From genotype to phenotype: Bioinformatic tools for functional genomics Steve Oliver

advertisement
From genotype to phenotype:
Bioinformatic tools for functional genomics
Steve Oliver
Professor of Genomics
School of Biological Sciences
University of Manchester
http://www.bioinf.man.ac.uk
Infoberg
Sequence data
Functional data
Functional Genomics
Level of Analysis
Definition
Status
Method of Analysis
Genome
Complete set of genes of
an organism or its
organelles.
Context-independent
(modifications to the
yeast genome may be
made with exquisite
precision.
Systematic DNA
sequencing.
Transcriptome
Complete set of mRNA
molecules present in a
cell, tissue or organ.
Context-dependent (the
complement of mRNAs
varies with changes in
physiology, development
or pathology.
Hybridisation arrays.
SAGE
High-throughput Northern
analysis.
Proteome
Complete set of protein
molecules present in a
cell, tissue or organ.
Context-dependent.
2-D gel electrophoresis.
Peptide mass
fingerprinting.
Two-hybrid analysis.
Metabolome
Complete set of
metabolites (low
molecular weight
intermediates) present in
a cell, tissue or organ.
Context-dependent.
Infra-red spectroscopy.
Mass spectometry.
Nuclear magnetic
resonance spectometry.
GENOME
TRANSCRIPTOME
PROTEOME
METABOLOME
4.0
4.5
5.0
5.5
6.0
6.5
Aberdeen PRF1: S. cerevisiae 2D map
ADE6
+
150
100
CDC48
+
90
ABP1
+
80
SSA2 SSA1
+ +
HSP60
PDR13
+
70
60
PUB1+
50
HIS4
+
ADE5,7
+
SSE1
+
SSC1
VMA1
+ + SSB1
+
WTM1+
HXK2 +
+
VMA2
HXK1
SAM1 ATP2
+
+
+
+ LYS9
TIF3
+
SGT2
+
ADO1
+
TPM1
+
FBA1
+
+
SPE3
Ykl056c
+
FBA1
EGD2
20
PGK1?
+
OYE2 +
CYS3 +
ADH1
+
PSA1 +
ILV5 +
+ +URA1 + ADH1
ENO2
+ + PGK1
+
PDC1
+
ASC1
TDH3 +
TPI1
TPI1
+
+
PST2
+
MGE1
TSA1
AHP1
YHB1
+
ADK1
+
+
RIB3
+
+
+
+
ENO1
+
+ MET17
+
+
HSP26
FBA1
+
+
+
+ ENO2
EFB1
+
+ PDC1
+
+
RHR2
CYS4
+
+ SES1
ENO2
+
+
VMA4 ENO2
+
+
Ylr301w
SEC53 +
+
RPS0A +
RPS0B
+
GLK1,
+
ARO8
GDH1
+
CDC19
+
+
Yfr044c
IPP1
+
+
BMH1 +
HYP2
+
PDC1
+
FBA1
BMH2 +
30
ALD6
+
PAB1
+
+
ASN2
+
+
PDB1 +
CLC1,BGL2
+
MET6
+
STI1
+
ACT1
+
+
ARG1
SAM2 +
40
LEU1
SOD1
+
+
BNA1
TDH3
+
+
COF1
+
+
EGD1
PDC1
+
FPR1
+
NTF2
+
10
PFY1
+
ENO2
+
RPS21
+
RIB4
+
RPL22A
+
CPH1
+
Peptide mass fingerprinting
denature
KETAAAKFERQHMDSSTSAASSSNYCNQMMKSRNLTKDRC
LPVNTFVHESLADVQAVCSQKNVACKNGQTNCYQSYSTMS
ITDCRETGSSKYPNCAYKTTQANKHIIVACEGNPYVPVHF
DASV
digest (trypsin)
KETAAAK
m1
FER
QHMDSSTSAASSSNYCNQMMK
m2
m3
CLPVNTFVHESLADVQAVCSQK
NVACK
m7
ETGSSK
m10
SR
m4
NLTK
m5
m9
YPNCAYKTTQANK
HIIVACEGNPYVPVHFDASV
m11
m12
abundance
mass spectrometry
m7
m10
m1
mass
m6
NGQTNCYQSYSTMSITDCR
m8
m11
DR
m12
m9
PROBLEMS WITH ‘CLASSICAL’
PROTEOME ANALYSIS:
1. Not comprehensive
2. Not high-throughput
3. Destroys protein-protein interactions
that provide important clues to function
Number of (protein) database matches
450
400
350
300
250
200
C. elegans
150
100
S.cerevisiae
50
0
1000
E.coli
H.influenzae
1200
1400
1600
Peptide mass (Da)
1800
2000
Just Enough Diagnostic Information
UMIST
Univ. Manchester
Kwushant Sidhu
Polkit Sangvanich
Tony Sullivan
Olaf Wolkenhauer
Simon Gaskell
Simon Hubbard
Francesco Brancia
Steve Oliver
Provide limited sequence information by:
1. Identification of N-terminal amino acid by
PTC derivatisation
2. Use guanidination to identify C-terminus,
determine lysine content, and improve
signal response
3. Specifically fragment next to Asp
residues using MALDI-QToF MS
Initial set of search
peptides and associated
information
Search database,
compile protein “hit
list” with matching
peptides
Top-scoring protein is
matched. Remove
corresponding peptides
from search list
If all initial search
peptides masses are
matched, stop, else
continue searching
S. cerevisiae
Yeast22proteins
proteins
100
100
90
90
% unambiguous identification
% unambiguous identification
S. cerevisiae
1 protein
Yeast 1 protein
80
80
standard
70
guanidination
60
PTC (500)
50
PTC (50)
40
Asp-frag
30 (All)
Asp-frag
70
60
50
40
30
20
10
standard
guanidination
PTC (500)
PTC (50)
Asp-frag
Asp-frag (All)
20
10
0
0
1
2
2
4
6
C. elegans 2 proteins
C. elegans 1 protein
100
100
90
90
% unambiguous identification
% unambiguous identification
4
total number of search peptides
total number of search peptides
80
70
60
50
40
30
20
80
70
standard
standard
guanidination
60
guanidination
PTC (500)
PTC (500)
PTC (50)
PTC (50)
50
40
Asp-frag
Asp-frag
30 (All)
Asp-frag
Asp-frag (All)
20
10
10
0
0
1
2
total num ber of search peptides
4
2
4
6
total number of search peptides
90
80
70
60
50 % unambiguous
40 identification
30
20
10
Asp-promoted daughter ions
0
N-terminal residue
Peptide masses only (5ppm)
4
6
8
10
number of search peptides
Identification in a mixture of 3 S. cerevisiae proteins
Genome Information
Management System (GIMS)
• Norman Paton, Carole Goble, Mike Cornell, Paul
Kirby - Dept. of Computer Sciences. University of
Manchester.
• Steve Oliver, Andy Hayes, Andy Brass - School of
Biological Sciences. University of Manchester.
Object Data Model for GIMS
Genome
1
contains
*
Chromosome
1
contains
0..1
next
prior
0..1
*
Chromosome
*
Fragment
{Abstract}
0..1
contains
Gene
is a
Transcribed
contains
0..1
next
prior
Non Transcribed
1
*
is a
Transcribed
Fragments
{Abstract}
0..1
*
composed of
is a
mRNA
snRNA
Spliced
Transcript
Component
tRNA
CEN
Intron
Promoter
rRNA
1
translates to1
ORF
1
1
PRIMARY
POLYPEPTIDE
1
is a
is a
is a
Spliced
*
Transcript composes
*
Chromosomal
Element
Regulators
FUNCTIONAL
PROTEIN
Terminator
ORI
TEL
Data in GIMS
Data type
Data source
DNA sequences, chromosome locations of
coding regions, e.g. ORFs, tRNAs,
centromeres, telomeres etc.
MIPS
Predicted protein sequences, pI, mol weight,
number of transmembrane regions.
MIPs
Protein attributes (e.g. cellular location,
function, protein class, Prosite motifs,
phenotype).
MIPS
Protein interaction data (yeast two-hybrid).
Protein interaction data (genetic
interactions).
MIPS, Uetz et al. (2000),
Ito et al (2001)
MIPS
Data in GIMS (2)
Data type
Data source
Protein interaction data (TAP & HMS-PCI
complexes).
Gavin et al. (2002)
Ho et al. (2002)
Metabolic data (reactions, compounds and
enzymes).
L-compound, L-enzyme
Post-translational modifications.
YPD
Transcription factor.
SCPD
Transcriptome data
Stanford Microarray
Database, Rosetta
Inpharmatics, Inc and
University of Manchester
Evaluating protein-interaction
data
Integrating complex data with
yeast two-hybrid data
Complex consists of six proteins
A, B, C, D, E, F
B
F
A
E
In a yeast two-hybrid experiment,
A
A interacts with another protein
Is
B, C, D, E or F?
C
D
Percentages of protein pairs sharing the same cellular
location
% protein
interactions
compatible
with subcellular
location
MIPS
complexes
99.0
HMS-PCI,
complexes
47.6
TAP
complexes
55.3
Y2H
48.3
interactions
Randomly
generated
complexes
33.7
Large-scale interaction data and the distribution of
interactions according to functional categories.
Quantitative comparison of interaction datasets.
m
MyGrid
Personalised
extensible environments for
data-intensive
in silico experiments
in biology
Professor Carole Goble,
University of Manchester
Dr Alan Robinson,
EBI
Approach
Applications
Toolkits
Metadata
Personalisation
Interoperation layer
Context mgt
Process mgt
Communication fabric
Data mgt
Robot Scientist Project
Aberystwyth
York/Imperial
Ross King
Phil Reissner
Douglas Kell
Stephen Muggleton
Chris Bryant
Manchester
Steve Oliver
The Robot Scientist Project
Aim:
build a physical implementation of a scientific active
learning system and apply to functional genomics
Test problem:
Genetic control of aromatic amino acid biosynthesis
in yeast
1.
2.
3.
4.
Devise experiments to select between hypotheses
Direct a robot to perform experiments
Automatically analyse the experimental results
Revise set of hyphotheses
Background
Background
Knowledge
Knowledge
Analysis
Analysis
Learning
Learning
Engine
Engine
Consistent
Hypotheses
New
New Biological
Biological
Knowledge
Knowledge
Experiment
Experiment
Selection
Selection
Experimental
Experimental
Results
Results
Robot
Analysis of Results
Cost of the chemicals consumed
The cost of the chemicals consumed in
converging upon a hypothesis with an
accuracy in the range 46 – 88% was reduced
if trials were selected by ASE-Progol rather
than if they were sampled at random.
To reach an accuracy in the range 46 – 80%,
ASE-Progol incurs five orders of magnitude
less costs than random sampling.
Analysis of Results (cont’d)
Duration of experiment
ASE-Progol requires less time to converge upon
a hypothesis with an accuracy in the range 74 –
87% than if trials are sampled at random or
selected using the naïve strategy.
To reach an accuracy of 80%, takes:
ASE-Progol
Random sampling
Naïve strategy
= 4 days
= 6 days
= 10 days
CONCLUSIONS
1. Take full advantage of complete genome
sequences
2. Promote close cooperation between
experimentalists &
bioinformaticians/computer scientists
3. Require integration of data from different
‘omic levels to mine reliable biological
information
4. Exploit machine-learning techniques in
the design, execution, and interpretation
of experiments in functional genomics
Download