Proteomic informatics - Computational Bioscience Program

advertisement
Proteomic informatics
Lawrence Hunter, Ph.D.
Director, Computational Bioscience Program
University of Colorado School of Medicine
Larry.Hunter@uchsc.edu
http://compbio.uchsc.edu/Hunter
Proteomics
• The characterization of the complete
complement of proteins in a sample
– Which proteins are present?
– In what isoforms (e.g. cleavage products)
– With what covalent post-translational
modifications?
– In what concentrations (quatification)
• Today, largely addressed via mass
spectrometry
[NB: many slides in this lecture are from Speed & Schutz,
http://www.ima.umn.edu/talks/workshops/9-29-10-3.2003/speed/speed.ppt ]
Mass Spectroscopy
•
Many different types of instruments (e.g.
MALDI/TOF)
– Produce lists of specific masses (or mass/charge
ratios) with “intensities”
– Can dynamically manipulate samples, e.g. by breaking
certain bonds.
•
Long used in molecular studies
•
Now applied to complex mixtures of proteins
– Mostly of small molecules
– Identification of purified proteins
What is a Mass Spectrometer ?
“An analytical device that determines the molecular weight of
chemical compounds by separating molecular ions according to their
mass-to-charge ratio (m/z)”
Ionisation
+
+ +
+ + + + +
+ ++ + +
+
+ +
+
+ ++
Separation
by m/z
+ +
+ +
+ + +
+ + +
+
+ + +
+ + +
+ + +
Detection
molecular weight = 600 Da
abundance = 50 %
molecular weight = 400 Da
abundance = 20 %
molecular weight = 300 Da
abundance = 30 %
50
30
20
301 401 601
m/z
Many different instruments
• Different MS
instruments
sensitive (or
not) to
different
types of ions
From Wysocki, et al., 2005
Use of Proteomic Mass Spectra
Two broad approaches:
1. Direct use: discrimination among tissue
states based on raw (unanalyzed)
spectra
2. Analysis: Protein identification,
characterization and quantification from
mass spectra
Predictive models based on
raw specta (or peak lists)
• Typical discrimination problem
• At least one impressive success: Ovarian
•
cancers detected from peripheral blood
samples
But…
– Major challenges in reproducibility (lots of
reasons a peak might appear incorrectly to be
correlated)
– Black box: no identifications or interpretations
– Big challenges in modeling: why are these
peaks discriminative?
General challenges
• The curse of dimensionality
– Many possibly interacting factors. Number of
interactions goes up with factorial of the
number of peaks.
– Kernel methods escape the curse by requiring
a similarity function (which must take
interactions into account)
• Statistical challenges
– Too many peaks, and not enough samples
– Uncontrolled variability, observational (post
hoc) studies
Mass Spec for protein
identification
2D gel
• Start with a purified
protein sample.
– Mass of an intact
protein alone is often ambiguous
– Cleavage of a protein at known sites (e.g. by
trypsin, which cleaves at K/R residues)
generates a set of peptide fragments
– “Protein fingerprinting” – a particular set of
fragment masses often identifies the protein
Fingerprinting
• Start from a set of all possible proteins in an
•
organism (called “the database”)
Theoretical digest of each protein to produce
collection of fragments
– E.g. SwissProt’s PeptideCutter
http://us.expasy.org/tools/peptidecutter/
• Compare predicted fragments for each protein
versus observed masses
– Some fragments missing, others unexpected (e.g.
PTMs)
– Ad hoc scoring or probabilistic calculations.
MS for protein identification
Peptide mass fingerprinting
2D-GEL
EXCISE
DIGEST
MS
Proteins Sample
Example: peaks at m/z 333, 336, 406, 448, 462, 889
The only protein in the database that would produce these peaks is
MALK|CGIR|GGSRPFLR|ATSK|ASR|SDD
•
•
The exact protein needs to be in the database
Works only with single protein fragmentations
m/z
Challenges
•
•
•
•
Which proteins does a genome specify?
– Gene/coding region prediction from genomic data
– Splice variants
– Coding polymorphisms
Some peptides are in multiple proteins
Some peptides are missing in the data
– Post-translational modifications change mass
– Stochastic loss of peptides
PTMs can cause false positive matches
– Combinations of PTMs make this problem much worse
Issues with tryptic fragments
• Genome level
– Coding sequence identification problem
generally
– Families of similar genes share fragments
• Transcription/translation
– Alternative splicing leads to multiple products
sharing fragments
• Post-translational modifications
– Covalent modifications add mass
– Cleavages reduce mass
The simple matching task
• Given a list of tryptic masses, what is the
most likely protein to have generated it?
– Compare the observed peptides with the ones
that would be expected if the protein were
present.
• Challenges:
– Fragments from true proteins will be missing
due to chance or modifications
– Fragments will spuriously match proteins that
are not present, due to shared fragments or
modifications
Existing approaches
• Widely used programs:
– Mascot (http://www.matrixscience.com)
– Sequest (http://fields.scripps.edu/sequest)
• Good at finding matches to unmodified
•
peptides from a database of sequences
Not so good at
– Unambiguously identifying a protein
(get a list of hits, first not always correct)
– Identifying post-translational modification(s),
especially multiple ones
How they work
•
•
Score each protein in the database against the
set of peptides from the assay
Score functions
– Some differences between Sequest/Mascot, but...
– Positive score for matching fragments, penalty for
missing
– Penalty for size of matching protein.
(Longer proteins have more random matches)
– Treat all proteins as independent
•
•
Random selection in case of ties (not generally consistent)
Reports each protein in family as separate
PTMs
• Each PTM has a characteristic mass
•
change (e.g. ~80Da for phosphate)
More than 30 biologically significant PTMs
– (de)protonation changes mass by 1 Da
• Even just ±2 phosphates causes large
increase in false positive matches
Possible approaches to PTMs
• Post-processing:
– Look only for PTM'd fragments of proteins
already identified by unmodified fragments
• Bayesian: Define a prior probability for
each possible combination of PTMs
– Global score function for each PTM
combination
– Per-protein predictions of likelihood of
particular PTMs.
Approaches to
Protein family issues
• Current programs either report all ties or
randomly select one.
– Even post-processing to group family members
in reports would be better
• Better still
– Link family members in the identification process
– Represent and report ambiguities
• How to handle probabilities for groups of related
proteins?
– Multiple family members can be present in a
sample...
Going high throughput
• Recently, various successful approaches
•
to identifying a complex mixture of
proteins.
MuDPIT
– Separate mixture into fractions (e.g. by liquid
chromatography or strong cation exchange)
– Identify many proteins in the mixture
simultaneously using fingerprinting
• Tandem (MS/MS)
– Use CID to fragment peptides, and send them
to a second MS. Pattern of fragments allows
MuDPIT Approaches
• Approach as multiple purified mass spec
•
runs
How to tell how many proteins?
– Just allow multiple high probability matches...
• How to “assign” a peptide to a protein?
– Can one peptide match more than one
protein?
– Related to the protein family problem...
– Use quantitative information?
• Complex instrument & experimental design
Fractionation
• Take a complex mixture and divide it into
smaller sets of proteins with known
qualities
– SCX: separate by charge
– LC: separate by hydrophobicity
• Can make restricted databases based on
•
proteins that could have been in a
particular fraction
Test to make sure that identified proteins
are compatible with fraction
Tandem MS (MS/MS)
To gain structural information about the detected masses:
...
one product
is selected
+ +
+ + collision
+ + with a gas +
50
+
Second MS
separation &
detection
30
20
301 401 601
m/z
– different molecules of the same substance can split in different ways.
– in each molecule, only the pieces that retain one of the charges will
be observed and present in the spectrum; the others are discarded.
MS/MS advantages
• Sequences make identification more
•
•
•
reliable (although still depends on
database to restrict search)
Able to directly detect certain kinds of
PTMs
With some instruments, direct de-novo
sequencing becoming possible (without
database)
Still new, many potential improvements to
informatic approaches
y8
y7
y5
y6
y4
y3
Tryptic fragment
y2
Val Phe Gly Lxx Lxx Asp Glu Asp Lys
b2
b3
b5
b4
b6
b7
b8
y3
100
b2
95
247.0
391.1
90
789.3
85
y7
b3
80
304.0
a2
75
y4
219.0
506.2
70
b4
65
417.2
Relative Abundance
60
Lxx
Asp
Glu
55
Lxx
50
y8
789.3
y5
45
Phe
Gly
936.4
619.2
40
y6
b5
35
732.2
530.2
30
25
20
889.4
645.3
418.1
305.1
y2
15
b8
b6
248.1
b7
262.1
10
205.0
318.1
431.1 468.4
372.2
937.4
774.4
904.5
5
0
150
200
250
300
350
400
450
500
550
m/z
600
650
700
Example MS/MS spectrum
750
800
850
900
950
Interpretation of MS/MS data
•
SEQUEST:
– generate a predicted spectrum for each potential peptide using a
simple fragmentation model (all b and y ions have the same
intensity; possible losses from b and y have a lower intensity)
– compute a "cross-correlation" score and find the best-matching
peptide
– since this operation is very time-consuming, a simpler preliminary
score is used to find the 500 peptides in the database that are
most likely to be the correct identification
MS/MS Challenges
• Theoretical spectra do not include intensity
•
information (much, yet)
Many random matches with low scores
– Poor seperability of low scoring real hits from
high scoring false ones
• Changes in instruments lead to new
theoretical spectra
Quantification
• Would like to know the amount of each
•
protein, not just its identity
Peak intensity is not directly correlated
with amount of original protein
– Theoretical and empirical attempts to map
from peak intensity (or area under a peak) to
original concentration
• Isotopic ratio approaches: use isotopes
(taken up in food, say) to quantify
abundance ratio between two samples.
Existing software
• Sequest most widely used, but commercial
•
•
and associated with an instrument
manufacturer
Mascot main alternative (has a web
service) http://www.matrixscience.com/
Open source alternatives improving
rapidly, X!Hunter and X!Tandem.
Organized through The Global Proteome
Machine Organization,
http://www.thegpm.org
Download