Interpreting MS/MS Proteomics Results - Proteome Software

advertisement
The first thing I should
say is that none of the
material presented is
original research done at
Proteome Software
Interpreting MS/MS
Proteomics Results
but we do strive to make the tools presented here
available in our software product Scaffold. With that
caveat aside…
Brian C. Searle
Proteome Software Inc.
Portland, Oregon USA
Brian.Searle@ProteomeSoftware.com
NPC Progress Meeting
(February 2nd, 2006)
Illustrated by Toni Boudreault
This is an foremost
an introduction so
we’re first going to
talk about
Organization
SEQUEST
Identify
how you go about
identifying proteins
with tandem mass
spectrometry in the
first place
Then we’re going to talk about the
motivations behind the development of
the first really useful bioinformatics
technique in our field, SEQUEST.
This technique has been
extended by two other tools
called X! Tandem and
Mascot.
X! Tandem/Mascot
We’re also going to talk about
how these programs differ
Differ
Combine
and how we can use that to our
advantage by considering them
simultaneously using
probabilities.
A
Start with a protein
A
I
K
H
Q
G
K
A
L
T
N
V
I
T
I
D
V
P
So, this is proteomics, so
we’re going to use tandem
mass spectrometry to
identify proteins-- hopefully
many of them, and hopefully
very quickly.
L
K
E
D
C
G
R
T
A
I
R
A
Cut with an enzyme
A
I
K
H
Q
G
K
E
A
T
And to use this
technique you
generally have to
lyse the protein
into peptides
about 8 to 20
amino acids in
length and…
L
L
K
P
N
V
I
T
I
D
V
D
C
G
R
T
A
I
R
A
Select a peptide
A
I
K
H
Q
G
K
E
A
P
T
I
L
Look at each peptide
individually.
L
K
R
N
V
I
T
I
D
V
D
C
G
R
T
A
We select the
peptide by mass
using the first half of
the tandem mass
spectrometer
Impart energy in collision cell
A
E
P
T
I
The mass spectrometer imparts energy
into the peptide causing it to fragment at
the peptide bonds between amino acids.
R
H2O
Intensity
Measure mass of daughter ions
A
E
P
A
E
P
A
E
T
399.2
A
298.1
201.1
72.0
M/z
The masses of
these fragment
ions is recorded
using the second
mass
spectrometer.
These ions are
commonly called B
ions, based on
nomenclature you
don’t really want to
know about…
Intensity
B-type Ions
A
E
P
T
I
72.0
129.0
97.0
101.0
113.1
M/z
But the mass difference
between the peaks
corresponds directly to the
amino acid sequence.
R
174.1
H2O
Intensity
B-type Ions
A
E
P
T
I
72.0
129.0
97.0
101.0
113.1
174.1
A-0
AE-A
AEP
-AE
AEPT
-AEP
AEPTI
-AEPT
AEPTIR
-AEPTI
For example, the
A-E peak minus
the A peak should
produce the mass
of E.
R
H2O
You can build these mass differences up and derive
a sequence for the original peptide
This is pretty neat and it makes tandem mass
spectrometry one of the best tools out there for
sequencing novel peptides.
M/z
But there are a couple
confounding factors.
So, it seems
pretty easy,
doesn’t it?
For example…
B ions have a tendency to degrade and lose
carbon monoxide producing…
B-type Ions
Intensity
A
E
CO
P
CO
T
CO
M/z
I
CO
R
CO
H2O
CO
A ions.
A-type Ions
A
E
P
T
I
R
H2O
Furthermore…
CO
CO
CO
M/z
CO
CO
CO
… The second half are
represented as Y ions that
sequence backwards.
And, unfortunately, this
is the real world, so…
R
I
T
Intensity
H2O
Y-type Ions
M/z
P
E
A
… All the peaks have
Y-type Ions
R
I
T
Intensity
H2O
M/z
P
different measured heights
and many peaks can often be
missing.
E
A
All these peaks are seen together
simultaneously
and we don’t
…
even know
B-type, A-type, Y-type Ions
R
I
T
Intensity
H2O
M/z
P
E
A
What type of ion they are, making the mass
differences approach even more difficult.
Intensity
Finally, as with all analytical techniques,
M/z
There’s noise,
Intensity
producing a final spectrum that looks like…
M/z
Intensity
….This, on a good day.
M/z
And so it’s actually fairly difficult to…
… compute the mass differences to
Intensity
sequence the peptide, certainly in a
computer automated way.
A
E
P
T
I
72.0
129.0
97.0
101.0
113.1
M/z
R
174.1
H2O
So the community needed a
new technique.
Now, it wasn’t all
without hope…
Known Ion Types
We knew a couple of things
about peptide fragmentation.
Not only do we know to
expect B, A, and Y ions,
but…
B-type ions
A-type ions
Y-type ions
Known Ion Types
… We also
B-type ions
A-type ions
Y-type ions
know a couple
of other
variations on
those ions that
come up.
We even know something
about the…
B- or Y-type +2H ions
B- or Y-type -NH3 ions
B- or Y-type -H2O ions
… likelihood of seeing each type of ion,
Known Ion Types
B-type ions • 100%
A-type ions • 20%
Y-type ions • 100%
where generally B and Y
ions are most prominent.
B- or Y-type +2H ions • 50%
B- or Y-type -NH3 ions • 20%
B- or Y-type -H2O ions • 20%
So it’s actually pretty easy
to guess what a spectrum
should look like
If we know the
amino acid
sequence of a
peptide,
if we know what
the peptide
sequence is.
we can guess
what the spectra
should look like!
Model Spectrum
So as an example,
consider the peptide
ELVIS LIVES K
that was synthesized
by Rich Johnson in
Seattle
ELVISLIVESK
*Courtesy of Dr. Richard Johnson
http://www.hairyfatguy.com/
Model Spectrum
We can create a hypothetical
spectrum based on our rules
B/Y type ions (100%)
Where B and Y ions are
estimated at 100%,
plus 2 ions are
estimated at
50%,
B/Y +2H type ions
(50%)
and other stragglers
are at 20%.
A type ions
B/Y -NH3/-H2O
(20%)
Model Spectrum
So if we consider the
spectrum that was derived
from the ELVIS LIVES K
peptide…
Model Spectrum
We can find where
the overlap is
between the
hypothetical and the
actual spectra…
Model Spectrum
And say conclusively based on
the evidence that the spectrum
does belong to the ELVIS
LIVES K peptide.
But who cares?
The more important
question is
“what about
situations
where we
don’t know the
sequence?”
We guess!
And so this was an
approach followed by a
program called PepSeq
PepSeq
…
AAAAAAAAAA
AAAAAAAAAC
AAAAAAAACC
AAAAAAACCC
which would guess
every combination of
amino acids possible
build a
hypothetical
spectrum,
and find the
best matching
hypothetical.
…
ELVISLIVESK
WYYYYYYYYY
YYYYYYYYYY
J. Rozenski et al.,
Org. Mass Spectrom.,
29 (1994) 654-658.
PepSeq
This was a
start,
but it’s clearly impossibly
hard with larger peptides
and there’s a lot
of room to
overfit the data.
• Impossibly
hard after 7 or
8 amino
acids!
• High false
positive rate
because you
consider so
many options
PepSeq
So obviously
this isn’t going
to work in the
long run.
Another strategy
is needed!
• Impossibly
hard after 7 or
8 amino
acids!
• High false
positive rate
because you
consider so
many options
Sequencing Explosion
We needed a new invention
to come around
and that was shotgun
Sanger-sequencing
…
• 1977 Shotgun sequencing invented,
bacteriophage fX174 sequenced.
•
•
•
•
•
•
In 89 and 90 the
Yeast and Human
Genome projects
were announced
1989 Yeast Genome project announced
1990 Human Genome project announced
1992 First chromosome (Yeast) sequenced
1995 H. influenza sequenced
1996 Yeast Genome sequenced
2000 Human Genome draft
followed by the
first chromosome
in 92
et cetra, et cetra
Sequencing Explosion
• 1977 Shotgun sequencing invented,
bacteriophage fX174 sequenced.
…
Eng, J. K.; McCormack, A. L.; Yates, J. R. III
J. Am. Soc. Mass Spectrom. 1994, 5, 976-989.
•
•
•
•
•
•
1989 Yeast Genome project announced
1990 Human Genome project announced
1992 First chromosome (Yeast) sequenced
exploit genome sequencing
1995 H. influenza
sequenced
1996 Yeast Genome sequenced
2000 Human Genome draft
In 1994 Jimmy Eng and John
Yates published a technique to
for use in tandem
mass
spectrometry.
And the idea was …
SEQUEST
.…instead of searching all
possible peptide sequences,
search only those in
genome databases.
Now, in the postgenomic world this
seems like a pretty
trivial idea,
but back then there was a
lot of assumption placed on
the idea
that we’d actually have a
complete Human genome in
a reasonable amount of
time.
SEQUEST
2*1014
2*1010
1*108
4*106
-- All possible 11mers
(ELVISLIVESK)
-- All possible peptides in NR
-- All tryptic peptides in NR
-- All Human tryptic peptides in NR
So, In terms of 11amino
acid peptides
we’re talking about a 10
thousand fold difference
between searching every
possible 11mer those in the
current non-redundant protein
database from the NCBI
And a 100 million fold
difference for searching
human trypic peptides
So that was
huge,
it made hypothetical
spectrum matching
feasible.
Instead of trying to make a better model,
SEQUEST made a couple of
other interesting
improvements as well
they decided just to make the actual spectrum look
like the model with normalization…
Jimmy and John noted that
there was a discontinuity
between the intensities of
the hypothetical spectrum
and the actual spectrum.
SEQUEST Model Spectrum
For a scoring function they decided
to use Cross-Correlation,
Like
so.
which basically sums the peaks that
overlap between hypothetical and the
actual spectra
SEQUEST Model Spectrum
And then they shifted the spectra
back and ….
SEQUEST Model Spectrum
… Forth so that the
peaks shouldn’t align.
They used this number, also called the
Auto-Correlation, as their background.
SEQUEST Model Spectrum
SEQUEST XCorr
This is another representation of
the Cross Correlation and the Auto
Correlation.
Correlation Score
Cross Correlation
(direct comparison)
Auto Correlation
(background)
Offset (AMU)
Gentzel M. et al
Proteomics 3 (2003) 1597-1610
The XCorr score is the
Cross Correlation divided
by the average of the
auto correlation over a
150 AMU range.
SEQUEST XCorr
Correlation Score
The XCorr is high if the direct
comparison is significantly
greater than the background,
which is obviously
good for peptide
identification.
Cross Correlation
(direct comparison)
Auto Correlation
(background)
Offset (AMU)
CrossCorr
XCorr =
avg AutoCorr offset=-75 to 75


Gentzel M. et al
Proteomics 3 (2003) 1597-1610
And this XCorr is actually
a pretty robust method for
estimating how accurate
the match is,
and so far, there really
haven’t been any
significant
improvements on it.
SEQUEST
DeltaCn
XCorr 1  XCorr 2
XCorr 1
The DeltaCn is another
score that scientists often
use.
It measures how good the
XCorr is relative to the
next best match.
As you can see, this
is actually a pretty
crude calculation.
Here’s another
representation of
that sentiment.
The XCorr is a strong
measure of accuracy,
whereas the DeltaCn is a weak measure of relative goodness.
.
SEQUEST
Accuracy Score
Strong
(XCorr)
Relative Score
Weak
(DeltaCn)
Obviously, there could be an alternative method that
focuses more on the success of the relative score.
Mascot and X! Tandem fit that bill.
Alternate SEQUEST
Method
Accuracy Score
Relative Score
Strong
(XCorr)
Weak
(DeltaCn)
Weak
Strong
X! Tandem Scoring
by-Score= Sum of intensities of peaks matching
B-type or Y-type ions
HyperScore=
Now the X! Tandem
accuracy score is
rather crude.
by-Score N !  N !
y
b
It only considers B and Y
ions and
and attaches these factorial terms
with an admittedly hand waving
argument.
Fenyo, D.; Beavis, R. C.
Anal. Chem., 75 (2003) 768-774
Distribution of “Incorrect” Hits
But instead of just considering the best
match to the second best, it looks at the
distribution of lower scoring hits,
assuming that they are all wrong.
# of Matches
This is somewhat based on ideas
pioneered with the BLAST algorithm.
Here, every bar represents the number of
matches at a given score.
The X! Tandem creators found that the
distribution decays (or slopes down)
exponentially…
Second
Best
Hyper Score
Best Hit
Estimate Likelihood (E-Value)
…and the log of the distribution is relatively
Log(# of Matches)
linear because of the exponential decay.
Best Hit
Hyper Score
Estimate Likelihood (E-Value)
Log(# of Matches)
Hyper Score
Expected Number
Of Random Matches
Best Hit
If the distribution represents
the number of random
matches at any given score,
the linear fit should
correspond to the
expected number of
random matches.
Estimate Likelihood (E-Value)
Log(# of Matches)
Score of 60 has
1/10 chance
of occurring
at random
Best Hit
And from this, you can calculate the likelihood
that the best match is random.
In this case, a score of 60
corresponds with a log number of
matches being -1
which means the estimated
number of random matches
for that score is 0.1
This is called an
E-Value, or
Expected-Value.
X! Tandem and Mascot
Now, X! Tandem calculates this
E-Value empirically.
Likelihood that match
E-Value= is incorrect relative to
N guesses
Empirical
(X! Tandem)
Likelihood that match
P-Value=
is incorrect (E~P·N)
Theoretical
(Mascot)
Another search engine, Mascot, tries to
get at the same kind of number using
theoretical calculations,
most likely based on the number of identified peaks
and the likelihood of finding certain amino acids in the
genome database.
They’ve never explicitly published their
algorithm, so we’ll never really know,
but I suspect it’s something smart.
I just want to bring up a point
that we’ll touch on a little
later…
…the E-Value that X! Tandem
calculates
and the P-Value that Mascot
calculates are
probabilistically based,
but they can only estimate the
likelihood that the match is wrong.
X! Tandem and Mascot
Likelihood that match
E-Value= is incorrect relative to
N guesses
Empirical
(X! Tandem)
Likelihood that match
is incorrect (E~P·N)
Theoretical
(Mascot)
P-Value=
Probability=
Likelihood that
Note
match is correct
(Probability≠1-P)!
This is realistically not
nearly as useful as
knowing
the probability that a peptide
identification is right,
which
is NOT 1 minus
the P-Value.
Now, let’s go back and fill in the X! Tandem part of our
accuracy/relativity scoring grid.
SEQUEST
Relative Score
XCorr
DeltaCn
X! Tandem
Accuracy Score
HyperScore
E-Value
To reiterate, the XCorr is an excellent measure of accuracy…
SEQUEST
Relative Score
XCorr
DeltaCn
X! Tandem
Accuracy Score
HyperScore
E-Value
…whereas the E-Value is an excellent measure of how
good the best score is relative to the rest.
If we assume that accuracy and relativity
scores are independent measures of
goodness,
could we use both the SEQUEST’s XCorr and
X! Tandem’s E-Value together?
SEQUEST
Relative Score
XCorr
DeltaCn
X! Tandem
Accuracy Score
HyperScore
E-Value
And the answer
is a resounding
yes.
Each point on this
graph is a spectrum,
where correct
identifications are
marked in red, while
incorrect identifications
are marked in blue.
We know what’s
correct and incorrect
because this is a
control sample.
X! Tandem: -log(E-Value)
10 Protein Control Sample
SEQUEST: Discriminant Score
Although in general the spectra SEQUEST scores well
are spectra X!Tandem also scores well,
there is considerable scatter between the search
engines.
One might wonder if X! Tandem
and Mascot use similar scoring
approaches,
would they benefit as
much,
but the answer is
surprisingly still yes!
X! Tandem: -log(E-Value)
10 Protein Control Sample
Mascot: Ion-Identity Score
Now, why are the
scores so different?
Why So Different?
• Sequest
– Considers
relative
intensities
• X! Tandem
– Considers
semi-tryptic
peptides
– Considers only
B/Y-type Ions
Well, here are a couple of
possible reasons.
SEQUEST is the only method to
consider relative intensities.
• Mascot
– Considers
theoretical
P-Value
relative to
search space
Why So Different?
• Sequest
– Considers
relative
intensities
• X! Tandem
– Considers
semi-tryptic
peptides
– Considers only
B/Y-type Ions
X! Tandem is the only method to consider peptides
outside the standard search space by default,
such as semi-tryptic
peptides.
However, it’s the only score that
considers only B and Y ions,
as opposed to a
complete model.
• Mascot
– Considers
theoretical
P-Value
relative to
search space
Why So Different?
• Sequest
– Considers
relative
intensities
• X! Tandem
– Considers
semi-tryptic
peptides
– Considers only
B/Y-type Ions
And Mascot is the only search
engine to compute a completely
theoretical P-Value
• Mascot
– Considers
theoretical
P-Value
relative to
search space
X! Tandem: -log(E-Value)
So we clearly want to consider
multiple search engines
simultaneously,
Consider Multiple Algorithms?
but how?
Mascot: Ion-Identity Score
How To Compare Search Engines?
– SEQUEST: XCorr>2.5, DeltaCn>0.1
– Mascot:
Ion Score-Identity Score>0
– X! Tandem: E-Value<0.01
You can’t use a
thresholding system
because it’s impossible
to find corresponding
thresholds.
For example, a SEQUEST
match with an XCorr of 2.5
doesn’t mean the same thing
as an X! Tandem match with an
E-Value of 0.01.
How To Compare Search Engines?
– SEQUEST: XCorr>2.5, DeltaCn>0.1
– Mascot:
Ion Score-Identity Score>0
– X! Tandem: E-Value<0.01
The simplest way would be
to convert the scores into
probabilities and compare
those.
Need to
convert
scores to
probabilities!
We advocate for Andrew
Keller and Alexy Nesviskii’s
Peptide Prophet approach
because it actually
calculates a true
probability, not
just a p-value.
10 Protein Control Sample (Q-ToF)
X! Tandem approach
# of Matches
Other Incorrect
IDs for Spectrum
Possibly
Correct?
So if you
remember,
X! Tandem
considers
the best
peptide
match for a
spectrum
against a
distribution
of incorrect
matches
Mascot: Ion-Identity Score
10 Protein Control Sample (Q-ToF)
Peptide Prophet approach
# of Matches
ALL Other
“Best” Matches
Well, Peptide Prophet looks
across the entire sample,
and not at just one
spectrum at a time.
It compares the best
match against all of
the other best
matches in the
sample, which is
clearly bimodal.
Possibly
Correct?
Mascot: Ion-Identity Score
Keller, A. et al
Anal. Chem. 74, 5383-5392
10 Protein Control Sample (Q-ToF)
Peptide Prophet approach
# of Matches
ALL Other
“Best” Matches
The low mode represents
matches that are most likely
wrong while the high mode
represents matches that are
probably right.
Possibly
Correct?
Mascot: Ion-Identity Score
Keller, A. et al
Anal. Chem. 74, 5383-5392
10 Protein Control Sample (Q-ToF)
Peptide Prophet approach
Peptide Prophet curve
fits two distributions to
the modes,
# of Matches
“Incorrect”
Possibly
Correct?
following the assumption
that the low scoring
distribution is “Incorrect”
and that the
higher scoring
distribution is
“correct”.
“Correct”
Mascot: Ion-Identity Score
10 Protein Control Sample (Q-ToF)
p(D | ) p()
p( | D) 
p(D | ) p()  p(D | ) p()
# of Matches
“Incorrect”
These two distributions
can be analyzed using
Bayesian statistics with
this formula.
Now that
formula
looks
pretty
complex,
Possibly
Correct?
but…
“Correct”
Mascot: Ion-Identity Score
10 Protein Control Sample (Q-ToF)
p(D | ) p()
p( | D) 
p(D | ) p()  p(D | ) p()
# of Matches
“Incorrect”
It just calculates the
height of the correct
distribution at a
particular score, divided
by the height of both
distributions.
“Correct”
Mascot: Ion-Identity Score
10 Protein Control Sample (Q-ToF)
This is essentially the probability of
having that score and being correct
divided by the probability of just
having that score
p(D | ) p()
p( | D) 
p(D | ) p()  p(D | ) p()
“Incorrect”
 prob of having score
 and being correct 
prob of having score


“Correct”
Mascot: Ion-Identity Score
# of Matches
“Incorrect”
Possibly
Correct?
“Correct”
Mascot: Ion-Identity Score
This is a neat method because it actually
considers the likelihood of being correct,
rather than X! Tandem and Mascot, which only
calculate the probability of being incorrect.
It’s because of this that Peptide Prophet can get
produce a true probability,
which is important when the sample
characteristics change.
# of Matches
“Incorrect”
Q-ToF:
Possibly
Correct?
“Correct”
Mascot: Ion-Identity Score
For example, the control sample we’ve
been looking at was derived from Q-ToF
data
which produces pretty
high quality results
# of Matches
If you compare that to
the same sample on
run on an Ion Trap, the
probability of being
correct is greatly
diminished.
If you’ll note, the Incorrect
distribution doesn’t
change very much
between the two
analyses, however, the
likelihood that the
identification is right
changes dramatically!
“Incorrect”
Q-ToF:
Possibly
Correct?
“Correct”
# of Matches
Mascot: Ion-Identity Score
“Incorrect”
Ion Trap:
Possibly
Correct?
“Correct”
As Peptide Prophet considers the correct distribution, it is immune to
fluctuations between samples.
P-Values and E-Values don’t consider this
information, so they can’t be compared across
multiple samples, or different examinations of the
same sample
hence the reason why we
need to use Peptide
Prophet for comparing two
different search engines
# of Matches
Mascot: Ion-Identity Score
“Incorrect”
Ion Trap:
Possibly
Correct?
“Correct”
So going back to the
scatter plot between X!
Tandem and Mascot,
X! Tandem: -log(E-Value)
Consider Multiple Algorithms?
Mascot: Ion-Identity Score
we can use Peptide Prophet to compute the score
threshold that represents a 95% cut-off…
Like so.
X! Tandem: -log(E-Value)
Consider Multiple Algorithms?
Mascot: -2.5=95%
X! Tandem: 2.6=95%
Mascot: Ion-Identity Score
This allows you to fairly consider the answers from both search engines
simultaneously.
The important thing to note, is that if you looked at a different sample, these
thresholds should change depending on the height of the correct distributions
Conclusion
So in conclusion,
• All search engines
use different criteria,
producing different
scores
• Using multiple search
engines
simultaneously yields
better results
• Peptide Prophet can
normalize search
engine results
all of the
search
engines
look at
different
criteria
Conclusion
• All search engines
use different criteria,
producing different
scores
• Using multiple search
engines
simultaneously yields
better results
• Peptide Prophet can
normalize search
engine results
And we can
leverage this to
identify more
peptides
Conclusion
• All search engines
use different criteria,
producing different
scores
• Using multiple search
engines
simultaneously yields
better results
• Peptide Prophet can
normalize search
engine results
And that Peptide
Prophet is a great
mechanism for doing
that
because it calculates true
probabilities,
instead of
p-values
The End
Download