De Novo Sequencing and Homology Search with De Novo

advertisement
De Novo Sequencing and Homology
Searching with De Novo Sequence Tags
Possible Ways to Interpret MS/MS Data
2
MS/MS Spectra
de novo
sequencing
1
database
search
peptides
protein DB
Inexact
protein DB
peptides
homology
search
homologous
peptides
3
Why Bother?
• De novo sequencing derives the sequence without
looking into a database.
• De novo sequencing is useful for
– unsequenced genomes (no protein database)
– novel peptides (unmatched spectra after database
search)
– single amino acid polymorphism
– unexpected PTM
– database error
– validate a database match
Outline
•
•
•
•
Basics
Manual De Novo Sequencing
De novo Sequencing Algorithm (PEAKS)
Homology Search with De Novo Tags
Sequence-specific fragment ions
(M+2H)+2
O
[N-term]-NH-CHR-C---NH2+-[C-term]H+
[N-term]-NH-CHR-CO+
b-ion
[N-term]-NH=CHR+ + CO
a-ion
+
NH2-[C-term] H+
y-ion
a – NH3 or H2O
b – NH3 or H2O
y – NH3 or H2O
Non-sequence-specific fragmentations
6
Why does everyone analyze positivelycharged tryptic peptides?
• Usually better sensitivity from positively-charged peptide
ions.
• “Mobile protons” protonate peptide bonds and promote b/y
fragmentation
• Arg sequesters protons in gas phase
• Tryptic peptides typically have 0 -1 Arg
• Tryptic peptide ions typically have two protons
• Therefore, tryptic peptides usually have b/y ions
• Placing Arg’s at the C-terminus makes it more likely that a
complete series of y-ions will be observed.
MS/MS spectrum of doubly-charged tryptic peptide
(one Arg and two protons)
Y
Database search test mix
11719.79003906
2003_061802 46 (45.418) Sm (SG, 2x3.00); Cm (45:48)
3: TOF MSMS 464.26ES+
405
y6 y5 y4 y3 y2 y1
100
Y L Y E I A R
b2
a2
%
y2 b2
L
y1
0
50
y5
y4
y3
y6
m/z
100
150
200
250
300
350
400
450
500
550
600
650
700
750
800
850
MS/MS spectrum of a doubly-charged nontryptic peptide(two Arg’s and two protons)
4050.02978516+2
(M+2H)
500fmol bsa, Asp-N, 2%
2003_092302 6 (16.310) Sm (SG, 2x3.00); Cm (4:6)
y2
Relative Ab. (%)
100
Y S R R H P E
P
y4-17
y4 b4
(b5+18)+2
H
0
50
a5-17
(b6+18)+2
%
Y
2: TOF MSMS 472.76ES+
77.1
y3
y1
b5-17
y6-17
m/z
100
150
200
250
300
350
400
450
500
550
600
650
700
750
800
850
900
Relative Abundance
CID in traps vs quadrupoles
b13
IPIGFAGAQGGFDTR
y2
Ion trap
y9
y14+2
y10
b6 y6
y5
b4
y3
b5
b3
300
400
500
y4
b7 b8
b10
b12
b11
600
700
800
y2
Relative Abundance
b9
y7 y8
y12
900
y11
y13
1000 1100 1200 1300 1400 1500
m/z
IPIGFAGAQGGFDTR
b2
Qtof
(M + 2H) +2
b4
y1
y3
b3
b6
b5
y4
y5
y14+2
y6
y7 y8
y9 y
10
y11
y12
b13 y13
0
100
200
300
400
500
600
700
m/z
800
900 1000 1100 1200 1300 1400 1500
Annoying things to remember when
sequencing peptides by MS/MS
• Leucine and isoleucine have the same mass
• Glutamine and lysine differ by 0.036 u
• Phenylalanine and oxidized methionine differ by 0.033 u
• Cleavages do not occur at every bond (more often than not, there
is no cleavage between the first and second residues)
• Certain amino acids have the same mass as pairs of other amino
acids: G + G = N, A + G = Q, G + V ~ R, A + D ~ W,
S+V~W
• However: mass accuracy resolves many of these ambiguities
Outline
•
•
•
•
Basics
Manual De Novo Sequencing
De novo Sequencing Algorithm (PEAKS)
Homology Search with De Novo Tags
Two approaches to manually sequencing
peptides from MS/MS spectra
1. Finding a series of ions in the middle of the
peptide, and working out towards the termini
(illustrated using ion trap data)
2. Finding the C-terminus and working towards the Nterminus (illustrated using qtof data)
Sequencing from the middle: look for ion series
in the region above the precursor ion (m/z 615)
An obvious series is the one that involves the more
abundant fragment ions (m/z 575, 688, 775, 888, and 987)
L
S
L
V
Another ion series contains pairs separated by 18 Da (water
losses)
L
V
E
S
-18
-18
-18
-18
-18
Two ion series have been identified in
the region above the precursor ion
Problem: Two ion series defining partial sequences LSLV and
LVES have been identified, but it is not known if these are y- or bions (i.e., the sequence direction is unknown).
Solution: Since ion trap data often exhibits high mass b-ions, check
to see if the highest mass ion in either series corresponds to a loss of
either Arg or Lys (the usual tryptic C-terminus). If not, check to see
if the mass difference corresponds to a dipeptide containing Lys or
Arg (it is possible that the b-ion defining the C-terminus is absent).
Calculation: Peptide MW – 17 – fragment ion = C-terminal
residue mass
For the first ion series: 1228 – 17 – 987 = 224 Da
L
S
L
V
224 – 128 = 96
224 – 156 = 68
Therefore this does
not look like a b-ion
series
For the second ion series: 1228 – 17 – 1083 = 128 (the residue
mass of Lys); this looks like a b-ion series and maybe the other
one is a y-ion series
L
V
E
S
K
-18
-18
-18
-18
-18
The high mass b-series predicts the presence of some
low mass y-ions; are they there?
b-series: …LVESK
y1: 147 No
y2: 234 Yes
y3: 363 Yes
y4: 462 Yes
y5: 575 Yes!!
y-ions = residue mass
plus 19 Da
b-ions
y-ions
The high mass y-series predicts the presence of some
low mass b-ions; are they there?
y-series: [242]VLSL…
b2: 243 Yes
b3: 342 Yes
b4: 455 Yes
b5: 542 Yes
b6: 655!! Yes
b-ions
y-ions
Can I account for most of the remaining ions as
neutral losses or internal fragments?
[242]VLSLLVESK
242 = N+Q, N+K, L+E
b-ions
y-ions
neutral loss
Two approaches to manually sequencing
peptides from MS/MS spectra
1. Finding a series of ions in the middle of the peptide, and
working out towards one of the termini (illustrated using ion
trap data)
2. Finding the C-terminus and working towards the Nterminus (illustrated using qtof data)
Outline
•
•
•
•
Basics
Manual De Novo Sequencing
De novo Sequencing Algorithm (PEAKS)
Homology Search with De Novo Tags
Algorithm Design
• The first thing for algorithm design is to define
the property of the solution.
• For the de novo sequencing problem, one
wants to compute a peptide that “best
matches” the given spectrum.
• This “best match” is practically defined by a
scoring function.
Peptide-Spectrum Match Score
peptide
prefix
𝑎1 ⋯ 𝑎𝑖 𝑎𝑖+1 ⋯ 𝑎𝑘
𝑎1 ⋯ 𝑎𝑖
𝑎𝑖+1 ⋯ 𝑎𝑘
suffix
• A fragment score can be computed for every two adjacent amino acids. This
score depends on the presence of the corresponding b and y ions.
• The peptide score is the sum of the fragment scores.
The Fragment Score 𝑓(𝑚) for a Mass
peptide
prefix
𝑎1 ⋯ 𝑎𝑖 𝑎𝑖+1 ⋯ 𝑎𝑘
𝑚 𝑎𝑖
𝑎1 ⋯
𝑎𝑀
𝑚𝑎𝑘
𝑖+1−⋯
suffix
• The fragment score calculation only requires the prefix mass but not the sequence
• Note: suffix mass = total residue mass – prefix mass.
• Thus it is possible to define score 𝑓(𝑚) for each prefix mass value 𝑚.
How to Define 𝑓(𝑚) Statistically
• Learn two probabilities from large training data
– 𝑝 : Prob(a peak is observed at a y-ion m/z).
– 𝑞 : Prob(a peak is observed at a random m/z).
– Usually 𝑝 > 𝑞.
𝑝
𝑞
• If an expected y-ion is observed, log is added to 𝑓(𝑚).
– log
𝑝
𝑞
is called the log-likelihood-ratio
• If an expected y-ion is missing,
1−𝑝
log
,
1−𝑞
is added to 𝑓(𝑚).
• Thus, matching ion is rewarded and missing ion is penalized.
• Other fragment ion types can be considered similarly.
De Novo Sequencing
• For a sequence 𝑃 with prefix masses
𝑚1 , 𝑚2 , ⋯ , 𝑚𝑘 , the peptide score is defined as
𝑠𝑐 𝑃 = 𝑓 𝑚1 + 𝑓 𝑚2 + ⋯ 𝑓 𝑚𝑘
• De Novo Seuqenicng: Given scoring function
𝑓(𝑚) and mass 𝑀, computes a sequence P
with total residue mass 𝑀, and maximizing
𝑠𝑐(𝑃).
Algorithm Idea
• 𝐵𝑒𝑠𝑡𝑆𝑐𝑜𝑟𝑒(𝑚): the maximized score that can be achieved by
a prefix with mass 𝑚.
best sequence for 𝑚
𝑃
𝑃’
𝑚 − 𝑚(𝑎)
𝑎
𝑚 𝑎
• If 𝑃 = 𝑃’𝑎 is the best sequence for 𝑚, then 𝑃’ must be the best
sequence for 𝑚 − 𝑚(𝑎).
• Thus, 𝐵𝑒𝑠𝑡𝑆𝑐𝑜𝑟𝑒 𝑚 = 𝑓 𝑚 + 𝐵𝑒𝑠𝑡𝑆𝑐𝑜𝑟𝑒 𝑚 − 𝑚 𝑎 .
• To compute 𝑎, try 20 residues and use the one that maximizes the
above formula.
Dynamic Programming
⋯⋯
BestScore
⋯⋯
0 1 2 3 ⋯⋯
⋯⋯
𝑚 − 𝑚(𝑎)
⋯⋯
𝑚
𝑀
• The algorithm initializes 𝐵𝑒𝑠𝑡𝑆𝑐𝑜𝑟𝑒(0) = 0 and all other cells to be −∞.
• Then computes 𝐵𝑒𝑠𝑡𝑆𝑐𝑜𝑟𝑒(𝑚) for 𝑚 from 1 to 𝑀 by
𝐵𝑒𝑠𝑡𝑆𝑐𝑜𝑟𝑒 𝑚 = 𝑓 𝑚 + max 𝐵𝑒𝑠𝑡𝑆𝑐𝑜𝑟𝑒 𝑚 − 𝑚 𝑎 .
𝑎
• The best sequence can be retrieved by a backtracking process by
repetitively computing the last amino acid 𝑎.
A Note on PTM
• Variable PTM does not cause major speed slow
down for de novo sequencing algorithms.
– Instead of trying 20 regular amino acids in the
maximization, the algorithm simply tries all modified
amino acids too.
– The time complexity is increased by a constant factor.
(Compare to the exponential growth in database search
approach).
• However, since the solution space is larger when
many variable PTMs are allowed, the accuracy of
the algorithm is reduced.
Accounting for Other Ion Types
• When internal cleavage ions are considered in the scoring
function, it becomes difficult to design efficient algorithm
to find the optimal sequence.
• A compromise between efficiency and accuracy is to
employ a two-stage approach.
– First, compute many (e.g. 10,000) sequences using an efficient
score function that uses only a few of the most important ions.
– Then, evaluate these candidates using a more sophisticated
scoring function additional ions.
• This two-round approach is a tradeoff between the
algorithm speed and accuracy.
Mass Segment Error
• Most errors are due to incomplete ion ladders in the
spectrum.
– Thus, a segment of amino acids cannot be determined.
– However, the total mass of the segment, is fixed.
– E.g. [242]VLSLLVESK, where 242 = N+Q, N+K, or L+E
• The first two or three residues often have low
confidence, because of a lack of fragment ions.
• Most de novo sequencing software uses the precursor
mass as a constraint (thus the mass of the derived
sequence is usually correct).
Outline
•
•
•
•
Basics
Manual De Novo Sequencing
De novo sequencing Algorithm
Homology Search with De Novo Tags
Why Homology Search with De Novo
Sequence
• Advantages:
– Database may not contain the exact peptide sequence,
but a homologous one is there.
– De novo + homology search is great to use the
database of one organism to study a similar organism.
• Disadvantages:
– De novo sequence can only provide partially correct
sequence tags.
– Conventional homology search may fail due to de
novo sequencing errors.
Traditional Sequence Alignment
• Two peptide sequences are aligned by inserting spaces to
appropriate positions. E.g.
FVEVTKL-TDLTK
| || || |||||
FAEV-KLVTDLTK
• The matching residues (including gaps, ‘-’) in each column has a
similarity score that can be looked up in a pre-defined amino acid
substitution matrix, such as BLOSUM or PAM.
• The alignment score is equal to the sum of the column-wise scores.
• There are algorithms to compute the optimal alignment that
maximizes the alignment score.
Limitations of Conventional Homology Search
• Conventional search ignores the possible errors in de
novo sequencing.
• Suppose a true sequence is SLCAFK, and the de novo
sequence is LSCFAK, and the homolog is SLAAFK.
(denovo)
X:
(homolog) Z:
LSCFAK
|
SLAAFK
(denovo)
(real)
X:
Y:
(homolog) Z:
Conventional search using
evolutionary similarities to explain
the mismatches results in a poor
match.
[LS]C[FA]K
[SL]C[AF]K
||
|| |
[SL]A[AF]K
If de novo sequencing errors are
considered, the match becomes
more significant.
A Simple Approach
• We can enumerate all possible combinations of a
mass segment, and search all of them together.
– MS BLAST will do this.
[LS]C[FA]K
• Difficulties:
LSCFAK
SLCFAK
TVCFAK
VTCFAK
LSCAFK
SLCAFK
TVCAFK
VTCAFK
– Do not know which portion of the sequence is error.
– Exponential growth of possibilities.
SPIDER Model
(de novo) X:
(real)
Y:
(homolog) Z:
[LS]C[FA]K
[SL]C[AF]K
||
|| |
[SL]A[AF]K
• Given a de novo sequence X, and a database sequence
Z. Try to reconstruct the real sequence Y.
– The difference between X and Y is explained by de novo
sequencing errors.
– The difference between Y and Z is explained by homology
mutations.
• The real Y should minimize the de novo errors and the
homology mutations needed in the above explanation.
Two exercises
(denovo) X:
(real)
Y:
(homolog) Z:
LSCFAV
SLCFAV
SLCF-V
•The swap of L and S is more likely a de novo
error than a mutation.
•The deletion of A is unlikely a de novo error
(de novo does not change peptide mass).
(denovo) X:
(real)
Y:
(homolog) Z:
LSCFV
EACFV
DACFV
• Mutation and de novo error overlap. Hard
for manual interpretation. Algorithm is
needed.
blosum62
m(LS)=m(EA)=200.1 Da
Conclusion
• When the target peptides are not in a database.
– De novo sequencing
• When the homologous peptides are in database
– Homology search with the de novo tags can find
them
– Some de novo errors can be corrected by
combining the homolog information
Download