presentation

advertisement
PEAKS: De Novo Sequencing
using MS/MS spectra
Bin Ma,
U. Western Ontario, Canada
Kaizhong Zhang,
U. Western Ontario, Canada
Chengzhi Liang,
Bioinformatics Solutions Inc. Canada
Outline
• Background
– Tandem Mass Spectrometry
• De novo sequencing
– Problem Definition and Algorithm.
• Software implementation – PEAKS
• Future work
Background
• Human has 100,000 different proteins. Because
of the existence of post translational
modifications, each protein can have many
different versions.
• Diseases are closely related to the abnormal
proteins or the expression levels of proteins.
• Given a tissue, the identification of the proteins
(and their modified versions) in it is a
fundamental problem for the drug design.
Proteins and Peptides
• A protein is a sequence of 20 different types
of amino acids.
– A protein is a string over alphabet with size 20
• A peptide is a substring of the protein.
• The 20 amino acids have 19 distinct masses.
– I and L have the same mass and cannot
(difficult) be distinguished by MS/MS.
– Regard them as the same letter.
Tandem Mass Spectrometry
• MS/MS is the only reliable way for protein
identification.
tissue
fraction
gel
protein
…VITK | GTDIMNEMR | SMW…
peptide
peptide sequence:
LGSSEVEQVQLVVDGVK
tandem mass spectrometer:
MS/MS spectrum
database
de novo sequencing:
LGSSEVEQVQLVVDGVK
How Does a Peptide Fragment?
m(b1)=1+m(A1)
m(b2)=1+m(A1)+m(A2)
m(b3)=1+m(A1)+m(A2)+m(A3)
m(y1)=19+m(A4)
m(y2)=19+m(A4)+m(A3)
m(y3)=19+m(A4)+m(A3)+m(A2)
Matching Sequence with Spectrum
De Novo Sequencing
• For any peptide P= a1…an, m(P) = Σi ai.
• De Novo Sequencing
– Given a spectrum, a mass value m,
compute a sequence P, s.t. m(P)=m, and
the matching score score(P) is
maximized.
A Simpler Case – Only Y-ions
Y-ions Determined By a Suffix
19
y1
y2
y3
score(Q) can be defined for a suffix Q.
score ( LVR)  score (VR)  f (u )
DP (u)  max score(Q)
m ( Q ) u
DP (u )  max DP (u  a)  f (u )
a
Counting Both y and b ions
Strategies
• Consider a pair of prefix R and a suffix Q
simultaneously.
• Consider only those pairs (R,Q) that satisfy a
nice property, which we call “chummy”
• Chummy pairs allow:
– The score of a chummy pair can be computed
recursively from a smaller chummy pair.
– There are a series of chummy pairs that grow to
the optimal solution.
Dynamic Programming
• Combining Lemma A, B, we can compute
DP (u, v) 
max
m ( R ) u , m ( Q )  v
( R ,Q ) chummy
score ( R, Q)
• Suppose (R,Q) is the pair maximizing DP(u,v)
under the condition m(R)+m(Q)+a=m. Then RaQ
is the optimal peptide.
PEAKS – The Software
Comparison of PEAKS and Lutefisk
m/z
MALDI MS/MS
BSA
927.4
1439.7
1479.8
1639.8
ESI MS/MS
Cyt- c
482.7
584.8
589.3
634.4
678.3
728.8
779.4
792.9
817.3
z
Correct Sequence
PEAKS (de novo)
Comments
1
1
1
1
YLYEIAR
RHPEYAVSVLLR
LGEYGFQNALIVR
KVPQVSTPTLVEVSR
YLYEIAR
GVLMVDVPPADNGR
LWYGFQNALIVR
RAPKVPQVSTPTLVEVSR
correct
Wrong (?)
correct
correct
[276.14]EY[184.08]R
No results
No results
No results
2
2
1
1
1
2
1
2
2
EDLIAYLK
TGPNLHGLFGR
GDVEK
IFVQK
YIPGTK
TGQAPGFSYTDANK
MIFAGIK
KTGQAPGFSYTDAMK
IFVQKCAQCHTVEK
EDLIAYLK
TGPNLHGLFGR
VDVEK
IFVQK
YIPGTK
TGQAPGFSYTDANK
MIFAGIK
KTGAGAPGFSYTDAMK
QFVTHMACCHTVEK
correct
correct
V = Ac-G
correct
correct
correct
correct
almost
partial
[357.15]LAYLK
TGPNLHGLFGR
VDVEK
IFVQK
YIPGTK
[199.10]SAPGF[250.09]TWNK
[244.12]FAGLK
[229.15]QGAPGAYQNHANK
[257.08][218.08][GP][260.08][HM]TVEK
1
2
1
2
2
2
ASEDLK
HGTVVLTALGGILK
ALELFR
VEADIAGHGQEVLIR
GLSDGEWQQVLNVWGK
YLEFISDAIIHVLHSK
ASEDLK
HGTVVLTALGGILK
ALELFR
LDADIAGHGQEVLIR
GLSDGEWQQVLNVWGK
YLEFISDAIIHVLHSK
correct
correct
correct
almost
correct
correct
[244.07]SALK
HGTVVLTALG[170.1]LK
[184.12]ELFR
no results
[170.11]SG[244.07]WQQVLNVWGK
[276.1]EFLSD[184.12]LHVLHSK
Lutefisk (de novo)
Apo-Myoglobin
662.3
689.9
748.4
803.9
908.4
943.2
Red = Correct
Users
Implementation Particulars
• More accurate scoring:
– sum of the logarithmic intensities
– many other ion types
– coexisting ions, e.g., x2, y2, z2
• Deconvolution
– converting multiply-charged peaks to singly-charged
ones
• Recalibration
– compress/stretch the spectrum for calibration error
• Noise reduction
Acknowledgement
• Bin Ma, Kaizhong Zhang were supported
by NSERC.
• Chengzhi Liang was supported by BSI.
• Thanks the development team in BSI for the
software development.
Tandem Mass Spectrometer
ions
MPSER +
PAK +
SG… +
…
precursor ions
fragment ions
PAK +
PAK +
PAK +
PAK +
P + AK
P
AK +
PA + K
PA
K+
mass
analyzer
fragment
detector
mass
analyzer
+
AK
PA
+
+
K
P+
Algorithm Sandwich
•
•
DP(0,0) = 0; DP(u,v) = -infinity for (u,v)!=(0,0);
for u from 1 to m/2 do
for v from u-max(a) to u+max(a) do
for a in Σ do
if u<v then
DP(u  a, v)  max DP(u, v)  f (u, v), DP(u  a, v)
else
DP(u, v  a)  max DP(u, v)  g (u, v), DP(u, v  a)
• find u,v,a, s.t. u+v+a=m and DP(u,v) maximized;
• backtracking;
Dynamic Programming
1. for u from 0 to m
DP (u)  max a DP (u  a)  f (u)
2. backtracking
Dynamic Programming
DP (u, v) 
max
m ( R ) u , m ( Q )  v
R is prefix, Q is suffix
score ( R, Q)
•We hope DP(u,v) for u+v=m gives the optimal
prefix and suffix.
•The optimal solution can be obtained by
concatenation of the prefix and suffix.
Chummy Pairs
• Two strings Ra and bQ are called chummy pairs,
iff. either of the following two is true:
1  m( R)  19  m(bQ)  1  m( Ra )
(C1)
(C2) 19  m(Q)  1  m( Ra )  19  m(bQ)
(LGE, LVR)  (C2)
(LGE, VR)  (C1)
(LGE, R)  (C1)
(LG,VR) is not chummy
Chummy pairs
• Lemma A – Suppose Ra and bQ are a
chummy pair. u=m(Ra), v=m(bQ). If (C1)
is true,
score ( Ra,bQ)  score ( Ra,Q)  f (u, v)
If (C2) is true,
score ( Ra,bQ)  score ( R, bQ)  g (u, v)
Chummy Pairs
• Lemma B – Let P be the optimal solution.
Then there is a chummy pair (R,Q) and a letter
a such that P=RaQ. Also, there is a chummy
pair series such that
( ,  )  ( R1 , Q1 )    ( Rn , Qn )  ( R, Q)
Download