PEAKS: De Novo Sequencing using MS/MS spectra Bin Ma, U. Western Ontario, Canada Kaizhong Zhang, U. Western Ontario, Canada Chengzhi Liang, Bioinformatics Solutions Inc. Canada Outline • Background – Tandem Mass Spectrometry • De novo sequencing – Problem Definition and Algorithm. • Software implementation – PEAKS • Future work Background • Human has 100,000 different proteins. Because of the existence of post translational modifications, each protein can have many different versions. • Diseases are closely related to the abnormal proteins or the expression levels of proteins. • Given a tissue, the identification of the proteins (and their modified versions) in it is a fundamental problem for the drug design. Proteins and Peptides • A protein is a sequence of 20 different types of amino acids. – A protein is a string over alphabet with size 20 • A peptide is a substring of the protein. • The 20 amino acids have 19 distinct masses. – I and L have the same mass and cannot (difficult) be distinguished by MS/MS. – Regard them as the same letter. Tandem Mass Spectrometry • MS/MS is the only reliable way for protein identification. tissue fraction gel protein …VITK | GTDIMNEMR | SMW… peptide peptide sequence: LGSSEVEQVQLVVDGVK tandem mass spectrometer: MS/MS spectrum database de novo sequencing: LGSSEVEQVQLVVDGVK How Does a Peptide Fragment? m(b1)=1+m(A1) m(b2)=1+m(A1)+m(A2) m(b3)=1+m(A1)+m(A2)+m(A3) m(y1)=19+m(A4) m(y2)=19+m(A4)+m(A3) m(y3)=19+m(A4)+m(A3)+m(A2) Matching Sequence with Spectrum De Novo Sequencing • For any peptide P= a1…an, m(P) = Σi ai. • De Novo Sequencing – Given a spectrum, a mass value m, compute a sequence P, s.t. m(P)=m, and the matching score score(P) is maximized. A Simpler Case – Only Y-ions Y-ions Determined By a Suffix 19 y1 y2 y3 score(Q) can be defined for a suffix Q. score ( LVR) score (VR) f (u ) DP (u) max score(Q) m ( Q ) u DP (u ) max DP (u a) f (u ) a Counting Both y and b ions Strategies • Consider a pair of prefix R and a suffix Q simultaneously. • Consider only those pairs (R,Q) that satisfy a nice property, which we call “chummy” • Chummy pairs allow: – The score of a chummy pair can be computed recursively from a smaller chummy pair. – There are a series of chummy pairs that grow to the optimal solution. Dynamic Programming • Combining Lemma A, B, we can compute DP (u, v) max m ( R ) u , m ( Q ) v ( R ,Q ) chummy score ( R, Q) • Suppose (R,Q) is the pair maximizing DP(u,v) under the condition m(R)+m(Q)+a=m. Then RaQ is the optimal peptide. PEAKS – The Software Comparison of PEAKS and Lutefisk m/z MALDI MS/MS BSA 927.4 1439.7 1479.8 1639.8 ESI MS/MS Cyt- c 482.7 584.8 589.3 634.4 678.3 728.8 779.4 792.9 817.3 z Correct Sequence PEAKS (de novo) Comments 1 1 1 1 YLYEIAR RHPEYAVSVLLR LGEYGFQNALIVR KVPQVSTPTLVEVSR YLYEIAR GVLMVDVPPADNGR LWYGFQNALIVR RAPKVPQVSTPTLVEVSR correct Wrong (?) correct correct [276.14]EY[184.08]R No results No results No results 2 2 1 1 1 2 1 2 2 EDLIAYLK TGPNLHGLFGR GDVEK IFVQK YIPGTK TGQAPGFSYTDANK MIFAGIK KTGQAPGFSYTDAMK IFVQKCAQCHTVEK EDLIAYLK TGPNLHGLFGR VDVEK IFVQK YIPGTK TGQAPGFSYTDANK MIFAGIK KTGAGAPGFSYTDAMK QFVTHMACCHTVEK correct correct V = Ac-G correct correct correct correct almost partial [357.15]LAYLK TGPNLHGLFGR VDVEK IFVQK YIPGTK [199.10]SAPGF[250.09]TWNK [244.12]FAGLK [229.15]QGAPGAYQNHANK [257.08][218.08][GP][260.08][HM]TVEK 1 2 1 2 2 2 ASEDLK HGTVVLTALGGILK ALELFR VEADIAGHGQEVLIR GLSDGEWQQVLNVWGK YLEFISDAIIHVLHSK ASEDLK HGTVVLTALGGILK ALELFR LDADIAGHGQEVLIR GLSDGEWQQVLNVWGK YLEFISDAIIHVLHSK correct correct correct almost correct correct [244.07]SALK HGTVVLTALG[170.1]LK [184.12]ELFR no results [170.11]SG[244.07]WQQVLNVWGK [276.1]EFLSD[184.12]LHVLHSK Lutefisk (de novo) Apo-Myoglobin 662.3 689.9 748.4 803.9 908.4 943.2 Red = Correct Users Implementation Particulars • More accurate scoring: – sum of the logarithmic intensities – many other ion types – coexisting ions, e.g., x2, y2, z2 • Deconvolution – converting multiply-charged peaks to singly-charged ones • Recalibration – compress/stretch the spectrum for calibration error • Noise reduction Acknowledgement • Bin Ma, Kaizhong Zhang were supported by NSERC. • Chengzhi Liang was supported by BSI. • Thanks the development team in BSI for the software development. Tandem Mass Spectrometer ions MPSER + PAK + SG… + … precursor ions fragment ions PAK + PAK + PAK + PAK + P + AK P AK + PA + K PA K+ mass analyzer fragment detector mass analyzer + AK PA + + K P+ Algorithm Sandwich • • DP(0,0) = 0; DP(u,v) = -infinity for (u,v)!=(0,0); for u from 1 to m/2 do for v from u-max(a) to u+max(a) do for a in Σ do if u<v then DP(u a, v) max DP(u, v) f (u, v), DP(u a, v) else DP(u, v a) max DP(u, v) g (u, v), DP(u, v a) • find u,v,a, s.t. u+v+a=m and DP(u,v) maximized; • backtracking; Dynamic Programming 1. for u from 0 to m DP (u) max a DP (u a) f (u) 2. backtracking Dynamic Programming DP (u, v) max m ( R ) u , m ( Q ) v R is prefix, Q is suffix score ( R, Q) •We hope DP(u,v) for u+v=m gives the optimal prefix and suffix. •The optimal solution can be obtained by concatenation of the prefix and suffix. Chummy Pairs • Two strings Ra and bQ are called chummy pairs, iff. either of the following two is true: 1 m( R) 19 m(bQ) 1 m( Ra ) (C1) (C2) 19 m(Q) 1 m( Ra ) 19 m(bQ) (LGE, LVR) (C2) (LGE, VR) (C1) (LGE, R) (C1) (LG,VR) is not chummy Chummy pairs • Lemma A – Suppose Ra and bQ are a chummy pair. u=m(Ra), v=m(bQ). If (C1) is true, score ( Ra,bQ) score ( Ra,Q) f (u, v) If (C2) is true, score ( Ra,bQ) score ( R, bQ) g (u, v) Chummy Pairs • Lemma B – Let P be the optimal solution. Then there is a chummy pair (R,Q) and a letter a such that P=RaQ. Also, there is a chummy pair series such that ( , ) ( R1 , Q1 ) ( Rn , Qn ) ( R, Q)