PEAKS: De Novo Sequencing using Tandem Mass Spectrometry Bin Ma Dept. of Computer Science University of Western Ontario Outline • Background • Sandwich algorithm for de novo sequencing • Software implementation – PEAKS Background • Diseases are closely related to the abnormal proteins. • Given a tissue, the identification of the proteins (and their posttranslational modifications) in it is a fundamental problem in proteomics. • MS/MS is the most common way for protein identification. Sample Preparation tissue fraction gel GTDIMR PAK MPSER …… …… Add trypsin peptides HPLC To MS/MS Tandem Mass Spectrometer QTOF ions MPSER + PAK + SG… + … ESI parent ions Quadrupole mass analyzer PAK + PAK + PAK + PAK + detector fragment ions collision P + AK P AK + PA + K PA K+ TOF mass analyzer + AK PA + + K P+ peptide sequence: LGSSEVEQVQLVVDGVK tandem mass spectrometry: MS/MS spectrum database de novo sequencing: LGSSEVEQVQLVVDGVK How Does a Peptide Fragment? m(b1)=1+m(A1) m(b2)=1+m(A1)+m(A2) m(b3)=1+m(A1)+m(A2)+m(A3) m(y1)=19+m(A4) m(y2)=19+m(A4)+m(A3) m(y3)=19+m(A4)+m(A3)+m(A2) Matching Sequence with Spectrum De Novo Sequencing • De Novo Sequencing (Dancik et al., JCB 6:327-342.) – Given a spectrum, a mass value M, compute a sequence P, s.t. m(P)=M, and the matching score is maximized. • We consider the matching score of P is the sum of the scores of the matched peaks. – We use intensity of a peak as its score to illustrate PEAKS’ algorithm. Spectrum Graph Approach • Convert the peak list to a graph. A peptide sequence corresponds to a path in the graph. • Bartels (1990), Biomed. Environ. Mass Spectrom 19:363-368. • Taylor and Johnson (1997). Rapid Comm. Mass Spec. 11:1067-1075. (Lutefisk) • Dancik et al. (1999), JCB 6:327-342. • Chen et al. (2001), JCB 8:325-337. • …… Warm up – Counting Only Y-ions The Score of a Suffix 19 y1 y2 y3 Let Q be a suffix of the peptide. It can determine some y-ions. score(Q) are the sum of scores of those y-ions of Q. DP(m) max score(Q) m(Q ) m 19 Recursive Computation of DP(m) Q’ a Suppose Q is such that DP(m)=score(Q). score (Q) h(m 19) score (Q' ) score(Q’)=DP(m(Q’)) DP (m) h(m 19) DP (m m(a)) Do not know a? DP (m) h(m 19) max DP (m m(a)) a Dynamic Programming 1. for m from 0 to M DP (m) h(m 19) max a DP (m a) 2. backtracking Counting Both y and b Ions Good News y1 y2 y3 bn-3 bn-2 bn-1 Bad News Ions Determined By a Pair P=LGEY Q=LLVR score(P,Q) is the sum of matched peak intensities. A peak can only count once. Chummy Pairs • Two strings P and Q are called chummy pairs, iff. either of the following two is true: 1 m( P ) 19 m(Q) 1 m( P) (C1) P LG, Q VR, P LGE (C2) 19 m(Q ) 1 m( P) 19 m(Q) Q VR, P LGE, Q LVR Recursive Computation of score(P,Q) P=LGEY Q=LLVR Q LVR u=m(P), v=m(Q) score ( P, Q) f (u , v) score ( P, Q ) Chummy pairs • Lemma 1 – Suppose P and Q are a chummy pair. u=m(P), v=m(Q). If (C1) is true, score( P,Q) score( P ,Q) g (u, v) If (C2) is true, score( P,Q) score( P,Q ) f (u, v) Chummy Pairs • Lemma 2 – Let (P,Q) be a chummy pair, a be a letter. – (C1) (P,aQ) is a chummy pair but (Pa,Q) is not. – (C2) (Pa,Q) is a chummy pair but (P,aQ) is not. • Lemma 3 – Let S be the optimal solution. Then there is a chummy pair (P,Q) and a letter a such that S=PaQ. Also, there is a chummy pair series such that ( , ) ( P1 , Q1 ) ( Pn , Qn ) ( P, Q) Dynamic Programming • Combining Lemma 1, 2, 3, we can compute DP (u, v) max m ( P ) u , m ( Q ) v score( P, Q) ( P ,Q ) chummy • Suppose (P,Q) is the pair maximizing DP(u,v) under the condition m(P)+m(Q)+m(a)=M. Then PaQ is the optimal peptide. Algorithm Sandwich • • DP(0,0) = 0; DP(u,v) = -infinity for (u,v)!=(0,0); for u from 1 to M/2 step d do for v from u-m(W) to u+m(W) step d do for a in Σ do if u<v then DP(u a, v) max DP(u, v) f (u, v), DP(u a, v) else DP(u, v a) max DP(u, v) g (u, v), DP(u, v a) • find u,v,a, s.t. u+v+m(a)=M and DP(u,v) maximized; • backtracking; Time: O( M m( W ) | | ) 2 d PEAKS – The Software Comparison • LCQ data (Iontrap instrument): – Generously provided by Dr. Richard Johnson. 144 spectra. • Micromass Q-Tof data: – Measured in UWO’s Protein ID lab. 61 spectra • Sciex Q-Star data: – Provided by U. Victoria’s Genome BC Proteomics Centre. 13 good/okay spectra. PEAKS v.s. Lutefisk • completely correct sequences: – 38/144 v.s. 15/144 • correct amino acids: – 1067/1702 v.s. 767/1702 v.s. • partially correct sequences with 5 or more contiguous correct amino acids: – 94/144 v.s. 64/144 PEAKS v.s. Micromass PLGS • completely correct sequences: – 13/61 v.s. 7/61 • correct amino acids: – 456/764 v.s. 232/764 • partially correct sequences with 5 or more contiguous correct amino acids: – 38/61 v.s. 24/61 PEAKS v.s. Sciex BioAnalyst • completely correct sequences: – 7/13 v.s. 1/13 • correct amino acids: – 115/150 v.s. 86/150 • partially correct sequences with 5 or more contiguous correct amino acids: – 12/61 v.s. 7/61 Users The company logos have been deleted from the original presentation. Please visit http://www.bioinformaticssolutions.com for a list of users. Other Techniques Used by PEAKS • Preprocess the MS/MS spectra – Deconvolution, noise reduction, and signal enhancement. – It does a better job than spectrometer vendor’s software. • Recalibration – compress/stretch the spectrum for calibration error • Positional Confidence – Estimate the confidence level of individual amino acids. Sophisticated Ion Matching Score • Score of one peak matching b ion log( h) e error tolerance 2 supp( b 17, b 18, a, c) PEAKS 2.x’s Additional Feature • Identify the proteins by matching the de novo (partial) sequences. • Then further match the spectra with the peptides of the proteins. Collaborators and References • Sandwich algorithm: – B. Ma, K. Zhang, C. Liang, CPM’03. (sandwich algorithm) • PEAKS: – B. Ma, K. Zhang, C. Hendrie, C. Liang, M. Li, A. Doherty-Kirby, G. Lajoie, Rapid Comm. Mass Spec. (software feature, score function, experiments) • Acknowledgement: – PEAKS development team. (Bioinformatics Solutions Inc.).