An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Protein Sequencing and Identification With Mass Spectrometry An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Outline • Tandem Mass Spectrometry • De Novo Peptide Sequencing • Spectrum Graph • Protein Identification via Database Search • Identifying Post Translationally Modified Peptides • Spectral Convolution • Spectral Alignment An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Amino Acids vs. Nucleic Acids Amino Acids: Amine, Carboxylic Acid, R-group Nucleic Acids: Sugar, Phosphate, Base An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Protein Backbone H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH N-terminus Ri-1 AA residuei-1 Ri AA residuei Ri+1 AA residuei+1 C-terminus An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Breaking of Protein Backbone H+ H...-HN-CH-CO N-terminus Ri-1 AA residuei-1 NH-CH-CO-NH-CH-CO-…OH Ri AA residuei Ri+1 AA residuei+1 C-terminus An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Breaking Peptides into Fragment Ions • Proteases, e.g. trypsin, break protein into peptides. • A Tandem Mass Spectrometer further breaks the peptides down into fragment ions and measures the mass of each piece. • Mass Spectrometer electrically accelerates the fragmented ions; heavier ions accelerate slower than lighter ones. • Mass Spectrometers measure mass/charge ratio of an ion. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Mass Spectrometry Matrix-assisted Laser Desorption/Ionization From lectures by Vineet Bafna (UCSD) An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Tandem Mass Spectrometry e c n a d n u b A e v i t a l e R S#: 1707 RT: 54.44 AV: 1 NL: 2.41E7 F: + c Full ms [ 300.00 - 2000.00] RT: 0.01 - 80.02 100 90 80 638.0 100 1389 LC 1409 2149 1615 1621 1411 1387 60 50 1593 1995 1655 1435 1987 1445 1661 40 1307 1313 1105 1095 20 2155 e c n a d n u b A 95 e v i t a l e R 70 MS 90 85 80 75 65 60 55 801.0 50 2001 2177 1937 1779 30 Base Peak F: + c Full ms [ 300.00 2000.00] 2147 1611 70 NL: 1.52E8 1991 45 40 2205 2135 2017 35 Scan 1707 638.9 30 25 2207 1707 2329 872.3 1275.3 15 687.6 10 2331 10 1173.8 20 944.7 783.3 1048.3 5 1212.0 1413.9 1617.7 1400 1600 1742.1 1884.5 0 200 0 5 10 15 20 25 30 35 40 45 Time (min) 50 55 60 65 70 75 400 600 800 1000 m/z 1200 1800 2000 80 S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6 T: + c d Full ms2 638.00 [ 165.00 - 1925.00] 850.3 100 collision MS-2 MS-1 cell Ion Source e c n a d n u b A 95 e v i t a l e R 70 687.3 90 85 588.1 80 75 MS/MS 65 60 55 851.4 425.0 50 45 949.4 40 326.0 35 524.9 30 25 20 589.2 226.9 1048.6 1049.6 397.1 489.1 15 10 629.0 5 0 200 400 600 800 1000 m/z 1200 Scan 1708 1400 1600 1800 2000 An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Using Tandem Mass Spectrometry S e q u e n c e MS/MS instrument S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6 T: + c d Full ms2 638.00 [ 165.00 - 1925.00] 850.3 100 e c n a d n u b A 95 e v i t a l e R 70 687.3 90 85 588.1 80 75 65 60 55 851.4 425.0 50 45 949.4 40 326.0 35 Database search •Sequest de Novo interpretation •Sherenga 524.9 30 25 20 589.2 226.9 1048.6 397.1 1049.6 489.1 15 10 629.0 5 0 200 400 600 800 1000 m/z 1200 1400 1600 1800 2000 An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Tandem Mass Spectrum • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal peptides • Spectrum consists of different ion types because peptides can be broken in several places. • Chemical noise often complicates the spectrum. • Represented in 2-D: mass/charge axis vs. intensity axis An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Tandem Mass Spectrum: An Example Secondary Fragmentation Ionized parent peptide An Introduction to Bioinformatics Algorithms www.bioalgorithms.info rm te C- N- te rm in ina al lp pe ep pt tid id es es N- and C-terminal Peptides An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Terminal peptides and ion types Peptide Mass (D) Peptide Mass (D) 57 + 97 + 147 + 114 = 415 without 57 + 97 + 147 + 114 – 18 = 397 An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Peptide Fragmentation b2-H2O a2 b3- NH3 b2 a3 b3 HO NH3+ | | R1 O R2 O R3 O R4 | || | || | || | H -- N --- C --- C --- N --- C --- C --- N --- C --- C --- N --- C -- COOH | | | | | | | H H H H H H H y3 y2 y3 -H2O y1 y2 - NH3 An Introduction to Bioinformatics Algorithms www.bioalgorithms.info De novo Peptide Sequencing S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6 T: + c d Full ms2 638.00 [ 165.00 - 1925.00] 850.3 100 e c n a d n u b A 95 e v i t a l e R 70 687.3 90 85 588.1 80 75 65 60 55 851.4 425.0 50 45 949.4 40 326.0 35 524.9 30 25 20 589.2 226.9 1048.6 1049.6 397.1 489.1 15 10 629.0 5 0 200 400 600 800 1000 m/z 1200 1400 Sequence 1600 1800 2000 An Introduction to Bioinformatics Algorithms Theoretical Spectrum www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Theoretical Spectrum (cont d) An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Theoretical Spectrum (cont d) An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Building Spectrum Graph • How to create vertices (from peaks) • How to create edges (from mass differences) • How to score paths • How to find best path An Introduction to Bioinformatics Algorithms www.bioalgorithms.info b S E Q U E N C E Mass/Charge (M/Z) An Introduction to Bioinformatics Algorithms www.bioalgorithms.info a SE Q U E N Mass/Charge (M/Z) C E An Introduction to Bioinformatics Algorithms www.bioalgorithms.info a is an ion type shift in b S E Q U E Mass/Charge (M/Z) N C E An Introduction to Bioinformatics Algorithms www.bioalgorithms.info y E C N E U Q Mass/Charge (M/Z) E S An Introduction to Bioinformatics Algorithms www.bioalgorithms.info y with corresponding intensities N E U Q Intensity E C Mass/Charge (M/Z) E S Intensity An Introduction to Bioinformatics Algorithms Mass/Charge (M/Z) www.bioalgorithms.info Intensity An Introduction to Bioinformatics Algorithms Mass/Charge (M/Z) www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info noise Mass/Charge (M/Z) An Introduction to Bioinformatics Algorithms Intensity MS/MS Spectrum Mass/Charge (M/z) www.bioalgorithms.info An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Mass Differences Correspond to Amino Acids u q s e s e e c e u q e n n q u e n c c e e s e An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Ion Types • Some peaks correspond to fragment ions, others are just random noise • Knowing ion types _={_1, _2,…, _k} lets us distinguish fragment ions from noise • We can learn ion types _i and their probabilities qi by analyzing a large test sample of annotated spectra. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Example of Ion Type • _={_1, _2,…, _k} • _={b, b-NH3, b-H2O} • Corresponding values of _={0, 17, 18} • *Note: In reality the _ value of ion type b is -1 An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Peptide Sequencing Problem Goal: Find a peptide with maximal match between an experimental and theoretical spectrum. Input: • S: experimental spectrum • _: set of possible ion types • m: parent mass Output: • P: peptide with mass m, whose theoretical spectrum matches the experimental S spectrum the best An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Vertices • Masses of potential N-terminal peptides • Vertices are generated by reverse shift • Every peak s in a spectrum generates vertices • V(s) = {s+_1, s+ _2, …, s+ _k} An Introduction to Bioinformatics Algorithms Vertices (cont www.bioalgorithms.info d) • Vertices of the spectrum graph: • {vinit}∪V(s1) ∪V(s2) ∪... ∪V(sm) ∪{vfin} • Where _={_1, _2,…, _k} are ion types. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Reverse Shifts Intensity b/b-H2O+H2O b-H2O b+H2O Red: Mass Spectrum Blue: shift (+H2O) Mass/Charge (M/Z) • Two peaks b-H2O and b are given by the Mass Spectrum • With a +H2O shift, if two peaks coincide that is a possible vertex. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Example of Reverse Shift Shift in H2O Shift in H2O and NH3 An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Edges • Two vertices with mass difference corresponding to an amino acid A: • Connect with an edge labeled by A • Gap edges for di- and tri-peptides An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Paths • Path in the graph corresponds to an amino acid sequence • There are many paths, how to find the correct one? • We need scoring to evaluate paths An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Path Score • p(P,S) = probability that peptide P produces spectrum S = {s1,s2,…sq} • p(P, s) = the probability that peptide S generates a peak s • Scoring = computing probabilities • p(P,S) = !s_S p(P, s) An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Peak Score • For a position t that represents ion type dj : qj, if peak is generated at t p(P,st) = 1-qj , otherwise An Introduction to Bioinformatics Algorithms Peak Score (cont www.bioalgorithms.info d) • For a position t that is not associated with an ion type: qR , if peak is generated at t pR(P,st) = 1-qR , otherwise • qR = the probability of a noisy peak that does not correspond to any ion type An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Finding Optimal Paths in the Spectrum Graph • For a given MS/MS spectrum S, find a peptide P’ maximizing p(P,S) over all possible peptides P: p(P',S) = max P p(P,S) • Peptides = paths in the spectrum graph • P’ = the optimal path in the spectrum graph An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Ions and Probabilities • Tandem mass spectrometry is characterized by a set of ion types {•‰ 1,•‰ 2,..,•‰ k} and their probabilities {q1,...,qk} ¶U•‰ i-ions of a partial peptide are produced independently with probabilities qi An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Ions and Probabilities • A peptide has all k peaks with probability k ∏q i i =1 k • and no peaks with probability ∏ (1 − qi ) i =1 • A peptide also produces a ``random noise'' with uniform probability qR in any position. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Ratio Test Scoring for Partial Peptides • Incorporates premiums for observed ions and penalties for missing ions. • Example: for k=4, assume that for a partial peptide P’ we only see ions •‰ 1,•‰ 2,•‰ 4. q1 q2 (1 − q3 ) q4 The score is calculated as: ⋅ ⋅ ⋅ qR qR (1 − qR ) qR An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Scoring Peptides • T- set of all positions. • Ti={t _1,, t _2,..., ,t _k,}- set of positions that represent ions of partial peptides Pi. • A peak at position t_j is generated with probability qj. • R=T- U Ti - set of positions that are not associated with any partial peptides (noise). An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Probabilistic Model • For a position t _j ∈ Ti the probability p(t, P,S) that peptide P produces a peak at position t. qj P(t , P, S ) = 1 − q j if a peak is generated at position t δ j otherwise • Similarly, for t∈R, the probability that P produces a random noise peak at t is: qR PR (t ) = 1 − qR if a peak is generated at position t otherwise An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Probabilistic Score • For a peptide P with n amino acids, the score for the whole peptides is expressed by the following ratio test: n k p (t p ( P, S ) iδ j , P , S ) = ∏∏ pR ( S ) pR (tiδ j ) i =1 j =1 An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Role of de novo Interpretation • Interpreting MS/MS of novel peptides • Automatic validation of MS/MS database matches. • Leveraging homology matching across species An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Post-Translational Modifications Proteins are involved in cellular signaling and metabolic regulation. They are subject to a large number of biological modifications. Almost all protein sequences are posttranslationally modified and 200 types of modifications of amino acid residues are known. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Examples of Post-Translational Modification An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Difficulties in Finding PostTranslational Modifications Currently post-translational modifications cannot be inferred from DNA sequences. Finding post-translational modifications remains an open problem even after the human genome is completed. Post-translational modifications increase the number of “letters” in amino acid alphabet and lead to a combinatorial explosion. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Sequencing of Modified Peptides De novo peptide sequencing is invaluable for identification of unknown proteins: However, de novo algorithms are designed for working with high quality spectra with good fragmentation and without modifications. Another approach is to compare a spectrum against a set of known spectra in a database. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Functional Proteomics • Problem: Given a large collection of uninterpreted spectra, find out which spectra correspond to similar peptides. • A method that cross-correlates related spectra (e.g., from normal and diseased individuals) would be valuable in functional proteomics. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Protein identification Problem • Input: A database of proteins, an experimental spectrum S, a set of ion types _, and a parent mass m. • Output: A peptide of mass m from the database with the best match to spectrum S. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info MS/MS Database Search Database search in mass-spectrometry has been very successful in identification of already known proteins. Experimental spectrum can be compared with theoretical spectra database peptides to find the best fit. SEQUEST (Yates et al., 1995) But reliable algorithms for identification of modified peptides are not yet known. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Search for Modified Peptides: Virtual Database Approach Yates et al.,1995: an exhaustive search in a virtual database of all modified peptides. Exhaustive search leads to a large combinatorial problem, even for a small set of modifications types. Problem (Yates et al.,1995). Extend the virtual database approach to a large set of modifications. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Modified Peptide Identification Problem Input: Experimental spectrum S Database of peptides Parameter k (# of mutations/modifications) A set of ion types _ Parent mass m Output: a peptide with the best match to the spectrum S that is at most k mutations/modifications apart from a database peptide. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Peptide Identification Problem: Challenge Very similar peptides may have very different spectra! Goal: Define a notion of spectral similarity that correlates well with the sequence similarity. If peptides are a few mutations/modifications apart, the spectral similarity between their spectra should be high. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Deficiency of the Shared Peaks Count Shared peaks count (SPC): intuitive measure of spectral similarity. Problem: SPC diminishes very quickly as the number of mutations increases. Only a small portion of correlations between the spectra of mutated peptides is captured by SPC. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info SPC Diminishes Quickly no mutations SPC=10 1 mutation SPC=5 2 mutations SPC=2 S(PRTEIN) = {98, 133, 246, 254, 355, 375, 476, 484, 597, 632} S(PRTEYN) = {98, 133, 254, 296, 355, 425, 484, 526, 647, 682} S(PGTEYN) = {98, 133, 155, 256, 296, 385, 425, 526, 548, 583} An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Spectral Convolution S 2 − S1 = {s2 − s1:s1 ∈ S1,s2 ∈ S 2 } Number of pairs s1 ∈ S1 , s2 ∈ S 2 with s2 − s1 = x : ( S 2 − S1 )( x) The shared peaks count (SPC peak) : ( S 2 − S1 )(0) An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Elements of S2 S1 represented as elements of a difference matrix. The elements with multiplicity >2 are colored; the elements with multiplicity =2 are circled. The SPC takes into account only the red entries An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Spectral Convolution: An Example 5 4 Spectral Convolution 3 2 1 0 -150 150 -100 -50 0 x 50 100 An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Spectral Comparison: Difficult Case S = {10, 20, 30, 40, 50, 60, 70, 80, 90, 100} Which of the spectra S’ = {10, 20, 30, 40, 50, 55, 65, 75,85, 95} or S” = {10, 15, 30, 35, 50, 55, 70, 75, 90, 95} fits the spectrum S the best? SPC: both S’ and S” have 5 peaks in common with S. Spectral Convolution: reveals the peaks at 0 and 5. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Spectral Comparison: Difficult Case S S’ S S’’ An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Limitations of the Spectrum Convolutions Spectral convolution does not reveal that spectra S and S’ are similar, while spectra S and S” are not. Clumps of shared peaks: the matching positions in S’ come in clumps while the matching positions in S” don't. This important property was not captured by spectral convolution and was overlooked in the previous MS/MS algorithms. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Edit Distance Between Spectra A = {a1 < … < an} : an ordered set of natural numbers. A shift Δi transforms {a1, …., an} Into {a1, ….,ai-1,ai+Δi,…,an+ Δi } e.g. 20 30 40 50 60 70 80 90 10 20 30 35 45 55 65 75 85 10 20 30 35 45 55 62 72 82 An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Spectral Alignment Problem • Find a series of k shifts that make the sets A={a1, …., an} and B={b1,….,bn} as similar as possible. • k-similarity between sets • D(k) - the maximum number of elements in common between sets after k shifts. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Spectral Alignment vs. Sequence Alignment • Manhattan-like graph with different alphabet and scoring. • Axes in the graph correspond to peaks in the two spectra. • In this case, score is 1 if the diagonal line goes through a peak on both axes, 0 otherwise. • Movement can be diagonal or perpendicular (but only k times total). An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Spectral Alignment = Sequence Alignment in 0-1 Alphabet • Convert spectrum to a string with each index being 1 if it corresponds to a peak and 0 otherwise. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Spectral Product A={a1, …., an} and B={b1,…., bn} Spectral product A⊗B: two-dimensional matrix with nm 1s corresponding to all pairs of10 20 30 40 50 55 65 75 85 95 indices (ai,bj) and remaining elements being 0s. SPC: the number of 1s at the main diagonal. δ-shifted SPC: the number of 1s on the diagonal (i,i+ δ) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 δ1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Spectral Alignment: k-similarity k-similarity between spectra: the maximum number of 1s on a path through this graph that uses at most k+1 diagonals. k-optimal spectral alignment = a path. The spectral alignment allows one to detect more and more subtle similarities between spectra by increasing k. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Use of k-Similarity SPC reveals only D(0)=3 matching peaks. Spectral Alignment reveals more hidden similarities between spectra: D(1)=5 and D(2)=8 and detects corresponding mutations. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Black lines represent the paths for k=0 Red lines represent the paths for k=1 blue line in Fig.(b) represents the path for k=2 An Introduction to Bioinformatics Algorithms Spectral Convolution www.bioalgorithms.info Limitation The spectral convolution considers diagonals separately without combining them into feasible mutation scenarios. 10 20 30 40 50 55 65 75 85 95 10 15 30 35 10 10 20 20 30 30 40 40 50 60 δ 50 60 70 70 80 80 90 90 100 100 D(1) =10 shift function score = 10 50 55 70 75 90 95 δ D(1) =6 An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Dynamic Programming for Spectral Alignment Dij(k): the maximum number of 1s on a path to (ai,bj) that uses at most k+1 diagonals. Di ' j ' (k ) + 1, if (i ' , j ' ) ~ (i, j ) Dij (k ) = max { (i ', j ')< (i , j ) Di ' j ' ( k − 1) + 1, otherwise D (k ) = max Dij (k ) ij Running time: O(n4 k) An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Edit Graph for Fast Spectral Alignment An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Fast Spectral Alignment Algorithm M ij (k ) = max Di ' j ' (k ) (i ', j ')< (i , j ) Ddiag (i , j ) (k ) + 1 Dij (k ) = max M i −1, j −1 (k − 1) + 1 Dij (k ) M ij (k ) = max M i −1, j (k ) M i , j −1 (k ) Running time: O(n2 k) An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Spectral Alignment: Complications • Simultaneous analysis of N- and C-terminal ions • Taking into account the intensities and charges • Analysis of minor ions • Much more complicated! An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Spectral Alignment: Complications Spectra are combinations of an increasing (Nterminal ions) and a decreasing (C-terminal ions) number series. These series form two diagonals in the spectral product, the main diagonal and the perpendicular diagonal. The described algorithm deals with the main diagonal only.