Document

advertisement
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Protein Sequencing and
Identification by Mass
Spectrometry
1
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Outline
• Introduction to Protein Structure
• Peptide Mass Spectra
• De Novo Peptide Sequencing
• Spectrum Graphs
• Protein Identification via Database Search
• Spectral Convolution
• Spectral Alignment Problem
TK modified for WMU CS 6030
2
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Why are proteins and their
sequences important?
• Necessary in order to understand how cells and their
biochemical pathways function.
• A unique set of proteins is involved in each function
• Each organism has a unique set of expressed proteins
• Each type of cell / tissue has a unique set of expressed
proteins
• A key to understanding how the brain or liver works is to
determine what proteins they produce
TK Added for WMU CS 6030
3
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Basic Summary of Protein
Structure
• A polypeptide is a polymer of amino acid
residues, a linear chain based on amino
acids
• A polypeptide can be completely specified by
indicating the sequence of amino acids
• A protein is a collection of one more peptides
bonded together
4
An Introduction to Bioinformatics Algorithms
The Amino Acids
5
www.bioalgorithms.info
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Amino Acids – the basic building block
6
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Formation of polypeptides
• Two amino acids can combine to form a
dipeptide
• Structure in blue is a peptide link
• If many amino acids are joined, we call it a
polypeptide
7
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Ends of the polypeptide chain
• Note that a polypeptide is a chain of amino
acid residues (water molecule is lost).
• End of chain with –NH2 is the N-terminal
• End of chain with –COOH is the C-terminal
8
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Polypeptides -> Proteins
9
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Proteins can be complex structures
10
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
De Novo Sequencing Strategy
• Break proteins into peptide chain fragments
• Use enzymes to cut into pieces
• Further break apart peptide chains using mass
spectrometry
• Peptide chains will break up in a manner predictable by
molecular physics
• The masses of the molecular fragments will collectively
deliver a “mass spectrum” from which the sequence
can be derived
• First, we’ll explore the ways that mass spectrometry can
break peptides apart
TK added for WMU CS 6030
11
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Masses of Amino Acid Residues
12
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
The Periodic Table (part of it!)
13
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Protein Backbone
H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH
N-terminus
Ri-1
AA residuei-1
Ri
AA residuei
Ri+1
C-terminus
AA residuei+1
14
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Peptide Fragmentation
Collision Induced Dissociation
H+
H...-HN-CH-CO
Ri-1
Prefix Fragment
. . . NH-CH-CO-NH-CH-CO-…OH
Ri
Ri+1
Suffix Fragment
• Peptides tend to fragment along the backbone.
• Fragments can also loose neutral chemical groups
like NH3 and H2O.
15
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Breaking Protein into Peptides and
Peptides into Fragment Ions
• Proteases, e.g. trypsin, break protein into
peptides.
• A Tandem Mass Spectrometer further breaks
the peptides down into fragment ions and
measures the mass of each piece.
• Mass Spectrometer accelerates the fragmented
ions; heavier ions accelerate slower than lighter
ones.
• Mass Spectrometer measure mass/charge
ratio of an ion.
16
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
N- and C-terminal Peptides
17
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Terminal peptides and ion types
Peptide
Mass (D)
Peptide
Mass (D)
57 + 97 + 147 + 114 = 415
without
57 + 97 + 147 + 114 – 18 = 397
18
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
N- and C-terminal Peptides
486
71
415
185
301
154
332
57
429
19
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
N- and C-terminal Peptides
486
71
415
185
301
154
332
57
429
20
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Peptide Fragmentation
b2-H2O
a2
b3- NH3
b2
a3
b3
HO
NH3+
|
|
R1 O
R2 O
R3 O
R4
|
||
|
||
|
||
|
H -- N --- C --- C --- N --- C --- C --- N --- C --- C --- N --- C -- COOH
|
|
|
|
|
|
|
H
H
H
H
H
H
H
y3
y2
y3 -H2O
y1
y2 - NH3
21
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Mass Spectra
57 Da =K‘G’
D
D
V
99 Da = ‘V’
L
L
H2O
G
D
K
V
G
mass
0
• The peaks in the mass spectrum:
• Prefix and Suffix Fragments.
• Fragments with neutral losses (-H2O, -NH3)
• Noise and missing peaks.
22
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Protein Identification with MS/MS
G
V
D
K
Peptide
Identification:
Intensity
MS/MS
L
mass
00
23
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Tandem Mass-Spectrometry
24
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Breaking Proteins into Peptides
MPSERGTDIMRPAKID......
protein
GTDIMR
PAKID
MPSER
……
……
HPLC
To
MS/MS
peptides
25
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Tandem Mass Spectrometry
S#: 1707 RT: 54.44 AV: 1 NL: 2.41E7
F: + c Full ms [ 300.00 - 2000.00]
RT: 0.01 - 80.02
100
90
80
1409
LC
NL:
1.52E8
1991
2149
1615 1621
1411
2147
1387
60
1593
1995
1655
1435
50
1987
1445
1661
40
2001 2177
1937
1779
30
2155
2205
2135
2017
1095
80
75
70
65
60
55
801.0
50
45
40
35
Scan 1707
638.9
25
2207
1105
85
30
1307 1313
20
MS
90
Relative Abundance
70
95
Base Peak F: +
c Full ms [
300.00 2000.00]
1611
Relative Abundance
638.0
100
1389
1173.8
20
2329
872.3
687.6
10
2331
10
1275.3
15
1707
944.7
783.3
1048.3
1212.0
1413.9
1617.7
1400
1600
1742.1
1884.5
5
0
200
400
600
800
1000
m/z
0
5
10
15
20
25
30
35
40 45
Time (min)
50
55
60
65
70
75
1200
1800
2000
80
S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6
T: + c d Full ms2 638.00 [ 165.00 - 1925.00]
850.3
100
95
687.3
90
85
Ion
Source
588.1
80
75
70
MS/MS
65
Relative Abundance
collision
MS-2
MS-1
cell
60
55
851.4
425.0
50
45
949.4
40
326.0
35
524.9
30
25
20
589.2
226.9
1048.6
1049.6
397.1
489.1
15
10
629.0
5
0
200
400
600
800
1000
m/z
1200
Scan 1708
1400
1600
1800
2000
26
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Tandem Mass Spectrum
• Tandem Mass Spectrometry (MS/MS): mainly
generates partial N- and C-terminal peptides
• Spectrum consists of different ion types
because peptides can be broken in several
places.
• Chemical noise often complicates the
spectrum.
• Represented in 2-D: mass/charge axis vs.
intensity axis
27
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
De Novo vs. Database Search
S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6
T: + c d Full m s 2 638.00 [ 165.00 - 1925.00]
850.3
100
95
85
588.1
80
De Novo
75
70
65
Relative Abundance
Database
Search
687.3
90
60
55
851.4
425.0
50
45
949.4
40
326.0
35
524.9
30
25
20
589.2
226.9
1048.6
1049.6
397.1
489.1
15
10
629.0
5
0
200
400
600
800
1000
m /z
1200
1400
1600
1800
2000
Mass, Score
Database of
known peptides
MDERHILNM, KLQWVCSDL,
PTYWASDL, ENQIKRSACVM,
TLACHGGEM, NGALPQWRT,
HLLERTKMNVV, GGPASSDA,
GGLITGMQSD, MQPLMNWE,
ALKIIMNVRT,
ALKIIMNVRT,AVGELTK
AVGELTK, ,
HEWAILF, GHNLWAMNAC,
GVFGSVLRA, EKLNKAATYIN..
n
Database of allWpeptides =
R 20
V
A
L
A
L
AAAAAAAA,AAAAAAAC,AAAAAAAD,AAAAAAAE,
T
G
G
AAAAAAAG,AAAAAAAF,AAAAAAAH,AAAAAAI,
E
C
L
P
K
K
W
AVGELTI, AVGELTK
, AVGELTL, AVGELTM,
D
T
YYYYYYYS,YYYYYYYT,YYYYYYYV,YYYYYYYY
AVGELTK
28
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
De Novo vs. Database Search: A Paradox
• The database of all peptides is huge ≈ O(20n) .
• The database of all known peptides is much smaller ≈
O(108).
• However, de novo algorithms can be much faster, even
though their search space is much larger!
• A database search scans all peptides in the database of
all known peptides search space to find best one.
• De novo eliminates the need to scan database of all
peptides by modeling the problem as a graph search.
29
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
De novo Peptide Sequencing
S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6
T: + c d Full ms2 638.00 [ 165.00 - 1925.00]
850.3
100
95
687.3
90
85
588.1
80
75
70
Relative Abundance
65
60
55
851.4
425.0
50
45
949.4
40
326.0
35
524.9
30
25
20
589.2
226.9
1048.6
1049.6
397.1
489.1
15
10
629.0
5
0
200
400
600
800
1000
m/z
1200
1400
1600
1800
2000
Sequence
30
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Theoretical Spectrum
31
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Theoretical Spectrum (cont’d)
32
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Theoretical Spectrum (cont’d)
33
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Two approaches to determining sequence
from the mass spectra
• Exhaustive Search
• Involves analysis of 20L sequences of length L
• Branch and bound advisable, but success has
been limited
• Analysis of Spectrum Graph
• Based on experimental spectrum rather than all
possible matches to possible spectra
TK added slide for WMU CS 6030
34
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Building Spectrum Graph
• How to create vertices (from masses)
• How to create edges (from mass differences)
• How to score paths
• How to find best path
35
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Ion Types
• Some masses correspond to fragment
ions, others are just random noise
• Knowing ion types Δ={δ1, δ2,…, δk} lets us
distinguish fragment ions from noise
• We can learn ion types δi and their
probabilities qi by analyzing a large test
sample of annotated spectra.
36
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Example of Ion Type
• Δ={δ1, δ2,…, δk}
• Ion types
{b, b-NH3, b-H2O}
correspond to
Δ={0, 17, 18}
• This is a set of masses of chemical groups frequently cleaved from
peptide fragments during bombardment of molecules with electrons
in the mass spectrometer collision cell
*Note: In reality the δ value of ion type b is -1 but we will “hide” it for the sake of simplicity
TK modified slide for WMU CS 6030
37
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Theoretical Spectra
• The theoretical spectra of peptide P is
designated as T(P)
• T(P) can be calculated by subtracting all
possible ion types Δ={δ1, δ2,…, δk} from all
possible partial peptides of P
• Every partial peptide will have k masses in
T(P)
TK added slide for WMU CS 6030
38
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Match between Spectra and the
Shared Peak Count
• The match between two spectra is the number of
masses (peaks) they share (Shared Peak Count or
SPC)
• In practice mass-spectrometrists use the weighted SPC
that reflects intensities of the peaks
• Match between experimental (S) and theoretical spectra
(T) is defined similarly
39
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Peptide Sequencing Problem
Goal: Find a peptide with maximal match between
an experimental and theoretical spectrum.
Input:
• S: experimental spectrum
• Δ: set of possible ion types
• m: parent mass
Output:
• P: peptide with mass m, whose theoretical
spectrum T matches the experimental S
spectrum the best
40
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Vertices of Spectrum Graph
• Masses of potential N-terminal peptides
• Vertices are generated by reverse shifts corresponding to ion types
Δ={δ1, δ2,…, δk}
• Every N-terminal peptide can generate up to k ions
m-δ1, m-δ2, …, m-δk
• Every mass s in an MS/MS spectrum generates k vertices
V(s) = {s+δ1, s+δ2, …, s+δk}
corresponding to potential N-terminal peptides
• Vertices of the spectrum graph:
{initial vertex}V(s1) V(s2) ... V(sm) {terminal vertex}
41
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Edges of Spectrum Graph
• Two vertices with mass difference
corresponding to an amino acid A:
• Connect with an edge labeled by A
42
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Peptide Sequences
• A possible peptide sequence is identified by
finding a path from mass 0 to parent mass m
in the resulting DAG
• Many possible sequences (paths) will
typically be identified in the DAG
• The “best path” is identified by scoring using
probabilities that are based on experimental
data
TK added slide for WMU CS 6030
43
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Path Score
• p(P,S) = probability that peptide P produces
spectrum S= {s1,s2,…sq}
• p(P, s) = the probability that peptide P
generates a peak (mass) s
• Scoring = computing probabilities
• p(P,S) = πsєS p(P, s)
44
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Peak Score
• For a position t that represents ion type dj :
qj, if peak is generated at t
p(P,st) =
1-qj , otherwise
45
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Peak Score (cont’d)
• For a position t that is not associated with an
ion type:
qR , if peak is generated at t
pR(P,st) =
1-qR , otherwise
• qR = the probability of a noisy peak that does
not correspond to any ion type
46
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Finding Optimal Paths in the Spectrum Graph
• For a given MS/MS spectrum S, find a
peptide P’ maximizing p(P,S) over all
possible peptides P:
p(P',S)  max P p(P,S)
• Peptides = paths in the spectrum graph
• P’ = the optimal path in the spectrum graph
47
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Ions and Probabilities
• Tandem mass spectrometry is characterized
by a set of ion types {δ1,δ2,..,δk} and their
probabilities {q1,...,qk}
• δi-ions of a partial peptide are produced
independently with probabilities qi
48
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Ions and Probabilities
• A peptide has all k peaks with probability
k
q
i
i 1
k
• and no peaks with probability  (1  qi )
i 1
• A peptide also produces a ``random noise''
with uniform probability qR in any position.
49
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Ratio Test Scoring for Partial Peptides
• Incorporates premiums for observed ions
and penalties for missing ions.
• Example: for k=4, assume that for a partial
peptide P’ we only see ions δ1,δ2,δ4.
The score is calculated as:
q1 q2 (1  q3 ) q4
 

qR qR (1  qR ) qR
50
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Scoring Peptides
• T- set of all positions.
• Ti={t δ1,, t δ2,..., ,t δk,}- set of positions that
represent ions of partial peptides Pi.
• A peak at position tδj is generated with
probability qj.
• R=T- U Ti - set of positions that are not
associated with any partial peptides (noise).
51
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Probabilistic Model
• For a position t δj  Ti the probability p(t, P,S) that
peptide P produces a peak at position t.
 qj
P(t , P, S )  
1  q j
if a peak is generated at position t
j
otherwise
• Similarly, for tR, the probability that P produces a
random noise peak at t is:
 qR
PR (t )  
1  qR
if a peak is generated at position t
otherwise
52
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Probabilistic Score
• For a peptide P with n amino acids, the score
for the whole peptides is expressed by the
following ratio test:
n
k p (t
p ( P, S )
i j , P , S )
 
pR ( S )
pR (ti j )
i 1 j 1
53
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Summary of Steps for Peptide Sequencing
using a Spectrum Graph (8.12)
• Graph experimental spectrum S = {s1,…,sq}
• Label directed edges in graph that
correspond to the mass of a single amino
acid
• Find all possible peptide sequences, which
are paths from 0 to m in the DAG
• Score possible sequences using probabilities
of specific ion masses occurring
• Select the sequence with the maximum score
TK added slide for WMU CS 6030
54
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Protein Identification via Database Search (8.13)
• De novo sequencing is often not feasible with
experimental data because the resulting spectra may be
incomplete
• However, de novo algorithm is much faster!
• Since thousands of proteins have been sequenced, a
database search is possible
• Database search is therefore a more practical solution in
most cases
• May save time in assay/experiment design
• SEQUEST is a program that implements database
search algorithm
Added by TK for WMU CS 6030
References: http://fields.scripps.edu/sequest/. http://en.wikipedia.org/wiki/SEQUEST
55
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
De Novo vs. Database Search: A Paradox
• de novo algorithms are much faster, even though their
search space is much larger!
• A database search scans all peptides in the search
space to find best one.
• De novo eliminates the need to scan all peptides by
modeling the problem as a graph search.
Why not sequence de novo?
56
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Peptide Sequencing Problem
Goal: Find a peptide with maximal match between
an experimental and theoretical spectrum.
Input:
• S: experimental spectrum
• Δ: set of possible ion types
• m: parent mass
Output:
• A peptide with mass m, whose theoretical
spectrum matches the experimental S
spectrum the best
57
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Peptide Identification Problem
Goal: Find a peptide from the database with
maximal match between an experimental and
theoretical spectrum.
Input:
• S: experimental spectrum
• database of peptides
• Δ: set of possible ion types
• m: parent mass
Output:
• A peptide of mass m from the database whose
theoretical spectrum matches the
experimental S spectrum the best
58
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
MS/MS Database Search
Database search in mass-spectrometry has been very
successful in identification of already known proteins.
Experimental spectrum can be compared with theoretical
spectra of database peptides to find the best fit.
SEQUEST (Yates et al., 1995)
But reliable algorithms for identification of modified
peptides is a much more difficult problem.
59
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
The dynamic nature of the proteome
•
•
•
•
•
The proteome of the cell
is changing
Various extra-cellular,
and other signals
activate pathways of
proteins.
A key mechanism of
protein activation is
post-translational
modification (PTM)
These pathways may
lead to other genes
being switched on or off
Mass spectrometry is
key to probing the
proteome and detecting
PTMs
61
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Post-Translational Modifications
Proteins are involved in cellular signaling and
metabolic regulation.
They are subject to a large number of biological
modifications.
Almost all protein sequences are posttranslationally modified and 200 types of
modifications of amino acid residues are
known.
62
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Examples of Post-Translational
Modification
Post-translational modifications increase the number of “letters” in
amino acid alphabet and lead to a combinatorial explosion in both
database search and de novo approaches.
63
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Search for Modified Peptides:
Virtual Database Approach
Yates et al.,1995: an exhaustive search in a
virtual database of all modified peptides.
Exhaustive search leads to a large combinatorial
problem, even for a small set of modifications
types.
Problem (Yates et al.,1995). Extend the virtual
database approach to a large set of
modifications.
64
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Exhaustive Search for modified peptides.
•
YFDSTDYNMAK
Oxidation?
• For each peptide,
generate all
modifications.
• Score each
modification.
Phosphorylation?
• 25=32 possibilities, with 2 types
of modifications!
65
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Peptide Identification Problem Revisited
Goal: Find a peptide from the database with
maximal match between an experimental and
theoretical spectrum.
Input:
• S: experimental spectrum
• database of peptides
• Δ: set of possible ion types
• m: parent mass
Output:
• A peptide of mass m from the database whose
theoretical spectrum matches the
experimental S spectrum the best
66
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Modified Peptide Identification Problem
Goal: Find a modified peptide from the database with maximal
match between an experimental and theoretical spectrum.
Input:
• S: experimental spectrum
• database of peptides
• Δ: set of possible ion types
• m: parent mass
• Parameter k (# of mutations/modifications)
Output:
• A peptide of mass m that is at most k
mutations/modifications apart from a database peptide
and whose theoretical spectrum matches the
experimental S spectrum the best
67
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Database Search:
Sequence Analysis vs. MS/MS Analysis
68
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Peptide Identification Problem: Challenge
Very similar peptides may have very different
spectra!
Goal: Define a notion of spectral similarity that
correlates well with the sequence similarity.
If peptides are a few mutations/modifications
apart, the spectral similarity between their
spectra should be high.
69
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Deficiency of the Shared Peaks Count
Shared peaks count (SPC): intuitive measure
of spectral similarity.
Problem: SPC diminishes very quickly as the
number of mutations increases.
Only a small portion of correlations between
the spectra of mutated peptides is captured
by SPC.
70
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
SPC Diminishes Quickly
no mutations
SPC=10
1 mutation
SPC=5
2 mutations
SPC=2
S(PRTEIN) = {98, 133, 246, 254, 355, 375, 476, 484, 597, 632}
S(PRTEYN) = {98, 133, 254, 296, 355, 425, 484, 526, 647, 682}
S(PGTEYN) = {98, 133, 155, 256, 296, 385, 425, 526, 548, 583}
71
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Spectral Convolution
S 2  S1  {s2  s1:s1  S1,s2  S 2 }
Number of pairs s1  S1 , s2  S 2 with s2  s1  x :
( S 2  S1 )( x)
The shared peaks count (SPC peak) :
( S 2  S1 )(0)
72
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Elements of S2 S1 represented as elements of a difference matrix. The
elements with multiplicity >2 are colored; the elements with multiplicity =2
are circled. The SPC takes into account only the red entries
73
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Naïve approaches to using Spectral Convolution
• The simplest approach is to base comparison only on
SPC
• A more sophisticated, but still naïve, approach is to
consider the mass differences that correspond to a
known peptide modification. A high multiplicity of such a
difference would indicate a possible match.
Added by TK for WMU CS 6030
References: http://fields.scripps.edu/sequest/. http://en.wikipedia.org/wiki/SEQUEST
74
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Spectral Comparison: Difficult Case
S = {10, 20, 30, 40, 50, 60, 70, 80, 90, 100}
Which of the spectra
S’ = {10, 20, 30, 40, 50, 55, 65, 75,85, 95}
or
S” = {10, 15, 30, 35, 50, 55, 70, 75, 90, 95}
fits the spectrum S the best?
SPC: both S’ and S” have 5 peaks in common with S.
Spectral Convolution: reveals the peaks at 0 and 5.
75
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Spectral Comparison: Difficult Case
S
S’
S
S’’
76
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Limitations of the Spectrum Convolutions
Spectral convolution does not reveal that
spectra S and S’ are similar, while spectra S
and S” are not.
Clumps of shared peaks: the matching
positions in S’ come in clumps while the
matching positions in S” don't.
This important property was not captured by
spectral convolution.
77
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Shifts
A = {a1 < … < an} : an ordered set of natural
numbers.
A shift (i,) is characterized by two parameters,
the position (i) and the length ().
The shift (i,) transforms
{a1, …., an}
into
{a1, ….,ai-1,ai+,…,an+  }
78
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Shifts: An Example
The shift (i,) transforms
{a1, …., an}
into
{a1, ….,ai-1,ai+,…,an+  }
e.g.
10 20 30 40 50 60 70 80 90
shift (4, -5)
10 20 30 35 45 55 65 75 85
shift (7,-3)
10 20 30 35 45 55 62 72 82
79
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Spectral Alignment Problem
• Find a series of k shifts that make the sets
A={a1, …., an} and B={b1,….,bn}
as similar as possible.
• k-similarity between sets
• D(k) - the maximum number of elements in
common between sets after k shifts.
80
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Summary
• Introduction to Protein Structure
• Peptide Mass Spectra
• De Novo Peptide Sequencing
• Spectrum Graphs
• Protein Identification via Database Search
• Spectral Convolution
• Spectral Alignment Problem
TK modified for WMU CS 6030
81
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
References
•
Jones, Neil C., and Pevzner, Pavel A., An Introduction to Bioinformatics Algorithms,
chapter 8, and Mass Spectrometry slides at
http://bix.ucsd.edu/bioalgorithms/slides.php
•
Russell, Peter J., iGenetics, A Molecular Approach, chapter 6 (Gene Expression:
Translation)
•
Hoppe, Pamela (WMU), lecture slides from BIOS 2500 (General Genetics)
•
“The Structure of Proteins”,
http://www.chemguide.co.uk/organicprops/aminoacids/proteinstruct.html
•
SEQUEST site, http://fields.scripps.edu/sequest/
TK added for WMU CS 6030
82
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Additional Slides from
Bioinformatics.info
TK modified for WMU CS 6030
83
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Representing Spectra in 0-1 Alphabet
• Convert spectrum to a 0-1 string with 1s
corresponding to the positions of the peaks.
84
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Comparing Spectra=Comparing 0-1 Strings
• A modification with positive offset corresponds to
inserting a block of 0s
• A modification with negative offset corresponds to
deleting a block of 0s
• Comparison of theoretical and experimental spectra
(represented as 0-1 strings) corresponds to a
(somewhat unusual) edit distance/alignment
problem where elementary edit operations are
insertions/deletions of blocks of 0s
• Use sequence alignment algorithms!
85
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Spectral Alignment vs. Sequence Alignment
• Manhattan-like graph with different alphabet
and scoring.
• Movement can be diagonal (matching
masses) or horizontal/vertical
(insertions/deletions corresponding to PTMs).
• At most k horizontal/vertical moves.
86
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Spectral Product
A={a1, …., an} and B={b1,…., bn}
Spectral product AB: two-dimensional matrix with
nm 1s corresponding to all pairs of10 20 30 40 50 55 65 75 85 95
indices (ai,bj) and remaining
elements being 0s.
SPC: the number of 1s at
the main diagonal.
-shifted SPC: the number
of 1s on the diagonal (i,i+ )
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1 1
1
1
1
1

87
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Spectral Alignment: k-similarity
k-similarity between spectra: the maximum number
of 1s on a path through this graph that uses at most
k+1 diagonals.
k-optimal spectral
alignment = a path.
The spectral alignment
allows one to detect
more and more subtle
similarities between
spectra by increasing k.
88
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Use of k-Similarity
SPC reveals only
D(0)=3 matching
peaks.
Spectral Alignment
reveals more
hidden similarities
between spectra:
D(1)=5 and D(2)=8
and detects
corresponding
mutations.
89
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Black line represent the path for k=0
Red lines represent the path for k=1
Blue lines (right) represents the path for k=2
90
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Spectral Convolution’ Limitation
The spectral convolution considers diagonals
separately without combining them into feasible
mutation scenarios.
10 20 30 40 50 55 65 75 85 95
10 15 30 35
10
10
20
20
30
30
40
40
50

50
60
60
70
70
80
80
90
90
100
100
D(1) =10
shift function score = 10
50 55
70 75 90 95

D(1) =6
91
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Dynamic Programming for
Spectral Alignment
Dij(k): the maximum number of 1s on a path to
(ai,bj) that uses at most k+1 diagonals.
Di ' j ' (k )  1, if (i ' , j ' ) ~ (i, j )
Dij (k )  max {
(i ', j ')(i , j ) Di ' j ' ( k  1)  1, otherwise
D (k )  max Dij (k )
ij
Running time: O(n4 k)
92
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Edit Graph for Fast Spectral Alignment
diag(i,j) – the position
of previous 1 on the
same diagonal as (i,j)
93
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Fast Spectral Alignment Algorithm
M ij (k ) 
max Di ' j ' (k )
(i ', j ')(i , j )
 Ddiag(i , j ) (k )  1
Dij (k )  max 
M i 1, j 1 (k  1)  1
 Dij (k )

M ij (k )  max M i 1, j (k )
M
 i , j 1 (k )
Running time: O(n2 k)
94
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Spectral Alignment: Complications
Spectra are combinations of an increasing (Nterminal ions) and a decreasing (C-terminal
ions) number series.
These series form two diagonals in the
spectral product, the main diagonal and the
perpendicular diagonal.
The described algorithm deals with the main
diagonal only.
95
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Spectral Alignment: Complications
• Simultaneous analysis of N- and C-terminal
ions
• Taking into account the intensities and
charges
• Analysis of minor ions
96
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Filtration: Combining de novo and
Database Search in Mass-Spectrometry
• So far de novo and database search were presented as
two separate techniques
• Database search is rather slow: many labs generate
more than 100,000 spectra per day. SEQUEST takes
approximately 1 minute to compare a single spectrum
against SWISS-PROT (54Mb) on a desktop.
• It will take SEQUEST more than 2 months to analyze the
MS/MS data produced in a single day.
• Can slow database search be combined with fast de
novo analysis?
97
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Why Filtration ?
Sequence Alignment – BLAST
Smith Waterman Algorithm
Protein Query
Sequence matches
Filtration
Scoring
Database
actgcgctagctacggatagctgatcc
agatcgatgccataggtagctgatcc
atgctagcttagacataaagcttgaat
cgatcgggtaacccatagctagctcg
atcgacttagacttcgattcgatcgaat
tcgatctgatctgaatatattaggtccg
atgctagctgtggtagtgatgtaaga
• BLAST filters out very few correct
matches and is almost as accurate as
Smith – Waterman algorithm.
98
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Filtration and MS/MS
Peptide Sequencing – SEQUEST / Mascot
MS/MS spectrum
Sequence matches
Scoring
Filtration
Database
MDERHILNMKLQWVCSDLPT
YWASDLENQIKRSACVMTLA
CHGGEMNGALPQWRTHLLE
RTYKMNVVGGPASSDALITG
MQSDPILLVCATRGHEWAILF
GHNLWACVNMLETAIKLEGVF
GSVLRAEKLNKAAPETYIN..
99
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Filtration in MS/MS Sequencing
• Filtration in MS/MS is more difficult than in BLAST.
• Early approaches using Peptide Sequence Tags were
not able to substitute the complete database search.
• Current filtration approaches are mostly used to
generate additional identifications rather than replace
the database search.
• Can we design a filtration based search that can replace
the database search, and is orders of magnitude faster?
100
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Asking the Old Question Again: Why
Not Sequence De Novo?
• De novo sequencing is still not very accurate!
Algorithm
Amino Acid
Accuracy
Whole Peptide
Accuracy
Lutefisk (Taylor and Johnson, 1997).
0.566
0.189
SHERENGA (Dancik et. al., 1999).
Peaks (Ma et al., 2003).
0.690
0.673
0.289
0.246
PepNovo (Frank and Pevzner, 2005).
0.727
0.296
101
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
So What Can be Done with De Novo?
• Given an MS/MS spectrum:
• Can de novo predict the entire peptide sequence? - No!
(accuracy is less than 30%).
• Can de novo predict partial sequences? - No!
(accuracy is 50% for GutenTag and 80% for PepNovo )
• Can de novo predict a set of partial sequences, that with
high probability, contains at least one correct tag? - Yes!
A Covering Set of Tags
102
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Peptide Sequence Tags
• A Peptide Sequence Tag is short substring of
a peptide.
Example:
Tags:
GVDLK
GVD
VDL
DLK
103
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Filtration with Peptide Sequence Tags
• Peptide sequence tags can be used as filters in
database searches.
• The Filtration: Consider only database peptides
that contain the tag (in its correct relative mass
location).
• First suggested by Mann and Wilm (1994).
• Similar concepts also used by:
• GutenTag - Tabb et. al. 2003.
• MultiTag - Sunayev et. al. 2003.
• OpenSea - Searle et. al. 2004.
104
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Why Filter Database Candidates?
• Filtration makes genomic database searches practical
(BLAST).
• Effective filtration can greatly speed-up the process,
enabling expensive searches involving post-translational
modifications.
• Goal: generate a small set of covering tags and use them
to filter the database peptides.
105
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Tag Generation - Global Tags
W
TAG
R
V
A
C
L
G
E
T
K
P
L
W
AVGELTK
T
D
Prefix Mass
AVG
0.0
VGE
71.0
GEL
170.1
ELT
227.1
LTK
356.2
• Parse tags from de novo reconstruction.
• Only a small number of tags can be generated.
• If the de novo sequence is completely incorrect,
none of the tags will be correct.
106
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Tag Generation - Local Tags
W
R
V
A
G
P
C
L
TAG
T
Prefix Mass
AVG
0.0
WTD
120.2
PET
211.4
E
L
K
W
D
T
• Extract the highest scoring
subspaths from the spectrum graph.
• Sometimes gets misled by locally
promising-looking “garden paths”.
107
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Ranking Tags
• Each additional tag used to filter increases the
number of database hits and slows down the
database search.
• Tags can be ranked according to their scores,
however this ranking is not very accurate.
• It is better to determine for each tag the
“probability” that it is correct, and choose most
probable tags.
108
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Reliability of Amino Acids in Tags
• For each amino acid in a tag we want to assign a
probability that it is correct.
• Each amino acid, which corresponds to an edge in the
spectrum graph, is mapped to a feature space that consists
of the features that correlate with reliability of amino acid
prediction, e.g. score reduction due to edge removal
109
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Score Reduction Due to Edge Removal
W
R
V
A
G
L
E
T
P
C
L
K
W
D
T
• The removal of an edge corresponding to a
genuine amino acid usually leads to a reduction
in the score of the de novo path.
• However, the removal of an edge that does not
correspond to a genuine amino acid tends to
leave the score unchanged.
110
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Probabilities of Tags
• How do we determine the probability of a
predicted tag ?
• We use the predicted probabilities of its amino
acids and follow the concept:
a chain is only as strong as its weakest link
111
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Experimental Results
Algorithm \ #tags
GlobalTag
Length 3 Length 4
1
10
1
10
Length 5
1
10
0.80
0.94
0.73
0.87
0.66
0.80
LocalTag+
0.75
0.96 0.70
0.90
0.57
0.80
GutenTag
0.49
0.89
0.78
0.31
0.64
0.41
• Results are for 280 spectra of doubly charged tryptic peptides
from the ISB and OPD datasets.
112
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Tag-based Database Search
Candidate Peptides (700)
Db
55M
peptides
Tag filter
Tag
extension
Score
Significance
De novo
113
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Matching Multiple Tags
•
•
•
•
Matching of a sequence tag against a database is
fast
Even matching many tags against a database is
fast
k tags can be matched against a database in time
proportional to database size, but independent of
the number of tags.
• keyword trees (Aho-Corasick algorithm)
Scan time can be amortized by combining scans for
many spectra all at once.
•
build one keyword tree from multiple spectra
114
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Keyword Trees
Y
F
A
N
K
S
F
N
T
YFAK
YFNS
FNTA
A
…..Y F R A Y F N T A…..
115
An Introduction to Bioinformatics Algorithms
Tag Extension
Db
55M
peptides
Filter
Extension
www.bioalgorithms.info
Candidate
Peptides
(700)
Score
Significance
De novo
116
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Fast Extension
• Given:
• tag with prefix and suffix masses <mP> xyz <mS>
• match in the database
<mP>xyz<mS>
xyz
• Compute if a suffix and prefix match with allowable
modifications.
• Compute a candidate peptide with most likely
positions of modifications (attachment points).
117
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Scoring Modified Peptides
Db
55M
peptides
Filter
Extension
Score
Significance
De novo
118
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Scoring
• Input:
• Candidate peptide with attached modifications
• Spectrum
• Output:
• Score function that normalizes for length, as
variable modifications can change peptide length.
119
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Assessing Reliability of Identifications
Db
55M
peptides
Filter
extension
Score
Significance
De novo
120
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Selecting Features for Separating
Correct and Incorrect Predictions
• Features:
• Score S: as computed
• Explained Intensity I: fraction of total intensity explained by
annotated peaks.
• b-y score B: fraction of b+y ions annotated
• Explained peaks P: fraction of top 25 peaks annotated.
• Each of I,S,B,P features is normalized (subtract mean
and divide by s.d.)
• Problem: separate correct and incorrect identifications
using I,S,B,P
121
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Separating power of features
122
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Separating power of features
Quality scores:
Q = wI I + wS S + wB B + wP P
The weights are chosen to minimize
the mis-classification error
123
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Distribution of Quality Scores
124
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Results on ISB data-set
• All ISB spectra were searched.
•
•
•
•
•
•
•
•
The top match is valid for 2978 spectra (2765 for Sequest)
InsPecT-Sequest: 644 spectra (I-S dataset)
Sequest-InsPecT: 422 spectra (S-I dataset)
Average explained intensity of I-S = 52%
Average explained intensity of S-I = 28%
Average explained intensity IS = 58%
~70 Met. Oxidations
Run time is 0.7 secs. per spectrum (2.7 secs. for Sequest)
125
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Results for Mus-IMAC data-sets
• The Alliance for Cellular signalling is looking at proteins
phosphorylated in specific signal transduction pathways.
• 6500 spectra are searched with upto 4 modifications (upto 3 Met.
Oxidation and upto 2 Phos.)
• 281 phosphopeptides with P-value < 0.05
126
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
127
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Filtration Results
PTMs
None
Phosphory
lation
Tag
# Tags
Length
Filtration InsPecT SEQUEST
Runtime Runtime
3
1
3.4×10-7
0.17 sec
3
10
1.6×10-6
0.27 sec
3
1
5.8×10-7
0.21 sec
> 1 minute
> 2 minutes
3
10
2.7×10-6
0.38 sec
The search was done against SWISS-PROT (54Mb).
•
With 10 tags of length 3:
• The filtration is 1500 more efficient.
• Less than 4% of spectra are filtered out.
• The search time per spectrum is reduced by two orders of magnitude
as compared to SEQUEST.
128
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Conclusion
• With 10 tags of length 3:
• The filtration is 1500 more efficient than using only
the parent mass alone.
• Less than 4% of the positive peptides are filtered out.
• The search time per spectrum is reduced from over a
minute (SEQUEST) to 0.4 seconds.
129
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
SPIDER: Yet Another Application of de novo
Sequencing
• Suppose you have a good MS/MS spectrum of an
elephant peptide
• Suppose you even have a good de novo
reconstruction of this spectra
• However, until elephant genome is sequenced, it is
hard to verify this de novo reconstruction
• Can you search de novo reconstruction of a peptide
from elephant against human protein database?
• SPIDER (Han, Ma, Zhang ) addresses this
comparative proteomics problem
Slides from Bin Ma, University of Western Ontario
130
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Common de novo sequencing errors
GG
N and GG
have the
same mass
131
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
From de novo Reconstruction to Database
Candidate through Real Sequence
• Given a sequence with errors, search for
the similar sequences in a DB.
(Seq)
X:
(Real) Y:
(Match) Z:
LSCFAV
SLCFAV
SLCF-V
sequencing error
Homology mutations
(Seq)
X:
(Real) Y:
(Match) Z:
LSCF-AV
EACF-AV
DACFKAV
mass(LS)=mass(EA)
132
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Alignment between de novo Candidate and
Database Candidate
(Seq)
X:
(Real) Y:
(Match) Z:
LSCF-AV
EACF-AV
DACFKAV
• If real sequence Y is known then:
d(X,Z) = seqError(X,Y) + editDist(Y,Z)
133
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Alignment between de novo Candidate and
Database Candidate
(Seq)
X:
(Real) Y:
(Match) Z:
LSCF-AV
EACF-AV
DACFKAV
• If real sequence Y is known then:
d(X,Z) = seqError(X,Y) + editDist(Y,Z)
• If real sequence Y is unknown then the distance between de
novo candidate X and database candidate Z:
• d(X,Z) = minY ( seqError(X,Y) + editDist(Y,Z) )
134
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Alignment between de novo Candidate and
Database Candidate
(Seq)
X:
(Real) Y:
(Match) Z:
LSCF-AV
EACF-AV
DACFKAV
• If real sequence Y is known then:
d(X,Z) = seqError(X,Y) + editDist(Y,Z)
• If real sequence Y is unknown then the distance between de
novo candidate X and database candidate Z:
• d(X,Z) = minY ( seqError(X,Y) + editDist(Y,Z) )
• Problem: search a database for Z that minimizes d(X,Z)
• The core problem is to compute d(X,Z) for given X and Z.
135
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Computing seqError(X,Y)
• Align X and Y (according to mass).
(Seq)
(Real)
X:
Y:
LSCFAV
EACFAV
• A segment of X can be aligned to a segment of
Y only if their mass is the same!
• For each erroneous mass block (Xi,Yi), the cost is
f(Xi,Yi)=f(mass(Xi)).
• f(m) depends on how often de novo sequencing
makes errors on a segment with mass m.
• seqError(X,Y) is the sum of all f(mass(Xi)).
X
Y
Z
seqError
editDist
136
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Computing d(X,Z)
(Seq)
X:
(Real) Y:
(Match) Z:
LSCF-AV
EACF-AV
DACFKAV
• Dynamic Programming:
• Let D[i,j]=d(X[1..i], Z[1..j])
• We examine the last block of the alignment of
X[1..i] and Z[1..j].
137
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Dynamic Programming: Four Cases
D[i,j]=D[i,j-1]+indel
D[i,j]=D[i-1,j-1]+dist(X[i],Z[j])
D[i,j]=D[i-1,j]+indel
• Cases A, B, C - no
de novo sequencing
errors
• Case D: de novo
sequencing error
D[i,j]=D[i’-1,j’-1]+alpha(X[i’..i],Z[j’..j])
• D[i,j] is the
minimum of the
four cases.
138
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Computing alpha(.,.)
• alpha(X[i’..i],Z[j’..j])
= min m(y)=m(X[i’..i]) [seqError (X[i’..i],y)+editDist(y,Z[j’..j])]
= min m(y)=m[i’..i] [f(m[i’..i])+editDist(y,Z[j’..j])].
= f(m[i’..i]) + min m(y)=m[i’..i] editDist(y,Z[j’..j]).
• This is like to align a mass with a string.
• Mass-alignment Problem: Given a mass m and a
peptide P, find a peptide of mass m that is most
similar to P (among all possible peptides)
139
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Solving Mass-Alignment Problem
min y  (m  m( y ), Z [i.. j ])  indel

 (m, Z [i.. j ])  min  (m, Z [i..( j  1)])  indel
min  (m  m( y ), Z [i..( j  1)])  dist ( y, Z [ j ])
y

140
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Improving the Efficiency
• Homology Match mode:
• Assumes tagging (only peptides that share a tag of
length 3 with de novo reconstruction are considered)
and extension of found hits by dynamic programming
around the hits.
• Non-gapped homology match mode:
• Sequencing error and homology mutations do not
overlap.
• Segment Match mode:
• No homology mutations.
• Exact Match mode:
• No sequencing errors and homology mutations.
141
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Experiment Result
• The correct peptide sequence for each spectrum is known.
• The proteins are all in Swissprot but not in Human
database.
• SPIDER searches 144 spectra against both Swissprot and
human databases
142
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Example
• Using de novo reconstruction X=CCQWDAEACAFNNPGK,
the homolog Z was found in human database. At the same
time, the correct sequence Y, was found in SwissProt
database.
sequencing errors
Seq(X):
Real(Y):
Database(Z):
CCQ[W ]DAEAC[AF]<NN><PG>K
CCK AD DAEAC FA VE GP K
CCK[AD]DKETC[FA]<EE><GK>K
homology mutations
143
Download