Protein Identification by Sequence Database Search

advertisement
Protein Identification by
Sequence Database Search
Nathan Edwards
Department of Biochemistry and Mol. & Cell. Biology
Georgetown University Medical Center
Outline
• Proteomics
• Mass Spectrometry
• Protein Identification
• Peptide Mass Fingerprint
• Tandem Mass Spectrometry
2
Proteomics
• Proteins are the machines that drive
much of biology
• Genes are merely the recipe
• The direct characterization of a
sample’s proteins en masse.
• What proteins are present?
• How much of each protein is present?
3
2D Gel-Electrophoresis
• Protein separation
• Molecular weight (MW)
• Isoelectric point (pI)
• Staining
• Birds-eye view of
protein abundance
4
2D Gel-Electrophoresis
Bécamel et al., Biol. Proced. Online 2002;4:94-104.
5
Paradigm Shift
• Traditional protein chemistry assay
methods struggle to establish identity.
• Identity requires:
• Specificity of measurement (Precision)
• Mass spectrometry
• A reference for comparison
(Measurement → Identity)
• Protein sequence databases
6
Mass Spectrometer
Sample
+
_
Ionizer
• MALDI
• Electro-Spray
Ionization (ESI)
Mass Analyzer
• Time-Of-Flight (TOF)
• Quadrapole
• Ion-Trap
7
Detector
• Electron
Multiplier
(EM)
Mass Spectrometer
(MALDI-TOF)
UV (337 nm)
Source
Field-free drift zone
Pulse
voltage
Analyte/
matrix
Ed = 0
Length = s
Backing plate
(grounded)
Microchannel
plate detector
Length = D
Extraction grid
(source voltage -Vs)
Detector grid -Vs
8
Mass Spectrum
9
Mass is fundamental
10
Peptide Mass Fingerprint
Cut out
2D-Gel
Spot
11
Peptide Mass Fingerprint
Trypsin Digest
12
Peptide Mass Fingerprint
MS
13
Peptide Mass Fingerprint
14
Peptide Mass Fingerprint
• Trypsin: digestion enzyme
• Highly specific
• Cuts after K & R except if followed by P
• Protein sequence from sequence database
• In silico digest
• Mass computation
• For each protein sequence in turn:
• Compare computer generated masses with
observed spectrum
15
Protein Sequence
• Myoglobin
GLSDGEWQQV
RLFTGHPETL
LKKHGTVVLT
QSHATKHKIP
GDFGADAQGA
FQG
LNVWGKVEAD
EKFDKFKHLK
ALGGILKKKG
IKYLEFISDA
MTKALELFRN
16
IAGHGQEVLI
TEAEMKASED
HHEAELKPLA
IIHVLHSKHP
DIAAKYKELG
Protein Sequence
• Myoglobin
GLSDGEWQQV
RLFTGHPETL
LKKHGTVVLT
QSHATKHKIP
GDFGADAQGA
FQG
LNVWGKVEAD
EKFDKFKHLK
ALGGILKKKG
IKYLEFISDA
MTKALELFRN
17
IAGHGQEVLI
TEAEMKASED
HHEAELKPLA
IIHVLHSKHP
DIAAKYKELG
Amino-Acid Masses
Amino-Acid
Residual MW
Amino-Acid Residual MW
A Alanine
71.03712 M Methionine
131.04049
C Cysteine
103.00919 N Asparagine
114.04293
D Aspartic acid
115.02695 P Proline
E Glutamic acid
129.04260 Q Glutamine
128.05858
F Phenylalanine
147.06842 R Arginine
156.10112
57.02147 S Serine
87.03203
G Glycine
H Histidine
137.05891 T
I
113.08407 V Valine
Isoleucine
Threonine
97.05277
101.04768
99.06842
K Lysine
128.09497 W Tryptophan
186.07932
L Leucine
113.08407 Y Tyrosine
163.06333
18
Peptide Mass & m/z
• Peptide Molecular Weight:
N-terminal-mass (0.00) +
Sum (AA masses) +
C-terminal-mass (18.010560)
• Observed Peptide m/z:
(Peptide Molecular Weight +
z * Proton-mass (1.007825)) / z
• Monoisotopic mass values!
19
Peptide Masses
1811.90
1606.85
1271.66
1378.83
1982.05
1853.95
1884.01
1502.66
748.43
GLSDGEWQQVLNVWGK
VEADIAGHGQEVLIR
LFTGHPETLEK
HGTVVLTALGGILK
KGHHEAELKPLAQSHATK
GHHEAELKPLAQSHATK
YLEFISDAIIHVLHSK
HPGDFGADAQGAMTK
ALELFR
20
21
KGHHEAELKPLAQSHATK
GLSDGEWQQVLNVWGK
GHHEAELKPLAQSHATK
YLEFISDAIIHVLHSK
VEADIAGHGQEVLIR
HPGDFGADAQGAMTK
HGTVVLTALGGILK
LFTGHPETLEK
ALELFR
Peptide Mass Fingerprint
Sample Preparation for
Tandem Mass Spectrometry
Enzymatic Digest
and
Fractionation
22
Single Stage MS
MS
23
Tandem Mass Spectrometry
(MS/MS)
MS/MS
24
Peptide Fragmentation
N-terminus
Peptides consist of amino-acids
arranged in a linear backbone.
H…-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH
Ri-1
AA residuei-1
Ri
Ri+1
AA residuei
AA residuei+1
25
C-terminus
Peptide Fragmentation
26
Peptide Fragmentation
yn-i
yn-i-1
-HN-CH-CO-NH-CH-CO-NHRi+1
Ri
bi
bi+1
27
Peptide Fragmentation
xn-i y
n-i z
n-i
yn-i-1
-HN-CH-CO-NH-CH-CO-NHCH-R’
i+1
Ri
ai
bi
R”
i+1
bi+1
ci
28
Peptide Fragmentation
Peptide: S-G-F-L-E-E-D-E-L-K
MW
ion
ion
MW
GFLEEDELK
y9
1080
FLEEDELK
y8
1022
88
b1
S
145
b2
SG
292
b3
SGF
LEEDELK
y7
875
405
b4
SGFL
EEDELK
y6
762
534
b5
SGFLE
EDELK
y5
633
663
b6
SGFLEE
DELK
y4
504
778
b7
SGFLEED
ELK
y3
389
907
b8
SGFLEEDE
LK
y2
260
1020
b9
SGFLEEDEL
K
y1
147
29
Peptide Fragmentation
88
S
1166
145
G
1080
292
F
1022
405
L
875
534
E
762
663
E
633
778
D
504
907
E
389
1020
L
260
1166
K
147
b ions
y ions
% Intensity
100
0
250
500
750
30
1000
m/z
Peptide Fragmentation
88
S
1166
145
G
1080
292
F
1022
405
L
875
534
E
762
663
E
633
778
D
504
907
E
389
1020
L
260
1166
K
147
b ions
y ions
y6
100
% Intensity
y7
y5
y2
y3
y8 y
9
y4
0
250
500
750
31
1000
m/z
Peptide Fragmentation
88
S
1166
145
G
1080
292
F
1022
405
L
875
534
E
762
663
E
633
778
D
504
907
E
389
1020
L
260
1166
K
147
b ions
y ions
y6
100
% Intensity
y7
y5
b3
y2
y3
b4
y4 b5
b6
b7
b8
y
b9 8 y9
0
250
500
750
32
1000
m/z
Peptide Identification
Given:
• The mass of the precursor ion, and
• The MS/MS spectrum
Output:
• The amino-acid sequence of the
peptide
33
Peptide Identification
Two paradigms:
• De novo interpretation
• Sequence database search
34
De Novo Interpretation
% Intensity
100
0
250
500
750
35
1000
m/z
De Novo Interpretation
% Intensity
100
E
L
0
250
500
750
36
1000
m/z
De Novo Interpretation
% Intensity
100
L
SGF
KL
E
E
E
E
D
L
D
E
E
F
L
0
250
500
750
37
G
1000
m/z
De Novo Interpretation
Amino-Acid
Residual MW
Amino-Acid Residual MW
A Alanine
71.03712 M Methionine
131.04049
C Cysteine
103.00919 N Asparagine
114.04293
D Aspartic acid
115.02695 P Proline
E Glutamic acid
129.04260 Q Glutamine
128.05858
F Phenylalanine
147.06842 R Arginine
156.10112
57.02147 S Serine
87.03203
G Glycine
H Histidine
137.05891 T
I
113.08407 V Valine
Isoleucine
Threonine
97.05277
101.04768
99.06842
K Lysine
128.09497 W Tryptophan
186.07932
L Leucine
113.08407 Y Tyrosine
163.06333
38
De Novo Interpretation
…from Lu and Chen (2003), JCB 10:1
39
De Novo Interpretation
40
De Novo Interpretation
…from Lu and Chen (2003), JCB 10:1
41
De Novo Interpretation
• Find good paths in spectrum graph
• Can’t use same peak twice
• Forbidden pairs: NP-hard
• “Nested” forbidden pairs: Dynamic Prog.
• Simple peptide fragmentation model
• Usually many apparently good
solutions
• Needs better fragmentation model
• Needs better path scoring
42
De Novo Interpretation
• Amino-acids have duplicate masses!
• Incomplete ladders create ambiguity.
• Noise peaks and unmodeled fragments
create ambiguity
• “Best” de novo interpretation may have no
biological relevance
• Current algorithms cannot model many
aspects of peptide fragmentation
• Identifies relatively few peptides in highthroughput workflows
43
Sequence Database
Search
• Compares peptides from a protein
sequence database with spectra
• Filter peptide candidates by
• Precursor mass
• Digest motif
• Score each peptide against spectrum
• Generate all possible peptide fragments
• Match putative fragments with peaks
• Score and rank
44
Sequence Database
Search
S
G
F
L
E
E
D
E
L
K
% Intensity
100
0
250
500
750
45
1000
m/z
Sequence Database
Search
88
S
1166
145
G
1080
292
F
1022
405
L
875
534
E
762
663
E
633
778
D
504
907
E
389
1020
L
260
1166
K
147
b ions
y ions
% Intensity
100
0
250
500
750
46
1000
m/z
Sequence Database
Search
88
S
1166
145
G
1080
292
F
1022
405
L
875
534
E
762
663
E
633
778
D
504
907
E
389
1020
L
260
1166
K
147
b ions
y ions
y6
100
% Intensity
y7
y5
b3
y2
y3
b4
y4 b5
b6
b7
b8
y
b9 8 y9
0
250
500
750
47
1000
m/z
Sequence Database
Search
• No need for complete ladders
• Possible to model all known peptide
fragments
• Sequence permutations eliminated
• All candidates have some biological
relevance
• Practical for high-throughput peptide
identification
• Correct peptide might be missing from
database!
48
Peptide Candidate
Filtering
• Digestion Enzyme: Trypsin
• Cuts just after K or R unless followed
by a P.
• Basic residues (K & R) at C-terminal
attract ionizing charge, leading to
strong y-ions
• “Average” peptide length about 10-15
amino-acids
• Must allow for “missed” cleavage sites
49
Peptide Candidate
Filtering
>ALBU_HUMAN
MKWVTFISLLFLFSSAYSRGVFRRDAHKSEVAHRFKDLGEENFKAL
VLIAFAQYLQQCPFEDHVKLVNEVTEFAK…
No missed cleavage sites
MK
WVTFISLLFLFSSAYSR
GVFR
R
DAHK
SEVAHR
FK
DLGEENFK
ALVLIAFAQYLQQCPFEDHVK
LVNEVTEFAK
50
…
Peptide Candidate
Filtering
>ALBU_HUMAN
MKWVTFISLLFLFSSAYSRGVFRRDAHKSEVAHRFKDLGEENFKAL
VLIAFAQYLQQCPFEDHVKLVNEVTEFAK…
One missed cleavage site
MKWVTFISLLFLFSSAYSR
WVTFISLLFLFSSAYSRGVFR
GVFRR
RDAHK
DAHKSEVAHR
SEVAHRFK
FKDLGEENFK
DLGEENFKALVLIAFAQYLQQCPFEDHVK
ALVLIAFAQYLQQCPFEDHVKLVNEVTEFAK
…
51
Peptide Candidate
Filtering
• Peptide molecular weight
• Only have m/z value
• Need to determine charge state
• Ion selection tolerance
• Mass for each amino-acid symbol?
•
•
•
•
Monoisotopic vs. Average
“Default” residual mass
Depends on sample preparation protocol
Cysteine almost always modified
52
Peptide Molecular Weight
i=0
Same peptide,
i = # of C13 isotope
i=1
i=2
i=3
53
i=4
Peptide Molecular Weight
i=0
Same peptide,
i = # of C13 isotope
i=1
i=2
i=3
54
i=4
Peptide Molecular Weight
…from “Isotopes” – An IonSource.Com Tutorial
55
Peptide Molecular Weight
• Peptide sequence WVTFISLLFLFSSAYSR
• Potential phosphorylation? S,T,Y + 80 Da
WVTFISLLFLFSSAYSR
2018.06
WVTFISLLFLFSSAYSR
2098.06
WVTFISLLFLFSSAYSR
2098.06
WVTFISLLFLFSSAYSR
2098.06
WVTFISLLFLFSSAYSR
2098.06
WVTFISLLFLFSSAYSR
2098.06
WVTFISLLFLFSSAYSR
2098.06
WVTFISLLFLFSSAYSR
2178.06
WVTFISLLFLFSSAYSR
2178.06
…
…
56
WVTFISLLFLFSSAYSR
2418.06
- 7 Molecular
Weights
- 64 “Peptides”
Peptide Scoring
• Peptide fragments vary based on
•
•
•
•
The instrument
The peptide’s amino-acid sequence
The peptide’s charge state
Etc…
• Search engines model peptide
fragmentation to various degrees.
• Speed vs. sensitivity tradeoff
• y-ions & b-ions occur most frequently
57
Peptide Identification
• High-throughput workflows demand we
analyze all spectra, all the time.
• Spectra may not contain enough
information to be interpreted correctly
• …fading in and out on a cell phone
• Spectra may contain too many peaks
• …static or background noise
• Peptides may not match our assumptions
• …its all Greek to me
• “Don’t know” is an acceptable answer!
58
Peptide Identification
• Rank the best peptide identifications
• Is the top ranked peptide correct?
59
Peptide Identification
• Rank the best peptide identifications
• Is the top ranked peptide correct?
60
Peptide Identification
• Rank the best peptide identifications
• Is the top ranked peptide correct?
61
Peptide Identification
• Incorrect peptide has best score
• Correct peptide is missing?
• Potential for incorrect conclusion
• What score ensures no incorrect
peptides?
• Correct peptide has weak score
• Insufficient fragmentation, poor score
• Potential for weakened conclusion
• What score ensures we find all correct
peptides?
62
Statistical Significance
• Can’t prove particular identifications
are right or wrong...
• ...need to know fragmentation in advance!
• A minimal standard for identification
scores...
• ...better than guessing.
• p-value, E-value, statistical significance
63
Mascot MS/MS Ions
Search
64
Mascot MS/MS Search
Results
65
Mascot MS/MS Search
Results
66
Mascot MS/MS Search
Results
67
Mascot MS/MS Search
Results
68
Mascot MS/MS Search
Results
69
Mascot MS/MS Search
Results
70
Mascot MS/MS Search
Results
71
Sequence Database Search
Traps and Pitfalls
Search options may eliminate the correct peptide
• Precursor mass tolerance too small
• Incorrect precursor ion charge state
• Non-tryptic or semi-tryptic peptide
• Incorrect or unexpected modification
• Sequence database too conservative
• Unreliable taxonomy annotation
72
Sequence Database Search
Traps and Pitfalls
Search options can cause infinite search
times
• Variable modifications increase search times
exponentially
• Non-tryptic search increases search time by
two orders of magnitude
• Large sequence databases contain many
irrelevant peptide candidates
73
Sequence Database Search
Traps and Pitfalls
Best available peptide isn’t necessarily correct!
• Score statistics (e-values) are essential!
• What is the chance a peptide could score this well by
chance alone?
• Incorrect instrument settings or fragment
tolerance can render scores non-specific.
• The wrong peptide can look correct if the right
peptide is missing!
• Need scores (or e-values) that are invariant to
spectrum quality and peptide properties
74
Sequence Database Search
Traps and Pitfalls
Search engines often make incorrect
assumptions about sample prep
• Proteins with lots of identified peptides
are not more likely to be present
• Peptide identifications do not represent
independent observations
• All proteins are not equally interesting
to report
75
Sequence Database Search
Traps and Pitfalls
Good spectral processing can make a
big difference
• Poorly calibrated spectra require large
m/z tolerances
• Poorly baselined spectra make small
peaks hard to believe
• Poorly de-isotoped spectra have extra
peaks and misleading charge state
assignments
76
Summary
• Protein identification from tandem
mass spectra is a key proteomics
technology.
• Protein identifications should be
treated with healthy skepticism.
• Look at all the evidence!
• Spectra remain unidentified for a
variety of reasons.
77
Further Reading
• Matrix Science (Mascot) Web Site
• www.matrixscience.com
• Seattle Proteome Center (ISB)
• www.proteomecenter.org
• Proteomic Mass Spectrometry Lab at
The Scripps Research Institute
• fields.scripps.edu
• UCSF ProteinProspector
• prospector.ucsf.edu
78
Download