Proteomics Informatics Workshop Part I: Protein Identification

advertisement
Proteomics Informatics Workshop
Part I: Protein Identification
David Fenyö
February 4, 2011
• Introduction to proteomics
• Introduction to mass spectrometry
• Analysis of mass spectra
• Database searching
• Spectrum library searching
• de novo sequencing
• Significance testing
Why Proteomics?
Geiger et al., “Proteomic changes resulting from gene copy number
variations in cancer cells”, PLoS Genet. 2010 Sep 2;6(9). pii: e1001090.
Proteomics Informatics
Biological System
Experimental
Design
Samples
MS/MS
MS
Sample
Preparation
Measurements
Data Analysis
Information about each sample
Information
Integration
Information about the biological system
What does the
sample contain?
How much?
Sample Preparation
Biological System
Experimental
Design
Samples
MS/MS
MS
Sample
Preparation
Measurements
Data Analysis
Information about each sample
Information
Integration
Information about the biological system
Enrichment
Separation etc
Digestion
Top Bottom
down
up
What does the
sample contain?
How much?
Mass Spectrometry (MS)
Ion
Source
Mass
Analyzer
Quadrupole
Ion Trap (3D, linear)
Time-of-Flight
Orbitrap
FTICR
intensity
MALDI
ESI
mass/charge
Detector
Mass Spectrometry – MALDI-TOF
Ion
Source
MALDI
Mass
Analyzer
Detector
Time-of-Flight
Detector
HV
Laser
Tandem Mass Spectrometry (MS/MS)
Ion Source
Detector
CAD – Collision
Activated
Dissociation
Quadrupole
Quadrupole
m/z
time
time
YES
time
Dm/z is constant
m/z
m/z
m/z
time
YES
time
time
mass/charge
m/z
time
m/z
time
NO
m/z
Quadrupole
intensity
Mass
Analyzer 2
m/z
Fragmentation
m/z
Mass
Analyzer 1
time
Dissociation Techniques
CAD: Collision Activated Dissociation (b, y ions)
 increase of internal energy through collisions
ETD: Electron Transfer Dissociation (c, z ions)
 radical driven fragmentation
Dissociation Techniques: CAD versus ETD
CAD
ETD
Low charge
High charge
Short peptides
Up to intact proteins
Weakest bonds
break first
More uniform
fragmentation
Preferred cleavage
N-terminal to proline
No cleavage
N-terminal to proline
Liquid Chromatography (LC)-MS/MS
LC
Ion Source
mass/charge
mass/charge
Detector
mass/charge
mass/charge
mass/charge
Time
intensity
intensity
intensity
mass/charge
Mass
Analyzer 2
intensity
mass/charge
mass/charge
intensity
mass/charge
intensity
mass/charge
intensity
mass/charge
Fragmentation
intensity
intensity
intensity
mass/charge
intensity
intensity
intensity
intensity
intensity
Mass
Analyzer 1
mass/charge
mass/charge
mass/charge
intensity
intensity
mass/charge
intensity
mass/charge
mass/charge
intensity
MS
MS/MS 1
MS/MS 2
MS/MS 3
MS
MS/MS 1
MS/MS 2
MS/MS 3
MS
MS/MS 1
MS/MS 2
MS/MS 3
MS
MS/MS 1
MS/MS 2
MS/MS 3
MS
MS/MS 1
MS/MS 2
MS/MS 3
MS
MS/MS 1
MS/MS 2
MS/MS 3
mass/charge
intensity
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
…
intensity
Data Independent Acquisistion
mass/charge
mass/charge
MS
MS/MS 1
MS/MS 2
MS/MS 3
MS/MS 4
MS/MS 5
MS/MS 6
MS/MS 7
MS/MS 8
MS/MS 9
MS/MS 10
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
…
MS
MS/MS 1
MS/MS 2
MS/MS 3
MS/MS 4
MS/MS 5
MS/MS 6
MS/MS 7
MS/MS 8
MS/MS 9
MS/MS 10
mass/charge
intensity
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
intensity
Data Dependent Acquisistion
mass/charge
Mass Spectrometry – ESI-LC-MS/MS
ESI
Linear Ion Trap
HCD
Ion
Source
Mass
Analyzer 1
Fragmentation
Detector
CAD
ETD
Fragmentation
Mass
Analyzer 2
Orbitrap
Olsen J V et al. Mol Cell Proteomics 2009;8:2759-2769
Detector
Charge-State Distributions
MALDI
ESI
intensity
Peptide
2+
intensity
1+
2+
mass/charge
3+
4+
1+
mass/charge
m
M  nH

z
n
M - molecular mass
n - number of charges
H – mass of a proton
MALDI
ESI
3+
4+
5+
mass/charge
1+
intensity
Protein
intensity
2+
27+
31+
mass/charge
Isotope Distributions
m = 1035 Da
m = 1878 Da
m = 2234 Da
Intensity
12C
14N
16O
+1Da
1H
32S
+2Da
+3Da
m/z
m/z
m/z
0.015% 2H
1.11% 13C
0.366% 15N
0.038% 17O, 0.200% 18O,
0.75% 33S, 4.21% 34S, 0.02% 36S
Only 12C and 13C:
p=0.0111
n is the number of C in the peptide
m is the number of 13C in the peptide
Tm is the relative intensity of
the peptide m 13C
𝑛 𝑚
𝑇𝑚 =
𝑝 (1 − 𝑝)𝑛−𝑚
𝑚
Intensity ratio
Intensity ratio
Isotope distributions
Peptide mass
Peptide mass
GFP 29kDa
monoisotopic
mass
m/z
Intensity
Noise
m/z
Peak Finding
Intensity
Find maxima of
S (l )   I (k )
|k l |w / 2
m/z
The signal in a peak can be
estimated with the RMSD
 (I (k )  I )
2
|k l |w / 2
w /2
and the signal-to-noise ratio of a peak
can be estimated by dividing the signal
with the RMSD of the background
The centroid m/z of a peak
m
I (k ) 
(k )

z
|k l |w / 2
 I (k )
|k l |w / 2
Isotope Clusters and Charge State
Intensity
0.33
0.5
1
1+
3+
2+
0.33
0.5
1
0.33
0.5
1
Possible to
Determine Charge?
Yes
m/z
Yes
Maybe
No
Identification – Peptide Mass Fingerprinting
Lysis
Fractionation
Digestion
Mass spectrometry
MS
Identified Proteins
Example data – Peptide Mapping by MALDI-TOF
45
700
Intensity
Intensity
1800
0
1000
0
1300
2280
14602400
m/z
D:\Users\Fenyo\Desktop\ATP.txt (15:50
(15:4602/03/11)
Description: none available
700
35
Intensity
D:\Users\Fenyo\Desktop\ATP.txt (15:42 02/03/11)
Description: none available
4500
m/z
0
2378.0
1444.0
m/z
2394.0
1458.0
1
2 3 4
6
8
#of matching peptides
2 3 4
6
8
#of matching peptides
10
10
Avg. #of matching peptides
1
Avg. #of matching peptides
Information Content in a Single Mass Measurement
Human
10
8
6
4
3
2
1
1000
2000
3000
Tryptic peptide mass [Da]
S. cerevisiae
10
8
6
4
3
2
1
1000
2000
3000
Tryptic peptide mass [Da]
Identification – Peptide Mass Fingerprinting
Lysis
Fractionation
Digestion
Mass spectrometry
MS
Identified Proteins
Peak Finding
Charge determination
De-isotoping
Searching
Identification – Peptide Mass Fingerprinting
Sequence
DB
Digestion
MS
All Peptide
Masses
MS
Compare, Score, Test Significance
Identified Proteins
Repeat for each protein
Pick Protein
ProFound – Search Parameters
http://prowl.rockefeller.edu/
ProFound Results
m/z
Example data – ESI-LC-MS/MS
762
MS/MS
% Relative Abundance
100
0
875
[M+2H]2+
292
405
534
260
389
504
250
Time
500
633
663
m/z
778
750
1022
9071020 1080
1000
Peptide Fragmentation
Mass
Analyzer 1
Ion Source
Fragmentation
Mass
Analyzer 2
Detector
b
y
Identification – Tandem MS
Tandem MS – Sequence Confirmation
S
G
F
L
E
E
D
E
L
K
% Relative Abundance
100
0
250
500
m/z
750
1000
Tandem MS – Sequence Confirmation
S
G
F
L
E
E
D
E
L
K
88
S
145
G
292
F
405
L
534
E
663
E
778
D
907
E
1020
L
1166
K
% Relative Abundance
100
0
250
500
m/z
750
1000
b ions
Tandem MS – Sequence Confirmation
S
G
F
L
E
E
D
E
L
K
88
S
1166
145
G
1080
292
F
1022
405
L
875
534
E
762
663
E
633
778
D
504
907
E
389
1020
L
260
1166
K
147
% Relative Abundance
100
0
250
500
m/z
750
1000
b ions
y ions
Tandem MS – Sequence Confirmation
S
G
F
L
E
E
D
E
L
K
88
S
1166
145
G
1080
292
F
1022
405
L
875
534
E
762
663
E
633
778
D
504
907
E
389
1020
L
260
1166
K
147
b ions
y ions
762
% Relative Abundance
100
875
[M+2H]2+
292
405
534
260
389
504
633
663
778
1022
907 1020
1080
0
250
500
m/z
750
1000
Tandem MS – Sequence Confirmation
S
G
F
L
E
E
D
E
L
K
88
S
1166
145
G
1080
292
F
1022
405
L
875
534
E
762
663
E
633
778
D
504
907
E
389
1020
L
260
1166
K
147
b ions
y ions
762
% Relative Abundance
100
875
[M+2H]2+
292
405
534
260
389
504
633
663
778
1022
907 1020
1080
0
250
500
m/z
750
1000
Tandem MS – Sequence Confirmation
S
G
F
L
E
E
D
E
L
K
88
S
1166
145
G
1080
292
F
1022
405
L
875
534
E
762
663
E
633
778
D
504
907
E
389
1020
L
260
1166
K
147
b ions
y ions
762
% Relative Abundance
100
875
113
[M+2H]2+
113
292
405
534
260
389
504
633
663
778
1022
907 1020
1080
0
250
500
m/z
750
1000
Tandem MS – Sequence Confirmation
S
G
F
L
E
E
D
E
L
K
88
S
1166
145
G
1080
292
F
1022
405
L
875
534
E
762
663
E
633
778
D
504
907
E
389
1020
L
260
1166
K
147
b ions
y ions
762
% Relative Abundance
100
129
875
[M+2H]2+
129
292
405
534
260
389
504
633
663
778
1022
907 1020
1080
0
250
500
m/z
750
1000
Tandem MS – Sequence Confirmation
S
G
F
L
E
E
D
E
L
K
88
S
1166
145
G
1080
292
F
1022
405
L
875
534
E
762
663
E
633
778
D
504
907
E
389
1020
L
260
1166
K
147
b ions
y ions
762
% Relative Abundance
100
875
[M+2H]2+
292
405
534
260
389
504
633
663
778
1022
907 1020
1080
0
250
500
m/z
750
1000
Tandem MS – Sequence Confirmation
S
G
F
L
E
E
D
E
L
K
88
S
1166
145
G
1080
292
F
1022
405
L
875
534
E
762
663
E
633
778
D
504
907
E
389
1020
L
260
1166
K
147
b ions
y ions
762
% Relative Abundance
100
875
[M+2H]2+
292
405
534
260
389
504
633
663
778
1022
907 1020
1080
0
250
500
m/z
750
1000
Tandem MS – Sequence Confirmation
S
G
F
L
E
E
D
E
L
K
88
S
1166
145
G
1080
292
F
1022
405
L
875
534
E
762
663
E
633
778
D
504
907
E
389
1020
L
260
1166
K
147
b ions
y ions
762
% Relative Abundance
100
875
[M+2H]2+
292
405
534
260
389
504
633
663
778
1022
907 1020
1080
0
250
500
m/z
750
1000
Tandem MS – de novo Sequencing
762
Amino acid masses
1-letter 3-letter
code
code
A
Ala
Chemical
formula
C3H5ON
Monois
Average
otopic
71.0371 71.0788
R
Arg
C 6H12ON4
156.101 156.188
N
Asn
C 4H6O2N2
114.043 114.104
D
Asp
C 4 H5 O 3 N
115.027 115.089
C
Cys
C 3H5ONS
103.009 103.139
E
Glu
C 5 H7 O 3 N
129.043 129.116
Q
Gln
C 5H8O2N2
128.059 128.131
G
Gly
C2H3ON
57.0215 57.0519
H
His
C 6H7ON3
137.059 137.141
I
Ile
C 6H11ON
113.084 113.159
L
Leu
C 6H11ON
113.084 113.159
K
Lys
C 6H12ON2
128.095 128.174
M
Met
C 5H9ONS
131.04 131.193
F
Phe
C9H9ON
147.068 147.177
P
Pro
C5H7ON
97.0528 97.1167
S
Ser
C 3 H5 O 2 N
87.032 87.0782
T
Thr
C 4 H7 O 2 N
101.048 101.105
W
Trp
Y
Tyr
V
Val
C 11H10ON2 186.079 186.213
C 9H9O2N 163.063 163.176
C5H9ON
99.0684 99.1326
% Relative Abundance
100
0
875
[M+2H]2+
292
405
534
260
389
504
250
500
633
663
m/z
778
1022
9071020 1080
750
Mass Differences
Sequences
consistent
with spectrum
1000
Tandem MS – de novo Sequencing
260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079
260
292
389
405
504
534
633
663
762
778
875
32
129 145 244 274 373 403 502 518 615 647 760 762 819
97
113 212 242 341 371 470 486 583 615 728 730 787
16
115 145 244 274 373 389 486 518 631 633 690
99
129 228 258 357 373 470 502 615 617 674
30
129 159 258 274 371 403 516 518 575
99
129 228 244 341 373 486 488 545
30
129 145 242 274 387 389 446
99
115 212 244 357 359 416
16
113 145 258 260 317
97
129 242 244 301
32
145 147 204
907
113 115 172
1020
2
1022
59
57
Tandem MS – de novo Sequencing
260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079
260
292
389
405
504
534
633
663
762
778
875
32
129 145 244 274 373 403 502 518 615 647 760 762 819
97 113 212 242 341 371 470 486 583 615 728 730 787
16
115 145 244 274 373 389 486 518 631 633 690
99 129 228 258 357 373 470 502 615 617 674
30
129 159 258 274 371 403 516 518 575
99 129 228 244 341 373 486 488 545
30
129 145 242 274 387 389 446
99 115 212 244 357 359 416
16
113 145 258 260 317
97 129 242 244 301
32
145 147 204
907
113 115 172
1020
2
1022
59
57
Tandem MS – de novo Sequencing
260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079
260
292
389
405
504
534
32
E 145
XP I/L
16
244 274 373 403 502 518 615 647 760 762 819
212 242 341 371 470 486 583 615 728 730 787
D 145
V E
X
30
244 274 373 389 486 518 631 633 690
228 258 357 373 470 502 615 617 674
E 159
V E
X
633
663
762
778
875
907
1020
1022
30
SGF(I/L)EEDE(I/L)…
…GF(I/L)EEDE(I/L)…
…(I/L)EDEE(I/L)FG…
1166 – 1020 – 18 = 128
K
or Q= 1166
Peptide
M+H
1166 -1079 = 87 => S
SGF(I/L)EEDE(I/L)(K/Q)
SGF(I/L)EEDE(I/L)…
258 274 371 403 516 518 575
228 244 341 373 486 488 545
E 145
XV D
16
242 274 387 389 446
212 244 357 359 416
I/L 145
XP E
32
258 260 317
242 244 301
F
I/L X
D
145
2
204
172
59
G
Tandem MS – de novo Sequencing
Challenges in de novo sequencing
Neutral loss (-H2O, -NH3)
Modifications
Background peaks
Incomplete
Incomplete information
information
Tandem MS – Database Search
Sequence
DB
Pick Peptide
MS/MS
All Fragment
Masses
MS/MS
Compare, Score, Test Significance
Repeat for
all peptides
LC-MS
Repeat for all proteins
Lysis
Pick Protein
Fractionation
Digestion
Tandem MS – Database Search
X! Tandem - Search Parameters
http://www.thegpm.org/
X! Tandem - Search Parameters
X! Tandem - Search Parameters
Multi-stage searching
spectra
sequences
Tryptic
cleavage
sequences
Modifications #1
Modifications #2
Point mutation
X! Tandem
Search Results
Search Results
Search Results
Search Results
1
0.5
Critical # of
Matching
Fragments
0
5
10
15
Number of Matching Fragments
Critical # of Matching Fragments
Probability of Identification
How many fragment masses
are needed for identification?
16
8
0
A parameter
Small peptides are slightly more difficult to identify
16
mprecursor
Critical # of Fragments
Probability of Identification
1.2
14
1000 Da
1500 Da
2000 Da
2500 Da
1
0.8
12
10
0.6
0.4
0.2
8
6
4
2
0
0
0
5
10
15
Number of fragment ions
20
500
1000
1500
2000
2500
3000
Precursor Mass [Da]
Dmprecursor = 1 Da
Dmfragment = 0.5 Da
No modification
A lower precursor mass error
requires fewer fragment masses for
identification of unmodified peptides
16
Critical # of Fragments
Probability of Identification
1.2
1
0.8
0.6
0.4
0.01 Da
0.2
1 Da
14
12
10
8
6
4
2
10 Da
0
0
5
10
15
Number of fragment ions
20
0
0.001
0.01
0.1
1
10
Precursor Mass Error [Da]
mprecursor = 2000 Da
Dmfragment = 0.5 Da
No modification
The dependence on the fragment mass error is
weak below a threshold for identification of
unmodified peptides
16
Dmfragment
1
Critical # of Fragments
Probability of Identification
1.2
0.01 Da
0.5 Da
1 Da
2 Da
0.8
0.6
0.4
0.2
14
12
10
8
6
4
2
0
0
5
10
15
Number of fragment ions
20
0
0.001
0.01
0.1
1
10
Fragment Mass Error [Da]
mprecursor = 2000 Da
Dmprecursor = 1 Da
No modification
A moderate number of background peaks can be
tolerated when identifying unmodified peptides
16
Background
1
Critical # of Fragments
Probability of Identification
1.2
0%
50%
0.8
80%
0.6
0.4
0.2
14
12
10
8
6
4
2
0
0
0
5
10
15
Number of fragment ions
20
0
20
40
60
80
100
Background [%]
mprecursor = 2000 Da
Dmprecursor = 1 Da
Dmfragment = 0.5 Da
No modification
A large number of background peaks can be
tolerated if the fragment mass is accurate
16
Background
1
Critical # of Fragments
Probability of Identification
1.2
0%
50%
0.8
80%
0.6
0.4
0.2
14
12
10
8
6
4
2
0
0
0
5
10
15
Number of fragment ions
20
0
20
40
60
80
100
Background [%]
mprecursor = 2000 Da
Dmprecursor = 1 Da
Dmfragment = 0.01 Da
No modification
Identification of phosphopeptides
is only slightly more difficult
Probability of Identification
1.2
1
0.8
0.6
0.4
Phosphorylated
0.2
Unmodified
0
0
5
10
15
20
Number of fragment ions
mprecursor = 2000 Da
Dmprecursor = 1 Da
Dmfragment = 0.5 Da
Identification – Spectrum Library Search
Spectrum
Library
Pick
Spectrum
MS/MS
Compare, Score, Test Significance
Identified Proteins
Repeat for
all spectra
Lysis
Fractionation
Digestion
LC-MS/MS
Spectrum Library Characteristics – Peptide Length
fraction of library (%)
10
8
6
4
2
0
0
10
20
30
peptide length
40
50
Spectrum Library Characteristics – Protein Coverage
50
residues
peptides
% coverage
40
30
20
10
0
10
30
50
70
90
110
protein Mr (kDa)
130
150
170
190
Spectrum Library Characteristics – Size
Species
Spectra
Peptides
Redundancy
H. sapiens
P. troglodytes
M. mulata
M. musculus
R. norvegicus
B. taurus
E. caballus
S. cerevisiae
C. elegans
D. rerio
T. rubripes
D. melanogaster
A. thaliana
1002326
889232
754601
732382
637776
592070
590514
201253
190952
174049
169551
122353
111689
270345
238688
195701
199182
160439
140063
139849
133166
90981
46546
36514
71928
62574
×3.7
×3.7
×3.9
×3.7
×4.0
×4.2
×4.2
×1.5
×2.1
×3.7
×4.6
×1.7
×1.8
Identification – Spectrum Library Search
Library spectrum
(5:25)
Test spectrum
(5:25)
Results: 4 peaks selected, 1 peak missed
Identification – Spectrum Library Search
How likely is this?
Apply a hypergeometric probability model:
- 25 possible m/z values;
- 5 peaks in the library spectrum; and
- 4 selected by the test spectrum.
Matches
1
2
3
4
5
Probability
0.45
0.15
0.016
0.00039
0.0000037
Identification – Spectrum Library Search
If you have 1000 possible m/z values and
20 peaks in test and library spectrum?
1.0E+00
1.0E-02
1.0E-04
p
1.0E-06
1 matched: p = 0.6
5 matched: p = 0.0002
1.0E-08
1.0E-10
10 matched: p = 0.0000000000001
1.0E-12
1.0E-14
1
2
3
4
5
6
matches
7
8
9
10
Identification – Spectrum Library Search
Experimental
Mass Spectrum

M/Z
Best search result
Library of Assigned
Mass Spectra

X! Hunter Result
Query Spectrum
Library Spectrum
Significance Testing
False protein identification is caused by random
matching
An objective criterion for testing the significance of
protein identification results is necessary.
The significance of protein identifications can be
tested once the distribution of scores for false results
is known.
Significance Testing - Expectation Values
The majority of sequences in a collection will
give a score due to random matching.
Significance Testing - Expectation Values
Database Search
List of Candidates
M/Z
Distribution of Scores
for Random and False
Identifications
Extrapolate
And Calculate
Expectation Values
List of Candidates With Expectation Values
Rho-diagrams: Overall Quality of a Data Set
Expectation values as a function of score for
random matching: e( s )  exp(   s )
Definition: Ei (i=0,-1,-2,…) is the number of spectra
that has been assigned an expectation value
between exp(i) and exp(i-1). For random matching:
e  exp( i )
E
i

 (i )  log(
 Nde  N{exp( i)  exp( i  1)}
e  exp( i 1)
E )  log( N exp( i){1  exp( 1)})  i
N {1  exp( 1)}
E
i
0
Rho-diagram
Random Matching
-6
-5
-4
-3
-2
-1
0
0
-1
-2

-3
-4
-5
-6
log(e)
Rho-diagram
Data Quality
-10
-8
-6
-4
-2
0
0
-2
-4

-6
-8
-10
log(e)
Rho-diagram
Parameters
Summary
Protein identification strategies:
- de Novo Sequencing
- Searching Sequence Collections
- Searching Spectrum Libraries
It is important to report the significance of the results
Google Group for Proteomics in NYC
Please join!
Proteomics Informatics Workshop
Part II: Protein Characterization
February 18, 2011
•Top-down/bottom-up proteomics
• Post-translational modifications
• Protein complexes
• Cross-linking
• The Global Proteome Machine Database
Proteomics Informatics Workshop
Part III: Protein Quantitation
February 25, 2011
• Metabolic labeling – SILAC
• Chemical labeling
• Label-free quantitation
• Spectrum counting
• Stoichiometry
• Protein processing and degradation
• Biomarker discovery and verification
Proteomics Informatics Workshop
Part I: Protein Identification, February 4, 2011
Part II: Protein Characterization, February 18, 2011
Part III: Protein Quantitation, February 25, 2011
Download