Slides - Fenyo Lab

advertisement
Previous Lecture: Regression and Correlation
This Lecture
Introduction to Biostatistics and Bioinformatics
Proteomics Informatics
Proteomics Informatics – Learning Objectives
• Structure of mass spectrometry data
• Protein identification
• Protein quantitation
Protein Identification and Quantitation
by Mass Spectrometry
Samples
Quantity
intensity
Peptides
Mass
Spectrometry
m/z
Identity
Sample preparation for protein identification,
characterization and quantitation
Lysis
Fractionation
Digestion
Mass spectrometry
Overview of Mass spectrometry
Mass
Analyzer
intensity
Ion
Source
mass/charge
Detector
Mass Spectrometry (MS)
dv
F  ma  m
 z ( E  v  B)
dt
m dv
 E  v B
z dt
Example data – MALDI-TOF
45
700
Intensity
Intensity
1800
0
1000
D:\Users\Fenyo\Desktop\ATP.txt (15:42 02/03/11)
Description: none available
4500
m/z
0
1300
2280
14602400
m/z
D:\Users\Fenyo\Desktop\ATP.txt (15:50
(15:4602/03/11)
Description: none available
Peptide intensity vs m/z
Intensity
700
35
0
2378.0
1444.0
m/z
2394.0
1458.0
Peptide Fragmentation
Mass
Analyzer 1
Ion Source
Fragmentation
Mass
Analyzer 2
Detector
b
y
Liquid Chromatography (LC)-MS/MS
LC
Ion Source
mass/charge
mass/charge
Detector
mass/charge
mass/charge
mass/charge
Time
intensity
intensity
intensity
mass/charge
Mass
Analyzer 2
intensity
mass/charge
mass/charge
intensity
mass/charge
intensity
mass/charge
intensity
mass/charge
Fragmentation
intensity
intensity
intensity
mass/charge
intensity
intensity
intensity
intensity
intensity
Mass
Analyzer 1
mass/charge
mass/charge
mass/charge
Example data – ESI-LC-MS/MS
m/z
Peptide intensity vs m/z vs time
762
MS/MS
% Relative Abundance
100
0
Time
875
[M+2H]2+
292
405
534
260
389
504
250
500
633
663
m/z
778
750
1022
9071020 1080
1000
Fragment intensity vs m/z
Charge-State Distributions
MALDI
ESI
intensity
Peptide
2+
intensity
1+
2+
mass/charge
3+
4+
1+
mass/charge
m M  nH

z
n
M - molecular mass
n - number of charges
H – mass of a proton
MALDI
ESI
3+
4+
5+
mass/charge
1+
intensity
Protein
intensity
2+
27+
31+
mass/charge
Charge-State
m M  nH

z
n
M - molecular mass
n - number of charges
H – mass of a proton
Example:
peptide of mass 898 carrying 1 H+ = (898 + 1) / 1 = 899 m/z
carrying 2 H+ = (898 + 2) / 2 = 450 m/z
carrying 3 H+ = (898 + 3) / 3 = 300.3 m/z
Isotope Distributions
m = 1035 Da
m = 1878 Da
m = 2234 Da
Intensity
12C
14N
16O
+1Da
1H
32S
+2Da
+3Da
m/z
m/z
m/z
0.015% 2H
1.11% 13C
0.366% 15N
0.038% 17O, 0.200% 18O,
0.75% 33S, 4.21% 34S, 0.02% 36S
Only 12C and 13C:
p=0.0111
n is the number of C in the peptide
m is the number of 13C in the peptide
Tm is the relative intensity of
the peptide m 13C
𝑛 𝑚
𝑇𝑚 =
𝑝 (1 − 𝑝)𝑛−𝑚
𝑚
Isotope Clusters and Charge State
Intensity
1
1+
1
1
m/z
Intensity
0.5
2+
0.5
0.5
m/z
Intensity
0.33
3+
0.33
0.33
m/z
What is the Charge State?
713.3225
432.8990
713.8239
714.3251
714.8263
 between the
isotopes is 0.5 Da
433.2330
433.5671
433.9014
 between the
isotopes is 0.33 Da
Protein Identification
by Mass Spectrometry
Samples
intensity
Peptides
Mass
Spectrometry
m/z
Identity
Protein Identification - Exercise
1. Protein identification: NUP1 was genomically tagged protein A,
affinity purified under two conditions, and the resulting protein
mixture was analyzed with liquid chromatography mass spectrometry
(LC-MS). Search the resulting spectra (NUP1-less-stringent-wash.mgf,
NUP1-more-stringent-wash.mgf) using X! Tandem
(http://h.thegpm.org/tandem/thegpm_tandem.html). Change the taxon
to “S. cerevisiae (budding yeast)” but otherwise keep the default
parameter settings.
a. Look at the list of identified proteins and explain why they are
found in this sample. More information is also available by selecting
the “go”, “path”, “ppi”, “doms”, “string” tabs on top of the page.
b. Select the “mh” display on top right of the page, and zoom in to +/100 ppm (the default setting for the mass accuracy that was used in
the search). What precursor mass accuracy should we have used?
Zoom in further and determine what precursor mass accuracy could
have been used if the spectra were recalibrated (the error
distribution centered at zero).
Identification – Tandem MS
Tandem MS – Sequence Confirmation
S
G
F
L
E
E
D
E
L
K
% Relative Abundance
100
0
250
500
m/z
750
1000
Tandem MS – Sequence Confirmation
S
G
F
L
E
E
D
E
L
K
88
S
145
G
292
F
405
L
534
E
663
E
778
D
907
E
1020
L
1166
K
% Relative Abundance
100
0
250
500
m/z
750
1000
b ions
Tandem MS – Sequence Confirmation
S
G
F
L
E
E
D
E
L
K
88
S
1166
145
G
1080
292
F
1022
405
L
875
534
E
762
663
E
633
778
D
504
907
E
389
1020
L
260
1166
K
147
% Relative Abundance
100
0
250
500
m/z
750
1000
b ions
y ions
Tandem MS – Sequence Confirmation
S
G
F
L
E
E
D
E
L
K
88
S
1166
145
G
1080
292
F
1022
405
L
875
534
E
762
663
E
633
778
D
504
907
E
389
1020
L
260
1166
K
147
b ions
y ions
762
% Relative Abundance
100
875
[M+2H]2+
292
405
534
260
389
504
633
663
778
1022
907 1020
1080
0
250
500
m/z
750
1000
Tandem MS – Sequence Confirmation
S
G
F
L
E
E
D
E
L
K
88
S
1166
145
G
1080
292
F
1022
405
L
875
534
E
762
663
E
633
778
D
504
907
E
389
1020
L
260
1166
K
147
b ions
y ions
762
% Relative Abundance
100
875
[M+2H]2+
292
405
534
260
389
504
633
663
778
1022
907 1020
1080
0
250
500
m/z
750
1000
Tandem MS – Sequence Confirmation
S
G
F
L
E
E
D
E
L
K
88
S
1166
145
G
1080
292
F
1022
405
L
875
534
E
762
663
E
633
778
D
504
907
E
389
1020
L
260
1166
K
147
b ions
y ions
762
% Relative Abundance
100
875
113
[M+2H]2+
113
292
405
534
260
389
504
633
663
778
1022
907 1020
1080
0
250
500
m/z
750
1000
Tandem MS – Sequence Confirmation
S
G
F
L
E
E
D
E
L
K
88
S
1166
145
G
1080
292
F
1022
405
L
875
534
E
762
663
E
633
778
D
504
907
E
389
1020
L
260
1166
K
147
b ions
y ions
762
% Relative Abundance
100
129
875
[M+2H]2+
129
292
405
534
260
389
504
633
663
778
1022
907 1020
1080
0
250
500
m/z
750
1000
Tandem MS – de novo Sequencing
762
Amino acid masses
1-letter 3-letter
code
code
A
Ala
Chemical
formula
C3H5ON
Monois
Average
otopic
71.0371 71.0788
R
Arg
C 6H12ON4
156.101 156.188
N
Asn
C 4H6O2N2
114.043 114.104
D
Asp
C 4 H5 O 3 N
115.027 115.089
C
Cys
C 3H5ONS
103.009 103.139
E
Glu
C 5 H7 O 3 N
129.043 129.116
Q
Gln
C 5H8O2N2
128.059 128.131
G
Gly
C2H3ON
57.0215 57.0519
H
His
C 6H7ON3
137.059 137.141
I
Ile
C 6H11ON
113.084 113.159
L
Leu
C 6H11ON
113.084 113.159
K
Lys
C 6H12ON2
128.095 128.174
M
Met
C 5H9ONS
131.04 131.193
F
Phe
C9H9ON
147.068 147.177
P
Pro
C5H7ON
97.0528 97.1167
S
Ser
C 3 H5 O 2 N
87.032 87.0782
T
Thr
C 4 H7 O 2 N
101.048 101.105
W
Trp
Y
Tyr
V
Val
C 11H10ON2 186.079 186.213
C 9H9O2N 163.063 163.176
C5H9ON
99.0684 99.1326
% Relative Abundance
100
0
875
[M+2H]2+
292
405
534
260
389
504
250
500
633
663
m/z
778
1022
9071020 1080
750
Mass Differences
Sequences
consistent
with spectrum
1000
Tandem MS – de novo Sequencing
260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079
260
292
389
405
504
534
633
663
762
778
875
32
129 145 244 274 373 403 502 518 615 647 760 762 819
97
113 212 242 341 371 470 486 583 615 728 730 787
16
115 145 244 274 373 389 486 518 631 633 690
99
129 228 258 357 373 470 502 615 617 674
30
129 159 258 274 371 403 516 518 575
99
129 228 244 341 373 486 488 545
30
129 145 242 274 387 389 446
99
115 212 244 357 359 416
16
113 145 258 260 317
97
129 242 244 301
32
145 147 204
907
113 115 172
1020
2
1022
59
57
Tandem MS – de novo Sequencing
260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079
260
292
389
405
504
534
633
663
762
778
875
32
129 145 244 274 373 403 502 518 615 647 760 762 819
97 113 212 242 341 371 470 486 583 615 728 730 787
16
115 145 244 274 373 389 486 518 631 633 690
99 129 228 258 357 373 470 502 615 617 674
30
129 159 258 274 371 403 516 518 575
99 129 228 244 341 373 486 488 545
30
129 145 242 274 387 389 446
99 115 212 244 357 359 416
16
113 145 258 260 317
97 129 242 244 301
32
145 147 204
907
113 115 172
1020
2
1022
59
57
Tandem MS – de novo Sequencing
260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079
260
292
389
405
504
534
32
E 145
XP I/L
16
244 274 373 403 502 518 615 647 760 762 819
212 242 341 371 470 486 583 615 728 730 787
D 145
V E
X
30
244 274 373 389 486 518 631 633 690
228 258 357 373 470 502 615 617 674
E 159
V E
X
633
663
762
778
875
907
1020
1022
30
SGF(I/L)EEDE(I/L)…
…GF(I/L)EEDE(I/L)…
…(I/L)EDEE(I/L)FG…
1166 – 1020 – 18 = 128
K
or Q= 1166
Peptide
M+H
1166 -1079 = 87 => S
SGF(I/L)EEDE(I/L)(K/Q)
SGF(I/L)EEDE(I/L)…
258 274 371 403 516 518 575
228 244 341 373 486 488 545
E 145
XV D
16
242 274 387 389 446
212 244 357 359 416
I/L 145
XP E
32
258 260 317
242 244 301
F
I/L X
D
145
2
204
172
59
G
Tandem MS – de novo Sequencing
Challenges in de novo sequencing
Neutral loss (-H2O, -NH3)
Modifications
Background peaks
Incomplete
Incomplete information
information
Tandem MS – Database Search
Sequence
DB
Pick Peptide
MS/MS
All Fragment
Masses
MS/MS
Compare, Score, Test Significance
Repeat for
all peptides
LC-MS
Repeat for all proteins
Lysis
Pick Protein
Fractionation
Digestion
1
2 3 4
6
8
#of matching peptides
2 3 4
6
8
#of matching peptides
10
10
Avg. #of matching peptides
1
Avg. #of matching peptides
Information Content in a Single Mass Measurement
Human
10
8
6
4
3
2
1
1000
2000
3000
Tryptic peptide mass [Da]
S. cerevisiae
10
8
6
4
3
2
1
1000
2000
3000
Tryptic peptide mass [Da]
Protein Identification and Quantitation
by Mass Spectrometry
Samples
Quantity
intensity
Peptides
Mass
Spectrometry
m/z
Protein Quantitation by Mass Spectrometry
C ij
p
p
p
Lysis
L
ij
p
D
ijk
LC
Sample i
Protein j
Peptide k
Pr
Fractionation
p
ij
p
Digestion
Pep
ik
MS
ik
I
LC-MS
MS
ik
ik
I
ik



k  C ij
j 
L
Pr
ij
ij
p p
pijk
D
Pep
LC
MS
ik
ik
ik
p p p

k
Quantitation – Label-Free (MS)
Sample i
Protein j
Peptide k
Lysis
Fractionation
Digestion
LC-MS
MS
Assumption:
 p p p p p p
k
L
Pr
D
Pep
LC
MS
ij
ij
ijk
ik
ik
ik
constant for all samples
Ci / Ci
j
n
MS
j
m
I
in j / I i m j
Quantitation – Metabolic Labeling
L
Ci
n
j
H
Light
Heavy
n
m
j
M
M
pi
Ci
pi
Lysis
j
m
j
Fractionation
Digestion
LC-MS
Sample i
Protein j
Peptide k
L
Ii
k
n
L
H MS
H
Ii
m
k
Oda et al. PNAS 96 (1999) 6591
Ong et al. MCP 1 (2002) 376
Quantitation – Labeled Synthetic Peptides
Lysis
Fractionation
Assumption: All
losses after mixing
are identical for the
heavy and light
isotopes and
L
Enrichment with
Peptide antibody
Light
Anderson, N.L., et al.
Proteomics 3 (2004) 235-44
LC-MS
L
D
M
M
pi pi pi pi  p
n
Digestion
Pr
j
n
j
n
jk
n
k
sk
Synthetic
Peptides
(Heavy)
H MS
Gerber et al. PNAS 100 (2003) 6940
Estimating peptide quantity
Intensity
Peak height
Curve fitting
Peak area
m/z
What is the best way to estimate quantity?
Peak height
- resistant to interference
- poor statistics
Peak area
- better statistics
- more sensitive to interference
Curve fitting
- better statistics
- needs to know the peak shape
- slow
Spectrum counting - resistant to interference
- easy to implement
- poor statistics for
low-abundance proteins
Proteomics Informatics - Summary
• Structure of mass spectrometry data
• Protein identification
• Protein quantitation
Next Lecture: Gene Expression
Protein Quantitation - Exercise
2. Protein quantitation: Two breast tumor xenografts (one basal and
one luminal) were analyzed in by LC-MS and the spectral counts for
the identified peptides in the different analyses are listed in twosample-three-replicate-comparison.txt.
a. Compare replicate one of Sample 1 with replicate one of Sample 2
using proteomics_no_replicate.py. Which differences are significant?
b. Compare replicate one and two of Sample 1 using
proteomics_one_replicate.py. Compare to the distribution in 2a. Which
differences are significant in 2a?
c. Compare the three replicates of Sample 1 with the three replicates
of Sample 2 using proteomics_three_replicates.py. Which differences
are significant?
d. In cases when a protein is not observed in one sample, how many
spectra do we need to observe in the other sample to say that there is
a significant difference?
Phosphorylation Exercise: an unmodified peptide
Theoretical fragment ions
Spectrum of the phosphorylated peptide
Stat3_cytosolic_a #7952 RT: 62.88 AV: 1 NL: 6.00E3
T: ITMS +c ESI d Full ms2 1196.04@cid35.00 [315.00-2000.00]
100
341.9 361.2 383.2 407.2 421.3 439.3
0
350
400
472.3
450
500.3 520.1 541.4
500
569.4
550
667.2
603.3 621.2 635.2 664.5 668.2
704.4 723.9
600
700
650
m/z
0
762.2
780.3
750
797.3 819.2
800
1178.0
0
895.3
920.4 936.1 959.2 976.5 990.2 1008.3
858.2 876.8 885.5
850
1204.1 1223.4
1250.6
1200
1250
1281.3
900
1310.5
950
m/z
1300
1000
1382.4 1400.1
1350.4
1350
1048.1 1065.7 1080.2 1109.4
1050
1432.8 1453.2
1400
1450
1802.6 1820.6
1852.6 1870.5
1146.7
1137.8
1100
1478.4 1495.5
1501.3
1150
1547.5 1571.1
1500
1550
m/z
1723.3
0
1592.3
1610.3
1600
1654.8 1671.7
1706.5
1769.4
1751.6
1650
1700
1750
1800
m/z
1850
1916.5 1934.8 1951.6 1970.5 1995.3
1900
1950
2000
Spectrum of the peptide
phosphorylated at a different site
Stat3_cytosolic_a #8053 RT: 63.59 AV: 1 NL: 6.15E3
T: ITMS +c ESI d Full ms2 1196.04@cid35.00 [315.00-2000.00]
587.3
588.2
558.1 569.3
100
411.2 421.2 441.1
343.2 359.2 373.2
0
623.1 641.6 667.2 682.3
700.4
724.4
700
650
600
550
500
450
400
350
520.2
472.2 490.2
m/z
0
780.4 801.2
738.2
815.4
1177.7
0
1180.4 1217.3 1230.4 1247.3
1364.2
1390.5
1146.6
1137.6
1116.1
1150
1100
1491.4
1507.3 1540.3 1558.4 1575.3
1426.4 1445.4 1477.5
1550
1500
1450
1400
1350
1300
1250
1200
1316.2 1333.4
1281.0
1065.6 1077.4
1050
1000
950
m/z
900
850
800
750
928.4 955.3 964.5 985.4 1016.5 1029.6
845.6 858.4 884.2 902.4
m/z
0
1593.5
1600
1689.6
1705.6
1672.5
1724.4
1628.4 1654.6
1650
1700
1767.6
1750
1785.4 1803.4
1800
m/z
1834.5 1852.6 1870.4
1901.5
1850
1900
1970.5
1933.4
1950
1996.7
2000
Download