pptx - Fenyo Lab

advertisement
Proteomics Informatics –
Protein identification II: search engines and
protein sequence databases (Week 5)
General Criteria for a Good Protein
Identification Algorithms
The response to random input data should be random.
Maximum number of correct identification and minimum
number of incorrect identifications for any data set.
Maximal separation between scores for correct
identifications and the distribution of scores for random
matching proteins for any data set.
The statistical significance of the results should be
calculated.
The searches should be fast.
Search Parameters
Parent tolerance
+/- daltons/ppm
Frag. Tolerance
+/- daltons/ppm
Complete mods
Cys alkylation
Potential mods
(artifacts)
Met/Trp oxidation,
Gln/Asn deamidation
Potential mods
(PTMs)
Cleavage
Phosphoryl, sulfonyl, acetyl,
methyl, glycosyl, GPI
Scoring method
Scores or statistics
Sequences
FASTA files
Trypsin ([KR]|{P})
Identification – Peptide Mass Fingerprinting
Sequence
DB
Digestion
MS
All Peptide
Masses
MS
Compare, Score, Test Significance
Identified Proteins
Repeat for each protein
Pick Protein
Normalized Frequency
Response to Random Data
ProFound – Search Parameters
http://prowl.rockefeller.edu/
ProFound – Protein Identification
by Peptide Mapping
r

2
(
m

m
)
r
i0
r  i

( N  r )! r  mmax  mmin 
 F pattern
P(k | DI )  P(k | I )
gi 
 exp  i 1

2
N! i 1 
2
2
2






W. Zhang & B.T. Chait,
Analytical Chemistry
72 (2000) 2482-2489
ProFound Results
Peptide Mapping – Mass Accuracy
7
140
Mascot
6
120
5
100
4
80
Score
-log(e)
ProFound
3
60
2
40
1
20
0
0
0
0.5
1
1.5
Mass Tolerance (Da)
2
0
0.5
1
1.5
Mass Tolerance (Da)
2
Peptide Mapping - Database Size
S. cerevisiae
Expectation Values
Peptide mapping example:
S. Cerevisiae
4.8e-7
Fungi
8.4e-6
All Taxa
2.9e-4
Fungi
All Taxa
Missed Cleavage Sites
u=1
Expectation Values
Peptide mapping example:
u=1
4.8e-7
u=2
1.1e-5
u=4
6.8e-4
u=2
u=4
Peptide Mapping - Partial Modifications
No Modifications
Searched
Without
Modifications
Searched With
Possible
Phosphorylation
of S/T/Y
DARPP-32
0.00006
0.01
CFTR
0.00002
0.005
Even if the protein is modified it is usually better to
search a protein sequence database without
specifying possible modifications using peptide
mapping data.
Phophorylation (S, T, or Y)
Peptide Mapping - Ranking by
Direct Calculation of the Significance
Tandem MS – Database Search
Sequence
DB
Pick Peptide
MS/MS
All Fragment
Masses
MS/MS
Compare, Score, Test Significance
Repeat for
all peptides
LC-MS
Repeat for all proteins
Lysis
Pick Protein
Fractionation
Digestion
Algorithms
Comparing and Optimizing Algorithms
Algorithm 1
Sensitivity
False
True
Score
1-Specificity
Algorithm 2
Sensitivity
False
True
Score
1-Specificity
MS/MS - Parent Mass Error
and Enzyme Specificity
Expectation Values
MS/MS example:
Dm=2, Trypsin
2.5e-5
Dm=100, Trypsin
2.5e-5
Dm=2, non-specific
7.9e-5
Dm=100, non-specific
1.6e-4
xII  xI (nb!ny !)
Sequest
Cross-correlation
X! Tandem - Search Parameters
http://www.thegpm.org/
X! Tandem - Search Parameters
X! Tandem - Search Parameters
spectra
sequences
Generic search engine
Test all
cleavages,
sequences
modifications,
& mutations
for all sequences
Conventional,
single stage searching
Some hard problems in MS/MS
analysis in proteomics
Allowing for unanticipated peptide cleavages
- e.g., chymotryptic contamination in trypsin
- calculation order ~ 200 × tryptic cleavage
- “unfortunate” coefficient
Determining potential modifications
- e.g., oxidation, phosphorylation, deamidation
- calculation order 2n
- NP complete
Detecting point mutations
- e.g., sequence homology
- calculation order 18N
- NP complete
Multi-stage searching
spectra
sequences
Tryptic
cleavage
sequences
Modifications #1
Modifications #2
Point mutation
X! Tandem
Search Results
Search Results
Sequence Annotations
Search Results
Search Results
Mascot
http://www.matrixscience.com/cgi/search_form.pl?FORMVER=2&SEARCH=MIS
Identification – Spectrum Library Search
Spectrum
Library
Pick
Spectrum
MS/MS
Compare, Score, Test Significance
Identified Proteins
Repeat for
all spectra
Lysis
Fractionation
Digestion
LC-MS/MS
Steps in making an
Annotated Spectrum Library (ASL):
1. Find the best 10 spectra for a particular
sequence, with the same PTMs and charge.
2. Add the spectra together and normalize the
intensity values.
3. Assign a “quality” value: the median
expectation value of the 10 spectra used.
4. Record the 20 most intense peaks in the
averaged spectrum, it’s parent ion z, m/z,
sequence, protein accessions & quality.
Spectrum Library Characteristics – Peptide Length
fraction of library (%)
10
8
6
4
2
0
0
10
20
30
peptide length
40
50
Spectrum Library Characteristics – Protein Coverage
50
residues
peptides
% coverage
40
30
20
10
0
10
30
50
70
90
110
protein Mr (kDa)
130
150
170
190
Identification – Spectrum Library Search
Library spectrum
(5:25)
Test spectrum
(5:25)
Results: 4 peaks selected, 1 peak missed
Identification – Spectrum Library Search
How likely is this?
Apply a hypergeometric probability model:
- 25 possible m/z values;
- 5 peaks in the library spectrum; and
- 4 selected by the test spectrum.
Matches
1
2
3
4
5
Probability
0.45
0.15
0.016
0.00039
0.0000037
Identification – Spectrum Library Search
If you have 1000 possible m/z values and
20 peaks in test and library spectrum?
1.0E+00
1.0E-02
1.0E-04
p
1.0E-06
1 matched: p = 0.6
5 matched: p = 0.0002
1.0E-08
1.0E-10
10 matched: p = 0.0000000000001
1.0E-12
1.0E-14
1
2
3
4
5
6
matches
7
8
9
10
Identification – Spectrum Library Search
Experimental
Mass Spectrum

M/Z
Best search result
Library of Assigned
Mass Spectra

X! Hunter
X! Hunter algorithm:
1. Use dot product to find a library spectrum
that best matches a test spectrum.
2. Calculate p-value with hypergeometric
distribution.
3. Use p-value to calculate expectation value,
given the identification parameters.
4. If expectation value is less than the median
expectation value of the library spectrum,
report the median value.
X! Hunter Result
Query Spectrum
Library Spectrum
Number of Proteins
Dynamic Range In Proteomics
Distribution of
Protein Amounts
Experimental
Dynamic Range
Log (Protein Amount)
Desired Dynamic Range
Large
The goal
discrepancy
is to identify
between
and characterize
the experimental
all components
dynamicof
range
a proteome
and the range of amounts of different proteins in
a proteome
Digestion
Mass
Separation
Protein Abundance
Sample
Extraction
Protein
Labeling
Protein
Separation
Fragmentation
Peptide
Labeling
Peptide
Separation
Mass
Separation
Ionization
Detection
Limit of amount
of material
Limit of amount
of material
Loss of
material
Loss of
material
Separation
of material
Detection limit
Dynamic range
Sample
Protein Abundance
Protein Separation
Digestion
# of
peptides
per bin
Peptide
Separation
y
k
1
"Retention time" (bin)
Mass Spectrometry
MS
dynamic
range
m1 m
m 3
MS dynamic
m3
m5
m5
range
MS
dynamic
m3
m
mm
m
m m
m
66 5
range
MS dynamic
144 m
2
3
mm
m2 m
m14 m
range
6 5
MS dynamic
mm6
m2 m 3
range
5
4
m6
m m
1
m2m 1
10
2
4
Experimental
Designs
Simulated
Parameters in Simulation
Sample
● Distribution of protein amounts in sample
Protein Abundance
Protein Separation
● # of Proteins in each fraction
Digestion
Peptide
Separation
● Total amount of peptides that are loaded on
column (limited by column loading capacity)
# of
peptides
per bin
● Loss of peptides before binding to the column
● # of peptide fractions
y
● Loss of peptides after elution off the column
k
1
"Retention time" (bin)
Mass Spectrometry
MS
dynamic
range
m1 m
m 3
MS dynamic
m3
m5
m5
range
MS
dynamic
m3
m
mm
m
m m
m
66 5
range
MS dynamic
144 m
2
3
mm
m2 m
m14 m
range
6 5
MS dynamic
mm6
m2 m 3
range
5
4
m6
m m
1
m2m 1
10
2
4
● Distribution of mass spectrometric response for
different peptides present at the same amount
● Dynamic range of mass spectrometer
● Detection limit of mass spectrometer
Simulation Results for 1D-LC-MS
0.025
0.014
Tissue
Body Fluid
0.012
No Protein
Separation
Number of Proteins
0.02
Number of Proteins
Complex Mixtures
of Proteins
No Protein
Separation
0.015
0.01
0.008
0.006
0.01
0.004
0.005
Digestion
0.002
0
0
0
1
RPC
2
3
4
log(Protein Amount)
5
6
0.025
2
4
6
8
log(Protein Amount)
10
1.40E-02
Number of Proteins
0.02
Protein
Separation:
10 fractions
Body Fluid
1.20E-02
Number of Proteins
Tissue
MS Analysis
0
Protein
Separation:
10 fractions
1.00E-02
0.015
8.00E-03
6.00E-03
0.01
4.00E-03
0.005
2.00E-03
0.00E+00
0
0
1
2
3
4
log(Protein Amount)
5
6
0
2
4
6
8
log(Protein Amount)
10
Number of Proteins
Success Rate of a Proteomics Experiment
Distribution of
Protein Amounts
Proteins
Detected
Log (Protein Amount)
DEFINITION: The success rate of a proteomics experiment
is defined as the number of proteins detected divided by
the total number of proteins in the proteome.
Number of Proteins
Relative Dynamic Range of a
Proteomics Experiment
Distribution of
Protein Amounts
Proteins
Detected
Fraction of
Proteins Detected
RDR90
RDR50
RDR10
Log (Protein Amount)
DEFINITION: RELATIVE DYNAMIC RANGE, RDRx,
where x is e.g. 10%, 50%, or 90%
1
1
RDR50
Success Rate
0.8
Success Rate
0.8
2
0.6
1
0.4
0.2
0.6
2
0.4
1
0.2
Tissue
Body Fluid
0
0
1
10
100
1000
10000 100000
Number of Proteins in Mixture
0.025
Tissue
1
10
100
1000
10000 100000
Number of Proteins in Mixture
1.40E-02
Body Fluid
2
1
0.012
Body Fluid
1.20E-02
2
Number of Proteins
0.015
0.015
0.01
0.01
0.005
Number of Proteins
0.02
Number of Proteins
0.02
1
0.014
0.025
Tissue
Number of Proteins
Relative Dynamic Range (RDR50)
Number of Proteins in Mixture
0.01
1.00E-02
0.008
8.00E-03
0.006
6.00E-03
0.004
4.00E-03
0.002
2.00E-03
0.005
0
0
0
1
2
3
4
log(Protein Amount)
5
6
0.00E+00
0
0
1
2
3
4
log(Protein Amount)
5
6
0
2
4
6
log(Protein Amount)
8
10
0
2
4
6
log(Protein Amount)
8
10
1
1
RDR50
Tissue
Body Fluid
3
2
0.6
Success Rate
3
0.8
Success Rate
0.8
0.4
0.2
0.6
2
0.4
0.2
0
0.01
0.1
1
10
Amount Loaded [m g]
0.025
0.01
0.1
1
1.40E-02
2
10
100
Amount Loaded [m g]
Tissue
3
0.014
Body Fluid
1.20E-02
2
Body Fluid
0.012
3
0.02
0.015
1.00E-02
0.015
0.01
8.00E-03
0.01
0.008
6.00E-03
0.01
0.005
Number of Proteins
Number of Proteins
Number of Proteins
0.02
0
100
0.025
Tissue
Number of Proteins
Relative Dynamic Range (RDR50)
Amount of Peptides Loaded on the Column
0.006
4.00E-03
0.004
0.005
2.00E-03
0
0.00E+00
0
0
1
2
3
4
log(Protein Amount)
5
6
0.002
0
1
2
3
4
log(Protein Amount)
5
6
0
0
2
4
6
log(Protein Amount)
8
10
0
2
4
6
log(Protein Amount)
8
10
1
1
RDR50
4
4
3
0.6
0.4
0.2
Success Rate
3
0.8
Success Rate
0.8
0.6
0.4
0.2
Tissue
Body Fluid
0
0
10
100
1000
10000
Number of Peptide Fractions
0.025
100000
Tissue
3
100
1000
10000
Number of Peptide Fractions
4
100000
0.014
Body Fluid
Body Fluid
3
0.012
4
0.012
0.015
0.015
0.01
0.01
0.008
0.01
0.008
0.006
0.01
0.005
Number of Proteins
Number of Proteins
0.02
Number of Proteins
0.02
10
0.014
0.025
Tissue
Number of Proteins
Relative Dynamic Range (RDR50)
Peptide Separation
0.006
0.004
0.004
0.005
0.002
0
0
0
1
2
3
4
log(Protein Amount)
5
6
0.002
0
0
1
2
3
4
log(Protein Amount)
5
6
0
0
2
4
6
log(Protein Amount)
8
10
0
2
4
6
log(Protein Amount)
8
10
Amount loaded and peptide separation
0.025
0.025
4
Tissue
0.8
0.01
0.005
0
0
0
1
2
3
4
log(Protein Amount)
5
6
3
0
1
2
3
5
6
Amount
loaded
0.025
2
Number of Proteins
0.02
Protein
separation
0.015
0.2
1
0.01
1
0.2
0.4
0.6
0.8
Success Rate
0.005
0
1.0
1.0
0.015
0.01
0.005
0
Number of Proteins
0.02
2
0
0
1
2
3
4
log(Protein Amount)
5
6
0
0.025
2
3
Number of Proteins
4
Amount
loaded
0.01
0.6
0.015
0.01
0.005
0.005
3
0
0
0
1
3
2
3
4
log(Protein Amount)
5
0
6
1
2
3
4
log(Protein Amount)
0.025
0.025
0.02
111
0.015
0.4
0.6
0.8
Success Rate
1.0
1
0.01
0.015
0.01
0.005
0.005
0
Protein separation
Amount loaded
Peptide separation
2
0.02
Protein
separation
6
Number of Proteins
0.2
5
Peptide
separation
2
Ranges:
Protein separation: 30000 – 3000 proteins in each fraction
Amount loaded: 0.1 ug – 10 ug
Peptide separation: 100 – 1000 fractions
6
0.02
0.015
0.2
5
0.025
0.02
0.8
0
4
log(Protein Amount)
4
Tissue
0.4
1
Number of Proteins
0.4
0
4
log(Protein Amount)
3
0.025
0
0.015
0.005
4
0.6
Peptide
separation
0.01
Number of Proteins
1. Protein separation
2. Peptide separation
3. Amount loaded
Relative Dynamic Range
1. Protein separation
2. Amount loaded
3. Peptide separation
Relative Dynamic Range
Order:
1.0
Number of Proteins
0.02
Number of Proteins
0.02
0.015
0
0
1
2
3
4
log(Protein Amount)
5
6
0
1
2
3
4
log(Protein Amount)
5
6
Repeat Analysis
1 Analysis
Repeat Analysis
2 Analyses
Repeat Analysis
3 Analyses
Repeat Analysis
4 Analyses
Repeat Analysis
5 Analyses
Repeat Analysis
6 Analyses
Repeat Analysis
7 Analyses
Repeat Analysis
8 Analyses
Repeat Analysis: Simulations
0.5
0.3
0.2
RDR10
Sucess Rate
0.4
0.3
0.2
0.1
0.1
Experiment
Experiment
Simulation
Simulation
0
0
0
2
4
6
Number of Repeats
8
10
0
2
4
6
Number of Repeats
8
10
Summary
•
The success rate of proteome analysis is
influenced by the following factors (listed
in order of importance):
•
The degree of protein separation
•
Amount of peptides loaded on column or
mass spectrometric detection limit
•
The degree of peptide separation or
mass spectrometric dynamic range
Proteomics Informatics –
Protein identification II: search engines and
protein sequence databases (Week 5)
Download