Faster, More Sensitive Peptide ID by Sequence DB

advertisement
Generalized Protein Parsimony
and Spectral Counting
for Functional
Enrichment Analysis
Nathan Edwards
Department of Biochemistry and
Molecular & Cellular Biology
Georgetown University Medical Center
Systems Biology
Structured
High-Throughput
Experiments
2
Knowledge
Databases
Systems Biology
molecular biology
↕
phenotype
molecular biology
↕
biology
Structured
High-Throughput
Experiments
Knowledge
Databases
•
•
•
•
Proteomics
Sequencing
Microarrays
Metabolomics
3
•
•
•
•
•
•
Localization
Function
Process
Interactions
Pathway
Mutation
Systems Biology
molecular biology
↕
phenotype
molecular biology
↕
biology
Structured
High-Throughput
Experiments
Knowledge
Databases
•
•
•
•
Proteomics
Sequencing
Microarrays
Metabolomics
4
Mathematical
Models
•
•
•
•
•
•
Localization
Function
Process
Interactions
Pathway
Mutation
Systems Biology
molecular biology
↕
phenotype
molecular biology
↕
biology
Structured
High-Throughput
Experiments
Knowledge
Databases
•
•
•
•
Proteomics
Sequencing
Microarrays
Metabolomics
5
Functional
Annotation
Enrichment
Mathematical
Models
•
•
•
•
•
•
Localization
Function
Process
Interactions
Pathway
Mutation
Systems Biology
molecular biology
↕
phenotype
molecular biology
↕
biology
Structured
High-Throughput
Experiments
Knowledge
Databases
•
•
•
•
Proteomics
Sequencing
Microarrays
Metabolomics
6
Functional
Annotation
Enrichment
Mathematical
Models
•
•
•
•
•
•
Localization
Function
Process
Interactions
Pathway
Mutation
Systems Biology
molecular biology
↕
phenotype
molecular biology
↕
biology
Structured
High-Throughput
Experiments
Knowledge
Databases
•
•
•
•
Proteomics
Sequencing
Microarrays
Metabolomics
7
Functional
Annotation
Enrichment
Mathematical
Models
•
•
•
•
•
•
Localization
Function
Process
Interactions
Pathway
Mutation
Functional Annotation
Enrichment

In any draw, we expect:





~ 5 "evens", ~ 2 "≤ 10", etc.
Each ball is equally likely
Balls are independent
p-value is surprise!
For transcriptomics:




Genes
Genome
Diff. Expr.
Annotation
↔ Balls
↔ Tumbler
↔ Draw
↔ "evens",…
Draw 10 of 50!
8
Why not in proteomics?

Double counting and false positives…


Proteomics cannot see all proteins…


…due to traditional protein inference
…proteins are not equally likely to be drawn
Good relative abundance is hard…


…extra chemistries, workflows, and software
…missing values are particularly problematic
9
In proteomics…

Double counting and false positives…


Proteomics cannot see all proteins…


Use generalized protein parsimony
Use identified proteins as background
Good relative abundance is hard…

Model differential spectral counts directly
10
Ignore some PSMs



FDR filtering leaves some false PSMs
Enforce strict protein inference criteria
Leave some PSMs uncovered
PSMs
Proteins
10%
11
Ignore some PSMs



FDR filtering leaves some false PSMs
Enforce strict protein inference criteria
Leave some PSMs uncovered
PSMs
90%
Proteins
12
Match uncovered PSMs to FDR
13
Plasma membrane enrichment

Pellicle enrichment of plasma membrane


Six replicate LC-MS/MS analyses each



Choksawangkarn et al. JPR 2013 (Fenselau Lab)
Cell-lysate (44,861 MS/MS)
Fe3O4-Al2O3 pellicle (21,871 MS/MS)
625 3-unique proteins to match 10% FDR:


Lysate: 18,976 PSMs; Pellicle: 13,723 PSMs
89 proteins with significantly (< 10-5) increased counts
14
Plasma membrane enrichment

Na/K+ ATPase subunit alpha-1 (P05023):


Transferrin receptor protein 1 (P02786):


Lysate: 17; Pellicle: 63; p-value: 2.0 x 10-11
DAVID Bioinformatics analysis (89/625):



Lysate: 1; Pellicle: 90; p-value: 5.2 x 10-33
Plasma membrane (GO:0005886) : 29 (5.2 x 10-5)
Transmembrane (SwissProtKW): 24 (1.3 x 10-6)
Transmembrane (SwissProtKW):

Lysate: 524; Pellicle: 1335; p-value: 2.6 x 10-158
15
A protein's PSMs rise and fall
together!
16
A protein's PSMs rise and fall
together?
17
Anomalies indicate
proteoforms
18
Nascent polypeptide-associated
complex subunit alpha
7.3 x 10-8
19
Pyruvate kinase isozymes M1/M2
2.5 x 10-5
20
Summary

Functional annotation enrichment for
proteomics too:



Careful counting (generalized parsimony)
Differential abundance by spectral counts
Use (multivariate-)hypergeometric model for


Differential abundance by spectral counts
Proteoform detection
21
HER2/Neu Mouse Model of
Breast Cancer


Paulovich, et al. JPR, 2007
Study of normal and tumor mammary tissue
by LC-MS/MS


Peptide-spectrum assignments



Normal samples (Nn): 161,286 (49.7%)
Tumor samples (Nt): 163,068 (50.3%)
4270 proteins identified in total

22
1.4 million MS/MS spectra
2-unique generalized protein parsimony
Distribution of p-values (Yeast)
23
Download