Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center Systems Biology Structured High-Throughput Experiments 2 Knowledge Databases Systems Biology molecular biology ↕ phenotype molecular biology ↕ biology Structured High-Throughput Experiments Knowledge Databases • • • • Proteomics Sequencing Microarrays Metabolomics 3 • • • • • • Localization Function Process Interactions Pathway Mutation Systems Biology molecular biology ↕ phenotype molecular biology ↕ biology Structured High-Throughput Experiments Knowledge Databases • • • • Proteomics Sequencing Microarrays Metabolomics 4 Mathematical Models • • • • • • Localization Function Process Interactions Pathway Mutation Systems Biology molecular biology ↕ phenotype molecular biology ↕ biology Structured High-Throughput Experiments Knowledge Databases • • • • Proteomics Sequencing Microarrays Metabolomics 5 Functional Annotation Enrichment Mathematical Models • • • • • • Localization Function Process Interactions Pathway Mutation Systems Biology molecular biology ↕ phenotype molecular biology ↕ biology Structured High-Throughput Experiments Knowledge Databases • • • • Proteomics Sequencing Microarrays Metabolomics 6 Functional Annotation Enrichment Mathematical Models • • • • • • Localization Function Process Interactions Pathway Mutation Systems Biology molecular biology ↕ phenotype molecular biology ↕ biology Structured High-Throughput Experiments Knowledge Databases • • • • Proteomics Sequencing Microarrays Metabolomics 7 Functional Annotation Enrichment Mathematical Models • • • • • • Localization Function Process Interactions Pathway Mutation Functional Annotation Enrichment In any draw, we expect: ~ 5 "evens", ~ 2 "≤ 10", etc. Each ball is equally likely Balls are independent p-value is surprise! For transcriptomics: Genes Genome Diff. Expr. Annotation ↔ Balls ↔ Tumbler ↔ Draw ↔ "evens",… Draw 10 of 50! 8 Why not in proteomics? Double counting and false positives… Proteomics cannot see all proteins… …due to traditional protein inference …proteins are not equally likely to be drawn Good relative abundance is hard… …extra chemistries, workflows, and software …missing values are particularly problematic 9 In proteomics… Double counting and false positives… Proteomics cannot see all proteins… Use generalized protein parsimony Use identified proteins as background Good relative abundance is hard… Model differential spectral counts directly 10 Ignore some PSMs FDR filtering leaves some false PSMs Enforce strict protein inference criteria Leave some PSMs uncovered PSMs Proteins 10% 11 Ignore some PSMs FDR filtering leaves some false PSMs Enforce strict protein inference criteria Leave some PSMs uncovered PSMs 90% Proteins 12 Match uncovered PSMs to FDR 13 Plasma membrane enrichment Pellicle enrichment of plasma membrane Six replicate LC-MS/MS analyses each Choksawangkarn et al. JPR 2013 (Fenselau Lab) Cell-lysate (44,861 MS/MS) Fe3O4-Al2O3 pellicle (21,871 MS/MS) 625 3-unique proteins to match 10% FDR: Lysate: 18,976 PSMs; Pellicle: 13,723 PSMs 89 proteins with significantly (< 10-5) increased counts 14 Plasma membrane enrichment Na/K+ ATPase subunit alpha-1 (P05023): Transferrin receptor protein 1 (P02786): Lysate: 17; Pellicle: 63; p-value: 2.0 x 10-11 DAVID Bioinformatics analysis (89/625): Lysate: 1; Pellicle: 90; p-value: 5.2 x 10-33 Plasma membrane (GO:0005886) : 29 (5.2 x 10-5) Transmembrane (SwissProtKW): 24 (1.3 x 10-6) Transmembrane (SwissProtKW): Lysate: 524; Pellicle: 1335; p-value: 2.6 x 10-158 15 A protein's PSMs rise and fall together! 16 A protein's PSMs rise and fall together? 17 Anomalies indicate proteoforms 18 Nascent polypeptide-associated complex subunit alpha 7.3 x 10-8 19 Pyruvate kinase isozymes M1/M2 2.5 x 10-5 20 Summary Functional annotation enrichment for proteomics too: Careful counting (generalized parsimony) Differential abundance by spectral counts Use (multivariate-)hypergeometric model for Differential abundance by spectral counts Proteoform detection 21 HER2/Neu Mouse Model of Breast Cancer Paulovich, et al. JPR, 2007 Study of normal and tumor mammary tissue by LC-MS/MS Peptide-spectrum assignments Normal samples (Nn): 161,286 (49.7%) Tumor samples (Nt): 163,068 (50.3%) 4270 proteins identified in total 22 1.4 million MS/MS spectra 2-unique generalized protein parsimony Distribution of p-values (Yeast) 23