Proteogenomics Kelly Ruggles, Ph.D. Proteomics Informatics March 31, 2015 Proteogenomics: Intersection of proteomics and genomics As the cost of high-throughput genome sequencing goes down whole genome, exome and RNA sequencing can be easily attained for most proteomics experiments In combination with mass spectrometry-based proteomics, sequencing can be used for: 1. Genome annotation 2. Studying the effect of genomic variation in proteome 3. Biomarker identification Proteogenomics: Intersection of proteomics and genomics First published on in 2004 “Proteogenomic mapping as a complementary method to perform genome annotation” (Jaffe JD, Berg HC and Church GM) using genomic sequencing to better annotate Mycoplasma pneumoniae Renuse S, Chaerkady R and A Pandey, Proteomics. 11(4) 2011 Proteogenomic workflow High throughput shotgun MS/MS Requires no knowledge of peptides present, uses mass difference to determine next AA in peptide chain. Requirements for Proteogenomic Analysis • DNA or RNA sequencing data • High resolution MS/MS • Informatics tools for proteogenomic database construction and protein searching DNA and/or RNA Sequencing Sample Informatics Tools MS/MS Informatics Tools Compare, score, test significance Identified peptides and proteins Personalized Protein DB Proteogenomics • In the past, computational algorithms were commonly used to predict and annotate genes. – Many limitations • With mass spectrometry we can – Confirm existing gene models – Correct gene models – Identify novel genes and splice isoforms Proteogenomics 1. Improving genome annotation 2. Sequencing driven database construction 3. Proteomic mapping to genomic coordinates Proteogenomics 1. Improving genome annotation 2. Sequencing driven database construction 3. Proteomic mapping to genomic coordinates Genome Annotation • Process of identifying and assigning function to genes • Historically, identification of protein coding regions was completed using – Comparative sequence similarity analysis – ab initio gene prediction algorithms – RNA transcript analysis • Limitations associated with these methods in determining – – – – – Gene start and stop sites Translation reading frames Short genes, overlapping genes Alternative splice boundaries Translated vs. transcribed genes • Therefore, MS-based proteomics can be used to supplement sequence analysis for genome annotation Protein Sequence Databases • Identification of peptides from MS relies heavily on the quality of the protein sequence database (DB) • DBs with missing peptide sequences will fail to identify the corresponding peptides • DBs that are too large will have low sensitivity • Ideal DB is complete and small, containing all proteins in the sample and no irrelevant sequences Genome Sequence-based database for genome annotation Commonly used method is to search MS against 6 frame translation, removing bias based on established annotation 6 frame translation of genome sequence intensity MS/MS Reference protein DB m/z Compare, score, test significance annotated + novel peptides Compare, score, test significance annotated peptides Creating 6-frame translation database ATGAAAAGCCTCAGCCTACAGAAACTCTTTTAATATGCATCAGTCAGAATTTAAAAAAAAAATC Positive Strand M K * S K E L A K S P Negative Strand H F A S F F L A Q E G L S Y P A * R Q R T * G L K N E L V R L S T F S C F F L E V F * K I L K K A S N M H L R S Y C I I * Q I C A R S S * H M Y V D I E Q D L F N S * T * K L N F L K K K L K I K K K F F * N K F F F I S I F F Software: • Peppy: creates the database + searches MS, Risk BA, et. al (2013) • PIUS (Peptide Identification by Unbiased Searching): Costa et al, 2013 • MS-Dictionary: Kim et al, 2009 D F G Genome Annotation Example 1: A. gambiae Peptides mapping to annotated 3’ UTR Peptides mapping to novel exon within an existing gene Renuse S, Chaerkady R and A Pandey, Proteomics. 11(4) 2011 Genome Annotation Example 1: A. gambiae Peptides mapping to unannotated gene related species Renuse S, Chaerkady R and A Pandey, Proteomics. 11(4) 2011 Genome Annotation Example 2: Correcting Miss-annotations A. Establishes new transcriptional start location B. Confirm ORF C. Establishes intron-exon boundaries D. Determines new reading frame for exons E. Predicts novel coding region F. Finds the end of a gene G. Uses a related species to build on genomic annotation RNA Sequence-based database for alternatively splicing identification intensity MS/MS RNA-Seq junction DB m/z Compare, score, test significance Identification of novel splice isoforms Annotation of organisms which lack genome sequencing intensity MS/MS Reference DB of related species m/z De novo MS/MS sequencing Compare, score, test significance Identification of potential protein coding regions Proteogenomics: Genome Annotation Summary • Confirms existing gene models • Corrects existing gene models – – – – Intron-exon boundaries Reading frames Novel splice isoforms Novel exons • Identifies novel genes • Fusion protein identification • Identify genomic polymorphisms Proteogenomics 1. Improving genome annotation 2. Sequencing driven database construction 3. Proteomic mapping to genomic coordinates Proteogenomic workflow Before the advent of proteogenomics, variant protein analysis was laborious, often requiring de novo sequencing**, which is very time-consuming, and therefore only a very limited number of peptides can be sequenced. ** DNA/RNA sequencing Single nucleotide variant database for variant protein identification intensity MS/MS Reference protein DB + Variant DB m/z Compare, score, test significance Identification of variant proteins Variants predicted from genome sequencing Exon 1 TCGAGAGCTG TCGAGAGCTG TCGAGAGCTG TCGAGAGCTG TCGAGAGCTG TCGATAGCTG Creating variant sequence DB VCF File Format # Meta-information lines Columns: 1. Chromosome 2. Position 3. ID (ex: dbSNP) 4. Reference base 5. Alternative allele 6. Quality score 7. Filter (PASS=passed filters) 8. Info (ex: SOMATIC, VALIDATED..) Creating variant sequence DB EXON 1 EXON2 … … …GTATTGCAAAAATAAGATAGAATAAGAATAATTACGACAAGATTC… Add in variants within exon boundaries …CTATTGCAAAAATACGATAGCATAAGAATAGTTACGACAAGATTC… In silico translation …LLQKYDSIRIVTTRF… Variant DB Splice junction database for novel exon, alternative splicing identification intensity MS/MS Reference protein DB + m/z Compare, score, test significance RNA-Seq junction DB Intron/Exon boundaries from RNA sequencing Alt. Splicing Identification of novel splice proteins Exon 1 Exon 2 Exon 3 Novel Expression Exon 1 Exon X Exon 2 Creating splice junction DB BED File Format Columns: 1. Chromosome 2. Chromosome Start 3. Chromosome End 4. Name 5. Score 6. Strand (+or-) 7-9. Display info 10. # blocks (exons) 11. Size of blocks 12. Start of blocks Junction bed file Creating splice junction DB Map to known intron/exon boundaries Bed file with new gene mapping Unannotated alternative splicing One novel intron/exon boundary Two novel intron/exon boundaries Fusion protein identification intensity MS/MS Reference protein DB + Fusion Gene DB m/z Compare, score, test significance Gene X Exon 1 Identification of variant proteins Gene X Exon 2 Chr 1 Gene Y Exon 2 Gene Y Exon 1 Chr 2 Gene X Exon 1 Gene Y Exon 2 Fusion Genes Find consensus sequence .…AGAACTGGAAGAATTGG*AATGGTAGATAACGCAGATCATCT..… Fusion Location 6 frame translation FASTA Informatics tools for customized DB creation • QUILTS: perl/python based tool to generate DB from genomic and RNA sequencing data (Fenyo lab) • customProDB: R package to generate DB from RNA-Seq data (Zhang B, et al.) • Splice-graph database creation (Bafna V. et al.) Proteogenomics and Human Disease: Genomic Heterogeneity •Whole genome sequencing has uncovered millions of germline variants between individuals •Genomic, proteome studies typically use a reference database to model the general population, masking patient specific variation Nature October 28, 2010 Proteogenomics and Human Disease: Cancer Proteomics Cancer is characterized by altered expression of tumor drivers and suppressors • • Results from gene mutations causing changes in protein expression, activity Can influence diagnosis, prognosis and treatment Cancer proteomics • • • Are genomic variants evident at the protein level? What is their effect on protein function? Can we classify tumors based on protein markers? Tumor Specific Proteomic Variation Nature April 15, 2010 Stephens, et al. Complex landscape of somatic rearrangement in human breast cancer genomes. Nature 2009 Personalized Database for Protein Identification Somatic Variants Germline Variants SVATGSSEAAGGASGGGAR GQVAGTMKIEIAQYR DSGSYGQSGGEQQR EETSDFAEPTTCITNNQHS EPRDPR FIKGWFCFIISAR…. MQYAPNTQVEIIPQGR SSAEVIAQSR ASSSIIINESEPTTNIQIR QRAQEAIIQISQAISIMETVK SSPVEFECINDK SPAPGMAIGSGR… intensity MS/MS Protein DB m/z Compare, score, test significance Identified peptides and proteins Personalized Database for Protein Identification RNA-Seq Genome Sequencing intensity MS/MS Tumor Specific Protein DB m/z Compare, score, test significance Identified peptides and proteins + tumor specific + patient specific peptides Tumor Specific Protein Databases Non-Tumor Sample Genome sequencing Genome sequencing RNA-Seq Tumor Sample Identify germline variants Identify alternative splicing, somatic variants and novel expression Alt. Splicing Novel Expression Tumor Specific Protein DB Exon 1 Exon 1 Exon 3 Exon 2 Variants Fusion Genes Gene X Exon 1 Gene X Exon 2 Gene X Gene Y Exon 1 Gene Y Exon X Gene Y Exon 2 Exon 1 Exon 2 TCGAGAGCTG TCGAGAGCTG TCGAGAGCTG TCGAGAGCTG TCGAGAGCTG TCGATAGCTG Reference Human Database (Ensembl) Proteogenomics and Biomarker Discovery • Tumor-specific peptides identified by MS can be used as sensitive drug targets or diagnostic tools – Fusion proteins – Protein isoforms – Variants • Effects of genomic rearrangements on protein expression can elucidate cancer biology Proteogenomics 1. Improving genome annotation 2. Sequencing driven database construction 3. Proteomic mapping to genomic coordinates Proteogenomic mapping • Map back observed peptides to their genomic location. • Requires tools to convert proteomic location to genomic coordinates • Use to determine: – – – – Exon location of peptides Proteotypic Novel coding region Visualize in genome browsers (UCSF genome browser, Integrative Genomics Viewer (IGV)) – Quantitative comparison based on genomic location Informatics tools for proteogenomic mapping • PGx: python-based tool, maps peptides back to genomic coordinates using user defined reference database (Fenyo lab) • The Proteogenomic Mapping Tool: Java-based search of peptides against 6-reading frame sequence database (Sanders WS, et al). PGX: Proteogenomic mapping tool Peptides Sample specific protein database Log Fold Change in Expression (10,000 bp bins) Manor Askenazi David Fenyo Copy Number Variation Methylation Status Exon Expression (RNA-Seq) Number of Genes/Bin Peptides Peptides mapped onto genomic coordinates Variant Peptide Mapping Peptides with single amino acid changes corresponding to germline and somatic variants SVATGSSETAGGASGGGAR ACG->GCG ENSEMBL Gene Tumor Peptide Reference Peptide SVATGSSEAAGGASGGGAR Novel Peptide Mapping Peptides corresponding to RNA-Seq expression in non-coding regions ENSEMBL Gene Tumor Peptide Tumor RNA-Seq Proteogenomic integration Variants Proteomic Quantitation RNA-Seq Data Predicted gene expression Proteomic Mapping Maps genomic, transcriptomic and proteomic data to same coordinate system including quantitative information Summary • The integration of proteomics and genomics can improve our understanding of not only genomic annotation, but also of the functional protein products integral in biological processes. • Proteogenomics is currently being used extensively in cancer discovery – Genetic rearrangement differs between tumors – Requires personalized database – Can provide cancer specific proteins for biomarker development • Proteogenomics will likely continue to grow, particularly in the identification of genomic abnormalities in disease Questions?