Slides

advertisement
Proteogenomics
Kelly Ruggles, Ph.D.
Proteomics Informatics
March 31, 2015
Proteogenomics: Intersection of
proteomics and genomics
As the cost of high-throughput
genome sequencing goes
down whole genome, exome
and RNA sequencing can be
easily attained for most
proteomics experiments
In combination with mass spectrometry-based
proteomics, sequencing can be used for:
1. Genome annotation
2. Studying the effect of genomic variation in proteome
3. Biomarker identification
Proteogenomics: Intersection of
proteomics and genomics
First published on in 2004 “Proteogenomic mapping as a
complementary method to perform genome annotation”
(Jaffe JD, Berg HC and Church GM) using genomic
sequencing to better annotate Mycoplasma pneumoniae
Renuse S, Chaerkady R and A Pandey, Proteomics. 11(4) 2011
Proteogenomic workflow
High throughput shotgun MS/MS
Requires no knowledge of
peptides present, uses mass
difference to determine next
AA in peptide chain.
Requirements for Proteogenomic
Analysis
• DNA or RNA sequencing data
• High resolution MS/MS
• Informatics tools for proteogenomic database construction and
protein searching
DNA and/or RNA Sequencing
Sample
Informatics Tools
MS/MS
Informatics Tools
Compare, score,
test significance
Identified peptides and proteins
Personalized
Protein DB
Proteogenomics
• In the past, computational algorithms were
commonly used to predict and annotate genes.
– Many limitations
• With mass spectrometry we can
– Confirm existing gene models
– Correct gene models
– Identify novel genes and splice isoforms
Proteogenomics
1. Improving genome annotation
2. Sequencing driven database construction
3. Proteomic mapping to genomic coordinates
Proteogenomics
1. Improving genome annotation
2. Sequencing driven database construction
3. Proteomic mapping to genomic coordinates
Genome Annotation
• Process of identifying and assigning function to genes
• Historically, identification of protein coding regions was
completed using
– Comparative sequence similarity analysis
– ab initio gene prediction algorithms
– RNA transcript analysis
• Limitations associated with these methods in determining
–
–
–
–
–
Gene start and stop sites
Translation reading frames
Short genes, overlapping genes
Alternative splice boundaries
Translated vs. transcribed genes
• Therefore, MS-based proteomics can be used to
supplement sequence analysis for genome annotation
Protein Sequence Databases
• Identification of peptides from MS relies
heavily on the quality of the protein sequence
database (DB)
• DBs with missing peptide sequences will fail to
identify the corresponding peptides
• DBs that are too large will have low sensitivity
• Ideal DB is complete and small, containing all
proteins in the sample and no irrelevant
sequences
Genome Sequence-based database for
genome annotation
Commonly used method is to search MS against 6 frame translation,
removing bias based on established annotation
6 frame translation
of genome
sequence
intensity
MS/MS
Reference
protein DB
m/z
Compare, score,
test significance
annotated + novel
peptides
Compare, score,
test significance
annotated peptides
Creating 6-frame translation database
ATGAAAAGCCTCAGCCTACAGAAACTCTTTTAATATGCATCAGTCAGAATTTAAAAAAAAAATC
Positive Strand
M K
*
S
K
E
L
A
K
S
P
Negative Strand
H F A
S
F
F
L
A
Q
E
G
L
S
Y
P
A
*
R
Q
R
T
*
G
L
K
N
E
L
V
R
L
S
T
F
S
C
F
F
L
E
V
F
*
K
I
L
K
K
A
S
N M H
L
R
S
Y
C
I
I
*
Q
I
C
A
R
S
S
*
H M
Y
V
D
I
E
Q
D
L
F
N
S
*
T
*
K
L
N
F
L
K
K
K
L
K
I
K
K
K
F
F
*
N
K
F
F
F
I
S
I
F
F
Software:
• Peppy: creates the database + searches MS, Risk BA, et. al (2013)
• PIUS (Peptide Identification by Unbiased Searching): Costa et al,
2013
• MS-Dictionary: Kim et al, 2009
D
F
G
Genome Annotation Example 1:
A. gambiae
Peptides mapping to annotated 3’ UTR
Peptides mapping to novel exon within an existing gene
Renuse S, Chaerkady R and A Pandey, Proteomics. 11(4) 2011
Genome Annotation Example 1:
A. gambiae
Peptides mapping to unannotated gene
related species
Renuse S, Chaerkady R and A Pandey, Proteomics. 11(4) 2011
Genome Annotation Example 2:
Correcting Miss-annotations
A. Establishes new transcriptional start location
B. Confirm ORF
C. Establishes intron-exon boundaries
D. Determines new reading frame for exons
E. Predicts novel coding region
F. Finds the end of a gene
G. Uses a related species to build on genomic annotation
RNA Sequence-based database for
alternatively splicing identification
intensity
MS/MS
RNA-Seq junction
DB
m/z
Compare, score,
test significance
Identification of novel splice
isoforms
Annotation of organisms which lack
genome sequencing
intensity
MS/MS
Reference DB of
related species
m/z
De novo MS/MS
sequencing
Compare, score,
test significance
Identification of potential protein coding regions
Proteogenomics: Genome Annotation
Summary
• Confirms existing gene models
• Corrects existing gene models
–
–
–
–
Intron-exon boundaries
Reading frames
Novel splice isoforms
Novel exons
• Identifies novel genes
• Fusion protein identification
• Identify genomic polymorphisms
Proteogenomics
1. Improving genome annotation
2. Sequencing driven database construction
3. Proteomic mapping to genomic coordinates
Proteogenomic workflow
Before the advent of proteogenomics, variant protein analysis was laborious, often requiring de novo sequencing**,
which is very time-consuming, and therefore only a very limited number of peptides can be sequenced.
**
DNA/RNA
sequencing
Single nucleotide variant database for
variant protein identification
intensity
MS/MS
Reference
protein DB
+
Variant DB
m/z
Compare, score,
test significance
Identification of
variant proteins
Variants predicted from genome sequencing
Exon 1
TCGAGAGCTG
TCGAGAGCTG
TCGAGAGCTG
TCGAGAGCTG
TCGAGAGCTG
TCGATAGCTG
Creating variant sequence DB
VCF File Format
# Meta-information lines
Columns:
1. Chromosome
2. Position
3. ID (ex: dbSNP)
4. Reference base
5. Alternative allele
6. Quality score
7. Filter (PASS=passed filters)
8. Info (ex: SOMATIC, VALIDATED..)
Creating variant sequence DB
EXON 1
EXON2
…
…
…GTATTGCAAAAATAAGATAGAATAAGAATAATTACGACAAGATTC…
Add in variants within exon boundaries
…CTATTGCAAAAATACGATAGCATAAGAATAGTTACGACAAGATTC…
In silico translation
…LLQKYDSIRIVTTRF…
Variant DB
Splice junction database for novel
exon, alternative splicing identification
intensity
MS/MS
Reference
protein DB
+
m/z
Compare, score,
test significance
RNA-Seq
junction
DB
Intron/Exon boundaries from RNA sequencing
Alt. Splicing
Identification of
novel splice proteins
Exon 1
Exon 2
Exon 3
Novel Expression
Exon 1
Exon X
Exon 2
Creating splice junction DB
BED File Format
Columns:
1. Chromosome
2. Chromosome Start
3. Chromosome End
4. Name
5. Score
6. Strand (+or-)
7-9. Display info
10. # blocks (exons)
11. Size of blocks
12. Start of blocks
Junction bed file
Creating splice junction DB
Map to known
intron/exon boundaries
Bed file with
new gene
mapping
Unannotated alternative splicing
One novel intron/exon boundary
Two novel intron/exon boundaries
Fusion protein identification
intensity
MS/MS
Reference
protein DB
+
Fusion Gene
DB
m/z
Compare, score,
test significance
Gene X
Exon 1
Identification of
variant proteins
Gene X
Exon 2
Chr 1
Gene Y
Exon 2
Gene Y
Exon 1
Chr 2
Gene X
Exon 1
Gene Y
Exon 2
Fusion Genes
Find consensus sequence
.…AGAACTGGAAGAATTGG*AATGGTAGATAACGCAGATCATCT..…
Fusion Location
6 frame translation FASTA
Informatics tools for customized DB
creation
• QUILTS: perl/python based tool to generate
DB from genomic and RNA sequencing data
(Fenyo lab)
• customProDB: R package to generate DB from
RNA-Seq data (Zhang B, et al.)
• Splice-graph database creation (Bafna V. et al.)
Proteogenomics and Human Disease:
Genomic Heterogeneity
•Whole genome sequencing has uncovered millions of
germline variants between individuals
•Genomic, proteome studies typically use a reference
database to model the general population, masking
patient specific variation
Nature October 28, 2010
Proteogenomics and Human Disease:
Cancer Proteomics
Cancer is characterized by altered expression of tumor
drivers and suppressors
•
•
Results from gene mutations causing changes in
protein expression, activity
Can influence diagnosis, prognosis and treatment
Cancer proteomics
•
•
•
Are genomic variants evident at the protein level?
What is their effect on protein function?
Can we classify tumors based on protein markers?
Tumor Specific Proteomic Variation
Nature April 15, 2010
Stephens, et al. Complex landscape of somatic
rearrangement in human breast cancer genomes.
Nature 2009
Personalized Database for Protein
Identification
Somatic Variants
Germline Variants
SVATGSSEAAGGASGGGAR
GQVAGTMKIEIAQYR
DSGSYGQSGGEQQR
EETSDFAEPTTCITNNQHS
EPRDPR
FIKGWFCFIISAR….
MQYAPNTQVEIIPQGR
SSAEVIAQSR
ASSSIIINESEPTTNIQIR
QRAQEAIIQISQAISIMETVK
SSPVEFECINDK
SPAPGMAIGSGR…
intensity
MS/MS
Protein DB
m/z
Compare, score,
test significance
Identified peptides and proteins
Personalized Database for Protein
Identification
RNA-Seq
Genome Sequencing
intensity
MS/MS
Tumor Specific
Protein DB
m/z
Compare, score,
test significance
Identified peptides and proteins
+ tumor specific
+ patient specific peptides
Tumor Specific Protein Databases
Non-Tumor Sample
Genome sequencing
Genome sequencing
RNA-Seq
Tumor Sample
Identify germline variants
Identify alternative splicing,
somatic variants and
novel expression
Alt. Splicing
Novel Expression
Tumor Specific
Protein DB
Exon 1
Exon 1
Exon 3
Exon 2
Variants
Fusion Genes
Gene X
Exon 1
Gene X
Exon 2
Gene X
Gene Y
Exon 1
Gene Y
Exon X
Gene Y
Exon 2
Exon 1
Exon 2
TCGAGAGCTG
TCGAGAGCTG
TCGAGAGCTG
TCGAGAGCTG
TCGAGAGCTG
TCGATAGCTG
Reference Human
Database (Ensembl)
Proteogenomics and Biomarker
Discovery
• Tumor-specific peptides identified by MS can
be used as sensitive drug targets or diagnostic
tools
– Fusion proteins
– Protein isoforms
– Variants
• Effects of genomic rearrangements on protein
expression can elucidate cancer biology
Proteogenomics
1. Improving genome annotation
2. Sequencing driven database construction
3. Proteomic mapping to genomic coordinates
Proteogenomic mapping
• Map back observed peptides to their genomic
location.
• Requires tools to convert proteomic location to
genomic coordinates
• Use to determine:
–
–
–
–
Exon location of peptides
Proteotypic
Novel coding region
Visualize in genome browsers (UCSF genome browser,
Integrative Genomics Viewer (IGV))
– Quantitative comparison based on genomic location
Informatics tools for proteogenomic
mapping
• PGx: python-based tool, maps peptides back
to genomic coordinates using user defined
reference database (Fenyo lab)
• The Proteogenomic Mapping Tool: Java-based
search of peptides against 6-reading frame
sequence database (Sanders WS, et al).
PGX: Proteogenomic mapping tool
Peptides
Sample specific
protein database
Log Fold Change in Expression (10,000 bp bins)
Manor Askenazi
David Fenyo
Copy Number Variation
Methylation Status
Exon Expression (RNA-Seq)
Number of Genes/Bin
Peptides
Peptides mapped
onto genomic
coordinates
Variant Peptide Mapping
Peptides with single amino acid changes corresponding to germline and somatic variants
SVATGSSETAGGASGGGAR
ACG->GCG
ENSEMBL Gene
Tumor Peptide
Reference Peptide
SVATGSSEAAGGASGGGAR
Novel Peptide Mapping
Peptides corresponding to RNA-Seq expression in non-coding regions
ENSEMBL Gene
Tumor Peptide
Tumor RNA-Seq
Proteogenomic integration
Variants
Proteomic
Quantitation
RNA-Seq Data
Predicted gene
expression
Proteomic
Mapping
Maps genomic, transcriptomic and proteomic data to same coordinate system
including quantitative information
Summary
• The integration of proteomics and genomics can
improve our understanding of not only genomic
annotation, but also of the functional protein products
integral in biological processes.
• Proteogenomics is currently being used extensively in
cancer discovery
– Genetic rearrangement differs between tumors
– Requires personalized database
– Can provide cancer specific proteins for biomarker
development
• Proteogenomics will likely continue to grow,
particularly in the identification of genomic
abnormalities in disease
Questions?
Download