Finding the Lost Treasure of NGS Data

advertisement
Finding the Lost Treasure of NGS
Data
Yan Guo, PhD
Modules Overview
for DNA-sequence Exome / whole-Genome
gene coding
changes
fastq
files
FastQC
realignment
bamQC
somatic
mutation
best
practice
filter
recalibration
bwa
alignment
dbsnp / indel
resources
mark-duplication
Bam files
GATK refinement
SNP/INDEL
gene-level
analysis
vcf files
structural
variant analysis
gene
associates
Translocation, inversion, copy
number variants
RNAseq
genes
identifying
fastq
files
FastQC
cluster
tophat
alignment
functional/
pathway
cuffdiff
comparisons
SeQC
Gene List
cufflinks
annotations
Bam files
Refinement
novel genes
discovery
gene-fusion
analysis
gene quantification
cufflinks
annotations
cuffmerge
cuffdiff
comparisons
What do you expect to find in NGS data?
DNAseq
• SNPs
• Somatic Mutations
• Small Indels
• Large Structural Change
• CNV
RNAseq
• Gene expression difference
• Splicing Variants
• Fusion Genes
What you don’t expect to find in NGS
data?
Exome
sequencing reads
Is mapped?
No
Unmapped DNA
reads
Virus/Microbe
DNA
Contamination
Yes
Mapped reads
Intronic DNA
Is targeted?
Yes
Targeted DNA
No
Untargeted DNA
Intergenic DNA
Mitochondrial
DNA
T0
3
ruS
6 7e4q4 s a
Exome Capture
Why do we care about intron and
intergenic regions
• some introns can encode specific proteins and
can be processed after splicing to form
noncoding RNA molecules. (Rearick, Prakash
et al. 2011)
• Majority of the GWAS SNPs are not in coding
regions (706 exon, 3986 intron, 3323
intergenic)
• The ENCODE Project: ENCyclopedia Of DNA
Elements
GWAS catalog SNPs
Kit
Missing
Target total Exon
Missing
Missing
bases
SNPs
intron SNPs Intergenic SNPs
SureSelect(v2)
37627747
387
3946
3323
TrueSeq
62085286
206
3980
3320
SeqCap EZ (v3.0)
64190747
326
3880
3317
Average
Intergen
Samples
Intronic Splicing1 ncRNA2
depth
ic
Agilent
(N=22)
1000G
(N=6)
Illumina
(N=6)
≥2
≥5
≥ 10
≥2
≥5
≥ 10
≥2
≥5
≥ 10
21741
7362
4766
4561
2784
1419
6114
2408
1058
48
39
37
19
12
9
0
0
0
9129
5794
4393
648
360
194
985
501
327
91480
44269
28673
4658
2815
1624
9659
5344
3498
Exonic
NonStopgai
Stoploss
synonymous
n
1431
38
6
1142
29
5
892
19
4
491
10
1
337
6
1
233
5
1
25
0
0
0
0
0
0
0
0
1. Variant is within 2-bp of a splicing junction
2. Variant overlaps a transcript without coding annotation in the gene definition
Mitochondria
• Mitochondria play an important role in cellular energy
metabolism, free radical generation, and apoptosis
(Andrews, Kubacka et al. 1999; Verma and Kumar 2007).
• Mitochondrial DNA (mtDNA) is a maternally-inherited
16,569-bp closed-circle genome that encodes two rRNAs,
22 tRNAs, and 10 polypeptides.
• Dysfunctions in mitochondrial function are an important
cause of many neurological diseases (Fernandez-Vizarra,
Bugiani et al. 2007) and drug toxicities (Lemasters, Qian et
al. 1999; Wallace and Starkov 2000) and may contribute to
carcinogenesis and tumor progression (Modica-Napolitano
and Singh 2004; Chen 2012).
Mitochondria Extraction Strategy
Results
Virus
• Known oncogenic viruses are estimated to cause 15 to
20 percent of all cancers in humans (Parkin 2006).
• Understanding the viral integration pattern of cancerassociated viruses may uncover novel oncogenes and
tumor suppressors that are associated with cellular
transformation.
• Viral genomes have been detected using off-target
exome sequencing reads (Barzon, Lavezzo et al. 2011;
Li and Delwart 2011; Chevaliez, Rodriguez et al. 2012;
Radford, Chapman et al. 2012; Capobianchi, Giombini
et al. 2013).
One example using HNSCC
Virus Detection in HNSCC in TCGA
Site
Buccal Mucosa
Buccal Mucosa
Buccal Mucosa
Buccal Mucosa
Buccal Mucosa
Buccal Mucosa
Buccal Mucosa
Buccal Mucosa
Oropharynx
Oropharynx
Oropharynx
Tonsil
Tonsil
Tonsil
Tonsil
Tonsil
Tonsil
Tonsil
Tonsil
Tonsil
Tonsil
Tonsil
Tonsil
Tonsil
Tonsil
Tonsil
clin_hpv_ish
clin_hpv_p16
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
0
0
1
0
0
0
0
0
0
0
0
ExomeSeq
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
0
1
0
0
0
0
0
0
0
0
0
low_pass
0
0
0
0
0
0
0
0
1
0
0
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
RNAseq
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
1
0
0
0
0
0
0
0
HPV
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
1
0
0
4
4
4
4
4
3
3
3
2
2
2
2
2
2
1
Existing Tools
• PathSeq (Kostic, Ojesina et al. 2011)
• VirusSeq (Chen, Yao et al. 2012)
• ViralFusionSeq (Li, Wan et al. 2013)
SNP and Somatic Mutation
Identification using RNAseq Data
• Traditionally, somatic mutations are detected
using Sanger sequencing or RT-PCR by comparing
paired tumor and normal samples. One obvious
limitation of such methods is that we have to
limit our search to a certain genomic region of
interest.
• With the maturity of next generation sequencing,
we can now screen all coding genes or even the
whole genome for somatic mutations at a
reasonable cost.
Why do we want to detect mutation in
RNAseq data?
• You don’t have DNA sequencing data
• Detecting mutation was not the original goal,
but why not
• There are much more RNAseq data than
DNAseq data
• A mutation in RNA is more relevant than a
mutation in DNA
Difficulties
• Not enough depth in the non-expressed genes
to detect mutation
• Reverse transcribe RNA to cDNA introduce
more error
• Hard to distinguish mutation from RNA editing
• In summary, somatic mutation detection using
RNAseq data contains much more false
positives.
Somatic Mutation Caller Designed
Specifically for RNAseq Data
Other Ways you can mine your data
Summary
• Get your priority right, never design a study
just for secondary analysis targets
• If you have old data, think about else you can
do with it, try to maximize the full potential of
your data
• At VANGARD, we help you with your basic
genomic data analysis needs
• Advanced data analysis can be done through
collaboration.
Acknowledgement
•
•
•
•
•
•
•
Yu Shyr
Tiger Sheng
Chung-I Li
Jiang Li
Mike Guo
David Samuels
Chun Li
Download