Finding the Lost Treasure of NGS Data Yan Guo, PhD Modules Overview for DNA-sequence Exome / whole-Genome gene coding changes fastq files FastQC realignment bamQC somatic mutation best practice filter recalibration bwa alignment dbsnp / indel resources mark-duplication Bam files GATK refinement SNP/INDEL gene-level analysis vcf files structural variant analysis gene associates Translocation, inversion, copy number variants RNAseq genes identifying fastq files FastQC cluster tophat alignment functional/ pathway cuffdiff comparisons SeQC Gene List cufflinks annotations Bam files Refinement novel genes discovery gene-fusion analysis gene quantification cufflinks annotations cuffmerge cuffdiff comparisons What do you expect to find in NGS data? DNAseq • SNPs • Somatic Mutations • Small Indels • Large Structural Change • CNV RNAseq • Gene expression difference • Splicing Variants • Fusion Genes What you don’t expect to find in NGS data? Exome sequencing reads Is mapped? No Unmapped DNA reads Virus/Microbe DNA Contamination Yes Mapped reads Intronic DNA Is targeted? Yes Targeted DNA No Untargeted DNA Intergenic DNA Mitochondrial DNA T0 3 ruS 6 7e4q4 s a Exome Capture Why do we care about intron and intergenic regions • some introns can encode specific proteins and can be processed after splicing to form noncoding RNA molecules. (Rearick, Prakash et al. 2011) • Majority of the GWAS SNPs are not in coding regions (706 exon, 3986 intron, 3323 intergenic) • The ENCODE Project: ENCyclopedia Of DNA Elements GWAS catalog SNPs Kit Missing Target total Exon Missing Missing bases SNPs intron SNPs Intergenic SNPs SureSelect(v2) 37627747 387 3946 3323 TrueSeq 62085286 206 3980 3320 SeqCap EZ (v3.0) 64190747 326 3880 3317 Average Intergen Samples Intronic Splicing1 ncRNA2 depth ic Agilent (N=22) 1000G (N=6) Illumina (N=6) ≥2 ≥5 ≥ 10 ≥2 ≥5 ≥ 10 ≥2 ≥5 ≥ 10 21741 7362 4766 4561 2784 1419 6114 2408 1058 48 39 37 19 12 9 0 0 0 9129 5794 4393 648 360 194 985 501 327 91480 44269 28673 4658 2815 1624 9659 5344 3498 Exonic NonStopgai Stoploss synonymous n 1431 38 6 1142 29 5 892 19 4 491 10 1 337 6 1 233 5 1 25 0 0 0 0 0 0 0 0 1. Variant is within 2-bp of a splicing junction 2. Variant overlaps a transcript without coding annotation in the gene definition Mitochondria • Mitochondria play an important role in cellular energy metabolism, free radical generation, and apoptosis (Andrews, Kubacka et al. 1999; Verma and Kumar 2007). • Mitochondrial DNA (mtDNA) is a maternally-inherited 16,569-bp closed-circle genome that encodes two rRNAs, 22 tRNAs, and 10 polypeptides. • Dysfunctions in mitochondrial function are an important cause of many neurological diseases (Fernandez-Vizarra, Bugiani et al. 2007) and drug toxicities (Lemasters, Qian et al. 1999; Wallace and Starkov 2000) and may contribute to carcinogenesis and tumor progression (Modica-Napolitano and Singh 2004; Chen 2012). Mitochondria Extraction Strategy Results Virus • Known oncogenic viruses are estimated to cause 15 to 20 percent of all cancers in humans (Parkin 2006). • Understanding the viral integration pattern of cancerassociated viruses may uncover novel oncogenes and tumor suppressors that are associated with cellular transformation. • Viral genomes have been detected using off-target exome sequencing reads (Barzon, Lavezzo et al. 2011; Li and Delwart 2011; Chevaliez, Rodriguez et al. 2012; Radford, Chapman et al. 2012; Capobianchi, Giombini et al. 2013). One example using HNSCC Virus Detection in HNSCC in TCGA Site Buccal Mucosa Buccal Mucosa Buccal Mucosa Buccal Mucosa Buccal Mucosa Buccal Mucosa Buccal Mucosa Buccal Mucosa Oropharynx Oropharynx Oropharynx Tonsil Tonsil Tonsil Tonsil Tonsil Tonsil Tonsil Tonsil Tonsil Tonsil Tonsil Tonsil Tonsil Tonsil Tonsil clin_hpv_ish clin_hpv_p16 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 ExomeSeq 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 low_pass 0 0 0 0 0 0 0 0 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 RNAseq 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 HPV 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 4 4 4 4 4 3 3 3 2 2 2 2 2 2 1 Existing Tools • PathSeq (Kostic, Ojesina et al. 2011) • VirusSeq (Chen, Yao et al. 2012) • ViralFusionSeq (Li, Wan et al. 2013) SNP and Somatic Mutation Identification using RNAseq Data • Traditionally, somatic mutations are detected using Sanger sequencing or RT-PCR by comparing paired tumor and normal samples. One obvious limitation of such methods is that we have to limit our search to a certain genomic region of interest. • With the maturity of next generation sequencing, we can now screen all coding genes or even the whole genome for somatic mutations at a reasonable cost. Why do we want to detect mutation in RNAseq data? • You don’t have DNA sequencing data • Detecting mutation was not the original goal, but why not • There are much more RNAseq data than DNAseq data • A mutation in RNA is more relevant than a mutation in DNA Difficulties • Not enough depth in the non-expressed genes to detect mutation • Reverse transcribe RNA to cDNA introduce more error • Hard to distinguish mutation from RNA editing • In summary, somatic mutation detection using RNAseq data contains much more false positives. Somatic Mutation Caller Designed Specifically for RNAseq Data Other Ways you can mine your data Summary • Get your priority right, never design a study just for secondary analysis targets • If you have old data, think about else you can do with it, try to maximize the full potential of your data • At VANGARD, we help you with your basic genomic data analysis needs • Advanced data analysis can be done through collaboration. Acknowledgement • • • • • • • Yu Shyr Tiger Sheng Chung-I Li Jiang Li Mike Guo David Samuels Chun Li