Inference of Allele Specific Isoform Expression (ASIE) Levels from RNASeq Data Sahar Al Seesi and Ion Măndoiu Computer Science and Engineering CANGS 2012 Outline • Problem definition • Challenges and limitations of current approaches • ASIE pipeline – SNVQ – RefHap – Diploid IsoEM • Results Gene/Isoform Expression Estimation Make cDNA & shatter into fragments Sequence fragment ends Map reads A Gene Expression (GE) B C Isoform Expression (IE) D E A B A C D E C Allele Specific Gene/Isoform Expression Estimation H0 H1 Make cDNA & shatter into fragments Sequence fragment ends H0 Map reads H1 A B C D E Allele Specific Gene Expression (GE) H1 H0 H0 A B C Allele Specific Isoform Expression (IE) H1 H0 H0 H1 H1 D H1 E Challenges and limitations of current approaches • Need for diploid transcriptome • Existing studies rely on simple alleles coverage analysis for heterozygous SNP sites – Not isoform specific – Read mapping bias towards the reference allele – Use less information less robust estimates Pipeline for ASIE from RNA-Seq Reads Pipeline for ASIE from RNA-Seq Reads Hybrid Approach Based on Merging Alignments mRNA reads Transcript Library Mapping Transcript mapped reads Read Merging Genome Mapping Genome mapped reads Mapped reads Merging Rules for Short Reads Genome Transcripts Agree? Hard Merge Unique Unique Yes Keep Unique Unique No Throw Unique Multiple No Throw Unique Not Mapped No Keep Multiple Unique No Throw Multiple Multiple No Throw Multiple Not Mapped No Throw Not mapped Unique No Keep Not mapped Multiple No Throw Not mapped Not Mapped Yes Throw Merging Local Alignments of ION Reads: HardMerge at Base-Level • Input: SAM files with alignments from genome and transcriptome mapping • The following alignments are filtered out – Any local alignments of length <= 15 bases – All alignments of read that has alignments on different chromosomes or different strands • Key idea: a read base mapped to multiple locations is discarded • Output alignments are generated from contiguous stretches of nonambiguously mapped bases, based on the unique genomic location of these bases – Subject to the above filtering criteria HardMerge Example Input alignments in genome coordinates: Filter multiple local alignments/sub-alignments Output alignment: SNV Detection and Genotyping J. Duitama and P.K. Srivastava and I.I. Mandoiu, Towards Accurate Detection and Genotyping of Expressed Variants from Whole Transcriptome Sequencing Data, BMC Genomics 13(Suppl 2):S6, 2012 • A reliable hybrid mapping strategy • Bayesian model for SNV detection based on quality scores SNVQ Model • Calculate conditional probabilities by multiplying contributions of individual reads Accuracy per Coverage Bins Pipeline for ASIE from RNA-Seq Reads ReFHap J. Duitama and T. Huebsch and G. McEwen and E. Suk and M.R. Hoehe, ReFHap: A Reliable and Fast Algorithm for Single Individual Haplotyping, Proc. 1st ACM Intl. Conf. on Bioinformatics and Computational Biology, pp. 160-169, 2010 • Problem Formulation – Alleles for each locus are encoded with 0 and 1 – Fragment: Aligned read showing coocurrance of two or more alleles in the same chromosome copy Locus 1 2 3 4 5 6 7 8 9 ... f - 0 1 1 - 1 - 0 0 ... Problem Formulation • Input: Matrix M of m fragments covering n loci Locus 1 2 3 4 5 ... n f1 1 1 0 - 1 - f2 - 0 1 0 0 1 f3 - 0 0 0 1 - - - - - 1 0 ... fm ReFHap vs HapCUT Pipeline for ASIE from RNA-Seq Reads IsoEM: Isoform Expression Level Estimation • Expectation-Maximization algorithm • Unified probabilistic model incorporating – – – – – Single and/or paired reads Fragment length distribution Strand information Base quality scores Repeat and hexamer bias correction Read-isoform compatibility wr ,i wr ,i OaQa Fa a Fragment length distribution • Paired reads Fa(i) i A B j A C C Fa (j) IsoEM vs. Cufflinks 1.0.3 on ION reads R2 for IsoEM/Ccufflinks Estimates vs qPCR 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 IsoEM HBR Cufflinks HBR IsoEM UHR Cufflinks UHR Simplified Pipeline for ASIE in F1 Hybrids Short Reads Reference Transcriptome >name:EI1W3PE02ILQXT GAATTCTGTGAAAGCCTGTAGCTATAA >name:EI1W3PE02ILQXA AAAAATGTTGAGCCATAAATACCATCA >name:EI1W3PE02ILQXB CTTTGAAGTATTCTGAGACTTGTAGGA >name:EI1W3PE02ILQXC AGGTGAAGTAAATATCTAATATAATTG >name:EI1W3PE02ILQXD GATTGTATGTTTTTGATTATTTTTTGTTA >name:EI1W3PE02ILQXE GGCTGTGATGGGCTCAAGTAATTGAAA >name:EI1W3PE02ILQXF AATACAGATGGATTCAGGAGAGGTAC >name:EI1W3PE02ILQXG TTCCAGGGGGTCAAGGGGAGAAATAC >name:EI1W3PE02ILQXH CTCCTAATTCTGGAGTAGGGGCTAGGC Diploid Transcriptome Prental Genome Sequences A B C A B C AAAAATGTTGAGCCATAAATACCATCACTTTGAAGTATTC A >chrX >chrX GAATTCTGTGAAAGCCTGT GAATTCTGTGAAAGCCTGT AGCTATAAAAAAATGTTGA AGCTATAAAAAAATGTTGA GCCATAAATACCATCACTTT GCCATAAATACCATCACTTT GAAGTATTCTGAGACTTGT GAAGTATTCTGAGACTTGT AGGAAGGTGAAGTAAATA AGGAAGGTGAAGTAAATA TCTAATATAATTGGATTGTA TCTAATATAATTGGATTGTA TGTTTTTGATTATTTTTTGTT TGTTTTTGATTATTTTTTGTT AGGCTGTGATGGGCTCAA AGGCTGTGATGGGCTCAA GTAATTGAAA GTAATTGAAA Generate Isoform Sequences B C Align to Diploid Transcriptome AAAAATGTTGAGCCATAAATACCATCACTTTGAAGTATTC A C AAAAATGTTGAGCCTTTGAAGTATTC A C AAAAATGTTGAGCCTTTGAAGTATTC Allele Specific Expression Levels A B C A C A B C A C IsoEM ABC AC Allele Specific Read Mapping Whole Brain RNA-Seq Data - Sanger Institute Mouse Genomes Project Strain C57BL BALBc A/J CAST SPRET SNPs 9,844 3,920,925 4,198,324 17,673,726 35,441,735 Private SNPs 1,488 29,973 44,837 5,368,019 23,455,525 Number of read Number of mapped Percentage of pairs read pairs mapped Pairs C57BL 57,187,342 21,756,070 38.044 BALBc 62,465,347 28,358,653 45.399 A/J 46,993,887 22,449,227 47.771 CAST 54,569,423 22,307,194 40.879 SPRET 57,411,555 19,016,949 33.124 C57BLxBALBc 114,374,684 47,682,108 41.689 C57BLxAJ 93,987,774 35,353,398 37.615 C57BLxCAST 109,138,846 43,134,951 39.523 C57BLxSPRET 114,374,684 40,780,806 35.655 Strain/Hybrid Hybrid C57BL IE Strain IE C57BL GE Strain GE C57BLxStrain Pearson Pearson Pearson Pearson C57BLxSPRET 0.952 0.726 0.951 0.725 C57BLxBALBc 0.705 0.675 0.706 0.675 C57BLxAJ 0.855 0.902 0.856 0.903 C57BLxCAST 0.872 0.824 0.924 0.882 C57BLxSPRET 0.952 0.726 0.951 0.725 Correlation between FPKM values, for each strain, inferred from the separate strain RNA-Seq Read vs. the pooled read of the two strains (synthetic hybrid) Allele Specific Isoform Expression for Synthetic Hybrid C57BLxAJ R2 = 0.73 R2 = 0.81 Correlation between FPKM values, for each strain, inferred from the separate strain RNA-Seq Read vs. the pooled read of the two strains (synthetic hybrid) Allele Specific Isoform Expression for Synthetic Hybrid C57BLxCAST R2 = 0.76 R2 = 0.68 Correlation between FPKM values, for each strain, inferred from the separate strain RNA-Seq Read vs. the pooled read of the two strains (synthetic hybrid) Allele Specific Expression on Drosophila RNA-Seq data from [McManus et al. 10] 0.0000001 0.00001 0.001 D.Mel. In Parental Pool R² = 0.8922 0.1 1E-09 0.0000001 0.00001 0.1 0.001 0.00001 1E-09 0.1 0.1 0.001 1E-05 0.0000001 D.Mel. 0.001 R² = 0.9333 D.Sec.in Parental Pool 1E-09 1E-07 D.Sec. 1E-09 Allele Specific Expression for Mouse RNA-Seq Data from [Gregg et al. 2010] Conclusion • Proposed novel RNA-Seq analysis pipeline – Reconstructs diploid transcriptome – Not affected by mapping bias towards reference allele – Estimation of allele specific expression levels of isoforms – Robust estimation based on all reads What’s Next? • Test whole pipeline • Use read coverage information SNVs along with max cut sizes in RefHap to phase isolated SNPs • Incorporate flowgram data, when available, in SNV detection • Deploy on Galaxy • Develop ASIE plugin for ION Torrent Acknowledgments • Ion Mandoiu (Uconn) • Jorge Duitama (KU Leuven) • Marius Nicolae (Uconn) • • • • • Alex Zelikovsky (GSU) Serghei Mangul (GSU) Adrian Caciula (GSU) Dumitru Brinza (Life Tech) Pramod Srivastava (UCHC)