RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes Serghei Mangul*, Adrian Caciula*, Ion Mandoiu** and Alexander Zelikovsky* *Georgia State University, **University of Connecticut 1 3 Introduction Genes, Exons, Introns, and Splicing INITIALIZATION: Uniform transcript frequencies f(j) ‘s E STEP: Compute the expected number n(j) of reads sampled from transcript j (assuming current transcript frequencys f(j) ) M STEP: For each transcript j, set of f(j) = portion of reads emitted by transcript j among all reads in the sample Gene - a segment of DNA or RNA that carries genetic information. Exon - a region of a gene which is translated into protein Intron - a region of a gene which is not translated into protein Splicing – a process in which the introns are removed and exons are joined to be translated into a single protein Alternative Splicing the process in which exons can be spliced out in different combinations named transcripts to generate the mature RNA. 5 Expectation Maximization (EM) Discovery and Reconstruction of Unannotated Transcripts DRUT (Discovery and Reconstruction of Unannotated Transcripts): GIVEN: A set of transcripts and frequencies for the reads. FIND : Transcripts missing from the set. DRUT Quality of ML Model Fig. 1. Chromosome with its DNA Alternative splicing is a common mode of gene regulation within cells, being used by 90–95% of human genes. It can drastically alter the Fig. 2. Alternative Splicing Process function of a gene in different tissue types or environmental conditions, or even inactivate the gene completely. The possible gaps in the ML model include: erroneous reads caused by genotyping errors missing and/or chimerical candidate transcripts an inaccurate read to transcript match (caused by genotyping errors) non-uniform emitting of reads by transcripts j:hi , j 0 Unspliced reads Annotated transcript a) Map reads to annotated transcripts (using Bowtie) Measure the quality of ML model by deviation D of observed reads from expected reads (ej) | oj ej | j |R| is the number of reads D |R| Expected read frequencies (ej) are calculated based on weighted match between reads and strings ML f maximum likelihood frequencies estimations of transcripts ( j ) Fig4 shows the relation between transcripts, exons and hTj ,i ML reads ej fj Spliced reads b) VTEM: Identify “overexpressed” exons (possibly from unannotated transcripts) Overexpressed exons c) Assemble Transcripts (e.g., Cufflinks) using reads from “overexpressed” Novel exons and unmapped reads transcript h Tj ,l l d) Output: annotated transcripts + novel transcripts Alternative splicing is implicated in many diseases. Fig. 4. Transcripts – Exons –Reads Relation. 4 Virtual Transcript Expectation Maximization (VTEM) -> Observed frequencies - EDGES: weights ~ probability of the read to be emitted by the transcript Fig. 9(a) shows that in genes with more transcripts is more difficult to correctly reconstruct all transcripts. As a result Cufflinks performs better on genes with few transcripts since annotations are not used in it standard settings. DRUT has higher sensitivity on genes with 2 and 3 transcripts, but RABT is better on genes with 4 transcripts. For genes with more than 4 transcripts performance of annotation-guided methods is equal to ”existing annotations ratio”, which mean what these methods are unable to reconstruct unannotated transcript.. ML Problem: GIVEN: Annotations (transcripts) and frequencies of the reads. FIND: ML estimate of transcript frequencies Fig 3. Panel: Bipartite Graph - consisting of transcripts with unknown frequencies and reads with observed frequency (oj) SUBPROBLEMS: Decide if the panel is likely to be incomplete Estimate total frequency of missing transcripts Identify read spectrum emitted by missing transcripts Assemble missing transcripts from read spectrum emitted by missing transcripts Input data of EM is a panel: a bipartite graph a set of candidate transcripts that are believed to emit the set of reads weighted match based on mapping of the read i to the transcripts j (hTj, i) ML Estimates of Transcripts Frequencies Probability that a read is sampled from transcript j is proportional with f(j) f(j) transcript (unknown) frequency ML estimates for f(j) is given by n(j)/(n(1) + . . . + n(N)) n(j) denotes the number of reads sampled from transcript j Experimental Results Simulation Setup: human genome data (UCSC hg18) UCSC database - 66, 803 isoforms 19, 372 genes, Single error-free reads: 60M of length 100bp for partially annotated genome -> remove from every gene exactly one isoform - LEFT: transcripts -> unknown frequencies - RIGHT: reads 6 Fig. 7. VTEM 1.0 1.0 0.9 0.8 0.8 Partial Annotations (T3 missing) Complete Annotations exons ML transcripts .25 T1 .5 T2 .25 E1 E2 E3 O .25 .25 .25 E .25 .25 .25 1st .25 2nd ML ML ML .34 .66 .33 .65 .02 1st | 2nd | 3rd Run | Run | Run 3rd | | Run | Run | Run 0 T3 O<E Decrease VT weights (to 0) .20 .6 .2 .25 exons transcripts T1 VT VT frequency stays 0 No false positives E – Expected exon frequencies VT – Virtual Transcripts with hTi, j = 0 ML – Estimatied transcript frequency E1 E2 VT frequency increases! Deviation of expected from observed decreases! Fig 8. An example of VTEM estimation E2 E3 .25 .32 .3 .25 .25 .32 .3 .25 2 3 4 5 0.6 0.5 DRUT RABT Cufflinks Existing annotations 0.0 0.7 E3 VT .25 .16 .15 .25 .25 .16 .15 .25 O>E Increase VT weights Overexpressed exons VT frequency (.2) ≈ T3 frequency (.25) VT’s exons (E3,E4)= T3’s exons (E3,E4) DRUT RABT Cufflinks 0.4 6 7 8 >8 a) Number of transcripts per gene T2 E4 Observed = Expected Nothing to update E1 0.4 0.2 E4 0 O 0.6 PPV Maximum Likelihood (ML) Model Sensitivity 2 2 3 4 5 6 7 8 >8 b) Number of transcripts per gene Fig. 9. a) Sensitivity and PPV of the methods grouped by the number of transcripts per gene. Here, 60M single reads of length 100bp are simulated * Cufflinks is a well known tool for transcriptome reconstruction [2]. References 1. S. Mangul, I. Astrovskaya, M. Nicolae, B. Tork, I. Mandoiu, and A. Zelikovsky, “Maximum likelihood estimation of incomplete genomic spectrum from hts data,” in Proc. 11th Workshop on Algorithms in Bioinformatics, 2011. 2. C. Trapnell, B. Williams, G. Pertea, A. Mortazavi, G. Kwan, M. van Baren, S. Salzberg, B. Wold, and L. Pachter, “Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation.” Nature biotechnology, vol. 28, no. 5, pp. 511–515, 2010.