An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia State University Joint work with Adrian Caciula, Sahar Al Seesi, Dumitru Brinza, Abdul Rouf Banday, Rahul N. Kanadia, Ion Mandoiu and Alex Zelikovsky Advances in Next Generation Sequencing Illumina HiSeq 2000 Up to 6 billion PE reads/run 35-100bp read length Roche/454 FLX Titanium 400-600 million reads/run 400bp avg. length http://www.economist.com/node/16349358 Ion Proton Sequencer SOLiD 4/5500 1.4-2.4 billion PE reads/run 35-50bp read length 2 RNA-Seq Make cDNA & shatter into fragments Sequence fragment ends Map reads A Gene Expression B C D Transcriptome Reconstruction A B A C D E E Isoform Expression C 3 Transcriptome Reconstruction • Given partial or incomplete information about something, use that information to make an informed guess about the missing or unknown data. 4 Transcriptome Reconstruction Types • Genome-independent reconstruction (de novo) – de Brujin k-mer graph • Genome-guided reconstruction (ab initio) – Spliced read mapping – Exon identification – Splice graph • Annotation-guided reconstruction – Use existing annotation (known transcripts) – Focus on discovering novel transcripts 5 Previous approaches • Genome-independent reconstruction – Trinity(2011), Velvet(2008), TransABySS(2008) • Genome-guided reconstruction – Scripture(2010) • Reports “all” transcripts – Cufflinks(2010), IsoLasso(2011), SLIDE(2012) • Minimizes set of transcripts explaining reads • Annotation-guided reconstruction – RABT(2011), DRUT(2011) 6 Challenges and Solutions • Read length is currently much shorter then transcripts length • Paired-end reads • Fragment length distribution 7 1 2 3 4 5 6 7 t1 : 1 2 3 4 5 6 7 t2 : 1 3 4 5 6 7 t3 : 1 3 4 5 7 t4 : 1 3 4 5 7 2 Exon 2 and 6 are “distant” exons : how to phase them? TRIP Transciptome Reconstruction using Integer Programming • Map the RNA-Seq reads to genome • Construct Splice Graph - G(V,E) Genome – V : exons – E: splicing events • Candidate transcripts – depth-first-search (DFS) • Filter candidate transcripts – fragment length distribution (FLD) – integer programming 9 Gene representation Tr1: e1 Tr2: e1 Tr3: Pseudoexons: e5 e3 e2 pse1 Spse1 e4 pse2 Epse1 Spse2 e5 pse3 Epse2 Spse3 pse4 Epse3 Spse4 e6 pse5 Epse4 Spse5 pse6 Epse5 Spse6 pse7 Epse6 Spse7 Epse7 • Pseudo-exons - regions of a gene between consecutive transcriptional or splicing events • Gene - set of non-overlapping pseudo-exons 10 pseudo-exons Splice Graph TSS TES Genome 1 2 3 4 5 6 7 8 9 11 How to select? • Select the smallest set of candidate transcripts that yields a good statistical fit between – the fragment length distribution empirically determined during library preparation – fragment lengths implied by mapped read pairs 500 1 2 3 200 200 200 Mean : 500; Std. dev. 50 300 Mean : 500; Std. dev. 50 1 3 200 200 12 Simplified IP Formulation • Objective y (t ) min t T • Constraints (1) y ( t ) x ( p ), p t T ( p ) ( 2 ) x ( p ) N reads for each pe read at least one transcript is selected p T(p) - set of candidate transcripts on which paired-end read p can be mapped y(t) - 1 if a candidate transcript t is selected, 0 otherwise x(p) - 1 if the pe read p is selected to be mapped 13 IP Formulation • Objective min • Constraints y (t ) t T for each pe read from every category of std.dev. at least one transcript is selected restricts the number of pe reads mapped within different std. dev. each pe read is mapped no more then with one category of std. dev. every splice junction to be covered 14 Comparison on Simulated Data Sens PPV TP TP TP FP TP FN FScore 2 100x coverage, 2x100bp pe reads, 500 mean fragment length, 10% sd PPV Sens PPV Sens 15 Influence of Sequencing Parameters Sens TP PPV TP TP FP TP FN FScore 2 PPV Sens PPV Sens TRIP-L : individual fragment lengths estimates 100x coverage, 2x100bp pe reads, 500 mean fragment length, 10%-100% sd 16 Results on Real RNA-Seq Data • CD1 mouse retina RNA samples • specific gene that has 33 annotated transcripts in Ensembl • TRIP : 5 out of 10 transcripts, confirmed using qPCR. • Cufflinks : 3 out of 10 transcripts 6906 alignments for 22346 read pairs with read length of 68 17 Conclusions • We introduced a novel method for transcriptome reconstruction from paired-end RNA-Seq reads. • Our method : – exploits distribution of fragment lengths – additional experimental data • TSS/TES (TRIP with TSS/TES) • individual fragment lengths estimates (TRIP-L) 18 19