Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with Serghei Mangul, Ion Mandoiu and Alex Zelikovsky Introduction EM Algorithm Results Conclusions and future work Make cDNA & shatter into fragments Sequence fragment ends Map reads A Gene Expression (GE) B C D Isoform Discovery (ID) A B A C D E C E Isoform Expression (IE) Read ambiguity (multireads) A B C What is the gene length? D E Ignore multireads [Mortazavi et al. 08] ◦ Fractionally allocate multireads based on unique read estimates [Pasaniuc et al. 10] ◦ EM algorithm for solving ambiguities Gene length: sum of lengths of exons that appear in at least one isoform Underestimate expression levels for genes with 2 or more isoforms [Trapnell et al. 10] A A B C C D E [Jiang&Wong 09] ◦ Poisson model, single reads only [Li et al.10] ◦ EM Algorithm, single reads only [Feng et al. 10] ◦ Convex quadratic program, pairs used only for ID [Trapnell et al. 10] ◦ Extends Jiang’s model to paired reads ◦ Fragment length distribution EM Algorithm for IE ◦ ◦ ◦ ◦ Single and paired reads Fragment length distribution Strand information Base quality scores Solving GE by adding isoform levels Introduction EM Algorithm Results Conclusions and future work Paired reads A B C A C Single reads A B A C C E-step M-step Introduction EM Algorithm Results Conclusions and future work Human genome UCSC known isoforms 25000 100000 20000 10000 Number of genes Number of isoforms 15000 10000 5000 0 100 10 1 10 100 1000 10000 100000 Isoform length 1000 0 5 10 15 20 25 30 35 40 45 50 55 Number of isoforms GNFAtlas2 gene expression levels ◦ Uniform/geometric expression of gene isoforms Normally distributed fragment lengths ◦ Mean 250, std. dev. 25 Error Fraction (EF) ◦ Percentage of isoforms (or genes) with relative error larger than given threshold t Median Percent Error (MPE) ◦ Threshold t for which EF is 50% r2 ◦ Coefficient of determination 30M single reads of length 25 100 Uniq Rescue RSEM IsoEM UniqLN % of isoforms over threshold 90 80 70 60 50 40 30 20 10 0 0 0.2 0.4 0.6 0.8 1 Relative error threshold Main difference b/w IsoEM and RSEM is fragment length modeling 30M single reads of length 25 100 90 Uniq Rescue GeneEM RSEM 80 % of genes over threshold 70 60 IsoEM 50 40 30 20 10 0 0 0.2 0.4 0.6 Relative error threshold 0.8 1 Fixed sequencing throughput (750Mb) 0.978 25 0.976 r2 0.972 0.97 0.968 0.966 Paired reads 0.964 Single reads 0.962 Median Percent Error 20 0.974 15 10 5 Paired reads Single reads 0 25 35 45 55 65 75 85 95 25 35 Read length 50bp reads better than 100bp! 45 55 65 Read length 75 85 95 1-60M 75bp reads 0.985 0.98 0.975 r2 0.97 RandomStrand-Pairs-PerfectMapping 0.965 RandomStrand-Pairs 0.96 CodingStrand-pairs 0.955 RandomStrand-Single 0.95 CodingStrand-single 0.945 0 10000000 20000000 30000000 40000000 50000000 60000000 # reads Pairs help, strand info doesn’t [Trapnell et al. 10] r2=.95 for 13M PE reads Introduction EM Algorithm Results Conclusions and future work Presented EM algorithm for isoform frequency estimation that exploits fragment length distribution for both single and paired reads ◦ Significant accuracy improvement over existing methods ◦ Code and datasets to be released publicly soon Ongoing extensions ◦ ◦ ◦ ◦ Confidence intervals Allelic specific isoform expression Testing for novel isoforms Integration with isoform discovery