AG AG G CC ACGCG G A A A G GT GA T C A AATAA T C A G A T GC TCAA AA A T T T AAA A C A TT G C GA G C T A T AA A A G TGA A T AC AATG TT AA G T A A G AG GG AAAATAG GTAAAA TAGCCG TCAG GCG AC CGAGAAACGC G T AA C AGT GCGT AG ATT A G CG CC ATA AAAAAATAATTGCG C T C G GA GT AA ATT G AA ATAA A GC AAAAATAAATT AA A CA AAAA CA C A G AA TTC A TT C G T A T G A CG T A A C A A C AAA AAA GCA AA AT A A A G AA TAA TC A AA TTGAAAATCG A G G A G C AA TA AA AT TGAA A AA A A AA A AA ACTA TG AA AA G TT CA TGAAAA C AAA AA A AA AA TA AAAATCAATA A TGA G TG AA GTAGTTT AA TTG TA AGT GA GTAGTTT G AA TTG GG AA AG T TA TG GTAGTTT A G TTG AA A AG AT GTAGTT T TTT A GA GA A GT A GTAGTT TTT GA GA GTAGT GT TT TT GA GTAGT TT TT GTAG T GA TG GTAGTTTT AA T AA G TC AA GA GA GACA AAAT CA C AA AA TAAAA A G AC the gene pool GTTCAAAAAAAAATA A AATC A TTC AAA A GC AA GCAAAAATCGTT GA A TCGA C A A A TC AAA AT AA GA AA A A TGAAAATCGAG T T C A A G A A A AA GT TT A A ATG A AT AT TG AA TG AATGG C T A T G G AA GAA TGAAA AA AAA TT CTA A AA TAT C T AA AA AAACTATGAAA CG AA A ACA A A A AC T GC AC GG AA T AATGAAAAC A AA ATG A TG A A AAA A AC TA T ATAAAAA AA AT AA ATCA A G AATGAAAACA ATAAAA AA A A C CTA AAT A TG A A A T A A A A A G C ATGA T AA A A A A C AA A A AA G A A CA G C AT C T A A A A T A A A A AA AAAA G AGA A GG C A A A A A A G TG C A A A T T AA A A AT AGAAAATCA AA C A A G A A A T A ATA GA AC TA A AAA A GG GG T A A G C A A A GA A A AA AA GG TAAA ATC G A A GA G AA TG AG AGA TAAAAAGA G A T G TA A T CA A G A A A GA T AA AA AA AGAGGTAAA T A A A A A A AG A G GA A T G GAAAATAGAGG T G TA G A G AG T G A A A A TA TT T G T A T G TT Delivering the genomics revolution GenePool & NGBug Transcriptome Assembly Workshop NeSC November 2010 RNA QC cDNA Synthesis Normalisation Roche 454 Titanium Sequencing Sequence QC/QA Statistics for read lengths: Min read length: 40 Max read length: 933 Mean read length: 311.33 Standard deviation of read length: 139.31 Median read length: 344 N50 read length: 395 Statistics for numbers of reads: Number of reads: 35585 Number of reads >=1kb: 0 Number of reads in N50: 12195 Assembly Statistics for bases in the reads: Number of bases in all reads: 11078599 Number of bases in reads >=1kb: 0 GC Content of reads: 42.38 % Statistics for reads >= 150 bp in length: Number of reads >= 150 bp: 29060 Number of bases in reads >= 150 bp: 10508272 Mean read length for reads over 150 bp: 361.61 Annotation Optimising Transcriptome Assembly the truth is out there.... Optimising transcriptomics ribominus versus poly(A) selection oligod(T) versus random 6- | 9-mers cap capture vs 5’ end capture vs none gene expression is very skewed number of mRNAs 10,000 1 10e-2 expression level (mRNA per cell) 10e5 aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa bbb bbb bbb ccc aaa normalisation aaa bbb ccc aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa bbb bbb bbb ccc d e f g h i j k l m n o p q r s t u v w x y z aaa normalisation aaa bbb ccc d e f g h i j k l m n o p q r s t u v w x y z subtraction with driver aba aba aba apa aba aba aba aba aba aba aba aba apa normalisation aba aba aba aba apa aba aba aba aba aba aba aba apa aba aba aba aba aba aba aba aba aba aba aba aba aba aba aba aba aba aba aba aba aba ds nuclease aba [p] aba aba Optimising Transcriptome Assembly if the genome is known, the transcriptome can be estimated by mapping transcriptome reads to this reference if the genome is well annotated, the transcriptome is ‘known’ Assembler ‘philosophy’: reads every read is sacred versus reads can be fragmented Assembler ‘philosophy’: graphs overlap - layout - consensus (OLC) versus De Bruijn graph Assembler ‘philosophy’: preclustering precluster before assembly versus all-against-all assembly Assembler ‘philosophy’ programme reads graphs preclusters TiCL sacred OLC yes CLOBB sacred OLC yes MIRA sacred OLC no CAP3 sacred OLC no Newbler fragment OLC no Velvet fragment de Bruijn no ABySS fragment de Bruijn no SOAPdenovo fragment de Bruijn no CLC NGCell fragment de Bruijn no Optimising de novo Transcriptome Assembly or How do we know when we have arrived? Assembly Assembly Optimality Criteria Read use: Number of contigs: Span of contigs: # unique bases: N50 of contigs: # contigs >1 kb: use most or all reads should approach transcriptome size (15k-35k) should approach transcriptome size (30-50 Mb) more may be better should approach transcriptome N50 (~1.3 kb) more may be better comparison to reference datasets: previously sequenced ESTs related reference transcriptome/proteome KOGs/CEGMA data more is better comparison to other assemblies: should have the most unique bases 2.0e+07 1.5e+07 1.0e+07 5.0e+06 CAP3 CLC MIRA Newbler 2.3 Newbler 2.5 SeqMan 0.0e+00 Cumulative contig length Assembly 0 5000 10000 15000 20000 25000 Contigs ranked by size 30000 35000 Assembly Improving the credibility of assemblies by merging pairs of initial assemblies CcCc cCcC cc cc CC CC reads NOT used by either assembler reads used by both assemblers BUT contigs not co-assembled reads used only by assembler A reads used by either assembler contributing to coassembled contigs reads used only by assembler B Assembly Improving the credibility of assemblies by merging pairs of initial assemblies Assembly 1 Assembly 2 Number of Contigs Assembly1 Assembly 2 “Credible” C+C contigs Summed length of C+ C contigs MIRA SeqMan 35827 29969 18068 16293192 MIRA Newbler (Version 2.5) 35827 21734 15951 15866051 CLC Newbler (Version 2.5) 22746 21734 15778 15825663 Newbler (Version 2.5) SeqMan 21734 29969 15783 15701053 CAP3 Newbler (Version 2.5) 24727 21734 14275 14830304 CAP3 SeqMan 24727 29969 15387 14824287 CLC SeqMan 22746 29969 15504 14679975 CLC MIRA 22746 35827 15334 14357031 CAP3 MIRA 24727 35827 15688 14243534 CAP3 CLC 24727 22746 14149 13753398 12019 21734 9733 13252303 Newbler (Version 2.3) Newbler (Version 2.5) CLC Newbler (Version 2.3) 22746 12019 8884 12318589 MIRA Newbler (Version 2.3) 35827 12019 9380 11731374 Newbler (Version 2.3) SeqMan 12019 29969 8274 11452990 CAP3 Newbler (Version 2.3) 24727 12019 8484 11426423