Assembling Genome Timothee Cezard EBI NGS workshop 16/10/2012 2 • Goal: Find the shortest common sequence of a set of reads. • This is NP-hard problem, we need to use some approximation algorithm. Main algorithm used: • Overlap Layout Consensus • Debrujin graphs Need efficient alignment algorithm Doesn’t scale well when number of read is high Use seed based alignment with extension TACATAGATTACACAGATTACTGA || |||||||||||||||||||| TAGTTAGATTACACAGATTACTAGA • A graph is constructed: (1) Nodes are reads (2) Edges represent overlapping reads CGTAGTGGCAT Overlap graph ATTCACGTAG Try to find the Hamiltonian path: • a path in the graph contains each node exactly once. • Expensive computationally CGTAGTGGCAT ATTCACGTAG • This approach is used in Celera (CABOG), Newbler, Mira, SGA… • It is mostly used with Sanger or 454 data. • Can’t assemble repeat longer than read length • Could come back if read gets longer. De Bruijn Graphs example “It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity,.... “ Dickens, Charles. A Tale of Two Cities. 1859. London: Chapman Hall Velvet example courtesy of J. Leipzig 2010 De Bruijn Graphs example itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness… Generate random ‘reads’ How do we assemble? fincreduli geoffoolis Itwasthebe Itwasthebe geofwisdom itwastheep epochofinc timesitwas stheepocho nessitwast wastheageo theepochof stheepocho hofincredu estoftimes eoffoolish lishnessit hofbeliefi pochofincr itwasthewo twastheage toftimesit domitwasth ochofbelie eepochofbe eepochofbe astheworst chofincred theageofwi iefitwasth ssitwasthe astheepoch efitwasthe wisdomitwa ageoffooli twasthewor ochofbelie sdomitwast sitwasthea eepochofbe ffoolishne eofwisdomi hebestofti stheageoff twastheepo eworstofti stoftimesi theepochof esitwasthe heepochofi theepochof sdomitwast astheworst rstoftimes worstoftim stheepocho geoffoolis ffoolishne timesitwas lishnessit stheageoff eworstofti orstoftime fwisdomitw wastheageo heageofwis incredulit ishnessitw twastheepo wasthewors astheepoch heworstoft ofbeliefit wastheageo heepochofi pochofincr heageofwis stheageofw fincreduli astheageof wisdomitwa wastheageo astheepoch olishnessi astheepoch itwastheep twastheage wisdomitwa fbeliefitw bestoftime epochofbel theepochof sthebestof lishnessit hofbeliefi Itwasthebe ishnessitw sitwasthew ageofwisdo twastheage esitwasthe twastheage shnessitwa fincreduli fbeliefitw theepochof mesitwasth domitwasth ochofbelie heageofwis oftimesitw stheepocho bestoftime twastheage foolishnes ftimesitwa thebestoft itwastheag theepochof itwasthewo ofbeliefit bestoftime mitwasthea imesitwast timesitwas orstoftime estoftimes twasthebes stoftimesi sdomitwast wisdomitwa theworstof astheworst sitwasthew theageoffo eepochofbe theageofwi foolishnes incredulit ofbeliefit chofincred beliefitwa beliefitwa wisdomitwa eageoffool eoffoolish itwastheag mesitwasth epochofinc ssitwasthe itwastheep astheageof stheageoff sitwasthee thebestoft oolishness heepochofb ochofbelie wastheepoc bestoftime mesitwasth ebestoftim pochofincr …etc. to 10’s of millions of reads Traditional all-vs-all comparisons of datasets this size require immense computational resources. De Bruijn solution: Construct a graph efficiently De Bruijn Graphs Step 1: create kmer Step 1: “Kmerize” the data Reads: Kmers : (k=3) theageofwi sthebestof astheageof worstoftim imesitwast the sth ast wor ime hea the eag sth heb age the ebe geo ofw fwi sto tof esi sto eag est mes rst hea bes eof ors sit tof age geo eof itw oft fti twa was tim …..etc for all reads in the dataset ast De Bruijn Graphs Step2 Build the graph Look for k-1 overlaps: given by the reads ast ast sth sth the the the hea hea eag eag age age geo geo eof eof ofw heb ebe bes est sto sto tof tof wor ors rst oft fti fwi tim ime was mes esi twa itw sit …..etc for all ‘kmers’ in the dataset De Bruijn Graphs step3: simplify the graph De Bruijn Graphs step4: Create contigs No single solution! Break the graph to give the final assembly De Bruijn example The final assembly (k=3) wor incredulity foolishness itwasthe times age epoch be st of wisdom belief Repeat with a longer “kmer” length A better assembly (k=10) itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolis… Why not always use longest ‘k’ possible? Sequencing errors: k=3 sth the heb ebe ent ben nto Mostly unaffected tofkmers sthebentof k=10 sthebentof 100% wrong kmer Strengths and problems of De Bruijn approach Strengths: • No need to calculate the overlaps • Size of the final graph is function of the genome size • Repeats are collapsed Problems: • Can only resolve k long repeat • Loose connectivity when create the contigs Resolve repeat through scaffolding Contigs from assembly Align reads from short insert or long insert library Join contigs using evidence from paired end data Scaffold De Bruijn assembler • Velvet: http://www.ebi.ac.uk/~zerbino/velvet/ • ABySS: http://www.bcgsc.ca/platform/bioinfo/software/abyss • SOAP-denovo: http://soap.genomics.org.cn/soapdenovo.html • ALLPATH-LG: http://www.broadinstitute.org/software/allpaths-lg/blog/ • IDBA-UD: http://i.cs.hku.hk/~alse/hkubrg/projects/idba_ud/ What makes an assembly good? • • • • High coverage: 50 to 100X Different but precise insert size libraries Little to no sequencing errors Avoid large number of variant. • Try different assembler • Need a big fat memory machine (from 16Go to 1To) What makes your assembly better? Error Correction: Correct the read before assembly http://bib.oxfordjournals.org/content/early/2012/04/06/bib.bbs015.full • SOAP-denovo • Reptile: http://aluru-sun.ece.iastate.edu/doku.php?id=reptile • SGA: https://github.com/jts/sga Joining overlapping reads: • COPE: ftp://ftp.genomics.org.cn/pub/cope/ • FLASH: http://genomics.jhu.edu/software/FLASH/index.shtml What makes your assembly better? Gap Filling - Image Tsai et al. Genome biology 2010 Assembly validation N50 is the most commonly used metric: Weighted median such as 50% of your assembly is contained in contig of length >=N50 CEGMA: Core Eukaryotic Genes Mapping Approach • Looks in your assembly for gene that should be there • Usually best assembly have best CEGMA score http://korflab.ucdavis.edu/datasets/cegma/ There are no magic tool