Overlap graph

advertisement
Assembling Genome
Timothee Cezard
EBI NGS workshop
16/10/2012
2
• Goal: Find the shortest common sequence of a set of
reads.
• This is NP-hard problem, we need to use some
approximation algorithm.
Main algorithm used:
• Overlap Layout Consensus
• Debrujin graphs
Need efficient alignment algorithm
Doesn’t scale well when number of read is high
Use seed based alignment with extension
TACATAGATTACACAGATTACTGA
||
||||||||||||||||||||
TAGTTAGATTACACAGATTACTAGA
• A graph is constructed:
(1) Nodes are reads
(2) Edges represent overlapping reads
CGTAGTGGCAT
Overlap graph
ATTCACGTAG
Try to find the Hamiltonian path:
• a path in the graph contains each node exactly once.
• Expensive computationally
CGTAGTGGCAT
ATTCACGTAG
• This approach is used in Celera (CABOG), Newbler,
Mira, SGA…
• It is mostly used with Sanger or 454 data.
• Can’t assemble repeat longer than read length
• Could come back if read gets longer.
De Bruijn Graphs example
“It was the best of times, it was the worst of times, it was the age of
wisdom, it was the age of foolishness, it was the epoch of belief, it was
the epoch of incredulity,.... “
Dickens, Charles. A Tale of Two Cities. 1859. London: Chapman Hall
Velvet example courtesy of J. Leipzig 2010
De Bruijn Graphs example
itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness…
Generate random ‘reads’
How do we assemble?
fincreduli geoffoolis Itwasthebe Itwasthebe geofwisdom itwastheep epochofinc timesitwas stheepocho nessitwast wastheageo theepochof stheepocho hofincredu
estoftimes eoffoolish lishnessit hofbeliefi pochofincr itwasthewo twastheage toftimesit domitwasth ochofbelie eepochofbe eepochofbe astheworst chofincred theageofwi
iefitwasth ssitwasthe astheepoch efitwasthe wisdomitwa ageoffooli twasthewor ochofbelie sdomitwast sitwasthea eepochofbe ffoolishne eofwisdomi hebestofti
stheageoff twastheepo eworstofti stoftimesi theepochof esitwasthe heepochofi theepochof sdomitwast astheworst rstoftimes worstoftim stheepocho geoffoolis
ffoolishne timesitwas lishnessit stheageoff eworstofti orstoftime fwisdomitw wastheageo heageofwis incredulit ishnessitw twastheepo wasthewors astheepoch
heworstoft ofbeliefit wastheageo heepochofi pochofincr heageofwis stheageofw fincreduli astheageof wisdomitwa wastheageo astheepoch olishnessi astheepoch
itwastheep twastheage wisdomitwa fbeliefitw bestoftime epochofbel theepochof sthebestof lishnessit hofbeliefi Itwasthebe ishnessitw sitwasthew ageofwisdo
twastheage esitwasthe twastheage shnessitwa fincreduli fbeliefitw theepochof mesitwasth domitwasth ochofbelie heageofwis oftimesitw stheepocho bestoftime
twastheage foolishnes ftimesitwa thebestoft itwastheag theepochof itwasthewo ofbeliefit bestoftime mitwasthea imesitwast timesitwas orstoftime estoftimes
twasthebes stoftimesi sdomitwast wisdomitwa theworstof astheworst sitwasthew theageoffo eepochofbe theageofwi foolishnes incredulit ofbeliefit chofincred beliefitwa
beliefitwa wisdomitwa eageoffool eoffoolish itwastheag mesitwasth epochofinc ssitwasthe itwastheep astheageof stheageoff sitwasthee thebestoft oolishness
heepochofb ochofbelie wastheepoc bestoftime mesitwasth ebestoftim pochofincr
…etc. to 10’s of millions of reads
Traditional all-vs-all comparisons of datasets this size require immense
computational resources.
De Bruijn solution: Construct a graph efficiently
De Bruijn Graphs
Step 1: create kmer
Step 1: “Kmerize” the data
Reads:
Kmers :
(k=3)
theageofwi
sthebestof
astheageof
worstoftim
imesitwast
the
sth
ast
wor
ime
hea
the
eag
sth
heb
age
the
ebe
geo
ofw
fwi
sto
tof
esi
sto
eag
est
mes
rst
hea
bes
eof
ors
sit
tof
age
geo
eof
itw
oft
fti
twa
was
tim
…..etc for all reads in the dataset
ast
De Bruijn Graphs
Step2 Build the graph
Look for k-1 overlaps: given by the reads
ast
ast
sth
sth
the
the
the
hea
hea
eag
eag
age
age
geo
geo
eof
eof
ofw
heb
ebe
bes
est
sto
sto
tof
tof
wor
ors
rst
oft
fti
fwi
tim
ime
was
mes
esi
twa
itw
sit
…..etc for all ‘kmers’ in the dataset
De Bruijn Graphs
step3: simplify the graph
De Bruijn Graphs
step4: Create contigs
No single solution!
Break the graph to give the final assembly
De Bruijn example
The final assembly (k=3)
wor
incredulity
foolishness
itwasthe
times
age
epoch
be
st
of
wisdom
belief
Repeat with a longer “kmer” length
A better assembly (k=10)
itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolis…
Why not always use longest ‘k’ possible?
Sequencing errors:
k=3
sth the
heb
ebe ent
ben nto
Mostly unaffected
tofkmers
sthebentof
k=10
sthebentof
100% wrong kmer
Strengths and problems
of De Bruijn approach
Strengths:
• No need to calculate the overlaps
• Size of the final graph is function of the genome size
• Repeats are collapsed
Problems:
• Can only resolve k long repeat
• Loose connectivity when create the contigs
Resolve repeat through scaffolding
Contigs from assembly
Align reads from short
insert or long insert
library
Join contigs using evidence from
paired end data
Scaffold
De Bruijn assembler
• Velvet: http://www.ebi.ac.uk/~zerbino/velvet/
• ABySS: http://www.bcgsc.ca/platform/bioinfo/software/abyss
• SOAP-denovo: http://soap.genomics.org.cn/soapdenovo.html
• ALLPATH-LG: http://www.broadinstitute.org/software/allpaths-lg/blog/
• IDBA-UD: http://i.cs.hku.hk/~alse/hkubrg/projects/idba_ud/
What makes an assembly good?
•
•
•
•
High coverage: 50 to 100X
Different but precise insert size libraries
Little to no sequencing errors
Avoid large number of variant.
• Try different assembler
• Need a big fat memory machine (from 16Go to 1To)
What makes your assembly better?
Error Correction: Correct the read before assembly
http://bib.oxfordjournals.org/content/early/2012/04/06/bib.bbs015.full
• SOAP-denovo
• Reptile: http://aluru-sun.ece.iastate.edu/doku.php?id=reptile
• SGA: https://github.com/jts/sga
Joining overlapping reads:
• COPE: ftp://ftp.genomics.org.cn/pub/cope/
• FLASH: http://genomics.jhu.edu/software/FLASH/index.shtml
What makes your assembly better?
Gap Filling - Image
Tsai et al. Genome biology 2010
Assembly validation
N50 is the most commonly used metric:
Weighted median such as 50% of your assembly is
contained in contig of length >=N50
CEGMA: Core Eukaryotic Genes Mapping Approach
• Looks in your assembly for gene that should be there
• Usually best assembly have best CEGMA score
http://korflab.ucdavis.edu/datasets/cegma/
There are no magic tool
Download