Genome assembly
Henrik Lantz - BILS/SciLife/Uppsala University
De novo genome project workflow
•
•
•
•
•
•
•
•
•
Extracting DNA (and RNA) - as much DNA as possible!
Choosing best sequence technology for the project
Sequencing
Quality assessment and other pre-assembly investigations
Assembly
Assembly validation
Assembly comparisons
Repeat masking?
Annotation
Genome assembly - things to think about
• Genome specifics - Size of genome, number of chromosomes,
repeat content, heterozygosity
• Which assembly programs can run on “my” genome?
• What kind of data do these programs need?
Genome assembly - things to think about
Genome assembly - things to think about
• Genome specifics - Size of genome, number of chromosomes,
repeat content, heterozygosity
• Which assembly programs can run on “my” genome?
• What kind of data do these programs need?
• How much data do I need? Will I have enough coverage? Do I
need to subsample?
• Are there closely related organisms that already have had
their genome sequenced?
• Do I need additional data for post-assembly?
Genome assembly programs
•
•
•
•
•
•
•
•
•
•
•
Abyss
Allpaths-LG
CABOG (a.k.a. Celera)
HGAP
Masurca
Mira
Newbler
SGA
SoapDeNovo
Spades
Velvet
Genome assembly programs
Name
Algorithm
Data
Abyss
De Bruijn
Illumina
Allpaths-lg
De Bruijn
Illumina/PacBio
CABOG (Celera)
OLC
All
HGAP
OLC
PacBio
Masurca
De Bruijn/OLC
All
Mira
“OLC”
All
Newbler
OLC
454/Illumina/Torrent
SGA
String
Illumina
SoapDeNovo
De Bruijn
Illumina
Spades
De Bruijn
Illumina
Velvet
De Bruijn
Illumina
OLC vs. de Bruijn
de Bruijn
de Bruijn
Sequence Assembly via De Bruijn Graphs
From Martin & Wang, Nat. Rev. Genet. 2011
From Martin & Wang, Nat. Rev. Genet. 2011
From Martin & Wang, Nat. Rev. Genet. 2011
De Bruijn
• Pros: Computationally efficient, can work with large coverage
short read datasets
• Cons: Sensitive to sequence errors, connection between
assembly and read is lost, does not work so well with longer
reads
OLC
• Pros: Utilizes longer reads well
• Cons: Time consuming, high memory requirements
Assemblathon 2
• Uses 454, Illumina, and PacBio for three large eukaryote
genomes: a bird, a fish, and a snake
• Bird - Illumina 14 libraries, 454, PacBio
• Fish - Illumina, 8 libraries
• Snake - Illumina, 4 libraries
• Teams take the data, perform assemblies with whatever tools
they wish, and then submit their results => teams are
evaluated more than individual programs!
GigaScience 2013, 2:10
Assemblathon 2
Assemblathon 2 - Bird vs. Snake
Assemblathon 2 - Bird
CEGMA
Assemblathon 2 - Validation measures
GAGE-B
• Uses Illumina (HiSeq and MiSeq) data for a number of bacteria
• One team runs all programs => assembly programs are
compared, not teams
• Reference high quality assemblies are available =>
errors/misassemblies can be quantified
Comparison of N50 contig size (in kilobases) on the y-axis, versus depth of coverage on the
x-axis, for the eight assemblers used in this study.
Magoc T et al. Bioinformatics 2013;29:1718-1725
© The Author 2013. Published by Oxford University Press. All rights reserved. For Permissions,
please e-mail: journals.permissions@oup.com
GAGE-B
GAGE-B statistics
Genome assembly programs - pros and cons
•
•
•
•
•
•
•
•
•
•
Abyss
Allpaths-LG
CABOG (a.k.a. Celera)
Masurca
Mira
Newbler
SGA
SoapDeNovo
Spades
Velvet
Allpaths-LG
• Pros: Produces contigs and scaffolds with high N50 values, can
use PacBio data for scaffolding, can run on large genomes
with high coverage
• Cons: Only accepts Illumina data, needs very specific libraries
to work at all (180 bp + 3 kbp), needs very high coverage
(100x), takes a long time to run and requires a lot of memory
Assemblathon 2 - Bird vs. Snake
Assemblathon 2 - Bird
Masurca
• Pros: Can accept any type of data, is a true hybrid assembler,
usable for very large genomes, produces top results in
comparison of assembly statistics
• Cons: Takes a long time to run, unstable(?)
Comparison of N50 contig size (in kilobases) on the y-axis, versus depth of coverage on the
x-axis, for the eight assemblers used in this study.
Magoc T et al. Bioinformatics 2013;29:1718-1725
© The Author 2013. Published by Oxford University Press. All rights reserved. For Permissions,
please e-mail: journals.permissions@oup.com
GAGE-B statistics
MIRA
• Pros: Can accept any type of data, is a true hybrid assembler,
produces good assemblies for smaller genomes, excellent
documentation
• Cons: Only useful for smaller genomes (bacteria, fungi), can
not use high coverage data (prefers max 50x), takes a long
time to run, no scaffolding
Comparison of N50 contig size (in kilobases) on the y-axis, versus depth of coverage on the
x-axis, for the eight assemblers used in this study.
Magoc T et al. Bioinformatics 2013;29:1718-1725
© The Author 2013. Published by Oxford University Press. All rights reserved. For Permissions,
please e-mail: journals.permissions@oup.com
GAGE-B statistics
SoapDeNovo
• Pros: Usable on large genomes, easy to use and runs fairly
quickly, can use high coverage data
• Cons: Only accepts Illumina data, medium results in assembly
statistic comparisons
Assemblathon 2 - Bird vs. Snake
Assemblathon 2 - Bird
GAGE-B statistics
Spades
• Pros: Designed to work with amplified data, performs very
well in GAGE-B with MiSeq data
• Cons: Only accepts Illumina data, only for smaller genomes
Comparison of N50 contig size (in kilobases) on the y-axis, versus depth of coverage on the
x-axis, for the eight assemblers used in this study.
Magoc T et al. Bioinformatics 2013;29:1718-1725
© The Author 2013. Published by Oxford University Press. All rights reserved. For Permissions,
please e-mail: journals.permissions@oup.com
GAGE-B statistics
Some recommendations
•
•
•
•
•
•
Large eukaryote genome, Illumina data: Allpaths-LG (needs specific libraries),
SoapDeNovo, SGA, Masurca
Large eukaryote genome, additional longer reads: Masurca, Newbler, CABOG
Small eukaryote or prokaryote genome, Illumina data: Spades, Masurca,
SoapDeNovo, Abyss, Velvet
Small eukaryote or prokaryote genome, mixed data: MIRA, Masurca, Newbler
Need to run in parallel: Abyss
Amplified data (Single Cell Genomics): Spades
Assemblathon 2 recommendations
•
•
•
•
•
•
Based on the findings of Assemblathon 2, we make a few broad suggestions to
someone looking to perform a de novo assembly of a large eukaryotic genome:
1. Don’t trust the results of a single assembly. If possible, generate several
assemblies (with different assemblers and/or different assembler parameters).
Some of the best assemblies entered for Assemblathon 2 were the evaluation
assemblies rather than the competition entries.
2. Do not place too much faith in a single metric. It is unlikely that we would have
considered SGA to have produced the highest ranked snake assembly if we had
only considered a single metric.
3. Potentially choose an assembler that excels in the area you are interested in
(e.g., coverage, continuity, or number of error free bases).
4. If you are interested in generating a genome assembly for the purpose of genic
analysis (e.g., training a gene finder, studying codon usage bias, looking for intronspecific motifs), then it may not be necessary to be concerned by low N50/NG50
values or by a small assembly size.
5. Assess the levels of heterozygosity in your target genome before you assemble
(or sequence) it and set your expectations accordingly.
Post assembly considerations
• External scaffolders: SSPACE (commercial),
SGA (see Hunt et al. in Genome Biology
2014:15, R42).
• Gap Closers (use with caution!): IMAGE,
PILON, GapCloser
• Error correction: Nesoni, PILON
• Assembly validation is extremely important!
Abyss
• Pros: Only assembler that can run in parallel on different
nodes => does not need a single huge memory node, fast, can
run on large genomes with a high coverage
• Cons: Only accepts Illumina data, does not excel in any
statistics
Comparison of N50 contig size (in kilobases) on the y-axis, versus depth of coverage on the
x-axis, for the eight assemblers used in this study.
Magoc T et al. Bioinformatics 2013;29:1718-1725
© The Author 2013. Published by Oxford University Press. All rights reserved. For Permissions,
please e-mail: journals.permissions@oup.com
Assemblathon 2 - Bird vs. Snake
CABOG (Celera)
• Pros: Can accept any type of data, is a true hybrid assembler,
output can easily be analyzed in the assembly validation
toolkit AMOSvalidate, usable for large genomes
• Cons: Does not perform so well for any statistic in
comparisons
Comparison of N50 contig size (in kilobases) on the y-axis, versus depth of coverage on the
x-axis, for the eight assemblers used in this study.
Magoc T et al. Bioinformatics 2013;29:1718-1725
© The Author 2013. Published by Oxford University Press. All rights reserved. For Permissions,
please e-mail: journals.permissions@oup.com
GAGE-B statistics
Newbler
• Pros: Easy to run, works very well on 454 and Ion Torrent data,
can use many types of data, usable for larger genomes,
produces competitive assemblies if longer reads are available
• Cons: Requires longer reads to perform well
Assemblathon 2 - Bird vs. Snake
Assemblathon 2 - Bird
SGA
• Pros: Usable on large genomes, memory-efficient
• Cons: Only accepts Illumina data, does not perform well in
comparisons of assembly statistics
GAGE-B statistics
Assemblathon 2 - Bird
Velvet
• Pros: Easy to use, runs quickly
• Cons: Only accepts Illumina data, only for smaller genomes,
does not excel in any assembly statistic comparison
Comparison of N50 contig size (in kilobases) on the y-axis, versus depth of coverage on the
x-axis, for the eight assemblers used in this study.
Magoc T et al. Bioinformatics 2013;29:1718-1725
© The Author 2013. Published by Oxford University Press. All rights reserved. For Permissions,
please e-mail: journals.permissions@oup.com
GAGE-B statistics