ppt - University of Connecticut

advertisement
Hierarchical Genome Assembly
Anas Al-Okaily and Ion MÓ‘ndoiu
University of Connecticut
School of Engineering
Department of Computer Science and Engineering, University of Connecticut
INTRODUCTION
ASSEMBLY FLOWS
Current high-throughput sequencing technologies generate large numbers of relatively
short and error-prone reads, making the de novo assembly problem challenging.
Although high quality assemblies can be obtained by assembling multiple paired-end
libraries with both short and long insert sizes, the latter are costly to generate. Recently,
the GAGE-B study showed that a remarkably good assembly quality can be obtained for
bacterial genomes by state-of-the-art assemblers run on a single short-insert library with
very high coverage. In this poster, we introduce and empirically evaluate a novel
hierarchical genome assembly (HGA) methodology that takes further advantage of such
very high coverage by independently assembling disjoint subsets of reads, combining
assemblies of the subsets, and finally re-assembling the combined contigs along with the
original reads.
The proposed hierarchical assembly flows consist of following steps:
1. Partitioning the reads into p disjoint parts, where p = 2, 4, or 8
2. Independent assembly of each part using one of the 8 evaluated
assemblers, with kmer size between 21 to 101 in increments of 10
3. Merging the resulting contigs, respectively combinining them using the
Velvet assembler with kmer size 31 and expected coverage = p
4. Reassembling the merged/combined contigs along with the original
reads using SPAdes, again with kmer size between 21 to 101 in
increments of 10
For each assembler, reported HGA results are for the assembly with the
largest (uncorrected) N50 over the tested values of p and kmer sizes.
THE CHALLENGE OF ULTRA-DEEP DATA ASSEMBLY
CORRECTED N50 RESULTS
Assembly quality delivered by current assemblers improves only marginally or gets worse
for ultra-deep genome sequencing data.
HiSeq datasets (100bp)
Magoc et al. 2013
DATASETS AND ACCURACY METRICS
Assemblies were evaluated using multiple metrics computed
using QUAST, including:
• Number of contigs
• Number of known genes completely or partially covered by
the contigs
• N50, the contig length that covers at least 50% of the total
length of the assembly
• NA50, computed like N50 after breaking misassembled
contigs
• Genome fraction: percentage of genome bases aligned to at
least on contig
• Duplication ratio: number of aligned contig bases divided by
the number of reference bases aligned to at least one contig
• Number of global and local misassemblies
• Mismatches and indels per 100Kb
• Unaligned contig length
BEST HGA PARAMETERS
Lonardi et al. 2015
Best kmer combinations (HiSeq)
MiSeq datasets (250bp)
12
10
Count
EVALUATED ASSEMBLERS
Assembler
8
6
4
Reference
81
61 Reassembly
kmer
41
2
0
21
Cabog 7.0
Mira 4.0.2
MaSuRCA 2.2.1
Simpson et al., J. T., Wong, K., Jackman, S. D., Schein, J. E.,
Jones, S. J., and Birol, I. (2009). Abyss: a parallel assembler
for short read sequence data. Genome research, 19(6), 1117–
1123.
Simpson, J. T. and Durbin, R. (2012). Efficient de
novo assembly of large genomes using compressed data
structures. Genome research, 22(3), 549–556.
SoapDenovo 2.04
Luo, R., Liu, B., Xie, Y., Li, Z., Huang, W., Yuan, J., He, G.,
Chen, Y., Pan, Q., Liu, Y., et al. (2012). Soapdenovo2: an
empirically improved memory-efficient short-read de novo
assembler. Gigascience, 1(1), 18.
SPAdes 3.0.0
Velvet 1.2.10
Bankevich, A., Nurk, S., Antipov, D., Gurevich, A. A.,
Dvorkin, M., Kulikov, A. S., Lesin, V. M., Nikolenko, S. I.,
Pham, S., Prjibelski, A. D., et al. (2012). Spades: a new
genome assembly algorithm and its applications to singlecell
sequencing. Journal of Computational Biology, 19(5), 455–477.
Zerbino, D. R. and Birney, E. (2008). Velvet: algorithms for de
novo short read assembly using de bruijn graphs. Genome
research, 18(5), 821–829.
61
71
81
21
91
8
6
4
2
IDENTIFIED GENE RESULTS
81
51
0
21
31
41
51
61
71
81
91
Reassembly
kmer
21
101
Assembly kmer
HiSeq datasets (100bp)
Zimin, A. V., Marc¸ais, G., Puiu, D., Roberts, M., Salzberg,
S. L., and Yorke, J. A. (2013). The masurca genome
assembler. Bioinformatics, 29(21), 2669–2677.
SGA 0.10.13
51
Best kmer combinations (MiSeq)
Miller, J. R., Delcher, A. L., Koren, S., Venter, E., Walenz,
B. P., Brownley, A., Johnson, J., Li, K., Mobarry, C., and
Sutton, G. (2008). Aggressive assembly of pyrosequencing
reads with mates. Bioinformatics, 24(24), 2818–2824.
Barthelson, R., McFarlin, A. J., Rounsley, S. D., and Young, S.
(2011). Plantagora: modeling whole genome sequencing and
assembly of plant genomes. PLoS One, 6(12), e28436.
41
Assembly kmer
Count
Abyss 1.5.1
31
CONCLUSIONS
Empirical evaluation of this methodology for 8 leading assemblers using 7
GAGE-B bacterial datasets consisting of 100bp Illumina HiSeq and 250bp
Illumina MiSeq reads shows that HGA leads to a significant improvement in
assembly quality for all evaluated assemblers and all datasets. In ongoing
work we are evaluating the HGA methodology on ultra-deep BAC
sequencing data.
Availability: Version 1.0.0 of HGA, implemented in Python, is available at
http://dna.engr.uconn.edu/software/HGA.
MiSeq datasets (250bp)
Acknowledgements: This work has been partially supported by the
Agriculture and Food Research Initiative Competitive Grant No. 2011-6701630331 from the USDA National Institute of Food and Agriculture.
References
• Gurevich, A., Saveliev, V., Vyahhi, N., and Tesler, G. (2013). Quast: quality
assessment tool for genome assemblies. Bioinformatics, 29(8), 1072–1075.
• Lonardi, S., Mirebrahim, H., Wanamaker, S., Alpert, M., Ciardo, G., Duma, D.,
Close, T.J. (2015), When Less is More: “Slicing” Sequencing Data Improves Read
Decoding Accuracy and De Novo Assembly Quality, Bioinformatics, advance
access.
• Magoc, T., Pabinger, S., Canzar, S., Liu, X., Su, Q., Puiu, D., Tallon, L. J., and
Salzberg, S. L. (2013). GAGE-B: an evaluation of genome assemblers for
bacterial organisms. Bioinformatics, 29(14), 1718–1725.
Download