volume genome

advertisement
De novo assembly of a large volume of genome using Next Generation Sequencing data
of Toxocara canis
Jung Im Won1), Sangkyoon Hong2), Jin Hwa Kong2), Sun Huh3), Jee Hee Yoon2)
1
2
Research Institute of Electrical and Computer Engineering, Hanyang University;
Department of Computer Engineering, Hallym University; 3Department of Parasitology,
College of Medicine, Hallym University
Purpose: De novo assembly is the method of reconstructing read into sequences estimated as
original sequences without the reference sequence. Recently, Next Generation Sequencing
(NGS) technique has merit of producing large quantity of sequences with low cost. This
presentation aims to provide the analysis methods of de novo assembly of whole genome of
Toxocara canis, a dog intestinal nematode, using NGS read data.
Methods: Read data of 400 bp (15.9 Gbp), 1900 bp (10.3 Gbp), 2900 bp (10.4 Gbp) from
Toxocara canis with a variety of insert size produced by paired-end methods was used. A
total size of read data was 36.6 Gbp, mean read length was 101 bp, and coverage was 104 X.
De novo assembly algorithm used was SOAP de novo. This algorithm adopted De Brujin
graph method using k-mer. To test the accuracy of results, length and mean length of N50
of contig and scaffold was analyzed. N50, the standard value of de novo assembly, is the
longest length of contig among that consisted of half length of whole contig length. Hardware
used for this experiment was Xeon E5620 (quod core) dual CPU 2.4GHxz, 144 GB RAM.
Results: After comparing the statistical value of contigs produced according to the k-mer size
using read data with insert size 400 bp, the best result was shown at the k-mer size, 41 in
mean length of contig, N50 and N90. The results of similarity analysis with adjacent species
genome was done using Caenorhabditis elegans whole genome sequence as reference species.
Read alignment algorithms used were SOAP and GSNAP. K-mismatch was tested after
expanding seed length. The total number of aligned read was compared to test the accuracy of
alignment. The more k-mer value, the better the alignment score. Result by SOAP showed
low accuracy than that by GSNAP since k value is below 2 during k-mismatch performance.
From de novo assembly of Toxocara canis, A total length of 335 Mbp contig was produced
of which N50 is about 531 with 45.4X coverage when k-mer size was 41 and insert size is
400 bp.
Scaffold of 362 Mbp was acquired with N50 size of 4.3 Kbp by annexing
contigs, using mate-pare information of read data of 1900 bp and 2900 bp.
Conclusion: De novo assembly based on NGS was done for the Toxocara canis of which genome
size was about 350 Mbp. The most appropriate k-mer size for contig and scaffold can be
calculated. This results showed the usability of de novo assembly in analyzing large volume
genome sequence by suggesting a variety of analysis results on contig and scaffold.
Download