Assembly Presentation #2

advertisement
Robert Arthur
Kevin Lee
Xing Liu
Pushkar Pande
Gena Tang
Racchit Thapliyal
Tianjun Ye


Sequencing Methods
Experimental comparison of De Bruijn graph
and Overlay graph assemblers

Preliminary Results

Lab Exercise
Sequencing Methods

Sanger Sequencing
◦ Cycle sequencing rxn
◦ ddNTP-terminated dyelabeled products
◦ High-resolution
electrophoretic
separation
◦ Parallelized in 96 or 384
capillaries
◦ Read lengths up to 1kBp
◦ Raw accuracy up to
99.999%
◦ Costs 50 ¢ per kB
Sequencing Methods

Second Gen. Sequencing
◦ Cyclical array methods





454
Illumina
AB SOLiD
Polonator
HeliScope
◦ Platforms vary in
biochemistry and array
generation yet conceptually
similar in workflow
Illumina
Illumina continued
AB SOLiD
454 Pyrosequencing

Create a DNA library
◦ Ligate adaptors to
fragments

Emulsion PCR
◦ Agarose beads
◦ Oil, water, PCR reagents
◦ Results in 1 mill copies /
fragment for each bead
More 454

Beads arrayed into
picotiter plate
◦ Immobilized via
addition of enzyme
containing beads
◦ Each cell contains
exactly 1 bead

Bst polymerase,
luciferase, apyrase,
ATP sulferylase used
Even more 454
Example of Output
Flow Order
4-mer
3-mer
2-mer
1-mer
T
A
C
G
KEY (TCAG)
Measures the presence
or absence of each
nucleotide at any given
position
Videos (454 Workflow)
Videos (Pyrosequencing)
note: we did not choose the music
Comparison of 2nd Gen Platforms


Sequencing Methods
Experimental comparison of De Bruijn graph
and Overlay graph assemblers

Preliminary Results

Lab Exercise
De Bruijn Graph assemblers and
Overlay Graph assemblers

De Bruijn Graph assemblers
◦ Velvet, Abyss, Euler

Overlay Graph assemblers
◦ Newbler, Edena, SSAKE, VCAKE
Synthetic Data used for Experiments

Write a C program to simulate reads from
reference genome with specific read length,
coverage and base error rate
◦ Human chr 22, ~33.5M bases
◦ Streptococcus Suis, NC_012925.1, ~2M bases
◦ Helicobacter acinonychis Sheeba, ~ 1.5M bases

Write anther C program to measure the
quality of assemblers
◦
◦
◦
◦
N50 length
No. of contigs
Max contig length
No. of mis-assembled contigs
Read Length


De Bruijn graph assemblers are only suitable for short reads
data
K limitation
◦ Use Hash table or Sorting to index K-mers
 Need use a unique key(value) to represent each K-mer
 K=16 416=232 <-> 32-bit integer (unsigned int)
 K=32 432=264 <-> 64-bit integer (unsigned long long)
 K>32? <-> multiple integer to represent the hash table key


Simulate reads from Streptococcus Suis
 300 read length, 50X coverage, error rate
0.1%
Velvet default: K <= 31, so we use 31
# of contigs (total
length)
Velvet

46515 (1716053 bp)
115 bp
# of misassembled
contigs (total length)
5 (1346 bp)
Recompile velvet, K = 99
# of contigs (total
length)
Velvet
N50 length
441(1974382 bp)
N50 length
15328 bp
# of misassembled
contigs (total length)
1 (34 bp)
Quality and Accuracy

It is stated in some literatures that “De Bruijn based
approach prone to false positives”, “Overlap graph
has better quality”

Simulate reads from Helicobacter acinonychis
Sheeba
 35 read length, 50X coverage, error rate
0.1%
Assembl
ers
# of contigs (total
length)
N50 length
# of misassembled
contigs (total length)
Velvet
336 (1525746 bp)
10.4 kbp
17 (156637 bp)
Edena
340 (1513259 bp)
9,8 kbp
0 (0 bp)

Simulate reads from Streptococcus Suis
 35 read length, 50X coverage, error rate
0.1%
Assembl
ers
# of contigs (total
length)
N50 length
# of misassembled
contigs (total length)
Velvet
1106 (1969617 bp)
5266 bp
12 (255594 bp)
Edena
1003 (1970342 bp)
6416 bp
0 (0 bp)
Runtime and Memory Usage

Overlap graph based assemblers are
computing-expensive and use more memory
◦ All-to-all alignment of reads, O(n2)
◦ Use more memory to store overlap graph
 Typically, number of reads is weigh larger than the
number of K-mers
◦ Especially for short reads data
 With the same coverage and genome length, shorter
reads means more reads
◦ It is stated that De Bruijn graph are more suitable
for NGS data
 Shorter reads, and high throughput


Simulate reads from Streptococcus Suis
 802995 reads
 50 read length, 20X coverage, error rate
0.1%
Xeon E5530 2.4 GHz
Assemblers
Time
Memory
Velvet
33 secs
~220 M
SSAKE
26 mins
~900 M
VCAKE
107 mins
~1.1 G
However!

Recent advance of pattern matching algorithms and technical
enable the use of overlap graph
◦ Suffix tree, Suffix array, Prefix array, compressed suffix array

Suffix array
◦ Be able to find overlap between reads in linear time
◦ Usage of compressed suffix array can significantly reduce the
memory requirements of overlap graph assemblers

Examples
◦ D. Hernandez, P. François, L. Farinelli, M. Osteras, and J. Schrenzel , De
novo bacterial genome sequencing: millions of very short reads assembled
on a desktop computer. Genome Research. 18:802-809, 2008.
◦ Jared T. Simpson and Richard Durbin Efficient construction of an assembly
string graph using the FM-index, Bioinformatics (2010) 26 (12):i367-i373.
◦ Pasqual
 Pushkar and I have developed a parallel sequence assembler based on overlap
graph in our research project


Simulate reads from Human chr22
 6978908 reads
 50 read length, 20X coverage, error rate
0.1%
Xeon E5530 2.4 GHz with 4 cores/8 threads
Assemblers
Time
Memory
Velvet
292 mins
~17 GB
Edena
37 mins
~7 GB
Pasqual
43 mins
~8 GB
Parallel Pasqual
9 mins
~8 GB
Mixed Length Reads

H. influenzae
◦ 30 ~ 300 length

Velvet does not work
◦ K is fixed
◦ If we use big K, the reads shorter than K can not be
assembled.
◦ If we use small K, it is difficult to assemble the long
reads

Overlap graph assemblers do not have this
issue
◦ Newbler
Conclusion

Controversial
◦ It is still unclear about the relation between De Bruijn graph
and Overlap graph

We can still conclude from the experiments
◦ Regarding quality and accuracy, Overlap graph assemblers
are thought to be better than De Bruijn graph assembler
◦ De Bruijn graph assemblers does not work for long reads
◦ De Bruijn graph assemblers does not work for mixed length
reads (K is fixed)
◦ Traditional overlap graph assemblers are slower and use
more memory, but latest assemblers are better than De
Bruijn graph assemblers


Sequencing Methods
Experimental comparison of De Bruijn graph
and Overlay graph assemblers

Preliminary Results

Lab Exercise
Quality score and length distribution
M19107
Mean length Median length
577.5849
569
Std dev
83.9605
Quality score and length distribution
M19501
Mean length Median length
624.7172
621
Std dev
78.4074
Quality score and length distribution
M21127
Mean length Median length
618.7576
616
Std dev
81.5678
Quality score and length distribution
M21621
Mean length Median length
620.6305
621
Std dev
83.978
Quality score and length distribution
M21639
Mean length Median length
573.384
564
Std dev
66.5525
Quality score and length distribution
M21709
Mean length Median length
626.2459
624
Std dev
78.2447
Velvet
Id
K
No. of contigs
N50
Max length
Total length
% reads used
M19107
19
217160
16
665
2905543
97.3535
29
176741
26
655
3315033
88.7319
19
618036
13
429
4716286
78.9177
29
537077
18
490
5725530
35.5981
19
319999
15
483
3498613
91.4239
29
259942
24
416
3998418
73.0187
19
218872
16
640
3052522
93.7490
29
157853
26
838
3256837
87.5425
19
770867
13
628
5818868
85.0236
29
680339
19
601
7348599
46.1671
19
29
291156
207736
16
25
768
816
3425632
3637419
95.7695
83.8704
M19501
M21127
M21621
M21639
M21709
$> velveth <output_dir> <k-mer length> -fasta -long <reads.fasta>
$> velvetg <output_dir>
Input: Fasta/Fastq
Output: Fasta
WGS assembler (Celera)
• >50 separate programs make up the Celera Assembler pipeline
Input: frg format
Output: Fasta
• runCA script helps manage them all
Id
M19107
M19501
M21127
M21621
M21639
M21709
No.of Contigs
236
214
345
356
326
520
N50
11881
1230
8349
7791
2092
4393
Max length
32038
4519
26765
30668
9912
15002
Total length
1766060
278112
1947955
1892633
610813
1700040
% reads used
96.3570
98.6032
97.9181
98.1710
98.3939
98.5221
$> sffToCA –trim soft –libraryname ${Id}-trimsoft –output ${Id}-trimsoft ${Id}.sff
$> runCA –p ${Id} –d ${Id} ovlConcurrency=4 ${id}-trimsoft.frg
Newbler
De Novo Assembly
Id
No.of Contigs
M19107
217
M19501
75
M21127
59
M21621
50
M21639
175
M21709
52
N50
15659
157459
121256
138437
43023
140128
Max length
38000
343196
316274
339424
182797
319869
Reference Assembly – (Haemophilus-influenzae-refseq.fasta)
Id
No.of Contigs
N50
Max length
M19107
1260
2496
10409
M19501
988
3503
18724
M21127
M21621
M21639
1272
2701
13712
M21709
313
13836
70298
$> runAssembly <reads.sff>
Total length
25112606
106836011
40693944
50432798
158028027
69503256
Total length
1224223
1380153
1416318
1607841
// de novo assembly
Input: .sff
Output: Fasta
MIRA
MIRA stands for Mimicking Intelligent Read Assembly
Id
No.of Contigs
N50
Max length
Total length
% reads used
M19107
208
18379
51687
1795134
95.7478
M19501
181
185484
321569
1901198
97.7347
M21127
89
81157
305626
1951240
97.4776
M21621
67
90877
253924
1887484
97.5015
M21639
175
90800
152373
2378888
98.1330
M21709
83
62871
197745
1840248
97.6776
Input: Fasta + qual
+ trace info
Output: Fasta, Ace
$> sff_extract –s ${Id}_in.454.fasta -q ${Id}_in.454.fasta.qual -x ${Id}_traceinfo_in.454.xml ${Id}.sff
$> mira --project=${Id} --job=denovo,genome,normal,454 -GE:not=4 >& ${Id}_assembly.log
Eagle view - M19107.ace
Eagle view - M19501.ace
Works Cited

“Next-generation DNA sequencing” Shendure
et. al,
http://compgenomics2011.biology.gatech.edu/images/f/f9/ShendureNatureBiotechnology-2008.pdf

“Next-generation DNA sequencing methods”
Mardis et. al,
http://compgenomics2011.biology.gatech.edu/images/5/59/MardisAnnuRevGenet-2008.pdf


Sequencing Methods
Experimental comparison of De Bruijn graph
and Overlay graph assemblers

Preliminary Results

Lab Exercise
Lab Exercise

Download the Lab Exercise file from the
Genome Assembly wiki page
Download