ppt - University of Connecticut

advertisement
Novel multi-platform next generation assembly
methods for mammalian genomes
The Baylor College of Medicine, Australian
Government and University of Connecticut
teamed together (under the KanGO
consortium) as part of an international
effort to generate a de novo genome
sequence for marsupial model species, the
tammar wallaby (M. eugenii).
• An important model organism is part of
the Metatherian mammals and harbors
unique life history traits and genome
features.
• Genome sequencing was done at several
institutions , using several technologies,
over several years.
• Developed a pipeline to integrate long
and short read sequences and existing
assemblies using well known mapping and
assembly tools such as Bowtie and Phrap.
Overview of the Data
Technology
Institution
Read Length
# Reads
# Bases
Sanger
BCM
Avg. 915 bp
9,924,136
9,088,748,105
454
AGRF, UCONN
Avg. 160 bp
1,530,592
275,951,386
Illumina
AGRF
100 bp
271,875,064
27,187,506,400
Solid
BCM
25 bp
710,427,490
18,471,114,740
“Paired” Read Overview:
Read
Insert
Read
• All reads used in reassembly were “paired”.
• Read orientation not considered.
• The Solid reads had insert size of 1.395 kb.
• The Illumina reads had insert size of 3 and 8 kb in roughly equal proportions.
• The 454 reads had insert size of 8, 12, 20, and 30 kb with the majority of them
split between 20 and 30 kb.
Local Assembly Pipeline
Initial
Assembly
Map Short
Reads
Gap Reestimation
Final
Mapping
Scaffold
Assembly
Finished
Assembly
Quality
Calling
Initial Assembly
Sanger reads assembled
using Atlas (BCM). Initial
scaffolding done using
Solid mate pairs.
Map Short Reads
Reads which map to multiple locations (non
unique) are not considered.
To scaffolds using Bowtie,
there are two cases:
• Orphaned, when one
mate is unmapped.
• Complete, when both are
mapped.
Key: 454, illumina, sanger, unmapped read, dotted line – estimated distance
Pipeline continued
Scaffold Assembly
Feed contigs, complete
and orphaned pairs to
Phrap. Re-assemble.
Final Mapping
Map all data to the final
contigs.
;;;;;;;;;;;9;7;;.7;393333
;;;;;;;;;;;7;;;;;-;;;3;83
;;;;;;;;;;;9;7;;.7;393333
;;;;;;;;;;;7;;;;;-;;;3;83
;;3;;;;;;;7;;;;;;;88383;;;34 ;;3;;;;;;7;;;;;;;88383;;;77
Quality Calling
Map all reads and calculate a
quality of each base.
Gap Restimation
Of all gap distances in scaffolds
using complete pairs mapped
on different contigs.
Output: Contigs, Quality Scores, Scaffold (agp file).
Local Assembly Close-up
A
B
C
• Screen shot taken from Codon Code aligner which uses Phrap to map reads.
• Red and blue denote orientation.
• A: Two contigs may be fused using short reads.
• B: Contig will be extended with short reads at one end.
• C: Single nucleotide and small errors are corrected using the short reads
higher coverage.
• We are confident of changes since even at < 10x each read is itself uniquely
mapped, or its approximate position supported by a uniquely mapped read.
Assembly Comparison and Validation
Category
Meug 1.0
Meug 1.1
Meug 1.2
Contigs (10^6)
1.21
1.17
1.101
N50 (10^3)
2.5
2.6
2.91
Bases (10^6)
2546
2536
2574
scaffolds
616418
277711
277711
Gaps (10^6)
NA
539
614
RIKEN BACs
Total Reads
Total Bases
Recovered
Reads
Recovered
Bases
BAC
147312
113232882
34734
25662069
FOSMID
31250
18777544
4335
2263696
BAC
147312
113232882
33328
24555624
FOSMID
31250
18777544
4294
2241059
1.1 (original)
1.2 (updated)
Gap Estimation Methods
• Gap between two contigs estimated using an expectation
maximization algorithm. Steps are repeated until
estimated parameters do not change.
• Step1: Maximization: compute gap estimate (x), let the
mean insertion length of N pairs equal μ (initial value is
library average).
• Step 2: Sampling, given x, and the length of contigs,
sample μ from completely mapped reads spanning gaps.
Gap Estimation Results
• When using libraries with different insert size and std deviation it is
necessary to bundle the estimates.
• The following is an example of how two libraries are bundled:
• nx = # reads in libx, ex = libx estimate, sx = lib std dev.
Simulation study of EM algorithm accuracy.
ctg len \ gap
1000
2000
3000
4000
5000
10
-19
-7
2
-1
-15
100
66
87
92
88
74
200
161
188
192
188
173
500
446
492
492
487
469
800
733
795
794
784
764
1000
939
997
995
982
956
1200
1153
1196
1196
1177
1152
1500
1501
1493
1496
1469
1436
Conclusion
Future Work
• This method is a viable
way of improving
existing draft genomes
with short read
technologies at limited
(<10x depth) coverage.
• This method is robust
and easily parallelized
so it is practical for large
mammalian genomes.
• Better results may be
obtained through
multiple iterations.
• Re-scaffolding of the
contigs should be done
between iterations.
• A contig aware
assembly algorithm
could improve local
assembly performance.
Download