Keynote Speaker, Steven Salzberg: "Assembling Genomes from

advertisement
Assembling Genomes
from Next-Generation
Sequencers
Steven Salzberg
Center for Bioinformatics and Computational Biology
University of Maryland Institute for Advanced Computer
Studies
http://cbcb.umd.edu
Solexa sequencing
What can we do with
next-gen sequencers?
1. Assembling genomes from very
short reads (part 1)
2. Mapping millions of reads to the
human genome (part 2)
Assemble a bacterial genome entirely
from Solexa reads
Target: a novel strain of Pseudomonas
aeruginosa isolated from a frostbite patient
Every read exactly 33 bp long
8,627,900 reads generated
approximately 41X coverage
just 1/4 of a single Solexa run
Assembly strategy
Throw every trick in the book at it
Use related genomes
3 finished strains available
Use de novo assemblers
New gene-boosting assembly method
Pseudomonas
aeruginosa
A leading cause of hospital-acquired
infections, especially of the lungs
Leading cause of infections in cystic fibrosis
patients
Large (~6.5 Mbp) bacterial genome
high GC - 66%
Comparative Assembly
AMOScmp assembles a
genome using a related
species
Fast, accurate assembly
http://amos.sourceforge.net
Comparative assembly using multiple genomes
Comparative assembly A
Reference genome A
Divergent regions
X
Y
Z
Target genome
Reference genome B
Comparative assembly B
Comparative assembly using multiple genomes
AMOScmp
assembly
Contigs
Contigs
>200bp
Max
contig
PA14 reference
2053
428
170,485
PA01 reference
2797
865
75,626
Comparative assembly using multiple genomes
Assembly A
Assembly B
Merge
Merged assembly
Comparative assembly using multiple genomes
AMOS-Cmp
assembly
Contigs
Contigs
>200bp
Max contig
PA14 reference
2053
428
170,485
PA01 reference
2797
865
75,626
Merged
1850
306
236,472
De novo assembly
Several new methods available
Short reads require long overlaps
e.g., 33 bp reads must overlap by 20 bp
end-trimming helps
De novo assembly strategies
SSAKE
Warren et al., 2007
Uses DNA prefix tree to find k-mer matches
Edena
Hernandez et al., 2008
overlap-layout algorithm adapted for short reads
Velvet
Zerbino and Birney, 2008
Uses DeBruijn graph algorithm plus error correction
De novo Assembler
performance
● All
three programs run with default
parameters on the same data set
● input: 8.6 million reads
● platform: 64-bit Opteron, 4 CPUs, 32 GB
memory
Program
Version
CPU time
Wall clock
SSAKE
3.0
2:24:59
5:08:59
Edena
2.11
0:28:31
28:58
Velvet
0.5
0:08:48
10:36
De novo assemblies
Program
# Contigs
N50 (bp)
Sum (bp) Max contig
SSAKE
185,030
87
14,287,07
9
Edena
11,180
837
6,175,460
11,300
Velvet
10,684
# Contigs
>200 bp
1,184
6,841,458
16,239
Program
5,490
N50 (bp)
Sum (bp) Singletons
SSAKE
12,532
549
6,090,567 3,164,495
Edena
8,316
902
5,759,209 3,955,865
Velvet
7,382
1,252
6,474,426 1,273,164
Gene-boosted assembly
Contig 1
Contig 2
Gap-spanning gene
Gap-spanning gene sequence
Translated amino acid sequence
Translated, mapped reads
Comparative assembly using multiple genomes
Assembly strategy
Contigs
Contigs
>200bp
Merged,
AMOS-Cmp
1850
306
236,472
Gene-boosted
120
120
512,638
Max contig
Note: input to Gene-boosted assembly included 306 contigs from Merged assembly
Final assembly
76 contigs in one large scaffold, 6.3 Mb
Largest contig: 512,638 bp
additional 436 small contigs spanning 417 kb
9% of the reads unused
5602 protein-coding genes
5568 in PAO1
5892 in PA14
Challenges of next-gen
sequencing
1. Assembling genomes from very
short reads (part 1)
2. Mapping millions of reads to the
human genome (part 2)
Short read alignment
Sequencer
Human source
Reads from new sequencing machines are short: 25-50 bp
Short read alignment
Sequencing machine
And you get
MANY of them
Short read alignment
Need to map them
back to human
reference
Bowtie
• Ultrafast short read alignment software
– designed for 25-63bp reads
• Same sensitivity as Maq, but 35 times faster
• Shares formats with Maq
– compatible with Maq’s SNP caller
• Open source:
– http://cbcb.umd.edu/software
– http://bowtie-bio.sourceforge.net
Bowtie overview
• For each read, finds a ‘good’ hit to the
reference, allowing for mismatches
– Prefers mismatches at lower-quality bases
– Can behave like Maq or SOAP
– Calls SNPs using Maq interface
• Uses Burrows-Wheeler index of the reference
genome
– Pre-built genomes available
– Can download or build your own
Why Burrows-Wheeler?
• BWT very compact:
– Approximately ½ byte per base
– As large as the original text, plus a few “extras”
– Can fit onto a standard computer with 2GB of
memory
• Linear-time search algorithm
– proportional to length of query for exact matches
Burrows-Wheeler Transform (BWT)
BWT
acaacg$
$acaacg
aacg$ac
acaacg$
acg$aca
caacg$a
cg$acaa
g$acaac
gc$aaac
Burrows-Wheeler Matrix (BWM)
Burrows-Wheeler Matrix
$acaacg
aacg$ac
acaacg$
acg$aca
caacg$a
cg$acaa
g$acaac
Burrows-Wheeler Matrix
$acaacg
aacg$ac
acaacg$
acg$aca
caacg$a
cg$acaa
g$acaac
See the suffix array?
Handling mismatches
Matching
acctagattcagaggtcaccataggcacatgcag
Don’t backtrack to positions in this region
of the read
Handling mismatches
Matching
acctagattcagaggtcaccataggcacatgcag
Allow mismatches in this part of the read
Don’t backtrack to positions in this region
of the read
Handling mismatches
acctagattcagaggtcaccataggcacatgcag
Flip the read and index
around
Allow mismatches in this part of the read
Don’t backtrack to positions in this region
of the read
gacgtacacggataccactggagacttagatcca
Handling mismatches
acctagattcagaggtcaccataggcacatgcag
Allow mismatches in this part of the read
Don’t backtrack to positions in this region
of the read
gacgtacacggataccactggagacttagatcca
Matching
Handling mismatches
• Bowtie uses a more complex scheme to allow
for more than 1 mismatch
–Divides the read into a 28bp “seed” region, which is
assumed to be of high-quality
–Divides that into two parts, similar to the 1-mismatch
scheme
–Allows backtracking in each part in separate phases
to avoid excessive backtracking
Bowtie speed
Alignment of 8.84 million Solexa reads from the 1000 Genomes pilot
268 x
54 x
Millions of reads per CPU hour
Bowtie memory requirements
(less is better)
Alignment of 8.84 million Solexa reads from the 1000 Genomes pilot
Peak memory usage (megabytes)
Bowtie Sensitivity
Percent reads aligned
Maq
74.7
SOAP
71.6
Bowtie
75.1
>90% of reads are aligned by all 3 programs
Alignment of 8.84 million Solexa reads from the 1000 Genomes pilot
Bowtie index construction
Maximum allowed memory (GB)
Building index for NCBI human reference, build 36, on a 2.4 GHz Opteron with 32GB RAM
Bowtie index construction
• Can build index for a mammalian genome on a
desktop workstation in < 1 day
• Pre-built indices at CBCB:
H. sapiens
2.1 GB
M. musculus
1.8 GB
D. melanogaster
118 MB
S. cerevisiae
12 MB
others…
39
Acknowledgements
Assembly with short reads
Dan Sommer, Daniela Puiu, Vincent Lee
Short-read alignment (Bowtie)
Ben Langmead, Cole Trapnell, Mihai Pop
Funding
NIH R01-LM06845, R01-GM083873
Download