3. Sequence Assembly Overview

advertisement
MGM Workshop
Assembly Tutorial
Alicia Clum
DOE Joint Genome Institute,
Walnut Creek, CA
May 14, 2012
Contents
1. Vocabulary introduction
2. Introduction to short-read genome
sequencing and assembly
3. Practical experience of short read genome
assembly
4. Improving genome assembly using 3rd
generation sequencing
Contents
1. Vocabulary introduction
2. Introduction to short-read genome
sequencing and assembly
3. Practical experience of short read genome
assembly
4. Improving genome assembly using 3rd
generation sequencing
Vocabulary
• Fragment library: a short insert (270bp)
library with overlapping ends. Aka std library
• Long insert library: A 4-8kb library where
only 100 bp on each end are sequenced. Aka
CLIP, mate pair library
• Contig: A contiguous sequence of DNA
• Scaffold: One or more contigs linked
together by unknown sequence
• Captured gap: A gap within a scaffold. The
order and orientation of the contigs spanning
the gap is known
A
B
C
D
E
Contents
2. Introduction to short-read genome
sequencing and assembly
•
•
•
Short read sequencing and assembly basics
Short read assembly - De Bruijn graph example
Short read assembly – Scaffolding
Why sequence genomes using
short reads?
Mb/day
Cost / Mb
Read length
20,000
650bp
450bp
$400
1
Sanger
$15
1,000
150bp
454
Illumina HiSeq
Traditional genome
sequencing technology
$0.1
Short-read
$$! - We have to figure out how to sequence
microbial genomes using only illumina data
Short read genome sequencing
Genomic
DNA
Random
fragmentation
4-8 kb
fragments
270 bp
fragments
molecular biology
Paired-end
short insert
reads
(10’s millions)
Sequencing
(Illumina)
Paired-end
long insert
reads
(10’s millions)
How do we assemble this data back into a genome?
Assembly outline
Reads
Contigs
Assembly
algorithms
e.g.
Allpaths, Velvet,
Meraculous
Scaffolds
Assembly outline
Reads
Contigs
Scaffolds
‘De Bruijn’
assembly
Contents
2. Introduction to short-read genome
sequencing and assembly
•
•
•
Short read sequencing and assembly basics
Short read assembly - De Bruijn graph example
Short read assembly – Scaffolding
De Bruijn example
“It was the best of times, it was the worst of times, it was the
age of wisdom, it was the age of foolishness, it was the
epoch of belief, it was the epoch of incredulity,.... “
Dickens, Charles. A Tale of Two Cities. 1859. London: Chapman Hall
Example courtesy of J. Leipzig 2010
De Bruijn example
itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness…
Generate random ‘reads’
How do we assemble?
fincreduli geoffoolis Itwasthebe Itwasthebe geofwisdom itwastheep epochofinc timesitwas stheepocho nessitwast wastheageo theepochof
stheepocho hofincredu estoftimes eoffoolish lishnessit hofbeliefi pochofincr itwasthewo twastheage toftimesit domitwasth ochofbelie
eepochofbe eepochofbe astheworst chofincred theageofwi iefitwasth ssitwasthe astheepoch efitwasthe wisdomitwa ageoffooli twasthewor
ochofbelie sdomitwast sitwasthea eepochofbe ffoolishne eofwisdomi hebestofti stheageoff twastheepo eworstofti stoftimesi theepochof
esitwasthe heepochofi theepochof sdomitwast astheworst rstoftimes worstoftim stheepocho geoffoolis ffoolishne timesitwas lishnessit
stheageoff eworstofti orstoftime fwisdomitw wastheageo heageofwis incredulit ishnessitw twastheepo wasthewors astheepoch heworstoft
ofbeliefit wastheageo heepochofi pochofincr heageofwis stheageofw fincreduli astheageof wisdomitwa wastheageo astheepoch olishnessi
astheepoch itwastheep twastheage wisdomitwa fbeliefitw bestoftime epochofbel theepochof sthebestof lishnessit hofbeliefi Itwasthebe
ishnessitw sitwasthew ageofwisdo twastheage esitwasthe twastheage shnessitwa fincreduli fbeliefitw theepochof mesitwasth domitwasth
ochofbelie heageofwis oftimesitw stheepocho bestoftime twastheage foolishnes ftimesitwa thebestoft itwastheag theepochof itwasthewo
ofbeliefit bestoftime mitwasthea imesitwast timesitwas orstoftime estoftimes twasthebes stoftimesi sdomitwast wisdomitwa theworstof
astheworst sitwasthew theageoffo eepochofbe
…etc. to 10’s of millions of reads
Traditional all-vs-all assemblers fail due to immense
computational resources (scales with number of reads2)
A million (106 ) reads requires a trillion (1012) pairwise alignments
De Bruijn solution:
Represent the data as a graph (scales with genome size)
De Bruijn example
Step 1:
Convert reads into “Kmers”
Kmer: a substring of defined length
Reads: theageofwi
Kmers : the
(k=3)
hea
sthebestof astheageof worstoftim imesitwast
sth
ast
the
eag
sth
heb
geo
eof
ofw
fwi
sto
tof
esi
sto
eag
est
mes
rst
hea
bes
ime
ors
the
ebe
age
wor
sit
tof
age
geo
eof
oft
fti
tim
itw
twa
was
ast
…..etc for all reads in the dataset
De Bruijn example
Step 2:
Build a De-Bruijn graph from the kmers
ast
ast
sth
sth
the
the
the
hea
hea
eag
eag
age
age
geo
geo
eof
eof
ofw
heb
ebe
bes
est
sto
sto
tof
tof
wor
ors
rst
oft
fti
fwi
tim
ime
was
mes
esi
twa
itw
sit
…..etc for all ‘kmers’ in the dataset
De Bruijn example
Step 3:
Simplify the graph as much as possible:
A De Bruijn Graph
De Bruijn assemblies ‘broken’ by repeats longer than kmer
“It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of
foolishness, it was the epoch of belief, it was the epoch of incredulity,.... “
Drawback of De Bruijn approach
Step 4: Dump graph into consensus (fasta)
No single solution!
Break graph to produce final assembly
Kmer size is an important parameter in
De Bruijn assembly
The final assembly (k=3)
wor
times
incredulity
foolishness
itwasthe
age
epoch
be
st
of
wisdom
belief
Repeat with a longer “kmer” length
A better assembly (k=20)
itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolis…
Why not always use longest ‘k’ possible?
Sequencing errors:
k=3
sth the ebe ent tof
heb
ben nto
sthebentof
k=10
sthebentof
Mostly
unaffected
kmers
100% wrong kmer
Contents
2. Introduction to short-read genome
sequencing and assembly
•
•
•
Short read sequencing and assembly basics
Short read assembly - De Bruijn graph example
Short read assembly – Scaffolding
Scaffolding
Reads
Contigs
‘De Bruijn’
assembly
Align reads to DeBruijn
contigs
Join contigs using evidence
from paired end data
Scaffolds
(An assembly)
“Captured” gaps caused by
repeats. Represented by “NNN” in
assembly
Contents
1. Vocabulary introduction
2. Introduction to short-read genome
sequencing and assembly
3. Practical experience of short read genome
assembly
4. Improving genome assembly using 3rd
generation sequencing
Real life assembly is messy!
Assembly in theory
Uniform coverage, no errors, no contamination
Assembly in reality
Contaminant reads
(-> incorrect +
inflated assembly)
Chimeric reads (->mis-joins)
Sequencing errors
(-> fragmented assembly)
*
* Biased* *coverage (->gaps) *
*
*
*
Worse than predicted assemblies!
Coverage (x)
Fraction of normalized coverage
Real life assembly is messy!
Reference position (bp)
Theoretical
GC% of 100 base windows
Genome properties can also
make assembly difficult
High repeat content
Polyploidy
r
r
r
r
a
a’
r
RESULT:
misassemblies /
collapsed assemblies
Biased sequence composition
RESULT:
fragmented
assembly
Biased sequence abundance
ACTGTCTAGTCAGCGCGC
GCGCGCGCGCCCGCGCG
CGCGGGCGGCGGCGCGG
GCGGGCGCATGTAGTGAT
C
RESULT:
incomplete / fragmented assembly
RESULT:
Incomplete / fragmented assembly
How do we get a good assembly?
Key features of the JGI microbe
sequencing and assembly pipeline
Genomic
DNA
Fragment
Longer inserts
Improved scaffolding
270bp
Overlapping pairedend reads for short
insert library
Allows for long ‘kmer’
Sequence
2 x 150bp
Extensive data QC:
-Remove artifacts
-Remove contaminants
-Reorder libraries
4-8 kb
2 x 100 bp
QC
Assembler
Best Assemblies
Illumina V3
sequencing chemistry
Reduced GC bias
AllPaths LG assembler
Most complete + most
accurate assembler in
our hands
Internal error
correction
Benefits of ALLPATHS
• Internal error correction of all data types
• Simplifies the graph by removing kmers at low coverage
caused by errors
velvet assembly results for error corrected standard data
4000
number of contigs
3500
3000
2500
4084251
2000
4092796
1500
1000
500
0
allpaths-lg
control-no error correction
msr-ca
error correction method
• Given the correct input data, a good assembly can be
produced with default options
• Overlaps the fragment data to use a longer kmer size
• Produces accurate and highly contiguous assemblies
Simulated De Bruijn assembly for six
‘known’microbial genomes
100
90
80
70
60
50
40
30
20
10
0
97%
Bacteria name:
A. haemolyticum
B. Murdochii
C. Flavigena
S. Smaragdinae
H. Turkmenica
C. Woesei
Kmer = 150
A small fraction of
the genome remains
‘unassemblable’
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
% Genome in unique ‘kmers’
What fraction of a genome should we be able to assemble, that is,
can be represented in unique kmers?
'Kmer' length (bp)
Kmer =30, most of the genome CAN be assembled
Matt Blow
How fragmented are short read assemblies?
Goal: genome in 1 fragment per replicon
Short insert
library only
S. Smaragdinae
H. Turkmenica
(7 replicons)
C. Woesei
C. Flavigena
B. Murdochii
Assemblies from
simulated data
0
50
100
150
Number of scaffolds
Assemblies using only short insert sequencing libraries
are acceptable (<100 scaffolds)
How good are short read assemblies?
Goal: genome in 1 fragment per replicon
S. Smaragdinae
Assemblies from
real data
H. Turkmenica
Assemblies from
simulated data
(7 replicons)
Short insert
library only
C. Woesei
C. Flavigena
B. Murdochii
0
50
100
150
Number of scaffolds
Assemblies using only short insert sequencing libraries
are acceptable (<100 scaffolds)
How good are short read assemblies?
Goal: genome in 1 fragment per replicon
S. Smaragdinae
Assemblies from
real data
H. Turkmenica
Assemblies from
simulated data
(7 replicons)
Short + long
insert
libraries
C. Woesei
C. Flavigena
B. Murdochii
0
5
10
Number of scaffolds
Assemblies using short and long insert sequencing
libraries are very good (often a single fragment!)
Comparison of assembly results to
reference genome annotation
Short insert
library only
Short + long
insert libraries
Average
number of
fragments
Average %
known genes
identified
85
97.3%
3
97.4%
Near complete and accurate assemblies are now
possible using only short read data.
But problems remain:
Reads
Contigs
‘De Bruijn’
assembly
Align reads to DeBruijn
contigs
Join contigs using evidence
from paired end data
An assembly
“Captured” gaps caused by
repeats. Represented by “NNN” in
assembly
Contents
1. Vocabulary introduction
2. Introduction to short-read genome
sequencing and assembly
3. Practical experience of short read genome
assembly
4. Improving genome assembly using 3rd
generation sequencing
Pacific Biosciences Sequencer
Long reads from “3rd generation” Pacific Biosciences sequencer hold promise
for improving short-read based assemblies
Late April 2012
14,000
12,000
10,000
8,000
6,000
4,000
2,000
Mean read
length = 2743 bp
Mean read length =
1080bp
Maximum read length >4kb
1
2
3
4
5
Read length (Kb)
Low coverage bias
Reduced sensitivity to
G+C rich regions
compared to illumina
chemistry
6
7
Fraction of normalized coverage
Number of reads
“Early” data
High error rate
Up to 15% error rate
In-del errors
GC% of 100 base windows
PacBio pilot study
For each of 20 Microbes:
Genomic DNA
+
2 x 150bp
270bp insert
+
2 x 100bp
8kb insert
(Illumina HiSeq, V3 chemistry, 2x150b)
Assembly 1
Assembly 2
Assembler: AllPaths V39750 (PacBio enabled)
“1000”bp
(PacBio)
3rd Generation sequence data improves
genome assembly
Most improved genome:
53 / 71 (75%) gaps closed
Number of ‘captured’ gaps in
assembly
100
90
80
70
60
50
40
30
20
10
0
Assembled using:
Illumina only
Illumina + PacBio
Least improved genomes
(.. but started out in good shape)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Microbe (sorted by number of gaps closed)
PacBio assembly improvement is greatest
for genomes with high GC content
90
80
80
LEAST improved genomes
average 56% G+C
75
70
70
65
60
60
50
55
40
50
30
45
20
40
10
35
0
Conclusion:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
rd
3 generation sequencing
promising
for better
Microbe (sorted is
by a
number
of gaps tool
closed)
assemblies, especially for G+C rich genomes.
30
Assembly G+C content
Number of ‘captured’ gaps in
assembly
100
MOST improved genomes
average 66% G+C
Summary
• High quality genome sequencing using
only short-reads is within reach
• Short-read microbial genomes assemblies
are minimally fragmented and contain the
vast majority known genes
• Third-generation sequencing may provide
an inexpensive path to finished genomes
END
Assembly Q+A
Alex Copeland
(Assembly group lead)
James Han
(QC group lead)
Alicia Clum
(Analyst,
Genome assembly)
Metagenome Assembly
Cow rumen
Acid mine
1
10
100
Soil
1000
10000
Species per metagenome
-All challenges of isolate genome assembly remain
-Extra challenges from diversity and different abundance of
constituent genomes
- The same strategies as isolate assembly can be used, but
many heuristics fail for metagenomes
Road map for Long Mate Pair (LMP) library Improvement & Development
Fragmentation
column
Hydroshear
Circularization
Cre-Lox
Covaris
Transposon
Hybridization
Ligation
Nick translation
shearing
Purification
Electro-elution
Clean up
efficiency
Paired tag
generation
RE enzyme
AmplificationPCR cycle #, specificity, choice of enzyme ---
2-D
Electrophoresis
CLIP
LFPE
Illumina
De Bruijn example
The final assembly (k=3)
wor
times
incredulity
foolishness
itwasthe
age
epoch
be
of
st
wisdom
belief
Repeat with a longer “kmer” length
A better assembly (k=20)
itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolis…
Why not always use longest ‘k’ possible?
Sequencing errors:
k=3
sth the ebe ent tof
heb
ben nto
sthebentof
k=10
sthebentof
Mostly
unaffected
kmers
100% wrong kmer
Short read assemblies are
improving over time
100
80
60
40
20
0
80
100
60
80
40
60
20
40
Longest contig
(Kb)
…resulting in better microbial genome assemblies
150
40
30
100
Average results from suboptimal “QC” assemblies
50
20
10
Q4 2009
(n = 64)
Q1 2010
(n = 90)
Q2 2010
(n = 31)
Q3 2010
(n = 68)
Q4 2010
(n = 94)
High quality
reads (%)
Read number
(millions)
But illumina sequence quantity and quality is increasing…
Genome GC
(%)
10
8
6
4
2
0
Q1 2011
(n = 43)
contig N50
(Kb)
Genome size
(Mb)
Sequenced genome properties remain constant
100
90
80
70
60
50
40
30
20
10
0
97%
Bacteria name:
A. haemolyticum
B. Murdochii
C. Flavigena
S. Smaragdinae
H. Turkmenica
C. Woesei
Kmer = 150
A small fraction of
the genome remains
non-unique
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
% Genome in unique ‘kmers’
Q1. What fraction of kmers are unique
in a typical microbial genome?
(6 known microbe genomes, 15 kmer lengths)
'Kmer' length (bp)
Kmer = 30, most of the genome is in unique kmers
Majority of genome contained in unique fragments
Predicted number of
fragments in the assembly
Q2. How do non-unique kmers
fragment the assembly?
250
200
150
100
50
0
A. haemolyticum
B. Murdochii
C. Flavigena
C. Woesei
H. Turkmenica
Microbe
Assemblies will be fragmented!
How do we improve this?
S. Smaragdinae
Metagenome assembly is an
ongoing challenge
-All challenges of isolate genome assembly remain
-Extra challenges from diversity and different abundance of
constituent genomes
- The same strategies as isolate assembly can be used, but
many heuristics fail for metagenomes
Useful Reviews
•
•
Miller JR, Koren S, Sutton G. ,
Assembly algorithms for nextgeneration sequencing data.
Genomics. 2010 Jun;95(6):315-27.
Mihai Pop, Genome assembly reborn:
recent computational challenges.
Brief Bioinform (2009) 10 (4):354-366.
Illumina data quality
Syntrophorhabdus aromaticivorans
Genome Properties
PASS
Read Quality
Library Quality
Run Quality
Illumina data quality
Opitutaceae bacterium TAV2
Genome Properties
FAIL
Read Quality
Library Quality
Run Quality
Metagenomes are harder to assemble
Velvet gaps
Fibrobacter succinogenes & Ignisphaera aggregans
velvet gap size distribution of aligned contig shreds
120
60
4085750 std (trimmed to
Q20)
4085750 jumping
(trimmed to Q20)
4085750 std trimmed +
jumping trimmed
4086221 std
40
4086221 jumping
20
4086221 std + jumping
100
number of gaps
80
0
< 100
100-999
1000-1999 2000-2999 3000-3999
Gap size (bp)
> 4000
Features of Assemblers
Algorithm Feature
Read features
Removal of erroneous reads
Base error correction
Graph construction
Graph reduction
Base substitutions
Homopolymer miscount
Concentrated error in 3′ end
Flow space
Based on K-mer frequencies
Based on K-mer freq and QV
For multiple values of K
By alignment to other reads
Based on K-mer frequencies
Based on Kmer freq and QV
Based on alignments
OLC Assemblers
DBG Assemblers
Euler, AllPaths, SOAP
CABOG
Euler
Newbler
Euler, Velvet, AllPaths
AllPaths
AllPaths
CABOG
Euler, SOAP
AllPaths
CABOG
Reads as graph nodes
K-mers as graph nodes
Simple paths as graph nodes
CABOG, Newbler, Edena
Collapse simple paths
Erosion of spurs
Bubble smoothing
Bubble detection
Reads separate tangled paths
Break at low coverage
Break at high coverage
High coverage indicates repeat
CABOG, Newbler
CABOG, Edena
Edena
Euler, Velvet, ABySS, SOAP
AllPaths
CABOG
CABOG
Euler, Velvet, SOAP
Euler, Velvet, AllPaths, SOAP
Euler, Velvet, SOAP
AllPaths
Euler, SOAP
Velvet, SOAP
Euler
Velvet
Graph partitions
Partition by K-mers
Partition by scaffolds
ABySS
AllPaths
Mate pairs
Constrain path searches
Guide path selection
Merge contigs or fill gaps
Transitive link reduction
Detect, avoid repeat contigs
Create scaffolds
Euler, Velvet, AllPaths
Euler, Allpaths
Velvet, ABySS, SOAP
SOAP
Velvet, SOAP
Euler, Velvet, AllPaths, SOAP
CABOG, Shorty
CABOG
CABOG
CABOG, Shorty
J.R.Miller et al. Genomics 95 (2010)
Greedy overlap
CORRECT
INCORRECT
(A) Overlap between two read (agreement within overlapping
region need not be perfect);
(B) Correct assembly of a genome with two repeats (boxes)
using four reads A–D;
(C) Incorrect assembly produced by the greedy approach.
(D) Disagreement between two reads (thin lines) that could
extend a contig (thick line), indicating a potential repeat
boundary. Contig extension must be terminated in order to
avoid misassembly.
Pop M Brief Bioinform 2009;10:354-366
Overlap-Layout-Consensus (OLC)
Overlap graph of a genome containing a two-copy repeat (B).
Comparing Overlap and
de Bruijn Graphs
Schatz et al., Genome Res. (2010)
Iterative kmer evaluation:
IDBA
Y Peng, et al. IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler (2010)
Jeremy Leipzig (jerdemo.blogspot.com/2009/11/using-vmatch-to-combine-assemblies.html
N50
The N50 size of a set of entities (e.g., contigs or
scaffolds) represents the largest entity E such that at
least half of the total size of the entities is contained in
entities larger than E.
For example, given a collection of contigs with sizes 7,
4, 3, 2, 2, 1, and 1 kb (total size = 20kbp), the N50 length
is 4 because we can cover 10 kb with contigs bigger
than 4kb. http://www.cbcb.umd.edu/research/castats.shtml)
(
N50 length is the length ‘x’ such that 50% of the
sequence is contained in contigs of length x or greater.
(Waterston http://www.pnas.org/cgi/reprint/100/6/3022.pdf)
Theoretical performance
Assessing performance of a range of read lengths
Repeat-induced gaps
Cahill et al., PLoS ONE (2010)
Supplement: Assembler
flowcharts
Phrap
Bamidele-Abegunde T. 2010 http://library2.usask.ca
CAP3 & PCAP
Bamidele-Abegunde T. 2010 http://library2.usask.ca
MIRA
Bamidele-Abegunde T. 2010 http://library2.usask.ca
Velvet
Bamidele-Abegunde T. 2010 http://library2.usask.ca
Supplement: Genome
Improvement
Typical Microbial project
Sequencing
Draft
assembly
Goals:
FINISHING
Completely restore genome
Produce high quality consensus
Annotation
Public release
Metagenomic assembly and
Finishing
The whole-genome shotgun sequencing approach was used for a number of
microbial community projects, however useful quality control and assembly
of these data require reassessing methods developed to handle relatively
uniform sequences derived from isolate microbes.
•
Typically size of metagenomic sequencing project is very large
•
Different organisms have different coverage. Non-uniform sequence coverage results in significant
under- and over-representation of certain community members
•
Low coverage for the majority of organisms in highly complex communities leads to poor (if any)
assemblies
•
Chimerical contigs produced by co-assembly of sequencing reads originating from different
species.
•
Genome rearrangements and the presence of mobile genetic elements (phages, transposons) in
closely related organisms further complicate assembly.
•
No assemblers developed for metagenomic data sets
QC: Annotation of poor quality
sequence
To avoid this:

make sure you use high quality sequence

choose proper assembler
De Bruijn graph for binary code
0000110010111101
K=4
Follow the blue numbered edges to resolve the graph
Compeau et al., Nature Biotechnology (2011)
Hamiltonian vs Eulerian
cycle
Compeau et al., Nature Biotechnology (2011)
Evaluating Assemblers
• When evaluating assemblers we use
reference microbial datasets and
assess
Query
• Number of scaffolds and contigs
• Performance across a range of size, GC
content and repeat content
• Accuracy of sequence produced
• Gene content captured
Reference
Vocabulary
Node: A point, read or kmer
Edge: a line connecting two nodes
Graph: a network of nodes connected by edges
Node
Edge
Graph
Compeau et al., Nature Biotechnology (2011)
Vocabulary
•Kmer: a substring of defined length. For the purposes of this
talk a substring of the sequence read
•de Bruijn graph: a graph representing overlaps between kmers
Kmer
(k=3)
node
edge
Schatz et al., Genome Res. (2010)
Download