Slides 3: NGS short

advertisement
CS 6293 Advanced Topics:
Current Bioinformatics
Genome Assembly: a brief
introduction
Slides Adapted from Mihai Pop, Art Delcher, and Steven Salzberg
Homework #2
• #1: questions will be posted online before Monday class
• #2: Form groups of 3
– Each group reads two papers on a topic:
Short reads alignment or assembly
– Present the papers and do some comparison
– ~8 minutes presentation
• You can choose to go to some really cool details
• Or give the main idea of the paper
– Other teams (and me) will judge you
– Send me names in your group and optionally papers you want to
present
– List of papers:
http://www.oxfordjournals.org/our_journals/bioinformatics/nextgenerationsequencing.html
Genome sequencing
AGTAGCACAGA
CTACGACGAGA
CGATCGTGCGA
GCGACGGCGTA
GTGTGCTGTAC
TGTCGTGTGTG
TGTACTCTCCT
3x109 nucleotides
~500 nucleotides
Genome sequencing
AGTAGCACAGA
CTACGACGAGA
CGATCGTGCGA
GCGACGGCGTA
GTGTGCTGTAC
TGTCGTGTGTG
TGTACTCTCCT
3x109 nucleotides
A big puzzle
~60 million pieces
Computational Fragment Assembly
Introduced ~1980
1995: assemble up to 1,000,000 long DNA pieces
2000: assemble whole human genome
Shotgun DNA Sequencing
(Technology)
DNA target sample
SHEAR
SIZE SELECT
e.g.,
10Kbp
± 8%
std.dev.
End Reads (Mates)
550bp
Primer
LIGATE &
CLONE
SEQUENCE
Vector
Whole Genome Shotgun
Sequencing
– Collect 10x sequence in a 1-to-1 ratio of two types of read pairs: ~ 35million reads
Short
2Kbp
–
Long
for Human.
10Kbp
Collect another 20X in clone coverage of 50Kbp end sequence pairs:
~ 1.2million pairs for Human.
–
Early simulations showed that if repeats were considered black
boxes, one could still cover 99.7% of the genome unambiguously.
BAC 5’
+ single highly automated process
+ only three library constructions
– assembly is much more difficult
BAC 3’
Sequencing Factory
Celera’s Sequencing Factory
(circa 2001)
 300 ABI 3700 DNA Sequencers
 50 Production Staff
 20,000 sq. ft. of wet lab
 20,000 sq. ft. of sequencing space
 800 tons of A/C (160,000 cfm)
 $1 million / year for electrical service
 $10 million / month for reagents
Human Data (April 2000)
 Collected 27.27 Million reads = 5.11X coverage
 21.04 Million are paired (77%) = 10.52 Million pairs
 2Kbp
5.045M
98.6% true *
<6% std.dev.
 10Kbp
4.401M
98.6% true *
<8% std.dev.
 50Kbp
1.071M
90.0% true *
<15% std.dev.
* validated against finished Chrom. 21 sequence
 The clones cover the genome 38.7X times
 Data is from 5 individuals (roughly 3X, 4 others at .5X)
Pairs Give Order & Orientation
Assembly without pairs results
in contigs whose order and
orientation are not known.
Contig
Consensus (15- 30Kbp)
Reads
?
Pairs, especially groups of corroborating
ones, link the contigs into scaffolds where
the size of gaps is well characterized.
2-pair
Mean & Std.Dev.
is known
Scaffold
Anatomy of a WGS Assembly
STS
Chromosome
STS-mapped Scaffolds
Contig
Read pair (mates)
Gap (mean & std. dev. Known)
Consensus
Reads (of several haplotypes)
SNPs
External “Reads”
Assembly gaps
Physical gaps
Sequencing gaps
sequencing gap - we know the order and orientation of the contigs and have at
least one clone spanning the gap
physical gap - no information known about the adjacent contigs, nor about the DNA
spanning the gap
12
Assembly paradigms
• Overlap-layout-consensus
– greedy (TIGR Assembler, phrap, CAP3...)
– graph-based (Celera Assembler, Arachne)
• Eulerian path (especially useful for short
read sequencing)
13
TIGR Assembler/phrap
Greedy
• Build a rough map of fragment
overlaps
• Pick the largest scoring overlap
• Merge the two fragments
• Repeat until no more merges
can be done
14
(A) Overlap between two reads—note that agreement within overlapping region need not be
perfect; (B) Correct assembly of a genome with two repeats (boxes) using four reads A–D; (C)
Assembly produced by the greedy approach.
Pop M Brief Bioinform 2009;10:354-366
© The Author 2009. Published by Oxford University Press. For Permissions, please email:
journals.permissions@oxfordjournals.org
Overlap-layout-consensus
Main entity: read
Relationship between reads: overlap
1
4
2
7
5
8
3
2
1
1
3
2
6
4
3
5
9
6
1
7
2
8
3
9
1
2
3
ACCTGA
ACCTGA
AGCTGA
ACCAGA
16
Paths through graphs and
assembly
• Hamiltonian circuit: visit each node (city) exactly
once, returning to the start
• Hamiltonian path: visit each node (city) exactly
once
B
C
D
E
A
G
A
E
G
F
I
H
F
B
Genome
C
I
H
D
Overlap between two
sequences
overlap (19 bases)
overhang (6 bases)
GGATGCGCGGACACGTAGCCAGGAC
CAGTACTTGGATGCGCTGACACGTAGC
overhang
% identity = 18/19 % = 94.7%
overlap - region of similarity between regions
overhang - un-aligned ends of the sequences
The assembler screens merges based on:
• length of overlap
• % identity in overlap region
• maximum overhang size.
18
All pairs alignment
• Needed by the assembler
• Try all pairs – must consider ~ n2 pairs
• Smarter solution: only n x coverage (e.g. 8)
pairs are possible
– Build a table of k-mers contained in sequences
(single pass through the genome)
– Generate the pairs from k-mer table (single pass
through k-mer table)
E
k-mer
A
G
B
F
C
I
H
D
19
BWT-based overlap detection
• Efficient construction of an assembly string graph using
the FM-index, Jared T. Simpson and Richard Durbin,
Bioinformatics, 26 (12): i367-i373 (2010)
• Read it yourself for more details
ACT
ACT
ACT$......
ACT…..
ACT…..
$
ACT….
BWT for multiple sequences
OVERLAP GRAPH
Edge Types:
Regular Dovetail
A
B
A
B
Prefix Dovetail
A
B
B
A
Suffix Dovetail
A
B
A
B
E.G.:
Edges are annotated
with deltas of overlaps
The Unitig Reduction
1. Remove “Transitively Inferrable” Overlaps:
A
C
A
B
C
B
The Unitig Reduction
412
352
45
2. Collapse “Unique Connector” Overlaps:
A
A
B
B
Celera Assembly Pipeline
Trim & Screen
Find all overlaps  40bp allowing 6% mismatch.
Overlapper
A
Unitiger
B
implies
Scaffolder
TRUE
A
B
Repeat Rez I, II
OR
A
REPEATINDUCED
B
Celera Assembly Pipeline
Trim & Screen
Overlapper
Unitiger
Scaffolder
Repeat Rez I, II
Compute all overlap consistent sub-assemblies:
Unitigs (Uniquely Assembled Contig)
Celera Assembly Pipeline
Trim & Screen
Scaffold U-unitigs with confirmed pairs
Mated reads
Overlapper
Unitiger
Scaffolder
Repeat Rez I, II
Celera Assembly Pipeline
Trim & Screen
Fill repeat gaps with doubly anchored positive unitigs
Overlapper
Unitig>0
Unitiger
Scaffolder
Repeat Rez I, II
Handling repeats
1. Repeat detection
–
pre-assembly: find fragments that belong to
repeats
•
•
–
–
statistically (most existing assemblers)
repeat database (RepeatMasker)
during assembly: detect "tangles" indicative of
repeats (Pevzner, Tang, Waterman 2001)
post-assembly: find repetitive regions and
potential mis-assemblies.
•
•
Reputer, RepeatMasker
"unhappy" mate-pairs (too close, too far, mis-oriented)
2. Repeat resolution
–
–
find DNA fragments belonging to the repeat
determine correct tiling across the repeat
28
Statistical repeat detection
Significant deviations from average coverage flagged as
repeats.
- frequent k-mers are ignored
- “arrival” rate of reads in contigs compared with theoretical
value
Problem 1: assumption of uniform distribution of fragments leads to false positives
non-random libraries
poor clonability regions
Problem 2: repeats with low copy number are missed - leads
to false negatives
29
Mis-assembled repeats
excision
collapsed tandem
a
b
I
c
II
a
c
I
a
b
III
c
d
b
III
a
c
d
b
II
b
c
rearrangement
I
II
a
I
c
b
a
a
III
d
IV
e
d
III
f
II
e
b
IV
c
f
30
Eulerian path-based assembly
• Break each read into k-mers (typically k >= 19)
• Construct a de Bruijn graph using the k-mers
from all reads
– Each k-mer is a node
– v1 has a directed edge to v2 if v1 can be expressed
by removing the last char from v2 and adding a new
char at the beginning of v2, E.g.
v1 = acgtctgact
v2 = cgtctgactg
• Find a Eulerian path in the graph
– visits each edge exactly once
3. Simplification
1. Sequencing
2. Constructing a
de Bruijn graph
4. Error removal
Eulerian path-based assembly
• No need to compute pairwise overlaps – important for
NGS data
• Eulerian paths are much easier to find than Hamiltonian
path
– Catch: multiple Eulerian paths may exist
– Loss of information
• Repeats appear as cycles in the graph
– Less likely to cause mis-assembly
• More suitable for short-reads assembly
–
–
–
–
–
Newbler
VELVET
EDENA
ABySS
See Flicek & Birney, Nat Methods, 2009
References
• Sense from sequence reads: methods for alignment and
assembly, Paul Flicek & Ewan Birney, Nature Methods 6,
S6 - S12 (2009)
• Genome assembly reborn: recent computational
challenges, Mihai Pop, Briefings in Bioinformatics, 10(4):
354-366 (2009)
Download