Slides 2: NGS short

advertisement
CS 6293 Advanced Topics:
Current Bioinformatics
Next-generation sequencing Mapping short reads
Short read mapping
• Input:
– A reference genome
– A collection of many 25-100bp tags (reads)
– User-specified parameters
• Output:
– One or more genomic coordinates for each tag
• In practice, only 70-75% of tags successfully
map to the reference genome. Why?
Multiple mapping
• A single tag may occur more than once in
the reference genome.
• The user may choose to ignore tags that
appear more than n times.
• As n gets large, you get more data, but
also more noise in the data.
Inexact matching
?
• An observed tag may not exactly match any position in the reference
genome.
• Sometimes, the tag almost matches one or more positions.
• Such mismatches may represent a SNP (single-nucleotide
polymorphism, see wikipedia) or a bad read-out.
• The user can specify the maximum number of mismatches, or a
phred-style quality score threshold.
• As the number of allowed mismatches goes up, the number of
mapped tags increases, but so does the number of incorrectly
mapped tags.
% of Paired K-mers with Uniquely
Assignable Location
Read Length is Not As Important
For Resequencing
100%
90%
80%
70%
60%
E.COLI
50%
HUMAN
40%
30%
20%
10%
0%
8
Jay Shendure
10
12
14 16
18
20
Length of K-mer Reads (bp)
Mapping Reads Back
• Hash Table (Lookup table)
– FAST, but requires perfect matches. [O(m n + N)]
• Array Scanning
– Can handle mismatches, but not gaps. [O(m N)]
• Dynamic Programming (Smith Waterman)
– Indels
– Mathematically optimal solution
– Slow (most programs use Hash Mapping as a prefilter) [O(mnN)]
• Burrows-Wheeler Transform (BW Transform)
– FAST. [O(m + N)] (without mismatch/gap)
– Memory efficient.
– But for gaps/mismatches, it lacks sensitivity
Spaced seed
alignment
• Tags and tag-sized
pieces of reference are
cut into small “seeds.”
• Pairs of spaced seeds
are stored in an index.
• Look up spaced seeds for
each tag.
• For each “hit,” confirm the
remaining positions.
• Report results to the user.
Burrows-Wheeler
• Store entire reference
genome.
• Align tag base by base
from the end.
• When tag is traversed, all
active locations are
reported.
• If no match is found, then
back up and try a
substitution.
Why Burrows-Wheeler?
• BWT very compact:
– Approximately ½ byte per base
– As large as the original text, plus a few
“extras”
– Can fit onto a standard computer with 2GB of
memory
• Linear-time search algorithm
– proportional to length of query for exact
matches
Burrows-Wheeler Transform (BWT)
BWT
acaacg$
$acaacg
aacg$ac
acaacg$
acg$aca
caacg$a
cg$acaa
g$acaac
gc$aaac
Burrows-Wheeler Matrix (BWM)
Burrows-Wheeler Matrix
$acaacg
aacg$ac
acaacg$
acg$aca
caacg$a
cg$acaa
g$acaac
Burrows-Wheeler Matrix
3
1
4
2
5
6
$acaacg
aacg$ac
acaacg$
acg$aca
caacg$a
cg$acaa
g$acaac
See the suffix array?
Key observation
a1c1a2a3c2g1$1
“last first (LF) mapping”
The i-th occurrence of character X in
the last column corresponds to
the same text character as the i-th
occurrence of X in the first column.
1$acaacg1
2aacg$ac1
1acaacg$1
3acg$aca2
1caacg$a1
2cg$acaa3
1g$acaac2
Recover text
4
6
5
6
3
3
4
4
2
5
6
5
6
5
6
3
1
4
2
5
6
Exact match
3
1
4
2
5
6
Exact match (another example)
BWT(agcagcagact) = tgcc$ggaaaac
Search for pattern: gca
gca
gca
gca
gca
$agcagcagact
$agcagcagact
$agcagcagact
$agcagcagact
act$agcagcag
act$agcagcag
act$agcagcag
act$agcagcag
agact$agcagc
agact$agcagc
agact$agcagc
agact$agcagc
agcagact$agc
agcagact$agc
agcagact$agc
agcagact$agc
agcagcagact$
agcagcagact$
agcagcagact$
agcagcagact$
cagact$agcag
cagact$agcag
cagact$agcag
cagact$agcag
cagcagact$ag
cagcagact$ag
cagcagact$ag
cagcagact$ag
ct$agcagcaga
ct$agcagcaga
ct$agcagcaga
ct$agcagcaga
gact$agcagca
gact$agcagca
gact$agcagca
gact$agcagca
gcagact$agca
gcagact$agca
gcagact$agca
gcagact$agca
gcagcagact$a
gcagcagact$a
gcagcagact$a
gcagcagact$a
t$agcagcagac
t$agcagcagac
t$agcagcagac
t$agcagcagac
Test with your own seq and pattern at: http://www.allisons.org/ll/AlgDS/Strings/BWT/
Auxiliary data structures
Key for efficient pattern matching: how to find the corresponding chars in
the first column efficiently, in terms of both time and space.
rank
BWT
SA
$agcagcagact
t
a
c
g
t
1
9
act$agcagcag
g
0
0
1
1
2
7
agact$agcagc
c
0
1
1
1
3
4
agcagact$agc
c
0
2
1
1
4
1
agcagcagact$
$
0
2
1
1
5
6
cagact$agcag
g
0
2
2
1
6
3
cagcagact$ag
g
0
2
3
1
7
10
ct$agcagcaga
a
1
2
3
1
8
8
gact$agcagca
a
2
2
3
1
9
5
gcagact$agca
a
3
2
3
1
10
2
gcagcagact$a
a
4
2
3
1
11
11
t$agcagcagac
c
4
3
3
1
a
c
g
T
1
5
8
11
FM indices
Auxiliary data structures
Key for efficient pattern matching: how to find the corresponding chars in
the first column efficiently, in terms of both time and space.
gca
rank
BWT
SA
$agcagcagact
t
a
c
g
t
1
9
act$agcagcag
g
0
0
1
1
2
7
agact$agcagc
c
0
1
1
1
3
4
agcagact$agc
c
0
2
1
1
4
1
agcagcagact$
$
0
2
1
1
5
6
cagact$agcag
g
0
2
2
1
6
3
cagcagact$ag
g
0
2
3
1
7
10
ct$agcagcaga
a
1
2
3
1
8
8
gact$agcagca
a
2
2
3
1
9
5
gcagact$agca
a
3
2
3
1
10
2
gcagcagact$a
a
4
2
3
1
11
11
t$agcagcagac
c
4
3
3
1
a
c
g
t
1
5
8
11
FM indices
Next block:
From 1 + 0 = 1
to 1 + (4-1) = 4
Auxiliary data structures
Key for efficient pattern matching: how to find the corresponding chars in
the first column efficiently, in terms of both time and space.
gca
rank
BWT
SA
$agcagcagact
t
a
c
g
t
1
9
act$agcagcag
g
0
0
1
1
2
7
agact$agcagc
c
0
1
1
1
3
4
agcagact$agc
c
0
2
1
1
4
1
agcagcagact$
$
0
2
1
1
5
6
cagact$agcag
g
0
2
2
1
6
3
cagcagact$ag
g
0
2
3
1
7
10
ct$agcagcaga
a
1
2
3
1
8
8
gact$agcagca
a
2
2
3
1
9
5
gcagact$agca
a
3
2
3
1
10
2
gcagcagact$a
a
4
2
3
1
11
11
t$agcagcagac
c
4
3
3
1
a
c
g
T
1
5
8
11
FM indices
Next block:
From 5 + 0 = 5
to 5 + (2-1) = 6
Auxiliary data structures
Key for efficient pattern matching: how to find the corresponding chars in
the first column efficiently, in terms of both time and space.
gca
rank
BWT
SA
$agcagcagact
t
a
c
g
t
1
9
act$agcagcag
g
0
0
1
1
2
7
agact$agcagc
c
0
1
1
1
3
4
agcagact$agc
c
0
2
1
1
4
1
agcagcagact$
$
0
2
1
1
5
6
cagact$agcag
g
0
2
2
1
6
3
cagcagact$ag
g
0
2
3
1
7
10
ct$agcagcaga
a
1
2
3
1
8
8
gact$agcagca
a
2
2
3
1
9
5
gcagact$agca
a
3
2
3
1
10
2
gcagcagact$a
a
4
2
3
1
11
11
t$agcagcagac
c
4
3
3
1
a
c
g
T
1
5
8
11
FM indices
Next block:
From 8 + 1 = 9
to 8 + (3-1) = 10
Inexact
match
Main advantage of BWT against
suffix array
• BWT needs less memory than suffix array
• For human genome m = 3 * 109 :
– Suffix array: mlog2(m) bits = 4m bytes = 12GB
– BWT: m/4 bytes plus extras = 1 - 2 GB
• m/4 bytes to store BWT (2 bits per char)
• Suffix array and occurrence counts array take 5 m
log2 m bits = 20 n bytes
• In practice, SA and OCC only partially stored, most
elements are computed on demand (takes time!)
• Tradeoff between time and space
Comparison
Spaced seeds
• Requires ~50Gb of
memory.
• Runs 30-fold slower.
• Is much simpler to
program.
Burrows-Wheeler
• Requires <2Gb of
memory.
• Runs 30-fold faster.
• Is much more
complicated to
program.
MAQ
Bowtie
Short-read mapping software
Software
Technique
Developer
License
Eland
Hashing reads
Illumnia
?
SOAP
Hashing refs
BGI
Academic
Maq
Hashing reads
Sanger (Li, Heng) GNUPL
Bowtie
BWT
Salzberg/UMD
BWA
BWT
Sanger (Li, Heng) GNUPL
SOAP2
BWT & hashing
BGI
GNUPL
Academic
http://www.oxfordjournals.org/our_journals/bioinformatics/nextgenerationsequencing.html
References
•
•
•
•
•
•
•
(Bowtie) Ultrafast and memory-efficient alignment of short DNA sequences
to the human genome, Langmead et al, Genome Biology 2009, 10:R25
SOAP: short oligonucleotide alignment, Ruiqiang Li et al. Bioinformatics
(2008) 24: 713-4
(BWA) Fast and Accurate Short Read Alignment with Burrows-Wheeler
Transform, Li Heng and Richard Durbin, (2009) 25:1754–1760
SOAP2: an improved ultrafast tool for short read alignment, Ruiqiang Li,
(2009) 25: 1966–1967
(MAQ) Mapping short DNA sequencing reads and calling variants using
mapping quality scores. Li H, Ruan J, Durbin R. Genome Res. (2008)
18:1851-8.
Sense from sequence reads: methods for alignment and assembly, Paul
Flicek & Ewan Birney, Nature Methods 6, S6 - S12 (2009)
http://www.allisons.org/ll/AlgDS/Strings/BWT/
Download