Lecture 5 - Tresch Group

advertisement
Achim Tresch
UoC / MPIPZ
Cologne
Omics
treschgroup.de/OmicsModule1415.html
tresch@mpipz.mpg.de
1
Mapping of Sequence Reads
Today‘s topics:
Hash tables
Suffix arrays
Burrows-Wheeler transform
Mapping of Sequence Reads
Short Read Applications
• Genotyping
Goal: identify variations
GGTATAC…
…CCATAG
TATGCGCCC
CGGAAATTT CGGTATAC
CGGTATAC
…CCAT
CTATATGCG
TCGGAAATT
GCGGTATA
CTATCGGAAA
…CCAT GGCTATATG
TTGCGGTA
C…
…CCA AGGCTATAT
CCTATCGGA
C…
TTTGCGGT
…CCA AGGCTATAT
GCCCTATCG
ATAC…
AGGCTATAT
…CC
GCCCTATCG AAATTTGC
…CC TAGGCTATA GCGCCCTA
AAATTTGC GTATAC…
…CCA TAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…
Short reads
Reference genome
• RNA-seq, ChIP-seq, Methyl-seq, ClIP-seq
GAAATTTGC
Goal: classify, measure
GGAAATTTG
significant peaks
CGGAAATTT
CGGAAATTT
TCGGAAATT
CTATCGGAAA
CCTATCGGA
TTTGCGGT
GCCCTATCG AAATTTGC
GCCCTATCG AAATTTGC
…CC
ATAC…
…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…
Indexing of the reference genome
• Genomes and reads are too large for direct
approaches like dynamic programming
• Indexing is required
Suffix tree
Suffix array
Seed hash tables
Many variants, incl. spaced seeds
• Choice of index is key to performance
Indexing of the reference genome
• Genome indices can be big. For human:
> 35 GBs
> 12 GBs
> 12 GBs
• Large indices necessitate painful compromises
1. Require big-memory machine
2. Use secondary storage
3. Build new index each run
4. Subindex and do multiple passes
Hash tables
Slides taken from Michael Main
University of Colorado
Hash tables
• The simplest kind of hash
table is an array of records.
• This example has 701
records.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
[ 700]
. . .
An array of records
Hash tables
[ 4 ]
Key 506643548
• Each record has a special
field, called its key.
• In this example, the key is a
long integer number
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
[ 700]
. . .
Hash tables
[ 4 ]
Key 506643548
• The number might be a
person's identification
number, and the rest of the
record has information
about the person.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
[ 700]
. . .
Hash tables
• When a hash table is in use,
some spots contain valid
records, and other spots are
"empty".
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
Number 281942902
Number
233667136
Number
506643548
[ 700]
. . .
Number 155778322
Inserting a new record
• In order to insert a new
record, the key must
somehow be converted to
an index.
• The function which does this
is the hash function.
• The index is also called the
hash value of the key.
Key 580625685
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
Number 281942902
Number
233667136
Number
[ 700]
506643548
Number 155778322
. . .
In our case: The keys are short sequences, and the
records contain their location in the genome
Inserting a new record
Typical hash function: Take the
integer division rest of the key
mod the array size:
Key 580625685
Key mod 701
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
Number 281942902
Number
233667136
Number
506643548
[ 700]
. . .
Number 155778322
Inserting a new record
Typical hash function: Take the
integer division rest of the key
mod the array size:
Key 580625685
Key mod 701
3
= 580625685 mod 701
= 3
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
Number 281942902
Number
233667136
Number
506643548
[ 700]
. . .
Number 155778322
Inserting a new record
Key 580625685
• The hash value is used for
the location of the new
record.
[3]
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
Number 281942902
Number
233667136
Number
506643548
[ 700]
. . .
Number 155778322
Inserting a new record
• The hash value is used for
the location of the new
record.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
Number 281942902
Number
233667136
Number 580625685
Number
506643548
[ 700]
. . .
Number 155778322
Collisions
Key 701466868
• Here is another new record
to insert, with a hash value
of 2.
My hash
value is [2].
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
Number 281942902
Number
233667136
Number 580625685
Number
506643548
[ 700]
. . .
Number 155778322
Collisions
Key 701466868
• This is called a collision,
because there is already
another valid record at [2].
When a collision occurs,
move forward until you
find an empty spot.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
Number 281942902
Number
233667136
Number 580625685
Number
506643548
[ 700]
. . .
Number 155778322
Collisions
Number 701466868
• This is called a collision,
because there is already
another valid record at [2].
When a collision occurs,
move forward until you
find an empty spot.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
Number 281942902
Number
233667136
Number 580625685
Number
506643548
[ 700]
. . .
Number 155778322
Collisions
Number 701466868
• This is called a collision,
because there is already
another valid record at [2].
When a collision occurs,
move forward until you
find an empty spot.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
Number 281942902
Number
233667136
Number 580625685
Number
506643548
[ 700]
. . .
Number 155778322
Collisions
• This is called a collision,
because there is already
another valid record at [2].
The new record goes
in the empty spot.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
Number 281942902
Number
233667136
Number 580625685
Number
506643548
Number 701466868
[ 700]
. . .
Number 155778322
A Quiz
If the keys were short sequences,
can you think of a hash function
for generating index values?
ATACCG?
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
Number 281942902
Number
233667136
Number 580625685
Number
506643548
Number 701466868
[ 700]
. . .
Number 155778322
Searching for a Key
Key 701466868
• The data that's attached to
a key can be found quickly.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
Number 281942902
Number
233667136
Number 580625685
Number
506643548
Number 701466868
[ 700]
. . .
Number 155778322
Searching for a Key
Key 701466868
• Calculate the hash value of the
key.
• Check that location of the array
for the key.
My hash
value is [2].
Not me.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
Number 281942902
Number
233667136
Number 580625685
Number
506643548
Number 701466868
[ 700]
. . .
Number 155778322
Searching for a Key
Key 701466868
• Keep moving forward until you
find the key, or you reach an
empty spot.
My hash
value is
[2].
Not
me.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
Number 281942902
Number
233667136
Number 580625685
Number
506643548
Number 701466868
[ 700]
. . .
Number 155778322
Searching for a Key
Key 701466868
• Keep moving forward until you
find the key, or you reach an
empty spot.
My hash
value is
[2].
Not
me.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
Number 281942902
Number
233667136
Number 580625685
Number
506643548
Number 701466868
[ 700]
. . .
Number 155778322
Searching for a Key
Key 701466868
• Keep moving forward until you
find the key, or you reach an
empty spot.
My hash
value is
[2].
Yes!
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
Number 281942902
Number
233667136
Number 580625685
Number
506643548
Number 701466868
[ 700]
. . .
Number 155778322
Searching for a Key
Key 701466868
• When the item is found, the
information can be copied to
the necessary location.
My hash
value is
[2].
Yes!
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
Number 281942902
Number
233667136
Number 580625685
Number
506643548
Number 701466868
[ 700]
. . .
Number 155778322
Summary
Hash tables store a collection of records with
keys.
The location of a record depends on the hash
value of the record's key.
When a collision occurs, the next available
location is used.
Searching for a particular key is generally quick.
How can hash tables be used for mapping?
Mapping by hash tables
Preprocessing of the target
genome:
• Cut the genome into short
sequences of fixed length L
• Use these sequences as keys
to create a hash table
(this takes time, but only
once!)
Key:
ACTAGGTCTT
GAGAATCTTA
Matches:
Chr II, 304938
Chr V, 2053723
• For all such sequences, store all matching positions
in the genome as data in the hash table
Key:
ACTAGGTCTT
GAGAATCTTA
Key:
ACTAGGTCTT
GAGAATCTTA
Content:
Content:
Starting position Starting position
Chr V, 2053723 Chr V, 2053723
Key:
ACTAGGTCTT
GAGAATCTTA
Content:
Starting position
Chr V, 2053723
. . .
Key:
ACTAGGTCTT
GAGAATCTTA
Content:
Starting position
Chr V, 2053723
Mapping by hash tables
Mapping of short reads from a
sequencing experiment:
• For every read, use a substring
of length L and check its
occurrence in the hash table
• Given the few (possibly none)
matching positions, try to
extend the alignment to the
whole short read.
Key:
ACTAGGTCTT
GAGAATCTTA
Key:
ACTAGGTCTT
GAGAATCTTA
Content:
Content:
Starting position Starting position
Chr V, 2053723 Chr V, 2053723
Key:
ACTAGGTCTT
GAGAATCTTA
Content:
Starting position
Chr V, 2053723
Key:
ACTAGGTCTT
GAGAATCTTA
Matches:
Chr II, 304938
Chr V, 2053723
. . .
Key:
ACTAGGTCTT
GAGAATCTTA
Content:
Starting position
Chr V, 2053723
Suffix Arrays
•
•
•
Suffix arrays were introduced by Manber and Myers
in 1993
More space efficient than suffix trees
A suffix array for a string x of length m is an array of
size m that specifies the lexicographic ordering of
the suffixes of x.
Idea: Every substring is a prefix of a suffix
Suffix Arrays
Example of a suffix array for acaaacatat$
3
4
1
5
7
9
2
6
8
10
11
Starting
position of
that suffix in
the search
string
Suffix Array Construction
• Naive construction
– Similar to insertion sort
– Insert all the suffixes into the array one by one
making sure that the newly inserted suffix is in its
correct place
– Running time complexity:
• O(m2) where m is the length of the string
• Manber and Myers give a O(m log m)
construction in their 1993 paper.
Suffix Array Construction
•
•
•
There exists a memory efficient O(n) space where n
is the size of the database string
However in this case query time increases
Lookup query
–
–
–
Binary search
O(m log n) time; m is the size of the query
Can reduce time to O(m + log n) using a more efficient
implementation
Suffix Array Search
find(Pattern P in SuffixArray A):
lo = 0, hi = length(A)
for i in 0:length(P)
Binary search for x,y
such that P[i]=S[A[j]+i]
for all j=x,x+1,…,y
lo = x, hi = y
return {lo,hi}
Suffix Array Search
Search ‘is’ in mississippi$
Examine the pattern
letter by letter,
reducing the range of
occurrence each time.
- First letter i:
occurs in indices from
0 to 3
- Second letter s:
occurs in indices from
2 to 3
Done. Output: issippi$ and ississippi$
0
11
i$
1
8
ippi$
2
5
issippi$
3
2
ississippi$
4
1
mississippi$
5
10
pi$
6
9
ppi$
7
7
sippi$
8
4
sissippi$
9
6
ssippi$
10
3
ssissippi$
11
12
$
Summary
• It can be built very fast.
• It can answer queries very fast:
– How many times does ‘ATG’ appear?
• Disadvantages:
– Can’t do approximate matching
– Hard to insert new sequences / modify sequences
dynamically
(need to rebuild the array)
Links
• http://pauillac.inria.fr/~quercia/documents-info/Luminy98/albert/JAVA+html/SuffixTreeGrow.html
• http://home.in.tum.de/~maass/suffix.html
• http://homepage.usask.ca/~ctl271/857/suffix_tree.shtml
• http://homepage.usask.ca/~ctl271/810/approximate_matchin
g.shtml
• http://www.cs.mcgill.ca/~cs251/OldCourses/1997/topic7/
• http://dogma.net/markn/articles/suffixt/suffixt.htm
• http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/Suffi
x/
Bowtie: A Highly Scalable Tool
for Post-Genomic Datasets
(Slides by Ben Langmead)
Short Read Alignment
• Given a reference and a set of reads, report at least one
“good” local alignment for each read if one exists
– Approximate answer to: where in genome did read originate?
• What is “good”? For now, we concentrate on:
– Fewer mismatches is better …TGATCATA…
better than
GATCAA
– Failing to align a low-quality
base is better than failing to …TGATATTA… better than
GATcaT
align a high-quality base
…TGATCATA…
GAGAAT
…TGATcaTA…
GTACAT
Burrows-Wheeler Transform
Text T
acaacg$
$acaacg
g$acaac
acg$aca
aacg$ac
caacg$a
acaacg$
Rotate string
one by one
in each row
Last column contains
the characters
preceding the characters
in the first column
Sort suffixes
lexicographically
BWT(T)
Burrows Wheeler
Matrix
Burrows-Wheeler Transform
• Reversible permutation used originally in compression
BWT(T)
T
Burrows
Wheeler
Matrix
Last column
• Once BWT(T) is built, all else shown here is discarded
– Matrix will be shown for illustration only
• In long texts, BWT(T) contains more repeated character
occurrences than the original text easier to compress!
Burrows-Wheeler Transform
• Property that makes BWT(T) reversible is “LF Mapping”
– ith occurrence of a character in Last column is same
text occurrence as the ith occurrence in First column
Rank: 2
(sexond ‘a’)
BWT(T)
Rank: 2
(sexond ‘a’)
Burrows Wheeler
Matrix
Burrows M, Wheeler DJ: A block sorting lossless data compression algorithm. Digital Equipment
Corporation, Palo Alto, CA 1994, Technical Report 124; 1994
Burrows-Wheeler Transform
• To recreate T from BWT(T), repeatedly apply rule:
T  BWT[ LF(i) ] + T; i = LF(i)
– Where LF(i) maps row i to row whose first character
corresponds to i’s last per LF Mapping
Final T
• Could be called “unpermute” or “walk-left” algorithm
BWT in Bioinformatics
• Oligomer counting
– Healy J et al: Annotating large genomes with exact word
matches. Genome Res 2003, 13(10):2306-2315.
• Whole-genome alignment
– Lippert RA: Space-efficient whole genome comparisons with
Burrows-Wheeler transforms. J Comp Bio 2005, 12(4):407-415.
• Smith-Waterman alignment to large reference
– Lam TW et al: Compressed indexing and local alignment of DNA.
Bioinformatics 2008, 24(6):791-797.
TopHat: Bowtie for RNA-seq
• TopHat is a fast splice junction mapper for RNA-Seq reads. It
aligns RNA-Seq reads using Bowtie, and then analyzes the
mapping results to identify splice junctions between exons.
– Contact: Cole Trapnell (cole@cs.umd.edu)
– http://tophat.cbcb.umd.edu
Acknowledgements
NGS Exercises were designed by
Nicolas Delhomme,
EMBL Heidelberg
University of Umeå
Comparison to Maq & SOAP
CPU time
Bowtie –v 2 (server)
SOAP (server)
Bowtie (PC)
Maq (PC)
Bowtie (server)
Maq (server)
•
•
•
•
•
•
Wall clock
time
Reads
per
hour
Peak virtual
memory
footprint
15m:07s
15m:41s
33.8 M
1,149 MB
91h:57m:35s
91h:47m:46s
0.08 M
13,619 MB
16m:41s
17m:57s
29.5 M
1,353 MB
17h:46m:35s
17h:53m:07s
0.49 M
804 MB
17m:58s
18m:26s
28.8 M
1,353 MB
32h:56m:53s
32h:58m:39s
0.27 M
804 MB
Bowtie
speedup
Reads
aligned
(%)
-
67.4
351x
67.3
-
71.9
59.8x
74.7
-
71.9
107x
74.7
PC: 2.4 GHz Intel Core 2, 2 GB RAM
Server: 2.4 GHz AMD Opteron, 32 GB RAM
Bowtie v0.9.6, Maq v0.6.6, SOAP v1.10
SOAP not run on PC due to memory constraints
Reads: FASTQ 8.84 M reads from 1000 Genomes (Acc: SRR001115)
Reference: Human (NCBI 36.3, contigs)
• Bowtie delivers about 30 million alignments per CPU hour
Download