Achim Tresch UoC / MPIPZ Cologne Omics treschgroup.de/OmicsModule1415.html tresch@mpipz.mpg.de 1 Mapping of Sequence Reads Today‘s topics: Hash tables Suffix arrays Burrows-Wheeler transform Mapping of Sequence Reads Short Read Applications • Genotyping Goal: identify variations GGTATAC… …CCATAG TATGCGCCC CGGAAATTT CGGTATAC CGGTATAC …CCAT CTATATGCG TCGGAAATT GCGGTATA CTATCGGAAA …CCAT GGCTATATG TTGCGGTA C… …CCA AGGCTATAT CCTATCGGA C… TTTGCGGT …CCA AGGCTATAT GCCCTATCG ATAC… AGGCTATAT …CC GCCCTATCG AAATTTGC …CC TAGGCTATA GCGCCCTA AAATTTGC GTATAC… …CCA TAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC… Short reads Reference genome • RNA-seq, ChIP-seq, Methyl-seq, ClIP-seq GAAATTTGC Goal: classify, measure GGAAATTTG significant peaks CGGAAATTT CGGAAATTT TCGGAAATT CTATCGGAAA CCTATCGGA TTTGCGGT GCCCTATCG AAATTTGC GCCCTATCG AAATTTGC …CC ATAC… …CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC… Indexing of the reference genome • Genomes and reads are too large for direct approaches like dynamic programming • Indexing is required Suffix tree Suffix array Seed hash tables Many variants, incl. spaced seeds • Choice of index is key to performance Indexing of the reference genome • Genome indices can be big. For human: > 35 GBs > 12 GBs > 12 GBs • Large indices necessitate painful compromises 1. Require big-memory machine 2. Use secondary storage 3. Build new index each run 4. Subindex and do multiple passes Hash tables Slides taken from Michael Main University of Colorado Hash tables • The simplest kind of hash table is an array of records. • This example has 701 records. [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700] . . . An array of records Hash tables [ 4 ] Key 506643548 • Each record has a special field, called its key. • In this example, the key is a long integer number [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700] . . . Hash tables [ 4 ] Key 506643548 • The number might be a person's identification number, and the rest of the record has information about the person. [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700] . . . Hash tables • When a hash table is in use, some spots contain valid records, and other spots are "empty". [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] Number 281942902 Number 233667136 Number 506643548 [ 700] . . . Number 155778322 Inserting a new record • In order to insert a new record, the key must somehow be converted to an index. • The function which does this is the hash function. • The index is also called the hash value of the key. Key 580625685 [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] Number 281942902 Number 233667136 Number [ 700] 506643548 Number 155778322 . . . In our case: The keys are short sequences, and the records contain their location in the genome Inserting a new record Typical hash function: Take the integer division rest of the key mod the array size: Key 580625685 Key mod 701 [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] Number 281942902 Number 233667136 Number 506643548 [ 700] . . . Number 155778322 Inserting a new record Typical hash function: Take the integer division rest of the key mod the array size: Key 580625685 Key mod 701 3 = 580625685 mod 701 = 3 [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] Number 281942902 Number 233667136 Number 506643548 [ 700] . . . Number 155778322 Inserting a new record Key 580625685 • The hash value is used for the location of the new record. [3] [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] Number 281942902 Number 233667136 Number 506643548 [ 700] . . . Number 155778322 Inserting a new record • The hash value is used for the location of the new record. [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] Number 281942902 Number 233667136 Number 580625685 Number 506643548 [ 700] . . . Number 155778322 Collisions Key 701466868 • Here is another new record to insert, with a hash value of 2. My hash value is [2]. [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] Number 281942902 Number 233667136 Number 580625685 Number 506643548 [ 700] . . . Number 155778322 Collisions Key 701466868 • This is called a collision, because there is already another valid record at [2]. When a collision occurs, move forward until you find an empty spot. [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] Number 281942902 Number 233667136 Number 580625685 Number 506643548 [ 700] . . . Number 155778322 Collisions Number 701466868 • This is called a collision, because there is already another valid record at [2]. When a collision occurs, move forward until you find an empty spot. [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] Number 281942902 Number 233667136 Number 580625685 Number 506643548 [ 700] . . . Number 155778322 Collisions Number 701466868 • This is called a collision, because there is already another valid record at [2]. When a collision occurs, move forward until you find an empty spot. [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] Number 281942902 Number 233667136 Number 580625685 Number 506643548 [ 700] . . . Number 155778322 Collisions • This is called a collision, because there is already another valid record at [2]. The new record goes in the empty spot. [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] Number 281942902 Number 233667136 Number 580625685 Number 506643548 Number 701466868 [ 700] . . . Number 155778322 A Quiz If the keys were short sequences, can you think of a hash function for generating index values? ATACCG? [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] Number 281942902 Number 233667136 Number 580625685 Number 506643548 Number 701466868 [ 700] . . . Number 155778322 Searching for a Key Key 701466868 • The data that's attached to a key can be found quickly. [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] Number 281942902 Number 233667136 Number 580625685 Number 506643548 Number 701466868 [ 700] . . . Number 155778322 Searching for a Key Key 701466868 • Calculate the hash value of the key. • Check that location of the array for the key. My hash value is [2]. Not me. [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] Number 281942902 Number 233667136 Number 580625685 Number 506643548 Number 701466868 [ 700] . . . Number 155778322 Searching for a Key Key 701466868 • Keep moving forward until you find the key, or you reach an empty spot. My hash value is [2]. Not me. [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] Number 281942902 Number 233667136 Number 580625685 Number 506643548 Number 701466868 [ 700] . . . Number 155778322 Searching for a Key Key 701466868 • Keep moving forward until you find the key, or you reach an empty spot. My hash value is [2]. Not me. [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] Number 281942902 Number 233667136 Number 580625685 Number 506643548 Number 701466868 [ 700] . . . Number 155778322 Searching for a Key Key 701466868 • Keep moving forward until you find the key, or you reach an empty spot. My hash value is [2]. Yes! [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] Number 281942902 Number 233667136 Number 580625685 Number 506643548 Number 701466868 [ 700] . . . Number 155778322 Searching for a Key Key 701466868 • When the item is found, the information can be copied to the necessary location. My hash value is [2]. Yes! [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] Number 281942902 Number 233667136 Number 580625685 Number 506643548 Number 701466868 [ 700] . . . Number 155778322 Summary Hash tables store a collection of records with keys. The location of a record depends on the hash value of the record's key. When a collision occurs, the next available location is used. Searching for a particular key is generally quick. How can hash tables be used for mapping? Mapping by hash tables Preprocessing of the target genome: • Cut the genome into short sequences of fixed length L • Use these sequences as keys to create a hash table (this takes time, but only once!) Key: ACTAGGTCTT GAGAATCTTA Matches: Chr II, 304938 Chr V, 2053723 • For all such sequences, store all matching positions in the genome as data in the hash table Key: ACTAGGTCTT GAGAATCTTA Key: ACTAGGTCTT GAGAATCTTA Content: Content: Starting position Starting position Chr V, 2053723 Chr V, 2053723 Key: ACTAGGTCTT GAGAATCTTA Content: Starting position Chr V, 2053723 . . . Key: ACTAGGTCTT GAGAATCTTA Content: Starting position Chr V, 2053723 Mapping by hash tables Mapping of short reads from a sequencing experiment: • For every read, use a substring of length L and check its occurrence in the hash table • Given the few (possibly none) matching positions, try to extend the alignment to the whole short read. Key: ACTAGGTCTT GAGAATCTTA Key: ACTAGGTCTT GAGAATCTTA Content: Content: Starting position Starting position Chr V, 2053723 Chr V, 2053723 Key: ACTAGGTCTT GAGAATCTTA Content: Starting position Chr V, 2053723 Key: ACTAGGTCTT GAGAATCTTA Matches: Chr II, 304938 Chr V, 2053723 . . . Key: ACTAGGTCTT GAGAATCTTA Content: Starting position Chr V, 2053723 Suffix Arrays • • • Suffix arrays were introduced by Manber and Myers in 1993 More space efficient than suffix trees A suffix array for a string x of length m is an array of size m that specifies the lexicographic ordering of the suffixes of x. Idea: Every substring is a prefix of a suffix Suffix Arrays Example of a suffix array for acaaacatat$ 3 4 1 5 7 9 2 6 8 10 11 Starting position of that suffix in the search string Suffix Array Construction • Naive construction – Similar to insertion sort – Insert all the suffixes into the array one by one making sure that the newly inserted suffix is in its correct place – Running time complexity: • O(m2) where m is the length of the string • Manber and Myers give a O(m log m) construction in their 1993 paper. Suffix Array Construction • • • There exists a memory efficient O(n) space where n is the size of the database string However in this case query time increases Lookup query – – – Binary search O(m log n) time; m is the size of the query Can reduce time to O(m + log n) using a more efficient implementation Suffix Array Search find(Pattern P in SuffixArray A): lo = 0, hi = length(A) for i in 0:length(P) Binary search for x,y such that P[i]=S[A[j]+i] for all j=x,x+1,…,y lo = x, hi = y return {lo,hi} Suffix Array Search Search ‘is’ in mississippi$ Examine the pattern letter by letter, reducing the range of occurrence each time. - First letter i: occurs in indices from 0 to 3 - Second letter s: occurs in indices from 2 to 3 Done. Output: issippi$ and ississippi$ 0 11 i$ 1 8 ippi$ 2 5 issippi$ 3 2 ississippi$ 4 1 mississippi$ 5 10 pi$ 6 9 ppi$ 7 7 sippi$ 8 4 sissippi$ 9 6 ssippi$ 10 3 ssissippi$ 11 12 $ Summary • It can be built very fast. • It can answer queries very fast: – How many times does ‘ATG’ appear? • Disadvantages: – Can’t do approximate matching – Hard to insert new sequences / modify sequences dynamically (need to rebuild the array) Links • http://pauillac.inria.fr/~quercia/documents-info/Luminy98/albert/JAVA+html/SuffixTreeGrow.html • http://home.in.tum.de/~maass/suffix.html • http://homepage.usask.ca/~ctl271/857/suffix_tree.shtml • http://homepage.usask.ca/~ctl271/810/approximate_matchin g.shtml • http://www.cs.mcgill.ca/~cs251/OldCourses/1997/topic7/ • http://dogma.net/markn/articles/suffixt/suffixt.htm • http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/Suffi x/ Bowtie: A Highly Scalable Tool for Post-Genomic Datasets (Slides by Ben Langmead) Short Read Alignment • Given a reference and a set of reads, report at least one “good” local alignment for each read if one exists – Approximate answer to: where in genome did read originate? • What is “good”? For now, we concentrate on: – Fewer mismatches is better …TGATCATA… better than GATCAA – Failing to align a low-quality base is better than failing to …TGATATTA… better than GATcaT align a high-quality base …TGATCATA… GAGAAT …TGATcaTA… GTACAT Burrows-Wheeler Transform Text T acaacg$ $acaacg g$acaac acg$aca aacg$ac caacg$a acaacg$ Rotate string one by one in each row Last column contains the characters preceding the characters in the first column Sort suffixes lexicographically BWT(T) Burrows Wheeler Matrix Burrows-Wheeler Transform • Reversible permutation used originally in compression BWT(T) T Burrows Wheeler Matrix Last column • Once BWT(T) is built, all else shown here is discarded – Matrix will be shown for illustration only • In long texts, BWT(T) contains more repeated character occurrences than the original text easier to compress! Burrows-Wheeler Transform • Property that makes BWT(T) reversible is “LF Mapping” – ith occurrence of a character in Last column is same text occurrence as the ith occurrence in First column Rank: 2 (sexond ‘a’) BWT(T) Rank: 2 (sexond ‘a’) Burrows Wheeler Matrix Burrows M, Wheeler DJ: A block sorting lossless data compression algorithm. Digital Equipment Corporation, Palo Alto, CA 1994, Technical Report 124; 1994 Burrows-Wheeler Transform • To recreate T from BWT(T), repeatedly apply rule: T BWT[ LF(i) ] + T; i = LF(i) – Where LF(i) maps row i to row whose first character corresponds to i’s last per LF Mapping Final T • Could be called “unpermute” or “walk-left” algorithm BWT in Bioinformatics • Oligomer counting – Healy J et al: Annotating large genomes with exact word matches. Genome Res 2003, 13(10):2306-2315. • Whole-genome alignment – Lippert RA: Space-efficient whole genome comparisons with Burrows-Wheeler transforms. J Comp Bio 2005, 12(4):407-415. • Smith-Waterman alignment to large reference – Lam TW et al: Compressed indexing and local alignment of DNA. Bioinformatics 2008, 24(6):791-797. TopHat: Bowtie for RNA-seq • TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads using Bowtie, and then analyzes the mapping results to identify splice junctions between exons. – Contact: Cole Trapnell (cole@cs.umd.edu) – http://tophat.cbcb.umd.edu Acknowledgements NGS Exercises were designed by Nicolas Delhomme, EMBL Heidelberg University of Umeå Comparison to Maq & SOAP CPU time Bowtie –v 2 (server) SOAP (server) Bowtie (PC) Maq (PC) Bowtie (server) Maq (server) • • • • • • Wall clock time Reads per hour Peak virtual memory footprint 15m:07s 15m:41s 33.8 M 1,149 MB 91h:57m:35s 91h:47m:46s 0.08 M 13,619 MB 16m:41s 17m:57s 29.5 M 1,353 MB 17h:46m:35s 17h:53m:07s 0.49 M 804 MB 17m:58s 18m:26s 28.8 M 1,353 MB 32h:56m:53s 32h:58m:39s 0.27 M 804 MB Bowtie speedup Reads aligned (%) - 67.4 351x 67.3 - 71.9 59.8x 74.7 - 71.9 107x 74.7 PC: 2.4 GHz Intel Core 2, 2 GB RAM Server: 2.4 GHz AMD Opteron, 32 GB RAM Bowtie v0.9.6, Maq v0.6.6, SOAP v1.10 SOAP not run on PC due to memory constraints Reads: FASTQ 8.84 M reads from 1000 Genomes (Acc: SRR001115) Reference: Human (NCBI 36.3, contigs) • Bowtie delivers about 30 million alignments per CPU hour