CS 6293 Advanced Topics: Current Bioinformatics Next-generation sequencing Mapping short reads Short read mapping • Input: – A reference genome – A collection of many 25-100bp tags (reads) – User-specified parameters • Output: – One or more genomic coordinates for each tag • In practice, only 70-75% of tags successfully map to the reference genome. Why? Multiple mapping • A single tag may occur more than once in the reference genome. • The user may choose to ignore tags that appear more than n times. • As n gets large, you get more data, but also more noise in the data. Inexact matching ? • An observed tag may not exactly match any position in the reference genome. • Sometimes, the tag almost matches one or more positions. • Such mismatches may represent a SNP (single-nucleotide polymorphism, see wikipedia) or a bad read-out. • The user can specify the maximum number of mismatches, or a phred-style quality score threshold. • As the number of allowed mismatches goes up, the number of mapped tags increases, but so does the number of incorrectly mapped tags. % of Paired K-mers with Uniquely Assignable Location Read Length is Not As Important For Resequencing 100% 90% 80% 70% 60% E.COLI 50% HUMAN 40% 30% 20% 10% 0% 8 Jay Shendure 10 12 14 16 18 20 Length of K-mer Reads (bp) Mapping Reads Back • Hash Table (Lookup table) – FAST, but requires perfect matches. [O(m n + N)] • Array Scanning – Can handle mismatches, but not gaps. [O(m N)] • Dynamic Programming (Smith Waterman) – Indels – Mathematically optimal solution – Slow (most programs use Hash Mapping as a prefilter) [O(mnN)] • Burrows-Wheeler Transform (BW Transform) – FAST. [O(m + N)] (without mismatch/gap) – Memory efficient. – But for gaps/mismatches, it lacks sensitivity Spaced seed alignment • Tags and tag-sized pieces of reference are cut into small “seeds.” • Pairs of spaced seeds are stored in an index. • Look up spaced seeds for each tag. • For each “hit,” confirm the remaining positions. • Report results to the user. Burrows-Wheeler • Store entire reference genome. • Align tag base by base from the end. • When tag is traversed, all active locations are reported. • If no match is found, then back up and try a substitution. Why Burrows-Wheeler? • BWT very compact: – Approximately ½ byte per base – As large as the original text, plus a few “extras” – Can fit onto a standard computer with 2GB of memory • Linear-time search algorithm – proportional to length of query for exact matches Burrows-Wheeler Transform (BWT) BWT acaacg$ $acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac gc$aaac Burrows-Wheeler Matrix (BWM) Burrows-Wheeler Matrix $acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac Burrows-Wheeler Matrix 3 1 4 2 5 6 $acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac See the suffix array? Key observation a1c1a2a3c2g1$1 “last first (LF) mapping” The i-th occurrence of character X in the last column corresponds to the same text character as the i-th occurrence of X in the first column. 1$acaacg1 2aacg$ac1 1acaacg$1 3acg$aca2 1caacg$a1 2cg$acaa3 1g$acaac2 Recover text 4 6 5 6 3 3 4 4 2 5 6 5 6 5 6 3 1 4 2 5 6 Exact match 3 1 4 2 5 6 Exact match (another example) BWT(agcagcagact) = tgcc$ggaaaac Search for pattern: gca gca gca gca gca $agcagcagact $agcagcagact $agcagcagact $agcagcagact act$agcagcag act$agcagcag act$agcagcag act$agcagcag agact$agcagc agact$agcagc agact$agcagc agact$agcagc agcagact$agc agcagact$agc agcagact$agc agcagact$agc agcagcagact$ agcagcagact$ agcagcagact$ agcagcagact$ cagact$agcag cagact$agcag cagact$agcag cagact$agcag cagcagact$ag cagcagact$ag cagcagact$ag cagcagact$ag ct$agcagcaga ct$agcagcaga ct$agcagcaga ct$agcagcaga gact$agcagca gact$agcagca gact$agcagca gact$agcagca gcagact$agca gcagact$agca gcagact$agca gcagact$agca gcagcagact$a gcagcagact$a gcagcagact$a gcagcagact$a t$agcagcagac t$agcagcagac t$agcagcagac t$agcagcagac Test with your own seq and pattern at: http://www.allisons.org/ll/AlgDS/Strings/BWT/ Auxiliary data structures Key for efficient pattern matching: how to find the corresponding chars in the first column efficiently, in terms of both time and space. rank BWT SA $agcagcagact t a c g t 1 9 act$agcagcag g 0 0 1 1 2 7 agact$agcagc c 0 1 1 1 3 4 agcagact$agc c 0 2 1 1 4 1 agcagcagact$ $ 0 2 1 1 5 6 cagact$agcag g 0 2 2 1 6 3 cagcagact$ag g 0 2 3 1 7 10 ct$agcagcaga a 1 2 3 1 8 8 gact$agcagca a 2 2 3 1 9 5 gcagact$agca a 3 2 3 1 10 2 gcagcagact$a a 4 2 3 1 11 11 t$agcagcagac c 4 3 3 1 a c g T 1 5 8 11 FM indices Auxiliary data structures Key for efficient pattern matching: how to find the corresponding chars in the first column efficiently, in terms of both time and space. gca rank BWT SA $agcagcagact t a c g t 1 9 act$agcagcag g 0 0 1 1 2 7 agact$agcagc c 0 1 1 1 3 4 agcagact$agc c 0 2 1 1 4 1 agcagcagact$ $ 0 2 1 1 5 6 cagact$agcag g 0 2 2 1 6 3 cagcagact$ag g 0 2 3 1 7 10 ct$agcagcaga a 1 2 3 1 8 8 gact$agcagca a 2 2 3 1 9 5 gcagact$agca a 3 2 3 1 10 2 gcagcagact$a a 4 2 3 1 11 11 t$agcagcagac c 4 3 3 1 a c g t 1 5 8 11 FM indices Next block: From 1 + 0 = 1 to 1 + (4-1) = 4 Auxiliary data structures Key for efficient pattern matching: how to find the corresponding chars in the first column efficiently, in terms of both time and space. gca rank BWT SA $agcagcagact t a c g t 1 9 act$agcagcag g 0 0 1 1 2 7 agact$agcagc c 0 1 1 1 3 4 agcagact$agc c 0 2 1 1 4 1 agcagcagact$ $ 0 2 1 1 5 6 cagact$agcag g 0 2 2 1 6 3 cagcagact$ag g 0 2 3 1 7 10 ct$agcagcaga a 1 2 3 1 8 8 gact$agcagca a 2 2 3 1 9 5 gcagact$agca a 3 2 3 1 10 2 gcagcagact$a a 4 2 3 1 11 11 t$agcagcagac c 4 3 3 1 a c g T 1 5 8 11 FM indices Next block: From 5 + 0 = 5 to 5 + (2-1) = 6 Auxiliary data structures Key for efficient pattern matching: how to find the corresponding chars in the first column efficiently, in terms of both time and space. gca rank BWT SA $agcagcagact t a c g t 1 9 act$agcagcag g 0 0 1 1 2 7 agact$agcagc c 0 1 1 1 3 4 agcagact$agc c 0 2 1 1 4 1 agcagcagact$ $ 0 2 1 1 5 6 cagact$agcag g 0 2 2 1 6 3 cagcagact$ag g 0 2 3 1 7 10 ct$agcagcaga a 1 2 3 1 8 8 gact$agcagca a 2 2 3 1 9 5 gcagact$agca a 3 2 3 1 10 2 gcagcagact$a a 4 2 3 1 11 11 t$agcagcagac c 4 3 3 1 a c g T 1 5 8 11 FM indices Next block: From 8 + 1 = 9 to 8 + (3-1) = 10 Inexact match Main advantage of BWT against suffix array • BWT needs less memory than suffix array • For human genome m = 3 * 109 : – Suffix array: mlog2(m) bits = 4m bytes = 12GB – BWT: m/4 bytes plus extras = 1 - 2 GB • m/4 bytes to store BWT (2 bits per char) • Suffix array and occurrence counts array take 5 m log2 m bits = 20 n bytes • In practice, SA and OCC only partially stored, most elements are computed on demand (takes time!) • Tradeoff between time and space Comparison Spaced seeds • Requires ~50Gb of memory. • Runs 30-fold slower. • Is much simpler to program. Burrows-Wheeler • Requires <2Gb of memory. • Runs 30-fold faster. • Is much more complicated to program. MAQ Bowtie Short-read mapping software Software Technique Developer License Eland Hashing reads Illumnia ? SOAP Hashing refs BGI Academic Maq Hashing reads Sanger (Li, Heng) GNUPL Bowtie BWT Salzberg/UMD BWA BWT Sanger (Li, Heng) GNUPL SOAP2 BWT & hashing BGI GNUPL Academic http://www.oxfordjournals.org/our_journals/bioinformatics/nextgenerationsequencing.html References • • • • • • • (Bowtie) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome, Langmead et al, Genome Biology 2009, 10:R25 SOAP: short oligonucleotide alignment, Ruiqiang Li et al. Bioinformatics (2008) 24: 713-4 (BWA) Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform, Li Heng and Richard Durbin, (2009) 25:1754–1760 SOAP2: an improved ultrafast tool for short read alignment, Ruiqiang Li, (2009) 25: 1966–1967 (MAQ) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Li H, Ruan J, Durbin R. Genome Res. (2008) 18:1851-8. Sense from sequence reads: methods for alignment and assembly, Paul Flicek & Ewan Birney, Nature Methods 6, S6 - S12 (2009) http://www.allisons.org/ll/AlgDS/Strings/BWT/