CS 5263 Bioinformatics Lecture 7: Heuristic Sequence Alignment Tools (BLAST) Multiple Sequence Alignment Roadmap • Last lecture review – Sequence alignment statistics • Today – Heuristic alignment algorithms • Basic Local Alignment Search Tools – Multiple sequence alignment algorithms Sequence Alignment Statistics • Substitution matrices – How is the BLOSUM-N matrix made – How to make your own substitution matrix – What’s the meaning of an arbitrary substitution matrix • Significance of sequence alignments – P-value estimation for • Global alignment scores • Local alignment scores – Critical score for local alignment to guarantee significance Heuristic Local Aligners -- BLAST and alike State of biological databases Sequenced Genomes: Human Mouse Neurospora Tetraodon Drosophila Rice sea squirts 3109 2.7109 4107 3108 1.2108 1.0109 1.6108 Current rate of sequencing: 4 big labs 3 109 bp /year/lab 10s small labs Private sectors Yeast Rat Fugu fish Mosquito Worm Arabidopsis 1.2107 2.6109 3.3108 2.8108 1.0108 1.2108 State of biological databases Number of genes in these genomes: Vertebrate: ~30,000 Insects: ~14,000 Worm: ~17,000 Fungi: ~6,000-10,000 Small organisms: 100s-1,000s Each known or predicted gene has an associated protein sequence >1,000,000 known / predicted protein sequences Some useful applications of alignments Given a newly discovered gene, - Does it occur in other species? Assume we try Smith-Waterman: Our new gene 104 The entire genomic database 1010 - 1011 May take several weeks! Some useful applications of alignments Given a newly sequenced organism, - Which subregions align with other organisms? - Potential genes - Other functional units Assume we try Smith-Waterman: Our newly sequenced mammal 3109 The entire genomic database 1010 - 1011 > 1000 years ??? BLAST • Basic Local Alignment Search Tool – Altschul, Gish, Miller, Myers, Lipman, J Mol Biol 1990 – The most widely used bioinformatics tool – One of the most cited papers in the history of science • Which is better: long mediocre match or a few nearby, short, strong matches with the same total score? – Score-wise, exactly equivalent – Biologically, later may be more interesting, & is common – At least, if must miss some, rather miss the former • BLAST is a heuristic algorithm emphasizing the later – speed/sensitivity tradeoff: BLAST may miss former, but gains greatly in speed BLAST • Available at NCBI (National Center for Biotechnology Information) for download and online use. http://blast.ncbi.nlm.nih.gov/ • Along with many sequence databases query Main idea: 1.Construct a dictionary of all the words in the query 2.Initiate a local alignment for each word match between query and DB Running Time: O(MN) However, orders of magnitude faster than Smith-Waterman DB BLAST Original Version …… Dictionary: All words of length k (~11 for DNA, 3 for proteins) Alignment initiated between words of alignment score T (typically T = k) query …… Alignment: Ungapped extensions until score below statistical threshold scan DB Output: All local alignments with score > statistical threshold query BLAST Original Version A C G A A G T A A G G T C C A G T k = 4, T = 4 The matching word GGTC initiates an alignment Extension to the left and right with no gaps until alignment falls < 50% Output: GTAAGGTCC GTTAGGTCC C C C T T C C T G G A T T G C G A Example: Gapped BLAST Added features: • • Pairs of words can initiate alignment Extensions with gaps in a band around anchor Output: GTAAGGTCCAGT GTTAGGTC-AGT C T G A T C C T G G A T T G C G A A C G A A G T A A G G T C C A G T Example Query: gattacaccccgattacaccccgattaca (29 letters) [2 mins] Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences) 1,726,556 sequences; 8,074,398,388 total letters >gi|28570323|gb|AC108906.9| Oryza sativa chromosome 3 BAC OSJNBa0087C10 genomic sequence, complete sequence Length = 144487 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: Sbjct: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| 125138 tacacccagattacaccccga 125158 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: Sbjct: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| 125104 tacacccagattacaccccga 125124 >gi|28173089|gb|AC104321.7| Oryza sativa chromosome 3 BAC OSJNBa0052F07 genomic sequence, complete sequence Length = 139823 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: Sbjct: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| 3891 tacacccagattacaccccga 3911 Example Query: Human atoh enhancer, 179 letters [1.5 min] Result: 57 blast hits 1. 2. 3. 4. 5. 6. 7. 8. gi|7677270|gb|AF218259.1|AF218259 Homo sapiens ATOH1 enhanc... gi|22779500|gb|AC091158.11| Mus musculus Strain C57BL6/J ch... gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhanc... gi|28875397|gb|AF467292.1| Gallus gallus CATH1 (CATH1) gene... gi|27550980|emb|AL807792.6| Zebrafish DNA sequence from clo... gi|22002129|gb|AC092389.4| Oryza sativa chromosome 10 BAC O... gi|22094122|ref|NM_013676.1| Mus musculus suppressor of Ty ... gi|13938031|gb|BC007132.1| Mus musculus, Similar to suppres... 355 1e-95 264 4e-68 256 9e-66 78 5e-12 54 7e-05 44 0.068 42 0.27 42 0.27 gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhancer sequence Length = 1517 Score = 256 bits (129), Expect = 9e-66 Identities = 167/177 (94%), Gaps = 2/177 (1%) Strand = Plus / Plus Query: 3 tgacaatagagggtctggcagaggctcctggccgcggtgcggagcgtctggagcggagca 62 ||||||||||||| ||||||||||||||||||| |||||||||||||||||||||||||| Sbjct: 1144 tgacaatagaggggctggcagaggctcctggccccggtgcggagcgtctggagcggagca 1203 Query: 63 cgcgctgtcagctggtgagcgcactctcctttcaggcagctccccggggagctgtgcggc 122 |||||||||||||||||||||||||| ||||||||| |||||||||||||||| ||||| Sbjct: 1204 cgcgctgtcagctggtgagcgcactc-gctttcaggccgctccccggggagctgagcggc 1262 Query: 123 cacatttaacaccatcatcacccctccccggcctcctcaacctcggcctcctcctcg 179 ||||||||||||| || ||| |||||||||||||||||||| ||||||||||||||| Sbjct: 1263 cacatttaacaccgtcgtca-ccctccccggcctcctcaacatcggcctcctcctcg 1318 BLAST Score: bit score vs raw score • Bit score is converted from raw score by taking into account K and : S’ = ( S – log K) / log 2 • To compute E-value from bit score: E = KM’N’ e-S = M’N’ 2-S’ • Critical score is now: S* = log2(M’N’) If S’ >> S*: significant If S’ << S*: not significant (M’ ~ M, N’ ~ N) Different types of BLAST • blastn: search nucleic acid databases • blastp: search protein databases • blastx: you give a nucleic acid sequence, search protein databases • tblastn: you give a protein sequence, search nucleic acid databases • tblastx: you give a nucleic sequence, search nucleic acid database, implicitly translate both into protein sequences BLAST cons and pros • Advantages – Fast!!!! – A few minutes to search a database of 1011 bases • Disadvantages – Sensitivity may be low – Often misses weak homologies • New improvement – Make it even faster • Mainly for aligning very similar sequences or really long sequences – E.g. whole genome vs whole genome – Make it more sensitive • PSI-BLAST: iteratively add more homologous sequences • PatternHunter: discontinuous seeds Variants of BLAST NCBI-BLAST: most widely used version WU-BLAST: (Washington University BLAST): another popular version Optimized, added features MEGABLAST: Optimized to align very similar sequences. Linear gap penalty BLAT: Blast-Like Alignment Tool BlastZ: Optimized for aligning two genomes PSI-BLAST: BLAST produces many hits Those are aligned, and a pattern is extracted Pattern is used for next search; above steps iterated Sensitive for weak homologies Slower Pattern hunter • Instead of exact matches of consecutive matches of k-mer, we can • look for discontinuous matches – My query sequence looks like: • ACGTAGACTAGCAGTTAAG – Search for sequences in database that match • AXGXAGXCTAXC • X stands for don’t care • Seed: 101011011101 Pattern hunter • A good seed may give you both a higher sensitivity and higher specificity • You may think 110110110110 is the best seed – Because mutation in the third position of a codon often doesn’t change the amino acid – Best seed is actually 110100110010101111 • Empirically determined • How to design such seed is an open problem • May combine multiple random seeds Things we’ve covered so far • Global alignment – Needleman-Wunsch and variants • Local Alignment – Smith-Waterman • Improvement on space and time • More accurate gap penalty models • Heuristic algorithms – BLAST families • Statistics for sequence alignment • Commonality: They all deal with aligning two sequences – Pair-wise sequence alignment • Next: Aligning multiple sequences all together – Multiple sequence alignment • Motivation: – A faint similarity between two sequences becomes very significant if present in many sequences • Protein domains • Motifs responsible for gene regulation Definition • Given N sequences x1, x2,…, xN: – Insert gaps (-) in each sequence xi, such that • All sequences have the same length L • Score of the global mapping is maximum • Pairwise alignment: a hypothesis on the evolutionary relationship between the letters of two sequences • Same for a multiple alignment! Scoring Function • Ideally: – Find alignment that maximizes probability that sequences evolved from common ancestor x ? y z w v Phylogenetic tree or evolution tree Scoring Function (cont’d) • Unfortunately: too many parameters • Compromises: – Ignore phylogenetic tree • Compute from pair-wise scores – Based on sum of all pair-wise scores – Based on scores with a consensus sequence First assumption • Columns are independent – Similar in pair-wise alignment • Therefore, the score of an alignment is the sum of all columns • Need to decide how to score a single column Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: Induces: - x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG x: ACGCGG-C; y: ACGC-GAC; - x: AC-GCGG-C; z: GCCGC-GAG; - y: AC-GCGAG z: GCCGCGAG - Sum Of Pairs (cont’d) • The sum-of-pairs score of an alignment is the sum of the scores of all induced pairwise alignments S(m) = k<l s(mk, ml) s(mk, ml): score of induced alignment (k,l) Example: x: y: z: AC-GCGG-C AC-GC-GAG GCCGC-GAG A C G T A 1 -1 -1 -1 -1 C -1 (A,A) + (A,G) x 2 = -1 (C,C) x 3 (-,A) x 2 + =3 (A,A) = -1 Total score = (-1) + 3 + (-2) + 3 + 3 + (-2) + 3 + (-1) + (-1) = 5 1 -1 -1 -1 G -1 -1 1 -1 -1 T -1 -1 -1 1 -1 - -1 -1 -1 -1 0 Sum Of Pairs (cont’d) • Drawback: no evolutionary characterization – Every sequence derived from all others • Heuristic way to incorporate evolution tree – Weighted Sum of Pairs: S(m) = k<l wkl s(mk, ml) wkl: weight decreasing with distance Human Mouse Duck Chicken Consensus score -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC CAG-CTATCAC--GACCGC----TCGATTTGCTCGAC Consensus sequence: CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC • Find optimal consensus string m* to maximize S(m) = i s(m*, mi) s(mk, ml): score of pairwise alignment (k,l) Multiple Sequence Alignments Algorithms Multidimensional Dynamic Programming (MDP) Generalization of Needleman-Wunsh: • Find the longest path in a high-dimensional cube – As opposed to a two-dimensional grid • Uses a N-dimensional matrix – As apposed to a two-dimensional array • Entry F(i1, …, ik) represents score of optimal alignment for s1[1..i1], … sk[1..ik] F(i1,i2,…,iN) = max(all neighbors of a cell) (F(nbr)+S(current)) Multidimensional Dynamic Programming (MDP) • Example: in 3D (three sequences): (i-1,j-1,k-1) • 23 – 1 = 7 neighbors/cell F(i-1,j-1,k-1) + S(xi, xj, xk), F(i-1,j-1,k ) + S(xi, xj, -), F(i-1,j ,k-1) + S(xi, -, xk), F(i,j,k) = max F(i ,j-1,k-1) + S(-, xj, xk), F(i-1,j ,k ) + S(xi, -, -), F(i ,j-1,k ) + S(-, xj, -), F(i ,j ,k-1) + S(-, -, xk) (i-1,j-1,k) (i,j-1,k-1) (i,j-1,k) (i-1,j,k-1) (i-1,j,k) (i,j,k-1) (i,j,k) Multidimensional Dynamic Programming (MDP) Running Time: 1. Size of matrix: LN; Where L = length of each sequence N = number of sequences 2. Neighbors/cell: 2N – 1 Therefore………………………… O(2N LN) Faster MDP • Carrillo & Lipman, 1988 – Branch and bound – Other heuristics • Practical for about 6 sequences of length about 200-300. Progressive Alignment • • Multiple Alignment is NP-hard Most used heuristic: Progressive Alignment Algorithm: 1. 2. 3. 4. Align two of the sequences xi, xj Fix that alignment Align a third sequence xk to the alignment xi,xj Repeat until all sequences are aligned Running Time: O(NL2) Each alignment takes O(L2) Repeat N times Progressive Alignment x y z w • When evolutionary tree is known: – Align closest first, in the order of the tree Example: Order of alignments: 1. (x,y) 2. (z,w) 3. (xy, zw) Progressive Alignment: CLUSTALW CLUSTALW: most popular multiple protein alignment Algorithm: 1. Find all dij: alignment dist (xi, xj) • High alignment score => short distance 2. Construct a tree (Neighbor-joining hierarchical clustering. Will discuss in future) 3. Align nodes in order of decreasing similarity + a large number of heuristics CLUSTALW example • • • • S1 ALSK S2 TNSD S3 NASK S4 NTSD CLUSTALW example • • • • S1 ALSK S2 TNSD S3 NASK S4 NTSD s1 s2 s3 s4 s1 0 9 4 7 s2 0 8 3 0 7 s3 s4 0 Distance matrix CLUSTALW example • • • • S1 ALSK S2 TNSD S3 NASK S4 NTSD s1 s2 s3 s4 s1 0 9 4 7 s2 0 8 3 0 7 s3 s4 0 s1 s3 s2 s4 CLUSTALW example • • • • S1 ALSK S2 TNSD S3 NASK S4 NTSD -ALSK NA-SK s1 s2 s3 s4 s1 0 9 4 7 s2 0 8 3 0 7 s3 s4 0 s1 s3 s2 s4 CLUSTALW example • • • • S1 ALSK S2 TNSD S3 NASK S4 NTSD -ALSK NA-SK -TNSD NT-SD s1 s2 s3 s4 s1 0 9 4 7 s2 0 8 3 0 7 s3 s4 0 s1 s3 s2 s4 CLUSTALW example • • • • S1 ALSK S2 TNSD S3 NASK S4 NTSD -ALSK -TNSD NA-SK NT-SD s1 s2 s3 s4 s1 0 9 4 7 s2 0 8 3 0 7 s3 s4 0 -ALSK NA-SK -TNSD NT-SD s1 s3 s2 s4 Iterative Refinement Problems with progressive alignment: • Depend on pair-wise alignments • If sequences are very distantly related, much higher likelihood of errors • Initial alignments are “frozen” even when new evidence comes Example: x: y: GAAGTT GAC-TT Frozen! z: w: GAACTG GTACTG Now clear: correct y should be GA-CTT Iterative Refinement Algorithm (Barton-Stenberg): 1. 2. 3. 4. Align most similar xi, xj Align xk most similar to (xixj) Repeat 2 until (x1…xN) are aligned For j = 1 to N, Remove xj, and realign to x1…xj-1xj+1…xN 5. Repeat 4 until convergence Progressive alignment Iterative Refinement (cont’d) For each sequence y 1. Remove y 2. Realign y z x y (while rest fixed) allow y to vary x,z fixed projection Note: Guaranteed to converge (why?) Running time: O(kNL2), k: number of iterations Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x: y: z: w: GAAGTTA GAC-TTA GAACTGA GTACTGA After realigning y: x: y: z: w: GAAGTTA G-ACTTA GAACTGA GTACTGA + 3 matches Iterative Refinement • Example not handled well: x: y1: y2: y3: GAAGTTA GAC-TTA GAC-TTA GAC-TTA z: w: GAACTGA GTACTGA Realigning any single yi changes nothing Restricted MDP • Similar to bounded DP in pair-wise alignment 1. Construct progressive multiple alignment m 2. Run MDP, restricted to radius R from m z y Running Time: O(2N RN-1 L) x Restricted MDP x: y1: y2: y3: GAAGTTA GAC-TTA GAC-TTA GAC-TTA z: w: GAACTGA GTACTGA • Within radius 1 of the optimal Restricted MDP will fix it. Other approaches • Profile Hidden Markov Models – Statistical learning methods – Will discuss in future Multiple alignment tools • Clustal W (Thompson, 1994) – Most popular • • • • • • • PRRP (Gotoh, 1993) HMMT (Eddy, 1995) DIALIGN (Morgenstern, 1998) T-Coffee (Notredame, 2000) MUSCLE (Edgar, 2004) Align-m (Walle, 2004) PROBCONS (Do, 2004) In summary • Multiple alignment algorithms: – MDP (too slow) • Branch & Bound doesn’t solve the problem entirely – Progressive alignment: clustalW – Iterative refinement – Restricted MDP