Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011 Sequential Data • Sequence data: a series of alphabets or numbers • Spatial sequence data: biological sequence, text document, a computer program, web page, a missile trajectory • Temporal sequence data: speech, movie clips, stock prices, a series of web clicks Key Issue • Measure the similarity between two sequences, i.e. sequence comparison • Sequence database search / retrieval • Sequence classification • Sequence clustering • Sequence pattern recognition • Sequence similarity definition and calculation is a challenge compared to vector-based data Sequence Alignment • • • • Use biological sequences as an example Optimal pairwise sequence alignment algorithm Fast, heuristic sequence alignment (BLAST) Statistical significance of sequence alignment score Sequence Data • DNA Sequence (gene) gaagagagcttcaggtttggggaagagaca acaactcccgctcagaagcaggggccgata • RNA Sequence caaaucacuacacacaggguagaagguggaacgcacaggagcaugu caacggggugc • Protein Sequence RELQVWGRDNNSLSEAGANRQGDVSFNLPQITLWQRPL VTIKIGGQLKEALLDTGADDTVLEDIDLPGKWKP Global Pairwise Sequence Alignment ITAKPAKTTSPKEQAIGLSVTFLSFLLPAGVLYHL ITAKPQWLKTSESVTFLSFLLPQTQGLYHL Global Pairwise Sequence Alignment ITAKPAKTTSPKEQAIGLSVTFLSFLLPAGVLYHL ITAKPQWLKTSESVTFLSFLLPQTQGLYHL ITAKPAKT-TSPKEQAIGLSVTFLSFLLPAG-VLYHL ITAKPQWLKTSE-------SVTFLSFLLPQTQGLYHL Alignment (similarity) score Three Main Issues 1.Definition of alignment score 2.Algorithms of finding the optimal alignment 3.Evaluation of significance of alignment score A Simple Scoring Scheme • Score of a character pair: S(match)=1, S(not_match) = -1, S(gap-char) = -1 n • Score of an alignment = S i 1 ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL ITAKPQWLKSTE-------SVTFLSFLLPQTQGLYHL 5 – 7 – 7 +10 -4 + 4= 1 Optimization Problem • How many possible alignments exist for two sequences with length m and n? • How can we find the best alignment to maximize alignment score? Possible Alignments Shortest ACTG--ATGGATG ACTG ATGGATG Intermediate AC-TG--ATG-GATG Longest ACTG----------ATGGATG Total Number of Possible Alignments AGATCAGAAAT-G --AT-AG-AATCC A G A A T T C A A G GA A A A A T T C G C m+n Total Number of Alignments Select m positions out of m+n possible positions: m n (m n)! m!n! m Exponential! If m = 300, n = 300, total = 1037 Needleman and Wunsch Algorithm • Given sequences P and Q, we use a matrix M to record the optimal alignment scores of all prefixes of P and Q. M[i,j] is the best alignment score for the prefixes P[1..i] and Q[1..j]. • M[i,j] = max [ M[i-1,j-1] + S(P[i],Q[j]), M[i,j-1] + S(-, Q[j]) M[i-1,j] + S(P[i], -) ] Needleman and Wunsch Algorithm • Given sequences P and Q, we use a matrix M to record the optimal alignment scores of all prefixes of P and Q. M[i,j] is the best alignment score for the prefixes P[1..i] and Q[1..j]. • M[i,j] = max [ M[i-1,j-1] + S(P[i],Q[j]), M[i,j-1] + S(-, Q[j]) M[i-1,j] + S(P[i], -) ] Dynamic Programming Dynamic Programming Algorithm Three-Step Algorithm: •Initialization •Matrix filling (scoring) •Trace back (alignment) 1. Initialization of Matrix M _ A T _ 0 i A G A -1 T C A G -4 A A A -8 T G -11 -2 -3 -5 -6 -7 -9 -10 -12 -1 -2 j A -3 G -4 A -5 A -6 T -7 2. Fill Matrix _ _ 0 A -1 A G A -1 1 -2 0 -3 -1 T C A G -4 T -2 A -3 G -4 A -5 A -6 T -7 -2 -5 -3 -6 -4 -7 -5 A A A -8 -6 -9 -7 -10 -8 T G -11 -9 -12 -10 M[i,j] = max [ M[i-1,j-1] + S(P[i],Q[j]), M[i,j-1] + S(-, Q[j]) M[i-1,j] + S(P[i], -) ] 2. Fill Matrix _ _ 0 A -1 T -2 A G A T C A G A A A -1 1 0 -2 0 0 -3 -1 -1 T G -4 -2 A -3 G -4 A -5 A -6 T -7 0 -5 -3 -1 -6 -4 -2 -7 -5 -3 -8 -6 -4 -9 -7 -5 -10 -8 -6 -11 -9 -7 -12 -10 -8 M[i,j] = max [ M[i-1,j-1] + S(P[i],Q[j]), M[i,j-1] + S(-, Q[j]) M[i-1,j] + S(P[i], -) ] 2. Fill Matrix _ _ 0 A -1 T -2 A -3 G A A T -4 -5 -6 -7 A G A -1 1 0 -1 -2 -3 -4 -5 -2 0 0 -1 0 -1 -2 -3 -3 -1 -1 1 -1 1 0 -1 T C A G A A A -4 0 0 0 0 0 1 T G -2 -5 -3 -1 -1 -1 -1 -1 0 -6 -4 -2 0 -1 0 0 -1 -7 -5 -3 -1 1 0 -1 -1 -8 -6 -4 -2 0 2 1 0 -9 -7 -5 -3 -1 1 3 2 -10 -8 -6 -4 -2 0 2 2 -11 -9 -7 -5 -3 -1 1 3 -12 -10 -8 -6 -4 -2 0 2 M[i,j] = max [ M[i-1,j-1] + S(P[i],Q[j]), M[i,j-1] + S(-, Q[j]) M[i-1,j] + S(P[i], -) ] 3. Trace Back _ _ A G A T C A AGATCAGAAATG G --AT-AG-AATA A A T G 0 A T A -1 -2 -3 G A A T -4 -5 -6 -7 -1 1 0 -1 -2 -3 -4 -5 -2 0 0 -1 0 -1 -2 -3 -3 -1 -1 1 -1 1 0 -1 -4 -2 0 0 0 0 0 1 -5 -3 -1 -1 -1 -1 -1 0 -6 -4 -2 0 -1 0 0 -1 -7 -5 -3 -1 1 0 -1 -1 -8 -6 -4 -2 0 2 1 0 -9 -7 -5 -3 -1 1 3 2 -10 -8 -6 -4 -2 0 2 2 -11 -9 -7 -5 -3 -1 1 3 -12 -10 -8 -6 -4 -2 0 2 Pairwise Alignment Algorithm Using Dynamic Programming • Initialization: Given two sequences with length m and n, create a (m+1)×(n+1) matrix M. Initialize the first row and first column according to scoring matrix. • For j in 1..n (column) for i in 1..m (row) M[i,j] = max( (M[i-1,j-1]+S(i,j), M[i,j-1]+S(-,j), M[i1, j] + S(i,-) ) Record the selected path toward (i,j) • Report alignment score M[m,n] and trace back to M[0,0] to generate the optimal alignment. What is the time complexity of this algorithm? Local Sequence Alignment Using DP • Biological sequences often only have local similarity. • During evolution, only functional and structural important regions are highly conserved. • Global alignment sacrifices the local similarity to maximize the global alignment score. • Need to use alignment method to identify the local similar regions disregard of other dissimilar regions. Transcription binding site Local Alignment Algorithm Goal: find an alignment of the substrings of P and Q with maximum alignment score. Naïve Algorithm: (m+1)*m/2 substrings of P, (n+1)*n/2 substrings of Q Using DP for each substring pairs: m2 * n2 * O(mn) = O(m3n3) Smith-Waterman Algorithm Same dynamic programming algorithm as global alignment except for three differences. 1. All negative scores is converted to 0 (why?) 2. Alignment can start from anywhere in the matrix 3. Alignment can end at anywhere in the matrix Local Alignment Algorithm • Initialization: Given two sequences with length m and n, create a (m+1)×(n+1) matrix M. Initialize the first row and first column to 0s. • For j in 1..n (column) for i in 1..m (row) M[i,j] = max( 0, M[i-1,j-1]+S(i,j), M[i,j-1]+S(-,j), M[i-1, j] + S(i,-) ) Record the selected path. • Find elements in matrix M with maximum values. Trace back till 0 and report the alignment corresponding to the path. 1. Initialization _ _ 0 A 0 G 0 A 0 T 0 C 0 A 0 G 0 A 0 A 0 A 0 T 0 G 0 A 0 T 0 A 0 G 0 A 0 A 0 T 0 2. Fill the Matrix _ A T A G A A T _ 0 0 0 0 0 0 0 0 A 0 1 0 1 0 1 1 0 G 0 0 0 0 2 1 0 0 A 0 1 0 1 0 3 2 1 T 0 0 2 1 0 2 2 3 C 0 0 1 0 0 1 1 2 A 0 1 0 2 1 1 2 1 G 0 0 0 1 3 2 1 1 A 0 1 0 1 2 4 3 2 A 0 1 0 1 1 3 5 4 A 0 1 0 1 0 2 4 4 T 0 0 2 1 0 1 3 5 G 0 0 1 1 2 1 2 4 3. Trace Back _ A T A G A A T _ 0 0 0 0 0 0 0 0 A 0 1 0 1 0 1 1 0 G 0 0 0 0 2 1 0 0 A 0 1 0 1 0 3 2 1 T 0 0 2 1 0 2 2 3 C 0 0 1 0 0 1 1 2 A 0 1 0 2 1 1 2 1 G 0 0 0 1 3 2 1 1 A 0 1 0 1 2 4 3 2 A 0 1 0 1 1 3 5 4 A 0 1 0 1 0 2 4 4 T 0 0 2 1 0 1 3 5 G 0 0 1 1 2 1 2 4 Two Best Local Alignments Local Alignment 1: ATCAGAA AT-AGAA Local Alignment 2: ATCAGAAAT AT-AGA-AT Scoring Matrix • How to accurately measure the similarity between amino acids (or nucleotides) is one key issue of sequence alignment. • For nucleotides, a simple identical / not identical scheme is mostly ok. • Due to various properties of amino acids, it is hard and also critical to measure the similarity between amino acids. Evolutionary Substitution Approach • During evolution, the substitution of similar amino acids is more likely to be selected within a protein family of similar sequences than random substitutions (M. Dayhoff) • The frequency / probability that one residue substitutes another one is an indicator of their similarity. PAM Scoring Matrices (M. Dayhoff) • Select a number of protein families. • Align sequences in each family and count the frequency of amino acid substitution in each column: Pij. • Similarity score is logarithm of the ratio of observed substitution probability over the random substitution probability. S(i,j) = log(Pij / (Pi * Pj)). Pi is the observed probability of residue i and Pj is the observed probability of residue j • PAM: Point Accepted Mutation • Let data tell strategy! - NBA A Simplified Example ACGTCGAGT ACCACGTGT CACACTACT ACCGCATGA ACCCTATCT TCCGTAACA ACCATAAGT AGCATAAGT ACTATAAGT ACGATAAGT Chars A C G T Prob. 6 / 10 1 / 10 2 / 10 1 / 10 P(A<->C) = 0.07+0.07=0.14 Substitution Frequency Table A C G T A 30 6 12 6 C 6 0 2 1 G 12 2 1 2 T 6 1 2 0 Total number of substitutions: 90 A C G T A .33 .07 .14 .07 C .07 0 .02 .01 G .14 .02 .01 .02 T .07 .01 .02 0 A Simplified Example ACGTCGAGT ACCACGTGT CACACTACT ACCGCATGA ACCCTATCT TCCGTAACA ACCATAAGT AGCATAAGT ACTATAAGT ACGATAAGT Chars A C G T Prob. 6 / 10 1 / 10 2 / 10 1 / 10 Substitution Frequency Table A C G T A 30 6 12 6 C 6 0 2 1 G 12 2 2 2 T 6 1 2 0 Total number of substitutions: 90 A C P(A<->C) = 0.07+0.07=0.14 S(A,C) = log(0.14/(0.6*0.1)) = 0.36 G T A .33 .07 .14 .07 C .07 0 .02 .01 G .14 .02 .01 .02 T .07 .01 .02 0 PAM250 Matrix (log odds multiplied by 10) BLOSUM Matrices (Henikoff and Henikoff) • PAM calculation is based on global alignments • PAM matrices don’t work well for aligning sequences with little similarity. • BLOSUM: BLOcks SUbstitution Matrix • BLOSUM based on highly conserved local regions /blocks without gaps. Block 1 Block2 BLOSUM62 Matrix Significance of Sequence Alignment Score • Why do we need significant test? • Mathematical view: unusual versus “by chance” • Biological view: evolutionary related or not? Randomization Approach • Randomization is a fundamental idea due to a statistician Fisher. • Randomly permute chars within sequence P and Q to generate new sequences (P’ and Q’). Align random sequences and record alignment scores. • Assuming these scores obey normal distribution, compute mean (u) and standard derivation (σ) of alignment scores 95% Alignment score Normal distribution of alignment scores of two sequences •If S = u+2 σ, the probability of observing the alignment score equal to or more extreme than this by chance is 2.5%, e.g., P(S>=u+2 σ) = 2.5%. Thus we are 97.5% confident that the alignment score is not by chance, i.e. significant. •For any score x, we can compute P(S >= x), which is called p-value. Model-Based Approach (Karlin and Altschul) http://www.people.virginia.edu/~wrp/cshl02/Altschul/Altschul-3.html • Extreme Value Distribution K and λ are statistical parameters depending on substitution matrix. For BLOSUM62, λ =0.252, K=0.35 P-Value • P(S≥x) is called p-value. It is the probability that two random sequences has an alignment score >= x. • Smaller means more significant. • E-value = p-value * # sequences Problems of Using Dynamic Programming to Search Large Sequence Databases • Search homologs in DNA and protein database is often the first step of a bioinformatics study. • DP is too slow for large sequence database search such as Genbank and UniProt. Each DP search can take hours. • Most DP search time is wasted on unrelated sequences or dissimilar regions. • Developing fast, practical sequence comparison methods for database search is important. Fast Sequence Search Methods • All successful, rapid sequence comparison methods are based on a simple fact: similar sequences share some common words. • First such method is FASTP (Pearson & Lipman, 1985) • Most widely used methods are BLAST (Altschul et al., 1990) and PSI-BLAST (Altschul et al., 1997). Basic Local Alignment Search Tool (S. Altschul, W. Gish, W. Miller, E. Meyer and D. Lipman) 1. Compile a list of words for a query 2. Scan sequences in database for word hits 3. Extending hits David Lipman Stephen Altschul Step 1: Compile Word List • Words: w-mer with length w. • Protein 4-mer and DNA 12-mer Query: DSRSKGEPRDSGTLQSQEAKAVKKTSLFE Words: DSRS, SRSK, RSKG, KGEP…. Step 2: Scan Database Classic problem: find occurrence of a list of words in a sequence. •Integer indexing approach (hashing) •Deterministic finite automaton or finite state machine (faster) Finite State Machine • A finite state machine for string matching: (input alphabet: a,b,c) Word: abab a a a 0 a 1 b 2 a 3 b 4 Database sequence: bcabccaaababacababacabb Step 3: Extension • Extend words on both ends • Terminate the process when we reach a segment pair whose score falls a certain distance below the best score found for shorter extensions. • Depart from the ideal of finding a guaranteed Maximum Segment Pair, but the added inaccuracy is negligible. • Report significant MSP according to extreme value distribution An Example of Extension Query: DSRSKGEPRDSGTLQSQEAKAVKKTSLFE Words: DSRS, SRSK, RSKG, KGEP…. Database Sequence: PESRSKGEPRDSGKKQMDSOKPD Maximum Segment Pair: ESRSKGEPRDSG P-Value and E-Value • P-value • E-value = database size * p-value • Common threshold: 0.01 S P-value = Prob(score >=S) Why is BLAST so Successful? • • • • Address a fundamental problem Extremely fast Simple, yet powerful idea Well founded in computer theory (wordsstring matching, hashing, random process of Maximum Segment Pair) • Implementation tricks -> super speed software • Sacrifice a little accuracy for speed practically (good heuristics) Realism NCBI Online Blast DNA Blast Protein Blast Matched sequences ranked by score and evalue Output Format Significant local alignments Application of Pairwise Sequence Comparison • • • • Database search Clustering Classification Motif recognition