Efficient and Accurate Discovery of Patterns in Sequence Datasets Avrilia Floratou, SandeepTata, and Jignesh M. Patel ICDE 2010 Outline Motivation A powerful new model FLAME(FLexible and Accurate Motif Detector) algorithm Experimental Results Conclusion 2 2010/5/27 Motivation Existing sequence mining algorithms mostly focus on mining for subsequences(non-contiguous). For instance, assume that a sequence is “a, b, a, c, b, a, c”, the sequence “a, b, b, c” is a subsequence constructed by choosing the 1st, 2nd, 5th, and 7th. “Are there any frequently recurring patterns in this time series dataset?” The recurring subsequences are similar, but not identical. It allow for some noise. Approximate subsequence mining problem.(contiguous) 3 2010/5/27 (Cont.) Computational biology To detect short sequences, usually of length 6-15, that occur frequently in a given set of DNA or protein sequences. We call these frequently occurring patterns as motifs. Goal To present a new model for approximate motif mining in many different domains. To present FLAME algorithm to efficiently find motifs that satisfy these model. 4 2010/5/27 The model (L, M, s, k) model L is the length of the motif. M is a distance matrix that is used to compute the similarity between two strings. s is the maximum distance threshold within which two strings are considered similar. k is the min_sup. If at least k strings T1,…, Tk in the DB each of them is length L, and d(S,Ti) s, where d ( A , B ) M ( a , b ) is a distance function. Then, a string S is an (L, M, s, k) motif. n i 1 5 i i 2010/5/27 (Cont.) Protein motif mining Some amino acids are very similar to each other, while some are very different. Because Alanine and Valine are both hydrophobic, the matrix can be used to award a small penalty for M(X,Y) when X and Y are similar. When Alanine is matched to Glycine, the large penalty is awarded. Because Glycine is a hydrophilic amino acid. (L, d, k) model (L, f, d, k) model 6 2010/5/27 (L, d, k) model (L, d, k) model is a mismatch based model for finding DNA motifs. The distance measure between two strings is the Hamming distance, M(X,Y)=1 if X Y and M(X,Y)=0 if X=Y The signature is usually a short string of DNA 6-15 bases long. These signatures are seldom identical and differ in a few positions. For instance {ABCD, ACCD, ABCA} if ABCD is the model sequence, the other two sequences are within one mismatch of the motif, so these sequences would constitute a (4,1,3) motif. 7 2010/5/27 (L, f, d, k) model This model builds on themotif (L, d, k) model to include positional (8,1,0,5) motif or (8,1,5) constraints onk)the mismatches. Use the (L, d, model to retrieve the pattern, we will end up Tomany specify the number of fixed-position mismatches. with extraneous hits that might not be meningful. {ABCD, ACCD, ADCD}, this set forms a (4,1,3) motif, but the mismatches, whenever they occur, are always in position two (AcCD, AdCD). The advantage of this model Consider a DNA DB consisting of 5 sequences, each of length 500. Assume that each sequence has in it the motif GTGAACAC, and each instance of the motif has a mismatch at the fifth position. 8 2010/5/27 FLAME algorithm It first construct a data suffix tree and model suffix tree. To traversing the nodes of the model space in depth first order. Using two subroutines Evaluate_support: to compute the list of matches and the support. Expand_Matches: to ensure that the number of mismatches to the model string does not exceed d. 9 2010/5/27 (Cont.) The data suffix tree: The model suffix tree: On the set of all possible model srtings. To help guide the exploration of the model space in a way that avoids redundant work. 10 2010/5/27 (Cont.)the strategy of pruning the model suffix tree Assume that the dataset consists of sequences over the alphabet {A, B, C, D, E} All the strings of length L starting with the symbol A form a subset of the model space. … A AA AB AC AD AE … 11 2010/5/27 (Cont.) No mismatches are allowed. AA AB … k A AC AD AE … 12 2010/5/27 (Cont.) As mismatches are allowed. A AA AB … B AC AD AE … 13 2010/5/27 Experimental results 14 2010/5/27 (Cont.) 15 2010/5/27 Conclusion This paper presented a powerful new model: (L, M, s, k) It also presented FLAME to find (L, M, s, k) motifs 16 2010/5/27