Mining Order-Preserving Submatrices from Data with Repeated Measurements Ben Kao, Kevin Y. Yip, Sau Dan Lee, Chun Kit Chui ICDM 2008 Presentation Outline The traditional Order-Preserving submatrices (OPSM’s) mining problem Mining OPSMs from data with repeated measurements (OPSM-RM) Basic algorithm Efficient mining methods MinBound The HTBound technique Experimental results The traditional OPSM mining problem Preliminaries Order-preserving Submatrices Data matrix plotted Matrix of numerical data values C1 C2 C3 C4 C5 C6 C7 C8 R1 36 32 12 19 18 42 33 8 R2 11 22 33 24 30 3 9 23 R3 14 18 48 28 38 11 33 21 R4 20 14 5 10 7 24 44 13 R5 38 25 10 24 19 39 8 22 No obvious patterns observed The Order-preserving Submatrices problem is a pattern-based subspace clustering model that applies to a matrix of numerical data values. Objective : To discover a subset of attributes (columns) over which a subset of tuples (rows) exhibit a similar pattern of rise and falls in the tuples’ values. Order-preserving Submatrices Data matrix plotted Matrix of numerical data values C1 C2 C3 C4 C5 C6 C7 C8 R1 36 32 12 19 18 42 33 8 R2 11 22 33 24 30 3 9 23 R3 14 18 48 28 38 11 33 21 R4 20 14 5 10 7 24 44 13 R5 38 25 10 24 19 39 8 22 Order Preserving Submatrix C3 C5 C4 C2 C1 C6 R1 12 18 19 32 36 42 R4 5 7 10 14 20 24 R5 10 19 24 25 38 39 No obvious patterns observed Values of rows are increasing w.r.t. the column order R1 R4 R5 Concurrent rising patterns The Order-preserving Submatrices problem is a pattern-based subspace clustering model that applies to a matrix of numerical data values. Objective : Identify subset of columns over which a subset of rows exhibit a similar pattern of rises and falls in the columns’ values. Order-preserving Submatrices Data matrix plotted Matrix of numerical data values C1 C2 C3 C4 C5 C6 C7 C8 36 32 12 19 18 42 33 8 Application : Mining gene expression dataset. R1 11 22 that 33 R2Genes 24 30 simultaneous 3 9 23 exhibit rises and falls of their expression No obvious patterns R3values 14 18 across 48 28 different 38 11 33 21 experiments reveal interesting patterns and observed R4 20 14 5 knowledge . 10 7 24 44 13 R5 38 25 10 24 19 39 8 22 Expression value Experimental conditions Order Preserving Submatrix Genes C3 C5 C4 C2 C1 C6 R1 12 18 19 32 36 42 R4 5 7 10 14 20 24 R5 10 19 24 25 38 39 R1 R4 R5 Concurrent rising patterns Order-preserving Submatrices Data matrix plotted Matrix of numerical data values C1 C2 C3 C4 C5 C6 C7 C8 36 32 12 19 18 42 33 8 Application : Mining gene expression dataset. R1 11 22 that 33 R2Genes 24 30 simultaneous 3 9 23 exhibit rises and falls of their expression No obvious patterns R3values 14 18 across 48 28 different 38 11 33 21 experiments reveal interesting patterns and observed R4 20 14 5 knowledge . 10 7 24 44 13 R5 38 25 10 24 19 39 8 22 Expression value Experimental conditions Order Preserving Submatrix Genes C3 C5 C4 C2 C1 C6 R1 12 18 19 32 36 42 R4 5 7 10 14 20 24 R5 10 19 24 25 38 39 R1 R4 R5 Concurrent rising patterns Biologists would like to identify set of genes that are functionally related OPSM suggests candidates for the set of genes that are of their interest. With the mined OPSMs, costly small scale testing will be performed Requirement : as few false positive results as possible. Order-preserving Submatrices Given a data matrix M with n rows and m columns. An order preserving Submatrices S is A subset of row R. A permutation of columns (pattern) P E.g. P=<C3,C5,C4,C2,C1,C6> The entries of all rows in R are monotonically increasing w.r.t. P. Mining OPSMs: Find all OPSMs with number of columns greater than or equal to a user specified threshold (frequent). E.g. R={R1,R4,R5} Matrix of numerical data values C1 C2 C3 C4 C5 C6 … Cm R1 36 32 12 19 18 42 … 8 R2 11 22 33 24 30 3 … 23 R3 3 25 31 22 11 4 … 26 R4 20 14 5 10 7 24 … 13 R5 38 25 10 24 19 39 … 22 … … … … … … … … … Rn … Order-preserving Submatrices Given a data matrix M with n rows and m columns. An OPSM S is A subset of row R. A permutation of columns (pattern) P E.g. P=<C3,C5,C4,C2,C1,C6> The entries of all rows in R are monotonically increasing w.r.t. P. Mining OPSMs: Find all OPSMs with number of columns greater than or equal to a user specified threshold (frequent). E.g. R={R1,R4,R5} Matrix of numerical data values C1 C2 C3 C4 C5 C6 … Cm R1 36 32 12 19 18 42 … 8 R2 11 22 33 24 30 3 … 23 R3 3 25 31 22 11 4 … 26 R4 20 14 5 10 7 24 … 13 R5 38 25 10 24 19 39 … 22 … … … … … … … … … … Rn Reordered subset of columns (experimental conditions) Order Preserving Submatrix Subset of rows C3 C5 C4 C2 C1 C6 R1 12 18 19 32 36 42 R4 5 7 10 14 20 24 R5 10 19 24 25 38 39 Order-preserving Submatrices Given a data matrix M with n rows and m columns. An OPSM S is A subset of row R. A permutation of columns (pattern) P E.g. P=<C3,C5,C4,C2,C1,C6> The entries of all rows in R are monotonically increasing w.r.t. P. Mining OPSMs: Find all OPSMs with number of rows (the support) greater than or equal to a user specified threshold (frequent). E.g. R={R1,R4,R5} R1 Matrix of R1numerical data values C1 C2 R4 C3 C4 C5 C6 … R5 36 32 12 19 18 42 … R2 11 22 33 24 30 3 … 23 R3 3 25 31 22 11 4 … 26 R4 20 14 5 10 7 24 … 13 R5 38 25 10 24 19 39 … 22 … … … … … … … … … Cm 8 … Rn Reordered subset of columns (experimental conditions) Order Preserving Submatrix Subset of rows C3 C5 C4 C2 C1 C6 R1 12 18 19 32 36 42 R4 5 7 10 14 20 24 R5 10 19 24 25 38 39 Order-preserving Submatrices Apriori property Let P1 and P2 be two patterns such that P1 is a subsequence of P2. The support of P2 must be no greater than the support of P1 <a,b> is infrequent => <a,b,c>, <a,b,c,d>, <c,a,b,d> are all infrequent. According to the Apriori property, we can adopt an iterative candidate set generation-and-test mining framework to prune the search space. Order-preserving Submatrices Apriori property Let P1 and P2 be two patterns such that P1 is a subsequence of P2. The support of P2 must be no greater than the support of P1 <a,b> is infrequent => <a,b,c>, <a,b,c,d>, <c,a,b,d> are all infrequent. With the Apriori property, we can adopt an iterative candidate set generation-and-test mining framework to prune the search space. Start from mining size-2 OPSMs because size-1 OPSMs does not have any “orderings”. Size-2 OPSMs candidates Order-preserving Submatrices Apriori property Let P1 and P2 be two patterns such that P1 is a subsequence of P2. The support of P2 must be no greater than the support of P1 <a,b> is infrequent => <a,b,c>, <a,b,c,d>, <c,a,b,d> are all infrequent. With the Apriori property, we can adopt an iterative candidate set generation-and-test mining framework to prune the search space. Start from mining size-2 OPSMs because size-1 OPSMs does not have any “orderings”. Size-2 OPSMs candidates Support Counting The Support counting module verifies the number of supporting rows of the candidate OPSMs. Order-preserving Submatrices Apriori property Let P1 and P2 be two patterns such that P1 is a subsequence of P2. The support of P2 must be no greater than the support of P1 <a,b> is infrequent => <a,b,c>, <a,b,c,d>, <c,a,b,d> are all infrequent. With the Apriori property, we can adopt an iterative candidate set generation-and-test mining framework to prune the search space. Start from mining size-2 OPSMs because size-1 OPSMs does not have any “orderings”. Frequent size-k OPSMs Size-2 OPSMs candidates Support Counting The Support counting module verifies the number of supporting rows of the candidate OPSMs. Those OPSMs with #supporting rows ≥ ρ are frequent, they are passed into the candidates generation module. OPSM candidates generation Order-preserving Submatrices Apriori property Let P1 and P2 be two patterns such that P1 is a subsequence of P2. The support of P2 must be no greater than the support of P1 <a,b> is infrequent => <a,b,c>, <a,b,c,d>, <c,a,b,d> are all infrequent. With the Apriori property, we can adopt an iterative candidate set generation-and-test mining framework to prune the search space. Start from mining size-2 OPSMs because size-1 OPSMs does not have any “orderings”. Frequent size-k OPSMs Size-2 OPSMs candidates Support Counting Those OPSMs with #supporting rows ≥ ρ are frequent, they are passed into the candidates generation module. OPSM candidates generation Size k+1 Candidate OPSMs The Support counting module verifies the number of supporting rows of the candidate OPSMs. According to the Apriori property, we only generate those size k+1 candidates with all proper subsets being frequent. Order-preserving Submatrices Apriori property Let P1 and P2 be two patterns such that P1 is a subsequence of P2. The support of P2 must be no greater than the support of P1 <a,b> is infrequent => <a,b,c>, <a,b,c,d>, <c,a,b,d> are all infrequent. With the Apriori property, we can adopt an iterative candidate set generation-and-test mining framework to prune the search space. Start from mining size-2 OPSMs because size-1 OPSMs does not have any “orderings”. Frequent size-k OPSMs Size-2 OPSMs candidates Support Counting Those OPSMs with #supporting rows ≥ ρ are frequent, they are passed into the candidates generation module. OPSM candidates generation The algorithm terminates when no more candidates are generated. Size k+1 Candidate OPSMs The Support counting module verifies the number of supporting rows of the candidate OPSMs. According to the Apriori property, we only generate those size k+1 candidates with all proper subsets being frequent. Mining OPSMs from Data with Repeated Measurements Fractional support Problem motivation A main drawback of the basic OPSM mining problem is that it is very sensitive to noisy data. In particular in microarray experiments, each value in the dataset is a physical measurement that is subject to different kinds of errors. To combat errors, experiments are often repeated and multiple measured values (replicates) are recorded. The replicates allow a better estimate of the actual physical quantity. Problem motivation A main drawback of the basic OPSM mining problem is that it is very sensitive to noisy data. In particular in microarray experiments, each value in the dataset is a physical measurement that is subject to different kinds of errors. To combat errors, experiments are often repeated and multiple measured values (replicates) are recorded. The replicates allow a better estimate of the actual physical quantity. An example dataset that each column has 3 replicates (e.g. experiment in column “a” is repeated 3 times and therefore generating 3 sub columns : a1, a2, a3 ) a b c d a1 a2 a3 b1 b2 b3 c1 c2 c3 d1 d2 d3 r1 49 38 115 82 r1 49 55 80 38 51 81 115 101 79 82 110 50 r2 67 96 124 48 r2 67 54 130 96 85 82 124 92 94 48 37 32 r3 65 67 132 95 r3 65 49 62 67 39 28 132 119 83 95 89 64 r4 81 115 133 62 r4 81 83 105 115 110 87 133 108 105 62 52 51 A dataset without repeated measurements A dataset with repeated measurements Problem motivation The original OPSM definition is not robust against noisy data. According to its definition, a row either supports or not support a pattern. It fails to take advantage of the additional information provided by data replicates. There is a strong motivation to revise the definition of OPSM to handle repeated measurements. (a1, b1) supports pattern <a,b>. a b c d r1 49 38 115 82 r2 67 96 124 r3 65 67 r4 81 115 (a2, b2) does not support pattern <a,b>. a1 a2 a3 b1 b2 b3 c1 c2 c3 d1 d2 d3 r1 49 55 80 38 51 81 115 101 79 82 110 50 48 r2 67 54 130 96 85 82 124 92 94 48 37 32 132 95 r3 65 49 62 67 39 28 132 119 83 95 89 64 133 62 r4 81 83 105 115 110 87 133 108 105 62 52 51 A dataset without repeated measurements A dataset with repeated measurements Problem motivation The fractional support si(P) of a pattern P contribute by a row i is the number of replicate combinations of row i that support the pattern, divided by the total number of replicate combinations of the columns in P. Fractional support of <a,b,d> in row 1. Pattern P <a,b,d> Total number of replicate combinations sd1(P) 3 * 3 * 3 = 27 a1 a2 a3 b1 b2 b3 c1 c2 c3 d1 d2 d3 r1 49 55 80 38 51 81 115 101 79 82 110 50 r2 67 54 130 96 85 82 124 92 94 48 37 32 r3 65 49 62 67 39 28 132 119 83 95 89 64 r4 81 83 105 115 110 87 133 108 105 62 52 51 A dataset with repeated measurements Problem motivation The fractional support si(P) of a pattern P contribute by a row i is the number of replicate combinations of row i that support the pattern, divided by the total number of replicate combinations of the columns in P. Fractional support of <a,b,d> in row 1. Pattern P <a,b,d> Total number of replicate combinations sd1(P) 3 * 3 * 3 = 27 sn1(P) <a1,b2,d1> <a1,b2,d2> <a1,b3,d1> <a1,b3,d2> = 9 <a2,b3,d1> <a2,b3,d2> <a3,b3,d1> <a3,b3,d2> Number of replicate combinations that support the pattern a1 a2 a3 b1 b2 b3 c1 c2 c3 d1 d2 d3 r1 49 55 80 38 51 81 115 101 79 82 110 50 r2 67 54 130 96 85 82 124 92 94 48 37 32 r3 65 49 62 67 39 28 132 119 83 95 89 64 r4 81 83 105 115 110 87 133 108 105 62 52 51 A dataset with repeated measurements Problem motivation The fractional support si(P) of a pattern P contribute by a row i is the number of replicate combinations of row i that support the pattern, divided by the total number of replicate combinations of the columns in P. Fractional support of <a,b,d> in row 1. Pattern Total number of replicate combinations Number of replicate combinations that support the pattern Fractional support of the pattern P The fractional support satisfy the following requirements: <a,b,d> sd1(P) 3 * 3 * 3 = 27 sn1(P) <a1,b2,d1> <a1,b2,d2> <a1,b3,d1> <a1,b3,d2> = 9 <a2,b3,d1> <a2,b3,d2> <a3,b3,d1> <a3,b3,d2> s1(P) sn1(P) sd1(P) = 9 27 Requirement 1 : If all replicate combinations of a row support a certain pattern, the row strongly supports the pattern. Requirement 2 : If only a fraction of the replicate combinations support a pattern, the resulting fraction support will be fuzzy (away from 0, and 1), which reflects the uncertainty. a1 a2 a3 b1 b2 b3 c1 c2 c3 d1 d2 d3 r1 49 55 80 38 51 81 115 101 79 82 110 50 r2 67 54 130 96 85 82 124 92 94 48 37 32 r3 65 49 62 67 39 28 132 119 83 95 89 64 r4 81 83 105 115 110 87 133 108 105 62 52 51 A dataset with repeated measurements Problem Definition The total fractional support of a pattern P (or simply the support of P), is defined as the sum of all the fraction supports of P contribute by all rows. A pattern P is frequent if its total fractional support is not less than a given support threshold ρ Mining OPSMs from Data with Repeated Measurements Frequent size-k OPSMs Size-2 OPSMs candidates Support Counting Size k+1 Candidate OPSMs Problem 2. Support counting requires computational expensive subsequence counting step. Solution. Maintain a data structure for efficient support counting. E.g. Head Tail Tree , run-length encoding of the transformed sequences of the dataset. Candidate patterns generation Problem 1. Combinatorial explosion of the number of candidates generated in each iteration. Solution. For each generated pattern, obtain an UPPER BOUND of its fractional support. If the upper bound does not exceed the support requirement, we can prune the candidate. Efficient Mining Methods MinBound (naïve) HTBound Scenario Frequent size-k OPSMs Size-2 OPSMs candidates Support Counting Size k+1 Candidate OPSMs Scenario Candidate patterns generation In iteration k, each length-k candidate pattern P is generated from two length k-1 frequent patterns P1 and P2,, where P1 and P2 are subsequences of P. E.g. P = <a,b,c>, P1 = <a,b> and P2 = <b,c> Bounding techniques Given a length-k pattern P, we want to obtain an upper bound of s(P) before the support counting step. Obtaining an upper bound of si(P) for all row i. MinBound Frequent size-k OPSMs Size-2 OPSMs candidates Support Counting Size k+1 Candidate OPSMs Input Candidate patterns generation si(P1), si(P2), where P1 and P2 are subsequences of P. P is generated by joining P1 and P2. Minbound Sum of minimum is smaller than minimum of sum Transformed sequence dataset Column Sequences a1 a2 a3 b1 … r1 49 55 80 38 … r1 < b,a,d,b,a,c,a,b,d,c,d,c> r2 67 54 130 96 … r2 <d,d,d,a,a,b,b,c,c,b,c,a> r3 65 49 62 67 … r3 <b,b,a,a,d,a,b,c,d,d,c,c> r4 81 83 105 115 … R4 <d,d,d,a,a,b,c,c,a,b,b,c> A dataset with repeated measurements Before the mining step, we preprocess the data matrix by sorting each columns in each row w.r.t. their values in ascending order and replaces the entries by the column label. To determine the fractional support of a row for a pattern, we determine the number of occurrences of the pattern in the row sequence. MinBound Column Sequence Column Sequences < b,a,d,b,a,c,a,b,d,c,d,c> r1 Pattern <a, b, c, d> ≤ min{ 9/27, 7/27} si(P) <a, b, c> <b, c, d> = 9/27 = 7/27 r1 < b,a,d,b,a,c,a,b,d,c,d,c> r2 <d,d,d,a,a,b,b,c,c,b,c,a> r3 <b,b,a,a,d,a,b,c,d,d,c,c> R4 <d,d,d,a,a,b,c,c,a,b,b,c> Transformed sequence dataset Notice that the true value of si(<a,b,c,d>) is 2/27. Input si(P1), si(P2), where P1 and P2 are subsequences of P. P is generated by joining P1 and P2. Minbound Sum of minimum is smaller than minimum of sum Efficient Mining Methods MinBound (naïve) HTBound Transformed sequence dataset HT arrays Column Sequences a1 a2 a3 b1 … r1 49 55 80 38 … r1 < b,a,d,b,a,c,a,b,d,c,d,c> r2 67 54 130 96 … r2 <d,d,d,a,a,b,b,c,c,b,c,a> r3 65 49 62 67 … r3 <b,b,a,a,d,a,b,c,d,d,c,c> r4 81 83 105 115 … R4 <d,d,d,a,a,b,c,c,a,b,b,c> A dataset with repeated measurements Before the mining step, we preprocess the data matrix by sorting each columns in each row w.r.t. their values in ascending order and replaces the entries by the column label. To determine the fractional support of a row for a pattern, we determine the number of occurrences of the pattern in the row sequence. Transformed sequence dataset Column Sequences HT arrays Pattern P r1 < b,a,d,b,a,c,a,b,d,c,d,c> r2 <d,d,d,a,a,b,b,c,c,b,c,a> r3 <b,b,a,a,d,a,b,c,d,d,c,c> R4 <d,d,d,a,a,b,c,c,a,b,b,c> <a, b, c, d> 1 2 3 4 Assume that we generate pattern <a,b,c,d> in the candidate generation module. Our task is to obtain the number of occurrences of <a,b,c,d> in each of the row sequences. Frequent size-k OPSMs Size-2 OPSMs candidates Support Counting Size k+1 Candidate OPSMs Candidate patterns generation Transformed sequence dataset Column Sequences HT arrays Column Sequence < b,a,d,b,a,c,a,b,d,c,d,c> r1 Pattern P 1 <a, b, c, d> 1 2 3 4 Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) 2 3 4 5 Head pattern 6 7 8 9 10 11 12 r1 < b,a,d,b,a,c,a,b,d,c,d,c> r2 <d,d,d,a,a,b,b,c,c,b,c,a> r3 <b,b,a,a,d,a,b,c,d,d,c,c> R4 <d,d,d,a,a,b,c,c,a,b,b,c> Number of replicates for each column Column a b c d Number of replicates of column j 3 3 3 3 Tail pattern <a, b, c> <b, c, d> 9 7 Frequent size-k OPSMs Size-2 OPSMs candidates Support Counting Size k+1 Candidate OPSMs As pattern <a,b,c,d> is generated, this implies that <a,b,c>, <b,c,d> must be frequent, we can assume that the number of occurrences of the head and tail patterns in a given row are readily available. Candidate patterns generation Transformed sequence dataset Column Sequences HT arrays Column Sequence < b,a,d,b,a,c,a,b,d,c,d,c> r1 Pattern P 1 2 <a, b, c, d> 1 2 3 4 Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays Head array 3 4 5 Head pattern 6 7 8 9 10 11 Tail pattern <a, b, c> <b, c, d> 9 7 12 r1 < b,a,d,b,a,c,a,b,d,c,d,c> r2 <d,d,d,a,a,b,b,c,c,b,c,a> r3 <b,b,a,a,d,a,b,c,d,d,c,c> R4 <d,d,d,a,a,b,c,c,a,b,b,c> Number of replicates for each column Column a b c d Number of replicates of column j 3 3 3 3 P[1] is the 1st item in the pattern. (i.e. “a”) r(a) is the number of replicates of column “a”. Therefore, the Head array contains r(a) = 3 slots. Concerns the head sub-pattern (i.e. P[1…k-1]). It contains r(P[1]) entries. The l-th entry records the number of times P[2…k-1] appear after the l-th occurrence of P[1] in the sequence. Tail array Concerns the tail sub-pattern (i.e. P[2…k]). It consists of sni(P[2,,,k-1]) entries. The l-th entry records the number of times P[k] appears after the l-th occurrence of P[2…k-1] in the sequence, where the occurrences are in lexicographic order according to the positions of the occurrences. Transformed sequence dataset HT arrays Column Sequences The 1st occurrence of "a" 1st Column Sequence < b,a,d,b,a,c,a,b,d,c,d,c> r1 Pattern P 1 2 <a, b, c, d> 1 2 3 4 Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays Head array 3 4 5 Head pattern 6 7 8 9 10 11 r1 < b,a,d,b,a,c,a,b,d,c,d,c> r2 <d,d,d,a,a,b,b,c,c,b,c,a> r3 <b,b,a,a,d,a,b,c,d,d,c,c> R4 <d,d,d,a,a,b,c,c,a,b,b,c> 12 Tail pattern <a, b, c> <b, c, d> 9 7 The 1st slot of H-array concerns the 1st occurrence of "a" in the sequence How many <b,c> appear after that “a”? Concerns the head sub-pattern (i.e. P[1…k-1]). It contains r(P[1]) entries. The l-th entry records the number of times P[2…k-1] appear after the l-th occurrence of P[1] in the sequence. Tail array Concerns the tail sub-pattern (i.e. P[2…k]). It consists of sni(P[2…k-1]) entries. The l-th entry records the number of times P[k] appears after the l-th occurrence of P[2…k-1] in the sequence, where the occurrences are in lexicographic order according to the positions of the occurrences. Transformed sequence dataset HT arrays Column Sequences The 1st occurrence of "a" 2nd Column Sequence < b,a,d,b,a,c,a,b,d,c,d,c> r1 Pattern P 1 2 <a, b, c, d> 1 2 3 4 Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays Head array 3 4 5 Head pattern 6 7 8 9 10 11 r1 < b,a,d,b,a,c,a,b,d,c,d,c> r2 <d,d,d,a,a,b,b,c,c,b,c,a> r3 <b,b,a,a,d,a,b,c,d,d,c,c> R4 <d,d,d,a,a,b,c,c,a,b,b,c> 12 Tail pattern <a, b, c> <b, c, d> 9 7 The 1st slot of H-array concerns the 1st occurrence of "a" in the sequence How many <b,c> appear after that “a”? Concerns the head sub-pattern (i.e. P[1…k-1]). It contains r(P[1]) entries. The l-th entry records the number of times P[2…k-1] appear after the l-th occurrence of P[1] in the sequence. Tail array Concerns the tail sub-pattern (i.e. P[2…k]). It consists of sni(P[2…k-1]) entries. The l-th entry records the number of times P[k] appears after the l-th occurrence of P[2…k-1] in the sequence, where the occurrences are in lexicographic order according to the positions of the occurrences. Transformed sequence dataset HT arrays Column Sequences The 1st occurrence of "a" 3rd Column Sequence < b,a,d,b,a,c,a,b,d,c,d,c> r1 Pattern P 1 2 <a, b, c, d> 1 2 3 4 Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays Head array 3 4 5 Head pattern 6 7 8 9 10 11 r1 < b,a,d,b,a,c,a,b,d,c,d,c> r2 <d,d,d,a,a,b,b,c,c,b,c,a> r3 <b,b,a,a,d,a,b,c,d,d,c,c> R4 <d,d,d,a,a,b,c,c,a,b,b,c> 12 Tail pattern <a, b, c> <b, c, d> 9 7 The 1st slot of H-array concerns the 1st occurrence of "a" in the sequence How many <b,c> appear after that “a”? Concerns the head sub-pattern (i.e. P[1…k-1]). It contains r(P[1]) entries. The l-th entry records the number of times P[2…k-1] appear after the l-th occurrence of P[1] in the sequence. Tail array Concerns the tail sub-pattern (i.e. P[2…k]). It consists of sni(P[2…k-1]) entries. The l-th entry records the number of times P[k] appears after the l-th occurrence of P[2…k-1] in the sequence, where the occurrences are in lexicographic order according to the positions of the occurrences. Transformed sequence dataset HT arrays Column Sequences The 1st occurrence of "a" 4th Column Sequence < b,a,d,b,a,c,a,b,d,c,d,c> r1 Pattern P 1 2 <a, b, c, d> 1 2 3 4 Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays Head array 3 4 5 Head pattern 6 7 8 9 10 11 r1 < b,a,d,b,a,c,a,b,d,c,d,c> r2 <d,d,d,a,a,b,b,c,c,b,c,a> r3 <b,b,a,a,d,a,b,c,d,d,c,c> R4 <d,d,d,a,a,b,c,c,a,b,b,c> 12 Tail pattern <a, b, c> <b, c, d> 9 7 The 1st slot of H-array concerns the 1st occurrence of "a" in the sequence How many <b,c> appear after that “a”? Concerns the head sub-pattern (i.e. P[1…k-1]). It contains r(P[1]) entries. The l-th entry records the number of times P[2…k-1] appear after the l-th occurrence of P[1] in the sequence. Tail array Concerns the tail sub-pattern (i.e. P[2…k]). It consists of sni(P[2…k-1]) entries. The l-th entry records the number of times P[k] appears after the l-th occurrence of P[2…k-1] in the sequence, where the occurrences are in lexicographic order according to the positions of the occurrences. Transformed sequence dataset HT arrays Column Sequences The 1st occurrence of "a" 5th Column Sequence < b,a,d,b,a,c,a,b,d,c,d,c> r1 Pattern P 1 2 <a, b, c, d> 1 2 3 4 Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays Head array 3 4 5 Head pattern 6 7 8 9 10 11 r1 < b,a,d,b,a,c,a,b,d,c,d,c> r2 <d,d,d,a,a,b,b,c,c,b,c,a> r3 <b,b,a,a,d,a,b,c,d,d,c,c> R4 <d,d,d,a,a,b,c,c,a,b,b,c> 12 Tail pattern <a, b, c> <b, c, d> 9 7 The 1st slot of H-array concerns the 1st occurrence of "a" in the sequence How many <b,c> appear after that “a”? Concerns the head sub-pattern (i.e. P[1…k-1]). It contains r(P[1]) entries. The l-th entry records the number of times P[2…k-1] appear after the l-th occurrence of P[1] in the sequence. Tail array Concerns the tail sub-pattern (i.e. P[2…k]). It consists of sni(P[2…k-1]) entries. The l-th entry records the number of times P[k] appears after the l-th occurrence of P[2…k-1] in the sequence, where the occurrences are in lexicographic order according to the positions of the occurrences. Transformed sequence dataset HT arrays Column Sequences The 1st occurrence of "a" Column Sequence < b,a,d,b,a,c,a,b,d,c,d,c> r1 Pattern P 1 2 <a, b, c, d> 1 2 3 4 3 Head array 5 Head pattern Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays 4 <a, b, c> 9 5 6 7 8 9 10 11 r1 < b,a,d,b,a,c,a,b,d,c,d,c> r2 <d,d,d,a,a,b,b,c,c,b,c,a> r3 <b,b,a,a,d,a,b,c,d,d,c,c> R4 <d,d,d,a,a,b,c,c,a,b,b,c> 12 Tail pattern <b, c, d> 7 There are 5 occurrences of <b,c> after the 1st occurrence of “a” The 1st slot of H-array concerns the 1st occurrence of "a" in the sequence How many <b,c> appear after that “a”? Concerns the head sub-pattern (i.e. P[1…k-1]). It contains r(P[1]) entries. The l-th entry records the number of times P[2…k-1] appear after the l-th occurrence of P[1] in the sequence. Tail array Concerns the tail sub-pattern (i.e. P[2…k]). It consists of sni(P[2,,,k-1]) entries. The l-th entry records the number of times P[k] appears after the l-th occurrence of P[2…k-1] in the sequence, where the occurrences are in lexicographic order according to the positions of the occurrences. Transformed sequence dataset The 2nd occurrence of "a" Column Sequences HT arrays Column Sequence < b,a,d,b,a,c,a,b,d,c,d,c> r1 Pattern P 1 2 <a, b, c, d> 1 2 3 4 3 Head array 5 Head pattern Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays 4 5 6 7 8 9 10 11 r1 < b,a,d,b,a,c,a,b,d,c,d,c> r2 <d,d,d,a,a,b,b,c,c,b,c,a> r3 <b,b,a,a,d,a,b,c,d,d,c,c> R4 <d,d,d,a,a,b,c,c,a,b,b,c> 12 Tail pattern <a, b, c> <b, c, d> 9 7 Since there are 2 occurrences of <b,c> after the 2nd occurrence of “a”, we have value “2” in the slot. 2 The 2nd occurrence of "a" in the sequence How many <b,c> appear after that “a”? Concerns the head sub-pattern (i.e. P[1…k-1]). It contains r(P[1]) entries. The l-th entry records the number of times P[2…k-1] appear after the l-th occurrence of P[1] in the sequence. Tail array Concerns the tail sub-pattern (i.e. P[2…k]). It consists of sni(P[2,,,k-1]) entries. The l-th entry records the number of times P[k] appears after the l-th occurrence of P[2…k-1] in the sequence, where the occurrences are in lexicographic order according to the positions of the occurrences. Transformed sequence dataset The 3rd occurrence of "a" Column Sequences HT arrays Column Sequence < b,a,d,b,a,c,a,b,d,c,d,c> r1 Pattern P 1 2 <a, b, c, d> 1 2 3 4 3 Head array 5 Head pattern Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays 4 5 6 7 8 9 10 11 < b,a,d,b,a,c,a,b,d,c,d,c> r2 <d,d,d,a,a,b,b,c,c,b,c,a> r3 <b,b,a,a,d,a,b,c,d,d,c,c> R4 <d,d,d,a,a,b,c,c,a,b,b,c> 12 Tail pattern <a, b, c> <b, c, d> 9 7 2 r1 2 The 3rd occurrence of "a" in the sequence How many <b,c> appear after that “a”? Concerns the head sub-pattern (i.e. P[1…k-1]). It contains r(P[1]) entries The l-th entry records the number of times P[2…k-1] appear after the l-th occurrence of P[1] in the sequence. Tail array Concerns the tail sub-pattern (i.e. P[2…k]). It consists of sni(P[2,,,k-1]) entries. The l-th entry records the number of times P[k] appears after the l-th occurrence of P[2…k-1] in the sequence, where the occurrences are in lexicographic order according to the positions of the occurrences. Transformed sequence dataset Column Sequences HT arrays Column Sequence < b,a,d,b,a,c,a,b,d,c,d,c> r1 Pattern P 1 2 <a, b, c, d> 1 2 3 4 3 5 6 7 8 9 10 11 < b,a,d,b,a,c,a,b,d,c,d,c> r2 <d,d,d,a,a,b,b,c,c,b,c,a> r3 <b,b,a,a,d,a,b,c,d,d,c,c> R4 <d,d,d,a,a,b,c,c,a,b,b,c> 12 Tail pattern Mid pattern <a, b, c> <b, c, d> <b, c> 9 7 8 2 2 Head array 5 Head pattern Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays 4 r1 Concerns the head sub-pattern (i.e. P[1…k-1]). The number of entries in Tail array is the It contains r(P[1]) entries number of occurrences of mid pattern <b,c> The l-th entry records the number of times P[2…k-1] in appear aftersequence the l-th the column occurrence of P[1] in the sequence. Tail array Concerns the tail sub-pattern (i.e. P[2…k]). It consists of sni(P[2,,,k-1]) entries. The l-th entry records the number of times P[k] appears after the l-th occurrence of P[2…k-1] in the sequence, where the occurrences are in lexicographic order according to the positions of the occurrences. Transformed sequence dataset The 1st occurrence of <b,c> according to the lexicographic order HT arrays Column Sequence < b,a,d,b,a,c,a,b,d,c,d,c> r1 Pattern P 1 2 <a, b, c, d> 1 2 3 4 3 5 6 7 8 9 10 11 r1 < b,a,d,b,a,c,a,b,d,c,d,c> r2 <d,d,d,a,a,b,b,c,c,b,c,a> r3 <b,b,a,a,d,a,b,c,d,d,c,c> R4 <d,d,d,a,a,b,c,c,a,b,b,c> 12 Tail pattern Mid pattern <a, b, c> <b, c, d> <b, c> 9 7 8 2 2 Head array 5 Head pattern Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays 4 Column Sequences The number of times “d” appears after Concerns the head sub-pattern (i.e. P[1…k-1]). The 1st occurrence of <b,c> in the sequence It contains r(P[1]) entries The l-th entry records the number of times P[2…k-1] appear after the l-th occurrence of P[1] in the sequence. Tail array Concerns the tail sub-pattern (i.e. P[2…k]). It consists of sni(P[2,,,k-1]) entries. The l-th entry records the number of times P[k] appears after the l-th occurrence of P[2…k-1] in the sequence, where the occurrences are in lexicographic order according to the positions of the occurrences. Transformed sequence dataset The 1st occurrence of <b,c> according to the lexicographic order HT arrays Column Sequence < b,a,d,b,a,c,a,b,d,c,d,c> r1 Pattern P 1 2 <a, b, c, d> 1 2 3 4 3 6 7 5 8 9 10 11 r1 < b,a,d,b,a,c,a,b,d,c,d,c> r2 <d,d,d,a,a,b,b,c,c,b,c,a> r3 <b,b,a,a,d,a,b,c,d,d,c,c> R4 <d,d,d,a,a,b,c,c,a,b,b,c> 12 Tail pattern Mid pattern <a, b, c> <b, c, d> <b, c> 9 7 8 2 2 2 Head array 5 Head pattern Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays 4 Column Sequences The number of times “d” appears after Concerns the head sub-pattern (i.e. P[1…k-1]). The 1st occurrence of <b,c> in the sequence It contains r(P[1]) entries The l-th entry records the number of times P[2…k-1] appear after the l-th occurrence of P[1] in the sequence. Tail array Concerns the tail sub-pattern (i.e. P[2…k]). It consists of sni(P[2,,,k-1]) entries. The l-th entry records the number of times P[k] appears after the l-th occurrence of P[2…k-1] in the sequence, where the occurrences are in lexicographic order according to the positions of the occurrences. Transformed sequence dataset The 2nd occurrence of <b,c> according to the lexicographic order HT arrays Column Sequence < b,a,d,b,a,c,a,b,d,c,d,c> r1 Pattern P 1 2 <a, b, c, d> 1 2 3 4 3 6 7 8 9 5 10 11 r1 < b,a,d,b,a,c,a,b,d,c,d,c> r2 <d,d,d,a,a,b,b,c,c,b,c,a> r3 <b,b,a,a,d,a,b,c,d,d,c,c> R4 <d,d,d,a,a,b,c,c,a,b,b,c> 12 Tail pattern Mid pattern <a, b, c> <b, c, d> <b, c> 9 7 8 2 2 2 1 Head array 5 Head pattern Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays 4 Column Sequences The number of times “d” appears after Concerns the head sub-pattern (i.e. P[1…k-1]). The 2nd occurrence of <b,c> in the sequence It contains r(P[1]) entries The l-th entry records the number of times P[2…k-1] appear after the l-th occurrence of P[1] in the sequence. Tail array Concerns the tail sub-pattern (i.e. P[2…k]). It consists of sni(P[2,,,k-1]) entries. The l-th entry records the number of times P[k] appears after the l-th occurrence of P[2…k-1] in the sequence, where the occurrences are in lexicographic order according to the positions of the occurrences. Transformed sequence dataset The 3rd occurrence of <b,c> according to the lexicographic order HT arrays Column Sequence < b,a,d,b,a,c,a,b,d,c,d,c> r1 Pattern P 1 2 <a, b, c, d> 1 2 3 4 3 6 7 8 9 5 10 11 r1 < b,a,d,b,a,c,a,b,d,c,d,c> r2 <d,d,d,a,a,b,b,c,c,b,c,a> r3 <b,b,a,a,d,a,b,c,d,d,c,c> R4 <d,d,d,a,a,b,c,c,a,b,b,c> 12 Tail pattern Mid pattern <a, b, c> <b, c, d> <b, c> 9 7 8 2 2 2 1 0 Head array 5 Head pattern Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays 4 Column Sequences The number of times “d” appears after Concerns the head sub-pattern (i.e. P[1…k-1]). The 3rd occurrence of <b,c> in the sequence It contains r(P[1]) entries The l-th entry records the number of times P[2…k-1] appear after the l-th occurrence of P[1] in the sequence. Tail array Concerns the tail sub-pattern (i.e. P[2…k]). It consists of sni(P[2,,,k-1]) entries. The l-th entry records the number of times P[k] appears after the l-th occurrence of P[2…k-1] in the sequence, where the occurrences are in lexicographic order according to the positions of the occurrences. Transformed sequence dataset The 4th occurrence of <b,c> according to the lexicographic order HT arrays Column Sequence < b,a,d,b,a,c,a,b,d,c,d,c> r1 Pattern P 1 2 <a, b, c, d> 1 2 3 4 3 6 7 8 9 5 10 11 r1 < b,a,d,b,a,c,a,b,d,c,d,c> r2 <d,d,d,a,a,b,b,c,c,b,c,a> r3 <b,b,a,a,d,a,b,c,d,d,c,c> R4 <d,d,d,a,a,b,c,c,a,b,b,c> 12 Tail pattern Mid pattern <a, b, c> <b, c, d> <b, c> 9 7 8 2 2 2 1 0 2 Head array 5 Head pattern Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays 4 Column Sequences The number of times “d” appears after Concerns the head sub-pattern (i.e. P[1…k-1]). The 4th occurrence of <b,c> in the sequence It contains r(P[1]) entries The l-th entry records the number of times P[2…k-1] appear after the l-th occurrence of P[1] in the sequence. Tail array Concerns the tail sub-pattern (i.e. P[2…k]). It consists of sni(P[2,,,k-1]) entries. The l-th entry records the number of times P[k] appears after the l-th occurrence of P[2…k-1] in the sequence, where the occurrences are in lexicographic order according to the positions of the occurrences. Transformed sequence dataset Column Sequences HT arrays Column Sequence < b,a,d,b,a,c,a,b,d,c,d,c> r1 Pattern P 1 <a, b, c, d> 1 2 3 4 2 3 5 6 7 8 9 Head pattern Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays 4 5 10 11 r1 < b,a,d,b,a,c,a,b,d,c,d,c> r2 <d,d,d,a,a,b,b,c,c,b,c,a> r3 <b,b,a,a,d,a,b,c,d,d,c,c> R4 <d,d,d,a,a,b,c,c,a,b,b,c> 12 Tail pattern Mid pattern <a, b, c> <b, c, d> <b, c> 9 7 8 2 2 2 1 0 2 1 0 1 0 There are 5 <b,c>’s appear after The 1st occurrence of "a" in the sequence. It is sufficient to calculate the fractional support of pattern P in r1 solely from its Head and Tail arrays. 1. How many <b,c> appear after the 1st “a” in the sequence? 2. How many “d” appear after each of the 5 <b,c> ? 5. Transformed sequence dataset Column Sequences HT arrays Column Sequence < b,a,d,b,a,c,a,b,d,c,d,c> r1 Pattern P 1 <a, b, c, d> 1 2 3 4 2 3 5 6 7 8 9 Head pattern Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays 4 5 10 11 r1 < b,a,d,b,a,c,a,b,d,c,d,c> r2 <d,d,d,a,a,b,b,c,c,b,c,a> r3 <b,b,a,a,d,a,b,c,d,d,c,c> R4 <d,d,d,a,a,b,c,c,a,b,b,c> 12 Tail pattern Mid pattern <a, b, c> <b, c, d> <b, c> 9 7 8 2 2 There are 5 <b,c>’s appear after The 1st occurrence of "a" in the sequence. 2 1 0 2 1 0 1 0 The 5 <b,c>’s must correspond to the last five <b,c> according to the lexicographical order. It is sufficient to calculate the fractional support of pattern P in r1 solely from its Head and Tail arrays. 1. How many <b,c> appear after the 1st “a” in the sequence? 5. 2. How many “d” appear after each of the 5 <b,c> ? 2+1+0+1+0 = 4. There are 4 <a,b,c,d> occurrences from the 1st “a” in the sequence. Transformed sequence dataset Column Sequences HT arrays Column Sequence < b,a,d,b,a,c,a,b,d,c,d,c> r1 Pattern P 1 <a, b, c, d> 1 2 3 4 2 3 5 6 7 8 9 Head pattern Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays 4 5 10 11 r1 < b,a,d,b,a,c,a,b,d,c,d,c> r2 <d,d,d,a,a,b,b,c,c,b,c,a> r3 <b,b,a,a,d,a,b,c,d,d,c,c> R4 <d,d,d,a,a,b,c,c,a,b,b,c> 12 Tail pattern Mid pattern <a, b, c> <b, c, d> <b, c> 9 7 8 2 2 There are 5 <b,c>’s appear after The 1st occurrence of "a" in the sequence. 2 1 0 2 1 0 1 0 The 5 <b,c>’s must correspond to the last five <b,c> according to the lexicographical order. The number of occurrences of <a,b,c,d> in the entire sequence : 4 Transformed sequence dataset Column Sequences HT arrays Column Sequence < b,a,d,b,a,c,a,b,d,c,d,c> r1 Pattern P 1 <a, b, c, d> 1 2 3 4 2 3 5 6 7 8 9 Head pattern Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays 4 5 10 11 r1 < b,a,d,b,a,c,a,b,d,c,d,c> r2 <d,d,d,a,a,b,b,c,c,b,c,a> r3 <b,b,a,a,d,a,b,c,d,d,c,c> R4 <d,d,d,a,a,b,c,c,a,b,b,c> 12 Tail pattern Mid pattern <a, b, c> <b, c, d> <b, c> 9 7 8 2 2 There are 2<b,c>’s appear after The 2nd occurrence of "a" in the sequence. 2 1 0 2 1 0 1 0 The 2 <b,c>’s must correspond to the last two <b,c> according to the lexicographical order. The number of occurrences of <a,b,c,d> in the entire sequence : 4 + 1 Transformed sequence dataset Column Sequences HT arrays Column Sequence < b,a,d,b,a,c,a,b,d,c,d,c> r1 Pattern P 1 <a, b, c, d> 1 2 3 4 2 3 5 6 7 8 9 Head pattern Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays 4 5 10 11 r1 < b,a,d,b,a,c,a,b,d,c,d,c> r2 <d,d,d,a,a,b,b,c,c,b,c,a> r3 <b,b,a,a,d,a,b,c,d,d,c,c> R4 <d,d,d,a,a,b,c,c,a,b,b,c> 12 Tail pattern Mid pattern <a, b, c> <b, c, d> <b, c> 9 7 8 2 2 There are 2 <b,c>’s appear after The 3rd occurrence of "a" in the sequence. 2 1 0 2 1 0 1 0 The 2 <b,c>’s must correspond to the last two <b,c> according to the lexicographical order. The number of occurrences of <a,b,c,d> in the entire sequence : 4 + 1 + 1 = 6 Transformed sequence dataset Column Sequences HT arrays Column Sequence < b,a,d,b,a,c,a,b,d,c,d,c> r1 Pattern P 1 <a, b, c, d> 1 2 3 4 2 3 5 6 7 8 9 Head pattern Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays 4 5 10 11 r1 < b,a,d,b,a,c,a,b,d,c,d,c> r2 <d,d,d,a,a,b,b,c,c,b,c,a> r3 <b,b,a,a,d,a,b,c,d,d,c,c> R4 <d,d,d,a,a,b,c,c,a,b,b,c> 12 Tail pattern Mid pattern <a, b, c> <b, c, d> <b, c> 9 7 8 2 2 There are 5 <b,c>’s appear after The 3rd occurrence of "a" in the sequence. 2 1 0 2 1 0 1 0 The 2 <b,c>’s must correspond to the last two <b,c> according to the lexicographical order. The number of occurrences of <a,b,c,d> in the entire sequence : 4 + 1 + 1 = 6 Efficient Mining Methods HTBound HT Bound Pattern P <a, b, c, d> 1 2 3 4 Head pattern Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays Mid pattern <a, b, c> <b, c, d> <b, c> 9 7 8 2 2 2 1 0 2 1 0 1 0 Storing the HT arrays is impractical 5 Tail pattern The #entries in T-array is the #occurrences of the mid pattern, which, in the worst case, is exponential to the pattern’s length. It is possible to compute an upper bound of the HT-sum by storing only 3 numbers without ever construct or store the HT-arrays. Our solution is to provide a way to construct two arrays H* and T* from the 3 numbers and guarantee the following is always true: HT-sum (H, T) ≤ HT-sum ( H*, T*) HT Bound Pattern P <a, b, c, d> 1 2 3 4 Head pattern Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays Mid pattern <a, b, c> <b, c, d> <b, c> 9 7 8 2 2 2 1 0 2 1 0 1 0 Storing the HT arrays is impractical 5 Tail pattern The #entries in T-array is the #occurrences of the mid pattern, which, in the worst case, is exponential to the pattern’s length. It is possible to compute an upper bound of the HT-sum by storing only 3 numbers without ever construct or store the HT-arrays. Our solution is to provide a way to construct two arrays H* and T* from the 3 numbers and guarantee the following is always true: HT-sum HT-sum (( H*, HT-sum (H,(H, T)T) ≤ ≤HT-sum H*,T*) T*) HT Bound Pattern P <a, b, c, d> 1 2 3 4 Head pattern Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays H*T*-arrays for upper bound of HT-sum Tail pattern Mid pattern <a, b, c> <b, c, d> <b, c> 9 7 8 5 H* ?2 ? 2 2 T* 1 0 2? 1 0 1 0 ? HT-sum (H, T) ≤ HT-sum ( H*, T*) Challenge: We cannot construct the real HT-arrays solely from the 3 numbers. HT Bound Pattern P <a, b, c, d> 1 2 3 4 Head pattern Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays H*T*-arrays for upper bound of HT-sum Tail pattern Mid pattern <a, b, c> <b, c, d> <b, c> 9 7 8 5 ?2 2 H* 1 0 2? 1 0 1 0 T* Number of <b,c> after the 1st “a” Size of H* and T* arrays 2 r(a) = 3 slots Number of “d” after the 1st <b,c> sni(<b,c>) = 8 slots HT-sum (H, T) ≤ HT-sum ( H*, T*) Challenge: We cannot construct the real HT-arrays solely from the 3 numbers. HT Bound Pattern P <a, b, c, d> 1 2 3 4 Head pattern Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays H*T*-arrays for upper bound of HT-sum Tail pattern Mid pattern <a, b, c> <b, c, d> <b, c> 9 7 8 5 ?2 2 H* 2 1 0 2? 1 0 1 0 T* Number of <b,c> after the 1st “a” Size of H* and T* arrays r(a) = 3 slots Maximum value per slot sni(<b,c>) = 8 Number of “d” after the 1st <b,c> sni(<b,c>) = 8 slots r(d) = 3 HT-sum (H, T) ≤ HT-sum ( H*, T*) Challenge: We cannot construct the real HT-arrays solely from the 3 numbers. HT Bound Pattern P <a, b, c, d> 1 2 3 4 Head pattern Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays H*T*-arrays for upper bound of HT-sum Tail pattern Mid pattern <a, b, c> <b, c, d> <b, c> 9 7 8 5 ?2 2 H* 1 0 2? 1 0 1 0 T* Number of <b,c> after the 1st “a” Size of H* and T* arrays r(a) = 3 slots Maximum value per slot sni(<b,c>) = 8 Sum over all slots 2 sni(<a,b,c>) = 9 Number of “d” after the 1st <b,c> sni(<b,c>) = 8 slots r(d) = 3 sni(<b,c,d>) = 7 HT-sum (H, T) ≤ HT-sum ( H*, T*) Challenge: We cannot construct the real HT-arrays solely from the 3 numbers. HT Bound Pattern P <a, b, c, d> 1 2 3 4 Head pattern Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays H*T*-arrays for upper bound of HT-sum Tail pattern Mid pattern <a, b, c> <b, c, d> <b, c> 9 7 8 5 ?2 2 H* 2 1 0 2? 1 0 1 0 T* Number of “d” after the 1st <b,c> Important properties Non-increasing Size of H* and T* arrays r(a) = 3 slots Maximum value per slot sni(<b,c>) = 8 Sum over all slots sni(<a,b,c>) = 9 sni(<b,c>) = 8 slots r(d) = 3 sni(<b,c,d>) = 7 HT-sum (H, T) ≤ HT-sum ( H*, T*) Challenge: We cannot construct the real HT-arrays solely from the 3 numbers. HT Bound Pattern P <a, b, c, d> 1 2 3 4 Head pattern Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays H*T*-arrays for upper bound of HT-sum Important properties Mid pattern <b, c, d> <b, c> <a, b, c> 9 5 ?2 7 Interval 1 2 H* 2 1 0 Interval 2 2? 1 Interval 3 0 1 0 T* Non-increasing Size of H* and T* arrays r(a) = 3 slots Maximum value per slot sni(<b,c>) = 8 Sum over all slots Tail pattern sni(<a,b,c>) = 9 1. There are r(P[2]) = r(b)=3 intervals s.t. the values in the interval are non-increasing. 2. The interval averages are non-increasing. sni(<b,c>) = 8 slots r(d) = 3 sni(<b,c,d>) = 7 HT-sum (H, T) ≤ HT-sum ( H*, T*) 8 Challenge: We cannot construct the real HT-arrays solely from the 3 numbers. HT Bound Pattern P <a, b, c, d> 1 2 3 4 Head pattern Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays H*T*-arrays for upper bound of HT-sum Tail pattern Mid pattern <a, b, c> <b, c, d> <b, c> 9 7 8 5 ?2 2 H* 8 2 1 0 2? 1 0 1 0 T* 1 0 push Important properties Non-increasing Size of H* and T* arrays r(a) = 3 slots Maximum value per slot sni(<b,c>) = 8 Sum over all slots sni(<a,b,c>) = 9 1. There are r(P[2]) = r(b)=3 intervals s.t. the values in the interval are non-increasing. 2. The interval averages are non-increasing. sni(<b,c>) = 8 slots r(d) = 3 sni(<b,c,d>) = 7 HT-sum (H, T) ≤ HT-sum ( H*, T*) Challenge: We cannot construct the real HT-arrays solely from the 3 numbers. HT Bound Pattern P <a, b, c, d> 1 2 3 4 Head pattern Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays H*T*-arrays for upper bound of HT-sum Tail pattern Mid pattern <a, b, c> <b, c, d> <b, c> 9 7 8 5 ?2 2 H* 8 1 0 Non-increasing Size of H* and T* arrays r(a) = 3 slots Maximum value per slot sni(<b,c>) = 8 Sum over all slots 1 0 2? 1 0 1 0 1 1 1 1 1 1 1 T* push Important properties 2 sni(<a,b,c>) = 9 0 assign 1. There are r(P[2]) = r(b)=3 intervals s.t. the values in the interval are non-increasing. 2. The interval averages are non-increasing. sni(<b,c>) = 8 slots r(d) = 3 sni(<b,c,d>) = 7 HT-sum (H, T) ≤ HT-sum ( H*, T*) Challenge: We cannot construct the real HT-arrays solely from the 3 numbers. HT Bound Pattern P <a, b, c, d> 1 2 3 4 Head pattern Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays H*T*-arrays for upper bound of HT-sum Tail pattern Mid pattern <a, b, c> <b, c, d> <b, c> 9 7 8 5 ?2 2 H* 8 1 0 Non-increasing Size of H* and T* arrays r(a) = 3 slots Maximum value per slot sni(<b,c>) = 8 Sum over all slots 1 0 2? 1 0 1 0 HT-Sum(H,T) = 6 1 1 1 1 1 1 1 HT-Sum (H*,T*) = 8 T* push Important properties 2 sni(<a,b,c>) = 9 0 assign 1. There are r(P[2]) = r(b)=3 intervals s.t. the values in the interval are non-increasing. 2. The interval averages are non-increasing. sni(<b,c>) = 8 slots r(d) = 3 sni(<b,c,d>) = 7 HT-sum (H, T) ≤ HT-sum ( H*, T*) HT Bound Pattern P <a, b, c, d> 1 2 3 4 Head pattern Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays H*T*-arrays for upper bound of HT-sum Tail pattern Mid pattern <a, b, c> <b, c, d> <b, c> 9 7 8 5 2 2 H* 8 2 1 0 2 1 0 1 0 HT-Sum(H,T) = 6 1 1 1 1 1 1 1 HT-Sum (H*,T*) = 8 T* 1 0 0 push We don’t have to materialize the H* and T* arrays to compute HT-Sum(H*, T*) !!! Number of slots in the partition Value in each slot in the partition assign Partition 1 sni(<a,b,c>) mod sni(<b,c>) sni(<a,b,c>) / sni(<b,c>) Partition 2 sni(<b,c> ) – ( sni(<a,b,c>) mod sni(<b,c>)) sni(<a,b,c>) / sni(<b,c>) HT-sum (H, T) ≤ HT-sum ( H*, T*) HT Bound Pattern P <a, b, c, d> 1 2 3 4 Head pattern Patterns already computed Number of occurrences of the pattern in row 1 (i.e. sni(P) ) HT-arrays H*T*-arrays for upper bound of HT-sum Tail pattern Mid pattern <a, b, c> <b, c, d> <b, c> 9 7 8 5 2 2 H* 8 2 1 0 2 1 0 1 0 HT-Sum(H,T) = 6 1 1 1 1 1 1 1 HT-Sum (H*,T*) = 8 T* 1 0 push 0 assign We don’t have to materialize the H* and T* arrays to compute HT-Sum(H*, T*) !!! HT-sum (H, T) ≤ HT-sum ( H*, T*) Experimental Evaluation Experimental settings C programming language Machine CPU : 2.6 GHz Memory : 1 Gb Fedora Dataset Real dataset : Yeast galactose dataset Subset of 205 genes (rows) 20 experimental conditions (columns) 4 biological replicates per condition Publicly available : http://expression.microslu.washington.edu/expression/kayee/medvedovic2003/medvedovic_bioinf2003 .html Synthetic dataset We model the values by a Gaussian distribution with the mean and variance equal to the sample mean and variance of the 4 replicates. Replicates simulation – Expression values of new replicates are sampled from the Gaussian Columns simulation – New columns are synthesized by randomly drawing an existing column, discarding the existing expression values, but keeping the fitted Gaussians and sampling new values from them. Rows simulation – New rows are synthesized as in the synthesis of new columns, but with an existing row as template instead of a column. Speed performance in different iteration Running time (5000 rows, 20 cols, 4 rep / c, ρ=0.2) The HTBound is very effective in speeding up the mining time by pruning infrequent candidates. The pruning effectiveness is most pronounced in iteration 4, which the number of candidates is the highest. Number of candidates generated (5000 rows, 20 cols, 4 rep / c, ρ=0.2) Speed performance in different iteration Running time (5000 rows, 20 cols, 4 rep / c, ρ=0.2) The number of frequent pattern tells us the minimum number of candidates can be generated by any pruning techniques. Number of candidates generated (5000 rows, 20 cols, 4 rep / c, ρ=0.2) Speed performance in different iteration Running time (5000 rows, 20 cols, 4 rep / c, ρ=0.2) The number of frequent pattern tells us the minimum number of candidates can be generated by any pruning techniques. Number of candidates generated (5000 rows, 20 cols, 4 rep / c, ρ=0.2) The number of unpruned candidates under HTBound is very close to the actual number of frequent patterns. In particular, HTBound has already pruned 94% of all infrequent candidates generated by Basic. Scalability w.r.t. support threshold Running time The HTBound achieved the greatest saving in all settings. Running time (in % of Basic) We observe that at higher support thresholds, the two bounds are capable of pruning more candidates as the support requirement is relaxed (i.e. more upper bounds can be less than the minimum support count when the later is higher) Scalability w.r.t. number of rows Running time Running time (in % of Basic) Scalability w.r.t. number of columns Running time Running time (in % of Basic) Scalability w.r.t number of replicates Running time Running time (in % of Basic) Conclusion We have described the problem of high noise level to the mining of OPSM’s, and discussed how it can be alleviated by exploiting repeating measurements. Proposed a number of efficient mining techniques to speed up the mining process. Performed experiments on real microarray data to the usability of the proposed “fractional support” model. Demonstrate the effectiveness of the pruning methods. Demonstrate