Mining Order

advertisement
Mining Order-Preserving
Submatrices from Data with
Repeated Measurements
Ben Kao, Kevin Y. Yip, Sau Dan Lee, Chun Kit
Chui
ICDM 2008
Presentation Outline



The traditional Order-Preserving submatrices
(OPSM’s) mining problem
Mining OPSMs from data with repeated
measurements (OPSM-RM)
Basic algorithm
 Efficient mining methods
 MinBound
 The HTBound technique

Experimental results
The traditional OPSM
mining problem
Preliminaries
Order-preserving Submatrices
Data matrix plotted
Matrix of numerical data values


C1
C2
C3
C4
C5
C6
C7
C8
R1
36
32
12
19
18
42
33
8
R2
11
22
33
24
30
3
9
23
R3
14
18
48
28
38
11
33
21
R4
20
14
5
10
7
24
44
13
R5
38
25
10
24
19
39
8
22
No obvious
patterns observed
The Order-preserving Submatrices problem is a
pattern-based subspace clustering model that applies
to a matrix of numerical data values.
Objective : To discover a subset of attributes (columns)
over which a subset of tuples (rows) exhibit a similar
pattern of rise and falls in the tuples’ values.
Order-preserving Submatrices
Data matrix plotted
Matrix of numerical data values
C1
C2
C3
C4
C5
C6
C7
C8
R1
36
32
12
19
18
42
33
8
R2
11
22
33
24
30
3
9
23
R3
14
18
48
28
38
11
33
21
R4
20
14
5
10
7
24
44
13
R5
38
25
10
24
19
39
8
22
Order Preserving Submatrix


C3
C5
C4
C2
C1
C6
R1
12
18
19
32
36
42
R4
5
7
10
14
20
24
R5
10
19
24
25
38
39
No obvious
patterns observed
Values of rows are increasing w.r.t. the
column order
R1
R4
R5
Concurrent rising
patterns
The Order-preserving Submatrices problem is a
pattern-based subspace clustering model that applies
to a matrix of numerical data values.
Objective : Identify subset of columns over which a
subset of rows exhibit a similar pattern of rises and
falls in the columns’ values.
Order-preserving Submatrices
Data matrix plotted
Matrix of numerical data values

C1
C2
C3
C4
C5
C6
C7
C8
36
32
12
19
18
42
33
8
Application : Mining gene expression dataset.
R1
11
22 that
33
R2Genes
24
30 simultaneous
3
9
23
exhibit
rises and falls of their expression
No obvious
patterns
R3values
14
18 across
48
28 different
38
11
33
21
experiments
reveal interesting patterns and observed
R4
20
14
5
knowledge
. 10 7 24 44 13
R5
38
25
10
24
19
39
8
22
Expression
value
Experimental conditions
Order Preserving Submatrix
Genes
C3
C5
C4
C2
C1
C6
R1
12
18
19
32
36
42
R4
5
7
10
14
20
24
R5
10
19
24
25
38
39
R1
R4
R5
Concurrent rising
patterns
Order-preserving Submatrices
Data matrix plotted
Matrix of numerical data values

C1
C2
C3
C4
C5
C6
C7
C8
36
32
12
19
18
42
33
8
Application : Mining gene expression dataset.
R1
11
22 that
33
R2Genes
24
30 simultaneous
3
9
23
exhibit
rises and falls of their expression
No obvious
patterns
R3values
14
18 across
48
28 different
38
11
33
21
experiments
reveal interesting patterns and observed
R4
20
14
5
knowledge
. 10 7 24 44 13
R5
38
25
10
24
19
39
8
22
Expression
value
Experimental conditions
Order Preserving Submatrix
Genes

C3
C5
C4
C2
C1
C6
R1
12
18
19
32
36
42
R4
5
7
10
14
20
24
R5
10
19
24
25
38
39
R1
R4
R5
Concurrent rising
patterns
Biologists would like to identify set of genes that are
functionally related

OPSM suggests candidates for the set of genes that are of their
interest.
 With the mined OPSMs, costly small scale testing will be performed

Requirement : as few false positive results as possible.
Order-preserving Submatrices


Given a data matrix M with n
rows and m columns.
An order preserving
Submatrices S is
 A subset of row R.


A permutation of columns
(pattern) P

E.g. P=<C3,C5,C4,C2,C1,C6>
The entries of all rows in R
are monotonically
increasing w.r.t. P.
Mining OPSMs: Find all
OPSMs with number of
columns greater than or equal
to a user specified threshold
(frequent).


E.g. R={R1,R4,R5}
Matrix of numerical data values
C1
C2
C3
C4
C5
C6
…
Cm
R1
36
32
12
19
18
42
…
8
R2
11
22
33
24
30
3
…
23
R3
3
25
31
22
11
4
…
26
R4
20
14
5
10
7
24
…
13
R5
38
25
10
24
19
39
…
22
…
…
…
…
…
…
…
…
…
Rn
…
Order-preserving Submatrices


Given a data matrix M with n
rows and m columns.
An OPSM S is
 A subset of row R.


A permutation of columns
(pattern) P

E.g. P=<C3,C5,C4,C2,C1,C6>
The entries of all rows in R
are monotonically
increasing w.r.t. P.
Mining OPSMs: Find all
OPSMs with number of
columns greater than or equal
to a user specified threshold
(frequent).


E.g. R={R1,R4,R5}
Matrix of numerical data values
C1
C2
C3
C4
C5
C6
…
Cm
R1
36
32
12
19
18
42
…
8
R2
11
22
33
24
30
3
…
23
R3
3
25
31
22
11
4
…
26
R4
20
14
5
10
7
24
…
13
R5
38
25
10
24
19
39
…
22
…
…
…
…
…
…
…
…
…
…
Rn
Reordered subset of columns
(experimental conditions)
Order Preserving Submatrix
Subset of rows
C3
C5
C4
C2
C1
C6
R1
12
18
19
32
36
42
R4
5
7
10
14
20
24
R5
10
19
24
25
38
39
Order-preserving Submatrices


Given a data matrix M with n
rows and m columns.
An OPSM S is
 A subset of row R.


A permutation of columns
(pattern) P

E.g. P=<C3,C5,C4,C2,C1,C6>
The entries of all rows in R
are monotonically
increasing w.r.t. P.
Mining OPSMs: Find all
OPSMs with number of rows
(the support) greater than or
equal to a user specified
threshold (frequent).


E.g. R={R1,R4,R5}
R1
Matrix of
R1numerical data values
C1
C2 R4
C3
C4
C5
C6
…
R5
36
32
12
19
18
42
…
R2
11
22
33
24
30
3
…
23
R3
3
25
31
22
11
4
…
26
R4
20
14
5
10
7
24
…
13
R5
38
25
10
24
19
39
…
22
…
…
…
…
…
…
…
…
…
Cm
8
…
Rn
Reordered subset of columns
(experimental conditions)
Order Preserving Submatrix
Subset of rows
C3
C5
C4
C2
C1
C6
R1
12
18
19
32
36
42
R4
5
7
10
14
20
24
R5
10
19
24
25
38
39
Order-preserving Submatrices

Apriori property

Let P1 and P2 be two patterns such that P1 is a subsequence of P2.
 The support of P2 must be no greater than the support of P1


<a,b> is infrequent => <a,b,c>, <a,b,c,d>, <c,a,b,d> are all infrequent.
According to the Apriori property, we can adopt an iterative
candidate set generation-and-test mining framework to prune
the search space.
Order-preserving Submatrices

Apriori property

Let P1 and P2 be two patterns such that P1 is a subsequence of P2.
 The support of P2 must be no greater than the support of P1


<a,b> is infrequent => <a,b,c>, <a,b,c,d>, <c,a,b,d> are all infrequent.
With the Apriori property, we can adopt an iterative candidate set
generation-and-test mining framework to prune the search space.
Start from mining size-2 OPSMs
because size-1 OPSMs does not
have any “orderings”.
Size-2 OPSMs
candidates
Order-preserving Submatrices

Apriori property

Let P1 and P2 be two patterns such that P1 is a subsequence of P2.
 The support of P2 must be no greater than the support of P1


<a,b> is infrequent => <a,b,c>, <a,b,c,d>, <c,a,b,d> are all infrequent.
With the Apriori property, we can adopt an iterative candidate set
generation-and-test mining framework to prune the search space.
Start from mining size-2 OPSMs
because size-1 OPSMs does not
have any “orderings”.
Size-2 OPSMs
candidates
Support Counting
The Support counting module
verifies the number of supporting rows
of the candidate OPSMs.
Order-preserving Submatrices

Apriori property

Let P1 and P2 be two patterns such that P1 is a subsequence of P2.
 The support of P2 must be no greater than the support of P1


<a,b> is infrequent => <a,b,c>, <a,b,c,d>, <c,a,b,d> are all infrequent.
With the Apriori property, we can adopt an iterative candidate set
generation-and-test mining framework to prune the search space.
Start from mining size-2 OPSMs
because size-1 OPSMs does not
have any “orderings”.
Frequent size-k OPSMs
Size-2 OPSMs
candidates
Support Counting
The Support counting module
verifies the number of supporting rows
of the candidate OPSMs.
Those OPSMs with #supporting rows ≥ ρ are
frequent, they are passed into the
candidates generation module.
OPSM candidates
generation
Order-preserving Submatrices

Apriori property

Let P1 and P2 be two patterns such that P1 is a subsequence of P2.
 The support of P2 must be no greater than the support of P1


<a,b> is infrequent => <a,b,c>, <a,b,c,d>, <c,a,b,d> are all infrequent.
With the Apriori property, we can adopt an iterative candidate set
generation-and-test mining framework to prune the search space.
Start from mining size-2 OPSMs
because size-1 OPSMs does not
have any “orderings”.
Frequent size-k OPSMs
Size-2 OPSMs
candidates
Support Counting
Those OPSMs with #supporting rows ≥ ρ are
frequent, they are passed into the
candidates generation module.
OPSM candidates
generation
Size k+1 Candidate OPSMs
The Support counting module
verifies the number of supporting rows
of the candidate OPSMs.
According to the Apriori property, we
only generate those size k+1 candidates
with all proper subsets being frequent.
Order-preserving Submatrices

Apriori property

Let P1 and P2 be two patterns such that P1 is a subsequence of P2.
 The support of P2 must be no greater than the support of P1


<a,b> is infrequent => <a,b,c>, <a,b,c,d>, <c,a,b,d> are all infrequent.
With the Apriori property, we can adopt an iterative candidate set
generation-and-test mining framework to prune the search space.
Start from mining size-2 OPSMs
because size-1 OPSMs does not
have any “orderings”.
Frequent size-k OPSMs
Size-2 OPSMs
candidates
Support Counting
Those OPSMs with #supporting rows ≥ ρ are
frequent, they are passed into the
candidates generation module.
OPSM candidates
generation
The algorithm
terminates when no
more candidates are
generated.
Size k+1 Candidate OPSMs
The Support counting module
verifies the number of supporting rows
of the candidate OPSMs.
According to the Apriori property, we
only generate those size k+1 candidates
with all proper subsets being frequent.
Mining OPSMs from
Data with Repeated
Measurements
Fractional support
Problem motivation

A main drawback of the basic OPSM mining problem is
that it is very sensitive to noisy data.


In particular in microarray experiments, each value in the dataset
is a physical measurement that is subject to different kinds of
errors.
To combat errors, experiments are often repeated and multiple
measured values (replicates) are recorded.

The replicates allow a better estimate of the actual physical quantity.
Problem motivation

A main drawback of the basic OPSM mining problem is
that it is very sensitive to noisy data.


In particular in microarray experiments, each value in the dataset
is a physical measurement that is subject to different kinds of
errors.
To combat errors, experiments are often repeated and multiple
measured values (replicates) are recorded.
The replicates allow a better estimate of the actual physical quantity.

An example dataset that each column has 3 replicates (e.g. experiment in column “a” is
repeated 3 times and therefore generating 3 sub columns : a1, a2, a3 )
a
b
c
d
a1
a2
a3
b1
b2
b3
c1
c2
c3
d1
d2
d3
r1
49
38
115
82
r1
49
55
80
38
51
81
115
101
79
82
110
50
r2
67
96
124
48
r2
67
54
130
96
85
82
124
92
94
48
37
32
r3
65
67
132
95
r3
65
49
62
67
39
28
132
119
83
95
89
64
r4
81
115
133
62
r4
81
83
105
115
110
87
133
108
105
62
52
51
A dataset without repeated measurements
A dataset with repeated measurements
Problem motivation

The original OPSM definition is not robust against noisy
data. According to its definition, a row either supports
or not support a pattern.


It fails to take advantage of the additional information provided
by data replicates.
There is a strong motivation to revise the definition of
OPSM to handle repeated measurements.
(a1, b1) supports pattern <a,b>.
a
b
c
d
r1
49
38
115
82
r2
67
96
124
r3
65
67
r4
81
115
(a2, b2) does not support pattern <a,b>.
a1
a2
a3
b1
b2
b3
c1
c2
c3
d1
d2
d3
r1
49
55
80
38
51
81
115
101
79
82
110
50
48
r2
67
54
130
96
85
82
124
92
94
48
37
32
132
95
r3
65
49
62
67
39
28
132
119
83
95
89
64
133
62
r4
81
83
105
115
110
87
133
108
105
62
52
51
A dataset without repeated measurements
A dataset with repeated measurements
Problem motivation

The fractional support si(P) of a pattern P contribute by
a row i is the number of replicate combinations of row i
that support the pattern, divided by the total number of
replicate combinations of the columns in P.
Fractional support of <a,b,d> in row 1.
Pattern
P
<a,b,d>
Total number
of replicate
combinations
sd1(P)
3 * 3 * 3 = 27
a1
a2
a3
b1
b2
b3
c1
c2
c3
d1
d2
d3
r1
49
55
80
38
51
81
115
101
79
82
110
50
r2
67
54
130
96
85
82
124
92
94
48
37
32
r3
65
49
62
67
39
28
132
119
83
95
89
64
r4
81
83
105
115
110
87
133
108
105
62
52
51
A dataset with repeated measurements
Problem motivation

The fractional support si(P) of a pattern P contribute by
a row i is the number of replicate combinations of row i
that support the pattern, divided by the total number of
replicate combinations of the columns in P.
Fractional support of <a,b,d> in row 1.
Pattern
P
<a,b,d>
Total number
of replicate
combinations
sd1(P)
3 * 3 * 3 = 27
sn1(P)
<a1,b2,d1>
<a1,b2,d2>
<a1,b3,d1>
<a1,b3,d2> = 9
<a2,b3,d1>
<a2,b3,d2>
<a3,b3,d1>
<a3,b3,d2>
Number of
replicate
combinations
that support
the pattern
a1
a2
a3
b1
b2
b3
c1
c2
c3
d1
d2
d3
r1
49
55
80
38
51
81
115
101
79
82
110
50
r2
67
54
130
96
85
82
124
92
94
48
37
32
r3
65
49
62
67
39
28
132
119
83
95
89
64
r4
81
83
105
115
110
87
133
108
105
62
52
51
A dataset with repeated measurements
Problem motivation

The fractional support si(P) of a pattern P contribute by
a row i is the number of replicate combinations of row i
that support the pattern, divided by the total number of
replicate combinations of the columns in P.
Fractional support of <a,b,d> in row 1.
Pattern
Total number
of replicate
combinations
Number of
replicate
combinations
that support
the pattern
Fractional
support of
the pattern
P
The fractional support satisfy the following
requirements:

<a,b,d>

sd1(P)
3 * 3 * 3 = 27
sn1(P)
<a1,b2,d1>
<a1,b2,d2>
<a1,b3,d1>
<a1,b3,d2> = 9
<a2,b3,d1>
<a2,b3,d2>
<a3,b3,d1>
<a3,b3,d2>
s1(P)
sn1(P)
sd1(P)
=
9
27

Requirement 1 : If all replicate combinations of a row support a
certain pattern, the row strongly supports the pattern.
Requirement 2 : If only a fraction of the replicate combinations
support a pattern, the resulting fraction support will be fuzzy (away
from 0, and 1), which reflects the uncertainty.
a1
a2
a3
b1
b2
b3
c1
c2
c3
d1
d2
d3
r1
49
55
80
38
51
81
115
101
79
82
110
50
r2
67
54
130
96
85
82
124
92
94
48
37
32
r3
65
49
62
67
39
28
132
119
83
95
89
64
r4
81
83
105
115
110
87
133
108
105
62
52
51
A dataset with repeated measurements
Problem Definition

The total fractional support of a pattern
P (or simply the support of P), is defined
as the sum of all the fraction supports of P
contribute by all rows.

A pattern P is frequent if its total fractional
support is not less than a given support
threshold ρ
Mining OPSMs from Data with
Repeated Measurements
Frequent size-k OPSMs
Size-2 OPSMs
candidates
Support Counting
Size k+1
Candidate OPSMs
Problem 2. Support counting
requires computational expensive
subsequence counting step.
Solution. Maintain a data
structure for efficient support
counting. E.g. Head Tail Tree
, run-length encoding of the
transformed sequences of the
dataset.
Candidate patterns
generation
Problem 1. Combinatorial
explosion of the number of
candidates generated in each
iteration.
Solution. For each generated
pattern, obtain an UPPER
BOUND of its fractional support.
If the upper bound does not
exceed the support requirement,
we can prune the candidate.
Efficient Mining
Methods
MinBound (naïve)
HTBound
Scenario
Frequent size-k OPSMs
Size-2 OPSMs
candidates
Support Counting
Size k+1
Candidate OPSMs

Scenario



Candidate patterns
generation
In iteration k, each length-k candidate pattern P is generated
from two length k-1 frequent patterns P1 and P2,, where P1 and P2
are subsequences of P.
E.g. P = <a,b,c>, P1 = <a,b> and P2 = <b,c>
Bounding techniques


Given a length-k pattern P, we want to obtain an upper bound of
s(P) before the support counting step.
Obtaining an upper bound of si(P) for all row i.
MinBound
Frequent size-k OPSMs
Size-2 OPSMs
candidates
Support Counting
Size k+1
Candidate OPSMs

Input


Candidate patterns
generation
si(P1), si(P2), where P1 and P2 are subsequences of P. P is
generated by joining P1 and P2.
Minbound

Sum of minimum is smaller than minimum of sum
Transformed sequence dataset
Column Sequences
a1
a2
a3
b1
…
r1
49
55
80
38
…
r1
< b,a,d,b,a,c,a,b,d,c,d,c>
r2
67
54
130
96
…
r2
<d,d,d,a,a,b,b,c,c,b,c,a>
r3
65
49
62
67
…
r3
<b,b,a,a,d,a,b,c,d,d,c,c>
r4
81
83
105
115
…
R4
<d,d,d,a,a,b,c,c,a,b,b,c>
A dataset with repeated measurements
Before the mining step, we preprocess the data matrix
by sorting each columns in each row w.r.t. their values
in ascending order and replaces the entries by the
column label.
To determine the fractional support of a row for a
pattern, we determine the number of occurrences of
the pattern in the row sequence.
MinBound
Column Sequence
Column Sequences
< b,a,d,b,a,c,a,b,d,c,d,c>
r1
Pattern
<a, b, c, d>
≤ min{ 9/27, 7/27}
si(P)
<a, b, c>
<b, c, d>
= 9/27
= 7/27
r1
< b,a,d,b,a,c,a,b,d,c,d,c>
r2
<d,d,d,a,a,b,b,c,c,b,c,a>
r3
<b,b,a,a,d,a,b,c,d,d,c,c>
R4
<d,d,d,a,a,b,c,c,a,b,b,c>
Transformed sequence dataset
Notice that the true value of si(<a,b,c,d>) is 2/27.

Input


si(P1), si(P2), where P1 and P2 are subsequences of P. P is
generated by joining P1 and P2.
Minbound

Sum of minimum is smaller than minimum of sum
Efficient Mining
Methods
MinBound (naïve)
HTBound
Transformed sequence dataset
HT arrays
Column Sequences
a1
a2
a3
b1
…
r1
49
55
80
38
…
r1
< b,a,d,b,a,c,a,b,d,c,d,c>
r2
67
54
130
96
…
r2
<d,d,d,a,a,b,b,c,c,b,c,a>
r3
65
49
62
67
…
r3
<b,b,a,a,d,a,b,c,d,d,c,c>
r4
81
83
105
115
…
R4
<d,d,d,a,a,b,c,c,a,b,b,c>
A dataset with repeated measurements
Before the mining step, we preprocess the data matrix
by sorting each columns in each row w.r.t. their values
in ascending order and replaces the entries by the
column label.
To determine the fractional support of a row for a
pattern, we determine the number of occurrences of
the pattern in the row sequence.
Transformed sequence dataset
Column Sequences
HT arrays
Pattern P
r1
< b,a,d,b,a,c,a,b,d,c,d,c>
r2
<d,d,d,a,a,b,b,c,c,b,c,a>
r3
<b,b,a,a,d,a,b,c,d,d,c,c>
R4
<d,d,d,a,a,b,c,c,a,b,b,c>
<a, b, c, d>
1
2
3
4
Assume that we generate pattern <a,b,c,d> in the candidate
generation module.
Our task is to obtain the number of occurrences of
<a,b,c,d> in each of the row sequences.
Frequent size-k OPSMs
Size-2 OPSMs
candidates
Support Counting
Size k+1
Candidate OPSMs
Candidate patterns
generation
Transformed sequence dataset
Column Sequences
HT arrays
Column Sequence
< b,a,d,b,a,c,a,b,d,c,d,c>
r1
Pattern P
1
<a, b, c, d>
1
2
3
4
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
2
3
4
5
Head pattern
6
7
8
9
10
11
12
r1
< b,a,d,b,a,c,a,b,d,c,d,c>
r2
<d,d,d,a,a,b,b,c,c,b,c,a>
r3
<b,b,a,a,d,a,b,c,d,d,c,c>
R4
<d,d,d,a,a,b,c,c,a,b,b,c>
Number of replicates for each column
Column
a
b
c
d
Number of replicates
of column j
3
3
3
3
Tail pattern
<a, b, c>
<b, c, d>
9
7
Frequent size-k OPSMs
Size-2 OPSMs
candidates
Support Counting
Size k+1
Candidate OPSMs
As pattern <a,b,c,d> is generated, this
implies that <a,b,c>, <b,c,d> must be
frequent, we can assume that the
number of occurrences of the head and
tail patterns in a given row are readily
available.
Candidate patterns
generation
Transformed sequence dataset
Column Sequences
HT arrays
Column Sequence
< b,a,d,b,a,c,a,b,d,c,d,c>
r1
Pattern P
1
2
<a, b, c, d>
1
2
3
4
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays

Head array




3
4
5
Head pattern
6
7
8
9
10
11
Tail pattern
<a, b, c>
<b, c, d>
9
7
12
r1
< b,a,d,b,a,c,a,b,d,c,d,c>
r2
<d,d,d,a,a,b,b,c,c,b,c,a>
r3
<b,b,a,a,d,a,b,c,d,d,c,c>
R4
<d,d,d,a,a,b,c,c,a,b,b,c>
Number of replicates for each column
Column
a
b
c
d
Number of replicates
of column j
3
3
3
3
P[1] is the 1st item in the pattern.
(i.e. “a”)
r(a) is the number of replicates of
column “a”.
Therefore, the Head array contains
r(a) = 3 slots.
Concerns the head sub-pattern (i.e. P[1…k-1]).
It contains r(P[1]) entries.
The l-th entry records the number of times P[2…k-1] appear after the l-th
occurrence of P[1] in the sequence.
Tail array



Concerns the tail sub-pattern (i.e. P[2…k]).
It consists of sni(P[2,,,k-1]) entries.
The l-th entry records the number of times P[k] appears after the l-th occurrence
of P[2…k-1] in the sequence, where the occurrences are in lexicographic order
according to the positions of the occurrences.
Transformed sequence dataset
HT arrays
Column Sequences
The 1st occurrence of "a"
1st
Column Sequence
< b,a,d,b,a,c,a,b,d,c,d,c>
r1
Pattern P
1
2
<a, b, c, d>
1
2
3
4
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays

Head array




3
4
5
Head pattern
6
7
8
9
10
11
r1
< b,a,d,b,a,c,a,b,d,c,d,c>
r2
<d,d,d,a,a,b,b,c,c,b,c,a>
r3
<b,b,a,a,d,a,b,c,d,d,c,c>
R4
<d,d,d,a,a,b,c,c,a,b,b,c>
12
Tail pattern
<a, b, c>
<b, c, d>
9
7
The 1st slot of H-array concerns the 1st occurrence of "a" in the sequence
How many <b,c> appear after that “a”?
Concerns the head sub-pattern (i.e. P[1…k-1]).
It contains r(P[1]) entries.
The l-th entry records the number of times P[2…k-1] appear after the l-th
occurrence of P[1] in the sequence.
Tail array



Concerns the tail sub-pattern (i.e. P[2…k]).
It consists of sni(P[2…k-1]) entries.
The l-th entry records the number of times P[k] appears after the l-th occurrence
of P[2…k-1] in the sequence, where the occurrences are in lexicographic order
according to the positions of the occurrences.
Transformed sequence dataset
HT arrays
Column Sequences
The 1st occurrence of "a"
2nd
Column Sequence
< b,a,d,b,a,c,a,b,d,c,d,c>
r1
Pattern P
1
2
<a, b, c, d>
1
2
3
4
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays

Head array




3
4
5
Head pattern
6
7
8
9
10
11
r1
< b,a,d,b,a,c,a,b,d,c,d,c>
r2
<d,d,d,a,a,b,b,c,c,b,c,a>
r3
<b,b,a,a,d,a,b,c,d,d,c,c>
R4
<d,d,d,a,a,b,c,c,a,b,b,c>
12
Tail pattern
<a, b, c>
<b, c, d>
9
7
The 1st slot of H-array concerns the 1st occurrence of "a" in the sequence
How many <b,c> appear after that “a”?
Concerns the head sub-pattern (i.e. P[1…k-1]).
It contains r(P[1]) entries.
The l-th entry records the number of times P[2…k-1] appear after the l-th
occurrence of P[1] in the sequence.
Tail array



Concerns the tail sub-pattern (i.e. P[2…k]).
It consists of sni(P[2…k-1]) entries.
The l-th entry records the number of times P[k] appears after the l-th occurrence
of P[2…k-1] in the sequence, where the occurrences are in lexicographic order
according to the positions of the occurrences.
Transformed sequence dataset
HT arrays
Column Sequences
The 1st occurrence of "a"
3rd
Column Sequence
< b,a,d,b,a,c,a,b,d,c,d,c>
r1
Pattern P
1
2
<a, b, c, d>
1
2
3
4
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays

Head array




3
4
5
Head pattern
6
7
8
9
10
11
r1
< b,a,d,b,a,c,a,b,d,c,d,c>
r2
<d,d,d,a,a,b,b,c,c,b,c,a>
r3
<b,b,a,a,d,a,b,c,d,d,c,c>
R4
<d,d,d,a,a,b,c,c,a,b,b,c>
12
Tail pattern
<a, b, c>
<b, c, d>
9
7
The 1st slot of H-array concerns the 1st occurrence of "a" in the sequence
How many <b,c> appear after that “a”?
Concerns the head sub-pattern (i.e. P[1…k-1]).
It contains r(P[1]) entries.
The l-th entry records the number of times P[2…k-1] appear after the l-th
occurrence of P[1] in the sequence.
Tail array



Concerns the tail sub-pattern (i.e. P[2…k]).
It consists of sni(P[2…k-1]) entries.
The l-th entry records the number of times P[k] appears after the l-th occurrence
of P[2…k-1] in the sequence, where the occurrences are in lexicographic order
according to the positions of the occurrences.
Transformed sequence dataset
HT arrays
Column Sequences
The 1st occurrence of "a"
4th
Column Sequence
< b,a,d,b,a,c,a,b,d,c,d,c>
r1
Pattern P
1
2
<a, b, c, d>
1
2
3
4
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays

Head array




3
4
5
Head pattern
6
7
8
9
10
11
r1
< b,a,d,b,a,c,a,b,d,c,d,c>
r2
<d,d,d,a,a,b,b,c,c,b,c,a>
r3
<b,b,a,a,d,a,b,c,d,d,c,c>
R4
<d,d,d,a,a,b,c,c,a,b,b,c>
12
Tail pattern
<a, b, c>
<b, c, d>
9
7
The 1st slot of H-array concerns the 1st occurrence of "a" in the sequence
How many <b,c> appear after that “a”?
Concerns the head sub-pattern (i.e. P[1…k-1]).
It contains r(P[1]) entries.
The l-th entry records the number of times P[2…k-1] appear after the l-th
occurrence of P[1] in the sequence.
Tail array



Concerns the tail sub-pattern (i.e. P[2…k]).
It consists of sni(P[2…k-1]) entries.
The l-th entry records the number of times P[k] appears after the l-th occurrence
of P[2…k-1] in the sequence, where the occurrences are in lexicographic order
according to the positions of the occurrences.
Transformed sequence dataset
HT arrays
Column Sequences
The 1st occurrence of "a"
5th
Column Sequence
< b,a,d,b,a,c,a,b,d,c,d,c>
r1
Pattern P
1
2
<a, b, c, d>
1
2
3
4
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays

Head array




3
4
5
Head pattern
6
7
8
9
10
11
r1
< b,a,d,b,a,c,a,b,d,c,d,c>
r2
<d,d,d,a,a,b,b,c,c,b,c,a>
r3
<b,b,a,a,d,a,b,c,d,d,c,c>
R4
<d,d,d,a,a,b,c,c,a,b,b,c>
12
Tail pattern
<a, b, c>
<b, c, d>
9
7
The 1st slot of H-array concerns the 1st occurrence of "a" in the sequence
How many <b,c> appear after that “a”?
Concerns the head sub-pattern (i.e. P[1…k-1]).
It contains r(P[1]) entries.
The l-th entry records the number of times P[2…k-1] appear after the l-th
occurrence of P[1] in the sequence.
Tail array



Concerns the tail sub-pattern (i.e. P[2…k]).
It consists of sni(P[2…k-1]) entries.
The l-th entry records the number of times P[k] appears after the l-th occurrence
of P[2…k-1] in the sequence, where the occurrences are in lexicographic order
according to the positions of the occurrences.
Transformed sequence dataset
HT arrays
Column Sequences
The 1st occurrence of "a"
Column Sequence
< b,a,d,b,a,c,a,b,d,c,d,c>
r1
Pattern P
1
2
<a, b, c, d>
1
2
3
4
3

Head array




5
Head pattern
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays
4
<a, b, c>
9
5
6
7
8
9
10
11
r1
< b,a,d,b,a,c,a,b,d,c,d,c>
r2
<d,d,d,a,a,b,b,c,c,b,c,a>
r3
<b,b,a,a,d,a,b,c,d,d,c,c>
R4
<d,d,d,a,a,b,c,c,a,b,b,c>
12
Tail pattern
<b, c, d>
7
There are 5 occurrences of
<b,c> after the 1st occurrence
of “a”
The 1st slot of H-array concerns the 1st occurrence of "a" in the sequence
How many <b,c> appear after that “a”?
Concerns the head sub-pattern (i.e. P[1…k-1]).
It contains r(P[1]) entries.
The l-th entry records the number of times P[2…k-1] appear after the l-th
occurrence of P[1] in the sequence.
Tail array



Concerns the tail sub-pattern (i.e. P[2…k]).
It consists of sni(P[2,,,k-1]) entries.
The l-th entry records the number of times P[k] appears after the l-th occurrence
of P[2…k-1] in the sequence, where the occurrences are in lexicographic order
according to the positions of the occurrences.
Transformed sequence dataset
The 2nd occurrence of "a"
Column Sequences
HT arrays
Column Sequence
< b,a,d,b,a,c,a,b,d,c,d,c>
r1
Pattern P
1
2
<a, b, c, d>
1
2
3
4
3

Head array




5
Head pattern
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays
4
5
6
7
8
9
10
11
r1
< b,a,d,b,a,c,a,b,d,c,d,c>
r2
<d,d,d,a,a,b,b,c,c,b,c,a>
r3
<b,b,a,a,d,a,b,c,d,d,c,c>
R4
<d,d,d,a,a,b,c,c,a,b,b,c>
12
Tail pattern
<a, b, c>
<b, c, d>
9
7
Since there are 2 occurrences
of <b,c> after the 2nd
occurrence of “a”, we have
value “2” in the slot.
2
The 2nd occurrence of "a" in the sequence
How many <b,c> appear after that “a”?
Concerns the head sub-pattern (i.e. P[1…k-1]).
It contains r(P[1]) entries.
The l-th entry records the number of times P[2…k-1] appear after the l-th
occurrence of P[1] in the sequence.
Tail array



Concerns the tail sub-pattern (i.e. P[2…k]).
It consists of sni(P[2,,,k-1]) entries.
The l-th entry records the number of times P[k] appears after the l-th occurrence
of P[2…k-1] in the sequence, where the occurrences are in lexicographic order
according to the positions of the occurrences.
Transformed sequence dataset
The 3rd occurrence of "a"
Column Sequences
HT arrays
Column Sequence
< b,a,d,b,a,c,a,b,d,c,d,c>
r1
Pattern P
1
2
<a, b, c, d>
1
2
3
4
3

Head array




5
Head pattern
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays
4
5
6
7
8
9
10
11
< b,a,d,b,a,c,a,b,d,c,d,c>
r2
<d,d,d,a,a,b,b,c,c,b,c,a>
r3
<b,b,a,a,d,a,b,c,d,d,c,c>
R4
<d,d,d,a,a,b,c,c,a,b,b,c>
12
Tail pattern
<a, b, c>
<b, c, d>
9
7
2
r1
2
The 3rd occurrence of "a" in the sequence
How many <b,c> appear after that “a”?
Concerns the head sub-pattern (i.e. P[1…k-1]).
It contains r(P[1]) entries
The l-th entry records the number of times P[2…k-1] appear after the l-th
occurrence of P[1] in the sequence.
Tail array



Concerns the tail sub-pattern (i.e. P[2…k]).
It consists of sni(P[2,,,k-1]) entries.
The l-th entry records the number of times P[k] appears after the l-th occurrence
of P[2…k-1] in the sequence, where the occurrences are in lexicographic order
according to the positions of the occurrences.
Transformed sequence dataset
Column Sequences
HT arrays
Column Sequence
< b,a,d,b,a,c,a,b,d,c,d,c>
r1
Pattern P
1
2
<a, b, c, d>
1
2
3
4
3

5
6
7
8
9
10
11
< b,a,d,b,a,c,a,b,d,c,d,c>
r2
<d,d,d,a,a,b,b,c,c,b,c,a>
r3
<b,b,a,a,d,a,b,c,d,d,c,c>
R4
<d,d,d,a,a,b,c,c,a,b,b,c>
12
Tail pattern
Mid pattern
<a, b, c>
<b, c, d>
<b, c>
9
7
8
2
2
Head array




5
Head pattern
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays
4
r1
Concerns the head sub-pattern (i.e. P[1…k-1]).
The number of entries in Tail array is the
It contains r(P[1]) entries
number of occurrences of mid pattern <b,c>
The l-th entry records the number of times P[2…k-1] in
appear
aftersequence
the l-th
the column
occurrence of P[1] in the sequence.
Tail array



Concerns the tail sub-pattern (i.e. P[2…k]).
It consists of sni(P[2,,,k-1]) entries.
The l-th entry records the number of times P[k] appears after the l-th occurrence
of P[2…k-1] in the sequence, where the occurrences are in lexicographic order
according to the positions of the occurrences.
Transformed sequence dataset
The 1st occurrence of <b,c> according to the
lexicographic order
HT arrays
Column Sequence
< b,a,d,b,a,c,a,b,d,c,d,c>
r1
Pattern P
1
2
<a, b, c, d>
1
2
3
4
3

5
6
7
8
9
10
11
r1
< b,a,d,b,a,c,a,b,d,c,d,c>
r2
<d,d,d,a,a,b,b,c,c,b,c,a>
r3
<b,b,a,a,d,a,b,c,d,d,c,c>
R4
<d,d,d,a,a,b,c,c,a,b,b,c>
12
Tail pattern
Mid pattern
<a, b, c>
<b, c, d>
<b, c>
9
7
8
2
2
Head array




5
Head pattern
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays
4
Column Sequences
The number of times “d” appears after
Concerns the head sub-pattern (i.e. P[1…k-1]).
The 1st occurrence of <b,c> in the sequence
It contains r(P[1]) entries
The l-th entry records the number of times P[2…k-1] appear after the l-th
occurrence of P[1] in the sequence.
Tail array



Concerns the tail sub-pattern (i.e. P[2…k]).
It consists of sni(P[2,,,k-1]) entries.
The l-th entry records the number of times P[k] appears after the l-th occurrence
of P[2…k-1] in the sequence, where the occurrences are in lexicographic order
according to the positions of the occurrences.
Transformed sequence dataset
The 1st occurrence of <b,c> according to the
lexicographic order
HT arrays
Column Sequence
< b,a,d,b,a,c,a,b,d,c,d,c>
r1
Pattern P
1
2
<a, b, c, d>
1
2
3
4
3

6
7
5
8
9
10
11
r1
< b,a,d,b,a,c,a,b,d,c,d,c>
r2
<d,d,d,a,a,b,b,c,c,b,c,a>
r3
<b,b,a,a,d,a,b,c,d,d,c,c>
R4
<d,d,d,a,a,b,c,c,a,b,b,c>
12
Tail pattern
Mid pattern
<a, b, c>
<b, c, d>
<b, c>
9
7
8
2
2
2
Head array




5
Head pattern
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays
4
Column Sequences
The number of times “d” appears after
Concerns the head sub-pattern (i.e. P[1…k-1]).
The 1st occurrence of <b,c> in the sequence
It contains r(P[1]) entries
The l-th entry records the number of times P[2…k-1] appear after the l-th
occurrence of P[1] in the sequence.
Tail array



Concerns the tail sub-pattern (i.e. P[2…k]).
It consists of sni(P[2,,,k-1]) entries.
The l-th entry records the number of times P[k] appears after the l-th occurrence
of P[2…k-1] in the sequence, where the occurrences are in lexicographic order
according to the positions of the occurrences.
Transformed sequence dataset
The 2nd occurrence of <b,c> according to the
lexicographic order
HT arrays
Column Sequence
< b,a,d,b,a,c,a,b,d,c,d,c>
r1
Pattern P
1
2
<a, b, c, d>
1
2
3
4
3

6
7
8
9
5
10
11
r1
< b,a,d,b,a,c,a,b,d,c,d,c>
r2
<d,d,d,a,a,b,b,c,c,b,c,a>
r3
<b,b,a,a,d,a,b,c,d,d,c,c>
R4
<d,d,d,a,a,b,c,c,a,b,b,c>
12
Tail pattern
Mid pattern
<a, b, c>
<b, c, d>
<b, c>
9
7
8
2
2
2
1
Head array




5
Head pattern
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays
4
Column Sequences
The number of times “d” appears after
Concerns the head sub-pattern (i.e. P[1…k-1]).
The 2nd occurrence of <b,c> in the sequence
It contains r(P[1]) entries
The l-th entry records the number of times P[2…k-1] appear after the l-th
occurrence of P[1] in the sequence.
Tail array



Concerns the tail sub-pattern (i.e. P[2…k]).
It consists of sni(P[2,,,k-1]) entries.
The l-th entry records the number of times P[k] appears after the l-th occurrence
of P[2…k-1] in the sequence, where the occurrences are in lexicographic order
according to the positions of the occurrences.
Transformed sequence dataset
The 3rd occurrence of <b,c> according to the
lexicographic order
HT arrays
Column Sequence
< b,a,d,b,a,c,a,b,d,c,d,c>
r1
Pattern P
1
2
<a, b, c, d>
1
2
3
4
3

6
7
8
9
5
10
11
r1
< b,a,d,b,a,c,a,b,d,c,d,c>
r2
<d,d,d,a,a,b,b,c,c,b,c,a>
r3
<b,b,a,a,d,a,b,c,d,d,c,c>
R4
<d,d,d,a,a,b,c,c,a,b,b,c>
12
Tail pattern
Mid pattern
<a, b, c>
<b, c, d>
<b, c>
9
7
8
2
2
2
1
0
Head array




5
Head pattern
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays
4
Column Sequences
The number of times “d” appears after
Concerns the head sub-pattern (i.e. P[1…k-1]).
The 3rd occurrence of <b,c> in the sequence
It contains r(P[1]) entries
The l-th entry records the number of times P[2…k-1] appear after the l-th
occurrence of P[1] in the sequence.
Tail array



Concerns the tail sub-pattern (i.e. P[2…k]).
It consists of sni(P[2,,,k-1]) entries.
The l-th entry records the number of times P[k] appears after the l-th occurrence
of P[2…k-1] in the sequence, where the occurrences are in lexicographic order
according to the positions of the occurrences.
Transformed sequence dataset
The 4th occurrence of <b,c> according to the
lexicographic order
HT arrays
Column Sequence
< b,a,d,b,a,c,a,b,d,c,d,c>
r1
Pattern P
1
2
<a, b, c, d>
1
2
3
4
3

6
7
8
9
5
10
11
r1
< b,a,d,b,a,c,a,b,d,c,d,c>
r2
<d,d,d,a,a,b,b,c,c,b,c,a>
r3
<b,b,a,a,d,a,b,c,d,d,c,c>
R4
<d,d,d,a,a,b,c,c,a,b,b,c>
12
Tail pattern
Mid pattern
<a, b, c>
<b, c, d>
<b, c>
9
7
8
2
2
2
1
0
2
Head array




5
Head pattern
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays
4
Column Sequences
The number of times “d” appears after
Concerns the head sub-pattern (i.e. P[1…k-1]).
The 4th occurrence of <b,c> in the sequence
It contains r(P[1]) entries
The l-th entry records the number of times P[2…k-1] appear after the l-th
occurrence of P[1] in the sequence.
Tail array



Concerns the tail sub-pattern (i.e. P[2…k]).
It consists of sni(P[2,,,k-1]) entries.
The l-th entry records the number of times P[k] appears after the l-th occurrence
of P[2…k-1] in the sequence, where the occurrences are in lexicographic order
according to the positions of the occurrences.
Transformed sequence dataset
Column Sequences
HT arrays
Column Sequence
< b,a,d,b,a,c,a,b,d,c,d,c>
r1
Pattern P
1
<a, b, c, d>
1
2
3
4
2
3
5
6
7
8
9
Head pattern
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays
4
5
10
11
r1
< b,a,d,b,a,c,a,b,d,c,d,c>
r2
<d,d,d,a,a,b,b,c,c,b,c,a>
r3
<b,b,a,a,d,a,b,c,d,d,c,c>
R4
<d,d,d,a,a,b,c,c,a,b,b,c>
12
Tail pattern
Mid pattern
<a, b, c>
<b, c, d>
<b, c>
9
7
8
2
2
2
1
0
2
1
0
1
0
There are 5 <b,c>’s appear after
The 1st occurrence of "a" in the sequence.
It is sufficient to calculate the fractional support of pattern P in r1 solely from its Head and Tail arrays.
1. How many <b,c> appear after the 1st “a” in the sequence?
2. How many “d” appear after each of the 5 <b,c> ?
5.
Transformed sequence dataset
Column Sequences
HT arrays
Column Sequence
< b,a,d,b,a,c,a,b,d,c,d,c>
r1
Pattern P
1
<a, b, c, d>
1
2
3
4
2
3
5
6
7
8
9
Head pattern
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays
4
5
10
11
r1
< b,a,d,b,a,c,a,b,d,c,d,c>
r2
<d,d,d,a,a,b,b,c,c,b,c,a>
r3
<b,b,a,a,d,a,b,c,d,d,c,c>
R4
<d,d,d,a,a,b,c,c,a,b,b,c>
12
Tail pattern
Mid pattern
<a, b, c>
<b, c, d>
<b, c>
9
7
8
2
2
There are 5 <b,c>’s appear after
The 1st occurrence of "a" in the sequence.
2
1
0
2
1
0
1
0
The 5 <b,c>’s must correspond to the last five <b,c> according to
the lexicographical order.
It is sufficient to calculate the fractional support of pattern P in r1 solely from its Head and Tail arrays.
1. How many <b,c> appear after the 1st “a” in the sequence?
5.
2. How many “d” appear after each of the 5 <b,c> ? 2+1+0+1+0 = 4.
There are 4 <a,b,c,d> occurrences from the 1st “a” in the sequence.
Transformed sequence dataset
Column Sequences
HT arrays
Column Sequence
< b,a,d,b,a,c,a,b,d,c,d,c>
r1
Pattern P
1
<a, b, c, d>
1
2
3
4
2
3
5
6
7
8
9
Head pattern
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays
4
5
10
11
r1
< b,a,d,b,a,c,a,b,d,c,d,c>
r2
<d,d,d,a,a,b,b,c,c,b,c,a>
r3
<b,b,a,a,d,a,b,c,d,d,c,c>
R4
<d,d,d,a,a,b,c,c,a,b,b,c>
12
Tail pattern
Mid pattern
<a, b, c>
<b, c, d>
<b, c>
9
7
8
2
2
There are 5 <b,c>’s appear after
The 1st occurrence of "a" in the sequence.
2
1
0
2
1
0
1
0
The 5 <b,c>’s must correspond to the last five <b,c> according to
the lexicographical order.
The number of occurrences of <a,b,c,d> in the entire sequence : 4
Transformed sequence dataset
Column Sequences
HT arrays
Column Sequence
< b,a,d,b,a,c,a,b,d,c,d,c>
r1
Pattern P
1
<a, b, c, d>
1
2
3
4
2
3
5
6
7
8
9
Head pattern
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays
4
5
10
11
r1
< b,a,d,b,a,c,a,b,d,c,d,c>
r2
<d,d,d,a,a,b,b,c,c,b,c,a>
r3
<b,b,a,a,d,a,b,c,d,d,c,c>
R4
<d,d,d,a,a,b,c,c,a,b,b,c>
12
Tail pattern
Mid pattern
<a, b, c>
<b, c, d>
<b, c>
9
7
8
2
2
There are 2<b,c>’s appear after
The 2nd occurrence of "a" in the sequence.
2
1
0
2
1
0
1
0
The 2 <b,c>’s must correspond to the last two <b,c> according to
the lexicographical order.
The number of occurrences of <a,b,c,d> in the entire sequence : 4 + 1
Transformed sequence dataset
Column Sequences
HT arrays
Column Sequence
< b,a,d,b,a,c,a,b,d,c,d,c>
r1
Pattern P
1
<a, b, c, d>
1
2
3
4
2
3
5
6
7
8
9
Head pattern
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays
4
5
10
11
r1
< b,a,d,b,a,c,a,b,d,c,d,c>
r2
<d,d,d,a,a,b,b,c,c,b,c,a>
r3
<b,b,a,a,d,a,b,c,d,d,c,c>
R4
<d,d,d,a,a,b,c,c,a,b,b,c>
12
Tail pattern
Mid pattern
<a, b, c>
<b, c, d>
<b, c>
9
7
8
2
2
There are 2 <b,c>’s appear after
The 3rd occurrence of "a" in the sequence.
2
1
0
2
1
0
1
0
The 2 <b,c>’s must correspond to the last two <b,c> according to
the lexicographical order.
The number of occurrences of <a,b,c,d> in the entire sequence : 4 + 1 + 1 = 6
Transformed sequence dataset
Column Sequences
HT arrays
Column Sequence
< b,a,d,b,a,c,a,b,d,c,d,c>
r1
Pattern P
1
<a, b, c, d>
1
2
3
4
2
3
5
6
7
8
9
Head pattern
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays
4
5
10
11
r1
< b,a,d,b,a,c,a,b,d,c,d,c>
r2
<d,d,d,a,a,b,b,c,c,b,c,a>
r3
<b,b,a,a,d,a,b,c,d,d,c,c>
R4
<d,d,d,a,a,b,c,c,a,b,b,c>
12
Tail pattern
Mid pattern
<a, b, c>
<b, c, d>
<b, c>
9
7
8
2
2
There are 5 <b,c>’s appear after
The 3rd occurrence of "a" in the sequence.
2
1
0
2
1
0
1
0
The 2 <b,c>’s must correspond to the last two <b,c> according to
the lexicographical order.
The number of occurrences of <a,b,c,d> in the entire sequence : 4 + 1 + 1 = 6
Efficient Mining
Methods
HTBound
HT Bound
Pattern P
<a, b, c, d>
1
2
3
4
Head pattern
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays


Mid pattern
<a, b, c>
<b, c, d>
<b, c>
9
7
8
2
2
2
1
0
2
1
0
1
0
Storing the HT arrays is impractical


5
Tail pattern
The #entries in T-array is the #occurrences of the mid pattern,
which, in the worst case, is exponential to the pattern’s length.
It is possible to compute an upper bound of the HT-sum
by storing only 3 numbers without ever construct or
store the HT-arrays.
Our solution is to provide a way to construct two arrays
H* and T* from the 3 numbers and guarantee the
following is always true:

HT-sum (H, T) ≤ HT-sum ( H*, T*)
HT Bound
Pattern P
<a, b, c, d>
1
2
3
4
Head pattern
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays


Mid pattern
<a, b, c>
<b, c, d>
<b, c>
9
7
8
2
2
2
1
0
2
1
0
1
0
Storing the HT arrays is impractical


5
Tail pattern
The #entries in T-array is the #occurrences of the mid pattern,
which, in the worst case, is exponential to the pattern’s length.
It is possible to compute an upper bound of the HT-sum
by storing only 3 numbers without ever construct or
store the HT-arrays.
Our solution is to provide a way to construct two arrays
H* and T* from the 3 numbers and guarantee the
following is always true:
 HT-sum
HT-sum (( H*,
HT-sum
(H,(H,
T)T)
≤ ≤HT-sum
H*,T*)
T*)
HT Bound
Pattern P
<a, b, c, d>
1
2
3
4
Head pattern
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays
H*T*-arrays
for upper bound
of HT-sum
Tail pattern
Mid pattern
<a, b, c>
<b, c, d>
<b, c>
9
7
8
5
H*
?2
?
2
2
T*
1
0
2?
1
0
1
0
?
HT-sum (H, T) ≤ HT-sum ( H*, T*)
Challenge:
We cannot construct
the real HT-arrays
solely from the 3
numbers.
HT Bound
Pattern P
<a, b, c, d>
1
2
3
4
Head pattern
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays
H*T*-arrays
for upper bound
of HT-sum
Tail pattern
Mid pattern
<a, b, c>
<b, c, d>
<b, c>
9
7
8
5
?2
2
H*
1
0
2?
1
0
1
0
T*
Number of <b,c>
after the 1st “a”
Size of H* and T* arrays
2
r(a) = 3 slots
Number of “d” after the 1st <b,c>
sni(<b,c>) = 8 slots
HT-sum (H, T) ≤ HT-sum ( H*, T*)
Challenge:
We cannot construct
the real HT-arrays
solely from the 3
numbers.
HT Bound
Pattern P
<a, b, c, d>
1
2
3
4
Head pattern
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays
H*T*-arrays
for upper bound
of HT-sum
Tail pattern
Mid pattern
<a, b, c>
<b, c, d>
<b, c>
9
7
8
5
?2
2
H*
2
1
0
2?
1
0
1
0
T*
Number of <b,c>
after the 1st “a”
Size of H* and T* arrays
r(a) = 3 slots
Maximum value per slot
sni(<b,c>) = 8
Number of “d” after the 1st <b,c>
sni(<b,c>) = 8 slots
r(d) = 3
HT-sum (H, T) ≤ HT-sum ( H*, T*)
Challenge:
We cannot construct
the real HT-arrays
solely from the 3
numbers.
HT Bound
Pattern P
<a, b, c, d>
1
2
3
4
Head pattern
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays
H*T*-arrays
for upper bound
of HT-sum
Tail pattern
Mid pattern
<a, b, c>
<b, c, d>
<b, c>
9
7
8
5
?2
2
H*
1
0
2?
1
0
1
0
T*
Number of <b,c>
after the 1st “a”
Size of H* and T* arrays
r(a) = 3 slots
Maximum value per slot
sni(<b,c>) = 8
Sum over all slots
2
sni(<a,b,c>) = 9
Number of “d” after the 1st <b,c>
sni(<b,c>) = 8 slots
r(d) = 3
sni(<b,c,d>) = 7
HT-sum (H, T) ≤ HT-sum ( H*, T*)
Challenge:
We cannot construct
the real HT-arrays
solely from the 3
numbers.
HT Bound
Pattern P
<a, b, c, d>
1
2
3
4
Head pattern
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays
H*T*-arrays
for upper bound
of HT-sum
Tail pattern
Mid pattern
<a, b, c>
<b, c, d>
<b, c>
9
7
8
5
?2
2
H*
2
1
0
2?
1
0
1
0
T*
Number of “d” after the 1st <b,c>
Important properties
Non-increasing
Size of H* and T* arrays
r(a) = 3 slots
Maximum value per slot
sni(<b,c>) = 8
Sum over all slots
sni(<a,b,c>) = 9
sni(<b,c>) = 8 slots
r(d) = 3
sni(<b,c,d>) = 7
HT-sum (H, T) ≤ HT-sum ( H*, T*)
Challenge:
We cannot construct
the real HT-arrays
solely from the 3
numbers.
HT Bound
Pattern P
<a, b, c, d>
1
2
3
4
Head pattern
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays
H*T*-arrays
for upper bound
of HT-sum
Important properties
Mid pattern
<b, c, d>
<b, c>
<a, b, c>
9
5
?2
7
Interval 1
2
H*
2
1
0
Interval 2
2?
1
Interval 3
0
1
0
T*
Non-increasing
Size of H* and T* arrays
r(a) = 3 slots
Maximum value per slot
sni(<b,c>) = 8
Sum over all slots
Tail pattern
sni(<a,b,c>) = 9
1. There are r(P[2]) = r(b)=3 intervals s.t. the
values in the interval are non-increasing.
2. The interval averages are non-increasing.
sni(<b,c>) = 8 slots
r(d) = 3
sni(<b,c,d>) = 7
HT-sum (H, T) ≤ HT-sum ( H*, T*)
8
Challenge:
We cannot construct
the real HT-arrays
solely from the 3
numbers.
HT Bound
Pattern P
<a, b, c, d>
1
2
3
4
Head pattern
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays
H*T*-arrays
for upper bound
of HT-sum
Tail pattern
Mid pattern
<a, b, c>
<b, c, d>
<b, c>
9
7
8
5
?2
2
H*
8
2
1
0
2?
1
0
1
0
T*
1
0
push
Important properties
Non-increasing
Size of H* and T* arrays
r(a) = 3 slots
Maximum value per slot
sni(<b,c>) = 8
Sum over all slots
sni(<a,b,c>) = 9
1. There are r(P[2]) = r(b)=3 intervals s.t. the
values in the interval are non-increasing.
2. The interval averages are non-increasing.
sni(<b,c>) = 8 slots
r(d) = 3
sni(<b,c,d>) = 7
HT-sum (H, T) ≤ HT-sum ( H*, T*)
Challenge:
We cannot construct
the real HT-arrays
solely from the 3
numbers.
HT Bound
Pattern P
<a, b, c, d>
1
2
3
4
Head pattern
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays
H*T*-arrays
for upper bound
of HT-sum
Tail pattern
Mid pattern
<a, b, c>
<b, c, d>
<b, c>
9
7
8
5
?2
2
H*
8
1
0
Non-increasing
Size of H* and T* arrays
r(a) = 3 slots
Maximum value per slot
sni(<b,c>) = 8
Sum over all slots
1
0
2?
1
0
1
0
1
1
1
1
1
1
1
T*
push
Important properties
2
sni(<a,b,c>) = 9
0
assign
1. There are r(P[2]) = r(b)=3 intervals s.t. the
values in the interval are non-increasing.
2. The interval averages are non-increasing.
sni(<b,c>) = 8 slots
r(d) = 3
sni(<b,c,d>) = 7
HT-sum (H, T) ≤ HT-sum ( H*, T*)
Challenge:
We cannot construct
the real HT-arrays
solely from the 3
numbers.
HT Bound
Pattern P
<a, b, c, d>
1
2
3
4
Head pattern
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays
H*T*-arrays
for upper bound
of HT-sum
Tail pattern
Mid pattern
<a, b, c>
<b, c, d>
<b, c>
9
7
8
5
?2
2
H*
8
1
0
Non-increasing
Size of H* and T* arrays
r(a) = 3 slots
Maximum value per slot
sni(<b,c>) = 8
Sum over all slots
1
0
2?
1
0
1
0
HT-Sum(H,T) = 6
1
1
1
1
1
1
1
HT-Sum (H*,T*) = 8
T*
push
Important properties
2
sni(<a,b,c>) = 9
0
assign
1. There are r(P[2]) = r(b)=3 intervals s.t. the
values in the interval are non-increasing.
2. The interval averages are non-increasing.
sni(<b,c>) = 8 slots
r(d) = 3
sni(<b,c,d>) = 7
HT-sum (H, T) ≤ HT-sum ( H*, T*)
HT Bound
Pattern P
<a, b, c, d>
1
2
3
4
Head pattern
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays
H*T*-arrays
for upper bound
of HT-sum
Tail pattern
Mid pattern
<a, b, c>
<b, c, d>
<b, c>
9
7
8
5
2
2
H*
8
2
1
0
2
1
0
1
0
HT-Sum(H,T) = 6
1
1
1
1
1
1
1
HT-Sum (H*,T*) = 8
T*
1
0
0
push
We don’t have to materialize
the H* and T* arrays to compute
HT-Sum(H*, T*) !!!
Number of slots in the partition
Value in each slot in the partition
assign
Partition 1
sni(<a,b,c>) mod sni(<b,c>)
sni(<a,b,c>) / sni(<b,c>)
Partition 2
sni(<b,c> ) – ( sni(<a,b,c>) mod sni(<b,c>))
sni(<a,b,c>) / sni(<b,c>)
HT-sum (H, T) ≤ HT-sum ( H*, T*)
HT Bound
Pattern P
<a, b, c, d>
1
2
3
4
Head pattern
Patterns already
computed
Number of occurrences
of the pattern in row 1
(i.e. sni(P) )
HT-arrays
H*T*-arrays
for upper bound
of HT-sum
Tail pattern
Mid pattern
<a, b, c>
<b, c, d>
<b, c>
9
7
8
5
2
2
H*
8
2
1
0
2
1
0
1
0
HT-Sum(H,T) = 6
1
1
1
1
1
1
1
HT-Sum (H*,T*) = 8
T*
1
0
push
0
assign
We don’t have to materialize
the H* and T* arrays to compute
HT-Sum(H*, T*) !!!
HT-sum (H, T) ≤ HT-sum ( H*, T*)
Experimental
Evaluation
Experimental settings


C programming language
Machine




CPU : 2.6 GHz
Memory : 1 Gb
Fedora
Dataset

Real dataset : Yeast galactose dataset





Subset of 205 genes (rows)
20 experimental conditions (columns)
4 biological replicates per condition
Publicly available :
http://expression.microslu.washington.edu/expression/kayee/medvedovic2003/medvedovic_bioinf2003
.html
Synthetic dataset




We model the values by a Gaussian distribution with the mean and variance equal to the sample mean
and variance of the 4 replicates.
Replicates simulation – Expression values of new replicates are sampled from the Gaussian
Columns simulation – New columns are synthesized by randomly drawing an existing column,
discarding the existing expression values, but keeping the fitted Gaussians and sampling new values
from them.
Rows simulation – New rows are synthesized as in the synthesis of new columns, but with an existing
row as template instead of a column.
Speed performance in different
iteration
Running time (5000 rows, 20 cols, 4 rep / c, ρ=0.2)
The HTBound is very effective in speeding
up the mining time by pruning infrequent
candidates.
The pruning effectiveness is most
pronounced in iteration 4, which the
number of candidates is the highest.
Number of candidates generated
(5000 rows, 20 cols, 4 rep / c, ρ=0.2)
Speed performance in different
iteration
Running time (5000 rows, 20 cols, 4 rep / c, ρ=0.2)
The number of frequent pattern tells us the
minimum number of candidates can be
generated by any pruning techniques.
Number of candidates generated
(5000 rows, 20 cols, 4 rep / c, ρ=0.2)
Speed performance in different
iteration
Running time (5000 rows, 20 cols, 4 rep / c, ρ=0.2)
The number of frequent pattern tells us the
minimum number of candidates can be
generated by any pruning techniques.
Number of candidates generated
(5000 rows, 20 cols, 4 rep / c, ρ=0.2)
The number of unpruned candidates under
HTBound is very close to the actual
number of frequent patterns.
In particular, HTBound has already pruned
94% of all infrequent candidates generated
by Basic.
Scalability w.r.t. support threshold
Running time
The HTBound achieved the greatest saving in
all settings.
Running time (in % of Basic)
We observe that at higher support
thresholds, the two bounds are capable of
pruning more candidates as the support
requirement is relaxed (i.e. more upper
bounds can be less than the minimum
support count when the later is higher)
Scalability w.r.t. number of rows
Running time
Running time (in % of Basic)
Scalability w.r.t. number of columns
Running time
Running time (in % of Basic)
Scalability w.r.t number of replicates
Running time
Running time (in % of Basic)
Conclusion



We have described the problem of high noise
level to the mining of OPSM’s, and discussed
how it can be alleviated by exploiting repeating
measurements.
Proposed a number of efficient mining
techniques to speed up the mining process.
Performed experiments on real microarray data
to
the usability of the proposed “fractional
support” model.
 Demonstrate the effectiveness of the pruning
methods.
 Demonstrate
Download