COMP 790-090 Data Mining - Computer Science

advertisement
Sequential Pattern Mining
COMP 790-90 Seminar
BCB 713 Module
Spring 2011
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Sequential Pattern Mining
Why sequential pattern mining?
GSP algorithm
FreeSpan and PrefixSpan
Boarder Collapsing
Constraints and extensions
2
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Sequence Databases and
Sequential Pattern Analysis
(Temporal) order is important in many situations
Time-series databases and sequence databases
Frequent patterns  (frequent) sequential patterns
Applications of sequential pattern mining
Customer shopping sequences:
First buy computer, then CD-ROM, and then digital camera,
within 3 months.
Medical treatment, natural disasters (e.g., earthquakes),
science & engineering processes, stocks and markets,
telephone calling patterns, Weblog click streams, DNA
sequences and gene structures
3
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
What Is Sequential Pattern
Mining?
Given a set of sequences, find the complete
set of frequent subsequences
A sequence database
SID
sequence
10
<a(abc)(ac)d(cf)>
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
<eg(af)cbc>
A
sequence
: < (ef) (ab) (df) c b >
An element may contain a set of items.
Items within an element are unordered
and we list them alphabetically.
<a(bc)dc> is a subsequence
of <a(abc)(ac)d(cf)>
Given support threshold min_sup =2, <(ab)c> is a
sequential pattern
4
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Challenges on Sequential
Pattern Mining
A huge number of possible sequential
patterns are hidden in databases
A mining algorithm should
Find the complete set of patterns satisfying the
minimum support (frequency) threshold
Be highly efficient, scalable, involving only a
small number of database scans
Be able to incorporate various kinds of userspecific constraints
5
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
A Basic Property of
Sequential Patterns: Apriori
A basic property: Apriori (Agrawal & Sirkant’94)
If a sequence S is not frequent
Then none of the super-sequences of S is frequent
E.g, <hb> is infrequent  so do <hab> and <(ah)b>
Seq. ID
10
20
30
40
50
6
Sequence
<(bd)cb(ac)>
<(bf)(ce)b(fg)>
<(ah)(bf)abf>
<(be)(ce)d>
<a(bd)bcb(ade)>
Given support threshold
min_sup =2
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Basic Algorithm : Breadth
First Search (GSP)
L=1
While (ResultL != NULL)
Candidate Generate
Prune
Test
L=L+1
7
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Finding Length-1 Sequential
Patterns
Initial candidates: all singleton sequences
<a>, <b>, <c>, <d>, <e>, <f>, <g>, <h>
Scan database once, count support for
candidates
min_sup =2
Seq. ID
10
20
30
40
50
8
Sequence
<(bd)cb(ac)>
<(bf)(ce)b(fg)>
<(ah)(bf)abf>
<(be)(ce)d>
<a(bd)bcb(ade)>
Cand
<a>
<b>
<c>
<d>
<e>
<f>
<g>
<h>
Sup
3
5
4
3
3
2
1
1
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
The Mining Process
5th scan: 1 cand. 1 length-5 seq.
pat.
Cand. cannot pass
sup. threshold
<(bd)cba>
Cand. not in DB at all
4th scan: 8 cand. 6 length-4 seq. <abba> <(bd)bc> …
pat.
3rd scan: 46 cand. 19 length-3 seq. <abb> <aab> <aba> <baa> <bab> …
pat. 20 cand. not in DB at all
2nd scan: 51 cand. 19 length-2 seq.
<aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)>
pat. 10 cand. not in DB at all
1st scan: 8 cand. 6 length-1 seq.
<a> <b> <c> <d> <e> <f> <g> <h>
pat.
Seq. ID
Sequence
min_sup =2
9
10
<(bd)cb(ac)>
20
<(bf)(ce)b(fg)>
30
<(ah)(bf)abf>
40
<(be)(ce)d>
50
<a(bd)bcb(ade)>
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Generating Length-2
Candidates
51 length-2
Candidates
<a>
<a>
<b>
<c>
<d>
<e>
<f>
10
<a>
<b>
<c>
<d>
<e>
<f>
<a>
<aa>
<ab>
<ac>
<ad>
<ae>
<af>
<b>
<ba>
<bb>
<bc>
<bd>
<be>
<bf>
<c>
<ca>
<cb>
<cc>
<cd>
<ce>
<cf>
<d>
<da>
<db>
<dc>
<dd>
<de>
<df>
<e>
<ea>
<eb>
<ec>
<ed>
<ee>
<ef>
<f>
<fa>
<fb>
<fc>
<fd>
<fe>
<ff>
<b>
<c>
<d>
<e>
<f>
<(ab)>
<(ac)>
<(ad)>
<(ae)>
<(af)>
<(bc)>
<(bd)>
<(be)>
<(bf)>
<(cd)>
<(ce)>
<(cf)>
<(de)>
<(df)>
<(ef)>
Without Apriori
property,
8*8+8*7/2=92
candidates
Apriori prunes
44.57% candidates
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Pattern Growth (prefixSpan)
Prefix and Suffix (Projection)
<a>, <aa>, <a(ab)> and <a(abc)> are prefixes
of sequence <a(abc)(ac)d(cf)>
Given sequence <a(abc)(ac)d(cf)>
12
Prefix
Suffix (Prefix-Based Projection)
<a>
<aa>
<ab>
<(abc)(ac)d(cf)>
<(_bc)(ac)d(cf)>
<(_c)(ac)d(cf)>
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Example
Sequence_id
13
Sequence
10
<a(abc)(ac)d(cf)>
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
<eg(af)cbc>
An Example
( min_sup=2):
Prefix
Sequential Patterns
<a>
<a>,<aa>,<ab><a(bc)>,<a(bc)a>,<aba>,<abc>,<(ab)>,<(ab)c>,<(a
b)d>,<(ab)f>,<(ab)dc>,<ac>,<aca>,<acb>,<acc>,<ad>,<adc>,<af>
<b>
<b>, <ba>, <bc>, <(bc)>, <(bc)a>, <bd>, <bdc>,<bf>
<c>
<c>, <ca>, <cb>, <cc>
<d>
<d>,<db>,<dc>, <dcb>
<e>
<e>,<ea>,<eab>,<eac>,<eacb>,<eb>,<ebc>,<ec>,<ecb>,<ef>,<efb
>,<efc>,<efcb>
<f>
<f>,<fb>,<fbc>,
<fc>,
<fcb>
COMP
790-090
Data Mining: Concepts, Algorithms, and Applications
PrefixSpan
(the example to be continued)
Step1: Find length-1 sequential patterns;
<a>:4, <b>:4, <c>:4, <d>:3, <e>:3, <f>:3
support
pattern
Step2: Divide search space;
six subsets according to the six prefixes;
Step3: Find subsets of sequential patterns;
By constructing corresponding projected databases and mine
each recursively.
14
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Example
15
to be continued
Sequence_id
Sequence
Projected(suffix) databases
10
20
30
40
<a(abc)(ac)d(cf)>
<(ad)c(bc)(ae)>
<(ef)(ab)(df)cb>
<eg(af)cbc>
<a(abc)(ac)d(cf)>
<(ad)c(bc)(ae)>
<(ef)(ab)(df)cb>
<eg(af)cbc>
Prefix
Projected(suffix) databases
Sequential Patterns
<a>
<(abc)(ac)d(cf)>,
<(_d)c(bc)(ae)>,
<(_b)(df)cb>,
<(_f)cbc>
<a>,<aa>,<ab><a(bc)>,<a(bc)a>,
<aba>,<abc>,<(ab)>,<(ab)c>,<(ab
)d>,<(ab)f>,<(ab)dc>,<ac>,<aca>
,<acb>,<acc>,<ad>,<adc>,<af>
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Example
Find sequential patterns having prefix <a>:
1.
Scan sequence database S once. Sequences in S
containing <a> are projected w.r.t <a> to form the <a>projected database.
2.
Scan <a>-projected database once, get six length-2
sequential patterns having prefix <a> :
<a>:2 , <b>:4, <(_b)>:2, <c>:4, <d>:2, <f>:2
<aa>:2 , <ab>:4, <(ab)>:2, <ac>:4, <ad>:2, <af>:2
3.
Recursively, all sequential patterns having prefix <a> can be
further partitioned into 6 subsets. Construct respective
projected databases and mine each.
e.g. <aa>-projected database has two sequences :
<(_bc)(ac)d(cf)> and <(_e)>.
16
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
PrefixSpan Algorithm
Main Idea: Use frequent prefixes to divide the search space and to
project sequence databases. only search the relevant sequences.
PrefixSpan(, i, S|)
1. Scan S| once, find the set of frequent items b such that
•
b can be assembled to the last element of  to form a
sequential pattern; or
•
<b> can be appended to  to form a sequential pattern.
2. For each frequent item b, appended it to  to form a sequential
pattern ’, and output ’;
3. For each ’, construct ’-projected database S|’, and call
PrefixSpan(’, i+1,S|’).
17
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Approximate match
Compatibility Matrix
When you observe d1
Spread count as
d1: 90%, d2: 5%, d3: 5%
18
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Match
The degree to which pattern P is retained/reflected in S
M(P,S) = P(P|S)= C(p,s) when when lS=lP
M(P,S) = max over all possible when lS>lP
Example
P
S
M
d1d1 d1d3
0.9*0
d1d2
d1d2
d1d2
d1d2
19
d1d2
d1d3
d2d3
d1d2d3
0.9*0.8
0.9*0.05
0.1*0.05
0.9*0.8
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Calculate Max over all
Dynamic Programming
M(p1p2..pi, s1s2…sj)= Max of
M(p1p2..pi-1, s1s2…sj-1) * C(pi,sj)
M(p1p2..pi, s1s2…sj-1)
O(lP*lS)
When compatibility Matrix is sparse O(lS)
20
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Match in D
Average over all sequences in D
21
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Spread of match
If compatibility matrix is identity matrix
Match = support
22
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Anti-Monotone
The match of a pattern P in a symbol sequence S
is less than or equal to the match of any subpattern
of P in S
The match of a pattern P in a sequence database D
is less than or equal to the match of any subpattern
of P in D
Can use any support based algorithm
More patterns match so require efficient solution
Sample based algorithms
Border collapsing of ambiguous patterns
23
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Chernoff Bound
Given sample size=n, range R,
with probability 1-
true value:  
 = sqrt([R2ln(1/)]/2n)
Distribution free
More conservative
Sample size : fit in memory
Restricted spread :
Frequent Patterns
min_match + 
min_match - 
Infrequent patterns
For pattern P= p1p2..pL
R=min (match[pi]) for all 1  i L
24
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Algorithm
Scan DB: O(N*min (Ls*m, Ls+m2))
Find the match of each individual symbol
Take a random sample of sequences
Identify borders that embrace the set of ambiguous patterns
O(mLp * |S| * Lp * n)
Min_match  
existing methods for association rule mining
Locate the border of frequent patterns
in the entire DB
via border collapsing
25
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Border Collapsing
If memory can not hold the counters of all
ambiguous patterns
Probe-and-collapse : binary search
Probe patterns with highest collapsing power until
memory is filled
If memory can hold all patterns up to the 1/x layer
the space of ambiguous patterns can be narrowed to at
least 1/x of the original one
where x is a power of 2
If it takes a level-wise search y scans of the DB, only
O(logxy) scans are necessary when the border
collapsing technique is employed
26
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Periodic Pattern
Full periodic pattern
ABC ABC ABC
Partial periodic pattern
ABC ADC ACC ABC
Pattern hierarchy
ABC ABC ABC DE DE DE DE ABC ABC
ABC DE DE DE DE ABC ABC ABC DE DE
DE DE
29
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Periodic Pattern
Recent Achievements
Partial Periodic Pattern
Asynchronous Periodic Pattern
Meta Pattern
InfoMiner/InfoMiner+/STAMP
30
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Clustering Sequential Data
CLUSEQ
ApproxMAP
31
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Download