Sequential Pattern Mining COMP 790-90 Seminar BCB 713 Module Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Sequential Pattern Mining Why sequential pattern mining? GSP algorithm FreeSpan and PrefixSpan Boarder Collapsing Constraints and extensions 2 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Sequence Databases and Sequential Pattern Analysis (Temporal) order is important in many situations Time-series databases and sequence databases Frequent patterns (frequent) sequential patterns Applications of sequential pattern mining Customer shopping sequences: First buy computer, then CD-ROM, and then digital camera, within 3 months. Medical treatment, natural disasters (e.g., earthquakes), science & engineering processes, stocks and markets, telephone calling patterns, Weblog click streams, DNA sequences and gene structures 3 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications What Is Sequential Pattern Mining? Given a set of sequences, find the complete set of frequent subsequences A sequence database SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> A sequence : < (ef) (ab) (df) c b > An element may contain a set of items. Items within an element are unordered and we list them alphabetically. <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)> Given support threshold min_sup =2, <(ab)c> is a sequential pattern 4 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Challenges on Sequential Pattern Mining A huge number of possible sequential patterns are hidden in databases A mining algorithm should Find the complete set of patterns satisfying the minimum support (frequency) threshold Be highly efficient, scalable, involving only a small number of database scans Be able to incorporate various kinds of userspecific constraints 5 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications A Basic Property of Sequential Patterns: Apriori A basic property: Apriori (Agrawal & Sirkant’94) If a sequence S is not frequent Then none of the super-sequences of S is frequent E.g, <hb> is infrequent so do <hab> and <(ah)b> Seq. ID 10 20 30 40 50 6 Sequence <(bd)cb(ac)> <(bf)(ce)b(fg)> <(ah)(bf)abf> <(be)(ce)d> <a(bd)bcb(ade)> Given support threshold min_sup =2 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Basic Algorithm : Breadth First Search (GSP) L=1 While (ResultL != NULL) Candidate Generate Prune Test L=L+1 7 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Finding Length-1 Sequential Patterns Initial candidates: all singleton sequences <a>, <b>, <c>, <d>, <e>, <f>, <g>, <h> Scan database once, count support for candidates min_sup =2 Seq. ID 10 20 30 40 50 8 Sequence <(bd)cb(ac)> <(bf)(ce)b(fg)> <(ah)(bf)abf> <(be)(ce)d> <a(bd)bcb(ade)> Cand <a> <b> <c> <d> <e> <f> <g> <h> Sup 3 5 4 3 3 2 1 1 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications The Mining Process 5th scan: 1 cand. 1 length-5 seq. pat. Cand. cannot pass sup. threshold <(bd)cba> Cand. not in DB at all 4th scan: 8 cand. 6 length-4 seq. <abba> <(bd)bc> … pat. 3rd scan: 46 cand. 19 length-3 seq. <abb> <aab> <aba> <baa> <bab> … pat. 20 cand. not in DB at all 2nd scan: 51 cand. 19 length-2 seq. <aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)> pat. 10 cand. not in DB at all 1st scan: 8 cand. 6 length-1 seq. <a> <b> <c> <d> <e> <f> <g> <h> pat. Seq. ID Sequence min_sup =2 9 10 <(bd)cb(ac)> 20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)> COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Generating Length-2 Candidates 51 length-2 Candidates <a> <a> <b> <c> <d> <e> <f> 10 <a> <b> <c> <d> <e> <f> <a> <aa> <ab> <ac> <ad> <ae> <af> <b> <ba> <bb> <bc> <bd> <be> <bf> <c> <ca> <cb> <cc> <cd> <ce> <cf> <d> <da> <db> <dc> <dd> <de> <df> <e> <ea> <eb> <ec> <ed> <ee> <ef> <f> <fa> <fb> <fc> <fd> <fe> <ff> <b> <c> <d> <e> <f> <(ab)> <(ac)> <(ad)> <(ae)> <(af)> <(bc)> <(bd)> <(be)> <(bf)> <(cd)> <(ce)> <(cf)> <(de)> <(df)> <(ef)> Without Apriori property, 8*8+8*7/2=92 candidates Apriori prunes 44.57% candidates COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Pattern Growth (prefixSpan) Prefix and Suffix (Projection) <a>, <aa>, <a(ab)> and <a(abc)> are prefixes of sequence <a(abc)(ac)d(cf)> Given sequence <a(abc)(ac)d(cf)> 12 Prefix Suffix (Prefix-Based Projection) <a> <aa> <ab> <(abc)(ac)d(cf)> <(_bc)(ac)d(cf)> <(_c)(ac)d(cf)> COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Example Sequence_id 13 Sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> An Example ( min_sup=2): Prefix Sequential Patterns <a> <a>,<aa>,<ab><a(bc)>,<a(bc)a>,<aba>,<abc>,<(ab)>,<(ab)c>,<(a b)d>,<(ab)f>,<(ab)dc>,<ac>,<aca>,<acb>,<acc>,<ad>,<adc>,<af> <b> <b>, <ba>, <bc>, <(bc)>, <(bc)a>, <bd>, <bdc>,<bf> <c> <c>, <ca>, <cb>, <cc> <d> <d>,<db>,<dc>, <dcb> <e> <e>,<ea>,<eab>,<eac>,<eacb>,<eb>,<ebc>,<ec>,<ecb>,<ef>,<efb >,<efc>,<efcb> <f> <f>,<fb>,<fbc>, <fc>, <fcb> COMP 790-090 Data Mining: Concepts, Algorithms, and Applications PrefixSpan (the example to be continued) Step1: Find length-1 sequential patterns; <a>:4, <b>:4, <c>:4, <d>:3, <e>:3, <f>:3 support pattern Step2: Divide search space; six subsets according to the six prefixes; Step3: Find subsets of sequential patterns; By constructing corresponding projected databases and mine each recursively. 14 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Example 15 to be continued Sequence_id Sequence Projected(suffix) databases 10 20 30 40 <a(abc)(ac)d(cf)> <(ad)c(bc)(ae)> <(ef)(ab)(df)cb> <eg(af)cbc> <a(abc)(ac)d(cf)> <(ad)c(bc)(ae)> <(ef)(ab)(df)cb> <eg(af)cbc> Prefix Projected(suffix) databases Sequential Patterns <a> <(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc> <a>,<aa>,<ab><a(bc)>,<a(bc)a>, <aba>,<abc>,<(ab)>,<(ab)c>,<(ab )d>,<(ab)f>,<(ab)dc>,<ac>,<aca> ,<acb>,<acc>,<ad>,<adc>,<af> COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Example Find sequential patterns having prefix <a>: 1. Scan sequence database S once. Sequences in S containing <a> are projected w.r.t <a> to form the <a>projected database. 2. Scan <a>-projected database once, get six length-2 sequential patterns having prefix <a> : <a>:2 , <b>:4, <(_b)>:2, <c>:4, <d>:2, <f>:2 <aa>:2 , <ab>:4, <(ab)>:2, <ac>:4, <ad>:2, <af>:2 3. Recursively, all sequential patterns having prefix <a> can be further partitioned into 6 subsets. Construct respective projected databases and mine each. e.g. <aa>-projected database has two sequences : <(_bc)(ac)d(cf)> and <(_e)>. 16 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications PrefixSpan Algorithm Main Idea: Use frequent prefixes to divide the search space and to project sequence databases. only search the relevant sequences. PrefixSpan(, i, S|) 1. Scan S| once, find the set of frequent items b such that • b can be assembled to the last element of to form a sequential pattern; or • <b> can be appended to to form a sequential pattern. 2. For each frequent item b, appended it to to form a sequential pattern ’, and output ’; 3. For each ’, construct ’-projected database S|’, and call PrefixSpan(’, i+1,S|’). 17 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Approximate match Compatibility Matrix When you observe d1 Spread count as d1: 90%, d2: 5%, d3: 5% 18 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Match The degree to which pattern P is retained/reflected in S M(P,S) = P(P|S)= C(p,s) when when lS=lP M(P,S) = max over all possible when lS>lP Example P S M d1d1 d1d3 0.9*0 d1d2 d1d2 d1d2 d1d2 19 d1d2 d1d3 d2d3 d1d2d3 0.9*0.8 0.9*0.05 0.1*0.05 0.9*0.8 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Calculate Max over all Dynamic Programming M(p1p2..pi, s1s2…sj)= Max of M(p1p2..pi-1, s1s2…sj-1) * C(pi,sj) M(p1p2..pi, s1s2…sj-1) O(lP*lS) When compatibility Matrix is sparse O(lS) 20 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Match in D Average over all sequences in D 21 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Spread of match If compatibility matrix is identity matrix Match = support 22 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Anti-Monotone The match of a pattern P in a symbol sequence S is less than or equal to the match of any subpattern of P in S The match of a pattern P in a sequence database D is less than or equal to the match of any subpattern of P in D Can use any support based algorithm More patterns match so require efficient solution Sample based algorithms Border collapsing of ambiguous patterns 23 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Chernoff Bound Given sample size=n, range R, with probability 1- true value: = sqrt([R2ln(1/)]/2n) Distribution free More conservative Sample size : fit in memory Restricted spread : Frequent Patterns min_match + min_match - Infrequent patterns For pattern P= p1p2..pL R=min (match[pi]) for all 1 i L 24 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Algorithm Scan DB: O(N*min (Ls*m, Ls+m2)) Find the match of each individual symbol Take a random sample of sequences Identify borders that embrace the set of ambiguous patterns O(mLp * |S| * Lp * n) Min_match existing methods for association rule mining Locate the border of frequent patterns in the entire DB via border collapsing 25 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Border Collapsing If memory can not hold the counters of all ambiguous patterns Probe-and-collapse : binary search Probe patterns with highest collapsing power until memory is filled If memory can hold all patterns up to the 1/x layer the space of ambiguous patterns can be narrowed to at least 1/x of the original one where x is a power of 2 If it takes a level-wise search y scans of the DB, only O(logxy) scans are necessary when the border collapsing technique is employed 26 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Periodic Pattern Full periodic pattern ABC ABC ABC Partial periodic pattern ABC ADC ACC ABC Pattern hierarchy ABC ABC ABC DE DE DE DE ABC ABC ABC DE DE DE DE ABC ABC ABC DE DE DE DE 29 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Periodic Pattern Recent Achievements Partial Periodic Pattern Asynchronous Periodic Pattern Meta Pattern InfoMiner/InfoMiner+/STAMP 30 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Clustering Sequential Data CLUSEQ ApproxMAP 31 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications