Bottleneck of Frequent-pattern Mining
Multiple database scans are costly
Mining long patterns needs many passes of scanning and generates lots of candidates
To find frequent itemset i
1 i
2
…i
100
# of scans: 100
# of Candidates: (
100
1 ) + (
100
2 ) + … + (
1
1
0
0
0
0 ) = 2 100 -1 =
1.27*10 30 !
Bottleneck: candidate-generation-and-test
Can we avoid candidate generation?
2
Mining Freq Patterns w/o Candidate Generation
Grow long patterns from short ones using local frequent items
“abc” is a frequent pattern
Get all transactions having “abc”: DB|abc (projected database on abc)
“d” is a local frequent item in DB|abc abcd is a frequent pattern
Get all transactions having “abcd” (projected database on
“abcd”) and find longer itemsets
3
Mining Freq Patterns w/o Candidate Generation
Compress a large database into a compact, Frequent-
Pattern tree (FP-tree) structure
Highly condensed, but complete for frequent pattern mining
Avoid costly database scans
Develop an efficient, FP-tree-based frequent pattern mining method
A divide-and-conquer methodology: decompose mining tasks into smaller ones
Avoid candidate generation: examine sub-database
(conditional pattern base) only!
4
Construct FP-tree from a Transaction DB
TIDItems bought (ordered) frequent items
100
200
300
{f, a, c, d, g, i, m, p}
{a, b, c, f, l, m, o}
{b, f, h, j, o}
{f, c, a, m, p}
{f, c, a, b, m}
{f, b} min_sup=
50%
400
500
{b, c, k, s, p}
{a, f, c, e, l, p, m, n}
{c, b, p}
{f, c, a, m, p}
Steps:
1.
Scan DB once, find frequent
1-itemset (single item pattern)
2.
Order frequent items in frequency descending order: f, c, a, b, m, p (L-order)
3.
Process DB based on Lorder c d f e a b g h
3 i
3 j
4 k
1 l
1 m
4 n
1 o
1 p
1
2
3
1
1
1
2
3
5
Construct FP-tree from a Transaction DB
TIDItems bought (ordered) frequent items
100
200
300
{f, a, c, d, g, i, m, p}
{a, b, c, f, l, m, o}
{b, f, h, j, o}
{f, c, a, m, p}
{f, c, a, b, m}
{f, b}
400
500
{b, c, k, s, p}
{a, f, c, e, l, p, m, n}
{c, b, p}
{f, c, a, m, p}
{}
Header Table c a b
Item frequency head f 0 nil m p
0
0
0
0
0 nil nil nil nil nil
Initial FP-tree
6
Construct FP-tree from a Transaction DB
TIDItems bought (ordered) frequent items
100
200
300
{f, a, c, d, g, i, m, p}
{a, b, c, f, l, m, o}
{b, f, h, j, o}
{f, c, a, m, p}
{f, c, a, b, m}
{f, b}
400
500
{b, c, k, s, p}
{a, f, c, e, l, p, m, n}
{c, b, p}
{f, c, a, m, p}
{}
Header Table f:1 c a b
Item frequency head f 1 m p
1
1
0
1
1 nil c:1 a:1 m:1
Insert {f, c, a, m, p} p:1
7
Construct FP-tree from a Transaction DB
TIDItems bought (ordered) frequent items
100
200
300
{f, a, c, d, g, i, m, p}
{a, b, c, f, l, m, o}
{b, f, h, j, o}
{f, c, a, m, p}
{f, c, a, b, m}
{f, b}
400
500
{b, c, k, s, p}
{a, f, c, e, l, p, m, n}
{c, b, p}
{f, c, a, m, p}
{}
Header Table f:2 c a b
Item frequency head f 2 m p
2
2
1
2
1 c2 a:2 m:1 b:1
Insert {f, c, a, b, m} p:1 m:1
8
Construct FP-tree from a Transaction DB
TIDItems bought (ordered) frequent items
100
200
300
{f, a, c, d, g, i, m, p} {f, c, a, m, p}
{a, b, c, f, l, m, o}{f, c, a, b, m}
{b, f, h, j, o} {f, b}
400
500
{b, c, k, s, p}
{a, f, c, e, l, p, m, n}
{c, b, p}
{f, c, a, m, p}
{}
Header Table f:3 c a b
Item frequency head f 3 m p
2
2
2
2
1 c:2 a:2 m:1 b:1 b:1
Insert {f, b} p:1 m:1
9
Construct FP-tree from a Transaction DB
TIDItems bought (ordered) frequent items
100
200
300
{f, a, c, d, g, i, m, p} {f, c, a, m, p}
{a, b, c, f, l, m, o}{f, c, a, b, m}
{b, f, h, j, o} {f, b}
400
500
{b, c, k, s, p}
{a, f, c, e, l, p, m, n}
{c, b, p}
{f, c, a, m, p}
{}
Header Table f:3 c:1 c a b
Item frequency head f 3 m p
3
2
3
2
2 c:2 a:2 m:1 b:1 b:1 b:1 p:1
Insert {c, b, p} p:1 m:1
10
Construct FP-tree from a Transaction DB
TIDItems bought (ordered) frequent items
100
200
300
{f, a, c, d, g, i, m, p} {f, c, a, m, p}
{a, b, c, f, l, m, o}{f, c, a, b, m}
{b, f, h, j, o} {f, b}
400
500
{b, c, k, s, p}
{a, f, c, e, l, p, m, n}
{c, b, p}
{f, c, a, m, p}
{}
Header Table f:4 c:1 c a b
Item frequency head f 4 m p
4
3
3
3
3 c:3 a:3 m:2 b:1 b:1 b:1 p:1
Insert {f, c, a, m, p} p:2 m:1
11
Benefits of FP-tree Structure
Completeness:
Preserve complete DB information for frequent pattern mining (given prior min support)
Each transaction mapped to one FP-tree path; counts stored at each node
Compactness
One FP-tree path may correspond to multiple transactions; tree is never larger than original database
(if not count node-links and counts)
Reduce irrelevant information—infrequent items are gone
Frequency-descending ordering: more frequent items are closer to tree top and more likely to be shared
12
How Effective Is FP-tree?
Alphabetical FP-tree
Tran. DB
100000
10000
1000
100
10
1
0%
Dataset: Connect-4
(a dense dataset)
20%
Ordered FP-tree
Freq. Tran. DB
40% 60%
Support threshold
80% 100%
13
Mining Frequent Patterns Using FP-tree
General idea (divide-and-conquer)
Recursively grow frequent pattern path using FP-tree
Frequent patterns can be partitioned into subsets according to L-order
L-order=f-c-a-b-m-p
Patterns containing p
Patterns having m but no p
Patterns having b but no m or p
…
Patterns having c but no a nor b, m, p
Pattern f
14
Mining Frequent Patterns Using FP-tree
Step 1 : Construct conditional pattern base for each item in header table
Step 2: Construct conditional FP-tree from each conditional pattern-base
Step 3: Recursively mine conditional FP-trees and grow frequent patterns obtained so far
If conditional FP-tree contains a single path, simply enumerate all patterns
15
Step 1: Construct Conditional Pattern Base
Starting at header table of FP-tree
Traverse FP-tree by following link of each frequent item
Accumulate all transformed prefix paths of item to form a conditional pattern base
Header Table {} b m p
Item frequency head c f a
4
4
3
3
3
3 c:3 a:3 m:2 f:4 b:1 b:1 c:1 b:1 p:1 p:2 m:1 c a b
Conditional pattern bases item cond. pattern base m p f:3 fc:3 fca:1, f:1, c:1 fca:2, fcab:1 fcam:2, cb:1
16
Step 2: Construct Conditional FP-tree
For each pattern-base
Accumulate count for each item in base
Construct FP-tree for frequent items of pattern base b m p c a
Conditional pattern bases item cond. pattern base f:3 fc:3 fca:1, f:1, c:1 fca:2, fcab:1 fcam:2, cb:1 fcam fcam cb f 2 c 3 a 2 m 2 b 1 min_sup= 50%
# transaction =5 p conditional FP-tree
Item frequency head c 3
{} c:3
17
Mining Frequent Patterns by Creating Conditional Pattern-
Bases b a c f
Item p m
Conditional pattern-base
{(fcam:2), (cb:1)}
{(fca:2), (fcab:1)}
{(fca:1), (f:1), (c:1)}
{(fc:3)}
{(f:3)}
Empty
Conditional FP-tree
{(c:3)}|p
{(f:3, c:3, a:3)}|m
Empty
{(f:3, c:3)}|a
{(f:3)}|c
Empty
18
Step 3: Recursively mine conditional FP-tree
Collect all patterns that end at p suffix: p(3)
FP: p(3) CPB: fcam:2, cb:1
FP-tree: c(3)
Suffix: cp(3)
FP: cp(3) CPB: nil
19
Step 3: Recursively mine conditional FP-tree
•
Collect all patterns that end at m
FP-tree: suffix: m(3) f(3) c(3)
FP: m(3) CPB: fca:2, fcab:1 a(3) suffix: am(3) suffix: cm(3) suffix: fm(3)
Continue next page
FP: cm(3)
FP-tree:
CPB: f:3 f(3)
FP: fm(3) CPB: nil suffix: fcm(3)
FP: fcm(3) CPB: nil
20
Collect all patterns that end at m (cont’d)
FP-tree: suffix: am(3)
FP: am(3) CPB: fc:3 f(3) c(3) suffix: cam(3) FP-tree: suffix: fam(3)
FP: cam(3) f(3)
CPB: f:3 FP: fam(3) suffix: fcam(3)
FP: fcam(3) CPB: nil
CPB: nil
21
FP-growth vs. Apriori: Scalability With the Support Threshold
100
90
80
70
60
50
40
30
20
10
0
0 0.5
Data set T25I20D10K
1 1.5
Support threshold(%)
2
D1 FP-grow th runtime
D1 Apriori runtime
2.5
3
22
Why Is Frequent Pattern Growth Fast?
Performance study shows
FP-growth is an order of magnitude faster than Apriori
Reasoning
No candidate generation, no candidate test
Use compact data structure
Eliminate repeated database scan
Basic operations are counting and FP-tree building
23
Weaknesses of FP-growth
Support dependent; cannot accommodate dynamic support threshold
Cannot accommodate incremental DB update
Mining requires recursive operations
24
Maximal patterns and Border
Maximal patterns: An frequent itemset X is maximal if none of its superset is frequent
Maximal
Patterns
= {AD, ACE, BCDE}
•
20 patterns, but only 3 maximal patterns !
A B null
C D E
•
Can use 3 maximal patterns to represent all 20 patterns
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE
ABCD
E
ACDE BCDE
Border
25
Closed Itemsets
Tid
1
4
5
2
3
6
Itemset
AC T W
CDW
AC T W
AC D W
AC DT W
CDT
Example:
•
AC is not closed since every transaction containing AC also contains W
•
CDW is closed since transaction Tid=2 contains no other item
26
Frequent Closed Patterns
For frequent itemset X, if there exists no item y s.t. every transaction containing X also contains y, then X is a frequent closed pattern
“acdf” is a frequent closed pattern
Concise rep. of freq pats
Reduce # of patterns and rules
N. Pasquier et al. In ICDT’99
Min_sup=2
TID Items
10 a, c, d, e, f
20 a, b, e
30 c, e, f
40 a, c, d, f
50 c, e, f
27