Instructor: Qiang Yang
Thanks: J.Han and J. Pei
1
Multiple database scans are costly
Mining long patterns needs many passes of scanning and generates lots of candidates
To find frequent itemset i
1 i
2
…i
100
# of scans: 100
# of Candidates: (
1 = 1.27*10 30 !
100
1 ) + (
100
2 ) + … + (
1
1
0
0
0
0 ) = 2 100 -
Bottleneck: candidate-generation-and-test
Can we avoid candidate generation?
Frequent-pattern mining methods 2
FP-growth: Frequent-pattern Mining
Without Candidate Generation
Heuristic: let the set of transactions contain an item. If x
P be a frequent itemset, must be a frequent itemset
P , and is a frequent item in S , {
S x x be be
} P
No candidate generation!
A compact data structure, FP-tree, to store information for frequent pattern mining
Recursive mining algorithm for mining complete set of frequent patterns
Frequent-pattern mining methods 3
Items Bought f,a,c,d,g,i,m,p a,b,c,f,l,m,o b,f,h,j,o b,c,k,s,p a,f,c,e,l,p,m,n
Min Support = 3
Frequent-pattern mining methods 4
List of frequent items, sorted: (item:support)
<(f:4), (c:4), (a:3),(b:3),(m:3),(p:3)>
The root of the tree is created and labeled with “{}”
Scan the database
Scanning the first transaction leads to the first branch of the tree:
<(f:1),(c:1),(a:1),(m:1),(p:1)>
Order according to frequency
Frequent-pattern mining methods 5
Transaction
Database
TID Items
100 f,a,c,d,g,i,m,p
Header Table c a
Node
Item count head f 1
1
1 m p
1
1 m:1 f:1 c:1 a:1 root
{} p:1
Frequent-pattern mining methods 6
Items
Bought f,a,c,d,g,i,m,p a,b,c,f,l,m,o b,f,h,j,o b,c,k,s,p a,f,c,e,l,p,m,n
Frequent Single Items:
F1=<f,c,a,b,m,p>
TID=200
Possible frequent items:
Intersect with F1: f,c,a,b,m
Along the first branch of <f,c,a,m,p>, intersect:
<f,c,a>
Generate two children
<b>, <m>
Frequent-pattern mining methods 7
Transaction
Database
TID Items
200 f,c,a, b,m root
{} c a b m
Header Table
Node
Item count head f 1
1
2
1
1 p 1 m:1 f:2 c:2 a:2 b:1 p:1 m:1
Frequent-pattern mining methods 8
Transaction
Database
TID Items
100
200
300
400
500 f,a,c,d,g,i,m,p a,b,c,f,l,m,o b,f,h,j,o b,c,k,s,p a,f,c,e,l,p,m,n
{} b m p
Header Table c a
Node
Item count head f 1
2
1
3
2
2 m:2 c:3 a:3 f:4 b:1 b:1 c:1 b:1 p:1
Min support = 3
Frequent 1-items in frequency descending order: f,c,a,b,m,p p:2 m:1
Frequent-pattern mining methods 9
Scans the database only twice
Subsequent mining: based on the FP-tree
Frequent-pattern mining methods 10
Step 1: form conditional pattern base
Step 2: construct conditional FP-tree
Step 3: recursively mine conditional FPtrees
Frequent-pattern mining methods 11
Let {I} be a frequent item
A sub database which
consists of the set of prefix paths in the FP-tree
With item {I} as a co-occurring suffix pattern
Example:
{m} is a frequent item
{m}’s conditional pattern base:
<f,c,a>: support =2
<f,c,a,b>: support = 1
Mine recursively on such databases a:3 f:4 m:2 b:1
{} c:1 c:3 b:1 b:1 p:1 p:2 m:1
Frequent-pattern mining methods 12
Let {I} be a suffix item, {DB|I} be the conditional pattern base
The frequent pattern tree Tree
I is known as the conditional pattern tree
Example:
{m} is a frequent item
{m}’s conditional pattern base:
<f,c,a>: support =2
<f,c,a,b>: support = 1
{m}’s conditional pattern tree c:3 f:4 a:3 m:2
{}
Frequent-pattern mining methods 13
a
b
Let a be a frequent item in DB, B be a ’s conditional pattern base, and b be an itemset in B.
Then a + b is frequent in DB if and only if b is frequent in B.
Example:
Starting with a ={p}
{p}’s conditional pattern base (from the tree) B=
(f,c,a,m): 2
(c,b): 1
Let b be {c}.
Then a+b ={p,c}, with support = 3.
Frequent-pattern mining methods 14
Let P be a single path
FP tree
Let {I
1
, I
2
, …I k
} be an itemset in the tree
Let I j have the lowest support
Then the support({I
1
,
I
2
, …I k
})=support(I j
)
Example:
Frequent-pattern mining methods
{} f:4 c:1 c:3 b:1 b:1 p:1 a:3 m:2 b:1 p:2 m:1
15
Recursive Algorithm
Input: A transaction database, min_supp
Output: The complete set of frequent patterns
1. FP-Tree construction
2. Mining FP-Tree by calling FP_growth(FP_tree, null)
Key Idea: consider single path FP-tree and multi-path FP-tree separately
Continue to split until get single-path FP-tree
Frequent-pattern mining methods 16
a
If tree contains a single path P, then
For each combination (denoted as b ) of the nodes in the path P, then
Generate pattern b + a with support = min_supp of nodes in b
Else for each a in the header of tree, do {
Generate pattern b =
Construct
a +
(1) b ’s conditional pattern base and
(2) b ’s conditional FP-tree Tree b
If Tree b is not empty, then
Call FP-growth(Tree b
, b );
} a with support = a .
support ;
Frequent-pattern mining methods 17
60
50
40
30
20
100
90
80
70
10
0
0
FP-Growth vs. Apriori: Scalability
With the Support Threshold
Data set T25I20D10K
D1 FP-grow th runtime
D1 Apriori runtime
0.5
1 1.5
Support threshold(%)
2
Frequent-pattern mining methods
2.5
3
18
140
120
100
80
60
40
20
0
0
FP-Growth vs. Tree-Projection:
Scalability with the Support Threshold
Data set T25I20D100K
D2 FP-growth
D2 TreeProjection
2 0.5
1
Support threshold (%)
Frequent-pattern mining methods
1.5
19
Divide-and-conquer:
decompose both the mining task and DB according to the frequent patterns obtained so far leads to focused search of smaller databases
Other factors
no candidate generation, no candidate test compressed database: FP-tree structure no repeated scan of entire database basic ops—counting and FP-tree building, not pattern search and matching
Frequent-pattern mining methods 20
Mining closed frequent itemsets and max-patterns
CLOSET (DMKD’00)
Mining sequential patterns
FreeSpan (KDD’00), PrefixSpan (ICDE’01)
Constraint-based mining of frequent patterns
Convertible constraints (KDD’00, ICDE’01)
Computing iceberg data cubes with complex measures
H-tree and H-cubing algorithm (SIGMOD’01)
Frequent-pattern mining methods 21
Frequent-pattern mining methods 22
Frequent-pattern mining methods 23
Multi-level, quantitative association rules, correlation and causality, ratio rules, sequential patterns, emerging patterns, temporal associations, partial periodicity
Classification, clustering, iceberg cubes, etc.
Frequent-pattern mining methods 24
Items often form hierarchy
Flexible support settings: Items at the lower level are expected to have lower support.
Transaction database can be encoded based on dimensions and levels explore shared multi-level mining uniform support reduced support
Level 1 min_sup = 5%
Milk
[support = 10%]
Level 1 min_sup = 5%
Level 2 min_sup = 5%
2% Milk
[support = 6%]
Skim Milk
[support = 4%]
Frequent-pattern mining methods
Level 2 min_sup = 3%
25
Quantitative Association Rules
Numeric attributes are
dynamically
Such that the confidence or compactness of the rules mined is maximized.
2-D quantitative association rules: A
A cat
Cluster “adjacent” association rules to form general rules using a 2-D grid.
Example: age(X,”34-35”) income(X,”30K discretized quan1
A quan2
50K”)
buys(X,”high resolution TV”)
Frequent-pattern mining methods 26
Which rule is redundant?
milk wheat bread, [support = 8%, confidence =
70%]
“ skim milk ” wheat bread, [support = 2%, confidence = 72%]
The first rule is more general than the second rule.
A rule is redundant if its support is close to the
“ expected ” value, based on a general rule, and its confidence is close to that of the general rule.
Frequent-pattern mining methods 27
Rules in DB were found and a set of new tuples added to DB, db is
Task: to find new rules in DB + db.
Usually, DB is much larger than db.
Properties of Itemsets:
frequent in DB + db if frequent in both DB and db.
infrequent in DB + db if also in both DB and db.
frequent only in DB, then merge with counts in db.
No DB scan is needed!
frequent only in db, then itemset counts.
scan DB once to update their
Same principle applicable to distributed/parallel mining.
Association does not measure correlation
[BMS97, AY98].
Among 5000 students
3000 play basketball, 3750 eat cereal, 2000 do both play basketball eat cereal [40%, 66.7%]
Conclusion: “ basketball and cereal are correlated ” is misleading
because the overall percentage of students eating cereal is
75%, higher than 66.7%.
Confidence does not always give correct picture!
Frequent-pattern mining methods 29
P ( A
B )
P ( A ) P ( B )
P ( B | A ) * P ( A )
P ( B | A ) / P ( B )
P ( A ) P ( B )
P(A^B)=P(B)*P(A), if
A and B are independent events
A and B negatively correlated the value is less than 1;
Otherwise A and B positively correlated.
P(B|A)/P(B) is known as the lift of rule B A
If less than one, then
B and A are negatively correlated.
Basketball Cereal
2000/(3000*3750/500
0)=2000*5000/3000*3
750<1
Frequent-pattern mining methods 30
2
( 1
3 * 5 / 9 )
2
3 * 5 / 9
+
( 2
3 * ( 9
5 ) / 9 )
2
3 * ( 9
5 ) / 9
+
( 4
( 9
3 ) * 5 / 9 )
2
( 9
3 ) * 5 / 9
+
( 2
( 9
3 ) * ( 9
5 ) / 9 )
2
( 9
3 ) * ( 9
5 ) / 9
0 .
9
Item1 not Item1
Item2 not Item2 row sum
1 2 3
4 2 6 column sum 5 4 9
The cutoff value at 95% significance level is
3.84 > 0.9
Thus, we do not reject the independence assumption.
Frequent-pattern mining methods 31
Finding all the patterns in a database autonomously ? — unrealistic!
The patterns could be too many but not focused!
Data mining should be an interactive process
User directs what to be mined using a data mining query language (or a graphical user interface)
Constraint-based mining
User flexibility: provides constraints on what to be mined
System optimization: explores such constraints for efficient mining— constraint-based mining
Frequent-pattern mining methods 32
Knowledge type constraint :
classification, association, etc.
Data constraint — using SQL-like queries
find product pairs sold together in stores in Vancouver in Dec.’00
Dimension/level constraint
in relevance to region, price, brand, customer category
Rule (or pattern) constraint
small sales (price < $10) triggers big sales (sum >
$200)
Interestingness constraint
strong rules: min_support 3%, min_confidence
60%
Frequent-pattern mining methods 33
Constrained mining vs. constraint-based search/reasoning
Both are aimed at reducing search space
Finding all patterns satisfying constraints vs. finding some (or one) answer in constraint-based search in AI
Constraint-pushing vs. heuristic search
It is an interesting research problem on how to integrate them
Constrained mining vs. query processing in DBMS
Database query processing requires to find all
Constrained pattern mining shares a similar philosophy as pushing selections deeply in query processing
Frequent-pattern mining methods 34
Constrained Frequent Pattern Mining: A
Mining Query Optimization Problem
Given a frequent pattern mining query with a set of constraints C, the algorithm should be
sound: it only finds frequent sets that satisfy the given constraints C
complete : all frequent sets satisfying the given constraints C are found
A naïve solution
First find all frequent sets, and then test them for constraint satisfaction
More efficient approaches:
Analyze the properties of constraints comprehensively
Push them as deeply as possible inside the frequent pattern computation.
Frequent-pattern mining methods 35
TDB (min_sup=2)
Anti-monotonicity
intemset S satisfies the constraint, so does any of its subset sum(S.Price) v is anti-monotone sum(S.Price) v is not anti-monotone
Example. C: range(S.profit) 15 is anti-monotone
Itemset ab violates C
So does every superset of ab
Frequent-pattern mining methods
TID Transaction
10 a, b, c, d, f
20 b, c, d, f, g, h
30 a, c, d, e, f
40 c, e, f, g
Item Profit c d a b e
40
0
-20
10
-30 f g h
30
20
-10 36
Constraint Antimonotone
No no v S
S V
S V min(S) v min(S) v max(S) v max(S) v count(S) v count(S) v sum(S) v ( a S, a 0 ) sum(S) v ( a S, a 0 ) range(S) v range(S) v avg(S) v, { , , } support(S) support(S) yes no yes yes no yes no yes no yes no convertible yes no
Frequent-pattern mining methods 37
TDB (min_sup=2)
Monotonicity
When an intemset S satisfies the constraint, so does any of its superset sum(S.Price) v is monotone min(S.Price) v is monotone
Example. C: range(S.profit) 15
Itemset ab satisfies C
So does every superset of ab
Frequent-pattern mining methods
TID Transaction
10 a, b, c, d, f
20 b, c, d, f, g, h
30
40 a, c, d, e, f c, e, f, g f g d e h
Ite m a b c
Profit
40
0
-20
10
-30
30
20
-10
38
Constraint v S
S V
S V min(S) v min(S) v max(S) v max(S) v count(S) v count(S) v sum(S) v ( a S, a 0 ) sum(S) v ( a S, a 0 ) range(S) v range(S) v avg(S) v, { , , } support(S) support(S)
Frequent-pattern mining methods
Monotone yes yes no yes no no yes no yes no yes no yes convertible no yes
39
We will not consider these in this course.
Frequent-pattern mining methods 40
Mine association possible rules in form of itemset
class
Itemset: a set of attribute-value pairs
Class: class label
Build Classifier
Organize rules according to decreasing precedence based on confidence and support
B. Liu, W. Hsu & Y. Ma. Integrating classification and association rule mining. In KDD’98
Frequent-pattern mining methods 41
Emerging pattern (EP): A pattern frequent in one class of data but infrequent in others.
Age<=30 is frequent in class
“buys_computer=yes” and infrequent in class
“buys_computer=no”
Rule: age<=30 buys computer
G. Dong & J. Li. Efficient mining of emerging patterns: discovering trends and differences. In KDD’99
Frequent-pattern mining methods 42