Mining Association Rules with Constraints

advertisement
Mining Association Rules with
Constraints
Wei Ning
Joon Wong
COSC 6412 Presentation
1
Outline






Introduction
Summary of Approach
Algorithm CAP
Performance Analysis
Conclusion
References
2
Outline






Introduction
Summary of Approach
Algorithm CAP
Performance Analysis
Conclusion
References
3
Introduction


Recall mining association rules
Association rules mining finds
interesting association or correlation
relationships among a large set of data
items.
4
Some problems we met during
mining association rules




Overwhelming?
Not what you want?
Wait so long?
Lack of Focus
5
Introduction(cont.)


Example in walmart
Suppose a manager want to find which
is the most popular shoes in winter?
6
Outline

Introduction

Summary of Approach




Algorithm CAP
Performance Analysis
Conclusion
References
7
Mining frequent itemsets vs.
Mining association rules

Mining frequent itemsets is almost the
same as Mining association rules
8
Constrained Mining

A naive solution


First find all frequent sets, and then test
them for constraint satisfaction
Our approach:
Analyze the properties of constraints
comprehensively
Push them as deeply as possible
inside the frequent pattern
computation.

9
Frequent Itemsets &
Constraints

TDB (min_sup=2)
TID
Transaction
10
a, b, c
20
b, c, d, f
30
a, c


Item Value
a
40
b
10
c
-20
d
10
e
-30
Given a transaction database
Frequent itemset: a subset of
items frequently appear in
transactions, e.g. {a, c}
Constraint: a predicate over
itemsets

C(I): sum(I)>50

C(abd)= true
10
Mining Frequent Itemsets With
Constraints

Given





A transaction database TDB
A support threshold min_sup
A constraint C
Find the complete set of frequent itemsets
satisfying the constraint
Use constraint to


Express user’s focus
Improve both effectiveness and efficiency
11
Classification of Constraints

We have the following classification of
constraints

Anti-monotone

Monotone

Succinct

Convertible


Convertible anti-monotone

Convertible monotone

Strongly convertible
Inconvertible
12
Anti-Monotone


Definition 1 (Anti-Monotone): A 1-var
constraint C is anti-monotone if for all
sets S, S’:
S  S’ & S satisfies C  S’ satisfies C.
Simply, when an intemset S violates
the constraint, so does any of its
superset
13
Is Min(S)  v anti-monotone?
S={5, 10, 14}, v = 7
 Min(S)  7
{5} violates it.
Superset {5}: {5, 10}, {5, 14}, {5, 10 , 14}
So does {5, 10}, {5, 14}, {5, 10 , 14}
Min(S)  v is anti-monotone
14
Succinct

Definition 2 (Succinct)



I  Item is a succinct set if it can be expressed as
p(Item) for some selection predicate p.
SP  2Item is a succinct powerset if there is a fixed
number of succinct sets Item1, … Itemk  Item
such that SP can be expressed in terms of the
strict powersets of Item1,…,Itemk, using union
and minus.
Finally, a 1-var constraint C is succinct provided
SATc(Item) is a succinct powerset.
15
Succinct


General idea: we can enumerate all and
only those sets that are guaranteed to
satisfy the constraint.
If a constraint is succinct, we can
directly generate precisely the sets that
satisfy it.
16
Succinct example

Itemset containing a or b

Itemset containing some item with value
more than 30
17
Succinct example

C1  Item.Price  100





Item 1 = Item.price  100(Item)={a,b}
2Item1={{a}, {b}, {a, b}}
SATc1 = {{a}, {b}, {a, b}}
SATc1 = 2Item1
C1 is succinct
18
Convertible

Convert tough constraints into antimonotone or monotone by properly
order items
19
Convertible



Definition:
R is an order of items
Convertible anti-monotone

Itemset X satisfies constraint  so does
every prefix of X w.r.t. R
20
Convertible example

constraint C: avg(X)  25

Item
Value
Item
Value
a
40
a
40
b
0
f
30
c
-20
g
20
Itemset afd satisfies C
d
10
d
10

So do prefixes a and af
e
-30
b
0
Thus, it becomes
f
30
h
-10

g
20
c
-20
h
-10
e
-30
Order items in valuedescending order


<a, f, g, d, b, h, c, e>

Anti-monotone!
21
Commonly Used Constraints—
A General Picture
Constraint
Antimonotone
Monotone
Succinct
vS
no
yes
yes
SV
no
yes
yes
SV
yes
no
yes
min(S)  v
no
yes
yes
min(S)  v
yes
no
yes
max(S)  v
yes
no
yes
max(S)  v
no
yes
yes
count(S)  v
yes
no
weakly
count(S)  v
no
yes
weakly
sum(S)  v ( a  S, a 
0)
yes
no
no
sum(S)  v ( a  S, a 
0)
no
yes
no
range(S)  v
yes
no
no
range(S)  v
no
yes
no
avg(S)  v,   { , , 
}
convertible
convertible
no
22
Optional Proof of min(S)  v is
Anti-monotone



According to the table, min(S)  v is
both anti-monotone and succinct.
I only proof anti-monotone here due to
time limitation.
Something special…
23
Constraint Classification
Monotone
Antimonotone
Succinct
Strongly
convertible
Convertible
anti-monotone
Convertible
monotone
Inconvertible
24
Summary of Approach
Recapitulation


Basic idea about mining frequent
itemsets with constraints.
Introduce several important constraints.
25
Outline

Introduction
Summary of Approach

Algorithm CAP




Performance Analysis
Conclusion
References
26
Algorithms

There are many algorithms in solving
constrained based association rules
mining.





Algorithm
Algorithm
Algorithm
Algorithm
Algorithm
Direct
MultiJoins & Reorder
Apriori†
Hybrid(m)
CAP (Main Focus)
27
Design of Algorithm

Sound


An algorithm is sound provided it only finds
frequent sets that satisfy the given
constraints.
Complete

An algorithm is complete provided all
frequent sets satisfying the given
constraints are found.
28
Algorithm Apriori†

Main idea : Use Apriori Algorithm to get
the frequent item sets. Then apply the
constraints on the item sets found.


Step 1) Apriori with Cfreq
Step 2) Apply C – Cfreq to get final Ans
29
Algorithm Apriori† (Pseudocode)
1. C1 consists of sets of size 1; k = 1; Ans = ;
2. While (Ck not empty) {
2.1 conduct db scan to form Lk from Ck;
2.2 form Ck+1 from Lk based on Cfreq; k++; }
3. For each set S in some Lk:
Add S to Ans if S satisfies (C – Cfreq).
30
The Apriori† Algorithm — An Example
Database TDB
Tid
10
20
30
40
L2
Items
A, C, D
B, C, E
A, B, C, E
B, E
C1
1st scan
Itemset sup
{A, C}
2
{B, C}
2
{B, E}
3
{C, E}
2
C3
Itemset
{B, C, E}
C2
Itemset sup
{A}
2
{B}
3
{C}
3
{D}
1
{E}
3
Itemset sup
{A, B}
1
{A, C}
2
{A, E}
1
{B, C}
2
{B, E}
3
{C, E}
2
3rd scan
L3
L1
Itemset sup
{A}
2
{B}
3
{C}
3
{E}
3
C2
2nd scan
Itemset sup
{B, C, E}
2
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
The Apriori† Algorithm — An Example
(cont.)
L1
Database TDB
Tid
10
20
30
40
Items
A, C, D
B, C, E
A, B, C, E
B, E
Itemset sup
{A}
2
{B}
3
{C}
3
{E}
3
L2
Itemset sup
{A, C}
2
{B, C}
2
{B, E}
3
{C, E}
2
L3
Itemset sup
{B, C, E}
2
Constraint :
{A, C, E}  T.Item
Ans
{A}
{C}
{E}
{A, C}
{C, E}
Algorithm CAP

Succinct and Anti-monotone


Strategy I: Replace C1 in the Apriori Algorithm by
C1C.
Anti-monotone but non-succinct

Strategy II: Define Ck as in the Apriori Algorithm.
Drop a set S  Ck from counting if S fails C, i.e.,
constraint satisfaction is tested before counting is
done.
33
Algorithm CAP (cont.)

Succinct but non-anti-monotone


Strategy III: Too Complicated. To be discussed
later…
Non-succinct & non-anti-monotone

Strategy IV: Induce any weaker constraint C1 from
C. Depending on whether C1 is anti-monotone
and/or succinct, use one of the strategies I-III
above for the generation of frequent set.
34
Algorithm CAP (Pseudocode)
1 if Csam  Csuc  Cnone is non-empty, prepare C1 as indicated in
Strategies I, III, and IV; k = 1;
2 if Csuc is non-empty {
2.1 conduct db scan to form L1 as indicated in Strategy III;
2.2 form C2 as indicated in Strategy III; k = 2;}
3 while (Ck not empty) {
3.1 conduct db scan to form Lk from Ck;
3.2 form Ck+1 from Lk based on Strategy III if Csuc is non-empty,
and Strategy II for constraints in Cam;}
4. if Cnone is empty, Ans = ULk. Otherwise, for each set S in some
Lk, add S to Ans iff S satisfies Cnone.
35
The Algorithm CAP — An Example
Constraints : {A, C, E}  T.Item & min support count = 2
Question : Which strategy should we apply?
Database TDB
Tid
10
20
30
40
Items
A, C, D
B, C, E
A, B, C, E
B, E
The Algorithm CAP — An Example
(Cont.)
L Itemset sup
Database TDB
Tid
10
20
30
40
Items
A, C, D
B, C, E
A, B, C, E
B, E
Apply Strategy I!!!
C1 Itemset sup
1st scan
{A}
2
{C}
{E}
L2 Itemset sup
{A, C}
2
{C, E}
2
C3
Itemset
{}
C2
3
3
Itemset sup
{A, C}
2
{A, E}
1
{C, E}
2
1
{A}
{C}
{E}
C2
2nd scan
Because {A, E} is pruned earlier
2
3
3
Itemset
{A, C}
{A, E}
{C, E}
Ans
{A}
{C}
{E}
{A, C}
{C, E}
Case 3 : Succinct but not antimonotone. Revisit…
{1} {2} {3} {4} {5} {6} {7} {8} {9} {10}
{1} {2} {3} {4}
Some possible frequent sets may
be lost: e.g. {1,8} {1,2,10}
min (S) < 5
Apriori
{1} {2} {3} {4}
{1,2} {2,3}………{3,4}
………
{1,2,3,4}
**Information extracted from past presentation.
38
Case 3 : Succinct but not antimonotone. Continue…

Algorithm Direct



Idea : Play it safe. Generate Cck+1 by using
Lck x F where F is the set of all frequent
items.
Algorithm MultiJoins
Algorithm Reorder
39
Outline






Introduction
Summary of Approach
Algorithm CAP
Performance Analysis
Conclusion
References
40
Performance Analysis
(Specification)





Programs written in C
Generate transactional databases using
program from IBM Almaden Research
Center
100,000 records, domain of 1,000 items
Page size 4KB
SPARC-10 environment
41
Performance Analysis
(Terminology)

Speedup


Item Selectivity


Comparison of execution time between two
algorithms.
x% of them items satisfying the constraints.
Support Threshold

*Low support threshold means more frequent set
to process.
42
Performance Analysis



Note: Support threshold
set at 0.5%.
For 10% selectivity, CAP
runs 80 times faster
than Apriori†!
For 30% selectivity, the
speedup is about 10
times.
43
Performance Analysis



Note: Item Selectivity
fixed at 30%.
Support threshold goes
up, frequent item set
goes down, Apriori†
improves.
CAP still at least 8 times
faster.
44
Performance Analysis
Support
L1
L2
L3
L4
L5
L6
L7
L8
0.2%
174/582
79/969
29/1140
8/1250
1/934
0/451
0/132
0/20
0.6%
98/313
1/12
0/1
0
0
0
0
0

Each entry is of the form a/b



a is the # of frequent set satisfying the constraint.
B is the total number of frequent set.
For L4 with support of 0.2%, Apriori† finds 1250
frequent sets where 8 of which is found by CAP.
45
Conclusion


The idea of anti-monotonicity,
succinctness, and convertible are
introduced in the paper.
Sound, complete, and efficient
algorithms are introduced for the
constraint based association rule mining.
46
Reference



R. Srikant, Q. Vu, and R. Agrawal. Mining association
rules with item constraints. KDD’97.
R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang.
Exploratory mining and pruning optimizations of
constrained associations rules. SIGMOD’98.
J. Pei and J. Han. Can we push more constraints into
frequent pattern mining? KDD’00.
47
Download