Association Rule Mining

advertisement
Data Mining and
Knowledge Acquisition
— Chapter 6 —
BIS 541
2013/2014 Summer
1
Chapter 5: Mining Association
Rules in Large Databases




Association rule mining
Mining single-dimensional Boolean association rules
from transactional databases
Mining multilevel association rules from transactional
databases
Mining multidimensional association rules from
transactional databases and data warehouse

From association mining to correlation analysis

Constraint-based association mining

Summary
2
What Is Association Mining?
Association rule mining:
 Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories.
 Applications:
 Market basket analysis, cross-marketing, catalog
design, etc.
 Examples.
 Rule form: “Body ead [support, confidence]”.
 buys(x, “diapers”)  buys(x, “beers”) [0.5%, 60%]
 major(x, “MIS”) ^ takes(x, “DM”) grade(x, “AA”)
[1%, 75%]

3
Association Rule: Basic Concepts


Given:
 (1) database of transactions,
 (2) each transaction is a list of items (purchased by a
customer in a visit)
Find: all rules that correlate the presence of one set of
items with that of another set of items
 E.g., 98% of people who purchase tires and auto
accessories also get automotive services done

The user specifies
 Minimum support level
 Minimum confidence level
 Rules exceeding the two trasholds are listed as
interesting
4
Basic Concepts cont.






I:{i1,..,im} set of all items, T any transaction
AT: T contains the itemset A
AT, BT A,B itemsets
Examine rule like:
AB where
AB=,
 support s: P(AB)


frequency of transactions containing both A and B
confidence c: P(BA) = P(AB)/P(A)

Conditional probability that a transaction containing
A contains B
5
Rule Measures: Support and
Confidence
Customer
buys both
Find all the rules X & Y  Z with
minimum confidence and support
 support, s, probability that a
transaction contains {X  Y 
Z}
 confidence, c, conditional
Customer
buys beer
probability that a transaction
having {X  Y} also contains Z
Transaction ID Items Bought Let minimum support 50%, and
minimum confidence 50%,
2000
A,B,C
we have
1000
A,C
 A  C (50%, 66.6%)
4000
A,D
5000
B,E,F
 C  A (50%, 100%)
Customer 
buys diaper
6
Frequent itemsets







Strong association rules:
 Support rule > min_support
 Confidence rule > min_confidence
k-item set: itemsets containing k items
occurrence frequency=count=support count:
Minimum support count =
min_sup*#transactions in database
frequent item sets:
 İtemsets satisfying minimum support count
The Apriori Algorithm has two steps:
 (1) - Find all frequent itemsets
 (2) - Genertate strong association rules from frequent
itemsets
7
Mining Association Rules—An Example(1)
Transaction ID
2000
1000
4000
5000
Items Bought
A,B,C
A,C
A,D
B,E,F
Min_support 50%
Min._confidence 50%
Min_count:0.5*4=2
Frequent Itemset Support
{A}
75%
{B}
50%
{C}
50%
{D}
25%
{A}.{B}.{C}.{D} are 1-itemsets
{A}.{B}.{C} are frequent 1-itemsets as
Count[{A}] = 3 >= 2 (minimum_count) or
Support[{A}] = 75% >= 50% (minimum_support)
{D} is not a frequent 1-itemsets as
Count[{D}] = 1 < 2 (minimum_count) or
Support[{D}] = 25% < 50% (minimum_support)
8
Mining Association Rules—An Example(2)
Transaction ID
2000
1000
4000
5000
Items Bought
A,B,C
A,C
A,D
B,E,F
Min_support 50%
Min._confidence 50%
Min_count:0.5*4=2
Frequent Itemset Support
{A.B}
25%
{A.C}
50%
{A.D}
25%
{B,C}
25%
{A.B}.{A.C}.{A.D}.{B.C} are 2-itemsets
{A.C}is frequent 2-itemsets as
Count[{A.C}] = 2 >= 2 (minimum_count) or
Support[{A.C}] = 50% >= 50% (minimum_support)
{A.B}.{A.D} are not frequent 2-itemsets as
Count[{A.D}] = 1 < 2 (minimum_count) or
Support[{A.D}] = 25% < 50% (minimum_support)
9
Mining Association Rules—An Example(3)
Transaction ID
2000
1000
4000
5000
Items Bought
A,B,C
A,C
A,D
B,E,F
Min. support 50%
Min. confidence 50%
Frequent Itemset Support
{A}
75%
{B}
50%
{C}
50%
{A,C}
50%
For rule A  C:
support = support({A C}) = 50%
confidence = support({A C})/support({A}) = 66.6%
Strong rule as support >=min_support
confidence >= min_confidence
10
The Apriori Principle
Transaction ID
2000
1000
4000
5000
Items Bought
A,B,C
A,C
A,D
B,E,F
Min. support 50%
Min. confidence 50%
Frequent Itemset Support
{A}
75%
{B}
50%
{C}
50%
{A,C}
50%
The Apriori principle:
Any subset of a frequent itemset must be frequent
{A.C} is a frequent 2-itemset
{A} and {C}: subsets of {A,C} must be frequent
1-itemsets
11
Apriori Algorithme has two steps

(1)-Find the frequent itemsets: the sets of items that
have minimum support (the key step)

A subset of a frequent itemset must also be a frequent itemset


Iteratively find frequent itemsets with cardinality from 1 to k (kitemset)


i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent
itemsets
Until k is an empty set
(2)-Use the frequent itemsets to generate association
rules.
12
Generation of frequent itemsets from
candidate itemsets (Step 1)

C1L1  C2L2 C3  L3  C4L4…

From Ck (candidate k-itemsets) generate Lk :Ck  Lk


From candidate k itemsets generate frequent k
itemsets
(a)-Using the Apriori principle that:

Eliminate itemset sk in Ck if


(b)-For candidate k itemsets in Ck


At least one k-1 subset of sk is not in Lk-1
Make a database scan to eliminate those itemsets whose
support counts are below the critical min support cout
From frequent k itemsets Lk generate candidate k+1
itemsets Ck+1 : Lk  Ck+1

Self joining any Lk with Lk
13
Self Join operation






Sort the items in any li Lk in some lexicographic order
 li[1]<li[2]<,… <li[k-1]<li[k]
li and lj are elements of Lk li.lj Lk
If li[1]=lj[1] and li[2]=lj[2] and … li[k-1]=lj[k-1]
and li[k]<lj[k]
 The first k-1 elements are the same
 Only the last elements are different
li lj satisfiing the above condition
Construct the item set lk+1:
 li[1], li[2],… li[k-1],li[k], lj[k]

common items
 the k-1 items are taken form li or lj
 k th item is taken from li
 k+1 th item is from lj
14
Example of Self Join operation

Lexigographic order: alphabetic a<b<c<d....

L3={abc, abd, acd, ace, bcd}

Self-joining: L3*L3 Step(2)


abcd from abc and abd

acde from acd and ace
Pruning by Apriori principle: Step(1a)


acde is removed because ade is not in L3
C4={abcd}
15
The Apriori Algorithm — Example
min support cont=2
Database D
TID
100
200
300
400
itemset sup.
C1
{1}
2
{2}
3
Scan D
{3}
3
{4}
1
{5}
3
Items
134
235
1235
25
C2 itemset sup
L2 itemset sup
2
2
3
2
{1
{1
{1
{2
{2
{3
C3 itemset
{2 3 5}
Scan D
{1 3}
{2 3}
{2 5}
{3 5}
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
L1 itemset sup.
{1}
{2}
{3}
{5}
2
3
3
3
C2 itemset
{1 2}
Scan D
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
L3 itemset sup
{2 3 5} 2
16
Example 6.1 Han
TID_____list of item_Ids
 T100
125
9 transactions
 T200
24
D=9
 T300
23
minimum transaction
 T400
124
support_count=2
 T500
13
min_sup=2/9=22%
 T600
23
 T700
13
min conf: 70%
 T800
1235
 T900
123
Find strong association rules: having min sup count
of 2 and min confidence %70

17
Data Dictionary





1:
2:
3:
4:
5:
milk
apple
butter
bread
orange
18
1th iteration of algorithm











C1: itemset sup_count L1:itemset
 1
6
1
 2
7
2
 3
6

3
 4
2
4
 5
2
5
sup_count
6
7
6 
2
2
C2:L1 join L1, itemset sup_count
1
1
1
1
2
2
2
3
3
4
2
3
4
5
3
4
5
4
5
5
4
4
1x
2
4
2
2
0x
1x
0x
L2 itset supcount
12
4
13
4
15
2
23
4
24
2

25
2
frequent 2 item sets L2
those itemsets in C2
having minimum support
Step (1b)
19
3 th iteration







Self join to get C3 Step (2)
C3: L2 join L2: [1 2 3], [1 2 5],[1 3 5],[2 3 4],
[2 3 5],[2 4 5]
Now Step (1a) Apply Apriori to every itemset in C3
2 item subsets of [1 2 3]:[1 2],[1 3],[2 3]
 all 2 items sets are members of L2
 keep [1 2 3] in C3
2 item subsets of [1 2 5]:[1 2],[1 5],[2 5]
 all 2 items sets are members of L2
 keep [1 2 5] in C3
2 item subsets of [1 3 5]:[1 3],[1 5],[3 5]
 [3 5] is not a members of L2 so it si not frequent
 remove [1 2 5] from C3
20
3 iteration cont.




2 item subsets of [2 3 4]:[2 3],[2 4],[3 4]
 [3 4] is not a members of L2 so it si not frequent
 remove [2 3 4] from C3
2 item subsets of [2 3 5]:[2 3],[2 5],[3 5]
 [3 5] is not a members of L2 so it si not frequent
 remove [2 3 5] from C3
2 item subsets of [2 4 5]:[2 4],[2 5],[4 5]
 [4 5] is not a members of L2 so it si not frequent
 remove [2 4 5] from C3
C3:[1 2 3],[1 2 5] after pruning
21
4 th iteration








C3L3 check min support Step (1b)
L3:those item sets having minimum support
L3: item sets minsupcount
 1 2 3
2
 1 2 5
2
L3 join L3 to generate C4 Step (2)
L3 join L3: 1 2 3 5
pruned since its subset [2 3 5] is not frequent
C4=
the algorithm terminates
22
Generating Association Rules from
frequent itemsets




Strong rules
 min support and min confidence
confidence(AB)= P(BA):sup_count(AB)
sup_count(A)
for each frequent itemset l
 generate non empty subsets of l: denoted by s
 For each sl


construct rules: s (l-s)
Satısfying the condition:


sup_count(l)/sup_count(s)>=min_conf
are listed as interestıng
23
Example 6.2 Han cont.




the 3-frequent item set l:[1 2 5]: transaction containing
milk, apple and orange is frequent
non empty subsets of l are
 [1 2],[1 5],[2 5],[1],[2],[5]
the resulting association rules are:
 125 conf: 2/4=50%
 152 conf: 2/2=100%
 251 conf: 2/2=100%
 125 conf: 2/6=33%
 215 conf: 2/7=29%
 512 conf: 2/2=100%
if min conf: 70% 2th 3th and last rules are strong
24
Example 6.2 cont. Detail on confidence
for two rules








For the rule
152
conf: s(1,2,5)/s(1,5)
conf: 2/2=100% >= 70%
A strong rule
For the rule
215
conf: s(1,2,5)/s(2)
conf: 2/7=29% < 70%
Not a strong rule
25
Exercise

Find all strong association rules in Example 6.2
 Check minimum confindence
 for 2-frequent intemsets




[1,2], [1,3], [1,5], [2,3], [2,4], [2,5]
12, 21
25, 52 exetra
for 3-frequent intemset



[1,2,5]
123
3  12 exetra
26
Exercise






a) Suppose A  B and B  C are strong rules
Dose this imply that A  C is also a strong rule?
b) Suppose A  B and A  C are strong rules
Dose this imply that B  C is also a strong rule?
c) Suppose A  C and B  C are strong rules
Dose this imply that A and B  C is also a strong
rule?
27
Bottleneck of Frequent-pattern Mining


Multiple database scans are costly
Mining long patterns needs many passes of
scanning and generates lots of candidates

To find frequent itemset i1i2…i100


# of scans: 100
# of Candidates: (1001) + (1002) + … + (110000) = 21001 = 1.27*1030 !

Bottleneck: candidate-generation-and-test

Can we avoid candidate generation?
28
Is Apriori Fast Enough? — Performance
Bottlenecks

The core of the Apriori algorithm:



Use frequent (k – 1)-itemsets to generate candidate frequent kitemsets
Use database scan and pattern matching to collect counts for the
candidate itemsets
The bottleneck of Apriori: candidate generation
 Huge candidate sets:



104 frequent 1-itemset will generate 107 candidate 2-itemsets
To discover a frequent pattern of size 100, e.g., {a1, a2, …,
a100}, one needs to generate 2100  1030 candidates.
Multiple scans of database:

Needs (n +1 ) scans, n is the length of the longest pattern
29
Mining Frequent Patterns Without
Candidate Generation

Compress a large database into a compact, FrequentPattern tree (FP-tree) structure



highly condensed, but complete for frequent pattern
mining
avoid costly database scans
Develop an efficient, FP-tree-based frequent pattern
mining method


A divide-and-conquer methodology: decompose mining
tasks into smaller ones
Avoid candidate generation: sub-database test only!
30
Construct FP-tree from a
Transaction DB
TID
100
200
300
400
500
Items bought
(ordered) frequent items
{f, a, c, d, g, i, m, p}
{f, c, a, m, p}
{a, b, c, f, l, m, o}
{f, c, a, b, m}
{b, f, h, j, o}
{f, b}
{b, c, k, s, p}
{c, b, p}
{a, f, c, e, l, p, m, n}
{f, c, a, m, p}
Steps:
1. Scan DB once, find frequent
1-itemset (single item
pattern)
2. Order frequent items in
frequency descending order
3. Scan DB again, construct
FP-tree
min_support = 0.5
{}
Header Table
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
f:4
c:3
c:1
b:1
a:3
b:1
p:1
m:2
b:1
p:2
m:1
31
Benefits of the FP-tree Structure


Completeness:
 never breaks a long pattern of any transaction
 preserves complete information for frequent pattern
mining
Compactness
 reduce irrelevant information—infrequent items are gone
 frequency descending ordering: more frequent items are
more likely to be shared
 never be larger than the original database (if not count
node-links and counts)
 Example: For Connect-4 DB, compression ratio could be
over 100
32
Chapter 5: Mining Association
Rules in Large Databases




Association rule mining
Mining single-dimensional Boolean association rules
from transactional databases
Mining multilevel association rules from transactional
databases
Mining multidimensional association rules from
transactional databases and data warehouse

From association mining to correlation analysis

Constraint-based association mining

Summary
33
Multiple-Level Association Rules
Food





Items often form hierarchy.
Items at the lower level are
expected to have lower
support.
Rules regarding itemsets at
appropriate levels could be
quite useful.
Transaction database can be
encoded based on
dimensions and levels
We can explore shared multilevel mining
bread
milk
skim
Fraser
TID
T1
T2
T3
T4
T5
2%
wheat
white
Sunset
Items
{111, 121, 211, 221}
{111, 211, 222, 323}
{112, 122, 221, 411}
{111, 121}
{111, 122, 211, 221, 413}
34
Mining Multi-Level Associations

A top_down, progressive deepening approach:
 First find high-level strong rules:
milk  bread [20%, 60%].

Then find their lower-level “weaker” rules:
2% milk  wheat bread [6%, 50%].

Variations at mining multiple-level association rules.
 Level-crossed association rules:
2% milk

 Wonder wheat bread
Association rules with multiple, alternative
hierarchies:
2% milk
 Wonder bread
35
Multi-level Association: Uniform
Support vs. Reduced Support

Uniform Support: the same minimum support for all levels
 + One minimum support threshold.
No need to examine itemsets
containing any item whose ancestors do not have minimum
support.

– Lower level items do not occur as frequently. If support threshold
too high  miss low level associations
 too low  generate too many high level associations
Reduced Support: reduced minimum support at lower levels
 There are 4 search strategies:






Level-by-level independent
Level-cross filtering by k-itemset
Level-cross filtering by single item
Controlled level-cross filtering by single item
36
Uniform Support
Multi-level mining with uniform support
Level 1
min_sup = 5%
Level 2
min_sup = 5%
Milk
[support = 10%]
2% Milk
Skim Milk
[support = 6%]
[support = 4%]
Back
37
Reduced Support
Multi-level mining with reduced support
Level 1
min_sup = 5%
Level 2
min_sup = 3%
Milk
[support = 10%]
2% Milk
Skim Milk
[support = 6%]
[support = 4%]
Back
38




Controlled level-cross filtering by single item
Specify a level passage treshold for each level k
min_sup_T(k+1)<LPT(k)<min_sup_T(k)
Example:
 High level milk


Low level 2% milk,skim milk


min supp=5%
Min supp = 3%
Level passage trashold = 4%
39
Multi-level Association: Redundancy
Filtering




Some rules may be redundant due to “ancestor”
relationships between items.
Example
 milk  wheat bread
[support = 8%, confidence = 70%]
 2% milk  wheat bread [support = 2%, confidence = 72%]
We say the first rule is an ancestor of the second rule.
A rule is redundant if its support is close to the “expected”
value, based on the rule’s ancestor.
40
Multi-Level Mining: Progressive
Deepening

A top-down, progressive deepening approach:
 First mine high-level frequent items:
milk (15%), bread (10%)

Then mine their lower-level “weaker” frequent
itemsets:
2% milk (5%), wheat bread (4%)

Different min_support threshold across multi-levels
lead to different algorithms:
 If adopting the same min_support across multilevels
then toss t if any of t’s ancestors is infrequent.

If adopting reduced min_support at lower levels
then examine only those descendents whose ancestor’s
support is frequent/non-negligible.
41
Progressive Refinement of Data
Mining Quality

Why progressive refinement?



Trade speed with quality: step-by-step refinement.
Superset coverage property:


Mining operator can be expensive or cheap, fine or
rough
Preserve all the positive answers—allow a positive false
test but not a false negative test.
Two- or multi-step mining:


First apply rough/cheap operator (superset coverage)
Then apply expensive algorithm on a substantially
reduced candidate set (Koperski & Han, SSD’95).
42
Chapter 5: Mining Association
Rules in Large Databases




Association rule mining
Mining single-dimensional Boolean association rules
from transactional databases
Mining multilevel association rules from transactional
databases
Mining multidimensional association rules from
transactional databases and data warehouse

From association mining to correlation analysis

Constraint-based association mining

Summary
43
Interestingness Measurements

Objective measures
Two popular measurements:
 support; and


confidence
Subjective measures (Silberschatz & Tuzhilin,
KDD95)
A rule (pattern) is interesting if
 it is unexpected (surprising to the user);
and/or
 actionable (the user can do something with it)
44
Criticism to Support and Confidence

Example 1: (Aggarwal & Yu, PODS98)
 Among 5000 students
 3000 play basketball
 3750 eat cereal
 2000 both play basket ball and eat cereal
 play basketball  eat cereal [40%, 66.7%] is misleading
because the overall percentage of students eating cereal is 75%
which is higher than 66.7%.
 play basketball  not eat cereal [20%, 33.3%] is far more
accurate, although with lower support and confidence
basketball not basketball sum(row)
cereal
2000
1750
3750
not cereal
1000
250
1250
sum(col.)
3000
2000
5000
45
Criticism to Support and Confidence
(Cont.)


Example 2:
 X and Y: positively correlated,
 X and Z, negatively related
 support and confidence of
X=>Z dominates
We need a measure of dependent
or correlated events
corrA, B

P( A B)

P( A) P( B)
X 1 1 1 1 0 0 0 0
Y 1 1 0 0 0 0 0 0
Z 0 1 1 1 1 1 1 1
Rule Support Confidence
X=>Y 25%
50%
X=>Z 37.50%
75%
P(B|A)/P(B) is also called the lift
of rule A => B
46
Other Interestingness Measures: Interest

Interest (correlation, lift)
P( A  B)
P( A) P( B)

taking both P(A) and P(B) in consideration

P(A^B)=P(B)*P(A), if A and B are independent events

A and B negatively correlated, if the value is less than 1;
otherwise A and B positively correlated
X 1 1 1 1 0 0 0 0
Y 1 1 0 0 0 0 0 0
Z 0 1 1 1 1 1 1 1
Itemset
Support
Interest
X,Y
X,Z
Y,Z
25%
37.50%
12.50%
2
0.9
0.57
47
Example





Total transactions 10,000
İtems C:computers, V: video
V: 7,500 C: 6,000 C and V: 4,000
 Min_support: 0.3 min_conf:0,50
Consider the rule:
Buy(X: computer) buy(X: video)
 Support : = 4000/10000 = 0.4
 Confidence: P(C and V) /P(C) = 4000/6000 =%66
 Strong but
 The probablity of buying a video is 0.75 buying a
comuter reduces the probablity of buying a video
 From 0.75 to 0.66
 Computer and video are negatively correlated
48






Lift of A  B
Lift : P(A and B)/P(A)*P(B)
P(A and B) = P(B|A)*P(A) then
Lift = P(B|A)/P(B)
Ratio of probablity of buying A and B divided by
buying A and B independently
Or it can be interpreted as:
 Conditional probablity of buying B given that A
is purchased divided by unconditional
probablity of buying B
49
C
not C
V
4000
3500
not V
2000
500
7500
2500
10000
4000
6000
Lift CV is P(P and V)/P(V)P(C) = P(V|C)/P(V)
= 0.4/0.6*0.75=0.89<1 there is a negative correlation
Between Video and computer
50
Are All the Rules Found Interesting?

“Buy walnuts  buy milk [1%, 80%]” is misleading

if 85% of customers buy milk

Support and confidence are not good to represent correlations

So many interestingness measures? (Tan, Kumar, Sritastava @KDD’02)
lift 
P( A B)
P( A) P( B)
all _ conf 
sup( X )
max_item _ sup( X )
sup( X )
coh 
| universe( X ) |
Milk
No Milk
Sum (row)
Coffee
m, c
~m, c
c
No Coffee
m, ~c
~m, c
~c
Sum(col.)
m
~m

all-conf
coh
2
9.26
0.91
0.83
9055
100,000
8.44
0.09
0.05
670
10000
100,000
9.18
0.09
0.09
8172
1000
1000
1
0.5
0.33
0
DB
m, c
~m, c
m~c
~m~c
lift
A1
1000
100
100
10,000
A2
100
1000
1000
A3
1000
100
A4
1000
1000
51
All Confidence









All confidence:
All_conf= sup(X)/max sup(Xi)i
X: (X1,X2,...,Xk)
For k = 2
Rules are X1X2 and X2 X1
All_conf = sup(X1,X2)/max sup(X1),sup(X2)
Here sup(X1,X2)/sup(X1): confidence of rule
X1X2
Ex all conf: 0.4/max(0.6,0.75)=0.4/0.75=0.53
52
Cosine





Cosine : P(A,B)/sqrt(P(A),P(B))
Similar to lift but take square root of denominator
Both cosine and all_conf are null inveriant
 Not affected from null transactions
Ex:
Cosine: 0.4/sqrt(0.6*0.75)=0.27
53
Mining Highly Correlated Patterns



lift and 2 are not good measures for correlations
in transactional DBs
all-conf or cosine could be good measures
(Omiecinski @TKDE’03)
Both all-conf and coherence have the downward
closure
all _ conf 
sup( X )
DB
max_item _ sup( X )
m, c
~m, c
m~c
~m~c
lift
A1
1000
100
100
10,000
A2
100
1000
1000
A3
1000
100
A4
1000
1000
sup( X )
coh 
| universe( X ) |
all-conf
coh
2
9.26
0.91
0.83
9055
100,000
8.44
0.09
0.05
670
10000
100,000
9.18
0.09
0.09
8172
1000
1000
1
0.5
0.33
0
54
Dataset
mc
mc
mc
mc
all_conf.
cosine
2
lift
A1
1000
100
100
100000
0.91
0.91
83.64
83452.6
A2
1000
100
100
10000
0.91
0.91
9.36
9055.7
A3
1000
100
100
1000
0.91
0.91
1.82
1472.7
A4
1000
100
100
0
0.91
0.91
0.99
9.9
B1
1000
1000
1000
1000
0.5
0.5
1
0
C1
100
1000
1000
100000
0.09
0.09
8.44
670
C2
1000
100
10000
100000
0.09
0.29
9.18
8172.8
C3
1
1
100
10000
0.1
0.07
50
48.5
55
Chapter 5: Mining Association
Rules in Large Databases




Association rule mining
Mining single-dimensional Boolean association rules
from transactional databases
Mining multilevel association rules from transactional
databases
Mining multidimensional association rules from
transactional databases and data warehouse

From association mining to correlation analysis

Constraint-based association mining

Summary
56
Constraint-based (Query-Directed) Mining



Finding all the patterns in a database autonomously? —
unrealistic!
 The patterns could be too many but not focused!
Data mining should be an interactive process
 User directs what to be mined using a data mining
query language (or a graphical user interface)
Constraint-based mining
 User flexibility: provides constraints on what to be
mined
 System optimization: explores such constraints for
efficient mining—constraint-based mining
57
Constraints in Data Mining





Knowledge type constraint:
 classification, association, etc.
Data constraint — using SQL-like queries
 find product pairs sold together in stores in Chicago in
Dec.’02
Dimension/level constraint
 in relevance to region, price, brand, customer category
Rule (or pattern) constraint
 small sales (price < $10) triggers big sales (sum >
$200)
Interestingness constraint
 strong rules: min_support  3%, min_confidence 
60%
58
Example



bread  milk
milk  butter
 Strong rules but items are not that valuable
TV  VCD player
 Support may be lower then previous rules but
value of items are much higher
 This rule may be more valuable
59




Apriori principle stating that
 All non empty subsets of a frequent itemsets
must also be frequent
Note that:
 If a given itemset does not satisfy minimum
support
 None of its supersets can
Other examples of anti-monotone constraints:
 Min(l.price) >= 500
 Count(l) < 10
Average(l.price) < 10 : not anti-monotone
60
Anti-Monotonicity in Constraint Pushing
TDB (min_sup=2)

Anti-monotonicity




When an intemset S violates the
constraint, so does any of its superset
sum(S.Price)  v is anti-monotone
sum(S.Price)  v is not anti-monotone
Example. C: range(S.profit)  15 is antimonotone

Itemset ab violates C

So does every superset of ab
TID
Transaction
10
a, b, c, d, f
20
b, c, d, f, g, h
30
a, c, d, e, f
40
c, e, f, g
Item
Profit
a
40
b
0
c
-20
d
10
e
-30
f
30
g
20
h
-10
61
Monotonicity for Constraint Pushing
TDB (min_sup=2)

Monotonicity

sum(S.Price)  v is monotone
Transaction
10
a, b, c, d, f
20
b, c, d, f, g, h
30
a, c, d, e, f
40
c, e, f, g
Item
Profit
min(S.Price)  v is monotone
a
40
b
0
Example. C: range(S.profit)  15
c
-20
d
10
e
-30
f
30
g
20
h
-10



When an intemset S satisfies the
constraint, so does any of its
superset
TID

Itemset ab satisfies C

So does every superset of ab
62
The Apriori Algorithm — Example
Database D
TID
100
200
300
400
itemset sup.
C1
{1}
2
{2}
3
Scan D
{3}
3
{4}
1
{5}
3
Items
134
235
1235
25
C2 itemset sup
L2 itemset sup
2
2
3
2
{1
{1
{1
{2
{2
{3
C3 itemset
{2 3 5}
Scan D
{1 3}
{2 3}
{2 5}
{3 5}
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
L1 itemset sup.
{1}
{2}
{3}
{5}
2
3
3
3
C2 itemset
{1 2}
Scan D
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
L3 itemset sup
{2 3 5} 2
63
Naïve Algorithm: Apriori + Constraint
Database D
TID
100
200
300
400
itemset sup.
C1
{1}
2
{2}
3
Scan D
{3}
3
{4}
1
{5}
3
Items
134
235
1235
25
C2 itemset sup
L2 itemset sup
2
2
3
2
{1
{1
{1
{2
{2
{3
C3 itemset
{2 3 5}
Scan D
{1 3}
{2 3}
{2 5}
{3 5}
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
L1 itemset sup.
{1}
{2}
{3}
{5}
2
3
3
3
C2 itemset
{1 2}
Scan D
L3 itemset sup
{2 3 5} 2
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
Constraint:
Sum{S.price < 5}
64
The Constrained Apriori Algorithm: Push
an Anti-monotone Constraint Deep
Database D
TID
100
200
300
400
itemset sup.
C1
{1}
2
{2}
3
Scan D
{3}
3
{4}
1
{5}
3
Items
134
235
1235
25
C2 itemset sup
L2 itemset sup
2
2
3
2
{1
{1
{1
{2
{2
{3
C3 itemset
{2 3 5}
Scan D
{1 3}
{2 3}
{2 5}
{3 5}
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
L1 itemset sup.
{1}
{2}
{3}
{5}
2
3
3
3
C2 itemset
{1 2}
Scan D
L3 itemset sup
{2 3 5} 2
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
Constraint:
Sum{S.price < 5}
65
The Constrained Apriori Algorithm: Push
Another Constraint Deep
Database D
TID
100
200
300
400
itemset sup.
C1
{1}
2
{2}
3
Scan D
{3}
3
{4}
1
{5}
3
Items
134
235
1235
25
C2 itemset sup
L2 itemset sup
2
2
3
2
{1
{1
{1
{2
{2
{3
C3 itemset
{2 3 5}
Scan D
{1 3}
{2 3}
{2 5}
{3 5}
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
L1 itemset sup.
{1}
{2}
{3}
{5}
2
3
3
3
C2 itemset
{1 2}
Scan D
L3 itemset sup
{2 3 5} 2
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
Constraint:
min{S.price <= 1 }
66
Chapter 5: Mining Association Rules in
Large Databases

Association rule mining

Algorithms for scalable mining of (single-dimensional
Boolean) association rules in transactional databases

Mining various kinds of association/correlation rules

Constraint-based association mining

Sequential pattern mining

Applications/extensions of frequent pattern mining

Summary
67
Sequence Databases and Sequential
Pattern Analysis

Transaction databases, time-series databases vs. sequence
databases

Frequent patterns vs. (frequent) sequential patterns

Applications of sequential pattern mining

Customer shopping sequences:


First buy computer, then CD-ROM, and then digital camera,
within 3 months.
Medical treatment, natural disasters (e.g., earthquakes),
science & engineering processes, stocks and markets, etc.

Telephone calling patterns, Weblog click streams

DNA sequences and gene structures
68
What Is Sequential Pattern Mining?

Given a set of sequences, find the complete set
of frequent subsequences
A
sequence
: < (ef) (ab) (df) c b >
A sequence database
SID
sequence
10
<a(abc)(ac)d(cf)>
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
<eg(af)cbc>
An element may contain a set of items.
Items within an element are unordered
and we list them alphabetically.
<a(bc)dc> is a subsequence
of <a(abc)(ac)d(cf)>
Given support threshold min_sup =2, <(ab)c> is a
sequential pattern
69
Challenges on Sequential Pattern Mining


A huge number of possible sequential patterns are
hidden in databases
A mining algorithm should



find the complete set of patterns, when possible,
satisfying the minimum support (frequency) threshold
be highly efficient, scalable, involving only a small
number of database scans
be able to incorporate various kinds of user-specific
constraints
70
Studies on Sequential Pattern Mining




Concept introduction and an initial Apriori-like algorithm
 R. Agrawal & R. Srikant. “Mining sequential patterns,” ICDE’95
GSP—An Apriori-based, influential mining method (developed at IBM
Almaden)
 R. Srikant & R. Agrawal. “Mining sequential patterns:
Generalizations and performance improvements,” EDBT’96
From sequential patterns to episodes (Apriori-like + constraints)
 H. Mannila, H. Toivonen & A.I. Verkamo. “Discovery of frequent
episodes in event sequences,” Data Mining and Knowledge
Discovery, 1997
Mining sequential patterns with constraints

M.N. Garofalakis, R. Rastogi, K. Shim: SPIRIT: Sequential Pattern
Mining with Regular Expression Constraints. VLDB 1999
71
Sequential pattern mining: Cases and
Parameters


Duration of a time sequence T
 Sequential pattern mining can then be confined to the
data within a specified duration
 Ex. Subsequence corresponding to the year of 1999
 Ex. Partitioned sequences, such as every year, or every
week after stock crashes, or every two weeks before
and after a volcano eruption
Event folding window w
 If w = T, time-insensitive frequent patterns are found
 If w = 0 (no event sequence folding), sequential
patterns are found where each event occurs at a
distinct time instant
 If 0 < w < T, sequences occurring within the same
period w are folded in the analysis
72
Example


When event folding window is 5 munites
Purchases within 5 munits is considered to be
taken together
73
Sequential pattern mining: Cases and
Parameters (2)

Time interval, int, between events in the discovered
pattern
 int = 0: no interval gap is allowed, i.e., only strictly
consecutive sequences are found


min_int  int  max_int: find patterns that are
separated by at least min_int but at most max_int


Ex. “Find frequent patterns occurring in consecutive weeks”
Ex. “If a person rents movie A, it is likely she will rent movie
B within 30 days” (int  30)
int = c  0: find patterns carrying an exact interval

Ex. “Every time when Dow Jones drops more than 5%, what
will happen exactly two days later?” (int = 2)
74
A Basic Property of Sequential Patterns: Apriori

A basic property: Apriori (Agrawal & Sirkant’94)
 If a sequence S is not frequent
 Then none of the super-sequences of S is frequent
 E.g, <hb> is infrequent  so do <hab> and <(ah)b>
Seq. ID
Sequence
10
<(bd)cb(ac)>
20
<(bf)(ce)b(fg)>
30
<(ah)(bf)abf>
40
<(be)(ce)d>
50
<a(bd)bcb(ade)>
Given support threshold
min_sup =2
75
GSP—A Generalized Sequential Pattern Mining Algorithm

GSP (Generalized Sequential Pattern) mining algorithm


Outline of the method




proposed by Agrawal and Srikant, EDBT’96
Initially, every item in DB is a candidate of length-1
for each level (i.e., sequences of length-k) do
 scan database to collect support count for each
candidate sequence
 generate candidate length-(k+1) sequences from
length-k frequent sequences using Apriori
repeat until no frequent sequence or no candidate
can be found
Major strength: Candidate pruning by Apriori
76
Finding Length-1 Sequential Patterns


Examine GSP using an example
Initial candidates: all singleton sequences


<a>, <b>, <c>, <d>, <e>, <f>,
<g>, <h>
Scan database once, count support for
candidates
min_sup =2
Cand
Sup
<a>
3
<b>
5
<c>
4
<d>
3
<e>
3
<f>
2
Seq. ID
Sequence
10
<(bd)cb(ac)>
<g>
1
20
<(bf)(ce)b(fg)>
<h>
1
30
<(ah)(bf)abf>
40
<(be)(ce)d>
50
<a(bd)bcb(ade)>
77
Generating Length-2 Candidates
51 length-2
Candidates
<a>
<a>
<b>
<c>
<d>
<e>
<f>
<a>
<b>
<c>
<d>
<e>
<f>
<a>
<aa>
<ab>
<ac>
<ad>
<ae>
<af>
<b>
<ba>
<bb>
<bc>
<bd>
<be>
<bf>
<c>
<ca>
<cb>
<cc>
<cd>
<ce>
<cf>
<d>
<da>
<db>
<dc>
<dd>
<de>
<df>
<e>
<ea>
<eb>
<ec>
<ed>
<ee>
<ef>
<f>
<fa>
<fb>
<fc>
<fd>
<fe>
<ff>
<b>
<c>
<d>
<e>
<f>
<(ab)>
<(ac)>
<(ad)>
<(ae)>
<(af)>
<(bc)>
<(bd)>
<(be)>
<(bf)>
<(cd)>
<(ce)>
<(cf)>
<(de)>
<(df)>
<(ef)>
Without Apriori
property,
8*8+8*7/2=92
candidates
Apriori prunes
44.57% candidates
78
Generating Length-3 Candidates and Finding Length-3 Patterns

Generate Length-3 Candidates



Self-join length-2 sequential patterns
 Based on the Apriori property
 <ab>, <aa> and <ba> are all length-2 sequential
patterns  <aba> is a length-3 candidate
 <(bd)>, <bb> and <db> are all length-2 sequential
patterns  <(bd)b> is a length-3 candidate
46 candidates are generated
Find Length-3 Sequential Patterns


Scan database once more, collect support counts for
candidates
19 out of 46 candidates pass support threshold
80
The GSP Mining Process
5th scan: 1 cand. 1 length-5 seq.
pat.
Cand. cannot pass
sup. threshold
<(bd)cba>
Cand. not in DB at all
4th scan: 8 cand. 6 length-4 seq. <abba> <(bd)bc> …
pat.
3rd scan: 46 cand. 19 length-3 seq. <abb> <aab> <aba> <baa> <bab> …
pat. 20 cand. not in DB at all
2nd scan: 51 cand. 19 length-2 seq.
<aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)>
pat. 10 cand. not in DB at all
1st scan: 8 cand. 6 length-1 seq.
<a> <b> <c> <d> <e> <f> <g> <h>
pat.
min_sup =2
Seq. ID
Sequence
10
<(bd)cb(ac)>
20
<(bf)(ce)b(fg)>
30
<(ah)(bf)abf>
40
<(be)(ce)d>
50
<a(bd)bcb(ade)>
81


Definition c is a contiguous subsequence of a
sequence s:{s1,s2,...,sn} if
 c is derived by dropping an item from s1 or sn
 c is derived by dropping an item from si which
has at least 2 items
 c’ is a contiguous subsequence of c and c is a
contiguous subsequence of s
Ex: s:{ (1,2),(3,4),5,6}
 { 2,(3,4),5}, { (1,2),3,5,6},{ (3,5} are but
 { (1,2),(3,4),6},{ (1,5,6} are not
82
Candidate genration


Step 1: Join Step Lk-1 join with Lk-1 to give Ck
 s1 and s2 are joined if dropping first item of s1
and last item of s2 gives the same sequence
 s1 is extended by adding the last item of s2
Step 2: Prune Step Delete candidate sequences
having (k-1) contiguous subsequences whose
support count is less than min_support count
83







L3
{(1,2),3}
{(1,2),4}
{1,(3,4)}
{(1,3),5}
{2,(3,4)}
{2,3,5}




C4
{(1,2),(3,4)}
{(1,2),3,5}
L4
{(1,2),(3,4)}
{(1,2),3} joined with {2,(3,4)} to give {(1,2),(3,4)}
{(1,2),3} joined with {2,3,5} to give {(1,2),3,5}
{(1,2),3,5} is dropped since its 3 contiguous subseq
{(1,3,5} not in L3
84
Bottlenecks of GSP

A huge set of candidates could be generated
1,000 frequent length-1 sequences generate
1000  999
1000 1000 
 1,499 ,500 length-2 candidates!

2

Multiple scans of database in mining

Real challenge: mining long sequential patterns


An exponential number of short candidates
A length-100 sequential pattern needs 1030
candidate sequences!
100 100
  100
 
  2
i 1  i 
 1  1030
86
Download