Association Rules

advertisement
Association Rules
presented by
Zbigniew W. Ras*,#)
*) University
of North Carolina – Charlotte
#) ICS, Polish Academy of Sciences
Market Basket Analysis (MBA)
Customer buying habits by finding associations and
correlations between the different items that
customers place in their “shopping basket”
Milk, eggs, sugar,
bread
Milk, eggs, cereal, bread
Eggs, sugar
Customer1
Customer2
Customer3
Market Basket Analysis
Given: a database of customer transactions, where
each transaction is a set of items
Find groups of items which are frequently
purchased together
Goal of MBA
 Extract information on purchasing behavior
 Actionable information: can suggest
 new store layouts
 new product assortments
 which products to put on promotion
MBA applicable whenever a customer purchases
multiple things in proximity
Association Rules
 Express how product/services relate to each other,
and tend to group together
 “if a customer purchases three-way calling, then will
also purchase call-waiting”
 Simple to understand
 Actionable information: bundle three-way calling and
call-waiting in a single package
Basic Concepts
Transactions:
Relational format
<Tid,item>
<1, item1>
<1, item2>
<2, item3>
Compact format
<Tid,itemset>
<1, {item1,item2}>
<2, {item3}>
Item: single element,
Itemset: set of items
Support of an itemset I [denoted by sup(I)]: card(I)
Threshold for minimum support: 
Itemset I is Frequent if: sup(I)  .
Frequent Itemset represents set of items which are
positively correlated
itemset
Frequent Itemsets
Transaction ID Items Bought
1
dairy,fruit
2
dairy,fruit, vegetable
3
dairy
4
fruit, cereals
Customer 1
sup({dairy}) = 3
sup({fruit}) = 3
sup({dairy, fruit}) = 2
Customer 2
If  = 3, then
{dairy} and {fruit} are frequent while {dairy,fruit} is not.
Association Rules: AR(s,c)
 {A,B} - partition of a set of items
 r = [A  B]
Support of r: sup(r) = sup(AB)
Confidence of r: conf(r) = sup(AB)/sup(A)
 Thresholds:
 minimum support - s
 minimum confidence – c
r  AS(s, c), if sup(r)  s and conf(r)  c
Association Rules - Example
Transaction ID Items Bought
2000
A,B,C
1000
A,C
4000
A,D
5000
B,E,F
Min. support – 2 [50%]
Min. confidence - 50%
Frequent Itemset Support
{A}
75%
{B}
50%
{C}
50%
{A,C}
50%
 For rule A  C:
 sup(A  C) = 2
 conf(A  C) = sup({A,C})/sup({A}) = 2/3
 The Apriori principle:
 Any subset of a frequent itemset must be frequent
The Apriori algorithm [Agrawal]
 Fk : Set of frequent itemsets of size k
 Ck : Set of candidate itemsets of size k
F1 := {frequent items}; k:=1;
while card(Fk)  1 do begin
Ck+1 := new candidates generated from Fk ;
for each transaction t in the database do
increment the count of all candidates in Ck+1 that
are contained in t ;
Fk+1 := candidates in Ck+1 with minimum support
k:= k+1
end
Answer := { Fk: k  1 & card(Fk)  1}
Apriori - Example
a,b,c,d
a, b
a, b, c
a, b, d
a, c, d
b, c, d
a, c
a, d
b, c
b, d
a
b
c
d
c, d
{a,d} is not frequent, so the 3-itemsets {a,b,d}, {a,c,d} and the 4itemset {a,b,c,d}, are not generated.
Algorithm Apriori: Illustration
 The task of mining association rules is mainly to
discover strong association rules (high
confidence and strong support) in large
databases.

TID
1000
2000
3000
4000
Items
A, B, C
A, C
A, D
B, E, F
Mining association rules is composed of two
steps:
1. discover the large items, i.e., the sets of
itemsets that have
transaction support above a
predetermined minimum support s.
Large
items
support
{A}
{B}
{C}
{A,C}
3
2
2
2
2. Use the large itemsets to generate
the association rules
MinSup = 2
Algorithm Apriori: Illustration
S=2
C1
Database D
TID
100
200
300
400
C2
Itemset Count
Items
A, C, D
B, C, E
A, B, C, E
B, E
Scan
D
C3
Itemset
{B, C, E}
Itemset
Scan
D
Scan
D
{A,B}
{A,C}
{A,E}
{B,C}
{B,E}
{C,E}
Itemset Count
{A}
{B}
{C}
{E}
C2
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
2
3
3
1
3
{A}
{B}
{C}
{D}
{E}
F1
Count
1
2
1
2
3
2
C3
2
3
3
3
F2
Itemset
{A, C}
{B, C}
{B, E}
{C, E}
Count
2
2
3
2
F3
Itemset
Count
Itemset
Count
{B, C, E}
2
{B, C, E}
2
Representative Association Rules
 Definition 1.
Cover C of a rule X  Y is denoted by C(X  Y)
and defined as follows:
C(X  Y) = { [X  Z]  V : Z, V are disjoint subsets of Y}.
 Definition 2.
Set RR(s, c) of Representative Association Rules
is defined as follows:
RR(s, c) =
{r  AR(s, c): ~(rl  AR(s, c)) [rl  r & r  C(rl)]}
s – threshold for minimum support
c – threshold for minimum confidence
 Representative Rules (informal description):
[as short as possible]  [as long as possible]
Representative Association Rules
Transactions:
{A,B,C,D,E}
{A,B,C,D,E,F}
{A,B,C,D,E,H,I}
{A,B,E}
{B,C,D,E,H,I}
Find RR(2,80%)
Representative Rules
From (BCDEHI):
{H}  {B,C,D,E,I}
{I}  {B,C,D,E,H}
From (ABCDE):
{A,C}  {B,D,E}
{A,D}  {B,C,E}
Beyond Support and Confidence
 Example 1: (Aggarwal & Yu)
coffee
tea
not tea
sum(col.)
not coffee
20
70
90
sum(row)
5
25
5
75
10
100
 {tea} => {coffee} has high support (20%) and confidence (80%)
 However, a priori probability that a customer buys coffee is 90%



A customer who is known to buy tea is less likely to buy
coffee (by 10%)
There is a negative correlation between buying tea and buying
coffee
{~tea} => {coffee} has higher confidence (93%)
Correlation and Interest
 Two events are independent
if P(A  B) = P(A)*P(B), otherwise are correlated.
 Interest = P(A  B)/P(B)*P(A)
 Interest expresses measure of correlation. If:




equal to 1  A and B are independent events
less than 1  A and B negatively correlated,
greater than 1  A and B positively correlated.
In our example,
I(drink tea  drink coffee ) = 0.89 i.e. they are
negatively correlated.
Questions?
Thank You
Download