Association Rules presented by Zbigniew W. Ras*,#) *) University of North Carolina – Charlotte #) ICS, Polish Academy of Sciences Market Basket Analysis (MBA) Customer buying habits by finding associations and correlations between the different items that customers place in their “shopping basket” Milk, eggs, sugar, bread Milk, eggs, cereal, bread Eggs, sugar Customer1 Customer2 Customer3 Market Basket Analysis Given: a database of customer transactions, where each transaction is a set of items Find groups of items which are frequently purchased together Goal of MBA Extract information on purchasing behavior Actionable information: can suggest new store layouts new product assortments which products to put on promotion MBA applicable whenever a customer purchases multiple things in proximity Association Rules Express how product/services relate to each other, and tend to group together “if a customer purchases three-way calling, then will also purchase call-waiting” Simple to understand Actionable information: bundle three-way calling and call-waiting in a single package Basic Concepts Transactions: Relational format <Tid,item> <1, item1> <1, item2> <2, item3> Compact format <Tid,itemset> <1, {item1,item2}> <2, {item3}> Item: single element, Itemset: set of items Support of an itemset I [denoted by sup(I)]: card(I) Threshold for minimum support: Itemset I is Frequent if: sup(I) . Frequent Itemset represents set of items which are positively correlated itemset Frequent Itemsets Transaction ID Items Bought 1 dairy,fruit 2 dairy,fruit, vegetable 3 dairy 4 fruit, cereals Customer 1 sup({dairy}) = 3 sup({fruit}) = 3 sup({dairy, fruit}) = 2 Customer 2 If = 3, then {dairy} and {fruit} are frequent while {dairy,fruit} is not. Association Rules: AR(s,c) {A,B} - partition of a set of items r = [A B] Support of r: sup(r) = sup(AB) Confidence of r: conf(r) = sup(AB)/sup(A) Thresholds: minimum support - s minimum confidence – c r AS(s, c), if sup(r) s and conf(r) c Association Rules - Example Transaction ID Items Bought 2000 A,B,C 1000 A,C 4000 A,D 5000 B,E,F Min. support – 2 [50%] Min. confidence - 50% Frequent Itemset Support {A} 75% {B} 50% {C} 50% {A,C} 50% For rule A C: sup(A C) = 2 conf(A C) = sup({A,C})/sup({A}) = 2/3 The Apriori principle: Any subset of a frequent itemset must be frequent The Apriori algorithm [Agrawal] Fk : Set of frequent itemsets of size k Ck : Set of candidate itemsets of size k F1 := {frequent items}; k:=1; while card(Fk) 1 do begin Ck+1 := new candidates generated from Fk ; for each transaction t in the database do increment the count of all candidates in Ck+1 that are contained in t ; Fk+1 := candidates in Ck+1 with minimum support k:= k+1 end Answer := { Fk: k 1 & card(Fk) 1} Apriori - Example a,b,c,d a, b a, b, c a, b, d a, c, d b, c, d a, c a, d b, c b, d a b c d c, d {a,d} is not frequent, so the 3-itemsets {a,b,d}, {a,c,d} and the 4itemset {a,b,c,d}, are not generated. Algorithm Apriori: Illustration The task of mining association rules is mainly to discover strong association rules (high confidence and strong support) in large databases. TID 1000 2000 3000 4000 Items A, B, C A, C A, D B, E, F Mining association rules is composed of two steps: 1. discover the large items, i.e., the sets of itemsets that have transaction support above a predetermined minimum support s. Large items support {A} {B} {C} {A,C} 3 2 2 2 2. Use the large itemsets to generate the association rules MinSup = 2 Algorithm Apriori: Illustration S=2 C1 Database D TID 100 200 300 400 C2 Itemset Count Items A, C, D B, C, E A, B, C, E B, E Scan D C3 Itemset {B, C, E} Itemset Scan D Scan D {A,B} {A,C} {A,E} {B,C} {B,E} {C,E} Itemset Count {A} {B} {C} {E} C2 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} 2 3 3 1 3 {A} {B} {C} {D} {E} F1 Count 1 2 1 2 3 2 C3 2 3 3 3 F2 Itemset {A, C} {B, C} {B, E} {C, E} Count 2 2 3 2 F3 Itemset Count Itemset Count {B, C, E} 2 {B, C, E} 2 Representative Association Rules Definition 1. Cover C of a rule X Y is denoted by C(X Y) and defined as follows: C(X Y) = { [X Z] V : Z, V are disjoint subsets of Y}. Definition 2. Set RR(s, c) of Representative Association Rules is defined as follows: RR(s, c) = {r AR(s, c): ~(rl AR(s, c)) [rl r & r C(rl)]} s – threshold for minimum support c – threshold for minimum confidence Representative Rules (informal description): [as short as possible] [as long as possible] Representative Association Rules Transactions: {A,B,C,D,E} {A,B,C,D,E,F} {A,B,C,D,E,H,I} {A,B,E} {B,C,D,E,H,I} Find RR(2,80%) Representative Rules From (BCDEHI): {H} {B,C,D,E,I} {I} {B,C,D,E,H} From (ABCDE): {A,C} {B,D,E} {A,D} {B,C,E} Beyond Support and Confidence Example 1: (Aggarwal & Yu) coffee tea not tea sum(col.) not coffee 20 70 90 sum(row) 5 25 5 75 10 100 {tea} => {coffee} has high support (20%) and confidence (80%) However, a priori probability that a customer buys coffee is 90% A customer who is known to buy tea is less likely to buy coffee (by 10%) There is a negative correlation between buying tea and buying coffee {~tea} => {coffee} has higher confidence (93%) Correlation and Interest Two events are independent if P(A B) = P(A)*P(B), otherwise are correlated. Interest = P(A B)/P(B)*P(A) Interest expresses measure of correlation. If: equal to 1 A and B are independent events less than 1 A and B negatively correlated, greater than 1 A and B positively correlated. In our example, I(drink tea drink coffee ) = 0.89 i.e. they are negatively correlated. Questions? Thank You