Association Rules Market Basket Analysis Prof. Mingdi Xin The Paul Merage School of Business University of California, Irvine Agenda • Please work on your projects – Gather Data – Refer to project guidelines • • • • Association rules: Definition and applications Generating and identifying good Association Rules Mining class association rules Sequential pattern mining (optional) Association Rules • Techniques that automatically look for associations among items in the data – If {set of items} Then {set of items} • Unsupervised – No “class” attribute – Outcome can be one or combination of attributes • Interesting patterns (previously unknown) may emerge that can be used within a business • Assume all data are categorical. • Initially used for Market Basket Analysis to find how items purchased by customers are related Association Rules Rule format: If {set of items} Then {set of items} LHS RHS If {Diaper, Baby Food} Then {Beer, Wine} LHS implies RHS An association rule is valid if it satisfies some evaluation measures Association Rule Discovery: Definition • Given a set of records each of which contain some number of items from a given collection; – Produce dependency rules which will predict occurrence of an item based on occurrences of other items. TID Items 1 2 3 Bread, Coke, Milk Beer, Bread Beer, Coke, Diaper, Milk 4 5 Beer, Bread, Diaper, Milk Coke, Diaper, Milk Rules RulesDiscovered: Discovered: {Milk} {Milk}--> -->{Coke} {Coke} {Diaper, {Diaper,Milk} Milk}--> --> {Beer} {Beer} Association Rule Discovery: Application 1 • Marketing and Sales Promotion: – Let the rule discovered be {Bagels, … } --> {Potato Chips} – Potato Chips as consequent => Can be used to determine what should be done to boost its sales. – Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagels. – Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips. Association Rule Discovery: Application 2 • Supermarket shelf management. – Goal: To identify items that are bought together by sufficiently many customers. – Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items. – A classic rule -• If a customer buys diaper and milk, then he is very likely to buy beer. • So, don’t be surprised if you find six-packs stacked next to diapers. Association Rules: Application beyond Market Basket Analysis • Medical (associated symptoms, side effects of patience with multiple prescriptions) • Inventory Management: – Goal: A consumer appliance repair company wants to anticipate the nature of repairs on its consumer products and keep the service vehicles equipped with right parts to reduce on number of visits to consumer households. – Approach: Process the data on tools and parts required in previous repairs at different consumer locations and discover the co-occurrence patterns. Association Rule Mining • Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules {Diaper} {Beer}, {Milk, Bread} {Eggs,Coke}, {Beer, Bread} {Milk}, Implication means co-occurrence, not causality! Definition: Frequent Itemset • Itemset – A collection of one or more items • Example: {Milk, Bread, Diaper} – k-itemset • An itemset that contains k items • Support count () – Frequency of occurrence of an itemset – E.g. ({Milk, Bread,Diaper}) = 2 • Support – Fraction of transactions that contain an itemset – E.g. s({Milk, Bread, Diaper}) = 2/5 • Frequent Itemset – An itemset whose support is greater than or equal to a minsup threshold • Rule Evaluation Metrics – Support (s) • Fraction of transactions that contain both X and Y No. of transactions containing items in LHS and RHS Support = – Confidence (c) Total No. of transactions in the dataset • Measures how often items in Y appear in transactions that contain X Confidence = No. of transactions containing both LHS and RHS No. of transactions containing LHS Example: {Milk , Diaper} Beer (Milk, Diaper, Beer ) 2 s 0.4 |T| 5 (Milk, Diaper, Beer ) 2 c 0.67 (Milk , Diaper ) 3 Rule Evaluation - Lift Transaction No. Item 1 Item 2 Item 3 Item 4 100 Beer Diaper Chocolate 101 Milk Chocolate Shampoo 102 Beer Milk Vodka Chocolate 103 Beer Milk Diaper Chocolate 104 Milk Diaper Beer … What’s the support and confidence for rule {Chocolate}{Milk}? Support = 3/5 Confidence = 3/4 Very high support and confidence. Is Chocolate a good predictor of Milk purchase? No! Because Milk occurs in 4 out of 5 transactions. Chocolate is even decreasing the chance of Milk purchase 3/4 < 4/5, i.e. P(Milk|Chocolate)<P(Milk) Lift = (3/4)/(4/5) = 0.9375 < 1 Rule Evaluation – Lift (cont.) Measures how much more likely is the RHS given the LHS than merely the RHS Lift = confidence of the rule / benchmark confidence Benchmark Confidence = Count of RHS / Number of transactions in database Example: {Diaper} {Beer} Total number of customer in database: 1000 No. of customers buying Diaper: 200 No. of customers buying beer: 50 No. of customers buying Diaper & beer: 20 Benchmark confidence of Beer = 50/1000 (5%) Confidence = 20/200 (10%) Lift = 10%/5% = 2 Higher Lift indicates better rule Exercise 1 Compute the support for subsets {a}, {b, d}, and {a,b,d} by treating each transaction ID as a market basket. Exercise 2 Use the results in the previous problem to compute the confidence for the association rules {b, d} → {a} and {a} → {b, d}. State what these values mean in plain English. Exercise 3 Compute the support for itemsets {a}, {b, d}, and {a,b,d} by treating each customer ID as a market basket. Exercise 4 Use the results in the previous problem to compute the confidence for the association rules {b, d} → {a} and {a} → {b, d}. Mining Association Rules Example of Rules: {Milk,Diaper} {Beer} (s=0.4, c=0.67) {Milk,Beer} {Diaper} (s=0.4, c=1.0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5) Observations: • All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} • Rules originating from the same itemset have identical support but can have different confidence • Thus, we may decouple the support and confidence requirements Association Rule Mining Task • Given a set of transactions T, the goal of association rule mining is to find all rules having – support ≥ minsup threshold – confidence ≥ minconf threshold • Brute-force approach: – List all possible association rules – Compute the support and confidence for each rule – Prune rules that fail the minsup and minconf thresholds Computationally prohibitive! Frequent Itemset Generation Optional null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ABCDE ACDE BCDE Given d items, there are 2d possible candidate itemsets Frequent Itemset Generation • Brute-force approach: Optional – Each itemset in the lattice is a candidate frequent itemset – Count the support of each candidate by scanning the database – Match each transaction against every candidate d Computational Complexity • Given d unique items: Optional – Total number of itemsets = 2d – Total number of possible association rules: d d k R j k 3 2 1 d1 d k k 1 j 1 d d 1 If d=6, R = 602 rules Identifying Association Rules • Two-step approach: 1. Frequent Itemset Generation – Generate all itemsets whose support minsup 2. Rule Generation – Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset • Frequent itemset generation is still computationally expensive Phase 1: Finding all frequent itemsets How to perform an efficient search of all frequent itemsets? If {diaper, beer} is frequent then {diaper} and {beer} are each frequent as well This means that… • If an itemset is not frequent (e.g., {wine}) then no itemset that includes wine can be frequent either, such as {wine, beer} . • We therefore first find all itemsets of size 1 that are frequent. Then try to “expand” these by counting the frequency of all itemsets of size 2 that include frequent itemsets of size 1. • Example: If {wine} is not frequent we need not try to find out whether {wine, beer} is frequent. But if both {wine} & {beer} were frequent then it is possible (though not guaranteed) that {wine, beer} is also frequent. Then take only itemsets of size 2 that are frequent, and try to expand those, etc. Algorithm Apriori • Iterative algorithm: Find all 1-item frequent itemsets; then all 2-item frequent itemsets, and so on. – In each iteration k, only consider itemsets that contain some k-1 frequent itemset. • Find frequent itemsets of size 1: F1 • From k = 2 – Ck = candidates of size k: those itemsets of size k that could be frequent, given Fk-1 – Fk = those itemsets that are actually frequent, Fk Ck (need to scan the database once). Details: ordering of items • The items in I are sorted in lexicographic order (which is a total order). • The order is used throughout the algorithm in each itemset. • {w[1], w[2], …, w[k]} represents a k-itemset w consisting of items w[1], w[2], …, w[k], where w[1] < w[2] < … < w[k] according to the total order. Details: the algorithm Algorithm Apriori(T) C1 init-pass(T); F1 {f | f C1, f.count/n minsup}; // n: no. of transactions in T for (k = 2; Fk-1 ; k++) do Ck candidate-gen(Fk-1); for each transaction t T do for each candidate c Ck do if c is contained in t then c.count++; end end Fk {c Ck | c.count/n minsup} end return F k Fk; Apriori candidate generation • The candidate-gen function takes Fk-1 and returns a superset (called the candidates Ck) of the set of all frequent k-itemsets. It has two steps – join step: Generate all possible candidate itemsets Ck of length k – prune step: Remove those candidates in Ck that cannot be frequent. Candidate-gen function Function candidate-gen(Fk-1) Ck ; forall f1, f2 Fk-1 with f1 = {i1, … , ik-2, ik-1} and f2 = {i1, … , ik-2, i’k-1} and ik-1 < i’k-1 do c {i1, …, ik-1, i’k-1}; // join f1 and f2 Ck Ck {c}; for each (k-1)-subset s of c do if (s Fk-1) then delete c from Ck; // prune end end return Ck; Example – Finding frequent itemsets Dataset T minsup=0.5 TID Items T100 1, 3, 4 T200 2, 3, 5 itemset:count 1. scan T C1: {1}:2, {2}:3, {3}:3, {4}:1, {5}:3 F1: C2: {1}:2, {2}:3, {3}:3, T300 1, 2, 3, 5 T400 2, 5 {5}:3 {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5} 2. scan T C2: {1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, {3,5}:2 F2: C3: {1,3}:2, {2, 3,5} 3. scan T C3: {2, 3, 5}:2 F3: {2, 3, 5} {2,3}:2, {2,5}:3, {3,5}:2 Transactions T-ID Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E Agrawal (94)’s Apriori Algorithm—An Exa An itemset must have been purchased at least twice in order to be considered frequent or supported Agrawal (94)’s Apriori Algorithm—An Example Transactions T-ID Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E L2 C1 Itemset sup 1 scan st C2 {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 Itemset sup Itemset sup {A, B} 1 {A, C} 2 {A, C} 2 {B, C} 2 {A, E} 1 {B, E} 3 {B, C} 2 {C, E} 2 Itemset {B, E} 3 {C, E} 2 C3 {B, C, E} 3 scan rd {A,B,C}, {A, C, E}? L3 L1 Itemset sup {A} 2 {B} 3 {C} 3 {E} 3 C2 2nd scan Itemset {A, B} {A, C} {A, E} {B, C} {B, E} Itemset sup {B, C, E} 2 {C, E} Phase 2: Generating Association Rules Assume {Milk, Bread, Butter} is a frequent itemset. Using items contained in the itemset, list all possible rules • – – – – – – • • {Milk} {Bread, Butter} {Bread} {Milk, Butter} {Butter} {Milk, Bread} {Milk, Bread} {Butter} {Milk, Butter} {Bread} {Bread, Butter} {Milk} Calculate the confidence of each rule Pick the rules with confidence above the minimum confidence Confidence of {Milk} {Bread, Butter}: No. of transaction that support {Milk, Bread, Butter} No. of transaction that support {Milk} Support {Milk, Bread, Butter} = Support {Milk} Exercise 5 Transaction No.Item 1 100 Beer 101 Milk 102 Beer 103 Beer 104 Milk Item 2 Diaper Chocolate Soap Cheese Diaper Item 3 Item 4 Chocolate Shampoo Vodka Wine Beer Chocolate Given the above list of transactions, do the following: 1) Find all the frequent itemsets (minimum support 40%) 2) Find all the association rules (minimum confidence 70%) 3) For the discovered association rules, calculate the lift What is the objective of Apriori? A: Identify good association rules B: Identify all frequent itemsets C: Identify all rules from frequent itemsets D: Determine the computational complexity of finding association rules E: None of the above Given the table below, the confidence of the rule Bread Coke is A: 1/4 B: 2/4 C: 3/4 D: 4/4 E: None of the above Problems with the association mining • Single minsup: It assumes that all items in the data are of the same nature and/or have similar frequencies. • Not true: In many applications, some items appear very frequently in the data, while others rarely appear. E.g., in a supermarket, people buy food processor and cooking pan much less frequently than they buy bread and milk. Rare Item Problem • If the frequencies of items vary a great deal, we will encounter two problems – If minsup is set too high, those rules that involve rare items will not be found. – To find rules that involve both frequent and rare items, minsup has to be set very low. This may cause combinatorial explosion because those frequent items will be associated with one another in all possible ways. Multiple minsups model • The minimum support of a rule is expressed in terms of minimum item supports (MIS) of the items that appear in the rule. • Each item can have a minimum item support. • By providing different MIS values for different items, the user effectively expresses different support requirements for different rules. Minsup of a rule • Let MIS(i) be the MIS value of item i. The minsup of a rule R is the lowest MIS value of the items in the rule. • I.e., a rule R: a1, a2, …, ak ak+1, …, ar satisfies its minimum support if its actual support is min(MIS(a1), MIS(a2), …, MIS(ar)). An Example • Consider the following items: bread, shoes, clothes The user-specified MIS values are as follows: MIS(bread) = 2% MIS(shoes) = 0.1% MIS(clothes) = 0.2% The following rule doesn’t satisfy its minsup: clothes bread [with sup=0.15%,conf =70%] The following rule satisfies its minsup: clothes shoes [with sup=0.15%,conf =70%] Agenda • Please work on your projects – Gather Data – Refer to project guidelines • • • • Association rules: Definition and applications Generating and identifying good Association Rules Mining class association rules Sequential pattern mining (optional) Mining class association rules (CAR) • Normal association rule mining does not have any target. • It finds all possible rules that exist in data, i.e., any item can appear as a consequent or a condition of a rule. • However, in some applications, the user is interested in some targets. – E.g, the user has a set of text documents from some known topics. He/she wants to find out what words are associated or correlated with each topic. Problem definition • Let T be a transaction data set consisting of n transactions. • Each transaction is also labeled with a class y. • Let I be the set of all items in T, Y be the set of all class labels and I Y = . • A class association rule (CAR) is an implication of the form X y, where X I, and y Y. • The definitions of support and confidence are the same as those for normal association rules. An example • A text document data set doc 1: Student, Teach, School : Education doc 2: Student, School : Education doc 3: Teach, School, City, Game : Education doc 4: Baseball, Basketball : Sport doc 5: Basketball, Player, Spectator : Sport doc 6: Baseball, Coach, Game, Team : Sport doc 7: Basketball, Team, City, Game : Sport • Let minsup = 20% and minconf = 60%. The following are two examples of class association rules: Student, School Education [sup= 2/7, conf = 2/2] game Sport [sup= 2/7, conf = 2/3] Mining algorithm • Unlike normal association rules, CARs can be mined directly in one step. • The key operation is to find all ruleitems that have support above minsup. A ruleitem is of the form: (condset, y) where condset is a set of items from I (i.e., condset I), and y Y is a class label. • Each ruleitem basically represents a rule: condset y, • The Apriori algorithm can be modified to generate CARs Next Week • Guest Lecture Doug Herman Executive Director of Global Data Science & Analytics Ingram Micro • Logistic Regression WEKA • Find association rules – Apriori • http://facweb.cs.depaul.edu/mobasher/classes/ect58 4/WEKA/associate.html Agenda • Please work on your projects – Gather Data – Refer to project guidelines • • • • Association rules: Definition and applications Generating and identifying good Association Rules Mining class association rules Sequential pattern mining (optional) Sequential pattern mining Informational NOT required • Association rule mining does not consider the order of transactions. • In many applications such orderings are significant. E.g., – in market basket analysis, it is interesting to know whether people buy some items in sequence, • e.g., buying bed first and then bed sheets some time later. – In Web usage mining, it is useful to find navigational patterns of users in a Web site from sequences of page visits of users 54 Objective Informational NOT required • Given a set S of input data sequences, the problem of mining sequential patterns is to find all the sequences that have a user-specified minimum support. • Each such sequence is called a frequent sequence, or a sequential pattern. • The support for a sequence is the fraction of total data sequences in S that contains this sequence. • Algorithm: Srikant, R., and R. Agrawal. "Mining sequential patterns: Generalizations and performance improvements." International Conference on Extending Database Technology. Springer, Berlin, Heidelberg, 1996. 55 Example Informational NOT required Example (cond) Informational NOT required 57 Weka Memory • To increase Java virtual memory for WEKA – On your computer go to “Start” – In “Search Program and Files” type: “cmd” – Use “cd …” to change directory to Weka folder (under Program Files) – On command prompt type: – “Java –Xmx512m –jar weka.jar”