advertisement

Business Intelligence Technologies – Data Mining ` Lecture 2 Market Basket Analysis, Association Rules Agenda Market basket analysis & Association rules Case Discussion Software demo Exercise Barbie® Candy 1. 2. 3. 4. 5. 6. 7. 8. Put them closer together in the store. Put them far apart in the store. Package candy bars with the dolls. Package Barbie + candy + poorly selling item. Raise the price on one, lower it on the other. Barbie accessories for proofs of purchase. Do not advertise candy and Barbie together. Offer candies in the shape of a Barbie Doll. Market Basket Analysis (MBA) MBA in retail setting Find out what Cross-selling are bought together Optimize shelf layout Product bundling Timing promotions Discount planning (avoid double-discounts) Product selection under limited space Targeted advertisement, Personalized coupons, item recommendations Usage beyond Market Basket Medical (one symptom after another) Financial (customers with mortgage acct saving acct) also have What the data contains Transaction No. Item 1 Item 2 Item 3 Item 4 100 Beer Diaper Chocolate Cheese 101 Milk Chocolate Shampoo 102 Beer Wine Vodka 103 Beer Cheese Diaper 104 Ice Cream Diaper Beer Chocolate … Customer No. Age Income Saving_acct Children Mortgage 100 >50 High Yes Yes Yes 101 35-50 Mid No No No 102 <35 High Yes No Yes 103 >50 Mid Yes No Yes 104 <35 Low No Yes No … … Rules Discovered from MBA Actionable Rules Wal-Mart customers who purchase Barbie dolls have a 60% likelihood of also purchasing one of three types of candy bars Trivial Rules Customers who purchase large appliances are very likely to purchase maintenance agreements Inexplicable Rules When a new hardware store opens, one of the most commonly sold items is toilet bowl cleaners Learning Frequent Itemsets and Association Rules from Data A descriptive approach for discovering relevant and valid associations among items in the data. If buy diapers Then Buy beer The itemset corresponding to this rule is {Diaper, Beer} Itemset: A collection of items. Frequent Itemset: An itemset that occurs often in data. Often times, finding frequent itemsets is enough. Market Basket Analysis Transaction No. Item 1 Item 2 Item 3 Item 4 100 Beer Diaper Chocolate Cheese 101 Milk Chocolate Shampoo 102 Beer Wine Vodka 103 Beer Cheese Diaper 104 Ice Cream Diaper … Chocolate Beer … Examples: Shoppers who buy Diaper are very likely to buy Beer. If buy Diaper Then Buy Beer Shoppers who buy Beer and Diaper are likely to buy Cheese and Chocolate If buy Beer, Diaper Then Buy Cheese, Chocolate Association Rules Rule format: If {set of items} Then {set of items} LHS If {Diaper, Baby Food} RHS Then {Beer, Wine} LHS implies RHS Evaluation of Association Rules What rules should be considered valid? LHS RHS Then If {Diaper} {Beer} An association rule is valid if it satisfies some evaluation measures Rule Evaluation Milk & But… Wine co-occur Only 2 out of 200K transactions contain these items Transaction No. Item 1 Item 2 Item 3 100 Beer Diaper Chocolate 101 Milk Chocolate Wine 102 Beer Wine Vodka 103 Beer Cheese Diaper 104 Ice Cream Diaper Beer …. … Rule Evaluation – Support Support: The frequency in which the items in LHS and RHS co-occur. E.g., The support of the {Diaper} {Beer} rule is 3/5: 60% of the transactions contain both items. Support = No. of transactions containing items in LHS and RHS Total No. of transactions in the dataset Transaction No. Item 1 Item 2 Item 3 100 Beer Diaper Chocolate 101 Milk Chocolate Shampoo 102 Beer Wine Vodka 103 Beer Cheese Diaper 104 Ice Cream Diaper Beer … Support evaluation is not enough? My friend, Bill, an 85 years old man, told me a joke in a party last Friday: An old man is celebrating his 103th birthday. “I will hold my 104th birthday party next year. You are all welcome to join me,” he announces to his guests proudly. “How do you know you will still be alive then?” one of his guests asks. “Because very few people died between the age of 103 and 104,” he replies. Explain the logic of the old man and provide your comments. The old man’s logic: P{103+ & died} is low; so 1 - P{103+ & died} is high Common knowledge: P{103+ & died} = P{103+} * P{died|103+}, where P{103+} is low. So the low of P{103+ & died} is due to P{103+}, while P{died|103+} is still high. Rule Evaluation - Confidence Is Beer leading to Diaper purchase or Diaper leading to Beer purchase? Among the transactions with Diaper, 100% have Beer. Among the transactions with Beer, 75% have Diaper. Transaction No. Item 1 Item 2 Item 3 100 Beer Diaper Chocolate 101 Milk Chocolate Shampoo 102 Beer Wine Vodka 103 Beer Cheese Diaper 104 Ice Cream Diaper Beer Confidence = … No. of transactions containing both LHS and RHS No. of transactions containing LHS confidence for {Diaper} {Beer} : 3/3 When Diaper is purchased, the likelihood of Beer purchase is 100% confidence for {Beer} {Diaper} : 3/4 When Beer is purchased, the likelihood of Diaper purchase is 75% So, {Diaper} {Beer} is a more important rule according to confidence. Rule Evaluation - Lift Transaction No. Item 1 Item 2 Item 3 Item 4 100 Beer Diaper Chocolate 101 Milk Chocolate Shampoo 102 Beer Milk Vodka Chocolate 103 Beer Milk Diaper Chocolate 104 Milk Diaper Beer … What’s the support and confidence for rule {Chocolate}{Milk}? Support = 3/5 Confidence = 3/4 Very high support and confidence. Does Chocolate really lead to Milk purchase? No! Because Milk occurs in 4 out of 5 transactions. Chocolate is even decreasing the chance of Milk purchase (3/4 < 4/5) Lift = (3/4)/(4/5) = 0.9375 < 1 Rule Evaluation – Lift (cont.) Measures how much more likely is the RHS given the LHS than merely the RHS Lift = confidence of the rule / frequency of the RHS Example: {Diaper} {Beer} Total number of customer in database: 1000 No. of customers buying Diaper: 200 No. of customers buying beer: 50 No. of customers buying Diaper & beer: 20 Frequency of Beer = 50/1000 (5%) Confidence = 20/200 (10%) Lift = 10%/5% = 2 Lift higher than 1 implies people have higher change to buy Beer when they buy Diaper. Lift lower than 1 implies people have lower change to buy Milk when they buy Chocolate. Algorithm to Extract Association Rules (1) Given a set of transactions T, the goal of association rule mining is to find all rules having support ≥ minsup threshold confidence ≥ minconf threshold Brute-force approach: List all possible association rules Compute the support and confidence for each rule Prune rules that fail the minsup and minconf thresholds Computationally prohibitive! Frequent Itemset Generation Brute-force approach: Each itemset in the lattice is a candidate frequent itemset Count the support of each candidate by scanning the database Complexity ~ O(NMw) => Expensive since M = 2d !!!Match each transaction against every candidate Complexity ~ O(NMw) => Expensive since M = 2d !!! Transactions N TID 1 2 3 4 5 Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke w List of Candidates M Mining Association Rules TID Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke Example of Rules: {Milk,Diaper} {Beer} (s=0.4, c=0.67) {Milk,Beer} {Diaper} (s=0.4, c=1.0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5) Observations: • All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} • Rules originating from the same itemset have identical support but can have different confidence • Thus, we may decouple the support and confidence requirements Mining Association Rules Two-step approach: Frequent Itemset Generation Rule Generation Generate all itemsets whose support minsup Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset Frequent itemset generation is still computationally expensive Algorithm to Extract Association Rules (2) The standard algorithm: Apriori Rakesh Agrawal, Ramakrishnan Srikant: Fast Algorithms for Mining Association Rules in Large Databases. VLDB 1994: 487-499 The Association Rules problem was defined as: Generate all association rules that have support greater than the user-specified minimum support and confidence greater than the user-specified minimum confidence The base algorithm uses support and confidence, but we can also use lift to rank the rules discovered by Apriori. The algorithm performs an efficient search over the data to find all such rules. Finding Association Rules from Data Association rules discovery problem is decomposed into two sub-problems: 1. Find all sets of items (itemsets) whose support is above minimum support --- called frequent itemsets or large itemsets 2. From each frequent itemset, generate rules whose confidence is above minimum confidence. Given a large itemset Y, and X is a subset of Y Calculate confidence of the rule X (Y - X) If its confidence is above the minimum confidence, then X (Y - X) is an association rule we are looking for. Example Transaction No. Item 1 Item 2 Item 3 100 Beer Diaper Chocolate 101 Milk Chocolate Shampoo 102 Beer Wine Vodka 103 Beer Cheese Diaper 104 Ice Cream Diaper Beer A data set with 5 transactions Minimum support = 40%, Minimum confidence = 80% Phase 1: Find all frequent itemsets {Beer} (support=80%), {Diaper} (60%), Phase 2: {Chocolate} (40%) Beer Diaper (conf. 60%÷80%= 75%) {Beer, Diaper} (60%) Diaper Beer (conf. 60%÷60%= 100%) Phase 1: Finding all frequent itemsets How to perform an efficient search of all frequent itemsets? Note: frequent itemsets of size n contain itemsets of size n-1 that also must be frequent Example: if {diaper, beer} is frequent then {diaper} and {beer} are each frequent as well This means that… If an itemset is not frequent (e.g., {wine}) then no itemset that includes wine can be frequent either, such as {wine, beer} . We therefore first find all itemsets of size 1 that are frequent. Then try to “expand” these by counting the frequency of all itemsets of size 2 that include frequent itemsets of size 1. Example: If {wine} is not frequent we need not try to find out whether {wine, beer} is frequent. But if both {wine} & {beer} were frequent then it is possible (though not guaranteed) that {wine, beer} is also frequent. Then take only itemsets of size 2 that are frequent, and try to expand those, etc. Phase 2: Generating Association Rules Assume {Milk, Bread, Butter} is a frequent itemset. Using items contained in the itemset, list all possible rules {Milk} {Bread, Butter} {Bread} {Milk, Butter} {Butter} {Milk, Bread} {Milk, Bread} {Butter} {Milk, Butter} {Bread} {Bread, Butter} {Milk} Calculate the confidence of each rule Pick the rules with confidence above the minimum confidence Confidence of {Milk} {Bread, Butter}: No. of transaction that support {Milk, Bread, Butter} = No. of transaction that support {Milk} Support {Milk, Bread, Butter} Support {Milk} Association If the rule {Yogurt} {Bread, Butter } is found to have minimum confidence. Does it mean the rule: {Bread, Butter} {Yogurt} also has minimum confidence? No. Example: Support of {Yogurt} is 20%, {Yogurt, Bread, Butter } is 10% {Bread and Butter } is 50% Confidence of {Yogurt} {Bread, Butter} is 10%/20%=50% Confidence of {Bread, Butter} {Yogurt} is 10%/50%=20% Agrawal (94)’s Apriori Algorithm—An Example Transactions T-ID Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E C1 1st scan C2 L2 Itemset {A, C} {B, C} {B, E} {C, E} sup 2 2 3 2 Itemset C3 {B, C, E} {A,B,C}? Itemset sup {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 Itemset sup {A, B} 1 {A, C} 2 {A, E} 1 {B, C} 2 {B, E} 3 {C, E} 2 3rd scan L3 L1 Itemset sup {A} 2 {B} 3 {C} 3 {E} 3 C2 2nd scan Itemset {B, C, E} sup 2 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} Sequential Patterns Instead of finding association between items in a single transactions, find association between items across related transactions over time. Customer ID Transaction Data. Item 1 Item 2 AA 2/2/2001 Laptop Case AA 1/13/2002 Wireless network card Router BB 4/5/2002 laptop iPaq BB 8/10/2002 Wireless network card Router … … … … … Sequence : {Laptop}, {Wireless Card, Router} A sequence has to satisfy some predetermined minimum support Examples of Sequence Data Sequence Database Sequence Element (Transaction) Event (Item) Customer Purchase history of a given customer A set of items bought by a customer at time t Books, diary products, CDs, etc Web Data Browsing activity of a particular Web visitor A collection of files viewed by a Web visitor after a single mouse click Home page, index page, contact info, etc Event data History of events generated by a given sensor Events triggered by a sensor at time t Types of alarms generated by sensors Genome sequences DNA sequence of a particular species An element of the DNA sequence Bases A,T,G,C Element (Transaction) Sequence E1 E2 E1 E3 E2 E2 E3 E4 Event (Item) Examples of Sequence Web sequence: < {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation} {Return to Shopping} > Sequence of books checked out at a library: <{Fellowship of the Ring} {The Two Towers} {Return of the King}> Applications of Association Rules Market-Basket Analysis: e.g. Product assortment optimization (see next slide) Recommendations: Determines which books are frequently purchased together and recommends associated books or products to people who express interest in an item. Healthcare: Studying the side-effects in patients with multiple prescriptions, we can discover previously unknown interactions and warn patients about them. Fraud detection: Finding in insurance data that a certain doctor often works with a certain lawyer may indicate potential fraudulent activity. (virtual items) Sequence Discovery: looks for associations between items bought over time. E.g., we may notice that people who buy chili tend to buy antacid within a month. Knowledge like this can be used to plan inventory levels. Product Assortment Optimization Graphs of expected sales (e.g derived from association rules) and costs (e.g. of purchasing and holding inventory) can allow us to optimize the number and selection (choice) of items in a product category. Dollars Revenues Margin Costs Products in Category Dollars Max Profit Margin = Revenues - Costs Products in Category 35 Agenda Market basket analysis & Association rules Case Discussion Software demo Exercise Case - Merkur 1. 2. 3. 4. 5. 6. 7. 8. 9. What are the benefits of finding the associated products sold together within the same transaction, or sold together to the same customer ? (i.e. use transaction or customer as the unit of analysis) How to perform an item-based Market Basket Analysis or a customer-based Market Basket Analysis, and what are the benefits for each? (i.e. MBA based on data about a specific item, MBA based on data about a specific customer) What are the interesting results from MBA discussed in the case? How to decide promotion items based on MBA? How to evaluate a promotion based on MBA? How does MBA help product bundling? Please brainstorm a promotion plan based on MBA to maximize the net profit of the retailer. How to do targeted promotion over time? Other possible strategies based on MBA? Agenda Market basket analysis & Association rules Case Discussion Software demo Exercise Agenda Market basket analysis & Association rules Case Discussion Software demo Exercise Exercise Transaction No.Item 1 100 Beer 101 Milk 102 Beer 103 Beer 104 Milk Item 2 Diaper Chocolate Soap Cheese Diaper Item 3 Item 4 Chocolate Shampoo Vodka Wine Beer Chocolate Given the above list of transactions, do the following: 1) Find all the frequent itemsets (minimum support 40%) 2) Find all the association rules (minimum confidence 70%) 3) For the discovered association rules, calculate the lift What to Do After Class Read Chapter 4, 9 Read cases for Lecture 3 Get familiar with SAS or WEKA, replicate the class demo. Talk to candidate companies for your project 41