Data Mining Algorithm – Market Basket Analysis Market Basket Analysis - is the most widely used and, in many ways, most successful data mining algorithm. It essentially determines what products people purchase together. Stores can use this information to place these products in the same area. Direct marketers can use this information to determine which new products to offer to their current customers. Inventory policies can be improved if reorder points reflect the demand for the complementary products. Rules are written in the form “left-hand side implies right-hand side” and an example is: Yellow Peppers IMPLIES Red Peppers, Bananas, Bakery To make effective use of a rule, three numeric measures about that rule must be considered: (1) support, (2) confidence and (3) lift. Measures of Predictive Ability 1. Support refers to the percentage of baskets where the rule was true (both left and right side products were present). 2. Confidence measures what percentage of baskets that contained the left-hand product also contained the right. 3. Lift measures how much more frequently the left-hand item is found with the right than without the right. An Example Rule: Green Peppers IMPLIES Bananas Red Peppers IMPLIES Bananas Yellow Peppers IMPLIES Bananas Lift 1.37 1.43 1.17 Support 3.77 8.58 22.12 Confidence 85.96 89.47 73.09 The confidence suggests people buying any kind of pepper also buy bananas. Green peppers sell in about the same quantities as red or yellow, but are not as predictive. Market Basket Analysis Methodology We first need a list of transactions and what was purchased. This is pretty easily obtained these days from scanning cash registers. Next, we choose a list of products to analyze, and tabulate how many times each was purchased with the others. The diagonals of the table shows how often a product is purchased in any combination, and the off-diagonals show which combinations were bought. A Convenience Store Example (5 transactions) Consider the following simple example about five transactions at a convenience store: Transaction 1: Frozen pizza, cola, milk Transaction 2: Milk, potato chips Transaction 3: Cola, frozen pizza Transaction 4: Milk, pretzels Transaction 5: Cola, pretzels These need to be cross tabulated and displayed in a table. A Convenience Store Example (5 transactions) Product Bought Pizza also Milk also Cola Also Chips also Pretzels also Pizza 2 1 2 0 0 Milk 1 3 1 1 1 Cola 2 1 3 0 1 Chips 0 1 0 1 0 Pretzels 0 1 1 0 2 Pizza and Cola sell together more often than any other combo; a cross-marketing opportunity? Milk sells well with everything – people probably come here specifically to buy it. Computing Measures of Association Frozen Pizza 2 Milk Cola Chips Pretzels 1 2 0 0 1 3 1 1 1 Cola 2 1 3 0 1 Chips 0 1 0 1 0 Pretzels 0 1 1 0 2 Frozen Pizza Milk A implies B Example : Cola IMPLIES Frozen Pizza Formula : Support = occurrence / total transaction * 100% Support Support Support Support Support Support of of of of of of Cola IMPLIES Frozen Pizza = 2/5 * 100% = 40% Frozen Pizza IMPLIES Cola = 2/5 * 100% = 40% Frozen Pizza = 2/5 * 100% = 40% Cola = 3/5 * 100% = 60% Milk = 3/5 * 100% = 60% Milk IMPLIES Potato Chips = 1/5 * 100% = 20% } Support of combination Confidence = Support of combination / Support of A * 100% Confidence of Cola IMPLIES Frozen Pizza = 40 / 60 * 100% = 67% Confidence of Milk IMPLIES Potato Chips = 20 / 60 * 100% = 33% Convenience store Customers who buy orange juice also buy milk with a 75% confidence. Combination of milk and orange juice has a support of 30% Customers buy milk 90% Customers buy orange juice 40% Lift = Confidence of combination / support of B Likelihood of finding the right hand side item in any random basket: Lift of Cola IMPLIES Frozen Pizza = 100 / 40 = 2.50 Lift of Milk IMPLIES Potato Chips = 33 / 20 = 1.65 Lift of Orange Juice IMPLIES Milk = 75 / 90 = 0.83 Any rule with a lift of less than 1.0 does not indicate a real cross-selling opportunity, no matter how high its support and confidence, because it actually offers less ability to predict a purchase than does a random chance. *Lift measures how well the associative rule performs by comparing its performance to the “null” rule (that the left-hand side item is present without the right-hand side item). In this sense, lift can also be thought of as improvement, since it measures the improvement of the prediction over time. Using the Results The tabulations can immediately be translated into association rules and the numerical measures computed. Comparing this week’s table to last week’s table can immediately show the effect of this week’s promotional activities. Some rules are going to be trivial (hot dogs and buns sell together) or inexplicable (toilet rings sell only when a new hardware store is opened). Limitations to Market Basket Analysis A large number of real transactions are needed to do an effective basket analysis, but the data’s accuracy is compromised if all the products do not occur with similar frequency. The analysis can sometimes capture results that were due to the success of previous marketing campaigns (and not natural tendencies of customers). Performing Analysis with Virtual Items The sales data can be augmented with the addition of virtual items. For example, we could record that the customer was new to us, or had children. The transaction record might look like: Item 1: Sweater Item 2: Jacket Item 3: New This might allow us to see what patterns new customers have versus old customers. Multidimensional Market Basket Analysis Rules can involve more than two items, for example Plant and Clay Pot IMPLIES Soil. These rules are built iteratively. First, pairs are found, then relevant sets of three or four. These are then pruned by removing those that occur infrequently. In an environment like a grocery store, where customers commonly buy over 100 items, rules could involve as many as 10 items or more. Association Rules (Market Basket Analysis) Market basket: collection of items purchased by a customer in a single transaction (e.g. supermarket, web) Association rules: Unsupervised learning Used for pattern discovery Each rule has form: A -> B, or Left -> Right For example: “70% of customers who purchase 2% milk will also purchase whole wheat bread.” Data mining using association rules is the process of looking for strong rules: 1. Find the large itemsets (i.e. most frequent combinations of items) Most frequently used algorithm: Apriori algorithm. 2. Generate association rules for the above itemsets. How to measure the strength of an association rule? 1. Using support/confidence 2. Using dependence framework Support/confidence Support shows the frequency of the patterns in the rule; it is the percentage of transactions that contain both A and B, i.e. Support = Probability(A and B) Support = (# of transactions involving A and B) / (total number of transactions). Confidence is the strength of implication of a rule; it is the percentage of transactions that contain B if they contain A, ie. Confidence = Probability (B if A) = P(B/A) Confidence = (# of transactions involving A and B) / (total number of transactions that have A). Ex.ample: Customer Item purchased 1 Pizza 2 Salad 3 Pizza 4 Salad Item purchased pepsi soda soda tea If A is “purchased pizza” and B is “purchased soda” then Support = P(A and B) = ¼ Confidence = P(B / A) = ½ Confidence does not measure if the association between A and B is random or not. For example, if milk occurs in 30% of all baskets, information that milk occurs in 30% of all baskets with bread is useless. But if milk is present in 50% of all baskets that contain coffee, that is significant information. Support allows us to weed out most infrequent combinations – but sometimes we should not ignore them, for example, if the transaction is valuable and generates a large revenue, or if the products repel each other. Ex. We measure the following: P(Coke in a basket) = 50% P(pepsi in a basket) = 50% P(coke and peps in a basket) = 0.001% What does this mean? If Coke and Pepsi were independent, we would expect that P(coke and pepsi in a basket) = .5*0.5 = 0.25. The fact that the joint probability is much smaller says that the products are dependent and that they repel each other.