Data Mining Algorithm-Market Basket Analysis

advertisement
Data Mining Algorithm – Market Basket Analysis
Market Basket Analysis - is the most widely used and, in many ways, most successful data
mining algorithm.
It essentially determines what products people purchase together.
Stores can use this information to place these products in the same area.
Direct marketers can use this information to determine which new products to offer to their
current customers.
Inventory policies can be improved if reorder points reflect the demand for the complementary
products.
Rules are written in the form “left-hand side implies right-hand side” and an example is:
Yellow Peppers IMPLIES Red Peppers, Bananas, Bakery
To make effective use of a rule, three numeric measures about that rule must be considered:
(1) support, (2) confidence and (3) lift.
Measures of Predictive Ability
1. Support refers to the percentage of baskets where the rule was true (both left and right side
products were present).
2. Confidence measures what percentage of baskets that contained the left-hand product also
contained the right.
3. Lift measures how much more frequently the left-hand item is found with the right than
without the right.
An Example
Rule:
Green Peppers
IMPLIES
Bananas
Red Peppers
IMPLIES
Bananas
Yellow Peppers
IMPLIES
Bananas
Lift
1.37
1.43
1.17
Support
3.77
8.58
22.12
Confidence
85.96
89.47
73.09
The confidence suggests people buying any kind of pepper also buy bananas.
Green peppers sell in about the same quantities as red or yellow, but are not as
predictive.
Market Basket Analysis Methodology
We first need a list of transactions and what was purchased. This is pretty easily
obtained these days from scanning cash registers.
Next, we choose a list of products to analyze, and tabulate how many times each was
purchased with the others.
The diagonals of the table shows how often a product is purchased in any combination,
and the off-diagonals show which combinations were bought.
A Convenience Store Example (5 transactions)
Consider the following simple example about five transactions at a convenience store:
Transaction 1: Frozen pizza, cola, milk
Transaction 2: Milk, potato chips
Transaction 3: Cola, frozen pizza
Transaction 4: Milk, pretzels
Transaction 5: Cola, pretzels
These need to be cross tabulated and displayed in a table.
A Convenience Store Example (5 transactions)
Product
Bought
Pizza
also
Milk
also
Cola
Also
Chips
also
Pretzels
also
Pizza
2
1
2
0
0
Milk
1
3
1
1
1
Cola
2
1
3
0
1
Chips
0
1
0
1
0
Pretzels
0
1
1
0
2
Pizza and Cola sell together more often than any other combo; a cross-marketing opportunity?
Milk sells well with everything – people probably come here specifically to buy it.
Computing Measures of Association
Frozen
Pizza
2
Milk
Cola
Chips
Pretzels
1
2
0
0
1
3
1
1
1
Cola
2
1
3
0
1
Chips
0
1
0
1
0
Pretzels 0
1
1
0
2
Frozen
Pizza
Milk
A implies B
Example : Cola IMPLIES Frozen Pizza
Formula :
Support = occurrence / total transaction * 100%
Support
Support
Support
Support
Support
Support
of
of
of
of
of
of
Cola IMPLIES Frozen Pizza = 2/5 * 100% = 40%
Frozen Pizza IMPLIES Cola = 2/5 * 100% = 40%
Frozen Pizza = 2/5 * 100% = 40%
Cola = 3/5 * 100% = 60%
Milk = 3/5 * 100% = 60%
Milk IMPLIES Potato Chips = 1/5 * 100% = 20%
}
Support of
combination
Confidence = Support of combination / Support of A * 100%
Confidence of Cola IMPLIES Frozen Pizza = 40 / 60 * 100% = 67%
Confidence of Milk IMPLIES Potato Chips = 20 / 60 * 100% = 33%
Convenience store Customers who buy orange juice also buy milk with a 75% confidence.
Combination of milk and orange juice has a support of 30%
Customers buy milk 90%
Customers buy orange juice 40%
Lift = Confidence of combination / support of B
Likelihood of finding the right hand side item in any random basket:
Lift of Cola IMPLIES Frozen Pizza = 100 / 40 = 2.50
Lift of Milk IMPLIES Potato Chips = 33 / 20 = 1.65
Lift of Orange Juice IMPLIES Milk = 75 / 90 = 0.83
Any rule with a lift of less than 1.0 does not indicate a real cross-selling opportunity, no matter
how high its support and confidence, because it actually offers less ability to predict a purchase
than does a random chance.
*Lift measures how well the associative rule performs by comparing its performance to the
“null” rule (that the left-hand side item is present without the right-hand side item). In this
sense, lift can also be thought of as improvement, since it measures the improvement of the
prediction over time.
Using the Results
The tabulations can immediately be translated into association rules and the numerical
measures computed.
Comparing this week’s table to last week’s table can immediately show the effect of this
week’s promotional activities.
Some rules are going to be trivial (hot dogs and buns sell together) or inexplicable (toilet rings
sell only when a new hardware store is opened).
Limitations to Market Basket Analysis
A large number of real transactions are needed to do an effective basket analysis, but the
data’s accuracy is compromised if all the products do not occur with similar frequency.
The analysis can sometimes capture results that were due to the success of previous
marketing campaigns (and not natural tendencies of customers).
Performing Analysis with Virtual Items
The sales data can be augmented with the addition of virtual items. For example, we could
record that the customer was new to us, or had children.
The transaction record might look like:
Item 1: Sweater
Item 2: Jacket
Item 3: New
This might allow us to see what patterns new customers have versus old customers.
Multidimensional Market Basket Analysis
Rules can involve more than two items, for example Plant and Clay Pot IMPLIES Soil.
These rules are built iteratively. First, pairs are found, then relevant sets of three or four.
These are then pruned by removing those that occur infrequently.
In an environment like a grocery store, where customers commonly buy over 100 items, rules
could involve as many as 10 items or more.
Association Rules (Market Basket Analysis)
Market basket: collection of items purchased by a customer in a single transaction (e.g.
supermarket, web)
Association rules:
 Unsupervised learning
 Used for pattern discovery

Each rule has form: A -> B, or Left -> Right
For example: “70% of customers who purchase 2% milk will also purchase whole wheat
bread.”
Data mining using association rules is the process of looking for strong rules:
1. Find the large itemsets (i.e. most frequent combinations of items)
Most frequently used algorithm: Apriori algorithm.
2. Generate association rules for the above itemsets.
How to measure the strength of an association rule?
1. Using support/confidence
2. Using dependence framework
Support/confidence
Support shows the frequency of the patterns in the rule; it is the percentage of transactions that
contain both A and B, i.e.
Support = Probability(A and B)
Support = (# of transactions involving A and B) / (total number of transactions).
Confidence is the strength of implication of a rule; it is the percentage of transactions that
contain B if they contain A, ie.
Confidence = Probability (B if A) = P(B/A)
Confidence =
(# of transactions involving A and B) / (total number of transactions that have A).
Ex.ample:
Customer Item
purchased
1
Pizza
2
Salad
3
Pizza
4
Salad
Item
purchased
pepsi
soda
soda
tea
If A is “purchased pizza” and B is “purchased soda” then
Support = P(A and B) = ¼
Confidence = P(B / A) = ½
Confidence does not measure if the association between A and B is random or not.
For example, if milk occurs in 30% of all baskets, information that milk occurs in 30% of all
baskets with bread is useless. But if milk is present in 50% of all baskets that contain coffee, that
is significant information.
Support allows us to weed out most infrequent combinations – but sometimes we should not
ignore them, for example, if the transaction is valuable and generates a large revenue, or if the
products repel each other.
Ex. We measure the following:
P(Coke in a basket) = 50%
P(pepsi in a basket) = 50%
P(coke and peps in a basket) = 0.001%
What does this mean? If Coke and Pepsi were independent, we would expect that
P(coke and pepsi in a basket) = .5*0.5 = 0.25.
The fact that the joint probability is much smaller says that the products are dependent and that
they repel each other.
Download