Market basket analysis & Association rules
Case Discussion
Software demo
Exercise
1.
2.
3.
4.
5.
6.
7.
8.
Put them closer together in the store.
Put them far apart in the store.
Package candy bars with the dolls.
Package Barbie + candy + poorly selling item.
Raise the price on one, lower it on the other.
Barbie accessories for proofs of purchase.
Do not advertise candy and Barbie together.
Offer candies in the shape of a Barbie Doll.
MBA in retail setting
Find out what are bought together
Cross-selling
Optimize shelf layout
Product bundling
Timing promotions
Discount planning (avoid double-discounts)
Product selection under limited space
Targeted advertisement, Personalized coupons, item recommendations
Usage beyond Market Basket
Medical (one symptom after another)
Financial (customers with mortgage acct also have saving acct)
Transaction No.
100
101
102
103
104
…
Item 1
Beer
Milk
Beer
Item 2 Item 3
Diaper Chocolate Cheese
Chocolate Shampoo
Wine
Beer Cheese
Ice Cream Diaper
Vodka
Diaper
Beer
Item 4
Chocolate
…
Customer No.
Age
100 >50
101
102
35-50
<35
Income
High
Mid
High
103
104
…
>50
<35
Mid
Low
Saving_acct Children Mortgage
Yes Yes Yes
No
Yes
No
No
No
Yes
Yes
No
No
Yes
Yes
No
Actionable Rules
Wal-Mart customers who purchase Barbie dolls have a 60% likelihood of also purchasing one of three types of candy bars
Trivial Rules
Customers who purchase large appliances are very likely to purchase maintenance agreements
Inexplicable Rules
When a new hardware store opens, one of the most commonly sold items is toilet bowl cleaners
Data
Then
If buy diapers
Buy beer
The itemset corresponding to this rule is {Diaper, Beer}
Itemset: A collection of items.
Frequent Itemset: An itemset that occurs often in data.
Often times, finding frequent itemsets is enough.
Transaction No. Item 1
100 Beer
101
102
103
104
…
Milk
Beer
Beer
Item 2
Diaper
Wine
Cheese
Ice Cream Diaper
Item 3
Chocolate Shampoo
Vodka
Diaper
Beer
Item 4
Chocolate Cheese
Chocolate
…
Examples: Shoppers who buy Diaper are very likely to buy Beer.
If buy
Diaper
Then
Buy Beer
Shoppers who buy Beer and Diaper are likely to buy Cheese and Chocolate
Then
If buy
Beer, Diaper
Buy Cheese,
Chocolate
LHS RHS
Then
If {Diaper,
Baby Food}
{Beer, Wine}
LHS
RHS
What rules should be considered valid?
LHS RHS
Then
If {Diaper} {Beer}
An association rule is valid if it satisfies some evaluation measures
Milk & Wine co-occur
But…
Only 2 out of 200K transactions contain these items
Transaction No.
100
101
102
103
104
….
Item 1
Beer
Milk
Beer
Beer
Ice Cream
Item 2
Diaper
Chocolate
Wine
Cheese
Diaper
Item 3
Chocolate
Wine
Vodka
Diaper
Beer
…
Support :
The frequency in which the items in LHS and RHS co-occur.
E.g., The support of the {Diaper} {Beer} rule is 3/5:
60% of the transactions contain both items.
Support =
No. of transactions containing items in LHS and RHS
Total No. of transactions in the dataset
Transaction No.
100
101
102
103
104
Item 1
Beer
Milk
Beer
Beer
Ice Cream
Item 2
Diaper
Chocolate
Wine
Cheese
Diaper
Item 3
Chocolate
Shampoo
Vodka
Diaper
Beer
…
My friend, Bill, an 85 years old man, told me a joke in a party last Friday:
An old man is celebrating his 103th birthday.
“I will hold my 104 th birthday party next year. You are all welcome to join me,” he announces to his guests proudly.
“How do you know you will still be alive then?” one of his guests asks.
“Because very few people died between the age of 103 and
104,” he replies.
Explain the logic of the old man and provide your comments.
The old man’s logic: P{103+ & died} is low; so 1
- P{103+ & died} is high
Common knowledge: P{103+ & died} = P{103+}
* P{died|103+}, where P{103+} is low.
So the low of P{103+ & died} is due to P{103+}, while P{died|103+} is still high.
Is Beer leading to Diaper purchase or Diaper leading to Beer purchase?
Among the transactions with Diaper, 100% have Beer.
Among the transactions with Beer, 75% have Diaper.
Transaction No.
Item 1 Item 2 Item 3 …
100
101
102
103
Beer
Milk
Beer
Beer
Diaper Chocolate
Chocolate Shampoo
Wine Vodka
Cheese Diaper
104 Ice Cream Diaper Beer
Confidence =
No. of transactions containing both LHS and RHS
No. of transactions containing LHS
confidence for {Diaper}
{Beer} : 3/3
When Diaper is purchased, the likelihood of Beer purchase is 100% confidence for {Beer} {Diaper} : 3/4
When Beer is purchased, the likelihood of Diaper purchase is 75%
So, {Diaper}
{Beer} is a more important rule according to confidence.
Item 4 Transaction No.
Item 1
100
101
102
103
104
Beer
Milk
Beer
Beer
Milk
Item 2 Item 3
Diaper Chocolate
Chocolate Shampoo
Milk Vodka
Milk
Diaper
Diaper
Beer
Chocolate
Chocolate
…
What’s the support and confidence for rule {Chocolate} {Milk}?
Support = 3/5 Confidence = 3/4
Very high support and confidence.
Does Chocolate really lead to Milk purchase?
No! Because Milk occurs in 4 out of 5 transactions. Chocolate is even decreasing the chance of Milk purchase (3/4 < 4/5)
Lift = (3/4)/(4/5) = 0.9375 < 1
Measures how much more likely is the RHS given the
LHS than merely the RHS
Lift = confidence of the rule / frequency of the RHS
Example: {Diaper} {Beer}
Total number of customer in database: 1000
No. of customers buying Diaper : 200
No. of customers buying beer : 50
No. of customers buying Diaper & beer : 20
Frequency of Beer = 50/1000 (5%)
Confidence = 20/200 (10%)
Lift = 10%/5% = 2
Lift higher than 1 implies people have higher change to buy Beer when they buy Diaper. Lift lower than 1 implies people have lower change to buy Milk when they buy
Chocolate.
Given a set of transactions T, the goal of association rule mining is to find all rules having
support ≥ minsup threshold confidence ≥ minconf threshold
Brute-force approach:
List all possible association rules
Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf thresholds
Computationally prohibitive !
Brute-force approach:
Each itemset in the lattice is a candidate frequent itemset
Count the support of each candidate by scanning the database
Complexity ~ O(NMw) => Expensive since M = 2 d !!!
Match each transaction against every candidate
Complexity ~ O(NMw) => Expensive since M = 2d !!!
List of
Candidates
Transactions
N
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke w
M
TID Items
3
4
1
2
5
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Example of Rules:
{Milk,Diaper}
{Beer} (s=0.4, c=0.67)
{Milk,Beer}
{Diaper} (s=0.4, c=1.0)
{Diaper,Beer}
{Milk} (s=0.4, c=0.67)
{Beer}
{Milk,Diaper} (s=0.4, c=0.67)
{Diaper}
{Milk,Beer} (s=0.4, c=0.5)
{Milk}
{Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but can have different confidence
• Thus, we may decouple the support and confidence requirements
Two-step approach:
Frequent Itemset Generation
Generate all itemsets whose support
minsup
Rule Generation
Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset
Frequent itemset generation is still computationally expensive
The standard algorithm: Apriori
Rakesh Agrawal, Ramakrishnan Srikant: Fast Algorithms for Mining
Association Rules in Large Databases. VLDB 1994: 487-499
The Association Rules problem was defined as:
Generate all association rules that have
support greater than the user-specified minimum support
and confidence greater than the user-specified minimum confidence
The base algorithm uses support and confidence, but we can also use lift to rank the rules discovered by
Apriori.
The algorithm performs an efficient search over the data to find all such rules.
Association rules discovery problem is decomposed into two sub-problems:
1.
2.
Find all sets of items (itemsets) whose support is above minimum support --- called frequent itemsets or large itemsets
From each frequent itemset, generate rules whose confidence is above minimum confidence.
Given a large itemset Y , and X is a subset of Y
Calculate confidence of the rule X
( Y X )
If its confidence is above the minimum confidence, then X
( Y X ) is an association rule we are looking for.
Transaction No.
Item 1
100
101
Beer
Milk
102
103
104
Beer
Item 2
Wine
Beer Cheese
Ice Cream Diaper
Item 3
Diaper Chocolate
Chocolate Shampoo
Vodka
Diaper
Beer
A data set with 5 transactions
Minimum support = 40%, Minimum confidence = 80%
Phase 1: Find all frequent itemsets
{Beer}
(support=80%),
{Diaper}
(60%),
{Chocolate}
(40%)
{Beer, Diaper}
(60%)
Phase 2:
Beer
Diaper (conf. 60% ÷ 80%= 75%)
Diaper Beer (conf. 60% ÷ 60%= 100%)
How to perform an efficient search of all frequent itemsets?
Note: frequent itemsets of size n contain itemsets of size n1 that also must be frequent
Example: if {diaper, beer} is frequent then {diaper} and {beer} are each frequent as well
This means that…
If an itemset is not frequent (e.g., { wine } ) then no itemset that includes wine can be frequent either, such as {wine, beer} .
We therefore first find all itemsets of size 1 that are frequent.
Then try to “expand” these by counting the frequency of all itemsets of size 2 that include frequent itemsets of size 1.
Example:
If {wine} is not frequent we need not try to find out whether {wine, beer} is frequent. But if both {wine} & {beer} were frequent then it is possible
(though not guaranteed) that {wine, beer} is also frequent.
Then take only itemsets of size 2 that are frequent, and try to expand those, etc.
Assume {Milk, Bread, Butter} is a frequent itemset .
Using items contained in the itemset, list all possible rules
{Milk} {Bread, Butter}
{Bread} {Milk, Butter}
{Butter} {Milk, Bread}
{Milk, Bread} {Butter}
{Milk, Butter} {Bread}
{Bread, Butter} {Milk}
Calculate the confidence of each rule
Pick the rules with confidence above the minimum confidence
Confidence of {Milk} {Bread, Butter}:
No. of transaction that support {Milk, Bread, Butter}
No. of transaction that support {Milk}
=
Support {Milk, Bread, Butter}
Support {Milk}
If the rule {Yogurt} {Bread, Butter } is found to have minimum confidence.
Does it mean the rule:
{Bread, Butter} {Yogurt} also has minimum confidence?
No.
Example:
Support of {Yogurt} is 20%,
{Yogurt, Bread, Butter } is 10%
{Bread and Butter } is 50%
Confidence of {Yogurt} {Bread, Butter} is
10%/20%=50%
Confidence of {Bread, Butter} {Yogurt} is
10%/50%=20%
Agrawal (94)’s Apriori Algorithm—An Example
L
2
Transactions
T-ID Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
C
1
1 st scan
Itemset sup
{A} 2
{B}
{C}
3
3
{D}
{E}
1
3
Itemset sup
{A, C} 2
{B, C}
{B, E}
{C, E}
2
3
2
C
3
Itemset
{B, C, E}
{A,B,C}?
L
1
Itemset sup
{A} 2
{B}
{C}
{E}
3
3
3
3
C rd
2
Itemset sup
{A, B} 1
{A, C} 2
{A, E} 1
{B, C} 2
{B, E} 3
{C, E} 2 scan
L
3
2 nd scan
C
2
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Itemset sup
{B, C, E} 2
Instead of finding association between items in a single transactions, find association between items across related transactions over time.
Customer ID Transaction Data.
Item 1
AA 2/2/2001 Laptop
AA
BB
1/13/2002
4/5/2002
Wireless network card laptop
Item 2
Case
Router iPaq
…
BB
…
8/10/2002
…
Wireless network card Router
… …
Sequence : {Laptop}, {Wireless Card, Router}
A sequence has to satisfy some predetermined minimum support
Sequence
Database
Customer
Sequence
Purchase history of a given customer
Web Data
Event data
Element
(Transaction)
A set of items bought by a customer at time t
Event
(Item)
Books, diary products,
CDs, etc
Browsing activity of a particular Web visitor
History of events generated by a given sensor
A collection of files viewed by a Web visitor after a single mouse click
Events triggered by a sensor at time t
Home page, index page, contact info, etc
Types of alarms generated by sensors
Genome sequences
DNA sequence of a particular species
An element of the DNA sequence
Bases A,T,G,C
Element
(Transaction)
E1
E2
E1
E3
E2
E3
E4
Event
(Item)
Sequence
E2
Web sequence:
< {Homepage} {Electronics} {Digital Cameras} {Canon
Digital Camera} {Shopping Cart} {Order Confirmation}
{Return to Shopping} >
Sequence of books checked out at a library:
<{Fellowship of the Ring} {The Two Towers} {Return of the
King}>
Market-Basket Analysis:
e.g. Product assortment optimization (see next slide)
Recommendations: Determines which books are frequently purchased together and recommends associated books or products to people who express interest in an item.
Healthcare: Studying the side-effects in patients with multiple prescriptions, we can discover previously unknown interactions and warn patients about them.
Fraud detection: Finding in insurance data that a certain doctor often works with a certain lawyer may indicate potential fraudulent activity. (virtual items)
Sequence Discovery: looks for associations between items bought over time. E.g., we may notice that people who buy chili tend to buy antacid within a month. Knowledge like this can be used to plan inventory levels.
Graphs of expected sales (e.g derived from association rules) and costs
(e.g. of purchasing and holding inventory) can allow us to optimize the number and selection (choice) of items in a product category.
Dollars
Revenues
Costs
Margin
Products in Category
Dollars
Max Profit
Margin = Revenues - Costs
Products in Category
35
Market basket analysis & Association rules
Case Discussion
Software demo
Exercise
8.
9.
3.
4.
5.
6.
7.
1.
2.
What are the benefits of finding the associated products sold together within the same transaction, or sold together to the same customer ? (i.e. use transaction or customer as the unit of analysis)
How to perform an item-based Market Basket Analysis or a customer-based Market Basket Analysis, and what are the benefits for each? (i.e. MBA based on data about a specific item, MBA based on data about a specific customer)
What are the interesting results from MBA discussed in the case?
How to decide promotion items based on MBA?
How to evaluate a promotion based on MBA?
How does MBA help product bundling?
Please brainstorm a promotion plan based on MBA to maximize the net profit of the retailer.
How to do targeted promotion over time?
Other possible strategies based on MBA?
Market basket analysis & Association rules
Case Discussion
Software demo
Exercise
Market basket analysis & Association rules
Case Discussion
Software demo
Exercise
Transaction No.Item 1
100 Beer
101
102
Milk
Beer
103
104
Beer
Milk
Item 2
Diaper
Item 3
Chocolate
Chocolate Shampoo
Soap Vodka
Cheese Wine
Diaper Beer
Item 4
Chocolate
Given the above list of transactions, do the following:
1) Find all the frequent itemsets (minimum support 40%)
2) Find all the association rules (minimum confidence 70%)
3) For the discovered association rules, calculate the lift
41