Uploaded by Tommy Nguyen

7 Association Rules handout.pptx

advertisement
Association Rules
Market Basket Analysis
Prof. Mingdi Xin
The Paul Merage School of Business
University of California, Irvine
Agenda
• Please work on your projects
– Gather Data
– Refer to project guidelines
•
•
•
•
Association rules: Definition and applications
Generating and identifying good Association Rules
Mining class association rules
Sequential pattern mining (optional)
Association Rules
• Techniques that automatically look for associations among
items in the data
– If {set of items}  Then {set of items}
• Unsupervised
– No “class” attribute
– Outcome can be one or combination of attributes
• Interesting patterns (previously unknown) may emerge that
can be used within a business
• Assume all data are categorical.
• Initially used for Market Basket Analysis to find how items
purchased by customers are related
Association Rules
Rule format:
If {set of items}  Then {set of items}
LHS
RHS
If {Diaper,
Baby Food}
Then
{Beer, Wine}
LHS implies RHS
An association rule is valid if it satisfies some evaluation measures
Association Rule Discovery: Definition
• Given a set of records each of which contain some number
of items from a given collection;
– Produce dependency rules which will predict occurrence of an
item based on occurrences of other items.
TID
Items
1
2
3
Bread, Coke, Milk
Beer, Bread
Beer, Coke, Diaper, Milk
4
5
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Rules
RulesDiscovered:
Discovered:
{Milk}
{Milk}-->
-->{Coke}
{Coke}
{Diaper,
{Diaper,Milk}
Milk}-->
-->
{Beer}
{Beer}
Association Rule Discovery: Application 1
• Marketing and Sales Promotion:
– Let the rule discovered be
{Bagels, … } --> {Potato Chips}
– Potato Chips as consequent => Can be used to determine what should
be done to boost its sales.
– Bagels in the antecedent => Can be used to see which products would
be affected if the store discontinues selling bagels.
– Bagels in antecedent and Potato chips in consequent => Can be used to
see what products should be sold with Bagels to promote sale of
Potato chips.
Association Rule Discovery: Application 2
• Supermarket shelf management.
– Goal: To identify items that are bought together by
sufficiently many customers.
– Approach: Process the point-of-sale data collected with
barcode scanners to find dependencies among items.
– A classic rule -• If a customer buys diaper and milk, then he is very likely to
buy beer.
• So, don’t be surprised if you find six-packs stacked next to
diapers.
Association Rules: Application beyond Market
Basket Analysis
• Medical (associated symptoms, side effects of
patience with multiple prescriptions)
• Inventory Management:
– Goal: A consumer appliance repair company wants to
anticipate the nature of repairs on its consumer
products and keep the service vehicles equipped with
right parts to reduce on number of visits to consumer
households.
– Approach: Process the data on tools and parts required
in previous repairs at different consumer locations and
discover the co-occurrence patterns.
Association Rule Mining
• Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other items
in the transaction
Market-Basket transactions
Example of Association Rules
{Diaper}  {Beer},
{Milk, Bread}  {Eggs,Coke},
{Beer, Bread}  {Milk},
Implication means co-occurrence,
not causality!
Definition: Frequent Itemset
• Itemset
– A collection of one or more items
• Example: {Milk, Bread, Diaper}
– k-itemset
• An itemset that contains k items
• Support count ()
– Frequency of occurrence of an itemset
– E.g. ({Milk, Bread,Diaper}) = 2
• Support
– Fraction of transactions that contain an
itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
– An itemset whose support is greater than
or equal to a minsup threshold
• Rule Evaluation Metrics
– Support (s)
• Fraction of transactions that contain both X and Y
No. of transactions containing items in LHS and RHS
Support =
– Confidence (c)
Total No. of transactions in the dataset
• Measures how often items in Y appear in transactions that contain X
Confidence =
No. of transactions containing both LHS and RHS
No. of transactions containing LHS
Example:
{Milk , Diaper}  Beer
 (Milk, Diaper, Beer ) 2
s
 0.4
|T|
5
 (Milk, Diaper, Beer ) 2
c
 0.67
 (Milk , Diaper )
3
Rule Evaluation - Lift
Transaction No.
Item 1
Item 2
Item 3
Item 4
100
Beer
Diaper
Chocolate
101
Milk
Chocolate
Shampoo
102
Beer
Milk
Vodka
Chocolate
103
Beer
Milk
Diaper
Chocolate
104
Milk
Diaper
Beer
…
What’s the support and confidence for rule {Chocolate}{Milk}?
Support = 3/5
Confidence = 3/4
Very high support and confidence.
Is Chocolate a good predictor of Milk purchase?
No! Because Milk occurs in 4 out of 5 transactions. Chocolate is even
decreasing the chance of Milk purchase
3/4 < 4/5, i.e. P(Milk|Chocolate)<P(Milk)
Lift = (3/4)/(4/5) = 0.9375 < 1
Rule Evaluation – Lift (cont.)
Measures how much more likely is the RHS given the LHS
than merely the RHS
 Lift = confidence of the rule / benchmark confidence
Benchmark Confidence
= Count of RHS / Number of transactions in database
Example: {Diaper}  {Beer}









Total number of customer in database: 1000
No. of customers buying Diaper: 200
No. of customers buying beer: 50
No. of customers buying Diaper & beer: 20
Benchmark confidence of Beer = 50/1000 (5%)
Confidence = 20/200 (10%)
Lift = 10%/5% = 2
Higher Lift indicates better rule
Exercise 1
Compute the support for subsets {a}, {b, d},
and
{a,b,d} by treating each transaction ID
as a market basket.
Exercise 2
Use the results in the previous problem to
compute the confidence for the association
rules {b, d} → {a} and {a} → {b, d}. State
what these values mean in plain English.
Exercise 3
Compute the support for itemsets {a}, {b, d},
and
{a,b,d} by treating each customer ID as a
market basket.
Exercise 4
Use the results in the previous problem to
compute the confidence for the association
rules {b, d} → {a} and {a} → {b, d}.
Mining Association Rules
Example of Rules:
{Milk,Diaper}  {Beer} (s=0.4, c=0.67)
{Milk,Beer}  {Diaper} (s=0.4, c=1.0)
{Diaper,Beer}  {Milk} (s=0.4, c=0.67)
{Beer}  {Milk,Diaper} (s=0.4, c=0.67)
{Diaper}  {Milk,Beer} (s=0.4, c=0.5)
{Milk}  {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Association Rule Mining Task
• Given a set of transactions T, the goal of
association rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold
• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
 Computationally prohibitive!
Frequent Itemset Generation
Optional
null
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ABCDE
ACDE
BCDE
Given d items, there
are 2d possible
candidate itemsets
Frequent Itemset Generation
• Brute-force approach:
Optional
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the database
– Match each transaction against every candidate
d
Computational Complexity
• Given d unique items:
Optional
– Total number of itemsets = 2d
– Total number of possible association rules:
 d 
 d  k 
R      

 j 
 k 
3  2  1
d1
d k
k 1
j 1
d
d 1
If d=6, R = 602 rules
Identifying Association Rules
•
Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
•
Frequent itemset generation is still
computationally expensive
Phase 1: Finding all frequent itemsets
How to perform an efficient search of all frequent
itemsets?
If {diaper, beer} is frequent then {diaper} and {beer} are each frequent as well
This means that…
• If an itemset is not frequent (e.g., {wine}) then no itemset that includes wine can
be frequent either, such as {wine, beer} .
• We therefore first find all itemsets of size 1 that are frequent.
Then try to “expand” these by counting the frequency of all itemsets of size 2 that
include frequent itemsets of size 1.
•
Example:
If {wine} is not frequent we need not try to find out whether {wine, beer} is
frequent. But if both {wine} & {beer} were frequent then it is possible (though
not guaranteed) that {wine, beer} is also frequent.
Then take only itemsets of size 2 that are frequent, and try to expand those, etc.
Algorithm Apriori
• Iterative algorithm: Find all 1-item frequent itemsets;
then all 2-item frequent itemsets, and so on.
– In each iteration k, only consider itemsets that contain some
k-1 frequent itemset.
• Find frequent itemsets of size 1: F1
• From k = 2
– Ck = candidates of size k: those itemsets of size k that could
be frequent, given Fk-1
– Fk = those itemsets that are actually frequent, Fk  Ck (need
to scan the database once).
Details: ordering of items
• The items in I are sorted in lexicographic order
(which is a total order).
• The order is used throughout the algorithm in each
itemset.
• {w[1], w[2], …, w[k]} represents a k-itemset w
consisting of items w[1], w[2], …, w[k], where
w[1] < w[2] < … < w[k] according to the total
order.
Details: the algorithm
Algorithm Apriori(T)
C1  init-pass(T);
F1  {f | f  C1, f.count/n  minsup}; // n: no. of transactions in T
for (k = 2; Fk-1  ; k++) do
Ck  candidate-gen(Fk-1);
for each transaction t  T do
for each candidate c  Ck do
if c is contained in t then
c.count++;
end
end
Fk  {c  Ck | c.count/n  minsup}
end
return F  k Fk;
Apriori candidate generation
• The candidate-gen function takes Fk-1 and returns a
superset (called the candidates Ck) of the set of all
frequent k-itemsets. It has two steps
– join step: Generate all possible candidate itemsets Ck of
length k
– prune step: Remove those candidates in Ck that cannot
be frequent.
Candidate-gen function
Function candidate-gen(Fk-1)
Ck  ;
forall f1, f2  Fk-1
with f1 = {i1, … , ik-2, ik-1}
and f2 = {i1, … , ik-2, i’k-1}
and ik-1 < i’k-1 do
c  {i1, …, ik-1, i’k-1};
// join f1 and f2
Ck  Ck  {c};
for each (k-1)-subset s of c do
if (s  Fk-1) then
delete c from Ck;
// prune
end
end
return Ck;
Example –
Finding frequent itemsets
Dataset T
minsup=0.5
TID
Items
T100 1, 3, 4
T200 2, 3, 5
itemset:count
1. scan T  C1: {1}:2, {2}:3, {3}:3, {4}:1, {5}:3
 F1:
 C2:
{1}:2, {2}:3, {3}:3,
T300 1, 2, 3, 5
T400 2, 5
{5}:3
{1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5}
2. scan T  C2: {1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, {3,5}:2
 F2:
 C3:
{1,3}:2,
{2, 3,5}
3. scan T  C3: {2, 3, 5}:2  F3: {2, 3, 5}
{2,3}:2, {2,5}:3, {3,5}:2
Transactions
T-ID
Items
10
A, C, D
20
B, C, E
30
A, B, C,
E
40
B, E
Agrawal (94)’s Apriori Algorithm—An Exa
An itemset must have been purchased at least twice in order to be considered
frequent or supported
Agrawal (94)’s Apriori Algorithm—An
Example
Transactions
T-ID
Items
10
A, C, D
20
B, C, E
30
A, B, C,
E
40
B, E
L2
C1
Itemset sup
1 scan
st
C2
{A}
2
{B}
3
{C}
3
{D}
1
{E}
3
Itemset sup
Itemset
sup
{A, B}
1
{A, C}
2
{A, C}
2
{B, C}
2
{A, E}
1
{B, E}
3
{B, C}
2
{C, E}
2
Itemset
{B, E}
3
{C, E}
2
C3
{B, C, E}
3 scan
rd
{A,B,C}, {A, C, E}?
L3
L1
Itemset sup
{A}
2
{B}
3
{C}
3
{E}
3
C2
2nd scan
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
Itemset
sup
{B, C, E}
2
{C, E}
Phase 2:
Generating Association Rules
Assume {Milk, Bread, Butter} is a frequent itemset.
Using items contained in the itemset, list all possible rules
•
–
–
–
–
–
–
•
•
{Milk}  {Bread, Butter}
{Bread}  {Milk, Butter}
{Butter}  {Milk, Bread}
{Milk, Bread}  {Butter}
{Milk, Butter}  {Bread}
{Bread, Butter}  {Milk}
Calculate the confidence of each rule
Pick the rules with confidence above the minimum confidence
Confidence of {Milk}  {Bread, Butter}:
No. of transaction that support {Milk, Bread,
Butter}
No. of transaction that support {Milk}
Support {Milk, Bread, Butter}
=
Support {Milk}
Exercise 5
Transaction No.Item 1
100
Beer
101
Milk
102
Beer
103
Beer
104
Milk
Item 2
Diaper
Chocolate
Soap
Cheese
Diaper
Item 3
Item 4
Chocolate
Shampoo
Vodka
Wine
Beer
Chocolate
Given the above list of transactions, do the following:
1) Find all the frequent itemsets (minimum support 40%)
2) Find all the association rules (minimum confidence 70%)
3) For the discovered association rules, calculate the lift
What is the objective of Apriori?
A: Identify good association rules
B: Identify all frequent itemsets
C: Identify all rules from frequent itemsets
D: Determine the computational complexity of
finding association rules
E: None of the above
Given the table below, the confidence of
the rule
Bread  Coke is
A: 1/4
B: 2/4
C: 3/4
D: 4/4
E: None of the above
Problems with the association mining
• Single minsup: It assumes that all items in the data
are of the same nature and/or have similar
frequencies.
• Not true: In many applications, some items appear
very frequently in the data, while others rarely
appear.
E.g., in a supermarket, people buy food processor and
cooking pan much less frequently than they buy bread
and milk.
Rare Item Problem
• If the frequencies of items vary a great deal, we will
encounter two problems
– If minsup is set too high, those rules that involve rare items
will not be found.
– To find rules that involve both frequent and rare items,
minsup has to be set very low. This may cause
combinatorial explosion because those frequent items will
be associated with one another in all possible ways.
Multiple minsups model
• The minimum support of a rule is expressed in terms of
minimum item supports (MIS) of the items that appear
in the rule.
• Each item can have a minimum item support.
• By providing different MIS values for different items,
the user effectively expresses different support
requirements for different rules.
Minsup of a rule
• Let MIS(i) be the MIS value of item i. The minsup of a
rule R is the lowest MIS value of the items in the rule.
• I.e., a rule R: a1, a2, …, ak  ak+1, …, ar satisfies its
minimum support if its actual support is 
min(MIS(a1), MIS(a2), …, MIS(ar)).
An Example
• Consider the following items:
bread, shoes, clothes
The user-specified MIS values are as follows:
MIS(bread) = 2% MIS(shoes) = 0.1%
MIS(clothes) = 0.2%
The following rule doesn’t satisfy its minsup:
clothes  bread [with sup=0.15%,conf =70%]
The following rule satisfies its minsup:
clothes  shoes [with sup=0.15%,conf =70%]
Agenda
• Please work on your projects
– Gather Data
– Refer to project guidelines
•
•
•
•
Association rules: Definition and applications
Generating and identifying good Association Rules
Mining class association rules
Sequential pattern mining (optional)
Mining class association rules (CAR)
• Normal association rule mining does not have any
target.
• It finds all possible rules that exist in data, i.e., any
item can appear as a consequent or a condition of a
rule.
• However, in some applications, the user is interested
in some targets.
– E.g, the user has a set of text documents from some known
topics. He/she wants to find out what words are associated
or correlated with each topic.
Problem definition
• Let T be a transaction data set consisting of n
transactions.
• Each transaction is also labeled with a class y.
• Let I be the set of all items in T, Y be the set of all class
labels and I  Y = .
• A class association rule (CAR) is an implication of the
form
X  y, where X  I, and y  Y.
• The definitions of support and confidence are the same
as those for normal association rules.
An example
• A text document data set
doc 1: Student, Teach, School : Education
doc 2: Student, School
: Education
doc 3: Teach, School, City, Game
: Education
doc 4: Baseball, Basketball
: Sport
doc 5: Basketball, Player, Spectator
: Sport
doc 6: Baseball, Coach, Game, Team : Sport
doc 7: Basketball, Team, City, Game
: Sport
• Let minsup = 20% and minconf = 60%. The following are two
examples of class association rules:
Student, School  Education [sup= 2/7, conf = 2/2]
game  Sport
[sup= 2/7, conf = 2/3]
Mining algorithm
• Unlike normal association rules, CARs can be mined
directly in one step.
• The key operation is to find all ruleitems that have
support above minsup. A ruleitem is of the form:
(condset, y)
where condset is a set of items from I (i.e., condset  I),
and y  Y is a class label.
• Each ruleitem basically represents a rule:
condset  y,
• The Apriori algorithm can be modified to generate CARs
Next Week
• Guest Lecture
Doug Herman
Executive Director of Global Data Science & Analytics
Ingram Micro
• Logistic Regression
WEKA
• Find association rules
– Apriori
• http://facweb.cs.depaul.edu/mobasher/classes/ect58
4/WEKA/associate.html
Agenda
• Please work on your projects
– Gather Data
– Refer to project guidelines
•
•
•
•
Association rules: Definition and applications
Generating and identifying good Association Rules
Mining class association rules
Sequential pattern mining (optional)
Sequential pattern mining
Informational
NOT required
• Association rule mining does not consider the
order of transactions.
• In many applications such orderings are
significant. E.g.,
– in market basket analysis, it is interesting to know
whether people buy some items in sequence,
• e.g., buying bed first and then bed sheets some time later.
– In Web usage mining, it is useful to find navigational
patterns of users in a Web site from sequences of page
visits of users
54
Objective
Informational
NOT required
• Given a set S of input data sequences, the problem
of mining sequential patterns is to find all the
sequences that have a user-specified minimum
support.
• Each such sequence is called a frequent
sequence, or a sequential pattern.
• The support for a sequence is the fraction of total
data sequences in S that contains this sequence.
• Algorithm: Srikant, R., and R. Agrawal. "Mining sequential patterns:
Generalizations and performance improvements." International Conference on
Extending Database Technology. Springer, Berlin, Heidelberg, 1996.
55
Example
Informational
NOT required
Example (cond)
Informational
NOT required
57
Weka Memory
• To increase Java virtual memory for WEKA
– On your computer go to “Start”
– In “Search Program and Files” type: “cmd”
– Use “cd …” to change directory to Weka folder (under
Program Files)
– On command prompt type:
– “Java –Xmx512m –jar weka.jar”
Download