comp5318_s1_2008_lec..

advertisement
COMP5318 Knowledge Discovery
and Data Mining
Week 8:
Mining Association Rules
Reference: TSK pp.328-353, 363-370
Dunham 125-142
Outline
• What is Association Rule Mining?
• Basic concepts
• Item, Itemset, Transaction, Support, Confidence, …
• Association rule problem definition
• Apriori principle
• What, why, how
• Apriori algorithm
• FP-growth algorithm
• Discussion
What is Association Rule Mining?
• Association rule mining finds
• combinations of items that typically occur
together in a database (market-basket analysis)
• Sequences of items that occur frequently
(sequential analysis) in a database
• Originally introduced for Market-basket analysis -useful for analysing purchasing behaviour of
customers.
Market-Basket Analysis – Examples





Where should strawberries be placed to maximize their sale?
Services purchased together by telecommunication customers (e.g.
broad band Internet, call forwarding, etc.) help determine how to
bundle these services together to maximize revenue
Unusual combinations of insurance claims can be a sign of a fraud
Medical histories can give indications of complications based on
combinations of treatments
Sport: analyzing game statistics (shots blocked, assists, and fouls) to
gain competitive advantage
•
“When player X is on the floor, player Y’s shot accuracy
decreases from 75% to 30%”
•
Bhandari et.al. (1997). Advanced Scout: data mining and knowledge discovery
in NBA data, Data Mining and Knowledge Discovery, 1(1), pp.121-125
Basic Concepts
• Set of items: I={i1, i2,…,im};
• Set of transactions: T={t1, t2, …,tn};
• Each transaction tn is a subset of I
• Example:
 5 transactions: T={t1, t2, …,t5};
 5 items: I={Bread, Jelly, PeanutButter, Milk, Beer}
• Itemset – a collection (set) of 1 or more items
• If an itemset contains k items, it is called k-itemset
e.g. {Jelly, Milk, Bread} is an example of 3-itemset
Example
Dataset
Basic Concepts
• Searching for rules of the form XY, where X and Y are
itemsets, e.g.
• {Bread}  {Jelly}
• {Bread, Jelly}  {PeanutButter}
• Formally:
Given I={i1, i2,…,im} and T={t1, t2, …,tn}, an association rule is
an implication of the form XY, where X,Y  I and XY=
(i.e. X and Y are disjoint itemsets)
• Association rules have 2 important ‘properties’
• Support
• Confidence
These measure how “interesting” the rule is.
Support of an Itemset
• The support of an itemset X is the number (or percentage) of
transactions containing that itemset.
• Example:
Question: What is support({Bread, PeanutButter})?
Answer: 3 (or 3/5 = 60%)
Support and Confidence of an Association Rule
Support of an association rule XY is the number (or percentage) of
transactions that contain X Y
• support(XY)=support(X Y)
•
Measures how often the rule occurs in the dataset
• Low support: “uninteresting rule; occurs by chance”

Confidence of an association rule XY is
the number of transactions that contain X Y divided by
the number of transactions that contain X
• confidence(XY)=support(XY)/support(X)
•
• Measures the reliability (strength) of the rule
• confidence can be seen as approximating (estimating) P(Y|X)
• Intuitive: Question: given X, what is the most likely Y?
Answer: The Y so that P(Y|X) is the highest.
Support and Confidence - Example
•
What is the support and confidence of the following rules?
• {Beer}{Bread}
• {Bread, PeanutButter}{Jelly} ?
Support(XY)=support(X Y)
confidence(XY)=support(XY)/support(X)
Association Rule Mining Problem Definition
Given a set of transactions T={t1, t2, …,tn} and 2 thresholds; minsup
and minconf,
• Find all association rules XY with support  minsup and
confidence  minconf
• I.E: we want rules with high confidence and support
• We call these rules interesting
•
• We would like to
• Design an efficient algorithm for mining association rules in
large data sets
• Develop an effective approach for distinguishing interesting rules
from spurious ones
Generating Association Rules – Approach 1
(Naïve)
• Enumerate all possible rules and select those of them that
satisfy the minimum support and confidence thresholds
• Not practical for large databases
• For a given dataset with m items, the total number of possible
rules is 3m-2m+1+1 (Why?*)
• And most of these will be discarded!
• We need a strategy for rule generation -- generate only the
promising rules
• rules that are likely to be interesting, or, more accurately, don’t
generate rules that can’t be interesting.
*hint: use inclusion-exclusion principle
Generating Association Rules – Approach 2
• What do these rules have in common?
A,BC
A,CB
B,CA
• The support of a rule XY depends only on the support of its
itemset X Y
Answer: they have the same support: support({A,B,C})
• Hence, a better approach: find Frequent itemsets first, then
generate the rules
• Frequent itemset is an itemset that occurs more than minsup
times
• If an itemset is infrequent, all the rules that contain it will have
support<minsup and there is no need to generate them
Generating Association Rules – Approach 2
• 2 step-approach:
Step 1: Generate frequent itemsets -- Frequent
Itemset Mining (i.e. support  minsup)
• e.g. {A,B,C} is frequent (so A,BC, A,CB and B,CA
satisfy the minSup threshold).
Step 2: From them, extract rules that satisfy the
confidence threshold (i.e. confidence  minconf)
• e.g. maybe only A,B C and C,BA are confident
• Step 1 is the computationally difficult part
(the next slides explain why, and a way to reduce the
complexity….)
Frequent Itemset Generation (Step 1)
– Brute-Force Approach
•
Enumerate all possible itemsets and scan the dataset to calculate
the support for each of them
• Example: I={a,b,c,d,e}
Search space
showing superset
/ subset
relationships
Given d items, there
are 2d-1 possible (nonempty) candidate
itemsets => not
practical for large d
Frequent Itemset Generation (Step 1)
-- Apriori Principle (1)
A subset of any frequent itemset is also frequent
Example: If
{c,d,e} is
frequent then
{c,d}, {c,e},
{d,e}, {c}, {d}
are also
frequent
Frequent Itemset Generation (Step 1)
-- Apriori Principle (2)
If an itemset is not frequent, a superset of it is also not frequent
Example: If we
know that {a,b} is
infrequent, the
entire sub-graph
can be pruned.
Ie: {a,b,c}, {a,b,d},
{a,b,e}, {a,b,c,d},
{a,b,c,e}, {a,b,d,e}
and {a,b,c,d} are
infrequent
Frequent Itemset Generation (Step 1)
-- Apriori Principle (3)
That is:
If an itemset is frequent then all its subsets are frequent
Equivalently (more useful):
If an itemset is not frequent, a superset of it is also not
frequent
“Support is anti-monotonic” – support monotonically
decreases as we add items to the itemset.
Use this to prune the search space – “support-based
pruning”.
Recall the 2 Step process for
Association Rule Mining
Step 1: Find all frequent Itemsets
So far: main ideas and concepts (Apriori
principle).
Later: algorithms
Step 2: Generate the association rules from
the frequent itemsets.
ARGen Algorithm (Step 2)
• Generates interesting rules from the frequent itemsets
• Already know the rules are frequent (Why?), just need to
check confidence.
ARGen algorithm
for each frequent itemset F
generate all non-empty subsets S.
for each s in S do
if confidence(s  F-s) ≥ minConf
then output rule s  F-s
end
Example:
F={a,b,c}
S={{a,b}, {a,c}, {b,c}, {a}, {b}, {c}}
rules output: {a,b} {c}, etc.
ARGen - Example
• minsup=30%, minconf=50%
•
The set of frequent itemsets
L={{Beer},{Bread}, {Milk},
{PeanutButter},
{Bread, PeanutButter}}
• Only the last itemset from L consists of 2 nonempty subsets of
frequent itemsets – Bread and PeanutButter.
confidence( Bread   PeanutButter ) 
support({Bread , PeanutButter}) 60

 0.75  minconf
support({Bread})
80
confidence( PeanutButter   Bread ) 
support({Bread , PeanutButter}) 60

 1  minconf
support({PeanutButter})
60
• => 2 rules will be generated
Summary so far
• Concepts (item, itemset, transaction,
support, confidence, Association Rules)
• 2 step process for Association rule Mining
 Step 1: Frequent Itemset Mining
 The
most computationally difficult step in
Association Rule Mining.
 Apriori Principle – support is antimonotonic.
 Step
2: Extract rules from frequent
itemsets (ARGen).
What’s next?
• Algorithms for finding frequent itemsets
(ie: Step 1)
Apriori Algorithm
 FP-Growth Algorithm

Apriori Algorithm
Frequent Itemset Generation
– Apriori Algorithm
1.
2.
3.
4.
5.
Generate candidate itemsets with size = 1 (all items)
Scan the database to see which of them are frequent
(database scan step)
Use only the frequent itemsets to generate the set of
candidates with size = size + 1
(candidate generation step -- AprioriGen)
If candidates were generated, goto 2.
Stop. All frequent itemsets found.
Recall that we would then generate the association rules
from these… ARGen
Apriori Algorithm
I={Beer, Bread, Jelly,
PeanutButter, Milk}
T={t1, t2, …,t5}
•
Let minSup=30%
•
Level
1.
2.
itemset
sizes
3.
Candidate itemsets
{Beer} (40%), {Bread} (60%),
{Jelly} (20%), {Milk} (40%),
{PeanutButter} (60%)
Frequent itemsets
{Beer}, {Bread}, {Milk},
{PeanutButter}
*
{Beer, Bread}(20%), {Beer, Milk}(20%),
{Bread, PeanutButter}
{Beer, PeanutButter}(0%), {Bread, Milk} (20%),
*
{Bread, PeanutButter} (40%),
{Milk, PeanutButter} (20%)
Why not

{Jelly,…}?
* AprioriGen – later…
Note the benefits of using Apriori Principle for
Candidate Generation
•
For our simple example (5 items)
• Brute-force approach – generate all candidate itemsets of a given size
 5  5   5
         5  10  10  25
 1  2   3
•
Apriori – generate candidate itemsets from frequent itemsets
 5  43 
     =55+6
 3=11
8
1  2
•
=> Apriori is much more efficient
•
Lets look into this in more detail…
Apriori-Gen (Candidate Generation)
•
Apriori-Gen = The algorithm for generating candidate itemsets with
size k from frequent itemsets with size k-1
• Initially (k = 1), all itemsets of size 1 are considered as candidate
itemsets
• From k = 2 onwards – different strategies
 Brute force (for comparison only – not part of Apriori)
 Fk-1 x F1
 Fk-1 x Fk-1
Brute-force Approach for Candidate Generation
• Not part of Apriori
• Generate all possible combinations of size k (from the original
items) and then prune the infrequent => will generate  dk 
 
candidate itemsets at level k, d is total number of items
•Pruning of so many
items is expensive
•Total cost of
generation and
pruning: O(d*2d-1)
plus we would already know that
most of these are not frequent
Apriori-Gen – Fk-1 x F1
•
Extend frequent (k-1)-itemset with frequent 1-itemset
• Will generate all frequent itemsets of size k as each frequent k-itemset
consists of a frequent (k-1)-itemset and a frequent 1-itemset => the
procedure is complete
• Less computationally expensive
• However, may generate the same candidate itemsets more than once
But we already know some of these
are not frequent (Apriori Principle)
•Solution: lexicographic
order of the frequent
itemsets and extension of
(k-1)-itemset allowed only
with lexicographically
larger items, e.g.
{Bread, Diapers} can be
extended with {Milk} but
{Diapers, Milk} cannot be
extended with {Bread}
Apriori-Gen – Fk-1 x F1 (cont.)
•
Although an improvement, it may still produce unnecessary candidate
itemsets
• E.g. merging {Beer, Diapers} with {Milk} is not necessary as one of the
subsets ( {Beer, Milk}) is infrequent
• can be checked at the candidate generation time and the itemset discarded or
• the itemset can be generated and then pruned
Apriori-Gen – Fk-1 x Fk-1
•
Assumes lexicographic ordering
• Merges a pair of frequent (k-1)-itemsets only if their first k-2 items are
identical
• E.g. k=3, merging itemsets of size 2
{Bread, Diapers} and {Bread, Milk} will be merged: {Bread, Diapers, Milk}
{Beer, Diapers} and {Diapers, Milk} will not be merged
• Complete procedure
• Will not generate
duplicates
• Does not guarantee
that all generated
candidate itemsets as
frequent => pruning is
needed
Frequent Itemset Generation in Apriori –
Clothing Example
•
Given: 20 clothing transactions; minSup=20%, minConf=50%
• Generate frequent itemset using the Apriori algorithm and the Fk-1 x
Fk-1 strategy for candidate itemset generation
1. Level 1 – generate all 1-itemsets and find the frequent ones
Level
Candidate
Frequent
1
{Blouse}(3), {Jeans}(14), {Shoes}(10)
{Jeans}(14), {Shoes}(10)
{Shorts}(5), {Skirt}(6), {TShirt}(13)
{Shorts}(5), {Skirt}(6), {TShirt}(13)
Frequent Itemset Generation in Apriori –
Clothing Example (cont.)
Level
Candidate
Frequent
1
{Blouse}(3), {Jeans}(14), {Shoes}(10)
{Jeans}(14), {Shoes}(10)
{Shorts}(5), {Skirt}(6), {TShirt}(13)
{Shorts}(5), {Skirt}(6), {TShirt}(13)
2. Use AR-Gen to generate candidate 2-itemsets from frequent 1-itemsets and F1xF1
2
{Jeans, Shoes} (7), {Jeans, Shorts} (5),
{Jeans, Shoes} (7), {Jeans, Shorts} (5)
(Jeans, Skirt} (2), {Jeans, TShirt} (8),
{Jeans, TShirt} (8),
{Shoes, Shorts} (4), {Shoes, Skirt} (3),
{Shoes, Shorts} (4),
{Shoes, TShirt} (9), {Shorts, Skirt} (0),
{Shoes, TShirt} (9),
{Shorts, TShirt} (4), {Skirt, TShirt} (3)
{Shorts, TShirt} (4)
3. Use AR-Gen to generate candidate 3-itemsets from frequent 2-itemsets and F2xF2 (1st item
should be identical)
3
{Jeans, Shoes, Shorts} (4),
{Jeans, Shoes, Shorts} (4),
{Jeans, Shoes, TShirt} (7),
{Jeans, Shoes, TShirt} (7),
{Jeans, Shorts, TShirt} (4)
{Jeans, Shorts, TShirt} (4)
{Shoes, Shorts, TShirt} (4)
{Shoes, Shorts, TShirt} (4)
Frequent Itemset Generation in Apriori –
Clothing Example (cont.)
...
3
{Jeans, Shoes, Shorts} (4),
{Jeans, Shoes, Shorts} (4),
{Jeans, Shoes, TShirt} (7),
{Jeans, Shoes, TShirt} (7),
{Jeans, Shorts, TShirt} (4)
{Jeans, Shorts, TShirt} (4)
{Shoes, Shorts, TShirt} (4)
{Shoes, Shorts, TShirt} (4)
3. Use AR-Gen to generate candidate 4-itemsets from frequent 3-itemsets and F2xF2 (1st and
2d items should be identical)
{Jeans, Shoes, Shorts, TShirt}(4)
{Jeans, Shoes, Shorts, TShirt}(4)
4. Use AR-Gen to generate candidate 5-itemsets from frequent 4-itemsets and F2xF2 (1st ,2d
and 3d items should be identical)
 stop (there are no 4-itemset candidates that can be generated)
Clothing Example – Generation of AR Rules
•
The next step is to use the frequent itemsets and generate association
rules using the ARGen algorithm (slide 14)
• =50%
•
The set of frequent itemsets is
L={{Jeans},{Shoes}, {Shorts}, {Skirt}, {TShirt}, {Jeans, Shoes}, {Jeans,
Shorts}, {Jeans, TShirt}, {Shoes, Shorts}, {Shoes, TShirt}, {Shorts, TShirt},
{Skirt, TShirt}, {Jeans, Shoes, Shorts}, {Jeans, Shoes, TShirt}, {Jeans,
Shorts, TShirt},{Shoes, Shorts, TShirt}, {Jeans, Shoes, Shorts,TShirt} }
• We ignore the first 5 as they do not consists of 2 nonempty
subsets of frequent itemsets. We test all the others, e.g.:
confidence( Jeans   Shoes) 
etc.
support ({Jeans, Shoes}) 7 / 20

 50%  
support ({Jeans})
14 / 20
Frequent Itemset Generation in Apriori
Pseudo Code
e.g. Fk-1 x F1 or Fk-1 x Fk-1, etc.
FP-Growth Algorithm
Frequent Pattern Growth (FP-Growth)
Algorithm
• Apriori: generate-and-test approach – generates
candidate itemsets and tests if they are frequent
• Problem: the generation of candidate itemsets is
expensive
• FP-growth – the first algorithm that allows frequent
itemset discovery without candidate itemsets
generation
• Uses a compact data structure called FP-tree and
extracts frequent itemsets directly from the FP-tree
FP-Tree
•
•
•
•
•
•
•
Nodes correspond to items + have a
counter
1 path = 1 transaction
Reads 1 transaction at a time and
maps it to a path
Pointers between nodes containing
same item
Paths may overlap as transactions
share items => increment the
counter and add pointers
More paths overlap -> higher
compression
=> FP-tree may fit in the memory
=> direct extraction of frequent
itemsets from FP-tree instead of
many passes over data stored on
disk
FP-Tree Construction
•
Pass 1: Scan data and find support for each item. Discard infrequent
items. Sort frequent items in decreasing order based on their support.
• For our example: a, b, c, d, e
•
Pass 2: construct the FP-Tree
• Read trans. 1 {a, b}. Create 2 nodes a and b and the path null->a->b. Set
counts of a and b to 1.
• Read trans. 2 {b, c, d}. Create 3 nodes for b, c and d and the path null->b>c->d. Set counts to 1. Note that although trans. 1 and 2 share b, the
paths are disjoint as they don’t share a common prefix.
• Read trans. 3 {a, c, d, e}. It shares common prefix item {a} with trans. 1
=> the path for trans. 1 and 3 will overlap and the frequency count for
node a will be incremented with 1.
• Continue until all transactions are mapped to a path in the FP-tree.
FP-Tree Size
•
FP-tree has a smaller size than the uncompressed data as many
transactions share items
• Best case scenario – all transactions contain the same set of items.
• 1 path in the FP-Tree
• Worst case scenario – every transaction has a unique set of items (no
items in common)
• the size of the FP-tree = size of the original data
• However, the storage requirements for the FP-tree are higher – need to store
the pointers between the nodes and the counters
•
The size of the FP-tree depends on how the items are ordered
• Ordering by decreasing support is typically used but it does not always
lead to the smallest tree
FP-Growth Algorithm
•
Extracts frequent itemsets from FP-tree
• Bottom-up algorithm – from the leaves to the root
• For our example – first look for frequent itemsets ending in e, than in d, c,
b and a (Note: reverse lexiographic order!)
•
Extract the paths ending in e, d, c, b and a (called also prefix paths)
Complete FP-tree
Prefix paths ending in e, d, c, b and a
FP-Growth Algorithm (cont. 1)
•
Each prefix path sub-tree is processed recursively to extract the
frequent itemsets and the solutions are then merged
e.g. the prefix path sub-tree for e will be used to extract frequent itemsets
ending in e, than in de, ce, be and ae. Each of them can be decomposed
into problems, e.g. de into cde, bde, cde, etc.
e -> de -> cde
bde
ade
ce -> bce
ace
be-> abc
ae
• End of recursion: no more frequent itemsets can be
d...
extracted, i.e. empty tree or tree with 1 item where
c...
• tree=prefix path sub-tree or conditional FP-tree
b...
a...

FP-Growth Algorithm - Example
• Extract frequent itemsets for e. Let minsup = 2.
•
1) Obtain the prefix path sub-tree for e
•
2) Check if {e} is a frequent item by adding the counts. If so, extract it.

•
Yes, count =3 => {e} is extracted as a frequent itemset.
3) As {e} is frequent, find frequent itemsets ending in de, ce, be and ae.
To do this, we need first to obtain the conditional FP-tree for e.
Prefix paths ending in e, d, c, b and a
FP-Growth Example (cont. 2)
•
To obtain the conditional FP-tree for e:

Update the support counts along the prefix paths to reflect the number of
transactions containing e => b and c should be set to 1, a to 2
2
1
1
2
1
1
cut


2
1
1
cut
Remove the nodes containing e – information about node e is no longer
needed because of the previous step
Remove infrequent items (nodes) from the prefix paths, e.g. b has a
support of 1 and appears once => there is only 1 trans. containing b and e
=> be is infrequent => remove b.
Final Conditional FP-tree for e
FP-Growth Example (cont. 3)
•
4) Use the the conditional FP-tree for e to find frequent itemsets
ending in de, ce and ae


Note that be is not considered as b is not in the conditional FP-tree
For each of them (e.g. de) find the prefix paths from the conditional tree
for e and extract frequent itemsets; generate conditional FP-tree etc.
Extract {e}
Frequent itemsets ending with de?
Extract {d,e}
Contains 1 item; no need to
generate prefix paths ending in
ade (will be the same as the
Frequent itemsets ending with ade?cond. FP tree for de); extract
frequent itemsets (if any) and
Extract {a,d,e}
stop this branch of the
recursion. Continue with
itemsets ending with ce.
FP-Growth Example - Solution
•
FP-Growth algorithm will find the following frequent itemsets:
Discussion
•
Association rules are typically sought for very large databases =>
efficient algorithms are needed
• The Apriori algorithm makes 1 pass through the dataset for each
different itemset size


The maximum number of database scans is k+1, where k is the
cardinality of the largest frequent itemset (4 in the clothing ex.)
potentially large number of scans – weakness of Apriori
•
Sometimes the database is too big to be kept in memory and must be
kept on disk
• The amount of computation also depends on the minimum support;
the confidence has less impact as it does not affect the number of
passes
• Variations



Using sampling of the database
Using partitioning of the database
Generation of incremental rules
Discussion (2)
•
FP-growth is typically an order of magnitude faster than Apriori




No candidate generation
Uses compact data structure
Only 2 scans of the database: 1 to count the support of each item and 2 to
build the FP-tree
Basic operation is FP-tree building and counting
Download