Apriori for Mining Association Rules

advertisement
Fast Algorithms for Mining
Association Rules
Rakesh Agrawal
Ramakrishnan Srikant
Slides from Ofer Pasternak
1
Data Mining Seminar 2003
Introduction
Bar-Code technology
 Mining Association Rules over basket
data (93)
 Tires ^ accessories  automotive
service
 Cross market, Attached mail.
 Very large databases.

©Ofer Pasternak
2
Data Mining Seminar 2003
Notation
Items – I = {i1,i2,…,im}
 Transaction – set of items

TI
– Items are sorted lexicographically

©Ofer Pasternak
TID – unique identifier for each
transaction
3
Data Mining Seminar 2003
Notation

Association Rule – X  Y
X  I , Y  I and X  Y  
©Ofer Pasternak
4
Data Mining Seminar 2003
Confidence and Support


©Ofer Pasternak
Association rule XY has
confidence c,
c% of transactions in D that contain
X also contain Y.
Association rule XY has support s,
s% of transactions in D contain X
and Y.
5
Data Mining Seminar 2003
Define the Problem
Given a set of transactions D, generate
all association rules that have support
and confidence greater than the
user-specified minimum support and
minimum confidence.
©Ofer Pasternak
6
Data Mining Seminar 2003
Discovering all Association
Rules

Find all Large itemsets
– itemsets with support above minimum
support.

©Ofer Pasternak
Use Large itemsets to generate the
rules.
7
Data Mining Seminar 2003
General idea
Say ABCD and AB are large itemsets
 Compute
conf = support(ABCD) / support(AB)
 If conf >= minconf
AB  CD holds.

©Ofer Pasternak
8
Data Mining Seminar 2003
Discovering Large Itemsets
Multiple passes over the data
 First pass – count the support of individual
items.
 Subsequent pass

– Generate Candidates using previous pass’s large
itemset.
– Go over the data and check the actual support
of the candidates.

©Ofer Pasternak
Stop when no new large itemsets are found.
9
Data Mining Seminar 2003
The Trick
Any subset of large itemset is large.
Therefore
To find large k-itemset
– Create candidates by combining large k-1
itemsets.
– Delete those that contain any subset
that is not large.
©Ofer Pasternak
10
Data Mining Seminar 2003
Algorithm Apriori
L1  {large 1-item sets}
For ( k  2; Lk-1   ; k   ) do begin
Ck  apriori-gen (Lk-1 );
foralltransactions t  D do begin
Count item occurrences
Generate new k-itemsets
candidates
Ct  subset (Ck ,t)
forallcandidatesc  Ct do
c.count ;
Find the support of all the
candidates
end
end
Lk  { c  Ck|c.count m insup}
end
Answer 
Take only those with
support over minsup
L ;
k
k
©Ofer Pasternak
11
Data Mining Seminar 2003
Candidate generation

Join step
insert intoCk
P and q are 2 k-1 large
itemsets identical in all
k-2 first items.
select p.item1 , p.item2 , p.itemk 1 , q.itemk 1
from Lk 1 p,Lk 1q
where p.item1  q.item1 ,..., p.itemk 2  q.itemk 2 , p.itemk 1  q.itemk 1

Prune step
forallitem sets c  Ck do
forall(k-1)-subsets s of cdo
if (s  Lk-1 ) then
deletec from Ck
©Ofer Pasternak
Join by adding the last item of
q to p
Check all the subsets, remove a
candidate with “small” subset
12
Data Mining Seminar 2003
Example
L3 = { {1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4} }
After joining
{ {1 2 3 4}, {1 3 4 5} }
{1 4 5} and {3 4 5}
After pruning
Are not in L3
{1 2 3 4}
©Ofer Pasternak
13
Data Mining Seminar 2003
Correctness
Show that Ck  Lk
Any subset of large itemset
must also be large
insert int oCk
Join is equivalent to
extending Lk-1 with all
items and removing
those whose (k-1)
subsets are not in Lk-1
©Ofer Pasternak
select p.item1 , p.item2 , p.itemk 1 , q.itemk 1
from Lk 1 p,Lk 1q
where p.item1  q.item1 ,..., p.itemk  2  q.itemk  2 , p.itemk 1  q.itemk 1
forallitem sets c  Ck do
forall(k-1)-subsets s of cdo
if (s  Lk-1 ) then
deletec from Ck
Prevents duplications
14
Data Mining Seminar 2003
Subset Function
L1  {large 1-item sets}
Candidate itemsets - Ck are
stored in a hash-tree
 Finds in O(k) time whether a
candidate itemset of size k
is contained in transaction t.
 Total time O(max(k,size(t))
For ( k  2; Lk-1   ; k   ) do begin
Ck  apriori-gen (Lk-1 );

©Ofer Pasternak
foralltransactions t  D do begin
Ct  subset (Ck ,t)
forallcandidatesc  Ct do
c.count ;
end
end
Lk  { c  Ck|c.count m insup}
end
Answer 
L ;
k
k
15
Download