Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Slides from Ofer Pasternak 1 Data Mining Seminar 2003 Introduction Bar-Code technology Mining Association Rules over basket data (93) Tires ^ accessories automotive service Cross market, Attached mail. Very large databases. ©Ofer Pasternak 2 Data Mining Seminar 2003 Notation Items – I = {i1,i2,…,im} Transaction – set of items TI – Items are sorted lexicographically ©Ofer Pasternak TID – unique identifier for each transaction 3 Data Mining Seminar 2003 Notation Association Rule – X Y X I , Y I and X Y ©Ofer Pasternak 4 Data Mining Seminar 2003 Confidence and Support ©Ofer Pasternak Association rule XY has confidence c, c% of transactions in D that contain X also contain Y. Association rule XY has support s, s% of transactions in D contain X and Y. 5 Data Mining Seminar 2003 Define the Problem Given a set of transactions D, generate all association rules that have support and confidence greater than the user-specified minimum support and minimum confidence. ©Ofer Pasternak 6 Data Mining Seminar 2003 Discovering all Association Rules Find all Large itemsets – itemsets with support above minimum support. ©Ofer Pasternak Use Large itemsets to generate the rules. 7 Data Mining Seminar 2003 General idea Say ABCD and AB are large itemsets Compute conf = support(ABCD) / support(AB) If conf >= minconf AB CD holds. ©Ofer Pasternak 8 Data Mining Seminar 2003 Discovering Large Itemsets Multiple passes over the data First pass – count the support of individual items. Subsequent pass – Generate Candidates using previous pass’s large itemset. – Go over the data and check the actual support of the candidates. ©Ofer Pasternak Stop when no new large itemsets are found. 9 Data Mining Seminar 2003 The Trick Any subset of large itemset is large. Therefore To find large k-itemset – Create candidates by combining large k-1 itemsets. – Delete those that contain any subset that is not large. ©Ofer Pasternak 10 Data Mining Seminar 2003 Algorithm Apriori L1 {large 1-item sets} For ( k 2; Lk-1 ; k ) do begin Ck apriori-gen (Lk-1 ); foralltransactions t D do begin Count item occurrences Generate new k-itemsets candidates Ct subset (Ck ,t) forallcandidatesc Ct do c.count ; Find the support of all the candidates end end Lk { c Ck|c.count m insup} end Answer Take only those with support over minsup L ; k k ©Ofer Pasternak 11 Data Mining Seminar 2003 Candidate generation Join step insert intoCk P and q are 2 k-1 large itemsets identical in all k-2 first items. select p.item1 , p.item2 , p.itemk 1 , q.itemk 1 from Lk 1 p,Lk 1q where p.item1 q.item1 ,..., p.itemk 2 q.itemk 2 , p.itemk 1 q.itemk 1 Prune step forallitem sets c Ck do forall(k-1)-subsets s of cdo if (s Lk-1 ) then deletec from Ck ©Ofer Pasternak Join by adding the last item of q to p Check all the subsets, remove a candidate with “small” subset 12 Data Mining Seminar 2003 Example L3 = { {1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4} } After joining { {1 2 3 4}, {1 3 4 5} } {1 4 5} and {3 4 5} After pruning Are not in L3 {1 2 3 4} ©Ofer Pasternak 13 Data Mining Seminar 2003 Correctness Show that Ck Lk Any subset of large itemset must also be large insert int oCk Join is equivalent to extending Lk-1 with all items and removing those whose (k-1) subsets are not in Lk-1 ©Ofer Pasternak select p.item1 , p.item2 , p.itemk 1 , q.itemk 1 from Lk 1 p,Lk 1q where p.item1 q.item1 ,..., p.itemk 2 q.itemk 2 , p.itemk 1 q.itemk 1 forallitem sets c Ck do forall(k-1)-subsets s of cdo if (s Lk-1 ) then deletec from Ck Prevents duplications 14 Data Mining Seminar 2003 Subset Function L1 {large 1-item sets} Candidate itemsets - Ck are stored in a hash-tree Finds in O(k) time whether a candidate itemset of size k is contained in transaction t. Total time O(max(k,size(t)) For ( k 2; Lk-1 ; k ) do begin Ck apriori-gen (Lk-1 ); ©Ofer Pasternak foralltransactions t D do begin Ct subset (Ck ,t) forallcandidatesc Ct do c.count ; end end Lk { c Ck|c.count m insup} end Answer L ; k k 15