Apriori for Mining Association Rules

Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Slides from Ofer Pasternak 1 Data Mining Seminar 2003 Introduction Bar-Code technology  Mining Association Rules over basket data (93)  Tires ^ accessories  automotive service  Cross market, Attached mail.  Very large databases.  ©Ofer Pasternak 2 Data Mining Seminar 2003 Notation Items – I = {i1,i2,…,im}  Transaction – set of items  TI – Items are sorted lexicographically  ©Ofer Pasternak TID – unique identifier for each transaction 3 Data Mining Seminar 2003 Notation  Association Rule – X  Y X  I , Y  I and X  Y   ©Ofer Pasternak 4 Data Mining Seminar 2003 Confidence and Support   ©Ofer Pasternak Association rule XY has confidence c, c% of transactions in D that contain X also contain Y. Association rule XY has support s, s% of transactions in D contain X and Y. 5 Data Mining Seminar 2003 Define the Problem Given a set of transactions D, generate all association rules that have support and confidence greater than the user-specified minimum support and minimum confidence. ©Ofer Pasternak 6 Data Mining Seminar 2003 Discovering all Association Rules  Find all Large itemsets – itemsets with support above minimum support.  ©Ofer Pasternak Use Large itemsets to generate the rules. 7 Data Mining Seminar 2003 General idea Say ABCD and AB are large itemsets  Compute conf = support(ABCD) / support(AB)  If conf >= minconf AB  CD holds.  ©Ofer Pasternak 8 Data Mining Seminar 2003 Discovering Large Itemsets Multiple passes over the data  First pass – count the support of individual items.  Subsequent pass  – Generate Candidates using previous pass’s large itemset. – Go over the data and check the actual support of the candidates.  ©Ofer Pasternak Stop when no new large itemsets are found. 9 Data Mining Seminar 2003 The Trick Any subset of large itemset is large. Therefore To find large k-itemset – Create candidates by combining large k-1 itemsets. – Delete those that contain any subset that is not large. ©Ofer Pasternak 10 Data Mining Seminar 2003 Algorithm Apriori L1  {large 1-item sets} For ( k  2; Lk-1   ; k   ) do begin Ck  apriori-gen (Lk-1 ); foralltransactions t  D do begin Count item occurrences Generate new k-itemsets candidates Ct  subset (Ck ,t) forallcandidatesc  Ct do c.count ; Find the support of all the candidates end end Lk  { c  Ck|c.count m insup} end Answer  Take only those with support over minsup L ; k k ©Ofer Pasternak 11 Data Mining Seminar 2003 Candidate generation  Join step insert intoCk P and q are 2 k-1 large itemsets identical in all k-2 first items. select p.item1 , p.item2 , p.itemk 1 , q.itemk 1 from Lk 1 p,Lk 1q where p.item1  q.item1 ,..., p.itemk 2  q.itemk 2 , p.itemk 1  q.itemk 1  Prune step forallitem sets c  Ck do forall(k-1)-subsets s of cdo if (s  Lk-1 ) then deletec from Ck ©Ofer Pasternak Join by adding the last item of q to p Check all the subsets, remove a candidate with “small” subset 12 Data Mining Seminar 2003 Example L3 = { {1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4} } After joining { {1 2 3 4}, {1 3 4 5} } {1 4 5} and {3 4 5} After pruning Are not in L3 {1 2 3 4} ©Ofer Pasternak 13 Data Mining Seminar 2003 Correctness Show that Ck  Lk Any subset of large itemset must also be large insert int oCk Join is equivalent to extending Lk-1 with all items and removing those whose (k-1) subsets are not in Lk-1 ©Ofer Pasternak select p.item1 , p.item2 , p.itemk 1 , q.itemk 1 from Lk 1 p,Lk 1q where p.item1  q.item1 ,..., p.itemk  2  q.itemk  2 , p.itemk 1  q.itemk 1 forallitem sets c  Ck do forall(k-1)-subsets s of cdo if (s  Lk-1 ) then deletec from Ck Prevents duplications 14 Data Mining Seminar 2003 Subset Function L1  {large 1-item sets} Candidate itemsets - Ck are stored in a hash-tree  Finds in O(k) time whether a candidate itemset of size k is contained in transaction t.  Total time O(max(k,size(t)) For ( k  2; Lk-1   ; k   ) do begin Ck  apriori-gen (Lk-1 );  ©Ofer Pasternak foralltransactions t  D do begin Ct  subset (Ck ,t) forallcandidatesc  Ct do c.count ; end end Lk  { c  Ck|c.count m insup} end Answer  L ; k k 15

Apriori for Mining Association Rules

Related documents

Products

Support

Apriori for Mining Association Rules

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib