Chapter 5. Association Rule Introduction In data mining, association rule learners are used to discover elements that co-occur frequently within a data set[1] consisting of multiple independent selections of elements (such as purchasing transactions), and to discover rules, such as implication or correlation, which relate cooccurring elements. Questions such as "if a customer purchases product A, how likely is he to purchase product B?" and "What products will a customer buy if he buys products C and D?" are answered by association-finding algorithms. This application of association rule learners is also known as market basket analysis. As with most data mining techniques, the task is to reduce a potentially huge amount of information to a small, understandable set of statistically supported statements. A famous story about association rule mining is the "beer and diaper" story. A purported survey of behavior of supermarket shoppers discovered that customers (presumably young men) who buy diapers tend also to buy beer. This anecdote became popular as an example of how unexpected association rules might be found from everyday data. Definition Association Rule Mining: Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Association Rule An implication expression of the form X Y, where X and Y are itemsets Example: Market-Basket Transactions TID Items 1 2 3 4 5 Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke {Diaper} {Beer}, {Milk, Bread} {Eggs,Coke}, {Beer, Bread} {Milk}, Itemset A collection of one or more items – Example: {Milk, Bread, Diaper} k-itemset An itemset that contains k items Rule Evaluation Metrics Support (s) Fraction of transactions that contain both X and Y Confidence (c) Measures how often items in Y appear in transactions that contain X Example: {Milk, Diaper} Beer s = δ(Milk, Diaper, Beer)/T = 2/5 = 0.4 = 40% c = δ(Milk, Diaper, Beer) / δ(Milk, Diaper) = 2/3 = 0.67 = 67% Frequent Itemset An itemset whose support is greater than or equal to a minsup threshold Frequent pattern A pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set Motivation Finding inherent regularities in data What products were often purchased together?— Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents? Applications Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis. Association Rule Mining Task Given a set of transactions T, the goal of association rule mining is to find all rules having – support ≥ minsup threshold – confidence ≥ minconf threshold Example of Rules: {Milk,Diaper} {Beer} (s=0.4, c=0.67) {Milk,Beer} {Diaper} (s=0.4, c=1.0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5) Methods for Mining Frequent Patterns The downward closure property of frequent patterns Any subset of a frequent itemset must be frequent If {beer, diaper, nuts} is frequent, so is {beer, diaper} i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} Scalable mining methods: Three major approaches Apriori (Agrawal & Srikant@VLDB’94) Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00) Vertical data format approach (Charm—Zaki & Hsiao @SDM’02) Apriori pruning principle If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94) Method: Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from length k frequent itemsets Test the candidates against DB Terminate when no frequent or candidate set can be generated Frequent Itemset Generation Brute-force approach: – Each itemset in the lattice is a candidate frequent itemset – Count the support of each candidate by scanning the database Given d unique items: – Total number of itemsets = 2d – Total number of possible association rules: d 1 d d k d k R j k 1 k j 1 3d 2 d 1 1 If d=6, R = 602 rules Expensive !!! Frequent Itemset Generation null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE © Tan,Steinbach, Kumar Introduction to Data Mining Given d items, there are 2d possible candidate itemsets 4/18/2004 8 Frequent Itemset Generation Strategies Reduce the number of candidates (M) – Complete search: M=2d – Use pruning techniques to reduce M Reduce the number of transactions (N) – Reduce size of N as the size of itemset increases – Used by DHP and vertical-based mining algorithms Reduce the number of comparisons (NM) – Use efficient data structures to store the candidates or transactions – No need to match every candidate against every transaction Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due to the following property of the support measure: – X , Y : ( X Y ) s( X ) s(Y ) Support of an itemset never exceeds the support of its subsets – This is known as the anti-monotone property of support Illustrating Apriori Principle null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE Found to be Infrequent ABCD ABCE Pruned supersets © Tan,Steinbach, Kumar ABDE ACDE BCDE ABCDE Introduction to Data Mining 4/18/2004 13 The Apriori Algorithm—An Example Database TDB Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E Supmin = 2 sup {A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2 Itemset {B, C, E} March 5, 2008 {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 1st scan Itemset C3 sup C1 C2 L2 Itemset Itemset sup {A} 2 L1 {B} 3 {C} 3 {E} 3 Itemset sup {A, B} 1 {A, C} 2 {A, E} 1 {A, C} {B, C} 2 {A, E} {B, E} 3 {B, C} {C, E} 2 {B, E} 3rd scan L3 C2 2nd scan Itemset sup {B, C, E} 2 Data Mining: Concepts and Techniques Important Details of Apriori How to generate candidates? Itemset {A, B} {C, E} 12 Step 1: self-joining Lk Step 2: pruning Example of Candidate-generation L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L3 C4={abcd} Maximal Frequent Itemset An itemset is maximal frequent if none of its immediate supersets is frequent null Maximal Itemsets A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE Infrequent Itemsets ABCD E © Tan,Steinbach, Kumar Introduction to Data Mining ACDE BCDE Border 4/18/2004 27 Closed Itemset An itemset is closed if none of its immediate supersets has the same support as the itemset TID 1 2 3 4 5 Itemset {A} {B} {C} {D} {A,B} {A,C} {A,D} {B,C} {B,D} {C,D} Items {A,B} {B,C,D} {A,B,C,D} {A,B,D} {A,B,C,D} Support 4 5 3 4 4 2 3 3 4 3 Itemset Support {A,B,C} 2 {A,B,D} 3 {A,C,D} 2 {B,C,D} 3 {A,B,C,D} 2 Maximal vs Closed Itemsets TID Items 1 ABC 2 ABCD 3 BCE 4 ACDE 5 Transaction Ids null 124 123 A 12 124 AB DE 12 24 AC AD ABD ABE 2 AE 2 3 BD 4 ACD 345 D BC BE 2 4 ACE ADE E 24 CD 34 CE 45 3 BCD ABCE ABDE ACDE BDE CDE BCDE ABCDE Introduction to Data Mining DE 4 BCE 4 ABCD Not supported by any transactions © Tan,Steinbach, Kumar 245 C 123 4 24 2 ABC 1234 B 4/18/2004 29 Maximal vs Closed Frequent Itemsets Minimum support = 2 124 123 A 12 124 AB 12 ABC 24 AC AD ABD ABE 1234 B 245 C 123 4 AE 2 3 BD 4 ACD 345 D BC 24 2 Closed but not maximal null 24 BE 2 4 ACE E ADE CD Closed and maximal 34 CE 3 BCD 45 DE 4 BCE BDE CDE 4 2 ABCD ABCE ABDE ACDE BCDE # Closed = 9 # Maximal = 4 ABCDE © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30 4/18/2004 31 Maximal vs Closed Itemsets Frequent Itemsets Closed Frequent Itemsets Maximal Frequent Itemsets © Tan,Steinbach, Kumar Introduction to Data Mining Alternative Methods for Frequent Itemset Generation Representation of Database – horizontal vs vertical data layout Horizontal Data Layout TID 1 2 3 4 5 6 7 8 9 10 Items A,B,E B,C,D C,E A,C,D A,B,C,D A,E A,B A,B,C A,C,D B © Tan,Steinbach, Kumar Vertical Data Layout A 1 4 5 6 7 8 9 B 1 2 5 7 8 10 C 2 3 4 8 9 D 2 4 5 9 Introduction to Data Mining E 1 3 6 4/18/2004 35 Frequent Pattern-growth (FP-growth) Use a compressed representation of the database using an FP-tree Once an FP-tree has been constructed, it uses a recursive divide-andconquer approach to mine the frequent itemsets FP-tree construction null After reading TID=1: TID 1 2 3 4 5 6 7 8 9 10 A:1 Items {A,B} {B,C,D} {A,C,D,E} {A,D,E} {A,B,C} {A,B,C,D} {B,C} {A,B,C} {A,B,D} {B,C,E} B:1 After reading TID=2: null B:1 A:1 B:1 C:1 D:1 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 37 FP-Tree Construction TID 1 2 3 4 5 6 7 8 9 10 Items {A,B} {B,C,D} {A,C,D,E} {A,D,E} {A,B,C} {A,B,C,D} {B,C} {A,B,C} {A,B,D} {B,C,E} Header table Item Pointer A B C D E © Tan,Steinbach, Kumar Transaction Database null B:3 A:7 B:5 C:1 C:3 D:1 D:1 C:3 D:1 D:1 D:1 E:1 E:1 E:1 Pointers are used to assist frequent itemset generation Introduction to Data Mining 4/18/2004 38 FP-growth C:1 Conditional Pattern base for D: P = {(A:1,B:1,C:1), (A:1,B:1), (A:1,C:1), (A:1), (B:1,C:1)} D:1 Recursively apply FPgrowth on P null A:7 B:5 B:1 C:1 C:3 D:1 D:1 Frequent Itemsets found (with sup > 1): AD, BD, CD, ACD, BCD D:1 D:1 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 39 Latihan: 1. Tentukan frequent itemset data berikut ini dengan menggunakan metoda Apriori dan FP-growth. Tentukan pula Closed Itemset dan Max Itemset. a. TID Items 1 2 3 4 5 Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke b. Itemset Temperature Wind Humidity Play 1 Warm Calm Dry Yes 2 Cold Calm Dry Yes 3 Cold Windy Raining No 4 Cold Gale Dry No 5 Cold Windy Raining Yes