Association Mining Data Mining Spring 2012 Transactional Database • Transactional Database • Transaction – A row in the database • i.e.: {Eggs, Cheese, Milk} Transactional dataset Eggs Cheese Milk Jam Cheese Bacon Butter Bread Bread Butter Milk Eggs Cat food Eggs Milk Cheese Items and Itemsets • Item = {Milk}, {Cheese}, {Bread}, etc. • Itemset = {Milk}, {Milk, Cheese}, {Bacon, Bread, Milk} • Doesn’t have to be in the dataset • Can be of size 1 – n Transactional dataset Eggs Cheese Milk Jam Cheese Bacon Butter Bread Bread Butter Milk Eggs Cat food Eggs Milk Cheese The Support Measure Support Examples Support({Eggs}) = 3/5 = 60% Support({Eggs, Milk}) = 2/5 = 40% Transactional dataset Eggs Cheese Milk Jam Cheese Bacon Butter Bread Bread Butter Milk Eggs Cat food Eggs Milk Cheese Minimum Support Minsup – The minimum support threshold for an itemset to be considered frequent (User defined) Frequent itemset – an itemset in a database whose support is greater than or equal to minsup. Support(X) > minsup = frequent Support(X) < minsup = infrequent Minimum Support Examples Minimum support = 50% Support({Eggs}) = 3/5 = 60% Pass Support({Eggs, Milk}) = 2/5 = 40% Fail Transactional dataset Eggs Cheese Milk Jam Cheese Bacon Butter Bread Bread Butter Milk Eggs Cat food Eggs Milk Cheese Association Rules Confidence Example 1 {Eggs} => {Bread} Confidence = sup({Eggs, Bread})/Sup({Eggs}) Confidence = (1/5)/(3/5) = 33% Transactional dataset Eggs Cheese Milk Jam Cheese Bacon Butter Bread Bread Butter Milk Eggs Cat food Eggs Milk Cheese Confidence Example 2 {Milk} => {Eggs, Cheese} Confidence = sup({Milk, Eggs, Cheese})/sup({Milk}) Confidence = (2/5)/(3/5) = 66% Transactional dataset Eggs Cheese Milk Jam Cheese Bacon Butter Bread Bread Butter Milk Eggs Cat food Eggs Milk Cheese Strong Association Rules Minimum Confidence – A user defined minimum bound on confidence. (Minconf) Strong association rule – a rule X=>Y whose conf > minconf. - this is a potentially interesting rule for the user. Conf(X=>Y) > minconf = strong Conf(X=>Y) < minconf = uninteresting Minimum Confidence Example Minconf = 50% {Eggs} => {Bread} Confidence = (1/5)/(3/5) = 33% Fail {Milk} => {Eggs, Cheese} Confidence = (2/5)/(3/5) = 66% Pass Association Mining Association Mining: - Finds strong rules contained in a dataset from frequent itemsets. Can be divided into two major subtasks: 1. Finding frequent itemsets 2. Rule generation Transactional Database Revisited • Some algorithms change items into letters or numbers • Numbers are more compact • Easier to make comparisons Transactional dataset 1 2 3 5 2 7 6 8 8 6 3 1 4 1 3 2 Basic Set Logic Subset – a subset itemset X is contained in an itemset Y. Superset – a superset itemset Y contains an itemset X. example: X = {1,2} Y = {1,2,3,5} Y X Apriori Arranges database into a temporary lattice structure to find associations Apriori principle – 1. itemsets in the lattice with support < minsup will only produce supersets with support < minsup. 2. the subsets of frequent itemsets are always frequent. Prunes lattice structure of non-frequent itemsets using minsup. Reduces the number of comparisons Reduces the number of candidate itemsets Monotonicity Monotone (upward closed) - if X is a subset of Y, then support(X) cannot exceed support(Y). Anti-Monotone (downward closed) - if X is a subset of Y, then support(Y) cannot exceed support(X). Apriori is anti-monotone. - uses this property to prune the lattice structure. Itemset Lattice Lattice Pruning Lattice Example 1 2 2 4 1 2 1 4 3 4 5 4 Count occurrences of each 1-itemset in the database and compute their support: Support = #occurrences/#rows in db Prune anything less than minsup = 30% Lattice Example 1 2 2 4 1 2 1 4 1 2 2 4 1 2 1 4 1 2 2 4 1 2 1 4 3 4 5 4 5 4 5 4 3 4 3 4 Count occurrences of each 2-itemset in the database and compute their support Prune anything less than minsup = 30% Lattice Example A B B D A B A D C D E D Count occurrences of the last 3-itemset in the database and compute its support. Prune anything less than minsup = 30% Example - Results 1 2 2 4 1 2 1 4 3 4 5 4 Frequent itemsets: {1}, {2}, {3}, {1,2}, {1,3}, {2,3}, {1,2,3} Apriori Algorithm Frequent Itemset Generation Transactional Database 1 2 3 2 3 5 1 3 5 1 5 1. 2. 3. 4. 4 5 Itemset Support Frequent {1} 75% Yes {2} 50% No {3} 75% Yes {4} 25% No {5} 100% Yes Minsup = 70% Generate all 1-itemsets Calculate the support for each itemset Determine whether or not the itemsets are frequent Frequent Itemset Generation Transactional Database 1 2 3 2 3 5 1 3 5 1 5 4 5 Itemset Support Frequent {1,3} 50% Yes {1,5} 75% Yes {3,5} 75% Yes Generate all 2-itemsets, minsup = 70% {1} U {3} = {1,3} , {1} U {5} = {1,5} {3} U {5} = {3,5} Frequent Itemset Generation Transactional Database Itemset Support Frequent 1 2 3 {1,3,5} 50% Yes 2 3 5 1 3 5 1 5 4 5 Generate all 3-itemsets, minsup = 70% {1,3} U {1,5} = {1,3,5} Frequent Itemset Results All frequent itemsets generated are output: {1} , {3} , {5} {1,3} , {1,5} , {3,5} {1,3,5} Apriori Rule Mining Apriori Rule Mining Rule Combinations: 1. {1,2} 2-itemsets {1}=>{2} {2}=>{1} 2. {1,2,3} 3-itemsets {1}=>{2,3} {2,3}=>{1} {1,2}=>{3} {3}=>{1,2} {1,3}=>{2} {2}=>{1,3} Strong Rule Generation Transactional Database 4 5 Rule Confidence Strong {1}=>{3} No {3}=>{1} No 1 2 3 2 3 5 {1}=>{5} Yes 1 3 5 {5}=>{1} No 1 5 {3}=>{5} Yes {5}=>{3} No 1. I = {{1}, {3}, {5}} 2. Rules = X => Y 3. Minconf = 80% Strong Rule Generation Transactional Database 4 5 Rule Confidence Strong {2}=>{3,5} Yes {3,5}=>{2} No 1 2 3 2 3 5 {2,3}=>{5} Yes 1 3 5 {5}=>{2,3} No 1 5 {2,5}=>{3} Yes {3}=>{2,5} No 1. I = {{1}, {3}, {5}} 2. Rules = X => Y 3. Minconf = 80% Strong Rules Results All strong rules generated are output: {1}=>{5} {3}=>{5} {2}=>{3,5} {2,3}=>{5} {2,5}=>{3} Other Frequent Itemsets Closed Frequent Itemset – a frequent itemset X who has no immediate supersets with the same support count as X. Maximal Frequent Itemset – a frequent itemset whom none of its immediate supersets are frequent. Itemset Relationships Frequent Itemsets Closed Frequent Itemsets Maximal Frequent Itemsets Targeted Association Mining Targeted Association Mining * Users may only be interested in specific results * Potential to get smaller, faster, and more focused results * Examples: 1. User wants to know how often only bread and garlic cloves occur together. 2. User wants to know what items occur with toilet paper. Itemset Trees * Itemset Tree: - A data structure which aids in users querying for a specific itemset and it’s support. * Items within a transaction are mapped to integer values and ordered such that each transaction is in lexical order. {Bread, Onion, Garlic} = {1, 2, 3} * Why use numbers? - make the tree more compact - numbers follow ordering easily Itemset Trees An Itemset Tree T contains: * A root pair (I, f(I)), where I is an itemset and f(I) is its count. * A (possibly empty) set {T1, T2, . . . , Tk} each element of which is an itemset tree. * If Ij is in the root, then it will also be in The root’s children * If Ij is not in the root, then it might be in the root’s children if: first_item(I) < first_item(Ij) and last_item(I) < last_item(Ij) Building an Itemset Tree Let ci be a node in the itemset tree. Let I be a transaction from the dataset Loop: Case 1: ci = I Case 2: ci is a child of I - make I the parent node of ci Case 3: ci and I contain a common lexical overlap i.e. {1,2,4} vs. {1,2,6} - make a node for the overlap - make I and ci it’s children. Case 4: ci is a parent of I - Loop to check ci’s children - make I a child of ci Note: {2,6} and {1,2,6} do not have a Lexical overlap Itemset Trees - Creation Dataset 2 4 1 2 3 9 1 2 2 2 9 3 6 5 Itemset Trees - Creation Dataset 2 4 1 2 3 9 1 2 3 6 2 2 9 Child node. 5 Itemset Trees - Creation Dataset 2 4 1 2 3 9 1 2 3 6 2 2 9 Child node. 5 Itemset Trees - Creation Dataset 2 4 1 2 3 9 1 2 3 6 2 2 9 Child node. 5 Itemset Trees - Creation Dataset 2 4 1 2 3 9 1 2 3 5 6 2 2 9 Lexical overlap Itemset Trees - Creation Dataset 2 4 1 2 3 9 1 2 3 5 6 2 2 9 Parent node. Itemset Trees - Creation Dataset 2 4 1 2 3 9 1 2 3 6 2 2 9 Child node. 5 Itemset Trees – Querying Let I be an itemset, Let ci be a node in the tree Let totalSup be the total count for I in the tree For all s.t. first_item(ci) < first_item(I): Case 1: If I is contained in ci. - Add support to totalSup. Case 2: If I is not contained and last_item(ci) < last_item(I) - proceed down the tree Example 1 Itemset Trees - Querying Querying Example 1: Query: {2} totalSup = 0 Itemset Trees - Querying Querying Example 1: Query: {2} 2=2 Add to support: totalSup = 3 Itemset Trees - Querying Querying Example 1: Query: {2} 1,2 contains 2 Add to support totalSup = 3 + 2 = 5 Itemset Trees - Querying Querying Example 1: Query: {2,9} 3 > 2, and end of Subtree. Return support totalSup = 5 Example 2 Itemset Trees - Querying Querying Example 2: Query: {2,9} totalSup = 0 Itemset Trees - Querying Querying Example 2: Query: {2,9} totalSup = 0 2<2 2 < 9 continue Itemset Trees - Querying Querying Example 2: Query: {2,9} totalSup = 0 2<2 4<9 {2,4} doesn’t contain {2,9}, go to next sibling Itemset Trees - Querying Querying Example 2: Query: {2,9} totalSup = 1 {2,9} = {2,9} Add to support! Itemset Trees - Querying Querying Example 2: Query: {2,9} totalSup = 1 1<2 2<9 continue Itemset Trees - Querying Querying Example 2: Query: {2,9} totalSup = 1 1<2 5<9 {1,2,3,5} doesn’t contain {2,9}, go to next sibling Itemset Trees - Querying Querying Example 2: Query: {2,9} totalSup = 1 1<2 6<9 {1,2,6} doesn’t contain {2,9}, go to next node Itemset Trees - Querying Querying Example 2: Query: {2,9} totalSup = 1 3 < 2 <= fail 9<9 End of tree, totalSupp = 1 Nodes = 8