Data Mining: Mining Frequent Patterns, Association and Correlations 1 Mining Frequent Patterns, Association and Correlations Basic concepts and a road map Efficient and scalable frequent itemset mining methods Mining various kinds of association rules From association mining to correlation analysis Summary 2 What Is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining Motivation: Finding inherent regularities in data What products were often purchased together?— Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents? Assume all data are categorical. No good algorithm for numeric data. 3 Why Is Freq. Pattern Mining Important? Discloses an intrinsic and important property of data sets Forms the foundation for many essential data mining tasks Association, correlation, and causality analysis Sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia, timeseries, and stream data Classification: associative classification Cluster analysis: frequent pattern-based clustering Data warehousing: cube-gradient Broad applications 4 Association Rules Rules for customers that buy PC for buying antivirus software PC antivirus-sw [support =2%, conf= 60%] Support 2% of customers both PC and antivirus-sw together Confidence 60% of PC buyers also buys antivirus-sw Needs to satisfy minimum support and confidence 5 Definitions Let be set of all items I = {bread, milk, beer, chocolate, egg, diaper} Transaction T1 = {milk, egg} Itemset Set of items k-itemset: itemset with k items Frequent itemset Itemsets that satisfies a frequency threshold called minimum support count 6 Transaction data: a set of documents A text document data set. Each document is treated as a “bag” of keywords doc1: doc2: doc3: doc4: doc5: doc6: doc7: CS583, Bing Liu, UIC Student, Teach, School Student, School Teach, School, City, Game Baseball, Basketball Basketball, Player, Spectator Baseball, Coach, Game, Team Basketball, Team, City, Game 7 Association Rules Association rule: an implication A B A ⊂I, B⊂I, A∩B=∅ Holds with some support s and confidence c Support (AB): the percentage of transactions that contain A U B which is P(A U B) Confidence (AB): the conditional probability that a transaction having A also contains B which is P(B|A) = freq(AUB) / freq(A) Minimum support and minimum confidence are prespecified thresholds 8 Basic Concepts: Frequent Patterns and Association Rules Transaction-id Items bought 1 A, B, D 2 A, C, D 3 A, D, E 4 B, E, F 5 B, C, D, E, F Itemset I = {i1, …, ik} Find all the rules X Y with minimum support and confidence Customer buys both Customer buys diaper support, s, probability that a transaction contains X Y confidence, c, conditional probability that a transaction having X also contains Y Let supmin = 50%, confmin = 50% Freq. itemset: {A:3, B:3, D:4, E:3, AD:3} Customer buys beer Association rules: A D (60%, 100%) D A (60%, 75%) 9 Association Rule mining Goal: Find all rules that satisfy the user-specified minimum support and minimum confidence Association Rule mining can be viewed in two steps Find all frequent itemsets: itemsets that occur at least as frequently as prespecified minimum support count min_sup Generate strong association rules: that satisfy minimum support and minimum confidence First part is more costly 10 Closed Patterns Huge number of itemsets: if an itemset is frequent then all its subsets are frequent A long pattern contains a combinatorial number of subpatterns, e.g., {a1, …, a100} contains = Too huge to compute or store Solution: Mine closed patterns instead 11 Closed Patterns An itemset X is closed if X is frequent and there exists no super-pattern Y כX, with the same support as X (proposed by Pasquier, et al. @ ICDT’99) i.e. use the biggest possible itemset for a support level Closed pattern is a lossless compression of freq. patterns Reducing the # of patterns and rules 12 Closed Patterns Exercise. DB = {<a1, …, a100>, < a1, …, a50>} Min_sup = 1. What is the set of closed itemset? <a1, …, a100>: 1 < a1, …, a50>: 2 What is the set of all patterns? 2100-1!! 13 Mining Frequent Patterns, Association and Correlations Basic concepts and a road map Efficient and scalable frequent itemset mining methods Mining various kinds of association rules From association mining to correlation analysis Summary 14 Scalable Methods for Mining Frequent Patterns The downward closure property of frequent patterns Any subset of a frequent itemset must be frequent If {beer, diaper, nuts} is frequent, so is {beer, diaper} i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} Scalable mining methods: Several approaches Apriori (Agrawal & Srikant@VLDB’94) Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00) Any algorithm should find the same set of rules although their computational efficiencies and memory requirements may be different 15 Apriori: A Candidate Generation-and-Test Approach Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94) Method: Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from length k frequent itemsets Test the candidates against DB Terminate when no frequent or candidate set can be generated 16 The Apriori Algorithm—An Example Database TDB Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E Supmin = 2 Itemset {A, C} {B, C} {B, E} {C, E} sup {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 C1 1st scan C2 L2 Itemset sup 2 2 3 2 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} sup 1 2 1 2 3 2 Itemset sup {A} 2 {B} 3 {C} 3 {E} 3 L1 C2 2nd scan Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} C3 Itemset {B, C, E} 3rd scan L3 Itemset sup {B, C, E} 2 17 The Apriori Algorithm Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk; 18 Important Details of Apriori How to generate candidates? Step 1: self-joining Lk Step 2: pruning How to count supports of candidates? Example of Candidate-generation L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3 Pruning: abcd from abc and abd acde from acd and ace acde is removed because ade is not in L3 C4={abcd} 19 Step 2: Generating rules from frequent itemsets Frequent itemsets association rules One more step is needed to generate association rules For each frequent itemset X, For each proper nonempty subset A of X, Let B = X - A A B is an association rule if Confidence(A B) ≥ minconf, support(A B) = support(AB) = support(X) confidence(A B) = support(A B) / support(A) Generating rules: an example Suppose {2,3,4} is frequent, with sup=50% Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4}, with sup=50%, 50%, 75%, 75%, 75%, 75% respectively These generate these association rules: 2,3 4, confidence=100% 2,4 3, confidence=100% 3,4 2, confidence=67% 2 3,4, confidence=67% 3 2,4, confidence=67% 4 2,3, confidence=67% All rules have support = 50% support(A B) = support(AB) Generating rules: summary To recap, in order to obtain A B, we need to have support(A B) and support(A) All the required information for confidence computation has already been recorded in itemset generation. No need to see the data T any more. This step is not as time-consuming as frequent itemsets generation. Bottleneck of Apriori Alg Multiple database scans are costly Mining long patterns needs many passes of scanning and generates lots of candidates To find frequent itemset i1i2…i100 # of scans: 100 Bottleneck: candidate-generation-and-test Can we avoid candidate generation? 23 Mining Frequent Patterns Without Candidate Generation Grow long patterns from short ones using local frequent items “abc” is a frequent pattern Get all transactions having “abc”: DB|abc “d” is a local frequent item in DB|abc abcd is a frequent pattern 24 Construct FP-tree from a Transaction Database TID 100 200 300 400 500 Items bought (ordered) frequent items {f, a, c, d, g, i, m, p} {f, c, a, m, p} {a, b, c, f, l, m, o} {f, c, a, b, m} {b, f, h, j, o, w} {f, b} {b, c, k, s, p} {c, b, p} {a, f, c, e, l, p, m, n} {f, c, a, m, p} Header Table 1. Scan DB once, find frequent 1-itemset (single item pattern) 2. Sort frequent items in frequency descending order, f-list 3. Scan DB again, construct FP-tree Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 min_support = 3 {} f:1 c:1 a:1 m:1 p:1 25 Construct FP-tree from a Transaction Database TID 100 200 300 400 500 Items bought (ordered) frequent items {f, a, c, d, g, i, m, p} {f, c, a, m, p} {a, b, c, f, l, m, o} {f, c, a, b, m} {b, f, h, j, o, w} {f, b} {b, c, k, s, p} {c, b, p} {a, f, c, e, l, p, m, n} {f, c, a, m, p} Header Table 1. Scan DB once, find frequent 1-itemset (single item pattern) 2. Sort frequent items in frequency descending order, f-list 3. Scan DB again, construct FP-tree Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 F-list=f-c-a-b-m-p min_support = 3 {} f:2 c:2 a:2 m:1 b:1 p:1 m:1 26 Construct FP-tree from a Transaction Database TID 100 200 300 400 500 Items bought (ordered) frequent items {f, a, c, d, g, i, m, p} {f, c, a, m, p} {a, b, c, f, l, m, o} {f, c, a, b, m} {b, f, h, j, o, w} {f, b} {b, c, k, s, p} {c, b, p} {a, f, c, e, l, p, m, n} {f, c, a, m, p} Header Table 1. Scan DB once, find frequent 1-itemset (single item pattern) 2. Sort frequent items in frequency descending order, f-list 3. Scan DB again, construct FP-tree Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 F-list=f-c-a-b-m-p min_support = 3 {} f:4 c:3 c:1 b:1 a:3 b:1 p:1 m:2 b:1 p:2 m:1 27 Construct FP-tree from a Transaction Database TID 100 200 300 400 500 Items bought (ordered) frequent items {f, a, c, d, g, i, m, p} {f, c, a, m, p} {a, b, c, f, l, m, o} {f, c, a, b, m} {b, f, h, j, o, w} {f, b} {b, c, k, s, p} {c, b, p} {a, f, c, e, l, p, m, n} {f, c, a, m, p} Header Table 1. Scan DB once, find frequent 1-itemset (single item pattern) 2. Sort frequent items in frequency descending order, f-list 3. Scan DB again, construct FP-tree Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 F-list=f-c-a-b-m-p min_support = 3 {} f:4 c:3 c:1 b:1 a:3 b:1 p:1 m:2 b:1 p:2 m:1 28 Find Patterns Having P From P-conditional Database Starting at the frequent item header table in the FP-tree Traverse the FP-tree by following the link of each frequent item p Accumulate all of transformed prefix paths of item p to form p’s conditional pattern base {} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:3 c:1 b:1 a:3 Conditional pattern bases item cond. pattern base b:1 c f:3 p:1 a fc:3 b fca:1, f:1, c:1 m:2 b:1 m fca:2, fcab:1 p:2 m:1 p fcam:2, cb:1 29 Example: conditional patterns having ‘p’ Start with the bottom of “header table” Find all paths to ‘p’ Those are: (f:4, c:3, a:3, m:2, p:2) and (c:1, b:1, p:1) Paths that contain ‘p’ (f:2, c:2, a:2, m:2, p:2) and (c:1, b:1, p:1) Find prefix paths (remove ‘p’) (f:2, c:2, a:2, m:2) and (c:1, b:1) 30 From Conditional Pattern-bases to Conditional FP-trees For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the pattern base min_support = 3 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 {} f:4 c:3 c:1 b:1 a:3 b:1 p:1 m:2 b:1 p:2 m:1 m-conditional pattern base: fca:2, fcab:1 All frequent patterns relate to m {} m, f:3 fm, cm, am, fcm, fam, cam, c:3 fcam a:3 m-conditional FP-tree 31 Another example Transaction itemset Sorted itemset Freq items T1 a,b,e b,a,e b:7 T2 b,d b,d a:6 T3 b,c b,c c:6 T4 a,b,d b,a,d d:2 T5 a,c a,c e:2 T6 b,c b,c T7 a,c a,c T8 a,b,c,e b,a,c,e T9 a,b,c b,a,c Min_sup = 2 32 Example –cont. Item Cond. pattern Cond. FP-tree Freq. patterns e {b,a:1}{b,a,c:1} <b:2,a:2> {b,e:2}{a,e:2}{a,b,e:2} d {b,a:1}{b:1} <b:2> {b,d:2} c {b,a:2}{b:2}{a:2} <b:4,a:2><a:2> {b,c:4}{a,c:4}{a,b,c:2} a {b:4} {b,a:4} <b:4> 33 Benefits of the FP-tree Structure Completeness Preserve complete information for frequent pattern mining Never break a long pattern of any transaction Compactness Reduce irrelevant info—infrequent items are gone Items in frequency descending order: the more frequently occurring, the more likely to be shared Never be larger than the original database (not count node-links and the count field) 34 Partition Patterns and Databases Frequent patterns can be partitioned into subsets according to f-list F-list=f-c-a-b-m-p Patterns containing p Patterns having m but no p … Patterns having c but no a nor b Pattern f Completeness and non-redundency 35 Scaling FP-growth by DB Projection FP-tree cannot fit in memory?—DB projection First partition a database into a set of projected DBs Then construct and mine FP-tree for each projected DB Parallel projection vs. Partition projection techniques Parallel projection is space costly 36 Why Is FP-Growth the Winner? Divide-and-conquer: decompose both the mining task and DB according to the frequent patterns obtained so far leads to focused search of smaller databases Other factors no candidate generation, no candidate test compressed database: FP-tree structure no repeated scan of entire database basic ops—counting local freq items and building sub FP-tree, no pattern search and matching 37 Mining Frequent Patterns, Association and Correlations Basic concepts and a road map Efficient and scalable frequent itemset mining methods Mining various kinds of association rules From association mining to correlation analysis Summary 38 Mining Various Kinds of Association Rules Mining multilevel association Miming multidimensional association Mining quantitative association Mining interesting correlation patterns 39 Mining Multiple-Level Association Rules Items often form hierarchies Flexible support settings Items at the lower level are expected to have lower support Logitech wireless laser mouse M505 & Norton antivirus 2015 Difficult to find strong associations use multiple min_sup reduced support uniform support Level 1 min_sup = 5% Level 2 min_sup = 5% Milk [support = 10%] 2% Milk [support = 6%] Skim Milk [support = 4%] Level 1 min_sup = 5% Level 2 min_sup = 3% 40 Problem for uniform support If the frequencies of items vary a great deal, we will encounter two problems If min_sup is set too high, those rules that involve rare items will not be found. To find rules that involve both frequent and rare items, min_sup has to be set very low. This causes combinatorial explosion because those frequent items will be associated with one another in all possible ways. CS583, Bing Liu, UIC 41 Mining Multi-Dimensional Association Single-dimensional rules: buys(X, “milk”) buys(X, “bread”) Multi-dimensional rules: 2 dimensions or predicates Inter-dimension assoc. rules (no repeated predicates) age(X,”19-25”) occupation(X,“student”) buys(X, “coke”) hybrid-dimension assoc. rules (repeated predicates) age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”) Categorical Attributes: finite number of possible values, no ordering among values—(occupation, brand, color) Quantitative Attributes: numeric, implicit ordering among values—discretization, clustering, and gradient approaches 42 Mining Quantitative Associations Techniques can be categorized by how numerical attributes, such as age or salary are treated 1. Static discretization based on predefined concept hierarchies (data cube methods) 2. Dynamic discretization based on data distribution (quantitative rules) Intervals dynamically determined 3. Clustering: Distance-based association one dimensional clustering then association 43 Mining Other Interesting Patterns Flexible support constraints Some items (e.g., diamond) may occur rarely but are valuable Customized supmin specification and application Top-K closed frequent patterns Hard to specify supmin 44 Mining Frequent Patterns, Association and Correlations Basic concepts and a road map Efficient and scalable frequent itemset mining methods Mining various kinds of association rules From association mining to correlation analysis Summary 45 Interesting Measure - Lift Strong rules are not necessarily important Only an estimate, not measure real strength of correlation Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 2000 5000 play basketball eat cereal [40%, 66.7%] is misleading Basketball The overall % of students eating cereal is 75% > 66.7%. play basketball not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence 46 Interestingness Measure: Correlations (Lift) Measure of dependent/correlated events: lift P( A B) lift P( A) P( B) lift ( B, C ) Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 2000 5000 2000 / 5000 0.89 3000 / 5000 * 3750 / 5000 lift ( B, C ) If lift > 1 positively correlated If lift < 1 negatively correlated If lift = 0 independent 1000 / 5000 1.33 3000 / 5000 *1250 / 5000 47 Mining Frequent Patterns, Association and Correlations Basic concepts and a road map Efficient and scalable frequent itemset mining methods Mining various kinds of association rules From association mining to correlation analysis Summary 48 Summary Frequent pattern mining—an important task in data mining Support & confidence Scalable frequent pattern mining methods Apriori (Candidate generation & test) FPgrowth (Projection-based) Mining a variety of rules and interesting patterns Correlation analysis lift 49