CIS4930 Introduction to Data Mining Frequent Pattern Mining Peixiang Zhao Tallahassee, Florida, 2016 What is Frequent Pattern Analysis? • Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set • Motivation: Finding inherent regularities in data – What products were often purchased together?— Beer and diapers?! – What are the subsequent purchases after buying a PC? – What kinds of DNA are sensitive to this new drug? – Can we automatically classify web documents? • Applications – Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis 1 Why Frequent Patterns? • Frequent patterns – An intrinsic and important property of datasets • Foundation for many essential data mining tasks – Association, correlation, and causality analysis – Sequential, structural (e.g., sub-graph) patterns – Pattern analysis in spatiotemporal, multimedia, time-series, and stream data – Classification: discriminative, frequent pattern analysis – Cluster analysis: frequent pattern-based clustering – Broad applications 2 Association Rule Mining • Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction – Examples: • {Diaper} {Beer}, {Milk, Bread} {Eggs,Coke}, {Beer, Bread} {Milk}, – Implication means co-occurrence, not causality! 3 Basic Concepts • Itemset – A set of one or more items • k-itemset X = {x1, …, xk} • (absolute) support of X – Frequency or occurrence of an itemset X • (relative) support of X Customer buys both – The fraction of transactions that contains X (i.e., the probability that a transaction contains X) Customer buys diaper • An itemset X is frequent if – X’s support is no less than a minsup threshold Customer buys beer 4 Basic Concepts • Association rule – • An implication expression of the form XY, where X and Y are itemsets Find all the association rules X Y with minimum support and confidence – – • Example Support, s, fraction of transactions that contain both X Let minsup = 50%, minconf = 50% Freq. Pat.: Beer:3, Nuts:3, Diaper:4, and Y (probability that a Eggs:3, {Beer, Diaper}:3 transaction contains X and Y) Confidence, c, how often items Association rules: (many more!) in Y appear in transactions that Beer Diaper (60%, 100%) contain X (conditional Diaper Beer (60%, 75%) probability that a transaction having X also contains Y) 5 Closed Patterns and Max-Patterns • A long pattern contains a combinatorial number of sub-patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000) = 2100 – 1 = 1.27*1030 sub-patterns! • Solution: Mine closed patterns and max-patterns instead • An itemset X is closed if – X is frequent and there exists no super-pattern Y כX, with the same support as X • An itemset X is a maximal pattern if – X is frequent and there exists no frequent super-pattern Y כX • Closed pattern is a lossless compression of freq. patterns – Reducing the # of patterns and rules 6 Closed and Max-Patterns: An Example • Exercise. DB = {<a1, …, a100>, < a1, …, a50>} – Min_sup = 1. • What is the set of closed itemset? – <a1, …, a100>: 1 – < a1, …, a50>: 2 • What is the set of maximal pattern? – <a1, …, a100>: 1 • What is the set of all patterns? – !! 7 Computational Complexity • How many itemsets are potentially to be generated in the worst case? – The number of frequent itemsets is senstive to the minsup threshold – When minsup is low, there exist an exponential number of frequent itemsets – The worst case: MN where M: # distinct items, and N: max length of transactions • The worst case complexity vs. the expected probability – Ex. Suppose Walmart has 104 kinds of products • The chance to pick up one product 10-4 • The chance to pick up a particular set of 10 products: ~10-40 • What is the chance this particular set of 10 products to be frequent 103 times in 109 transactions? 8 Association Rule Mining • Given a set of transactions T, the goal of association rule mining is to find all rules having – support ≥ minsup threshold – confidence ≥ minconf threshold • Brute-force approach: – List all possible association rules – Compute the support and confidence for each rule – Prune rules that fail the minsup and minconf thresholds Computationally prohibitive! 9 Mining Association Rules • Observations – All the association rules are binary partitions of the same itemset: {Milk, Diaper, Beer} – Rules originating from the same itemset have identical support but can have different confidence – Thus, we may decouple the support and confidence requirements TID Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke Example of Rules: {Milk,Diaper} {Beer} (s=0.4, c=0.67) {Milk,Beer} {Diaper} (s=0.4, c=1.0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5) 10 Mining Association Rules • Two-step approach: 1. Frequent Itemset Mining – Generate all itemsets whose support minsup 2. Rule Generation – Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset • Frequent itemset generation is still computationally expensive 11 Frequent Itemset Mining Given d items, there are 2d possible candidate itemsets null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE 12 Frequent Itemset Mining • Brute-force approach 1. Each itemset in the lattice is a candidate frequent itemset 2. Count the support of each candidate by scanning the database • Match each transaction against every candidate – Time complexity: O(N*M*w) Expensive since M = 2d Transactions N TID 1 2 3 4 5 Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke List of Candidates M w 13 Computational Complexity of Association Rule Mining • Given d unique items: – Total number of itemsets = 2d – Total number of possible association rules d d k R k j 3 2 1 d 1 d k k 1 j 1 d d 1 If d=6, R = 602 rules 14 Frequent Itemset Mining Strategies • Reduce the number of candidates (M) – Complete search: M=2d – Use pruning techniques to reduce M • Reduce the number of transactions (N) – Reduce size of N as the size of itemset increases – Used by vertical-based mining algorithms • Reduce the number of comparisons (NM) – Use efficient data structures to store the candidates or transactions – No need to match every candidate against every transaction 15 Methods for Frequent Itemset Mining • Traversal of Itemset Lattice – General-to-specific vs Specific-to-general Frequent itemset border null null .. .. .. .. {a1,a2,...,an} (a) General-to-specific {a1,a2,...,an} Frequent itemset border null .. .. Frequent itemset border (b) Specific-to-general {a1,a2,...,an} (c) Bidirectional 16 Methods for Frequent Itemset Mining • Traversal of Itemset Lattice – Breadth-first vs Depth-first (a) Breadth first (b) Depth first 17 Methods for Frequent Itemset Mining • Traversal of Itemset Lattice – Equivalent Classes null A AB ABC B AC AD ABD ACD null C BC BD BCD D CD A AB AC ABC B C BC AD ABD D BD CD ACD BCD ABCD ABCD (a) Prefix tree (b) Suffix tree 18 Methods for Frequent Itemset Mining • Representation of Database – horizontal vs vertical data layout Horizontal Data Layout TID 1 2 3 4 5 6 7 8 9 10 Items A,B,E B,C,D C,E A,C,D A,B,C,D A,E A,B A,B,C A,C,D B Vertical Data Layout A 1 4 5 6 7 8 9 B 1 2 5 7 8 10 C 2 3 4 8 9 D 2 4 5 9 E 1 3 6 19 Frequent Itemset Mining Methods • Apriori: A Candidate Generation-and-Test Approach – Improving the Efficiency of Apriori • FPGrowth: A Frequent Pattern-Growth Approach • ECLAT: Frequent Pattern Mining with Vertical Data Format 20 The Downward Closure Property • The downward closure property of frequent patterns – Any subset of a frequent itemset must be frequent • If {beer, diaper, nuts} is frequent, so is {beer, diaper} • i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} • Apriori principle holds due to the following property of the support measure: X , Y : ( X Y ) s( X ) s(Y ) – Support of an itemset never exceeds the support of its subsets – This is known as the anti-monotone property of support 21 Illustrating Apriori Principle null A Found to be Infrequent B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD Pruned supersets ABCE ABDE ACDE BCDE ABCDE 22 Apriori: A Candidate Generation & Test Approach • Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! • Algorithm 1. Initially, scan DB once to get frequent 1-itemset 2.1 Generate length (k+1) candidate itemsets from length k frequent itemsets 2.2 Test the candidates against DB 3. Terminate when no frequent or candidate set can be generated 23 The Apriori Algorithm—An Example Database TDB Supmin = 2 L1 C1 1st scan C2 C2 2nd scan L2 C3 3rd scan L3 24 The Apriori Algorithm Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk; 25 Implementation • How to generate candidates? – Step 1: self-joining Lk – Step 2: pruning • Example of Candidate-generation – L3={abc, abd, acd, ace, bcd} – Self-joining: L3*L3 • abcd from abc and abd • acde from acd and ace – Pruning: • acde is removed because ade is not in L3 – C4 = {abcd} 26 Candidate Generation: SQL • SQL Implementation of candidate generation – Suppose the items in Lk-1 are listed in an order – Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 – Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck 27 Frequency Counting • Hash tree – Suppose you have 15 candidate itemsets of length 3: • {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} – You need: • Hash function • Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node) 234 Hash function 567 3,6,9 1,4,7 145 345 356 367 136 357 368 2,5,8 689 124 125 159 457 458 28 Subset Operation Given a transaction t, what are the possible subsets of size 3? Transaction, t 1 2 3 5 6 Level 1 1 2 3 5 6 2 3 5 6 3 5 6 Level 2 12 3 5 6 13 5 6 123 125 126 135 136 Level 3 15 6 156 23 5 6 235 236 25 6 256 35 6 356 Subsets of 3 items 29 Subset Operation Using Hash Tree Hash Function 1 2 3 5 6 transaction 1+ 2356 2+ 356 12+ 356 1,4,7 3+ 56 3,6,9 2,5,8 13+ 56 234 567 15+ 6 145 136 345 124 457 125 458 159 356 357 689 367 368 30 Factors Affecting Complexity • Choice of minimum support threshold – lowering support threshold results in more frequent itemsets – this may increase number of candidates and max length of frequent itemsets • Dimensionality (number of items) of the data set – more space is needed to store support count of each item – if number of frequent items also increases, both computation and I/O costs may also increase • Size of database – since Apriori makes multiple passes, run time of algorithm may increase with number of transactions • Average transaction width – transaction width increases with denser data sets – This may increase max length of frequent itemsets, and the number of subsets in a transaction increases with its width 31 Maximal Frequent Itemset • An itemset is maximal frequent if none of its immediate supersets is frequent null A B C D E Maximal Itemsets AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ABCD E ACDE BCDE Border 32 Closed Itemset • An itemset is closed if none of its immediate supersets has the same support as the itemset TID 1 2 3 4 5 Items {A,B} {B,C,D} {A,B,C,D} {A,B,D} {A,B,C,D} Itemset {A} {B} {C} {D} {A,B} {A,C} {A,D} {B,C} {B,D} {C,D} Support 4 5 3 4 4 2 3 3 4 3 Itemset Support {A,B,C} 2 {A,B,D} 3 {A,C,D} 2 {B,C,D} 3 {A,B,C,D} 2 33 Maximal vs. Closed Itemsets TID Items 1 ABC 2 ABCD 3 BCE 4 ACDE 5 DE Transaction Ids null 124 123 A 12 124 AB 12 24 AC AD ABD ABE 2 Not supported by any transactions B AE 345 D 2 3 BC BD 4 ACD 245 C 123 4 24 2 ABC 1234 BE 2 4 ACE ADE E 24 CD 34 CE 3 BCD 45 DE 4 BCE BDE CDE 4 ABCD ABCE ABDE ACDE BCDE ABCDE 34 Maximal vs Closed Frequent Itemsets Minimum support = 2 124 123 A 12 124 AB 12 ABC 24 AC AD ABD ABE 1234 B AE 345 D 2 3 BC BD 4 ACD 245 C 123 4 24 2 Closed but not maximal null 24 BE 2 4 ACE E ADE CD Closed and maximal 34 CE 3 BCD 45 DE 4 BCE BDE CDE 4 2 ABCD ABCE ABDE ACDE BCDE # Closed = 9 # Maximal = 4 ABCDE 35 Maximal vs Closed Itemsets Frequent Itemsets Closed Frequent Itemsets Maximal Frequent Itemsets 36 Further Improvement of Apriori • Major computational challenges – Multiple scans of transaction database – Huge number of candidates – Tedious workload of support counting for candidates • Improving Apriori: general ideas – Reduce passes of transaction database scans – Shrink number of candidates – Facilitate support counting of candidates 37 Partition: Scan Database Only Twice • Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB – Scan 1: partition database and find local frequent patterns – Scan 2: consolidate global frequent patterns DB1 sup1(i) < σDB1 + DB2 sup2(i) < σDB2 + + DBk supk(i) < σDBk = DB sup(i) < σDB 38 FP-Growth: A Frequent Pattern-Growth Approach • Bottlenecks of the Apriori approach – Breadth-first (i.e., level-wise) search – Candidate generation and test • Often generates a huge number of candidates • The FPGrowth Approach – Depth-first search – Avoid explicit candidate generation • Major philosophy: Grow long patterns from short ones using local frequent items only – “abc” is a frequent pattern – Get all transactions having “abc”, i.e., project DB on abc: DB|abc – “d” is a local frequent item in DB|abc abcd is a frequent pattern 39 General Idea • The FP-growth method indexes the database for fast support computation via an augmented prefix tree: the frequent pattern tree (FP-tree) – Each node in the tree is labeled with a single item, with the support information for the itemset comprising the items on the path from the root to that node • FP-tree construction – Initially the tree contains as root the null item ∅ – For each tuple X ∈ D, we insert the X into the FP-tree, incrementing the count of all nodes along the path that represents X – If X shares a prefix with some previously inserted transaction, then X will follow the same path until the common prefix. For the remaining items in X, new nodes are created under the common prefix, with counts initialized to 1 – The FP-tree is complete when all transactions have been inserted 40 FP-Tree: Preprocessing • Procedure 1. Scan the DB once, find frequent 1-itemset (single item pattern) 2. Sort frequent items in frequency descending order: f-list 3. Scan DB again, construct FP-tree • This way, frequent patterns can be partitioned into subsets according to f-list (ex. b-e-a-c-d) – – – – Patterns containing d Patterns having d but no c … Patterns having b but no e nor a, c, d • Completeness and nonredundency 41 Frequent Pattern Tree • The FP-tree is a prefix compressed representation of D, and the frequent items are sorted in descending order of support 42 FP-Growth Algorithm • Given a FP-tree R, projected FP-trees are built for each frequent item i in R in increasing order of support in a recursive manner • To project R on item i, we find all the occurrences of i in the tree, and for each occurrence, we determine the corresponding path from the root to i – The count of item i on a given path is recorded in cnt(i) and the path is inserted into the new projected tree – While inserting the path, the count of each node along the given path is incremented by the path count cnt(i) • The base case for the recursion happens when the input FP-tree R is a single path – Basic enumeration 43 Projected Frequent Pattern Tree for D 44 Frequent Pattern Tree Projection 45 FP-Growth Algorithm P: projected items The support of X = the minimum support of x in Y X: new projected items, including i Recursive computation 46 Benefits of FP-Tree • Completeness – Preserve complete information for frequent pattern mining – Never break a long pattern of any transaction • Compactness – Reduce irrelevant info—infrequent items are gone – Items in frequency descending order: the more frequently occurring, the more likely to be shared – Never be larger than the original database (not count node-links and the count field) 47 Advantages of the Pattern Growth Approach • Divide-and-conquer – Decompose both the mining task and DB according to the frequent patterns obtained so far – Lead to focused search of smaller databases • Other factors – No candidate generation, no candidate test – Compressed database: FP-tree structure – No repeated scan of entire database – Basic ops: counting local freq items and building sub FP-tree, no pattern search and matching 48 ECLAT: Mining by Exploring Vertical Data Format • For each item, store a list of transaction ids (tids) 49 ECLAT • Determine the support of any k-itemset by intersecting tid-lists of two of its (k-1) subsets A 1 3 4 5 B 1 2 3 4 5 6 AB 1 3 4 5 • Eclat intersects the tidsets only if the frequent itemsets share a common prefix – It traverses the prefix search tree in a DFS-like manner, processing a group of itemsets that have the same prefix, also called a prefix equivalence class 50 Eclat Algorithm F: the set of frequent itemsets P: frequent items and their tid-lists Xa: one frequent itemset Pa: the set of frequent itemsets generated from Xa. They are in the same prefix equivalence class Recursive computation 51 ECLAT: Tidlist Intersections 52 Generating Association Rules • How to efficiently generate rules from frequent itemsets? – In general, confidence does not have an anti-monotone property c(ABC D) can be larger or smaller than c(AB D) – But confidence of rules generated from the same itemset has an anti-monotone property • e.g., L = {A,B,C,D}: c(ABC D) c(AB CD) c(A BCD) – Confidence is anti-monotone w.r.t. number of items on the RHS of the rule 53 Example • If {A,B,C,D} is a frequent itemset, candidate rules: A BCD, AB CD, BD AC, ABC D, B ACD, AC BD, CD AB, ABD C, C ABD, AD BC, D ABC BC AD, ACD B, BCD A, • If |L| = k, then there are 2k – 2 candidate association rules (ignoring L and L) 54 Generating Association Rules • Given a frequent itemset Z, we look at all proper subsets X ⊂ Z to compute rules of the form X Y, where Y = Z\X • The rule must be frequent – s = sup(XY) = sup(Z) >= minsup • We compute the confidence as – c = sup(X Y)/sup(X) = sup(Z)/sup(X) – If c >= minconf, this rule is a strong association rule – Otherwise, conf(WZ\W)<c for all subsets W ⊂ X, because sup(W) >= sup(X). We can thus avoid checking subsets of X 55 Example Lattice of rules Low Confidence Rule CD=>AB ABCD=>{ } BCD=>A BD=>AC D=>ABC ACD=>B BC=>AD C=>ABD ABD=>C AD=>BC B=>ACD ABC=>D AC=>BD AB=>CD A=>BCD Pruned Rules 56 Association Rule Mining Algorithm F: the set of frequent itemsets A: all proper subsets of Z X: start from the largest subset A strong association rule If X fails, all its subsets fail as well 57 Pattern Evaluation • Association rule algorithms tend to produce too many rules – many of them are uninteresting or redundant – Redundant if {A,B,C} {D} and {A,B} {D} have same support & confidence • Interestingness measures can be used to prune/rank the derived patterns • In the original formulation of association rules, support & confidence are the only measures used – Anything else? 58 Computing Interestingness Measure • Given a rule X Y, information needed to compute rule interestingness can be obtained from a contingency table Contingency table for X Y Y Y X f11 f10 f1+ X f01 f00 fo+ f+1 f+0 |T| f11: support of X and Y f10: support of X and Y f01: support of X and Y f00: support of X and Y Used to define various measures support, confidence, lift, Gini, J-measure, …… 59 Drawback of Confidence Coffee Coffee Tea 15 5 20 Tea 75 5 80 90 10 100 Association Rule: Tea Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9 Although confidence is high, rule is misleading P(Coffee|Tea) = 0.9375 60 Properties of a Good Interestingness Measure • Three principal properties a good measure M must satisfy: 1. M(A,B) = 0 if A and B are statistically independent • P(AB) = P(A)P(B) 2. M(A,B) increase monotonically with P(A,B) when P(A) and P(B) remain unchanged 3. M(A,B) decreases monotonically with P(A) [or P(B)] when P(A,B) and P(B) [or P(A)] remain unchanged 61 Other Considerations • Is the measure symmetric? – Does M(A,B) = M(B,A)? • Is the measure insensitive to row/column scaling? Male Female High 2 3 5 Low 1 4 5 3 7 10 Male Female High 4 30 34 Low 2 40 42 6 70 76 2x 10x • Is the measure non-invariant? A A B p r B q s A A B p r B q s+k 62 63 Summary • Basic concepts: association rules, support-confident framework, closed and max-patterns • Scalable frequent pattern mining methods – Apriori (Candidate generation & Frequency counting) – Projection-based • FP-Growth – Vertical format approach • ECLAT Which patterns are interesting? Pattern evaluation methods 64 65 Research on Pattern Mining: A Road Map Mining Multiple-Level Association Rules • Items often form hierarchies • Flexible support settings – Items at the lower level are expected to have lower support uniform support Level 1 min_sup = 5% Level 2 min_sup = 5% reduced support Milk [support = 10%] 2% Milk [support = 6%] Skim Milk [support = 4%] Level 1 min_sup = 5% Level 2 min_sup = 3% 66 Mining Multiple-Level Association Rules • Flexible min-support thresholds: Some items are more valuable but less frequent – Use non-uniform, group-based min-support – E.g., {diamond, watch, camera}: 0.05%; {bread, milk}: 5%; … • Redundancy Filtering: Some rules may be redundant due to “ancestor” relationships between items – milk wheat bread [support = 8%, confidence = 70%] – 2% milk wheat bread [support = 2%, confidence = 72%] – A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor 67 Negative and Rare Patterns • Rare patterns: Very low support but interesting – E.g., buying Rolex watches – Setting individual-based or special group-based support threshold for valuable items • Negative patterns – Since it is unlikely that one buys Ford Expedition (an SUV car) and Toyota Prius (a hybrid car) together, Ford Expedition and Toyota Prius are likely negatively correlated patterns • Negatively correlated patterns that are infrequent tend to be more interesting than those that are frequent 68 Constraint-based (Query-Directed) Mining • Finding all the patterns in a database autonomously? — unrealistic! – The patterns could be too many but not focused! • Data mining should be an interactive process – User directs what to be mined using a data mining query language (or a graphical user interface) • Constraint-based mining – User flexibility: provides constraints on what to be mined – Optimization: explores such constraints for efficient – Note: still find all the answers satisfying constraints, not finding some answers in “heuristic search” 69 Meta-Rule Guided Mining • Meta-rule can be in the rule form with partially instantiated predicates and constants P1(X, Y) ^ P2(X, W) => buys(X, “iPad”) • The resulting rule derived can be age(X, “15-25”) ^ profession(X, “student”) => buys(X, “iPad”) • In general, it can be in the form of P1 ^ P2 ^ … ^ Pl => Q1 ^ Q2 ^ … ^ Qr • Method to find meta-rules – Find frequent (l+r) predicates (based on min-support threshold) – Push constants deeply when possible into the mining – Use confidence, correlation, and other measures when possible 70