Frequent Pattern and Association Analysis (baseado nos slides do livro: Data Mining: C & T) Frequent Pattern and Association Analysis Basic concepts Scalable mining methods Mining a variety of rules and interesting patterns Constraint-based mining Mining sequential and structured patterns Extensions and applications SAD Tagus 2004/05 H. Galhardas Frequent Pattern and Association Analysis Basic concepts Scalable mining methods Mining a variety of rules and interesting patterns Constraint-based mining Mining sequential and structured patterns Extensions and applications SAD Tagus 2004/05 H. Galhardas What Is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of association rule mining SAD Tagus 2004/05 H. Galhardas Motivation Finding inherent regularities in data What products were often purchased together?— Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents? Exs: Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, DNA sequence analysis, etc. SAD Tagus 2004/05 H. Galhardas Basic Concepts (1) Item: Boolean variable representing its presence or absence Basket: Boolean vector of variables Analyzed to discover patterns of items that are frequently associated (or buyed) together Association rules: association patterns X => Y SAD Tagus 2004/05 H. Galhardas Basic Concepts (2) Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F Customer buys both Customer buys diaper Itemset X = {x1, …, xk} Find all the rules X Y with minimum support and confidence Support, s, probability that a transaction contains X Y Support = # tuples (X Y) / # total tuples = P(X Y) Confidence, c, conditional probability that a transaction having X also contains Y Customer buys beer Confidence = # tuples (X Y) / # tuples X = P(Y|X) SAD Tagus 2004/05 H. Galhardas Basic Concepts (3) Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F Interesting (or strong) rules – satisfy minimum support threshold and minimum confidence threshold Let supmin = 50%, confmin = 50% Frequent Pattern: {A:3, B:3, D:4, E:3, AD:3} Association rules: A D (60%, 100%), 60% of all analyzed transactions show that A and D are bought together; 100% of customers that bought A also bought D D A (60%, 75%) H. Galhardas SAD Tagus 2004/05 Basic Concepts (4) Itemset (or pattern): set of items (k-itemset has k items) Occurrence frequency of itemset: # of transactions containing the itemset Also known as: frequency, support count or count of the itemset Frequent itemset: Itemset satisfies minimum support, i.e., frequency >= min-sup * # total transactions Confidence (A => B) = P(A|B) = frequency (A B)/frequency (A) Pb. Mining assoc. Rules = Pb. Mining frequent itemsets SAD Tagus 2004/05 H. Galhardas Mining frequent itemsets A long pattern contains a combinatorial number of sub-patterns Solution: Mine closed patterns and max-patterns An itemset X is closed frequent if X is frequent and there exists no super-itemset Y כX, with the same support as X An itemset X is a max-itemset if X is frequent and there exists no frequent super-itemset Y כX Closed itemset is a lossless compression of freq. patterns: reducing the # of patterns and rules SAD Tagus 2004/05 H. Galhardas Example DB = {<a1, …, a100>, < a1, …, a50>} Min_sup = 1. What is the set of closed itemset? <a1, …, a100>: 1 < a1, …, a50>: 2 What is the set of max-pattern? <a1, …, a100>: 1 What is the set of all patterns? !! SAD Tagus 2004/05 H. Galhardas Frequent Pattern and Association Analysis Basic concepts Scalable mining methods Mining a variety of rules and interesting patterns Constraint-based mining Mining sequential and structured patterns Extensions and applications SAD Tagus 2004/05 H. Galhardas Scalable Methods for Mining Frequent Patterns Downward closure (or Apriori) property of frequent patterns: any subset of a frequent itemset must be frequent If {beer, diaper, nuts} is frequent, so is {beer, diaper} i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} Scalable mining methods: Three major approaches Apriori (Agrawal & Srikant@VLDB’94) Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00) Vertical data format approach (Charm—Zaki & Hsiao @SDM’02) SAD Tagus 2004/05 H. Galhardas Apriori: A Candidate Generationand-Test Approach Method: Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from length k frequent itemsets Test the candidates against DB Terminate when no frequent or candidate set can be generated Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! SAD Tagus 2004/05 H. Galhardas How to Generate Candidates? Suppose the items in Lk-1 are listed in an order Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck SAD Tagus 2004/05 H. Galhardas Example Supmin = 2 Database TDB Tid 10 20 30 40 L2 C1 Items A, C, D B, C, E A, B, C, E B, E Itemset {A, C} {B, C} {B, E} {C, E} C3 1st scan C2 sup 2 2 3 2 Itemset {B, C, E} SAD Tagus 2004/05 Itemset {A} {B} {C} {D} {E} Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} 3rd scan sup 2 3 3 1 3 sup 1 2 1 2 3 2 L3 L1 C2 2nd scan Itemset {B, C, E} H. Galhardas Itemset {A} {B} {C} {E} sup 2 sup 2 3 3 3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} The Apriori Algorithm Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk; SAD Tagus 2004/05 H. Galhardas Important Details of Apriori How to generate candidates? Step 1: self-joining Lk Step 2: pruning Example of Candidate-generation L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L3 C4={abcd} SAD Tagus 2004/05 H. Galhardas Another example (1) SAD Tagus 2004/05 H. Galhardas Another example (2) SAD Tagus 2004/05 H. Galhardas Generation of ass. rules from frequent itemsets Confidence(A=>B) = P(B|A) = = support-count (A B) / support-count (A) support-count (A B): nb of transactions containing itemsets A B support-count (A): nb of transactions containing itemset A Association rules can be generated as: For each frequent itemset l, generate all nonempty subsets of l For every nonempty subset s of l, output the rule: s=>(l-s) if support-count(l)/support-count(s) >= min-conf SAD Tagus 2004/05 H. Galhardas How to Count Supports of Candidates? Why counting supports of candidates a problem? The total number of candidates can be very huge One transaction may contain many candidates Method based on hashing: Candidate itemsets are stored in a hash-tree Leaf node of hash-tree contains a list of itemsets and counts Interior node contains a hash table Subset function: finds all the candidates contained in a transaction SAD Tagus 2004/05 H. Galhardas Challenges of Frequent Pattern Mining Challenges Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates Improving Apriori: general ideas Partitioning Sampling others SAD Tagus 2004/05 H. Galhardas Partition: Scan Database Only Twice Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB Scan 1: partition database and find local frequent patterns Scan 2: consolidate global frequent patterns SAD Tagus 2004/05 H. Galhardas Sampling for Frequent Patterns Select a sample of original database, mine frequent patterns within sample using Apriori Scan database once to verify frequent itemsets found in sample Scan database again to find missed frequent patterns Use lower support threshold to find frequent itemsets in sample Trade off accuracy vs efficiency SAD Tagus 2004/05 H. Galhardas Bottleneck of Frequentpattern Mining Multiple database scans are costly Mining long patterns needs many passes of scanning and generates lots of candidates To find frequent itemset i1i2…i100 # of scans: 100 # of Candidates: (1001) + (1002) + … + (110000) = 2100-1 = 1.27*1030 ! Bottleneck: candidate-generation-and-test Can we avoid candidate generation? Yes! SAD Tagus 2004/05 H. Galhardas Visualization of Association Rules: Plane Graph SAD Tagus 2004/05 H. Galhardas Visualization of Association Rules: Rule Graph SAD Tagus 2004/05 H. Galhardas Visualization of Association Rules (SGI/MineSet 3.0) SAD Tagus 2004/05 H. Galhardas Frequent Pattern and Association Analysis Basic concepts Scalable mining methods Mining a variety of rules and interesting patterns Constraint-based mining Mining sequential and structured patterns Extensions and applications SAD Tagus 2004/05 H. Galhardas Mining Various Kinds of Association Rules Mining multi-level association Miming multi-dimensional association Mining quantitative association Mining interesting correlation patterns SAD Tagus 2004/05 H. Galhardas Example SAD Tagus 2004/05 H. Galhardas Mining Multiple-Level Association Rules Items often form hierarchy Flexible support settings Items at the lower level are expected to have lower support reduced support uniform support Level 1 min_sup = 5% Level 2 min_sup = 5% SAD Tagus 2004/05 Milk [support = 10%] 2% Milk [support = 6%] Skim Milk [support = 4%] H. Galhardas Level 1 min_sup = 5% Level 2 min_sup = 3% Multi-level Association: Redundancy Filtering Some rules may be redundant due to “ancestor” relationships between items. Example: R1: milk wheat bread [support = 8%, confidence = 70%] R2: 2% milk wheat bread [support = 2%, confidence = 72%] R1 is an ancestor of R2: R1 can be obtained from R2 replacing items by its ancestors in the concept hierarchy R2 is redundant if its support is close to the “expected” value, based on the rule’s ancestor. SAD Tagus 2004/05 H. Galhardas Mining Multi-Dimensional Association Single-dimensional rules: buys(X, “milk”) buys(X, “bread”) Multi-dimensional rules: 2 dimensions or predicates Inter-dimension assoc. rules (no repeated predicates) age(X,”19-25”) occupation(X,“student”) buys(X, “coke”) Hybrid-dimension assoc. rules (repeated predicates) age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”) SAD Tagus 2004/05 H. Galhardas Recall: categorical vs quantitative attributes Categorical (or nominal) attributes: finite number of possible values, no ordering among values Quantitative Attributes: numeric, implicit ordering among values SAD Tagus 2004/05 H. Galhardas Mining Quantitative Associations Techniques can be categorized by how numerical attributes, such as age or salary are treated: Static discretization based on predefined concept hierarchies (data cube methods) Dynamic discretization based on data distribution (quantitative rules, e.g., Agrawal & Srikant@SIGMOD96) Clustering: Distance-based association (e.g., Yang & Miller@SIGMOD97): one dimensional clustering then association Deviation: (such as Aumann and Lindell@KDD99) Sex = female => Wage: mean=$7/hr (overall mean = $9) SAD Tagus 2004/05 H. Galhardas Static Discretization of Quantitative Attributes (1) Discretized prior to mining using concept hierarchy. Numeric values are replaced by ranges. Frequent itemset mining algo. must be modified so that frequent predicate sets are searched Instead of searching only one attribute (ex: buys), search through relevant attributes (ex: age, occupation, buys) and treat each attribute-value pair as an itemset. SAD Tagus 2004/05 H. Galhardas Static Discretization of Quantitative Attributes (2) Data cube is well suited for mining and may already exist The cells of an n-dimensional cuboid correspond to the predicate sets and can be used to store the support counts () Mining from data cubes (age) (income) (buys) can be much faster. (age, income) SAD Tagus 2004/05 H. Galhardas (age,buys) (income,buys) (age,income,buys) Mining Various Kinds of Association Rules Mining multi-level association Miming multi-dimensional association Mining quantitative association Mining SAD Tagus 2004/05 interesting correlation patterns H. Galhardas Mining interesting correlation patterns Strong association rules (w/ high support and confidence) can be uninteresting. Example: Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 2000 5000 play basketball eat cereal [40%, 66.7%] is misleading The overall percentage of students eating cereal is 75% which is higher than 66.7%. play basketball not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence SAD Tagus 2004/05 H. Galhardas Are All the Rules Found Interesting? The confidence of a rule only estimates the conditional probability of item B given item A, but doesn’t measure the real correlation between A and B Another example: Buy walnuts buy milk [1%, 80%] is misleading if 85% of customers buy milk Support and confidence are not good to represent correlations Many other interestingness measures (Tan, Kumar, Sritastava @KDD’02) SAD Tagus 2004/05 H. Galhardas Are All the Rules Found Interesting? Correlation rule: A => B [supp, conf., corr. measure] Measure of dependent/correlated events: correlation coefficient or lift The occurrence of A is independent of the occurrence of B if P(A B) P(B | A) conf(A B) lift P(A)P(B) P(B) sup(B) P(A B) P(A)P(B) If lift > 1, A and B positively correlated, if lift < 1 then negatively correlated SAD Tagus 2004/05 H. Galhardas Lift - example Milk No Milk Sum (row) Coffee m, c ~m, c c No Coffee m, ~c ~m, ~c ~c Sum(col.) m ~m DB m, c ~m, c m~c ~m~c lift A1 1000 100 100 10,000 9.26 A2 100 1000 1000 100,000 8.44 A3 1000 100 10000 100,000 9.18 A4 1000 1000 1000 1000 1 Coffee => Milk SAD Tagus 2004/05 H. Galhardas Mining Highly Correlated Patterns 2 lift and are not good measures for correlations in transactional DBs all-conf or coherence could be good measures (Omiecinski @TKDE’03) all _ conf sup( X ) max_ item _ sup( X ) coh sup( X ) | universe( X ) | Both all-conf and coherence have the downward closure property Efficient algorithms can be derived for mining (Lee et al. @ICDM’03sub) SAD Tagus 2004/05 H. Galhardas Bibliografia (Livro) Data Mining: Concepts and Techniques, J. Han & M. Kamber, Morgan Kaufmann, 2001 (Cap. 6 – livro 2001, Cap. 4 – draft) SAD Tagus 2004/05 H. Galhardas