Mine

Frequent Pattern and Association Analysis (baseado nos slides do livro: Data Mining: C & T) Frequent Pattern and Association Analysis  Basic concepts  Scalable mining methods  Mining a variety of rules and interesting patterns  Constraint-based mining  Mining sequential and structured patterns  Extensions and applications SAD Tagus 2004/05 H. Galhardas Frequent Pattern and Association Analysis  Basic concepts  Scalable mining methods  Mining a variety of rules and interesting patterns  Constraint-based mining  Mining sequential and structured patterns  Extensions and applications SAD Tagus 2004/05 H. Galhardas What Is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set  First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of association rule mining SAD Tagus 2004/05 H. Galhardas Motivation  Finding inherent regularities in data  What products were often purchased together?— Beer and diapers?!  What are the subsequent purchases after buying a PC?  What kinds of DNA are sensitive to this new drug?  Can we automatically classify web documents? Exs: Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, DNA sequence analysis, etc. SAD Tagus 2004/05 H. Galhardas Basic Concepts (1) Item: Boolean variable representing its presence or absence Basket: Boolean vector of variables Analyzed to discover patterns of items that are frequently associated (or buyed) together Association rules: association patterns X => Y SAD Tagus 2004/05 H. Galhardas Basic Concepts (2)  Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F Customer buys both Customer buys diaper  Itemset X = {x1, …, xk} Find all the rules X  Y with minimum support and confidence Support, s, probability that a transaction contains X  Y Support = # tuples (X  Y) / # total tuples = P(X  Y) Confidence, c, conditional probability that a transaction having X also contains Y Customer buys beer Confidence = # tuples (X  Y) / # tuples X = P(Y|X) SAD Tagus 2004/05 H. Galhardas Basic Concepts (3) Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F Interesting (or strong) rules – satisfy minimum support threshold and minimum confidence threshold Let supmin = 50%, confmin = 50% Frequent Pattern: {A:3, B:3, D:4, E:3, AD:3} Association rules:  A  D (60%, 100%), 60% of all analyzed transactions show that A and D are bought together; 100% of customers that bought A also bought D  D  A (60%, 75%) H. Galhardas SAD Tagus 2004/05 Basic Concepts (4) Itemset (or pattern): set of items (k-itemset has k items) Occurrence frequency of itemset: # of transactions containing the itemset Also known as: frequency, support count or count of the itemset Frequent itemset: Itemset satisfies minimum support, i.e., frequency >= min-sup * # total transactions Confidence (A => B) = P(A|B) = frequency (A  B)/frequency (A) Pb. Mining assoc. Rules = Pb. Mining frequent itemsets SAD Tagus 2004/05 H. Galhardas Mining frequent itemsets A long pattern contains a combinatorial number of sub-patterns Solution: Mine closed patterns and max-patterns  An itemset X is closed frequent if X is frequent and there exists no super-itemset Y ‫ כ‬X, with the same support as X  An itemset X is a max-itemset if X is frequent and there exists no frequent super-itemset Y ‫ כ‬X  Closed itemset is a lossless compression of freq. patterns: reducing the # of patterns and rules SAD Tagus 2004/05 H. Galhardas Example DB = {<a1, …, a100>, < a1, …, a50>} Min_sup = 1.   What is the set of closed itemset?  <a1, …, a100>: 1  < a1, …, a50>: 2 What is the set of max-pattern?   <a1, …, a100>: 1 What is the set of all patterns?  !! SAD Tagus 2004/05 H. Galhardas Frequent Pattern and Association Analysis  Basic concepts  Scalable mining methods  Mining a variety of rules and interesting patterns  Constraint-based mining  Mining sequential and structured patterns  Extensions and applications SAD Tagus 2004/05 H. Galhardas Scalable Methods for Mining Frequent Patterns Downward closure (or Apriori) property of frequent patterns: any subset of a frequent itemset must be frequent   If {beer, diaper, nuts} is frequent, so is {beer, diaper} i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} Scalable mining methods: Three major approaches    Apriori (Agrawal & Srikant@VLDB’94) Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00) Vertical data format approach (Charm—Zaki & Hsiao @SDM’02) SAD Tagus 2004/05 H. Galhardas Apriori: A Candidate Generationand-Test Approach Method:  Initially, scan DB once to get frequent 1-itemset  Generate length (k+1) candidate itemsets from length k frequent itemsets  Test the candidates against DB  Terminate when no frequent or candidate set can be generated Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! SAD Tagus 2004/05 H. Galhardas How to Generate Candidates?  Suppose the items in Lk-1 are listed in an order  Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1  Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck SAD Tagus 2004/05 H. Galhardas Example Supmin = 2 Database TDB Tid 10 20 30 40 L2 C1 Items A, C, D B, C, E A, B, C, E B, E Itemset {A, C} {B, C} {B, E} {C, E} C3 1st scan C2 sup 2 2 3 2 Itemset {B, C, E} SAD Tagus 2004/05 Itemset {A} {B} {C} {D} {E} Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} 3rd scan sup 2 3 3 1 3 sup 1 2 1 2 3 2 L3 L1 C2 2nd scan Itemset {B, C, E} H. Galhardas Itemset {A} {B} {C} {E} sup 2 sup 2 3 3 3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} The Apriori Algorithm  Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk; SAD Tagus 2004/05 H. Galhardas Important Details of Apriori   How to generate candidates?  Step 1: self-joining Lk  Step 2: pruning Example of Candidate-generation  L3={abc, abd, acd, ace, bcd}  Self-joining: L3*L3   abcd from abc and abd  acde from acd and ace Pruning:   acde is removed because ade is not in L3 C4={abcd} SAD Tagus 2004/05 H. Galhardas Another example (1) SAD Tagus 2004/05 H. Galhardas Another example (2) SAD Tagus 2004/05 H. Galhardas Generation of ass. rules from frequent itemsets Confidence(A=>B) = P(B|A) = = support-count (A  B) / support-count (A) support-count (A  B): nb of transactions containing itemsets A  B support-count (A): nb of transactions containing itemset A Association rules can be generated as:  For each frequent itemset l, generate all nonempty subsets of l  For every nonempty subset s of l, output the rule: s=>(l-s) if support-count(l)/support-count(s) >= min-conf SAD Tagus 2004/05 H. Galhardas How to Count Supports of Candidates?  Why counting supports of candidates a problem?    The total number of candidates can be very huge One transaction may contain many candidates Method based on hashing:  Candidate itemsets are stored in a hash-tree  Leaf node of hash-tree contains a list of itemsets and counts  Interior node contains a hash table  Subset function: finds all the candidates contained in a transaction SAD Tagus 2004/05 H. Galhardas Challenges of Frequent Pattern Mining   Challenges  Multiple scans of transaction database  Huge number of candidates  Tedious workload of support counting for candidates Improving Apriori: general ideas  Partitioning  Sampling  others SAD Tagus 2004/05 H. Galhardas Partition: Scan Database Only Twice  Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB  Scan 1: partition database and find local frequent patterns  Scan 2: consolidate global frequent patterns SAD Tagus 2004/05 H. Galhardas Sampling for Frequent Patterns  Select a sample of original database, mine frequent patterns within sample using Apriori  Scan database once to verify frequent itemsets found in sample  Scan database again to find missed frequent patterns  Use lower support threshold to find frequent itemsets in sample  Trade off accuracy vs efficiency SAD Tagus 2004/05 H. Galhardas Bottleneck of Frequentpattern Mining  Multiple database scans are costly  Mining long patterns needs many passes of scanning and generates lots of candidates  To find frequent itemset i1i2…i100  # of scans: 100  # of Candidates: (1001) + (1002) + … + (110000) = 2100-1 = 1.27*1030 !  Bottleneck: candidate-generation-and-test  Can we avoid candidate generation? Yes! SAD Tagus 2004/05 H. Galhardas Visualization of Association Rules: Plane Graph SAD Tagus 2004/05 H. Galhardas Visualization of Association Rules: Rule Graph SAD Tagus 2004/05 H. Galhardas Visualization of Association Rules (SGI/MineSet 3.0) SAD Tagus 2004/05 H. Galhardas Frequent Pattern and Association Analysis  Basic concepts  Scalable mining methods  Mining a variety of rules and interesting patterns  Constraint-based mining  Mining sequential and structured patterns  Extensions and applications SAD Tagus 2004/05 H. Galhardas Mining Various Kinds of Association Rules  Mining multi-level association  Miming multi-dimensional association  Mining quantitative association  Mining interesting correlation patterns SAD Tagus 2004/05 H. Galhardas Example SAD Tagus 2004/05 H. Galhardas Mining Multiple-Level Association Rules   Items often form hierarchy Flexible support settings  Items at the lower level are expected to have lower support reduced support uniform support Level 1 min_sup = 5% Level 2 min_sup = 5% SAD Tagus 2004/05 Milk [support = 10%] 2% Milk [support = 6%] Skim Milk [support = 4%] H. Galhardas Level 1 min_sup = 5% Level 2 min_sup = 3% Multi-level Association: Redundancy Filtering  Some rules may be redundant due to “ancestor” relationships between items. Example: R1: milk  wheat bread [support = 8%, confidence = 70%] R2: 2% milk  wheat bread [support = 2%, confidence = 72%]  R1 is an ancestor of R2: R1 can be obtained from R2 replacing items by its ancestors in the concept hierarchy  R2 is redundant if its support is close to the “expected” value, based on the rule’s ancestor. SAD Tagus 2004/05 H. Galhardas Mining Multi-Dimensional Association Single-dimensional rules: buys(X, “milk”)  buys(X, “bread”) Multi-dimensional rules:  2 dimensions or predicates  Inter-dimension assoc. rules (no repeated predicates) age(X,”19-25”)  occupation(X,“student”)  buys(X, “coke”)  Hybrid-dimension assoc. rules (repeated predicates) age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”) SAD Tagus 2004/05 H. Galhardas Recall: categorical vs quantitative attributes Categorical (or nominal) attributes: finite number of possible values, no ordering among values Quantitative Attributes: numeric, implicit ordering among values SAD Tagus 2004/05 H. Galhardas Mining Quantitative Associations Techniques can be categorized by how numerical attributes, such as age or salary are treated: Static discretization based on predefined concept hierarchies (data cube methods) Dynamic discretization based on data distribution (quantitative rules, e.g., Agrawal & Srikant@SIGMOD96) Clustering: Distance-based association (e.g., Yang & Miller@SIGMOD97): one dimensional clustering then association Deviation: (such as Aumann and Lindell@KDD99) Sex = female => Wage: mean=$7/hr (overall mean = $9) SAD Tagus 2004/05 H. Galhardas Static Discretization of Quantitative Attributes (1)  Discretized prior to mining using concept hierarchy.  Numeric values are replaced by ranges.  Frequent itemset mining algo. must be modified so that frequent predicate sets are searched  Instead of searching only one attribute (ex: buys), search through relevant attributes (ex: age, occupation, buys) and treat each attribute-value pair as an itemset. SAD Tagus 2004/05 H. Galhardas Static Discretization of Quantitative Attributes (2)  Data cube is well suited for mining and may already exist  The cells of an n-dimensional cuboid correspond to the predicate sets and can be used to store the support counts ()  Mining from data cubes (age) (income) (buys) can be much faster. (age, income) SAD Tagus 2004/05 H. Galhardas (age,buys) (income,buys) (age,income,buys) Mining Various Kinds of Association Rules  Mining multi-level association  Miming multi-dimensional association  Mining quantitative association  Mining SAD Tagus 2004/05 interesting correlation patterns H. Galhardas Mining interesting correlation patterns  Strong association rules (w/ high support and confidence) can be uninteresting. Example: Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 2000 5000 play basketball  eat cereal [40%, 66.7%] is misleading  The overall percentage of students eating cereal is 75% which is higher than 66.7%. play basketball  not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence SAD Tagus 2004/05 H. Galhardas Are All the Rules Found Interesting?  The confidence of a rule only estimates the conditional probability of item B given item A, but doesn’t measure the real correlation between A and B  Another example: Buy walnuts  buy milk [1%, 80%] is misleading if 85% of customers buy milk  Support and confidence are not good to represent correlations  Many other interestingness measures (Tan, Kumar, Sritastava @KDD’02) SAD Tagus 2004/05 H. Galhardas Are All the Rules Found Interesting?  Correlation rule: A => B [supp, conf., corr. measure]  Measure of dependent/correlated events: correlation coefficient or lift  The occurrence of A is independent of the occurrence of B if  P(A B) P(B | A) conf(A  B) lift    P(A)P(B) P(B) sup(B) P(A B)  P(A)P(B) If lift > 1, A and B positively correlated, if lift < 1 then negatively correlated SAD Tagus 2004/05 H. Galhardas Lift - example Milk No Milk Sum (row) Coffee m, c ~m, c c No Coffee m, ~c ~m, ~c ~c Sum(col.) m ~m  DB m, c ~m, c m~c ~m~c lift A1 1000 100 100 10,000 9.26 A2 100 1000 1000 100,000 8.44 A3 1000 100 10000 100,000 9.18 A4 1000 1000 1000 1000 1 Coffee => Milk SAD Tagus 2004/05 H. Galhardas   Mining Highly Correlated Patterns 2 lift and  are not good measures for correlations in transactional DBs all-conf or coherence could be good measures (Omiecinski @TKDE’03) all _ conf    sup( X ) max_ item _ sup( X ) coh  sup( X ) | universe( X ) | Both all-conf and coherence have the downward closure property Efficient algorithms can be derived for mining (Lee et al. @ICDM’03sub) SAD Tagus 2004/05 H. Galhardas Bibliografia  (Livro) Data Mining: Concepts and Techniques, J. Han & M. Kamber, Morgan Kaufmann, 2001 (Cap. 6 – livro 2001, Cap. 4 – draft) SAD Tagus 2004/05 H. Galhardas

Mine

Related documents

Products

Support

Mine

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib