Association Rules & Correlations Basic concepts Efficient and scalable frequent itemset mining methods: Apriori, and improvements FP-growth Rule postmining: visualization and validation Interesting association rules. 1 Rule Validations Only a small subset of derived rules might be meaningful/useful Domain expert must validate the rules Useful tools: Visualization Correlation analysis 2 Visualization of Association Rules: Plane Graph 3 Visualization of Association Rules (SGI/MineSet 3.0) 4 Pattern Evaluation Association rule algorithms tend to produce too many rules many of them are uninteresting or redundant confidence(A B) = p(B|A) = p(A & B)/p(A) Confidence is not discriminative enough criterion Beyond original support & confidence Interestingness measures can be used to prune/rank the derived patterns 5 Application of Interestingness Measure Interestingness Measures 6 Computing Interestingness Measure Given a rule X Y, information needed to compute rule interestingness can be obtained from a contingency table Contingency table for X Y Y Y f11: support of X and Y X f11 f10 f1+ f10: support of X and Y f01: support of X and Y X f01 f00 fo+ f00: support of X and Y f+1 f+0 |T| Used to define various measures support, confidence, lift, Gini, J-measure, etc. 7 Drawback of Confidence Coffee Coffee Tea 15 5 20 Tea 75 5 80 90 10 100 Association Rule: Tea Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9 … >0.75 Although confidence is high, rule is misleading P(Coffee|Tea) = 0.9375 …>>0.75 8 Statistical-Based Measures Measures that take into account statistical dependence Does X lift the probability of Y? i.e. probability of Y given X over probability P(Y | X ) of Y. Lift = This is the same as interest factor I =1 P(Y ) independence, P( X , Y ) I> 1 positive association (<1 negative) = Interest P( X ) P(Y ) PS = P( X , Y ) - P( X ) P(Y ) Many other measures PS: Piatesky-Shapiro 9 Example: Lift/Interest Coffee Coffee Tea 15 5 20 Tea 75 5 80 90 10 100 Association Rule: Tea Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9 Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated) 10 Drawback of Lift & Interest Statistical independence: If P(X,Y)=P(X)P(Y) => Lift = 1 Y Y X 10 0 10 X 0 90 90 10 90 100 0.1 Lift = = 10 (0.1)(0.1) Lift Y Y X 90 0 90 X 0 10 10 90 10 100 0.9 Lift = = 1.11 (0.9)(0.9) favors infrequent items Other criteria proposed Gini, J-measure, etc. 11 There are lots of measures proposed in the literature Some measures are good for certain applications, but not for others What criteria should we use to determine whether a measure is good or bad? What about Aprioristyle support based pruning? How does it affect these measures? 12 Association Rules & Correlations Basic concepts Efficient and scalable frequent itemset mining methods: Apriori, and improvements FP-growth Rule derivation, visualization and validation Multi-level Associations Summary 13 Multiple-Level Association Rules Food Items often form hierarchy. Items at the lower level are bread milk expected to have lower support. Rules regarding itemsets at 2% wheat white skim appropriate levels could be quite useful. Fraser Sunset Transaction database can be encoded based on dimensions and levels TID Items We can explore shared multiT1 {111, 121, 211, 221} level mining T2 {111, 211, 222, 323} T3 T4 T5 {112, 122, 221, 411} {111, 121} {111, 122, 211, 221, 413} 14 Mining Multi-Level Associations A top_down, progressive deepening approach: First find high-level strong rules: milk bread [20%, 60%]. Then find their lower-level “weaker” rules: 2% milk wheat bread [6%, 50%]. Variations at mining multiple-level association rules. Level-crossed association rules: 2% milk Wonder wheat bread Association rules with multiple, alternative hierarchies: 2% milk Wonder bread 15 Multi-level Association: Uniform Support vs. Reduced Support Uniform Support: the same minimum support for all levels + One minimum support threshold. No need to examine itemsets containing any item whose ancestors do not have minimum support. – Lower level items do not occur as frequently. If support threshold too high miss low level associations too low generate too many high level associations Reduced Support: reduced minimum support at lower levels There are 4 search strategies: Level-by-level independent Level-cross filtering by k-itemset Level-cross filtering by single item Controlled level-cross filtering by single item 16 Uniform Support Multi-level mining with uniform support Level 1 min_sup = 5% Level 2 min_sup = 5% Milk [support = 10%] 2% Milk Skim Milk [support = 6%] [support = 4%] Back 17 Reduced Support Multi-level mining with reduced support Level 1 min_sup = 5% Level 2 min_sup = 3% Milk [support = 10%] 2% Milk Skim Milk [support = 6%] [support = 4%] Back 18 Multi-level Association: Redundancy Filtering Some rules may be redundant due to “ancestor” relationships between Example milk wheat bread [support = 8%, confidence = 70%] Say that 2%Milk is 25% of milk sales, then: 2% milk wheat bread [support = 2%, confidence = 72%] We say the first rule is an ancestor of the second rule. A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor. 19 Multi-Level Mining: Progressive Deepening A top-down, progressive deepening approach: First mine high-level frequent items: milk (15%), bread (10%) Then mine their lower-level “weaker” frequent itemsets: 2% milk (5%), wheat bread (4%) Different min_support threshold across multi-levels lead to different algorithms: If adopting the same min_support across multi-levels then toss t if any of t’s ancestors is infrequent. If adopting reduced min_support at lower levels then examine only those descendents whose ancestor’s support is frequent/non-negligible. 20 Association Rules & Correlations Basic concepts Efficient and scalable frequent itemset mining methods: Apriori, and improvements FP-growth Rule derivation, visualization and validation Multi-level Associations Temporal associations and frequent sequences Other association mining methods Summary Temporal associations and frequent sequences [later] 21 Other Association Mining Methods CHARM: Mining frequent itemsets by a Vertical Data Format Mining Frequent Closed Patterns Mining Max-patterns Mining Quantitative Associations [e.g., what is the implication between age and income?] Constraint-base association mining Frequent Patterns in Data Streams: very difficult problem. Performance is a real issue Constraint-based (Query-Directed) Mining Mining sequential and structured patterns 22 Summary Association rule mining probably the most significant contribution from the database community in KDD New interesting research directions Association analysis in other types of data: spatial data, multimedia data, time series data, Association Rule Mining for Data Streams: a very difficult challenge. 23 Statistical Independence Population of 1000 students 600 students know how to swim (S) 700 students know how to bike (B) 420 students know how to swim and bike (S,B) P(SB) = 420/1000 = 0.42 P(S) P(B) = 0.6 0.7 = 0.42 P(SB) = P(S) P(B) => Statistical independence P(SB) > P(S) P(B) => Positively correlated 24 P(SB) < P(S) P(B) => Negatively correlated