Formato Base dei Dati - UCLA Computer Science

Association Rules & Correlations  Basic concepts  Efficient and scalable frequent itemset mining methods:  Apriori, and improvements  FP-growth  Rule postmining: visualization and validation  Interesting association rules. 1 Rule Validations Only a small subset of derived rules might be meaningful/useful Domain expert must validate the rules Useful tools: Visualization Correlation analysis 2 Visualization of Association Rules: Plane Graph 3 Visualization of Association Rules (SGI/MineSet 3.0) 4 Pattern Evaluation Association rule algorithms tend to produce too many rules many of them are uninteresting or redundant confidence(A B) = p(B|A) = p(A & B)/p(A) Confidence is not discriminative enough criterion  Beyond original support & confidence Interestingness measures can be used to prune/rank the derived patterns 5 Application of Interestingness Measure Interestingness Measures 6 Computing Interestingness Measure Given a rule X  Y, information needed to compute rule interestingness can be obtained from a contingency table Contingency table for X  Y Y Y f11: support of X and Y X f11 f10 f1+ f10: support of X and Y f01: support of X and Y X f01 f00 fo+ f00: support of X and Y f+1 f+0 |T| Used to define various measures  support, confidence, lift, Gini, J-measure, etc. 7 Drawback of Confidence Coffee Coffee Tea 15 5 20 Tea 75 5 80 90 10 100 Association Rule: Tea  Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9 … >0.75  Although confidence is high, rule is misleading  P(Coffee|Tea) = 0.9375 …>>0.75 8 Statistical-Based Measures Measures that take into account statistical dependence Does X lift the probability of Y? i.e. probability of Y given X over probability P(Y | X ) of Y. Lift = This is the same as interest factor I =1 P(Y ) independence, P( X , Y ) I> 1 positive association (<1 negative) = Interest P( X ) P(Y ) PS = P( X , Y ) - P( X ) P(Y ) Many other measures PS: Piatesky-Shapiro 9 Example: Lift/Interest Coffee Coffee Tea 15 5 20 Tea 75 5 80 90 10 100 Association Rule: Tea  Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9  Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated) 10 Drawback of Lift & Interest Statistical independence: If P(X,Y)=P(X)P(Y) => Lift = 1 Y Y X 10 0 10 X 0 90 90 10 90 100 0.1 Lift = = 10 (0.1)(0.1) Lift Y Y X 90 0 90 X 0 10 10 90 10 100 0.9 Lift = = 1.11 (0.9)(0.9) favors infrequent items Other criteria proposed Gini, J-measure, etc. 11 There are lots of measures proposed in the literature Some measures are good for certain applications, but not for others What criteria should we use to determine whether a measure is good or bad? What about Aprioristyle support based pruning? How does it affect these measures? 12 Association Rules & Correlations  Basic concepts  Efficient and scalable frequent itemset mining methods:  Apriori, and improvements  FP-growth  Rule derivation, visualization and validation  Multi-level Associations  Summary 13 Multiple-Level Association Rules Food  Items often form hierarchy.  Items at the lower level are bread milk expected to have lower support.  Rules regarding itemsets at 2% wheat white skim appropriate levels could be quite useful. Fraser Sunset  Transaction database can be encoded based on dimensions and levels TID Items  We can explore shared multiT1 {111, 121, 211, 221} level mining T2 {111, 211, 222, 323} T3 T4 T5 {112, 122, 221, 411} {111, 121} {111, 122, 211, 221, 413} 14 Mining Multi-Level Associations  A top_down, progressive deepening approach:  First find high-level strong rules: milk  bread [20%, 60%].  Then find their lower-level “weaker” rules: 2% milk  wheat bread [6%, 50%].  Variations at mining multiple-level association rules. Level-crossed association rules: 2% milk  Wonder wheat bread Association rules with multiple, alternative hierarchies: 2% milk  Wonder bread 15 Multi-level Association: Uniform Support vs. Reduced Support  Uniform Support: the same minimum support for all levels  + One minimum support threshold. No need to examine itemsets containing any item whose ancestors do not have minimum support. – Lower level items do not occur as frequently. If support threshold too high  miss low level associations too low  generate too many high level associations  Reduced Support: reduced minimum support at lower levels  There are 4 search strategies: Level-by-level independent Level-cross filtering by k-itemset Level-cross filtering by single item Controlled level-cross filtering by single item 16 Uniform Support Multi-level mining with uniform support Level 1 min_sup = 5% Level 2 min_sup = 5% Milk [support = 10%] 2% Milk Skim Milk [support = 6%] [support = 4%] Back 17 Reduced Support Multi-level mining with reduced support Level 1 min_sup = 5% Level 2 min_sup = 3% Milk [support = 10%] 2% Milk Skim Milk [support = 6%] [support = 4%] Back 18 Multi-level Association: Redundancy Filtering  Some rules may be redundant due to “ancestor” relationships between Example milk  wheat bread [support = 8%, confidence = 70%]  Say that 2%Milk is 25% of milk sales, then: 2% milk  wheat bread [support = 2%, confidence = 72%]  We say the first rule is an ancestor of the second rule.  A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor. 19 Multi-Level Mining: Progressive Deepening  A top-down, progressive deepening approach:  First mine high-level frequent items: milk (15%), bread (10%)  Then mine their lower-level “weaker” frequent itemsets: 2% milk (5%), wheat bread (4%)  Different min_support threshold across multi-levels lead to different algorithms: If adopting the same min_support across multi-levels then toss t if any of t’s ancestors is infrequent. If adopting reduced min_support at lower levels then examine only those descendents whose ancestor’s support is frequent/non-negligible. 20 Association Rules & Correlations  Basic concepts  Efficient and scalable frequent itemset mining methods:  Apriori, and improvements  FP-growth       Rule derivation, visualization and validation Multi-level Associations Temporal associations and frequent sequences Other association mining methods Summary Temporal associations and frequent sequences [later] 21 Other Association Mining Methods         CHARM: Mining frequent itemsets by a Vertical Data Format Mining Frequent Closed Patterns Mining Max-patterns Mining Quantitative Associations [e.g., what is the implication between age and income?] Constraint-base association mining Frequent Patterns in Data Streams: very difficult problem. Performance is a real issue Constraint-based (Query-Directed) Mining Mining sequential and structured patterns 22 Summary Association rule mining probably the most significant contribution from the database community in KDD New interesting research directions Association analysis in other types of data: spatial data, multimedia data, time series data, Association Rule Mining for Data Streams: a very difficult challenge. 23 Statistical Independence Population of 1000 students 600 students know how to swim (S) 700 students know how to bike (B) 420 students know how to swim and bike (S,B) P(SB) = 420/1000 = 0.42 P(S)  P(B) = 0.6  0.7 = 0.42 P(SB) = P(S)  P(B) => Statistical independence P(SB) > P(S)  P(B) => Positively correlated 24 P(SB) < P(S)  P(B) => Negatively correlated

Formato Base dei Dati - UCLA Computer Science

Related documents

Products

Support

Formato Base dei Dati - UCLA Computer Science

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib