Selecting the Right Interestingness Measure for Association Patterns Pang-Ning Tan, Vipin Kumar, and Jaideep Srivastava Department of Computer Science and Engineering University of Minnesota Presented by Ahmet Bulut Bulut, Singh # Motivation • Major Data mining problem: analysis of relationships among variables • Finding sets of binary variables that cooccur together – Association Rule Mining [Agrawal et al.] • How to find a suitable metric to capture the dependencies between variables (defined in terms of contingency tables) • Many metrics provide conflicting information • Goal: Automated way of choosing the best metric for an application domain Bulut, Singh # 2 P( A B) P( A).P( B) Bulut, Singh # 3 Justification for Conflicts • E10 is ranked highest by I measure but lowest according to coefficient. • Recognize intrinsic properties of the existing measures Bulut, Singh # 4 Analysis of a Measure • The relationship between the gender of a student and the grade obtained in a course • Number of male students X 2, Number of female students X 10 • One expects scale-invariance in this particular application • Most measures are sensitive to scaling of rows and columns such as: gini index, interest, mutual information etc. Bulut, Singh # 5 Solutions to zero-in • Support based pruning – Eliminate uncorrelated and poorly correlated patterns • Table Standardization – To modify contingency tables to have uniform margins – Many measures provide non-conflicting information • Expectation of domain experts – Choose the measure that agrees with the expectations the most. – Number of contingency tables, |T|, is high – It is possible to extract a small set, S, of contingency tables – Find the best measure for S to approximate for T Bulut, Singh # 6 Preliminaries T ( D ) {t1 , t 2 ,...,t N } P : m ea su res M P M (T ) {m1 , m2 ,...,m N } OM (T ) {o1 , o2 ,...,o N } • The similarity between any two measures M1 and M2: the similarity between OM1(T) and OM2(T) • The similarity metric used is correlation coefficient corr(OM1(T), OM2(T) ) > threshold then similar Bulut, Singh # 7 Desired Properties of a Measure M Bulut, Singh # 8 Properties of a Measure M cont’d. • Denote 2X2 contingency table as a contingency matrix M [f11f10 ; f01f00 ] • Interestingness measure is a matrix operator, O such that – OM = k where k is a scalar. – For instance for Coefficient as the interestingness measure k equals to normalized form of the determinant operator Det(M) = f11f00 – f01f10 Statistical Independence a singular matrix M whose determinant equal to 0. Bulut, Singh # 9 Properties of a Measure M cont’d. • Property 1: Symmetry under variable • • Property Property4:3:Inversion Antisymmetry Invariance under T permutation: O(M ) = O(M) –row/column S=[0 1;1 0] permutation – cosine (IS), interest factor(I), odds S =and 0] permutation – – row column ratio ([0 )1;1 • – together If O(SM) = -O(M), antisymmetric Property 2: Row/Column Scaling row permutation – Ifunder O(SMS)=O(M), Invariance: R=C=[k1 0; inversion 0 k2] = -O(M), – – invariant R If x MO(MS) is row scaling andantisymmetric M x R is underscaling column permutation – column Insight: flip presence with – – If O(RM) and = O(M) O(MR) absence viceand versa for= O(M), binary Measures that are symmetric then M is row/scale invariant variables. – – under the row and column odds ratio () satisfies this permutation coefficient,operations: odds ratio,no property along with Yule’s Q and Y distinction between positive and collective strength are negative correlations of a table symmetric binary measures For example: index – Jaccard measuregini is asymmetric Bulut, Singh # 10 Property 4 and Property 5 • Market Basket analysis requires unequal treatment of binary values of a variable – A symmetric measure like the one above is not suitable • Property 5: Null Invariance: If O(M+C) = O(M) where C=[0 0;0 k] and k is a positive constant – For binary variables; more records added that do not contain the two variables under consideration: Co-occurrence emphasized Bulut, Singh # 11 Effect of Support Based Pruning • Randomly generated synthetic dataset of 10,000 contingency tables • Darker cells, correlation > 0.85, and lighter cells indicate otherwise • Tighter bounds on the support of the patterns: many measures become correlated Bulut, Singh # 12 Elimination of poorly correlated tables using Support-based Pruning • Minimum support threshold to prune out the low support patterns • Having a maximum support threshold: equal elimination of uncorrelated, negatively correlated and positively correlated tables • Having a lower bound of support will prune out the negatively correlated or uncorrelated tables. Bulut, Singh # 13 Table standardization • A standardized table: visual depiction of the disjoint distribution of two variables after elimination of non-uniform marginals Bulut, Singh # 14 Implications of standardization • The rankings from different measures become identical Bulut, Singh # 15 Implications of standardization cont’d • After standardization, a matrix has [x y ; y x] where y = N/2-x and x = f*11 • If you consider monotonically increasing functions of x (nearly all of the measures are) – Identical rankings on standardized, positively correlated tables – Some measures do not satisfy this property • Consider the values of x where N/4 < x < N/2 • IPF favors “odds ratio” measure, therefore final rankings agree with odds ratio rankings before standardization • Leave with: Different standardization techniques may be more appropriate for different application domains Bulut, Singh # 16 Measure Selection Based on Rankings by Experts • Ideally, experts rank all the contingency tables, choose the best measure accordingly • Laborious task if the number of tables is too large • Provide a smaller set of tables to decide the best measure Bulut, Singh # 17 Table Selection via Disjoint Algorithm • Use Disjoint algorithm to choose a subset of tables of cardinality k. • Rank tables according to various measures • Compute the similarity between different measures • A good table selection scheme minimizes D(SS , ST ) max ST (i, j) SS (i, j) Bulut, Singh # 18 Experimental Results Bulut, Singh # 19 Conclusions • Key properties to consider for selecting the right measure • No measure is consistently better than others • Situations where most measures provide correlated info • Choosing the right measure on a non-biased small set of all the tables give good estimates to the ideal solution • As a future work – Extension to k-way contingency tables – Association between mixed data types Bulut, Singh # 20