Bulut - Electrical and Computer Engineering

advertisement
Selecting the Right Interestingness
Measure for Association Patterns
Pang-Ning Tan, Vipin Kumar, and Jaideep Srivastava
Department of Computer Science and Engineering
University of Minnesota
Presented by Ahmet Bulut
Bulut, Singh #
Motivation
•
Major Data mining problem: analysis of
relationships among variables
•
Finding sets of binary variables that cooccur together
– Association Rule Mining [Agrawal et al.]
•
How to find a suitable metric to capture
the dependencies between variables
(defined in terms of contingency tables)
•
Many metrics provide conflicting
information
•
Goal: Automated way of choosing the best
metric for an application domain
Bulut, Singh # 2
P( A  B)
P( A).P( B)
Bulut, Singh # 3
Justification for Conflicts
•
E10 is ranked highest by I measure
but lowest according to

coefficient.
•
Recognize intrinsic properties of the
existing measures
Bulut, Singh # 4
Analysis of a Measure
• The relationship between the gender of a student and the grade
obtained in a course
• Number of male students X 2, Number of female students X 10
• One expects scale-invariance in this particular application
• Most measures are sensitive to scaling of rows and columns
such as: gini index, interest, mutual information etc.
Bulut, Singh # 5
Solutions to zero-in
• Support based pruning
– Eliminate uncorrelated and poorly correlated patterns
• Table Standardization
– To modify contingency tables to have uniform margins
– Many measures provide non-conflicting information
• Expectation of domain experts
– Choose the measure that agrees with the expectations the most.
– Number of contingency tables, |T|, is high
– It is possible to extract a small set, S, of contingency tables
– Find the best measure for S to approximate for T
Bulut, Singh # 6
Preliminaries
T ( D )  {t1 , t 2 ,...,t N }
P : m ea su res
M P
M (T )  {m1 , m2 ,...,m N }
OM (T )  {o1 , o2 ,...,o N }
• The similarity between any two measures M1 and M2: the similarity
between OM1(T) and OM2(T)
• The similarity metric used is correlation coefficient
corr(OM1(T), OM2(T) ) > threshold then similar
Bulut, Singh # 7
Desired Properties of a Measure M
Bulut, Singh # 8
Properties of a Measure M cont’d.
• Denote 2X2 contingency table as a contingency matrix
M  [f11f10 ; f01f00 ]
• Interestingness measure is a matrix operator, O such that
– OM = k where k is a scalar.
– For instance for

Coefficient as the interestingness measure
k equals to normalized form of the determinant operator
Det(M) = f11f00 – f01f10
Statistical Independence
a singular matrix M whose determinant equal to 0.
Bulut, Singh # 9
Properties of a Measure M cont’d.
•
Property 1: Symmetry under variable
• • Property
Property4:3:Inversion
Antisymmetry
Invariance
under
T
permutation: O(M ) = O(M)
–row/column
S=[0 1;1 0] permutation
– cosine (IS), interest factor(I), odds
S =and
0] permutation
– – row
column
ratio
([0
 )1;1
•
– together
If O(SM)
= -O(M), antisymmetric
Property
2: Row/Column
Scaling
row
permutation
– Ifunder
O(SMS)=O(M),
Invariance:
R=C=[k1
0; inversion
0 k2]
= -O(M),
– – invariant
R If
x MO(MS)
is row scaling
andantisymmetric
M x R is
underscaling
column
permutation
– column
Insight:
flip presence
with
– – If
O(RM) and
= O(M)
O(MR)
absence
viceand
versa
for= O(M),
binary
Measures
that
are
symmetric
then
M is row/scale invariant
variables.
–
–
under the row and column
odds ratio () satisfies this
permutation
coefficient,operations:
odds ratio,no
property
along with Yule’s Q and Y
distinction
between
positive and
collective
strength
are
negative correlations
of a table
symmetric
binary measures
For example:
index
– Jaccard
measuregini
is asymmetric
Bulut, Singh # 10
Property 4 and Property 5
•
Market Basket analysis requires unequal treatment of binary values of a variable
– A symmetric measure like the one above is not suitable
•
Property 5: Null Invariance: If O(M+C) = O(M) where C=[0 0;0 k] and k is a
positive constant
– For binary variables; more records added that do not contain the two variables under
consideration: Co-occurrence emphasized
Bulut, Singh # 11
Effect of Support Based Pruning
• Randomly generated synthetic dataset of 10,000 contingency tables
• Darker cells, correlation > 0.85, and lighter cells indicate otherwise
• Tighter bounds on the support of the patterns: many measures become
correlated
Bulut, Singh # 12
Elimination of poorly correlated
tables using Support-based Pruning
•
Minimum support threshold to prune out the low support patterns
•
Having a maximum support threshold: equal elimination of uncorrelated, negatively correlated and
positively correlated tables
•
Having a lower bound of support will prune out the negatively correlated or uncorrelated tables.
Bulut, Singh # 13
Table standardization
•
A standardized table: visual
depiction of the disjoint distribution
of two variables after elimination of
non-uniform marginals
Bulut, Singh # 14
Implications of standardization
• The rankings from different measures become identical
Bulut, Singh # 15
Implications of standardization
cont’d
• After standardization, a matrix has [x y ; y x] where
y = N/2-x and x = f*11
• If you consider monotonically increasing functions of x
(nearly all of the measures are)
– Identical rankings on standardized, positively correlated tables
– Some measures do not satisfy this property
• Consider the values of x where N/4 < x < N/2
• IPF favors “odds ratio” measure, therefore final rankings
agree with odds ratio rankings before standardization
• Leave with: Different standardization techniques may be
more appropriate for different application domains
Bulut, Singh # 16
Measure Selection Based on
Rankings by Experts
• Ideally, experts rank all the contingency tables, choose the best
measure accordingly
• Laborious task if the number of tables is too large
• Provide a smaller set of tables to decide the best measure
Bulut, Singh # 17
Table Selection via Disjoint
Algorithm
• Use Disjoint algorithm to choose
a subset of tables of cardinality
k.
• Rank tables according to various
measures
• Compute the similarity between
different measures
• A good table selection scheme
minimizes
D(SS , ST )  max ST (i, j)  SS (i, j)
Bulut, Singh # 18
Experimental Results
Bulut, Singh # 19
Conclusions
• Key properties to consider for selecting the right measure
• No measure is consistently better than others
• Situations where most measures provide correlated info
• Choosing the right measure on a non-biased small set of all
the tables give good estimates to the ideal solution
• As a future work
– Extension to k-way contingency tables
– Association between mixed data types
Bulut, Singh # 20
Download