Mining Non-Redundant High Order Correlations in Binary Data

Mining Non-Redundant High Order Correlations in Binary Data Xiang Zhang, Feng Pan, Wei Wang, and Andrew Nobel VLDB2008 Outline  Motivation  Properties related to NIFSs  Pruning candidates by mutual information  The algorithm  Bounds based on pair-wise correlations  Bounds based on Hamming distances  Discussion Motivation  Example:  Suppose X, Y and Z are binary features, where X and Y are disease SNPs, Z=X(XOR)Y is the complex disease trait.  {X,Y,Z} have strong correlation.  But there are no correlation in{X,Z},{Y,Z} and {X,Y}.  Summary  We can see that the high order correlation pattern cannot be identified by only examing the pair-wise correlations  Two aspects of the desired correlation patterns:  The correlation involves more than two features  The correlation is non-redundant, i.e., removing any feature will greatly reduce the correlation I ( X 3 ; X1 )  (Cont.) • I (Y ; X )  H (Y )  H (Y | X ) be the H (Y ) relative entropy reduction of Y based on X.  Consider three features, X1 , X 2 and X 3  I ( X 3 ; X 1 )  21.97% I ( X 3 ; X 2 )  8.62% i.e., the relative entropy reduction of X 3 given X1 or X 2 alone is small.  I ( X 3 ; X 1 , X 2 )  81.59% i.e., X1 or X 2 jointly reduce the uncertainty of X 3 more than they do separately.  This strong correlation exists only when these three features are considered together. H ( X 3 )  H ( X 3 | X1) H (X3) 6 6 14 14 9 5 5 4 4 11 10 10 1 1 log 2  log 2 )  [ (  log 2  log 2 )  (  log 2  log 2 )] 20 20 20 20 9 9 9 9 20 11 11 11 11  20 6 6 14 14 ( log 2  log 2 ) 20 20 20 20 0.881  0.688   21.97% 0.881 ( (Cont.)  In this paper, author study the problem of finding non- redundant high order correlations in binary data.  NIFSs(Non-redundant Interacting Feature Subsets):  The features in an NIFS together has high multi-information  All subsets of an NIFS have low multi-information.  The computational challenge of finding NIFSs:  To enumerate feature combinations to find the feature subsets that have high correlation.  For each such subset, it must be checked all its subsets to make sure there is no redundancy. Definition of NIFS  A subset of features {X1 , X 2 ,..., X n } is NIFS if the following two criteria are satisfied:  {X1 , X 2 ,..., X n } is an SFS  Every proper subset X   {X1, X 2 ,..., X n } is a WFS  Ex.  { X 1 , X 2 , X 3} is a NIFS  { X 1 , X 2 , X 3} is a SFS  {X1 , X 2 },{X1 , X 3},{X 2 , X 3} are WFSs, where C ( X 1 , X 2 )  0.03 C ( X 1 , X 3 )  0.22 C ( X 2 , X 3 )  0.22 Properties related to NIFSs  (Downward closure property of WFSs):  If feature subset {X1 , X 2 ,..., X n } is a WFS, then all its subsets are WFSs  Advantage: This greatly reduces the complexity of the problem.  Let X  { X1 , X 2 ,..., X n } be a NIFS. Any Y  X is not a NIFS Pruning candidates by mutual information not a WFS, i.e., C ( X , X )    All supersets of { X , X } can be safely pruned.  Ex.  { X i , X j } is i i  Let   0.25 ,   0.8 j j Algorithm Upper and lower bounds based on pairwise correlations is the average entropy in bits per symbol of a randomly drawn k-element subset of {X1 , X 2 ,..., X n } Algorithm(Cont.) V  { X a , X a 1 ,..., X b }  Suppose that the current candidate feature subset is  lb(V )    C (V )   , check whether all subsets of V of size (b-a-1) are WFSs.  lb(V )   , the subtree of V can be pruned. In case 2, C(V) must be calculated and checked all subsets of V. (Cont.)  ub(V )   , there is no need to calculate C(V) and directly proceed to its subtree. Using adding proposition to get upper and lower bounds on the multi-information for each direct child node of V  lb(V )   , ub(V )   , it must be calculate C(V) . adding proposition, C (V )    subtree is pruned,   C (V )   V is output as a NIFS, C (V )   & & all subsets is WFSs  Discussion  Using an entropy-based correlation measurement to address the problem of finding non-redundant interacting feature subsets. (Cont.) C ( X1, X 2 )  H ( X1 )  H ( X 2 )  H ( X1, X 2 ) 9 9 11 11 8 8 12 12 7 7 4 4 5 5 4 4 log 2  log 2 )+(  log 2  log 2 )-(  log 2  log 2  log 2  log 2 ) 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20  0.992+0.971-1.933  (  0.03<   Let   0.25 and   0.8 { X1 , X 2 , X 3}, {X1 , X 2 , X 3 , X 6 }, {X 7 , X 8 , X 9 , X 10} are SFSs  C ( X1 , X 2 , X 3 )  0.82   C ( X1 , X 2 , X 3 , X 6 )  0.97   C ( X 7 , X 8 , X 9 , X10 )  1.15    { X1 , X 2 }, { X 7 , X 8 , X 9 } are WFSs  C ( X1 , X 2 )  0.03   C ( X 7 , X 8 , X 9 )  0.15   To require that any subset of an NIFS is weakly correlated. Adding proposition  Where, Hamming distance

Mining Non-Redundant High Order Correlations in Binary Data

Related documents

Products

Support

Mining Non-Redundant High Order Correlations in Binary Data

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib