Outline Minimum Redundancy and Maximum Relevance Feature Selection and Its Applications ! ! ! ! Hanchuan Peng ! Janelia Farm Research Campus, Howard Hughes Medical Institute ! What is mRMR feature selection Applications in cancer classification Applications in image pattern recognition Theoretical basis of mRMR Combinations with other methods How to use mRMR programs 1 2 Feature Selection Problem Redundancy in Features S (selected variables, e.g. genes) Data D Target h Variable pool !={x1,…,x M} ! A prediction example Cancer type h ! {Gene xi} N Samples G1 On On Baseline Off Baseline On Off On On G2 On Off On Off Off On On On On GM On Off On On Off On Off On On ! ! Prostate Prostate Liver Liver Lung Lung Lymphoma Lymphoma Lymphoma Ding & Peng, CSB2003, JBCB2005 ! Ding & Peng, CSB2003, JBCB2005 ! WI = " I (i, j ) ! Maximize Relevance: ! V I = |S1 | " I (h, i ) i!S VF = 1 | S | i! S " F (i, h). Minimize redundancy : Correlation c(i,j) min Wc , I(i,j) is mutual information between features i and j max V I , Maximize relevance : F-statistic: F(i,h) max VF , 1 | S | 2 i , j !S S is the set of features. ! 4 mRMR: Continuous Variables Minimize Redundancy: min WI , Selected features are correlated; Selected features cover narrow regions in space. 3 mRMR: Discrete Variables ! Most used methods: select top-ranking features based on mutual info., F-test, etc., without considering relationships among features. Problems: Wc = 1 ! | c(i, j ) | |S |2 i , j Relevance and redundancy can also be defined using mutual info of hybrid variables (continuous or categorical) (Peng, Long, & Ding, TPAMI, 2005) h = target classes (e.g. types of different cancers, or annotations) Ding & Peng, CSB2003, JBCB2005 5 Ding & Peng, CSB2003, JBCB2005 6 1 MRMR Selection Schemes Combine Redundancy and Relevance ! Additive combination max(V ! W ) ! Multiplicative combination max(V / W ) Ding & Peng, CSB2003, JBCB2005 Ding & Peng, CSB2003, JBCB2005 7 Most Related Methods ! ! ! ! 8 Outline Most used feature selection methods: topranking features without considering relationships among features. Yu & Liu, 2003/2004. information gain, essentially similar approach. Wrapper: not filter approach, classifier-involved and thus features do not generalize well. PCA and ICA: Features are orthogonal or independent, but not in the original feature space. ! ! ! ! ! ! What is mRMR feature selection Applications in cancer classification Applications in image pattern recognition Theoretical basis of mRMR Combinations with other methods How to use mRMR programs 9 Microarray Gene Expression Data Sets for Cancer Classification 10 Lymphoma Results Ding & Peng, CSB2003, JBCB2005 11 Ding & Peng, CSB2003, JBCB2005 12 2 Do mRMR Features Better Cover the Data Distribution Space and Thus Perform Well on Different Classifiers? What is the Role of Redundancy Reduction? Relevance Redundancy LOOCV Error Average LOOCV errors of three different classifiers, NB, SVM, and LDA on three multi-class datasets (a) Relevance and (b) redundancy for MRMR features on discretized NCI dataset. (c) The respective LOOCV errors obtained using the Naïve Bayes classifier Ding & Peng, CSB2003, JBCB2005 LOOCV testing results (#error) for binarized NCI and Lymphoma data using SVM classifier Child Leukemia data (7 classes, 215 training samples, 112 testing samples) testing errors. M is the number of features used in classification Ding & Peng, CSB2003, JBCB2005 15 Comparison with Other Work 16 Outline ! ! ! ! ! ! Ding & Peng, CSB2003, JBCB2005 14 What is the Relationship of mRMR Features and Various Data Discretization Schemes? Do mRMR Features Generalize Well on Unseen Data? Ding & Peng, CSB2003, JBCB2005 Ding & Peng, CSB2003, JBCB2005 13 17 What is mRMR feature selection Applications in cancer classification Applications in image pattern recognition Theoretical basis of mRMR Combinations with other methods How to use mRMR programs 18 3 Automatic Annotation of in situ Gene Expression Patterns in Fly Embryos ! ! ! ! Wavelet-embryo Features Minimal redundant wavelet-embryo features were used to train classifiers. 456 genes, 80 ontological annotation terms, 6 developmental ranges. Consistent automatic annotations with those manually generated in BDGP. 99.5% accuracy in predicting developmental stages of embryos by gene expression patterns. re co es nc e fid on nc tio c i ed Pr Short terms: ECNS: embryonic central nervous system; VNC: ventral nerve cord; EH: embryonic hindgut; EM: embryonic midgut; EDE: embryonic dorsal epidermis. 5-hours 5-years Zhou & Peng, Bioinformatics, 2007 Peng et al, BMC Cell Bio, 2007 19 Annotation System 20 Tier 1 Accuracy: Developmental Stage Prediction Wavelet-embryo features Extract wavelet-embryo features Tier 1 Select dev-stage-specific features Assign dev-stage (classification) Select annotation-Aspecific features Select annotation-Bspecific features … Select annotation-Zspecific features Predict annotation-A (classification) Predict annotation-B (classification) … Predict annotation-Z (classification) mRMR feature selection Tier 2 Classification (LDA, SVM, etc.) Overall accuracy >99% Determine annotations based on probabilistic ranking Peng et al, BMC Cell Bio, 2007 21 22 Tier 2 Accuracy: Specific Annotation Prediction Outline ! ! ! ! ! ! Zhou & Peng, Bioinformatics, 2007 23 What is mRMR feature selection Applications in cancer classification Applications in image pattern recognition Theoretical basis of mRMR Combinations with other methods How to use mRMR programs 24 4 Mutual Information Maximum Dependency Criterion ! ! Statistical association ! Definition ! For two univariate variables x and y I ( x; y ) = !! p ( x, y ) log ! For multivariate variable Sm and the target h Mutual information I(S,h) I ( S m ; h) = !! p ( S m , h) log not S (selected variables) Data D p ( x, y ) dxdy p( x) p( y ) Target h p(x1, x 2 ,..., x m ,h) dx1 Ldx m dh p(x1 )L p(x m ) p(h) p(Sm ,h) p(Sm ,h)log dSm dh p(x1 )L p(x m ) p(h) J(x1, x 2 ,..., x m ,h) = " L " p(x1,..., x m ,h)log Variable pool !={x1,…,x M} Peng, Long, Ding, TPAMI 2005 p ( S m , h) dS m dh p ( S m ) p ( h) = "L " Peng, Long, Ding, TPAMI 2005 25 26 ! Maximum Dependency Feature Selection is Combinatorial ! High-Dimensional Mutual Information ! For multivariate variable Sm and the target h ! p ( S m , h) I ( S m ; h) = !! p ( S m , h) log dS m dh p ( S m ) p ( h) ! p ( S m "1 , xm , h) = !! p ( S m "1 , xm , h) log dS m dh p ( S m "1 , xm ) p (h) = ! L ! p ( x1 , L , xm , h) log ! Number of subspaces ! p ( x1 , L , xm , h) dx1 L dxm dh. p ( x1 , L , xm ) p (h) Total Example: 2|!| or 2 M For given number of features |S| &| ' |# $$ | S | !! % " Estimating high-dimensional I(S m,h) is difficult ! ! An ill-posed problem to find inverse of large co-variance matrix Insufficient number of samples Peng, Long, Ding, TPAMI 2005 1000 1000 1000 1 2 3 103 0.5 x 106 ~1.66 x 108 5000 3 ~2.08 x 1010 #configurations of selected variables Simplest case: the incremental search. Peng, Long, Ding, TPAMI 2005 27 28 Upper & Lower Bounds of J(.) Lower bound Mutual information for multivariate variable S m and the target h J ( x1 , x2 ,..., xm ) ! 0. p ( S m , h) dS m dh p ( S m ) p ( h) when variables are maximally independent Define: p ( x1 , x 2 ,..., x m ) J ( x1 , x 2 ,..., x m ) = ! L ! p ( x1 ,..., x m ) log dx1 L dx m p ( x1 ) L p ( x m ) Upper bound n It can be proved: J ( x1 , x2 ,..., xn ) $ min{! H ( xi ), I ( S m , h ) = J ( h, S m ) ! J ( S m ) Peng, Long, Ding, TPAMI 2005 |S| Heuristic search algorithms are necessary. Factorize the Mutual Information I ( S m ; h) = !! p ( S m , h) log |!| i =2 n n n "1 ! H ( x ),L, ! H ( x ), ! H ( x )} i i =1,i # 2 i i =1,i # n "1 i i =1 when variables are maximally dependent 29 Peng, Long, Ding, TPAMI 2005 30 5 Factorize I(S m,h) ! ! Advantages of mRMR Relevance of S={x1, x2, …} and h, or RL(S,h) Redundancy among variables {x1, x2, …}, or RD(S) RL = 1 ! I ( xi , h) | S | xi "S RD = 1 | S |2 ! I (x , x i j ! ) xi , x j "S ! I ( S m , h) = J ( S m !1 , xm , h) ! J ( S m !1 , xm ). ! ! For incremental search, max I(S,h) is “equivalent” to max [RL(S,h) – RD(S)], i.e. combination of min-RedundancyMax-Relevance (mRMR). ! ! Peng, Long, Ding, TPAMI 2005 Peng, Long, Ding, TPAMI 2005 ! ! ! ! In the pool ! find the variable x1 that has the largest I(.,h). Exclude x1 from !. Search x2 so that it maximizes I(.,h) – "I(.,x1)/|!|. Iterate this process until an expected number of variables have been obtained, or other constraints are satisfied. ! ! ! ! Complexity O(|S|*|!|) ! Peng, Long, Ding, TPAMI 2005 32 Outline Greedy search algorithm ! mRMR is an optimal first-order approximation of I(.) maximization Relevance-only ranking only maximizes J(.)! 31 Search Algorithm of mRMR ! Both relevance and redundancy estimation are lowdimensional problems (i.e. involving only 2 variables). This is much easier than directly estimating multivariate density or mutual information in the highdimensional space! Faster speed More reliable estimation What is mRMR feature selection Applications in cancer classification Applications in image pattern recognition Theoretical basis of mRMR Comparison/Combinations with other methods How to use mRMR programs 33 Comparing Max-Dep and mRMR: Complexity of Feature Selection 34 Comparing Max-Dep and mRMR: Accuracy of Feature Selected in Classification Time cost (seconds) for selecting individual features based on mutual information estimation for continuous feature variables. Leave-One-Out cross validation of feature classification accuracies of mRMR and MaxDep (Parallel experiments on a cluster of eight 3.06G Xeon CPUs running Redhat Linux 9, with the Matlab implementation) 35 36 6 Use Wrappers to Refine Features • mRMR is a filter approach • Fast • Features might be redundant • Independent of the classifier • Wrappers seek to minimize the number of errors directly • Slow • Features are less robust • Dependent on classifier • Better prediction accuracy • Use mRMR first to generate a short feature pool and use wrappers to get a least redundant feature set with better accuracy Use Wrappers to Refine Features Forward wrappers (incremental selection) Backward wrappers (Decremental selection) NCI data 37 mRMR Website Outline ! ! ! ! ! ! 38 http://research.janelia.org/peng/proj/mRMR What is mRMR feature selection Applications in cancer classification Applications in image pattern recognition Theoretical basis of mRMR Combinations with other methods How to use mRMR programs 39 Available Datasets Available Versions ! ! ! All source codes. Can be embedded in your programs easily. ! C/C++ version ! ! ! Matlab version ! ! For Linux/Unix/Mac, simple command line executable. ! Online version ! 40 Upload the data sets (csv format: comma separated values) and get the results right away. ! 41 NCI data (9 cancers, discretized as 3-states) Lung Cancer data (7 subtypes, discretized as 3-states) Lymphoma data (9 cancer-subtypes, discretized as 3-states) Leukemia data (2 subtypes, discretized as 3states) Colon Cancer data (2 subtypes, discretized as 3-states) The continuous-value raw data should be obtained from the original sources. 42 7 Citations ! ! This paper presents a theory of mutual information based feature selection. Demonstrates the relationship of four selection schemes: maximum dependency, mRMR, maximum relevance, and minimal redundancy. Also gives the combination scheme of "mRMR + wrapper" selection and mutual information estimation of continuous/hybrid variables. ! ! ! [JBCB05] Chris Ding, and Hanchuan Peng, "Minimum redundancy feature selection from microarray gene expression data," Journal of Bioinformatics and Computational Biology, Vol. 3, No. 2, pp.185-205, 2005. ! ! ! This paper presents a comprehensive suite of experimental results of mRMR for microarray gene selection on many different conditions. It is an extended version of the CSB03 paper. [CSB03] Chris Ding, and Hanchuan Peng, "Minimum redundancy feature selection from microarray gene expression data," Proc. 2nd IEEE Computational Systems Bioinformatics Conference (CSB 2003), pp.523-528, Stanford, CA, Aug, 2003. ! ! Conclusions [TPAMI05] Hanchuan Peng, Fuhui Long, and Chris Ding, "Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No. 8, pp.12261238, 2005. The Max-Dependency feature selection can be efficiently implemented as the mRMR algorithm. Significantly outperforms the widely used max-relevance selection method: mRMR features cover a broader feature space with less features. mRMR is very efficient and useful for gene selection and many other applications. The programs are ready! mRMR website: http://research.janelia.org/peng/proj/mRMR This paper presents the first set of mRMR results and different definitions of relevance/redundancy terms. [Bioinfo07] Jie Zhou, and Hanchuan Peng, "Automatic recognition and annotation of gene expression patterns of fly embryos," Bioinformatics, Vol. 23, No. 5, pp. 589-596, 2007. ! One application of mRMR in selecting good wavelet image features. 43 44 Acknowledgment Collaborators: Chris Ding Fuhui Long Jie Zhou 45 8