Feature selection methods Isabelle Guyon isabelle@clopinet.com IPAM summer school on Mathematics in Brain Imaging. July 2008 Feature Selection • Thousands to millions of low level features: select the most relevant one to build better, faster, and easier to understand learning machines. n n’ m X Applications examples Customer knowledge 103 102 System diagnosis 105 104 Quality control Market Analysis 106 Text Categorization OCR HWR Bioinformatics 10 10 Machine vision 102 103 104 105 variables/features Nomenclature • Univariate method: considers one variable (feature) at a time. • Multivariate method: considers subsets of variables (features) together. • Filter method: ranks features or feature subsets independently of the predictor (classifier). • Wrapper method: uses a classifier to assess features or feature subsets. Univariate Filter Methods Univariate feature ranking m- m+ P(Xi|Y=1) P(Xi|Y=-1) -1 xi s- s+ • Normally distributed classes, equal variance s2 unknown; estimated from data as s2within. • Null hypothesis H0: m+ = m• T statistic: If H0 is true, t= (m+ - m-)/(swithin1/m++1/m-) Student(m++m--2 d.f.) Statistical tests ( chap. 2) Null distribution • H0: X and Y are independent. • Relevance index test statistic. • Pvalue false positive rate FPR = nfp / nirr pval r0 r • Multiple testing problem: use Bonferroni correction pval n pval • False discovery rate: FDR = nfp / nsc FPR n/nsc • Probe method: FPR nsp/np Univariate Dependence • Independence: P(X, Y) = P(X) P(Y) • Measure of dependence: MI(X, Y) = P(X,Y) log P(X,Y) dX dY P(X)P(Y) = KL( P(X,Y) || P(X)P(Y) ) Other criteria ( chap. 3) A choice of feature selection ranking methods depending on the nature of: • the variables and the target (binary, categorical, continuous) • the problem (dependencies between variables, linear/nonlinear relationships between variables and target) • the available data (number of examples and number of variables, noise in data) • the available tabulated statistics. Multivariate Methods Univariate selection may fail Guyon-Elisseeff, JMLR 2004; Springer 2006 Filters,Wrappers, and Embedded methods All features All features Filter Feature subset Multiple Feature subsets Predictor Predictor Wrapper All features Embedded method Feature subset Predictor Relief Relief=<Dmiss/Dhit> Dhit Dmiss nearest hit nearest miss Dhit Dmiss Kira and Rendell, 1992 Wrappers for feature selection Kohavi-John, 1997 N features, 2N possible feature subsets! Search Strategies ( chap. 4) • • • • Exhaustive search. Simulated annealing, genetic algorithms. Beam search: keep k best n-k path at each step. g Greedy search: forward selection or backward elimination. • PTA(l,r): plus l , take away r – at each step, run SFS l times then SBS r times. • Floating search (SFFS and SBFS): One step of SFS (resp. SBS), then SBS (resp. SFS) as long as we find better subsets than those of the same size obtained so far. Any time, if a better subset of the same size was already found, switch abruptly. Forward Selection (wrapper) Start n n-1 n-2 … 1 Also referred to as SFS: Sequential Forward Selection Forward Selection (embedded) Start n n-1 n-2 … 1 Guided search: we do not consider alternative paths. Typical ex.: Gram-Schmidt orthog. and tree classifiers. Backward Elimination (wrapper) Also referred to as SBS: Sequential Backward Selection 1 … n-2 n-1 n Start Backward Elimination (embedded) Guided search: we do not consider alternative paths. Typical ex.: “recursive feature elimination” RFE-SVM. 1 … n-2 n-1 n Start Scaling Factors Idea: Transform a discrete space into a continuous space. s=[s1, s2, s3, s4] • Discrete indicators of feature presence: si {0, 1} • Continuous scaling factors: si IR Now we can do gradient descent! Formalism ( chap. 5) • Many learning algorithms are cast into a minimization of some regularized functional: Empirical error Regularization capacity control Justification of RFE and many other embedded methods. Next few slides: André Elisseeff Embedded method • Embedded methods are a good inspiration to design new feature selection techniques for your own algorithms: – Find a functional that represents your prior knowledge about what a good model is. – Add the s weights into the functional and make sure it’s either differentiable or you can perform a sensitivity analysis efficiently – Optimize alternatively according to a and s – Use early stopping (validation set) or your own stopping criterion to stop and select the subset of features • Embedded methods are therefore not too far from wrapper techniques and can be extended to multiclass, regression, etc… The l1 SVM • A version of SVM where W(w)=||w||2 is replaced by the l1 norm W(w)=i |wi| • Can be considered an embedded feature selection method: – Some weights will be drawn to zero (tend to remove redundant features) – Difference from the regular SVM where redundant features are included Bi et al 2003, Zhu et al, 2003 The l0 SVM • Replace the regularizer ||w||2 by the l0 norm • Further replace by i log( + |wi|) • Boils down to the following multiplicative update algorithm: Weston et al, 2003 Causality What can go wrong? 120 X1 X2 100 80 60 40 20 180 190 200 210 220 230 240 250 260 Guyon-Aliferis-Elisseeff, 2007 What can go wrong? 100 X2 80 60 40 20 16 X1 14 120 X1 X2 12 100 10 80 60 8 40 20 20 180 40 60 X2 190 200 210 220 230 240 250 260 80 100 X1 What can go wrong? X2 Y Y X1 X2 X1 Guyon-Aliferis-Elisseeff, 2007 http://clopinet.com/causality Wrapping up Bilevel optimization N variables/features Split data into 3 sets: training, validation, and test set. m2 m3 M samples m1 1) For each feature subset, train predictor on training data. 2) Select the feature subset, which performs best on validation data. – Repeat and average if you want to reduce variance (crossvalidation). 3) Test on test data. Complexity of Feature Selection With high probability: Error Generalization_error Validation_error + (C/m2) Method Number of Complexity subsets tried C 2N N Exhaustive search wrapper Nested subsets N(N+1)/2 or log N Feature ranking N n m2: number of validation examples, N: total number of features, n: feature subset size. Try to keep C of the order of m2. CLOP http://clopinet.com/CLOP/ • CLOP=Challenge Learning Object Package. • Based on the Matlab® Spider package developed at the Max Planck Institute. • Two basic abstractions: – Data object – Model object • Typical script: – D = data(X,Y); % Data constructor – M = kridge; % Model constructor – [R, Mt] = train(M, D); % Train model=>Mt – Dt = data(Xt, Yt); % Test data constructor – Rt = test(Mt, Dt); % Test the model NIPS 2003 FS challenge http://clopinet.com/isabelle/Projects/ETH/Feature_Selection_w_CLOP.html Conclusion • Feature selection focuses on uncovering subsets of variables X1, X2, … predictive of the target Y. • Multivariate feature selection is in principle more powerful than univariate feature selection, but not always in practice. • Taking a closer look at the type of dependencies in terms of causal relationships may help refining the notion of variable relevance. Acknowledgements and references 1) Feature Extraction, Foundations and Applications I. Guyon et al, Eds. Springer, 2006. http://clopinet.com/fextract-book 2) Causal feature selection I. Guyon, C. Aliferis, A. Elisseeff To appear in “Computational Methods of Feature Selection”, Huan Liu and Hiroshi Motoda Eds., Chapman and Hall/CRC Press, 2007. http://clopinet.com/causality