AMCS/CS 340 : Data Mining Feature Selection Xiangliang Zhang King Abdullah University of Science and Technology Outline • Introduction • Unsupervised Feature Selection Clustering Matrix Factorization • Supervised Feature Selection Individual Feature Ranking (Single Variable Classifier) Feature subset selection o Filters o Wrappers • Summary Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 2 Problems due to poor variable selection • Input dimension is too large; the curse of dimensionality problem may happen; • Poor model may be built with additional unrelated inputs or not enough relevant inputs; • Complex models which contain too many inputs are more difficult to understand Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 3 Applications OCR (optical character recognition) HWR (handwriting recognition) Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 4 Benefits of feature selection • Facilitating data visualization • Data understanding • Reducing the measurement and storage requirements • Reducing training and utilization times • Defying the curse of dimensionality to improve prediction performance Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 5 Feature Selection/Extraction Thousands to millions of low level features: select/extract the most relevant one to build better, faster, and easier to understand learning machines. m d<<m d X N Y • Using label Y supervised • Without label Y unsupervised {fi} {Fj} Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 6 Feature Selection vs Extraction Selection: • choose a best subset of size d from the m features {fi} can be a subset of {Fj}, i=1,…,d, and j=1,…,m Extraction: • extract d new features by linear or non-linear combination of all the m features - Linear/Non-linear feature extraction: {fi} = f({Fj}) m • New features may not have physical interpretation/meaning d X N {fi} Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Y {Fj} 7 Outline • Introduction • Unsupervised Feature Selection Clustering Matrix Factorization • Supervised Feature Selection Individual Feature Ranking (Single Variable Classifier) Feature subset selection o Filters o Wrappers • Summary Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 8 Feature Selection by Clustering • Group features into clusters • Replace (many) similar variables in one cluster by a (single) cluster centroid • E.g., K-means, Hierarchical clustering Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 9 Example of student project Abdullah Khamis, AMCS/CS340 2010 Fall, “Statistical Learning Based System for Text Classification” Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 10 Other unsupervised FS methods • Matrix Factorization o PCA (Principal Component Analysis) use PCs with largest eigenvalues as “features” o SVD (Singular Value Decomposition) use singular vectors with largest singular values as “features” o NMF (Non-negative Matrix Factorization) • Nonlinear Dimensionality Reduction o Isomap o LLE (Locally Linear Embedding) Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 11 Outline • Introduction • Unsupervised Feature Selection Clustering Matrix Factorization • Supervised Feature Selection Individual Feature Ranking (Single Variable Classifier) Feature subset selection o Filters o Wrappers • Summary Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 12 Feature Ranking • Build better, faster, and easier to understand learning machines • Discover the most relevant features w.r.t. target label, e.g., find genes that discriminate between healthy and disease patients m d N Rank of useful features. X - Eliminate useless features (distracters). - Rank useful features. - Eliminate redundant features. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 13 Example of detecting attacks in real HTTP logs A common request. A JS XSS attack. Remote file inclusion attack DoS attack. Represent each HTTP request by a vector • in 95 dimensions, corresponding to the 95 types of ASCII code (between 33 and 127) • of character distribution computed as the frequency of each ASCII code in the path source of a HTTP request. For example, Classification of HTTP vectors in 95-dim v.s. in reduced dimension space? Which dim to choose? Which one is better? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 14 Individual Feature Ranking (1) by AUC 1. Rank the features by AUC 1, most related 0.5, most unrelated 1 True Positive Rate ROC curve -1 AUC 0 False Positive Rate 1 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining xi 15 Individual Feature Ranking (2) by Mutual Information 2. Rank the features by Mutual information I(i) The higher I(i), the attribute xi is more related to class y Mutual information between each variable and the target: • P(Y = y): frequency count of class y • P(X = xi): frequency count of attribute value xi • P(X = xi,Y = y): frequency count of attribute value xi given class y Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 16 Individual Feature Ranking (3) with continuous target 3. Rank features by Pearson correlation coefficient • detect linear dependencies between variable and target • rank features by R(i) or R2(i) (linear regression) 1 related; 0 unrelated Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 17 Individual Feature Ranking (4) by T-test • Null hypothesis H0: m+ = m- (xi and Y are independent) • Relevance index test statistic • T statistic: If H0 is true, t μ μ σ within ~ student(n n 2 d.f.), 1 1 n n m- m+ where n and n are the numbers of samples with label and σ within 2 ( n 1)σ2 ( n 1)σ n n 2 4. Rank by Pvalue false positive rate -1 The lower Pvalue, xi is more related to class y sXiangliang Zhang, KAUST AMCS/CS 340: Data Mining xi s+ 18 Individual Feature Ranking (5) by Fisher Score • Fisher discrimination • Two-class case: F = between class variance / pooled within class variance (μ μ ) 2 n σ2 n σ2 n n m- m+ 5. Rank by F value The higher F, xi is more related to class y -1 sXiangliang Zhang, KAUST AMCS/CS 340: Data Mining xi s+ 19 Rank features in HTTP logs FS SVM results Features AUC Accuracy 0.8797 97.92% 1 AUC-ranking (D=30) 0.9212 97.60% #.128:;?BLOQ[\]_aefhiklmoptuw| 2 MI-ranking (D=30) 0.8849 97.96% "#./1268;ALPQRS[\]_`aehkltwyz| 3 R-ranking (D=30) 0.9208 97.67% "#,.2:;?LQS[\]_`aehiklmoptuwz| 4 T-test ranking (D=30) 0.9208 97.67% "#,.2:;?LQS[\]_`aehiklmoptuwz| 5 Fisher score ranking (D=30) 0.9208 97.67% "#,.2:;?LQS[\]_`aehiklmoptuwz| 6 PCA (D=30, unsupervised) 0.8623 97.74% Constructed features All 95 features 1 http://www.lri.fr/~xlzhang/KAUST/CS340_slides/FS_rank_demo.zip True positive rate 0.8 0.6 all features rank by AUC rank by MI rank by correlation coefficient rank by t-test rank by Fisher Score 0.4 0.2 0 20 0 0.2 0.4 0.6 False positive rate 0.8 1 Issues of individual features ranking • Relevance vs usefulness: Relevance does not imply usefulness. Usefulness does not imply relevance • Leads to the selection of a redundant subset k best features != best k features • A variable that is useless by itself can be useful with others Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 21 Useless features become useful • Separation is gained by using two variables instead of one or by adding variables • Ranking variables individually and independently of each other is at loss to determine which combination of variables would give best performance. 22 Outline • Introduction • Unsupervised Feature Selection Clustering Matrix Factorization • Supervised Feature Selection Individual Feature Ranking (Single Variable Classifier) Feature subset selection o Filters o Wrappers • Summary Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 23 Multivariate Feature Selection is complex Kohavi-John, 1997 M features, 2M possible feature subsets! Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 24 Objectives of feature selection Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 25 Questions before subset feature selection 1. How to search the space of all possible variable subsets? 2. Do we use the prediction performance to guide the search? • NO Filter • Yes Wrapper 1) how to assess the prediction performance of a learning machine to guide the search and halt it 2) which predictor to use popular predictors include decision trees, Naive Bayes, Least-square linear predictors, and SVM Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 26 Filter: Feature subset selection All features Filter Feature subset Predictor The feature subset is chosen by an evaluation criterion, which measures the relation of each subset of input variables, e.g., correlation based feature selector (CFS) subsets that contain features that are highly correlated with the class and uncorrelated with each other mean feature-class correlation how predictive of the class a set of features are M( subset { f i , i 1...k} ) k Rcf i k k (k 1) R fi f j how much redundancy there is among the feature subset Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining average feature-feature intercorrelation 27 Filter: Feature subset selection (2) All features Filter Feature subset Predictor Search in all possible feature subsets? k=1,…,M? M( subset { f i , i 1...k} ) – – – – exhaustive enumeration forward selection, backward elimination, best first, forward/backward with a stopping criterion Filter method is a pre-processing step, which is independent of the learning algorithm. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 28 Forward Selection Start n Sequential forward selection (SFS), features are sequentially added to an empty candidate set until the addition of further features does not decrease the criterion n-1 n-2 … 1 Also referred to as SFS: Sequential Forward Selection Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 29 Backward Elimination 1 … n-2 n-1 n Start Sequential backward selection (SBS), in which features are sequentially removed from a full candidate set until the removal of further features increase the criterion. Also referred to as SBS: Sequential Backward Selection Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 30 Wrapper: Feature selection methods All features Multiple Feature subsets Predictor Wrapper Learning model is used as a part of evaluation function and also to induce the final learning model Subsets of features are scored according to their predictive power Optimizing the parameters of the model by measuring some cost functions. Danger of over-fitting with intensive search! Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 31 RFE SVM Recursive Feature Elimination (RFE) SVM. Guyon-Weston, 2000. US patent 7,117,188 All features Train Train SVM SVM Eliminate useless feature(s) Performance degradation? Yes, stop! No, continue… 1: repeat 2: Find w and b by training a linear SVM. 3: Remove the feature with the smallest value |wi| 4: until a desired number of features remain. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 32 Selecting feature subsets in HTTP logs FS SVM results AUC All 95 features Features Accurac y 0.8797 97.92% 1 AUC-ranking (D=30) 0.9212 97.60% #.128:;?BLOQ[\]_aefhiklmoptuw| 2 R-ranking (D=30) 0.9208 97.67% "#,.2:;?LQS[\]_`aehiklmoptuwz| 3 SFS Gram-Schmidt (D=30) 0.8914 97.85% #&,./25:;=?DFLQ[\_`ghklmptwxz| 4 RFE SVM (D=30) 0.9174 97.85% "#',-.0238:;<?D[\]_`ghklmoqwz| 5 …. 6 …. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 34 Comparsion of Filter and Wrapper: • Main goal: rank subsets of useful features • Search strategies: explore the space of all possible feature combinations • Two criteria: predictive power (maximize) and subset size (minimize). • Predictive power assessment: – Filter methods: criteria not involving any learning machine, e.g., a relevance index based on correlation coefficients or test statistics – Wrapper methods: the performance of a learning machine trained using a given feature subset • Wrapper is potentially very time consuming since they typically need to evaluate a cross-validation scheme at every iteration. • Filter method is much faster but it do not incorporate learning. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 35 Forward Selection w. Trees • Tree classifiers, like CART (Breiman, 1984) or C4.5 (Quinlan, 1993) At each step, choose the feature that “reduces entropy” most. Work towards “node purity”. All the data f2 f1 Feature subset selection by Random Forest Choose f1 Choose f2 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 36 Outline • Introduction • Unsupervised Feature Selection Clustering Matrix Factorization • Supervised Feature Selection Individual Feature Ranking (Single Variable Classifier) Feature subset selection o Filters o Wrappers • Summary Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 37 Conclusion Feature selection focuses on uncovering subsets of variables X1, X2, …predictive of the target Y. Univariate feature selection How to rank the features? Multivariate (subset) feature selection Filter, Wrapper, Embedded How to search the subset of features? How to evaluate the subsets of features? Feature extraction How to construct new features in linear/non-linear ways? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 38 In practice • No method is universally better: - wide variety of types of variables, data distributions, learning machines, and objectives. • Match the method complexity to the ratio M/N: - univariate feature selection may work better than multivariate feature selection; - non-linear classifiers are not always better. • Feature selection is not always necessary to achieve good performance. NIPS 2003 and WCCI 2006 challenges : http://clopinet.com/challenges Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Feature selection toolbox • Matlab: sequentialfs (Sequential feature selection, shown in demo) Forward ---- good Backward --- be careful on definition of criteria • Feature Selection Toolbox 3 – freely available and open-source software in C++. • Weka Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 40 Reference • An introduction to variable and feature selection, Isabelle Guyon, André Elisseeff, JMLR 2003 • Feature Extraction, Foundations and Applications, Isabelle Guyon et al, Eds. Springer, 2006. http://clopinet.com/fextract-book • Pabitra Mitra, C. A. Murthy, and Sankar K. Pal. (2002). "Unsupervised Feature Selection Using Feature Similarity." In: IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3) • Prof. Marc Van Hulle, Katholieke Universiteit Leuven, http://134.58.34.50/~marc/DM_course/slides_selection.pdf Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 41