G54DMT – Data Mining Techniques and Applications http://www.cs.nott.ac.uk/~jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk Topic 2: Data Preprocessing Lecture 3: Feature and prototype selection Outline of the lecture • Definition and taxonomy • Feature selection – Evaluation methods – Exploration mechanisms • Prototype selection – Filters – Wrappers • Resources Feature Selection • Transforming a dataset by removing some of its columns A1 A2 A3 A4 C A2 A4 C Prototype Selection • Transforming a dataset by removing some of its rows I1 I2 I3 I4 I1 I2 I4 Why? • Lack of quality in the data (for both rows and columns) – Redundant – Irrelevant – Noisy • Scalability issues – Some methods many not be able to cope with a large number of attributes or instances • In general, to help the data mining method learn better Taxonomy of feature/prototype selection methods • Filter methods – The reduction process happens before the learning process – Using some kind of metric that tries to estimate the goodness of the reduction Dataset Filter Classification method Taxonomy of feature/prototype selection methods • Wrapper methods – Filter methods try to estimate the goodness of the reduced dataset – Why don’t we use the actual data mining method (or at least a fast one) to tell if the reduction is good or bad? – The space of possible reductions will be iteratively explored by a search algorithms Explore reduction Classification method Dataset Classifier Taxonomy of feature/prototype selection methods • Embedded methods – Learning methods that incorporate the feature/instance reduction inside their process – Difference between these and wrapper methods is that these methods are aware that they are performing a reduction – We will discuss some of them in the data mining topic Feature selection • Two issues that characterise the FS methods – Feature evaluation (for the filters) • How do we estimate the goodness of a feature subset? • Metric applies to – Feature subset – Individual features (generating a ranking) – Subset exploration (for both filters and wrappers) • How do we explore the space of feature subsets? Feature evaluation methods • Four types of metrics (Liu and Yu, 2005) – Distance metrics • Feature helps separating better between classes – Information metrics • Quantify the information gain (Information Theory) of a feature – Dependency metrics • Quantify the correlation between attributes and between each attribute and the class – Consistency metrics • Inconsistency: having two equal instances but with different class labels • These metrics try to find the minimal set of features that maintains the same level of consistency as the whole dataset Fisher Score • Metric from the Fisher Discriminant Analysis • Already discussed it in the complexity metrics topic • For each feature, it quantifies the overlap of the distribution of values from the dataset’s classes • Cannot detect irrelevant features Image taken from http://featureselection.asu.edu/featureselection_techreport.pdf Relief/ReliefF • Metric for individual features • Given a sample of p instances from the training set, it scores each instance as follow: • d= distance function, ft,I = feature I of sampled instance t. NM(xt)= instance closest to xt with different class. NH(xt)= instance closest to xt with the same class • Metric for two-class problems (Relief). The formula for multi-class problems (ReliefF) is more complex Image taken from http://featureselection.asu.edu/featureselection_techreport.pdf CFS (Correlation-based Feature Selection) • Evaluates subsets of features • Principle: select the subset that has maximal correlation to the class, and minimal correlation between features • rcf = average correlation between features and class • rff = average correlation between features • For discrete attributes: correlation based on normalised Information Gain (Information Theory) • For continuous attributes: Pearson’s correlation Image taken from http://featureselection.asu.edu/featureselection_techreport.pdf CFS correlation functions • For problems with only discrete attibutes – H(X)= Entropy, H(X,Y)= Joint Entropy of the X and Y variables, that is, for every possible combination of values of X and Y • For problems with continuous attributes – Involving two continuous variables: Pearson’s correlation – Involving one continuous and one discrete variable • Each discrete variable X of n values is decomposed into n binary variables Xbi – Involving two discrete variables • Decomposition for both X and Y www.cs.waikato.ac.nz/~mhall/icml2000.ps Exploration mechanisms • There are 2d possible sub-features of d features • Several heuristic feature selection methods: – Best single features under the feature independence assumption: choose by significance tests – Best step-wise feature selection: • The best single-feature is picked first • Then next best feature condition to the first, ... – Step-wise feature elimination: • Repeatedly eliminate the worst feature – Best combined feature selection and elimination – Optimal branch and bound: • Use feature elimination and backtracking Slide taken from Han & Kamber Hill Climbing search • Simple and fast search procedure • Start point: all/None/Subset • Local change: – Add any possible feature – Remove any possible feature – Add/remove Image from “Correlation-based Feature Selection for Machine Learning”. Mark A. Hall. PhD thesis Best-first search • Hill-Climbing only remembers the latest best solution. Best search remembers all Image from “Correlation-based Feature Selection for Machine Learning”. Mark A. Hall. PhD thesis Genetic Algorithms (Holland, 75) • Bio-inspired optimization algorithm • This algorithm manipulates, through an iterative cycle, a population of candidate solutions. • Throughout this cycle, the population is evolved, that is, it is manipulated in a procedure that takes inspiration on the principles of natural selection and genetics • These procedures do not try to mimic life, are just inspired by its functioning Genetic Algorithm working cycle Population A Evaluation Mutation Population B Selection Population D Population C Crossover Genetic Algorithms: terms • Population – Possible solutions of the problem – Traditionally represented as bit-strings (e.g. each bit associated to a feature, indicating if it is selected or not) – Each bit of an individual is called gene • Evaluation – Giving a goodness value to each individual in the population • Selection – Process that rewards good individuals – Good individuals will survive, and get more than one copy in the next population. Bad individuals will disappear Genetic Algorithms • Crossover – Exchanging subparts of the solutions 1-point crossover uniform crossover – The crossover stage will take two individuals from the population (parents) and with certain probability Pc will generate two offspring Genetic Algorithms • Mutation – Intuition: introduce small (random) variations to individuals – For binary representation: bit-flipping • Select a gene from one of the individuals in the population with probability Pm • If that gene had value 1, switch it to 0, if it has value 0, switch it to 1 • Replacement: often omitted stage that decides how populations A and D are mixed Applying FS in WEKA • 4 methods – CFS+Best First – Wrapper (with C4.5) + Best First – ReliefF + Ranker (top 5 features) – InfoGain + Ranker (top 5 features) • One dataset: Wisconsin breast cancer ARFF headers of the filtered datasets @relation 'wisconsin-breast-cancerweka.filters.supervised.attribute.AttributeSelectionEweka.attributeSelection.ReliefFAttributeEval -M -1 -D 1 -K 10-Sweka.attributeSelection.Ranker -T 1.7976931348623157E308 -N 5' @attribute Bare_Nuclei numeric @attribute Clump_Thickness numeric @attribute Cell_Shape_Uniformity numeric @attribute Cell_Size_Uniformity numeric @attribute Normal_Nucleoli numeric @attribute Class {benign,malignant} @relation 'wisconsin-breast-cancerweka.filters.supervised.attribute.AttributeSelectionEweka.attributeSelection.CfsSubsetEvalSweka.attributeSelection.BestFirst -D 1 -N 5' @attribute Clump_Thickness numeric @attribute Cell_Size_Uniformity numeric @attribute Cell_Shape_Uniformity numeric @attribute Marginal_Adhesion numeric @attribute Single_Epi_Cell_Size numeric @attribute Bare_Nuclei numeric @attribute Bland_Chromatin numeric @attribute Normal_Nucleoli numeric @attribute Mitoses numeric @attribute Class {benign,malignant} @relation 'wisconsin-breast-cancerweka.filters.supervised.attribute.AttributeSelectionEweka.attributeSelection.WrapperSubsetEval -B weka.classifiers.trees.J48 -F 5 -T 0.01 -R 1 -- -C 0.25 -M 2Sweka.attributeSelection.BestFirst -D 1 -N 5' @relation 'wisconsin-breast-cancerweka.filters.supervised.attribute.AttributeSelectionEweka.attributeSelection.InfoGainAttributeEvalSweka.attributeSelection.Ranker -T 1.7976931348623157E308 -N 5' @attribute Clump_Thickness numeric @attribute Cell_Shape_Uniformity numeric @attribute Normal_Nucleoli numeric @attribute Class {benign,malignant} @attribute Cell_Size_Uniformity numeric @attribute Cell_Shape_Uniformity numeric @attribute Bare_Nuclei numeric @attribute Bland_Chromatin numeric @attribute Single_Epi_Cell_Size numeric @attribute Class {benign,malignant} Each attribute has an associated colour. In black I have left those attributes picked up by one method only Genetic Algorithms • To know more about them – What is an Evolutionary Algorithm? • Chapter 2 of “Introduction to Evolutionary Computing” by Eiben & Smith • To formally understand them – The Design of Innovation, by David E. Goldberg • Contains theoretical models that explain how genetic algorithms work and how to adjust them properly Prototype selection: wrapper methods • Global search – Evolutionary computation-based (example) • Same representation as in FS: One bit per instance • When dealing with a large set of instances this methods need to be adapted, e.g. through stratification – Estimation of Distribution Algorithms (example) • Related to GAs, but with different (smarter) exploration methods • A model of the structure of the problem is created from the population • Exploration is done by sampling the model Prototype selection: wrapper methods • Heuristic search: Windowing method • Start with a small set • Iterate through: – Learning – Testing theory with examples not in the window – All misclassified examples (up to MaxIncSize) will be added to the window www.jair.org/media/487/live-487-1704-jair.pdf Prototype selection: filter methods • Static sampling (John & Langley, 96) – Generate a random sample from the training set – Use statistical tests (c2 for discrete attributes, large-sample test relying in the central limit theorem for continuous) to check if the sample is similar enough to the whole set • Pattern by Ordered Projections (J.C. Riquelme et al., 03) – Projects the data separately for each dimension in the dataset, identifying which instances lay at the decision frontiers – If an instance is never at the frontier, it is removed Prototype selection: filter methods • Instance-based methods – Based on the family of nearest-neighbour classifiers. – Could be considered wrapper/embedded, but because of their simplicity they are used as filters – Greatly depend on the definition of distance – Two types of methods • Those that start with an empty subset and add examples to it • Those that start with the complete set and remove examples Prototype selection: filter methods • Condensed Nearest Neighbour Rule (Hart, 68) – Starts S with two random examples from T, one for each class – Tries to classify every example in T, if it misclassifies, example is added to the subset – Stops when all examples in T have been checked • Edited Nearest Neighbour Rule (Wilson, 72) – Starts with S=T – For every element in S, check its k nearest neighbours – If the class of the element does not agree with the majority class of the k neighbours, drop it Prototype selection: filter methods • DROP1 (Wilson & Martinez, 00) • Drops an example if its neighbours would be classified correctly more times without it than with it Example of prototype selection using KEEL • Again using the Wisconsin breast cancer dataset • 3 Methods (using the Imbalance Learning module) – Pattern by Ordered projection – CNN – ENN Method Positive Negative None 191 355 CNN 20 191 ENN 182 344 POP 179 251 Resources • A review of feature selection techniques in bioinformatics • Toward Integrating Feature Selection Algorithms for Classification and Clustering • List of feature selection methods in KEEL • Feature selection in WEKA • List of instance selection methods in KEEL • Survey of instance-based reduction techniques Questions?