Slide

advertisement
G54DMT – Data Mining Techniques and
Applications
http://www.cs.nott.ac.uk/~jqb/G54DMT
Dr. Jaume Bacardit
jqb@cs.nott.ac.uk
Topic 2: Data Preprocessing
Lecture 3: Feature and prototype selection
Outline of the lecture
• Definition and taxonomy
• Feature selection
– Evaluation methods
– Exploration mechanisms
• Prototype selection
– Filters
– Wrappers
• Resources
Feature Selection
• Transforming a dataset by removing some of
its columns
A1
A2
A3
A4
C
A2
A4
C
Prototype Selection
• Transforming a dataset by removing some of
its rows
I1
I2
I3
I4
I1
I2
I4
Why?
• Lack of quality in the data (for both rows and
columns)
– Redundant
– Irrelevant
– Noisy
• Scalability issues
– Some methods many not be able to cope with a large
number of attributes or instances
• In general, to help the data mining method learn
better
Taxonomy of feature/prototype
selection methods
• Filter methods
– The reduction process happens before the
learning process
– Using some kind of metric that tries to estimate
the goodness of the reduction
Dataset
Filter
Classification
method
Taxonomy of feature/prototype
selection methods
• Wrapper methods
– Filter methods try to estimate the goodness of the reduced
dataset
– Why don’t we use the actual data mining method (or at
least a fast one) to tell if the reduction is good or bad?
– The space of possible reductions will be iteratively
explored by a search algorithms
Explore
reduction
Classification
method
Dataset
Classifier
Taxonomy of feature/prototype
selection methods
• Embedded methods
– Learning methods that incorporate the
feature/instance reduction inside their process
– Difference between these and wrapper methods
is that these methods are aware that they are
performing a reduction
– We will discuss some of them in the data mining
topic
Feature selection
• Two issues that characterise the FS methods
– Feature evaluation (for the filters)
• How do we estimate the goodness of a feature subset?
• Metric applies to
– Feature subset
– Individual features (generating a ranking)
– Subset exploration (for both filters and wrappers)
• How do we explore the space of feature subsets?
Feature evaluation methods
• Four types of metrics (Liu and Yu, 2005)
– Distance metrics
• Feature helps separating better between classes
– Information metrics
• Quantify the information gain (Information Theory) of a feature
– Dependency metrics
• Quantify the correlation between attributes and between each
attribute and the class
– Consistency metrics
• Inconsistency: having two equal instances but with different class
labels
• These metrics try to find the minimal set of features that
maintains the same level of consistency as the whole dataset
Fisher Score
• Metric from the Fisher Discriminant Analysis
• Already discussed it in the complexity metrics
topic
• For each feature, it quantifies the overlap of
the distribution of values from the dataset’s
classes
• Cannot detect irrelevant features
Image taken from http://featureselection.asu.edu/featureselection_techreport.pdf
Relief/ReliefF
• Metric for individual features
• Given a sample of p instances from the training set, it
scores each instance as follow:
• d= distance function, ft,I = feature I of sampled
instance t. NM(xt)= instance closest to xt with
different class. NH(xt)= instance closest to xt with the
same class
• Metric for two-class problems (Relief). The formula
for multi-class problems (ReliefF) is more complex
Image taken from http://featureselection.asu.edu/featureselection_techreport.pdf
CFS (Correlation-based Feature
Selection)
• Evaluates subsets of features
• Principle: select the subset that has maximal
correlation to the class, and minimal correlation
between features
• rcf = average correlation between features and class
• rff = average correlation between features
• For discrete attributes: correlation based on
normalised Information Gain (Information Theory)
• For continuous attributes: Pearson’s correlation
Image taken from http://featureselection.asu.edu/featureselection_techreport.pdf
CFS correlation functions
• For problems with only discrete
attibutes
– H(X)= Entropy, H(X,Y)= Joint Entropy
of the X and Y variables, that is, for
every possible combination of values
of X and Y
• For problems with continuous
attributes
– Involving two continuous variables:
Pearson’s correlation
– Involving one continuous and one
discrete variable
• Each discrete variable X of n values is
decomposed into n binary variables Xbi
– Involving two discrete variables
• Decomposition for both X and Y
www.cs.waikato.ac.nz/~mhall/icml2000.ps
Exploration mechanisms
• There are 2d possible sub-features of d features
• Several heuristic feature selection methods:
– Best single features under the feature independence
assumption: choose by significance tests
– Best step-wise feature selection:
• The best single-feature is picked first
• Then next best feature condition to the first, ...
– Step-wise feature elimination:
• Repeatedly eliminate the worst feature
– Best combined feature selection and elimination
– Optimal branch and bound:
• Use feature elimination and backtracking
Slide taken from Han & Kamber
Hill Climbing search
• Simple and fast search procedure
• Start point: all/None/Subset
• Local change:
– Add any possible feature
– Remove any possible feature
– Add/remove
Image from “Correlation-based Feature Selection for Machine Learning”. Mark A. Hall. PhD thesis
Best-first search
• Hill-Climbing only remembers the latest best
solution. Best search remembers all
Image from “Correlation-based Feature Selection for Machine Learning”. Mark A. Hall. PhD thesis
Genetic Algorithms (Holland, 75)
• Bio-inspired optimization algorithm
• This algorithm manipulates, through an
iterative cycle, a population of candidate
solutions.
• Throughout this cycle, the population is
evolved, that is, it is manipulated in a
procedure that takes inspiration on the
principles of natural selection and genetics
• These procedures do not try to mimic life, are
just inspired by its functioning
Genetic Algorithm working cycle
Population A
Evaluation
Mutation
Population B
Selection
Population D
Population C
Crossover
Genetic Algorithms: terms
• Population
– Possible solutions of the problem
– Traditionally represented as bit-strings (e.g. each bit
associated to a feature, indicating if it is selected or not)
– Each bit of an individual is called gene
• Evaluation
– Giving a goodness value to each individual in the
population
• Selection
– Process that rewards good individuals
– Good individuals will survive, and get more than one copy
in the next population. Bad individuals will disappear
Genetic Algorithms
• Crossover
– Exchanging subparts of the solutions
1-point crossover
uniform crossover
– The crossover stage will take two individuals from
the population (parents) and with certain
probability Pc will generate two offspring
Genetic Algorithms
• Mutation
– Intuition: introduce small (random) variations to
individuals
– For binary representation: bit-flipping
• Select a gene from one of the individuals in the
population with probability Pm
• If that gene had value 1, switch it to 0, if it has value 0,
switch it to 1
• Replacement: often omitted stage that
decides how populations A and D are mixed
Applying FS in WEKA
• 4 methods
– CFS+Best First
– Wrapper (with C4.5) +
Best First
– ReliefF + Ranker (top
5 features)
– InfoGain + Ranker
(top 5 features)
• One dataset:
Wisconsin breast
cancer
ARFF headers of the filtered datasets
@relation 'wisconsin-breast-cancerweka.filters.supervised.attribute.AttributeSelectionEweka.attributeSelection.ReliefFAttributeEval -M -1 -D 1 -K
10-Sweka.attributeSelection.Ranker -T 1.7976931348623157E308 -N 5'
@attribute Bare_Nuclei numeric
@attribute Clump_Thickness numeric
@attribute Cell_Shape_Uniformity numeric
@attribute Cell_Size_Uniformity numeric
@attribute Normal_Nucleoli numeric
@attribute Class {benign,malignant}
@relation 'wisconsin-breast-cancerweka.filters.supervised.attribute.AttributeSelectionEweka.attributeSelection.CfsSubsetEvalSweka.attributeSelection.BestFirst -D 1 -N 5'
@attribute Clump_Thickness numeric
@attribute Cell_Size_Uniformity numeric
@attribute Cell_Shape_Uniformity numeric
@attribute Marginal_Adhesion numeric
@attribute Single_Epi_Cell_Size numeric
@attribute Bare_Nuclei numeric
@attribute Bland_Chromatin numeric
@attribute Normal_Nucleoli numeric
@attribute Mitoses numeric
@attribute Class {benign,malignant}
@relation 'wisconsin-breast-cancerweka.filters.supervised.attribute.AttributeSelectionEweka.attributeSelection.WrapperSubsetEval -B
weka.classifiers.trees.J48 -F 5 -T 0.01 -R 1 -- -C 0.25 -M 2Sweka.attributeSelection.BestFirst -D 1 -N 5'
@relation 'wisconsin-breast-cancerweka.filters.supervised.attribute.AttributeSelectionEweka.attributeSelection.InfoGainAttributeEvalSweka.attributeSelection.Ranker -T 1.7976931348623157E308 -N 5'
@attribute Clump_Thickness numeric
@attribute Cell_Shape_Uniformity numeric
@attribute Normal_Nucleoli numeric
@attribute Class {benign,malignant}
@attribute Cell_Size_Uniformity numeric
@attribute Cell_Shape_Uniformity numeric
@attribute Bare_Nuclei numeric
@attribute Bland_Chromatin numeric
@attribute Single_Epi_Cell_Size numeric
@attribute Class {benign,malignant}
Each attribute has an associated colour. In
black I have left those attributes picked up
by one method only
Genetic Algorithms
• To know more about them
– What is an Evolutionary Algorithm?
• Chapter 2 of “Introduction to Evolutionary Computing”
by Eiben & Smith
• To formally understand them
– The Design of Innovation, by David E. Goldberg
• Contains theoretical models that explain how genetic
algorithms work and how to adjust them properly
Prototype selection: wrapper
methods
• Global search
– Evolutionary computation-based (example)
• Same representation as in FS: One bit per instance
• When dealing with a large set of instances this methods
need to be adapted, e.g. through stratification
– Estimation of Distribution Algorithms (example)
• Related to GAs, but with different (smarter) exploration
methods
• A model of the structure of the problem is created from
the population
• Exploration is done by sampling the model
Prototype selection: wrapper
methods
• Heuristic search:
Windowing method
• Start with a small set
• Iterate through:
– Learning
– Testing theory with
examples not in the
window
– All misclassified examples
(up to MaxIncSize) will be
added to the window
www.jair.org/media/487/live-487-1704-jair.pdf
Prototype selection: filter methods
• Static sampling (John & Langley, 96)
– Generate a random sample from the training set
– Use statistical tests (c2 for discrete attributes, large-sample
test relying in the central limit theorem for continuous) to
check if the sample is similar enough to the whole set
• Pattern by Ordered Projections (J.C. Riquelme et al.,
03)
– Projects the data separately for each dimension in the
dataset, identifying which instances lay at the decision
frontiers
– If an instance is never at the frontier, it is removed
Prototype selection: filter methods
• Instance-based methods
– Based on the family of nearest-neighbour
classifiers.
– Could be considered wrapper/embedded, but
because of their simplicity they are used as filters
– Greatly depend on the definition of distance
– Two types of methods
• Those that start with an empty subset and add
examples to it
• Those that start with the complete set and remove
examples
Prototype selection: filter methods
• Condensed Nearest Neighbour Rule (Hart, 68)
– Starts S with two random examples from T, one for each
class
– Tries to classify every example in T, if it misclassifies,
example is added to the subset
– Stops when all examples in T have been checked
• Edited Nearest Neighbour Rule (Wilson, 72)
– Starts with S=T
– For every element in S, check its k nearest neighbours
– If the class of the element does not agree with the
majority class of the k neighbours, drop it
Prototype selection: filter methods
• DROP1 (Wilson
& Martinez,
00)
• Drops an
example if its
neighbours
would be
classified
correctly more
times without
it than with it
Example of prototype selection
using KEEL
• Again using the Wisconsin breast cancer dataset
• 3 Methods (using the Imbalance Learning module)
– Pattern by Ordered projection
– CNN
– ENN
Method
Positive
Negative
None
191
355
CNN
20
191
ENN
182
344
POP
179
251
Resources
• A review of feature selection techniques in
bioinformatics
• Toward Integrating Feature Selection
Algorithms for Classification and Clustering
• List of feature selection methods in KEEL
• Feature selection in WEKA
• List of instance selection methods in KEEL
• Survey of instance-based reduction
techniques
Questions?
Download