INF 5300: Lecture 2 Classification and feature selection Asbjørn Berge 17. february 2004

advertisement
Outline
Classification
Problems
Solutions
INF 5300: Lecture 2
Classification and feature selection
Asbjørn Berge
Department of Informatics
University of Oslo
17. february 2004
Asbjørn Berge
INF 5300: Lecture 2 Classification and feature selection
Outline
Classification
Problems
Solutions
1
Classification
2
Problems
3
Solutions
Asbjørn Berge
INF 5300: Lecture 2 Classification and feature selection
Outline
Classification
Problems
Solutions
Features
Patterns
Classification
Maximum A Posteriori
Features
A feature is any aspect, quality or characteristic of an object. d
features of an object make up a feature vector. The set of
available feature vectors span the feature space. 2-d or 3-d subsets
or projections of feature space can be visualized as scatter plots
Asbjørn Berge
INF 5300: Lecture 2 Classification and feature selection
Outline
Classification
Problems
Solutions
Features
Patterns
Classification
Maximum A Posteriori
Patterns
A pattern is the pair (x, ω); a feature (x-vector) and a label (ω)
corresponding to an object. A good feature vector for an object
results in a pattern that is simple to ”‘learn”’, this usually means
that equally labeled objects have similar feature vectors and
likewise differently labeled objects have significantly dissimilar
feature vectors.
Asbjørn Berge
INF 5300: Lecture 2 Classification and feature selection
Outline
Classification
Problems
Solutions
Features
Patterns
Classification
Maximum A Posteriori
Classification
The ultimate goal of classification is to establish boundaries in
feature space, separating groups of patterns with minimum error.
A new observation (unlabeled feature vector of an object) is
labeled according to the decision boundaries in feature space.
Decision boundaries can be (piecewise) linear or f.ex. composed of
higher order polynomials. However - a more complex model implies
more parameters - which might cause the model to overfit the
training data.
Asbjørn Berge
INF 5300: Lecture 2 Classification and feature selection
Outline
Classification
Problems
Solutions
Features
Patterns
Classification
Maximum A Posteriori
Maximum A Posteriori
The definition of decision boundaries can be simply explained
as the boundary where our estimate of densities of a given
pattern are equal. (An oversimplification of this idea is that
we base the choice of label on a feature vector by assuming
the label identity of the dominant class in a feature space
neighborhood)
A mathematical definiton of this idea is to classify according
to the Maximum A Posteriori probability. In english: given
that we know the label, and density of feature vectors
occuring, what label has maximum probability of occuring?
The prior probability (”‘ratio of natural occurrence”’) of each
label is denoted p(ωr )
Asbjørn Berge
INF 5300: Lecture 2 Classification and feature selection
Outline
Classification
Problems
Solutions
Features
Patterns
Classification
Maximum A Posteriori
Maximum A Posteriori
The probability of a feature vector given that we know its
label we denote p(x|ωi )
The probability
of a feature vector is
P
p(x) = Ci=1 p(x|ωi )p(ωi )
The probability of a label given that we know the feature
vector can be found by (Bayes rule)
i )p(ωi )
p(ωi |x) = p(x|ω
p(x)
The rule is simply to classify according to the label
corresponding to max value of p(ωi |x) for a given feature
vector x
Asbjørn Berge
INF 5300: Lecture 2 Classification and feature selection
Outline
Classification
Problems
Solutions
Features
Patterns
Classification
Maximum A Posteriori
Maximum A Posteriori
Problem: usually complex to estimate the density of
probability p(x|ωi ) of our set of feature vectors. A
simplification is needed.
Solution: Most data is usually somewhat elliptically
”‘cloud-shaped”’1 . The Gaussian distribution has the same
properties, and is easy to treat algebraically!
1
exp[− 21 (x − µi )T Σ−1
p(x|ωi ) = (2π)( d/2)|Σ
(
i (x − µi )]
i | 1/2)
1 Pni
2
with parameters mean vector µ̂i = ni j=1 xj and covariance
P i
matrix Σ̂i = n1i nj=1
(xj − µ̂i )(xj − µ̂i )T
So all we need to generate a MAP rule is to estimate the
parameters of the gaussian distribution.
1
2
Reason: random (additive) noise on ideal measurements
”‘Maximum likelihood”’ estimates
Asbjørn Berge
INF 5300: Lecture 2 Classification and feature selection
Outline
Classification
Problems
Solutions
Features
Patterns
Classification
Maximum A Posteriori
Maximum A Posteriori
Using Gaussian distributions, the posterior probability p(ωi |x)
can be simplified to a discriminant function
1
gi (x) = − 12 (x − µi )T Σ−1
i (x − µi ) − 2 log(|Σi |) + log(P(ωi )).
The quadratic term here is referred to as the Mahalanobis
distance. Note the similarity to (squared) euclidean distance,
the only difference being Σ which can be visualized as a
space-stretching factor.
Asbjørn Berge
INF 5300: Lecture 2 Classification and feature selection
Outline
Classification
Problems
Solutions
Curse of dimensionality
The curse of dimensionality
The curse of dimensionality
Bellman (1961) - referring to the computational complexity of
searching the neighborhood of data points in high dimensional
settings
Commonly used in statistics to describe the problem of data
sparsity.
Example: a 3-class pattern recognition problem
Divide the feature space into uniform bins
Compute the ratio of objs from each class in each bin
For a new obj, find the correct bin and assign to dominant
class
Asbjørn Berge
INF 5300: Lecture 2 Classification and feature selection
Outline
Classification
Problems
Solutions
Curse of dimensionality
The curse of dimensionality
Preserve resolution of bins
3 bins in 1D increases to 32 bins in 2D
Roughly 3 examples per bin in 1D - need 27 examples to
preserve density of examples
Asbjørn Berge
INF 5300: Lecture 2 Classification and feature selection
Outline
Classification
Problems
Solutions
Curse of dimensionality
The curse of dimensionality
Using 3 features makes the problem worse
The number of bins is now 33 = 27
81 examples are needed to preserve density
Using original amount (9) of examples means that 2/3 of the
feature space is empty!
Asbjørn Berge
INF 5300: Lecture 2 Classification and feature selection
Outline
Classification
Problems
Solutions
Curse of dimensionality
The curse of dimensionality
Dividing sample space into equally spaced bins very inefficient
Can we beat the curse?
Use prior knowlegde (f.ex. discard parts of the sample space or
restrict estimates)
Increase the smoothness of the density estimate (wider bins)
Reducing the dimensionality
In practice, the curse means that, for a given sample size,
there is a maximum number of features one can add before
the classifier starts to degrade.
Asbjørn Berge
INF 5300: Lecture 2 Classification and feature selection
Outline
Classification
Problems
Solutions
Dimensionality reduction
Feature selection
Objective function
Search strategies
Dimensionality reduction
In general - two approaches for dimensionality reduction
Feature

 selection: choose a subset of the features
x1


xi 1
 x2 


 xi 2 
 .. 

 .  −→ 
 .. 



. 
 .. 
 . 
xi m
xn
Feature extraction: create a subset of new features by
combining

 existing features

x1
x1


y1
 x2 
 x2 




 y2 
 .. 
 .. 

 .  −→ 



.
=
f
 .. 




 . 
 .. 
 .. 
 . 
 . 
ym
xn
xn
Asbjørn Berge
INF 5300: Lecture 2 Classification and feature selection
Outline
Classification
Problems
Solutions
Dimensionality reduction
Feature selection
Objective function
Search strategies
Feature selection
Given a feature set x = x1 , x2 , . . . , xn find a subset
ym = xi1 , xi2 , . . . , xim with m < n which optimizes an objective
function

 J(Y)
x1


xi 1
 x2 


 xi 
 .. 
2 
 . →
 ..  , xi1 , xi2 , . . . , xim = argmax[J(x1 , x2 , . . . , xn )]


 . 
M,im
 .. 
 . 
xi m
xn
Asbjørn Berge
INF 5300: Lecture 2 Classification and feature selection
Outline
Classification
Problems
Solutions
Dimensionality reduction
Feature selection
Objective function
Search strategies
Feature selection
Motivation for feature selection
Features may be expensive to obtain
Selected features possess simple meanings (measurement units
remain intact)
Features may be discrete or non-numeric
Asbjørn Berge
INF 5300: Lecture 2 Classification and feature selection
Outline
Classification
Problems
Solutions
Dimensionality reduction
Feature selection
Objective function
Search strategies
Feature selection
Search strategy
Exhaustive search implies mn if we fix m and 2n if we need to
search all possible m as well.
Choosing 10 out of 100 will result in 1013 queries to J
Obviously we need to guide the search!
Objective function (J)
Asbjørn Berge
INF 5300: Lecture 2 Classification and feature selection
Outline
Classification
Problems
Solutions
Dimensionality reduction
Feature selection
Objective function
Search strategies
Objective function
Distance metrics
Euclidean (mean difference, minimum distance, maximum
distance)
Parametric (Mahalanobis, Bhattacharyya3 )
Information theoretic (Divergence4 )
Classifier performance
Need to generate a testset from the training data (f.x. cross
validation)
Note that distance metrics are usually pairwise comparisons.
(Need to define a C class extension of the distance)
3
4
extension of Mahalanobis using different covariance
difference between pdf’s
Asbjørn Berge
INF 5300: Lecture 2 Classification and feature selection
Outline
Classification
Problems
Solutions
Dimensionality reduction
Feature selection
Objective function
Search strategies
Naı̈ve individual feature selection
Goal: select the two best features
individually
Any reasonable objective J will
rank the features
J(x1 ) > J(x2 ) ≈ J(x3 ) > J(x4 )
Thus features chosen [x1 , x2 ] or
[x1 , x3 ]
However, x4 is the only feature
that provides complementary
information to x1
Asbjørn Berge
INF 5300: Lecture 2 Classification and feature selection
Outline
Classification
Problems
Solutions
Dimensionality reduction
Feature selection
Objective function
Search strategies
Sequential Forward Selection (SFS)
Starting from the empty set, sequentially add the feature x + that
results in the highest objective function J(Yk + x + ) when
combined with the features Yk that have already been selected
Algorithm
1. Start with the empty set Y0 = ∅
2. Select the next best feature x + = argmax[J(Yk + x)]
x ∈Y
/ k
3. Update Yk+1 = Yk +
x +; k
=k +1
4. Goto 2
SFS performs best when the optimal subset has a small number of
features
SFS cannot discard features that become obsolete when adding
other features
Asbjørn Berge
INF 5300: Lecture 2 Classification and feature selection
Outline
Classification
Problems
Solutions
Dimensionality reduction
Feature selection
Objective function
Search strategies
Sequential Backward Selection (SBS)
Starting from the full set, sequentially remove the feature x − that
results in the smallest decrease in the value of J(Yk − x − )
Algorithm
1. Start with the full set Y0 = X
2. Remove the worst feature x − = argmax[J(Yk − x)]
x ∈Y
/ k
3. Update Yk+1 = Yk −
x −; k
=k +1
4. Goto 2
SBS performs best when the optimal subset has a large number of
features
SBS cannot reenable features that become obsolete when adding
other features
Asbjørn Berge
INF 5300: Lecture 2 Classification and feature selection
Outline
Classification
Problems
Solutions
Dimensionality reduction
Feature selection
Objective function
Search strategies
Plus-L Minus-R Selection (LRS)
If L > R, LRS starts from the empty set and repeatedly adds L
features and removes R features
If L < R, LRS starts from the full set and repeatedly removes R
features followed by L feature additions
Algorithm
1. If L > R then start with the empty set Y = ∅ else start with
the full set Y = X goto step 3
2. Repeat SFS step L times
3. Repeat SBS step R times
4. Goto step 2
LRS attempts to compensate for weaknesses in SFS and SBS by
backtracking
Asbjørn Berge
INF 5300: Lecture 2 Classification and feature selection
Outline
Classification
Problems
Solutions
Dimensionality reduction
Feature selection
Objective function
Search strategies
Bidirectional Search (BDS)
Bidirectional Search is a parallel implementation of SFS and
SBS
SFS is performed from the empty set
SBS is performed from the full set
To guarantee that SFS and SBS converge to the same
solution, we must ensure that
Features already selected by SFS are not removed by SBS
Features already removed by SBS are not selected by SFS
For example, before SFS attempts to add a new feature, it
checks if it has been removed by SBS and, if it has, attempts
to add the second best feature, and so on. SBS operates in a
similar fashion
Asbjørn Berge
INF 5300: Lecture 2 Classification and feature selection
Outline
Classification
Problems
Solutions
Dimensionality reduction
Feature selection
Objective function
Search strategies
Sequential Floating Selection (SFFS and SFBS)
Extension to the LRS algorithms with flexible backtracking
capabilities
Rather than fixing the values of L and R, these floating
methods allow those values to be determined from the data:
The size of the subset during the search can be thought to be
”‘floating”’
Sequential Floating Forward Selection (SFFS) starts from the
empty set
After each forward step, SFFS performs backward steps as
long as the objective function increases
Sequential Floating Backward Selection (SFBS) starts from
the full set
After each backward step, SFBS performs forward steps as
long as the objective function increases
Asbjørn Berge
INF 5300: Lecture 2 Classification and feature selection
Outline
Classification
Problems
Solutions
Dimensionality reduction
Feature selection
Objective function
Search strategies
Sequential Floating Selection (SFFS and SFBS)
SFFS Algorithm
a
a
SFBS is analogous
1. Start with the empty set Y = ∅
2. SFS step
3. Find the worst feature x − = argmax[J(Yk − x)]
x ∈Y
/ k
− x −)
4. If J(YK
> J(YK ) then Yk+1 = Yk − x − ; k = k + 1; goto
step 3 else goto step 2
Asbjørn Berge
INF 5300: Lecture 2 Classification and feature selection
Outline
Classification
Problems
Solutions
Dimensionality reduction
Feature selection
Objective function
Search strategies
Optimal searches and randomized methods
If the criterion increases monotonically
J(xi1 ) ≤ J(xi1 , xi2 ) ≤ J(xi1 , xi2 , . . . , xin ), one can use
graph-theoretic methods to perform effective subset searches.
(I.e. branch and bound or dynamic programming)
Randomized methods are also popular, examples would be
sequential searching with random starting subsets, simulated
annealing (a random subset permutation where the
randomness cools off) or genetic algorithms.
Asbjørn Berge
INF 5300: Lecture 2 Classification and feature selection
Download