INF 5300 Feature selection - theory and practise Anne Solberg () Today:

advertisement
INF 5300
Feature selection - theory and practise
Anne Solberg (anne@ifi.uio.no)
Today:
• Short review of classification
• Finding the best features
• How do we compare features
• Tomorrow: how to generate new features by linear
transforms.
F1 15.2.06
INF 5300
1
Typical image analysis tasks
• Preprocessing/noise filtering
• Feature extraction
– Are are original image pixel values sufficient for classification, or do
we need additional features?
– What kind of features do we use to discriminate between the object
types involved?
• Exploratory feature analysis and selection
– Which features separate the object classes best?
– How many features are needed?
• Classifier selection and training
– Select the type of classifier (e.g. Gaussian or neural net)
– Estimate the classifier parameters based on available training data
• Classification
– Assign each object/pixel to the class with the highest probability
• Validating classifier accuracy
F1 15.2.06
INF 5300
2
Two approaches to classification
• Pixel-based classification
– Compute features in a window centered at each pixel
– Classify each pixel in the image
– Example: satellite image classification
• Object-based classification
–
–
–
–
Find the objects in the scene
Compute features for each object
Classify each object in the image
Example: text recognition
F1 15.2.06
INF 5300
3
Supervised or unsupervised classification
• Supervised classification
– Classify each object or pixel into a set of k known classes
– Class parameters are estimated using a set of training
samples from each class.
• Unsupervised classification
– Partition the feature space into a set of k clusters
– k is not known and must be estimated (difficult)
• In both cases, classification is based on the value of
the set of n features x1,....xn.
• The object is classified to the class which has the
highest posterior probability.
F1 15.2.06
INF 5300
4
Feature space
• We will discriminate between different object classes based on a
set of features.
• The features are chosen given the application.
• Normally, a large set of different features is investigated.
• Classifier design also involves feature selection - selecting the
best subset out of a large feature set.
• Given a training set of a certain size, the dimensionality of the
feature vector must be limited.
• Careful selection of features is the most important step in image
classification!
F1 15.2.06
INF 5300
5
Feature space and discriminant functions
Different types of decision
functions:
•Univariate (a single feature)
– Linear
•Multivariate (several
features):
– Linear
– Quadratic
– Piecewise smooth
functions of higher
order
F1 15.2.06
INF 5300
6
Basic classification principles
Classification task:
• Classify object x = {x1 ,..., xn } to one of the R classes ω1 ,...ω R
• Decision rule d(x)=ωr divides the feature space into R
disjoint subsets Kr, r=1,...R.
• The borders between subsets Kr, r=1,...R are defined by R
scalar discrimination functions g1(x),....gR(x)
• The discrimination functions must satisfy:
gr(x)≥gs(x), s≠r, for all x∈Kr
• Discrimination hypersurfaces are thus defined by
gr(x)-gs(x)=0
• The pattern x will be classified to the class whose
discrimination function gives a maximum:
d(x)=ωr ⇔ gr(x) = max gs(x)
s=1,...R
F1 15.2.06
INF 5300
7
Bayesian classification
• Prior probabilities P(ωr) for each class
• Bayes classification rule: classify a pattern x to the
class with the highest posterior probability P(ωr|x)
P(ωr|x) = max P(ωs|x)
s=1,...R
• P(ωs|x) is computed using Bayes formula
p ( x | ω s ) P (ω s )
P(ω s | x ) =
p ( x)
R
p ( x) = ∑ p ( x | ω s ) P (ω s )
s =1
• p(x| ωs) is the class-conditional probability density
for a given class.
F1 15.2.06
INF 5300
8
Classification with Gaussian distributions
• Probability distribution for n-dimensional Gaussian vector:
p(x |ω s ) =
µˆ s =
1
Ms
ˆ = 1
∑
s
Ms
∑
1
(2 π )
Ms
m =1
d /2
⎡ 1
⎤
t
exp ⎢ − ( x − µ s ) Σ −s 1 ( x − µ s )⎥
⎣ 2
⎦
xm ,
∑ (x
Ms
m =1
Σs
1/ 2
− µˆ s )( xm − µˆ s )
t
m
where the sum is over all training samples belonging to class s
• µs and Σs are not known, but they are estimated from M training
samples as the Maximum Likelihood estimates
F1 15.2.06
INF 5300
9
F1 15.2.06
INF 5300
10
Training a classifier
• The parameters of the classifier must be estimated
• Statistical classifiers use class-conditional probability
densities:
– For the Gaussian case: estimate µk and Σk
• For neural nets: estimate network weights
• Ground truth data used to train and test the data
– Divide ground truth data into a separate training and test
set (possibly also a validation set)
• From the training data set, estimate all classifier
parameters
• Store classifier parameters in a ”Class description
database”.
F1 15.2.06
INF 5300
11
Validating classifier performance
• Classification performance is evaluated on a different
set of samples with known class - the test set.
• The training set and the test set must be
independent!
• Normally, the set of ground truth pixels (with known
class) is partionioned into a set of training pixels and
a set of test pixels of approximately the same size.
• This can be repeated several times to compute more
robust estimates as average test accuracy over
several different partitions of test set and training
set.
F1 15.2.06
INF 5300
12
Confusion matrices
• A matrix with the true class label versus the estimated
class labels for each class
True class labels
Estimated class labels
Class 1
Class 2
Class 3
Total
#sampl
es
Class 1
80
15
5
100
Class 2
5
140
5
150
Class 3
25
50
125
200
Total
110
205
135
450
F1 15.2.06
INF 5300
13
Confusion matrix - cont.
Alternatives:
Class
1
Class
2
Class
3
Total
#sam
ples
Class 1
80
15
5
100
Class 2
5
140
5
150
Class 3
25
50
125
200
Total
110
205
135
450
•Report nof. correctly classified
pixels for each class.
•Report the percentage of
correctly classified pixels for
each class.
•Report the percentage of
correctly classified pixels in
total.
•Why is this not a good
measure if the number of
test pixels from each class
varies between classes?
F1 15.2.06
INF 5300
14
The curse of dimensionality
• Assume we have S classes and a n-dimensional feature vector.
• With a fully multivariate Gaussian model, we must estimate S
different mean vectors and S different covariance matrices
from training samples.
µ̂ s has n elements
Σ̂ s
has n(n-1)/2 elements
• Assume that we have Ms training samples from each class
• Given Ms, there is a maximum of the achieved classification
performance for a certain value of n (increasing n beyond this
limit will lead to worse performance after a certain).
• Adding more features is not always a good idea!
• If we have limited training data, we can use diagonal covariance
matrices or regularization.
F1 15.2.06
INF 5300
15
How do we beat the ”curse of dimensionality”?
• Use regularized estimates for the Gaussian case
– Use diagonal covariance matrices
– Apply regularized covariance estimation:
• Generate few, but informative features
– Careful feature design given the application
• Reducing the dimensionality
– Feature selection
– Feature transforms
F1 15.2.06
INF 5300
16
Regularized covariance matrix estimation
• Let the covariance matrix be a weighted combination of a classspecific covariance matrix Σk and a common covariance matrix
Σ:
Σ k (α ) =
(1 − α )nk Σ k + αnΣ
(1 − α )nk + αn
der 0≤α≤1 must be determined, and nk and n is the number of
training samples for class k and overall.
• Alternatively:
Σ k (β ) = (1 − β )Σ k + β I
where the parameter 0≤β≤1 must be determined.
F1 15.2.06
INF 5300
17
Feature selection
• Given a large set of N features, how do we select the
best subset of n features?
– How do we select n?
– Finding the best combination of n features out a N possible
is a large optimization problem.
– Full search is normally not possible.
– Suboptimal approaches are often used.
– How many features are needed?
• Alternative: compute lower-dimensional projections
of the N-dimensional space
– PCA
– Fisher’s linear discriminant
– Projection pursuit and other non-linear approaches
F1 15.2.06
INF 5300
18
Exploratory data analysis
• For a small number of
features, manual data
analysis to study the features
is recommended.
• Evaluate e.g.
– Error rates for singlefeature classification
– Scatter plots
– Clustering different
feature sets
F1 15.2.06
Scatter plots of feature
combinations
INF 5300
19
Preprocessing - data normalization
• Features may have different ranges
– Feature 1 has range f1min-f1max
– Feature n has range fnmin-fnmax
– This does not reflect their significance in classification
performance!
– Example: minimum distance classifier uses Euclidean
distance
• Features with large absolute values will dominate the classifier
F1 15.2.06
INF 5300
20
Feature normalization
• Normalize all features to have the same mean and variance.
• Data set with N objects and K features
• Features xik, i=1...N, k=1,...K
Zero mean, unit variance:
Softmax (non-linear)
1 N
xk = ∑ xik
N i =i
σ k2 =
xˆik =
y=
1 N
( xik − xk )2
∑
N − 1 i =i
xik − xk
rσ k
xˆik =
xik − xk
1
1 + exp(− y )
σk
Remark: normalization may destroy important discrimination
information!
F1 15.2.06
INF 5300
21
Feature selection
• Search strategy
– Exhaustive search implies
if we fix m
n
and 2 if we need to search all possible m
as well.
– Choosing 10 out of 100 will result in 1013
queries to J
– Obviously we need to guide the search!
• Objective function (J)
– ”Predict” classifier performance
m
Note that ⎛⎜ ⎞⎟ =
⎝l⎠
F1 15.2.06
m!
l!(m − l )!
INF 5300
22
Distance measures (to specify J)
• Between two classes:
–
–
–
–
–
Distance between the closest two points?
Maximum distance between two points?
Distance between the class means?
Average distance between points in the two classes?
Which distance measure?
• Between K classes:
– How do we generalize to more than two classes?
– Average distance between the classes?
– Smallest distance between a pair of classes?
Note: Often performance should be evalued in terms of
classification error rate
F1 15.2.06
INF 5300
23
Class separability measures
• How do we get an indication of the separability
between two classes?
– Euclidean distance |µr- µs|
– Bhattacharyya distance
• Can be defined for different distributions
1
• For Gaussian data, it is
(
B=
1
(µ r − µ s )t Σ r + Σ s
8
2
(µ r − µ s ) + 1 ln 2
2
Σr + Σs )
Σr Σs
– Mahalanobis distance between two classes:
∆ = (µ1 − µ 2 )T Σ −1 (µ1 − µ 2 )
Σ = N1Σ1 + N 2Σ 2
F1 15.2.06
INF 5300
24
Divergence
• Divergence (see 5.5 in Theodoridis and
Koutroumbas) is a measure of distance between
probability density functions.
• Mahalanobis distance is a form of divergence
measure.
• The Bhattacharrya distance is related to the Chernoff
bound for the lowest classification error.
• If two classes have equal variance Σ1=Σ2, then the
Bhattacharrya distance is proportional to the
Mahalanobis distance.
F1 15.2.06
INF 5300
25
Selecting individual features
• Each feature is treated individually (no correlation between
features)
• Select a criteria, e.g. FDR or divergence
• Rank the feature according to the value of the criteria C(k)
• Select the set of features with the best individual criteria value
• Multiclass situations:
– Average class separability or
Often used
– C(k) = min distance(i,j) - worst case
• Advantage with individual selection: computation time
• Disadvantage: no correlation is utilized.
F1 15.2.06
INF 5300
26
Individual feature selection cont.
• We can also include a simple measure of feature correlation.
• Cross-Correlation between feature i and j: (|ρij|≤1)
∑ xni xnj
ρij = N n =1 N
∑n=1 xni2 ∑n=1 xnj2
N
• Simple algorithm:
– Select C(k) and compute for all xk, k=1,...m. Rank in
descending order and select the one with best value. Call
this xi1.
– Compute the cross-correlation between xi1 and all other
features. Choose the feature xi2 for which
{
}
i2 = arg max α1C ( j ) − α 2 ρi1 j for all j ≠ i1
j
– Select xik, k=3,...l so that ⎧
α
ik = arg max ⎨α1C ( j ) − 2
k −1
⎩
j
F1 15.2.06
INF 5300
k −1
∑
r =1
⎫
ρi1 j ⎬ for all j ≠ i1
⎭
27
Sequential backward selection
• Example: 4 features x1,x2,x3,x4
• Choose a criterion C and compute it for the vector
[x1,x2,x3,x4]T
• Eliminate one feature at a time by computing
[x1,x2,x3]T, [x1,x2,x4]T, [x1,x3,x4]T and [x2,x3,x4]T
• Select the best combination, say [x1,x2,x3]T.
• From the selected 3-dimensional feature vector
eliminate one more feature, and evaluate the
criterion for [x1,x2]T, [x1,x3]T, [x2,x3]T and select the
one with the best value.
• Number of combinations searched:
1+1/2((m+1)m-l(l+1))
F1 15.2.06
INF 5300
28
Sequential forward selection
• Compute the criterion value for each feature. Select the
feature with the best value, say x1.
• Form all possible combinations of features x1 (the winner at
the previous step) and a new feature, e.g. [x1,x2]T, [x1,x3]T,
[x1,x4]T, etc. Compute the criterion and select the best one,
say [x1,x3]T.
• Continue with adding a new feature.
• Number of combinations searched: lm-l(l-1)/2.
– Backwards selection is faster if l is closer to m than to 1.
F1 15.2.06
INF 5300
29
Plus-L Minus-R Selection (LRS)
F1 15.2.06
INF 5300
30
Bidirectional Search (BDS)
F1 15.2.06
INF 5300
31
Floating search methods
• Problem with backward selection: if one feature is excluded, it
cannot be considered again.
• Floating methods can reconsider features previously discarded.
• Floating search can be defined both for forward and backward
selection, here we study forward selection.
• Let Xk={x1,x2,...,xk} be the best combination of the k features
and Ym-k the remaining m-k features.
• At the next step the k+1 best subset Xk+1is formed by
’borrowing’ an element from Ym-k.
• Then, return to previously selected lower dimension subset to
check whether the inclusion of this new element improves the
criterion.
• If so, let the new element replace one of the previously selected
features.
F1 15.2.06
INF 5300
32
Algorithm for floating search
• Step I: Inclusion
xk+1=argmaxy∈Ym-kC({Xk,y}) (choose the element from Ym-k
that has best effect of C when combined with Xk).
Set Xk+1= {Xk, xk+1}.
• Step II: Test
1. xr= argmaxy∈Xk+1C({Xk+1-y}) (Find the feature with the
least effect on C when removed from Xk+1)
2. If r=k+1, change k=k+1 and go to step I.
3. If r≠k+1 AND C({Xk+1-xr})<C(Xk), goto step I. (If removing
xk did not improve the cost, no further backwards
selection)
4. If k=2 put Xk= Xk+1- xr and C(Xk)=C(Xk+1- xr). Goto step I.
F1 15.2.06
INF 5300
33
Algorithm cont.
• Step III: Exclusion
1.Xk’=Xk+1-xr (remove xr)
2.xs= argmaxy∈Xk’C({Xk’-y}) (find the least significant feature
in the new set.)
3.If C(Xk’- xs)<C(Xk-1) then Xk= Xk’ and goto step I.
4.Put Xk-1’=Xk’-xs and k=k-1.
5.If k=2, put Xk=Xk’ and C(Xk)=C(Xk’) and goto step I.
6.Goto step III.
Floating search often yields better performance than
sequential search, but at the cost of increased
computational time.
F1 15.2.06
INF 5300
34
Optimal searches and randomized methods
F1 15.2.06
INF 5300
35
Feature transforms
• We now consider computing new features as linear
combinations of the existing features.
• From the original feature vector x, we compute a
new vector y of transformed features
y=ATx
y is l-dimensional, x is m-dimensional, A is a l×m matrix.
• y is normally defined in such a way that it has lower
dimension than x.
F1 15.2.06
INF 5300
36
Literature on pattern recognition
• Updated review and statistical pattern recognition:
–
A. Jain, R. Duin and J. Mao: Statistical pattern recognition: a review, IEEE Trans.
Pattern analysis and Machine Intelligence, vol. 22, no. 1, January 2001, pp. 4--
• Classical PR-books
–
–
–
R. Duda, P. Hart and D. Stork, Pattern Classification, 2. ed. Wiley, 2001
B. Ripley, Pattern Recognition and Neural Networks, Cambridge Press, 1996.
S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic Press, 1999.
F1 15.2.06
INF 5300
37
Sequential Floating Search (SFFS and SFBS)
F1 15.2.06
INF 5300
38
Sequential Floating Search (SFFS and SFBS)
F1 15.2.06
INF 5300
39
Download