Outline Classification Problems Solutions INF 5300: Lecture 2 Classification and feature selection Asbjørn Berge Department of Informatics University of Oslo 17. february 2004 Asbjørn Berge INF 5300: Lecture 2 Classification and feature selection Outline Classification Problems Solutions 1 Classification 2 Problems 3 Solutions Asbjørn Berge INF 5300: Lecture 2 Classification and feature selection Outline Classification Problems Solutions Features Patterns Classification Maximum A Posteriori Features A feature is any aspect, quality or characteristic of an object. d features of an object make up a feature vector. The set of available feature vectors span the feature space. 2-d or 3-d subsets or projections of feature space can be visualized as scatter plots Asbjørn Berge INF 5300: Lecture 2 Classification and feature selection Outline Classification Problems Solutions Features Patterns Classification Maximum A Posteriori Patterns A pattern is the pair (x, ω); a feature (x-vector) and a label (ω) corresponding to an object. A good feature vector for an object results in a pattern that is simple to ”‘learn”’, this usually means that equally labeled objects have similar feature vectors and likewise differently labeled objects have significantly dissimilar feature vectors. Asbjørn Berge INF 5300: Lecture 2 Classification and feature selection Outline Classification Problems Solutions Features Patterns Classification Maximum A Posteriori Classification The ultimate goal of classification is to establish boundaries in feature space, separating groups of patterns with minimum error. A new observation (unlabeled feature vector of an object) is labeled according to the decision boundaries in feature space. Decision boundaries can be (piecewise) linear or f.ex. composed of higher order polynomials. However - a more complex model implies more parameters - which might cause the model to overfit the training data. Asbjørn Berge INF 5300: Lecture 2 Classification and feature selection Outline Classification Problems Solutions Features Patterns Classification Maximum A Posteriori Maximum A Posteriori The definition of decision boundaries can be simply explained as the boundary where our estimate of densities of a given pattern are equal. (An oversimplification of this idea is that we base the choice of label on a feature vector by assuming the label identity of the dominant class in a feature space neighborhood) A mathematical definiton of this idea is to classify according to the Maximum A Posteriori probability. In english: given that we know the label, and density of feature vectors occuring, what label has maximum probability of occuring? The prior probability (”‘ratio of natural occurrence”’) of each label is denoted p(ωr ) Asbjørn Berge INF 5300: Lecture 2 Classification and feature selection Outline Classification Problems Solutions Features Patterns Classification Maximum A Posteriori Maximum A Posteriori The probability of a feature vector given that we know its label we denote p(x|ωi ) The probability of a feature vector is P p(x) = Ci=1 p(x|ωi )p(ωi ) The probability of a label given that we know the feature vector can be found by (Bayes rule) i )p(ωi ) p(ωi |x) = p(x|ω p(x) The rule is simply to classify according to the label corresponding to max value of p(ωi |x) for a given feature vector x Asbjørn Berge INF 5300: Lecture 2 Classification and feature selection Outline Classification Problems Solutions Features Patterns Classification Maximum A Posteriori Maximum A Posteriori Problem: usually complex to estimate the density of probability p(x|ωi ) of our set of feature vectors. A simplification is needed. Solution: Most data is usually somewhat elliptically ”‘cloud-shaped”’1 . The Gaussian distribution has the same properties, and is easy to treat algebraically! 1 exp[− 21 (x − µi )T Σ−1 p(x|ωi ) = (2π)( d/2)|Σ ( i (x − µi )] i | 1/2) 1 Pni 2 with parameters mean vector µ̂i = ni j=1 xj and covariance P i matrix Σ̂i = n1i nj=1 (xj − µ̂i )(xj − µ̂i )T So all we need to generate a MAP rule is to estimate the parameters of the gaussian distribution. 1 2 Reason: random (additive) noise on ideal measurements ”‘Maximum likelihood”’ estimates Asbjørn Berge INF 5300: Lecture 2 Classification and feature selection Outline Classification Problems Solutions Features Patterns Classification Maximum A Posteriori Maximum A Posteriori Using Gaussian distributions, the posterior probability p(ωi |x) can be simplified to a discriminant function 1 gi (x) = − 12 (x − µi )T Σ−1 i (x − µi ) − 2 log(|Σi |) + log(P(ωi )). The quadratic term here is referred to as the Mahalanobis distance. Note the similarity to (squared) euclidean distance, the only difference being Σ which can be visualized as a space-stretching factor. Asbjørn Berge INF 5300: Lecture 2 Classification and feature selection Outline Classification Problems Solutions Curse of dimensionality The curse of dimensionality The curse of dimensionality Bellman (1961) - referring to the computational complexity of searching the neighborhood of data points in high dimensional settings Commonly used in statistics to describe the problem of data sparsity. Example: a 3-class pattern recognition problem Divide the feature space into uniform bins Compute the ratio of objs from each class in each bin For a new obj, find the correct bin and assign to dominant class Asbjørn Berge INF 5300: Lecture 2 Classification and feature selection Outline Classification Problems Solutions Curse of dimensionality The curse of dimensionality Preserve resolution of bins 3 bins in 1D increases to 32 bins in 2D Roughly 3 examples per bin in 1D - need 27 examples to preserve density of examples Asbjørn Berge INF 5300: Lecture 2 Classification and feature selection Outline Classification Problems Solutions Curse of dimensionality The curse of dimensionality Using 3 features makes the problem worse The number of bins is now 33 = 27 81 examples are needed to preserve density Using original amount (9) of examples means that 2/3 of the feature space is empty! Asbjørn Berge INF 5300: Lecture 2 Classification and feature selection Outline Classification Problems Solutions Curse of dimensionality The curse of dimensionality Dividing sample space into equally spaced bins very inefficient Can we beat the curse? Use prior knowlegde (f.ex. discard parts of the sample space or restrict estimates) Increase the smoothness of the density estimate (wider bins) Reducing the dimensionality In practice, the curse means that, for a given sample size, there is a maximum number of features one can add before the classifier starts to degrade. Asbjørn Berge INF 5300: Lecture 2 Classification and feature selection Outline Classification Problems Solutions Dimensionality reduction Feature selection Objective function Search strategies Dimensionality reduction In general - two approaches for dimensionality reduction Feature selection: choose a subset of the features x1 xi 1 x2 xi 2 .. . −→ .. . .. . xi m xn Feature extraction: create a subset of new features by combining existing features x1 x1 y1 x2 x2 y2 .. .. . −→ . = f .. . .. .. . . ym xn xn Asbjørn Berge INF 5300: Lecture 2 Classification and feature selection Outline Classification Problems Solutions Dimensionality reduction Feature selection Objective function Search strategies Feature selection Given a feature set x = x1 , x2 , . . . , xn find a subset ym = xi1 , xi2 , . . . , xim with m < n which optimizes an objective function J(Y) x1 xi 1 x2 xi .. 2 . → .. , xi1 , xi2 , . . . , xim = argmax[J(x1 , x2 , . . . , xn )] . M,im .. . xi m xn Asbjørn Berge INF 5300: Lecture 2 Classification and feature selection Outline Classification Problems Solutions Dimensionality reduction Feature selection Objective function Search strategies Feature selection Motivation for feature selection Features may be expensive to obtain Selected features possess simple meanings (measurement units remain intact) Features may be discrete or non-numeric Asbjørn Berge INF 5300: Lecture 2 Classification and feature selection Outline Classification Problems Solutions Dimensionality reduction Feature selection Objective function Search strategies Feature selection Search strategy Exhaustive search implies mn if we fix m and 2n if we need to search all possible m as well. Choosing 10 out of 100 will result in 1013 queries to J Obviously we need to guide the search! Objective function (J) Asbjørn Berge INF 5300: Lecture 2 Classification and feature selection Outline Classification Problems Solutions Dimensionality reduction Feature selection Objective function Search strategies Objective function Distance metrics Euclidean (mean difference, minimum distance, maximum distance) Parametric (Mahalanobis, Bhattacharyya3 ) Information theoretic (Divergence4 ) Classifier performance Need to generate a testset from the training data (f.x. cross validation) Note that distance metrics are usually pairwise comparisons. (Need to define a C class extension of the distance) 3 4 extension of Mahalanobis using different covariance difference between pdf’s Asbjørn Berge INF 5300: Lecture 2 Classification and feature selection Outline Classification Problems Solutions Dimensionality reduction Feature selection Objective function Search strategies Naı̈ve individual feature selection Goal: select the two best features individually Any reasonable objective J will rank the features J(x1 ) > J(x2 ) ≈ J(x3 ) > J(x4 ) Thus features chosen [x1 , x2 ] or [x1 , x3 ] However, x4 is the only feature that provides complementary information to x1 Asbjørn Berge INF 5300: Lecture 2 Classification and feature selection Outline Classification Problems Solutions Dimensionality reduction Feature selection Objective function Search strategies Sequential Forward Selection (SFS) Starting from the empty set, sequentially add the feature x + that results in the highest objective function J(Yk + x + ) when combined with the features Yk that have already been selected Algorithm 1. Start with the empty set Y0 = ∅ 2. Select the next best feature x + = argmax[J(Yk + x)] x ∈Y / k 3. Update Yk+1 = Yk + x +; k =k +1 4. Goto 2 SFS performs best when the optimal subset has a small number of features SFS cannot discard features that become obsolete when adding other features Asbjørn Berge INF 5300: Lecture 2 Classification and feature selection Outline Classification Problems Solutions Dimensionality reduction Feature selection Objective function Search strategies Sequential Backward Selection (SBS) Starting from the full set, sequentially remove the feature x − that results in the smallest decrease in the value of J(Yk − x − ) Algorithm 1. Start with the full set Y0 = X 2. Remove the worst feature x − = argmax[J(Yk − x)] x ∈Y / k 3. Update Yk+1 = Yk − x −; k =k +1 4. Goto 2 SBS performs best when the optimal subset has a large number of features SBS cannot reenable features that become obsolete when adding other features Asbjørn Berge INF 5300: Lecture 2 Classification and feature selection Outline Classification Problems Solutions Dimensionality reduction Feature selection Objective function Search strategies Plus-L Minus-R Selection (LRS) If L > R, LRS starts from the empty set and repeatedly adds L features and removes R features If L < R, LRS starts from the full set and repeatedly removes R features followed by L feature additions Algorithm 1. If L > R then start with the empty set Y = ∅ else start with the full set Y = X goto step 3 2. Repeat SFS step L times 3. Repeat SBS step R times 4. Goto step 2 LRS attempts to compensate for weaknesses in SFS and SBS by backtracking Asbjørn Berge INF 5300: Lecture 2 Classification and feature selection Outline Classification Problems Solutions Dimensionality reduction Feature selection Objective function Search strategies Bidirectional Search (BDS) Bidirectional Search is a parallel implementation of SFS and SBS SFS is performed from the empty set SBS is performed from the full set To guarantee that SFS and SBS converge to the same solution, we must ensure that Features already selected by SFS are not removed by SBS Features already removed by SBS are not selected by SFS For example, before SFS attempts to add a new feature, it checks if it has been removed by SBS and, if it has, attempts to add the second best feature, and so on. SBS operates in a similar fashion Asbjørn Berge INF 5300: Lecture 2 Classification and feature selection Outline Classification Problems Solutions Dimensionality reduction Feature selection Objective function Search strategies Sequential Floating Selection (SFFS and SFBS) Extension to the LRS algorithms with flexible backtracking capabilities Rather than fixing the values of L and R, these floating methods allow those values to be determined from the data: The size of the subset during the search can be thought to be ”‘floating”’ Sequential Floating Forward Selection (SFFS) starts from the empty set After each forward step, SFFS performs backward steps as long as the objective function increases Sequential Floating Backward Selection (SFBS) starts from the full set After each backward step, SFBS performs forward steps as long as the objective function increases Asbjørn Berge INF 5300: Lecture 2 Classification and feature selection Outline Classification Problems Solutions Dimensionality reduction Feature selection Objective function Search strategies Sequential Floating Selection (SFFS and SFBS) SFFS Algorithm a a SFBS is analogous 1. Start with the empty set Y = ∅ 2. SFS step 3. Find the worst feature x − = argmax[J(Yk − x)] x ∈Y / k − x −) 4. If J(YK > J(YK ) then Yk+1 = Yk − x − ; k = k + 1; goto step 3 else goto step 2 Asbjørn Berge INF 5300: Lecture 2 Classification and feature selection Outline Classification Problems Solutions Dimensionality reduction Feature selection Objective function Search strategies Optimal searches and randomized methods If the criterion increases monotonically J(xi1 ) ≤ J(xi1 , xi2 ) ≤ J(xi1 , xi2 , . . . , xin ), one can use graph-theoretic methods to perform effective subset searches. (I.e. branch and bound or dynamic programming) Randomized methods are also popular, examples would be sequential searching with random starting subsets, simulated annealing (a random subset permutation where the randomness cools off) or genetic algorithms. Asbjørn Berge INF 5300: Lecture 2 Classification and feature selection