INF 5300 Feature selection - theory and practise Anne Solberg (anne@ifi.uio.no) Today: • Short review of classification • Finding the best features • How do we compare features • Tomorrow: how to generate new features by linear transforms. F1 15.2.06 INF 5300 1 Typical image analysis tasks • Preprocessing/noise filtering • Feature extraction – Are are original image pixel values sufficient for classification, or do we need additional features? – What kind of features do we use to discriminate between the object types involved? • Exploratory feature analysis and selection – Which features separate the object classes best? – How many features are needed? • Classifier selection and training – Select the type of classifier (e.g. Gaussian or neural net) – Estimate the classifier parameters based on available training data • Classification – Assign each object/pixel to the class with the highest probability • Validating classifier accuracy F1 15.2.06 INF 5300 2 Two approaches to classification • Pixel-based classification – Compute features in a window centered at each pixel – Classify each pixel in the image – Example: satellite image classification • Object-based classification – – – – Find the objects in the scene Compute features for each object Classify each object in the image Example: text recognition F1 15.2.06 INF 5300 3 Supervised or unsupervised classification • Supervised classification – Classify each object or pixel into a set of k known classes – Class parameters are estimated using a set of training samples from each class. • Unsupervised classification – Partition the feature space into a set of k clusters – k is not known and must be estimated (difficult) • In both cases, classification is based on the value of the set of n features x1,....xn. • The object is classified to the class which has the highest posterior probability. F1 15.2.06 INF 5300 4 Feature space • We will discriminate between different object classes based on a set of features. • The features are chosen given the application. • Normally, a large set of different features is investigated. • Classifier design also involves feature selection - selecting the best subset out of a large feature set. • Given a training set of a certain size, the dimensionality of the feature vector must be limited. • Careful selection of features is the most important step in image classification! F1 15.2.06 INF 5300 5 Feature space and discriminant functions Different types of decision functions: •Univariate (a single feature) – Linear •Multivariate (several features): – Linear – Quadratic – Piecewise smooth functions of higher order F1 15.2.06 INF 5300 6 Basic classification principles Classification task: • Classify object x = {x1 ,..., xn } to one of the R classes ω1 ,...ω R • Decision rule d(x)=ωr divides the feature space into R disjoint subsets Kr, r=1,...R. • The borders between subsets Kr, r=1,...R are defined by R scalar discrimination functions g1(x),....gR(x) • The discrimination functions must satisfy: gr(x)≥gs(x), s≠r, for all x∈Kr • Discrimination hypersurfaces are thus defined by gr(x)-gs(x)=0 • The pattern x will be classified to the class whose discrimination function gives a maximum: d(x)=ωr ⇔ gr(x) = max gs(x) s=1,...R F1 15.2.06 INF 5300 7 Bayesian classification • Prior probabilities P(ωr) for each class • Bayes classification rule: classify a pattern x to the class with the highest posterior probability P(ωr|x) P(ωr|x) = max P(ωs|x) s=1,...R • P(ωs|x) is computed using Bayes formula p ( x | ω s ) P (ω s ) P(ω s | x ) = p ( x) R p ( x) = ∑ p ( x | ω s ) P (ω s ) s =1 • p(x| ωs) is the class-conditional probability density for a given class. F1 15.2.06 INF 5300 8 Classification with Gaussian distributions • Probability distribution for n-dimensional Gaussian vector: p(x |ω s ) = µˆ s = 1 Ms ˆ = 1 ∑ s Ms ∑ 1 (2 π ) Ms m =1 d /2 ⎡ 1 ⎤ t exp ⎢ − ( x − µ s ) Σ −s 1 ( x − µ s )⎥ ⎣ 2 ⎦ xm , ∑ (x Ms m =1 Σs 1/ 2 − µˆ s )( xm − µˆ s ) t m where the sum is over all training samples belonging to class s • µs and Σs are not known, but they are estimated from M training samples as the Maximum Likelihood estimates F1 15.2.06 INF 5300 9 F1 15.2.06 INF 5300 10 Training a classifier • The parameters of the classifier must be estimated • Statistical classifiers use class-conditional probability densities: – For the Gaussian case: estimate µk and Σk • For neural nets: estimate network weights • Ground truth data used to train and test the data – Divide ground truth data into a separate training and test set (possibly also a validation set) • From the training data set, estimate all classifier parameters • Store classifier parameters in a ”Class description database”. F1 15.2.06 INF 5300 11 Validating classifier performance • Classification performance is evaluated on a different set of samples with known class - the test set. • The training set and the test set must be independent! • Normally, the set of ground truth pixels (with known class) is partionioned into a set of training pixels and a set of test pixels of approximately the same size. • This can be repeated several times to compute more robust estimates as average test accuracy over several different partitions of test set and training set. F1 15.2.06 INF 5300 12 Confusion matrices • A matrix with the true class label versus the estimated class labels for each class True class labels Estimated class labels Class 1 Class 2 Class 3 Total #sampl es Class 1 80 15 5 100 Class 2 5 140 5 150 Class 3 25 50 125 200 Total 110 205 135 450 F1 15.2.06 INF 5300 13 Confusion matrix - cont. Alternatives: Class 1 Class 2 Class 3 Total #sam ples Class 1 80 15 5 100 Class 2 5 140 5 150 Class 3 25 50 125 200 Total 110 205 135 450 •Report nof. correctly classified pixels for each class. •Report the percentage of correctly classified pixels for each class. •Report the percentage of correctly classified pixels in total. •Why is this not a good measure if the number of test pixels from each class varies between classes? F1 15.2.06 INF 5300 14 The curse of dimensionality • Assume we have S classes and a n-dimensional feature vector. • With a fully multivariate Gaussian model, we must estimate S different mean vectors and S different covariance matrices from training samples. µ̂ s has n elements Σ̂ s has n(n-1)/2 elements • Assume that we have Ms training samples from each class • Given Ms, there is a maximum of the achieved classification performance for a certain value of n (increasing n beyond this limit will lead to worse performance after a certain). • Adding more features is not always a good idea! • If we have limited training data, we can use diagonal covariance matrices or regularization. F1 15.2.06 INF 5300 15 How do we beat the ”curse of dimensionality”? • Use regularized estimates for the Gaussian case – Use diagonal covariance matrices – Apply regularized covariance estimation: • Generate few, but informative features – Careful feature design given the application • Reducing the dimensionality – Feature selection – Feature transforms F1 15.2.06 INF 5300 16 Regularized covariance matrix estimation • Let the covariance matrix be a weighted combination of a classspecific covariance matrix Σk and a common covariance matrix Σ: Σ k (α ) = (1 − α )nk Σ k + αnΣ (1 − α )nk + αn der 0≤α≤1 must be determined, and nk and n is the number of training samples for class k and overall. • Alternatively: Σ k (β ) = (1 − β )Σ k + β I where the parameter 0≤β≤1 must be determined. F1 15.2.06 INF 5300 17 Feature selection • Given a large set of N features, how do we select the best subset of n features? – How do we select n? – Finding the best combination of n features out a N possible is a large optimization problem. – Full search is normally not possible. – Suboptimal approaches are often used. – How many features are needed? • Alternative: compute lower-dimensional projections of the N-dimensional space – PCA – Fisher’s linear discriminant – Projection pursuit and other non-linear approaches F1 15.2.06 INF 5300 18 Exploratory data analysis • For a small number of features, manual data analysis to study the features is recommended. • Evaluate e.g. – Error rates for singlefeature classification – Scatter plots – Clustering different feature sets F1 15.2.06 Scatter plots of feature combinations INF 5300 19 Preprocessing - data normalization • Features may have different ranges – Feature 1 has range f1min-f1max – Feature n has range fnmin-fnmax – This does not reflect their significance in classification performance! – Example: minimum distance classifier uses Euclidean distance • Features with large absolute values will dominate the classifier F1 15.2.06 INF 5300 20 Feature normalization • Normalize all features to have the same mean and variance. • Data set with N objects and K features • Features xik, i=1...N, k=1,...K Zero mean, unit variance: Softmax (non-linear) 1 N xk = ∑ xik N i =i σ k2 = xˆik = y= 1 N ( xik − xk )2 ∑ N − 1 i =i xik − xk rσ k xˆik = xik − xk 1 1 + exp(− y ) σk Remark: normalization may destroy important discrimination information! F1 15.2.06 INF 5300 21 Feature selection • Search strategy – Exhaustive search implies if we fix m n and 2 if we need to search all possible m as well. – Choosing 10 out of 100 will result in 1013 queries to J – Obviously we need to guide the search! • Objective function (J) – ”Predict” classifier performance m Note that ⎛⎜ ⎞⎟ = ⎝l⎠ F1 15.2.06 m! l!(m − l )! INF 5300 22 Distance measures (to specify J) • Between two classes: – – – – – Distance between the closest two points? Maximum distance between two points? Distance between the class means? Average distance between points in the two classes? Which distance measure? • Between K classes: – How do we generalize to more than two classes? – Average distance between the classes? – Smallest distance between a pair of classes? Note: Often performance should be evalued in terms of classification error rate F1 15.2.06 INF 5300 23 Class separability measures • How do we get an indication of the separability between two classes? – Euclidean distance |µr- µs| – Bhattacharyya distance • Can be defined for different distributions 1 • For Gaussian data, it is ( B= 1 (µ r − µ s )t Σ r + Σ s 8 2 (µ r − µ s ) + 1 ln 2 2 Σr + Σs ) Σr Σs – Mahalanobis distance between two classes: ∆ = (µ1 − µ 2 )T Σ −1 (µ1 − µ 2 ) Σ = N1Σ1 + N 2Σ 2 F1 15.2.06 INF 5300 24 Divergence • Divergence (see 5.5 in Theodoridis and Koutroumbas) is a measure of distance between probability density functions. • Mahalanobis distance is a form of divergence measure. • The Bhattacharrya distance is related to the Chernoff bound for the lowest classification error. • If two classes have equal variance Σ1=Σ2, then the Bhattacharrya distance is proportional to the Mahalanobis distance. F1 15.2.06 INF 5300 25 Selecting individual features • Each feature is treated individually (no correlation between features) • Select a criteria, e.g. FDR or divergence • Rank the feature according to the value of the criteria C(k) • Select the set of features with the best individual criteria value • Multiclass situations: – Average class separability or Often used – C(k) = min distance(i,j) - worst case • Advantage with individual selection: computation time • Disadvantage: no correlation is utilized. F1 15.2.06 INF 5300 26 Individual feature selection cont. • We can also include a simple measure of feature correlation. • Cross-Correlation between feature i and j: (|ρij|≤1) ∑ xni xnj ρij = N n =1 N ∑n=1 xni2 ∑n=1 xnj2 N • Simple algorithm: – Select C(k) and compute for all xk, k=1,...m. Rank in descending order and select the one with best value. Call this xi1. – Compute the cross-correlation between xi1 and all other features. Choose the feature xi2 for which { } i2 = arg max α1C ( j ) − α 2 ρi1 j for all j ≠ i1 j – Select xik, k=3,...l so that ⎧ α ik = arg max ⎨α1C ( j ) − 2 k −1 ⎩ j F1 15.2.06 INF 5300 k −1 ∑ r =1 ⎫ ρi1 j ⎬ for all j ≠ i1 ⎭ 27 Sequential backward selection • Example: 4 features x1,x2,x3,x4 • Choose a criterion C and compute it for the vector [x1,x2,x3,x4]T • Eliminate one feature at a time by computing [x1,x2,x3]T, [x1,x2,x4]T, [x1,x3,x4]T and [x2,x3,x4]T • Select the best combination, say [x1,x2,x3]T. • From the selected 3-dimensional feature vector eliminate one more feature, and evaluate the criterion for [x1,x2]T, [x1,x3]T, [x2,x3]T and select the one with the best value. • Number of combinations searched: 1+1/2((m+1)m-l(l+1)) F1 15.2.06 INF 5300 28 Sequential forward selection • Compute the criterion value for each feature. Select the feature with the best value, say x1. • Form all possible combinations of features x1 (the winner at the previous step) and a new feature, e.g. [x1,x2]T, [x1,x3]T, [x1,x4]T, etc. Compute the criterion and select the best one, say [x1,x3]T. • Continue with adding a new feature. • Number of combinations searched: lm-l(l-1)/2. – Backwards selection is faster if l is closer to m than to 1. F1 15.2.06 INF 5300 29 Plus-L Minus-R Selection (LRS) F1 15.2.06 INF 5300 30 Bidirectional Search (BDS) F1 15.2.06 INF 5300 31 Floating search methods • Problem with backward selection: if one feature is excluded, it cannot be considered again. • Floating methods can reconsider features previously discarded. • Floating search can be defined both for forward and backward selection, here we study forward selection. • Let Xk={x1,x2,...,xk} be the best combination of the k features and Ym-k the remaining m-k features. • At the next step the k+1 best subset Xk+1is formed by ’borrowing’ an element from Ym-k. • Then, return to previously selected lower dimension subset to check whether the inclusion of this new element improves the criterion. • If so, let the new element replace one of the previously selected features. F1 15.2.06 INF 5300 32 Algorithm for floating search • Step I: Inclusion xk+1=argmaxy∈Ym-kC({Xk,y}) (choose the element from Ym-k that has best effect of C when combined with Xk). Set Xk+1= {Xk, xk+1}. • Step II: Test 1. xr= argmaxy∈Xk+1C({Xk+1-y}) (Find the feature with the least effect on C when removed from Xk+1) 2. If r=k+1, change k=k+1 and go to step I. 3. If r≠k+1 AND C({Xk+1-xr})<C(Xk), goto step I. (If removing xk did not improve the cost, no further backwards selection) 4. If k=2 put Xk= Xk+1- xr and C(Xk)=C(Xk+1- xr). Goto step I. F1 15.2.06 INF 5300 33 Algorithm cont. • Step III: Exclusion 1.Xk’=Xk+1-xr (remove xr) 2.xs= argmaxy∈Xk’C({Xk’-y}) (find the least significant feature in the new set.) 3.If C(Xk’- xs)<C(Xk-1) then Xk= Xk’ and goto step I. 4.Put Xk-1’=Xk’-xs and k=k-1. 5.If k=2, put Xk=Xk’ and C(Xk)=C(Xk’) and goto step I. 6.Goto step III. Floating search often yields better performance than sequential search, but at the cost of increased computational time. F1 15.2.06 INF 5300 34 Optimal searches and randomized methods F1 15.2.06 INF 5300 35 Feature transforms • We now consider computing new features as linear combinations of the existing features. • From the original feature vector x, we compute a new vector y of transformed features y=ATx y is l-dimensional, x is m-dimensional, A is a l×m matrix. • y is normally defined in such a way that it has lower dimension than x. F1 15.2.06 INF 5300 36 Literature on pattern recognition • Updated review and statistical pattern recognition: – A. Jain, R. Duin and J. Mao: Statistical pattern recognition: a review, IEEE Trans. Pattern analysis and Machine Intelligence, vol. 22, no. 1, January 2001, pp. 4-- • Classical PR-books – – – R. Duda, P. Hart and D. Stork, Pattern Classification, 2. ed. Wiley, 2001 B. Ripley, Pattern Recognition and Neural Networks, Cambridge Press, 1996. S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic Press, 1999. F1 15.2.06 INF 5300 37 Sequential Floating Search (SFFS and SFBS) F1 15.2.06 INF 5300 38 Sequential Floating Search (SFFS and SFBS) F1 15.2.06 INF 5300 39