CpE p 646 Pattern Recognition g and Classification Prof. Hong Man Department of Electrical and Computer Engineering Stevens Institute of Technology Bayesian Decision Theory Chapter p 3 (Section ( 3.7,, 3.8)) Outline: • Problems of dimensionality • Component analysis and discriminant Problems of Dimensionality • In ppractical multi-category g y cases,, it is common to see problems involving hundreds of features, especially when features are binary valued. • We W assume every single i l feature f t is i useful f l for f some discrimination, but we do not know (usually doubt) if they all provide some independent information. • Two concerns: – Classification accuracy depends upon the di dimensionality i li andd the h amount off training i i data d – Computational complexity of designing a classifier. Accuracy, Dimension and Training Sample Size • If features are statisticallyy independent, p classification performance can be excellent in theory. • Example: two-classes multivariate normal with the same covariance, p(x|ωj) ~ N(μj,Σ), jj=1, 1, 2. P ( error ) = 1 2π ∞ ∫ e − u2 2 du r/2 w here r 2 = ( μ 1 − μ 2 ) t Σ −1 ( μ1 − μ 2 ) therefore lim P ( error ) = 0 r→∞ r2 is called the squared Mahalanobis distance from μ1 to μ2 The Mahalanobis Distance the squared Mahalanobis distance from x to μ is r 2 = ( x − μ i )t Σ i−1 ( x − μ i ) the loci of points of constant density are hyperellipsoids with constant squred Mahalanobis distance. Accuracy, Dimension and Training Sample Size Accuracy, Dimension and Training Sample Size • If features are independent then: Σ = diag(σ 12 , σ 22 , ..., σ d2 ) ⎛ μ i1 − μ i 2 ⎞ r = ∑⎜ ⎟ σi ⎠ i =1 ⎝ d 2 2 • Thi This shows h how h different diff t feature f t contributes t ib t to t reducing d i the probability of error (~increasing r). – Most useful features are the ones for which the difference between the means is large relative to the standard deviation – Also no feature is useless – To T further f th reduce d the th error rate, t we can introduce i t d new independent features. Accuracy, Dimension and Training Sample Size • In ggeneral if the pprobabilistic structure of the pproblem is known, introducing new features will improve classification performance – Existing feature set maybe inadequate. – Bayes risk can not be increased by adding new feature. At worst, the Bayes classifier may ignore the new feature. feature • However adding new feature will increase the computational complexity of both the feature extractor and the h classifier. l ifi Accuracy, Dimension and Training Sample Size Accuracy, Dimension and Training Sample Size • It has frequently q y been observed in ppractice that, beyond y a certain point, the inclusion of additional features leads to worse rather than better performance. – Frequently it is because that we have the wrong model (Gaussian) or wrong assumption (independent features), – Also the number of training samples may be inadequate so the distributions are not estimated accurately. Computational Complexity • Our design g methodology gy is affected by y the computational p difficulty • Notations of computational complexity – Order of a function: if f(x) is “of of the order of h(x) h(x)”, written as f(x) = O(h(x)) and read as “big oh of h(x)”, if there exists constant c and x0 such that |f(x)|≤c|h(x)| for all x>x0 • An upper bound on f(x) grows no worse than h(x) for sufficiently large x • Example: f(x) = a0+a1x+a2x2, we have h(x) = x2, and f(x) = O(x2) • Big oh order of a function is not unique. In our example: f(x) = ( 2), or O(x ( 3), or O(x ( 4), or O(x ( 2lnx)) O(x Computational Complexity – Bigg theta notation. f( f(x)) = Θ((h(x)) ( )) and read as “bigg theta of h(x)”, if there exists constants x0, c1 and c2 such that for all x>x0, c1h(x) ≤ f(x)≤c2h(x) • In I previous i example, l f(x) f( ) = Θ(x ( 2), ) but b t will ill nott obey b f(x) = Θ(x3) Computation putat o complexity co p e ty iss measured easu ed in terms te s of o number u be • Co of basic mathematical operators, such as additions, multiplications, and divisions, or in terms of computing time and memory requirements. time, requirements Complexity of ML Estimation • Gaussian ppriors in d dimensions classifier with n trainingg samples for each of c classes • For each category, we have to compute the discriminant 2 O ( 1 ) O ( d n) 2 f ti function ( ) O d n O(n) O ( dn ) 1 d 1 t ˆ −1 ˆ ˆ g ( x ) = − ( x − μ ) Σ ( x − μ ) − ln 2π − ln Σˆ + ln P (ω ) 2 2 2 μˆ : d × n additions and 1 division Σˆ : d (d + 1) / 2 components each with n multiplications and additions |i|= O(d 2 ) and (i)−1 = (d 3 ), estimation of P (ω ) : O( n) Complexity of ML Estimation • Total complexity p y for one discriminant function is dominated by the O(d2n). • For total of c classes, the overall computational complexity f learning for l i in i this thi Bayes B classifier l ifi is i O(cd O( d2n)) ≈ O(d2n)) (c is a constant much smaller than d and n) • Co Complexity p e ty increases c eases as d aandd n increase. c ease. Computational Complexity • The computational p complexity p y for classification is usually y less than parameter estimation (i.e. learning). • Given a test point x, for each of c categories, – Compute ( x − μˆ ) , O(d). t −1 – Compute the multiplication ( x − μˆ ) Σˆ ( x − μˆ ) , O(d2) max gi ( x ) , O(c) – Compute C t the th decision d ii O( ) i – So overall for small c, classification is an O(d2) operation. p Overfitting • Frequently q y the number of trainingg sample p is inadequate. q To address this problem: – Reduce the dimensionality • Redesign the feature extractor • Select a subset of existing features – Assume all classes share the same covariance matrix – Find a better way to estimate the covariance matrix. • With some prior estimate, Bayesian learning can refine this estimate – Assume the covariance matrix is diagonal, and offdiagonal elements are all zero, i.e. features are i d independent. d t Overfitting • Assumingg feature independence p will generally g y pproduce suboptimal classification performance. But in some cases it may outperform ML estimate with full parameter space. – This Thi is i because b off insufficient i ffi i t data. d t • To have accurate estimate of model parameters, we need much uc more o e data samples sa p es than t a parameters. pa a ete s. If we don’t, do t, we may have to reduce the parameter space. – We can start with a complex model, and then smooth andd simplify i lif the h model. d l Such S h solution l i may have h poor estimation error on training data, but may generalize well on test data. Overfitting Component Analysis and Discriminants • One of the major j problems p in statistical machine learningg and pattern recognition is the curse of dimensionality, i.e. the large value of feature dimension d. • Combine C bi features f t in i order d to t reduce d the th dimension di i off the th feature space – dimension reduction • Linear ea co combinations b at o s aaree ssimple p e to compute co pute and a d tractable t actab e • Project high dimensional data onto a lower dimensional space Component Analysis and Discriminants • Two classical approaches pp for finding g “optimal” p linear transformation – PCA (Principal Component Analysis) “Projection that b t represents best t the th data d t in i a leastl t square sense”” – MDA (Multiple Discriminant Analysis) “Projection that best sepa separates ates tthee data in a least-squares east squa es sense” se se Principal Component Analysis • Given n data samples p {x1, x2, …,, xn} in d dimension. • Define a linear transform xk = m + ak e where xk is an approximation of a feature vector xk, e is a unit vector in a particular direction and ||e||=1, ak is the 1-D projection of xk onto basis vector e , and m is the sample mean 1 n m = ∑ k =1 xk n • We try to minimize the square error criterion function n J ( a1 , a2 , ..., an , e ) = ∑ ( m + ak e ) − xk k =1 2 Principal Component Analysis – To solve ak n J ( a1 , a2 , ..., an , e ) = ∑ ( m + ak e ) − xk 2 k =1 n n = ∑ ak e − ( xk − m ) 2 k =1 n n = ∑ a e − 2∑ ak e ( xk − m ) + ∑ xk − m k =1 2 k 2 t k =1 k =1 ∂J let = 0, we have ak = e t ( xk − m ) ∂ak – We also define the scatter matrix S = ∑ k =1 ( xk − m )( xk − m )t n C Compare this hi with i h the h covariance i matrix i fro f ML estimation i i (CPE646-4,p.19) (CPE646 4 19) 2 Principal Component Analysis – To solve e, ggiven ak=et((xk-m)) and ||||e||=1 || , we have n n n J ( e ) = ∑ a − 2∑ a + ∑ x k − m 2 k k =1 k =1 2 k 2 k =1 n n = − ∑ [e ( xk − m )] + ∑ xk − m t 2 k =1 2 k =1 n n = − ∑ e ( xk − m )( xk − m ) e + ∑ xk − m t t k =1 k =1 n = − e Se + ∑ xk − m t k =1 2 2 Principal Component Analysis – To minimize J(e) ( ) is the same as to maximize etSe. We can use Lagrange Multipliers to maximize etSe subject to the constraint ||e||=1 u = e t Se S − λ ( e t e − 1) ∂u = 2 Se − 2λ e = 0 ∂e Se = λ e or e t Se = λ e t e = λ – To maximize etSe,, we should select e as the eigenvector g that corresponding to the largest eigenvalue of the scatter matrix . Principal Component Analysis • In ggeneral,, if we extend this 1-dimension pprojection j to d’dimension projection (d’≤d) d' xk = m + ∑ aik ei i =1 the solution is to let ei be the eigenvectors of the largest d’ eigenvalues of the scatter matrix • Because the scatter matrix is real and symmetric, the eigenvectors are orthogonal. • The projections of xk to each of these eigenvectors (i.e. bases) are called principal components. Problem of PCA • The PCA reduces dimension for each class individually. y The resulting components are good representation of the data in each class, but they may not be good for discrimination purpose. purpose • For example, suppose the two classes have 2D Gaussianlike densities, represented by the two ellipsis. They are well separable. But if we project the data to the first principal component (i.e. from 2D to 1D), then they become inseparable (with a very low Chernoff information). • The best projection is the short axis Problem of PCA • In fact, this is typical problem in studying generative models d l vs discriminative di i i ti models. d l The Th generative ti models d l aim at representing the data faithfully, while discriminative models target telling objects apart. Fisher Linear Discriminant • Consider a two-class problem. p • Given n data samples {x1, x2, …, xn} in d dimension. n1 samples in subset (class) D1 labeled with ω1, and n2 samples l in i subset b t (class) ( l ) D2 labeled l b l d with ith ω2. • The linear combination will project a d-D feature vector to a 1-D sca scalee y=w wtx aandd ||w||=1 ||w|| • The resulting n samples y1, …,yn divided into the subsets Y1 and Y2. • We adjust the direction of w to maximize class separability. Fisher Linear Discriminant Fisher Linear Discriminant 1 mi = ni 1 mi = ni ∑ y∈Yi 1 y= ni ∑x x∈Di t t w x = w mi ∑ y∈Yi m1 − m2 = w t ( m1 − m2 ) • We define the scatter for projected samples labeled ω1 si2 = 2 ( y − m ) ∑ i y∈Yi Fisher Linear Discriminant • The Fisher linear discriminant employs p y that linear function wtx for which the criterion function is maximum. J (w) = m1 − m2 2 s12 + s22 m1 − m2 is the distance of projected means -between-class scatter s12 + s22 is called the total within-class scatter Fisher Linear Discriminant • We define the scatter matrices Si = si2 = = t ( x − m )( x − m ) and ∑ i i x∈ Di t t 2 ( w x − w m ) ∑ i x∈ Di ∑ x∈ Di w t ( x − mi )( x − mi )t w = w t Si w s12 + s22 = w t SW w SW = S1 + S2 within-class scatter matrix Fisher Linear Discriminant • The between-class scatter ( m1 − m2 )2 = ( w t m1 − w t m2 )2 = w t ( m1 − m2 )( m1 − m2 )t w = w t SB w where S B = ( m1 − m2 )( m1 − m2 )t between-class scatter matrix • The criterion function becomes w t SB w J (w) = t w SW w (generalized Rayleigh quotient) Fisher Linear Discriminant • The vector w that maximize J(⋅) ( ) must satisfyy S B w = λ SW w for some constant λ if SW is not singular singular, this becomes SW−1 S B w = λ w • This is an eigenvalue problem, but we do not need to solve for λ and w, since SBw is always in the direction of (m1m2), the solution is w = SW−1 ( m1 − m2 ) Fisher Linear Discriminant • The remaining g step p is to separate p two classes in 1dimension. • The complexity of finding w is dominated by calculating SW-11, which hi h is i an O(d2n)) calculation. l l ti Fisher Linear Discriminant • If the class-conditional densities p( p(x||ωi) are multivariate normal with equal covariance matrices Σ, the optimal decision boundary satisfies (CPE646-3, p.22) w t x + ω0 = 0 where ω 0 is a constant involving w and the prior probabilities, and w = Σ −1 ( μ1 − μ 2 ) This w is equivalent to w = SW−1 ( m1 − m2 ) Multiple Discriminant Analysis • For the c-class pproblem,, the generalization g of Fisher linear discriminant will project the d-dimensional feature vectors to a (c-1)-dimensional space, assuming d≥c. • The Th generalization li ti off within-class ithi l scatter tt matrix ti c SW = ∑ Si where i =1 Si = t ( x − m )( x − m ) ∑ i i x∈Di 1 mi = ni ∑x x∈Di and Multiple Discriminant Analysis • The g generalization of between-class scatter matrix 1 1 c define total mean vector m = ∑ x = ∑ ni mi n x n i =1 define total scatter matrix ST = ∑ ( x − m )( x − m )t x c = ∑ ∑ ( x − mi + mi − m )( x − mi + mi − m )t i =1 x∈Di c = SW + ∑ ni ( mi − m )( mi − m )t i =1 define a general between-class scatter matrix c SB = ∑ ni ( mi − m )( mi − m )t i =1 and ST = SW + S B Multiple Discriminant Analysis • The p projection j from a d-D to ((c-1)-D ) can be expressed p as yi = wit x i = 1, ..., c − 1 or y = W t x , where h wi are columns l off d × (c - 1) matrix t i W • The criterion function becomes W t SBW J (W ) = t W SW W • The solution to maximizing this function satisfies SB wi = λi SW wi Multiple Discriminant Analysis • To solve wi, we can – Compute λi from |SB- λiSW|=0 – Then compute wi from (SB- λiSW)wi=0 • The solution for W is not unique, they all have the same J(W). Multiple Discriminant Analysis