METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 EE 583 PATTERN RECOGNITION Statistical Pattern Recognition Bayes Decision Theory Supervised Learning Linear Discriminant Functions Unsupervised Learning METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Supervised Learning Supervised Learning == Training Parametric approaches Maximum likelihood estimation Bayesian parameter estimation Non-parametric approaches Direct pdf (multi-D histogram) estimation Parzen window pdf estimation kn-nearest neighbor pdf estimation Nearest-neighbor rule METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Parametric Approaches “Curse of dimensionality” : We need lots of training data to determine the completely unknown statistics for multi-D problems A rule of thumb : “use at least 10 times as many training samples per class as the number of features (i.e. D)” Hence, with some a priori information, it is possible to estimate the parameters of the known distribution by using less number of samples METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Maximum Likelihood Estimation (1/4) Assume c sets of samples, drawn according to p ( x | ω j ) which has a known parametric form. e.g. pdf is known to be Gaussian; mean & variance values are unknown r Let Θ j be unknown deterministic parameter set of pdf for class-j r p( x | ω j ) = p( x | ω j , Θ j ) : shows the dependence Aim : Use the information provided by the observed samples to estimate the unknown parameter Note that all sets of samples have independent pdf’s, there are c separate problems METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Maximum Likelihood Estimation (2/4) For an arbitrary class, let an observed sample set, X, contain n samples, X={x1,…,xn}. Assume the samplesrare independently drawn from their density, p ( x | Θ ) The likelihood of the observed sample set, X : n r r p ( X | Θ ) = ∏ p ( xk | Θ ) k =1 r Find value of the parameter that maximizes p ( X | Θ ) In order to find the parameter that maximizes its value, differentiate the conditional probability and equate to zero METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Maximum Likelihood Estimation (3/4) r Find value of unknown parameter maximizes p ( X | Θ ) p(X|Θ) For different Θ, the observed samples gives different p(X|Θ) values for p(xk|Θ) densities log p(X|Θ) The argument for the maximum of such products is ML estimate p(x|Θ1) p(x|Θ2) log p(X|Θ) will not differ the argument of this maxima METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Maximum Likelihood Estimation (4/4) Better to work with logarithm for analytical purposes. n r r r l (Θ ) = log p ( X | Θ ) = ∑ log p ( x k | Θ ) k =1 Note: Taking logarithm does not effect finding the maxima Differentiate l(Θ) and equate it to zero. n r r ∇ Θ l (Θ ) = ∑ ∇ Θ log p ( x k | Θ ) = 0 k =1 METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 ML Estimate of Univariate Normal : Assume mean θ1 & variance θ22 are unknown for a Gaussian pdf: 1 1 log p ( x k | Θ ) = − log{( 2π )θ 2 } − ( xk − θ 1 ) 2 2 2θ 2 1 − ( x ) θ k 1 θ 2 Differentiate wrt θ1 and θ2 : ∇ Θ log p ( xk | Θ ) = 2 − ( x ) θ 1 − + k 21 2θ 2 2θ 2 Maximum likelihood estimates of the parameters : n 1 ∑θ k =1 n −∑ ( x k − θ1 ) = 0 2 1 k =1 θ 2 n ( x k − θ1 ) 2 k =1 θ22 +∑ n 1 ⇒ θˆ1 = ∑ xk n k =1 n 1 = 0 ⇒ θˆ2 = ∑ ( xk − θˆ1 ) 2 n k =1 ML estimates of mean and variance METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 ML Estimate of Multivariate Normal : Assume only mean vector is unknown : r r r t −1 r r 1 1 r d log p ( xk | µ ) = − log{( 2π ) | Σ |} − ( xk − µ ) Σ ( x k − µ ) 2 2 Differentiate r r r −1 r ∇ µ log p ( x k | µ ) = Σ ( xk − µ ) Maximum likelihood estimate of the unknown mean vector : n r −1 r ∑ Σ ( xk − µ ) = 0 ⇒ k =1 rˆ 1 n r µ = ∑ xk n k =1 MLE of mean is the arithmetic average of vector samples METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Bayesian Parameter Estimation (1/3) Can we incorporate a priori knowledge about the unknown parameters into the formulation? Remember, Bayesian minimum error rate classifier maximizes p(ωi|x) Assume the role of the observed sample set, X, is emphasized : r r p ( x | ω i , X ) P (ω i | X ) P (ω i | x , X ) = c r p ( x | ω j , X ) P (ω j | X ) ∑ j =1 Assume a priori probabilities are known : P (ω i | X ) = P (ω i ) Assume sample sets of classes are independent, r r p ( x | ω , X ) = p ( x |ωi, X i) c separate problems i r = p(x | X ) METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Bayesian Parameter Estimation (2/3) r P (ω | x , X ) = c ∑ j =1 r p ( x | X ) P (ω ) r p ( x | ω j , X ) P (ω j ) r Main aim is to compute p( x | X ) r r p ( x | X ) = ∫ p ( x , Θ | X )dΘ = ∫ r p ( x | Θ ) p (Θ | X ) dΘ 1 424 3 142 4 3 form is known ? r Samples are drawn independently according to p ( x | Θ ) whose parametric form is known Bayesian approach assumes that the unknown parameter is a random variable with a known density p ( Θ ) METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Bayesian Parameter Estimation (3/3) ˆ , we obtain If p(Θ | X ) is peakly sharped at some value Θ r r r ˆ p( x | X ) = ∫ p( x | Θ) p(Θ | X )dΘ ≈ p( x | Θ) p(Θ) p(Θ | X ) Θ̂ If we are not sure about the value (i.e. no sharp peak), the result is the average over possible values of Θ How to determine p(Θ|X) ? For various densities, different analytical results exist METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Bayesian Parameter Estimation Univariate Normal Distribution (1/3) A univariate normal distribution with unknown µ p( x | µ ) ~ N (µ,σ 2 ) A priori information about µ is expressed by density p(µ ) ~ N (µ0 ,σ 0 ) 2 Observing the sample set, D, p(µ|D) becomes p(µ | D) = p( D | µ ) p(µ ) ∫ p ( D | µ ) p ( µ ) dµ n = α ∏ p ( xk | µ ) p ( µ ) k =1 1 µ − µ0 2 1 xk − µ 2 − ( ) − ( ) n 1 1 2 σ 2 σ 0 p ( µ | D ) = α ∏ e e 2π σ 2 π σ k = 1 0 METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Bayesian Parameter Estimation Univariate Normal Distribution (2/3) p(µ | D) = α ′ e n x − µ 2 µ − µ 0 2 1 − ( k ) +( ) 2 k =1 σ σ0 ∑ = α ′′ e 1 n 1 1 − ( 2 + 2 ) µ 2 −2( 2 2 σ σ0 σ n µ k =1 Increasing number of samples As n ∞, p(µ|D) σ σ σ0 nσ 02 2 mn + µn = µ0 ; σ n = 2 2 { 2 2 nσ 0 + σ 1 nσ 0 + σ nσ 02 + σ 2 xk n∑ 2 ⇒ p ( µ | D ) ~ N ( µ n , σ n2 ) , ∑ xk + σ 002 ) µ δ(µ) p(µ|D) sharper peak Bayesian Learning 2 2 METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Bayesian Parameter Estimation Univariate Normal Distribution (3/3) After determining p(µ|D), p(x|D) is obtained by p( x | D) = ∫ p( x | µ ) p(µ | D)dµ 1 µ − µn 2 1 x−µ 2 ( ) − − ( ) 1 1 2 σ n dµ ⇒ p( x | D) = ∫ e 2 σ e 2π σ 1 ( x−µ )2 2π σ n n − 1 2 σ 2 +σ n2 ⇒ p( x | D) = e f (σ ,σ n ) 2π σ σ n ⇒ p( x | D) ~ N (µn ,σ 2 + σ n2 ) Compared to the initial knowledge, p(x|µ), about µ, p(x|D) has additional uncertainty due to lack of exact knowledge of µ. METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 General Bayesian Learning In summary : • The form of the density, p(x|Θ), is assumed to be known, but the value of parameter, Θ, is unknown • Our initial knowledge about the parameter, Θ, is assumed to be contained in a known a priori density, p(Θ). • The rest of our knowledge about the parameter, Θ, is contained in n samples, drawn according to the unknown probability p(x|Θ) METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Comparison : ML vs. Bayesian ML avoids many assumptions and analytically easier to solve, although some estimates can be biased Bayesian parameter estimation permits including a priori information about the unknown, but the analytical derivations are cumbersome. For ordinary cases, both approaches give similar results with sufficient sample data METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Non-Parametric Approaches Parametric approaches require Knowing the form of the density Finding the parameter of the density In many cases, The form is not known The form does not let you to find a unique solution (multi-modal densities) METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Non-Parametric Approaches The solution is to use non-parametric approaches which do not assume a form There are 2 main directions : Estimating densities non-parametrically Direct estimation of density Parzen window k-NN estimation Nearest Neighbor Rules METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Non-Parametric Approaches Density Estimation (1/3) Probability P of a vector x falling into region R : r r P = ∫ p ( x ′) dx ′ ℜ N samples of x independently drawn according to p(x) Probability of k independent samples fall into R (Binomial): n k Pk = P (1 − P) n − k k and E[k ] = nP, var(k ) = nP(1 − P) Since Binomial distribution peaks very sharply around the expected value, the number of observed samples (kobs) in R should be approximately equal k obs ≈ E[ k ] = nP Note that probability P can be estimated via P ≈ k obs / n , but we need density, p(x) METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Non-Parametric Approaches Density Estimation (2/3) r r r ′ ′ Assume p(x) is almost constant in R : ∫ p ( x ) dx ≈ p ( x )V where V is the volume of R ℜ Hence, one will obtain the obvious result by combining previous relations : r kobs / n p( x ) ≈ V There are two approximations (≈) in previous relations • If k (or n) goes to infinity or V goes to zero then those approximations will converge to exact values For finite n, fixing V and k independent of n yields problems : • If V 0 then p(x) ≈ 0 (useless) METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Non-Parametric Approaches Density Estimation (3/3) Form a sequence of regions, Rn ,centered at x for n samples 3 conditions under which (1 ) lim V n = 0 n→ ∞ (2) r r r lim ∫ pn ( x ′) dx ′ = p( x ) n →∞ ℜ lim k n = ∞ n→ ∞ (3) r k /n pn ( x ) ≡ n Vn lim n→ ∞ Examples that achieve these conditions : V0 n • k-NN : Rn is grown until it contains kn samples • Parzen : Initial Vo volume is shrinking Vn = kn = n kn = 0 n METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Non-Parametric Approaches Parzen Windows (1/2) Assume region Rn is a d-dimensional hypercube with the length of an edge as hn The number of samples falling in Rn can be obtained analytically by using the window function : | u j | ≤ 1 / 2 j = 1,K , d r 1 Φ (u ) = otherwise 0 For a hypercube (centered at x), number of samples and estimate rforr the density are obtained ras :r n x − xi x − xi 1 n 1 kn = ∑ Φ( ) and pn ( x) = ∑ Φ ( ) hn n i =1 Vn hn i =1 hn hn+1 x x METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Non-Parametric Approaches Parzen Windows (2/2) hn The window function can be generalized for better interpolation of the density : each sample contribute to the estimate based on its distance to x. x • If hn is very large, then pn(x) is a superposition of slowly changing functions & an “out-of-focus” estimate • If hn is very small, then window function is a Dirac delta function and estimate is sum of sharp pulses hn x1 x2 x3 x4 x5 x6 x7 x With unlimited number of samples, pn(x) converges to the unknown density for any value of hn With limited number of samples, the best option is to seek for an acceptable compromise METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Non-Parametric Approaches Example : Parzen Windows (1/2) hn = h1 n Normal density Window function : 1 1 − 2u2 Φ (u ) = e 2π METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Non-Parametric Approaches Example : Parzen Windows (2/2) hn = h1 n Bi-modal density Window function : 1 1 − 2u2 Φ (u ) = e 2π METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Non-Parametric Approaches kn-Nearest Neighbor Parzen window approach depends on the initial selection of the cell volume, V One remedy is to choose the cell volume as a function of the data, rather than an arbitrary function of number of samples In order to estimate p(x) from n samples, center a cell around x and grow until it captures kn nearest k /n samples (kn is a function of n). Resulting p(x) : pn ( x) = n Vn Necessary conditions for convergence : kn lim k n = ∞ and lim = 0 (e.g . k n = n ) n →∞ n→∞ n METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Non-Parametric Approaches Example : kn-Nearest Neighbor METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Non-Parametric Approaches Parzen vs kn-Nearest Neighbor Both methods do converge, but it is very difficult to make meaningful statements about their finitesample behaviour METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Non-Parametric Approaches Classification Rule All 3 methods (direct, Parzen, kn-NN) can be used to obtain a posteriori probabilities for n-sample data At each cell, total k samples; ki samples for each class ki / n p n ( x, ω i ) = Vn Pn ( ω i | x ) = pn ( x,ω i ) c ∑ j =1 pn ( x,ω j ) = ki k Cell size selection can be achieved by using either Parzen window or kn-NN approach Using arbitrarily large number of samples, unknown probabilities can be obtained with optimum performance METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Non-Parametric Approaches Nearest Neighbor Rule (1/3) All 3 methods (direct, Parzen, kn-NN) can be used to obtain a posteriori probabilities by using n-sample data so that this density is utilized for Bayes Decision Rule A radical approach is to use the nearest neighbor out of the sample data to classify the unknown test data (Nearest Neighbor Rule [NN-R]) While Bayes Rule (minimum-error rate) is optimal while choosing between different classes, NN-R is suboptimal METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Non-Parametric Approaches Nearest Neighbor Rule (2/3) Assume that there are unlimited number of labeled “prototypes” for each class If the test point x is nearest to one of these prototypes, x’ p(wi|x) ≈ p(wi|x’) for all i Obviously, x’ labeled with m gives p(wm |x’) > p(wj|x’) for all j ≠ m one should expect p(wm |x) > p(wj|x) for all j ≠ m For unlimited samples, the error rate for NN-R is less than twice the error rate of Bayes decision rule METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Non-Parametric Approaches Nearest Neighbor Rule (3/3) NN-rule allows to partition the feature space into cells consisting of all points closer to a given training point than any other training point (Voronoi tessellation) METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Non-Parametric Approaches k-Nearest Neighbor Rule A straight forward extension to Nearest Neighbor rule is using k-neighbors instead of only one. The classification is achieved by voting k neighbors (k is usually selected as odd to avoid ties) Selecting k requires a compromise : • If k is too high some of these k neighbors may have different probabilities, for finite n • If k is too low estimation may not be reliable k=5 The optimal behavior is obtained as both k and n approaches to infinity. METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Dimension Reduction In supervised learning, excessive dimensionality of features should be decreased. The main approaches are • Principal Component Analysis • Unsupervised • Fisher’s Linear Discriminant • Supervised (data with class info is required) METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Principal Component Analysis r r Assume there n vectors in d-dimensions: { x 1 , K , x n } These vectors are represented by their projections onto a line passing, e, through their sample mean, m r r r x = m + ae For a fixed line, the optimal a coefficients that minimize the distance between points and the line : n r r r r min J ( a 1 , K , a n , e ) = min ∑ ( m + a k e ) − x k a i ... a n ⇒ J (.) = n ∑ k =1 a 2 k r e ∂ J (.) ∂ak 2 a i ... a n k =1 n t k k =1 − 2∑ = 0 r r r a e ( xk − m ) + ⇒ ak n ∑ k =1 rt r r = e ( xk − m ) 2 r r xk − m 2 METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Principal Component Analysis Assume a coefficients are obtained; the same cost function, J(.), is minimized wrt to the line direction, e n r r r r 2 min J ( a 1 , K , a n , e ) = min (m + ak e ) − xk r r ∑ e e rt r r k =1 where a k = e ( x k − m ) Define scatter matrix, S, (similar to covariance) as n r r r r t S ≡ ∑ ( x k − m )( x k − m ) k =1 ⇒ J (.) = n ∑ k =1 n r 2 r r r a e − 2∑ ak e t ( xk − m ) + 142 43 { k =1 2 k =1 n = −∑ k =1 = a kt ( rt r r e ( xk − m ) ) 2 + n rt r r r = − e S e + ∑ xk − m k =1 n ∑ k =1 2 r r xk − m ⇒ n ∑ k =1 r r xk − m 2 2 rt r min J (.) = max e Se r r e e METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Principal Component Analysis rt r min J (.) = max e Se r r e e Maximum of etSe must be obtained by the constraint |e|=1 r r r r r r ∂u Lagrange mul. : u ≡ e t Se + λ (1 − e t e ) ⇒ = 0 ⇒ 2 S e − 2λ e = 0 ∂e Solution is equal to e which is the eigenvector of S, corresponding its largest eigenvalue Result can be generalized to d’-dimensional projection by minimizing the following relation d′ 2 r r J d ′ = ∑ m + ∑ a ki e i − x k k =1 i =1 d′ r r r where x = m + ∑ a i e i , such that ei’s are eigenvectors n i =1 METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Principal Component Analysis r r Remember n vectors in d-dimensions: X = [x 1 , K , x n ] Note difficulty during calculation of S, if d>>n (S is dxd) n r r r r t S ≡ ∑ ( x k − m )( x k − m ) = XX t k =1 instead of solving Se=λe or XXte=λe, try solving r r X X f = λ f t multiply by X from left ⇒ r r X X X f = λ X f t Note that XXt is dxd, whereas XtX is nxn r r r t r X X (X f ) = λ (X f ) ⇔ X X e = λe r r ⇒ X f = e t METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Fisher’s Linear Discriminant (1/8) The Fisher’s approach aims to project d-dimensional data onto a line (1-D), which is defined by w The projected data is expected to be well separated between two classes after such a dimension reduction METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Fisher’s Linear Discriminant (2/8) rt r Feature vector projections : yi = w xi i = 1, K , n Measures for separation based on w : Difference between projection means Variance of within-class projection data Choose projection (w) in order to maximize J ( m1 − m 2 ) 2 J (• ) = s12 + s 22 where mi : projection means for class i si2 = 2 ( ) y − m : scatter ∑ i y∈Yi METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Fisher’s Linear Discriminant (3/8) Relation between sample & projection means : r 1 mi = ni r 1 x ⇒ mi = ∑ ni x∈ℵi 1 y= ∑ ni y∈Yi rt r rt r ∑ w x = w mi x∈ℵi Define scatter matrices Si r r r r T Si = ∑ ( x − mi )(x − mi ) x∈ℵi and SW = S1 + S 2 Note that si and Si are related as s = ∑ ( y − mi ) = 2 i 2 y∈Yi ∑( x∈ℵi rT r rT r w x − w mi ) 2 rT r r r r T r rT r = ∑ w (x − mi )( x − mi ) w = w Si w x∈ℵi METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Fisher’s Linear Discriminant (4/8) Similarly, the relation between m1 and m2 becomes rT r rT r 2 rT r r r r T r ( m1 − m 2 ) = ( w m1 − w m 2 ) = w ( m1 − m 2 )( m1 − m 2 ) w rT r = w S B w (Note that S has rank 1) ( m1 − m 2 ) 2 The initial criterion function : J (•) = s12 + s 22 rT r r w SBw This function can be written as J ( w ) = r T r w SW w r r w vector maximizes J must satisfy S B w = λ SW w 2 B (see distributed notes for its proof) If SW is non-singular, then S −1 W r r SBw = λ w { direction r r m1 − m 2 ⇒ r r −1 r w = SW ( m1 − m 2 ) METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Fisher’s Linear Discriminant (5/8) For a 2-class problem, d-dimensional data is projected on a line As an extension to c-class problem, it is possible to project data onto (c-1)-dimensions, instead of a line. For (c-1)-dimensions : rT r yi = wi x , i = 1, K c − 1 ⇒ r Tr y =W x Define new scatter matrices in d-dimensional space r r r r T ST = ∑ ( x − m )( x − m ) r x∈Whole c , SW = ∑ Si i =1 c r r r r r r r r T = ∑ ∑ ( x − mi + mi − m )(x − mi + mi − m ) r i =1 x∈Di = SW + S B c r r r r T where S B = ∑ ni (mi − m )(mi − m ) i =1 (Note that SB has rank c-1) METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Fisher’s Linear Discriminant (6/8) r Tr Remember the samples are projected by y = W x Resulting projected mean vectors in (c-1)-dimensions : ~r 1 mi = ni ~r 1 c ~r r y , m = ∑ ni mi ∑ r n i =1 y∈Yi ~r 1 ⇒ mi = ni T r T r ∑W x = W mi x∈ℵi ~r 1 c T r T r , m = ∑ niW mi = W m n i =1 Scatter matrices in (c-1)-dimensions can defined as ( )( ) ( )( ) c c ~r ~r ~r ~r T r ~r r ~r T ~ ~ SW = ∑ ∑ y − mi y − mi , S B = ∑ ni mi − m mi − m r i =1 y∈Yi i =1 METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Fisher’s Linear Discriminant (7/8) Scatter matrices in the projected space are ( )( ) ( )( ) c c ~r ~r ~r ~r T r ~r r ~r T ~ ~ SW = ∑ ∑ y − mi y − mi , S B = ∑ ni mi − m mi − m r i =1 y∈Yi i =1 Relation between scatter matrices are equal to ( )( c r ~r r ~r ~ SW = ∑ ∑ y − mi y − mi r i =1 y∈Yi c ( ) T )( r T r Tr T r = ∑ ∑ W x − W mi W x − W mi r i =1 x∈ℵi T ( )( ) T = W T SW W , ) c ~r ~r ~r ~r T ~ S B = ∑ ni mi − m mi − m i =1 c ( )( ) r T r T r T r T = ∑ ni W mi − W m W mi − W m = W T S BW i =1 T METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 Fisher’s Linear Discriminant (8/8) Relation between scatter matrices are obtained as ~ ~ T SW = W SW W , S B = W T S BW For better discrimination in the projected space: ~ ~ min | SW | & max | S B | . : determinant ~ | SB | | W T S BW | ⇒ J (•) = ~ ⇒ J (W ) = T | W SW W | | SW | Note that determinant is product of scatter along principal directions Solution for J(W) : Columns of the optimal W are generalized (c-1) eigenvectors to the r that correspond r largest eigenvalues of S B wi = λi SW wi