CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis AD March 2011 AD () March 2011 1 / 17 Multivariate Gaussian N Consider data xi i =1 where xi 2 RD and we assume they are independent and identically distributed. A standard pdf used to model multivariate real data is the multivariate Gaussian or normal p ( xj µ, Σ) = N (x; µ, Σ) 1 1 exp( = (x D /2 1/2 2| (2π ) jΣj µ )T Σ 1 (x {z Mahalanobis distance It can be shown that µ is the mean and Σ is the covariance of N (x; µ, Σ) ; i.e. µ ) ). } E (X) = µ and cov (X) = Σ. It will be used extensively in our discussion on unsupervised learning and can also used for generative classi…ers (i.e. discriminant analysis). AD () March 2011 2 / 17 Special Cases When D = 1, we are back to p x j µ, σ2 = N x; µ, σ2 = p 1 2πσ2 exp 1 (x 2σ2 µ )2 When D = 2 and writing Σ= σ21 ρσ1 σ2 ρσ1 σ2 σ22 where ρ = corr (X1 , X2 ) 2 [ 1, 1] we have p ( xj µ, Σ) = exp AD () 1p 2πσ1 σ2 1 ρ2 1 2 (1 ρ2 ) (x 1 µ1 )2 σ21 + (x 2 µ2 )2 σ22 2 (x1 µ1 )(x2 µ2 ) σ1 σ2 March 2011 3 / 17 Graphical Ilustrations 2 introBody.tex full diagonal spherical 0.2 0.2 0.2 0.15 0.15 0.15 0.1 0.1 0.1 0.05 0.05 0.05 0 10 0 10 5 10 5 0 5 0 0 −5 0 5 5 −5 −5 −10 −10 5 0 0 −10 −5 0 −5 −5 (b) (c) (left): Illustration(a)of 2D Gaussian pdfs for di¤erent covariance matrices full, (middle): diagonal, (right): spherical. spherical full diagonal 6 4 AD () 10 5 8 4 6 3 March 2011 4 / 17 Why are the contours of a multivariate Gaussian elliptical If we plot the values of x s.t. p ( xj µ, Σ) is equal to a constant, i.e. s.t. (x µ)T Σ 1 (x µ) = c > 0 where c is given, then we obtain an ellipse. Σ is a positive de…nite matrix so we have Σ = UΛU T where U is an orthonormal matrix of eigenvectors, i.e. U T U = I , and Λ = diag (λ1 , ..., λD ) with λk 0 is the diagonal matrix of eigenvalues. Hence, we have Σ 1 1 = UT Λ 1 U 1 = UΛ 1 UT = D ∑ k =1 1 uk uTk , λk so (x µ )T Σ 1 D (x µ) = ∑ k =1 AD () yk2 where yk = uTk (x λk µ) . March 2011 5 / 17 0 0 0 −5 Graphical Ilustrations full 2 0 −2 −4 0 (c) spherical diagonal 4 −2 −5 −5 (b) 6 −4 0 −10 −5 (a) −6 −6 0 0 −5 −5 −10 −10 2 4 6 10 5 8 4 6 3 4 2 2 1 0 0 −2 −1 −4 −2 −6 −3 −8 −4 −10 −5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −2 0 2 4 6 (d) of 2D Gaussian pdfs level(e)sets for di¤erent covariance (f) matrices Illustration (left): full, (middle): diagonal, (right): spherical. gure 1.28: Top: plot of pdf’s for 2d Gaussian distributions. Bottom: corresponding level sets, i.e., we plot the points x where p(x) = c f AD ()a full covariance matrix has elliptical contours. Middle: a diagonal covariance matrix March 6 / ellips 17 fferent values of c. Left: is an2011 axis aligned Properties of Multivariate Gaussians Marginalization is straightforward. Conditioning is easy; e.g. if X = (X1 X2 ) with p (x) = p (x1 , x2 ) = N (x; µ, Σ) where µ= µ1 µ2 then p ( x1 j x2 ) = , Σ= Σ11 Σ21 Σ12 Σ22 p (x1 , x2 ) = N x1 ; µ1 j2 , Σ1 j2 p (x2 ) with µ1 j2 = µ1 + Σ12 Σ221 (x2 Σ1 j2 = Σ11 AD () µ2 ) , Σ12 Σ221 Σ21 . March 2011 7 / 17 Independence and Correlation for Gaussian Variables It is well-known that independence implies uncorrelations; i.e. if the components (X1 , ..., XD ) of a vector X are independent then they are uncorrelated. However, uncorrelated does not imply independence in the general case. If the components (X1 , ..., XD ) of a vector X distributed according to a multivariate Gaussian are uncorrelated then they are independent. Proof. If (X1 , ..., XD ) are uncorrelated then Σ = diag σ21 , ..., σ2D and jΣj = D ∏ σ2k so k =1 p ( xj µ, Σ) = AD () D ∏p k =1 xk j µk , σ2k = D ∏N xk ; µk , σ2k k =1 March 2011 8 / 17 ML Parameter Learning for Multivariate Gaussian N Consider data xi i =1 where xi 2 RD and assume they are independent and identically distributed from N (x; µ, Σ) . The ML parameter estimates of (µ, Σ) maximize by de…nition N ∑ log i =1 = N xi ; µ, Σ ND log (2π ) 2 N log jΣj 2 1 N xi 2 i∑ =1 T µ Σ 1 xi µ . We obtain after painful calculations the fairly intuitive results i xi ∑N ∑N i =1 x b µ b= , Σ = i =1 N AD () µ b N xi µ b T . March 2011 9 / 17 Application to Supervised Learning using Bayes Classi…er N Assume you are given some training data xi , y i i =1 where xi 2 RD and y i 2 f1, 2, ..., C g can take C di¤erent values. Given an input test data x, you want to predict/estimate the output y associated to x. Previously we have followed a probabilistic approach p ( y = c j x) = p (y = c ) p ( xj y = c ) . p (y = c 0 ) p ( xj y = c 0 ) ∑Cc0 =1 This requires modelling and learning the parameters of the class conditional density of features p ( xj y = c ) . AD () March 2011 10 / 17 Height Weight Data 26 introBody red = female, blue=male 280 260 260 240 240 220 220 200 200 weight weight red = female, blue=male 280 180 180 160 160 140 140 120 120 100 100 80 55 60 65 70 height (a) 75 80 80 55 60 65 70 75 80 height (b) (left) Height/Weight data for female/male (right) 2d Gaussians …t to Figure 1.29:class. (a) Height/weight data.the (b) Visualization of 2d Gaussians fit to each 95% of the probability mass is inside the elli each 95% of proba mass is inside theclass. ellipse Figure generated by gaussHeightWeight. AD () March 2011 11 / 17 Supervised Learning using Bayes Classi…er Assume we pick p ( xj y = c ) = N (x; µc , Σc ) and p (y = c ) = π c then p ( y = c j x) ∝ π c jΣc j 1/2 exp( 21 (x µc )T Σc 1 (x µc )) 1 T 1 = exp(µTc Σc 1 x 21 µTc Σc 1 µc + log π c ) exp 2 x Σc x For models where Σc = Σ then this is known as linear discriminant analysis exp βTc x + γc p ( y = c j x) = T ∑Cc0 =1 exp βc 0 x + γc 0 where βc = Σ 1 µc , γc = 21 µTc Σ very similar to logistic regression. AD () 1µ c + log π c and the model is March 2011 12 / 17 Decision Boundaries oBody.tex Linear Boundary All Linear Boundaries 6 2 4 0 2 0 −2 −2 −2 0 2 −2 0 2 (a) 4 6 (b) Parabolic Boundary Some Linear, Some Quadratic 8 2 6 4 0 2 −2 0 −2 −2 0 2 −2 (c) 0 2 4 6 (d) Figure 1.30: Decision boundaries in 2D for the 2 and 3 class case. Figure generated by discrimAnalysisDboundariesDemo Decision boundaries in 2D for 2 and 3 class case. we can write T p(y = c|x, θ) = P eβ c c0 x+γc T eβ c0 x+γc0 (1 ch is the softmax function (see Section 1.2.12). This is equivalent in form to multinomial logistic regression. However all method is not equivalent to logistic regression, as we explain in Section 1.4.6. 0 The decision boundary c|x, 2011 θ) = p(y13= |x AD () between two classes say c and c0 , is the set of points x for which p(y = March /c 17 Binary Classi…cation Consider the case where C = 2 then one can check that β0 )T x + γ1 p ( y = 1j x) = g ( β1 γ0 We have γ1 γ0 = x0 : 1 µ0 )T Σ 1 (µ1 + µ0 ) + log (π 1 /π 0 ) , (µ 2 1 (µ1 µ0 ) log (π 1 /π 0 ) 1 = ( µ1 + µ0 ) 2 µ )T Σ 1 ( µ µ ) (µ 1 w : = β1 β0 = Σ 1 ( µ1 0 1 0 µ0 ) then p ( y = 1j x) = g w T (x x0 ) x is shifted by x0 and then projected onto the line w. AD () March 2011 14 / 17 Binary Classi…cation ure 1.31: Geometry of LDA in the 2 class case where Σ1 = Σ 2 1 Example where Σ = σ I so w is in the direction of µ1 − γ0 AD= () − 1 µT Σ−1 µ + 1 µT Σ−1 µ µ0 . Marchlog(π 2011 151 / /π 17 + Generative or Discriminative Classi…ers When the model for class conditional densities is correct, resp. incorrect, generative classi…ers will typically outperform, resp. underperform, discriminative classi…ers for large enough datasets. Generative classi…ers can be di¢ cult to learn whereas Discriminative classi…ers try to learn directly the posterior probability of interest. Generative classi…ers can handle missing data easily, discriminative methods cannot. Discriminative can be more ‡exible; e.g. substitute x to Φ (x) . AD () March 2011 16 / 17 Application Bankruptcy Data using Gaussian (black = train, blue=correct, red=wrong), nerr = 3 4 Ba 2 1 2 0 0 −1 −2 −2 −3 −4 −4 −5 −6 −6 Bankrupt Solvent −8 −5 −4 −3 −2 −1 0 1 2 −7 − (a) Discriminant analysis on the bankruptcy data set. Gaussian class 1.33: Discriminant analysis on the bankruptcy data set. Left: Gaus densities. labels, based onasthe posterior Points proba ofthat b es. conditional Points that belongEstimated to class 1 are shown triangles, belonging to each class, are computed. If incorrect, the point is colored If th posterior probability of belonging to each class, are computed. Training data is in black.) Figure generated by robustDiscrimAn read, otherwise in blue (Training data are black). AD () March 2011 17 / 17