CS 782 – Machine Learning Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models Probabilistic Generative Models We have shown: pC1 | x w T x w0 w 1 μ1 μ2 1 T 1 T pC1 w0 μ1 μ1 μ2 μ2 ln 2 2 pC2 Decision boundary: 1 pC1 | x pC2 | x 0.5 T 0 . 5 w x w0 0 w T x w0 1 e Probabilistic Generative Models K >2 classes: p x | Ck pCk e ak pCk | x aj p x | C p C j j e j ak ln p x | Ck pCk j We can show the following result: pCk | x e w kT x wk 0 e w Tj x w j 0 j 1 w k μk 1 T 1 wk 0 μk μk ln pCk 2 Maximum likelihood solution We have a parametric functional form for the classconditional densities: p x | Ck 1 2 D/2 1 1/ 2 e 1 T 1 x μk x μk 2 We can estimate the parameters and the prior class probabilities using maximum likelihood. Two class case with shared covariance matrix. Training data: xn , tn n 1,, N t n 1 denotes class C1 tn 0 denotes class C2 Priors : pC1 , pC2 1 Maximum likelihood solution xn , tn n 1,, N t n 1 denotes class C1 tn 0 denotes class C2 Priors : pC1 , pC2 1 For a data point x n from class C1 we have t n 1 and therefore p xn , C1 pC1 p xn | C1 N xn | μ1 , For a data point x n from class C2 we have t n 0 and therefore p xn , C2 pC2 p xn | C2 1 N xn | μ2 , Maximum likelihood solution xn , tn n 1,, N t n 1 denotes class C1 tn 0 denotes class C2 Priors : pC1 , pC2 1 p xn , C1 pC1 p xn | C1 N xn | μ1 , p xn , C2 pC2 p xn | C2 1 N xn | μ2 , Assuming observations are drawn independently, we can write the likelihood function as follows: N pt | , μ1 , μ2 , pt n | , μ1 , μ2 , n 1 N p x , C p x , C 1t n tn n 1 n 2 n 1 N t 1t N x | μ , 1 N x | μ , n 1 n 2 n n 1 n t t1 ,, t N T Maximum likelihood solution We want to find the values of the parameters that maximize the likelihood function, i.e., fit a model that best describes the observed data. pt | , μ1 , μ2 , N N x | μ1 , 1 N x n | μ2 , tn n 1t n n 1 As usual, we consider the log of the likelihood: ln pt | , μ1 , μ2 , N t n 1 n ln t n ln N x n | μ1 , 1 t n ln 1 1 t n ln N x n | μ2 , Maximum likelihood solution ln pt | , μ1 , μ2 , N t n 1 n ln t n ln N x n | μ1 , 1 t n ln 1 1 t n ln N x n | μ2 , We first maximize the log likelihood with respect to . The N terms that depend on are t ln 1 t 1 n 1 1 N n n N N 1 t ln 1 t ln 1 t n n n 1 n 1 n 1 1 N 1 1 tn N n 1 1 1 1 N N N 1 t n 1 n N 1 1 tn tn N 0 1 n 1 1 n 1 N1 N1 tn N N1 N 2 n 1 Maximum likelihood solution ML N1 N1 N 2 Thus, the maximum likelihood estimate of is the fraction of points in class C1 The result can be generalized to the multiclass case: the maximum likelihood estimate of pCk is given by the fraction of points in the training set that belong to Ck Maximum likelihood solution ln pt | , μ1 , μ2 , N t n 1 n ln t n ln N x n | μ1 , 1 t n ln 1 1 t n ln N x n | μ2 , We now maximize the log likelihood with respect to μ1 . The N terms that depend on μ1 are t n ln N xn | μ1 , n 1 N 1 T 1 t ln N x | μ , t x μ x μ const n n n 1 n 1 n 1 μ1 n 1 μ1 2 n 1 N N t n x n μ1 n 1 T 1 N μ1 tn xn N1 n 1 1 0 t x [ N xn | μ1 , N n n 1 1 2 D/2 1 1/ 2 N n e μ1 0 t n x n N1 μ1 n 1 1 T 1 x n μ1 x n μ1 2 ] Maximum likelihood solution 1 N μ1 tn xn N1 n 1 Thus, the maximum likelihood estimate of μ1 is the sample mean of all the input vectors x n assigned to class C1 By maximizing the log likelihood with respect to μ2 we obtain a similar result for μ2 1 μ2 N2 N 1 t x n 1 n n Maximum likelihood solution Maximizing the log likelihood with respect to we obtain the maximum likelihood estimate ML N1 N2 ML S1 S2 N N 1 1 T T xn μ1 xn μ1 S 2 xn μ2 xn μ2 S1 N1 nC1 N 2 nC2 Thus: the maximum likelihood estimate of the covariance is given by the weighted average of the sample covariance matrices associated with each of the classes. This results extend to K classes. Probabilistic Discriminative Models Two-class case: Multiclass case: pC1 | x w T x w0 pCk | x e w kT x wk 0 e w Tj x w j 0 j Discriminative approach: use the functional form of the generalized linear model for the posterior probabilities and determine its parameters directly using maximum likelihood. Probabilistic Discriminative Models Advantages: Fewer parameters to be determined Improved predictive performance, especially when the class-conditional density assumptions give a poor approximation of the true distributions. Probabilistic Discriminative Models Two-class case: pC1 | x w T x w0 y x pC2 | x 1 pC1 | x In the terminology of statistics, this model is known as logistic regression. Assuming x M how many parameters do we need to estimate? M 1 Probabilistic Discriminative Models How many parameters did we estimate to fit Gaussian class-conditional densities (generative approach)? pC1 1 2 mean vectors 2 M M2 M M2 M M 2 2 M2 M total 1 2 M O M2 2 Logistic Regression pC1 | x w T x w0 y x We use maximum likelihood to determine the parameters of the logistic regression model. xn , tn n 1,, N t n 1 denotes class C1 t n 0 denotes class C2 We want to find the values of w that max imize the posterior probabilit ies associated to the observed data Likelihood function : N Lw PC1 | x n 1 PC1 | x n n 1 tn 1t n N y x n 1 y x n n 1 tn 1t n Logistic Regression pC1 | x w T x w0 y x N Lw PC1 | xn n 1 PC1 | xn 1t n t n 1 N y xn n 1 y xn 1t n t n 1 We consider the negative logarithm of the likelihood: N E w ln Lw ln y x n n 1 y x n t n 1 N t n ln y x n 1 t n ln 1 y x n n 1 arg min E w w 1t n Logistic Regression pC1 | x w T x w0 y x We compute the derivative of the error function with respect to w (gradient): N E w tn ln y xn 1 tn ln 1 y xn w w n 1 We need to compute the derivative of the logistic sigmoid function: 1 ea a a a a 1 e 1 e a 1 1 e a 2 1 ea a a 1 e 1 e 1 a 1 a 1 a 1 e Logistic Regression pC1 | x w T x w0 y x N E w t n ln y x n 1 t n ln 1 y x n w w n 1 tn 1 tn yn 1 yn xn yn 1 yn x n 1 yn n 1 y n N N N n 1 n 1 t n 1 yn x n 1 t n yn x n t n t n yn yn t n yn x n N N n 1 n 1 t n yn x n yn t n x n N E w yn t n x n n 1 Logistic Regression N E w yn t n x n n 1 The gradient of E at w gives the direction of the steepest increase of E at w. We need to minimize E. Thus we need to update w so that we move along the opposite direction of the gradient: w t 1 w t E w This technique is called gradient descent It can be shown that E is a concave function of w. Thus, it has a unique minimum. An efficient iterative technique exists to find the optimal w parameters (Newton-Raphson optimization). Batch vs. on-line learning N E w yn t n x n n 1 The computation of the above gradient requires the processing of the entire training set (batch technique) w t 1 w t E w If the data set is large, the above technique can be costly; For real time applications in which data become available as continuous streams, we may want to update the parameters as data points are presented to us (on-line technique). On-line learning After the presentation of each data point n, we compute the contribution of that data point to the gradient (stochastic gradient): En w yn tn xn The on-line updating rule for the parameters becomes: w t 1 w t En w w t 1 w t En w w t yn tn xn 0 is called learning rate. It ' s value needs to be chosen carefully to ensure convergence Multiclass Logistic Regression Multiclass case: pCk | x e w kT x wk 0 e w Tj x w j 0 yk x j We use maximum likelihood to determine the parameters of the logistic regression model. xn , t n n 1,, N t n 0,, 1, ,0 denotes class Ck We want to find the values of w1 , , w K that max imize the posterior probabilit ies associated to the observed data Likelihood function : N K N K Lw1 ,, w K PCk | x n yk x n nk n 1 k 1 t nk n 1 k 1 t Multiclass Logistic Regression pCk | x e w kT x wk 0 e w Tj x w j 0 yk x j N K Lw1 ,, w K yk xn nk t n 1 k 1 We consider the negative logarithm of the likelihood: N K E w1 ,, w K ln Lw1 ,, w K t nk ln yk xn arg min E w j wj n 1 k 1 Multiclass Logistic Regression pCk | x e w kT x wk 0 e w Tj x w j 0 yk x j N K E w1 ,, w K t nk ln yk xn n 1 k 1 We compute the gradient of the error function with respect to one of the parameter vectors: E w1 ,, w K w j w j N K t nk ln yk xn n 1 k 1 Multiclass Logistic Regression pCk | x e w kT x wk 0 e w Tj x w j 0 j E w1 ,, w K w j w j yk x N K t nk yk xn n 1 k 1 Thus, we need to compute the derivatives of the softmax function: yk ak ak e e ak e j aj ak ak ak e e e aj j aj e j 2 yk yk2 yk 1 yk Multiclass Logistic Regression pCk | x e w kT x wk 0 e w Tj x w j 0 j E w1 ,, w K w j w j yk x N K t nk yk xn n 1 k 1 Thus, we need to compute the derivatives of the softmax function: e e e for j k , yk yk y j aj 2 a j a j e aj e j j ak ak aj Multiclass Logistic Regression yk ak ak e e ak e aj ak ak ak e e e aj j 2 yk yk2 yk 1 yk aj e j j a e ak e ak e j for j k , yk yk y j aj 2 a j a j e aj e j j Compact expression: yk yk I kj y j a j where I kj are the elements of the identity matrix Multiclass Logistic Regression pCk | x e w kT x wk 0 e w Tj x w j 0 yk x j yk yk I kj y j a j w j E w1 ,, w K w j N K t nk n 1 k 1 N K t nk ln yk x n n 1 k 1 N K 1 ynk I kj ynj x n t nk ynj t nk I kj x n ynk n 1 k 1 y N n 1 nj t nj x n Multiclass Logistic Regression w j E w1 ,, w K ynj t nj xn N n 1 It can be shown that E is a concave function of w. Thus, it has a unique minimum. For a batch solution, we can use the Newton-Raphson optimization technique. On-line solution (stochastic gradient descent): w tj1 w tj En w w tj ynj tnj xn