logistic regression

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models Probabilistic Generative Models We have shown:  pC1 | x    w T x  w0  w   1  μ1  μ2  1 T 1 T pC1  w0   μ1 μ1  μ2 μ2  ln 2 2 pC2  Decision boundary:  1 pC1 | x   pC2 | x   0.5 T  0 . 5  w x  w0  0  w T x  w0  1 e Probabilistic Generative Models K >2 classes: p x | Ck  pCk  e ak pCk | x    aj     p x | C p C  j j e j ak  ln p x | Ck  pCk  j We can show the following result: pCk | x   e w kT x  wk 0 e w Tj x  w j 0 j 1 w k   μk 1 T 1 wk 0   μk  μk  ln pCk  2 Maximum likelihood solution We have a parametric functional form for the classconditional densities: p  x | Ck   1 2  D/2 1  1/ 2 e  1  T 1    x  μk    x  μk    2  We can estimate the parameters and the prior class probabilities using maximum likelihood.  Two class case with shared covariance matrix. Training data: xn , tn  n  1,, N t n  1 denotes class C1 tn  0 denotes class C2 Priors : pC1    , pC2   1    Maximum likelihood solution xn , tn  n  1,, N t n  1 denotes class C1 tn  0 denotes class C2 Priors : pC1    , pC2   1    For a data point x n from class C1 we have t n  1 and therefore p xn , C1   pC1  p xn | C1   N  xn | μ1 ,  For a data point x n from class C2 we have t n  0 and therefore p xn , C2   pC2  p xn | C2   1   N  xn | μ2 ,  Maximum likelihood solution xn , tn  n  1,, N t n  1 denotes class C1 tn  0 denotes class C2 Priors : pC1    , pC2   1    p xn , C1   pC1  p xn | C1   N  xn | μ1 ,  p xn , C2   pC2  p xn | C2   1   N  xn | μ2 ,  Assuming observations are drawn independently, we can write the likelihood function as follows: N pt |  , μ1 , μ2 ,     pt n |  , μ1 , μ2 ,    n 1 N   p x , C   p x , C  1t n tn n 1 n 2  n 1 N t 1t            N x | μ ,  1   N x | μ ,   n 1 n 2 n n 1 n t  t1 ,, t N  T Maximum likelihood solution We want to find the values of the parameters that maximize the likelihood function, i.e., fit a model that best describes the observed data. pt |  , μ1 , μ2 ,    N  N  x | μ1 ,   1   N  x n | μ2 ,   tn n 1t n n 1 As usual, we consider the log of the likelihood: ln pt |  , μ1 , μ2 ,    N  t n 1 n ln   t n ln N  x n | μ1 ,    1  t n  ln 1     1  t n  ln N  x n | μ2 ,   Maximum likelihood solution ln pt |  , μ1 , μ2 ,    N  t n 1 n ln   t n ln N  x n | μ1 ,    1  t n  ln 1     1  t n  ln N  x n | μ2 ,   We first maximize the log likelihood with respect to  . The N terms that depend on  are t ln   1  t 1     n 1   1 N n n N N 1       t ln   1  t ln 1    t    n n n  1  n 1 n 1 1 N 1 1   tn  N  n 1 1  1  1   N N N  1  t  n 1 n N 1 1 tn  tn  N 0    1    n 1 1  n 1 N1 N1 tn    N N1  N 2 n 1 Maximum likelihood solution  ML N1  N1  N 2 Thus, the maximum likelihood estimate of  is the fraction of points in class C1 The result can be generalized to the multiclass case: the maximum likelihood estimate of pCk  is given by the fraction of points in the training set that belong to Ck Maximum likelihood solution ln pt |  , μ1 , μ2 ,    N  t n 1 n ln   t n ln N  x n | μ1 ,    1  t n  ln 1     1  t n  ln N  x n | μ2 ,   We now maximize the log likelihood with respect to μ1 . The N terms that depend on μ1 are  t n ln N  xn | μ1 ,   n 1 N      1  T 1         t ln N x | μ ,    t x  μ  x  μ  const  n  n n 1 n 1 n 1  μ1 n 1 μ1  2 n 1  N N    t n  x n  μ1   n 1 T 1 N  μ1  tn xn  N1 n 1 1   0   t  x [ N  xn | μ1 ,    N n n 1 1 2  D/2 1  1/ 2 N n e  μ1   0   t n x n  N1 μ1 n 1  1  T 1    x n  μ1    x n  μ1    2  ] Maximum likelihood solution 1 N μ1  tn xn  N1 n 1 Thus, the maximum likelihood estimate of μ1 is the sample mean of all the input vectors x n assigned to class C1 By maximizing the log likelihood with respect to μ2 we obtain a similar result for μ2 1 μ2  N2 N  1  t x n 1 n n Maximum likelihood solution Maximizing the log likelihood with respect to  we obtain the maximum likelihood estimate  ML N1 N2  ML  S1  S2 N N 1 1 T T  xn  μ1  xn  μ1  S 2    xn  μ2  xn  μ2  S1   N1 nC1 N 2 nC2  Thus: the maximum likelihood estimate of the covariance is given by the weighted average of the sample covariance matrices associated with each of the classes.  This results extend to K classes. Probabilistic Discriminative Models Two-class case: Multiclass case:  pC1 | x    w T x  w0 pCk | x   e  w kT x  wk 0 e w Tj x  w j 0 j Discriminative approach: use the functional form of the generalized linear model for the posterior probabilities and determine its parameters directly using maximum likelihood. Probabilistic Discriminative Models Advantages:  Fewer parameters to be determined  Improved predictive performance, especially when the class-conditional density assumptions give a poor approximation of the true distributions. Probabilistic Discriminative Models Two-class case:   pC1 | x    w T x  w0  y x  pC2 | x   1  pC1 | x  In the terminology of statistics, this model is known as logistic regression. Assuming x   M how many parameters do we need to estimate? M 1 Probabilistic Discriminative Models How many parameters did we estimate to fit Gaussian class-conditional densities (generative approach)? pC1   1 2 mean vectors  2 M M2 M M2 M   M  2 2 M2 M total  1  2 M  O M2 2   Logistic Regression   pC1 | x    w T x  w0  y x  We use maximum likelihood to determine the parameters of the logistic regression model. xn , tn  n  1,, N t n  1 denotes class C1 t n  0 denotes class C2 We want to find the values of w that max imize the posterior probabilit ies associated to the observed data Likelihood function : N Lw    PC1 | x n  1  PC1 | x n  n 1 tn 1t n N   y  x n  1  y  x n  n 1 tn 1t n Logistic Regression   pC1 | x    w T x  w0  y x  N Lw    PC1 | xn  n 1  PC1 | xn  1t n t n 1 N   y  xn  n 1  y  xn  1t n t n 1 We consider the negative logarithm of the likelihood: N E w    ln Lw    ln  y  x n  n 1  y  x n  t n 1 N   t n ln y  x n   1  t n  ln 1  y  x n  n 1 arg min E w  w 1t n  Logistic Regression   pC1 | x    w T x  w0  y x  We compute the derivative of the error function with respect to w (gradient):    N  E w     tn ln y xn   1  tn  ln 1  y xn   w w  n 1  We need to compute the derivative of the logistic sigmoid function:   1 ea  a    a a a 1  e 1  e a  1 1  e a  2 1 ea   a a 1 e 1 e 1     a 1   a  1  a   1 e  Logistic Regression pC1 | x    w T x  w0   y x     N  E w     t n ln y  x n   1  t n  ln 1  y  x n    w w  n 1   tn   1  tn   yn 1  yn xn      yn 1  yn x n  1  yn  n 1  y n  N N N n 1 n 1   t n 1  yn x n  1  t n  yn x n    t n  t n yn  yn  t n yn x n  N N n 1 n 1   t n  yn x n    yn  t n x n N E w     yn  t n x n n 1 Logistic Regression N E w     yn  t n x n n 1  The gradient of E at w gives the direction of the steepest increase of E at w. We need to minimize E. Thus we need to update w so that we move along the opposite direction of the gradient: w t 1  w t  E w  This technique is called gradient descent  It can be shown that E is a concave function of w. Thus, it has a unique minimum.  An efficient iterative technique exists to find the optimal w parameters (Newton-Raphson optimization). Batch vs. on-line learning N E w     yn  t n x n n 1  The computation of the above gradient requires the processing of the entire training set (batch technique) w t 1  w t  E w   If the data set is large, the above technique can be costly;  For real time applications in which data become available as continuous streams, we may want to update the parameters as data points are presented to us (on-line technique). On-line learning  After the presentation of each data point n, we compute the contribution of that data point to the gradient (stochastic gradient): En w    yn  tn xn The on-line updating rule for the parameters becomes: w t 1  w t  En w  w t 1  w t En w   w t   yn  tn xn   0 is called learning rate. It ' s value needs to be chosen carefully to ensure convergence Multiclass Logistic Regression Multiclass case: pCk | x   e w kT x  wk 0 e w Tj x  w j 0  yk  x  j We use maximum likelihood to determine the parameters of the logistic regression model. xn , t n  n  1,, N t n  0,, 1, ,0  denotes class Ck We want to find the values of w1 , , w K that max imize the posterior probabilit ies associated to the observed data Likelihood function : N K N K Lw1 ,, w K    PCk | x n    yk  x n  nk n 1 k 1 t nk n 1 k 1 t Multiclass Logistic Regression pCk | x   e w kT x  wk 0 e w Tj x  w j 0  yk  x  j N K Lw1 ,, w K    yk  xn  nk t n 1 k 1 We consider the negative logarithm of the likelihood: N K E w1 ,, w K    ln Lw1 ,, w K    t nk ln yk  xn  arg min E w j  wj n 1 k 1 Multiclass Logistic Regression pCk | x   e w kT x  wk 0 e w Tj x  w j 0  yk  x  j N K E w1 ,, w K    t nk ln yk  xn  n 1 k 1 We compute the gradient of the error function with respect to one of the parameter vectors:   E w1 ,, w K   w j w j  N K    t nk ln yk  xn   n 1 k 1  Multiclass Logistic Regression pCk | x   e w kT x  wk 0 e w Tj x  w j 0 j   E w1 ,, w K   w j w j  yk  x   N K    t nk yk  xn   n 1 k 1  Thus, we need to compute the derivatives of the softmax function:   yk  ak ak e e ak e j aj  ak ak ak e  e e  aj j  aj  e    j   2  yk  yk2  yk 1  yk  Multiclass Logistic Regression pCk | x   e w kT x  wk 0 e w Tj x  w j 0 j   E w1 ,, w K   w j w j  yk  x   N K    t nk yk  xn   n 1 k 1  Thus, we need to compute the derivatives of the softmax function:   e e e for j  k , yk     yk y j aj 2 a j a j  e  aj  e  j    j  ak ak aj Multiclass Logistic Regression   yk  ak ak e e ak e aj  ak ak ak e  e e  aj j 2  yk  yk2  yk 1  yk   aj  e  j   j   a   e ak  e ak e j for j  k , yk     yk y j aj 2 a j a j  e  aj  e  j   j   Compact expression:  yk  yk I kj  y j  a j where I kj are the elements of the identity matrix Multiclass Logistic Regression pCk | x   e w kT x  wk 0 e w Tj x  w j 0  yk  x  j  yk  yk I kj  y j  a j   w j E w1 ,, w K   w j N K   t nk n 1 k 1  N K    t nk ln yk  x n    n 1 k 1  N K 1 ynk I kj  ynj x n   t nk ynj  t nk I kj x n  ynk n 1 k 1  y N n 1 nj  t nj x n Multiclass Logistic Regression  w j E w1 ,, w K     ynj  t nj xn N n 1  It can be shown that E is a concave function of w. Thus, it has a unique minimum.  For a batch solution, we can use the Newton-Raphson optimization technique.  On-line solution (stochastic gradient descent): w tj1  w tj En w   w tj  ynj  tnj xn

logistic regression

Related documents

Products

Support

logistic regression

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib