logistic regression

advertisement
CS 782 – Machine Learning
Lecture 4
Linear Models for Classification
 Probabilistic generative models
 Probabilistic discriminative models
Probabilistic Generative Models
We have shown:

pC1 | x    w T x  w0

w   1  μ1  μ2 
1 T
1 T
pC1 
w0   μ1 μ1  μ2 μ2  ln
2
2
pC2 
Decision boundary:

1
pC1 | x   pC2 | x   0.5
T

0
.
5

w
x  w0  0
 w T x  w0 
1 e
Probabilistic Generative Models
K >2 classes:
p x | Ck  pCk 
e ak
pCk | x  

aj




p
x
|
C
p
C

j
j
e
j
ak  ln p x | Ck  pCk 
j
We can show the following result:
pCk | x  
e
w kT x  wk 0
e
w Tj x  w j 0
j
1
w k   μk
1 T 1
wk 0   μk  μk  ln pCk 
2
Maximum likelihood solution
We have a parametric functional form for the classconditional densities:
p  x | Ck  
1
2 
D/2
1

1/ 2
e
 1

T 1
   x  μk    x  μk  
 2

We can estimate the parameters and the prior class
probabilities using maximum likelihood.
 Two class case with shared covariance matrix.
Training data:
xn , tn  n  1,, N
t n  1 denotes class C1
tn  0 denotes class C2
Priors : pC1    , pC2   1   
Maximum likelihood solution
xn , tn  n  1,, N
t n  1 denotes class C1
tn  0 denotes class C2
Priors : pC1    , pC2   1   
For a data point x n from class C1 we have t n  1 and
therefore
p xn , C1   pC1  p xn | C1   N  xn | μ1 , 
For a data point x n from class C2 we have t n  0 and
therefore
p xn , C2   pC2  p xn | C2   1   N  xn | μ2 , 
Maximum likelihood solution
xn , tn  n  1,, N
t n  1 denotes class C1
tn  0 denotes class C2
Priors : pC1    , pC2   1   
p xn , C1   pC1  p xn | C1   N  xn | μ1 , 
p xn , C2   pC2  p xn | C2   1   N  xn | μ2 , 
Assuming observations are drawn independently, we can
write the likelihood function as follows:
N
pt |  , μ1 , μ2 ,     pt n |  , μ1 , μ2 ,   
n 1
N
  p x , C   p x , C 
1t n
tn
n
1
n
2

n 1
N
t
1t











N
x
|
μ
,

1


N
x
|
μ
,


n
1
n
2
n
n 1
n
t  t1 ,, t N 
T
Maximum likelihood solution
We want to find the values of the parameters that maximize
the likelihood function, i.e., fit a model that best describes
the observed data.
pt |  , μ1 , μ2 ,   
N
 N  x
| μ1 ,   1   N  x n | μ2 ,  
tn
n
1t n
n 1
As usual, we consider the log of the likelihood:
ln pt |  , μ1 , μ2 ,   
N
 t
n 1
n
ln   t n ln N  x n | μ1 ,    1  t n  ln 1     1  t n  ln N  x n | μ2 ,  
Maximum likelihood solution
ln pt |  , μ1 , μ2 ,   
N
 t
n 1
n
ln   t n ln N  x n | μ1 ,    1  t n  ln 1     1  t n  ln N  x n | μ2 ,  
We first maximize the log likelihood with respect to  . The
N
terms that depend on  are
t ln   1  t 1   

n 1


1
N
n
n
N
N
1






t
ln


1

t
ln
1



t



n
n
n

1 
n 1
n 1
1
N
1
1
  tn 
N
 n 1
1 
1 
1
 
N
N
N
 1  t 
n 1
n
N
1
1
tn 
tn 
N 0


 1    n 1
1 
n 1
N1
N1
tn 


N N1  N 2
n 1
Maximum likelihood solution
 ML
N1

N1  N 2
Thus, the maximum likelihood estimate of  is the fraction
of points in class C1
The result can be generalized to the multiclass case: the
maximum likelihood estimate of pCk  is given by the
fraction of points in the training set that belong to Ck
Maximum likelihood solution
ln pt |  , μ1 , μ2 ,   
N
 t
n 1
n
ln   t n ln N  x n | μ1 ,    1  t n  ln 1     1  t n  ln N  x n | μ2 ,  
We now maximize the log likelihood with respect to μ1 . The
N
terms that depend on μ1 are  t n ln N  xn | μ1 ,  
n 1
N



  1

T 1








t
ln
N
x
|
μ
,



t
x

μ

x

μ

const
 n
 n n 1
n
1
n
1

μ1 n 1
μ1  2 n 1

N
N

  t n  x n  μ1  
n 1
T
1 N
 μ1 
tn xn

N1 n 1
1
  0   t  x
[ N  xn | μ1 ,   
N
n
n 1
1
2 
D/2
1

1/ 2
N
n
e
 μ1   0   t n x n  N1 μ1
n 1
 1

T 1
   x n  μ1    x n  μ1  
 2

]
Maximum likelihood solution
1 N
μ1 
tn xn

N1 n 1
Thus, the maximum likelihood estimate of μ1 is the sample
mean of all the input vectors x n assigned to class C1
By maximizing the log likelihood with respect to μ2 we
obtain a similar result for μ2
1
μ2 
N2
N
 1  t x
n 1
n
n
Maximum likelihood solution
Maximizing the log likelihood with respect to  we obtain
the maximum likelihood estimate
 ML
N1
N2
 ML 
S1 
S2
N
N
1
1
T
T
 xn  μ1  xn  μ1  S 2    xn  μ2  xn  μ2 
S1 

N1 nC1
N 2 nC2
 Thus: the maximum likelihood estimate of the covariance
is given by the weighted average of the sample covariance
matrices associated with each of the classes.
 This results extend to K classes.
Probabilistic Discriminative Models
Two-class case:
Multiclass case:

pC1 | x    w T x  w0
pCk | x  
e

w kT x  wk 0
e
w Tj x  w j 0
j
Discriminative approach: use the functional form of the
generalized linear model for the posterior probabilities and
determine its parameters directly using maximum
likelihood.
Probabilistic Discriminative Models
Advantages:
 Fewer parameters to be determined
 Improved predictive performance, especially when the
class-conditional density assumptions give a poor
approximation of the true distributions.
Probabilistic Discriminative Models
Two-class case:


pC1 | x    w T x  w0  y x 
pC2 | x   1  pC1 | x 
In the terminology of statistics, this model is known as
logistic regression.
Assuming x   M how many parameters do we need to
estimate?
M 1
Probabilistic Discriminative Models
How many parameters did we estimate to fit Gaussian
class-conditional densities (generative approach)?
pC1   1
2 mean vectors  2 M
M2 M M2 M
  M

2
2
M2 M
total  1  2 M 
O M2
2
 
Logistic Regression


pC1 | x    w T x  w0  y x 
We use maximum likelihood to determine the parameters
of the logistic regression model.
xn , tn  n  1,, N
t n  1 denotes class C1
t n  0 denotes class C2
We want to find the values of w that max imize the posterior
probabilit ies associated to the observed data
Likelihood function :
N
Lw    PC1 | x n  1  PC1 | x n 
n 1
tn
1t n
N
  y  x n  1  y  x n 
n 1
tn
1t n
Logistic Regression


pC1 | x    w T x  w0  y x 
N
Lw    PC1 | xn  n 1  PC1 | xn 
1t n
t
n 1
N
  y  xn  n 1  y  xn 
1t n
t
n 1
We consider the negative logarithm of the likelihood:
N
E w    ln Lw    ln  y  x n  n 1  y  x n 
t
n 1
N
  t n ln y  x n   1  t n  ln 1  y  x n 
n 1
arg min E w 
w
1t n

Logistic Regression


pC1 | x    w T x  w0  y x 
We compute the derivative of the error function with
respect to w (gradient):

  N

E w  
  tn ln y xn   1  tn  ln 1  y xn 

w
w  n 1

We need to compute the derivative of the logistic sigmoid
function:


1
ea
 a  

a
a
a 1  e
1  e a

1
1  e a

2
1
ea


a
a
1 e 1 e
1 

  a 1   a 
1 
a 
 1 e 
Logistic Regression
pC1 | x    w T x  w0   y x 

  N

E w  
  t n ln y  x n   1  t n  ln 1  y  x n  

w
w  n 1

 tn


1  tn 
 yn 1  yn xn  
   yn 1  yn x n 
1  yn 
n 1  y n

N
N
N
n 1
n 1
  t n 1  yn x n  1  t n  yn x n    t n  t n yn  yn  t n yn x n 
N
N
n 1
n 1
  t n  yn x n    yn  t n x n
N
E w     yn  t n x n
n 1
Logistic Regression
N
E w     yn  t n x n
n 1
 The gradient of E at w gives the direction of the
steepest increase of E at w. We need to minimize E. Thus
we need to update w so that we move along the opposite
direction of the gradient: w t 1  w t  E w 
This technique is called gradient descent
 It can be shown that E is a concave function of w. Thus,
it has a unique minimum.
 An efficient iterative technique exists to find the optimal
w parameters (Newton-Raphson optimization).
Batch vs. on-line learning
N
E w     yn  t n x n
n 1
 The computation of the above gradient requires the
processing of the entire training set (batch technique)
w t 1  w t  E w 
 If the data set is large, the above technique can be
costly;
 For real time applications in which data become
available as continuous streams, we may want to update
the parameters as data points are presented to us (on-line
technique).
On-line learning
 After the presentation of each data point n, we compute
the contribution of that data point to the gradient
(stochastic gradient):
En w    yn  tn xn
The on-line updating rule for the parameters becomes:
w t 1  w t  En w 
w t 1  w t En w   w t   yn  tn xn
  0 is called learning rate.
It ' s value needs to be chosen carefully to ensure convergence
Multiclass Logistic Regression
Multiclass case:
pCk | x  
e
w kT x  wk 0
e
w Tj x  w j 0
 yk  x 
j
We use maximum likelihood to determine the parameters
of the logistic regression model.
xn , t n  n  1,, N
t n  0,, 1, ,0  denotes
class Ck
We want to find the values of w1 , , w K that max imize
the posterior probabilit ies associated to the observed data
Likelihood function :
N
K
N
K
Lw1 ,, w K    PCk | x n    yk  x n  nk
n 1 k 1
t nk
n 1 k 1
t
Multiclass Logistic Regression
pCk | x  
e
w kT x  wk 0
e
w Tj x  w j 0
 yk  x 
j
N
K
Lw1 ,, w K    yk  xn  nk
t
n 1 k 1
We consider the negative logarithm of the likelihood:
N
K
E w1 ,, w K    ln Lw1 ,, w K    t nk ln yk  xn 
arg min E w j 
wj
n 1 k 1
Multiclass Logistic Regression
pCk | x  
e
w kT x  wk 0
e
w Tj x  w j 0
 yk  x 
j
N
K
E w1 ,, w K    t nk ln yk  xn 
n 1 k 1
We compute the gradient of the error function with respect
to one of the parameter vectors:


E w1 ,, w K  
w j
w j
 N K

  t nk ln yk  xn 
 n 1 k 1

Multiclass Logistic Regression
pCk | x  
e
w kT x  wk 0
e
w Tj x  w j 0
j


E w1 ,, w K  
w j
w j
 yk  x 
 N K

  t nk yk  xn 
 n 1 k 1

Thus, we need to compute the derivatives of the softmax
function:


yk 
ak
ak
e
e ak
e
j
aj

ak
ak ak
e

e
e

aj
j

aj 
e 


j


2
 yk  yk2  yk 1  yk 
Multiclass Logistic Regression
pCk | x  
e
w kT x  wk 0
e
w Tj x  w j 0
j


E w1 ,, w K  
w j
w j
 yk  x 
 N K

  t nk yk  xn 
 n 1 k 1

Thus, we need to compute the derivatives of the softmax
function:

 e
e e
for j  k ,
yk 

  yk y j
aj
2
a j
a j  e

aj 
e 
j


 j

ak
ak
aj
Multiclass Logistic Regression


yk 
ak
ak
e
e ak
e
aj

ak
ak ak
e

e
e

aj
j
2
 yk  yk2  yk 1  yk 

aj 
e 
j


j


a

 e ak
 e ak e j
for j  k ,
yk 

  yk y j
aj
2
a j
a j  e

aj 
e 
j


j


Compact expression:

yk  yk I kj  y j 
a j
where I kj are the elements of the identity matrix
Multiclass Logistic Regression
pCk | x  
e
w kT x  wk 0
e
w Tj x  w j 0
 yk  x 
j

yk  yk I kj  y j 
a j

 w j E w1 ,, w K  
w j
N
K
  t nk
n 1 k 1
 N K

  t nk ln yk  x n  
 n 1 k 1

N K
1
ynk I kj  ynj x n   t nk ynj  t nk I kj x n 
ynk
n 1 k 1
 y
N
n 1
nj
 t nj x n
Multiclass Logistic Regression
 w j E w1 ,, w K     ynj  t nj xn
N
n 1
 It can be shown that E is a concave function of w. Thus,
it has a unique minimum.
 For a batch solution, we can use the Newton-Raphson
optimization technique.
 On-line solution (stochastic gradient descent):
w tj1  w tj En w   w tj  ynj  tnj xn
Download