Week 6

advertisement
Linear Models for Classification:
Probabilistic Methods
Adopted from Seung-Joon Yi
Biointelligence Laboratory, Seoul National University
http://bi.snu.ac.kr/
Recall, Linear Methods for Classification
Problem Definition: Given the training data {xn,tn}, find
a linear model for each class yk(x) to partition the
feature space into decision regions
 Deterministic Models:
 Discriminant Functions
 Fisher Discriminant function
 Perceptron
2
Probabilistic Approaches for Classification

Generative Models:
 Inference :
Model p(x/Ck) and p(Ck)
 Decision : Model p(Ck/x)

Discriminative Models
 Model p(Ck/x) directly
 Use
the functional form of the generalized linear model explicitly
 Determine the parameters directly using Maximum Likelihood
3
Logistic Sigmoid Function



Comes from population growth
Prob distribution function of Normal R.V. İs Logistic sigmoid
İf class conditional densities are Normal, posteriors become lo
gistic sigmoid
4
Posterior Probabilities can be formulated
by
2-Class: Logistic sigmoid acting on a linear function
of x
 K-Class: Softmax transformation of a linear function
of x

Then,
 The parameters of the densities as well as the class
priors can be determined using Maximum Likelihood
5
Probabilistic Generative Models: 2-Class
p  x | Ck  and p Ck   p Ck | x 

Recall, given

Posterior can be expresses by Logistic Sigmoid
p  C1 | x  

p  x | C1  p  C1 
p  x | C1  p  C1   p  x | C2  p  C2 
1
  a
1  exp  a 
where a  ln

p  x | C1  p  C1 
p  x | C2  p  C2 
.
a is called logit function
6
Probabilistic Generative Models K-Class


Posterior can be expresses by Softmax function or
normalized exponential
Multi-class generalisation of logistic sigmoid:
p  Ck | x  
p  x | Ck  p  Ck 

exp  ak 
 p  x | C  p C  
j
j
j
j
 
exp a j
,
where ak  ln p  x | Ck  p  Ck  .
7
Probabilistic Generative Models
Gaussian Class Conditionals for 2-Class

Assume same covariance matrix ∑,
p  x | Ck  

1
1
 2 D / 2
p  C1 | x    w T x  w0

1/ 2
p  x | Ck 
T
 1

exp   x  μ k  1  x  μ k   .
 2


p  C1 
1
1
w   1  μ1  μ 2  and w0   μ1T  1μ1  μT2  1μ 2  ln
2
2
p  C2 

Note
p  C1 | x 
 The quadratic terms in x from the exponents are cancelled.
 The resulting decision boundary is linear in input space.
 The prior only shifts the decision boundary, i.e. parallel
contour.
8
Probabilistic Generative Models: Gaussian Class
Conditionals for K-classes
ak  x   wTk x  wk 0
1
w k  1μk and wk 0   μTk 1μk  ln p  Ck 
2


When, covariance matrix is the same, decision boundaries are linear.
When, each class-condition density have its own covariance matrix,
ak becomes quadratic functions of x, giving rise to a quadratic
discriminant.
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
9
Probabilistic Generative Models
-Maximum Likelihood Solution

Two classes
Given
Data set: xn , tn  , n  1,..., N
tn  1 or 0, (denoting C1 and C2 , respectively)
10
Q: Find P(C1) = π and P(C2) = 1- π
and
parameters of p(Ck/x): μ1, μ2 and 
11
Probabilistic Generative Models
-Maximum Likelihood Solution
Let P(C1) = π and P(C2) = 1- π
12
Probabilistic Generative Models
-Maximize log likelihood w r to. π ,μ1 μ2. ∑
1

N
N

tn 
n 1
1
μ.1 
N1
N1
N1

N
N1  N 2
N
t x
n n
n 1
1
μ2 
N2
N
 1  t  x
n
n
n 1
S
N1
N
S1  2 S 2
N
N
1
Sk 
 xn  μk  xn  μk T
N k nC
S

k
13
Probabilistic Generative Models
-Discrete Features

Discrete feature values xi 0,1
When we have D inputs, the table size grows exponentially
with the number of featuresto a 2D size table.
.
Naïve Bayes assumption, conditioned on the class Ck
p  x | Ck  
D

1 xi
kixi 1  ki 
i 1
ln p  x | Ck  p  Ck  
D
x ln 
i
ki
 1  xi  ln 1  ki   ln p  Ck 
i 1
Linear with respect to the features as in the continuous features.
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
14
Bayes Decision Boundaries: 2D
-Pattern Classification, Duda et al. pp.42
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
15
Bayes Decision Boundaries: 3D
-Pattern Classification, Duda et al. pp.43
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
16
For both Gaussian distributed and discrete
inputs

The posterior class probabilities are given by
 Generalized linear models with logistic sigmoid or
 softmax activation functions.
17
Probabilistic Generative Models
-Exponential Family
Recall, bernoulli, binomial, multinomial, Gaussian can be expressed in a
general form


p  x | λ k   h  x  g  λ k  exp λ Tk u  x 
p  C1 | x     a1  .
18
Probabilistic Generative Models
Exponential Family
2- Classes: Logistic Function
 The subclass for which u(x) = x.
p  x | λ k   h  x  g  λ k  exp

λ Tk u
 x 
For some scaling parameter s,
1 1 
1

p  x | λ k , s   h  x  g  λ k  exp  λ Tk x  .
s s 
s

a  x   λ1  λ 2  x  ln g  λ1   ln g  λ2   ln p C1   ln p C2 
T

K-Classes: Softmax function. Linear with respect to x
again.
ak  x  λTk x  ln g  λ k   ln p Ck 
where p  Ck | x  
exp  ak 

j
 
exp a j
.
19
Probabilistic Discriminative Models




Goal: Find p(Ck/x) directly
No inferrence step
Discriminative Training: Max likelihood p(Ck/x)
İmproves prediction performance when p(x/Ck) is poorly
estimated
20
Fixed basis functions: x

Assume fixed nonlinear transformation
 Transform inputs using a vector of basis functions
 The resulting decision boundaries will be linear in the feature
space y(x)= WT Φ
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
21
Posterior probability of a class for twoclass problem:

The number of adjustable parameters (M-dimensional, 2-class)
 2 Gaussian class conditional densities (generative model)
2M parameters for means
 M(M+1)/2 parameters for (shared) covariance matrix
 Grows quadratically with M

 Logistic regression
(discriminative model)
M parameters for
 Grows linearly with M

22
Determining the parameters using
Likelihood function:

Take negative log likelihood: Cross-entropy error function
 Recall, cross entropy between two probability distributions measures t
he average number of bits needed to identify an event from a set of po
ssibilities, if a coding scheme is used based on a given probability distri
bution q, rather than the "true" distribution p.
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
23
The gradient of the error function w.r.t. W

The same form as the linear regression
prediction
target value
24
Iterative Reweighted Least Squares

Recall, Linear regression models in ch.3
 ML solution on the assumption of a Gaussian noise leads to a closeform solution, as a consequence of the quadratic dependence of the
log likelihood on the parameter w.

Logistic regression model
 No longer a closed-form solution
 But the error function is concave and has a unique minimum
Efficient iterative technique can be used
 The Newton-Raphson update to minimize a function E(w)

– Where H is the Hessian matrix, the second derivatives of E(w)
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
25
Iterative reweighted least squares (Cont’d)

CASE 1: SSE function:
 Newton-Raphson update:
CASE 2:
Cross-entropy error function:

 Newton-Rhapson update:
(iterative reweighted least squares)
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
26
Multiclass logistic regerssion

Posterior probability for multiclass classification

We can use ML to determine the parameters
directly.
 Likelihood function using 1-of-K coding scheme
 Cross-entropy error function for the multiclass classification
27
Multiclass logistic regression (Cont’d)

The derivative of the error function
 Same form, the product of error times the basis function.

The Hessian matrix
 IRLS algorithm can also be used for a batch processing
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
28
Generalized Linear Models

Recall, for a broad range of class-conditional distributions,
described by the exponential family, the resulting posterior
class probabilities are given by a logistic(or softmax)
transformation acting on a linear function of the feature
variables.

However this is not the case for all choices of class-conditional
density
 It might be worth exploring other types of discriminative probabilistic
model
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
29
Generalized Linear Model: 2 Classes
For example: For each input, we evaluate an=wTΦn
θ
30
Noisy Threshold model

Corresponding activation function
when θ is drawn from p(θ), mixture
of Gaussian
31
Probit Function
Sigmoidal shape
The generalized linear model based on
a probit activation function is known
as probit regression.
32
Canonical link functions

Recall, if we take the derivative of the error function w.r.t the
parameter w, it takes the form of the error times the feature
vector.
 Logistic regression model with sigmoid activation function
 Logistic regression model with softmax activation function

This is a general result of assuming a conditional distribution for
the target variable from the exponential family, along with a
corresponding choice for the activation function known as the
canonical link function.
33
Canonical link functions (Cont’d)

Consider the exponential family, Conditional distributions of the
target variable
 Log likelihood:
 The derivative of the log likelihood:
where

The canonical link function:
then
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
34
The Laplace approximation



Goal: Find a Gaussian approximation to a non-Gaussian
density, centered on the mode z0 of the distribution.
Suppose: p(z)= (1 /Z)f(z) , non Gaussian
Taylor expansion, arround mode z0, of the logarithm of the
target function:
 Resulting approximated Gaussian distribution:
35
Laplace approximation for
p(z) ∝ exp(-z2/2)σ(20z +4)


Left: the normalized distribution p(z) in yellow, together with the Laplace
approximation centred on the mode z0 of p(z) in red.
Right:The negative logarithms of the corresponding curves
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
36
Model comparison and BIC

Laplace approximation to the normalization constant Z
 This result can be used to obtain an approximation to the model
evidence, which plays a central role in Bayesian model comparison.

Consider a set of models
 The log of model evidence
having parameters
can be approximated as
 Further approximation with some more assumption:
Bayesian Information Criterion (BIC)
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
37
Bayesian Logistic Regression

Exact Bayesian inference is intractable.
 Gaussian prior:
 Posterior:
 Log of posterior:

Laplace approximation of posterior distribution
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
38
Predictive distribution

Can be obtained by marginalizing w.r.t the posterior distribution p
(w|t) which is approximated by a Gaussian q(w)
where
 a is a marginal distribution of a Gaussian which is also Gaussian
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
39
Predictive distribution

Resulting variational approximation to the predictive distribution
 To integrate over a, we make use of the close similarity between the
logistic sigmoid function and the probit function
Then
where
 Finally we get
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
40
Download