Lecture Notes for CMPUT 466/551
Nilanjan Ray
1
• What is meant by linear classification?
– The decision boundaries in the in the feature
(input) space is linear
• Should the regions be contiguous?
R
1 R
2
X
2 R
3
R
4
X
1
Piecewise linear decision boundaries in 2D input space 2
• There is a discriminant function
k
( x ) for each class k
• Classification rule:
R k
{ x : k
arg max
j
( x )} j
• In higher dimensional space the decision boundaries are piecewise hyperplanar
• Remember that 0-1 loss function led to the classification rule:
• So,
P ( G
k | X )
R k
{ x : k
arg max can serve as
j k
( x )
P ( G
j | X
x )}
3
• All we require here is the class boundaries
{ x :
k
( x ) =
j
( x )} be linear for every ( k , j ) pair
• One can achieve this if k
( x ) themselves are linear or any monotone transform of
k
( x ) is linear
– An example:
P ( G
1 | X
x )
exp(
1
exp(
0
0
T
x )
T x )
So that
P ( G log[
2 | X
P ( G
P ( G
1 |
2 |
X
X x )
1
exp(
1
0
x ) x )
]
0
T x
T x )
Linear 4
Linear Classification as a Linear
Regression
2D Input space: X = ( X
1
, X
2
)
Number of classes/categories K =3, So output Y = ( Y
1
, Y
2
, Y
3
)
Training sample, size N =5,
X
1
1
1
1
1 x
11 x
21 x
31 x
41 x
51 x
12 x
22 x
32 x
42 x
52
, Y
y y y
11
21
31 y
41 y
51 y
12 y
22 y
32 y
42 y
52 y y
23 y
33 y
13
43 y
53
Each row has exactly one 1 indicating the category/class
Indicator Matrix
Regression output: Y
ˆ
(( x
1
, x
2
))
( 1 x
1 x
2
)( X
T
X )
1
X
T
Y
( x
T
1 x
T
2 x
T
3
)
Or, Y
ˆ
1
(( x
1
Y
ˆ
2
(( x
1
Y
ˆ
3
(( x
1 x x
2
2
))
)) x
2
))
( 1
(
( 1
1 x
1 x x
1
1 x
2
)
1 x
2
)
2 x
2
)
3
Classification rule:
ˆ
(( x
1 x
2
))
arg max k
Y
ˆ k
(( x
1 x
2
))
5
Linear regression of the indicator matrix can lead to masking
2D input space and three classes Masking
Y
ˆ
2
( 1 x
1 x
2
)
2
Y
ˆ
3
( 1 x
1 x
2
)
3
Y
ˆ
1
( 1 x
1 x
2
)
1
Viewing direction
LDA can avoid this masking
6
Essentially minimum error Bayes’ classifier
Assumes that the conditional class densities are ( multivariate ) Gaussian
Assumes equal covariance for every class
Posterior probability Pr( G
k | X
x )
l
K
1 f k
( x )
k f l
( x )
l
k is the prior probability for class k f k
( x ) is class conditional density or likelihood density f k
( x )
1
( 2
) p / 2
| Σ |
1 / 2 exp(
1
2
( x
k
)
T Σ
1
( x
k
))
Application of
Bayes rule
7
LDA… log
Pr( G
Pr( G
k l
|
|
X
X
x x )
)
log
k l
log f k f l
(log
k
x
T Σ
1
k
1
2
k
T Σ
1
k
)
(log
l
x
T Σ
1
l
1
2
l
T Σ
1
l
)
k
( x )
l
( x )
Classification rule:
ˆ
( x )
arg max k
k
( x ) is equivalent to:
ˆ
( x )
arg max k
Pr( G
k | X
x )
The good old Bayes classifier!
8
When are we going to use the training data?
Total N input-output pairs
N k number of pairs in class
Total number of classes: K k
Training data utilized to estimate
Prior probabilities:
Means:
Covariance matrix:
( g i
, x i
), i
1 : N
ˆ k
N k
/ N
ˆ k
g i
k x i
/ N k
k
1 g i
( x i
ˆ k
)( x i
ˆ k
)
T
/( N
K )
9
LDA was able to avoid masking here
10
• Relaxes the same covariance assumption– class conditional probability densities (still multivariate Gaussians) are allowed to have different covariant matrices
• The class decision boundaries are not linear rather quadratic log
Pr( G
Pr( G
k l |
| X
X
x x )
)
log
k l
log f k f l
(log
k
1
2
( x
k
)
T Σ k
1
( x
k
)
1
2 log |
Σ k
|)
(log
l
1
2
( x
l
)
T Σ l
1
( x
l
)
1
2 log |
Σ l
|)
k
( x )
l
( x )
11
QDA and Masking
Better than Linear Regression in terms of handling masking:
Usually computationally more expensive than LDA
12
Fisher’s Linear Discriminant
[DHS]
From training set we want to find out a direction where the separation between the class means is high and overlap between the classes is small
13
Projection of a vector x on a unit vector w : w T x
Geometric interpretation: x w w
T x
From training set we want to find out a direction w where the separation between the projections of class means is high and the projections of the class overlap is small
14
Class means: m
1
1
N
1 x i
R
1 x i
,
Projected class means: m
2
1
N
2 x i
R
2 x i
~
1
1
N
1 x i
R
1 w T x i
w T m
1
,
Difference between projected class means:
~
2
1
N
2 x i
R
2 w T x i
w T m
2
2
1
w
T
( m
2
m
1
)
Scatter of projected data (this will indicate overlap between the classes):
~
1
2
~
2
2
y i
: x i
R
1
( y i
y i
: x i
R
2
( y i
~
1
) 2
~
2
) 2
x i
R
1
( w T x i x i
R
2
( w T x i
w T m
1
) 2
w T m
2
) 2
w T
x i
R
1
( x i
w T
x i
R
2
( x i
m
1
)( x i
m
2
)( x i
m
1
) T
m
2
) T
w
w
w T S
1 w
w T S
2 w
15
Fisher’s LD…
Ratio of difference of projected means over total scatter: r ( w )
(
~
2
~
1
2
~
1
~
2
2
)
2
w
T
S
B w w
T
S w w
Rayleigh quotient where S w
S
1
S
2
S
B
( m
2
m
1
)( m
2
m
1
)
T
We want to maximize r ( w ). The solution is w
S w
1
( m
2
m
1
)
16
So far so good. However, how do we get the classifier?
All we know at this point is that the direction w
S w
1
( m
2
m
1
) separates the projected data very well
Since we know that the projected class means are well separated, we can choose average of the two projected means as a threshold for classification
Classification rule: x in R
2 if y ( x )>0, else x in R
1
, where y ( x )
w
T x
1
2
(
1
2
)
w
T x
1
2 w
T
( m
1
m
2
)
S w
1
( m
2
m
1
)( x
1
2
( m
1
m
2
))
17
They become same when
(1) Prior probabilities are same
(2) Common covariance matrix for the class conditional densities
(3) Both class conditional densities are multivariate Gaussian
Ex.
Show that Fisher’s LD classifier and LDA produce the same rule of classification given the above assumptions
Note: (1) Fisher’s LD does not assume Gaussian densities
(2) Fisher’s LD can be used in dimension reduction for a multiple class scenario
18
Logistic Regression
• The output of regression is the posterior probability i .
e ., Pr(output | input)
• Always ensures that the sum of output variables is 1 and each output is non-negative
• A linear classification method
• We need to know about two concepts to understand logistic regression
– Newton-Raphson method
– Maximum likelihood estimation
19
A technique for solving non-linear equation f ( x )=0
Taylor series: f ( x n
1
)
f ( x n
)
( x n
1
x n
) f
( x n
)
After rearrangement: x n
1
x n
f ( x n f
1
)
(
x n
) f ( x n
)
If x n+1 is a root or very close to the root, then: f ( x n
1
)
0
So: x n
1
x n
f f (
( x n x n
)
)
Rule for iteration
Need an initial guess x
0
20
Newton-Raphson in Multi-dimensions
We want to solve the equations:
Taylor series: f f
1
( x
1
, x
2
, , x
N
)
0
2
( x
1
, x
2
, , x
N
)
0
f
N
( x
1
, x
2
, , x
N
)
0 f j
( x
x )
f j
( x )
k
N
1
f j
x k
x k
, j
1 ,..., N
After some rearrangement etc.
the rule for iteration:
(Need an initial guess)
x
1 x
2 n n
1
1 x
N n
1
x
1 x n
1 n
1
2
x
N n
1
x
1
f
N
x
1
f
1
f x
1
2
f
1
x f
2
2
x
f
N
2
x
2
f
1
x f
N
2
x
f
N
N
x
N
1
f f f
1
(
2
N
(
( x x
1 n
1 x
1 n n ,
,
, x x n
2 n
2
x
2 n
,
,
,
, ,
, x x n
N n
N x
N n
)
)
)
Jacobian matrix
21
Solve: f
1
( x
1
, x
2
)
x
1
2 f
2
( x
1
, x
2
)
sin(
cos( x
2
)
0 x
1
)
x
1
2 x
2
3
0
x
1 n
1 x
2 n
1
x
1 x
2 n n
cos(
2 x
1 n x
1 n
)
2 x
1 n sin(
3 ( x
2 n x
2 n
)
2
)
1
sin(
( x
1 n x
1 n
)
) 2
(
x cos( x
2 n
1 n
)
2
(
) x
2 n
)
3
Iteration rule need initial guess
22
Maximum Likelihood Parameter
Estimation
Let’s start with an example. We want to find out the unknown parameters mean and standard deviation of a Gaussian pdf, given N independent samples from it.
p ( x ;
,
)
1
2
exp(
( x
2
)
2
2
)
Samples: x
1
,…., x
N
Form the likelihood function : L (
,
)
i
N
1
1
2
exp(
( x i
2
2
)
2
)
Estimate the parameters that maximize the likelihood function
( ,
ˆ
)
arg
max
,
L (
,
)
Let’s find out
(
ˆ
,
ˆ
) 23
The method directly models the posterior probabilities as the output of regression
Pr( G
k | X
x )
1
exp(
k 0 l
K
1
1 exp(
l
0
k
T
x )
l
T x )
,
Pr( G
K | X
x )
1
1 l
1 K
1 exp(
l 0
l
T x ) k
1 , , K
1 x is p -dimensional input vector
k is a p -dimensional vector for each k
Total number of parameters is ( K -1)( p +1)
Note that the class boundaries are linear
How can we show this linear nature?
What is the discriminant function for every class in this model?
24
Logistic Regression Computation
Let’s fit the logistic regression model for K =2, i.e., number of classes is 2
Training set: ( x i
, g i
), i=1,…, N
Log-likelihood: l (
)
i
N
1
{log Pr( G
y i
| X
x i
)}
i
N
1 y i log(Pr( G
1 | X
x i
))
( 1
y i
) log(Pr( G
0 | X
x i
))
i
N
1
( y i
T x i
( 1
y i
) log
1
1 exp(
T x i
)
)
i
N
1
( y i
T x i
( 1
y i
) log( 1
exp(
T x i
))) x i are ( p+ 1 ) -dimensional input vector with leading entry 1
is a ( p+ 1 ) -dimensional vector y i
= 1 if g i
=1; y i
= 0 if g i
=2
We want to maximize the log-likelihood in order to estimate
25
Newton-Raphson for LR
l (
)
i
N
1
( y i
exp(
1
exp(
T
x )
T x )
) x i
0
( p +1) Non-linear equations to solve for ( p +1) unknowns
Solve by Newton-Raphson method:
[ Jacobian(
l (
)
)]
1
l (
)
, where, Jacobian(
l (
)
)
i
N
1 x i x i
T exp(
(
1
exp(
T
x i
T
) x i
)
)(
1
1 exp(
T x i
)
)
26
l (
)
i
N
1
( y i
exp(
1
T exp(
x )
T x )
) x i
X
T
( y
p )
Jacobian (
l (
)
)
X
T
WX
So, NR rule becomes:
( X
T
WX )
1
X
T
( y
p ),
X
x
1
T x x
T
2
T
N
N
by
( p
1 )
, y
y y
y
N
1
2
N
by
1
, p
exp( exp(
exp(
T
T
T x x x
1
N
2
)
)
) /( 1
/( 1
/( 1
exp(
exp(
exp(
T
T
T x
1 x
2 x
N
))
))
))
N
by
1
,
W is a N -byN diagonal matrix with i th diagonal entry: exp(
(
(1
exp(
T
x i
T
) x i
) )
)( 1
(1
1 exp(
T x i
) )
) 27
• Newton-Raphson
– new old
( X
T
WX )
1
X
T
( y
p )
( X
T
WX )
1
X
T
W ( X
old
W
1
( y
p ))
( X
T
WX )
1
X
T
Wz
– Adjusted response z
X
old
W
1
( y
p )
– Iteratively reweighted least squares (IRLS)
new arg min
( z
X
T
)
T
W ( z
X
T
)
arg min
( y
p )
T
W
1
( y
p )
28
29
After data fitting in the logistic regression model:
Pr( MI
yes | x )
1 exp(
4 exp(
.
130
4 .
130
0
.
006 x
0 .
006 sbp x
sbp
0 .
08 x tobaco
0 .
08 x tobaco
0
.
185 x
0 .
185 ldl x
ldl
0 .
939 x famhist
0 .
939 x famhist
0
.
035 x
0 .
035 obesity x
obesity
0 .
001 x alcohol
0 .
001 x
alcohol
0
.
043
0 .
x age
043 x
) age
)
(Intercept) sbp tobacco ldl famhist obesity alcohol age
Coefficient Std. Error Z Score
-4.130
0.964
-4.285
0.006
0.006
1.023
0.080
0.185
0.939
-0.035
0.001
0.043
0.026
0.057
0.225
0.029
0.004
0.010
3.034
3.219
4.178
-1.187
0.136
4.184
30
After ignoring negligible coefficients:
Pr( MI
yes | x )
1 exp(
4 exp(
.
204
4 .
204
0 .
081 x tobaco
0 .
081 x
tobaco
0
.
0
168 x
.
168 ldl x
ldl
0
.
924
0 .
x famhist
924 x famhist
0
.
044
0 .
x age
044 x
) age
)
What happened to systolic blood pressure? Obesity?
31
NR update:
~
~
(
~
X
T
~
W
~
X )
1
~
X
T
(
)
~
( K
( K
10
11
1
2 p
20 p
1 ) 0
1 ) p
( K
1 )( p
1 )
by
1
,
~
X
X
X
X
X
N ( K
1 )
by
( K
1 )( p
1 )
, where X
1
1
1 x x
T
1
T
2 x
T
N
N
by
( p
1 )
32
is a N ( K -1) dimension vector:
y y
1
2 y
K
1
, where y k
(
(
( g
1 g g
N
2
k k k
)
)
)
, 1
k
K
1 .
( z ) is a indicator function:
( z )
1 ,
0 , if z
0 otherwise .
is a N ( K -1) dimension vector:
p p
1
2 p
K
1
, where p k
exp(
exp(
exp(
k k k
0
0
0
k k k x
1 x
2 x
N
) /( 1
) /( 1
) /( 1
K
1 l
K
1
1 l
1 exp(
l 0 exp(
l 0 l
K
1
1 exp(
l 0
l l l x x
1
2 x
N
) )
) )
) )
, 1
k
K
1 .
33
W
W
W
W
( K
11
21
1 ) 1
W
12
W
22
W
( K
1 ) 2
W
1 (
W
2 (
K
1 )
K
1 )
W
( K
1 )( K
1 )
N ( K
1 )
by
N ( K
1 )
, where W km
, 1
k , m
K
1 , is an N
by
N diagonal matrix, if k
m , then the i th diagonal entry is (
(1
exp(
k 0
l
K 1
1 exp(
l 0
k
T
x i
)
l
T x i
) )
)( 1
(1
exp(
k 0
l
K 1
1 exp(
l 0
k
T
x i
)
l
T x i
) )
), if k
m , then the i th diagonal entry is
(
(1
l exp(
k 0
K 1
1
exp(
l 0
k
T
x i
)
l
T x i
) )
)(
(1
exp(
m 0 l
K 1
1
exp(
l 0
m
T x i
)
l
T x i
) )
).
34
LDA vs. Logistic Regression
• LDA (Generative model)
– Assumes Gaussian class-conditional densities and a common covariance
– Model parameters are estimated by maximizing the full log likelihood, parameters for each class are estimated independently of other classes, Kp+p ( p+1 ) /2+ ( K-1 ) parameters
– Makes use of marginal density information Pr( X )
– Easier to train, low variance, more efficient if model is correct
– Higher asymptotic error, but converges faster
• Logistic Regression (Discriminative model)
– Assumes class-conditional densities are members of the (same) exponential family distribution
– Model parameters are estimated by maximizing the conditional log likelihood, simultaneous consideration of all other classes, ( K-1 )( p+1 ) parameters
– Ignores marginal density information Pr( X )
– Harder to train, robust to uncertainty about the data generation process
– Lower asymptotic error, but converges more slowly
Generative vs. Discriminative Learning
Example
Generative
Linear Discriminant
Analysis
Discriminative
Logistic Regression
Objective Functions Full log likelihood:
i log p
( x i
, y i
)
Model Assumptions Class densities:
( x | y e.g. Gaussian in LDA
Parameter Estimation
Conditional log likelihood
i log p
( y i
| x i
)
Discriminant functions k )
k
( x )
“Easy” – One single sweep “Hard” – iterative optimization
Advantages
Disadvantages
More efficient if model correct, borrows strength from p(x)
More flexible, robust because fewer assumptions
Bias if model is incorrect May also be biased. Ignores information in p(x)