Document

advertisement
Course Outline
MODEL INFORMATION
COMPLETE
Bayes Decision
Theory
Parametric
Approach
“Optimal”
Rules
Plug-in
Rules
INCOMPLETE
Supervised
Learning
Nonparametric
Approach
Unsupervised
Learning
Parametric
Approach
Density
Geometric Rules
Estimation
(K-NN, MLP)
Mixture
Resolving
Nonparametric
Approach
Cluster Analysis
(Hard, Fuzzy)
Two-dimensional Feature Space
Supervised Learning
Chapter 3:
Maximum-Likelihood & Bayesian
Parameter Estimation

Introduction
 Maximum-Likelihood Estimation
 Bayesian Estimation
 Curse of Dimensionality
 Component analysis & Discriminants
 EM Algorithm
3

Introduction

Bayesian framework

We could design an optimal classifier if we knew:


P(i) : priors
P(x | i) : class-conditional densities
Unfortunately, we rarely have this complete
information!

Design a classifier based on a set of labeled
training samples (supervised learning)


Assume priors are known
Need sufficient no. of training samples for estimating
class-conditional densities, especially when the
dimensionality of the feature space is large
Pattern Classification, Chapter1 3


Assumption about the problem: parametric
model of P(x | i) is available
4
Assume P(x | i) is multivariate Gaussian
P(x | i) ~ N( i, i)


Characterized by 2 parameters
Parameter estimation techniques


Maximum-Likelihood (ML) and Bayesian estimation
Results of the two procedures are nearly identical, but
there is a subtle difference
Pattern Classification, Chapter1 3
5
In ML estimation parameters are assumed to be
fixed but unknown! Bayesian parameter estimation
procedure, by its nature, utilizes whatever prior
information is available about the unknown
parameter
 MLE: Best parameters are obtained by maximizing
the probability of obtaining the samples observed
 Bayesian methods view the parameters as random
variables having some known prior distribution; How
do we know the priors?
 In either approach, we use P(i | x) for our
classification rule!

Pattern Classification, Chapter1 3

Maximum-Likelihood Estimation



Has good convergence properties as the
sample size increases; estimated parameter
value approaches the true value as n increases
Simpler than any other alternative technique
General principle

Assume we have c classes and
P(x | j) ~ N( j, j)
P(x | j)  P (x | j, j), where
  ( j ,  j ) 
1 2
11 22
m
n
( j ,  j ,...,  j ,  j , cov(x j , x j )...)
Use class j samples to estimate class j parameters
Pattern Classification, Chapter2 3
6
7


Use the information in training samples to estimate
 = (1, 2, …, c); i (i = 1, 2, …, c) is associated with the
ith category
Suppose sample set D contains n iid samples, x1, x2,…, xn
k n
P(D | )   P(x k | )  F()
k 1
P(D | ) is called the likelihood of  w.r.t. the set of samples)

ˆ that
ML estimate of  is, by definition, the value 
maximizes P(D | )
“It is the value of  that best agrees with the actually
observed training samples”
Pattern Classification, Chapter2 3
8
Pattern Classification, Chapter2 3
9

Optimal estimation

Let  = (1, 2, …, p)t and  be the gradient operator
 

 
  
,
,...,

p 
 1  2

We define l() as the log-likelihood function
l() = ln P(D | )

New problem statement:
determine  that maximizes the log-likelihood
t
ˆ  arg maxl()

Pattern Classification, Chapter2 3
10
Set of necessary conditions for an optimum is:
k n
( l     ln P(x k | ))
k 1
l = 0
Pattern Classification, Chapter2 3

Example of a specific case: unknown 

P(x | ) ~ N(, )
(Samples are drawn from a multivariate normal
population)


1
1
1
d
t
ln P(x k | )   ln (2)   (x k  )  (x k  )
2
2
1
and   ln P(x k | )   (x k  )
 = , therefore the ML estimate for  must satisfy:
k n
1

(x k  ˆ )  0

k 1
Pattern Classification, Chapter2 3
11
12
•
Multiplying by  and rearranging, we obtain:
1 k n
ˆ   x k
n k 1
which is the arithmetic average or the mean of
the samples of the training samples!
Conclusion:
Given P(xk | j), j = 1, 2, …, c to be Gaussian in a ddimensional feature space, estimate the vector  = (1,
2, …, c)t and perform a classification using the Bayes
decision rule of chapter 2!
Pattern Classification, Chapter2 3
13

ML Estimation:
 Univariate Gaussian Case: unknown  & 
 = (1, 2) = (, 2)
l  ln P( x k | )  
1
1
ln 2 2 
( x k  1 ) 2
2
2 2
 

(ln P( x k | )) 

1
0
 l  
 

(ln P( x k | )) 

 2

 1
  (x k  1 )  0
 2

2
(
x


)
1
k
1


0
2

2

2 2
2

Pattern Classification, Chapter2 3
14
Summation:
k n 1
(
x


)

0

k
1
 k 1 ˆ
2

 k n
2
k

n
ˆ
  1   ( x k  1 )  0
 k 1 ˆ 2 k 1
ˆ 22
(1)
(2)
Combining (1) and (2), one obtains:
k n
k n x
 
k 1
k
n
;
 
2
 (x k  )
2
k 1
n
Pattern Classification, Chapter2 3
15

Bias

ML estimate for 2 is biased
n1 2
1
2
E  ( x i  x )  
.   2
n
n


An unbiased estimator for  is:
1 k n
t
C
 (x k   )(x k  ˆ )
1 k
1
n- 

Sample covariance matrix
Pattern Classification, Chapter2 3

Bayesian Estimation (Bayesian learning approach
for pattern classification problems)
 In MLE  was supposed to have a fixed value
 In BE  is a random variable
 The computation of posterior probabilities P(i |
x) lies at the heart of Bayesian classification
 Goal: compute P(i | x, D)
Given the training sample set D, Bayes formula
can be written
P(x | i , D ).P(i | D )
P(i | x, D )  c
 P(x |  j , D ).P( j | D )
j 1
3 1
Pattern Classification, Chapter
16
17

To demonstrate the preceding equation, use:
P(x, D | i )  P(x | D i ).P(D | i )
P(x | D )   P(x,  j | D )
j
P(i )  P(i | D ) (Training sample provides this!)
Thus :
P(i | x, D ) 
P(x | i , D i ).P(i )
c
 P(x |  j , D ).P( j )
j 1
3 1
Pattern Classification, Chapter
18

Bayesian Parameter Estimation: Gaussian Case
Goal: Estimate  using the a-posteriori density
P( | D)

The univariate Gaussian case: P( | D)
 is the only unknown parameter
P(x |  ) ~ N(,  2 )
P( ) ~ N( 0 ,  20 )
0 and 0 are known!
4 1
Pattern Classification, Chapter
19
P(D |  ).P( )
P( | D ) 
 P(D |  ).P( )d
(1)
k n
   P(x k |  ).P( )
k 1

Reproducing density
P( | D ) ~ N(n , n2 )
(2)
The updated parameters of the prior:
 n 20

2
 ˆ n 
 n  
. 0
2
2
2
2
n 0  
 n0  0   
and n2 
 20  2
n 20   2
4 1
Pattern Classification, Chapter
20
4 1
Pattern Classification, Chapter

21
The univariate case P(x | D)


P( | D) has been computed
P(x | D) remains to be computed!
P(x | D )   P(x | ).P( | D )d is Gaussian
It provides:
P(x | D ) ~ N(n , 2  n2 )
Desired class-conditional density P(x | Dj, j)
P(x | Dj, j) together with P(j) and using Bayes
formula, we obtain the Bayesian classification rule:



Max P( j | x, D  Max P( x |  j , D j ).P( j )
j
j

4 1
Pattern Classification, Chapter
22

Bayesian Parameter Estimation: General Theory

P(x | D) computation can be applied to any
situation in which the unknown density can be
parametrized: the basic assumptions are:



The form of P(x | ) is assumed known, but the value of
 is not known exactly
Our knowledge about  is assumed to be contained in
a known prior density P()
The rest of our knowledge about  is contained in a set
D of n random variables x1, x2, …, xn that follows P(x)
5 1
Pattern Classification, Chapter
23
The basic problem is:
“Compute the posterior density P( | D)”
then “Derive P(x | D)”
Using Bayes formula, we have:
P(D | ).P()
P( | D ) 
,
 P(D | ).P()d
And by independence assumption:
k n
P(D | )   P(x k | )
k 1
5 1
Pattern Classification, Chapter
Overfitting
Problem of Insufficient Data
• How to train a classifier (e.g., estimate the covariance
matrix) when the training set size is small (compared to
the number of features)
• Reduce the dimensionality
– Select a subset of features
– Combine available features to get a smaller number of more
“salient” features.
• Bayesian techniques
– Assume a reasonable prior on the parameters to compensate
for small amount of training data
• Model Simplification
– Assume statistical independence
• Heuristics
– Threshold the estimated covariance matrix such that only
correlations above a threshold are retained.
Practical Observations
• Most heuristics and model simplifications are
almost surely incorrect
• In practice, however, the performance of the
classifiers base don model simplification is better
than with full parameter estimation
• Paradox: How can a suboptimal/simplified model
perform better than the MLE of full parameter
set, on test dataset?
– The answer involves the problem of
insufficient data
Insufficient Data in Curve Fitting
Curve Fitting Example
(contd)
• The example shows that a 10th-degree polynomial
fits the training data with zero error
– However, the test or the generalization error is
much higher for this fitted curve
• When the data size is small, one cannot be sure
about how complex the model should be
• A small change in the data will change the
parameters of the 10th-degree polynomial
significantly, which is not a desirable quality;
stability
Handling insufficient data
• Heuristics and model simplifications
• Shrinkage is an intermediate approach, which combines
“common covariance” with individual covariance matrices
– Individual covariance matrices shrink towards a common
covariance matrix.
– Also called regularized discriminant analysis
• Shrinkage Estimator for a covariance matrix, given
shrinkage factor 0 <  < 1,
(1   )ni i  n
i ( )
(1   )ni  n
• Further, the common covariance can be shrunk towards
the Identity matrix,
( ) (1   )  I
Problems of Dimensionality
Introduction
• Real world applications usually come with a large
number of features
– Text in documents is represented using frequencies of tens
of thousands of words
– Images are often represented by extracting local features
from a large number of regions within an image
• Naive intuition: more the number of features, the better
the classification performance? – Not always!
• There are two issues that must be confronted with high
dimensional feature spaces
– How does the classification accuracy depend on the
dimensionality and the number of training samples?
– What is the computational complexity of the classifier?
Statistically Independent Features
• If features are statistically independent, it is
possible to get excellent performance as
dimensionality increases
• For a two class problem with multivariate
normal classes P( x |  j ) ~ N (μ j , ), and equal prior
probabilities, the probability of error is
1
P (e) 
2

e
r
u
2
2
du
2
where the Mahalanobis distance is defined as
1
r  (μ1  μ2 )  (μ1  μ2 )
2
T
Statistically Independent Features
• When features are independent, the covariance
matrix is diagonal, and we have
 i1  i 2 

r   
i 
i 1 
d
2
2
• Since r2 increases monotonically with an increase
in the number of features, P(e) decreases
• As long as the means of features in the differ, the
error decreases
Increasing Dimensionality
• If a given set of features does not result in good
classification performance, it is natural to add
more features
• High dimensionality results in increased cost and
complexity for both feature extraction and
classification
• If the probabilistic structure of the problem is
completely known, adding new features will not
possibly increase the Bayes risk
Curse of Dimensionality
• In practice, increasing dimensionality beyond a
certain point in the presence of finite number of
training samples often leads to lower
performance, rather than better performance
• The main reasons for this paradox are as follows:
– the Gaussian assumption, that is typically made, is
almost surely incorrect
– Training sample size is always finite, so the estimation of
the class conditional density is not very accurate
• Analysis of this “curse of dimensionality” problem
is difficult
A Simple Example
• Trunk (PAMI, 1979) provided a simple example
illustrating this phenomenon.
1
p1   p2  
μ1  μ, μ 2  μ
2
2
1
1 
N
1  2  xi  i 
px | 1   
e
p X | 1  ~ Gμ1 , I 
i 1
2
2
1
1 
p X | 2  ~ Gμ2 , I 
N
1  2  xi  i 
p x | 2   
e
i 1
2
N: Number of features
1
i
i     i th componentof themean vector
 1 1 1

μ1  1,
,
,
,...
2 3 4 

1
1
1


μ 2    1,
,
,
,...
2
3
4 

Case 1: Mean Values Known
• Bayes decision rule:
Decide1 if xt μ  x11  x2 2  ... xN N  0
N xi
or
i1 i  0

1  12 z 2
N 1
2
Pe  
e dz
 2  μ1  μ 2  4i 1  
i
 / 2 2
Pe N  


i1 i 
N
1
  i 
1
1  12 z 2
e dz
2
is a divergent series
 Pe N   0
as
N 
Case 2: Mean Values Unknown
• m labeled training samples are available
1 m i
μˆ  i 1 x , xi is replaced by -xi if xi 2
m
POOLED ESTIMATE
Plug-in decision rule
Decide1 if xt μˆ  x1ˆ1  x2 ˆ 2  ... xN ˆ N  0



Pe N , m  P2 .P xt μˆ  0 | x 2  P1 .P xt μˆ  0 | x 1

 P xt μˆ  0 | x 2

due tosymmetry
Let z  x μ  i 1 xi μˆ i
t
N
It is difficult to computer he
t distribution of z

Case 2: Mean Values Unknown
1
 1  N 1 N
E z   i 1  
VARz   1  i 1   
i
 m
i m
z  E z 
lim
~ G0,1, Standard Normal
N  VARz 
N
 z  E z 




E
z

Pe  N , m   Pz  0 | x  2   P

 VAR z 



VAR
z



1 2
1 2z
Pe m, N   
e dz
N 1
2

i1  
 /2
N  
lim PN  0
N 
1
 lim Pe m, N  
N 
2
E z 
c

VARz 
 1  N 1 N
1  i 1   
 m
i m
Case 2: Mean Values Unknown
42

Component Analysis and Discriminants
Combine features in order to reduce the
dimension of the feature space
 Linear combinations are simple to compute and
tractable
 Project high dimensional data onto a lower
dimensional space
 Two classical approaches for finding “optimal”
linear transformation



PCA (Principal Component Analysis) “Projection that
best represents the data in a least- square sense”
MDA (Multiple Discriminant Analysis) “Projection that
best separates the data in a least-squares sense”
8 1
Pattern Classification, Chapter
Download