# CS479/679 Pattern Recognition Spring 2006 – Prof. Bebis ```Parameter Estimation:
Bayesian Estimation
Chapter 3 (Duda et al.) – Sections 3.3-3.7
CS479/679 Pattern Recognition
Dr. George Bebis
Bayesian Estimation
• Assumes that the parameters q are random
variables and that they have some known apriori distribution p(q).
• Estimates a distribution rather than making a
point estimate like ML:
p ( x / D )   p ( x / θ ) p (θ / D ) d θ
Note: BE solution might not be of the parametric form
assumed.
Role of Training Examples
• If p(x/ωi) and P(ωi) are known, the Bayes’ rule allows us
to compute the posterior probabilities P(ωi /x):
p(x / i ) P(i )
P(i / x) 
 p(x /  j ) P( j )
j
• Consider the role of the training examples D by
introducing them in the computation of the posterior
probabilities:
P(i / x, D)
Role of Training Examples (cont’d)
p (x, D / i ) P(i ) p ( x / D, i ) p ( D / i ) P (i )
P(i / x, D) 


p (x, D)
p(x / D) p( D)
p (x / i , D) P(i / D) p( x / i , D) P(i / D)



p (x / D)
 p(x,  j / D j )
j
marginalize
p (x / i , Di ) P(i / Di )

 p(x /  j , D j ) P( j / D j )
j
Using only the
samples from
class i
Role of Training Examples (cont’d)
• The training examples Di are important in determining
both the class-conditional densities and the prior
probabilities:
p(x / i , Di ) P(i / Di )
P(i / x, Di ) 
 p(x /  j , D j ) P( j / D j )
j
• For simplicity, replace P(ωi /D) with P(ωi):
p(x / i , Di ) P(i )
P(i / x, Di ) 
 p(x /  j , D j ) P( j )
j
Bayesian Estimation (BE)
• Need to estimate p(x/ωi,Di) for every class ωi
• If the samples in Dj give no information about qi, i 
we need to solve c independent problems:
“Given D, estimate p(x/D)”
p(x / i , Di ) P(i )
P(i / x, Di ) 
 p(x /  j , D j ) P( j )
j
j
BE Approach
• Estimate p(x/D) as follows:
p(x / D)   p(x, θ / D)dθ   p(x / θ, D) p(θ / D)dθ
• Since p(x / θ, D)  p(x / θ), we have:
p ( x / D )   p ( x / θ ) p (θ / D ) d θ
Interpretation of BE Solution
• If we are less certain about the exact value of θ, consider a
weighted average of p(x / θ) over the possible values of θ:
p ( x / D )   p ( x / θ ) p (θ / D ) d θ
• Samples D exert their influence on p(x / D) through p(θ / D).
BE Solution – Special Case
• Suppose p(θ/D) peaks sharply at θ  θˆ , then p(x/D) can be
approximated as follows:
p ( x / D )   p ( x / θ ) p (θ / D ) d θ
ˆ)
p(x / D)  p (x / θ
(assuming that p(x/ θ) is smooth)
Relation to ML solution
p ( x / D )   p ( x / θ ) p (θ / D ) d θ
p( D / θ) p(θ)
p(θ / D) 
p( D)
• If p(D/ θ) peaks sharply at θ  θˆ , then p(θ /D) will, in
general, peak sharply at θ  θˆ too (i.e., close to ML
solution):
ˆ)
p(x / D)  p (x / θ
• Therefore, ML is a special case of BE!
BE Main Steps
(1) Compute p(θ/D) :
n
p( D / θ) p(θ)
p(θ / D) 
 a p(x k / θ) p(θ)
p( D)
k 1
(2) Compute p(x/D) :
p ( x / D )   p ( x / θ ) p (θ / D ) d θ
Case 1: Univariate Gaussian,
Unknown μ
(known μ 0 and σ0 )
D={x1,x2,…,xn}
(1)
(independently drawn)
Case 1: Univariate Gaussian,
Unknown μ (cont’d)
• It can be shown that p(μ/D) has the following form:
X
p(μ/D) peaks at μn
where:
c
Case 1: Univariate Gaussian,
Unknown μ (cont’d)
(i.e., lies between them)
0  0
n 
as n   (ML estimate)
implies more samples!
Case 1: Univariate Gaussian,
Unknown μ (cont’d)
n 
implies more samples!
Case 1: Univariate Gaussian,
Unknown μ (cont’d)
Bayesian
Learning
Case 1: Univariate Gaussian,
Unknown μ (cont’d)
(2)
independent on μ
As the number of samples increases, p(x/D) converges to p(x/μ)
Case 2: Multivariate Gaussian,
Unknown μ
Assume p(x/μ)~N(μ,Σ) and p(μ)~N(μ0, Σ0)
(known μ0, Σ0)
D={x1,x2,…,xn}
(independently drawn)
(1) Compute p(μ/D):
n
p( D / μ) p(μ)
p(μ / D) 
 a p (xk / μ) p(μ)
p( D)
k 1
Case 2: Multivariate Gaussian,
Unknown μ (cont’d)
• It can be shown that p(μ/D) has the following form:
1
p(μ / D)  c  exp[ (μ  μn )t  n1 (μ  μ n )]
2
where:
1 1
1
1 1
μ n   0 (  0   ) xn   (  0   ) μ 0
n
n
n
1 1 1
 n   0 ( 0   )

n
n
Case 2: Multivariate Gaussian,
Unknown μ (cont’d)
(2) Compute p(x/D):
p(x / D)   p(x / μ) p(μ / D)dμ ~ N (μ n ,    n )
Recursive Bayes Learning
• Develop an incremental learning algorithm:
Dn: (x1, x2, …., xn-1, xn)
Dn-1
n
• Rewrite p( D / θ)   p(x k / θ) as follows:
k 1
p ( D n / θ)  p (x n / θ) p ( D n 1 / θ)
Recursive Bayes Learning (cont’d)
p ( D n / θ) p (θ)
p (θ / D ) 

n
p( D )
p ( D n / θ ) p (θ )
n
p (x n / θ) p ( D n 1 / θ) p(θ)
 p(x
n
/ θ) p ( D
n 1
/ θ) p(θ)dθ
 p( D

n
/ θ) p (θ)dθ

p( x n / θ) p(θ / D n 1 )
 p(x
n
/ θ) p(θ / D
where p (θ / D 0 )  p (θ)
n 1
) dθ
n=1,2,…
Example
p(θ / D ) 
n
p(x n / θ) p(θ / D n 1 )
 p(x
n
/ θ) p(θ / D n 1 )dθ
where p(θ / D 0 )  p(θ)
p(θ)
Example (cont’d)
(x4=8)
In general: p (q / D ) 
n
1
qn
,
for max x [ D n ]  q  10
Example (cont’d)
p(θ/D4) peaks at qˆ  8
Iterations
p(θ/D0)
ML estimate:
p( x / qˆ) ~ U (0,8)
Bayesian estimate:
p ( x / D )   p ( x / θ ) p (θ / D ) d θ
ML vs Bayesian Estimation
• Number of training data
– The two methods are equivalent assuming infinite
number of training data (and prior distributions that do
not exclude the true solution).
– For small training data sets, they give different results in
most cases.
• Computational complexity
– ML uses differential calculus or gradient search for
maximizing the likelihood.
– Bayesian estimation requires complex multidimensional
integration techniques.
ML vs Bayesian Estimation (cont’d)
• Solution complexity
– Easier to interpret ML solutions (i.e., must be of the
assumed parametric form).
– A Bayesian estimation solution might not be of the
parametric form assumed.
• Prior distribution
– If the prior distribution p(θ) is uniform, Bayesian
estimation solutions are equivalent to ML solutions.
– In general, the two methods will give different solutions.
Computational Complexity
ML estimation
dimensionality: d
• Learning complexity
# training data: n
# classes: c
O(dn)
O(d2n)
O(d3)
g ( x)  
O(1)
O(d2)
O(n)
1
d
1
ˆ 1 ( x  
ˆ |  ln P( )
ˆ )t 
ˆ )  ln 2  ln | 
(x  
2
2
2
These computations must be repeated c times (once for each class)
(n&gt;d)
Computational Complexity
dimensionality: d
• Classification complexity
O(d2)
g ( x)  
# training data: n
# classes: c
O(1)
1
d
1
ˆ 1 ( x  
ˆ |  ln P( )
ˆ )t 
ˆ )  ln 2  ln | 
(x  
2
2
2
These computations must be repeated c times and take max
Computational Complexity
Bayesian Estimation
• Learning complexity: higher than ML
• Classification complexity: same as ML
Main Sources of Error
in Classifier Design
p(x / i , Di ) P(i )
P(i / x, Di ) 
 p(x /  j , D j ) P( j )
• Bayes error
j
– The error due to overlapping densities p(x/ωi)
• Model error
– The error due to choosing an incorrect model.
• Estimation error
– The error due to incorrectly estimated parameters
(e.g., due to small number of training examples)
Overfitting
• When the number of training examples is inadequate, the
solution obtained might not be optimal.
• Consider the problem of curve fitting:
– Points were selected from a parabola (plus noise).
– A 10th degree polynomial fits the data perfectly but does not
generalize well.
A greater error on
training data might
improve generalization!
# training examples &gt; # model parameters
Overfitting (cont’d)
• Control model complexity
– Assume diagonal covariance matrix (i.e., uncorrelated features).
– Use the same covariance matrix for all classes.
• Shrinkage techniques
Shrink individual covariance matrices to common covariance:
(1  a)ni i  an
i (a) 
, (0  a  1)
(1  a)ni  an
Shrink common covariance matrix to identity matrix:
(b)  (1  b)  bI , (0  b  1)
```