(ML) Estimation

advertisement
Parameter Estimation:
Maximum Likelihood Estimation
Chapter 3 (Duda et al.) – Sections 3.1-3.2
CS479/679 Pattern Recognition
Dr. George Bebis
Parameter Estimation
• Bayesian Decision Theory allows us to design an
optimal classifier given that we have estimated P(i)
and p(x/i) first:
p( x /  j ) P( j )
P( j / x) 
p ( x)
• Estimating P(i) is usually not very difficult.
• Estimating p(x/i) could be more difficult:
– Dimensionality of feature space is large.
– Number of samples is often too small.
Parameter Estimation (cont’d)
• We will make the following assumptions:
– A set of training samples D ={x1, x2, ...., xn} is given, where the
samples were drawn according to p(x|j).
– p(x|j) has some known parametric form:
e.g., p(x /i) ~ N(μ i , i)
also denoted as p(x/q) where q=(μi , Σi)
• Parameter estimation problem:
Given D, find the best possible q
Main Methods in
Parameter Estimation
• Maximum Likelihood (ML)
• Bayesian Estimation (BE)
Main Methods in Parameter Estimation
• Maximum Likelihood (ML)
– Best estimate is obtained by maximizing the probability
of obtaining the samples D ={x1, x2, ...., xn} actually
observed:
p(x1 , x2 ,..., xn / θ)  p( D / θ)
θˆ  arg max p( D / θ)
θ
– ML assumes that θ is fixed and makes a point estimate:
p (x / q )  p( x / qˆ)
Main Methods in Parameter Estimation
(cont’d)
• Bayesian Estimation (BE)
– Assumes that θ is a set of random variables that have
some known a-priori distribution p(θ).
– Estimates a distribution rather than making a point
estimate (i.e., like ML):
p ( x / D )   p ( x / θ ) p (θ / D ) d θ
Note: the BE solution p(x/D) might not be of the
parametric form assumed (e.g., p(x/q)).
ML Estimation - Assumptions
• Consider c classes and c training data sets (i.e.,
one for each class):
D1, D2, ...,Dc
• Samples in Dj are drawn independently according
to p(x/ωj).
• Problem: given D1, D2, ...,Dc and a model for
p(x/ωj) ~ p(x/q), estimate:
q1, q2,…, qc
ML Estimation - Problem Formulation
• If the samples in Dj provide no information about qi (i  j ),
we need to solve c independent problems (i.e., one for each
class).
• The ML estimate for D={x1,x2,..,xn} is the value θ̂ that
maximizes p(D / q) (i.e., best supports the training data).
θˆ  arg max p( D / θ)
q
– Using independence assumption, we can simplify p(D / q) :
n
p ( D / θ)  p(x1 , x 2 ,..., x n / θ)   p( x k / θ)
k 1
ML Estimation - Solution
• How can we find the maximum of p(D/ q) ?
θ p( D / θ)  0
where
(gradient)
ML Estimation Using Log-Likelihood
• Taking the log for simplicity:
n
p ( D / θ)  p(x1 , x 2 ,..., x n / θ)   p( x k / θ)
k 1
n
ln p ( D / θ)   ln p (x k / θ)
log-likelihood
k 1
• Maximizes ln p(D/ θ):
θˆ  arg max ln p( D / θ)
q
θ ln p( D / θ)  0 or
n

k 1
θ
ln p (x k / θ)  0
Example
training data:
unknown mean,
known variance
p(D / θ)
ln p(D/ θ)
θ̂ =μ
ML for Multivariate Gaussian Density:
Case of Unknown θ=μ
• Assume p(x / μ) ~ N (μ, )
1
d
1
t 1
ln p(x / μ)   ( x  μ)  ( x  μ)  ln 2  ln |  |
2
2
2
• Computing the gradient, we have:
μ ln p( D / μ)   μ ln p(xk / μ)   1 (xk  μ)
k
k
ML for Multivariate Gaussian Density:
Case of Unknown θ=μ (cont’d)
• Setting μ ln p ( D / μ)  0 we have:
n

k 1
1
(x k  μ)  0 or
n
x
k 1
k
 nμ  0
1 n
• The solution μ̂ is given by μˆ   x k
n k 1
The ML estimate is simply the “sample mean”.
Special Case: Maximum A-Posteriori
Estimator (MAP)
• Assume that θ is a random variable with known p(θ).
p( D / θ) p(θ)
Consider: p(θ / D) 
p( D)
•
Maximize p(θ/D) or p(D/θ)p(θ) or ln p(D/ θ)p(θ):
n
 p(x
k 1
n
k
/ θ) p(θ)
 ln p(x
k 1
k
/ θ)  ln p(θ)
Special Case: Maximum A-Posteriori
Estimator (MAP) (cont’d)
•
What happens when p(θ) is uniform?
n
 ln p(x
k 1
k
/ θ)  ln p(θ)
n
 ln p(x
k 1
k
/ θ)
MAP is equivalent to ML
MAP for Multivariate Gaussian Density:
Case of Unknown θ=μ
• Assume p(x / μ) ~ N (μ,   Diag ( μ ))
and
p(μ) ~ N (μ 0 ,    Diag (σμ0 ))
(both μ 0 and σ μ0 are known)
• Maximize ln p(μ /D) = ln p(D/ μ)p(μ):
n
 ln p(x
k 1
n
k
/ μ)  ln p(μ)
μ ( ln p(x k / μ)  ln p(μ))  0
k 1
MAP for Multivariate Gaussian Density:
Case of Unknown θ=μ (cont’d)
 μ2 n
μ0  2  xk
 μ k 1
μˆ 
 μ2
1 2 n
μ
0
n
1

k 1
• If
2
μ
(x k  μ) 
 μ2
 1 ,
2
μ
0
1

2
μ0
(μ  μ 0 )  0
or
0
1 n
then μˆ   x k
n k 1
• What happens when  μ0  0 ?
μˆ  μ0
ML for Univariate Gaussian Density:
Case of Unknown θ=(μ,σ2)
• Assume p( x / θ) ~ N (  ,  2 ) θ =(θ1,θ2)=(μ,σ2)
1
1
2
ln p(x k / θ)   ln 2  2 (x k   ) 2 or
2
2
1
1
ln p(x k / θ)   ln 2q 2 
(x k  q1 ) 2
2
2q 2
p(xk/θ)
p(xk/θ)
p(xk/θ)
ML for Univariate Gaussian Density:
Case of Unknown θ=(μ,σ2) (cont’d)
p(xk/ θ)=0
=0
=0
• The solutions are given by:
sample mean
sample variance
ML for Multivariate Gaussian Density:
Case of Unknown θ=(μ,Σ)
• In the general case (i.e., multivariate Gaussian) the
solutions are:
1 n
μˆ   x k
n k 1
sample mean
n
1
ˆ   (x k  μˆ )(x k  μˆ )t
n k 1
sample covariance
Biased and Unbiased Estimates
• An estimate θ̂ is unbiased when
E[θˆ ]  θ
• The ML estimate μ̂ is unbiased, i.e.,
E[μˆ ]  μ
• The ML estimates σ̂ and ̂ are biased:
n 1 2
E[σˆ ] 

n
2
n 1
ˆ
E[  ] 

n
Biased and Unbiased Estimates (cont’d)
• The following are unbiased estimates of σ̂ and ̂
1 n
2
ˆ
ˆ 
(
x

μ
)

k
n  1 k 1
n
1
t
ˆ
ˆ
ˆ 
(
x

μ
)(
x

μ
)

k
k
n  1 k 1
Comments
• ML estimation is simpler than alternative methods.
• ML provides more accurate estimates as the
number of training samples increases.
• If the model for p(x/ θ) is correct, and the
independence assumptions among samples are
true, ML will work well.
Download