Lecture 12 Parameter Est..pptx

advertisement
PROBABILITY AND STATISTICS
FOR ENGINEERING
Principles of Parameter Estimation
Hossein Sameti
Department of Computer Engineering
Sharif University of Technology
The Estimation Problem
 We use the various concepts introduced and studied in earlier lectures to
solve practical problems of interest.
 Consider the problem of estimating an unknown parameter of interest
from a few of its noisy observations.
- the daily temperature in a city
- the depth of a river at a particular spot
 Observations (measurement) are made on data that contain the desired
nonrandom parameter  and undesired noise.
The Estimation Problem
 For example Observatio n  signal (desired part)  noise,
 or, the i th observation can be represented as
X i    ni , i  1,2,, n.
  : the unknown nonrandom desired parameter
 ni , i  1,2,, n : random variables that may be dependent or
independent from observation to observation.
 The Estimation Problem:
- Given n observations X 1  x1 , X 2  x2 , , X n  xn , obtain the “best” estimator
for the unknown parameter  in terms of these observations.
Estimators
 Let us denote by ˆ( X ) the estimator for  .
 Obviously ˆ( X ) is a function of only the observations.
 “Best estimator” in what sense?
 Ideal solution: the estimate ˆ( X ) coincides with the unknown  .
 Almost always any estimate will result in an error given by
e  ˆ( X )   .
 One strategy would be to select the estimatorˆ( X ) so as to minimize
some function of this error
- mean square error (MMSE),
- absolute value of the error
- etc.
A More Fundamental Approach: Principle of Maximum Likelihood
 Underlying Assumption: the available data X 1 , X 2 , , X n has something
to do with the unknown parameter  .
 We assume that the joint p.d.f of X 1 , X 2 , , X n , f X ( x1, x2 , , xn ; ),
depends on .
 This method
- assumes that the given sample data set is representative of the population
f X ( x1 , x2 , , xn ; )
- chooses the value for  that most likely caused the observed data to occur
Principle of Maximum Likelihood
 In other words, given the observations x1 , x2 , , xn , f X ( x1 , x2 , , xn ; ) is
a function of  alone
 The value of  that maximizes the above p.d.f is the most likely value for
 , and it is chosen as the ML estimate ˆML ( X ) for .
f X ( x1 , x2 , , xn ; )
ˆML ( X )

 Given X 1  x1 , X 2  x2 , , X n  xn , the joint p.d.f f X ( x1 , x2 , , xn ; )
represents the likelihood function
 The ML estimate can be determined either from
- the likelihood equation
sup f X ( x1, x2 , , xn ; )
ˆML
- or using the log-likelihood function

L( x1 , x2 , , xn ; )  log f X ( x1 , x2 , , xn ; ).
 If L( x1 , x2 , , xn ; ) is differentiable and a supremum ˆML exists in the
above equation, then that must satisfy the equation
 log f X ( x1 , x2 , , xn ; )
 0.

 ˆML
Example
 Let X i    wi , i  1  n, represent n observations where  is the
unknown parameter of interest,
 wi , i  1  n, are zero mean independent normal r.vs with common
2
variance  .
 Determine the ML estimate for .
Solution
 Since wi s are independent r.vs and  is an unknown constant, X i s
are independent normal random variables.
 Thus the likelihood function takes the form
n
f X ( x1 , x2 , , xn ; )   f X i ( xi ; ).
i 1
Example - continued
 Each X i is Gaussian with mean  and variance  2 (Why?).
 Thus
f X i ( xi ; ) 
1
2
2
e
 ( xi  ) 2 / 2 2
.
 Therefore the likelihood function is:
f X ( x1 , x2 ,, xn ; ) 
1
( 2 )
2 n/2

e
n
 ( xi  )2 / 2 2
i 1
.
 It is easier to work with the log-likelihood function L( X ; ) in this case.
Example - continued
 We obtain
2
n
n
(
x


)
L( X ; )  ln f X ( x1 , x2 ,, xn ; )  ln( 2 2 )   i 2 ,
2
2
i 1
 and taking derivative with respect to , we get
n
 ln f X ( x1 , x2 , , xn ; )
(x  )
 2 i 2
 0,

2
i 1
 ˆML
 ˆML
 or
n
1
ˆML ( X )   X i .
n i 1
 This linear estimator represents the ML estimate for  .
Unbiased Estimators
 Notice that the estimator is a r.v. Taking its expected value, we get
n
1
E [ˆML ( x )]   E ( X i )   ,
n i 1
 i.e., the expected value of the estimator does not differ from the desired
parameter, and hence there is no bias between the two.
 Such estimators are known as unbiased estimators.
1 n
ˆ
  ML ( X )   X i represents an unbiased estimator for  .
n i 1
Consistent Estimators
 Moreover the variance of the estimator is given by
V ar (ˆ
ˆ
ML )  E [( ML

 1
  ) ]  E 
 n

2
n
X
i 1
i

 

2





2
2
n
n




1
1








 E   ( X i   )    2 E   ( X i   )  
 
 


 n i 1
 n
 i 1

n
n

1  n
2
 2  E ( X i   )    E ( X i   )( X j   )  .
n  i 1
i 1 j 1, i  j

 The latter terms are zeros since X i and X j are independent r.vs.
n 2  2
Var( X i )  2 
.

n
n
i 1
 So,
1
Var(ˆML )  2
n
 And:
Var(ˆML )  0
n
as
n  ,
 another desired property. We say estimators that satisfy this limit are
consistent estimators.
Example
 Let X 1 , X 2 , , X n be i.i.d. uniform random variables in ( 0,  ) with
common p.d.f
1
f X i ( xi ; )  , 0  xi   ,

 where  is an unknown parameter. Find the ML estimate for .
Solution
 The likelihood function in this case is given by
f X ( X 1  x1 , X 2  x2 , , X n  xn ; ) 
1
n
1
, 0  xi   ,
i 1 n
, 0  max( x1 , x2 , , xn )   .
n
 The likelihood function here is maximized by the minimum value of  .

Example - continued
 and since   max( X 1 , X 2 , , X n ), we get
ˆML ( X )  max( X1, X 2 , , X n )
to be the ML estimate for  .
 a nonlinear function of the observations.
 Is this is an unbiased estimate for  ? we need to evaluate its mean.
 It is easier to determine its p.d.f and proceed directly.
 Let Z  max( X 1 , X 2 , , X n ) where f X i ( xi ; ) 
1

,
0  xi   .
Example - continued
 Then
FZ ( z )  P[max( X 1 , X 2 ,  , X n )  z ]
 P( X 1  z, X 2  z,, X n  z )
n
z
  P ( X i  z )   FX i ( z )    ,
 
i 1
i 1
n
 so that
 nz n 1

f Z ( z)    n ,

 0,
n
0  z ,
0  z ,
otherwise .
 Using the above, we get

n
0
n
E[ˆML ( X )]  E ( Z )   z f Z ( z )dz 


0
n  n 1

z dz 

.
n
n 1 
(1  1 / n)
n
Example - continued
 In this case E [ˆML ( X )]   , so the ML estimator is not an unbiased
estimator for  .
 However, note that as n  

lim E[ˆML ( X )]  lim
n 
n 
(1  1 / n )
,
 i.e., the ML estimator is an asymptotically unbiased estimator.
 Also,

E ( Z )   z f Z ( z )dz 
2
0
2
n

n


0
z
n 2
dz 
n2
n 1
 so that
2
2 2
2
n

n

n

2
2
Var[ˆML ( X )]  E ( Z )  [ E ( Z )] 


.
2
2
n  2 (n  1)
(n  1) (n  2)
 Var[ˆML ( X )]  0 as n  , implying that this estimator is a consistent
estimator.
Example
 Let X 1 , X 2 , , X n be i.i.d Gamma random variables with unknown
parameters  and  .
 Determine the ML estimator for  and  .
Solution
 Here xi  0, and
n
n
  x

 1
f X ( x1 , x2 , , xn ; ,  ) 
x
e
.
n  i
( ( )) i 1
n
i
i 1
 This gives the log-likelihood function to be
L( x1 , x2 , , xn ; ,  )  log f X ( x1 , x2 , , xn ; ,  )
n
 n

 n log   n log ( )  (  1)  log xi     xi .
i 1
 i 1

Example - continued
 Differentiating L with respect to  and  we get
n
L
n
 n log  
( )   log xi
 0,

( )
i 1
 ,  ˆ , ˆ
n
L
n

  xi
 0.


i 1
 ,  ˆ , ˆ
 Thus,
 So,
ˆ ML ( X ) 
log ˆ ML
ˆ ML
1
n
,
n
x
i 1
i
(ˆ ML )
1 n
 1 n

 log   xi    xi .
(ˆ ML )
 n i 1  n i 1
 Notice that this is highly nonlinear in ˆ ML .
Conclusion
 In general the (log)-likelihood function
- can have more than one solution, or no solutions at all.
- may not be even differentiable
- can be extremely complicated to solve explicitly
Best Unbiased Estimator
1 n
ˆ
 We have seen that  ML ( X )   X i represents an unbiased estimator
n i 1
2
for  with variance  .
n
 It is possible that, for a given n, there may be other unbiased estimators to
this problem with even lower variances.
 If such is indeed the case, those estimators will be naturally preferrable
compared to previous one.
 Is it possible to determine the lowest possible value for the variance of
any unbiased estimator?
 A theorem by Cramer and Rao gives a complete answer to this problem.
Cramer - Rao Bound
 Variance of any unbiased estimator ˆ based on observations
X 1  x1, X 2  x2 , , X n  xn for  must satisfy the lower bound
Var(ˆ) 
1
  ln f X ( x1 , x2 ,, xn ; ) 
E

 

2

1
.
  2 ln f X ( x1 , x2 ,, xn ; ) 
E 

2
 

 The right side of above equation acts as a lower bound on the variance of
all unbiased estimator for , provided their joint p.d.f satisfies certain
regularity restrictions. (see (8-79)-(8-81), Text).
Efficient Estimators
 Any unbiased estimator whose variance coincides with Cramer-Rao
bound must be the best.
 Such estimates are known as efficient estimators.
1 n
ˆ
 Let us examine whether  ML ( X )   X i is efficient .
n i 1
2
1  n

  ln f X ( x1 , x2 ,, xn ; ) 

(
X


)
 ;


i
4 
  i 1
 


2
 and
2
n
n

1 n
  ln f X ( x1 , x2 ,, xn ; ) 
2
E
  4  E[( X i   ) ]    E[( X i   )( X j   )]
 
   i 1
i 1 j 1, i  j

1 n 2
n
 4   2 ,

i 1
 So the Cramer - Rao lower bound is

2
n
.
Rao-Blackwell Theorem
 As we obtained before, the variance of this ML estimator is the same as
the specified bound.
 If there are no unbiased estimators that are efficient, the best estimator
will be an unbiased estimator with the lowest possible variance.
 How does one find such an unbiased estimator?
 Rao-Blackwell theorem gives a complete answer to this problem.
 Cramer-Rao bound can be extended to multiparameter case as well.
Estimating Parameters with a-priori p.d.f
 So far, we discussed nonrandom parameters that are unknown.
 What if the parameter of interest is a r.v with a-priori p.d.f f ( ) ?
 How does one obtain a good estimate for  based on the observations
X 1  x1 , X 2  x2 , , X n  xn ?
 One technique is to use the observations to compute its a-posteriori p.d.f.
f |X ( | x1 , x2 , , xn ).
 Of course, we can use the Bayes’ theorem to obtain this a-posteriori p.d.f.
f |X ( | x1 , x2 , , xn ) 
f X | ( x1, x2 , , xn |  ) f ( )
f X ( x1, x2 , , xn )
.
 Notice that this is only a function of  , since x1 , x2 , , xn represent given
observations.
MAP Estimator
 Once again, we can look for the most probable value of  suggested by
the above a-posteriori p.d.f.
 Naturally, the most likely value for  is the one corresponding to the
maximum of the a-posteriori p.d.f (The MAP estimator for  ).
f ( | x1 , x2 , , xn )
ˆ

MAP
 It is possible to use other optimality criteria as well.
Download