Introduction to Estimation Theory

advertisement
Introduction to Estimation
Theory: A Tutorial
Volkan Cevher
2
Outline





Introduction
Terminology and Preliminaries
Bayesian (Random) Parameter Estimation
Nonrandom Parameter Estimation
Questions
Georgia Institute of Technology
Center for Signal and Image Processing
3
Introduction

Classical detection problem:
–
Design of optimum procedures for deciding
between possible statistical situations given a
random observation:
H 0 : Yk ~ P  P0 , k  1,, n
H1 : Yk ~ P  P1 , k  1,, n
–
The model has the following components:




Parameter Space (for parametric detection
problems)
Probabilistic Mapping from Parameter Space to
Observation Space
Observation Space
Detection Rule
Georgia Institute of Technology
Center for Signal and Image Processing
4
Introduction

Parameter Space:
–
–

Completely characterizes the output given the mapping.
Each hypothesis corresponds to a point in the parameter
space. This mapping is one-to-one.
Probabilistic Mapping from Parameter Space to
Observation Space:
–
The probability law that governs the effect of a
parameter on the observation.
Example 1:
Probabilistic
mapping
Parameter Space
Georgia Institute of Technology
2
 N k , p  1 / 2, N k ~ N (0,  )

Yk   N k , p  1 / 4, N k ~ N (1,  2 )
 N , p  1 / 4, N ~ N (1,  2 )
k
 k
   1 0 1T
Center for Signal and Image Processing
5
Introduction

Observation Space:
–

Finite dimensional, i.e. Y n, where n is finite.
Detection Rule
–
Mapping of the observation space into its
parameters in the parameter space is called a
detection rule.
Georgia Institute of Technology
Center for Signal and Image Processing
6
Introduction

Classical estimation problem:
–
–
–
Interested in not making a choice among several discrete
situations, but rather making a choice among a continuum of
possible states.
Think of a family of distributions on the observation space,
indexed by a set of parameters.
Given the observation, determine as accurately as possible the
actual value of the parameter.
Example 2:
–
Yk  N k , N k ~ N (  ,  2 )
In this example, given the observations, parameter  is being
estimated. Its value is not chosen among a set of discrete
values, but rather is estimated as accurately as possible.
Georgia Institute of Technology
Center for Signal and Image Processing
7
Introduction

Estimation problem also has the same components as the
detection problem.
–
–
–
–



Parameter Space
Probabilistic Mapping from Parameter Space to Observation
Space
Observation Space
Estimation Rule
Detection problem can be thought of as a special case of
the estimation problem.
There are a variety of estimation procedures differing
basically in the amount of prior information about the
parameter and in the performance criteria applied.
Estimation theory is less structured than detection theory.
“Detection is science, estimation is art.” Array Signal
Processing by Johnson, Dudgeon.
Georgia Institute of Technology
Center for Signal and Image Processing
8
Introduction

Based on the a priori information about the
parameter, there are two basic approaches
to parameter estimation:
–
–

Bayesian Parameter Estimation:
–

Bayesian Parameter Estimation
Nonrandom Parameter Estimation
Parameter is assumed to be a random quantity
related statistically to the observation.
Nonrandom Parameter Estimation:
–
Parameter is a constant without any
probabilistic structure.
Georgia Institute of Technology
Center for Signal and Image Processing
9
Terminology and Preliminaries

Estimation theory relies on jargon to
characterize the properties of estimators.
In this presentation, the following
definitions are used:
–
The set of n observations are represented by
the n-dimensional vector y (observation
space).
T
y  Y1  Yk  Yn 
–
The values of the parameters are denoted by
the vector  (parameter space).
The estimate of this parameter vector is
denoted by θˆ (y ) : .
–
Georgia Institute of Technology
Center for Signal and Image Processing
10
Terminology and Preliminaries

Definitions (continued):
–
The estimation error (y) ( in short) is defined
by the difference between the estimate and the
actual parameter:
ε(y )  θˆ (y )  θ
–
–
The function C[a,]: + is the cost of
estimating a true value of  as a.
Given such a cost function C, the Bayes risk
(average risk) of the estimator is defined by the
following:
r (θˆ )  E{E{C[θˆ (Y), Θ] | y}}
Georgia Institute of Technology
Center for Signal and Image Processing
11
Terminology and Preliminaries
Example 3:
Suppose we would like to
minimize the Bayes risk defined by
r (θˆ )  E{E{C[θˆ (Y), Θ] | y}}
for a given cost function C.
By inspection, one can see that the Bayes
estimate of  can be found (if it exists) by
minimizing, for each y, the posterior
cost given Y=y: E{C[θˆ (Y), Θ] | y}
Georgia Institute of Technology
Center for Signal and Image Processing
12
Terminology and Preliminaries

Definitions (continued):
–
An estimate is said to be unbiased if the expected value
of the estimate equals the true value of the parameter
E{θˆ | θ}  θ . Otherwise the estimate is said to be biased.
The bias b() is usually considered to be additive, so
that:
b(θ)  E{θˆ | θ}  θ
–
–
An estimate is said to be asymptotically unbiased if the
bias tends to zero as the number of observations tend to
infinity.
An estimate is said to be consistent if the mean-squared
estimation error tends to zero as the number of
observations becomes large.
lim E{ε Tε}  0
n
Georgia Institute of Technology
Center for Signal and Image Processing
13
Terminology and Preliminaries

Definitions (continued):
–

An efficient estimate has a mean-squared error
that equals a particular lower bound: the
Cramer-Rao bound. If an efficient estimate
exists, it is optimum in the mean-squared
sense: No other estimate has a smaller meansquared error.
Following shorthand notations will also be
used for brevity:
pθ (y )  py|θ (y | θ)  Probabilit y density( y given θ)
Eθ {y}  E{y | θ}
Georgia Institute of Technology
Center for Signal and Image Processing
14
Terminology and Preliminaries


Following definitions and theorems will be useful
later in the presentation:
Definition: Sufficiency
Suppose that  is an arbitrary set. A function T:  is
said to be a sufficient statistic for the parameter set 
if the distribution of y conditioned on T(y) does not
depend on  for .
If knowing T(y) removes any further dependence on  of
the distribution of y, one can conclude that T(y) contains
all the information in y that is useful for estimating .
Hence, it is sufficient.
Georgia Institute of Technology
Center for Signal and Image Processing
15
Terminology and Preliminaries

Definition: Minimal Sufficiency
A function T on  is said to be minimal
sufficient for the parameter set  if it is a
function of every other sufficient statistic for .
A minimal sufficient statistic represents the
furthest reduction in the observation without
destroying information about .
Minimal sufficient statistic does not necessarily
exist for every problem. Even if it exists, it is
usually very difficult to identify it.
Georgia Institute of Technology
Center for Signal and Image Processing
16
Terminology and Preliminaries

The Factorization Theorem:
Suppose that the parameter set  has a
corresponding families of densities p. A
statistic T is sufficient for  iff there are
functions g and h such that
pθ  g θ [T (y )] h( y )
for all y and .
Refer to the supplement for a proof.
Georgia Institute of Technology
Center for Signal and Image Processing
17
Terminology and Preliminaries
Example 4: (Poor) Consider the hypothesis-testing
problem ={0,1} with densities p0 and p1. Noting that
if θ  0
 p0 ( y )
 p ( y)
p ( y)   1
p0 ( y ) if   1,

 p0 ( y )
the factorization pθ  g θ [T (y )] h( y ) is possible with
h( y )  p0 ( y )
T ( y )  p1 ( y ) / p0 ( y )  L( y )
1 if   0
g ( y )  
t if   1.
Thus the likelihood ratio L is a sufficient statistic for the
binary hypothesis-testing problem.
Georgia Institute of Technology
Center for Signal and Image Processing
18
Terminology and Preliminaries

The Rao-Blackwell Theorem:
Suppose that ĝ(y) is an unbiased estimate of
g() and that T is sufficient for . Define
~
g[T (y )]  Eθ{ĝ(Y) | T (Y)  T (y )}
~
Then g[T (y )] is also an unbiased estimate of
g(). Furthermore,
Varθ (~
g[T (Y)])  Varθ (ĝ(Y)),
with equality iff
Pθ (ĝ(Y)  ~
g[T (Y)])  1.
Refer to the supplement for a proof.
Georgia Institute of Technology
Center for Signal and Image Processing
19
Terminology and Preliminaries

Definition: Completeness
The parameter family  is said to be complete if the
condition E{f(Y)}=0 for all  implies that P(f(Y)=0)=1 for all
.
Example 5: (Poor) Suppose that ={0,1,…,n}, ={0,1}, and
n!
p ( y ) 
 y (1   ) n y , y  0,, n, 0    1
y!(n  y )!
For any function f onn, we have
n!
E { f (Y )}  
f ( y ) y (1   ) n  y
y 0
y!(n  y )!
 (1   )
n
n
a x
y 0
y
y
The condition E{f(Y)}=0 for all  implies that
n
a x
y 0
y
y
 0, for all x  0.
nth
However, an
order polynomial has at most n zeros unless
all of its coefficients are zero. Hence,  is complete.
Georgia Institute of Technology
Center for Signal and Image Processing
20
Terminology and Preliminaries

Definition: Exponential Families
–
A class of distributions with parameter set 
is said to be an exponential family if there are
real-valued functions C,Q1,…,Qm,T1,…,Tm, and
h such that
m

pθ (y )  C(θ) exp  Ql (θ) Tl (y ) h( y )
 l 1

–
T(y)=[T1(y),…,Tm(y)]T is a complete sufficient
statistic.
Georgia Institute of Technology
Center for Signal and Image Processing
21
Bayesian Parameter Estimation



For the random observation Y , indexed
by a parameter m, our goal is to
find a function θ̂ :    such that θˆ (y ) is the
best guess of the true value of  given
Y=y.
Bayesian estimators are the estimators
that minimize the Bayesian risk function.
The following estimators are commonly
used in practice and can be distinguished
by their cost functions.
Georgia Institute of Technology
Center for Signal and Image Processing
22
Bayesian Parameter Estimation

Minimum-Mean-Squared-Error (MMSE):
–
Euclidian Cost function:
m
m
i 1
i 1
C[a, θ]  a  θ   Ci [ai , θi ]   (ai   i ) 2
2
–
The posterior cost given Y=y is given by
2
E{C[θˆ (Y), Θ] | Y  y}  E{ θˆ (Y)  Θ | Y  y}
2

–
θˆ (y )  2Re{[ θˆ (y )]H E{Θ | Y  y}}
 E{Θ 2 | Y  y}
Minimizing this cost function also minimizes the Bayes
risk r(θˆ ). Hence, on differentiating with respect to θˆ (y ),
one can obtain the Bayes estimate
θ̂ MMSE (y)  E{Θ | Y  y}
Georgia Institute of Technology
Center for Signal and Image Processing
23
Bayesian Parameter Estimation

Minimum-Mean-Absolute-Error (MMAE):
–
Absolute Error Cost
function:
m
C[a, θ]  a  θ   Ci [ai , θi ]   | ai   i |
i 1
–
m
i 1
The posterior cost given Y=y is given by
E{C[θˆ (Y), Θ] | Y  y}  E{ θˆ (Y)  Θ | Y  y}

  P( θˆ (Y)  

i
i
–
i
 x | Y  y )dx
0
Here we used the fact that with P(X0)=1, then

E{X }   P( X  x)dx
0
MMAE 1of3
Georgia Institute of Technology
Center for Signal and Image Processing
24
Bayesian Parameter Estimation
–
Further simplification is also possible:


E{C[θˆ (Y), Θ] | Y  y}    P(i  x  θˆi (Y) | Y  y )dx
i 0

ˆ
  P(i   x  θi (Y) | Y  y )dx 
0


 
    P(i  t | Y  y )dt
i θˆ ( Y )
i

  P(i  t | Y  y )dt 


θˆi ( Y )
MMAE 2of3
Georgia Institute of Technology
Center for Signal and Image Processing
25
Bayesian Parameter Estimation
–
Taking the derivative with respect to each θˆi (Y),
one can see that

θˆi (Y)
E{C[θˆ (Y), Θ] | Y  y} 
P( i  θˆi (Y) | Y  y )
 P( i  θˆi (Y) | Y  y )
This derivative is a nondecreasing function of
θˆi (Y) that approaches –1 as θˆi (Y)   and +1
as θˆi (Y)   . Thus E{C[θˆ (Y), Θ] | Y  y} achieves its
minimum where its derivative changes sign:
P(i  t | Y  y )  P(i  t | Y  y ), t  θˆi ,MMAE (Y)
P(  t | Y  y )  P(  t | Y  y ), t  θˆ
(Y)
i
i
i , MMAE
MMAE 3of3
Georgia Institute of Technology
Center for Signal and Image Processing
26
Bayesian Parameter Estimation

Maximum A Posteriori Probability (MAP):
–
–
Uniform Error Cost function:
1 if max 1i m | ai   i | 
C[a, θ]  
0 if max 1im | ai   i | 
The posterior cost given Y=y is given by
E{C[θˆ (Y), Θ] | Y  y}  1  P( θˆ1 (Y)  1  ,, θˆm (Y)  m  ).
–
Within some smoothness conditions, the
estimator that maximizes this cost function is
given by
θ̂ MAP (y )  arg max pθ|Y y (θ | Y  y )
θ̂
Georgia Institute of Technology
Center for Signal and Image Processing
27
Bayesian Parameter Estimation

Observations:
–
MMSE Estimator:
θ̂ MMSE (y)  E{Θ | Y  y}

–
The MMSE estimate of  given Y=y is the conditional
mean of  given Y=y .
MMAE Estimator:
P(i  t | Y  y )  P(i  t | Y  y ), t  θˆi ,MMAE (Y)
P(  t | Y  y )  P(  t | Y  y ), t  θˆ
(Y)
i

–
i
i , MMAE
The MMAE estimate of  given Y=y is the conditional
median of  given Y=y .
MAP Estimator:
θ̂ MAP (y )  arg max pθ|Y y (θ | Y  y )
θ̂

The MMAE estimate of  given Y=y is the conditional
mode of  given Y=y .
Georgia Institute of Technology
Center for Signal and Image Processing
28
Bayesian Parameter Estimation

Example 6: (Poor) Given the following conditional
probability density function
e y
p ( y)  
 0
if y  0
e 
w( )  
 0
if   0
if y  0
hence y has an exponential density with parameter . Suppose
 is also exponential random variable with density
if   0.
Then, the posterior distribution of  given Y=y is given by
e  (  y )
w( | y )  
 (  y )

e
d

0
 (  y ) 2e (  y )
for 0 and y0, and w(|y)=0 otherwise.
Georgia Institute of Technology
Center for Signal and Image Processing
29
Bayesian Parameter Estimation

Example 7: (Continued.)
–
The MMSE is the mean of this distribution:
ˆMMSE ( y ) 
–
2
y
The MMAE is the median of this distribution:
ˆMMAE ( y ) 
2
3 2
–
The MAP estimate is the mode of this distribution (where it is
maximum):
1
ˆMAP ( y ) 
y
–
To decide which one to use, one must decide which three of
the cost functions best suits the problem at hand.
Georgia Institute of Technology
Center for Signal and Image Processing
30
Nonrandom Parameter Estimation


Our goal is the same in Bayesian
parameter estimation problem. Find .
Assume that the parameter set  is real
valued. In the nonrandom parameter
estimation problem, we do not know
anything about the true value of  other
than the fact that it lies in . Hence, given
the observation Y=y, what is the best
estimate of  is the question we would like
to answer.
Georgia Institute of Technology
Center for Signal and Image Processing
31
Nonrandom Parameter Estimation


The only average performance cost that
can be done is with respect to the
distribution of Y given , given a cost
function C.
A reasonable restriction to place on an
estimate of  is that its expected value is
equal to the true parameter value:
E {θˆ (Y)}  θ, θ  
θ

For its tractability, the Euclidian norm
squared cost function will be used.
Georgia Institute of Technology
Center for Signal and Image Processing
32
Nonrandom Parameter Estimation

When the squared-error cost is used, the
risk function is the following:
2
–
–
Rθ (θˆ )  Eθ { θˆ (Y)  θ }, θ  
One can not generally expect to minimize this
risk function uniformly for all . This is easily
seen for the squared error cost since for any
particular value of , say 0 the conditional
mean-squared error can be made zero by
choosing the estimate to be identically 0 for all
observations y.
However, if  is not close to 0, such an
estimate would perform poorly.
Georgia Institute of Technology
Center for Signal and Image Processing
33
Nonrandom Parameter Estimation


With the unbiased-ness restriction, the
conditional mean-squared error becomes
the variance of the estimate. Hence, these
estimators are termed minimum-variance
unbiased estimators (MVUEs).
The procedure for seeking MVUEs:
–
–
–
Find a complete sufficient statistics T for .
Find any unbiased estimator ĝ(y) of g().
Then, ~g[T( y)]  Eθ{ĝ(Y) | T( Y)  T( y)} is an MVUE of
g().
Georgia Institute of Technology
Center for Signal and Image Processing
34
Nonrandom Parameter Estimation

Example 8: (Poor) Consider the model
Yk  N k  sk ,
k  1,, n
where N1,…,Nn are i.i.d. N(0,2) noise samples, and sk is a
known signal for k=1,…,n. Our objective is to estimate  and
2.
1. The density of Y is given by
 1 n
2
p(y ) 
exp

(
y


s
)



k
k
2
(2 2 ) n / 2
2

k

1


 C(θ) exp 1 T1 (y )   2 T2 (y )h( y ),
1
where =[ 1 2 ]T and
1   /  2 ,  2  
1
2
2
,
n
n
k 1
k 1
T1 (y )   sk yk , T2 (y )   yk2 ,
  
C(θ)    2 
  
Georgia Institute of Technology
n/2
 12
exp 
 4 2
2
s

k , h( y )  1.
k 1

n
Center for Signal and Image Processing
35
Nonrandom Parameter Estimation

Example 9:
(Continued.) Note that T= [ T1 T2 ]T is a complete
sufficient statistic for .
2
2. We wish to estimate   g1 (θ)  1 / 2 2 and   g 2 (θ)  1 / 2 2 .
Assuming that s10, the estimate ĝ1(y)=y1/s1 is an unbiased estimator
of g1().
Moreover, note that
Eθ {T12 (Y)}  Varθ {T1 (Y)}  Eθ {T1 (Y)}
2
 n 2 s 2  n 2  2 ( s 2 ) 2 , with s 2  (1 / n)k 1 sk2
n
and that
n
n
Eθ {T2 (Y)}   Eθ {Y }   ( 2   2 sk2 )
2
k
k 1
k 1
 n 2  n 2 s 2 .
Hence, ĝ 2 (y )  [T2 (y )  T1 (y) / ns ] /( n  1) is an unbiased estimate
of g2().
2
Georgia Institute of Technology
2
Center for Signal and Image Processing
36
Nonrandom Parameter Estimation
Example 10: (Continued.)

3. Since T1 and T2 are complete, the estimates
~
g1[T(y )]  Eθ {ĝ1 (Y) | T(Y)  T(y )}
~
g [T(y )]  E {ĝ (Y) | T(Y)  T(y )}
θ
2
2
are MVUEs of . Note that ĝ1(y) and T1 (y) are both linear
functions of Y and are jointly Gaussian. Hence, MVUEs are
~
g1[T(y )]  Eθ {ĝ1 (Y)}  Cov[ĝ1 (Y), T1 (y )]
 [Varθ [T1 (y )]] 1[T1 (y )  Eθ {T1 (y )}]
    2 (n 2 s 2 ) 1[T1 (y )  ns 2 ]
 T1 (y ) / ns 2

 
n
s yk
k 1 k

ns 2
~
g 2 [T(y )]  [T2 (y )  T12 (y ) / ns 2 ] /( n  1)

Georgia Institute of Technology
 ˆ
1
n
2
2
ˆ
ˆ
(
y


s
)



k
k
n  1 k 1
Center for Signal and Image Processing
37
Nonrandom Parameter Estimation

Maximum-Likelihood (ML) Estimation:
–
–
–

For many problems arising in practice, it is not usually
feasible to find MVUEs.
Another method for seeking good estimators are
needed.
ML is one of the most commonly used methods in signal
processing literature.
Consider MAP estimation for :
ˆMAP (y )  arg max pθ (y ) w(θ)
θΛ
–
In the absence of any prior information about the
parameter , we can assume that it is uniformly
distributed (w() becomes a uniform distribution) since
this represents the worst case scenario.
Georgia Institute of Technology
Center for Signal and Image Processing
38
Nonrandom Parameter Estimation

ML Estimation: (Continued.)
–
–
–
Hence, the MAP estimate for a given y is any value of
 that maximizes p(y) over .
p(y) is usually called the likelihood ratio.
Hence, the ML estimate is
θˆ ML (y )  arg max pθ (y )
θΛ
–
Maximizing p(y) is the same as maximizing log p(y)
(log-likelihood function). Therefore, a necessary
condition for the maximum-likelihood estimate is

log pθ (y )
 0.
θ
θ θˆ ML ( y )
–
The above condition is also known as the likelihood
equation.
Georgia Institute of Technology
Center for Signal and Image Processing
39
Nonrandom Parameter Estimation

Cramer-Rao Bound:
–
–
Let θˆ (Y ) be some unbiased estimator of  Then the
error covariance matrix θ̂ is bounded by the Cramer-Rao
bound (refer to the supplement).
If the Cramer-Rao bound can be satisfied with equality,
only the maximum likelihood estimate achieves it.
Hence, if an efficient estimate exists, it is the maximum
likelihood estimate.
Example 11:
refer to the attached paper: “The Stochastic
CRB for Array Processing: A Textbook Derivation” by
Stoica, Larsson, and Gershman.
Georgia Institute of Technology
Center for Signal and Image Processing
40
Questions
Georgia Institute of Technology
Center for Signal and Image Processing
Download