Introduction to Estimation Theory

Introduction to Estimation Theory: A Tutorial Volkan Cevher 2 Outline      Introduction Terminology and Preliminaries Bayesian (Random) Parameter Estimation Nonrandom Parameter Estimation Questions Georgia Institute of Technology Center for Signal and Image Processing 3 Introduction  Classical detection problem: – Design of optimum procedures for deciding between possible statistical situations given a random observation: H 0 : Yk ~ P  P0 , k  1,, n H1 : Yk ~ P  P1 , k  1,, n – The model has the following components:     Parameter Space (for parametric detection problems) Probabilistic Mapping from Parameter Space to Observation Space Observation Space Detection Rule Georgia Institute of Technology Center for Signal and Image Processing 4 Introduction  Parameter Space: – –  Completely characterizes the output given the mapping. Each hypothesis corresponds to a point in the parameter space. This mapping is one-to-one. Probabilistic Mapping from Parameter Space to Observation Space: – The probability law that governs the effect of a parameter on the observation. Example 1: Probabilistic mapping Parameter Space Georgia Institute of Technology 2  N k , p  1 / 2, N k ~ N (0,  )  Yk   N k , p  1 / 4, N k ~ N (1,  2 )  N , p  1 / 4, N ~ N (1,  2 ) k  k    1 0 1T Center for Signal and Image Processing 5 Introduction  Observation Space: –  Finite dimensional, i.e. Y n, where n is finite. Detection Rule – Mapping of the observation space into its parameters in the parameter space is called a detection rule. Georgia Institute of Technology Center for Signal and Image Processing 6 Introduction  Classical estimation problem: – – – Interested in not making a choice among several discrete situations, but rather making a choice among a continuum of possible states. Think of a family of distributions on the observation space, indexed by a set of parameters. Given the observation, determine as accurately as possible the actual value of the parameter. Example 2: – Yk  N k , N k ~ N (  ,  2 ) In this example, given the observations, parameter  is being estimated. Its value is not chosen among a set of discrete values, but rather is estimated as accurately as possible. Georgia Institute of Technology Center for Signal and Image Processing 7 Introduction  Estimation problem also has the same components as the detection problem. – – – –    Parameter Space Probabilistic Mapping from Parameter Space to Observation Space Observation Space Estimation Rule Detection problem can be thought of as a special case of the estimation problem. There are a variety of estimation procedures differing basically in the amount of prior information about the parameter and in the performance criteria applied. Estimation theory is less structured than detection theory. “Detection is science, estimation is art.” Array Signal Processing by Johnson, Dudgeon. Georgia Institute of Technology Center for Signal and Image Processing 8 Introduction  Based on the a priori information about the parameter, there are two basic approaches to parameter estimation: – –  Bayesian Parameter Estimation: –  Bayesian Parameter Estimation Nonrandom Parameter Estimation Parameter is assumed to be a random quantity related statistically to the observation. Nonrandom Parameter Estimation: – Parameter is a constant without any probabilistic structure. Georgia Institute of Technology Center for Signal and Image Processing 9 Terminology and Preliminaries  Estimation theory relies on jargon to characterize the properties of estimators. In this presentation, the following definitions are used: – The set of n observations are represented by the n-dimensional vector y (observation space). T y  Y1  Yk  Yn  – The values of the parameters are denoted by the vector  (parameter space). The estimate of this parameter vector is denoted by θˆ (y ) : . – Georgia Institute of Technology Center for Signal and Image Processing 10 Terminology and Preliminaries  Definitions (continued): – The estimation error (y) ( in short) is defined by the difference between the estimate and the actual parameter: ε(y )  θˆ (y )  θ – – The function C[a,]: + is the cost of estimating a true value of  as a. Given such a cost function C, the Bayes risk (average risk) of the estimator is defined by the following: r (θˆ )  E{E{C[θˆ (Y), Θ] | y}} Georgia Institute of Technology Center for Signal and Image Processing 11 Terminology and Preliminaries Example 3: Suppose we would like to minimize the Bayes risk defined by r (θˆ )  E{E{C[θˆ (Y), Θ] | y}} for a given cost function C. By inspection, one can see that the Bayes estimate of  can be found (if it exists) by minimizing, for each y, the posterior cost given Y=y: E{C[θˆ (Y), Θ] | y} Georgia Institute of Technology Center for Signal and Image Processing 12 Terminology and Preliminaries  Definitions (continued): – An estimate is said to be unbiased if the expected value of the estimate equals the true value of the parameter E{θˆ | θ}  θ . Otherwise the estimate is said to be biased. The bias b() is usually considered to be additive, so that: b(θ)  E{θˆ | θ}  θ – – An estimate is said to be asymptotically unbiased if the bias tends to zero as the number of observations tend to infinity. An estimate is said to be consistent if the mean-squared estimation error tends to zero as the number of observations becomes large. lim E{ε Tε}  0 n Georgia Institute of Technology Center for Signal and Image Processing 13 Terminology and Preliminaries  Definitions (continued): –  An efficient estimate has a mean-squared error that equals a particular lower bound: the Cramer-Rao bound. If an efficient estimate exists, it is optimum in the mean-squared sense: No other estimate has a smaller meansquared error. Following shorthand notations will also be used for brevity: pθ (y )  py|θ (y | θ)  Probabilit y density( y given θ) Eθ {y}  E{y | θ} Georgia Institute of Technology Center for Signal and Image Processing 14 Terminology and Preliminaries   Following definitions and theorems will be useful later in the presentation: Definition: Sufficiency Suppose that  is an arbitrary set. A function T:  is said to be a sufficient statistic for the parameter set  if the distribution of y conditioned on T(y) does not depend on  for . If knowing T(y) removes any further dependence on  of the distribution of y, one can conclude that T(y) contains all the information in y that is useful for estimating . Hence, it is sufficient. Georgia Institute of Technology Center for Signal and Image Processing 15 Terminology and Preliminaries  Definition: Minimal Sufficiency A function T on  is said to be minimal sufficient for the parameter set  if it is a function of every other sufficient statistic for . A minimal sufficient statistic represents the furthest reduction in the observation without destroying information about . Minimal sufficient statistic does not necessarily exist for every problem. Even if it exists, it is usually very difficult to identify it. Georgia Institute of Technology Center for Signal and Image Processing 16 Terminology and Preliminaries  The Factorization Theorem: Suppose that the parameter set  has a corresponding families of densities p. A statistic T is sufficient for  iff there are functions g and h such that pθ  g θ [T (y )] h( y ) for all y and . Refer to the supplement for a proof. Georgia Institute of Technology Center for Signal and Image Processing 17 Terminology and Preliminaries Example 4: (Poor) Consider the hypothesis-testing problem ={0,1} with densities p0 and p1. Noting that if θ  0  p0 ( y )  p ( y) p ( y)   1 p0 ( y ) if   1,   p0 ( y ) the factorization pθ  g θ [T (y )] h( y ) is possible with h( y )  p0 ( y ) T ( y )  p1 ( y ) / p0 ( y )  L( y ) 1 if   0 g ( y )   t if   1. Thus the likelihood ratio L is a sufficient statistic for the binary hypothesis-testing problem. Georgia Institute of Technology Center for Signal and Image Processing 18 Terminology and Preliminaries  The Rao-Blackwell Theorem: Suppose that ĝ(y) is an unbiased estimate of g() and that T is sufficient for . Define ~ g[T (y )]  Eθ{ĝ(Y) | T (Y)  T (y )} ~ Then g[T (y )] is also an unbiased estimate of g(). Furthermore, Varθ (~ g[T (Y)])  Varθ (ĝ(Y)), with equality iff Pθ (ĝ(Y)  ~ g[T (Y)])  1. Refer to the supplement for a proof. Georgia Institute of Technology Center for Signal and Image Processing 19 Terminology and Preliminaries  Definition: Completeness The parameter family  is said to be complete if the condition E{f(Y)}=0 for all  implies that P(f(Y)=0)=1 for all . Example 5: (Poor) Suppose that ={0,1,…,n}, ={0,1}, and n! p ( y )   y (1   ) n y , y  0,, n, 0    1 y!(n  y )! For any function f onn, we have n! E { f (Y )}   f ( y ) y (1   ) n  y y 0 y!(n  y )!  (1   ) n n a x y 0 y y The condition E{f(Y)}=0 for all  implies that n a x y 0 y y  0, for all x  0. nth However, an order polynomial has at most n zeros unless all of its coefficients are zero. Hence,  is complete. Georgia Institute of Technology Center for Signal and Image Processing 20 Terminology and Preliminaries  Definition: Exponential Families – A class of distributions with parameter set  is said to be an exponential family if there are real-valued functions C,Q1,…,Qm,T1,…,Tm, and h such that m  pθ (y )  C(θ) exp  Ql (θ) Tl (y ) h( y )  l 1  – T(y)=[T1(y),…,Tm(y)]T is a complete sufficient statistic. Georgia Institute of Technology Center for Signal and Image Processing 21 Bayesian Parameter Estimation    For the random observation Y , indexed by a parameter m, our goal is to find a function θ̂ :    such that θˆ (y ) is the best guess of the true value of  given Y=y. Bayesian estimators are the estimators that minimize the Bayesian risk function. The following estimators are commonly used in practice and can be distinguished by their cost functions. Georgia Institute of Technology Center for Signal and Image Processing 22 Bayesian Parameter Estimation  Minimum-Mean-Squared-Error (MMSE): – Euclidian Cost function: m m i 1 i 1 C[a, θ]  a  θ   Ci [ai , θi ]   (ai   i ) 2 2 – The posterior cost given Y=y is given by 2 E{C[θˆ (Y), Θ] | Y  y}  E{ θˆ (Y)  Θ | Y  y} 2  – θˆ (y )  2Re{[ θˆ (y )]H E{Θ | Y  y}}  E{Θ 2 | Y  y} Minimizing this cost function also minimizes the Bayes risk r(θˆ ). Hence, on differentiating with respect to θˆ (y ), one can obtain the Bayes estimate θ̂ MMSE (y)  E{Θ | Y  y} Georgia Institute of Technology Center for Signal and Image Processing 23 Bayesian Parameter Estimation  Minimum-Mean-Absolute-Error (MMAE): – Absolute Error Cost function: m C[a, θ]  a  θ   Ci [ai , θi ]   | ai   i | i 1 – m i 1 The posterior cost given Y=y is given by E{C[θˆ (Y), Θ] | Y  y}  E{ θˆ (Y)  Θ | Y  y}    P( θˆ (Y)    i i – i  x | Y  y )dx 0 Here we used the fact that with P(X0)=1, then  E{X }   P( X  x)dx 0 MMAE 1of3 Georgia Institute of Technology Center for Signal and Image Processing 24 Bayesian Parameter Estimation – Further simplification is also possible:   E{C[θˆ (Y), Θ] | Y  y}    P(i  x  θî (Y) | Y  y )dx i 0  ˆ   P(i   x  θi (Y) | Y  y )dx  0         P(i  t | Y  y )dt i θˆ ( Y ) i    P(i  t | Y  y )dt    θî ( Y ) MMAE 2of3 Georgia Institute of Technology Center for Signal and Image Processing 25 Bayesian Parameter Estimation – Taking the derivative with respect to each θî (Y), one can see that  θî (Y) E{C[θˆ (Y), Θ] | Y  y}  P( i  θî (Y) | Y  y )  P( i  θî (Y) | Y  y ) This derivative is a nondecreasing function of θî (Y) that approaches –1 as θî (Y)   and +1 as θî (Y)   . Thus E{C[θˆ (Y), Θ] | Y  y} achieves its minimum where its derivative changes sign: P(i  t | Y  y )  P(i  t | Y  y ), t  θî ,MMAE (Y) P(  t | Y  y )  P(  t | Y  y ), t  θˆ (Y) i i i , MMAE MMAE 3of3 Georgia Institute of Technology Center for Signal and Image Processing 26 Bayesian Parameter Estimation  Maximum A Posteriori Probability (MAP): – – Uniform Error Cost function: 1 if max 1i m | ai   i |  C[a, θ]   0 if max 1im | ai   i |  The posterior cost given Y=y is given by E{C[θˆ (Y), Θ] | Y  y}  1  P( θˆ1 (Y)  1  ,, θˆm (Y)  m  ). – Within some smoothness conditions, the estimator that maximizes this cost function is given by θ̂ MAP (y )  arg max pθ|Y y (θ | Y  y ) θ̂ Georgia Institute of Technology Center for Signal and Image Processing 27 Bayesian Parameter Estimation  Observations: – MMSE Estimator: θ̂ MMSE (y)  E{Θ | Y  y}  – The MMSE estimate of  given Y=y is the conditional mean of  given Y=y . MMAE Estimator: P(i  t | Y  y )  P(i  t | Y  y ), t  θî ,MMAE (Y) P(  t | Y  y )  P(  t | Y  y ), t  θˆ (Y) i  – i i , MMAE The MMAE estimate of  given Y=y is the conditional median of  given Y=y . MAP Estimator: θ̂ MAP (y )  arg max pθ|Y y (θ | Y  y ) θ̂  The MMAE estimate of  given Y=y is the conditional mode of  given Y=y . Georgia Institute of Technology Center for Signal and Image Processing 28 Bayesian Parameter Estimation  Example 6: (Poor) Given the following conditional probability density function e y p ( y)    0 if y  0 e  w( )    0 if   0 if y  0 hence y has an exponential density with parameter . Suppose  is also exponential random variable with density if   0. Then, the posterior distribution of  given Y=y is given by e  (  y ) w( | y )    (  y )  e d  0  (  y ) 2e (  y ) for 0 and y0, and w(|y)=0 otherwise. Georgia Institute of Technology Center for Signal and Image Processing 29 Bayesian Parameter Estimation  Example 7: (Continued.) – The MMSE is the mean of this distribution: ˆMMSE ( y )  – 2 y The MMAE is the median of this distribution: ˆMMAE ( y )  2 3 2 – The MAP estimate is the mode of this distribution (where it is maximum): 1 ˆMAP ( y )  y – To decide which one to use, one must decide which three of the cost functions best suits the problem at hand. Georgia Institute of Technology Center for Signal and Image Processing 30 Nonrandom Parameter Estimation   Our goal is the same in Bayesian parameter estimation problem. Find . Assume that the parameter set  is real valued. In the nonrandom parameter estimation problem, we do not know anything about the true value of  other than the fact that it lies in . Hence, given the observation Y=y, what is the best estimate of  is the question we would like to answer. Georgia Institute of Technology Center for Signal and Image Processing 31 Nonrandom Parameter Estimation   The only average performance cost that can be done is with respect to the distribution of Y given , given a cost function C. A reasonable restriction to place on an estimate of  is that its expected value is equal to the true parameter value: E {θˆ (Y)}  θ, θ   θ  For its tractability, the Euclidian norm squared cost function will be used. Georgia Institute of Technology Center for Signal and Image Processing 32 Nonrandom Parameter Estimation  When the squared-error cost is used, the risk function is the following: 2 – – Rθ (θˆ )  Eθ { θˆ (Y)  θ }, θ   One can not generally expect to minimize this risk function uniformly for all . This is easily seen for the squared error cost since for any particular value of , say 0 the conditional mean-squared error can be made zero by choosing the estimate to be identically 0 for all observations y. However, if  is not close to 0, such an estimate would perform poorly. Georgia Institute of Technology Center for Signal and Image Processing 33 Nonrandom Parameter Estimation   With the unbiased-ness restriction, the conditional mean-squared error becomes the variance of the estimate. Hence, these estimators are termed minimum-variance unbiased estimators (MVUEs). The procedure for seeking MVUEs: – – – Find a complete sufficient statistics T for . Find any unbiased estimator ĝ(y) of g(). Then, ~g[T( y)]  Eθ{ĝ(Y) | T( Y)  T( y)} is an MVUE of g(). Georgia Institute of Technology Center for Signal and Image Processing 34 Nonrandom Parameter Estimation  Example 8: (Poor) Consider the model Yk  N k  sk , k  1,, n where N1,…,Nn are i.i.d. N(0,2) noise samples, and sk is a known signal for k=1,…,n. Our objective is to estimate  and 2. 1. The density of Y is given by  1 n 2 p(y )  exp  ( y   s )    k k 2 (2 2 ) n / 2 2  k  1    C(θ) exp 1 T1 (y )   2 T2 (y )h( y ), 1 where =[ 1 2 ]T and 1   /  2 ,  2   1 2 2 , n n k 1 k 1 T1 (y )   sk yk , T2 (y )   yk2 ,    C(θ)    2     Georgia Institute of Technology n/2  12 exp   4 2 2 s  k , h( y )  1. k 1  n Center for Signal and Image Processing 35 Nonrandom Parameter Estimation  Example 9: (Continued.) Note that T= [ T1 T2 ]T is a complete sufficient statistic for . 2 2. We wish to estimate   g1 (θ)  1 / 2 2 and   g 2 (θ)  1 / 2 2 . Assuming that s10, the estimate ĝ1(y)=y1/s1 is an unbiased estimator of g1(). Moreover, note that Eθ {T12 (Y)}  Varθ {T1 (Y)}  Eθ {T1 (Y)} 2  n 2 s 2  n 2  2 ( s 2 ) 2 , with s 2  (1 / n)k 1 sk2 n and that n n Eθ {T2 (Y)}   Eθ {Y }   ( 2   2 sk2 ) 2 k k 1 k 1  n 2  n 2 s 2 . Hence, ĝ 2 (y )  [T2 (y )  T1 (y) / ns ] /( n  1) is an unbiased estimate of g2(). 2 Georgia Institute of Technology 2 Center for Signal and Image Processing 36 Nonrandom Parameter Estimation Example 10: (Continued.)  3. Since T1 and T2 are complete, the estimates ~ g1[T(y )]  Eθ {ĝ1 (Y) | T(Y)  T(y )} ~ g [T(y )]  E {ĝ (Y) | T(Y)  T(y )} θ 2 2 are MVUEs of . Note that ĝ1(y) and T1 (y) are both linear functions of Y and are jointly Gaussian. Hence, MVUEs are ~ g1[T(y )]  Eθ {ĝ1 (Y)}  Cov[ĝ1 (Y), T1 (y )]  [Varθ [T1 (y )]] 1[T1 (y )  Eθ {T1 (y )}]     2 (n 2 s 2 ) 1[T1 (y )  ns 2 ]  T1 (y ) / ns 2    n s yk k 1 k  ns 2 ~ g 2 [T(y )]  [T2 (y )  T12 (y ) / ns 2 ] /( n  1)  Georgia Institute of Technology  ˆ 1 n 2 2 ˆ ˆ ( y   s )    k k n  1 k 1 Center for Signal and Image Processing 37 Nonrandom Parameter Estimation  Maximum-Likelihood (ML) Estimation: – – –  For many problems arising in practice, it is not usually feasible to find MVUEs. Another method for seeking good estimators are needed. ML is one of the most commonly used methods in signal processing literature. Consider MAP estimation for : ˆMAP (y )  arg max pθ (y ) w(θ) θΛ – In the absence of any prior information about the parameter , we can assume that it is uniformly distributed (w() becomes a uniform distribution) since this represents the worst case scenario. Georgia Institute of Technology Center for Signal and Image Processing 38 Nonrandom Parameter Estimation  ML Estimation: (Continued.) – – – Hence, the MAP estimate for a given y is any value of  that maximizes p(y) over . p(y) is usually called the likelihood ratio. Hence, the ML estimate is θˆ ML (y )  arg max pθ (y ) θΛ – Maximizing p(y) is the same as maximizing log p(y) (log-likelihood function). Therefore, a necessary condition for the maximum-likelihood estimate is  log pθ (y )  0. θ θ θˆ ML ( y ) – The above condition is also known as the likelihood equation. Georgia Institute of Technology Center for Signal and Image Processing 39 Nonrandom Parameter Estimation  Cramer-Rao Bound: – – Let θˆ (Y ) be some unbiased estimator of  Then the error covariance matrix θ̂ is bounded by the Cramer-Rao bound (refer to the supplement). If the Cramer-Rao bound can be satisfied with equality, only the maximum likelihood estimate achieves it. Hence, if an efficient estimate exists, it is the maximum likelihood estimate. Example 11: refer to the attached paper: “The Stochastic CRB for Array Processing: A Textbook Derivation” by Stoica, Larsson, and Gershman. Georgia Institute of Technology Center for Signal and Image Processing 40 Questions Georgia Institute of Technology Center for Signal and Image Processing

Introduction to Estimation Theory

Related documents

Products

Support

Introduction to Estimation Theory

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib