Contents 1 Shrinkage estimators

Peter Hoff Shrinkage estimators October 31, 2013 Contents 1 Shrinkage estimators 1 2 Admissible linear shrinkage estimators 3 3 Admissibility of unbiased normal mean estimators 6 4 Motivating the James-Stein estimator 11 4.1 What is wrong with X? . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.2 An oracle estimator: . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.3 Adaptive shrinkage estimation . . . . . . . . . . . . . . . . . . . . . . 13 5 Risk of δJS 16 5.1 Risk bound for δJS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 5.2 Stein’s identity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 6 Some oracle inequalities 6.1 A simple oracle inequality . . . . . . . . . . . . . . . . . . . . . . . . 7 Unknown variance or covariance 21 21 23 Much of this content comes from Lehmann and Casella [1998], sections 5.2, 5.4, 5.5, 4.6 and 4.7. 1 Shrinkage estimators Consider a model {p(x|θ) : θ ∈ Θ} for a random variable X such that E[X|θ] = µ(θ), 0 < Var[X|θ] = σ 2 (θ) < ∞ ∀θ ∈ Θ. 1 Peter Hoff Shrinkage estimators October 31, 2013 A linear estimator δ(x) for µ(θ) is an estimator of the form δab (X) = aX + b. Is δab admissible? Theorem 1 (LC thm 5.2.6). δab (X) = aX + b is inadmissible for E[X|θ] under squared error loss whenever 1. a > 1, 2. a = 1 and b 6= 0, or 3. a < 0. Proof. The risk of δab is R(θ, δab ) = E[(aX + b − µ)2 |θ] = E[(aX − aµ − µ(1 − a) + b)2 |θ] = E[a2 (X − µ)2 + (b − µ(1 − a))2 + 2a(X − µ)(b − µ(1 − a))|θ] = a2 σ 2 + (b − µ(1 − a))2 1. If a > 1, then R(θ, δab ) > a2 σ 2 > σ 2 = R(θ, X), so δab is dominated by X. 2. If a < 0, then R(θ, δab ) > (b − µ(1 − a))2 = (1 − a)2 (b/(1 − a) − µ)2 = (b/(1 − a) − µ)2 = R(θ, b/(1 − a)) and so δab is dominated by the constant estimator b/(1 − a). 3. If a = 1 and b 6= 0, then R(θ, δab ) = σ 2 +b2 > σ 2 = R(θ, X), so δab is dominated by X. 2 Peter Hoff Shrinkage estimators October 31, 2013 Letting w = 1 − a and µ0 = b/(1 − a), the result suggests that if we want to use an admissible linear estimator, it should be of the form δ(X) = wµ0 + (1 − w)X , w ∈ [0, 1] We call such estimators linear shrinkage estimators as they “shrink” the estimate from X towards µ0 . Intuitively, you can think of µ0 as your “guess” as to the value of µ, and w as the confidence you have in your guess. Of course, the closer your guess is to the truth, the better your estimator. If µ0 represents your guess as to µ(θ), it seems natural to require that µ0 ∈ µ(Θ) = {µ : µ = µ(θ), θ ∈ Θ}, i.e. µ0 is a possible value of µ. Lemma 1. If µ(Θ) is convex and µ0 6∈ µ̄(Θ), then δ(X) = wµ0 + (1 − w)X is not admissible. Proof. For the one-dimensional case, suppose µ0 > µ(θ) ∀θ ∈ Θ. Let µ̃0 = supΘ µ(θ), and δ̃(X) = wµ̃0 + (1 − w)X. Then δ̃(X) dominates δ(X) (the variances are the same, and the latter has higher bias for all θ). The proof is similar for the case µ0 < µ(θ) ∀θ ∈ Θ. Exercise: Generalize this result to higher dimensions. 2 Admissible linear shrinkage estimators We have shown that δ(X) = wµ0 + (1 − w)X is inadmissible for µ(θ) = E[X|θ] if • w 6∈ [0, 1] or 3 Peter Hoff Shrinkage estimators October 31, 2013 • µ0 6∈ µ(Θ). Restricting attention to w ∈ [0, 1] and µ0 ∈ µ(Θ), it may seem that such estimators should always be admissible, but “always” is almost always too inclusive. Exercise: Given an example where wµ0 + (1 − w)X is not admissible, even with w ∈ (0, 1) and µ0 ∈ µ(Θ). Linear shrinkage via conjugate priors What about using a Bayesian argument? Recall, Theorem. Any unique Bayes estimator is admissible. If we can show that wµ0 + (1 − w)X is unique Bayes under some prior, then we will have shown admissibility. Let X1 , . . . , Xn ∼ i.i.d. p(x|θ), where p(x|θ) ∈ P = {p(x|θ) = h(x) exp(x · θ − A(θ)) : θ ∈ H} Consider estimation of µ = E[X|θ] under squared error loss. Let π(θ) ∝ exp(n0 µ0 · θ − n0 A(θ)) where n0 > 0 and µ0 ∈ Conv{E[X|θ] : θ ∈ H} Recall that under this prior, E[µ] ≡ E[E[X|θ]] = µ0 . Then π(θ|x) ∝ exp(n1 µ1 · θ − n1 A(θ)), where n1 = n0 + n and n1 µ1 = n0 µ0 + nx̄ µ1 = n0 µ0 /n1 + nx̄/n1 n n0 = µ0 + x̄. n0 + n1 n0 + n Under this posterior distribution, E[µ|x] ≡ E[E[X|θ]|x] = µ1 . 4 Peter Hoff Shrinkage estimators October 31, 2013 Therefore, the unique Bayes estimator of µ = E[X|θ] under squared error loss is µ1 = wµ0 + (1 − w)x̄, and so this linear shrinkage estimator is admissible. Example (multiple normal means): Let X ∼ Np (θ, σ 2 I). First consider the case that σ 2 is known, so that p(x|θ) = (2πσ 2 )−p/2 exp(−(x − θ) · (x − θ)/[2σ 2 ]) ∝θ exp(x · θ/σ 2 − θ · θ/[2σ 2 ]). Consider the normal prior π(θ) = (2πτ 2 )−p/2 exp(−(θ − θ 0 ) · (θ − θ 0 )/[2τ02 ]) ∝θ exp(θ 0 · θ/τ02 − θ · θ/[2τ02 ]). where τ02 is analogous to 1/n0 in the general formulation for exponential families. The posterior density is π(θ|x) ∝θ exp{[θ 0 /τ02 + x/σ 2 ] · θ − θ · θ[1/σ 2 + 1/τ02 ]/2} = exp{θ 1 · θ/τ12 − θ · θ/[2τ12 ]} where • 1/τ12 = 1/τ02 + 1/σ 2 • θ1 = 1/τ02 θ 1/τ02 +1/σ 2 0 + 1/σ 2 x 1/τ02 +1/σ 2 ≡ wθ 0 + (1 − w)x. So {θ|x} ∼ Np (θ 1 , τ12 I), which means that E[θ|x] = θ 1 = wθ 0 + (1 − w)x uniquely minimizes the posterior risk under squared error loss. The posterior mean is therefore a unique Bayes estimator and also an admissible estimator of θ. Since this result holds for all τ02 > 0, we have the following: 5 Peter Hoff Shrinkage estimators October 31, 2013 Lemma 2. For each w ∈ (0, 1) and θ 0 ∈ Rp , the estimator δwθ0 (x) = wθ 0 +(1−w)x is admissible for estimating θ in the model X ∼ Np (θ, σ 2 I), θ ∈ Rp , where σ 2 is known. Of course, what we would like is the following lemma: Lemma 3. For each w ∈ (0, 1) and θ 0 ∈ Rp , the estimator δwθ0 (X) = wθ 0 +(1−w)X is admissible for estimating θ in the model X ∼ Np (θ, σ 2 I), θ ∈ Rp , σ 2 ∈ R+ . How can this result be obtained? Theorem 2. Let P = {p(x|θ, ψ) : (θ, ψ) ∈ Θ × Ψ}, and for ψ0 ∈ Ψ, let Pψ0 = {p(x|θ, ψ0 ) : θ ∈ Θ} be a submodel. If δ is admissible for estimating θ under Pψ0 for each ψ0 ∈ Ψ, then δ is admissible for estimating θ under P. Proof. Suppose δ satisfies the conditions of the theorem but is not admissible. Then there exists a δ 0 ∈ D such that ∀(θ, ψ) , R((θ, ψ), δ 0 ) ≤ R((θ, ψ), δ) ∃(θ0 , ψ0 ) , R((θ0 , ψ0 ), δ 0 ) < R((θ0 , ψ0 ), δ). But this contradicts the assumption that δ is admissible for estimating θ under Pψ0 . Therefore, no such δ 0 can exist and so δ is admissible for P. A corollary to this theorem is the admissibility of wθ 0 + (1 − w)X in the normal model with unknown variance. 3 Admissibility of unbiased normal mean estimators Let X ∼ Np (θ, σ 2 I), θ ∈ Rp , σ 2 > 0. For estimation of θ under squared error loss, we have shown that the linear shrinkage estimator δ(x) = wθ 0 + (1 − w)x is 6 Peter Hoff Shrinkage estimators October 31, 2013 • inadmissible if w 6∈ [0, 1], • admissible if w ∈ (0, 1). What remains to evaluate is the admissibility for w ∈ {0, 1}. Admissibility for w = 1 is easy to show - the estimator δ1θ0 (x) = θ 0 beats everything at θ 0 and so can’t be dominated. The last and most interesting case is that of w = 0, i.e. δ0 (X) = X, the unbiased MLE and UMVUE. Blyth’s method Recall Blyth’s method for showing admissibility using a limiting Bayes argument: Theorem 3 (LC 5.7.13). Suppose Θ ⊂ Rp is open, and that R(θ, δ) is continuous in θ for all δ ∈ D. Let δ be an estimator and {πn } be a sequence of measures such that for any open ball B ⊂ Θ, R(πn , δ) − R(πn , δπn ) → 0 as n → ∞. πn (B) Then δ is admissible. Let’s try to use this to show admissibility of δ0 (X) = X in the normal means problem. We begin with the case that σ 2 = 1 is known. X ∼ Np (θ, I) θ ∼ Np (0, τ 2 I) 2 2 τ τ {θ|X} ∼ Np ( 1+τ 2 X, 1+τ 2 I) The unique Bayes estimator is δτ 2 = E[θ|X] = τ2 X 1+τ 2 ≡ (1 − w)X. To apply the theorem, we need to compute the Bayes risk of δτ 2 and X under the Np (0, τ 2 I) prior πτ 2 . The loss we will use is “scaled” squared error loss, L(θ, d) = P (θj − dj )2 /p. Because the risk is the average of the individual MSEs, the Bayes 7 Peter Hoff Shrinkage estimators October 31, 2013 risk is just the average of the Bayes risks from the p components, p X R(θ, δ) = E[ (θj − δj )2 ]/p j=1 p = X E[(θj − δj )2 ]/p, j=1 and so calculating the Bayes risk is similar to calculating the risk in the p = 1 problem. For δτ 2 (x) = ax, where a = 1 − w = τ 2 /(1 + τ 2 ), we have E[(aX − θ)2 ] = E[(aX − aθ + (1 − a)θ)2 ] = a2 E[(X − θ)2 ] + 2a(1 − a)E[(X − θ)θ] + (1 − a)2 E[θ]2 = a2 + (1 − a)2 τ 2 2 2 τ τ2 τ2 = + = 1 + τ2 (1 + τ 2 )2 1 + τ2 A more intuitive way to calculate this makes use of the fact that δτ 2 (X) = E[θ|X], so E[(θ − δτ 2 )2 ] = E[(θ − E[θ|X])2 ] = Ex [Eθ|x [(θ − E[θ|X])2 ]] = Ex [Var[θ|X]] = Ex [τ 2 /(1 + τ 2 )] = τ2 . 1 + τ2 Similarly, E[(X − θ)2 ] = Eθ [Ex|θ [(X − θ)2 ]] = Eθ [1] = 1. So R(πτ 2 , δτ 2 ) = τ2 1+τ 2 and R(πτ 2 , X) = 1. Returning to the p-variate case, since the Bayes risk is the arithmetic average of the risks for each of the p components of θ, we have τ2 1 + τ2 R(πτ 2 , X) = 1. R(πτ 2 , δτ 2 ) = 8 Peter Hoff Shrinkage estimators October 31, 2013 Note that δτ 2 (X) = τ2 X 1+τ 2 ↑ X as τ 2 ↑ ∞, R(πτ 2 , δτ 2 ) ↑ R(πτ 2 , X) as τ 2 ↑ ∞, so X is a “limiting Bayes” estimator, for which the risk difference from the Bayes estimator converges to zero. This is promising - let’s now apply the theorem. Letting B be any open finite ball in Rp , we need to see if the following limit is zero: τ2 (1 − 1+τ 2 ) R(πτ 2 , X) − R(πτ 2 , δτ 2 ) = lim lim τ 2 →∞ πτ 2 (B) τ 2 →∞ πτ 2 (B) = 2lim [(1 + τ 2 )πτ 2 (B)]−1 τ →∞ Now πτ 2 (B) → 0 as τ 2 → ∞ for any bounded set B Therefore, the limit is zero only if lim τ 2 πτ2 (B) = ∞. τ 2 →∞ We have 2 Z (2πτ 2 )−p/2 exp(−||θ||2 /[2τ 2 ]) dθ B Z −p/2 2 1−p/2 exp(−||θ||2 /[2τ 2 ]) dθ = (2π) × (τ ) × τ πτ2 (B) = τ 2 B (2π) −p/2 × (a) × (b). Now take the limit as τ 2 → ∞: (b) → Vol(B) as τ 2 → ∞    ∞ if p = 1 (a) → 1 if p = 2 .   0 if p > 2 Therefore, the desired limit is achieved for p = 1 but not p > 1. • By the theorem, X is admissible for Θ. 9 Peter Hoff Shrinkage estimators October 31, 2013 • For p > 1, this method of showing admissibility does not work. – For p = 2, X can be shown to be admissible using Blyth’s method with non-normal priors (see LC exercise 5.4.5). – For p > 2, X can’t be shown to be admissible because it isn’t. Interpreting the failure of Blyth’s method: The admissibility conditions for Blyth’s method derived from consideration of the existence of an estimator δ that dominates X. If such an estimator exits, then by continuity of risks ∃ > 0 and an open ball B ⊂ Θ : R(θ, X) − R(θ, δ) > ∀θ ∈ B, which implies for each prior πk that Z Z R(πk , X) − R(πk , δ) = [R(θ, X) − R(θ, δ)]πk (dθ) ≥ [R(θ, X) − R(θ, δ)]πk (dθ) B ≥ πk (B). Integrating with respect to a prior πk , and comparing to the Bayes risk of the Bayes estimator δk under πk gives R(πk , X) − R(πk , δk ) ≥ R(πk , X) − R(πk , δ) > πk (B) ∀k, as δk has Bayes risk less than or equal to that of δ. Could such a δ exist? Could exist: Suppose B is a ball such that πk (B) goes to zero very fast. Then an estimator (like X) can have a good limiting Bayes risk and still do poorly on B. This allows for the possibility of domination by another estimator that does better on B. 10 Peter Hoff Shrinkage estimators October 31, 2013 Couldn’t exist: On the other hand, if R(πk , X) − R(πk , δk ) goes to zero very fast (e.g. faster than the probability of any ball B), then in a sense X would have to be doing well everywhere, and would not be able to be dominated - this is Blyth’s method for showing admissibility. What fails in the admissibility proof for the normal means problem is that for p > 2, the probability πk (B) of an open ball B is going to zero much faster than the Bayes risk difference, leaving a large enough “gap” for some other estimator to do better. 4 Motivating the James-Stein estimator Stein [1956] showed that X is inadmissible for θ in the normal means problem when p > 2. This was surprising, as X is the MLE and UMVUE for θ. In this section, 4.1 What is wrong with X? For large p, • X may be close to θ, but • X · X = ||X||2 may be far from θ · θ = ||θ||2 . If X ∼ Np (θ, I), 2 E[||X|| ] = E[ p X Xj2 ] 1 p = X (θj2 + 1) = ||θ||2 + p, 1 so for large p, the magnitude of the estimator vector X is expected to be much larger than the magnitude of the estimand vector θ. More insight can be gained as follows: Note that every vector x can be expressed as x = sθ + r, for some s ∈ R and r : θ · r = 0. 11 Peter Hoff Shrinkage estimators October 31, 2013 Here, the random variable s is the magnitude of the projection of x in the direction of θ, and r is the residual vector. Using this decomposition, we can write the squarederror loss of ax for estimating θ as ||ax − θ||2 = (ax − θ) · (ax − θ) = ((as − 1)θ + ar) · ((as − 1)θ + ar) = (as − 1)2 ||θ||2 + a2 ||r||2 Now consider replacing x with X ∼ Np (θ, I). The random-variable version of the above equation is then ||aX − θ||2 = (aS − 1)2 ||θ||2 + a2 ||R||2 Exercise: Show that • S ∼ N (1, ||θ||−2 ), • ||R||2 ∼ χ2p−1 , • S and R are independent. Now imagine a situation where p is growing but ||θ||2 remains fixed. The distribution of (aS − 1)2 ||θ||2 remains fixed whereas the distribution of a2 ||R||2 blows up. This suggests that if we think ||θ||2 /p is small we should use an estimator like aX with a < 1 to control the error that comes from R2 . But what should the value of a be? 4.2 An oracle estimator: Question: Among estimators aX : a ∈ [0, 1], which has the smallest risk? Solution: E[||aX − θ||2 ] = E[||(aX − aθ) − (1 − a)θ||2 ] = a2 p + (1 − a)2 ||θ||2 . 12 Peter Hoff Shrinkage estimators October 31, 2013 Taking derivatives, the minimizing value ã of a satisfies 2ãp − 2(1 − ã)||θ||2 = 0 ã ||θ||2 = 1 − ã p ||θ||2 ã = . ||θ||2 + p Thus the optimal shrinkage “estimator” is given by δã (X) = ãX. This is not really an estimator in the usual sense, because the ideal degree of shrinkage ã depends on θ. For this reason, ãX is sometimes called an “oracle estimator:” You would need an oracle to tell you the value of ||θ||2 before you could use it. Note that the risk of this estimator is ||θ||4 p + p2 ||θ||2 (||θ||2 + p)2 p||θ||2 (||θ||2 + p) = (||θ||2 + p)2 p||θ||2 = ||θ||2 + p E[||aX − θ||2 ] = and so E[||aX − θ||2 ] = p ||θ||2 < p = E[||X − θ||2 ]. ||θ||2 + p The risk differential is large if ||θ||2 is small compared to p. 4.3 Adaptive shrinkage estimation As shown above, the optimal amount of shrinkage ã is ã = ||θ||2 ||θ||2 /p = . ||θ||2 + p ||θ||2 /p + 1 13 Peter Hoff Shrinkage estimators October 31, 2013 Note that ||θ||2 /p is the variability of of the θj values around zero. Can this variability be estimated? Consider the following hierarchical model: Xj = θj + j iid 1 , . . . , p ∼ N (0, 1) iid θ1 , . . . , θp ∼ N (0, τ 2 ) If you’d like to connect this with some actual inference problem, imagine that each Xj is the sample mean or t-statistic calculated from observations from experiment j, with population mean θj . Suppose you believed this model and knew the value of τ 2 . If you were interested finding an estimator δ(X) that minimized the the expected squared error ||θ−δ(X)||2 under repeated sampling of • θ1 , . . . , θp , followed by sampling of • X1 , . . . , X p , you would want to come up with an estimator δ(X) that minimized Z Z 2 E[||θ − δ(X)|| ] = ||θ − δ(x)||2 p(dx|θ)p(dθ). Exercise: Show that δτ 2 (X) = τ2 X 1+τ 2 minimizes the expected loss. If we knew τ 2 , then the estimator to use would be τ2 X. 1+τ 2 We generally don’t know τ 2 , but maybe it can be estimated from the data. Under the above model, X =θ+ θ ∼ Np (0, τ 2 I) ∼ Np (0, I) Cov(θ, ) = 0. This means that the distribution of X marginalized over θ is X ∼ Np (0, (τ 2 + 1)I). 14 Peter Hoff Shrinkage estimators October 31, 2013 An unbiased estimator of τ 2 + 1 is clearly ||X||2 /p, so an unbiased estimator of τ 2 is τ̂ 2 = ||X||2 − p . p However, we were interested in estimating τ 2 /(τ 2 + 1), not τ 2 . If p > 2, you can use the fact that ||X||2 ∼ gamma(p/2, 1/[2(τ 2 + 1)]) to show that E[||X||−2 ||] = [(p − 2)(τ 2 + 1)]−1 , and so p−2 E[ ||X|| 2] = E[1 − Again, τ2 X τ 2 +1 p−2 ] ||X||2 = τ2 1 +1 τ2 . τ2 + 1 would be the optimal estimator in this hierarchical model if we knew 2 τ . If we don’t know τ 2 , we might instead consider using \ τ2 )X τ2 + 1 p−2 = (1 − ||X|| 2 )X. δJS (X) = ( This estimator is called the James-Stein estimator. As we will see, it has many interesting properties: • For large p in the hierarchical normal model, it is almost as good as the oracle estimator τ2 X: τ 2 +1 EXθ [||θ − δJS ||2 ] ≈ EXθ [||θ − τ2 X||2 ] τ 2 +1 • Even if the hierarchical normal model isn’t correct, it still is almost as good as the oracle estimator ãX in the normal means model: EX|θ [||θ − δJS ||2 ] ≈ EX|θ ||θ − ãX||2 ] • In the normal means problem, this estimator dominates the unbiased estimator X if p > 2: EX|θ [||θ − δJS ||2 ] < EX|θ [||θ − X||2 ] ∀θ We will show this last inequality first. 15 Peter Hoff 5 Shrinkage estimators October 31, 2013 Risk of δJS We will show that δJS dominates X by showing that the risk R(θ, δJS ) = E[||δJS − θ||2 ]/p is uniformly less than 1. This will not be done by computing its risk function directly, but instead by showing that 1 is an upper bound on the risk. This bound will be obtained via an identity that has applications beyond the calculation of R(θ, δJS ). 5.1 Risk bound for δJS We can write the James-Stein estimator as \ τ2 x = (1 − δJS = 1 + τ2 =x− p−2 )x x·x p−2 x x·x ≡ x − g(x). Under X ∼ Np (θ, I), E[||δJS − θ||2 ] = E[(X − g(X) − θ)2 ] = E[(X − θ) − g(X))2 ] = E[||(X − θ)||2 ] + E[||g(X)||2 ] − 2E[(X − θ) · g(X)]. where all expectations are with respect to the distribution of X given θ. The first expectation is p and the second is 2 X·X 1 E[(p − 2)2 (X·X) 2 ] = (p − 2) E[ X·X ]. The third expectation is more complicated, but in the next subsection we’ll derive an identity (Stein’s identity) for computing E[(X − θ) · g(X)] that is applicable for arbitrary functions g. Stein’s identity as applied to g(x) = p−2 x x·x gives 2 E[(X − θ)g(X)] = E[ (p−2) ]. X·X 16 Peter Hoff Shrinkage estimators October 31, 2013 Using this for the above risk calculation gives 1 1 E[||δJS − θ||2 ] = p + (p − 2)2 E[ X·X ] − 2(p − 2)2 E[ X·X ] 2 ]. = p − E[ (p−2) X·X Note that we haven’t actually calculated the risk of δJS is closed form - our formula depends on the expectation of 1/X·X, which is an inverse-moment of a noncentral χ2 distribution where the noncentrality parameter depends on θ. However, computing this moment is not necessary to show that δJS dominates X: Since 1 x·x > 0 ∀x ∈ Rp , we have 2 ] < p = E[||X − θ||2 ]. E[||δJS − θ||2 ] = p − E[ (p−2) X·X Since the expectation of (X · X)−1 is complicated, further study of the risk of δJS is often achieved via a study of its unbiased risk estimate. From the above calculation, we see that E[||δJS − θ||2 ] = E[p − and so p − 5.2 (p−2)2 X·X (p−2)2 ], X·X can be said to be an unbiased estimate of the risk of δJS . Stein’s identity We start with a univariate version of the identity: Lemma 4 (Stein’s identity). Let X ∼ N (µ, σ 2 ) and let g(x) be such that E[|g 0 |] < ∞. Then E[g(X)(X − µ)] = σ 2 E[g 0 (X)]. Proof. The proof follows from Fubini’s theorem and a bit of calculus. Letting p(x) = φ([x − µ]/σ)/σ, note that p0 (x) = −( x−µ )p(x). By the fundamental theorem of σ2 calculus, Z x p(x) = −∞ −( y−µ )p(y) σ2 Z dy = x ∞ ( y−µ )p(y) dy. σ2 17 Peter Hoff Shrinkage estimators The expectation we wish to calculate is Z ∞ 0 g 0 (x)p(x) dx E[g (x)] = −∞ Z Z ∞ 0 g (x)p(x) dx + = October 31, 2013 0 g 0 (x)p(x) dx. −∞ 0 Doing the first part, we have Z ∞ Z ∞ Z ∞ 0 0 ( y−µ g (x) g (x)p(x) dx = )p(y) dy dx σ2 x 0 Z0 = g 0 (x)( y−µ )p(y) dy dx σ2 0<x<y<∞ Z ∞ Z y y−µ = ( σ2 )p(y) g 0 (x) dx dy 0 Z0 ∞ ( y−µ )p(y)[g(y) − g(0)] dy = σ2 0 = E[(g(X) − g(0))( X−µ )1(X > 0)]. σ2 Similarly, the second part is Z 0 g 0 (x)p(x) dx = E[(g(X) − g(0))( X−µ )1(X < 0)]. σ2 −∞ Adding the two parts gives )] E[g 0 (x)] = E[(g(X) − g(0))( X−µ σ2 = E[g(X)( X−µ )] − g(0)E[( X−µ )] = E[g(X)( X−µ )], σ2 σ2 σ2 and so E[g(X)(X − µ)] = σ 2 E[g 0 (X)]. Stein’s lemma is often alternatively proven with integration by parts. These proofs go roughly as follows: As before, p(x) = (2πσ 2 )−1/2 exp{−(x − µ)2 /[2σ 2 ]} dp(x) = −( σ12 )(x − µ)p(x). dx 18 Peter Hoff Shrinkage estimators 0 Z October 31, 2013 ∞ g 0 (x) × p(x) dx −∞ Z c = lim g 0 (x) × p(x) dx c→∞ −c Z c v 0 (x)u(x) dx ≡ lim c→∞ −c Z c 0 c v(x)u (x) dx = lim u(x)v(x)|−c − c→∞ −c Z c 2 c g(x)[−(x − µ)/σ ]p(x) dx = lim g(x)p(x)|−c − E[g (X)] = c→∞ −c 2 = E[g(X)(X − µ)/σ ] + lim g(x)p(x)|c−c . c→∞ To complete the proof we have to show that the last limit is zero. This is straightforward to show if p(x) decreases monotonically in |x|: Lemma 5. Let p(x) be decreasing to zero in |x| and let E[|g 0 (X)|] < ∞. Then g(x)p(x) → 0 as x → ±∞. Proof. Given > 0 ∃K such that Z ∞ |g 0 (x)|p(x) dx < /3. K Then for any t sufficiently large, RK 1. p(t) < p(K)( 0 |g 0 (x)|p(x))−1 × /3 2. p(t)|g(0)| < /3. 19 Peter Hoff Shrinkage estimators October 31, 2013 From this, we have Z t g 0 (x) dx| |g(t)p(t)| = p(t)|g(0) + 0 Z K Z t g (x) dx + g 0 (x) dx| 0 K Z t Z K 0 |g (x)| dx + p(t) + p(t) |g 0 (x)| dx ≤ p(t)|g(0)| + p(t) 0 K Z t Z K p(t) ≤ p(t)|g(0)| + p(K) |g 0 (x)|p(x) dx |g 0 (x)|p(x) dx + = p(t)|g(0) + 0 K 0 ≤ /3 + /3 + /3 = , where the second to last line holds as p(x)/p(K) < 1 and p(x)/p(t) < 1 on x ∈ (0, K) and (K, t) respectively, due to p(x) being monotonically decreasing. This identity generalizes to other exponential families. See LC Theorem 1.5.15. For computing the risk of a vector-valued function g : Rp → Rp , we will need a multivariate version of the above identity. Lemma 6 (Stein’s identity, multivariate version). Let X ∼ Np (µ, σ 2 I) and let g(x) : Rp → Rp such that E[|dgi /dxi |] < ∞. Then E[(X − µ) · g(X)] = σ 2 E[∇ · g(X)] where ∇ · g = Pp j=1 dgj (x)/dxj . Proof. This is just a corollary of the univariate version: Z Z E[dgp (x)/dxp ] = x−p xp dgp (x) p(xp ) dxp dxp ! p−1 Y 1 Z Exp [gp (X)(Xp − µp )]/σ = x−p = 1 E[gp (X)(Xp σ2 p(xj ) dxj 2 p−1 Y p(xj ) dxj 1 − µp )]], 20 Peter Hoff Shrinkage estimators October 31, 2013 and similarly for each other j ∈ {1, . . . , p}. Therefore E[∇ · g(X)] = X E[dgj (x)/dxj ] = X E[gj (X)(Xj − µj )]/σ 2 = E[g(X) · (X − µ)]/σ 2 . Now we are in a position to apply the lemma to obtain the unbiased risk estimator of δJS . Recall we needed to calculate E[(X − θ) · g(X)] where g(x) = (p−2) x. x·x Applying the lemma, we have x xp 1 ,..., x·x x·x X x · x − 2x2j ∇ · g(x) = (p − 2) (x · x)2 p−2 [px · x − 2x · x] = (x · x)2 (p − 2)2 = , x·x g(x) = (p − 2) and so 2 E[(X − θ) · g(X)] = E[ (p−2) ], X·X as we used above. 6 6.1 Some oracle inequalities A simple oracle inequality Recall that if we knew ||θ||2 , the optimal estimator in the class {δa (x) = ax : a ∈ [0, 1]} would be δã where ã = ||θ||2 /p . ||θ||2 /p + 1 21 Peter Hoff Shrinkage estimators October 31, 2013 We also showed that the risk of this estimator is R(θ, ãX) = E[||ãX − θ||2 ]/p = ||θ||2 < 1. ||θ||2 + p Not surprisingly, it turns out that R(θ, ãX) ≤ R(θ, δ JS ) (use the fact that X · X has a noncentral χ2 distribution, or condition on X · X). But how much worse is δJS than the oracle estimator δã ? Recall the risk of δ JS is R(θ, δ JS ) = 1 − (p − 2)2 E[(X · X)−1 ] p Since 1/x is convex, Jensen’s inequality gives E[ 1 1 1 ]≥ = , X ·X E[X · X)] ||θ||2 + p and so (p − 2)2 1 p ||θ||2 + p 2 ||θ||2 p (p − 2) − p(R(θ, δ JS ) − R(θ, δ ã )) ≤ p − ||θ||2 + p ||θ||2 + p p||θ||2 + p2 − p2 + 4p − 4 − ||θ||2 p = ||θ||2 + p 4(p − 1) = ||θ||2 + p R(θ, δ JS ) ≤ 1 − ≤ 4 p−1 ≤ 4, p and so R(θ, δ ã )) ≤ R(θ, δ JS ) ≤ R(θ, δ ã )) + 4/p. Additional work can get you to R(θ, δ ã )) ≤ R(θ, δ JS ) ≤ R(θ, δ ã )) + 2/p. For more on this, see Johnstone [2002] or Candès [2006]. 22 Peter Hoff 7 Shrinkage estimators October 31, 2013 Unknown variance or covariance Suppose X ∼ Np (θ, σ 2 I), where σ 2 is known. Letting • X̃ = X/σ and • θ̃ = θ/σ, we have X̃ ∼ Np (θ̃, I). The James-Stein estimator δ̃ JS of θ̃ = θ/σ is then p−2 )X X ·X σ 2 (p − 2) )X/σ. = (1 − X ·X δ̃(X̃) = (1 − It seems natural then that the JSE of θ should be σ times the JSE of θ̃ = θ/σ, giving σ 2 (p − 2) )X. X ·X Of course, often σ 2 is not known. Is there a version of the JSE in this case? Yes, if δ JS = (1 − you have information about σ 2 . Consider the following hierarchical sampling scheme: Xi,j = θj + i,j , i = 1, . . . , n, j = 1, . . . , p {i,j } ∼ i.i.d. N(0, σ 2 ). Letting Xj = X̄·j = Pn i=1 Xi,j /n, we now basically have the situation described above. Also note that the data contain information about σ 2 via the pooled sample sum of squares: S= p X (Xi,j − X̄·j )2 ∼ σ 2 χ2p(n−1) . j=1 Note further that X and S are statistically independent. For this and similar situations, James and Stein [1961] considered estimators of the following form: Let X ∼ Np (θ, σ 2 I) be independent of S ∼ σ 2 χ2k . Define the estimator δc (X, S) = (1 − cS )X. X·X 23 Peter Hoff Shrinkage estimators October 31, 2013 The value of c that minimizes the risk of δc for all θ is c = (p − 2)/(k + 2), resulting in the following estimator: δ JS (X, S) = (1 − S (p−2) )X. k+2 X·X In particular, note that this estimator dominates X. This result also generalizes to the correlated data case: Let X ∼ Np (θ, Σ) and S ∼ Wishart(Σ, k) be independent. Consider estimators of the form δc (X, S) = (1 − c )X. X T S−1 X James and Stein [1961] show that the estimator obtained by setting c= p−2 n−p+3 minimizes the risk for all values of θ. See Brown and Han for recent work on these and related problems, including extensions to regression problems. References L.D. Brown and X. Han. Optimal estimation of multidimensional normal means with an unknown variance. Emmanuel J. Candès. ties. Modern statistical estimation via oracle inequali- Acta Numer., 15:257–325, 2006. S0962492906230010. ISSN 0962-4929. doi: 10.1017/ URL http://dx.doi.org.offcampus.lib.washington. edu/10.1017/S0962492906230010. W. James and Charles Stein. Estimation with quadratic loss. In Proc. 4th Berkeley Sympos. Math. Statist. and Prob., Vol. I, pages 361–379. Univ. California Press, Berkeley, Calif., 1961. 24 Peter Hoff Shrinkage estimators October 31, 2013 IM Johnstone. Function estimation and gaussian sequence models. Unpublished manuscript, 2002. E. L. Lehmann and George Casella. Theory of point estimation. Springer Texts in Statistics. Springer-Verlag, New York, second edition, 1998. ISBN 0-387-98502-6. Charles Stein. Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, 1954–1955, vol. I, pages 197–206, Berkeley and Los Angeles, 1956. University of California Press. 25

Contents 1 Shrinkage estimators

Related documents

Products

Support

Contents 1 Shrinkage estimators

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib