Contents 1 Shrinkage estimators

advertisement
Peter Hoff
Shrinkage estimators
October 31, 2013
Contents
1 Shrinkage estimators
1
2 Admissible linear shrinkage estimators
3
3 Admissibility of unbiased normal mean estimators
6
4 Motivating the James-Stein estimator
11
4.1
What is wrong with X? . . . . . . . . . . . . . . . . . . . . . . . . .
11
4.2
An oracle estimator: . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
4.3
Adaptive shrinkage estimation . . . . . . . . . . . . . . . . . . . . . .
13
5 Risk of δJS
16
5.1
Risk bound for δJS . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
5.2
Stein’s identity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
6 Some oracle inequalities
6.1
A simple oracle inequality . . . . . . . . . . . . . . . . . . . . . . . .
7 Unknown variance or covariance
21
21
23
Much of this content comes from Lehmann and Casella [1998], sections 5.2, 5.4, 5.5,
4.6 and 4.7.
1
Shrinkage estimators
Consider a model {p(x|θ) : θ ∈ Θ} for a random variable X such that
E[X|θ] = µ(θ), 0 < Var[X|θ] = σ 2 (θ) < ∞ ∀θ ∈ Θ.
1
Peter Hoff
Shrinkage estimators
October 31, 2013
A linear estimator δ(x) for µ(θ) is an estimator of the form
δab (X) = aX + b.
Is δab admissible?
Theorem 1 (LC thm 5.2.6). δab (X) = aX + b is inadmissible for E[X|θ] under
squared error loss whenever
1. a > 1,
2. a = 1 and b 6= 0, or
3. a < 0.
Proof. The risk of δab is
R(θ, δab ) = E[(aX + b − µ)2 |θ]
= E[(aX − aµ − µ(1 − a) + b)2 |θ]
= E[a2 (X − µ)2 + (b − µ(1 − a))2 + 2a(X − µ)(b − µ(1 − a))|θ]
= a2 σ 2 + (b − µ(1 − a))2
1. If a > 1, then R(θ, δab ) > a2 σ 2 > σ 2 = R(θ, X), so δab is dominated by X.
2. If a < 0, then
R(θ, δab ) > (b − µ(1 − a))2
= (1 − a)2 (b/(1 − a) − µ)2
= (b/(1 − a) − µ)2 = R(θ, b/(1 − a))
and so δab is dominated by the constant estimator b/(1 − a).
3. If a = 1 and b 6= 0, then R(θ, δab ) = σ 2 +b2 > σ 2 = R(θ, X), so δab is dominated
by X.
2
Peter Hoff
Shrinkage estimators
October 31, 2013
Letting w = 1 − a and µ0 = b/(1 − a), the result suggests that if we want to use an
admissible linear estimator, it should be of the form
δ(X) = wµ0 + (1 − w)X , w ∈ [0, 1]
We call such estimators linear shrinkage estimators as they “shrink” the estimate
from X towards µ0 . Intuitively, you can think of µ0 as your “guess” as to the value
of µ, and w as the confidence you have in your guess. Of course, the closer your
guess is to the truth, the better your estimator.
If µ0 represents your guess as to µ(θ), it seems natural to require that µ0 ∈ µ(Θ) =
{µ : µ = µ(θ), θ ∈ Θ}, i.e. µ0 is a possible value of µ.
Lemma 1. If µ(Θ) is convex and µ0 6∈ µ̄(Θ), then δ(X) = wµ0 + (1 − w)X is not
admissible.
Proof.
For the one-dimensional case, suppose µ0 > µ(θ) ∀θ ∈ Θ.
Let µ̃0 = supΘ µ(θ), and δ̃(X) = wµ̃0 + (1 − w)X.
Then δ̃(X) dominates δ(X)
(the variances are the same, and the latter has higher bias for all θ).
The proof is similar for the case µ0 < µ(θ) ∀θ ∈ Θ.
Exercise: Generalize this result to higher dimensions.
2
Admissible linear shrinkage estimators
We have shown that δ(X) = wµ0 + (1 − w)X is inadmissible for µ(θ) = E[X|θ] if
• w 6∈ [0, 1] or
3
Peter Hoff
Shrinkage estimators
October 31, 2013
• µ0 6∈ µ(Θ).
Restricting attention to w ∈ [0, 1] and µ0 ∈ µ(Θ), it may seem that such estimators
should always be admissible, but “always” is almost always too inclusive.
Exercise: Given an example where wµ0 + (1 − w)X is not admissible, even with
w ∈ (0, 1) and µ0 ∈ µ(Θ).
Linear shrinkage via conjugate priors
What about using a Bayesian argument? Recall,
Theorem. Any unique Bayes estimator is admissible.
If we can show that wµ0 + (1 − w)X is unique Bayes under some prior, then we will
have shown admissibility.
Let X1 , . . . , Xn ∼ i.i.d. p(x|θ), where
p(x|θ) ∈ P = {p(x|θ) = h(x) exp(x · θ − A(θ)) : θ ∈ H}
Consider estimation of µ = E[X|θ] under squared error loss.
Let π(θ) ∝ exp(n0 µ0 · θ − n0 A(θ)) where n0 > 0 and µ0 ∈ Conv{E[X|θ] : θ ∈ H}
Recall that under this prior,
E[µ] ≡ E[E[X|θ]] = µ0 .
Then π(θ|x) ∝ exp(n1 µ1 · θ − n1 A(θ)), where n1 = n0 + n and
n1 µ1 = n0 µ0 + nx̄
µ1 = n0 µ0 /n1 + nx̄/n1
n
n0
=
µ0 +
x̄.
n0 + n1
n0 + n
Under this posterior distribution,
E[µ|x] ≡ E[E[X|θ]|x] = µ1 .
4
Peter Hoff
Shrinkage estimators
October 31, 2013
Therefore, the unique Bayes estimator of µ = E[X|θ] under squared error loss is
µ1 = wµ0 + (1 − w)x̄,
and so this linear shrinkage estimator is admissible.
Example (multiple normal means):
Let X ∼ Np (θ, σ 2 I). First consider the case that σ 2 is known, so that
p(x|θ) = (2πσ 2 )−p/2 exp(−(x − θ) · (x − θ)/[2σ 2 ])
∝θ exp(x · θ/σ 2 − θ · θ/[2σ 2 ]).
Consider the normal prior
π(θ) = (2πτ 2 )−p/2 exp(−(θ − θ 0 ) · (θ − θ 0 )/[2τ02 ])
∝θ exp(θ 0 · θ/τ02 − θ · θ/[2τ02 ]).
where τ02 is analogous to 1/n0 in the general formulation for exponential families.
The posterior density is
π(θ|x) ∝θ exp{[θ 0 /τ02 + x/σ 2 ] · θ − θ · θ[1/σ 2 + 1/τ02 ]/2}
= exp{θ 1 · θ/τ12 − θ · θ/[2τ12 ]}
where
• 1/τ12 = 1/τ02 + 1/σ 2
• θ1 =
1/τ02
θ
1/τ02 +1/σ 2 0
+
1/σ 2
x
1/τ02 +1/σ 2
≡ wθ 0 + (1 − w)x.
So {θ|x} ∼ Np (θ 1 , τ12 I), which means that
E[θ|x] = θ 1 = wθ 0 + (1 − w)x
uniquely minimizes the posterior risk under squared error loss. The posterior mean
is therefore a unique Bayes estimator and also an admissible estimator of θ. Since
this result holds for all τ02 > 0, we have the following:
5
Peter Hoff
Shrinkage estimators
October 31, 2013
Lemma 2. For each w ∈ (0, 1) and θ 0 ∈ Rp , the estimator δwθ0 (x) = wθ 0 +(1−w)x
is admissible for estimating θ in the model X ∼ Np (θ, σ 2 I), θ ∈ Rp , where σ 2 is
known.
Of course, what we would like is the following lemma:
Lemma 3. For each w ∈ (0, 1) and θ 0 ∈ Rp , the estimator δwθ0 (X) = wθ 0 +(1−w)X
is admissible for estimating θ in the model X ∼ Np (θ, σ 2 I), θ ∈ Rp , σ 2 ∈ R+ .
How can this result be obtained?
Theorem 2. Let P = {p(x|θ, ψ) : (θ, ψ) ∈ Θ × Ψ}, and for ψ0 ∈ Ψ, let Pψ0 =
{p(x|θ, ψ0 ) : θ ∈ Θ} be a submodel. If δ is admissible for estimating θ under Pψ0 for
each ψ0 ∈ Ψ, then δ is admissible for estimating θ under P.
Proof. Suppose δ satisfies the conditions of the theorem but is not admissible.
Then there exists a δ 0 ∈ D such that
∀(θ, ψ) , R((θ, ψ), δ 0 ) ≤ R((θ, ψ), δ)
∃(θ0 , ψ0 ) , R((θ0 , ψ0 ), δ 0 ) < R((θ0 , ψ0 ), δ).
But this contradicts the assumption that δ is admissible for estimating θ under Pψ0 .
Therefore, no such δ 0 can exist and so δ is admissible for P.
A corollary to this theorem is the admissibility of wθ 0 + (1 − w)X in the normal
model with unknown variance.
3
Admissibility of unbiased normal mean estimators
Let X ∼ Np (θ, σ 2 I), θ ∈ Rp , σ 2 > 0.
For estimation of θ under squared error loss, we have shown that the linear shrinkage
estimator δ(x) = wθ 0 + (1 − w)x is
6
Peter Hoff
Shrinkage estimators
October 31, 2013
• inadmissible if w 6∈ [0, 1],
• admissible if w ∈ (0, 1).
What remains to evaluate is the admissibility for w ∈ {0, 1}. Admissibility for w = 1
is easy to show - the estimator δ1θ0 (x) = θ 0 beats everything at θ 0 and so can’t be
dominated. The last and most interesting case is that of w = 0, i.e. δ0 (X) = X, the
unbiased MLE and UMVUE.
Blyth’s method
Recall Blyth’s method for showing admissibility using a limiting Bayes argument:
Theorem 3 (LC 5.7.13). Suppose Θ ⊂ Rp is open, and that R(θ, δ) is continuous in
θ for all δ ∈ D. Let δ be an estimator and {πn } be a sequence of measures such that
for any open ball B ⊂ Θ,
R(πn , δ) − R(πn , δπn )
→ 0 as n → ∞.
πn (B)
Then δ is admissible.
Let’s try to use this to show admissibility of δ0 (X) = X in the normal means
problem. We begin with the case that σ 2 = 1 is known.
X ∼ Np (θ, I)
θ ∼ Np (0, τ 2 I)
2
2
τ
τ
{θ|X} ∼ Np ( 1+τ
2 X, 1+τ 2 I)
The unique Bayes estimator is δτ 2 = E[θ|X] =
τ2
X
1+τ 2
≡ (1 − w)X.
To apply the theorem, we need to compute the Bayes risk of δτ 2 and X under the
Np (0, τ 2 I) prior πτ 2 . The loss we will use is “scaled” squared error loss, L(θ, d) =
P
(θj − dj )2 /p. Because the risk is the average of the individual MSEs, the Bayes
7
Peter Hoff
Shrinkage estimators
October 31, 2013
risk is just the average of the Bayes risks from the p components,
p
X
R(θ, δ) = E[ (θj − δj )2 ]/p
j=1
p
=
X
E[(θj − δj )2 ]/p,
j=1
and so calculating the Bayes risk is similar to calculating the risk in the p = 1
problem. For δτ 2 (x) = ax, where a = 1 − w = τ 2 /(1 + τ 2 ), we have
E[(aX − θ)2 ] = E[(aX − aθ + (1 − a)θ)2 ]
= a2 E[(X − θ)2 ] + 2a(1 − a)E[(X − θ)θ] + (1 − a)2 E[θ]2
= a2 + (1 − a)2 τ 2
2 2
τ
τ2
τ2
=
+
=
1 + τ2
(1 + τ 2 )2
1 + τ2
A more intuitive way to calculate this makes use of the fact that δτ 2 (X) = E[θ|X],
so
E[(θ − δτ 2 )2 ] = E[(θ − E[θ|X])2 ]
= Ex [Eθ|x [(θ − E[θ|X])2 ]]
= Ex [Var[θ|X]]
= Ex [τ 2 /(1 + τ 2 )] =
τ2
.
1 + τ2
Similarly,
E[(X − θ)2 ] = Eθ [Ex|θ [(X − θ)2 ]] = Eθ [1] = 1.
So R(πτ 2 , δτ 2 ) =
τ2
1+τ 2
and R(πτ 2 , X) = 1. Returning to the p-variate case, since the
Bayes risk is the arithmetic average of the risks for each of the p components of θ,
we have
τ2
1 + τ2
R(πτ 2 , X) = 1.
R(πτ 2 , δτ 2 ) =
8
Peter Hoff
Shrinkage estimators
October 31, 2013
Note that
δτ 2 (X) =
τ2
X
1+τ 2
↑ X as τ 2 ↑ ∞,
R(πτ 2 , δτ 2 ) ↑ R(πτ 2 , X) as τ 2 ↑ ∞,
so X is a “limiting Bayes” estimator, for which the risk difference from the Bayes
estimator converges to zero. This is promising - let’s now apply the theorem. Letting
B be any open finite ball in Rp , we need to see if the following limit is zero:
τ2
(1 − 1+τ 2 )
R(πτ 2 , X) − R(πτ 2 , δτ 2 )
=
lim
lim
τ 2 →∞ πτ 2 (B)
τ 2 →∞
πτ 2 (B)
= 2lim [(1 + τ 2 )πτ 2 (B)]−1
τ →∞
Now πτ 2 (B) → 0 as τ 2 → ∞ for any bounded set B Therefore, the limit is zero only
if
lim τ 2 πτ2 (B) = ∞.
τ 2 →∞
We have
2
Z
(2πτ 2 )−p/2 exp(−||θ||2 /[2τ 2 ]) dθ
B
Z
−p/2
2 1−p/2
exp(−||θ||2 /[2τ 2 ]) dθ
= (2π)
× (τ )
×
τ πτ2 (B) = τ
2
B
(2π)
−p/2
× (a) × (b).
Now take the limit as τ 2 → ∞:
(b) → Vol(B) as τ 2 → ∞


 ∞ if p = 1
(a) →
1 if p = 2 .


0 if p > 2
Therefore, the desired limit is achieved for p = 1 but not p > 1.
• By the theorem, X is admissible for Θ.
9
Peter Hoff
Shrinkage estimators
October 31, 2013
• For p > 1, this method of showing admissibility does not work.
– For p = 2, X can be shown to be admissible using Blyth’s method with
non-normal priors (see LC exercise 5.4.5).
– For p > 2, X can’t be shown to be admissible because it isn’t.
Interpreting the failure of Blyth’s method:
The admissibility conditions for Blyth’s method derived from consideration of the
existence of an estimator δ that dominates X. If such an estimator exits, then by
continuity of risks
∃ > 0 and an open ball B ⊂ Θ : R(θ, X) − R(θ, δ) > ∀θ ∈ B,
which implies for each prior πk that
Z
Z
R(πk , X) − R(πk , δ) = [R(θ, X) − R(θ, δ)]πk (dθ) ≥ [R(θ, X) − R(θ, δ)]πk (dθ)
B
≥ πk (B).
Integrating with respect to a prior πk , and comparing to the Bayes risk of the Bayes
estimator δk under πk gives
R(πk , X) − R(πk , δk ) ≥ R(πk , X) − R(πk , δ) > πk (B) ∀k,
as δk has Bayes risk less than or equal to that of δ. Could such a δ exist?
Could exist: Suppose B is a ball such that πk (B) goes to zero very fast. Then
an estimator (like X) can have a good limiting Bayes risk and still do poorly on B.
This allows for the possibility of domination by another estimator that does better
on B.
10
Peter Hoff
Shrinkage estimators
October 31, 2013
Couldn’t exist: On the other hand, if R(πk , X) − R(πk , δk ) goes to zero very fast
(e.g. faster than the probability of any ball B), then in a sense X would have to
be doing well everywhere, and would not be able to be dominated - this is Blyth’s
method for showing admissibility.
What fails in the admissibility proof for the normal means problem is that for p > 2,
the probability πk (B) of an open ball B is going to zero much faster than the Bayes
risk difference, leaving a large enough “gap” for some other estimator to do better.
4
Motivating the James-Stein estimator
Stein [1956] showed that X is inadmissible for θ in the normal means problem when
p > 2. This was surprising, as X is the MLE and UMVUE for θ. In this section,
4.1
What is wrong with X?
For large p,
• X may be close to θ, but
• X · X = ||X||2 may be far from θ · θ = ||θ||2 .
If X ∼ Np (θ, I),
2
E[||X|| ] = E[
p
X
Xj2 ]
1
p
=
X
(θj2 + 1) = ||θ||2 + p,
1
so for large p, the magnitude of the estimator vector X is expected to be much larger
than the magnitude of the estimand vector θ. More insight can be gained as follows:
Note that every vector x can be expressed as
x = sθ + r, for some s ∈ R and r : θ · r = 0.
11
Peter Hoff
Shrinkage estimators
October 31, 2013
Here, the random variable s is the magnitude of the projection of x in the direction of
θ, and r is the residual vector. Using this decomposition, we can write the squarederror loss of ax for estimating θ as
||ax − θ||2 = (ax − θ) · (ax − θ)
= ((as − 1)θ + ar) · ((as − 1)θ + ar)
= (as − 1)2 ||θ||2 + a2 ||r||2
Now consider replacing x with X ∼ Np (θ, I). The random-variable version of the
above equation is then
||aX − θ||2 = (aS − 1)2 ||θ||2 + a2 ||R||2
Exercise: Show that
• S ∼ N (1, ||θ||−2 ),
• ||R||2 ∼ χ2p−1 ,
• S and R are independent.
Now imagine a situation where p is growing but ||θ||2 remains fixed. The distribution
of (aS − 1)2 ||θ||2 remains fixed whereas the distribution of a2 ||R||2 blows up. This
suggests that if we think ||θ||2 /p is small we should use an estimator like aX with
a < 1 to control the error that comes from R2 . But what should the value of a be?
4.2
An oracle estimator:
Question: Among estimators aX : a ∈ [0, 1], which has the smallest risk?
Solution:
E[||aX − θ||2 ] = E[||(aX − aθ) − (1 − a)θ||2 ]
= a2 p + (1 − a)2 ||θ||2 .
12
Peter Hoff
Shrinkage estimators
October 31, 2013
Taking derivatives, the minimizing value ã of a satisfies
2ãp − 2(1 − ã)||θ||2 = 0
ã
||θ||2
=
1 − ã
p
||θ||2
ã =
.
||θ||2 + p
Thus the optimal shrinkage “estimator” is given by δã (X) = ãX. This is not really
an estimator in the usual sense, because the ideal degree of shrinkage ã depends on
θ. For this reason, ãX is sometimes called an “oracle estimator:” You would need
an oracle to tell you the value of ||θ||2 before you could use it. Note that the risk of
this estimator is
||θ||4 p + p2 ||θ||2
(||θ||2 + p)2
p||θ||2 (||θ||2 + p)
=
(||θ||2 + p)2
p||θ||2
=
||θ||2 + p
E[||aX − θ||2 ] =
and so
E[||aX − θ||2 ] = p
||θ||2
< p = E[||X − θ||2 ].
||θ||2 + p
The risk differential is large if ||θ||2 is small compared to p.
4.3
Adaptive shrinkage estimation
As shown above, the optimal amount of shrinkage ã is
ã =
||θ||2
||θ||2 /p
=
.
||θ||2 + p
||θ||2 /p + 1
13
Peter Hoff
Shrinkage estimators
October 31, 2013
Note that ||θ||2 /p is the variability of of the θj values around zero. Can this variability
be estimated? Consider the following hierarchical model:
Xj = θj + j
iid
1 , . . . , p ∼ N (0, 1)
iid
θ1 , . . . , θp ∼ N (0, τ 2 )
If you’d like to connect this with some actual inference problem, imagine that each
Xj is the sample mean or t-statistic calculated from observations from experiment j,
with population mean θj .
Suppose you believed this model and knew the value of τ 2 . If you were interested
finding an estimator δ(X) that minimized the the expected squared error ||θ−δ(X)||2
under repeated sampling of
• θ1 , . . . , θp , followed by sampling of
• X1 , . . . , X p ,
you would want to come up with an estimator δ(X) that minimized
Z Z
2
E[||θ − δ(X)|| ] =
||θ − δ(x)||2 p(dx|θ)p(dθ).
Exercise: Show that δτ 2 (X) =
τ2
X
1+τ 2
minimizes the expected loss.
If we knew τ 2 , then the estimator to use would be
τ2
X.
1+τ 2
We generally don’t know
τ 2 , but maybe it can be estimated from the data. Under the above model,
X =θ+
θ ∼ Np (0, τ 2 I)
∼ Np (0, I)
Cov(θ, ) = 0.
This means that the distribution of X marginalized over θ is
X ∼ Np (0, (τ 2 + 1)I).
14
Peter Hoff
Shrinkage estimators
October 31, 2013
An unbiased estimator of τ 2 + 1 is clearly ||X||2 /p, so an unbiased estimator of τ 2 is
τ̂ 2 =
||X||2 − p
.
p
However, we were interested in estimating τ 2 /(τ 2 + 1), not τ 2 . If p > 2, you can use
the fact that ||X||2 ∼ gamma(p/2, 1/[2(τ 2 + 1)]) to show that
E[||X||−2 ||] = [(p − 2)(τ 2 + 1)]−1 ,
and so
p−2
E[ ||X||
2] =
E[1 −
Again,
τ2
X
τ 2 +1
p−2
]
||X||2
=
τ2
1
+1
τ2
.
τ2 + 1
would be the optimal estimator in this hierarchical model if we knew
2
τ . If we don’t know τ 2 , we might instead consider using
\
τ2
)X
τ2 + 1
p−2
= (1 − ||X||
2 )X.
δJS (X) = (
This estimator is called the James-Stein estimator. As we will see, it has many
interesting properties:
• For large p in the hierarchical normal model, it is almost as good as the oracle
estimator
τ2
X:
τ 2 +1
EXθ [||θ − δJS ||2 ] ≈ EXθ [||θ −
τ2
X||2 ]
τ 2 +1
• Even if the hierarchical normal model isn’t correct, it still is almost as good as
the oracle estimator ãX in the normal means model:
EX|θ [||θ − δJS ||2 ] ≈ EX|θ ||θ − ãX||2 ]
• In the normal means problem, this estimator dominates the unbiased estimator
X if p > 2:
EX|θ [||θ − δJS ||2 ] < EX|θ [||θ − X||2 ] ∀θ
We will show this last inequality first.
15
Peter Hoff
5
Shrinkage estimators
October 31, 2013
Risk of δJS
We will show that δJS dominates X by showing that the risk R(θ, δJS ) = E[||δJS −
θ||2 ]/p is uniformly less than 1. This will not be done by computing its risk function
directly, but instead by showing that 1 is an upper bound on the risk. This bound will
be obtained via an identity that has applications beyond the calculation of R(θ, δJS ).
5.1
Risk bound for δJS
We can write the James-Stein estimator as
\
τ2
x = (1 −
δJS =
1 + τ2
=x−
p−2
)x
x·x
p−2
x
x·x
≡ x − g(x).
Under X ∼ Np (θ, I),
E[||δJS − θ||2 ] = E[(X − g(X) − θ)2 ]
= E[(X − θ) − g(X))2 ]
= E[||(X − θ)||2 ] + E[||g(X)||2 ] − 2E[(X − θ) · g(X)].
where all expectations are with respect to the distribution of X given θ. The first
expectation is p and the second is
2
X·X
1
E[(p − 2)2 (X·X)
2 ] = (p − 2) E[ X·X ].
The third expectation is more complicated, but in the next subsection we’ll derive
an identity (Stein’s identity) for computing E[(X − θ) · g(X)] that is applicable for
arbitrary functions g. Stein’s identity as applied to g(x) =
p−2
x
x·x
gives
2
E[(X − θ)g(X)] = E[ (p−2)
].
X·X
16
Peter Hoff
Shrinkage estimators
October 31, 2013
Using this for the above risk calculation gives
1
1
E[||δJS − θ||2 ] = p + (p − 2)2 E[ X·X
] − 2(p − 2)2 E[ X·X
]
2
].
= p − E[ (p−2)
X·X
Note that we haven’t actually calculated the risk of δJS is closed form - our formula
depends on the expectation of 1/X·X, which is an inverse-moment of a noncentral χ2
distribution where the noncentrality parameter depends on θ. However, computing
this moment is not necessary to show that δJS dominates X: Since
1
x·x
> 0 ∀x ∈ Rp ,
we have
2
] < p = E[||X − θ||2 ].
E[||δJS − θ||2 ] = p − E[ (p−2)
X·X
Since the expectation of (X · X)−1 is complicated, further study of the risk of δJS is
often achieved via a study of its unbiased risk estimate. From the above calculation,
we see that
E[||δJS − θ||2 ] = E[p −
and so p −
5.2
(p−2)2
X·X
(p−2)2
],
X·X
can be said to be an unbiased estimate of the risk of δJS .
Stein’s identity
We start with a univariate version of the identity:
Lemma 4 (Stein’s identity). Let X ∼ N (µ, σ 2 ) and let g(x) be such that E[|g 0 |] < ∞.
Then
E[g(X)(X − µ)] = σ 2 E[g 0 (X)].
Proof.
The proof follows from Fubini’s theorem and a bit of calculus. Letting p(x) =
φ([x − µ]/σ)/σ, note that p0 (x) = −( x−µ
)p(x). By the fundamental theorem of
σ2
calculus,
Z
x
p(x) =
−∞
−( y−µ
)p(y)
σ2
Z
dy =
x
∞
( y−µ
)p(y) dy.
σ2
17
Peter Hoff
Shrinkage estimators
The expectation we wish to calculate is
Z ∞
0
g 0 (x)p(x) dx
E[g (x)] =
−∞
Z
Z ∞
0
g (x)p(x) dx +
=
October 31, 2013
0
g 0 (x)p(x) dx.
−∞
0
Doing the first part, we have
Z ∞
Z ∞
Z ∞
0
0
( y−µ
g (x)
g (x)p(x) dx =
)p(y) dy dx
σ2
x
0
Z0
=
g 0 (x)( y−µ
)p(y) dy dx
σ2
0<x<y<∞
Z ∞
Z y
y−µ
=
( σ2 )p(y)
g 0 (x) dx dy
0
Z0 ∞
( y−µ
)p(y)[g(y) − g(0)] dy
=
σ2
0
= E[(g(X) − g(0))( X−µ
)1(X > 0)].
σ2
Similarly, the second part is
Z 0
g 0 (x)p(x) dx = E[(g(X) − g(0))( X−µ
)1(X < 0)].
σ2
−∞
Adding the two parts gives
)]
E[g 0 (x)] = E[(g(X) − g(0))( X−µ
σ2
= E[g(X)( X−µ
)] − g(0)E[( X−µ
)] = E[g(X)( X−µ
)],
σ2
σ2
σ2
and so
E[g(X)(X − µ)] = σ 2 E[g 0 (X)].
Stein’s lemma is often alternatively proven with integration by parts. These proofs
go roughly as follows: As before,
p(x) = (2πσ 2 )−1/2 exp{−(x − µ)2 /[2σ 2 ]}
dp(x)
= −( σ12 )(x − µ)p(x).
dx
18
Peter Hoff
Shrinkage estimators
0
Z
October 31, 2013
∞
g 0 (x) × p(x) dx
−∞
Z c
= lim
g 0 (x) × p(x) dx
c→∞ −c
Z c
v 0 (x)u(x) dx
≡ lim
c→∞ −c
Z c
0
c
v(x)u (x) dx
= lim u(x)v(x)|−c −
c→∞
−c
Z c
2
c
g(x)[−(x − µ)/σ ]p(x) dx
= lim g(x)p(x)|−c −
E[g (X)] =
c→∞
−c
2
= E[g(X)(X − µ)/σ ] + lim g(x)p(x)|c−c .
c→∞
To complete the proof we have to show that the last limit is zero. This is straightforward to show if p(x) decreases monotonically in |x|:
Lemma 5. Let p(x) be decreasing to zero in |x| and let E[|g 0 (X)|] < ∞. Then
g(x)p(x) → 0 as x → ±∞.
Proof. Given > 0 ∃K such that
Z ∞
|g 0 (x)|p(x) dx < /3.
K
Then for any t sufficiently large,
RK
1. p(t) < p(K)( 0 |g 0 (x)|p(x))−1 × /3
2. p(t)|g(0)| < /3.
19
Peter Hoff
Shrinkage estimators
October 31, 2013
From this, we have
Z
t
g 0 (x) dx|
|g(t)p(t)| = p(t)|g(0) +
0
Z
K
Z
t
g (x) dx +
g 0 (x) dx|
0
K
Z t
Z K
0
|g (x)| dx + p(t) + p(t)
|g 0 (x)| dx
≤ p(t)|g(0)| + p(t)
0
K
Z t
Z K
p(t)
≤ p(t)|g(0)| + p(K)
|g 0 (x)|p(x) dx
|g 0 (x)|p(x) dx +
= p(t)|g(0) +
0
K
0
≤ /3 + /3 + /3 = ,
where the second to last line holds as p(x)/p(K) < 1 and p(x)/p(t) < 1 on x ∈ (0, K)
and (K, t) respectively, due to p(x) being monotonically decreasing.
This identity generalizes to other exponential families. See LC Theorem 1.5.15.
For computing the risk of a vector-valued function g : Rp → Rp , we will need a
multivariate version of the above identity.
Lemma 6 (Stein’s identity, multivariate version). Let X ∼ Np (µ, σ 2 I) and let g(x) :
Rp → Rp such that E[|dgi /dxi |] < ∞. Then
E[(X − µ) · g(X)] = σ 2 E[∇ · g(X)]
where ∇ · g =
Pp
j=1
dgj (x)/dxj .
Proof. This is just a corollary of the univariate version:
Z
Z
E[dgp (x)/dxp ] =
x−p
xp
dgp (x)
p(xp )
dxp
dxp
! p−1
Y
1
Z
Exp [gp (X)(Xp − µp )]/σ
=
x−p
=
1
E[gp (X)(Xp
σ2
p(xj ) dxj
2
p−1
Y
p(xj ) dxj
1
− µp )]],
20
Peter Hoff
Shrinkage estimators
October 31, 2013
and similarly for each other j ∈ {1, . . . , p}. Therefore
E[∇ · g(X)] =
X
E[dgj (x)/dxj ]
=
X
E[gj (X)(Xj − µj )]/σ 2
= E[g(X) · (X − µ)]/σ 2 .
Now we are in a position to apply the lemma to obtain the unbiased risk estimator of
δJS . Recall we needed to calculate E[(X − θ) · g(X)] where g(x) =
(p−2)
x.
x·x
Applying
the lemma, we have
x
xp 1
,...,
x·x
x·x
X x · x − 2x2j ∇ · g(x) = (p − 2)
(x · x)2
p−2
[px · x − 2x · x]
=
(x · x)2
(p − 2)2
=
,
x·x
g(x) = (p − 2)
and so
2
E[(X − θ) · g(X)] = E[ (p−2)
],
X·X
as we used above.
6
6.1
Some oracle inequalities
A simple oracle inequality
Recall that if we knew ||θ||2 , the optimal estimator in the class {δa (x) = ax : a ∈
[0, 1]} would be δã where
ã =
||θ||2 /p
.
||θ||2 /p + 1
21
Peter Hoff
Shrinkage estimators
October 31, 2013
We also showed that the risk of this estimator is
R(θ, ãX) = E[||ãX − θ||2 ]/p =
||θ||2
< 1.
||θ||2 + p
Not surprisingly, it turns out that
R(θ, ãX) ≤ R(θ, δ JS )
(use the fact that X · X has a noncentral χ2 distribution, or condition on X · X).
But how much worse is δJS than the oracle estimator δã ? Recall the risk of δ JS is
R(θ, δ JS ) = 1 −
(p − 2)2
E[(X · X)−1 ]
p
Since 1/x is convex, Jensen’s inequality gives
E[
1
1
1
]≥
=
,
X ·X
E[X · X)]
||θ||2 + p
and so
(p − 2)2
1
p
||θ||2 + p
2
||θ||2 p
(p − 2)
−
p(R(θ, δ JS ) − R(θ, δ ã )) ≤ p −
||θ||2 + p ||θ||2 + p
p||θ||2 + p2 − p2 + 4p − 4 − ||θ||2 p
=
||θ||2 + p
4(p − 1)
=
||θ||2 + p
R(θ, δ JS ) ≤ 1 −
≤ 4 p−1
≤ 4,
p
and so
R(θ, δ ã )) ≤ R(θ, δ JS ) ≤ R(θ, δ ã )) + 4/p.
Additional work can get you to
R(θ, δ ã )) ≤ R(θ, δ JS ) ≤ R(θ, δ ã )) + 2/p.
For more on this, see Johnstone [2002] or Candès [2006].
22
Peter Hoff
7
Shrinkage estimators
October 31, 2013
Unknown variance or covariance
Suppose X ∼ Np (θ, σ 2 I), where σ 2 is known. Letting
• X̃ = X/σ and
• θ̃ = θ/σ,
we have X̃ ∼ Np (θ̃, I). The James-Stein estimator δ̃ JS of θ̃ = θ/σ is then
p−2
)X
X ·X
σ 2 (p − 2)
)X/σ.
= (1 −
X ·X
δ̃(X̃) = (1 −
It seems natural then that the JSE of θ should be σ times the JSE of θ̃ = θ/σ,
giving
σ 2 (p − 2)
)X.
X ·X
Of course, often σ 2 is not known. Is there a version of the JSE in this case? Yes, if
δ JS = (1 −
you have information about σ 2 . Consider the following hierarchical sampling scheme:
Xi,j = θj + i,j , i = 1, . . . , n, j = 1, . . . , p
{i,j } ∼ i.i.d. N(0, σ 2 ).
Letting Xj = X̄·j =
Pn
i=1
Xi,j /n, we now basically have the situation described
above.
Also note that the data contain information about σ 2 via the pooled sample sum of
squares:
S=
p
X
(Xi,j − X̄·j )2 ∼ σ 2 χ2p(n−1) .
j=1
Note further that X and S are statistically independent. For this and similar situations, James and Stein [1961] considered estimators of the following form:
Let X ∼ Np (θ, σ 2 I) be independent of S ∼ σ 2 χ2k . Define the estimator
δc (X, S) = (1 −
cS
)X.
X·X
23
Peter Hoff
Shrinkage estimators
October 31, 2013
The value of c that minimizes the risk of δc for all θ is c = (p − 2)/(k + 2), resulting
in the following estimator:
δ JS (X, S) = (1 −
S (p−2)
)X.
k+2 X·X
In particular, note that this estimator dominates X.
This result also generalizes to the correlated data case: Let X ∼ Np (θ, Σ) and
S ∼ Wishart(Σ, k) be independent. Consider estimators of the form
δc (X, S) = (1 −
c
)X.
X T S−1 X
James and Stein [1961] show that the estimator obtained by setting
c=
p−2
n−p+3
minimizes the risk for all values of θ.
See Brown and Han for recent work on these and related problems, including extensions to regression problems.
References
L.D. Brown and X. Han. Optimal estimation of multidimensional normal means with
an unknown variance.
Emmanuel J. Candès.
ties.
Modern statistical estimation via oracle inequali-
Acta Numer., 15:257–325, 2006.
S0962492906230010.
ISSN 0962-4929.
doi:
10.1017/
URL http://dx.doi.org.offcampus.lib.washington.
edu/10.1017/S0962492906230010.
W. James and Charles Stein. Estimation with quadratic loss. In Proc. 4th Berkeley
Sympos. Math. Statist. and Prob., Vol. I, pages 361–379. Univ. California Press,
Berkeley, Calif., 1961.
24
Peter Hoff
Shrinkage estimators
October 31, 2013
IM Johnstone. Function estimation and gaussian sequence models. Unpublished
manuscript, 2002.
E. L. Lehmann and George Casella. Theory of point estimation. Springer Texts in
Statistics. Springer-Verlag, New York, second edition, 1998. ISBN 0-387-98502-6.
Charles Stein. Inadmissibility of the usual estimator for the mean of a multivariate
normal distribution. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, 1954–1955, vol. I, pages 197–206, Berkeley and
Los Angeles, 1956. University of California Press.
25
Download