The horseshoe estimator for sparse signals

advertisement
The horseshoe estimator for sparse signals
CARLOS M. CARVALHO
NICHOLAS G. POLSON
JAMES G. SCOTT
Biometrika (2010)
Presented by Eric Wang
10/14/2010
Overview
• This paper proposes a highly analytically tractable horseshoe
estimator that is more robust and adaptive to different sparsity
patterns than existing approaches.
• Two theorems are proved characterizing the proposed
estimator’s tail robustness and demonstrating super-efficient
rate of convergence to the correct estimate of the sampling
density in sparse situation.
• The proposed estimator’s performance is demonstrated using
both real and simulated data. The authors show its answer
correspond quite closely to those obtained by Bayesian model
averaging.
The horseshoe estimator
• Consider a p-dimensional vector
where
is sparse, the authors propose the following model for
estimation and prediction:
where
is a standard half-Cauchy distribution with
mean 0 and scale parameter a.
• The name horseshoe prior arises from the observation that, for
fixed values
where
and
is the amount of
shrinkage toward zero, a posteriori. has a horseshoe shaped
prior
.
The horseshoe estimator
• The meaning of is as follows:
yields virtually no
shrinkage, and describes signals while
yields near total
shrinkage and (hopefully) describes noise.
•At right is the
shrinkage coefficient .
prior on the
The horseshoe density function
• An analytic density function lacks an analytic form, but very
tight bounds are available:
Theorem 1. The univariate horseshoe density
the following: (a)
(b) For
satisfies
where
• Alternatively, it is possible to integrate over yielding
though the dependence among
causes more issues.
Therefore the authors do not take this approach.
Horseshoe estimator for sparse signals
Review of similar methods
• Scott & Berger (2006) studied the discrete mixture
where
• Tipping (2001) studied the Student-t prior
defined by an inverse-gamma mixing density,
is
• The double-exponential prior (Bayesian lasso) has mixing
density
Review of similar methods
• The normal-Jeffreys prior is an improper prior and is induced
by placing the Jeffreys’ prior on each variance term
leading to
. This choice is commonly used in
the absence of a global scale parameter.
• The Strawderman-Berger prior does not have an analytic form,
but arises from assuming
, with
• The normal-exponential-gamma family of priors generalizes
the lasso specification using
to mix over the
exponential rate parameter, leading to
Review of similar methods
Tail robustness of prior
Shrinkage of noise
Robustness to large signals
• Theorem 2. Let
be the likelihood, and suppose that
is a zero-mean scale mixture of normals:
with having proper prior
. Assume further that the
likelihood and
are such that the marginal density
is
finite for all . Define the following three pseudo-densities,
which may be improper:
Then
Robustness to large signals
• If
is a Gaussian likelihood, then the result of
Theorem 2 reduces to
• A key result of Theorem 2 is that if the prior on is chosen
such that the derivative of the log probability leads to the
derivative of the log predictive probability
that
is bounded at 0 at large . This happens for heavy-tailed
priors, including the proposed horseshoe prior. This yields
The horseshoe score function
• Theorem 3. Suppose
. Let
denote the
predictive density under the horseshoe prior for known scale
parameter
, i.e. where
and
. Then
for some
that depends upon , and
• Corollary:
• Although the horseshoe prior has no analytic form, it does lead
to the following posterior mean
where
is a degenerate hypergeometric
function of two variables.
Estimating
• The conditional posterior distribution of
is approximately
if dimensionality p is large.
• This approximately yields a
distribution for
where
.
• If most observations are shrunk toward 0, then
with high probability.
will be small
Comparison to double exponential
Super-efficient convergence
• Theorem 4. Suppose the true sampling model is
Then:
(1) For under the horseshoe prior, the optimal rate of
convergence of when
is
where b is a constant. When
.
.
, the optimal rate is
(2) Suppose
is any other density that is continuous,
bounded above, and strictly positive on a neighborhood of the
true value . For
under
, the optimal rate of
convergence of , regardless of , is
Example - simulated data
• Data generated from
Example-Vanguard mutual-fund data
• Here, the authors show how the horseshoe can provide a
regularized estimate of a large covariance matrix whose
inverse may be sparse.
• Vanguard mutual funds dataset containing n = 86 weekly
returns for p = 59 funds.
• Suppose the observation matrix is
with each p-dimensional vector is drawn from a zero-mean
Gaussian with covariance matrix .
• We will model the Cholesky decomposition of
.
Example-Vanguard mutual-fund data
• The goal is to estimate the ensemble of regression models in
the implied triangular system
,
where
is the
column of Y.
• The regression coefficients are assumed to have a Horseshoe
prior, and posterior means were computed using MCMC.
Conclusions
• This paper introduces the horseshoe prior as a good default
prior for sparse problems.
• Empirically, the model performs similarly to Bayesian model
averaging, the current standard.
• The model exhibits strong global shrinkage and robust local
adaptation to signals.
Download