Handling Sparsity via the Horseshoe Carlos M. Carvalho Booth School of Business The University of Chicago Chicago, IL 60637 Nicholas G. Polson Booth School of Business The University of Chicago Chicago, IL 60637 Abstract From a Bayesian-learning perspective, there are two main sparse-estimation alternatives: discrete mixtures and shrinkage priors. The first approach (Mitchell and Beauchamp, 1988; George and McCulloch, 1993) models each βi with a prior comprising both a point mass at βi = 0 and an absolutely continuous alternative; the second approach (see, e.g., Tibshirani, 1996 and Tipping, 2001) models the βi ’s with absolutely continuous “shrinkage” priors centered at zero. This paper presents a general, fully Bayesian framework for sparse supervised-learning problems based on the horseshoe prior. The horseshoe prior is a member of the family of multivariate scale mixtures of normals, and is therefore closely related to widely used approaches for sparse Bayesian learning, including, among others, Laplacian priors (e.g. the LASSO) and Student-t priors (e.g. the relevance vector machine). The advantages of the horseshoe are its robustness at handling unknown sparsity and large outlying signals. These properties are justified theoretically via a representation theorem and accompanied by comprehensive empirical experiments that compare its performance to benchmark alternatives. 1 James G. Scott McCombs School of Business The University of Texas Austin, TX 78712 The choice of one approach or the other involves a series of tradeoffs. Discrete mixtures offer the correct representation of sparse problems by placing positive prior probability on βi = 0, but pose several difficulties. These include foundational issues related to the specification of priors for trans-dimensional model comparison, and computational issues related both to the calculation of marginal likelihoods and to the rapid combinatorial growth of the solution set. Shrinkage priors, on the other hand, can be very attractive computationally. But they create their own set of challenges, since the posterior probability mass on {βi = 0} (a set of Lebesgue measure zero) is never positive. Truly sparse solutions can therefore be achieved only through artifice. Introduction Supervised Learning can be cast as the problem of p estimating a set of coefficients β = {βi }i=1 that determine some functional relationship between a set of p inputs {xi }i=1 and a target variable y. This framework, while simple, is of central focus in modern statistics and artificial-intelligence research; it encompasses problems of regression, classification, function estimation, covariance regularization, and others still. The main challenges arise in “large-p” problems where, in order to avoid overly complex models that will predict poorly, some form of dimensionality reduction is needed. This entails finding sparse solutions, where some of the elements βi are zero (or very small). In this paper we adopt the shrinkage approach, while at the same time acknowledging the discrete-mixture approach as a methodological ideal. Indeed, it is with this ideal in mind that describe the horseshoe prior (Carvalho, Polson and Scott, 2008) as a default choice for shrinkage in the presence of sparsity. We begin our discussion in the simple situation where β is a vector of normal means, since it is here that the lessons drawn from a comparison of different shrinkage approaches for modeling sparsity are most readily understood. In this context, we provide a theoretical characterization of the robustness properties of the horseshoe via a representation theorem for the posterior mean of β, given data y. We then give a handful of examples of the horseshoe’s performance in linear models, function estimation, and covariance regular- Appearing in Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS) 2009, Clearwater Beach, Florida, USA. Volume 5 of JMLR: W&CP 5. Copyright 2009 by the authors. 73 Handling Sparsity via the Horseshoe 0.7 ization (a problem of unsupervised learning for which the horseshoe prior is still highly relevant). −3 −1 0 1 2 3 Figure 1: The horseshoe prior and two close cousins: Laplacian and Student-t. 2.1 Relation to other shrinkage priors The density in (1) is perfectly well defined without reference to the λi ’s, which can be marginalized away. But by writing the horseshoe prior as a scale mixture of normals, we can identify its relationship with commonly used procedures in supervised learning. For example, exponential mixing, with λ2i ∼ Exp(2), implies independent Laplacian priors for each βi ; inversegamma mixing, with λ2i ∼ IG(a, b), leads to Student-t priors. The former represents the underlying model for the LASSO (Tibshirani, 1996), while the latter is the model associated with the relevance vector machine (RVM) of Tipping (2001). We start by introducing our approach to sparsity in the simple, stylized situation where (y|β) ∼ N(β, σ 2 I), and where β is believed to be sparse. The horseshoe prior assumes that each βi is conditionally independent with density πHS (βi | τ ), where πHS can be represented as a scale mixture of normals: λi −2 Beta The Horseshoe Prior (βi |λi , τ ) ∼ N(0, λ2i τ 2 ) 0.4 0.0 Finally, we will return several times to a happy, and remarkably consistent, fact about the horseshoe’s performance: that it quite closely mimics the answers one would get by performing Bayesian model-averaging, or BMA, under a heavy-tailed discrete-mixture model. Bayesian model averaging is clearly the predictive gold standard for such problems (see, e.g., Hoeting et al, 1999), and a large part of the horseshoe prior’s appeal stems from its ability to provide “BMA-like” performance without the attendant computational fuss. 2 0.3 0.1 0.2 Density 0.5 0.6 Our goal is not to characterize the horseshoe estimator as a “cure-all”—merely a default procedure that is well-behaved, that is computationally tractable, and that seems to outperform its competitors in a wide variety of sparse situations. We also try to provide some intuition as to the nature of this advantage: namely, the horseshoe prior’s ability to adapt to different sparsity patterns while simultaneously avoiding the overshrinkage of large coefficients. Student−t (df =1) Laplacian Horseshoe Horseshoe near 0 (1) + This common framework allows us to compare the appropriateness of the assumptions made by different models. These assumptions can be better understood by representing models in terms of the “shrinkage profiles” associated with their posterior expectations. Assume for now that σ 2 = τ 2 = 1, and define κi = 1/(1 + λ2i ). Then κi is a random shrinkage coefficient, and can be interpreted as the amount of weight that the posterior mean for βi places on 0 once the data y have been observed: λ2i 1 E(βi | yi , λ2i ) = y + 0 = (1−κi )yi . i 1 + λ2i 1 + λ2i ∼ C (0, 1) , where C+ (0, 1) is a half-Cauchy distribution for the standard deviation λi . We refer to the λi ’s as the local shrinkage parameters and to τ as the global shrinkage parameter. Figure 1 plots the densities for the horseshoe, Laplacian and Student-t priors. The density function πHS (βi |τ ) lacks a closed-form representation, but it behaves essentially like log(1 + 2/βi2 ), and can be well approximated by elementary functions as detailed in Theorem 1 of Carvalho et al. (2008). The horseshoe prior has two interesting features that make it particularly useful as a shrinkage prior for sparse problems. Its flat, Cauchy-like tails allow strong signals to remain large (that is, un-shrunk) a posteriori. Yet its infinitely tall spike at the origin provides severe shrinkage for the zero elements of β. As we will highlight in the discussion that follows, these are the key elements that make the horseshoe an attractive choice for handling sparse vectors. Since κi ∈ [0, 1], this is clearly finite, and so by Fubini’s theorem, Z 1 E(βi | y) = (1 − κi )yi π(κi | y) dκi 0 = {1 − E (κi | yi )} y . (2) By applying this transformation and inspecting the priors on κi implied by different choices for π(λi ), we 74 Carvalho, Polson, Scott Laplacian 0 0.5 κ 1.0 0 Strawderman−Berger 0 0.5 κ The shrinkage characteristics of the models are presented in Figure 3, where ȳi is plotted against β̂i = E(βi |y). The important differences occur when ȳi ≈ 0 and when ȳi is large. Compared to the horseshoe prior, the Laplacian specification tends to over-shrink the large values of ȳ and yet under-shrink the noise observations. This is a direct effect of the prior on κi , which in the Laplacian case is bounded both at 0 and 1, limiting the ability of each κi to approach these values a posteriori. Student−t 0.5 κ 1.0 Figure 3 also plots posterior draws for the global shrinkage parameter τ , offering a closer look at the mechanism underlying signal discrimination. Under the horseshoe model, τ is estimated to be much smaller than in the Laplacian model. This is perhaps the single most important characteristic of the horseshoe: the clear separation between the global and local shrinkage effects. The global shrinkage parameter tries to estimate the overall sparsity level, while the local shrinkage parameters are able to flag the non-zero elements of β. Heavy tails for π(λi ) play a key role in this process, allowing the estimates of βi to escape the strong “gravitational pull” towards zero exercised by τ . Horseshoe 1.0 0 0.5 κ 1.0 Figure 2: Densities for the shrinkage weights κi ∈ [0, 1]. κi = 0 means no shrinkage and κi = 1 means total shrinkage to zero. Put another way, the horseshoe has the freedom to shrink globally (via τ ) and yet act locally (via λi ). This is not possible under the Laplacian prior, whose shrinkage profile forces a compromise between shrinking noise and flagging signals. This leads to overestimation of the signal density of underlying vector, combined with under-estimation of larger elements of β. Performance therefore suffers—in this simple example, the mean squared-error was 25% lower under the horseshoe model. can develop an understanding of how these models attempt to discern between signal and noise. Figure 2 plots the densities for κ derived from a few important models in this class. Choosing λi ∼ C+ (0, 1) implies κi ∼ Be(1/2, 1/2), a density that is symmetric and unbounded at both 0 and 1. This horseshoe-shaped shrinkage profile expects to see two things a priori: strong signals (κ ≈ 0, no shrinkage), and zeros (κ ≈ 1, total shrinkage). In fairness, the most commonly used form of the Laplacian model is the LASSO, where estimators are defined by the posterior mode (MAP), thereby producing zeros in the solution set. Our experiments of the next section, however, indicate that the issues we have highlighted about Laplacian priors remain even when the mode is used—the overall estimate of the sparsity level will still be governed by τ , which in turn is heavily influenced by the tail behavior of the prior on λi . Robustness here is crucial, which is an issue towards which we now turn. No other commonly used shrinkage prior shares these features. The Laplacian prior tends to a fixed constant near κ = 1, and disappears entirely near κ = 0. The Student-t prior and the Strawderman–Berger prior (see Section 2.3) are both unbounded near κ = 0, reflecting their heavy tails. But both are bounded near κ = 1, limiting these priors in their ability to squelch noise components back to zero. As an illustration, consider a simple example. Two repeated standard normal observations yi1 and yi2 were simulated for each of 1000 means: 10 signals with βi = 10, 90 signals βi = 2 and 900 noise components where βi = 0. Based on this data, we estimate the vector β under two models: (i) independent horseshoe priors for each βi , and (ii) independent Laplacian priors. Both models assume τ ∼ C+ (0, 1), along with Jeffreys’ prior π(σ) ∝ 1/σ. 2.2 Robust Shrinkage The robust behavior of the horseshoe can be formalized using the following representation of the posterior mean of β when (y|β) ∼ N(β, 1). Conditional on one sample y ∗ , E(β|y ∗ ) = y ∗ + 75 d ln m(y ∗ ), dy ∗ (3) Handling Sparsity via the Horseshoe Shrinkage under Laplacian Shrinkage under Horseshoe Handling Noise 10 ● ● ● Horseshoe Double Exponential Flat ● ● Horseshoe Laplacian Flat 4 6 8 10 Ybar 0 2 4 6 8 10 Ybar Mean Local Shrinkage Weights 1.0 Posterior Draws for Tau −2 0 4 0 2 Posterior Mean 0 Posterior Mean ● ● −2 5 BetaHat 1 0 0 −2 ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ●● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● 2 4 ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ●● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● 2 BetaHat 6 6 8 8 2 ● 10 10 ● ●● ● ● Signal Detection ● ● ● ●● ● ● ● ● ● ● ● ● ● 0.6 −5 −1 60 0.8 ● ● ● ● ● ● 0.4 40 ● ● ● ● ● −2 0.2 0 0.0 ● ● Laplacian Horseshoe −2 ● ● ● −1 Laplacian −10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 ● ● ● ● ● ● 0 1 2 −10 −5 0 5 10 Horseshoe Y (units of sigma) Y (units of sigma) Figure 3: Plots of ȳi versus θ̂i for Laplacian (left) and Figure 4: A comparison of the posterior mean versus y for horseshoe (right) priors on data where most of the means are zero. The diagonal lines are where θ̂i = ȳi . horseshoe and Laplacian priors. ∗ R near-identical shrinkage within roughly 2σ of the origin. Both models can “bow” near the origin to accommodate sparse vectors by changing τ ; only the horseshoe can simultaneously perform well in the tails, even when τ is very small. ∗ where m(y ) = p(y |β) π(β) dβ is the marginal density for y ∗ (see Polson, 1991). From (3) we get an essential insight about the behavior of an estimator in situations where y ∗ is very different from the prior mean. In particular, robustness is achieved by using priors having the “bounded influence” property—i.e. those giving rise to a score function that is bounded as a function of y ∗ . If such a bound exists, then for large values of |y ∗ |, E(β|y ∗ ) ≈ y ∗ , implying that the estimator never misses too badly in the tails of the prior. This effect can be confirmed by inspecting the joint distribution of the data and parameters under the horseshoe prior, 2 p p Y e−κi yi /2 Y 1 √ p(y, κ, τ ) ∝ π(τ ) τ , 2κ + 1 − κ τ 1 − κ i i i i=1 i=1 (6) from which it is clear that the marginal density for κi is always unbounded at 1, regardless of τ . (This is one reason why the posterior mode is inappropriate here.) Hence the horseshoe prior, its tail robustness notwithstanding, will always have the ability to severely shrink elements of β when needed. 2 Theorem 3 of Carvalho et al. (2008) shows that the horseshoe prior is indeed of bounded influence, and furthermore that lim |y ∗ |→∞ d ln mH (y ∗ ) = 0 . dy ∗ (4) 2.3 The Laplacian prior is also of bounded influence, but crucially, this bound does not decay to zero in the tails. Instead, d lim ln mL (y ∗ ) = ±a , (5) ∗ ∗ dy |y |→∞ 2 p Relation to Bayesian model averaging As we have mentioned, one alternative approach for handling sparsity is the use of discrete mixtures priors, where βi ∼ (1 − w)δ0 + w · π(βi ) . (7) where a varies inversely with the global shrinkage parameter τ (Pericchi and Smith, 1992). Unfortunately, when the vector β is sparse, τ will be estimated to be small, and this “nonrobustness bias” a will be quite large. Figure 4 illustrates these results by showing the relationship between y ∗ and the posterior mean under both the horseshoe and the Laplacian priors. These are available analytically for fixed values of τ , which for the sake of illustration were chosen to yield Here, w is the prior inclusion probability, and δ0 is a degenerate distribution at zero, so that each βi is assigned probability (1 − w) of being zero a priori. Crucial to the good performance of the model in (7) are the choice of π(β) and the careful estimation of w. The former allows large signals to be accommodated, while the latter allows the model to adapt to the overall level of sparsity in β, automatically handling the 76 Carvalho, Polson, Scott implied multiple-testing problem (Scott and Berger, 2006). By carefully choosing π(β) and accounting for the uncertainty in w, this model can be considered the “gold standard” for sparse problems, both theoretically and empirically. This is extensively discussed by, for example, Hoeting et al (1999) and Johnstone and Silverman (2004). variance components in general hierarchical models, and justifications for our choices of τ ∼ C + (0, 1) and π(σ) ∝ 1/σ appear in Gelman (2006). Alternatives to a fully Bayesian analysis include cross validation and empirical-Bayes, often called Type-II maximum likelihood. These “plug-in” analysis are, in fact, the standard choices in many applications of shrinkage estimation in both machine learning and statistics. The discrete-mixture model is therefore an important benchmark for any shrinkage prior. Here, we will focus on a version of the discrete mixture where the nonzero βi ’s follow independent Strawderman–Berger priors (Strawderman, 1971; Berger, 1980), which have Cauchy-like tails and and yet still allow closed-form convolution with the normal likelihood. Figure 2 displays the shrinkage profile of the Strawderman-Berger prior on the κ scale, where it is seen to yield a Beta(1, 1/2) distribution. Here, the point mass at βi = 0 can be equivalently be construed as a point mass at κi = 1. If Strawderman–Berger priors are assumed for the nonzero βi ’s, the discrete mixture model will yield a shrinkage profile with the desired unboundedness both at κi ≈ 0 (signal) and κi ≈ 1 (noise). While we certainly do not intend to argue that “plugin” alternatives are wrong per se, we do recommend, as a conservative and more robust route, the use of the fully Bayesian approach. The full Bayes analysis is quite simple computationally using MCMC, and will avoid at least three potential problems: 1. Plug-in approaches will ignore the unknown correlation structure between τ and σ (or τ , σ and w in the discrete mixture model). This can potentially give misleading results in situations where the correlation is severe, while the full Bayes analysis will automatically average over this joint uncertainty. 2. The marginal maximum-likelihood solution is always in danger of collapsing to the degenerate τ̂ = 0. The issue is exacerbated when very few signals are present, in which case the posterior mass of τ will concentrate near 0 and signals will be flagged via large values of the local shrinkage parameters λi . Notice that both the horseshoe prior and the discrete mixture have mechanisms for controlling the overall signal density in β. In the discrete mixture model, this parameter is clearly w, the prior inclusion probability. But under the horseshoe prior, this role is played by τ , the common variance parameter. This is easily seen from the joint distribution in (6), since one can approximate the conditional posterior for τ by p(τ 2 | κ) ≈ (τ 2 )−p/2 ≈ (τ 2 )−p/2 3. Plug-in methods may fail to correspond to any kind of Bayesian analysis even asymptotically, when there no longer is any uncertainty about the relevant hyperparameters. See Scott and Berger (2008) for an extensive discussion of this phenomenon. −p 1 − κ̄ 1+ 2 τ κ̄ 1 p(1 − κ̄) exp − 2 , τ κ̄ Pp where κ̄ = p−1 i=1 κi . This is essentially a Ga {(p + 2)/2, (p − κ̄)/κ̄} distribution for τ −2 , with posterior mean equal to 2(1 − κ̄)/κ̄. When κ̄ gets close to 1, implying that most observations are shrunk to zero, then τ 2 is estimated to be very small. 2.4 One may ask, of course, whether a global scale parameter τ is even necessary, and whether the local parameters λi can be counted upon to do all the work. (This is the tactic used in, for example, the relevance vector machine.) But this is equivalent to choosing τ = 1, and we feel that Figure 3 is enough to call this practice into question, given how far away the posterior distribution is from τ = 1. Hyperparameters Much of the above discussion focused on the behavior implied by different choices of priors for the local shrinkage parameters λi ’s. Yet the estimation of the global parameters τ and σ plays a large role in separating signal from noise, as seen in the example depicted in Figure 2. 3 Examples Carvalho, Polson, Scott and Yae (2009) provide an extensive discussion of the use of the horseshoe in traditional supervised-learning situations, including linear regression, generalized linear models, and function estimation through basis expansions. We now focus on a few examples that highlight the effectiveness of the horseshoe as a good default procedure. So far, we have focused on a fully Bayesian specification where weakly informative priors were used both for τ and σ (as well as w in the discrete mixture). There is a vast literature on choosing priors for 77 Handling Sparsity via the Horseshoe Loss LP HS DM DE HS DM `2 `1 σ2 = 1 LP HS DM 209 1.62 1.62 77 0.95 93 178 1.50 1.60 80 1.02 83 σ2 = 9 LP HS DM 850 1.47 1.51 416 0.99 440 341 1.56 1.75 142 1.10 123 Table 1: Risk under squared-error (`2 ) loss and absolute1 error (` ) loss in Experiment 1. Bold diagonal entries in the top and bottom halves are median sum of squared-errors and absolute errors, respectively, in 1000 simulated data sets. Off-diagonal entries are average risk ratios, risk of row divided by risk of column, in units of σ. LP: Laplacian. HS: horseshoe. DM: discrete mixture, fully Bayes. Case 1: p n Lasso HS β 1:10 20 24 1.86 1.28 = (2, 2, 2, 2, 2, 2, 2, 2, 5, 20) 50 100 200 400 60 120 240 480 0.78 0.34 0.13 0.12 0.33 0.11 0.06 0.07 Case 2: p n Lasso HS β 1:10 20 25 0.61 0.31 = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) 50 100 200 400 55 105 205 405 0.40 0.48 0.21 0.23 0.23 0.12 0.09 0.08 Table 2: Mean-squared error in estimating β in Experiment 2. These situations all involve an n-dimensional vector y ∼ N(Xβ, σ 2 I), where X is a n × p design matrix. In regression, the rows of X are the predictors for each subject; in basis models, they are the bases evaluated at the points in predictor space to which each entry of y corresponds. The horseshoe prior for the p-dimensional vector β takes the form in (1). 3.1 with moderately correlated entries, and simulated y by adding standard normal errors to the true linear predictor Xβ. In all cases, n scaled linearly with p. For this example, we evaluated the horseshoe using the LASSO (i.e. the posterior mode under Laplacian priors) as a benchmark, with τ chosen through crossvalidation. Results are presented in Table 2. Exchangeable means In Experiment 3, we fixed p = 50, but rather than fixing the non-zero values of β, we simulated 1000 data sets with varying levels of sparsity, where nonzero βi ’s were generated from a standard Student-t with 2 degrees of freedom. (The coefficients were 80% sparse on average, with nonzero status decided by a weighted coin flip.) We again compared the horseshoe against the LASSO, but also included Bayesian model-averaging using Zellner-Siow priors as a second benchmark. Results for both estimation error and outof-sample prediction error are displayed in Figure 5.As these results show, both BMA and the horseshoe prior systematically outperform the LASSO in sparse regression problems, without either one enjoying a noticeable advantage over the other. Experiment 1 demonstrates the operational similarities between the horseshoe and a heavy-tailed discrete mixture. We still focus on the problem of estimating a p-dimensional sparse mean (β) of a multivariate normal distribution (implying that X is the identity). We simulated 1000 data sets with different sparsity configurations, and 20% non-zero entries on average. Nonzero βi ’s were generated randomly from a Studentt distribution with scale τ = 3 and degrees of freedom equal to 3. Data y was simulated under two possibilities for the noise variables: σ 2 = 1 and σ 2 = 9. In each data set we estimate β by the posterior mean under three different models: horseshoe, Laplacian and discrete mixtures. The results for estimation risk are reported in Table 1. 3.3 Regardless of the situation, the Laplacian loses quite significantly both to the horseshoe prior and the discrete-mixture model. Yet neither of these two options enjoys a systematic advantage; their similarities in shrinkage profiles seem to translate quite directly to similar empirical results. Meanwhile, the nonrobustness of the Laplacian prior is quite apparent. 3.2 Basis expansion with kernels In Experiment 4, we used the sine test function described in Tipping (2001) to assess the ability of the horseshoe prior to handle regularized kernel regression. For each of 100 different simulated data sets, 100 random points ti were simulated uniformly between −20 and 20. The response yi was then set to sin(ti )/ti + i , with i ∼ N(0, σ = 0.15). Regression The goal was to estimate the underlying function f (t) using kernel methods. As a benchmark, we use the relevance vector machine, corresponding to independent Student-t priors with zero degrees of freedom. Gaussian kernels were centered at each of the 100 observed points, with the kernel bandwidth chosen as the de- In Experiment 2, we chose two fixed vectors of ten nonzero coefficients: β 1:10 = (2, 2, 2, 2, 2, 2, 2, 2, 5, 20) and β 1:10 = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10). We then “padded” these with (p − 10) zeros for several different choices of p, simulated random design matrices 78 Carvalho, Polson, Scott L-OR 791 Parameter Estimation SSE Lasso ●● ● ● ● ● HS ●●● ●●●● ●● ● ● ● BMA ● ●●● ●●● ● 0 ● ●● ● ● ● ●● ● ● ● ● ●● 2 ● 6 8 10 Out−of−sample Prediction SSE ● ● ●●●●●●● HS ● ●● ● ● ●● ● ●● ●● ●● ● ● ● ● ● 200 400 600 problem in portfolio allocation, where one must assess the variance of a weighted portfolio of assets, and where regularized estimates of Σ are known to offer substantial improvements over the straight estimator Σ̂ = Y 0 Y . 800 Figure 5: Results for Experiment 3. “BMA” refers to the 1. model-averaged results under Zellner-Siow priors. “Lasso” refers to the posterior mode under Laplacian priors. Truth Horseshoe (SSE = 1.26) RVM (SSE = 3.44) A useful way of regularizing Σ is by introducing offdiagonal zeros in its inverse Ω. This can be done by searching for undirected graphs that characterize the Markov structure of y, a process known as Gaussian graphical modeling (see Jones et. al, 2005). While quite potent as a tool for regularization, Bayesian model averaging across different graphical models poses the same difficulties as it does in linear models: marginal likelihoods are difficult to compute, and the model space is enormously difficult to search. ● ● 1.0 ● ● ● ●● ● ●● ● ● 0.5 sin(X)/X ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Luckily, Gaussian graphical modeling can also be done indirectly, either by fitting a series of sparse self-onself regression models for (yj | y −j ), j = 1, . . . , p, or by representing the Cholesky decomposition of Ω as a triangular system of sparse regressions. The first option is done using the LASSO by Meinshausen and Buhlmann (2006). We now present similar results using the horseshoe. ● ● ● ● ● ● −20 ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● −0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −10 0 10 ● 20 X Figure 6: One example data set in Experiment 5 involving the sin(t)/t test function, showing the true function, data, and horseshoe/RVM estimates. Our test data set is the Vanguard mutual-fund data set (p = 59, n = 86) of Carvalho and Scott (2009). We recapitulate their out-of-sample prediction exercise, which involves estimating Σ using the first 60 observations, and then attempting to impute random subsets of missing values among the remaining 26 observations. We use that paper’s full BMA results as a benchmark (which required many hours of computing using the FINCS algorithm of Scott and Carvalho, 2008). fault in the “rvm” function in the R package “kernlab.” These kernel basis functions, evaluated at the observed values of ti , formed the 100 × 100 design matrix, with β representing the vector of kernel weights. In these 100 simulated data sets, the average sum of squared errors in estimating f (t) at 100 out-of-sample t points was 7.55 using the horseshoe prior, and 8.19 using the relevance vector machine. In 91 cases of 100, the horseshoe prior yielded lower risk. An example of one simulated data set is in Figure 6. 3.4 BMA 347 ●● ● ●● ● ● ● BMA ● HS 372 turn values in the 59-dimensional mutual fund example. “L-OR” and “L-AND” refer to estimates based on Lasso regressions in the full conditionals of each asset. “L-Chol” and “HS” refer to Lasso and horseshoe models on the triangular system of linear regressions from the Cholesky decomposition of Σ−1 . Finally, “BMA” is based on Bayesian model averaging using the FINCS. ● 4 Lasso L-Chol 520 Table 3: Sum of squared errors in predicting missing re- ● ●● L-AND 729 Results from this prediction exercise are presented in Table 3, where it is clear that the horseshoe, despite being a much simpler computational strategy, performs almost as well as the benchmark (Bayesian model averaging). Once again, both BMA and the horseshoe outperform alternatives based on the Laplacian prior. Unsupervised covariance estimation Suppose we observe a matrix Y whose rows are n realizations of a p-dimensional vector y ∼ N(0, Σ), and that the goal is to estimate Σ. This is an important 79 Handling Sparsity via the Horseshoe 4 Discussion Statistical Association 88, 881–889. J. Hoeting, D. Madigan, A. E. Raftery, and C. Volinsky (1999). Bayesian model averaging: a tutorial. Statist. Sci. 14, 382–417. We have introduced and discussed the use of the horseshoe prior in the estimation of sparse vectors in supervised learning problems. The horseshoe prior is based on a novel multivariate-normal scale mixture; it yields estimates that are robust both to unknown sparsity patterns and to large outlying signals, making it an attractive default option. I. Johnstone and B. Silverman (2004). Needles and Straw in Haystacks: Empirical-Bayes Estimates of Possibly Sparse Sequences. The Annals of Statistics, 32, 1594–1649. B. Jones, C. Carvalho, A. Dobra, C. Hans, C. Carter and M. West (2005). Experiments in Stochastic Computation for High-dimensional Graphical Models. Statistical Science, 20, 388-400. It is reassuring that the theoretical insights of Section 2 regarding sparsity and robustness can be observed in practice, as we have demonstrated through a variety of experiments. Moreover, it is surprising that in all situations where we have investigated the matter, the answers obtained by the horseshoe closely mimic those arising from the gold standard for sparse estimation and prediction: Bayesian model averaging across discrete mixture models. This is an interesting (and as yet under-explored) fact that may prove very useful in ultra-high-dimensional situations, where the computational challenges associated with BMA may be very cumbersome indeed. Meinshausen, N. and Buhlmann, P. (2006). High dimensional graphs and variable selection with the Lasso. Annals of Statistics 34, 1436–1462. T. Mitchell and J. Beauchamp (1988). Bayesian Variable Selection in Linear Regression (with discussion). Journal of the American Statistical Association, 83, 1023–1036. L. Pericchi and A. Smith (1992). Exact and Appropriate Posterior Moments for a Normal Location Parameter. Journal of the Royal Statistical Society B 54, 793–804. Additional detail concerning these issues can be found in working papers available from the authors’ websites. Acknowledgements N. Polson (1991). A Representation of the Posterior Mean for a Location Model. Biometrika, 78, 426–430. The first author acknowledges the support of the IBM Corporation Scholar Fund at the University of Chicago Booth School of Business, and the third author that of a graduate research fellowship from the U.S. National Science Foundation. J. G. Scott and J. Berger (2006). An Exploration of Aspects of Bayesian Multiple Testing. Journal of Statistical Planning and Inference, 136, 2144-2162. J. G. Scott and J. Berger (2008). Bayes and EmpiricalBayes Multiplicity Adjustment in the VariableSelection Problem. Discussion Paper 2008-10. Duke University Department of Statistical Science. References J. Berger (1980). A Robust Generalized Bayes Estimator and Confidence Region for a Multivariate Normal Mean. The Annals of Statistics, 8 716–761. J. G. Scott and C. Carvalho (2008). Feature-inclusion Stochastic Search for Gaussian Graphical Models. Journal of Computational and Graphical Statistics, 17, 790-808. C. Carvalho, N. Polson and J. G. Scott (2008). The Horseshoe Estimator for Sparse Signals. Discussion Paper 2008-31. Duke University Department of Statistical Science. W. Strawderman (1971). Proper Bayes Minimax Estimators of the Multivariate Normal Mean. The Annals of Statistics, 42, 385–388. C. Carvalho, N. Polson, J. G. Scott and S. Yae (2009). Bayesian Regularized Regression and Basis Expansion via the Horseshoe. Working Paper. R. Tibshirani (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society B, 58, 267-288. C. Carvalho and J. G. Scott (2009). Objective Bayesian Model Selection in Gaussian Graphical Models. Biometrika (to appear). M. Tipping (2001). Sparse Bayesian Learning and the Relevance Vector Machine. Journal of Machine Learning Research, 1, 211-244. A. Gelman (2006). Prior Distributions for Variance Parameters in Hierarchical Models. Bayesian Analysis, 1. 515–533. E. George and R. McCulloch (1993). Variable Selection via Gibbs Sampling. Journal of the American 80