Non-Local Priors for High-Dimensional Estimation David Rossell1 , Donatello Telesca2 1 Author’s Footnote University of Warwick, Department of Statistics 2 UCLA, Department of Biostatistics April 16, 2014 Abstract Simultaneously achieving parsimony and good predictive power in high dimensions is a main challenge in statistics. Non-local priors (NLPs) possess appealing properties for high-dimensional model choice. but their use for estimation has not been studied in detail, mostly due to difficulties in characterizing the posterior on the parameter space. We give a general representation of NLPs as mixtures of truncated distributions. This enables simple posterior sampling and flexibly defining NLPs beyond previously proposed families. We develop posterior sampling algorithms and assess performance in p >> n setups. We observed low posterior serial correlation and notable high-dimensional estimation for linear models. Relative to benchmark and hyper-g priors, SCAD and LASSO, combining NLPs with Bayesian model averaging provided substantially lower estimation error when p >> n. In gene expression data they achieved higher cross-validated R2 by using an order of magnitude less predictors than competing methods. Remarkably, these results were obtained without the need for methods to pre-screen predictors. Our findings contribute to the debate of whether different priors should be used for estimation and model selection, showing that selection priors may actually be desirable for high-dimensional estimation. Keywords: Model Selection, MCMC, Non Local Priors, Bayesian Model Averaging 1 CRiSM Paper No. 14-10, www.warwick.ac.uk/go/crism 1 Introduction Developing high-dimensional methods that balance parsimony and good predictive power is one of the main challenges in statistics. Non-local prior (NLP) distributions have appealing properties for Bayesian model selection. Relative to local priors, NLPs discard spurious covariates faster as the sample size n grows, while preserving exponential learning rates to detect non-zero coefficients [Johnson and Rossell, 2010]. This additional parsimony enforcement has important consequences in high dimensions. Denote the observations by yn ∈ Yn , where Yn is the sample space. In Normal regression models, when the number of variables p grows at rate O(nα ) with 0.5 ≤ α < 1 and the ratios between model prior probabilities are bounded, the posterior probability P (Mt | yn ) assigned to the data-generating model Mt converges in probability to 1 when using NLPs. In contrast, when using local priors P (Mt | yn ) converges to 0 [Johnson and Rossell, 2012]. That is, consistency of P (Mt | yn ) occurs if and only if NLPs are used. These results make NLPs potentially very attractive for high-dimensional estimation, but this use has not been studied in detail. We remark that consistency of P (Mt | yn ) can also be achieved using model prior probabilities which increasingly encourage sparsity as p grows [Liang et al., 2013]. While useful, the extra complexity penalty induced by this strategy is not guided by the data but by a priori assumptions on the model space, and is complementary to parameter-specific priors such as NLPs. For simplicity we assume a parametric density f (yn | θ) indexed by θ ∈ Θ ⊆ Rp , defined with respect to an adequate σ-algebra and σ-finite dominating measure. We entertain a collection of K models M1 , . . . , MK with corresponding parameter spaces Θ1 , . . . , ΘK ⊆ Θ. We assume that models are nested, where ΘK = Θ is the full model and for any k 0 ∈ {1, . . . , K − 1} T there exists k such that Θk0 ⊂ Θk , dim(Θk0 ) < dim(Θk ) and Θk Θk0 has 0 Lebesgue measure. We say that π(θ | Mk ), an absolutely continuous density for θ ∈ Θk under model Mk , is a NLP if it converges to 0 as θ approaches any value θ0 that would be consistent with a submodel Mk0 . Definition 1. Let kθ − θ0 k be a suitable distance between θ ∈ Θk and θ0 ∈ Θk0 ⊂ Θk . π(θ | Mk ) is a non-local prior under model Mk if for all θ0 , k 0 6= k and > 0 there exists η > 0 such that kθ − θ0 k < η implies that π(θ | Mk ) < . Intuitively, NLPs define probabilistic separation of the models under consideration, which is the basis for their improved learning rates. For preciseness, the definition also assumes that (as usual) the intersection between 2 CRiSM Paper No. 14-10, www.warwick.ac.uk/go/crism any two models (k, k 0 ) of equal dimensionality is included in another model T k 00 , i.e. Θk Θk0 ⊆ Θk00 . To fix ideas, consider the Normal linear regression y ∼ N (Xθ, φI) with unknown variance φ. The three following distributions are NLPs under Mk . Y θ2 i πM (θ | φ, Mk ) = i∈Mk τφ 1 i∈Mk ( πE (θ | φ, Mk ) = Y exp i∈Mk √ ( τφ √ 2 exp − 2 πθi θi Y (τ φ) 2 πI (θ | φ, Mk ) = N (θi ; 0, τ φ) τφ 2− 2 θi (1) ) (2) ) N (θi ; 0, τ φ), (3) where i ∈ Mk are the non-zero coefficients, N (θi ; 0, v) is the univariate Normal density with mean 0 and variance v and πM , πI and πE are called the product MOM, iMOM and eMOM priors (respectively). Consider now an estimation problem where interest lies either on the posterior distribution conditional on a single model π(θ | Mk , yn ) or the P Bayesian model averaging (BMA) π(θ | yn ) = K k=1 π(θ | Mk , yn )P (Mk | yn ). Let Mt be the data-generating model. Whenever P (Mt | yn ) → 1, the oracle property π(θ | yn ) → π(θ | Mt , yn ) is achieved by direct application of Slutsky’s theorem. For instance, under the conditions in Johnson and Rossell [2012] NLPs achieve P (Mt | yn ) → 1 and hence achieve the oracle property. However, using NLPs for estimating θ presents important challenges. Strategies to compute P (Mk | yn ) and explore the model space are available, but π(θ | Mk , yn ) has extreme multi-modalities in high dimensions, e.g. the posteriors induced by (1)-(3) have 2pk modes, where pk = dim(Θk ). Also, finding functional forms that lead to simple calculations limits the flexibility in choosing the prior. The main contribution of this manuscript lies in the representation of NLPs as mixtures of truncated distributions (Section 3). We prove that this is a one-to-one characterization, in the sense that any NLP can be represented as such a mixture, and any non-degenerate mixture induces a NLP. The representation provides an intuitive justification for NLPs, adds flexibility in prior choice and enables efficient posterior sampling even under strong multi-modalities (Section 4). We also study finite-sample performance in case studies with growing p. (Section 5). Throughout we place emphasis on high dimensions and devote our interest to proper priors, as their shrinkage properties outperform improper priors even in moderate dimensions. See the early arguments in Stein [1956], the review on penalized 3 CRiSM Paper No. 14-10, www.warwick.ac.uk/go/crism likelihood in Fan and Lv [2010], or related Bayesian perpectives in George et al. [2012]. To provide intuition, we first introduce a simple example built on basic principles. 2 Non local priors for testing and estimation In the fundamental Bayesian paradigm the prior depends only on one’s knowledge (or lack thereof), but in practice the analysis goals may affect the prior choice. For instance, different priors are often used for estimation and model selection (although see Bernardo [2010] and references therein for an alternative decision-theoretic approach). While the practice deviates from the basic paradigm, it can be argued that some preferences are hard to formalize in the utility function and may be easier to account for in the prior. For instance, in high dimensions point masses may simplify interpretation and computation, which could be otherwise unfeasible. Suppose we wish to both estimate θ ∈ R and test H0 : θ = 0 vs. H1 : θ 6= 0, and that the analyst is comfortable with specifying a (possibly vague) prior for the estimation problem. As an example, the gray line in Figure 1 shows a Cauchy(0, 0.25) prior expressing confidence that θ is close to 0, e.g. P (|θ| > 0.25) = 0.5 and P (|θ| < 3) = 0.947. Setting a prior for the testing problem is more challenging, as the analyst wishes to assign positive prior probability to H0 . To be consistent, she aims to preserve the estimation prior as much as possible. This could indicate a belief that θ = 0 exactly or simply reflect that |θ| < λ are irrelevant, where λ is a practical significance threshold. She sets λ = 0.25 and combines a point mass at 0 with a Cauchy(0, 0.25) truncated to exclude (−0.25, 0.25), each with weight 0.5. The black line in Figure 1 (top) shows the resulting prior. It assigns the same P (|θ| > θ0 ) as the estimation prior for any θ0 ≥ 0.25 and concentrates all probability in (−0.25, 0.25) at θ = 0. The intuitive appeal of truncated priors for Bayesian tests has been argued before [Verdinelli and Wasserman, 1996, Rousseau, 2010, Klugkist and Hoijtink, 2007]. They encourage coherency between estimation and testing. As important limitations they require a practical significance threshold λ, and even as n → ∞ there is no chance of detecting small but non-zero effects. In fact, most Bayesian approaches to hypothesis testing combine point masses with untruncated priors. These include Jeffreys-Zellner-Siow priors [Jeffreys, 1961, Zellner and Siow, 1980, 1984], g and hyper-g priors [Zellner, 1986, Liang et al., 2008], unit information priors [Kass and Wasserman, 1995] and conventional priors [Bayarri and Garcia-Donato, 2007]. Although 4 CRiSM Paper No. 14-10, www.warwick.ac.uk/go/crism 1.2 1.0 0.8 0.6 Prior density 0.4 0.2 0.0 −2 −1 0 1 2 1 2 1 2 0.8 0.6 0.0 0.2 0.4 Prior density 1.0 1.2 θ −2 −1 0 0.8 0.6 0.0 0.2 0.4 Prior density 1.0 1.2 θ −2 −1 0 θ Figure 1: Marginal priors for θ ∈ R (estimation prior Cauchy(0, 0.0625) shown in grey). Top: mixture of point mass at 0 and Cauchy(0, 0.0625) truncated at λ = 0.25; Middle: same with untruncated Cauchy(0, 0.0625); Bottom: same as top with λ ∼ IG(3, 10) 5 CRiSM Paper No. 14-10, www.warwick.ac.uk/go/crism not strictly necessary, most objective Bayes approaches are also based on untruncated priors (e.g. O’Hagan [1995], Berger and Pericchi [1996], Moreno et al. [1998], Berger and Pericchi [2001], Pérez and Berger [2002]). Figure 1 (middle) shows an untruncated Cauchy(0, 0.25) combined with a point mass at 0. It is substantially more concentrated around 0 than the original estimation prior, e.g. P (|θ| > 0.25) decreased from 0.5 to 0.25. Setting a larger scale parameter for the Cauchy would not fix the issue, as the Cauchy mode would remain at 0. We view this discrepancy between estimation and testing priors as troublesome, as their underlying beliefs cannot be easily reconciled. Suppose that the analyst goes back to the truncated Cauchy but now expresses her uncertainty in the truncation point by placing a prior λ ∼ G(2.5, 10), so that E(λ) = 0.25. Figure 1 (bottom) shows the marginal prior on θ, obtained by integrating out λ. The prior under H1 is a smooth version of the truncated Cauchy that goes to 0 as θ → 0, hence it is a NLP (Definition 1). Relative to the estimation prior, most of the probability assigned to θ ≈ 0 is absorbed by the point mass, and P (|θ| > θ0 ) is roughly preserved for θ0 > 0.5. In contrast with the truncated Cauchy, it avoids setting a fixed λ and aims to detect any θ 6= 0. The example is meant to show a case where a NLP arises as a mixture of truncated distributions, and that this builds NLPs from first principles. We formalize this intuition in the following section. 3 Non-local priors as truncation mixtures We establish the correspondence between NLPs and mixtures of truncated distributions. In Section 3.1 we show that the two classes are essentially equivalent, and in Section 3.2 we study how the nature of the mixture determines some relevant NLP characteristics. Our subsequent discussion is conditional on a given model Mk . Therefore, we simplify notation letting π(θ) = π(θ | Mk ) and refer to dim(Θk ) as p. We note that any NLP density under Mk can be written as π(θ) ∝ d(θ)πu (θ), where the penalty d(θ) → 0 as θ → θ0 for any θ0 ∈ Θk0 ⊂ Θk and πu (θ) is an arbitrary prior. To ensure that π(θ) is proper we assume R d(θ)πu (θ)dθ < ∞. NLPs are often expressed in this form (e.g. (1) or (3)), but the representation is always possible. A NLP density π(θ) can be written as π(θ) = ππ(θ) πu (θ) = d(θ)πu (θ), where πu (θ) is any local prior u (θ) density and d(θ) = π(θ) πu (θ) . 6 CRiSM Paper No. 14-10, www.warwick.ac.uk/go/crism 3.1 Equivalence between NLPs and truncation mixtures We first prove that truncation mixtures define valid NLPs, and subsequently that any NLP may be represented via truncation schemes. Given that the representation is not unique, we give two constructive approaches and discuss their relative merits. Let πu (θ) be an arbitrary prior on θ (typically, a local prior) and let λ ∈ R+ be a latent truncation point. Define the conditional prior π(θ | λ) ∝ πu (θ)I(d(θ) > λ), where as before d(θ) → 0 as θ → θ0 ∈ Θk0 ⊂ Θk , and let π(λ) be a marginal prior for λ. The following proposition holds. Proposition 1. Define π(θ | λ) ∝ πu (θ)I(d(θ) > λ), assume that π(λ) places no probability mass at λ = 0 and that πu (θ)R is bounded in a neighbourhood around any θ0 ∈ Θk0 ⊂ Θk . Then π(θ) = π(θ | λ)π(λ)dλ defines a non-local prior. Proof. See Appendix A.1. As a corollary, whenever d(θ) can be expressed as the product of independent penalties di (θi ) NLPs may also be induced with multiple latent truncation variables. This alternative representation can be convenient for sampling (as illustrated later on) or to avoid the marginal dependency between elements in θ induced by a common truncation. Corollary 1. Define πu (θ) as in Proposition 1 with d(θ) = pi=1 di (θi ). Let the latent truncations λ = (λ1 , . . . , λp )0 ∈ Rp+ have an absolutely continuous Q prior π(λ). Then π(θ | λ) ∝ πu (θ) pi=1 I (di (θi ) > λi ) defines a non-local prior. Q Proof. Replace I(d(θ) > λ) by pi=1 I(d(θi ) > λi ) in the proof of Proposition 1. Letting any λi go to 0 and applying the same argument delivers the result. Q Example 1. Consider the linear regression model y ∼ N (Xθ, σ 2 I), where y = (y1 , . . . , yn )0 , θ ∈ Rp , σ 2 is known and I is the n × n identity matrix. Variable selection is conceptualized as the vanishing of any component θi of θ, (i=1,. . . ,p). Accordingly, we define a NLP for θ that penalizes θi → 0 with a single truncation point as in Proposition 1, e.g. Qp 2 > λ . To complete the prior specification π(θ | λ) ∝ N (θ; 0, τ I)I θ i=1 i we choose some specific form for π(λ), e.g. Gamma or Inverse Gamma. Obviously, the choice of π(λ) affects the properties of the marginal prior π(θ). We study this issue in Section 3.2. An alternative prior based on Corollary 7 CRiSM Paper No. 14-10, www.warwick.ac.uk/go/crism 1 is π(θ | λ1 , . . . , λp ) ∝ N (θ; 0, τ I) pi=1 I θi2 > λi . This prior results in marginal independence as long as the prior on (λ1 , . . . , λp ) has independent components. Q We turn attention to the complementary question: is it possible to represent any given NLP with latent truncations? The following proposition proves that such representation can always be achieved with a single truncation variable λ and an adequate prior choice π(λ). Proposition 2. Let π(θ) ∝ d(θ)πu (θ) be an arbitrary non-local prior and denote by h(λ) = Pu (d(θ) > λ), where Pu (·) is the probability under πu . Then π(θ) is the marginal prior associated to π(θ | λ) ∝ πu (θ)I(d(θ) > λ) and h(λ) π(λ) = ∝ h(λ), Eu (d(θ)) where Eu (·) is the expectation with respect to πu (θ). Proof. See Appendix A.2. A corollary is that NLPs with product penalties d(θ) = also be represented with multiple truncation variables. Qp i=1 di (θi ) can Corollary 2. Let π(θ) ∝ πu (θ) pi=1 di (θi ) be a non-local prior, denote by h(λ) = Pu (d1 (θ1 ) > λ1 , . . . , dp (θp ) > λp ) and assume that h(λ) integrates to a finite constant. Then π(θ) is the marginal prior associated to π(θ | λ) ∝ Q πu (θ) pi=1 I(θi > λi ) and π(λ) ∝ h(λ). Q Proof. See Appendix A.3. The main advantage of Corollary 2 is that, in spite of introducing additional latent variables, the representation greatly facilitates sampling. The technical condition that h(λ) has a finite integral is guaranteed when πu (θ) has independent components, as the result then follows by applying Proposition 2 to each univariate marginal. We illustrate this issue in an example where we seek to sample from the prior. Section 4 discusses the advantages for posterior sampling. Example 2. Consider the product MOM prior in (1), where τ is the prior Q dispersion. Setting d(θ) = pi=1 θi2 and πu (θ) = N (θ; 0, τ I) in Proposition Q 2, (1) can be represented using π(θ | λ) ∝ N (θ; 0, τ I)I pi=1 θi2 > λ and the following marginal prior on λ π(λ) = P( Qp 2 p i=1 θi /τ > λ/τ ) Qp 2 E i=1 θi = h(λ/τ p ) , τp 8 CRiSM Paper No. 14-10, www.warwick.ac.uk/go/crism 10 0.14 ● 0.12 ● ● ● ● ● 0 0.08 ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● 0.002 ●●● ●● ● ●●● ●● ● ●● ●● ● ● ● ●●● ●●● ●● ●● ● ●● ● ●● ● ● ● ●● ● ● ●● ● ● ● ●● ●● ●● ● ●●●●● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ●● ●●● ● ●● ● ● ●● ● ● ●● ● ●● ●● ● ● ●6 ● ● ●● ● ● ● ●● ● ● ● ● ●●●● ●●● ●● ●●● ●0 ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●●●● ● ● ●● ●● ● ● ● ●● ● ● ●●●● .0 ● ● ●● ● ● ● ● ● 0.01 ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ●● 0●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●●● ● ● ●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ●● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ●● ●● ● ● ● ● ● ● ●●●● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ●● ● ● ● ● ● ●●● ● ●● ●● ● ●● ●●● ●● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ●●● ● ● ●● ● ●● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●●● ●● ●● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ●●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ●● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● 0.012 ● ● ● ● ●●●● ● ● ●● ● ● ● ●● ● ● ● ●●● ● ●● ●● ● ●● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●●● ●●● ●● ● ● ● ● ● ● ●●● ●●●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ●● ●●●● ● ● ● ● ● ● ● ● ●●● ● ● ● ●●●● ●●● ● ● ●●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ●●●● ● ● ● ●● ● ● 0.008 ● ● ●● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ●● ● ● ● ●●● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ●●● ● ●● ● ●● ●● ●● ●● ●● ● ● ● ●●●● ●● ● ●● ●● ● ● ● ●●● ● ● ● ● ●●●● ● ● ● ● ●● ● ● ● ● ● ●● ●● ●●●●● ● ●●●● ● ● ●● ● ● ● ●● ● ●● ● ●● ●● ● ● ●●0.004 ●● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ●●●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●●●● ●● ●● ● ● ● ● ●●● ●● ●● ● ● ●● ● ● ●● ●● ● ●● ●● ● ●● ●● ●● ●●● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● −10 0.00 0.02 −5 0.04 0.0 14 0.06 Density ● 0.01 4 5 0.10 ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ●● ● ● ●●● ● ● ●● ● ●●●● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ●● ●●● ●●● ● ● ●● ●● ●●● ● ● ●● ● ● ●●● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ●● ●● ●● ●●● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ●●● ● ● ● ● ●● ● ●● ●● ● 0.004 ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ●● ● ●● ● ● ● ●● ●● ● ● ●● ● ●●● ● ● ● ● ●● ● ● ●●● ●● ●●●●● ● ●● ● ●●●0.004 ● ● ●● ●●●● ●● ● ●● ● ●●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●●●●● ● ● ● ●● ●● ● ● ● ●●● ● ● ●● ●●● ●● ● ●● ● ●● ● ● ● ●● ●● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●●● ● ● ●● ●●●● ● ● ● ●● ●● ●● ● ● ●● ● ●● ● ●●●●● ●● ●●● ●● ● ● ● ●● ● ●●● ●● ● ●● ●● ● ●●●●● ●● ●● ● ●●● ● ● ● ● ● ● ●●● ● ●● ●● ● ● ●● ● ●● ● ● ● ● ●● ● ●● ● ● ●● ● ●● ● ● ● ●●● ● ● ● ● ●●●●● ●● ●● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ● ● ●●● ● ●● ● ● ● ● ● ●●● ● ●●●● ● ●● ● ● ●● ●● ● ● ● ● ● ●● ● ● ●● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● 0.00 ● ●●● ● ● ●●● ●● ●● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● 0.008 ● ● ● 8 ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ●● ● ● ●●●●● ● ●● ● ●●●●● ●● ● ● ●● ● ● ●● ●●● ● ●●● ● ● ●● ● ● ●● ●● ● ●● ● ●●●●● ● ● ●● ● ● ●●● ●● ●● ● ●● ●● ●●●● ● ●●●●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ●●● ● ●●●● ● ● ● ●●● ● ● ●● ●● ●●●● ●●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1● ● ●● ● ● ● ● ● ● ●● ●● ●● ●●● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ●●● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●●●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● .0 ● ● ● ● ● ● ● ● ● 0.012 ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ●●●● ● ●●●●● ● ● ●●● ●● ● ●● ● ● ● 0● ● ●● ● ●● ● ● ●● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ●● ●●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●●● ●● ● ● ●● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● 0.014 ● ●●● ● ● ● ●● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ●●●●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ●● ●● ● ●●● ● ● ●● ● ● ● ●● ● ●● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●●●●●●● ● ● ● ● ● ●●● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●●● ● ● ● ●●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ●●●●●●●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ●● ● ●● ● ● ● 0.016 ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●●● ●● ● ●● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ●● ● ●● ● ● ●● ● ● ●● ● ● ● ●●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ●● ● ● ●●● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ●● ● ● ●●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●● ● ● ● ● ●● ● ● ● ● ●●●● ● ● ● ● ● ● ●● ● ●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ●● ●●● ● ●● ●● ● ●● ● ●●● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ●● ●● ● ● ● ● ●●● ● ●● ●● ● ● ●●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ●● ●●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●●● ● ● ● ● 2 ● ●● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●●●●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●●● ●● ●● ● ● ● ● ●●●● ● ● ●● ● ● ● ● ● ●● ● ●● ●● ● 0.01 ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●●●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●.0 ● ● ● ●● ● ●● ● ●● ●● ●● ● ● 0●● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ●● ● ● ●●●●● 1● ● ●● ● ●● ● ● ●●● ●●● ● ●● ● ● ●● ●● ● ● ● ● ●● ●● ● ●● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● 0.0 ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ●●● ● ● ●● ●● 0 ●● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ●●●● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● 6 ● ● ● ● ● ●●● ● ● ●● ●● ● ● ● 6● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●●● ●● ● ● ● ●● ●●● ● ● ● ●● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●●● ● ● ●●● ●0.●00 ● ● ●●● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ●● ●● ●● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ●● ● ●● ● ● ●●●● ● ●● ● ●●● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ●● ● ● ●● ● ● ● ● ●● ● ●● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ●● ● ● ● ●● ●● ●● ● ● ● ●●● ● ● ●● ● ●● ●● ● ●● ● ●●● ● ● ● ● ●0.002 ● ●● ● ● ●●●● ● ● ●● ● ●● ●● ● ●●● ● ● ●● ● ● ● ● 0.002 ● ● ● ● ●● ● ●●● ● ● ● ● ● ●● ● ●● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● −10 −5 0 5 10 ● ● ● ●● ●● ● ●● ● ● 0.002 ● ● ● ●● ● ●●● ● ● ●●● ● ● ●● ● ●●●● ●● ● ● ● ●● ●● ● ●● ● ●● ● ● ● ● ● ● ●● ●● ●● ●●● ● ●● ● ● ●● ● ● ●● ●● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ●● ●● ●● ●● ● ●● ●●● ● ● ● ●● ●● 0● ● ● ● ● ●● ●●● ●● ●● ● ● ● ● ● ●●● ● ● ● ● .0 ●● ● ● ● ● ● ●● ●●●● ●●● ● ●● ● ● ● ● ● 0.008 ● ●● ● ●● ●● ●● ● ● ● ● ● ●● ● ● ● ●● ●0● ●● ● ● ● ● ●● ● ● ● ●● 6● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ●●●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●●●● 0●● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ●● ● ●●● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● .0 ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ●●●●● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● 1● ●● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●●● ●● ●● ● ● ● ● ● ● ● ● ●● ● ●●●● ●● ●● ● ● ● ● ●●● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●●●● ●● ● ● ● ●● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●●● ●● ● ●● ●● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ●● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● 14 ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● 0.0 ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ●● ● ● ● ●●●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ●● ● ● ● ● ●●●● ●● ● ● ●● ● ● ● ● ● ● ●●●●● ● ● ● ● ● ● 0.01 ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ●●●●●● ● ● ●● ●● ● ●●● ●● ● ●● ● ● ●● ●● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ●● ●● ● ●● ●● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ●● ●● ● ●● ●● ● ●● ● ●● ● ●● ●●●● ●● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●● ● ●●● ● ●●●● ● ●● ● ● ● ●● ●● ● ● ● ●● ● ●● ● ●● ●●●● ● ● ●● ● ● ● ●● ●●●●●● ● ● ●● ● ●● ●● ● ●● ● ●●●● ● ● ● ● ● ●● ● ●● ●● ● ● ●●●● ● ●●●● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ●●● ●●● ● ● ●● ● ● ●● ● ● ●0.004 ● ●● ●● ●● ●●● ● ● ● ●● ● ● ● ●● ● ●●● ●●● ● ●● ● ●● ● ● ● ● ● ●●● ●● ● ● ●● ●●● ● ● ● ●●●●●● ●●● ●●● ● ● ● ●● ● ●● ● ●● ●● ●● ●● ●● ●● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● −10 −5 ● 0 5 10 theta Figure 2: 10,000 independent MOM prior draws (τ = 5). Lines indicate true density. where h(·) is the survival function for a product of independent chi-square random variables with 1 degree of freedom (Springer and Thompson [1970] give expressions for h(·)). Using this representation, prior draws are obtained as follows: 1. Draw u ∼ Unif(0, 1). Set λ = P −1 (u), where P (u) = Pπ (λ ≤ u) is the cdf associated to π(λ). 2. Draw θ ∼ N (0, τ I)I (d(θ) > λ). As important drawbacks, P (u) requires Meijer G-functions and is cumbersome to evaluate for large p. Furthermore, sampling from a multivariate Q Normal with truncation region pi=1 θi2 > λ is also non-trivial. Q As an alternative representation, given the product penalty pi=1 θi2 , we use Corollary 2 and sample from the univariate marginals. Let P (u) = ) P (λ < u) be the cdf associated to π(λ) = h(u/τ where h(·) is the survival of τ a χ21 distribution. For i = 1, . . . , p do 1. Draw u ∼ Unif(0, 1) and set λi = P −1 (u). 2. Draw θi ∼ N (0, τ )I(θi > |λi |). 9 CRiSM Paper No. 14-10, www.warwick.ac.uk/go/crism The function P −1 (·) can be tabulated and quickly evaluated, rendering the approach computationally efficient. Figure 2 shows 100,000 draws from univariate (left) and bivariate (right) product MOM priors with τ = 5. 3.2 Deriving NLP properties for a given mixture Our results so far prove a correspondence between NLPs and mixtures of truncated distributions. We now establish how the two most important characteristics of a NLP functional form, namely the penalty d(θ) and its tail behavior, depend on a given truncation scheme. It is necessary to distinguish whether a single or multiple truncation variables are used. For a common truncation variable λ, shared across θ1 , . . . , θp , the following holds. Proposition 3. Let π(θ) be the marginal non-local prior for π(θ, λ) = πu (θ) Qp i=1 I(d(θi ) > λ) π(λ), where h(λ) = Pu (d(θ1 ) > λ, . . . , d(θp ) > λ) h(λ) and π(λ) is absolutely continuous wrt. the Lebesque measure on R+ . Denoting dmin = min{d(θ1 ), . . . , d(θp )}, the following hold: 1. As dmin → 0, π(θ) ∝ πu (θ)dmin π(λ∗ ), where λ∗ ∈ (0, dmin ). If π(λ) ∝ 1 as λ → 0+ (including the case π(λ) ∝ h(λ)) then π(θ) ∝ πu (θ)dmin . 2. As dmin → ∞, π(θ) has tails at least as thick as πu (θ). In particular, R if π(λ) h(λ) dλ < ∞ then π(θ) ∝ πu (θ). Proof. See Appendix A.4. In words, the non-local penalty is given by dmin π(dmin ), i.e. only depends on the smallest individual penalty d(θ1 ), . . . , d(θp ). This property is important as the asymptotic Bayes factor rates depend on the form of the penalty. Proposition 3 also indicates that π(θ) inherits its tail behavior from πu (θ), which can be important for parameter estimation and to avoid finite sample inconsistencies [Liang et al., 2008]. Corollary 3 extends the results to multiple truncation points. Corollary 3. Let λ = (λ1 , . . . , λp )0 be continuous positive variables and π(θ) Qp u (θ) be the marginal non-local prior for π(θ, λ) = πh(λ) i=1 I(di (θi ) > λi )π(λ), where h(λ) = Pu (d1 (θ1 ) > λ1 , . . . , dp (θp ) > λp ). The following hold: 1. As di (θi ) → 0 for i = 1, . . . , p, π(θ) ∝ πu (θ) pi=1 di (θi )π(λ∗i ), where π(λ∗i ) is the marginal prior for λi at λ∗i ∈ (0, d(θi )). Q 10 CRiSM Paper No. 14-10, www.warwick.ac.uk/go/crism 2. As di (θi ) → ∞ for i = 1, . . . , p, π(θ) has tails at least as thick as πu (θ). In particular, if E h(λ)−1 < ∞ under the prior on λ, then π(θ) ∝ πu (θ). Proof. See Appendix A.5. In words, multiple truncation variables induce a multiplicative non-local penalty. This implies that assigning λi ∼ π(λ) for i = 1, . . . , p induces stronger parsimony than a common λ ∼ π(λ). As an example, with multiple λi it may suffice that each di (λi )π(λi ) induces a quadratic penalty, but with a single λ an exponential penalty might be preferable. 4 Posterior sampling We use the latent truncation characterization to derive posterior sampling algorithms, and show how setting the truncation mixture as in Proposition 2 and Corollary 2 leads to convenient simplifications. Section 4.1 provides two generic Gibbs algorithms to sample from arbitrary posteriors, and Section 4.2 adapts the algorithm to linear models under product MOM, iMOM and eMOM priors (1)-(3). As before, sampling is implicitly conditional on a given model Mk and we drop Mk to keep notation simple. 4.1 General algorithm First consider a NLP defined by a single latent truncation, i.e. π(θ | λ) = πu (θ)I(d(θ) > λ)/h(λ), where h(λ) = Pu (d(θ) > λ) and π(λ) is an arbitrary prior on λ ∈ R+ . The joint posterior can be expressed as π(θ, λ | y) ∝ f (y | θ) πu (θ)I(d(θ) > λ) π(λ). h(λ) (4) Sampling from π(θ | y) directly is challenging as it has 2p modes. However, straightforward algebra gives the following k th Gibbs iteration to sample from π(θ, λ | y). Algorithm 1. Gibbs sampling with a single truncation 1. Draw λ(k) ∼ π(λ | y, θ (k−1) ) ∝ I(d(θ) > λ)π(λ)/h(λ). When π(λ) ∝ h(λ) as in Proposition 2, λ(k) ∼ Unif(0, d(θ (k−1) )). 2. Draw θ (k) ∼ π(θ | y, λ(k) ) ∝ πu (θ | y)I(d(θ) > λ(k) ). 11 CRiSM Paper No. 14-10, www.warwick.ac.uk/go/crism That is, λ(k) is sampled from a univariate distribution that reduces to a uniform when setting π(λ) ∝ h(λ), and θ (k) from a truncated version of πu (·). For instance, πu (·) may be a local prior that allows easy posterior sampling. As a difficulty, the truncation region {θ : d(θ) > λ(k) } is nonlinear and non-convex so that jointly sampling θ = (θ1 , . . . , θp ) may be challenging. One may apply a Gibbs step to each element in θ1 , . . . , θp sequentially, which only requires univariate truncated draws from πu (·), but the mixing of the chain may suffer. The multiple truncation representation in Corollary 2 provides a conveQ nient alternative. Consider π(θ | λ) = πu (θ) pi=1 I(di (θi ) > λi )π(λ)/h(λ), where h(λ) = Pu (d1 (θ1 ) > λ1 , . . . dp (θp ) > λp ) and π(λ) is an arbitrary prior. The following steps define the k th Gibbs sampling iteration: Algorithm 2. Gibbs sampling with multiple truncations 1. Draw λ(k) ∼ π(λ | y, θ (k−1) ) = h(λ) as in Corollary 2, (k) λi π(λ) i=1 Unif(λi ; 0, di (θi )) h(λ) . Qp If π(λ) ∝ ∼ Unif(0, di (θi )). 2. Draw θ (k) ∼ π(θ | y, λ(k) ) ∝ πu (θ | y) Qp i=1 I(di (θi ) (k) > λi ) Now the truncation region in Step 2 is defined by hyper-rectangles, which facilitates sampling. As in Algorithm 1, by setting the prior conveniently Step 1 avoids evaluating π(λ) and h(λ). 4.2 Linear models We adapt Algorithm 2 to a Normal linear regression y ∼ N (Xθ, φI) with unknown variance φ and the three priors in (1)-(3). We set the prior φ ∼ IG(aφ /2, bφ /2) and let τ be a user-specified prior dispersion. To set a hyperprior on τ or determine it in a data-adaptively manner see Rossell et al. [2013], and for objective Bayes alternatives see e.g. Consonni and La Rocca [2010]. For all three priors, Step 2 in Algorithm 2 samples from a multivariate Normal with rectangular truncation around 0, for which we developed an efficient algorithm. Kotecha and Djuric [1999] and Rodriguez-Yam et al. [2004] proposed Gibbs after orthogonalization strategies that result in low serial correlation, and Wilhelm and Manjunath [2010] implemented the approach in the R package tmvtnorm under restrictions of the type l ≤ θi ≤ u. Here we require sampling under di (θi ) ≥ l, which defines a non-convex region. Our adaptated algorithm is in Appendix A.6 and implemented in our R package mombf. An important property is that the algorithm produces 12 CRiSM Paper No. 14-10, www.warwick.ac.uk/go/crism independent samples when the posterior probability of the truncation region becomes negligible. Since NLPs only assign high posterior probability to a model when the posterior for non-zero coefficients is well shifted from the origin, the truncation region is indeed often negligible. We outline the full algorithm separately for each prior. 4.2.1 Product MOM prior Straightforward algebra shows that the full conditional posteriors are π(θ | φ, y) ∝ p Y ! θi2 N (θ; m, φS −1 ) i=1 aφ + n + 3p bφ + s2R + θ 0 θ/τ π(φ | θ, y) = IG , 2 2 ! , (5) where S = X 0 X + τ −1 I, m = S −1 X 0 y and s2R = (y − Xθ)0 (y − Xθ) is the sum of squared residuals. Corollary 2 represents the product MOM prior in (1) as p Y θi2 π(θ | φ, λ) = N (θ; 0, τ φI) I > λi τ φ i=1 marginalized with respect to π(λi ) = h(λi ) = P θi2 τφ ! 1 h(λi ) (6) > λi | φ , where h(·) is the survival of a chi-square with 1 degree of freedom. Algorithm 2 and straightforward algebra give the k th Gibbs iteration as 1. φ(k) ∼ IG( 2. λ(k) aφ +n+3p bφ +s2R +(θ (k−1) )0 θ (k−1) /τ , ) 2 2 ∼ π(λ | θ (k−1) , φ(k) , y) = Qp i=1 I (k−1) (θi )2 τ φ(k) 3. θ (k) ∼ π(θ | λ(k) , φ(k) , y) = N (θ; m, φ(k) S −1 ) > λi Qp i=1 I θi2 τ φ(k) > λi . Step 1 samples unconditionally on λ, so that no efficiency is lost for introducing these latent variables. Step 3 requires truncated multivariate Normal draws. 13 CRiSM Paper No. 14-10, www.warwick.ac.uk/go/crism 4.2.2 Product iMOM prior We focus models with less than n parameters. The full conditional posteriors are ! p √ τφ Y τ φ − θ2 i π(θ | φ, y) ∝ N (θ; m, φS −1 ) 2 e θ i i=1 π(φ | θ, y) = e−τ φ Pp θ−2 i=1 i aφ + n − p bφ + s2R IG φ; , 2 2 ! , (7) where S = X 0 X, m = S −1 X 0 y and s2R = (y − Xθ)0 (y − Xθ). Now, the iMOM prior in (2) can be written as πI (θ | φ) = √ φτ τφ − 2 p p √ 2 e θi Y Y πθi = N (θ; 0; τN φI) di (θi , φ). N (θ; 0; τN φI) N (θi ; 0, τN φ) i=1 i=1 (8) While in principle any value of τN may be used, setting τN ≥ 2τ guarantees d(θi , φ) to be monotone increasing in θi2 , so that its inverse exists (Appendix A.7). By default we set τN = 2τ . Following Corollary 2 we represent (8) using latent variables λ, i.e. π(θ | φ, λ) = N (θ; 0, τN φI) p Y I(d(θi , φ) > λi ) i=1 1 h(λi ) (9) and π(λ) = pi=1 h(λi ), where h(λi ) = P (d(θi , φ) > λi ) which we need not evaluate. Algorithm 2 gives the following MH within Gibbs procedure. Q 1. MH step (a) Propose φ∗ ∼ IG φ; aφ +n−p bφ +s2R , 2 2 n (b) Set φ(k) = φ∗ with probability min 1, e(φ φ(k) 2. λ(k) ∼ = Qp (k−1) −φ∗ )τ Pp θ−2 i=1 i o , else φ(k−1) . (k−1) i=1 Unif λi ; 0, d(θi 3. θ (k) ∼ N (θ; m, φ(k) S −1 ) Qp , φ(k) ) (k) (k) i=1 I d(θi , φ ) > λi . Step 3 requires the inverse d−1 (·), which can be evaluated efficiently combining an asymptotic approximation with a linear interpolation search (Appendix A.7). As a token, 10,000 draws for p = 2 variables required 0.58 seconds on a 2.8 GHz processor running OS X 10.6.8. 14 CRiSM Paper No. 14-10, www.warwick.ac.uk/go/crism 4.2.3 Product eMOM prior The full conditional posteriors are p Y π(θ | φ, y) ∝ e − τ 2φ ! N (θ; m, φS −1 ) θ i i=1 − Pp τφ i=1 θ 2 i π(φ | θ, y) ∝ e IG φ; a∗ b∗ , , 2 2 (10) where S = X 0 X + τ −1 I, m = S −1 X 0 y, a∗ = aφ + n + p, b∗ = bφ + s2R + θ 0 θ/τ and s2R = (y − Xθ)0 (y − Xθ). Corollary 2 represents the product eMOM prior πE (θ) in (3) as π(θ | φ, λ) = N (θ; 0, τ φI) p Y √ I e ! 2− τ 2φ θ > λi i i=1 √ marginalized with respect to π(λi ) = h(λi ) = P e 2− τ 2φ θ i 1 h(λi ) (11) ! > λi | φ . Again h(λi ) has no simple form but is not required by Algorithm 2, which gives the k th Gibbs iteration 1. φ(k) − Pp τφ i=1 θ 2 i ∼e ∗ ∗ IG φ; a2 , b2 ∗ ∗ (a) Propose φ∗ ∼ IG φ; a2 , b2 n (b) Set φ(k) = φ∗ with probability min 1, e(φ φ(k) = 5 2. 3. θ (k) Pp θ−2 i=1 i o , else φ(k−1) . √ λ(k) (k−1) −φ∗ )τ 2− τ 2φ ∼ Qp ∼ Q N (θ; m, φ(k) S −1 ) pi=1 I i=1 Unif λi ; 0, e θ ! i θi2 φτ √ > log(λ )− 2 . i Examples We assess the performance of the posterior sampling algorithms proposed in Section 4, as well as the use of NLPs for high-dimensional parameter estimation. Section 5.1 shows a bivariate example designed to illustrate the main posterior sampling issues. Section 5.2 studies p ≥ n cases and compares the BMA estimators induced by NLPs with benchmark [Fernández et al., 15 CRiSM Paper No. 14-10, www.warwick.ac.uk/go/crism θ1 θ1 θ1 θ1 = 0, = 0, 6= 0, 6= 0, θ2 θ2 θ2 θ2 =0 6= 0 =0 =0 θ1 θ1 θ1 θ1 = 0, = 0, 6= 0, 6= 0, θ2 θ2 θ2 θ2 =0 6= 0 =0 =0 θ1 = 0.5, θ2 = 1 MOM iMOM 0 0 2.8e-78 2.72e-78 1.95e-191 3.82-e191 1 1 θ1 = 0, θ2 = 1 1.69e-225 4.39e-225 0.999 1 1.82e-193 1.64e-192 8.83e-05 3.30e-09 eMOM 0 6.86e-79 5.90e-191 1 1.08e-224 1 6.80e-192 3.17e-09 Table 1: Posterior probabilities 2001] and hyper-g priors [Liang et al., 2008], SCAD [Fan and Li, 2001] and LASSO [Tibshirani, 1996]. In all examples we use the default prior dispersions τ = 0.358, 0.133, 0.119 for product MOM, iMOM and eMOM √ priors (respectively), all of which assign 0.01 prior probability to |θi / φ| < 0.2 [Johnson and Rossell, 2010], and φ ∼ IG(0.01/2, 0.01/2). We set P (Mk ) = 0 whenever dim(Θk ) > n, and adapt the Gibbs search on the model space proposed in Johnson and Rossell [2012] to never visit those models. We assess the relative merits attained by each method without the help of any procedures to pre-screen covariates. 5.1 Posterior samples for a given model We simulate 1,000 realizations from yi ∼ N (θ1 x1i +θ2 x2i , 1), where (x1i , x2i ) are drawn from a bivariate Normal with E(x1i ) = E(x2i ) = 0, V (x1i ) = V (x2i ) = 2, Cov(x1i , x2i ) = 1. We first consider a scenario where both variables have non-zero coefficients θ1 = 0.5, θ2 = 1, and compute posterior probabilities for the four possible models. We assign equal a priori probabilities and obtain exact integrated likelihoods using functions pmomMarginalU, pimomMarginalU and pemomMarginalU in the mombf package (the former is available in closedform, for the latter two we used 106 importance samples). The posterior probability assigned to the full model under all three priors is 1 (up to rounding) (Table 1). Figure 3 (left) shows 900 Gibbs draws (100 burn-in) obtained under the full model. The posterior mass is well-shifted away from 0 and resembles an elliptical shape for the three priors, as expected. Table 16 CRiSM Paper No. 14-10, www.warwick.ac.uk/go/crism 1.05 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ●●●●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ●● ●● ●● ●● ● ● ● ●● ●● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ●● ●●● ● ●●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ●● ● ●● ●● ● ● ● ● ● ●●●● ●● ● ●● ●● ●● ● ● ● ●● ● ● ●●● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●●●● ●● ● ●●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●●●● ●●● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●●●●●●●●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●● ●● ●● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● 1.00 0.95 ● ● 1.00 ● ●● ● ● ● ● ● ● 0.50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ●●● ● ● ● ●●●● ●● ● ●● ● ● ●● ●● ● ●● ● ● ● ●● ●● ●● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ●● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ●●● ●● ●● ● ● ● ● ●● ● ● ●●● ●●● ● ● ● ● ● ●●● ●●● ●● ●● ● ●●●● ● ● ● ●● ● ● ●● ● ●● ● ●● ● ●●● ● ●● ●● ● ● ● ● ● ●●● ● ● ● ●● ●●●●● ● ● ● ● ● ● ●●● ● ●●● ● ● ●● ● ● ●●● ●● ● ●● ●●● ●●● ●●● ●●● ●● ● ●● ●●●● ●●● ●● ●● ● ●● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ●●●● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ●● ● ● ●● ● ●●●● ●● ● ●●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●●● ● ● ●●●● ● ●● ●● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ●●●● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● 0.60 −0.10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ●● ● ●●● ● ● ● ●● ● ● ● ●● ● ● ●● ● ●●● ● ● ● ●● ● ●● ●● ● ● ● ● ● ●●● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ●●●● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ●●●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●●●● ● ●● ● ●● ●● ●● ● ●● ●● ●● ● ● ● ● ●● ● ●● ●● ● ● ●● ●●●●●● ● ● ●●●● ● ● ●●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●●● ● ● ●● ● ●● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ●●●●● ●● ● ● ●● ●● ● ● ●●● ● ● ●● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ●● ●●● ● ● ●● ●●● ● ● ● ● ●●● ●●● ● ● ● ● ● ● ● ●● ●●● ●● ●●● ● ● ●●● ● ● ●●●● ●●● ● ●● ●● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ●●●● ●●●● ● ● ●● ● ● ● ●● ● ●● ● ●● ● ● ●● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●●●● ● ●● ●● ● ● ● ● ●●● ● ● ●● ● ● ●●● ●● ●●●● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ●●●● ● ● ● ●● ●●● ●●●● ● ● ● ● ●●●● ●● ● ● ● ● ●●● ●● ●● ● ● ● ● ● ●●● ●● ● ● ● ●● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●● ● ●●● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ●●●● ● ● ●●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● 1.05 1.05 ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ●● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ●●● ●●● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ●● ● ● ●●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● −0.05 0.00 ● 0.05 ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ●● ●● ● ● ●● ● ● ●● ● ● ●● ● ●● ●● ●● ● ●● ●●● ● ● ●● ●● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ●● ●● ● ●● ●●●● ●● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ●● ●● ●● ● ●●● ●● ● ● ●● ● ●●● ● ● ●●● ● ●●● ● ● ●●● ● ● ●●● ● ● ●● ● ●● ●● ● ●● ● ●● ●●● ● ● ● ●● ●●●●● ● ● ● ●● ● ●● ● ● ●● ●●● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ●● ●● ● ● ● ●● ● ●●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ●●● ● ● ● ●● ● ●●● ● ● ● ● ● ●●●●● ● ● ● ●●●● ● ● ● ●●● ● ●● ● ● ●● ●● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ●● ● ● ●●● ● ●● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ●●●● ● ● ● ●● ●●●●● ● ● ● ●●● ● ● ● ●● ● ● ● ●● ●●● ●● ●● ● ● ● ● ● ●●● ● ● ● ●● ●● ●● ●● ●● ● ●● ● ●●● ● ●● ● ●● ● ● ● ● ● ●● ● ● ●●● ●●●●● ● ●● ● ● ● ● ●● ● ●● ● ●● ●● ● ●● ● ● ● ●● ●● ● ● ● ●● ● ●● ● ●●● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ●● ●●● ● ●● ●●● ● ●●● ●●● ● ●●● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ●●●●● ● ● ● ● ●● ● ● ● ●● ● ● ●● ●● ● ●● ● ● ● ● ● ● ●●● ● ● ●● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● θ2 0.95 ● ● ● ● ●● ●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● 0.90 0.95 θ2 1.00 ● 1.00 ● ● ●● ●● ● ● ● ● θ1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.55 ● ● ● ● ● θ1 ● ● ●● ● ● ● ● ● ● ● ● ● θ2 θ2 ● ● ● 0.95 ● ● ●● ● 0.90 1.05 ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● 0.90 ● 0.55 0.85 0.60 ● −0.15 −0.10 −0.05 θ1 ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● 0.50 0.55 θ1 ● 1.00 ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●●● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ●●● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ●● ● ● ● ●● ● ● ●● ● ●● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ●● ●● ● ● ●●● ● ●● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●●● ●●● ●● ●● ● ● ●●● ●● ●● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●●●● ● ●● ● ●●●●● ● ● ● ● ●● ● ● ● ● ●●● ● ●● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ●● ●● ●● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ●●●● ● ● ● ●● ● ● ● ●● ●● ●●●● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●●●●●● ● ●● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ●● ●●● ●● ● ●● ●●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ●● ●● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ●●●●● ● ●● ● ● ● ●● ●● ●● ●● ● ●● ● ● ● ● ● ● ● ●● ●● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ●● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● θ2 ● 0.95 ● ● 0.10 0.15 ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ●●●●● ● ● ● ● ● ●● ●● ● ● ● ● ● ●●● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●●● ● ● ●● ● ●● ●● ●● ●● ● ● ●● ● ●● ● ● ●● ●●●● ●● ● ● ● ●●● ●●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ●● ●● ●● ● ●●●●●● ● ●● ● ● ●●● ● ●● ● ● ● ● ●● ●● ● ● ● ●● ● ● ●●● ●● ●● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ●●●●● ● ● ● ●●● ● ● ●● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ●●● ● ●● ● ●●● ●●●● ● ●● ● ● ● ●● ● ● ● ● ●● ●● ● ●● ●●● ● ●● ● ● ● ●● ● ● ● ● ● ●● ●●● ● ● ●● ● ●●●● ●● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ●● ●● ● ●● ● ● ●●● ●● ● ● ●● ● ●● ●●● ● ●●● ●●●●● ●● ●●●● ● ●● ●● ● ●● ●● ●● ● ●●● ●●●● ●●●● ● ● ●● ● ●●● ● ●● ● ●● ●●●● ● ● ● ●● ●●● ● ●● ● ● ● ● ● ● ●●● ● ●● ● ● ●● ● ● ●● ● ●● ●● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ●● ● ● ●● ●●● ● ●●● ● ● ● ●● ● ●● ●● ●● ● ● ●● ● ●● ● ●●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ●●● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ●● ●● ●● ●● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●●● ●● ● ● ● ●● ●●● ● ● ●● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●●● ● ●●●● ● ●● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ●●● ● ● ● ● ●●●● ●● ●● ●● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●●● ● ● ● ● ● ●●● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● 0.90 ● 0.05 0.85 1.05 ● ● 1.05 ● ● ● ● 0.00 ● ● ● ● 1.00 ● θ1 ● ● ● θ2 ● ● ● 0.50 0.95 ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ●● ● ● ●● ●● ●● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● 0.60 −0.15 −0.10 −0.05 0.00 0.05 0.10 0.15 θ1 17 Figure 3: 900 Gibbs draws after a 100 burn-in when θ = (0.5, 1)0 (left) and θ = (0, 1)0 (right). Contours indicate posterior density. Top: MOM (τ = 0.358); Middle: iMOM (τ = 0.133); Bottom: eMOM (τ = 0.119) CRiSM Paper No. 14-10, www.warwick.ac.uk/go/crism θ1 θ2 φ θ1 θ2 φ θ1 = 0.5, θ2 = 1 MOM iMOM eMOM 0.096 0.110 0.018 0.034 0.134 0.019 -0.016 0.069 0.027 θ1 = 0, θ2 = 1 0.115 0.032 0.049 0.134 0.122 0.042 -0.040 0.327 0.353 Table 2: Serial correlation in Gibbs sampling algorithm 2 gives the first-order auto-correlations, which are very small. This example reflects the advantages of the orthogonalization strategy, which is particularly efficient as the latent truncation becomes negligible. We now set θ1 = 0, θ2 = 1 and keep n = 1000 and (x1i , x2i ) as before. We simulated several data sets and in most cases did not observe a noticeable multi-modality in the posterior density. We portray a specific simulation that did exhibit multi-modality, as this poses a greater challenge from a sampling perspective. Table 1 shows that the data-generating model adequately concentrated the posterior mass. Although the full model was clearly dismissed in light of the data, as an exercise we drew from its posterior. Figure 3 (right) shows 900 Gibbs draws after a 100 burn-in, and Table 2 indicates the auto-correlation. The sampled values adequately captured the multi-modal, non-elliptical posterior. 5.2 High-dimensional estimation We perform a simulation study with n = 100 and growing dimensionality p = 100, 500, 1000. We set θi = 0 for i = 1, . . . , p − 5, the remaining 5 coefficients to (0.6, 1.2, 1.8, 2.4, 3) and consider residual variances φ = 1, 4. Covariates were sampled from x ∼ N (0, Σ), where Σ has unit variances and all pairwise correlation set to ρ = 0 or ρ = 0.25. We remark that ρ are population correlations, the maximum absolute sample correlations when ρ = 0 being 0.37, 0.44, 0.47 for p = 100, 500, 1000 (respectively), and 0.54, 0.60, 0.62 when ρ = 0.25. We simulated 1,000 data sets under each setup. Parameter estimates θ̂ were obtained via Bayesian model averaging. Let δ be the model indicator. For each simulated data set, we performed 1,000 full Gibbs iterations (100 burn-in) which are equivalent to 1,000×p birth18 CRiSM Paper No. 14-10, www.warwick.ac.uk/go/crism 0.5 ρ = 0.25 ● SCAD ● LASSO SCAD ● LASSO ● 0.4 0.5 ● ● HyperG BP 0.2 HyperG BP 0.3 SSE SSE ● 0.2 ● 0.7 ρ=0 ● ● MOM iMOM MOM iMOM ● 0.1 0.1 ● 100 500 1000 100 500 1000 p 5.0 p BP 5 BP 2.0 ● LASSO ● 2 SSE SSE ● LASSO ● 1.0 ● ● SCAD ● 1 ● ● HyperG ● SCAD HyperG ● iMOM MOM iMOM MOM 0.5 ● 100 500 1000 100 p Figure 4: Mean SSE= and φ = 4 (bottom) 500 1000 p Pp i=1 (θ̂i − θi )2 when residual variance φ = 1 (top) 19 CRiSM Paper No. 14-10, www.warwick.ac.uk/go/crism ● ● LASSO BP ● iMOM ● MOM 100 500 ● ● SSE ● ● ● ● 0.002 0.005 SSE SCAD HyperG 0.002 0.005 0.020 0.050 ● 0.200 0.500 ● 0.020 0.050 0.200 0.500 ρ = 0, φ = 1 1000 100 500 p 1000 p 5.00 5.00 ρ = 0, φ = 4 2.00 2.00 BP ● ● 0.50 LASSO SSE SCAD ● ● ● ● ● 0.20 ● ● ● ● 0.20 SSE 0.50 ● iMOM 0.05 0.02 0.02 0.05 MOM HyperG 100 500 1000 100 500 p 0.050 0.500 ● ● ● 0.200 ● ● SCAD LASSO BP HyperG ● ● ● 0.020 0.020 iMOM 0.050 0.200 ● SSE 0.500 ρ = 0.25, φ = 1 ● SSE 1000 p MOM ● 100 0.005 0.005 ● 500 1000 100 500 p 1000 p ρ = 0.25, φ = 4 5.00 SCAD ● HyperG iMOM MOM ● 100 500 1000 p 0.50 1.00 2.00 SSE LASSO ● ● ● ● ● ● 0.05 0.10 0.20 0.50 1.00 2.00 ● ● ● ● 0.05 0.10 0.20 SSE 5.00 BP 100 20 500 1000 p Figure 5: Mean SSE for θi = 0 (left) and θi 6= 0 (right) CRiSM Paper No. 14-10, www.warwick.ac.uk/go/crism MOM iMOM BP HG SCAD LASSO p = 172 p̄ R2 4.3 0.566 5.3 0.560 4.2 0.562 10.5 0.559 28 0.560 42 0.586 p = 10, 172 p̄ R2 6.5 0.617 10.3 0.620 259.0 0.014 116.6 0.419 81 0.535 159 0.570 Table 3: Expression data with p = 172 and p = 10, 172 genes. p̄: mean number of predictors (MOM, iMOM, BP, HG) or number of predictors (SCAD, LASSO). R2 coefficient is between (yi , ŷi ) (leave-one-out cross-validation) death moves. These provided approximate posterior samples δ1 , . . . , δ1000 from P (δ | y). We estimated model probabilities from the proportion of MCMC visits and obtained posterior draws for θ using the algorithms in Section 4.2. For benchmark (BP) and hyper-g (HG) priors we used the same strategy for δ and drew θ from their corresponding posteriors. For comparison, we also used SCAD and LASSO to estimate θ (penalty parameter set via 10-fold cross-validation). P Figure 4 shows overall Sum of Squared Errors (SSE) pi=1 (θ̂i − θi )2 averaged across simulations for scenarios with φ = 1, 4, ρ = 0, 0.25. MOM and iMOM perform similarly and present an SSE between 1.15 and 10 times lower than other methods in all scenarios. As p grows, differences between methods become larger. To obtain more insight on how the lower SSE is achieved, Figure 5 shows SSE separately for θi = 0 (left) and θi 6= 0 (right). The largest differences between methods were observed for zero coefficients, where SSE remains very stable as p grows for MOM and iMOM and is smaller for the former. SCAD appears particularly sensitive to increasing p. For non-zero coefficients differences in SSE are smaller, iMOM slightly outperforming MOM. Here SSE tends to be largest for LASSO. Most methods show SSE remarkably close to the least squares oracle estimator (Figure 5, right panels, black horizontal segments). 5.3 Gene expression data We assess predictive performance in high-dimensional gene expression data. Calon et al. [2012] used mice experiments to identify 172 genes potentially related to the gene TGFB, and showed that these were related to colon 21 CRiSM Paper No. 14-10, www.warwick.ac.uk/go/crism cancer progression in an independent data set with n = 262 human patients. TGFB plays a crucial role in colon cancer, hence it is important to understand its relation to other genes. Our goal is to predict TGFB in the human data, first using only the p = 172 genes and then adding 10,000 extra genes. Both response and predictors were standardized to zero mean and unit variance (data provided as Supplementary Material). We assessed predictive performance via the leave-one-out cross-validated R2 coefficient between predictions and observations. For Bayesian methods we report the posterior expected number of variables in the model (i.e. the mean number of predictors used by BMA), and for SCAD and LASSO the number of selected variables. Table 3 shows the results. When using p = 172 predictors all methods achieve similar R2 , that for LASSO being slightly higher. We note that MOM, iMOM and the Benchmark Prior used substantially less predictors. These results appear reasonable in a moderately dimensional setting where genes are expected to be related to TGFB. However, when using p = 10, 172 predictors important differences between methods are observed. The BMA estimates based on MOM and iMOM priors remain parsimonious (6.5 and 10.3 predictors, respectively) and the cross-validated R2 increases roughly by 0.05. In contrast, for the remaining methods the number of predictors increased sharply (the smallest being 81 predictors for SCAD) and a drop in R2 was observed. Predictors with large marginal inclusion probabilities in MOM/iMOM included genes related to various cancer types (ESM1, GAS1, HIC1, CILP, ARL4C, PCGF2), TGFB regulators (FAM89B) or AOC3 which is used to alleviate certain cancer symptoms. These findings suggest that NLPs were extremely effective in detecting a parsimonious subset of predictors in this high-dimensional example. Regarding the efficiency of our proposed posterior sampler, the correlation between θ̂ from two independent chains with 1,000 iterations each was > 0.99. 6 Discussion We represented NLPs as mixtures of truncated distributions, which enables their use for estimation. This one-to-one characterization defines NLPs from arbitrary local priors, which greatly reduces analytical restrictions in choosing the prior formulation. We provided sampling algorithms for three NLP families (MOM, iMOM and eMOM) in linear regression. Beyond their computational appeal, latent truncations build NLPs from first principles. We observed promising results. Posterior samples exhibited low serial 22 CRiSM Paper No. 14-10, www.warwick.ac.uk/go/crism correlation, captured multi-modalities and independent runs of the chain delivered virtually identical parameter estimates. Also, combining NLPs with BMA delivered remarkably precise parameter estimates in p >> n cases. Applying NLPs to gene expression data resulted in extremely parsimonious predictions that achieved higher cross-validated predictive performance than competing methods. These results did no require procedures to pre-screen covariates. Johnson and Rossell [2012] showed that NLPs provided advantages to select the data-generating model in p = n cases. Hence, our results show that it is not only possible to use the same prior for estimation and selection, but may indeed be desirable. The shrinkage provided by BMA and point masses appears extremely competitive. We remark that we used default informative priors, which are relatively popular for testing, but perhaps less readily adopted for estimation. Developing objective Bayes variations is an interesting venue for future research. The proposed construction of NLPs remains feasible beyond linear regression. In particular, inference based on MCMC in several model classes may be easily achieved by applying our results to extend available posterior simulation strategies. Promising applications include, but are not limited to, generalized linear, graphical and mixture models. A Proofs and Miscellanea A.1 Proof of Proposition 1 The goal is to show that for all > 0 there exists η > 0 such that d(θ) < η implies π(θ) < . By construction, the conditional prior Rdensity is π(θ | λ) = πu (θ)I(d(θ) > λ)/h(λ), where h(λ) = Pu (d(θ) > λ) = πu (θ)I(d(θ) > λ)dθ. Let θ be a value such that d(θ) < η, and express the prior density as Z π(θ) = Z λ≤η πu (θ)I(d(θ) > λ) π(λ)dλ + h(λ) Z λ>η π(θ | λ)π(λ)dλ = πu (θ)I(d(θ) > λ) π(λ)dλ h(λ) (12) The second term in (12) is 0, as by assumption d(θ) < η. Now, consider that for λ ≤ η, h(λ) = Pu (d(θ) > λ) is minimized at λ = η, and therefore (12) can be bounded by π(θ) ≤ πu (θ) R λ≤η I(d(θ) > λ)π(λ)dλ h(η) = πu (θ)P (λ < min{η, d(θ)}) (13) h(η) 23 CRiSM Paper No. 14-10, www.warwick.ac.uk/go/crism Notice that the numerator can be made arbitrarily small by decreasing η, since πu (θ) is bounded around θ0 , by assumption there is no prior mass at λ = 0 so that the cdf in the numerator converges to 0 as η → 0, and that denominator converges to 1 as η → 0. That is, it is possible to choose η such that π(θ) ≤ , which gives the result. A.2 Proof of Proposition 2 We first note that in order for π(θ) to be proper the random variable d(θ) must have finite expectation with respect to πu (θ). Now, the marginal prior for θ is Z Z d(θ) πu (θ)I(d(θ) > λ) π(λ) π(θ) = π(λ)dλ = πu (θ) dλ. (14) Pu (d(θ) > λ) h(λ) 0 R Suppose we set π(λ) ∝ h(λ), which we can do as long as h(λ)dλ < ∞. Then π(θ) ∝ πu (θ)d(θ), which proves the result. The only step left is to R show that indeed h(λ)dλ < ∞. In general Z Z h(λ)dλ = Z Pu (d(θ) > λ)dλ = Sd(θ) (λ)dλ, (15) where Sd(θ) (λ) is the survival function of the positive random variable d(θ) and therefore (15) is equal to its expectation Eu (d(θ)) with respect to πu (θ), which is finite as discussed at the beginning of the proof. A.3 Proof of Corollary 2 Analogously to the proof of Proposition 2 the marginal prior for θ is π(θ) = Z πu (θ) pi=1 I(di (θi ) > λi ) π(λ)dλ1 , . . . , dλp = Pu (d1 (θ1 ) > λ1 , . . . , dp (θp ) > λp ) Q Z ... Z d1 (θ1 ) Z ... πu (θ) 0 0 dp (θp ) p Y π(λ) dλ1 , . . . , dλp ∝ πu (θ) di (θi ), h(λ) i=1 (16) as by assumption π(λ) ∝ h(λ). A.4 Proof of Proposition 3 By definition, the marginal density π(θ) = Z πu (θ) p π(λ) Y I(d(θi ) > λ)dλ = πu (θ) h(λ) i=1 Z πu (θ)Pλ (dmin ) Z I(λ < dmin ) π(λ) dλ = h(λ) 1 π(λ | λ < dmin )dλ, h(λ) (17) 24 CRiSM Paper No. 14-10, www.warwick.ac.uk/go/crism where Pλ (dmin ) = P (λ < dmin ) is the cdf of λ evaluated at dmin . As dmin → 0 we have that π(λ | λ < dmin ) converges to a point mass at zero and hence the integral in the right hand side of (17) converges to 1/h(0) = 1. To finish the proof of statement 1 notice that Pλ (dmin ) ∝ dmin π(λ∗ ) by the Mean Value Theorem, as long as Pλ (·) is differentially and continuous around 0+ , i.e. λ is a continuous random variable. To prove statement 2, notice that Pλ (dmin ) → 1 as dmin → ∞ and that the integral in the right hand side of (17) is m(dmin ) = E(1/h(λ) | λ < dmin ), which is increasing with dmin as h(λ) is monotone decreasing in λ. Hence, π(θ) ∝ πu (θ)m(dmin ) where m(dmin ) increases as dmin → ∞, i.e. π(θ) has R tails at least as thick as πu (θ). Furthermore, if π(λ) h(λ) < ∞ the Monotone Converge Theorem applies and m(dmin ) converges to a finite constant, i.e. π(θ) ∝ πu (θ). A.5 Proof of Corollary 3 Analogously to the proof of Proposition 3, the marginal density can be Q written as π(θ) = πu (θ) pi=1 Pλi (di (θi ))× Z Z ... 1 π (λ | λ1 < d1 (θ1 ), . . . , λp < dp (θp )) dλ1 . . . dλp , h(λ) πu (θ)E h(λ)−1 | ∀λi < di (θi ) p Y Pλi (di (θi )), (18) i=1 where h(λ) is a multivariate survival function and decreases as di (θi ) → 0. Hence E h(λ)−1 | ∀λi < di (θi ) decreases as di (θi ) → 0. To find the limit as di (θi ) → 0 we note that the integral is bounded by the finite integral obtained plugging di (θi ) = 1 into the integrand. Hence, the Dominated Convergence Theorem applies, the integral converges to E (h(0)) = 1 and Q (18) to πu (θ) pi=1 Pλi (di (θi )). Since λ1 , . . . , λp are continuous the Mean Value Theorem applies, so that Pλi (di (θi )) = di (θi )π(λ∗i ) for some λ∗i ∈ (0, di (θi )). To prove statement 2, notice that Pλi (di (θi )) → 1 as di (θi ) → ∞ and that E h(λ)−1 | ∀λi < di (θi ) increases as di (θi ) → ∞. Hence, π(θ) ∝ πu (θ)m(θ) where m(θ) is increasing as di (θi ) → ∞, which proves statement 2. Further, if E(h(λ)−1 ) < ∞ the Monotone Convergence Theorem applies and π(θ) ∝ πu (θ). 25 CRiSM Paper No. 14-10, www.warwick.ac.uk/go/crism A.6 Multivariate Normal sampling under outer rectangular truncation The goal is to sample θ ∼ N (µ, Σ)I (θ ∈ T ) with truncation region T = {θ : θi < li or θi > ui , i = 1, . . . , p}. We generalize the Gibbs sampling of Rodriguez-Yam et al. [2004] and importance sampling of Hajivassiliou [1993] and Keane [1993] to the non-convex region T . Let D = chol(Σ) be the Cholesky decomposition of Σ and K = D−1 its inverse, so that KΣK 0 = KDD0 K 0 = I is the identity matrix, and define α = Kµ. The random variable Z = Kθ follows a N (α, I)I (Z ∈ S) distribution with truncation region S. Since θ = K −1 Z = DZ, denoting di. as the ith row in D we obtain the truncation region S = {Z : di. Z ≤ li or di. Z ≥ ui , i = 1, . . . , p}. The full conditionals for Zi given Z(−i) = (Z1 , . . . , Zi−1 , Zi+1 , . . . , Zp ) needed for Gibbs sampling follow from straightforward algebra. Denote by djk the (j, k) element in D, then Zi | Z(−i) ∼ N (αi , 1) truncated so that P P either dji Zi ≤ lj − k6=i djk Zk or dji Zi ≥ uj − k6=i djk Zk hold simultaneously for j = 1, . . . , p. We now adapt the algorithm to address the fact that this truncation region is non-convex. S The region excluded from sampling can be written as Sic = pj=1 (aj , bj ), P P aj = (lj − k6=i djk Zk )/dji when dji > 0 and aj = (uj − k6=i djk Zk )/dji when dji < 0 (analogously for bj ). Sic as given is the union of possibly non-disjoint intervals, which complicates sampling. Fortunately, it can be S expressed as a union of disjoint intervals Si = K j=1 (ãj , b̃j ) with the following algorithm. Suppose that li are sorted increasingly, set ˜l1 = l1 , ũ1 = u1 and K = 1. For j = 2, . . . , p repeat the following two steps. 1. If lj > ũK set K = K + 1, ˜lK = lj and ũK = uj , else if lj ≤ ũK and uj ≥ ũK set ũK = uj . 2. Set j = j + 1. Finally, because (˜l1 , ũ1 ), . . . , (˜lK , ũK ) are disjoint and increasing, we may draw a uniform number u in (0, 1) excluding intervals (Φ(˜lj ), Φ(ũj )) and set Zi = Φ−1 (u), where Φ(·) is the inverse Normal(αi , 1) cdf. A.7 Monotonicity and inverse of iMOM prior penalty Consider the product iMOM prior as given in (8). We first study the monotonicity of the penalty d(θi , λ), which for simplicity here we denote as d(θ), and then provide an algorithm to evaluate its inverse function. Equivalently, 26 CRiSM Paper No. 14-10, www.warwick.ac.uk/go/crism it is convenient to consider the log-penalty log (d(θ)) = τφ 1 1 (log(τ τN ) + 2log(φ) + log(2)) − log (θ − θ0 )2 − + (θ − θ0 )2 , 2 2 (θ − θ0 ) 2τN φ (19) as its inverse uniquely determines the inverse of d(θ). Denoting z = (θ−θ0 )2 , (19) can be written as g(z) = 1 τφ 1 (log(τ τN ) + 2log(φ) + log(2)) − log(z) − + z. 2 z 2τN φ (20) To show the monotonicity of (20) we compute its derivative g 0 (z) = − z1 + τφ + 2τ1N φ and show that it is positive for all z. Clearly, both when z → 0 z2 and z → ∞ we have positive g 0 (z). Hence we just need to see that there is some τN for which all roots of g 0 (z) are imaginary, so that g 0 (z) >q0 for all z. Simple algebra shows that the roots of g 0 (z) are z = τN φ ± τN φ 1 − τ2τN , so that for τN ≤ 2τ there are no real roots. Hence, for τN ≤ 2τ g(z) is monotone increasing. We now provide an algorithm to evaluate the inverse. That is, given a threshold t we seek z0 such that g(z0 ) = t. Our strategy is to obtain an initial guess from an approximation to g(z) and then use continuity and monotonicity to bound the desired z0 and conduct a linear interpolation based search. Inspecting the expression for g(z) in (20) we see that the term log(z) is dominated by τ φ/z when z approaches 0 and by 2τzN φ when z is large. Hence, we approximate g(z) by dropping the log(z) term, obtaining 1 τφ 1 g(z) ≈ (log(τ τN ) + 2log(φ) + log(2)) − + z. 2 z 2τN φ q (21) Setting (21) equal to t and solving for z gives z0 = τN φ −b + b2 − 2 ττN as an initial guess, where b = log(τ τN ) + 2log(φ) + log(2) − t. If g(z0 ) < t we set a lower bound zl = z0 and an upper bound zu obtained by increasing z0 by a factor of 2 until g(z0 ) > t. Similarly, if g(z0 ) > t we set the upper bound zu = z0 and find a lower bound by successively dividing z0 by a factor of 0.5. Once (zl , zu ) are determined, we use a linear interpolation to update z0 , evaluate g(z0 ) and update either zl or zu . The process continues until |g(z0 ) − t| is below some tolerance (we used 10−5 ). In our experience the initial guess is often quite good and the algorithm converges in very few iterations. 27 CRiSM Paper No. 14-10, www.warwick.ac.uk/go/crism Acknowledgements This research was partially funded by the NIH grant R01 CA158113-01. References M.J. Bayarri and G. Garcia-Donato. Extending conventional priors for testing general hypotheses in linear models. Biometrika, 94:135–152, 2007. J.O. Berger and R.L Pericchi. The intrinsic Bayes factor for model selection and prediction. Journal of the American Statistical Association, 91:109– 122, 1996. J.O. Berger and R.L. Pericchi. Objective Bayesian methods for model selection: introduction and comparison, volume 38, pages 135–207. Institute of Mathematical Statistics lecture notes - Monograph series, 2001. J.M. Bernardo. Integrated objective Bayesian estimation and hypothesis testing. In J.M. Bernardo, M.J. Bayarri, J.O. Berger, A.P. Dawid, D. Heckerman, A.F.M Smith, and M. West, editors, Bayesian Statistics 9 - Proceedings of the ninth Valencia international meeting, pages 1–68. Oxford University Press, 2010. A. Calon, E. Espinet, S. Palomo-Ponce, D.V.F. Tauriello, M. Iglesias, M.V. Céspedes, M. Sevillano, C. Nadal, P. Jung, X.H.-F. Zhang, D. Byrom, A. Riera, D. Rossell, R. Mangues, J. Massague, E. Sancho, and E. Batlle. Dependency of colorectal cancer on a tgf-beta-driven programme in stromal cells for metastasis initiation. Cancer Cell, 22(5):571–584, 2012. G. Consonni and L. La Rocca. On moment priors for Bayesian model choice with applications to directed acyclic graphs. In J.M. Bernardo, M.J. Bayarri, J.O. Berger, A.P. Dawid, D. Heckerman, A.F.M Smith, and M. West, editors, Bayesian Statistics 9 - Proceedings of the ninth Valencia international meeting, pages 119–144. Oxford University Press, 2010. J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96: 1348–1360, 2001. J. Fan and J. Lv. A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20:101–140, 2010. 28 CRiSM Paper No. 14-10, www.warwick.ac.uk/go/crism C. Fernández, E. Ley, and M.F.J. Steel. Benchmark priors for Bayesian model averaging. Journal of Econometrics, 100:381–427, 2001. E. George, F. Liang, and X. Xu. From minimax shrinkage estimation to minimax shrinkage prediction. Statistical Science, 27:82–94, 2012. V.A. Hajivassiliou. Simulating normal rectangle probabilities and their derivatives: the effects of vectorization. Econometrics, 11:519–543, 1993. H. Jeffreys. Theory of Probability. Oxford University Press, Oxford, England, third edition, 1961. V.E. Johnson and D. Rossell. Prior densities for default Bayesian hypothesis tests. Journal of the Royal Statistical Society B, 72:143–170, 2010. V.E. Johnson and D. Rossell. Bayesian model selection in high-dimensional settings. Journal of the American Statistical Association, 24(498):649– 660, 2012. R.E. Kass and L. Wasserman. A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association, 90:928–934, 1995. M.P. Keane. Simulation estimation for panel data models with limited dependent variables. Econometrics, 11:545–571, 1993. Irene Klugkist and Herbert Hoijtink. The bayes factor for inequality and about equality constrained models. Computational Statistics & Data Analysis, 51(12):6367 – 6379, 2007. ISSN 0167-9473. J.H. Kotecha and P.M. Djuric. Gibbs sampling approach for generation of truncated multivariate gaussian random variables. In Proceedings, 1999 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1757–1760. IEEE Computer Society, 1999. F. Liang, R. Paulo, G. Molina, M.A. Clyde, and J.O. Berger. Mixtures of gpriors for Bayesian variable selection. Journal of the American Statistical Association, 103:410–423, 2008. F. Liang, Q. Song, and K. Yu. Bayesian modeling for high-dimensional generalized linear models. Journal of the American Statistical Association, 108(502):589–606, 2013. 29 CRiSM Paper No. 14-10, www.warwick.ac.uk/go/crism E. Moreno, F. Bertolino, and W. Racugno. An intrinsic limiting procedure for model selection and hypotheses testing. Journal of the American Statistical Association, 93:1451–1460, 1998. A. O’Hagan. Fractional Bayes factors for model comparison. Journal of the Royal Statistical Society, Series B, 57:99–118, 1995. J.M. Pérez and J.O. Berger. Expected posterior prior distributions for model selection. Biometrika, 89:491–512, 2002. G. Rodriguez-Yam, R.A. Davis, and L.L. Scharf. Efficient Gibbs sampling of truncated multivariate normal with application to constrained linear regression. PhD thesis, Department of Statistics, Colorado State University, 2004. D. Rossell, D. Telesca, and V.E. Johnson. High-dimensional Bayesian classifiers using non-local priors. In Statistical Models for Data Analysis XV, pages 305–314. Springer, 2013. J. Rousseau. Approximating interval hypothesis: p-values and Bayes factors. In J.M. Bernardo, M.J. Bayarri, J.O. Berger, and A.P. Dawid, editors, Bayesian Statistics 8, pages 417–452. Oxford University Press, 2010. M.D. Springer and W.E. Thompson. The distribution of products of beta, gamma and gaussian random variables. SIAM Journal of Applied Mathematics, 18(4):721–737, 1970. C. Stein. Inadmissibility of the usual estimator for the mean of a multivariate distribution. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 197–206, 1956. R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, B, 58:267–288, 1996. I. Verdinelli and L. Wasserman. Bayes factors, nuisance parameters and imprecise tests. In J.M. Bernardo, J.O. Berger, A.P. Dawid, and A.F.M. Smith, editors, Bayesian Statistics 5, pages 765–771. Oxford University Press, 1996. S. Wilhelm and B.G. Manjunath. tmvtnorm: a package for the truncated multivariate normal distribution. The R Journal, 2:25–29, 2010. A. Zellner. On assessing prior distributions and bayesian regression analysis with g-prior distributions. In Bayesian inference and decision techniques: 30 CRiSM Paper No. 14-10, www.warwick.ac.uk/go/crism essays in honor of Bruno de Finetti, Amsterdam; New York, 1986. NorthHolland/Elsevier. A. Zellner and A. Siow. Posterior odds ratios for selected regression hypotheses. In J.M. Bernardo, M.H. DeGroot, D.V. Lindley, and A.F.M. Smith, editors, Bayesian statistics: Proceedings of the first international meeting held in Valencia (Spain), volume 1. Valencia: University Press, 1980. A. Zellner and A. Siow. Basic issues in econometrics. University of Chicago Press, Chicago, 1984. 31 CRiSM Paper No. 14-10, www.warwick.ac.uk/go/crism