An Information Matrix Prior for Bayesian Analysis in Data

An Information Matrix Prior for Bayesian Analysis in Generalized Linear Models with High Dimensional Data Mayetri Gupta∗ and Joseph G. Ibrahim † Abstract An important challenge in analyzing high dimensional data in regression settings is that of facing a situation in which the number of covariates p in the model greatly exceeds the sample size n (sometimes termed the “p > n” problem). In this article, we develop a novel specification for a general class of prior distributions, called Information Matrix (IM) priors, for high-dimensional generalized linear models. The priors are first developed for settings in which p < n, and then extended to the p > n case by defining a ridge parameter in the prior construction, leading to the Information Matrix Ridge (IMR) prior. The IM and IMR priors are based on a broad generalization of Zellner’s g-prior for Gaussian linear models. Various theoretical properties of the prior and implied posterior are derived including existence of the prior and posterior moment generating functions, tail behavior, as well as connections to Gaussian priors and Jeffreys’ prior. Several simulation studies and an application to a nucleosomal positioning data set demonstrate its advantages over Gaussian, as well as g-priors, in high dimensional settings. KEY WORDS: Fisher Information; g-prior; Importance sampling; Model identifiability; Prior elicitation. Department of Biostatistics, Boston University, MA 02118, U.S.A. Email: gupta@bu.edu; Phone: 1.617.414.7946; Fax: 1.617.638.6484 † Department of Biostatistics, University of North Carolina at Chapel Hill, NC 27599, U.S.A. Email: ibrahim@bios.unc.edu ∗ 1 1 Introduction In the analysis of data arising in many scientific applications, one often faces a scenario in which the number of variables (p) greatly exceeds the sample size (n), often termed the “p > n” problem. In these problems, fitting many types of statistical models leads to model nonidentifiability in which parameters cannot be estimated via maximum likelihood. The specification of proper priors can alleviate such a nonidentifiability problem and lead to proper posterior distributions as R long as one uses a valid probability density for the data, i.e., f (y|θ) dy < ∞. Specification of proper priors in the p > n context is not easy since it is desirable that the prior (i) leads to existence of prior or posterior moments, which is not guaranteed and theoretically checking this is not easy; (ii) is relatively non-informative so that the data can essentially drive the inference; (iii) is at least somewhat semi-automatic in nature requiring relatively little or minimal hyper-parameter specification; and (iv) is easy to interpret and computationally feasible. Properties (i) - (iv) are important to investigate in considering a class of priors for posterior inference. The existence of posterior moments is important since Bayesian inference often relies on the estimation of posterior quantities (e.g. means), via Markov chain Monte Carlo (MCMC) or other sampling procedures. Computing Bayes factors and other model assessment criteria often requires the existence of posterior moments (Chen, Ibrahim, and Yiannoutsos (1999)). Constructing estimates through a black box use of MCMC, without knowing that these estimates are well defined for the class of priors considered, may lead to nonsensical posterior inference. Noninformativeness is desirable in model selection and model assessment settings, or wherever prior information is not available. The specification of semi-automatic priors has been advocated by Spiegelhalter and Smith (1982), Zellner (1986), Mitchell and Beauchamp (1988), Berger and Pericchi (1996), Bedrick, Christensen, and Johnson (1996), Chen, Ibrahim, and Yiannoutsos (1999), and Ibrahim and Chen (2003), and is attractive in high dimensional settings where it is difficult to find contextual interpretations for all parameters. In the p > n paradigm, there has been little work on the specification of desirable priors and, in particular, priors that attain (i) - (iv) above. West (2003) specifies singular g-priors (Zellner (1986)) for the Gaussian model in the context of Bayesian factor analysis for p > n. Liang, Paulo, Molina, Clyde, and Berger (2008) advocate the use of mixtures of g-priors to resolve consistency issues in model selection but their adaptation does not directly extend to cases where 2 p > n. When p > n, Np (0, γI) priors are often not desirable– for small to moderate γ, they are typically too informative, and for large γ they often lead to computationally unstable posteriors since the model becomes weakly identified (see Section 4). They also do not capture the a priori correlation in the parameters, and eliciting a prior covariance matrix when p > n is extremely difficult. The class of priors we consider are called Information Matrix (IM) priors. They can be applied in any parametric regression context but we initially focus on generalized linear models (GLMs). The functional form of the IM prior is obtained once a parametric statistical model is specified for the data, the kernel of the IM prior being specified through the Fisher Information matrix. Let (y1 , . . . , yn ) be independent univariate response variables with density p(y i |X, β, φ), where β is a p × 1 vector of regression coefficients, φ = (φ1 , . . . , φq ) a vector of dispersion parameters, and X the n × p matrix of covariates with i-th row x0i = (xi1 , . . . , xip ), where ‘0 ’ denotes matrix transposition. Let α = (β, φ). Assume for the moment that φ is known, p < n, and rank(X) = p. The likelihood function of β for a regression model with independent responses is Q L(β) = ni=1 p(yi |xi , β), and the (i, j)-th element of the Fisher information matrix is Iij (β) = −E ∂2 log(L(β)) , ∂βi ∂βj where the expectation is with respect to y|β, and βi is the i-th component of β, (i, j = 1, . . . p). Further, assume that the p × p Fisher information matrix, I(β), is non-singular, as is often the case when p < n. The general IM prior is now defined as πIM (β) ∝ |I(β)| 1/2 exp 1 0 − (β − µ0 ) I(β)(β − µ0 ) , 2c0 (1.1) where µ0 and c0 ≥ 0 are specified location and dispersion hyperparameters. The IM prior (1.1) captures the prior covariance of β via the Fisher information matrix, which seems an attractive specification since this matrix plays a major role in the determination of the large sample covariance of β in both Bayesian and frequentist inference. The use of the design matrix X is attractive since X may reveal redundant covariates. The prior (1.1) is semi-automatic, requiring specifications only for µ0 (which can be taken to be 0), and the scalar c0 . It should be mentioned that some recent work by Wang and George (2007) and Wang (2002) uses a related form of the 3 prior for β, with two important differences: (i) the form of the prior for all GLMs is taken to be Gaussian, and (ii) the covariance is dependent on the observed information matrix, rather than the expected one, leading to a data-dependent prior, unlike our specification here. We now focus on the development of these priors and the resultant posterior estimators in the class of GLMs, where many theoretical and computational properties can be characterized for p < n and p > n. 2 IM Priors for Generalized Linear Models We first consider the IM prior for GLMs when p < n, and where the dispersion parameter φ is known or intrinsically fixed, as for example in the binomial, Poisson, and exponential regression models. For a GLM, yi |xi , (i = 1, . . . , n) has a conditional density given by p(yi |xi , β, φ) = exp a−1 i (φ)(yi θi − b(θi )) + c(yi , φ) , i = 1, 2, . . . , n, (2.1) where the canonical parameter θi satisfies the equations θi = θ(ηi ), and a(·), b(·) and c(·) determine a particular family in the class. The function θ(.) is the link function for the GLM, often referred to as the θ-link, and ηi = xi 0 β. When a canonical link is used, θ(ηi ) = ηi . The function ai (φ) is commonly of the form ai (φ) = φ−1 wi−1 , where wi ’s are known weights. Without loss of generality and for ease of exposition, let φ = 1 and wi = 1. Then (2.1) can be rewritten as p(yi |xi , β) = exp {yi θi − b(θi ) + c(yi )} , i = 1, 2, . . . , n. (2.2) Now, let X be the n × p matrix of covariates with i-th row x0i , θ(Xβ) be a component-wise function of Xβ that depends on the link, and J be a n × 1 vector of ones. Then in matrix form, we can write the likelihood function of β for the GLM in (2.2) as p(y|β) = exp{y 0 θ(Xβ) − J 0 b(θ(Xβ)) + c(y)}. 4 (2.3) For the class of GLMs (with φ = 1 and wi = 1), the Fisher information matrix for β is I(β) = X 0 Ω(β)X, (2.4) where Ω(β) = ∆(β)V (β)∆(β), (2.5) where V (β) is the n × n diagonal matrix of variance functions with i-th diagonal element v i = v(x0i β) = d2 b(θi )/dθi2 , and ∆(β) is an n × n diagonal matrix of “link adjustments”, with i-th diagonal element δi = δ(x0i β) = dθi /dηi . We denote the i-th diagonal element of Ω(β) by ωi . We now can write the IM prior for β as 0 πIM (β) ∝ |X Ω(β)X| 1/2 exp 1 0 0 (β − µ0 ) (X Ω(β)X)(β − µ0 ) . − 2c0 (2.6) The prior in (2.6) can be viewed as a generalization of several types of priors. As c 0 → ∞, (2.6) converges to Jeffreys’ prior for GLMs, given by πJ (β) ∝ |X 0 Ω(β)X|1/2 . (2.7) Jeffreys’ prior is a popular noninformative prior for Bayesian inference in multiparameter settings involving regression coefficients, due to its invariance and local uniformity properties (Kass (1989, 1990)). Also, as long as X 0 Ω(β)X is bounded (for example in the logistic model) as β → ∞, the ratio of prior (2.6) to (2.7) is zero, hence showing the IM prior has lighter tails than Jeffreys’ in the limit. Jeffreys’ prior for GLMs is improper for most models, except for binomial regression models (Ibrahim and Laud (1991)). For the Gaussian linear model, (2.6) reduces to Zellner’s g-prior (Zellner (1986)). In this case, Ω(β) = σ 2 I, where σ 2 denotes the variance of yi in the Gaussian linear model. For any given GLM, if Ω(β) is a constant matrix free of β, say W , then the IM prior will always be a Gaussian prior, that is β ∼ N p (µ0 , c0 (X 0 W X)−1 ). For example, for the gamma model with log-link, Ω(β) is a constant (identity) matrix. In this sense, the IM prior can be thought of as a “generalized” g-prior. The tail behavior of the IM prior can be theoretically compared with the Gaussian prior for 5 specific GLMs. To show that its tails are heavier than a Gaussian distribution, we need πIM (β) = ∞, lim kβ k−→∞ φ(β; ν, Σ) (2.8) where φ(β; ν, Σ) denotes the p dimensional multivariate normal density with mean ν and covariance matrix Σ. (Showing that the expression in (2.8) goes to zero indicates the tails are lighter than the Gaussian.) For the binomial regression model and the inverse Gaussian model (with canonical link), it can be shown that (2.8) holds, so that the IM prior under these models has heavier tails (Figure 1). For most GLMs, if the IM prior is proper and Ω(β) → 0 elementwise as ||β|| → ∞, then the tails of the IM prior will be heavier than those of Gaussian priors. When the elements of Ω(β) become infinite as ||β|| → ∞, the IM prior has lighter tails than the Gaussian PSfrag replacements prior, as with the Poisson model with canonical link (Figure 2). Although the IM prior for the Poisson model has lighter tails than a Gaussian prior, the IM prior for this model is much flatter −100 −50 0 β1 50 100 0.005 0.004 IM Jeff Norm 0.001 density 0.002 0.003 IM Jeff Norm −10 −5 0 β1 5 10 0.000 IM Jeff Norm density 0.00 0.02 0.04 0.06 0.08 0.10 0.12 density 0.00 0.02 0.04 0.06 0.08 0.10 0.12 in the “middle” part of the distribution and hence effectively acts as a more noninformative prior. 50 100 200 300 β1 Figure 1: Density of IM prior under a logistic model with p = 1 compared with Jeffreys’ ( “Jeff”) and a Gaussian prior (“Norm”) for a simulated data set with n = 20. The plot is shown at three different scales. The total density under each curve is the same (normalized); however the Gaussian prior has the highest mass near the center while Jeffreys’ prior has much heavier tails (visible in the third panel). The IM prior has lower mass near the center of the distribution compared to the Gaussian prior but thicker tails; while it has thinner tails and more central mass compared to Jeffreys’ prior. When the dispersion parameter φ is unknown, the IM prior based on α, where α = (β, φ), is typically not proper for GLMs and has properties that are difficult to characterize. In these cases, one can construct the IM prior of β conditional on φ, and then specify a proper prior for φ such 6 −15 −10 −5 IM Norm density 0.4 0.8 0 5 10 0.0 density 0.4 0.8 0.0 1.2 1.2 cements IM Norm 15 −4 −2 β1 0 2 4 β1 density 6e−301 data set with n = 20, shown at two different scales. IM/N density ratio 0e+00 3e+145 6e+145 Figure 2: Density of IM prior under a Poisson model with p = 1 compared with a Gaussian prior for a simulated 0e+00 as a gamma prior or a product of independent gamma densities (Ibrahim and Laud (1991)). The 12 13 14 15 conditional IM 10 prior 11 of β given φ is defined as beta1 −20 0 beta1 20 40 1 0 (β − µ0 ) I(β|φ)(β − µ0 ) , (2.9) πIM (β|φ) ∝ |I(β|φ)| exp − 2c0 ∂2 log(L(β, φ)) , Iij (β|φ) = −E ∂βi ∂βj 1/2 where −40 and L(β, φ) is the likelihood function of (β, φ). We specify the joint prior of (β, φ) as π(β, φ) ∝ πIM (β|φ)π(φ), (2.10) where π(φ) is a proper prior for φ. When q = 1, π(φ) can be taken to be a gamma prior and, for a general q, π(φ1 , . . . , φq ) can be assumed to be a product of q independent gamma densities. For GLMs, following the results of Section 3, it can be shown that (2.10) is jointly proper for (β, φ), when φ is distributed as a product of gamma densities. 2.1 Conditions for the existence of the prior MGF when p < n We now present theoretical results establishing conditions for the existence of the prior and posterior moment generating functions (MGFs) using the IM prior for GLMs. The proofs of these results are presented as special cases of more general theorems given in Section 3. Sufficiency. The sufficient condition for the existence of the prior MGF of β for the IM prior in (2.6) is that Z d2 b(r) exp{τ θ (r)} dr2 −1 12 dr < ∞, for τ in some (−, ). This is derived as a special case of Theorem 1, (Corollary 1.2), and matches 7 the condition of MGF existence for Jeffreys’ prior when p < n (Ibrahim and Laud (1991)). Necessity. The necessary condition for existence of the prior MGF of β for the IM prior in (2.6) is the finiteness of the p-dimensional integral Z Y p vj (β)δj2 (β) i=1 1/2 exp 1 0 0 − (β − µ0 ) I(β)(β − µ0 ) + t β dβ, 2c0 (2.11) for any t in some p-dimensional sphere about 0, where I(β) = X 0 Ω(β)X. In many cases, checking (2.11) reduces to a condition involving a one-dimensional integral (Section 3.1). Note here that (2.11) is less stringent a condition than that for Jeffreys’ prior, allowing prior MGFs to exist for models where Jeffreys’ prior is improper. This is discussed further in Section 3.1. 2.2 Existence of the posterior MGF when p < n A sufficient condition for the existence of the posterior MGF of β for the IM prior in (2.6) is the finiteness of the following one-dimensional integral for τ ∈ (−, ), some > 0: Z d2 b(r) exp{τ θ (r) + φ w(yr − b(r))} dr2 −1 −1 12 dr. (2.12) This is the same condition as for Jeffreys’ prior in Ibrahim and Laud (1991). Since the sufficient conditions for the prior and posterior are the same as those for using Jeffreys’ prior, the examples which satisfy sufficiency conditions for Jeffreys’ prior in Ibrahim and Laud (1991) also satisfy the sufficient conditions here, e.g., the posterior MGFs exist for binomial, Poisson and Gamma GLMs if none of the observed y’s is zero. 3 IM Ridge Priors for GLMs when p > n When p > n, β is not identifiable in the likelihood function and the MLE of β does not exist. Specifying a proper prior guarantees a proper posterior in this setting; however, not all proper priors yield the existence of prior and posterior moments which are often the end objectives in Bayesian inference. To generalize the IM priors to a proper prior in the p > n case, we introduce 8 a scalar “ridge” parameter λ in the prior construction, defining the IM Ridge (IMR) prior as 0 πIM R (β) ∝ |X Ω(β)X +λI| 1/2 1 0 0 (β − µ0 ) (X Ω(β)X +λI)(β − µ0 ) . exp − 2c0 (3.1) Here λ can be considered a “ridge” parameter as used in regression models for high dimensional covariates, and to reduce effects of collinearity (Hoerl and Kennard (1970)). With the introduction of λ, the matrix X 0 Ω(β)X +λI is always positive definite regardless of the rank of X and the form of Ω, for any GLM. As c0 → ∞ in 3.1, the IMR prior converges to a generalized Jeffreys’ prior given by πJ∗ (β) ∝ |X 0 Ω(β)X + λI|1/2 , which is useful in the p > n setting since the usual Jeffreys’ prior (2.7) does not exist when X 0 Ω(β)X is singular. This idea is similar in spirit (but considerably different in form) to the introduction of a constant matrix Λ added to the sample covariance matrix S in the construction of a data-dependent prior for the population covariance matrix Σ in multivariate normal settings (Schafer (1997)). To understand the IMR prior in (3.1), first consider the IMR prior for the linear model. In this case, the IMR prior for β|σ 2 is Gaussian and the posterior distribution is also Gaussian. Specifically, consider the linear model Y = Xβ +, where ∼ Nn (0, σ 2 I), and p > n. The IMR prior in this case becomes β|σ 2 ∼ Np (µ0 , c0 σ 2 (X 0 X + λIp )−1 ), and the posterior distribution of β given σ 2 is easily shown to be the non-singular Gaussian distribution: 1 0 0 2 β|y, σ ∼ N Σ (X X + λI)µ0 + X y , σ Σ , c0 2 (3.2) where Σ = [(1 + 1/c0 )X 0 X + (λ/c0 )Ip ]−1 . Theoretical properties of the IMR prior for the linear model are investigated in Section 3.3. 3.1 MGF existence for GLMs using the IMR prior The prior MGF must exist for the posterior MGF to exist, when p > n. We first provide necessary and sufficient conditions for prior MGF existence. Theorem 1. A sufficient condition for the existence of the MGF of β based on the prior (3.1) 9 when p > n is that the one-dimensional integral Z exp λM −1 [θ (r)]2 + τi θ−1 (r) − 2c0 d2 b(r) dr2 12 (3.3) dr < ∞, p . 0 for some τi ∈ (−, ), where rank(X) = n, M is such that |X ∗ | ≤ M − 2 , X ∗ = [X 0 .. x00 ], with x0 as a (p − n) × p matrix selected such that X ∗ is positive definite. Proof. Proofs of all theorems are given in the Appendices (in the online Article Supplement). Corollary 1.1. Let ωk (β) denote the k-th diagonal element of Ω(β) in (2.5). If Pn k=1 ωk (β) ≤ 1, the prior MGF of β from (3.1) always exists, as the integral is dominated by a Gaussian MGF. Corollary 1.2. The sufficient condition (3.3) also holds and is identical in the p < n case and, additionally, reduces to the sufficient condition for Jeffreys’ prior if λ = 0. The IMR prior thus can be useful (and more desirable than the IM prior) even in the p < n case, in the face of high-dimensionality, collinearity, or weak identifiability. Theorem 2. If the prior MGF exists when p > n, the p-dimensional integral Z as (β) exp 1 0 0 (β − µ0 ) (I(β) + λI)(β − µ0 ) + t β dβ − 2c0 (3.4) is finite for some t ∈ (−, ), for s = 0, 1, . . . , p, where as (β) = |I(β)(i1 ,...,is ) | is the determinant of the (p − s) × (p − s) sub-matrix of I(β) formed by leaving out the (i 1 , . . . , is )-th rows and columns of I(β) = X 0 Ω(β)X. Note here that a0 (β) = |I(β)|, ap−1 (β) = trace(I(β)), and ap (β) = 1. As a working principle, try to check the sufficiency condition first: if it does not hold, check the necessity condition. For the necessity condition (Theorem 2) it is easiest to first check the p = 1 case. Necessity does not hold if there exists no t ∈ (−, ) such that Z 1 2 a11 (β) exp 1 2 (β − µ0 ) (a11 (β) + λ) + tβ dβ − 2c0 dθ is finite, where a11 (β) = b00 (θ) dη . If this condition is not satisfied, the MGF does not exist. If the necessity condition is satisfied for p = 1, we need to check the condition for larger p; if for any p we find that the necessity condition is not satisfied, the prior MGF does not exist. 10 Corollary 2.1. The result in Theorem 2 holds when n > p, and simplifies to (2.11) when λ = 0. Theorem 3. A sufficient condition for the existence of the posterior MGF for π IM R (β) is that Z exp M1 λ −1 [θ (r)]2 + τ θ−1 (r) + φ−1 w(yr − b(r)) − 2c0 d2 b(r) dr2 1/2 dr (3.5) is finite for some τ in an open neighborhood about zero, for j = 1, . . . , p. For j = n + 1, . . . , p, the condition is the same as for the existence of the prior MGF. Corollary 3.1. When p < n (and λ = 0), a sufficient condition for the existence of the posterior MGF is that Z −1 −1 exp τ θ (r) + φ w(yr − b(r)) d2 b(r) dr2 1/2 dr (3.6) is finite for some τ in an open neighborhood about zero, and that the MLE exists. Here we additionally require that the MLE exists (i.e., the likelihood function is bounded above). However, existence of the prior MGF is not necessary. The next result demonstrates the usage of the necessary and sufficient conditions in some specific examples of GLMs to determine existence of prior and posterior MGFs. Without loss of generality, we assume that µ0 = 0. Theorem 4. (i) The sufficient conditions (3.3) and (3.5) guarantee the existence of prior and posterior MGFs for p > n in the Binomial model with canonical and probit link, and in the Poisson model with canonical and identity link. (ii) According to condition (3.4), prior and posterior MGFs do not exist for the Gamma model with canonical link. (iii) For the Inverse Gaussian model with canonical link, (3.3) is not satisfied, while (3.4) is satisfied, thus it cannot be determined by these conditions alone whether the prior and posterior MGFs exist. Note that, however, the prior and posterior MGFs do exist for the Gamma model with a log link, due to a result shown in Section 3.2. All derivations are given in the Appendices. 11 3.2 Connection between IM and g-priors When the information matrix Ω(β) = ∆(β)V (β)∆(β) is independent of β, the IMR prior reduces to a “ridge” g-prior for a Gaussian linear model, and is of Gaussian form, proper, and its MGF exists. This provides a quick way to determine existence of the prior and posterior MGF for models for which the sufficiency conditions do not hold, but the necessary ones do. Examples are the gamma model with log-link and the Gaussian model with canonical link. For the gamma model with log link, note that b(θ) = − log(−θ), so that b 00 (θ) = vi (β) = 1/θ 2 ; and θ = −e−η , thus dθ dη = δi (β) = θ. The diagonal elements of Ω(β) are vi (β)δi2 (β) = 1. Hence Ω(β) reduces to an identity matrix, and πIM R (β) is a Gaussian distribution with mean µ0 and covariance matrix c0 (X 0 X + λI)−1 , which is always proper, and for which the MGF exists even if p > n. 3.3 Characterization of “frequentist” bias and variance To determine the influence of λ and c0 in the IMR prior on the posterior estimates, we first explore the case of Gaussian linear models, where closed forms exist. For simplicity, assume µ 0 = 0. Theorem 5. The posterior covariance matrix of β for the Gaussian linear model, 2 σ Σ=σ 2 satisfies c0 λ p/2 1+ −1 λ 1 0 X X + Ip , 1+ c0 c0 c0 +1 (p) −n x0 λ nn/2 ≤ |Σ| ≤ c0 λ (3.7) p/2 , where x0 = max{diag(X 0 X)}, if c0 > λ. (p) Corollary 5.1. If p −→ ∞, c0 > λ, and n is finite, then |Σ| −→ ∞. Theorem 6. The posterior bias of β, for the IMR prior, satisfies p p 1X 1X bias(βp ) ≤ |βi |. p i=1 p i=1 Equivalently, for the determinant of the bias matrix, D = ΣX 0 X −I, k Dβ k2 = β 0 D0 Dβ ≤ β 0 β. It is also of interest to compare the bias and MSE of estimates arising from the use of the IMR prior to those obtained using a Gaussian prior, N (0, c0 Ip ). For simplicity, assume σ 2 = 1, and 12 βi = 1, for i = 1, . . . , p. With DIM and DN denoting the bias matrices of the IMR and Gaussian priors, respectively, it can be shown that the ratio of their determinants is |DIM | |{(1 + c0 )X 0 X + λI}−1 X 0 Xc0 − I| |X 0 X + λI||X 0 X + c0 I| = |DN | |{(X 0 X + c0 I)−1 X 0 X − I}−1 X 0 Xc0 − I| |(1 + c0 )X 0 X + λI||c0 I| (3.8) and, similarly, the ratio of the determinants of the mean square error (MSE) matrices is |M SEIM | |(ΣX 0 X − I)0 (ΣX 0 X − I) + Σ| = , (3.9) |M SEN | |[(X 0 X + c0 I)−1 X 0 X − I]0 [(X 0 X + c0 I)−1 X 0 X − I] + (X 0 X + c0 I)−1 | where Σ is as given in (3.7). Figure 3 depicts the behavior of these ratios for a set of choices of (n, p). The bias ratio (averaged over 5 data sets) decreases with an increase in c 0 , and the IMR PSfrag replacements prior is uniformly better when c0 is sufficiently large (> 3) irrespective of whether n is larger or smaller than p. When n ≥ p, the MSE using the IMR prior is uniformly better when c 0 is moderately large. However, when p > n, an increase in c0 leads to a an increase in the MSE with log(|M SEIM |/|M SEN |) the IM prior, and a sharper Gaussian prior (with a large bias) is favored in terms of the MSE. 300 −200 −100 0 100 200 log(|DIM |/|DN |) 600 (n,p)=30,70 (n,p)=50,50 (n,p)=70,30 −2 −1 0 1 2 3 −2 log10 (c0 ) −1 0 1 2 3 log10 (c0 ) Figure 3: Ratio of determinants of bias (panel 1) and MSE (panel 2) for the IMR prior compared to a Gaussian N (0, c0 Ip ) prior for the Normal linear model. 3.4 Elicitation of λ, µ0 and c0 The parameter λ plays an important role in the construction of the IMR prior since its introduction makes the prior covariance matrix of β nonsingular regardless of p. One question is whether to take λ fixed or random in the model. Empirical experience suggests that if λ is random, any prior for λ would have to be quite sharp since there is no information in the data for estimating 13 it. Empirical studies show that taking λ fixed gives similar results with far less computational effort. When taking 0 < λ ≤ 1, posterior inference about β appears quite robust for a wide range of values of λ in (0, 1) (Section 4.2). Note that when c 0 is large, λ and the design matrix X have a small impact on the posterior analysis of β. Since our main aim is to develop a relatively noninformative IM prior for Bayesian inference, other practical hyperparameter choices include µ0 = 0 (consistent with Zellner’s g-prior) and a moderately large c 0 (c0 ≥ 1). 4 Empirical Studies 4.1 Density of the IM prior compared to Jeffreys’ and Gaussian priors We simulated data under a logistic model with sample size n = 20, varying the number of covariates in the range 1 ≤ p ≤ 500, to get an idea of the behavior of the IMR prior for p < n and p > n. For the prior πIM R (β), setting µ0 = 0, the posterior density of β is |Σ| p(β|y) ∝ − 21 exp Pn 0 i=1 yi xi β Qn i=1 (1 − 1 0 −1 βΣ β 2 + e xi β ) 0 where Σ = Σ(β) = c0 [X 0 Ω(β)X + λIp ]−1 , and ωi (β) = exp(x0i β ) [1+exp(x0i β )]2 , (4.1) , (for i = 1, . . . , n). When p < n and λ = 0, (4.1) reduces to the IM prior. The IM prior was compared to Jeffreys’ prior and a Gaussian prior N (0, c0 (X 0 X)−1 /4) (i.e. equivalent to setting β = 0 in the IM prior). For the p = 1 case, we can plot the exact prior densities (normalized computationally to be on the same scale) over a range of values. Figure 1 shows the three priors superimposed on the same plot at three scales. The IMR prior lies between Jeffreys’ prior and the Gaussian prior at the center of the range, has heavier tails than do the others for a a wider range than Jeffreys’ and, at the tails, converges to a Gaussian prior. We repeated the study with the same n, but set p = 5 and used the IMR prior. In this case, we cannot analytically derive the marginals, but instead prior and posterior samples are drawn from the respective distributions using Adaptive Rejection Metropolis Sampling (ARMS) (Gilks, Best, and Tan (1995)), using the HI package in the statistical software R (R Development Core Team (2004)). Smoothed kernel density estimates based on a Gaussian kernel are plotted for one of the marginals (Figure 4). This indicates the relative flatness of the 14 Sfrag replacements IMR prior compared to the posterior density, showing that it is less informative than taking a Gaussian prior. Posterior densities are centered closely around 0.9, the true value of β. −5 −10 0.04 density 0.06 0 IMR pr Jeff pr Norm pr IMR po Jeff po Norm po 0.00 0.02 density 0.08 IMR Jeff Norm −20 −10 0 10 20 −20 β1 −10 0 10 20 β1 Figure 4: Panel 1: Density of IMR prior with p = 5 compared with Jeffreys’ prior (“Jeff”) and a Gaussian prior (“Norm”) for a simulated data set with n=20. Panel 2: Prior (“pr”) and posterior (“po”) log-densities based on the three priors. 4.2 Robustness of posterior estimates Effect of c0 on posterior estimates. We investigated the performance of estimates using the IM prior with different settings of c0 and for values of p from 10 to 500, with a sample size n = 150. Data were generated from a logistic model with a design matrix x = (x ij ), xij ∼ N (0, 0.1), and coefficient vector β = (β1 , . . . βp ), where βj = 0.9 (j = 1, . . . , p). When p > n, the ARMS algorithm did not work well for generating posterior samples of β, the autocorrelations in the samples being quite high. To generate posterior draws, we turned to importance sampling with a multivariate t trial density with mean 0, dispersion matrix V t , and 3 df, denoted as t3 (0, Vt ), where Vt = [(1 + 1/g)X 0 W X + λ/gI]−1 , and g is a scalar quantity controlling the dispersion. The performance of the estimates worsened as p increased for a fixed n (Figure 5). With an increase in c0 , the change in the bias and MSE of estimators was small for p ≤ 200. When p = 500, the variance of the estimator overshadowed the bias and hence a moderate c 0 led to a small MSE, with a negligible change in bias. We tested a range of g between 0.1 and 100, for which smaller values appeared to give more accurate results. Results using g between 0.1 and 5 were virtually indistinguishable. The results reported are for a setting of g = 0.25, and λ = 0.5. Robustness to choice of λ. We tested the effect of λ on posterior estimates when p > n. Here 15 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 0 0.5 5 5 0.6 10 15 mse 15 var 10 0.7 bias 0.8 20 20 0.9 25 25 cements p=10 p=30 p=100 p=200 p=500 0.0 0.5 log 10(c0 ) 1.0 1.5 2.0 2.5 3.0 0.0 0.5 log 10(c0 ) 1.0 1.5 2.0 2.5 3.0 log 10(c0 ) Figure 5: Performance of IMR prior for different settings of c 0 . we present results from a simulation with data from a logistic model with p = 100, n = 50, c0 = 1, and λ chosen at approximately equal intervals between 0 to 1 (λ > 0). Figure 6 shows PSfrag replacements that very small values of λ, close to zero, led to unstable estimates; however, interestingly, the estimates were remarkably consistent, exhibiting little or no difference over a large range of λ values, between 0.4 and 1. Since the results were robust to this wide range of λ for both the p < n and p > n cases, we chose the value λ = .5 for all analyses. 6 var mse bias 4 3 2 0 1 bias/sd/sqrt(MSE) 5 sd sqrt(mse) 0.0 0.2 0.4 0.6 0.8 1.0 λ Figure 6: Performance of estimates using the IMR prior for different settings of λ. Effect of the relationship between n and p on posterior estimates. We compared the performance of estimates based on the IMR prior to Gaussian priors and a “g-prior” for the logistic model, as the sample size decreased relative to p. We generated sets of data for a fixed p = 100, 16 while n varied between p/2 = 50 and 2p = 200. Five data sets were generated for each combination of (n, p), while β was generated from a N (β 0 , c0 [x0 W (β 0 )x + λI]) distribution, with β 0 = (2 × J 50 , −2 × J 50 )0 where J q denotes a q-dimensional vector of ones. Sampling from the posterior distribution was done with a multivariate t trial density. In addition to the bias and MSE, we also computed the theoretical “asymptotic covariance” of the esti−1 δ2 mates from the inverse of the Hessian matrix of the log-posterior, as V as = − δβ 2 log p(β|y, x) . Vas cannot be evaluated in closed form, so we used a numerical computation routine in R to get an approximate estimate. The results from the IM prior (Figure 7) were compared with estimates using (i) N (0, γI) prior distributions for a highly informative prior with γ = 1 and an almost non-informative prior with γ = 106 , the default used in the software BUGS (Gilks, Thomas, and Spiegelhalter (1994)); and (ii) N (0, γ(X 0 X)−1 ) priors for γ = 1, 106 (equivalent to the “g-prior” for a normal linear model). We denote estimates found using the five methods, IMR prior, Gaussian priors N (0, I), N (0, 106 I), g-priors N (0, (X 0 X)−1 ), N (0, 106 (X 0 X)−1 ) as IM, NO, NO6, GP, and GP6. Figure 7 shows that (i) when n > p, the IM prior had the lowest or comparable bias to NO6; (ii) when p > n, the IM prior had almost comparable bias to the NO prior; (iii) the MSE of estimates based on the IM prior was almost uniformly lowest. The GP6 and NO6 performed badly in terms of MSE, especially when p > n, whereas for larger n, the NO and GP priors appear too informative and hence led to more biased estimates. Overall, the IMR prior appears to be an attractive choice for computing estimates of the regression coefficients for both cases, whether n is larger or smaller than p. Comparison to Bayesian model averaging predictions. In high-dimensional regression applications, an alternative method to estimating the full model is Bayesian model averaging (BMA) of the posterior estimates (Hoeting, Madigan, Raftery, and Volinsky (1999)). We compared predictions using the IMR priors in a full model to those obtained using BMA with a g-prior in a scenario where p > n in logistic regression. Since the g-prior is undefined when p > n, the model averaging procedure was restricted to involve only p < n models. IMR was compared with (i) BMA using BIC, in a stepwise variable selection algorithm (Yeung, Bumgarner, and Raftery (2005)) implemented as the iBMA routine in the R package “BMA”, and (ii) BMA under Zellner’s gprior with a marginal likelihood evaluated using the generalized Laplace approximation (Bollen, Ray, and Zavisca (2005)) with model selection through the evolutionary Monte Carlo (EMC) 17 log(|Vas |) 0 8 6 MSE 0 0.0 2 0.5 −500 4 1.5 2.0 IM NO NO6 GP GP6 1.0 bias 2.5 10 500 3.0 12 cements N = 50 to 200; comparison of estimates for IM, Normal and g−priors for logistic model 50 100 n 150 200 50 100 n 150 200 50 100 n 150 200 Figure 7: Comparison of estimates based on the IMR prior with a Gaussian prior N (0, γI), and g-prior N (0, γ(X 0 X)−1 ), with γ = 1 and 106 for a data set with p = 100. When n ≥ p, estimates based on the IM prior had the lowest bias and MSE; while when n < p, they performed equally well as the N (0, I) prior. algorithm (Liang and Wong (2000)). The approximation to the marginal likelihood is given by P 1 0 ˆ p(y) ≈ exp[ ni=1 {yi x0i β̂ − log(1 + exi β )}]|c0 (X 0 X)−1 | 2 φ(β̂; 0, [X 0 Ω(β̂)X]−1 + c0 (X 0 X)−1 ), where φ(y; µ, Σ) denotes the multivariate Gaussian density N (µ, Σ) at y, β̂ is the MLE of β, and Ω(·) is as defined in (2.5). We first adapted a procedure proposed in Hoeting et al. (1999) to compare the predictive performance of the three methods. The data was randomly split into halves, and each model selection method was applied on the first half of the data (“training set”, T). The performance was then measured on the second half (“test set”, t) of the data through an approximation to the predictive logarithmic score (Good (1952)) which is given by the sum of the logarithms of the observed ordinates of the predictive density under the model M for each observation in the test set P − d∈t log P (d|D T , M ), where d = (y, x) denotes a data point, and D T denotes the training data P P set. For BMA, the predictive score is measured by − d∈t log{ M ∈M P (d|D T , M )p(M |D T )}; the smaller the score for a model or model average, the better is the predictive performance. The second criterion used for comparing the performance was obtained through generating the fitted probabilities for the test set using regression coefficients β̂ (T ) estimated from the training set, given by π̂i (t) = exp(Xi β̂ (T ) )/[1 + exp(Xi β̂ (T ) )]. We then compared the performance of the (t) (t) three methods through their receiver operating characteristic (ROC) curves, based on comparison to the true value of the ordinates. Figure 8 shows the ROC curves for the IMR prior, BMA using BIC, and BMA using the g-prior (gBMA), for a simulated data set under a logistic regression 18 model with test and training data set sizes of 75, and 100 covariates. IMR and gBMA appeared to have a similar performance (with IMR performing better at the two ends of the curve), and both performed much better than the BMA-BIC method at almost all points. The corresponding predictive logarithmic scores were 53.67 (IMR), 77.79 (BMA-BIC), and 51.43 (gBMA). The PSfrag replacements scores from IMR and gBMA are highly comparable, though BMA requires a much higher computational cost in exploring the high dimensional model space (it sampled a total of 3592 distinct models in 50,000 iterations of EMC). Interestingly, restricting BMA with the g-prior to the top 0.6 0.0 0.2 0.4 sensitivity 0.8 1.0 100 models sampled, the score increased to 197.65, making it extremely inaccurate. 0.0 0.2 0.4 0.6 1 − specificity 0.8 1.0 Figure 8: ROC curves comparing the predictive classification performance between IMR (red solid line), BMA using BIC (green dashed line), and BMA using a g-prior (blue dotted line) for a simulated data set. 5 Analysis of nucleosomal positioning data 5.1 Data description and background The misregulation of the chromatin structure in DNA is associated with the progression of cancer, aging, and developmental defects (Johnson (2000)). It is known that the accessibility of genetic information in DNA is dependent on the positioning of histone proteins packaging the chromatin, forming nucleosomes (Kornberg and Lorch (1999)), which in turn is dependent upon the underlying DNA sequence. Nucleosomes typically comprise regions of about 147 bp of DNA separated by stretches of “open” DNA (nucleosome-free regions, or NFRs). Nucleosome positioning is known to be influenced by di- and tri-nucleotide repeats (Thastrom, Bingham, and 19 Widom (2004)), but overall, the sequence signals influencing positioning are relatively weak and difficult to detect. To determine how sequence features affect nucleosome positioning, we obtained data from a genome-wide study of chromatin structure in yeast (Hogan, Lee, and Lieb (2006)). The data consist of normalized log-ratios of intensities measured for a tiled array for chromosome III, consisting of 50-mer oligonucleotide probes that overlap every 20 bp. We first fitted a twostate Gaussian hidden Markov model, or HMM (Juang and Rabiner (1991)) to determine the nucleosomal state for each probe. We then refrained from any further use of the probe-level microarray data, as it was our primary interest to determine whether certain sequence features are predictive of nucleosome and nucleosome-free positions in the genome, which would enable us to make predictions for genomic regions for which experimental data is currently unavailable or difficult to obtain. 5.2 Analysis with the sample size n > dimension p In order to determine how sequence features might influence the positions of nucleosome-free regions, we concentrated on a region of about 1400 adjacent probes on yeast chromosome III. For each probe, the covariate vector was the set of observed frequencies of nucleotide k-tuples, with k = 1, 2, 3, 4. This led to a total of p = 340 covariates (without including an intercept in the model). The HMM-based classification gave the observed “state” of each probe, whether corresponding to a nucleosome-free region (NFR) or a nucleosome (N). The results reported here are with c0 = 1, results with c0 = 10 were essentially similar. We carried out ten-fold cross validation to test the predictive power of the model. The set of probes, with the associated covariates, were divided into ten non-overlapping pairs of training sets (90% of probes) and test sets (10% of probes). Each training set thus had a sample size of n = 1260, which is greater than the number of covariates (340). For each training-test set pair, the logistic regression model was first fitted to the training data set (with the three priors: IMR, N (0, I) and N (0, 106 I)), and the fitted values of β used to compute the posterior probabilities of classification into the NFR state, for the corresponding test set. The sensitivity and specificity of cross validation using the three different priors was compared, where any region having an estimated posterior probability of 50% or more with the logistic model was classified as an NFR. 20 The IMR prior showed uniformly higher sensitivity while its specificity was comparable to the other two priors (Table 1). The IMR predicted a slightly higher percentage of NFRs than the true percentage, while the other two methods consistently underestimated the number of NFRs. Overall, using the IMR was about 16% more accurate in predicting NFRs compared to the next best method. We also compared the results with Bayesian model averaging, using a g-prior and using the set of 340 covariates- in which case it performed equally well as the full model with an IMR prior. The computational cost of using BMA was much higher than IMR- 5,000 iterations under this setting took about 104 hours on a 1.261 GHz Intel Pentium III compute node running Red Hat Enterprise Linux 4. In comparison, generating 5000 independent samples from the posterior distribution of β using the IMR prior required less than an hour. Table 1: Overall sensitivity and specificity of methods using three types of priors,under the full model, and BMA using the g-prior (gBMA), by ten-fold cross validation, on set (a): p = 340, n = 1260 and set (b): p = 1364, n = 1260. %pN F R: percentage of predicted nucleosome-free regions by each method; ave.corr: average percent correct classification = (Sens + Spec)/2, averaged over the ten cross-validation data sets. The average percentage of NFRs in the data sets was 43%. Note: It was not possible to use the gBMA procedure for set (b) due to the massive computational cost. Set Prior (a) IMR N (0, I) N (0, 106 I) gBMA (b) IMR N (0, I) N (0, 106 I) Range[E(β|y)] %pN F R (−0.295, 0.443) 0.708 (−0.192, 0.161) 0.123 (−0.188, 0.168) 0.111 (−8.778, 11.437) 0.569 (−0.518, 0.535) 0.387 (−1.142, 0.496) 0 (−3.424, 2.179) 0 ave.corr 0.664 0.509 0.475 0.670 0.447 − − Out of 340 covariates using the IMR prior, 93 and 91 coefficients had approximate 95% HPD intervals above and below zero. Among the significant dinucleotides, “aa”, “at”, “tg”, and “tt” had a positive effect on the possibility of being an NFR, while “ac”, “ag”, “cc”, “ct”, “gc”, and “gg” had the opposite effect. It was previously found that “aa” or “tt” repeats have an effect of making DNA rigid, and thus difficult to form nucleosomes, while “gg” and “cc” lead to less rigid DNA for which it is easier to form nucleosomes (Thastrom et al. (2004)). Thus these results seem reasonable compared to biological knowledge. On the other hand, we see that dinucleotides alone do not seem to have the strongest power to distinguish NFRs from nucleosomal regions, suggesting that the relationship between sequence factors and nucleosome positioning could be 21 more complex than linear. 5.3 Analysis with p > n Next, we increased the number of predictors to test how far the improved model fit, by including the 5-tuple counts, would be offset by the increased covariate dimensionality. We repeated the same process for creating the 10 training-test data set pairs as in the earlier case, except that we included k-tuples for k = 1, . . . , 5, leading to p = 1364, with the training set size as 1260. We next carried out the ten-fold cross-validation procedure over each training-test set pair. As seen in Table 1, the IMR prior in this case is the only method that could predict even a proportion of the NFRs correctly. However, the overall predictive power using 5-mers decreased, due to a combination of overfitting with sparse data as well as induced bias due to the massive increase in dimensionality. The above empirical studies are mainly to illustrate that the use of the IMR prior is beneficial in situations where the number of observations does not significantly exceed, or is even less than the number of observed covariates, and the lack of prior knowledge of the situation prevents selection of a fewer number of covariates in advance of the analysis. We observed some dissimilarities between the sets of covariates found significant between the two situations; however, we suspect the main reason for this is that the covariates are highly collinear (the average correlation between them ranges from −0.7 to 0.9) and the values are highly sparse, especially the counts for k-mers with a large k. The results, however, indicate that the positioning of nucleosomes may indeed depend on a number of underlying sequence factors, and not just a few dinucleotides, as was previously thought. The logistic regression model may be a simplification of the actual relationship between the covariates and response, but is a first step towards modeling a more complex, possibly non-linear relationship with the covariates. Using the logistic model to connect sequence features with nucleosomal state, rather than modeling the probe-level data as an intermediate step, is a direct attempt to determine how sequence features influence positioning. For instance, a model with high predictive power of correct nucleosomal state can provide useful surrogate information for other applications when experimental data, which are expensive to generate, are not available. 22 6 Discussion The proposed IM and IMR priors can be viewed as a broad generalization of the “g-prior” (Zellner (1986)) for Gaussian linear models, reducing to Jeffreys’ prior as a limiting case. Although the gprior was originally conceived (and is still most frequently used) in the context of model selection, the proposed priors provide a desirable alternative to Gaussian or improper priors with highdimensional data in generalized linear models. The IM and IMR priors appear to produce results similar to a diffuse Gaussian prior, but are computationally more stable with collinear variables. They provide a desirable alternative to Jeffreys’ prior, being proper for most GLMs, but giving more flexibility than Jeffreys’ prior in the choice of tuning parameters, and being less subjective than the choice of an arbitrary Gaussian prior. Theoretical and computational properties of the IM and IMR priors were investigated, demonstrating their effectiveness in a variety of situations. The IM and IMR priors for many GLMs are proper and their moment generating functions (MGFs) exist. Numerical studies demonstrated that the IMR prior, even with the full model, compared favorably with a more complex Bayesian model averaging procedure with a g-prior that involves dimension reduction. With extremely high dimensional data, the BMA procedure becomes computationally infeasible in our examples. The BMA procedure could also be used with an IMR prior– it would be interesting to explore the possibility of improving variable selection methods in GLMs, as in Hans, Dobra, and West (2007) and Liang et al. (2008), through the use of an IMR prior instead of conventional priors. This would require the ability to compute accurate approximations for the marginal likelihoods, which is a complex problem outside the Gaussian family of priors. Future work is also needed in developing alternative methods for eliciting λ, such as choosing the λ that maximizes the marginal likelihood. Although our current focus is on GLMs, the IMR prior framework can be easily extended to a variety of other models, for instance, to survival and longitudinal models, and to others used in a multitude of scientific applications. Acknowledgments We would like to thank two anonymous referees whose valuable insights and suggestions led to major improvements in the overall clarity and presentation of this paper, and to thank Jason 23 Lieb and Greg Hogan for making available the yeast nucleosomal array data. This research was supported in part by the NIGMS grant GM070335. Appendix The appendices are available in the online version of the article at http://www.stat.sinica.tw/statistica. References Bedrick, E. J., Christensen, R., and Johnson, W. (1996). A new perspective on priors for generalized linear models. J. Am. Stat. Assoc. 91, 1450–1460. Berger, J. and Pericchi, L. (1996). The intrinsic Bayes factor for model selection and prediction. J. Amer. Stat. Assoc. 91, 109–122. Bollen, K., Ray, S., and Zavisca, J. (2005). A scaled unit information prior approximation to the Bayes factor. Technical report, SAMSI LVSSS Transition Workshop: Latent Variable Models in the Social Sciences. Chen, M. H., Ibrahim, J. G., and Yiannoutsos, C. (1999). Prior elicitation, variable selection and Bayesian computation for logistic regression models. J. Roy. Stat. Soc. B 61, 223–242. Gilks, W. R., Thomas, A., and Spiegelhalter, D. J. (1994). A language and program for complex Bayesian modelling. The Statistician 43, 169–178. Gilks, W. R., Best, N. G., and Tan, K. K. C. (1995). Adaptive rejection Metropolis sampling. Applied Statistics 44, 455–472. Good, I. J. (1952). Rational decisions. J. Roy. Statist. Soc. B. 14, 107–114. Hans, C., Dobra, A., and West, M. (2007). Shotgun stochastic search for “large p” regression. J. Amer. Statist. Assoc. 102, 507–516. Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12, 55–67. 24 Hoeting, J. A., Madigan, D., Raftery, A. E., and Volinsky, C. T. (1999). Bayesian model averaging: a tutorial. Statist. Sci. 14, 382–417. Hogan, G. J., Lee, C.-K., and Lieb, J. D. (2006). Cell cycle-specified fluctuation of nucleosome occupancy at gene promoters. PLoS Genet 2, e158. Ibrahim, J. and Chen, M.-H. (2003). Conjugate priors for generalized linear models. Statistica Sinica 13, 461–476. Ibrahim, J. G. and Laud, P. W. (1991). On Bayesian analysis of generalized linear models using Jeffreys’ prior. J. Amer. Statist. Assoc. 86, 981–986. Johnson, C. A. (2000). Chromatin modification and disease. J Med Genet 37, 905–915. Juang, B.-H. and Rabiner, L. R. (1991). Hidden Markov models for speech recognition. Technometrics 33, 251–272. Kass, R. E. (1989). The geometry of asymptotic inference. Statistical Science 4, 188–234. Kass, R. E. (1990). Data-translated likelihood and Jeffreys’ rules. Biometrika 77, 107–114. Kornberg, R. D. and Lorch, Y. (1999). Twenty-five years of the nucleosome, fundamental particle of the eukaryote chromosome. Cell 98, 285–294. Liang, F. and Wong, W. H. (2000). Evolutionary Monte Carlo: applications to c p model sampling and change point problem. Statistica Sinica 10, 317–342. Liang, F., Paulo, R., Molina, G., Clyde, M. A., and Berger, J. O. (2008). Mixtures of g-priors for Bayesian variable selection. J. Am. Stat. Assoc. 103, 410–423. Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regression. J. Am. Stat. Assoc. 83, 1023–1032. R Development Core Team (2004). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. Schafer, J. L. (1997). Analysis of Incomplete Multivariate Data. Chapman and Hall, London. 25 Spiegelhalter, D. J. and Smith, A. F. M. (1982). Bayes factors for linear and log-linear models with vague prior information. J. Roy. Stat. Soc. B 44, 377–387. Thastrom, A., Bingham, L. M., and Widom, J. (2004). Nucleosomal locations of dominant DNA sequence motifs for histone-DNA interactions and nucleosome positioning. J Mol Biol 338, 695–709. Wang, X. (2002). Bayesian Variable Selection for GLM. Ph.D. thesis, The University of Texas at Austin. Wang, X. and George, E. I. (2007). Adaptive Bayesian criteria in variable selection for generalized linear models. Statistica Sinica 17, 667–690. West, M. (2003). Bayesian factor regression models in the “large p, small n” paradigm. Bayesian Statistics 7, 723–732. Yeung, K., Bumgarner, R., and Raftery, A. (2005). Bayesian model averaging: Development of an improved multi-class, gene selection and classification tool for microarray data. Bioinformatics 21, 2394–2402. Zellner, A. (1986). On assessing prior distributions and Bayesian regression analysis with gprior distributions, volume Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti (Goel, P.K. and Zellner, A. Eds.), page 233. North-Holland, Amsterdam. 26 Online Supplementary material for “An Information Matrix Prior for Bayesian Analysis in Generalized Linear Models with High Dimensional Data” Mayetri Gupta∗ and Joseph G. Ibrahim † Appendix Proof of Theorem 1 Without loss of generality, let µ0 = 0.We need to show that the prior MGF: R t0 β e πIM R (β)dβ Z 1 0 0 0 1/2 0 ∝ |X Ω(β)X +λI| exp − β (X Ω(β)X +λI)β + t β dβ 2c0 (0.1) exists for some t in a neighborhood including zero. As previously, we have I(β) = X 0 Ω(β)X. By Corollary 13.7.4 in Harville (1997), we have, |I(β) + λI| = p X s=0 λs X T |I(β)(i1 ,...,is ) |, (0.2) where T = {i1 , . . . , is } is an s-dimensional subset of the p positive integers {1, . . . , p} and the summation is over all such ps subsets; and |I(β)(i1 ,...,is ) | is the determinant of the (p−s)×(p−s) ∗ Department of Biostatistics, Boston University, MA 02118, U.S.A. Email: gupta@bu.edu; Phone: 1.617.414.7946; Fax: 1.617.638.6484 † Department of Biostatistics, University of North Carolina at Chapel Hill, NC 27599, U.S.A. Email: ibrahim@bios.unc.edu 1 submatrix of I(β) obtained by leaving out the (i1 , . . . , is )-th rows and columns of I(β). Then, using (0.2), we have Z X p 21 1 0 0 (0.1) = |I(β) | exp − λ β (I(β) + λI)β + t β dβ 2c0 s=0 T Z X p X 1 0 s/2 0 (i1 ,...,is ) 21 λ β (I(β) + λI)β + t β dβ. (0.3) |I(β) | exp − ≤ 2c 0 s=0 T s X (i1 ,...,is ) Now by applying the Cauchy-Binet formula, we have |I(β) (i1 ,...,is ) |= X c(ai1 , . . . , aip−s ) p−s Y ω ij , j=1 V (T ) (p−s)0 (p−s) 2 where V (T ) = {i1 , . . . , ip−s }, and c(ai1 , . . . , aip−s ) = |X∗ | , with X∗ = (xi1 , . . . , xip−s ) being a (p − s) × (p − s) matrix with j-th column xij (j = 1, . . . , p − s); so that (0.3) = Z X p λ s/2 s=0 ≤ Z X p s=0 X X T λ s/2 XX T c(ai1 , . . . , aip−s ) ω ij p−s Y ωi2j e j=1 V (T ) 1 2 c (ai1 , . . . , aip−s ) 21 p−s Y 1 exp 1 0 0 − β (I(β) + λI)β + t β dβ 2c0 − 2c1 β 0 (I(β )+λI)β+t0 β 0 dβ. (0.4) j=1 V (T ) Now, for any s such that p − s > n, c(ai1 , . . . , aip−s ) = 0. So (0.4) ≤ Z X p λ s/2 s=p−n XX 1 2 c (ai1 , . . . , aip−s ) p−s Y 1 ωi2j e − 2c1 β 0 (I(β )+λI)β+t0 β 0 dβ. (0.5) j=1 T V (T ) Now, without loss of generality, taking (i1 , . . . , ip−s ) = (1, . . . , p − s), the finiteness of (0.5) is equivalent to the finiteness of Z X p−s p Y s=p−n j=1 1 2 ωj exp 1 0 0 − β (I(β) + λI)β + t β dβ, 2c0 2 (0.6) as λ, c(ai1 , . . . , aip−s ) are constants that do not depend on β. For any positive semi-definite matrices A and B, |A + B| ≥ |A| ⇒ x0 (A + B)x ≥ x0 Ax. So we can simplify (0.6) as (0.6) ≤ Z X p p−s Y vj (β)δj2 (β) s=p−n j=1 12 exp λ 0 0 − β β + t β dβ. 2c0 (0.7) h ∗ X i Now let vj (β) = δj (β) = 1 when j = n + 1, . . . , p. Let us construct a p × p matrix X = x0 , where x0 is a (p − n) × p matrix such that X ∗ is positive definite (p.d.). If X is of rank n, this can always be done. Then, make the substitution u = X ∗ β, so that β = (X ∗ )−1 u ⇒ |J| = |(X ∗ )−1 |. Denoting Q = (X ∗ )−1 , we have β 0 β = u0 Q0 Qu. So, finiteness of (0.7) is equivalent to finiteness of = Z Y p 1 2 2 v(θ(uj )) δ (θ(uj )) exp j=1 λ 0 0 0 u Q Qu + t Qu du. − 2c0 (0.8) Now, we need to find a scalar constant M1 > 0 such that u0 Q0 Qu ≥ M1 u0 u for all u. Since Q −p/2 is p.d., a necessary and sufficient condition for this to hold is |Q 0 Q| ≥ M1p ⇒ |X ∗ | ≤ M1 . So the required M1 > 0 exists, as |X ∗ | is bounded above. Next, we have, (0.8) ≤ Z Y p = p Z Y v(θ(uj )) 1/2 v(θ(uj )) 1/2 δ(θ(uj )) exp j=1 δ(θ(uj )) e j=1 − M1 λ 0 0 − u u + t Qu du 2c0 M1 λ 2 u +τj uj 2c0 j duj , (0.9) where τ 0 = t0 Q. Finally, make the transformation rj = θ(uj ) ⇒ uj = θ−1 (rj ). Then (0.9) becomes a product of p one dimensional integrals p Z Y v(rj )1/2 δ(rj ) e j=1 = p Z Y e − − M1 λ −1 [θ (rj )]2 +τj θ −1 (rj ) 2c0 M1 λ −1 [θ (rj )]2 +τj θ −1 (rj ) 2c0 j=1 3 d2 b(rj ) drj2 1/2 duj drj drj drj , (0.10) as d −1 θ (rj ) drj = δ(rj ) = drj . duj So if each integral in (0.10) is finite for some τj in a open interval containing zero, the prior MGF exists. The corollaries immediately follow. Proof of Corollary 1.2 To see that Corollary 1.2 holds, take λ = 0 in (0.3), then the expression reduces to only the term with s = 0. Continue the proof until Eqn (0.7), where, instead of augmenting the matrix, delete the last n − p rows to get a square p × p matrix. The condition then follows. Proof of Theorem 2 Starting with (0.1) in the proof of Theorem 1, we have Z X p 21 1 0 0 β (I(β) + λI)β + t β dβ (0.1) = |I(β) | exp − λ 2c0 s=0 T 21 Z 1 0 0 s (i1 ,...,is ) β (I(β) + λI)β + t β dβ, (0.11) ≥ λ |I(β) | exp − 2c0 s X (i1 ,...,is ) for every s = 0, . . . , p. So finiteness of (0.11) necessitates finiteness of each integral Z λ s/2 |I(β) (i1 ,...,is ) 1 2 | exp 1 0 0 β (I(β) + λI)β + t β dβ. − 2c0 (0.12) In practice, it is sufficient to check the non-existence of (0.12) for any s = 0, . . . , p, to disprove existence of the prior MGF. Proof of Corollary 2.1 To see that Corollary 2.1 holds, take λ = 0 in (0.11), and thus the only non-zero term is the s = 0 term. Continuing the proof the same way, the necessary condition is the finiteness of the same integral, with s = λ = 0. 4 Proof of Theorem 3 This proof follows along the same lines as the proof of Theorem 1, with the likelihood term P added, and setting wi = 0 (for i = n + 1, . . . , p), so that pi=n+1 φ−1 wi [yi θi − b(θi )] = 0. Proof of Theorem 4 Existence of MGFs for specific models are shown through application of Theorems 1-3 below. • Binomial with canonical link. Here b(θ) = log(1 + eθ ), so b00 (θ) = eθ . (1+eθ )2 Thus the sufficient condition (3.3) is satisfied. • Binomial with probit link. For the probit link, the link function is given by θ(η) = Φ(η) eη log 1−Φ(η) , so that θ −1 (η) = Φ−1 1+e η . To check if the sufficient condition (3.3) holds, we need to check finiteness of ∞ r r r/2 λM −1 e e e 2 −1 exp − dr [Φ ] + τΦ r r 2c0 1+e 1+e 1 + er −∞ Z λM 2 φ(z) = exp − dz, z + τz p 2c0 Φ(z)[1 − Φ(z)] Z which is finite, since √ φ(z) Φ(z)[1−Φ(z)] is bounded. • Poisson with canonical link. b(θ) = eθ ⇒ b00 (θ) = eθ . E(e(1/2+τ )θ ) exists, so the sufficient condition (3.3) is satisfied. • Poisson with identity link. Here θ(η) = log(η) ⇒ θ −1 (η) = eη ⇒ Now, Z ∞ e − λM e2r +τ er r/2 2c0 e −∞ dr = Z ∞ e d −1 θ (r) dr − λM u+τ 2c 0 √ u −3 4 u = er . du. (0.13) 0 τ √ u −1 when u ∼ Gamma(p, α), √ 5 λM τ u −1 where the shape parameter p = 4 and the scale parameter α = 2c0 . If τ < 0, E e u < The existence of (0.13) is equivalent to the existence of E e u E(u−1 ), which exists for any p > 1 (by Corollary 2.1 of Piegorsch and Casella (1985)). 5 If τ > 0, the existence of (0.13) is equivalent to the existence of E(u −1 ) where u ∼ Gamma( 45 , λM − τ ), which exists as long as τ < 2c0 λM . 2c0 So (0.13) exists for any τ ∈ (−∞, λM ), so the sufficient condition (3.3) is satisfied. 2c0 • Gamma with canonical link. Here, b(θ) = − log(−θ), so b00 (θ) = θ12 . The sufficient conR ∞ − λM r2 +τ r −1 R ∞ − λM u+τ √u −1 u du, dition would be satisfied by the existence of 0 e 2c0 r dr = 0 e 2c0 which does not exist for τ > 0, since in that case, with u ∼ Exponential( λM ), E(u−1 eτ 2c0 √ u )> E(u−1 ) = ∞. Since the sufficient condition (3.3) fails, we then check if the minimum necessary condition holds, by taking p = 1 and checking the (p − 1)-th term in the integral. Since a11 (β) = x2 vx = x2 x2 β 2 = 1 , β2 the integral in the necessity condition (3.4) for the prior MGF reduces to Z ∞ 2 β β −2 − 2c0 e x2 +λ β2 2 dβ = e −∞ x − 2c 0 Z ∞ β −2 e − 2cλ β 2 0 −∞ dβ = ∞, since the second negative moment of a Gaussian distribution is infinite. Hence the necessary condition fails. • Gamma with log link. With the log link, η = log(µ), or log(−1/θ) = η, which implies that θ−1 (η) = − log(−η). So the integral in (3.3) reduces to Z ∞ exp 0 λM (log r)2 + τ (− log r) − 2c0 1+1/2 τ + 32 Z ∞ log r λM 2c0 1 1 1 dr = dr r r r 0 which is infinite, and hence the sufficient condition does not hold. It is straightforward to check that the necessity condition (3.4) holds. Hence from these conditions we cannot determine whether the prior MGF exists. We will return to this model in Section 3.2 and show that the prior MGF does exist here due to a special result. • Inverse Gaussian with canonical link. Here b(θ) = −(−2θ)1/2 ⇒ b00 (θ) = −(−2θ)−3/2 , 6 and θ−1 (r) = r. So the prior MGF would exist if the integral Z ∞ e − λM r 2 +τ r − 3 2c0 4 r dr = 0 Z ∞ e u+τ − λM 2c 0 √ u −3−1 4 2 u du (0.14) 0 were finite. Now (0.14) is greater than E(u−5/4 ) for an exponential distribution with mean 2c0 λM when τ < 0, and hence is infinite. Next we check the necessity condition for p = 1. a11 (β) = x1/2 β −3/2 , and the integral to check necessity is Z ∞ 1 √ 1/2 +λβ 2 )+tβ − 43 − 2c0 ( xβ ∞ Z 3 1 − √ x u1/4 +tu1/2 − λ u 2c0 u− 8 − 2 e 2c0 dβ = du β e 0 0 Z ∞ Z ∞ √ √ x λ 1/4 7 −( x +t− 2cλ )v − 87 −( 2c0 +t− 2c0 )u 0 du = dv < ∞. < v − 2 +3 e 2c0 u e 0 0 So the sufficiency condition (3.3) is not satisfied, but the necessity condition (3.4) is satisfied, so that we cannot directly determine whether the prior MGF of β exists. • Posterior MGF existence. The existence of the posterior MGFs can be checked in the same way as the prior MGFs, and are shown below. – Binomial with canonical link. The posterior MGF exists, as the sufficient condition R r λM 2 −1 (3.5) is satisfied: exp − 2c0 r + (τ + φ wy)r (1+er )eφ−1 w+1 < ∞. – Gamma with log link. Prior MGF existence is shown in Section 3.2. The posterior MGF also exists, as Z ∞ 0 for τ + 3 2 log r λM τ + 32 −φ−1 w Z ∞ 2c0 3 1 1 −φ−1 wry e dr < r−τ − 2 +α e−αry dr < ∞ r r 0 (0.15) − α < 1, by Corollary 2.1 of Piegorsch and Casella (1985). (For a gamma model, φ−1 = α, w = 1.) A special case is the exponential model, with α = 1. In this case, it can be seen that the posterior MGF exists, as the integral (0.15) is finite for τ < 12 . – Poisson with canonical link. The sufficient condition (3.5) would be satisfied by the 7 finiteness of the integral R∞ −∞ e r 2 +τ r+yr−er − λM 2c 0 a Gaussian distribution with variance c0 . λM r dr, that is, if E(e(τ +y)r−e ) exists for r r Since 0 < ee < 1, E(e(τ +y)r−e ) < E(e(τ +y)r ) < ∞, and the posterior MGF exists. Proof of Theorem 5 Upper bound Σ−1 = (1 + 1/c0 )X 0 X + λ/c0 Ip From (3.7), and since both are positive semi-definite, ⇒ |Σ−1 | ≥ |(1 + 1/c0 )X 0 X| + |λ/c0 Ip |. (0.16) When p > n, |(1 + 1/c0 )X 0 X| = 0. Also, from (0.16), |Σ| ≤ if c0 < λ. c0 p . λ So as p −→ ∞, |Σ| −→ 0 Lower bound −1 0 |Σ | = |(1 + 1/c0 )X X + λ/c0 Ip | = λ c0 p c + 1 0 0 In + XX , λ since |A + BC| = |A||Ik + CA−1 B|. By Hadamard’s inequality, −1 2 |Σ | ≤ λ c0 p Y n n X i=1 k=1 c0 + 1 0 1+ x i xk λ 2 . (0.17) Now let x0 be chosen such that x0i xk ≤ x20 (for 1 ≤ i, k ≤ n). Then, p 2n c0 + 1 2 n 1+ x0 (0.17) ≤ λ −n c +1 2 0 p/2 1 + λ x0 c0 ⇒ |Σ| ≥ . λ nn/2 λ c0 n (0.18) From (0.17) and (0.18) it is easy to see that limp−→∞ |Σ| = 0 if c0 < λ. However, if c0 > λ, and n is fixed, then limp−→∞ |Σ| = ∞, which proves the corollary. 8 Proof of Theorem 6 Bias = E E(β|Y ) − β = (ΣX 0 X − I)β, where Σ = c0 +1 0 XX c0 + λ I c0 p −1 . Part 1: Average bias. This is given by p 1X bias(βp ) = J 0 (ΣX 0 X − I)β p i=1 −1 1 λc0 0 λ 0 0 = − J + J XX+ Ip β. p(c0 + 1) c0 + 1 c0 + 1 Now, let β = BJ , where B = diag(|β1 |, . . . , |βp |), and J denotes a p-dimensional vector of ones. Now, −1 −1 c0 + 1 0 λ X 0 X + λ Ip Ip |B| ≤ B , B = X X + c0 + 1 c0 + 1 λ since |X 0 X + λ I| c0 +1 p J 0 (0.19) ≥ | c0λ+1 Ip |. So (0.19) implies that λ XX+ Ip c0 + 1 0 −1 BJ ≤ c0 + 1 0 J BJ , λ so finally, X p p 1 c0 + 1 c0 λ 1X |βi |. |βi | = k average bias k≤ 1+ p(c0 + 1) λ c0 + 1 i=1 p i=1 Part 2: Bound on determinant of bias matrix. Let D = ΣX 0 X − I. Then, D = [(c0 X 0 X + X 0 X + λI]−1 c0 (X 0 X) − I = −[c0 X 0 X + X 0 X + λI]−1 (X 0 X + λI). So, |D| = |X 0 X + λI| ≤ 1, |c0 X 0 X + X 0 X + λI| which again shows that k Dβ k2 = β 0 D0 Dβ ≤ β 0 β. 9 References Harville, D. A. (1997). Matrix Algebra from a Statistician’s Perspective. Springer-Verlag. Piegorsch, W. W. and Casella, G. (1985). The existence of the first negative moment. Amer. Statist., 39(1):60–62. 10

An Information Matrix Prior for Bayesian Analysis in Data

Related documents

Products

Support

An Information Matrix Prior for Bayesian Analysis in Data

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib