An Information Matrix Prior for Bayesian Analysis in Data

advertisement
An Information Matrix Prior for Bayesian Analysis in
Generalized Linear Models with High Dimensional
Data
Mayetri Gupta∗ and Joseph G. Ibrahim †
Abstract
An important challenge in analyzing high dimensional data in regression settings is that of
facing a situation in which the number of covariates p in the model greatly exceeds the sample size n (sometimes termed the “p > n” problem). In this article, we develop a novel
specification for a general class of prior distributions, called Information Matrix (IM) priors,
for high-dimensional generalized linear models. The priors are first developed for settings in
which p < n, and then extended to the p > n case by defining a ridge parameter in the prior
construction, leading to the Information Matrix Ridge (IMR) prior. The IM and IMR priors
are based on a broad generalization of Zellner’s g-prior for Gaussian linear models. Various
theoretical properties of the prior and implied posterior are derived including existence of
the prior and posterior moment generating functions, tail behavior, as well as connections to
Gaussian priors and Jeffreys’ prior. Several simulation studies and an application to a nucleosomal positioning data set demonstrate its advantages over Gaussian, as well as g-priors, in
high dimensional settings.
KEY WORDS: Fisher Information; g-prior; Importance sampling; Model identifiability; Prior elicitation.
Department of Biostatistics, Boston University, MA 02118, U.S.A. Email: gupta@bu.edu; Phone:
1.617.414.7946; Fax: 1.617.638.6484
†
Department of Biostatistics, University of North Carolina at Chapel Hill, NC 27599, U.S.A. Email:
ibrahim@bios.unc.edu
∗
1
1
Introduction
In the analysis of data arising in many scientific applications, one often faces a scenario in which
the number of variables (p) greatly exceeds the sample size (n), often termed the “p > n” problem. In these problems, fitting many types of statistical models leads to model nonidentifiability
in which parameters cannot be estimated via maximum likelihood. The specification of proper
priors can alleviate such a nonidentifiability problem and lead to proper posterior distributions as
R
long as one uses a valid probability density for the data, i.e., f (y|θ) dy < ∞. Specification of
proper priors in the p > n context is not easy since it is desirable that the prior (i) leads to existence of prior or posterior moments, which is not guaranteed and theoretically checking this is not
easy; (ii) is relatively non-informative so that the data can essentially drive the inference; (iii) is
at least somewhat semi-automatic in nature requiring relatively little or minimal hyper-parameter
specification; and (iv) is easy to interpret and computationally feasible.
Properties (i) - (iv) are important to investigate in considering a class of priors for posterior
inference. The existence of posterior moments is important since Bayesian inference often relies
on the estimation of posterior quantities (e.g. means), via Markov chain Monte Carlo (MCMC) or
other sampling procedures. Computing Bayes factors and other model assessment criteria often
requires the existence of posterior moments (Chen, Ibrahim, and Yiannoutsos (1999)). Constructing estimates through a black box use of MCMC, without knowing that these estimates are well
defined for the class of priors considered, may lead to nonsensical posterior inference. Noninformativeness is desirable in model selection and model assessment settings, or wherever prior
information is not available. The specification of semi-automatic priors has been advocated by
Spiegelhalter and Smith (1982), Zellner (1986), Mitchell and Beauchamp (1988), Berger and Pericchi (1996), Bedrick, Christensen, and Johnson (1996), Chen, Ibrahim, and Yiannoutsos (1999),
and Ibrahim and Chen (2003), and is attractive in high dimensional settings where it is difficult
to find contextual interpretations for all parameters.
In the p > n paradigm, there has been little work on the specification of desirable priors and,
in particular, priors that attain (i) - (iv) above. West (2003) specifies singular g-priors (Zellner
(1986)) for the Gaussian model in the context of Bayesian factor analysis for p > n. Liang,
Paulo, Molina, Clyde, and Berger (2008) advocate the use of mixtures of g-priors to resolve
consistency issues in model selection but their adaptation does not directly extend to cases where
2
p > n. When p > n, Np (0, γI) priors are often not desirable– for small to moderate γ, they are
typically too informative, and for large γ they often lead to computationally unstable posteriors
since the model becomes weakly identified (see Section 4). They also do not capture the a priori
correlation in the parameters, and eliciting a prior covariance matrix when p > n is extremely
difficult.
The class of priors we consider are called Information Matrix (IM) priors. They can be applied
in any parametric regression context but we initially focus on generalized linear models (GLMs).
The functional form of the IM prior is obtained once a parametric statistical model is specified
for the data, the kernel of the IM prior being specified through the Fisher Information matrix.
Let (y1 , . . . , yn ) be independent univariate response variables with density p(y i |X, β, φ), where
β is a p × 1 vector of regression coefficients, φ = (φ1 , . . . , φq ) a vector of dispersion parameters, and X the n × p matrix of covariates with i-th row x0i = (xi1 , . . . , xip ), where ‘0 ’ denotes
matrix transposition. Let α = (β, φ). Assume for the moment that φ is known, p < n, and
rank(X) = p. The likelihood function of β for a regression model with independent responses is
Q
L(β) = ni=1 p(yi |xi , β), and the (i, j)-th element of the Fisher information matrix is
Iij (β) = −E
∂2
log(L(β)) ,
∂βi ∂βj
where the expectation is with respect to y|β, and βi is the i-th component of β, (i, j = 1, . . . p).
Further, assume that the p × p Fisher information matrix, I(β), is non-singular, as is often the
case when p < n. The general IM prior is now defined as
πIM (β) ∝ |I(β)|
1/2
exp
1
0
−
(β − µ0 ) I(β)(β − µ0 ) ,
2c0
(1.1)
where µ0 and c0 ≥ 0 are specified location and dispersion hyperparameters. The IM prior (1.1)
captures the prior covariance of β via the Fisher information matrix, which seems an attractive
specification since this matrix plays a major role in the determination of the large sample covariance of β in both Bayesian and frequentist inference. The use of the design matrix X is
attractive since X may reveal redundant covariates. The prior (1.1) is semi-automatic, requiring
specifications only for µ0 (which can be taken to be 0), and the scalar c0 . It should be mentioned
that some recent work by Wang and George (2007) and Wang (2002) uses a related form of the
3
prior for β, with two important differences: (i) the form of the prior for all GLMs is taken to be
Gaussian, and (ii) the covariance is dependent on the observed information matrix, rather than
the expected one, leading to a data-dependent prior, unlike our specification here. We now focus
on the development of these priors and the resultant posterior estimators in the class of GLMs,
where many theoretical and computational properties can be characterized for p < n and p > n.
2
IM Priors for Generalized Linear Models
We first consider the IM prior for GLMs when p < n, and where the dispersion parameter φ is
known or intrinsically fixed, as for example in the binomial, Poisson, and exponential regression
models. For a GLM, yi |xi , (i = 1, . . . , n) has a conditional density given by
p(yi |xi , β, φ) = exp a−1
i (φ)(yi θi − b(θi )) + c(yi , φ) , i = 1, 2, . . . , n,
(2.1)
where the canonical parameter θi satisfies the equations θi = θ(ηi ), and a(·), b(·) and c(·) determine a particular family in the class. The function θ(.) is the link function for the GLM, often
referred to as the θ-link, and ηi = xi 0 β. When a canonical link is used, θ(ηi ) = ηi . The function
ai (φ) is commonly of the form ai (φ) = φ−1 wi−1 , where wi ’s are known weights. Without loss of
generality and for ease of exposition, let φ = 1 and wi = 1. Then (2.1) can be rewritten as
p(yi |xi , β) = exp {yi θi − b(θi ) + c(yi )} , i = 1, 2, . . . , n.
(2.2)
Now, let X be the n × p matrix of covariates with i-th row x0i , θ(Xβ) be a component-wise
function of Xβ that depends on the link, and J be a n × 1 vector of ones. Then in matrix form,
we can write the likelihood function of β for the GLM in (2.2) as
p(y|β) = exp{y 0 θ(Xβ) − J 0 b(θ(Xβ)) + c(y)}.
4
(2.3)
For the class of GLMs (with φ = 1 and wi = 1), the Fisher information matrix for β is
I(β) = X 0 Ω(β)X,
(2.4)
where Ω(β) = ∆(β)V (β)∆(β),
(2.5)
where V (β) is the n × n diagonal matrix of variance functions with i-th diagonal element v i =
v(x0i β) = d2 b(θi )/dθi2 , and ∆(β) is an n × n diagonal matrix of “link adjustments”, with i-th
diagonal element δi = δ(x0i β) = dθi /dηi . We denote the i-th diagonal element of Ω(β) by ωi .
We now can write the IM prior for β as
0
πIM (β) ∝ |X Ω(β)X|
1/2
exp
1
0
0
(β − µ0 ) (X Ω(β)X)(β − µ0 ) .
−
2c0
(2.6)
The prior in (2.6) can be viewed as a generalization of several types of priors. As c 0 → ∞, (2.6)
converges to Jeffreys’ prior for GLMs, given by
πJ (β) ∝ |X 0 Ω(β)X|1/2 .
(2.7)
Jeffreys’ prior is a popular noninformative prior for Bayesian inference in multiparameter settings involving regression coefficients, due to its invariance and local uniformity properties (Kass
(1989, 1990)). Also, as long as X 0 Ω(β)X is bounded (for example in the logistic model) as
β → ∞, the ratio of prior (2.6) to (2.7) is zero, hence showing the IM prior has lighter tails than
Jeffreys’ in the limit. Jeffreys’ prior for GLMs is improper for most models, except for binomial
regression models (Ibrahim and Laud (1991)). For the Gaussian linear model, (2.6) reduces to
Zellner’s g-prior (Zellner (1986)). In this case, Ω(β) = σ 2 I, where σ 2 denotes the variance of
yi in the Gaussian linear model. For any given GLM, if Ω(β) is a constant matrix free of β, say
W , then the IM prior will always be a Gaussian prior, that is β ∼ N p (µ0 , c0 (X 0 W X)−1 ). For
example, for the gamma model with log-link, Ω(β) is a constant (identity) matrix. In this sense,
the IM prior can be thought of as a “generalized” g-prior.
The tail behavior of the IM prior can be theoretically compared with the Gaussian prior for
5
specific GLMs. To show that its tails are heavier than a Gaussian distribution, we need
πIM (β)
= ∞,
lim
kβ k−→∞ φ(β; ν, Σ)
(2.8)
where φ(β; ν, Σ) denotes the p dimensional multivariate normal density with mean ν and covariance matrix Σ. (Showing that the expression in (2.8) goes to zero indicates the tails are lighter
than the Gaussian.) For the binomial regression model and the inverse Gaussian model (with
canonical link), it can be shown that (2.8) holds, so that the IM prior under these models has
heavier tails (Figure 1). For most GLMs, if the IM prior is proper and Ω(β) → 0 elementwise as
||β|| → ∞, then the tails of the IM prior will be heavier than those of Gaussian priors. When the
elements of Ω(β) become infinite as ||β|| → ∞, the IM prior has lighter tails than the Gaussian
PSfrag replacements
prior, as with the Poisson model with canonical link (Figure 2). Although the IM prior for the
Poisson model has lighter tails than a Gaussian prior, the IM prior for this model is much flatter
−100
−50
0
β1
50
100
0.005
0.004
IM
Jeff
Norm
0.001
density
0.002 0.003
IM
Jeff
Norm
−10
−5
0
β1
5
10
0.000
IM
Jeff
Norm
density
0.00 0.02 0.04 0.06 0.08 0.10 0.12
density
0.00 0.02 0.04 0.06 0.08 0.10 0.12
in the “middle” part of the distribution and hence effectively acts as a more noninformative prior.
50
100
200
300
β1
Figure 1: Density of IM prior under a logistic model with p = 1 compared with Jeffreys’ ( “Jeff”) and a Gaussian
prior (“Norm”) for a simulated data set with n = 20. The plot is shown at three different scales. The total density
under each curve is the same (normalized); however the Gaussian prior has the highest mass near the center while
Jeffreys’ prior has much heavier tails (visible in the third panel). The IM prior has lower mass near the center of
the distribution compared to the Gaussian prior but thicker tails; while it has thinner tails and more central mass
compared to Jeffreys’ prior.
When the dispersion parameter φ is unknown, the IM prior based on α, where α = (β, φ), is
typically not proper for GLMs and has properties that are difficult to characterize. In these cases,
one can construct the IM prior of β conditional on φ, and then specify a proper prior for φ such
6
−15
−10
−5
IM
Norm
density
0.4 0.8
0
5
10
0.0
density
0.4 0.8
0.0
1.2
1.2
cements
IM
Norm
15
−4
−2
β1
0
2
4
β1
density
6e−301
data set with n = 20, shown at two different scales.
IM/N density ratio
0e+00 3e+145 6e+145
Figure 2: Density of IM prior under a Poisson model with p = 1 compared with a Gaussian prior for a simulated
0e+00
as a gamma prior or a product of independent gamma densities (Ibrahim and Laud (1991)). The
12
13
14
15
conditional IM 10
prior 11
of β given
φ is defined
as
beta1
−20
0
beta1
20
40
1
0
(β − µ0 ) I(β|φ)(β − µ0 ) , (2.9)
πIM (β|φ) ∝ |I(β|φ)| exp −
2c0
∂2
log(L(β, φ)) ,
Iij (β|φ) = −E
∂βi ∂βj
1/2
where
−40
and L(β, φ) is the likelihood function of (β, φ). We specify the joint prior of (β, φ) as
π(β, φ) ∝ πIM (β|φ)π(φ),
(2.10)
where π(φ) is a proper prior for φ. When q = 1, π(φ) can be taken to be a gamma prior and, for
a general q, π(φ1 , . . . , φq ) can be assumed to be a product of q independent gamma densities. For
GLMs, following the results of Section 3, it can be shown that (2.10) is jointly proper for (β, φ),
when φ is distributed as a product of gamma densities.
2.1 Conditions for the existence of the prior MGF when p < n
We now present theoretical results establishing conditions for the existence of the prior and posterior moment generating functions (MGFs) using the IM prior for GLMs. The proofs of these
results are presented as special cases of more general theorems given in Section 3.
Sufficiency. The sufficient condition for the existence of the prior MGF of β for the IM prior in
(2.6) is that
Z
d2 b(r)
exp{τ θ (r)}
dr2
−1
12
dr < ∞,
for τ in some (−, ). This is derived as a special case of Theorem 1, (Corollary 1.2), and matches
7
the condition of MGF existence for Jeffreys’ prior when p < n (Ibrahim and Laud (1991)).
Necessity. The necessary condition for existence of the prior MGF of β for the IM prior in (2.6)
is the finiteness of the p-dimensional integral
Z Y
p vj (β)δj2 (β)
i=1
1/2
exp
1
0
0
−
(β − µ0 ) I(β)(β − µ0 ) + t β dβ,
2c0
(2.11)
for any t in some p-dimensional sphere about 0, where I(β) = X 0 Ω(β)X. In many cases,
checking (2.11) reduces to a condition involving a one-dimensional integral (Section 3.1). Note
here that (2.11) is less stringent a condition than that for Jeffreys’ prior, allowing prior MGFs to
exist for models where Jeffreys’ prior is improper. This is discussed further in Section 3.1.
2.2 Existence of the posterior MGF when p < n
A sufficient condition for the existence of the posterior MGF of β for the IM prior in (2.6) is the
finiteness of the following one-dimensional integral for τ ∈ (−, ), some > 0:
Z
d2 b(r)
exp{τ θ (r) + φ w(yr − b(r))}
dr2
−1
−1
12
dr.
(2.12)
This is the same condition as for Jeffreys’ prior in Ibrahim and Laud (1991). Since the sufficient
conditions for the prior and posterior are the same as those for using Jeffreys’ prior, the examples
which satisfy sufficiency conditions for Jeffreys’ prior in Ibrahim and Laud (1991) also satisfy
the sufficient conditions here, e.g., the posterior MGFs exist for binomial, Poisson and Gamma
GLMs if none of the observed y’s is zero.
3
IM Ridge Priors for GLMs when p > n
When p > n, β is not identifiable in the likelihood function and the MLE of β does not exist.
Specifying a proper prior guarantees a proper posterior in this setting; however, not all proper
priors yield the existence of prior and posterior moments which are often the end objectives in
Bayesian inference. To generalize the IM priors to a proper prior in the p > n case, we introduce
8
a scalar “ridge” parameter λ in the prior construction, defining the IM Ridge (IMR) prior as
0
πIM R (β) ∝ |X Ω(β)X +λI|
1/2
1
0
0
(β − µ0 ) (X Ω(β)X +λI)(β − µ0 ) .
exp −
2c0
(3.1)
Here λ can be considered a “ridge” parameter as used in regression models for high dimensional
covariates, and to reduce effects of collinearity (Hoerl and Kennard (1970)). With the introduction of λ, the matrix X 0 Ω(β)X +λI is always positive definite regardless of the rank of X and the
form of Ω, for any GLM. As c0 → ∞ in 3.1, the IMR prior converges to a generalized Jeffreys’
prior given by πJ∗ (β) ∝ |X 0 Ω(β)X + λI|1/2 , which is useful in the p > n setting since the usual
Jeffreys’ prior (2.7) does not exist when X 0 Ω(β)X is singular. This idea is similar in spirit (but
considerably different in form) to the introduction of a constant matrix Λ added to the sample
covariance matrix S in the construction of a data-dependent prior for the population covariance
matrix Σ in multivariate normal settings (Schafer (1997)).
To understand the IMR prior in (3.1), first consider the IMR prior for the linear model. In
this case, the IMR prior for β|σ 2 is Gaussian and the posterior distribution is also Gaussian.
Specifically, consider the linear model Y = Xβ +, where ∼ Nn (0, σ 2 I), and p > n. The IMR
prior in this case becomes β|σ 2 ∼ Np (µ0 , c0 σ 2 (X 0 X + λIp )−1 ), and the posterior distribution of
β given σ 2 is easily shown to be the non-singular Gaussian distribution:
1
0
0
2
β|y, σ ∼ N Σ
(X X + λI)µ0 + X y , σ Σ ,
c0
2
(3.2)
where Σ = [(1 + 1/c0 )X 0 X + (λ/c0 )Ip ]−1 . Theoretical properties of the IMR prior for the linear
model are investigated in Section 3.3.
3.1 MGF existence for GLMs using the IMR prior
The prior MGF must exist for the posterior MGF to exist, when p > n. We first provide necessary
and sufficient conditions for prior MGF existence.
Theorem 1. A sufficient condition for the existence of the MGF of β based on the prior (3.1)
9
when p > n is that the one-dimensional integral
Z
exp
λM −1
[θ (r)]2 + τi θ−1 (r)
−
2c0
d2 b(r)
dr2
12
(3.3)
dr < ∞,
p
.
0
for some τi ∈ (−, ), where rank(X) = n, M is such that |X ∗ | ≤ M − 2 , X ∗ = [X 0 .. x00 ], with
x0 as a (p − n) × p matrix selected such that X ∗ is positive definite.
Proof. Proofs of all theorems are given in the Appendices (in the online Article Supplement).
Corollary 1.1. Let ωk (β) denote the k-th diagonal element of Ω(β) in (2.5). If
Pn
k=1
ωk (β) ≤ 1,
the prior MGF of β from (3.1) always exists, as the integral is dominated by a Gaussian MGF.
Corollary 1.2. The sufficient condition (3.3) also holds and is identical in the p < n case and,
additionally, reduces to the sufficient condition for Jeffreys’ prior if λ = 0.
The IMR prior thus can be useful (and more desirable than the IM prior) even in the p < n
case, in the face of high-dimensionality, collinearity, or weak identifiability.
Theorem 2. If the prior MGF exists when p > n, the p-dimensional integral
Z
as (β) exp
1
0
0
(β − µ0 ) (I(β) + λI)(β − µ0 ) + t β dβ
−
2c0
(3.4)
is finite for some t ∈ (−, ), for s = 0, 1, . . . , p, where as (β) = |I(β)(i1 ,...,is ) | is the determinant
of the (p − s) × (p − s) sub-matrix of I(β) formed by leaving out the (i 1 , . . . , is )-th rows and
columns of I(β) = X 0 Ω(β)X.
Note here that a0 (β) = |I(β)|, ap−1 (β) = trace(I(β)), and ap (β) = 1. As a working
principle, try to check the sufficiency condition first: if it does not hold, check the necessity
condition. For the necessity condition (Theorem 2) it is easiest to first check the p = 1 case.
Necessity does not hold if there exists no t ∈ (−, ) such that
Z
1
2
a11 (β) exp
1
2
(β − µ0 ) (a11 (β) + λ) + tβ dβ
−
2c0
dθ
is finite, where a11 (β) = b00 (θ) dη
. If this condition is not satisfied, the MGF does not exist. If the
necessity condition is satisfied for p = 1, we need to check the condition for larger p; if for any p
we find that the necessity condition is not satisfied, the prior MGF does not exist.
10
Corollary 2.1. The result in Theorem 2 holds when n > p, and simplifies to (2.11) when λ = 0.
Theorem 3. A sufficient condition for the existence of the posterior MGF for π IM R (β) is that
Z
exp
M1 λ −1
[θ (r)]2 + τ θ−1 (r) + φ−1 w(yr − b(r))
−
2c0
d2 b(r)
dr2
1/2
dr
(3.5)
is finite for some τ in an open neighborhood about zero, for j = 1, . . . , p. For j = n + 1, . . . , p,
the condition is the same as for the existence of the prior MGF.
Corollary 3.1. When p < n (and λ = 0), a sufficient condition for the existence of the posterior
MGF is that
Z
−1
−1
exp τ θ (r) + φ w(yr − b(r))
d2 b(r)
dr2
1/2
dr
(3.6)
is finite for some τ in an open neighborhood about zero, and that the MLE exists.
Here we additionally require that the MLE exists (i.e., the likelihood function is bounded
above). However, existence of the prior MGF is not necessary.
The next result demonstrates the usage of the necessary and sufficient conditions in some
specific examples of GLMs to determine existence of prior and posterior MGFs. Without loss of
generality, we assume that µ0 = 0.
Theorem 4. (i) The sufficient conditions (3.3) and (3.5) guarantee the existence of prior and
posterior MGFs for p > n in the Binomial model with canonical and probit link, and in the
Poisson model with canonical and identity link.
(ii) According to condition (3.4), prior and posterior MGFs do not exist for the Gamma model
with canonical link.
(iii) For the Inverse Gaussian model with canonical link, (3.3) is not satisfied, while (3.4) is
satisfied, thus it cannot be determined by these conditions alone whether the prior and posterior
MGFs exist.
Note that, however, the prior and posterior MGFs do exist for the Gamma model with a log
link, due to a result shown in Section 3.2. All derivations are given in the Appendices.
11
3.2 Connection between IM and g-priors
When the information matrix Ω(β) = ∆(β)V (β)∆(β) is independent of β, the IMR prior
reduces to a “ridge” g-prior for a Gaussian linear model, and is of Gaussian form, proper, and its
MGF exists. This provides a quick way to determine existence of the prior and posterior MGF
for models for which the sufficiency conditions do not hold, but the necessary ones do. Examples
are the gamma model with log-link and the Gaussian model with canonical link. For the gamma
model with log link, note that b(θ) = − log(−θ), so that b 00 (θ) = vi (β) = 1/θ 2 ; and θ = −e−η ,
thus
dθ
dη
= δi (β) = θ. The diagonal elements of Ω(β) are vi (β)δi2 (β) = 1. Hence Ω(β) reduces
to an identity matrix, and πIM R (β) is a Gaussian distribution with mean µ0 and covariance matrix
c0 (X 0 X + λI)−1 , which is always proper, and for which the MGF exists even if p > n.
3.3 Characterization of “frequentist” bias and variance
To determine the influence of λ and c0 in the IMR prior on the posterior estimates, we first explore
the case of Gaussian linear models, where closed forms exist. For simplicity, assume µ 0 = 0.
Theorem 5. The posterior covariance matrix of β for the Gaussian linear model,
2
σ Σ=σ
2
satisfies
c0
λ
p/2
1+
−1
λ
1
0
X X + Ip
,
1+
c0
c0
c0 +1 (p) −n
x0
λ
nn/2
≤ |Σ| ≤
c0
λ
(3.7)
p/2
,
where x0 = max{diag(X 0 X)}, if c0 > λ.
(p)
Corollary 5.1. If p −→ ∞, c0 > λ, and n is finite, then |Σ| −→ ∞.
Theorem 6. The posterior bias of β, for the IMR prior, satisfies
p
p
1X
1X
bias(βp ) ≤
|βi |.
p i=1
p i=1
Equivalently, for the determinant of the bias matrix, D = ΣX 0 X −I, k Dβ k2 = β 0 D0 Dβ ≤ β 0 β.
It is also of interest to compare the bias and MSE of estimates arising from the use of the IMR
prior to those obtained using a Gaussian prior, N (0, c0 Ip ). For simplicity, assume σ 2 = 1, and
12
βi = 1, for i = 1, . . . , p. With DIM and DN denoting the bias matrices of the IMR and Gaussian
priors, respectively, it can be shown that the ratio of their determinants is
|DIM |
|{(1 + c0 )X 0 X + λI}−1 X 0 Xc0 − I| |X 0 X + λI||X 0 X + c0 I|
=
|DN |
|{(X 0 X + c0 I)−1 X 0 X − I}−1 X 0 Xc0 − I| |(1 + c0 )X 0 X + λI||c0 I|
(3.8)
and, similarly, the ratio of the determinants of the mean square error (MSE) matrices is
|M SEIM |
|(ΣX 0 X − I)0 (ΣX 0 X − I) + Σ|
=
, (3.9)
|M SEN |
|[(X 0 X + c0 I)−1 X 0 X − I]0 [(X 0 X + c0 I)−1 X 0 X − I] + (X 0 X + c0 I)−1 |
where Σ is as given in (3.7). Figure 3 depicts the behavior of these ratios for a set of choices of
(n, p). The bias ratio (averaged over 5 data sets) decreases with an increase in c 0 , and the IMR
PSfrag replacements
prior is uniformly better when c0 is sufficiently large (> 3) irrespective of whether n is larger
or smaller than p. When n ≥ p, the MSE using the IMR prior is uniformly better when c 0 is
moderately large. However, when p > n, an increase in c0 leads to a an increase in the MSE with
log(|M SEIM |/|M SEN |)
the IM prior, and a sharper Gaussian prior (with a large bias) is favored in terms of the MSE.
300
−200
−100 0 100
200
log(|DIM |/|DN |)
600
(n,p)=30,70
(n,p)=50,50
(n,p)=70,30
−2
−1
0
1
2
3
−2
log10 (c0 )
−1
0
1
2
3
log10 (c0 )
Figure 3: Ratio of determinants of bias (panel 1) and MSE (panel 2) for the IMR prior compared
to a Gaussian N (0, c0 Ip ) prior for the Normal linear model.
3.4 Elicitation of λ, µ0 and c0
The parameter λ plays an important role in the construction of the IMR prior since its introduction
makes the prior covariance matrix of β nonsingular regardless of p. One question is whether to
take λ fixed or random in the model. Empirical experience suggests that if λ is random, any
prior for λ would have to be quite sharp since there is no information in the data for estimating
13
it. Empirical studies show that taking λ fixed gives similar results with far less computational
effort. When taking 0 < λ ≤ 1, posterior inference about β appears quite robust for a wide
range of values of λ in (0, 1) (Section 4.2). Note that when c 0 is large, λ and the design matrix X
have a small impact on the posterior analysis of β. Since our main aim is to develop a relatively
noninformative IM prior for Bayesian inference, other practical hyperparameter choices include
µ0 = 0 (consistent with Zellner’s g-prior) and a moderately large c 0 (c0 ≥ 1).
4
Empirical Studies
4.1 Density of the IM prior compared to Jeffreys’ and Gaussian priors
We simulated data under a logistic model with sample size n = 20, varying the number of
covariates in the range 1 ≤ p ≤ 500, to get an idea of the behavior of the IMR prior for p < n
and p > n. For the prior πIM R (β), setting µ0 = 0, the posterior density of β is
|Σ|
p(β|y) ∝
− 21
exp
Pn
0
i=1 yi xi β
Qn
i=1 (1
−
1 0 −1
βΣ β
2
+ e xi β )
0
where Σ = Σ(β) = c0 [X 0 Ω(β)X + λIp ]−1 , and ωi (β) =
exp(x0i β )
[1+exp(x0i β )]2
,
(4.1)
, (for i = 1, . . . , n). When
p < n and λ = 0, (4.1) reduces to the IM prior. The IM prior was compared to Jeffreys’ prior and
a Gaussian prior N (0, c0 (X 0 X)−1 /4) (i.e. equivalent to setting β = 0 in the IM prior). For the
p = 1 case, we can plot the exact prior densities (normalized computationally to be on the same
scale) over a range of values. Figure 1 shows the three priors superimposed on the same plot at
three scales. The IMR prior lies between Jeffreys’ prior and the Gaussian prior at the center of
the range, has heavier tails than do the others for a a wider range than Jeffreys’ and, at the tails,
converges to a Gaussian prior. We repeated the study with the same n, but set p = 5 and used the
IMR prior. In this case, we cannot analytically derive the marginals, but instead prior and posterior samples are drawn from the respective distributions using Adaptive Rejection Metropolis
Sampling (ARMS) (Gilks, Best, and Tan (1995)), using the HI package in the statistical software
R (R Development Core Team (2004)). Smoothed kernel density estimates based on a Gaussian
kernel are plotted for one of the marginals (Figure 4). This indicates the relative flatness of the
14
Sfrag replacements
IMR prior compared to the posterior density, showing that it is less informative than taking a
Gaussian prior. Posterior densities are centered closely around 0.9, the true value of β.
−5
−10
0.04
density
0.06
0
IMR pr
Jeff pr
Norm pr
IMR po
Jeff po
Norm po
0.00
0.02
density
0.08
IMR
Jeff
Norm
−20
−10
0
10
20
−20
β1
−10
0
10
20
β1
Figure 4: Panel 1: Density of IMR prior with p = 5 compared with Jeffreys’ prior (“Jeff”)
and a Gaussian prior (“Norm”) for a simulated data set with n=20. Panel 2: Prior (“pr”) and
posterior (“po”) log-densities based on the three priors.
4.2 Robustness of posterior estimates
Effect of c0 on posterior estimates. We investigated the performance of estimates using the IM
prior with different settings of c0 and for values of p from 10 to 500, with a sample size n = 150.
Data were generated from a logistic model with a design matrix x = (x ij ), xij ∼ N (0, 0.1), and
coefficient vector β = (β1 , . . . βp ), where βj = 0.9 (j = 1, . . . , p). When p > n, the ARMS
algorithm did not work well for generating posterior samples of β, the autocorrelations in the
samples being quite high. To generate posterior draws, we turned to importance sampling with
a multivariate t trial density with mean 0, dispersion matrix V t , and 3 df, denoted as t3 (0, Vt ),
where Vt = [(1 + 1/g)X 0 W X + λ/gI]−1 , and g is a scalar quantity controlling the dispersion.
The performance of the estimates worsened as p increased for a fixed n (Figure 5). With an
increase in c0 , the change in the bias and MSE of estimators was small for p ≤ 200. When
p = 500, the variance of the estimator overshadowed the bias and hence a moderate c 0 led to a
small MSE, with a negligible change in bias. We tested a range of g between 0.1 and 100, for
which smaller values appeared to give more accurate results. Results using g between 0.1 and 5
were virtually indistinguishable. The results reported are for a setting of g = 0.25, and λ = 0.5.
Robustness to choice of λ. We tested the effect of λ on posterior estimates when p > n. Here
15
0.0
0.5
1.0
1.5
2.0
2.5
3.0
0
0
0.5
5
5
0.6
10
15
mse
15
var
10
0.7
bias
0.8
20
20
0.9
25
25
cements
p=10
p=30
p=100
p=200
p=500
0.0
0.5
log 10(c0 )
1.0
1.5
2.0
2.5
3.0
0.0
0.5
log 10(c0 )
1.0
1.5
2.0
2.5
3.0
log 10(c0 )
Figure 5: Performance of IMR prior for different settings of c 0 .
we present results from a simulation with data from a logistic model with p = 100, n = 50,
c0 = 1, and λ chosen at approximately equal intervals between 0 to 1 (λ > 0). Figure 6 shows
PSfrag replacements
that very small values of λ, close to zero, led to unstable estimates; however, interestingly, the
estimates were remarkably consistent, exhibiting little or no difference over a large range of λ
values, between 0.4 and 1. Since the results were robust to this wide range of λ for both the p < n
and p > n cases, we chose the value λ = .5 for all analyses.
6
var
mse
bias
4
3
2
0
1
bias/sd/sqrt(MSE)
5
sd
sqrt(mse)
0.0
0.2
0.4
0.6
0.8
1.0
λ
Figure 6: Performance of estimates using the IMR prior for different settings of λ.
Effect of the relationship between n and p on posterior estimates. We compared the performance of estimates based on the IMR prior to Gaussian priors and a “g-prior” for the logistic
model, as the sample size decreased relative to p. We generated sets of data for a fixed p = 100,
16
while n varied between p/2 = 50 and 2p = 200. Five data sets were generated for each combination of (n, p), while β was generated from a N (β 0 , c0 [x0 W (β 0 )x + λI]) distribution, with
β 0 = (2 × J 50 , −2 × J 50 )0 where J q denotes a q-dimensional vector of ones.
Sampling from the posterior distribution was done with a multivariate t trial density. In addition to the bias and MSE, we also computed the theoretical “asymptotic covariance” of the esti−1
δ2
mates from the inverse of the Hessian matrix of the log-posterior, as V as = − δβ 2 log p(β|y, x) .
Vas cannot be evaluated in closed form, so we used a numerical computation routine in R to get
an approximate estimate. The results from the IM prior (Figure 7) were compared with estimates
using (i) N (0, γI) prior distributions for a highly informative prior with γ = 1 and an almost
non-informative prior with γ = 106 , the default used in the software BUGS (Gilks, Thomas, and
Spiegelhalter (1994)); and (ii) N (0, γ(X 0 X)−1 ) priors for γ = 1, 106 (equivalent to the “g-prior”
for a normal linear model). We denote estimates found using the five methods, IMR prior, Gaussian priors N (0, I), N (0, 106 I), g-priors N (0, (X 0 X)−1 ), N (0, 106 (X 0 X)−1 ) as IM, NO, NO6,
GP, and GP6. Figure 7 shows that (i) when n > p, the IM prior had the lowest or comparable
bias to NO6; (ii) when p > n, the IM prior had almost comparable bias to the NO prior; (iii)
the MSE of estimates based on the IM prior was almost uniformly lowest. The GP6 and NO6
performed badly in terms of MSE, especially when p > n, whereas for larger n, the NO and GP
priors appear too informative and hence led to more biased estimates. Overall, the IMR prior
appears to be an attractive choice for computing estimates of the regression coefficients for both
cases, whether n is larger or smaller than p.
Comparison to Bayesian model averaging predictions. In high-dimensional regression applications, an alternative method to estimating the full model is Bayesian model averaging (BMA)
of the posterior estimates (Hoeting, Madigan, Raftery, and Volinsky (1999)). We compared predictions using the IMR priors in a full model to those obtained using BMA with a g-prior in a scenario where p > n in logistic regression. Since the g-prior is undefined when p > n, the model averaging procedure was restricted to involve only p < n models. IMR was compared with (i) BMA
using BIC, in a stepwise variable selection algorithm (Yeung, Bumgarner, and Raftery (2005))
implemented as the iBMA routine in the R package “BMA”, and (ii) BMA under Zellner’s gprior with a marginal likelihood evaluated using the generalized Laplace approximation (Bollen,
Ray, and Zavisca (2005)) with model selection through the evolutionary Monte Carlo (EMC)
17
log(|Vas |)
0
8
6
MSE
0
0.0
2
0.5
−500
4
1.5
2.0
IM
NO
NO6
GP
GP6
1.0
bias
2.5
10
500
3.0
12
cements
N = 50 to 200; comparison of estimates for IM, Normal and g−priors for logistic model
50
100
n
150
200
50
100
n
150
200
50
100
n
150
200
Figure 7: Comparison of estimates based on the IMR prior with a Gaussian prior N (0, γI), and
g-prior N (0, γ(X 0 X)−1 ), with γ = 1 and 106 for a data set with p = 100. When n ≥ p, estimates
based on the IM prior had the lowest bias and MSE; while when n < p, they performed equally
well as the N (0, I) prior.
algorithm (Liang and Wong (2000)). The approximation to the marginal likelihood is given by
P
1
0 ˆ
p(y) ≈ exp[ ni=1 {yi x0i β̂ − log(1 + exi β )}]|c0 (X 0 X)−1 | 2 φ(β̂; 0, [X 0 Ω(β̂)X]−1 + c0 (X 0 X)−1 ),
where φ(y; µ, Σ) denotes the multivariate Gaussian density N (µ, Σ) at y, β̂ is the MLE of β,
and Ω(·) is as defined in (2.5).
We first adapted a procedure proposed in Hoeting et al. (1999) to compare the predictive
performance of the three methods. The data was randomly split into halves, and each model
selection method was applied on the first half of the data (“training set”, T). The performance
was then measured on the second half (“test set”, t) of the data through an approximation to the
predictive logarithmic score (Good (1952)) which is given by the sum of the logarithms of the observed ordinates of the predictive density under the model M for each observation in the test set
P
− d∈t log P (d|D T , M ), where d = (y, x) denotes a data point, and D T denotes the training data
P
P
set. For BMA, the predictive score is measured by − d∈t log{ M ∈M P (d|D T , M )p(M |D T )};
the smaller the score for a model or model average, the better is the predictive performance. The
second criterion used for comparing the performance was obtained through generating the fitted
probabilities for the test set using regression coefficients β̂ (T ) estimated from the training set,
given by π̂i (t) = exp(Xi β̂ (T ) )/[1 + exp(Xi β̂ (T ) )]. We then compared the performance of the
(t)
(t)
three methods through their receiver operating characteristic (ROC) curves, based on comparison
to the true value of the ordinates. Figure 8 shows the ROC curves for the IMR prior, BMA using
BIC, and BMA using the g-prior (gBMA), for a simulated data set under a logistic regression
18
model with test and training data set sizes of 75, and 100 covariates. IMR and gBMA appeared
to have a similar performance (with IMR performing better at the two ends of the curve), and
both performed much better than the BMA-BIC method at almost all points. The corresponding predictive logarithmic scores were 53.67 (IMR), 77.79 (BMA-BIC), and 51.43 (gBMA). The
PSfrag replacements
scores from IMR and gBMA are highly comparable, though BMA requires a much higher computational cost in exploring the high dimensional model space (it sampled a total of 3592 distinct
models in 50,000 iterations of EMC). Interestingly, restricting BMA with the g-prior to the top
0.6
0.0
0.2
0.4
sensitivity
0.8
1.0
100 models sampled, the score increased to 197.65, making it extremely inaccurate.
0.0
0.2
0.4
0.6
1 − specificity
0.8
1.0
Figure 8: ROC curves comparing the predictive classification performance between IMR (red solid line), BMA
using BIC (green dashed line), and BMA using a g-prior (blue dotted line) for a simulated data set.
5
Analysis of nucleosomal positioning data
5.1 Data description and background
The misregulation of the chromatin structure in DNA is associated with the progression of cancer, aging, and developmental defects (Johnson (2000)). It is known that the accessibility of
genetic information in DNA is dependent on the positioning of histone proteins packaging the
chromatin, forming nucleosomes (Kornberg and Lorch (1999)), which in turn is dependent upon
the underlying DNA sequence. Nucleosomes typically comprise regions of about 147 bp of DNA
separated by stretches of “open” DNA (nucleosome-free regions, or NFRs). Nucleosome positioning is known to be influenced by di- and tri-nucleotide repeats (Thastrom, Bingham, and
19
Widom (2004)), but overall, the sequence signals influencing positioning are relatively weak and
difficult to detect.
To determine how sequence features affect nucleosome positioning, we obtained data from
a genome-wide study of chromatin structure in yeast (Hogan, Lee, and Lieb (2006)). The data
consist of normalized log-ratios of intensities measured for a tiled array for chromosome III,
consisting of 50-mer oligonucleotide probes that overlap every 20 bp. We first fitted a twostate Gaussian hidden Markov model, or HMM (Juang and Rabiner (1991)) to determine the
nucleosomal state for each probe. We then refrained from any further use of the probe-level
microarray data, as it was our primary interest to determine whether certain sequence features are
predictive of nucleosome and nucleosome-free positions in the genome, which would enable us
to make predictions for genomic regions for which experimental data is currently unavailable or
difficult to obtain.
5.2 Analysis with the sample size n > dimension p
In order to determine how sequence features might influence the positions of nucleosome-free
regions, we concentrated on a region of about 1400 adjacent probes on yeast chromosome III.
For each probe, the covariate vector was the set of observed frequencies of nucleotide k-tuples,
with k = 1, 2, 3, 4. This led to a total of p = 340 covariates (without including an intercept
in the model). The HMM-based classification gave the observed “state” of each probe, whether
corresponding to a nucleosome-free region (NFR) or a nucleosome (N). The results reported here
are with c0 = 1, results with c0 = 10 were essentially similar.
We carried out ten-fold cross validation to test the predictive power of the model. The set of
probes, with the associated covariates, were divided into ten non-overlapping pairs of training
sets (90% of probes) and test sets (10% of probes). Each training set thus had a sample size of
n = 1260, which is greater than the number of covariates (340). For each training-test set pair,
the logistic regression model was first fitted to the training data set (with the three priors: IMR,
N (0, I) and N (0, 106 I)), and the fitted values of β used to compute the posterior probabilities
of classification into the NFR state, for the corresponding test set. The sensitivity and specificity
of cross validation using the three different priors was compared, where any region having an
estimated posterior probability of 50% or more with the logistic model was classified as an NFR.
20
The IMR prior showed uniformly higher sensitivity while its specificity was comparable to the
other two priors (Table 1). The IMR predicted a slightly higher percentage of NFRs than the
true percentage, while the other two methods consistently underestimated the number of NFRs.
Overall, using the IMR was about 16% more accurate in predicting NFRs compared to the next
best method. We also compared the results with Bayesian model averaging, using a g-prior and
using the set of 340 covariates- in which case it performed equally well as the full model with an
IMR prior. The computational cost of using BMA was much higher than IMR- 5,000 iterations
under this setting took about 104 hours on a 1.261 GHz Intel Pentium III compute node running
Red Hat Enterprise Linux 4. In comparison, generating 5000 independent samples from the
posterior distribution of β using the IMR prior required less than an hour.
Table 1: Overall sensitivity and specificity of methods using three types of priors,under the full
model, and BMA using the g-prior (gBMA), by ten-fold cross validation, on set (a): p = 340,
n = 1260 and set (b): p = 1364, n = 1260. %pN F R: percentage of predicted nucleosome-free
regions by each method; ave.corr: average percent correct classification = (Sens + Spec)/2,
averaged over the ten cross-validation data sets. The average percentage of NFRs in the data
sets was 43%. Note: It was not possible to use the gBMA procedure for set (b) due to the massive
computational cost.
Set Prior
(a) IMR
N (0, I)
N (0, 106 I)
gBMA
(b) IMR
N (0, I)
N (0, 106 I)
Range[E(β|y)]
%pN F R
(−0.295, 0.443) 0.708
(−0.192, 0.161) 0.123
(−0.188, 0.168) 0.111
(−8.778, 11.437) 0.569
(−0.518, 0.535) 0.387
(−1.142, 0.496) 0
(−3.424, 2.179) 0
ave.corr
0.664
0.509
0.475
0.670
0.447
−
−
Out of 340 covariates using the IMR prior, 93 and 91 coefficients had approximate 95% HPD
intervals above and below zero. Among the significant dinucleotides, “aa”, “at”, “tg”, and “tt”
had a positive effect on the possibility of being an NFR, while “ac”, “ag”, “cc”, “ct”, “gc”, and
“gg” had the opposite effect. It was previously found that “aa” or “tt” repeats have an effect
of making DNA rigid, and thus difficult to form nucleosomes, while “gg” and “cc” lead to less
rigid DNA for which it is easier to form nucleosomes (Thastrom et al. (2004)). Thus these results
seem reasonable compared to biological knowledge. On the other hand, we see that dinucleotides
alone do not seem to have the strongest power to distinguish NFRs from nucleosomal regions,
suggesting that the relationship between sequence factors and nucleosome positioning could be
21
more complex than linear.
5.3 Analysis with p > n
Next, we increased the number of predictors to test how far the improved model fit, by including
the 5-tuple counts, would be offset by the increased covariate dimensionality. We repeated the
same process for creating the 10 training-test data set pairs as in the earlier case, except that we
included k-tuples for k = 1, . . . , 5, leading to p = 1364, with the training set size as 1260. We
next carried out the ten-fold cross-validation procedure over each training-test set pair. As seen
in Table 1, the IMR prior in this case is the only method that could predict even a proportion
of the NFRs correctly. However, the overall predictive power using 5-mers decreased, due to a
combination of overfitting with sparse data as well as induced bias due to the massive increase in
dimensionality.
The above empirical studies are mainly to illustrate that the use of the IMR prior is beneficial in situations where the number of observations does not significantly exceed, or is even less
than the number of observed covariates, and the lack of prior knowledge of the situation prevents
selection of a fewer number of covariates in advance of the analysis. We observed some dissimilarities between the sets of covariates found significant between the two situations; however, we
suspect the main reason for this is that the covariates are highly collinear (the average correlation
between them ranges from −0.7 to 0.9) and the values are highly sparse, especially the counts
for k-mers with a large k. The results, however, indicate that the positioning of nucleosomes
may indeed depend on a number of underlying sequence factors, and not just a few dinucleotides,
as was previously thought. The logistic regression model may be a simplification of the actual
relationship between the covariates and response, but is a first step towards modeling a more
complex, possibly non-linear relationship with the covariates. Using the logistic model to connect sequence features with nucleosomal state, rather than modeling the probe-level data as an
intermediate step, is a direct attempt to determine how sequence features influence positioning.
For instance, a model with high predictive power of correct nucleosomal state can provide useful surrogate information for other applications when experimental data, which are expensive to
generate, are not available.
22
6
Discussion
The proposed IM and IMR priors can be viewed as a broad generalization of the “g-prior” (Zellner
(1986)) for Gaussian linear models, reducing to Jeffreys’ prior as a limiting case. Although the gprior was originally conceived (and is still most frequently used) in the context of model selection,
the proposed priors provide a desirable alternative to Gaussian or improper priors with highdimensional data in generalized linear models. The IM and IMR priors appear to produce results
similar to a diffuse Gaussian prior, but are computationally more stable with collinear variables.
They provide a desirable alternative to Jeffreys’ prior, being proper for most GLMs, but giving
more flexibility than Jeffreys’ prior in the choice of tuning parameters, and being less subjective
than the choice of an arbitrary Gaussian prior. Theoretical and computational properties of the IM
and IMR priors were investigated, demonstrating their effectiveness in a variety of situations. The
IM and IMR priors for many GLMs are proper and their moment generating functions (MGFs)
exist.
Numerical studies demonstrated that the IMR prior, even with the full model, compared favorably with a more complex Bayesian model averaging procedure with a g-prior that involves
dimension reduction. With extremely high dimensional data, the BMA procedure becomes computationally infeasible in our examples. The BMA procedure could also be used with an IMR
prior– it would be interesting to explore the possibility of improving variable selection methods
in GLMs, as in Hans, Dobra, and West (2007) and Liang et al. (2008), through the use of an IMR
prior instead of conventional priors. This would require the ability to compute accurate approximations for the marginal likelihoods, which is a complex problem outside the Gaussian family
of priors. Future work is also needed in developing alternative methods for eliciting λ, such as
choosing the λ that maximizes the marginal likelihood. Although our current focus is on GLMs,
the IMR prior framework can be easily extended to a variety of other models, for instance, to
survival and longitudinal models, and to others used in a multitude of scientific applications.
Acknowledgments
We would like to thank two anonymous referees whose valuable insights and suggestions led
to major improvements in the overall clarity and presentation of this paper, and to thank Jason
23
Lieb and Greg Hogan for making available the yeast nucleosomal array data. This research was
supported in part by the NIGMS grant GM070335.
Appendix
The appendices are available in the online version of the article at
http://www.stat.sinica.tw/statistica.
References
Bedrick, E. J., Christensen, R., and Johnson, W. (1996). A new perspective on priors for generalized linear models. J. Am. Stat. Assoc. 91, 1450–1460.
Berger, J. and Pericchi, L. (1996). The intrinsic Bayes factor for model selection and prediction.
J. Amer. Stat. Assoc. 91, 109–122.
Bollen, K., Ray, S., and Zavisca, J. (2005). A scaled unit information prior approximation to the
Bayes factor. Technical report, SAMSI LVSSS Transition Workshop: Latent Variable Models
in the Social Sciences.
Chen, M. H., Ibrahim, J. G., and Yiannoutsos, C. (1999). Prior elicitation, variable selection and
Bayesian computation for logistic regression models. J. Roy. Stat. Soc. B 61, 223–242.
Gilks, W. R., Thomas, A., and Spiegelhalter, D. J. (1994). A language and program for complex
Bayesian modelling. The Statistician 43, 169–178.
Gilks, W. R., Best, N. G., and Tan, K. K. C. (1995). Adaptive rejection Metropolis sampling.
Applied Statistics 44, 455–472.
Good, I. J. (1952). Rational decisions. J. Roy. Statist. Soc. B. 14, 107–114.
Hans, C., Dobra, A., and West, M. (2007). Shotgun stochastic search for “large p” regression. J.
Amer. Statist. Assoc. 102, 507–516.
Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal
problems. Technometrics 12, 55–67.
24
Hoeting, J. A., Madigan, D., Raftery, A. E., and Volinsky, C. T. (1999). Bayesian model averaging: a tutorial. Statist. Sci. 14, 382–417.
Hogan, G. J., Lee, C.-K., and Lieb, J. D. (2006). Cell cycle-specified fluctuation of nucleosome
occupancy at gene promoters. PLoS Genet 2, e158.
Ibrahim, J. and Chen, M.-H. (2003). Conjugate priors for generalized linear models. Statistica
Sinica 13, 461–476.
Ibrahim, J. G. and Laud, P. W. (1991). On Bayesian analysis of generalized linear models using
Jeffreys’ prior. J. Amer. Statist. Assoc. 86, 981–986.
Johnson, C. A. (2000). Chromatin modification and disease. J Med Genet 37, 905–915.
Juang, B.-H. and Rabiner, L. R. (1991). Hidden Markov models for speech recognition. Technometrics 33, 251–272.
Kass, R. E. (1989). The geometry of asymptotic inference. Statistical Science 4, 188–234.
Kass, R. E. (1990). Data-translated likelihood and Jeffreys’ rules. Biometrika 77, 107–114.
Kornberg, R. D. and Lorch, Y. (1999). Twenty-five years of the nucleosome, fundamental particle
of the eukaryote chromosome. Cell 98, 285–294.
Liang, F. and Wong, W. H. (2000). Evolutionary Monte Carlo: applications to c p model sampling
and change point problem. Statistica Sinica 10, 317–342.
Liang, F., Paulo, R., Molina, G., Clyde, M. A., and Berger, J. O. (2008). Mixtures of g-priors for
Bayesian variable selection. J. Am. Stat. Assoc. 103, 410–423.
Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regression. J.
Am. Stat. Assoc. 83, 1023–1032.
R Development Core Team (2004). R: A language and environment for statistical computing. R
Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.
Schafer, J. L. (1997). Analysis of Incomplete Multivariate Data. Chapman and Hall, London.
25
Spiegelhalter, D. J. and Smith, A. F. M. (1982). Bayes factors for linear and log-linear models
with vague prior information. J. Roy. Stat. Soc. B 44, 377–387.
Thastrom, A., Bingham, L. M., and Widom, J. (2004). Nucleosomal locations of dominant DNA
sequence motifs for histone-DNA interactions and nucleosome positioning. J Mol Biol 338,
695–709.
Wang, X. (2002). Bayesian Variable Selection for GLM. Ph.D. thesis, The University of Texas
at Austin.
Wang, X. and George, E. I. (2007). Adaptive Bayesian criteria in variable selection for generalized linear models. Statistica Sinica 17, 667–690.
West, M. (2003). Bayesian factor regression models in the “large p, small n” paradigm. Bayesian
Statistics 7, 723–732.
Yeung, K., Bumgarner, R., and Raftery, A. (2005). Bayesian model averaging: Development of
an improved multi-class, gene selection and classification tool for microarray data. Bioinformatics 21, 2394–2402.
Zellner, A. (1986). On assessing prior distributions and Bayesian regression analysis with gprior distributions, volume Bayesian Inference and Decision Techniques: Essays in Honor of
Bruno de Finetti (Goel, P.K. and Zellner, A. Eds.), page 233. North-Holland, Amsterdam.
26
Online Supplementary material for “An Information
Matrix Prior for Bayesian Analysis in Generalized
Linear Models with High Dimensional Data”
Mayetri Gupta∗ and Joseph G. Ibrahim †
Appendix
Proof of Theorem 1
Without loss of generality, let µ0 = 0.We need to show that the prior MGF:
R t0 β
e πIM R (β)dβ
Z
1 0 0
0
1/2
0
∝ |X Ω(β)X +λI| exp −
β (X Ω(β)X +λI)β + t β dβ
2c0
(0.1)
exists for some t in a neighborhood including zero. As previously, we have I(β) = X 0 Ω(β)X.
By Corollary 13.7.4 in Harville (1997), we have,
|I(β) + λI| =
p
X
s=0
λs
X
T
|I(β)(i1 ,...,is ) |,
(0.2)
where T = {i1 , . . . , is } is an s-dimensional subset of the p positive integers {1, . . . , p} and the
summation is over all such ps subsets; and |I(β)(i1 ,...,is ) | is the determinant of the (p−s)×(p−s)
∗
Department of Biostatistics, Boston University, MA 02118, U.S.A. Email: gupta@bu.edu; Phone:
1.617.414.7946; Fax: 1.617.638.6484
†
Department of Biostatistics, University of North Carolina at Chapel Hill, NC 27599, U.S.A. Email:
ibrahim@bios.unc.edu
1
submatrix of I(β) obtained by leaving out the (i1 , . . . , is )-th rows and columns of I(β). Then,
using (0.2), we have
Z X
p
21
1 0
0
(0.1) =
|I(β)
| exp −
λ
β (I(β) + λI)β + t β dβ
2c0
s=0
T
Z X
p
X
1 0
s/2
0
(i1 ,...,is ) 21
λ
β (I(β) + λI)β + t β dβ. (0.3)
|I(β)
| exp −
≤
2c
0
s=0
T
s
X
(i1 ,...,is )
Now by applying the Cauchy-Binet formula, we have
|I(β)
(i1 ,...,is )
|=
X
c(ai1 , . . . , aip−s )
p−s
Y
ω ij ,
j=1
V (T )
(p−s)0
(p−s) 2
where V (T ) = {i1 , . . . , ip−s }, and c(ai1 , . . . , aip−s ) = |X∗
| , with X∗
= (xi1 , . . . , xip−s )
being a (p − s) × (p − s) matrix with j-th column xij (j = 1, . . . , p − s); so that
(0.3) =
Z X
p
λ
s/2
s=0
≤
Z X
p
s=0
X X
T
λ
s/2
XX
T
c(ai1 , . . . , aip−s )
ω ij
p−s
Y
ωi2j e
j=1
V (T )
1
2
c (ai1 , . . . , aip−s )
21
p−s
Y
1
exp
1 0
0
−
β (I(β) + λI)β + t β dβ
2c0
− 2c1 β 0 (I(β )+λI)β+t0 β
0
dβ.
(0.4)
j=1
V (T )
Now, for any s such that p − s > n, c(ai1 , . . . , aip−s ) = 0. So
(0.4) ≤
Z X
p
λ
s/2
s=p−n
XX
1
2
c (ai1 , . . . , aip−s )
p−s
Y
1
ωi2j e
− 2c1 β 0 (I(β )+λI)β+t0 β
0
dβ.
(0.5)
j=1
T V (T )
Now, without loss of generality, taking (i1 , . . . , ip−s ) = (1, . . . , p − s), the finiteness of (0.5) is
equivalent to the finiteness of
Z X
p−s
p
Y
s=p−n j=1
1
2
ωj exp
1 0
0
−
β (I(β) + λI)β + t β dβ,
2c0
2
(0.6)
as λ, c(ai1 , . . . , aip−s ) are constants that do not depend on β. For any positive semi-definite
matrices A and B, |A + B| ≥ |A| ⇒ x0 (A + B)x ≥ x0 Ax. So we can simplify (0.6) as
(0.6) ≤
Z X
p
p−s Y
vj (β)δj2 (β)
s=p−n j=1
12
exp
λ 0
0
−
β β + t β dβ.
2c0
(0.7)
h
∗
X
i
Now let vj (β) = δj (β) = 1 when j = n + 1, . . . , p. Let us construct a p × p matrix X = x0 ,
where x0 is a (p − n) × p matrix such that X ∗ is positive definite (p.d.). If X is of rank n, this can
always be done. Then, make the substitution u = X ∗ β, so that β = (X ∗ )−1 u ⇒ |J| = |(X ∗ )−1 |.
Denoting Q = (X ∗ )−1 , we have β 0 β = u0 Q0 Qu. So, finiteness of (0.7) is equivalent to finiteness
of
=
Z Y
p
1
2
2
v(θ(uj )) δ (θ(uj )) exp
j=1
λ 0 0
0
u Q Qu + t Qu du.
−
2c0
(0.8)
Now, we need to find a scalar constant M1 > 0 such that u0 Q0 Qu ≥ M1 u0 u for all u. Since Q
−p/2
is p.d., a necessary and sufficient condition for this to hold is |Q 0 Q| ≥ M1p ⇒ |X ∗ | ≤ M1
. So
the required M1 > 0 exists, as |X ∗ | is bounded above. Next, we have,
(0.8) ≤
Z Y
p =
p Z Y
v(θ(uj ))
1/2
v(θ(uj ))
1/2
δ(θ(uj )) exp
j=1
δ(θ(uj )) e
j=1
−
M1 λ 0
0
−
u u + t Qu du
2c0
M1 λ 2
u +τj uj
2c0 j
duj ,
(0.9)
where τ 0 = t0 Q. Finally, make the transformation rj = θ(uj ) ⇒ uj = θ−1 (rj ). Then (0.9)
becomes a product of p one dimensional integrals
p Z Y
v(rj )1/2 δ(rj ) e
j=1
=
p Z
Y
e
−
−
M1 λ −1
[θ (rj )]2 +τj θ −1 (rj )
2c0
M1 λ −1
[θ (rj )]2 +τj θ −1 (rj )
2c0
j=1
3
d2 b(rj )
drj2
1/2
duj drj drj
drj ,
(0.10)
as
d −1
θ (rj )
drj
= δ(rj ) =
drj
.
duj
So if each integral in (0.10) is finite for some τj in a open interval
containing zero, the prior MGF exists. The corollaries immediately follow.
Proof of Corollary 1.2
To see that Corollary 1.2 holds, take λ = 0 in (0.3), then the expression reduces to only the term
with s = 0. Continue the proof until Eqn (0.7), where, instead of augmenting the matrix, delete
the last n − p rows to get a square p × p matrix. The condition then follows.
Proof of Theorem 2
Starting with (0.1) in the proof of Theorem 1, we have
Z X
p
21
1 0
0
β (I(β) + λI)β + t β dβ
(0.1) =
|I(β)
| exp −
λ
2c0
s=0
T
21
Z 1 0
0
s
(i1 ,...,is )
β (I(β) + λI)β + t β dβ,
(0.11)
≥
λ |I(β)
| exp −
2c0
s
X
(i1 ,...,is )
for every s = 0, . . . , p. So finiteness of (0.11) necessitates finiteness of each integral
Z
λ
s/2
|I(β)
(i1 ,...,is )
1
2
| exp
1 0
0
β (I(β) + λI)β + t β dβ.
−
2c0
(0.12)
In practice, it is sufficient to check the non-existence of (0.12) for any s = 0, . . . , p, to disprove
existence of the prior MGF.
Proof of Corollary 2.1
To see that Corollary 2.1 holds, take λ = 0 in (0.11), and thus the only non-zero term is the s = 0
term. Continuing the proof the same way, the necessary condition is the finiteness of the same
integral, with s = λ = 0.
4
Proof of Theorem 3
This proof follows along the same lines as the proof of Theorem 1, with the likelihood term
P
added, and setting wi = 0 (for i = n + 1, . . . , p), so that pi=n+1 φ−1 wi [yi θi − b(θi )] = 0.
Proof of Theorem 4
Existence of MGFs for specific models are shown through application of Theorems 1-3 below.
• Binomial with canonical link. Here b(θ) = log(1 + eθ ), so b00 (θ) =
eθ
.
(1+eθ )2
Thus the
sufficient condition (3.3) is satisfied.
• Binomial with probit link. For the probit link, the link function is given by θ(η) =
Φ(η)
eη
log 1−Φ(η) , so that θ −1 (η) = Φ−1 1+e
η . To check if the sufficient condition (3.3) holds,
we need to check finiteness of
∞
r r r/2
λM −1
e
e
e
2
−1
exp −
dr
[Φ
] + τΦ
r
r
2c0
1+e
1+e
1 + er
−∞
Z
λM 2
φ(z)
= exp −
dz,
z + τz p
2c0
Φ(z)[1 − Φ(z)]
Z
which is finite, since √
φ(z)
Φ(z)[1−Φ(z)]
is bounded.
• Poisson with canonical link. b(θ) = eθ ⇒ b00 (θ) = eθ . E(e(1/2+τ )θ ) exists, so the
sufficient condition (3.3) is satisfied.
• Poisson with identity link. Here θ(η) = log(η) ⇒ θ −1 (η) = eη ⇒
Now,
Z
∞
e
− λM
e2r +τ er r/2
2c0
e
−∞
dr =
Z
∞
e
d −1
θ (r)
dr
− λM
u+τ
2c
0
√
u −3
4
u
= er .
du.
(0.13)
0
τ
√
u −1
when u ∼ Gamma(p, α),
√
5
λM
τ u −1
where the shape parameter p = 4 and the scale parameter α = 2c0 . If τ < 0, E e u
<
The existence of (0.13) is equivalent to the existence of E e
u
E(u−1 ), which exists for any p > 1 (by Corollary 2.1 of Piegorsch and Casella (1985)).
5
If τ > 0, the existence of (0.13) is equivalent to the existence of E(u −1 ) where u ∼
Gamma( 45 , λM
− τ ), which exists as long as τ <
2c0
λM
.
2c0
So (0.13) exists for any τ ∈
(−∞, λM
), so the sufficient condition (3.3) is satisfied.
2c0
• Gamma with canonical link. Here, b(θ) = − log(−θ), so b00 (θ) = θ12 . The sufficient conR ∞ − λM r2 +τ r −1
R ∞ − λM u+τ √u −1
u du,
dition would be satisfied by the existence of 0 e 2c0
r dr = 0 e 2c0
which does not exist for τ > 0, since in that case, with u ∼ Exponential( λM
), E(u−1 eτ
2c0
√
u
)>
E(u−1 ) = ∞. Since the sufficient condition (3.3) fails, we then check if the minimum necessary condition holds, by taking p = 1 and checking the (p − 1)-th term in the integral.
Since a11 (β) = x2 vx =
x2
x2 β 2
=
1
,
β2
the integral in the necessity condition (3.4) for the
prior MGF reduces to
Z
∞
2
β
β
−2 − 2c0
e
x2
+λ
β2
2
dβ = e
−∞
x
− 2c
0
Z
∞
β −2 e
− 2cλ β 2
0
−∞
dβ = ∞,
since the second negative moment of a Gaussian distribution is infinite. Hence the necessary condition fails.
• Gamma with log link. With the log link, η = log(µ), or log(−1/θ) = η, which implies
that θ−1 (η) = − log(−η). So the integral in (3.3) reduces to
Z
∞
exp
0
λM
(log r)2 + τ (− log r)
−
2c0
1+1/2
τ + 32
Z ∞ log r λM
2c0
1
1
1
dr =
dr
r
r
r
0
which is infinite, and hence the sufficient condition does not hold. It is straightforward
to check that the necessity condition (3.4) holds. Hence from these conditions we cannot
determine whether the prior MGF exists. We will return to this model in Section 3.2 and
show that the prior MGF does exist here due to a special result.
• Inverse Gaussian with canonical link. Here b(θ) = −(−2θ)1/2 ⇒ b00 (θ) = −(−2θ)−3/2 ,
6
and θ−1 (r) = r. So the prior MGF would exist if the integral
Z
∞
e
− λM
r 2 +τ r − 3
2c0
4
r
dr =
0
Z
∞
e
u+τ
− λM
2c
0
√
u −3−1
4
2
u
du
(0.14)
0
were finite. Now (0.14) is greater than E(u−5/4 ) for an exponential distribution with mean
2c0
λM
when τ < 0, and hence is infinite. Next we check the necessity condition for p = 1.
a11 (β) = x1/2 β −3/2 , and the integral to check necessity is
Z
∞
1 √
1/2 +λβ 2 )+tβ
− 43 − 2c0 ( xβ
∞
Z
3
1
−
√
x
u1/4 +tu1/2 −
λ
u
2c0
u− 8 − 2 e 2c0
dβ =
du
β e
0
0
Z ∞
Z ∞
√
√
x
λ
1/4
7
−( x +t− 2cλ )v
− 87 −( 2c0 +t− 2c0 )u
0
du =
dv < ∞.
<
v − 2 +3 e 2c0
u e
0
0
So the sufficiency condition (3.3) is not satisfied, but the necessity condition (3.4) is satisfied, so that we cannot directly determine whether the prior MGF of β exists.
• Posterior MGF existence. The existence of the posterior MGFs can be checked in the
same way as the prior MGFs, and are shown below.
– Binomial with canonical link. The posterior MGF exists, as the sufficient condition
R
r
λM 2
−1
(3.5) is satisfied: exp − 2c0 r + (τ + φ wy)r (1+er )eφ−1 w+1 < ∞.
– Gamma with log link. Prior MGF existence is shown in Section 3.2. The posterior
MGF also exists, as
Z
∞
0
for τ +
3
2
log r λM
τ + 32 −φ−1 w
Z ∞
2c0
3
1
1
−φ−1 wry
e
dr < r−τ − 2 +α e−αry dr < ∞
r
r
0
(0.15)
− α < 1, by Corollary 2.1 of Piegorsch and Casella (1985). (For a gamma
model, φ−1 = α, w = 1.) A special case is the exponential model, with α = 1. In
this case, it can be seen that the posterior MGF exists, as the integral (0.15) is finite
for τ < 12 .
– Poisson with canonical link. The sufficient condition (3.5) would be satisfied by the
7
finiteness of the integral
R∞
−∞
e
r 2 +τ r+yr−er
− λM
2c
0
a Gaussian distribution with variance
c0
.
λM
r
dr, that is, if E(e(τ +y)r−e ) exists for
r
r
Since 0 < ee < 1, E(e(τ +y)r−e ) <
E(e(τ +y)r ) < ∞, and the posterior MGF exists.
Proof of Theorem 5
Upper bound
Σ−1 = (1 + 1/c0 )X 0 X + λ/c0 Ip
From (3.7),
and since both are positive semi-definite,
⇒ |Σ−1 | ≥ |(1 + 1/c0 )X 0 X| + |λ/c0 Ip |.
(0.16)
When p > n, |(1 + 1/c0 )X 0 X| = 0. Also, from (0.16), |Σ| ≤
if c0 < λ.
c0 p
.
λ
So as p −→ ∞, |Σ| −→ 0
Lower bound
−1
0
|Σ | = |(1 + 1/c0 )X X + λ/c0 Ip | =
λ
c0
p c
+
1
0
0
In +
XX ,
λ
since |A + BC| = |A||Ik + CA−1 B|. By Hadamard’s inequality,
−1 2
|Σ |
≤
λ
c0
p Y
n n X
i=1
k=1
c0 + 1 0
1+
x i xk
λ
2 .
(0.17)
Now let x0 be chosen such that x0i xk ≤ x20 (for 1 ≤ i, k ≤ n). Then,
p
2n
c0 + 1 2
n 1+
x0
(0.17) ≤
λ
−n
c
+1
2
0
p/2 1 + λ x0
c0
⇒ |Σ| ≥
.
λ
nn/2
λ
c0
n
(0.18)
From (0.17) and (0.18) it is easy to see that limp−→∞ |Σ| = 0 if c0 < λ. However, if c0 > λ, and
n is fixed, then limp−→∞ |Σ| = ∞, which proves the corollary.
8
Proof of Theorem 6
Bias = E E(β|Y ) − β = (ΣX 0 X − I)β, where Σ =
c0 +1 0
XX
c0
+
λ
I
c0 p
−1
.
Part 1: Average bias. This is given by
p
1X
bias(βp ) = J 0 (ΣX 0 X − I)β
p i=1
−1 1
λc0 0
λ
0
0
= −
J +
J XX+
Ip
β.
p(c0 + 1)
c0 + 1
c0 + 1
Now, let β = BJ , where B = diag(|β1 |, . . . , |βp |), and J denotes a p-dimensional vector of
ones. Now,
−1
−1 c0 + 1 0
λ
X 0 X + λ Ip
Ip |B| ≤ B ,
B = X X +
c0 + 1
c0 + 1 λ
since |X 0 X +
λ
I|
c0 +1 p
J
0
(0.19)
≥ | c0λ+1 Ip |. So (0.19) implies that
λ
XX+
Ip
c0 + 1
0
−1
BJ ≤
c0 + 1 0
J BJ ,
λ
so finally,
X
p
p
1
c0 + 1 c0 λ
1X
|βi |.
|βi | =
k average bias k≤
1+
p(c0 + 1)
λ c0 + 1 i=1
p i=1
Part 2: Bound on determinant of bias matrix. Let D = ΣX 0 X − I. Then,
D = [(c0 X 0 X + X 0 X + λI]−1 c0 (X 0 X) − I = −[c0 X 0 X + X 0 X + λI]−1 (X 0 X + λI).
So,
|D| =
|X 0 X + λI|
≤ 1,
|c0 X 0 X + X 0 X + λI|
which again shows that k Dβ k2 = β 0 D0 Dβ ≤ β 0 β.
9
References
Harville, D. A. (1997). Matrix Algebra from a Statistician’s Perspective. Springer-Verlag.
Piegorsch, W. W. and Casella, G. (1985). The existence of the first negative moment. Amer.
Statist., 39(1):60–62.
10
Download