Bayesian generalized linear models using marginal likelihoods

advertisement
Bayesian generalized linear models using
marginal likelihoods
Jinfang Wang
Graduate School of Science and Technology, Chiba University
1-33 Yayoi-cho, Inage-ku, Chiba 263-8522, Japan
wang@math.s.chiba-u.ac.jp
Summary. We extend the generalized linear model (GLM) by introducing conjugate priors for the canonical parameters. The linear predictors are now linked with
the prior means instead of the sampling means as in the traditional GLM. The
regression parameters are estimated by maximizing either the conditional or the
unconditional marginal likelihood functions. EM-type algorithms are proposed to
numerically estimate these parameters.
Key words: EM-type algorithm, generalized linear model, logistic regression,
marginal likelihood
1 Introduction
We consider independent univariate random variables Y1 , · · · , Yn with densities
f (yj | θj ) = exp {wj [yj θj − ψ(θj )] + c(wj , yj ) } , j = 1, · · · , n ,
(1)
where wj are known scale parameters. The canonical parameters θj are not known
but are traditionally assumed a priori to satisfy
g(µj ) = xTj β , j = 1, · · · , n
′
(2)
where µj = E(Yj | θj ) = ψ (θj ) are the sampling means, xj being known p × 1
covariate vectors, and g(·) a pre-specified monotonic link function. The inferential
focus is usually on the p-dimensional structure parameter β. The model (1)–(2)
is known as the generalized linear model(GLM), proposed originally by [NW72] for
dealing with non-normal errors in regression; see [MN89]. This approach for inference
about β assumes, among other things, that the choice of the link function g defined
by (2) is satisfactory to a reasonable extent.
GLM has been extended in various ways to allow for more model flexibility. An
important contribution is due to [BC93] who proposed the generalized linear mixed
modelsby adding a random effect part to the fixed effects in (2). Such Bayesian
approaches have become increasingly popular because posterior analyses can be
carried out using Markov chain Monte Carlo numerical integration techniques. With
1050
Jinfang Wang
a Gaussian prior for β, N (β0 , Σ0 ), where β0 and Σ0 are known, the posterior density
for β is given by
(
π(β | y1 , · · · , yn ) ∝ exp
n
X
1
wj [θj yj − ψ(θj )] − (β − β0 )T Σ0−1 (β − β0 )
2
j=1
)
.
It is seen that a penalty of quadratic form is introduced into the likelihood function.
This approach is taken for example by [DS93].
This paper concerns some Bayesian methods to relax the assumption (2) on the
link function. In stead of using a prior distribution for β, we shall regard β as fixed
and introduce prior distributions for the canonical parameters θj . The vagueness of
these priors are modeled through a second-stage hyper-priors. Following [Alb88], we
assume that θ1 , · · · , θn in (1) have independent conjugate prior densities
π(θj | mj , λ) = exp {λ [mj θj − ψ(θj )] + k(mj , λ) } , j = 1, · · · , n
(3)
where ψ(θ) is defined in (1), mj and λ are hyperparameters. The densities (3) have
the property that E µj = E ψ ′ (θj ) = mj . That is, the hyperparameters mj are
prior means of the sampling means, mj = E [E(Yj |θj )]. The hyperparameter λ
in (3), common for j = 1, · · · , n, reflects the strength of one’s prior belief about
m1 , · · · , mn .
Replacing the sometimes unrealistic assumption (2), we now assume that
g(mj ) = xTj β , j = 1, · · · , n .
(4)
The three components (1), (3) and (4) constitute a basic Bayesian generalized linear
model (BGLM). In this model the strict equality linking the means µj and the
linear predictors xTj β, namely, g(µi ) = xTj β, is made vague through the specification
about the prior means mj . When λ → ∞, the prior distributions of the sampling
means µi = ψ ′ (θi ) will tend to a degenerate distribution having probability mass on
the prior means mi ( [Alb88]). Hence, our Bayesian model will be identical to the
classical GLM model in this limiting case.
To complete the BGLM it remains to model λ. Two approaches are possible.
First we fix λ and use the empirical Bayes approach to estimate λ. So ξ T = (β T , λ)
will be simultaneously estimated using the observed data. The second, and probably
the preferable, approach is to assign a second-stage noninformative prior for λ
λ ∼ π(λ) .
(5)
In this case we define as the estimate for β, a value which maximizes the unconditional marginal likelihood function. In §2 we discuss both the conditional and the
unconditional marginal likelihood approaches. In §3 we apply the ideas of §2 to
the beta-binomial model. In §4 we discuss an example involving the urinary tract
infections using the Bayesian logistic regression model of §3.
2 Marginal Likelihood Methods
2.1 Conditional marginal likelihood
The conditional posterior densities of θ1 , · · · , θn have closed forms
Bayesian generalized linear models using marginal likelihoods
1051
π(θj | yj ; β, λ) = exp {(wj + λ) [ηj θj − ψ(θj )] + k(ηj , wj + λ) }
where ηj are the posterior means of µj conditional on yj , and ηj = E(µj |yj ) =
(wj yj + λmj )/(wj + λ) . These posterior distributions depend on β through the
prior means m1 , · · · mn via (4). To estimate ξ = (β T , λ)T , we consider the marginal
likelihood functions
Z
f (yj | ξ) =
f (yj | θj ) × π(θj | mj , λ) dθj ,
j = 1, · · · , n
Directly evaluating these integrations, we find that
log f (yj | ξ) = c(wj , yj ) + k(mj , λ) − k(ηj , wj + λ) ,
j = 1, · · · , n.
These functions can also be derived using the basic marginal likelihood identity
( [Chi95])
f (yj | β, λ) = f (yj | θj ) × π(θj | mj , λ)/π(θj | yj ; β, λ) .
Now let y = (y1 , · · · , yn ). Summing log f (yj | ξ) over the observations, we obtain
the following joint marginal log-likelihood function
ℓ(y | ξ) =
n
X
[k(mj , λ) − k(ηj , wj + λ)] +
j=1
n
X
c(wj , yj ) .
(6)
j=1
Maximize ℓ(y | ξ) with respect to ξ we obtain ξ̂, the maximum conditional
marginal likelihood estimator (MCMLE). Note that the second term in (6) is functionally independent of ξ, thus plays no role in estimating β. Note also that the observations y affect the estimation process only through the posterior means η1 , · · · , ηn .
We may use Newton-type algorithms to find the MCMLE. To establish notations,
let mj = g −1 (xTj β) = h(xTj β). Let ḣ(a) = h′ (a), k10 (a, b) = ∂ k(a, b)/∂a, k01 (a, b) =
∂ k(a, b)/∂b, and kij (a, b) = ∂ i+j k(a, b)/∂ai ∂bj (i, j ≥ 1) . The conditional marginal
likelihood estimating functions for ξ can then be written as
Lβ (y | ξ) =
X
j
Lλ (y | ξ) =
X
k10 (mj , λ) −
λ
k10 (ηj , wj + λ) ḣ(xTj β) xj
wj + λ
k01 (mj , λ) − k01 (ηj , wj + λ) −
j
wj (mj − yj )
k10 (ηj , wj + λ)
(wj + λ)2
(7)
2.2 EM algorithm
The EM algorithm can be useful when directly maximizing ℓ(y | ξ) is difficult. For
example, using a Newton-type method to solve the likelihood equations Lβ (y | ξ) =
0, Lλ (y | ξ) = 0, typically involves computation of the Hessian matrix of the likelihood function, evaluation of which may not be easy.
To develop the algorithm, we begin by considering z = (yT , θ)T as the complete
data, while θ = (θ1 , · · · , θn ) is regarded as missing. Now the complete data likelihood
takes the form
f (z | ξ) = f (y | θ) × π(θ | m, λ) ,
where f (y | θ) and π(θ | m, λ) are the products of (1) and (3) over the observations,
respectively. The parameter β enters into the likelihood through the prior means.
The EM algorithm ( [DLR77]) now looks for the MCMLE by iterating
1052
Jinfang Wang
• E-step: Compute Q(ξ | ξ (t) ) = E log f (z | ξ) | y, ξ (t)
.
• M-step: Choose ξ (t+1) = arg maxξ∈Ξ Q(ξ | ξ (t) ) .
until convergence. For our BGLM, the complete data log likelihood decomposes into
log f (z | ξ) = log f (y | θ) + log π(θ|ξ)
=
n
X
{λ[mj θj − ψ(θj )] + k(mj , λ)} + log f (y | θ) .
j=1
The Q function in the E-step can then be expressed as
"(
Q(ξ | ξ (t) ) = λ E
)
n
X
#
| yj , ξ (t)
(mj θj − ψ(θj ))
j=1
+
n
X
n
k(mj , λ) + E log f (y | θ) | y, ξ (t)
o
,
(8)
j=1
where the expectations are taken with respect to the priors of θj . Since the
expectations are taken conditional on ξ (t) , the parameter ξ enters into the Q function
only through m1 , · · · , mn . It follows that the last term on the right-hand side of (8)
is irrelevant in the M-step when we maximize Q(ξ | ξ (t) ) with respect to ξ.
Differentiating Q(ξ | ξ (t) ) with respect to λ and β, and setting to zero, we may
then find ξ (t+1) by solving the following equations for β and λ:
n n
X
o
mj E θj | yj , ξ (t) − E ψ(θj ) | yj , ξ (t) + k01 (mj , λ) = 0
j=1
n n
X
o
λ E θ | y, ξ (t) + k10 (m, λ) ġ(xTj β)xj = 0
(9)
j=1
where k01 (m, λ) is defined in the previous subsection. Newton-type methods may
have to be invoked to solve the equations (9).
2.3 Unconditional marginal likelihood
For some models, such as the beta-binomial model to be studied in §3, the empirical
Bayes approach for estimating λ discussed so far may be difficult. Now we introduce a second approach by using (5). Let f (y | λ, β) be the marginal likelihood of β
conditional on λ. We define the unconditional marginal likelihood of β by
Z
f (y | β) =
f (y | λ, β) π(λ) dλ ,
(10)
where the second-stage prior π(λ) is completely specified, the noninformative
prior being usually the choice. The unconditional marginal density f (y | β) is a
mixture density of f using the prior π. Now we estimate β by maximizing f (y | β)
to obtain the maximum unconditional marginal likelihood estimator MUMLE).
Bayesian generalized linear models using marginal likelihoods
1053
The problem of maximizing (10) can be replaced by that of maximizing the
expected logarithm of the marginal likelihood function
Z
G(β) = Eπ log f (y | λ, β) =
log f (y | λ, β) d Π(λ) ,
where Π ′ (λ) = π(λ). Let T (Π) be a functional defined implicitly through
Z
∂
d Π(λ) = 0
log f (y | λ, β)
∂β
β=T (Π)
(11)
If T (Π) in (11) is uniquely defined, then β̂ = T (Π) evaluated at the “true” prior
distribution Π is the value that maximizes the objective function G(β). Since solving
(11) is difficult in general, we now consider an approximates method.
The idea is to use β̃ = T (Π̂) to approximate β̂ = T (Π), where Π̂ is an empirical distribution using i.i.d. values λ1 , · · · , λR from the known noninformative prior
Π(λ). In other words, we maximize the following approximate objective function
Z
Ĝ(β) =
log f (y | λ, β) d Π̂(λ) =
R
1 X
log f (y | λr , β) .
R r=1
to obtain β̃ = T (Π̂). Note that Ĝ(β) is an unbiased estimate of G(β) for each value
of β, that is, EΠ Ĝ(β) = G(β). The approximate estimate β̃ can be found using
Newton’s method, for example, by solving the following equations
R
X
∂
r=1
∂β
log f (y | λr , β) = 0 .
Under mild conditions for the hyper-prior Π, we may show that, conditional on
y, β̃ = T (Π̂) based on a sequence of i.i.d. values λ1 , · · · , λR from Π, is a consistent
estimator of β̂ = T (Π).
In order to obtain a reliable estimate β̃, we need to choose an appropriate number
R, an issue common with any Monte Carlo methods. With noninformative hyperpriors, we would expect that the value of R need not be too large in order to ensure
a reliable approximation of β̂. In the beta-binomial model of §3, for instance, we
found the R = 10 was sufficient.
Using (6) and omitting a constant not involving β and λ, we can write Ĝ(β) as
Ĝ(β) =
R X
n
X
k(mj , λr ) − k(ηjr , wj + λr ) ,
(12)
r=1 j=1
where ηjr = (wj yj + λr mj )/(wj + λr ). We can then use usual optimization
techniques to work with the approximate objective function (12).
2.4 MUMLE using empirical Bayes
In cases where it is preferable to estimate Π(λ), we then have to maximize
1054
Jinfang Wang
Z
log f (y | β, λ) dΠ(λ)
G(β) =
without assuming that Π(λ) is known. The EM algorithm is again useful here. First,
we assume that β is known, and Π is supported on J distinct points. Assume that
J is known a priori. At the tth step, we have
πr(t) = Pr (λ = λ(t)
r ), r = 1, · · · , J
(t)
(t)
The EM algorithm updates λr and πr
(t)
(t)
(t+1)
by repeating the following two steps
∝ f (y | β, λr ) πr , r = 1, · · · , J
• E-step: πr
P
(t+1)
(t+1)
(t)
• M-step: λ1
, · · · , λJ
= arg max Jr=1 πr log f (y | β, λr )
until convergence to get
n
Πβ = (λβ1 , π1β ), · · · , (λβJ , πJβ )
o
We thus arrive at the following EMM Algorithm for Finding the MUMLE. Assume β (t) is known. At the (t + 1)th step, compute
1. EM step:
Πβ (t) : πrβ
(t)
= Pr (λ = λβr
(t)
)
r = 1, · · · , J
2. M step:
β (t+1) = arg max
J
X
πrβ
(t)
log f (y | β, λβr
(t)
)
r=1
Repeat (1) and (2) until convergence.
3 Beta-binomial model
Now we apply the marginal likelihood methods to the problem of assessing the effects
of covariates upon a binary outcome. Let y1 , · · · , yn be observed proportions from
independent binomial trials with index parameters nj and success probabilities µj ,
j = 1, · · · , n. The logarithm of the density functions of yj given µi are
!
log f (yj | θj ) = nj [yj θj − ψ(θj )] + log
nj
nj yj
where θ = logit µ = log {µ/(1 − µ)}, and ψ(θ) = log(1 + eθ ).
We assume that θj follow the conjugate prior distributions
π(θj | mj , λ) = exp {λ[mj θj − ψ(θj )] + k(mj , λ)}
where mj is the prior mean of µj . The normalization constant is given by
k(mj , λ) = − log B(λmj , λ − λmj ) ,
(13)
where B(a, b) is a beta function. Further, we assume that the prior means and
the linear predictors are linked by the following equations
Bayesian generalized linear models using marginal likelihoods
1055
logit mj = log mj /(1 − mj ) = β T xj , j = 1, · · · , n.
The conjugate priors π(θj | mj , λ)
with k(mj , λ) given by (13) corresponds to assigning to the mean variables µj
the following beta priors Be(λmj , λ − λmj ), with density functions being given by
λmj −1
π(µj |mj , λ) = µj
(1 − µj )λ−λmj −1 /B(λmj , λ − λmj ) , j = 1, · · · , n
Note that π has mean mj and variance vj = mj (1 − mj )/(λ + 1). When λ → ∞,
we have vj → 0 so that π will tend to a degenerate distribution concentrated at the
prior mean mj . So in the limit we get the usual logistic model.
The posterior means of µj are given by ηj = (nj yj + λmj )/(nj + λ) .
The marginal distributions of yj are independent beta-binomials. Ignoring a
constant free of β and λ, we can write the marginal log likelihood function as
ℓ(y | ξ) =
X
log B [(nj + λ)ηj , (nj + λ)(1 − ηj )] −
j
X
log B [λmj , λ(1 − mj )] .
j
We may now maximize ℓ(y | ξ) with respect to ξ by solving the likelihood equations. Let φ(z) be the digamma function. Differentiating ℓ(y | ξ) with respect to β
and λ and setting to zero, we can write the likelihood equations as
0=
n
X
{φ((nj + λ)ηj )) − φ((nj + λ)(1 − ηj )) − φ(λmj ) + φ(λ(1 − mj ))} Aj
j=1
0=
n
X
{mj φ((nj + λ)ηj ) + (1 − mj )φ((nj + λ)(1 − ηj )) + φ(λ)
j=1
−φ(nj + λ) − mj φ(λmj ) − (1 − mj )φ(λ(1 − mj ))}
(14)
where Aj = mj (1 − mj )xj . The maximum conditional marginal likelihood estimator can then be found by solving the above likelihood equations.
On the other hand, the unconditional approach seeks to maximize
XZ
log
j
B [(nj + λ)ηj , (nj + λ)(1 − ηj )]
d Π(λ)
B [λmj , λ(1 − mj )]
with respect to a proper prior distribution Π(λ). To approximate the MUMLE we
maximize the following empirical sum
n
R X
X
r=1 j=1
log
B (nj + λr )ηj , (nj + λr )(1 − ηjr )
,
B [λmj , λr (1 − mj )]
where λ1 , · · · , λR are i.i.d. values from Π(λ), and ηjr = (nj yj + λr mj )/(nj + λr ).
We can use the Newton’s method to solve the following equations for β
0=
R X
n
X
φ((nj + λr )ηj )) − φ((nj + λr )(1 − ηjr )) − φ(λr mj ) + φ(λr (1 − mj )) Aj .
r=1 j=1
1056
Jinfang Wang
4 An Application to the Urinary Tract Infections
We analyzed a data set (LogXact4 Manual, 1999, p. 198, Cytel Software Corporation) from an ebetademiological survey concerning the urinary tract infections
(UTI) among college women. We applied the beta-binomial model studied in §3.We
assumed that λ has some chi-squared distribution Π(λ) = χ2 (ν) with a large degree
of freedom ν. Our results are consistent with some well-established current theories
on possible causes of UTI ( [FBN96, FGP95, FMG97]). We shall report our results
elsewhere.
References
[Alb88]
[BC93]
[Chi95]
[DS93]
[DLR77]
[FBN96]
[FGP95]
[FMG97]
[MN89]
[NW72]
Albert, J.H.: Computational methods using a Bayesian Hierarchical generalized linear model. J. Amer. Statistical Assoc., 83, 1037–1044 (1988)
Breslow, N.E., Clayton, D.G.: Approximate inference in generalized linear
mixed models. J. Amer. Statistical Assoc. , 88, 9–25 (1993)
Chib, S.: Marginal likelihood from the Gibbs output. J. Amer. Statistical
Assoc. , 90, 1313–1321 (1995)
Dellaportas, P., Smith, A.F.M.: Bayesian inference for generalized linear
and proportional hazards models via Gibbs sampling. Appl. Statist., 42,
443–459 (1993)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. A 39 (1977)
”
Fihn, S.D., Boyko, E.J., Normand, E.H., Chen, C., Grafton, J.R., Hunt.,
M., Yarbro, P., Scholes, D., Stergachis, A.: Association between use of
spermicide-coated condoms and Escherichia coli urinary tract infection
in young women. A. J. Ebetademiol., 144, 512–520 (1996)
Foxman, B., Geiger, A., Palin, K., Gillesbetae, B., Koopman, J.S.: First
time urinary tract infection and sexual behavior. Ebetademiology, 6, 162–
168 (1995)
Foxman, B., Marsh J.V., Gillesbetae, B., Rubin, K.N., Koopman, J.S.,
Spear, S.: Condom use and first time urinary tract infection. Ebetademioly, 8, 637–641 (1997)
McCullagh, P. and Nelder, J.A.: Generalized Linear Models (2nd ed).
Chapman and Hall, London (1989)
Nelder, J.A., R.W.M. Wedderburn, R.W.M.: Generalized linear models.
J. Roy. Statist. Soc. A, 135, 370–384 (1972)
Download