Bayesian generalized linear models using marginal likelihoods

Bayesian generalized linear models using marginal likelihoods Jinfang Wang Graduate School of Science and Technology, Chiba University 1-33 Yayoi-cho, Inage-ku, Chiba 263-8522, Japan wang@math.s.chiba-u.ac.jp Summary. We extend the generalized linear model (GLM) by introducing conjugate priors for the canonical parameters. The linear predictors are now linked with the prior means instead of the sampling means as in the traditional GLM. The regression parameters are estimated by maximizing either the conditional or the unconditional marginal likelihood functions. EM-type algorithms are proposed to numerically estimate these parameters. Key words: EM-type algorithm, generalized linear model, logistic regression, marginal likelihood 1 Introduction We consider independent univariate random variables Y1 , · · · , Yn with densities f (yj | θj ) = exp {wj [yj θj − ψ(θj )] + c(wj , yj ) } , j = 1, · · · , n , (1) where wj are known scale parameters. The canonical parameters θj are not known but are traditionally assumed a priori to satisfy g(µj ) = xTj β , j = 1, · · · , n ′ (2) where µj = E(Yj | θj ) = ψ (θj ) are the sampling means, xj being known p × 1 covariate vectors, and g(·) a pre-specified monotonic link function. The inferential focus is usually on the p-dimensional structure parameter β. The model (1)–(2) is known as the generalized linear model(GLM), proposed originally by [NW72] for dealing with non-normal errors in regression; see [MN89]. This approach for inference about β assumes, among other things, that the choice of the link function g defined by (2) is satisfactory to a reasonable extent. GLM has been extended in various ways to allow for more model flexibility. An important contribution is due to [BC93] who proposed the generalized linear mixed modelsby adding a random effect part to the fixed effects in (2). Such Bayesian approaches have become increasingly popular because posterior analyses can be carried out using Markov chain Monte Carlo numerical integration techniques. With 1050 Jinfang Wang a Gaussian prior for β, N (β0 , Σ0 ), where β0 and Σ0 are known, the posterior density for β is given by ( π(β | y1 , · · · , yn ) ∝ exp n X 1 wj [θj yj − ψ(θj )] − (β − β0 )T Σ0−1 (β − β0 ) 2 j=1 ) . It is seen that a penalty of quadratic form is introduced into the likelihood function. This approach is taken for example by [DS93]. This paper concerns some Bayesian methods to relax the assumption (2) on the link function. In stead of using a prior distribution for β, we shall regard β as fixed and introduce prior distributions for the canonical parameters θj . The vagueness of these priors are modeled through a second-stage hyper-priors. Following [Alb88], we assume that θ1 , · · · , θn in (1) have independent conjugate prior densities π(θj | mj , λ) = exp {λ [mj θj − ψ(θj )] + k(mj , λ) } , j = 1, · · · , n (3) where ψ(θ) is defined in (1), mj and λ are hyperparameters. The densities (3) have the property that E µj = E ψ ′ (θj ) = mj . That is, the hyperparameters mj are prior means of the sampling means, mj = E [E(Yj |θj )]. The hyperparameter λ in (3), common for j = 1, · · · , n, reflects the strength of one’s prior belief about m1 , · · · , mn . Replacing the sometimes unrealistic assumption (2), we now assume that g(mj ) = xTj β , j = 1, · · · , n . (4) The three components (1), (3) and (4) constitute a basic Bayesian generalized linear model (BGLM). In this model the strict equality linking the means µj and the linear predictors xTj β, namely, g(µi ) = xTj β, is made vague through the specification about the prior means mj . When λ → ∞, the prior distributions of the sampling means µi = ψ ′ (θi ) will tend to a degenerate distribution having probability mass on the prior means mi ( [Alb88]). Hence, our Bayesian model will be identical to the classical GLM model in this limiting case. To complete the BGLM it remains to model λ. Two approaches are possible. First we fix λ and use the empirical Bayes approach to estimate λ. So ξ T = (β T , λ) will be simultaneously estimated using the observed data. The second, and probably the preferable, approach is to assign a second-stage noninformative prior for λ λ ∼ π(λ) . (5) In this case we define as the estimate for β, a value which maximizes the unconditional marginal likelihood function. In §2 we discuss both the conditional and the unconditional marginal likelihood approaches. In §3 we apply the ideas of §2 to the beta-binomial model. In §4 we discuss an example involving the urinary tract infections using the Bayesian logistic regression model of §3. 2 Marginal Likelihood Methods 2.1 Conditional marginal likelihood The conditional posterior densities of θ1 , · · · , θn have closed forms Bayesian generalized linear models using marginal likelihoods 1051 π(θj | yj ; β, λ) = exp {(wj + λ) [ηj θj − ψ(θj )] + k(ηj , wj + λ) } where ηj are the posterior means of µj conditional on yj , and ηj = E(µj |yj ) = (wj yj + λmj )/(wj + λ) . These posterior distributions depend on β through the prior means m1 , · · · mn via (4). To estimate ξ = (β T , λ)T , we consider the marginal likelihood functions Z f (yj | ξ) = f (yj | θj ) × π(θj | mj , λ) dθj , j = 1, · · · , n Directly evaluating these integrations, we find that log f (yj | ξ) = c(wj , yj ) + k(mj , λ) − k(ηj , wj + λ) , j = 1, · · · , n. These functions can also be derived using the basic marginal likelihood identity ( [Chi95]) f (yj | β, λ) = f (yj | θj ) × π(θj | mj , λ)/π(θj | yj ; β, λ) . Now let y = (y1 , · · · , yn ). Summing log f (yj | ξ) over the observations, we obtain the following joint marginal log-likelihood function ℓ(y | ξ) = n X [k(mj , λ) − k(ηj , wj + λ)] + j=1 n X c(wj , yj ) . (6) j=1 Maximize ℓ(y | ξ) with respect to ξ we obtain ξ̂, the maximum conditional marginal likelihood estimator (MCMLE). Note that the second term in (6) is functionally independent of ξ, thus plays no role in estimating β. Note also that the observations y affect the estimation process only through the posterior means η1 , · · · , ηn . We may use Newton-type algorithms to find the MCMLE. To establish notations, let mj = g −1 (xTj β) = h(xTj β). Let ḣ(a) = h′ (a), k10 (a, b) = ∂ k(a, b)/∂a, k01 (a, b) = ∂ k(a, b)/∂b, and kij (a, b) = ∂ i+j k(a, b)/∂ai ∂bj (i, j ≥ 1) . The conditional marginal likelihood estimating functions for ξ can then be written as Lβ (y | ξ) = X j Lλ (y | ξ) = X k10 (mj , λ) − λ k10 (ηj , wj + λ) ḣ(xTj β) xj wj + λ k01 (mj , λ) − k01 (ηj , wj + λ) − j wj (mj − yj ) k10 (ηj , wj + λ) (wj + λ)2 (7) 2.2 EM algorithm The EM algorithm can be useful when directly maximizing ℓ(y | ξ) is difficult. For example, using a Newton-type method to solve the likelihood equations Lβ (y | ξ) = 0, Lλ (y | ξ) = 0, typically involves computation of the Hessian matrix of the likelihood function, evaluation of which may not be easy. To develop the algorithm, we begin by considering z = (yT , θ)T as the complete data, while θ = (θ1 , · · · , θn ) is regarded as missing. Now the complete data likelihood takes the form f (z | ξ) = f (y | θ) × π(θ | m, λ) , where f (y | θ) and π(θ | m, λ) are the products of (1) and (3) over the observations, respectively. The parameter β enters into the likelihood through the prior means. The EM algorithm ( [DLR77]) now looks for the MCMLE by iterating 1052 Jinfang Wang • E-step: Compute Q(ξ | ξ (t) ) = E log f (z | ξ) | y, ξ (t) . • M-step: Choose ξ (t+1) = arg maxξ∈Ξ Q(ξ | ξ (t) ) . until convergence. For our BGLM, the complete data log likelihood decomposes into log f (z | ξ) = log f (y | θ) + log π(θ|ξ) = n X {λ[mj θj − ψ(θj )] + k(mj , λ)} + log f (y | θ) . j=1 The Q function in the E-step can then be expressed as "( Q(ξ | ξ (t) ) = λ E ) n X # | yj , ξ (t) (mj θj − ψ(θj )) j=1 + n X n k(mj , λ) + E log f (y | θ) | y, ξ (t) o , (8) j=1 where the expectations are taken with respect to the priors of θj . Since the expectations are taken conditional on ξ (t) , the parameter ξ enters into the Q function only through m1 , · · · , mn . It follows that the last term on the right-hand side of (8) is irrelevant in the M-step when we maximize Q(ξ | ξ (t) ) with respect to ξ. Differentiating Q(ξ | ξ (t) ) with respect to λ and β, and setting to zero, we may then find ξ (t+1) by solving the following equations for β and λ: n n X o mj E θj | yj , ξ (t) − E ψ(θj ) | yj , ξ (t) + k01 (mj , λ) = 0 j=1 n n X o λ E θ | y, ξ (t) + k10 (m, λ) ġ(xTj β)xj = 0 (9) j=1 where k01 (m, λ) is defined in the previous subsection. Newton-type methods may have to be invoked to solve the equations (9). 2.3 Unconditional marginal likelihood For some models, such as the beta-binomial model to be studied in §3, the empirical Bayes approach for estimating λ discussed so far may be difficult. Now we introduce a second approach by using (5). Let f (y | λ, β) be the marginal likelihood of β conditional on λ. We define the unconditional marginal likelihood of β by Z f (y | β) = f (y | λ, β) π(λ) dλ , (10) where the second-stage prior π(λ) is completely specified, the noninformative prior being usually the choice. The unconditional marginal density f (y | β) is a mixture density of f using the prior π. Now we estimate β by maximizing f (y | β) to obtain the maximum unconditional marginal likelihood estimator MUMLE). Bayesian generalized linear models using marginal likelihoods 1053 The problem of maximizing (10) can be replaced by that of maximizing the expected logarithm of the marginal likelihood function Z G(β) = Eπ log f (y | λ, β) = log f (y | λ, β) d Π(λ) , where Π ′ (λ) = π(λ). Let T (Π) be a functional defined implicitly through Z ∂ d Π(λ) = 0 log f (y | λ, β) ∂β β=T (Π) (11) If T (Π) in (11) is uniquely defined, then β̂ = T (Π) evaluated at the “true” prior distribution Π is the value that maximizes the objective function G(β). Since solving (11) is difficult in general, we now consider an approximates method. The idea is to use β̃ = T (Π̂) to approximate β̂ = T (Π), where Π̂ is an empirical distribution using i.i.d. values λ1 , · · · , λR from the known noninformative prior Π(λ). In other words, we maximize the following approximate objective function Z Ĝ(β) = log f (y | λ, β) d Π̂(λ) = R 1 X log f (y | λr , β) . R r=1 to obtain β̃ = T (Π̂). Note that Ĝ(β) is an unbiased estimate of G(β) for each value of β, that is, EΠ Ĝ(β) = G(β). The approximate estimate β̃ can be found using Newton’s method, for example, by solving the following equations R X ∂ r=1 ∂β log f (y | λr , β) = 0 . Under mild conditions for the hyper-prior Π, we may show that, conditional on y, β̃ = T (Π̂) based on a sequence of i.i.d. values λ1 , · · · , λR from Π, is a consistent estimator of β̂ = T (Π). In order to obtain a reliable estimate β̃, we need to choose an appropriate number R, an issue common with any Monte Carlo methods. With noninformative hyperpriors, we would expect that the value of R need not be too large in order to ensure a reliable approximation of β̂. In the beta-binomial model of §3, for instance, we found the R = 10 was sufficient. Using (6) and omitting a constant not involving β and λ, we can write Ĝ(β) as Ĝ(β) = R X n X k(mj , λr ) − k(ηjr , wj + λr ) , (12) r=1 j=1 where ηjr = (wj yj + λr mj )/(wj + λr ). We can then use usual optimization techniques to work with the approximate objective function (12). 2.4 MUMLE using empirical Bayes In cases where it is preferable to estimate Π(λ), we then have to maximize 1054 Jinfang Wang Z log f (y | β, λ) dΠ(λ) G(β) = without assuming that Π(λ) is known. The EM algorithm is again useful here. First, we assume that β is known, and Π is supported on J distinct points. Assume that J is known a priori. At the tth step, we have πr(t) = Pr (λ = λ(t) r ), r = 1, · · · , J (t) (t) The EM algorithm updates λr and πr (t) (t) (t+1) by repeating the following two steps ∝ f (y | β, λr ) πr , r = 1, · · · , J • E-step: πr P (t+1) (t+1) (t) • M-step: λ1 , · · · , λJ = arg max Jr=1 πr log f (y | β, λr ) until convergence to get n Πβ = (λβ1 , π1β ), · · · , (λβJ , πJβ ) o We thus arrive at the following EMM Algorithm for Finding the MUMLE. Assume β (t) is known. At the (t + 1)th step, compute 1. EM step: Πβ (t) : πrβ (t) = Pr (λ = λβr (t) ) r = 1, · · · , J 2. M step: β (t+1) = arg max J X πrβ (t) log f (y | β, λβr (t) ) r=1 Repeat (1) and (2) until convergence. 3 Beta-binomial model Now we apply the marginal likelihood methods to the problem of assessing the effects of covariates upon a binary outcome. Let y1 , · · · , yn be observed proportions from independent binomial trials with index parameters nj and success probabilities µj , j = 1, · · · , n. The logarithm of the density functions of yj given µi are ! log f (yj | θj ) = nj [yj θj − ψ(θj )] + log nj nj yj where θ = logit µ = log {µ/(1 − µ)}, and ψ(θ) = log(1 + eθ ). We assume that θj follow the conjugate prior distributions π(θj | mj , λ) = exp {λ[mj θj − ψ(θj )] + k(mj , λ)} where mj is the prior mean of µj . The normalization constant is given by k(mj , λ) = − log B(λmj , λ − λmj ) , (13) where B(a, b) is a beta function. Further, we assume that the prior means and the linear predictors are linked by the following equations Bayesian generalized linear models using marginal likelihoods 1055 logit mj = log mj /(1 − mj ) = β T xj , j = 1, · · · , n. The conjugate priors π(θj | mj , λ) with k(mj , λ) given by (13) corresponds to assigning to the mean variables µj the following beta priors Be(λmj , λ − λmj ), with density functions being given by λmj −1 π(µj |mj , λ) = µj (1 − µj )λ−λmj −1 /B(λmj , λ − λmj ) , j = 1, · · · , n Note that π has mean mj and variance vj = mj (1 − mj )/(λ + 1). When λ → ∞, we have vj → 0 so that π will tend to a degenerate distribution concentrated at the prior mean mj . So in the limit we get the usual logistic model. The posterior means of µj are given by ηj = (nj yj + λmj )/(nj + λ) . The marginal distributions of yj are independent beta-binomials. Ignoring a constant free of β and λ, we can write the marginal log likelihood function as ℓ(y | ξ) = X log B [(nj + λ)ηj , (nj + λ)(1 − ηj )] − j X log B [λmj , λ(1 − mj )] . j We may now maximize ℓ(y | ξ) with respect to ξ by solving the likelihood equations. Let φ(z) be the digamma function. Differentiating ℓ(y | ξ) with respect to β and λ and setting to zero, we can write the likelihood equations as 0= n X {φ((nj + λ)ηj )) − φ((nj + λ)(1 − ηj )) − φ(λmj ) + φ(λ(1 − mj ))} Aj j=1 0= n X {mj φ((nj + λ)ηj ) + (1 − mj )φ((nj + λ)(1 − ηj )) + φ(λ) j=1 −φ(nj + λ) − mj φ(λmj ) − (1 − mj )φ(λ(1 − mj ))} (14) where Aj = mj (1 − mj )xj . The maximum conditional marginal likelihood estimator can then be found by solving the above likelihood equations. On the other hand, the unconditional approach seeks to maximize XZ log j B [(nj + λ)ηj , (nj + λ)(1 − ηj )] d Π(λ) B [λmj , λ(1 − mj )] with respect to a proper prior distribution Π(λ). To approximate the MUMLE we maximize the following empirical sum n R X X r=1 j=1 log B (nj + λr )ηj , (nj + λr )(1 − ηjr ) , B [λmj , λr (1 − mj )] where λ1 , · · · , λR are i.i.d. values from Π(λ), and ηjr = (nj yj + λr mj )/(nj + λr ). We can use the Newton’s method to solve the following equations for β 0= R X n X φ((nj + λr )ηj )) − φ((nj + λr )(1 − ηjr )) − φ(λr mj ) + φ(λr (1 − mj )) Aj . r=1 j=1 1056 Jinfang Wang 4 An Application to the Urinary Tract Infections We analyzed a data set (LogXact4 Manual, 1999, p. 198, Cytel Software Corporation) from an ebetademiological survey concerning the urinary tract infections (UTI) among college women. We applied the beta-binomial model studied in §3.We assumed that λ has some chi-squared distribution Π(λ) = χ2 (ν) with a large degree of freedom ν. Our results are consistent with some well-established current theories on possible causes of UTI ( [FBN96, FGP95, FMG97]). We shall report our results elsewhere. References [Alb88] [BC93] [Chi95] [DS93] [DLR77] [FBN96] [FGP95] [FMG97] [MN89] [NW72] Albert, J.H.: Computational methods using a Bayesian Hierarchical generalized linear model. J. Amer. Statistical Assoc., 83, 1037–1044 (1988) Breslow, N.E., Clayton, D.G.: Approximate inference in generalized linear mixed models. J. Amer. Statistical Assoc. , 88, 9–25 (1993) Chib, S.: Marginal likelihood from the Gibbs output. J. Amer. Statistical Assoc. , 90, 1313–1321 (1995) Dellaportas, P., Smith, A.F.M.: Bayesian inference for generalized linear and proportional hazards models via Gibbs sampling. Appl. Statist., 42, 443–459 (1993) Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. A 39 (1977) ” Fihn, S.D., Boyko, E.J., Normand, E.H., Chen, C., Grafton, J.R., Hunt., M., Yarbro, P., Scholes, D., Stergachis, A.: Association between use of spermicide-coated condoms and Escherichia coli urinary tract infection in young women. A. J. Ebetademiol., 144, 512–520 (1996) Foxman, B., Geiger, A., Palin, K., Gillesbetae, B., Koopman, J.S.: First time urinary tract infection and sexual behavior. Ebetademiology, 6, 162– 168 (1995) Foxman, B., Marsh J.V., Gillesbetae, B., Rubin, K.N., Koopman, J.S., Spear, S.: Condom use and first time urinary tract infection. Ebetademioly, 8, 637–641 (1997) McCullagh, P. and Nelder, J.A.: Generalized Linear Models (2nd ed). Chapman and Hall, London (1989) Nelder, J.A., R.W.M. Wedderburn, R.W.M.: Generalized linear models. J. Roy. Statist. Soc. A, 135, 370–384 (1972)

Bayesian generalized linear models using marginal likelihoods

Related documents

Products

Support

Bayesian generalized linear models using marginal likelihoods

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib