Lecture 2 - eis.bris.ac.uk

Lecture 14 MCMC for multilevel logistic regressions in MLwiN Lecture Contents • Recap of MCMC from day 3 • Recap of logistic regression from days 1 & 4 • Metropolis Hastings sampling • Reunion Island dataset (2 and 3 level) • VPC for binomial models • Method Comparison on Rodriguez/Goldman examples MCMC Methods (recap) • Goal: To sample from joint posterior distribution. p(  , u,  u2 ,  e2 | y ) • Problem: For complex models this involves multidimensional integration. • Solution: It may be possible to sample from conditional posterior distributions, p(  | y, u,  u2 ,  e2 ), p(u j | y,  ,  u2 ,  e2 ), p( u2 | y,  , u,  e2 ), p( e2 | y,  , u,  u2 ) • It can be shown that after convergence such a sampling approach generates dependent samples from the joint posterior distribution. Gibbs Sampling (recap) • When we can sample directly from the conditional posterior distributions then such an algorithm is known as Gibbs Sampling. • This proceeds as follows for the variance components example: • Firstly give all unknown parameters starting values, 2 2  (0), u(0),  (0),  e (0). u • Next loop through the following steps: Gibbs Sampling for VC model • Sample from p(  | y, u (0),  u2 (0),  e2 (0)) to generate  (1) and then from p(u j | y,  (1),  u2 (0),  e2 (0)) to generate u j (1) for each j and then from p( u2 | y,  (1), u (1),  e2 (0)) to generate  u2 (1) and then from p( e2 | y,  (1), u (1),  u2 (1)) to generate  e2 (1). These steps are then repeated with the generated values from this loop replacing the starting values. The chain of values produced by this procedure are known as a Markov chain. Note that β is generated as a block while each uj is updated individually. Algorithm Summary Repeat the following four steps • 1. Generate β from its (Multivariate) Normal conditional distribution. • 2. Generate each uj from its Normal conditional distribution. • 3. Generate 1/σu2 from its Gamma conditional distribution. • 3. Generate 1/σe2 from its Gamma conditional distribution. Logistic regression model • A standard Bayesian logistic regression model (e.g. for the rat tumour example) can be written as follows: yi ~ Binomial (ni , pi ) logit ( pi )  X  0 ~ N (0, m0 ), 1 ~ N (0, m1 ) • Both MLwiN and WinBUGS can fit this model but can we write out the conditional posterior distributions and use Gibbs Sampling? Conditional distribution for β0 p(  0 | y, 1 )  p(  0 ) p( y |  0 , 1 )   exp(  0  1 xi )   1  exp(  )  2m0 i  1  exp(  0  1 xi )  m0 2 0 yi   1    1  exp(  0  1 xi )  ni  yi p(  0 | y, 1 ) ~ ? This distribution is not a standard distribution and so we cannot simply simulate from a standard random number generator. However both WinBUGS and MLwiN can fit this model using MCMC. We will in this lecture describe how MLwiN does this before considering WinBUGS in the next lecture. Metropolis Hastings (MH) sampling • An alternative, and more general, way to construct am MCMC sampler. • A form of generalised rejection sampling (see later) where values are drawn from approximate distributions and “corrected” so that, asymptotically they behave as random observations from the target distribution. • MH sampling algorithms sequentially draw candidate observations from a ‘proposal’ distribution, conditional on the current observations thus inducing a Markov chain. General MH algorithm step Let us focus on a single parameter θ and its posterior distribution p(θ|Y). Now at iteration t, θ takes value θt and we generate a new value θ* from a proposal distribution q. Then we accept this new value and let θt+1 =θ* with acceptance probability α(θ*, θt) otherwise we set θt+1 = θt. * *  p (  | Y ) q (  |  ) The acceptance probability  ( * , )  min 1, t  t   p(t | Y )q( * | t )  Choosing a proposal distribution Remarkably the proposal distribution can have almost any form. There are some (silly) exceptions e.g. a proposal that has point mass at one value but assuming that the proposal allows the chain to explore the whole posterior and doesn’t produce a recurrent chain we are OK. Three special cases of Metropolis Hasting algorithm are: 1. Random walk metropolis sampling. 2. Independence sampling. 3. Gibbs sampling! Pure Metropolis Sampling The general MH sampling algorithm is due to Hastings (1970) however this is a generalisation of pure Metropolis Sampling (Metropolis et al. (1953)). This special case is when q(1 | 2 )  q(2 | 1 )1,2 i.e. the proposal distribution is symmetric. This then reduces the acceptance probability to *  p (  |Y)  * .  ( ,t )  min 1,  p(t | Y )  Random Walk Metropolis • This is an example of pure Metropolis sampling. • Here q(θ1|θ2) = q(|θ1 – θ2|). • Typical examples of random walk proposals are Normal distributions centered around the current value of the parameter i.e. q(θ1|θ2) ~N(θ2,s2) where s2 is the (fixed) proposal variance that can be tuned to give particular acceptance rates. • This is the method used within MLwiN. Independence Sampler The independence sampler is so called as each proposal is independent of the current parameter value i.e. q(θ1|θ2) = q(θ1). This leads to acceptance probability  p( * | Y )q(t )  .  ( ,t )  min 1, *   p(t | Y )q( )  * Example proposal distributions could be a Normal based around the ML estimate with inflated variance. The independence sampler can sometimes work very well but equally can work very badly! Gibbs Sampling The Gibbs sampler that we have already studied is a special case of the MH algorithm. The proposal distribution is the full conditional distribution which leads to acceptance rate 1 as shown below: p ( * | y )q ( t |  * ) rt    p ( t | y )q ( * |  t ) p ( * | y ) / p ( (*i ) |  t (  i ) , y ) p ( t | y ) / p ( t (i ) |  t (  i ) , y ) p ( t (  i ) | y ) p ( t (  i ) | y ) 1 MH Sampling in MLwiN • • • • • MLwiN actually uses a hybrid method. Gibbs sampling steps are used for variance parameters. MH steps are used for fixed effects and residuals. Univariate Normal proposal distributions are used. For the proposal standard deviation a scaled IGLS standard error is initially used (multiplied by 5.8 on the variance scale). • However an adaptive method is used to tune these proposal distributions prior to the burn-in period. MH Sampling for Normal model For the education dataset we can illustrate MH Sampling on the VC model by modifying steps 1 and 2. Repeat the following four steps: • 1. Generate βi by Univariate Normal MH Sampling. • 2. Generate each uj by Univariate Normal MH Sampling. • 3. Generate 1/σu2 from its Gamma conditional distribution. • 3. Generate 1/σe2 from its Gamma conditional distribution. MH Sampling in MLwiN Here we see how to change method. Note for Binomial responses this will change automatically and Gibbs Sampling will not be available. Trajectories plot Here we see MH sampling for β. Adaptive Method (ad hoc) One way of finding a ‘good’ proposal distribution is to choose a distribution that gives a particular acceptance rate. It has been shown that a ‘good’ acceptance rate is often around 50%. MLwiN has incorporated an adaptive method that uses this fact to construct univariate Normal proposals with an acceptance rate of approximately 50%. Method Before the burn-in we have an adaptation period where the sampler improves the proposal distribution. The adaptive method requires a desired acceptance rate e.g. 50% and tolerance e.g. 10% resulting in an acceptable range of (40%,60%). Adaptive method algorithm Run the sampler for consecutive batches of 100 iterations. Compare the number accepted, N with the desired acceptance rate, R. If N  R,  new   old /( 2  NR ) N Else  new   old  (2  100 100 R ) Repeat this procedure until 3 consecutive values of N lie within the acceptable range and then mark this parameter. When all the parameters are marked the adaptation period is over. N.B. Proposal SDs are still modified after being marked until adaptation period is over. Example of Adaptive period Adaptive method used on parameter, β0: N SD Accepted N in Row 0 1.0 - - 100 0.505 1 0 200 0.263 4 0 300 0.138 5 0 400 0.074 7 0 500 0.046 19 0 600 0.032 29 0 700 0.031 48 1 800 0.026 40 2 900 0.024 46 3* 1000 0.021 51 3* 1500 0.022 48 3* Comparison of Gibbs vs MH MLwiN also has a MVN MH algorithm. A comparison of ESS for a run of 5,000 iterations of the VC model follows: Gibbs MH MH MV β0 216 33 59 β1 4413 973 303 σ2u 2821 2140 2919 σ2e 4712 4895 4728 2 level Reunion Island dataset We have already studied this dataset with a continuous response. Here we consider a subset with only the 1st lactation for each cow resulting in 2 levels: cows nested within herds. The (Binary) response of interest is fscr – whether the first service results in a conception. There are two predictors – ai (whether insemination was natural or artificial) and heifer (the age of the cow). MCMC algorithm Our MLwiN algorithm has 3 steps: • 1. Generate βi by Univariate Normal MH Sampling. • 2. Generate each uj by Univariate Normal MH Sampling. • 3. Generate 1/σu2 from its Gamma conditional distribution. MLwin Demo The 2-level model is set up and run in IGLS giving the following starting values: Trajectories for 5k iterations Here we see some poor mixing particularly for the variance: DIC for binomial models We can use DIC to check whether we need random effects and whether to include heifer in the model. Note we only ran each model for 5,000 iterations. Model pD DIC VC + AI 18.37 LR + AI 2.16 2087.15 (2086.48 50k) 2095.70 VC + AI + Heifer 22.16 2087.43 VC Model after 50k iterations Here is a trace for the herd level variance after 50k iterations that suggests we need to run even longer! VPC for Binomial models VPC is harder to calculate for Binomial models as the level 1 variance is part of the Binomial distribution and hence related to the mean and on a different scale to higher level variances. Goldstein et al. (2002) propose 4 methods: 1. Use a Taylor series approximation. 2. Use a simulation based approach. 3. Switch to a Normal response model. 4. Use the latent variable approach in Snijders and Bosker. VPC for Binomial models Snijders and Bosker (1999) suggest the following: The variance of a standard logistic distribution is π2/3 and so the level 1 variance should be replaced by this value. In the Reunion Island example this means VPCHERD  0.089  0.0263. 2 0.089   / 3 Or in other words 2.63% of the variation is at the herd level. The fact that there isn’t a huge level 2 variance may in part explain the poor mixing of the MCMC algorithm. 3 level reunion island dataset We will now fit the same models to the 3-level dataset. After 5,000 we have the following rather worrying results: Running for longer We ran the chains for 100k after a burn-in of 5k and thinned the chains by a factor of 10. The results are still a little worrying: Potential solutions In the last 2 slides we have seen some bad mixing behaviour for some parameters. The solution of running for longer seems to work but we need to run for a very long time! In the next lecture we will look at WinBUGS methods for this model and also a reparameterisation of the model known as hierarchical centering. For this model it looks like MCMC is extremely computationally intensive to get reliable estimates. In what follows we look at an example where MCMC is clearly useful. The Guatemalan Child Health dataset. This consists of a subsample of 2,449 respondents from the 1987 National Survey of Maternal and Child Helath, with a 3-level structure of births within mothers within communities. The subsample consists of all women from the chosen communities who had some form of prenatal care during pregnancy. The response variable is whether this prenatal care was modern (physician or trained nurse) or not. Rodriguez and Goldman (1995) use the structure of this dataset to consider how well quasi-likelihood methods compare with considering the dataset without the multilevel structure and fitting a standard logistic regression. They perform this by constructing simulated datasets based on the original structure but with known true values for the fixed effects and variance parameters. They consider the MQL method and show that the estimates of the fixed effects produced by MQL are worse than the estimates produced by standard logistic regression disregarding the multilevel structure! The Guatemalan Child Health dataset. Goldstein and Rasbash (1996) consider the same problem but use the PQL method. They show that the results produced by PQL 2nd order estimation are far better than for MQL but still biased. The model in this situation is yijk ~ Bernouilli ( pijk ) with logit ( p ijk )   0  1 x1ijk   2 x2 jk   3 x3k  u jk  vk where u jk ~ N (0,  u2 ) and vk ~ N (0,  v2 ). In this formulation i,j and k index the level 1, 2 and 3 units respectively. The variables x1,x2 and x3 are composite scales at each level because the original model contained many covariates at each level. Browne (1998) considered the hybrid Metropolis-Gibbs method in MLwiN and two possible variance priors (Gamma-1(ε,ε) and Uniform Simulation Results The following gives point estimates (MCSE) for 4 methods and 500 simulated datasets. Parameter (True) MQL1 PQL2 Gamma Uniform β0 (0.65) 0.474 (0.01) 0.612 (0.01) 0.638 (0.01) 0.655 (0.01) β1 (1.00) 0.741 (0.01) 0.945 (0.01) 0.991 (0.01) 1.015 (0.01) β2 (1.00) 0.753 (0.01) 0.958 (0.01) 1.006 (0.01) 1.031 (0.01) β3 (1.00) 0.727 (0.01) 0.942 (0.01) 0.982 (0.01) 1.007 (0.01) σ2v (1.00) 0.550 (0.01) 0.888 (0.01) 1.023 (0.01) 1.108 (0.01) σ2u (1.00) 0.026 (0.01) 0.568 (0.01) 0.964 (0.02) 1.130 (0.02) Simulation Results The following gives interval coverage probabilities (90%/95%) for 4 methods and 500 simulated datasets. Parameter (True) MQL1 PQL2 Gamma Uniform β0 (0.65) 67.6/76.8 86.2/92.0 86.8/93.2 88.6/93.6 β1 (1.00) 56.2/68.6 90.4/96.2 92.8/96.4 92.2/96.4 β2 (1.00) 13.2/17.6 84.6/90.8 88.4/92.6 88.6/92.8 β3 (1.00) 59.0/69.6 85.2/89.8 86.2/92.2 88.6/93.6 σ2v (1.00) 0.6/2.4 70.2/77.6 89.4/94.4 87.8/92.2 σ2u (1.00) 0.0/0.0 21.2/26.8 84.2/88.6 88.0/93.0 Summary of simulations The Bayesian approach yields excellent bias and coverage results. For the fixed effects, MQL performs badly but the other 3 methods all do well. For the random effects, MQL and PQL both perform badly but MCMC with both priors is much better. Note that this is an extreme scenario with small levels 1 in level 2 yet high level 2 variance and in other examples MQL/PQL will not be so bad. Introduction to Practical In the practical you will be let loose on MLwiN with two datasets: 1. A dataset on contraceptive use in Bangladesh. 2. A veterinary epidemiology dataset on pneumonia in pigs.

Lecture 2 - eis.bris.ac.uk

Related documents

Products

Support

Lecture 2 - eis.bris.ac.uk

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib