Lecture 2 - eis.bris.ac.uk

advertisement
Lecture 14
MCMC for multilevel logistic
regressions in MLwiN
Lecture Contents
• Recap of MCMC from day 3
• Recap of logistic regression from days 1 &
4
• Metropolis Hastings sampling
• Reunion Island dataset (2 and 3 level)
• VPC for binomial models
• Method Comparison on
Rodriguez/Goldman examples
MCMC Methods (recap)
• Goal: To sample from joint posterior distribution.
p(  , u,  u2 ,  e2 | y )
• Problem: For complex models this involves
multidimensional integration.
• Solution: It may be possible to sample from
conditional posterior distributions,
p(  | y, u,  u2 ,  e2 ), p(u j | y,  ,  u2 ,  e2 ),
p( u2 | y,  , u,  e2 ), p( e2 | y,  , u,  u2 )
• It can be shown that after convergence such a
sampling approach generates dependent samples
from the joint posterior distribution.
Gibbs Sampling (recap)
• When we can sample directly from the conditional
posterior distributions then such an algorithm is
known as Gibbs Sampling.
• This proceeds as follows for the variance
components example:
• Firstly give all unknown parameters starting
values,
2
2
 (0), u(0),  (0),  e (0).
u
• Next loop through the following steps:
Gibbs Sampling for VC model
• Sample from
p(  | y, u (0),  u2 (0),  e2 (0)) to generate  (1) and then from
p(u j | y,  (1),  u2 (0),  e2 (0)) to generate u j (1) for each j and then from
p( u2 | y,  (1), u (1),  e2 (0)) to generate  u2 (1) and then from
p( e2 | y,  (1), u (1),  u2 (1)) to generate  e2 (1).
These steps are then repeated with the generated
values from this loop replacing the starting values.
The chain of values produced by this procedure are
known as a Markov chain. Note that β is generated
as a block while each uj is updated individually.
Algorithm Summary
Repeat the following four steps
• 1. Generate β from its (Multivariate)
Normal conditional distribution.
• 2. Generate each uj from its Normal
conditional distribution.
• 3. Generate 1/σu2 from its Gamma
conditional distribution.
• 3. Generate 1/σe2 from its Gamma
conditional distribution.
Logistic regression model
• A standard Bayesian logistic regression
model (e.g. for the rat tumour example) can
be written as follows:
yi ~ Binomial (ni , pi )
logit ( pi )  X
 0 ~ N (0, m0 ), 1 ~ N (0, m1 )
• Both MLwiN and WinBUGS can fit this
model but can we write out the conditional
posterior distributions and use Gibbs
Sampling?
Conditional distribution for β0
p(  0 | y, 1 )  p(  0 ) p( y |  0 , 1 )

 exp(  0  1 xi ) 

1

exp( 
) 
2m0 i  1  exp(  0  1 xi ) 
m0
2
0
yi


1


 1  exp(  0  1 xi ) 
ni  yi
p(  0 | y, 1 ) ~ ?
This distribution is not a standard distribution and
so we cannot simply simulate from a standard
random number generator. However both
WinBUGS and MLwiN can fit this model using
MCMC. We will in this lecture describe how MLwiN
does this before considering WinBUGS in the next lecture.
Metropolis Hastings (MH) sampling
• An alternative, and more general, way to
construct am MCMC sampler.
• A form of generalised rejection sampling (see
later) where values are drawn from approximate
distributions and “corrected” so that,
asymptotically they behave as random
observations from the target distribution.
• MH sampling algorithms sequentially draw
candidate observations from a ‘proposal’
distribution, conditional on the current
observations thus inducing a Markov chain.
General MH algorithm step
Let us focus on a single parameter θ and its
posterior distribution p(θ|Y).
Now at iteration t, θ takes value θt and we
generate a new value θ* from a proposal
distribution q.
Then we accept this new value and let
θt+1 =θ* with acceptance probability α(θ*, θt)
otherwise we set θt+1 = θt.
*
*

p
(

|
Y
)
q
(

|

)
The acceptance probability  ( * , )  min 1,
t

t


p(t | Y )q( * | t ) 
Choosing a proposal distribution
Remarkably the proposal distribution can have
almost any form.
There are some (silly) exceptions e.g. a proposal that has
point mass at one value but assuming that the proposal
allows the chain to explore the whole posterior and
doesn’t produce a recurrent chain we are OK.
Three special cases of Metropolis Hasting algorithm are:
1. Random walk metropolis sampling.
2. Independence sampling.
3. Gibbs sampling!
Pure Metropolis Sampling
The general MH sampling algorithm is due to
Hastings (1970) however this is a
generalisation of pure Metropolis Sampling
(Metropolis et al. (1953)).
This special case is when q(1 | 2 )  q(2 | 1 )1,2
i.e. the proposal distribution is symmetric.
This then reduces the acceptance probability to
*

p
(

|Y) 
*
.
 ( ,t )  min 1,
 p(t | Y ) 
Random Walk Metropolis
• This is an example of pure Metropolis sampling.
• Here q(θ1|θ2) = q(|θ1 – θ2|).
• Typical examples of random walk proposals are
Normal distributions centered around the current
value of the parameter i.e. q(θ1|θ2) ~N(θ2,s2)
where s2 is the (fixed) proposal variance that can
be tuned to give particular acceptance rates.
• This is the method used within MLwiN.
Independence Sampler
The independence sampler is so called as each proposal
is independent of the current parameter value i.e.
q(θ1|θ2) = q(θ1).
This leads to acceptance probability
 p( * | Y )q(t ) 
.
 ( ,t )  min 1,
* 
 p(t | Y )q( ) 
*
Example proposal distributions could be a Normal based
around the ML estimate with inflated variance. The
independence sampler can sometimes work very well
but equally can work very badly!
Gibbs Sampling
The Gibbs sampler that we have already studied
is a special case of the MH algorithm. The
proposal distribution is the full conditional
distribution which leads to acceptance rate 1 as
shown below:
p ( * | y )q ( t |  * )
rt 


p ( t | y )q ( * |  t )
p ( * | y ) / p ( (*i ) |  t (  i ) , y )
p ( t | y ) / p ( t (i ) |  t (  i ) , y )
p ( t (  i ) | y )
p ( t (  i ) | y )
1
MH Sampling in MLwiN
•
•
•
•
•
MLwiN actually uses a hybrid method.
Gibbs sampling steps are used for variance parameters.
MH steps are used for fixed effects and residuals.
Univariate Normal proposal distributions are used.
For the proposal standard deviation a scaled IGLS
standard error is initially used (multiplied by 5.8 on the
variance scale).
• However an adaptive method is used to tune these
proposal distributions prior to the burn-in period.
MH Sampling for Normal model
For the education dataset we can illustrate MH Sampling
on the VC model by modifying steps 1 and 2.
Repeat the following four steps:
• 1. Generate βi by Univariate Normal MH Sampling.
• 2. Generate each uj by Univariate Normal MH Sampling.
• 3. Generate 1/σu2 from its Gamma conditional
distribution.
• 3. Generate 1/σe2 from its Gamma conditional
distribution.
MH Sampling in MLwiN
Here we see how to change
method. Note for Binomial
responses this will change
automatically and Gibbs
Sampling will not be available.
Trajectories plot
Here we see MH sampling for β.
Adaptive Method (ad hoc)
One way of finding a ‘good’ proposal distribution is to
choose a distribution that gives a particular acceptance
rate. It has been shown that a ‘good’ acceptance rate is
often around 50%.
MLwiN has incorporated an adaptive method that uses this
fact to construct univariate Normal proposals with an
acceptance rate of approximately 50%.
Method
Before the burn-in we have an adaptation period where the
sampler improves the proposal distribution. The adaptive
method requires a desired acceptance rate e.g. 50% and
tolerance e.g. 10% resulting in an acceptable range of
(40%,60%).
Adaptive method algorithm
Run the sampler for consecutive batches of 100 iterations.
Compare the number accepted, N with the desired
acceptance rate, R.
If N  R,  new   old /( 2  NR )
N
Else  new   old  (2  100
100 R )
Repeat this procedure until 3 consecutive values of N lie
within the acceptable range and then mark this
parameter. When all the parameters are marked the
adaptation period is over.
N.B. Proposal SDs are still modified after being marked
until adaptation period is over.
Example of Adaptive period
Adaptive method used on parameter, β0:
N
SD
Accepted
N in Row
0
1.0
-
-
100
0.505
1
0
200
0.263
4
0
300
0.138
5
0
400
0.074
7
0
500
0.046
19
0
600
0.032
29
0
700
0.031
48
1
800
0.026
40
2
900
0.024
46
3*
1000
0.021
51
3*
1500
0.022
48
3*
Comparison of Gibbs vs MH
MLwiN also has a MVN MH algorithm. A comparison of ESS for a
run of 5,000 iterations of the VC model follows:
Gibbs
MH
MH MV
β0
216
33
59
β1
4413
973
303
σ2u
2821
2140
2919
σ2e
4712
4895
4728
2 level Reunion Island dataset
We have already studied this dataset with a
continuous response.
Here we consider a subset with only the 1st
lactation for each cow resulting in 2 levels: cows
nested within herds.
The (Binary) response of interest is fscr – whether
the first service results in a conception.
There are two predictors – ai (whether
insemination was natural or artificial) and heifer
(the age of the cow).
MCMC algorithm
Our MLwiN algorithm has 3 steps:
• 1. Generate βi by Univariate Normal MH
Sampling.
• 2. Generate each uj by Univariate Normal
MH Sampling.
• 3. Generate 1/σu2 from its Gamma
conditional distribution.
MLwin Demo
The 2-level model is set up and run in IGLS giving the following
starting values:
Trajectories for 5k iterations
Here we see some poor mixing particularly for the variance:
DIC for binomial models
We can use DIC to check whether we need random effects
and whether to include heifer in the model. Note we only
ran each model for 5,000 iterations.
Model
pD
DIC
VC + AI
18.37
LR + AI
2.16
2087.15
(2086.48 50k)
2095.70
VC + AI +
Heifer
22.16
2087.43
VC Model after 50k iterations
Here is a trace for the herd level variance after 50k iterations that suggests
we need to run even longer!
VPC for Binomial models
VPC is harder to calculate for Binomial models as
the level 1 variance is part of the Binomial
distribution and hence related to the mean and
on a different scale to higher level variances.
Goldstein et al. (2002) propose 4 methods:
1. Use a Taylor series approximation.
2. Use a simulation based approach.
3. Switch to a Normal response model.
4. Use the latent variable approach in Snijders
and Bosker.
VPC for Binomial models
Snijders and Bosker (1999) suggest the following:
The variance of a standard logistic distribution is π2/3
and so the level 1 variance should be replaced by this
value.
In the Reunion Island example this means
VPCHERD 
0.089
 0.0263.
2
0.089   / 3
Or in other words 2.63% of the variation is at the herd
level. The fact that there isn’t a huge level 2 variance
may in part explain the poor mixing of the MCMC
algorithm.
3 level reunion island dataset
We will now fit the same models to the 3-level dataset. After 5,000 we have the
following rather worrying results:
Running for longer
We ran the chains for 100k after a burn-in of 5k and thinned the chains by a
factor of 10. The results are still a little worrying:
Potential solutions
In the last 2 slides we have seen some bad mixing
behaviour for some parameters.
The solution of running for longer seems to work
but we need to run for a very long time!
In the next lecture we will look at WinBUGS
methods for this model and also a
reparameterisation of the model known as
hierarchical centering.
For this model it looks like MCMC is extremely
computationally intensive to get reliable
estimates. In what follows we look at an example
where MCMC is clearly useful.
The Guatemalan Child Health
dataset.
This consists of a subsample of 2,449 respondents from the 1987
National Survey of Maternal and Child Helath, with a 3-level
structure of births within mothers within communities.
The subsample consists of all women from the chosen communities
who had some form of prenatal care during pregnancy. The
response variable is whether this prenatal care was modern
(physician or trained nurse) or not.
Rodriguez and Goldman (1995) use the structure of this dataset to
consider how well quasi-likelihood methods compare with
considering the dataset without the multilevel structure and fitting a
standard logistic regression.
They perform this by constructing simulated datasets based on the
original structure but with known true values for the fixed effects
and variance parameters.
They consider the MQL method and show that the estimates of the
fixed effects produced by MQL are worse than the estimates
produced by standard logistic regression disregarding the multilevel
structure!
The Guatemalan Child Health
dataset.
Goldstein and Rasbash (1996) consider the same problem but use the PQL
method. They show that the results produced by PQL 2nd order
estimation are far better than for MQL but still biased.
The model in this situation is
yijk ~ Bernouilli ( pijk ) with
logit ( p ijk )   0  1 x1ijk   2 x2 jk 
 3 x3k  u jk  vk
where u jk ~ N (0,  u2 ) and vk ~ N (0,  v2 ).
In this formulation i,j and k index the level 1, 2 and 3 units respectively.
The variables x1,x2 and x3 are composite scales at each level because the
original model contained many covariates at each level.
Browne (1998) considered the hybrid Metropolis-Gibbs method in MLwiN
and two possible variance priors (Gamma-1(ε,ε) and Uniform
Simulation Results
The following gives point estimates (MCSE) for 4 methods and 500 simulated
datasets.
Parameter
(True)
MQL1
PQL2
Gamma
Uniform
β0 (0.65) 0.474 (0.01) 0.612 (0.01) 0.638 (0.01) 0.655 (0.01)
β1 (1.00) 0.741 (0.01) 0.945 (0.01) 0.991 (0.01) 1.015 (0.01)
β2 (1.00) 0.753 (0.01) 0.958 (0.01) 1.006 (0.01) 1.031 (0.01)
β3 (1.00) 0.727 (0.01) 0.942 (0.01) 0.982 (0.01) 1.007 (0.01)
σ2v (1.00) 0.550 (0.01) 0.888 (0.01) 1.023 (0.01) 1.108 (0.01)
σ2u (1.00) 0.026 (0.01) 0.568 (0.01) 0.964 (0.02) 1.130 (0.02)
Simulation Results
The following gives interval coverage probabilities (90%/95%) for 4 methods
and 500 simulated datasets.
Parameter
(True)
MQL1
PQL2
Gamma
Uniform
β0 (0.65)
67.6/76.8
86.2/92.0
86.8/93.2
88.6/93.6
β1 (1.00)
56.2/68.6
90.4/96.2
92.8/96.4
92.2/96.4
β2 (1.00)
13.2/17.6
84.6/90.8
88.4/92.6
88.6/92.8
β3 (1.00)
59.0/69.6
85.2/89.8
86.2/92.2
88.6/93.6
σ2v (1.00)
0.6/2.4
70.2/77.6
89.4/94.4
87.8/92.2
σ2u (1.00)
0.0/0.0
21.2/26.8
84.2/88.6
88.0/93.0
Summary of simulations
The Bayesian approach yields excellent bias and
coverage results.
For the fixed effects, MQL performs badly but the
other 3 methods all do well.
For the random effects, MQL and PQL both
perform badly but MCMC with both priors is
much better.
Note that this is an extreme scenario with small
levels 1 in level 2 yet high level 2 variance and
in other examples MQL/PQL will not be so bad.
Introduction to Practical
In the practical you will be let loose on
MLwiN with two datasets:
1. A dataset on contraceptive use in
Bangladesh.
2. A veterinary epidemiology dataset on
pneumonia in pigs.
Download