Bayesian estimation Mplus

advertisement
Multilevel & Bayes
A Mixed Story
Joop Hox
Methodology & Statistics
Utrecht University
1
The Bayes Story as a
“who-dunnit?”
2
Some early links between
Bayes and multilevel
How Bayesians almost invented
multilevel analysis
3
Statistics, around 1930
Bayesian statistics: fiercely subjectivist
Frequentist statistics: not yet of age
“Discussions” (read: total war) between
Fisher, Karl Pearson, Neyman-Egon Pearson
But total agreement that Bayesian statistics
with unknown priors was wrong
 Nice and non-partisan discussion of their
different approaches in Vic Barnett Comparative
Statistical Inference
4
An interesting discussion,
1971, Royal Stat. Soc.
A presentation by Lindley & Smith discusses
Bayesian estimation for the linear model
Key term: exchangeability
Observed units are exchangeable, and have some
distribution (e.g. normal) described by parameters
Key term: hyperparameter
These parameters also have a distribution described
by hyperparameters
The distribution of hyperparameters can be
structured (e.g. follow some linear model)
5
Lindley & Smith, 1971,
Example 2
Assume a sample of schools, with variables x
and y linearly related (regression)
One can estimate y from x using regression
parameters estimated per school
These estimates can be improved by incorporating
information from the other schools
assuming exchangeability
What does exchangeability mean?
Suppose y is school success, conditional on school
variables (here: none) we should not care to which
school we go – it’s all the same to us
Formally: exchanging (reordering) schools does not
change their joint distribution
6
Lindley & Smith, 1971,
Example 2
 Estimates can be improved by incorporating
information from the other schools.
 How?
 Suppose we have teacher ratings of ‘school
climate’
 Two approaches:
ˆ OLS

1. Use the mean rating per school 0 j
2. Use an empty 2-level model: Yij   00  u0 j  eij
EB
OLS
ˆ
ˆ




and calculate 0 j
j 0 j  1   j   00
 EB = Empirical Bayes
 Effectively using information of all schools as prior
 Known to be biased but also more precise
7
Empirical Bayes Estimates
EB is an interesting link, because it can be
justified from both a frequentist and a Bayesian
perspective:
EB point estimates have a posterior distribution
that uses the data as prior
‘Shrinkage estimates’ are estimates that provide
improved and more reliable estimates for
population member point estimates
John Tukey: shrinkage estimates borrow strength
(from other population members)
8
EB/Shrinkage Estimates
Bottom Line:
 Both EB and shrinkage estimates underestimate the true
population variance
 The point estimates ˆ0EBj are biased
 The point estimates ˆ0EBj are more precise
 Frequentist: lower Mean Squared Error in repeated sampling
 Bayesian: lower variance posterior distribution
 Conclusion: use EB estimates as point estimates
 to describe units (e.g. school climate) (cf. factor scores)
 to diagnose if units are outliers
9
Lindley & Smith, follow-up
 Lindley & Smith (1971) next discuss estimation
procedures, using iterative estimation
 The proceedings include a serious discussion by
other RSS members
 Local history: PhD Margo Jansen on Bayesian
estimation in educational measurement (RUG, 1977)
 However, none of this had a large impact on
statistical practice
 Although this approach can clearly be extended
to multilevel modeling, this did not happen
 In other words, their approach was not followed-up
10
Why?
 Although this approach can clearly be extended
to multilevel modeling, this did not happen
 Multilevel modeling = Having levels and variables at
all levels (my definition)
 Why?
1. The interest was mostly on Bayesian
estimation as compared to frequentist
2. The estimation method used simple iterative
equations (works only in simple problems)
3. The discussion was highly theoretical
11
Interlude:
first multilevel software
GENMOD (1989)
No EB estimates but can be calculated from output
HLM (1988)
OLS and EB residuals available
Exploratory procedures and diagnostics file
ML2 (1987)
All data in memory (<640 k!), includes stats package
OLS and EB residuals available
Exploratory procedures and diagnostics plots
VARCL (1986)
3 levels!
Non-normal outcomes!
12
Multilevel
Bayes
Bayesian estimation for multilevel models starts
with MLwiN 1.0 (1998)
With a readable manual
+ Bayesian estimation explained in Hox (2002)
Yet Bayesian estimation rarely used
Google Bayesian estimation MLwiN: 1st substantive
use found on page 7
On the other hand:
Google Bayesian estimation Mplus: 1st substantive
use found on page 6
13
Why Bayesian estimation?
Complex models
In MLwiN introduced for multilevel nonlinear models
which MLwiN estimates with limited precision
using Taylor linearization, just like SPSS…
Other software may use numerical approximation, which
generally works well but may take much computer power
In MLwiN also used for cross-classifications and
multiple membership models
Bayesian estimation works well when models become
complex
14
Bayesian estimation for
complex models
Why are cross-classified (CC) and multiple
membership (MM) data complex?
 In a two-level intercept-only
model the covariances between
individuals are a block-diagonal
matrix
 Dependencies within groups
 100 groups of size 10 means 100
1010 matrices = 10 000 cells
 CC and MM models:
dependencies anywhere
 100 groups of size 10 means
10001000 matrix = 1 000 000 cells
Why Bayesian estimation?
Small sample sizes (especially 2nd level)
Example: Meuleman & Billiet (2009) How
many countries are needed for accurate
multilevel SEM?
ML estimation: At least 40 for accuracy, for good
power at least 60
Hox, vd Schoot & Matthijsse (2012) How
few countries will do?
Using Bayes estimation: 20 for good accuracy,
even 10 is not too bad (note model size is 10
parameters!)
16
Why Bayesian estimation?
Violation of (some) assumptions
Bayesian estimates have always admissible values
Play around with different distributions, e.g. if there are
outliers chose parameter distribution with a long tail
(Hox & vd Schoot, 2013)
Work simply with derived parameter estimates,
e.g. indirect effects in moderation models
Generate k MCMC iterations, calculate k indirect
effects, examine distribution
(Hox, Moerbeek, Kluytmans & vd Schoot, 2014)
17
Why Bayesian estimation?
If you have incomplete data
Listwise deletion (LD, default in most software)
is extremely wasteful if these are at the 2nd level
And LD makes strong assumption MCAR, principled
methods are better (Hox, van Buuren & Jolani, 2016)
Missing data can be viewed as a complex model
 Each missing data point is yet another parameter
Bayesian estimation can be used directly, or to
generate multiple imputations
Remember Gerko Vink …
18
Why Bayesian estimation?
 It is not always clear from which population the 2nd level
sample comes…
 50 US states, 28 EU member states, 12 NL provinces
 In traditional statistics, we must assume that they are a
sample from some population.
 Ad hoc ‘solutions’
1. Ignore (unless a reviewer complains)
2. Assume a hypothetical population (very similar to 1.)
3. Declare that using a simple random model is more parsimonious
that a fixed effects model (dummies)
 Bayesian solution
1. Assume exchangeability
 Note ad hoc solution 3 is dealing with uncertainty, close to Bayes
19
Why Bayesian estimation?
Why not?
 Bayesian estimation rarely used
 Google Bayesian estimation MLwiN: 1st substantive
use found on page 7
 Google Bayesian estimation Mplus: 1st substantive
use found on page 6
 WHY?
20
Download