Multilevel & Bayes A Mixed Story Joop Hox Methodology & Statistics Utrecht University 1 The Bayes Story as a “who-dunnit?” 2 Some early links between Bayes and multilevel How Bayesians almost invented multilevel analysis 3 Statistics, around 1930 Bayesian statistics: fiercely subjectivist Frequentist statistics: not yet of age “Discussions” (read: total war) between Fisher, Karl Pearson, Neyman-Egon Pearson But total agreement that Bayesian statistics with unknown priors was wrong Nice and non-partisan discussion of their different approaches in Vic Barnett Comparative Statistical Inference 4 An interesting discussion, 1971, Royal Stat. Soc. A presentation by Lindley & Smith discusses Bayesian estimation for the linear model Key term: exchangeability Observed units are exchangeable, and have some distribution (e.g. normal) described by parameters Key term: hyperparameter These parameters also have a distribution described by hyperparameters The distribution of hyperparameters can be structured (e.g. follow some linear model) 5 Lindley & Smith, 1971, Example 2 Assume a sample of schools, with variables x and y linearly related (regression) One can estimate y from x using regression parameters estimated per school These estimates can be improved by incorporating information from the other schools assuming exchangeability What does exchangeability mean? Suppose y is school success, conditional on school variables (here: none) we should not care to which school we go – it’s all the same to us Formally: exchanging (reordering) schools does not change their joint distribution 6 Lindley & Smith, 1971, Example 2 Estimates can be improved by incorporating information from the other schools. How? Suppose we have teacher ratings of ‘school climate’ Two approaches: ˆ OLS 1. Use the mean rating per school 0 j 2. Use an empty 2-level model: Yij 00 u0 j eij EB OLS ˆ ˆ and calculate 0 j j 0 j 1 j 00 EB = Empirical Bayes Effectively using information of all schools as prior Known to be biased but also more precise 7 Empirical Bayes Estimates EB is an interesting link, because it can be justified from both a frequentist and a Bayesian perspective: EB point estimates have a posterior distribution that uses the data as prior ‘Shrinkage estimates’ are estimates that provide improved and more reliable estimates for population member point estimates John Tukey: shrinkage estimates borrow strength (from other population members) 8 EB/Shrinkage Estimates Bottom Line: Both EB and shrinkage estimates underestimate the true population variance The point estimates ˆ0EBj are biased The point estimates ˆ0EBj are more precise Frequentist: lower Mean Squared Error in repeated sampling Bayesian: lower variance posterior distribution Conclusion: use EB estimates as point estimates to describe units (e.g. school climate) (cf. factor scores) to diagnose if units are outliers 9 Lindley & Smith, follow-up Lindley & Smith (1971) next discuss estimation procedures, using iterative estimation The proceedings include a serious discussion by other RSS members Local history: PhD Margo Jansen on Bayesian estimation in educational measurement (RUG, 1977) However, none of this had a large impact on statistical practice Although this approach can clearly be extended to multilevel modeling, this did not happen In other words, their approach was not followed-up 10 Why? Although this approach can clearly be extended to multilevel modeling, this did not happen Multilevel modeling = Having levels and variables at all levels (my definition) Why? 1. The interest was mostly on Bayesian estimation as compared to frequentist 2. The estimation method used simple iterative equations (works only in simple problems) 3. The discussion was highly theoretical 11 Interlude: first multilevel software GENMOD (1989) No EB estimates but can be calculated from output HLM (1988) OLS and EB residuals available Exploratory procedures and diagnostics file ML2 (1987) All data in memory (<640 k!), includes stats package OLS and EB residuals available Exploratory procedures and diagnostics plots VARCL (1986) 3 levels! Non-normal outcomes! 12 Multilevel Bayes Bayesian estimation for multilevel models starts with MLwiN 1.0 (1998) With a readable manual + Bayesian estimation explained in Hox (2002) Yet Bayesian estimation rarely used Google Bayesian estimation MLwiN: 1st substantive use found on page 7 On the other hand: Google Bayesian estimation Mplus: 1st substantive use found on page 6 13 Why Bayesian estimation? Complex models In MLwiN introduced for multilevel nonlinear models which MLwiN estimates with limited precision using Taylor linearization, just like SPSS… Other software may use numerical approximation, which generally works well but may take much computer power In MLwiN also used for cross-classifications and multiple membership models Bayesian estimation works well when models become complex 14 Bayesian estimation for complex models Why are cross-classified (CC) and multiple membership (MM) data complex? In a two-level intercept-only model the covariances between individuals are a block-diagonal matrix Dependencies within groups 100 groups of size 10 means 100 1010 matrices = 10 000 cells CC and MM models: dependencies anywhere 100 groups of size 10 means 10001000 matrix = 1 000 000 cells Why Bayesian estimation? Small sample sizes (especially 2nd level) Example: Meuleman & Billiet (2009) How many countries are needed for accurate multilevel SEM? ML estimation: At least 40 for accuracy, for good power at least 60 Hox, vd Schoot & Matthijsse (2012) How few countries will do? Using Bayes estimation: 20 for good accuracy, even 10 is not too bad (note model size is 10 parameters!) 16 Why Bayesian estimation? Violation of (some) assumptions Bayesian estimates have always admissible values Play around with different distributions, e.g. if there are outliers chose parameter distribution with a long tail (Hox & vd Schoot, 2013) Work simply with derived parameter estimates, e.g. indirect effects in moderation models Generate k MCMC iterations, calculate k indirect effects, examine distribution (Hox, Moerbeek, Kluytmans & vd Schoot, 2014) 17 Why Bayesian estimation? If you have incomplete data Listwise deletion (LD, default in most software) is extremely wasteful if these are at the 2nd level And LD makes strong assumption MCAR, principled methods are better (Hox, van Buuren & Jolani, 2016) Missing data can be viewed as a complex model Each missing data point is yet another parameter Bayesian estimation can be used directly, or to generate multiple imputations Remember Gerko Vink … 18 Why Bayesian estimation? It is not always clear from which population the 2nd level sample comes… 50 US states, 28 EU member states, 12 NL provinces In traditional statistics, we must assume that they are a sample from some population. Ad hoc ‘solutions’ 1. Ignore (unless a reviewer complains) 2. Assume a hypothetical population (very similar to 1.) 3. Declare that using a simple random model is more parsimonious that a fixed effects model (dummies) Bayesian solution 1. Assume exchangeability Note ad hoc solution 3 is dealing with uncertainty, close to Bayes 19 Why Bayesian estimation? Why not? Bayesian estimation rarely used Google Bayesian estimation MLwiN: 1st substantive use found on page 7 Google Bayesian estimation Mplus: 1st substantive use found on page 6 WHY? 20