My First Bayes: Why and How to Run Your First Bayesian Model Rens van de Schoot a.g.j.vandeschoot@uu.nl rensvandeschoot.wordpress.com As stated by Kruschke (2012) in a recent special issue of Perspectives on Psychological Science “[...] whereas the 20th century was dominated by null hypothesis significance testing, the 21st century is becoming Bayesian.” NHT vs. Bayes Pr (Data | H0) ≠ Pr (Hi | Data) Bayes Theorem Pr (Hi | Data) = Posterior ≈ prior * data Posterior probability is proportional to the product of the prior probability and the likelihood 5 Bayes theorem: prior, data and posterior Bayes theorem: prior, data and posterior p(A) = ¼ Bayes theorem: prior, data and posterior p(B) = ¼ p (A∩B) = 1/16 Bayes theorem: prior, data and posterior What is the probability of A within B ? Stated otherwise, what is the probability of A given B -> p(A|B)? Bayes theorem: prior, data and posterior (1) (2) (3) Bayes Theorem (4) Bayes theorem: prior, data and posterior Replace A and B with an hypothesis (H) and data (D) Bayes Theorem: Bayes Theorem Pr (Hi| Data) = Posterior ≈ prior * data Posterior probability is proportional to the product of the prior probability and the likelihood Intelligence (IQ) -∞ IQ ∞ Prior Knowledgde 1 -∞ IQ ∞ Intelligence Interval Cognitive Designation 40 - 54 Severely challenged (<1%) 55 - 69 Challenged (2.3% of test takers) 70 - 84 Below average 85 - 114 Average (68% of test takers) 115 - 129 Above average 130 - 144 Gifted (2.3% of test takers) 145 - 159 Genius (Less than 1% of test takers) 160 - 175 Extraordinary genius 15 Prior Knowledgde 40 IQ 180 Prior Knowledgde 2 40 IQ 180 Prior Knowledgde 3 40 100 IQ 180 Prior Knowledgde 4 40 100 IQ 180 Prior Knowledgde 5 40 100 IQ 180 Prior Knowledgde 1 2 -∞ ∞ 3 4 5 Prior Prior -∞ IQ ∞ Data Data Prior -∞ IQ ∞ Posterior Posterior Data Prior -∞ IQ ∞ Prior - Data Data Prior 40 100 IQ 180 Posterior Posterior Data Prior 40 100 IQ 180 Prior - Data Data Prior 40 100 IQ 180 Posterior Posterior Data Prior 40 100 IQ 180 How to obtain posterior? In complex models, the posterior is often intractable (impossible to compute exactly) Solution: approximate posterior by simulation – Simulate many draws from posterior distribution – Compute mode, median, mean, 95% interval et cetera from the simulated draws 29 ANOVA example 4 unknown parameters μj (j=1,...,4) and one common but unknown σ2. Statistical model: Y = μ1*D1 + μ2*D2 + μ3*D3 + μ4*D4 + E with E ~ N(0, σ2) The Gibbs sampler Specify prior: Pr(μ1, μ2, μ3, μ4, σ2) Prior (μj) ~ Nor(μ0, var0) Prior (μj) ~ Nor(0,10000) Prior (σ2) ~ IG(0.001, 0.001) Prior is Inverse Gamma a (shape), b (scale) 32 The Gibbs sampler Combine prior with likelihood provides posterior: Post ( μ1, μ2, μ3, μ4, σ2 | data ) …this is a 5 dimensional distribution… The Gibbs sampler Iterative evaluation via conditional distributions: Post ( μ1 | μ2, μ3, μ4, σ2, data ) ~ N(M1,S1) Post ( μ2 | μ1, μ3, μ4, σ2, data ) ~ N(M2,S2) Post ( μ3 | μ1, μ2, μ4, σ2, data ) ~ N(M3,S3) Post ( μ4 | μ1, μ2, μ3, σ2, data ) ~ N(M4,S4) Post ( σ2 | μ1, μ2, μ3, μ4, data ) ~ IG(a,b) The Gibbs sampler 1.Assign starting values 2.Sample μ1 from conditional distribution 3.Sample μ2 from conditional distribution 4.Sample μ3 from conditional distribution 5.Sample μ4 from conditional distribution 6.Sample σ2 from conditional distribution 7.Go to step 2 until enough iterations The Gibbs sampler Iteration μ1 μ2 μ3 μ4 σ2 1 3.00 5.00 8.00 3.00 10 2 3.75 4.25 7.00 4.30 8 3 3.65 4.11 6.78 5.55 5 . . . . . . 15 4.45 3.19 5.08 6.55 1.1 . . . . . . . . . . . . 199 4.59 3.75 5.21 6.36 1.2 200 4.36 3.45 4.65 6.99 1.3 Trace plot Trace plot: starting value Trace plot: start of stable pattern Trace plot: stable pattern Trace plot: omit burn-in Trace plot: posterior Trace plot: Kernel density plot Posterior Distribution 44 Convergence 45 Burn In Gibbs sampler must run t iterations ‘burn in’ before we reach target distribution f(Z) – How many iterations are needed to converge on the target distribution? Diagnostics – Examine graph of burn in – Try different starting values – Run several chains in parallel 46 Convergence 47 Convergence 48 Convergence 49 Convergence 50 Convergence 51 Convergence 52 What went wrong? Mplus automatically monitors convergence using the potential scale reduction factor (PSRF) The PSRF criterion essentially requires the between-chain variation to be small relative to the total of between- and within-chain variation. The PSRF is checked every 100th iteration for each parameter separately and the highest value is displayed on your computer screen while Mplus is running. 53 Convergence 54 Convergence 55 Convergence 56 Convergence 57 Convergence 58 Conclusion about convergenge Burn-in: Mplus deletes first half of chain Run multiple chains (Mplus default 2) – PSR statistic compares variances within chains to variance between chains – Must be close to 1 – Decrease Bconvergence: default .05 but better use .01 ALWAYS do a graphical evaluation of each and every parameter 59 Summing up Probability Degree of belief Prior What is known before observing the data Posterior What is known after observing the Informative prior Tool to include subjective knowledge Non-informative prior Try to express absence of prior knowledge Posterior mainly determined by data MCMC methods Simulation (sampling) techniques to obtain the posterior distribution and all posterior summary measures Convergence Important to check 60 61 Uncertainty in Classical Statistics Uncertainty = sampling distribution – Estimate population parameter by ˆ – Imagine drawing an infinity of samples – Distribution of ˆ over samples Problem is that we have only one sample – Estimate ˆ and its sampling distribution – Estimate 95% confidence interval 62 Inference in Classical Statistics What does 95% confidence interval actually mean? 63 Inference in Classical Statistics What does 95% confidence interval actually mean? – Over an infinity of samples, 95% of these contain the true population value – But we have only one sample – We never know if our present estimate ˆ and confidence interval is one of those 95% or not 64 Inference in Classical Statistics What does 95% confidence interval NOT mean? We have a 95% probability that the true population value is within the limits of our confidence interval We only have an aggregate assurance that in the long run 95% of our confidence intervals contain the true population value 65 Uncertainty in Bayesian Statistics Uncertainty = probability distribution for the population parameter In classical statistics the population parameter has one single true value In Bayesian statistics we imagine a distribution of possible values of population parameter 66 Inference in Bayesian Statistics What does a95% central credibility interval mean? We have a 95% probability that the population value is within the limits of our confidence interval 67 Software WinBUGS/ OpenBUGS R packages Special implementation for multilevel regression AMOS LearnBayes, R2Winbugs, MCMCpack MLwiN Bayesian inference Using Gibbs Sampling Very general, user must set up model Special implementation for SEM Mplus Very general (SEM + ML + many other models) MPLUS - ML DATA: FILE IS data.dat; VARIABLE: NAMES ARE x1-x5 y1-y5; ANALYSIS: ESTIMATOR IS ML; MODEL: y1-y5 ON x1-x5; 69 MPLUS - BAYES DATA: FILE IS data.dat; VARIABLE: NAMES ARE x1-x5 y1-y5; ANALYSIS: ESTIMATOR IS BAYES; MODEL: y1-y5 ON x1-x5; 70 MPLUS - BAYES Single-level, multilevel, and mixture models Continuous and categorical outcomes Default non-informative priors or user-specified informative priors (MODEL PRIORS) Multiple chains using parallel processing (CHAIN) Convergence assessment using Gelman-Rubin potential scale reduction factors Posterior parameter distributions with means, medians, modes, and credibility intervals (POINT) 71 MPLUS - BAYES Posterior parameter trace plots Autocorrelation plots Posterior predictive checking plots 72 MPLUS - BAYES ANALYSIS: – – – – – – – – – – – POINT ( mean, , median, mode; default is median) CHAINS (default is 2) + Processors = 32 BSEED STVALUES (= ML, PERTURBED, UNPERTURBED) MEDIATOR (observed, latent; default is latent) ALGORITHM (GIBBS, MH; default is GIBBS) BCONVERGENCE (related to PSR) BITERATIONS (to go beyond 50K iterations) FBITERATIONS (fixed number of iterations) THIN (every kth iteration recorded; default is 1) DISTRIBUTION (how many iterations used for MODE) 73 MPLUS - BAYES MODEL PRIOR: – Informative priors are used to incorporate prior knowledge – Flexible in Mplus, but (Win)Bugs is more flexible and offers more tools – In models with small variance parameters and small samples the posterior is often sensitive to choice of prior – Especially the variance estimates – Do sensitivity analysis (try different priors) 74 MPLUS - BAYES MODEL PRIOR: – Intercepts, regression slopes, loadings: N(0, infinity), – unless these parameters are in a probit regression in which case N(0,5) is used – Variances: IG(0, -1) – Covariance matrices: IW(0, -p-1), unless the elements include parameters from a probit regression in which case IW(I, p+1) is used – Thresholds: N(0, infinity) – Class proportions: Dirichlet prior D(10, 10, ..., 10) 75 Let’s Analyze IQ N = 20 Data are generated Mean = 102 SD = 15 N = 20 Data are generated Mean = 102 SD = 15 IQ 77 IQ 78 IQ Prior type ML Prior 1 A Prior 2a M or A Prior2b M or A Prior2c M or A Prior 3A Prior 4W Prior 5 W Prior 6a W Prior 6b W Prior Variance used large variance, SD=100 medium variance, SD=10 small variance, SD=1 medium variance, SD=10 small variance, SD=1 Large variance, SD=100 medium variance, SD=10 Posterior Mean IQ score 95% C.I./C.C.I. 102.00 101.99 101.99 101.99 102.00 102.03 102.00 102.00 99.37 86.56 94.42 – 109.57 94.35 – 109.62 94.40 – 109.42 94.89 – 109.07 100.12 – 103.87 94.22 – 109.71 97.76 – 106.80 100.20-103.90 92.47 – 106.10 80.17 – 92.47 79 An example where Bayesian analysis is an improvement The intervention program ATLAS (Adolescent Training and Learning to Avoid Steroids) was administered to high school football players to prevent use of anabolic steroids. Data are from 861 high school football players. Example from Bengt Muthén, Bayesian Analysis In Mplus: A Brief Introduction (Incomplete Draft, Version 3, May 17, 2010). Data from MacKinnon, D.P., Lockwood, C.M., & Williams, J. (2004). Confidence limits for the indirect effect: Distribution of the product and resampling methods. Multivariate Behavioral Research, 39, 99-128. 80 ATLAS Example, cntd. The indirect effect of the intervention on nutrition via perceived severity of using steroids is the focus. The ML estimate is 0.02 (.011), p = 0.056ns 81 ATLAS Example, setup DATA: file = mbr2004atlas.dat; VARIABLE: names = obs group severity nutrit; usevariables = group - nutrit; ANALYSIS: estimator = bayes; processors = 2; chains=4; bconvergence=0.01; MODEL: severity on group (a); nutrit on severity (b) group; MODEL CONSTRAINT: new(indirect); With Bayes currently no INDIRECT or indirect = a*b; VIA command OUTPUT: tech8 standardized; PLOT: type = plot2; 82 ATLAS Example, cntd. MODEL RESULTS (default settings) Posterior One-Tailed 95% C.I. Estimate S.D. P-Value Lower 2.5% Upper 2.5% New/Additional Parameters INDIRECT 0.018 0.012 0.010 0.002 0.045 MODEL RESULTS: chains = 4; bconvergence = 0.01; Posterior One-Tailed 95% C.I. Estimate S.D. P-Value Lower 2.5% Upper 2.5% New/Additional Parameters INDIRECT 0.019 0.011 0.011 0.002 0.043 83 ATLAS Example, cntd. Why is the Bayes estimate significant and the ML estimate not? 84 Two-level CFA, Sibling Data 37 families, 187 children Scores on 6 intelligence tests Multilevel structure: children nested in families – Example from Hox Multilevel Analysis, 1st Ed., 2002 – Problematic because of small family level sample size Source: Van Peet, A.A.J. (1992). De potentieeltheorie van intelligentie. [The potentiality theory of intelligence] Amsterdam: University of Amsterdam, Unpublished Ph.D. Thesis 85 Path Diagram CFA Sibling Data Between Within 86 ML estimation 2-level CFA TITLE: two level factor analysis Van Peet data, using ML estimators; DATA: FILE IS ggkind.dat; VARIABLE: NAMES ARE famnr wordlist cards figures matrices animals occup; CLUSTER IS famnr; ANALYSIS: TYPE IS TWOLEVEL; ESTIMATOR IS ML; MODEL: %BETWEEN% general by wordlist cards figures matrices animals occup; %WITHIN% numeric by wordlist cards matrices; percept by occup animals figures; OUTPUT: SAMPSTAT STANDARDIZED; 87 ML estimation 2-level CFA MODEL RESULTS Residual Variances WORDLIST CARDS FIGURES MATRICES ANIMALS OCCUP Estimate S.E. 1.598 3.871 2.315 -0.160 1.085 5.705 1.323 1.769 1.496 0.673 1.400 1.988 Est./S.E. 1.208 2.188 1.548 -0.237 0.775 2.870 Two-Tailed P-Value 0.227 0.029 0.122 0.813 0.438 0.004 88 Bayes estimation 2-level CFA TITLE: two level factor analysis Van Peet data, using ML estimators; DATA: FILE IS ggkind.dat; VARIABLE: NAMES ARE famnr wordlist cards figures matrices animals occup; CLUSTER IS famnr; ANALYSIS: TYPE IS TWOLEVEL; ESTIMATOR IS Bayes; MODEL: %BETWEEN% general by wordlist cards figures matrices animals occup; %WITHIN% numeric by wordlist cards matrices; percept by occup animals figures; OUTPUT: SAMPSTAT STANDARDIZED; 89 Bayes estimation 2-level CFA MODEL RESULTS Residual Variances WORDLIST CARDS FIGURES MATRICES ANIMALS OCCUP Estimate 4.618 3.991 2.150 0.712 3.104 6.389 Posterior One-Tailed S.D. P-Value 2.235 2.369 1.625 0.669 1.891 2.511 0.000 0.000 0.000 0.000 0.000 0.000 95% C.I. L 2.5% U2.5% 1.322 9.820 0.434 9.468 0.201 6.299 0.101 2.604 0.676 7.825 2.888 12.652 90 What have we learned so far? Results are compromise of prior & data However: -> non/low-informative priors -> informative priors -> misspecification of the prior -> convergence Results are easier to communicate (eg CCI compared to confidence interval) Missing Data Yet another advantage: Missing data are easily handled with Missing Data Carried out using Bayesian estimation to create several data sets where missing values have been imputed The multiple imputation data sets can be used for subsequent model estimation using ML or WLSMV The imputed data sets can be saved for subsequent analysis or analysis can be carried out in the same run The Gibbs sampler 1.Assign starting values 2.Sample μ1 from conditional distribution 3.Sample μ2 from conditional distribution 4.Sample μ3 from conditional distribution 5.Sample μ4 from conditional distribution 6.Sample σ2 from conditional distribution 7.Go to step 2 until enough iterations The Gibbs sampler: now including missing data Data: y(obs) and y(mis) The Gibbs sampler: now including missing data 1.Assign starting values including y(mis) 2. … 6. 7. Sample y(mis) from y(obs) and current parameter values 8. Go to step 2 until enough iterations Missing data: an example (n = 467) Missing data: an example (n = 467) 1.Listwise deletion 2.FIML 3.FIML with auxiliary variables 4.Bayes 5.Bayes with auxiliary variables Inadmissible parameters WARNING: THE LATENT VARIABLE COVARIANCE MATRIX (PSI) IS NOT POSITIVE DEFINITE. THIS COULD INDICATE A NEGATIVE VARIANCE/ RESIDUAL VARIANCE FOR A LATENT VARIABLE, A CORRELATION GREATER OR EQUAL TO ONE BETWEEN TWO LATENT VARIABLES, OR A LINEAR DEPENDENCY AMONG MORE THAN TWO LATENT VARIABLES. CHECK THE TECH4 OUTPUT FOR MORE INFORMATION. Prior is Inverse Gamma a (shape)=-1, b (scale)=0 101 Inadmissible parameters Between r = 1.05 fb1 fb2 y1 y2 y3 y4 y5 y6 y7 y8 y1 y2 y3 y4 y5 y6 y7 y8 fw1 fw2 Within Inadmissible parameters Inadmissible parameters Inadmissible parameters Between r = .94 fb1 fb2 y1 y2 y3 y4 y5 y6 y7 y8 y1 y2 y3 y4 y5 y6 y7 y8 fw1 fw2 Within More on priors Specifying type of distribution, e.g. normal versus inv.chi2 non-informative versus informative priors 106 Conclusions I/II • Excellent tool to include prior knowledge if available • Estimates (including intervals) always lie in the sample space if prior is chosen wisely • Results are easier to communicate • Missing data are easily handled with • BUT: Bayes doesn’t solve misspecification of the model Conclusions II/II • Better small-sample performance, large-sample theory not needed • Analyses can be made less computationally demanding Bayesian Analysis When ML Is Slow Or Intractable Due To Many Dimensions Of Numerical Integration