Structural Equation Modeling Using Mplus Chongming Yang Research Support Center FHSS College Structural? Structuralism Components Relations Objectives Introduction to SEM The model Parameters Estimation Model evaluation Applications Estimate simple models with Mplus Continuous Dependent Variables Session I Information of Variable Mean Variance Skewedness Kurtosis Variance & Covariance n V (x x ) 2 i i n 1 n Cov ( x x )( y y ) i i i n 1 Covariance Matrix (S) x1 x2 x1 V1 x2 Cov21 V2 x3 Cov31 Cov32 x3 V3 Statistical Model Probabilistic statement about Relations of variables Imperfect but useful representation of reality Structural Equation Modeling A system of regression equations for latent variables to estimate and test direct and indirect effects without the influence of measurement errors. To estimate and test theories about interrelations among observed and latent variables. Latent Variable (Construct / Factor / Trait) A hypothetical variable cannot be measured directly No objective measurement unit inferred from observable manifestations Multiple manifestations (indicators) Normally distributed interval dimension How is Depression Distributed in? BYU students Patients for Therapy Normal Distributions Levels of Analyses Observed Latent Test Theories Classical True Score Theory: Observed Score = True score + Error Item Response Theory Generalizability (Raykov & Marcoulides, 2006) Graphic Symbols of SEM Rectangle – observed variable Oval -- latent variable or error Single-headed arrow -- causal relation Double-headed arrow -- correlation Graphic Measurement Model of Latent 1 X1 1 2 X2 2 3 X3 3 Equations Specific equations X1 = 1 + 1 X2 = 2 + 2 X3 = 3 + 3 Matrix Symbols X = + True Score Theory? Relations of Variances VX1 = 12 + 1 VX2 = 22 + 2 VX3 = 32 + 3 = measurement error / uniqueness Unknown Parameters VX1 = 12 + 1 VX2 = 22 + 2 VX3 = 32 + 3 Sample Covariance Matrix (S) x1 x2 x1 V1 x2 Cov21 V2 x3 Cov31 Cov32 x3 V3 Variance of Variance of = common covariance of X1 X2 and X3 1 0 0 Variance of 2 3 0 Unstandardized Parameterization (scaling) 1 =1 (set variance of X1 =1; X1 called reference Indicator) Variance of = common variance of X1 X2 and X3 Squared = explained variance of X (R2) Variance of = unexplained variance-error Total Variance = Squared + Variance Just Identified Model 1 X1 1 2 X2 2 3 X3 3 Reference Indicator (marker) Choose conceptually the best Small variance non-convergence Different markers different parameters estimates and their standard errors Affect measurement invariance tests Not affect standardized estimates Standardized Parameterizations (scaling) Variance of = 1 = common variance of X1 X2 and X3 Squared = explained variance of X (R2) Variance of = 1 - 2 Mean of = 0 Mean of = 0 Two Kinds of Parameters Fixed at 0, 1, or other values Freely estimated d1 Analytic d2 Reasoning d3 Verbal d4 Self Control d5 Recognize/ Assess d6 General Intelligence Social Relations e2 Perceived Benefit e3 Perceived Cost e4 Emotional Intelligence z2 Marital Satisfaction Agreeableness Openness e1 z1 Personality d7 Job Satisfaction Being Appreciated Structural Equation Model in Matrix Symbols = x + (exogenous) Y = y + (endogenous) = + + (structural model) X Note: Measurement model reflects the true score theory Structural Equation Model in Matrix Symbols X = x + x + (measurement) Y = y + y + (measurement) = α + + + (structural) Note: SEM with mean structure. Model Implied Covariance Matrix (Σ) Note: This covariance matrix contains unknown parameters in the equations. (I-B) = non-singular Estimations/Fit Functions Hypothesis: = S or - S = 0 Maximum Likelihood F = log|||| + trace(S-1) - log||S|| - (p+q) Convergence -- Reaching Limit Minimize F while adjust unknown Parameters through iterative process Convergence value: F difference between last two iterations Default convergence = .0001 Increase to help convergence (0.001 or 0.01) e.g. Analysis: convergence = .01; No Convergence No unique parameter estimates Lack of degrees of freedom under identification Variance of reference indicator too small Fixed parameters are left to be freely estimated Misspecified model Absolute Fit Index 2 = F(N-1) (N = sample size) df = p(p+1)/2 – q P = number of variances, covariances, & means q = number of unknown parameters to be estimated prob = ? (Nonsignificant 2 indicates good fit, Why?) Sample Information x1 x1 x2 x3 x4 … x2 x3 x4 v1 cov21 v2 cov31 cov32 v3 cov41 cov42 cov43 v4 … Mean1 Mean2 Mean3 Mean4 … Total info = P(P+1)/2 + Means … Absolute Fit -- SRMR Standardized Root Mean Square Residual SRMR = Difference between observed and implied covariances in standardized metric Desirable when < .90, but no consensus Relative Fit: Relative to Baseline (Null) Model All unknown parameters are fixed at 0 Variables not related (====0) Model implied covariance = 0 Fit to sample covariance matrix S Obtain 2, df, prob < .0000 Relative Fit Indices CFI = 1- (2-df)/(2b-dfb) b = baseline model Comparative Fit Index, desirable => .95; 95% better than b model TLI = (2b/dfb - 2/df) / (2b/dfb-1) (Tucker-Lewis Index, desirable => .90) RMSEA = √(2-df)/(n*df) (Root Mean Square of Error Approximation, desirable <=.06 penalize a large model with more unknown parameters) Special Case A d1 1 Verbal Aggression t4a3 e3 t4a93 e2 t4a94 e1 t4a37 e6 t4a57 e5 t4a90 e4 Sex d2 1 Physical Aggression Special Cases A Assumption: x = y = x + + = + x + Special Case B e1 x1 e2 x2 e3 x3 Verbal Aggression d Peer Status e4 x4 e5 x5 e6 x6 Physical Aggression Special Cases B Assumption: y = x = x + x + y = + + Other Special Cases of SEM Confirmatory Factor Analysis (measurement model only) Multiple & Multivariate Regression ANOVA / MANOVA (multigroup CFA) ANCOVA Path Analysis Model (no latent variables) Simultaneous Econometric Equations… Growth Curve Modeling … EFA vs. CFA e1 1 e2 1 e3 1 e4 1 e5 1 e6 1 x1 x2 x3 x4 x5 x6 1 1 Factor 2 Factor 1 Exploratory Factor Analysis Confirmatory Factor Analysis e1 1 e2 1 e3 1 e4 1 e5 1 e6 1 x1 x2 x3 x4 x5 x6 1 1 Factor 1 Factor 2 Multiple Regression x1 e 1 x2 x3 Y ANCOVA e1 1 Pretest1 Posttest1 Group e2 1 Pretest2 Posttest2 Multivariate Normality Assumption Observed data summed up perfectly by covariance matrix S (+ means M), S thus is an estimator of the population covariance Consequences of Violation Inflated 2 & deflated CFI and TLI reject plausible models Inflated standard errors attenuate factor loadings and relations of latent variables (structural parameters) (Cause: Sample covariances were underestimated) Accommodating Strategies Correcting Fit Correcting standard errors Bootstrapping Transforming Nonnormal variables Satorra-Bentler Scaled 2 & Standard Errors (estimator = mlm; in Mplus) Transforming into new normal indicators (undesirable) SEM with Categorical Variables Satorra-Bentler Scaled S-B 2 = d-1(ML-based 2) that incorporates kurtosis) 2 & SE (d= Scaling factor Effect: performs well with continuous data in terms of 2, CFI, TLI, RMSEA, parameter estimates and standard errors. also works with certain-categorical variables (See next slide) Analysis: estimator = MLM; Workable Categorical Data 7.000 6.000 5.000 4.000 3.000 2.000 1.000 0.000 1.000 2.000 3.000 4.000 5.000 Nonworkable Categorical Data 6.000 5.000 4.000 3.000 2.000 1.000 0.000 1.000 2.000 3.000 Bootstrapping (resampling of data) Original btstrp1 x y x y 1 5 5 3 2 4 1 1 3 3 3 2 4 2 4 5 5 1 2 4 . . . . btstrp2 … x y 1 3 5 4 4 1 2 2 3 5 . . Limitation of Bootstrapping Assumption: Sample = Population Useful Diagnostic Tool Does not Compensate for small or unrepresentative samples severely non-normal or absence of independent samples for the crossvalidation Analysis: Bootstrap = 500 (standard/residual); Output: stand cinterval; Mplus www.statmodel.com Multiple Programs Integrated SEM of both continuous and categorical variables Multilevel modeling Mixture modeling (identify hidden groups) Complex survey data modeling (stratification, clustering, weights) Modern missing data treatment Monte Carlo Simulations Types of Mplus Files Data (*.dat, *.txt) Input (specify a model, <=80 columns/line) Output (automatically produced) Plot (automatically produced) Data File Format Free Delimited by tab, space, or comma All missing values must be flagged with special numbers / symbols Default in Mplus Computationally slow with large data set Fixed Format = 3F3, 5F3.2, F5.1; Mplus Input DATA: File = ? VARIABLE: Names=?; Usevar=?; Categ=?; ANALYSIS: Type = ? MODEL: (BY, ON, WITH) OUTPUT: Stand; Model Specification in Mplus BY Measured by (F by x1 x2 x3 x4) ON Regressed on (y on x) WITH Correlated with (x with y) XWITH Interact with (inter | F1 xwith F2) PON Pair ON (y1 y2 on x1 x2 = y1 on x1; y2 on x2) PWITH pair with (x1 x2 with y1 y2 = x1 with y1; y1 with y2) Default Specification Error or residual (disturbance) Covariance of exogenous variables in CFA Certain covariances of residuals (z2) z1 z2 Graphic Model y1 y2 y3 y7 F1 y8 y9 F3 y13 y14 d3 F5 F2 d4 d5 y4 y5 y6 F4 y10 y11 y12 y15 Model Specification Model: f1 by y1-y3; f2 by y4-y6; f3 by y7-y9; f4 by y10-y12; f5 by y13-y15; f3 on f1 f2; f4 on f2; f5 on f2 f3 f4 ; MeaErrors are au Practice Prepare two data files for Mplus Mediation.sav Aggress.sav Model Specification Single Group CFA Examine Mediation Effects in a Full SEM Run a MIMIC model of aggressions Multigroup CFA to examine measurement invariance SPSS Data Missing Values? Save as & choose file type Leave as blank to use fixed format Recode into special number to use free format Fixed ASCII Free *.dat (with or without variable names?) Copy & paste variable names into Mplus input file Mplus Interface Activate Mplus Program Language Generator Manually Create An Input File Four Separate Files (Mplus) Data Input Need manually specify a model Output best prepared with other programs automatic output window Graph automatic graph file Data File Individual Case Data (*.dat or *.txt) Free Format (default) Variable separated by tab, comma, or space All missing values must be flagged with special symbols or numbers). Fixed Format Variable takes fixed space, e.g. 2F2, 4F6, 5F6.3 Missing values can be left blank Summary Data Variance-Covariance matrix, means Correlation matrix, standard deviation, means SPSS Mplus Open “Antisocial.sav” with SPSS Work in Variable Window Option 1: Fixed Format Change Format to Simplify Save as ? (Type=Fixed ASCII) Option 2: Free Format Recode missing values Save as ? (Tab-delimited) Fixed Format F3 4F3.2 25F1 F3 One variable that takes 3 columns 4F3.2 4 variables, each has 3 column with 2 decimals with a column 25F1 25 variables, each uses on column Copy SPSS Variable Names into Mplus Menu: Utilities Variables Highlight to select variables Paste Go to Syntax Window Select & Copy Paste under Names Are in Mplus input file Practice now SAS Mplus Assign flags to missing values (use Array code for many variables) Proc Export Data = Data File Outfile = “Mplus input file folder\*.dat” DBMS = dlm Replace; Run; Practice Fixed Format Out of SAS Open with SPSS Save as Fixed Format Practice Stata2mplus Converting a stata data file to *.dat Find out: http://www.ats.ucla.edu/stat/stata/faq/stata 2mplus.htm Modification Indices Lower bound estimate of the expected chi square decrease Freely estimating a parameter fixed at 0 MPlus Output: stand Mod(10); Start with least important parameters (covariance of errors) Caution: justification? Indirect (Mediation) Effect A*B Mplus specification: Model Indirect: DV IND Mediator IV; Model Comparison Model: Probabilistic statement about the relations of variables Imperfect but useful Models Differ: Different Variables and Different Relations (, , , ) Same Variables but Different Relations (, , , ) Nested Model A Nested Model (b) comes from general Model (a) by Removing a parameter (e.g. a path) Fixing a parameter at a value (e.g. 0) Constraining parameter to be equal to another Both models have the same variables Test If A=B y1 y2 y3 y7 A F1 y8 y9 F3 y13 B y14 d3 F5 F2 d4 d5 y4 y5 y6 F4 y10 y11 y12 y15 Model Comparison via 2 Difference 2 = df = (Nested model) 2 = df = (Default model) ___________________________________ 2dif = dfdif = p = ? (a single tail) Find p value at the following website: http://www.tutor-homework.com/statistics_tables/statistics_tables.html Conclusion: If p > .05, there is no difference between the default model and nested model. Or the Hypothesis that the parameters of the two models are equal is not supported. Practice Test if effect A=B Equality Constraints in Mplus Parameter Labels: Numbers Letters Combination of numbers of letters Constraint (B=A) F3 on F1 (A); F3 on F2 (A); Run CFA with Real Data Verbal Aggression Physical Aggression a3 e1 a93 e2 a94 e3 a37 e4 a57 e5 a90 e6 Multigroup Analysis VARIABLE: USEVAR = X1 X2 X3 X4; Grouping IS sex (0=F 1=M); ANALYSIS: TYPE = MISSING H1; MODEL: F1 BY X1 - X4; MODEL M: F1 BY X2 - X4; Note: sex is grouping variable and is not used in the model. Why Measurement Invariance Matters? Xg1 = g1 + g1g1 + g1 Xg2 = g2 + g2g2 + g2 Xg1- Xg2= (g1 - g2) + (g1g1-g2g2) + (g1-g2) Xg1- Xg2 = + (g1- g2) Test Measurement Invariance Default Model Model: F1 By a3 a93(1) a94 (2); F2 By a37 a57 (3) a90 (4); Model M: F1 By a93 () a94 (); F2 By a57 () a90 (); Output: stand; Note: Reference indicators in the second group are omitted. Test Measurement Invariance Constrained Model Model: F1 By a3 a93(1) a94 (2); F2 By a37 a57 (3) a90 (4); Model M: F1 By a93 (1) a94 (2); F2 By a57 (3) a90 (4); Output: stand; Note: Reference indicators in the second group are omitted. Estimate with Real Data Verbal Aggression Sex a3 e1 a93 e2 a94 e3 a37 e4 a57 e5 a90 e6 d1 Race1 d2 Race2 Physical Aggression SEM with Categorical Indicators Session II Problems of Ordinal Scales Not truly interval measure of a latent dimension, having measurement errors Limited range, biased against extreme scores Items are equally weighted (implicitly by 1) when summed up or averaged, losing item sensitivity Criticisms on Using Ordinal Scales as Measures of Latent Constructs Steven (1951): …means should be avoided because Merbitz(1989): Ordinal scales and foundations of its meaning could be easily interpreted beyond ranks. misinference Muthen (1983): Pearson product moment correlations Write (1998): “…misuses nonlinear raw scores or of ordinal scales will produce distorted results in structural equation modeling. Likert scales as though they were linear measures will produce systematically distorted results. …It’s not only unfair, it is immoral.” Assumption of Categorical Indicators A categorical indicator is a coarse categorization of a normally distributed underlying dimension Latent (Polychoric) Correlation Categorization of Latent Dimension & Threshold No Never 1 Yes m-1 2 Sometimes m 3 4 Y Often 5 Threshold The values of a latent dimension at which respondents have 50% probability of responding to two adjacent categories Number of thresholds = response categories – 1. e.g. a binary variable has one threshold. Mplus specification [x$1] [y$2]; Normal Cumulative Distributions Measurement Models of Categorical Indicators (2P IRT) Probit: P (=1|) = [(- + )-1/2 ] (Estimation = Weight Least Square with df adjusted for Means and Variances) Logistic: P (=1|) = 1 / (1+ e-(- + )) (Maximum Likelihood Estimation) Converting CFA to IRT Parameters Probit Conversion a = -1/2 b = / Logit Conversion a = /D b = / (D=1.7) One Parameter Item Response Theory Model Analysis: Estimator = ML; Model: F by X1@1.7 X2@1.7 … Xn@1.7; Sample Information Latent Correlation Matrix equivalent to covariance matrix of continuous indicators Threshold matrix Δ equivalent to means of continuous indicators Stages of Estimation Sample information: Correlations/threshold/intercepts (Maximum Likelihood) Correlation structure (Weight Least Square) g F= (s(g)-(g))’W(g)-1(s(g)-(g)) g=1 W-1 matrix Elements: S1 intercepts or/and thresholds S2 slopes S3 residual variances and correlations W-1 : divided by sample size Estimation WLSMV: Weight Least Square estimation 2 with degrees of freedom adjusted for Means and Variances of latent and observed variables Baseline Model Estimated thresholds of all the categorical indicators df = p 2– 3p (p = 3 of polychoric correlations) Data Preparation Tip Categorical indicators are required to have consistent response categories across groups Run Crosstab to identify zero cells Recode variables to collapse certain categories to eliminate zero cells Inconsistent Categories 1 2 3 4 5 Male 60 80 43 4 0 Female 57 86 32 16 2 1 2 3 4 Male 60 80 43 4 Female 57 86 32 18 Specify Dependent Variables as Categorical Variable: Categ = x1-x3; Categ = all; Reporting Results Guidelines: Conceptual Model Software + Version Data (continuous or categorical?) Treatment of Missing Values Estimation method Model fit indices (2(df), p, CFI, TLI, RMSEA) Measurement properties (factor loadings + reliability) Structural parameter estimates (estimate, significance, 95% confidence intervals) ( = .23*, CI = .18~.28) Reliability of Categorical Indicators (variance approach) = (i)2/ [(i)2 + 2], where (i)2 = square (sum of standardized factor loadings) 2 = sum of residual variances i = items or indicator 2i = 1 - 2 McDonald, R. P. (1999). Test theory: A unified treatment (p.89) Mahwah, New Jersey: Lawrence Erlbaum Associates. Calculator of Reliability (Categorical Indicators) SPSS reliability data SPSS reliability syntax Trouble Shooting Strategy Start with one part of a big model Ensure every part works Estimate all parts simultaneously Important Resources Mplus Website: www.statmodel.com Papers: http://www.statmodel.com/papers.shtml Mplus discussions: http://www.statmodel.com/cgi-bin/discus/discus.cgi