Tate Center Lecture Series Brooks Applegate, EMR 3/10/2014 Often the analysis focuses on covariances so is is referred to as Covariance Structure Modeling or Structural Regression Models Often involves unobserved or latent variables so it is referred to as Latent Variable Modeling Often SEM models test /estimate causal effects (theoretically modeled) so it is referred to as Causal Modeling 3/10/14 Tate Center Series 2 About 100 years ago Spearman put down the foundation of what is today called the Common Factor Model and the techniques to statistically analyze it – Factor Analysis FA is one of the most frequently used multivariate statistical techniques in use today. CFM & FA studies the variance structures within a group. The fundamental intent of FA is to determine the number and nature of the latent variables that account for the variation/covariation in a larger set of observed variables 3/10/14 Tate Center Series 3 Postulates that each indicator in a set of observed measures is a linear function of one or more common factors and one unique factor. FA task is to partition the variance of each indicator into Common variance Unique variance There are two broad FA families that do this Exploratory FA (EFA) Confirmatory FA (CFA) 3/10/14 Tate Center Series 4 δ δ δ δ δ δ δ X1 X2 X3 X4 X5 X6 X7 ξ2 ξ1 3/10/14 Tate Center Series 5 δ δ δ δ δ δ δ X1 X2 X3 X4 X5 X6 X7 ξ2 ξ1 3/10/14 Tate Center Series 6 In path analysis (PA), the researcher specifies a model that attempts to explain why X and Y (and other observed variables) covary Part of the explanation about why two variables covary may include presumed causal effects (e.g., X causes Y) Other parts of the explanation may reflect presumed noncausal relations, such as a spurious association The overall goal in PA is to estimate causal versus noncausal aspects of observed covariances 3/10/14 Tate Center Series 7 To reasonably infer that X is a cause of Y, all of the following conditions must be met: 1. 2. 3. There is time precedence, that is, X precedes Y in time The direction of the causal relation is correctly specified, that is, X causes Y instead of the reverse or that X and Y cause each other The association between X and Y does not disappear when external variables, such as common causes of both, are held constant (i.e., it is not spurious) It is very unlikely that all these conditions would be satisfied in a single study 3/10/14 Tate Center Series 8 The assessment of variables at different times at least provides a measurement framework consistent with the specification of directional causal effects longitudinal designs pose many potential difficulties, such as subject attrition and the need for additional resources When the variables are concurrently measured, it is not possible to demonstrate time precedence Therefore, the researcher needs a very clear, substantive rationale for specifying that X causes Y instead of the reverse or that X and Y mutually influence each other when all variables are measured at the same time 3/10/14 Tate Center Series 9 X1 Y1 X2 Y3 Y2 3/10/14 Tate Center Series 10 Basic foundation was laid down in the 1970’s Generally became accessible to researchers in the 1980’s New developments (statistical estimation theory, numerical analysis & desktop computing power) have made this family accessible to researchers who have not had extensive training in applied statistics and measurement 3/10/14 Tate Center Series 11 SEM can be generally thought of as a generalization or extension of: ANOVA (DOE) Regression Principal Factor Analysis It can model multilevel data Provided an appropriate link function it can accommodate non-linear response data 3/10/14 Item Response Models Growth Mixture Models Tate Center Series 12 SEM models fit (and test) a-priori models to data OR fits a-post priori models to data Models employ both estimation and hypothesis testing Models require the explicit representation of observed (indicator) and unobserved (latent) variables Models can be applied to experimental and non experimental data Models can be exploratory, confirmatory or a mixture of both (Jorsekog, 1993) Strictly confirmatory Alternative models Model-generating 3/10/14 Tate Center Series 13 Path Models Measurement Models (EFA, E/CFA & CFA) All observed indicator variables Mediation & moderation analysis Cause and effects Definition, structure and relationships of the latent factors Structural Regression Models Integration of path models (depicting relations among latent factors) together with the CFA measurement models Special models Latent Growth Models (longitudinal multilevel models) Latent Class Models Item Response Models (IRT) 3/10/14 Tate Center Series 14 Whole and Parts 3/10/14 Tate Center Series 15 Symbol Interpretation 3/10/14 X Observed exogenous variable Y Observed endogenous variable D Unobserved exogenous variable (i.e., a disturbance) Variance of exogenous variable Covariance between a pair of exogenous variables Presumed direct causal effect (e.g., X Y) Presumed reciprocal causal effects (e.g., Y1 Y2) Tate Center Series 16 y1 x1 x2 y2 y3 y7 ξ1 η1 η3 D 1 y8 y9 D 3 x3 x4 x5 3/10/14 η2 ξ2 y10 D 1 y4 Tate Center Series y5 y6 17 y1 x1 x2 y2 y3 y7 X1 Y1 y8 Y3 y9 y10 x3 x4 x5 3/10/14 Y2 X2 y4 Tate Center Series y5 y6 18 X1 Y1 Y3 Y2 X2 3/10/14 Tate Center Series 19 1. 2. 3. 4. Specify the model – where is your theory? Establish that the model is identified Prepare/screen the data/variables Estimate the model 1. 2. 3. 5. 6. 7. 8. Examine model fit Interpret Consider alternative/equivalent models Re-specify the model Write it up accurately Replicate (cross validate) your model Apply the results 3/10/14 Tate Center Series 20 data acquisition & preparation b2 b1 a specification estimation identification 3/10/14 fit evaluation interpretation & reporting specification Tate Center Series 21 Combining path model with latent variables and their measurement components into Structural Regression (SR) Models 3/10/14 Tate Center Series 22 In a path model the disturbance factors of the endogenous variables reflect both In a CFA model measurement errors are moved to the unique variances of the observed indicator variables Measurement error All omitted causes There is no counterpart to omitted causes In a SEM model Measurement errors are reflected in the measurement model Omitted causes are reflected in the disturbance factors of the endogenous latent variables 3/10/14 Tate Center Series 23 Exogenous factors are uncorrelated with the disturbances of the endogenous factors Measurement errors are uncorrelated 3/10/14 Tate Center Series 24 Unit Loading Identification (ULI) Used to scale disturbance and measurement errors Common software default Generally not a problem unless there are only 2 indicator variables for a factor and the model has equality constraints involving the other indicator (a constraint interaction) Unit Variance Identification (UVI) 3/10/14 Common for scaling exogenous variables Tate Center Series 25 Disturbances & measurement errors are typically assigned a scale through a unit loading (ULI) constraints Exogenous variables are typically scaled by either ULI (one indicator per factor is fixed to 1.0) thus the factor is unstandardized (or by fixing the variance of the factor to 1.0 thus standardizing it) Common SEM software limits the scaling of the endogenous factors to only ULI (thus treating them unstandardized) 3/10/14 Tate Center Series 26 Basically the same issues and considerations as CFA & Path models Start with the number of variables (not the sample size) v(v+1)/2, where v = # of observed variables Need to have a just specified or over-specified model Count the number of variances and covariances of all the exogenous variables (measurement errors, disturbances and exogenous factors) Count the number of direct effects on endogenous variables (factor loadings, direct effects on endogenous factors from other factors 3/10/14 Tate Center Series 27 The 2-Step rule is a sufficient condition for SEM identification The measurement model must be identified 2. If the structural model is recursive the full model is identified If the model is nonrecursive things are bit more complicated! Empirical underidentificaton 1. 3/10/14 Created when there is substantial model misspecification Tate Center Series 28 Do it all at once Probably not recommended If the overall fit is good - GOOD JOB If the overall fit is poor what to do? Is the poor fit due to a poor measurement model? Is the poor fit due to a poor structural model? 3/10/14 Tate Center Series 29 Based on the recognition that the structural part of the SEM is actually nested under the more general (correlated) CFA model Respecify the full SEM model as a measurement (CFA) model and estimate (and fix!?!) 1. If the measurement model adequately fits move to Step 2 If the measurement model is a poor fit then the structural model will be just as poor or worse Place constraints on the structural part of the CFA model to bring it into like with your structural model 2. Now consider alternative structural models If alternative structural models dramatically affect the measurement model portion of the SEM, the measurement model is not invariant This results in interpretational confounding – that is the meaning/interpretation of the measurement model is a function of the structural model 3/10/14 Tate Center Series 30 Expansion of the 2 step process 1. 2. 3. 4. Requirement: Each latent factor has at least 4 indicator variables (mixed methods) E/CFA CFA with all latent factors freely correlated (step 1 in the 2-step approach) Begin to place constraints on the structural portion of the model Incremental or sequential tests of a-prior hypotheses Steps 3 & 4 are really incremental refinement of the structural part of the model from the general CFA to the end SEM 3/10/14 Tate Center Series 31 3/10/14 Tate Center Series 32 SAS/CALIS IBM SPSS Amos (21.0.0) (Arbuckle & Wothke) bundled with SPSS http://www-142.ibm.com/software/products/us/en/spssamos/ AMOS 21.0.0 stand alone ($1590.00) LISREL (9.1) (Joreskog & Sorbom) $495.00 EQS (6.2) for Windows (Bentler) $595.00 http://www.mvsoft.com Mplus-7 (Muthen & Muthen) ($595.00) http://www.ssicentral.com Free student versions https://www.statmodel.com Demo version available Add-ons for multilevel and mixture models Many open source programs, e.g. R 3/10/14 Tate Center Series 33 Structural Regression Model 3/10/14 Tate Center Series 34 A research question that focuses on the regression of Y on X (e.g., do principal experience(s) predict school building health)? A survey is constructed with 4 items all theoretically related to a latent construct X (principal experiences) And 3 different items theoretically related to a latent construct Y (school health) 3/10/14 Tate Center Series 35 Typically a researcher derives a X-composite variable And a Y-composite variable X= (x1+x2+x3+x4) Y= (y1+y2+y3) Then regresses Y on X 3/10/14 Tate Center Series 36 3/10/14 Tate Center Series 37 * * Y X 1 * * * 1 x1 x2 x3 x4 * * * * 3/10/14 Tate Center Series * * y1 y2 y3 * * * 38 Understanding the Measurement Model 3/10/14 Tate Center Series 39 Consider the regression of Y on X (diagramed as follows) Y X Expressed as a Path model the regression of Y on X is diagramed X Y 3/10/14 Tate Center Series 40 Modeling Measurement Error in X (measurement error variance is 0.019) Y 1. X FX FX is the True Score on X 0.019 Modeling Measurement Errors in both X & Y Y 0.022 3/10/14 1. FX FY 1. FY is the True Score on Y FX is the True Score on X Tate Center Series X 0.019 41 Confirmatory Factor Analysis 3/10/14 Tate Center Series 42 CFA is a type of SEM that deals specifically with measurement models The relationships between observed variables (indicators) and latent variables (factors) CFA models are hypothesis driven CFA has become one of the most popular techniques/methods used in applied social and health science research 3/10/14 Tate Center Series 43 Psychometric evaluation of “test” instruments Construct validation Measurement of invariance Investigation of method effects 3/10/14 Tate Center Series 44 δ δ δ δ δ δ δ X1 X2 X3 X4 X5 X6 X7 ξ2 ξ1 3/10/14 Tate Center Series 45 Missing data, Non-normality, & Categorical Data 3/10/14 Tate Center Series 46 Conduct a Missing Data Analysis 3/10/14 Tate Center Series 47 MCAR (missing completely at random) MAR (missing at random) When the probability of missing on Y depends on one (or more) X variables (not related to Y when X is held constant) NMAR (missing not at random) Probability of missing on Y is unrelated to Y and all other variables in the data set When missingness is related values that could have been observed Planned missingness (ignorable missing) 3/10/14 Tate Center Series 48 Consult your favorite statistician 3/10/14 Tate Center Series 49 Listwise deletion Loss of power If missingness is MCAR Estimates are consistent (unbiased) Usually not efficient (large std errors) If missingness is MAR Estimates may not be consistent nor efficient Pairwise deletion If you use a covariance or correlation matrix (you will probably lie) Matrix may not be positive definite (means it cannot be inverted) MCAR Consistent estimates (in large samples) Biased std errors MAR Estimates and std errors are biased 3/10/14 Tate Center Series 50 Mean substitution, LVCF, regression substitution Tends to underestimate variances & std errors but overestimate correlations EM imputation Missingness must be MCAR or MAR + multivariate normal Std errors are consistent Multiple imputation (3 step process) Generate m (5 is enough) imputed data sets (typically EM + MCMC) Analyze the 5 parallel data sets individually Combine the analyses SAS PROC MIANALYZE 3/10/14 Tate Center Series 51 3/10/14 Preferred choice for handling missing data MCAR or MAR + multivariate normality MLM estimator (robust ML) can be used with non-normal data Available in Mplus, LISREL, Amos, Mx, SAS SPSS(?) Tate Center Series 52 3/10/14 Tate Center Series 53 ML & GLS estimators are robust to minor departures in multivariate normality When non-normality is pronounced don’t use ML or GLS estimators 3/10/14 ML is very sensitive to high kurtosis Inflated χ2 Modest underestimation of fit indices Moderate to severe underestimation of std errors All bad things get much worse in small samples Tate Center Series 54 Weighted Least Squares (WLS) Requires very large samples Robust ML (best choice) Typically requires raw data SB χ2 Cannot be used the same way for testing nested models – must be adjusted 3/10/14 There is quite a bit of variation among the popular SEM programs – so read and get familiar with the tool you use Tate Center Series 55 3/10/14 Tate Center Series 56 Don’t use a ML estimator 3/10/14 Produces attenuated correlation estimates if there is a floor or ceiling effect Produces “pseudofactor” that are artifacts of item difficulty or extremeness Produces incorrect test statistics and std errors Possibly produces incorrect parameter estimates if there is a floor or ceiling effect Tate Center Series 57 WLS Weight matrix requires a large sample Assume 10 indicators b = (10*11)/2 = 55 so W matrix is (b*(b+1))/2 or (55*56)/2=1540 elements Skewness can aggravate a small sample size problem Robust WLS (WLSMV – Mplus) Appears to be the best choice for categorical data But still requires large samples 3/10/14 ULS Tate Center Series 58 3/10/14 Tate Center Series 59