Tate Center Lecture Series Brooks Applegate, EMR 3/10/2014

advertisement
Tate Center Lecture Series
Brooks Applegate, EMR
3/10/2014



Often the analysis focuses on covariances so is
is referred to as Covariance Structure Modeling
or Structural Regression Models
Often involves unobserved or latent variables
so it is referred to as Latent Variable Modeling
Often SEM models test /estimate causal effects
(theoretically modeled) so it is referred to as
Causal Modeling
3/10/14
Tate Center Series
2

About 100 years ago Spearman put down the
foundation of what is today called the Common
Factor Model and the techniques to statistically
analyze it – Factor Analysis
FA is one of the most frequently used multivariate
statistical techniques in use today.
 CFM & FA studies the variance structures within a group.
 The fundamental intent of FA is to determine the number
and nature of the latent variables that account for the
variation/covariation in a larger set of observed variables

3/10/14
Tate Center Series
3

Postulates that each indicator in a set of
observed measures is a linear function of one
or more common factors and one unique factor.

FA task is to partition the variance of each indicator
into
 Common variance
 Unique variance

There are two broad FA families that do this
 Exploratory FA (EFA)
 Confirmatory FA (CFA)
3/10/14
Tate Center Series
4
δ
δ
δ
δ
δ
δ
δ
X1
X2
X3
X4
X5
X6
X7
ξ2
ξ1
3/10/14
Tate Center Series
5
δ
δ
δ
δ
δ
δ
δ
X1
X2
X3
X4
X5
X6
X7
ξ2
ξ1
3/10/14
Tate Center Series
6

In path analysis (PA), the researcher specifies a
model that attempts to explain why X and Y
(and other observed variables) covary



Part of the explanation about why two variables
covary may include presumed causal effects (e.g., X
causes Y)
Other parts of the explanation may reflect presumed
noncausal relations, such as a spurious association
The overall goal in PA is to estimate causal
versus noncausal aspects of observed
covariances
3/10/14
Tate Center Series
7

To reasonably infer that X is a cause of Y, all of the
following conditions must be met:
1.
2.
3.

There is time precedence, that is, X precedes Y in time
The direction of the causal relation is correctly specified,
that is, X causes Y instead of the reverse or that X and Y
cause each other
The association between X and Y does not disappear
when external variables, such as common causes of
both, are held constant (i.e., it is not spurious)
It is very unlikely that all these conditions would
be satisfied in a single study
3/10/14
Tate Center Series
8




The assessment of variables at different times at least
provides a measurement framework consistent with
the specification of directional causal effects
longitudinal designs pose many potential difficulties,
such as subject attrition and the need for additional
resources
When the variables are concurrently measured, it is not
possible to demonstrate time precedence
Therefore, the researcher needs a very clear,
substantive rationale for specifying that X causes Y
instead of the reverse or that X and Y mutually
influence each other when all variables are measured at
the same time
3/10/14
Tate Center Series
9
X1
Y1
X2
Y3
Y2
3/10/14
Tate Center Series
10



Basic foundation was laid down in the 1970’s
Generally became accessible to researchers in
the 1980’s
New developments (statistical estimation
theory, numerical analysis & desktop
computing power) have made this family
accessible to researchers who have not had
extensive training in applied statistics and
measurement
3/10/14
Tate Center Series
11

SEM can be generally thought of as a
generalization or extension of:





ANOVA (DOE)
Regression
Principal Factor Analysis
It can model multilevel data
Provided an appropriate link function it can
accommodate non-linear response data


3/10/14
Item Response Models
Growth Mixture Models
Tate Center Series
12

SEM models fit (and test) a-priori models to data
OR fits a-post priori models to data




Models employ both estimation and hypothesis testing
Models require the explicit representation of
observed (indicator) and unobserved (latent)
variables
Models can be applied to experimental and non
experimental data
Models can be exploratory, confirmatory or a
mixture of both (Jorsekog, 1993)
 Strictly confirmatory
 Alternative models
 Model-generating
3/10/14
Tate Center Series
13

Path Models




Measurement Models (EFA, E/CFA & CFA)


All observed indicator variables
Mediation & moderation analysis
Cause and effects
Definition, structure and relationships of the latent factors
Structural Regression Models
Integration of path models (depicting relations among
latent factors) together with the CFA measurement
models
 Special models

 Latent Growth Models (longitudinal multilevel models)
 Latent Class Models
 Item Response Models (IRT)
3/10/14
Tate Center Series
14
Whole and Parts
3/10/14
Tate Center Series
15

Symbol Interpretation
3/10/14

X Observed exogenous variable

Y Observed endogenous variable

D Unobserved exogenous variable (i.e., a disturbance)

Variance of exogenous variable

Covariance between a pair of exogenous variables

Presumed direct causal effect (e.g., X Y)

Presumed reciprocal causal effects (e.g., Y1 Y2)
Tate Center Series
16
y1
x1
x2
y2
y3
y7
ξ1
η1
η3
D
1
y8
y9
D
3
x3
x4
x5
3/10/14
η2
ξ2
y10
D
1
y4
Tate Center Series
y5
y6
17
y1
x1
x2
y2
y3
y7
X1
Y1
y8
Y3
y9
y10
x3
x4
x5
3/10/14
Y2
X2
y4
Tate Center Series
y5
y6
18
X1
Y1
Y3
Y2
X2
3/10/14
Tate Center Series
19
1.
2.
3.
4.
Specify the model – where is your theory?
Establish that the model is identified
Prepare/screen the data/variables
Estimate the model
1.
2.
3.
5.
6.
7.
8.
Examine model fit
Interpret
Consider alternative/equivalent models
Re-specify the model
Write it up accurately
Replicate (cross validate) your model
Apply the results
3/10/14
Tate Center Series
20
data acquisition &
preparation
b2
b1
a
specification
estimation
identification
3/10/14
fit evaluation
interpretation
& reporting
specification
Tate Center Series
21
Combining path model with latent variables
and their measurement components into
Structural Regression (SR) Models
3/10/14
Tate Center Series
22

In a path model the disturbance factors of the
endogenous variables reflect both



In a CFA model measurement errors are moved to
the unique variances of the observed indicator
variables


Measurement error
All omitted causes
There is no counterpart to omitted causes
In a SEM model
Measurement errors are reflected in the measurement
model
 Omitted causes are reflected in the disturbance factors of
the endogenous latent variables

3/10/14
Tate Center Series
23

Exogenous factors are uncorrelated with the
disturbances of the endogenous factors
Measurement errors are uncorrelated
3/10/14
Tate Center Series

24

Unit Loading Identification (ULI)




Used to scale disturbance and measurement errors
Common software default
Generally not a problem unless there are only 2
indicator variables for a factor and the model has
equality constraints involving the other indicator (a
constraint interaction)
Unit Variance Identification (UVI)

3/10/14
Common for scaling exogenous variables
Tate Center Series
25



Disturbances & measurement errors are
typically assigned a scale through a unit
loading (ULI) constraints
Exogenous variables are typically scaled by
either ULI (one indicator per factor is fixed to
1.0) thus the factor is unstandardized (or by
fixing the variance of the factor to 1.0 thus
standardizing it)
Common SEM software limits the scaling of the
endogenous factors to only ULI (thus treating
them unstandardized)
3/10/14
Tate Center Series
26

Basically the same issues and considerations as
CFA & Path models

Start with the number of variables (not the sample
size)
 v(v+1)/2, where v = # of observed variables

Need to have a just specified or over-specified
model
 Count the number of variances and covariances of all
the exogenous variables (measurement errors,
disturbances and exogenous factors)
 Count the number of direct effects on endogenous
variables (factor loadings, direct effects on endogenous
factors from other factors
3/10/14
Tate Center Series
27

The 2-Step rule is a sufficient condition for
SEM identification
The measurement model must be identified
2. If the structural model is recursive the full model is
identified
If the model is nonrecursive things are bit more
complicated!
Empirical underidentificaton
1.


3/10/14
Created when there is substantial model misspecification
Tate Center Series
28

Do it all at once



Probably not recommended
If the overall fit is good - GOOD JOB
If the overall fit is poor what to do?
 Is the poor fit due to a poor measurement model?
 Is the poor fit due to a poor structural model?
3/10/14
Tate Center Series
29
Based on the recognition that the structural part of
the SEM is actually nested under the more general
(correlated) CFA model

Respecify the full SEM model as a measurement (CFA)
model and estimate (and fix!?!)
1.
 If the measurement model adequately fits move to Step 2
 If the measurement model is a poor fit then the structural model
will be just as poor or worse
Place constraints on the structural part of the CFA model
to bring it into like with your structural model
2.

Now consider alternative structural models
If alternative structural models dramatically affect the
measurement model portion of the SEM, the measurement
model is not invariant
 This results in interpretational confounding – that is the
meaning/interpretation of the measurement model is a
function of the structural model

3/10/14
Tate Center Series
30

Expansion of the 2 step process

1.
2.
3.
4.

Requirement: Each latent factor has at least 4 indicator
variables (mixed methods)
E/CFA
CFA with all latent factors freely correlated (step 1 in the
2-step approach)
Begin to place constraints on the structural portion of the
model
Incremental or sequential tests of a-prior hypotheses
Steps 3 & 4 are really incremental refinement of the
structural part of the model from the general CFA
to the end SEM
3/10/14
Tate Center Series
31
3/10/14
Tate Center Series
32


SAS/CALIS
IBM SPSS Amos (21.0.0) (Arbuckle & Wothke) bundled with
SPSS
http://www-142.ibm.com/software/products/us/en/spssamos/
 AMOS 21.0.0 stand alone ($1590.00)


LISREL (9.1) (Joreskog & Sorbom) $495.00



EQS (6.2) for Windows (Bentler) $595.00


http://www.mvsoft.com
Mplus-7 (Muthen & Muthen) ($595.00)




http://www.ssicentral.com
Free student versions
https://www.statmodel.com
Demo version available
Add-ons for multilevel and mixture models
Many open source programs, e.g. R
3/10/14
Tate Center Series
33
Structural Regression Model
3/10/14
Tate Center Series
34



A research question that focuses on the
regression of Y on X (e.g., do principal
experience(s) predict school building health)?
A survey is constructed with 4 items all
theoretically related to a latent construct X
(principal experiences)
And 3 different items theoretically related to a
latent construct Y (school health)
3/10/14
Tate Center Series
35

Typically a researcher derives a X-composite
variable


And a Y-composite variable


X= (x1+x2+x3+x4)
Y= (y1+y2+y3)
Then regresses Y on X
3/10/14
Tate Center Series
36
3/10/14
Tate Center Series
37
*
*
Y
X
1
*
*
*
1
x1
x2
x3
x4
*
*
*
*
3/10/14
Tate Center Series
*
*
y1
y2
y3
*
*
*
38
Understanding the Measurement Model
3/10/14
Tate Center Series
39
Consider the regression of Y on X (diagramed as follows)
Y
X
Expressed as a Path model the regression of Y on X is diagramed
X
Y
3/10/14
Tate Center Series
40
Modeling Measurement Error in X (measurement error variance is 0.019)
Y
1.
X
FX
FX is the True Score on X
0.019
Modeling Measurement Errors in both X & Y
Y
0.022
3/10/14
1.
FX
FY
1.
FY is the True Score on Y
FX is the True Score on X
Tate Center Series
X
0.019
41
Confirmatory Factor Analysis
3/10/14
Tate Center Series
42

CFA is a type of SEM that deals specifically
with measurement models
The relationships between observed variables
(indicators) and latent variables (factors)
 CFA models are hypothesis driven


CFA has become one of the most popular
techniques/methods used in applied social and
health science research
3/10/14
Tate Center Series
43




Psychometric evaluation of “test” instruments
Construct validation
Measurement of invariance
Investigation of method effects
3/10/14
Tate Center Series
44
δ
δ
δ
δ
δ
δ
δ
X1
X2
X3
X4
X5
X6
X7
ξ2
ξ1
3/10/14
Tate Center Series
45
Missing data, Non-normality, & Categorical Data
3/10/14
Tate Center Series
46
Conduct a Missing Data Analysis
3/10/14
Tate Center Series
47

MCAR (missing completely at random)


MAR (missing at random)


When the probability of missing on Y depends on
one (or more) X variables (not related to Y when X is
held constant)
NMAR (missing not at random)


Probability of missing on Y is unrelated to Y and all
other variables in the data set
When missingness is related values that could have
been observed
Planned missingness (ignorable missing)
3/10/14
Tate Center Series
48
Consult your favorite statistician
3/10/14
Tate Center Series
49

Listwise deletion


Loss of power
If missingness is MCAR
 Estimates are consistent (unbiased)
 Usually not efficient (large std errors)

If missingness is MAR
 Estimates may not be consistent nor efficient

Pairwise deletion



If you use a covariance or correlation matrix (you will probably lie)
Matrix may not be positive definite (means it cannot be inverted)
MCAR
 Consistent estimates (in large samples)
 Biased std errors

MAR
 Estimates and std errors are biased
3/10/14
Tate Center Series
50

Mean substitution, LVCF, regression substitution


Tends to underestimate variances & std errors but
overestimate correlations
EM imputation
Missingness must be MCAR or MAR + multivariate
normal
 Std errors are consistent


Multiple imputation (3 step process)
Generate m (5 is enough) imputed data sets (typically EM
+ MCMC)
 Analyze the 5 parallel data sets individually
 Combine the analyses

 SAS PROC MIANALYZE
3/10/14
Tate Center Series
51





3/10/14
Preferred choice for handling missing data
MCAR or MAR + multivariate normality
MLM estimator (robust ML) can be used with
non-normal data
Available in Mplus, LISREL, Amos, Mx, SAS
SPSS(?)
Tate Center Series
52
3/10/14
Tate Center Series
53


ML & GLS estimators are robust to minor
departures in multivariate normality
When non-normality is pronounced don’t use
ML or GLS estimators





3/10/14
ML is very sensitive to high kurtosis
Inflated χ2
Modest underestimation of fit indices
Moderate to severe underestimation of std errors
All bad things get much worse in small samples
Tate Center Series
54

Weighted Least Squares (WLS)


Requires very large samples
Robust ML (best choice)


Typically requires raw data
SB χ2
 Cannot be used the same way for testing nested models
– must be adjusted

3/10/14
There is quite a bit of variation among the
popular SEM programs – so read and get
familiar with the tool you use
Tate Center Series
55
3/10/14
Tate Center Series
56

Don’t use a ML estimator




3/10/14
Produces attenuated correlation estimates if there is
a floor or ceiling effect
Produces “pseudofactor” that are artifacts of item
difficulty or extremeness
Produces incorrect test statistics and std errors
Possibly produces incorrect parameter estimates if
there is a floor or ceiling effect
Tate Center Series
57

WLS

Weight matrix requires a large sample
 Assume 10 indicators
 b = (10*11)/2 = 55 so W matrix is (b*(b+1))/2 or
(55*56)/2=1540 elements
 Skewness can aggravate a small sample size problem

Robust WLS (WLSMV – Mplus)
Appears to be the best choice for categorical data
 But still requires large samples


3/10/14
ULS
Tate Center Series
58
3/10/14
Tate Center Series
59
Download