Structural equation models : opportunities,
risks and discussion of some applications in the
travel behavior research domain
Marco Diana, Politecnico di Torino (I)
University of Maryland, College Park, 29th November 2014
Structure of the seminar
1.
Structural equation models are grounded on two
multivariate analysis statistical techniques :


Multiple regression
Principal component and factor analysis
2.
Basic notions on structural equation models (SEM)
3.
Use of SEM: needed input, range of output, most
commonplace issues in travel behavior research
4.
Available software packages
5.
Discussion on some applications in the study of
mobility behaviours
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
2
Measurement scales (Stevens, 1946)

Metric (quantitative) variables:

Ratio scales
(Es: body weight, road length)

Interval scales
(Es: temperature)

Nonmetric (qualitative) variables:

Ordinal scales
(Es: degree of satisfaction)

Categorical scales
(Es: sex)
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
3
Univariate and bivariate analyses

One random variable:


Univariate distributions and related moments
(mean, variance…)
Two random variables:



Bivariate, joint and conditional distributions
and related moments
Interdependence analyses => correlations
(Pearson, Spearman…), contingency tables
Dependence analyses => Linear regression,
ANOVA
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
4
Multivariate statistical analysis tech.
From:
Hair et al. (1998)
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
5
Multiple linear regression (1/2)
Operating instructions:
1.
2.
3.
4.
Dependence technique => need to identify x e y
A unique linear relationship
Only one metric dependent variable (y)
Two or more linear independent variables (x1, x2,
…), either metric or binary
Objective:

x1
a1
x2
a2
y
Find the value of parameters a0, a1, a2, … in
y = a0 + a1x1 + a2x2 + … + e

… such that the sum of squared errors (differences
between the two terms) is mimimised (OLS).
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
6
Multiple linear regression (2/2)
Assumptions:
1.
Linear relationship
2.
Errors independence
3.
Normal distribution of errors
4.
Constant variance of error (homoskedasticity)
NB1: multicollinearity of x variables «slightly less
problematic» than in some discrete choice models
NB2: measurement errors are not distinguishable
SEM can be helpful in both cases!
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
7
Factor & Principal Components Anal.
Operating instructions:
1.
Interdependence analysis => «We only have x»
2.
Metric variables (possible extensions)
Objective:

Analize the correlation matrix of variables, looking
for clusters of variables that are more correlated
among them and less correlated with the others

Find latent variables (factors, constructs,
components, dimensions) from such groups that
can therefore «synthetise» o «represent» the
observed x variables
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
8
Common, specific and total variance






Both methods are based on the study of the
variance in the data
The common variance is the variance that is
shared among all x variables
The specific variance is associated only to a
specific variable xi (including the one due to meas.
errors)
The total variance is the sum of the two
PCA: The input is the correlation matrix => this
method considers the total variance
FA: The main diagonal of the correlation matric
contains an estimation of the common variance =>
the method considers only the common variance
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
9
Principal component an. (Pearson, 1901)

Transformation of p observed variables x into p
latent variables t, linear combinations of x

i.e., find the value of coefficients a11, a21, … in
t1 = a11x1 + a12x2 + … + a1pxp
t2 = a21x1 + a22x2 + … + a2pxp
…
tp = ap1x1 + ap2x2 + … + appxp

… such that:

The components t1 … tp are sorted by decreasing
variance

The components ti are independent
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
10
Factor analysis (Spearman, 1904)

Regression of p observed variables x on k<p
latent variables x

i.e., find the value of loadings l11, l21, … in
x1 = l11x1 + l12x2 + … + l1pxk + d1
x2 = l21x1 + l22x2 + … + l2pxk + d2
…
xp = lp1x1 + lp2x2 + … + lppxk + dp

… such that the factors x can explain the common
variance among the x variables

Unlike PCA, here we assume that factors x actually
exist (more formally, the covariance matrix of x
variables must have some properties)
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
11
Common requirements and results




Both PCA and FA give meaningful results iff x
variables are at least partly correlated =>
multicollinearity is desirable!
Sample size: at least 5 observations per
observed variable x, in any case at least 100
We consider the first k<p components of a PCA
or we look for k<p factors through a FA =>
methods to choose k are needed
If the common variance is a consistent part of
the total variance, the two methods give similar
results
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
12
PCA ambits of use

Aim: to represent data variability with the
minimum number of latent variables

Theoretical assumptions: none, we simply want
to summarise the variables while trying to
preserve the patterns within the dataset
Component t1
a11
x1

a12
x2
Component t2
a23
a13
a24
a13
x3
a25
x4
x5
a26
x6
a27
x7
Data characteristics: the specific variance and
the one due to measurement errors are a
negligible proportion of the total variance
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
13
Factor Analysis ambits of use

Aim: identifying the dimensions, or latent factors,
implied by the set of x variables being considered

Theoretical assumptions: latent factors do
exists, on the basis of a theory that allows the
interpretation of the observed correlations
Factor x1
l11
x1

l12
x2
Factor x2
l23
l13
l24
l13
x3
l25
x4
x5
l26
x6
l27
x7
Data characteristics: specific and measurement
error variances are not negligible, therefore I
consider only the common variance
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
14
Exploratory vs confirmatory analysis


The factor analysis we introduced is exploratory
(EFA): the number of latent factors and their
relationships with the observed variables are
found a posteriori, through the analysis itself.
If we have a well founded theory and empirically
supported by previous EFAs, it is better to define a
priori factors and their relations with observed
variables, computing loadings lij and checking the
model «goodness of fit» => confirmatory
technique (CFA)
SEM can be used to implement a CFA!
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
15
Combining regression and factor an.
Examples of combinations of the two methods:
Chained regressions:
path analysis (Wright,
1934)
Freedom
Well-being
Safety
Reliability
Education
Children <14
Income
Age
Higher-order
factor
analyses
Affective
Car attitudes
Cognitive
Regression where
some variables are
latent
Trip rates
Nationality
Rootedness
Education
Mobility
Income
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
Systematic
trips
Transfers
Holidays, VFR
16
SEM – Structural equation models


It would be possible to estimate the previous
models by decomposing them and implementing n
distinct regressions and/or factor analyses
However, this would be an inefficient use of data
Structural equation models

(Jöreskog et al., 1973)
Regression and FA are generalised and combined,
through simultanous estimation of all parameters:


Further results and «diagnostic tools»
Further applications compared to the previous examples
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
17
LISREL notation of a SEM model

Measurement model:
x = Lxx + d
y = Lyh + e
where x and y are esogenous and endogenous
variables, x and h the latent ones, Lx and Ly are
loadings matrices, d and e error terms

Structural model:
h = Bh + Gx + z
where B and G are the structural coefficients
matrices and z error terms

The two models are jointly estimated.
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
18
Example (Hair, 1998)
Model path diagram
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
19
Example, cont. (Hair, 1998)
Complete model
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
20
Parameters that can be estimated





Structural coefficients (regression coefficients)
Factor loadings, both of exogenous and
endogenous variables
Correlations between endogenous constructs (to
avoid!) or exogenous constructs (obviously not
between endogenous and exogenous)
Variance of the measurement error of the
observed variables (endogenous and exogenous)
Covariance of the measurement error of the
observed variables (endogenous and exogenous)
Confirmatory technique => the analyst chooses
which parameters should be estimated
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
21
Input and assumptions

Input: covariance or correlation matrix of
the observed variables, as in factor analysis:



Covariances: total effects are found, comparison
between different models/populations/samples
(transferability)
Correlations: understanding patterns among
variables and their relative importance
Assumptions:


From regression: linear relationship, multivariate
normal distributions
From sampling theory: random sample,
independent observations
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
22
Data requirement and estimation

Dimensions of the sample:




At least 100-150 observations
10 observations per parameter, 15 when nonnormality is detected
Overfitting when we use more than 400
observations (too sensitive model)
Estimation methods:



Parametric: maximum likelihood (ML)
Non parametric: ADS-WLS => 1000
observations are needed
Resampling: bootstrap, jackknife
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
23
Common problems in SEM

A unique symptom could be due to
different problems: estimation process not
converging, variances<0, loadings>1,
«mysterious» error messages…




Unsound theoretical basis, specification errors
Model identification: degrees of freedom,
scales and # of indicators per construct, rank
and order conditions…
Non-normality when using a parametric
estimation method
Algebraic properties of the input matrix
(positive definite…)
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
24
Goodness of fit measures in SEM

Problems and symptoms are not
univocally linked, the same goes for fit
measures:





Absolute fit
Parsimonius fit
Incremental fit
Structural model fit (sign and significance of
coefficients, rho-squared)
Measurement model fit (unidimensionality of
costructs, Cronbach’s alpha)
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
25
Advanced SEM applications



Path analysis:
 Reciprocal implications (Non-recursive models)
 Direct, indirect and total effects
 Mean structures (different means of latent vars)
Regression with an estimation of correlations
among variables (endogenous or exogenous,
observed or latent)
 Models with repeated observations
 Models with longitudinal data (latent growth)
Including categorical variables
 Multiple sample models, mixture models
You simply can’t do all this by combining R and FA!
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
26
Software for SEM estimation
LISREL 9.1 (Jöreskog et al.)
 EQS 6.1 (Bentler et al.)
 Mplus 7 (Muthén et al.)
 SAS => PROC CALIS (SAS Institute)
 Statistica => SEPATH (StatSoft)
 SPSS => Amos (IBM)
 R => sem, lavaan, …

(Packages that I used to be familiar with are in
bold, they are not necessarily the best ones…)
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
27
SEM applications in travel research

Golob (2003) reviewed more than 50
papers on a wealth of topics:







Mode choice behaviors
Determinants of car ownership and use
Longitudinal and panel data analyses
Activity-based models
Travel attitudes-behaviors relationships
Driving behaviors and safety issues
Obviously many more SEM papers have
appeared since then, although I would
have expected an ever sharper increase
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
28
Example: primary utility (Diana, 2008)

Travel demand derived only by the need of
performing activities in different places…



Activity-based models
Utility-maximising models by minimising travel
times
…but is it always true?


«Teleportation test»: 3% of the sample
indicates an ideal commute time <2 min, 50%
>20 min (Mokhtarian, 2001)
Random utility models where travel-time
coefficients >= 0: always garbage or…
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
29
Example: primary utility (Diana, 2008)
Goal: capturing and measuring the «primary
utility» latent construct
 Theoretical model => EFA => primary utility
is due to different factors:







Importance of on-trip activities
Importance of activities at different locations
Ideal trip length
Travel-related cognitive and affective attitudes
Performances and use of the travel means
Item analysis => 6 constructs are related
to primary utility => Second order CFA
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
30
Model specification (Diana, 2008)
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
31
Primary utility measurement scale
Drivers
versus
transit riders
Commuting
versus other
trips
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
32
Modal diversion (Diana, 2010)
Modal diversion versus mode choice
 Demand for unknown services:



«cognitive asymmetry» <=> SP surveys
Attitudes and rational evaluations have a
different relative importance according to the
alternative
Behavioral modal diversion model: the
endogenous variable measures the
propension to change on a Likert scale
 Data limitations => submodel implement.
and considering standard estimations

Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
33
Modal diversion (Diana, 2010)
Standardized
estimation=>
comparing different
structural coefficients
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
34
SEM with subsamples
Is there a difference in the diversion to buses and
to shared taxis? => Comparing unstandardized
estimations of the single structural equations in
the two subsamples
Model with MULTIM
REL_COST
REL_TIME
REL_WAIT
REL_WALK
MULTIM
All
-0.20
-0.25
-0.15
-0.14
0.17
Buses
-0.11 *
-0.39
-0.29
-0.05 **
0.29 *
DRT
-0.07 *
-0.21
-0.14
-0.15
0.15 *
Model with COGNIT
REL_COST
REL_TIME
REL_WAIT
REL_WALK
COGNIT
All
-0.19
-0.26
-0.13
-0.09
-0.20
Buses
-0.08 *
-0.38
-0.27
0.01
-0.08 **
DRT
-0.07 *
-0.21
-0.11 *
-0.10 *
-0.29
*
= not signif. at the 5% level ** = not signif. at the 20% level
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
35
Thank you for your attention!
Structural equation models :
opportunities, risks and discussion of
some applications in the travel behavior
research domain
Question, remarks, …
Marco Diana
marco.diana@polito.it
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
36
List of acronyms
ADF-WLS = Asymptotically distributionfree weighted least squares
CFA = Confirmatory factor analysis
EFA = Exploratory factor analysis
FA = Factor analysis
ML = Maximum likelyhood
OLS = Ordinary least squares
PCA = Principal components analysis
SEM = Structural equations model
VFR = Visiting friends and relatives
Mentioned references
•
•
•
•
•
Diana, M. (2008) Making the “primary utility of travel” concept operational: a
measurement model for the assessment of the intrinsic utility of reported trips,
Transportation Research A, 42(3), 455-474.
Diana, M. (2010) From mode choice to modal diversion: a new behavioural paradigm
and an application to the study of the demand for innovative transport services,
Technological Forecasting & Social Change, 77(3), 429-441.
Golob, T.F. (2003) Structural equation modeling for travel behavior research,
Transportation Research B, 37(1), 1-25.
Hair, J.F., Anderson, R.E., Tatham, R.L., Black, W.C. (1998) Multivariate Data
Analysis, 5 ed. Prentice Hall (but more recent editions are now available)
Mokhtarian, P.L., Salomon, I. (2001) How derived is the demand for travel? Some
conceptual and measurement considerations, Transportation Research A, 35(8), 695719.
Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014
37