Structural equation models : opportunities, risks and discussion of some applications in the travel behavior research domain Marco Diana, Politecnico di Torino (I) University of Maryland, College Park, 29th November 2014 Structure of the seminar 1. Structural equation models are grounded on two multivariate analysis statistical techniques : Multiple regression Principal component and factor analysis 2. Basic notions on structural equation models (SEM) 3. Use of SEM: needed input, range of output, most commonplace issues in travel behavior research 4. Available software packages 5. Discussion on some applications in the study of mobility behaviours Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 2 Measurement scales (Stevens, 1946) Metric (quantitative) variables: Ratio scales (Es: body weight, road length) Interval scales (Es: temperature) Nonmetric (qualitative) variables: Ordinal scales (Es: degree of satisfaction) Categorical scales (Es: sex) Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 3 Univariate and bivariate analyses One random variable: Univariate distributions and related moments (mean, variance…) Two random variables: Bivariate, joint and conditional distributions and related moments Interdependence analyses => correlations (Pearson, Spearman…), contingency tables Dependence analyses => Linear regression, ANOVA Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 4 Multivariate statistical analysis tech. From: Hair et al. (1998) Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 5 Multiple linear regression (1/2) Operating instructions: 1. 2. 3. 4. Dependence technique => need to identify x e y A unique linear relationship Only one metric dependent variable (y) Two or more linear independent variables (x1, x2, …), either metric or binary Objective: x1 a1 x2 a2 y Find the value of parameters a0, a1, a2, … in y = a0 + a1x1 + a2x2 + … + e … such that the sum of squared errors (differences between the two terms) is mimimised (OLS). Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 6 Multiple linear regression (2/2) Assumptions: 1. Linear relationship 2. Errors independence 3. Normal distribution of errors 4. Constant variance of error (homoskedasticity) NB1: multicollinearity of x variables «slightly less problematic» than in some discrete choice models NB2: measurement errors are not distinguishable SEM can be helpful in both cases! Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 7 Factor & Principal Components Anal. Operating instructions: 1. Interdependence analysis => «We only have x» 2. Metric variables (possible extensions) Objective: Analize the correlation matrix of variables, looking for clusters of variables that are more correlated among them and less correlated with the others Find latent variables (factors, constructs, components, dimensions) from such groups that can therefore «synthetise» o «represent» the observed x variables Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 8 Common, specific and total variance Both methods are based on the study of the variance in the data The common variance is the variance that is shared among all x variables The specific variance is associated only to a specific variable xi (including the one due to meas. errors) The total variance is the sum of the two PCA: The input is the correlation matrix => this method considers the total variance FA: The main diagonal of the correlation matric contains an estimation of the common variance => the method considers only the common variance Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 9 Principal component an. (Pearson, 1901) Transformation of p observed variables x into p latent variables t, linear combinations of x i.e., find the value of coefficients a11, a21, … in t1 = a11x1 + a12x2 + … + a1pxp t2 = a21x1 + a22x2 + … + a2pxp … tp = ap1x1 + ap2x2 + … + appxp … such that: The components t1 … tp are sorted by decreasing variance The components ti are independent Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 10 Factor analysis (Spearman, 1904) Regression of p observed variables x on k<p latent variables x i.e., find the value of loadings l11, l21, … in x1 = l11x1 + l12x2 + … + l1pxk + d1 x2 = l21x1 + l22x2 + … + l2pxk + d2 … xp = lp1x1 + lp2x2 + … + lppxk + dp … such that the factors x can explain the common variance among the x variables Unlike PCA, here we assume that factors x actually exist (more formally, the covariance matrix of x variables must have some properties) Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 11 Common requirements and results Both PCA and FA give meaningful results iff x variables are at least partly correlated => multicollinearity is desirable! Sample size: at least 5 observations per observed variable x, in any case at least 100 We consider the first k<p components of a PCA or we look for k<p factors through a FA => methods to choose k are needed If the common variance is a consistent part of the total variance, the two methods give similar results Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 12 PCA ambits of use Aim: to represent data variability with the minimum number of latent variables Theoretical assumptions: none, we simply want to summarise the variables while trying to preserve the patterns within the dataset Component t1 a11 x1 a12 x2 Component t2 a23 a13 a24 a13 x3 a25 x4 x5 a26 x6 a27 x7 Data characteristics: the specific variance and the one due to measurement errors are a negligible proportion of the total variance Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 13 Factor Analysis ambits of use Aim: identifying the dimensions, or latent factors, implied by the set of x variables being considered Theoretical assumptions: latent factors do exists, on the basis of a theory that allows the interpretation of the observed correlations Factor x1 l11 x1 l12 x2 Factor x2 l23 l13 l24 l13 x3 l25 x4 x5 l26 x6 l27 x7 Data characteristics: specific and measurement error variances are not negligible, therefore I consider only the common variance Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 14 Exploratory vs confirmatory analysis The factor analysis we introduced is exploratory (EFA): the number of latent factors and their relationships with the observed variables are found a posteriori, through the analysis itself. If we have a well founded theory and empirically supported by previous EFAs, it is better to define a priori factors and their relations with observed variables, computing loadings lij and checking the model «goodness of fit» => confirmatory technique (CFA) SEM can be used to implement a CFA! Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 15 Combining regression and factor an. Examples of combinations of the two methods: Chained regressions: path analysis (Wright, 1934) Freedom Well-being Safety Reliability Education Children <14 Income Age Higher-order factor analyses Affective Car attitudes Cognitive Regression where some variables are latent Trip rates Nationality Rootedness Education Mobility Income Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 Systematic trips Transfers Holidays, VFR 16 SEM – Structural equation models It would be possible to estimate the previous models by decomposing them and implementing n distinct regressions and/or factor analyses However, this would be an inefficient use of data Structural equation models (Jöreskog et al., 1973) Regression and FA are generalised and combined, through simultanous estimation of all parameters: Further results and «diagnostic tools» Further applications compared to the previous examples Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 17 LISREL notation of a SEM model Measurement model: x = Lxx + d y = Lyh + e where x and y are esogenous and endogenous variables, x and h the latent ones, Lx and Ly are loadings matrices, d and e error terms Structural model: h = Bh + Gx + z where B and G are the structural coefficients matrices and z error terms The two models are jointly estimated. Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 18 Example (Hair, 1998) Model path diagram Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 19 Example, cont. (Hair, 1998) Complete model Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 20 Parameters that can be estimated Structural coefficients (regression coefficients) Factor loadings, both of exogenous and endogenous variables Correlations between endogenous constructs (to avoid!) or exogenous constructs (obviously not between endogenous and exogenous) Variance of the measurement error of the observed variables (endogenous and exogenous) Covariance of the measurement error of the observed variables (endogenous and exogenous) Confirmatory technique => the analyst chooses which parameters should be estimated Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 21 Input and assumptions Input: covariance or correlation matrix of the observed variables, as in factor analysis: Covariances: total effects are found, comparison between different models/populations/samples (transferability) Correlations: understanding patterns among variables and their relative importance Assumptions: From regression: linear relationship, multivariate normal distributions From sampling theory: random sample, independent observations Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 22 Data requirement and estimation Dimensions of the sample: At least 100-150 observations 10 observations per parameter, 15 when nonnormality is detected Overfitting when we use more than 400 observations (too sensitive model) Estimation methods: Parametric: maximum likelihood (ML) Non parametric: ADS-WLS => 1000 observations are needed Resampling: bootstrap, jackknife Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 23 Common problems in SEM A unique symptom could be due to different problems: estimation process not converging, variances<0, loadings>1, «mysterious» error messages… Unsound theoretical basis, specification errors Model identification: degrees of freedom, scales and # of indicators per construct, rank and order conditions… Non-normality when using a parametric estimation method Algebraic properties of the input matrix (positive definite…) Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 24 Goodness of fit measures in SEM Problems and symptoms are not univocally linked, the same goes for fit measures: Absolute fit Parsimonius fit Incremental fit Structural model fit (sign and significance of coefficients, rho-squared) Measurement model fit (unidimensionality of costructs, Cronbach’s alpha) Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 25 Advanced SEM applications Path analysis: Reciprocal implications (Non-recursive models) Direct, indirect and total effects Mean structures (different means of latent vars) Regression with an estimation of correlations among variables (endogenous or exogenous, observed or latent) Models with repeated observations Models with longitudinal data (latent growth) Including categorical variables Multiple sample models, mixture models You simply can’t do all this by combining R and FA! Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 26 Software for SEM estimation LISREL 9.1 (Jöreskog et al.) EQS 6.1 (Bentler et al.) Mplus 7 (Muthén et al.) SAS => PROC CALIS (SAS Institute) Statistica => SEPATH (StatSoft) SPSS => Amos (IBM) R => sem, lavaan, … (Packages that I used to be familiar with are in bold, they are not necessarily the best ones…) Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 27 SEM applications in travel research Golob (2003) reviewed more than 50 papers on a wealth of topics: Mode choice behaviors Determinants of car ownership and use Longitudinal and panel data analyses Activity-based models Travel attitudes-behaviors relationships Driving behaviors and safety issues Obviously many more SEM papers have appeared since then, although I would have expected an ever sharper increase Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 28 Example: primary utility (Diana, 2008) Travel demand derived only by the need of performing activities in different places… Activity-based models Utility-maximising models by minimising travel times …but is it always true? «Teleportation test»: 3% of the sample indicates an ideal commute time <2 min, 50% >20 min (Mokhtarian, 2001) Random utility models where travel-time coefficients >= 0: always garbage or… Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 29 Example: primary utility (Diana, 2008) Goal: capturing and measuring the «primary utility» latent construct Theoretical model => EFA => primary utility is due to different factors: Importance of on-trip activities Importance of activities at different locations Ideal trip length Travel-related cognitive and affective attitudes Performances and use of the travel means Item analysis => 6 constructs are related to primary utility => Second order CFA Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 30 Model specification (Diana, 2008) Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 31 Primary utility measurement scale Drivers versus transit riders Commuting versus other trips Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 32 Modal diversion (Diana, 2010) Modal diversion versus mode choice Demand for unknown services: «cognitive asymmetry» <=> SP surveys Attitudes and rational evaluations have a different relative importance according to the alternative Behavioral modal diversion model: the endogenous variable measures the propension to change on a Likert scale Data limitations => submodel implement. and considering standard estimations Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 33 Modal diversion (Diana, 2010) Standardized estimation=> comparing different structural coefficients Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 34 SEM with subsamples Is there a difference in the diversion to buses and to shared taxis? => Comparing unstandardized estimations of the single structural equations in the two subsamples Model with MULTIM REL_COST REL_TIME REL_WAIT REL_WALK MULTIM All -0.20 -0.25 -0.15 -0.14 0.17 Buses -0.11 * -0.39 -0.29 -0.05 ** 0.29 * DRT -0.07 * -0.21 -0.14 -0.15 0.15 * Model with COGNIT REL_COST REL_TIME REL_WAIT REL_WALK COGNIT All -0.19 -0.26 -0.13 -0.09 -0.20 Buses -0.08 * -0.38 -0.27 0.01 -0.08 ** DRT -0.07 * -0.21 -0.11 * -0.10 * -0.29 * = not signif. at the 5% level ** = not signif. at the 20% level Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 35 Thank you for your attention! Structural equation models : opportunities, risks and discussion of some applications in the travel behavior research domain Question, remarks, … Marco Diana marco.diana@polito.it Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 36 List of acronyms ADF-WLS = Asymptotically distributionfree weighted least squares CFA = Confirmatory factor analysis EFA = Exploratory factor analysis FA = Factor analysis ML = Maximum likelyhood OLS = Ordinary least squares PCA = Principal components analysis SEM = Structural equations model VFR = Visiting friends and relatives Mentioned references • • • • • Diana, M. (2008) Making the “primary utility of travel” concept operational: a measurement model for the assessment of the intrinsic utility of reported trips, Transportation Research A, 42(3), 455-474. Diana, M. (2010) From mode choice to modal diversion: a new behavioural paradigm and an application to the study of the demand for innovative transport services, Technological Forecasting & Social Change, 77(3), 429-441. Golob, T.F. (2003) Structural equation modeling for travel behavior research, Transportation Research B, 37(1), 1-25. Hair, J.F., Anderson, R.E., Tatham, R.L., Black, W.C. (1998) Multivariate Data Analysis, 5 ed. Prentice Hall (but more recent editions are now available) Mokhtarian, P.L., Salomon, I. (2001) How derived is the demand for travel? Some conceptual and measurement considerations, Transportation Research A, 35(8), 695719. Marco Diana, Structural equations models – University of Maryland, College Park, 29/11/2014 37