Using structural equation modeling to discover the hidden structure of ck data∗ Ene-Margit Tiit1 , Mare Vähi2 , and Kai Saks3 1 2 3 Institute of Mathematical Statistics, University of Tartu, Estonia Mare.Vahi@ut.ee Institute of Mathematical Statistics, University of Tartu, Estonia Mare.Vahi@ut.ee Department of Internal Medicine, University of Tartu, Estonia Kai.Saks@ut.ee Summary. Modeling the factors influencing the quality of life (QoL) of different groups of population has been a challenge for numerous scientists of several areas. Our task is to build a model of QoL for clients of social care. The results demonstrate that SEM-methodology is helpful in discovering hidden dependencies and causalities in data-sets characterized by low correlations, high number of variables and complicated structures of dependencies between variables and variable’ groups. Key words: structural equation models, factor, quality-of life. 1 Structural Equations as statistical method Structural Equation Modeling [initially LISREL-method by K. Jöreskog] is a very general, very powerful multivariate analysis technique that includes specialized versions of a number of other analysis methods as special cases. One of the fundamental ideas in multivariate analysis is the idea of statistical (linear or nonlinear) dependence. This idea generalizes, in various ways, to several variables interrelated by a group of linear equations. The rules become more complex, the calculations more difficult, but the basic message remains the same – you can test whether variables are interrelated through a set of linear relationships by examining the variances and covariances of the variables. It has been assumed, that SEM models, characterizing the directions of influences, can also describe and check hypotheses about causal dependencies. There exist procedures for testing whether a set of variances and covariances in a covariance matrix fits a specified structure. The way structural modeling works is as follows: 1. You state the way that you believe the variables are inter-related, often with the use of a path diagram. ∗ This study is a part of the CareKeys project (supported by the European Commission, contract No QLK6-CT-2002-02525) 768 Ene-Margit Tiit, Mare Vähi, and Kai Saks 2. You work out, via some complex internal rules, what the implications of this are for the variances and covariances of the variables. 3. You test whether the variances and covariances fit this model of them. 4. Results of the statistical testing, and also parameter estimates and standard errors for the numerical coefficients in the linear equations are reported. 5. On the basis of this information, you decide whether the model seems like a good fit to your data. http://www.statsoft.com/textbook/stsepath.html#index 2 Grouping CareKeys data for building SEM- scheme To integrate all possible variables into one model we used the Structural Equations modeling (SEM) procedure. Using this procedure we tested, which variables from all CK data-set (Clint, Index and Mandex from IC data) have statistically significant influences on other variables and tried to find the structure of all interactions. In this process we moved from big models to smaller models (backward methodology), excluding step by step defined latent variables from the model. We used the following scheme for describing the structure of influences. Totally, we defined 14 groups of variables, measured using Index, Clint and Mandex variables on the basis of IC (normal) clients for whom the QoL variables (PGMS and WHOQOL were (at least partially) measured. All variables used were completed using EM-methodology for imputation. Each group defines one latent variable, representing the most important information of the group of variables in the sense of linear dependencies. To find the latent variables, as intermediate variables the factors (found using either exploratory or confirmatory factor analysis) will be used. Hence, as a starting point we had 435 cases having about 265 variables (without blanks), see Table 1. From QoL variables we took the integrated components (domains) only, but included 6 additional variables from Clint characterizing some aspects of QoL. Some initially nominal variables were recoded as binary ones (e.g. country, marital status, language and close persons were presented as binary variables). In all groups the factor analysis was made and the number of factors was fixed to gain at least 50% description rate. Totally the number of factors describing at least 50% of the whole information was 42. Using step-wise regression and correlation analysis the expected influences between groups were estimated, see Table 2. From the Table 2 it follows that the influences (measured via Determination coefficient R2 ) are not high, in general. 3 The first model 3.1 Building the model Here the Background (BG) group forms the exogenous latent variable (not depending from other groups of variables), all the other groups form endogenous latent variables, e.g. they depend from other groups Title Suppressed Due to Excessive Length 769 Table 1. Groups of variables No Name of group 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Symbol Number of variables Background (Index) BG 12 ADL/ IADL need (Index) AN 20 ADL/ IADL supply (Index) AS 18 Medical need (Index) MN 17 Medical supply (Index) MS 16 Social need (Index) SN 11 Social supply (Index) SS 11 Prof QoC (Index) PQ 26 Client QoC (Clint) QC 37 Living environment (Clint, Index) L 18 Informal network, close people (Clint) P 19 Life experiments (Clint) E 9 QoL (Clint) QL 16 Management (Mandex) M 35 Number of factors 3 2 2 3 3 2 2 4 5 4 4 2 3 3 To build the initial model the following scheme was constructed, see figure 1, where each box represents a latent variable, having the same symbol as a group of initial variables in the Table 1. The latent variables are formed by factors of given group of initial variables. The number of factors used is also given in the Table 1. The arrows demonstrate the expected influences of latent variables. In general, we have a model with 42 manifest variables (initial factors) and 14 latent variables; from them one (BG) is exogenous and 13 are endogenous. The model between latent variables given as a scheme in the Figure 1 can be written as the following system of linear equations, where all error terms are (for simplicity sake) dropped. QL=a11 AN+a21 AS+a31 MN+a41 MS+a51 SN+a61 SS+a71 PQ+a81 QC+a91 L+a10 ,1 P+a11,1 E+a12,1 BG P = a11,2 E+a12,21 BG E = a12,3 BG AN =a12,4 BG MN =a12,5 BG SN = a12,6 BG M = a12,7 BG AS = a18 AN+a78 PQ+a88 QC+a10 ,8 P MS = a39 MN+a79 PQ+a89 QC+a10 ,9 P SS = a2,10 AS+a7,10 PQ+a8,10 QC+a10 ,10 P PQ = a13,11 M L = a13,12 M QC = a7,13 PQ+a9,13 L Solving the system means to estimate the parameters aij characterizing the strength of influences (their number is 31) and their variances, also the error terms (42+13 corresponding to manifest variables and endogenous latent variables). Hence the total number of parameters to be estimated is 117 (more than 3 times less than number of observations, hence the task is solvable in the statistical sense). After 770 Ene-Margit Tiit, Mare Vähi, and Kai Saks Table 2. Estimated influences between groups Source of Outcome R2 influence variables 1 BG E 0,08 2 P 0,118 3 AN 0,208 4 MN 0,169 5 SN 0,157 6 PQ 0,168 7 L 0,19 8 QL 0,107 9 M PQ 0,15 10 L 0,19 11 E P 0,08 12 SN 0,05 13 QL 0,07 14 PQ AS 0,11 15 MS 0,11 16 SS 0,13 17 QC 0,06 18 QL 0,09 Fig. 1. The scheme of the first model Source of Outcome R2 influence variables 19 L QC 0,12 20 QL 0,13 21 AN AS 0,31 22 QL 0,07 23 MN MS 0,27 24 QL 0,07 25 SN SS 0,27 26 QL 0,09 27 P AS 0,06 28 SS 0,11 29 QL 0,09 30 AS QL 0,07 31 MS QL 0,05 32 SS QL 0,09 33 QC QL 0,14 Title Suppressed Due to Excessive Length 771 that it is necessary to check the significance of parameters and the goodness-of fit of the model. 3.2 Results of the first model All parameters aij of the model, also the residuals and variances of terms were estimated. The distribution of normalized residuals was quite close to normal distribution with mean 0, hence the assumptions of model were met, but some errors were quite big. From all estimated parameters, describing the influences, 2/3 were statistically significant (level 0,05). From here it follows that almost all latent variables had impact on QoL. The latent variables not having significant direct influence on QOL were BG, AN and SS. The influence of QC was rather weak. The strong and significant influences found were the following: E → QL; QC → QL; PQ → QL; MS → QL; P → QL; MN→ QL; SN → QL; AS → QL; MS→ QL; L → QL. Also, between E and P group is quite strong influence: E→P For need- variables we have similar to each other models with mostly significant parameters: AN → AS; PQ → AS; QC → AS; MN → MS; PQ → MS; QC → MS; SN → SS, PQ → SS, QC → SS From here it follows, that all S-variables’ groups depend from one hand on Nvariables (the result of the fact that S- and N- variables are filled in by the same person), but also from PQ and QC, as expected. All models depending on BG variables only occurred to be insignificant. Also quite interesting chains of significant influences are the following ones: M→PQ→QC; M→L→QC Still the statistical quality of the first model was not high: as the χ2 had the value 6648 (the number of degrees of freedom being 786), demonstrating rather bad fit of the model with data. The second problem was that the matrix of estimates was not positively defined. From here it follows that some of the estimates were not uniquely defined and several might be quite unstable. To improve the model it is necessary to decrease the number of variables and/or dependencies between the groups of variables. 4 The second model 4.1 Defining the second model Using the statistics, showing the weakest points of the first model the following steps were made with the aim to improve the model: 1. As manifest variables instead rotated factors the unrotated factors were used, as there was no need for interpretation manifest variables as intermediate ones. 2. The number of manifest variables was decreased in following groups (latent variables), as the description rate remained close to 50% in given groups after this change: 772 Ene-Margit Tiit, Mare Vähi, and Kai Saks • in SN and SS – 1 factor instead of 2; • in QC 4 factors instead 5; • in P 2 factors instead 4. 3. As the latent variables describing Need and Supply were strongly pair-wise correlated (AN and AS, MN and MS, SN and SS correspondingly), they might cause linear functional dependencies of estimates. Hence it would be necessary to drop either N either S-variables. More reasonable seems to save S-variables and drop N-variables. 4. The latent variable BG had very low impact on all other latent variables. Most of the connections between BG and other latent variables were dropped from the model. 4.2 The scheme of the second model Additionally, some parameters (QC→AS, QC→SS) were excluded using the Wald criterion, demonstrating bad fit of a part of a model with given block of the initial covariance matrix. The new model contains 11 latent variables, among them 3 exogenous (BG, E and M) and 7 endogenous ones (AS, MS, SS, P, L, PQ, QC and QL), see Figure 2. The number of manifest variables (factors of initial variables) was 31. Number of model parameters demonstrating influences, was 21, number of error terms to be estimated – 39, from them 8 variances of latent variables, and 31 of manifest variables. Hence the total number of parameters to be estimated was 80. Fig. 2. The scheme of the second model 4.3 Interpretation of the second model The choice of exogenous latent variables (recommended by the characteristics calculated in by the first model), is quite understandable: they can be considered as independent outputs: • E describes the clients life experience, • BG – clients background that also depends on the country; Title Suppressed Due to Excessive Length 773 • M – the management culture and quality in given institution. Living conditions latent variable L, seemingly depend on BG and M. PQ can be considered as process variable, influencing the real care, measured by latent variables AS (supply of ADL and IADL), MS (supply of medication) and SS (supply in emotional, psychological and social area), but also Clients’ estimated Quality of care (QC). Life experiences (E) of a client influence his/her personal contacts and list of important persons (P). The last one is also input for all care (Supply) variables via informal care. The Quality of Life latent variable (QL) is output, depending on most of the listed latent variables. 4.4 Results of the second model Among estimated parameters of equations more than half were statistically significant. Using standardized parameters we got the following equations: QL= -3,54 P – 0,76 AS + 0,22 MS + 2,83 SS-2,63 PQ – 51,64 L + 32,57 QC + 0 BG + 0 E; P = 0 E; AS=-0,06 P+1,41 PQ; MS = 0,19 PQ-0,002QC; SS = 0,73 P+ 0,06 PQ; PQ = 0 M; QC = 0,07 PQ + 0,64 L; L = 0 BG + 0 M; NB, the direction (plus or minus) of the coefficient has no sense, as the latent variable has estimated by factors, but factors’ signs are random and can be changed. From the estimated equations the following conclusions can be made: • BG (Background variable) does not influence the other latent variables significantly; • E (Life experiences) does not influence personal relations (P) nor Quality of Life (QL); • Management (M) does not influence L (living conditions) nor PC (Professional care). The goodness of fit statistic χ2 improved more than twice, being now 3222 (that is still too big for the number of degrees of freedom 416), also improved the ratio χ2 /df, that is now 7,45 (against 8,46 in the case of the first model). Comparison with the fit of independent model (χ2 equals 4059) demonstrates the efficiency of the last model. But it seems there are possibilities to improve it more, deleting the latent variables, not having statistically significant influences to other latent variables. 5 The third model The third model has much simpler structure than the previous ones. It contains 8 latent variables, from the three (P, PQ and L) are exogenous that can be considered as input; QC and AS, AM and SS are process variables and QL is the output. 774 Ene-Margit Tiit, Mare Vähi, and Kai Saks The number of expected influences is now 15 and the number of manifest variables 23 (instead of 31 in the second model). AS a result, the number of parameters to be estimated decreases up to 58, that will warrant higher stability of estimates. Fig. 3. The scheme of the third model Solution of the third model gave statistically much better model, where χ2 equaled to 1171 (df = 215) and difference from independent model has increase that proves the existence of significant dependencies among latent variables. Again the largest and most significant parameters appear in the first equation, connecting QL with all other latent variables, where especially strong influence had PC, QC, AS and P (professional care, Quality of care, Supply in ADL and IADL and Personal relations). All other influences were weaker, but remarkable were also connection between MS (medical supply) and QC (Clients’ Quality of care) and also influence of P (Personal relations) on AS (Supply in ADL and IADL) and SS (Supply in emotional and social support). 6 Summary The results demonstrate that SEM-methodology is helpful also in discovering hidden dependencies and causalities in data-sets characterized by low correlations, high number of variables and complicated structures of dependencies between variables and variable’ groups. The results gained after several modeling steps can be interpreted and prove the hypotheses about multilevel dependencies and causal relations between QL, QC and other variables measured in CK data-set. References [FHB97] Favers P.M., Hand D.J., Bjordal K., Groenvold M. (1997). Causal indicators in quality of life research. Qual Life Res 6 (5), pp 393–406. [Hel01] Helgeson V.S. (2001). Social support and quality of life. Qual Life Res 12 Suppl 1, pp 25–31. Title Suppressed Due to Excessive Length [Hug90] [Mar98] [SKV00] 775 Hughes B. (1990). Quality of life. In: Peace S (Ed) Researching Social Gerontology. London, Maruyama G M. (1998). Basics of structural equation modeling. SAGE Publications. Sullivan M.D., Kempen G.I., Van Sonderen E., Ormel E. (2000). Model of health-related quality of life in a population of communality-dwelling Dutch elderly population. Qual Life Res 9 (7), pp 801–810.