Using structural equation modeling to discover ∗

advertisement
Using structural equation modeling to discover
the hidden structure of ck data∗
Ene-Margit Tiit1 , Mare Vähi2 , and Kai Saks3
1
2
3
Institute of Mathematical Statistics, University of Tartu, Estonia
Mare.Vahi@ut.ee
Institute of Mathematical Statistics, University of Tartu, Estonia
Mare.Vahi@ut.ee
Department of Internal Medicine, University of Tartu, Estonia Kai.Saks@ut.ee
Summary. Modeling the factors influencing the quality of life (QoL) of different
groups of population has been a challenge for numerous scientists of several areas.
Our task is to build a model of QoL for clients of social care. The results demonstrate
that SEM-methodology is helpful in discovering hidden dependencies and causalities in data-sets characterized by low correlations, high number of variables and
complicated structures of dependencies between variables and variable’ groups.
Key words: structural equation models, factor, quality-of life.
1 Structural Equations as statistical method
Structural Equation Modeling [initially LISREL-method by K. Jöreskog] is a very
general, very powerful multivariate analysis technique that includes specialized versions of a number of other analysis methods as special cases. One of the fundamental
ideas in multivariate analysis is the idea of statistical (linear or nonlinear) dependence. This idea generalizes, in various ways, to several variables interrelated by a
group of linear equations. The rules become more complex, the calculations more
difficult, but the basic message remains the same – you can test whether variables
are interrelated through a set of linear relationships by examining the variances and
covariances of the variables. It has been assumed, that SEM models, characterizing
the directions of influences, can also describe and check hypotheses about causal
dependencies.
There exist procedures for testing whether a set of variances and covariances in
a covariance matrix fits a specified structure. The way structural modeling works is
as follows:
1. You state the way that you believe the variables are inter-related, often with
the use of a path diagram.
∗
This study is a part of the CareKeys project (supported by the European Commission, contract No QLK6-CT-2002-02525)
768
Ene-Margit Tiit, Mare Vähi, and Kai Saks
2. You work out, via some complex internal rules, what the implications of this
are for the variances and covariances of the variables.
3. You test whether the variances and covariances fit this model of them.
4. Results of the statistical testing, and also parameter estimates and standard
errors for the numerical coefficients in the linear equations are reported.
5. On the basis of this information, you decide whether the model seems like a
good fit to your data.
http://www.statsoft.com/textbook/stsepath.html#index
2 Grouping CareKeys data for building SEM- scheme
To integrate all possible variables into one model we used the Structural Equations
modeling (SEM) procedure. Using this procedure we tested, which variables from all
CK data-set (Clint, Index and Mandex from IC data) have statistically significant
influences on other variables and tried to find the structure of all interactions. In
this process we moved from big models to smaller models (backward methodology),
excluding step by step defined latent variables from the model.
We used the following scheme for describing the structure of influences.
Totally, we defined 14 groups of variables, measured using Index, Clint and
Mandex variables on the basis of IC (normal) clients for whom the QoL variables
(PGMS and WHOQOL were (at least partially) measured. All variables used were
completed using EM-methodology for imputation.
Each group defines one latent variable, representing the most important information of the group of variables in the sense of linear dependencies. To find the latent
variables, as intermediate variables the factors (found using either exploratory or
confirmatory factor analysis) will be used.
Hence, as a starting point we had 435 cases having about 265 variables (without blanks), see Table 1. From QoL variables we took the integrated components
(domains) only, but included 6 additional variables from Clint characterizing some
aspects of QoL. Some initially nominal variables were recoded as binary ones (e.g.
country, marital status, language and close persons were presented as binary variables).
In all groups the factor analysis was made and the number of factors was fixed
to gain at least 50% description rate. Totally the number of factors describing at
least 50% of the whole information was 42.
Using step-wise regression and correlation analysis the expected influences between groups were estimated, see Table 2.
From the Table 2 it follows that the influences (measured via Determination
coefficient R2 ) are not high, in general.
3 The first model
3.1 Building the model
Here the Background (BG) group forms the exogenous latent variable (not depending from other groups of variables), all the other groups form endogenous latent
variables, e.g. they depend from other groups
Title Suppressed Due to Excessive Length
769
Table 1. Groups of variables
No Name of group
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Symbol Number of
variables
Background (Index)
BG
12
ADL/ IADL need (Index)
AN
20
ADL/ IADL supply (Index)
AS
18
Medical need (Index)
MN
17
Medical supply (Index)
MS
16
Social need (Index)
SN
11
Social supply (Index)
SS
11
Prof QoC (Index)
PQ
26
Client QoC (Clint)
QC
37
Living environment (Clint, Index)
L
18
Informal network, close people (Clint) P
19
Life experiments (Clint)
E
9
QoL (Clint)
QL
16
Management (Mandex)
M
35
Number of
factors
3
2
2
3
3
2
2
4
5
4
4
2
3
3
To build the initial model the following scheme was constructed, see figure 1,
where each box represents a latent variable, having the same symbol as a group of
initial variables in the Table 1. The latent variables are formed by factors of given
group of initial variables. The number of factors used is also given in the Table
1. The arrows demonstrate the expected influences of latent variables. In general,
we have a model with 42 manifest variables (initial factors) and 14 latent variables;
from them one (BG) is exogenous and 13 are endogenous. The model between latent
variables given as a scheme in the Figure 1 can be written as the following system
of linear equations, where all error terms are (for simplicity sake) dropped.
QL=a11 AN+a21 AS+a31 MN+a41 MS+a51 SN+a61 SS+a71 PQ+a81 QC+a91 L+a10 ,1 P+a11,1 E+a12,1 BG
P = a11,2 E+a12,21 BG
E = a12,3 BG
AN =a12,4 BG
MN =a12,5 BG
SN = a12,6 BG
M = a12,7 BG
AS = a18 AN+a78 PQ+a88 QC+a10 ,8 P
MS = a39 MN+a79 PQ+a89 QC+a10 ,9 P
SS = a2,10 AS+a7,10 PQ+a8,10 QC+a10 ,10 P
PQ = a13,11 M
L = a13,12 M
QC = a7,13 PQ+a9,13 L
Solving the system means to estimate the parameters aij characterizing the
strength of influences (their number is 31) and their variances, also the error terms
(42+13 corresponding to manifest variables and endogenous latent variables). Hence
the total number of parameters to be estimated is 117 (more than 3 times less than
number of observations, hence the task is solvable in the statistical sense). After
770
Ene-Margit Tiit, Mare Vähi, and Kai Saks
Table 2. Estimated influences between groups
Source of Outcome R2
influence variables
1
BG
E
0,08
2
P
0,118
3
AN
0,208
4
MN
0,169
5
SN
0,157
6
PQ
0,168
7
L
0,19
8
QL
0,107
9
M
PQ
0,15
10
L
0,19
11
E
P
0,08
12
SN
0,05
13
QL
0,07
14
PQ
AS
0,11
15
MS
0,11
16
SS
0,13
17
QC
0,06
18
QL
0,09
Fig. 1. The scheme of the first model
Source of Outcome R2
influence variables
19
L
QC
0,12
20
QL
0,13
21
AN
AS
0,31
22
QL
0,07
23
MN
MS
0,27
24
QL
0,07
25
SN
SS
0,27
26
QL
0,09
27
P
AS
0,06
28
SS
0,11
29
QL
0,09
30
AS
QL
0,07
31
MS
QL
0,05
32
SS
QL
0,09
33
QC
QL
0,14
Title Suppressed Due to Excessive Length
771
that it is necessary to check the significance of parameters and the goodness-of fit
of the model.
3.2 Results of the first model
All parameters aij of the model, also the residuals and variances of terms were
estimated. The distribution of normalized residuals was quite close to normal distribution with mean 0, hence the assumptions of model were met, but some errors were
quite big. From all estimated parameters, describing the influences, 2/3 were statistically significant (level 0,05). From here it follows that almost all latent variables
had impact on QoL. The latent variables not having significant direct influence on
QOL were BG, AN and SS. The influence of QC was rather weak.
The strong and significant influences found were the following:
E → QL; QC → QL; PQ → QL; MS → QL; P → QL; MN→ QL; SN → QL;
AS → QL; MS→ QL; L → QL.
Also, between E and P group is quite strong influence: E→P
For need- variables we have similar to each other models with mostly significant
parameters:
AN → AS; PQ → AS; QC → AS;
MN → MS; PQ → MS; QC → MS;
SN → SS, PQ → SS, QC → SS
From here it follows, that all S-variables’ groups depend from one hand on Nvariables (the result of the fact that S- and N- variables are filled in by the same
person), but also from PQ and QC, as expected.
All models depending on BG variables only occurred to be insignificant.
Also quite interesting chains of significant influences are the following ones:
M→PQ→QC;
M→L→QC
Still the statistical quality of the first model was not high: as the χ2 had the
value 6648 (the number of degrees of freedom being 786), demonstrating rather bad
fit of the model with data. The second problem was that the matrix of estimates
was not positively defined. From here it follows that some of the estimates were
not uniquely defined and several might be quite unstable. To improve the model it
is necessary to decrease the number of variables and/or dependencies between the
groups of variables.
4 The second model
4.1 Defining the second model
Using the statistics, showing the weakest points of the first model the following steps
were made with the aim to improve the model:
1. As manifest variables instead rotated factors the unrotated factors were used,
as there was no need for interpretation manifest variables as intermediate ones.
2. The number of manifest variables was decreased in following groups (latent
variables), as the description rate remained close to 50% in given groups after
this change:
772
Ene-Margit Tiit, Mare Vähi, and Kai Saks
• in SN and SS – 1 factor instead of 2;
• in QC 4 factors instead 5;
• in P 2 factors instead 4.
3. As the latent variables describing Need and Supply were strongly pair-wise correlated (AN and AS, MN and MS, SN and SS correspondingly), they might
cause linear functional dependencies of estimates. Hence it would be necessary
to drop either N either S-variables. More reasonable seems to save S-variables
and drop N-variables.
4. The latent variable BG had very low impact on all other latent variables. Most
of the connections between BG and other latent variables were dropped from
the model.
4.2 The scheme of the second model
Additionally, some parameters (QC→AS, QC→SS) were excluded using the Wald
criterion, demonstrating bad fit of a part of a model with given block of the initial
covariance matrix.
The new model contains 11 latent variables, among them 3 exogenous (BG, E
and M) and 7 endogenous ones (AS, MS, SS, P, L, PQ, QC and QL), see Figure
2. The number of manifest variables (factors of initial variables) was 31. Number
of model parameters demonstrating influences, was 21, number of error terms to
be estimated – 39, from them 8 variances of latent variables, and 31 of manifest
variables. Hence the total number of parameters to be estimated was 80.
Fig. 2. The scheme of the second model
4.3 Interpretation of the second model
The choice of exogenous latent variables (recommended by the characteristics calculated in by the first model), is quite understandable: they can be considered as
independent outputs:
• E describes the clients life experience,
• BG – clients background that also depends on the country;
Title Suppressed Due to Excessive Length
773
• M – the management culture and quality in given institution.
Living conditions latent variable L, seemingly depend on BG and M. PQ can be
considered as process variable, influencing the real care, measured by latent variables AS (supply of ADL and IADL), MS (supply of medication) and SS (supply
in emotional, psychological and social area), but also Clients’ estimated Quality of
care (QC).
Life experiences (E) of a client influence his/her personal contacts and list of
important persons (P). The last one is also input for all care (Supply) variables via
informal care.
The Quality of Life latent variable (QL) is output, depending on most of the
listed latent variables.
4.4 Results of the second model
Among estimated parameters of equations more than half were statistically significant. Using standardized parameters we got the following equations:
QL= -3,54 P – 0,76 AS + 0,22 MS + 2,83 SS-2,63 PQ – 51,64 L + 32,57 QC +
0 BG + 0 E;
P = 0 E;
AS=-0,06 P+1,41 PQ;
MS = 0,19 PQ-0,002QC;
SS = 0,73 P+ 0,06 PQ;
PQ = 0 M;
QC = 0,07 PQ + 0,64 L;
L = 0 BG + 0 M;
NB, the direction (plus or minus) of the coefficient has no sense, as the latent
variable has estimated by factors, but factors’ signs are random and can be changed.
From the estimated equations the following conclusions can be made:
• BG (Background variable) does not influence the other latent variables significantly;
• E (Life experiences) does not influence personal relations (P) nor Quality of Life
(QL);
• Management (M) does not influence L (living conditions) nor PC (Professional
care).
The goodness of fit statistic χ2 improved more than twice, being now 3222 (that
is still too big for the number of degrees of freedom 416), also improved the ratio
χ2 /df, that is now 7,45 (against 8,46 in the case of the first model). Comparison with
the fit of independent model (χ2 equals 4059) demonstrates the efficiency of the last
model. But it seems there are possibilities to improve it more, deleting the latent
variables, not having statistically significant influences to other latent variables.
5 The third model
The third model has much simpler structure than the previous ones. It contains 8
latent variables, from the three (P, PQ and L) are exogenous that can be considered
as input; QC and AS, AM and SS are process variables and QL is the output.
774
Ene-Margit Tiit, Mare Vähi, and Kai Saks
The number of expected influences is now 15 and the number of manifest variables 23 (instead of 31 in the second model). AS a result, the number of parameters
to be estimated decreases up to 58, that will warrant higher stability of estimates.
Fig. 3. The scheme of the third model
Solution of the third model gave statistically much better model, where χ2
equaled to 1171 (df = 215) and difference from independent model has increase
that proves the existence of significant dependencies among latent variables. Again
the largest and most significant parameters appear in the first equation, connecting
QL with all other latent variables, where especially strong influence had PC, QC, AS
and P (professional care, Quality of care, Supply in ADL and IADL and Personal
relations). All other influences were weaker, but remarkable were also connection
between MS (medical supply) and QC (Clients’ Quality of care) and also influence
of P (Personal relations) on AS (Supply in ADL and IADL) and SS (Supply in
emotional and social support).
6 Summary
The results demonstrate that SEM-methodology is helpful also in discovering hidden dependencies and causalities in data-sets characterized by low correlations, high
number of variables and complicated structures of dependencies between variables
and variable’ groups. The results gained after several modeling steps can be interpreted and prove the hypotheses about multilevel dependencies and causal relations
between QL, QC and other variables measured in CK data-set.
References
[FHB97] Favers P.M., Hand D.J., Bjordal K., Groenvold M. (1997). Causal indicators in quality of life research. Qual Life Res 6 (5), pp 393–406.
[Hel01]
Helgeson V.S. (2001). Social support and quality of life. Qual Life Res 12
Suppl 1, pp 25–31.
Title Suppressed Due to Excessive Length
[Hug90]
[Mar98]
[SKV00]
775
Hughes B. (1990). Quality of life. In: Peace S (Ed) Researching Social
Gerontology. London,
Maruyama G M. (1998). Basics of structural equation modeling. SAGE
Publications.
Sullivan M.D., Kempen G.I., Van Sonderen E., Ormel E. (2000). Model
of health-related quality of life in a population of communality-dwelling
Dutch elderly population. Qual Life Res 9 (7), pp 801–810.
Download