MarkHaggard_CSI2011 - University of Cambridge

Mark
1,3
Haggard
Jan
1,2
Zirk-Sadowski
& Helen
3
Spencer
1: Department of Experimental Psychology, University of Cambridge
2: Centre for Neuroscience in Education, University of Cambridge
3: Multi-centre Otitis Media Study Group
Background to statistical methods
Fig 3: Preferred (reference) model (=2)
(version on baseline individual differences)
Structural equation modelling (SEM) is a multi-purpose
modelling technique based on generalised estimation
Balance, anxiety, inappropriate behaviour, &
Hearing
equations. Despite great potential within biology, psychology
speech/language variables
level #
and social science, the range of uses seen has been quite
limited, often just for summarising a set of inter-correlations
Child
with an imposed causal interpretation declared to be consistent
Zsum OR
QoL
reported
Antecedents:
with the data. As well as this ability to summarise, the graphical
hearing
Length of
Development
difficulties
notation (used to input model structures) is a conceptual aid to
history, SES,
thinking about variance components and their separation, as
gender
Zdiff OR ear
offered by SEM. However the small peer community
Parent
Infection
Age
appreciating the pluri-potential scientific value of SEM has
QoL
severity/number
meant slow development of applications and application ‘lore’
Respiratory
Sleep
in the 30+ years since software became available. We illustrate
infections
disturbance
how such knowledge can generate solutions in an area of
causal cascades previously difficult to systematise.
Shading and ‘OR’ separate 2 alternative formulations that were tested: raw
variables or their summed & differenced Z-scores. Boxes & error circles for
markers again omitted. as are separate boxes and paths for antecedents.
Background to problem area
The central question is still hotly disputed: do minor and
common fluctuating ear and hearing problems in childhood
have consequences for development, health and quality of life;
and if so, by what mechanism ? The clinical literature contains
hypotheses & explanations, explicit or implicit & mostly
imprecise, for the correlations among these measures, but has
few well-controlled empirical studies.
♦ A ‘good’ model as a scientific aim – what is it ? ♦.
Good models show a mix of 6 virtues: (a) predictive accuracy: Rsq,
goodness-of-fit, etc; (b) explicability of findings, including relation to
other findings; (c) scope, the coverage of plausible effects in the
topic area; (d) generalisability, ie features that do not just optimise fit
to the derivation data but suit other sets; (e) simplicity, usually seen
as parsimony, ie few but powerful ly predictive variables; & (f) data
economy, insofar as this can be reconciled with reliability, meaning
few & practical measurements (or questionnaire items) marking the
predictive concepts. These virtues trade: improving (a, c) may
degrade (e, f). Further improvement is possible by disaggregating
facets of OM &/or aspects of developmental impact, but that needs
several models, ie it buys virtues (a & b), at the expense of (c & e).
Contention
Causal inference is impossible from a single correlation. Even
in a medium-complex multivariable data structure, it is not
granted by drawing a regression arrow in a particular direction.
Rather, it has to be earned by accumulation of three circumstances (i) having sufficient complexity at two or more logically
“causally related stages” that the direction of arrow(s) can
make some difference to the model error, (ii) various
constraints on stages, for example some arrows of which
reversed direction is inconceivable, and (iii) hypothesis testing
as with other techniques, here by the clear contrasting of sets
of alternative models. We have developed on this basis a
principled stepped modelling strategy that limits the
exploratory phase and hence limits capitalisation upon chance.
Fig 2: ‘Worthy opponent ‘ model (=1)
Weighted total of SES, sex,
length of history
Predisposition
Hearing
level as
measured in
audiometry
Weighted total of balance, anxiety, inappropriate behaviour, &
speech/language variables ( and Parent & child QoL)
Mucosal
disease
,Weighted total of ear
infections, respiratory
infections
Reported hearing
difficulties
Development
Age
For simplicity boxes & error circles for observed marker variables are not shown
Acknowledgements
We thank Medical Research Council UK and Deafness Research UK for financial support and
MRC Multi-centre Otitis Media Study Group (ORLs, audiologists and nurses/researchassistants, listed fully in Clinical Otolaryngology) for acquisition of clinical data. Statisticians
Kath Bennett and Elaine Nicholls worked previously on the project, particularly on the
derivations of optimum facet scores,
Data source
The data are test scores and scaled questionnaire scores from
376 of children at 11 Ear Nose and Throat Clinics, entering a
randomised clinical trial (TARGET). Baseline data & outcomes
used here are each on 2 occasions (-3 & 0 mo), +3 &+6mmo.
TARGET addresses surgical management in otitis media with
effusion (OME – also known as glue ear). Standard measures
were available only for measured hearing (HL) so we invested
in new questionnaire measures of all the known manor facets
and retained and scaled the response levels of the best items
according to the usual psychometric criteria. We also mapped
the facet scores into a standard measure of Quality of Life,
reflecting impact arises throughout the present cascades. The
‘correlational’ models use the average of the 2 baseline
occasions, for the measures listed in the path graphs, whilst
the ‘experimental’ use the difference between these and the
average of the two first post-randomisation occasions.
Strategy & methods of analysis
We used standard AMOS (SPSS Inc), inspecting a small subset
of available performance parameters for models: Chi Sq* and
Akaike Information Criterion (AIC), supplemented by
permutation@ and bootstrapping#. Inclusion/ exclusion of a link
of only marginal significance is by definition not a big issue for
overall goodness of fit (GoF), but it can affect parsimony, as
reflected in AIC, and the permutation P. We undertook minor
exploratory optimisation of an a priori theoretical model, then
froze this as reference for a designed grid of contrasts, and
compared model fit & parameter values estimated for specific
links where necessary, within this grid (Excerpt in Table ).
A. Experimentally manipulated driving of the covariance (via
randomised treatment) from randomised treatment as a pair of
wholly independent binary variables. The optimum treatment
analysis is analysis of covariance, avoiding the assumption involved
in taking simple difference scores, of equal variance of pre- and
post-treatment measures. However, for simplicity we here modelled
baseline-to-outcome shifts in mediating variables and in the ultimate
outcome variables, replacing “antecedents” by 2 treatment terms +/ventilation tubes and +/- adenoidectomy, each of which acts on
specific disease measures. Given near-homogeneity of baseline
and outcome variance, this simplification permits examination for
similar co-variance structure when this is additionally driven by a
manipulated variable, not just observed cross-sectional correlation.
B. The comparison of preferred model against control models
with graded diffuseness of correlation structure, & commitment to some ‘worthy opponent’ hence non-trivial challenge.
The structure in Fig 2 is a single cascade of serial regressions
between latent variables summarising all markers at their
(predefined) stage. In English, it states that the aggregate of
antecedent risk factors determines the severity of disease, by
various markers; and this determines the aggregate severity of
intermediate markers of developmental, which in turn determines
degree of impact on quality of life. Put thus, Model 1 is not
implausible, but somewhat uninteresting and unimpressive. The
more interesting postulate of two major cascades of influence is
shown in Model 2, Fig 3; this captures clinical intuitions and some
previous findings. The contest between preferred model and worthy
opponent is more informative than the mere demonstration that
some preferred model is an absolutely adequate model.
Results
The table embraces two versions (baseline absolute correlations & correlations among treatment-driven difference
scores). The better fit for model 2 (Fig 3) reflects the fact that
the latent variables in 1 have diverse markers; the imposed
factorial purity in 1 is false. This part of the difference lies in
what is called “the measurement model”. The other part lies
in the substantive (structural) model. Insofar as Model 2 is
correct in implying separate and specific between-stage
correlations, (& allowing some by-passing of stages as
another manifestation of serial cascade, not shown in Fig 3
for graphic simplicity), imposition of a single cascade in 1
must lead to poor fit, and it does, even on the Akaike
Information Criterion, with its adjustment for greater
parsimony from fitting fewer path parameters. The
experimental models with front end driven by randomised
treatment are generally not as strong as the pure correlation
models (due to added error from differencing). However, they
show largely similar structures (convincing overall) and
replicate the main structural contrasts (Model 2 better than 1:
rotated Z summing & differencing of two key variables better
than raw.) The ‘worthy opponent’ serial only model 1 performs
particularly badly on the experimental (treatment-driven) data.
Dataset: ‘Correlational’ | ‘Experimental’
Model
Akaike
Permutation Akaike
Permutation
Info
P (for better Info
P (for better
Criterion model) @
Criterion model) @
1 (serial
230.7
2.6 e-4 324.8
0.134
only)
/180
/180
2 (serial & 258.7
3.53 e-6 287.1
4.02 e-5
parallel)
/238
/180
For AIC the figure after the slash is the saturated value, and the approach of the
given value (downwards) to it represents parsimony-adjusted goodness of fit.
Whilst the p-value associated with Ch-sq would represent lack of fit (small =bad)
the p-value for permutation gives the exclusivity of match of structure to variable
values and has conventional small p = good. Standard notation is used for the 3
cells with very small p. (ie near to best possible arrangement for data obtained/)
Conclusions
A single adequate SEM is a starting point ,not an end. As a
treatment analysis with multiple outcomes, the SEMs provide
a sophisticated alternative to MANOVA or principalcomponents reduction of multiple dependent variables to one
summary measure. SEM permits a distinction between
mediator and outcome measures within causal dependency.
Via the worthy opponent, a certainty value can attached to the
postulate that at least two main cascades (Model 2 better)
are required. This requirement for multiple cascades complements the fuller treatment analyses of the trial in underpinning a combined treatment policy. Those show that the 2
treatment elements (separated in the shift versions here)
each with its basis of candidature aligned with 1 cascade.
Footnotes to text and table
* Not tabulated, as uninformative. In a large sample, even very good
models differ from data at p < 0.005. The chi-square is however the basis
of most other tests, including the two tabulated.
@ This test examines whether for a structure of the given form other
permutations of the particular slots that the variables occupy may be
more adequate. The p-value is the probability of an equal or better chisquare value on doing this. Exhaustive for small numbers of variables, it
has for a large number to be sampled (by drawing with replacement) as
the number of possible models becomes astronomical. To avoid tying up
computers for weeks (or the lifetime of the universe) the need for more
extensive iterations of small differences between small numbers of
important models can be approached in stages: 105, 106, 107
permutations etc. Typically we went up to 3.5 X 106 permutations.
# Bootstrapping (not shown) can provide empirical confidence intervals
more conservative than those with the strong parametric assumptions. It
also permits a direct comparison of fit between two models. It comes in
two versions. The Bollen-Stine version is based on an assumption of
‘empirical-chi-sq test’ to judge the bootstrapped model’s goodness-of-fit.
The ordinary (Maximum-Likelihood bootstrap) version is used to calculate
the bootstrapped parameter estimates (e.g. confidence intervals for the
bootstrapped standardized as well as for unstandardised regression
weights, ie for the β coefficients).