Interrupted Time Series Designs

advertisement
Interrupted Time Series Designs
Overview
• Role of ITS in the history of WSC
– Two classes of ITS for WSCs
• Two examples of WSC comparing ITS to RE
• Issues in ITS vs RE WSCs
– Methodological and logistical
– Analytical
ITS
• A series of observations on a dependent variable
over time
– N = 100 observations is the desirable standard
– N < 100 observations is still helpful, even with very few
observations—and by far the most common!
• Interrupted by the introduction of an intervention.
• The time series should show an “effect” at the time
of the interruption.
Two Classes of ITS for WSC
• Large scale ITS on aggregates
• Single-case (SCD) and N-of-1 designs in social
science and medicine
• These two classes turn out to have very
different advantages and disadvantages in the
context of WSCs.
• Consider examples of the two classes:
Large Scale ITS on Aggregates:
The effects of an alcohol warning label on prenatal drinking
0.8
0.4
0.2
0
-0.2
Impact
Begins
-0.4
-0.6
Se
p86
Ja
n8
M 7
ay
-8
Se 7
p87
Ja
n8
M 8
ay
-8
Se 8
p88
Ja
n8
M 9
ay
-8
Se 9
p89
Ja
n9
M 0
ay
-9
Se 0
p90
Ja
n9
M 1
ay
-9
Se 1
p91
prenatal drinking
0.6
Label
Law
Date
Month of First Prenatal Visit
Large Scale ITS on Aggregates
• Advantages:
– Very high policy interest.
– Sometimes very long time series which makes analysis easier.
• Disadvantages
– Typically very simple with only a few design elements (perhaps a
control group, little chance to introduce and remove treatment,
rarely even implemented with multiple baseline designs).
– Usual problems with uncontrolled and unknown attrition and
treatment implementation
– We have yet to find a really strong example in education
• Formidable logistical problems in designing WSCs that are
well enough controlled to meet the criteria we outlined on
Day 1 for good WSCs.
– We are not aware of any WSC’s comparing RE to this kind of ITS
Single-case (SCD) and N-of-1 designs in
social science and medicine
• Each time series is done on a single person, though a study usually
includes multiple SCDs
• Advantages:
– Very well controlled with many opportunities to introduce design
elements (treatment withdrawal, multiple baseline and more), low
attrition, excellent treatment implementation.
– Plentiful in certain parts of education and psychology
• Disadvantages:
– Of less general policy interest except in some circles (e.g., special
education) but
• IES now allows them for both treatment development studies and for impact
studies under some conditions.
• Increasing interest in medicine (e.g. CENT reporting standards).
– Typically short time series that makes analysis more difficult
• Much work currently being done on this.
• Should be applicable to short time series in schools or classes
• Has proven somewhat more amenable to WSC
Two Examples of WSC of RE vs ITS
• Roifman et al (1987)
– WSC Method: A longitudinal randomized crossover
design
– Medical Example
– One study that can be analyzed simultaneously as
• Randomized experiment
• 6 single-case designs
• Pivotal Response Training
– WSC Method: Meta-analytic comparison of RE vs ITS
– Educational example on treatment of autism
– Multiple studies with multiple outcomes
• No claim that these two examples are optimal
– But they do illustrate some possibilities, and the
design, analytical and logistical issues that arise.
Roifman et al (1987)
• High-dose versus low-dose intravenous
immunoglobulin in hypogammaglobulinaemia and
chronic lung disease
• 12 patients in a longitudinal randomized cross-over
design. After one baseline (no IgG) observation:
– Group A: 6 receive high dose for 6 sessions, then low dose
for 6 sessions.
– Group B: 6 receive low dose for 6 sessions, then high dose
for 6 sessions
• Outcome is serum IgG levels
• Here is a graph of results
Even though this example uses
individual people for each time
series, one can imagine this kind
of study being implemented using
schools or classrooms.
How many time points are needed
is an interesting question.
Analysis Strategy
• To compare RE to SCD results, we analyze the
data two ways
– As an N-of-1 Trial: Analyze Group B only as if it
were six single-case designs
– As a RE: Analyze Time 6 data as a randomized
experiment comparing Group A and Group B.
• Analyst blinding
– I analyzed the RE
– David Rindskopf analyzed SCD
Analytic Methods
• The RCT is easy
– usual regression (or ANOVA) to get group mean
difference and se.
• We did run ANCOVA covarying pretest but results were
essentially the same.
– Or a usual d-statistic (or bias-corrected g)
• The SCD analysis needs to produce a result that is
in a comparable metric
– Used a multilevel model in WinBUGS to adjust for
nonlinearity and get a group mean difference at time
6 (or 12, but with potential carryover effects)
– d-statistic (or g) for SCD that is in the same metric as
the usual between-groups d (Hedges, Pustejovsky &
Shadish, in press).
• But current incarnation assumes linearity
Analysis: RCT
• If we analyze as a randomized experiment
with the endpoint at the last observation
before the crossover (time 6):
– Group A (M = 794.93, SD = 90.48)
– Group B (M = 283.89, SD = 71.10)
– MD = 511.05 (SE = 46.98) (t = 10.88, df = 10, p <
.001)
• d = 6.28, g = 5.80, V(g) = 1.98 (se = 1.41)
Analysis 2: SCD
• If we analyze only Group B (6 cases) using a destimator1:
– g = 4.59, V(g) = 1.43 (se = 1.196)
– Close to RE estimate g = 5.80, V(g) = 1.98 (se = 1.41)
• We also have a WinBUGS analysis2 taking trend
into account:
– MD = 495, SE = 54 , “t” = 495/54 = 9.2
– Very close to the best estimate from the randomized
experiment of MD = 511.05, SE = 46.98
1
Hedges, Pustejovsky and Shadish, in press, Research Synthesis Methods
2 Script and data input available on request
Comparing Results RE vs SCD
• Means and d in same direction
• Means and d of similar magnitude
• It is not clear that the standard errors from previous
slides are really comparable, but treating them as if
they were:
– Test overlap using 84% confidence intervals simulates ztest1
• For g, they are 3.82 < 5.80 < 7.76 for SCD
•
2.91 < 4.59 < 6.27 for RE
• For the group mean difference 419.13 < 495 < 570.87 for SCD
•
445.04 < 511 < 577.06 for RE
– That is, no significant difference between the SCD and RE.
• Another option would be to bootstrap the standard
errors.
1
Julious, 2004, Pharmaceutical Statistics
Comments on This WSC Method
• Using randomized crossover designs with
longitudinal observations is a promising
method.
• Statistical problems:
– how to compare results from RE and SCD when
they clearly are not independent.
– Did not deal with autocorrelation
• Should be possible to do in several ways
• but correcting would likely make SEs larger so make REITS differences less significant
– Need to explore further the effects of trend and
nonlinearities
Example: PRT
• Pivotal Response Training (PRT) for Childhood
Autism
• This WSC method does a meta-analytic
comparison of results from SCDs to results from
an RE.
• Meta-analytic WSC’s have a long history but also
have significant flaws in that many unknown
variables may be confounded with the designs.
– But those flaws may often be no more than in the
usual 3-arm nonrandomized WSC
– Big difference is the latter usually has raw data but
meta-analysis does not. In the case of SCDs, however,
we do have the raw data (digitized).
The PRT Data Set
• Pivotal Response Training (PRT) for Childhood
Autism
• 18 studies containing 91 cases.
• We used only the 14 studies with at least 3 cases
(66 cases total).
• If there were only one outcome measure per
study, this would result in 14 effect sizes.
• But each study measures multiple outcomes on
cases, so the total number of effect sizes is 54
Histogram of Effect Sizes
Statistics
G
N
Valid
Missing
Mean
Median
Std. Deviation
Minimum
Maximum
54
0
1.311520
1.041726
1.2359393
-0.4032
5.4356
Data Aggregated (by simple averaging)
to Study Level
sid
19
20
9
5
8
18
17
15
7
16
10
3
11
4
g
.25643452
.54197801
.63926057
.75588367
.99578116
1.3178189
1.3252908
1.6105048
1.6148902
1.6345153
2.5494302
2.5985178
2.5989373
3.5641752
vg
.06523894
.06563454
.02315301
.03473584
.17598106
.36416389
.3465313
.11272789
.1154149
.23100681
.26735438
.67319145
.78074928
.15104969
w
15.328268
15.235879
43.190927
28.788705
5.6824295
2.7460164
2.8857422
8.8709195
8.6643927
4.3288767
3.7403539
1.4854615
1.2808209
6.6203379
Initial Meta-Analysis (aggregated to
the study level)
------- Distribution Description --------------------------------N
Min ES
Max ES
Wghtd SD
14.000
.256
3.564
.767
------- Fixed & Random Effects Model ----------------------------Mean ES
-95%CI
+95%CI
SE
Z
P
Fixed
1.0100
.8493
1.1706
.0820
12.3222
.0000
Random
1.4540
.9887
1.9193
.2374
6.1244
.0000
------- Random Effects Variance Component -----------------------v
=
.592448
------- Homogeneity Analysis ------------------------------------Q
df
p
87.4789
13.0000
.0000
I2 = 85.14%
RE on PRT: Nefdt et al. (2010)
• From one RE (Nefdt et al., 2010), we selected the
outcomes most similar to those used in the SCDs
– G = .875, v(G) = .146 (se = .382).
• Recall that the meta-analysis of PRT showed
– G = 1.454, V(G) = .056 (se = .2374)
• Are they the same?
– Same direction, somewhat different magnitudes
– Again using 84% confidence interval overlap test:
• .338 < .875 < 1.412 for RE
• 1.120 < 1.454 < 1.788 for SCDs
• Again, the confidence intervals overlap substantially, so
no significant difference between RE and SCD
Comments on PRT Meta-Analytic
Example
• This is just a very rough first stab
• Need to deal better with linearity issues in
SCD analyses
– We are currently expanding our d-statistic to cases
with nonlinearity
– In the meantime, can detrend prior to analysis
• Need also to code possible covariates
confounded with the design for metaregression.
Some Other Possible Examples
• Labor Economics (Dan Black)
– Many randomized experiments in labor economics
use outcomes that are in archives with 20-30 data
points
• Army Recruiting Experiment (Coady Wing)
– Incentive provided for enlisting into select
specialties, but only in randomly selected
recruiting districts.
– Could tap VA/SSA/etc records for outcomes
Issues in WSCs comparing ITS to RE
• Design Issues and Options
• Analytic Issues
• Logistical Issue
Design Issue: How to Make Random Assignment
to RE vs ITS More Feasible in Four-Arm Study
• How to motivate people to take multiple
measures so that attrition is not a problem in the
time series?
– Large amounts of money for each observation?
• E.g., 300 people times 12 observations times $100 per
session = $360,000
• If 100 observations, total cost is $3,000,000, but how to
space observations in a policy relevant way?
• Do pilot study to determine optimal payment per
observation.
– Shorten overall time by using very frequent
measurements per day (e.g., mood literature).
• Could then reduce costs by paying per day or the like
• But how policy relevant?
Design Issues: Nonrandomized Studies
Comparing ITS to RE
• E.g., Michalopoulos et al (2004) had time
series data with about 30 observations over
time for randomized treatment group,
randomized control group, nonrandomized
comparison group.
In this study, longitudinal randomized
experiments were done within cities.
The nonrandomized comparison group
was from another city.
More on Michalopolous
• They did not analyze the ITS as a time series.
Instead just substituted the comparison group
for the control group and analyzed like a
randomized experiment. This is not how an ITS
would be analyzed.
• In addition, this has the usual confounds
between method and third variables.
• But this perhaps can and should be done more
– Does someone want to reanalyze Michalopolous?
• Compare results from usual randomized estimate at
one time point to
– Analyzing the randomized control as ITS
Design Issues: Randomized Crossover
Designs
• Key Issue: How many longitudinal randomized
crossover designs exist with enough data points?
– Medline search for “randomized crossover design” found
27986 hits. Surely some of these are longitudinal
– Medline Search for “longitudinal randomized crossover
design” found 126 hits. But a quick look suggested few
were what we need. So it will be a tedious search to find
truly longitudinal randomized crossover designs
– I have a list of
– This is another good study for someone to do.
• Might need meta-analytic summary, so statistical issues
will emerge about what effect size to use and how to
deal with trend.
Design Issues: Meta-Analytic
Approaches
• Feasible in areas where both ITS and REs are
commonly used
– A key issue would be to find areas with both
(medicine and N-of-1 trials? SCDs in ed and psy)
– My lab is currently working on this
• In many respects just aggregations of three and
four arm studies with all the flaws and strengths
therein.
– But meta-regression could help clarify some
confounds of methods with third variables.
Analytic Issues
• Because so many real examples are short ITS,
usual ARIMA modeling etc is not practical.
• Key analytic issues:
– What is the metric for the comparison of ITS and
RE?
– Dealing with trend
– Dealing with autocorrelation
Analytic Issues:
When the Metric is the Same
• To compare RE to ITS, the same effect estimate has to be
measured in the same metric
• Not a problem
– In longitudinal randomized crossover designs
– In three arm studies in which participants are randomized to all
methods and treated identically and simultaneously (e.g.,
Shadish et al., 2008, 2011)
– In three-arm studies like Michalopolous that are all part of one
large study
– Or over multiple studies if they just happened to use the same
outcome variable.
• In all cases, ensure the effect estimate is the same (ATE,
ToT, etc.).
• But meta-analyzing all this can require finding a common
metric if it causes metrics to differ across studies.
Analytic Issues: Metric Is Different
• Special Case: When all outcomes are
dichotomous but may measure different
constructs
– Can use HLM (etc) with code for outcome type
(Haddock Rindskopf Shadish 1998)
• Otherwise, need common effect size estimate
like d, r, odds ratio, rate difference, etc.:
Analytic Issue: Metric is Different
• d-statistic for ABk SCDs (Hedges, Pustejovsky and Shadish in
press RSM)
– SPSS macro in progress; R-script available but needs individual
adaptation to each study
– Assumes no trend, normally distributed outcome
– Takes autocorrelation and between-within case variability into
account; requires minimum 3 cases.
• d-statistic for multiple baseline designs SCDs nearing
completion (HPS also)
• Grant proposal pending to extend work to
– Various kind of trend
– Various kinds of outcome (e.g., counts distributed as Poisson or
binomial)
• There are other effect sizes, but none are well-justified
statistically, or comparable to between groups effect sizes.
Analytic Issues:
How to Deal with Trend in the ITS
• The issue is to prevent the presence of linear
or nonlinear trend from causing a spurious
difference between RE and ITS estimate.
6
5
Baseline
4
3
Series1
Treatment
2
1
0
1
2
3
4
5
6
7
8
9
10
Trend:
Outcome Already in Common Metric
• If you do not need to convert to a common metric (e.g., outcome in
RE and ITS is identical). Two options:
– Model trend using ordinary regression (Huitema 2011) or multilevel
models (Van den Nortgate & Onghena; Kyse, Rindskopf & Shadish).
• But may produce two estimates:
– Main effect of treatment
– Interaction of trend with treatment
• Working out whether the regression/HLM estimate is truly identical to ATE
from RE is not transparent
– For example, how to demonstrate that the effect estimate from WinBUGS for the SCDs in
the Roifman example is really an ATE?
– Remove trend by detrending the data using
• First order differencing (but is second order differencing needed, and loss of
data points)
• Or regression with trend (but what polynomial order) and the subsequent
analysis on residuals
• Then could use HPS d
• But is not clear it is GOOD to remove trend x treatment interaction, which may
be a real effect of treatment.
Trend:
Outcomes in Different Metrics
• E.g., in the PRT meta-analysis where each
study had the same construct but different
measures:
– Detrend the data using methods previously
described
– Then compute HPS d
– Or wait till HPS d with trend adjustment is ready in
a few years.
Analytic Issues: Diagnosing Trend and
Sensitivity Analyses
• A different approach is to use nonparametric or
semi-parametric methods to see whether the
presence of an effect is sensitive to the presence
of trend or trend x treatment interactions.
• These methods allow the data to tell you about
trend and interactions, where as parametric
methods require you to know the trend
beforehand.
• We have been exploring Generalized Additive
Models (GAMs), a semi-parametric method.
Introduction to GAM
• Like a parametric regression, but replacing some or all
of the parametric predictors with smoothed
nonparametric predictors. E.g.,
• Parametric:
Yt = β0+ β1Xt + β2zt + β3[Xt – (n1 + 1)]zt + εt.
• GAM with smoothed trend and interaction:
Yt = β0+ s1(Xt) + β2zt + s3([Xt – (n1 + 1)]zt) + εt.
• For the techies: Smooth is cubic regression spline with
iteratively reweighted least squares with best fitting
model chosen by generalized cross validation.
– Modeled in R using the mgcv package (Wood, 2010)
GAM Questions to Ask
• Is there trend or trend x treatment
interaction?
• Is either nonlinear?
• Is the treatment effect robust to trend?
• Consider the following SCD:
Parametric and GAM Results
• Parametric GLM with binomial errors finds a
treatment effect but no trend or interaction.
• GAM best fitting model smooths the
interaction:
Degree of
nonlinearity
may be high;
Edf is
monotonically
related to
polynomial
order.
Fit indices
Borderline linear
parametric trend
Tmt effect is signif.
F suggests smoothed int
is not significant, but test
is underpowered.
GAM Conclusions
• About our three questions:
– A trend by treatment might be present
– If so, it may be highly nonlinear
– But the treatment effect is robust to trend compared
to the usual GLM.
• About GAM based on our experience so far:
– Works well with > 20 data points, less sure < 20
– Good as sensitivity analysis about trend, too early to
say if good as primary analysis of ITS effects.
– Open power questions
• Model comparison tests seem well powered (too well?)
• F test for smooth is said to be underpowered.
– Does GAM overfit the data?
Conclusions about Trend
• Probably the most difficult problem for WSCs
of RE and ITS
• Lots of methods available and in development
but no “best practice” yet
– So multiple sensitivity analyses warranted
• Decision about trend depends in part on
decision/context regarding metric
– If all outcomes are the same, more flexibility.
– If outcomes vary, detrend, use d, use GAM for
sensitivity analyses?
Analytic Issues: Autocorrelation
• Observations (or their errors) on the same
case over time are autocorrelated
– Both effect estimates and standard errors can be
biased if not modeled correctly.
• Typically computed with a standard YuleWalker estimator on the residuals from four
n 1
parameter regression:
yt yt 1
rj t 1n
yt 2
t 1
Correcting for Bias in Autocorrelation
• Biased downwards (a) with small time series
and (b) the more regression parameters used
to estimate residuals:
−(P + 3ρ)/t
• With the usual four parameter model, a
correction is:
rt 4
t 3
Variability in Autocorrelations
• Raw autocorrelations can be quite variable:
Autocorrelation and Sampling Error
• Much of the observed variability may be due
to sampling error:
vj
(1
2
j
) (t j 3)
• E.g., consider Bayesian (or empirical Bayes)
estimates of autocorrelations from two SD
studies:
Trace Plot for Schutte Data
C
C
I
I
F
D
N
J
L
A
F
D
N
J
L
A
K
H
B
E
K
H
B
E
G
G
I
0.20
C
C
N
A
D
K
JFI
L
B
E
H
M
G
C
FI
N
D
JL
A
K
E
H
B
M
G
C
N
D
JFI
L
A
K
H
E
B
M
G
I
F
N
D
J
L
A
K
H
E
B
M
G
F
N
D
J
L
A
K
H
B
E
M
G
0.10
0.15
C
A
B
D
E
G
H
K
M
N
JFI
L
F
D
N
J
L
A
K
H
B
E
0.05
G
M
0.2
M
-0.8
M
-0.2
0.0
Conditional Mean
I
-0.4
0.25
C
-0.6
0.30
C
0.0
Posterior Probability of Tau
0.4
0.35
Estimates Conditional on Tau
0.005
0.013
0.030
0.065
0.140
Tau
0.303
0.668
1.558
4.157
A= (Intercept) B= 1 C= 2 D= 3 E= 4 F= 5 G= 6 H= 7 I= 8 J= 9 K= 10 L= 11 M= 12 N= 13
Most plausible values (those
with large bars) of τ are
small, corresponding to
variances of .09 or less,
though they seem to
indicate that τ is unlikely to
be zero.
At the most likely values of τ,
the ACs are greatly shrunken
around a conditional mean
of just less than 0.20, with a
range that probably does not
extend much beyond -0.05
to -0.30, not enough to bias
G or V(G) much.
Trace Plot for Dyer Data
0.40
D
D
D
0.4
Here the results are different:
Estim ates Conditional on Tau
0.3
0.35
D
0.2
0.0
0.1
Conditional Mean
0.20
D
A
B
C
E
A
B
C
E
A
A
A
A
B
C
E
B
C
E
B
C
E
B
C
E
0.399
0.676
1.189
2.288
-0.3
0.05
B
C
E
A
-0.1
A
B
C
E
0.10
0.15
D
D
A
B
C
E
-0.2
0.25
D
0.0
Posterior Probability of Tau
0.30
D
0.025
0.048
0.085
0.143
0.239
Tau
A= (Intercept) B= 1 C= 2 D= 3 E= 4
1. Cases B, C, E shrink to a common
small negative autocorrelation.
2. Case D is an outlier with a larger
positive autocorrelation.
Inspection of the graph for Case D
suggests a ceiling effect that was
not present in Cases B, C, E, which
could cause higher
autocorrelation.
3. Again, however, τ is unlikely to be
zero.
4. And again, the range of the AC is
likely to fall between -0.20 and
+0.30, not enough to bias G or
V(G) much.
Implications of Bayesian Results
• Doing Bayesian analyses of ITS/SCDs may be a
very useful approach
• We need more research to understand whether
assuming a common underlying (Bayesian or EB)
autocorrelation is justified for, say, cases within
studies.
– Schutte data says yes
– Dyer data says no (but moderator analyses?)
• If we can make that assumption, much of the
variability goes away and the remaining
autocorrelation may be small enough to ignore
Dealing with Autocorrelations
• Lots of methods, but don’t ignore entirely
• In GLM/HLM/GAM:
– Incorrect specification of trend in such models can
lead to spurious autocorrelations, so modeling
trend is important to results
– Tentative work (GAMs) suggests properly
modeling trend may reduce ACs to levels that are
unimportant for bias.
• For d, our SPSS macro estimates AC and
adjusts d appropriately.
Logistical Issue
• How to get the data for the time series.
– Sometimes it is available in an archive etc.
– Sometimes it has to be digitized from graphs
• The latter can be done with high validity and reliability:
– Shadish, W.R., Brasil, I.C.C., Illingworth, D.A., White, K.,
Galindo, R., Nagler, E.D. & Rindskopf, D.M. (2009). Using
UnGraph® to Extract Data from Image Files: Verification of
Reliability and Validity. Behavior Research Methods, 41,
177-183.
– There is freeware also.
• But digitizing can be time consuming and tedious for
large numbers of studies
– E.g., a very good graduate student digitizing 800 SCDs from
100 studies took 8 months (including coding; Shadish &
Sullivan 2011).
Questions to Ask
• The one area where we still need studies on
the main effect question of “can ITS = RE?”
• Design variations to also examine:
– Does it help to add a nonequivalent control?
– Does it help to add a nonequivalent DV?
– What about variations in ITS design?
• Ordinary ITS with one intervention at one time
• Multiple baseline designs with staggered
implementation of intervention over time
– Over cases
– Over measures within one case
Conclusion
• WSCs of ITS and RE badly needed if ITS is
going to regain the credibility it once had
(assuming it does, in fact, give a good answer).
• Challenging to design and analyze, but much
progress has been made already, especially in
SCDs and N-of-1 Trials.
• Questions?
Some Comments on Meta-Analysis
• Meta-analysis is probably the oldest empirical
approach to studying RE-QE differences
– Smith, Glass, Miller 1981 psychotherapy
– Lipsey and Wilson 1991 meta-meta-analysis
– Both compared results from REs to results from
NREs with no adjustments of any kind.
– Both found that dRE = dNRE, but perhaps s2RE < s2NRE
– But…
How Much Credibility Should We Give
To Such Studies?
• The RE-NRE question was always secondary to
substantive interests (does psychotherapy work;
do behavioral and educational interventions
work?)
– So no careful attention to definitions of RE-NRE
• Just used original researchers word for it
– No attention to covariates confounded with RE-NRE.
• Glaser (my student) carefully recoded a large
random sample of Lipsey and Wilson using clear
definitions etc, and found LW’s finding did not
replicate.
2nd Generation Meta-Analyses
• E.g., Shadish-Ragsdale 1996; Heinsman-Shadish 1996.
• Careful selection of two hundred well-defined REs and
NREs
– from 5 areas (psychotherapy, presurgical patient
education, SAT coaching, ability grouping of students,
prevention of juvenile drug abuse),
– coded on a host of covariates potentially confounded with
assignment method,
– meta-regression to adjust RE-NRE difference for those
covariates.
• Other studies of the same sort (Kownacki-Shadish 1999
Alcoholics Anonymous; Shadish et al. 2000
Psychotherapy)
Some Illustrative General Findings
• Confounds with assignment methods are
rampant, and different across RE v NRE
• And are often quite different across
substantive areas
• But adjusting for those confounds greatly
reduces or eliminates RE-NRE effects.
• Tentative hypothesis: If studies were
conducted identically in all respects except for
assignment method, they yield similar results.
Some Illustrative Specific Findings
• Studies using self-selection into conditions
yielded far more bias than studies using otherselection.
• Local controls produce more accurate NRE
estimates than non-local controls.
• Crucial to control for activity level in control
group.
• Across a set of studies, pretest d (on the
outcome) is often a very strong predictor of
posttest d.
3rd Generation Meta-Analyses?
• Most meta-analyses do not have access to
individual person data within studies.
– So they cannot use typical statistical adjustments at
that level—PSA, ANCOVA, SEM etc
• But access may be more possible today
– Digitizing individual outcome data from graphs as in
SCDs, often with some narrative and/or quantitative
description of each case’s personal characteristics
– Growing individual patient meta-analyses, that is from
REs and NREs where the individual data is available
both within and across multiple sites.
• May allow some use of adjustments to NREs
within and across studies that has not been
possible before, but too early to tell.
What are the Disadvantages of MetaAnalyses?
• For example, at the study level, meta-analytic data are
correlational data, so our ability to know and code for
confounds with assignment method is inherently
limited.
– Also true of all three-arm designs so far?
• Even worse because within a single three-arm study assignment is
totally confounded with study level covariates (but not individual
level covariates).
• At least within meta-analysis one has variation in those study-level
confounds over studies to model and adjust. This may only inform
about study-level covariates (and perhaps aggregate level person
covariates—e.g., average age), but those covariates are
nonetheless important in understanding RE-NRE differences in
results (and this may change with 3rd generation meta-analysis)
• Other criticisms? Discussion of the role of MA in WSC?
Download