Randomization tests for extensions and variations of ABAB

0191-5401192 $5.00 + .00
Copyright © 1992 Pergamon Press Ltd.
Behavioral Assessment, Vol. 14, pp. 153-171. 1992
Printed in the USA. All rights reserved.
Randomization Tests for Extensions and
Variations of ABAB Single-Case Experimental
Designs: A Rejoinder
PATRICK ONGHENA
Katlwlieke Universiteit Leuven, Belgium
Randomization tests have been developed for several single-case experimental
designs. It is argued, however, that the randomization tests developed by Levin,
Marascuilo, and Hubert (1978) for the ABAB design and by Marascuilo and
Busk (1988) for replicated ABAB designs across subjects are inappropriate. An
alternative randomization procedure for the ABAB design is presented, and the
appropriate corresponding randomization test is derived. It is shown how this
alternative procedure has to be adapted to allow for statistical analyses of
extended ABAB, parametric-variation, drug-evaluation, interaction, replication,
and multiple-baseline designs. Finally, some limitations of the use of random­
ization tests for single-case designs are discussed, with particular reference to
changing-criterion designs.
Key Words: randomization tests; ABAB designs; within-subject designs;
statistical analysis; changing criterion designs; nonparametric statistical tests;
meta analysis; time series designs.
Randomization theory provides for a category of statistical tests that is
valid for single-case experimental designs (Edgington, 1967, 1980c).
Furthermore, these "randomization tests" are extremely versatile in this
area as a consequence of their potential for tailoring the tests to the specific
designs used (Edgington, 1987).
Randomization tests have been developed for (a) AB and ABA designs
(Edgington, 1975a), (b) ABAB designs and extensions (Levin, Marascuilo,
& Hubert, 1978), (c) alternating treatments designs (Edgington, 1967,
1980b), (d) multiple schedule designs (Edgington, 1982), (e) multiple-base­
line designs (Wampold & Worsham, 1986), and (f) replicated AB and
ABAB designs across subjects (Marascuilo & Busk, 1988).
The author wishes to thank Jan Beirlant, Paul De Boeck, Luc Delbeke, Eugene Edginton, Joel
Levin, Gert Stonns, and three anonymous reviewers for their helpful comments on an earlier draft
of the article. The author is Research Assistant of the National Fund for Scientific Research
(Belgium).
Correspondence concerning this article should be addressed to Patrick Onghena. K.U.
Leuven, Department of Psychology, Tiensestraat 102, B-3000 Leuven, Belgium.
153
154
ONGHENA
In the present article, it will be argued that the Levin et al. (1978) ran­
domization test (in the following abbreviated as the LMH test) is inappro­
priate for the ABAB design, and that the randomization test developed by
Marascuilo and Busk (1988) is inappropriate for the replicated ABAB
design across subjects because it is based on the LMH test. A more appro­
priate, alternative test will be presented, and it will be shown how this
alternative test has to be modified to analyze replication and multiple-base­
line designs and single-case experimental designs which have hitherto
received little attention in the randomization test literature: extended
ABAB designs, parametric-variation designs, drug-evaluation designs, and
interaction designs.
THE LMH TEST FOR ABAB DESIGNS
According to Kazdin (1982), ABAB designs are the most basic experi­
mental designs in single-case research. They consist of a family of designs
in which repeated measurements are taken for a single unit (e.g., a person)
under two different alternating conditions in four phases, with several mea­
surements in each phase. In the first phase, repeated measurements are
taken under control conditions (baseline or first A phase); in the second,
under experimental conditions (intervention or first B phase); in the third,
under control conditions again (withdrawal or second A phase); and in the
fourth, under experimental conditions again (second intervention or second
B phase; see also Barlow & Hersen, 1984, pp. 157-166).
Because of the non-independence of the measurements, parametric sta­
tistical tests based on random sampling are inappropriate for single-subject
designs in general, and ABAB designs in particular (Edgington, 1967,
1975a; Levin et al., 1978). Therefore, Levin et al. (1978) proposed to use a
nonparametric randomization test to analyze such designs. However, they
wanted a close correspondence between the random sampling and the ran­
domization model. They used phase mean scores to get "approximately
independent, or at least uncorrelated, random variables that are identically
distributed even though the population model itself generates dependent
observations" (p. 177).
Consider, for example, the hypothetical data collected in a single-sub­
ject ABAB design in Table 1. To perform the LMH test, the mean scores of
each phase are regarded as a population. The null hypothesis that there is
no difference between the A and the B condition is assessed by referring to
the conditional sampling distribution of a test statistic associated with all
theoretically possible assignments of these means to the conditions. For the
data in Table 1, the mean values of each phase are 4.0, 2.0, 3.0, and 1.0,
respectively. The number of possible assignments of them to Conditions A
and B, under the assumption that two must go to each condition, is given
155
RANDOMIZATION TESTS FOR SINGLE-CASE DESIGNS
TABLE 1
Hypothetical Data Used to Illustrate the Levin, Marascuilo. and Hubert Test for an
ABAB Design With 24 Measurement Times (MT)
MT
1 2 3 4 5 6 7 8 9 10 11 12 13 14 IS 16 17 18 19 20 21 22 23 24
Condition A A A A A A B B B B B B A A A A A A B B B B B
Score
625 344 1 2 3
322342430
B
202
by the binomial coefficient (V = 6. One could assign the following pairs to
one of the conditions (let us say A), where order is not important: 4.0 and
2.0, 4.0 and 3.0,4.0 and 1.0, 2.0 and 3.0, 2.0 and 1.0, and 3.0 and 1.0. For
each of the six possible assignments to a particular condition (here A), the
sum of the two means is then computed as the test statistic, and based on
these sums a sampling distribution is defined (Table 2).
For the observed assignment, the sum of the means associated with
condition A was 4.0 + 3.0 = 7.0. The probability (p) that the value of the
test statistic for one of the assignments is as large as the observed test
statistic in the generated distribution is 1/6 = .1667 for a one-tailed test. If a
significance level of a = .05 was chosen, the null hypothesis of no differ­
ence between Conditions A and B is not rejected.
THE INAPPROPRIATENESS OF THE LMH TEST
FOR ABAB DESIGNS
Although the LMH test has a certain appeal as a descriptive device and
is relatively simple to perform, it is not an appropriate randomization test to
allow for causal inferences. To the extent that the LMH test is presented as
a randomization test for a systematic ABAB design (Le., flrst an A condi­
tion, then a B condition, then an A condition, and finally a B condition), it
is invalid because there has to be random assignment of measurement times
TABLE 2
Randomization Distribution for the Levin,
Marascuilo. and Hubert Test Applied to the
Hypothetical Data of Table 1
Means assigned
to Condition A
4.0 and 3.0
4.0 and 2.0
4.0 and 1.0
3.0 and 2.0
3.0 and 1.0
2.0 and 1.0
Sum of Means
7.0
6.0
5.0
5.0
4.0
3.0
Probability
1/6
1/6
1/6
1/6
1/6
1/6
156
ONGHENA
to conditions if a valid randomization test is to be performed (Edgington,
1980b, 198Oc, 1984, 1987).
Although Levin et al. (1978) defended the application of randomization
tests to systematic ABAB designs, they also suggested to use the LMH test
for randomized ABAB designs:
For example, in the baseline (A) versus treattnent (B) situation an initial A phase could
be designated as an adaptation or "warm-up" phase, not to be included in the analysis. It
might then be possible to randomly assign two A and two B conditions to four successive
phases to constitute the analysis proper. In other instances, A and B may represent two
different experimental treattnents (rather than a baseline and a treattnent), in which case
random assignment may be more reasonable. (Note that under the proper randomization
scheme, it is possible to come up with two consecutive phases of the same type.) (p. 175)
The last sentence in parentheses, however, is crucial. It implies, which
Levin et al. (1978) seem to ignore, that the LMH test assumes random
assignments that are inconsistent with the ABAB sequence, and hence, that
it is not an appropriate randomization test for the randomized ABAB
design. For a randomization test to be valid, the permutation of the data to
obtain the distribution of test statistics must correspond to the particular
type of random assignment that is actually used (Edgington, 1980b, 198Oc).
This means that if a randomization procedure was performed that corre­
sponded to the permutation of the means in the LMH test, an ABAB design
would have been equally likely as an AABB, ABBA, BAAB, BABA, or
BBAA design. Hence, an investigator who wanted to use an ABAB design
may, as a result of this randomization procedure, have arrived at an AB
(AABB) or an ABA (ABBA) design instead.
The LMH test is not well suited to analyze ABAB designs because it
is an analogue of the randomization test that is appropriate for the alternat­
ing treatments design in which there is one potential alternation of condi­
tion per measurement time (Edgington, 1967, 1980b), It seems that Levin
et al. (1978) and Marascuilo and Busk (1988) consider the ABAB design
as if it were an alternating treatments design with two different conditions,
four measurement times, and two measurement times for each condition,
with the only exception being that for the ABAB design the phase means
have to be used because there is slower alternation and more measure­
ments per phase.
It could be argued that, while it is true that the LMH test is inappropri­
ate for ABAB designs, it is a valid test for randomized XXXX designs,
where X could be an A or B phase, and two A and two B phases can be
assigned. Although this design has not been used yet, the argument is cor­
rect. There is, however, still another problem: the LMH test has zero power
for randomized XXXX designs at any reasonable significance level
(ex = .10 or lower). By definition, a randomized XXXX design consists of
RANOOMIZATION TESTS FOR SINGLE-CASE DESIGNS
157
four phases, and consequently the p-value can never be smaller than .1667.
Levin et al. (1978) and Marascuilo and Busk (1988) realized this, and they
proposed to test at the a = .1667 level for a one-tailed test or at the
a = .3333 level for a two-tailed test, to strengthen the alternative hypothe­
sis (e.g., by making more specific predictions about the rank ordering of
the mean values), to add phases to the design, or to add subjects to the
study. These suggestions, however, only confmn that the basic LMH test
has zero power at a = .10, a = .05, or lower.
The LMH test also lacks power for systematic ABAB designs, as in the
example with the data in Table 1. The validity consideration is, however,
more fundamental. Even if the LMH test were powerful (as it can be made
in its extensions), there would still be the problem of its validity.
In sum, the LMH test is inappropriate for ABAB designs because, to
the extent that it is proposed for systematic ABAB designs it is invalid, and
to the extent that it is proposed for randomized designs it has zero power
and is, strictly speaking, not even a test for ABAB designs.
AN APPROPRIATE RANDOMIZATION TEST FOR
THE ABAB DESIGN
A valid randomization test is performed in nine consecutive steps: (a)
choice of the alternative hypothesis, (b) specification of the level of signifi­
cance a, (c) specification of the number of measurement times N, (d) ran­
domization of some aspect of the design, (e) choice of a sensitive test
statistic, (f) data collection, (g) computation of the test statistic from the
obtained data, (h) formation of the randomization distribution, and (i) com­
parison of the observed test statistic to the randomization distribution. It is
essential for the validity of the test that the first five steps are perfonned
before data are collected (Edgington, 1980b, 1980c, 1984, 1987; Wampold
& Worsham, 1986). Because in the previous section it was shown that the
invalidity of the LMH test is the most fundamental problem, the alternative
test will be presented closely following these nine steps to assure its validi­
ty. In addition, the alternative test will be described with the same data and
level of significance that were used in the illustration of the LMH test, and
it will be shown to be a powerful randomization test.
The Null and Alternative Hypotheses
The null hypothesis of a randomization test for single-case experiments
is always that there is no differential effect of the conditions for any of the
measurement times. The alternative hypothesis can be chosen. As in statis­
tical tests for group experiments with two conditions, it is possible to have
a directional (one-tailed) or a nondirectional (two-tailed) alternative.
Furthermore, it is possible to make even stronger predictions about trends
158
ONGHENA
(Edgington, 1975b, 1987; Levin et al. 1978). Generally speaking, the more
specific the predictions, the more powerful one's test becomes. The trade­
off is, of course, that the test becomes completely inappropriate if the data
are not in the predicted direction. As some critics of one-tailed t tests might
say, one should only use a directional test if any treatment differences of a
kind other than the predicted kind are regarded to be either irrational or
irrelevant (Edgington, 1975b; Goldfried, 1959; Goldman, 1960; Kimmel,
1957; Levin et al., 1978).
In the following, the randomization test for the ABAB design will be
described for a one-tailed alternative, parallel to the illustration of the
LMH test (B > A). In applied behavior analysis, it is usually reasonable to
specify the direction of behavior change (Kazdin, 1982; Wampold &
Worsham, 1986).
The Level of Significance and the Number ofMeasurement Times
In the Neyman-Pearson decision-theoretical framework of hypothesis
testing, the level of significance and the number of observations must be
specified before the data are collected (Cohen, 1988). Together with the
test statistic, the design, and the alternative hypothesis, they determine the
power of the test.
In parametric statistical group analyses, the issue of Type II errors is
frequently, wrongly, neglected (Cohen, 1962, 1990; Sedlmeier &
Gigerenzer, 1989), and this neglect is even more prevalent in statistical
analyses of single-case data. Occasionally, qualitative references are made
to sensitivity considerations with respect to the choice of the test statistic,
the randomization procedure, the alternative hypothesis, or the level of sig­
nificance (e.g., Barlow & Hersen, 1984; Edgington, 1987; Kazdin, 1982;
Levin et al., 1978), but there has never been an attempt to quantify the rela­
tionship between the number of measurement times and the power of ran­
domization tests for single-case designs.
This quantification is straightforward, however, if there exists a group
design with the same randomization structure as the single-case design.
The power of the randomization test for the single-case design is then equal
to the power of the randomization test for the group design (Onghena,
1991; Onghena & Delbeke, 1992). If the same test statistic is used, the
power of the randomization test for the group design itself can be approxi­
mated using the standard tables of the corresponding random sampling test
(May, Masson, & Hunter, 1990, pp. 313-314) or else can be derived by
computer-intensive methods (Gabriel & Hall, 1983; Gabriel & Hsu, 1983;
Kempthorne, 1952; Kempthorne & Doerfler, 1969; Onghena, 1992). For
example, the randomization procedure for the alternating treatments design
as presented by Edgington (1967) has the same structure as the completely
randomized design (using Kirk's [1982] terminology), the only difference
RANDOMIZATION TESTS FOR SINGLE-CASE DESIGNS
159
being that for the first the random assignment to conditions concerns the
measurement times and for the last it concerns the subjects.
If the randomization procedure of the single-case design has an unusual
group design counterpart, then the power of the randomization test for the
single-case design has to be derived directly by computer-intensive meth­
ods (Onghena, 1991, 1992; Onghena & Delbeke, 1992), after specifying a
more restrictive alternative hypothesis (e.g., a specific additive or multi­
plicative effect for all measurement times under the experimental condition
as proposed by Gabriel & Hall, 1983; Gabriel & Hsu, 1983; Kempthorne,
1952; Kempthorne & Doerfler, 1969; Onghena, 1992). For example, the
randomization procedure for an AB design concerns the intervention point
(Edgington, 1975a), and the group design counterpart would be to assign
the first k selected subjects to the A condition, the others to the B condi­
tion, and to determine k at random. This is unusual, and it would be very
inefficient for a group design. Also the randomization procedure for an
ABAB design, to be presented next, is of this type.
To make it possible to work with the data of Table 1, suppose that on
the basis of practical and power considerations, a level of significance
a = .05 and N =24 measurement times are chosen.
Randomization
The most appropriate aspect of the ABAB design that may be subject to
randomization is the determination of the three points of change: the time
of the first intervention (tl), the time of the withdrawal (t2), and the time of
the second intervention (t3)' The possible times of the withdrawal are con­
ditional upon the time of the first intervention, and the possible times of the
second intervention are conditional upon the time of the withdrawal.
Furthermore, the random selection of the three points of change has to be
restricted to rule out the possibility of having too few (or even no) mea­
surement times for one of the phases.
This randomization procedure makes a randomization test possible that
radically differs from the LMH test. It follows the Edgington (1975a, 1980b)
model for AB and ABA designs instead of the Edgington (1967) model for
alternating treatments designs. In fact, Edgington (1975a, 1980b) already sug­
gested this randomization procedure for extensions of AB and ABA designs.
To continue the illustration, suppose an ABAB design is chosen with at
least n =4 measurements in each phase (N = 24, detennined in the preceding
step). Then it can be shown (see Appendix A, Equation 2) that there are 165
possible data divisions according to this randomization procedure. To make
all data divisions equally probable, it is necessary to enumerate all possible
triplets (tl,tz,t3) followed by random selection of a triplet. In the example,
random selection of a triplet out of the set of 165 triplets {(5,9,13), (5,9,14),
... , (5,9,21), (5,10,14), ... , (5,10,21), ... , (6,10,14), ... , (6,10,21), ... ,
160
ONGHENA
(l3,17,21)} results in a probability of 1/165 for each of the data divisions.
Suppose that, for the example, triplet (7,13,19) was selected at random.
Notice that the resulting data division parallels the design in the illustration
of the LMH test (Table 1).
The Test Statistic
Any test statistic that is sensitive to the expected effect of the treatment
is acceptable. If one expects a difference in level between the two condi­
tions, Condition A having the highest level, then the most straightforward
test statistic will be T = (AI + A z)- (B 1 + Bz), the difference between the
sum of the means for the baseline and treatment phases. For the example,
this is the statistic chosen.
For a difference in level, other interesting test statistics have been pro­
posed for particular predictions: test statistics for immediate but decaying
effects (Edgington, 1975a) and for delayed effects (Wampold & Furlong,
1981), test statistics corrected for trends (Edgington, 1980b, 198Oc, 1984,
1987) and for outliers (Marascuilo & Busk, 1988), and ordinal or nominal
test statistics (Edgington, 1984; Marascuilo & Busk, 1988). It is also possi­
ble to test for changes in slope (Wampold & Furlong, 1981), for specific
linear or nonlinear trends (Edgington, 1975b; Levin et al. 1978), and for
changes in variability (Edgington, 1975a).
Data Collection, the Observed Test Statistic, and the Randomization
Distribution
Suppose the data in Table 1 were obtained. The test statistic T for these
data (called the observed test statistic) is (4.0 + 3.0) - (2.0 + 1.0) = 4.0.
Then T is computed for all the data divisions that could have been selected
following the randomization procedure. If the null hypothesis is true, each
measurement is independent of the condition that is implemented, and any
difference between the measurements under the two conditions is due sole­
ly to a difference in factors associated with the time the two treatments
were administered. Hence, each measurement time can be characterized by
its own measurement, and consequently, under the null hypothesis, the ran­
dom division of the measurement times in phases can be conceptualized as
the random division of the measurements in phases. Because each division
of the measurements is associated with a value for T, this results in a distri­
bution of T under the null hypothesis (the randomization distribution). For
the obtained data, the six lowest and the six highest values of T in this dis­
tribution are shown in Table 3.
The Probability Value
As a last step, the observed test statistic is compared to the randomiza­
tion distribution. If the proportion of test statistics that is as extreme as the
RANDOMIZATION TESTS FOR SINGLE-CASE DESIGNS
161
TABLE 3
Six Highest and Six Lowest Values of the Test Statistic Pin
the Randomization Distribution for the Alternative Test
Applied to the Hypothetical Data of Table 1
Triplet of Measurement
Tunes for Change
T
Probability
(7,14,19)
(7,11,19)
(7,15,19)
(7,13,19)
(7,14,18)
(6,14,19)
4.20
4.13
4.13
4.00
3.96
3.95
1/165
1/165
1/165
1/165
1/165
1/165
(12,17,21)
(5,10,14)
(11,17,21)
(12,16,21)
(11,16,21)
(13,17,21)
1.24
1.20
1.18
1.09
1.05
1.00
1/165
1/165
1/165
1/165
1/165
1/165
aT=(A j +Az)-(Bj+Bz).
observed test statistic (the probability value or p-value) is smaller than or
equal to a, the null hypothesis is rejected and the predicted effect in the
alternative hypothesis is accepted.
In the example, the proportion of test statistics that is as large as (we
had a directional alternative hypothesis) the observed test statistic is
p =4/165 = .0242. This is smaller than the predetermined a = .05, and con­
sequently the null hypothesis of no difference between the A and B condi­
tions is rejected, in favor of the alternative hypothesis, that for some (per­
haps all) of the measurement times the B condition has resulted in a higher
score than if the A condition was given.
In sum, the alternative randomization procedure results in a valid ran­
domization test for the ABAB design that has higher power than the LMH
test with the fixed six randomization possibilities. The superior power was
illustrated using the same data and level of significance, rejecting the null
hypothesis with the alternative test while this was not possible with the
LMH test.
RANDOMIZATION TESTS FOR EXTENDED ABAB,
DRUG-EVALUATION, INTERACTION, AND PARAMETRIC­
VARIATION DESIGNS
The ABAB design can be extended by adding one or more phases after
the basic ABAB sequence (Barlow & Hersen, 1984, p.175). A randomiza­
162
ONGHENA
tion test for such an extended ABAB design follows the nine steps
described above. The randomization involves the k points of change; that
is, the random sampling of a k-tuple of points of change out of the set of all
possible k-tuples (see Appendix A for the computation of the cardinality of
this set). Finally, a test statistic sensitive to the expected effect of the treat­
ment (e.g., the difference between the sum of the means for the A condition
and the sum of the means for the B condition) is calculated for the obtained
data and compared with the randomization distribution of this test statistic
to obtain the probability value.
The ABAB design can also be extended by comparing more than two
conditions (e.g., in an ABABACA design; Barlow & Hersen, 1984, p. 177),
and an appropriate randomization test can easily follow this extension.
Besides the probable increment in k, the choice of the test statistic deserves
particular attention. Edgington (1984, 1987) proposed to use the F-statistic
from analysis of variance or an equivalent, but ordinal and nominal test
statistics are also possible (Edgington, 1984; Marascuilo & Busk, 1988).
These same modifications apply to drug-evaluation designs and interac­
tion designs. In a drug-evaluation design (Barlow & Hersen, 1984, p. 183)
a no-drug condition (A), a drug condition (B), and a placebo condition (AI)
are compared (e.g., in an AAIAAIABAIB design). For the randomization
test, this is equivalent to an extended ABAB design with three conditions.
In an interaction design (Barlow & Hersen, 1984, p. 193), a combination of
conditions is present in some phases, allowing for an assessment of the
additive and non-additive effects of the conditions (e.g., in an
ABAC[BC]B[BC]B design, with [BC] representing a phase where both
condition B and condition C are present). For the randomization test, this is
equivalent to an extended ABAB design with four conditions, with the
combination condition considered as a separate condition. Drug-evaluation
and interaction designs can also be extended by adding several drugs or
conditions in the comparison.
An interesting extension of the ABAB design is the parametric-varia­
tion design, in which several levels of an independent variable are present­
ed consecutively (e.g., in an ABABB(l)B(2)B(3) design, with B, B(l), B(2),
and B(3), representing different levels of an independent variable; Barlow
& Hersen, 1984, p. 179). The randomization procedure for a parametric­
variation randomization test is equivalent to the randomization procedure
of an extended ABAB design. Again, however, particular attention is need­
ed for the choice of the test statistic. Parallel to the test statistic of the cor­
relational trend test (Edgington, 1987), a conelation coefficient could be
used. The correlation between the obtained data and coefficients assigned
to the different levels of the independent variable might be calculated and
compared to the randomization distIibution of this test statistic to get the
probability value. The assignment of the coefficients to the levels could be
perlormed using an ordinal, an interval, or a ratio weighting scheme.
RANOOMIZATION TESTS FOR SINGLE-CASE DESIGNS
163
RANDOMIZATION TESTS FOR REPLICATED ABAB DESIGNS
ABAB designs (or extensions and variations) can be replicated with
different subjects, and the data can be integrated by using meta-analytic
techniques (Rosenthal, 1978; White, Rusch, Kazdin & Hartmann, 1989). If
the single-subject experiments can be assumed statistically independent,
the probability value of the randomization test for the replicated ABAB
design is obtained by perfonning a randomization test for each single-sub­
ject design separately, following the guidelines described above, and then
combining the resulting probability values (Edgington, 1972a, 1972b;
Edgington & Haller, 1983, 1984).
In their randomization test for replicated ABAB designs, however,
Marascuilo and Busk (1988) proposed to use the LMH test and the associ­
ated "randomization procedure", to calculate a test statistic across subjects,
and to calculate a single randomization distribution of this test statistic.
Because the latter turned out to be laborious for more than two subjects,
they proposed to use a normal approximation. As we argued above, the
LMH test is inappropriate for the ABAB design, and consequently the
Marascuilo and Busk (1988) test is inappropriate for replicated ABAB
designs. The meta-analytic procedure we propose for replicated ABAB
designs differs from the Marascuilo and Busk (1988) test in four respects:
(a) the test statistic is not calculated across subjects but for each subject
separately, (b) there is not a single randomization distribution but as many
as there are subjects in the design, (c) there is no need for a reference to a
normal approximation, and (d) the test is not based on the LMH test but on
the alternative randomization test for the ABAB design described above.
RANDOMIZATION TESTS FOR MULTIPLE-BASELINE DESIGNS
In a multiple-baseline across subjects design, there is usually only one
intervention (without withdrawal), and the data for all subjects are collect­
ed simultaneously with sequential application of the intervention across
subjects (Harris & Jenson, 1985). Single-subject multiple-baseline designs
are possible when the intervention is sequentially aimed at different
behaviors or settings for a single subject (Barlow & Hersen, 1984, pp.
210-243). The graphical analysis of the results obtained in a multiple­
baseline design consists of both between- and within-baseline compar­
isons (Barlow & Hersen, 1984; Harris & Jenson, 1985; Hayes, 1981). If,
for example, an intervention is applied to one of the baselines and pro­
duces a change in it, while little or no change is observed in the other
baselines, better experimental control over historical sources of confound­
ing is obtained than in a singular AB design (Cook & Campbell, 1979).
Notice that in the multiple-baseline literature the term "baseline" is used
to refer to the entire AB sequence.
164
ONGHENA
Wampold and Worsham (1986) precisely used this between-baseline fea­
ture to develop a randomization test for the multiple-baseline design. They
fixed the intervention points for the different baselines and they randomly
selected the order in which the persons, behaviors, or settings were subject­
ed to the treatment. A test statistic for the obtained data is then compared to
a randomization distribution consisting of test statistics computed for all
possible orders in which the persons, behaviors, or settings could have been
subjected to treatment. Hence, in all baselines the intervention points of
other baselines are located to obtain the randomization distribution.
As Marascuilo and Busk (1988) observed, however, this randomization
test has low power because of the restricted randomization possibilities.
Therefore, they proposed to improve the power by determining the interven­
tion point at random instead of fixing it in advance. The random determina­
tion of intervention points also randomly selects the order in which the per­
sons, behaviors, or settings are subjected to the treatment. The control over
historical sources of confounding is obtained by randomization within a
baseline, and the randomization distribution is obtained not only by locating
the intervention point of a baseline in the other baselines, but by locating all
possible intervention points in a baseline. A multivariate test statistic such
as the difference of the means for the A and the B phase summed over all
baselines may be used, and the p-value of this statistic can be derived by
comparing it to the single randomization distribution of this test statistic.
Marascuilo and Busk (1988) considered the randomization test with
random intervention point inappropriate for multiple-baseline designs
across behaviors or settings because of the probable intercorrelations
between the baselines. As Edgington (1992) has argued, however, these
intercorrelations cause interpretation difficulties but leave the validity of the
test unaffected. This is because of the multivariate approach in constructing
a randomization test for multiple-baseline designs: rejection of the random­
ization null hypothesis is rejection of the overall null hypothesis that there
is no effect on any of the baselines. Significant results can, however, not be
interpreted for a specific baseline. The interpretation to a specific baseline
can only be unequivocal if the baselines are independent, but then the mul­
tiple-baseline design becomes a replicated AB design, and then the meta­
analytic techniques of the previous paragraph are more appropriate.
Multiple-baseline designs can be extended to include other simultane­
ous series (e.g., ABA, ABAB, ...). The randomization test can be devel­
oped after random determination of the intervention and withdrawal points,
and by choosing an appropriate multivariate test statistic. It should be noted
that the non-concurrent multiple-baseline design (Barlow & Hersen, 1984,
p. 244; Hayes, 1981; Watson & Workman, 1981) is, in fact, a mere replicat­
ed AB design (Mansell, 1982) and consequently can be analyzed with the
meta-analytic techniques of the previous paragraph.
RANDOMIZATION TESTS FOR SINGLE-CASE DESIGNS
165
DISCUSSION
Although randomization tests appear to be valid and versatile in the
area of single-case experimentation, Kazdin (1980, 1982, 1984) noted
that several features of single-case research would preclude their use.
The major obstacles in using randomization tests in single-case experi­
mentation would be: (a) the necessity to have rapidly alternating condi­
tions, (b) the possibility of multiple-treatment interference, (c) computa­
tional complexity, and (d) the necessity of randomization (Kazdin, 1980,
1982, 1984).
The first obstacle, however, is based on a false assumption. It is not
true that randomization tests can only deal with single-case designs where
the conditions are alternated rapidly. In fact, rapid alternation of conditions
was a requirement in none of the randomization tests described in this
paper. Presumably, this false assumption stems from the fact that random­
ization tests were introduced in the single-case literature by Edgington
(1967) as a way to analyze alternating treatments designs.
Multiple-treatment interference is the confounding effect that the out­
come of one treatment is due, in part, to previous treatments which may
have been provided (Campbell & Stanley, 1966; Cook & Campbell, 1979).
As such, it is a potential obstacle of a design and not of a statistical test that
analyzes the data gathered following the design (Edgington, 1980a). Of
course, if it were assumed that randomization tests are associated with
alternating treatments designs, the possibility of multiple-treatment inter­
ference would be especially acute with randomization tests.
Randomization tests for alternating treatments designs require a rapidly
amounting number of data divisions with an increase in number of mea­
surement times, so that for even a fairly small number of measurement
times time-consuming computer programs or even approximations are
required (Edgington, 1969, 1987; Kazdin, 1982, 1984). Fortunately, the
randomization tests for single-case experimental designs involving random
determination of points of change have a more restricted random assign­
ment procedure and consequently can be performed with relatively simple
computer programs. A computer program for the tests presented in this
article is available from the author, together with a computer program for
the random sampling of the k-tuple out of the set of all possible k-tuples of
points of change. The wider availability of powerful computers and effi­
cient algorithms ultimately removes this obstacle.
This leaves the fourth obstacle as the only real one. As Kazdin (1980)
put it, "Randomization tests do not merely provide statistical tools to
help the investigator analyze data obtained in single-case research but
dictate changes in the basic designs that are used" (p. 258). As men­
tioned above, the fourth step for the performance of a valid randomiza­
166
ONGHENA
tion test is always the randomization of some aspect of the design, and
this has to be done before the data are collected. Randomization tests are
thus incompatible with the systematic designs and the response-guided
experimentation which are common in single-case research (Edgington,
1983, 1984).
For example, in a changing-criterion design (Barlow & Hersen, 1984,
p. 205) the baseline condition (A) is followed by a treatment condition
(B) until a preset criterion is met. Then, a condition (B 1) is introduced
with a more stringent criterion, with treatment applied until this new level
is met, and so forth. This changing-criterion element can also be com­
bined with an ABAB element (e.g., in an ABABB1B2B3 design). If the
number of measurement times for each phase can be specified a priori, for
example, by characterizing each phase by the specification of the criterion
but without the requirement that it should be met before it is respecified
in the next phase, then it is possible to perform a randomization test for
this version of the changing-criterion design because then it is equivalent
to the parametric-variation design. In such a randomized changing-criteri­
on design, generally, even stronger predictions are possible than in a para­
metric variation-design because one has the preset criteria-predictions,
and a test statistic has to be chosen accordingly (e.g., some measure of
goodness-of-fit to the criterion), The underlying rationale of changing-cri­
terion designs, however, is usually the shaping of behavior (Kazdin,
1980), and in shaping the number of measurement times cannot be speci­
fied a priori.
Edgington (l980a) remarked that it is possible to perform a random­
ization test in designs where the manipulation of conditions is partially
dependent on the data (e.g., randomization after stability of the baseline
or, in a changing-criterion design, after the criterion is met), but this may
reduce the power substantially by restricting the randomization possibili­
ties, and this leaves the basic obstacle that it is impossible to perform a
randomization test where the manipulation of conditions is entirely depen­
dent on the data.
It must be acknowledged however that, while it is true that randomiza­
tion tests dictate a change in the basic designs that are used, namely the
introduction of a random element, this change improves the internal and
statistical conclusion validity of the results (Cook & Campbell, 1979).
Randomization controls for unknown as well as known sources of con­
founding, and Campbell and Stanley (1966) considered it important
enough to serve as a criterion to distinguish experimental and quasi-exper­
imental designs (Edgington, 1984, 1992). Hence, the effort to overcome
the fourth obstacle is remunerated by an increased internal and statistical
conclusion validity.
RANDOMIZATION TESTS FOR SINGLE-CASE DESIGNS
167
REFERENCES
Barlow, D. H., & Hersen, M. (Eds.). (1984). Single-case experimental designs: Strategies for
studying behavior change (2nd ed.). Oxford: Pergamon Press.
Campbell, D. T., & Stanley, J. C. (1966). Experimental and quasi-experimental designs for
research. Chicago: Rand McNally.
Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review.
Journal ofAbnormal and Social Psychology, 65, 145 -153.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale:
Erlbaurn.
Cohen,1. (1990). Things I have learned (so far). American Psychologist, 45,1304-1312.
Cook, T. D., & Campbell, D. T. (Eds.). (1979). Quasi-experimentation: Design and analysis
issues for field settings. Chicago: Rand McNally.
Edgington, E. S. (1967). Statistical inference from N = I experiments. Journal of Psychology, 65,
195-199.
Edgington, E. S. (1969). Approximate randomization tests. Journal of Psychology, 72, 143-149.
Edgington, E. S. (1972a). An additive method for combining probability values from independent
experiments. Journal of Psychology, 80, 351-363.
Edgington, E. S. (1972b). A normal curve method for combining probability values from indepen­
dent experiments. Journal of Psychology, 82, 85-89.
Edgington, E. S. (l975a). Randomization tests for one-subject operant experiments. Journal of
Psychology, 90, 57-68.
Edgington, E. S. (1975b). Randomization tests for predicted trends. Canadian Psychological
Review, 16, 49-53.
Edgington, E. S. (1980a). Overcoming obstacles to single-subject experimentation. Journal of
Educational Statistics, 5,261-267.
Edgington, E. S. (1980b). Random assignment and statistical tests for one-subject experiments.
Behavioral Assessment, 2, 19-28.
Edgington, E. S. (l980c). Validity of randomization tests for one-subject experiments. Journal of
Educational Statistics, 5, 235-251.
Edgington, E. S. (1982). Nonparametric tests for single-subject multiple schedule experiments.
Behavioral Assessment, 4, 83-91.
Edgington, E. S. (1983). Response-guided experimentation. Contemporary Psychology, 28,
64-65.
Edgington, E. S. (1984). Statistics and single case analysis. In M. Herscn, R. M. Eisler, & P. M.
Miller (Eds.), Progress in behavior modification: Vol. 16 (pp. 83-120). New York: Raven
Press.
Edgington, E. S. (1987). Randomization tests (2nd ed.). New York: Marcel Dekker.
Edgington, E. S. (1992). Nonparametric tests for single-case experiments. In T. R. Kratochwill &
J. Levin (Eds.), Single case design and analysis (pp. 133-157). Hillsdale, NJ: Erlbaurn.
Edgington, E. S., & Haller, O. (1983). A computer program for combining probabilities.
Educational and Psychological Measurement, 43, 835-837.
Edgington, E. S., & Haller, O. (1984). Combining probabilities from discrete probability distribu­
tions. Educational and Psychological Measurement, 44, 265-274.
Gabriel, K. R., & Hall, W. 1. (1983). Re-randomization inference on regression and shift effects:
computationally feasible methods. Journal of the American Statistical Association, 78,
827-836.
Gabriel, K. R., & Hsu, C. F. (1983). Power studies of re-randomization tests, with application to
weather modification experiments. Journal of the American Statistical Association, 78,
766-775.
Goldfried, M. R. (1959). One-tailed tests and "unexpected" results. Psychological Review, 66,
79-80.
Goldman, M. (1960). Some further remarks on one-tailed tests and "unexpected" results.
Psychological Reports, 6,171-173.
168
ONGHENA
Harris, F. N., & Jenson, W. R. (1985). Comparisons of multiple-baseline across persons designs
and AB designs with replication: Issues and confusions. Behavioral Assessment, 7, 121-127.
Hayes, S. C. (1981). Single case experimental design and empirical clinical practice. Journal of
Consulting and Clinical Psychology, 49, 193-211.
Kazdin, A. E. (1980). Obstacles in using randomization tests in single-case experimentation.
Journal ofEducational Statistics,S, 253-260.
Kazdin, A. E. (1982). Single-case research designs: Methods for clinical and applied sellings.
New York: Oxford University Press.
Kazdin, A. E. (1984). Statistical analyses for single-ease experimental designs. In D. H. Barlow &
M. Hersen (Eds.), Single-case experimental designs: Strategies for studying behavior change
(2nd ed.) (pp. 285-324). Oxford: Pergamon Press.
Kempthorne, O. (1952). The design and analysis ofexperiments. New York: Wiley.
Kempthorne, 0., & Doerfler, T. E. (1969). The behavior of some significance tests under experi­
mental randomization. Biometrika, 56,231-247.
Kimmel, H. D. (1957). Three criteria for the use of one-tailed tests. Psychological Bulletin, 54,
351-353.
Kirk, R. E. (1982). Experimental design: Procedures for the behavioral sciences (2nd ed.). Pacific
Grove, CA: Brooks/Cole.
Levin, J. R., Marascuilo, 1.. A., & Hubert, 1.. J. (1978). N = Nonparametric randomization tests. In
T. R. Kratochwill (Ed.), Single-subject research: Strategies for evaluating change (pp.
167-196). New York: Academic Press.
Mansell, J. (1982). Repeated direct replication of AB designs. Journal of Behavior Therapy and
Experimental Psychiatry, 13, 26l.
Marascuilo,1.. A., & Busk, P. 1.. (1988). Combining statistics for multiple-baseline AB and repli­
cated ABAB designs across subjects. Behavioral Assessment, 10, 1-28.
May, R. B., Masson, M. E. J., & Hunter. M. A. (1990). Applicaztion of statistics in behavioral
research. New York: Harper & Row.
Onghena, P. (1991). Het onderscheidingsvermogen van randomiseringstoetsen bij N=I-experi­
menten: De relatieve efficil!ntie van AB-proefopzetten en proefopzellen met alternerende
behandelingen [The power of randomization tests for single-case experiments: The relative
efficiency of AD designs and alternating treatments designs]. Unpublished manuscript,
Katholieke Universiteit Leuven, Centrum voor Mathematische Psychologie en Psychologische
Methodologie, Leuven, Belgium.
Onghena, P. (1992, May). The power ofrandomization tests. Paper presented at the Meeting of the
University Center of Statistics, the NFWO contact group Data Analysis, and the NFWO con­
tact group Probability Theory and its Applications, Leuven, Belgium.
Onghena, P., & Delbeke, 1.. (1992, July). Power analysis of randomization tests for single-case
designs. International Journal OfPsychology, 27, 379.
Rosenthal, R. (1978). Combining results of independent studies. Psychological Bulletin, 85,
185-193.
Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the
power of the studies? Psychological Bulletin, 105, 309-316.
Wampold, B. E., & Furlong, M. 1. (1981). Randomization tests in single-subject designs:
Illustrative examples. Journal ofBehavioral Assessment, 3, 329-341.
Wampold, B. E., & Worsham, N. 1.. (1986). Randomization tests for multiple-baseline designs.
Behavioral Assessment, 8, 135-143.
Watson, P. J., & Workman, E. A. (1981). Non-concurrent multiple-baseline across individuals
design: An extension of the traditional multiple-baseline design. Journal of Behavior Therapy
and Experimental Psychiatry, 12, 257-259.
White, D. M., Rusch, F. R., Kazdin, A. E., & HarImann, D. P. (1989). Applications of meta analy­
sis in individual-subject research. Behavioral Assessment, 11, 281-296.
169
RANOOMIZATION TESTS FOR SINGLE-CASE DESIGNS
APPENDIX A
THE CARDINALITY THEOREM FOR SINGLE-CASE DESIGNS
INVOLVING RANDOM DETERMINATION OF POINTS
OF CHANGE
The Unrestricted Case
The total number of possible data divisions for a single-case design
involving random determination of k points of change with N measurement
times is
(1)
Proof The determination of k points of change with N measurement
times is equivalent to filling k positions with slashes and N positions with
dashes on a line with N+k positions. There are (N;t] ways of selecting k of
the N+k positions to be filled by slashes, so there are (N;t] ways of deter­
mining k points of change with N measurement times. Finally, each
k-tuple uniquely defines a data division, so there are
data divisions.
h
t]
possible
Remark. Because there are no restrictions on the position of the k points of
change, this formula assumes that it is possible to have two or more points
of change in a row or to begin or end with a point of change, implying no
change of condition or compressing the design. For example, suppose we
want to collect N = 24 measurements according to an ABAB design. Ac­
cording to formula (1) the total number of possible data divisions is (N;t] =
(~7) = 2925, because there are k = 3 points of change in a ABAB design.
This includes, however, (using the slash-dash notation of the proof):
----II--------------------/
and other AA-designs, and
-----------------11/------­
and other AB-designs. In order to avoid these and other aberrations it is
necessary to restrict the possible positions of the k points of change.
The Restricted Case
The total number of possible data divisions for a single-case design
involving random determination of k points of change with N measure­
170
ONGHENA
ment times and at least n measurement times in each phase is
(2)
Proof If there are k points of change, there are m = k + 1 phases. Let n;'
represent the number of measurement times in phase i and let n represent
the minimum number of measurement times in each phase. Hence, n/ and
n are both integers, with n/ ~ n for i = 1, , m, and nl'+ nz' +...+n/ +...+n m'
=N. Define the ordered m-tuple (nl', nz', , n/,... , nm') to be a data division
of N measurement times into m phases. Then the mapping
A • (n" , nz
tJ.
l
,..., nj , ,... , nm')
')
n l -n, nz -n, ... , nj, -n,... , nm-n
-4 ( "
is a bijection from the set A of data divisions of N measurement times into
m phases with at least n measurement times in each phase onto a set B of
unrestricted data divisions of N-mn into m phases. According to Equation
(1), the cardinality of set B, #B, is equal to (N - nm k ) or (N - n(k \ 1) + k).
Because of the bijection #A = #B. and consequently #A = (N - n(k \ 1) + k),
as claimed.
t
Remark. With n ~ 1 the aberrations are excluded. To have an ABAB design,
several measurements are needed in each phase, hence n ~ 2 and usually n
~ 4. In single-case research it is, however, not uncommon to require a
longer baseline or a longer final phase. Therefore, it may be of interest to
generalize Equation (2) to allow for phases with unequal restrictions.
The Generalized Case
The total number of possible data divisions for a single-case design
involving random determination of k points of change with N measurement
times and at least nj measurement times in phase i, with nj not necessarily
equal to nj, for i,j = I, ... , m, and i;t. j. is
~
(-
k+1
(j~ ni) +
J
k)
(3)
K
Proof If there are k points of change, there are m =k+ 1 phases. Let n/ rep­
resent the number of measurement times and nj the minimum number of
measurement times in phase i. Hence, n/ and nj are both integers, with
n/ ~ nj for i = 1, , m, and n/ +nz'+...+n;'+...+nm' = N. Define the ordered
m-tuple (n;' , nz', ,nj',....nm') to be a data division of N measurement times
into m phases. Then the mapping
RANDOMIZATION TESTS FOR SINGLE-CASE DESIGNS
171
is a bijection from the set A of data divisions of N measurement times into
m phases with at least nj measurement times in phase i onto a set B of
unrestricted data divisions of N-L%lnj into m phases. According to
Equation (1), the cardinality of set B, #B, is equal to Formula (3). Because
of the bijection #A #B, and consequently #A is equal to Formula (3),
=
as claimed.
= nj for all i,} = 1, ..., m (with i::l:- }), then the generalized case
specializes to the restricted case. If in addition nj = nj = 0 for all i, } = 1, ...,
m (with i::l:- }), then the generalized case specializes to the unrestricted case.
Remark. If nj
REcEIVED:
19 AUGUST 1991
FiNAL ACCEPTANCE:
5 JULY 1992