0191-5401192 $5.00 + .00 Copyright © 1992 Pergamon Press Ltd. Behavioral Assessment, Vol. 14, pp. 153-171. 1992 Printed in the USA. All rights reserved. Randomization Tests for Extensions and Variations of ABAB Single-Case Experimental Designs: A Rejoinder PATRICK ONGHENA Katlwlieke Universiteit Leuven, Belgium Randomization tests have been developed for several single-case experimental designs. It is argued, however, that the randomization tests developed by Levin, Marascuilo, and Hubert (1978) for the ABAB design and by Marascuilo and Busk (1988) for replicated ABAB designs across subjects are inappropriate. An alternative randomization procedure for the ABAB design is presented, and the appropriate corresponding randomization test is derived. It is shown how this alternative procedure has to be adapted to allow for statistical analyses of extended ABAB, parametric-variation, drug-evaluation, interaction, replication, and multiple-baseline designs. Finally, some limitations of the use of random­ ization tests for single-case designs are discussed, with particular reference to changing-criterion designs. Key Words: randomization tests; ABAB designs; within-subject designs; statistical analysis; changing criterion designs; nonparametric statistical tests; meta analysis; time series designs. Randomization theory provides for a category of statistical tests that is valid for single-case experimental designs (Edgington, 1967, 1980c). Furthermore, these "randomization tests" are extremely versatile in this area as a consequence of their potential for tailoring the tests to the specific designs used (Edgington, 1987). Randomization tests have been developed for (a) AB and ABA designs (Edgington, 1975a), (b) ABAB designs and extensions (Levin, Marascuilo, & Hubert, 1978), (c) alternating treatments designs (Edgington, 1967, 1980b), (d) multiple schedule designs (Edgington, 1982), (e) multiple-base­ line designs (Wampold & Worsham, 1986), and (f) replicated AB and ABAB designs across subjects (Marascuilo & Busk, 1988). The author wishes to thank Jan Beirlant, Paul De Boeck, Luc Delbeke, Eugene Edginton, Joel Levin, Gert Stonns, and three anonymous reviewers for their helpful comments on an earlier draft of the article. The author is Research Assistant of the National Fund for Scientific Research (Belgium). Correspondence concerning this article should be addressed to Patrick Onghena. K.U. Leuven, Department of Psychology, Tiensestraat 102, B-3000 Leuven, Belgium. 153 154 ONGHENA In the present article, it will be argued that the Levin et al. (1978) ran­ domization test (in the following abbreviated as the LMH test) is inappro­ priate for the ABAB design, and that the randomization test developed by Marascuilo and Busk (1988) is inappropriate for the replicated ABAB design across subjects because it is based on the LMH test. A more appro­ priate, alternative test will be presented, and it will be shown how this alternative test has to be modified to analyze replication and multiple-base­ line designs and single-case experimental designs which have hitherto received little attention in the randomization test literature: extended ABAB designs, parametric-variation designs, drug-evaluation designs, and interaction designs. THE LMH TEST FOR ABAB DESIGNS According to Kazdin (1982), ABAB designs are the most basic experi­ mental designs in single-case research. They consist of a family of designs in which repeated measurements are taken for a single unit (e.g., a person) under two different alternating conditions in four phases, with several mea­ surements in each phase. In the first phase, repeated measurements are taken under control conditions (baseline or first A phase); in the second, under experimental conditions (intervention or first B phase); in the third, under control conditions again (withdrawal or second A phase); and in the fourth, under experimental conditions again (second intervention or second B phase; see also Barlow & Hersen, 1984, pp. 157-166). Because of the non-independence of the measurements, parametric sta­ tistical tests based on random sampling are inappropriate for single-subject designs in general, and ABAB designs in particular (Edgington, 1967, 1975a; Levin et al., 1978). Therefore, Levin et al. (1978) proposed to use a nonparametric randomization test to analyze such designs. However, they wanted a close correspondence between the random sampling and the ran­ domization model. They used phase mean scores to get "approximately independent, or at least uncorrelated, random variables that are identically distributed even though the population model itself generates dependent observations" (p. 177). Consider, for example, the hypothetical data collected in a single-sub­ ject ABAB design in Table 1. To perform the LMH test, the mean scores of each phase are regarded as a population. The null hypothesis that there is no difference between the A and the B condition is assessed by referring to the conditional sampling distribution of a test statistic associated with all theoretically possible assignments of these means to the conditions. For the data in Table 1, the mean values of each phase are 4.0, 2.0, 3.0, and 1.0, respectively. The number of possible assignments of them to Conditions A and B, under the assumption that two must go to each condition, is given 155 RANDOMIZATION TESTS FOR SINGLE-CASE DESIGNS TABLE 1 Hypothetical Data Used to Illustrate the Levin, Marascuilo. and Hubert Test for an ABAB Design With 24 Measurement Times (MT) MT 1 2 3 4 5 6 7 8 9 10 11 12 13 14 IS 16 17 18 19 20 21 22 23 24 Condition A A A A A A B B B B B B A A A A A A B B B B B Score 625 344 1 2 3 322342430 B 202 by the binomial coefficient (V = 6. One could assign the following pairs to one of the conditions (let us say A), where order is not important: 4.0 and 2.0, 4.0 and 3.0,4.0 and 1.0, 2.0 and 3.0, 2.0 and 1.0, and 3.0 and 1.0. For each of the six possible assignments to a particular condition (here A), the sum of the two means is then computed as the test statistic, and based on these sums a sampling distribution is defined (Table 2). For the observed assignment, the sum of the means associated with condition A was 4.0 + 3.0 = 7.0. The probability (p) that the value of the test statistic for one of the assignments is as large as the observed test statistic in the generated distribution is 1/6 = .1667 for a one-tailed test. If a significance level of a = .05 was chosen, the null hypothesis of no differ­ ence between Conditions A and B is not rejected. THE INAPPROPRIATENESS OF THE LMH TEST FOR ABAB DESIGNS Although the LMH test has a certain appeal as a descriptive device and is relatively simple to perform, it is not an appropriate randomization test to allow for causal inferences. To the extent that the LMH test is presented as a randomization test for a systematic ABAB design (Le., flrst an A condi­ tion, then a B condition, then an A condition, and finally a B condition), it is invalid because there has to be random assignment of measurement times TABLE 2 Randomization Distribution for the Levin, Marascuilo. and Hubert Test Applied to the Hypothetical Data of Table 1 Means assigned to Condition A 4.0 and 3.0 4.0 and 2.0 4.0 and 1.0 3.0 and 2.0 3.0 and 1.0 2.0 and 1.0 Sum of Means 7.0 6.0 5.0 5.0 4.0 3.0 Probability 1/6 1/6 1/6 1/6 1/6 1/6 156 ONGHENA to conditions if a valid randomization test is to be performed (Edgington, 1980b, 198Oc, 1984, 1987). Although Levin et al. (1978) defended the application of randomization tests to systematic ABAB designs, they also suggested to use the LMH test for randomized ABAB designs: For example, in the baseline (A) versus treattnent (B) situation an initial A phase could be designated as an adaptation or "warm-up" phase, not to be included in the analysis. It might then be possible to randomly assign two A and two B conditions to four successive phases to constitute the analysis proper. In other instances, A and B may represent two different experimental treattnents (rather than a baseline and a treattnent), in which case random assignment may be more reasonable. (Note that under the proper randomization scheme, it is possible to come up with two consecutive phases of the same type.) (p. 175) The last sentence in parentheses, however, is crucial. It implies, which Levin et al. (1978) seem to ignore, that the LMH test assumes random assignments that are inconsistent with the ABAB sequence, and hence, that it is not an appropriate randomization test for the randomized ABAB design. For a randomization test to be valid, the permutation of the data to obtain the distribution of test statistics must correspond to the particular type of random assignment that is actually used (Edgington, 1980b, 198Oc). This means that if a randomization procedure was performed that corre­ sponded to the permutation of the means in the LMH test, an ABAB design would have been equally likely as an AABB, ABBA, BAAB, BABA, or BBAA design. Hence, an investigator who wanted to use an ABAB design may, as a result of this randomization procedure, have arrived at an AB (AABB) or an ABA (ABBA) design instead. The LMH test is not well suited to analyze ABAB designs because it is an analogue of the randomization test that is appropriate for the alternat­ ing treatments design in which there is one potential alternation of condi­ tion per measurement time (Edgington, 1967, 1980b), It seems that Levin et al. (1978) and Marascuilo and Busk (1988) consider the ABAB design as if it were an alternating treatments design with two different conditions, four measurement times, and two measurement times for each condition, with the only exception being that for the ABAB design the phase means have to be used because there is slower alternation and more measure­ ments per phase. It could be argued that, while it is true that the LMH test is inappropri­ ate for ABAB designs, it is a valid test for randomized XXXX designs, where X could be an A or B phase, and two A and two B phases can be assigned. Although this design has not been used yet, the argument is cor­ rect. There is, however, still another problem: the LMH test has zero power for randomized XXXX designs at any reasonable significance level (ex = .10 or lower). By definition, a randomized XXXX design consists of RANOOMIZATION TESTS FOR SINGLE-CASE DESIGNS 157 four phases, and consequently the p-value can never be smaller than .1667. Levin et al. (1978) and Marascuilo and Busk (1988) realized this, and they proposed to test at the a = .1667 level for a one-tailed test or at the a = .3333 level for a two-tailed test, to strengthen the alternative hypothe­ sis (e.g., by making more specific predictions about the rank ordering of the mean values), to add phases to the design, or to add subjects to the study. These suggestions, however, only confmn that the basic LMH test has zero power at a = .10, a = .05, or lower. The LMH test also lacks power for systematic ABAB designs, as in the example with the data in Table 1. The validity consideration is, however, more fundamental. Even if the LMH test were powerful (as it can be made in its extensions), there would still be the problem of its validity. In sum, the LMH test is inappropriate for ABAB designs because, to the extent that it is proposed for systematic ABAB designs it is invalid, and to the extent that it is proposed for randomized designs it has zero power and is, strictly speaking, not even a test for ABAB designs. AN APPROPRIATE RANDOMIZATION TEST FOR THE ABAB DESIGN A valid randomization test is performed in nine consecutive steps: (a) choice of the alternative hypothesis, (b) specification of the level of signifi­ cance a, (c) specification of the number of measurement times N, (d) ran­ domization of some aspect of the design, (e) choice of a sensitive test statistic, (f) data collection, (g) computation of the test statistic from the obtained data, (h) formation of the randomization distribution, and (i) com­ parison of the observed test statistic to the randomization distribution. It is essential for the validity of the test that the first five steps are perfonned before data are collected (Edgington, 1980b, 1980c, 1984, 1987; Wampold & Worsham, 1986). Because in the previous section it was shown that the invalidity of the LMH test is the most fundamental problem, the alternative test will be presented closely following these nine steps to assure its validi­ ty. In addition, the alternative test will be described with the same data and level of significance that were used in the illustration of the LMH test, and it will be shown to be a powerful randomization test. The Null and Alternative Hypotheses The null hypothesis of a randomization test for single-case experiments is always that there is no differential effect of the conditions for any of the measurement times. The alternative hypothesis can be chosen. As in statis­ tical tests for group experiments with two conditions, it is possible to have a directional (one-tailed) or a nondirectional (two-tailed) alternative. Furthermore, it is possible to make even stronger predictions about trends 158 ONGHENA (Edgington, 1975b, 1987; Levin et al. 1978). Generally speaking, the more specific the predictions, the more powerful one's test becomes. The trade­ off is, of course, that the test becomes completely inappropriate if the data are not in the predicted direction. As some critics of one-tailed t tests might say, one should only use a directional test if any treatment differences of a kind other than the predicted kind are regarded to be either irrational or irrelevant (Edgington, 1975b; Goldfried, 1959; Goldman, 1960; Kimmel, 1957; Levin et al., 1978). In the following, the randomization test for the ABAB design will be described for a one-tailed alternative, parallel to the illustration of the LMH test (B > A). In applied behavior analysis, it is usually reasonable to specify the direction of behavior change (Kazdin, 1982; Wampold & Worsham, 1986). The Level of Significance and the Number ofMeasurement Times In the Neyman-Pearson decision-theoretical framework of hypothesis testing, the level of significance and the number of observations must be specified before the data are collected (Cohen, 1988). Together with the test statistic, the design, and the alternative hypothesis, they determine the power of the test. In parametric statistical group analyses, the issue of Type II errors is frequently, wrongly, neglected (Cohen, 1962, 1990; Sedlmeier & Gigerenzer, 1989), and this neglect is even more prevalent in statistical analyses of single-case data. Occasionally, qualitative references are made to sensitivity considerations with respect to the choice of the test statistic, the randomization procedure, the alternative hypothesis, or the level of sig­ nificance (e.g., Barlow & Hersen, 1984; Edgington, 1987; Kazdin, 1982; Levin et al., 1978), but there has never been an attempt to quantify the rela­ tionship between the number of measurement times and the power of ran­ domization tests for single-case designs. This quantification is straightforward, however, if there exists a group design with the same randomization structure as the single-case design. The power of the randomization test for the single-case design is then equal to the power of the randomization test for the group design (Onghena, 1991; Onghena & Delbeke, 1992). If the same test statistic is used, the power of the randomization test for the group design itself can be approxi­ mated using the standard tables of the corresponding random sampling test (May, Masson, & Hunter, 1990, pp. 313-314) or else can be derived by computer-intensive methods (Gabriel & Hall, 1983; Gabriel & Hsu, 1983; Kempthorne, 1952; Kempthorne & Doerfler, 1969; Onghena, 1992). For example, the randomization procedure for the alternating treatments design as presented by Edgington (1967) has the same structure as the completely randomized design (using Kirk's [1982] terminology), the only difference RANDOMIZATION TESTS FOR SINGLE-CASE DESIGNS 159 being that for the first the random assignment to conditions concerns the measurement times and for the last it concerns the subjects. If the randomization procedure of the single-case design has an unusual group design counterpart, then the power of the randomization test for the single-case design has to be derived directly by computer-intensive meth­ ods (Onghena, 1991, 1992; Onghena & Delbeke, 1992), after specifying a more restrictive alternative hypothesis (e.g., a specific additive or multi­ plicative effect for all measurement times under the experimental condition as proposed by Gabriel & Hall, 1983; Gabriel & Hsu, 1983; Kempthorne, 1952; Kempthorne & Doerfler, 1969; Onghena, 1992). For example, the randomization procedure for an AB design concerns the intervention point (Edgington, 1975a), and the group design counterpart would be to assign the first k selected subjects to the A condition, the others to the B condi­ tion, and to determine k at random. This is unusual, and it would be very inefficient for a group design. Also the randomization procedure for an ABAB design, to be presented next, is of this type. To make it possible to work with the data of Table 1, suppose that on the basis of practical and power considerations, a level of significance a = .05 and N =24 measurement times are chosen. Randomization The most appropriate aspect of the ABAB design that may be subject to randomization is the determination of the three points of change: the time of the first intervention (tl), the time of the withdrawal (t2), and the time of the second intervention (t3)' The possible times of the withdrawal are con­ ditional upon the time of the first intervention, and the possible times of the second intervention are conditional upon the time of the withdrawal. Furthermore, the random selection of the three points of change has to be restricted to rule out the possibility of having too few (or even no) mea­ surement times for one of the phases. This randomization procedure makes a randomization test possible that radically differs from the LMH test. It follows the Edgington (1975a, 1980b) model for AB and ABA designs instead of the Edgington (1967) model for alternating treatments designs. In fact, Edgington (1975a, 1980b) already sug­ gested this randomization procedure for extensions of AB and ABA designs. To continue the illustration, suppose an ABAB design is chosen with at least n =4 measurements in each phase (N = 24, detennined in the preceding step). Then it can be shown (see Appendix A, Equation 2) that there are 165 possible data divisions according to this randomization procedure. To make all data divisions equally probable, it is necessary to enumerate all possible triplets (tl,tz,t3) followed by random selection of a triplet. In the example, random selection of a triplet out of the set of 165 triplets {(5,9,13), (5,9,14), ... , (5,9,21), (5,10,14), ... , (5,10,21), ... , (6,10,14), ... , (6,10,21), ... , 160 ONGHENA (l3,17,21)} results in a probability of 1/165 for each of the data divisions. Suppose that, for the example, triplet (7,13,19) was selected at random. Notice that the resulting data division parallels the design in the illustration of the LMH test (Table 1). The Test Statistic Any test statistic that is sensitive to the expected effect of the treatment is acceptable. If one expects a difference in level between the two condi­ tions, Condition A having the highest level, then the most straightforward test statistic will be T = (AI + A z)- (B 1 + Bz), the difference between the sum of the means for the baseline and treatment phases. For the example, this is the statistic chosen. For a difference in level, other interesting test statistics have been pro­ posed for particular predictions: test statistics for immediate but decaying effects (Edgington, 1975a) and for delayed effects (Wampold & Furlong, 1981), test statistics corrected for trends (Edgington, 1980b, 198Oc, 1984, 1987) and for outliers (Marascuilo & Busk, 1988), and ordinal or nominal test statistics (Edgington, 1984; Marascuilo & Busk, 1988). It is also possi­ ble to test for changes in slope (Wampold & Furlong, 1981), for specific linear or nonlinear trends (Edgington, 1975b; Levin et al. 1978), and for changes in variability (Edgington, 1975a). Data Collection, the Observed Test Statistic, and the Randomization Distribution Suppose the data in Table 1 were obtained. The test statistic T for these data (called the observed test statistic) is (4.0 + 3.0) - (2.0 + 1.0) = 4.0. Then T is computed for all the data divisions that could have been selected following the randomization procedure. If the null hypothesis is true, each measurement is independent of the condition that is implemented, and any difference between the measurements under the two conditions is due sole­ ly to a difference in factors associated with the time the two treatments were administered. Hence, each measurement time can be characterized by its own measurement, and consequently, under the null hypothesis, the ran­ dom division of the measurement times in phases can be conceptualized as the random division of the measurements in phases. Because each division of the measurements is associated with a value for T, this results in a distri­ bution of T under the null hypothesis (the randomization distribution). For the obtained data, the six lowest and the six highest values of T in this dis­ tribution are shown in Table 3. The Probability Value As a last step, the observed test statistic is compared to the randomiza­ tion distribution. If the proportion of test statistics that is as extreme as the RANDOMIZATION TESTS FOR SINGLE-CASE DESIGNS 161 TABLE 3 Six Highest and Six Lowest Values of the Test Statistic Pin the Randomization Distribution for the Alternative Test Applied to the Hypothetical Data of Table 1 Triplet of Measurement Tunes for Change T Probability (7,14,19) (7,11,19) (7,15,19) (7,13,19) (7,14,18) (6,14,19) 4.20 4.13 4.13 4.00 3.96 3.95 1/165 1/165 1/165 1/165 1/165 1/165 (12,17,21) (5,10,14) (11,17,21) (12,16,21) (11,16,21) (13,17,21) 1.24 1.20 1.18 1.09 1.05 1.00 1/165 1/165 1/165 1/165 1/165 1/165 aT=(A j +Az)-(Bj+Bz). observed test statistic (the probability value or p-value) is smaller than or equal to a, the null hypothesis is rejected and the predicted effect in the alternative hypothesis is accepted. In the example, the proportion of test statistics that is as large as (we had a directional alternative hypothesis) the observed test statistic is p =4/165 = .0242. This is smaller than the predetermined a = .05, and con­ sequently the null hypothesis of no difference between the A and B condi­ tions is rejected, in favor of the alternative hypothesis, that for some (per­ haps all) of the measurement times the B condition has resulted in a higher score than if the A condition was given. In sum, the alternative randomization procedure results in a valid ran­ domization test for the ABAB design that has higher power than the LMH test with the fixed six randomization possibilities. The superior power was illustrated using the same data and level of significance, rejecting the null hypothesis with the alternative test while this was not possible with the LMH test. RANDOMIZATION TESTS FOR EXTENDED ABAB, DRUG-EVALUATION, INTERACTION, AND PARAMETRIC­ VARIATION DESIGNS The ABAB design can be extended by adding one or more phases after the basic ABAB sequence (Barlow & Hersen, 1984, p.175). A randomiza­ 162 ONGHENA tion test for such an extended ABAB design follows the nine steps described above. The randomization involves the k points of change; that is, the random sampling of a k-tuple of points of change out of the set of all possible k-tuples (see Appendix A for the computation of the cardinality of this set). Finally, a test statistic sensitive to the expected effect of the treat­ ment (e.g., the difference between the sum of the means for the A condition and the sum of the means for the B condition) is calculated for the obtained data and compared with the randomization distribution of this test statistic to obtain the probability value. The ABAB design can also be extended by comparing more than two conditions (e.g., in an ABABACA design; Barlow & Hersen, 1984, p. 177), and an appropriate randomization test can easily follow this extension. Besides the probable increment in k, the choice of the test statistic deserves particular attention. Edgington (1984, 1987) proposed to use the F-statistic from analysis of variance or an equivalent, but ordinal and nominal test statistics are also possible (Edgington, 1984; Marascuilo & Busk, 1988). These same modifications apply to drug-evaluation designs and interac­ tion designs. In a drug-evaluation design (Barlow & Hersen, 1984, p. 183) a no-drug condition (A), a drug condition (B), and a placebo condition (AI) are compared (e.g., in an AAIAAIABAIB design). For the randomization test, this is equivalent to an extended ABAB design with three conditions. In an interaction design (Barlow & Hersen, 1984, p. 193), a combination of conditions is present in some phases, allowing for an assessment of the additive and non-additive effects of the conditions (e.g., in an ABAC[BC]B[BC]B design, with [BC] representing a phase where both condition B and condition C are present). For the randomization test, this is equivalent to an extended ABAB design with four conditions, with the combination condition considered as a separate condition. Drug-evaluation and interaction designs can also be extended by adding several drugs or conditions in the comparison. An interesting extension of the ABAB design is the parametric-varia­ tion design, in which several levels of an independent variable are present­ ed consecutively (e.g., in an ABABB(l)B(2)B(3) design, with B, B(l), B(2), and B(3), representing different levels of an independent variable; Barlow & Hersen, 1984, p. 179). The randomization procedure for a parametric­ variation randomization test is equivalent to the randomization procedure of an extended ABAB design. Again, however, particular attention is need­ ed for the choice of the test statistic. Parallel to the test statistic of the cor­ relational trend test (Edgington, 1987), a conelation coefficient could be used. The correlation between the obtained data and coefficients assigned to the different levels of the independent variable might be calculated and compared to the randomization distIibution of this test statistic to get the probability value. The assignment of the coefficients to the levels could be perlormed using an ordinal, an interval, or a ratio weighting scheme. RANOOMIZATION TESTS FOR SINGLE-CASE DESIGNS 163 RANDOMIZATION TESTS FOR REPLICATED ABAB DESIGNS ABAB designs (or extensions and variations) can be replicated with different subjects, and the data can be integrated by using meta-analytic techniques (Rosenthal, 1978; White, Rusch, Kazdin & Hartmann, 1989). If the single-subject experiments can be assumed statistically independent, the probability value of the randomization test for the replicated ABAB design is obtained by perfonning a randomization test for each single-sub­ ject design separately, following the guidelines described above, and then combining the resulting probability values (Edgington, 1972a, 1972b; Edgington & Haller, 1983, 1984). In their randomization test for replicated ABAB designs, however, Marascuilo and Busk (1988) proposed to use the LMH test and the associ­ ated "randomization procedure", to calculate a test statistic across subjects, and to calculate a single randomization distribution of this test statistic. Because the latter turned out to be laborious for more than two subjects, they proposed to use a normal approximation. As we argued above, the LMH test is inappropriate for the ABAB design, and consequently the Marascuilo and Busk (1988) test is inappropriate for replicated ABAB designs. The meta-analytic procedure we propose for replicated ABAB designs differs from the Marascuilo and Busk (1988) test in four respects: (a) the test statistic is not calculated across subjects but for each subject separately, (b) there is not a single randomization distribution but as many as there are subjects in the design, (c) there is no need for a reference to a normal approximation, and (d) the test is not based on the LMH test but on the alternative randomization test for the ABAB design described above. RANDOMIZATION TESTS FOR MULTIPLE-BASELINE DESIGNS In a multiple-baseline across subjects design, there is usually only one intervention (without withdrawal), and the data for all subjects are collect­ ed simultaneously with sequential application of the intervention across subjects (Harris & Jenson, 1985). Single-subject multiple-baseline designs are possible when the intervention is sequentially aimed at different behaviors or settings for a single subject (Barlow & Hersen, 1984, pp. 210-243). The graphical analysis of the results obtained in a multiple­ baseline design consists of both between- and within-baseline compar­ isons (Barlow & Hersen, 1984; Harris & Jenson, 1985; Hayes, 1981). If, for example, an intervention is applied to one of the baselines and pro­ duces a change in it, while little or no change is observed in the other baselines, better experimental control over historical sources of confound­ ing is obtained than in a singular AB design (Cook & Campbell, 1979). Notice that in the multiple-baseline literature the term "baseline" is used to refer to the entire AB sequence. 164 ONGHENA Wampold and Worsham (1986) precisely used this between-baseline fea­ ture to develop a randomization test for the multiple-baseline design. They fixed the intervention points for the different baselines and they randomly selected the order in which the persons, behaviors, or settings were subject­ ed to the treatment. A test statistic for the obtained data is then compared to a randomization distribution consisting of test statistics computed for all possible orders in which the persons, behaviors, or settings could have been subjected to treatment. Hence, in all baselines the intervention points of other baselines are located to obtain the randomization distribution. As Marascuilo and Busk (1988) observed, however, this randomization test has low power because of the restricted randomization possibilities. Therefore, they proposed to improve the power by determining the interven­ tion point at random instead of fixing it in advance. The random determina­ tion of intervention points also randomly selects the order in which the per­ sons, behaviors, or settings are subjected to the treatment. The control over historical sources of confounding is obtained by randomization within a baseline, and the randomization distribution is obtained not only by locating the intervention point of a baseline in the other baselines, but by locating all possible intervention points in a baseline. A multivariate test statistic such as the difference of the means for the A and the B phase summed over all baselines may be used, and the p-value of this statistic can be derived by comparing it to the single randomization distribution of this test statistic. Marascuilo and Busk (1988) considered the randomization test with random intervention point inappropriate for multiple-baseline designs across behaviors or settings because of the probable intercorrelations between the baselines. As Edgington (1992) has argued, however, these intercorrelations cause interpretation difficulties but leave the validity of the test unaffected. This is because of the multivariate approach in constructing a randomization test for multiple-baseline designs: rejection of the random­ ization null hypothesis is rejection of the overall null hypothesis that there is no effect on any of the baselines. Significant results can, however, not be interpreted for a specific baseline. The interpretation to a specific baseline can only be unequivocal if the baselines are independent, but then the mul­ tiple-baseline design becomes a replicated AB design, and then the meta­ analytic techniques of the previous paragraph are more appropriate. Multiple-baseline designs can be extended to include other simultane­ ous series (e.g., ABA, ABAB, ...). The randomization test can be devel­ oped after random determination of the intervention and withdrawal points, and by choosing an appropriate multivariate test statistic. It should be noted that the non-concurrent multiple-baseline design (Barlow & Hersen, 1984, p. 244; Hayes, 1981; Watson & Workman, 1981) is, in fact, a mere replicat­ ed AB design (Mansell, 1982) and consequently can be analyzed with the meta-analytic techniques of the previous paragraph. RANDOMIZATION TESTS FOR SINGLE-CASE DESIGNS 165 DISCUSSION Although randomization tests appear to be valid and versatile in the area of single-case experimentation, Kazdin (1980, 1982, 1984) noted that several features of single-case research would preclude their use. The major obstacles in using randomization tests in single-case experi­ mentation would be: (a) the necessity to have rapidly alternating condi­ tions, (b) the possibility of multiple-treatment interference, (c) computa­ tional complexity, and (d) the necessity of randomization (Kazdin, 1980, 1982, 1984). The first obstacle, however, is based on a false assumption. It is not true that randomization tests can only deal with single-case designs where the conditions are alternated rapidly. In fact, rapid alternation of conditions was a requirement in none of the randomization tests described in this paper. Presumably, this false assumption stems from the fact that random­ ization tests were introduced in the single-case literature by Edgington (1967) as a way to analyze alternating treatments designs. Multiple-treatment interference is the confounding effect that the out­ come of one treatment is due, in part, to previous treatments which may have been provided (Campbell & Stanley, 1966; Cook & Campbell, 1979). As such, it is a potential obstacle of a design and not of a statistical test that analyzes the data gathered following the design (Edgington, 1980a). Of course, if it were assumed that randomization tests are associated with alternating treatments designs, the possibility of multiple-treatment inter­ ference would be especially acute with randomization tests. Randomization tests for alternating treatments designs require a rapidly amounting number of data divisions with an increase in number of mea­ surement times, so that for even a fairly small number of measurement times time-consuming computer programs or even approximations are required (Edgington, 1969, 1987; Kazdin, 1982, 1984). Fortunately, the randomization tests for single-case experimental designs involving random determination of points of change have a more restricted random assign­ ment procedure and consequently can be performed with relatively simple computer programs. A computer program for the tests presented in this article is available from the author, together with a computer program for the random sampling of the k-tuple out of the set of all possible k-tuples of points of change. The wider availability of powerful computers and effi­ cient algorithms ultimately removes this obstacle. This leaves the fourth obstacle as the only real one. As Kazdin (1980) put it, "Randomization tests do not merely provide statistical tools to help the investigator analyze data obtained in single-case research but dictate changes in the basic designs that are used" (p. 258). As men­ tioned above, the fourth step for the performance of a valid randomiza­ 166 ONGHENA tion test is always the randomization of some aspect of the design, and this has to be done before the data are collected. Randomization tests are thus incompatible with the systematic designs and the response-guided experimentation which are common in single-case research (Edgington, 1983, 1984). For example, in a changing-criterion design (Barlow & Hersen, 1984, p. 205) the baseline condition (A) is followed by a treatment condition (B) until a preset criterion is met. Then, a condition (B 1) is introduced with a more stringent criterion, with treatment applied until this new level is met, and so forth. This changing-criterion element can also be com­ bined with an ABAB element (e.g., in an ABABB1B2B3 design). If the number of measurement times for each phase can be specified a priori, for example, by characterizing each phase by the specification of the criterion but without the requirement that it should be met before it is respecified in the next phase, then it is possible to perform a randomization test for this version of the changing-criterion design because then it is equivalent to the parametric-variation design. In such a randomized changing-criteri­ on design, generally, even stronger predictions are possible than in a para­ metric variation-design because one has the preset criteria-predictions, and a test statistic has to be chosen accordingly (e.g., some measure of goodness-of-fit to the criterion), The underlying rationale of changing-cri­ terion designs, however, is usually the shaping of behavior (Kazdin, 1980), and in shaping the number of measurement times cannot be speci­ fied a priori. Edgington (l980a) remarked that it is possible to perform a random­ ization test in designs where the manipulation of conditions is partially dependent on the data (e.g., randomization after stability of the baseline or, in a changing-criterion design, after the criterion is met), but this may reduce the power substantially by restricting the randomization possibili­ ties, and this leaves the basic obstacle that it is impossible to perform a randomization test where the manipulation of conditions is entirely depen­ dent on the data. It must be acknowledged however that, while it is true that randomiza­ tion tests dictate a change in the basic designs that are used, namely the introduction of a random element, this change improves the internal and statistical conclusion validity of the results (Cook & Campbell, 1979). Randomization controls for unknown as well as known sources of con­ founding, and Campbell and Stanley (1966) considered it important enough to serve as a criterion to distinguish experimental and quasi-exper­ imental designs (Edgington, 1984, 1992). Hence, the effort to overcome the fourth obstacle is remunerated by an increased internal and statistical conclusion validity. RANDOMIZATION TESTS FOR SINGLE-CASE DESIGNS 167 REFERENCES Barlow, D. H., & Hersen, M. (Eds.). (1984). Single-case experimental designs: Strategies for studying behavior change (2nd ed.). Oxford: Pergamon Press. Campbell, D. T., & Stanley, J. C. (1966). Experimental and quasi-experimental designs for research. Chicago: Rand McNally. Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal ofAbnormal and Social Psychology, 65, 145 -153. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale: Erlbaurn. Cohen,1. (1990). Things I have learned (so far). American Psychologist, 45,1304-1312. Cook, T. D., & Campbell, D. T. (Eds.). (1979). Quasi-experimentation: Design and analysis issues for field settings. Chicago: Rand McNally. Edgington, E. S. (1967). Statistical inference from N = I experiments. Journal of Psychology, 65, 195-199. Edgington, E. S. (1969). Approximate randomization tests. Journal of Psychology, 72, 143-149. Edgington, E. S. (1972a). An additive method for combining probability values from independent experiments. Journal of Psychology, 80, 351-363. Edgington, E. S. (1972b). A normal curve method for combining probability values from indepen­ dent experiments. Journal of Psychology, 82, 85-89. Edgington, E. S. (l975a). Randomization tests for one-subject operant experiments. Journal of Psychology, 90, 57-68. Edgington, E. S. (1975b). Randomization tests for predicted trends. Canadian Psychological Review, 16, 49-53. Edgington, E. S. (1980a). Overcoming obstacles to single-subject experimentation. Journal of Educational Statistics, 5,261-267. Edgington, E. S. (1980b). Random assignment and statistical tests for one-subject experiments. Behavioral Assessment, 2, 19-28. Edgington, E. S. (l980c). Validity of randomization tests for one-subject experiments. Journal of Educational Statistics, 5, 235-251. Edgington, E. S. (1982). Nonparametric tests for single-subject multiple schedule experiments. Behavioral Assessment, 4, 83-91. Edgington, E. S. (1983). Response-guided experimentation. Contemporary Psychology, 28, 64-65. Edgington, E. S. (1984). Statistics and single case analysis. In M. Herscn, R. M. Eisler, & P. M. Miller (Eds.), Progress in behavior modification: Vol. 16 (pp. 83-120). New York: Raven Press. Edgington, E. S. (1987). Randomization tests (2nd ed.). New York: Marcel Dekker. Edgington, E. S. (1992). Nonparametric tests for single-case experiments. In T. R. Kratochwill & J. Levin (Eds.), Single case design and analysis (pp. 133-157). Hillsdale, NJ: Erlbaurn. Edgington, E. S., & Haller, O. (1983). A computer program for combining probabilities. Educational and Psychological Measurement, 43, 835-837. Edgington, E. S., & Haller, O. (1984). Combining probabilities from discrete probability distribu­ tions. Educational and Psychological Measurement, 44, 265-274. Gabriel, K. R., & Hall, W. 1. (1983). Re-randomization inference on regression and shift effects: computationally feasible methods. Journal of the American Statistical Association, 78, 827-836. Gabriel, K. R., & Hsu, C. F. (1983). Power studies of re-randomization tests, with application to weather modification experiments. Journal of the American Statistical Association, 78, 766-775. Goldfried, M. R. (1959). One-tailed tests and "unexpected" results. Psychological Review, 66, 79-80. Goldman, M. (1960). Some further remarks on one-tailed tests and "unexpected" results. Psychological Reports, 6,171-173. 168 ONGHENA Harris, F. N., & Jenson, W. R. (1985). Comparisons of multiple-baseline across persons designs and AB designs with replication: Issues and confusions. Behavioral Assessment, 7, 121-127. Hayes, S. C. (1981). Single case experimental design and empirical clinical practice. Journal of Consulting and Clinical Psychology, 49, 193-211. Kazdin, A. E. (1980). Obstacles in using randomization tests in single-case experimentation. Journal ofEducational Statistics,S, 253-260. Kazdin, A. E. (1982). Single-case research designs: Methods for clinical and applied sellings. New York: Oxford University Press. Kazdin, A. E. (1984). Statistical analyses for single-ease experimental designs. In D. H. Barlow & M. Hersen (Eds.), Single-case experimental designs: Strategies for studying behavior change (2nd ed.) (pp. 285-324). Oxford: Pergamon Press. Kempthorne, O. (1952). The design and analysis ofexperiments. New York: Wiley. Kempthorne, 0., & Doerfler, T. E. (1969). The behavior of some significance tests under experi­ mental randomization. Biometrika, 56,231-247. Kimmel, H. D. (1957). Three criteria for the use of one-tailed tests. Psychological Bulletin, 54, 351-353. Kirk, R. E. (1982). Experimental design: Procedures for the behavioral sciences (2nd ed.). Pacific Grove, CA: Brooks/Cole. Levin, J. R., Marascuilo, 1.. A., & Hubert, 1.. J. (1978). N = Nonparametric randomization tests. In T. R. Kratochwill (Ed.), Single-subject research: Strategies for evaluating change (pp. 167-196). New York: Academic Press. Mansell, J. (1982). Repeated direct replication of AB designs. Journal of Behavior Therapy and Experimental Psychiatry, 13, 26l. Marascuilo,1.. A., & Busk, P. 1.. (1988). Combining statistics for multiple-baseline AB and repli­ cated ABAB designs across subjects. Behavioral Assessment, 10, 1-28. May, R. B., Masson, M. E. J., & Hunter. M. A. (1990). Applicaztion of statistics in behavioral research. New York: Harper & Row. Onghena, P. (1991). Het onderscheidingsvermogen van randomiseringstoetsen bij N=I-experi­ menten: De relatieve efficil!ntie van AB-proefopzetten en proefopzellen met alternerende behandelingen [The power of randomization tests for single-case experiments: The relative efficiency of AD designs and alternating treatments designs]. Unpublished manuscript, Katholieke Universiteit Leuven, Centrum voor Mathematische Psychologie en Psychologische Methodologie, Leuven, Belgium. Onghena, P. (1992, May). The power ofrandomization tests. Paper presented at the Meeting of the University Center of Statistics, the NFWO contact group Data Analysis, and the NFWO con­ tact group Probability Theory and its Applications, Leuven, Belgium. Onghena, P., & Delbeke, 1.. (1992, July). Power analysis of randomization tests for single-case designs. International Journal OfPsychology, 27, 379. Rosenthal, R. (1978). Combining results of independent studies. Psychological Bulletin, 85, 185-193. Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of the studies? Psychological Bulletin, 105, 309-316. Wampold, B. E., & Furlong, M. 1. (1981). Randomization tests in single-subject designs: Illustrative examples. Journal ofBehavioral Assessment, 3, 329-341. Wampold, B. E., & Worsham, N. 1.. (1986). Randomization tests for multiple-baseline designs. Behavioral Assessment, 8, 135-143. Watson, P. J., & Workman, E. A. (1981). Non-concurrent multiple-baseline across individuals design: An extension of the traditional multiple-baseline design. Journal of Behavior Therapy and Experimental Psychiatry, 12, 257-259. White, D. M., Rusch, F. R., Kazdin, A. E., & HarImann, D. P. (1989). Applications of meta analy­ sis in individual-subject research. Behavioral Assessment, 11, 281-296. 169 RANOOMIZATION TESTS FOR SINGLE-CASE DESIGNS APPENDIX A THE CARDINALITY THEOREM FOR SINGLE-CASE DESIGNS INVOLVING RANDOM DETERMINATION OF POINTS OF CHANGE The Unrestricted Case The total number of possible data divisions for a single-case design involving random determination of k points of change with N measurement times is (1) Proof The determination of k points of change with N measurement times is equivalent to filling k positions with slashes and N positions with dashes on a line with N+k positions. There are (N;t] ways of selecting k of the N+k positions to be filled by slashes, so there are (N;t] ways of deter­ mining k points of change with N measurement times. Finally, each k-tuple uniquely defines a data division, so there are data divisions. h t] possible Remark. Because there are no restrictions on the position of the k points of change, this formula assumes that it is possible to have two or more points of change in a row or to begin or end with a point of change, implying no change of condition or compressing the design. For example, suppose we want to collect N = 24 measurements according to an ABAB design. Ac­ cording to formula (1) the total number of possible data divisions is (N;t] = (~7) = 2925, because there are k = 3 points of change in a ABAB design. This includes, however, (using the slash-dash notation of the proof): ----II--------------------/ and other AA-designs, and -----------------11/------­ and other AB-designs. In order to avoid these and other aberrations it is necessary to restrict the possible positions of the k points of change. The Restricted Case The total number of possible data divisions for a single-case design involving random determination of k points of change with N measure­ 170 ONGHENA ment times and at least n measurement times in each phase is (2) Proof If there are k points of change, there are m = k + 1 phases. Let n;' represent the number of measurement times in phase i and let n represent the minimum number of measurement times in each phase. Hence, n/ and n are both integers, with n/ ~ n for i = 1, , m, and nl'+ nz' +...+n/ +...+n m' =N. Define the ordered m-tuple (nl', nz', , n/,... , nm') to be a data division of N measurement times into m phases. Then the mapping A • (n" , nz tJ. l ,..., nj , ,... , nm') ') n l -n, nz -n, ... , nj, -n,... , nm-n -4 ( " is a bijection from the set A of data divisions of N measurement times into m phases with at least n measurement times in each phase onto a set B of unrestricted data divisions of N-mn into m phases. According to Equation (1), the cardinality of set B, #B, is equal to (N - nm k ) or (N - n(k \ 1) + k). Because of the bijection #A = #B. and consequently #A = (N - n(k \ 1) + k), as claimed. t Remark. With n ~ 1 the aberrations are excluded. To have an ABAB design, several measurements are needed in each phase, hence n ~ 2 and usually n ~ 4. In single-case research it is, however, not uncommon to require a longer baseline or a longer final phase. Therefore, it may be of interest to generalize Equation (2) to allow for phases with unequal restrictions. The Generalized Case The total number of possible data divisions for a single-case design involving random determination of k points of change with N measurement times and at least nj measurement times in phase i, with nj not necessarily equal to nj, for i,j = I, ... , m, and i;t. j. is ~ (- k+1 (j~ ni) + J k) (3) K Proof If there are k points of change, there are m =k+ 1 phases. Let n/ rep­ resent the number of measurement times and nj the minimum number of measurement times in phase i. Hence, n/ and nj are both integers, with n/ ~ nj for i = 1, , m, and n/ +nz'+...+n;'+...+nm' = N. Define the ordered m-tuple (n;' , nz', ,nj',....nm') to be a data division of N measurement times into m phases. Then the mapping RANDOMIZATION TESTS FOR SINGLE-CASE DESIGNS 171 is a bijection from the set A of data divisions of N measurement times into m phases with at least nj measurement times in phase i onto a set B of unrestricted data divisions of N-L%lnj into m phases. According to Equation (1), the cardinality of set B, #B, is equal to Formula (3). Because of the bijection #A #B, and consequently #A is equal to Formula (3), = as claimed. = nj for all i,} = 1, ..., m (with i::l:- }), then the generalized case specializes to the restricted case. If in addition nj = nj = 0 for all i, } = 1, ..., m (with i::l:- }), then the generalized case specializes to the unrestricted case. Remark. If nj REcEIVED: 19 AUGUST 1991 FiNAL ACCEPTANCE: 5 JULY 1992