B&m. PergamRR ~~7967(93)E~l4-X RANDOMIZATION ALTERNATING PATRICK ONGHENA’* Res. Ther. Voi. 32, No. 7, pp. 783-786, 1994 Copyright % 1994 Elsevier Science Ltd Printed in Great Britain. All rights reserved 00057967:94 $7.00 + 0.00 TESTS FOR RESTRICTED TREATMENTS DESIGNS and EUGENE S. EDGINGTON* ‘Katholieke Universiteit Leuven, Department of Psychology, Center for Mathematical Psychology and Psychological Methodology, Tiensestraat 102, B-3000 Leuven, Belgium and ‘The University of Calgary, 2500 University Drive NW, Calgary, Alberta, Canada TZN lN4 (Received 29 Jr.& 1993; received for publication 17 November 1993) Summary-Alternating Treatments Designs (ATD) with random assignment of the treatments to the measurement times provide very powerful single-case experiments. However, complete randomization might cause too many consecutive administrations of the same treatment to occur in the design. In order to exclude these possibilities, an ATD with restricted randomization can be used. In this article we provide a general rationale for the random assignment procedure in such a Restricted Alternating Treatments Design (RATD), and derive the corresponding randomization test. A software package for randomization tests in RATD, ATD and other single-case experimental designs [Van Damme & Onghena Single-case randomizurion rests, version 1.1, Department of Psychology, Katholieke Universiteit Leuven, Belgium] is discussed. INTRODUCTION Single-case experimental designs can be classified as within-series designs, between-series designs and combined-series designs (Barlow, Hayes & Nelson, 1984; Hayes, 1981). Randomization tests for within-series designs and combined-series designs were presented by Onghena (1992) using the rationale of Edgington (1987). In this paper, valid randomization tests will be presented for between-series designs. The Alternating Treatments Designs (ATD) is the prototype between-series design, which provides and extremely powerful strategy to study the relative effectiveness of two or more treatments (Barlow & Hayes, 1979; Barlow & Hersen, 1984; KratochwiIl & Levin, 1992). If randomization is introduced, it concerns random assignment of different levels of the independent variable (the ~re~~~e~fs) to measurement times. An ATD is then a Completely Randomized Design (CRD) for a single unit, with repeated measures under different levels of the independent variable, and a standard independent t or ANOVA F randomization test may be used to assess the statistical significance (Edgington, 1987). Like Barlow and Hersen (1984) observed, the need for randomization in ATD’s is obvious: “Of course, one would not want to proceed in a simple A-B-A-B-A-.B-A-B fashion. Rather, one would want to randomize the order of introduction of the treatments to control for sequential confounding, or the possibility that introducing Treatment A first, for example, would bias the results in favor of Treatment A.” (p, 253) On the other hand, for clinical, practical, or research reasons, several of the possible ations may be proscribed. As Barlow and Hersen (1984) also remarked: randomiz- “Finally, in arranging for random alternation of treatments to avoid order effects. one must be careful not to bunch too many administrations of the same treatment together in a row For example, if eight alternations were available. . then the investigator might want to set an upper limit of three consecutive administrations of one treatment.” (P. 265) Reasons to set an upper limit consecutively include impracticality *Author on the number of times a treatment can be administered in certain situations of more than some fixed number of for correspondence. 783 PATRICK ONGHENA and 784 EUGENE S. EDG~NGTON consecutive applications of a treatment, the aversive effects of certain enduring treatments, or the effect of length of exposure to some condition on the subject’s becoming aware of the experimental manipulation. As was demonstrated by Edgington (1987) and Onghena (1992), in order to have Type I error rate control, the permutation of the data to obtain the distribution of test statistics must correspond to the random assignment that is actually used. Therefore, if certain outcomes of the random assignment procedure lead to designs that are impossible or undesirable (e.g. A-A-A-A-B-B-B-B) and would be discarded, then the standard independent d or ANQVA F randomization test assuming complete randomization would not be a valid test. A vahd randomization test for a Restricted Alternating Treatments Design (RATD), that is, a randomized single-case design with an upper limit on the number of times a treatment can be administered consecutively, however, is possible if the restrictions on the random assignments are taken into account to obtain the randomization distribution” The school-board has decided to evaluate the effectiveness of two kinds of individual support after school hours of a highly gifted child with bad performance on a norm-referenced achievement test and with inferior grades on the school exams. One wants to compare support after school hours by the class-room teacher (Treatment A) vs support after school hours by a special-education teacher (Treatment B). Eight one month periods are available, each period ending with an examination on 100 points. Suppose four of the periods are randomly assigned to Treatment A and the others are assigned to Treatment B. There are 70 ways to pick 4 out of 8, and ~onsequentIy there are 70 possible outcomes to this randomization procedure. The following designs, however, are considered to be undesirable: A -A-A-A-B-B-B-B A--A-A-B-B-B-B-A A-A-B-B-B-B-A-A A-B-B-B-R-A-A-A B-B--B--B-A-A-A-A B-B-B-A-A-A-A-B B-B-A-A-A-A-B-B B-A-A-A-A-B-B-B because one wants at least some alternation and because it is considered unfair if one teacher could follow the student for 4 consecutive periods while the other coufd not. This leaves 62 possible designs. These 62 designs are enumerated and one is randomly sampted to be the design that is actually used. The school-board expects Treatment B to be superior to Treatment A and decides to test the null hypothesis that there is no differential effect of the treatments for any of the measurement times, using a randomization test on the difference between means B--A. The level of significance is set at 5%. Suppose the actual design is A--B-B-A-B-A-A-B and that the scores are respectively 40-51-56-45-61-54-55-66. Consequently, the observed value of the test statistic B-A is 10. the randomization distribution of the RATD randomization test is derived by keeping the scores fixed, Table I. Four highest and four lowest values of the test statistic 8-A in the f~fldomi~~ti~~ distrrbution for the restricted alternating treatments desgn. condiiional on the data 4%51~56-4s61-%.-55-66 Design A+-8-A-B A-l-B A-A-B-A-6-B-A B A-B-B-A.-EA-A-B A-B--A-A-B-A-R-B * * B--A*&B-A-B-A-A B-A- A--& A-B-S-A B-B-A--8-A-A-BA B-8-A-B-A-B-A-A 8-A 12.0 I I.5 10.0” 9.5 . . -‘9.5 - 10.0 -11.5 - 12.0 Prohahitity Table 2. Four highest and four lowest values of the test statistic 8-A in the randomization distribution for the completely randomized design, conditions1 on the data: 40-Sl-56--45-61-5~555h Probahttity Design I362 1,162 1.62 I,%2 . * A A-B-A-B A--R-B A-A-&-A-R 8-A .B A-A-A-A-&~__&~ A B-EA.-B- A-A--B . . $2 I:‘62 I x2 I :a B-AA- B A-B-B-A B-&B-B-A--A--A--A B-B-.A-B-A-A -B-A B-B A-9-A-B. A A 12.0 II.5 I1 .o 10.0" . . - - ko Il.0 - Il.5 - 12.0 t :70 t :7a I :70 I,70 . . 1,;o I,70 I,‘70 1:70 Randomization test for restricted ATD 785 superposing the 62 possible designs, and calculating and sorting the randomization statistics (see Table 1). The P-value of the randomization test is the proportion of randomization statistics that are not smaller than the observed test statistic. As can be seen in Table 1, there are two randomization statistics larger than, and one as large as, the observed test statistic. Consequently, the P-value is 3/62 = 0.04839. Because the P-value is not larger than the level of significance, the null hypothesis of no difference between the treatments is rejected. For this child, the individual support after school hours by the special-education teacher seemed more efficient and therefore it might be considered to continue this treatment for the months to come. Comparison of the CRD and the RA TD randomization test Before carrying out randomization tests, one should take account of the lowest possible P-value that can be attained. The lowest possible P-value is the inverse of the number of randomizations. Because in a CRD the number of randomizations is at least as large as in an RATD, the lowest possible P-value of a CRD randomization test is never larger than the lowest possible P-value of an RATD randomization test. Notice, however, that with the data as given in the example the CRD randomization test would give a P-value of 4/70 = 0.05714 (see Table 2), which is larger than the P-value of the RATD randomization test, and larger than the level of significance. This is because one of the randomizations (viz. A-A-A-A-B-B--B-B) that gives a higher test statistic than the one observed is included in the CRD randomization test, but excluded in the RATD randomization test, Generating restricted randomizations In the example, an upper limit on the number of times a treatment can be administered consecutively is set by listing all possible designs before the study is started, discarding those where the limit is exceeded, and taking at random one of the remaining designs. Because the number of possible designs increases very rapidly with increasing numbers of observations, however, the computational load of this procedure is prohibitive for many applications. Two alternative procedures are available: (1) the waste-basket procedure, and (2) the constructive-sequential procedure. The waste-basket procedure is interesting when the number of designs exceeding a limit is only a small proportion of the possible designs. With this procedure, the treatments are randomly assigned to the measurement times, and if the resulting design exceeds the upper limit, the design is discarded and another random assignment is performed. The waste-basket procedure, however, is not efficient if a large proportion of designs has to be discarded. In this case, the constructive-sequential procedure is more efficient. With this procedure, treatment indicators are randomly sampled without replacement from a population with as many treatment indicators as there are measurement times for each treatment, and if the limit of the number of consecutive identical treatments is reached, treatment indicators for that treatment are temporarily withdrawn from the population of treatment indicators. For example, with two treatments and an upper limit of two consecutive identical treatments, Treatment A may be randomly assigned to the first and second measurement time, but consequently Treatment B has to be assigned to the third measurement time, and so on. The difference in efficiency between both procedures is obvious if the statistical significance of the randomization test is assessment by random data permutation (see Edgington, 1987, for the difference between systematic and random data permutation tests). If the statistical significance of the randomization test is assessed by systematic data permutation, however, one must take account of the fact that, with the constructive-sequential procedure, the designs are not equally likely. For example, with two treatments, 8 measurement times, and an upper limit of two consecutive identical treatments, the design A-B-A-B-A-B-A-B has a probability of while the design A-A-B-B-A-A-B-B has a probability of (~)f~f(l)(~>(l>f~>(l) (1) = & PATRICK ONGHENA and EUGENE S. EDGINGTON 786 to be the weighted significance ation test algorithm design that is actually used. Consequently, the randomization statistics have to be with this probability (Cox, 1956; Kempthorne & Doerfler, 1969). If the statistical of the randomization test is assessed by random data permutation, a valid randomizis obtained if the algorithm to perform the initial randomization is the same as the to generate the randomization statistics. Software availability Randomization tests for RATD’s cannot be performed with the usual permutation algorithms because of the restrictions on the assignments and the permutation of the data. The SCRT program (Van Damme & Onghena, 1993), however, is especially designed to deal with these sorts of single-case experiments. In addition to the randomization tests for within-series and combinedseries designs, one can perform randomization tests for RATD’s easily. It is possible to restrict the number of consecutive administrations of the same treatment separately for each treatment or to restrict it for some treatments and not for others. Other interesting features of SCRT for the single-case researcher are: the possibility to read any customized set of possible designs from an external file, a Statistics Editor to define a tailor-made test statistic prior to data collection, and a nonparametric meta-analytic procedure to analyze replicated single-case experiments or small-N designs. The program runs on IBM/PC (80286, 80386, or 80486) and compatibles, and can be obtained together with a 30-page manual by e-mail (fpaag02@;blekull l.earn or Patrick. Onghena@psy.kuleuven.ac.be*) or by writing to the first author. Ackno~ledgementsPThe authors wish to thank Luc Delbeke. Geert Van Damme, and two anonymous reviewers for their helpful comments on an earlier version of the manuscript. The first author is Research Assistant of the National Fund for Scientific Research (Belgium). REFERENCES Barlow, D. H. & Hayes, S. C. (1979). Alternating treatments design: One strategy for comparing the effects of two treatments in a single subject. Journal of Applied Behavior Analysis, 12, 119-210. Barlow, D. H., Hayes, S. C. & Nelson, R. 0. (1984). The scienti.r/-practitioner: Research and accountability in clinical and educational settings. New York: Pergamon Press. Barlow, D. H. & Hersen, M. (1984). Single case experimental designs: Strategies for studying heharior change (2nd edn). New York: Pergamon. Cox, D. R. (1956). A note on weighted randomization. Annals of‘ Mathematical Statistics, 27. 1144-l 151. Edgington, E. S. (1987). Randomization tests (2nd edn). New York: Marcel Dekker. Hayes, S. C. (1981). Single case experimental design and empirical clinical practice. Journal qf Consulting and Clinical Psychology, 49, 193-21 I. Kempthorne, 0. & Doerfler, T. E. (1969). The behaviour of some significance tests under experimental randomization. Biometrika, 56, 231-248. T. R. & Levin, J. R. (Eds). (1992). Single-case research design and analysis: NeLv directionsfor psychology and education. Hillsdale, NJ: Erlbaum. Onghena, P. (1992). Randomization tests for extensions and variations of ABAB single-case experimental designs: A rejoinder. Behavioral Assessment, 14, 153-l 7 1. Van Damme, G. & Onghena, P. (1993). Single-case randomization testy (version 1.1) [Computer program]. Department of Psychology, Katholieke Universiteit Leuven (Belgium). Kratochwill,