Examining Treatment Effects for Single-Case ABAB Designs through Sensitivity Analyses A dissertation presented to the faculty of The Patton College of Education of Ohio University In partial fulfillment of the requirements for the degree Doctor of Philosophy Christine A. Crumbacher May 2013 © 2013 Christine A. Crumbacher. All Rights Reserved. 2 This dissertation titled Examining Treatment Effects for Single-Case ABAB Designs Through Sensitivity Analyses by CHRISTINE A. CRUMBACHER has been approved for the Department of Educational Studies and The Patton College of Education by John H. Hitchcock Associate Professor of Educational Studies Renée A. Middleton Dean, The Patton College of Education 3 Abstract CRUMBACHER, CHRISTINE A., Ph.D., May 2013, Educational Research and Evaluation Examining Treatment Effects for Single-Case ABAB Designs thorough Sensitivity Analyses Director of Dissertation: John H. Hitchcock Single-case designs (SCDs) are often used to examine the impact of an intervention over brief periods of time (Kratochwill & Stoiber, 2002; Segool, Brinkman, & Carlson, 2007). The majority of SCDs are inspected using visual analysis (Kromrey & Foster-Johnson, 1996; Morgan & Morgan, 2009). Although the single-case literature suggests that visual analyses have merit (Brossart, Parker, Olson, & Mahadevan, 2006; Kratochwill & Brody, 1978) there are concerns regarding the reliability of the procedure (Shadish et al., 2009). Recent advances in hierarchical linear models (HLM) allow for statistical analyses of treatment effects (Nagler, Rindskopf, & Shadish, 2008), thus making it possible to compare and contrast results from HLM and visual analyses to ascertain if the different methods yield consistent conclusions. This work performed a series of sensitivity analyses while also exploring ways in which HLM can be used to examine new and different questions when dealing with published single-case data. The work applied analyses to ABAB designs only. In addition to reporting the results of visual analysis performed by the original authors, it also utilized recently published guidelines by the What Works Clearinghouse (WWC) that standardize visual analysis processes (Kratochwill, Hitchcock, Horner, Levin, Odom, Rindskopf, & Shadish, 4 2010). The comparisons presented here are based on nine, single-case studies that meet WWC design standards. All studies examined intervention impacts on behavioral outcomes. UnGraph digitizing software was used to quantify results from ABAB graphs and HLM and STATA software were used to perform statistical analyses. In addition to applying a statistical procedure to check conclusions about treatment effects based on visual analyses, HLM was used to examine between-subject variation of performance on outcome measures. In order to statistically describe treatment impacts, effect size estimates were calculated using four methods: (a) the percentage of nonoverlapping data (Morgan & Morgan, 2009), (b) the Standardized Mean Difference (Busk & Serlin, 1992), (c) the improvement rate difference and (d) R2, in order to assess the proportion variance in the dependent variable that can be explained by treatment exposure. 5 Acknowledgments Deepest thanks to the faculty who advised me on this dissertation. Thank you to all committee members, Dr. John Hitchcock, Dr. Gordon Brooks, Dr. Bruce Carlson, and Dr. Jerry Johnson. A special thanks to Dr. John Hitchcock, my academic advisor and dissertation chair for his continued support and guidance. I would also like to thank The Patton College of Education for the opportunity to study at Ohio University. 6 Table of Contents Page Abstract ............................................................................................................................... 3 Acknowledgments............................................................................................................... 5 List of Tables ...................................................................................................................... 8 List of Figures ................................................................................................................... 10 Chapter One: Introduction ................................................................................................ 11 Background of the Study .............................................................................................. 11 SCDs and the What Works Clearinghouse ................................................................... 13 Descriptive Statistical Analysis .................................................................................... 19 Hierarchical General Linear Modeling in SCDs ........................................................... 22 Effect Sizes ................................................................................................................... 25 Statement of the Problem .............................................................................................. 28 Research Questions ....................................................................................................... 30 Primary Question: ......................................................................................................... 30 Secondary Questions:.................................................................................................... 30 Significance of Study .................................................................................................... 30 Delimitations and Limitations of the Study .................................................................. 33 Definition of Terms ...................................................................................................... 36 Organization of the Study ............................................................................................. 39 Chapter Two: Review of Literature .................................................................................. 41 Review of Philosophical Issues .................................................................................... 41 Permutation Tests ......................................................................................................... 43 Other Types of SCDs .................................................................................................... 46 Minimizing Threats to Internal Validity Using the ABAB Design .............................. 47 Reliability and Validity of the UnGraph Procedure ..................................................... 49 HLM Applications to SCD ........................................................................................... 50 Three HLM Models for ABAB Designs....................................................................... 57 Variances in two-level models. ................................................................................. 60 Estimation procedures. .............................................................................................. 61 7 Page Hierarchical Generalized Linear Models (HGLMs) ..................................................... 63 Some Effect Size Options in SCDs............................................................................... 65 Effect Sizes for Meta-Analyses and Comparisons of SCD and Group Studies. ....... 70 Type I and II error rates in SCDs .................................................................................. 72 Chapter Summary ......................................................................................................... 74 Chapter Three: Methodology ............................................................................................ 76 Article Selection and Descriptions.................................................................................... 77 Digitizing and Modeling the Data ................................................................................ 83 Comparing and Contrasting Visual Analysis, HLM/STATA, and Author Reports ..... 91 Level-2 Predictors ......................................................................................................... 92 Alternative Approaches for Exploring Variation .......................................................... 94 Effect Sizes ................................................................................................................... 96 UnGraph Validity and Reliability ................................................................................. 99 Chapter Summary ....................................................................................................... 101 Chapter Four: Results ..................................................................................................... 103 Primary Question: ....................................................................................................... 103 Secondary Questions:.................................................................................................. 103 Section 1: The Pilot..................................................................................................... 104 Section 2: Sensitivity Analysis Examining Treatment Effectiveness ......................... 119 Results for Research Question 1 ................................................................................. 119 Results for Research Question 2 and 3 ....................................................................... 127 Results for Research Question 4 ................................................................................. 156 Chapter Five: Discussion, Conclusions, and Recommendations .................................... 159 Discussion of the Results ............................................................................................ 160 Conclusions ................................................................................................................. 174 Recommendations ....................................................................................................... 176 References ....................................................................................................................... 179 Appendix A: Tables ........................................................................................................ 194 Appendix B: Graphs, Extracted Data from SPSS and Excel with Codes ....................... 208 Appendix C: All Models (Working and Non-Working) ................................................. 278 8 List of Tables Page Table 1. Title of Articles and WWC Standards .................................................................80 Table 2. Results of Sensitivity Analyses Pertaining to Statement of a Treatment Impact .........................................................................92 Table 3. Results of the Multi-level Models and Level-2 Contributors ..............................94 Table 4. Effect Size Methods, Criteria, and Software .......................................................97 Table 5. Simple Non-Linear Model without Slopes for the Lambert and Colleagues’ (2006) Study .............................................................107 Table 6. Final Estimation of Variance Components for the Lambert and Colleagues’ (2006) Study.......................................................................................................110 Table 7. Possible Level-2 Predictors from the Exploratory Analysis for the Lambert and Colleagues (2006) Study ...................................................111 Table 8. Simple Non-Linear Model without Slopes with CLASSB on Intercept for the Lambert and Colleagues’ (2006) Study .............113 Table 9. Final Model: Lambert and Colleagues (2006) ...................................................114 Table 10. Final Estimation of Variance Components for the Lambert, Cartledge and Colleagues’ (2006) Study ..........................................................117 Table 11. Final Results of Sensitivity Analyses Pertaining to Statement of a Treatment Impact ......................................................................................124 Table 12. Final Model: Murphy and Colleagues (2006)..................................................131 Table 13. Final Model: Mavropoulou and Colleagues (2006) .........................................137 9 Page Table 14. Final Model: Ramsey and Colleagues (2006)..................................................141 Table 15. Final Model: Restori and Colleagues (2007) ...................................................143 Table 16. Final Model: Williamson and Colleagues (2009) ............................................148 Table 17. Final Model: Amato-Zech and Colleagues (2006) ..........................................150 Table 18. A Comparison of Effect Sizes from the PND: Visual Analysis Versus Extracted Data from UnGraph .........................................................................153 Table 19. A Comparison of Descriptive Statistics between Raw Data and Extracted for Two Studies ................................................................................. 158 Table 20. List of Effect Sizes from the Original Article Compared to Calculated Effect Sizes Using the PND, SMD, and R2 ..............................194 Table 21. List of Effect Sizes from the Original Article Compared to Calculated Effect Sizes Using the IRD .......................................................197 Table 22. A Comparison of the Reported Data to the Extracted .....................................200 Table 23. List of Population Type, Independent Variables, Dependent Variables, and Reported Effect Sizes ............................................202 Table 24. List of Level-2 Variables Used in the Exploratory Analyses ..........................205 10 List of Figures Page Figure 1. ABAB Design with Disruptive Behavior as the Outcome Variables .................13 Figure 2. An Example of a Multiple-Baseline Design .......................................................47 11 Chapter One: Introduction Background of the Study Single-case research started in the early 20th century and was developed primarily to assess the impacts of treatments, often in the context of Applied Behavioral Analysis (Morgan & Morgan, 2009). Single-case designs (SCDs) use an experimental process where treatment access is systemically manipulated by a researcher, performance is monitored over time, and the units of interest serve as their own control (Horner, Carr, Halle, McGee, Odom, & Wolery, 2005; Kratochwill et al., 2010; Segool, Brinkman, & Carlson, 2007). Many of these investigations are done to test treatment impacts in applied settings (Kratochwill & Stoiber, 2002; Morgan & Morgan, 2009). Although SCDs can yield causal findings, they typically utilize small sample sizes and tend to not generalize well to an underlying population of interest or other settings (Edgington, 1987; Jenson, Clark, Kircher, & Kristjansson, 2007; Kratochwill et al., 2010). These studies can however be aggregated and examined within a meta-analytic framework (Scruggs & Mastropieri, 1998), and researchers have recently become interested in re-analyzing SCDs using new statistical techniques so as to support such work, as well as to examine data in ways that have typically not been available using more classic methods (Iwakabe & Gazzola, 2009; Nagler et al., 2008; Shadish, Rindskopf, & Hedges, 2011). Part of the recent methodological interest in SCDs may be due in part to growing demands for evidence-based practices by the U.S. Department of Education, which has recently favored studies that can yield causal findings (Jenson et al., 2007). Although causal arguments rely on logic and not necessarily statistical inference, modeling data is 12 typically used to address a causal question (Shadish, Cook, & Campbell, 2002). Yet, single-case research does not synchronize well with statistical tests that generally require large sample sizes, equal variances across study conditions, independent errors, and approximately normal distributions (Edgington, 1987). SCDs almost always have small sample sizes due to their focus on a single person or group (Morgan & Morgan, 2009); furthermore, repeated observation of a single entity yields a violation to the independent error assumption (Krishef, 1991). Randomization in SCDs supports causal inference because allowing chance to dictate the start and withdrawal of a treatment should, in expectation, minimize data trends (Edgington, 1987). Randomized SCDs can use permutation tests (a nonparametric approach) to determine if performance on the dependent variable is any different during treatment compared to a control/baseline condition (Edgington, 1987). Randomization is, however, often difficult to apply in SCDs (Edgington, 1987, 1995; Kratochwill & Levin, 2010). This is because behaviorally-based interventions typically compel researchers to treat people on the basis of need. When dealing with students who exhibit self-injurious behavior for example, any desire to randomize with a new treatment would likely be a secondary concern to providing them with help. In a context in which randomization has not been used, permutation tests can still provide information on whether or not there is a treatment effect, much in the same way that independent t-tests can be applied in a quasi-experimental group designs that did not randomize study participants to treatment conditions. They are nevertheless limited by the fact that they cannot model subject characteristic data (Howell, 2009), which may be of interest to 13 researchers if they wish to examine if treatment effects appear to be more powerful across classrooms, or types of students. A recent alternative to permutation tests is the application hierarchical generalized linear models (HGLM), which allows not only for the application of classic inferential approaches to examine if a treatment effect is present, the overall approach can be used to ascertain if subject characteristics explain variance in a dependent variable of interest while accounting for clustered data that clearly do not fit a normal distribution (Nagler et al., 2008). Primarily for these reasons, techniques based on HGLMs will be examined in the current study. SCDs and the What Works Clearinghouse Single-case research designs consist of three major types: within-series (i.e., AB, ABA, and ABAB designs), between-series (i.e., alternating treatments designs), and combined-series (i.e., multiple baseline design) (Segool et al., 2007). Within-series designs examine baseline performance on a dependent variable (A) to performance in the presence of a treatment/intervention (B) (Morgan & Morgan, 2009). A fictitious example Disruptive Behavior of an ABAB design is seen in Figure 1. 14 12 10 8 6 4 2 0 A Phase B Phase A Phase B Phase Sessions Figure 1. An example of an ABAB design with disruptive behavior as the outcome. 14 The presence of the treatment was manipulated by a researcher who had an a priori expectation that the level of problematic behavior will be higher when the treatment was not in place. The vertical X-axis presents the number of times a disruptive behavior has been observed. The horizontal Y-axis represents data collection times during baseline (A phases) and presence of the treatment (B phase). The dark vertical lines represent changes in study condition (i.e., treatment exposure). Figure 1 shows that five data collection points were collected in the first baseline phase; five were collected in the first treatment phase, and so on. Overall the figure is meant to depict a causal argument that the treatment was responsible for drops in problematic behavior, even in the absence of randomization (Horner et al., 2005). This is because the researcher confirmed the disruptive behavior decreases during treatment phases and increases after removal of the treatment, and this pattern of effects have been replicated. Designs demonstrating such patterns of effects are legion (Horner et al., 2005) and the U.S. Department of Education's Institute of Education Sciences (IES) has taken an interest in providing a clearinghouse that describes interventions that yield positive academic and behavioral outcomes for children. The What Works Clearinghouse (WWC) was developed in 2002 and it involves a network of standards, guidelines, and criteria specific to single-case research. The WWC has thus developed criteria for judging whether designs can reasonably make a causal argument about the impact of a treatment by considering different design features and visual analyses (Kratochwill et al., 2010). To clarify, the WWC offers criteria that separate out SCDs with designs that yield reasonable internal validity from those that do not, and of the designs that meet such 15 criteria, visual analyses are applied to determine if there is indeed an effect (some studies may be internally valid but the treatment was judged to not have made a difference). As a quick aside, visual analysis is a method used for determining treatment effectiveness by visually examining SCD graphs while considering various features such as mean performance change from baseline to treatment. In terms of design criteria, all articles used in the current study should meet the WWC standards and Kratochwill and colleagues’ (2010) steps for assessing design. These standards include: reason to believe there was systematic manipulation of the independent variable over time; inter-assessor agreement was examined on at least 20% of all observations, at least once within each phase, and an agreement rate of 0.80 was achieved (at least 0.60 if measured by Cohen’s kappa); a minimum of three data points were collected in each study phase; and there was an opportunity to demonstrate at least three intervention effects at three different points in time. A typical ABAB design can meet the last criterion because the introduction of the treatment after baseline (going from A to B) yields an opportunity to demonstrate an effect. Removing the treatment (B to A) allows for a second chance to demonstrate a treatment impact, and re-introducing the treatment (A back to B) yields a third opportunity. An implication here is that several variants of the ABAB design cannot demonstrate three treatment impacts (e.g., AB, ABA, BAB) and thus are not able to meet WCC standards. Put another way, shorter designs do not allow for enough replication of 16 a treatment effect to allow for a reasonable causal argument. Once a study is deemed to have the capacity to generate a strong causal argument (i.e., passes standards), the WWC will consider the evidence the study produces using visual analyses. Again, this means the WWC will not consider visual analyses from designs that do not pass standards. At the conclusion of visual analyses, reviewers render one of three judgments, ‘Strong Evidence,’ ‘Moderate Evidence,’ or ‘No Evidence’ of a causal relation (Kratochwill et al., 2010). Interpreting graphs in single subject research stems back to the 1970s (Brossart, Parker, Olson, & Mahadevan, 2006; Kratochwill & Brody, 1978). Today, visual analyses are used to analyze treatment effectiveness in the majority of studies (Brossart et al., 2006; Busk, & Marascuilo, 1992; Horner et al., 2005). Indeed, Kromrey and FosterJohnson (1996) estimate that 80% of SCDs rely on visual analyses. Examining trend, level, and variability can allow for an assessment of whether the treatment appeared to work (Horner et al., 2005; Kratochwill et al., 2010; Morgan & Morgan, 2009). According to Kratochwill and colleagues (2010) the WWC considers six features when examining the presence of a causal relationship in the context of a SCD: “level; trend; variability; immediacy of the effect; overlap; and consistency of data patterns across similar phases” (p. 18). 17 Level refers to the performance mean within each phase. Consideration of this feature is analogous to an unstandardized effect size (i.e., simple mean difference) in a classic experimental design with a single treatment and control group. If the average performance in a treatment phase is better than a non-treatment phase, then there is evidence that the intervention worked. Trend is in essence the slope of the regression line (line of best fit found within each phase). Variability in the data provides information about fluctuations of the dependent variable (Morgan & Morgan, 2009). The remaining features are immediacy of the effect, overlap of data points (i.e., performance) across phases, and consistency of data patterns across similar phases. Immediacy of the effect is characterized by any change in level of performance exhibited by the last three data points in one phase compared to the first three data points in the following phase. This is considered under the logic that the more immediate an observed impact, the more likely one can attribute it to the presence (or removal) of a treatment. Overlap is the percentage of overlapping data in one phase to the next, with the larger the separation of data signaling the greater likelihood of a treatment effect. Lastly, consistency of data in similar phases involves pairing the baseline phases and treatment phases (e.g., the first A phase compared to a second A phase, and the first B phase compared to a second B phase) and looking for consistent patterns. As consistency of data patterns across similar phases increases, the more likely that there is a possible causal relationship. A key advantage to utilizing WWC visual analysis procedures when reanalyzing the results from the original studies is that the approach is standardized. By contrast, visual analyses may well vary from one author to the next and oftentimes are not reported 18 in detail. Since a primary goal of this work is to compare HLM and visual analyses to see if each approach yields comparable results, it should be beneficial to use a consistent visual analysis approach that should incorporate the most recent thinking on best practices. The WWC procedures provide such consistency. Despite the advantages of standardized visual analyses, researchers disagree on whether to solely use the approach when statistical procedures are available (cf. Baer, 1977; Kazdin, 1992; Morgan & Morgan, 2009; Parsonson & Baer, 1986). Concerns with visual analyses lie in the fact that some graphs are not easily interpreted and raters can reasonably disagree on whether a treatment effect is present. Additional limitations discussed in the literature include: (a) human error can occur while reading graphs, (b) lack of rater training, and (c) the lack of formal standardized criteria for treatment effects (Kromrey & Foster-Johnson, 1996; Morgan & Morgan, 2009). Statistical procedures can of course be valuable when raters disagree (Danov & Symons, 2008) and it is argued that performing sensitivity analyses may mitigate disagreements (Kratochwill et al., 2010). This sets the stage for sensitivity analyses. Should disparate but reasonable analytic techniques yield similar conclusions, this provides evidence that different traditions would concur on the basic information practitioners need; that is, whether a treatment worked. Perhaps visual analyses and statistical tests can be used in conjunction to lessen the magnitude of human error while examining treatment effects (Morgan & Morgan, 2009). 19 Descriptive Statistical Analysis Current SCD analytic practice does not eschew statistical analyses. Although the literature indicates that visual analysis is the dominant approach when examining treatment effects, researchers often use descriptive statistical approaches to analyze treatment effectiveness. Descriptive approaches are relatively easy to apply and provide information about trends and treatment impacts. These include calculation of the mean (level) and the percentage of nonoverlapping data (PND) statistic. Calculating the mean entails computing the mean for the baseline and treatment phase(s) and calculating their differences (Morgan & Morgan, 2009). The percentage of nonoverlapping data (PND) statistic is calculated using the percentage of data that do not overlap (Morgan & Morgan, 2009). The PND calculation is not used in an inferential statistical test; it is a visually based procedure (the reverse in calculated as well; this is, the number of treatment data points that fall lowest data point observed at baseline). The PND uses features associated with visual analysis to assess treatment effectiveness. An effective rating could be a graph that has little data overlap, limited variability, distinguishable levels, and immediacy of effect (i.e., the last three data points in the first baseline and the first three in the treatment are not close in numerical value). A questionable rating may be a graph that is not as discernible, that is it may have a few of the features associated with an effective rating and a few that are rather questionable to a trained visual expert. A noneffective judgment could be a graph that has many overlapping data points, large variability, levels that are close in numerical value and data that lack an immediate change or visible gap in data between phases. 20 Criteria for the PND suggest that any percent equal to or greater than 90% (i.e., 90% of the data in a treatment phase does not overlap with data observed at baseline) indicates a very effective treatment; 70%-90% indicates an effective treatment (Scruggs & Mastropieri, 1998). A range of 50%-70% indicates questionable effectiveness, while a range of lower than 50% is interpreted as an ineffective treatment (Scruggs & Mastropieri, 1998). Although the approach is straightforward, it does have limitations (described in Chapter Two). The PND uses the features associated with visual analysis to determine effective / questionable / not effective renderings. Statistical methods are available to examine treatment effectiveness, including adding regression lines (trend), creating statistical process control (SPC) charts, and generating dual-criteria analyses (Morgan & Morgan, 2009). The regression line procedure identifies a line of best fit between the data points in each phase to display trends in the data; the regression line can facilitate determination of treatment effects if the line differs in intercept or slope relative to the baseline (Morgan & Morgan, 2009). Further, observing the trend in baseline might help researchers to predict where other data points within the baseline may lie (Morgan & Morgan, 2009). The SPC charts consider outliers in data, using standard deviations, and investigate whether such outliers are best explained by a treatment effect. A formula for calculating the standard deviation in SCD is as follows: SD = , (1) 21 Where, x = a single data point = mean of all data n = total number of data points. This formula states that the standard deviation equals the square root of the sum of a single data point or raw score minus the mean of all data points in the study squared divided by n, the total number of data points minus one. Data points that deviate by two standard deviations across phases (above or below) are considered to be atypical (Morgan & Morgan, 2009). If at least two repeated data points fall outside this range then a treatment effect is plausible (Orme & Cox, 2001). Dual-criteria analysis is a visual aid technique that considers both the mean and trend of baseline data. These sources of information are used to extrapolate a line that depicts counterfactual performance and allows for easier comparisons with performance in the treatment phase. The more point(s) that lie above the extrapolated information the more likely it is that treatment is responsible for causing a change (Morgan & Morgan, 2009). A more conservative approach has also been tested by moving the regression line up or down by 0.25 standard deviations (Morgan & Morgan, 2009) which is a useful method as well. 22 Hierarchical General Linear Modeling in SCDs The above techniques all are variants of descriptive analyses and do not provide an opportunity to test statistical significance. HGLM1 can, however, provide a test for statistical significance if quantified data are available. Two level models in SCDs consist of the extracted/raw graph data (Level-1) and subject characteristic data (Level-2). A statistical sensitivity analysis of SCD graphs using two level models with raw data is not available in the existing literature; the lack of such work may be due to several issues including design and availability of information. As noted above, use of statistical inference in the context of SCDs can be problematic. Problems associated with statistical analyses using SCDs include autocorrelation, overdispersion, and small sample sizes (Nagler et al., 2008). Overdispersion and concerns around small sample sizes are discussed below. An understanding of autocorrelation however provides a sense of how ordinary least square (OLS) approaches are problematic when statistically analyzing single-case data. Since data in ABAB designs (and other types of SCDs) have observations nested by case (Kratochwill et al., 2010; Raudenbush & Bryk, 2002), autocorrelation becomes a concern. In the context of SCDs, autocorrelation is in essence a form of serial dependence, where observations within the study yield information that allows for prediction of subsequent observations (Krishef, 1991). This presents an interesting problem because violation of the assumption of independence can, at times, represent a desirable outcome in the context of visual analysis. This is because stable baseline performance should predict little or no change in status in absence of treatment, 1 2 This work will also use the more common phrase: “Hierarchical Linear Modeling” (HLM). It should be noted that the author found few studies that tested the reliability of UnGraph in the literature. 23 and introduction of treatment should be the primary reason for any given change (Horner et al., 2005). Put another way, more severe autocorrelation within a study phase can at times facilitate visual analysis but undermine OLS statistical analyses. Multilevel modeling can however address violations of the independence assumption by allowing error terms to randomly vary when modeling each unit’s individual growth trajectory (this is Level-1 of a multi-level model, where Level-2 typically represents subject characteristics as outcome variables). That is, the repeated measures of an entity over time are treated as nested, and autocorrelation, as well other difficulties associated with varying numbers of measurements and different spacing of measurement (Raudenbush & Bryk, 2002) are handled in the analytic model by accounting for the degree of association between nested observations. The approach can also allow researchers to test multiple hypotheses that pertain to a particular study and to explain subject characteristics that are unique. The use of hierarchical linear modeling to analyze published SCDs requires some prerequisite steps (Nagler et al., 2008) because raw data are typically not provided in reports. McDougal, Narkon, and Wells (2011) call for the providing raw data when presenting SCD reports so as to ease comparison of results across studies, and decry the fact that this is not the norm. UnGraph is one type of digitizing software that solves this problem as it quantifies the X and Y coordinates of a published graph and exports data into Excel or SPSS formats.2 Once data are exported, the researcher can compute new variables based on a particular study and create new ones. UnGraph in essence makes 2 It should be noted that the author found few studies that tested the reliability of UnGraph in the literature. The reliability of the procedure is discussed further below. 24 raw data available for re-analysis. With such data, the researcher can compute variables for time series data (scaling the dependent variable due to proportions or interval data) create interaction effects, and Level-2 participant characteristics. After defining all variables in SPSS, one can import the data into HLM and run models to provide estimates for coefficients. This can lead to analyses where exact variation between students can be tested for contributions to the model (i.e., using dummy codes to separate one person from the group). In sum, specific units/schools/classrooms/subject characteristics can be added to the model in hopes of differentiating or explaining behavior patterns. Raudenbush and Bryk (2002) stated that hypothesis testing in HLM depends in part on the distributional shape of the data. When data are not normal the use of linear Level-1 models would be unsuitable. This is common with count data,3 which fall into mainly two types of distributions, Poisson or binomial (Nagler et al., 2008; Raudenbush & Bryk, 2002). SCDs with count data therefore need to assume one of these distributions, which require a mathematical adjustment when using HLM software (Raudenbush, Bryk, Cheong, Congdon, & du Toit, 2004). The Poisson distribution is often used to model the number of events in a precise time period, such as classroom session or week. Binomial distributions are used to model the number of events that occurred out of a fixed, total number of possible occurrences (Nagler, et al., 2008; Raudenbush & Bryk, 2002). Examples of variables that have binomial distributions could be passed/failed a test, did/did not display behavior, or whether a baseball player managed to get a hit when at bat (Raudenbush & Bryk, 2002). In studies where behavior 3 All articles used in this study follow the assumptions of count data. 25 is coded as either occurring or non-occurring events (typically coded 0 or 1), the binomial distribution should be used (Nagler et al., 2008). Any post-hoc tests in HLM are associated with finding mean differences between groups and within the individual (Nagler at al., 2008). Exploratory tests, which in these analyses involve the Level-2 variables, can help explain variance in the data of a given set of studies. One advantage of using a nested design approach is the ability to determine how Level-2 predictors explain individual variation. Although multi-level models yield p values that can help inform the researcher if the treatment was effective and how Level-2 variables influence the study, they cannot synthesize effect sizes to demonstrate the strength of association between outcome and intervention. Furthermore, it is difficult to generalize findings to other settings or populations due to the small sample size typically associated with these analyses. Effect Sizes The degree to which a variable (or set of variables) influence outcomes is referred to as the effect size of the analysis (Cohen, 1988, 1992). According to Parker, Vannest, and Brown (2009), reporting effect sizes along with visual analysis results is helpful for four reasons. Doing so (a) promotes objectivity when raters disagree, (b) promotes precision when discussing treatment impacts, (c) provides further insight about whether results are due to chance alone, and (d) offers a means to communicate findings. Effect sizes can be calculated for any given single-case study, but it is suggested that a minimum of three data points per phase (five for the ABAB design) are needed to 26 determine an effect (Beeson & Robey, 2006; Horner et al., 2005; Kratochwill et al., 2010; Shadish, Sullivan, Hedges, & Rindskopf, 2010). Generating and interpreting effect sizes in SCDs is complicated by the fact that standardized mean differences (SMDs) are not readily comparable to effect sizes generated from group-based studies (e.g., randomized controlled trials and quasiexperiments). This is because the variances of SCDs tend to be small relative to groupbased designs and there can be cases where mean performance shifts in baseline for an individual can yield large numerators. In SCDs, one can simply divide phase-mean differences by within phase variance using some variant of the formula: (2) Where, E= C mean score for the experimental group = mean score for the control group s = pooled standard deviation. There are variations of this general formula where the methods of obtaining the denominator are altered such as using the control group standard deviation (Grissom & Kim, 2005) or assuming that the population standard deviations are equal, therefore using the pooled estimates (Nagler et al., 2008; Van den Noortgate & Onghena, 2003). In SCDs, this is somewhat akin to using variance estimates based on intervention and/or baseline phases. The result is often a standardized effect size that is quite larger than 27 Cohen’s general standards of 0.2, 0.5, and 0.8 (Cohen, 1988). Of course, these values are heuristics and context should drive determination of whether an effect is large, but standardized effects from SCDs can be so large that it is difficult to compare them to group based studies (Kratochwill et al., 2010; Shadish, Rindskopf, & Hedges, 2011). Finding an effect size formula that effectively represents single-case designs that parallels group based studies is still being developed. These formulas will help researchers compare effect sizes between studies for SCDs. A recently developed procedure, the d-estimator (Shadish, Sullivan, Hedges, & Rindskopf, 2010), uses the baseline standard deviation as opposed to the pooled standard deviation in the denominator. The d-estimator promises to generate estimates of SCD treatment impacts that are comparable to group based designs (Shadish et al., 2010) but the procedure is still under development (Kratochwill et al., 2010; Shadish et al., 2010). Other methods of measuring relationships between the independent and dependent variables exist. For example, R2 can interpret an effect as the proportion of variation explained but must be adjusted for use with categorical or continuous predictors and in instances where the data trends are controlled for the analyses (Brossart et al., 2006; Kirk, 1996). Therefore, there are some limitations in interpreting R2 in SCDs due to baseline and trend effects inherent in the design. For example, failure to account for a baseline trend can reduce R² and this may in turn undermine interpretation (Parker et al., 2006). The proportion of the variation explained by the phase differences is one way to interpret R2; however, other interpretations are available in single-case research and consideration should be contingent on the design’s function (Parker & Brossart, 2003). 28 Phase differences are the shifts that occur between the baseline (a1) to the treatment (b1) intervals. The actual differences of these shifts would be similar to statistical testing between the averages of the baselines and treatments. Formulas for effect size will yield overestimated effects in SCDs (Brand, Best, & Stoica, 2011). Hence, a way to calibrate effect size estimates in SCDs is needed to compare to more standard approaches such as Cohen’s d. Having said that, developing such approaches is not the focus of this work; rather, the focus is on comparing HLM results, focusing on p values, with visual analyses that describe whether there is a treatment effect (to clarify, these two features in essence focus on the same issue, and that is whether a treatment effect is present; not on how large the effect may be). This discussion is offered only to clarify that SCD effect size calculation is inherently problematic. Nevertheless, this work will calculate treatment impacts using conventional methods so as to describe treatment effects and to compare to original authors reported effect size estimates via sensitivity analyses. Statement of the Problem A motivating question behind this work is: how can quantitative methods supplement visual analyses? Quantitative methods to assess treatments are becoming more popular in the literature (Maggin, O’Keeffe, & Johnson, 2011). Visual analysis does not use p values to assess treatment effectiveness, therefore sensitivity analyses need to be conducted, not to compare the two methods, but to verify the conclusions reached by study authors. Evaluating treatment effectiveness is a critical component to any SCD designed to assess intervention effects. As noted above, visual analyses represent the 29 preferred methods (or most used method) for determining if there is a treatment effect, but statistical approaches have recently been developed and are promising. Therefore, more research should be conducted to compare results of visual and statistical analyses. In addition to examining whether visual and statistical analyses yield comparable conclusions, Nagler and colleagues (2008) described a value-added component in HLM through the use of Level-2 data. Testing the influence of Level-2 information of student performance might yield additional information about circumstances under which treatment effects are present. Lastly, there is an opportunity to assess effect size after digitizing graphs using the PND, IRD, SMD, and R2. Although this work is not a metaanalysis, testing these procedures and comparing the effect sizes to the original work will provide a sensitivity analysis. Later, data from this study can potentially contribute to the quantitative syntheses of SCDs using formulas such as the d-estimator. The primary purpose of this study is to compare results from three different analytic procedures: (a) claims made by original study authors; (b) WWC-informed visual analyses as applied for purposes of this dissertation4; and (c) results of HLM. In the process, this work will also test the concept of quantifying graphs found in singlecase research. With new digitizing software, researchers are now able to obtain coordinates of graphs and may test phase effects (i.e., differences between baselines and treatments). Secondary purposes include (a) examining whether Level-2 information explain data variance using HLM procedures in a way that might yield new insights about 4 Note that this is not necessarily comparable to WWC procedures which entails independent coding of two, trained coders and reconciliation with a senior methodologist. 30 published SCD data, (b) application and exploration of effect size estimates, and (c) exploring the reliability and validity of UnGraph. Research Questions This is a methodological dissertation focused on testing new single-case analysis procedures. It is guided by four research questions, one primary and three secondary: Primary question. (1) Does quantification and subsequent statistical analyses of selected ABAB graphs produce conclusions similar to visual analyses? Comparisons will be made between WWC visual analyses and those of the study’s original authors. WWC visual analyses will be employed as to standardize the procedure across the various studies used to inform this work. Secondary questions. (2) Do any of the subject characteristics (Level-2 data) explain between-subjects variation? If so, can this information be used to yield new findings? The procedures advanced by Nagler and colleagues (2008) allow for statistical analyses of Level-2 variables that can yield new findings; the approach can basically be used to extend findings of the original SCD authors. Therefore, testing these procedures would seem to be a worthwhile endeavor. (3) Do the PND, IRD (nonparametric indices), SMD (parametric indices), and R2 yield similar effect sizes as the original studies? (4) Is UnGraph a valid and reliable tool for digitizing ABAB graphs? Significance of Study SCDs are primarily conducted with visual analysis using a team of experts, yet scholars have expressed concern over potential bias with the procedure and scenarios that involve “close calls.” It therefore seems reasonable to examine whether HLM analyses 31 routinely yield comparable results with visual analyses. Nagler and colleagues (2008) created a handbook concerning the use of multi-level modeling in HLM for small sample sizes. The methods used in this handbook are employed for this work. Ideally, there will be few if any differences between the different procedures. This would provide an early indication (i.e., early in the sense that only a small number of studies were examined due to the computational intensity of Nagler and colleagues’ methods) that statistical and visual procedures coincide. In the event that they do not, this work will not necessarily be able to recommend one approach over another. That would be a task for later study. Nevertheless failure of the two methods to converge would yield a warning for the emerging field of statistical data analyses of SCDs. In addition, Level-2 predictors may provide information for prediction purposes. Statistically significant Level-2 variables would explain variation across subjects (Nagler et al., 2008). Without statistical testing of the Level-2 variables, only subjective claims can be made concerning subject characteristics and the outcome variable. Unfortunately there does not appear to be a competing statistical approach (at least so far as the researcher is aware) that answers questions pertaining to Level-2 data in SCDs. In addition, visual analysis and descriptive statistics do not yield statistical information about the phases beyond means and ranges. Since regression-based statistical approaches are recommended in available technical literature it seems worthwhile to test Level-2 contributions, even when samples sizes are small. The basic point behind the Level-2 variable analyses is that SCD literature tends not to synthesize results, and when they do, they do not go beyond making some logical 32 inferences based on visual work. Visual analysis does not rely on statistical analyses to synthesize results and make group comparisons (e.g., boys tended to respond to treatment better than girls, kids in Classroom A tended to do better than kids in Classroom B). A lesser justification for the Level-2 work is the approach has been advanced in the methodological literature. Accordingly, testing Level-2 variables seems logical in this sensitivity analysis simply because it is recommended as an option, and comparing results to the original authors’ statements concerning these variables can be compared to this works findings (no study chosen in this work used quantification to analyze results). Further justification for testing Level-2 significance leads to Nagler and colleagues (2008). They found significance under similar circumstances and thus provided evidence that Level-2 analyses can be reasonably interpreted. Furthermore, significant Level-2 variables can be interpreted as influencing the baseline or treatment phase in some way. That is, if a significant Level-2 variable is influencing the intercept in SCD work, one should interpret that as one group (e.g., ethnic background, coded 0 = Caucasian, 1 = African American) demonstrating a difference in performance between the two on the intercept (i.e., baseline). On the other hand, if the Level-2 variable for ethnic background is significant for the treatment phase, then one group may be performing differently during the treatment phase. Finally, calculating effect size estimates can help yield new insights about treatment impacts of the studies re-analyzed in this dissertation. 33 Delimitations and Limitations of the Study This dissertation is delimited in several aspects. First, the study consisted of nine ABAB, behaviorally-based studies. Identifying articles that matched the criteria for the WWC for acceptable designs and limiting the article search to students with behavioral issues restricted the number of articles to quantify. Furthermore, the HLM procedures used in this work are intense and time consuming, making it difficult to apply the procedures with a larger group of studies. More importantly, this work focuses on testing an emerging application of HLM and obtaining an early sense of whether results are congruent with established visual analysis procedures. It in essence is an attempt to independently examine an emerging methodology. This work does not attempt to make a substantive contribution to the knowledge base on treating students with emotionalbehavioral disorders. In short, this is a methodological dissertation with the intent to make a contribution to the methodological literature; this justifies the use of a small number of studies. A second delimitation to this work is that the methodology being tested focuses only on ABAB designs, and not on other types of SCDs such as multiplebaseline, alternating treatment, and changing criterion designs, which could be analyzed using similar methods. Using ABAB designs will limit threats to validity and possibly allow more data to be used in analyses. A third delimitation is that this work is primarily a sensitivity analysis. In the event that differences are routinely found across these procedures, this might serve as warning to SCD methodologists but beyond that the work will not attempt to exhaustively examine why the differences were found. Ideally, the 34 alternatives will routinely converge and results would indicate that we have early evidence that the different techniques generate exact results. In terms of the primary research questions, limitations to this study include using WWC visual analyses procedures without having access to WWC resources. There is no guarantee that the application of the approaches would yield exact conclusions had these studies gone through formal WWC review. Kratochwill and colleagues (2010) show that each single-case study is independently coded and analyzed by trained, doctoral-level methodologists and differences are reconciled by a senior methodologist who in turn has access to content advisors. In short, considerable resources used by the WWC are not available for this work and there is limited assurance that the visual analyses used here would yield the same results seen in a WWC review. Of course, the very rationale for the use of WWC approaches is to standardize the technique when performing sensitivity analyses. Having said that, the visual analyses were checked by the dissertation chair; he co-authored the report on WWC visual analyses and is a senior methodologist on the WWC project. For the first and second research question, the manual did not designate the exact matrix (i.e., identity, unstructured, diagonal) to model the data. Nagler and colleagues (2008) discuss options for constraining random effects and it is assumed that the no constraints model is the same as the unstructured matrix, which is the default in HLM (Garson, 2012). For this reason, an unstructured variance-covariance matrix is used in these analyses and these matrices can be seen in Appendix C in tau dimensions. These matrices indicate the unstructured variance co-variance matrix but because this is an 35 assumption, it is a limitation. Each coefficient in the unstructured matrix estimates heterogeneous variances on the diagonal and non-zero covariances on the off diagonal (Garson, 2012; Raudenbush & Bryk, 2002). In terms of secondary research questions, another limitation to this study is that although the magnitude of treatment impact of each SCD will be described using different effect size calculations, none of the current procedures are ideal. There is nothing statistically wrong with calculating a SMD but, as discussed above, this approach typically yields results that are difficult to compare with effect sizes from group-based designs (Kratochwill et al., 2010). Some technical difficulties with both the PND and R2 procedures are evident. The IRD or “risk difference” is a newer effect size for summarizing single-case designs, and this will be calculated as well (Parker, Vannest, & Brown, 2009). The IRD is a difference of two proportions of data overlap (the intervention minus the baseline). The IRD is different from other effect size calculations because it is based on risk indices (Maggin et al., 2011b). The other estimates described in this dissertation are not derived from rates. The IRD was not used in the original nine articles, but it can be compared to the PND. These effect size estimates and associated issues will be discussed in Chapter Two. Finally, some attempt was made to assess the reliability and validity of the UnGraph extraction procedure. For the latter issue, an ideal validation effort would be to obtain raw data from additional study authors and confirm that quantified graphs yield similar means, variance and so on. However it is often difficult to obtain raw data, thereby limiting this aspect of the work. All original authors were contacted to obtain raw data and for any responses to this request, comparisons will be made. 36 Definition of Terms Phase Difference A phase difference is defined as the shift in the performance of a dependent variable that occurs between the baseline phase (first A in the ABAB design) and the treatment implementation (first B). For example, if a child shows a large number of tantrums (the dependent variable of interest) at baseline, and this is reduced after a treatment is introduced, there is a phase difference. Sensitivity Analysis A sensitivity analysis is a process where two different analytic procedures are pursued to determine if they have the same finding. The procedure attempts to ascertain if a set of findings are sensitive to the methods used. Eta-Squared Eta-squared is an effect size measure that is equivalent to R2, typically seen in ANOVA. Error (R) This statistical notation, R, is described as the error term which allows each student to vary from the grand mean. Random Error (ε) The unexplainable error found in measurements between the observed value and the predicted value in a model. 37 Coefficient of Determination, R2 The coefficient of determination or R2 is a type of effect size that shows how much variance in a dependent variable is explained by the predictor variable. In the context of SCDs, R2 represents the proportion of the variation explained by phase differences. Hierarchical Generalized Liner Models (HGLMs) Hierarchical generalized linear models (HGLMs) are extensions of hierarchical linear modeling where normality and the assumptions of linearity are not feasible with the given data and no transformational procedure will correct the data (Raudenbush & Bryk, 2002). Given the data, special functions (e.g., logit, log) can facilitate the incorporation of different distributional shapes (Raudenbush & Bryk, 2002). Treatment Effects A treatment is deemed effective in ABAB designs if the design and resulting evidence allows for causal arguments per WWC criteria. From a statistical perspective, a treatment is deemed to be effective if baseline and treatment means in the study are significantly different from each other, p < .05. Interchangeably, phase effects are the average rates of change in log odds (binomial distribution) and log count (Poisson) as a student switches from baseline to treatment phases (Nagler et al., 2008). Poisson and Binomial Distributions Poisson and binomial distributions are common when using count data, which involves the interpretation of the dependent variable. For example, if the dependent variable is number of tantrums displayed, then the distribution chosen should be Poisson. 38 On the other hand, if the dependent variable is the percentage of time on-task then the distribution that would accurately represent the data is binomial. Both have different assumptions concerning the mean and variance. In Poisson and binomial distributions, the variance is a function of the mean (Nagler et al., 2008). The mean and variance are equal in Poisson distributions. As variance increases, so does the mean. For the binomial distribution, the variance is largest when the proportion is 0.5 (Nagler et al., 2008). The Poisson distribution is generally used to model the number of events in a precise time period. Binomial distributions are used to model the number of events that occurred where the total number of possible events is known; for example, the dependent variable is from a fixed number of binary (0, 1) observations (Nagler et al., 2008; Raudenbush & Bryk, 2002), where the variance is p(1 – p), and p equals the probability of success (defined as an event occurring). An important feature of both distributions is they provide some guidance around how much variance one might expect in observed data, which is important when examining the presence of overdispersion. Overdispersion In multi-level modeling, overdispersion occurs if variability in the Level-1 outcome is greater than what is expected from the Level-1 sampling model (Raudenbush & Bryk, 2002). The SCDs examined here all use count data. Again, the presence of count data requires models to assume either a Poisson or binomial distribution and these distributions provide a sense of how much variability in the data can be expected. When the observed variability of the count data is greater than one might expect when using these distributions, overdispersion is typically thought to be present (Nagler et al., 2008). 39 Such overdispersion may complicate the capacity to assess if a treatment effect is present using statistical procedures, as the baseline mean and treatment means should be far enough apart to determine if they are in fact different. There are however corrective steps that one can take and the overall issues are discussed in greater detail in Chapter Two. Autocorrelation Autocorrelations occur when the baselines display trend which can make the phase differences statistically indiscernible (Parker et al., 2006). Serial-dependency is another term used for autocorrelation where the future behavior of a person is predictive based on prior instances; hence a trend develops (Krishef, 1991). Autocorrelation or serial-dependency can alter effect size magnitudes and significance levels (Parker et al., 2006); additionally, it violates the independence assumption of data in regression (Fox, 1991). Organization of the Study Chapter One establishes the purpose and significance of the research questions, as well as study delimitations and limitations. Chapter Two begins with a general overview of treatment interventions that are used to help children who display behavioral difficulties. Articles used in the current work involve children with identified disabilities or children who are at-risk of being identified. The criteria for identifying children with disabilities are therefore described. The chapter also reviews differences between visual and statistical analyses, differences between visual and statistical analyses, different types of SCDs, UnGraph software procedures, and finally, statistical concern encountered when dealing with SCDs such as overdispersion, autocorrelation effect size estimates. 40 Chapter Two concludes with an overview of subject characteristics that can explain between subject variations and how that may add value to the original articles in this study. Chapter Three reports research design, HLM interpretations, data collection and data analyses. Chapter Four includes the main results from HLM analyses. Discussion, conclusions and recommendations are in Chapter Five. References and Appendices (e.g., sensitivity analyses, effect size calculations) are attached at the end of this study. 41 Chapter Two: Review of Literature Review of Philosophical Issues There has been, as of late, an increased emphasis on the search for evidence-based instruction (Iwakabe & Gazzola, 2009). SCDs offer an important class of techniques for identifying evidence-based instruction, mainly because they can deal with small samples and highly contextualized treatments. Iwakabe and Gazzola (2009) argue that attempts to synthesize and aggregate single-case studies may help to develop evidence-based treatment interventions for populations of people with specific needs. Meta-analysts have taken an interest in statistical examinations of SCDs (e.g. permutations tests, HLM) because doing so can promote synthesis and generalization of findings. On the other hand, SCD research communities who favor localized treatment plans tend to not be as interested in probabilities or generalization (Jenson et al., 2007; Morgan & Morgan, 2009). In addition, some studies use multiple SCDs and often use very different students (e.g., students who are typically developing and students with disabilities) or contexts (e.g., teaching students in general education or self-contained settings). This makes efforts to understand treatment impacts and their generalization a complex endeavor (Kauffman & Lloyd, 1995). This work attempts to help address these disparate issues by testing and applying an emergent methodology for analyzing SCD data developed by Nagler and colleagues (2008). More specifically, the work focuses on comparing the results of emerging HLM applications and standard visual analyses when determining if a SCD produced evidence of a treatment effect. The work also applies a technique that can test for treatment 42 differences among types of students and contexts. As noted in Chapter One, these sensitivity analyses are replicated across nine studies. Although the purpose of this work is not to directly contribute to the treatment literature, it seemed reasonable to work with a corpus of studies that have a similar goal since radically different types of studies might complicate the sensitivity analyses. The decision was to focus on studies that examine treatment effects of students with behavior disorders both because this could yield a large number of SCDs that can pass WWC standards (which was necessary so standardized visual analyses could be applied) and because it seems likely that future meta-analyses of SCDs would examine this particular literature base. There is after all a current effort by the WWC to review this topic and on-going calls in the literature to identify treatments that work well with students with emotional behavioral disorders (Kauffman & Landrum, 2009). Several of the studies used in this work involve students with emotional disturbances. Severe Emotional Disturbance (SED) is defined according to IDEA (2012, p. 7-8) as "a condition exhibiting one or more of the following characteristics over a long period of time and to a marked degree, which adversely affects educational performance an inability to learn which cannot be explained by intellectual, sensory, or health factors; an inability to build or maintain satisfactory interpersonal relationships with peers and teachers; inappropriate types of behavior or feelings under normal circumstances; a general pervasive mood of unhappiness or depression; or 43 a tendency to develop physical symptoms or fears associated with personal or school problems" (IDEA, 2012, p. 7-8). Statistical vs. Visual Analysis Whether or not one should use statistical inference in SCDs represents an ongoing controversy in the literature (Morgan & Morgan, 2009). Visual analysis proponents argue that techniques specific to SCDs mimic more widely accepted procedures used in statistical inference testing (Morgan & Morgan, 2009), but visual inspection of the data will not yield the commonly used p value index and the approach may be subject to unreliable interpretation of the analyst (Ottenbacher, 1990). It is likely that visual analysis remains in the field due to its usefulness in determining treatment effectiveness (Kratochwill et al., 2010) and drawing statistical inference is not commensurate with single-case research given the use of small samples (Edgington, 1995). Researchers believe that visual analysis techniques must be an option and remain viable given careful examination of trend, variability, and parallel statistical data interpretation (e.g., percentage of nonoverlapping data, confidence interval bands). Furthermore, visual inspection of graphs is thought to be effective and swift. Of course, there remain concerns about the potentially subjective nature of the process and occasionally low inter-rater reliabilities, even when experts are involved (Morgan & Morgan, 2009). Permutation Tests Permutation tests are described here to help justify the use of HLM since they tend to have assumption issues within SCDs. Permutation tests, also known as 44 randomization tests, are a subset of non-parametric tests which involve re-sampling the original data for all possible values of the test statistic (Edgington, 1987). In order to perform permutations of ABAB data, random assignment of treatment blocks to treatments should have been performed by the researcher (Edgington, 1987) which is difficult in SCDs. Permutation tests can be used in SCDs, but certain assumptions of the data must be met before they are interpreted. One assumption would be there is no baseline trend. Another assumption would be randomization of units to treatment settings (e.g., days). Assuming randomization was conducted, permutation tests could be applied in the context of a SCD since they do not require assumptions about the data distribution, and provide a p value that can be used to assess treatment effectiveness (Edgington, 1987; Kratochwill & Levin, 2010). The null hypothesis in a randomization test is the expectation that at each treatment time, performance would be the same had an alternative treatment study condition been given at that time (Edgington, 1987). A p value is used in tandem with the MA-MB test statistic, which is the difference of the average values between Method A and Method B. This test yields the nonparametric exact significance level (Edgington, 1995) based on rank data. Since the test uses all available data using iterations, permutations are advisable for conditions where sample sizes are small. Unfortunately, the smaller samples also increase the likelihood of committing a Type II error (Edgington, 1995). One must also consider if differential carry-over effects occurred. These effects can, for example, be seen when drugs/treatments are used in tandem and the researcher does not know which caused the 45 behavior(s) to change (Edgington, 1995). Furthermore, the results of permutation tests overlook any covariation of treatment effectiveness with subject characteristics (a limitation that can be addressed via HLM analyses) cannot account for baseline trends, and few researchers can randomly assign treatment blocks to treatments (Edgington, 1995). For those who prefer using permutations, certain software programs will test for the presence of a baseline trend. When the baseline is not flat, several corrective tests, such as the Allison and Gorman (ALLIS-M) which improves the mean, (ALLIS-T) for trend in slope, and in mean and slope together (ALLIS-MT) facilitate trend-control (Parker, Cryer, & Byrns, 2006). Newer methods are available but are not covered here (refer to Parker et al., 2006). Proponents of nonrandomized single-subject studies suggest however that the researcher(s) should not lose the ability to present and withdraw the treatment. There is also a practical consideration since studies with patients who have severe behavioral issues, for example, may make the use of randomization problematic (Edgington, 1987; Morgan & Morgan, 2009). One interesting aspect to SCDs is that simply increasing the number of trials (total number of possible trials on each day) for individual studies (permutations do this since they increase the length of the study) can shift a binomial distribution towards a normal distribution (Raudenbush & Bryk, 2002). Using normal distributions can minimize threats associated with quantification, but small sample sizes will influence interpretations (i.e., probabilities of behavior, assumptions concerning normality). 46 Other Types of SCDs Although only ABAB designs were used for this study, there are other types of SCDs. Testing two or more different treatments in one study would generally entail the use of an alternating-treatment design, where the comparison of subject performance under each condition can be monitored (Morgan & Morgan, 2009). The systematic manipulation of the treatments over time allows for examination of which treatment is more effective for the patient(s) (Morgan & Morgan, 2009). Alternating-treatment designs are most often used for specific people with individual needs, where the researcher can quickly assess different treatments or independent variables for custom treatment plans (Morgan & Morgan, 2009). Multiple-baseline designs are constructed so that treatments are staggered across time. The staggered onset of treatment can address various threats to internal validity such as history, regression to the mean, maturation, instrumentation, and so on. This is because if one of these factors is responsible for observed performance change, then it is likely that these will be seen across different baselines. For example, if maturation were the driver of performance change, one would expect to see improvement in baselines before the onset of treatment. However, if changes in performance occur only after the implementation of treatment, then one can have confidence that the treatment/intervention is causing the behavior change (see Figure 2). 47 6 4 2 0 6 4 2 0 10 8 6 4 2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Session Figure 2. An example of a multiple-baseline design (with percent of intervals of on-task behavior as the outcome variable). Minimizing Threats to Internal Validity Using the ABAB Design Internal validity is the degree to which the researcher is certain that changes in a dependent variable are due to changes in an independent variable and not other factors 48 (Morgan & Morgan, 2009; Shadish et al., 2002). The ABAB design can yield strong internal validity. The design can eliminate confounding, where the researcher is not sure whether a specific intervention is responsible for observed performance change or if some hidden variable is causing the effect (Morgan & Morgan, 2009). Consider for example maturation and history threats. Maturation in ABAB designs can be examined via reversal of the treatment condition and replication because if maturation is driving performance change, there should be no difference in performance. History threats occur when some other uncontrolled factor(s) outside of the study are influencing the dependent variable (Morgan & Morgan, 2009). The ABAB design can handle this threat by observing the different treatment phases. If some uncontrolled factor is responsible for performance, then there should be few, if any changes associated with removal and reintroduction of the intervention (Morgan & Morgan, 2009). Regression to the mean, data loss, and instrumentation threats can also be dealt with via replication. Regression to the mean is a statistical phenomenon where data converge to the mean and this represents a standard rationale for the need to examine counterfactual conditions in any design that is used to investigate causal inference. In studies where subjects are chosen on the basis of high or low scores, common in singlecase research, this can be a plausible threat. But again, if regression to the mean were driving performance change in an ABAB, then there should not be an apparent treatment effect with each introduction and subsequent removal of the tested intervention. Instrumentation can occur when the researchers unknowingly change the treatment constructs, when bias due to observation occurs or via repeated testing (Kratochwill et al., 49 2010). But again, these threats can be rendered implausible using the same logic discussed above. Attrition in single-case research can occur for several reasons and subjects leaving too early may not provide enough information to continue the experiment. Five data points per phase and three phase repetitions are needed to meet standards set by the WWC. Here, losing data might render a study that does not meet standards, meaning internal validity is thought to be suspect, and the evidence produced by the study will not be examined. ABAB designs do have to contend with subjectivity during visual analyses. One rater may claim an effect upon inspection of data and another may not (Danov & Symons, 2008). In the parlance of null hypothesis tests of statistical significance, rater disagreement can yield a problem that is similar to Type I and Type II errors. That is, if raters disagree, there is still a chance that a treatment intervention is deemed effective when in fact it is not, or ineffective when it is. Although the definition or the amount of inter-rater agreement varies, the specifics of the data can cause raters to disagree despite being an expert or novice (Danov & Symons, 2008). A general approach for handling this concern is to train visual analysts to use replicable and systematic procedure for judging data. Reliability and Validity of the UnGraph Procedure Shadish and colleagues (2009) investigated data extraction reliability (inter-rater) and validity of UnGraph using three coders and 91 graphs. The correlation between the extracted values for the original coder and the two coders averaged .96 (median = .99). The reliability between the phases (baseline to treatment) was also tested. The average 50 difference between phases for the two coders had a correlation of r = .95. To explore the validity of UnGraph procedure, means and standard deviations from the original data were compared to baseline means from the extracted data. Also, five different methods of testing validity between extracted data and original articles were employed for the mean of each phase of the study considering the outcome variable: (1) looking at mean percentages; (2) mean time or duration; (3) mean number correct; (4) mean; and (5) percentage correct from the different types of studies. By averaging the extracted and the reported data over all phases and correlating these averages, the five kinds of data validity suggest a range from .97 - .99. The authors suggested that any issues concerning reading data and mistakes were from human error (e.g., extracting the wrong number of data points, issues reading overlapping data points). HLM Applications to SCD Techniques and model fit criteria needed to test models associated with SCDs are available. The assumptions behind HLM when analyzing single-case research designs are not that different from the assumptions of the standard HLM models, except the distributions change in count data and they are considered non-linear models. The study of individual change assumes individual’s trajectory is a function of parameters and random error (Raudenbush & Bryk, 2002, p. 186). Since the data are nested within persons, they should not be thought of as a fixed set for all participants (Jenson et al., 2007; Raudenbush & Bryk, 2002). A key interest is using HLM to determine whether there are phase differences in ABAB designs (Nagler et al., 2008; Raudenbush & Bryk, 2002). Phase differences are defined as statistically significant findings between baseline 51 and treatment phases indicating treatment effectiveness for a given study (Nagler et al., 2008). A p value is given in HLM to describe these phase differences, suggesting on average a difference between the treatment and baseline phases. Several different tests are used to determine model fit (Jenson et al., 2007). According to Van den Noortgate and Onghena (2003) the simplest hierarchical model for SCDs is the two-level model where measures are nested within persons. For AB or ABAB designs in single-case research (recall that this study pertains to ABAB designs only) each model can provide the researcher with information concerning the impact of change for the group of observations within each phase. A typical set of analyses will cluster several ABAB studies together (often, a single article will report on several at a time). In general, a model for the overall change in time with behavior is recorded for each student. Last, analyses entail modeling the change in time with behavior recorded between phases. More complex models can be used to test whether the student specific characteristics contribute any information regarding treatment impacts for a particular student (Nagler et al., 2008). This is a key point in SCD research, as Kratochwill and colleagues (2010) stated, these designs can yield convincing causal information and results for a particular student in the study. The binomial distribution is often used in the context of SCDs given the presence of data from a fixed interval (i.e., probability). For example, each trial can yield a binary response and that each time period can be from a fixed number of trials. Count data can therefore be considered as an umbrella term to describe both numbers (i.e., frequency of behavior) and proportion (i.e., fixed interval or probability). Further, data can deal with 52 the number of successes “Y” in n independent trials, and the probability of success “π” is constant over trials is; Y ~ Binomial(n, π) (Skrondal & Rabe-Hesketh, 2007). In generalized linear models, the distribution is determined for the count of success Yi in ni trials for unit i and is conditional on covariates xi, Yi | xi ~ Binomial(ni, πi). The covariates determine the probability πi seen below using πi ≡ E(Pi|xi) = h(ηi), ηi = xi′β, where Pi = Yi / ni, h(·) is the inverse of a link function (i.e., logit or probit), and ηi is the linear predictor (Skrondal & Rabe-Hesketh, 2007). Therefore, the variance is a function of the expression Var(Pi | xi) = πi(1 – πi) / ni = (pi | xi) [1- (pi | xi)] / ni . The independence assumption is sometimes violated when overdispersion occurs in single-level models (Skrondal & Rabe-Hesketh, 2007). Oftentimes data in SCDs are clustered or close together within phases. Dispersion refers to the variability in a data set, as well as expected variation around a population parameter. It is commonly measured by estimating the variance of a variable. Recall, overdispersion is defined as the Level-1 outcome variability being greater than expected in comparison to what is anticipated from either a Poisson or binomial distribution (Raudenbush & Bryk, 2002) and can be diagnosed by looking at the variance components output where values of sigma-squared (σ²) over one indicate overdispersion in the model). In behavioral research, the outcome variable can be on several scales (e.g., count, dichotomous). Recall in Poisson and binomial distributions, the variance is a function of the mean (Nagler et al., 2008). The 53 mean and variance are equal in Poisson distributions, and as the variance increases so does the mean (Nagler et al., 2008). Traditionally, overdispersion occurs when there is heterogeneity among subjects (Agresti, 1996) or when the “observed random variation is greater than the expected random variation” (Ghahfarokhi, Iravani, & Sepehri, 2008, p. 544). In SCDs, overdispersion has a similar definition but when count data have more than expected variability by the binomial or Poisson distributions, overdispersion is typically (but not always) present (Nagler et al., 2008). When σ² is closer to one, (i.e., the criteria for determining overdispersion) then overdispersion is not an issue (Raudenbush & Bryk, 2002). HLM has an overdispersion option before a model is run that accounts for overdispersion present. By checking this option before running a model, likelihood estimates can be compared across models (Raudenbush & Bryk, 2002; Raudenbush et al., 2004) given the presence of overdispersion. Overdispersion continues to be a possible concern in single-case research due to the estimation of the variances at each level. If the estimation procedure used overestimates or underestimates either level of the multi-level model based on the distribution, overdispersion will be present and this highlights the need for better parameter estimations in small samples. All models in this work used binomial distributions and overdispersion can be recognized when the observed variance is larger than the expected binomial variance (Skrondal & Rabe-Hesketh, 2007). Overdispersion can be handled by using one of two methods (a) introduce a Level-1 random effect in the linear predictor (Nagler et al., 2008; Raudenbush & Bryk, 2002; Skrondal & Rabe-Hesketh, 2007) or (b) use a different estimation procedure (e.g., quasilikelihood) with a modified variance function (see Raudenbush & Bryk, 2002). It was 54 recommended by Skrondal and Rabe-Hesketh (2007) to add a random effect (error term) at Level-1. Like overdispersion, underdispersion can be an issue as well (although not as common). Underdispersion is defined as having less variation in the data than predicted (Raudenbush & Bryk, 2002). It should be noted that underdispersion is not possible for a random effect model when using HLM software since the number of trials has to be greater than one (Skrondal & Rabe-Hesketh, 2007). When overdispersion is suspected, HLM software allows for a simple correction by providing an option in the Basic Settings tab and the software in essence introduces a random effect (Option A above). Furthermore, if the researcher suspects that overdisperison is present then models can be run twice in HLM, once assuming overdisperison and once without overdispersion. The researcher can then check to see if estimates of the fixed effects are unwavering (Nagler et al., 2008). Also, the likelihood function for model fit between the two options (i.e., with or without overdispersion) will produce one model with a lower value, suggesting the better fitting model. Although models are analyzed for ABAB designs in HLM, conceptually they generally share similar elements for testing fixed and random slopes and intercepts. One distinct difference is that in single-case research, measures are within persons. In the ABAB design, multiple observations on each individual are nested within the individual (Nagler et al., 2008); that is, observations represent Level-1 data and student characteristics represent Level-2 data. A linear regression model is described next because of the commonality of such a model, and then approaches used for this work are 55 described. This is done so that the reader is comfortable with one method and can easily transition into modeling in SCDs, which follow the same understanding just different distributions. For one individual, a Level-1 linear regression model in HLM would look like: Yt = P0 + P1(Xt)+ Rt where, Yt = person observed score at time t Xt = a dichotomous variable for phase, where 0 = baseline and 1 = treatment P0 = mean of the baseline phase P1= difference between baseline mean and treatment mean. Rt = error term (assumed to be normally distributed, and independent) The Level-1 model will take all the data obtained and reduce it into two scores, one for the mean of the baseline phase, Y0, and the other for the difference between the baseline mean and treatment means, Y1 (Jenson et al., 2007). Then a separate linear model is produced for each: β0j = P00 + R0j β1j = P10 + R1j . According to Jenson and colleagues (2007) the equation for β0j states that the “baseline mean for subject j equals the grand mean baseline level, P00, plus a random effect, R0j” (p. 488). Similarly, β1j equals “the grand mean difference between baseline and treatment phases, P10, plus a random effect, R1j” (p. 488). The difference between 56 the treatment effect and the grand mean treatment effect is specified as R1j. In the context of a two-level model, a statistically significant slope would indicate a treatment effect. If there in an indication that remaining variation exists in baseline or treatment, then predictor variables can be added to the Level-2 model in an attempt to explain the observed variance among subjects (Jenson et al., 2007). Reliable variance for the predictors will yield significant p values. For example, if a study was using two classrooms (A = 0 and B = 1), then a variable like CLASS can be created and added to the Level-2 equation to see if the treatment was more or less effective for class A than class B. An exploratory analysis in HLM can tell the researcher where to put the CLASS variable, mainly whether it should be placed on the intercept or predictor variable. Below is an example of a Level-2 predictor (CLASS) added to the model, β1j = P10 + P11(CLASS) + R1j Where, P10 = Grand mean difference between baseline and treatment phases for Class A P11 = The difference between the mean treatment effect for Class A and Class B. One can also separately analyze the baseline means. For example, add sex (0 = male and 1 = female) to the Level-2 equation, β0j. One could test if the mean baseline behavior, γ01, increases or decreases as a function of sex: β0j = P00 + P01(SEX) + R0j 57 Where, P00 = Grand mean difference between baseline and treatment phases for males P01 = The difference between the mean treatment effect for sex. When assessing model fit, several approaches are available to test variation in behavior both within and between persons and modeling alternate covariance structures (Raudenbush & Bryk, 2002). For example, constraining the random effects can be used to explore the within subject variation and how it interacts with the intercepts and slopes (Nagler et al., 2008). In HLM, modeling count or proportional data would yield equations similar to the above, except the distributional shape would change the meaning of the output. For example, the binomial distribution would describe the log odds (logit) scale and the Poisson would describe the log scale of the behavior (Nagler et al., 2008). Three HLM Models for ABAB Designs The models described previously, although similar to the linear models typically used in software for multi-level model, will slightly change in interpretation depending on the distribution used. They were described to provide a reference and a starting point for further discussion. The remaining models are based upon one and two-level models with the binomial distribution. Nagler and colleagues (2008) discuss three types of models used to answer questions pertaining to strengths of associations, variations, and predictors associated with the dependent variable in ABAB designs. For the purposes of this discussion, a dependent variable is discussed in terms of whether a behavior occurred 58 or not, but in principle other types of dependent variables can be examined with this modeling approach. These three models will be applied to this dissertation. The Full Non-Linear Model with Slopes includes all Level-1 predictors but no Level-2 predictors (Raudenbush & Bryk, 2002). The aim is to find variables that are not contributing anything to understanding the data and remove them to promote parsimony (Nagler et al., 2008). Here, we would expect the slope associated with time to be not significant. This suggests that the baseline trend is flat, or not changing over time (Nagler et al., 2008). Furthermore, a null for any interactions involving change in session/days between phases suggests the trend during treatment is predicted to be flat (i.e., does not change over time). Next, the Simple Non-Linear Model without Slopes would retain any variables that were significant in the Full model and allow the error term for all students to vary from the grand mean (Raudenbush & Bryk, 2002). Coefficients in this model allow the researcher to compute probabilities for observing target behaviors during baseline and treatment phases. This test will allow the researcher to see if any Level-2 variable(s) contribute to the model in an exploratory fashion (Nagler et al., 2008). This exploratory analysis will determine whether any Level-2 variables contribute to understanding variance in the outcome. If so, then these Level-2 variables will be added to the last model (Nagler et al., 2008). Last, the Simple Non-Linear Model with any Level-2 predictors included is tested. If any Level-2 predictors are significant, then probabilities for the target behavior can be calculated under the Level-2 conditions (e.g., Classroom A = 0, Classroom B = 1). 59 That is, probabilities of the behavior occurring in each class can be computed and compared. Further, Nagler and colleagues (2008) indicate the difference of between subject variation(s) from one model to the next can be observed from the variance components tables in the HLM output. In the ‘final estimation’ of the variance components, any between subject variations in estimates of the intercept will yield a p value which tests the null hypothesis that baseline averages for all subjects are similar. A significant p value suggests that the variance is too big to assume only estimation error. The between subject variances in phase effect produces a p value that tests the null hypothesis that on average, the probability of a behavior occurring is similar for all subjects. Additionally, any Level-2 predictors that are significant may indicate that they contribute to the estimation of the outcome measure. Other models could be designed to test how much subjects vary from average expectations between subjects (Nagler et al., 2008). Described before, constraining random effects to zero, where intercepts and phases are not allowed to vary across subjects will allow the researcher to compare model fit and decide if other constraints are necessary (Nagler et al., 2008). Some statistical evidence suggests that standard errors and estimation of the fixed effects are robust to violations of homogeneous residual errors at Level-1, var(rij) = σ² (Raudenbush & Bryk, 2002). However, the random slopes, heterogeneous Level-1 variance model has a different variance at each time point and tests if variances depend systematically as a function of any Level-1 or Level-2 predictors (Raudenbush & Bryk, 2002). 60 Variances in Two-Level Models The variance-covariance matrices in these models are based on the variances and covariances at Level-1 and Level-2. Maximum likelihood estimation will be affected by the fixed and random effects incorporated into the model (Raudenbush & Bryk, 2002). Maximum likelihood is an estimation procedure that selects values that make the observed results most probable. Depending on whether the actual data are balanced or unbalanced, the covariance structure will be estimated differently (Raudenbush & Bryk, 2002). To have balanced designs, each Level-2 unit must have the same sample size and the predictors need identical distributions (Raudenbush & Bryk, 2002). In SCDs, where any predictor variable otherwise known as time-varying covariate(s) can take on different sets of values for each person, it may be that no two participants have the same behavioral pattern at time, t (Raudenbush & Bryk, 2002). That is, at one time point, there may be only one person having a behavior. Hence, the model allows for a heterogeneous variance-covariance matrix across persons as a function of individual variation in exposure time. In SCDs, the number of session or x-axis data points can differ between people. This may be due to students missing school days, or extended baselines for some individuals (oftentimes baselines are extended if initial problem behaviors are not observed or if data patterns are highly unstable). Data extracted from SCDs do not have approximately normal sampling distributions or the same variances (Raudenbush & Bryk, 2002). Most commonly, the Level-1 and Level-2 error residuals are assumed to be independently and normally distributed with a mean of zero and constant variance, σ² (Raudenbush & Bryk, 2002; 61 Van den Noortgate & Onghena, 2003). The mean and variance do however depend on the distributional shape of the data. Given the nature of behavioral studies, the Poisson distribution with constant exposure is chosen when dealing with count data and a binomial distribution is used when the outcome is a proportion (Nagler et al., 2008). The distributional differences will affect interpretation in HLM as Poisson distributions produce estimates on a log scale and binomial distributions estimates are on a log odds or logit scale (Nagler et al., 2008). Estimation procedures. In the past, the two most commonly used estimation procedures have been Full maximum likelihood (MLF) and Restricted maximum likelihood (MLR) (Raudenbush & Bryk, 2002). Even though MLR is more accurate with smaller sample sizes, Raudenbush and Bryk (2002) recommend using MLF in order to ascertain model fits using likelihoods so as to handle the (typically) small number of Level-2 units. When there is a small number of Level-2 units (J), the MLF variance estimates will be smaller than MLR by a factor of approximately (J-F) / J, where F is the total number of elements in the fixed effects vector (Raudenbush & Bryk, 2002). It may help to point out that a vector in mixed models can be composed of observations, fixed effects, random effects, and random errors where the fixed effects are specific to the individual. If the number of observations (nj) for which each variance components (σj²) is being estimated, usually the sample size in each level needs to be large. Using ordinary least squares (OLS) is too conservative for small sample sizes, and Raudenbush and Bryk (2002) warn both estimation procedures, MLF and MLR, are too liberal. Another approach is to use an exact t-test distribution under the null hypothesis where 62 OLS is used to compute β estimates, but here likelihood functions cannot be created to compare models. Either way, the standard approaches for linear model estimation procedures cannot be used for two-level or three-level “nonlinear” models, instead closer approximations to ML are advised. Closer approximations to ML are available in software packages that have more options for estimation procedures, such as the Gauss-Hermite quadrature, where the random effect of μ is centered on an approximate posterior mode rather than the mean of zero (Pinheiro & Bates, 1995). If the random effects are large, then this procedure provides more accurate results for parameter estimation (Raudenbush & Bryk, 2002). But, in keeping with two-level hierarchical generalized linear models, the Penalized quasi-likelihood (PQL) and MLF are more appropriate with the algorithm being “expectation-maximization” (EM) or LaPlace approximation with Fisher scoring (Raudenbush & Bryk, 2002). The estimation procedures all maximize differently, for example, the PQL maximizes the likelihood based on the Level-2 fixed effects based on initial parameter estimates and variance-components (Raudenbush & Bryk, 2002). Recall that for two-level hierarchical analysis, three kinds of parameters need to be estimated: (1) fixed effects, (2) random Level-1 coefficients, and (3) variancecovariance components (Raudenbush & Bryk, 2002). The estimating procedures for these parameters entail the use of MLF or MLR. In SCD research using the ABAB design, examples of Level-1 predictors are a case/student identifier, behavior (outcome variable), time (independent variable), phases (baseline, treatment), and any interaction terms (Nagler et al., 2008). Examples of Level-2 predictors are any subject 63 characteristics such as cognitive ability, chronological age, or sex that could contribute or explain the dependent variable. In sum, for a small number of subjects in single-case research, HLM software provides two options, MLR and MLF, with MLR as the better option (Nagler et al., 2008; Raudenbush & Bryk, 2002). The MLF and MLR estimates will produce similar Level-1 variances; however, differences lie with the estimation of Level-2 variances (Raudenbush & Bryk, 2002). When the Level-2 sample is small, the MLF variance estimates will be smaller than MLR estimates making the MLR the preferred estimation procedure with small sample sizes (Raudenbush & Bryk, 2002). Again, to get likelihoods for model fit analyses, MLF must be used (Nagler et al., 2008; Raudenbush & Bryk, 2002). Hierarchical Generalized Linear Models (HGLMs) Hierarchical Generalized Linear Models (HGLMs) are extensions to hierarchical models for count data where the models allow the interpretation of probabilities (Raudenbush & Bryk, 2002). Hierarchical generalized linear models are a broad class of models that can be expressed in a common way. Individual models, such as the logistic, Poisson, linear, and so on, can be represented by linear functions and can be distinguished by a link function and by the probability distribution used to model errors. Link functions are transformations for the Level-1 predicted value so that they are constrained within given intervals. According to Raudenbush and Bryk (2002) the binomial sampling is the log odds of success, modeled with the following formula: Yij = log[φij/(1-φij)] , 64 where Yij is the log odds of success. If the probability of success, φij, is 0.5, the odds of success equals 1.0 and the log-odds is log(1), which equals 0 (p. 295). With Ps incorporated into the equation seen below, Yij = P0j + P1jX1ij + P2jX2ij + … + PpjXpij . It is possible to generate a predicted log-odd (Yij) for any case. These predicted log-odds can be converted to odds by taking the exp(Yij) (Raudenbush & Bryk, 2002). A probability can be computed from the following formula, φij = 1 / 1 + exp{-Yij} . (3) Another threat to single-case research is autocorrelation or serial-dependency (Krishef, 1991; Parker et al., 2006). Autocorrelation not only represents a violation of the independent errors assumption, but can also alter effect size interpretation (Parker et al., 2006). Mainly, autocorrelation occurs when the baseline is unstable or displays a trend (Parker et al., 2006). If the autocorrelations are positive, the standard errors will be reduced and Type I error rate is increased (Crosbie, 1987; Manolov, Arnau, Solanas, & Bono, 2010; Parker et al., 2006; Scheffé, 1959; Suen & Ary, 1987). Statistical tests within HLM to test baseline and treatment trend was previously described when testing the null concerning flatness. In the current study, if the data reveal a trend in baseline, the article will be rejected. Recall, this is customary given the standards and guidelines already in place for SCDs by the WWC. 65 Some Effect Size Options in SCDs Effect size estimates in SCDs are the standardized difference of behavior change between phases (Parker, Vannest, & Brown, 2009). Effect size estimate formulas for SCDs are still being developed (Shadish et al., 2010), and Maggin, Chafouleas, Goddard, and Johnson (2011a) suggest comparing the results of multiple approaches. Each effect size estimate is calculated differently because researchers either visually assess graphs or statistically analyze data. Disparate findings are not unusual, and even statistical approaches can produce inconsistent results. In a meta-analysis of quantitative methods for SCDs involving students with disabilities, Maggin and colleagues (2011b) assessed effect size estimates and subject characteristics for 68 SCDs. Mainly, the meta-analysis was an inventory of the methods used, the subject characteristics reported, and the effect sizes published to determine treatment effectiveness. Among the results was the increased interest in high and low incidence disabilities in SCDs, variability in the methods used to assess SCDs, and finding two primary effect size estimates reported in the literature: (1) the PND, and (2) the SMD. Other effect size indices were used in the Maggin and colleague’s study (2011b), but all others were not more than 10% of the synthesis. Further, a connection to the methodological rigor and association of meeting criteria by the WWC which is similar to this dissertation, only 24 articles were used in the Maggin and colleagues (2011a) study and no Level-2 data were collected. The Maggin and colleagues (2011a) study analyzed treatment interventions using token economies, which used AB designs, reversal, and multiple-baseline designs and did not 66 analyze anything further than effect size estimates and treatment effectiveness. Four effect size estimates were compared. The two nonparametric effect size estimates were the PND and the Improvement Rate Difference (IRD) and the parametric effect size estimates used followed the SMD and raw-data multilevel effect size. For the nonparametric procedures, the PND and IRD can be used to estimate treatment effectiveness. The PND is an older, nonparametric effect size index. The PND is calculated using this formula: (1 – Percent of Treatment Overlap Points)*100 = PND. (4) Percent of Treatment Overlap Points is defined as the total number of treatment data points that are “less extreme or equal to the greatest data point in baseline divided by the total number of data points in the treatment phase” (Maggin et al., 2011a, p.10). The PND is widely reported in the SCD literature and school psychology (Scruggs & Mastropieri, 2001), but it has weak qualities (Kratochwill et al., 2010; Maggin et al., 2011a; Shadish, Rindskopf, & Hedges, 2008). Computationally the PND is simple, but the procedure requires researcher judgment in the context of close calls and decisions about whether a data point overlaps can be inconsistent. Another limitation is that, as the number of data points increase, the PND value trends to decrease since there is more of an opportunity to demonstrate an overlap. This makes it difficult to compare PND results from one study to another unless they have the same number of observations (Allison & Gorman, 1994). In addition, the PND does not correlate highly with other effect size indices (Maggin et al., 2011a; Shadish, Rindskopf, & Hedges, 2008). The IRD 67 may be a better approach because it uses an improvement rate difference. The IRD is calculated by the difference between improvement rates (IR) for the treatment and baseline phases (Maggin, Swaminathan, Rogers, O’Keeffe, Sugai, & Horner, 2011), respectively called by this dissertation as IRT and IRB. IRT is calculated as the number of data points indicating a study participant is performing better, relative to baseline, divided by the total number of observations in the given treatment phase (Parker, Vannest, & Brown, 2009). An ABAB design would yield two IRT estimates and these are averaged. In the presence of a study with no overlapping data points, IRT would equal 100%. The IRB is derived by dividing the number data points where a student performs equal to or better, at baseline, relative to the subsequent treatment phase. A study with no overlapping data points between phases would yield an IRB of zero. IRB estimates are also averaged. The difference between the two improvement rates yields IRD (Cochrane Collaboration, 2006; Sackett, Richardson, Rosenberg, & Haynes, 1997). The formula below displays the IRD calculation: IRT – IRB = IRD (5) where, IR = improvement rate T = treatment phase B = baseline phase. A 100% on the IRD would mean that data in the baseline and treatment phases do not overlap and this indicates a highly effective treatment. By contrast, when the IRD is 68 50%, there is considerable overlap in performance between phases (Parker et al., 2009). The IRD was calculated for this study even though no article used this estimate. This was mainly done to compare it to the other effect size estimates. The parametric procedures used by Maggin and colleagues (2011a) to calculate effect sizes were the SMD and a regression-based, multilevel approach (RDMES). The first parametric effect size used between baseline and treatment phases was the SMD (Maggin et al., 2011a). The SMD was calculated by taking the difference between the mean baselines (MB) and mean treatments (MT) and dividing it by the baseline standard deviation (Busk & Serlin, 1992). The formula is as follows: (XT – XB) / s = SMD (6) Where, XT = treatment average XB = baseline average s = baseline standard deviation. The baseline standard deviation is used instead of the pooled standard deviation (which is characteristic of group-based studies for effect sizes) because it allows certain assumptions concerning the normalcy and independence of the data to be relaxed (Busk & Serlin, 1992; Maggin et al., 2011a; White, Rusch, Kazdin, & Hartmann, 1989). The criteria of ‘effective intervention’ would be a calculation of 2.0 or above on the SMD (Jenson et al., 2007). 69 Although R2 results are presented below, they should be interpreted cautiously in SCDs. R2 was calculated to compare the proportion of the variation explained by phase differences using OLS regression (see Parker & Hagan-Burke, 2007). Typically, in linear regression, R2 is a measure of goodness of fit (Parker & Hagan-Burke, 2007). A 1.0 indicates a perfect fit of the regression line to the actual data. The R2 formula is as follows: R2 ≡ 1 - . The sum of squares is the variability of the data, where SSerr is the sum of squares of the residuals and SStot is the sample variance. Again, count data and R2 require different interpretations for SCDs. For this study, R2 was calculated using OLS regression and interpreted as the proportion of the variation explained by the phase differences (i.e., between baseline and treatment averages). The literature suggests using multiple effect size metrics for comparative purposes (Kratochwill et al., 2010; Maggin et al., 2011a). The results of the Maggin and colleagues (2011a) study found effective interventions and comparable effect sizes using four effect size estimates. The class was the unit of analysis in the token economy study; all effect size estimates indicated treatment effectiveness with the PND at 83.09%, IRD 63.92%, SMD 4.57% and RDMES at 9.37% (Maggin et al., 2011a). The regression technique included separating out the different baseline and treatment conditions and dividing those standardized scores by the root mean squared error (Maggin et al., 2011a; Van den Noortgate & Onghena, 2008). Recommendations from the authors indicate the need for more stringent standards for methodological design to meet more WWC criteria and reporting more descriptive information (Level-2 data) in general. It was suggested by Maggin and colleagues (2011a) that other researchers should provide more subject 70 level characteristic data, so the likeliness that a student will respond to token economies is known. Also, even though the study focused on token economies, some studies had different implementation techniques (Maggin et al., 2011a) thus limiting treatment generalizability. Additionally, Parker and colleagues (2009) took the IRD from 166 published data series contrasts (AB) and correlated it to R2, Kruskal-Wallis W, and the PND. The Kruskal-Wallis W is the most powerful nonparametric technique available (Parker et al., 2009). The IRD correlated more strongly with the Kruskal-Wallis W (.86) than with R2 (.86) or the PND (.83). The strongest correlation was between R2 and Kruskal-Wallis W (.92) and the weakest was between R2 and the PND (.75). It should be noted that although some of these procedures have similar interpretations, it is uncharacteristic to compare between the estimates (Parker et al., 2009). In fact, the 166 published articles found that two thirds failed to meet equal variance or normality assumptions and another two thirds had autocorrelation issues (Parker et al., 2009). Effect sizes for meta-analyses and comparisons of SCD and group studies. The d-estimator (Shadish et al., 2010) should be separated from the other effect size (ES) estimates as a meta-analytic approach because it is specifically developed to generate SCD effect sizes that can be compared to group-base studies (it is however still under development). This estimate standardizes the between-case variability, not the withincase variability (Shadish et al., 2010). Future work on this formula could adjust for autocorrelation issues and incorporate pooled standard deviations in the denominator (Shadish et al., 2010). Across disciplines, the magnitude of treatment effects varies 71 (Parker, Brossart, Vannest, Long, Garcia De-Alba, Baugh, & Sullivan, 2005) thus limiting generalization. Rosnow and Rosenthal (1989) agree adding ES may be contextdependent not to mention different across similar treatments (Brand et al., 2011). One way to compare across studies would be to include raw data (McDougal, Narkon, & Wells, 2011). Researchers would be able to compare more studies, especially the ones who failed to calculate an effect size or report descriptive statistics. In an examination of behavioral self-management (BSM) techniques, McDougall, Skouge, Farrell, and Hoff (2006) found that only 1 of 38 studies using a SCD calculated ES. Afterwards, McDougall and colleagues (2006) suggested that all researchers either report an ES or provide the data necessary to calculate the ES. Most notably, the original researchers did not include descriptive statistics, such as pooled or within-phase standard deviations necessary to calculate ES indices. Maggin and colleagues (2011b) found a lack of reported data in their meta-analyses of SCDs that focused on interventions for students with disabilities. The synthesis started with 682 abstracts and from those 87 candidate studies matched specific criteria (e.g., publication date, disability related, single-subject). The inclusion criteria continued and eventually eliminated 26 more studies due to the lack of quality (e.g., not a review, disability status not clear, effect sizes not reported). Approximately 30.77% (n = 8) of the eliminated articles did not report ES. The remaining 61 articles were selected to review for patterns, themes, and similar subject characteristics and an ancestral search included 7 more studies (n = 68). The most frequent way of assessing treatment effect was by calculating a mean of the baseline and treatment phases (29.41%, n = 20) and the second most common method of estimating a 72 treatment effect was to calculate effect size at the subject level, not aggregated across subject participants (20.59%, n = 14). Power is low in single-case research given the small sample size (Manolov et al., 2010; Nagler et al., 2008) and Monte Carlo applications are providing helpful design techniques to better understand the relationship between studies (Coe, 2002; Jenson et al., 2007; Manolov et al., 2010; Raudenbush & Liu, 2004). Power analysis software called, PinT (Snijders, Bosker, & Guldemond, 2007) can determine standard errors and provide optimal sample sizes in two-level models in HLM. The newest (windows) version 2.12 can be downloaded on Snijders’ website. Type I and II error rates in SCDs Several Monte Carlo applications have been used in the context of single-case research, including examination of statistical power (Manolov et al., 2010; Raudenbush & Liu, 2004) and Type I error rates (Coe, 2002; Jenson et al., 2007). Controlling for Type I error rates remains a concern because it would be undesirable to state an intervention is effective when it is in fact not. Monte Carlo simulations have been conducted to understand different features of SCDs. Jenson and colleagues (2007) used Monte Carlo simulation to vary the number of data points in the baseline and treatment phases, the type and magnitude of autocorrelation, and the number of subjects. Findings suggest that including random effects in Level-2 equations will provide some protection against Type I errors in several different scenarios (again, based on different autocorrelation sizes and subjects). 73 Further, Manolov and colleagues (2010) conducted a simulation to generate AB design data to test the performance of four regression-based techniques including general trend and autocorrelation. Primarily, estimation procedures were compared to detect existing effects and false alarm rates. The procedures compared were ordinary least squares (OLS), generalized least squares estimation (GLS), differencing analysis (DA), and trend analysis (TA) using three different distributions: normal, negative exponential, and Laplace. Results indicated that the regression parameters of OLS estimation could not be advised for short series designs. Choosing the correct regression model and controlling autocorrelation and trend did not guarantee useful p values for treatment effects. OLS and GLS estimation procedures were useful only when there was no trend in the data and data series were independent, but OLS appeared to be more sensitive to treatment effects than TA or DA. In TA, the correction was so severe that it was found to overcorrect the data and remove data produced by the intervention for both trend and autocorrelation. Lastly, DA had issues in detecting treatment effects when the effect was a change in slope, and did not detect ineffective interventions as effective even when more measurements were available. In conclusion, power in SCDs is influenced by several characteristics: trend, autocorrelation, type of distribution, and estimation procedures in place. Power is an enduring issue in SCDs, but it was suggested by Manolov and colleagues (2010) that positive autocorrelation alters the Type I error rates of the GLS estimate which indicates the need to estimate a data correction procedure iteratively, and not all at once. 74 Chapter Summary Visual analysis is a powerful technique in determining treatment effect and is still the dominant mode of analysis in SCDs. Treatment effectiveness can be assessed visually and with parametric procedures. In comparison with visual analysis, multi-level modeling may provide an alternative and quantifiable approach to determining treatment effect. Using multi-level models requires that subjects/classrooms/units are nested (Raudenbush & Bryk, 2002). These analyses are accurate for estimating standard errors and more information about the outcome variable and the contribution of predictor variables (Raudenbush & Bryk, 2002). For this study, the Level-1 data in HLM contains multiple data points on a single person, which is then matched to the Level-2 data (via an identifier variable). Of course, the data in Level-1 is larger than Level-2 since UnGraph counts each point in the ABAB graph as a point at each exposure time. Data from each subject is eventually aggregated together to form one file, causing Level-1 files to be larger. Using multi-level modeling is tedious. Only nine articles were used in the current study due to the time consuming collection of data and the fact that Level-2 data were analyzed making the process even more vigorous given the use of multi-level modeling. Several parametric threats to validity such as autocorrelation, overdispersion, and power are still being tested with Monte Carlo simulation for studies with small samples (Nagler et al., 2008). Used in tandem, the magnitude of treatment effectiveness can be determined and provide more than one approach for the assessment of SCDs (Kratochwill et al., 2010; 75 Maggin et al., 2011a). Unfortunately, with blended data collection procedures, interpretation can become unclear. Effect size estimates per study are still a concern and efforts concerning meta-analysis techniques in SCDs are ongoing as well. Of the three effect size estimates, the PND, IRD, and SMD, the PND is the most widely used effect size in SCD (Maggin et al., 2011a) and therefore will be used in this sensitivity analysis, not to mention it was the most reported estimate in the articles chosen. The SMD will also be used in the current study because this measure was reported in some of the original articles, providing an opportunity to conduct sensitivity analyses and a separate check to address Research Question 4 (if SMDs from digitized data are similar to those reported in base articles then this would help validate the use of UnGraph). Again, the IRD will be calculated, but mainly as an added benefit for further sensitivity analyses since it was not calculated for any of the nine original articles. 76 Chapter Three: Methodology This Chapter describes the research design and analytic procedures used to address the research questions. The purpose of this study was to conduct a sensitivity analysis comparing standardized visual analyses (based on recent WWC recommendations), multi-level models, and results presented by original authors. The key focus of the sensitivity analysis was on whether the three sources of information consistently report the presence of a treatment effect. Secondary questions examined Level-2 predictors (whether they explain variance in treatment impacts across subjects), documented the magnitude of treatment impacts, and checked on the validity and reliability of the UnGraph procedure. This last effort was made in an attempt to add to the evidence base on whether digitizing data could be done consistently across different analysts. As described in Chapter One, the research questions were as follows: Primary Question 1) Does quantification and subsequent statistical and visual analyses of the selected ABAB graphs produce conclusions similar to visual analyses? Recall, comparisons were made between independent WWC visual analyses and those of the study’s original authors. Secondary Questions 2) Do any of the subject characteristics (Level-2 data) explain between-subjects variation? If so, can this information be used to yield new findings? 77 3) Do the PND and IRD (nonparametric indices), SMD (parametric indices) and R2 yield effect sizes that are similar to the ones reported in the original studies? 4) Is UnGraph a valid and reliable tool for digitizing ABAB graphs? The analyses conducted for this work followed a series of steps. Article Selection and Descriptions ABAB single-case designs that meet WWC standards were selected for this work. The reason for this was the WWC only performs visual analyses on studies that meet design criteria which standardize the application of visual analysis procedures across each report. To clarify, there is no way to be sure that authors consistently applied different visual analysis procedures or if they used current recommendations. The WWC approach, by contrast, standardizes procedures, which sets the stage for comparing and contrasting results across the two key methods of interest. Identifying articles that should meet standards is not a trivial step, as seventy-eight articles are originally found, and only nine are chosen for this dissertation. There are two general reasons for why some articles were not included in this work: (a) Some articles did not meet a WWC standard such as exhibiting fewer than three data points per phase or report inter-observer agreement. Keep in mind the desire to use standardized visual analysis approaches and the WWC does not analyze studies that do not meet its design standards. (b) Some articles reported results for only one student. Reporting on only one subject prevents addressing the first research question since the HLM analyses will not converge while working with a single participant. Also, the second research question 78 concerning whether subject characteristics explain variance in the dependent variable could not be answered. As noted above, only articles that examine the impacts of interventions on children with behavioral issues were used. Characteristics of the children included in the selected articles include: presence of social delinquency, aggressive, withdrawn, depressed, disruptive, disobedient, or acting out tendencies. Since this work only examined articles that should meet WWC standards, it is important to review study features that would meet the standards. These include: The systematic manipulation of the independent variable. Inter-rater agreement (generally called interobserver agreement) documented on the basis of a statistical measure of constancy between raters. For example, percentage/proportional agreement (0.80 - 0.90 on average) and Cohen’s kappa coefficient (0.60). Percentage agreement being observer agreement and Cohen’s kappa accounting for chance agreement. A minimum of three data points per phase. Note that studies with as few as three data points in a phase can meet standards “with reservations” so studies with this limitation were included. An attempt to demonstrate an effect at three different points in time. Again, only articles that can meet these standards were included in the study to remove any question around whether WWC analyses procedures would be applied to these studies. 79 ERIC, Academic Search Complete, PsycINFO, and Social Science (Education Full Text) were searched using the key words: SCD, ABAB design, behavior and emotional disturbance. Seventy-eight articles focusing on classroom-based behavioral interventions were found, and of these, nine used an ABAB design and satisfied WWC criteria. The articles selected for this study are mainly concerned with decreasing offtask behavior or increasing participation in the classroom. Several of the articles loosely define children with disabilities and some include children with disabilities integrated with typically developing children. Only using nine articles was a combination of the time consuming procedures required for this study, the nature of a methodological dissertation, and the lack of articles that meet all the required WWC standards and specifics of the ABAB design. Further, this work was primarily interested in providing sensitivity analyses, not generalizations. The articles chosen, and the design rating that was applied using the WWC standards, are displayed in Table 1. Three of the articles that met with reservation had fewer than five data points per phase and one study had 18% (instead of 20%) agreement in phases. Due to the limited articles that meet all the requirements of the WWC, the decision was made to include the articles. 80 Table 1 Titles of Articles and WWC Standards Authors Title Size Meets Standards / Meets With Reservation 1) Amato-Zech, Hoff & Doepke (2006). Increasing On-Task Behavior in the Classroom: Extension of Self-Monitoring Strategies. N=3 Y 2) Cole & Levinson (2002). Effects of Within-Activity N=2 Choices on the Challenging Behavior of Children with Severe Developmental Disabilities. Y 3) Lambert, Cartledge, Effects of Response Cards N=9 Heward, & Lo on Disruptive Behavior (2006). and Academic Responding during Math Lessons by Fourth-Grade Urban Students. Y 4) Mavropoulou, Papadopoulou, & Kakana (2011). Effects of Task Organization N=2 on the Independent Play of Students with Autism Spectrum Disorders. Y With Reservation 5) Murphy, Theodore, Alric-Edwards, & Hughes (2007). Interdependent Group Contingency and Mystery Motivators to Reduce Preschool Disruptive Behavior. Y With Reservation 6) Ramsey, Jolivette, Puckett-Patterson, & Kennedy (2010). Using Choice to Increase N=5 Time On-Task-Completion, and Accuracy for Students with Emotional/Behavior Disorders in a Residential Facility. N=8 Y With Reservation 81 Table 1 (Continued) 7) Restori, Gresham, Chang, Lee, & Laija-Rodriquez (2007). Functional AssessmentBased Interventions for Children At-Risk for Emotional and Behavioral Disorders. N=8 Y With Reservation 8) Theodore, Bray, Kehle, & Jenson (2001). Randomization of Group N=3 Behavior Contingencies and Reinforcers to Reduce Classroom Disruptive. Y 9) Williamson, Using a Random Dependent N=6 Campbell- Whatley, Group Contingency to & Lo (2009). Increase On-task Behaviors of High School Students with High Incidence Disabilities. Y Visual Analyses of Selected Articles Standards put forth by the WWC list six features for assessing intervention impacts by visually examining between and within phase patterns of data. These are described above but are summarized again for this chapter. The term “between phase” denotes the idea of comparing and contrasting data between adjacent phases in a study. For example, does the number of tantrums a child exhibits at baseline differ from the number seen after onset of treatment? Within phase analyses consider information such as data trends and variability within a given study phase (Horner et al., 2005), such as baseline or a given treatment phase. Visual analyses consider a series of perspectives which include: (1) level, (2) trend, (3) variability, (4) immediacy of the effect, (5) 82 overlap, and (6) consistency of data patterns across similar phases. The six features assess patterns across all phases of the design (Kratochwill et al., 2010). Levels are the means within phases whereas trend refers to the slope of the regression line, or line of best fit found within each phase. Consideration of variability is no different from the commonly understood definition of the term, keeping in mind of course that it is often assessed via visual approaches as opposed to the application of descriptive statistics. Immediacy of effect is the comparison of the last three points in the baseline phase to the first three points in the intervention phase. Immediacy of the effect differentiates patterns between phases. Assessing immediacy of the effect helps establish the presence of a treatment effect because rapid change in performance after the onset of removal of treatment makes it easier to discern that the presence of treatment variable caused changes in a dependent variable. Overlap deals with the percentage of overlapping data in one phase to the next, with the larger the separation of data the more likely there is a treatment effect (Kratochwill et al., 2010). To clarify, high levels of data overlap suggest limited or no change in performance from treatment to non-treatment phases. If this is the case, there is no evidence of a treatment impact. Lastly, consistency of data in similar phases involves comparing the baseline phases with each other (e.g., comparing baseline 1 and baseline 2), and treatment phases with each other and looking for consistent patterns (Kratochwill et al., 2010). The more consistency across similar phases the more plausible a causal relation exists (Kratochwill et al., 2010). Patterns are closely observed in visual analysis for trends. A key trend that was analyzed was the first baseline phase. Typically, a stable baseline must be present before 83 an intervention should be implemented because a trend suggesting a drop in the level of concern can make distinguishing levels difficult. In other words, should problematic target behavior be diminishing on its own before treatment, it would be difficult to conclude that introduction of an intervention is responsible for any improvement. Differentiating an effective treatment intervention from one that is not entails consideration of the six features already described for the visual analysis method. For this dissertation, all treatments will be considered either non-effective or effective. In order to be conservative any questionable treatment interventions will be considered noneffective. This is done to simplify interpretation. Digitizing and Modeling the Data Nagler and colleagues (2008) manual, Analyzing Data from Small N Designs Using Multilevel Models: A Procedural Handbook, studied small sample sizes in multilevel models. The manual was followed for this work so the general analytic steps are described. Here, one ABAB design was scanned from existing literature and UnGraph was used to digitize the coordinates. Once the data were obtained, Nagler and colleagues (2008) used SPSS to create dummy coded variables and interaction terms, and then used HLM software to statistically compare baseline to treatment phases. Dummy coding sets the stage for examining the presence of treatment effects; A and B phases were distinguished using a 0/1 coding scheme. In the handbook several types of designs were analyzed, but the ABAB design is of interest here because it can yield strong evidence of a treatment effect given that it included a reversal to baseline. The procedures and HLM analyses described in Chapters One and Two will be used to answer 84 part of the research questions, the effect size estimates were analyzed separately and were compared to the original studies. Before hierarchical models were analyzed in HLM/STATA software, SPSS was used to transform variables. The variables that required transformation were the outcome measures and independent variables (sessions). The outcome measure was rounded up or down if the outcome measure was indicated as a whole number in the original article (e.g., number of off-task behaviors). Otherwise, the outcome measure was not changed. Recall, the dependent variables in these articles all reflect some percentage of the target behavior in a given time period. The session variable was rounded to whole numbers. Session was the x-axis of the graph(s) and represented the number of sessions in the original study. Session was re-centered so that a zero was the final session of each baseline phase. This was done for HLM interpretation, so that zeros on all Level-1 variables (e.g., phase, session, interactions) indicated the final baseline session. The variables added to SPSS included trials, session (time), treatment, order, and interaction terms (treatment by order, session by treatment, session by order, and a three-way interaction between sessions, treatment, and order). A trial indicated the total number of possible trials on each day and was not really a variable, but a constant. Order was created as the first AB pair (0) and second AB pair (1). This is done to perform a statistical test between the first introduction of the treatment and the second. All of the articles produced data that followed a binomial distribution; all outcomes were some form of a “proportion correct” variable, such as the percentage of 85 time a student was on-task. Use of the binomial distribution in modeling was therefore expected and appropriate. For all analyses, standard critical p values were used, p < .05. STATA was used when there were only two subjects in an article (there were two such articles) because this software package was able to provide closer approximations to estimate the parameters (results did not converge in HLM for these studies). Raudenbush and Bryk (2002) have suggested HGLMs using Fisher scoring as the estimation procedure (for closer approximations to ML estimates) which could aid with convergence issues. This estimation procedure is not available in HLM for measures within persons. All other analyses were performed using MLF, in order to compare likelihood functions for model fit. Recall, the binomial distribution is modeled for data, where the number of successes Y in n trials, if the trials are independent and the probability of success π is constant over trials is; Y ~ Binomial(n, π) (Skrondal & Rabe-Hesketh, 2007). In generalized linear models, the distribution is determined as the count of success (Yi) in trials (ni) for unit (i) and is conditional on covariates (xi), Yi | xi ~ Binomial(ni, πi). All of the articles in this study were modeled with the binomial distribution. Three types of models were run in both HLM and STATA that mirrored procedures in Nagler and colleagues’ (2008) manual. Overall, obtaining a model with good model fit (likelihood function closest to zero), while considering overdispersion was the objective. 86 In an ABAB design for single-case research, the Full Non-Linear Model with Slopes included all Level-1 predictors but no Level-2 predictors. An example of a Full Non-Linear Model with Slopes is as follows: Level-1 Yti = P0i + P1i(SESS1ti) + P2i(TRTti) + P3i(ORDERti) + P4i(TRTORDti) + P5i(S1TRTti) + P6i(S1ORDti) + P7i(S1TRTORDti) Level-2 P0i = β00 + R0i P1i = β10 + R1i P2i = β20 + R2i P3i = β30 + R3i P4i = β40 + R4i P5i = β50 + R5i P6i = β60 + R6i P7i = β70 + R7i Where, Yti = The log odds of the dependent variable occurring (or expected number of behaviors, out of maximum possible intervals). P0i = The average log odds at final baseline session for all subjects (β00), plus an error term to allow each student to vary from this grand mean (R0). P1i = The average rate of change in log odds per day of observation (SESS1) during each baseline/treatment pair (β10), plus an error term to allow each student to vary from this grand mean effect (R1). 0 = last baseline observation. . 87 P2i = The average rate of change in log odds as a subject switches from baseline to treatment phase (TRT) for all students (β20), plus an error term to allow each student to vary from this grand mean (R2). P3i= The average rate of change in log odds as a subject switches from observation in the first AB pair to observations in the second AB pair (β30), plus an error term to allow each student to vary from this grand mean (R3). P4i = The average change in treatment effect as a subject switches from the first AB pair to the second AB pair (TRTORD) for all students (β40), plus an error term to allow each student to vary from this grand mean (R4). P5i = The average change in session effect (i.e., time slope) as a subject switches from baseline to treatment phase (S1TRT) for all students (β50), plus on error term to allow each student to vary from this grand mean (R5). P6i = The average change in session effect (i.e., time slope) as a subject switches from the first AB pair to the second AB pair (S1ORD) for all students (β60), plus an error term to allow each student to vary from the grand mean (R6). P7i = The average change in the differing slopes in baseline vs. treatment phases, as subject switches from the first AB pair to the second AB pair (S1TRTORD) (β70), plus an error term to allow each student to vary from this grand mean (R7). The full Level-1 formula states that the log odds of any dependent variable (or the expected number of days where the behavior was observed, out of X possible intervals) is the sum of eight parts: the log odds at the intercept (the final baseline session), plus a term accounting for the rate of change in log odds with implementation of the intervention (TRT), plus a term accounting for the rate of change in log odds with time (SESS1), plus a term accounting for the rate of change in log odds in phases from the first phase pair (A1, B1) to the second phase pair (A2, B2) (ORDER), plus four interaction terms (three 2-way interactions and one 3-way interaction) (Nagler et al., 2008). As the purpose for now is only to provide an example, the average change in variable effects definitions was exactly how Nagler and colleagues (2008) presented them in their 88 manual. As noted in Chapter Two, the inclusion of random effects in the Level-2 equations generally protects against Type I errors, for this reason all error terms were activated in the above Level-2 model. In general, the aim in the Full Non-Linear Model with Slopes is to find variables that are not contributing anything to the model and remove them for the sake of parsimony. One modeling strategy described by Nagler and colleagues (2008) is to examine the slope of measurement occasion in models; this slope can be referred to as “time.” The expectation is that the baseline slope will not be statistically significant, which would suggest that the baseline trend is flat. Furthermore, if there is no interaction between change in session/days between phases and this slope, then this would suggest the trend is flat within phases. This analysis can help because it can help inform researchers if there are subtle data trends that can perhaps be hard to discern using visual analysis alone. A second model is the Simple Non-Linear Model without Slopes. This model retains variables that are significant in the Full Non-Linear Model with Slopes. An example of the Simple Non-Linear Model without Slopes from Nagler and colleagues’ manual shows only a treatment variable was retained: Level-1 Yti = P0i + P1i(TRTti) Level-2 P0i = β00 + R0i P1i = β10 + R1i 89 where: Yti = The log odds of the dependent variable occurring. P0i = The average log odds during baseline for all subjects (β00), plus an error term to allow each student to vary from the grand mean (R0). P1i = The average rate of change in log odds as a subject switches from baseline (TRT = 0) to treatment phase (TRT = 1) for all students (β10), plus an error term to allow each student to vary from the grand mean (R1). The Level-2 equations model the intercepts and phase changes. The Level-1 equation now states that the log odds of the dependent variable is the sum of two parts: the log odds at the intercept (in this case, the baseline phase overall, since trend was found to be flat and removed), plus a term accounting for the rate of change in log odds with a phase change (treatment phase). This model allows error terms for all students to vary from the grand mean (Raudenbush & Bryk, 2002) and can yield a p value to determine if there was a treatment effect. The coefficients in this model also allow for the computation of probabilities such as the probability of observing a behavior during baseline and treatment phases (e.g., 15 observed disruptive behaviors in the first baseline compared to 5 observations of a disruptive behavior in the first treatment phase). As a follow-up analysis, Level-2 variables can be used to analyze whether subject characteristics appear to explain variance in performance. Note that the use of Level-2 information is a departure from a typical visual analysis. HLM provides a t test statistic for the variable, which explains any contribution to the estimation of the outcome. The 90 third and last model includes the Level-2 predictors that have the largest t value(s) from the second model’s exploratory analysis. Nagler and colleagues refer to this third and last model as the Simple Non-Linear Model with any Level-2 predictors. If no Level-2 predictors are found, then the model is called the Simplified L-1 Model with No Level-2 Predictors. Again, coefficients in this model can be used to determine the probability of behaviors occurring at baseline and treatment respectively. An example of this model is as follows (assuming a Level-2 variable called, “Class B” which refers to the class a child is assigned to): Level-1 Yti = P0i + P1i(TRTti) Level-2 P0i = β00 + β01(CLASSBi) + R0i P1i = β10 + R1i Where, Yti = The log odds of the dependent variable occurring. P0i= The average log odds during baseline for all students (β00), plus a term to allow for students in Class B to have a different baseline level (β10), plus an error term to allow each student to vary from this grand mean (R0). P1i= The average rate of change in log odds as a subject switches from baseline (TRT = 0) to treatment phase (TRT = 1) for all students (β10), plus an error term to allow each student to vary from this grand mean (R1). The Leve1-1 equations still read the same as the Simple Non-Linear Model without Slopes, but now the Level-2 equations model the baselines and phase changes. 91 In their example, Nagler and colleagues calculated between subject variances from the variance components tables in the HLM output. In the ‘final estimation’ of the variance components, any between subject variations in estimates of the intercept yielded a p value which tested the null hypothesis that baseline averages for all subjects were similar. A significant p value suggested that the variance was too big to assume estimation error. The between subject variances in phase effect produced a p value that tested the null hypothesis that on average, the probability of observing disruptive behavior was similar for all subjects. Further, any Level-2 predictors that were significant were included for the contribution to the estimation of the outcome measure for the final model. If the predictors reduce the variance in the model, then probabilities of the behavior occurring in both the baseline and treatment phases can be calculated and comparisons between the two can be made. If a Level-2 variable is significant, then probabilities for observing a target behavior (more generally, the outcome variable) should be computed. Comparing and Contrasting Visual Analysis, HLM/STATA, and Author Reports Table 2 is meant to convey how results across methods were compared (actual results are in Chapter Four). The table yields a quick reference pertaining to the consistency of results across methods. A sensitivity analysis on the effect sizes is also described in Chapter Four. 92 Table 2 Results of Sensitivity Analyses Pertaining to Statement of a Treatment Impact Study and Author(s) WWC HLM/STATA Dependent Published Independent Results (p Variable Statement Visual values)** Analysis* (1) (2) *Only one visual analysis expert was used in the current study. ** For a treatment to be deemed effective, p < .05. As noted in Chapter One, an ideal outcome would be to find that no differences in terms of overall statements of whether there was a treatment effect across the different methods. Should author statements diverge from WWC and HLM results, this might reasonably be attributed to disparate application of the visual analysis technique. Should there be differences between WWC and HLM results, it would be of interest to SCD researchers since two different but reasonable analytic strategies yielded inconsistent findings. Research Question 2: Level-2 Predictors As noted above, the Simple Non-Linear Model without Slopes further tests Level2 variable(s) in an exploratory fashion (Raudenbush & Bryk, 2002). This exploratory analysis explained the contribution of a Level-2 variable, or set of variables capacity to explain variation in observed performance across a group of subjects (that is, a group of similar ABAB studies). In the event any significant Level-2 variables are identified, a third model, the Simple Non-Linear Model including any Level-2 predictors can be used to describe how 93 participant characteristics explain variation in the outcome variable. In the ‘final estimation’ of the variance components, any between subject variations in estimates yields a p value which tests the null hypothesis that baseline averages for all subjects are similar (Nagler et al., 2008). Again, according to Nagler and colleagues (2008) a significant p value, p < .05, suggests that the variance is too big to assume estimation error. The between subject variances across phase effects produces a p value that tests the null hypothesis that on average, the probability of observing a target behavior is similar for all subjects. If the Level-2 variables did not contribute anything to the model (p > .05) they are dropped from the analysis, and the associated study is considered to have ‘no Level-2 predictors’ that explained variation in the model. Note that the L-2 analyses require there to be at least three people in the study or the model will not converge due to small sample size. Therefore this procedure was limited to 5 studies. To clarify the goals of the analyses, Table 3 represents a display of how results are to be depicted (actual results of each article will be displayed). 94 Table 3 Results of the Multilevel Models and Level-2 Contributors Articles Study 1 Study 2 Overdispersion/without Overdispersion Simplified Level-1 model with Level-2 predictors on intercept (TRT) Coefficients Variance Components (SD) Likelihood Function (MLF) * = p < .05, ** = p < .01 Alternative Approaches for Exploring Variation As suggested by Nagler and colleagues (2008), there are alternative approaches for exploring the variation among intercepts and phase effects. These approaches allow the researcher to observe how much subjects vary from average expectations. Recall, the models in this dissertation use an exploratory analysis to aid in the detection of Level-2 variables and variation between-subjects. That is the models use the variance components column and associated p values to determine treatment effectiveness and between-subject variation, which is the same method employed by Nagler and colleagues 95 (2008). Since the between-subject variation was already detected via exploratory Level-2 variable contributions, it was not necessary to use the alternative analyses. The alternative approach is to constrain the random effects in four different ways and compare likelihood functions: (1) have no constraints on random effects, (2) restrict the intercepts to zero so they do not vary across subjects, (3) restrict the phase effects to zero across subjects and (4) restrict both intercepts and phase effects to zero. The manual followed for this dissertation uses the Level-2 exploratory analysis to explain the between-subject variance and not the constraints just described. It is dependent on the researcher whether to use the exploratory method (placing the Level-2 variable(s) on the intercepts and phase effects to test for between-subject variance) or to constrain the random effects. Both methods do not need to be employed, especially given the nature of a sensitivity analysis. Further, these data are considered repeated, which complicates error structures needed to model the data. Recall, that all the slopes and intercepts vary across Level-2 units and are not fixed (i.e., error terms are activated) for the three models used in this dissertation allowing for the grand means to vary, which would be most consistent with a model without constraints. However, these types of repeated data designs have variance covariance matrices that are considered unstructured or identity based, among other types of structures, such as the diagonal matrix (Singer & Willett, 2003). For the unstructured matrix three estimations are considered, the variance of intercepts and the variance of slopes and covariance in slopes and intercepts. The benefit of using the identity matrix over unstructured is that one gains a degree of freedom because there is no estimation 96 required for covariance (Singer & Willett, 2003). However, since Nagler and colleagues (2008) indicate they used a no constraints model, it is assumed that their analyses followed an unstructured matrix. Since no restrictions were placed on the variance / covariance matrix in the Nagler and colleagues (2008) manual, this dissertation followed their model building strategies and no further testing was needed for assessing model fit using alternative approaches. Again, alternative strategies were suggested by Nagler and colleagues (2008) to explore variation, however, this was not the focus of this dissertation nor is it necessary. It is worth mentioning, Nagler and colleagues (2008) do imply that more research is needed concerning between and within-subject variation, and the appropriate estimation procedures needed to fit models with small sample sizes. Effect Sizes Research Question 3 dealt with calculating effect sizes using the PND, IRD, and SMD. When possible, the calculations were compared to what was reported in the original studies. R2 was also calculated to determine the proportion of the variation explained by phase differences using ordinary least squares (OLS) regression, using the dependent measure(s) and treatment variable only. It was suggested by Manolov and colleagues (2010) that OLS estimation procedures were sufficient to determine proportion of variance when there was no trend in the data and data series were independent. Calculations for the four estimates used in this study are provided in Chapter Four. 97 For the current study, the PND statistic and the SMD were checked for similarities to the original study. Also, the PND was compared to the extracted data PND to assess consistency between the two methods. So as to provide background information needed for the following discussion, Table 4 provides all formulas used for effect size calculations. Table 4 Effect Size Methods, Criteria, and Software Procedure PND Method Assessing level, trend, variability, immediacy of effect, overlap, and consistency of data patterns Criteria Program See Chapter Two Visual Analysis = > 90% very effective treatment 70%-90% effective treatment 50%-70% questionable effectiveness < 50% ineffective treatment (Scruggs & Mastropieri,1998) PND = (1 – Percent of Treatment Overlap Points)*100 IRD = IR Treatment – IR Baseline 100% (1.00) very effective 50% (.50) half of phases (AB) overlap or chance level improvement (Vannest et al., 2011) SMD = (Treatment Average – Baseline Average) / Baseline Standard Deviation ≈ 2.0 indicates an effective treatment (Jenson et al., 2007) SPSS/Excel IRD Online Calculator SPSS 98 Table 4 (continued) R2 ≡1- 1.0 is a perfect fit Proportion of the variation explained SPSS by the phase differences OLS Regression The PND statistic was analyzed by assessing the percentage of data that did not overlap, by choosing the most extreme value in the baseline and calculating the amount of data that did not overlap in the treatment phase (Morgan & Morgan, 2009). The IRD was calculated by taking the difference between two improvement rates (IRs). According to Parker, Vannest, and Brown (2009), the IR is defined as “the number of improved data points divided by total data points in each phase p. 139). = IR (7) Recall from Chapter 2 that the IRD is simply the difference between the proportion of data points where treatment performance exceeds baseline performance. These points can be observed visually or with software for calculating the IRD. The IRD formula can be seen below: IRT – IRB = IRD (8) 99 Again, the difference between the two improvement rates for the phases (the T stands for treatment and B is the baseline) is the definition of the IRD (Cochrane Collaboration, 2006; Sackett, Richardson, Rosenberg, & Haynes, 1997). Fortunately, software is available for free to calculate the IRD and multiple contrasts can be compared (Vannest, Parker, & Gonen, 2011). Contrasts are determined by the researcher. The two most common contrasts for ABAB designs were calculated for this study, although only one was needed for the calculation of the IRD. Recall that the baselines were coded as A1 and A2 and the treatments were B1 and B2. One contrast is: A1 versus B1, B1 versus A2, and A2 versus B2. Another option is to compare the A’s to the B’s, meaning A1A2 versus B1B2. Both contrasts were calculated and compared to the reported PND from the original articles even though only one contrast option is needed to calculate the IRD. The SMD was calculated by taking the difference between the mean baselines (MB) and mean treatments (MT) and dividing it by the first baseline standard deviation (Busk & Serlin, 1992). UnGraph Validity and Reliability Research Question 4 involved conducting a reliability analysis on the extraction of digitized data from the original articles using UnGraph. Furthermore, the validity of the process was assessed by obtaining raw data from the original authors. All study authors were contacted twice to obtain original data. Note that other researchers are interested in the general idea of recreating raw data. The reliability and validity of the UnGraph procedure was examined at three levels: (1) between the original “raw” data correlated to UnGraph’s extracted data, (2) between the researcher of this work and a 100 second coder, and (3) between the author’s reported means/percentages/ranges compared to UnGraph’s extracted data. This is not the first attempt to examine the reliability of the UnGraph procedure. Shadish and colleagues (2009) examined 91 studies of instructional or behavioral interventions with a variety of student populations. Three undergraduate students extracted the data from around 30 graphs each and the inter-rater reliability was calculated using the percent agreement between the three raters. Each person was paired with another, so that if someone was the original data extractor, then the other two people would test the scanned data against the original coder. In that study, there was agreement on 92.31% of the graphs (n = 84 of 91) regarding the number of data points to extract. The number of data points to extract was the total number of scanned data agreed upon (counting each data as one point) divided by the total data points. It is also noteworthy that there was a correlation of r = .99 (p < .001) between coders in terms of the number of data points to extract. The correlation between the extracted values for the original coder and the two coders averaged 0.96 (median = .99) across all studies. The reliability between the phases (baseline to treatment) was also tested. The average difference between phases for the two coders had a correlation of r = .95. To explore the validity of the data, means and standard deviations from the original data were compared to baseline mean from the extracted data. Since validity of the data were reported on more than one phase per study, forty-four studies yielded 152 data points. Five different methods of testing validity were employed, looking at mean percentages, mean time, mean number correct, mean, and percentage correct from the different types of studies. Correlations of the average extracted data over all phases overall and then for each of the five kinds of 101 data validity suggest a range from .97 - .99. Any issues concerning reading data and mistakes made by the coders in Shadish and colleagues’ (2009) article were presented as human error. Although this prior work supports the use of the UnGraph procedure it seems prudent to independently check on the reliability and validity of the approach. This study recorded the percentage of data agreement (total number of data points that agree between the extracted data and the original data), and Pearson product-moment correlations between the two coders. One coder was the primary extractor, and a second coder was another graduate student in the same program of study. To assess the validity of the data, the descriptive statistics from the extracted data were compared to original graphs. For example, means/percents/ranges from the original data were compared to the extracted data. Of course, digitized results were also compared with available raw data. Raw data compared to the extracted data was not found in an existing data search of ERIC, Academic Search Complete, or Social Sciences Index. Therefore, this is the first sensitivity analysis between raw data and extracted data for the software, UnGraph. Chapter Summary Several articles were used to perform sensitivity analyses that focused on comparing and contrasting results from HLM and visual analysis for the purpose of comparing the original author’s work to WWC visual analysis, and then to quantification. Data were collected using UnGraph, a digitizing software package. SPSS was used to create interaction terms, and create any subject characteristic data (Level-2 data). 102 The PND statistic, IRD, SMD, and R2 were checked for similarities to the original study. The PND statistic was analyzed by assessing the percentage of data that did not overlap, by choosing the most extreme value in the baseline and calculating the amount of data that did not overlap in the treatment phase (Morgan & Morgan, 2009). The IRD was the difference between the two improvement rates for the phases (baseline and treatment) (Parker et al., 2009). The SMD was calculated by taking the difference between the mean baselines (MB) and mean treatments (MT) and dividing it by the baseline standard deviation (Busk & Serlin, 1992). R2 was also calculated to determine an effect as the proportion of the variation explained by phase differences using ordinary least squares (OLS) regression, where the formula for R2 equals one minus the sum of squared differences between the actual and predicted Y values divided by the sum of squared differences between the actual Y values and their mean. The interpretation for R2 was lightly described in Chapter Two and reported in Chapters Four and Five in an effort to introduce the existence of R2 in SCDs, however because there are several different explanations of R2, it is not widely used or accepted as the best method for effect size estimates for this field. Lastly, the reliability and validity of UnGraph was checked for inter-rater reliabilities between two coders. As well, a sensitivity analysis between the validity and reliability of the extracted data and the original author’s reported descriptive statistics were compared. When available, raw data from original authors was compared to extracted data to ensure the sensitivity of Ungraph. 103 Chapter Four: Results The results from the sensitivity analyses are summarized in this chapter. Recall the research questions for this dissertation are: Primary Question 1) Does quantification and subsequent statistical analyses of selected ABAB graphs produce conclusions similar to visual analyses? Secondary questions 2) Do any of the subject characteristics (Level-2 data) explain between-subjects variation? If so, can this information be used to yield new findings? 3) Do the PND, IRD (nonparametric indices), SMD (parametric indices) and R2 yield similar effect sizes as the original studies? 4) Is UnGraph a valid and reliable tool for digitizing ABAB graphs? This dissertation focused on comparing two distinct methodologies used for analyzing single-case designs. For this reason, this chapter is organized into two major sub-sections. The first section offers detailed descriptions of a pilot of the statistical procedures used here. The pilot is modeled on the specific approach described by Nagler and colleagues (2008). Detailed descriptions of each step needed to address parts of Questions 1 and 2 are offered while working with the study presented in the Nagler and colleagues (2008) publication. The pilot endeavored to replicate the Nagler and colleagues work, starting with the data digitization process. This was done to ensure the procedures could be independently applied. Once the process was replicated (allowing for rounding error and small differences that can be attributed to the digitization process), 104 it was applied again with eight more studies. The results of these analyses are summarized in Section 2 of the chapter. The primary purpose of the second section of this chapter is to provide the overall results pertaining to each research question. To clarify, the intent is to first explain the statistical approaches used and then discuss analyses and results used to address the four questions this dissertation was designed to address. Section 1: The Pilot The pilot data for this study came from the article by Lambert, Cartledge, Heward and Lo (2006) called, Effects of Response Cards on Disruptive Behavior and Academic Responding during Math Lessons by Fourth-Grade Urban Students. The original authors were interested in the impact that response cards (white boards used by the students to write answers based on questions from the teacher) had on disruptive behavior during a lesson, compared to baseline teaching which entailed traditional hand-raising and waiting for the teacher to call on the student. The dependent variable was the proportion of disruptive behaviors, relative to on-task behaviors, during a five second interval. That is, each student’s (n = 9) behavior was recorded during a five second interval, where disruptive behavior was recorded as having occurred or did not. In total, behavior was recorded across ten intervals. Since the outcome variable focused on the proportion of observed disruptive behavior to all behavior (i.e., the percentage of times a student was disruptive), a binomial distribution of the data was assumed. Disruptive behavior was considered to have occurred during an observation interval if the student displayed any of the following behaviors, talking, provoking others, looking around the classroom, 105 misusing the white boards, writing notes to friends, drawing pictures on the white boards, sucking on fingers, or leaving their seat. The authors indicated the treatment was effective after using visual analysis. The phase changes that support this assertion can be seen in Appendix B, where the graphs for each article can be found. Note that these data were produced from the digitization process using UnGraph. This point is revisited when discussing Research Question 4. For now, the focus is on the application of HLM techniques to statistically analyze data, whereas the original authors relied primarily on visual procedures. The Full Non-Linear Model states that the log odds of disruptive behavior (or the expected number of intervals where on-task behavior was observed, out of 10 possible intervals) was the sum of eight parts: (1) The log odds of disruptive behavior occurring at the intercept (the final baseline session). (2) A term accounting for the average rate of change in the log odds after implementation of the intervention (TRT), which is the average impact between baselines and treatments. (3) A term accounting for the rate of change in log odds over time (SESS1). (4) A term accounting for the rate of change in log odds from the first pair of phases (A1, B1) to the second pair of phases (A2, B2) was included to examine differential treatment impacts. This term is referred to as “order” (and designated as ORDER in tables). 106 Several 2-way interactions were tested in the model. These included interactions between (1) session slope and treatment, (2) treatment and order, and (3) session slope and order. The session slope and treatment interaction tested whether slopes differed across sessions and treatment (i.e., recall that a session is the interval or the time variable in the study and treatment is coded 0 = baseline and 1 = treatment) (Nagler et al., 2008). The interaction between treatment and order tested if the intervention effect observed at the first AB pair was significantly different from the second AB pair. This is because sometimes the first pair (first baseline then treatment) and the second pair (second baseline and treatment) will be affected by repeated exposure to the treatment. The order variable can be used to determine if the level or amount of target behavior at the start of the study is different than at the end. Lastly, the interaction between session and order tested the interaction between the time variable and if the first AB pair and second AB pair (order) was significant. This 3-way interaction tested the combination of the session slope, treatment, and order. In order to build a parsimonious model, all non-significant variables (p > .05) were removed except for the treatment variable (TRT), which was significant. The next model tested was the Simple Non-Linear Model without Slopes, Level-1 equation, which stated the log odds of disruptive behavior was the sum of two parts: (1) the log odds at the intercept (in this case, the baseline phase overall, since trend was found to be flat), and (2) a term accounting for the rate of change in log odds given a phase change (baseline to treatment). To clarify, statistical analyses indicate that there was no baseline trend, meaning the rate of disruptive behavior was consistent prior to the onset of treatment so there was no need to include a term that accounted for baseline 107 changes. However, the presence of disruptive behavior could be expected to alter after treatment onset, so a term was included that captured the log odds of performance change during treatment. Level-2 equations model the intercepts and treatment impacts (to clarify the phase change is treatment impact from an A phase to a B phase) as the average log odds during baseline for all subjects, plus an error term to allow each student to vary from the grand mean and the average rate of change in log odds as a subject switched from baseline to the treatment phase, plus an error term. The error term will allow each student to vary from the grand mean. Table 5 describes the final estimation of fixed effects. The expected probability of behavior at baseline and treatment phases was calculated from this data. Table 5 Simple Non-Linear Model without Slopes for the Lambert and Colleagues’ (2006) Study Fixed Effect For INTRCPT1, P0 INTRCPT2, β00 For Coefficient Standard error T ratio Approx. d.f p value 0.82 .19 4.15 8 .004 -2.53 .16 - 15.77 8 .000 TRT Slope, P1 INTRCPT2, β10 When treatment = 0 (i.e., the baseline phase), the overall average log odds of exhibiting a disruptive behavior for all students was 0.82 (β00). Additional formulas for 108 determining the expected probability (in baseline and switching from baseline to treatment) are available and results can be derived from information presented in Table 5. The formula for the expected probability of observing disruptive behavior at baseline is the exponent of the baseline regression coefficient divided by one plus that number. So, the expected probability of observing a disruptive behavior during the baseline phase was 0.69, [exp(0.82) = 2.27; 2.27 / 3.27 = 0.69]. In other words, there was high likelihood that children would be observed engaging in a disruptive behavior before being treated. This is a critical point because a basic tenant of single-case designs is that there must be evidence of a problem behavior at baseline in order to have an opportunity to demonstrate a treatment impact. Visual analyses clearly demonstrated the presence of a problem, so at this point the process of digitizing the data and subsequent modeling has established a statistical means for describing baseline performance. Now one can examine whether there was a treatment effect; that is, if the use of response boards cause a reduction in baseline levels of disruptive behavior. The average rate of change in log odds as a student switched from baseline (TRT = 0) to treatment (TRT = 1) was - 2.53 (see β10 in Table 4). This represents a treatment effect, and it was significant as the associated p value was less than .05. When applying the formula for examining the log odds of disruptive behavior during the treatment phase (by adding the exponent of the regression coefficients divided by one plus that number), the calculations show: [exp(0.82 - 2.53) = exp(-1.71) = 0.18; 0.18 / 1.18 = 0.15]. Therefore, the expected probability of observing a disruptive behavior during the treatment phase was 0.15. In other words, the probability of observing a disruptive behavior before treatment was 0.69, and after treatment this 109 dropped to 0.15. Visual comparisons were made between the different phases, and in this case, a reduction of the target behavior was evident. At this moment, consider that the visual data presented in the published report yields a similar conclusion to what is described here, and there is a drop in the probability of the presence of disruptive behavior after onset of treatment. If working with designs that allow for reasonable levels of confidence that drops in problematic behavior are a function of the treatment, these analyses allow researchers to (a) statistically describe changes in performance as a function of treatment, and (b) apply statistical perspectives on the probability that changes observed in this data are reflective of a small sample. To add to this point, the analyses are able to handle a number of problems associated with OLS regression, including autocorrelation, problematic distributional assumptions (specifically the use of binomial distribution in lieu of a normal distribution) and different measurement times. In sum, this step corresponds with Research Question 1, which examines the degree of correspondence (i.e. percent agreement) between visual analyses and statistical modeling. Moving on to Research Question 2, while still working with the pilot study, Table 6 displays the final estimation of variance components (that accounts for performance across all ABABs in the article) for the Simple Non-Linear Model without Slopes. 110 Table 6 Final Estimation of Variance Components for the Lambert and Colleagues (2006) Study Random Effect INTRCPT1, Standard Deviation Variance Component df Chi-square p value R0 .53 .28 8 43.81 .000 TRT Slope, R1 .18 .03 8 11.51 .174 1.49 2.22 Level-1, E The between-subjects variance on intercepts was estimated to be .28, which corresponds to a standard deviation of .53. The associated p value is a result of a null hypothesis test, where the null condition states that baseline averages for all subjects were similar in terms of their frequency of disruptive behavior. This hypothesis was rejected. In other words, the variance was too large to assume differences may be due to only sampling error, meaning some students were better behaved than others at baseline. The significant p value (for intercept) suggests there should be further testing for variation in the baseline phase. Since the p value is not significant for the treatment phase, no Level2 variables will help explain the variation. Keep in mind that if no statistically significant variation between-subjects were found for either baseline or treatment, one would stop here. In this case, since there may be some Level-2 variable(s) that help explain some of the remaining variance in the baseline, another model is introduced. 111 In order to explore the possibility that certain subject characteristic (Level-2) variables might account for some of the between-subject variation, an exploratory analysis of the potential contributions of Level-2 variables was added to the intercept. Table 7 displays the potential Level-2 predictors and their associated t values. Table 7 Possible Level-2 Predictors from the Exploratory Analysis for the Lambert and Colleagues (2006) Study Level-1 Coefficient INTRCPT, β0 Coefficient Standard Error t value Potential Level-2 Predictors CLASSB WHITE AGE PREGRADE - .79 .65 - .49 - .14 .21 .53 .32 .19 - 3.60 1.21 - 1.55 - .70 Table 7 indicate that Class B (the indicator that student was in Class B, instead of A) might help to explain some of the between-subject variance in intercepts since it had the largest t value. This means that class can be used to explain variation in the baseline (since the intercept was significant in the final estimation of variance components). Therefore, the class variable was added to the last model used in the analysis. 112 The final model, the Simple Non-Linear Model with Class B tested on the intercept (i.e., only testing the Level-2 variable where it was significant in the previous model will still read the same as the Simple Non-Linear Model without Slopes, but now the Level-2 equations model the baselines and treatment impacts). Recall the variance components output suggested there was a significant amount of variation unaccounted for in the baseline (there could be variation left in either the baseline or treatment but because the variation around the treatment variable was non-significant, Level-2 variables should be used to examine baseline variation). From the prior test (see Table 5), variation around the intercept (i.e., baseline) was significant; therefore the class variable can be examined. Now, the probability of behavior at baseline is interpreted as the average log odds for all students (β00), plus a term to allow for students in Class B to have a different baseline level (β10), plus an error term to allow each student to vary from this grand mean (R0). In other words, a statistically significant β00 would indicate that the baselines are different for students and a statistically significant β01 would signify the classroom the child came from is a variable that should be included in the final model. P1 is the average rate of change in log odds as a subject switches from baseline (TRT = 0) to treatment phase (TRT = 1) for all students (β10), plus an error term to allow each student to vary from this grand mean (R1). Table 8 displays the final model, the Simple Non-Linear Model without Slopes with Class B on intercept. From this, the expected probability of observing disruptive behavior at baseline was calculated in a similar fashion from the first model. 113 Table 8 Simple Non-Linear Model without Slopes with CLASSB on Intercept for the Lambert and Colleagues’ (2006) Study Fixed Effect Coefficient Standard error T ratio Approx. d.f. p value INTRCPT1, P0 INTRCPT2, β00 1.33 .19 7.10 7 .000 CLASSB, β01 - 0.91 .22 - 4.01 7 .006 - 2.53 .16 - 16.14 8 .000 TRT Slope, P1 INTRCPT2, β10 The final working model was the Simple Non-Linear Model with the variable, class included on the intercept. Seen another way, Table 9 summarizes the final model, outcome variable, the coefficients with a regression equation included, and variance components (in standard deviations). The likelihood function was used to determine if this model is more effective at explaining the data than other models (i.e., this model works best in comparison to one that did not include the class variable). Although the estimations for the parameters generate slightly different numbers than what was originally reported by Nagler and colleagues (2008), the results for the pilot match the manual. Recall, working and non-working models used for each article used for this dissertation can be seen in Appendix B. 114 Table 9 Final Model: Lambert and Colleagues (2006) Final Model Overdispersion Simplified L-1 model with CLASSB on INTCPT Outcome Variable Disruptive Behavior Variables P0 = β00 + R0 INTCPT P1 = β10 + R1 TRT Coefficient Estimates β00 = β01= β10 = β00 + β01 *(CLASSB) + R0 β10 + R1 TRT β00 = 1.33** β01 = - 0.91** β10 = - 2.53** Variance Sigma ( 2) Components R0 (SD) R1 Likelihood Function (MLF) 2.24 0.29** 0.13 - 485.52 Log(Y) = 1.33 - .91(CLASSB) - 2.53(TRT) * p < .05, ** p < .01, Y = dependent variable Although the odds-ratio will be the same for Class A and Class B since no interaction was present, the probabilities of observing disruptive behavior are still calculated to yield descriptive information for each class separately. The concept here needs clarification because examining the class variable in this instance entails including 115 it as a main effect, not a simple main effect. Stated another way, a constant shift exists for the classrooms because there was no interaction in the model. Basically, for one class there is an increased probability of observing disruptive behavior compared to the other even though the effectiveness of the treatment was the same for both classes. The manual and this dissertation simply wish to convey that the authors were descriptively telling the readers that a student from one classroom was behaving differently than the other, and the probability calculation procedures described here provide a new way of looking at this performance differential. From the final model, the probabilities of observing disruptive behavior were computed to describe performance in both baseline and treatment phases. When treatment = 0 (i.e., the baseline phase), the overall average log odds of a student exhibiting disruptive behavior in Class A (CLASSB = 0) was 1.33 (β00); [exp(1.33) = 3.78; 3.78 / 4.78 = 0.79]. That is, the expected probability of observing a disruptive behavior during the baseline phase for a student in Class A (when treatment = 0) was 0.79. For a student in Class B the overall average log odds of exhibiting a disruptive behavior was 0.60, (1.33 - .91 = 0.42), [exp(0.42) = 1.52; 1.52 / 2.52 = 0.60]. So, the expected probability of observing a disruptive behavior during the baseline phase for a student in Class B was 0.60, a drop that is associated with a statistically significant p value. It appears that students in Class B tended to be better behaved compared to those in Class A at baseline. This new information could be of interest to the original researchers especially since there are separate teachers for each classroom. 116 As seen in Table 9, the significance test that compares the differences between the baseline and treatment phases demonstrates the intervention’s treatment effectiveness. The average rate of change in log odds as a student switches from baseline to treatment was -2.53 (β10). This value was significant as the p value for β10 was less than .05. Now, for Class A the expected probability of observing a disruptive behavior during the treatment phase was 0.23, [exp(1.33 - 2.53) = exp(-1.20) = 0.30; 0.30 / 1.30 = 0.23]. For Class B, the calculation must consider the coefficients for Class B and the treatment variable, [exp(1.33 - 0.91 - 2.53) = exp(-2.11) = 0.12; 0.12 / 1.12 = 0.11]. The expected probability of observing a disruptive behavior during the treatment phase was 0.11. Combined, the modeling results reveal that students in Class B started with lower rates of problematic behavior and also showed better performance after the treatment. Estimates of the variance components for this model, seen in Table 10, indicated that there may still be between-subject variation in estimates of the intercept. This is demonstrated by the significant p value for the intercept (.009). Differences in treatment effect were not found (p = .19). 117 Table 10 Final Estimation of Variance Components for the Lambert, Cartledge and Colleagues’ (2006) Study Random Effect Standard Deviation Variance Component df Chi-square p value R0 .29 .08 7 18.85 .009 TRT Slope, R1 .13 .02 8 11.19 .191 1.49 2.24 INTRCPT1, Level-1, E Looking back at Table 8, the treatment was effective (p < .01) and the subject characteristic Class B (which classroom the students came from) was also significant (p = .006) when considering performance at the intercept. Class A had a higher level of expected probability of observing a disruptive behavior during the baseline phase than a student who came from Class B. Therefore, the Simple Non-Linear Model without Slopes with Class B on intercept with overdispersion was the final model (Table 9). One can go back after the final model is chosen and test overdisperson by comparing the likelihood functions (i.e., one with overdispersion and another without) but if sigma squared is over 1.0 overdisperion must be accounted for in the model. Once the researcher tests for overdispersion, the model will either account for it or determine to be unnecessary when estimating standard errors. 118 All effect size estimates collected by the original authors and the calculated comparisons can be seen in Tables 21 and 22 in Appendix A. Table 23, also seen in Appendix A, lists sample details, independent variables, dependent variables, and original effect sizes (reported by the authors) for each article included in the sensitivity analysis. The PND was reported in the original article as an effect size criterion, making comparisons of extracted data easy to compare to this study, which also calculated the PND. The results of this pilot work concluded that the treatment intervention was effective. A Level-2 variable, class membership, explained some of the variation in performance across students. These findings were replicated after digitizing the graphed data then analyzing these data using HLM techniques. As was hoped, the results also matched (i.e., allowing for rounding error) those from Nagler and colleagues (2008) work. Remember that the Lambert and colleagues’ (2006) study had nine students. SMD and R2 estimates could not be compared to the originally reported results as the study authors did not calculate these statistics. PND effect size estimates were reported for five students (four were not reported). In terms of the five that were reported, three were matched by this dissertation work. Two suggested effectiveness in the original study (92.80% and 94.10%) and questionable in the calculated PND statistic (77.78% and 61.36%). Seven out of nine SMD calculations were indicative of effective treatment interventions ranging from 2.57 to 5.80. The last two student’s (B4 and B5) SMD were calculated at 1.05 and 1.27 suggesting the treatment intervention as not effective for these 119 students. The R2 (proportion of the variation explained by the phase differences) was 59%. Replicating the work from the manual was necessary to have confidence in working with this new methodology. The pilot results also provided an educative function as these procedures are innovative. That is, the detailed explanations offered above provide a framework for a quick explanation for the analyses of the eight remaining studies. All necessary information pertaining to significant Level-2 predictors (i.e., p values and calculated probabilities for behaviors in baseline and treatment phases) will be provided in each article summary. Eight more ABAB articles were collected for the current study. Before the individual articles are described, Section 2 provides the results for Research Question One. Section 2: Sensitivity Analysis Examining Treatment Effectiveness Results for research question 1. To address question one, comparisons were made between the results of the original authors, independent visual analyses based on WWC procedures (discussed in Chapter 2), and HLM. The treatment variable tested in each HLM analysis was the difference (and associated p value) in observed performance between mean baseline and mean treatment phases. Data were analyzed via HLM techniques using two software programs (HLM 6 and STATA 11.0). Data were drawn from nine published articles (the eight that were collected for this work and the study described in the above pilot section). Two of the nine articles had small sample sizes (n ≤ 2) and could not be analyzed using HLM 120 software because this yielded inadequate degrees of freedom and results would not converge. STATA was therefore used since this package included more estimation procedures that could be applied. More specifically, STATA could use a Penalized quasi-likelihood for binary outcomes (also called Laplace’s estimation approximation) and provide log-likelihoods with respect to the parameters using Fisher scoring (Manolov et al., 2010; Raudenbush & Bryk, 2002). Note that details associated with use of these procedures are described in Chapter Three. A key finding deals with the level of agreement between visual and statistical analyses used for examining the presence of a treatment effect, as well as statements made by original study authors. Of the 14 dependent variables examined (remember that some articles had more than one dependent variable) there was a 93% (n = 13 of 14) agreement rate between the original authors and HLM modeling results. Mavropoulou and colleagues (2011) considered the treatment to have questionable effectiveness on one dependent variable (the independent visual analyses reached this conclusion as well and determined it as not effective); yet HLM procedures yielded a significant p value. There was less agreement between HLM results and independent visual analyses. In all, there was only a 62.30% (n = 9 of 14) agreement rate between the two procedures in terms of whether there was a treatment effect. There was a 71.40% (n = 10 of 14) agreement rate between visual analyses and reports of the original study authors. Table 11, presented below, lists details specific to Research Question 1. Before the Table is presented, the results of visual analyses completed for each article are described first. The intervention described in article one in Table 11 was determined to 121 be effective because there was little overlapping data across baseline and treatment phases. Trends suggested the target behavior improved during the implementation of treatment and there was a change in the means across phases. The intervention in article two was considered to be ineffective because while one person (Keith) seemed to respond to the treatment implementation, his performance showed high variability and there was overlapping data across study phases. For the second person, Wally, the same concerns existed; there was almost complete overlapping data for the first treatment implementation to the second baseline phase. The intervention described in article three was considered to be effective since six of the nine people yielded little overlapping data across phases and mean performance changes favored the intervention. Also there was little variability in performance in the baselines, and the intervention effect was immediately apparent (recall, one would want to see a gap between the data between the baseline and treatment phases indicating a change in behavior). There were two students and three outcome measures in article four, and their intervention was considered to be ineffective for all of them. The three dependent variables were: (1) on-task behavior, (2) required number of teacher prompts, and (3) task performance (i.e., executing the correct steps/actions in each task such as placing a domino card in the correct place). The on-task behavior for the first student in this article, Vaggelis, was highly variable, the analyst could not document an immediate effect, and there were many overlapping data points. For the second person, Yiannis, the trend and level improved after onset of treatment, but there was still overlapping data when looking at on-task behaviors. For teacher prompting, there was considerable 122 overlapping data present for both students. And lastly, for the performance variable the data had high variability and overlap. The visual analyses conducted while examining results from article five determined the treatment was effective since there was overall low variability in performance, the treatment effect immediately yielded an improvement, and almost no problematic behaviors were seen in treatment phases. Since this outcome measure was the percentage of disruptive intervals, one would expect a favorable treatment to reduce during the treatment phase, which it did for almost all students. The intervention examined in article six was found to be effective at improving the on-task behavior and task completion. There was limited overlap of performance across phases, and levels and trends suggested the baseline and treatment phases were consistent to the intended direction of the slope (i.e., trend). The intervention did not show an impact on work accuracy as there was high variability in performance within each phase, associated overlap in performance across phases, and the analyst could not document an immediate treatment effect (i.e., last two data points in baselines in comparison to the first few data points in treatment phases were similar). The intervention assessed in article seven was considered to be effective when measuring academic achievement and disruptive behavior. For academic achievement, there was low variability and performance trends and levels were consistent with a treatment effect (i.e., the trend went in the intended direction and the average performance was higher during treatment phases). Only one person out of eight showed any overlapping data and this was considered to be minimal. 123 The intervention in article eight was found to be effective; data showed an immediate effect, low variability in performance, and no overlapping data. Finally, the intervention in article nine was considered to not be effective because there were too many overlapping data points between phases and changes in performance levels across phases was minimal. Table 11 provides the key information needed to address the first research question addressed by this dissertation, which considers if visual and statistical analyses using HLM procedures (described in detail in Section 1 of the chapter) yield similar conclusions. The table lists each study and the dependent variable(s) examined in each article (the dependent variables are listed after each articles citation, for example the Amato-Zech and colleagues (2006) article has the dependent variable percent of intervals of on-task behavior). The table also displays overall claims made by the original study authors pertaining to whether the treatment was effective and if overall results from visual and HLM analyses agree. The table also summarizes the visual analysis results described above. In terms of HLM results, an intervention was considered to be effective if the p value associated with mean change in performance across study phases was less than .05. Recall, all dependent variables are percentages or percent of intervals; no article used count data for the purposes of this dissertation. To provide a quick interpretive guide, the original study authors, the independent visual analyses and HLM results all agree that the intervention in the first article (AmatoZech, Hoff, & Doepke, 2006) was effective. There was complete agreement across methods. This was not the case for the second article (Cole & Levinson, 2002), and so 124 on. A discussion of the somewhat low agreement rate between independent visual analyses, and author claims and HLM results is provided in Chapter Five. Table 11 Final Results of Sensitivity Analyses Pertaining to Statement of a Treatment Impact Study/ Dependent Variable(s) Author(s) Published Statement about the Treatment Independent Visual Analysis* HLM Results (p values)** Effective Effective Effective Not Effective Effective Effective Effective (1) Amato-Zech, Hoff & Doepke (2006). Increasing On-Task Behavior in the Classroom: Extension of Self-Monitoring Strategies. Percent of intervals of on-task behavior. (2) Cole & Levinson (2002). Choices on the Challenging Effects Effective of Within-Activity Behavior of Children with Severe Developmental Disabilities. Percentage of task steps with challenging behavior. (3) Lambert, Cartledge, Heward, & Lo (2006). Effects of Response Cards on Disruptive Behavior Effective and Academic Responding 125 Table 11 (Continued) Study/ Dependent Variable(s) Author(s) Independent HLM Published Visual Results (p Statement Analysis* values)** about the Treatment ________________________________________________________________________ during Math Lessons by Fourth-Grade Urban Students. Number of intervals of disruptive behavior. (4) Mavropoulou, Papadopoulou, Effective & Kakana (2011). Effects of Task Not Effective Organization on the Independent Not Effective Play of Students with Autism Spectrum Disorders. Not Effective Effective Not Effective Not Effective Not Effective Effective Effective Effective Percentage of intervals with on-task behavior, prompting behavior, and performance behavior. (5) Murphy, Theodore, Alric-Edwards, & Hughes (2007). Interdependent Group Contingency and Mystery Motivators to Reduce Preschool Disruptive Behavior. Percentage of disruptive intervals. Effective 126 Table 11 (Continued) Study/ Dependent Variable(s) Author(s) Independent HLM Published Visual Results (p Statement Analysis* values)** about the Treatment ________________________________________________________________________ (6) Ramsey, Jolivette, Puckett-Patterson, Effective & Kennedy (2010). Using Choice to Increase Effective Time On-TaskCompletion, Effective and Accuracy for Students with Emotional/ Behavior Disorders in a Residential Facility. Effective Effective Effective Effective Not Effective Effective Effective Effective Effective Effective Effective Effective Kehle, & Jenson (2001). Effective Randomization of Group Behavior Contingencies and Reinforcers to Reduce Classroom Disruptive. Effective Effective Percentage of time on-task, task-completion, and accuracy. (7) Restori, Gresham, Chang, Lee, & LaijaRodriquez (2007). Functional AssessmentBased Interventions for Children At-Risk for Emotional and Behavioral Disorders. Percent of intervals of academic achievement and disruptive behavior. (8) Theodore, Bray, Percentage of disruptive Interval 127 Table 11 (Continued) Study/ Dependent Variable(s) Author(s) Published Statement about the Treatment Independent Visual Analysis* HLM Results (p values)** Not Effective Effective (9) Williamson, Campbell- Whatley, & Lo (2009). Effective Using a Random Dependent Group Contingency to Increase On-task Behaviors of High School Students with High Incidence Disabilities. Percent of intervals of on-task behavior. *Only one visual analysis expert was used in the current study. ** For a treatment to be deemed effective, p < .05. Results for Research Questions 2 and 3 To address the second research question, a summary of the statistical models used for analyzing data from the nine articles is offered here. The model choices are all based on the procedures described in the above pilot section. Each table presented in this subsection provides the final models used to analyze data from each article, the likelihood functions,5 and estimates of variance components. Finally, each table lists the dependent variables from each article (i.e., on-task behavior, disruptive behavior, the percentage of 5 The model with the likelihood function closest to zero is the best model suited for the data. 128 task analysis steps or how many procedures a student successfully completed, accuracy of academic tasks during academic courses, performance or implementing the correct number of tasks/procedures). Although specific details regarding measurement schemes vary from one article to another, on-task behavior can generally be described as a measure of the percentage of time a student was attending to a teacher or actively engaged in academic work. Disruptive behavior refers to the frequency a student was engaged in unsanctioned, problematic behavior. Recall that a binomial distribution was used for each of these variables. All Level-2 data available in the original articles were tested to explain variation in the dependent variable, except when n < 3. Statistically significant Level-2 variables were found in three studies. In sum, four Level-2 variables helped explain variance in a given model (p < .05). These variables are CLASSB (which class a child came from, as described in the above pilot section) from the Lambert and colleagues’ (2006) article, ONTRACK (if a child was on-track to attend the next grade level) from the Murphy and colleagues’ article (2007), and twice for the intervention variable (type of treatment plan given to a child) on two different outcome measures (disruptive behavior and academic achievement) from the Restori and colleagues’ (2007) article. A complete list of the Level-2 variables used in the analyses presented here can be seen in Table 26 in Appendix A. Recall that research question three pertains to effect sizes. A detailed description of the articles and original statements concerning treatment effectiveness (Research 129 Question One) compared to the results of the sensitivity analyses is also revisited in light of various effect size calculations. Cole and Levinson (2002) studied the impact of offering children a chance to make their own choices on reducing disruptive behavior. Behaviors were considered problematic if the students demonstrated aggression, tantrums, and noncompliance (i.e., throwing objects, walking away from desk, or screaming). Two boys who attended a school that served students with emotional/behavioral disorders were included in this study. Each instructional routine lasted for thirty minutes and occurred for twenty-nine days for Keith and thirty-two days for Wally. Each student had a different choice option tailored to the specific need of the child’s IEP (e.g., vocational or daily living skills such as washing hands). To present a choice, rather than using a phrase like: put the soap in your hands, interventionists asked a question like: do you want to use the bar soap or the pump soap? (Cole & Levinson, 2002). The rationale behind the study was that choice making might influence a child’s willingness to comply with directions. The authors concluded that choice making was an effective intervention (Cole & Levinson, 2002). Statistical analyses of digitized data (using STATA) indicated the intervention was effective (p < .05) and no Level-2 predictors were found in the multi-level analysis to be significant (p > .05). The final model presented is therefore a simplified Level-1 model, with no Level-2 predictors. The final model’s results are not presented here because it did not converge. Some details can be seen in Appendix C. Even though some parameters are available, the model failed to converge so they should be cautiously interpreted due to small sample size and biased estimations. As noted in Table 10, the 130 independent visual analyses determined that the intervention was not effective. The original authors reported each phase’s percent of task-analysis steps with challenging behavior. The PND effect sizes suggest the treatment for Keith as questionable (51.66%) and not effective (28.60%) for Wally. Overall, the statistical test determined the treatment effective, p < .05, but again without convergence this p value should be cautiously interpreted. The SMD for Keith suggested an ineffective treatment intervention (1.42) as well as Wally (.96). The R2 (proportion of the variation explained by the phase differences) was 25%. Murphy, Theodore, Aloiso, Alric-Edwards and Hughes (2007) used intermittent mystery motivators in an interdependent group contingency to reduce disruptive behavior in the classroom. Disruptive behaviors were defined as touching other students, leaving a designated learning area, (i.e., a rug) and standing or lying down on the rug. Fifteen second intervals were recorded for fifteen minutes for eight days during baseline and treatment. This study included eight students. Classrooms were comprised of children with a diverse range of behavioral skills. The intermittent mystery motivator treatment intervention was determined to be effective by original authors, the independent analyses, and HLM procedures. However, HLM techniques provided more information concerning the behavior of the students. The Level-2 variable, if a student was on-track to enter first grade (ONTRACK), was statistically significant, p < .01. This means that students in the study were performing differently in the baseline phase when comparing between the two groups. See Table 12. 131 Table 12 Final Model: Murphy and Colleagues (2006) Final Model Overdispersion Simple Non-Linear Model with ONTRACK Predictors on INTCPT Outcome Variables Disruptive Behavior P0 = β00 + R0 INTCPT β00 + β01* (ONTRACK) + R0 β10 + R1 TRT P1 = β10 + R1 TRT P2 = β20 + R2 ORDER β20 + R2 ORDER Coefficient Estimates β00 = β01= β10= β20= Variance Components (SD) Sigma R0 R1 R2 Likelihood Function (MLF) -1.79** 2.06** -1.93** -1.12** 2 3.07 0.43** 0.47* 0.27 -444.29 Log(Y) = - 1.79 + 2.06(ONTRACK) - 1.93(TRT) - 1.12(ORDER) * p < .05, ** p < .01, Y = dependent variable, disruptive behavior As before, some useful descriptive analyses can be gleaned from the table. When treatment = 0 (i.e., the baseline phase) and order = 0 (i.e., A1B1) and on-track = 0, the overall average log odds of exhibiting a disruptive behavior for students on-track to graduate preschool in the baseline was - 1.79 (β00), [exp(-1.79) = 0.17; 0.17 / 1.17 = 132 0.15]. In other words, the expected probability of observing a disruptive behavior during the baseline phase for an on-track student was 0.15. When treatment = 0 (i.e., the baseline phase) and order = 1 (i.e., second AB pairing) and on-track = 0, the overall average log odds of exhibiting a disruptive behavior for these students was 0.05, [exp(1.79 - 1.12) = 0.05; 0.05 / 1.05 = 0.05]. In other words, the expected probability of observing a disruptive behavior during the baseline phase for a student on-track for the second AB pairing was 0.05, when treatment = 0 and order = 1. When treatment = 0 (i.e., the baseline phase) and order = 0 (i.e., A1B1) and on-track = 1, the overall average log odds of exhibiting a disruptive behavior for students not on-track to graduate preschool at baseline was 0.57 for the first AB pairing, [exp(-1.79 + 2.06) = 0.05; 1.31 / 2.31 = 0.57]. In other words, the expected probability of observing a disruptive behavior during the baseline phase for a student not on-track for the first AB pairing was 0.57. Lastly, when treatment = 0 (i.e., the baseline phase) and order = 1 (i.e., A2B2) and on-track = 1, the overall average log odds of exhibiting a disruptive behavior for students not on-track to graduate preschool at baseline for the second AB pairing was 0.30, [exp(-1.79 – 1.12 + 2.06) = 0.43; 0.43 / 1.43 = 0.30]. In other words, the expected probability of observing a disruptive behavior during the baseline phase for a student not on-track for the second AB pairing was 0.30. The significance test that compares the differences between the baseline and treatment phases demonstrates the study’s treatment effectiveness. The average rate of change in log odds as a student switches from baseline (TRT = 0) to treatment (TRT = 1) 133 was - 1.93 (β20). This phase effect was significant as the p value for β20 was less than .01, so the treatment variable was considered to be statistically significant. For the treatment phase, when treatment = 1 and order = 1 and on-track = 1, the overall average log odds of exhibiting a disruptive behavior was 0.07, [exp(- 1.79 – 1.93 – 1.12 + 2.06) = 0.06; 0.06 / 1.06 = 0.07]. In other words, the expected probability of observing a disruptive behavior during the treatment phase for a student not on-track in the second AB pairing was 0.07. When treatment = 1 and order = 1 and on-track = 0, the overall average log odds of exhibiting a disruptive behavior was 0.008, [exp(-1.79 - 1.93 - 1.12) = 0.008; 0.008 / 1.008 = 0.008]. In other words, the expected probability of observing a disruptive behavior during the treatment phase for a student on-track for the second AB pairing was 0.008. When treatment = 1 and order = 0 and on-track = 1, the overall average log odds of exhibiting a disruptive behavior was 0.16 [exp(- 1.79 – 1.93 + 2.06) = 0.19; 0.19 / 1.19 = 0.16]. In other words, the expected probability of observing a disruptive behavior during the baseline phase for a student not on-track for the first AB pairing was 0.16. Lastly, when treatment = 1 and order = 0 and on-track = 0, the overall average log odds of exhibiting a disruptive behavior was 0.02, [exp(-1.79 – 1.93) = 0.02; .02 / 0.02 = 0.02]. In other words, the expected probability of observing a disruptive behavior during the treatment phase for a student on-track for the first AB pairing was 0.02. Murphy and colleagues (2007) calculated SMDs and they ranged from .99 to 7.71. The treatment was found to be effective for three out of eight students according to 134 the calculated SMD, ranging from .61 to 2.88. The R2 (proportion of the variation explained by the average baseline to treatment differences) was 21%. Mavropoulou, Papadopoulou, and Kakana (2011) studied a visually based intervention focusing on task organization while working with two boys, Vaggelis and Yiannis. Both boys had difficulties completing tasks independently and they were diagnosed with autism spectrum disorders. On-task behavior was defined as attending to materials. Off-task was defined as throwing materials or not using them properly for the task at hand (Mavropoulou et al., 2011). Three dependent variables were measured (1) on-task behavior, (2) teacher prompting, and (3) task performance. According to the study authors, on-task behavior was measured by attending to materials, paying attention visually and manipulating materials appropriately. Teacher prompting (measured as percentage of intervals of teacher prompting) dealt with re-directing a study by cueing or saying the child’s name. Lastly, task performance was defined as placing a picture in a correct category (e.g., placing a picture of a piece of clothing on a doll) which yielded a binary score (1 = correct and 0 = incorrect). Task performance was recorded as a percentage. All treatment sessions were fifteen minutes long and data were collected in ten second intervals during an observational period. For each student a total of forty-five intervals were assessed. Table 13 (below) yields data that can be used to compute the expected probabilities for on-task behavior on baseline and treatment phases. The models used to analyze the variables (teacher prompting and task performance) did not converge and 135 therefore are not analyzed any further; however the on-task model did converge and therefore were not analyzed any further; however the on-task model did converge. Therefore, when session = 0 and treatment = 0, the expected probability of observing a student with on-task behavior during the final session of the baseline phase was 0.97, [exp(3.60) = 36.60; .36.60 / 37.60 = 0.97]. The average rate of change in log odds as a student switched from baseline to treatment was .22. Therefore, when session = 0 and treatment = 1, the expected probability of observing a student with on-task behavior during the final session of the baseline phase was 0.98, [exp(3.60 + 0.22) = exp(3.82) = 45.6; 45.6 / 46.6 = 0.98]. The average rate of change in log odds per day of observation was 0.01. This increase was significant (p < .01), therefore it was concluded that the baseline trend is not flat, or increases over time (days). Therefore, on-task behavior increased in the log odds from the start of the baseline to the end of treatment. The study authors calculated PND statistics producing a score of 50% for on-task behavior and a score of 45% for accurate task performance. The PND for teacher prompting was not reported, but the authors did say the treatment was not effective. These values, as do the study authors suggest the intervention was not effective in terms of impacting Vaggelis’ on-task behavior and task accuracy (Mavropoulou et al., 2011). For Yiannis, a PND of 75% was calculated for on-task behavior, and 70% for task accuracy. This suggests that the intervention effectively improved performance on both variables (Mavropoulou et al., 2011). Results differed across the three approaches for examining treatment effectiveness on these outcomes. For on-task behavior the original authors and re-analyses using the software STATA found the treatment to be effective, 136 whereas the independent visual analysis determined it was not effective. As noted in Table 11, the original authors and re-analyses found the treatment not effective when dealing with the teacher prompt variable, whereas the independent visual analysis considered the treatment not effective. When examining task performance, the original authors and independent visual analyses found the treatment not effective, whereas statistical analyses yielded the opposite conclusion. Effect size estimates were independently calculated and all PND statistics matched except for one (task performance), which did not change the overall interpretation (i.e., reported 70% and calculated 60%, both of which would be considered questionable effectiveness). The SMDs are not consistent to the PND; PND results indicate that the intervention was not effective for Vaggelis (50%, 30%, and 45% for on-task, teacher prompting, and performance, respectively) whereas the SMD suggested it was for Yiannis (75%, 40%, and 60% for on-task, teacher prompting, and performance, respectively). The R2 (proportion of the variation explained by the phase differences) was 34% for on-task behavior, 1% for teacher prompting, and 39% for performance. Table 13 provides results from HLM analyses. 137 Table 13 Final Model: Mavropoulou and Colleagues (2006) Final Model Overdispersion Simplified L-1 model No L-2 Predictors Outcome Variable On-Task Behavior Variables P0 = β00+ R0 INTCPT P1 = β10 + R1 SESS P2 = β20 + R2 TRT β00+ R0 INTCPT β10 + R1 SESS β20 + R2 TRT Coefficient Estimates β00 = β10= β20= Variance Sigma ( 2) Components R0 (SD) R1 R2 Likelihood Function (MLF) 3.60** 0.01** 0.22** 0.03 0.09 0.01 0.07 - 241.86 On-Task Log(Y) = 3.60 + 0.01(SESS) + 0.22(TRT) * p < .05, ** p < .01, Y = dependent variable The Theodore, Bray, Kehle and Jenson (2001) study originally had five students, but considerable amounts of data were missing from two and they were dropped from analyses presented here. The intervention entailed listing classroom rules in the front of 138 the classroom as well as on each student’s desk. The teacher instituted an award system, with reinforcement contingent upon group behavior that allowed students to select various options. Disruptive behaviors were defined as a failure if the student displayed one of the following behaviors: use of obscene words, touching or talking to other students at inappropriate times, verbal put-downs, not facing the teacher, or listening to music too loudly. The intervention (posting rules) was conducted for two 45 minute time blocks during the school day for two weeks during regular classroom instruction, where no rewards were given for behavior. This period was considered the baseline phase. The study consisted of fifteen second intervals for a period of twenty minutes every other day. The treatment was considered by the authors, independent visual analyses and statistical procedures used here to be effective for all the students. The intercept data did not converge during analyses and therefore no expected probabilities pertaining to dependent variable were computed for this study. In the Theodore and colleagues’ study (2001) the reported SMD ranges from 2.6 to 5.2 and the calculated ranged from 2.8 to 5.3. PND results were consistent with an effective intervention (all were 100%). The R2 (proportion of the variation explained by the phase differences) was 78%. The Ramsey, Jolivette, Puckett-Patterson, and Kennedy (2010) study had three dependent variables of interest: on-task behavior, task-completion and accuracy. All five students in the study were classified with an emotional / behavior disorder (E / BD). Ontask behavior was defined as an observation that students: (1) were examining an 139 assignment, (2) writing questions related to the assignment, (3) followed directions, (4) did not use obscene language, and / or (5) did not touch others. Task-completion was calculated by taking the total number of times a task was determined to be correctly completed divided by the total number of attempts. Lastly, accuracy was defined as the percentage of tasks completed correctly. According to the authors, each observation session was conducted for fifteen minutes during the student’s independent work time. The sessions were conducted across consecutive weekdays, twice a day. During the baseline phase, teachers gave assignments during independent practice time in math and English classes. Students were told to do a specific assignment (no choice) and gave the assignment to the children. During the treatment (choice) phase, students were given the option of which of two assignments they wanted to complete first (Ramsey et al., 2010). The original study used the PND as a criterion for determining treatment effect. The students (Sara, Chris, Trey, and Abby) demonstrated higher percentages of time on-task, task-completion, and accuracy during treatment whereas Katie’s percentages were lower (Ramsey et al., 2010). According to Ramsey and colleagues (2010), the treatment worked for four of the five students during independent academic tasks. It should be noted that when students all have the same characteristics (in this case an E / BD classification), discriminating performance across students using HLM procedures becomes difficult. Table 14 is the final working model for this article. For all three dependent variables, the main effect of treatment was significant (p < .05). From this table, we can compute the expected probabilities for each dependent variable, starting with accuracy, or 140 number of task steps completed correctly for baseline and treatment phases. That is, during baseline (when treatment = 0), the expected probability of observing an accurate response from a student in baseline was 0.11, [exp(-2.09) = 0.12; 0.12 / 1.12 = 0.11]. The average rate of change as a person switches from baseline (treatment = 0) to treatment (treatment = 1) was 1.82, [exp(- 2.09 + 1.82) = exp(-0.27) = 0.76; 0.76 / 1.76 = 0.43]. So, the expected probability of observing a student completing accurate steps during the treatment phase was 0.43. The next outcome variable was on-task behaviors. During baseline (when treatment = 0), the expected probability of observing a student with ontask behavior was 0.39, [exp(-0.46) = 0.63; 0.63 / 1.63 = 0.39]. The average rate of change as a student was switched from baseline (treatment = 0) to treatment (treatment = 1) was 2.22, [exp(-0.46 + 2.22) = exp(1.76) = 5.81; 5.81 / 6.81 = 0.85]. So, the expected probability of observing a student with on-task behavior during the treatment phase increased to 0.85. Lastly, one can compute the expected probabilities for task completion for baseline and treatment phases. That is, when the expected probability of observing a complete task in baseline was 0.21, [exp(-1.49) = 0.26; 0.26 / 1.26 = 0.21]. The average rate of change as a person switches from baseline (treatment = 0) to treatment (treatment = 1) was 0.74, [exp(-1.49 + 2.53) = exp(1.04) = 2.83; 2.83 / 3.83 = 0.74]. So, the expected probability of observing a student complete a task during the treatment phase was 0.74. All three procedures for assessing treatment impacts matched up, except when dealing with the dependent variable, accuracy. The independent visual analyses did not deem the treatment to be effective, but the original authors claimed there was an effect 141 and this was supported by HLM results. For all but two people (Abby and Trey) the intervention was effective according to the SMD for all outcome variables. The R2 (proportion of the variation explained by the phase differences) was 33% when examining the on-task variable, 34% for the task completion variable, and 22% for the accuracy variable. Table 14 Final Model: Ramsey and Colleagues (2006) Final Model Overdispersion Simplified L1 model No L2 Predictors Outcome Variable Variables Coefficient Estimates Accuracy P0 = β00+ R0 INTCPT P1 = β10 + R1 TRT β00 = β10= Variance Sigma ( 2) Components R0 (SD) R1 Likelihood Function (MLF) Accuracy Log(Y) = - 2.09 + 1.82(TRT) On-Task Log(Y) = - 0.46 + 2.22(TRT) β00+ R0 INTCPT β10 + R1 TRT Overdispersion Simplified L1 model No L-2 Predictors Overdispersion Simplified L1 model No L-2 Predictors On-Task Behavior Task Completion β00+ R0 INTCPT β10 + R1 TRT β00+ R0 INTCPT β10 + R1 TRT -2.09* 1.82** - 0.46 2.22** 4.36 1.41** 0.36 8.50 0.59** 0.55* 5.39 1.33** 0.15 -378.21 - 430.24 - 394.27 -1.49* 2.53** 142 Table 14 (continued) Task Completion Log(Y) = - 1.49 + 2.53(TRT) * p < .05, ** p < .01, Y = dependent variable The next article used in the current study was Restori, Gresham, Chang, Lee, and Laija-Rodriquez (2007). The study examined the effect of self-monitoring and differential reinforcement of behaviors on academic and disruptive behavior. A particular focus of the article was comparing antecedent and consequent-based treatment strategies, where the former of the two tends to be more proactive in orientation and the latter essentially relies on reactive applications of a behavior plan. The article examined treatment impacts on eight students. When students correctly worked on assigned material, answered questions correctly and sought assistance, they were viewed as being engaged; whether students were rated as being engaged constituted the academic dependent variable of interest in this study. Disruptive behavior was defined as out-ofseat behavior, making disruptive noises, disturbing others, and talking without permission. Academic engagement was defined as the student correctly working on assigned academic material (i.e., working on material, answering questions correctly, and seeking assistance when appropriate). Participants were observed in 10 second intervals for 15 minutes, where each observation session was considered a single data point. The author’s results indicated that treatments primarily seen as “antecedent-based were more effective than treatment strategies that were primarily consequent-based for reducing 143 disruptive behavior and increasing academic engagement for all participants” (Restori et al., 2007, p. 26). These two conditions are referred to as interventions. The independent visual analyses and HLM results supported this assertion. For the Restori and colleagues (2007) article, both outcome variables disruptive behavior and academic achievement had the same final working model, the Simple Non-Linear Model with the Level-2 variable (intervention) on the treatment slope (see Table 15). Table 15 Final Model: Restori and Colleagues (2007) HLM Approach Overdispersion Overdispersion Simple Non-Linear model Simple Non-Linear model with INTERVENTION with INTERVENTION on TRT on TRT Outcome Variable Variables Disruptive Behavior P0 = β00+ R0 INTCPT P1 = β10 + R1 TRT β01 + R2 INTERVENTION P2 = β20 + R3 SESS1 Coefficient Estimates β00 = β10= β01 = β20= Variance Sigma ( 2) Components R0 (SD) R1 R2 β00+ β10* (INTERVENTION) + R0 β10 + R1 TRT Academic Achievement β00+ B10* (INTERVENTION) + R0 β10 + R1 TRT β20 + R2 SESS1 ---- 0.12 -3.26* 1.34** -0.06* -0.97** 3.55** -1.10* ---- 1.22 0.01 0.00 0.18 1.29 0.11 0.52 ---- 144 Table 15 (continued) Likelihood Function (MLF) -276.68 -284.38 Disruptive Behavior Log(Y) = 0.12 – 3.26(TRT) + 1.34(INTERVENTION) – 0.06(SESS1) Academic Achievement Log(Y) = - .97 + 3.55(TRT) – 1.10(INTERVENTION) * p < .05, ** p < .01, Y = dependent variable At baseline, the overall average log odds of exhibiting a disruptive behavior for a student in the antecedent-based treatment (intervention = 0, recall that this is a Level-2 variable designated as “intervention = 0”) was 0.53 (β00), [exp(0.12) = 1.13; 1.13 / 2.13 = 0.53] when treatment = 0 and session = 0. The expected probability of observing a disruptive behavior for a student in the antecedent-based treatment was 0.53 during the final session of the baseline phase. At baseline, the overall average log odds of exhibiting a disruptive behavior for a student in the consequent-based treatment during the final session of the baseline phase was 0.81 (β00 + β01), [exp(0.12 + 1.34) = 4.30; 4.30 / 5.30 = 0.81] when treatment = 0 and session = 0. The expected probability of observing a disruptive behavior during the final session of the baseline phase for a student in the consequent-based treatment was 0.81. The significance test that compares the differences between the baseline and treatment phases (for both dependent variables) demonstrates that both outcome variables are effective. The average range of change in log odds as a student switches from 145 baseline (TRT = 0) to treatment (TRT = 1) was -3.26 (β10). This phase effect was significant as the p value for β10 was less than .05, so the treatment variable was statistically significant. For a person in the antecedent-based treatment, the expected probability of observing a disruptive behavior during the final session of the baseline phase (session = 0) was 0.04, [exp(0.12 - 3.26) = exp(- 3.14) = 0.04; 0.04 / 1.04 = 0.04]. For a person in the consequent-based treatment, the expected probability of observing a disruptive behavior was 0.15 during the final session of the baseline phase (session = 0), [exp(0.12 - 3.26 + 1.34) = exp(- 1.80) = 0.17; 0.17 / 1.17 = 0.15]. Lastly, the average rate of change in log odds per day of observation is - 0.06. This decrease was significant (p < .05), therefore it was concluded that the baseline trend is not flat, or changes over time (days). Therefore, disruptive behavior decreased in the log odds from the start of the baseline to the end of treatment. The Level-2 variable intervention appeared to influence the academic achievement outcome variable in the Restori and colleagues (2007) study as well. At baseline, the overall average log odds of exhibiting academic achievement (recall this has to do with the student working on appropriate academic materials) for a student in the antecedent-based treatment (intervention = 0) was - 0.97 (β00), [exp(-0.97) = 0.38; 0.38 / 1.38 = 0.28] when treatment = 0. The expected probability of observing academic achievement during the baseline phase for a student in the antecedent-based treatment was 0.28. At baseline, the overall average log odds of exhibiting academic achievement for a student in the consequent-based treatment was 0.12 (β00 + β01), [exp(- 0.97 – 1.10) = 0.13; 0.13 / 1.13 = 0.12] when treatment = 0. The expected probability of observing a 146 disruptive behavior during the baseline phase for a student in the consequent-based treatment was 0.12. The significance test that compares the differences between the baseline and treatment phases demonstrates that both interventions are effective. The average rate of change in log odds as a student switches from baseline (TRT = 0) to treatment (TRT = 1) was 3.55 (β10). This phase effect was significant as the p value for β10 was less than .05, so the treatment variable was statistically significant. For a person in the antecedentbased treatment (intervention = 0), the expected probability of observing academic achievement during the treatment phase was 0.93, [exp(- 0.97 + 3.55) = exp(2.58) = 13.20; 13.20 / 14.20 = 0.93]. For a person in the consequent-based treatment, the expected probability of observing academic achievement during the treatment phase was 0.81, [exp(- 0.97 + 3.55 - 1.10) = exp(1.48) = 4.40; 4.40 / 5.40 = 0.81]. The original authors stated antecedent-based treatment interventions were more effective than consequent-based for reducing disruptive behavior, even though both were found to lower behavior. The statistical analyses performed here agree with these points. No effect size statistics, only means, were provided in the Restori and colleagues’ (2007) study. In the current study, the PND for overall disruptive behavior indicates that the treatment worked for seven out of eight students (for one student the PND = 0%). In terms of the academic achievement variable, the PND indicates the treatment was effective for all but one student (PND = 50%). SMD values show similar results. Using this effect size estimate, the intervention appeared to have questionable effectiveness for one student (SMD = 1.51) and was effective for another (SMD = 2.27). The R2 147 (proportion of the variation explained by the phase differences) for disruptive behavior was 63% and 77% for academic engagement. Next, Williamson, Campbell-Whatley, and Lo (2009) examined another group contingency reward system. The purpose of the study was to see if a group contingency worked for six students in a resource room with high incidence disabilities. Students one through three had more on-time task behavior and therefore may have been the most affected. This reward system included putting names in a jar and choosing one name after 25 minutes. If the child had at least four of five plus marks by their name, this was indicative of on-task behavior and the whole class would receive a reward (Williamson et al., 2009). The teacher would not show the child’s name drawn and the whole class determined the reward for on-task behavior. All six study participants were African American students with disabilities. The outcome measure was on-task behavior which was defined as students keeping their eyes and head oriented toward towards coursework, working appropriately for the given materials, being quiet and remaining in assigned area. Observations were conducted for the last 25 minutes of the period, in five second intervals. Table 16 presents the final working model for this article. From this table, expected probabilities can be computed for on-task behavior for baseline and treatment phases. A point worth raising here is that the variable order was statistically significant, which means that A1B1 compared to A2B2 had average differences. Here we see that treatment (p < .001) and order, (p < .001) was significant. That is, when treatment = 0 and order = 0, the expected probability of observing an on-task behavior from a student in 148 baseline in the first AB was 0.42, exp(- 0.31) = 0.72; 0.72 / 1.72 = 0.42. And, when treatment = 0 and order = 1, the expected probability of observing an on-task behavior from a student in baseline in the second AB was 0.34, exp(- 0.31 + .83) = 0.52; 0.52 / 1.52 = 0.34. The average rate of change as a student switched from baseline to treatment was 1.98. The expected probability of observing a student with on-task behavior during the treatment phase during the first AB was 0.84, [exp(- 0.31 + 1.98) = exp(1.67) = 5.31; 5.31 / 6.31 = 0.84] when order = 0. The expected probability of observing a student with on-task behavior during the treatment phase during the second AB was 0.92, [exp(- 0.31 + 1.98 + .83) = exp(2.50) = 12.2; 12.2 / 13.2 = 0.92] when order = 1. Also, the interaction between treatment and order was significant, p < .001. According to the authors, the intervention was effective for three of the six participants. The original authors and HLM results indicate the overall treatment was effective, yet the independent visual analysis concluded the intervention effect was questionable. The final model was the Simplified Level One with No Level-2 Predictors without Overdispersion. Table 16 Final Model: Williamson and Colleagues (2009) Final Model No Overdispersion Simplified L1 model No L-2 Predictors Outcome Variable Variables On-Task Behavior P0 = β00+ R0 INTCPT P1 = β10 + R1 TRT P2 = β20 + R2 ORDER P3 = β30 + R3 TRTORD β00+ R0 INTCPT β10 + R1 TRT β20 + R2 ORDER β30 + R3TRTORD 149 Table 16 (continued) Coefficient Estimates β00 = β10 = β20 = β30 = Variance Components (SD) Sigma ( 2) R0 R1 R2 R3 Likelihood Function (MLF) -0 .31 1.98** 0.83** - 1.84** n/a 0.12 0.00 0.00 0.09 - 239.42 Log(Y) = - 0.31 + 1.98(TRT) + 0.83(ORDER) - 1.84(TRTORD) * p < .05, ** p < .01, Y = dependent variable, n/a = no overdispersion The PND effect size estimates indicate the treatment was effective for one student (PND = 76.4%), questionable for three (PND = 61.1%, 55.55% and 50%) and not effective for two (PND = 40.4% and 27.78%). The SMD effect size estimate somewhat verified these findings, however three students that had questionable PNDs (61.1%, 55.55% and 50%) and different SMD results (30.5, 2.29 and .90). Recall a 50% cutoff is used to render a decision about treatment effectiveness for the PND and numbers around 2.0 are effective for the SMD. The R2 (proportion of the variation explained by the phase differences) was 29%. 150 Table 17 presents the final working model for the Amato-Zech and Colleagues’ (2006) article. Here, the authors used an electronic cue as a treatment intervention that vibrated every three minutes. The cue would prompt three students to record if they were on-task or not. The observation period was fifteen minutes a day, two to three times per week for forty-four sessions. Baseline sessions were described as a typical classroom environment where on-task behavior was defined as students actively or passively paying attention to instruction. From Table 17, we can compute the expected probabilities for on-task behavior for baseline and treatment phases. At baseline (treatment = 0), the expected probability of observing an on-task behavior from a student was 0.39, [exp(0.46) = 0.63; 0.63 / 1.63 = 0.39]. The average rate of change in log odds as a student switched from baseline (treatment = 0) to treatment (treatment = 1) was 0.32, so the expected probability of observing a student completing an on-task behavior during the treatment phase was 0.47, [exp(- 0.46 + 0.32) = exp(-0.14) = 0.87; 0.87 / 1.87 = 0.47]. . According to the authors, the treatment appeared to be effective for all students. The independent visual analyses and HLM results concurred. Table 17 Final Model: Amato-Zech and Colleagues (2006) Final Model Outcome Variable No Overdispersion Simplified L1 model No L-2 Predictors On-Task Behavior 151 Table 17 (continued) Variables Coefficient Estimates P0 = β00 + R0 INTCPT β00+ R0 INTCPT P1 = β10 + R1 TRT β10 + R1 TRT β00 = β10 = Variance Sigma ( 2) Components R0 (SD) R1 Likelihood Function (MLF) - 0.46** 0.32** n/a 0.02 0.01 - 159.49 Log(Y) = - 0.46 + 0.32(TRT) * p < .05, ** p < .01, Y = dependent variable, n/a = no overdispersion PND effect size estimates calculated here match the original report. The SMDs ranged from 2.8 to 4.5 suggesting effective treatments for all three students, and the R2 (proportion of the variation explained by the phase differences) was 51%. Improvement Rate Difference The IRD was calculated separately and is therefore not integrated in the above findings. Recall, a 100% on the IRD would suggest that all data-points in the treatment phase exceed the baseline phase (i.e., no overlap), and would indicate a highly effective treatment and when the IRD is 50%, only chance level improvement from the baseline to treatment phase exists (Parker et al., 2009). The IRD was not calculated for any of the nine articles, therefore it was only computed to compare to the other effect size estimates. 152 Two contrast methods were proposed by Parker and colleagues (2009) to calculate the IRD. The first method was the comparison of the three contrasts (A1 versus B1, B1 versus A2, and A2 versus B2) and the second was calculated using two contrasts (A1A2 versus B1B2). Both options are feasible, but only one effect is needed per person/ABAB graph (Parker et al., 2009). One way to compare the IRD to this sensitivity analysis is to compare it to the reported PND (i.e., originally reported in each article) which as the literature suggested, was a strong correlation, r = .83 (Parker et al., 2009). For this study, the reported PNDs correlated to the IRDs, was r = .92. Comparison of the PNDs based on Visual Analyses to PNDs based on Extracted Data Comparing effect sizes in the SCDs examined here was complicated by the fact that even the same estimate yielded inconsistent results. Four out of the nine articles had at least one inconsistent PND when comparing between the visual analysis and extracted data from UnGraph. This is not necessarily surprising due to how UnGraph extracts data, UnGraph traces the line or allows the user to point and click the data which may be slightly inconsistent, especially if the graph is lacking in quality. When this occurred, the PND was re-calculated and compared to either the extracted or visual PND. Whichever PND was verified by the previously calculated PND, it was chosen as the correct PND. The calculated PND from the extracted data was done in Excel by using the formula for the PND, also found in Chapter Two. The corrected PND column was determined by the researcher as the best PND for the data. The inconsistencies (there are few) are described in full in Chapter Five. 153 Table 18 A Comparison of Effect Sizes from the PND: Visual Analysis versus Extracted Data from UnGraph Citation Amato-Zech, N. A. Hoff, K.E., & Doepke, K J. (2006). Increasing on-task behavior in the classroom: extension of self-monitoring strategies. Cole, C.L. & Levinson, T.R. (2002). Effects of within-activity choices on the challenging behavior of children with developmental disabilities. Lambert, M., Cartledge, G., Heward, W. L., & Lo, Y. (2006). Effects of response cards on disruptive behavior and academic responding during math lessons by fourth-grade urban students. DV On-task Behavior Percent of Task Analysis Steps ID PND (visual analysis) PND (extracted data) Jack David Allison 100% 100% 93.75% 100% 100% 93.75% Keith 51.66% 51.66% Wally 28.60% 42.85% 77.78% 100% 81.25% 75.71% 61.36% 100% 94.4% 0% 33.33% 77.78% 100% 81.25% 75.71% 61.36% 100% 94.4% 0% 33.33% A1 Disruptive A2 Behavior A3 A4 B1 B2 B3 B4 B5 PND (Correct) 42.85% 154 Table 18 (Continued) Citation DV ID Mavropoulou, S., Papadopoulou, E., & Kakana, D. (2011). Effects of task organization on the independent play of students with autism spectrum disorders. Vaggelis On-task Teacher prompt Performance 50% 30% 45% 50% 30% 45% Yiannis On-task Teacher prompt Performance 75% 40% 60% 75% 40% 60% 50% 58.3% 42.85% 0% 87.5% 35.7% 35.7% 100% 50% 58.3% 42.85% 0% 93.75% 66.95% 50% 100% Abby On-task Task Complete Accuracy 50% 50% 75% 50% 50% 67.85% Sara On-task Task Complete Accuracy 94.44% 100% 100% 94.44% 100% 100% On-task Task Complete Accuracy 50% 77.77% 72.22% 50% 77.77% 72.22% Chris On-task Task Complete Accuracy 66.4% 100% 70.7% 73.5% 100% 70.7% Murphy, K. A., Theodore, L.A., Aloiso, D. Alric-Edwards, S1 J.M., & Hughes, T.L. (2007). S2 Interdependent group Disruptive S3 contingency and mystery Behavior S4 motivators to reduce S5 preschool disruptive behavior. S6 S7 S8 Ramsey, M.L., Jolivette, K., Puckett-Patterson, D., & Kennedy, C. (2010). Using choice to increase time on-task, taskcompletion and accuracy for students with emotional/behavior disorders in a residential facility. PND (visual analysis) PND (extracted data) PND (Correct) 93.75% 66.95% 35.7% 75% Trey 66.4% 155 Table 18 (Continued) Citation DV ID PND (visual analysis) PND (extracted data) PND (Correct) 26.6% 35% 30% 31.66% Katie On-task Task Complete Accuracy 26.6% 31.66% 30% Disruptive / Academic Restori, A.F., Gresham, F.M., Chang,T., Howard L.B. & Laija-Rodriquez, W., (2007). Functional assessment-based interventions for children at-risk for emotional and behavioral disorders. Overall Academic Engagement Overall Disruptive Behavior Theodore, L.A., Bray, M.A., Kehle, T.J., & Jenson, W.R. (2001). Randomization Disruptive of group contingencies Behavior and reinforcers to reduce classroom disruptive behavior. Williamson, B. D., Campbell-Whatley, G., & Lo, Y. (2009). Using a random dependent group contingency to increase on-task behaviors of high school students with high incidence disabilities. On-task Behavior Disruptive / Academic A1 A2 A3 A4 100% / 100% 100% / 100% 100% / 100% 100% / 100% 100% / 100% 100% / 100% 100% / 100% 100% / 100% C1 C2 C3 C4 100% / 92.8% 0% / 50% 91.6% / 91.6% 100% / 91.6% 100% / 100% 0% / 50% 91.6% / 91.6% 100% / 91.6% S1 S2 S3 S4 S5 100% 100% 100% 100% 100% 100% S1 S2 S3 S4 S5 S6 40.4% 76.4% 61.1% 55.55% 50.0% 27.78% 40.4% 76.4% 61.1% 55.55% 50.0% 27.78% 92.8% DV = Dependent Variable; ID = subject identifier; PND (visual analysis) = Percentage of non-overlapping data analyzed by a visual process determined by the features described in this dissertation; PND (extracted data) = Percentage of non-overlapping data analyzed by the numerical data scanned using UnGraph. PND formula = (1 – percent of treatment overlap points)*100. 156 Results for Research Question 4 The validity of using UnGraph to extract data has been examined in the past (Shadish et al., 2009). The concerns addressed here deal with the ability of UnGraph to correctly extract the data and if two independent raters would scan data similarly. To address the first point, attempts were made in this study to contact original authors to obtain original datasets. Only two responded and these data were compared to the UnGraph results. In short, analyses for this research question were done in three ways (1) a comparison of the extracted data to raw data, (2) between the primary researcher and a second coder, and (3) between the extracted data and the original authors’ descriptive data. Two authors responded to the data request from the articles: Mavropoulou and colleagues (2011) and Amato-Zech and colleagues (2006). Data on four dependent variables (two on-task outcomes, prompting, and performance) were included in the comparison of extracted data to original. N = 38, which indicated the total number of data points that were obtained by the raw data provided by the original authors. The correlations between the UnGraph results and raw data for the Mavropoulou and colleagues (2011) study were both 1.00 for on-task and performance, and .99 for prompting. For the Amato-Zech and colleagues (2006) study, on-task behavior yields a correlation of .99 between the original data and extracted data. For the independent variable (sessions), both studies had a correlation of 1.00 between the original author and scanned data. Lastly, there was 100% agreement for both studies on the number of data points to extract. 157 Further, a second coder was used in this study to check the reliability of the primary researcher and the ability to accurately extract data from UnGraph. The second coder independently digitized the same nine studies. The overall average correlation for the nine studies was r = .996 (p < .001). The average outcome variable correlation was r = .994 and the average session correlation was r = .998. For the outcome variable, three correlations were 1.00, one was .999, three were .998, and the lowest was .967 for the Cole and Levinson (2002) study. For the session variable, six variables had 1.00 correlations, with the lowest being .98 for the Lambert and colleagues (2006) study. Between the two coders, the percent of agreement of extracted data points (i.e., total number of data points selected by the secondary coder divided by the primary coder) was 97.70% (1,330 out of 1,362 data points). The trained graduate student accidently extracted data for generalizations phases in one article which caused the denominator to be 32 points higher from the actual. Generalization phases extend beyond designs in an effort to collect additional behavioral data and verify that the outcome measure, in fact, represents an increase/decrease as a function of the treatment. In addition, descriptive data from the original authors was compared to extracted data. Table 22 in the Appendix shows the studies reported data concerning the outcome variables (measured by mean intervals, ranges, and percentage of intervals) and compares those statistics to the extracted data. The average correlation between the original data and extracted data (average baseline and treatment phases) for the nine articles in this study was .99. The articles independently listed, with the reported and extracted data, can be seen in the Appendix. Below, Table 19 displays the mean and variance of both the 158 raw data and extracted (from the authors that responded to the request) further demonstrating the validity of UnGraph. Table 19 A Comparison of Descriptive Statistics between Raw Data and Extracted Raw Data Extracted Data Mean 48.38 47.96 Variance 959.68 968.68 N = 38 159 Chapter Five: Discussion, Conclusions, and Recommendations This chapter discusses results from the analyses that were conducted to address each research question. This is followed by a summary of the findings, series of conclusions and discussion of future research suggestions. The key point of this work was to compare results across two different analytic traditions. The work is a type of sensitivity analysis in that the primary question considers whether overall treatment effects are found whether using HLM techniques or visual analyses. Put another way, the work examines if a treatment effect is found whether using one approach or another (i.e., are findings and conclusions sensitive to method?). Strong correspondence between different approaches would be encouraging and low correspondence suggests the need for further investigation that goes beyond the scope of this dissertation. Secondary questions dealt with the examination of Level-2 subject characteristics in hierarchical statistical models. This is of some interest because the literature (Nagler et al., 2008) describes a method where regression techniques can examine the influence of such variables, even in the context of single-case designs which work with quite small sample sizes; furthermore, application of these methods can potentially push analyses in new directions and reveal findings that are not possible with visual analysis approaches. Finally, the calculation of effect size estimates (using different methods), comparing them to author results, and assessing the reliability and validity of the UnGraph procedure for digitizing coordinates was of interest. 160 Discussion of the Results The primary research question: Does quantification and subsequent statistical analyses of selected ABAB graphs produce conclusions similar to visual analyses? Comparisons between the original authors (who all used a form of visual analysis) to statistical analyses of the ABAB graphs produced similar findings. There was 93% agreement rate between the original authors and re-analyses using HLM techniques. This is across 14 dependent analyses while focusing on whether there was a treatment effect. Only one difference was found between the original articles and the re-analyses. In terms of this one exception, Mavropoulou and colleagues (2011) considered the intervention to be ineffective (for task performance), but analyses using HLM techniques rendered a significant p value (p < .05). While the original authors questioned the treatment effectiveness, it is noteworthy that the statistical approach suggested there was a treatment impact. These results should be interpreted carefully as models based on data drawn from this article did not converge. It is also the case that HLM approaches always concurred with claims made by authors when they stated there was a treatment impact. A 71.4% agreement rate was found when comparing the original authors claims and results from re-analyses using independent visual analysis approaches based on WWC criteria. These two findings could be explained several different ways. It could mean that the WWC renders conservative opinions about treatment impacts in the context of single-case designs. A related limitation from the beginning of this work was that some WWC approaches were not followed. In the WWC project, three doctoral-level researchers trained in visual analyses render an opinion and this level of attention was not 161 possible here. This is not trivial as seeking such corroboration is done by others in the field. Maggin and colleagues (2011a) recommended using a team of individuals trained in visual analysis to report findings for each individual case. It is difficult to determine if the single visual analyst used in this work was overly conservative when rendering decisions about treatment impacts or if others would with similar training would reach similar conclusion. There was also a 62.3% agreement between WWC-based analysis and HLM analyses. For the most part, HLM analyses routinely found treatment effects (there was one exception), and it seems likely that some of the same limitations for using SMDs (recall the procedure tends to yield very large results) apply here. It could be that HLM estimation procedures used for this dissertation (i.e., full maximum likelihood) yielded inflated parameters. This is worrisome and might suggest that full maximum likelihood may not be the best option given the data. It is also unfortunate to not have the ability to compare models using restricted maximum likelihood, as this estimation procedure does not render a likelihood function. Three articles were analyzed using models that did not converge, and so their results were not included in Chapter Four. These articles were: (a) Cole and Levinson (2002) studying disruptive behavior, (b) Theodore and colleagues’ (2001) concerning disruptive behavior and (c) Mavropoulou and colleagues (2011) for teacher prompting and task performance. The failure to converge may have occurred for several reasons but it would be reasonable to assume that the small sample sizes, large between and within- 162 variability and parameter estimation using MLF caused the problem. These models were dropped from Chapter Four but are available in Appendix C. The second research question: Do any of the subject characteristics (Level-2 data) explain between-subject variation? If so, can this information be used to yield new findings? There were four Level-2 predictors that explained (p < .05) variance betweensubjects. The Lambert and colleagues (2006) study had a Level-2 variable that was a statistically significant predictor of baseline variability. This variable was Class B, which captured classroom membership (Class A = 0, Class B = 1). The authors stated that both classrooms displayed a decrease in disruptive behavior during the response card intervention but later it was revealed that classroom teacher ‘A’ had some difficulty recording the responses at the onset of the study. No statistical analysis of differences between the two classrooms was conducted in the original article. Here, as found by Nagler and colleagues (2008) and verified by this study, the variable Class B (which classroom the child came from) was a statistically significant predictor of variance of (p < .05) at baseline. This new finding could be explained by several possibilities, but are not limited to the following: (a) the children were coming into the classroom with prior knowledge of using white boards to answer questions, (b) the two classrooms had students who performed differently, or (c) the teachers implementing the intervention had differing teaching strategies. For the Murphy and colleagues’ (2007) study, the original authors reported differences between the two types of students (ones who were on-track to and those not on-track to the next grade level) in the discussion section of the article. The authors were 163 limited to descriptive statistics, and stated that the onset of skills may be attributed to starting educational levels or the “heterogeneous nature of the classroom” (p. 60). This is consistent to the expected probabilities for observing disruptive behavior at baseline and treatment phases. The average overall expected probabilities for observing behavior matched the concerns that the authors had with not fully achieving a return to baseline (second A) and large reductions in behavior change when implementing the second treatment (second B phase). The authors attributed this to the starting educational levels and the mix of students used in the study and possibly the students who came with more skills (i.e., remembering how to behave from a prior experience). The Restori and colleagues (2007) study investigated functional assessments using typically developing children in a general education setting. It appeared that children exhibiting poor academic engagement and extreme rates of disruptive behavior were associated with avoidant or attention seeking behaviors. The authors reported a significant increase in academic achievement in both treatment phases, and a significant reduction in disruptive behaviors. Also, students assigned to antecedent-based interventions had increased academic engagement in comparison to consequent-based. The findings reported by the original authors compared to results from HLM analyses were consistent. For example, the treatment variable was a statistically significant predictor of academic achievement, p < .01, as well as disruptive behavior, p < .01. In addition, Restori and colleagues (2007) stated that “antecedent-based treatment interventions were more effective than consequent-based interventions for reducing disruptive behavior” (p. 26), even though both were found to lower behavior. The 164 statistical analyses found that both treatment interventions were successful for the group, and confirmed antecedent interventions as more effective, even though odds-ratios would be the same. For the most part, statistical analyses enhanced interpretation of the original articles. Calculating the expected probabilities of behavior somewhat verified claims made by the study authors, but also provided more information concerning baseline and treatment phases. In one case, even though the researchers claimed that both interventions worked effectively (which according to visual analyses they did); statistical work revealed that one was a better option (given the data). Caution is advised here because differences can occur across by using different estimation procedures. Further, since the sample sizes for these studies were so small generalization is neither viable nor suggested. For now, the overall conclusion reached here is that Level-2 analyses are useful, but more work on their procedures and properties is warranted. Given the sample sizes in SCDs, it is advised that estimation of the parameters should be studied more, especially when using MLF. Finally, the variance-covariance matrix chosen for this dissertation was unstructured since this is the default in HLM (Garson, 2012) and after producing the variance co-variance output, the matrices indicate an unstructured matrix. Again, there are several options for restricting the variance components and this and other parameter estimation techniques should be studied in the future, including autoregressive parameters according to the manual’s authors. Also, the HLM6 software does not yield the variance co-variance matrix as a default, but the output can be recovered (Raudenbush, Bryk, & Congdon, 2004). 165 Model building was necessary to answer Research Questions 1 (HLM analyses) and 2 (Level-2 contributions). Although some issues arose with model convergence and interaction terms, Nagler and colleagues (2008) state: “no model is going to completely fit actual data because of great within-phase and within-subject variation” (p. 11, Section 5). Nagler and colleagues (2008) continue to express that in real analyses, inferences between both the treatment effect sizes and estimates from the models would be compared for final judgments concerning treatment impact. The third research question: Do the PND, IRD, (nonparametric indices), SMD (parametric indices) and R2 yield similar effect sizes as the original studies. To address this question, effect sizes calculated from either extracted data or graphs were compiled and compared to ones originally described in the articles. All effect size similarities and inconsistencies can be seen in Table 20 (see Chapter Four) and Table 21 in Appendix A. Comparing multiple effect sizes for individual studies may be considered unwarranted due to the computational differences found within each estimate. Generating and interpreting effect sizes in SCDs is complicated by the fact that standardized mean differences (SMDs) are not readily comparable to effect sizes generated from groupbased studies (e.g., randomized controlled trials and quasi-experiments). This is because the variances of SCDs tend to be small relative to group-based designs and there can be cases where mean performance shifts in baseline for an individual can yield large numerators. It might however be enlightening to calculate and display other methods as a way to verify the conclusions of the original articles, and to see if effect sizes had discernible 166 patterns. Maggin and colleagues (2011a) suggested using more than one effect size estimate to understand the relationship between the outcome and treatment variable, even though this was not commonly seen in the literature. Generating and interpreting effect sizes in SCDs is complicated by the fact that standardized mean differences (SMDs) are not readily comparable to effect sizes generated from group-based studies (e.g., randomized controlled trials and quasiexperiments). This is because the variances of SCDs tend to be small relative to groupsbased designs and there can be cases where mean performance shifts in baseline for an individual can yield large numerators. The PND, although not considered the best estimate developed, continues to dominate the field of SCDs in school psychology (Maggin et al., 2011a). Until another estimate is developed, it will continue to be reported. The PND is simple to calculate, however according to Allison and Gorman (1994) as the number of data points increase, the PND value trends towards zero and makes it difficult to compare PND results from one study to another. In addition, the PND does not correlate highly with other effect size indexes (Maggin et al., 2011a; Shadish, Rindskoph, & Hedges, 2008). Nonetheless, for this dissertation 33% (n = 3 of 9) of the articles used the PND as an effect size estimate. It just may be that the effect size estimates in SCDs present on-going issues in the field. Amato-Zech and colleagues (2006) reported a PND, which matched the calculated effect size PND from the primary researcher of this sensitivity analysis. The SMD suggested the intervention was effective for each subject (ranging from 2.8 – 4.5, 167 recall any value close to 2.0 is considered effective). Lambert and colleagues’ (2006) article reported a PND for five students but did not calculate it for four of them. Only three of these were confirmed, and only two of the PND results suggested the intervention was effective. By contrast, all SMD calculations suggested the treatment was effective for each person, except for the last two individuals (B4 and B5). Again, although comparing between effect size methods is not common, it could be helpful when software is not available, when time is limited, or when data has high variability. In addition, Mavropoulou and colleagues (2011a) reported four PNDs in their article although six could have been calculated. Of the four reported, three were confirmed. The one that did not match was Yiannis’ performance result; however, the results calculated here did not substantively alter the overall conclusion that the treatment’s effectiveness was questionable for this student. As an aside, HLM analyses yielded a significant p value when examining the question of whether there was a treatment impact on the behavioral outcome, even though the original study authors questioned its effectiveness, as did independent visual analyses. Further, the SMD was calculated at .39, which suggested the treatment was not effective for the individual, Yiannis. Murphy and colleagues (2007) reported results using the SMD approach where the treatment was deemed to be effective for five out of eight subjects. The calculated SMD from the extracted data for this dissertation did not agree with the results of the original authors. According to the article, four students experienced effective treatments (using the standard cut-off of around 2.0 or higher); however, two were just slightly below the threshold, 1.99 and 1.98, respectively. The PND suggested that five students 168 did not experience an effective intervention. Theodore and colleagues (2007) reported SMD results in their article and there were all confirmed. The PND results, calculated by the dissertation author, also suggest an effective rating. Mainly, the effect size estimates are highly variable, do not always match, and should be further studied. The IRD is included in this study as a means to compare it to the PND from originally reported and extracted data. This is mainly because the literature does not offer IRD ranges for determining treatment effectiveness (unlike the PND or the SMD). Recall that, a 100% on the IRD indicates that all data-points in the treatment phase exceed the baseline phase, and this suggests an effective treatment. By contrast, when the IRD is 50%, only chance level improvement from the baseline to treatment phase exists (Parker et al., 2009). Recall, that Parker and colleagues (2009) found a .83 correlation between the IRD and the PND in 166 AB series data sets. The contrast method (A1A2 versus B1B2) yielded a correlation of .92 between the IRD and the reported PND. This may suggest that the IRD and PND are somewhat comparable effect size estimates for these types of behavioral-based, ABAB designs. Further, the IRD, as a newer estimate, should be used in the future possibly in place of the PND, until a more viable effect size estimate is devised for SCD data. For now, there is no way to generalize these effect size findings to other settings. It can only be concluded that the effect size metrics yielded inconsistencies and the single-case community should know about this finding. The last measure of association, the R2, can be seen in the last column of Table 20 in the Appendix. The proportion of the variation explained by the phase differences is 169 one way to interpret R2; however, other interpretations are available in single-case research and consideration should be contingent on the design’s function (Parker & Brossart, 2003). This last measure can give a sense of how the phase differences vary, with the larger proportions indicating more variation between baseline and treatment phases. In an effort to correlate different effect size measures, Parker and colleagues (2009) found the strongest correlation was between R2 and Kruskal-Wallis W and the weakest was between R2 and the PND. This highlights that caution is needed when using effect sizes in SCDs. Although comparing different effect size estimates is not common, this dissertation work suggests the extra steps may be useful since they can yield inconsistent results. Certain effect size procedures tend to be consistent (e.g., IRD and PND) whereas others are not (e.g., R2 and the PND). Furthermore, even comparisons between the exact same estimates (i.e., PNDs using the visual analysis technique versus the extracted data PNDs) were inconsistent. The visual analysis PND compared to the extracted data PND were displayed in Table 20 in Chapter Four. Recall, visual analysis uses the features and criteria described in Chapter Two for determining treatment effect (i.e., it is a visual process that is analyzed by multiple senior research methodologists assessing trend, level, overlapping data). The data in the column labeled ‘PND extracted data’ in Table 20 is the exact data extracted by the researcher of this dissertation and can be found in Appendix B). This column was calculated using the PND formula. The column labeled ‘PND correct’ is the suggested PND effect size estimate based on the information gathered (by the dissertation author) after re-analyzing the differences (discussed below). It is worrisome that some of the PNDs did not match 170 exactly. After a re-analysis of the two methods of calculating the PND (one being visual and the other calculated), the correct PND matched one method and therefore it was chosen as the correct PND method. The column labeled PND correct was added to suggest which PND option may be the best. Any other researcher would likely agree that a second look at the two inconsistencies in an effort to yield the same estimate is valuable. Finding an exact match alleviates concerns, but failure to do so provides the field with possible areas of future study: (a) why effect sizes do not match in SCDs and (b) how even the same estimate can be different between visual inspection and quantification using UnGraph. In general, the extracted UnGraph data and independent visual analysis PND (remember that even visually assessing graphs, the PND has calculations since it is the percentage of non-overlapping data) did provide some different numbers because trivial differences existed when clicking on a specific data point. UnGraph itself can be problematic in that one can get a false sense of security, particularly when dealing with hard to read graphs. When dealing with software that has such calibration systems for extracting data from graphs, it may be advisable to have more than one trained extractor check on the results or simply round decimals to whole numbers. For the Cole and Levinson (2002) article, one student (Wally) yielded two data points (last two data points found in the second AB pairing) that visually demonstrated no overlap, but the extracted data considered those two points as overlapping. This is important to mention because the PND can be interpreted differently. The PND does not treat data that are equal as countable data (i.e., the definition of PND is data that fall 171 above / below the highest / lowest data point in the intended direction of the treatment intervention), therefore the researcher would consider the data found in the Cole and Levinson article as overlap. This is unfortunate for data that fall at zero and the highest possible outcome measure. For the Murphy and colleagues (2007) article, three out of eight estimated PND values when using the two different datasets, all had three effect size estimates that were higher when using extracted data. For one student, the second AB pairing yielded a data point that was not overlapping but this was not the case when basing PND on visual analyses (the data point in UnGraph registered 5.50 in treatment two and 7.26 in baseline two). This same general issue occurred with the other students. Consequently, slight differences will yield different conclusions. Thus, the final PND column in the table suggests the recommended effect size estimate based on the data comparison between the two methods. The best method was chosen by the researcher of this dissertation based on a second look of the data in SPSS and visual analyses. It is not that UnGraph is inconsistent; rather it is the PND method can be analyzed differently. The next inconsistent PND result was found when analyzing data from a student in the Murphy and colleagues’ (2007) study. Visually it was clear that five points fell below the lowest number in the first AB pairing, but the quantification found all seven points (out of seven) overlapping. Therefore, visual analysis was the better method because it was clear visually they did not overlap. Similarly, for the Ramsey and colleagues (2010) article, three effect size estimates were inconsistent. Since an effect was readily apparent from visual analysis, the PND based on this procedure was deemed 172 to be correct. Finally, for the Restori and colleagues (2007) study, one person was visually difficult to assess for the outcome measure, overall disruptive behavior. The researcher of this dissertation had to make a judgment call based on the PND differences, which is a limitation to this dissertation. All other articles and outcome measures were exactly the same between the two methods. This all leads to an important suggestion conducting further research, which is to use the two methods in tandem. This comparison should help researchers see that even one inconsistent data point could change the effect size estimate making the justification to compare multiple effect sizes necessary (possibly within the same estimate as this dissertation demonstrated). That is, even a comparison of the same effect size estimate (PND), yielded inconsistent results. More research should be conducted for effect size estimates in the single-case community. The fourth research question was: Is UnGraph a valid and reliable tool for digitizing ABAB graphs? Examining the UnGraph procedure is important because this dissertation is predicated on the assumption that the UnGraph procedure can reliably reproduce published data presented in graph form. The reliability and validity of the UnGraph procedure was examined at three levels: (1) between the original “raw” data correlated to UnGraph’s extracted data, (2) between the researcher of this work and a second coder, and (3) between the author’s reported means/percentages/ranges compared to UnGraph’s extracted data. First, authors from two studies provided raw data in this study: Mavropoulou and colleagues (2011) and Amato-Zech and colleagues (2006). Ideally, all authors would have been able to respond to requests for raw data and the fact 173 that only two responded represents a study limitation. Correlations between the raw and extracted data for the Mavropoulou and colleagues (2011) study were both r = 1.00 for on-task and performance, and r = .99 for prompting. For the Amato-Zech and colleagues (2006) study, on-task behavior yields a correlation of r = .99 between raw and extracted data. For the independent variable (sessions), both studies had a correlation of r = 1.00 between the original author and scanned data. Lastly, there was 100% agreement for both studies on the number of data points to extract between the original authors and the researcher in this study. This shows that UnGraph was able to reproduce data. In addition, an independent coder also digitized data and results were compared to the extraction efforts of the dissertation author. The overall average correlation between the two coders (for the nine studies) was, r = .9957. The average outcome variable correlation was r = .9938 and the average session correlation was r =. 9976. Between the two coders, the percent of agreement of extracted data points (i.e., the total number of data points clicked on by the secondary coder divided by the primary coder) was 97.70%. In all, the number of data points available to extract (1,362) was collected from the primary researcher. Any divergence from the percent of agreement of extracted data points was due to an error on the part of the second coder, by not following directions and extracting phases not needed for this study. A final point worth noting is the similarities between the authors’ reported means/percentages/ranges compared to UnGraph’s extracted data. The average correlation between the original data and extracted data (average baseline and treatment phases) for the nine articles (means/percentages/intervals) in this study was r = .99, p < 174 .01. Specifics can be seen in Table 24 in the Appendix. Given the overall pattern of findings, it would appear that UnGraph can reliably reproduce graphed data. Study Limitations Several limitations were described above. The ones recognized at the outset of the study are: (1) not having access to WWC resources, (2) not knowing the exact variance-covariance matrix used in the manual, (3) not having access to all raw data and (4) not having an ideal ES estimate to compare between studies. Additional limitations that occurred as a result of proceeding with the study plan include: (1) failure to converge using MLF estimation, (2) only two authors responding to requests for raw data and (3) the primary researcher making judgment calls due to the differences between the PNDs. Conclusions This dissertation had one primary and three secondary research questions. The primary research question asked: does quantification and subsequent statistical analyses of the ABAB graphs produce conclusions similar to visual analyses? The overall answer is: not always. HLM results and author claims were largely consistent. The only exception was when dealing with the task performance variable from the Mavropoulou and colleagues’ (2011) study. The authors questioned whether there was a treatment impact but HLM analyses yielded a statistically significant p value. This may be an indication that HLM techniques tend to yield liberal decisions when using standard p value cutoffs, and this may make sense given the standardized effect size estimates. The model using task performance as the outcome variable did not converge and the differing treatment impact results should be cautiously interpreted. Also, other studies had similar 175 issues with convergence. For example in the Cole and Levinson (2002) article, the statistical test determined the treatment was effective, p < .05, but again without convergence this p value should be cautiously interpreted. The SMD for Keith suggested an ineffective treatment intervention (1.42) as well as Wally (.96). These results similarly support HLM being too liberal given the ES estimates. The examination of Level-2 subject characteristics via HLM techniques appears to hold promise in terms of yielding new insights into single-case data. In this work, four such variables were found to explain variation of dependent variable data. Several Level2 variables were tested (a complete list of all Level-2 variables tested can be seen in the Appendix) and found to provide additional information beyond the original work in the Lambert and colleagues (2006) study. It could be that the original authors were not interested in these variables at that time. Nonetheless, these predictors can provide information beyond what descriptive information could provide. Of course, these predictors are more exploratory in nature, and should be interpreted with caution due to the small sample sizes of these articles. The effect size component of this work was also considered as a type of sensitivity analysis, focusing on any differences between claims made by the original authors and results based on the extracted data. The SMD and PND were, for the most part, used to describe effect size in the nine articles in this study. Again, comparing effect sizes in SCDs are difficult because while some report numerically, others only provide an ‘effective/non effective’ treatment judgment. Further, the PND seemed to 176 produce inconsistent effect sizes depending on the method chosen to calculate (i.e., visual inspection versus excel formulas using extracted data). Finally, examining reliability and validity of UnGraph procedures was done by making comparisons between: (1) raw and extracted data, (2) two independent coders, and (3) findings and data presented in research articles (i.e., means and ranges) and UnGraph results. The average correlation between the raw and extracted data (average baseline and treatment phases) for the nine articles in this study was r = .99, p < .001. Also, 100% agreement was found between the original authors and the researcher of this work concerning the expected number of data points to collect for these two studies. Two coders were used to assess the accuracy of UnGraph providing 97.70% agreement on number of data points to extract. Overall average correlation and means/percentages/intervals was r = .9957 and r = .99, p < .01, respectively. Recommendations Although this work is based on a small number of studies, it appears that visual analyses seem to yield more conservative results when compared to what is gleaned from HLM procedures. This is consistent with Raudenbush and Bryk’s (2002) stance that estimation procedures for small sample sizes may yield biased estimates and it is recommended to research variance-covariance matrices available for repeated designs. While visual analysis is quick, efficient, and cost beneficial, quantification efforts can be used to re-examine visual analysis results and provide additional Level-2 subject characteristic information. Of course, quantification of single-case studies is tedious, time consuming, and requires the knowledge of several software applications and 177 estimation procedures. Overdispersion continues to be an ongoing issue in SCDs. In general, it is recommended that further research concerning the different estimations available, other than the options available in Nagler and colleagues (2008) manual, should be the focus of any further work in SCDs. These other estimation options are available in software like Stata and would bypass estimation limitations in HLM6. Stata may provide better parameter estimation and model fit for smaller samples. Level-2 information could be analyzed both by qualitative and quantitative means (Hitchcock, Nastasi & Summerville, 2010). However, quantification may provide new information and prompt future work. Maggin and colleagues (2011a) stated that some researchers did not provide or reported limited subject characteristics. Similarly, some authors in this work did not report the sex of the child, disability status, or any information pertaining to the data collectors (i.e., gender of teachers/primary researcher, years experience). This reiterates the need for collection of more descriptive data since it may not be behavior of the child influencing results, but the child themselves. Effect size estimates in SCDs, while popular for describing associations were found to be worrisome in this study. It should be noted that not all original authors reported all effect size estimates, and this study did have nine articles. Perhaps a team should be used in calculating effect size estimates (possibly using more than one estimate to verify results). Again, not all effect size estimates in this dissertation were exactly what the original authors found, which affected individual effectiveness. It would be beneficial to check one effect size calculation to another given the fact that more than one 178 method is available. It is recommended to use the PND or IRD for effect size estimates until a better estimate is developed for SCDs. Lastly, this work agrees with a call by McDougall and colleagues (2011) to require authors to provide raw data when publishing SCD results. This would dismiss hesitation concerning UnGraph, and other extracting tools, and allow closer estimation approximations for model building purposes. 179 References Abell, P. (2009). History, case studies, statistics, and causal inference. European Sociological Review, 25(5), 561-567. Agresti, A. (1996). An introduction to categorical data analysis. Gainesville, FL: John Wiley & Sons. Allison, D. B., & Gorman, B. S. (1994). Making things as simple as possible, but no Simpler: A rejoinder to Scruggs and Mastropieri. Behaviour Research and Therapy, 32, 885–890. doi: 10.1016/0005-7967(94)90170-8 Amato-Zech, N. A. Hoff, K. E., & Doepke, Karla J. (2006). Increasing on-task behavior in the classroom: extension of self-monitoring strategies. Psychology in the Schools, 43(2), 211-221. American Psychological Association Washington DC US (2002). American Psychologist, 57, 1052-1059. doi: 10.1037/0003-066X.57.12.1052 Baer, D. M. (1977). Perhaps it would be better not to know everything. Journal of Applied Behavior Analysis, 10(1), 167-172. Beeson, P. M., & Robey, R. R. (2006). Evaluating single-subject treatment research: Lessons learned from the aphasia literature. Neuropsychology Review, 16(4), 161169. Biosoft (2004). UnGraph for Windows (Version 5.0). Cambridge. U.K.: Author. 180 Bowman-Perrott, L., Greenwood, C. R., & Tapia, Y. (2007). The efficacy of CWPT used in secondary alternative school classrooms with small teacher/pupil ratios and students with emotional and behavioral disorders. Education & Treatment of Children (West Virginia University Press), 30(3), 65-87. Brand, A., Bradley, M. T., Best, L. A., & Stoica, G. (2011). Multiple trials may yield exaggerated effect size estimates. The Journal of General Psychology, 138(1), 1-11. Brandstätter, E. (1999). Confidence intervals as an alternative to significance testing. Methods of Psychological Research Online, 4(2), 33-46. Retrieved from http://www.dgps.de/fachgruppen/methoden/mpronline/issue7/art2/brandstaetter.p df Brossart, D. F., Parker, R. I., Olson, E. A., & Mahadevan, L. (2006). The relationship between visual analysis and five statistical analyses in a simple AB single-case research design. Behavior Modification, 30, 531-563. doi:10.1177/0145445503261167 Bryk, A. S., Raudenbush, S. W., Congdon, R. T., & Seltzer, M. (1988). An introduction to HLM: Computer program and user’s guide [Computer software]. Chicago, IL: University of Chicago. Busk, P. L., & Marascuilo, L. A. (1992). Statistical analysis in single-subject research: Issues, procedures, and recommendations, with application to multiple behaviors. In T. Kratochwill, & J. Levin (Eds.), Single-case research design and analysis: New directions for psychology and education (pp. 159-185). Hillsdale, NJ: Lawrence Erlbaum. 181 Busk, P. L., & Serlin, R. C. (1992). Meta-analysis for single-case research. In T. Kratochwill, & J. Levin, (Eds.), Single-case research design and analysis: New directions for psychology and education (pp. 187-212). Hillsdale, NJ: Erlbaum. Bryk, A. S., Raudenbush, S. W., Congdon, R. T., & Seltzer, M. (1988). An introduction to HLM: Computer program and user’s guide [Computer software]. Chicago, IL: University of Chicago. Carver, R. (1978). The case against statistical significance testing. Harvard Educational Review, 48(1), 378-399. Cochrane Collaboration (2006). Cochrane handbook for systematic reviews of interventions. Retrieved March 19, 2012, from http://www.cochrane.org/index_authors_researchers.htm Coe, R. (2002, September). It’s the effect size, stupid: What effect size is and why it is important. Paper presented at the Annual Conference of the British Educational Research Association, England. Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155–159. doi:10.1037/0033-2909.112.1.155 Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates. Cohen, J. (1969). Statistical power analysis for the behavioral sciences. New York: Academic Press. Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45(12), 13041312. 182 Cole, C. L., & Levinson, T. R. (2002). Effects of within-activity choices on the challenging behavior of children with severe developmental disabilities. Journal of Positive Behavior Interventions, 4(1), 29-37. Crosbie, J. (1987). The inability of the binomial test to control type I error with single subject data. Behavioral Assessment, 9(2), 141-150. Danov, S. E., & Symons, F. J. (2008). A survey evaluation of the reliability of visual inspection and functional analysis graphs. Behavior Modification, 32(6), 828-839. DiCarlo, C. F., & Reid, D. H. (2004). Increasing pretend toy play of toddlers with disabilities in an inclusive setting. Journal of Applied Behavior Analysis, 37(2), 197207. Edgington, E. S. (1987). Randomized single-subject experiments and statistical tests. Journal of Counseling Psychology, 34(4), 437-442. Edgington, E. S. (1996). Randomized single-subject experimental designs. Behavior Research and Therapy, 34(7), 567-574. Edgington, E. S. (1995). Randomization tests: Revised and expanded (3rd ed.). New York: Marcel Deekker. Fox, J. (1991). Regression diagnostics. Newbury Park, California: Sage. Garson, G. D. (2012). Hierarchical linear modeling: Guide and applications. North Carolina: Sage. Ghahfarokhi, M. A. B., Iravani, H., & Sepehri, M. R. (2008). Application of katz family of distributions for detecting and testing overdispersion in poisson regression models. World Academy of Science: Engineering & Technology, 44(1), 544-549. 183 Grissom, R. J., & Kim, J. J. (2005). Effect sizes for research: A broad practical approach. Mahwah, NJ: Erlbaum. Hedges, L. V. (1981). Distribution theory for Glass's estimator of effect size and related estimators. Journal of Educational Statistics, 6, 107–128. doi:10.3102/10769986006002107 Hitchcock, J. H., Nastasi, B. K., & Summerville, M. (2010). Single case designs and qualitative methods: Applying a mixed methods research perspective. MidWestern Educational Researcher, 23(2), 49-58. Horner, R. H., Carr, E. G., Halle, J., McGee, G., Odom, S., & Wolery, M. (2005). The use of single subject research to identify evidence based practices in special education. Exceptional Children, 71(2), 165-179. Howell, D. C. (2009, March 7). Permutation tests for factorial ANOVA designs. Retrieved from http://www.uvm.edu/~dhowell/StatPages/More_Stuff/Permutation%20Anova/Per mTestsAnova.html Hunt, P., Soto, G., Maier, J., & Doering, K. (2003). Collaborative teaming to support students at risk and students with severe disabilities in general education classrooms. Exceptional Children, 69(3), 315-332. Individuals with Disabilities Education Act. (2012). http://www.ode.state.oh.us Jenson, W. R., Clark, E., Kircher, J. C., & Kristjansson, S. D. (2007). Statistical reform: 184 Evidence-based practice, meta-analyses, and single subject designs. Psychology in the Schools, 44(5), 483-493. Kauffman, J. M., & Landrum, T. J. (2009). Characteristics of emotional and behavioral disorders of children and youth (9th ed.). Upper Saddle River, NJ: Prentice Hall. Kauffman, J. M., & Lloyd, J. W. (1995). A sense of place: The importance of placement issues in contemporary special education. In J. Kauffman, J. Lloyd, D. Hallahan, & T. Astuto (Eds.), Issues in educational placement: Students with emotional and behavioral disorders (pp. 3-19). Mahwah, NJ: Lawrence Erlbaum Associates. Kazdin, A. E. (1992). Research designs in clinical psychology (2nd ed.). Boston: Allyn & Bacon. Kirk, R. E. (1996), Practical significance: A concept whose time has come, Educational and Psychological Measurement, 56(5), 746-759. Kratochwill, T. R., & Brody, G. H. (1978). Single subject designs: A perspective on the controversy over employing statistical inference and implications for research and training in behavior modification. Behavior Modification, 2, 291-307. doi: 10.1177/002221947901200407 Kratochwill, T. R., Hitchcock, J., Horner, R. H., Levin, J. R., Odom, S. L., Rindskopf, D. M., & Shadish, W. R. (2010). SCDs technical documentation. Retrieved from What Works Clearinghouse website: http://ies.ed.gov/ncee/wwc/pdf/wwc_scd.pdf. 185 Kratochwill, T. R., & Levin, J. R. (2010). Enhancing the scientific credibility of singlecase intervention research: Randomization to the rescue. Psychological Methods, 15(2), 124-144. Kratochwill, T. R., & Stoiber, K. C. (2002). Evidence-based interventions in school psychology: Conceptual foundations of the procedural and coding manual of division 16 and the society for the study of school psychology task force. School Psychology Quarterly, 17(4), 341-389. Krishef, C. H. (1991). Fundamental approaches to single subject design and analysis. Malabar, FL: Krieger. Kromrey, J., & Foster-Johnson, L. (1996). Determining the efficacy of intervention: the use of effect sizes for data analysis in single-subject research. The Journal of Experimental Education, 65, 73-93. doi: 10.1080/00220973.1996.9943464 Lambert, M., Cartledge, G., Heward, W. L., & Lo, Y. (2006). Effects of response cards on disruptive behavior and academic responding during math lessons by fourthgrade urban students. Journal of Positive Behavior Interventions, 8(2), 88-99. Lazar, N. A. (2004). A short survey on causal inference, with implications for context of learning studies of second language acquisition. Studies in Second Language Acquisition, 26, 329-347. doi:10.1017/S0272263104262088 Liu, X., & Raudenbush, S. (2004). A note on the noncentrality parameter and effect size estimates for the F test in ANOVA. Journal of Educational and Behavioral Statistics, 29(2), 251-255. 186 Maggin, D. M., Chafouleas, S. M, Goddard, K. M., & Johnson, A. H. (2011). A systematic evaluation of token economies as a classroom management tool for students with challenging behavior. School Psychology, 49(5), 529-554. Maggin, D. M., O’Keeffe, B. V., & Johnson, A. H. (2011). A quantitative synthesis of methodology in the meta-analysis of single-subject research for students with disabilities: 1985-2009. Exceptionality, 19, 109-135. doi: 10.1080/09362835.2011.565725 Maggin, D. M., Swaminathan, H., Rogers, H. J., O’Keefe, B. V., Sugai, G., & Horner, R. H. (2011). A generalized least squares regression approach for computing effect sizes in single-case research: Application examples. School Psychology, 49(3), 301321. Manolov, R., Arnau, J., Solanas, A., & Bono, R. (2010). Regression-based techniques for statistical decision making in single-case designs. Psicothema, 22(4), 10261032. Mavropoulou, S., Papadopoulou, E., & Kakana, D. (2011). Effects of task organization on the independent play of students with autism spectrum disorders. Journal of Autism and Developmental Disorders, 41(7), 913-925. McDougall, D., Narkon, D., & Wells, J. (2011). The case for listing raw data articles describing single-case interventions: How long must we wait? Education, 132(1), 149-163. 187 McDougall, D., Skouge, J., Farrell, C. A., & Hoff, K. (2006). Research on selfmanagement techniques used by students with disabilities in general education settings: A promise fulfilled? Journal of the American Academy of Special Education Professionals, 1(2), 36-73. Mitchell, C., & Hartmann, D. P. (1981). A cautionary note on the use of omega squared to evaluate the effectiveness of behavioral treatments. Behavioral Assessment, 3, 93-100. Morgan, D. L., & Morgan, R. K. (2009). Single-case research methods for the behavioral and health sciences. Thousand Oaks, CA: Sage. Murphy, K. A., Theodore, L. A., Aloiso, D., Alric-Edwards, J. M., & Hughes, T. L. (2007). Interdependent group contingency and mystery motivators to reduce preschool disruptive behavior. Psychology in the Schools. 44(1), 53-63. Nagler, E., Rindskophf, D., & Shadish, W. (2008). Analyzing data from small N designs using multilevel models: A procedural handbook (Grant No. 75588-00-01). U.S. Department of Education. Orme, J. G., & Cox, M. E. (2001). Analyzing single-subject design data using statistical process control charts. Social Work Research, 25(2), 115-127. Ottenbacher, K. J. (1990). When is a picture worth a thousand p values? A comparison of visual and quantitative methods to analyze single subject data. Journal of Special Education, 23, 436-449. doi: 10.1177/002246699002300407 188 Parker, R., & Brossart, D. (2003). Evaluating single-case research data: A comparison of seven statistical methods. Behavior Therapy, 34(2), 189-211. doi: 10.1016/S0005-7894(03)80013-8 Parker, R., Brossart, D., Vannest, K., Long, J., Garcia De-Alba, R., Baugh, F. G., & Sullivan, J. (2005). Effect sizes in single-case research: How large is large? School Psychology Review, 34(1), 116-132. Parker, R. I., Cryer, J., & Byrns, G. (2006). Controlling baseline trend in single-case research. School Psychology Quarterly, 21(4), 418-444. Parker, R. I., Vannest, K. J., & Brown, L. (2009). The improvement rate difference for single-case research. Exceptional Children, 75(2), 135-150. Parsonson, B. S., & Baer, D. M. (1986). The graphic analysis of data. In A. Poling, & R. Fuqua (Eds.), Research methods in applied behavior analysis: Issues and advances (pp. 157-186). New York: Plenum Press. Pearl, J. (1995). Causal diagrams for empirical research. Biometrika, 82(4), 669-688. Pinheiro, J., & Bates, D. (1995). Approximations to the log-likelihood function in the nonlinear mixed-effects model. Journal of Computational and Graphical Statistics, 4(1), 12-35. Ramsey, M. L., Jolivette, K., Puckett-Patterson, D., & Kennedy, C. (2010). Using choice to increase time on-task, task-completion, and accuracy for students with emotional/behavior disorders in a residential facility. Education and Treatment of Children, 33(1), 1-21. 189 Restori, A. F., Gresham, F. M., Chang,T., Howard L. B., & Laija-Rodriquez, W. (2007). Functional assessment-based interventions for children at-risk for emotional and behavioral disorders. California School Psychologist, 12(1), 9-30. Raudenbush, S. W., Bryk, A. S., & Congdon, R. (2004). HLM 6 for Windows (Version 6) [Computer software]. Lincolnwood, IL: Scientific Software International, Inc. Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis (2nd ed.). Thousand Oaks, CA: Sage. Raudenbush, S. W., & Liu, X. (2000). Statistical power and optimal design for multisite randomized trials. Psychological Methods, 5(3), 199-213. Parker, R. I., Cryer, J., & Byrns, G. (2006). Controlling baseline trend in single-case research. School Psychology Quarterly, 21(4), 418-444. Rosnow, R., & Rosenthal, R. (1989). Statistical procedures and the justification of knowledge in psychological science. American Psychologist, 44, 1276-1284. doi: 10.1037/0003-066X.44.10.1276 Rousson, V., Gasser, T., & Seifer, B. (2002). Assessing intrarater, interrater and test– retest reliability of continuous measurements. Statistics in Medicine, 21(22), 3431-3446. Sackett, D. L., Richardson, W. S., Rosenberg, W., & Haynes, R. B. (1997). Evidence-based medicine: How to practice and teach EBM. London: Churchill Livingstone. Scheffé, H. (1959). The analysis of variance. New York: Wiley. 190 Schmidt, F. L., & Hunter, J. E. (1997). Eight common but false objections to the discontinuation of significance testing in the analysis of research data. In L. Harlow, S. Mulaik, & J. Steiger (Eds.), What if there were no significance tests? (pp. 37-64). Hillsdale: Lawrence Erlbaum. Scruggs, T. E., & Mastropieri, M. A. (2001). How to summarize single-participant research: Ideas and application. Exceptionality, 9(4), 227–244. Scruggs, T. E., & Mastropieri, M. A. (1998). Synthesizing single-subject research: Issues and applications. Behavior Modification, 22(1), 221-242. Segool, N. K., Brinkman, T. M., & Carlson, J. S. (2007). Enhancing accountability in behavioral consultation through the use of SCDs. International Journal of Behavioral Consultation and Therapy, 3(2), 310-321. Sekhon, J. S. (2010). The statistics of causal inference in the social sciences [PDF document]. Retrieved from Lecture Notes Online Web site: http://sekhon.berkeley.edu/causalinf/causalinf.print.pdf Shadish, W. R., Brasil, I. C. C., Illingworth, D. A., White, K., Galindo, R., Nagler, E. D., & Rindskopf, D. M. (2009). Using UnGraph® to Extract Data from Image Files: Verification of Reliability and Validity. Behavior Research Methods, 41(1), 177183. Shadish, W., Cook, T., & Campbell, D. (2002). Experimental and quasi-experimental designs for generalized causal inference. Boston: Houghton Miffin. 191 Shadish, W. R., Rindskopf, D. M., & Hedges, L. V. (2008). The state of the science in the meta-analysis of single-case experimental designs (Grant No. H324U05000106). Washington DC: Institute of Education Science. Shadish, W. R., Sullivan, K. J., Hedges, L., & Rindskpof, D. (2010). A d-estimator for single-case designs. (Grant No. R305D 100046 and R305D100033). University of California: Institute for Educational Sciences. Sidman, M. (1960). Tactics of scientific research: Evaluating experimental data in psychology. New York: Basic Books. Singer, J. D. & Willett, J. B. (2003). Applied longitudinal data analysis: Modeling change and event occurrence. New York: Oxford University Press. Skrondal, A., & Rabe-Hesketh, S. (2007). Redundant overdispersion parameters in multilevel models for categorical responses. Journal of Educational and Behavioral Statistics, 32(4), 419-430. Snijders T., Bosker, R., & Guldemond, H. (2007). PinT: The program PINT for determining sample sizes in two-level modeling Version (2.12) [Computer software]. Retrieved from: http://stat.gamma.rug.nl/multilevel.htm Steiger, J. H., & Fouladi, R. T. (1997). Noncentrality interval estimation and the evaluation of statistical models. In L. Harlow, S. Mulaik, & J. Steiger (Eds.), What if there were no significance tests? (pp. 221-257). Hillsdale: Lawrence Erlbaum. Stuart, R. B. (1967). Behavioral control of overeating. Behavior Research and Therapy, 5(4), 357-365. 192 Suen, H. K., & Ary, D. (1987). Autocorrelation in applied behavior analysis: Myth or reality? Behavioral Assessment, 9, 125-130. Swanson, H. L., & Sachse-Lee, C. (2000). A meta-analysis of single-subject-design intervention research for students with LD. Journal of Learning Disabilities, 33(2), 114–136. Swoboda, C. M., Kratochwill, T. R., & Levin, J. R. (2010). Conservative dual-criterion method for single-case research: A guide for visual analysis of AB, ABAB, and multiplebaseline designs (WCER Working Paper No. 2010-13). Retrieved from University of Wisconsin–Madison, Wisconsin Center for Education Research website: http://www.wcer.wisc.edu/publications/workingPapers/papers.php Tate, R. L., Mcdonald, S., Perdices, M., Togher, L., Schultz, R., & Savage, S. (2008). Rating the methodological quality of single-subject designs and n-of-1 trials: Introducing the single-case experimental design (SCED) scale. Neuropsychological Rehabilitation, 18, 385-401. doi:10.1080/09602010802009201 Theodore, L. A., Bray, M. A., Kehle, T. J., & Jenson, W. R. (2001). Randomization of group contingencies and reinforcers to reduce classroom disruptive behavior. Journal of School Psychology, 39(3), 267-77. U.S. Department of Education. (2008). Analyzing data from small N designs using multilevel models: A procedural handbook (Grant No. 75588-00-01). Van den Noortgate, W., & Onghena, P. (2003). Hierarchical linear models for the quantitative integration of effect sizes in single-case research. Behavior Research Methods, Instruments, & Computers, 35(1), 1-10. 193 Van den Noortgate, W., & Onghena, P. (2008). A multilevel meta-analysis of singlesubject experimental design studies. Evidence-Based Communication Assessment and Intervention, 2(3), 142–151. Vannest, K. J., Parker, R. I., & Gonan, O. (2011). Single Case Research: web based calculators for SCR analysis (Version 1.0) [Web-based application]. College Station, TX: Texas A & M University. Waldmann, M. R. (2000). Competition among causes but not effects in predictive and diagnostic learning. Journal of Experimental Psychology, 26, 53-76. doi: 10.1037//0278-7393.26.1.53 White, D. M., Rusch, F. R., Kazdin, A. E., & Hartmann, D. P. (1989). Applications of meta-analysis in individual subject research. Behavioral Assessment, 11, 281-296. Williamson, B. D., Campbell-Whatley, G., & Lo, Y. (2009). Using a random dependent group contingency to increase on-task behaviors of high school students with high incidence disabilities. Psychology in the Schools, 46(10), 1074-1083. Wolery, M., Busick, M., Reichow, B., & Barton, E. E. (2010). Comparison of overlap methods for quantitatively synthesizing single subject data. The Journal of Special Education, 44(1), 18-28. WWC (n.d.). What Works Clearinghouse. Retrieved from http://ies.ed.gov/ncee/wwc/. WWC Evidence review protocol (2011). Interventions for children classified as having an emotional disturbance. Retrieved from http://ies.ed.gov/ncee/wwc/pdf/reference_resources/wwc_ebd_protocol_v2.pdf 194 Appendix A: Tables Table 20 List of Effect Sizes from the Original Article Compared to Calculated Effect Sizes Using the PND, SMD, and R2 Citation Amato-Zech, N. A. Hoff, K.E.; Doepke, Karla J. (2006). Increasing on-task behavior in the classroom: extension of self-monitoring strategies. Psychology in the Schools, 43(2), 211-221. Cole, C.L. & Levinson, T.R. (2002). Effects of within-activity choices on the challenging behavior of children with developmental disabilities. Journal of Positive Behavior Interventions, 4(1), 29-37. Lambert, M., Cartledge, G., Heward, W. L., & Lo, Y. (2006). Effects of response cards on disruptive behavior and academic responding during math lessons by fourth-grade urban students. Journal of Positive Behavior Interventions, 8(2), 88-99. Mavropoulou, S., Papadopoulou, E., & Kakana, D. (2011). Effects of task organization on the independent play of students with autism DV On-task Behavior ID Reported Effect Size PND Jack David Allison 100% 100% 93.75% PND SMD R2 100% 4.5 100% 3.3 93.75% 2.8 .51 .25 Percents/Means Percent of Task Analysis Steps Keith’s range: A1 B1 A2 B2 7.7% - 61.5% 0% - 30.8% n/a 16% 51.66% 1.42 Wally’s range: A1 B1 A2 B2 14.3%-81.8% 0%-30.8% n/a n/a 28.60% .96 PND A1 Disruptive A2 Behavior A3 A4 B1 B2 B3 B4 B5 92.8% 100% n/a n/a 94.1% 100% 94.4% n/a n/a 77.78% 100% 81.25% 75.71% 61.36% 100% 94.4% 0% 33.33% 3.04 5.80 2.70 2.80 2.57 3.70 3.90 1.05 1.27 50% 30% 45% 1.27 .19 1.32 .59 PND Vaggelis On-task Teacher prompt Performance 50% n/a 45% .34 .01 .39 195 Table 20 (Continued) Citation DV ID spectrum disorders. Journal of Autism and Developmental Disorders, 41(7), 913-925. Yiannis On-task Teacher prompt Performance SMD Murphy, K. A., Theodore, L.A., Aloiso, D. Alric-Edwards, S1 J.M., & Hughes, T.L. (2007). S2 Interdependent group Disruptive S3 contingency and mystery Behavior S4 motivators to reduce S5 preschool disruptive behavior. S6 Psychology in the Schools. S7 44(1), 53-63. S8 Ramsey, M.L., Jolivette, K., Puckett-Patterson, D., & Kennedy, C. (2010). Using choice to increase time on-task, taskcompletion and accuracy for students with emotional/behavior disorders in a residential facility. Education and Treatment of Children, 33(1), 1-21. Reported Effect Size R2 PND SMD 75% n/a 70% 75% 40% 60% .74 2.15 1.70 7.71 3.04 2.36 2.06 1.58 1.59 .99 2.64 50% 58.3% 42.85% 0% 87.5% 35.7% 35.7% 100% 2.88 1.99 1.34 1.13 1.33 1.16 .61 1.98 .21 1.24 1.48 1.60 .33 .34 .22 Percents Abby On-task Task Complete Accuracy 0% complete 0% 0% complete 0% 90% most positive point 75% Sara On-task Task Complete Accuracy B2 100% 94.44% 1.80 B2 100% 100% 6.40 50% most positive point 100% 6.90 Trey On-task Task Complete Accuracy A1 No data exceeded A1 33.33% A1 66.7% 50% .92 77.77% 2.3 72.22% .21 Chris On-task Task Complete Accuracy 40% 100% 71% 66.4% 100% 70.7% Katie On-task Task Complete Accuracy 60% Most positive point 18.33% 1.80 33% 31.66% 2.40 13% 30% 3.11 3.64 11.6 5.40 196 Table 20 (Continued) Citation DV ID Reported Effect Size PND Mean Restori, A.F., Gresham, F.M., Chang,T., Howard L.B. & Laija-Rodriquez, W., (2007). Functional assessment-based interventions for children at-risk for emotional and behavioral disorders. California School Psychologist, 12, 9-30. Overall Academic Engagement Overall Disruptive Behavior Theodore, L.A., Bray, M.A., Kehle, T.J., & Jenson, W.R. (2001). Randomization Disruptive of group contingencies Behavior and reinforcers to reduce classroom disruptive behavior. Journal of School Psychology, 39(3), 267-77. Disruptive / Academic A1 B1 A2 B2 22.35 86.90 37.41 83.08 A1 A2 A3 A4 100% / 100% 100% / 100% 100% / 100% 100% / 100% A1 B1 A2 B2 59.53 6.77 38.31 6.49 C1 100% / 92.8% C2 0% / 50% C3 91.6% / 91.6% C4 100% / 91.6% SMD R2 Disruptive/ Academic 2.08 / 3.87 .77 2.84 / 4.22 3.71 / 20.74 2.43 / 3.42 1.52 / 3.93 1.51 / 2.27 2.56 / 3.68 4.8 / 4.91 .63 100% 100% 100% dropped dropped 5.3 4.4 2.8 dropped dropped .78 40.4% 76.4% 61.1% 55.55% 50.0% 27.78% 1.83 2.37 30.5 2.27 .90 1.31 SMD S1 S2 S3 S4 S5 5.2 4.7 2.6 3.8 4.2 Mean Percentages Williamson, B. D., Campbell-Whatley, G., & Lo, Y. (2009). Using a random dependent group contingency to increase on-task behaviors of high school students with high incidence disabilities. Psychology in the Schools, 46(10), 1074-1083. On-task Behavior A1 43.4% B1 83.7% A2 63.3% B2 66.9% S1 S2 S3 S4 S5 S6 .29 DV = Dependent Variable; ID = person identification, PND = Percentage of non-overlapping data; SMD = Standardized Mean Difference; R2 = proportion of variance accounted for between the independent variable and the dependent variable, n/a = not available from the original author(s) 197 Table 21 List of Effect Sizes from the Original Article Compared to Calculated Effect Sizes Using the IRD Citation Amato-Zech, N. A. Hoff, K.E.; Doepke, Karla J. (2006). Increasing on-task behavior in the classroom: extension of self-monitoring strategies. Psychology in the Schools, 43(2), 211-221. Cole, C.L. & Levinson, T.R. (2002). Effects of within-activity choices on the challenging behavior of children with developmental disabilities. Journal of Positive Behavior Interventions, 4(1), 29-37. Lambert, M., Cartledge, G., Heward, W. L., & Lo, Y. (2006). Effects of response cards on disruptive behavior and academic responding during math lessons by fourth-grade urban students. Journal of Positive Behavior Interventions, 8(2), 88-99. Mavropoulou, S., Papadopoulou, E., & Kakana, D. (2011). Effects of task organization on the independent play of students with autism spectrum disorders. Journal of Autism and Developmental Disorders, 41(7), 913-925. DV ID Reported Effect Size IRD PND On-task Behavior Jack David Allison 100% 100% 93.75% Jack David Allison 100% 100% 93.75% Percents/Means Percent of Task Analysis Steps Keith’s range: A1 B1 A2 B2 7.7% - 61.5% 0% - 30.8% n/a 16% Keith 54.15% Wally’s range: A1 B1 A2 B2 14.3%-81.8% 0%-30.8% n/a n/a Wally 69.00% A1 A2 A3 A4 B1 B2 B3 B4 B5 93.75% 100% 81.25% 75.7% 87.85% 100% 94.4% 54.45% 59.95% A1 Disruptive A2 Behavior A3 A4 B1 B2 B3 B4 B5 PND 92.8% 100% n/a n/a 94.1% 100% 94.4% n/a n/a PND Vaggelis On-task Teacher prompt Performance 50% n/a 45% Vaggelis On-task 70.45% Prompt 45% Performance 36.35% Yiannis On-task Teacher prompt Performance 75% n/a 70% Yiannis On-task 75% Prompt 85.45% Performance 55.45% 198 Table 21 (Continued) Citation DV ID SMD Murphy, K. A., Theodore, L.A., Aloiso, D. Alric-Edwards, S1 J.M., & Hughes, T.L. (2007). S2 Interdependent group Disruptive S3 contingency and mystery Behavior S4 motivators to reduce S5 preschool disruptive behavior. S6 Psychology in the Schools. S7 44(1), 53-63. S8 Ramsey, M.L., Jolivette, K., Puckett-Patterson, D., & Kennedy, C. (2010). Using choice to increase time on-task, taskcompletion and accuracy for students with emotional/behavior disorders in a residential facility. Education and Treatment of Children, 33(1), 1-21. Reported Effect Size 7.71 3.04 2.36 2.06 1.58 1.59 .99 2.64 IRD S1 S2 S3 S4 S5 S6 S7 S8 65.4% 91.65% 84.5% 77.09% 93.75% 59.82% 75% 100% Percents Abby On-task Task Complete Accuracy Abby 0% complete On-task 87.5% 0% complete Complete 90.2% 90% most positive point Accuracy 83.93% Sara On-task Task Complete Accuracy Sara B2 100% On-task 91.65% B2 100% Complete 100% 50% most positive point Accuracy 77.85% Trey On-task Task Complete Accuracy A1 No data exceeded A1 33.33% A1 66.7% Trey On-task 44.65% Complete 30.90% Accuracy 20% Chris On-task Task Complete Accuracy 40% 100% 71% Chris On-task 94.45% Complete 100% Accuracy 100% Katie On-task Task Complete Accuracy Katie 60% Most positive point On-task 96.15% 33% 31.66% Complete 92.31% 13% Accuracy 83.3% 199 Table 21 (Continued) Citation DV ID Reported Effect Size Mean Restori, A.F., Gresham, F.M., Chang,T., Howard L.B. & Laija-Rodriquez, W., (2007). Functional assessment-based interventions for children at-risk for emotional and behavioral disorders. California School Psychologist, 12, 9-30. Overall Academic Engagement Overall Disruptive Behavior IRD Academic Engagement Disruptive Behavior A1 B1 A2 B2 22.35 86.90 37.41 83.08 A1 100% A2 100% A3 100% A4 100% A1 100% A2 100% A3 100% A4 100% A1 B1 A2 B2 59.53 6.77 38.31 6.49 C1 100% C2 93.75% C3 91.65% C4 91.65% C1 100% C2 72.96% C3 91.65% C4 100% ______________________________________________________________________________________ Theodore, L.A., Bray, M.A., Kehle, T.J., & Jenson, W.R. (2001). Randomization Disruptive of group contingencies Behavior and reinforcers to reduce classroom disruptive behavior. Journal of School Psychology, 39(3), 267-77. SMD S1 S2 S3 S4 S5 5.2 4.7 2.6 3.8 4.2 S1 S2 S3` S4 S5 100% 100% 100% dropped dropped Mean Percentages Williamson, B. D., Campbell-Whatley, G., & Lo, Y. (2009). Using a random dependent group contingency to increase on-task behaviors of high school students with high incidence disabilities. Psychology in the Schools, 46(10), 1074-1083. On-task Behavior A1 43.4% B1 83.7% A2 63.3% B2 66.9% S1 S2 S3 S4 S5 S6 71.43% 93.75% 72.2% 88.89% 55.5% 61.1% DV = Dependent Variable; ID = person identification, PND = Percentage of non-overlapping data; IRD = Improvement Rate Difference, n/a = not available from the original author(s) 200 Table 22 A Comparison of the Reported Data to the Extracted Citation DV ID Reported Data Extracted Data Percentage of intervals (mean) Amato-Zech and colleagues (2006) On-task Behavior Jack a1 = 53% b1 = 79% a2 = 74% b2 = 91% a1 = 53% b1 = 79% a2 = 73% b2 = 91% David a1 = 55% b1 = 79% a2 = 76% b2 = 93% a1 = 54% b1 = 78% a2 = 75% b2 = 93% Allison a1 = 56% b1 = 89% a2 = 84% b2 = 96% a1 = 56% b1 = 90% a2 = 85% b2 = 96% Range Cole and colleagues (2003) Percent of Task-Analysis Steps Keith a1 = 7.7 b1 = 0 - 30.8 a2 = n/a b2 = n/a a1 = 8 – 61.4 b1 = 0 - 31 a2 = 15 – 92.4 b2 = 7 – 30.4 Wally a1 = 14.3 - 81.8 b1 = 0 – 30.8 a2 = n/a b2 = n/a a1 = 14.2 - 80 b1 = 0 - 30 a2 = 0 – 33.2 b2 = 0 – 21 Mean Number of Disruptive Behaviors Lambert and colleagues (2006) Disruptive Behavior SSR/Baseline = 6.8 RC/Treatment = 1.3 SSR/Baseline = 6.83 RC/Treatment = 1.63 Mavropoulou and colleagues (2011) Prompting On-task Baseline = 63.75 Treatment = 84.35 Baseline = 61.77 Treatment = 83.74 Baseline = 18.28 Treatment = 16.47 Baseline = 17.66 Treatment = 16.24 Baseline = n/a Treatment = n/a Baseline = 31.66 Treatment = 61.14 Performance 201 Table 22 (continued) Citation DV Reported Data Extracted Data Mean Intervals of Disruptive Behavior Murphy and colleagues (2007) Disruptive Behavior Baseline = 20.59 Treatment = 4.29 Baseline = 18.26 Treatment = 3.76 Mean Intervals Ramsey and colleagues (2010) On-task Task Complete Accuracy Baseline = 38.98 Treatment = 80.75 Baseline = 21.67 Treatment = 68.23 Baseline = 16.55 Treatment = 47.26 Baseline = 37.53 Treatment = 78.72 Baseline = 21.27 Treatment = 65.09 Baseline = 14.99 Treatment = 45.12 Mean Intervals Restori and colleagues (2007) Disruptive Behavior Baseline = 48.92 Treatment = 6.63 Baseline =50.81 Treatment =5.49 Academic Engagement Baseline = 29.88 Treatment = 84.99 Baseline = 27.52 Treatment = 86.78 Mean Intervals Theodore and colleagues (2001) Disruptive Behavior Baseline = 38.67 Treatment = 3.33 Baseline = 41.64 Treatment = 2.82 Mean Intervals Williamson and colleagues (2009) On-task Behavior n/a = not available from the original author(s) Baseline = 52.9 Treatment = 75.18 Baseline = 53.35 Treatment = 75.3 202 Table 23 List of Population Type, Independent Variables, Dependent Variables, and Reported Effect Sizes Population Identifier Independent Variable Dependent Variable Effect Sizes Amato-Zech, N. A. Hoff, Both K.E.; Doepke, Karla J. (2006). Increasing on-task behavior in the classroom: extension of self-monitoring strategies. Psychology in the Schools, 43(2), 211-221. Emotional Disturbances (E/BD) Self-Monitoring On-task Behavior PND Cole, C.L. & Levinson, Low T.R. (2002). Effects of within-activity choices on the challenging behavior of children with severe developmental disabilities. Journal of Positive Behavior Interventions, 4(1), 29-37. Cognitive Delay (CD) Choice/No Choice Percent of Task Analysis Steps Percents Lambert, M., Cartledge, G., N/A Heward, W. L., & Lo, Y. (2006). Effects of response cards on disruptive behavior and academic responding during math lessons by fourth-grade urban students. Journal of Positive Behavior Interventions, 8(2), 88-99. At-risk Response Cards Disruptive Behavior Means Mavropoulou, S., Papadopoulou, E., & Kakana, D. (2011). Effects of task organization on the independent play of students with autism spectrum disorders. Journal of Autism and Developmental Disorders, 41(7), 913-925. Autism (ASD) Visual prompting On-task behavior PND Citation Population Incidence High Specific Learning Disability (SLD) Teacher Prompt Performance 203 Table 23 (Continued) Population Identifier Independent Variable Dependent Variable Effect Sizes Murphy, K. A., Theodore, High L.A., Aloiso, D. Alric-Edwards, J.M., & Hughes, T.L. (2007). Interdependent group contingency and mystery motivators to reduce preschool disruptive behavior. Psychology in the Schools. 44(1), 53-63. At-risk Group Reward System Disruptive Behavior SMD Ramsey, M.L., Jolivette, Both K., Puckett-Patterson, D., & Kennedy, C. (2010). Using choice to increase time on-task, task-completion, and accuracy for students with emotional/behavior disorders in a residential facility. Education and Treatment of Children, 33(1), 1-21. E/BD Choice / No Choice On-task behavior Restori, A.F., Gresham, N/A F.M., Chang,T., Howard L.B. & Laija-Rodriquez, W., (2007). Functional assessment -based interventions for Children at-risk for emotional and behavioral disorders California School Psychologist, 12, 9-30. At-risk Theodore, L.A., Bray, High M.A., Kehle, T.J., & Jenson, W.R. (2001). Randomization of group contingencies and reinforcers to reduce classroom disruptive behavior. Journal of School Psychology, 39(3), 267-77. SE/BD Citation Population Incidence PND Task-completion Accuracy Self-Monitoring Academic Achievement Percents Disruptive Behavior Group Reward System Disruptive Behavior SMD 204 Table 23 (Continued) Citation Population Incidence Williamson, B. D., Both Campbell-Whatley, G., & Lo, Y. (2009). Using a random dependent group contingency to increase on-task behaviors of high school students with high incidence disabilities. Psychology in the Schools, 46(10), 1074-1083. Population Identifier Independent Variable Dependent Variable Effect Sizes Multiple OHI E/BD SLD Group Reward System On-task Behavior Percents PND = Percentage of non-overlapping data; SMD = Standardized mean difference, no assumptions model; SE/BD = Severe Emotional and Behavior Disorder; OHI = Other Health Impaired, n/a = not available from the original author(s) 205 Table 24 List of Level-2 Variables Used in the Exploratory Analyses Study Level-2 Variables / Definition (exploratory analyses) Significant Level-2 Variables (1) Amato-Zech, Hoff & Doepke (2006). Increasing On-Task Behavior in the Classroom: Extension of Self-Monitoring Strategies. Age Disability Gender N/A N/A N/A (2) Cole & Levinson (2002). Choices on the Challenging Effects of Within-Activity Behavior of Children with Severe Developmental Disabilities. Age Disability Gender N/A N/A N/A (3) Lambert, Cartledge, Heward, & Lo (2006). Effects of Response Cards on Disruptive Behavior and Academic Responding during Math Lessons by Fourth-Grade Urban Students. Age ClassB / classroom a child attended Race Gender Pre-grade / math grade prior to study N/A - 0.90** N/A N/A N/A (4) Mavropoulou, Papadopoulou, & Kakana (2011). Effects of Task Organization on the Independent Play of Students with Autism Spectrum Disorders. Age Disability Gender IQ / Using Weschler Intelligence Scale for Children Speech Therapy / therapy or no therapy N/A N/A N/A N/A N/A N/A 206 Table 24 (Continued) Study Level-2 Variables / Definition (exploratory analyses) Significant Level-2 Variables Age N/A N/A 2.06** N/A N/A (6) Ramsey, Jolivette, Puckett-Patterson, & Kennedy (2010). Using Choice to Increase Time On-Task Completion, and Accuracy for Students with Emotional/ Behavior Disorders in a Residential Facility. Age Disability Race GAF Score / Global Assessment of functioning Gender Grade Level N/A N/A N/A N/A (7) Restori, Gresham, Chang, Lee, & LaijaRodriquez (2007). Functional AssessmentBased Interventions for Children At-Risk for Emotional and Behavioral Disorders. Academic Achievement Disruptive Behavior Gender Intervention Type / antecedent-based or consequent-based (8) Theodore, Bray, Kehle, & Jenson (2001). Randomization of Group Behavior Contingencies and Reinforcers to Reduce Classroom Disruptive. DBS / delinquent behavior scale Disability Race Gender N/A N/A N/A N/A (9) Williamson, Campbell- Whatley, & Lo (2009). Using a Random Dependent Group Contingency to Increase Race Student 3 / dummy coded N/A N/A (5) Murphy, Theodore, Alric-Edwards, & Hughes (2007). Interdependent Group Contingency and Mystery Motivators to Reduce Preschool Disruptive Behavior. Gender OnTrack / to first grade Student 2 / dummy coded Student 8 / dummy coded N/A N/A N/A disruptive behavior (dv) = 1.40* academic achievement (dv) = -1.90* 207 Table 24 (Continued) Study Level-2 Variables / Definition (exploratory analyses) Significant Level-2 Variables On-task Behaviors of High School Students with High Incidence Disabilities. * p < .05, ** p < .01; N/A = study contained no significant Level-2 variables, dv = dependent variable, n/a = not available from the original author(s) 208 Appendix B: Graphs, Extracted Data from SPSS and Excel with Codes “All graphs were reproduced by the dissertation author using extracted, digitized data.” Amato-Zech, Hoff, & Doepke (2006). Percent of intervals of on-task behavior Jack 120 100 80 60 40 20 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 Session Percent of intervals of on-task behavior David 120 100 80 60 40 20 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 Session 209 Percent of intervals of on-task behavior Allison 120 100 80 60 40 20 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 Session ID Age Gender Disability Patient 1 = Jack Patient 2 = David Patient 3 = Allison 11 11 11 0=m 0=m 1=f 1 = SLD = Specific Learning Disability 1= SLD 0 = ED = Emotionally Disturbed Patient = p Session = s p s b t 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 15.00 17.00 18.00 20.00 22.00 24.00 26.00 50.21 47.41 47.01 49.84 49.85 61.10 52.27 57.10 56.71 57.52 67.61 70.44 67.64 65.24 80.12 84.96 84.98 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 Behavior = b Treatment = t 210 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 27.00 28.00 29.00 30.00 31.00 32.00 33.00 34.00 35.00 36.00 37.00 38.00 39.00 41.00 42.00 43.00 44.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 11.00 12.00 15.00 17.00 18.00 20.00 22.00 23.00 24.00 25.00 27.00 29.00 30.00 31.00 32.00 33.00 34.00 35.00 94.62 93.03 80.18 77.78 78.20 71.78 75.00 66.57 64.98 88.28 85.08 91.11 90.72 89.93 91.55 94.37 99.60 54.88 61.74 42.81 49.67 63.40 57.85 51.98 47.09 57.55 59.85 52.34 76.86 67.41 69.70 76.90 79.53 77.58 81.18 84.45 92.63 78.93 79.91 81.55 74.38 71.44 72.11 69.50 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 211 2.00 2.00 2.00 2.00 2.00 2.00 2.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 36.00 37.00 38.00 39.00 42.00 43.00 44.00 2.00 3.00 5.00 6.00 7.00 8.00 9.00 10.00 11.00 12.00 13.00 14.00 15.00 16.00 18.00 19.00 20.00 21.00 23.00 24.00 25.00 26.00 27.00 28.00 29.00 30.00 31.00 32.00 33.00 34.00 35.00 36.00 37.00 38.00 39.00 40.00 42.00 86.49 94.34 92.71 94.68 92.41 92.42 97.00 51.42 63.72 38.23 55.37 68.12 45.26 55.37 58.01 67.68 56.69 55.81 60.21 55.37 55.37 78.66 73.39 75.15 86.13 97.56 95.80 98.00 94.04 99.32 99.32 91.41 91.41 87.45 83.94 83.06 78.66 75.59 98.00 98.00 97.56 98.44 88.33 97.56 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 212 3.00 3.00 43.00 95.36 1 44.00 98.00 1 Cole & Levinson (2002). 90 80 70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829 Session Percentage of task steps with challenging behvaior Percentage of task steps with challenging behavior Keith 100 Wally 90 80 70 60 50 40 30 20 10 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Session 213 ID Patient 1=Keith Patient 2=Wally Gender 0=m 0 Disability CD=0 other=1 Keith Patient = p Session = s p s b t 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 2.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00 21.00 22.00 23.00 24.00 25.00 26.00 27.00 28.00 29.00 1.00 16.66 7.95 14.90 61.41 23.34 38.51 31.04 13.37 .00 .00 20.06 15.57 .00 28.49 15.55 86.43 15.03 38.40 92.38 22.96 7.04 12.75 14.23 12.73 12.97 30.37 20.17 9.46 8.96 19.93 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 Behavior = b Treatment = t 214 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00 21.00 22.00 23.00 24.00 25.00 26.00 27.00 28.00 29.00 30.00 31.00 32.00 37.19 80.01 42.92 36.82 14.23 35.42 35.21 52.46 7.99 14.92 .00 30.00 8.15 20.73 8.23 13.68 14.94 .00 15.02 15.06 20.51 21.04 30.43 33.17 8.13 7.67 7.71 21.03 13.69 .00 .00 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 215 Lambert, Cartledge, Heward, & Lo (2006). Number of intervals of disruptive behavior a1 12 10 8 6 4 2 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 Session Number of intervals of disruptive behavior a2 12 10 8 6 4 2 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 Session 216 Number of intervals of disruptive behavior a3 12 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Session Number of intervals of disruptive behavior a4 12 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Session 217 Number of intervals of disruptive behavior b1 12 10 8 6 4 2 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 Session Number of intervals of disruptive behavior b2 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 Session 218 Number of intervals of disruptive behavior b3 12 10 8 6 4 2 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 Session Number of intervals of disruptive behavior b4 10 8 6 4 2 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 Session 219 Number of intervals of disruptive behavior b5 12 10 8 6 4 2 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Session ID Gender age (year.month) Race Class Grade Math grade based on a 4.0 scale a1 a2 a3 a4 b5 b6 b7 b8 b9 1=f 1 0=m 0 0 1 1 1 0 9.7 9.4 9.10 9.4 10.2 10.1 10.8 9.5 10.1 0=AA 1=white 0 0 0 0 0 0 0 0 = class A 0 0 0 1= class B 1 1 1 1 1.5 3 1 .5 1 2 .5 1.49 3 Patient = p Session = s p s b t 1 1 1 1 1 1 1 1 1 .79 1.97 2.85 3.74 4.62 5.70 6.69 7.57 8.56 7.35 9.26 8.24 6.47 7.50 4.56 5.44 10.00 3.09 0 0 0 0 0 0 0 0 1 Behavior = b Treatment = t 220 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 9.64 11.61 12.49 13.48 14.36 15.34 16.33 17.31 18.20 19.28 20.26 21.15 22.23 23.21 24.10 25.08 26.07 27.05 27.93 29.11 30.10 .81 1.83 3.86 4.88 6.00 6.91 7.83 8.84 9.96 10.88 11.99 13.01 13.92 14.94 15.86 16.97 17.89 18.90 20.02 20.94 21.85 22.97 23.78 1.18 2.21 1.18 1.32 3.82 8.24 8.24 6.62 10.15 10.15 10.00 8.24 3.82 4.71 2.06 3.82 2.94 4.85 1.18 2.06 1.03 8.18 7.12 7.27 8.18 6.36 7.27 8.94 3.64 1.82 .76 4.39 .91 .91 8.03 9.09 10.00 7.27 8.94 10.00 8.18 10.00 1.82 1.82 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 221 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 24.80 25.92 27.04 27.95 28.87 30.09 31.00 1.01 3.04 4.86 5.88 6.89 7.90 9.83 10.94 11.95 12.87 13.88 14.89 15.80 16.82 18.94 19.96 20.87 21.88 22.79 23.81 24.82 25.83 26.95 27.96 28.87 29.89 .89 2.93 3.85 4.86 5.88 6.89 7.91 8.83 9.95 10.87 11.99 .91 5.45 3.64 6.36 .91 .76 2.73 10.31 6.41 6.25 9.22 6.25 10.16 .78 1.56 1.72 .78 .78 5.31 7.34 10.16 5.31 10.00 9.22 10.31 4.38 6.41 5.31 7.19 .78 .78 .78 1.72 9.86 6.28 4.50 8.08 8.09 9.14 9.89 3.62 6.32 .95 .95 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 222 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 13.92 14.83 15.84 16.75 18.88 19.90 20.81 21.84 22.85 23.88 24.89 25.92 28.86 29.98 31.00 .95 1.69 2.78 3.77 4.74 5.83 6.81 7.89 8.87 9.72 10.83 11.93 13.03 13.88 14.86 15.96 16.93 17.78 18.87 19.84 20.69 21.79 22.89 24.00 24.74 25.83 26.92 28.02 28.88 1.86 3.65 8.14 9.93 9.94 10.10 10.10 5.48 6.38 1.91 5.64 1.17 1.03 1.04 1.94 10.35 6.56 9.50 4.67 5.37 9.51 6.76 10.56 9.36 9.53 4.88 3.68 4.89 4.90 1.97 .77 3.71 5.78 8.54 10.62 10.45 10.63 6.50 3.91 .82 2.72 4.79 2.04 1.01 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 223 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 29.97 30.94 32.05 33.02 34.00 .85 1.83 2.68 5.73 6.95 7.92 8.90 9.99 10.97 11.94 13.04 14.01 16.82 17.91 18.89 19.86 21.94 22.91 24.74 25.84 26.93 27.91 28.88 29.98 31.93 33.03 34.12 .73 2.56 5.73 6.82 7.80 8.90 9.87 10.85 11.94 13.04 13.77 14.87 1.88 3.78 .85 1.89 1.21 7.26 4.52 5.16 7.10 8.39 4.52 8.23 8.06 .97 .81 .81 .81 5.32 7.10 6.29 4.52 6.29 5.48 .97 .97 .97 2.58 .97 1.13 .97 1.13 1.13 6.33 6.50 8.00 9.00 10.17 9.00 8.00 1.00 1.83 2.67 1.83 1.67 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 224 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 15.84 16.82 17.79 18.89 19.86 20.84 21.81 22.79 23.76 24.74 25.71 27.78 28.88 29.86 30.95 32.90 34.00 .87 1.83 2.81 3.90 4.87 5.85 6.94 7.91 8.99 9.84 10.93 13.00 13.97 14.94 15.80 16.89 17.87 18.96 19.94 21.02 21.86 22.71 24.05 24.90 25.88 26.84 27.94 .83 2.67 4.67 4.67 5.50 8.33 8.50 7.50 2.00 .83 2.67 1.67 .67 1.83 .50 1.83 .83 8.36 2.12 4.74 6.70 6.37 7.51 8.33 8.32 1.27 2.74 .93 1.25 1.24 2.71 6.65 5.49 6.64 5.65 8.59 4.65 1.04 2.68 1.85 2.83 6.60 1.18 2.65 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 225 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 28.90 29.88 30.97 .97 1.70 2.67 3.89 4.86 5.95 6.80 8.01 8.74 9.96 10.93 11.90 12.87 13.96 14.94 16.03 17.85 18.94 19.91 20.89 21.86 22.83 23.92 25.86 26.84 27.93 29.02 29.87 30.96 32.06 33.03 33.88 1.01 1.82 1.82 9.17 5.50 4.50 2.50 3.67 10.17 4.50 10.33 8.33 8.17 .50 2.67 1.67 3.50 1.00 .67 3.50 .83 2.50 7.50 7.50 2.83 .83 1.67 .67 2.50 2.67 4.33 .83 1.00 1.67 1.67 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 226 Mavropoulou, Papadopoulou, & Kakana (2011). Percentage of intervals with on-task behavior Vaggelis 120 100 80 60 40 20 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Session Percentage of intervals with on-task behavior Yiannis 100 90 80 70 60 50 40 30 20 10 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Session 227 Percentage of intervals with performance behavior Vaggelis 120 100 80 60 40 20 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Session Percentage of intervals with performance behavior Yiannis 100 90 80 70 60 50 40 30 20 10 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Session 228 Yiannis Percentage of intervals with prompting behavior 40 35 30 25 20 15 10 5 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Session Gender Patient1=Vaggelis Patient2=Yiannis Patient = p 0=Male 0 Session = s IQ Age Disability 71 82 7.6 years 7 0=ASD (mild) 0=two hrs 1=other 1=4.2 hrs On-task =b1 Performance=b2 Prompt=b3 Treatment = t Speech Service 229 p s b1 b2 b3 t 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00 21.00 22.00 23.00 24.00 25.00 26.00 27.00 28.00 29.00 30.00 31.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 11.00 64.42 21.80 21.79 23.79 69.42 59.68 28.79 59.66 67.03 67.02 55.94 63.98 66.32 91.14 82.07 87.43 60.24 95.46 95.45 93.09 89.06 62.20 75.28 77.28 66.53 50.42 92.35 86.30 90.65 70.17 83.92 20.30 65.44 38.46 80.57 75.72 85.10 66.91 64.78 82.95 80.82 82.94 24.34 10.71 .29 2.44 15.81 22.51 21.18 35.89 23.07 30.56 20.95 43.15 38.34 57.60 54.67 32.49 33.57 99.62 99.90 46.43 71.03 35.75 38.17 71.33 24.01 28.30 56.65 73.77 64.42 58.28 75.67 39.12 27.14 15.65 39.61 27.63 33.74 21.27 32.27 24.45 48.66 28.12 16.77 25.49 12.39 20.77 32.51 34.51 16.71 17.04 19.04 18.70 16.67 25.72 19.00 32.41 16.63 12.26 27.69 16.60 14.24 10.21 9.86 20.59 16.22 16.21 14.18 25.58 12.49 11.81 12.47 34.27 20.84 23.94 12.71 6.34 13.00 12.38 14.50 9.94 24.17 14.77 14.16 23.54 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 230 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00 21.00 22.00 23.00 24.00 25.00 26.00 27.00 28.00 29.00 30.00 31.00 87.17 80.19 91.09 82.60 84.41 84.70 84.69 73.47 73.46 80.42 56.17 75.86 76.15 80.08 73.41 82.79 93.39 80.35 93.98 91.85 84.60 41.81 25.18 68.22 54.52 85.82 79.71 63.33 33.50 72.13 67.73 25.92 68.22 49.14 59.17 92.42 65.04 54.28 60.39 47.68 21.71 35.04 28.06 19.26 12.28 15.61 1.96 16.80 8.01 10.12 10.41 30.40 10.09 10.38 10.98 4.00 3.69 13.07 10.64 10.33 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 231 Murphy, Theodore, Alric-Edwards, & Hughes (2007). Percentage of disruptive intervals Student 1 14 12 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Session Percentage of disruptive intervals Student 2 70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Session 232 Percentage of disruptive intervals Student 3 45 40 35 30 25 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 Session Percentage of disruptive intervals Student 4 14 12 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 Session 233 Percentage of disruptive intervals Student 5 45 40 35 30 25 20 15 10 5 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 Session Percentage of disruptive intervals Student 6 18 16 14 12 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 Session 234 Percentage of disruptive intervals Student 7 70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 Session Percentage of disruptive intervals Student 8 120 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Session ID Age Gender On-track (to kindergarten) 1 2 3 4 5 6 7 8 4 3 4 4 4 5 4 3 1=f 0=m 1 1 1 0 1 0 0=on-track 1=not on track 0 0 0 0 0 1 235 Patient = p Session = s p s b t 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 1.00 3.00 4.00 10.00 11.00 12.00 13.00 15.00 16.00 19.00 20.00 21.00 22.00 23.00 24.00 26.00 27.00 28.00 29.00 30.00 31.00 32.00 33.00 1.00 2.00 3.00 4.00 5.00 7.00 8.00 10.00 11.00 12.00 15.00 16.00 19.00 20.00 21.00 22.00 23.00 5.88 8.40 9.24 .00 2.52 2.52 .00 .00 .00 1.68 .00 1.68 11.76 3.36 1.68 .00 .00 .00 .00 2.52 .00 .00 .00 37.99 39.59 37.88 39.48 31.16 60.81 64.06 1.15 26.71 9.31 4.19 4.97 4.81 4.76 8.84 55.06 19.48 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 Behavior = b Treatment = t 236 2.00 2.00 2.00 2.00 2.00 2.00 2.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 24.00 26.00 27.00 28.00 29.00 31.00 32.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 10.00 11.00 12.00 13.00 14.00 15.00 16.00 19.00 20.00 21.00 22.00 23.00 24.00 26.00 27.00 28.00 29.00 30.00 31.00 32.00 33.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 15.29 3.62 .00 5.16 4.29 1.70 1.65 22.69 39.50 26.05 21.85 15.13 19.33 12.61 23.53 5.88 6.72 2.52 15.97 10.08 4.20 5.88 10.08 5.88 9.24 .00 9.24 15.97 .84 4.20 3.36 .00 3.36 .00 2.52 .00 12.48 9.30 8.48 9.98 9.16 .00 7.50 10.57 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 237 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 10.00 11.00 12.00 13.00 15.00 16.00 17.00 19.00 20.00 21.00 22.00 23.00 24.00 26.00 27.00 28.00 29.00 30.00 31.00 32.00 33.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 19.00 20.00 21.00 22.00 23.00 24.00 26.00 .00 2.62 2.57 2.52 1.65 .00 .00 1.45 .00 2.92 .00 1.26 2.77 .00 .00 .00 .00 .92 .00 .00 .00 13.07 17.66 13.79 19.15 10.66 12.95 39.84 20.59 3.62 5.90 8.19 6.63 1.99 12.73 6.55 5.00 7.26 8.77 12.59 27.95 21.78 7.91 .00 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 238 5.00 5.00 5.00 5.00 5.00 5.00 5.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 27.00 28.00 29.00 30.00 31.00 32.00 33.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 10.00 11.00 12.00 13.00 15.00 16.00 17.00 19.00 20.00 21.00 22.00 23.00 24.00 26.00 27.00 28.00 29.00 30.00 31.00 32.00 33.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 .00 5.50 2.40 .00 2.35 .79 3.08 10.69 16.03 4.58 9.16 6.87 7.63 12.21 15.27 .00 2.29 8.40 4.58 9.92 3.82 2.29 3.05 6.11 6.87 9.92 6.11 3.05 2.29 .00 .00 6.11 .00 2.29 3.05 4.58 17.45 59.76 6.31 10.53 3.63 8.38 20.00 20.51 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 239 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 10.00 11.00 12.00 13.00 15.00 16.00 17.00 19.00 20.00 21.00 22.00 23.00 24.00 26.00 27.00 28.00 29.00 30.00 31.00 32.00 33.00 1.00 2.00 3.00 4.00 5.00 10.00 11.00 12.00 13.00 15.00 16.00 17.00 19.00 20.00 21.00 22.00 23.00 24.00 26.00 27.00 28.00 29.00 30.00 3.02 .00 2.46 .00 .00 .00 .00 .00 .00 1.25 20.81 3.34 .00 .00 .00 .00 .00 .00 .00 .00 .00 51.89 61.68 96.60 87.80 48.96 19.20 20.79 12.55 29.98 16.22 6.88 16.67 20.94 28.54 48.16 62.86 60.08 19.59 4.19 .32 3.00 17.16 12.19 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 240 8.00 8.00 31.00 2.30 32.00 4.99 1 1 Ramsey, Jolivette, Puckett-Patterson, & Kennedy (2010). Abby Percentage of time on-task 120 100 80 60 40 20 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 Session Percentage of time on-task Chris 120 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829 Session 241 Percentage of time on-task Katie 120 100 80 60 40 20 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 Session Percentage of time on-task Sara 120 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 Session 242 Percentage of time on-task Trey 120 100 80 60 40 20 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 Session Percentage of task-completion Abby 120 100 80 60 40 20 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 Session 243 Percentage of task-completion Chris 90 80 70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829 Session Percentage of task-completion Katie 60 50 40 30 20 10 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 Session 244 Percentage of task-completion Sara 120 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 Session Percentage of task-completion Trey 120 100 80 60 40 20 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 Session 245 Abby Percentage of accuracy 120 100 80 60 40 20 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 Session Chris Percentage of accuracy 70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829 Session 246 Katie Percentage of accuracy 30 25 20 15 10 5 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 Session Percentage of accuracy Sara 90 80 70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 Session 247 Trey Percentage of accuracy 80 70 60 50 40 30 20 10 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 Session Person Age Grade Gender Race Disability GAF Score* P1 Abby P2 Chris P3 Katie P4 Sara P5 Trey 14 16 13 15 15 7 9 7 8 9 1=other 0=white 0=white 1=other 1=other 0=Y 0 0 0 0 65=1 55=0 20=0 55=0 65=1 1 0 1 1 0 *The Global Assessment of Functioning Score is used to determine the lowest impaired behavior (health/illness) for the child. Patient = p Session = s On-task =b1 Treatment = t Task-completion=b2 Accuracy=b3 p s b1 b2 b3 t 1.00 1.00 1.00 1.00 1.00 1.00 1.00 2.00 3.00 4.00 5.00 6.00 57.74 53.11 33.96 80.52 99.34 53.11 69.62 59.06 .28 89.43 99.67 60.05 49.48 49.81 .28 75.24 89.76 41.23 0 0 0 0 0 0 248 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 7.00 8.00 9.00 10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00 21.00 22.00 23.00 24.00 25.00 26.00 27.00 28.00 29.00 30.00 31.00 32.00 33.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 99.01 89.10 74.25 0 82.83 49.81 .61 0 90.09 74.58 70.28 1 99.67 99.67 90.42 1 99.67 100.00 95.05 1 100.00 100.00 75.24 1 100.00 99.67 89.76 1 100.00 100.00 95.05 1 99.67 99.67 99.67 1 100.00 99.67 80.52 1 100.00 100.00 75.24 1 100.00 99.67 95.05 1 96.37 99.67 100.00 1 100.00 100.00 95.05 1 99.67 100.00 90.42 1 90.42 99.67 99.67 1 26.70 .61 .00 0 66.32 48.82 25.38 0 63.02 49.81 25.71 0 59.72 .61 .00 0 99.67 99.01 69.62 1 70.61 74.91 70.28 1 76.56 74.58 50.14 1 83.16 74.91 77.00 1 91.08 100.00 89.43 1 76.23 74.91 85.47 1 95.71 90.09 95.71 1 60.31 .00 .00 0 40.63 12.81 .00 0 59.69 .00 .00 0 30.31 9.38 8.75 0 .00 .00 .00 0 41.56 17.50 .00 0 64.69 49.69 .00 1 90.31 65.00 19.69 1 60.94 54.69 .00 1 95.31 69.06 .00 1 90.94 74.69 29.38 1 88.13 74.38 29.69 1 90.00 77.19 50.00 1 80.00 65.31 50.31 1 89.06 69.38 50.31 1 95.94 77.50 50.31 1 50.00 42.19 29.06 0 249 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 18.00 19.00 20.00 21.00 22.00 23.00 24.00 25.00 26.00 27.00 28.00 29.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00 21.00 22.00 23.00 24.00 25.00 26.00 27.00 28.00 29.00 30.00 31.00 32.00 48.44 36.88 40.94 22.50 47.81 .00 90.00 23.13 34.06 25.00 80.31 45.31 91.88 59.38 76.88 56.25 88.75 60.00 93.13 74.69 87.50 70.00 92.81 74.69 61.12 14.93 28.55 9.94 29.54 4.95 .00 .00 .00 .00 .00 .00 .00 .00 60.07 4.91 .00 .00 9.22 .00 .00 .00 55.39 14.18 52.06 9.52 .00 .00 99.23 49.38 30.77 10.17 47.37 10.16 .00 .00 60.65 15.46 94.54 40.04 31.73 9.46 .00 .00 99.50 44.67 57.29 25.72 100.00 44.65 41.32 15.41 .00 .00 65.90 14.40 33.66 .00 .00 .00 .00 .00 71.52 9.71 23.13 19.69 .00 25.31 20.94 25.31 40.31 29.69 45.00 50.00 50.00 59.69 9.95 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 5.21 5.20 .00 9.84 .00 .00 .00 .00 15.12 .00 .00 15.10 10.44 .00 .00 1.44 .00 .00 1.09 1.08 .00 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 250 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 5.00 5.00 5.00 5.00 5.00 33.00 34.00 35.00 36.00 37.00 38.00 39.00 40.00 41.00 42.00 43.00 44.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00 21.00 22.00 23.00 24.00 25.00 26.00 27.00 1.00 2.00 3.00 4.00 5.00 13.69 .00 .00 .00 .00 .00 66.18 34.94 10.02 66.83 .00 .00 77.46 24.96 15.99 71.80 29.94 25.28 48.20 10.65 5.34 .00 .00 1.34 .00 .00 .00 .00 .00 .00 .00 .00 .00 5.63 .00 .00 20.12 25.27 30.42 30.09 20.11 25.58 60.03 34.59 24.92 47.46 25.23 30.39 35.85 20.07 40.68 50.01 24.57 25.53 99.60 85.10 70.93 99.91 89.92 49.99 83.47 79.93 49.33 24.84 94.41 70.25 87.95 85.70 74.43 100.00 89.88 49.62 100.00 100.00 45.42 100.00 100.00 50.56 100.00 100.00 75.02 56.33 49.57 49.89 47.94 35.38 44.40 23.13 25.38 25.71 36.00 39.22 25.37 20.53 10.22 29.87 43.71 34.04 25.35 89.43 74.94 50.14 98.44 84.91 76.21 96.49 90.37 76.20 100.00 95.19 50.42 100.00 90.35 75.21 100.00 99.68 75.20 .00 .00 .00 100.00 74.34 34.84 60.69 44.97 25.05 .00 .00 .00 66.76 44.63 25.05 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 251 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 6.00 7.00 8.00 9.00 10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00 21.00 22.00 23.00 24.00 25.00 26.00 27.00 28.00 29.00 30.00 31.00 32.00 33.00 34.00 35.00 36.00 37.00 78.27 54.42 25.05 49.81 29.10 .00 .00 .00 .00 .00 .00 .00 44.70 34.84 .00 67.05 44.29 24.37 25.04 .00 .00 .00 .00 .00 99.87 49.02 .00 97.49 64.55 25.05 84.61 50.03 24.71 100.00 75.02 49.36 90.35 69.95 45.64 85.26 74.00 49.69 90.33 79.07 65.56 94.73 84.13 69.95 99.80 98.65 70.29 .00 .00 .00 32.04 13.91 .00 .00 .00 .00 55.05 25.05 24.37 .00 .00 .00 34.38 .00 .00 76.36 59.49 24.71 100.00 74.68 50.03 85.00 85.14 50.03 90.00 89.87 25.39 94.62 99.32 69.61 85.13 94.60 70.29 79.36 75.02 74.34 93.57 98.99 69.95 98.98 100.00 70.29 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 252 Restori, Gresham, Chang, Lee, & Laija-Rodriquez (2007). Percent of intervals of academic achievement (antecedent intervention) A1 120 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Session Percent of intervals of academic achievement (antecedent intervention) A2 120 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Session 253 Percent of intervals of academic achievement (antecedent intervention) A3 120 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Session A4 Percent of intervals of academic achievement (antecedent intervention) 120 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Session 254 Percent of intervals of academic achievement (consequent intervention) C1 120 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Session Percent of intervals of academic achievement (consequent intervention) C2 120 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Session 255 (consequent intervention) Percent of intervals of academic achievement C3 80 70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Session Percent of intervals of academic achievement (consequent intervention) C4 120 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Session 256 Percent of intervals of disruptive behavior (antecedent intervention) A1 120 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Session Percent of intervals of disruptive behavior (antecedent intervention) A2 120 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Session 257 Percent of intervals of disruptive behavior (antecedent intervention) A3 90 80 70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Session Percent of intervals of disruptive behavior (antecedent intervention) A4 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Session 258 Percent of intervals of disruptive behavior (antecedent intervention) C1 90 80 70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Session Percent of intervals of disruptive behavior (antecedent intervention) C2 120 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Session 259 Percent of intervals of disruptive behavior (antecedent intervention) C3 120 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Session Percent of intervals of disruptive behavior (antecedent intervention) C4 70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Session ID Treatment A1 A2 A3 A4 C1 C2 C3 C4 0 = antecedent 0 0 0 1= consequent 1 1 1 260 Patient = p Session = s Academic Achievement = b1 Disruptive Behavior = b2 p s b1 b2 t 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00 21.00 22.00 23.00 24.00 25.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 11.00 12.00 13.00 22.57 4.80 16.07 41.86 30.01 18.16 43.41 18.65 .00 71.29 87.94 95.44 78.22 60.99 37.85 57.19 30.28 27.03 92.61 99.04 94.17 93.61 98.97 94.11 97.31 44.85 21.58 28.58 8.56 12.32 10.67 83.62 77.65 97.62 84.62 .00 32.14 61.31 51.60 95.66 69.30 42.93 67.64 53.64 54.69 81.55 99.27 18.61 11.60 1.90 8.87 28.20 28.18 23.32 19.53 41.01 3.89 1.19 3.85 5.98 .00 3.78 .00 44.85 56.72 56.69 92.34 69.62 89.05 6.32 9.54 .00 7.87 99.19 43.49 20.22 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 Treatment = t 261 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00 21.00 22.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00 1.00 2.00 3.00 4.00 5.00 7.00 8.00 9.00 9.00 10.00 11.00 12.00 13.00 14.00 15.00 46.00 44.52 35.31 43.42 69.88 12.04 84.45 1.75 90.91 .00 90.88 1.11 82.75 11.40 98.94 1.11 92.97 5.41 9.52 78.57 11.90 44.64 8.33 62.50 10.12 75.00 17.86 69.05 97.62 .00 97.62 .00 98.21 1.11 86.31 5.36 8.33 53.57 27.38 26.79 29.76 46.43 29.17 16.07 19.05 31.55 100.00 .00 88.10 .00 100.00 .00 95.24 .00 91.07 .00 96.43 .00 29.82 50.88 32.16 62.57 3.51 37.43 1.75 43.27 .00 93.57 41.52 41.52 4.09 54.39 99.42 .00 99.42 .00 98.25 .00 95.91 .00 96.49 .00 40.94 43.86 59.65 34.50 38.60 45.61 0 0 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 262 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 16.00 17.00 18.00 19.00 20.00 21.00 22.00 23.00 24.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00 21.00 22.00 23.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 11.00 12.00 45.61 42.69 0 41.52 30.99 0 93.57 1.17 1 98.25 1.17 1 95.91 1.17 1 97.08 1.17 1 94.74 4.09 1 97.66 1.75 1 95.32 .00 1 25.70 37.70 0 5.10 77.10 0 34.22 39.36 0 38.20 22.20 0 32.46 46.17 0 13.00 83.86 0 15.84 70.12 0 81.52 11.24 1 94.64 3.79 1 96.33 2.62 1 74.02 6.02 1 56.28 32.85 0 69.96 27.11 0 56.80 32.22 0 47.06 35.06 0 22.46 34.46 0 95.58 .00 1 77.27 12.70 1 84.10 9.24 1 90.36 9.22 1 91.47 6.33 1 85.16 7.45 1 70.85 19.43 1 22.52 78.59 0 29.99 49.06 0 95.83 3.92 0 20.63 26.99 0 49.49 36.78 0 6.66 76.61 0 .00 100.26 0 19.29 53.97 0 100.00 .00 1 100.00 .00 1 82.14 11.62 1 87.87 6.95 1 263 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00 21.00 22.00 23.00 24.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00 21.00 22.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 31.76 1.70 0 .00 100.49 0 .00 91.20 0 2.70 8.49 0 51.21 41.97 0 66.77 13.59 1 98.51 .00 1 85.17 7.13 1 99.57 .00 1 98.36 .00 1 100.00 .00 1 99.42 .00 1 28.14 52.02 0 18.59 56.57 0 46.73 41.41 0 26.63 74.24 0 16.08 83.84 0 15.58 44.44 0 15.58 60.61 0 56.78 22.73 1 71.86 27.27 1 71.36 11.11 1 74.87 11.62 1 37.69 33.33 0 .00 97.47 0 31.16 13.13 0 21.11 45.45 0 12.56 49.49 0 57.79 2.02 1 37.69 14.14 1 69.85 10.10 1 52.76 11.11 1 74.87 12.12 1 62.81 9.09 1 30.62 53.08 0 27.08 42.38 0 39.86 49.55 0 20.50 62.33 0 15.93 49.60 0 16.46 57.28 0 92.00 2.71 1 89.98 4.27 1 60.42 20.11 1 88.50 4.32 1 264 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00 21.00 33.43 61.51 27.86 61.56 42.20 58.04 66.23 80.03 89.75 97.94 81.12 25.77 34.98 18.68 38.09 32.50 15.18 10.11 7.58 2.50 1.51 11.74 0 0 0 0 0 1 1 1 1 1 1 Theodore, Bray, Kehle, & Jenson (2001). Percentage of disruptive intervals Student 1 80 70 60 50 40 30 20 10 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Session 265 Percentage of disruptive intervals Student 2 70 60 50 40 30 20 10 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 19 21 23 25 27 29 Session Percentage of disruptive intervals Student 3 70 60 50 40 30 20 10 0 1 3 5 7 9 11 13 15 17 Session Patient = p Session = s p s b 1.00 1.00 1.00 1.00 2.00 3.00 53.72 0 68.68 0 64.88 0 t Disruptive behavior = b Treatment = t 266 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00 21.00 22.00 23.00 24.00 25.00 26.00 27.00 28.00 29.00 30.00 31.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 11.00 12.00 13.00 14.00 15.00 16.00 63.60 48.56 49.77 42.23 59.69 55.90 45.86 70.82 54.53 59.49 .00 1.91 .62 .59 .00 36.72 45.42 39.14 35.35 45.31 49.02 3.98 1.45 11.41 8.86 .00 6.28 .00 37.93 37.93 44.83 48.28 35.63 39.08 42.53 42.53 34.48 60.92 40.23 49.43 5.75 .00 5.75 5.75 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 267 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 17.00 18.00 19.00 20.00 21.00 22.00 23.00 24.00 25.00 26.00 27.00 28.00 29.00 30.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00 21.00 22.00 23.00 24.00 25.00 26.00 27.00 28.00 29.00 30.00 8.05 2.30 44.83 25.29 21.84 31.03 28.74 33.33 4.60 .00 4.60 5.75 8.05 2.30 31.65 43.04 44.30 31.65 39.24 26.58 37.97 35.44 29.11 64.56 32.91 31.65 .00 5.06 3.80 .00 2.53 3.80 22.78 18.99 27.85 22.78 20.25 1.27 .00 .00 .00 .00 .00 .00 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 268 ID Gender Race Disability Student 1 Student 2 Student 3 0 = Male 0 0 0 = White 0 0 0 = SED (Severe Emotional Disturbance) 0 0 Williamson, Campbell-Whatley, & Lo-Ya (2009). Percent of intervals of on-task behavior Student 1 120 100 80 60 40 20 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 Session Percent of intervals of on-task behavior Student 2 120 100 80 60 40 20 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 Session 269 Percent of intervals of on-task behavior Student 3 120 100 80 60 40 20 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 Session Percent of intervals of on-task behavior Student 4 120 100 80 60 40 20 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 Session 270 Percent of intervals of on-task behavior Student 5 120 100 80 60 40 20 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 Session Percent of intervals of on-task behavior Student 6 120 100 80 60 40 20 0 1 s 1.0 1.0 1.0 2.0 1.0 3.0 1.0 4.0 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 Session Patient = p p 3 Session = s b t 41.14 21.98 21.93 43.95 0.0 0.0 0.0 0.0 Disruptive behavior = b Treatment = t 271 1.0 5.0 40.96 0.0 1.0 6.0 20.34 0.0 1.0 8.0 40.85 0.0 1.0 9.0 58.46 0.0 1.0 10.0 80.47 1.0 1.0 11.0 64.25 1.0 1.0 12.0 100.97 1.0 1.0 13.0 62.7 1.0 1.0 14.0 102.36 1.0 1.0 15.0 80.27 1.0 1.0 16.0 81.69 1.0 1.0 17.0 81.65 1.0 1.0 18.0 61.03 1.0 1.0 19.0 62.45 0.0 1.0 20.0 59.47 0.0 1.0 21.0 60.91 0.0 1.0 22.0 59.39 0.0 1.0 23.0 78.46 0.0 1.0 24.0 59.31 0.0 1.0 25.0 60.73 0.0 1.0 26.0 59.22 0.0 1.0 27.0 60.66 0.0 1.0 28.0 79.74 1.0 1.0 29.0 79.7 1.0 1.0 32.0 79.58 1.0 1.0 33.0 60.42 1.0 1.0 34.0 60.38 1.0 1.0 35.0 77.98 1.0 1.0 36.0 42.64 1.0 2.0 1.0 41.1 0.0 2.0 2.0 42.47 0.0 2.0 3.0 18.87 0.0 2.0 4.0 37.9 0.0 2.0 5.0 58.41 0.0 2.0 6.0 20.09 0.0 2.0 8.0 39.04 0.0 272 2.0 9.0 38.97 2.0 10.0 59.48 2.0 11.0 77.03 2.0 12.0 59.3 2.0 13.0 81.28 2.0 14.0 81.2 2.0 15.0 98.77 2.0 16.0 78.09 2.0 17.0 79.49 2.0 18.0 80.87 2.0 19.0 58.73 2.0 20.0 58.65 2.0 21.0 58.56 2.0 23.0 58.4 2.0 24.0 37.72 2.0 25.0 56.79 2.0 26.0 56.69 2.0 27.0 59.55 2.0 28.0 75.65 2.0 29.0 97.62 2.0 30.0 78.36 2.0 32.0 76.8 2.0 33.0 60.55 2.0 34.0 78.08 2.0 35.0 78.0 2.0 36.0 58.8 3.0 1.0 59.39 3.0 2.0 57.9 3.0 4.0 59.27 3.0 5.0 59.23 3.0 6.0 59.19 3.0 7.0 59.15 3.0 8.0 59.11 3.0 9.0 59.07 3.0 10.0 82.22 3.0 11.0 80.73 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 273 3.0 12.0 99.52 3.0 13.0 98.03 3.0 14.0 97.99 3.0 15.0 99.4 3.0 16.0 99.36 3.0 17.0 97.86 3.0 18.0 99.28 3.0 19.0 78.95 3.0 20.0 78.91 3.0 21.0 58.58 3.0 22.0 77.37 3.0 23.0 96.17 3.0 24.0 78.74 3.0 25.0 78.71 3.0 26.0 59.82 3.0 27.0 78.63 3.0 28.0 77.14 3.0 29.0 97.38 3.0 30.0 97.34 3.0 31.0 97.3 3.0 32.0 58.13 3.0 33.0 59.53 3.0 34.0 59.49 3.0 35.0 37.72 3.0 36.0 37.69 4.0 1.0 36.81 4.0 2.0 36.85 4.0 3.0 16.29 4.0 4.0 16.34 4.0 5.0 16.37 4.0 7.0 19.4 4.0 8.0 34.14 4.0 9.0 56.24 4.0 10.0 78.34 4.0 11.0 79.85 4.0 12.0 97.55 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 274 4.0 13.0 79.94 4.0 14.0 78.51 4.0 15.0 80.02 4.0 16.0 77.12 4.0 17.0 78.63 4.0 18.0 97.79 4.0 19.0 37.54 4.0 20.0 37.59 4.0 21.0 40.56 4.0 22.0 58.25 4.0 23.0 39.18 4.0 24.0 56.86 4.0 25.0 40.73 4.0 26.0 56.94 4.0 27.0 56.98 4.0 28.0 57.03 4.0 29.0 57.07 4.0 30.0 76.22 4.0 31.0 58.62 4.0 32.0 60.13 4.0 33.0 60.17 4.0 34.0 60.22 4.0 35.0 60.25 4.0 36.0 39.7 5.0 1.0 57.18 5.0 2.0 58.65 5.0 3.0 60.12 5.0 4.0 38.73 5.0 5.0 38.77 5.0 6.0 55.95 5.0 7.0 38.85 5.0 8.0 18.89 5.0 9.0 17.5 5.0 10.0 58.96 5.0 11.0 59.01 5.0 12.0 79.05 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 275 5.0 13.0 80.52 5.0 14.0 79.12 5.0 15.0 97.74 5.0 16.0 79.21 5.0 17.0 96.39 5.0 18.0 80.71 5.0 19.0 79.33 5.0 20.0 59.36 5.0 21.0 60.84 5.0 22.0 59.46 5.0 23.0 80.92 5.0 24.0 59.52 5.0 25.0 59.57 5.0 26.0 59.61 5.0 27.0 59.65 5.0 28.0 59.68 5.0 29.0 58.29 5.0 30.0 59.77 5.0 31.0 78.38 5.0 32.0 59.85 5.0 33.0 61.32 5.0 34.0 44.21 5.0 35.0 57.1 5.0 36.0 41.44 6.0 1.0 58.21 6.0 2.0 40.3 6.0 3.0 77.61 6.0 4.0 61.19 6.0 5.0 38.81 6.0 6.0 38.81 6.0 7.0 40.3 6.0 8.0 40.3 6.0 9.0 59.7 6.0 10.0 83.58 6.0 11.0 83.58 6.0 12.0 83.58 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 276 6.0 13.0 82.09 1.0 6.0 14.0 100.0 1.0 6.0 15.0 83.58 1.0 6.0 16.0 83.58 1.0 6.0 17.0 101.49 1.0 6.0 18.0 101.49 1.0 6.0 19.0 102.99 0.0 6.0 20.0 82.09 0.0 6.0 21.0 61.19 0.0 6.0 22.0 61.19 0.0 6.0 23.0 61.19 0.0 6.0 24.0 80.6 0.0 6.0 25.0 59.7 0.0 6.0 26.0 59.7 0.0 6.0 27.0 59.7 0.0 6.0 28.0 82.09 1.0 6.0 29.0 82.09 1.0 6.0 30.0 83.58 1.0 6.0 31.0 83.58 1.0 6.0 32.0 80.6 1.0 6.0 33.0 43.28 1.0 6.0 34.0 59.7 1.0 6.0 35.0 58.21 1.0 6.0 36.0 38.81 1.0 277 ID Special Education Race Person 1 Person 2 Person 3 Person 4 Person 5 Person 6 0=Y (all) 0=AA (all) 278 Appendix C: All Models (Working and Non-Working) Amato-Zech, Hoff & Doepke (2006) Full Non-Linear Model with Slopes (with overdispersion) 279 *Since the sigma is around 1, there is no need to check the overdispersion option for these models. Full with Overdispersion Each non-significant variable was removed from the model and run, leaving only the treatment variable as significant (i.e., that was the only variable that continued to the simple model below). 280 Simple Non-Linear Model without Slopes (without Overdispersion) 281 Lambert, M.C., Cartledge, G., Heward, W.L., Lo, Y. (2006) Full Non-Linear Model with Slopes (with overdispersion) 282 Simple Non-Linear Model without Slopes (with Overdispersion) 283 Simple Non-Linear Model with CLASSB on Intercept 284 Simplified non-linear model without slopes with CLASSB on INTCPT (NO overdispersion) 285 Gender on TRT 286 Murphy, Theodore, Alric-Edwards, & Hughes (2007) Full Non-Linear Model with Slopes (with Overdispersion) 287 Simple Non-Linear Model without Slopes (with overdispersion) 288 Simple Non-Linear Model with AGE on Intercept (with overdispersion) 289 Simple Non-Linear Model with AGE on Intercept (without overdispersion) Simple Non-Linear Model with AGE on TRT (with overdispersion) 290 Simple Non-Linear Model with AGE on TRT (without overdispersion) Simple Non-Linear Model with ONTRACK on Intercept (with overdispersion) 291 Theodore, Bray, Kehle, & Jenson (2001) Full Non-Linear Model with Slopes (with overdispersion) 292 Full Non-Linear Model with Slopes (without overdispersion) 293 Simple Non-Linear Model without Slopes (with overdispersion) 294 Simple Non-Linear model without intercept (with overdispersion) Simple Non-Linear model (no intercept without overdispersion) 295 Simple Non-Linear Model with DISABILITY on Intercept Ramsey, Jolivette, Puckett-Patterson, & Kennedy (2010) (ON-TASK) Full Non-Linear Model with Slopes (with overdispersion) 296 Simple Non-Linear Model without Slopes (with overdispersion) 297 Simple Non-Linear Model without Slopes (without overdispersion) Simple Non-Linear Model without Slopes (without overdispersion) 298 (TASK COMPLETION) Full Non-Linear Model with Slopes (with overdispersion) 299 Simple Non-Linear Model without Slopes (with overdispersion) 300 Simple Non-Linear Model without Slopes (with overdispersion) 301 Simple Non-Linear Model without Slopes (with overdispersion) DV: Task completion 302 (ACCURACY) Full Non-Linear Model with Slopes (with overdispersion) 303 304 Simple Non-Linear Model without Slopes (with overdispersion) 305 Simple Non-Linear Model without Slopes (without overdispersion) Restori, Gresham, Chang, Lee, & Laija-Rodriquez (2007) Disruptive Behavior = Freq10 (Freq10) Full Non-Linear Model with Slopes (with overdispersion) 306 (Freq10) Full Non-Linear Model with Slopes (without overdispersion) 307 (Freq10) Simple Non-Linear Model without Slopes (without overdispersion) 308 (Freq10) Simple Non-Linear Model with INTERVENTION on intercept and S1ORD (without overdispersion) 309 (Freq10) Simple Non-Linear Model with INTERVENTION on intercept and S1ORD (with overdispersion) 310 (Freq10) Simple Non-Linear Model with Intervention on TRT (with overdispersion) 311 (Freq10) Simple Non-Linear Model with Intervention on TRT (with overdispersion) 312 Academic Engagement = ACA10 (FreqACA10) Full Non-Linear Model with Slopes (with overdispersion) (FreqACA10) Full Non-Linear Model with Slopes (without overdispersion) 313 (FreqACA10) Simple Non-Linear Model without Slopes (with overdispersion) 314 (FreqACA10) Simple Non-Linear Model without Slopes (without overdispersion) (FreqACA10) Simple Non-Linear Model without Slopes (with overdispersion) 315 (FreqACA10) Simple Non-Linear Model with INTERVENTION on TRT (with Overdispersion) (FreqACA10) Simple Non-Linear Model with INTERVENTION on TRT (without Overdispersion) 316 Williamson, Campbell-Whatley, & Lo (2009) Full Non-Linear Model with Slopes (with overdispersion) 317 Full Non-Linear Model with Slopes (without overdispersion) 318 Simple Non-Linear Model without Slopes (with overdispersion) 319 Student 3 was tested on S1TRTORD and found not significant either with or without overdispersion. Simple Non-Linear Model without Slopes (without overdispersion) 320 Simple Non-Linear Model without Slopes (without overdispersion) No L-2 Predictors 321 Cole & Levinson (2002). 322 323 Mavropoulou, Papadopoulou, & Kakana (2011) Ontask needs overdisperison, xtmepoisson will account for the overdisperson needs. 324 Nbreg ontask60 TRT SESS1 325 Performance needs overdispersion 326 327 328 Prompt no overdispersion 329 330 ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! Thesis and Dissertation Services !