Examining Treatment Effects for Single

Examining Treatment Effects for Single-Case ABAB
Designs through Sensitivity Analyses
A dissertation presented to
the faculty of
The Patton College of Education
of Ohio University
In partial fulfillment
of the requirements for the degree
Doctor of Philosophy
Christine A. Crumbacher
May 2013
© 2013 Christine A. Crumbacher. All Rights Reserved.
2
This dissertation titled
Examining Treatment Effects for Single-Case ABAB Designs
Through Sensitivity Analyses
by
CHRISTINE A. CRUMBACHER
has been approved for
the Department of Educational Studies
and The Patton College of Education by
John H. Hitchcock
Associate Professor of Educational Studies
Renée A. Middleton
Dean, The Patton College of Education
3
Abstract
CRUMBACHER, CHRISTINE A., Ph.D., May 2013, Educational Research and
Evaluation
Examining Treatment Effects for Single-Case ABAB Designs thorough Sensitivity
Analyses
Director of Dissertation: John H. Hitchcock
Single-case designs (SCDs) are often used to examine the impact of an
intervention over brief periods of time (Kratochwill & Stoiber, 2002; Segool, Brinkman,
& Carlson, 2007). The majority of SCDs are inspected using visual analysis (Kromrey &
Foster-Johnson, 1996; Morgan & Morgan, 2009). Although the single-case literature
suggests that visual analyses have merit (Brossart, Parker, Olson, & Mahadevan, 2006;
Kratochwill & Brody, 1978) there are concerns regarding the reliability of the procedure
(Shadish et al., 2009). Recent advances in hierarchical linear models (HLM) allow for
statistical analyses of treatment effects (Nagler, Rindskopf, & Shadish, 2008), thus
making it possible to compare and contrast results from HLM and visual analyses to
ascertain if the different methods yield consistent conclusions. This work performed a
series of sensitivity analyses while also exploring ways in which HLM can be used to
examine new and different questions when dealing with published single-case data.
The work applied analyses to ABAB designs only. In addition to reporting the
results of visual analysis performed by the original authors, it also utilized recently
published guidelines by the What Works Clearinghouse (WWC) that standardize visual
analysis processes (Kratochwill, Hitchcock, Horner, Levin, Odom, Rindskopf, & Shadish,
4
2010). The comparisons presented here are based on nine, single-case studies that meet
WWC design standards. All studies examined intervention impacts on behavioral
outcomes. UnGraph digitizing software was used to quantify results from ABAB graphs
and HLM and STATA software were used to perform statistical analyses. In addition to
applying a statistical procedure to check conclusions about treatment effects based on
visual analyses, HLM was used to examine between-subject variation of performance on
outcome measures. In order to statistically describe treatment impacts, effect size
estimates were calculated using four methods: (a) the percentage of nonoverlapping data
(Morgan & Morgan, 2009), (b) the Standardized Mean Difference (Busk & Serlin, 1992),
(c) the improvement rate difference and (d) R2, in order to assess the proportion variance
in the dependent variable that can be explained by treatment exposure.
5
Acknowledgments
Deepest thanks to the faculty who advised me on this dissertation. Thank you to
all committee members, Dr. John Hitchcock, Dr. Gordon Brooks, Dr. Bruce Carlson, and
Dr. Jerry Johnson. A special thanks to Dr. John Hitchcock, my academic advisor and
dissertation chair for his continued support and guidance.
I would also like to thank The Patton College of Education for the opportunity to
study at Ohio University.
6
Table of Contents
Page
Abstract ............................................................................................................................... 3
Acknowledgments............................................................................................................... 5
List of Tables ...................................................................................................................... 8
List of Figures ................................................................................................................... 10
Chapter One: Introduction ................................................................................................ 11
Background of the Study .............................................................................................. 11
SCDs and the What Works Clearinghouse ................................................................... 13
Descriptive Statistical Analysis .................................................................................... 19
Hierarchical General Linear Modeling in SCDs ........................................................... 22
Effect Sizes ................................................................................................................... 25
Statement of the Problem .............................................................................................. 28
Research Questions ....................................................................................................... 30
Primary Question: ......................................................................................................... 30
Secondary Questions:.................................................................................................... 30
Significance of Study .................................................................................................... 30
Delimitations and Limitations of the Study .................................................................. 33
Definition of Terms ...................................................................................................... 36
Organization of the Study ............................................................................................. 39
Chapter Two: Review of Literature .................................................................................. 41
Review of Philosophical Issues .................................................................................... 41
Permutation Tests ......................................................................................................... 43
Other Types of SCDs .................................................................................................... 46
Minimizing Threats to Internal Validity Using the ABAB Design .............................. 47
Reliability and Validity of the UnGraph Procedure ..................................................... 49
HLM Applications to SCD ........................................................................................... 50
Three HLM Models for ABAB Designs....................................................................... 57
Variances in two-level models. ................................................................................. 60
Estimation procedures. .............................................................................................. 61
7
Page
Hierarchical Generalized Linear Models (HGLMs) ..................................................... 63
Some Effect Size Options in SCDs............................................................................... 65
Effect Sizes for Meta-Analyses and Comparisons of SCD and Group Studies. ....... 70
Type I and II error rates in SCDs .................................................................................. 72
Chapter Summary ......................................................................................................... 74
Chapter Three: Methodology ............................................................................................ 76
Article Selection and Descriptions.................................................................................... 77
Digitizing and Modeling the Data ................................................................................ 83
Comparing and Contrasting Visual Analysis, HLM/STATA, and Author Reports ..... 91
Level-2 Predictors ......................................................................................................... 92
Alternative Approaches for Exploring Variation .......................................................... 94
Effect Sizes ................................................................................................................... 96
UnGraph Validity and Reliability ................................................................................. 99
Chapter Summary ....................................................................................................... 101
Chapter Four: Results ..................................................................................................... 103
Primary Question: ....................................................................................................... 103
Secondary Questions:.................................................................................................. 103
Section 1: The Pilot..................................................................................................... 104
Section 2: Sensitivity Analysis Examining Treatment Effectiveness ......................... 119
Results for Research Question 1 ................................................................................. 119
Results for Research Question 2 and 3 ....................................................................... 127
Results for Research Question 4 ................................................................................. 156
Chapter Five: Discussion, Conclusions, and Recommendations .................................... 159
Discussion of the Results ............................................................................................ 160
Conclusions ................................................................................................................. 174
Recommendations ....................................................................................................... 176
References ....................................................................................................................... 179
Appendix A: Tables ........................................................................................................ 194
Appendix B: Graphs, Extracted Data from SPSS and Excel with Codes ....................... 208
Appendix C: All Models (Working and Non-Working) ................................................. 278
8
List of Tables
Page
Table 1. Title of Articles and WWC Standards .................................................................80
Table 2. Results of Sensitivity Analyses Pertaining to
Statement of a Treatment Impact .........................................................................92
Table 3. Results of the Multi-level Models and Level-2 Contributors ..............................94
Table 4. Effect Size Methods, Criteria, and Software .......................................................97
Table 5. Simple Non-Linear Model without Slopes for the
Lambert and Colleagues’ (2006) Study .............................................................107
Table 6. Final Estimation of Variance Components for the Lambert and Colleagues’
(2006) Study.......................................................................................................110
Table 7. Possible Level-2 Predictors from the Exploratory Analysis
for the Lambert and Colleagues (2006) Study ...................................................111
Table 8. Simple Non-Linear Model without Slopes with
CLASSB on Intercept for the Lambert and Colleagues’ (2006) Study .............113
Table 9. Final Model: Lambert and Colleagues (2006) ...................................................114
Table 10. Final Estimation of Variance Components for the Lambert,
Cartledge and Colleagues’ (2006) Study ..........................................................117
Table 11. Final Results of Sensitivity Analyses Pertaining to Statement
of a Treatment Impact ......................................................................................124
Table 12. Final Model: Murphy and Colleagues (2006)..................................................131
Table 13. Final Model: Mavropoulou and Colleagues (2006) .........................................137
9
Page
Table 14. Final Model: Ramsey and Colleagues (2006)..................................................141
Table 15. Final Model: Restori and Colleagues (2007) ...................................................143
Table 16. Final Model: Williamson and Colleagues (2009) ............................................148
Table 17. Final Model: Amato-Zech and Colleagues (2006) ..........................................150
Table 18. A Comparison of Effect Sizes from the PND: Visual Analysis Versus
Extracted Data from UnGraph .........................................................................153
Table 19. A Comparison of Descriptive Statistics between
Raw Data and Extracted for Two Studies ................................................................................. 158
Table 20. List of Effect Sizes from the Original Article Compared
to Calculated Effect Sizes Using the PND, SMD, and R2 ..............................194
Table 21. List of Effect Sizes from the Original Article Compared
to Calculated Effect Sizes Using the IRD .......................................................197
Table 22. A Comparison of the Reported Data to the Extracted .....................................200
Table 23. List of Population Type, Independent Variables,
Dependent Variables, and Reported Effect Sizes ............................................202
Table 24. List of Level-2 Variables Used in the Exploratory Analyses ..........................205
10
List of Figures
Page
Figure 1. ABAB Design with Disruptive Behavior as the Outcome Variables .................13
Figure 2. An Example of a Multiple-Baseline Design .......................................................47
11
Chapter One: Introduction
Background of the Study
Single-case research started in the early 20th century and was developed primarily
to assess the impacts of treatments, often in the context of Applied Behavioral Analysis
(Morgan & Morgan, 2009). Single-case designs (SCDs) use an experimental process
where treatment access is systemically manipulated by a researcher, performance is
monitored over time, and the units of interest serve as their own control (Horner, Carr,
Halle, McGee, Odom, & Wolery, 2005; Kratochwill et al., 2010; Segool, Brinkman, &
Carlson, 2007). Many of these investigations are done to test treatment impacts in
applied settings (Kratochwill & Stoiber, 2002; Morgan & Morgan, 2009). Although
SCDs can yield causal findings, they typically utilize small sample sizes and tend to not
generalize well to an underlying population of interest or other settings (Edgington, 1987;
Jenson, Clark, Kircher, & Kristjansson, 2007; Kratochwill et al., 2010). These studies
can however be aggregated and examined within a meta-analytic framework (Scruggs &
Mastropieri, 1998), and researchers have recently become interested in re-analyzing
SCDs using new statistical techniques so as to support such work, as well as to examine
data in ways that have typically not been available using more classic methods (Iwakabe
& Gazzola, 2009; Nagler et al., 2008; Shadish, Rindskopf, & Hedges, 2011).
Part of the recent methodological interest in SCDs may be due in part to growing
demands for evidence-based practices by the U.S. Department of Education, which has
recently favored studies that can yield causal findings (Jenson et al., 2007). Although
causal arguments rely on logic and not necessarily statistical inference, modeling data is
12
typically used to address a causal question (Shadish, Cook, & Campbell, 2002). Yet,
single-case research does not synchronize well with statistical tests that generally require
large sample sizes, equal variances across study conditions, independent errors, and
approximately normal distributions (Edgington, 1987). SCDs almost always have small
sample sizes due to their focus on a single person or group (Morgan & Morgan, 2009);
furthermore, repeated observation of a single entity yields a violation to the independent
error assumption (Krishef, 1991).
Randomization in SCDs supports causal inference because allowing chance to
dictate the start and withdrawal of a treatment should, in expectation, minimize data
trends (Edgington, 1987). Randomized SCDs can use permutation tests (a nonparametric approach) to determine if performance on the dependent variable is any
different during treatment compared to a control/baseline condition (Edgington, 1987).
Randomization is, however, often difficult to apply in SCDs (Edgington, 1987, 1995;
Kratochwill & Levin, 2010). This is because behaviorally-based interventions typically
compel researchers to treat people on the basis of need. When dealing with students who
exhibit self-injurious behavior for example, any desire to randomize with a new treatment
would likely be a secondary concern to providing them with help. In a context in which
randomization has not been used, permutation tests can still provide information on
whether or not there is a treatment effect, much in the same way that independent t-tests
can be applied in a quasi-experimental group designs that did not randomize study
participants to treatment conditions. They are nevertheless limited by the fact that they
cannot model subject characteristic data (Howell, 2009), which may be of interest to
13
researchers if they wish to examine if treatment effects appear to be more powerful across
classrooms, or types of students. A recent alternative to permutation tests is the
application hierarchical generalized linear models (HGLM), which allows not only for
the application of classic inferential approaches to examine if a treatment effect is
present, the overall approach can be used to ascertain if subject characteristics explain
variance in a dependent variable of interest while accounting for clustered data that
clearly do not fit a normal distribution (Nagler et al., 2008). Primarily for these reasons,
techniques based on HGLMs will be examined in the current study.
SCDs and the What Works Clearinghouse
Single-case research designs consist of three major types: within-series (i.e., AB,
ABA, and ABAB designs), between-series (i.e., alternating treatments designs), and
combined-series (i.e., multiple baseline design) (Segool et al., 2007). Within-series
designs examine baseline performance on a dependent variable (A) to performance in the
presence of a treatment/intervention (B) (Morgan & Morgan, 2009). A fictitious example
Disruptive Behavior
of an ABAB design is seen in Figure 1.
14
12
10
8
6
4
2
0
A Phase
B Phase
A Phase
B Phase
Sessions
Figure 1. An example of an ABAB design with disruptive behavior as the outcome.
14
The presence of the treatment was manipulated by a researcher who had an a
priori expectation that the level of problematic behavior will be higher when the
treatment was not in place. The vertical X-axis presents the number of times a disruptive
behavior has been observed. The horizontal Y-axis represents data collection times
during baseline (A phases) and presence of the treatment (B phase). The dark vertical
lines represent changes in study condition (i.e., treatment exposure). Figure 1 shows that
five data collection points were collected in the first baseline phase; five were collected in
the first treatment phase, and so on. Overall the figure is meant to depict a causal
argument that the treatment was responsible for drops in problematic behavior, even in
the absence of randomization (Horner et al., 2005). This is because the researcher
confirmed the disruptive behavior decreases during treatment phases and increases after
removal of the treatment, and this pattern of effects have been replicated.
Designs demonstrating such patterns of effects are legion (Horner et al., 2005)
and the U.S. Department of Education's Institute of Education Sciences (IES) has taken
an interest in providing a clearinghouse that describes interventions that yield positive
academic and behavioral outcomes for children. The What Works Clearinghouse
(WWC) was developed in 2002 and it involves a network of standards, guidelines, and
criteria specific to single-case research. The WWC has thus developed criteria for
judging whether designs can reasonably make a causal argument about the impact of a
treatment by considering different design features and visual analyses (Kratochwill et al.,
2010). To clarify, the WWC offers criteria that separate out SCDs with designs that yield
reasonable internal validity from those that do not, and of the designs that meet such
15
criteria, visual analyses are applied to determine if there is indeed an effect (some studies
may be internally valid but the treatment was judged to not have made a difference).
As a quick aside, visual analysis is a method used for determining treatment
effectiveness by visually examining SCD graphs while considering various features such
as mean performance change from baseline to treatment. In terms of design criteria, all
articles used in the current study should meet the WWC standards and Kratochwill and
colleagues’ (2010) steps for assessing design. These standards include:

reason to believe there was systematic manipulation of the independent variable
over time;

inter-assessor agreement was examined on at least 20% of all observations, at
least once within each phase, and an agreement rate of 0.80 was achieved (at least
0.60 if measured by Cohen’s kappa);

a minimum of three data points were collected in each study phase; and

there was an opportunity to demonstrate at least three intervention effects at three
different points in time.
A typical ABAB design can meet the last criterion because the introduction of the
treatment after baseline (going from A to B) yields an opportunity to demonstrate an
effect. Removing the treatment (B to A) allows for a second chance to demonstrate a
treatment impact, and re-introducing the treatment (A back to B) yields a third
opportunity. An implication here is that several variants of the ABAB design cannot
demonstrate three treatment impacts (e.g., AB, ABA, BAB) and thus are not able to meet
WCC standards. Put another way, shorter designs do not allow for enough replication of
16
a treatment effect to allow for a reasonable causal argument. Once a study is deemed to
have the capacity to generate a strong causal argument (i.e., passes standards), the WWC
will consider the evidence the study produces using visual analyses. Again, this means
the WWC will not consider visual analyses from designs that do not pass standards. At
the conclusion of visual analyses, reviewers render one of three judgments, ‘Strong
Evidence,’ ‘Moderate Evidence,’ or ‘No Evidence’ of a causal relation (Kratochwill et
al., 2010).
Interpreting graphs in single subject research stems back to the 1970s (Brossart,
Parker, Olson, & Mahadevan, 2006; Kratochwill & Brody, 1978). Today, visual analyses
are used to analyze treatment effectiveness in the majority of studies (Brossart et al.,
2006; Busk, & Marascuilo, 1992; Horner et al., 2005). Indeed, Kromrey and FosterJohnson (1996) estimate that 80% of SCDs rely on visual analyses.
Examining trend, level, and variability can allow for an assessment of whether the
treatment appeared to work (Horner et al., 2005; Kratochwill et al., 2010; Morgan &
Morgan, 2009). According to Kratochwill and colleagues (2010) the WWC considers six
features when examining the presence of a causal relationship in the context of a SCD:

“level;

trend;

variability;

immediacy of the effect;

overlap; and

consistency of data patterns across similar phases” (p. 18).
17
Level refers to the performance mean within each phase. Consideration of this feature is
analogous to an unstandardized effect size (i.e., simple mean difference) in a classic
experimental design with a single treatment and control group. If the average
performance in a treatment phase is better than a non-treatment phase, then there is
evidence that the intervention worked. Trend is in essence the slope of the regression line
(line of best fit found within each phase). Variability in the data provides information
about fluctuations of the dependent variable (Morgan & Morgan, 2009). The remaining
features are immediacy of the effect, overlap of data points (i.e., performance) across
phases, and consistency of data patterns across similar phases. Immediacy of the effect is
characterized by any change in level of performance exhibited by the last three data
points in one phase compared to the first three data points in the following phase. This is
considered under the logic that the more immediate an observed impact, the more likely
one can attribute it to the presence (or removal) of a treatment. Overlap is the percentage
of overlapping data in one phase to the next, with the larger the separation of data
signaling the greater likelihood of a treatment effect. Lastly, consistency of data in
similar phases involves pairing the baseline phases and treatment phases (e.g., the first A
phase compared to a second A phase, and the first B phase compared to a second B
phase) and looking for consistent patterns. As consistency of data patterns across similar
phases increases, the more likely that there is a possible causal relationship.
A key advantage to utilizing WWC visual analysis procedures when reanalyzing
the results from the original studies is that the approach is standardized. By contrast,
visual analyses may well vary from one author to the next and oftentimes are not reported
18
in detail. Since a primary goal of this work is to compare HLM and visual analyses to
see if each approach yields comparable results, it should be beneficial to use a consistent
visual analysis approach that should incorporate the most recent thinking on best
practices. The WWC procedures provide such consistency.
Despite the advantages of standardized visual analyses, researchers disagree on
whether to solely use the approach when statistical procedures are available (cf. Baer,
1977; Kazdin, 1992; Morgan & Morgan, 2009; Parsonson & Baer, 1986). Concerns with
visual analyses lie in the fact that some graphs are not easily interpreted and raters can
reasonably disagree on whether a treatment effect is present. Additional limitations
discussed in the literature include: (a) human error can occur while reading graphs, (b)
lack of rater training, and (c) the lack of formal standardized criteria for treatment effects
(Kromrey & Foster-Johnson, 1996; Morgan & Morgan, 2009). Statistical procedures can
of course be valuable when raters disagree (Danov & Symons, 2008) and it is argued that
performing sensitivity analyses may mitigate disagreements (Kratochwill et al., 2010).
This sets the stage for sensitivity analyses. Should disparate but reasonable analytic
techniques yield similar conclusions, this provides evidence that different traditions
would concur on the basic information practitioners need; that is, whether a treatment
worked. Perhaps visual analyses and statistical tests can be used in conjunction to lessen
the magnitude of human error while examining treatment effects (Morgan & Morgan,
2009).
19
Descriptive Statistical Analysis
Current SCD analytic practice does not eschew statistical analyses. Although the
literature indicates that visual analysis is the dominant approach when examining
treatment effects, researchers often use descriptive statistical approaches to analyze
treatment effectiveness. Descriptive approaches are relatively easy to apply and provide
information about trends and treatment impacts. These include calculation of the mean
(level) and the percentage of nonoverlapping data (PND) statistic. Calculating the mean
entails computing the mean for the baseline and treatment phase(s) and calculating their
differences (Morgan & Morgan, 2009). The percentage of nonoverlapping data (PND)
statistic is calculated using the percentage of data that do not overlap (Morgan & Morgan,
2009). The PND calculation is not used in an inferential statistical test; it is a visually
based procedure (the reverse in calculated as well; this is, the number of treatment data
points that fall lowest data point observed at baseline). The PND uses features associated
with visual analysis to assess treatment effectiveness. An effective rating could be a
graph that has little data overlap, limited variability, distinguishable levels, and
immediacy of effect (i.e., the last three data points in the first baseline and the first three
in the treatment are not close in numerical value). A questionable rating may be a graph
that is not as discernible, that is it may have a few of the features associated with an
effective rating and a few that are rather questionable to a trained visual expert. A noneffective judgment could be a graph that has many overlapping data points, large
variability, levels that are close in numerical value and data that lack an immediate
change or visible gap in data between phases.
20
Criteria for the PND suggest that any percent equal to or greater than 90% (i.e.,
90% of the data in a treatment phase does not overlap with data observed at baseline)
indicates a very effective treatment; 70%-90% indicates an effective treatment (Scruggs
& Mastropieri, 1998). A range of 50%-70% indicates questionable effectiveness, while a
range of lower than 50% is interpreted as an ineffective treatment (Scruggs &
Mastropieri, 1998). Although the approach is straightforward, it does have limitations
(described in Chapter Two). The PND uses the features associated with visual analysis to
determine effective / questionable / not effective renderings.
Statistical methods are available to examine treatment effectiveness, including
adding regression lines (trend), creating statistical process control (SPC) charts, and
generating dual-criteria analyses (Morgan & Morgan, 2009). The regression line
procedure identifies a line of best fit between the data points in each phase to display
trends in the data; the regression line can facilitate determination of treatment effects if
the line differs in intercept or slope relative to the baseline (Morgan & Morgan, 2009).
Further, observing the trend in baseline might help researchers to predict where other data
points within the baseline may lie (Morgan & Morgan, 2009).
The SPC charts consider outliers in data, using standard deviations, and
investigate whether such outliers are best explained by a treatment effect. A formula for
calculating the standard deviation in SCD is as follows:
SD =
,
(1)
21
Where,
x = a single data point
= mean of all data
n = total number of data points.
This formula states that the standard deviation equals the square root of the sum of a
single data point or raw score minus the mean of all data points in the study squared
divided by n, the total number of data points minus one. Data points that deviate by two
standard deviations across phases (above or below) are considered to be atypical (Morgan
& Morgan, 2009). If at least two repeated data points fall outside this range then a
treatment effect is plausible (Orme & Cox, 2001).
Dual-criteria analysis is a visual aid technique that considers both the mean and
trend of baseline data. These sources of information are used to extrapolate a line that
depicts counterfactual performance and allows for easier comparisons with performance
in the treatment phase. The more point(s) that lie above the extrapolated information the
more likely it is that treatment is responsible for causing a change (Morgan & Morgan,
2009). A more conservative approach has also been tested by moving the regression line
up or down by 0.25 standard deviations (Morgan & Morgan, 2009) which is a useful
method as well.
22
Hierarchical General Linear Modeling in SCDs
The above techniques all are variants of descriptive analyses and do not provide
an opportunity to test statistical significance. HGLM1 can, however, provide a test for
statistical significance if quantified data are available. Two level models in SCDs consist
of the extracted/raw graph data (Level-1) and subject characteristic data (Level-2). A
statistical sensitivity analysis of SCD graphs using two level models with raw data is not
available in the existing literature; the lack of such work may be due to several issues
including design and availability of information. As noted above, use of statistical
inference in the context of SCDs can be problematic. Problems associated with statistical
analyses using SCDs include autocorrelation, overdispersion, and small sample sizes
(Nagler et al., 2008). Overdispersion and concerns around small sample sizes are
discussed below. An understanding of autocorrelation however provides a sense of how
ordinary least square (OLS) approaches are problematic when statistically analyzing
single-case data. Since data in ABAB designs (and other types of SCDs) have
observations nested by case (Kratochwill et al., 2010; Raudenbush & Bryk, 2002),
autocorrelation becomes a concern. In the context of SCDs, autocorrelation is in essence
a form of serial dependence, where observations within the study yield information that
allows for prediction of subsequent observations (Krishef, 1991). This presents an
interesting problem because violation of the assumption of independence can, at times,
represent a desirable outcome in the context of visual analysis. This is because stable
baseline performance should predict little or no change in status in absence of treatment,
1
2
This work will also use the more common phrase: “Hierarchical Linear Modeling” (HLM).
It should be noted that the author found few studies that tested the reliability of UnGraph in the literature.
23
and introduction of treatment should be the primary reason for any given change (Horner
et al., 2005). Put another way, more severe autocorrelation within a study phase can at
times facilitate visual analysis but undermine OLS statistical analyses. Multilevel
modeling can however address violations of the independence assumption by allowing
error terms to randomly vary when modeling each unit’s individual growth trajectory
(this is Level-1 of a multi-level model, where Level-2 typically represents subject
characteristics as outcome variables). That is, the repeated measures of an entity over
time are treated as nested, and autocorrelation, as well other difficulties associated with
varying numbers of measurements and different spacing of measurement (Raudenbush &
Bryk, 2002) are handled in the analytic model by accounting for the degree of
association between nested observations. The approach can also allow researchers to test
multiple hypotheses that pertain to a particular study and to explain subject characteristics
that are unique.
The use of hierarchical linear modeling to analyze published SCDs requires some
prerequisite steps (Nagler et al., 2008) because raw data are typically not provided in
reports. McDougal, Narkon, and Wells (2011) call for the providing raw data when
presenting SCD reports so as to ease comparison of results across studies, and decry the
fact that this is not the norm. UnGraph is one type of digitizing software that solves this
problem as it quantifies the X and Y coordinates of a published graph and exports data
into Excel or SPSS formats.2 Once data are exported, the researcher can compute new
variables based on a particular study and create new ones. UnGraph in essence makes
2
It should be noted that the author found few studies that tested the reliability of UnGraph in the literature.
The reliability of the procedure is discussed further below.
24
raw data available for re-analysis. With such data, the researcher can compute variables
for time series data (scaling the dependent variable due to proportions or interval data)
create interaction effects, and Level-2 participant characteristics. After defining all
variables in SPSS, one can import the data into HLM and run models to provide estimates
for coefficients. This can lead to analyses where exact variation between students can be
tested for contributions to the model (i.e., using dummy codes to separate one person
from the group). In sum, specific units/schools/classrooms/subject characteristics can be
added to the model in hopes of differentiating or explaining behavior patterns.
Raudenbush and Bryk (2002) stated that hypothesis testing in HLM depends in
part on the distributional shape of the data. When data are not normal the use of linear
Level-1 models would be unsuitable. This is common with count data,3 which fall into
mainly two types of distributions, Poisson or binomial (Nagler et al., 2008; Raudenbush
& Bryk, 2002). SCDs with count data therefore need to assume one of these
distributions, which require a mathematical adjustment when using HLM software
(Raudenbush, Bryk, Cheong, Congdon, & du Toit, 2004). The Poisson distribution is
often used to model the number of events in a precise time period, such as classroom
session or week. Binomial distributions are used to model the number of events that
occurred out of a fixed, total number of possible occurrences (Nagler, et al., 2008;
Raudenbush & Bryk, 2002). Examples of variables that have binomial distributions
could be passed/failed a test, did/did not display behavior, or whether a baseball player
managed to get a hit when at bat (Raudenbush & Bryk, 2002). In studies where behavior
3
All articles used in this study follow the assumptions of count data.
25
is coded as either occurring or non-occurring events (typically coded 0 or 1), the binomial
distribution should be used (Nagler et al., 2008). Any post-hoc tests in HLM are
associated with finding mean differences between groups and within the individual
(Nagler at al., 2008).
Exploratory tests, which in these analyses involve the Level-2 variables, can help
explain variance in the data of a given set of studies. One advantage of using a nested
design approach is the ability to determine how Level-2 predictors explain individual
variation. Although multi-level models yield p values that can help inform the researcher
if the treatment was effective and how Level-2 variables influence the study, they cannot
synthesize effect sizes to demonstrate the strength of association between outcome and
intervention. Furthermore, it is difficult to generalize findings to other settings or
populations due to the small sample size typically associated with these analyses.
Effect Sizes
The degree to which a variable (or set of variables) influence outcomes is referred
to as the effect size of the analysis (Cohen, 1988, 1992). According to Parker, Vannest,
and Brown (2009), reporting effect sizes along with visual analysis results is helpful for
four reasons. Doing so (a) promotes objectivity when raters disagree, (b) promotes
precision when discussing treatment impacts, (c) provides further insight about whether
results are due to chance alone, and (d) offers a means to communicate findings. Effect
sizes can be calculated for any given single-case study, but it is suggested that a
minimum of three data points per phase (five for the ABAB design) are needed to
26
determine an effect (Beeson & Robey, 2006; Horner et al., 2005; Kratochwill et al., 2010;
Shadish, Sullivan, Hedges, & Rindskopf, 2010).
Generating and interpreting effect sizes in SCDs is complicated by the fact that
standardized mean differences (SMDs) are not readily comparable to effect sizes
generated from group-based studies (e.g., randomized controlled trials and quasiexperiments). This is because the variances of SCDs tend to be small relative to groupbased designs and there can be cases where mean performance shifts in baseline for an
individual can yield large numerators. In SCDs, one can simply divide phase-mean
differences by within phase variance using some variant of the formula:
(2)
Where,
E=
C
mean score for the experimental group
= mean score for the control group
s = pooled standard deviation.
There are variations of this general formula where the methods of obtaining the
denominator are altered such as using the control group standard deviation (Grissom &
Kim, 2005) or assuming that the population standard deviations are equal, therefore using
the pooled estimates (Nagler et al., 2008; Van den Noortgate & Onghena, 2003). In
SCDs, this is somewhat akin to using variance estimates based on intervention and/or
baseline phases. The result is often a standardized effect size that is quite larger than
27
Cohen’s general standards of 0.2, 0.5, and 0.8 (Cohen, 1988). Of course, these values are
heuristics and context should drive determination of whether an effect is large, but
standardized effects from SCDs can be so large that it is difficult to compare them to
group based studies (Kratochwill et al., 2010; Shadish, Rindskopf, & Hedges, 2011).
Finding an effect size formula that effectively represents single-case designs that parallels
group based studies is still being developed. These formulas will help researchers
compare effect sizes between studies for SCDs.
A recently developed procedure, the d-estimator (Shadish, Sullivan, Hedges, &
Rindskopf, 2010), uses the baseline standard deviation as opposed to the pooled standard
deviation in the denominator. The d-estimator promises to generate estimates of SCD
treatment impacts that are comparable to group based designs (Shadish et al., 2010) but
the procedure is still under development (Kratochwill et al., 2010; Shadish et al., 2010).
Other methods of measuring relationships between the independent and
dependent variables exist. For example, R2 can interpret an effect as the proportion of
variation explained but must be adjusted for use with categorical or continuous predictors
and in instances where the data trends are controlled for the analyses (Brossart et al.,
2006; Kirk, 1996). Therefore, there are some limitations in interpreting R2 in SCDs due
to baseline and trend effects inherent in the design. For example, failure to account for a
baseline trend can reduce R² and this may in turn undermine interpretation (Parker et al.,
2006). The proportion of the variation explained by the phase differences is one way to
interpret R2; however, other interpretations are available in single-case research and
consideration should be contingent on the design’s function (Parker & Brossart, 2003).
28
Phase differences are the shifts that occur between the baseline (a1) to the treatment (b1)
intervals. The actual differences of these shifts would be similar to statistical testing
between the averages of the baselines and treatments.
Formulas for effect size will yield overestimated effects in SCDs (Brand, Best, &
Stoica, 2011). Hence, a way to calibrate effect size estimates in SCDs is needed to
compare to more standard approaches such as Cohen’s d. Having said that, developing
such approaches is not the focus of this work; rather, the focus is on comparing HLM
results, focusing on p values, with visual analyses that describe whether there is a
treatment effect (to clarify, these two features in essence focus on the same issue, and that
is whether a treatment effect is present; not on how large the effect may be). This
discussion is offered only to clarify that SCD effect size calculation is inherently
problematic. Nevertheless, this work will calculate treatment impacts using conventional
methods so as to describe treatment effects and to compare to original authors reported
effect size estimates via sensitivity analyses.
Statement of the Problem
A motivating question behind this work is: how can quantitative methods
supplement visual analyses? Quantitative methods to assess treatments are becoming
more popular in the literature (Maggin, O’Keeffe, & Johnson, 2011). Visual analysis
does not use p values to assess treatment effectiveness, therefore sensitivity analyses need
to be conducted, not to compare the two methods, but to verify the conclusions reached
by study authors. Evaluating treatment effectiveness is a critical component to any SCD
designed to assess intervention effects. As noted above, visual analyses represent the
29
preferred methods (or most used method) for determining if there is a treatment effect,
but statistical approaches have recently been developed and are promising. Therefore,
more research should be conducted to compare results of visual and statistical analyses.
In addition to examining whether visual and statistical analyses yield comparable
conclusions, Nagler and colleagues (2008) described a value-added component in HLM
through the use of Level-2 data. Testing the influence of Level-2 information of student
performance might yield additional information about circumstances under which
treatment effects are present. Lastly, there is an opportunity to assess effect size after
digitizing graphs using the PND, IRD, SMD, and R2. Although this work is not a metaanalysis, testing these procedures and comparing the effect sizes to the original work will
provide a sensitivity analysis. Later, data from this study can potentially contribute to the
quantitative syntheses of SCDs using formulas such as the d-estimator.
The primary purpose of this study is to compare results from three different
analytic procedures: (a) claims made by original study authors; (b) WWC-informed
visual analyses as applied for purposes of this dissertation4; and (c) results of HLM. In
the process, this work will also test the concept of quantifying graphs found in singlecase research. With new digitizing software, researchers are now able to obtain
coordinates of graphs and may test phase effects (i.e., differences between baselines and
treatments). Secondary purposes include (a) examining whether Level-2 information
explain data variance using HLM procedures in a way that might yield new insights about
4
Note that this is not necessarily comparable to WWC procedures which entails independent coding of
two, trained coders and reconciliation with a senior methodologist.
30
published SCD data, (b) application and exploration of effect size estimates, and (c)
exploring the reliability and validity of UnGraph.
Research Questions
This is a methodological dissertation focused on testing new single-case analysis
procedures. It is guided by four research questions, one primary and three secondary:
Primary question. (1) Does quantification and subsequent statistical analyses of
selected ABAB graphs produce conclusions similar to visual analyses? Comparisons will
be made between WWC visual analyses and those of the study’s original authors. WWC
visual analyses will be employed as to standardize the procedure across the various
studies used to inform this work.
Secondary questions. (2) Do any of the subject characteristics (Level-2 data)
explain between-subjects variation? If so, can this information be used to yield new
findings? The procedures advanced by Nagler and colleagues (2008) allow for statistical
analyses of Level-2 variables that can yield new findings; the approach can basically be
used to extend findings of the original SCD authors. Therefore, testing these procedures
would seem to be a worthwhile endeavor. (3) Do the PND, IRD (nonparametric
indices), SMD (parametric indices), and R2 yield similar effect sizes as the original
studies? (4) Is UnGraph a valid and reliable tool for digitizing ABAB graphs?
Significance of Study
SCDs are primarily conducted with visual analysis using a team of experts, yet
scholars have expressed concern over potential bias with the procedure and scenarios that
involve “close calls.” It therefore seems reasonable to examine whether HLM analyses
31
routinely yield comparable results with visual analyses. Nagler and colleagues (2008)
created a handbook concerning the use of multi-level modeling in HLM for small sample
sizes. The methods used in this handbook are employed for this work. Ideally, there will
be few if any differences between the different procedures. This would provide an early
indication (i.e., early in the sense that only a small number of studies were examined due
to the computational intensity of Nagler and colleagues’ methods) that statistical and
visual procedures coincide. In the event that they do not, this work will not necessarily
be able to recommend one approach over another. That would be a task for later study.
Nevertheless failure of the two methods to converge would yield a warning for the
emerging field of statistical data analyses of SCDs. In addition, Level-2 predictors may
provide information for prediction purposes. Statistically significant Level-2 variables
would explain variation across subjects (Nagler et al., 2008). Without statistical testing
of the Level-2 variables, only subjective claims can be made concerning subject
characteristics and the outcome variable. Unfortunately there does not appear to be a
competing statistical approach (at least so far as the researcher is aware) that answers
questions pertaining to Level-2 data in SCDs. In addition, visual analysis and descriptive
statistics do not yield statistical information about the phases beyond means and ranges.
Since regression-based statistical approaches are recommended in available technical
literature it seems worthwhile to test Level-2 contributions, even when samples sizes are
small.
The basic point behind the Level-2 variable analyses is that SCD literature tends
not to synthesize results, and when they do, they do not go beyond making some logical
32
inferences based on visual work. Visual analysis does not rely on statistical analyses to
synthesize results and make group comparisons (e.g., boys tended to respond to treatment
better than girls, kids in Classroom A tended to do better than kids in Classroom B). A
lesser justification for the Level-2 work is the approach has been advanced in the
methodological literature. Accordingly, testing Level-2 variables seems logical in this
sensitivity analysis simply because it is recommended as an option, and comparing
results to the original authors’ statements concerning these variables can be compared to
this works findings (no study chosen in this work used quantification to analyze results).
Further justification for testing Level-2 significance leads to Nagler and
colleagues (2008). They found significance under similar circumstances and thus
provided evidence that Level-2 analyses can be reasonably interpreted. Furthermore,
significant Level-2 variables can be interpreted as influencing the baseline or treatment
phase in some way. That is, if a significant Level-2 variable is influencing the intercept
in SCD work, one should interpret that as one group (e.g., ethnic background, coded 0 =
Caucasian, 1 = African American) demonstrating a difference in performance between
the two on the intercept (i.e., baseline). On the other hand, if the Level-2 variable for
ethnic background is significant for the treatment phase, then one group may be
performing differently during the treatment phase. Finally, calculating effect size
estimates can help yield new insights about treatment impacts of the studies re-analyzed
in this dissertation.
33
Delimitations and Limitations of the Study
This dissertation is delimited in several aspects. First, the study consisted of nine
ABAB, behaviorally-based studies. Identifying articles that matched the criteria for the
WWC for acceptable designs and limiting the article search to students with behavioral
issues restricted the number of articles to quantify. Furthermore, the HLM procedures
used in this work are intense and time consuming, making it difficult to apply the
procedures with a larger group of studies. More importantly, this work focuses on testing
an emerging application of HLM and obtaining an early sense of whether results are
congruent with established visual analysis procedures. It in essence is an attempt to
independently examine an emerging methodology. This work does not attempt to make a
substantive contribution to the knowledge base on treating students with emotionalbehavioral disorders. In short, this is a methodological dissertation with the intent to
make a contribution to the methodological literature; this justifies the use of a small
number of studies. A second delimitation to this work is that the methodology being
tested focuses only on ABAB designs, and not on other types of SCDs such as multiplebaseline, alternating treatment, and changing criterion designs, which could be analyzed
using similar methods. Using ABAB designs will limit threats to validity and possibly
allow more data to be used in analyses. A third delimitation is that this work is primarily
a sensitivity analysis. In the event that differences are routinely found across these
procedures, this might serve as warning to SCD methodologists but beyond that the work
will not attempt to exhaustively examine why the differences were found. Ideally, the
34
alternatives will routinely converge and results would indicate that we have early
evidence that the different techniques generate exact results.
In terms of the primary research questions, limitations to this study include using
WWC visual analyses procedures without having access to WWC resources. There is no
guarantee that the application of the approaches would yield exact conclusions had these
studies gone through formal WWC review. Kratochwill and colleagues (2010) show that
each single-case study is independently coded and analyzed by trained, doctoral-level
methodologists and differences are reconciled by a senior methodologist who in turn has
access to content advisors. In short, considerable resources used by the WWC are not
available for this work and there is limited assurance that the visual analyses used here
would yield the same results seen in a WWC review. Of course, the very rationale for the
use of WWC approaches is to standardize the technique when performing sensitivity
analyses. Having said that, the visual analyses were checked by the dissertation chair; he
co-authored the report on WWC visual analyses and is a senior methodologist on the
WWC project.
For the first and second research question, the manual did not designate the exact
matrix (i.e., identity, unstructured, diagonal) to model the data. Nagler and colleagues
(2008) discuss options for constraining random effects and it is assumed that the no
constraints model is the same as the unstructured matrix, which is the default in HLM
(Garson, 2012). For this reason, an unstructured variance-covariance matrix is used in
these analyses and these matrices can be seen in Appendix C in tau dimensions. These
matrices indicate the unstructured variance co-variance matrix but because this is an
35
assumption, it is a limitation. Each coefficient in the unstructured matrix estimates
heterogeneous variances on the diagonal and non-zero covariances on the off diagonal
(Garson, 2012; Raudenbush & Bryk, 2002). In terms of secondary research questions,
another limitation to this study is that although the magnitude of treatment impact of each
SCD will be described using different effect size calculations, none of the current
procedures are ideal. There is nothing statistically wrong with calculating a SMD but, as
discussed above, this approach typically yields results that are difficult to compare with
effect sizes from group-based designs (Kratochwill et al., 2010). Some technical
difficulties with both the PND and R2 procedures are evident. The IRD or “risk
difference” is a newer effect size for summarizing single-case designs, and this will be
calculated as well (Parker, Vannest, & Brown, 2009). The IRD is a difference of two
proportions of data overlap (the intervention minus the baseline). The IRD is different
from other effect size calculations because it is based on risk indices (Maggin et al.,
2011b). The other estimates described in this dissertation are not derived from rates. The
IRD was not used in the original nine articles, but it can be compared to the PND. These
effect size estimates and associated issues will be discussed in Chapter Two. Finally,
some attempt was made to assess the reliability and validity of the UnGraph extraction
procedure. For the latter issue, an ideal validation effort would be to obtain raw data
from additional study authors and confirm that quantified graphs yield similar means,
variance and so on. However it is often difficult to obtain raw data, thereby limiting this
aspect of the work. All original authors were contacted to obtain raw data and for any
responses to this request, comparisons will be made.
36
Definition of Terms
Phase Difference
A phase difference is defined as the shift in the performance of a dependent
variable that occurs between the baseline phase (first A in the ABAB design) and the
treatment implementation (first B). For example, if a child shows a large number of
tantrums (the dependent variable of interest) at baseline, and this is reduced after a
treatment is introduced, there is a phase difference.
Sensitivity Analysis
A sensitivity analysis is a process where two different analytic procedures are
pursued to determine if they have the same finding. The procedure attempts to ascertain
if a set of findings are sensitive to the methods used.
Eta-Squared
Eta-squared is an effect size measure that is equivalent to R2, typically seen in
ANOVA.
Error (R)
This statistical notation, R, is described as the error term which allows each
student to vary from the grand mean.
Random Error (ε)
The unexplainable error found in measurements between the observed value and
the predicted value in a model.
37
Coefficient of Determination, R2
The coefficient of determination or R2 is a type of effect size that shows how
much variance in a dependent variable is explained by the predictor variable. In the
context of SCDs, R2 represents the proportion of the variation explained by phase
differences.
Hierarchical Generalized Liner Models (HGLMs)
Hierarchical generalized linear models (HGLMs) are extensions of hierarchical linear
modeling where normality and the assumptions of linearity are not feasible with the given
data and no transformational procedure will correct the data (Raudenbush & Bryk, 2002).
Given the data, special functions (e.g., logit, log) can facilitate the incorporation of
different distributional shapes (Raudenbush & Bryk, 2002).
Treatment Effects
A treatment is deemed effective in ABAB designs if the design and resulting
evidence allows for causal arguments per WWC criteria. From a statistical perspective, a
treatment is deemed to be effective if baseline and treatment means in the study are
significantly different from each other, p < .05. Interchangeably, phase effects are the
average rates of change in log odds (binomial distribution) and log count (Poisson) as a
student switches from baseline to treatment phases (Nagler et al., 2008).
Poisson and Binomial Distributions
Poisson and binomial distributions are common when using count data, which
involves the interpretation of the dependent variable. For example, if the dependent
variable is number of tantrums displayed, then the distribution chosen should be Poisson.
38
On the other hand, if the dependent variable is the percentage of time on-task then the
distribution that would accurately represent the data is binomial. Both have different
assumptions concerning the mean and variance. In Poisson and binomial distributions,
the variance is a function of the mean (Nagler et al., 2008). The mean and variance are
equal in Poisson distributions. As variance increases, so does the mean. For the binomial
distribution, the variance is largest when the proportion is 0.5 (Nagler et al., 2008). The
Poisson distribution is generally used to model the number of events in a precise time
period. Binomial distributions are used to model the number of events that occurred
where the total number of possible events is known; for example, the dependent variable
is from a fixed number of binary (0, 1) observations (Nagler et al., 2008; Raudenbush &
Bryk, 2002), where the variance is p(1 – p), and p equals the probability of success
(defined as an event occurring). An important feature of both distributions is they
provide some guidance around how much variance one might expect in observed data,
which is important when examining the presence of overdispersion.
Overdispersion
In multi-level modeling, overdispersion occurs if variability in the Level-1
outcome is greater than what is expected from the Level-1 sampling model (Raudenbush
& Bryk, 2002). The SCDs examined here all use count data. Again, the presence of
count data requires models to assume either a Poisson or binomial distribution and these
distributions provide a sense of how much variability in the data can be expected. When
the observed variability of the count data is greater than one might expect when using
these distributions, overdispersion is typically thought to be present (Nagler et al., 2008).
39
Such overdispersion may complicate the capacity to assess if a treatment effect is present
using statistical procedures, as the baseline mean and treatment means should be far
enough apart to determine if they are in fact different. There are however corrective steps
that one can take and the overall issues are discussed in greater detail in Chapter Two.
Autocorrelation
Autocorrelations occur when the baselines display trend which can make the
phase differences statistically indiscernible (Parker et al., 2006). Serial-dependency is
another term used for autocorrelation where the future behavior of a person is predictive
based on prior instances; hence a trend develops (Krishef, 1991). Autocorrelation or
serial-dependency can alter effect size magnitudes and significance levels (Parker et al.,
2006); additionally, it violates the independence assumption of data in regression (Fox,
1991).
Organization of the Study
Chapter One establishes the purpose and significance of the research questions, as
well as study delimitations and limitations. Chapter Two begins with a general overview
of treatment interventions that are used to help children who display behavioral
difficulties. Articles used in the current work involve children with identified disabilities
or children who are at-risk of being identified. The criteria for identifying children with
disabilities are therefore described. The chapter also reviews differences between visual
and statistical analyses, differences between visual and statistical analyses, different types
of SCDs, UnGraph software procedures, and finally, statistical concern encountered
when dealing with SCDs such as overdispersion, autocorrelation effect size estimates.
40
Chapter Two concludes with an overview of subject characteristics that can explain
between subject variations and how that may add value to the original articles in this
study. Chapter Three reports research design, HLM interpretations, data collection and
data analyses. Chapter Four includes the main results from HLM analyses. Discussion,
conclusions and recommendations are in Chapter Five. References and Appendices (e.g.,
sensitivity analyses, effect size calculations) are attached at the end of this study.
41
Chapter Two: Review of Literature
Review of Philosophical Issues
There has been, as of late, an increased emphasis on the search for evidence-based
instruction (Iwakabe & Gazzola, 2009). SCDs offer an important class of techniques for
identifying evidence-based instruction, mainly because they can deal with small samples
and highly contextualized treatments. Iwakabe and Gazzola (2009) argue that attempts to
synthesize and aggregate single-case studies may help to develop evidence-based
treatment interventions for populations of people with specific needs. Meta-analysts have
taken an interest in statistical examinations of SCDs (e.g. permutations tests, HLM)
because doing so can promote synthesis and generalization of findings. On the other
hand, SCD research communities who favor localized treatment plans tend to not be as
interested in probabilities or generalization (Jenson et al., 2007; Morgan & Morgan,
2009). In addition, some studies use multiple SCDs and often use very different students
(e.g., students who are typically developing and students with disabilities) or contexts
(e.g., teaching students in general education or self-contained settings). This makes
efforts to understand treatment impacts and their generalization a complex endeavor
(Kauffman & Lloyd, 1995).
This work attempts to help address these disparate issues by testing and applying
an emergent methodology for analyzing SCD data developed by Nagler and colleagues
(2008). More specifically, the work focuses on comparing the results of emerging HLM
applications and standard visual analyses when determining if a SCD produced evidence
of a treatment effect. The work also applies a technique that can test for treatment
42
differences among types of students and contexts. As noted in Chapter One, these
sensitivity analyses are replicated across nine studies. Although the purpose of this work
is not to directly contribute to the treatment literature, it seemed reasonable to work with
a corpus of studies that have a similar goal since radically different types of studies might
complicate the sensitivity analyses. The decision was to focus on studies that examine
treatment effects of students with behavior disorders both because this could yield a large
number of SCDs that can pass WWC standards (which was necessary so standardized
visual analyses could be applied) and because it seems likely that future meta-analyses of
SCDs would examine this particular literature base. There is after all a current effort by
the WWC to review this topic and on-going calls in the literature to identify treatments
that work well with students with emotional behavioral disorders (Kauffman & Landrum,
2009).
Several of the studies used in this work involve students with emotional
disturbances. Severe Emotional Disturbance (SED) is defined according to IDEA (2012,
p. 7-8) as "a condition exhibiting one or more of the following characteristics over a long
period of time and to a marked degree, which adversely affects educational performance

an inability to learn which cannot be explained by intellectual, sensory, or health
factors;

an inability to build or maintain satisfactory interpersonal relationships with peers
and teachers;

inappropriate types of behavior or feelings under normal circumstances;

a general pervasive mood of unhappiness or depression; or
43

a tendency to develop physical symptoms or fears associated with personal or
school problems" (IDEA, 2012, p. 7-8).
Statistical vs. Visual Analysis
Whether or not one should use statistical inference in SCDs represents an ongoing
controversy in the literature (Morgan & Morgan, 2009). Visual analysis proponents
argue that techniques specific to SCDs mimic more widely accepted procedures used in
statistical inference testing (Morgan & Morgan, 2009), but visual inspection of the data
will not yield the commonly used p value index and the approach may be subject to
unreliable interpretation of the analyst (Ottenbacher, 1990).
It is likely that visual analysis remains in the field due to its usefulness in
determining treatment effectiveness (Kratochwill et al., 2010) and drawing statistical
inference is not commensurate with single-case research given the use of small samples
(Edgington, 1995). Researchers believe that visual analysis techniques must be an option
and remain viable given careful examination of trend, variability, and parallel statistical
data interpretation (e.g., percentage of nonoverlapping data, confidence interval bands).
Furthermore, visual inspection of graphs is thought to be effective and swift. Of course,
there remain concerns about the potentially subjective nature of the process and
occasionally low inter-rater reliabilities, even when experts are involved (Morgan &
Morgan, 2009).
Permutation Tests
Permutation tests are described here to help justify the use of HLM since they
tend to have assumption issues within SCDs. Permutation tests, also known as
44
randomization tests, are a subset of non-parametric tests which involve re-sampling the
original data for all possible values of the test statistic (Edgington, 1987). In order to
perform permutations of ABAB data, random assignment of treatment blocks to
treatments should have been performed by the researcher (Edgington, 1987) which is
difficult in SCDs. Permutation tests can be used in SCDs, but certain assumptions of the
data must be met before they are interpreted. One assumption would be there is no
baseline trend. Another assumption would be randomization of units to treatment
settings (e.g., days).
Assuming randomization was conducted, permutation tests could be applied in the
context of a SCD since they do not require assumptions about the data distribution, and
provide a p value that can be used to assess treatment effectiveness (Edgington, 1987;
Kratochwill & Levin, 2010). The null hypothesis in a randomization test is the
expectation that at each treatment time, performance would be the same had an
alternative treatment study condition been given at that time (Edgington, 1987). A p
value is used in tandem with the MA-MB test statistic, which is the difference of the
average values between Method A and Method B. This test yields the nonparametric
exact significance level (Edgington, 1995) based on rank data. Since the test uses all
available data using iterations, permutations are advisable for conditions where sample
sizes are small. Unfortunately, the smaller samples also increase the likelihood of
committing a Type II error (Edgington, 1995). One must also consider if differential
carry-over effects occurred. These effects can, for example, be seen when
drugs/treatments are used in tandem and the researcher does not know which caused the
45
behavior(s) to change (Edgington, 1995). Furthermore, the results of permutation tests
overlook any covariation of treatment effectiveness with subject characteristics (a
limitation that can be addressed via HLM analyses) cannot account for baseline trends,
and few researchers can randomly assign treatment blocks to treatments (Edgington,
1995). For those who prefer using permutations, certain software programs will test for
the presence of a baseline trend. When the baseline is not flat, several corrective tests,
such as the Allison and Gorman (ALLIS-M) which improves the mean, (ALLIS-T) for
trend in slope, and in mean and slope together (ALLIS-MT) facilitate trend-control
(Parker, Cryer, & Byrns, 2006). Newer methods are available but are not covered here
(refer to Parker et al., 2006).
Proponents of nonrandomized single-subject studies suggest however that the
researcher(s) should not lose the ability to present and withdraw the treatment. There is
also a practical consideration since studies with patients who have severe behavioral
issues, for example, may make the use of randomization problematic (Edgington, 1987;
Morgan & Morgan, 2009). One interesting aspect to SCDs is that simply increasing the
number of trials (total number of possible trials on each day) for individual studies
(permutations do this since they increase the length of the study) can shift a binomial
distribution towards a normal distribution (Raudenbush & Bryk, 2002). Using normal
distributions can minimize threats associated with quantification, but small sample sizes
will influence interpretations (i.e., probabilities of behavior, assumptions concerning
normality).
46
Other Types of SCDs
Although only ABAB designs were used for this study, there are other types of
SCDs. Testing two or more different treatments in one study would generally entail the
use of an alternating-treatment design, where the comparison of subject performance
under each condition can be monitored (Morgan & Morgan, 2009). The systematic
manipulation of the treatments over time allows for examination of which treatment is
more effective for the patient(s) (Morgan & Morgan, 2009). Alternating-treatment
designs are most often used for specific people with individual needs, where the
researcher can quickly assess different treatments or independent variables for custom
treatment plans (Morgan & Morgan, 2009).
Multiple-baseline designs are constructed so that treatments are staggered across
time. The staggered onset of treatment can address various threats to internal validity
such as history, regression to the mean, maturation, instrumentation, and so on. This is
because if one of these factors is responsible for observed performance change, then it is
likely that these will be seen across different baselines. For example, if maturation were
the driver of performance change, one would expect to see improvement in baselines
before the onset of treatment. However, if changes in performance occur only after the
implementation of treatment, then one can have confidence that the
treatment/intervention is causing the behavior change (see Figure 2).
47
6
4
2
0
6
4
2
0
10
8
6
4
2
0
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Session
Figure 2. An example of a multiple-baseline design (with percent of intervals of on-task
behavior as the outcome variable).
Minimizing Threats to Internal Validity Using the ABAB Design
Internal validity is the degree to which the researcher is certain that changes in a
dependent variable are due to changes in an independent variable and not other factors
48
(Morgan & Morgan, 2009; Shadish et al., 2002). The ABAB design can yield strong
internal validity. The design can eliminate confounding, where the researcher is not sure
whether a specific intervention is responsible for observed performance change or if
some hidden variable is causing the effect (Morgan & Morgan, 2009). Consider for
example maturation and history threats. Maturation in ABAB designs can be examined
via reversal of the treatment condition and replication because if maturation is driving
performance change, there should be no difference in performance. History threats occur
when some other uncontrolled factor(s) outside of the study are influencing the dependent
variable (Morgan & Morgan, 2009). The ABAB design can handle this threat by
observing the different treatment phases. If some uncontrolled factor is responsible for
performance, then there should be few, if any changes associated with removal and
reintroduction of the intervention (Morgan & Morgan, 2009).
Regression to the mean, data loss, and instrumentation threats can also be dealt
with via replication. Regression to the mean is a statistical phenomenon where data
converge to the mean and this represents a standard rationale for the need to examine
counterfactual conditions in any design that is used to investigate causal inference. In
studies where subjects are chosen on the basis of high or low scores, common in singlecase research, this can be a plausible threat. But again, if regression to the mean were
driving performance change in an ABAB, then there should not be an apparent treatment
effect with each introduction and subsequent removal of the tested intervention.
Instrumentation can occur when the researchers unknowingly change the treatment
constructs, when bias due to observation occurs or via repeated testing (Kratochwill et al.,
49
2010). But again, these threats can be rendered implausible using the same logic
discussed above. Attrition in single-case research can occur for several reasons and
subjects leaving too early may not provide enough information to continue the
experiment. Five data points per phase and three phase repetitions are needed to meet
standards set by the WWC. Here, losing data might render a study that does not meet
standards, meaning internal validity is thought to be suspect, and the evidence produced
by the study will not be examined.
ABAB designs do have to contend with subjectivity during visual analyses. One
rater may claim an effect upon inspection of data and another may not (Danov &
Symons, 2008). In the parlance of null hypothesis tests of statistical significance, rater
disagreement can yield a problem that is similar to Type I and Type II errors. That is, if
raters disagree, there is still a chance that a treatment intervention is deemed effective
when in fact it is not, or ineffective when it is. Although the definition or the amount of
inter-rater agreement varies, the specifics of the data can cause raters to disagree despite
being an expert or novice (Danov & Symons, 2008). A general approach for handling
this concern is to train visual analysts to use replicable and systematic procedure for
judging data.
Reliability and Validity of the UnGraph Procedure
Shadish and colleagues (2009) investigated data extraction reliability (inter-rater)
and validity of UnGraph using three coders and 91 graphs. The correlation between the
extracted values for the original coder and the two coders averaged .96 (median = .99).
The reliability between the phases (baseline to treatment) was also tested. The average
50
difference between phases for the two coders had a correlation of r = .95. To explore the
validity of UnGraph procedure, means and standard deviations from the original data
were compared to baseline means from the extracted data. Also, five different methods
of testing validity between extracted data and original articles were employed for the
mean of each phase of the study considering the outcome variable: (1) looking at mean
percentages; (2) mean time or duration; (3) mean number correct; (4) mean; and (5)
percentage correct from the different types of studies. By averaging the extracted and the
reported data over all phases and correlating these averages, the five kinds of data
validity suggest a range from .97 - .99. The authors suggested that any issues concerning
reading data and mistakes were from human error (e.g., extracting the wrong number of
data points, issues reading overlapping data points).
HLM Applications to SCD
Techniques and model fit criteria needed to test models associated with SCDs are
available. The assumptions behind HLM when analyzing single-case research designs
are not that different from the assumptions of the standard HLM models, except the
distributions change in count data and they are considered non-linear models. The study
of individual change assumes individual’s trajectory is a function of parameters and
random error (Raudenbush & Bryk, 2002, p. 186). Since the data are nested within
persons, they should not be thought of as a fixed set for all participants (Jenson et al.,
2007; Raudenbush & Bryk, 2002). A key interest is using HLM to determine whether
there are phase differences in ABAB designs (Nagler et al., 2008; Raudenbush & Bryk,
2002). Phase differences are defined as statistically significant findings between baseline
51
and treatment phases indicating treatment effectiveness for a given study (Nagler et al.,
2008). A p value is given in HLM to describe these phase differences, suggesting on
average a difference between the treatment and baseline phases.
Several different tests are used to determine model fit (Jenson et al., 2007).
According to Van den Noortgate and Onghena (2003) the simplest hierarchical model for
SCDs is the two-level model where measures are nested within persons. For AB or
ABAB designs in single-case research (recall that this study pertains to ABAB designs
only) each model can provide the researcher with information concerning the impact of
change for the group of observations within each phase. A typical set of analyses will
cluster several ABAB studies together (often, a single article will report on several at a
time). In general, a model for the overall change in time with behavior is recorded for
each student. Last, analyses entail modeling the change in time with behavior recorded
between phases. More complex models can be used to test whether the student specific
characteristics contribute any information regarding treatment impacts for a particular
student (Nagler et al., 2008). This is a key point in SCD research, as Kratochwill and
colleagues (2010) stated, these designs can yield convincing causal information and
results for a particular student in the study.
The binomial distribution is often used in the context of SCDs given the presence
of data from a fixed interval (i.e., probability). For example, each trial can yield a binary
response and that each time period can be from a fixed number of trials. Count data can
therefore be considered as an umbrella term to describe both numbers (i.e., frequency of
behavior) and proportion (i.e., fixed interval or probability). Further, data can deal with
52
the number of successes “Y” in n independent trials, and the probability of success “π” is
constant over trials is; Y ~ Binomial(n, π) (Skrondal & Rabe-Hesketh, 2007). In
generalized linear models, the distribution is determined for the count of success Yi in ni
trials for unit i and is conditional on covariates xi,
Yi | xi ~ Binomial(ni, πi).
The covariates determine the probability πi seen below using
πi ≡ E(Pi|xi) = h(ηi), ηi = xi′β,
where Pi = Yi / ni, h(·) is the inverse of a link function (i.e., logit or probit), and ηi is the
linear predictor (Skrondal & Rabe-Hesketh, 2007). Therefore, the variance is a function
of the expression
Var(Pi | xi) = πi(1 – πi) / ni = (pi | xi) [1- (pi | xi)] / ni .
The independence assumption is sometimes violated when overdispersion occurs in
single-level models (Skrondal & Rabe-Hesketh, 2007). Oftentimes data in SCDs are
clustered or close together within phases. Dispersion refers to the variability in a data set,
as well as expected variation around a population parameter. It is commonly measured
by estimating the variance of a variable. Recall, overdispersion is defined as the Level-1
outcome variability being greater than expected in comparison to what is anticipated from
either a Poisson or binomial distribution (Raudenbush & Bryk, 2002) and can be
diagnosed by looking at the variance components output where values of sigma-squared
(σ²) over one indicate overdispersion in the model). In behavioral research, the outcome
variable can be on several scales (e.g., count, dichotomous). Recall in Poisson and
binomial distributions, the variance is a function of the mean (Nagler et al., 2008). The
53
mean and variance are equal in Poisson distributions, and as the variance increases so
does the mean (Nagler et al., 2008). Traditionally, overdispersion occurs when there is
heterogeneity among subjects (Agresti, 1996) or when the “observed random variation is
greater than the expected random variation” (Ghahfarokhi, Iravani, & Sepehri, 2008, p.
544). In SCDs, overdispersion has a similar definition but when count data have more
than expected variability by the binomial or Poisson distributions, overdispersion is
typically (but not always) present (Nagler et al., 2008). When σ² is closer to one, (i.e., the
criteria for determining overdispersion) then overdispersion is not an issue (Raudenbush
& Bryk, 2002). HLM has an overdispersion option before a model is run that accounts
for overdispersion present. By checking this option before running a model, likelihood
estimates can be compared across models (Raudenbush & Bryk, 2002; Raudenbush et al.,
2004) given the presence of overdispersion. Overdispersion continues to be a possible
concern in single-case research due to the estimation of the variances at each level. If the
estimation procedure used overestimates or underestimates either level of the multi-level
model based on the distribution, overdispersion will be present and this highlights the
need for better parameter estimations in small samples. All models in this work used
binomial distributions and overdispersion can be recognized when the observed variance
is larger than the expected binomial variance (Skrondal & Rabe-Hesketh, 2007).
Overdispersion can be handled by using one of two methods (a) introduce a Level-1
random effect in the linear predictor (Nagler et al., 2008; Raudenbush & Bryk, 2002;
Skrondal & Rabe-Hesketh, 2007) or (b) use a different estimation procedure (e.g., quasilikelihood) with a modified variance function (see Raudenbush & Bryk, 2002). It was
54
recommended by Skrondal and Rabe-Hesketh (2007) to add a random effect (error term)
at Level-1.
Like overdispersion, underdispersion can be an issue as well (although not as
common). Underdispersion is defined as having less variation in the data than predicted
(Raudenbush & Bryk, 2002). It should be noted that underdispersion is not possible for a
random effect model when using HLM software since the number of trials has to be
greater than one (Skrondal & Rabe-Hesketh, 2007). When overdispersion is suspected,
HLM software allows for a simple correction by providing an option in the Basic Settings
tab and the software in essence introduces a random effect (Option A above).
Furthermore, if the researcher suspects that overdisperison is present then models can be
run twice in HLM, once assuming overdisperison and once without overdispersion. The
researcher can then check to see if estimates of the fixed effects are unwavering (Nagler
et al., 2008). Also, the likelihood function for model fit between the two options (i.e.,
with or without overdispersion) will produce one model with a lower value, suggesting
the better fitting model.
Although models are analyzed for ABAB designs in HLM, conceptually they
generally share similar elements for testing fixed and random slopes and intercepts. One
distinct difference is that in single-case research, measures are within persons. In the
ABAB design, multiple observations on each individual are nested within the individual
(Nagler et al., 2008); that is, observations represent Level-1 data and student
characteristics represent Level-2 data. A linear regression model is described next
because of the commonality of such a model, and then approaches used for this work are
55
described. This is done so that the reader is comfortable with one method and can easily
transition into modeling in SCDs, which follow the same understanding just different
distributions. For one individual, a Level-1 linear regression model in HLM would look
like:
Yt = P0 + P1(Xt)+ Rt
where,
Yt = person observed score at time t
Xt = a dichotomous variable for phase, where 0 = baseline and 1 = treatment
P0 = mean of the baseline phase
P1= difference between baseline mean and treatment mean.
Rt = error term (assumed to be normally distributed, and independent)
The Level-1 model will take all the data obtained and reduce it into two scores,
one for the mean of the baseline phase, Y0, and the other for the difference between the
baseline mean and treatment means, Y1 (Jenson et al., 2007). Then a separate linear
model is produced for each:
β0j = P00 + R0j
β1j = P10 + R1j .
According to Jenson and colleagues (2007) the equation for β0j states that the
“baseline mean for subject j equals the grand mean baseline level, P00, plus a random
effect, R0j” (p. 488). Similarly, β1j equals “the grand mean difference between baseline
and treatment phases, P10, plus a random effect, R1j” (p. 488). The difference between
56
the treatment effect and the grand mean treatment effect is specified as R1j. In the context
of a two-level model, a statistically significant slope would indicate a treatment effect. If
there in an indication that remaining variation exists in baseline or treatment, then
predictor variables can be added to the Level-2 model in an attempt to explain the
observed variance among subjects (Jenson et al., 2007). Reliable variance for the
predictors will yield significant p values. For example, if a study was using two
classrooms (A = 0 and B = 1), then a variable like CLASS can be created and added to
the Level-2 equation to see if the treatment was more or less effective for class A than
class B. An exploratory analysis in HLM can tell the researcher where to put the CLASS
variable, mainly whether it should be placed on the intercept or predictor variable. Below
is an example of a Level-2 predictor (CLASS) added to the model,
β1j = P10 + P11(CLASS) + R1j
Where,
P10 = Grand mean difference between baseline and treatment phases for Class A
P11 = The difference between the mean treatment effect for Class A and Class B.
One can also separately analyze the baseline means. For example, add sex (0 = male and
1 = female) to the Level-2 equation, β0j. One could test if the mean baseline behavior,
γ01, increases or decreases as a function of sex:
β0j = P00 + P01(SEX) + R0j
57
Where,
P00 = Grand mean difference between baseline and treatment phases for males
P01 = The difference between the mean treatment effect for sex.
When assessing model fit, several approaches are available to test variation in
behavior both within and between persons and modeling alternate covariance structures
(Raudenbush & Bryk, 2002). For example, constraining the random effects can be used
to explore the within subject variation and how it interacts with the intercepts and slopes
(Nagler et al., 2008).
In HLM, modeling count or proportional data would yield equations similar to the
above, except the distributional shape would change the meaning of the output. For
example, the binomial distribution would describe the log odds (logit) scale and the
Poisson would describe the log scale of the behavior (Nagler et al., 2008).
Three HLM Models for ABAB Designs
The models described previously, although similar to the linear models typically
used in software for multi-level model, will slightly change in interpretation depending
on the distribution used. They were described to provide a reference and a starting point
for further discussion. The remaining models are based upon one and two-level models
with the binomial distribution. Nagler and colleagues (2008) discuss three types of
models used to answer questions pertaining to strengths of associations, variations, and
predictors associated with the dependent variable in ABAB designs. For the purposes of
this discussion, a dependent variable is discussed in terms of whether a behavior occurred
58
or not, but in principle other types of dependent variables can be examined with this
modeling approach. These three models will be applied to this dissertation.
The Full Non-Linear Model with Slopes includes all Level-1 predictors but no
Level-2 predictors (Raudenbush & Bryk, 2002). The aim is to find variables that are not
contributing anything to understanding the data and remove them to promote parsimony
(Nagler et al., 2008). Here, we would expect the slope associated with time to be not
significant. This suggests that the baseline trend is flat, or not changing over time
(Nagler et al., 2008). Furthermore, a null for any interactions involving change in
session/days between phases suggests the trend during treatment is predicted to be flat
(i.e., does not change over time).
Next, the Simple Non-Linear Model without Slopes would retain any variables
that were significant in the Full model and allow the error term for all students to vary
from the grand mean (Raudenbush & Bryk, 2002). Coefficients in this model allow the
researcher to compute probabilities for observing target behaviors during baseline and
treatment phases. This test will allow the researcher to see if any Level-2 variable(s)
contribute to the model in an exploratory fashion (Nagler et al., 2008). This exploratory
analysis will determine whether any Level-2 variables contribute to understanding
variance in the outcome. If so, then these Level-2 variables will be added to the last
model (Nagler et al., 2008).
Last, the Simple Non-Linear Model with any Level-2 predictors included is
tested. If any Level-2 predictors are significant, then probabilities for the target behavior
can be calculated under the Level-2 conditions (e.g., Classroom A = 0, Classroom B = 1).
59
That is, probabilities of the behavior occurring in each class can be computed and
compared. Further, Nagler and colleagues (2008) indicate the difference of between
subject variation(s) from one model to the next can be observed from the variance
components tables in the HLM output. In the ‘final estimation’ of the variance
components, any between subject variations in estimates of the intercept will yield a p
value which tests the null hypothesis that baseline averages for all subjects are similar. A
significant p value suggests that the variance is too big to assume only estimation error.
The between subject variances in phase effect produces a p value that tests the null
hypothesis that on average, the probability of a behavior occurring is similar for all
subjects. Additionally, any Level-2 predictors that are significant may indicate that they
contribute to the estimation of the outcome measure.
Other models could be designed to test how much subjects vary from average
expectations between subjects (Nagler et al., 2008). Described before, constraining
random effects to zero, where intercepts and phases are not allowed to vary across
subjects will allow the researcher to compare model fit and decide if other constraints are
necessary (Nagler et al., 2008). Some statistical evidence suggests that standard errors
and estimation of the fixed effects are robust to violations of homogeneous residual errors
at Level-1, var(rij) = σ² (Raudenbush & Bryk, 2002). However, the random slopes,
heterogeneous Level-1 variance model has a different variance at each time point and
tests if variances depend systematically as a function of any Level-1 or Level-2 predictors
(Raudenbush & Bryk, 2002).
60
Variances in Two-Level Models
The variance-covariance matrices in these models are based on the variances and
covariances at Level-1 and Level-2. Maximum likelihood estimation will be affected by
the fixed and random effects incorporated into the model (Raudenbush & Bryk, 2002).
Maximum likelihood is an estimation procedure that selects values that make the
observed results most probable. Depending on whether the actual data are balanced or
unbalanced, the covariance structure will be estimated differently (Raudenbush & Bryk,
2002). To have balanced designs, each Level-2 unit must have the same sample size and
the predictors need identical distributions (Raudenbush & Bryk, 2002). In SCDs, where
any predictor variable otherwise known as time-varying covariate(s) can take on different
sets of values for each person, it may be that no two participants have the same
behavioral pattern at time, t (Raudenbush & Bryk, 2002). That is, at one time point, there
may be only one person having a behavior. Hence, the model allows for a heterogeneous
variance-covariance matrix across persons as a function of individual variation in
exposure time. In SCDs, the number of session or x-axis data points can differ between
people. This may be due to students missing school days, or extended baselines for some
individuals (oftentimes baselines are extended if initial problem behaviors are not
observed or if data patterns are highly unstable).
Data extracted from SCDs do not have approximately normal sampling
distributions or the same variances (Raudenbush & Bryk, 2002). Most commonly, the
Level-1 and Level-2 error residuals are assumed to be independently and normally
distributed with a mean of zero and constant variance, σ² (Raudenbush & Bryk, 2002;
61
Van den Noortgate & Onghena, 2003). The mean and variance do however depend on
the distributional shape of the data. Given the nature of behavioral studies, the Poisson
distribution with constant exposure is chosen when dealing with count data and a
binomial distribution is used when the outcome is a proportion (Nagler et al., 2008). The
distributional differences will affect interpretation in HLM as Poisson distributions
produce estimates on a log scale and binomial distributions estimates are on a log odds or
logit scale (Nagler et al., 2008).
Estimation procedures. In the past, the two most commonly used estimation
procedures have been Full maximum likelihood (MLF) and Restricted maximum
likelihood (MLR) (Raudenbush & Bryk, 2002). Even though MLR is more accurate with
smaller sample sizes, Raudenbush and Bryk (2002) recommend using MLF in order to
ascertain model fits using likelihoods so as to handle the (typically) small number of
Level-2 units. When there is a small number of Level-2 units (J), the MLF variance
estimates will be smaller than MLR by a factor of approximately (J-F) / J, where F is the
total number of elements in the fixed effects vector (Raudenbush & Bryk, 2002). It may
help to point out that a vector in mixed models can be composed of observations, fixed
effects, random effects, and random errors where the fixed effects are specific to the
individual. If the number of observations (nj) for which each variance components (σj²)
is being estimated, usually the sample size in each level needs to be large. Using
ordinary least squares (OLS) is too conservative for small sample sizes, and Raudenbush
and Bryk (2002) warn both estimation procedures, MLF and MLR, are too liberal.
Another approach is to use an exact t-test distribution under the null hypothesis where
62
OLS is used to compute β estimates, but here likelihood functions cannot be created to
compare models. Either way, the standard approaches for linear model estimation
procedures cannot be used for two-level or three-level “nonlinear” models, instead closer
approximations to ML are advised.
Closer approximations to ML are available in software packages that have more
options for estimation procedures, such as the Gauss-Hermite quadrature, where the
random effect of μ is centered on an approximate posterior mode rather than the mean of
zero (Pinheiro & Bates, 1995). If the random effects are large, then this procedure
provides more accurate results for parameter estimation (Raudenbush & Bryk, 2002).
But, in keeping with two-level hierarchical generalized linear models, the Penalized
quasi-likelihood (PQL) and MLF are more appropriate with the algorithm being
“expectation-maximization” (EM) or LaPlace approximation with Fisher scoring
(Raudenbush & Bryk, 2002). The estimation procedures all maximize differently, for
example, the PQL maximizes the likelihood based on the Level-2 fixed effects based on
initial parameter estimates and variance-components (Raudenbush & Bryk, 2002).
Recall that for two-level hierarchical analysis, three kinds of parameters need to
be estimated: (1) fixed effects, (2) random Level-1 coefficients, and (3) variancecovariance components (Raudenbush & Bryk, 2002). The estimating procedures for
these parameters entail the use of MLF or MLR. In SCD research using the ABAB
design, examples of Level-1 predictors are a case/student identifier, behavior (outcome
variable), time (independent variable), phases (baseline, treatment), and any interaction
terms (Nagler et al., 2008). Examples of Level-2 predictors are any subject
63
characteristics such as cognitive ability, chronological age, or sex that could contribute or
explain the dependent variable.
In sum, for a small number of subjects in single-case research, HLM software
provides two options, MLR and MLF, with MLR as the better option (Nagler et al., 2008;
Raudenbush & Bryk, 2002). The MLF and MLR estimates will produce similar Level-1
variances; however, differences lie with the estimation of Level-2 variances (Raudenbush
& Bryk, 2002). When the Level-2 sample is small, the MLF variance estimates will be
smaller than MLR estimates making the MLR the preferred estimation procedure with
small sample sizes (Raudenbush & Bryk, 2002). Again, to get likelihoods for model fit
analyses, MLF must be used (Nagler et al., 2008; Raudenbush & Bryk, 2002).
Hierarchical Generalized Linear Models (HGLMs)
Hierarchical Generalized Linear Models (HGLMs) are extensions to hierarchical
models for count data where the models allow the interpretation of probabilities
(Raudenbush & Bryk, 2002). Hierarchical generalized linear models are a broad class of
models that can be expressed in a common way. Individual models, such as the logistic,
Poisson, linear, and so on, can be represented by linear functions and can be
distinguished by a link function and by the probability distribution used to model errors.
Link functions are transformations for the Level-1 predicted value so that they are
constrained within given intervals. According to Raudenbush and Bryk (2002) the
binomial sampling is the log odds of success, modeled with the following formula:
Yij = log[φij/(1-φij)] ,
64
where Yij is the log odds of success. If the probability of success, φij, is 0.5, the odds of
success equals 1.0 and the log-odds is log(1), which equals 0 (p. 295). With Ps
incorporated into the equation seen below,
Yij = P0j + P1jX1ij + P2jX2ij + … + PpjXpij .
It is possible to generate a predicted log-odd (Yij) for any case. These predicted log-odds
can be converted to odds by taking the exp(Yij) (Raudenbush & Bryk, 2002). A
probability can be computed from the following formula,
φij = 1 / 1 + exp{-Yij} .
(3)
Another threat to single-case research is autocorrelation or serial-dependency
(Krishef, 1991; Parker et al., 2006). Autocorrelation not only represents a violation of the
independent errors assumption, but can also alter effect size interpretation (Parker et al.,
2006). Mainly, autocorrelation occurs when the baseline is unstable or displays a trend
(Parker et al., 2006). If the autocorrelations are positive, the standard errors will be
reduced and Type I error rate is increased (Crosbie, 1987; Manolov, Arnau, Solanas, &
Bono, 2010; Parker et al., 2006; Scheffé, 1959; Suen & Ary, 1987). Statistical tests
within HLM to test baseline and treatment trend was previously described when testing
the null concerning flatness. In the current study, if the data reveal a trend in baseline,
the article will be rejected. Recall, this is customary given the standards and guidelines
already in place for SCDs by the WWC.
65
Some Effect Size Options in SCDs
Effect size estimates in SCDs are the standardized difference of behavior change
between phases (Parker, Vannest, & Brown, 2009). Effect size estimate formulas for
SCDs are still being developed (Shadish et al., 2010), and Maggin, Chafouleas, Goddard,
and Johnson (2011a) suggest comparing the results of multiple approaches. Each effect
size estimate is calculated differently because researchers either visually assess graphs or
statistically analyze data. Disparate findings are not unusual, and even statistical
approaches can produce inconsistent results. In a meta-analysis of quantitative methods
for SCDs involving students with disabilities, Maggin and colleagues (2011b) assessed
effect size estimates and subject characteristics for 68 SCDs. Mainly, the meta-analysis
was an inventory of the methods used, the subject characteristics reported, and the effect
sizes published to determine treatment effectiveness. Among the results was the
increased interest in high and low incidence disabilities in SCDs, variability in the
methods used to assess SCDs, and finding two primary effect size estimates reported in
the literature: (1) the PND, and (2) the SMD. Other effect size indices were used in the
Maggin and colleague’s study (2011b), but all others were not more than 10% of the
synthesis.
Further, a connection to the methodological rigor and association of meeting
criteria by the WWC which is similar to this dissertation, only 24 articles were used in
the Maggin and colleagues (2011a) study and no Level-2 data were collected. The
Maggin and colleagues (2011a) study analyzed treatment interventions using token
economies, which used AB designs, reversal, and multiple-baseline designs and did not
66
analyze anything further than effect size estimates and treatment effectiveness. Four
effect size estimates were compared. The two nonparametric effect size estimates were
the PND and the Improvement Rate Difference (IRD) and the parametric effect size
estimates used followed the SMD and raw-data multilevel effect size. For the
nonparametric procedures, the PND and IRD can be used to estimate treatment
effectiveness. The PND is an older, nonparametric effect size index.
The PND is calculated using this formula:
(1 – Percent of Treatment Overlap Points)*100 = PND.
(4)
Percent of Treatment Overlap Points is defined as the total number of treatment data
points that are “less extreme or equal to the greatest data point in baseline divided by the
total number of data points in the treatment phase” (Maggin et al., 2011a, p.10).
The PND is widely reported in the SCD literature and school psychology
(Scruggs & Mastropieri, 2001), but it has weak qualities (Kratochwill et al., 2010;
Maggin et al., 2011a; Shadish, Rindskopf, & Hedges, 2008). Computationally the PND
is simple, but the procedure requires researcher judgment in the context of close calls and
decisions about whether a data point overlaps can be inconsistent. Another limitation is
that, as the number of data points increase, the PND value trends to decrease since there
is more of an opportunity to demonstrate an overlap. This makes it difficult to compare
PND results from one study to another unless they have the same number of observations
(Allison & Gorman, 1994). In addition, the PND does not correlate highly with other
effect size indices (Maggin et al., 2011a; Shadish, Rindskopf, & Hedges, 2008). The IRD
67
may be a better approach because it uses an improvement rate difference. The IRD is
calculated by the difference between improvement rates (IR) for the treatment and
baseline phases (Maggin, Swaminathan, Rogers, O’Keeffe, Sugai, & Horner, 2011),
respectively called by this dissertation as IRT and IRB. IRT is calculated as the number of
data points indicating a study participant is performing better, relative to baseline, divided
by the total number of observations in the given treatment phase (Parker, Vannest, &
Brown, 2009). An ABAB design would yield two IRT estimates and these are averaged.
In the presence of a study with no overlapping data points, IRT would equal 100%. The
IRB is derived by dividing the number data points where a student performs equal to or
better, at baseline, relative to the subsequent treatment phase. A study with no
overlapping data points between phases would yield an IRB of zero. IRB estimates are
also averaged. The difference between the two improvement rates yields IRD (Cochrane
Collaboration, 2006; Sackett, Richardson, Rosenberg, & Haynes, 1997). The formula
below displays the IRD calculation:
IRT – IRB = IRD
(5)
where,
IR = improvement rate
T = treatment phase
B = baseline phase.
A 100% on the IRD would mean that data in the baseline and treatment phases do
not overlap and this indicates a highly effective treatment. By contrast, when the IRD is
68
50%, there is considerable overlap in performance between phases (Parker et al., 2009).
The IRD was calculated for this study even though no article used this estimate. This
was mainly done to compare it to the other effect size estimates.
The parametric procedures used by Maggin and colleagues (2011a) to calculate
effect sizes were the SMD and a regression-based, multilevel approach (RDMES). The
first parametric effect size used between baseline and treatment phases was the SMD
(Maggin et al., 2011a). The SMD was calculated by taking the difference between the
mean baselines (MB) and mean treatments (MT) and dividing it by the baseline standard
deviation (Busk & Serlin, 1992). The formula is as follows:
(XT – XB) / s = SMD
(6)
Where,
XT = treatment average
XB = baseline average
s = baseline standard deviation.
The baseline standard deviation is used instead of the pooled standard deviation
(which is characteristic of group-based studies for effect sizes) because it allows certain
assumptions concerning the normalcy and independence of the data to be relaxed (Busk
& Serlin, 1992; Maggin et al., 2011a; White, Rusch, Kazdin, & Hartmann, 1989). The
criteria of ‘effective intervention’ would be a calculation of 2.0 or above on the SMD
(Jenson et al., 2007).
69
Although R2 results are presented below, they should be interpreted cautiously in
SCDs. R2 was calculated to compare the proportion of the variation explained by phase
differences using OLS regression (see Parker & Hagan-Burke, 2007). Typically, in linear
regression, R2 is a measure of goodness of fit (Parker & Hagan-Burke, 2007). A 1.0
indicates a perfect fit of the regression line to the actual data. The R2 formula is as
follows: R2 ≡ 1 -
. The sum of squares is the variability of the data, where SSerr is
the sum of squares of the residuals and SStot is the sample variance. Again, count data
and R2 require different interpretations for SCDs. For this study, R2 was calculated using
OLS regression and interpreted as the proportion of the variation explained by the phase
differences (i.e., between baseline and treatment averages).
The literature suggests using multiple effect size metrics for comparative purposes
(Kratochwill et al., 2010; Maggin et al., 2011a). The results of the Maggin and
colleagues (2011a) study found effective interventions and comparable effect sizes using
four effect size estimates. The class was the unit of analysis in the token economy study;
all effect size estimates indicated treatment effectiveness with the PND at 83.09%, IRD
63.92%, SMD 4.57% and RDMES at 9.37% (Maggin et al., 2011a). The regression
technique included separating out the different baseline and treatment conditions and
dividing those standardized scores by the root mean squared error (Maggin et al., 2011a;
Van den Noortgate & Onghena, 2008). Recommendations from the authors indicate the
need for more stringent standards for methodological design to meet more WWC criteria
and reporting more descriptive information (Level-2 data) in general. It was suggested
by Maggin and colleagues (2011a) that other researchers should provide more subject
70
level characteristic data, so the likeliness that a student will respond to token economies
is known. Also, even though the study focused on token economies, some studies had
different implementation techniques (Maggin et al., 2011a) thus limiting treatment
generalizability.
Additionally, Parker and colleagues (2009) took the IRD from 166 published data
series contrasts (AB) and correlated it to R2, Kruskal-Wallis W, and the PND. The
Kruskal-Wallis W is the most powerful nonparametric technique available (Parker et al.,
2009). The IRD correlated more strongly with the Kruskal-Wallis W (.86) than with R2
(.86) or the PND (.83). The strongest correlation was between R2 and Kruskal-Wallis W
(.92) and the weakest was between R2 and the PND (.75). It should be noted that
although some of these procedures have similar interpretations, it is uncharacteristic to
compare between the estimates (Parker et al., 2009). In fact, the 166 published articles
found that two thirds failed to meet equal variance or normality assumptions and another
two thirds had autocorrelation issues (Parker et al., 2009).
Effect sizes for meta-analyses and comparisons of SCD and group studies.
The d-estimator (Shadish et al., 2010) should be separated from the other effect size (ES)
estimates as a meta-analytic approach because it is specifically developed to generate
SCD effect sizes that can be compared to group-base studies (it is however still under
development). This estimate standardizes the between-case variability, not the withincase variability (Shadish et al., 2010). Future work on this formula could adjust for
autocorrelation issues and incorporate pooled standard deviations in the denominator
(Shadish et al., 2010). Across disciplines, the magnitude of treatment effects varies
71
(Parker, Brossart, Vannest, Long, Garcia De-Alba, Baugh, & Sullivan, 2005) thus
limiting generalization. Rosnow and Rosenthal (1989) agree adding ES may be contextdependent not to mention different across similar treatments (Brand et al., 2011). One
way to compare across studies would be to include raw data (McDougal, Narkon, &
Wells, 2011). Researchers would be able to compare more studies, especially the ones
who failed to calculate an effect size or report descriptive statistics. In an examination of
behavioral self-management (BSM) techniques, McDougall, Skouge, Farrell, and Hoff
(2006) found that only 1 of 38 studies using a SCD calculated ES. Afterwards,
McDougall and colleagues (2006) suggested that all researchers either report an ES or
provide the data necessary to calculate the ES. Most notably, the original researchers did
not include descriptive statistics, such as pooled or within-phase standard deviations
necessary to calculate ES indices. Maggin and colleagues (2011b) found a lack of
reported data in their meta-analyses of SCDs that focused on interventions for students
with disabilities. The synthesis started with 682 abstracts and from those 87 candidate
studies matched specific criteria (e.g., publication date, disability related, single-subject).
The inclusion criteria continued and eventually eliminated 26 more studies due to the
lack of quality (e.g., not a review, disability status not clear, effect sizes not reported).
Approximately 30.77% (n = 8) of the eliminated articles did not report ES. The
remaining 61 articles were selected to review for patterns, themes, and similar subject
characteristics and an ancestral search included 7 more studies (n = 68). The most
frequent way of assessing treatment effect was by calculating a mean of the baseline and
treatment phases (29.41%, n = 20) and the second most common method of estimating a
72
treatment effect was to calculate effect size at the subject level, not aggregated across
subject participants (20.59%, n = 14).
Power is low in single-case research given the small sample size (Manolov et al.,
2010; Nagler et al., 2008) and Monte Carlo applications are providing helpful design
techniques to better understand the relationship between studies (Coe, 2002; Jenson et al.,
2007; Manolov et al., 2010; Raudenbush & Liu, 2004). Power analysis software called,
PinT (Snijders, Bosker, & Guldemond, 2007) can determine standard errors and provide
optimal sample sizes in two-level models in HLM. The newest (windows) version 2.12
can be downloaded on Snijders’ website.
Type I and II error rates in SCDs
Several Monte Carlo applications have been used in the context of single-case
research, including examination of statistical power (Manolov et al., 2010; Raudenbush
& Liu, 2004) and Type I error rates (Coe, 2002; Jenson et al., 2007). Controlling for
Type I error rates remains a concern because it would be undesirable to state an
intervention is effective when it is in fact not. Monte Carlo simulations have been
conducted to understand different features of SCDs. Jenson and colleagues (2007) used
Monte Carlo simulation to vary the number of data points in the baseline and treatment
phases, the type and magnitude of autocorrelation, and the number of subjects. Findings
suggest that including random effects in Level-2 equations will provide some protection
against Type I errors in several different scenarios (again, based on different
autocorrelation sizes and subjects).
73
Further, Manolov and colleagues (2010) conducted a simulation to generate AB
design data to test the performance of four regression-based techniques including general
trend and autocorrelation. Primarily, estimation procedures were compared to detect
existing effects and false alarm rates. The procedures compared were ordinary least
squares (OLS), generalized least squares estimation (GLS), differencing analysis (DA),
and trend analysis (TA) using three different distributions: normal, negative exponential,
and Laplace. Results indicated that the regression parameters of OLS estimation could
not be advised for short series designs. Choosing the correct regression model and
controlling autocorrelation and trend did not guarantee useful p values for treatment
effects. OLS and GLS estimation procedures were useful only when there was no trend
in the data and data series were independent, but OLS appeared to be more sensitive to
treatment effects than TA or DA. In TA, the correction was so severe that it was found to
overcorrect the data and remove data produced by the intervention for both trend and
autocorrelation. Lastly, DA had issues in detecting treatment effects when the effect was
a change in slope, and did not detect ineffective interventions as effective even when
more measurements were available.
In conclusion, power in SCDs is influenced by several characteristics: trend,
autocorrelation, type of distribution, and estimation procedures in place. Power is an
enduring issue in SCDs, but it was suggested by Manolov and colleagues (2010) that
positive autocorrelation alters the Type I error rates of the GLS estimate which indicates
the need to estimate a data correction procedure iteratively, and not all at once.
74
Chapter Summary
Visual analysis is a powerful technique in determining treatment effect and is still
the dominant mode of analysis in SCDs. Treatment effectiveness can be assessed
visually and with parametric procedures. In comparison with visual analysis, multi-level
modeling may provide an alternative and quantifiable approach to determining treatment
effect. Using multi-level models requires that subjects/classrooms/units are nested
(Raudenbush & Bryk, 2002). These analyses are accurate for estimating standard errors
and more information about the outcome variable and the contribution of predictor
variables (Raudenbush & Bryk, 2002). For this study, the Level-1 data in HLM contains
multiple data points on a single person, which is then matched to the Level-2 data (via an
identifier variable). Of course, the data in Level-1 is larger than Level-2 since UnGraph
counts each point in the ABAB graph as a point at each exposure time. Data from each
subject is eventually aggregated together to form one file, causing Level-1 files to be
larger.
Using multi-level modeling is tedious. Only nine articles were used in the current
study due to the time consuming collection of data and the fact that Level-2 data were
analyzed making the process even more vigorous given the use of multi-level modeling.
Several parametric threats to validity such as autocorrelation, overdispersion, and power
are still being tested with Monte Carlo simulation for studies with small samples (Nagler
et al., 2008).
Used in tandem, the magnitude of treatment effectiveness can be determined and
provide more than one approach for the assessment of SCDs (Kratochwill et al., 2010;
75
Maggin et al., 2011a). Unfortunately, with blended data collection procedures,
interpretation can become unclear. Effect size estimates per study are still a concern and
efforts concerning meta-analysis techniques in SCDs are ongoing as well. Of the three
effect size estimates, the PND, IRD, and SMD, the PND is the most widely used effect
size in SCD (Maggin et al., 2011a) and therefore will be used in this sensitivity analysis,
not to mention it was the most reported estimate in the articles chosen. The SMD will
also be used in the current study because this measure was reported in some of the
original articles, providing an opportunity to conduct sensitivity analyses and a separate
check to address Research Question 4 (if SMDs from digitized data are similar to those
reported in base articles then this would help validate the use of UnGraph). Again, the
IRD will be calculated, but mainly as an added benefit for further sensitivity analyses
since it was not calculated for any of the nine original articles.
76
Chapter Three: Methodology
This Chapter describes the research design and analytic procedures used to
address the research questions. The purpose of this study was to conduct a sensitivity
analysis comparing standardized visual analyses (based on recent WWC
recommendations), multi-level models, and results presented by original authors. The
key focus of the sensitivity analysis was on whether the three sources of information
consistently report the presence of a treatment effect. Secondary questions examined
Level-2 predictors (whether they explain variance in treatment impacts across subjects),
documented the magnitude of treatment impacts, and checked on the validity and
reliability of the UnGraph procedure. This last effort was made in an attempt to add to
the evidence base on whether digitizing data could be done consistently across different
analysts.
As described in Chapter One, the research questions were as follows:
Primary Question
1) Does quantification and subsequent statistical and visual analyses of the selected
ABAB graphs produce conclusions similar to visual analyses? Recall,
comparisons were made between independent WWC visual analyses and those of
the study’s original authors.
Secondary Questions
2) Do any of the subject characteristics (Level-2 data) explain between-subjects
variation? If so, can this information be used to yield new findings?
77
3) Do the PND and IRD (nonparametric indices), SMD (parametric indices) and R2
yield effect sizes that are similar to the ones reported in the original studies?
4) Is UnGraph a valid and reliable tool for digitizing ABAB graphs?
The analyses conducted for this work followed a series of steps.
Article Selection and Descriptions
ABAB single-case designs that meet WWC standards were selected for this work.
The reason for this was the WWC only performs visual analyses on studies that meet
design criteria which standardize the application of visual analysis procedures across
each report. To clarify, there is no way to be sure that authors consistently applied
different visual analysis procedures or if they used current recommendations. The WWC
approach, by contrast, standardizes procedures, which sets the stage for comparing and
contrasting results across the two key methods of interest. Identifying articles that should
meet standards is not a trivial step, as seventy-eight articles are originally found, and only
nine are chosen for this dissertation. There are two general reasons for why some articles
were not included in this work:
(a) Some articles did not meet a WWC standard such as exhibiting fewer than
three data points per phase or report inter-observer agreement. Keep in mind the desire to
use standardized visual analysis approaches and the WWC does not analyze studies that
do not meet its design standards.
(b) Some articles reported results for only one student. Reporting on only one
subject prevents addressing the first research question since the HLM analyses will not
converge while working with a single participant. Also, the second research question
78
concerning whether subject characteristics explain variance in the dependent variable
could not be answered.
As noted above, only articles that examine the impacts of interventions on
children with behavioral issues were used. Characteristics of the children included in the
selected articles include: presence of social delinquency, aggressive, withdrawn,
depressed, disruptive, disobedient, or acting out tendencies.
Since this work only examined articles that should meet WWC standards, it is
important to review study features that would meet the standards. These include:

The systematic manipulation of the independent variable.

Inter-rater agreement (generally called interobserver agreement)
documented on the basis of a statistical measure of constancy between
raters. For example, percentage/proportional agreement (0.80 - 0.90 on
average) and Cohen’s kappa coefficient (0.60). Percentage agreement
being observer agreement and Cohen’s kappa accounting for chance
agreement.

A minimum of three data points per phase. Note that studies with as few
as three data points in a phase can meet standards “with reservations” so
studies with this limitation were included.

An attempt to demonstrate an effect at three different points in time.
Again, only articles that can meet these standards were included in the study to
remove any question around whether WWC analyses procedures would be applied to
these studies.
79
ERIC, Academic Search Complete, PsycINFO, and Social Science (Education
Full Text) were searched using the key words: SCD, ABAB design, behavior and
emotional disturbance. Seventy-eight articles focusing on classroom-based behavioral
interventions were found, and of these, nine used an ABAB design and satisfied WWC
criteria. The articles selected for this study are mainly concerned with decreasing offtask behavior or increasing participation in the classroom. Several of the articles loosely
define children with disabilities and some include children with disabilities integrated
with typically developing children. Only using nine articles was a combination of the
time consuming procedures required for this study, the nature of a methodological
dissertation, and the lack of articles that meet all the required WWC standards and
specifics of the ABAB design. Further, this work was primarily interested in providing
sensitivity analyses, not generalizations.
The articles chosen, and the design rating that was applied using the WWC
standards, are displayed in Table 1. Three of the articles that met with reservation had
fewer than five data points per phase and one study had 18% (instead of 20%) agreement
in phases. Due to the limited articles that meet all the requirements of the WWC, the
decision was made to include the articles.
80
Table 1
Titles of Articles and WWC Standards
Authors
Title
Size
Meets Standards /
Meets With
Reservation
1) Amato-Zech, Hoff
& Doepke (2006).
Increasing On-Task
Behavior in the
Classroom: Extension
of Self-Monitoring
Strategies.
N=3
Y
2) Cole & Levinson
(2002).
Effects of Within-Activity N=2
Choices on the Challenging
Behavior of Children with
Severe Developmental
Disabilities.
Y
3) Lambert, Cartledge, Effects of Response Cards N=9
Heward, & Lo
on Disruptive Behavior
(2006).
and Academic Responding
during Math Lessons by
Fourth-Grade Urban Students.
Y
4) Mavropoulou,
Papadopoulou,
& Kakana (2011).
Effects of Task Organization N=2
on the Independent Play of
Students with Autism
Spectrum Disorders.
Y
With
Reservation
5) Murphy, Theodore,
Alric-Edwards, &
Hughes (2007).
Interdependent Group
Contingency and Mystery
Motivators to Reduce
Preschool Disruptive
Behavior.
Y
With
Reservation
6) Ramsey, Jolivette,
Puckett-Patterson,
& Kennedy (2010).
Using Choice to Increase
N=5
Time On-Task-Completion,
and Accuracy for Students
with Emotional/Behavior
Disorders in a Residential
Facility.
N=8
Y
With
Reservation
81
Table 1 (Continued)
7) Restori, Gresham,
Chang, Lee, &
Laija-Rodriquez
(2007).
Functional AssessmentBased Interventions for
Children At-Risk for
Emotional and Behavioral
Disorders.
N=8
Y
With
Reservation
8) Theodore, Bray,
Kehle, & Jenson
(2001).
Randomization of Group
N=3
Behavior Contingencies and
Reinforcers to Reduce
Classroom Disruptive.
Y
9) Williamson,
Using a Random Dependent N=6
Campbell- Whatley, Group Contingency to
& Lo (2009).
Increase On-task Behaviors of
High School Students with
High Incidence Disabilities.
Y
Visual Analyses of Selected Articles
Standards put forth by the WWC list six features for assessing intervention
impacts by visually examining between and within phase patterns of data. These are
described above but are summarized again for this chapter. The term “between phase”
denotes the idea of comparing and contrasting data between adjacent phases in a study.
For example, does the number of tantrums a child exhibits at baseline differ from the
number seen after onset of treatment? Within phase analyses consider information such
as data trends and variability within a given study phase (Horner et al., 2005), such as
baseline or a given treatment phase. Visual analyses consider a series of perspectives
which include: (1) level, (2) trend, (3) variability, (4) immediacy of the effect, (5)
82
overlap, and (6) consistency of data patterns across similar phases. The six features
assess patterns across all phases of the design (Kratochwill et al., 2010). Levels are the
means within phases whereas trend refers to the slope of the regression line, or line of
best fit found within each phase. Consideration of variability is no different from the
commonly understood definition of the term, keeping in mind of course that it is often
assessed via visual approaches as opposed to the application of descriptive statistics.
Immediacy of effect is the comparison of the last three points in the baseline phase to the
first three points in the intervention phase. Immediacy of the effect differentiates patterns
between phases. Assessing immediacy of the effect helps establish the presence of a
treatment effect because rapid change in performance after the onset of removal of
treatment makes it easier to discern that the presence of treatment variable caused
changes in a dependent variable. Overlap deals with the percentage of overlapping data
in one phase to the next, with the larger the separation of data the more likely there is a
treatment effect (Kratochwill et al., 2010). To clarify, high levels of data overlap suggest
limited or no change in performance from treatment to non-treatment phases. If this is
the case, there is no evidence of a treatment impact. Lastly, consistency of data in similar
phases involves comparing the baseline phases with each other (e.g., comparing baseline
1 and baseline 2), and treatment phases with each other and looking for consistent
patterns (Kratochwill et al., 2010). The more consistency across similar phases the more
plausible a causal relation exists (Kratochwill et al., 2010).
Patterns are closely observed in visual analysis for trends. A key trend that was
analyzed was the first baseline phase. Typically, a stable baseline must be present before
83
an intervention should be implemented because a trend suggesting a drop in the level of
concern can make distinguishing levels difficult. In other words, should problematic
target behavior be diminishing on its own before treatment, it would be difficult to
conclude that introduction of an intervention is responsible for any improvement.
Differentiating an effective treatment intervention from one that is not entails
consideration of the six features already described for the visual analysis method. For
this dissertation, all treatments will be considered either non-effective or effective. In
order to be conservative any questionable treatment interventions will be considered noneffective. This is done to simplify interpretation.
Digitizing and Modeling the Data
Nagler and colleagues (2008) manual, Analyzing Data from Small N Designs
Using Multilevel Models: A Procedural Handbook, studied small sample sizes in
multilevel models. The manual was followed for this work so the general analytic steps
are described. Here, one ABAB design was scanned from existing literature and
UnGraph was used to digitize the coordinates. Once the data were obtained, Nagler and
colleagues (2008) used SPSS to create dummy coded variables and interaction terms, and
then used HLM software to statistically compare baseline to treatment phases. Dummy
coding sets the stage for examining the presence of treatment effects; A and B phases
were distinguished using a 0/1 coding scheme. In the handbook several types of designs
were analyzed, but the ABAB design is of interest here because it can yield strong
evidence of a treatment effect given that it included a reversal to baseline. The
procedures and HLM analyses described in Chapters One and Two will be used to answer
84
part of the research questions, the effect size estimates were analyzed separately and were
compared to the original studies.
Before hierarchical models were analyzed in HLM/STATA software, SPSS was
used to transform variables. The variables that required transformation were the outcome
measures and independent variables (sessions). The outcome measure was rounded up or
down if the outcome measure was indicated as a whole number in the original article
(e.g., number of off-task behaviors). Otherwise, the outcome measure was not changed.
Recall, the dependent variables in these articles all reflect some percentage of the target
behavior in a given time period. The session variable was rounded to whole numbers.
Session was the x-axis of the graph(s) and represented the number of sessions in the
original study. Session was re-centered so that a zero was the final session of each
baseline phase. This was done for HLM interpretation, so that zeros on all Level-1
variables (e.g., phase, session, interactions) indicated the final baseline session. The
variables added to SPSS included trials, session (time), treatment, order, and interaction
terms (treatment by order, session by treatment, session by order, and a three-way
interaction between sessions, treatment, and order). A trial indicated the total number of
possible trials on each day and was not really a variable, but a constant. Order was
created as the first AB pair (0) and second AB pair (1). This is done to perform a
statistical test between the first introduction of the treatment and the second.
All of the articles produced data that followed a binomial distribution; all
outcomes were some form of a “proportion correct” variable, such as the percentage of
85
time a student was on-task. Use of the binomial distribution in modeling was therefore
expected and appropriate. For all analyses, standard critical p values were used, p < .05.
STATA was used when there were only two subjects in an article (there were two
such articles) because this software package was able to provide closer approximations to
estimate the parameters (results did not converge in HLM for these studies). Raudenbush
and Bryk (2002) have suggested HGLMs using Fisher scoring as the estimation
procedure (for closer approximations to ML estimates) which could aid with convergence
issues. This estimation procedure is not available in HLM for measures within persons.
All other analyses were performed using MLF, in order to compare likelihood functions
for model fit.
Recall, the binomial distribution is modeled for data, where the number of
successes Y in n trials, if the trials are independent and the probability of success π is
constant over trials is; Y ~ Binomial(n, π) (Skrondal & Rabe-Hesketh, 2007). In
generalized linear models, the distribution is determined as the count of success (Yi) in
trials (ni) for unit (i) and is conditional on covariates (xi),
Yi | xi ~ Binomial(ni, πi).
All of the articles in this study were modeled with the binomial distribution. Three types
of models were run in both HLM and STATA that mirrored procedures in Nagler and
colleagues’ (2008) manual. Overall, obtaining a model with good model fit (likelihood
function closest to zero), while considering overdispersion was the objective.
86
In an ABAB design for single-case research, the Full Non-Linear Model with
Slopes included all Level-1 predictors but no Level-2 predictors. An example of a Full
Non-Linear Model with Slopes is as follows:
Level-1
Yti = P0i + P1i(SESS1ti) + P2i(TRTti) + P3i(ORDERti) + P4i(TRTORDti) + P5i(S1TRTti) +
P6i(S1ORDti) + P7i(S1TRTORDti)
Level-2
P0i = β00 + R0i
P1i = β10 + R1i
P2i = β20 + R2i
P3i = β30 + R3i
P4i = β40 + R4i
P5i = β50 + R5i
P6i = β60 + R6i
P7i = β70 + R7i
Where,
Yti =
The log odds of the dependent variable occurring (or expected number of
behaviors, out of maximum possible intervals).
P0i =
The average log odds at final baseline session for all subjects (β00), plus an error
term to allow each student to vary from this grand mean (R0).
P1i =
The average rate of change in log odds per day of observation (SESS1) during
each baseline/treatment pair (β10), plus an error term to allow each student to vary
from this grand mean effect (R1). 0 = last baseline observation.
.
87
P2i =
The average rate of change in log odds as a subject switches from baseline to
treatment phase (TRT) for all students (β20), plus an error term to allow each
student to vary from this grand mean (R2).
P3i=
The average rate of change in log odds as a subject switches from observation in
the first AB pair to observations in the second AB pair (β30), plus an error term to
allow each student to vary from this grand mean (R3).
P4i =
The average change in treatment effect as a subject switches from the first AB
pair to the second AB pair (TRTORD) for all students (β40), plus an error term to
allow each student to vary from this grand mean (R4).
P5i =
The average change in session effect (i.e., time slope) as a subject switches from
baseline to treatment phase (S1TRT) for all students (β50), plus on error term to
allow each student to vary from this grand mean (R5).
P6i =
The average change in session effect (i.e., time slope) as a subject switches from
the first AB pair to the second AB pair (S1ORD) for all students (β60), plus an
error term to allow each student to vary from the grand mean (R6).
P7i =
The average change in the differing slopes in baseline vs. treatment phases, as
subject switches from the first AB pair to the second AB pair (S1TRTORD) (β70),
plus an error term to allow each student to vary from this grand mean (R7).
The full Level-1 formula states that the log odds of any dependent variable (or the
expected number of days where the behavior was observed, out of X possible intervals) is
the sum of eight parts: the log odds at the intercept (the final baseline session), plus a
term accounting for the rate of change in log odds with implementation of the
intervention (TRT), plus a term accounting for the rate of change in log odds with time
(SESS1), plus a term accounting for the rate of change in log odds in phases from the first
phase pair (A1, B1) to the second phase pair (A2, B2) (ORDER), plus four interaction
terms (three 2-way interactions and one 3-way interaction) (Nagler et al., 2008). As the
purpose for now is only to provide an example, the average change in variable effects
definitions was exactly how Nagler and colleagues (2008) presented them in their
88
manual. As noted in Chapter Two, the inclusion of random effects in the Level-2
equations generally protects against Type I errors, for this reason all error terms were
activated in the above Level-2 model.
In general, the aim in the Full Non-Linear Model with Slopes is to find variables
that are not contributing anything to the model and remove them for the sake of
parsimony. One modeling strategy described by Nagler and colleagues (2008) is to examine
the slope of measurement occasion in models; this slope can be referred to as “time.” The
expectation is that the baseline slope will not be statistically significant, which would
suggest that the baseline trend is flat. Furthermore, if there is no interaction between
change in session/days between phases and this slope, then this would suggest the trend is
flat within phases. This analysis can help because it can help inform researchers if there
are subtle data trends that can perhaps be hard to discern using visual analysis alone.
A second model is the Simple Non-Linear Model without Slopes. This model
retains variables that are significant in the Full Non-Linear Model with Slopes. An
example of the Simple Non-Linear Model without Slopes from Nagler and colleagues’
manual shows only a treatment variable was retained:
Level-1
Yti = P0i + P1i(TRTti)
Level-2
P0i = β00 + R0i
P1i = β10 + R1i
89
where:
Yti =
The log odds of the dependent variable occurring.
P0i =
The average log odds during baseline for all subjects (β00), plus an error term to
allow each student to vary from the grand mean (R0).
P1i =
The average rate of change in log odds as a subject switches from baseline (TRT
= 0) to treatment phase (TRT = 1) for all students (β10), plus an error term to
allow each student to vary from the grand mean (R1).
The Level-2 equations model the intercepts and phase changes. The Level-1 equation
now states that the log odds of the dependent variable is the sum of two parts: the log
odds at the intercept (in this case, the baseline phase overall, since trend was found to be
flat and removed), plus a term accounting for the rate of change in log odds with a phase
change (treatment phase). This model allows error terms for all students to vary from the
grand mean (Raudenbush & Bryk, 2002) and can yield a p value to determine if there was
a treatment effect. The coefficients in this model also allow for the computation of
probabilities such as the probability of observing a behavior during baseline and
treatment phases (e.g., 15 observed disruptive behaviors in the first baseline compared to
5 observations of a disruptive behavior in the first treatment phase).
As a follow-up analysis, Level-2 variables can be used to analyze whether subject
characteristics appear to explain variance in performance. Note that the use of Level-2
information is a departure from a typical visual analysis. HLM provides a t test statistic
for the variable, which explains any contribution to the estimation of the outcome. The
90
third and last model includes the Level-2 predictors that have the largest t value(s) from
the second model’s exploratory analysis.
Nagler and colleagues refer to this third and last model as the Simple Non-Linear
Model with any Level-2 predictors. If no Level-2 predictors are found, then the model is
called the Simplified L-1 Model with No Level-2 Predictors. Again, coefficients in this
model can be used to determine the probability of behaviors occurring at baseline and
treatment respectively. An example of this model is as follows (assuming a Level-2
variable called, “Class B” which refers to the class a child is assigned to):
Level-1
Yti = P0i + P1i(TRTti)
Level-2
P0i = β00 + β01(CLASSBi) + R0i
P1i = β10 + R1i
Where,
Yti =
The log odds of the dependent variable occurring.
P0i=
The average log odds during baseline for all students (β00), plus a term to allow
for students in Class B to have a different baseline level (β10), plus an error term
to allow each student to vary from this grand mean (R0).
P1i=
The average rate of change in log odds as a subject switches from baseline (TRT
= 0) to treatment phase (TRT = 1) for all students (β10), plus an error term to
allow each student to vary from this grand mean (R1).
The Leve1-1 equations still read the same as the Simple Non-Linear Model without
Slopes, but now the Level-2 equations model the baselines and phase changes.
91
In their example, Nagler and colleagues calculated between subject variances
from the variance components tables in the HLM output. In the ‘final estimation’ of the
variance components, any between subject variations in estimates of the intercept yielded
a p value which tested the null hypothesis that baseline averages for all subjects were
similar. A significant p value suggested that the variance was too big to assume
estimation error. The between subject variances in phase effect produced a p value that
tested the null hypothesis that on average, the probability of observing disruptive
behavior was similar for all subjects. Further, any Level-2 predictors that were
significant were included for the contribution to the estimation of the outcome measure
for the final model. If the predictors reduce the variance in the model, then probabilities
of the behavior occurring in both the baseline and treatment phases can be calculated and
comparisons between the two can be made. If a Level-2 variable is significant, then
probabilities for observing a target behavior (more generally, the outcome variable)
should be computed.
Comparing and Contrasting Visual Analysis, HLM/STATA, and Author Reports
Table 2 is meant to convey how results across methods were compared (actual
results are in Chapter Four). The table yields a quick reference pertaining to the
consistency of results across methods. A sensitivity analysis on the effect sizes is also
described in Chapter Four.
92
Table 2
Results of Sensitivity Analyses Pertaining to Statement of a Treatment Impact
Study and
Author(s)
WWC
HLM/STATA
Dependent
Published
Independent
Results (p
Variable
Statement
Visual
values)**
Analysis*
(1)
(2)
*Only one visual analysis expert was used in the current study.
** For a treatment to be deemed effective, p < .05.
As noted in Chapter One, an ideal outcome would be to find that no differences in
terms of overall statements of whether there was a treatment effect across the different
methods. Should author statements diverge from WWC and HLM results, this might
reasonably be attributed to disparate application of the visual analysis technique. Should
there be differences between WWC and HLM results, it would be of interest to SCD
researchers since two different but reasonable analytic strategies yielded inconsistent
findings.
Research Question 2: Level-2 Predictors
As noted above, the Simple Non-Linear Model without Slopes further tests Level2 variable(s) in an exploratory fashion (Raudenbush & Bryk, 2002). This exploratory
analysis explained the contribution of a Level-2 variable, or set of variables capacity to
explain variation in observed performance across a group of subjects (that is, a group of
similar ABAB studies).
In the event any significant Level-2 variables are identified, a third model, the
Simple Non-Linear Model including any Level-2 predictors can be used to describe how
93
participant characteristics explain variation in the outcome variable. In the ‘final
estimation’ of the variance components, any between subject variations in estimates
yields a p value which tests the null hypothesis that baseline averages for all subjects are
similar (Nagler et al., 2008). Again, according to Nagler and colleagues (2008) a
significant p value, p < .05, suggests that the variance is too big to assume estimation
error. The between subject variances across phase effects produces a p value that tests
the null hypothesis that on average, the probability of observing a target behavior is
similar for all subjects. If the Level-2 variables did not contribute anything to the model
(p > .05) they are dropped from the analysis, and the associated study is considered to
have ‘no Level-2 predictors’ that explained variation in the model. Note that the L-2
analyses require there to be at least three people in the study or the model will not
converge due to small sample size. Therefore this procedure was limited to 5 studies.
To clarify the goals of the analyses, Table 3 represents a display of how results
are to be depicted (actual results of each article will be displayed).
94
Table 3
Results of the Multilevel Models and Level-2 Contributors
Articles
Study 1
Study 2
Overdispersion/without
Overdispersion
Simplified Level-1
model with Level-2
predictors on intercept
(TRT)
Coefficients
Variance
Components
(SD)
Likelihood
Function
(MLF)
* = p < .05, ** = p < .01
Alternative Approaches for Exploring Variation
As suggested by Nagler and colleagues (2008), there are alternative approaches
for exploring the variation among intercepts and phase effects. These approaches allow
the researcher to observe how much subjects vary from average expectations. Recall, the
models in this dissertation use an exploratory analysis to aid in the detection of Level-2
variables and variation between-subjects. That is the models use the variance
components column and associated p values to determine treatment effectiveness and
between-subject variation, which is the same method employed by Nagler and colleagues
95
(2008). Since the between-subject variation was already detected via exploratory Level-2
variable contributions, it was not necessary to use the alternative analyses.
The alternative approach is to constrain the random effects in four different ways
and compare likelihood functions: (1) have no constraints on random effects, (2) restrict
the intercepts to zero so they do not vary across subjects, (3) restrict the phase effects to
zero across subjects and (4) restrict both intercepts and phase effects to zero. The manual
followed for this dissertation uses the Level-2 exploratory analysis to explain the
between-subject variance and not the constraints just described. It is dependent on the
researcher whether to use the exploratory method (placing the Level-2 variable(s) on the
intercepts and phase effects to test for between-subject variance) or to constrain the
random effects. Both methods do not need to be employed, especially given the nature of
a sensitivity analysis.
Further, these data are considered repeated, which complicates error structures
needed to model the data. Recall, that all the slopes and intercepts vary across Level-2
units and are not fixed (i.e., error terms are activated) for the three models used in this
dissertation allowing for the grand means to vary, which would be most consistent with a
model without constraints. However, these types of repeated data designs have variance
covariance matrices that are considered unstructured or identity based, among other types
of structures, such as the diagonal matrix (Singer & Willett, 2003). For the unstructured
matrix three estimations are considered, the variance of intercepts and the variance of
slopes and covariance in slopes and intercepts. The benefit of using the identity matrix
over unstructured is that one gains a degree of freedom because there is no estimation
96
required for covariance (Singer & Willett, 2003). However, since Nagler and colleagues
(2008) indicate they used a no constraints model, it is assumed that their analyses
followed an unstructured matrix. Since no restrictions were placed on the variance /
covariance matrix in the Nagler and colleagues (2008) manual, this dissertation followed
their model building strategies and no further testing was needed for assessing model fit
using alternative approaches. Again, alternative strategies were suggested by Nagler and
colleagues (2008) to explore variation, however, this was not the focus of this dissertation
nor is it necessary. It is worth mentioning, Nagler and colleagues (2008) do imply that
more research is needed concerning between and within-subject variation, and the
appropriate estimation procedures needed to fit models with small sample sizes.
Effect Sizes
Research Question 3 dealt with calculating effect sizes using the PND, IRD, and
SMD. When possible, the calculations were compared to what was reported in the
original studies. R2 was also calculated to determine the proportion of the variation
explained by phase differences using ordinary least squares (OLS) regression, using the
dependent measure(s) and treatment variable only. It was suggested by Manolov and
colleagues (2010) that OLS estimation procedures were sufficient to determine
proportion of variance when there was no trend in the data and data series were
independent. Calculations for the four estimates used in this study are provided in
Chapter Four.
97
For the current study, the PND statistic and the SMD were checked for
similarities to the original study. Also, the PND was compared to the extracted data PND
to assess consistency between the two methods.
So as to provide background information needed for the following discussion,
Table 4 provides all formulas used for effect size calculations.
Table 4
Effect Size Methods, Criteria, and Software
Procedure
PND
Method
Assessing level, trend,
variability, immediacy of
effect, overlap, and
consistency of data patterns
Criteria
Program
See Chapter Two
Visual
Analysis
= > 90% very effective treatment
70%-90% effective treatment
50%-70% questionable effectiveness
< 50% ineffective treatment
(Scruggs & Mastropieri,1998)
PND
= (1 – Percent of Treatment
Overlap Points)*100
IRD
= IR Treatment – IR Baseline
100% (1.00) very effective
50% (.50) half of phases (AB) overlap
or chance level improvement
(Vannest et al., 2011)
SMD
= (Treatment Average – Baseline
Average) / Baseline Standard
Deviation
≈ 2.0 indicates an
effective treatment
(Jenson et al., 2007)
SPSS/Excel
IRD
Online
Calculator
SPSS
98
Table 4 (continued)
R2
≡1-
1.0 is a perfect fit
Proportion of the variation explained
SPSS
by the phase differences
OLS Regression
The PND statistic was analyzed by assessing the percentage of data that did not
overlap, by choosing the most extreme value in the baseline and calculating the amount
of data that did not overlap in the treatment phase (Morgan & Morgan, 2009). The IRD
was calculated by taking the difference between two improvement rates (IRs). According
to Parker, Vannest, and Brown (2009), the IR is defined as “the number of improved data
points divided by total data points in each phase p. 139).
= IR
(7)
Recall from Chapter 2 that the IRD is simply the difference between the proportion of
data points where treatment performance exceeds baseline performance. These points
can be observed visually or with software for calculating the IRD. The IRD formula can
be seen below:
IRT – IRB = IRD
(8)
99
Again, the difference between the two improvement rates for the phases (the T stands for
treatment and B is the baseline) is the definition of the IRD (Cochrane Collaboration,
2006; Sackett, Richardson, Rosenberg, & Haynes, 1997). Fortunately, software is
available for free to calculate the IRD and multiple contrasts can be compared (Vannest,
Parker, & Gonen, 2011). Contrasts are determined by the researcher. The two most
common contrasts for ABAB designs were calculated for this study, although only one
was needed for the calculation of the IRD. Recall that the baselines were coded as A1
and A2 and the treatments were B1 and B2. One contrast is: A1 versus B1, B1 versus A2,
and A2 versus B2. Another option is to compare the A’s to the B’s, meaning A1A2 versus
B1B2. Both contrasts were calculated and compared to the reported PND from the
original articles even though only one contrast option is needed to calculate the IRD. The
SMD was calculated by taking the difference between the mean baselines (MB) and mean
treatments (MT) and dividing it by the first baseline standard deviation (Busk & Serlin,
1992).
UnGraph Validity and Reliability
Research Question 4 involved conducting a reliability analysis on the extraction
of digitized data from the original articles using UnGraph. Furthermore, the validity of
the process was assessed by obtaining raw data from the original authors. All study
authors were contacted twice to obtain original data. Note that other researchers are
interested in the general idea of recreating raw data. The reliability and validity of the
UnGraph procedure was examined at three levels: (1) between the original “raw” data
correlated to UnGraph’s extracted data, (2) between the researcher of this work and a
100
second coder, and (3) between the author’s reported means/percentages/ranges compared
to UnGraph’s extracted data. This is not the first attempt to examine the reliability of the
UnGraph procedure. Shadish and colleagues (2009) examined 91 studies of instructional
or behavioral interventions with a variety of student populations. Three undergraduate
students extracted the data from around 30 graphs each and the inter-rater reliability was
calculated using the percent agreement between the three raters. Each person was paired
with another, so that if someone was the original data extractor, then the other two people
would test the scanned data against the original coder. In that study, there was agreement
on 92.31% of the graphs (n = 84 of 91) regarding the number of data points to extract.
The number of data points to extract was the total number of scanned data agreed upon
(counting each data as one point) divided by the total data points. It is also noteworthy
that there was a correlation of r = .99 (p < .001) between coders in terms of the number of
data points to extract. The correlation between the extracted values for the original coder
and the two coders averaged 0.96 (median = .99) across all studies. The reliability
between the phases (baseline to treatment) was also tested. The average difference
between phases for the two coders had a correlation of r = .95. To explore the validity of
the data, means and standard deviations from the original data were compared to baseline
mean from the extracted data. Since validity of the data were reported on more than one
phase per study, forty-four studies yielded 152 data points. Five different methods of
testing validity were employed, looking at mean percentages, mean time, mean number
correct, mean, and percentage correct from the different types of studies. Correlations of
the average extracted data over all phases overall and then for each of the five kinds of
101
data validity suggest a range from .97 - .99. Any issues concerning reading data and
mistakes made by the coders in Shadish and colleagues’ (2009) article were presented as
human error.
Although this prior work supports the use of the UnGraph procedure it seems
prudent to independently check on the reliability and validity of the approach. This study
recorded the percentage of data agreement (total number of data points that agree
between the extracted data and the original data), and Pearson product-moment
correlations between the two coders. One coder was the primary extractor, and a second
coder was another graduate student in the same program of study.
To assess the validity of the data, the descriptive statistics from the extracted data
were compared to original graphs. For example, means/percents/ranges from the original
data were compared to the extracted data. Of course, digitized results were also
compared with available raw data. Raw data compared to the extracted data was not
found in an existing data search of ERIC, Academic Search Complete, or Social Sciences
Index. Therefore, this is the first sensitivity analysis between raw data and extracted data
for the software, UnGraph.
Chapter Summary
Several articles were used to perform sensitivity analyses that focused on
comparing and contrasting results from HLM and visual analysis for the purpose of
comparing the original author’s work to WWC visual analysis, and then to quantification.
Data were collected using UnGraph, a digitizing software package. SPSS was used to
create interaction terms, and create any subject characteristic data (Level-2 data).
102
The PND statistic, IRD, SMD, and R2 were checked for similarities to the original
study. The PND statistic was analyzed by assessing the percentage of data that did not
overlap, by choosing the most extreme value in the baseline and calculating the amount
of data that did not overlap in the treatment phase (Morgan & Morgan, 2009). The IRD
was the difference between the two improvement rates for the phases (baseline and
treatment) (Parker et al., 2009). The SMD was calculated by taking the difference
between the mean baselines (MB) and mean treatments (MT) and dividing it by the
baseline standard deviation (Busk & Serlin, 1992). R2 was also calculated to determine
an effect as the proportion of the variation explained by phase differences using ordinary
least squares (OLS) regression, where the formula for R2 equals one minus the sum of
squared differences between the actual and predicted Y values divided by the sum of
squared differences between the actual Y values and their mean. The interpretation for
R2 was lightly described in Chapter Two and reported in Chapters Four and Five in an
effort to introduce the existence of R2 in SCDs, however because there are several
different explanations of R2, it is not widely used or accepted as the best method for
effect size estimates for this field.
Lastly, the reliability and validity of UnGraph was checked for inter-rater
reliabilities between two coders. As well, a sensitivity analysis between the validity and
reliability of the extracted data and the original author’s reported descriptive statistics
were compared. When available, raw data from original authors was compared to
extracted data to ensure the sensitivity of Ungraph.
103
Chapter Four: Results
The results from the sensitivity analyses are summarized in this chapter. Recall
the research questions for this dissertation are:
Primary Question
1) Does quantification and subsequent statistical analyses of selected ABAB graphs produce
conclusions similar to visual analyses?
Secondary questions
2) Do any of the subject characteristics (Level-2 data) explain between-subjects variation?
If so, can this information be used to yield new findings?
3) Do the PND, IRD (nonparametric indices), SMD (parametric indices) and R2 yield
similar effect sizes as the original studies?
4) Is UnGraph a valid and reliable tool for digitizing ABAB graphs?
This dissertation focused on comparing two distinct methodologies used for
analyzing single-case designs. For this reason, this chapter is organized into two major
sub-sections. The first section offers detailed descriptions of a pilot of the statistical
procedures used here. The pilot is modeled on the specific approach described by Nagler
and colleagues (2008). Detailed descriptions of each step needed to address parts of
Questions 1 and 2 are offered while working with the study presented in the Nagler and
colleagues (2008) publication. The pilot endeavored to replicate the Nagler and
colleagues work, starting with the data digitization process. This was done to ensure the
procedures could be independently applied. Once the process was replicated (allowing
for rounding error and small differences that can be attributed to the digitization process),
104
it was applied again with eight more studies. The results of these analyses are
summarized in Section 2 of the chapter. The primary purpose of the second section of
this chapter is to provide the overall results pertaining to each research question. To
clarify, the intent is to first explain the statistical approaches used and then discuss
analyses and results used to address the four questions this dissertation was designed to
address.
Section 1: The Pilot
The pilot data for this study came from the article by Lambert, Cartledge, Heward
and Lo (2006) called, Effects of Response Cards on Disruptive Behavior and Academic
Responding during Math Lessons by Fourth-Grade Urban Students. The original authors
were interested in the impact that response cards (white boards used by the students to
write answers based on questions from the teacher) had on disruptive behavior during a
lesson, compared to baseline teaching which entailed traditional hand-raising and waiting
for the teacher to call on the student. The dependent variable was the proportion of
disruptive behaviors, relative to on-task behaviors, during a five second interval. That is,
each student’s (n = 9) behavior was recorded during a five second interval, where
disruptive behavior was recorded as having occurred or did not. In total, behavior was
recorded across ten intervals. Since the outcome variable focused on the proportion of
observed disruptive behavior to all behavior (i.e., the percentage of times a student was
disruptive), a binomial distribution of the data was assumed. Disruptive behavior was
considered to have occurred during an observation interval if the student displayed any of
the following behaviors, talking, provoking others, looking around the classroom,
105
misusing the white boards, writing notes to friends, drawing pictures on the white boards,
sucking on fingers, or leaving their seat. The authors indicated the treatment was
effective after using visual analysis. The phase changes that support this assertion can be
seen in Appendix B, where the graphs for each article can be found. Note that these data
were produced from the digitization process using UnGraph. This point is revisited when
discussing Research Question 4. For now, the focus is on the application of HLM
techniques to statistically analyze data, whereas the original authors relied primarily on
visual procedures.
The Full Non-Linear Model states that the log odds of disruptive behavior (or the
expected number of intervals where on-task behavior was observed, out of 10 possible
intervals) was the sum of eight parts:
(1) The log odds of disruptive behavior occurring at the intercept (the final
baseline session).
(2) A term accounting for the average rate of change in the log odds after
implementation of the intervention (TRT), which is the average impact
between baselines and treatments.
(3) A term accounting for the rate of change in log odds over time (SESS1).
(4) A term accounting for the rate of change in log odds from the first pair of
phases (A1, B1) to the second pair of phases (A2, B2) was included to
examine differential treatment impacts. This term is referred to as “order”
(and designated as ORDER in tables).
106
Several 2-way interactions were tested in the model. These included interactions
between (1) session slope and treatment, (2) treatment and order, and (3) session slope
and order. The session slope and treatment interaction tested whether slopes differed
across sessions and treatment (i.e., recall that a session is the interval or the time variable
in the study and treatment is coded 0 = baseline and 1 = treatment) (Nagler et al., 2008).
The interaction between treatment and order tested if the intervention effect observed at
the first AB pair was significantly different from the second AB pair. This is because
sometimes the first pair (first baseline then treatment) and the second pair (second
baseline and treatment) will be affected by repeated exposure to the treatment. The order
variable can be used to determine if the level or amount of target behavior at the start of
the study is different than at the end. Lastly, the interaction between session and order
tested the interaction between the time variable and if the first AB pair and second AB
pair (order) was significant. This 3-way interaction tested the combination of the session
slope, treatment, and order. In order to build a parsimonious model, all non-significant
variables (p > .05) were removed except for the treatment variable (TRT), which was
significant. The next model tested was the Simple Non-Linear Model without Slopes,
Level-1 equation, which stated the log odds of disruptive behavior was the sum of two
parts: (1) the log odds at the intercept (in this case, the baseline phase overall, since trend
was found to be flat), and (2) a term accounting for the rate of change in log odds given a
phase change (baseline to treatment). To clarify, statistical analyses indicate that there
was no baseline trend, meaning the rate of disruptive behavior was consistent prior to the
onset of treatment so there was no need to include a term that accounted for baseline
107
changes. However, the presence of disruptive behavior could be expected to alter after
treatment onset, so a term was included that captured the log odds of performance change
during treatment.
Level-2 equations model the intercepts and treatment impacts (to clarify the phase
change is treatment impact from an A phase to a B phase) as the average log odds during
baseline for all subjects, plus an error term to allow each student to vary from the grand
mean and the average rate of change in log odds as a subject switched from baseline to
the treatment phase, plus an error term. The error term will allow each student to vary
from the grand mean.
Table 5 describes the final estimation of fixed effects. The expected probability
of behavior at baseline and treatment phases was calculated from this data.
Table 5
Simple Non-Linear Model without Slopes for the Lambert and Colleagues’ (2006) Study
Fixed Effect
For
INTRCPT1, P0
INTRCPT2, β00
For
Coefficient
Standard
error
T ratio
Approx.
d.f
p value
0.82
.19
4.15
8
.004
-2.53
.16
- 15.77
8
.000
TRT Slope, P1
INTRCPT2, β10
When treatment = 0 (i.e., the baseline phase), the overall average log odds of
exhibiting a disruptive behavior for all students was 0.82 (β00). Additional formulas for
108
determining the expected probability (in baseline and switching from baseline to
treatment) are available and results can be derived from information presented in Table 5.
The formula for the expected probability of observing disruptive behavior at baseline is
the exponent of the baseline regression coefficient divided by one plus that number. So,
the expected probability of observing a disruptive behavior during the baseline phase was
0.69, [exp(0.82) = 2.27; 2.27 / 3.27 = 0.69]. In other words, there was high likelihood
that children would be observed engaging in a disruptive behavior before being treated.
This is a critical point because a basic tenant of single-case designs is that there must be
evidence of a problem behavior at baseline in order to have an opportunity to demonstrate
a treatment impact. Visual analyses clearly demonstrated the presence of a problem, so at
this point the process of digitizing the data and subsequent modeling has established a
statistical means for describing baseline performance. Now one can examine whether
there was a treatment effect; that is, if the use of response boards cause a reduction in
baseline levels of disruptive behavior. The average rate of change in log odds as a
student switched from baseline (TRT = 0) to treatment (TRT = 1) was - 2.53 (see β10 in
Table 4). This represents a treatment effect, and it was significant as the associated p
value was less than .05. When applying the formula for examining the log odds of
disruptive behavior during the treatment phase (by adding the exponent of the regression
coefficients divided by one plus that number), the calculations show: [exp(0.82 - 2.53) =
exp(-1.71) = 0.18; 0.18 / 1.18 = 0.15]. Therefore, the expected probability of observing a
disruptive behavior during the treatment phase was 0.15. In other words, the probability
of observing a disruptive behavior before treatment was 0.69, and after treatment this
109
dropped to 0.15. Visual comparisons were made between the different phases, and in this
case, a reduction of the target behavior was evident. At this moment, consider that the
visual data presented in the published report yields a similar conclusion to what is
described here, and there is a drop in the probability of the presence of disruptive
behavior after onset of treatment. If working with designs that allow for reasonable
levels of confidence that drops in problematic behavior are a function of the treatment,
these analyses allow researchers to (a) statistically describe changes in performance as a
function of treatment, and (b) apply statistical perspectives on the probability that
changes observed in this data are reflective of a small sample. To add to this point, the
analyses are able to handle a number of problems associated with OLS regression,
including autocorrelation, problematic distributional assumptions (specifically the use of
binomial distribution in lieu of a normal distribution) and different measurement times.
In sum, this step corresponds with Research Question 1, which examines the degree of
correspondence (i.e. percent agreement) between visual analyses and statistical modeling.
Moving on to Research Question 2, while still working with the pilot study, Table
6 displays the final estimation of variance components (that accounts for performance
across all ABABs in the article) for the Simple Non-Linear Model without Slopes.
110
Table 6
Final Estimation of Variance Components for the Lambert and Colleagues (2006) Study
Random Effect
INTRCPT1,
Standard
Deviation
Variance
Component
df
Chi-square
p value
R0
.53
.28
8
43.81
.000
TRT Slope, R1
.18
.03
8
11.51
.174
1.49
2.22
Level-1,
E
The between-subjects variance on intercepts was estimated to be .28, which
corresponds to a standard deviation of .53. The associated p value is a result of a null
hypothesis test, where the null condition states that baseline averages for all subjects were
similar in terms of their frequency of disruptive behavior. This hypothesis was rejected.
In other words, the variance was too large to assume differences may be due to only
sampling error, meaning some students were better behaved than others at baseline. The
significant p value (for intercept) suggests there should be further testing for variation in
the baseline phase. Since the p value is not significant for the treatment phase, no Level2 variables will help explain the variation. Keep in mind that if no statistically significant
variation between-subjects were found for either baseline or treatment, one would stop
here. In this case, since there may be some Level-2 variable(s) that help explain some of
the remaining variance in the baseline, another model is introduced.
111
In order to explore the possibility that certain subject characteristic (Level-2)
variables might account for some of the between-subject variation, an exploratory
analysis of the potential contributions of Level-2 variables was added to the intercept.
Table 7 displays the potential Level-2 predictors and their associated t values.
Table 7
Possible Level-2 Predictors from the Exploratory Analysis for the Lambert and
Colleagues (2006) Study
Level-1 Coefficient
INTRCPT, β0
Coefficient
Standard Error
t value
Potential Level-2 Predictors
CLASSB
WHITE
AGE PREGRADE
- .79
.65
- .49
- .14
.21
.53
.32
.19
- 3.60
1.21
- 1.55
- .70
Table 7 indicate that Class B (the indicator that student was in Class B, instead of A)
might help to explain some of the between-subject variance in intercepts since it had the
largest t value. This means that class can be used to explain variation in the baseline
(since the intercept was significant in the final estimation of variance components).
Therefore, the class variable was added to the last model used in the analysis.
112
The final model, the Simple Non-Linear Model with Class B tested on the
intercept (i.e., only testing the Level-2 variable where it was significant in the previous
model will still read the same as the Simple Non-Linear Model without Slopes, but now
the Level-2 equations model the baselines and treatment impacts). Recall the variance
components output suggested there was a significant amount of variation unaccounted for
in the baseline (there could be variation left in either the baseline or treatment but
because the variation around the treatment variable was non-significant, Level-2
variables should be used to examine baseline variation). From the prior test (see Table
5), variation around the intercept (i.e., baseline) was significant; therefore the class
variable can be examined. Now, the probability of behavior at baseline is interpreted as
the average log odds for all students (β00), plus a term to allow for students in Class B to
have a different baseline level (β10), plus an error term to allow each student to vary from
this grand mean (R0). In other words, a statistically significant β00 would indicate that the
baselines are different for students and a statistically significant β01 would signify the
classroom the child came from is a variable that should be included in the final model. P1
is the average rate of change in log odds as a subject switches from baseline (TRT = 0) to
treatment phase (TRT = 1) for all students (β10), plus an error term to allow each student
to vary from this grand mean (R1).
Table 8 displays the final model, the Simple Non-Linear Model without Slopes
with Class B on intercept. From this, the expected probability of observing disruptive
behavior at baseline was calculated in a similar fashion from the first model.
113
Table 8
Simple Non-Linear Model without Slopes with CLASSB on Intercept for the Lambert and
Colleagues’ (2006) Study
Fixed Effect
Coefficient
Standard
error
T ratio
Approx.
d.f.
p value
INTRCPT1,
P0
INTRCPT2,
β00
1.33
.19
7.10
7
.000
CLASSB, β01
- 0.91
.22
- 4.01
7
.006
- 2.53
.16
- 16.14
8
.000
TRT Slope,
P1
INTRCPT2, β10
The final working model was the Simple Non-Linear Model with the variable,
class included on the intercept. Seen another way, Table 9 summarizes the final model,
outcome variable, the coefficients with a regression equation included, and variance
components (in standard deviations). The likelihood function was used to determine if
this model is more effective at explaining the data than other models (i.e., this model
works best in comparison to one that did not include the class variable). Although the
estimations for the parameters generate slightly different numbers than what was
originally reported by Nagler and colleagues (2008), the results for the pilot match the
manual. Recall, working and non-working models used for each article used for this
dissertation can be seen in Appendix B.
114
Table 9
Final Model: Lambert and Colleagues (2006)
Final Model
Overdispersion
Simplified L-1 model
with CLASSB on
INTCPT
Outcome
Variable
Disruptive
Behavior
Variables P0 = β00 + R0 INTCPT
P1 = β10 + R1 TRT
Coefficient
Estimates
β00 =
β01=
β10 =
β00 + β01
*(CLASSB) + R0
β10 + R1 TRT
β00 = 1.33**
β01 = - 0.91**
β10 = - 2.53**
Variance
Sigma ( 2)
Components R0
(SD)
R1
Likelihood
Function
(MLF)
2.24
0.29**
0.13
- 485.52
Log(Y) = 1.33 - .91(CLASSB) - 2.53(TRT)
* p < .05, ** p < .01, Y = dependent variable
Although the odds-ratio will be the same for Class A and Class B since no
interaction was present, the probabilities of observing disruptive behavior are still
calculated to yield descriptive information for each class separately. The concept here
needs clarification because examining the class variable in this instance entails including
115
it as a main effect, not a simple main effect. Stated another way, a constant shift exists
for the classrooms because there was no interaction in the model. Basically, for one class
there is an increased probability of observing disruptive behavior compared to the other
even though the effectiveness of the treatment was the same for both classes. The manual
and this dissertation simply wish to convey that the authors were descriptively telling the
readers that a student from one classroom was behaving differently than the other, and
the probability calculation procedures described here provide a new way of looking at
this performance differential.
From the final model, the probabilities of observing disruptive behavior were
computed to describe performance in both baseline and treatment phases. When
treatment = 0 (i.e., the baseline phase), the overall average log odds of a student
exhibiting disruptive behavior in Class A (CLASSB = 0) was 1.33 (β00); [exp(1.33) =
3.78; 3.78 / 4.78 = 0.79]. That is, the expected probability of observing a disruptive
behavior during the baseline phase for a student in Class A (when treatment = 0) was
0.79. For a student in Class B the overall average log odds of exhibiting a disruptive
behavior was 0.60, (1.33 - .91 = 0.42), [exp(0.42) = 1.52; 1.52 / 2.52 = 0.60]. So, the
expected probability of observing a disruptive behavior during the baseline phase for a
student in Class B was 0.60, a drop that is associated with a statistically significant p
value. It appears that students in Class B tended to be better behaved compared to those
in Class A at baseline. This new information could be of interest to the original
researchers especially since there are separate teachers for each classroom.
116
As seen in Table 9, the significance test that compares the differences between the
baseline and treatment phases demonstrates the intervention’s treatment effectiveness.
The average rate of change in log odds as a student switches from baseline to treatment
was -2.53 (β10). This value was significant as the p value for β10 was less than .05.
Now, for Class A the expected probability of observing a disruptive behavior
during the treatment phase was 0.23, [exp(1.33 - 2.53) = exp(-1.20) = 0.30; 0.30 / 1.30 =
0.23]. For Class B, the calculation must consider the coefficients for Class B and the
treatment variable, [exp(1.33 - 0.91 - 2.53) = exp(-2.11) = 0.12; 0.12 / 1.12 = 0.11]. The
expected probability of observing a disruptive behavior during the treatment phase was
0.11. Combined, the modeling results reveal that students in Class B started with lower
rates of problematic behavior and also showed better performance after the treatment.
Estimates of the variance components for this model, seen in Table 10, indicated
that there may still be between-subject variation in estimates of the intercept. This is
demonstrated by the significant p value for the intercept (.009). Differences in treatment
effect were not found (p = .19).
117
Table 10
Final Estimation of Variance Components for the Lambert, Cartledge and Colleagues’
(2006) Study
Random Effect
Standard
Deviation
Variance
Component
df
Chi-square
p value
R0
.29
.08
7
18.85
.009
TRT Slope, R1
.13
.02
8
11.19
.191
1.49
2.24
INTRCPT1,
Level-1,
E
Looking back at Table 8, the treatment was effective (p < .01) and the subject
characteristic Class B (which classroom the students came from) was also significant (p =
.006) when considering performance at the intercept. Class A had a higher level of
expected probability of observing a disruptive behavior during the baseline phase than a
student who came from Class B. Therefore, the Simple Non-Linear Model without
Slopes with Class B on intercept with overdispersion was the final model (Table 9). One
can go back after the final model is chosen and test overdisperson by comparing the
likelihood functions (i.e., one with overdispersion and another without) but if sigma
squared is over 1.0 overdisperion must be accounted for in the model. Once the
researcher tests for overdispersion, the model will either account for it or determine to be
unnecessary when estimating standard errors.
118
All effect size estimates collected by the original authors and the calculated
comparisons can be seen in Tables 21 and 22 in Appendix A. Table 23, also seen in
Appendix A, lists sample details, independent variables, dependent variables, and original
effect sizes (reported by the authors) for each article included in the sensitivity analysis.
The PND was reported in the original article as an effect size criterion, making
comparisons of extracted data easy to compare to this study, which also calculated the
PND.
The results of this pilot work concluded that the treatment intervention was
effective. A Level-2 variable, class membership, explained some of the variation in
performance across students. These findings were replicated after digitizing the graphed
data then analyzing these data using HLM techniques. As was hoped, the results also
matched (i.e., allowing for rounding error) those from Nagler and colleagues (2008)
work. Remember that the Lambert and colleagues’ (2006) study had nine students. SMD
and R2 estimates could not be compared to the originally reported results as the study
authors did not calculate these statistics. PND effect size estimates were reported for five
students (four were not reported). In terms of the five that were reported, three were
matched by this dissertation work. Two suggested effectiveness in the original study
(92.80% and 94.10%) and questionable in the calculated PND statistic (77.78% and
61.36%). Seven out of nine SMD calculations were indicative of effective treatment
interventions ranging from 2.57 to 5.80. The last two student’s (B4 and B5) SMD were
calculated at 1.05 and 1.27 suggesting the treatment intervention as not effective for these
119
students. The R2 (proportion of the variation explained by the phase differences) was
59%.
Replicating the work from the manual was necessary to have confidence in
working with this new methodology. The pilot results also provided an educative
function as these procedures are innovative. That is, the detailed explanations offered
above provide a framework for a quick explanation for the analyses of the eight
remaining studies. All necessary information pertaining to significant Level-2 predictors
(i.e., p values and calculated probabilities for behaviors in baseline and treatment phases)
will be provided in each article summary. Eight more ABAB articles were collected for
the current study. Before the individual articles are described, Section 2 provides the
results for Research Question One.
Section 2: Sensitivity Analysis Examining Treatment Effectiveness
Results for research question 1. To address question one, comparisons
were made between the results of the original authors, independent visual
analyses based on WWC procedures (discussed in Chapter 2), and HLM. The
treatment variable tested in each HLM analysis was the difference (and associated
p value) in observed performance between mean baseline and mean treatment
phases.
Data were analyzed via HLM techniques using two software programs (HLM 6
and STATA 11.0). Data were drawn from nine published articles (the eight that were
collected for this work and the study described in the above pilot section). Two of the
nine articles had small sample sizes (n ≤ 2) and could not be analyzed using HLM
120
software because this yielded inadequate degrees of freedom and results would not
converge. STATA was therefore used since this package included more estimation
procedures that could be applied. More specifically, STATA could use a Penalized
quasi-likelihood for binary outcomes (also called Laplace’s estimation approximation)
and provide log-likelihoods with respect to the parameters using Fisher scoring (Manolov
et al., 2010; Raudenbush & Bryk, 2002). Note that details associated with use of these
procedures are described in Chapter Three.
A key finding deals with the level of agreement between visual and statistical
analyses used for examining the presence of a treatment effect, as well as statements
made by original study authors. Of the 14 dependent variables examined (remember that
some articles had more than one dependent variable) there was a 93% (n = 13 of 14)
agreement rate between the original authors and HLM modeling results. Mavropoulou
and colleagues (2011) considered the treatment to have questionable effectiveness on one
dependent variable (the independent visual analyses reached this conclusion as well and
determined it as not effective); yet HLM procedures yielded a significant p value. There
was less agreement between HLM results and independent visual analyses. In all, there
was only a 62.30% (n = 9 of 14) agreement rate between the two procedures in terms of
whether there was a treatment effect. There was a 71.40% (n = 10 of 14) agreement rate
between visual analyses and reports of the original study authors.
Table 11, presented below, lists details specific to Research Question 1. Before
the Table is presented, the results of visual analyses completed for each article are
described first. The intervention described in article one in Table 11 was determined to
121
be effective because there was little overlapping data across baseline and treatment
phases. Trends suggested the target behavior improved during the implementation of
treatment and there was a change in the means across phases. The intervention in article
two was considered to be ineffective because while one person (Keith) seemed to respond
to the treatment implementation, his performance showed high variability and there was
overlapping data across study phases. For the second person, Wally, the same concerns
existed; there was almost complete overlapping data for the first treatment
implementation to the second baseline phase.
The intervention described in article three was considered to be effective since six
of the nine people yielded little overlapping data across phases and mean performance
changes favored the intervention. Also there was little variability in performance in the
baselines, and the intervention effect was immediately apparent (recall, one would want
to see a gap between the data between the baseline and treatment phases indicating a
change in behavior). There were two students and three outcome measures in article
four, and their intervention was considered to be ineffective for all of them. The three
dependent variables were: (1) on-task behavior, (2) required number of teacher prompts,
and (3) task performance (i.e., executing the correct steps/actions in each task such as
placing a domino card in the correct place). The on-task behavior for the first student in
this article, Vaggelis, was highly variable, the analyst could not document an immediate
effect, and there were many overlapping data points. For the second person, Yiannis, the
trend and level improved after onset of treatment, but there was still overlapping data
when looking at on-task behaviors. For teacher prompting, there was considerable
122
overlapping data present for both students. And lastly, for the performance variable the
data had high variability and overlap.
The visual analyses conducted while examining results from article five
determined the treatment was effective since there was overall low variability in
performance, the treatment effect immediately yielded an improvement, and almost no
problematic behaviors were seen in treatment phases. Since this outcome measure was
the percentage of disruptive intervals, one would expect a favorable treatment to reduce
during the treatment phase, which it did for almost all students. The intervention
examined in article six was found to be effective at improving the on-task behavior and
task completion. There was limited overlap of performance across phases, and levels and
trends suggested the baseline and treatment phases were consistent to the intended
direction of the slope (i.e., trend). The intervention did not show an impact on work
accuracy as there was high variability in performance within each phase, associated
overlap in performance across phases, and the analyst could not document an immediate
treatment effect (i.e., last two data points in baselines in comparison to the first few data
points in treatment phases were similar).
The intervention assessed in article seven was considered to be effective when
measuring academic achievement and disruptive behavior. For academic achievement,
there was low variability and performance trends and levels were consistent with a
treatment effect (i.e., the trend went in the intended direction and the average
performance was higher during treatment phases). Only one person out of eight showed
any overlapping data and this was considered to be minimal.
123
The intervention in article eight was found to be effective; data showed an
immediate effect, low variability in performance, and no overlapping data. Finally, the
intervention in article nine was considered to not be effective because there were too
many overlapping data points between phases and changes in performance levels across
phases was minimal.
Table 11 provides the key information needed to address the first research
question addressed by this dissertation, which considers if visual and statistical analyses
using HLM procedures (described in detail in Section 1 of the chapter) yield similar
conclusions. The table lists each study and the dependent variable(s) examined in each
article (the dependent variables are listed after each articles citation, for example the
Amato-Zech and colleagues (2006) article has the dependent variable percent of intervals
of on-task behavior). The table also displays overall claims made by the original study
authors pertaining to whether the treatment was effective and if overall results from
visual and HLM analyses agree. The table also summarizes the visual analysis results
described above. In terms of HLM results, an intervention was considered to be effective
if the p value associated with mean change in performance across study phases was less
than .05. Recall, all dependent variables are percentages or percent of intervals; no
article used count data for the purposes of this dissertation.
To provide a quick interpretive guide, the original study authors, the independent
visual analyses and HLM results all agree that the intervention in the first article (AmatoZech, Hoff, & Doepke, 2006) was effective. There was complete agreement across
methods. This was not the case for the second article (Cole & Levinson, 2002), and so
124
on. A discussion of the somewhat low agreement rate between independent visual
analyses, and author claims and HLM results is provided in Chapter Five.
Table 11
Final Results of Sensitivity Analyses Pertaining to Statement of a Treatment Impact
Study/
Dependent
Variable(s)
Author(s)
Published
Statement
about the
Treatment
Independent
Visual
Analysis*
HLM
Results (p
values)**
Effective
Effective
Effective
Not Effective
Effective
Effective
Effective
(1) Amato-Zech, Hoff
& Doepke (2006).
Increasing On-Task
Behavior in the
Classroom: Extension
of Self-Monitoring
Strategies.
Percent of intervals of
on-task behavior.
(2) Cole & Levinson
(2002).
Choices on the
Challenging Effects
Effective
of Within-Activity
Behavior of Children
with Severe Developmental
Disabilities.
Percentage of task steps
with challenging behavior.
(3) Lambert, Cartledge,
Heward, & Lo
(2006).
Effects of Response Cards
on Disruptive Behavior Effective
and Academic Responding
125
Table 11 (Continued)
Study/
Dependent
Variable(s)
Author(s)
Independent
HLM
Published
Visual
Results (p
Statement
Analysis*
values)**
about the
Treatment
________________________________________________________________________
during Math Lessons
by Fourth-Grade
Urban Students.
Number of intervals of
disruptive behavior.
(4) Mavropoulou,
Papadopoulou,
Effective
& Kakana (2011).
Effects of Task
Not Effective
Organization
on the Independent
Not Effective
Play of Students with
Autism Spectrum Disorders.
Not Effective
Effective
Not Effective
Not Effective
Not Effective
Effective
Effective
Effective
Percentage of intervals
with on-task behavior,
prompting behavior,
and performance behavior.
(5) Murphy, Theodore,
Alric-Edwards, &
Hughes (2007).
Interdependent Group
Contingency and Mystery
Motivators to Reduce
Preschool Disruptive
Behavior.
Percentage of disruptive
intervals.
Effective
126
Table 11 (Continued)
Study/
Dependent
Variable(s)
Author(s)
Independent
HLM
Published
Visual
Results (p
Statement
Analysis*
values)**
about the
Treatment
________________________________________________________________________ (6) Ramsey, Jolivette,
Puckett-Patterson,
Effective
& Kennedy (2010).
Using Choice to Increase Effective
Time On-TaskCompletion,
Effective
and Accuracy for
Students with Emotional/
Behavior Disorders in
a Residential Facility.
Effective
Effective
Effective
Effective
Not Effective
Effective
Effective
Effective
Effective
Effective
Effective
Effective
Kehle, & Jenson (2001). Effective
Randomization of Group
Behavior Contingencies
and Reinforcers to Reduce
Classroom Disruptive.
Effective
Effective
Percentage of time on-task,
task-completion, and accuracy.
(7) Restori, Gresham,
Chang, Lee, & LaijaRodriquez
(2007).
Functional AssessmentBased Interventions
for Children At-Risk
for Emotional and
Behavioral Disorders.
Percent of intervals of
academic achievement
and disruptive behavior.
(8) Theodore, Bray,
Percentage of disruptive
Interval
127
Table 11 (Continued)
Study/
Dependent
Variable(s)
Author(s)
Published
Statement
about the
Treatment
Independent
Visual
Analysis*
HLM
Results (p
values)**
Not Effective
Effective
(9) Williamson,
Campbell- Whatley,
& Lo (2009).
Effective
Using a Random
Dependent Group
Contingency to Increase
On-task Behaviors
of High School Students
with High Incidence Disabilities.
Percent of intervals
of on-task behavior.
*Only one visual analysis expert was used in the current study.
** For a treatment to be deemed effective, p < .05.
Results for Research Questions 2 and 3
To address the second research question, a summary of the statistical models used
for analyzing data from the nine articles is offered here. The model choices are all based
on the procedures described in the above pilot section. Each table presented in this subsection provides the final models used to analyze data from each article, the likelihood
functions,5 and estimates of variance components. Finally, each table lists the dependent
variables from each article (i.e., on-task behavior, disruptive behavior, the percentage of
5
The model with the likelihood function closest to zero is the best model suited for the data.
128
task analysis steps or how many procedures a student successfully completed, accuracy
of academic tasks during academic courses, performance or implementing the correct
number of tasks/procedures). Although specific details regarding measurement schemes
vary from one article to another, on-task behavior can generally be described as a
measure of the percentage of time a student was attending to a teacher or actively
engaged in academic work. Disruptive behavior refers to the frequency a student was
engaged in unsanctioned, problematic behavior. Recall that a binomial distribution was
used for each of these variables.
All Level-2 data available in the original articles were tested to explain variation
in the dependent variable, except when n < 3. Statistically significant Level-2 variables
were found in three studies. In sum, four Level-2 variables helped explain variance in a
given model (p < .05). These variables are CLASSB (which class a child came from, as
described in the above pilot section) from the Lambert and colleagues’ (2006) article,
ONTRACK (if a child was on-track to attend the next grade level) from the Murphy and
colleagues’ article (2007), and twice for the intervention variable (type of treatment plan
given to a child) on two different outcome measures (disruptive behavior and academic
achievement) from the Restori and colleagues’ (2007) article. A complete list of the
Level-2 variables used in the analyses presented here can be seen in Table 26 in
Appendix A.
Recall that research question three pertains to effect sizes. A detailed description
of the articles and original statements concerning treatment effectiveness (Research
129
Question One) compared to the results of the sensitivity analyses is also revisited in light
of various effect size calculations.
Cole and Levinson (2002) studied the impact of offering children a chance to
make their own choices on reducing disruptive behavior. Behaviors were considered
problematic if the students demonstrated aggression, tantrums, and noncompliance (i.e.,
throwing objects, walking away from desk, or screaming). Two boys who attended a
school that served students with emotional/behavioral disorders were included in this
study. Each instructional routine lasted for thirty minutes and occurred for twenty-nine
days for Keith and thirty-two days for Wally. Each student had a different choice option
tailored to the specific need of the child’s IEP (e.g., vocational or daily living skills such
as washing hands). To present a choice, rather than using a phrase like: put the soap in
your hands, interventionists asked a question like: do you want to use the bar soap or the
pump soap? (Cole & Levinson, 2002). The rationale behind the study was that choice
making might influence a child’s willingness to comply with directions. The authors
concluded that choice making was an effective intervention (Cole & Levinson, 2002).
Statistical analyses of digitized data (using STATA) indicated the intervention
was effective (p < .05) and no Level-2 predictors were found in the multi-level analysis to
be significant (p > .05). The final model presented is therefore a simplified Level-1
model, with no Level-2 predictors. The final model’s results are not presented here
because it did not converge. Some details can be seen in Appendix C. Even though
some parameters are available, the model failed to converge so they should be cautiously
interpreted due to small sample size and biased estimations. As noted in Table 10, the
130
independent visual analyses determined that the intervention was not effective. The
original authors reported each phase’s percent of task-analysis steps with challenging
behavior. The PND effect sizes suggest the treatment for Keith as questionable (51.66%)
and not effective (28.60%) for Wally. Overall, the statistical test determined the
treatment effective, p < .05, but again without convergence this p value should be
cautiously interpreted. The SMD for Keith suggested an ineffective treatment
intervention (1.42) as well as Wally (.96). The R2 (proportion of the variation explained
by the phase differences) was 25%.
Murphy, Theodore, Aloiso, Alric-Edwards and Hughes (2007) used intermittent
mystery motivators in an interdependent group contingency to reduce disruptive behavior
in the classroom. Disruptive behaviors were defined as touching other students, leaving a
designated learning area, (i.e., a rug) and standing or lying down on the rug. Fifteen
second intervals were recorded for fifteen minutes for eight days during baseline and
treatment. This study included eight students. Classrooms were comprised of children
with a diverse range of behavioral skills. The intermittent mystery motivator treatment
intervention was determined to be effective by original authors, the independent analyses,
and HLM procedures. However, HLM techniques provided more information concerning
the behavior of the students. The Level-2 variable, if a student was on-track to enter first
grade (ONTRACK), was statistically significant, p < .01. This means that students in the
study were performing differently in the baseline phase when comparing between the two
groups. See Table 12.
131
Table 12
Final Model: Murphy and Colleagues (2006)
Final Model
Overdispersion
Simple Non-Linear Model
with ONTRACK
Predictors
on INTCPT
Outcome
Variables
Disruptive Behavior
P0 = β00 + R0 INTCPT β00 + β01* (ONTRACK) + R0
β10 + R1 TRT
P1 = β10 + R1 TRT
P2 = β20 + R2 ORDER
β20 + R2 ORDER
Coefficient
Estimates
β00 =
β01=
β10=
β20=
Variance
Components
(SD)
Sigma
R0
R1
R2
Likelihood
Function
(MLF)
-1.79**
2.06**
-1.93**
-1.12**
2
3.07
0.43**
0.47*
0.27
-444.29
Log(Y) = - 1.79 + 2.06(ONTRACK) - 1.93(TRT) - 1.12(ORDER)
* p < .05, ** p < .01, Y = dependent variable, disruptive behavior
As before, some useful descriptive analyses can be gleaned from the table. When
treatment = 0 (i.e., the baseline phase) and order = 0 (i.e., A1B1) and on-track = 0, the
overall average log odds of exhibiting a disruptive behavior for students on-track to
graduate preschool in the baseline was - 1.79 (β00), [exp(-1.79) = 0.17; 0.17 / 1.17 =
132
0.15]. In other words, the expected probability of observing a disruptive behavior during
the baseline phase for an on-track student was 0.15. When treatment = 0 (i.e., the
baseline phase) and order = 1 (i.e., second AB pairing) and on-track = 0, the overall
average log odds of exhibiting a disruptive behavior for these students was 0.05, [exp(1.79 - 1.12) = 0.05; 0.05 / 1.05 = 0.05]. In other words, the expected probability of
observing a disruptive behavior during the baseline phase for a student on-track for the
second AB pairing was 0.05, when treatment = 0 and order = 1. When treatment = 0 (i.e.,
the baseline phase) and order = 0 (i.e., A1B1) and on-track = 1, the overall average log
odds of exhibiting a disruptive behavior for students not on-track to graduate preschool at
baseline was 0.57 for the first AB pairing, [exp(-1.79 + 2.06) = 0.05; 1.31 / 2.31 = 0.57].
In other words, the expected probability of observing a disruptive behavior during the
baseline phase for a student not on-track for the first AB pairing was 0.57. Lastly, when
treatment = 0 (i.e., the baseline phase) and order = 1 (i.e., A2B2) and on-track = 1, the
overall average log odds of exhibiting a disruptive behavior for students not on-track to
graduate preschool at baseline for the second AB pairing was 0.30, [exp(-1.79 – 1.12 +
2.06) = 0.43; 0.43 / 1.43 = 0.30]. In other words, the expected probability of observing a
disruptive behavior during the baseline phase for a student not on-track for the second
AB pairing was 0.30.
The significance test that compares the differences between the baseline and
treatment phases demonstrates the study’s treatment effectiveness. The average rate of
change in log odds as a student switches from baseline (TRT = 0) to treatment (TRT = 1)
133
was - 1.93 (β20). This phase effect was significant as the p value for β20 was less than .01,
so the treatment variable was considered to be statistically significant.
For the treatment phase, when treatment = 1 and order = 1 and on-track = 1, the
overall average log odds of exhibiting a disruptive behavior was 0.07, [exp(- 1.79 – 1.93
– 1.12 + 2.06) = 0.06; 0.06 / 1.06 = 0.07]. In other words, the expected probability of
observing a disruptive behavior during the treatment phase for a student not on-track in
the second AB pairing was 0.07. When treatment = 1 and order = 1 and on-track = 0, the
overall average log odds of exhibiting a disruptive behavior was 0.008, [exp(-1.79 - 1.93
- 1.12) = 0.008; 0.008 / 1.008 = 0.008]. In other words, the expected probability of
observing a disruptive behavior during the treatment phase for a student on-track for the
second AB pairing was 0.008. When treatment = 1 and order = 0 and on-track = 1, the
overall average log odds of exhibiting a disruptive behavior was 0.16 [exp(- 1.79 – 1.93 +
2.06) = 0.19; 0.19 / 1.19 = 0.16]. In other words, the expected probability of observing a
disruptive behavior during the baseline phase for a student not on-track for the first AB
pairing was 0.16. Lastly, when treatment = 1 and order = 0 and on-track = 0, the overall
average log odds of exhibiting a disruptive behavior was 0.02, [exp(-1.79 – 1.93) = 0.02;
.02 / 0.02 = 0.02]. In other words, the expected probability of observing a disruptive
behavior during the treatment phase for a student on-track for the first AB pairing was
0.02.
Murphy and colleagues (2007) calculated SMDs and they ranged from .99 to
7.71. The treatment was found to be effective for three out of eight students according to
134
the calculated SMD, ranging from .61 to 2.88. The R2 (proportion of the variation
explained by the average baseline to treatment differences) was 21%.
Mavropoulou, Papadopoulou, and Kakana (2011) studied a visually based
intervention focusing on task organization while working with two boys, Vaggelis and
Yiannis. Both boys had difficulties completing tasks independently and they were
diagnosed with autism spectrum disorders. On-task behavior was defined as attending to
materials. Off-task was defined as throwing materials or not using them properly for the
task at hand (Mavropoulou et al., 2011).
Three dependent variables were measured (1) on-task behavior, (2) teacher
prompting, and (3) task performance. According to the study authors, on-task behavior
was measured by attending to materials, paying attention visually and manipulating
materials appropriately. Teacher prompting (measured as percentage of intervals of
teacher prompting) dealt with re-directing a study by cueing or saying the child’s name.
Lastly, task performance was defined as placing a picture in a correct category (e.g.,
placing a picture of a piece of clothing on a doll) which yielded a binary score (1 =
correct and 0 = incorrect). Task performance was recorded as a percentage. All
treatment sessions were fifteen minutes long and data were collected in ten second
intervals during an observational period. For each student a total of forty-five intervals
were assessed.
Table 13 (below) yields data that can be used to compute the expected
probabilities for on-task behavior on baseline and treatment phases. The models used to
analyze the variables (teacher prompting and task performance) did not converge and
135
therefore are not analyzed any further; however the on-task model did converge and
therefore were not analyzed any further; however the on-task model did converge.
Therefore, when session = 0 and treatment = 0, the expected probability of observing a
student with on-task behavior during the final session of the baseline phase was 0.97,
[exp(3.60) = 36.60; .36.60 / 37.60 = 0.97]. The average rate of change in log odds as a
student switched from baseline to treatment was .22. Therefore, when session = 0 and
treatment = 1, the expected probability of observing a student with on-task behavior
during the final session of the baseline phase was 0.98, [exp(3.60 + 0.22) = exp(3.82) =
45.6; 45.6 / 46.6 = 0.98]. The average rate of change in log odds per day of observation
was 0.01. This increase was significant (p < .01), therefore it was concluded that the
baseline trend is not flat, or increases over time (days). Therefore, on-task behavior
increased in the log odds from the start of the baseline to the end of treatment.
The study authors calculated PND statistics producing a score of 50% for on-task
behavior and a score of 45% for accurate task performance. The PND for teacher
prompting was not reported, but the authors did say the treatment was not effective.
These values, as do the study authors suggest the intervention was not effective in terms
of impacting Vaggelis’ on-task behavior and task accuracy (Mavropoulou et al., 2011).
For Yiannis, a PND of 75% was calculated for on-task behavior, and 70% for task
accuracy. This suggests that the intervention effectively improved performance on both
variables (Mavropoulou et al., 2011). Results differed across the three approaches for
examining treatment effectiveness on these outcomes. For on-task behavior the original
authors and re-analyses using the software STATA found the treatment to be effective,
136
whereas the independent visual analysis determined it was not effective. As noted in
Table 11, the original authors and re-analyses found the treatment not effective when
dealing with the teacher prompt variable, whereas the independent visual analysis
considered the treatment not effective. When examining task performance, the original
authors and independent visual analyses found the treatment not effective, whereas
statistical analyses yielded the opposite conclusion. Effect size estimates were
independently calculated and all PND statistics matched except for one (task
performance), which did not change the overall interpretation (i.e., reported 70% and
calculated 60%, both of which would be considered questionable effectiveness). The
SMDs are not consistent to the PND; PND results indicate that the intervention was not
effective for Vaggelis (50%, 30%, and 45% for on-task, teacher prompting, and
performance, respectively) whereas the SMD suggested it was for Yiannis (75%, 40%,
and 60% for on-task, teacher prompting, and performance, respectively). The R2
(proportion of the variation explained by the phase differences) was 34% for on-task
behavior, 1% for teacher prompting, and 39% for performance.
Table 13 provides results from HLM analyses.
137
Table 13
Final Model: Mavropoulou and Colleagues (2006)
Final Model
Overdispersion
Simplified L-1 model
No L-2 Predictors
Outcome
Variable
On-Task Behavior
Variables P0 = β00+ R0 INTCPT
P1 = β10 + R1 SESS
P2 = β20 + R2 TRT
β00+ R0 INTCPT
β10 + R1 SESS
β20 + R2 TRT
Coefficient
Estimates
β00 =
β10=
β20=
Variance
Sigma ( 2)
Components R0
(SD)
R1
R2
Likelihood
Function
(MLF)
3.60**
0.01**
0.22**
0.03
0.09
0.01
0.07
- 241.86
On-Task
Log(Y) = 3.60 + 0.01(SESS) + 0.22(TRT)
* p < .05, ** p < .01, Y = dependent variable
The Theodore, Bray, Kehle and Jenson (2001) study originally had five students,
but considerable amounts of data were missing from two and they were dropped from
analyses presented here. The intervention entailed listing classroom rules in the front of
138
the classroom as well as on each student’s desk. The teacher instituted an award system,
with reinforcement contingent upon group behavior that allowed students to select
various options. Disruptive behaviors were defined as a failure if the student displayed
one of the following behaviors: use of obscene words, touching or talking to other
students at inappropriate times, verbal put-downs, not facing the teacher, or listening to
music too loudly.
The intervention (posting rules) was conducted for two 45 minute time blocks
during the school day for two weeks during regular classroom instruction, where no
rewards were given for behavior. This period was considered the baseline phase. The
study consisted of fifteen second intervals for a period of twenty minutes every other day.
The treatment was considered by the authors, independent visual analyses and statistical
procedures used here to be effective for all the students. The intercept data did not
converge during analyses and therefore no expected probabilities pertaining to dependent
variable were computed for this study.
In the Theodore and colleagues’ study (2001) the reported SMD ranges from 2.6
to 5.2 and the calculated ranged from 2.8 to 5.3. PND results were consistent with an
effective intervention (all were 100%). The R2 (proportion of the variation explained by
the phase differences) was 78%.
The Ramsey, Jolivette, Puckett-Patterson, and Kennedy (2010) study had three
dependent variables of interest: on-task behavior, task-completion and accuracy. All five
students in the study were classified with an emotional / behavior disorder (E / BD). Ontask behavior was defined as an observation that students: (1) were examining an
139
assignment, (2) writing questions related to the assignment, (3) followed directions, (4)
did not use obscene language, and / or (5) did not touch others. Task-completion was
calculated by taking the total number of times a task was determined to be correctly
completed divided by the total number of attempts. Lastly, accuracy was defined as the
percentage of tasks completed correctly. According to the authors, each observation
session was conducted for fifteen minutes during the student’s independent work time.
The sessions were conducted across consecutive weekdays, twice a day. During the
baseline phase, teachers gave assignments during independent practice time in math and
English classes. Students were told to do a specific assignment (no choice) and gave the
assignment to the children. During the treatment (choice) phase, students were given the
option of which of two assignments they wanted to complete first (Ramsey et al., 2010).
The original study used the PND as a criterion for determining treatment effect.
The students (Sara, Chris, Trey, and Abby) demonstrated higher percentages of
time on-task, task-completion, and accuracy during treatment whereas Katie’s
percentages were lower (Ramsey et al., 2010). According to Ramsey and colleagues
(2010), the treatment worked for four of the five students during independent academic
tasks. It should be noted that when students all have the same characteristics (in this case
an E / BD classification), discriminating performance across students using HLM
procedures becomes difficult.
Table 14 is the final working model for this article. For all three dependent
variables, the main effect of treatment was significant (p < .05). From this table, we can
compute the expected probabilities for each dependent variable, starting with accuracy, or
140
number of task steps completed correctly for baseline and treatment phases. That is,
during baseline (when treatment = 0), the expected probability of observing an accurate
response from a student in baseline was 0.11, [exp(-2.09) = 0.12; 0.12 / 1.12 = 0.11]. The
average rate of change as a person switches from baseline (treatment = 0) to treatment
(treatment = 1) was 1.82, [exp(- 2.09 + 1.82) = exp(-0.27) = 0.76; 0.76 / 1.76 = 0.43]. So,
the expected probability of observing a student completing accurate steps during the
treatment phase was 0.43. The next outcome variable was on-task behaviors. During
baseline (when treatment = 0), the expected probability of observing a student with ontask behavior was 0.39, [exp(-0.46) = 0.63; 0.63 / 1.63 = 0.39]. The average rate of
change as a student was switched from baseline (treatment = 0) to treatment (treatment =
1) was 2.22, [exp(-0.46 + 2.22) = exp(1.76) = 5.81; 5.81 / 6.81 = 0.85]. So, the expected
probability of observing a student with on-task behavior during the treatment phase
increased to 0.85. Lastly, one can compute the expected probabilities for task completion
for baseline and treatment phases. That is, when the expected probability of observing a
complete task in baseline was 0.21, [exp(-1.49) = 0.26; 0.26 / 1.26 = 0.21]. The average
rate of change as a person switches from baseline (treatment = 0) to treatment (treatment
= 1) was 0.74, [exp(-1.49 + 2.53) = exp(1.04) = 2.83; 2.83 / 3.83 = 0.74]. So, the
expected probability of observing a student complete a task during the treatment phase
was 0.74.
All three procedures for assessing treatment impacts matched up, except when
dealing with the dependent variable, accuracy. The independent visual analyses did not
deem the treatment to be effective, but the original authors claimed there was an effect
141
and this was supported by HLM results. For all but two people (Abby and Trey) the
intervention was effective according to the SMD for all outcome variables. The R2
(proportion of the variation explained by the phase differences) was 33% when
examining the on-task variable, 34% for the task completion variable, and 22% for the
accuracy variable.
Table 14
Final Model: Ramsey and Colleagues (2006)
Final Model
Overdispersion
Simplified L1 model
No L2 Predictors
Outcome
Variable
Variables
Coefficient
Estimates
Accuracy
P0 = β00+ R0 INTCPT
P1 = β10 + R1 TRT
β00 =
β10=
Variance
Sigma ( 2)
Components R0
(SD)
R1
Likelihood
Function
(MLF)
Accuracy
Log(Y) = - 2.09 + 1.82(TRT)
On-Task
Log(Y) = - 0.46 + 2.22(TRT)
β00+ R0 INTCPT
β10 + R1 TRT
Overdispersion
Simplified L1 model
No L-2 Predictors
Overdispersion
Simplified L1 model
No L-2 Predictors
On-Task Behavior
Task Completion
β00+ R0 INTCPT
β10 + R1 TRT
β00+ R0 INTCPT
β10 + R1 TRT
-2.09*
1.82**
- 0.46
2.22**
4.36
1.41**
0.36
8.50
0.59**
0.55*
5.39
1.33**
0.15
-378.21
- 430.24
- 394.27
-1.49*
2.53**
142
Table 14 (continued)
Task Completion
Log(Y) = - 1.49 + 2.53(TRT)
* p < .05, ** p < .01, Y = dependent variable
The next article used in the current study was Restori, Gresham, Chang, Lee, and
Laija-Rodriquez (2007). The study examined the effect of self-monitoring and
differential reinforcement of behaviors on academic and disruptive behavior. A
particular focus of the article was comparing antecedent and consequent-based treatment
strategies, where the former of the two tends to be more proactive in orientation and the
latter essentially relies on reactive applications of a behavior plan. The article examined
treatment impacts on eight students. When students correctly worked on assigned
material, answered questions correctly and sought assistance, they were viewed as being
engaged; whether students were rated as being engaged constituted the academic
dependent variable of interest in this study. Disruptive behavior was defined as out-ofseat behavior, making disruptive noises, disturbing others, and talking without
permission. Academic engagement was defined as the student correctly working on
assigned academic material (i.e., working on material, answering questions correctly, and
seeking assistance when appropriate). Participants were observed in 10 second intervals
for 15 minutes, where each observation session was considered a single data point. The
author’s results indicated that treatments primarily seen as “antecedent-based were more
effective than treatment strategies that were primarily consequent-based for reducing
143
disruptive behavior and increasing academic engagement for all participants” (Restori et
al., 2007, p. 26). These two conditions are referred to as interventions. The independent
visual analyses and HLM results supported this assertion. For the Restori and colleagues
(2007) article, both outcome variables disruptive behavior and academic achievement had
the same final working model, the Simple Non-Linear Model with the Level-2 variable
(intervention) on the treatment slope (see Table 15).
Table 15
Final Model: Restori and Colleagues (2007)
HLM Approach
Overdispersion
Overdispersion
Simple Non-Linear model Simple Non-Linear model
with INTERVENTION with INTERVENTION
on TRT
on TRT
Outcome
Variable
Variables
Disruptive
Behavior
P0 = β00+ R0 INTCPT
P1 = β10 + R1 TRT
β01 + R2 INTERVENTION
P2 = β20 + R3 SESS1
Coefficient
Estimates
β00 =
β10=
β01 =
β20=
Variance Sigma ( 2)
Components
R0
(SD)
R1
R2
β00+ β10*
(INTERVENTION) + R0
β10 + R1 TRT
Academic
Achievement
β00+ B10*
(INTERVENTION)
+ R0
β10 + R1 TRT
β20 + R2 SESS1
----
0.12
-3.26*
1.34**
-0.06*
-0.97**
3.55**
-1.10*
----
1.22
0.01
0.00
0.18
1.29
0.11
0.52
----
144
Table 15 (continued)
Likelihood
Function
(MLF)
-276.68
-284.38
Disruptive Behavior
Log(Y) = 0.12 – 3.26(TRT) + 1.34(INTERVENTION) – 0.06(SESS1)
Academic Achievement
Log(Y) = - .97 + 3.55(TRT) – 1.10(INTERVENTION)
* p < .05, ** p < .01, Y = dependent variable
At baseline, the overall average log odds of exhibiting a disruptive behavior for a
student in the antecedent-based treatment (intervention = 0, recall that this is a Level-2
variable designated as “intervention = 0”) was 0.53 (β00), [exp(0.12) = 1.13; 1.13 / 2.13 =
0.53] when treatment = 0 and session = 0. The expected probability of observing a
disruptive behavior for a student in the antecedent-based treatment was 0.53 during the
final session of the baseline phase. At baseline, the overall average log odds of
exhibiting a disruptive behavior for a student in the consequent-based treatment during
the final session of the baseline phase was 0.81 (β00 + β01), [exp(0.12 + 1.34) = 4.30; 4.30 /
5.30 = 0.81] when treatment = 0 and session = 0. The expected probability of observing
a disruptive behavior during the final session of the baseline phase for a student in the
consequent-based treatment was 0.81.
The significance test that compares the differences between the baseline and
treatment phases (for both dependent variables) demonstrates that both outcome variables
are effective. The average range of change in log odds as a student switches from
145
baseline (TRT = 0) to treatment (TRT = 1) was -3.26 (β10). This phase effect was
significant as the p value for β10 was less than .05, so the treatment variable was
statistically significant. For a person in the antecedent-based treatment, the expected
probability of observing a disruptive behavior during the final session of the baseline
phase (session = 0) was 0.04, [exp(0.12 - 3.26) = exp(- 3.14) = 0.04; 0.04 / 1.04 = 0.04].
For a person in the consequent-based treatment, the expected probability of observing a
disruptive behavior was 0.15 during the final session of the baseline phase (session = 0),
[exp(0.12 - 3.26 + 1.34) = exp(- 1.80) = 0.17; 0.17 / 1.17 = 0.15]. Lastly, the average rate
of change in log odds per day of observation is - 0.06. This decrease was significant (p <
.05), therefore it was concluded that the baseline trend is not flat, or changes over time
(days). Therefore, disruptive behavior decreased in the log odds from the start of the
baseline to the end of treatment.
The Level-2 variable intervention appeared to influence the academic
achievement outcome variable in the Restori and colleagues (2007) study as well. At
baseline, the overall average log odds of exhibiting academic achievement (recall this has
to do with the student working on appropriate academic materials) for a student in the
antecedent-based treatment (intervention = 0) was - 0.97 (β00), [exp(-0.97) = 0.38; 0.38 /
1.38 = 0.28] when treatment = 0. The expected probability of observing academic
achievement during the baseline phase for a student in the antecedent-based treatment
was 0.28. At baseline, the overall average log odds of exhibiting academic achievement
for a student in the consequent-based treatment was 0.12 (β00 + β01), [exp(- 0.97 – 1.10) =
0.13; 0.13 / 1.13 = 0.12] when treatment = 0. The expected probability of observing a
146
disruptive behavior during the baseline phase for a student in the consequent-based
treatment was 0.12.
The significance test that compares the differences between the baseline and
treatment phases demonstrates that both interventions are effective. The average rate of
change in log odds as a student switches from baseline (TRT = 0) to treatment (TRT = 1)
was 3.55 (β10). This phase effect was significant as the p value for β10 was less than .05,
so the treatment variable was statistically significant. For a person in the antecedentbased treatment (intervention = 0), the expected probability of observing academic
achievement during the treatment phase was 0.93, [exp(- 0.97 + 3.55) = exp(2.58) =
13.20; 13.20 / 14.20 = 0.93]. For a person in the consequent-based treatment, the
expected probability of observing academic achievement during the treatment phase was
0.81, [exp(- 0.97 + 3.55 - 1.10) = exp(1.48) = 4.40; 4.40 / 5.40 = 0.81].
The original authors stated antecedent-based treatment interventions were more
effective than consequent-based for reducing disruptive behavior, even though both were
found to lower behavior. The statistical analyses performed here agree with these points.
No effect size statistics, only means, were provided in the Restori and colleagues’
(2007) study. In the current study, the PND for overall disruptive behavior indicates that
the treatment worked for seven out of eight students (for one student the PND = 0%). In
terms of the academic achievement variable, the PND indicates the treatment was
effective for all but one student (PND = 50%). SMD values show similar results. Using
this effect size estimate, the intervention appeared to have questionable effectiveness for
one student (SMD = 1.51) and was effective for another (SMD = 2.27). The R2
147
(proportion of the variation explained by the phase differences) for disruptive behavior
was 63% and 77% for academic engagement.
Next, Williamson, Campbell-Whatley, and Lo (2009) examined another group
contingency reward system. The purpose of the study was to see if a group contingency
worked for six students in a resource room with high incidence disabilities. Students one
through three had more on-time task behavior and therefore may have been the most
affected. This reward system included putting names in a jar and choosing one name
after 25 minutes. If the child had at least four of five plus marks by their name, this was
indicative of on-task behavior and the whole class would receive a reward (Williamson et
al., 2009). The teacher would not show the child’s name drawn and the whole class
determined the reward for on-task behavior. All six study participants were African
American students with disabilities. The outcome measure was on-task behavior which
was defined as students keeping their eyes and head oriented toward towards coursework,
working appropriately for the given materials, being quiet and remaining in assigned
area. Observations were conducted for the last 25 minutes of the period, in five second
intervals.
Table 16 presents the final working model for this article. From this table,
expected probabilities can be computed for on-task behavior for baseline and treatment
phases. A point worth raising here is that the variable order was statistically significant,
which means that A1B1 compared to A2B2 had average differences. Here we see that
treatment (p < .001) and order, (p < .001) was significant. That is, when treatment = 0
and order = 0, the expected probability of observing an on-task behavior from a student in
148
baseline in the first AB was 0.42, exp(- 0.31) = 0.72; 0.72 / 1.72 = 0.42. And, when
treatment = 0 and order = 1, the expected probability of observing an on-task behavior
from a student in baseline in the second AB was 0.34, exp(- 0.31 + .83) = 0.52; 0.52 /
1.52 = 0.34. The average rate of change as a student switched from baseline to treatment
was 1.98. The expected probability of observing a student with on-task behavior during
the treatment phase during the first AB was 0.84, [exp(- 0.31 + 1.98) = exp(1.67) = 5.31;
5.31 / 6.31 = 0.84] when order = 0. The expected probability of observing a student with
on-task behavior during the treatment phase during the second AB was 0.92, [exp(- 0.31
+ 1.98 + .83) = exp(2.50) = 12.2; 12.2 / 13.2 = 0.92] when order = 1. Also, the
interaction between treatment and order was significant, p < .001.
According to the authors, the intervention was effective for three of the six
participants. The original authors and HLM results indicate the overall treatment was
effective, yet the independent visual analysis concluded the intervention effect was
questionable. The final model was the Simplified Level One with No Level-2 Predictors
without Overdispersion.
Table 16
Final Model: Williamson and Colleagues (2009)
Final Model
No Overdispersion
Simplified L1 model
No L-2 Predictors
Outcome
Variable
Variables
On-Task
Behavior
P0 = β00+ R0 INTCPT
P1 = β10 + R1 TRT
P2 = β20 + R2 ORDER
P3 = β30 + R3 TRTORD
β00+ R0 INTCPT
β10 + R1 TRT
β20 + R2 ORDER
β30 + R3TRTORD
149
Table 16 (continued)
Coefficient
Estimates
β00 =
β10 =
β20 =
β30 =
Variance
Components
(SD)
Sigma ( 2)
R0
R1
R2
R3
Likelihood
Function
(MLF)
-0 .31
1.98**
0.83**
- 1.84**
n/a
0.12
0.00
0.00
0.09
- 239.42
Log(Y) = - 0.31 + 1.98(TRT) + 0.83(ORDER) - 1.84(TRTORD)
* p < .05, ** p < .01, Y = dependent variable, n/a = no overdispersion
The PND effect size estimates indicate the treatment was effective for one student
(PND = 76.4%), questionable for three (PND = 61.1%, 55.55% and 50%) and not
effective for two (PND = 40.4% and 27.78%). The SMD effect size estimate somewhat
verified these findings, however three students that had questionable PNDs (61.1%,
55.55% and 50%) and different SMD results (30.5, 2.29 and .90). Recall a 50% cutoff is
used to render a decision about treatment effectiveness for the PND and numbers around
2.0 are effective for the SMD. The R2 (proportion of the variation explained by the phase
differences) was 29%.
150
Table 17 presents the final working model for the Amato-Zech and Colleagues’
(2006) article. Here, the authors used an electronic cue as a treatment intervention that
vibrated every three minutes. The cue would prompt three students to record if they were
on-task or not. The observation period was fifteen minutes a day, two to three times per
week for forty-four sessions. Baseline sessions were described as a typical classroom
environment where on-task behavior was defined as students actively or passively paying
attention to instruction. From Table 17, we can compute the expected probabilities for
on-task behavior for baseline and treatment phases. At baseline (treatment = 0), the
expected probability of observing an on-task behavior from a student was 0.39, [exp(0.46) = 0.63; 0.63 / 1.63 = 0.39]. The average rate of change in log odds as a student
switched from baseline (treatment = 0) to treatment (treatment = 1) was 0.32, so the
expected probability of observing a student completing an on-task behavior during the
treatment phase was 0.47, [exp(- 0.46 + 0.32) = exp(-0.14) = 0.87; 0.87 / 1.87 = 0.47]. .
According to the authors, the treatment appeared to be effective for all students.
The independent visual analyses and HLM results concurred.
Table 17
Final Model: Amato-Zech and Colleagues (2006)
Final Model
Outcome
Variable
No Overdispersion
Simplified L1 model
No L-2 Predictors
On-Task Behavior
151
Table 17 (continued)
Variables
Coefficient
Estimates
P0 = β00 + R0 INTCPT β00+ R0 INTCPT
P1 = β10 + R1 TRT
β10 + R1 TRT
β00 =
β10 =
Variance
Sigma ( 2)
Components R0
(SD)
R1
Likelihood
Function
(MLF)
- 0.46**
0.32**
n/a
0.02
0.01
- 159.49
Log(Y) = - 0.46 + 0.32(TRT)
* p < .05, ** p < .01, Y = dependent variable, n/a = no overdispersion
PND effect size estimates calculated here match the original report. The SMDs
ranged from 2.8 to 4.5 suggesting effective treatments for all three students, and the R2
(proportion of the variation explained by the phase differences) was 51%.
Improvement Rate Difference
The IRD was calculated separately and is therefore not integrated in the above
findings. Recall, a 100% on the IRD would suggest that all data-points in the treatment
phase exceed the baseline phase (i.e., no overlap), and would indicate a highly effective
treatment and when the IRD is 50%, only chance level improvement from the baseline to
treatment phase exists (Parker et al., 2009). The IRD was not calculated for any of the
nine articles, therefore it was only computed to compare to the other effect size estimates.
152
Two contrast methods were proposed by Parker and colleagues (2009) to calculate the
IRD. The first method was the comparison of the three contrasts (A1 versus B1, B1 versus
A2, and A2 versus B2) and the second was calculated using two contrasts (A1A2 versus
B1B2). Both options are feasible, but only one effect is needed per person/ABAB graph
(Parker et al., 2009). One way to compare the IRD to this sensitivity analysis is to
compare it to the reported PND (i.e., originally reported in each article) which as the
literature suggested, was a strong correlation, r = .83 (Parker et al., 2009). For this study,
the reported PNDs correlated to the IRDs, was r = .92.
Comparison of the PNDs based on Visual Analyses to PNDs based on Extracted
Data
Comparing effect sizes in the SCDs examined here was complicated by the fact
that even the same estimate yielded inconsistent results. Four out of the nine articles had
at least one inconsistent PND when comparing between the visual analysis and extracted
data from UnGraph. This is not necessarily surprising due to how UnGraph extracts data,
UnGraph traces the line or allows the user to point and click the data which may be
slightly inconsistent, especially if the graph is lacking in quality. When this occurred, the
PND was re-calculated and compared to either the extracted or visual PND. Whichever
PND was verified by the previously calculated PND, it was chosen as the correct PND.
The calculated PND from the extracted data was done in Excel by using the formula for
the PND, also found in Chapter Two. The corrected PND column was determined by the
researcher as the best PND for the data. The inconsistencies (there are few) are described
in full in Chapter Five.
153
Table 18
A Comparison of Effect Sizes from the PND: Visual Analysis versus Extracted Data from
UnGraph
Citation
Amato-Zech, N. A. Hoff,
K.E., & Doepke, K J.
(2006). Increasing on-task
behavior in the classroom:
extension of self-monitoring
strategies.
Cole, C.L. & Levinson,
T.R. (2002). Effects of
within-activity choices
on the challenging
behavior of children
with developmental
disabilities.
Lambert, M., Cartledge, G.,
Heward, W. L., & Lo, Y.
(2006). Effects of response
cards on disruptive behavior
and academic responding
during math lessons by
fourth-grade urban students.
DV
On-task
Behavior
Percent
of Task
Analysis
Steps
ID
PND
(visual analysis)
PND
(extracted data)
Jack
David
Allison
100%
100%
93.75%
100%
100%
93.75%
Keith
51.66%
51.66%
Wally
28.60%
42.85%
77.78%
100%
81.25%
75.71%
61.36%
100%
94.4%
0%
33.33%
77.78%
100%
81.25%
75.71%
61.36%
100%
94.4%
0%
33.33%
A1
Disruptive A2
Behavior A3
A4
B1
B2
B3
B4
B5
PND
(Correct)
42.85%
154
Table 18 (Continued)
Citation
DV
ID
Mavropoulou, S.,
Papadopoulou, E.,
& Kakana, D. (2011).
Effects of task
organization on the
independent play of
students with autism
spectrum disorders.
Vaggelis
On-task
Teacher prompt
Performance
50%
30%
45%
50%
30%
45%
Yiannis
On-task
Teacher prompt
Performance
75%
40%
60%
75%
40%
60%
50%
58.3%
42.85%
0%
87.5%
35.7%
35.7%
100%
50%
58.3%
42.85%
0%
93.75%
66.95%
50%
100%
Abby
On-task
Task Complete
Accuracy
50%
50%
75%
50%
50%
67.85%
Sara
On-task
Task Complete
Accuracy
94.44%
100%
100%
94.44%
100%
100%
On-task
Task Complete
Accuracy
50%
77.77%
72.22%
50%
77.77%
72.22%
Chris
On-task
Task Complete
Accuracy
66.4%
100%
70.7%
73.5%
100%
70.7%
Murphy, K. A., Theodore,
L.A., Aloiso, D. Alric-Edwards,
S1
J.M., & Hughes, T.L. (2007).
S2
Interdependent group
Disruptive S3
contingency and mystery
Behavior S4
motivators to reduce
S5
preschool disruptive behavior.
S6
S7
S8
Ramsey, M.L., Jolivette,
K., Puckett-Patterson, D.,
& Kennedy, C. (2010).
Using choice to increase
time on-task, taskcompletion and accuracy
for students with
emotional/behavior
disorders in a residential
facility.
PND
(visual analysis)
PND
(extracted data)
PND
(Correct)
93.75%
66.95%
35.7%
75%
Trey
66.4%
155
Table 18 (Continued)
Citation
DV
ID
PND
(visual analysis)
PND
(extracted data)
PND
(Correct)
26.6%
35%
30%
31.66%
Katie
On-task
Task Complete
Accuracy
26.6%
31.66%
30%
Disruptive / Academic
Restori, A.F., Gresham,
F.M., Chang,T., Howard
L.B. & Laija-Rodriquez,
W., (2007). Functional
assessment-based
interventions for children
at-risk for emotional
and behavioral disorders.
Overall
Academic
Engagement
Overall
Disruptive
Behavior
Theodore, L.A., Bray,
M.A., Kehle, T.J., & Jenson,
W.R. (2001). Randomization Disruptive
of group contingencies
Behavior
and reinforcers to reduce
classroom disruptive behavior.
Williamson, B. D.,
Campbell-Whatley, G.,
& Lo, Y. (2009). Using
a random dependent group
contingency to increase
on-task behaviors of high
school students with high
incidence disabilities.
On-task
Behavior
Disruptive / Academic
A1
A2
A3
A4
100% / 100%
100% / 100%
100% / 100%
100% / 100%
100% / 100%
100% / 100%
100% / 100%
100% / 100%
C1
C2
C3
C4
100% / 92.8%
0% / 50%
91.6% / 91.6%
100% / 91.6%
100% / 100%
0% / 50%
91.6% / 91.6%
100% / 91.6%
S1
S2
S3
S4
S5
100%
100%
100%
100%
100%
100%
S1
S2
S3
S4
S5
S6
40.4%
76.4%
61.1%
55.55%
50.0%
27.78%
40.4%
76.4%
61.1%
55.55%
50.0%
27.78%
92.8%
DV = Dependent Variable; ID = subject identifier; PND (visual analysis) = Percentage of non-overlapping
data analyzed by a visual process determined by the features described in this dissertation; PND (extracted
data) = Percentage of non-overlapping data analyzed by the numerical data scanned using UnGraph. PND
formula = (1 – percent of treatment overlap points)*100.
156
Results for Research Question 4
The validity of using UnGraph to extract data has been examined in the past
(Shadish et al., 2009). The concerns addressed here deal with the ability of UnGraph to
correctly extract the data and if two independent raters would scan data similarly. To
address the first point, attempts were made in this study to contact original authors to
obtain original datasets. Only two responded and these data were compared to the
UnGraph results. In short, analyses for this research question were done in three ways
(1) a comparison of the extracted data to raw data, (2) between the primary researcher
and a second coder, and (3) between the extracted data and the original authors’
descriptive data.
Two authors responded to the data request from the articles: Mavropoulou and
colleagues (2011) and Amato-Zech and colleagues (2006). Data on four dependent
variables (two on-task outcomes, prompting, and performance) were included in the
comparison of extracted data to original. N = 38, which indicated the total number of data
points that were obtained by the raw data provided by the original authors. The
correlations between the UnGraph results and raw data for the Mavropoulou and
colleagues (2011) study were both 1.00 for on-task and performance, and .99 for
prompting. For the Amato-Zech and colleagues (2006) study, on-task behavior yields a
correlation of .99 between the original data and extracted data. For the independent
variable (sessions), both studies had a correlation of 1.00 between the original author and
scanned data. Lastly, there was 100% agreement for both studies on the number of data
points to extract.
157
Further, a second coder was used in this study to check the reliability of the
primary researcher and the ability to accurately extract data from UnGraph. The second
coder independently digitized the same nine studies. The overall average correlation for
the nine studies was r = .996 (p < .001). The average outcome variable correlation was r
= .994 and the average session correlation was r = .998. For the outcome variable, three
correlations were 1.00, one was .999, three were .998, and the lowest was .967 for the
Cole and Levinson (2002) study. For the session variable, six variables had 1.00
correlations, with the lowest being .98 for the Lambert and colleagues (2006) study.
Between the two coders, the percent of agreement of extracted data points (i.e., total
number of data points selected by the secondary coder divided by the primary coder) was
97.70% (1,330 out of 1,362 data points). The trained graduate student accidently
extracted data for generalizations phases in one article which caused the denominator to
be 32 points higher from the actual. Generalization phases extend beyond designs in an
effort to collect additional behavioral data and verify that the outcome measure, in fact,
represents an increase/decrease as a function of the treatment.
In addition, descriptive data from the original authors was compared to extracted
data. Table 22 in the Appendix shows the studies reported data concerning the outcome
variables (measured by mean intervals, ranges, and percentage of intervals) and compares
those statistics to the extracted data. The average correlation between the original data
and extracted data (average baseline and treatment phases) for the nine articles in this
study was .99. The articles independently listed, with the reported and extracted data, can
be seen in the Appendix. Below, Table 19 displays the mean and variance of both the
158
raw data and extracted (from the authors that responded to the request) further
demonstrating the validity of UnGraph.
Table 19
A Comparison of Descriptive Statistics between Raw Data and Extracted
Raw Data
Extracted Data
Mean
48.38
47.96
Variance
959.68
968.68
N = 38
159
Chapter Five: Discussion, Conclusions, and Recommendations
This chapter discusses results from the analyses that were conducted to address
each research question. This is followed by a summary of the findings, series of
conclusions and discussion of future research suggestions.
The key point of this work was to compare results across two different analytic
traditions. The work is a type of sensitivity analysis in that the primary question
considers whether overall treatment effects are found whether using HLM techniques or
visual analyses. Put another way, the work examines if a treatment effect is found
whether using one approach or another (i.e., are findings and conclusions sensitive to
method?). Strong correspondence between different approaches would be encouraging
and low correspondence suggests the need for further investigation that goes beyond the
scope of this dissertation. Secondary questions dealt with the examination of Level-2
subject characteristics in hierarchical statistical models. This is of some interest because
the literature (Nagler et al., 2008) describes a method where regression techniques can
examine the influence of such variables, even in the context of single-case designs which
work with quite small sample sizes; furthermore, application of these methods can
potentially push analyses in new directions and reveal findings that are not possible with
visual analysis approaches. Finally, the calculation of effect size estimates (using
different methods), comparing them to author results, and assessing the reliability and
validity of the UnGraph procedure for digitizing coordinates was of interest.
160
Discussion of the Results
The primary research question: Does quantification and subsequent statistical
analyses of selected ABAB graphs produce conclusions similar to visual analyses?
Comparisons between the original authors (who all used a form of visual analysis) to
statistical analyses of the ABAB graphs produced similar findings. There was 93%
agreement rate between the original authors and re-analyses using HLM techniques. This
is across 14 dependent analyses while focusing on whether there was a treatment effect.
Only one difference was found between the original articles and the re-analyses. In terms
of this one exception, Mavropoulou and colleagues (2011) considered the intervention to
be ineffective (for task performance), but analyses using HLM techniques rendered a
significant p value (p < .05). While the original authors questioned the treatment
effectiveness, it is noteworthy that the statistical approach suggested there was a
treatment impact. These results should be interpreted carefully as models based on data
drawn from this article did not converge. It is also the case that HLM approaches always
concurred with claims made by authors when they stated there was a treatment impact.
A 71.4% agreement rate was found when comparing the original authors claims
and results from re-analyses using independent visual analysis approaches based on
WWC criteria. These two findings could be explained several different ways. It could
mean that the WWC renders conservative opinions about treatment impacts in the context
of single-case designs. A related limitation from the beginning of this work was that
some WWC approaches were not followed. In the WWC project, three doctoral-level
researchers trained in visual analyses render an opinion and this level of attention was not
161
possible here. This is not trivial as seeking such corroboration is done by others in the
field. Maggin and colleagues (2011a) recommended using a team of individuals trained
in visual analysis to report findings for each individual case. It is difficult to determine if
the single visual analyst used in this work was overly conservative when rendering
decisions about treatment impacts or if others would with similar training would reach
similar conclusion.
There was also a 62.3% agreement between WWC-based analysis and HLM
analyses. For the most part, HLM analyses routinely found treatment effects (there was
one exception), and it seems likely that some of the same limitations for using SMDs
(recall the procedure tends to yield very large results) apply here. It could be that HLM
estimation procedures used for this dissertation (i.e., full maximum likelihood) yielded
inflated parameters. This is worrisome and might suggest that full maximum likelihood
may not be the best option given the data. It is also unfortunate to not have the ability to
compare models using restricted maximum likelihood, as this estimation procedure does
not render a likelihood function.
Three articles were analyzed using models that did not converge, and so their
results were not included in Chapter Four. These articles were: (a) Cole and Levinson
(2002) studying disruptive behavior, (b) Theodore and colleagues’ (2001) concerning
disruptive behavior and (c) Mavropoulou and colleagues (2011) for teacher prompting
and task performance. The failure to converge may have occurred for several reasons but
it would be reasonable to assume that the small sample sizes, large between and within-
162
variability and parameter estimation using MLF caused the problem. These models were
dropped from Chapter Four but are available in Appendix C.
The second research question: Do any of the subject characteristics (Level-2 data)
explain between-subject variation? If so, can this information be used to yield new
findings? There were four Level-2 predictors that explained (p < .05) variance betweensubjects. The Lambert and colleagues (2006) study had a Level-2 variable that was a
statistically significant predictor of baseline variability. This variable was Class B, which
captured classroom membership (Class A = 0, Class B = 1). The authors stated that both
classrooms displayed a decrease in disruptive behavior during the response card
intervention but later it was revealed that classroom teacher ‘A’ had some difficulty
recording the responses at the onset of the study. No statistical analysis of differences
between the two classrooms was conducted in the original article. Here, as found by
Nagler and colleagues (2008) and verified by this study, the variable Class B (which
classroom the child came from) was a statistically significant predictor of variance of (p <
.05) at baseline. This new finding could be explained by several possibilities, but are not
limited to the following: (a) the children were coming into the classroom with prior
knowledge of using white boards to answer questions, (b) the two classrooms had
students who performed differently, or (c) the teachers implementing the intervention had
differing teaching strategies.
For the Murphy and colleagues’ (2007) study, the original authors reported
differences between the two types of students (ones who were on-track to and those not
on-track to the next grade level) in the discussion section of the article. The authors were
163
limited to descriptive statistics, and stated that the onset of skills may be attributed to
starting educational levels or the “heterogeneous nature of the classroom” (p. 60). This is
consistent to the expected probabilities for observing disruptive behavior at baseline and
treatment phases. The average overall expected probabilities for observing behavior
matched the concerns that the authors had with not fully achieving a return to baseline
(second A) and large reductions in behavior change when implementing the second
treatment (second B phase). The authors attributed this to the starting educational levels
and the mix of students used in the study and possibly the students who came with more
skills (i.e., remembering how to behave from a prior experience).
The Restori and colleagues (2007) study investigated functional assessments
using typically developing children in a general education setting. It appeared that
children exhibiting poor academic engagement and extreme rates of disruptive behavior
were associated with avoidant or attention seeking behaviors. The authors reported a
significant increase in academic achievement in both treatment phases, and a significant
reduction in disruptive behaviors. Also, students assigned to antecedent-based
interventions had increased academic engagement in comparison to consequent-based.
The findings reported by the original authors compared to results from HLM analyses
were consistent. For example, the treatment variable was a statistically significant
predictor of academic achievement, p < .01, as well as disruptive behavior, p < .01. In
addition, Restori and colleagues (2007) stated that “antecedent-based treatment
interventions were more effective than consequent-based interventions for reducing
disruptive behavior” (p. 26), even though both were found to lower behavior. The
164
statistical analyses found that both treatment interventions were successful for the group,
and confirmed antecedent interventions as more effective, even though odds-ratios would
be the same.
For the most part, statistical analyses enhanced interpretation of the original
articles. Calculating the expected probabilities of behavior somewhat verified claims
made by the study authors, but also provided more information concerning baseline and
treatment phases. In one case, even though the researchers claimed that both
interventions worked effectively (which according to visual analyses they did); statistical
work revealed that one was a better option (given the data). Caution is advised here
because differences can occur across by using different estimation procedures. Further,
since the sample sizes for these studies were so small generalization is neither viable nor
suggested. For now, the overall conclusion reached here is that Level-2 analyses are
useful, but more work on their procedures and properties is warranted. Given the sample
sizes in SCDs, it is advised that estimation of the parameters should be studied more,
especially when using MLF. Finally, the variance-covariance matrix chosen for this
dissertation was unstructured since this is the default in HLM (Garson, 2012) and after
producing the variance co-variance output, the matrices indicate an unstructured matrix.
Again, there are several options for restricting the variance components and this and other
parameter estimation techniques should be studied in the future, including autoregressive
parameters according to the manual’s authors. Also, the HLM6 software does not yield
the variance co-variance matrix as a default, but the output can be recovered
(Raudenbush, Bryk, & Congdon, 2004).
165
Model building was necessary to answer Research Questions 1 (HLM analyses)
and 2 (Level-2 contributions). Although some issues arose with model convergence and
interaction terms, Nagler and colleagues (2008) state: “no model is going to completely
fit actual data because of great within-phase and within-subject variation” (p. 11, Section
5). Nagler and colleagues (2008) continue to express that in real analyses, inferences
between both the treatment effect sizes and estimates from the models would be
compared for final judgments concerning treatment impact.
The third research question: Do the PND, IRD, (nonparametric indices), SMD
(parametric indices) and R2 yield similar effect sizes as the original studies. To address
this question, effect sizes calculated from either extracted data or graphs were compiled
and compared to ones originally described in the articles. All effect size similarities and
inconsistencies can be seen in Table 20 (see Chapter Four) and Table 21 in Appendix A.
Comparing multiple effect sizes for individual studies may be considered unwarranted
due to the computational differences found within each estimate. Generating and
interpreting effect sizes in SCDs is complicated by the fact that standardized mean
differences (SMDs) are not readily comparable to effect sizes generated from groupbased studies (e.g., randomized controlled trials and quasi-experiments). This is because
the variances of SCDs tend to be small relative to group-based designs and there can be
cases where mean performance shifts in baseline for an individual can yield large
numerators.
It might however be enlightening to calculate and display other methods as a way
to verify the conclusions of the original articles, and to see if effect sizes had discernible
166
patterns. Maggin and colleagues (2011a) suggested using more than one effect size
estimate to understand the relationship between the outcome and treatment variable, even
though this was not commonly seen in the literature.
Generating and interpreting effect sizes in SCDs is complicated by the fact that
standardized mean differences (SMDs) are not readily comparable to effect sizes
generated from group-based studies (e.g., randomized controlled trials and quasiexperiments). This is because the variances of SCDs tend to be small relative to groupsbased designs and there can be cases where mean performance shifts in baseline for an
individual can yield large numerators.
The PND, although not considered the best estimate developed, continues to
dominate the field of SCDs in school psychology (Maggin et al., 2011a). Until another
estimate is developed, it will continue to be reported. The PND is simple to calculate,
however according to Allison and Gorman (1994) as the number of data points increase,
the PND value trends towards zero and makes it difficult to compare PND results from
one study to another. In addition, the PND does not correlate highly with other effect
size indexes (Maggin et al., 2011a; Shadish, Rindskoph, & Hedges, 2008). Nonetheless,
for this dissertation 33% (n = 3 of 9) of the articles used the PND as an effect size
estimate. It just may be that the effect size estimates in SCDs present on-going issues in
the field.
Amato-Zech and colleagues (2006) reported a PND, which matched the
calculated effect size PND from the primary researcher of this sensitivity analysis. The
SMD suggested the intervention was effective for each subject (ranging from 2.8 – 4.5,
167
recall any value close to 2.0 is considered effective). Lambert and colleagues’ (2006)
article reported a PND for five students but did not calculate it for four of them. Only
three of these were confirmed, and only two of the PND results suggested the
intervention was effective. By contrast, all SMD calculations suggested the treatment
was effective for each person, except for the last two individuals (B4 and B5). Again,
although comparing between effect size methods is not common, it could be helpful when
software is not available, when time is limited, or when data has high variability. In
addition, Mavropoulou and colleagues (2011a) reported four PNDs in their article
although six could have been calculated. Of the four reported, three were confirmed.
The one that did not match was Yiannis’ performance result; however, the results
calculated here did not substantively alter the overall conclusion that the treatment’s
effectiveness was questionable for this student. As an aside, HLM analyses yielded a
significant p value when examining the question of whether there was a treatment impact
on the behavioral outcome, even though the original study authors questioned its
effectiveness, as did independent visual analyses. Further, the SMD was calculated at
.39, which suggested the treatment was not effective for the individual, Yiannis.
Murphy and colleagues (2007) reported results using the SMD approach where
the treatment was deemed to be effective for five out of eight subjects. The calculated
SMD from the extracted data for this dissertation did not agree with the results of the
original authors. According to the article, four students experienced effective treatments
(using the standard cut-off of around 2.0 or higher); however, two were just slightly
below the threshold, 1.99 and 1.98, respectively. The PND suggested that five students
168
did not experience an effective intervention. Theodore and colleagues (2007) reported
SMD results in their article and there were all confirmed. The PND results, calculated by
the dissertation author, also suggest an effective rating. Mainly, the effect size estimates
are highly variable, do not always match, and should be further studied.
The IRD is included in this study as a means to compare it to the PND from
originally reported and extracted data. This is mainly because the literature does not
offer IRD ranges for determining treatment effectiveness (unlike the PND or the SMD).
Recall that, a 100% on the IRD indicates that all data-points in the treatment phase
exceed the baseline phase, and this suggests an effective treatment. By contrast, when
the IRD is 50%, only chance level improvement from the baseline to treatment phase
exists (Parker et al., 2009). Recall, that Parker and colleagues (2009) found a .83
correlation between the IRD and the PND in 166 AB series data sets. The contrast
method (A1A2 versus B1B2) yielded a correlation of .92 between the IRD and the reported
PND. This may suggest that the IRD and PND are somewhat comparable effect size
estimates for these types of behavioral-based, ABAB designs. Further, the IRD, as a
newer estimate, should be used in the future possibly in place of the PND, until a more
viable effect size estimate is devised for SCD data. For now, there is no way to
generalize these effect size findings to other settings. It can only be concluded that the
effect size metrics yielded inconsistencies and the single-case community should know
about this finding.
The last measure of association, the R2, can be seen in the last column of Table 20
in the Appendix. The proportion of the variation explained by the phase differences is
169
one way to interpret R2; however, other interpretations are available in single-case
research and consideration should be contingent on the design’s function (Parker &
Brossart, 2003). This last measure can give a sense of how the phase differences vary,
with the larger proportions indicating more variation between baseline and treatment
phases. In an effort to correlate different effect size measures, Parker and colleagues
(2009) found the strongest correlation was between R2 and Kruskal-Wallis W and the
weakest was between R2 and the PND. This highlights that caution is needed when using
effect sizes in SCDs. Although comparing different effect size estimates is not common,
this dissertation work suggests the extra steps may be useful since they can yield
inconsistent results. Certain effect size procedures tend to be consistent (e.g., IRD and
PND) whereas others are not (e.g., R2 and the PND). Furthermore, even comparisons
between the exact same estimates (i.e., PNDs using the visual analysis technique versus
the extracted data PNDs) were inconsistent. The visual analysis PND compared to the
extracted data PND were displayed in Table 20 in Chapter Four.
Recall, visual analysis uses the features and criteria described in Chapter Two for
determining treatment effect (i.e., it is a visual process that is analyzed by multiple senior
research methodologists assessing trend, level, overlapping data). The data in the column
labeled ‘PND extracted data’ in Table 20 is the exact data extracted by the researcher of
this dissertation and can be found in Appendix B). This column was calculated using the
PND formula. The column labeled ‘PND correct’ is the suggested PND effect size
estimate based on the information gathered (by the dissertation author) after re-analyzing
the differences (discussed below). It is worrisome that some of the PNDs did not match
170
exactly. After a re-analysis of the two methods of calculating the PND (one being visual
and the other calculated), the correct PND matched one method and therefore it was
chosen as the correct PND method. The column labeled PND correct was added to
suggest which PND option may be the best. Any other researcher would likely agree that
a second look at the two inconsistencies in an effort to yield the same estimate is
valuable. Finding an exact match alleviates concerns, but failure to do so provides the
field with possible areas of future study: (a) why effect sizes do not match in SCDs and
(b) how even the same estimate can be different between visual inspection and
quantification using UnGraph. In general, the extracted UnGraph data and independent
visual analysis PND (remember that even visually assessing graphs, the PND has
calculations since it is the percentage of non-overlapping data) did provide some different
numbers because trivial differences existed when clicking on a specific data point.
UnGraph itself can be problematic in that one can get a false sense of security,
particularly when dealing with hard to read graphs. When dealing with software that has
such calibration systems for extracting data from graphs, it may be advisable to have
more than one trained extractor check on the results or simply round decimals to whole
numbers.
For the Cole and Levinson (2002) article, one student (Wally) yielded two data
points (last two data points found in the second AB pairing) that visually demonstrated no
overlap, but the extracted data considered those two points as overlapping. This is
important to mention because the PND can be interpreted differently. The PND does not
treat data that are equal as countable data (i.e., the definition of PND is data that fall
171
above / below the highest / lowest data point in the intended direction of the treatment
intervention), therefore the researcher would consider the data found in the Cole and
Levinson article as overlap. This is unfortunate for data that fall at zero and the highest
possible outcome measure.
For the Murphy and colleagues (2007) article, three out of eight estimated PND
values when using the two different datasets, all had three effect size estimates that were
higher when using extracted data. For one student, the second AB pairing yielded a data
point that was not overlapping but this was not the case when basing PND on visual
analyses (the data point in UnGraph registered 5.50 in treatment two and 7.26 in baseline
two). This same general issue occurred with the other students. Consequently, slight
differences will yield different conclusions. Thus, the final PND column in the table
suggests the recommended effect size estimate based on the data comparison between the
two methods. The best method was chosen by the researcher of this dissertation based on
a second look of the data in SPSS and visual analyses. It is not that UnGraph is
inconsistent; rather it is the PND method can be analyzed differently.
The next inconsistent PND result was found when analyzing data from a student
in the Murphy and colleagues’ (2007) study. Visually it was clear that five points fell
below the lowest number in the first AB pairing, but the quantification found all seven
points (out of seven) overlapping. Therefore, visual analysis was the better method
because it was clear visually they did not overlap. Similarly, for the Ramsey and
colleagues (2010) article, three effect size estimates were inconsistent. Since an effect
was readily apparent from visual analysis, the PND based on this procedure was deemed
172
to be correct. Finally, for the Restori and colleagues (2007) study, one person was
visually difficult to assess for the outcome measure, overall disruptive behavior. The
researcher of this dissertation had to make a judgment call based on the PND differences,
which is a limitation to this dissertation. All other articles and outcome measures were
exactly the same between the two methods.
This all leads to an important suggestion conducting further research, which is to
use the two methods in tandem. This comparison should help researchers see that even
one inconsistent data point could change the effect size estimate making the justification
to compare multiple effect sizes necessary (possibly within the same estimate as this
dissertation demonstrated). That is, even a comparison of the same effect size estimate
(PND), yielded inconsistent results. More research should be conducted for effect size
estimates in the single-case community.
The fourth research question was: Is UnGraph a valid and reliable tool for
digitizing ABAB graphs? Examining the UnGraph procedure is important because this
dissertation is predicated on the assumption that the UnGraph procedure can reliably
reproduce published data presented in graph form. The reliability and validity of the
UnGraph procedure was examined at three levels: (1) between the original “raw” data
correlated to UnGraph’s extracted data, (2) between the researcher of this work and a
second coder, and (3) between the author’s reported means/percentages/ranges compared
to UnGraph’s extracted data. First, authors from two studies provided raw data in this
study: Mavropoulou and colleagues (2011) and Amato-Zech and colleagues (2006).
Ideally, all authors would have been able to respond to requests for raw data and the fact
173
that only two responded represents a study limitation. Correlations between the raw and
extracted data for the Mavropoulou and colleagues (2011) study were both r = 1.00 for
on-task and performance, and r = .99 for prompting. For the Amato-Zech and colleagues
(2006) study, on-task behavior yields a correlation of r = .99 between raw and extracted
data. For the independent variable (sessions), both studies had a correlation of r = 1.00
between the original author and scanned data. Lastly, there was 100% agreement for
both studies on the number of data points to extract between the original authors and the
researcher in this study. This shows that UnGraph was able to reproduce data. In
addition, an independent coder also digitized data and results were compared to the
extraction efforts of the dissertation author. The overall average correlation between the
two coders (for the nine studies) was, r = .9957. The average outcome variable
correlation was r = .9938 and the average session correlation was r =. 9976. Between the
two coders, the percent of agreement of extracted data points (i.e., the total number of
data points clicked on by the secondary coder divided by the primary coder) was 97.70%.
In all, the number of data points available to extract (1,362) was collected from the
primary researcher. Any divergence from the percent of agreement of extracted data
points was due to an error on the part of the second coder, by not following directions and
extracting phases not needed for this study.
A final point worth noting is the similarities between the authors’ reported
means/percentages/ranges compared to UnGraph’s extracted data. The average
correlation between the original data and extracted data (average baseline and treatment
phases) for the nine articles (means/percentages/intervals) in this study was r = .99, p <
174
.01. Specifics can be seen in Table 24 in the Appendix. Given the overall pattern of
findings, it would appear that UnGraph can reliably reproduce graphed data.
Study Limitations
Several limitations were described above. The ones recognized at the outset of
the study are: (1) not having access to WWC resources, (2) not knowing the exact
variance-covariance matrix used in the manual, (3) not having access to all raw data and
(4) not having an ideal ES estimate to compare between studies. Additional limitations
that occurred as a result of proceeding with the study plan include: (1) failure to converge
using MLF estimation, (2) only two authors responding to requests for raw data and (3)
the primary researcher making judgment calls due to the differences between the PNDs.
Conclusions
This dissertation had one primary and three secondary research questions. The
primary research question asked: does quantification and subsequent statistical analyses
of the ABAB graphs produce conclusions similar to visual analyses? The overall answer
is: not always. HLM results and author claims were largely consistent. The only
exception was when dealing with the task performance variable from the Mavropoulou
and colleagues’ (2011) study. The authors questioned whether there was a treatment
impact but HLM analyses yielded a statistically significant p value. This may be an
indication that HLM techniques tend to yield liberal decisions when using standard p
value cutoffs, and this may make sense given the standardized effect size estimates. The
model using task performance as the outcome variable did not converge and the differing
treatment impact results should be cautiously interpreted. Also, other studies had similar
175
issues with convergence. For example in the Cole and Levinson (2002) article, the
statistical test determined the treatment was effective, p < .05, but again without
convergence this p value should be cautiously interpreted. The SMD for Keith suggested
an ineffective treatment intervention (1.42) as well as Wally (.96). These results
similarly support HLM being too liberal given the ES estimates.
The examination of Level-2 subject characteristics via HLM techniques appears
to hold promise in terms of yielding new insights into single-case data. In this work, four
such variables were found to explain variation of dependent variable data. Several Level2 variables were tested (a complete list of all Level-2 variables tested can be seen in the
Appendix) and found to provide additional information beyond the original work in the
Lambert and colleagues (2006) study. It could be that the original authors were not
interested in these variables at that time. Nonetheless, these predictors can provide
information beyond what descriptive information could provide. Of course, these
predictors are more exploratory in nature, and should be interpreted with caution due to
the small sample sizes of these articles.
The effect size component of this work was also considered as a type of
sensitivity analysis, focusing on any differences between claims made by the original
authors and results based on the extracted data. The SMD and PND were, for the most
part, used to describe effect size in the nine articles in this study. Again, comparing
effect sizes in SCDs are difficult because while some report numerically, others only
provide an ‘effective/non effective’ treatment judgment. Further, the PND seemed to
176
produce inconsistent effect sizes depending on the method chosen to calculate (i.e., visual
inspection versus excel formulas using extracted data).
Finally, examining reliability and validity of UnGraph procedures was done by
making comparisons between: (1) raw and extracted data, (2) two independent coders,
and (3) findings and data presented in research articles (i.e., means and ranges) and
UnGraph results. The average correlation between the raw and extracted data (average
baseline and treatment phases) for the nine articles in this study was r = .99, p < .001.
Also, 100% agreement was found between the original authors and the researcher of this
work concerning the expected number of data points to collect for these two studies.
Two coders were used to assess the accuracy of UnGraph providing 97.70% agreement
on number of data points to extract. Overall average correlation and
means/percentages/intervals was r = .9957 and r = .99, p < .01, respectively.
Recommendations
Although this work is based on a small number of studies, it appears that visual
analyses seem to yield more conservative results when compared to what is gleaned from
HLM procedures. This is consistent with Raudenbush and Bryk’s (2002) stance that
estimation procedures for small sample sizes may yield biased estimates and it is
recommended to research variance-covariance matrices available for repeated designs.
While visual analysis is quick, efficient, and cost beneficial, quantification efforts
can be used to re-examine visual analysis results and provide additional Level-2 subject
characteristic information. Of course, quantification of single-case studies is tedious,
time consuming, and requires the knowledge of several software applications and
177
estimation procedures. Overdispersion continues to be an ongoing issue in SCDs. In
general, it is recommended that further research concerning the different estimations
available, other than the options available in Nagler and colleagues (2008) manual,
should be the focus of any further work in SCDs. These other estimation options are
available in software like Stata and would bypass estimation limitations in HLM6. Stata
may provide better parameter estimation and model fit for smaller samples.
Level-2 information could be analyzed both by qualitative and quantitative means
(Hitchcock, Nastasi & Summerville, 2010). However, quantification may provide new
information and prompt future work. Maggin and colleagues (2011a) stated that some
researchers did not provide or reported limited subject characteristics. Similarly, some
authors in this work did not report the sex of the child, disability status, or any
information pertaining to the data collectors (i.e., gender of teachers/primary researcher,
years experience). This reiterates the need for collection of more descriptive data since it
may not be behavior of the child influencing results, but the child themselves.
Effect size estimates in SCDs, while popular for describing associations were
found to be worrisome in this study. It should be noted that not all original authors
reported all effect size estimates, and this study did have nine articles. Perhaps a team
should be used in calculating effect size estimates (possibly using more than one estimate
to verify results). Again, not all effect size estimates in this dissertation were exactly
what the original authors found, which affected individual effectiveness. It would be
beneficial to check one effect size calculation to another given the fact that more than one
178
method is available. It is recommended to use the PND or IRD for effect size estimates
until a better estimate is developed for SCDs.
Lastly, this work agrees with a call by McDougall and colleagues (2011) to
require authors to provide raw data when publishing SCD results. This would dismiss
hesitation concerning UnGraph, and other extracting tools, and allow closer estimation
approximations for model building purposes.
179
References
Abell, P. (2009). History, case studies, statistics, and causal inference. European
Sociological Review, 25(5), 561-567.
Agresti, A. (1996). An introduction to categorical data analysis. Gainesville, FL: John
Wiley & Sons.
Allison, D. B., & Gorman, B. S. (1994). Making things as simple as possible, but no
Simpler: A rejoinder to Scruggs and Mastropieri. Behaviour Research and
Therapy, 32, 885–890. doi: 10.1016/0005-7967(94)90170-8
Amato-Zech, N. A. Hoff, K. E., & Doepke, Karla J. (2006). Increasing on-task behavior
in the classroom: extension of self-monitoring strategies. Psychology in the
Schools, 43(2), 211-221.
American Psychological Association Washington DC US (2002).
American Psychologist, 57, 1052-1059. doi: 10.1037/0003-066X.57.12.1052
Baer, D. M. (1977). Perhaps it would be better not to know everything. Journal of
Applied Behavior Analysis, 10(1), 167-172.
Beeson, P. M., & Robey, R. R. (2006). Evaluating single-subject treatment research:
Lessons learned from the aphasia literature. Neuropsychology Review, 16(4), 161169.
Biosoft (2004). UnGraph for Windows (Version 5.0). Cambridge. U.K.: Author.
180
Bowman-Perrott, L., Greenwood, C. R., & Tapia, Y. (2007). The efficacy of CWPT used
in secondary alternative school classrooms with small teacher/pupil ratios and
students with emotional and behavioral disorders. Education & Treatment of
Children (West Virginia University Press), 30(3), 65-87.
Brand, A., Bradley, M. T., Best, L. A., & Stoica, G. (2011). Multiple trials may yield
exaggerated effect size estimates. The Journal of General Psychology, 138(1), 1-11.
Brandstätter, E. (1999). Confidence intervals as an alternative to significance testing.
Methods of Psychological Research Online, 4(2), 33-46. Retrieved from
http://www.dgps.de/fachgruppen/methoden/mpronline/issue7/art2/brandstaetter.p
df
Brossart, D. F., Parker, R. I., Olson, E. A., & Mahadevan, L. (2006). The relationship
between visual analysis and five statistical analyses in a simple AB single-case
research design. Behavior Modification, 30, 531-563.
doi:10.1177/0145445503261167
Bryk, A. S., Raudenbush, S. W., Congdon, R. T., & Seltzer, M. (1988). An introduction
to HLM: Computer program and user’s guide [Computer software]. Chicago, IL:
University of Chicago.
Busk, P. L., & Marascuilo, L. A. (1992). Statistical analysis in single-subject research:
Issues, procedures, and recommendations, with application to multiple behaviors. In
T. Kratochwill, & J. Levin (Eds.), Single-case research design and analysis: New
directions for psychology and education (pp. 159-185). Hillsdale, NJ: Lawrence
Erlbaum.
181
Busk, P. L., & Serlin, R. C. (1992). Meta-analysis for single-case research. In T.
Kratochwill, & J. Levin, (Eds.), Single-case research design and analysis: New
directions for psychology and education (pp. 187-212). Hillsdale, NJ: Erlbaum.
Bryk, A. S., Raudenbush, S. W., Congdon, R. T., & Seltzer, M. (1988). An introduction
to HLM: Computer program and user’s guide [Computer software]. Chicago, IL:
University of Chicago.
Carver, R. (1978). The case against statistical significance testing. Harvard Educational
Review, 48(1), 378-399.
Cochrane Collaboration (2006). Cochrane handbook for systematic reviews of
interventions. Retrieved March 19, 2012, from
http://www.cochrane.org/index_authors_researchers.htm
Coe, R. (2002, September). It’s the effect size, stupid: What effect size is and why it is
important. Paper presented at the Annual Conference of the British Educational
Research Association, England.
Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155–159.
doi:10.1037/0033-2909.112.1.155
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.).
Hillsdale, NJ: Lawrence Erlbaum Associates.
Cohen, J. (1969). Statistical power analysis for the behavioral sciences. New York:
Academic Press.
Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45(12), 13041312.
182
Cole, C. L., & Levinson, T. R. (2002). Effects of within-activity choices on the
challenging behavior of children with severe developmental disabilities. Journal of
Positive Behavior Interventions, 4(1), 29-37.
Crosbie, J. (1987). The inability of the binomial test to control type I error with single
subject data. Behavioral Assessment, 9(2), 141-150.
Danov, S. E., & Symons, F. J. (2008). A survey evaluation of the reliability of visual
inspection and functional analysis graphs. Behavior Modification, 32(6), 828-839.
DiCarlo, C. F., & Reid, D. H. (2004). Increasing pretend toy play of toddlers with
disabilities in an inclusive setting. Journal of Applied Behavior Analysis, 37(2), 197207.
Edgington, E. S. (1987). Randomized single-subject experiments and statistical tests.
Journal of Counseling Psychology, 34(4), 437-442.
Edgington, E. S. (1996). Randomized single-subject experimental designs. Behavior
Research and Therapy, 34(7), 567-574.
Edgington, E. S. (1995). Randomization tests: Revised and expanded (3rd ed.). New
York: Marcel Deekker.
Fox, J. (1991). Regression diagnostics. Newbury Park, California: Sage.
Garson, G. D. (2012). Hierarchical linear modeling: Guide and applications. North
Carolina: Sage.
Ghahfarokhi, M. A. B., Iravani, H., & Sepehri, M. R. (2008). Application of katz family
of distributions for detecting and testing overdispersion in poisson regression
models. World Academy of Science: Engineering & Technology, 44(1), 544-549.
183
Grissom, R. J., & Kim, J. J. (2005). Effect sizes for research: A broad practical
approach. Mahwah, NJ: Erlbaum.
Hedges, L. V. (1981). Distribution theory for Glass's estimator of effect size and related
estimators. Journal of Educational Statistics, 6, 107–128.
doi:10.3102/10769986006002107
Hitchcock, J. H., Nastasi, B. K., & Summerville, M. (2010). Single case designs and
qualitative methods: Applying a mixed methods research perspective. MidWestern Educational Researcher, 23(2), 49-58.
Horner, R. H., Carr, E. G., Halle, J., McGee, G., Odom, S., & Wolery, M. (2005). The
use of single subject research to identify evidence based practices in special
education. Exceptional Children, 71(2), 165-179.
Howell, D. C. (2009, March 7). Permutation tests for factorial ANOVA designs.
Retrieved from
http://www.uvm.edu/~dhowell/StatPages/More_Stuff/Permutation%20Anova/Per
mTestsAnova.html
Hunt, P., Soto, G., Maier, J., & Doering, K. (2003). Collaborative teaming to support
students at risk and students with severe disabilities in general education
classrooms. Exceptional Children, 69(3), 315-332.
Individuals with Disabilities Education Act. (2012).
http://www.ode.state.oh.us
Jenson, W. R., Clark, E., Kircher, J. C., & Kristjansson, S. D. (2007). Statistical reform:
184
Evidence-based practice, meta-analyses, and single subject designs. Psychology
in the Schools, 44(5), 483-493.
Kauffman, J. M., & Landrum, T. J. (2009). Characteristics of emotional and behavioral
disorders of children and youth (9th ed.). Upper Saddle River, NJ:
Prentice Hall.
Kauffman, J. M., & Lloyd, J. W. (1995). A sense of place: The importance of placement
issues in contemporary special education. In J. Kauffman, J. Lloyd, D. Hallahan,
& T. Astuto (Eds.), Issues in educational placement: Students with emotional and
behavioral disorders (pp. 3-19). Mahwah, NJ: Lawrence Erlbaum Associates.
Kazdin, A. E. (1992). Research designs in clinical psychology (2nd ed.). Boston: Allyn
& Bacon.
Kirk, R. E. (1996), Practical significance: A concept whose time has come, Educational
and Psychological Measurement, 56(5), 746-759.
Kratochwill, T. R., & Brody, G. H. (1978). Single subject designs: A perspective on the
controversy over employing statistical inference and implications for research and
training in behavior modification. Behavior Modification, 2, 291-307. doi:
10.1177/002221947901200407
Kratochwill, T. R., Hitchcock, J., Horner, R. H., Levin, J. R., Odom, S. L., Rindskopf, D. M.,
& Shadish, W. R. (2010). SCDs technical documentation. Retrieved from What
Works Clearinghouse website: http://ies.ed.gov/ncee/wwc/pdf/wwc_scd.pdf.
185
Kratochwill, T. R., & Levin, J. R. (2010). Enhancing the scientific credibility of singlecase intervention research: Randomization to the rescue. Psychological Methods,
15(2), 124-144.
Kratochwill, T. R., & Stoiber, K. C. (2002). Evidence-based interventions in school
psychology: Conceptual foundations of the procedural and coding manual of
division 16 and the society for the study of school psychology task force. School
Psychology Quarterly, 17(4), 341-389.
Krishef, C. H. (1991). Fundamental approaches to single subject design and analysis.
Malabar, FL: Krieger.
Kromrey, J., & Foster-Johnson, L. (1996). Determining the efficacy of intervention: the
use of effect sizes for data analysis in single-subject research. The Journal of
Experimental Education, 65, 73-93. doi: 10.1080/00220973.1996.9943464
Lambert, M., Cartledge, G., Heward, W. L., & Lo, Y. (2006). Effects of response cards
on disruptive behavior and academic responding during math lessons by fourthgrade urban students. Journal of Positive Behavior Interventions, 8(2), 88-99.
Lazar, N. A. (2004). A short survey on causal inference, with implications for context of
learning studies of second language acquisition. Studies in Second Language
Acquisition, 26, 329-347. doi:10.1017/S0272263104262088
Liu, X., & Raudenbush, S. (2004). A note on the noncentrality parameter and effect size
estimates for the F test in ANOVA. Journal of Educational and Behavioral
Statistics, 29(2), 251-255.
186
Maggin, D. M., Chafouleas, S. M, Goddard, K. M., & Johnson, A. H. (2011). A
systematic evaluation of token economies as a classroom management tool for
students with challenging behavior. School Psychology, 49(5), 529-554.
Maggin, D. M., O’Keeffe, B. V., & Johnson, A. H. (2011). A quantitative synthesis of
methodology in the meta-analysis of single-subject research for students with
disabilities: 1985-2009. Exceptionality, 19, 109-135. doi:
10.1080/09362835.2011.565725
Maggin, D. M., Swaminathan, H., Rogers, H. J., O’Keefe, B. V., Sugai, G., & Horner, R.
H. (2011). A generalized least squares regression approach for computing effect
sizes in single-case research: Application examples. School Psychology, 49(3), 301321.
Manolov, R., Arnau, J., Solanas, A., & Bono, R. (2010). Regression-based techniques
for statistical decision making in single-case designs. Psicothema, 22(4), 10261032.
Mavropoulou, S., Papadopoulou, E., & Kakana, D. (2011). Effects of task organization
on the independent play of students with autism spectrum disorders. Journal of
Autism and Developmental Disorders, 41(7), 913-925.
McDougall, D., Narkon, D., & Wells, J. (2011). The case for listing raw data articles
describing single-case interventions: How long must we wait? Education, 132(1),
149-163.
187
McDougall, D., Skouge, J., Farrell, C. A., & Hoff, K. (2006). Research on selfmanagement techniques used by students with disabilities in general education
settings: A promise fulfilled? Journal of the American Academy of Special
Education Professionals, 1(2), 36-73.
Mitchell, C., & Hartmann, D. P. (1981). A cautionary note on the use of omega squared
to evaluate the effectiveness of behavioral treatments. Behavioral Assessment, 3,
93-100.
Morgan, D. L., & Morgan, R. K. (2009). Single-case research methods for the behavioral
and health sciences. Thousand Oaks, CA: Sage.
Murphy, K. A., Theodore, L. A., Aloiso, D., Alric-Edwards, J. M., & Hughes, T. L.
(2007). Interdependent group contingency and mystery motivators to reduce
preschool disruptive behavior. Psychology in the Schools. 44(1), 53-63.
Nagler, E., Rindskophf, D., & Shadish, W. (2008). Analyzing data from small N designs
using multilevel models: A procedural handbook (Grant No. 75588-00-01). U.S.
Department of Education.
Orme, J. G., & Cox, M. E. (2001). Analyzing single-subject design data using statistical
process control charts. Social Work Research, 25(2), 115-127.
Ottenbacher, K. J. (1990). When is a picture worth a thousand p values? A comparison of
visual and quantitative methods to analyze single subject data. Journal of Special
Education, 23, 436-449. doi: 10.1177/002246699002300407
188
Parker, R., & Brossart, D. (2003). Evaluating single-case research data: A comparison
of seven statistical methods. Behavior Therapy, 34(2), 189-211. doi:
10.1016/S0005-7894(03)80013-8
Parker, R., Brossart, D., Vannest, K., Long, J., Garcia De-Alba, R., Baugh, F.
G., & Sullivan, J. (2005). Effect sizes in single-case research: How large is large?
School Psychology Review, 34(1), 116-132.
Parker, R. I., Cryer, J., & Byrns, G. (2006). Controlling baseline trend in single-case
research. School Psychology Quarterly, 21(4), 418-444.
Parker, R. I., Vannest, K. J., & Brown, L. (2009). The improvement rate difference for
single-case research. Exceptional Children, 75(2), 135-150.
Parsonson, B. S., & Baer, D. M. (1986). The graphic analysis of data. In A. Poling, & R.
Fuqua (Eds.), Research methods in applied behavior analysis: Issues and
advances (pp. 157-186). New York: Plenum Press.
Pearl, J. (1995). Causal diagrams for empirical research. Biometrika,
82(4), 669-688.
Pinheiro, J., & Bates, D. (1995). Approximations to the log-likelihood function in the
nonlinear mixed-effects model. Journal of Computational and Graphical Statistics,
4(1), 12-35.
Ramsey, M. L., Jolivette, K., Puckett-Patterson, D., & Kennedy, C. (2010). Using choice
to increase time on-task, task-completion, and accuracy for students with
emotional/behavior disorders in a residential facility. Education and Treatment of
Children, 33(1), 1-21.
189
Restori, A. F., Gresham, F. M., Chang,T., Howard L. B., & Laija-Rodriquez, W. (2007).
Functional assessment-based interventions for children at-risk for emotional and
behavioral disorders. California School Psychologist, 12(1), 9-30.
Raudenbush, S. W., Bryk, A. S., & Congdon, R. (2004). HLM 6 for Windows (Version
6) [Computer software]. Lincolnwood, IL: Scientific Software International, Inc.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and
data analysis (2nd ed.). Thousand Oaks, CA: Sage.
Raudenbush, S. W., & Liu, X. (2000). Statistical power and optimal design for multisite
randomized trials. Psychological Methods, 5(3), 199-213.
Parker, R. I., Cryer, J., & Byrns, G. (2006). Controlling baseline trend in
single-case research. School Psychology Quarterly, 21(4), 418-444.
Rosnow, R., & Rosenthal, R. (1989). Statistical procedures and the justification of
knowledge in psychological science. American Psychologist, 44, 1276-1284. doi:
10.1037/0003-066X.44.10.1276
Rousson, V., Gasser, T., & Seifer, B. (2002). Assessing intrarater, interrater and test–
retest reliability of continuous measurements. Statistics in Medicine, 21(22),
3431-3446.
Sackett, D. L., Richardson, W. S., Rosenberg, W., & Haynes, R. B. (1997).
Evidence-based medicine: How to practice and teach EBM. London: Churchill
Livingstone.
Scheffé, H. (1959). The analysis of variance. New York: Wiley.
190
Schmidt, F. L., & Hunter, J. E. (1997). Eight common but false objections to the
discontinuation of significance testing in the analysis of research data. In L.
Harlow, S. Mulaik, & J. Steiger (Eds.), What if there were no significance tests?
(pp. 37-64). Hillsdale: Lawrence Erlbaum.
Scruggs, T. E., & Mastropieri, M. A. (2001). How to summarize single-participant
research: Ideas and application. Exceptionality, 9(4), 227–244.
Scruggs, T. E., & Mastropieri, M. A. (1998). Synthesizing single-subject research: Issues
and applications. Behavior Modification, 22(1), 221-242.
Segool, N. K., Brinkman, T. M., & Carlson, J. S. (2007). Enhancing accountability in
behavioral consultation through the use of SCDs. International Journal of
Behavioral Consultation and Therapy, 3(2), 310-321.
Sekhon, J. S. (2010). The statistics of causal inference in the social sciences [PDF
document]. Retrieved from Lecture Notes Online Web site:
http://sekhon.berkeley.edu/causalinf/causalinf.print.pdf
Shadish, W. R., Brasil, I. C. C., Illingworth, D. A., White, K., Galindo, R., Nagler, E. D.,
& Rindskopf, D. M. (2009). Using UnGraph® to Extract Data from Image Files:
Verification of Reliability and Validity. Behavior Research Methods, 41(1), 177183.
Shadish, W., Cook, T., & Campbell, D. (2002). Experimental and quasi-experimental
designs for generalized causal inference. Boston: Houghton Miffin.
191
Shadish, W. R., Rindskopf, D. M., & Hedges, L. V. (2008). The state of the science in
the meta-analysis of single-case experimental designs (Grant No.
H324U05000106). Washington DC: Institute of Education Science.
Shadish, W. R., Sullivan, K. J., Hedges, L., & Rindskpof, D. (2010). A d-estimator for
single-case designs. (Grant No. R305D 100046 and R305D100033). University
of California: Institute for Educational Sciences.
Sidman, M. (1960). Tactics of scientific research: Evaluating experimental data in
psychology. New York: Basic Books.
Singer, J. D. & Willett, J. B. (2003). Applied longitudinal data analysis: Modeling
change and event occurrence. New York: Oxford University Press.
Skrondal, A., & Rabe-Hesketh, S. (2007). Redundant overdispersion parameters in
multilevel models for categorical responses. Journal of Educational and
Behavioral Statistics, 32(4), 419-430.
Snijders T., Bosker, R., & Guldemond, H. (2007). PinT: The program PINT for
determining sample sizes in two-level modeling Version (2.12) [Computer
software]. Retrieved from: http://stat.gamma.rug.nl/multilevel.htm
Steiger, J. H., & Fouladi, R. T. (1997). Noncentrality interval estimation and the
evaluation of statistical models. In L. Harlow, S. Mulaik, & J. Steiger (Eds.),
What if there were no significance tests? (pp. 221-257). Hillsdale: Lawrence
Erlbaum.
Stuart, R. B. (1967). Behavioral control of overeating. Behavior Research and Therapy, 5(4),
357-365.
192
Suen, H. K., & Ary, D. (1987). Autocorrelation in applied behavior analysis: Myth or
reality? Behavioral Assessment, 9, 125-130.
Swanson, H. L., & Sachse-Lee, C. (2000). A meta-analysis of single-subject-design
intervention research for students with LD. Journal of Learning Disabilities, 33(2),
114–136.
Swoboda, C. M., Kratochwill, T. R., & Levin, J. R. (2010). Conservative dual-criterion
method for single-case research: A guide for visual analysis of AB, ABAB, and multiplebaseline designs (WCER Working Paper No. 2010-13). Retrieved from University of
Wisconsin–Madison, Wisconsin Center for Education Research website:
http://www.wcer.wisc.edu/publications/workingPapers/papers.php
Tate, R. L., Mcdonald, S., Perdices, M., Togher, L., Schultz, R., & Savage, S. (2008).
Rating the methodological quality of single-subject designs and n-of-1 trials:
Introducing the single-case experimental design (SCED) scale. Neuropsychological
Rehabilitation, 18, 385-401. doi:10.1080/09602010802009201
Theodore, L. A., Bray, M. A., Kehle, T. J., & Jenson, W. R. (2001). Randomization of
group contingencies and reinforcers to reduce classroom disruptive behavior.
Journal of School Psychology, 39(3), 267-77.
U.S. Department of Education. (2008). Analyzing data from small N designs using
multilevel models: A procedural handbook (Grant No. 75588-00-01).
Van den Noortgate, W., & Onghena, P. (2003). Hierarchical linear models for the
quantitative integration of effect sizes in single-case research. Behavior Research
Methods, Instruments, & Computers, 35(1), 1-10.
193
Van den Noortgate, W., & Onghena, P. (2008). A multilevel meta-analysis of singlesubject experimental design studies. Evidence-Based Communication Assessment
and Intervention, 2(3), 142–151.
Vannest, K. J., Parker, R. I., & Gonan, O. (2011). Single Case Research: web based
calculators for SCR analysis (Version 1.0) [Web-based application]. College
Station, TX: Texas A & M University.
Waldmann, M. R. (2000). Competition among causes but not effects in predictive and
diagnostic learning. Journal of Experimental Psychology, 26, 53-76. doi:
10.1037//0278-7393.26.1.53
White, D. M., Rusch, F. R., Kazdin, A. E., & Hartmann, D. P. (1989). Applications of
meta-analysis in individual subject research. Behavioral Assessment, 11, 281-296.
Williamson, B. D., Campbell-Whatley, G., & Lo, Y. (2009). Using a random dependent
group contingency to increase on-task behaviors of high school students with high
incidence disabilities. Psychology in the Schools, 46(10), 1074-1083.
Wolery, M., Busick, M., Reichow, B., & Barton, E. E. (2010). Comparison of overlap
methods for quantitatively synthesizing single subject data. The Journal of
Special Education, 44(1), 18-28.
WWC (n.d.). What Works Clearinghouse. Retrieved from http://ies.ed.gov/ncee/wwc/.
WWC Evidence review protocol (2011). Interventions for children classified as having
an emotional disturbance. Retrieved from
http://ies.ed.gov/ncee/wwc/pdf/reference_resources/wwc_ebd_protocol_v2.pdf
194
Appendix A: Tables
Table 20
List of Effect Sizes from the Original Article Compared to Calculated Effect Sizes
Using the PND, SMD, and R2
Citation
Amato-Zech, N. A. Hoff,
K.E.; Doepke, Karla J.
(2006). Increasing on-task
behavior in the classroom:
extension of self-monitoring
strategies. Psychology in
the Schools, 43(2), 211-221.
Cole, C.L. & Levinson,
T.R. (2002). Effects of
within-activity choices
on the challenging
behavior of children
with developmental
disabilities. Journal of
Positive Behavior
Interventions, 4(1), 29-37.
Lambert, M., Cartledge, G.,
Heward, W. L., & Lo, Y.
(2006). Effects of response
cards on disruptive behavior
and academic responding
during math lessons by
fourth-grade urban students.
Journal of Positive Behavior
Interventions, 8(2), 88-99.
Mavropoulou, S.,
Papadopoulou, E.,
& Kakana, D. (2011).
Effects of task
organization on the
independent play of
students with autism
DV
On-task
Behavior
ID Reported Effect Size
PND
Jack
David
Allison
100%
100%
93.75%
PND
SMD
R2
100%
4.5
100%
3.3
93.75% 2.8
.51
.25
Percents/Means
Percent
of Task
Analysis
Steps
Keith’s
range:
A1
B1
A2
B2
7.7% - 61.5%
0% - 30.8%
n/a
16%
51.66% 1.42
Wally’s
range:
A1
B1
A2
B2
14.3%-81.8%
0%-30.8%
n/a
n/a
28.60% .96
PND
A1
Disruptive A2
Behavior A3
A4
B1
B2
B3
B4
B5
92.8%
100%
n/a
n/a
94.1%
100%
94.4%
n/a
n/a
77.78%
100%
81.25%
75.71%
61.36%
100%
94.4%
0%
33.33%
3.04
5.80
2.70
2.80
2.57
3.70
3.90
1.05
1.27
50%
30%
45%
1.27
.19
1.32
.59
PND
Vaggelis
On-task
Teacher prompt
Performance
50%
n/a
45%
.34
.01
.39
195
Table 20 (Continued)
Citation
DV
ID
spectrum disorders.
Journal of Autism and
Developmental Disorders,
41(7), 913-925.
Yiannis
On-task
Teacher prompt
Performance
SMD
Murphy, K. A., Theodore,
L.A., Aloiso, D. Alric-Edwards,
S1
J.M., & Hughes, T.L. (2007).
S2
Interdependent group
Disruptive S3
contingency and mystery
Behavior S4
motivators to reduce
S5
preschool disruptive behavior.
S6
Psychology in the Schools.
S7
44(1), 53-63.
S8
Ramsey, M.L., Jolivette,
K., Puckett-Patterson, D.,
& Kennedy, C. (2010).
Using choice to increase
time on-task, taskcompletion and accuracy
for students with
emotional/behavior
disorders in a residential
facility. Education and
Treatment of Children,
33(1), 1-21.
Reported Effect Size
R2
PND
SMD
75%
n/a
70%
75%
40%
60%
.74
2.15
1.70
7.71
3.04
2.36
2.06
1.58
1.59
.99
2.64
50%
58.3%
42.85%
0%
87.5%
35.7%
35.7%
100%
2.88
1.99
1.34
1.13
1.33
1.16
.61
1.98
.21
1.24
1.48
1.60
.33
.34
.22
Percents
Abby
On-task
Task Complete
Accuracy
0% complete
0%
0% complete
0%
90% most positive point 75%
Sara
On-task
Task Complete
Accuracy
B2 100%
94.44% 1.80
B2 100%
100%
6.40
50% most positive point 100%
6.90
Trey
On-task
Task Complete
Accuracy
A1 No data exceeded
A1 33.33%
A1 66.7%
50%
.92
77.77% 2.3
72.22% .21
Chris
On-task
Task Complete
Accuracy
40%
100%
71%
66.4%
100%
70.7%
Katie
On-task
Task Complete
Accuracy
60% Most positive point 18.33% 1.80
33%
31.66% 2.40
13%
30%
3.11
3.64
11.6
5.40
196
Table 20 (Continued)
Citation
DV
ID Reported Effect Size PND
Mean
Restori, A.F., Gresham,
F.M., Chang,T., Howard
L.B. & Laija-Rodriquez,
W., (2007). Functional
assessment-based
interventions for children
at-risk for emotional
and behavioral disorders.
California School
Psychologist, 12, 9-30.
Overall
Academic
Engagement
Overall
Disruptive
Behavior
Theodore, L.A., Bray,
M.A., Kehle, T.J., & Jenson,
W.R. (2001). Randomization Disruptive
of group contingencies
Behavior
and reinforcers to reduce
classroom disruptive behavior.
Journal of School Psychology,
39(3), 267-77.
Disruptive /
Academic
A1
B1
A2
B2
22.35
86.90
37.41
83.08
A1
A2
A3
A4
100% / 100%
100% / 100%
100% / 100%
100% / 100%
A1
B1
A2
B2
59.53
6.77
38.31
6.49
C1 100% / 92.8%
C2 0% / 50%
C3 91.6% / 91.6%
C4 100% / 91.6%
SMD
R2
Disruptive/
Academic
2.08 / 3.87 .77
2.84 / 4.22
3.71 / 20.74
2.43 / 3.42
1.52 / 3.93
1.51 / 2.27
2.56 / 3.68
4.8 / 4.91
.63
100%
100%
100%
dropped
dropped
5.3
4.4
2.8
dropped
dropped
.78
40.4%
76.4%
61.1%
55.55%
50.0%
27.78%
1.83
2.37
30.5
2.27
.90
1.31
SMD
S1
S2
S3
S4
S5
5.2
4.7
2.6
3.8
4.2
Mean Percentages
Williamson, B. D.,
Campbell-Whatley, G.,
& Lo, Y. (2009). Using
a random dependent group
contingency to increase
on-task behaviors of high
school students with high
incidence disabilities.
Psychology in the Schools,
46(10), 1074-1083.
On-task
Behavior
A1 43.4%
B1 83.7%
A2 63.3%
B2 66.9%
S1
S2
S3
S4
S5
S6
.29
DV = Dependent Variable; ID = person identification, PND = Percentage of non-overlapping data; SMD =
Standardized Mean Difference; R2 = proportion of variance accounted for between the independent variable
and the dependent variable, n/a = not available from the original author(s)
197
Table 21
List of Effect Sizes from the Original Article Compared to Calculated Effect Sizes Using
the IRD
Citation
Amato-Zech, N. A. Hoff,
K.E.; Doepke, Karla J.
(2006). Increasing on-task
behavior in the classroom:
extension of self-monitoring
strategies. Psychology in
the Schools, 43(2), 211-221.
Cole, C.L. & Levinson,
T.R. (2002). Effects of
within-activity choices
on the challenging
behavior of children
with developmental
disabilities. Journal of
Positive Behavior
Interventions, 4(1), 29-37.
Lambert, M., Cartledge, G.,
Heward, W. L., & Lo, Y.
(2006). Effects of response
cards on disruptive behavior
and academic responding
during math lessons by
fourth-grade urban students.
Journal of Positive Behavior
Interventions, 8(2), 88-99.
Mavropoulou, S.,
Papadopoulou, E.,
& Kakana, D. (2011).
Effects of task
organization on the
independent play of
students with autism
spectrum disorders.
Journal of Autism and
Developmental Disorders,
41(7), 913-925.
DV
ID
Reported Effect Size
IRD
PND
On-task
Behavior
Jack
David
Allison
100%
100%
93.75%
Jack
David
Allison
100%
100%
93.75%
Percents/Means
Percent
of Task
Analysis
Steps
Keith’s
range:
A1
B1
A2
B2
7.7% - 61.5%
0% - 30.8%
n/a
16%
Keith
54.15%
Wally’s
range:
A1
B1
A2
B2
14.3%-81.8%
0%-30.8%
n/a
n/a
Wally
69.00%
A1
A2
A3
A4
B1
B2
B3
B4
B5
93.75%
100%
81.25%
75.7%
87.85%
100%
94.4%
54.45%
59.95%
A1
Disruptive A2
Behavior A3
A4
B1
B2
B3
B4
B5
PND
92.8%
100%
n/a
n/a
94.1%
100%
94.4%
n/a
n/a
PND
Vaggelis
On-task
Teacher prompt
Performance
50%
n/a
45%
Vaggelis
On-task
70.45%
Prompt
45%
Performance 36.35%
Yiannis
On-task
Teacher prompt
Performance
75%
n/a
70%
Yiannis
On-task 75%
Prompt
85.45%
Performance 55.45%
198
Table 21 (Continued)
Citation
DV
ID
SMD
Murphy, K. A., Theodore,
L.A., Aloiso, D. Alric-Edwards,
S1
J.M., & Hughes, T.L. (2007).
S2
Interdependent group
Disruptive S3
contingency and mystery
Behavior S4
motivators to reduce
S5
preschool disruptive behavior.
S6
Psychology in the Schools.
S7
44(1), 53-63.
S8
Ramsey, M.L., Jolivette,
K., Puckett-Patterson, D.,
& Kennedy, C. (2010).
Using choice to increase
time on-task, taskcompletion and accuracy
for students with
emotional/behavior
disorders in a residential
facility. Education and
Treatment of Children,
33(1), 1-21.
Reported Effect Size
7.71
3.04
2.36
2.06
1.58
1.59
.99
2.64
IRD
S1
S2
S3
S4
S5
S6
S7
S8
65.4%
91.65%
84.5%
77.09%
93.75%
59.82%
75%
100%
Percents
Abby
On-task
Task Complete
Accuracy
Abby
0% complete
On-task 87.5%
0% complete
Complete 90.2%
90% most positive point Accuracy 83.93%
Sara
On-task
Task Complete
Accuracy
Sara
B2 100%
On-task 91.65%
B2 100%
Complete 100%
50% most positive point Accuracy 77.85%
Trey
On-task
Task Complete
Accuracy
A1 No data exceeded
A1 33.33%
A1 66.7%
Trey
On-task 44.65%
Complete 30.90%
Accuracy 20%
Chris
On-task
Task Complete
Accuracy
40%
100%
71%
Chris
On-task 94.45%
Complete 100%
Accuracy 100%
Katie
On-task
Task Complete
Accuracy
Katie
60% Most positive point On-task 96.15%
33% 31.66%
Complete 92.31%
13%
Accuracy 83.3%
199
Table 21 (Continued)
Citation
DV
ID Reported Effect Size
Mean
Restori, A.F., Gresham,
F.M., Chang,T., Howard
L.B. & Laija-Rodriquez,
W., (2007). Functional
assessment-based
interventions for children
at-risk for emotional
and behavioral disorders.
California School
Psychologist, 12, 9-30.
Overall
Academic
Engagement
Overall
Disruptive
Behavior
IRD
Academic
Engagement
Disruptive
Behavior
A1
B1
A2
B2
22.35
86.90
37.41
83.08
A1 100%
A2 100%
A3 100%
A4 100%
A1 100%
A2 100%
A3 100%
A4 100%
A1
B1
A2
B2
59.53
6.77
38.31
6.49
C1 100%
C2 93.75%
C3 91.65%
C4 91.65%
C1 100%
C2 72.96%
C3 91.65%
C4 100%
______________________________________________________________________________________
Theodore, L.A., Bray,
M.A., Kehle, T.J., & Jenson,
W.R. (2001). Randomization Disruptive
of group contingencies
Behavior
and reinforcers to reduce
classroom disruptive behavior.
Journal of School Psychology,
39(3), 267-77.
SMD
S1
S2
S3
S4
S5
5.2
4.7
2.6
3.8
4.2
S1
S2
S3`
S4
S5
100%
100%
100%
dropped
dropped
Mean Percentages
Williamson, B. D.,
Campbell-Whatley, G.,
& Lo, Y. (2009). Using
a random dependent group
contingency to increase
on-task behaviors of high
school students with high
incidence disabilities.
Psychology in the Schools,
46(10), 1074-1083.
On-task
Behavior
A1 43.4%
B1 83.7%
A2 63.3%
B2 66.9%
S1
S2
S3
S4
S5
S6
71.43%
93.75%
72.2%
88.89%
55.5%
61.1%
DV = Dependent Variable; ID = person identification, PND = Percentage of non-overlapping data; IRD =
Improvement Rate Difference, n/a = not available from the original author(s)
200
Table 22
A Comparison of the Reported Data to the Extracted
Citation
DV
ID
Reported Data
Extracted Data
Percentage of intervals (mean)
Amato-Zech
and colleagues
(2006)
On-task
Behavior
Jack
a1 = 53%
b1 = 79%
a2 = 74%
b2 = 91%
a1 = 53%
b1 = 79%
a2 = 73%
b2 = 91%
David
a1 = 55%
b1 = 79%
a2 = 76%
b2 = 93%
a1 = 54%
b1 = 78%
a2 = 75%
b2 = 93%
Allison
a1 = 56%
b1 = 89%
a2 = 84%
b2 = 96%
a1 = 56%
b1 = 90%
a2 = 85%
b2 = 96%
Range
Cole
and colleagues
(2003)
Percent of
Task-Analysis
Steps
Keith
a1 = 7.7
b1 = 0 - 30.8
a2 = n/a
b2 = n/a
a1 = 8 – 61.4
b1 = 0 - 31
a2 = 15 – 92.4
b2 = 7 – 30.4
Wally
a1 = 14.3 - 81.8
b1 = 0 – 30.8
a2 = n/a
b2 = n/a
a1 = 14.2 - 80
b1 = 0 - 30
a2 = 0 – 33.2
b2 = 0 – 21
Mean Number of Disruptive Behaviors
Lambert
and colleagues
(2006)
Disruptive
Behavior
SSR/Baseline = 6.8
RC/Treatment = 1.3
SSR/Baseline = 6.83
RC/Treatment = 1.63
Mavropoulou
and colleagues
(2011)
Prompting
On-task
Baseline = 63.75
Treatment = 84.35
Baseline = 61.77
Treatment = 83.74
Baseline = 18.28
Treatment = 16.47
Baseline = 17.66
Treatment = 16.24
Baseline = n/a
Treatment = n/a
Baseline = 31.66
Treatment = 61.14
Performance
201
Table 22 (continued)
Citation
DV
Reported Data
Extracted Data
Mean Intervals of Disruptive Behavior
Murphy
and colleagues
(2007)
Disruptive
Behavior
Baseline = 20.59
Treatment = 4.29
Baseline = 18.26
Treatment = 3.76
Mean Intervals
Ramsey
and colleagues
(2010)
On-task
Task Complete
Accuracy
Baseline = 38.98
Treatment = 80.75
Baseline = 21.67
Treatment = 68.23
Baseline = 16.55
Treatment = 47.26
Baseline = 37.53
Treatment = 78.72
Baseline = 21.27
Treatment = 65.09
Baseline = 14.99
Treatment = 45.12
Mean Intervals
Restori
and colleagues
(2007)
Disruptive Behavior
Baseline = 48.92
Treatment = 6.63
Baseline =50.81
Treatment =5.49
Academic Engagement
Baseline = 29.88
Treatment = 84.99
Baseline = 27.52
Treatment = 86.78
Mean Intervals
Theodore
and colleagues
(2001)
Disruptive
Behavior
Baseline = 38.67
Treatment = 3.33
Baseline = 41.64
Treatment = 2.82
Mean Intervals
Williamson
and colleagues
(2009)
On-task
Behavior
n/a = not available from the original author(s)
Baseline = 52.9
Treatment = 75.18
Baseline = 53.35
Treatment = 75.3
202
Table 23
List of Population Type, Independent Variables, Dependent Variables, and Reported
Effect Sizes
Population
Identifier
Independent
Variable
Dependent
Variable
Effect
Sizes
Amato-Zech, N. A. Hoff,
Both
K.E.; Doepke, Karla J.
(2006). Increasing on-task
behavior in the classroom:
extension of self-monitoring
strategies. Psychology in
the Schools, 43(2), 211-221.
Emotional
Disturbances
(E/BD)
Self-Monitoring
On-task
Behavior
PND
Cole, C.L. & Levinson,
Low
T.R. (2002). Effects of
within-activity choices
on the challenging
behavior of children
with severe
developmental disabilities.
Journal of Positive Behavior
Interventions, 4(1), 29-37.
Cognitive
Delay (CD)
Choice/No Choice
Percent of
Task Analysis
Steps
Percents
Lambert, M., Cartledge, G., N/A
Heward, W. L., & Lo, Y.
(2006). Effects of response
cards on disruptive behavior
and academic responding
during math lessons by
fourth-grade urban students.
Journal of Positive Behavior
Interventions, 8(2), 88-99.
At-risk
Response Cards
Disruptive
Behavior
Means
Mavropoulou, S.,
Papadopoulou, E.,
& Kakana, D. (2011).
Effects of task
organization on the
independent play of
students with autism
spectrum disorders.
Journal of Autism and
Developmental Disorders,
41(7), 913-925.
Autism (ASD) Visual prompting
On-task
behavior
PND
Citation
Population
Incidence
High
Specific Learning
Disability (SLD)
Teacher Prompt
Performance
203
Table 23 (Continued)
Population
Identifier
Independent
Variable
Dependent
Variable
Effect
Sizes
Murphy, K. A., Theodore, High
L.A., Aloiso, D.
Alric-Edwards,
J.M., & Hughes, T.L.
(2007). Interdependent
group contingency and
mystery motivators
to reduce preschool
disruptive behavior.
Psychology in the Schools.
44(1), 53-63.
At-risk
Group Reward
System
Disruptive
Behavior
SMD
Ramsey, M.L., Jolivette,
Both
K., Puckett-Patterson, D.,
& Kennedy, C. (2010).
Using choice to increase
time on-task, task-completion,
and accuracy for students
with emotional/behavior
disorders in a residential
facility. Education and
Treatment of Children,
33(1), 1-21.
E/BD
Choice / No Choice On-task
behavior
Restori, A.F., Gresham,
N/A
F.M., Chang,T., Howard
L.B. & Laija-Rodriquez, W.,
(2007). Functional assessment
-based interventions for
Children at-risk for emotional
and behavioral disorders
California School Psychologist,
12, 9-30.
At-risk
Theodore, L.A., Bray,
High
M.A., Kehle, T.J., & Jenson,
W.R. (2001). Randomization
of group contingencies and
reinforcers to reduce classroom
disruptive behavior. Journal
of School Psychology, 39(3),
267-77.
SE/BD
Citation
Population
Incidence
PND
Task-completion
Accuracy
Self-Monitoring
Academic
Achievement
Percents
Disruptive
Behavior
Group Reward
System
Disruptive
Behavior
SMD
204
Table 23 (Continued)
Citation
Population
Incidence
Williamson, B. D.,
Both
Campbell-Whatley, G.,
& Lo, Y. (2009). Using
a random dependent group
contingency to increase
on-task behaviors of high
school students with high
incidence disabilities.
Psychology in the Schools,
46(10), 1074-1083.
Population
Identifier
Independent
Variable
Dependent
Variable
Effect
Sizes
Multiple
OHI
E/BD
SLD
Group Reward
System
On-task
Behavior
Percents
PND = Percentage of non-overlapping data; SMD = Standardized mean difference, no assumptions model;
SE/BD = Severe Emotional and Behavior Disorder; OHI = Other Health Impaired, n/a = not available from
the original author(s)
205
Table 24
List of Level-2 Variables Used in the Exploratory Analyses
Study
Level-2 Variables / Definition
(exploratory analyses)
Significant Level-2
Variables
(1) Amato-Zech, Hoff
& Doepke (2006).
Increasing On-Task
Behavior in the
Classroom: Extension
of Self-Monitoring
Strategies.
Age
Disability
Gender
N/A
N/A
N/A
(2) Cole & Levinson
(2002).
Choices on the
Challenging Effects
of Within-Activity
Behavior of Children
with Severe Developmental
Disabilities.
Age
Disability
Gender
N/A
N/A
N/A
(3) Lambert, Cartledge,
Heward, & Lo
(2006).
Effects of Response Cards
on Disruptive Behavior
and Academic Responding
during Math Lessons by
Fourth-Grade Urban Students.
Age
ClassB / classroom a child attended
Race
Gender
Pre-grade / math grade prior to study
N/A
- 0.90**
N/A
N/A
N/A
(4) Mavropoulou,
Papadopoulou,
& Kakana (2011).
Effects of Task
Organization
on the Independent
Play of Students with
Autism Spectrum Disorders.
Age
Disability
Gender
IQ / Using Weschler
Intelligence Scale for Children
Speech Therapy / therapy or no therapy
N/A
N/A
N/A
N/A
N/A
N/A
206
Table 24 (Continued)
Study
Level-2 Variables / Definition
(exploratory analyses)
Significant Level-2
Variables
Age
N/A
N/A
2.06**
N/A
N/A
(6) Ramsey, Jolivette,
Puckett-Patterson,
& Kennedy (2010).
Using Choice to Increase
Time On-Task
Completion,
and Accuracy for
Students with Emotional/
Behavior Disorders in
a Residential Facility.
Age
Disability
Race
GAF Score / Global Assessment
of functioning
Gender
Grade Level
N/A
N/A
N/A
N/A
(7) Restori, Gresham,
Chang, Lee, & LaijaRodriquez
(2007).
Functional AssessmentBased Interventions
for Children At-Risk
for Emotional and
Behavioral Disorders.
Academic Achievement
Disruptive Behavior
Gender
Intervention Type /
antecedent-based
or consequent-based
(8) Theodore, Bray,
Kehle, & Jenson (2001).
Randomization of Group
Behavior Contingencies
and Reinforcers to Reduce
Classroom Disruptive.
DBS / delinquent behavior scale
Disability
Race
Gender
N/A
N/A
N/A
N/A
(9) Williamson,
Campbell- Whatley,
& Lo (2009).
Using a Random
Dependent Group
Contingency to Increase
Race
Student 3 / dummy coded
N/A
N/A
(5) Murphy, Theodore,
Alric-Edwards, &
Hughes (2007).
Interdependent Group
Contingency and Mystery
Motivators to Reduce
Preschool Disruptive
Behavior.
Gender
OnTrack / to first grade
Student 2 / dummy coded
Student 8 / dummy coded
N/A
N/A
N/A
disruptive behavior (dv) = 1.40*
academic achievement (dv) = -1.90*
207
Table 24 (Continued)
Study
Level-2 Variables / Definition
(exploratory analyses)
Significant Level-2
Variables
On-task Behaviors
of High School Students
with High Incidence Disabilities.
* p < .05, ** p < .01; N/A = study contained no significant Level-2 variables, dv = dependent variable, n/a
= not available from the original author(s)
208
Appendix B: Graphs, Extracted Data from SPSS and Excel with Codes
“All graphs were reproduced by the dissertation author using extracted, digitized data.”
Amato-Zech, Hoff, & Doepke (2006).
Percent of intervals of on-task behavior
Jack
120
100
80
60
40
20
0
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29 31 33
Session
Percent of intervals of on-task behavior
David
120
100
80
60
40
20
0
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29 31 33
Session
209
Percent of intervals of on-task behavior
Allison
120
100
80
60
40
20
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
Session
ID
Age
Gender
Disability
Patient 1 = Jack
Patient 2 = David
Patient 3 = Allison
11
11
11
0=m
0=m
1=f
1 = SLD = Specific Learning Disability
1= SLD
0 = ED = Emotionally Disturbed
Patient = p
Session = s
p
s
b
t
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
15.00
17.00
18.00
20.00
22.00
24.00
26.00
50.21
47.41
47.01
49.84
49.85
61.10
52.27
57.10
56.71
57.52
67.61
70.44
67.64
65.24
80.12
84.96
84.98
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
Behavior = b Treatment = t
210
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
27.00
28.00
29.00
30.00
31.00
32.00
33.00
34.00
35.00
36.00
37.00
38.00
39.00
41.00
42.00
43.00
44.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
11.00
12.00
15.00
17.00
18.00
20.00
22.00
23.00
24.00
25.00
27.00
29.00
30.00
31.00
32.00
33.00
34.00
35.00
94.62
93.03
80.18
77.78
78.20
71.78
75.00
66.57
64.98
88.28
85.08
91.11
90.72
89.93
91.55
94.37
99.60
54.88
61.74
42.81
49.67
63.40
57.85
51.98
47.09
57.55
59.85
52.34
76.86
67.41
69.70
76.90
79.53
77.58
81.18
84.45
92.63
78.93
79.91
81.55
74.38
71.44
72.11
69.50
1
1
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
211
2.00
2.00
2.00
2.00
2.00
2.00
2.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
36.00
37.00
38.00
39.00
42.00
43.00
44.00
2.00
3.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
12.00
13.00
14.00
15.00
16.00
18.00
19.00
20.00
21.00
23.00
24.00
25.00
26.00
27.00
28.00
29.00
30.00
31.00
32.00
33.00
34.00
35.00
36.00
37.00
38.00
39.00
40.00
42.00
86.49
94.34
92.71
94.68
92.41
92.42
97.00
51.42
63.72
38.23
55.37
68.12
45.26
55.37
58.01
67.68
56.69
55.81
60.21
55.37
55.37
78.66
73.39
75.15
86.13
97.56
95.80
98.00
94.04
99.32
99.32
91.41
91.41
87.45
83.94
83.06
78.66
75.59
98.00
98.00
97.56
98.44
88.33
97.56
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1
1
1
212
3.00
3.00
43.00 95.36 1
44.00 98.00 1
Cole & Levinson (2002).
90
80
70
60
50
40
30
20
10
0
1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829
Session
Percentage of task steps with challenging
behvaior
Percentage of task steps with challenging behavior
Keith
100
Wally
90
80
70
60
50
40
30
20
10
0
1
3
5
7
9
11 13 15 17 19 21 23 25 27 29 31
Session
213
ID
Patient 1=Keith
Patient 2=Wally
Gender
0=m
0
Disability
CD=0
other=1
Keith
Patient = p
Session = s
p
s
b
t
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
2.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
12.00
13.00
14.00
15.00
16.00
17.00
18.00
19.00
20.00
21.00
22.00
23.00
24.00
25.00
26.00
27.00
28.00
29.00
1.00
16.66
7.95
14.90
61.41
23.34
38.51
31.04
13.37
.00
.00
20.06
15.57
.00
28.49
15.55
86.43
15.03
38.40
92.38
22.96
7.04
12.75
14.23
12.73
12.97
30.37
20.17
9.46
8.96
19.93
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
Behavior = b Treatment = t
214
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
12.00
13.00
14.00
15.00
16.00
17.00
18.00
19.00
20.00
21.00
22.00
23.00
24.00
25.00
26.00
27.00
28.00
29.00
30.00
31.00
32.00
37.19
80.01
42.92
36.82
14.23
35.42
35.21
52.46
7.99
14.92
.00
30.00
8.15
20.73
8.23
13.68
14.94
.00
15.02
15.06
20.51
21.04
30.43
33.17
8.13
7.67
7.71
21.03
13.69
.00
.00
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
215
Lambert, Cartledge, Heward, & Lo (2006).
Number of intervals of disruptive
behavior
a1
12
10
8
6
4
2
0
1
3
5
7
9
11 13 15 17 19 21 23 25 27 29
Session
Number of intervals of disruptive
behavior
a2
12
10
8
6
4
2
0
1
3
5
7
9
11 13 15 17 19 21 23 25 27 29
Session
216
Number of intervals of disruptive
behavior
a3
12
10
8
6
4
2
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Session
Number of intervals of disruptive
behavior
a4
12
10
8
6
4
2
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Session
217
Number of intervals of disruptive
behavior
b1
12
10
8
6
4
2
0
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29 31 33
Session
Number of intervals of disruptive
behavior
b2
9
8
7
6
5
4
3
2
1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
Session
218
Number of intervals of disruptive
behavior
b3
12
10
8
6
4
2
0
1
3
5
7
9
11 13 15 17 19 21 23 25 27 29
Session
Number of intervals of disruptive
behavior
b4
10
8
6
4
2
0
1
3
5
7
9
11 13 15 17 19 21 23 25 27 29
Session
219
Number of intervals of disruptive
behavior
b5
12
10
8
6
4
2
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Session
ID
Gender
age (year.month)
Race
Class
Grade
Math grade based
on a 4.0 scale
a1
a2
a3
a4
b5
b6
b7
b8
b9
1=f
1
0=m
0
0
1
1
1
0
9.7
9.4
9.10
9.4
10.2
10.1
10.8
9.5
10.1
0=AA
1=white
0
0
0
0
0
0
0
0 = class A
0
0
0
1= class B
1
1
1
1
1.5
3
1
.5
1
2
.5
1.49
3
Patient = p
Session = s
p
s
b
t
1
1
1
1
1
1
1
1
1
.79
1.97
2.85
3.74
4.62
5.70
6.69
7.57
8.56
7.35
9.26
8.24
6.47
7.50
4.56
5.44
10.00
3.09
0
0
0
0
0
0
0
0
1
Behavior = b Treatment = t
220
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
9.64
11.61
12.49
13.48
14.36
15.34
16.33
17.31
18.20
19.28
20.26
21.15
22.23
23.21
24.10
25.08
26.07
27.05
27.93
29.11
30.10
.81
1.83
3.86
4.88
6.00
6.91
7.83
8.84
9.96
10.88
11.99
13.01
13.92
14.94
15.86
16.97
17.89
18.90
20.02
20.94
21.85
22.97
23.78
1.18
2.21
1.18
1.32
3.82
8.24
8.24
6.62
10.15
10.15
10.00
8.24
3.82
4.71
2.06
3.82
2.94
4.85
1.18
2.06
1.03
8.18
7.12
7.27
8.18
6.36
7.27
8.94
3.64
1.82
.76
4.39
.91
.91
8.03
9.09
10.00
7.27
8.94
10.00
8.18
10.00
1.82
1.82
1
1
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1
1
1
0
0
0
0
0
0
0
0
1
1
221
2
2
2
2
2
2
2
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
4
24.80
25.92
27.04
27.95
28.87
30.09
31.00
1.01
3.04
4.86
5.88
6.89
7.90
9.83
10.94
11.95
12.87
13.88
14.89
15.80
16.82
18.94
19.96
20.87
21.88
22.79
23.81
24.82
25.83
26.95
27.96
28.87
29.89
.89
2.93
3.85
4.86
5.88
6.89
7.91
8.83
9.95
10.87
11.99
.91
5.45
3.64
6.36
.91
.76
2.73
10.31
6.41
6.25
9.22
6.25
10.16
.78
1.56
1.72
.78
.78
5.31
7.34
10.16
5.31
10.00
9.22
10.31
4.38
6.41
5.31
7.19
.78
.78
.78
1.72
9.86
6.28
4.50
8.08
8.09
9.14
9.89
3.62
6.32
.95
.95
1
1
1
1
1
1
1
0
0
0
0
0
0
1
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1
222
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
13.92
14.83
15.84
16.75
18.88
19.90
20.81
21.84
22.85
23.88
24.89
25.92
28.86
29.98
31.00
.95
1.69
2.78
3.77
4.74
5.83
6.81
7.89
8.87
9.72
10.83
11.93
13.03
13.88
14.86
15.96
16.93
17.78
18.87
19.84
20.69
21.79
22.89
24.00
24.74
25.83
26.92
28.02
28.88
1.86
3.65
8.14
9.93
9.94
10.10
10.10
5.48
6.38
1.91
5.64
1.17
1.03
1.04
1.94
10.35
6.56
9.50
4.67
5.37
9.51
6.76
10.56
9.36
9.53
4.88
3.68
4.89
4.90
1.97
.77
3.71
5.78
8.54
10.62
10.45
10.63
6.50
3.91
.82
2.72
4.79
2.04
1.01
1
0
0
0
0
0
0
0
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1
1
1
223
5
5
5
5
5
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
7
7
7
7
7
7
7
7
7
7
7
7
29.97
30.94
32.05
33.02
34.00
.85
1.83
2.68
5.73
6.95
7.92
8.90
9.99
10.97
11.94
13.04
14.01
16.82
17.91
18.89
19.86
21.94
22.91
24.74
25.84
26.93
27.91
28.88
29.98
31.93
33.03
34.12
.73
2.56
5.73
6.82
7.80
8.90
9.87
10.85
11.94
13.04
13.77
14.87
1.88
3.78
.85
1.89
1.21
7.26
4.52
5.16
7.10
8.39
4.52
8.23
8.06
.97
.81
.81
.81
5.32
7.10
6.29
4.52
6.29
5.48
.97
.97
.97
2.58
.97
1.13
.97
1.13
1.13
6.33
6.50
8.00
9.00
10.17
9.00
8.00
1.00
1.83
2.67
1.83
1.67
1
1
1
1
1
0
0
0
0
0
0
0
0
1
1
1
1
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1
1
224
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
15.84
16.82
17.79
18.89
19.86
20.84
21.81
22.79
23.76
24.74
25.71
27.78
28.88
29.86
30.95
32.90
34.00
.87
1.83
2.81
3.90
4.87
5.85
6.94
7.91
8.99
9.84
10.93
13.00
13.97
14.94
15.80
16.89
17.87
18.96
19.94
21.02
21.86
22.71
24.05
24.90
25.88
26.84
27.94
.83
2.67
4.67
4.67
5.50
8.33
8.50
7.50
2.00
.83
2.67
1.67
.67
1.83
.50
1.83
.83
8.36
2.12
4.74
6.70
6.37
7.51
8.33
8.32
1.27
2.74
.93
1.25
1.24
2.71
6.65
5.49
6.64
5.65
8.59
4.65
1.04
2.68
1.85
2.83
6.60
1.18
2.65
1
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1
1
225
8
8
8
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
28.90
29.88
30.97
.97
1.70
2.67
3.89
4.86
5.95
6.80
8.01
8.74
9.96
10.93
11.90
12.87
13.96
14.94
16.03
17.85
18.94
19.91
20.89
21.86
22.83
23.92
25.86
26.84
27.93
29.02
29.87
30.96
32.06
33.03
33.88
1.01
1.82
1.82
9.17
5.50
4.50
2.50
3.67
10.17
4.50
10.33
8.33
8.17
.50
2.67
1.67
3.50
1.00
.67
3.50
.83
2.50
7.50
7.50
2.83
.83
1.67
.67
2.50
2.67
4.33
.83
1.00
1.67
1.67
1
1
1
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
226
Mavropoulou, Papadopoulou, & Kakana (2011).
Percentage of intervals with on-task
behavior
Vaggelis
120
100
80
60
40
20
0
1
3
5
7
9
11 13 15 17 19 21 23 25 27 29 31
Session
Percentage of intervals with on-task
behavior
Yiannis
100
90
80
70
60
50
40
30
20
10
0
1
3
5
7
9
11 13 15 17 19 21 23 25 27 29 31
Session
227
Percentage of intervals with performance
behavior
Vaggelis
120
100
80
60
40
20
0
1
3
5
7
9
11 13 15 17 19 21 23 25 27 29 31
Session
Percentage of intervals with
performance behavior
Yiannis
100
90
80
70
60
50
40
30
20
10
0
1
3
5
7
9
11 13 15 17 19 21 23 25 27 29 31
Session
228
Yiannis
Percentage of intervals with
prompting behavior
40
35
30
25
20
15
10
5
0
1
3
5
7
9
11 13 15 17 19 21 23 25 27 29 31
Session
Gender
Patient1=Vaggelis
Patient2=Yiannis
Patient = p
0=Male
0
Session = s
IQ
Age
Disability
71
82
7.6 years
7
0=ASD (mild) 0=two hrs
1=other
1=4.2 hrs
On-task =b1
Performance=b2
Prompt=b3
Treatment = t
Speech
Service
229
p
s
b1
b2
b3
t
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
12.00
13.00
14.00
15.00
16.00
17.00
18.00
19.00
20.00
21.00
22.00
23.00
24.00
25.00
26.00
27.00
28.00
29.00
30.00
31.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
64.42
21.80
21.79
23.79
69.42
59.68
28.79
59.66
67.03
67.02
55.94
63.98
66.32
91.14
82.07
87.43
60.24
95.46
95.45
93.09
89.06
62.20
75.28
77.28
66.53
50.42
92.35
86.30
90.65
70.17
83.92
20.30
65.44
38.46
80.57
75.72
85.10
66.91
64.78
82.95
80.82
82.94
24.34
10.71
.29
2.44
15.81
22.51
21.18
35.89
23.07
30.56
20.95
43.15
38.34
57.60
54.67
32.49
33.57
99.62
99.90
46.43
71.03
35.75
38.17
71.33
24.01
28.30
56.65
73.77
64.42
58.28
75.67
39.12
27.14
15.65
39.61
27.63
33.74
21.27
32.27
24.45
48.66
28.12
16.77
25.49
12.39
20.77
32.51
34.51
16.71
17.04
19.04
18.70
16.67
25.72
19.00
32.41
16.63
12.26
27.69
16.60
14.24
10.21
9.86
20.59
16.22
16.21
14.18
25.58
12.49
11.81
12.47
34.27
20.84
23.94
12.71
6.34
13.00
12.38
14.50
9.94
24.17
14.77
14.16
23.54
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
230
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
12.00
13.00
14.00
15.00
16.00
17.00
18.00
19.00
20.00
21.00
22.00
23.00
24.00
25.00
26.00
27.00
28.00
29.00
30.00
31.00
87.17
80.19
91.09
82.60
84.41
84.70
84.69
73.47
73.46
80.42
56.17
75.86
76.15
80.08
73.41
82.79
93.39
80.35
93.98
91.85
84.60
41.81
25.18
68.22
54.52
85.82
79.71
63.33
33.50
72.13
67.73
25.92
68.22
49.14
59.17
92.42
65.04
54.28
60.39
47.68
21.71
35.04
28.06
19.26
12.28
15.61
1.96
16.80
8.01
10.12
10.41
30.40
10.09
10.38
10.98
4.00
3.69
13.07
10.64
10.33
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
1
1
1
1
1
231
Murphy, Theodore, Alric-Edwards, & Hughes (2007).
Percentage of disruptive intervals
Student 1
14
12
10
8
6
4
2
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Session
Percentage of disruptive intervals
Student 2
70
60
50
40
30
20
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Session
232
Percentage of disruptive intervals
Student 3
45
40
35
30
25
20
15
10
5
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
Session
Percentage of disruptive intervals
Student 4
14
12
10
8
6
4
2
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
Session
233
Percentage of disruptive intervals
Student 5
45
40
35
30
25
20
15
10
5
0
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
Session
Percentage of disruptive intervals
Student 6
18
16
14
12
10
8
6
4
2
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
Session
234
Percentage of disruptive intervals
Student 7
70
60
50
40
30
20
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
Session
Percentage of disruptive intervals
Student 8
120
100
80
60
40
20
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Session
ID
Age
Gender
On-track (to kindergarten)
1
2
3
4
5
6
7
8
4
3
4
4
4
5
4
3
1=f
0=m
1
1
1
0
1
0
0=on-track
1=not on track
0
0
0
0
0
1
235
Patient = p
Session = s
p
s
b
t
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
1.00
3.00
4.00
10.00
11.00
12.00
13.00
15.00
16.00
19.00
20.00
21.00
22.00
23.00
24.00
26.00
27.00
28.00
29.00
30.00
31.00
32.00
33.00
1.00
2.00
3.00
4.00
5.00
7.00
8.00
10.00
11.00
12.00
15.00
16.00
19.00
20.00
21.00
22.00
23.00
5.88
8.40
9.24
.00
2.52
2.52
.00
.00
.00
1.68
.00
1.68
11.76
3.36
1.68
.00
.00
.00
.00
2.52
.00
.00
.00
37.99
39.59
37.88
39.48
31.16
60.81
64.06
1.15
26.71
9.31
4.19
4.97
4.81
4.76
8.84
55.06
19.48
0
0
0
1
1
1
1
1
1
0
0
0
0
0
0
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1
1
0
0
0
0
0
Behavior = b Treatment = t
236
2.00
2.00
2.00
2.00
2.00
2.00
2.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
24.00
26.00
27.00
28.00
29.00
31.00
32.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
10.00
11.00
12.00
13.00
14.00
15.00
16.00
19.00
20.00
21.00
22.00
23.00
24.00
26.00
27.00
28.00
29.00
30.00
31.00
32.00
33.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
15.29
3.62
.00
5.16
4.29
1.70
1.65
22.69
39.50
26.05
21.85
15.13
19.33
12.61
23.53
5.88
6.72
2.52
15.97
10.08
4.20
5.88
10.08
5.88
9.24
.00
9.24
15.97
.84
4.20
3.36
.00
3.36
.00
2.52
.00
12.48
9.30
8.48
9.98
9.16
.00
7.50
10.57
0
1
1
1
1
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
0
0
0
0
0
0
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
237
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
10.00
11.00
12.00
13.00
15.00
16.00
17.00
19.00
20.00
21.00
22.00
23.00
24.00
26.00
27.00
28.00
29.00
30.00
31.00
32.00
33.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
10.00
11.00
12.00
13.00
14.00
15.00
16.00
17.00
19.00
20.00
21.00
22.00
23.00
24.00
26.00
.00
2.62
2.57
2.52
1.65
.00
.00
1.45
.00
2.92
.00
1.26
2.77
.00
.00
.00
.00
.92
.00
.00
.00
13.07
17.66
13.79
19.15
10.66
12.95
39.84
20.59
3.62
5.90
8.19
6.63
1.99
12.73
6.55
5.00
7.26
8.77
12.59
27.95
21.78
7.91
.00
1
1
1
1
1
1
1
0
0
0
0
0
0
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
0
0
0
0
0
0
1
238
5.00
5.00
5.00
5.00
5.00
5.00
5.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
7.00
7.00
7.00
7.00
7.00
7.00
7.00
7.00
27.00
28.00
29.00
30.00
31.00
32.00
33.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
10.00
11.00
12.00
13.00
15.00
16.00
17.00
19.00
20.00
21.00
22.00
23.00
24.00
26.00
27.00
28.00
29.00
30.00
31.00
32.00
33.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
.00
5.50
2.40
.00
2.35
.79
3.08
10.69
16.03
4.58
9.16
6.87
7.63
12.21
15.27
.00
2.29
8.40
4.58
9.92
3.82
2.29
3.05
6.11
6.87
9.92
6.11
3.05
2.29
.00
.00
6.11
.00
2.29
3.05
4.58
17.45
59.76
6.31
10.53
3.63
8.38
20.00
20.51
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
0
0
0
0
0
0
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
239
7.00
7.00
7.00
7.00
7.00
7.00
7.00
7.00
7.00
7.00
7.00
7.00
7.00
7.00
7.00
7.00
7.00
7.00
7.00
7.00
7.00
8.00
8.00
8.00
8.00
8.00
8.00
8.00
8.00
8.00
8.00
8.00
8.00
8.00
8.00
8.00
8.00
8.00
8.00
8.00
8.00
8.00
8.00
8.00
10.00
11.00
12.00
13.00
15.00
16.00
17.00
19.00
20.00
21.00
22.00
23.00
24.00
26.00
27.00
28.00
29.00
30.00
31.00
32.00
33.00
1.00
2.00
3.00
4.00
5.00
10.00
11.00
12.00
13.00
15.00
16.00
17.00
19.00
20.00
21.00
22.00
23.00
24.00
26.00
27.00
28.00
29.00
30.00
3.02
.00
2.46
.00
.00
.00
.00
.00
.00
1.25
20.81
3.34
.00
.00
.00
.00
.00
.00
.00
.00
.00
51.89
61.68
96.60
87.80
48.96
19.20
20.79
12.55
29.98
16.22
6.88
16.67
20.94
28.54
48.16
62.86
60.08
19.59
4.19
.32
3.00
17.16
12.19
1
1
1
1
1
1
1
0
0
0
0
0
0
1
1
1
1
1
1
1
1
0
0
0
0
0
1
1
1
1
1
1
1
0
0
0
0
0
0
1
1
1
1
1
240
8.00
8.00
31.00 2.30
32.00 4.99
1
1
Ramsey, Jolivette, Puckett-Patterson, & Kennedy (2010).
Abby
Percentage of time on-task
120
100
80
60
40
20
0
1
3
5
7
9
11 13 15 17 19 21 23 25 27 29 31 33
Session
Percentage of time on-task
Chris
120
100
80
60
40
20
0
1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829
Session
241
Percentage of time on-task
Katie
120
100
80
60
40
20
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43
Session
Percentage of time on-task
Sara
120
100
80
60
40
20
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
Session
242
Percentage of time on-task
Trey
120
100
80
60
40
20
0
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29 31 33 35 37
Session
Percentage of task-completion
Abby
120
100
80
60
40
20
0
1
3
5
7
9
11 13 15 17 19 21 23 25 27 29 31 33
Session
243
Percentage of task-completion
Chris
90
80
70
60
50
40
30
20
10
0
1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829
Session
Percentage of task-completion
Katie
60
50
40
30
20
10
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43
Session
244
Percentage of task-completion
Sara
120
100
80
60
40
20
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
Session
Percentage of task-completion
Trey
120
100
80
60
40
20
0
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29 31 33 35 37
Session
245
Abby
Percentage of accuracy
120
100
80
60
40
20
0
1
3
5
7
9
11 13 15 17 19 21 23 25 27 29 31 33
Session
Chris
Percentage of accuracy
70
60
50
40
30
20
10
0
1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829
Session
246
Katie
Percentage of accuracy
30
25
20
15
10
5
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43
Session
Percentage of accuracy
Sara
90
80
70
60
50
40
30
20
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
Session
247
Trey
Percentage of accuracy
80
70
60
50
40
30
20
10
0
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29 31 33 35 37
Session
Person
Age
Grade Gender
Race
Disability
GAF Score*
P1 Abby
P2 Chris
P3 Katie
P4 Sara
P5 Trey
14
16
13
15
15
7
9
7
8
9
1=other
0=white
0=white
1=other
1=other
0=Y
0
0
0
0
65=1
55=0
20=0
55=0
65=1
1
0
1
1
0
*The Global Assessment of Functioning Score is used to determine the lowest impaired
behavior (health/illness) for the child.
Patient = p
Session = s
On-task =b1
Treatment = t
Task-completion=b2
Accuracy=b3
p
s
b1
b2
b3
t
1.00
1.00
1.00
1.00
1.00
1.00
1.00
2.00
3.00
4.00
5.00
6.00
57.74
53.11
33.96
80.52
99.34
53.11
69.62
59.06
.28
89.43
99.67
60.05
49.48
49.81
.28
75.24
89.76
41.23
0
0
0
0
0
0
248
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
7.00
8.00
9.00
10.00
11.00
12.00
13.00
14.00
15.00
16.00
17.00
18.00
19.00
20.00
21.00
22.00
23.00
24.00
25.00
26.00
27.00
28.00
29.00
30.00
31.00
32.00
33.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
12.00
13.00
14.00
15.00
16.00
17.00
99.01 89.10 74.25 0
82.83 49.81 .61
0
90.09 74.58 70.28 1
99.67 99.67 90.42 1
99.67 100.00 95.05 1
100.00 100.00 75.24 1
100.00 99.67 89.76 1
100.00 100.00 95.05 1
99.67 99.67 99.67 1
100.00 99.67 80.52 1
100.00 100.00 75.24 1
100.00 99.67 95.05 1
96.37 99.67 100.00 1
100.00 100.00 95.05 1
99.67 100.00 90.42 1
90.42 99.67 99.67 1
26.70 .61
.00
0
66.32 48.82 25.38 0
63.02 49.81 25.71 0
59.72 .61
.00
0
99.67 99.01 69.62 1
70.61 74.91 70.28 1
76.56 74.58 50.14 1
83.16 74.91 77.00 1
91.08 100.00 89.43 1
76.23 74.91 85.47 1
95.71 90.09 95.71 1
60.31 .00
.00
0
40.63 12.81 .00
0
59.69 .00
.00
0
30.31 9.38 8.75 0
.00
.00
.00
0
41.56 17.50 .00
0
64.69 49.69 .00
1
90.31 65.00 19.69 1
60.94 54.69 .00
1
95.31 69.06 .00
1
90.94 74.69 29.38 1
88.13 74.38 29.69 1
90.00 77.19 50.00 1
80.00 65.31 50.31 1
89.06 69.38 50.31 1
95.94 77.50 50.31 1
50.00 42.19 29.06 0
249
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
18.00
19.00
20.00
21.00
22.00
23.00
24.00
25.00
26.00
27.00
28.00
29.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
12.00
13.00
14.00
15.00
16.00
17.00
18.00
19.00
20.00
21.00
22.00
23.00
24.00
25.00
26.00
27.00
28.00
29.00
30.00
31.00
32.00
48.44 36.88
40.94 22.50
47.81 .00
90.00 23.13
34.06 25.00
80.31 45.31
91.88 59.38
76.88 56.25
88.75 60.00
93.13 74.69
87.50 70.00
92.81 74.69
61.12 14.93
28.55 9.94
29.54 4.95
.00
.00
.00
.00
.00
.00
.00
.00
60.07 4.91
.00
.00
9.22 .00
.00
.00
55.39 14.18
52.06 9.52
.00
.00
99.23 49.38
30.77 10.17
47.37 10.16
.00
.00
60.65 15.46
94.54 40.04
31.73 9.46
.00
.00
99.50 44.67
57.29 25.72
100.00 44.65
41.32 15.41
.00
.00
65.90 14.40
33.66 .00
.00
.00
.00
.00
71.52 9.71
23.13
19.69
.00
25.31
20.94
25.31
40.31
29.69
45.00
50.00
50.00
59.69
9.95
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
5.21
5.20
.00
9.84
.00
.00
.00
.00
15.12
.00
.00
15.10
10.44
.00
.00
1.44
.00
.00
1.09
1.08
.00
0
0
0
0
0
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
250
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
5.00
5.00
5.00
5.00
5.00
33.00
34.00
35.00
36.00
37.00
38.00
39.00
40.00
41.00
42.00
43.00
44.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
12.00
13.00
14.00
15.00
16.00
17.00
18.00
19.00
20.00
21.00
22.00
23.00
24.00
25.00
26.00
27.00
1.00
2.00
3.00
4.00
5.00
13.69 .00
.00
.00
.00
.00
66.18 34.94 10.02
66.83 .00
.00
77.46 24.96 15.99
71.80 29.94 25.28
48.20 10.65 5.34
.00
.00
1.34
.00
.00
.00
.00
.00
.00
.00
.00
.00
5.63 .00
.00
20.12 25.27 30.42
30.09 20.11 25.58
60.03 34.59 24.92
47.46 25.23 30.39
35.85 20.07 40.68
50.01 24.57 25.53
99.60 85.10 70.93
99.91 89.92 49.99
83.47 79.93 49.33
24.84 94.41 70.25
87.95 85.70 74.43
100.00 89.88 49.62
100.00 100.00 45.42
100.00 100.00 50.56
100.00 100.00 75.02
56.33 49.57 49.89
47.94 35.38 44.40
23.13 25.38 25.71
36.00 39.22 25.37
20.53 10.22 29.87
43.71 34.04 25.35
89.43 74.94 50.14
98.44 84.91 76.21
96.49 90.37 76.20
100.00 95.19 50.42
100.00 90.35 75.21
100.00 99.68 75.20
.00
.00
.00
100.00 74.34 34.84
60.69 44.97 25.05
.00
.00
.00
66.76 44.63 25.05
0
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
1
1
1
1
1
1
0
0
0
0
0
251
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
12.00
13.00
14.00
15.00
16.00
17.00
18.00
19.00
20.00
21.00
22.00
23.00
24.00
25.00
26.00
27.00
28.00
29.00
30.00
31.00
32.00
33.00
34.00
35.00
36.00
37.00
78.27 54.42 25.05
49.81 29.10 .00
.00
.00
.00
.00
.00
.00
44.70 34.84 .00
67.05 44.29 24.37
25.04 .00
.00
.00
.00
.00
99.87 49.02 .00
97.49 64.55 25.05
84.61 50.03 24.71
100.00 75.02 49.36
90.35 69.95 45.64
85.26 74.00 49.69
90.33 79.07 65.56
94.73 84.13 69.95
99.80 98.65 70.29
.00
.00
.00
32.04 13.91 .00
.00
.00
.00
55.05 25.05 24.37
.00
.00
.00
34.38 .00
.00
76.36 59.49 24.71
100.00 74.68 50.03
85.00 85.14 50.03
90.00 89.87 25.39
94.62 99.32 69.61
85.13 94.60 70.29
79.36 75.02 74.34
93.57 98.99 69.95
98.98 100.00 70.29
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
252
Restori, Gresham, Chang, Lee, & Laija-Rodriquez (2007).
Percent of intervals of academic
achievement (antecedent intervention)
A1
120
100
80
60
40
20
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Session
Percent of intervals of academic
achievement (antecedent intervention)
A2
120
100
80
60
40
20
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Session
253
Percent of intervals of academic
achievement (antecedent intervention)
A3
120
100
80
60
40
20
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Session
A4
Percent of intervals of academic
achievement (antecedent intervention)
120
100
80
60
40
20
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Session
254
Percent of intervals of academic
achievement (consequent intervention)
C1
120
100
80
60
40
20
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Session
Percent of intervals of academic
achievement (consequent intervention)
C2
120
100
80
60
40
20
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Session
255
(consequent intervention)
Percent of intervals of academic achievement
C3
80
70
60
50
40
30
20
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Session
Percent of intervals of academic
achievement (consequent intervention)
C4
120
100
80
60
40
20
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Session
256
Percent of intervals of disruptive
behavior (antecedent intervention)
A1
120
100
80
60
40
20
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Session
Percent of intervals of disruptive
behavior (antecedent intervention)
A2
120
100
80
60
40
20
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Session
257
Percent of intervals of disruptive
behavior (antecedent intervention)
A3
90
80
70
60
50
40
30
20
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Session
Percent of intervals of disruptive
behavior (antecedent intervention)
A4
100
80
60
40
20
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Session
258
Percent of intervals of disruptive
behavior (antecedent intervention)
C1
90
80
70
60
50
40
30
20
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Session
Percent of intervals of disruptive
behavior (antecedent intervention)
C2
120
100
80
60
40
20
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Session
259
Percent of intervals of disruptive
behavior (antecedent intervention)
C3
120
100
80
60
40
20
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Session
Percent of intervals of disruptive
behavior (antecedent intervention)
C4
70
60
50
40
30
20
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Session
ID
Treatment
A1
A2
A3
A4
C1
C2
C3
C4
0 = antecedent
0
0
0
1= consequent
1
1
1
260
Patient = p
Session = s
Academic Achievement = b1
Disruptive Behavior = b2
p
s
b1
b2
t
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
12.00
13.00
14.00
15.00
16.00
17.00
18.00
19.00
20.00
21.00
22.00
23.00
24.00
25.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
12.00
13.00
22.57
4.80
16.07
41.86
30.01
18.16
43.41
18.65
.00
71.29
87.94
95.44
78.22
60.99
37.85
57.19
30.28
27.03
92.61
99.04
94.17
93.61
98.97
94.11
97.31
44.85
21.58
28.58
8.56
12.32
10.67
83.62
77.65
97.62
84.62
.00
32.14
61.31
51.60
95.66
69.30
42.93
67.64
53.64
54.69
81.55
99.27
18.61
11.60
1.90
8.87
28.20
28.18
23.32
19.53
41.01
3.89
1.19
3.85
5.98
.00
3.78
.00
44.85
56.72
56.69
92.34
69.62
89.05
6.32
9.54
.00
7.87
99.19
43.49
20.22
0
0
0
0
0
0
0
0
0
1
1
1
1
0
0
0
0
0
1
1
1
1
1
1
1
0
0
0
0
0
0
1
1
1
1
0
0
0
Treatment = t
261
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
14.00
15.00
16.00
17.00
18.00
19.00
20.00
21.00
22.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
12.00
13.00
14.00
15.00
16.00
17.00
18.00
19.00
20.00
1.00
2.00
3.00
4.00
5.00
7.00
8.00
9.00
9.00
10.00
11.00
12.00
13.00
14.00
15.00
46.00 44.52
35.31 43.42
69.88 12.04
84.45 1.75
90.91 .00
90.88 1.11
82.75 11.40
98.94 1.11
92.97 5.41
9.52 78.57
11.90 44.64
8.33 62.50
10.12 75.00
17.86 69.05
97.62 .00
97.62 .00
98.21 1.11
86.31 5.36
8.33 53.57
27.38 26.79
29.76 46.43
29.17 16.07
19.05 31.55
100.00 .00
88.10 .00
100.00 .00
95.24 .00
91.07 .00
96.43 .00
29.82 50.88
32.16 62.57
3.51 37.43
1.75 43.27
.00
93.57
41.52 41.52
4.09 54.39
99.42 .00
99.42 .00
98.25 .00
95.91 .00
96.49 .00
40.94 43.86
59.65 34.50
38.60 45.61
0
0
1
1
1
1
1
1
1
0
0
0
0
0
1
1
1
1
0
0
0
0
0
1
1
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1
1
0
0
0
262
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
5.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
16.00
17.00
18.00
19.00
20.00
21.00
22.00
23.00
24.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
12.00
13.00
14.00
15.00
16.00
17.00
18.00
19.00
20.00
21.00
22.00
23.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
12.00
45.61 42.69 0
41.52 30.99 0
93.57 1.17 1
98.25 1.17 1
95.91 1.17 1
97.08 1.17 1
94.74 4.09 1
97.66 1.75 1
95.32 .00
1
25.70 37.70 0
5.10 77.10 0
34.22 39.36 0
38.20 22.20 0
32.46 46.17 0
13.00 83.86 0
15.84 70.12 0
81.52 11.24 1
94.64 3.79 1
96.33 2.62 1
74.02 6.02 1
56.28 32.85 0
69.96 27.11 0
56.80 32.22 0
47.06 35.06 0
22.46 34.46 0
95.58 .00
1
77.27 12.70 1
84.10 9.24 1
90.36 9.22 1
91.47 6.33 1
85.16 7.45 1
70.85 19.43 1
22.52 78.59 0
29.99 49.06 0
95.83 3.92 0
20.63 26.99 0
49.49 36.78 0
6.66 76.61 0
.00
100.26 0
19.29 53.97 0
100.00 .00
1
100.00 .00
1
82.14 11.62 1
87.87 6.95 1
263
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
7.00
7.00
7.00
7.00
7.00
7.00
7.00
7.00
7.00
7.00
7.00
7.00
7.00
7.00
7.00
7.00
7.00
7.00
7.00
7.00
7.00
7.00
8.00
8.00
8.00
8.00
8.00
8.00
8.00
8.00
8.00
8.00
13.00
14.00
15.00
16.00
17.00
18.00
19.00
20.00
21.00
22.00
23.00
24.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
12.00
13.00
14.00
15.00
16.00
17.00
18.00
19.00
20.00
21.00
22.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
31.76 1.70 0
.00
100.49 0
.00
91.20 0
2.70 8.49 0
51.21 41.97 0
66.77 13.59 1
98.51 .00
1
85.17 7.13 1
99.57 .00
1
98.36 .00
1
100.00 .00
1
99.42 .00
1
28.14 52.02 0
18.59 56.57 0
46.73 41.41 0
26.63 74.24 0
16.08 83.84 0
15.58 44.44 0
15.58 60.61 0
56.78 22.73 1
71.86 27.27 1
71.36 11.11 1
74.87 11.62 1
37.69 33.33 0
.00
97.47 0
31.16 13.13 0
21.11 45.45 0
12.56 49.49 0
57.79 2.02 1
37.69 14.14 1
69.85 10.10 1
52.76 11.11 1
74.87 12.12 1
62.81 9.09 1
30.62 53.08 0
27.08 42.38 0
39.86 49.55 0
20.50 62.33 0
15.93 49.60 0
16.46 57.28 0
92.00 2.71 1
89.98 4.27 1
60.42 20.11 1
88.50 4.32 1
264
8.00
8.00
8.00
8.00
8.00
8.00
8.00
8.00
8.00
8.00
8.00
11.00
12.00
13.00
14.00
15.00
16.00
17.00
18.00
19.00
20.00
21.00
33.43
61.51
27.86
61.56
42.20
58.04
66.23
80.03
89.75
97.94
81.12
25.77
34.98
18.68
38.09
32.50
15.18
10.11
7.58
2.50
1.51
11.74
0
0
0
0
0
1
1
1
1
1
1
Theodore, Bray, Kehle, & Jenson (2001).
Percentage of disruptive intervals
Student 1
80
70
60
50
40
30
20
10
0
1
3
5
7
9
11 13 15 17 19 21 23 25 27 29 31
Session
265
Percentage of disruptive intervals
Student 2
70
60
50
40
30
20
10
0
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
19
21
23
25
27
29
Session
Percentage of disruptive intervals
Student 3
70
60
50
40
30
20
10
0
1
3
5
7
9
11
13
15
17
Session
Patient = p
Session = s
p
s
b
1.00
1.00
1.00
1.00
2.00
3.00
53.72 0
68.68 0
64.88 0
t
Disruptive behavior = b
Treatment = t
266
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
12.00
13.00
14.00
15.00
16.00
17.00
18.00
19.00
20.00
21.00
22.00
23.00
24.00
25.00
26.00
27.00
28.00
29.00
30.00
31.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
12.00
13.00
14.00
15.00
16.00
63.60
48.56
49.77
42.23
59.69
55.90
45.86
70.82
54.53
59.49
.00
1.91
.62
.59
.00
36.72
45.42
39.14
35.35
45.31
49.02
3.98
1.45
11.41
8.86
.00
6.28
.00
37.93
37.93
44.83
48.28
35.63
39.08
42.53
42.53
34.48
60.92
40.23
49.43
5.75
.00
5.75
5.75
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
0
0
0
0
0
0
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
267
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
3.00
17.00
18.00
19.00
20.00
21.00
22.00
23.00
24.00
25.00
26.00
27.00
28.00
29.00
30.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
12.00
13.00
14.00
15.00
16.00
17.00
18.00
19.00
20.00
21.00
22.00
23.00
24.00
25.00
26.00
27.00
28.00
29.00
30.00
8.05
2.30
44.83
25.29
21.84
31.03
28.74
33.33
4.60
.00
4.60
5.75
8.05
2.30
31.65
43.04
44.30
31.65
39.24
26.58
37.97
35.44
29.11
64.56
32.91
31.65
.00
5.06
3.80
.00
2.53
3.80
22.78
18.99
27.85
22.78
20.25
1.27
.00
.00
.00
.00
.00
.00
1
1
0
0
0
0
0
0
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
0
0
0
0
0
1
1
1
1
1
1
1
268
ID
Gender
Race
Disability
Student 1
Student 2
Student 3
0 = Male
0
0
0 = White
0
0
0 = SED (Severe Emotional Disturbance)
0
0
Williamson, Campbell-Whatley, & Lo-Ya (2009).
Percent of intervals of on-task behavior
Student 1
120
100
80
60
40
20
0
1
3
5
7
9
11 13 15 17 19 21 23 25 27 29 31 33
Session
Percent of intervals of on-task behavior
Student 2
120
100
80
60
40
20
0
1
3
5
7
9
11 13 15 17 19 21 23 25 27 29 31 33
Session
269
Percent of intervals of on-task behavior
Student 3
120
100
80
60
40
20
0
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29 31 33 35
Session
Percent of intervals of on-task behavior
Student 4
120
100
80
60
40
20
0
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29 31 33 35
Session
270
Percent of intervals of on-task behavior
Student 5
120
100
80
60
40
20
0
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29 31 33 35
Session
Percent of intervals of on-task behavior
Student 6
120
100
80
60
40
20
0
1
s
1.0 1.0
1.0 2.0
1.0 3.0
1.0 4.0
5
7
9 11 13 15 17 19 21 23 25 27 29 31 33 35
Session
Patient = p
p
3
Session = s
b
t
41.14
21.98
21.93
43.95
0.0
0.0
0.0
0.0
Disruptive behavior = b
Treatment = t
271
1.0 5.0 40.96 0.0
1.0 6.0 20.34 0.0
1.0 8.0 40.85 0.0
1.0 9.0 58.46 0.0
1.0 10.0 80.47 1.0
1.0 11.0 64.25 1.0
1.0 12.0 100.97 1.0
1.0 13.0 62.7 1.0
1.0 14.0 102.36 1.0
1.0 15.0 80.27 1.0
1.0 16.0 81.69 1.0
1.0 17.0 81.65 1.0
1.0 18.0 61.03 1.0
1.0 19.0 62.45 0.0
1.0 20.0 59.47 0.0
1.0 21.0 60.91 0.0
1.0 22.0 59.39 0.0
1.0 23.0 78.46 0.0
1.0 24.0 59.31 0.0
1.0 25.0 60.73 0.0
1.0 26.0 59.22 0.0
1.0 27.0 60.66 0.0
1.0 28.0 79.74 1.0
1.0 29.0 79.7 1.0
1.0 32.0 79.58 1.0
1.0 33.0 60.42 1.0
1.0 34.0 60.38 1.0
1.0 35.0 77.98 1.0
1.0 36.0 42.64 1.0
2.0 1.0 41.1 0.0
2.0 2.0 42.47 0.0
2.0 3.0 18.87 0.0
2.0 4.0 37.9 0.0
2.0 5.0 58.41 0.0
2.0 6.0 20.09 0.0
2.0 8.0 39.04 0.0
272
2.0 9.0 38.97
2.0 10.0 59.48
2.0 11.0 77.03
2.0 12.0 59.3
2.0 13.0 81.28
2.0 14.0 81.2
2.0 15.0 98.77
2.0 16.0 78.09
2.0 17.0 79.49
2.0 18.0 80.87
2.0 19.0 58.73
2.0 20.0 58.65
2.0 21.0 58.56
2.0 23.0 58.4
2.0 24.0 37.72
2.0 25.0 56.79
2.0 26.0 56.69
2.0 27.0 59.55
2.0 28.0 75.65
2.0 29.0 97.62
2.0 30.0 78.36
2.0 32.0 76.8
2.0 33.0 60.55
2.0 34.0 78.08
2.0 35.0 78.0
2.0 36.0 58.8
3.0 1.0 59.39
3.0 2.0 57.9
3.0 4.0 59.27
3.0 5.0 59.23
3.0 6.0 59.19
3.0 7.0 59.15
3.0 8.0 59.11
3.0 9.0 59.07
3.0 10.0 82.22
3.0 11.0 80.73
0.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
1.0
1.0
273
3.0 12.0 99.52
3.0 13.0 98.03
3.0 14.0 97.99
3.0 15.0 99.4
3.0 16.0 99.36
3.0 17.0 97.86
3.0 18.0 99.28
3.0 19.0 78.95
3.0 20.0 78.91
3.0 21.0 58.58
3.0 22.0 77.37
3.0 23.0 96.17
3.0 24.0 78.74
3.0 25.0 78.71
3.0 26.0 59.82
3.0 27.0 78.63
3.0 28.0 77.14
3.0 29.0 97.38
3.0 30.0 97.34
3.0 31.0 97.3
3.0 32.0 58.13
3.0 33.0 59.53
3.0 34.0 59.49
3.0 35.0 37.72
3.0 36.0 37.69
4.0 1.0 36.81
4.0 2.0 36.85
4.0 3.0 16.29
4.0 4.0 16.34
4.0 5.0 16.37
4.0 7.0 19.4
4.0 8.0 34.14
4.0 9.0 56.24
4.0 10.0 78.34
4.0 11.0 79.85
4.0 12.0 97.55
1.0
1.0
1.0
1.0
1.0
1.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
1.0
1.0
1.0
274
4.0 13.0 79.94
4.0 14.0 78.51
4.0 15.0 80.02
4.0 16.0 77.12
4.0 17.0 78.63
4.0 18.0 97.79
4.0 19.0 37.54
4.0 20.0 37.59
4.0 21.0 40.56
4.0 22.0 58.25
4.0 23.0 39.18
4.0 24.0 56.86
4.0 25.0 40.73
4.0 26.0 56.94
4.0 27.0 56.98
4.0 28.0 57.03
4.0 29.0 57.07
4.0 30.0 76.22
4.0 31.0 58.62
4.0 32.0 60.13
4.0 33.0 60.17
4.0 34.0 60.22
4.0 35.0 60.25
4.0 36.0 39.7
5.0 1.0 57.18
5.0 2.0 58.65
5.0 3.0 60.12
5.0 4.0 38.73
5.0 5.0 38.77
5.0 6.0 55.95
5.0 7.0 38.85
5.0 8.0 18.89
5.0 9.0 17.5
5.0 10.0 58.96
5.0 11.0 59.01
5.0 12.0 79.05
1.0
1.0
1.0
1.0
1.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
1.0
1.0
1.0
275
5.0 13.0 80.52
5.0 14.0 79.12
5.0 15.0 97.74
5.0 16.0 79.21
5.0 17.0 96.39
5.0 18.0 80.71
5.0 19.0 79.33
5.0 20.0 59.36
5.0 21.0 60.84
5.0 22.0 59.46
5.0 23.0 80.92
5.0 24.0 59.52
5.0 25.0 59.57
5.0 26.0 59.61
5.0 27.0 59.65
5.0 28.0 59.68
5.0 29.0 58.29
5.0 30.0 59.77
5.0 31.0 78.38
5.0 32.0 59.85
5.0 33.0 61.32
5.0 34.0 44.21
5.0 35.0 57.1
5.0 36.0 41.44
6.0 1.0 58.21
6.0 2.0 40.3
6.0 3.0 77.61
6.0 4.0 61.19
6.0 5.0 38.81
6.0 6.0 38.81
6.0 7.0 40.3
6.0 8.0 40.3
6.0 9.0 59.7
6.0 10.0 83.58
6.0 11.0 83.58
6.0 12.0 83.58
1.0
1.0
1.0
1.0
1.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
1.0
1.0
1.0
276
6.0 13.0 82.09 1.0
6.0 14.0 100.0 1.0
6.0 15.0 83.58 1.0
6.0 16.0 83.58 1.0
6.0 17.0 101.49 1.0
6.0 18.0 101.49 1.0
6.0 19.0 102.99 0.0
6.0 20.0 82.09 0.0
6.0 21.0 61.19 0.0
6.0 22.0 61.19 0.0
6.0 23.0 61.19 0.0
6.0 24.0 80.6 0.0
6.0 25.0 59.7 0.0
6.0 26.0 59.7 0.0
6.0 27.0 59.7 0.0
6.0 28.0 82.09 1.0
6.0 29.0 82.09 1.0
6.0 30.0 83.58 1.0
6.0 31.0 83.58 1.0
6.0 32.0 80.6 1.0
6.0 33.0 43.28 1.0
6.0 34.0 59.7 1.0
6.0 35.0 58.21 1.0
6.0 36.0 38.81 1.0
277
ID
Special Education
Race
Person 1
Person 2
Person 3
Person 4
Person 5
Person 6
0=Y (all)
0=AA (all)
278
Appendix C: All Models (Working and Non-Working)
Amato-Zech, Hoff & Doepke (2006)
Full Non-Linear Model with Slopes (with overdispersion)
279
*Since the sigma is around 1, there is no need to check the overdispersion option for
these models.
Full with Overdispersion
Each non-significant variable was removed from the model and run, leaving only the
treatment variable as significant (i.e., that was the only variable that continued to the
simple model below).
280
Simple Non-Linear Model without Slopes (without Overdispersion)
281
Lambert, M.C., Cartledge, G., Heward, W.L., Lo, Y. (2006)
Full Non-Linear Model with Slopes (with overdispersion)
282
Simple Non-Linear Model without Slopes (with Overdispersion)
283
Simple Non-Linear Model with CLASSB on Intercept
284
Simplified non-linear model without slopes with CLASSB on INTCPT (NO
overdispersion)
285
Gender on TRT
286
Murphy, Theodore, Alric-Edwards, & Hughes (2007)
Full Non-Linear Model with Slopes (with Overdispersion)
287
Simple Non-Linear Model without Slopes (with overdispersion)
288
Simple Non-Linear Model with AGE on Intercept (with overdispersion)
289
Simple Non-Linear Model with AGE on Intercept (without overdispersion)
Simple Non-Linear Model with AGE on TRT (with overdispersion)
290
Simple Non-Linear Model with AGE on TRT (without overdispersion)
Simple Non-Linear Model with ONTRACK on Intercept (with overdispersion)
291
Theodore, Bray, Kehle, & Jenson (2001)
Full Non-Linear Model with Slopes (with overdispersion)
292
Full Non-Linear Model with Slopes (without overdispersion)
293
Simple Non-Linear Model without Slopes (with overdispersion)
294
Simple Non-Linear model without intercept (with overdispersion)
Simple Non-Linear model (no intercept without overdispersion)
295
Simple Non-Linear Model with DISABILITY on Intercept
Ramsey, Jolivette, Puckett-Patterson, & Kennedy (2010)
(ON-TASK) Full Non-Linear Model with Slopes (with overdispersion)
296
Simple Non-Linear Model without Slopes (with overdispersion)
297
Simple Non-Linear Model without Slopes (without overdispersion)
Simple Non-Linear Model without Slopes (without overdispersion)
298
(TASK COMPLETION) Full Non-Linear Model with Slopes (with overdispersion)
299
Simple Non-Linear Model without Slopes (with overdispersion)
300
Simple Non-Linear Model without Slopes (with overdispersion)
301
Simple Non-Linear Model without Slopes (with overdispersion) DV: Task completion
302
(ACCURACY) Full Non-Linear Model with Slopes (with overdispersion)
303
304
Simple Non-Linear Model without Slopes (with overdispersion)
305
Simple Non-Linear Model without Slopes (without overdispersion)
Restori, Gresham, Chang, Lee, & Laija-Rodriquez (2007)
Disruptive Behavior = Freq10
(Freq10) Full Non-Linear Model with Slopes (with overdispersion)
306
(Freq10) Full Non-Linear Model with Slopes (without overdispersion)
307
(Freq10) Simple Non-Linear Model without Slopes (without overdispersion)
308
(Freq10) Simple Non-Linear Model with INTERVENTION on intercept and S1ORD
(without overdispersion)
309
(Freq10) Simple Non-Linear Model with INTERVENTION on intercept and S1ORD
(with overdispersion)
310
(Freq10) Simple Non-Linear Model with Intervention on TRT (with overdispersion)
311
(Freq10) Simple Non-Linear Model with Intervention on TRT (with overdispersion)
312
Academic Engagement = ACA10
(FreqACA10) Full Non-Linear Model with Slopes (with overdispersion)
(FreqACA10) Full Non-Linear Model with Slopes (without overdispersion)
313
(FreqACA10) Simple Non-Linear Model without Slopes (with overdispersion)
314
(FreqACA10) Simple Non-Linear Model without Slopes (without overdispersion)
(FreqACA10) Simple Non-Linear Model without Slopes (with overdispersion)
315
(FreqACA10) Simple Non-Linear Model with INTERVENTION on TRT (with
Overdispersion)
(FreqACA10) Simple Non-Linear Model with INTERVENTION on TRT (without
Overdispersion)
316
Williamson, Campbell-Whatley, & Lo (2009)
Full Non-Linear Model with Slopes (with overdispersion)
317
Full Non-Linear Model with Slopes (without overdispersion)
318
Simple Non-Linear Model without Slopes (with overdispersion)
319
Student 3 was tested on S1TRTORD and found not significant either with or without
overdispersion.
Simple Non-Linear Model without Slopes (without overdispersion)
320
Simple Non-Linear Model without Slopes (without overdispersion) No L-2 Predictors
321
Cole & Levinson (2002).
322
323
Mavropoulou, Papadopoulou, & Kakana (2011)
Ontask needs overdisperison, xtmepoisson will account for the overdisperson needs.
324
Nbreg ontask60 TRT SESS1
325
Performance needs overdispersion
326
327
328
Prompt no overdispersion
329
330
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
Thesis and Dissertation Services
!