reappraising the arguments against randomized experiments in

advertisement
Abstract
This paper notes the rarity with which researchers who teach in American schools of education
do policy-relevant experiments on schooling. In contrast, more school experiments are done by
scholars with appointments in public health, psychology and the policy sciences. Existing
theories of educational evaluation offer many arguments against experimentation, and these
arguments are critically analyzed in this paper. They vary in cogency. The better arguments
suggest that experimental designs will be improved if issues of implementation quality, program
theory and the utilization of evaluation results are addressed more seriously. The worse
arguments are those contending that better methods than the experiment already exist for
assessing the effectiveness of educational innovations. We surmise that educational evaluators
are not likely to adopt experimentation any time soon since it does not accord with the ontology,
epistemology or methodology they prefer, they do not believe it is practical, and they do not
consider it to be compatible with their preferred metaphor of schools as complex social
organizations that should be improved using the tools of management consulting more than
science. For more experiments to take place, school reform will require different intellectual
assumptions from those now prevalent among evaluation specialists in schools of education.
1
Preamble
To do good research on educational technologies requires different strategies at different
stages in the evolution of what I’ll call a product. This paper deals only with the last stages of
product development when claims are made about its effectiveness.
The argument I advance is that such claims should be based on experimental
demonstrations (or good quasi-experiments). Experimental designs compare students exposed to
the product with those exposed to a competing product (as in the Consumer Reports format) or
they compare students who have had the product to individuals who do not have it but are free to
do whatever they want, including gaining some exposure to similar products. No method
provides as good a counterfactual as the experiment. This implies that the market should not
decide alone; nor should decisions be made on the basis of formative studies alone; nor should
they be made on the basis of various pattern matching strategies that have recently been
espoused—e.g., theory-based evaluations.
Since many of those doing research on educational technology have academic
backgrounds in education, I reexamine the arguments against doing experiments advanced in
education in general and in educational evaluation in particular. This paper reviews these
arguments sympathetically, but nonetheless concludes they do not carry enough weight to
preclude using experiments to test claims about the effectiveness of educational technology. It is
not easy to do such experiments, particularly when the unit of assignment is schools rather than
classrooms or students. But without experiments claims about productivity gains due to
technology will not gain the acceptance they might otherwise deserve.
2
Paper for Presentation at the SRI International Design Conference on Design Issues in
Evaluating Educational Technologies
Reappraising the Arguments Against Randomized Experiments in Education: An Analysis of the
Culture of Evaluation in American Schools of Education
Thomas D. Cook
Northwestern University
Thanks for feedback on an earlier draft are due to Tony Byrk, Lee Cronbach, Joseph Durlak,
Christopher Jencks, and Paul Lingenfelter.
3
Introduction
The American education system is very decentralized compared to the case in most other
industrialized nations. Educational goals and practices vary enormously, not only by state and
school district, but also by schools within districts. Even current plans to improve schools must
seem heterogeneous. Politicians, business leaders and policy pundits have recently called for
school improvement through mechanisms as diverse as school-based management, charter
schools, school vouchers, more effective teaching practices, higher standards, increased
accountability, smaller schools and classes, new educational technologies, and better trained
teachers and principals. To many foreigners, the American system must look like a cacophony of
local experimentation.
However, experimentation connotes more than just generating diverse options. It also
connotes systematically evaluating them. This is usually through a process that can be
decomposed into four decisions that have to be made--about the alternatives to be compared, the
criteria they are to be compared on, how information is to be collected about each criterion, and
how a decision is to be reached once all the alternatives have been compared on all the criteria.
To evaluate like this requires considerable researcher control. In the laboratory sciences, control
is used to exclude all irrelevant causal forces from the context of study and to provide
experimenters with almost total say over the design and introduction of the alternatives to be
compared. In settings more open than the laboratory, control is most often associated with using
random assignment to determine which units (schools, classrooms or students) are to experience
each alternative or, as we will call it, each treatment. Such assignment rules out the possibility
4
that any differences observed at the end of a study are due to pre-existing group differences rather
than to the treatment contrast under investigation.
Education uses experiments in both the laboratory and open system senses. But since our
focus in this paper is on evaluating school reforms, we will concentrate on random assignment
and its use in schools. Its superiority for drawing inferences about the causal consequences of
deliberately planned changes is routinely acknowledged in philosophy, medicine, public health,
agriculture, statistics, micro-economics, psychology, marketing and those parts of political
science and sociology dealing with the improvement of opinion surveys. The same advocacy is
evident in elementary method tests in education, though not--as we will see--in specialized texts
on educational evaluation.
Random assignment is so esteemed because: (1) it provides a logically defensible no-cause
baseline since the selection process is completely known; (2) the design and statistical
alternatives to random assignment are empirically inadequate because they either do not produce
the same effect sizes as randomized experiments on the same topic (e.g., Mosteller, Gilbert &
McPeak, 1980; Lalonde, 1986; and Fraker & Maynard, 1987) or, when they do, the variation in
results between the quasi-experimental studies on a topic is greater then the variation between
randomized experiments on the same topic (Lipsey & Wilson, 1993); and (3) the costs of being
incorrect in the causal conclusions one makes are often far from trivial. What if we concluded
that Catholic schools are superior to public ones when they are not; or that vouchers stimulate
academic achievement when they do not; or that school desegregation does not increase minority
achievement when it in fact does? Taken together, these three arguments provide the
conventional justification for random assignment as the method of choice when a causal question
is paramount.
5
American education promotes experimentation in the sense of implementing many
variants. But it does not promote experimentation in the sense of using random assignment.
Indeed, Nave, Meich and Mosteller (1999) showed that not even 1% of dissertations in education
or of the studies archived in ERIC Abstracts involved such assignment. Casual review of the
major journals on school improvement tell a similar story, whether in the American Educational
Research Journal or Educational Evaluation and Policy Analysis. Since my own research
interests are in whole school reform rather than discrete school features like curricula, I wrote to
a nationally renowned colleague who designs and evaluates curricula. She replied that
randomized experiments are also extremely rare in curriculum studies. It is certainly the case in
studies of school desegregation where a review organized by the then National Institute of
Education (1984) included just one experiment (Zdep, 1971). I know of no experiments on
standards setting. The effective schools literature reveals no experiments systemically varying
those school practices that correlational studies suggested were effective. Recent studies of
school-based management reveal only two experiments, both on Comer’s School Development
Program (Cook, Habib, Phillips, Settersten, Shagle & Degirmencioglu, 1999; Cook, Hunt &
Murphy, 1999). There seem to be no experiments on Catholic or Accelerated or Total Quality
Management schools. On vouchers I know of only one completed study, often re-analyzed
(Witte, 1998; Greene, Peterson, Du, Boeger & Frazier, 1996; Rouse, 1998), plus three other
projects now underway (Peterson, Greene, Howell, & McCready, 1998; Peterson, Myers &
Howell, 1998). On charter schools I know of no relevant experiments. On smaller class sizes, I
know of six experiments, the best known being the Tennessee class size study (Finn & Achilles,
1990: Mosteller, Light & Sachs, 1996). On smaller schools I know of only one randomized
experiment, currently underway (Kemple & Rock, 1996). On teacher training I know of no
6
relevant experiments. So, our knowledge of the effectiveness of the educational policies most
discussed today has to depend on methods less esteemed than random assignment.
Of the experiments cited above, nearly all were conducted by scholars with primary
organizational affiliation outside of education. The best-known class size experiment was done
by educators (Finn & Achilles, 1990), but popularized by statisticians (Mosteller, Light & Sachs,
1996) and reanalyzed by an economist (Ferguson, 1998). The Milwaukee voucher study was
done by political scientists (Witte, 1998) and re analyzed as a randomized experiment by political
scientists (Greene, et al 1996) and economists (Rouse, 1998). The Comer studies were
conducted by sociologists and psychologists (Cook et al, 1999; Cook, Hunt & Murphy, 1999).
The ongoing experiment on academies within high schools (Kemple & Rock, 1996) is being
done by economists, and the work on school choice programs in Washington, New York and
Cleveland is also being done by political scientists (Peterson et al, 1998). Striking is the multiyear paucity of experimental studies conducted by scholars with appointments in Schools of
Education.
That they are not done there does not reflect ignorance. Nearly all writers on educational
research methods understand the logic of experimentation, know of its theoretical benefits, and
appreciate how esteemed it is in science writ large. However, most of them (including Alkin,
Cronbach, Eisner, Fetterman, Fullan, Guba, House, Hubermann, Lincoln, Miles, Provus, Sanders,
Schwandt, Stake, Stufflebeam and Worthen) denigrate formal experiments at some time or
another in their writings. A few pay lip-service to them while carrying out non-experimental
studies whose virtues they therefore model for others (e.g. C.H.Weiss). This distaste stands in
stark contrast to the preference for experiments exhibited by school researchers with affiliations
outside of Schools of Education who primarily seek to learn how to improve student health, to
prevent school violence, or to reduce teen use of tobacco, drugs and alcohol (e.g., Peters &
7
McMahon, 1996; Cook, Anson & Walchli, 1993; Durlak & Wells, 1997a, b and 1998; St. Pierre,
Cook & Straw, 1981; Connell, Turner & Mason, 1985). Such studies routinely use random
assignment, with Durlak’s two reviews of studies prior to 1991 including about 190 randomized
experiments and 120 other studies. The number of experiments has certainly increased since
then, given the rapid growth of school-based prevention studies. So, experiments are common in
schools, but not on the topics preferred by researchers who have been exposed to evaluation
training in Schools of Education.
This paper describes some things that are special about the intellectual culture of
evaluation in Schools of Education. It is a culture that actively rejects random assignment in
favor of alternatives that the larger research community judges to be technically inferior. Yet
educational evaluators have lived for at least 30 years knowing they have rejected what science
apotheosizes, and they will not adopt experimentation merely by learning more about it. They
believe in a set of mutually reinforcing propositions that provides an overwhelming rationale for
rejecting experiments on a variety of ontological, epistemological, methodological, practical and
ethical grounds. Although the writers on educational evaluation we listed earlier differ in their
reasons for rejecting a scientific model of knowledge growth, anyone in educational evaluation
during the last 30 years will have encountered arguments against experiments that seemed both
cogent and comprehensive. Indeed, even raising the issue of experimentation probably has a
"deja vu" quality, reminding older educational evaluators of a battle they thought they had won
long ago against positivism and the privilege it accords to experiments and to the research and
development (R&D) model associated with it whose origins are in agriculture, medicine and
public health--not education. Still, we want to re-examine the reasons educational evaluators
have forwarded for rejecting random assignment in school-based research.
8
Three incidental benefits follow from this purpose. First, we provide a rationale for
random assignment that is somewhat different from the conventional validity rationale outlined
earlier. We take seriously some of the objections raised by critics in education and so do not
argue that random assignment provides a gold standard guaranteeing impeccable causal
decisions. (However, we do believe it is superior to any alternative currently available). Nor do
we argue that the alternatives to random assignment are invariably infallible. That would be silly.
With the exception of psychology, the laboratory sciences routinely provide causal knowledge
yet rarely assigning at random. Moreover, commonsense has arrived at many correct causal
propositions without the advantage of experiments (e.g., that incest causes aberrant births or that
out-group threat causes in-group cohesion). And, while Lipsey and Wilson’s (1993) report on 74
meta-analyses showed that the average effect sizes did not differ between experiments and other
types of studies, the variation in effect sizes was much larger for the non-experiments. This
suggests that experiments are more efficient as probes of causal hypotheses, and so they are
especially needed when few reports on a topic are available—as is the case with almost all of the
recent assertions about effective school reforms. A revised rationale for experiments stresses (1)
their greater validity relative to other cause-probing methods even though they do not constitute a
"gold standard"; (2) their efficiency at reaching defensible causal conclusions sooner than other
methods; and (3) their greater credibility in the eyes of nearly all the scholarly community.
A second incidental benefit is the light our analysis throws on the R & D model guiding
educational research today. It is different from the science-based evolutionary model used in
other fields (Flay, 1986). This specifies that theory, insight and serendipity should be used to
generate novel ideas about possible improvements to health or welfare. The assumptions behind
these ideas should then be tested in laboratory settings, often in test tubes or with animal
populations. Novel ideas surviving this examination should then be implemented in efficacy
9
trials designed to test the treatment at its strongest in carefully controlled field settings. Only
ideas that survive this next stage enter into trials that probe their effectiveness under conditions
like those of eventual application where faithful implementation of the treatment details cannot
be taken for granted. Only ideas surviving this last stage should then be replicated to assess their
generalizability across different populations of students and teachers, across different kinds of
schools, and even across different moments in time.
Contrast this R&D model with the one most education researchers prefer. It comes from
management consulting and its crucial assumptions are: (1) each organization (e.g. school or
school district) is so complex in its structure and functions that it is difficult to implement any
changes that involve its central functions; (2) each organization is unique in its goals and culture,
entailing that the same stimulus is likely to elicit quite variable responses depending on a
school’s history, organization, personnel and politics; and (3) suggestions for change should
reflect the creative blending of knowledge from many different sources--especially from deep
historical and ethnographic insight into the district or school under study, as well as craft
knowledge about what is likely to improve this particular school or district, plus the application
of broad theories of general organizational change (Lindblom & Cohen, 1980). Seen from this
perspective, scientific knowledge about school effectiveness is inappropriate because it obscures
each school’s uniqueness, it oversimplifies the multivariate and non-linear nature of full
explanation, and it is naive about the mostly indirect ways in which social science is used in
policy. Most educational evaluators see themselves at the forefront of a post-positivist,
democratic and craft-based model of knowledge growth that is superior to the elitist scientific
model they believe fails to create knowledge that school staff can actually use to improve their
practice.
10
The third benefit that arises from understanding the attacks of randomized experiments is
that some of the criticisms help identify how experiments can be improved. We later outline
appropriate roles within experiments for the kinds of inquiry that many education researchers
now prefer as alternatives to random assignment. We maintain that unless these improvements
are made we cannot hope to see more experiments being done either by those who currently
teach evaluation in education or by their students. Nor can we expect support from those midlevel employees in government, foundations and contract research firms who have been
influenced by what educational evaluators believe about experiments. Such support is not
necessary for generating more experiments. But taking the critics’ concerns more seriously will
tend to improve experiments in education, whoever does them. It is time to circumvent the
unproductive current stand-off between scientific and craft models of knowledge growth by
illustrating how experiments will be designed better because serious attention has been paid to
the intellectual content of the attack on them.
Beliefs Adduced to Reject Random Assignment
1, Causation is ontologically more complex than a simple causal arrow from A to B. With a few
exceptions (e.g., Guba & Lincoln, 1982), educational researchers seem willing to postulate that a
real world exists, however imperfectly it can be known. Randomized experiments test the
influence of only a small subset of potential causes from within this world, often only a single
one. And at their most elegant they can responsibly test only a modest number of interactions
between treatments. This implies that randomized experiments are best when a causal question is
simple, sharply focused and easily justified. The theory of causation most relevant to this task has
been variously called the manipulability, activity or recipe theory (Collingwood, 1940; Gasking,
11
1955; Whitbeck, 1977). It seeks to describe the consequences of a set of activities whose
components can be listed as though they were ingredients in a recipe, thus permitting them to be
actively manipulated as a group in order to ascertain whether they achieve a desired purpose.
This is fundamentally a descriptive task and explanation is only promoted to the extent that the
manipulations help discriminate between competing theories or assessments are made of any
processes that might mediate between a cause and effect (Mackie, 1974). Otherwise, random
assignment is irrelevant to explanation; it merely describes the effects of a multidimensional set
of activities that have been deliberately varied in package form.
Contrast this theory of causation with the one most esteemed in science. In some form or
another, the preference is for full explanation. One understanding of this emphasizes identifying
those "generative processes" (Bhaskar, 1975; Harre, 1981) that are capable of bringing about
effects in a wide variety of circumstances. Examples would be gravity as it affects falling, or a
specific genetic defect as it induces phenylketonuria, or time on task as it facilitates learning. But
even these examples are causally contingent--the genetic defect does not induce phenylketonuria
if an appropriate diet is adopted early in life; and time on task does not induce learning if a
student is disengaged or the curriculum materials are meaningless. So, a more general
understanding of causation requires specifying all the contingencies that impact on a given
consequence or that follow from a particular event (Mackie, 1974).
Cronbach and his colleagues (1980) maintain that Mackie’s theory of cause is more
relevant to schools than the activity theory. They believe that multiple causal factors are
implicated in all student or teacher change, and often in complex non-linear ways. Their model
of real world causation is more akin to a pretzel (or intersecting pretzels) than to an arrow from A
to B. No educational intervention fully explains an outcome; at most, it is just one more
determinant. Nor do interventions have effects that are so general we can assume a constant
12
effect size across student and teacher populations, across types of schools and time periods.
Contingency is the watchword, and the activity theory seems largely irrelevant because so few of
the causally implicated variables can be simultaneously manipulated. Believing that experiments
cannot faithfully represent a real world of multivariate, non-linear (and often reciprocal) causal
links, Cronbach and Snow (1976) initiated the search for aptitude-treatment interactions-specifications that a treatment’s effect depends on student or teacher characteristics. But they
discovered few that were robust. It is not clear to what extent this reflects an ontological truth
that undermines a contingent understanding of causation or whether it reflects more mundane
method problems associated with under-specified theories, partially invalid measures,
imperfectly implemented treatments and truncated distributions.
The utility of the activity theory of causation is best illustrated by noticing that many
educational researchers write today as though some non-contingent causal connections are true
enough to be useful --e.g., that small schools are better than large ones; that time-on-task raises
achievement; that summer school raises test scores; that school desegregation hardly affects
achievement; and that assigning and grading homework raises achievement. They also seem
willing to accept some causal propositions that are bounded by few (and manageable)
contingencies--e.g., that reducing class size increases achievement, provided that the amount of
change is "sizable" and to a level under 20; or that Catholic schools are superior to public ones,
but only in the inner-city. Commitment to an explanatory theory of causation has not stopped
school researchers from acting as though education in the USA today can be characterized in
terms of some dependable main effects and some very simple non-linearities, each more like a
causal arrow than intersecting pretzels.
It is also worth remembering that experiments were not designed to meet grand
explanatory goals or to ensure verisimilitude. In Bacon’s words, they were invented to "twist
13
Nature by the tail", to pull apart what is usually confounded. This is not to deny that
verisimilitude is also important. Else why would Fisher aspire to take experimental methods
outside of the laboratory and into fields, schools and hospitals? But while verisimilitude is part of
the Pantheon of experimental desiderata, it is always secondary. Premier is achieving a clear
picture of whether the link between two variables is causal in the activity theory sense
(Campbell, 1957). So, a special onus falls on advocates of experimentation. To justify their
preferred method, they have to make the case that any planned treatment is so important to policy
or theory and so likely to be general in its effects that it is worthwhile investing in an experiment
whose payoff for explanation or verisimilitude may be sub-optimal.
In this connection, it is important to remember that some causal contingencies are of
minor relevance to educational policy, however useful they might be for full explanation. The
most important contingencies are those that, within normal ranges, modify the sign of a causal
relationship and not just its magnitude. Identifying such causal sign changes informs us where a
treatment is directly harmful as opposed to where it simply has less positive benefits for some
children than others. This distinction is important, since it is not always possible to assign
different treatments to different populations, and so policy-makers are often willing to advocate a
change that has different effects for different types of student so long as the effects are rarely
negative. When the concern is to improve educational practice we can ignore many variables that
moderate a cause-effect relationship, even though they are part of a full explanatory theory of the
relationship in question.
We have considerable sympathy for those opponents of experimentation who believe that
biased answers to big explanatory questions are more important than unbiased answers to smaller
casual-descriptive questions. However, we disagree that learning about the consequences of
educational reforms involves small questions, though we acknowledge that they are usually
14
smaller than questions about causal mechanisms that are broadly transferable across multiple
settings—the identification of which constitutes the traditional Holy Grail of science. Still,
proponents of random assignment need to acknowledge that the method they prefer depends on a
theory of causation that is less comprehensive and less esteemed than the one most researchers
espouse. Acknowledging this does not undermine the justification for experiments, and it should
have the unintended benefit of designing future experiments with a greater explanatory yield than
the black-box experiments of yesteryear.
2. Random assignment depends on discredited epistemological premises. To philosophers of
science, positivism connotes a rejection of realism, the formulation of theories in mathematical
form, the primacy of prediction over explanation, and the belief that entities do not exist
independently of their measurement. This epistemology has long been discredited. But many
educational researchers still use the word positivism and seem to mean by it the quantification
and hypothesis-testing that are essential to experimentation.
Kuhn’s work (1970) is at the forefront of the reasons cited for rejecting positivist science
in the second sense above. He argued two things of relevance. First is his claim about the
"incommensurability of theories"--the notion that theories cannot be formulated so specifically
that definitive falsification results. Second is his claim about the "theory-ladenness of
observations"--the notion that all measures are impregnated with researchers’ theories, hopes,
wishes and expectations, thus undermining their neutrality for discriminating between truth
claims. In refuting the possibility of totally explicit theories and totally neutral observations,
Kuhn’s work was interpreted as undermining science in general and random assignment in
particular. This is why his writings are often cited by educational evaluators like Cronbach,
Eisner, Fetterman, Guba, House, Lincoln, Stake and Stufflebeam. Some of them also invoke
15
philosophers like Lakatos, Harre and Feyerabend whose views partially overlap with Kuhn’s on
these issues. Also often cited are those sociologists of science whose empirical research reveals
practicing laboratory scientists whose on-the-job behavior deviates markedly from scientific
norms (e.g. Latour & Woolgar, 1979). All these points are designed to illustrate that meta-science
has demystified science, revealing it to be a very naked emperor indeed.
This epistemological critique is overly simplistic. Even if observations are never theoryneutral, this does not deny that many observations have stubbornly reoccurred whatever the
researcher’s predilections. As theories replace each other, most fact-like statements from the
older theory are incorporated into the newer one, surviving the change in theoretical
superstructure. Even if there are no "facts" we can independently know to be certain, there are
still many propositions with such a high degree of facticity they can be confidently treated as
though they were true. For practicing experimenters, the trick is to make sure that observations
are not impregnated with a single theory. This entails building multiple theories into one’s
collection of relevant observations, especially those of one’s theoretical opponents. It also means
valuing independent replications of causal claims, provided the replications do not share the
same direction of bias (Cook, 1985). Kuhn’s work complicates what a "fact" means, but it does
not deny that some claims to a fact-like status are very strong.
Kuhn is probably also correct that theoretical statements are never definitively tested
(Quine, 1951; 1969), including statements about the effects of an educational program. But this
does not mean that individual experiments fail to probe theories and causal hypotheses. For
example, when a study produces negative results the program developers (and other advocates)
are likely to surface methodological and substantive contingencies that might have changed the
experimental result--perhaps if a different outcome measure had been used or a different
population examined. Subsequent studies can probe these contingency formulations and, if they
16
again prove negative, lead to yet another round of probes of whatever more complicated
contingency hypotheses has emerged to explain the latest disconfirmation. After a time, this
process runs out of steam, so particularistic are the contingencies that remain to be examined. It
is as though the following consensus emerges: "The program was demonstrably not effective
under many conditions, and those that remain to be examined are so circumscribed that the
reform will not be worth much even if it does turn out to be effective under these conditions”.
Kuhn is correct that this process is social and not exclusively logical; and he is correct that it
arises because the underlying program theory is not sufficiently explicit that it can be definitively
confirmed or rejected. But the reality of elastic theory does not mean that decisions about causal
hypotheses are only social and so are devoid of all empirical or logical content.
Responding to Kuhn’s critique leads to down-playing the value of individual studies and
advocating more programs of experimental research (without implying that such programs are
perfect--only relatively better). However, critics of the experiment have not been interested in
developing a non-positivist epistemological justification for experiments, probably seeing such
efforts as futile. Instead, they have turned their attention to developing methods for studying
educational reforms based on quite different epistemological assumptions and hence technical
practices. Most of these assumptions led to stressing qualitative methods and hypothesis
discovery over hypothesis testing and qualitative methods (e.g. Guba & Lincoln, 1982; Stake,
1967 and 1975; Cronbach, 1980 and 1982; House, 1993; Fetterman, 1984). Others seek to trace
whether the mediating processes specified in the substantive theory of a program actually occur
in the time sequence predicted (Connell, Kubisch, Schorr & Weiss, 1995; Chen, 1990). But
before analyzing these alternatives, we should acknowledge some of the more practical attacks
made on random assignment.
17
3. Random assignment has been tried in education, and has failed. Education researchers were at
the forefront of the flurry of social experimentation that took place at the end of the 1960’s and
the 1970’s. Evaluations of Head Start (Cicirelli & Associates, 1969), Follow Through (Stebbins,
St. Pierre, Proper, Anderson & Cerva, 1978). and Title 1 (Wargo, Tallmadge, Michaels, Lipe &
Morris 1972) concluded that there were few, if any, positive effects and there might even be
some negative ones. These disappointing results engendered considerable dispute about methods,
and many educational evaluators concluded that quantitative evaluation had been tried and failed.
So, they turned to other methods, propelled in the same direction by their reading of Kuhn. Other
scholars responded differently, stressing the need to study school management and program
implementation in the belief that poor management and implementation had led to the
disappointing results (Berman & McLaughlin, 1977; Elmore & McLaughlin, 1983; Cohen &
Garet, 1975).
However, none of the most heavily criticized studies involved random assignment. In
education, criticism of experiments was undertaken by Cronbach et al (1980) who re-analyzed
some studies from the long lists Riecken & Boruch (1974) and Boruch (1974) had generated in
order to demonstrate the feasibility of randomized experiments in complex field settings. In his
critique, Cronbach paid particular attention to the Vera Institute’s Bail Bond Experiment (Area,
Rankin & Sturz, 1963) and to the New Jersey Negative Income Tax Experiment (Rossi & Lyall,
1976). He was able to show to his own and his followers' satisfaction that each was flawed both
in how it was implemented as a randomized experiment and in the degree of correspondence
achieved between its samplings particulars and the likely conditions of its application as new
policy. For these reasons primarily, the case was made that the experiments cited by Boruch did
not constitute a compelling justification for evaluating educational innovations experimentally.
18
However, the studies Cronbach analyzed were not from education. I know of only three
randomized experiments on “large” educational topics available at the time, though there were
many small experiments on highly specific teaching and learning practices. One experiment was
of Sesame Street in its second year (Bogatz & Ball, 1971) when television cable capacity was
randomly assigned to homes in order to promote differences in the opportunity to view the show.
A second was the Perry Preschool Project (Schweinhart, Barnes & Weikart, 1993), and the third
involved only 12 youngsters randomly assigned to be in a desegregated school (Zdep, 1971).
Only Zdep's study involved primary or secondary schools, and so it was not accurate to claim in
the 1970’s that randomized experiments had been tried in education and had failed.
But even so, it may be no easier to implement randomized experiments in schools than in
the substantive domains Cronbach et al did examine. Implementation shortfalls have arisen in
many of the more recent experiments in education. Schools have sometimes dropped out of the
different treatment conditions in different proportions, largely because a new principal wants to
change what a predecessor recently did (e.g., Cook, Hunt & Murphy, 1999). Even in the
Tennessee class size study it is striking that the number of classrooms in the final analysis differs
by more than 20% across the three treatment conditions, though the original design may have
called for more similar percentages. Then, there are inadvertent treatment crossovers, as
happened in Cook et al (1999) where one principal in a treatment condition was married to
someone teaching in a control school; where one control principal really liked the treatment and
learned more about it for himself and tried to implement parts of it in his school; and where one
teacher in a control school was the daughter of a program official who visited her school several
times to address the staff about strategies for improving the building. Fortunately, these
crossovers involved only three of the 23 schools, and none of them entailed borrowing the
treatment’s most unique and presumptively powerful components. Still, the planned experimental
19
contrast was diluted to some degree. Something similar may have happened with the Tennessee
experiment where classrooms of different size were compared within the same school rather than
between schools. How did teachers assigned to larger classes interpret some of their colleagues
teaching smaller classes? Were they dispirited and so tried less? All experiments have to
struggle with how to create and maintain random assignment and treatment independence. They
do not come easily.
We have seen that random assignment is much more prevalent when evaluating in-school
prevention programs than educational curricula or whole school reforms. Why is this? Some
reasons are organizational, having to do with the priority researchers from public health and
psychology have learned to attribute to clear causal inferences--a priority reinforced by their
journal editors and funders (mostly NIH, CDC and the Robert Wood Johnson Foundation). The
contrast here is with the almost non-existent priority of random assignment in federal and state
Offices of Education and the many foundations funding education reform, though the recent
entry of NSF and NICHD into educational grant-making may change this. Other reasons for the
discipline difference in school-based experiments are substantive. Prevention studies are mostly
concerned with evaluating novel curricula rather than, say, whole school reform; and the
curricula they develop are usually circumscribed in time, rarely lasting even a single year.
Moreover, teachers do not have to be trained to deliver the experimental materials. Researchers
typically do this. And prevention rarely requires the school itself to adopt new forms of staff
coordination. But many educational reforms demand just this and so entail all the stress that goes
with it.
20
The very real difficulties that arise when implementing experiments in schools suggest a
subtle modification to the usual rationale for experiments. In the real world of education it is not
easy to implement experiments perfectly. The initial randomization process can be faulty;
differential attrition can occur; and the control group can decrease its performance if some of the
individuals within it realize they are being treated differently from others. Over the past decades,
many lessons have been learned about how to minimize these problems (see Boruch 1997). But
selective attrition still occurs; and knowledge of the different treatment conditions still seeps
across planned treatment demarcations. Random assignment creates a logically superior
counterfactual to its alternatives. But it does not follow from this that the technique is perfect in
actual research practice. No rhetoric about experiments as "gold standards" is justified if it
implies that experiments invariably generate impeccable causal inferences. Experimenters
always need to check on how well random assignment and treatment independence have been
achieved and maintained. They cannot be taken for granted.
4. Random assignment is not feasible in education. The small number of randomized
experiments in education may reflect, not researchers' distaste for them, but how difficult they are
to mount. School district officials do not like the focused inequities in schools structures or
resources that random assignment generates, fearing negative reactions from parents and staff.
They prefer for individual schools to choose the changes they will make, or else they prefer
changes to be made district-wide. Principals and other school staff share these preferences and
have additional administrative concerns about disrupting ongoing routines. And then there are
the usual ethical concerns about withholding potentially helpful treatments from those in need.
Given this climate, random assignment needs to be locally advocated by people committed to it
and able to counter the objections usually raised (Gueron, 1999). So, it is a significant drawback
21
that the professional norms fueling the widespread use of random assignment in medicine and
elsewhere are absent in education. If educational evaluators are not likely to fight for more
experimentation in education, then researchers from other disciplines will have to do so.
What does it take to mount randomized experiments in schools? In the Cook et al (1999)
study, random assignment was sponsored by the school district and all district middle schools
had to comply. Principals had no choice about participating in the study or in the treatment they
eventually received. The district took this step because a foundation-funded network of very
prestigious scholars--none from education-- insisted on random assignment as a precondition for
funding the program and its evaluation. In a second case (Cook, Hunt & Murphy, 1999), the
principal investigator insisted on random assignment as a precondition for collaborating with the
program implementers, though in this case participation was restricted to schools where
principals wanted the program and said in advance that they were prepared to live with the results
of the randomization coin toss, whatever the outcome. Random assignment was a struggle in
each case, and schools would definitely have preferred to do without it. But no principal had any
difficulty appreciating the method’s logic and most lived with its consequences for up to six
years. (But not all of them. Four Chicago replacement principals from the original 24 insisted on
implementing their own reform program and eliminated the one under evaluation).
In implementing randomization, the role of political will and disciplinary culture are very
important. Random assignment is common in the health sciences because it is institutionally
supported there by funding agencies, publishing outlets, graduate training programs and the
tradition of clinical trials. Prevention studies in schools tap into this same research culture.
Something similar is emerging in pre-school education where the many experiments are the
product of: (1) congressional requirements to assign at random, as with Early Head Start, the
Comprehensive Child Care Program and the plan for the new national Head Start study; (2) the
22
high political and scholarly visibility of experiments on the Perry Pre-School Program
(Schweinhart, Barnes & Weikart, 1993; the Abacadarian project (e.g., Campbell & Ramey, 1995)
and Olds’ home visiting nurse program (Olds et al, 1997); and (3) the involvement of researchers
trained in psychology, human development, medicine and micro-economics, these being fields
where random assignment is highly valued. Agriculture is another area of study whose culture
favors random assignment, even in schools (St. Pierre, Cook & Straw, 1981; Connell, Turner &
Mason, 1985). So, too, are marketing and research on survey research.
Contrast this with education. Reports from the Office of Educational Research and
Improvement (OERI) are supposed to identify effective practices in primary and secondary
schools. But neither the work of Vinovskis (1998) nor my own haphazard reading of OERI
reports suggests any privilege being accorded to random assignment. Moreover, one recent
report I read on bilingual-lingual education repeated old saws about the impossibility of doing
such studies and claimed that alternatives are just as good--in this case quasi-experiments. And at
a recent foundation meeting on Teaching and Learning the representative of a reforming state
governor spoke about a list of best practices his state had developed that was being disseminated
to all schools. He did not care, and he believed no governors care, about the technical quality of
the research generating these lists. The main concern was to have a consensus of education
researchers endorsing each presumptively effective practice. When asked how many of these
best practices depended on randomized experiments, he guessed it would be zero. Several
nationally known educational researchers were present and agreed that such assignment probably
played no role in generating the list. No one felt any distress at this. I surmise that so little will
exists to utilize experiments in education--not out of ignorance--but because researchers believe
that opportunities to conduct such studies are rare and that there is little need for them, given the
feasibility of less noxious alternatives whose internal validity is adequate for creating credible (if
23
not valid) conclusions. So long as such beliefs are widespread among the elite of educational
research, there can never be the kind of pan-support for experimentation that is currently found in
health, agriculture, or health-in-schools.
What are the research conditions most conducive to randomization? They include: When
the treatment is of shorter duration; when no extensive retraining of teachers is required; when
new patterns of coordination among school staff are minimal; when the demand for an
innovation outstrips the supply; when two or more treatments with similar goals can be
compared; when communication between the units receiving different treatment is not possible;
and when students are the unit of assignment (or perhaps classrooms) rather than entire schools.
Our guess, then, is that it will be more feasible to study different curricula at random; to
introduce new technologies at random; to give students tuition rebates for Catholic schools at
random; to assign more or different forms of homework at random (or by classroom); to assign
teachers trained in different ways to classes at random, etc. None of these studies would be easy;
but all should be feasible so long as the will exists to make random assignment work, so long as
researchers know the most recent lessons about implementing and maintaining random
assignment, and so long as strong fall-back options are built into research designs in case the
original treatment assignment breaks down. In my experience, it will be rare for the biases
resulting from such a breakdown to be greater than those that result from teachers, schools or
students self-selecting themselves into treatments. And the limitations of statistical selection
controls are less the smaller the initial selection bias and the better the selection has been directly
observed (Holland, 1986).
5. Random assignment is premature because it needs strong theory and standardized treatment
implementation, neither of which is routinely available in education. Experimentation is easier
24
and more fruitful when an intervention is based on strong substantive theory, when it occurs
within well managed schools, when it is reasonable to assume that implementation quality does
not vary much between units, and when the implementation is faithful to program theory.
Unfortunately, these conditions are not often met. Schools tend to be large, complex social
organizations characterized by multiple simultaneously occurring programs, disputatious
building politics and conflicting stakeholder goals. Management is all too often weak and
removed from classroom practice, and day to day politics can swamp effective program planning
and monitoring. So, many reform initiatives are implemented highly variably across districts,
schools, classrooms and students. Indeed, when several different educational models are
contrasted in the same study, the between-model variation is usually small when compared to the
variation between schools implementing the same model (Rivlin & Timpane, 1975; Stebbins et
al, 1978). In school research, it is impossible to assume either standard program implementation
or total fidelity to program theory (Berman & McLaughlin, 1977). Randomized experiments
must seem premature to those whose operating premise is that schools are complex social
organizations with severe management and implementation problems.
However, educational research does not have to be so premised; nor is it impossible to do
experiments in schools that are poorly managed and do not implement well. An earlier model of
schools treated them as physical structures with many self-contained classrooms where teachers
tried to deliver effective curricula using instructional practices that have been shown to enhance
students’ academic performance. This approach privileged curriculum design and instructional
practice, not the school-wide factors that dominate within the current organizational framework -viz., strong leadership, clear and supportive links to the world outside of school, creating a
building-wide communitarian climate focused on learning, and engaging in multiple forms of
professional development, including those relevant to curriculum and teaching matters. Many
25
important consequences have followed from this intellectual shift in how schools are
conceptualized. One is the lesser profile accorded to curriculum and instructional practice and to
what happens once the teacher closes the classroom door; another is the view that random
assignment is premature, given its dependence on positive school management and quality
program implementation; and another is that quantitative techniques have only marginal utility
for understanding schools, since a school’s governance, culture and management are best
understood through intensive case studies, often ethnographic.
It is a mistake to believe that random assignment requires well-specified program
theories, or good management, or standard implementation, or treatments that are totally faithful
to program theory, though these features do make evaluation easier. Experiments primarily
protect against bias in causal estimates; and only secondarily against the imprecision in these
estimates due to extraneous variation. So, the complexity of schools leads to the need for
randomized experiments with larger sample sizes. Otherwise, experiments conducted in complex
settings tend towards no-difference findings. Cronbach (1982) has noted how rarely researchers
abandon their analysis if this is indeed their initial finding. Instead, they stratify schools by the
degree to which program particulars were implemented and then relate this to variation in
planned outcomes. Such internal analyses make any resulting causal claim the product of the very
non-experimental analyses whose weaknesses random assignment is designed to overcome. This
problem suggests orienting the research focus differently so as: (1) to avoid the need for such
internal analyses by having initial sample sizes that reflect the expected extraneous variation; (2)
to anticipate specific sources of variation and to reduce their influence by design or through
measurement and statistical manipulation; and (3) to study implementation quality as a
dependent variable to ascertain which types of schools and teachers implement the intervention
26
better. Variable implementation has implications for budgets, sample sizes and the relative
utility of research questions; but it does not by itself invalidate random assignment.
The aim of experiments is not to explain all sources of variation; it is merely to probe
whether a reform idea makes a marginal improvement in staff or student performance over and
above all the other background changes that occur. It is not an argument against random
assignment to claim that many reform theories are under specified, some schools are chaotic,
treatment implementation is highly variable, and treatments are not completely theory-faithful.
Random assignment does not have to be postponed while we learn more about school
management and implementation. However, the more we know about these matters the better we
can randomize, the more reliable effects are likely to be, and the more experiments there will be
that make management and implementation issues worthy objects of study within the
experiments themselves. No advocate of random assignment will be credible in education who
assumes treatment homogeneity or setting invariance. School-level variation is large and may be
greater than in other fields where experiments are routine. So, it is a reasonable and politic
working assumption to assert that the complexity of schools requires large sample experiments
and careful measurement of implementation quality and other intervening variables as well as
careful documentation of all sources of extraneous variation influencing schools. Black box
experiments are out.
In the connection, it is important to remember two points. First, for many educational
policy purposes it is important to assess what an educational innovation achieves despite the
variation in treatment exposure within treatment groups. Few educational interventions will be
standardized once they are implemented as formal policy. So, why should we expect them to be
standardized in an experimental evaluation? Of course, treatment standardization is desirable
for those researchers interested in testing the substantive theory undergirding an intervention or
27
in assessing the intervention’s potential as policy (rather than its actual achievements). But
standardization is not a desideratum for educational studies if standard implementation cannot be
expected in the hurly-burly of real educational practice.
Second, one of the few circumstances where instrumental variables provide unbiased
casual inference is within a randomized experiment. What instrument can be better than random
assignment (Angrist, Imbrens & Rubin, 1996)? So, given a true experiment, we can have
unbiased estimates of two entities of interest. One is of the effect of “the intent to treat” --the
original assignment variable of policy significance. The other is the effect of spontaneous
variation in exposure within the treatment group, a question that is of theoretical interest, and
dear to the hearts of program developers. Without random assignment, it is difficult to defend
conclusions about the causal effect of spontaneous variation in treatment exposure, given the
inevitability of selection problems and the inability to deal with them definitively. Thus, it is only
with random assignment that one can have one’s cake and eat it, assessing both the policyrelevant effects of treatments that are variably implemented and also the more theory-relevant
effects of spontaneous variation in the amount and type of exposure to program details.
6. Random assignment entails trade-offs not worth making. Random assignment places the
priority on unbiased answers to descriptive causal questions. Few educational researchers share
this priority, and many believe that randomization often compromises more important research
goals. Indeed, Cronbach (1982) has strongly argued against the assertion that internal validity is
the sine qua non of experimentation (Campbell & Stanley, 1963), contending that this neglects
external validity. It is self-evident that experiments are often limited in time and space, with
nation-wide experiments being rare. Moreover, they are often limited to schools willing to
surrender choice over which treatment they will receive and willing to tolerate the measurement
28
burdens required for assessing implementation, mediating processes and individual outcomes.
What kinds of schools are these? Would it not be preferable, Cronbach asks, to sample from a
more representative population of schools even if less certain causal inferences result from this?
A second trade-off places scientific purity above utility. Why should experimenters value
uncertainty reduction about cause so much that they use a highly conservative statistical criterion
(p <.05). In their own lives, would they not use a much more liberal risk calculus when deciding
whether to adopt a potentially life-saving therapy? And why should statistical traditions be so
strict that schools not implementing the treatment are included in the analysis as though they had
been treated? He sees these as two examples of a counterproductive conservative bias that is
designed to protect against wrongly concluding that a treatment is effective but that, in so doing,
runs the risk of failing to detect true treatment effects that would emerge by other criteria. He
also worries about the purism of experimental enthusiasts who refuse to explore the data for
unplanned treatment interactions with student or teacher characteristics, and who see unplanned
variation in implementation as a cause for concern rather than an opportunity to explore why the
variation occurred and what its consequences might be. He also notes how poorly some
experimental questions are framed, and argues for escaping from the original framing whenever
more helpful questions emerge during a study--even if unbiased answers to these new questions
are not possible. Better relevant than pure might be the motto. After all, many experiments take
so long to plan, mount and analyze that answering the questions they pose could be of more
antiquarian than contemporary interest.
Another trade-off experiments force, in his opinion, is between two conceptions of cause.
The most prized questions in science are about generative causal processes like gravity, relativity,
DNA, nuclear fusion, aspirin, personal identity, infant attachment, or engaged time on task. Such
constructs index instantiating processes capable of bringing about important effects in a
29
multitude of different settings. This makes them more general in application than, for example,
learning whether one way of organizing instruction affected achievement at a particular moment
in past time is the particular set of schools that volunteered to be in an experiment. Like most
other educational evaluators, Cronbach wants evaluations to explain why programs work and, to
this end, he is prepared to tolerate more uncertainty about whether they work. So, he rejects the
experimental tradition, believing that it often creates research conditions that do not closely
approximate educational practice and that it ignores the exploration of those explanatory
processes that might be transferable across a wide range of schools and students. Cronbach
wants evaluation to pursue the scholarly goal of full explanation, but not necessarily using the
quantitative methods generally preferred for this in science. In his view, the methods of the
historian, journalist and ethnographer are just as good for learning what happened in a reform and
why. The objectives of science remain privileged, but not its methods.
A final trade-off is between audiences for the research. Experiments seek to maximize the
truth about the consequences of a planned reform, and their intended audience is usually some
policy-making group. Experiments rarely seek to help those who work in the sampled schools.
Such persons rarely want what an experiment produces-- information summarizing what a reform
has achieved. Moreover, they can rarely wait to act until an experiment is completed. They need
continuous feedback about their own practice in their own school. Putting the needs of service
deliverers above those of amorphous policy-makers requires that evaluators prioritize on utility
over truth and on continuous feedback over waiting for a final report. A recent letter to the New
York Times captured this preference: “Alan Krueger ... claims to eschew value judgments and
wants to approach issues (about educational reform) empirically. Yet his insistence on
postponing changes in education policy until studies by researchers approach certainty is itself a
30
value judgment in favor of the status quo. In view of the tragic state of affairs in parts of public
education, his judgment is a most questionable one.” (W. M. Petersen, April 20, 1999).
These criticisms of the trade-offs implicit in experimentation remind us that experiments
should not be conducted unless there is a clear causal question that has been widely probed for its
presumed long-term utility to a wide range of policy actors, of whom school personnel are surely
one. They also remind us that, while experiments optimize on causal descriptive questions, this
need not preclude either examining the reasons for variation in implementation quality or seeking
to identify the processes through which a treatment influences an effect. The criticisms further
remind us that experiments do tend to be conservative, sometimes so preoccupied with bias
protection that other types of knowledge are secondary. But this need not be so. There is no
compelling need for such stringent alpha rates; only statistical convention is at play here, not
statistical wisdom. Nor need one restrict data analyses to the intent-to-treat group, though such
analyses are needed among the total set reported. Nor need one close one’s eyes to all statistical
interactions, so long as the probes are done with substantive theory and statistical power in mind,
so long as internal and external replications are used to check against profligate error rates, and
so long as conclusions are couched more tentatively than those that directly result from random
assignment. Researchers can also try to replicate any results using the best available nonexperimental analyses of data collected from formally representative samples, though such
analyses would be difficult to interpret by themselves. Finally, many controlled experiments
would be improved by collecting ethnographic data in all the treatment groups. This would help
identify possible unintended outcomes and mediating processes, while also providing continuous
feedback to experimental and control schools alike. Experiments need not be as rigid as some of
the more compulsive texts on clinical trials paint them.
31
Even so, some bounds cannot be crossed. All the above suggestions involve adding to
experiments, not replacing them. To conduct a study with randomly selected schools but no
random assignment to treatments would be to gamble on achieving wide generalizability of what
may not be a dependable causal connection. Conversely, to embark on an experiment
presupposes the cardinal utility of causal connections; otherwise one would not do such a study
in the first place. Assuming adequate statistical power, experiments protect against
recommending ineffective changes and against overlooking effective ones. So the value of
experiments depends on avoiding the costs that would occur if one wrongly concluded that
school vouchers raise achievement, that small schools are superior to larger ones, or that Catholic
schools are superior to public ones in the inner city.
It is now 30 years since vouchers were proposed, and we still have no completed
experiments on them and no clear answers about them. It is 30 years since Comer began his
work on the School Development Program, and almost the same situation holds. It is 15 years
since Levin began accelerated schools, and here too we have no experiments and no answers.
But we need to tread carefully here lest our advocacy of experiments is over generalized.
Premature experimentation is a real danger. There is little point evaluating educational reforms
that are theoretically muddled or that are not implementable by ordinary school personnel. But
even so, these decade-long time lines are inexcusable. The Obie-Porter legislation cites Comer’s
program as a proven exemplar. But when the legislation passed, the only evidence about the
program available to Cook et al (1999) consisted of testimony, a dozen or so empirical studies by
the program’s own staff that used primitive quasi-experimental designs, and one large-scale
study that confounded the court-ordered introduction of the program with a simultaneously
ordered reduction in class sizes of 40% (Comer, 1988). To be restricted to such a data base
verges on the irresponsible. However, Comer’s program is not much different from any other in
32
this regard, Success for All being somewhat of an exception. The trade-off Cronbach is prepared
to make that favors generalization over causal knowledge runs just the risk exemplified by
Comer’s program. There are many studies covering over 20 years in many school districts and
times, but none of them is worth much as a study of cause. Experiments are not meant to be
representative; they are meant to be the strongest possible tests of causal hypotheses.
Yet causal statements are much more useful if they are the product of experiments and are
also demonstrably generalizable in their results across various types of students, teachers, settings
and times--or if the boundary conditions limiting generalization are empirically specified.
Generalizing causal connections is always a problem with single experiments (Cook, 1993), but
especially so in education. Unlike in medicine or public health, there is no tradition of multi-site
experiments with national reach. Typically we find single experiments of unclear reach, done
only in Milwaukee or Washington or Chicago or Tennessee. And with many kinds of school
reform there is no fixed protocol. So, it is possible to imagine implementing vouchers, charter
schools or Total Quality Management in many different ways both across and within districts.
Indeed, the Comer programs in Prince George’s County, Chicago and Detroit are different from
each other in many ways, given how much latitude districts are supposed to have in how they
define and implement many of the program specifics. So, the non-standardization we find
associated with many educational treatments requires even larger samples of settings than is
typically in clinical trials in medicine or public health. Getting cooperation from so many
schools is not easy, given the absence of a tradition of random assignment and the considerable
local control we find in American education today. Nonetheless, experiments can be conducted
involving more schools than is the case with the rare experiments we find today.
But very large experiments may not be wise. A sampling approach to causal
generalization emphasizes the goal of generating a causal estimate that describes whatever
33
population of schools, students, teachers or sites is of specific policy interest. To achieve this, the
emphasis is on sampling with known probability. As admirable as this traditional sampling
method is, it has rarely been the path to generalization in the experimental sciences. There are
many reasons for this. Many of the populations it is practical to achieve are of only parochial
interest; random sampling is hardly relevant to the need to estimate causal relationships across
time; and random sampling cannot be used to select the outcome measures and treatment variants
with which we try to represent general cause and effect constructs. So, bench scientists use a
generalization model that emphasizes the consistency with which a causal relationship replicates
across multiple sources of heterogeneity (Cook, 1993). Can the same causal relationship be
observed across different laboratories, time periods, regions of the country and ways of
operationalizing the cause and effect? This heterogeneity-of-replication model is what undergirds
clinical trials and meta-analysis, and it requires many smaller experiments done at different
times, in different settings, with different kinds of respondents and with different measures and
different local adaptations of a treatment. The physical sciences have progressed by means of
such replication because knowledge claims are often replicated in the next stage of research on a
phenomenon. But in research areas with weaker (and more expensive) traditions of replication-as in education-- replication cannot be haphazard. Instead, it might be based on conducting a
number of smaller (but adequately statistically powered) experiments staggered over the years.
But whatever the merits of phased programs of experiments, the point is that experiments do not
have to be both small and stand-alone things. Single experiments will not produce definitive
answers, however large they are. And they certainly will not answer all ancillary questions about
causal contingencies. Like all social science, experiments need a cumulative context if
dependable knowledge is to result.
34
Sampling concerns are less salient when causal generalization is understood, not as
achieving a single causal estimate or generating a set of effect sizes from heterogeneous studies
of the same general hypothesis, but as the identification of generative processes. Engaged timeon-task is an example of this, for it is supposed to stimulate student achievement through such
diverse activities as more homework, summer classes, longer school days, and more interesting
curricula. And many of these procedures can be implemented in any school district, in any
country, at any time. Of the methods available for identifying explanatory processes, some
require collecting data about each of the variables in the presumed mediating processes and then
using these measures in causal models. Others require historians or ethnographers to explore
what happened and why in each treatment group. Although the knowledge of explanatory
processes achieved by either of these methods will be tentative, it is still worthwhile as part of
the information base from which conclusions are eventually drawn about novel mechanisms that
bring about important results. Cronbach's advocacy of causal explanation over causal description
does not reflect a genuine duality. Experiments should explore all causal mechanisms that have
an obvious potential to be generative; they should explain the consequences of interventions as
well as describe them.
7. Experiments assume an invalid model of research utilization. Experimentation can be
construed as recreating the classical model of rational decision-making. In this model, one first
lays out the alternatives (the treatments); then one decides on decision criteria (the outcomes);
then one collects information on each criterion for each treatment (data collection), and finally
one uses the observed effect sizes and whatever utilities can be attached to them to make a
decision about the merits of the contending alternatives. Unfortunately, empirical work on the
use of social science data in policy reveals that such instrumental use is rare (Weiss & Bucuvalas,
35
1977; Weiss, 1988). Use is more diffuse, better described by an "enlightenment" model that
blends information from existing theories, personal testimony, extrapolations from surveys, the
consensus of a field, claims from experts who may or may not have interests to defend, and novel
concepts that are au courant and broadly applied--like the urban underclass or social capital in
sociology today. No privilege is extended to science in this conception; nor to experiments.
Empirical research on research utilization also notes that decisions are multiply rather
than singly determined, with central roles being played by politics, personality, windows of
opportunity and values. Given this, scientific information of the type experiments produce can
play at best a marginal role. Also, many decisions are not made in the systematic sense of that
verb, but instead are slipped into or accrete, with earlier small decisions constraining later large
ones. And we should also not forget that official decision-making bodies turn over in
composition, with new persons and issues replacing older ones. When studies take a long time to
complete--as with many experiments -- results may not be available until the policy agenda has
changed. So, many experiments turn out to be verdicts in history books rather than inputs into
current reform debates, making research utilization appear more complex than the rational choice
model implies.
Critics also point out that experiments rarely provide uncontested verdicts. Disputes
typically arise about whether the original causal question was framed as it should have been,
whether the claimed results are valid, whether all relevant outcomes were assessed, and whether
the proffered recommendations follow from the results. All social science findings tend to meet
with a disputed reception, if not about the quality of answers, then about the questions not
addressed. The logical control over selection that makes experiments so valuable does not mean
they are seen to put to rest all quibbles about causal claims. Consider in this regard the
reexaminations of the Milwaukee voucher study and the very different conclusions offered about
36
whether and where there were effects (Witte, 1998; Green, Peterson, Du, Boeger & Frazier,
1996). Consider, also, the Tennessee class size experiment and the different effect sizes that
have been generated (Finn & Achilles, 1990; Mosteller, Light & Sachs, 1996; Hanushek, 1999)
Sometimes, real scholarly disagreements are at issue, while in other cases the disputes reflect
stakeholders protecting their interests. Policy insiders use multiple criteria for making decisions,
and scientific knowledge of causal influences is never uniquely determinative.
Moreover, close examination of cases where the claim was made that policy changed
because of experimental results suggest some oversimplification. The Tennessee class size
experiment has been examined from this perspective (see the special summer 1999 issue of
Educational Evaluation and Policy Analysis), and in this connection it is worth noting that the
results were in line with what a much-cited meta-analysis had earlier described (Glass & Smith,
1979). They were also consonant with theories that say children gain more if they are engaged
and on-task. They also conformed with teachers' hopes and expectations. Further, they were in
line with parents' commonsense notions and so were acceptable to them. And the results came
when the governor of the state had national political ambitions, was using education as an
individuating policy priority, had the state resources to increase educational investments, and he
knew his actions would be popular both with the teachers unions (mostly Democrats) and with
business interests in his own Republican party. So, a close explanatory analysis suggests that his
policy change was probably not due to random assignment alone, though even closer analysis
may well show that such assignment played a marginal role in creating the use that followed.
Experiments exist on a smaller scale than would typically pertain if the services they test
were to be implemented state- or nation-wide. This scale issue is serious (Elmore, 1996).
Consider what happened when the class size results were implemented state-wide in Tennessee
and California. This new policy was begun at a time of a growing national teacher shortage due
37
to demographic shifts that exacerbated teacher-poaching across state and district lines and
allowed richer school districts to recruit more and better teachers. Since the innovation inevitably
requires many new teachers, some individuals were able to leave their old non-teaching jobs and
become fledgling teachers. Were these new entrants equal in quality to existing teachers?
Problems also arose in creating the additional classrooms that lower class sizes require, leading
to more use of trailers and dilapidated buildings. Are the benefits of smaller classes undercut by
worse physical plant? Small scale experiments cannot mimic all the details relevant to
implementing a policy on a broader scale, though they should seek to approximate these details
as best they can.
There is some substance to arguments about the misfit between the theory of use
undergirding randomized experiments and the ways in which social science data are actually
used. But the objections are exaggerated. Instrumental use does occur (Chelimski, 1987), and
more often than the very low base rates readers might infer from most of the research denigrating
instrumental usage. Moreover, one reason why some study results are widely disseminated is
probably because random assignment confers credibility on them, at least in some quarters. This
certainly happened with the Tennessee class size study and the pre-school studies cited earlier.
Studies can even be important if political events have rendered their results obsolete. This is
because many policy initiatives get recycled --as happened with school vouchers-- and because
the texts used to train professionals in a field often contain the results of past studies, using them
for the light they throw on particulars of professional practice (Leviton & Cook, 1983).
There is no necessary trade-off between instrumental and enlightenment usage.
Experiments also contribute to general enlightenment--about the kinds of interventions most and
least implementable; about the influence principal turnover has on school management; about the
utility of theories that fail to specify and monitor what happens once the teacher closes her
38
classroom door; about the kinds of principals who are most attracted to school-based
management theories; about the kinds of teachers most amenable to professional development,
etc. The era of black box experiments is long past. Within experiments we now want and need
to learn about the determinants and consequences of implementation quality; we want to describe
and analyze constructs from the substantive theory undergirding a program; we want to collect
qualitative as well as quantitative data so long as the data collection protocol is identical in all
treatment groups; and we want to get stakeholder groups involved in the formulation of guiding
experimental questions and in interpreting the substantive relevance of findings. All these
procedures help generate enlightenment as well as instrumental usage.
8. Random assignment is not needed since better alternatives already exist. It is a truism that no
social science method will die, whatever its imperfections, unless a demonstrably better or
simpler method is available to replace it. Most researchers who evaluate educational reforms
believe that superior alternatives to the experiment already exist, and so they are willing to let it
die. These methods are superior, they believe, because they are more acceptable to school
personnel, because they reduce enough uncertainty about causation to be useful, and because they
are relevant to a broader array of issues than causation alone. Although no single alternative is
universally recommended, intensive case studies seem to be preferred most often.
a. Intensive Case Studies. The call for intensive case studies in educational evaluation came
from three primary sources. One was from scholars who first made their names as quantitative
researchers and then switched to qualitative case methods out of a disillusionment with positivist
science and the results of the earliest educational evaluations (e.g., Guba; Stake; House). Even
Cronbach (1982) came to assert that the appropriate methods for educational evaluation are those
39
of the historian, journalist and ethnographer, pointedly not the scientist. Converts often have a
special zeal and capacity to convince, and when former proponents of measurement and classical
experimental design were seen attacking what they had once espoused this may have had a strong
influence on young students of methods in education. The second source of influence was
evaluators within education who had been trained in qualitative methods, like Eisner or
Fettermann. Third were those scholars with doctorates in Sociology and Political Science who
entered schools of education to pursue the newly emergent research agenda to improve school
management and program implementation. These newcomers were nearly all qualitative
researchers with an organizational or policy research background, fresh from the
qualitative/quantitative method disputes in their original disciplines.
Some countervoices in support of experiments could have been raised. After all, some
psychometricians within education refused to be converted and the more elite graduate schools of
education hired scholars trained in statistics, economics and psychology. But none of these
groups worked hard in print to advocate for experiments--Boruch and Murnane excepted; and
none of them conducted experimental evaluations in education, thus demonstrating their
feasibility and utility. Instead, the economists in education tended to work on issues related to
higher education and the transition from school to work, while the methodologists from other
disciplines worked on topics tangential to causal inference--e.g., on novel psychometric
techniques, on improving meta-analysis, on re-conceptualizing the measurement of individual
change, and on developing hierarchical models that simultaneously include school, student and
intra-individual change factors. So, while three sets of voices converged to denigrate
experimental research in education, none of the likely counter voices were heard from.
A central belief of advocates of qualitative methods is that the techniques they prefer are
uniquely suited to providing feedback on the many different kinds of questions and issues
40
relevant to a school reform--questions about the quality of problem formulation, the quality of
program theory, the quality of program implementation, the determinants of quality
implementation, the proximal and distal effects on students, unanticipated side effects, teacher
and student subgroup effects and assessing the relevance of findings to different stakeholder
groups. This flexibility for generating knowledge about so many different topics is something
that randomized experiments cannot match. Their function is much more monolithic, and the
assignment process on which they depend can impose obvious limits to a study’s generalizability.
Advocates for qualitative methods also contend that they reduce some uncertainty about
causal connections. Cronbach (1982) agrees that this might be less than in an experiment, but he
contends that journalists, historians and ethnographers nonetheless regularly learn the truth about
what caused what, as do many lay persons in their everyday life. Needed for a causal diagnosis
are only a relevant hypothesis, observations that can be made about the implications of this
hypothesis, reflection on these observations, reformations of the original hypotheses, a new
round of observations, a new round of reflections on these observations, and so on until closure is
eventually reached. Theorists of ethnography have long advocated such an empirical and
ultimately falsificationist procedure (e.g., Becker, 1958), claiming that the results achieved from
it will be richer than those from black box experiments because ethnography requires close
attention to the unfolding of explanatory processes at different stages in a program’s
development.
I do not doubt that hard thought and non-experimental methods can reduce some of the
uncertainty about cause. I also believe that they sometimes reduce all reasonable uncertainty,
though it will be difficult to know when this happens. However, I doubt whether intensive,
qualitative case studies can generally reduce as much uncertainty about cause as a true
experiment. This is because such methods rarely involve a convincing causal counterfactual.
41
Typically, they do not have comparison groups, making it difficult to know how a treatment
group would have changed in the absence of the reform under analysis. Adding control groups
helps, but unless they are randomly created it will not be clear whether the two groups would
have changed over time at similar rates. Whether intensive case methods reduce enough
uncertainty about cause to be generally useful is a poorly specified proposition we cannot answer
well, though it is central to Cronbach's case. Still, it forces us to note that experiments are only
justified when a very high standard of uncertainty reduction is required about a causal claim.
Three other factors favoring case studies are more clear. First, schools are less squeamish
about allowing in ethnographers than experimentalists (under the assumption that the
ethnographers are not in classes and staff meetings every day). Second, case studies produce
human stories that can be effectively used to communicate results to readers and listeners. And
finally, feeding interim evaluation results back to teachers and principals with whom an
ethnographer has an ongoing relationship is especially likely to generate local use of the data
collected. Such use is less grandiose than the usual aspiration for experiments--to guide policy
changes that will affect large numbers of districts and schools. But in the work of educational
evaluators like Stake, Guba and Lincoln, such local use is a desideratum, given how unsure they
are that policy dictates issued from central authorities will ever be complied with once the
classroom door closes.
The advantages of case study methods for evaluation are considerable. But we value these
methods for their role within experiments rather than as alternatives to them. Case study
methods complement experiments whenever a causal question is clearly central and it is not clear
how successful program implementation will be, why implementation shortfalls may occur, what
unexpected effects are likely to emerge, how respondents interpret the questions asked of them,
what the casual mediating processes are, etc. Since it will be rare for any of these issues to be
42
clear, qualitative methods have a central role to play in experimental work on educational
interventions. They cannot be afterthoughts.
b. Theory-Based Evaluations. It is currently fashionable in many foundations and some scholarly
circles to espouse a non-experimental theory of evaluation for use in complex social settings like
communities and schools (Connell et al, 1995). The method depends on: (1) explicating the
substantive theory behind a reform initiative and detailing all the flow-through relationships that
should occur if the intended intervention is to impact on some major distal outcome, like
achievement gains; (2) measuring each of the constructs specified in the substantive theory; and
(3) analyzing the data to assess the extent to which the postulated relationships have actually
occurred through time. For shorter time periods, the data analysis will involve only the first part
of a postulated causal chain; but over longer periods the complete model might be involved.
Obviously, this conception of evaluation places a primacy on highly specific substantive theory,
high quality measurement, and the valid analysis of multivariate explanatory processes as they
unfold in time.
Several features make it seem as though theory-based evaluation can function as an
alternative to random assignment. First, it is not necessary to construct a causal counterfactual
through random assignment or creating closely matched comparison groups. A study can involve
only the group experiencing a treatment. Second, obtaining data patterns congruent with
program theory is assumed to indicate the validity of that theory, an epistemology that does not
require explicitly rejecting alternative explanations – merely demonstrating a match between the
predicted and obtained data. Finally, the theory approach does not depend for its utility on
attaining the end points typically specified in educational research--usually a cause-effect
relationship involving student performance. Instead, if any part of the program theory is
43
corroborated or disconfirmed at any point along the causal chain, this can be used for interim
purposes--to inform program staff, to argue for maintaining the program, to provide a rationale
for acting as though the program will be effective if more distal criteria were to be collected and
to defend against premature summative evaluations that declare a program ineffective before
sufficient time has elapsed for all the processes to occur that are presumed necessary for ultimate
change. So, no requirement exists to measure the effect endpoints most often valued in
educational studies.
Few advocates of experimentation will argue against the greater use of substantive theory
in evaluation; and since most experiments can accommodate more process measurement, it
follows they will be improved thereby. Such measurement will make it possible to probe, first,
whether the intervention led to changes in the theoretically specified intervening processes and,
second, whether these processes could then have plausibly caused changes in distal outcomes.
The first of these tests will be unbiased because it relates each step in the causal model to the
planned treatment contrast. But the second test will entail selection bias if it depends only on
stratifying units according to the extent the postulated theoretical processes have occurred. Still,
quasi-experimental analyses of the second stage are worth doing provided that their results are
clearly labeled as more tentative than the results of planned experimental contrasts. The utility of
measuring and analyzing theoretical intervening processes is beyond dispute; the issue is whether
such measurement and analysis can alone provide an alternative to random assignment.
I am skeptical. First, it has been my experience writing papers on the theory behind a
program with its developer (Anson, Cook, Habib, Grady, Haynes & Comer, 1991) that the theory
is not always very explicit. Moreover, it could be made more explicit in several different ways,
not just one. Is there a single theory of a program, or several possible versions of it? Second
there is the problem that many of these theories seem to be too linear in their flow of influence,
44
rarely incorporating reciprocal feedback loops or external contingencies that might moderate the
entire flow of influence. It is all a little bit too neat for our more chaotic world. Third, few
theories are specific about timelines, specifying how long it should take for a given process to
affect some proximal indicator. Without such specifications it is difficult to know when
apparently disconfirming results are at hand whether the next step in the model has not yet
occurred or will never occur because the theory is wrong in this particular. Fourth, the method
places a great premium on knowing, not just when to measure, but how to measure. Failure to
corroborate the model could be the result of partially invalid measures as opposed to the
invalidity of the theory itself--though researchers can protect against this with more reliable
measures. Fifth is the epistemological problem that many different models can usually be fit to
any single pattern of data (Glymour, Sprites, Scheines 1987), implying that causal modeling is
more valid when multiple competing models are tested against each other rather than when a
single model is tested.
The biggest problem though, is the absence of a valid counterfactual to inform us about
what would have happened to students or teachers had there been no treatment. As a result, it is
impossible to decide whether the events observed in a study genuinely result from the
intervention or whether they would have occurred anyway. One way to guard against this is with
"signed causes" (Scriven, 1976), predicting a multivariate pattern among variables that is so
unique it could not have occurred except for the reform. But signed causes depend on the
availability of much well-validated substantive theory and very high quality measurement (Cook
& Campbell, 1979). So, a better safeguard is to have at least one comparison group, and the best
comparison group is a randomly constructed one. So, we are back again with random assignment
and the proposition that theory-based evaluations are useful as complements to randomized
experiments but not as alternatives to them.
45
c. Quasi-Experiments. The vast majority of educational evaluators favor intensive case studies,
and a few of them plus some outsiders from psychology are now espousing theory-based
methods. But many educational researchers do not aspire to improve evaluation theory or
practice. They only want to test the effectiveness of changes in practice within their own
education sub-field. In our experience, such researchers tend to use quasi-experiments, as with
the early evaluations of Comer’s program noted earlier and those Herman (1998) detailed for
Success for All. Indeed, almost all the quantitative studies Herrman (1998) cites to back claims
about effective whole-school reforms depend on quasi-experiments rather than experiments or
qualitative studies. This last is surprising, given how much emphasis theorists of educational
evaluation have placed on qualitative work. Has their advocacy had little influence on researchers
studying the effectiveness of educational interventions; or does their absence from the
effectiveness part of Herrman’s analysis reflect merely bias on the part of her research team? We
do not know. But we do know that it is not easy to integrate qualitative case studies into the
summary picture of a school reform’s effectiveness when the database also includes some
quantitative studies.
Quasi-experiments are identical to experiments in their purpose and in most details of
their structure, the defining difference being the absence of random assignment and hence of a
demonstrably valid causal counterfactual. Quasi-experimentation is nothing more than the
search, more through design than statistical adjustment techniques, to create the best possible
approximation (or approximations) to the missing counterfactual. To this end, there are
invocations (Corrin & Cook, 1998; Shadish & Cook, in press) to create stably matched
comparison groups; to use age or sibling controls; to measure behavior at several time points
before a treatment begins; perhaps even creating a time-series of observations; to seek out
46
situations where units are assigned to treatment solely because of their score on some scale; to
assign the same treatment to different stably matched groups at different times so that they can
alternate in their functions as treatment and control groups; and to build multiple outcome
variables into studies, some of which should theoretically be influenced by a treatment and others
not. These are the most important elements from which quasi-experimental designs are created
through a mixing process that tailors the design achieved to the nature of the research problem
and the resources available.
However good they are, quasi-experiments are always second-best to randomized
experiments. In some quarters, "quasi-experiment" has been used promiscuously to connote any
study that is not a true experiment, that seeks to test a causal hypothesis, and that has any type of
non-equivalent control group or pre-treatment observation. Yet Campbell and Stanley (1963) and
Cook and Campbell (1979) labeled some such studies as "generally causally uninterpretable".
Many of the studies educational researchers call "quasi-experiments" are of this last causally
uninterpretable kind and lag far behind the state of the art. Reading them indicates that little
thought has been given to the quality of the match when creating control groups; to the
possibility of multiple hypothesis tests rather than a single one; to the possibility of generating
data from several pre-treatment time points rather than a single one; or to having several
comparison groups per treatment, with one initially outperforming the treatment group and the
other under performing so as to create bracketed controls. Reading quasi-experimental studies of
educational reform projects is dispiriting, so weak are the designs and so primitive are the
statistical analyses. All quasi-experimental designs and analyses are not equal.
Although I am convinced that the best quasi-experiments give reasonable approximations
to the results of randomized experiments, to judge by the quality of the work on education I know
best, the average quasi-experiment is lamentable in the confidence it inspires in the causal
47
conclusions drawn. Recent advances in the design and analysis of quasi-experiments are not
getting into educational research where they are sorely needed. Nonetheless, the best estimate of
any quasi-experiment’s internal validity will always be to compare its results with those from a
randomized experiment on the same topic. When this is done systematically, it seems that quasiexperiments are less efficient than randomized experiments, providing more variable answers
from study to study (Lipsey and Wilson, 1993). So, in areas like education where few studies
exist on most of the reform ideas being currently debated, randomized experiments are
particularly needed and it will take fewer of them to arrive at what might be the same answer
better quasi-experiments would also achieve later. But irrespective of such speculation, we can
be sure of one thing: that the eyes of nearly all scholarly observers, the answers obtained from
experiments will be much more credible than those obtained from quasi-experiments.
Conclusions
1. Random assignment is very rare in research on the effectiveness of strategies to improve
student academic performance, though it is widely acknowledged as premier for answering
causal questions of this type.
2. Random assignment is common in schools when the topic is preventing negative behaviors or
feelings. Intellectual culture may be one explanation for this difference in prevalence. Prevention
researchers tend to be trained in public health and psychology where random assignment is
esteemed and where funders and journal editors are clear about their preference for the technique.
The case is quite different with evaluators who operate out of Schools of Education. Capacity
may be another explanation. Most school-based prevention experiments are of short duration,
48
involve curriculum changes, the implementers are usually researchers, and the research topics
may engage educators less than issues of school governance and teaching practice.
3. Nearly all educational evaluators seem to believe that experiments are not worthwhile. They
cite many reasons for this, including: The theory of causation that buttresses experiments is
naive; experiments are not practical; they require unacceptable trade-offs that lower the quality of
answers to non-causal questions; they deliver a kind of information rarely used to change policy;
and the evaluative information needed can be gained using different and more flexible methods.
4. Some of these beliefs are better justified than others. Beliefs about the viability of alternatives
to the experiment are particularly poorly warranted since no alternative provides as convincing a
causal counterfactual as the randomized assignment. However, the better criticisms suggest
useful and practical additions that can be made to the bare bones structure and function of
experiments. Of special relevance here is the importance of describing implementation quality, of
relating implementation to outcome changes, of describing program theory, of measuring
whether theoretically specified intervening processes have occurred, and of relating these
intervening processes to student outcomes.
5. Random assignment should not be considered a "gold standard". It creates only a probabilistic
equivalence between groups at the pretest; treatment-correlated attrition is likely when treatments
differ in intrinsic desirability; treatments are not always independent; and the means used to
increase internal validity often reduce external validity. More appropriate rationales for random
assignment are that, even after all the limitations above are taken into account, (1) it still provides
a logically more valid causal counterfactual than its alternatives; (2) it almost certainly provides a
49
more efficient counterfactual; and (3) it provides a counterfactual that is more credible in nearly
all academic circles. Taken together, these are compelling rationales for random assignment,
though not for any hubris on the part of advocates of experimentation.
6. It will not be possible to persuade the current community of educational researchers to do
experiments simply by outlining the advantages of the technique and describing newer methods
for implementing randomization. Most educational evaluators share some of the antiexperimental beliefs outlined above and also think that a scientific model of knowledge growth is
out-of-date. Advocates of experimentation need to be explicit about its real limits and to consider
seriously whether the practice of experimentation can be improved by incorporating some of the
critics’ concerns about program theory, the quality of implementation, the valued qualitative
data, the necessity of causal contingency, and meeting the information needs of school personnel
as well as other stakeholder groups. But even so, it will be difficult enlisting the current
generation of educational evaluators behind a banner promoting more experimentation.
7. However, they are not needed for this task. They are not part of the flurry of controlled
experimentation now occurring in schools. Moreover, Congress has shown its willingness to
mandate experiments, especially in early childhood education and job training. So, "end runs"
around the educational research community are possible, leaving future experiments to be carried
out by contract research firms, university faculty in the policy sciences, or education faculty now
lying fallow. It would be a pity if such an "end run" reduced access to those researchers who
now know most about school management, program implementation, and how educational
research is used. The next generation of experimenters should not have to re-learn the craft
50
knowledge that current education researchers have gained that genuinely complements
experiments and is in no way in intellectual opposition to them.
8. The major reason why researchers responsible for evaluating educational reforms feel no
urgency to conduct randomized experiments is that they conceptualize schools as complex social
organizations that should be studied with the case study tools typically used in organizational
research. In this tradition, internal change often occurs in an organization when management
contracts with external consultants who visit the institution, observe and interview in it, inspect
records, and then make recommendations for change based on both site visit particulars and their
own knowledge of theories of organizational change and presumptively effective teaching and
learning practices. This R&D model from management consulting is quite different from the one
used in medicine and agriculture. Until the operating metaphor of schools as complex social
organizations changes it will not be easy for educational researchers to adopt experimental
methods. They conceptualize educational change more holistically and see it as much more
complicated than whether an A causes some B. Yet even complex organizations can be studied
experimentally, as can many of the more important details within them.
51
References
Angrist, J.D., Imbrens , G.W. & Rubin. D. B. (1996). Identification of causal effects using
instrumental variables. Journal of the American Statistical Association, 91, pp.444-462.
Anson, A., Cook, T.D., Habib, F., Grady, M.K., Haynes, N & Comer, J.P. (1991). The Comer
School Development Program: A theoretical analysis. Journal of Urban Education, 26, pp 56-82.
Ares, C.E., Rankin, A. & Sturz, H. (1963). The Manhattan bail project: An interim report on the
use of pre-trial parole. New York University Law Review, 38, 67-95.
Becker, H.S. (1958). Problems of inference and proof in participant observation. American
Sociological Review, 23, pp. 652-660..
Berman, P. & McLaughlin, M.W. (1977). Federal programs supporting educational change.
Vol. 8: Factors affecting implementation and continuation. Santa Monica, CA: Rand Corp.
Bhaskar, R. (1975). A realist theory of science. Leeds, England: Leeds.
Bogatz, G.A. & Ball, S. (1972). The impact of “Sesame Street” on children’s first school
experience. Princeton, NJ: Educational Testing Service.
Boruch, R.F. (1974). Bibliography: Illustrated randomized field experiments for program
planning and evaluation. Evaluation, 2, pp. 83-87.
Boruch, R.F. (1997). Randomized experiments for planning and evaluation: A practical guide.
(Applied Social Research Methods Series, Volume 44.) Thousand Oaks, CA: Sage Publications.
Campbell, D.T. (1957). Factors relevant to the validity of experiments in social settings.
Psychological Bulletin, 54, 297-312.
Campbell & Stanley (1963). Experimental and quasi-experimental designs for research.
Chicago: Rand-McNally.
Campbell, F.A. & Ramey, C.T. (1995). Cognitive and school outcomes for high-risk African
American students at middle adolescence: Positive effects of early intervention. American
Educational Research Journal Vol 32, No. 4, pp 743-772.
Chelimsky, E. (1987). The politics of program evaluation. In D.S. Cordray, H.S. Bloom & R.J.
Light (Eds.) Evaluation Practice in Review. San Francisco, CA: Jossey-Bass.
Chen, H. (1990). Theory-Driven Evaluations. Newbury Park, Ca: Sage, 10, pp. 95-103
Cohen, D.K. & Garet, M.S. (1975). Reforming educational policy with applied social research.
Harvard Educational Review, 45.
52
Cicirelli, V.G. and Associates (1969). The Impact of Head Start: An Evaluation of the Effects of
Head Start on Children's Cognitive and Affective Development, Vol. 1 and 2. A Report to the
Office of Economic Opportunity. Athens: Ohio University and Westinghouse Learning
Corporation.
Collingwood, R.G. (1940). An essay on metaphysics. Oxford: Clarendon.
Comer, J.P. (1988). Educating poor minority children. Scientific American, 259(5), pp. 42-48.
Connell, D.B., Turner, R.R. & Mason. (1985). Summary of findings of the school health
education evaluation: Health promotion effectiveness, implementation and costs. Journal of
School Health, 55, pp. 316-321.
Connell, J.P., Kubisch, A.C., Schorr, L.B. & Weiss, C.H. (Eds.). (1995). New approaches to
evaluating community initiatives: Concepts, methods and contexts. Washington, D.C.: Aspen
Institute.
Cook, T.D. (1984). What have black children gained academically from school desegregation? A
review of the meta-analytic evidence. Special volume to commemorate Brown v. Board of
Education. In School Desegregation. Washington, D.C.: National Institute of Education.
Cook, T.D. (1985). Post-positivist critical multiplism. In R.L. Shotland & M.M. Mark (Eds.)
Social science and social policy, pp. 21-62. Beverly Hills, CA: Sage.
Cook, T.D. (1993). A quasi-sampling theory of the generalization of causal relationships. In L.
Sechrest and A.G. Scott (Eds.) New Directions for Program Evaluation: Understanding Causes
and Generalizing about Them, Vol. 57. San Francisco: Jossey-Bass Publishers.
Cook, T.D., Anson, A. & Walchli, S. (1993). From causal description to causal explanation:
Improving three already good evaluations of adolescent health programs. In S.G. Millstein, A.C.
Petersen & E.O. Nightingale (eds.) Promoting the Health of Adolescents: New Directions for the
Twenty-First Century. New York: Oxford University Press.
Cook, T.D. & Campbell, D.T. (1979). Quasi-experimentation: Design and analysis issues for
field settings. Boston: Houghton Mifflin.
Cook, T.D., Habib, F., Phillips, J., Settersten, R.A., Shagle, S.C., Degirmencioglu, S.M. (1999).
In Press. Comer's School Development Program in Prince George's County, Maryland: A theorybased evaluation. American Educational Research Journal.
Cook, T.D., Hunt, H.D. & Murphy, R.F. (1999). Comer's School Development Program in
Chicago: A theory-based evaluation. Working Paper, Institute for Policy Research,
Northwestern University.
Corrin , W.J. & Cook, T.D. (1998). Design elements of quasi-experimentation. Advances in
Educational Productivity, 7, pp. 35-57.
53
Cronbach, L.J. (1982). Designing evaluations of educational and social programs. San
Francisco: Jossey-Bass.
Cronbach, L.J., Ambron, S.R., Dornbusch, S.M., Hess, R.D., Hornik, R.C., Phillips, D.C.,
Walker, D.F. & Weiner, S.S. (1980). Toward reform of program evaluation. San Francisco:
Jossey-Bass.
Cronbach, L.J. & Snow, R.E. (1976). Aptitudes and instructional methods: A handbook for
research on interactions. New York: Irvington
Durlak, J.A. & Wells, A.M. (1997) (a). Primary prevention mental health programs for children
and adolescents: A meta-analytic review. Special Issue: Meta-analysis of primary prevention
programs. American Journal of Community Psychology, 25(2), pp. 115-152.
Durlak, J.A. & Wells, A.M. (1997) (b). Primary prevention mental health programs: The future
is exciting. American Journal of Community Psychology, Vol. 25, No. pp. 233-241.
Durlak, J.A. & Wells, A.M. (1998). Evaluation of indicated preventive intervention (Secondary
Prevention) Mental health programs for children and adolescents. American Journal of
Community Psychology, Vol. 26, No. 5, pp 775-802.
Elmore, R.F. (1996). Getting to scale with good educational practice. Harvard Educational
Review, 66, 1-26.
Elmore, R.F. & McLaughlin, M.W. (1983). The federal role in education: Learning from
experience. Education and Urban Society, 15, pp. 309-333.
Ferguson, R.F. (1998). Can Schools Narrow the Black-White Test Score Gap? In C. Jencks & M.
Phillips. The Black-White Test Score Gap. Washington, DC: Brookings. pp. 318-374.
Fetterman, D.M. (Ed.), (1984). Ethnography in educational evaluation. Beverly Hills, CA:
Sage.
Finn, J.D. & Achilles, C.M. (1990). Answers and questions about class size: A statewide
experiment. American Educational Research Journal, 27(3), pp. 557-577.
Flay, B.R. (1986). Efficacy and effectiveness trials (and other phases of research) in the
development of health promotion programs. Preventive Medicine, 15, pp. 451-474.
Fraker, T., & Maynard, R. (1987). Evaluating comparison group designs with employmentrelated programs. Journal of Human Resources, 22, 194-227.
Gasking, D. (1955). Causation and recipes. Mind, 64, pp. 479-487.
Gilbert, J.P., McPeek, B., & Mosteller, F. (1977). Statistics and ethics in surgery and anesthesia.
Science, 198, 684-689.
54
Glass, G.V. & Smith, M.L. (1979). Meta-analysis of research on the relationship of class size and
achievement. Educational Evaluation and Policy Analysis, 1, pp. 2-16.
Glymour, C., Sprites & Scheines, R. (1987). Discovering causal structure: Artificial intelligence,
philosophy of science and statistical modeling. Orlando: Academic Press.
Greene, J.P., Peterson, P.E., Du, J, Boeger, L. & Frazier, C.L. (1996). The effectiveness of school
choice in Milwaukee: A secondary analysis of data from the program's evaluation. University of
Houston mimeo. August, 1996.
Guba, E.G. & Lincoln, Y. (1982). Effective evaluation. San Francisco: Jossey-Bass.
Gueron, J. M. (1999). The politics of random assignment: Implementing studies and impacting
policy. Paper presented at the American Association of Arts and Sciences. Cambridge, Mass.
Hanushek, E.A. (1999). Evidence on class size. In S. Mayer & P. E. Peterson (Eds.) When
schools make a difference. Washington, D.C.: Brookings Institute.
Herrman, R. (1998). Whole school reform: A review. American Institutes for Research.
Washington, D.C.
Holland, P.W. (1986). Statistics and causal inference. Journal of the American Statistical
Association, 81, pp. 945-970.
House, E. (1993). Professional evaluation: Social impact and political consequences.
Newbury Park, CA: Sage
Kemple, J.J. & Rock, J.L. (1996). Career academies: Early implementation lessons from a 10site evaluation. New York: Manpower Demonstration Research Corporation.
Kuhn, T.S. (1970). The structure of scientific revolutions. (2nd edition) Chicago: University of
Chicago Press.
LaLonde, R.J. (1986). Evaluating the econometric evaluations of training programs with
experimental data. American Economic Review, 76, 604-620.
Latour, B. & Woolgar, S. (1979). Laboratory life: The construction of scientific facts. Beverly
Hills: Sage Publications
Leviton, L.C. & Cook, T.D. (1983). Evaluation findings in education and social work textbooks.
Evaluation Review, 7, pp. 497-518.
Lindblom, C.E. & Cohen, D.K. (1980). Usable knowledge. New Haven, CT: Yale University
Press.
Lipsey, M.W. & Wilson, D.B. (1993). The efficacy of psychological, educational, and
behavioral treatment: confirmation from meta-analysis. American Psychologist, pp. 1181-1209
55
Mackie, J.L. (1974). The cement of the universe. Oxford, England: Oxford University Press.
Mosteller, F., Light, R.J. & Sachs, J.A. (1996). Sustained inquiry in education: Lessons from
skill grouping and class size. Harvard Educational Review, 66, pp. 797-842.
Nave, B., Miech, ElJ. & Mosteller, F. (1999). A rare design: The role of field trials in evaluating
school practices. Paper presented at The American Academy of Arts and Sciences an Harvard
University.
Olds, D.L., Eckenrode, D., Henderson, C.R., Kitzman, H, Powers, J. Cole, R., Sidora, K., Morris,
P., Pettitt, L.M.& Luckey, D. (1997). Long-term effects of home visitation on maternal life
course and child abuse and neglect. Journal of the American Medical Association, 278(8): 637643.
Peters, R.D.& McMahon, R.J. (Eds.) (1996). Preventing childhood disorders, substance abuse,
and delinquency. Banff International Science Series, Vol. 3, pp 215-240. Thousand Oaks, CA:
Sage.
Peterson, P.E., Myers, D. & Howell, W.G. (1998). An evaluation of the New York City School
Choice Scholarships Program: The First Year. Paper prepared under the auspices of the
Program on Education Policy and Governance, Harvard University. PEPG98-12
Peterson, P.E., Greene, J.P., Howell, W.G. & McCready, W. (1998). Initial findings from an
evaluation of school choice programs in Washington, D.C. Paper prepared under the auspices of
the Program on Education Policy and Governance, Harvard University.
Popper, K.R. (1959). The logic of scientific enquiry. New York: Basic Books.
Quine, W.V. (1951). Two dogmas of empiricism. Philosophical Review, 60, pp. 20-43.
Quine, W.V. (1969). Ontological relativity and other essays. New York: Columbia University
Press.
Ramey, C.T. & Campbell, F.A. (1991). Poverty, early childhood education, and academic
competence: The Abecedarian experiment. In A.C. Huston (Ed.) Children in poverty: Child
development and public policy. New York: Cambridge University Press, pp. 190-221.
Riecken, H.W. & Boruch, R. (1974). Social experimentation. New York: Academic Press.
Rivlin, A.M. & Timpane, M.M. (Eds.) (1975). Planned variation in education. Washington, DC:
Brookings Institution.
Rossi, P. H. & Lyall, K. C. (1976) Reforming public welfare. New York, N.Y. Russell Sage
Foundation .
56
Rouse, C.E. (1998). Private school vouchers and student achievement: An evaluation of the
Milwaukee parental choice program. The Quarterly Journal of Economics, pp. 553-602.
St.Pierre, R.G., Cook, T.D. & Straw, R.B. (1981). An evaluation of the Nutrition Education and
Training Program: Findings from Nebraska. Evaluation and Program Planning, 4, pp. 335-344.
Schweinhart, L.J., Barnes, H.V. & Weikart, D.P. (with W.S. Barnett and A.S. Epstein) (1993).
Significant benefits: The High/Scope Perry Preschool Study through age 27. Ypsilanti, MI:
High/Scope Press.
Scriven, M. (1976). Maximizing the power of causal investigation: The Modus Operandi
method. In G.V. Glass (Ed.), Evaluation Studies Review Annual, 1, pp. 101-118. Newbury Park,
CA: Sage Publications.
Shadish, W.R. & Cook, T.D. (in press). Design rules. Statistical Science.
Stake, Robert E. (1967). The countenance of educational evaluation. Teachers College Record,
68, 523-540.
Stebbins, L.B., St. Pierre, R.G., Proper, E.C., Anderson, R.B. & Cerba, T.R. (1978). An
evaluation of Follow Through. In T.D. Cook (Ed.) Evaluation Studies Review Annual, 3, pp. 571610. Beverly Hills CA: Sage Publications.
Vinovskis, M.A. (1998). Changing federal strategies for supporting educational research,
development and statistics. Background paper prepared for the National Educational Research
Policy and Priorities Board, U.S. Department of Education.
Wargo, M.J., Tallmadge, G.K., Michaels, D.D., Lipe, D. & Morris, S.J. (1972). ESEA Title I: A
reanalysis and synthesis of evaluation data from fiscal year 1965 through 1970, Final Report,
Contract No. OEC-0-71-4766. Palo Alto, CA: American Institutes for Research.
Weiss, C.H. (1988). Evaluation for decisions: Is anybody there? Does anybody care? Evaluation
Practice, 9, pp. 5-20.
Weiss, C.H. & Bucuvalas, M.J. (1977). The challenge of social research to decision making. In
C.H. Weiss (Ed.), Using social research in public policy making. pp. 213-234. Lexington, MA:
Lexington.
Whitbeck, C. (1977). Causation in medicine: The disease entity model. Philosophy of Science,
44, pp. 619-637.
Witte, J.F. (1998). The Milwaukee voucher experiment. Educational Evaluation and Policy
Analysis, 20, pp. 229-251.
Zdep, S.M. (1971). Educating disadvantaged urban children in suburban schools: An evaluation.
Journal of Applied Social Psychology, pp. 1,2, 173-186
57
58
Download