Abstract This paper notes the rarity with which researchers who teach in American schools of education do policy-relevant experiments on schooling. In contrast, more school experiments are done by scholars with appointments in public health, psychology and the policy sciences. Existing theories of educational evaluation offer many arguments against experimentation, and these arguments are critically analyzed in this paper. They vary in cogency. The better arguments suggest that experimental designs will be improved if issues of implementation quality, program theory and the utilization of evaluation results are addressed more seriously. The worse arguments are those contending that better methods than the experiment already exist for assessing the effectiveness of educational innovations. We surmise that educational evaluators are not likely to adopt experimentation any time soon since it does not accord with the ontology, epistemology or methodology they prefer, they do not believe it is practical, and they do not consider it to be compatible with their preferred metaphor of schools as complex social organizations that should be improved using the tools of management consulting more than science. For more experiments to take place, school reform will require different intellectual assumptions from those now prevalent among evaluation specialists in schools of education. 1 Preamble To do good research on educational technologies requires different strategies at different stages in the evolution of what I’ll call a product. This paper deals only with the last stages of product development when claims are made about its effectiveness. The argument I advance is that such claims should be based on experimental demonstrations (or good quasi-experiments). Experimental designs compare students exposed to the product with those exposed to a competing product (as in the Consumer Reports format) or they compare students who have had the product to individuals who do not have it but are free to do whatever they want, including gaining some exposure to similar products. No method provides as good a counterfactual as the experiment. This implies that the market should not decide alone; nor should decisions be made on the basis of formative studies alone; nor should they be made on the basis of various pattern matching strategies that have recently been espoused—e.g., theory-based evaluations. Since many of those doing research on educational technology have academic backgrounds in education, I reexamine the arguments against doing experiments advanced in education in general and in educational evaluation in particular. This paper reviews these arguments sympathetically, but nonetheless concludes they do not carry enough weight to preclude using experiments to test claims about the effectiveness of educational technology. It is not easy to do such experiments, particularly when the unit of assignment is schools rather than classrooms or students. But without experiments claims about productivity gains due to technology will not gain the acceptance they might otherwise deserve. 2 Paper for Presentation at the SRI International Design Conference on Design Issues in Evaluating Educational Technologies Reappraising the Arguments Against Randomized Experiments in Education: An Analysis of the Culture of Evaluation in American Schools of Education Thomas D. Cook Northwestern University Thanks for feedback on an earlier draft are due to Tony Byrk, Lee Cronbach, Joseph Durlak, Christopher Jencks, and Paul Lingenfelter. 3 Introduction The American education system is very decentralized compared to the case in most other industrialized nations. Educational goals and practices vary enormously, not only by state and school district, but also by schools within districts. Even current plans to improve schools must seem heterogeneous. Politicians, business leaders and policy pundits have recently called for school improvement through mechanisms as diverse as school-based management, charter schools, school vouchers, more effective teaching practices, higher standards, increased accountability, smaller schools and classes, new educational technologies, and better trained teachers and principals. To many foreigners, the American system must look like a cacophony of local experimentation. However, experimentation connotes more than just generating diverse options. It also connotes systematically evaluating them. This is usually through a process that can be decomposed into four decisions that have to be made--about the alternatives to be compared, the criteria they are to be compared on, how information is to be collected about each criterion, and how a decision is to be reached once all the alternatives have been compared on all the criteria. To evaluate like this requires considerable researcher control. In the laboratory sciences, control is used to exclude all irrelevant causal forces from the context of study and to provide experimenters with almost total say over the design and introduction of the alternatives to be compared. In settings more open than the laboratory, control is most often associated with using random assignment to determine which units (schools, classrooms or students) are to experience each alternative or, as we will call it, each treatment. Such assignment rules out the possibility 4 that any differences observed at the end of a study are due to pre-existing group differences rather than to the treatment contrast under investigation. Education uses experiments in both the laboratory and open system senses. But since our focus in this paper is on evaluating school reforms, we will concentrate on random assignment and its use in schools. Its superiority for drawing inferences about the causal consequences of deliberately planned changes is routinely acknowledged in philosophy, medicine, public health, agriculture, statistics, micro-economics, psychology, marketing and those parts of political science and sociology dealing with the improvement of opinion surveys. The same advocacy is evident in elementary method tests in education, though not--as we will see--in specialized texts on educational evaluation. Random assignment is so esteemed because: (1) it provides a logically defensible no-cause baseline since the selection process is completely known; (2) the design and statistical alternatives to random assignment are empirically inadequate because they either do not produce the same effect sizes as randomized experiments on the same topic (e.g., Mosteller, Gilbert & McPeak, 1980; Lalonde, 1986; and Fraker & Maynard, 1987) or, when they do, the variation in results between the quasi-experimental studies on a topic is greater then the variation between randomized experiments on the same topic (Lipsey & Wilson, 1993); and (3) the costs of being incorrect in the causal conclusions one makes are often far from trivial. What if we concluded that Catholic schools are superior to public ones when they are not; or that vouchers stimulate academic achievement when they do not; or that school desegregation does not increase minority achievement when it in fact does? Taken together, these three arguments provide the conventional justification for random assignment as the method of choice when a causal question is paramount. 5 American education promotes experimentation in the sense of implementing many variants. But it does not promote experimentation in the sense of using random assignment. Indeed, Nave, Meich and Mosteller (1999) showed that not even 1% of dissertations in education or of the studies archived in ERIC Abstracts involved such assignment. Casual review of the major journals on school improvement tell a similar story, whether in the American Educational Research Journal or Educational Evaluation and Policy Analysis. Since my own research interests are in whole school reform rather than discrete school features like curricula, I wrote to a nationally renowned colleague who designs and evaluates curricula. She replied that randomized experiments are also extremely rare in curriculum studies. It is certainly the case in studies of school desegregation where a review organized by the then National Institute of Education (1984) included just one experiment (Zdep, 1971). I know of no experiments on standards setting. The effective schools literature reveals no experiments systemically varying those school practices that correlational studies suggested were effective. Recent studies of school-based management reveal only two experiments, both on Comer’s School Development Program (Cook, Habib, Phillips, Settersten, Shagle & Degirmencioglu, 1999; Cook, Hunt & Murphy, 1999). There seem to be no experiments on Catholic or Accelerated or Total Quality Management schools. On vouchers I know of only one completed study, often re-analyzed (Witte, 1998; Greene, Peterson, Du, Boeger & Frazier, 1996; Rouse, 1998), plus three other projects now underway (Peterson, Greene, Howell, & McCready, 1998; Peterson, Myers & Howell, 1998). On charter schools I know of no relevant experiments. On smaller class sizes, I know of six experiments, the best known being the Tennessee class size study (Finn & Achilles, 1990: Mosteller, Light & Sachs, 1996). On smaller schools I know of only one randomized experiment, currently underway (Kemple & Rock, 1996). On teacher training I know of no 6 relevant experiments. So, our knowledge of the effectiveness of the educational policies most discussed today has to depend on methods less esteemed than random assignment. Of the experiments cited above, nearly all were conducted by scholars with primary organizational affiliation outside of education. The best-known class size experiment was done by educators (Finn & Achilles, 1990), but popularized by statisticians (Mosteller, Light & Sachs, 1996) and reanalyzed by an economist (Ferguson, 1998). The Milwaukee voucher study was done by political scientists (Witte, 1998) and re analyzed as a randomized experiment by political scientists (Greene, et al 1996) and economists (Rouse, 1998). The Comer studies were conducted by sociologists and psychologists (Cook et al, 1999; Cook, Hunt & Murphy, 1999). The ongoing experiment on academies within high schools (Kemple & Rock, 1996) is being done by economists, and the work on school choice programs in Washington, New York and Cleveland is also being done by political scientists (Peterson et al, 1998). Striking is the multiyear paucity of experimental studies conducted by scholars with appointments in Schools of Education. That they are not done there does not reflect ignorance. Nearly all writers on educational research methods understand the logic of experimentation, know of its theoretical benefits, and appreciate how esteemed it is in science writ large. However, most of them (including Alkin, Cronbach, Eisner, Fetterman, Fullan, Guba, House, Hubermann, Lincoln, Miles, Provus, Sanders, Schwandt, Stake, Stufflebeam and Worthen) denigrate formal experiments at some time or another in their writings. A few pay lip-service to them while carrying out non-experimental studies whose virtues they therefore model for others (e.g. C.H.Weiss). This distaste stands in stark contrast to the preference for experiments exhibited by school researchers with affiliations outside of Schools of Education who primarily seek to learn how to improve student health, to prevent school violence, or to reduce teen use of tobacco, drugs and alcohol (e.g., Peters & 7 McMahon, 1996; Cook, Anson & Walchli, 1993; Durlak & Wells, 1997a, b and 1998; St. Pierre, Cook & Straw, 1981; Connell, Turner & Mason, 1985). Such studies routinely use random assignment, with Durlak’s two reviews of studies prior to 1991 including about 190 randomized experiments and 120 other studies. The number of experiments has certainly increased since then, given the rapid growth of school-based prevention studies. So, experiments are common in schools, but not on the topics preferred by researchers who have been exposed to evaluation training in Schools of Education. This paper describes some things that are special about the intellectual culture of evaluation in Schools of Education. It is a culture that actively rejects random assignment in favor of alternatives that the larger research community judges to be technically inferior. Yet educational evaluators have lived for at least 30 years knowing they have rejected what science apotheosizes, and they will not adopt experimentation merely by learning more about it. They believe in a set of mutually reinforcing propositions that provides an overwhelming rationale for rejecting experiments on a variety of ontological, epistemological, methodological, practical and ethical grounds. Although the writers on educational evaluation we listed earlier differ in their reasons for rejecting a scientific model of knowledge growth, anyone in educational evaluation during the last 30 years will have encountered arguments against experiments that seemed both cogent and comprehensive. Indeed, even raising the issue of experimentation probably has a "deja vu" quality, reminding older educational evaluators of a battle they thought they had won long ago against positivism and the privilege it accords to experiments and to the research and development (R&D) model associated with it whose origins are in agriculture, medicine and public health--not education. Still, we want to re-examine the reasons educational evaluators have forwarded for rejecting random assignment in school-based research. 8 Three incidental benefits follow from this purpose. First, we provide a rationale for random assignment that is somewhat different from the conventional validity rationale outlined earlier. We take seriously some of the objections raised by critics in education and so do not argue that random assignment provides a gold standard guaranteeing impeccable causal decisions. (However, we do believe it is superior to any alternative currently available). Nor do we argue that the alternatives to random assignment are invariably infallible. That would be silly. With the exception of psychology, the laboratory sciences routinely provide causal knowledge yet rarely assigning at random. Moreover, commonsense has arrived at many correct causal propositions without the advantage of experiments (e.g., that incest causes aberrant births or that out-group threat causes in-group cohesion). And, while Lipsey and Wilson’s (1993) report on 74 meta-analyses showed that the average effect sizes did not differ between experiments and other types of studies, the variation in effect sizes was much larger for the non-experiments. This suggests that experiments are more efficient as probes of causal hypotheses, and so they are especially needed when few reports on a topic are available—as is the case with almost all of the recent assertions about effective school reforms. A revised rationale for experiments stresses (1) their greater validity relative to other cause-probing methods even though they do not constitute a "gold standard"; (2) their efficiency at reaching defensible causal conclusions sooner than other methods; and (3) their greater credibility in the eyes of nearly all the scholarly community. A second incidental benefit is the light our analysis throws on the R & D model guiding educational research today. It is different from the science-based evolutionary model used in other fields (Flay, 1986). This specifies that theory, insight and serendipity should be used to generate novel ideas about possible improvements to health or welfare. The assumptions behind these ideas should then be tested in laboratory settings, often in test tubes or with animal populations. Novel ideas surviving this examination should then be implemented in efficacy 9 trials designed to test the treatment at its strongest in carefully controlled field settings. Only ideas that survive this next stage enter into trials that probe their effectiveness under conditions like those of eventual application where faithful implementation of the treatment details cannot be taken for granted. Only ideas surviving this last stage should then be replicated to assess their generalizability across different populations of students and teachers, across different kinds of schools, and even across different moments in time. Contrast this R&D model with the one most education researchers prefer. It comes from management consulting and its crucial assumptions are: (1) each organization (e.g. school or school district) is so complex in its structure and functions that it is difficult to implement any changes that involve its central functions; (2) each organization is unique in its goals and culture, entailing that the same stimulus is likely to elicit quite variable responses depending on a school’s history, organization, personnel and politics; and (3) suggestions for change should reflect the creative blending of knowledge from many different sources--especially from deep historical and ethnographic insight into the district or school under study, as well as craft knowledge about what is likely to improve this particular school or district, plus the application of broad theories of general organizational change (Lindblom & Cohen, 1980). Seen from this perspective, scientific knowledge about school effectiveness is inappropriate because it obscures each school’s uniqueness, it oversimplifies the multivariate and non-linear nature of full explanation, and it is naive about the mostly indirect ways in which social science is used in policy. Most educational evaluators see themselves at the forefront of a post-positivist, democratic and craft-based model of knowledge growth that is superior to the elitist scientific model they believe fails to create knowledge that school staff can actually use to improve their practice. 10 The third benefit that arises from understanding the attacks of randomized experiments is that some of the criticisms help identify how experiments can be improved. We later outline appropriate roles within experiments for the kinds of inquiry that many education researchers now prefer as alternatives to random assignment. We maintain that unless these improvements are made we cannot hope to see more experiments being done either by those who currently teach evaluation in education or by their students. Nor can we expect support from those midlevel employees in government, foundations and contract research firms who have been influenced by what educational evaluators believe about experiments. Such support is not necessary for generating more experiments. But taking the critics’ concerns more seriously will tend to improve experiments in education, whoever does them. It is time to circumvent the unproductive current stand-off between scientific and craft models of knowledge growth by illustrating how experiments will be designed better because serious attention has been paid to the intellectual content of the attack on them. Beliefs Adduced to Reject Random Assignment 1, Causation is ontologically more complex than a simple causal arrow from A to B. With a few exceptions (e.g., Guba & Lincoln, 1982), educational researchers seem willing to postulate that a real world exists, however imperfectly it can be known. Randomized experiments test the influence of only a small subset of potential causes from within this world, often only a single one. And at their most elegant they can responsibly test only a modest number of interactions between treatments. This implies that randomized experiments are best when a causal question is simple, sharply focused and easily justified. The theory of causation most relevant to this task has been variously called the manipulability, activity or recipe theory (Collingwood, 1940; Gasking, 11 1955; Whitbeck, 1977). It seeks to describe the consequences of a set of activities whose components can be listed as though they were ingredients in a recipe, thus permitting them to be actively manipulated as a group in order to ascertain whether they achieve a desired purpose. This is fundamentally a descriptive task and explanation is only promoted to the extent that the manipulations help discriminate between competing theories or assessments are made of any processes that might mediate between a cause and effect (Mackie, 1974). Otherwise, random assignment is irrelevant to explanation; it merely describes the effects of a multidimensional set of activities that have been deliberately varied in package form. Contrast this theory of causation with the one most esteemed in science. In some form or another, the preference is for full explanation. One understanding of this emphasizes identifying those "generative processes" (Bhaskar, 1975; Harre, 1981) that are capable of bringing about effects in a wide variety of circumstances. Examples would be gravity as it affects falling, or a specific genetic defect as it induces phenylketonuria, or time on task as it facilitates learning. But even these examples are causally contingent--the genetic defect does not induce phenylketonuria if an appropriate diet is adopted early in life; and time on task does not induce learning if a student is disengaged or the curriculum materials are meaningless. So, a more general understanding of causation requires specifying all the contingencies that impact on a given consequence or that follow from a particular event (Mackie, 1974). Cronbach and his colleagues (1980) maintain that Mackie’s theory of cause is more relevant to schools than the activity theory. They believe that multiple causal factors are implicated in all student or teacher change, and often in complex non-linear ways. Their model of real world causation is more akin to a pretzel (or intersecting pretzels) than to an arrow from A to B. No educational intervention fully explains an outcome; at most, it is just one more determinant. Nor do interventions have effects that are so general we can assume a constant 12 effect size across student and teacher populations, across types of schools and time periods. Contingency is the watchword, and the activity theory seems largely irrelevant because so few of the causally implicated variables can be simultaneously manipulated. Believing that experiments cannot faithfully represent a real world of multivariate, non-linear (and often reciprocal) causal links, Cronbach and Snow (1976) initiated the search for aptitude-treatment interactions-specifications that a treatment’s effect depends on student or teacher characteristics. But they discovered few that were robust. It is not clear to what extent this reflects an ontological truth that undermines a contingent understanding of causation or whether it reflects more mundane method problems associated with under-specified theories, partially invalid measures, imperfectly implemented treatments and truncated distributions. The utility of the activity theory of causation is best illustrated by noticing that many educational researchers write today as though some non-contingent causal connections are true enough to be useful --e.g., that small schools are better than large ones; that time-on-task raises achievement; that summer school raises test scores; that school desegregation hardly affects achievement; and that assigning and grading homework raises achievement. They also seem willing to accept some causal propositions that are bounded by few (and manageable) contingencies--e.g., that reducing class size increases achievement, provided that the amount of change is "sizable" and to a level under 20; or that Catholic schools are superior to public ones, but only in the inner-city. Commitment to an explanatory theory of causation has not stopped school researchers from acting as though education in the USA today can be characterized in terms of some dependable main effects and some very simple non-linearities, each more like a causal arrow than intersecting pretzels. It is also worth remembering that experiments were not designed to meet grand explanatory goals or to ensure verisimilitude. In Bacon’s words, they were invented to "twist 13 Nature by the tail", to pull apart what is usually confounded. This is not to deny that verisimilitude is also important. Else why would Fisher aspire to take experimental methods outside of the laboratory and into fields, schools and hospitals? But while verisimilitude is part of the Pantheon of experimental desiderata, it is always secondary. Premier is achieving a clear picture of whether the link between two variables is causal in the activity theory sense (Campbell, 1957). So, a special onus falls on advocates of experimentation. To justify their preferred method, they have to make the case that any planned treatment is so important to policy or theory and so likely to be general in its effects that it is worthwhile investing in an experiment whose payoff for explanation or verisimilitude may be sub-optimal. In this connection, it is important to remember that some causal contingencies are of minor relevance to educational policy, however useful they might be for full explanation. The most important contingencies are those that, within normal ranges, modify the sign of a causal relationship and not just its magnitude. Identifying such causal sign changes informs us where a treatment is directly harmful as opposed to where it simply has less positive benefits for some children than others. This distinction is important, since it is not always possible to assign different treatments to different populations, and so policy-makers are often willing to advocate a change that has different effects for different types of student so long as the effects are rarely negative. When the concern is to improve educational practice we can ignore many variables that moderate a cause-effect relationship, even though they are part of a full explanatory theory of the relationship in question. We have considerable sympathy for those opponents of experimentation who believe that biased answers to big explanatory questions are more important than unbiased answers to smaller casual-descriptive questions. However, we disagree that learning about the consequences of educational reforms involves small questions, though we acknowledge that they are usually 14 smaller than questions about causal mechanisms that are broadly transferable across multiple settings—the identification of which constitutes the traditional Holy Grail of science. Still, proponents of random assignment need to acknowledge that the method they prefer depends on a theory of causation that is less comprehensive and less esteemed than the one most researchers espouse. Acknowledging this does not undermine the justification for experiments, and it should have the unintended benefit of designing future experiments with a greater explanatory yield than the black-box experiments of yesteryear. 2. Random assignment depends on discredited epistemological premises. To philosophers of science, positivism connotes a rejection of realism, the formulation of theories in mathematical form, the primacy of prediction over explanation, and the belief that entities do not exist independently of their measurement. This epistemology has long been discredited. But many educational researchers still use the word positivism and seem to mean by it the quantification and hypothesis-testing that are essential to experimentation. Kuhn’s work (1970) is at the forefront of the reasons cited for rejecting positivist science in the second sense above. He argued two things of relevance. First is his claim about the "incommensurability of theories"--the notion that theories cannot be formulated so specifically that definitive falsification results. Second is his claim about the "theory-ladenness of observations"--the notion that all measures are impregnated with researchers’ theories, hopes, wishes and expectations, thus undermining their neutrality for discriminating between truth claims. In refuting the possibility of totally explicit theories and totally neutral observations, Kuhn’s work was interpreted as undermining science in general and random assignment in particular. This is why his writings are often cited by educational evaluators like Cronbach, Eisner, Fetterman, Guba, House, Lincoln, Stake and Stufflebeam. Some of them also invoke 15 philosophers like Lakatos, Harre and Feyerabend whose views partially overlap with Kuhn’s on these issues. Also often cited are those sociologists of science whose empirical research reveals practicing laboratory scientists whose on-the-job behavior deviates markedly from scientific norms (e.g. Latour & Woolgar, 1979). All these points are designed to illustrate that meta-science has demystified science, revealing it to be a very naked emperor indeed. This epistemological critique is overly simplistic. Even if observations are never theoryneutral, this does not deny that many observations have stubbornly reoccurred whatever the researcher’s predilections. As theories replace each other, most fact-like statements from the older theory are incorporated into the newer one, surviving the change in theoretical superstructure. Even if there are no "facts" we can independently know to be certain, there are still many propositions with such a high degree of facticity they can be confidently treated as though they were true. For practicing experimenters, the trick is to make sure that observations are not impregnated with a single theory. This entails building multiple theories into one’s collection of relevant observations, especially those of one’s theoretical opponents. It also means valuing independent replications of causal claims, provided the replications do not share the same direction of bias (Cook, 1985). Kuhn’s work complicates what a "fact" means, but it does not deny that some claims to a fact-like status are very strong. Kuhn is probably also correct that theoretical statements are never definitively tested (Quine, 1951; 1969), including statements about the effects of an educational program. But this does not mean that individual experiments fail to probe theories and causal hypotheses. For example, when a study produces negative results the program developers (and other advocates) are likely to surface methodological and substantive contingencies that might have changed the experimental result--perhaps if a different outcome measure had been used or a different population examined. Subsequent studies can probe these contingency formulations and, if they 16 again prove negative, lead to yet another round of probes of whatever more complicated contingency hypotheses has emerged to explain the latest disconfirmation. After a time, this process runs out of steam, so particularistic are the contingencies that remain to be examined. It is as though the following consensus emerges: "The program was demonstrably not effective under many conditions, and those that remain to be examined are so circumscribed that the reform will not be worth much even if it does turn out to be effective under these conditions”. Kuhn is correct that this process is social and not exclusively logical; and he is correct that it arises because the underlying program theory is not sufficiently explicit that it can be definitively confirmed or rejected. But the reality of elastic theory does not mean that decisions about causal hypotheses are only social and so are devoid of all empirical or logical content. Responding to Kuhn’s critique leads to down-playing the value of individual studies and advocating more programs of experimental research (without implying that such programs are perfect--only relatively better). However, critics of the experiment have not been interested in developing a non-positivist epistemological justification for experiments, probably seeing such efforts as futile. Instead, they have turned their attention to developing methods for studying educational reforms based on quite different epistemological assumptions and hence technical practices. Most of these assumptions led to stressing qualitative methods and hypothesis discovery over hypothesis testing and qualitative methods (e.g. Guba & Lincoln, 1982; Stake, 1967 and 1975; Cronbach, 1980 and 1982; House, 1993; Fetterman, 1984). Others seek to trace whether the mediating processes specified in the substantive theory of a program actually occur in the time sequence predicted (Connell, Kubisch, Schorr & Weiss, 1995; Chen, 1990). But before analyzing these alternatives, we should acknowledge some of the more practical attacks made on random assignment. 17 3. Random assignment has been tried in education, and has failed. Education researchers were at the forefront of the flurry of social experimentation that took place at the end of the 1960’s and the 1970’s. Evaluations of Head Start (Cicirelli & Associates, 1969), Follow Through (Stebbins, St. Pierre, Proper, Anderson & Cerva, 1978). and Title 1 (Wargo, Tallmadge, Michaels, Lipe & Morris 1972) concluded that there were few, if any, positive effects and there might even be some negative ones. These disappointing results engendered considerable dispute about methods, and many educational evaluators concluded that quantitative evaluation had been tried and failed. So, they turned to other methods, propelled in the same direction by their reading of Kuhn. Other scholars responded differently, stressing the need to study school management and program implementation in the belief that poor management and implementation had led to the disappointing results (Berman & McLaughlin, 1977; Elmore & McLaughlin, 1983; Cohen & Garet, 1975). However, none of the most heavily criticized studies involved random assignment. In education, criticism of experiments was undertaken by Cronbach et al (1980) who re-analyzed some studies from the long lists Riecken & Boruch (1974) and Boruch (1974) had generated in order to demonstrate the feasibility of randomized experiments in complex field settings. In his critique, Cronbach paid particular attention to the Vera Institute’s Bail Bond Experiment (Area, Rankin & Sturz, 1963) and to the New Jersey Negative Income Tax Experiment (Rossi & Lyall, 1976). He was able to show to his own and his followers' satisfaction that each was flawed both in how it was implemented as a randomized experiment and in the degree of correspondence achieved between its samplings particulars and the likely conditions of its application as new policy. For these reasons primarily, the case was made that the experiments cited by Boruch did not constitute a compelling justification for evaluating educational innovations experimentally. 18 However, the studies Cronbach analyzed were not from education. I know of only three randomized experiments on “large” educational topics available at the time, though there were many small experiments on highly specific teaching and learning practices. One experiment was of Sesame Street in its second year (Bogatz & Ball, 1971) when television cable capacity was randomly assigned to homes in order to promote differences in the opportunity to view the show. A second was the Perry Preschool Project (Schweinhart, Barnes & Weikart, 1993), and the third involved only 12 youngsters randomly assigned to be in a desegregated school (Zdep, 1971). Only Zdep's study involved primary or secondary schools, and so it was not accurate to claim in the 1970’s that randomized experiments had been tried in education and had failed. But even so, it may be no easier to implement randomized experiments in schools than in the substantive domains Cronbach et al did examine. Implementation shortfalls have arisen in many of the more recent experiments in education. Schools have sometimes dropped out of the different treatment conditions in different proportions, largely because a new principal wants to change what a predecessor recently did (e.g., Cook, Hunt & Murphy, 1999). Even in the Tennessee class size study it is striking that the number of classrooms in the final analysis differs by more than 20% across the three treatment conditions, though the original design may have called for more similar percentages. Then, there are inadvertent treatment crossovers, as happened in Cook et al (1999) where one principal in a treatment condition was married to someone teaching in a control school; where one control principal really liked the treatment and learned more about it for himself and tried to implement parts of it in his school; and where one teacher in a control school was the daughter of a program official who visited her school several times to address the staff about strategies for improving the building. Fortunately, these crossovers involved only three of the 23 schools, and none of them entailed borrowing the treatment’s most unique and presumptively powerful components. Still, the planned experimental 19 contrast was diluted to some degree. Something similar may have happened with the Tennessee experiment where classrooms of different size were compared within the same school rather than between schools. How did teachers assigned to larger classes interpret some of their colleagues teaching smaller classes? Were they dispirited and so tried less? All experiments have to struggle with how to create and maintain random assignment and treatment independence. They do not come easily. We have seen that random assignment is much more prevalent when evaluating in-school prevention programs than educational curricula or whole school reforms. Why is this? Some reasons are organizational, having to do with the priority researchers from public health and psychology have learned to attribute to clear causal inferences--a priority reinforced by their journal editors and funders (mostly NIH, CDC and the Robert Wood Johnson Foundation). The contrast here is with the almost non-existent priority of random assignment in federal and state Offices of Education and the many foundations funding education reform, though the recent entry of NSF and NICHD into educational grant-making may change this. Other reasons for the discipline difference in school-based experiments are substantive. Prevention studies are mostly concerned with evaluating novel curricula rather than, say, whole school reform; and the curricula they develop are usually circumscribed in time, rarely lasting even a single year. Moreover, teachers do not have to be trained to deliver the experimental materials. Researchers typically do this. And prevention rarely requires the school itself to adopt new forms of staff coordination. But many educational reforms demand just this and so entail all the stress that goes with it. 20 The very real difficulties that arise when implementing experiments in schools suggest a subtle modification to the usual rationale for experiments. In the real world of education it is not easy to implement experiments perfectly. The initial randomization process can be faulty; differential attrition can occur; and the control group can decrease its performance if some of the individuals within it realize they are being treated differently from others. Over the past decades, many lessons have been learned about how to minimize these problems (see Boruch 1997). But selective attrition still occurs; and knowledge of the different treatment conditions still seeps across planned treatment demarcations. Random assignment creates a logically superior counterfactual to its alternatives. But it does not follow from this that the technique is perfect in actual research practice. No rhetoric about experiments as "gold standards" is justified if it implies that experiments invariably generate impeccable causal inferences. Experimenters always need to check on how well random assignment and treatment independence have been achieved and maintained. They cannot be taken for granted. 4. Random assignment is not feasible in education. The small number of randomized experiments in education may reflect, not researchers' distaste for them, but how difficult they are to mount. School district officials do not like the focused inequities in schools structures or resources that random assignment generates, fearing negative reactions from parents and staff. They prefer for individual schools to choose the changes they will make, or else they prefer changes to be made district-wide. Principals and other school staff share these preferences and have additional administrative concerns about disrupting ongoing routines. And then there are the usual ethical concerns about withholding potentially helpful treatments from those in need. Given this climate, random assignment needs to be locally advocated by people committed to it and able to counter the objections usually raised (Gueron, 1999). So, it is a significant drawback 21 that the professional norms fueling the widespread use of random assignment in medicine and elsewhere are absent in education. If educational evaluators are not likely to fight for more experimentation in education, then researchers from other disciplines will have to do so. What does it take to mount randomized experiments in schools? In the Cook et al (1999) study, random assignment was sponsored by the school district and all district middle schools had to comply. Principals had no choice about participating in the study or in the treatment they eventually received. The district took this step because a foundation-funded network of very prestigious scholars--none from education-- insisted on random assignment as a precondition for funding the program and its evaluation. In a second case (Cook, Hunt & Murphy, 1999), the principal investigator insisted on random assignment as a precondition for collaborating with the program implementers, though in this case participation was restricted to schools where principals wanted the program and said in advance that they were prepared to live with the results of the randomization coin toss, whatever the outcome. Random assignment was a struggle in each case, and schools would definitely have preferred to do without it. But no principal had any difficulty appreciating the method’s logic and most lived with its consequences for up to six years. (But not all of them. Four Chicago replacement principals from the original 24 insisted on implementing their own reform program and eliminated the one under evaluation). In implementing randomization, the role of political will and disciplinary culture are very important. Random assignment is common in the health sciences because it is institutionally supported there by funding agencies, publishing outlets, graduate training programs and the tradition of clinical trials. Prevention studies in schools tap into this same research culture. Something similar is emerging in pre-school education where the many experiments are the product of: (1) congressional requirements to assign at random, as with Early Head Start, the Comprehensive Child Care Program and the plan for the new national Head Start study; (2) the 22 high political and scholarly visibility of experiments on the Perry Pre-School Program (Schweinhart, Barnes & Weikart, 1993; the Abacadarian project (e.g., Campbell & Ramey, 1995) and Olds’ home visiting nurse program (Olds et al, 1997); and (3) the involvement of researchers trained in psychology, human development, medicine and micro-economics, these being fields where random assignment is highly valued. Agriculture is another area of study whose culture favors random assignment, even in schools (St. Pierre, Cook & Straw, 1981; Connell, Turner & Mason, 1985). So, too, are marketing and research on survey research. Contrast this with education. Reports from the Office of Educational Research and Improvement (OERI) are supposed to identify effective practices in primary and secondary schools. But neither the work of Vinovskis (1998) nor my own haphazard reading of OERI reports suggests any privilege being accorded to random assignment. Moreover, one recent report I read on bilingual-lingual education repeated old saws about the impossibility of doing such studies and claimed that alternatives are just as good--in this case quasi-experiments. And at a recent foundation meeting on Teaching and Learning the representative of a reforming state governor spoke about a list of best practices his state had developed that was being disseminated to all schools. He did not care, and he believed no governors care, about the technical quality of the research generating these lists. The main concern was to have a consensus of education researchers endorsing each presumptively effective practice. When asked how many of these best practices depended on randomized experiments, he guessed it would be zero. Several nationally known educational researchers were present and agreed that such assignment probably played no role in generating the list. No one felt any distress at this. I surmise that so little will exists to utilize experiments in education--not out of ignorance--but because researchers believe that opportunities to conduct such studies are rare and that there is little need for them, given the feasibility of less noxious alternatives whose internal validity is adequate for creating credible (if 23 not valid) conclusions. So long as such beliefs are widespread among the elite of educational research, there can never be the kind of pan-support for experimentation that is currently found in health, agriculture, or health-in-schools. What are the research conditions most conducive to randomization? They include: When the treatment is of shorter duration; when no extensive retraining of teachers is required; when new patterns of coordination among school staff are minimal; when the demand for an innovation outstrips the supply; when two or more treatments with similar goals can be compared; when communication between the units receiving different treatment is not possible; and when students are the unit of assignment (or perhaps classrooms) rather than entire schools. Our guess, then, is that it will be more feasible to study different curricula at random; to introduce new technologies at random; to give students tuition rebates for Catholic schools at random; to assign more or different forms of homework at random (or by classroom); to assign teachers trained in different ways to classes at random, etc. None of these studies would be easy; but all should be feasible so long as the will exists to make random assignment work, so long as researchers know the most recent lessons about implementing and maintaining random assignment, and so long as strong fall-back options are built into research designs in case the original treatment assignment breaks down. In my experience, it will be rare for the biases resulting from such a breakdown to be greater than those that result from teachers, schools or students self-selecting themselves into treatments. And the limitations of statistical selection controls are less the smaller the initial selection bias and the better the selection has been directly observed (Holland, 1986). 5. Random assignment is premature because it needs strong theory and standardized treatment implementation, neither of which is routinely available in education. Experimentation is easier 24 and more fruitful when an intervention is based on strong substantive theory, when it occurs within well managed schools, when it is reasonable to assume that implementation quality does not vary much between units, and when the implementation is faithful to program theory. Unfortunately, these conditions are not often met. Schools tend to be large, complex social organizations characterized by multiple simultaneously occurring programs, disputatious building politics and conflicting stakeholder goals. Management is all too often weak and removed from classroom practice, and day to day politics can swamp effective program planning and monitoring. So, many reform initiatives are implemented highly variably across districts, schools, classrooms and students. Indeed, when several different educational models are contrasted in the same study, the between-model variation is usually small when compared to the variation between schools implementing the same model (Rivlin & Timpane, 1975; Stebbins et al, 1978). In school research, it is impossible to assume either standard program implementation or total fidelity to program theory (Berman & McLaughlin, 1977). Randomized experiments must seem premature to those whose operating premise is that schools are complex social organizations with severe management and implementation problems. However, educational research does not have to be so premised; nor is it impossible to do experiments in schools that are poorly managed and do not implement well. An earlier model of schools treated them as physical structures with many self-contained classrooms where teachers tried to deliver effective curricula using instructional practices that have been shown to enhance students’ academic performance. This approach privileged curriculum design and instructional practice, not the school-wide factors that dominate within the current organizational framework -viz., strong leadership, clear and supportive links to the world outside of school, creating a building-wide communitarian climate focused on learning, and engaging in multiple forms of professional development, including those relevant to curriculum and teaching matters. Many 25 important consequences have followed from this intellectual shift in how schools are conceptualized. One is the lesser profile accorded to curriculum and instructional practice and to what happens once the teacher closes the classroom door; another is the view that random assignment is premature, given its dependence on positive school management and quality program implementation; and another is that quantitative techniques have only marginal utility for understanding schools, since a school’s governance, culture and management are best understood through intensive case studies, often ethnographic. It is a mistake to believe that random assignment requires well-specified program theories, or good management, or standard implementation, or treatments that are totally faithful to program theory, though these features do make evaluation easier. Experiments primarily protect against bias in causal estimates; and only secondarily against the imprecision in these estimates due to extraneous variation. So, the complexity of schools leads to the need for randomized experiments with larger sample sizes. Otherwise, experiments conducted in complex settings tend towards no-difference findings. Cronbach (1982) has noted how rarely researchers abandon their analysis if this is indeed their initial finding. Instead, they stratify schools by the degree to which program particulars were implemented and then relate this to variation in planned outcomes. Such internal analyses make any resulting causal claim the product of the very non-experimental analyses whose weaknesses random assignment is designed to overcome. This problem suggests orienting the research focus differently so as: (1) to avoid the need for such internal analyses by having initial sample sizes that reflect the expected extraneous variation; (2) to anticipate specific sources of variation and to reduce their influence by design or through measurement and statistical manipulation; and (3) to study implementation quality as a dependent variable to ascertain which types of schools and teachers implement the intervention 26 better. Variable implementation has implications for budgets, sample sizes and the relative utility of research questions; but it does not by itself invalidate random assignment. The aim of experiments is not to explain all sources of variation; it is merely to probe whether a reform idea makes a marginal improvement in staff or student performance over and above all the other background changes that occur. It is not an argument against random assignment to claim that many reform theories are under specified, some schools are chaotic, treatment implementation is highly variable, and treatments are not completely theory-faithful. Random assignment does not have to be postponed while we learn more about school management and implementation. However, the more we know about these matters the better we can randomize, the more reliable effects are likely to be, and the more experiments there will be that make management and implementation issues worthy objects of study within the experiments themselves. No advocate of random assignment will be credible in education who assumes treatment homogeneity or setting invariance. School-level variation is large and may be greater than in other fields where experiments are routine. So, it is a reasonable and politic working assumption to assert that the complexity of schools requires large sample experiments and careful measurement of implementation quality and other intervening variables as well as careful documentation of all sources of extraneous variation influencing schools. Black box experiments are out. In the connection, it is important to remember two points. First, for many educational policy purposes it is important to assess what an educational innovation achieves despite the variation in treatment exposure within treatment groups. Few educational interventions will be standardized once they are implemented as formal policy. So, why should we expect them to be standardized in an experimental evaluation? Of course, treatment standardization is desirable for those researchers interested in testing the substantive theory undergirding an intervention or 27 in assessing the intervention’s potential as policy (rather than its actual achievements). But standardization is not a desideratum for educational studies if standard implementation cannot be expected in the hurly-burly of real educational practice. Second, one of the few circumstances where instrumental variables provide unbiased casual inference is within a randomized experiment. What instrument can be better than random assignment (Angrist, Imbrens & Rubin, 1996)? So, given a true experiment, we can have unbiased estimates of two entities of interest. One is of the effect of “the intent to treat” --the original assignment variable of policy significance. The other is the effect of spontaneous variation in exposure within the treatment group, a question that is of theoretical interest, and dear to the hearts of program developers. Without random assignment, it is difficult to defend conclusions about the causal effect of spontaneous variation in treatment exposure, given the inevitability of selection problems and the inability to deal with them definitively. Thus, it is only with random assignment that one can have one’s cake and eat it, assessing both the policyrelevant effects of treatments that are variably implemented and also the more theory-relevant effects of spontaneous variation in the amount and type of exposure to program details. 6. Random assignment entails trade-offs not worth making. Random assignment places the priority on unbiased answers to descriptive causal questions. Few educational researchers share this priority, and many believe that randomization often compromises more important research goals. Indeed, Cronbach (1982) has strongly argued against the assertion that internal validity is the sine qua non of experimentation (Campbell & Stanley, 1963), contending that this neglects external validity. It is self-evident that experiments are often limited in time and space, with nation-wide experiments being rare. Moreover, they are often limited to schools willing to surrender choice over which treatment they will receive and willing to tolerate the measurement 28 burdens required for assessing implementation, mediating processes and individual outcomes. What kinds of schools are these? Would it not be preferable, Cronbach asks, to sample from a more representative population of schools even if less certain causal inferences result from this? A second trade-off places scientific purity above utility. Why should experimenters value uncertainty reduction about cause so much that they use a highly conservative statistical criterion (p <.05). In their own lives, would they not use a much more liberal risk calculus when deciding whether to adopt a potentially life-saving therapy? And why should statistical traditions be so strict that schools not implementing the treatment are included in the analysis as though they had been treated? He sees these as two examples of a counterproductive conservative bias that is designed to protect against wrongly concluding that a treatment is effective but that, in so doing, runs the risk of failing to detect true treatment effects that would emerge by other criteria. He also worries about the purism of experimental enthusiasts who refuse to explore the data for unplanned treatment interactions with student or teacher characteristics, and who see unplanned variation in implementation as a cause for concern rather than an opportunity to explore why the variation occurred and what its consequences might be. He also notes how poorly some experimental questions are framed, and argues for escaping from the original framing whenever more helpful questions emerge during a study--even if unbiased answers to these new questions are not possible. Better relevant than pure might be the motto. After all, many experiments take so long to plan, mount and analyze that answering the questions they pose could be of more antiquarian than contemporary interest. Another trade-off experiments force, in his opinion, is between two conceptions of cause. The most prized questions in science are about generative causal processes like gravity, relativity, DNA, nuclear fusion, aspirin, personal identity, infant attachment, or engaged time on task. Such constructs index instantiating processes capable of bringing about important effects in a 29 multitude of different settings. This makes them more general in application than, for example, learning whether one way of organizing instruction affected achievement at a particular moment in past time is the particular set of schools that volunteered to be in an experiment. Like most other educational evaluators, Cronbach wants evaluations to explain why programs work and, to this end, he is prepared to tolerate more uncertainty about whether they work. So, he rejects the experimental tradition, believing that it often creates research conditions that do not closely approximate educational practice and that it ignores the exploration of those explanatory processes that might be transferable across a wide range of schools and students. Cronbach wants evaluation to pursue the scholarly goal of full explanation, but not necessarily using the quantitative methods generally preferred for this in science. In his view, the methods of the historian, journalist and ethnographer are just as good for learning what happened in a reform and why. The objectives of science remain privileged, but not its methods. A final trade-off is between audiences for the research. Experiments seek to maximize the truth about the consequences of a planned reform, and their intended audience is usually some policy-making group. Experiments rarely seek to help those who work in the sampled schools. Such persons rarely want what an experiment produces-- information summarizing what a reform has achieved. Moreover, they can rarely wait to act until an experiment is completed. They need continuous feedback about their own practice in their own school. Putting the needs of service deliverers above those of amorphous policy-makers requires that evaluators prioritize on utility over truth and on continuous feedback over waiting for a final report. A recent letter to the New York Times captured this preference: “Alan Krueger ... claims to eschew value judgments and wants to approach issues (about educational reform) empirically. Yet his insistence on postponing changes in education policy until studies by researchers approach certainty is itself a 30 value judgment in favor of the status quo. In view of the tragic state of affairs in parts of public education, his judgment is a most questionable one.” (W. M. Petersen, April 20, 1999). These criticisms of the trade-offs implicit in experimentation remind us that experiments should not be conducted unless there is a clear causal question that has been widely probed for its presumed long-term utility to a wide range of policy actors, of whom school personnel are surely one. They also remind us that, while experiments optimize on causal descriptive questions, this need not preclude either examining the reasons for variation in implementation quality or seeking to identify the processes through which a treatment influences an effect. The criticisms further remind us that experiments do tend to be conservative, sometimes so preoccupied with bias protection that other types of knowledge are secondary. But this need not be so. There is no compelling need for such stringent alpha rates; only statistical convention is at play here, not statistical wisdom. Nor need one restrict data analyses to the intent-to-treat group, though such analyses are needed among the total set reported. Nor need one close one’s eyes to all statistical interactions, so long as the probes are done with substantive theory and statistical power in mind, so long as internal and external replications are used to check against profligate error rates, and so long as conclusions are couched more tentatively than those that directly result from random assignment. Researchers can also try to replicate any results using the best available nonexperimental analyses of data collected from formally representative samples, though such analyses would be difficult to interpret by themselves. Finally, many controlled experiments would be improved by collecting ethnographic data in all the treatment groups. This would help identify possible unintended outcomes and mediating processes, while also providing continuous feedback to experimental and control schools alike. Experiments need not be as rigid as some of the more compulsive texts on clinical trials paint them. 31 Even so, some bounds cannot be crossed. All the above suggestions involve adding to experiments, not replacing them. To conduct a study with randomly selected schools but no random assignment to treatments would be to gamble on achieving wide generalizability of what may not be a dependable causal connection. Conversely, to embark on an experiment presupposes the cardinal utility of causal connections; otherwise one would not do such a study in the first place. Assuming adequate statistical power, experiments protect against recommending ineffective changes and against overlooking effective ones. So the value of experiments depends on avoiding the costs that would occur if one wrongly concluded that school vouchers raise achievement, that small schools are superior to larger ones, or that Catholic schools are superior to public ones in the inner city. It is now 30 years since vouchers were proposed, and we still have no completed experiments on them and no clear answers about them. It is 30 years since Comer began his work on the School Development Program, and almost the same situation holds. It is 15 years since Levin began accelerated schools, and here too we have no experiments and no answers. But we need to tread carefully here lest our advocacy of experiments is over generalized. Premature experimentation is a real danger. There is little point evaluating educational reforms that are theoretically muddled or that are not implementable by ordinary school personnel. But even so, these decade-long time lines are inexcusable. The Obie-Porter legislation cites Comer’s program as a proven exemplar. But when the legislation passed, the only evidence about the program available to Cook et al (1999) consisted of testimony, a dozen or so empirical studies by the program’s own staff that used primitive quasi-experimental designs, and one large-scale study that confounded the court-ordered introduction of the program with a simultaneously ordered reduction in class sizes of 40% (Comer, 1988). To be restricted to such a data base verges on the irresponsible. However, Comer’s program is not much different from any other in 32 this regard, Success for All being somewhat of an exception. The trade-off Cronbach is prepared to make that favors generalization over causal knowledge runs just the risk exemplified by Comer’s program. There are many studies covering over 20 years in many school districts and times, but none of them is worth much as a study of cause. Experiments are not meant to be representative; they are meant to be the strongest possible tests of causal hypotheses. Yet causal statements are much more useful if they are the product of experiments and are also demonstrably generalizable in their results across various types of students, teachers, settings and times--or if the boundary conditions limiting generalization are empirically specified. Generalizing causal connections is always a problem with single experiments (Cook, 1993), but especially so in education. Unlike in medicine or public health, there is no tradition of multi-site experiments with national reach. Typically we find single experiments of unclear reach, done only in Milwaukee or Washington or Chicago or Tennessee. And with many kinds of school reform there is no fixed protocol. So, it is possible to imagine implementing vouchers, charter schools or Total Quality Management in many different ways both across and within districts. Indeed, the Comer programs in Prince George’s County, Chicago and Detroit are different from each other in many ways, given how much latitude districts are supposed to have in how they define and implement many of the program specifics. So, the non-standardization we find associated with many educational treatments requires even larger samples of settings than is typically in clinical trials in medicine or public health. Getting cooperation from so many schools is not easy, given the absence of a tradition of random assignment and the considerable local control we find in American education today. Nonetheless, experiments can be conducted involving more schools than is the case with the rare experiments we find today. But very large experiments may not be wise. A sampling approach to causal generalization emphasizes the goal of generating a causal estimate that describes whatever 33 population of schools, students, teachers or sites is of specific policy interest. To achieve this, the emphasis is on sampling with known probability. As admirable as this traditional sampling method is, it has rarely been the path to generalization in the experimental sciences. There are many reasons for this. Many of the populations it is practical to achieve are of only parochial interest; random sampling is hardly relevant to the need to estimate causal relationships across time; and random sampling cannot be used to select the outcome measures and treatment variants with which we try to represent general cause and effect constructs. So, bench scientists use a generalization model that emphasizes the consistency with which a causal relationship replicates across multiple sources of heterogeneity (Cook, 1993). Can the same causal relationship be observed across different laboratories, time periods, regions of the country and ways of operationalizing the cause and effect? This heterogeneity-of-replication model is what undergirds clinical trials and meta-analysis, and it requires many smaller experiments done at different times, in different settings, with different kinds of respondents and with different measures and different local adaptations of a treatment. The physical sciences have progressed by means of such replication because knowledge claims are often replicated in the next stage of research on a phenomenon. But in research areas with weaker (and more expensive) traditions of replication-as in education-- replication cannot be haphazard. Instead, it might be based on conducting a number of smaller (but adequately statistically powered) experiments staggered over the years. But whatever the merits of phased programs of experiments, the point is that experiments do not have to be both small and stand-alone things. Single experiments will not produce definitive answers, however large they are. And they certainly will not answer all ancillary questions about causal contingencies. Like all social science, experiments need a cumulative context if dependable knowledge is to result. 34 Sampling concerns are less salient when causal generalization is understood, not as achieving a single causal estimate or generating a set of effect sizes from heterogeneous studies of the same general hypothesis, but as the identification of generative processes. Engaged timeon-task is an example of this, for it is supposed to stimulate student achievement through such diverse activities as more homework, summer classes, longer school days, and more interesting curricula. And many of these procedures can be implemented in any school district, in any country, at any time. Of the methods available for identifying explanatory processes, some require collecting data about each of the variables in the presumed mediating processes and then using these measures in causal models. Others require historians or ethnographers to explore what happened and why in each treatment group. Although the knowledge of explanatory processes achieved by either of these methods will be tentative, it is still worthwhile as part of the information base from which conclusions are eventually drawn about novel mechanisms that bring about important results. Cronbach's advocacy of causal explanation over causal description does not reflect a genuine duality. Experiments should explore all causal mechanisms that have an obvious potential to be generative; they should explain the consequences of interventions as well as describe them. 7. Experiments assume an invalid model of research utilization. Experimentation can be construed as recreating the classical model of rational decision-making. In this model, one first lays out the alternatives (the treatments); then one decides on decision criteria (the outcomes); then one collects information on each criterion for each treatment (data collection), and finally one uses the observed effect sizes and whatever utilities can be attached to them to make a decision about the merits of the contending alternatives. Unfortunately, empirical work on the use of social science data in policy reveals that such instrumental use is rare (Weiss & Bucuvalas, 35 1977; Weiss, 1988). Use is more diffuse, better described by an "enlightenment" model that blends information from existing theories, personal testimony, extrapolations from surveys, the consensus of a field, claims from experts who may or may not have interests to defend, and novel concepts that are au courant and broadly applied--like the urban underclass or social capital in sociology today. No privilege is extended to science in this conception; nor to experiments. Empirical research on research utilization also notes that decisions are multiply rather than singly determined, with central roles being played by politics, personality, windows of opportunity and values. Given this, scientific information of the type experiments produce can play at best a marginal role. Also, many decisions are not made in the systematic sense of that verb, but instead are slipped into or accrete, with earlier small decisions constraining later large ones. And we should also not forget that official decision-making bodies turn over in composition, with new persons and issues replacing older ones. When studies take a long time to complete--as with many experiments -- results may not be available until the policy agenda has changed. So, many experiments turn out to be verdicts in history books rather than inputs into current reform debates, making research utilization appear more complex than the rational choice model implies. Critics also point out that experiments rarely provide uncontested verdicts. Disputes typically arise about whether the original causal question was framed as it should have been, whether the claimed results are valid, whether all relevant outcomes were assessed, and whether the proffered recommendations follow from the results. All social science findings tend to meet with a disputed reception, if not about the quality of answers, then about the questions not addressed. The logical control over selection that makes experiments so valuable does not mean they are seen to put to rest all quibbles about causal claims. Consider in this regard the reexaminations of the Milwaukee voucher study and the very different conclusions offered about 36 whether and where there were effects (Witte, 1998; Green, Peterson, Du, Boeger & Frazier, 1996). Consider, also, the Tennessee class size experiment and the different effect sizes that have been generated (Finn & Achilles, 1990; Mosteller, Light & Sachs, 1996; Hanushek, 1999) Sometimes, real scholarly disagreements are at issue, while in other cases the disputes reflect stakeholders protecting their interests. Policy insiders use multiple criteria for making decisions, and scientific knowledge of causal influences is never uniquely determinative. Moreover, close examination of cases where the claim was made that policy changed because of experimental results suggest some oversimplification. The Tennessee class size experiment has been examined from this perspective (see the special summer 1999 issue of Educational Evaluation and Policy Analysis), and in this connection it is worth noting that the results were in line with what a much-cited meta-analysis had earlier described (Glass & Smith, 1979). They were also consonant with theories that say children gain more if they are engaged and on-task. They also conformed with teachers' hopes and expectations. Further, they were in line with parents' commonsense notions and so were acceptable to them. And the results came when the governor of the state had national political ambitions, was using education as an individuating policy priority, had the state resources to increase educational investments, and he knew his actions would be popular both with the teachers unions (mostly Democrats) and with business interests in his own Republican party. So, a close explanatory analysis suggests that his policy change was probably not due to random assignment alone, though even closer analysis may well show that such assignment played a marginal role in creating the use that followed. Experiments exist on a smaller scale than would typically pertain if the services they test were to be implemented state- or nation-wide. This scale issue is serious (Elmore, 1996). Consider what happened when the class size results were implemented state-wide in Tennessee and California. This new policy was begun at a time of a growing national teacher shortage due 37 to demographic shifts that exacerbated teacher-poaching across state and district lines and allowed richer school districts to recruit more and better teachers. Since the innovation inevitably requires many new teachers, some individuals were able to leave their old non-teaching jobs and become fledgling teachers. Were these new entrants equal in quality to existing teachers? Problems also arose in creating the additional classrooms that lower class sizes require, leading to more use of trailers and dilapidated buildings. Are the benefits of smaller classes undercut by worse physical plant? Small scale experiments cannot mimic all the details relevant to implementing a policy on a broader scale, though they should seek to approximate these details as best they can. There is some substance to arguments about the misfit between the theory of use undergirding randomized experiments and the ways in which social science data are actually used. But the objections are exaggerated. Instrumental use does occur (Chelimski, 1987), and more often than the very low base rates readers might infer from most of the research denigrating instrumental usage. Moreover, one reason why some study results are widely disseminated is probably because random assignment confers credibility on them, at least in some quarters. This certainly happened with the Tennessee class size study and the pre-school studies cited earlier. Studies can even be important if political events have rendered their results obsolete. This is because many policy initiatives get recycled --as happened with school vouchers-- and because the texts used to train professionals in a field often contain the results of past studies, using them for the light they throw on particulars of professional practice (Leviton & Cook, 1983). There is no necessary trade-off between instrumental and enlightenment usage. Experiments also contribute to general enlightenment--about the kinds of interventions most and least implementable; about the influence principal turnover has on school management; about the utility of theories that fail to specify and monitor what happens once the teacher closes her 38 classroom door; about the kinds of principals who are most attracted to school-based management theories; about the kinds of teachers most amenable to professional development, etc. The era of black box experiments is long past. Within experiments we now want and need to learn about the determinants and consequences of implementation quality; we want to describe and analyze constructs from the substantive theory undergirding a program; we want to collect qualitative as well as quantitative data so long as the data collection protocol is identical in all treatment groups; and we want to get stakeholder groups involved in the formulation of guiding experimental questions and in interpreting the substantive relevance of findings. All these procedures help generate enlightenment as well as instrumental usage. 8. Random assignment is not needed since better alternatives already exist. It is a truism that no social science method will die, whatever its imperfections, unless a demonstrably better or simpler method is available to replace it. Most researchers who evaluate educational reforms believe that superior alternatives to the experiment already exist, and so they are willing to let it die. These methods are superior, they believe, because they are more acceptable to school personnel, because they reduce enough uncertainty about causation to be useful, and because they are relevant to a broader array of issues than causation alone. Although no single alternative is universally recommended, intensive case studies seem to be preferred most often. a. Intensive Case Studies. The call for intensive case studies in educational evaluation came from three primary sources. One was from scholars who first made their names as quantitative researchers and then switched to qualitative case methods out of a disillusionment with positivist science and the results of the earliest educational evaluations (e.g., Guba; Stake; House). Even Cronbach (1982) came to assert that the appropriate methods for educational evaluation are those 39 of the historian, journalist and ethnographer, pointedly not the scientist. Converts often have a special zeal and capacity to convince, and when former proponents of measurement and classical experimental design were seen attacking what they had once espoused this may have had a strong influence on young students of methods in education. The second source of influence was evaluators within education who had been trained in qualitative methods, like Eisner or Fettermann. Third were those scholars with doctorates in Sociology and Political Science who entered schools of education to pursue the newly emergent research agenda to improve school management and program implementation. These newcomers were nearly all qualitative researchers with an organizational or policy research background, fresh from the qualitative/quantitative method disputes in their original disciplines. Some countervoices in support of experiments could have been raised. After all, some psychometricians within education refused to be converted and the more elite graduate schools of education hired scholars trained in statistics, economics and psychology. But none of these groups worked hard in print to advocate for experiments--Boruch and Murnane excepted; and none of them conducted experimental evaluations in education, thus demonstrating their feasibility and utility. Instead, the economists in education tended to work on issues related to higher education and the transition from school to work, while the methodologists from other disciplines worked on topics tangential to causal inference--e.g., on novel psychometric techniques, on improving meta-analysis, on re-conceptualizing the measurement of individual change, and on developing hierarchical models that simultaneously include school, student and intra-individual change factors. So, while three sets of voices converged to denigrate experimental research in education, none of the likely counter voices were heard from. A central belief of advocates of qualitative methods is that the techniques they prefer are uniquely suited to providing feedback on the many different kinds of questions and issues 40 relevant to a school reform--questions about the quality of problem formulation, the quality of program theory, the quality of program implementation, the determinants of quality implementation, the proximal and distal effects on students, unanticipated side effects, teacher and student subgroup effects and assessing the relevance of findings to different stakeholder groups. This flexibility for generating knowledge about so many different topics is something that randomized experiments cannot match. Their function is much more monolithic, and the assignment process on which they depend can impose obvious limits to a study’s generalizability. Advocates for qualitative methods also contend that they reduce some uncertainty about causal connections. Cronbach (1982) agrees that this might be less than in an experiment, but he contends that journalists, historians and ethnographers nonetheless regularly learn the truth about what caused what, as do many lay persons in their everyday life. Needed for a causal diagnosis are only a relevant hypothesis, observations that can be made about the implications of this hypothesis, reflection on these observations, reformations of the original hypotheses, a new round of observations, a new round of reflections on these observations, and so on until closure is eventually reached. Theorists of ethnography have long advocated such an empirical and ultimately falsificationist procedure (e.g., Becker, 1958), claiming that the results achieved from it will be richer than those from black box experiments because ethnography requires close attention to the unfolding of explanatory processes at different stages in a program’s development. I do not doubt that hard thought and non-experimental methods can reduce some of the uncertainty about cause. I also believe that they sometimes reduce all reasonable uncertainty, though it will be difficult to know when this happens. However, I doubt whether intensive, qualitative case studies can generally reduce as much uncertainty about cause as a true experiment. This is because such methods rarely involve a convincing causal counterfactual. 41 Typically, they do not have comparison groups, making it difficult to know how a treatment group would have changed in the absence of the reform under analysis. Adding control groups helps, but unless they are randomly created it will not be clear whether the two groups would have changed over time at similar rates. Whether intensive case methods reduce enough uncertainty about cause to be generally useful is a poorly specified proposition we cannot answer well, though it is central to Cronbach's case. Still, it forces us to note that experiments are only justified when a very high standard of uncertainty reduction is required about a causal claim. Three other factors favoring case studies are more clear. First, schools are less squeamish about allowing in ethnographers than experimentalists (under the assumption that the ethnographers are not in classes and staff meetings every day). Second, case studies produce human stories that can be effectively used to communicate results to readers and listeners. And finally, feeding interim evaluation results back to teachers and principals with whom an ethnographer has an ongoing relationship is especially likely to generate local use of the data collected. Such use is less grandiose than the usual aspiration for experiments--to guide policy changes that will affect large numbers of districts and schools. But in the work of educational evaluators like Stake, Guba and Lincoln, such local use is a desideratum, given how unsure they are that policy dictates issued from central authorities will ever be complied with once the classroom door closes. The advantages of case study methods for evaluation are considerable. But we value these methods for their role within experiments rather than as alternatives to them. Case study methods complement experiments whenever a causal question is clearly central and it is not clear how successful program implementation will be, why implementation shortfalls may occur, what unexpected effects are likely to emerge, how respondents interpret the questions asked of them, what the casual mediating processes are, etc. Since it will be rare for any of these issues to be 42 clear, qualitative methods have a central role to play in experimental work on educational interventions. They cannot be afterthoughts. b. Theory-Based Evaluations. It is currently fashionable in many foundations and some scholarly circles to espouse a non-experimental theory of evaluation for use in complex social settings like communities and schools (Connell et al, 1995). The method depends on: (1) explicating the substantive theory behind a reform initiative and detailing all the flow-through relationships that should occur if the intended intervention is to impact on some major distal outcome, like achievement gains; (2) measuring each of the constructs specified in the substantive theory; and (3) analyzing the data to assess the extent to which the postulated relationships have actually occurred through time. For shorter time periods, the data analysis will involve only the first part of a postulated causal chain; but over longer periods the complete model might be involved. Obviously, this conception of evaluation places a primacy on highly specific substantive theory, high quality measurement, and the valid analysis of multivariate explanatory processes as they unfold in time. Several features make it seem as though theory-based evaluation can function as an alternative to random assignment. First, it is not necessary to construct a causal counterfactual through random assignment or creating closely matched comparison groups. A study can involve only the group experiencing a treatment. Second, obtaining data patterns congruent with program theory is assumed to indicate the validity of that theory, an epistemology that does not require explicitly rejecting alternative explanations – merely demonstrating a match between the predicted and obtained data. Finally, the theory approach does not depend for its utility on attaining the end points typically specified in educational research--usually a cause-effect relationship involving student performance. Instead, if any part of the program theory is 43 corroborated or disconfirmed at any point along the causal chain, this can be used for interim purposes--to inform program staff, to argue for maintaining the program, to provide a rationale for acting as though the program will be effective if more distal criteria were to be collected and to defend against premature summative evaluations that declare a program ineffective before sufficient time has elapsed for all the processes to occur that are presumed necessary for ultimate change. So, no requirement exists to measure the effect endpoints most often valued in educational studies. Few advocates of experimentation will argue against the greater use of substantive theory in evaluation; and since most experiments can accommodate more process measurement, it follows they will be improved thereby. Such measurement will make it possible to probe, first, whether the intervention led to changes in the theoretically specified intervening processes and, second, whether these processes could then have plausibly caused changes in distal outcomes. The first of these tests will be unbiased because it relates each step in the causal model to the planned treatment contrast. But the second test will entail selection bias if it depends only on stratifying units according to the extent the postulated theoretical processes have occurred. Still, quasi-experimental analyses of the second stage are worth doing provided that their results are clearly labeled as more tentative than the results of planned experimental contrasts. The utility of measuring and analyzing theoretical intervening processes is beyond dispute; the issue is whether such measurement and analysis can alone provide an alternative to random assignment. I am skeptical. First, it has been my experience writing papers on the theory behind a program with its developer (Anson, Cook, Habib, Grady, Haynes & Comer, 1991) that the theory is not always very explicit. Moreover, it could be made more explicit in several different ways, not just one. Is there a single theory of a program, or several possible versions of it? Second there is the problem that many of these theories seem to be too linear in their flow of influence, 44 rarely incorporating reciprocal feedback loops or external contingencies that might moderate the entire flow of influence. It is all a little bit too neat for our more chaotic world. Third, few theories are specific about timelines, specifying how long it should take for a given process to affect some proximal indicator. Without such specifications it is difficult to know when apparently disconfirming results are at hand whether the next step in the model has not yet occurred or will never occur because the theory is wrong in this particular. Fourth, the method places a great premium on knowing, not just when to measure, but how to measure. Failure to corroborate the model could be the result of partially invalid measures as opposed to the invalidity of the theory itself--though researchers can protect against this with more reliable measures. Fifth is the epistemological problem that many different models can usually be fit to any single pattern of data (Glymour, Sprites, Scheines 1987), implying that causal modeling is more valid when multiple competing models are tested against each other rather than when a single model is tested. The biggest problem though, is the absence of a valid counterfactual to inform us about what would have happened to students or teachers had there been no treatment. As a result, it is impossible to decide whether the events observed in a study genuinely result from the intervention or whether they would have occurred anyway. One way to guard against this is with "signed causes" (Scriven, 1976), predicting a multivariate pattern among variables that is so unique it could not have occurred except for the reform. But signed causes depend on the availability of much well-validated substantive theory and very high quality measurement (Cook & Campbell, 1979). So, a better safeguard is to have at least one comparison group, and the best comparison group is a randomly constructed one. So, we are back again with random assignment and the proposition that theory-based evaluations are useful as complements to randomized experiments but not as alternatives to them. 45 c. Quasi-Experiments. The vast majority of educational evaluators favor intensive case studies, and a few of them plus some outsiders from psychology are now espousing theory-based methods. But many educational researchers do not aspire to improve evaluation theory or practice. They only want to test the effectiveness of changes in practice within their own education sub-field. In our experience, such researchers tend to use quasi-experiments, as with the early evaluations of Comer’s program noted earlier and those Herman (1998) detailed for Success for All. Indeed, almost all the quantitative studies Herrman (1998) cites to back claims about effective whole-school reforms depend on quasi-experiments rather than experiments or qualitative studies. This last is surprising, given how much emphasis theorists of educational evaluation have placed on qualitative work. Has their advocacy had little influence on researchers studying the effectiveness of educational interventions; or does their absence from the effectiveness part of Herrman’s analysis reflect merely bias on the part of her research team? We do not know. But we do know that it is not easy to integrate qualitative case studies into the summary picture of a school reform’s effectiveness when the database also includes some quantitative studies. Quasi-experiments are identical to experiments in their purpose and in most details of their structure, the defining difference being the absence of random assignment and hence of a demonstrably valid causal counterfactual. Quasi-experimentation is nothing more than the search, more through design than statistical adjustment techniques, to create the best possible approximation (or approximations) to the missing counterfactual. To this end, there are invocations (Corrin & Cook, 1998; Shadish & Cook, in press) to create stably matched comparison groups; to use age or sibling controls; to measure behavior at several time points before a treatment begins; perhaps even creating a time-series of observations; to seek out 46 situations where units are assigned to treatment solely because of their score on some scale; to assign the same treatment to different stably matched groups at different times so that they can alternate in their functions as treatment and control groups; and to build multiple outcome variables into studies, some of which should theoretically be influenced by a treatment and others not. These are the most important elements from which quasi-experimental designs are created through a mixing process that tailors the design achieved to the nature of the research problem and the resources available. However good they are, quasi-experiments are always second-best to randomized experiments. In some quarters, "quasi-experiment" has been used promiscuously to connote any study that is not a true experiment, that seeks to test a causal hypothesis, and that has any type of non-equivalent control group or pre-treatment observation. Yet Campbell and Stanley (1963) and Cook and Campbell (1979) labeled some such studies as "generally causally uninterpretable". Many of the studies educational researchers call "quasi-experiments" are of this last causally uninterpretable kind and lag far behind the state of the art. Reading them indicates that little thought has been given to the quality of the match when creating control groups; to the possibility of multiple hypothesis tests rather than a single one; to the possibility of generating data from several pre-treatment time points rather than a single one; or to having several comparison groups per treatment, with one initially outperforming the treatment group and the other under performing so as to create bracketed controls. Reading quasi-experimental studies of educational reform projects is dispiriting, so weak are the designs and so primitive are the statistical analyses. All quasi-experimental designs and analyses are not equal. Although I am convinced that the best quasi-experiments give reasonable approximations to the results of randomized experiments, to judge by the quality of the work on education I know best, the average quasi-experiment is lamentable in the confidence it inspires in the causal 47 conclusions drawn. Recent advances in the design and analysis of quasi-experiments are not getting into educational research where they are sorely needed. Nonetheless, the best estimate of any quasi-experiment’s internal validity will always be to compare its results with those from a randomized experiment on the same topic. When this is done systematically, it seems that quasiexperiments are less efficient than randomized experiments, providing more variable answers from study to study (Lipsey and Wilson, 1993). So, in areas like education where few studies exist on most of the reform ideas being currently debated, randomized experiments are particularly needed and it will take fewer of them to arrive at what might be the same answer better quasi-experiments would also achieve later. But irrespective of such speculation, we can be sure of one thing: that the eyes of nearly all scholarly observers, the answers obtained from experiments will be much more credible than those obtained from quasi-experiments. Conclusions 1. Random assignment is very rare in research on the effectiveness of strategies to improve student academic performance, though it is widely acknowledged as premier for answering causal questions of this type. 2. Random assignment is common in schools when the topic is preventing negative behaviors or feelings. Intellectual culture may be one explanation for this difference in prevalence. Prevention researchers tend to be trained in public health and psychology where random assignment is esteemed and where funders and journal editors are clear about their preference for the technique. The case is quite different with evaluators who operate out of Schools of Education. Capacity may be another explanation. Most school-based prevention experiments are of short duration, 48 involve curriculum changes, the implementers are usually researchers, and the research topics may engage educators less than issues of school governance and teaching practice. 3. Nearly all educational evaluators seem to believe that experiments are not worthwhile. They cite many reasons for this, including: The theory of causation that buttresses experiments is naive; experiments are not practical; they require unacceptable trade-offs that lower the quality of answers to non-causal questions; they deliver a kind of information rarely used to change policy; and the evaluative information needed can be gained using different and more flexible methods. 4. Some of these beliefs are better justified than others. Beliefs about the viability of alternatives to the experiment are particularly poorly warranted since no alternative provides as convincing a causal counterfactual as the randomized assignment. However, the better criticisms suggest useful and practical additions that can be made to the bare bones structure and function of experiments. Of special relevance here is the importance of describing implementation quality, of relating implementation to outcome changes, of describing program theory, of measuring whether theoretically specified intervening processes have occurred, and of relating these intervening processes to student outcomes. 5. Random assignment should not be considered a "gold standard". It creates only a probabilistic equivalence between groups at the pretest; treatment-correlated attrition is likely when treatments differ in intrinsic desirability; treatments are not always independent; and the means used to increase internal validity often reduce external validity. More appropriate rationales for random assignment are that, even after all the limitations above are taken into account, (1) it still provides a logically more valid causal counterfactual than its alternatives; (2) it almost certainly provides a 49 more efficient counterfactual; and (3) it provides a counterfactual that is more credible in nearly all academic circles. Taken together, these are compelling rationales for random assignment, though not for any hubris on the part of advocates of experimentation. 6. It will not be possible to persuade the current community of educational researchers to do experiments simply by outlining the advantages of the technique and describing newer methods for implementing randomization. Most educational evaluators share some of the antiexperimental beliefs outlined above and also think that a scientific model of knowledge growth is out-of-date. Advocates of experimentation need to be explicit about its real limits and to consider seriously whether the practice of experimentation can be improved by incorporating some of the critics’ concerns about program theory, the quality of implementation, the valued qualitative data, the necessity of causal contingency, and meeting the information needs of school personnel as well as other stakeholder groups. But even so, it will be difficult enlisting the current generation of educational evaluators behind a banner promoting more experimentation. 7. However, they are not needed for this task. They are not part of the flurry of controlled experimentation now occurring in schools. Moreover, Congress has shown its willingness to mandate experiments, especially in early childhood education and job training. So, "end runs" around the educational research community are possible, leaving future experiments to be carried out by contract research firms, university faculty in the policy sciences, or education faculty now lying fallow. It would be a pity if such an "end run" reduced access to those researchers who now know most about school management, program implementation, and how educational research is used. The next generation of experimenters should not have to re-learn the craft 50 knowledge that current education researchers have gained that genuinely complements experiments and is in no way in intellectual opposition to them. 8. The major reason why researchers responsible for evaluating educational reforms feel no urgency to conduct randomized experiments is that they conceptualize schools as complex social organizations that should be studied with the case study tools typically used in organizational research. In this tradition, internal change often occurs in an organization when management contracts with external consultants who visit the institution, observe and interview in it, inspect records, and then make recommendations for change based on both site visit particulars and their own knowledge of theories of organizational change and presumptively effective teaching and learning practices. This R&D model from management consulting is quite different from the one used in medicine and agriculture. Until the operating metaphor of schools as complex social organizations changes it will not be easy for educational researchers to adopt experimental methods. They conceptualize educational change more holistically and see it as much more complicated than whether an A causes some B. Yet even complex organizations can be studied experimentally, as can many of the more important details within them. 51 References Angrist, J.D., Imbrens , G.W. & Rubin. D. B. (1996). Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91, pp.444-462. Anson, A., Cook, T.D., Habib, F., Grady, M.K., Haynes, N & Comer, J.P. (1991). The Comer School Development Program: A theoretical analysis. Journal of Urban Education, 26, pp 56-82. Ares, C.E., Rankin, A. & Sturz, H. (1963). The Manhattan bail project: An interim report on the use of pre-trial parole. New York University Law Review, 38, 67-95. Becker, H.S. (1958). Problems of inference and proof in participant observation. American Sociological Review, 23, pp. 652-660.. Berman, P. & McLaughlin, M.W. (1977). Federal programs supporting educational change. Vol. 8: Factors affecting implementation and continuation. Santa Monica, CA: Rand Corp. Bhaskar, R. (1975). A realist theory of science. Leeds, England: Leeds. Bogatz, G.A. & Ball, S. (1972). The impact of “Sesame Street” on children’s first school experience. Princeton, NJ: Educational Testing Service. Boruch, R.F. (1974). Bibliography: Illustrated randomized field experiments for program planning and evaluation. Evaluation, 2, pp. 83-87. Boruch, R.F. (1997). Randomized experiments for planning and evaluation: A practical guide. (Applied Social Research Methods Series, Volume 44.) Thousand Oaks, CA: Sage Publications. Campbell, D.T. (1957). Factors relevant to the validity of experiments in social settings. Psychological Bulletin, 54, 297-312. Campbell & Stanley (1963). Experimental and quasi-experimental designs for research. Chicago: Rand-McNally. Campbell, F.A. & Ramey, C.T. (1995). Cognitive and school outcomes for high-risk African American students at middle adolescence: Positive effects of early intervention. American Educational Research Journal Vol 32, No. 4, pp 743-772. Chelimsky, E. (1987). The politics of program evaluation. In D.S. Cordray, H.S. Bloom & R.J. Light (Eds.) Evaluation Practice in Review. San Francisco, CA: Jossey-Bass. Chen, H. (1990). Theory-Driven Evaluations. Newbury Park, Ca: Sage, 10, pp. 95-103 Cohen, D.K. & Garet, M.S. (1975). Reforming educational policy with applied social research. Harvard Educational Review, 45. 52 Cicirelli, V.G. and Associates (1969). The Impact of Head Start: An Evaluation of the Effects of Head Start on Children's Cognitive and Affective Development, Vol. 1 and 2. A Report to the Office of Economic Opportunity. Athens: Ohio University and Westinghouse Learning Corporation. Collingwood, R.G. (1940). An essay on metaphysics. Oxford: Clarendon. Comer, J.P. (1988). Educating poor minority children. Scientific American, 259(5), pp. 42-48. Connell, D.B., Turner, R.R. & Mason. (1985). Summary of findings of the school health education evaluation: Health promotion effectiveness, implementation and costs. Journal of School Health, 55, pp. 316-321. Connell, J.P., Kubisch, A.C., Schorr, L.B. & Weiss, C.H. (Eds.). (1995). New approaches to evaluating community initiatives: Concepts, methods and contexts. Washington, D.C.: Aspen Institute. Cook, T.D. (1984). What have black children gained academically from school desegregation? A review of the meta-analytic evidence. Special volume to commemorate Brown v. Board of Education. In School Desegregation. Washington, D.C.: National Institute of Education. Cook, T.D. (1985). Post-positivist critical multiplism. In R.L. Shotland & M.M. Mark (Eds.) Social science and social policy, pp. 21-62. Beverly Hills, CA: Sage. Cook, T.D. (1993). A quasi-sampling theory of the generalization of causal relationships. In L. Sechrest and A.G. Scott (Eds.) New Directions for Program Evaluation: Understanding Causes and Generalizing about Them, Vol. 57. San Francisco: Jossey-Bass Publishers. Cook, T.D., Anson, A. & Walchli, S. (1993). From causal description to causal explanation: Improving three already good evaluations of adolescent health programs. In S.G. Millstein, A.C. Petersen & E.O. Nightingale (eds.) Promoting the Health of Adolescents: New Directions for the Twenty-First Century. New York: Oxford University Press. Cook, T.D. & Campbell, D.T. (1979). Quasi-experimentation: Design and analysis issues for field settings. Boston: Houghton Mifflin. Cook, T.D., Habib, F., Phillips, J., Settersten, R.A., Shagle, S.C., Degirmencioglu, S.M. (1999). In Press. Comer's School Development Program in Prince George's County, Maryland: A theorybased evaluation. American Educational Research Journal. Cook, T.D., Hunt, H.D. & Murphy, R.F. (1999). Comer's School Development Program in Chicago: A theory-based evaluation. Working Paper, Institute for Policy Research, Northwestern University. Corrin , W.J. & Cook, T.D. (1998). Design elements of quasi-experimentation. Advances in Educational Productivity, 7, pp. 35-57. 53 Cronbach, L.J. (1982). Designing evaluations of educational and social programs. San Francisco: Jossey-Bass. Cronbach, L.J., Ambron, S.R., Dornbusch, S.M., Hess, R.D., Hornik, R.C., Phillips, D.C., Walker, D.F. & Weiner, S.S. (1980). Toward reform of program evaluation. San Francisco: Jossey-Bass. Cronbach, L.J. & Snow, R.E. (1976). Aptitudes and instructional methods: A handbook for research on interactions. New York: Irvington Durlak, J.A. & Wells, A.M. (1997) (a). Primary prevention mental health programs for children and adolescents: A meta-analytic review. Special Issue: Meta-analysis of primary prevention programs. American Journal of Community Psychology, 25(2), pp. 115-152. Durlak, J.A. & Wells, A.M. (1997) (b). Primary prevention mental health programs: The future is exciting. American Journal of Community Psychology, Vol. 25, No. pp. 233-241. Durlak, J.A. & Wells, A.M. (1998). Evaluation of indicated preventive intervention (Secondary Prevention) Mental health programs for children and adolescents. American Journal of Community Psychology, Vol. 26, No. 5, pp 775-802. Elmore, R.F. (1996). Getting to scale with good educational practice. Harvard Educational Review, 66, 1-26. Elmore, R.F. & McLaughlin, M.W. (1983). The federal role in education: Learning from experience. Education and Urban Society, 15, pp. 309-333. Ferguson, R.F. (1998). Can Schools Narrow the Black-White Test Score Gap? In C. Jencks & M. Phillips. The Black-White Test Score Gap. Washington, DC: Brookings. pp. 318-374. Fetterman, D.M. (Ed.), (1984). Ethnography in educational evaluation. Beverly Hills, CA: Sage. Finn, J.D. & Achilles, C.M. (1990). Answers and questions about class size: A statewide experiment. American Educational Research Journal, 27(3), pp. 557-577. Flay, B.R. (1986). Efficacy and effectiveness trials (and other phases of research) in the development of health promotion programs. Preventive Medicine, 15, pp. 451-474. Fraker, T., & Maynard, R. (1987). Evaluating comparison group designs with employmentrelated programs. Journal of Human Resources, 22, 194-227. Gasking, D. (1955). Causation and recipes. Mind, 64, pp. 479-487. Gilbert, J.P., McPeek, B., & Mosteller, F. (1977). Statistics and ethics in surgery and anesthesia. Science, 198, 684-689. 54 Glass, G.V. & Smith, M.L. (1979). Meta-analysis of research on the relationship of class size and achievement. Educational Evaluation and Policy Analysis, 1, pp. 2-16. Glymour, C., Sprites & Scheines, R. (1987). Discovering causal structure: Artificial intelligence, philosophy of science and statistical modeling. Orlando: Academic Press. Greene, J.P., Peterson, P.E., Du, J, Boeger, L. & Frazier, C.L. (1996). The effectiveness of school choice in Milwaukee: A secondary analysis of data from the program's evaluation. University of Houston mimeo. August, 1996. Guba, E.G. & Lincoln, Y. (1982). Effective evaluation. San Francisco: Jossey-Bass. Gueron, J. M. (1999). The politics of random assignment: Implementing studies and impacting policy. Paper presented at the American Association of Arts and Sciences. Cambridge, Mass. Hanushek, E.A. (1999). Evidence on class size. In S. Mayer & P. E. Peterson (Eds.) When schools make a difference. Washington, D.C.: Brookings Institute. Herrman, R. (1998). Whole school reform: A review. American Institutes for Research. Washington, D.C. Holland, P.W. (1986). Statistics and causal inference. Journal of the American Statistical Association, 81, pp. 945-970. House, E. (1993). Professional evaluation: Social impact and political consequences. Newbury Park, CA: Sage Kemple, J.J. & Rock, J.L. (1996). Career academies: Early implementation lessons from a 10site evaluation. New York: Manpower Demonstration Research Corporation. Kuhn, T.S. (1970). The structure of scientific revolutions. (2nd edition) Chicago: University of Chicago Press. LaLonde, R.J. (1986). Evaluating the econometric evaluations of training programs with experimental data. American Economic Review, 76, 604-620. Latour, B. & Woolgar, S. (1979). Laboratory life: The construction of scientific facts. Beverly Hills: Sage Publications Leviton, L.C. & Cook, T.D. (1983). Evaluation findings in education and social work textbooks. Evaluation Review, 7, pp. 497-518. Lindblom, C.E. & Cohen, D.K. (1980). Usable knowledge. New Haven, CT: Yale University Press. Lipsey, M.W. & Wilson, D.B. (1993). The efficacy of psychological, educational, and behavioral treatment: confirmation from meta-analysis. American Psychologist, pp. 1181-1209 55 Mackie, J.L. (1974). The cement of the universe. Oxford, England: Oxford University Press. Mosteller, F., Light, R.J. & Sachs, J.A. (1996). Sustained inquiry in education: Lessons from skill grouping and class size. Harvard Educational Review, 66, pp. 797-842. Nave, B., Miech, ElJ. & Mosteller, F. (1999). A rare design: The role of field trials in evaluating school practices. Paper presented at The American Academy of Arts and Sciences an Harvard University. Olds, D.L., Eckenrode, D., Henderson, C.R., Kitzman, H, Powers, J. Cole, R., Sidora, K., Morris, P., Pettitt, L.M.& Luckey, D. (1997). Long-term effects of home visitation on maternal life course and child abuse and neglect. Journal of the American Medical Association, 278(8): 637643. Peters, R.D.& McMahon, R.J. (Eds.) (1996). Preventing childhood disorders, substance abuse, and delinquency. Banff International Science Series, Vol. 3, pp 215-240. Thousand Oaks, CA: Sage. Peterson, P.E., Myers, D. & Howell, W.G. (1998). An evaluation of the New York City School Choice Scholarships Program: The First Year. Paper prepared under the auspices of the Program on Education Policy and Governance, Harvard University. PEPG98-12 Peterson, P.E., Greene, J.P., Howell, W.G. & McCready, W. (1998). Initial findings from an evaluation of school choice programs in Washington, D.C. Paper prepared under the auspices of the Program on Education Policy and Governance, Harvard University. Popper, K.R. (1959). The logic of scientific enquiry. New York: Basic Books. Quine, W.V. (1951). Two dogmas of empiricism. Philosophical Review, 60, pp. 20-43. Quine, W.V. (1969). Ontological relativity and other essays. New York: Columbia University Press. Ramey, C.T. & Campbell, F.A. (1991). Poverty, early childhood education, and academic competence: The Abecedarian experiment. In A.C. Huston (Ed.) Children in poverty: Child development and public policy. New York: Cambridge University Press, pp. 190-221. Riecken, H.W. & Boruch, R. (1974). Social experimentation. New York: Academic Press. Rivlin, A.M. & Timpane, M.M. (Eds.) (1975). Planned variation in education. Washington, DC: Brookings Institution. Rossi, P. H. & Lyall, K. C. (1976) Reforming public welfare. New York, N.Y. Russell Sage Foundation . 56 Rouse, C.E. (1998). Private school vouchers and student achievement: An evaluation of the Milwaukee parental choice program. The Quarterly Journal of Economics, pp. 553-602. St.Pierre, R.G., Cook, T.D. & Straw, R.B. (1981). An evaluation of the Nutrition Education and Training Program: Findings from Nebraska. Evaluation and Program Planning, 4, pp. 335-344. Schweinhart, L.J., Barnes, H.V. & Weikart, D.P. (with W.S. Barnett and A.S. Epstein) (1993). Significant benefits: The High/Scope Perry Preschool Study through age 27. Ypsilanti, MI: High/Scope Press. Scriven, M. (1976). Maximizing the power of causal investigation: The Modus Operandi method. In G.V. Glass (Ed.), Evaluation Studies Review Annual, 1, pp. 101-118. Newbury Park, CA: Sage Publications. Shadish, W.R. & Cook, T.D. (in press). Design rules. Statistical Science. Stake, Robert E. (1967). The countenance of educational evaluation. Teachers College Record, 68, 523-540. Stebbins, L.B., St. Pierre, R.G., Proper, E.C., Anderson, R.B. & Cerba, T.R. (1978). An evaluation of Follow Through. In T.D. Cook (Ed.) Evaluation Studies Review Annual, 3, pp. 571610. Beverly Hills CA: Sage Publications. Vinovskis, M.A. (1998). Changing federal strategies for supporting educational research, development and statistics. Background paper prepared for the National Educational Research Policy and Priorities Board, U.S. Department of Education. Wargo, M.J., Tallmadge, G.K., Michaels, D.D., Lipe, D. & Morris, S.J. (1972). ESEA Title I: A reanalysis and synthesis of evaluation data from fiscal year 1965 through 1970, Final Report, Contract No. OEC-0-71-4766. Palo Alto, CA: American Institutes for Research. Weiss, C.H. (1988). Evaluation for decisions: Is anybody there? Does anybody care? Evaluation Practice, 9, pp. 5-20. Weiss, C.H. & Bucuvalas, M.J. (1977). The challenge of social research to decision making. In C.H. Weiss (Ed.), Using social research in public policy making. pp. 213-234. Lexington, MA: Lexington. Whitbeck, C. (1977). Causation in medicine: The disease entity model. Philosophy of Science, 44, pp. 619-637. Witte, J.F. (1998). The Milwaukee voucher experiment. Educational Evaluation and Policy Analysis, 20, pp. 229-251. Zdep, S.M. (1971). Educating disadvantaged urban children in suburban schools: An evaluation. Journal of Applied Social Psychology, pp. 1,2, 173-186 57 58