EDUCATION RESEARCH MEETS THE GOLD STANDARD: STATISTICS, EDUCATION, AND RESEARCH METHODS AFTER “NO CHILD LEFT BEHIND” Mack C. Shelley, II Iowa State University mshelley@iastate.edu Presented at the Joint Statistical Meetings, August 7-11, 2005, Minneapolis, MN Background This session is meant to help inform the national debate over the role of scientific standards for research in education, particularly as those research standards are influenced by statistical methods and theory. This session builds on a National Science Foundation award to myself and Brian Hand (University of Iowa). Background The panel is designed to meld research interests in statistics, education, and related disciplines, and to discuss the dramatically changing context of contemporary education research. Why, exactly, is the context changing for statistical research in education? Background Standards for acceptable research in education are affected greatly by: the recent creation of the Institute of Education Sciences in the U.S. Department of Education passage of the No Child Left Behind Act of 2001, and Passage of the Education Sciences Reform Act (H.R. 3801) in 2002 Background Together, these developments have reconstituted federal support for research and dissemination of information in education are meant to foster “scientifically valid research,” and have established what is referred to as the “gold standard” for research in education. Background These and other developments denote that greater education research emphasis now is placed on quantification, the use of randomized trials, and the selection of valid control groups Background This panel is intended to be part of a sustained and expanded dialogue between the statistical community and those who implement the education research agenda through a discussion of whether and how to implement the new standards for statistical work in the field of education research What Is The “Gold Standard”? U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance Identifying and implementing educational practices supported by rigorous evidence: A user friendly guide http://www.ed.gov/about/offices/list/ies/ news.html#guide What Is The “Gold Standard”? This publication emphasizes: evidence-based interventions educational outcomes that have been found to be effective in randomized controlled trials “research’s “gold standard” for establishing what works” following patterns of evidence use in medicine and welfare policy What Is The “Gold Standard”? The quality of studies needed to establish “strong” evidence requires randomized controlled trials that are welldesigned and implemented that the quantity of evidence needed spans trials showing effectiveness in two or more typical school settings including a setting similar to that of schools/classrooms What Is The “Gold Standard”? “Possible” evidence may include randomized controlled trials whose quality/quantity are good but fall short of “strong” evidence and/or comparison-group studies in which the intervention and comparison groups are very closely matched in academic achievement, demographics, and other characteristics What Is The “Gold Standard”? Evaluating whether an intervention is backed by “strong” evidence of effectiveness hinges on well-designed and well-implemented randomized controlled trials demonstrating that there are no systematic differences between intervention and control groups before the intervention the use of measures and instruments of proven validity “real-world” objective measures of the outcomes the intervention is designed to affect attrition of no more than 25% of the original sample effect size combined with statistical significance an adequate sample size to achieve statistical significance controlled trials implemented in more than one site in schools that represent a cross-section of all schools No Child Left Behind Public Law 107–110 [H.R. 1] passed on January 8, 2002 “An Act to close the achievement gap with accountability, flexibility, and choice, so that no child is left behind” the “No Child Left Behind Act of 2001” (NCLB) established standards for academic assessments in mathematics, reading or language arts, and science multiple up-to-date measures of student academic achievement, including measures that assess higher-order thinking skills and understanding These requirements for program assessment lead to many opportunities and circumstances for the application of statistical methods. No Child Left Behind The research program under NCLB was designed to examine the effect of the assessment and accountability systems on students, teachers, parents, families, schools, school districts, and States, including correlations between such systems and student academic achievement progress toward meeting the State-defined level of proficiency progress toward closing achievement gap changes in course offerings, teaching practices, course content, and instructional material teacher, principal, and pupil-services personnel turnover rates student dropout, grade-retention, and graduation rates students with disabilities student socioeconomic status level of student English proficiency student ethnicity and race The Education Sciences Reform Act and IES “The Education Sciences Reform Act” “An Act to provide for improvement of Federal education research, statistics, evaluation, information, and dissemination, and for other purposes” H.R. 3801, passed January 23, 2002 reconstituted federal support for research and dissemination of information in education, to foster “scientifically valid research” established the Institute of Education Sciences (IES) replacing the Office of Educational Research and Improvement part of the Department of Education but functioning separately from it The Education Sciences Reform Act and IES IES is the research arm of the Department of Education Mission is to expand knowledge and provide information on Goal the condition of education practices that improve academic achievement the effectiveness of Federal and other education programs the transformation of education into an evidence-based field in which decision makers routinely seek out the best available research and data before adopting programs or practices that will affect significant numbers of students Consists of Grover J. (Russ) Whitehurst, first Director, since November 2002 Office of the Director National Center for Education Research National Center for Education Statistics National Center for Education Evaluation and Regional Assistance National Center for Special Education Research The Education Sciences Reform Act and IES HR 3801 defined “Scientifically based research standards” to apply rigorous, systematic, and objective methodology to obtain reliable and valid knowledge relevant to education activities and programs present findings and make claims that are appropriate to and supported by the methods that have been employed The Education Sciences Reform Act and IES “Scientifically based research” also includes employing systematic, empirical methods that draw on observation or experiment involving data analyses that are adequate to support the general findings relying on measurements or observational methods that provide reliable data making claims of causal relationships only in random assignment experiments or other designs (to the extent such designs substantially eliminate plausible competing explanations for the obtained results) ensuring that studies and methods are presented in sufficient detail and clarity to allow for replication or, at a minimum, to offer the opportunity to build systematically on the findings of the research obtaining acceptance by a peer-reviewed journal or approval by a panel of independent experts through a comparably rigorous, objective, and scientific review using research designs and methods appropriate to the research question posed The Education Sciences Reform Act and IES “Scientifically valid education evaluation” means an evaluation that adheres to the highest possible standards of quality with respect to research design and statistical analysis provides an adequate description of the programs evaluated and, to the extent possible, examines the relationship between program implementation and program impacts provides an analysis of the results achieved by the program with respect to its projected effects employs experimental designs using random assignment, when feasible, and other research methodologies that allow for the strongest possible causal inferences when random assignment is not feasible may study program implementation through a combination of scientifically valid and reliable methods What Works What Works Clearinghouse (WWC) reviews and reports on existing studies of interventions (education programs, products, practices, and policies) in selected topic areas established in 2002 by IES to provide educators, policymakers, and the public with a central and trusted source of scientific evidence of what works in education administered by the U.S. Department of Education, through a contract to a joint venture of the American Institutes for Research and the Campbell Collaboration apply standards that follow scientifically valid criteria for determining the effectiveness of these interventions Technical Advisory Group (TAG) leading experts in research design, program evaluation, and research synthesis advises on the standards for evaluation research reviews monitors and informs the methodological aspects of WWC reviews and reports www.whatworks.ed.gov What Works - TAG Dr. Larry V. Hedges, Chairperson, Stella M. Rowley Professor of Education, Psychology, Public Policy Studies, and Sociology, University of Chicago, and editorial board member of the American Journal of Sociology, the Review of Educational Research, and Psychological Bulletin. Dr. Betsy Jane Becker, Professor of Measurement and Quantitative Methods, College of Education, Michigan State University. Dr. Jesse A. Berlin, Professor of Biostatistics, University of Pennsylvania School of Medicine, and Director of Biostatistics at the university's Comprehensive Cancer Center. Dr. Douglas Carnine, Professor of Education, University of Oregon, and Director of the National Center to Improve the Tools of Educators. Dr. Thomas D. Cook, Professor of Sociology, Psychology, Education and Social Policy, Northwestern University, and Faculty Fellow at the Institute for Policy Research. Dr. David J. Francis, Professor of Quantitative Methods, Chairman of the Department of Psychology, and Director of the Texas Institute for Measurement, Evaluation, and Statistics, University of Houston. Dr. Robert L. Linn, distinguished Professor of Education, University of Colorado at Boulder, and Co-Director of the National Center for Research on Evaluation, Standards, and Student Testing. Dr. Mark W. Lipsey, Senior Research Associate, Vanderbilt Institute for Public Policy Studies, and Director of the Center for Evaluation Research and Methodology. Dr. David Myers, Senior Fellow, Mathematica Policy Research, and former Director of the U.S. Department of Education's national evaluation of Upward Bound. Dr. Andrew C. Porter, Patricia and Rodes Hart Professor of Educational Leadership and Policy and Director of the Learning Sciences Institute at Vanderbilt University. Dr. David Rindskopf, Professor of Psychology and Educational Psychology, City University of New York Graduate Center, and elected Fellow of the American Statistical Association. Dr. Cecilia E. Rouse, Professor of Economics and Public Affairs, and joint appointee in the Economics Department and Woodrow Wilson School, Princeton University. Dr. William R. Shadish, Founding Faculty and Professor of Social Sciences, Humanities, and Arts at the University of California, Merced. What Works Current Topics The What Works Clearinghouse (WWC) prioritizes topics based on the following criteria: potential to improve important student outcomes; applicability to a broad range of students or to particularly important subpopulations; policy relevance and perceived demand within the education community; and likely availability of scientific studies. Specifically, the topics were selected from nominations received through: emails from the public; meetings and presentations sponsored by the What Works Clearinghouse; the What Works Network; suggestions presented by senior members of education associations, policymakers, and the U.S. Department of Education; and reviews of existing research. What Works Current Topics Topics include: Math—Curriculum-Based Interventions for Increasing Middle School Math Reading—Interventions for Beginning Reading Character Education—Comprehensive Schoolwide Character Education Interventions: Benefits for Character Traits, Behavioral, and Academic Outcomes Dropout Prevention—Interventions for Preventing High School Dropout English Language Learning—Interventions for Elementary School English Language Learners: Increasing English Language Acquisition and Academic Achievement Math—Curriculum-Based Interventions for Increasing Elementary School Math Early Childhood—Interventions for Improving Preschool Children’s School Readiness Delinquent, Disorderly, and Violent Behavior—Interventions to Reduce Delinquent, Disorderly, and Violent Behavior in Middle and High Schools Adult Literacy—Interventions for Increasing Adult Literacy Peer-Assisted Learning—Peer-Assisted Learning Interventions in Elementary Schools: Reading, Mathematics, and Science Gains “Does Not Meet Evidence Screens” Studies may not pass WWC screening requirements for the following reasons: Evaluation research design. The study did not meet certain design standards. Study designs that provide the strongest evidence of effects include randomized controlled trials regression discontinuity designs quasi-experimental designs (must use a similar comparison group and have no attrition or disruption problems) single subject designs Topic area definition. The study did not meet the intervention definition developed by the WWC for a particular topic. Time period definition (generally, the last 20 years) Relevant outcome academic outcomes, not, for example, student self-confidence needs to have only one relevant outcome to pass this screen test reliability or validity sample or description of relevant test items if a study outcome test is not known or available Relevant student sample A Real Live Current Example MATHEMATICS AND SCIENCE EDUCATION RESEARCH GRANTS PROGRAM CFDA (Catalog of Federal Domestic Assistance) NUMBER: 84.305 RELEASE DATE: May 6, 2005 REQUEST FOR APPLICATIONS NUMBER: NCER-0602 Mathematics and Science Education Research Grants Program http://www.ed.gov/about/offices/list/ies/programs .html LETTER OF INTENT RECEIPT DATE: September 12, 2005 APPLICATION RECEIPT DATE: November 3, 2005, 8:00 p.m. Eastern time A Real Live Current Example REVIEW CRITERIA FOR SCIENTIFIC MERIT Significance Does applicant make a compelling case for the potential contribution of the project to the solution of an education problem? Does the applicant present a strong rationale justifying the need to evaluate the selected intervention (e.g., does prior evidence suggest that the intervention is likely to substantially improve student learning and achievement)? Research Plan Does the applicant present (a) clear hypotheses or research questions (b) clear descriptions of and strong rationales for the sample, measures (including information on reliability and validity), data collection procedures, and research design (c) a detailed and well-justified data analysis plan? Does the research plan meet the requirements described in the section on the Requirements of the Proposed Research? Is the research plan appropriate for answering the research questions or testing the proposed hypotheses? A Real Live Current Example Applications under Goal Three (Efficacy and Replication Trials) Under Goal Three, the Institute requests proposals to test the efficacy of fully developed interventions that already have evidence of potential efficacy. By efficacy, the Institute means the degree to which an intervention has a net positive impact on the outcomes of interest in relation to the program or practice to which it is being compared. A Real Live Current Example Methodological requirements (i) Sample The applicant should define, as completely as possible, the sample to be selected and sampling procedures to be employed for the proposed study. Additionally, the applicant should describe strategies to insure that participants will remain in the study over the course of the evaluation. A Real Live Current Example (ii) Design Applicants should describe how potential threats to internal and external validity will be addressed. Studies using randomized assignment to treatment and comparison conditions are strongly preferred. When a randomized trial is used, the applicant should clearly state the unit of randomization (e.g., students, classroom, teacher, or school). Choice of randomizing unit or units should be grounded in a theoretical framework. Applicants should explain the procedures for assignment of groups (e.g., schools, classrooms) or participants to treatment and comparison conditions. A Real Live Current Example (ii) Design (continued) Only in circumstances in which a randomized trial is not possible may alternatives that substantially minimize selection bias or allow it to be modeled be employed. Applicants … must make a compelling case that randomization is not possible. Acceptable alternatives include appropriately structured regression-discontinuity designs or other well-designed quasi-experimental designs that come close to true experiments in minimizing the effects of selection bias on estimates of effect size. A Real Live Current Example (ii) Design (continued) A well-designed quasi-experiment reduces substantially the potential influence of selection bias on membership in the intervention or comparison group. This involves: demonstrating equivalence between the intervention and comparison groups at program entry on the variables measuring program outcomes (e.g., math achievement test scores), or obtaining such equivalence through statistical procedures such as propensity score balancing or regression demonstrating equivalence or removing statistically the effects of other variables on which the groups may differ and that may affect intended outcomes of the program being evaluated (e.g., demographic variables, experience and level of training of teachers, motivation of parents or students) a design for the initial selection of the intervention and comparison groups that minimizes selection bias or allows it to be modeled A Real Live Current Example (iii) Power Applicants should clearly address the power of the evaluation design to detect a reasonably expected and minimally important effect. For determining the sample size, applicants need to consider the number of clusters, the number of individuals within clusters, the potential adjustment from covariates, the desired effect, the intraclass correlation (i.e., the variance between clusters relative to the total variance between and within clusters), the desired power of the design, onetailed vs. two-tailed tests, repeated observations, attrition of participants, etc. Applicants should anticipate the degree to which the magnitude of the expected effect may vary across the primary outcomes of interest. A Real Live Current Example (iv) Measures Investigators should include relevant standardized measures of student achievement (e.g., standardized measures of mathematics achievement) other measures of student learning and achievement (e.g., researcher-developed measures) measures of teacher practices information on the reliability, validity, and appropriateness of proposed measures A Real Live Current Example (v) Fidelity of implementation of the intervention The applicant should specify how the implementation of the intervention will be documented and measured either indicate how the intervention will be maintained consistently across multiple groups (e.g., classrooms and schools) over time or describe the parameters under which variations in the implementation may occur propose research designs that permit the identification and assessment of factors impacting the fidelity of implementation A Real Live Current Example (vi) Comparison group, where applicable The applicant should describe strategies to avoid contamination between treatment and comparison groups include procedures for describing practices in the comparison groups be able to compare intervention and comparison groups on the implementation of key features of the intervention using a business-as-usual comparison group is acceptable applicants should specify the treatment or treatments received in the comparison group applicants should account for the ways in which what happens in the comparison group are important to understanding the net impact of the experimental treatment A Real Live Current Example (vii) Mediating and moderating variables Mediating and moderating variables that are measured in the intervention condition that are also likely to affect outcomes in the comparison condition should be measured in the comparison condition (e.g., student time-on-task, teacher experience/time in position). The evaluation should account for sources of variation in outcomes across settings (i.e., to account for what might otherwise be part of the error variance). (viii) Data analysis specific statistical procedures should be described the relation between hypotheses, measures, and independent and dependent variables should be clear the effects of clustering must be accounted for in the analyses, even when individuals are randomly assigned to condition