Rachel A. Gordon University of Illinois at Chicago Kerry G. Hofer Peabody Research Institute, Vanderbilt University Presentation in the Presidential Session on Universal Preschool: What Have We Learned, and What Does It Mean for Practice and Policy? Annual Meeting of the American Educational Research Association (April 6, 2014). Principal Investigators Rachel Gordon Kerry Hofer Other Investigators Sandra Wilson Everett Smith Graduate Students Elisabeth Stewart Jenny Kushto-Hoban Hillary Rowe Anna Colaner •Graduate Students (con’t). – Rowena Crabbe – Fang Peng – Danny Lambouths – Ken Fujimoto •Consultant – Betsy Becker •Institute for Education Sciences Grant #R305A130118 We present some results from our preliminary investigations in this new project. Although we have confidence in what is presented here, these analyses are the first steps towards a more thorough look at the validity of two quality measures. The results may change as we move forward, including as we revise the details of the regression models and the metaanalytic techniques used and as we take the products through peer review. Principal Investigator Rachel Gordon Other Investigators Everett Smith Robert Kaestner Sanders Korenman Graduate Students Ken Fujimoto Kristin Abner Anna Colaner Nicole Colwell Xue Wang IES R305A090065 NIH R01HD060711 Policy initiatives focus on high-quality preschool… Source: http://www.whitehouse.gov/issues/education/early-childhood high quality early childhood education Source: http://www.whitehouse.gov/issues/education/early-childhood high-quality early learning programs Source: http://www.whitehouse.gov/issues/education/early-childhood This question seems simple at first glance, but upon reflection is difficult to answer. In the case of early care and education, thinking carefully about this question, and how to answer it, has increasing importance. With the push toward expanding access to high quality preschool, measures designed for other purposes have been adopted for high stakes use. http://www.excelerateillinois.com/docman/resources/2-gold-excelerate-illinois-chart/file ECERS-R: Average overall score: At least 4.5 with no classroom below a 4.0, verified by on-site independent assessment http://www.excelerateillinois.com/docman/resources/2-gold-excelerate-illinois-chart/file CLASS: Emotional support and classroom organization average scores above 5.0 with no classroom below 4.0, as verified by on-site independent assessment http://www.excelerateillinois.com/docman/resources/2-gold-excelerate-illinois-chart/file I will review the case that: Both were developed for other purposes. Both have limitations for these high stakes uses. And discuss the rationale for: Rethinking how we verify program quality. Establishing research-policy-practice partnerships to support a “next generation” of quality measures. I will begin by examining evidence regarding whether the scales predict child outcomes. This aspect of validity is relevant to policy, to the extent that public investments in early care and education are meant to promote optimal child development and school readiness. There may be many different reasons for these small associations of ECERS-R and CLASS scores with child achievement outcomes. One possibility is low validity in other, more fundamental features of the measures… at least for assessing aspects of quality that promote school readiness in ways suitable for high stakes uses. ECERS-R Items mix different aspects of quality. Standard scoring makes it difficult to “pull out” aspects of quality most relevant for school readiness. CLASS Highly inferential rating process may limit inter-rater reliability. Limited empirical evidence for theoretical dimensions. Developed in 1970s from a checklist to help practitioners improve the quality of their settings. Reflects the early childhood education field’s concept of developmentally appropriate practice: predominance of child-initiated activities selected from a wide array of options; a “whole child” approach that integrates physical, emotional, social and cognitive development; teacher facilitation of development by being responsive to children’s age-related and individual needs. Over 400 indicators Standard “stop scoring” structure reflects this across the checklist, practice and philosophical origin. 43 items! Categories from 1 to 7 have several “indicators” Conditions in the indicators of lower scores must be met before indicators of higher scores are evaluated. Especially within some items, indicators often organized around contexts of practice and reflect multiple aspects of quality. Source: Harms, T., Clifford, R.M., & Cryer, D. (1998). Early Childhood Environment Rating Scale, Revised Edition. New York, NY: Teachers College Press. If higher scores reflect higher quality, then average quality scores should be higher for centers rated in higher categories versus lower categories. In item response theory models, the thresholds between categories should also show a stair-step progression, if they are ordered so that higher categories mark higher quality. Source: Gordon, Rachel A., Ken Fujimoto, Robert Kaestner, Sanders Korenman, and Kristin Abner. 2013. “An Assessment of the Validity of the ECERS-R with Implications for Assessments of Child Care Quality and its Relation to Child Development.” Developmental Psychology, 49: 146-160 Study Name Initial Year 1999 3-City Study 2001 ECLS-B Early Head Start Research Eval Project 1996-1998 FACES 1997 1997 FACES 2000 2000 FACES 2003 Head Start Impact Study 2003 Fragile Families Focal Population Low-income families from low-income neighborhoods in Boston, Chicago and San Antonio. Nationally-representative sample drawn from birth records in 46 states. New Early Head Start applicants with a child under 12 months of age. New Head Start 3- and 4- year old participants. 2002 1998-2000 PCER 2003 Birth records sampled from hospitals in twenty large U.S. cities. Twelve sites implemented curricula in preschool programs. Each site had 14-20 programs. QUINCE 2004 Twenty-four CCR&R agencies in five states (CA,IA,MN,NE,NC) ECERS-R 10: Meals/Snacks 2 Score 1 Score 3 Score 5 Score 7 1 0 -8 Sanitary condition Well-balanced meals Schedule appropriate Nonpunitive atmosphere Sanitary conditions usually maintained Eat independently Pleasant atmosphere Staff sits with children Coversation Child-sized utensils -9 Children help -1 -2 Difficulty Acceptable nutrional value Appropriate meal schedule Positive atmosphere -3 -4 -5 -6 -7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Indicator SourceGordon, Rachel, Kerry Hofer, Ken Fujimoto, Nicole Colwell, Robert Kaestner, Sanders Korenman. “Measuring Aspects of Child Care Quality Specific to Domains of Child Development: An Indicator-level Analysis of the ECERS-R.” Presented in the Paper Symposium "Measuring Early Care and Education Quality: New Insights about the Early Childhood Environment System Rating Scale - Revised" (Chair: Rachel Gordon Discussant: Margaret Burchinal) (Saturday April 20 2013, Seattle WA). Unlike the checklist and practice origins of the ECERS-R several decades ago… The CLASS was developed more recently based on “developmental theory and research suggesting that interactions between students and adults are the primary mechanism of student development and learning.” (Pianta, La Paro & Hamre, p. 1) Its predecessor was part of a research study, and it was aimed at professional development and coaching use before being adopted in high stakes policy contexts. The CLASS manual requires observers to assimilate what they see in order to assign scores to just a few items. The manual advises: “Because of the highly inferential nature of the CLASS, scores should never be given without referring to the manual.” (Pianta, La Paro & Hamre, p. 17, bold in original) Source: Pianta, R. C., La Paro, K. M., & Hamre, B. K. (2008). Classroom Assessment Scoring System Manual, PreK. Baltimore MD: Brookes Publishing. A recent publication from the CLASS developers (Cash, Hamre, Pianta, & Myers, 2012) reveals: Exact reliability is low: 41% overall exact agreement with master score in training of 2,093 Head Start staff. Black and Latino raters placed their Instructional Support scores farther from the master score as did raters who disagreed with intentional teaching beliefs. The CLASS developers also recently found (Hamre, Hatfield, Pianta & Jamil, in press): a bi-factor structure with one general dimension (responsive teaching) and two specific dimensions (proactive management and routines; cognitive facilitation). these differ from the subscales written into policy. In our work, we are replicating these results, and also examining the targeting and content of items with IRT models. This body of evidence highlights the way in which measures developed for other purposes have been adopted for high stakes policy uses. Not surprisingly, there are limitations in the validity of these measures for this high stakes purpose. Consistent with the latest Standards for Educational and Psychological Testing we contend there is a need to step back and consider the intents of these policy uses, build in continuous and local validation of measures selected for these uses, and allow for the refinement of measures over place and time. We need to consider big picture questions like: What are the goals of public investments in preschool? How do we design quality measures to help assure we are meeting those specific goals? http://www.apa.org/science/programs/testing/standards.aspx As a concrete example, if it is desirable to distinguish classrooms that fall above and below specific thresholds, as in current policy uses, then measures with very high information (and low error) at those thresholds are needed. If instead it is desirable to invest public dollars in improving quality through coaching, then we would like to have a measure (or two linked measures) that cover the continuum of quality over which growth is expected. As another example, it is essential to think carefully about variation in quality across children, classrooms, times of day, days of week, and weeks of year. We currently have very little evidence about such variation -- and the extent to which choices about when and how to observe classrooms affects measure validity -- including for high stakes uses. Even with these challenges, there is much potential to build new evidence.