Can Research Design Explain Variation in Head Start Research Results? A Meta-analysis of Cognitive and Achievement Outcomes Hilary M. Shager University of Wisconsin-Madison Holly S. Schindler Center on the Developing Child Harvard University Katherine A. Magnuson University of Wisconsin-Madison Greg J. Duncan Department of Education University of California-Irvine Hirokazu Yoshikawa Harvard Graduate School of Education Cassandra M. D. Hart University of California, Davis Submitted to EEPA 1-20-2011 Can Research Design Explain Variation in Head Start Research Results? A Meta-analysis of Cognitive and Achievement Outcomes Abstract This meta-analysis explores the extent to which differences in research design explain the heterogeneity in program evaluation findings from Head Start impact studies. We predicted average effect sizes for cognitive and achievement outcomes as a function of the type and rigor of research design, quality and timing of dependent measure, activity level of control group, and attrition. Across 28 evaluations, the average program-level effect size was .27. About 41 percent of the variation in impacts across evaluations can be explained by the measures of research design features, including the extent to which the control group experienced other forms of early care or education, and 11 percent of the variation within programs can be explained by features of the outcomes. 1 Can Research Design Explain Variation in Head Start Research Results? A Meta-analysis of Cognitive and Achievement Outcomes The recognition that poor children start school well behind their more advantaged peers in terms of academic skills has focused attention on early childhood education (ECE) as a potential vehicle for remediating such early achievement gaps. Indeed, educators and advocates have successfully argued that greater public funds should be spent to support early education programs for disadvantaged children, and investments in a variety of ECE programs have increased tremendously in the past 20 years. This increased attention and funding has also produced a proliferation of ECE program evaluations. These studies have the potential to yield important information about differences in the effectiveness of particular program models; as a result, it is important for researchers and policy makers to be effective designers and informed consumers of such research. Researchers and policy makers interested in the effectiveness of ECE programs are faced with a difficult task in trying to make sense of findings across studies. Evaluations vary greatly in methods, quality, and results, which leads to an “apples vs. oranges” comparison problem. Previous reviews of ECE programs have described differences in research design as a part of their subjective, narrative analyses, and have suggested that such features might be important (Ludwig & Phillips, 2007; Magnuson & Shager, 2010; McKey et al., 1985). However, there has been little systematical empirical investigation of the role of study design features in explaining differing results. This study investigates the importance of study design features in explaining variability in program evaluation results for a particular model of ECE, Head Start. A centerpiece of President Lyndon B. Johnson’s “War on Poverty,” Head Start was designed as a holistic intervention to improve economically disadvantaged, preschool-aged children’s cognitive and social development by providing a 2 comprehensive set of educational, health, nutritional, and social services, as well as opportunities for parent involvement (Zigler & Valentine, 1979). Program guidelines require that at least 90 percent of the families served in each Head Start program be poor (with incomes below the federal poverty threshold).1 Since its inception in 1965, the federally funded program has enrolled over 27 million children (US DHHS, 2010). Despite the program’s longevity, the question of whether children who attend Head Start experience substantial positive gains in their academic achievement has been the subject of both policy and academic debate since the 1969 Westinghouse study, which found that initial positive gains experienced by program participants “faded out” by the time children reached early elementary school (Westinghouse Learning Corporation, 1969). The design of that study was criticized because the comparison group may have been composed of children not as disadvantaged as those attending Head Start, thus leading to a possible underestimation of program impacts (McGroder, 1990). The lack of experimental studies of Head Start, coupled with evidence of positive short and long-term effects from non-experimental studies (Currie & Thomas, 1995; Deming, 2009; Garces, Thomas, & Currie, 2002; Ludwig & Miller, 2007), perpetuated debate and precluded definitive conclusions about the program’s impact. In 2005, when the results of the first large national experiment evaluating Head Start were reported (US DHHS, 2005), and again in 2010 when the follow-up study was released (USDHHS, ACF, 2010), some decried the comparatively small effects that again appeared to fade out over time, and contrasted them with much larger effects from other early education demonstration programs and state prekindergarten programs. For example, Besharov and Higney (2007) wrote, “It seems reasonable to 1 Recent legislation gives programs some discretion to serve children near poverty. Children from families receiving public assistance (TANF or SSI) or children who are in foster care are eligible for Head Start or Early Head Start regardless of family income level. Ten percent of program slots are reserved for children with disabilities, also regardless of income. Tribal Head Start programs and programs serving migrant families may also have more open eligibility requirements (UHHS, Administration of Children and Families, 2010). 3 compare Head Start’s impact to that of both pre-K programs and center-based child care. According to recent studies, Head Start’s immediate or short-term impacts on a host of developmental indicators seem not nearly as large as those of either pre-K or center-based child care” (p. 686-687). Yet, several scholars have argued that methodological differences may at least in part explain the differences in program impacts across such ECE studies (Cook & Wong, 2007). How large a role these design features might play, however, has been only a matter of conjecture. By using published or reported impact estimates as the unit of study, a meta-analysis provides a unique opportunity to estimate associations between research design characteristics and results. In this paper, we employ this methodology, using information from Head Start impact studies conducted between 1965 and 2007, to estimate the average effect size for Head Start studies for children’s cognitive and achievement outcomes in our sample of studies. Then, we explore how research design features are related to variation in effect size. Head Start serves as a particularly effective case study for such a methodological exercise. It has provided a fairly standardized set of services to a relatively homogenous population of children over a long period of time; yet evaluations have yielded varying results regarding the program’s effectiveness in improving children’s academic skills. By holding constant program features such as funding, program structure and requirements, and family socioeconomic background of children served, our analysis can better estimate the independent effects of study design in explaining variation in findings, while ensuring that that such differences cannot easily be attributed to these other potentially confounding factors. Our paper proceeds as follows: first, we review the previous literature regarding the relationship between research design factors and effect sizes measuring the impact of ECE programs on children’s 4 cognitive and achievement outcomes; second, we describe our method, meta-analysis; third, we present our results; and finally, we discuss the implications of our findings for future research and policy. Literature Review In most ECE studies it is difficult a priori to know how variations in research design will affect estimates of program effectiveness, as such effects may be specific to the particular aspects and features of the program and population being studied. Understanding the role of study design rarely is the intent of the analysis, and most often meta-analyses rely on a single omnibus indicator of design quality, which typically combines a number of study design features related to issues of internal validity. Meta-analyses employing such an approach have yielded mixed evidence of significant associations between research design factors and the magnitude of effect sizes for children’s cognitive and achievement outcomes. The only existing meta-analysis of Head Start research, conducted over 25 years ago by McKey and colleagues (1985), included studies completed between 1965 and 1982, and found positive program impacts on cognitive test scores in the short term (effect sizes=.31 to .59), but not the long term (two or more years after program completion; effect sizes= -.03 to .13). Initial descriptive analyses of methodological factors such as quality of study design, sampling, and attrition revealed only slight influences on the magnitude and direction of effect sizes; therefore, these variables were not included in the main analyses. The authors found, however, that studies with pre-/post-test designs that lacked a comparison group (and thus may not have adequately controlled for maturation effects) tended to produce larger effect sizes than studies that included a comparison group. More general meta-analyses of ECE programs, which include some Head Start studies as well as evaluations of other ECE models, are difficult to compare because of varying study inclusion criteria and differing composite measures of study quality. For example, in a meta-analysis of 123 ECE evaluations, Camilli and colleagues (2010) found that studies with higher- quality design yielded larger 5 effect sizes for cognitive outcomes. The measure of design quality was a single dummy variable encompassing such factors as attrition, baseline equivalence of treatment and control groups, high implementation fidelity, and adequate information provided for coders. In contrast, Gorey (2001), in his meta-analysis of 23 ECE programs with long-term follow-up studies, found no significant link between an index of internal validity (including factors such as design type, sample size, and attrition) and effect size magnitude. Nelson, Westhues and MacLeod (2003), in their meta-analysis of 34 longitudinal preschool studies, also found no significant link between effect size and a total methodological quality score, based on factors such as whether studies were randomized, whether inclusion and exclusion criteria were clearly defined, whether baseline equivalence between treatment and control groups was established, whether outcome measures were blinded, outcome measure reliability, and year published. Interestingly, several meta-analyses of other medical and social service interventions have found that low-quality study designs yield larger effect sizes than high-quality ones (Moher et al., 1998; Schulz et al., 1995; Sweet & Appelbaum, 2004). Prior research also suggests that other features of study design may matter—in particular, the quality and type of measures selected. Meta-analyses of various K-12 educational interventions provide evidence that researcher-developed tests yield larger average effect sizes than standardized tests (for a review see Rosenshine, 2010). One suggested reason for this finding is that researcher-developed tests may assess skills more closely related to those specifically taught in a particular intervention, while standardized tests may tap more general skills (Gersten et al., 2009). Suggestive evidence that measures that are more closely aligned with the practices of classroom instruction may be more sensitive to early education is found in Wong and colleagues’ (2008) regression discontinuity study of five state prekindergarten programs. All of the programs measured both children’s receptive vocabulary (Peabody 6 Picture Vocabulary Test) as well as their level of print awareness. Across the five programs, effects on print awareness were several times greater than the effects on receptive vocabulary. Two overlooked aspects of outcome measures are reliability and whether outcomes are based on objective measures of a child’s performance or an observational rating. Assessments of children’s cognitive development and academic achievement vary greatly in their rigor and potential for bias. For example, standardized assessments that have been carefully designed to have high levels of reliability are likely to introduce less measurement error than either teacher or parent reports of children’s cognitive skills; however, both types of measures are regularly used as outcomes. Some scholars argue that observational ratings may be likely to introduce bias. Hoyt and colleagues’ (1999) meta-analysis of psychological studies on rater bias found that rating bias was likely to be very low in measures that included explicit attributes (counts of particular behaviors) but quite prevalent in measures that required raters to make inferences (global ratings of achievement or skills). If ratings of young children are a mix of these types of measures, then such scales might be more biased than performance ratings; however, the expected direction of bias remains unclear, and may differ by who is doing the rating (research staff versus teacher) and whether the rater is blind to the participants’ treatment status (Hoyt et al., 1999). Also overlooked in this literature is the activity level of the control group, despite concerns about this raised in discussions of recent Head Start studies (National Forum on Early Childhood Policy and Programs, 2010). In the case of ECE evaluations, we can define this as participation in center-based care or preschool among control-group children. Control group activity varies considerably across ECE studies, particularly in more recent evaluations, given the relatively high rates of participation in early care and education programs among children of working parents, and the expansion of ECE programs in the decades since Head Start began. Cook (2006) compared the level of center-based early education in several studies of preschool programs, and found that rates of participation of the control group in these 7 settings during the preschool “treatment” year varied from just over 20 percent in state prekindergarten studies (Wong et al., 2008) to nearly 50 percent in the Head Start Impact Study (US DHHS, 2005). He argued that these differences might in part explain the larger effect sizes associated with recent prekindergarten evaluations compared to the findings reported in the Head Start Impact Study, and more generally that comparing effect sizes across programs when both the program and study design differ leaves one unable to make any comparative judgment about program effectiveness. Research Questions While some previous meta-analyses suggest a link between research design characteristics and results, some potentially important aspects of studies have not been considered. Furthermore, previous meta-analyses have generally employed a single design quality composite score rather than estimating the comparative role of each particular design factor in question. A more detailed empirical test of the importance of particular research design characteristics is needed to enable scholars and policy makers to better understand findings from prior ECE studies, as well as to inform future studies. Our study tests whether the following research design characteristics explain heterogeneity in the estimated effects of Head Start on children’s cognitive and achievement outcomes: type and rigor of design, quality and timing of dependent measure, activity level of control group, and attrition. Because Head Start primarily serves disadvantaged children, the concern with many prior studies of the program is that analysts were unable to find a comparison group that was as disadvantaged as the treatment group; thus, the independent influence of such disadvantage might downwardly bias estimates of Head Start effectiveness (Currie & Thomas, 1995). Therefore, we hypothesize that studies that used more rigorous methods to ensure similarity between treatment and control groups prior to program participation, particularly random assignment, will produce larger effect sizes. 8 We also investigate whether three aspects of outcome measures affect the magnitude of effect sizes. The first is whether the measure assessed types of skills that are more closely aligned with early education instruction, such as pre-reading and pre-math skills, which we expect to yield larger effect sizes than measures of more abstract and global cognitive skills, because these skills have been demonstrated to be particularly sensitive to instruction (Christian, Morrison, Frazier, & Massetti, 2000; Wong et al., 2008). The second is whether the measure was a performance test, rating (by teacher or caregiver), or observation (often by a researcher). Given that ratings and observations have the potential to introduce rater bias, and it is not clearly understood how this error might affect estimates, we do not have a clear hypothesis about how estimates of effect sizes will differ across these types of measures. Third, we explore the role of a measure’s reported reliability. If high reliability indicates low measurement error, we might expect somewhat larger estimates from more reliable measures. However, prior studies show that larger effect sizes result from researcher-developed tests, which may be explained by better alignment between outcome measures and skills taught in a particular intervention, and relatively lower reliability of these measures, compared to standardized tests (Rosenshine, 2010). Thus, we might also expect more reliable measures to predict smaller effect sizes. Based on findings from previous Head Start evaluations, which suggest “fade out” of effects over time, we expect the length of time between the end of the Head Start program and outcome measurement to be negatively associated with average effect size (Currie & Thomas, 1995; McKey et al., 1985). Given research suggesting that children derive cognitive gains from other types of ECE programs (Zhai, Brooks-Gunn, & Waldfogel, 2010), we also expect a negative relationship between effect size and the activity level of the control group (i.e., a measure of whether control group members sought alternative ECE services on their own). Finally, we predict that studies with higher levels of 9 overall attrition may yield smaller effect sizes, since disadvantaged students, who are most likely to benefit from the program, are also most likely to be lost to follow-up.2 Research Methods Meta-analysis. To understand how specific features of research design may account for the heterogeneity in estimated Head Start effects, we conducted a meta-analysis, a method of quantitative research synthesis that uses prior study results as the unit of observation (Cooper & Hedges, 2009). To combine findings across studies, estimates are transformed into a common metric, an “effect size,” which expresses treatment/control differences as a fraction of the standard deviation of the given outcome. Outcomes from individual studies can then be used to estimate the average effect size across studies. Additionally, meta-analysis can be used to test whether average effect size differs by, for example, characteristics of the studies and study samples. After defining the problem of interest, metaanalysis proceeds in the following steps, described below: 1) literature search, 2) data evaluation, and 3) data analysis. Literature Search. The Head Start studies analyzed in this paper compose a subset of studies from a large meta-analytic database being compiled by the National Forum on Early Childhood Policy and Programs. This database includes studies of child and family policies, interventions, and prevention programs provided to children from the prenatal period to age five, building on previous meta-analytic databases created by Abt Associates, Inc. and the National Institute for Early Education Research (NIEER) (Camilli et al., 2010; Jacob, Creps & Boulay, 2004; Layzer, Goodson, Bernstein & Price, 2001). An important first step in a meta-analysis is to identify all relevant evaluations that meet one’s programmatic and methodological criteria for inclusion; therefore, a number of search strategies were 2 We recognize, however, that in the presence of differential attrition across the control and treatment groups, the effects of attrition may be more complicated, depending on how patterns of attrition differ across the groups. 10 used to locate as many published and unpublished Head Start evaluations conducted between 1965 and 2007 as possible. First, we conducted key word searches in ERIC, PsychINFO, EconLit, and Dissertation Abstracts databases, resulting in 304 Head Start evaluations.3 Next, we manually searched the websites of several policy institutes (e.g., Rand, Mathematica, NIEER) and state and federal departments (e.g., U.S. Department of Health and Human Services) as well as references mentioned in collected studies and other key Head Start reviews. This search resulted in another 134 possible reports for inclusion in the database. In sum, 438 Head Start evaluations were identified, in addition to the 126 previously coded by Abt and NIEER. Data Evaluation. The next step in the meta-analysis process was to determine whether identified studies met our established inclusion criteria. To be included in our database, studies must have i) a comparison group (either an observed control or alternative treatment group); and ii) at least 10 participants in each condition, with attrition of less than 50 percent in each condition. Evaluations may be experimental or quasi-experimental, using one of the following methods: regression discontinuity, fixed effects (individual or family), residualized or other longitudinal change models, difference in difference, instrumental variables, propensity score matching, or interrupted time series. Quasiexperimental evaluations not using one of the former analytic strategies are also screened in if they include a comparison group plus pre- and post-test information on the outcome of interest or demonstrate adequate comparability of groups on baseline characteristics. These criteria are more rigorous than those applied by McKey et al. (1985) and Abt/NIEER; for example, they eliminate all pre/post-only (no comparison group) studies, as well as regression-based studies with baseline nonequivalent treatment and control groups. 3 The original Abt/NIEER database included ECE programs evaluated between 1960 and 2003 and used similar search techniques; therefore, we did not re-search for Head Start evaluations conducted during these years, with the exception of 2003. We conducted searches for evaluations completed between 2003 and 2007, as well as for programs not targeted by the original Abt/NIEER search strategies. 11 For this particular study, which is focused on impact evaluations of Head Start, we impose some additional inclusion criteria. We include only studies that measure differences between Head Start participants and control groups that were assigned to receive no other services. For example, studies that randomly assigned children to Head Start versus another type of early education program or Head Start add-on program are excluded. However, studies are not excluded if families assigned to a no-treatment control group sought services of their own volition. In addition, we include only Head Start studies that provide at least one measure of children’s cognitive or achievement outcomes. Furthermore, to improve comparability across findings, we impose limitations regarding the timing of study outcome measures. We limit our analysis to studies in which children received at least 75 percent of the intended Head Start treatment and for which outcomes were measured 12 or fewer months post-treatment. The screening process, based on the above criteria, resulted in the inclusion of 57 Head Start publications or reports (See Appendix A: Database References).4 Of the 126 Head Start reports originally included in the Abt/NIEER database, 29 were eliminated from the database because they did not meet our research design criteria. The majority of the 438 additional reports identified by the research team’s search were excluded after reading the abstract (N=243), indicating that they did not meet our inclusion criteria for obvious reasons (e.g., they were not quantitative evaluations of Head Start or did not have a comparison group). Of the 98 Head Start evaluation reports that were excluded after full-text screening, 90 were excluded because they did not meet our research design criteria; 53 of these specifically because they did not include a comparison group. Eight other reports were excluded due to other eligibility criteria (e.g., they only reported results for students with disabilities or did not report results for outcomes we were interested in coding). Our additional inclusion criteria for this paper (e.g., Because some of our inclusion criteria differed from Abt’s and NIEER’s original criteria, we re-screened all of the studies included in the original database as well as the new ones identified by the research team. 4 12 short-term cognitive or achievement outcomes only, no alternative treatment or curricular add-on studies) excluded 120 reports that remain coded in the database, but are not included in this analysis.5 Coding Studies. For reports that met our inclusion criteria, the research team developed a protocol to codify information about study design, program and sample characteristics, as well as statistical information needed to compute effect sizes. This protocol serves as the template for the database and delineates all the information about an evaluation that we want to describe and analyze. A team of 10 graduate research assistants were trained as coders during a 3- to 6-month process that included instruction in evaluation methods, using the coding protocol, and computing effect sizes. Trainees were paired with experienced coders in multiple rounds of practice coding. Before coding independently, research assistants also passed a reliability test comprised of randomly selected codes from a randomly selected study. In order to pass the reliability test, researchers had to calculate 100 percent of the effect sizes correctly and achieve 80 percent reliability with a master coder for the remaining codes. In instances when research assistants were just under the threshold for effect sizes, but were reliable on the remaining codes, they underwent additional effect size training before coding independently and were subject to periodic checks during their transition. Questions about coding were resolved in weekly research team conference calls. Database. The resulting database is organized in a four-level hierarchy (from highest to lowest): the study, the program, the contrast, and the effect size (See Table 1: Key Meta-Analysis Terms and Ns). A “study” is defined as a collection of comparisons in which the treatment groups are drawn from the same pool of subjects. A study may include evaluations of multiple “programs”; i.e., a particular type of Head Start or Head Start in a particular location. Each study also produces a number of “contrasts,” defined as a comparison between one group of children who received Head Start and another group of 5 Despite extensive efforts, we were unable to locate 17 reports identified in our searches. 13 children who received no other services as a result of the study. Evaluations of programs within studies may include multiple contrasts; for example, results may be presented using more than one analytic method (e.g., OLS and fixed effects) or separate groups of children (e.g., three- and four-year-olds), and these are coded as different contrasts nested within one program, within one study. The 33 Head Start programs included in our meta-analysis include 40 separate contrasts.6 In turn, each contrast consists of a number of individual “effect sizes” (estimated standard deviation unit difference in an outcome between the children who experienced Head Start and those who did not). The 40 contrasts in this metaanalytic database provide a total of 313 effect sizes.7 These effect sizes combine information from a total of over 160,000 observations. Effect Size Computation. Outcome information was reported in evaluations using a number of different statistics, which were converted to effect sizes (Hedges’ g) with the commercially available software package Comprehensive Meta-Analysis (Borenstein, Hedges, Higgins, & Rothstein, 2005). Hedges’ g is an effect size statistic that adjusts the standardized mean difference (Cohen’s d) to account for bias in the d estimator when sample sizes are small. Dependent Variable. Descriptive information for the dependent measure (effect size) and key independent variables is provided in Table 2. To account for the varying precision among effect size estimates, as well as the number of effect sizes within each program, these descriptive statistics and all subsequent analyses are weighted by the inverse of the variance of each effect size multiplied by the inverse of the number of effect sizes per program (Cooper & Hedges, 2009; Lipsey & Wilson, 2001). We excluded all “sub-contrasts,” i.e., analyses of sub-groups of the main contrast (e.g., by race or gender), as this would have provided redundant information. See Appendix B for a description of the Head Start studies, programs and contrasts included in our analyses. 7 In several studies, outcomes were mentioned in the text, but not enough information was provided to calculate effect sizes; for example, references were made to non-significant findings, but no numbers were reported. Excluding such effect sizes could lead to upward bias of treatment effects; therefore, we coded all available information for such measures, but coded actual effect sizes as missing. These effect sizes (N=72) were then imputed four different ways and included in subsequent robustness checks. Specifics regarding imputation are discussed in the robustness check section of this paper. In cases where “nested” measures were coded (e.g., an overall IQ score plus several IQ sub-scale scores), we retained the unique sub-scales and excluded the redundant overall scores. 6 14 The dependent variables in these analyses are the effect sizes measuring the standardized difference in assessment of children’s cognitive skills and achievement between children who attended Head Start and the comparison group. Effect sizes ranged from -0.49 to 1.05, with a weighted mean of 0.18. Independent Variables. Several of the key independent variables measure features of the particular outcome assessments employed. We distinguished between effect sizes measuring achievement outcomes, such as reading, math, letter recognition, and numeracy skills, which may be more sensitive to typical classroom instruction, and those measuring cognitive outcomes less sensitive to instruction, including IQ, vocabulary, theory of mind, attention, task persistence, and syllabic segmentation, such as rhyming (see Christian et al., 2000 for discussion of this distinction). The majority of effect sizes (67 percent) are from the cognitive domain. Using a series of dummy variables, we categorized effect sizes according to the type of measure employed by the researcher, indicating whether it is a performance test (reference category), rating by someone the child knows (e.g., a teacher or parent), or observational rating by a researcher. The majority of outcome measures are performance tests (93 percent). We also included a continuous measure of the timing of the outcome, measured in months post-treatment, which, given our screening criteria, ranges from -2.47 to 12. Other key independent variables represent facets of each program’s research design. These study and Head Start characteristics do not vary within a program, so we present both program-level and effect size-level descriptive information for these measures in Table 2. We created a series of dummy variables indicating the type of design: randomized (reference category) or quasi-experimental.8 The majority of effect sizes (74 percent) come from quasi-experimental studies, although differences 8 A few programs (N=2) had designs that changed post-hoc; i.e., the study was originally randomized, but for various reasons, became quasi-experimental in nature. In our primary specifications, these studies were coded as quasi-experimental. An alternative specification, coding these studies specifically “design changed,” is discussed in the robustness checks section. 15 between program level and effect size level means suggest that randomized trials tended to have more outcome measures per study than those with quasi-experimental designs. We created a dummy variable to indicate whether baseline covariates were included in the analysis. Although the majority of programs (86 percent) do not include baseline covariates in their analyses, those that do have a large number of outcome measures. We also coded the activity level of the control group using the following categories: passive (reference category), meaning that control group children received no alternative services; or active, meaning some of the control group members sought services of their own volition; as well as a dummy variable indicating whether information regarding control group activity was missing from the report.9 Although the majority of effect sizes (53 percent) come from studies with passive control groups, studies in which the control group actively sought alternative services, specifically attendance at other centerbased child care facilities or early education programs, tend to have more effect sizes per study. Studies that reported active control groups indicated that between 18 and 48 percent of the control attended these types of programs.10 We also created a series of dummy variables indicating levels of overall attrition. Keeping in mind that attrition was truncated at 50 percent based on our screening criteria, attrition levels were constructed using quartile scores and defined as follows: low attrition (reference category), less than or equal to 10 percent (representing quartiles 1, 2, and 3); high attrition, greater than 10 percent (representing quartile 4); or missing overall attrition information. The majority of effect sizes come from studies with 10 percent attrition or less. Although Head Start is guided by a set of federal performance standards and other regulations, these have changed over time and may not reflect the experience of participants in all studies. A dummy 9 Reports in which there was no mention of control group activity were coded as having missing information on this variable. In theory, we might also want to categorize use of parenting programs or family support programs as active control group participation; however, studies generally did not report on this type of activity. 10 16 variable was coded to indicate whether the program was a “modern” Head Start program, defined as post-1974, when the first set of Head Start quality guidelines were implemented.11 Although the majority of programs (75 percent) are older, 44 percent of effect sizes come from studies of modern Head Start programs. In addition, recognizing that the first iteration of Head Start was a shortened, 6 to 8 week summer program, we also created a continuous variable indicating length of treatment measured in months, and re-centered at two months, so that the resulting coefficient indicates the effect of receiving a full academic year of Head Start versus a summer program. Finally, we created a dummy variable indicating whether the evaluation was an article published in a peer refereed journal. The reference category is an unpublished report or dissertation, or book chapter. Statistical Analysis. Our key research question is whether heterogeneity in effect size is predicted by methodological aspects of the study design in the programs and effect sizes. The nested structure of the data (effect sizes nested within programs) requires a multivariate, multi-level approach to modeling these associations (de la Torre, Camilli, Vargas, & Vernon, 2007). The level-1 model (effect size level) is: (1) ESij = β0i + β1ix1ij + … + βkixkij + eij In this equation, the effect size j in program i, is modeled as a function of the intercept (β0i), which represents the average (covariate adjusted) effect size for all programs; a series of key independent variables and related coefficients of interest (β1ix1ij + … + βkixkij), which estimate the association between the effect size and aspects of the study design that vary at the effect size level; and a withinprogram error term (eij). Study design covariates at this level include timing of outcome, type of 11 In an alternative specification, instead of the modern Head Start variable, we included a continuous variable for the year each Head Start program was studied, re-centered at 1965, the program’s initial year of operation. This did not qualitatively change results. 17 outcome (rating or observation), whether or not baseline covariates are included, and domain of outcome (skills more or less sensitive to instruction). The level-2 equation (program level) models the intercept as a function of the grand mean effect size (β00), a series of covariates that represent aspects of study design and Head Start features that vary only at the program level (β01ix1i + … + β0kixki), and a between-program random error term (ui): (2) β0i = β00 + β01ix1i + … + β0kixki + ui Study design covariates at this level include type of research design, activity level of control group, attrition, and whether the effect size came from a peer refereed journal article. Head Start program feature covariates include length of program and whether the program was implemented post-1974. This “mixed effects” model assumes that there are two sources of variation in the effect size distribution, beyond subject-level sampling error: 1) the “fixed” effects of variables that measure key features of the methods and other covariates; and 2) remaining “random” unmeasured sources of variation between and within programs.12 To account for differences in effect size estimate precision, as well as the number of effect sizes within a particular program, all regressions were weighted by the inverse variance of each effect size multiplied by the inverse of the number of effect sizes per program (Cooper & Hedges, 2009; Lipsey & Wilson, 2001). Analyses were conducted in SAS, using the PROC MIXED procedure. We began by entering each design factor independently, and then included all relevant design covariates at the same time in our primary specification. We also tested several variations of the primary model specification; for example, we conducted separate analyses including imputed missing effect sizes, without weights, and excluding the National Head Start Impact Study, the largest study in 12 In our primary specifications, we ignore the third- and fourth- level of nesting, contrasts within programs, and programs within studies, due to the small number of studies with multiple contrasts (N=8) and multiple programs (N=4). In alternative specifications, we found that clustering at the contrast level or study level, instead of the program level, did not qualitatively change the results. 18 our sample. We also tested alternative specifications using a series of dummy variables indicating outcome measure reliability levels, a more nuanced set of research design variables, a continuous measure of the date of operation of the Head Start program being studied, an indicator of whether a randomized study experienced crossovers between control and treatment group, and a continuous measure of the number of control group children attending center-based childcare. Results of the main analyses and these robustness checks are discussed below. Results Bivariate Results. The results from an “empty model,” with no predictor variables, yields an intercept (average program-level effect size) of .27, which is significantly different from 0. Keeping this in mind, we began by exploring the relationships between single design factors and average effect size using a series of multilevel regressions, the results of which are presented in Table 3.13 Regressions including categorical variables (Table 3, columns 1-8) were run without intercepts; thus, the resulting coefficients indicate the average effect size for programs in each category. These results suggest that studies of modern Head Start programs produce a smaller average effect size (.23) than studies of Head Start conducted prior to 1975 (.29), although this difference is not statistically significant. This finding is also potentially complicated by the fact that modern Head Start studies are more likely to have active control groups, and programs in which the control group actively seeks alternative ECE services produce a smaller average effect size (.08) than studies with passive control groups (.31, p=.05) or missing information on this variable (.29, n.s.). Both study design types yield significant positive effect sizes (quasi-experimental=.26; randomized=.33). Programs in which baseline covariates were used in analyses yield a smaller average effect size (.20) than those in which covariates were not included (.29), although this difference is not statistically significant. 13 These bivariate analyses and subsequent multivariate specifications include 241 non-missing effect sizes in 28 programs. 19 Looking at different aspects of dependent measures, we find that ratings (.45) and observations (.55) yield significantly larger effect sizes than performance tests (.24). Consistent with our hypothesis, measures of skills more sensitive to instruction yield a significantly larger average effect size (.40) than those measuring more broad cognitive skills (.25). We find significantly positive and similar average effect sizes for measures with both high and low attrition (.28 and .33, respectively); however, the average effect size for measures with missing attrition information is smaller and not significant (.14). Finally, we find that the average effect size from a study published in a peer refereed journal (.43) appears larger than one produced by an unpublished study or book chapter (.23); this difference is marginally significant. Multilevel regressions including continuous measures of research design were run with an intercept; therefore, we include this estimate in columns 10 and 11 of Table 3 to show the relationship between an incremental increase in each continuous design variable and average effect size. None of the coefficients for continuous measures, including length of program in months and months between the end of treatment and outcome measure, is statistically significant. While most of these findings are consistent with our initial hypotheses regarding the influence of various design factors on average effect size, this analytic approach ignored the potential important confounds of other design variables, and thus might yield biased results. Therefore, in our primary specification, we included all design variables at once to investigate the independent and comparative role of each in impacting average effect size. Multivariate Results. Results from our primary specification are presented in Column 1 of Table 4; coefficients indicate the strength of associations between our independent variables (measuring facets of research design) and effect sizes (differences between treatment and control groups expressed as a fraction of a standard deviation). In terms of program and study characteristics, we find that attending a 20 full academic year of Head Start (10 months) is marginally associated with a .16 standard deviation unit larger effect than attending a summer Head Start program (2 months). Studies published in peer refereed journals also tend to yield effect sizes .28 standard deviations larger than those found in unpublished reports, dissertations, or book chapters; thus, confirming that there may be a tendency for publication bias to be operating in this field. We also find a negative but statistically insignificant association between effect size and the variable indicating that the Head Start program being evaluated was in operation in 1975 or later. Our exploration of program-level research design factors shows a large negative association between effect size and having an active control group in which families independently seek alternative services (-.33). Other features of study design are not statistically significant, although most are in the expected direction. Perhaps most surprisingly, we do not find a significant difference between effect sizes of quasi-experimental and randomized studies, when other features of study design are held constant. As expected, higher levels of attrition (and missing information on this variable) are associated with smaller effect sizes; however, the differences are modest in magnitude and not statistically significant. A number of dependent measure characteristics also predict effect size. Compared to performance tests, both ratings by teachers and parents as well as observational ratings by researchers yield larger effect sizes (.16 and .32, respectively). As expected, measures of skills more sensitive to instruction produce larger effect sizes than those for less teachable cognitive skills (.13). Counter to our hypothesis, however, length of time between treatment and outcome measure is not associated with smaller effect size. If one considers a performance test a more reliable measure of children’s skills, compared to ratings by others, the findings described in the previous paragraph are somewhat surprising. To test the 21 role of measure reliability more directly, in an alternative specification, we removed the variables indicating “type” of dependent measure (i.e., rating, observation) and instead included a series of dummy variables indicating the level of reliability of the outcome measure, based on coded reliability coefficients.14 Consistent with our primary specification findings, but contrary to our proposed hypothesis, we found that less reliable measures yield larger effect sizes (See Table 4, Column 2). Robustness Checks. One concern with our models is that they omit correlates of program design that might be associated with effect sizes. Our choice of looking only within Head Start programs was intended to limit the differences in programming that would have been found in a wider set of ECE programs; however, a possible remaining source of heterogeneity is the demographic make-up of the sample. Unfortunately, there is a surprisingly large amount of missing data on the demographic characteristics of the study samples. For example, only 52 percent of effect sizes have information about the gender and between 46 and 52 percent about the racial composition of the sample. Nevertheless, we explored whether effect sizes might be predicted by the gender and by the racial composition of the sample (percent boys versus girls; percent black, Hispanic, White). Bivariate analyses suggested that effect sizes were not significantly predicted by these characteristics, nor were they predictive in our multivariate models. Given these null findings, the large percentage of missing data for these variables, and the limited statistical power in our multivariate analyses, we opted to leave these variables out of our multivariate models. 14 Categories were constructed based on quartile scores and defined as follows: high reliability (reference category), greater than or equal to .92 (representing quartile 4); medium reliability, .91 to .75, (representing quartiles 2 and 3); low reliability, less than .75, (representing quartile 1). Our preference was to code any reliability coefficient provided for the specific study population; however, this information was rarely reported. If no coefficient was provided in the report, we attempted to find a reliability estimate from test manuals or another study. Any available type of reliability coefficient was recorded, although the most were measures of internal consistency (Cronbach’s alpha). Because of this variability in source information and coefficient type, and the fact that we were still left with missing reliability coefficients for 38 percent of our effect sizes, we offer these results with caution. 22 We undertook several additional analyses to determine the sensitivity of our findings to alternative model specifications. Most importantly, in some cases, authors reported that groups were compared on particular tests, but did not report the results of these tests, or did not provide enough numerical information to compute an effect size. Leaving out these “missing” effect sizes (N=72) could upwardly bias our average effect size estimate, since we know that unreported effect sizes are more likely to be smaller and statistically insignificant (Pigott, 2009). We imputed these missing effect sizes four ways and report results in Table 5. Our baseline results (Table 5, column 1) assumed that if precise magnitudes of differences were not reported, and there was no indication of whether differences were significant, there was no difference between the groups (g=0). If the results were reported as significant, we assumed that they were significant at the p=.05 level. We also checked the robustness of these results to different assumptions about the magnitude of the effects in these cases. We ran three alternative specifications. The first assumed that if an author reported which group was favored for a particular test, but not whether the effect size was statistically significant, that the differences approached marginal significance (p=.11). If authors did not report that either group was favored, we again assumed that g=0 (Table 5, column 2). The second scenario assumed that all results favored the treatment group as much as possible, consistent with author reports of which group actually fared better (p=.11 if treatment group was favored or if there was no indication of which group was favored; p=.99 otherwise; Table 5, column 3). Conversely, the third scenario assumed that all results favored the control group as much as possible, consistent with author reports of which group actually performed better (p=.11 if treatment group was favored; p=.99 otherwise; Table 5, column 4). 23 As demonstrated in Table 5, including the imputed effect sizes did not yield substantive changes in our coefficient estimates, regardless of the imputation assumptions.15 We therefore conclude that excluding these missing effect sizes from our primary analyses is unlikely to introduce bias in our results. A second concern is that results were being driven primarily by our inclusion of the National Head Start Impact Study, which includes 40 effect sizes, and is heavily weighted due to its large sample size. When we excluded this study from our analysis, however, we again obtained results qualitatively similar to those from our primary specification (results available upon request). The magnitude of most coefficients stayed the same, although due to loss of statistical power, some of them became statistically insignificant.16 These findings suggest that the relationships between effect sizes and research design factors are not strongly influenced by the specific findings from the National Head Start Impact Study. We also found qualitatively similar results for unweighted analyses, suggesting that studies with larger samples are not driving our findings either. Recognizing that not all quasi-experimental designs are equally rigorous, we also tested a more nuanced set of research design indicators, adding to our original design variables separate categories for studies with true quasi-experimental designs (matching on outcomes or demographics and change models) and those that were included due only to having baseline comparable treatment and control groups. We also added an indicator for studies in which the design was changed post-hoc (i.e., it was originally randomized, but for various reasons, became a quasi-experimental study). Again, our findings remained robust; none of the design indicator variables was statistically significant (compared to the reference category, randomization), with other design features included in the model. In another specification, we also included a variable indicating the presence of crossovers between the control and treatment groups (situations in which control group members attended Head 15 These analyses included 313 effect sizes nested in 40 contrasts. Specifically, the coefficient for rating (.09) was no longer significant, and the coefficient for observation (.28) became only marginally significant. 16 24 Start programs or treatment group members did not), which could potentially reduce effect size independent of control group activity level.17 The crossover coefficient was not significant, and its inclusion did not change the pattern of results for other variables in the analysis. Finally, we explored whether a more specific measure of the activity level of the control group would predict effect sizes. Active control groups were found in 4 programs, and in these programs between 19 and 48 percent of the children in these groups received center-based child care. As an alternative measure of control group activity, we also tried including a continuous variable capturing the percent of control group children in center-based care. As expected, we found that this measure significantly predicted effect sizes, with an additional percent of the control group experiencing centerbased child care corresponding to a -.005 (p<.05) smaller effect size. Discussion This study provides an important contribution to the field of ECE research, in that it uses a unique, new meta-analytic database to estimate the overall average effect of Head Start on children’s cognitive skills and achievement, and explores the role of methodological factors in explaining variation in effect sizes measuring the impact of Head Start on children’s cognitive and achievement skills. Overall, we found a statistically significant average effect size of .27, suggesting that Head Start is generally effective in improving children’s short-term (less than one year post-treatment) cognitive and achievement outcomes. This is a somewhat smaller effect on short term cognitive outcomes than those found in the previous meta-analysis of Head Start conducted by McKey et al. (1985), but somewhat larger than those reported in the first year findings from the recent national Head Start Impact Study (US DHHS, 2005). The .27 estimate is also within the range of the overall average effect sizes on cognitive outcomes found in Camilli et al. (2010) measured across various ECE models, and in Wong et 17 For example, in the National Head Start Impact study, approximately 18 percent of 3-year-olds and 13 percent of 4-yearolds in the control groups received Head Start services. 25 al. (2008), measured in state prekindergarten programs, but somewhat smaller than the short-term cognitive effect sizes found in meta-analyses of more intensive programs with longitudinal follow-ups conducted (Gorey,2001; Nelson et al.,2003). These comparisons suggest that Head Start program effects on children’s cognitive and achievement outcomes are on par with the effects of other general early education programs. We find that, indeed, several design factors significantly predict the heterogeneity of effect sizes, and these factors account for approximately 41 percent of the explainable variation between evaluation findings and 11 percent of the explainable variation within evaluation findings. This information can be used by researchers and policy makers to become better consumers and designers of Head Start and other ECE evaluations, and thus facilitate better policy and program development. One of our substantively largest findings is that having an active control group—one in which children experienced other forms of center-based care—is associated with much smaller effect sizes (.33) than those produced by studies in which the control group is “passive” (i.e., receives no alternative services). Given that a variety of models of ECE programs have been shown to increase children’s cognitive skills and achievement (Gormley, Phillips, & Gayer, 2008; Henry, Gordon, & Rickman, 2006), it is perhaps not surprising that effect sizes for studies in which a significant portion of the control group is receiving alternative ECE services are smaller than those produced by studies in which control group children receive no ECE services. The nature of the counterfactual, then, is important to consider. For example, in today’s policy context, in which almost 70 percent of 4-year-olds and 40 percent of 3-year-olds attend some form of ECE (Cook, 2006), it is probably not reasonable to expect the same kinds of large effect sizes produced by older model programs studied when few alternative ECE options were available (e.g., Perry Preschool, Schweinhart et al. 2005; and Abecedarian, Ramey & Campbell, 1984). More generally, 26 comparisons between Head Start studies with low rates of control group activity and higher rates should be approached with caution. The same may also be true for studies of other forms of ECE, although it is unclear if our findings will generalize beyond Head Start studies. If these findings are replicated with other types of ECE studies, then it suggests that it is not reasonable to compare effects from, for example, the National Head Start Impact Study, which had relatively high rates of center-care attendance in the control group, with recent regression discontinuity design (RDD) evaluations of state pre-kindergarten (pre-k) programs that have lower levels of such attendance in the control group (Cook, 2006). Our findings suggest that instead of asking whether Head Start is effective at all, we must ask how effective Head Start is compared to the range of other ECE options available. Another important finding is that the type of dependent measure used by the researcher may be systematically related to effect size, and must be considered when interpreting evaluation results. Consistent with previous research, we find that achievement-based skills such as early reading, early math, and letter recognition skills appear to be more sensitive to Head Start attendance than cognitive skills such as IQ, vocabulary, and attention, cognitive measures that are less sensitive to classroom instruction (Christian et al., 2000; Wong et al., 2008). This finding has important implications for designers and evaluators of early intervention programs; namely, that expectations for effects on omnibus measures such as vocabulary or IQ should be lowered. At minimum, these sets of skills should be tested and considered separately. Our finding that less reliable dependent measures yield larger effect sizes also warrants caution when interpreting effect sizes from studies without first considering the quality of the measures. Nonstandardized measures developed by researchers may tap into behaviors that are among the most directly targeted by the intervention services; therefore, it is not surprising that such measures tend to yield larger effect sizes. Ratings by parents, teachers, and researchers may also be subject to bias, however, 27 because these individuals are likely to be aware of the children’s participation in Head Start as well as the study purpose. We also find that effect sizes from studies published in peer refereed journals are larger than those found in non-published reports and book chapters. While research published in peer- refereed journals may be more rigorous than that found in non-published sources, this result may also be a sign of the “file drawer” problem (i.e., that negative or null findings are less likely to be published) long lamented by meta-analysts (Lipsey & Wilson, 2001). This finding suggests that meta-analysts must be exhaustive in their searches for both published and unpublished (“grey”) literature, and should carefully code information regarding study quality (Rothstein & Hopewell, 2009). A somewhat surprising finding from the current study is that type of overall design (e.g., randomized vs. quasi-experimental) did not predict effect size. We remind the reader, however, that our inclusion criteria regarding study design were typically more rigorous than previous meta-analyses of ECE programs. By limiting our study sample in this way, we give up some of the variation in design that might indeed predict effect sizes. Nevertheless, these findings are in alignment with recent research suggesting that in certain circumstances, rigorous quasi-experimental methods can produce causal estimates similar to those produced by randomized controlled trials (Cook, Shadish, & Wong, 2008), and further support the use of such methods to evaluate programs when randomized controlled trials are not feasible, as is often the case in education research (Schneider et al., 2005). We also predicted that attrition and time between intervention and outcome measure would be negatively associated with effect size; however, neither factor was statistically significant. The fact that the range of each measure was truncated in this study (attrition to less than 50 percent and posttreatment outcome measure timing to 12 or fewer months) may explain this lack of findings. Whether similar relationships are found between methodological factors and effect sizes for studies with long28 term outcomes is a question for future research. Timing of study (pre- or post-implementation of the first Head Start quality guidelines in 1974) also did not predict effect size, nor did a continuous measure of year study was conducted. Future research using measures that are better able to discern between program quality and other factors that may be associated with the historical context of the study is warranted. In addition to the limitations noted above, we offer our results with a few other caveats. We recognize that variation in research design is naturally occurring; therefore, such variation may be correlated with other unmeasured aspects of Head Start studies that we were not able to capture. Furthermore, although our multilevel models account for the nesting of effect sizes within programs, there were additional sources of non-independence that we were simply unable to model. Nevertheless, we believe meta-analysis to be a useful and robust method to explore our research questions. In sum, this study makes an important contribution to the field of ECE research, in that we are able to empirically and comparatively test which research design characteristics explain heterogeneity in effect sizes in Head Start evaluations. We find that several facets of research design explain some of the variation in Head Start participants’ short-term cognitive and achievement outcomes; therefore, consumers of ECE evaluations must ask themselves a series of design-related questions when interpreting results across evaluations with differing methods. What is the control counterfactual? What skills are being measured? How reliable are the measures being used? What is the source of information? By becoming more critical consumers and designers of such research, we can improve ECE services and realize their full potential as an intervention strategy for improving children’s life chances. 29 References Balk, E. M., Bonis, P. A., Moskowitz, H., Schmid, C. H., Ionnidis, J. P., Wang, C., & Lau, J. (2002). Correlation of quality measures with estimates of treatment effect in meta-analyses of randomized controlled trials. Journal of the American Medical Association, 287(22), 2973-2982. Besharov, D. J., & Higney, C. A. (2007). Head Start: Mend it, don’t expand it (yet). Journal of Policy Analysis and Management, 26(3), 678-681. Borenstein M., Hedges, L., Higgins, J., & Rothstein, H. (2005). Comprehensive Meta-analysis, Version 2. Englewood NJ: Biostat. Camilli, G., Vargas, S., Ryan, S., & Barnett, W. S. (2010). Meta-analysis of the effects of early education interventions on cognitive and social development. Teachers College Record, 112(3). Christian, K., Morrison, F. J., Frazier, J. A., & Massetti, G. (2000). Specificity in the nature and timing of cognitive growth in kindergarten and first grade. Journal of Cognition and Development, 1(4), 429–448. Cook, T. (2006). What works in publicly funded pre-kindergarten education? Retrieved March 23, 2010 from http://www.northwestern.edu/ipr/events/briefingmay06-cook/slide1.html. Cook, T. D., Shadish, W. R., & Wong, V. C. (2008). Three conditions under which experiments and observational studies produce comparable causal estimates: New findings from within-study comparisons. Journal of Policy Analysis and Management, 27(4), 724-750. Cook, T. D., & Wong, V. C. (2007). The warrant for universal pre-k: Can several thin reeds make a strong policy boat? Social Policy Report, 21(3), 14-15. Cooper, H., & Hedges, L. V. (2009). Research synthesis as a scientific process. In H. Cooper, L. V. Hedges, & J. C. Valentine (Eds.), The handbook of research synthesis and meta-analysis, 2nd edition, (pp. 3-17). New York: Russell Sage Foundation. Currie, J., & Thomas, D. (1995). Does Head Start make a difference? American Economic Review, 85, 341-364. De la Torre, J., Camilli, G., Vargas, S., & Vernon, R. F. (2007). Illustration of a multilevel model for meta-analysis. Measurement and Evaluation in Counseling and Development, 40, 169-180. Deming, David. (2009). Early childhood intervention and life-cycle skill development: Evidence from Head Start. American Economic Journal: Applied Economics, 1(3), 111-34. Garces, E., Thomas, D., & Currie, J. (2002). Longer term effects of Head Start. The American Economic Review, 92, 999-1012. 30 Gorey, K. M. (2001). Early childhood education: A meta-analytic affirmation of the short- and longterm benefits of educational opportunity. School Psychology Quarterly, 16(1), 9–30. Gormley, W. T., Jr., Phillips, D., & Gayer, T. (2008). Preschool programs can boost school readiness. Science, 320, 1723-24. Henry, G. T., Gordon, C. S., & Rickman, D. K. (2006). Early education policy alternatives: Comparing quality and outcomes of Head Start and state prekindergarten. Educational Evaluation and Policy Analysis, 28, 77-99. Hoyt, W. T., & Kerns, M-D. (1999). Magnitude and moderators of bias in observer ratings: A metaanalysis. Psychological Methods, 4(4), 403-424. Jacob, R. T., Creps, C. L., & Boulay, B. (2004). Meta-analysis of research and evaluation studies in early childhood education. Cambridge, MA: Abt Associates Inc. Layzer, J. I., Goodson, B. D., Bernstein, L., & Price, C. (2001). National evaluation of family support programs, volume A: The meta-analysis, final report. Cambridge, MA: Abt Associates Inc. Lipsey, M. W., & Wilson, D. B. (2001). Practical meta-analysis. Thousand Oaks, CA: Sage Publications. Ludwig, J., & Miller, D.L. (2007). Does Head Start improve children’s life chances: Evidence from a regression discontinuity design. Quarterly Journal of Economics, 122, 159-208. Ludwig, J., & Phillips, D. (2007). The benefits and costs of Head Start. Social Policy Report, 21(3), 3-18. Magnuson, K., & Shager, H. (2010). Early education: Progress and promise for children from lowincome families. Children and Youth Services Review, 32(9), 1186-1198. McGroder, S. (1990). Head Start: What do we know about what works? Report prepared for the U. S. Office of Management and Budget. Retrieved May 3, 2010, from http://aspe.hhs.gov/daltcp/reports/headstar.pdf. McKey, R. H., Condelli, L., Ganson, H., Barrett, B. J., McConkey, C., & Plantz, M. C. (1985). The impact of Head Start on children, families and communities: Final report of the Head Start Evaluation, Synthesis and Utilization Project. Washington, D. C.: CSR, Incorporated. Moher, D., Pham, B., Jones, A., Cook, D. J., Jadad, A. R., Moher, M., Tugwell, P., & Klassen, T. P. (1998). Does quality of reports of randomized trials affect estimates of intervention efficacy reported in meta-analyses? The Lancet, 352, 609-613. Nathan, R. P. (Ed.) (2007). How should we read the evidence about Head Start?: Three views. Journal of Policy Analysis and Management, 26(3), 673-689. 31 National Forum on Early Childhood Policy and Programs. (2010). Understanding the Head Start Impact Study. Retrieved January 9, 2012, from http://www.developingchild.harvard.edu/ Nelson, G., & Westhues, A., & MacLeod, J. (2003). A meta-analysis of longitudinal research on preschool prevention programs for children. Prevention and Treatment, 6, 1–34. Piggot, T. (2009). Handling missing data. In H. Cooper, L. V. Hedges, & J. C. Valentine (Eds.), The handbook of research synthesis and meta-analysis, 2nd edition, (pp. 399-416). New York: Russell Sage Foundation. Ramey, C. T. & Campbell, F. A. (1984). Preventive education for high-risk children: Cognitive consequences of the Carolina Abecedarian Project. American Journal of Mental Deficiency, 88(5), 515–23. Rosenshine, B. (2010, March 6). Researcher-developed tests and standardized tests: A review of ten meta-analyses. Paper presented at the Society for Research on Educational Effectiveness Conference, Washington, D. C. Rothstein, H. R., & Hopewell, S. (2009). Grey literature. In H. Cooper, L. V. Hedges, & J. C. Valentine (Eds.), The handbook of research synthesis and meta-analysis, 2nd edition, (pp. 103-125). New York: Russell Sage Foundation. Schulz, K. F., Chalmers, I., Hayes, R. J., & Altman, D. G. (1995). Empirical evidence of bias: Dimensions of methodological quality associated with estimates of treatment effects in controlled trials. Journal of the American Medical Association, 273(5), 408-412. Schweinhart, L. J., Montie, J., Xiang, Z., Barnett, W. S., Belfield, C. R., & Nores, M. (2005). Lifetime effects: The High/Scope Perry Preschool study through age 40. (Monographs of the High/Scope Educational Research Foundation, 14). Ypsilanti, MI: High/Scope Press. Shonkoff, J. P., & Phillips, D. (2000). From neurons to neighborhoods: The science of early childhood development. Washington, DC: National Academy Press. Sweet, M. A., & Appelbaum, M. I. (2004). Is home visiting an effective strategy? A meta-analytic review of home visiting programs for families with young children. Child Development, 75(5), 1435-1456. U.S. Department of Health and Human Services, Administration for Children and Families (January 2010). Head Start impact study: Final report. Washington, DC. U.S. Department of Health and Human Services, Administration for Children and Families, Office of Head Start. (2010). Head Start program fact sheet. Retrieved July 1, 2011, from http://www.acf.hhs.gov/programs/ohs/about/fy2010.html. U.S. Department of Health and Human Services, Administration for Children and Families, 32 Office of Head Start. (2010). Head Start performance standards & other regulations. Retrieved April 26, 2010, from http://www.acf.hhs.gov/programs/ohs/legislation/index.html. U.S. Department of Health and Human Services, Administration for Children and Families. (May 2005). Head Start impact study: First year findings. Washington, DC. United States General Accounting Office. (1997). Head Start: Research provides little information on impact of current program. GAO/HEHS-97-59. Washington, D. C.: U. S. General Accounting Office. Westinghouse Learning Corporation. (1969). The impact of Head Start: An evaluation of the effects of Head Start on children's cognitive and affective development. In: A report presented to the Office of Economic Opportunity, Ohio University (1969) (Distributed by Clearinghouse for Federal Scientific and Technical Information, U.S. Department of Commerce, National Bureau of Standards, Institute for Applied Technology. PB 184328) . Wong, V. C., Cook, T. D., Barnett, W. S. & Jung, K. (2008). An effectiveness-based evaluation of five state pre-kindergarten programs. Journal of Policy Analysis and Management, 27, 122154. Zhai, F., Brooks-Gunn, J., & Waldfogel, J. (2010, March 6). Head Start and urban children's school readiness: A birth cohort study in 18 cities. Paper presented at the Society for Research on Educational Effectiveness Conference, Washington, D. C. Zigler, E., & Valentine, J. (Eds.) (1979). Project Head Start: A legacy of the war on poverty. New York: The Free Press. 33 Table 1: Key Meta-Analysis Terms and Ns Term Report Study Program Contrast Effect Size Description Written evaluation of Head Start (e.g., a journal article, government report, book chapter) Collection of comparisons in which the treatment groups are drawn from the same pool of subjects Particular type of Head Start or Head Start within a particular location Comparison between one group of children who received Head Start and another group of children who received no other services as a result of the study (although they may have sought services themselves) Measure of the difference in cognitive outcomes between the children who experienced Head Start and those who did not, expressed in standard deviation units (Hedges’ g) N 57 27 33 40 313 34 Table 2: Descriptive Information for Effect Sizes and Independent Variables Min. Study and Program Characteristics Modern Head Start program (post-1974), effect size level Modern Head Start program (post-1974), program level Length of program (months, centered at 2), effect size level Length of program (months, centered at 2), program level Peer refereed journal, effect size level Peer refereed journal, program level Design Characteristics Active control group, effect size level Active control group, program level Passive control group, effect size level Passive control group, program level Missing control group activity, effect size level Missing control group activity, program level Randomized controlled trial, effect size level Randomized controlled trial, program level Quasi-experimental study, effect size level Quasi-experimental study, program level Baseline covariates included, effect size level Baseline covariates included, program level Dependent Measure Characteristics Rating (by someone who knows child) Observation (by researcher) Performance measure Skills sensitive to instruction Skills not sensitive to instruction Months post-treatment Attrition (always <50%) High attrition (>10%) Low attrition (<=10%) Missing attrition information Reliability High reliability (coefficient >=.92) Medium reliability (coefficient =.75-.91) Low reliability (coefficient <.75) Missing reliability coefficient Effect size Max. Mean SD 0 0 0 0 0 0 1 1 8 8 1 1 .44 .25 6.20 5.18 .17 .21 .50 .44 3.12 3.70 .37 .42 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 .35 .14 .53 .68 .12 .18 .26 .18 .74 .82 .49 .14 .48 .36 .50 .48 .32 .39 .44 .39 .44 .39 .50 .36 0 0 0 0 0 -2.47 1 1 1 1 1 12 .06 .01 .93 .67 .33 1.89 .24 .09 .25 .47 .47 4.12 0 0 0 1 1 1 .36 .42 .22 .48 .50 .41 0 0 0 0 -.49 1 1 1 1 1.05 .21 .35 .18 .26 .18 .41 .48 .39 .44 .22 Descriptive information for non-missing effect sizes (N=241), weighted by inverse variance of effect size multiplied by inverse of number of effect sizes per program. 35 Table 3: Summary of Results from Regressions of Head Start Evaluation Effect Sizes on Single Research Design Factors 1 Modern HS (post-1974) Not Modern HS (pre-1975) 2 3 4 5 6 7-Significant Differences .23* (.08) .29** (.05) Active control group .08 (.10) Passivet Passive control group .31** (.05) .29* (.10) Activet Missing control group activity Randomized controlled trial Quasi-experimental study Rating (by adult who knows child) Observation (by researcher) Performance test Skills sensitive to instruction Skills not as sensitive to instruction .33* (.10) .26** (.05) .45** (.07) .55** (.16) Performance** Performancet Rating**, Observationt .24** (.04) .40** (.05) .25** (.05) Skills not sensitive** Skills sensitive** High attrition (>10%) .28** (.05) Low attrition (<=10%) .33** (.06) Missing attrition information .14 (.11) Multilevel models were estimated N=241 effect sizes nested in 28 programs; **=p<.001, *=p<.05, t=p<.10. For columns 1-6 and 8-9 no intercept was estimated; therefore, the resulting coefficients represent the average effect size for programs in each category. . Column 7 and 12 lists within-factor categorical means that are statistically significant compared to the indicated category. Multilevel regressions with continuous measures of research design were run with an intercept; therefore estimates in columns 10 and 11 to show the relationship between an incremental increase in each continuous design variable and average effect size. 36 Table 3 Continued: Summary of Results from Regressions of Head Start Evaluation Effect Sizes on Research Design Factors 8 Baseline covariates included Baseline covariates not included Peer refereed journal Not peer refereed journal Length of program (months, centered at 2) Months post-treatment 9 10 11 12-Significant Differences .20* (.03) .29** (.04) .43** (.09) .23** Not peer refereedt (.04) Peer refereedt .01 (.01) -.00 (.01) Intercept .21* .28** (.07) (.04) Multilevel models were estimated N=241 effect sizes nested in 28 programs; **=p<.001, *=p<.05, t=p<.10. For columns 1-6 and 8-9 no intercept was estimated; therefore, the resulting coefficients represent the average effect size for programs in each category. . Column 7 and 12 lists within-factor categorical means that are statistically significant compared to the indicated category. Multilevel regressions with continuous measures of research design were run with an intercept; therefore estimates in columns 10 and 11 to show the relationship between an incremental increase in each continuous design variable and average effect size. 37 Table 4: Summary of Results from Multivariate Regressions of Head Start Evaluation Effect Sizes on Multiple Research Design Factors 1 2 Intercept Modern Head Start program (post-1974) Length of program (months, centered at 2) Peer refereed journal Active control group Missing control group activity Quasi-experimental study Baseline covariates included High attrition (>10%) Missing attrition information Rating (by someone who knows child) Observation (by researcher) Skills sensitive to instruction Months post-treatment Medium reliability (coefficient = .75-.91) Low reliability (coefficient <.75) .30* (.14) -.04 (.12) .33* (.15) -.16 (.14) .02t (.01) .28* (.09) -.33* (.12) -.01 (.10) -.14 (.11) -.11 (.09) -.09 (.07) -.04 (.13) .16* (.06) .32* (.15) .13** (.03) .00 (.01) .02t (.01) .31* (.11) -.35* (.14) -.07 (.12) -.26* (.12) -.06 (.10) -.09 (.07) .01 (.15) .17** (.03) .00 (.01) .09t (.05) .20** (.06) .07t (.04) Multilevel models were estimated with N=241 effect sizes nested in 28 programs; **=p<.001, *=p<.05, t=p<.10. Missing reliability coefficient 38 Table 5: Summary of Results from Regressions of Head Start Evaluation Effect Sizes on Multiple Research Design Factors, Including Imputed Missing Effect Sizes Intercept Modern Head Start program (post-1974) Length of program (months, centered at 2) Peer refereed journal Active control group Missing control group activity Quasi-experimental study Baseline covariates included High attrition (>10%) 1 2 3 4 .34* (.12) .06 (.11) .00 (.01) .29* (.09) -.34* (.12) -.00 (.10) -.17 (.10) -.04 (.09) .29* (.12) .05 (.11) .01 (.01) .26* (.09) -.31* (.12) .04 (.09) -.12 (.10) -.08 (.09) .33* (.12) -.06 (.11) .02 (.01) .28* (.09) -.32* (.11) -.03 (.09) -.15 (.10) -.09 (.08) .32* (.15) .19 (.14) -.00 (.01) .26* (.11) -.35* (.15) .08 (.12) -.14 (.12) -.01 (.11) -.11* (.05) -.07 (.05) -.09t (.05) -.11* (.05) -.05 -.11 .02 -.24t (.11) (.11) (.11) (.13) Rating (by someone who knows the child) .19** .18* .16* .20** (.05) (.06) (.05) (.06) Observation (by researcher) .35* .33* .30* .38* (.14) (.15) (.13) (.14) Skills sensitive to instruction .12** .12** .12** .11** (.03) (.03) (.03) (.03) Months post-treatment -.00 -.00 .00 -.00 (.00) (.00) (.00) (.00) Multilevel model estimated with N=313 effect sizes (ESs) nested in 33 programs; **=p<.001, *=p<.05, t=p<.10. We imputed 72 ESs, based on the following assumptions: Column 1: If there was no report of the magnitudes of ESs or whether ESs were significant, we assumed that there was no difference between the groups (g=0). If the ESs were reported as statistically significantly, we assumed p=.05. Column 2: If an author reported the direction of the ESs (which group was favored) but not whether ESs were statistically significant, we assumed the ES approached marginal significance (p=.11). If authors did not report that either group was favored, we again assumed that g=0. Column 3: We assumed that all ESs favored the treatment group as much as possible (p=.11 if treatment group was favored or if there was no indication of which group was favored; p=.99 otherwise). Column 4: We assumed that all ESs favored the control group as much as possible (p=.11 if treatment group was favored; p=.99 otherwise). Missing attrition information 39 Appendix A: Database References Abbott-Shim, M., Lambert, R., and McCarty, F. (2000). A study of Head Start effectiveness using a randomized design. Paper presented at the Fifth Head Start National Research Conference, Washington, DC. Abbott-Shim, M., Lambert, R., & McCarty, F. (2003). A comparison of school readiness outcomes for children randomly assigned to a Head Start program and the program's wait list. Journal of Education for Students Placed at Risk, 8, 191-214. Abelson, W.D. (1974). Head Start graduates in school: Studies in New Haven, Connecticut. In S. Ryan (Ed.), A report on longitudinal evaluations of preschool programs: Volume 1 (1-14). Washington, DC: Office of Child Development, US Department of Health, Education and Welfare. Allerhand, M. (1965). Impact of summer 1965 Head Start on children's concept attainment during kindergarten. Cleveland, OH: Western Reserve University. Allerhand, M. (1967). Effectiveness of parents of Headstart children as administrators of psychological tests. Journal of Consulting Psychology, 31, 286-290. Allerhand, M., Gaines, E., & Sterioff, S. (1966). Headstart follow-up study manual (rating concept attainment). Cleveland, OH: Western Reserve University. Allerhand, M.E. (1966). Headstart operational field analysis: Progress report III. Cleveland, OH: Western Reserve University. Allerhand, M.E. (1966). Headstart operational field analysis: Progress report II. Cleveland, OH: Western Reserve University. Allerhand, M.E. (1966). Headstart operational field analysis: Progress report I. Cleveland, OH: Western Reserve University. Allerhand, M.E. (1966). Headstart operational field analysis: Progress report IV. Cleveland, OH: Western Reserve University. Arenas, S., & Trujillo, L.A. (1982). A success story: The evaluation of four Head Start bilingual multicultural curriculum models. Denver, CO: InterAmerica Research Associates. Barnow, B.S., & Cain, G.G. (1977). A reanalysis of the effect of Head Start on cognitive development: Methodology and empirical findings. The Journal of Human Resources, 12, 177-197. Bereiter, C., & Engelmann, S. (1966). Teaching disadvantaged children in the preschool. Englewood Cliffs, NJ: Prentice-Hall, Inc. 40 Bridgeman, B., & Shipman, V.C. (1975). Predictive value of measures of self-esteem and achievement motivation in four- to nine-year-old low income children. Disadvantaged children and their first school experiences: ETS-Head Start Longitudinal study. Princeton, NJ: Educational Testing Service. Chesterfield, R., & Chaves, R. (1982). An evaluation of the Head Start Bilingual Bicultural Curriculum Development Project. Technical reports, Vols. I and II, executive summary. Los Angeles, CA: Juarez and Associates. Chesterfield, R., et al. (1979). Pilot study results and child assessment measures. Los Angeles, CA: Juarez and Associates. Chesterfield, R., et al. (1979). An evaluation of the Head Start Bilingual Bicultural Curriculum Development Project. Report of the pilot study results and the training of fieldworkers for the ethnographic/observational component. Los Angeles, CA: Juarez and Associates. Chesterfield, R., et al (1982). An Evaluation of the Head Start Bilingual Bicultural Curriculum Development Project: Final Report. Los Angeles, CA: Juarez and Associates. Cicarelli, V.G., Cooper, W.H., & Granger, R.L. (1969). The impact of Head Start: An evaluation of the effects of Head Start on children’s cognitive and affective development. Volume 2. Office of Economic Opportunity. Athens, OH: Westinghouse Learning Corporation and Ohio University. Cicarelli, V.G., Cooper, W.H., & Granger, R.L. (1969). The impact of Head Start: An evaluation of the effects of Head Start on children’s cognitive and affective development. Volume 1, text and appendices A to E. Office of Economic Opportunity. Athens, OH: Westinghouse Learning Corporation and Ohio University. Cline, M.G., & Dickey, M. (1968). An evaluation and follow-up study of summer 1966 Head Start children in Washington, DC. Washington, DC: Howard University. Engelmann, S. (1969). Preventing failure in the primary grades. New York: Simon and Schuster. Engelmann, S., & Osborn, J. (1976). Distar Language I: An Instructional system. Teacher's guide, 2nd edition. Chicago: Science Research Associates, Inc. Englemann, S., & Carnine, D. (1975). Distar Arithmetic I, 2nd edition. Chicago: Science Research Associates. Erickson, E.L., McMillan, J., Bonnell, J., Hofman, L., & Callahan, O.D. (1969). Experiments in Head Start and early education: The effects of teacher attitude and curriculum structure on preschool disadvantaged children. Office of Economic Opportunity: Washington, DC. Esteban, M.D. (1987). A comparison of Head Start and non-Head Start reading readiness scores of lowincome kindergarten children of Guam. UMI Dissertation Services: Ann Arbor, Michigan. 41 Henderson, R.W., Rankin, R.J., & Frosbisher, M.W. (1969). Positive effects of a bicultural preschool program on the intellectual performance of Mexican-American children. Paper presented at the annual meeting of the American Educational Research Association, Los Angeles. Hodes, M.R. (1966). An assessment and comparison of selected characteristics among culturally disadvantaged kindergarten children who attended Project Head Start (summer program 1965), culturally disadvantaged kindergarten children who did not attend Project Head Start; and kindergarten children who were not culturally disadvantaged. Glassboro, NJ: Glassboro State College. Howard, J.L., & Plant, W.T. (1967). Psychometric evaluation of an operant Head Start program. Journal of Genetic Psychology, 11, 281-288. Huron Institute. (1974). Short-term cognitive effects of Head Start programs: A report on the third year of planned variation. Cambridge, MA. Hyman, I.A., & Kliman, D.S. (1967). First grade readiness of children who have had summer Headstart programs. Training School Bulletin, 63, 163-167. Krider, M.A., & Petsche, M. (1967). An evaluation of Head Start pre-school enrichment programs as they affect the intellectual ability, the social adjustment, and the achievement level of five-year-old children enrolled in Lincoln, Nebraska. Lincoln, NE: University of Nebraska. Larson, D.E. (1972). Stability of gains in intellectual functioning among white children who attended a preschool program in rural Minnesota: Final report. Office of Education (DHEW). Mankato, MN: Mankato State College. Larson, D.F. (1969). The effects of a preschool experience upon intellectual functioning among fouryear-old, white children in rural Minnesota. Mankato: Minnesota State University, College of Education. Lee, V.E., Brooks-Gunn, J., Schnur, E., & Liaw, F.R. (1990). Are Head Start effects sustained? A longitudinal follow-up comparison of disadvantaged children attending Head Start, no preschool, and other preschool programs. Child Development, 61, 495-507. Lee, V.E., Schnur, E., & Brooks-Gunn, J. (1988). Does Head Start work? A 1-year follow-up comparison of disadvantaged children attending Head Start, no preschool, and other preschool programs. Developmental Psychology, 24, 210-222. Ludwig, J., & Phillips, D. (2007). The benefits and costs of Head Start. Social Policy Report, 21, 3-13. McNamara, J.R. (1968). Evaluation of the effects of Head Start experience in the area of self-concept, social skills, and language skills (pre-publication draft). Miami, FL: Dade County Board of Public Instruction. Miller, L. B., et al. (1970). Experimental variation of current approaches. University of Louisville. 42 Miller, L. B., et al. (1972). Four preschool programs: Their dimensions and effects. University of Louisville. Morris, B., & Morris, G.L. (1966). Evaluation of changes occurring in children who participated in project Head Start. Kearney, NE: Kearney State College. Nummedal, S.G., & Stern, C. (1971). Head Start graduates: One year later. Paper presented at the annual meeting of the American Educational Research Association, New York, NY. U.S. Department of Health and Human Services. (2003). Building futures: The Head Start impact study interim report. Washington, DC. Porter, P.J., Leodas, C., Godley, R.A., & Budroff, M. (1965). Evaluation of Head Start educational program in Cambridge, Massachusetts: Final report. Cambridge, MA: Harvard University. Sandoval-Martinez, S. (1982). Findings from the Head Start bilingual curriculum development effort. NABE: The Journal for the National Association for Bilingual Education, 7, 1-12. Schnur, E., & Brooks-Gunn, J. (1988). Who attends Head Start? A comparison of Head Start attendees and nonattendees from three sites in 1969-1970. Princeton, NJ: Educational Testing Service. Shipman, V. (1970). Disadvantaged children and their first school experiences: ETS-Head Start longitudinal study. Preliminary description of the initial sample prior to school enrollment: Summary report. Princeton, NJ: Educational Testing Service. Shipman, V.C. (1972). Disadvantaged children and their first school experience. Princeton, NJ: Educational Testing Service. Shipman, V.C. (1976). Stability and change in family status, situational, and process variables and their relationship to children's cognitive performance. Princeton, NJ: Educational Testing Service. Shipman, V.C., et al. (1976). Notable early characteristics of high and low achieving black and low-SES children. Disadvantaged children and their first school experiences: ETS-Head Start longitudinal study. Princeton, NJ: Educational Testing Service. Smith, M.S., & Bissell, J.S. (1970). Report analysis: The impact of Head Start. Harvard Educational Review, 40, 51-104. Sontag, M., Sella, A.P., & Thorndike, R.L. (1969). The effect of Head Start training on the cognitive growth of disadvantaged children. The Journal of Educational Research, 62, 387-389. Tamminen, A.W., Weatherman, R.F., & McKain, C.W. (1967). An evaluation of a preschool training program for culturally deprived children: Final report. U.S. Department of Health, Education, and Welfare. Duluth: University of Minnesota. 43 Thorndike, R.L. (1966). Head Start Evaluation and Research Center, Teachers College, Columbia University. Annual report (1st), September 1966-August 1967. New York, NY: Columbia University Teachers College. U.S. Department of Health and Human Services. (2005). Head Start impact study: First year findings. Washington, DC. Young-Joo, K. (2007). The return to private school and education-related policy. University of Wisconsin (dissertation). Zigler, E.F., Ableson, W.D., Trickett, P.K., & Seitz, V. (1982). Is an intervention program necessary in order to improve economically disadvantaged children's IQ scores? Child Development, 53, 340-348. 44 Appendix B: Head Start Studies, Programs, and Contrasts Included in Analysis Start Date Study Description Programs/Contrasts Included* 1968 Regular HS v. Direct Instruction Head start v. control 1) Bereiter/Englemann curriculum (Direct instruction) in Head start v. no preschool control 2) Enrichment (Standard) Head Start v. no preschool control 1965 National Head Start Program, 1965-1968 1) Full year Head Start participation v. no Head Start services 1966** New Haven Head Start evaluation 1) Head Start (from Zigler & Butterfield study) v. No Head Start (Zigler/Butterfield study) 1965 Camden, NJ Summer Head Start 1) Culturally Disadvantaged; Attended Camden Summer Head Start v. Culturally Disadvantaged; did not attend Camden Summer Head Start 1969 comparison of children enrolled in Head start, other Preschool or no preschool in two cities 1) Children enrolled in HS v. Children with no Preschool 1998 Southeastern Head Start program of high quality 1) Children in Head Start v. Children on wait list for Head Start 1965 New Jersey Summer Head Start one or two years 1) One or two treatments of summer Head Start v. No summer Head start 1967** Evaluation of the effect of Head Start program on cognitive growth of disadvantaged children 1) Head Start program v. children about to enter Head Start 1965 Kearney, NE Summer Head Start 1) Children who attended Summer Head Start v. Matched children who did not attend Summer Head Start 1965 Summer Head Start program Evaluation, San Jose, CA; Howard, J. 1) Children who attended summer Head Start program v. children who did not receive Head Start services 1968 Rural Minnesota Head Start 1) Head Start v. Eligible for Head Start but did not enroll 2)Head Start v. Students not enrolled in any preschool 1965 Cambridge, MA Summer Head Start 1) (summer) Head Start v. Operation Checkup (medical exam) Appendix B: Head Start Studies, Programs, and Contrasts Included in Analysis Start Date Study Description Programs/Contrasts Included 1966** Head Start Effects on SelfConcept, Social Skills, and Language Skills, 1968 1) Head Start v. no preschool 45 1979 Head Start Bilingual Bicultural Development Project 1) Bilingual HS v. Stay at home 2) Comparison HS v. Stay at home 1965 Duluth Summer Head Start 1) Head Start v. Stay at home 1965 Impact of 1965 Summer Head Start on Children's Concept Attainment, Allerhand 1) Summer Head Start v. No Head Start 1966 Bicultural Preschool Program 1) Mexican American Kids in Head Start v. Mexican American Kids with no preschool 1967 Head Start effects on children's behavior and cognitive functioning one year later, Nummedal, 1971 1) Full Year Head Start v. No Head Start 1980** New Haven Head Start 1) children who received New Haven Head Start v. nonHead Start comparison group 1966 A follow-up study of a summer Head Start program in Washington, DC 1) Head Start v. no Head Start 1965 Lincoln, NE Summer Head Start 1) Head Start, matched pairs v. Stay at home, matched pairs 2) Head Start, unmatched pairs v. Stay at home, unmatched pairs 2002 National Head Start Impact Study First Year 1) Head Start 3 years (weighted, controlling for demographics, pretest scores)- based on OLS v. No Head Start 3 years 2) Head Start 4 years (weighted, controlling for demographics, pretest scores) - based on OLS) v. No Head Start 4 years 3) Head Start 3 years (Treatment on Treated - Ludwig Phillips analysis) v. no Head Start 3 years (Treatment on Treated - Ludwig Phillips analysis) 4) Head Start 4 years (Treatment on Treated - Ludwig Phillips analysis) v. no Head Start 4 years (Treatment on Treated - Ludwig Phillips analysis) 46 Appendix B: Head Start Studies, Programs, and Contrasts Included in Analysis Start Date Study Description Programs/Contrasts Included 1966** New Haven Head Start Abelson, Zigler & Levine 1) Head Start (from Abelson, Zigler & Levine study) v. no Head Start 1997 ECLS-K Head Start Study 1) Whites attending Head Start v. All other white students (likely mix of stay at home and other childcare) 2) Blacks attending Head Start v. All other black students (likely mix of stay at home and other childcare) 3) Hispanics attending Head Start v. All other Hispanic students (likely mix of stay at home and other childcare) 1971 Planned Variation in Head Start 1) Planned Variation Head Start, 1971-1972 v. No Head Start 1971-1972 2) Standard Head Start, 1971-1972 v. No Head Start 19711972 1968 Louisville Head Start Curriculum Comparison 1) Bereiter-Engelmann Head Start v. no pre-k 2) DARCEE Head Start v. no pre-k 3) Montessori Head Start v. no pre-k 4) Traditional Head Start v. no pre-k 1985 Comparison of Head Start vs. NonHead Start reading readiness scores of low income children in Guam 1) Head Start v. No Head Start *A horizontal line indicates separate programs within studies; contrasts within a single program are listed in the same cell. **If an actual start date for the program was not provided, we estimated the start date to be two years prior to report publication. 47