PRELIMINARY AND INCOMPLETE - Association for Education

Can Research Design Explain Variation in Head Start Research Results?
A Meta-analysis of Cognitive and Achievement Outcomes
Hilary M. Shager
University of Wisconsin-Madison
Holly S. Schindler
Center on the Developing Child
Harvard University
Katherine A. Magnuson
University of Wisconsin-Madison
Greg J. Duncan
Department of Education
University of California-Irvine
Hirokazu Yoshikawa
Harvard Graduate School of Education
Cassandra M. D. Hart
University of California, Davis
Submitted to EEPA 1-20-2011
Can Research Design Explain Variation in Head Start Research Results?
A Meta-analysis of Cognitive and Achievement Outcomes
This meta-analysis explores the extent to which differences in research design explain the
heterogeneity in program evaluation findings from Head Start impact studies. We predicted average
effect sizes for cognitive and achievement outcomes as a function of the type and rigor of research
design, quality and timing of dependent measure, activity level of control group, and attrition. Across
28 evaluations, the average program-level effect size was .27. About 41 percent of the variation in
impacts across evaluations can be explained by the measures of research design features, including the
extent to which the control group experienced other forms of early care or education, and 11 percent of
the variation within programs can be explained by features of the outcomes.
Can Research Design Explain Variation in Head Start Research Results? A Meta-analysis of
Cognitive and Achievement Outcomes
The recognition that poor children start school well behind their more advantaged peers in terms
of academic skills has focused attention on early childhood education (ECE) as a potential vehicle for
remediating such early achievement gaps. Indeed, educators and advocates have successfully argued
that greater public funds should be spent to support early education programs for disadvantaged children,
and investments in a variety of ECE programs have increased tremendously in the past 20 years. This
increased attention and funding has also produced a proliferation of ECE program evaluations. These
studies have the potential to yield important information about differences in the effectiveness of
particular program models; as a result, it is important for researchers and policy makers to be effective
designers and informed consumers of such research.
Researchers and policy makers interested in the effectiveness of ECE programs are faced with a
difficult task in trying to make sense of findings across studies. Evaluations vary greatly in methods,
quality, and results, which leads to an “apples vs. oranges” comparison problem. Previous reviews of
ECE programs have described differences in research design as a part of their subjective, narrative
analyses, and have suggested that such features might be important (Ludwig & Phillips, 2007;
Magnuson & Shager, 2010; McKey et al., 1985). However, there has been little systematical empirical
investigation of the role of study design features in explaining differing results.
This study investigates the importance of study design features in explaining variability in
program evaluation results for a particular model of ECE, Head Start. A centerpiece of President
Lyndon B. Johnson’s “War on Poverty,” Head Start was designed as a holistic intervention to improve
economically disadvantaged, preschool-aged children’s cognitive and social development by providing a
comprehensive set of educational, health, nutritional, and social services, as well as opportunities for
parent involvement (Zigler & Valentine, 1979). Program guidelines require that at least 90 percent of
the families served in each Head Start program be poor (with incomes below the federal poverty
threshold).1 Since its inception in 1965, the federally funded program has enrolled over 27 million
children (US DHHS, 2010).
Despite the program’s longevity, the question of whether children who attend Head Start
experience substantial positive gains in their academic achievement has been the subject of both policy
and academic debate since the 1969 Westinghouse study, which found that initial positive gains
experienced by program participants “faded out” by the time children reached early elementary school
(Westinghouse Learning Corporation, 1969). The design of that study was criticized because the
comparison group may have been composed of children not as disadvantaged as those attending Head
Start, thus leading to a possible underestimation of program impacts (McGroder, 1990). The lack of
experimental studies of Head Start, coupled with evidence of positive short and long-term effects from
non-experimental studies (Currie & Thomas, 1995; Deming, 2009; Garces, Thomas, & Currie, 2002;
Ludwig & Miller, 2007), perpetuated debate and precluded definitive conclusions about the program’s
In 2005, when the results of the first large national experiment evaluating Head Start were
reported (US DHHS, 2005), and again in 2010 when the follow-up study was released (USDHHS, ACF,
2010), some decried the comparatively small effects that again appeared to fade out over time, and
contrasted them with much larger effects from other early education demonstration programs and state
prekindergarten programs. For example, Besharov and Higney (2007) wrote, “It seems reasonable to
Recent legislation gives programs some discretion to serve children near poverty. Children from families receiving public
assistance (TANF or SSI) or children who are in foster care are eligible for Head Start or Early Head Start regardless of
family income level. Ten percent of program slots are reserved for children with disabilities, also regardless of income.
Tribal Head Start programs and programs serving migrant families may also have more open eligibility requirements (UHHS,
Administration of Children and Families, 2010).
compare Head Start’s impact to that of both pre-K programs and center-based child care. According to
recent studies, Head Start’s immediate or short-term impacts on a host of developmental indicators seem
not nearly as large as those of either pre-K or center-based child care” (p. 686-687). Yet, several
scholars have argued that methodological differences may at least in part explain the differences in
program impacts across such ECE studies (Cook & Wong, 2007). How large a role these design
features might play, however, has been only a matter of conjecture.
By using published or reported impact estimates as the unit of study, a meta-analysis provides a
unique opportunity to estimate associations between research design characteristics and results. In this
paper, we employ this methodology, using information from Head Start impact studies conducted
between 1965 and 2007, to estimate the average effect size for Head Start studies for children’s
cognitive and achievement outcomes in our sample of studies. Then, we explore how research design
features are related to variation in effect size.
Head Start serves as a particularly effective case study for such a methodological exercise. It has
provided a fairly standardized set of services to a relatively homogenous population of children over a
long period of time; yet evaluations have yielded varying results regarding the program’s effectiveness
in improving children’s academic skills. By holding constant program features such as funding,
program structure and requirements, and family socioeconomic background of children served, our
analysis can better estimate the independent effects of study design in explaining variation in findings,
while ensuring that that such differences cannot easily be attributed to these other potentially
confounding factors.
Our paper proceeds as follows: first, we review the previous literature regarding the relationship
between research design factors and effect sizes measuring the impact of ECE programs on children’s
cognitive and achievement outcomes; second, we describe our method, meta-analysis; third, we present
our results; and finally, we discuss the implications of our findings for future research and policy.
Literature Review
In most ECE studies it is difficult a priori to know how variations in research design will affect
estimates of program effectiveness, as such effects may be specific to the particular aspects and features
of the program and population being studied. Understanding the role of study design rarely is the intent
of the analysis, and most often meta-analyses rely on a single omnibus indicator of design quality, which
typically combines a number of study design features related to issues of internal validity. Meta-analyses
employing such an approach have yielded mixed evidence of significant associations between research
design factors and the magnitude of effect sizes for children’s cognitive and achievement outcomes.
The only existing meta-analysis of Head Start research, conducted over 25 years ago by McKey
and colleagues (1985), included studies completed between 1965 and 1982, and found positive program
impacts on cognitive test scores in the short term (effect sizes=.31 to .59), but not the long term (two or
more years after program completion; effect sizes= -.03 to .13). Initial descriptive analyses of
methodological factors such as quality of study design, sampling, and attrition revealed only slight
influences on the magnitude and direction of effect sizes; therefore, these variables were not included in
the main analyses. The authors found, however, that studies with pre-/post-test designs that lacked a
comparison group (and thus may not have adequately controlled for maturation effects) tended to
produce larger effect sizes than studies that included a comparison group.
More general meta-analyses of ECE programs, which include some Head Start studies as well as
evaluations of other ECE models, are difficult to compare because of varying study inclusion criteria
and differing composite measures of study quality. For example, in a meta-analysis of 123 ECE
evaluations, Camilli and colleagues (2010) found that studies with higher- quality design yielded larger
effect sizes for cognitive outcomes. The measure of design quality was a single dummy variable
encompassing such factors as attrition, baseline equivalence of treatment and control groups, high
implementation fidelity, and adequate information provided for coders.
In contrast, Gorey (2001), in his meta-analysis of 23 ECE programs with long-term follow-up
studies, found no significant link between an index of internal validity (including factors such as design
type, sample size, and attrition) and effect size magnitude. Nelson, Westhues and MacLeod (2003), in
their meta-analysis of 34 longitudinal preschool studies, also found no significant link between effect
size and a total methodological quality score, based on factors such as whether studies were randomized,
whether inclusion and exclusion criteria were clearly defined, whether baseline equivalence between
treatment and control groups was established, whether outcome measures were blinded, outcome
measure reliability, and year published. Interestingly, several meta-analyses of other medical and social
service interventions have found that low-quality study designs yield larger effect sizes than high-quality
ones (Moher et al., 1998; Schulz et al., 1995; Sweet & Appelbaum, 2004).
Prior research also suggests that other features of study design may matter—in particular, the
quality and type of measures selected. Meta-analyses of various K-12 educational interventions provide
evidence that researcher-developed tests yield larger average effect sizes than standardized tests (for a
review see Rosenshine, 2010). One suggested reason for this finding is that researcher-developed tests
may assess skills more closely related to those specifically taught in a particular intervention, while
standardized tests may tap more general skills (Gersten et al., 2009). Suggestive evidence that measures
that are more closely aligned with the practices of classroom instruction may be more sensitive to early
education is found in Wong and colleagues’ (2008) regression discontinuity study of five state
prekindergarten programs. All of the programs measured both children’s receptive vocabulary (Peabody
Picture Vocabulary Test) as well as their level of print awareness. Across the five programs, effects on
print awareness were several times greater than the effects on receptive vocabulary.
Two overlooked aspects of outcome measures are reliability and whether outcomes are based on
objective measures of a child’s performance or an observational rating. Assessments of children’s
cognitive development and academic achievement vary greatly in their rigor and potential for bias. For
example, standardized assessments that have been carefully designed to have high levels of reliability
are likely to introduce less measurement error than either teacher or parent reports of children’s
cognitive skills; however, both types of measures are regularly used as outcomes. Some scholars argue
that observational ratings may be likely to introduce bias. Hoyt and colleagues’ (1999) meta-analysis of
psychological studies on rater bias found that rating bias was likely to be very low in measures that
included explicit attributes (counts of particular behaviors) but quite prevalent in measures that required
raters to make inferences (global ratings of achievement or skills). If ratings of young children are a mix
of these types of measures, then such scales might be more biased than performance ratings; however,
the expected direction of bias remains unclear, and may differ by who is doing the rating (research staff
versus teacher) and whether the rater is blind to the participants’ treatment status (Hoyt et al., 1999).
Also overlooked in this literature is the activity level of the control group, despite concerns about
this raised in discussions of recent Head Start studies (National Forum on Early Childhood Policy and
Programs, 2010). In the case of ECE evaluations, we can define this as participation in center-based
care or preschool among control-group children. Control group activity varies considerably across ECE
studies, particularly in more recent evaluations, given the relatively high rates of participation in early
care and education programs among children of working parents, and the expansion of ECE programs in
the decades since Head Start began. Cook (2006) compared the level of center-based early education in
several studies of preschool programs, and found that rates of participation of the control group in these
settings during the preschool “treatment” year varied from just over 20 percent in state prekindergarten
studies (Wong et al., 2008) to nearly 50 percent in the Head Start Impact Study (US DHHS, 2005). He
argued that these differences might in part explain the larger effect sizes associated with recent
prekindergarten evaluations compared to the findings reported in the Head Start Impact Study, and more
generally that comparing effect sizes across programs when both the program and study design differ
leaves one unable to make any comparative judgment about program effectiveness.
Research Questions
While some previous meta-analyses suggest a link between research design characteristics and
results, some potentially important aspects of studies have not been considered. Furthermore, previous
meta-analyses have generally employed a single design quality composite score rather than estimating
the comparative role of each particular design factor in question. A more detailed empirical test of the
importance of particular research design characteristics is needed to enable scholars and policy makers
to better understand findings from prior ECE studies, as well as to inform future studies. Our study tests
whether the following research design characteristics explain heterogeneity in the estimated effects of
Head Start on children’s cognitive and achievement outcomes: type and rigor of design, quality and
timing of dependent measure, activity level of control group, and attrition.
Because Head Start primarily serves disadvantaged children, the concern with many prior studies
of the program is that analysts were unable to find a comparison group that was as disadvantaged as the
treatment group; thus, the independent influence of such disadvantage might downwardly bias estimates
of Head Start effectiveness (Currie & Thomas, 1995). Therefore, we hypothesize that studies that used
more rigorous methods to ensure similarity between treatment and control groups prior to program
participation, particularly random assignment, will produce larger effect sizes.
We also investigate whether three aspects of outcome measures affect the magnitude of effect
sizes. The first is whether the measure assessed types of skills that are more closely aligned with early
education instruction, such as pre-reading and pre-math skills, which we expect to yield larger effect
sizes than measures of more abstract and global cognitive skills, because these skills have been
demonstrated to be particularly sensitive to instruction (Christian, Morrison, Frazier, & Massetti, 2000;
Wong et al., 2008). The second is whether the measure was a performance test, rating (by teacher or
caregiver), or observation (often by a researcher). Given that ratings and observations have the potential
to introduce rater bias, and it is not clearly understood how this error might affect estimates, we do not
have a clear hypothesis about how estimates of effect sizes will differ across these types of measures.
Third, we explore the role of a measure’s reported reliability. If high reliability indicates low
measurement error, we might expect somewhat larger estimates from more reliable measures. However,
prior studies show that larger effect sizes result from researcher-developed tests, which may be
explained by better alignment between outcome measures and skills taught in a particular intervention,
and relatively lower reliability of these measures, compared to standardized tests (Rosenshine, 2010).
Thus, we might also expect more reliable measures to predict smaller effect sizes.
Based on findings from previous Head Start evaluations, which suggest “fade out” of effects
over time, we expect the length of time between the end of the Head Start program and outcome
measurement to be negatively associated with average effect size (Currie & Thomas, 1995; McKey et al.,
1985). Given research suggesting that children derive cognitive gains from other types of ECE
programs (Zhai, Brooks-Gunn, & Waldfogel, 2010), we also expect a negative relationship between
effect size and the activity level of the control group (i.e., a measure of whether control group members
sought alternative ECE services on their own). Finally, we predict that studies with higher levels of
overall attrition may yield smaller effect sizes, since disadvantaged students, who are most likely to
benefit from the program, are also most likely to be lost to follow-up.2
Research Methods
Meta-analysis. To understand how specific features of research design may account for the
heterogeneity in estimated Head Start effects, we conducted a meta-analysis, a method of quantitative
research synthesis that uses prior study results as the unit of observation (Cooper & Hedges, 2009). To
combine findings across studies, estimates are transformed into a common metric, an “effect size,”
which expresses treatment/control differences as a fraction of the standard deviation of the given
outcome. Outcomes from individual studies can then be used to estimate the average effect size across
studies. Additionally, meta-analysis can be used to test whether average effect size differs by, for
example, characteristics of the studies and study samples. After defining the problem of interest, metaanalysis proceeds in the following steps, described below: 1) literature search, 2) data evaluation, and 3)
data analysis.
Literature Search. The Head Start studies analyzed in this paper compose a subset of studies
from a large meta-analytic database being compiled by the National Forum on Early Childhood Policy
and Programs. This database includes studies of child and family policies, interventions, and prevention
programs provided to children from the prenatal period to age five, building on previous meta-analytic
databases created by Abt Associates, Inc. and the National Institute for Early Education Research
(NIEER) (Camilli et al., 2010; Jacob, Creps & Boulay, 2004; Layzer, Goodson, Bernstein & Price,
An important first step in a meta-analysis is to identify all relevant evaluations that meet one’s
programmatic and methodological criteria for inclusion; therefore, a number of search strategies were
We recognize, however, that in the presence of differential attrition across the control and treatment groups, the effects of
attrition may be more complicated, depending on how patterns of attrition differ across the groups.
used to locate as many published and unpublished Head Start evaluations conducted between 1965 and
2007 as possible. First, we conducted key word searches in ERIC, PsychINFO, EconLit, and
Dissertation Abstracts databases, resulting in 304 Head Start evaluations.3 Next, we manually searched
the websites of several policy institutes (e.g., Rand, Mathematica, NIEER) and state and federal
departments (e.g., U.S. Department of Health and Human Services) as well as references mentioned in
collected studies and other key Head Start reviews. This search resulted in another 134 possible reports
for inclusion in the database. In sum, 438 Head Start evaluations were identified, in addition to the 126
previously coded by Abt and NIEER.
Data Evaluation. The next step in the meta-analysis process was to determine whether identified
studies met our established inclusion criteria. To be included in our database, studies must have i) a
comparison group (either an observed control or alternative treatment group); and ii) at least 10
participants in each condition, with attrition of less than 50 percent in each condition. Evaluations may
be experimental or quasi-experimental, using one of the following methods: regression discontinuity,
fixed effects (individual or family), residualized or other longitudinal change models, difference in
difference, instrumental variables, propensity score matching, or interrupted time series. Quasiexperimental evaluations not using one of the former analytic strategies are also screened in if they
include a comparison group plus pre- and post-test information on the outcome of interest or
demonstrate adequate comparability of groups on baseline characteristics. These criteria are more
rigorous than those applied by McKey et al. (1985) and Abt/NIEER; for example, they eliminate all pre/post-only (no comparison group) studies, as well as regression-based studies with baseline nonequivalent treatment and control groups.
The original Abt/NIEER database included ECE programs evaluated between 1960 and 2003 and used similar search
techniques; therefore, we did not re-search for Head Start evaluations conducted during these years, with the exception of
2003. We conducted searches for evaluations completed between 2003 and 2007, as well as for programs not targeted by the
original Abt/NIEER search strategies.
For this particular study, which is focused on impact evaluations of Head Start, we impose some
additional inclusion criteria. We include only studies that measure differences between Head Start
participants and control groups that were assigned to receive no other services. For example, studies that
randomly assigned children to Head Start versus another type of early education program or Head Start
add-on program are excluded. However, studies are not excluded if families assigned to a no-treatment
control group sought services of their own volition. In addition, we include only Head Start studies that
provide at least one measure of children’s cognitive or achievement outcomes. Furthermore, to improve
comparability across findings, we impose limitations regarding the timing of study outcome measures.
We limit our analysis to studies in which children received at least 75 percent of the intended Head Start
treatment and for which outcomes were measured 12 or fewer months post-treatment.
The screening process, based on the above criteria, resulted in the inclusion of 57 Head Start
publications or reports (See Appendix A: Database References).4 Of the 126 Head Start reports
originally included in the Abt/NIEER database, 29 were eliminated from the database because they did
not meet our research design criteria. The majority of the 438 additional reports identified by the
research team’s search were excluded after reading the abstract (N=243), indicating that they did not
meet our inclusion criteria for obvious reasons (e.g., they were not quantitative evaluations of Head Start
or did not have a comparison group). Of the 98 Head Start evaluation reports that were excluded after
full-text screening, 90 were excluded because they did not meet our research design criteria; 53 of these
specifically because they did not include a comparison group. Eight other reports were excluded due to
other eligibility criteria (e.g., they only reported results for students with disabilities or did not report
results for outcomes we were interested in coding). Our additional inclusion criteria for this paper (e.g.,
Because some of our inclusion criteria differed from Abt’s and NIEER’s original criteria, we re-screened all of the studies
included in the original database as well as the new ones identified by the research team.
short-term cognitive or achievement outcomes only, no alternative treatment or curricular add-on studies)
excluded 120 reports that remain coded in the database, but are not included in this analysis.5
Coding Studies. For reports that met our inclusion criteria, the research team developed a
protocol to codify information about study design, program and sample characteristics, as well as
statistical information needed to compute effect sizes. This protocol serves as the template for the
database and delineates all the information about an evaluation that we want to describe and analyze. A
team of 10 graduate research assistants were trained as coders during a 3- to 6-month process that
included instruction in evaluation methods, using the coding protocol, and computing effect sizes.
Trainees were paired with experienced coders in multiple rounds of practice coding. Before coding
independently, research assistants also passed a reliability test comprised of randomly selected codes
from a randomly selected study. In order to pass the reliability test, researchers had to calculate 100
percent of the effect sizes correctly and achieve 80 percent reliability with a master coder for the
remaining codes. In instances when research assistants were just under the threshold for effect sizes, but
were reliable on the remaining codes, they underwent additional effect size training before coding
independently and were subject to periodic checks during their transition. Questions about coding were
resolved in weekly research team conference calls.
Database. The resulting database is organized in a four-level hierarchy (from highest to lowest):
the study, the program, the contrast, and the effect size (See Table 1: Key Meta-Analysis Terms and Ns).
A “study” is defined as a collection of comparisons in which the treatment groups are drawn from the
same pool of subjects. A study may include evaluations of multiple “programs”; i.e., a particular type of
Head Start or Head Start in a particular location. Each study also produces a number of “contrasts,”
defined as a comparison between one group of children who received Head Start and another group of
Despite extensive efforts, we were unable to locate 17 reports identified in our searches.
children who received no other services as a result of the study. Evaluations of programs within studies
may include multiple contrasts; for example, results may be presented using more than one analytic
method (e.g., OLS and fixed effects) or separate groups of children (e.g., three- and four-year-olds), and
these are coded as different contrasts nested within one program, within one study. The 33 Head Start
programs included in our meta-analysis include 40 separate contrasts.6 In turn, each contrast consists of
a number of individual “effect sizes” (estimated standard deviation unit difference in an outcome
between the children who experienced Head Start and those who did not). The 40 contrasts in this metaanalytic database provide a total of 313 effect sizes.7 These effect sizes combine information from a total
of over 160,000 observations.
Effect Size Computation. Outcome information was reported in evaluations using a number of
different statistics, which were converted to effect sizes (Hedges’ g) with the commercially available
software package Comprehensive Meta-Analysis (Borenstein, Hedges, Higgins, & Rothstein, 2005).
Hedges’ g is an effect size statistic that adjusts the standardized mean difference (Cohen’s d) to account
for bias in the d estimator when sample sizes are small.
Dependent Variable. Descriptive information for the dependent measure (effect size) and key
independent variables is provided in Table 2. To account for the varying precision among effect size
estimates, as well as the number of effect sizes within each program, these descriptive statistics and all
subsequent analyses are weighted by the inverse of the variance of each effect size multiplied by the
inverse of the number of effect sizes per program (Cooper & Hedges, 2009; Lipsey & Wilson, 2001).
We excluded all “sub-contrasts,” i.e., analyses of sub-groups of the main contrast (e.g., by race or gender), as this would
have provided redundant information. See Appendix B for a description of the Head Start studies, programs and contrasts
included in our analyses.
In several studies, outcomes were mentioned in the text, but not enough information was provided to calculate effect sizes;
for example, references were made to non-significant findings, but no numbers were reported. Excluding such effect sizes
could lead to upward bias of treatment effects; therefore, we coded all available information for such measures, but coded
actual effect sizes as missing. These effect sizes (N=72) were then imputed four different ways and included in subsequent
robustness checks. Specifics regarding imputation are discussed in the robustness check section of this paper. In cases where
“nested” measures were coded (e.g., an overall IQ score plus several IQ sub-scale scores), we retained the unique sub-scales
and excluded the redundant overall scores.
The dependent variables in these analyses are the effect sizes measuring the standardized difference in
assessment of children’s cognitive skills and achievement between children who attended Head Start
and the comparison group. Effect sizes ranged from -0.49 to 1.05, with a weighted mean of 0.18.
Independent Variables. Several of the key independent variables measure features of the
particular outcome assessments employed. We distinguished between effect sizes measuring
achievement outcomes, such as reading, math, letter recognition, and numeracy skills, which may be
more sensitive to typical classroom instruction, and those measuring cognitive outcomes less sensitive to
instruction, including IQ, vocabulary, theory of mind, attention, task persistence, and syllabic
segmentation, such as rhyming (see Christian et al., 2000 for discussion of this distinction). The majority
of effect sizes (67 percent) are from the cognitive domain.
Using a series of dummy variables, we categorized effect sizes according to the type of measure
employed by the researcher, indicating whether it is a performance test (reference category), rating by
someone the child knows (e.g., a teacher or parent), or observational rating by a researcher. The
majority of outcome measures are performance tests (93 percent). We also included a continuous
measure of the timing of the outcome, measured in months post-treatment, which, given our screening
criteria, ranges from -2.47 to 12.
Other key independent variables represent facets of each program’s research design. These study
and Head Start characteristics do not vary within a program, so we present both program-level and effect
size-level descriptive information for these measures in Table 2. We created a series of dummy
variables indicating the type of design: randomized (reference category) or quasi-experimental.8 The
majority of effect sizes (74 percent) come from quasi-experimental studies, although differences
A few programs (N=2) had designs that changed post-hoc; i.e., the study was originally randomized, but for various
reasons, became quasi-experimental in nature. In our primary specifications, these studies were coded as quasi-experimental.
An alternative specification, coding these studies specifically “design changed,” is discussed in the robustness checks section.
between program level and effect size level means suggest that randomized trials tended to have more
outcome measures per study than those with quasi-experimental designs.
We created a dummy variable to indicate whether baseline covariates were included in the
analysis. Although the majority of programs (86 percent) do not include baseline covariates in their
analyses, those that do have a large number of outcome measures.
We also coded the activity level of the control group using the following categories: passive
(reference category), meaning that control group children received no alternative services; or active,
meaning some of the control group members sought services of their own volition; as well as a dummy
variable indicating whether information regarding control group activity was missing from the report.9
Although the majority of effect sizes (53 percent) come from studies with passive control groups, studies
in which the control group actively sought alternative services, specifically attendance at other centerbased child care facilities or early education programs, tend to have more effect sizes per study. Studies
that reported active control groups indicated that between 18 and 48 percent of the control attended these
types of programs.10
We also created a series of dummy variables indicating levels of overall attrition. Keeping in
mind that attrition was truncated at 50 percent based on our screening criteria, attrition levels were
constructed using quartile scores and defined as follows: low attrition (reference category), less than or
equal to 10 percent (representing quartiles 1, 2, and 3); high attrition, greater than 10 percent
(representing quartile 4); or missing overall attrition information. The majority of effect sizes come from
studies with 10 percent attrition or less.
Although Head Start is guided by a set of federal performance standards and other regulations,
these have changed over time and may not reflect the experience of participants in all studies. A dummy
Reports in which there was no mention of control group activity were coded as having missing information on this variable.
In theory, we might also want to categorize use of parenting programs or family support programs as active control group
participation; however, studies generally did not report on this type of activity.
variable was coded to indicate whether the program was a “modern” Head Start program, defined as
post-1974, when the first set of Head Start quality guidelines were implemented.11 Although the
majority of programs (75 percent) are older, 44 percent of effect sizes come from studies of modern
Head Start programs.
In addition, recognizing that the first iteration of Head Start was a shortened, 6 to 8 week
summer program, we also created a continuous variable indicating length of treatment measured in
months, and re-centered at two months, so that the resulting coefficient indicates the effect of receiving a
full academic year of Head Start versus a summer program.
Finally, we created a dummy variable indicating whether the evaluation was an article published
in a peer refereed journal. The reference category is an unpublished report or dissertation, or book
Statistical Analysis. Our key research question is whether heterogeneity in effect size is predicted
by methodological aspects of the study design in the programs and effect sizes. The nested structure of
the data (effect sizes nested within programs) requires a multivariate, multi-level approach to modeling
these associations (de la Torre, Camilli, Vargas, & Vernon, 2007). The level-1 model (effect size level)
(1) ESij = β0i + β1ix1ij + … + βkixkij + eij
In this equation, the effect size j in program i, is modeled as a function of the intercept (β0i), which
represents the average (covariate adjusted) effect size for all programs; a series of key independent
variables and related coefficients of interest (β1ix1ij + … + βkixkij), which estimate the association
between the effect size and aspects of the study design that vary at the effect size level; and a withinprogram error term (eij). Study design covariates at this level include timing of outcome, type of
In an alternative specification, instead of the modern Head Start variable, we included a continuous variable for the year
each Head Start program was studied, re-centered at 1965, the program’s initial year of operation. This did not qualitatively
change results.
outcome (rating or observation), whether or not baseline covariates are included, and domain of outcome
(skills more or less sensitive to instruction).
The level-2 equation (program level) models the intercept as a function of the grand mean effect size
(β00), a series of covariates that represent aspects of study design and Head Start features that vary only
at the program level (β01ix1i + … + β0kixki), and a between-program random error term (ui):
(2) β0i = β00 + β01ix1i + … + β0kixki + ui
Study design covariates at this level include type of research design, activity level of control group,
attrition, and whether the effect size came from a peer refereed journal article. Head Start program
feature covariates include length of program and whether the program was implemented post-1974.
This “mixed effects” model assumes that there are two sources of variation in the effect size
distribution, beyond subject-level sampling error: 1) the “fixed” effects of variables that measure key
features of the methods and other covariates; and 2) remaining “random” unmeasured sources of
variation between and within programs.12 To account for differences in effect size estimate precision, as
well as the number of effect sizes within a particular program, all regressions were weighted by the
inverse variance of each effect size multiplied by the inverse of the number of effect sizes per program
(Cooper & Hedges, 2009; Lipsey & Wilson, 2001). Analyses were conducted in SAS, using the PROC
MIXED procedure.
We began by entering each design factor independently, and then included all relevant design
covariates at the same time in our primary specification. We also tested several variations of the
primary model specification; for example, we conducted separate analyses including imputed missing
effect sizes, without weights, and excluding the National Head Start Impact Study, the largest study in
In our primary specifications, we ignore the third- and fourth- level of nesting, contrasts within programs, and programs
within studies, due to the small number of studies with multiple contrasts (N=8) and multiple programs (N=4). In alternative
specifications, we found that clustering at the contrast level or study level, instead of the program level, did not qualitatively
change the results.
our sample. We also tested alternative specifications using a series of dummy variables indicating
outcome measure reliability levels, a more nuanced set of research design variables, a continuous
measure of the date of operation of the Head Start program being studied, an indicator of whether a
randomized study experienced crossovers between control and treatment group, and a continuous
measure of the number of control group children attending center-based childcare. Results of the main
analyses and these robustness checks are discussed below.
Bivariate Results. The results from an “empty model,” with no predictor variables, yields an
intercept (average program-level effect size) of .27, which is significantly different from 0. Keeping this
in mind, we began by exploring the relationships between single design factors and average effect size
using a series of multilevel regressions, the results of which are presented in Table 3.13 Regressions
including categorical variables (Table 3, columns 1-8) were run without intercepts; thus, the resulting
coefficients indicate the average effect size for programs in each category. These results suggest that
studies of modern Head Start programs produce a smaller average effect size (.23) than studies of Head
Start conducted prior to 1975 (.29), although this difference is not statistically significant. This finding
is also potentially complicated by the fact that modern Head Start studies are more likely to have active
control groups, and programs in which the control group actively seeks alternative ECE services
produce a smaller average effect size (.08) than studies with passive control groups (.31, p=.05) or
missing information on this variable (.29, n.s.). Both study design types yield significant positive effect
sizes (quasi-experimental=.26; randomized=.33). Programs in which baseline covariates were used in
analyses yield a smaller average effect size (.20) than those in which covariates were not included (.29),
although this difference is not statistically significant.
These bivariate analyses and subsequent multivariate specifications include 241 non-missing effect sizes in 28 programs.
Looking at different aspects of dependent measures, we find that ratings (.45) and observations
(.55) yield significantly larger effect sizes than performance tests (.24). Consistent with our hypothesis,
measures of skills more sensitive to instruction yield a significantly larger average effect size (.40) than
those measuring more broad cognitive skills (.25). We find significantly positive and similar average
effect sizes for measures with both high and low attrition (.28 and .33, respectively); however, the
average effect size for measures with missing attrition information is smaller and not significant (.14).
Finally, we find that the average effect size from a study published in a peer refereed journal (.43)
appears larger than one produced by an unpublished study or book chapter (.23); this difference is
marginally significant.
Multilevel regressions including continuous measures of research design were run with an
intercept; therefore, we include this estimate in columns 10 and 11 of Table 3 to show the relationship
between an incremental increase in each continuous design variable and average effect size. None of the
coefficients for continuous measures, including length of program in months and months between the
end of treatment and outcome measure, is statistically significant.
While most of these findings are consistent with our initial hypotheses regarding the influence of
various design factors on average effect size, this analytic approach ignored the potential important
confounds of other design variables, and thus might yield biased results. Therefore, in our primary
specification, we included all design variables at once to investigate the independent and comparative
role of each in impacting average effect size.
Multivariate Results. Results from our primary specification are presented in Column 1 of Table
4; coefficients indicate the strength of associations between our independent variables (measuring facets
of research design) and effect sizes (differences between treatment and control groups expressed as a
fraction of a standard deviation). In terms of program and study characteristics, we find that attending a
full academic year of Head Start (10 months) is marginally associated with a .16 standard deviation unit
larger effect than attending a summer Head Start program (2 months). Studies published in peer
refereed journals also tend to yield effect sizes .28 standard deviations larger than those found in
unpublished reports, dissertations, or book chapters; thus, confirming that there may be a tendency for
publication bias to be operating in this field. We also find a negative but statistically insignificant
association between effect size and the variable indicating that the Head Start program being evaluated
was in operation in 1975 or later.
Our exploration of program-level research design factors shows a large negative association
between effect size and having an active control group in which families independently seek alternative
services (-.33). Other features of study design are not statistically significant, although most are in the
expected direction. Perhaps most surprisingly, we do not find a significant difference between effect
sizes of quasi-experimental and randomized studies, when other features of study design are held
constant. As expected, higher levels of attrition (and missing information on this variable) are
associated with smaller effect sizes; however, the differences are modest in magnitude and not
statistically significant.
A number of dependent measure characteristics also predict effect size. Compared to
performance tests, both ratings by teachers and parents as well as observational ratings by researchers
yield larger effect sizes (.16 and .32, respectively). As expected, measures of skills more sensitive to
instruction produce larger effect sizes than those for less teachable cognitive skills (.13). Counter to our
hypothesis, however, length of time between treatment and outcome measure is not associated with
smaller effect size.
If one considers a performance test a more reliable measure of children’s skills, compared to
ratings by others, the findings described in the previous paragraph are somewhat surprising. To test the
role of measure reliability more directly, in an alternative specification, we removed the variables
indicating “type” of dependent measure (i.e., rating, observation) and instead included a series of
dummy variables indicating the level of reliability of the outcome measure, based on coded reliability
coefficients.14 Consistent with our primary specification findings, but contrary to our proposed
hypothesis, we found that less reliable measures yield larger effect sizes (See Table 4, Column 2).
Robustness Checks. One concern with our models is that they omit correlates of program design
that might be associated with effect sizes. Our choice of looking only within Head Start programs was
intended to limit the differences in programming that would have been found in a wider set of ECE
programs; however, a possible remaining source of heterogeneity is the demographic make-up of the
sample. Unfortunately, there is a surprisingly large amount of missing data on the demographic
characteristics of the study samples. For example, only 52 percent of effect sizes have information about
the gender and between 46 and 52 percent about the racial composition of the sample. Nevertheless, we
explored whether effect sizes might be predicted by the gender and by the racial composition of the
sample (percent boys versus girls; percent black, Hispanic, White). Bivariate analyses suggested that
effect sizes were not significantly predicted by these characteristics, nor were they predictive in our
multivariate models. Given these null findings, the large percentage of missing data for these variables,
and the limited statistical power in our multivariate analyses, we opted to leave these variables out of our
multivariate models.
Categories were constructed based on quartile scores and defined as follows: high reliability (reference category), greater
than or equal to .92 (representing quartile 4); medium reliability, .91 to .75, (representing quartiles 2 and 3); low reliability,
less than .75, (representing quartile 1). Our preference was to code any reliability coefficient provided for the specific study
population; however, this information was rarely reported. If no coefficient was provided in the report, we attempted to find
a reliability estimate from test manuals or another study. Any available type of reliability coefficient was recorded, although
the most were measures of internal consistency (Cronbach’s alpha). Because of this variability in source information and
coefficient type, and the fact that we were still left with missing reliability coefficients for 38 percent of our effect sizes, we
offer these results with caution.
We undertook several additional analyses to determine the sensitivity of our findings to
alternative model specifications. Most importantly, in some cases, authors reported that groups were
compared on particular tests, but did not report the results of these tests, or did not provide enough
numerical information to compute an effect size. Leaving out these “missing” effect sizes (N=72) could
upwardly bias our average effect size estimate, since we know that unreported effect sizes are more
likely to be smaller and statistically insignificant (Pigott, 2009). We imputed these missing effect sizes
four ways and report results in Table 5. Our baseline results (Table 5, column 1) assumed that if precise
magnitudes of differences were not reported, and there was no indication of whether differences were
significant, there was no difference between the groups (g=0). If the results were reported as significant,
we assumed that they were significant at the p=.05 level.
We also checked the robustness of these results to different assumptions about the magnitude of
the effects in these cases. We ran three alternative specifications. The first assumed that if an author
reported which group was favored for a particular test, but not whether the effect size was statistically
significant, that the differences approached marginal significance (p=.11). If authors did not report that
either group was favored, we again assumed that g=0 (Table 5, column 2). The second scenario assumed
that all results favored the treatment group as much as possible, consistent with author reports of which
group actually fared better (p=.11 if treatment group was favored or if there was no indication of which
group was favored; p=.99 otherwise; Table 5, column 3). Conversely, the third scenario assumed that all
results favored the control group as much as possible, consistent with author reports of which group
actually performed better (p=.11 if treatment group was favored; p=.99 otherwise; Table 5, column 4).
As demonstrated in Table 5, including the imputed effect sizes did not yield substantive changes
in our coefficient estimates, regardless of the imputation assumptions.15 We therefore conclude that
excluding these missing effect sizes from our primary analyses is unlikely to introduce bias in our results.
A second concern is that results were being driven primarily by our inclusion of the National
Head Start Impact Study, which includes 40 effect sizes, and is heavily weighted due to its large sample
size. When we excluded this study from our analysis, however, we again obtained results qualitatively
similar to those from our primary specification (results available upon request). The magnitude of most
coefficients stayed the same, although due to loss of statistical power, some of them became statistically
insignificant.16 These findings suggest that the relationships between effect sizes and research design
factors are not strongly influenced by the specific findings from the National Head Start Impact Study.
We also found qualitatively similar results for unweighted analyses, suggesting that studies with larger
samples are not driving our findings either.
Recognizing that not all quasi-experimental designs are equally rigorous, we also tested a more
nuanced set of research design indicators, adding to our original design variables separate categories for
studies with true quasi-experimental designs (matching on outcomes or demographics and change
models) and those that were included due only to having baseline comparable treatment and control
groups. We also added an indicator for studies in which the design was changed post-hoc (i.e., it was
originally randomized, but for various reasons, became a quasi-experimental study). Again, our findings
remained robust; none of the design indicator variables was statistically significant (compared to the
reference category, randomization), with other design features included in the model.
In another specification, we also included a variable indicating the presence of crossovers
between the control and treatment groups (situations in which control group members attended Head
These analyses included 313 effect sizes nested in 40 contrasts.
Specifically, the coefficient for rating (.09) was no longer significant, and the coefficient for observation (.28) became only
marginally significant.
Start programs or treatment group members did not), which could potentially reduce effect size
independent of control group activity level.17 The crossover coefficient was not significant, and its
inclusion did not change the pattern of results for other variables in the analysis.
Finally, we explored whether a more specific measure of the activity level of the control group
would predict effect sizes. Active control groups were found in 4 programs, and in these programs
between 19 and 48 percent of the children in these groups received center-based child care. As an
alternative measure of control group activity, we also tried including a continuous variable capturing the
percent of control group children in center-based care. As expected, we found that this measure
significantly predicted effect sizes, with an additional percent of the control group experiencing centerbased child care corresponding to a -.005 (p<.05) smaller effect size.
This study provides an important contribution to the field of ECE research, in that it uses a
unique, new meta-analytic database to estimate the overall average effect of Head Start on children’s
cognitive skills and achievement, and explores the role of methodological factors in explaining variation
in effect sizes measuring the impact of Head Start on children’s cognitive and achievement skills.
Overall, we found a statistically significant average effect size of .27, suggesting that Head Start
is generally effective in improving children’s short-term (less than one year post-treatment) cognitive
and achievement outcomes. This is a somewhat smaller effect on short term cognitive outcomes than
those found in the previous meta-analysis of Head Start conducted by McKey et al. (1985), but
somewhat larger than those reported in the first year findings from the recent national Head Start Impact
Study (US DHHS, 2005). The .27 estimate is also within the range of the overall average effect sizes on
cognitive outcomes found in Camilli et al. (2010) measured across various ECE models, and in Wong et
For example, in the National Head Start Impact study, approximately 18 percent of 3-year-olds and 13 percent of 4-yearolds in the control groups received Head Start services.
al. (2008), measured in state prekindergarten programs, but somewhat smaller than the short-term
cognitive effect sizes found in meta-analyses of more intensive programs with longitudinal follow-ups
conducted (Gorey,2001; Nelson et al.,2003). These comparisons suggest that Head Start program effects
on children’s cognitive and achievement outcomes are on par with the effects of other general early
education programs.
We find that, indeed, several design factors significantly predict the heterogeneity of effect sizes,
and these factors account for approximately 41 percent of the explainable variation between evaluation
findings and 11 percent of the explainable variation within evaluation findings. This information can be
used by researchers and policy makers to become better consumers and designers of Head Start and
other ECE evaluations, and thus facilitate better policy and program development.
One of our substantively largest findings is that having an active control group—one in which
children experienced other forms of center-based care—is associated with much smaller effect sizes (.33) than those produced by studies in which the control group is “passive” (i.e., receives no alternative
services). Given that a variety of models of ECE programs have been shown to increase children’s
cognitive skills and achievement (Gormley, Phillips, & Gayer, 2008; Henry, Gordon, & Rickman, 2006),
it is perhaps not surprising that effect sizes for studies in which a significant portion of the control group
is receiving alternative ECE services are smaller than those produced by studies in which control group
children receive no ECE services.
The nature of the counterfactual, then, is important to consider. For example, in today’s policy
context, in which almost 70 percent of 4-year-olds and 40 percent of 3-year-olds attend some form of
ECE (Cook, 2006), it is probably not reasonable to expect the same kinds of large effect sizes produced
by older model programs studied when few alternative ECE options were available (e.g., Perry
Preschool, Schweinhart et al. 2005; and Abecedarian, Ramey & Campbell, 1984). More generally,
comparisons between Head Start studies with low rates of control group activity and higher rates should
be approached with caution. The same may also be true for studies of other forms of ECE, although it is
unclear if our findings will generalize beyond Head Start studies. If these findings are replicated with
other types of ECE studies, then it suggests that it is not reasonable to compare effects from, for example,
the National Head Start Impact Study, which had relatively high rates of center-care attendance in the
control group, with recent regression discontinuity design (RDD) evaluations of state pre-kindergarten
(pre-k) programs that have lower levels of such attendance in the control group (Cook, 2006). Our
findings suggest that instead of asking whether Head Start is effective at all, we must ask how effective
Head Start is compared to the range of other ECE options available.
Another important finding is that the type of dependent measure used by the researcher may be
systematically related to effect size, and must be considered when interpreting evaluation results.
Consistent with previous research, we find that achievement-based skills such as early reading, early
math, and letter recognition skills appear to be more sensitive to Head Start attendance than cognitive
skills such as IQ, vocabulary, and attention, cognitive measures that are less sensitive to classroom
instruction (Christian et al., 2000; Wong et al., 2008). This finding has important implications for
designers and evaluators of early intervention programs; namely, that expectations for effects on
omnibus measures such as vocabulary or IQ should be lowered. At minimum, these sets of skills should
be tested and considered separately.
Our finding that less reliable dependent measures yield larger effect sizes also warrants caution
when interpreting effect sizes from studies without first considering the quality of the measures. Nonstandardized measures developed by researchers may tap into behaviors that are among the most directly
targeted by the intervention services; therefore, it is not surprising that such measures tend to yield
larger effect sizes. Ratings by parents, teachers, and researchers may also be subject to bias, however,
because these individuals are likely to be aware of the children’s participation in Head Start as well as
the study purpose.
We also find that effect sizes from studies published in peer refereed journals are larger than
those found in non-published reports and book chapters. While research published in peer- refereed
journals may be more rigorous than that found in non-published sources, this result may also be a sign of
the “file drawer” problem (i.e., that negative or null findings are less likely to be published) long
lamented by meta-analysts (Lipsey & Wilson, 2001). This finding suggests that meta-analysts must be
exhaustive in their searches for both published and unpublished (“grey”) literature, and should carefully
code information regarding study quality (Rothstein & Hopewell, 2009).
A somewhat surprising finding from the current study is that type of overall design (e.g.,
randomized vs. quasi-experimental) did not predict effect size. We remind the reader, however, that our
inclusion criteria regarding study design were typically more rigorous than previous meta-analyses of
ECE programs. By limiting our study sample in this way, we give up some of the variation in design
that might indeed predict effect sizes. Nevertheless, these findings are in alignment with recent research
suggesting that in certain circumstances, rigorous quasi-experimental methods can produce causal
estimates similar to those produced by randomized controlled trials (Cook, Shadish, & Wong, 2008),
and further support the use of such methods to evaluate programs when randomized controlled trials are
not feasible, as is often the case in education research (Schneider et al., 2005).
We also predicted that attrition and time between intervention and outcome measure would be
negatively associated with effect size; however, neither factor was statistically significant. The fact that
the range of each measure was truncated in this study (attrition to less than 50 percent and posttreatment outcome measure timing to 12 or fewer months) may explain this lack of findings. Whether
similar relationships are found between methodological factors and effect sizes for studies with long28
term outcomes is a question for future research. Timing of study (pre- or post-implementation of the
first Head Start quality guidelines in 1974) also did not predict effect size, nor did a continuous measure
of year study was conducted. Future research using measures that are better able to discern between
program quality and other factors that may be associated with the historical context of the study is
In addition to the limitations noted above, we offer our results with a few other caveats. We
recognize that variation in research design is naturally occurring; therefore, such variation may be
correlated with other unmeasured aspects of Head Start studies that we were not able to capture.
Furthermore, although our multilevel models account for the nesting of effect sizes within programs,
there were additional sources of non-independence that we were simply unable to model. Nevertheless,
we believe meta-analysis to be a useful and robust method to explore our research questions.
In sum, this study makes an important contribution to the field of ECE research, in that we are
able to empirically and comparatively test which research design characteristics explain heterogeneity in
effect sizes in Head Start evaluations. We find that several facets of research design explain some of the
variation in Head Start participants’ short-term cognitive and achievement outcomes; therefore,
consumers of ECE evaluations must ask themselves a series of design-related questions when
interpreting results across evaluations with differing methods. What is the control counterfactual?
What skills are being measured? How reliable are the measures being used? What is the source of
information? By becoming more critical consumers and designers of such research, we can improve
ECE services and realize their full potential as an intervention strategy for improving children’s life
Balk, E. M., Bonis, P. A., Moskowitz, H., Schmid, C. H., Ionnidis, J. P., Wang, C., & Lau, J. (2002).
Correlation of quality measures with estimates of treatment effect in meta-analyses of
randomized controlled trials. Journal of the American Medical Association, 287(22), 2973-2982.
Besharov, D. J., & Higney, C. A. (2007). Head Start: Mend it, don’t expand it (yet). Journal of
Policy Analysis and Management, 26(3), 678-681.
Borenstein M., Hedges, L., Higgins, J., & Rothstein, H. (2005). Comprehensive Meta-analysis, Version
2. Englewood NJ: Biostat.
Camilli, G., Vargas, S., Ryan, S., & Barnett, W. S. (2010). Meta-analysis of the effects of early
education interventions on cognitive and social development. Teachers College Record, 112(3).
Christian, K., Morrison, F. J., Frazier, J. A., & Massetti, G. (2000). Specificity in the nature and timing
of cognitive growth in kindergarten and first grade. Journal of Cognition and Development, 1(4),
Cook, T. (2006). What works in publicly funded pre-kindergarten education? Retrieved March 23, 2010
Cook, T. D., Shadish, W. R., & Wong, V. C. (2008). Three conditions under which experiments and
observational studies produce comparable causal estimates: New findings from within-study
comparisons. Journal of Policy Analysis and Management, 27(4), 724-750.
Cook, T. D., & Wong, V. C. (2007). The warrant for universal pre-k: Can several thin reeds make a
strong policy boat? Social Policy Report, 21(3), 14-15.
Cooper, H., & Hedges, L. V. (2009). Research synthesis as a scientific process. In H. Cooper, L. V.
Hedges, & J. C. Valentine (Eds.), The handbook of research synthesis and meta-analysis, 2nd
edition, (pp. 3-17). New York: Russell Sage Foundation.
Currie, J., & Thomas, D. (1995). Does Head Start make a difference? American Economic Review, 85,
De la Torre, J., Camilli, G., Vargas, S., & Vernon, R. F. (2007). Illustration of a multilevel model for
meta-analysis. Measurement and Evaluation in Counseling and Development, 40, 169-180.
Deming, David. (2009). Early childhood intervention and life-cycle skill development: Evidence from
Head Start. American Economic Journal: Applied Economics, 1(3), 111-34.
Garces, E., Thomas, D., & Currie, J. (2002). Longer term effects of Head Start. The American
Economic Review, 92, 999-1012.
Gorey, K. M. (2001). Early childhood education: A meta-analytic affirmation of the short- and longterm benefits of educational opportunity. School Psychology Quarterly, 16(1), 9–30.
Gormley, W. T., Jr., Phillips, D., & Gayer, T. (2008). Preschool programs can boost school
readiness. Science, 320, 1723-24.
Henry, G. T., Gordon, C. S., & Rickman, D. K. (2006). Early education policy alternatives: Comparing
quality and outcomes of Head Start and state prekindergarten. Educational
Evaluation and Policy Analysis, 28, 77-99.
Hoyt, W. T., & Kerns, M-D. (1999). Magnitude and moderators of bias in observer ratings: A metaanalysis. Psychological Methods, 4(4), 403-424.
Jacob, R. T., Creps, C. L., & Boulay, B. (2004). Meta-analysis of research and evaluation studies in
early childhood education. Cambridge, MA: Abt Associates Inc.
Layzer, J. I., Goodson, B. D., Bernstein, L., & Price, C. (2001). National evaluation of family support
programs, volume A: The meta-analysis, final report. Cambridge, MA: Abt Associates Inc.
Lipsey, M. W., & Wilson, D. B. (2001). Practical meta-analysis. Thousand Oaks, CA: Sage
Ludwig, J., & Miller, D.L. (2007). Does Head Start improve children’s life chances: Evidence from a
regression discontinuity design. Quarterly Journal of Economics, 122, 159-208.
Ludwig, J., & Phillips, D. (2007). The benefits and costs of Head Start. Social Policy Report, 21(3), 3-18.
Magnuson, K., & Shager, H. (2010). Early education: Progress and promise for children from lowincome families. Children and Youth Services Review, 32(9), 1186-1198.
McGroder, S. (1990). Head Start: What do we know about what works? Report prepared for the U. S.
Office of Management and Budget. Retrieved May 3, 2010, from
McKey, R. H., Condelli, L., Ganson, H., Barrett, B. J., McConkey, C., & Plantz, M. C. (1985). The
impact of Head Start on children, families and communities: Final report of the Head Start
Evaluation, Synthesis and Utilization Project. Washington, D. C.: CSR, Incorporated.
Moher, D., Pham, B., Jones, A., Cook, D. J., Jadad, A. R., Moher, M., Tugwell, P., & Klassen, T. P.
(1998). Does quality of reports of randomized trials affect estimates of intervention efficacy
reported in meta-analyses? The Lancet, 352, 609-613.
Nathan, R. P. (Ed.) (2007). How should we read the evidence about Head Start?: Three views. Journal
of Policy Analysis and Management, 26(3), 673-689.
National Forum on Early Childhood Policy and Programs. (2010). Understanding the Head Start Impact
Study. Retrieved January 9, 2012, from
Nelson, G., & Westhues, A., & MacLeod, J. (2003). A meta-analysis of longitudinal research on
preschool prevention programs for children. Prevention and Treatment, 6, 1–34.
Piggot, T. (2009). Handling missing data. In H. Cooper, L. V. Hedges, & J. C. Valentine (Eds.), The
handbook of research synthesis and meta-analysis, 2nd edition, (pp. 399-416). New York: Russell
Sage Foundation.
Ramey, C. T. & Campbell, F. A. (1984). Preventive education for high-risk children: Cognitive
consequences of the Carolina Abecedarian Project. American Journal of Mental Deficiency,
88(5), 515–23.
Rosenshine, B. (2010, March 6). Researcher-developed tests and standardized tests: A review of ten
meta-analyses. Paper presented at the Society for Research on Educational Effectiveness
Conference, Washington, D. C.
Rothstein, H. R., & Hopewell, S. (2009). Grey literature. In H. Cooper, L. V. Hedges, & J. C. Valentine
(Eds.), The handbook of research synthesis and meta-analysis, 2nd edition, (pp. 103-125). New
York: Russell Sage Foundation.
Schulz, K. F., Chalmers, I., Hayes, R. J., & Altman, D. G. (1995). Empirical evidence of bias:
Dimensions of methodological quality associated with estimates of treatment effects in
controlled trials. Journal of the American Medical Association, 273(5), 408-412.
Schweinhart, L. J., Montie, J., Xiang, Z., Barnett, W. S., Belfield, C. R., & Nores, M. (2005).
Lifetime effects: The High/Scope Perry Preschool study through age 40. (Monographs of the
High/Scope Educational Research Foundation, 14). Ypsilanti, MI: High/Scope Press.
Shonkoff, J. P., & Phillips, D. (2000). From neurons to neighborhoods: The science of early
childhood development. Washington, DC: National Academy Press.
Sweet, M. A., & Appelbaum, M. I. (2004). Is home visiting an effective strategy? A meta-analytic
review of home visiting programs for families with young children. Child Development, 75(5),
U.S. Department of Health and Human Services, Administration for Children and Families (January
2010). Head Start impact study: Final report. Washington, DC.
U.S. Department of Health and Human Services, Administration for Children and Families,
Office of Head Start. (2010). Head Start program fact sheet. Retrieved July 1, 2011, from
U.S. Department of Health and Human Services, Administration for Children and Families,
Office of Head Start. (2010). Head Start performance standards & other regulations. Retrieved
April 26, 2010, from
U.S. Department of Health and Human Services, Administration for Children and Families. (May 2005).
Head Start impact study: First year findings. Washington, DC.
United States General Accounting Office. (1997). Head Start: Research provides little information on
impact of current program. GAO/HEHS-97-59. Washington, D. C.: U. S. General Accounting
Westinghouse Learning Corporation. (1969). The impact of Head Start: An evaluation of the effects of
Head Start on children's cognitive and affective development. In: A report presented to the Office
of Economic Opportunity, Ohio University (1969) (Distributed by Clearinghouse for Federal
Scientific and Technical Information, U.S. Department of Commerce, National Bureau of
Standards, Institute for Applied Technology. PB 184328) .
Wong, V. C., Cook, T. D., Barnett, W. S. & Jung, K. (2008). An effectiveness-based evaluation
of five state pre-kindergarten programs. Journal of Policy Analysis and Management, 27, 122154.
Zhai, F., Brooks-Gunn, J., & Waldfogel, J. (2010, March 6). Head Start and urban children's school
readiness: A birth cohort study in 18 cities. Paper presented at the Society for Research on
Educational Effectiveness Conference, Washington, D. C.
Zigler, E., & Valentine, J. (Eds.) (1979). Project Head Start: A legacy of the war on poverty. New York:
The Free Press.
Table 1: Key Meta-Analysis Terms and Ns
Effect Size
Written evaluation of Head Start (e.g., a journal article,
government report, book chapter)
Collection of comparisons in which the treatment groups are
drawn from the same pool of subjects
Particular type of Head Start or Head Start within a
particular location
Comparison between one group of children who received
Head Start and another group of children who received no
other services as a result of the study (although they may
have sought services themselves)
Measure of the difference in cognitive outcomes between the
children who experienced Head Start and those who did not,
expressed in standard deviation units (Hedges’ g)
Table 2: Descriptive Information for Effect Sizes and Independent Variables
Study and Program Characteristics
Modern Head Start program (post-1974), effect size level
Modern Head Start program (post-1974), program level
Length of program (months, centered at 2), effect size level
Length of program (months, centered at 2), program level
Peer refereed journal, effect size level
Peer refereed journal, program level
Design Characteristics
Active control group, effect size level
Active control group, program level
Passive control group, effect size level
Passive control group, program level
Missing control group activity, effect size level
Missing control group activity, program level
Randomized controlled trial, effect size level
Randomized controlled trial, program level
Quasi-experimental study, effect size level
Quasi-experimental study, program level
Baseline covariates included, effect size level
Baseline covariates included, program level
Dependent Measure Characteristics
Rating (by someone who knows child)
Observation (by researcher)
Performance measure
Skills sensitive to instruction
Skills not sensitive to instruction
Months post-treatment
Attrition (always <50%)
High attrition (>10%)
Low attrition (<=10%)
Missing attrition information
High reliability (coefficient >=.92)
Medium reliability (coefficient =.75-.91)
Low reliability (coefficient <.75)
Missing reliability coefficient
Effect size
Max. Mean
Descriptive information for non-missing effect sizes (N=241), weighted by inverse variance of effect
size multiplied by inverse of number of effect sizes per program.
Table 3: Summary of Results from Regressions of Head Start Evaluation Effect Sizes on Single
Research Design Factors
Modern HS (post-1974)
Not Modern HS (pre-1975)
Active control group
Passive control group
Missing control group activity
Randomized controlled trial
Quasi-experimental study
Rating (by adult who knows child)
Observation (by researcher)
Performance test
Skills sensitive to instruction
Skills not as sensitive to instruction
Skills not sensitive**
Skills sensitive**
High attrition (>10%)
Low attrition (<=10%)
Missing attrition information
Multilevel models were estimated N=241 effect sizes nested in 28 programs; **=p<.001, *=p<.05, t=p<.10.
For columns 1-6 and 8-9 no intercept was estimated; therefore, the resulting coefficients represent the
average effect size for programs in each category. . Column 7 and 12 lists within-factor categorical means
that are statistically significant compared to the indicated category. Multilevel regressions with continuous
measures of research design were run with an intercept; therefore estimates in columns 10 and 11 to show the
relationship between an incremental increase in each continuous design variable and average effect size.
Table 3 Continued: Summary of Results from Regressions of Head Start Evaluation Effect Sizes
on Research Design Factors
Baseline covariates included
Baseline covariates not included
Peer refereed journal
Not peer refereed journal
Length of program (months, centered at 2)
Months post-treatment
Not peer refereedt
Peer refereedt
.21* .28**
(.07) (.04)
Multilevel models were estimated N=241 effect sizes nested in 28 programs; **=p<.001, *=p<.05,
t=p<.10. For columns 1-6 and 8-9 no intercept was estimated; therefore, the resulting coefficients
represent the average effect size for programs in each category. . Column 7 and 12 lists within-factor
categorical means that are statistically significant compared to the indicated category. Multilevel
regressions with continuous measures of research design were run with an intercept; therefore estimates
in columns 10 and 11 to show the relationship between an incremental increase in each continuous
design variable and average effect size.
Table 4: Summary of Results from Multivariate Regressions of Head Start Evaluation
Effect Sizes on Multiple Research Design Factors
Modern Head Start program (post-1974)
Length of program (months, centered at 2)
Peer refereed journal
Active control group
Missing control group activity
Quasi-experimental study
Baseline covariates included
High attrition (>10%)
Missing attrition information
Rating (by someone who knows child)
Observation (by researcher)
Skills sensitive to instruction
Months post-treatment
Medium reliability (coefficient = .75-.91)
Low reliability (coefficient <.75)
Multilevel models were estimated with N=241 effect sizes nested in 28 programs; **=p<.001,
*=p<.05, t=p<.10.
Missing reliability coefficient
Table 5: Summary of Results from Regressions of Head Start Evaluation Effect Sizes on Multiple
Research Design Factors, Including Imputed Missing Effect Sizes
Modern Head Start program (post-1974)
Length of program (months, centered at 2)
Peer refereed journal
Active control group
Missing control group activity
Quasi-experimental study
Baseline covariates included
High attrition (>10%)
Rating (by someone who knows the child)
Observation (by researcher)
Skills sensitive to instruction
Months post-treatment
Multilevel model estimated with N=313 effect sizes (ESs) nested in 33 programs; **=p<.001, *=p<.05,
t=p<.10. We imputed 72 ESs, based on the following assumptions: Column 1: If there was no report of the
magnitudes of ESs or whether ESs were significant, we assumed that there was no difference between the
groups (g=0). If the ESs were reported as statistically significantly, we assumed p=.05. Column 2: If an author
reported the direction of the ESs (which group was favored) but not whether ESs were statistically significant,
we assumed the ES approached marginal significance (p=.11). If authors did not report that either group was
favored, we again assumed that g=0. Column 3: We assumed that all ESs favored the treatment group as much
as possible (p=.11 if treatment group was favored or if there was no indication of which group was favored;
p=.99 otherwise). Column 4: We assumed that all ESs favored the control group as much as possible (p=.11 if
treatment group was favored; p=.99 otherwise).
Missing attrition information
Appendix A: Database References
Abbott-Shim, M., Lambert, R., and McCarty, F. (2000). A study of Head Start effectiveness using a
randomized design. Paper presented at the Fifth Head Start National Research Conference, Washington,
Abbott-Shim, M., Lambert, R., & McCarty, F. (2003). A comparison of school readiness outcomes for
children randomly assigned to a Head Start program and the program's wait list. Journal of Education
for Students Placed at Risk, 8, 191-214.
Abelson, W.D. (1974). Head Start graduates in school: Studies in New Haven, Connecticut. In S. Ryan
(Ed.), A report on longitudinal evaluations of preschool programs: Volume 1 (1-14). Washington, DC:
Office of Child Development, US Department of Health, Education and Welfare.
Allerhand, M. (1965). Impact of summer 1965 Head Start on children's concept attainment during
kindergarten. Cleveland, OH: Western Reserve University.
Allerhand, M. (1967). Effectiveness of parents of Headstart children as administrators of psychological
tests. Journal of Consulting Psychology, 31, 286-290.
Allerhand, M., Gaines, E., & Sterioff, S. (1966). Headstart follow-up study manual (rating concept
attainment). Cleveland, OH: Western Reserve University.
Allerhand, M.E. (1966). Headstart operational field analysis: Progress report III. Cleveland, OH:
Western Reserve University.
Allerhand, M.E. (1966). Headstart operational field analysis: Progress report II. Cleveland, OH:
Western Reserve University.
Allerhand, M.E. (1966). Headstart operational field analysis: Progress report I. Cleveland, OH:
Western Reserve University.
Allerhand, M.E. (1966). Headstart operational field analysis: Progress report IV. Cleveland, OH:
Western Reserve University.
Arenas, S., & Trujillo, L.A. (1982). A success story: The evaluation of four Head Start bilingual
multicultural curriculum models. Denver, CO: InterAmerica Research Associates.
Barnow, B.S., & Cain, G.G. (1977). A reanalysis of the effect of Head Start on cognitive development:
Methodology and empirical findings. The Journal of Human Resources, 12, 177-197.
Bereiter, C., & Engelmann, S. (1966). Teaching disadvantaged children in the preschool. Englewood
Cliffs, NJ: Prentice-Hall, Inc.
Bridgeman, B., & Shipman, V.C. (1975). Predictive value of measures of self-esteem and achievement
motivation in four- to nine-year-old low income children. Disadvantaged children and their first school
experiences: ETS-Head Start Longitudinal study. Princeton, NJ: Educational Testing Service.
Chesterfield, R., & Chaves, R. (1982). An evaluation of the Head Start Bilingual Bicultural Curriculum
Development Project. Technical reports, Vols. I and II, executive summary. Los Angeles, CA: Juarez
and Associates.
Chesterfield, R., et al. (1979). Pilot study results and child assessment measures. Los Angeles, CA:
Juarez and Associates.
Chesterfield, R., et al. (1979). An evaluation of the Head Start Bilingual Bicultural Curriculum
Development Project. Report of the pilot study results and the training of fieldworkers for the
ethnographic/observational component. Los Angeles, CA: Juarez and Associates.
Chesterfield, R., et al (1982). An Evaluation of the Head Start Bilingual Bicultural Curriculum
Development Project: Final Report. Los Angeles, CA: Juarez and Associates.
Cicarelli, V.G., Cooper, W.H., & Granger, R.L. (1969). The impact of Head Start: An evaluation of the
effects of Head Start on children’s cognitive and affective development. Volume 2. Office of Economic
Opportunity. Athens, OH: Westinghouse Learning Corporation and Ohio University.
Cicarelli, V.G., Cooper, W.H., & Granger, R.L. (1969). The impact of Head Start: An evaluation of the
effects of Head Start on children’s cognitive and affective development. Volume 1, text and appendices A
to E. Office of Economic Opportunity. Athens, OH: Westinghouse Learning Corporation and Ohio
Cline, M.G., & Dickey, M. (1968). An evaluation and follow-up study of summer 1966 Head Start
children in Washington, DC. Washington, DC: Howard University.
Engelmann, S. (1969). Preventing failure in the primary grades. New York: Simon and Schuster.
Engelmann, S., & Osborn, J. (1976). Distar Language I: An Instructional system. Teacher's guide, 2nd
edition. Chicago: Science Research Associates, Inc.
Englemann, S., & Carnine, D. (1975). Distar Arithmetic I, 2nd edition. Chicago: Science Research
Erickson, E.L., McMillan, J., Bonnell, J., Hofman, L., & Callahan, O.D. (1969). Experiments in Head
Start and early education: The effects of teacher attitude and curriculum structure on preschool
disadvantaged children. Office of Economic Opportunity: Washington, DC.
Esteban, M.D. (1987). A comparison of Head Start and non-Head Start reading readiness scores of lowincome kindergarten children of Guam. UMI Dissertation Services: Ann Arbor, Michigan.
Henderson, R.W., Rankin, R.J., & Frosbisher, M.W. (1969). Positive effects of a bicultural preschool
program on the intellectual performance of Mexican-American children. Paper presented at the annual
meeting of the American Educational Research Association, Los Angeles.
Hodes, M.R. (1966). An assessment and comparison of selected characteristics among culturally
disadvantaged kindergarten children who attended Project Head Start (summer program 1965),
culturally disadvantaged kindergarten children who did not attend Project Head Start; and kindergarten
children who were not culturally disadvantaged. Glassboro, NJ: Glassboro State College.
Howard, J.L., & Plant, W.T. (1967). Psychometric evaluation of an operant Head Start program. Journal
of Genetic Psychology, 11, 281-288.
Huron Institute. (1974). Short-term cognitive effects of Head Start programs: A report on the third year
of planned variation. Cambridge, MA.
Hyman, I.A., & Kliman, D.S. (1967). First grade readiness of children who have had summer Headstart
programs. Training School Bulletin, 63, 163-167.
Krider, M.A., & Petsche, M. (1967). An evaluation of Head Start pre-school enrichment programs as
they affect the intellectual ability, the social adjustment, and the achievement level of five-year-old
children enrolled in Lincoln, Nebraska. Lincoln, NE: University of Nebraska.
Larson, D.E. (1972). Stability of gains in intellectual functioning among white children who attended a
preschool program in rural Minnesota: Final report. Office of Education (DHEW). Mankato, MN:
Mankato State College.
Larson, D.F. (1969). The effects of a preschool experience upon intellectual functioning among fouryear-old, white children in rural Minnesota. Mankato: Minnesota State University, College of
Lee, V.E., Brooks-Gunn, J., Schnur, E., & Liaw, F.R. (1990). Are Head Start effects sustained? A
longitudinal follow-up comparison of disadvantaged children attending Head Start, no preschool, and
other preschool programs. Child Development, 61, 495-507.
Lee, V.E., Schnur, E., & Brooks-Gunn, J. (1988). Does Head Start work? A 1-year follow-up
comparison of disadvantaged children attending Head Start, no preschool, and other preschool
programs. Developmental Psychology, 24, 210-222.
Ludwig, J., & Phillips, D. (2007). The benefits and costs of Head Start. Social Policy Report, 21, 3-13.
McNamara, J.R. (1968). Evaluation of the effects of Head Start experience in the area of self-concept,
social skills, and language skills (pre-publication draft). Miami, FL: Dade County Board of Public
Miller, L. B., et al. (1970). Experimental variation of current approaches. University of Louisville.
Miller, L. B., et al. (1972). Four preschool programs: Their dimensions and effects. University of
Morris, B., & Morris, G.L. (1966). Evaluation of changes occurring in children who participated in
project Head Start. Kearney, NE: Kearney State College.
Nummedal, S.G., & Stern, C. (1971). Head Start graduates: One year later. Paper presented at the
annual meeting of the American Educational Research Association, New York, NY.
U.S. Department of Health and Human Services. (2003). Building futures: The Head Start impact study
interim report. Washington, DC.
Porter, P.J., Leodas, C., Godley, R.A., & Budroff, M. (1965). Evaluation of Head Start educational
program in Cambridge, Massachusetts: Final report. Cambridge, MA: Harvard University.
Sandoval-Martinez, S. (1982). Findings from the Head Start bilingual curriculum development effort.
NABE: The Journal for the National Association for Bilingual Education, 7, 1-12.
Schnur, E., & Brooks-Gunn, J. (1988). Who attends Head Start? A comparison of Head Start attendees
and nonattendees from three sites in 1969-1970. Princeton, NJ: Educational Testing Service.
Shipman, V. (1970). Disadvantaged children and their first school experiences: ETS-Head Start
longitudinal study. Preliminary description of the initial sample prior to school enrollment: Summary
report. Princeton, NJ: Educational Testing Service.
Shipman, V.C. (1972). Disadvantaged children and their first school experience. Princeton, NJ:
Educational Testing Service.
Shipman, V.C. (1976). Stability and change in family status, situational, and process variables and their
relationship to children's cognitive performance. Princeton, NJ: Educational Testing Service.
Shipman, V.C., et al. (1976). Notable early characteristics of high and low achieving black and low-SES
children. Disadvantaged children and their first school experiences: ETS-Head Start longitudinal study.
Princeton, NJ: Educational Testing Service.
Smith, M.S., & Bissell, J.S. (1970). Report analysis: The impact of Head Start. Harvard Educational
Review, 40, 51-104.
Sontag, M., Sella, A.P., & Thorndike, R.L. (1969). The effect of Head Start training on the cognitive
growth of disadvantaged children. The Journal of Educational Research, 62, 387-389.
Tamminen, A.W., Weatherman, R.F., & McKain, C.W. (1967). An evaluation of a preschool training
program for culturally deprived children: Final report. U.S. Department of Health, Education, and
Welfare. Duluth: University of Minnesota.
Thorndike, R.L. (1966). Head Start Evaluation and Research Center, Teachers College, Columbia
University. Annual report (1st), September 1966-August 1967. New York, NY: Columbia University
Teachers College.
U.S. Department of Health and Human Services. (2005). Head Start impact study: First year findings.
Washington, DC.
Young-Joo, K. (2007). The return to private school and education-related policy. University of
Wisconsin (dissertation).
Zigler, E.F., Ableson, W.D., Trickett, P.K., & Seitz, V. (1982). Is an intervention program necessary in
order to improve economically disadvantaged children's IQ scores? Child Development, 53, 340-348.
Appendix B: Head Start Studies, Programs, and Contrasts Included in Analysis
Start Date
Study Description
Programs/Contrasts Included*
Regular HS v. Direct Instruction
Head start v. control
1) Bereiter/Englemann curriculum (Direct instruction) in
Head start v. no preschool control
2) Enrichment (Standard) Head Start v. no preschool
National Head Start Program,
1) Full year Head Start participation v. no Head Start
New Haven Head Start evaluation
1) Head Start (from Zigler & Butterfield study) v. No
Head Start (Zigler/Butterfield study)
Camden, NJ Summer Head Start
1) Culturally Disadvantaged; Attended Camden Summer
Head Start v. Culturally Disadvantaged; did not attend
Camden Summer Head Start
comparison of children enrolled
in Head start, other Preschool or
no preschool in two cities
1) Children enrolled in HS v. Children with no Preschool
Southeastern Head Start
program of high quality
1) Children in Head Start v. Children on wait list for
Head Start
New Jersey Summer Head Start
one or two years
1) One or two treatments of summer Head Start v. No
summer Head start
Evaluation of the effect of Head
Start program on cognitive growth
of disadvantaged children
1) Head Start program v. children about to enter Head
Kearney, NE Summer Head Start
1) Children who attended Summer Head Start v. Matched
children who did not attend Summer Head Start
Summer Head Start program
Evaluation, San Jose, CA;
Howard, J.
1) Children who attended summer Head Start program v.
children who did not receive Head Start services
Rural Minnesota Head Start
1) Head Start v. Eligible for Head Start but did not enroll
2)Head Start v. Students not enrolled in any preschool
Cambridge, MA Summer Head
1) (summer) Head Start v. Operation Checkup (medical
Appendix B: Head Start Studies, Programs, and Contrasts Included in Analysis
Start Date
Study Description
Programs/Contrasts Included
Head Start Effects on SelfConcept, Social Skills, and
Language Skills, 1968
1) Head Start v. no preschool
Head Start Bilingual Bicultural
Development Project
1) Bilingual HS v. Stay at home
2) Comparison HS v. Stay at home
Duluth Summer Head Start
1) Head Start v. Stay at home
Impact of 1965 Summer Head Start
on Children's Concept Attainment,
1) Summer Head Start v. No Head Start
Bicultural Preschool Program
1) Mexican American Kids in Head Start v. Mexican
American Kids with no preschool
Head Start effects on children's
behavior and cognitive functioning
one year later, Nummedal, 1971
1) Full Year Head Start v. No Head Start
New Haven Head Start
1) children who received New Haven Head Start v. nonHead Start comparison group
A follow-up study of a summer
Head Start program in Washington,
1) Head Start v. no Head Start
Lincoln, NE Summer Head Start
1) Head Start, matched pairs v. Stay at home, matched
2) Head Start, unmatched pairs v. Stay at home,
unmatched pairs
National Head Start Impact Study
First Year
1) Head Start 3 years (weighted, controlling for
demographics, pretest scores)- based on OLS v. No Head
Start 3 years
2) Head Start 4 years (weighted, controlling for
demographics, pretest scores) - based on OLS) v. No
Head Start 4 years
3) Head Start 3 years (Treatment on Treated - Ludwig
Phillips analysis) v. no Head Start 3 years (Treatment on
Treated - Ludwig Phillips analysis)
4) Head Start 4 years (Treatment on Treated - Ludwig
Phillips analysis) v. no Head Start 4 years (Treatment on
Treated - Ludwig Phillips analysis)
Appendix B: Head Start Studies, Programs, and Contrasts Included in Analysis
Start Date Study Description
Programs/Contrasts Included
New Haven Head Start Abelson,
Zigler & Levine
1) Head Start (from Abelson, Zigler & Levine study) v. no
Head Start
ECLS-K Head Start Study
1) Whites attending Head Start v. All other white students
(likely mix of stay at home and other childcare)
2) Blacks attending Head Start v. All other black students
(likely mix of stay at home and other childcare)
3) Hispanics attending Head Start v. All other Hispanic
students (likely mix of stay at home and other childcare)
Planned Variation in Head Start
1) Planned Variation Head Start, 1971-1972 v. No Head
Start 1971-1972
2) Standard Head Start, 1971-1972 v. No Head Start 19711972
Louisville Head Start Curriculum
1) Bereiter-Engelmann Head Start v. no pre-k
2) DARCEE Head Start v. no pre-k
3) Montessori Head Start v. no pre-k
4) Traditional Head Start v. no pre-k
Comparison of Head Start vs. NonHead Start reading readiness scores
of low income children in Guam
1) Head Start v. No Head Start
*A horizontal line indicates separate programs within studies; contrasts within a single program are
listed in the same cell.
**If an actual start date for the program was not provided, we estimated the start date to be two years
prior to report publication.