The Benefits And Costs Of Head Start Jens Ludwig, Georgetown University and NBER Deborah A. Phillips, Georgetown University

National Poverty Center Working Paper Series #07‐09 February, 2007 The Benefits And Costs Of Head Start Jens Ludwig, Georgetown University and NBER Deborah A. Phillips, Georgetown University This paper is available online at the National Poverty Center Working Paper Series index at: http://www.npc.umich.edu/publications/working_papers/ Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the view of the National Poverty Center or any sponsoring agency. The Benefits And Costs Of Head Start February 28, 2007 Jens Ludwig Deborah A. Phillips The authors are grateful for the support provided by the Buffett Early Childhood Fund and the McCormick Tribune Foundation to the National Forum on Early Childhood Program Evaluation through the National Scientific Council on the Developing Child. Thanks to Kathryn Clabby for outstanding research assistance. Very helpful comments were provided by Philip Cook, Janet Currie, William Dickens, Greg Duncan, Dave Frisvold, Katherine Magnuson, Gillian Najarian, Matthew Neidell, Helen Raikes, Jack Shonkoff, Hiro Yoshikawa, and Marty Zaslow. Special thanks to Ronna Cook at Westat for making available additional information about the first-year randomized Head Start evaluation. Any errors and all opinions are of course our own. THE BENEFITS AND COSTS OF HEAD START ABSTRACT In this essay we review what is known about Head Start and argue that the program is likely to generate benefits to participants and society as a whole that are large enough to justify the program’s costs. Our conclusions differ importantly from those offered in some previous reviews because we use a more appropriate standard to judge the success of Head Start (namely, benefit-cost analysis), draw on new accumulating evidence for Head Start’s long-term effects on early cohorts of program participants, and discuss why common interpretations of a recent randomized experimental evaluation of Head Start’s short-term impacts may be overly pessimistic. While in principle there could be more beneficial ways of deploying Head Start resources, the benefits of such changes remain uncertain and there is some downside risk. JEL Codes: I2, I3, H43 Jens Ludwig Georgetown Public Policy Institute Georgetown University 3520 Prospect Street, NW Washington, DC 20007 and NBER ludwigj@georgetown.edu -1- Deborah A. Phillips Department of Psychology Georgetown University 37th and O Streets, NW Washington, DC 20057 deborah.dap4@georgetown.edu I. INTRODUCTION Few social programs seem to enjoy the widespread popular support of Head Start. 1 Poor children represent a sympathetic target population for social programs [Mayer, 1997]. The powerful correlation between the socio-economic status of a child’s family and their long-term life chances raises concerns about the fairness of American society as well as the social costs associated with outcomes such as school dropout or criminal behavior [Holzer et al., 2007]. And the fact that the educational deficits of poor children show up very early in the life course, even well before children start kindergarten, provides a powerful logic for intervening early [Knudsen et al., 2006]. Despite the obvious emotional and intuitive appeal of Head Start, questions about Head Start’s effectiveness have dogged the program since its inception. The first study claiming that Head Start’s benefits fade out was published in 1966 – just one year after the program was launched as part of President Lyndon B. Johnson’s War on Poverty [Westinghouse, 1969; Vinovskis, 2005]. Skepticism about the program persists in some quarters to the present day. The perspective of the Cato Institute’s Darcy Ann Olsen is nicely summarized by the title of her policy brief: It’s Time to Stop Head Start. 2 Douglas J. Besharov’s assessment is slightly more charitable, but only slightly, as suggested by the title of his policy brief: Head Start’s Broken Promise. 3 Understanding 1 For example in a 1995 opinion poll, two-thirds of voters in Colorado agreed with the statement “School districts should be required to provide special preschool classes to children from low income households to get them prepared for grade school” (Memorandum to the Donnell-Kay Foundation from Public Opinion Strategies, September 7, 2005; available at www.pos.org/presentations/77/preschool.pdf ). The fraction of voters agreeing with this statement exceeds the fraction of Colorado votes in the 1996 presidential elections that went to Bill Clinton by nearly 22 percentage points. As another example, Head Start funding has increased dramatically under both Republican and Democratic presidential administrations (particularly the administrations of George HW Bush and Bill Clinton); see Haskins [2004]. 2 Darcy Ann Olsen, “It’s Time to Stop Head Start,” Human Events, September 1, 2000. 3 Douglas J. Besharov, “Head Start’s Broken Promise,” American Enterprise Institute for Public Policy Research On the Issues, October 2005. Head Start’s impacts on poor children seems particularly urgent given that the program is up for re-authorization this year in the U.S. Congress. This essay reviews what is known about the value of Head Start. Our bottom line is that the best available evidence suggests Head Start passes a benefit-cost test. While there remain some important limitations to the available evidence on Head Start, we believe the weight of the evidence points in this direction. In principle there might be ways to increase the cost-effectiveness of current Head Start funding, including changes to Head Start’s design or funding alternatives such as state pre-K programs. However the benefits of such changes remain uncertain and they entail some downside risk. Our essay seeks to develop five main arguments that lead us to these conclusions. First, much of the debate about Head Start stems from confusion about how to judge the magnitude of program impacts. We argue that the most appropriate standard for judging the program’s success is benefit-cost analysis. Second, over the past several years new evidence has been accumulating about the long-term impacts of Head Start on early cohorts of program participants, as well as about the short-term program impacts on more recent cohorts of children. Research on Head Start’s long-term impacts suggests the program passed a benefit-cost test during the first few decades of operation [Currie and Thomas, 1995; Garces, Thomas and Currie, 2002; Ludwig and Miller, 2007]. These findings counter the view that only very intensive (and expensive) early childhood interventions can generate long-term benefits, and also run counter to the perception that Head Start has been a failure from its inception. However these results are not directly informative about whether today’s version of Head Start passes a benefit-cost test, since Head Start and the counterfactual -2- developmental environments poor children would otherwise experience are both changing over time. This is a generic challenge to understanding the long-term impacts of contemporaneous government programs – we can only estimate long-term impacts for people who participated in the program a long time ago. The best evidence currently available on Head Start as it operates today comes from a recent randomized experimental evaluation of Head Start’s impacts measured within one year of random assignment, which was sponsored by the federal government and carried out by Westat [Puma et al., 2005]. Public discussions of the experimental results have typically focused on the effects of being assigned to the experiment’s treatment group rather than the control group, known in the program evaluation literature as the “intent to treat” (ITT) impact. These impacts are presented by Westat separately for 3 and 4 year old program participants and are usually in the direction consistent with some beneficial impact of Head Start on children’s short term outcomes, but are often not statistically significant. The third objective of our paper is to provide some benchmarks for how large these short-term impacts would need to be in order to believe that any long-term benefits generated by today’s Head Start program will be enough to justify the program’s costs. This exercise is complicated by the fact that there is currently limited evidence about how the cognitive and non-cognitive skills of young children translate into long-term life outcomes. With this caveat in mind, the evidence that is available suggests that given Head Start’s costs (around $7,000 per child on average), the program would pass a benefit-cost test if the short-term impacts on achievement test scores were equal to around .1 to .2 standard deviations, or maybe even much smaller still. -3- If our achievement test-score benchmarks are correct, then the expected benefits from Head Start are likely to exceed the program’s costs. The effects of being assigned to the Head Start experimental treatment group (the intent to treat effects) are typically on the order of .1 to .2 standard deviations. However the estimated effects of Head Start participation itself (the “effects of treatment on the treated,” or TOT) are estimated to be around 1.5 times as large as the intent-to-treat impacts. The difference between the effects of being assigned to the experimental treatment group and the effects of Head Start participation per se (that is, ITT versus TOT) stems from the fact that not all treatment-group children actually enrolled in Head Start, while some children assigned to the control group wound up enrolling in Head Start on their own. Some observers have focused on the fact that many of the estimated impacts in the recent randomized experimental evaluation of Head Start are not statistically significant, and so follow the usual scientific convention of assuming that any estimates that cannot be statistically distinguished from zero are zero. Our own calculations suggest that if Westat had conducted analyses that pooled the 3 and 4 year old samples, instead of only presenting estimates separately for 3 versus 4 year olds, almost all of the estimated Head Start impacts on the main cognitive outcomes of interest would be statistically significant. But more importantly, as Cook and Ludwig [2006] argue, the expected value of a program’s benefits and costs may be a more relevant framework for making policy decisions than statistical significance, and the expected net value of Head Start is positive. That is, odds are that Head Start passes a benefit-cost test for current cohorts of children. Finally, we close with some discussion about recent suggestions that have been -4- made for increasing the cost-effectiveness of Head Start funding, including changing the program design, making the program more like some of the newer state pre-K programs in operation around the country, or even diverting some Head Start funding to these state programs. There is in our view some uncertainty about both the short- and long-term benefits associated with these changes. There are also downside risks, particularly if one recognizes that there is some opportunity cost associated with the resources required to implement some of the proposed changes to Head Start. Given available evidence the expected net value of changing Head Start is ambiguous. II. THE BENEFITS OF BENEFIT-COST ANALYSIS The argument that we should judge the magnitude of Head Start’s impacts by how the dollar value of these benefits compare to the cost of the program will not seem like a new idea to economists and policy analysts. Yet much of the public debate about the value of Head Start reflects some basic confusion on this point. One benchmark that has been used to gauge the size of Head Start’s impacts is relative to the scale of the social problem that is being addressed. For example Besharov [2005] reviews the Westat report and argues “these small gains will not do much to close the achievement gap between poor children (particularly minority children) and the general population. We should expect more of a program that serves almost 900,000 children at a cost of $9 billion a year.” But the right standard of success for a public program is not the elimination of a social problem. Consider, for example, that mortality rates from lung cancer in the U.S. in 2003 remain quite high – equal to 71.9 deaths per 100,000 people for males and 41.2 deaths per 100,000 for females [Thun and Jemal, 2006, p. 346]. The fact that thousands -5- of Americans continue to die each year from lung cancer does not mean that the large decline in tobacco smoking observed during the last half of the 20th century should be considered a public health “failure,” particularly since diverting smokers from smoking appears to make them happier as well as healthier [Gruber and Koszegi, 2002; Gruber and Mullainathan, 2002]. Psychologists and education researchers often use the typology offered by Jacob Cohen [1977], who argues that an “effect size” (that is, program impact expressed as a share of a control group standard deviation) of .2 should be considered “small,” while effect sizes of .5 should be considered “medium” and those of .8 or more are “large.” Lipsey [1990] conducts a meta-analysis that draws on results from 6,700 studies in education and other related areas and finds that the empirical distribution of estimated effect sizes roughly corresponds to Cohen’s categorization [see also Bloom, 2005]. This is the convention adopted by Westat in their report on the short-term results of the recent randomized Head Start experiment. 4 Yet any assessment of what a program accomplishes should take into account not just the program’s benefits but also its costs, which necessarily requires conversion of both into some common metric – that is, benefit-cost analysis. A program that improved test scores by .8 standard deviations – “large” in Jacob Cohen’s [1977] scheme – but cost a total of $10 trillion per year would be difficult to support, since undertaking such an early childhood intervention would absorb the majority of the nation’s gross domestic product with very little left to house, clothe, feed and protect the nation’s child (and adult) population. At the other extreme a program that generated impacts on the order of 4 The Westat report calls effect sizes below 0.2 standard deviations “small,” while those of 0.2 to 0.5 standard deviations are “medium” and those over 0.5 are “large” [see Puma et al., 2005, p. ii, footnote 1]. -6- .2 standard deviations but cost only a nickel per child per year would be difficult to oppose. We should expect social programs to generate net benefits, not miraculous benefits. 5 III. EVIDENCE ON HEAD START’S LONG-TERM IMPACTS While researchers have been studying Head Start for over 40 years, only in recent years have social scientists made real headway in identifying the causal impacts of the program on participating children. There is now credible evidence on Head Start’s longterm impacts that suggest the program passed a benefit-cost test over the first few decades of operation, but the program is changing over time and so these impacts might not be relevant to today’s Head Start. Short-term impacts from the recent randomized experimental study of Head Start by themselves are not directly informative about whether the program’s long-term benefits justify program costs. Long-term effects of Head Start can obviously only be identified for those children who participated in the program a long time ago. The main challenge in identifying the long-term effects of Head Start on earlier cohorts of children comes from the problem of trying to figure out what the outcomes of Head Start participants would have been had they not enrolled in the program. Simply comparing the long-term outcomes of children who did participate with those who did not may provide misleading answers to the key causal question of interest. If, for example, more disadvantaged families participate in Head Start, then simple comparisons of Head Start recipients to other children may understate the program’s effectiveness if researchers are unable to 5 For an excellent discussion of these points see Duncan and Magnuson [2006]. Harris [2007] presents a cost-effectiveness framework for judging program impacts and suggests that any intervention that generates increased test scores of .025 standard deviations per child per $1,000 spending should be considered -7- adequately measure all aspects of family disadvantage. The opposite bias may result if instead the more motivated and effective parents are the ones who are able to get their children into (or are selected by program administrators for) scarce Head Start slots. Economists Eliana Garces, Duncan Thomas and Janet Currie [2002] evaluate Head Start by comparing the experiences of siblings who did and did not participate in the program. The analytic sample consists of children who would have participated in Head Start in 1980 or earlier. These sorts of within-family across-sibling comparisons help to eliminate the confounding influence of unmeasured family attributes that are common to all children within the home. The research design employed by Garces and colleagues represents a substantial improvement over previous research, although there necessarily remains some uncertainty about why some children within a family but not others participate in Head Start, and whether whatever is responsible for this within-family variation in program enrollment might also be relevant for children’s outcomes. For example sibling comparisons might overstate (or understate) Head Start’s impacts if parents enroll their more (or less) able children to participate in the program. The Garces study might also understate Head Start’s impacts if there are positive spillover effects of participating in the program on other members of the family, since in this case the control group for the analysis (i.e. siblings who do not enroll in Head Start themselves) will be partially treated (i.e. benefit to some degree from having a sibling participate in Head Start). In addition, their study relies on retrospective self reports of Head Start participation by people who have reached adulthood, which some people may “large.” The implication is that Head Start impacts of .175 standard deviations would be “large” under this framework, roughly consistent with our own benchmarks presented below. -8- misremember or misreport. If this measurement error is uncorrelated with respondents’ characteristics and potential outcomes then misreporting will cause their estimated Head Start impacts to be attenuated to some degree (that is, biased towards zero). With these caveats in mind, Garces, Thomas and Currie report that non-Hispanic white children who were in Head Start are about 22 percentage points more likely to complete high school than their siblings who were in some other form of preschool, and about 19 percentage points more likely to attend some college. These impact estimates are equal to around one-quarter and one-half of the “control mean.” For AfricanAmericans the estimated Head Start impact on schooling attainment is small and not statistically significant, but for this group Head Start relative to other preschool experience is estimated to reduce the chances of being arrested and charged with a crime by around 12 percentage points, which, as with the schooling effect for whites, is a very large effect. 6 Ludwig and Miller [2007] use a different research design to overcome the selection bias problems in evaluating the long-term effects of Head Start and generate qualitatively similar findings for schooling attainment, although unlike Garces et al. they find evidence for impacts for blacks as well as whites. Their design exploits a discontinuity in Head Start funding across counties generated by the way that the program was launched in 1965. Specifically, the Office of Economic Opportunity (OEO) provided technical grant-writing assistance for Head Start funding to the 300 counties with the highest 1960 poverty rates in the country, but not to other counties. The result is 6 The share of all children ever booked or charged with a crime in their data is 9.7% for the full sample and 10% for the sibling sample. These figures do not imply that Head Start achieves more than a 100% reduction in crime for program participants, since the right comparison for the estimated Head Start effect -9- that Head Start participation and funding rates are 50 to 100% higher in the counties with poverty rates that just barely put them into the group of the 300 poorest counties compared to those counties with poverty rates just below this threshold. So long as other determinants of children’s outcomes vary smoothly by the 1960 poverty rate across these counties, any discontinuities (or “jumps”) in outcomes for those children who grew up in counties just above versus below the county poverty-rate cutoff for grant-writing assistance can be attributed to the effects of the extra Head Start funding. Using this regression discontinuity design, Ludwig and Miller find that a 50100% increase in Head Start funding is associated with an increase in schooling attainment of about one-half year, and an increase in the likelihood of attending some college of about 15% of the control mean. Importantly, the estimated effects of extra Head Start funding on educational attainment are found for both blacks and whites. These estimates are calculated for children who would have participated in Head Start during the 1960s or 1970s, and cannot be calculated for more recent cohorts of program participants since the Head Start funding discontinuity across counties at the heart of this research design seems to have dissipated over time. Taken together, these impact estimates suggest that Head Start as it operated in the 1960s through 1980s generated benefits in excess of program costs, with a benefitcost ratio that might be at least as large as the 7-to-1 figure often cited for model early childhood programs such as Perry Preschool. Currie [2001] notes that the short-term benefits of Head Start to parents in the form of high-quality child care together with medium-term benefits from reductions in special education placements and grade on African-American participants is the average arrest rate for the siblings of these children, which does not seem to be reported in the study. - 10 - retention might together offset between 40 and 60 percent of the program’s costs. Ludwig and Miller’s [2007] estimates imply that each extra dollar of Head Start funding in a county generates benefits from reductions in child mortality and increases in schooling attainment that easily outweigh the extra program spending. 7 In addition Frisvold [2007] provides some evidence that Head Start might reduce childhood obesity. These findings run counter to the common view that only very intensive and expensive early childhood interventions are capable of generating long-term benefits. The origin of this conventional wisdom is itself not entirely clear, since there is no logical reason that lower-cost programs will necessarily have lower benefit-cost ratios compared to those from higher-cost programs. The impacts of Head Start on children will depend on the difference in the developmental quality of the program versus the quality of the environments that low-income children would have experienced otherwise. During its early years Head Start did not score well on commonly used indicators of early childhood program quality, such as teacher educational attainment. This was based in part on Head Start’s origin as part of the Community Assistance Program of the War on Poverty with its emphasis on involvement of the poor in the design and implementation of new social programs [Vinovskis, 2005], including roles as classroom teachers and aides. But for poor children in the 1960s through 1980s, the evaluation studies described above imply that the environments Head Start children would have experienced if not enrolled in the 7 Ludwig and Miller [2007] estimate the impact of an additional $400 per four year old in Head Start funding in a county. The dollar value of the decline in child mortality is equal to around $120 per four year old in the county. They also estimate an increase in schooling attainment of around one-half year per child. Card [1999] suggests an extra year of schooling increases earnings by 5 to 10 percent. We conservatively assume the extra $400 in Head Start funding raises lifetime earnings by 2 percent per child, which Krueger [2003] shows is worth at least $15,000 in present value using a 3 present discount rate (even assuming no productivity growth over time). The benefits would be even larger if we accounted for the fact that increased schooling also seems to reduce involvement with crime [Lochner and Morretti, 2004], and that the costs of crime to society are enormous – perhaps as much as $2 trillion per year [Ludwig, 2006]. - 11 - program were less developmentally productive than Head Start. One implication of this last point is that the effects of Head Start on poor children may be changing over time in ways that are difficult to predict, and so the long-term impacts of Head Start on previous cohorts of children may not represent the long-term effects of the program on today’s participants. Over time the Head Start program has improved in quality, but arguably so has the alternative to Head Start for poor children since parent educational attainments and real incomes have increased since the 1960s and state-funded pre-school programs have been introduced. It is not clear which environment is improving more rapidly in this horse race. Fortunately the federal government has recently sponsored the first-ever randomized experimental evaluation of Head Start, the Head Start National Impact Study (HSNIS, hereafter “the randomized Head Start experiment”), with first-year results that are now available from Westat, the evaluation sub-contractor [Puma et al., 2005]. Starting in 2002 nearly 4,700 three and four year old children whose parents applied for Head Start were randomly assigned to a Head Start treatment group or a control group that was not offered Head Start through the experiment, but could participate in other local preschool programs if slots were available. The 84 Head Start centers participating in the experiment were selected to be representative of all programs in operation across the country that had waiting lists. The experiment seems to have been done well – randomization was implemented properly, and careful assessments were made of a wide variety of children’s cognitive and non-cognitive outcomes, and parents were also studied. Response rates for both the child and parent assessments were usually around 10 percentage points lower for the - 12 - control than treatment group. 8 Note that by randomly assigning income-eligible children to the treatment and control conditions, the Head Start experiment uncovers the effects of making Head Start available to all eligible children. If, in practice, Head Start centers focus on enrolling the most disadvantaged of the eligible children that apply, and if the impacts of Head Start are more pronounced for more disadvantaged children, then the experimental impact estimates may under-state the effect of Head Start on the average program participant in the nation at large. While the Head Start experiment is informative about a number of key policy questions about the program, in the absence of time travel we cannot directly measure the long-term impacts of Head Start for the children who only recently participated in the program. As a result we are forced to rely on indirect evidence about what the short-term impact estimates from the recent Head Start experiment might imply about the program’s long-term effects. This point is taken up in the next section. IV. SHORT-TERM BENCHMARKS FOR LONG-TERM SUCCESS Head Start, as the program currently operates, spends about $7,000 per participating child. 9 How large would Head Start’s short-term impacts need to be, and in what outcome domains, for us to believe that the program’s long-term benefits justify the program expenditures? We try to answer this question in two ways, first by examining the short-term impacts that have been found for studies of other early childhood interventions where there is also evidence for long-term benefits in excess of program costs, and then by trying to assess directly the dollar value of a standard deviation 8 Puma et al. [2005, p. 1-18] report that for the first data collection wave in Fall 2002, child response rates equaled 85% for the treatment group and 72% for the control group, and for parents equaled 89% and 81% for the treatment and control groups, respectively. For the Spring 2003 follow-up response rates for children equaled 88% and 77% for the treatment and control groups, and 86% and 79% for parents. - 13 - increase in early childhood test scores. Both approaches are subject to some uncertainty given important limitations with the available evidence. But with that qualification in mind, we believe there is a reasonable case to be made that positive impacts on achievement test scores on the order of .1 to .2 standard deviations (and perhaps even much smaller than that) would be large enough to generate long-term dollar-value benefits that would outweigh the program’s costs. A. Short-term Impacts of Yesteryear’s Head Start and Perry Preschool The findings from Garces et al. [2002] and Ludwig and Miller [2007] cited above suggest that Head Start as the program operated in the 1960s through 1980s seems to have generated long-term benefits that were larger than the program’s costs. How large were the short-term impacts of Head Start on participating children, and in what outcome domains? If the short-term impacts of today’s Head Start were about as large as the short-term impacts of yesterday’s program, and if the latter passes a benefit-cost test, there would be some reason to believe that the same might be true of the current program. Using the same sibling-difference design as in Garces et al. [2002], Currie and Thomas [1995] study children who would have been in Head Start in the 1980s or earlier and find that Head Start participation seems to increase scores on the PPVT vocabulary test by around .25 standard deviations in the short term for both white and AfricanAmerican children. These impacts persist for whites, but fade out within three or four years for blacks. 10 Head Start’s impacts on PIAT math scores might be around half as 9 http://www.nhsa.org/download/advocacy/fact/HSBasics.pdf Currie and Thomas [1995, Table 6] use a sibling-difference research design and estimate a short-term effect of Head Start on PPVT test scores of nearly 7 percentile points in the national distribution for both blacks and whites. The standard deviation of percentile ranking scores (i.e. a uniform distribution with values between 1 and 100) will be around 29 points, implying short-term effect sizes in the Currie and Thomas study of around one-quarter of a standard deviation. 10 - 14 - large and are not statistically significant [p. 345, fn 10]. 11 Ludwig and Miller [2007] find that a 50-100% increase in Head Start funding does not lead to statistically significant increases in student achievement test scores in 8th grade in either math or reading, although they cannot rule out impacts smaller than around .2 standard deviations. Nor do they have adequate sample sizes to examine impacts on test scores separately for blacks and whites. Unfortunately not much is currently known about Head Start’s causal effects on short-term non-cognitive outcomes for earlier cohorts of program participants. 12 While there remains some debate about the relative importance of different early childhood cognitive or non-cognitive skills in predicting subsequent outcomes, 13 the literature as a whole is consistent with the idea that there are multiple pathways to longterm success. For example while Duncan et al. [2005] find that early math skills are the strongest predictor of subsequent academic achievement, early reading and attention skills also predict later test scores but just not quite as strongly as do early math skills. 14 11 Currie and Thomas [1995], p. 345, footnote 10, note the PIAT math results are not statistically significant, but that version of the study does not report the math point estimates themselves. However an earlier version of the study, Currie and Thomas [1993], reports results for PIAT math, PIAT reading and PPVT scores but not results interacted with age, so we cannot recover short- versus long-term effects. However the overall impacts for whites for PIAT math scores are about half as large as the PPVT results, and PIAT reading scores are about 15% of the PPVT impacts. 12 Currie and Thomas [1995, Table 4] do find some evidence that Head Start might reduce grade retention for white children who participated in the program in the 1980s or earlier. 13 For example Duncan et al. [2005] do not find much evidence that non-cognitive outcomes measured during early childhood (aside from attention skills) predict later test scores, although other correlational studies have found that socio-emotional outcomes, notably aggressive behavior, do seem to contribute to children’s achievement trajectories [Hinshaw, 1992; Jimerson, Egeland, and Teo, 1999; Miles and Stipek, 2006; Tremblay et al., 1992]. 14 These correlational data of course have important limitations in illuminating the causal relationships of early childhood outcomes with later outcomes. For example suppose that most parents read to their children, but what really distinguishes the most scholastically motivated parents from their peers is that the former try to impact math skills to their children even during the early childhood period. In this case the relatively strong correlation between early math and later scores could simply be a stand-in for the influence of parent motivation to help their children learn, and so an increase in early math skills induced by some intervention would yield longer-term impacts that are smaller than Duncan et al.’s correlations would suggest. Alternatively one can also imagine that children with early childhood socio-emotional - 15 - The fact that early childhood programs like Head Start achieve long-term behavioral impacts despite “fade out” of initial achievement test score gains suggests that lasting program impacts on non-cognitive skills might be the key drivers of long-term program impacts on outcomes such as school completion or employment [see for example Carniero and Heckman, 2003]. But it is possible that short-term boosts in academic skills are a key mechanism for improving non-cognitive skills such as motivation and persistence by for instance increasing children’s confidence in school [Barnett, Young and Schweinhart, 1998]. If we interpret short-term test scores as a proxy for the bundle of early skills that promote long term outcomes, then the previous research on earlier Head Start cohorts suggests that short-term impacts of around .25 standard deviations for vocabulary and perhaps .1 for math might be large enough to generate long-term benefits in excess of program costs. We can also look at the short- versus long-term impacts of the widely-cited Perry Preschool program, which provided poor 3 and 4 year old children with two years of services at a total per-child cost of about twice that of Head Start. 15 At the end of the second year of services, Perry had increased PPVT vocabulary scores by around .91 standard deviations and scores on a test of nonverbal intellectual performance (the Leiter International Performance test) by around .77 standard deviations [Schweinhart et al., 2005, p. 61]. By age 9, the impact on vocabulary scores had faded out entirely, while around half of the original impact on nonverbal performance had dissipated. By age 14 impacts on reading and math scores are just over .3 standard deviations. Despite this problems receive a variety of compensatory resources from their parents and schools to offset these early developmental challenges. In this case any intervention that improved children’s early socio-emotional skills – holding all else constant – might have larger impacts than the Duncan et al. correlations would imply. - 16 - partial fade-out of test score impacts, Perry Preschool shows large long-term impacts on schooling, crime and other outcomes measured through age 40 [Schweinhart et al., 2005]. The dollar value of Perry Preschool’s long-term benefits (in present dollars) range from nearly $100,000 calculated using a 7 percent discount rate to nearly $270,000 using a 3 percent discount rate [Belfield et al., 2006, p. 180-1]. Suppose that short-term test score impacts are proportional to the dollar value of long-term program benefits. In this case, even if we used a conservative 7 percent discount rate Head Start’s short-term impacts would need to be at most around 7 percent as large ($7,000 / $100,000) as those of Perry Preschool (that is, around .08 and .05 standard deviations for vocabulary and nonverbal performance, respectively) to generate benefits that are large enough to outweigh Head Start’s costs of around $7,000 per child. If we use a 3 percent discount rate instead, the necessary short-term impacts may be more on the order of .03 and .02 standard deviations, respectively. Of course it might be possible that long-term gains are not strictly proportional to short-term impacts. For example, it could be the case that some minimum short-term impact is necessary in order to generate lasting cognitive or non-cognitive benefits. It could also be the case that the behavioral consequences of achievement impacts on the low-IQ sample of Michigan children in Perry Preschool are different from those arising from similar-sized impacts on a more representative Head Start population. But, at a minimum, the Perry Preschool data raise the possibility that “small” short-term impacts might be sufficient for a program with the costs of Head Start to pass a benefit-cost test. B. The Value of Increasing Early Childhood Test Scores 15 Currie [2001] cites Perry costs of $12,884 per child in 1999 dollars. - 17 - Another way to think about how large Head Start’s short-term impacts would need to be in order for the program to pass a benefit-cost test is to measure directly the value of a 1 standard deviation increase in early childhood test scores. Because few studies have followed people from early childhood all the way through adulthood, this exercise is necessarily subject to some uncertainty. But the evidence that is available suggests that short-term effect sizes of .15 to .2 might be more than enough for Head Start to pass a benefit-cost test, consistent with the evidence from the previous section. The British National Child Development Study (NCDS) is one of the few datasets available for this purpose, and includes achievement test scores measured at age 7 and earnings measured at age 33 for a sample of people born in the U.K. in 1958. Krueger [2003] argues that analyses of these data suggest that an increase in early childhood test scores in either reading or math of 1 standard deviation might plausibly be associated with higher lifetime earnings of about 8 percent. 16 If Krueger’s argument is correct, then the short-term impacts on reading or math that would be needed to generate $7,000 in benefits from increased future earnings would be on the order of around .07 (using a 3 percent discount rate and assuming no productivity growth). 17 If we assume productivity growth of 2 percent, then the short-term impact on reading or math scores necessary to generate $7,000 in benefits could be as little as .04 standard deviations. 16 Krueger [2003] notes that Currie and Thomas’ [1999] analyses of these data imply that a 1 standard deviation increase in test scores increases lifetime earnings by around 8 percent. This impact is smaller than what has been estimated for a 1 standard deviation increase in test scores measured during adolescence for more recent US samples, which typically suggest earnings gains of around 20 percent. The difference is presumably due as Krueger notes to some combination of differences in the time period studied, the US vs UK labor markets, the fact that Currie and Thomas control for both reading and math scores simultaneously while most US studies examine one type of test score at a time in their effects on earnings. 17 Krueger [2003] reports increased lifetime earnings from a .2 standard deviation increase in test scores using a 3 percent discount rate and assuming no productivity growth of $15,174 in 1998 dollars, equal to around $18,800 in current dollars. So the effect size required to generate $7,000 in benefits is equal to ($7,000 / $18,800)*.2 = .37*.2 = .07. - 18 - One possible objection is that we are trying to use non-experimental correlations between early test scores and adult earnings to extrapolate to earnings gains from shortterm experimental impacts, which fade over time. But as noted above there is fade out in non-experimental achievement test advantage as well – that is, test scores measured in early childhood and adolescence are correlated, but imperfectly. 18 Alternatively the correlation between early childhood cognitive test scores and subsequent earnings might be different from the earnings impacts associated with a program-induced change in cognitive test scores. One reason is that the correlation between achievement test scores and earnings observed in population data may reflect, in part, the association of achievement test scores and IQ scores, for which early childhood programs have had a harder time generating lasting impacts [see for example Schweinhart et al., 2005]. On the other hand, the calculations presented above assume that the only benefit from increased early test scores is higher adult earnings. But anything that increases early childhood test scores and subsequently future earnings could affect other outcomes as well. Crime is one of the most important of these other outcomes, given that the social costs of crime might be on the order of $2 trillion per year [Ludwig, 2006]. In the Perry Preschool experiment, around two-thirds of the total dollar-value of the program’s benefits came from crime reductions [Belfield et al., 2006]. There is to date no entirely satisfactory way of determining how early test score impacts relate to longer life outcomes. But the two different approaches used here both suggest that short-term impacts that would be considered very small by the usual standards of education research – on the order of .05 standard deviations or less – could 18 Jencks and Phillips [1998, p. 28] think a plausible estimate is that the correlation between 1st and 12th grade test scores is around .52. The implication is that a child starting at the 16th percentile of the test score - 19 - potentially generate long-term benefits that would at least equal Head Start’s cost per participant (around $7,000). Given the uncertainties with these calculations, a more conservative approach would be to require that Head Start improve short-term test scores by .1 to .2 standard deviations in order to believe that the program generates long-term benefits that are large enough to justify the costs. V. HOW LARGE ARE HEAD START’S CURRENT SHORT-TERM IMPACTS? The best available evidence on current Head Start’s impacts on children comes from the Head State National Impact Study carried out by Westat for the U.S. Department of Health and Human Services, which we will refer to for convenience as “the randomized Head Start experiment.” The results of this experiment have been characterized as “disappointingly small” [Besharov, 2005, p. 1], although much of the public discussion of these findings seems to confuse the intent to treat effects emphasized in Westat’s report on the experimental results with the effects of Head Start participation per se (that is, the effects of treatment on the treated. The short-term impacts of Head Start participation are usually equal to or greater than the .1 or .2 standard deviation benchmark that is necessary for Head Start to pass a benefit-cost test. While many of the point estimates that Westat calculates separately for 3 and 4 year old program participants are not statistically significant, our calculations suggest that pooling data for 3 and 4 year olds leads impact estimates for almost all of the main cognitive outcome measures emphasized in the Westat report’s Executive Summary to be statistically significant. But more importantly the expected value of the program is positive. A. Intent-to-Treat Effects vs. the Effects of Head Start Participation One common source of confusion about the recent randomized Head Start distribution in first grade will on average be at the 27th percentile of the distribution in 12th grade. - 20 - experiment stems from the fact that the main results, particularly those in the executive summary to the several-hundred-page report, are not intended to reflect the effects of actual Head Start participation. The executive summary and most of the tables in the body of the report itself focus on the causal effects of offering children the chance to participate in Head Start by assigning them to the Head Start experimental group – that is, the intent to treat impact. These results are often discussed as if they represent the effects of Head Start participation. They do not. In practice not everyone who is offered the chance to participate in Head Start will actually enroll – parents, for example, might decide that Head Start will not meet their own or their children’s needs or better alternative opportunities might present themselves. If some people assigned to the experimental treatment group do not participate in the program, and, relatedly, if some people assigned to the control group enroll in Head Start on their own, then the effects of Head Start participation (the effect of treatment on the treated) can be different – sometimes quite different – from the effects of treatment-group assignment. The problems of drawing inferences about Head Start participation from the effects of treatment-group assignment can be easily seen by imagining an example in which everyone assigned to the treatment group participates in Head Start ... but because of their own efforts, so does everyone in the control group. If the average quality of the Head Start programs experienced by children in the treatment and control groups were the same, the effects of treatment group assignment (the intent-to-treat estimate) would be equal to exactly zero. It would obviously be incorrect to infer from these estimates that Head Start does nothing to improve the life chances of participating children. The - 21 - central point is that if Head Start participation rates are less than 100% among children assigned to the treatment group or greater than 0% among those in the control group, or both, then the effects of actual Head Start enrollment (the effect of treatment on the treated) will be larger than the estimated effect of being assigned to the treatment group (the intent-to-treat effect). In the Head Start experimental data we see that around 86% of 4 year olds assigned to the experimental treatment group enrolled in Head Start, while 18% of 4 year olds assigned to the control group wound up in Head Start on their own [p. 3-7, Puma et al., 2005]. 19 The body of the report does mention that the intent-to-treat estimates will understate the effects of actually participating in Head Start. But the report’s description of how it tries to convert the intent-to-treat estimates into something like an estimate for the effect of Head Start participation is confusing and the actual approach they employ might be misleading. In any case these results are relegated to one of the appendices and perhaps as a result seem to have been largely ignored in public discussions compared to the intent-to-treat estimates included in the Executive Summary. 20 More than 20 years ago, Howard Bloom [1984] proposed a method for translating intent-to-treat effects into estimates for the effects of treatment on the treated. He noted that under some conditions we can learn about the effects of treatment participation – in this case, Head Start enrollment – by scaling differences in the treatment and control 19 The figures for 3 year olds assigned to the treatment and control groups equal 89% and 21%, respectively. 20 The report describes the Bloom [1984] procedure for handling “no shows” in the treatment group, but does not use this procedure to handle the problem of control group members who wind up in Head Start on their own [p. 4-29, 4-35]. Instead the report seems to drop control group families who wind up in Head Start on their own and then re-weight the remaining control group members; see pp. 4-35,6. The report mentions the Bloom [1984] approach we use to calculate TOT impacts accounting for compliance rates in both the treatment and control groups on p. 4-36 but notes only that Westat will explore how findings from this procedure compare to their default procedure in future reports. - 22 - groups in average outcomes by the difference in the treatment and control groups in treatment participation rates. This procedure assumes random assignment is in fact random, and that treatment group assignment has no effect on children who do not participate in Head Start. 21 In addition, the Bloom procedure assumes that everyone who would participate in Head Start if assigned to the control group would also participate if they had been assigned to the treatment group instead. It further assumes that the average quality of the Head Start programs attended by children assigned to the treatment versus control groups is comparable. This latter assumption may be more problematic, but even fairly large differences in Head Start program quality between Head Start enrollees in the treatment and control group would impart relatively modest bias to the estimates derived from Bloom’s procedure. Why focus on the effects of actually participating in Head Start rather than the intent-to-treat estimates? One answer is that the effect sizes for the Head Start experiment’s intent-to-treat estimates are often compared to estimates from Perry Preschool, Carolina Abedarian and the results of more recent evaluations of universal state pre-K programs, all of which estimate the effects of actually participating in these other programs. This sort of apples (TOT) -to- oranges (ITT) comparison will obviously understate the relative effectiveness of Head Start. A more important reason for focusing on estimates for the effects of actually participating in Head Start (treatment on the treated) is to avoid confusion in conducting a benefit-cost analysis of Head Start. In public discussions about Head Start’s costs, the 21 Stated differently, the latent propensity to participate in Head Start if assigned to the treatment group is assumed to be equivalent for children who were, in fact, assigned to the treatment and control groups. This should be true if random assignment was in fact random, since the propensity to participate in Head Start – - 23 - focus is always on the costs per actual enrollee. The benefit measure that should be compared with this cost is then the dollar value of the benefits per enrollee – that is, the dollar-value of the gains from actually participating in Head Start. It is possible that Westat’s report on the Head Start experiment focuses more on intent-to-treat impacts than on the effects of treatment on the treated because either Westat or the U.S. Department of Health and Human Services might be uncomfortable with presenting treatment-on-the-treated estimates that do require imposing some additional assumptions on the data beyond those necessary to calculate the intent-to-treat effects. Our own view is that even if some of the assumptions for calculating the treatment-on-the-treated impacts are not strictly true (for example if there is some difference in program quality for children enrolled in Head Start in the experiment’s treatment versus control groups), the treatment-on-the-treated estimates still provide more useful approximations for the effects of actually enrolling in Head Start, and help avoid confusion along the lines described above. Readers who are uncomfortable with the assumptions required for the treatmenton-the-treated calculations can conduct their own benefit-cost analysis using the intentto-treat impact estimates, but must then be careful to adjust the cost side of the equation appropriately. If the difference in Head Start enrollment rates between the Head Start experiment’s treatment and control groups equals (86% - 18%) = 68%, then the right “cost” for comparison to the intent-to-treat “benefit” would equal the average Head Start cost per child assigned to the HSNIS treatment group minus the average Head Start cost like all other baseline characteristics – will be equally distributed between treatment and control groups (subject to sampling error). - 24 - per child assigned to the control group. This cost figure equals 68% * $7,000 = $4,760. 22 B. Head Start’s Short-Term Impacts In Table 1 we show the ITT impacts on each of the cognitive outcome domains reported in the Executive Summary of Westat’s report for the first-year findings of the Head Start experiment [Puma et al., 2005]. While the published Westat report did not show standard errors for impact estimates, Ronna Cook at Westat has very generously made these available to us. In Table 1 we present point estimates and standard errors that are converted into effect size terms (i.e. expressed as a share of the control group standard deviation for that outcome measure). 23 Table 1 also presents our own estimates for the effects of actually participating in Head Start (the effects of treatment on the treated) derived using Bloom’s approach together with information about Head Start enrollment rates in the experiment’s treatment and control groups. In the Head Start experiment, the difference in Head Start participation rates between the treatment and control groups is around 68 percentage points and so, using the Bloom procedure, we would estimate that the effects of Head Start enrollment on children are about 1.5 times as large as the intent-to-treat effects that are commonly misinterpreted to represent the effects of Head Start participation. These 22 It is easy to see that since both the costs and benefits of the ITT calculation are proportional to the TOT calculations by the treatment-control difference in Head Start enrollment rates, evidence for benefits in excess of costs for the ITT approach implies the same must be true with the TOT approach and vice versa. The important thing is to avoid comparing the dollar value of the ITT impact estimates with the costs per Head Start enrollee. 23 In the body of the report Westat presents a series of different impact estimates for each outcome domain, including those that do not adjust for baseline characteristics, those that adjust for baseline sociodemographic characteristics only, and those that also adjust for fall outcome measures in looking at spring test scores. Because the fall outcome measures are collected mostly by mid-November (collected over the period October to December), in principle controlling for these measures could understate Head Start’s impacts due to program effects that arise during the early parts of the academic year. Table 1 presents Westat’s own preferred regression-adjusted point estimates and standard errors, based on Westat’s examination of whether there is any evidence of program gains between the beginning of the school year and when the fall outcome measures are collected. - 25 - results are best interpreted as providing a range within which the “true” effects of Head Start likely fall. If the average Head Start program quality is somewhat higher for the treatment than control groups then our Bloom-style estimates for the effects of treatment on the treated might be biased upward somewhat. Note also that our estimates for the effects of Head Start participation also assume that the 10 percentage point difference in response rates between the Head Start experiment treatment and control groups [Puma et al. 2005, p. 1-18] do not impart any bias to the basic intent-to-treat estimates. Of course if there is selective sample attrition that biases the basic intent-to-treat estimates, this would represent a more fundamental problem with the Head Start experiment that cannot be solved by focusing on the intentto-treat effects rather than the effects of treatment on the treated. Table 1 shows that at least for cognitive skills all of the Head Start impact estimates point in the direction consistent with beneficial program impacts, although many of these point estimates are not statistically significant and in general the point estimates are larger (both absolutely and in relation to their standard errors) for 3 year olds than 4 year olds. For rhetorical convenience we focus on the effects of treatment on the treated estimates because we believe they are likely to be much closer approximations of the true effect of Head Start participation per se than are the intent-to-treat estimates. Nevertheless, it should be understood that the true impact is probably somewhere in between the ITT and TOT estimates. For vocabulary, pre-reading and pre-writing skills Head Start’s effects (the effects of treatment on the treated) range from .15 to .35 standard deviations, while for 4 year olds the impacts are one-third to one-half as large as for 3 year olds on the PPVT and - 26 - smaller for pre-reading and pre-writing. Parent-reported literacy skills show much more pronounced Head Start impacts, equal to .5 and .4 standard deviations for 3 and 4 year olds, respectively. There are reasons to believe that the results from direct student assessments in this outcome domain may be more reliable than those derived from parent reports. 24 Given the findings by Greg Duncan and his colleagues that early math scores are the strongest predictor of subsequent achievement test scores, one particular concern with the Head Start experiment results has been that the impact estimates on early math scores (measured by the Woodcock-Johnson applied problems test) are not statistically significant. Head Start’s impact on this test equals .18 and .15 standard deviations for 3 and 4 year olds, respectively. Duncan’s study also finds that attention skills are important in predicting future test scores. The closest measure to this in the HSNIS is a variable for hyperactive behavior, where we see a Head Start impact of -.26 standard deviations for 3 year olds but a zero point estimate for 4 year olds. A different concern that has been raised about these impact estimates comes from the ability of the available assessments to detect reliable impacts of this size in young children. One criterion we have for cognitive or non-cognitive assessments is that they are reliable – that is, they generate similar results when applied on different occasions. A standard concern is that assessments of very young children may not be very reliable, for reasons that will be obvious to anyone who has ever been the parent of a young child (short attention span, variability in temperament and willingness to cooperate, and so on). 24 Rock and Stenner [2005, p. 21] note that for the Early Childhood Longitudinal Study of the Kindergarten Class of 1998-99 (ECLS-K) parent reports of children’s social competence and skills have not proven reliable, with “the main concern [being] that parents often have little basis for determining whether behavior is age appropriate.” Analogous concerns could in principle apply to parent reports about - 27 - Reliability scores for achievement tests administered to adolescents are usually on the order of .8 to .9 [see for example Murnane et al., 1995]. Westat shared with us the reliability scores for the cognitive outcomes used in the Head Start experiment and these are typically on the same order but sometimes a bit lower. They are also lower for measures of non-cognitive skills compared to cognitive outcomes [see also Rock and Stenner, 2005]. 25 If the limitations of available assessments simply introduce random noise into children’s outcome scores, then the dependent variables in the Head Start experimental analysis will suffer from classical measurement error and the result would simply be less precise estimation of Head Start impacts (i.e., larger standard errors). This concern would provide a candidate explanation for why so many of the Head Start experimental impact estimates are not statistically significant, but does not pose a threat for interpretation of those impact estimates that are statistically significant. C. Statistically Insignificant Impact Estimates For policy purposes what we want to know is whether Head Start passes a benefitcost test. Most of the estimates for the effects of Head Start participation presented in Table 1 are above the .1 to .2 standard deviation threshold we think necessary for Head Start to pass a benefit-cost test. But many of these point estimates are not statistically significant. So what should policy makers conclude about the program? The Head Start experiment enrolled nearly 4,700 children, which is large by the their children’s literacy skills. 25 The reliabilities of the different cognitive and non-cognitive tests used in the Head Start experiment are as follows (3 year old figure shown first in parentheses, followed by figure for 4 year olds; only reliabilities for pooled 3 and 4 year old samples are available for the non-cognitive outcomes): WJ Word (.87, .9); Letter naming (.96, 97); McCarthy Drawing Score (.65, .73); WJ Spelling (.74, .78); PPVT (.66, .8); Color naming (.94, .94); WJ Oral Comprehension (.8, .88); WJ Applied Problems (.9, .91); Social skills and - 28 - standards of many social program evaluations but tiny compared to many randomized clinical trials in medicine, and in any case means that the standard errors around the resulting point estimates are subject to some non-trivial sampling uncertainty. This uncertainty is compounded by the fact that Westat’s report on the Head Start experiment further splits the sample by showing results separately for 3 and 4 year olds, particularly in the main table of results shown in the executive summary. While this splitting of the analytic sample makes sense for developmental reasons (program impacts may differ by age; see for example Knudsen et al. [2006]), it further reduces statistical power. Table 1 illustrates the basic issue confronting policymakers. Recall that the estimates presented by Greg Duncan and his colleagues [2005] suggest that early math scores might be the strongest predictors of children’s later cognitive outcomes. Section III above suggests that short-term impacts of .1 to .2 might be large enough for Head Start to pass a benefit-cost test. The effects of Head Start participation implied by the Head Start experimental data for early math scores for 3 and 4 year olds equal .18 and .15 standard deviations, respectively – neither of which is statistically significant! One alternative analytic approach would have been to pool the 3 and 4 year old samples in the Head Start experiment. While the Westat report does not present these analyses, with data on the separate impact estimates, sample sizes and standard deviations for the 3 and 4 year old samples we can approximate what the impact estimates and standard errors would be if Westat had pooled the two age groups together for analysis. Our calculations will only allow us to calculate standard errors that do not benefit from the improved precision afforded by adjusting for baseline covariates, and so our pooled approaches to learning (.62); Social competencies (.58); Total problem behavior (.74); Hyperactive behavior (.58); Aggressive behavior (.6); and Withdrawn behavior (.45). - 29 - standard errors are if anything conservative. Our calculations suggest that for a pooled sample of 3 and 4 year olds the Head Start impact estimate would be statistically significant for every cognitive outcome domain shown in Table 1 except oral comprehension. Perhaps more importantly, while scientific convention is to ignore estimates that are not statistically significant at the usual 95 percent cutoff (that is, assume they are zero), we believe that a more productive way to proceed for policy purposes is to focus on the expected value of the program benefits and costs, as suggested by Cook and Ludwig [2006]. The reason is that following the course of action associated with the null hypothesis of no statistically significant impact is itself a policy decision that winds up being overly privileged if we only follow through on point estimates that meet the usual standard for statistical significance. To see the difference, we revisit the hypothetical program we introduced in Section II above, which we assume increases children’s test scores by .2 standard deviations at a cost of just a nickel per child. Suppose that a randomized experimental evaluation of this intervention yielded a point estimate for a treatment effect of .2 standard deviations, but that the standard error was somewhat large and so the p-value for this estimate was equal to .8. While no referee worth her salt would endorse a scientific manuscript that claimed that this intervention “works,” at the same time she would surely wish that her own child’s school district jumped at the chance to adopt this program. This sort of expected value framework suggests that Head Start as it currently operates is likely to pass a benefit-cost test. There are good reasons to believe that shortterm impacts on reading and math scores on the order of .1 to .2 standard deviations, and - 30 - perhaps much smaller than that, would be large enough for Head Start to generate benefits in excess of costs. Table 1 shows that most of the point estimates for Head Start’s effects on cognitive skills for both 3 and 4 year olds are of about this magnitude, even when these estimates are not statistically significant for the two samples. VI. HEAD START ALTERNATIVES The fact that the current incarnation of Head Start seems to pass a benefit-cost test does not rule out the possibility that there could be even more cost-effective ways of deploying Head Start resources. One possibility that has figured prominently in debates about Head Start is to make the program more academically oriented, rather than focused on providing a broad range of academic, health, nutrition, and social services to disadvantaged children. The assumption is that focusing a greater share of children’s time in the program on academic instruction will generate stronger achievement outcomes. Some observers point to larger impact estimates that have been reported from recent studies of new universal state pre-K programs, which are more narrowly focused on instructional activities. They suggest that we should make Head Start operate more like those programs, particularly with respect to the state pre-K requirements that teachers have four-year college degrees, or even divert funding from Head Start to the state programs. These proposals hold some intuitive appeal. However the benefits associated with these changes in practice are uncertain, plus there is some downside risk, and so the expected value of these proposed changes to Head Start remain unclear at the present time. The recent Head Start experimental evaluation provides rigorous information about the short-term impacts of Head Start as it operated since the program’s inception, - 31 - as a comprehensive program focused on nutrition, physical and mental health, parenting and social services as well as education. The studies of Currie and Thomas [1995], Garces, Thomas and Currie [2002], and Ludwig and Miller [2007] also provide what is to us persuasive evidence for the long-term impacts of Head Start as the program was originally designed. To date there is no evaluation evidence available about what would be achieved for current recipients by a new version of Head Start that was more academically oriented. Several recent studies of universal state pre-K programs suggest impressively large impact estimates. Gormley et al. [2005] evaluate the effects of Tulsa, Okahoma’s pre-K program and report TOT estimates equal to .8 standard deviations for the Woodcock-Johnson-Revised (WJ-R) letter-word identification test (more than twice as large as those found in the recent Head Start experiment), with effect sizes of .65 for the WJ-R spelling test (almost three times as large as those reported for four year olds in the Head Start experiment) and of .38 for the WJ-R applied problems math test (more than twice as large as for four year olds in the Head Start experiment), all of which are statistically significant. Barnett et al. [2005] examine pre-K programs in five separate states and report effect sizes of .26 for the PPVT vocabulary test and .28 for the WJ-R applied problems test, both of which are statistically significant. What explains the difference in impact estimates between these state pre-K programs and Head Start? One candidate explanation is that the pre-K programs that have been evaluated to date require all teachers to hold four-year college degrees, while Head Start does not impose that requirement. Teachers in these state pre-K programs will presumably also have higher salaries than Head Start teachers, given the difference - 32 - in average educational attainment. But most of the state pre-K programs report average per-student costs below those of Head Start, 26 perhaps because they employ higher class sizes (for example 10:1 in the Tulsa program compared with around 6 or 7 to 1 in Head Start) or potentially because of differences in how cost estimates for the two types of programs account for fixed costs. An alternative explanation comes from the fact that the state pre-K programs that have been recently evaluated are universal while Head Start is targeted mostly at very low-income children. If there are positive spillover effects from attending school with more affluent or higher-achieving children then “peer effects” could account for part of the difference in impacts between pre-K and Head Start. A third candidate explanation for the difference in impact estimates for state universal pre-K programs and Head Start is the possibility of bias within the recent evaluations of state pre-K programs. While these recent state pre-K studies are major improvements over anything that has been done to examine such programs in the past, they are nonetheless all derived using a research design that may be susceptible to bias of unknown sign and magnitude. Specifically, these recent studies all use a regression discontinuity design that compares fall semester tests for kindergarten children who participated in pre-K the previous year and have birthdates close to the cutoff for having enrolled last year with fall tests of children who are just starting pre-K by virtue of having birthdates that just barely excluded them from participating the previous year. One identifying assumption here is that the selection process of children into pre-K is “smooth” around the birthday enrollment cutoff, but this need not be the case since there 26 For example Gormley and Gayer [2005] report per-pupil costs for 2005 for Tulsa pre-K of $3,500 to $6,000 for the full-day version of the program, which is less than the $7,000 figure for Head Start that - 33 - is a discrete change at the birthday threshold in terms of the choice set that families face in making this decision. For instance, suppose that among the children whose birthdays just barely excluded them from enrolling in pre-K during the previous year, those with the most motivated parents wound up being sent the previous year to private programs that are analogous to the public pre-K program and are then enrolled in private kindergarten programs in the fall semester that the pre-K study outcome measures are collected. This type of selection would reduce the share of more motivated parents among the control group in the pre-K studies and lead them to overstate the benefits of pre-K participation. Moreover the pre-K evaluations that have been done to date focus on those states that are leaders in this area. The experiences of pre-K programs in these states may or may not reflect the average pre-K effect we would observe if we made a wholesale shift of resources from Head Start to pre-K. The critical policy question is whether such a shift would create the possibility of greater benefits or of harm. Presently, this is an unanswerable question. The recent Head Start experimental evaluation, as well as the on-going evaluation of Early Head Start, have pointed in the direction of beneficial impacts on both cognitive and non-cognitive outcome domains (e.g., social, emotional, and health outcomes), even if not all of the impact estimates are statistically significant. Previous studies have also found beneficial Head Start impacts on health outcomes and on crime reduction [Garces et al., 2002; Ludwig and Miller, 2007; Frisvold, 2007]. Changing Head Start’s design to make the program more academic, or to look more like existing universal state pre-K programs, or even to shift Head Start funding to state programs that sometimes rely on mixed delivery represents an average of half- and full-day students. - 34 - systems could potentially generate improved academic outcomes, but the possible impacts on these other important domains of development remain unknown. While evaluations of high-quality, intensive early childhood interventions have found positive short- and long-term impacts on social-emotional outcomes, studies focusing on community-based child care have found some unfavorable social outcomes with greater participation especially in center-based care [Magnuson, Ruhm and Waldfogel, 2004; Zaslow, 2006]. Studies of state-funded universal pre-K programs have not yet reported findings for social-emotional outcomes. As a result policy actions that would shift or withdraw resources from Head Start are risky. It is important to recognize that a different kind of risk from changing Head Start comes from the fact that the resources required to implement some of the proposed changes have some opportunity cost, since the funding in question could in principle have been devoted to other uses, including other social programs. For instance, an increasingly common proposal is to require Head Start teachers to hold 4 year college degrees. This change would require higher salaries to recruit and retain more highlyeducated teachers, which would require either more spending for the Head Start program as a whole or else reductions in other parts of the Head Start budget. Even knowing that requiring BA-level teachers leads to improved student outcomes would not be sufficient to endorse this policy from an economist’s perspective. We would want to know how these gains compare to what could be achieved from devoting those extra resources to other uses such as further reducing class sizes in Head Start, 27 expanding the program’s coverage to more eligible low-income children, or improving pre-natal health and - 35 - outreach services to low-income women. Given these downside risks, it is possible to determine whether alternative uses of Head Start funding that have been proposed have positive or negative expected value. VII. CONCLUSIONS There is credible evidence that Head Start generates long-term benefits and passes a benefit-cost test, at least for children who participated during the first few decades of the program. For the current version of Head Start, we have rigorous evidence of shortterm impacts from a recent experimental evaluation but no direct data on long-term effects since experimental subjects have just recently finished participating in the program. However there are reasons to believe that with a cost of $7,000 per child Head Start does not need to yield very large short-term test score impacts in order to pass a benefit-cost test. Effect sizes of .1 or .2 might be enough, and impacts even smaller than this, perhaps much smaller, might be sufficient. The estimated effects of Head Start enrollment on children – the effects of treatment on the treated – implied by the recent experimental study of the program typically exceed this threshold. Many point estimates are not statistically significant when the results are presented separately for 3 and 4 year old participants, but the expected value of the program is still positive. We certainly do not mean to claim that Head Start is a perfect program that cannot be improved. It is possible that modifying the program in some of the ways that have been discussed in recent years, such as increasing the program’s academic focus to better target those skills that predict later literacy [Zaslow, 2006], or requiring teachers to hold a four-year college degree, could make the program more effective or even more 27 Currie and Neidell [forthcoming] suggest that redirecting resources to increase teacher qualifications and salaries within the existing Head Start budget at the expense of small class sizes would on net lead to worse - 36 - cost-effective. But there is some uncertainty about the benefits that would be achieved by such changes, and there is some downside risk associated with each of these proposals – particularly when one recognizes that the resources required to implement them entail some opportunity cost. In sum, the available evidence suggests to us that the Head Start program as it currently operates probably passes a benefit-cost test. Changing the program in various ways that have figured prominently in recent policy discussions may not make the program any better, and could make things worse. student outcomes. - 37 - Table 1: Intent-to-treat (ITT) Effect Sizes from the National Head Start Impact Study and Estimated Effects of Treatment on the Treated (TOT) Outcome 3 year olds 3 year olds 4 year olds 4 year olds ITT TOT ITT TOT Woodock.235* .346* .215* .319* Johnson letter (.074) (.109) (.099) (.147) identification Letter naming .196* .288* .243* .359* (.080) (.117) (.085) (.126) McCarthy .134* .197* .111 .164 draw-a-design (.051) (.075) (.067) (.100) Woodcock.090 .132 .161* .239* Johnson (.066) (.096) (.065) (.097) spelling PPVT .120* .17* .051 .075 (.052) (.077) (.052) (.076) vocabulary Color naming .098* .144* .108 .159 (.043) (.064) (.071) (.107) Parent-reported .340* .499* .293* .435* literacy skills (.066) (.097) (.075) (.112) Oral .025 .036 -.058 -.086 comprehension (.062) (.091) (.052) (.077) Woodcock.124 .182 .100 .147 Johnson applied (.083) (.122) (.070) (.103) problems First and third columns reproduce ITT impact estimates for all cognitive outcomes reported in Westat’s Executive Summary of the first year findings report from the National Head Start Impact Study, reported as effect sizes, i.e. program impacts divided by the control group standard deviation (Puma et al., 2005). Standard errors are shown in parentheses also in effect size terms; these were not included in the Westat report but were generously shared with us by Ronna Cook of Westat. Second and fourth columns are our own estimates for the effects of treatment on the treated (TOT) derived using the approach of Bloom (1984), which divides the ITT point estimates and standard errors by the treatment-control difference in Head Start enrollment rates. For 3 year olds the adjustment is to divide ITT by (.894 - .213) = .681, for 4 year olds adjustment is to divide ITT by (.856 - .181) = .675 (see Exhibit 3.3, Puma et al., 2005, p. 3-7). * = Statistically significant at the 5 percent cutoff. - 38 - References Barnett, W. Steven and Kenneth B. Robin. (Undated) “How much does quality preschool cost?” Rutgers University, National Institute for Early Education Research. Barnett, W. Steven, Cynthia Lamy and Kwanghee Jung (2005) “The effects of state prekindergarten programs on young children’s school readiness in five states.” Rutgers University, National Institute for Early Education Research. Barnett, W. Steven, J.W. Young and L.J. Schweinhart (1998) “How preschool education influences long-term cognitive development and school success.” In W.S. Barnett and S.S. Boocock, Eds. Early care and education for children in poverty: Promises, programs, and long-term results (pp. 167-184). Albany: State University of New York Press. Belfield, Clive R., Milagros Nores, Steve W. Barnett and Lawrence J. Schweinhart (2006) “The High/Scope Perry Preschool Program: Cost-Benefit Analysis Using Data from the Age-40 Followup.” Journal of Human Resources. XLI(1): 162-190. Besharov, Douglas J. (2005) “Head Start’s Broken Promise.” American Enterprise Institute, On the Issues. Bloom, Howard S. (1984) “Accounting for no-shows in experimental evaluation designs.” Evaluation Review. 8(2): 225-46. Bloom, Howard S. (2005) “Randomizing groups to evaluate place-based programs.” In Learning More from Social Experiments: Evolving Analytic Approaches, Edited by Howard S. Boom. NY: Russell Sage Foundation. Card, David (2001) “Estimating the returns to schooling: Progress on some persistent econometric problems.” Econometrica. 69(5): 1127-60. Carniero, Pedro and James J. Heckman (2003) “Human Capital Policy,” In Inequality in America: What Role for Human Capital Policies? James J. Heckman and Alan B. Krueger. Cambridge, MA: MIT Press. Cohen, Jacob (1977) Statistical Power Analysis for the Behavioral Sciences. New York: Academic Press. Cook, Philip J. and Jens Ludwig (2006) “Aiming for evidence-based gun policy.” Journal of Policy Analysis and Management. 25(3): 691-736. Currie, Janet (2001) “Early Childhood Education Programs.” Journal of Economic Perspectives. 15(2): 213-238. - 39 - Currie, Janet and Duncan Thomas (1993) “Does Head Start Make a Difference?” Cambridge, MA: NBER Working Paper 4406. Currie, Janet and Duncan Thomas (1995) “Does Head Start Make a Difference?” American Economic Review. 85(3): 341-364. Currie, Janet and Duncan Thomas (1999) “Early test scores, socioeconomic status, and future outcomes.” NBER Working Paper 6943. Currie, Janet and Matthew Neidell (forthcoming) “Getting Inside the ‘Black Box’ of Head Start Quality: What Matters and What Doesn’t?” Economics of Education Review. Duncan, Greg J., Chantelle J. Dowsett, Amy Claessens, Katherine Magnuson, Aletha C. Huyston, Pamela Klebanov, Linda Pagani, Leon Feinstein, Mimi Engel, Jeanne BrooksGunn, Holly Sexton, Kathryn Duckworth, and Crista Japel (2006) “School Readiness and Later Achievement.” Working Paper, Northwestern University. Duncan, Greg J. and Katherine Magnuson (forthcoming) “Penny size and effect size foolish.” CDP. Frisvold, David (2007) “Head Start Participation and Childhood Obesity.” Paper presented at the Allied Social Science Association Meetings, January 2007, Chicago. Garces, Eliana, Duncan Thomas, and Janet Currie (2002) “Longer Term Effects of Head Start.” American Economic Review. 92(4): 999-1012. Gormley, William T. and Ted Gayer (2005) “Promoting School Readiness in Oklahoma: An Evaluation of Tulsa’s Pre-K Program.” Journal of Human Resources. XL(3): 533558. Gormley, William T., Ted Gayer, Deborah Phillips and Brittany Dawson (2005) “The effects of universal pre-K on cognitive development.” Working Paper, Georgetown University, Center for Research on Children in the United States. Gruber, Jonathan and Botond Koszegi (2002) “A Theory of Government Regulation of Addictive Bads: Optimal Levels and Tax Incidence for Cigarette Excise Taxation.” NBER Working Paper 8777. Gruber, Jonathan and Sendhil Mullainathan (2002) “Do Cigarette Taxes Make Smokers Happier?” NBER Working Paper 8872. Harris, Douglas N. (2007) “New benchmarks for interpreting effect sizes: Combining effects with costs.” Working Paper, University of Wisconsin at Madison. Haskins, Ron (2004) “Competing Visions.” Education Next. - 40 - Hinshaw, SP (1992) “Externalizing behavior problems and academic underachievement in childhood and adolescence: Causal relationships and underlying mechanisms.” Psychological Bulletin. 111: 127-154. Holzer, Harry, Diane Whitmore Schanzenbach, Greg J. Duncan, and Jens Ludwig (2007) The Economic Costs of Poverty. Washington, DC: Center for American Progress. Jencks, Christopher and Meredith Phillips (1998) “The black-white test score gap: An introduction.” The Black-White Test Score Gap, edited by Christopher Jencks and Meredith Phillips. Washington, DC: Brookings Institution Press. pp. 1-54. Jimerson, S., Egeland, B., & Teo, A. (1999). A longitudinal study of achievement trajectories: Factors associated with change. Journal of Educational Psychology, 91(1) 116-126. Knudsen, Eric I., James J. Heckman, Judy L. Cameron, and Jack P. Shonkoff (2006) “Economic, neurobiological, and behavioral perspectives on building America’s future workforce.” Proceedings of the National Academy of Sciences. 103: 10155-10162. Krueger, Alan B. (2003) “Economic considerations and class size.” Economic Journal. Lochner, Lance and Enrico Moretti (2004) “The effect of education on crime: Evidence from prison inmates, arrests, and self-reports.” American Economic Review. 94(1): 155189. Ludwig, Jens (2006) “The Costs of Crime.” Testimony to the U.S. Senate Judiciary Committee, September, 2006. Ludwig, Jens and Douglas L. Miller (2007) “Does Head Start Improve Children’s Life Chances? Evidence from a Regression-Discontinuity Design.” Quarterly Journal of Economics. 122(1): 159-208. Magnuson, Katherine A., Christopher J. Ruhm, and Jane Waldfogel (2004) “Does prekindergarten improve school preparation and performance?” NBER Working Paper 10452. Mayer, Susan E. (1997) What Money Can’t Buy. Cambridge, MA: Harvard Press. Miles, Sarah B. and Deborah Stipek (2006) “Contemporaneous and longitudinal associations between social behavior and literacy achievement in a sample of low-income elementary school children.” Child Development. 77(1): 103-117. Murnane, Richard J., John B. Willett, and Frank Levy (1995) “The growing importance of cognitive skills in wage determination.” Review of Economics and Statistics. 77(2): 251-266. Phillips, D., McCartney, K., & Sussman, A. (2006). Child care and early development. - 41 - In McCartney, K., & Phillips, D. (Eds.), The Handbook of Early Child Development. Blackwell Publishers. Puma, Michael, Stephen Bell, Ronna Cook, Camilla Heid, Michael Lopez, et al. (2005) Head Start Impact Study: First Year Findings. Westat. Report Prepared for the U.S. Department of Health and Human Services. Rock, Donald A. and A. Jackson Stenner (2005) “Assessment Issues in the Testing of Children at School Entry.” The Future of Children. 15(1): 15-34. Schanzenbach, Diane Whitmore (forthcoming) “What have researchers learned from Project STAR?” Brookings Papers on Education Policy. Schweinhart, Lawrence J., Jeanne Montie, Zongping Xiang, W. Steven Barnett, Clive R. Belfield and Milagros Nores, Lifetime Effects: The High/Scope Perry Preschool Study Through Age 40. (Ypsilanti, Michigan: High/Scope Press, 2005). Thun, Michael J. and Ahmedin Jermal (2006) “How much of the decrease in cancer death rates in the United States is attributable to reductions in tobacco smoking?” Tobacco Control. 15: 345-347. Vinovskis, Maris A. (2005) The Birth of Head Start: Preschool Education Policies in the Kennedy and Johnson Administrations. Chicago: University of Chicago Press. Westinghouse Learning Corporation (1969) The Impact of Head Start: An Evaluation of the Effects of Head Start on Children’s Cognitive and Affective Development. Executive Summary. Ohio University Report to the Office of Economic Opportunity. Washington, DC: Clearinghouse for Federal Scientific and Technical Information, June 1969. Zaslow, Martha (2006) “Issues for the Learning community From the First Year Results of the Head Start Impact Study.” Plenary Presentation to the Head Start Eighth National Research Meeting, June 27, 2006. - 42 -

The Benefits And Costs Of Head Start Jens Ludwig, Georgetown University and NBER Deborah A. Phillips, Georgetown University

Related documents

Products

Support

The Benefits And Costs Of Head Start Jens Ludwig, Georgetown University and NBER Deborah A. Phillips, Georgetown University

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib