IDB WORKING PAPER SERIES No. IDB-WP-438 Theory and Evidence on Teacher Policies in Developed and Developing Countries Emiliana Vegas Alejandro Ganimian August 2013 Inter-American Development Bank Education Division Theory and Evidence on Teacher Policies in Developed and Developing Countries Emiliana Vegas Alejandro Ganimian Inter-American Development Bank 2013 Cataloging-in-Publication data provided by the Inter-American Development Bank Felipe Herrera Library Vegas, Emiliana. Theory and evidence on teacher policies in developed and developing countries / Emiliana Vegas, Alejandro Ganimian. p. cm. (IDB working paper series ; 438) Includes bibliographical references. 1. Teacher effectiveness. 2. Teachers—Training of. 3. Teachers—Rating of. 4. Teachers—Recruiting. 5. Teachers—Selection and appointment. I. Ganimian, Alejandro. II. Inter-American Development Bank. Education Division. III. Title. IV. Series. IDB-WP-438 http://www.iadb.org The opinions expressed in this publication are those of the authors and do not necessarily reflect the views of the Inter-American Development Bank, its Board of Directors, or the countries they represent. The unauthorized commercial use of Bank documents is prohibited and may be punishable under the Bank's policies and/or applicable laws. Copyright © 2013 Inter-American Development Bank. This working paper may be reproduced for any non-commercial purpose. It may also be reproduced in any academic journal indexed by the American Economic Association's EconLit, with previous consent by the Inter-American Development Bank (IDB), provided that the IDB is credited and that the author(s) receive no income from the publication. Theory and Evidence on Teacher Policies in Developed and Developing Countries Emiliana Vegas Alejandro J. Ganimian * Abstract The past decade has seen the emergence of numerous rigorous impact evaluations of teacher policies. This paper reviews the economic theory and empirical evidence on eight teacher policy goals: (1) setting clear expectations for teachers; (2) attracting the best into teaching; (3) preparing teachers with useful training and experience; (4) matching teachers’ skills with students’ needs; (5) leading teachers with strong principals; (6) monitoring teaching and learning; (7) supporting teachers to improve instruction; and (8) motivating teachers to perform. The paper also discusses key concepts and methods in econometrics to understand existing studies and offers some directions for future research. JEL codes: I250 Education and Development; J450 Public Sector Labor Markets; I210 Analysis of Education. * The authors would like to thank Susanna Loeb, Pilar Romaguera, Agustina Paglayán, Analía Jaimovich, Andrew Trembley, and Nicole Goldstein, who were key members of the World Bank’s SABER Teacher Project team, from which we draw extensively in this paper. Emiliana Vegas is Chief of the Education Division at the Inter-American Development Bank (evegas@iadb.org). Alejandro J. Ganimian is a Doctoral Candidate in Quantitative Policy Analysis in Education at the Harvard Graduate School of Education, and a Doctoral Fellow in the Multidisciplinary Program in Inequality and Social Policy at the Harvard Kennedy School of Government (alejandro_ganimian@mail.harvard.edu). 1 Introduction The past decade has seen the emergence of numerous rigorous impact evaluations of teacher policies. These studies differ from previous research in three ways that make them useful to inform policy decisions. First, instead of relying on proxies for student learning and teacher effectiveness, such as student enrollment or teacher certification rates, they capitalize on data on student learning from national and international assessments. Second, rather than identifying associations between policies and outcomes, they employ methods that allow them to distinguish the effects of interventions from other factors that may confound those effects. Finally, while they assess the impact of specific reforms, they also explore how these reforms interact with one another. This paper distills the main lessons from some of the more recent of these studies on teacher policies. The first section provides an analytical framework to structure the review of the evidence. Using this framework, the second section discusses the economic theory that is driving empirical work on teacher policies and the key findings of the most rigorous studies in this area. The final section concludes by making sense of the theory and evidence on the impact of various teacher policies on student outcomes, and by highlighting the main findings and questions for further research. Analytical Framework: Teacher Policies as a System One way to think about the issues that have motivated research on teacher policies is to organize them according to the challenges that education systems face at different stages of teachers’ careers—ranging from when individuals seek to enter teaching to when they join a school as teaching staff and become classroom instructors. The main policy issues that education systems face in managing teachers effectively are: (1) setting clear expectations for teachers; (2) attracting the best into teaching; (3) preparing teachers with useful training and experience; (4) matching teachers’ skills with students’ needs; (5) leading teachers with strong principals; (6) monitoring teaching and learning; (7) supporting teachers to improve instruction; and (8) motivating teachers to perform. While there are other ways to organize the evidence, this framework is reminds us that policies such as certification requirements or evaluations are designed to achieve goals, and should thus be evaluated on the extent to which they reach them. The degree to which an education system addresses the challenges in one stage of the teacher pipeline can make it easier or harder to tackle the challenges in subsequent stages. Also, the extent to which an education system deals with these challenges for a stock of teachers at a point in time will make it easier or harder for the system to deal with challenges for the incoming flow of teachers. As we argue elsewhere, teacher policies 2 form a dynamic system—changes in one area have repercussions on all others (Vegas et al. 2011). Theory and Evidence: Teacher Labor Markets This section draws on economic theory and empirics to discuss the importance of the teacher policy goals above and review the most rigorous available evidence on interventions involving each one of these goals. Economics has played an important role in recent rigorous research on teacher policies, including the problems identified and the solutions proposed to address these problems. So while there are many ways to examine the evidence on teacher policies, reviewing the economic theory that has driven empirical work can help us understand why the evidence has developed the way that it has. There are several aspects that distinguish recent research on teacher policies from previous empirical work. Perhaps the most distinctive feature of recent studies is their capacity to estimate the effects of interventions on student and teacher outcomes. At first, the fact that this has not been the norm may appear surprising, as it would seem that all impact evaluations should aim to study these outcomes given that the objective of interventions is to promote positive changes. However, until recently, most empirical work in education was unable to disentangle the effects of an intervention (known as “treatment effects”) from the effects associated with the characteristics of those participating in the program (“selection effects”). This problem, known as “selection bias,” has limited the impact of educational research on policy (Murnane and Willett 2011). Over the past decade, however, education researchers have made remarkable progress in their capacity to address selection bias. In particular, they have gotten better at providing reliable answers to two questions of prime interest to policymakers: (1) how would participants in a program have fared in the absence of the program? and (2) how would nonparticipants have fared in the presence of the program? Answering these questions definitively is impossible because at any point in time an individual is either participating or not participating in a program. Approximating an answer by comparing program participants before and after their participation is not good enough, since several things could have happened to those individuals while they participated in the program that could have influenced the outcomes to be measured. This is known as the “program evaluation problem” (Duflo, Glennerster and Kremer 2008). The innovation in recent research has been to use methods to find an adequate comparison group for program participants and to offer a credible estimate of the two “counterfactual” questions above. 3 The latest, most rigorous generation of studies on teacher policies in both developed and developing countries uses econometric methods such as randomized control trials, differences-in-differences analysis, instrumental variables estimation, and regression discontinuity designs to offer unbiased estimates of the effects of teacher policies. This review focuses on rigorous studies that assess the impact of teacher policies on student learning, as measured by standardized tests. The aim is not to imply that increased student learning forms the only outcome of a well-functioning education system.3 Instead, the focus is driven by the increasing interest of both developed and developing country governments in education policies that raise student achievement, and by a body of evidence that links learning to economic development (Hanushek and Woessmann 2007). Reference is made to less-rigorous studies (using propensity score matching or fixed effects) in two cases: (1) when there are no rigorous studies on a given teacher policy; and/or (2) when the objective is to draw attention to potential differences in the evidence between developed and developing countries. The boxes in this paper feature brief overviews of the methods employed in the studies as well as some of the commonly used statistical concepts. The aim is to bring policymakers and other stakeholders into the discussion of the evidence and make them active contributors. The effects of programs are mostly reported in standard deviation units, which are commonly used in statistics to measure how spread out values are within a distribution (e.g., how much students’ test scores vary in a class). A standard deviation is calculated through a process that might seem complicated and arbitrary to a nontechnical audience. Yet, standard deviations are used because they offer a common metric to compare the effects of different programs on outcomes such as student test scores. In the social sciences in general, effect sizes of .80 of a standard deviation are considered to be large, effects of .50 moderate, and effects of .20 small (Cohen 1988). As noted by Murnane and Willett (2011), however, even the most successful interventions in education have small effects by these standards. In interpreting standard deviations, it helps to think about them in one of five ways. One way is in terms of what they say about how much student learning improves, typically in student achievement tests of math and reading. If one assumes that students’ skills follow a “normal” distribution so that most students are clustered around an average skill level 3 Rigorous research has documented the impact of education on important short-term outcomes such as student satisfaction and long-term outcomes such as labor market outcomes, educational attainment, and crime. These outcomes are not always captured by test scores (see Chetty et al. 2011; Deming 2011; Kane and Staiger 2010, 2012; Kemple and Willner 2008). 4 and there are few at the lowest and highest levels—and the psychometricians who design tests make sure this is the case—then an effect size in standard deviations can be easily translated into a change in a student’s percentile rank (Box 1). For example, an improvement by .25 of a standard deviation means that a student previously performing at the 50th percentile now performs at the 60th percentile. Box 1. What Are Percentile Ranks? In statistics, percentiles ranks are often used to understand where an individual falls within a distribution. Calculating a percentile rank is straightforward, but it is not necessary to understand how to do so in order to understand what percentile ranks mean. The rankings are simply the percentage of people below that individual. For example, if a test is administered and a student scores in the 10th percentile, that means that 10 percent of the student’s peers scored below him or her—or equivalently, that 90 percent of the student’s peers outscored him or her. Similarly, an improvement from the 10th to the 15th percentile means that while a student initially outscored 10 percent of his/her peers, the student now outscores 15 percent of them. Another way to make sense of effect sizes is to compare them to reference points. Many studies in the United States compare the effects of a program to the black-white achievement gap, which is between .65 and .75 of a standard deviation (Reardon 2011). Others compare effect sizes to “grade equivalents”—i.e., how much learning takes place in a year of school. Kane and Staiger (2012) equate .25 of a standard deviation to nine months of schooling in grades 4 through 8. Yet others understand these effects in terms of the tangible benefits that they have for children. Krueger (2003) has estimated that a 1 standard deviation increase in elementary math scores is associated with an 8 percent increase in adult earnings. It is also possible to compare effect sizes of new policies to those of widely known interventions. In the United States, many compare the effects of new interventions to that of a high-profile class-size reduction policy discussed later in this paper, which yielded .20 of a standard deviation effect (Hanushek 2003). Setting Clear Expectations for Teachers Economic theory indicates that setting clear expectations is important from the point of view of both recruiting and managing effective teachers. Expectations influence how potential entrants into teaching perceive the profession. As Roy (1951) and Borjas (1987) argued, individuals “self-select” into job markets by choosing the one that is most likely to give them the highest expected earnings, based on their assessment of the skills that they possess and those that are demanded by the job. Professional standards can give employers a mechanism by which to “screen” those who are most likely to succeed (Stiglitz 1975). A profession with low or unclear standards creates an imbalance in the information that employers and potential employees have about the skills of the latter. 5 This is known in economics as “information asymmetry.” It is problematic because it encourages low-skilled individuals to try to enter professions for which they might be unprepared, discouraging high-skilled entrants—a problem that economists refer to as “adverse selection” (Akerlof 1970; Greenwald 1986). Second, expectations can guide teachers’ work. As Baker (1992) and Prendergast (2002) indicated, clear professional standards can minimize “principal-agent problems” in which a principal (e.g., an education system) hires an agent (e.g., a teacher) to perform tasks that are in the interest of the principal but that are costly to the agent and difficult to observe (e.g., raising student learning). Professional standards can mitigate these problems by serving as the basis for a “contract” in which the employer and employee agree on the tasks to be performed and align the ways in which performance will be measured with the employee’s compensation. A body of research focuses on the extent to which expectations for teachers’ work are clear, fair, and/or ambitious enough to produce the desired levels of student learning. These expectations include the explicit guidelines for teachers’ work in the classroom, such as curricula or learning standards, and characteristics of the system, such as the time allotted for instruction or the number of students per classroom. These guidelines implicitly dictate the conditions under which teachers are expected to conduct their work effectively. Rigorously evaluating expectations for teachers’ work is challenging because the evaluations generally are applied to all students in a national or subnational education system. Therefore, as for other universal policies, it is hard for researchers to find an adequate comparison group. Yet, researchers have made important inroads in this field of study by tinkering with existing expectations and randomly assigning groups of schools within an education system to business-as-usual and new conditions—a method discussed below. Table 1 summarizes three types of interventions that have been rigorously evaluated: (1) scaffolding teachers’ work; (2) increasing instructional time; and (3) reducing class size. The last two types of interventions might not seem as related to setting clear expectations as the first one, but they are included here because they inform our understanding of the adequacy of current expectations for teachers’ work. Scaffolding Teachers’ Work One way in which education systems have sought to improve student and teacher performance is by offering teachers more structured guidelines for their classroom work. 6 One intervention in the United States that offers scaffolding for teachers is “Success for All” (SFA), a reading program with a highly structured school-wide curriculum that uses novels and basal readers, periodically regroups students across age and grade boundaries, and requires students to engage in reading at home. In 2000, Borman et al. (2007) evaluated SFA using a Randomized Control Trial (RCT) (Box 2). The authors randomly assigned 41 high-poverty schools into a grade K–2 SFA treatment or a grade 3–5 SFA treatment and compared kindergarten and first grade students in these two groups. They found that the program had positive effects on three literacy outcomes. The effect sizes ranged from .21 of a standard deviation on passage comprehension to .33 on the word attack measure. These results suggest that changing expectations in schools that serve disadvantaged children can potentially affect achievement by making teachers’ job more manageable. Box 2. What Is a Randomized Control Trial? Randomized Controlled Trials (RCTs) have become increasingly popular. Their main benefit is that they allow researchers to obtain a “clean” estimate of a program’s average effect by comparing the outcomes of individuals who participated in the program (the “treatment” group) to those of individuals who did not (the “control” group). When a program is randomly assigned, the only factor that distinguishes persons in the treatment and control groups is chance; neither group has characteristics that affect the outcomes of interest. As a result, the threat of selection bias disappears and the effect can be estimated as the difference between the average outcomes of the treatment group minus those of the control group. There are more rigorous evaluations of initiatives that increase the scaffolding provided to teachers in developing countries, where teacher capacity tends to be much lower. In fact, two recent studies in India suggest these types of interventions can have a considerable impact on teacher pedagogy and student learning. In Maharasthra, He, Linden, and MacLeod (2007) used an RCT to evaluate a program designed to teach English that could be implemented either through a specially designed machine or flashcards. They found that both versions of this program yielded gains in English achievement of about .30 of a standard deviation, and that they proved particularly effective with older and lower-performing students. They also found that the version of the program taught by teachers not only improved students’ English scores, but also their math scores. This suggests investing in improving teachers’ pedagogy (rather than in replacing it) had some positive spillover effects. The other study in India suggests, however, that while scaffolding might be effective at producing simple changes in teacher pedagogy, the low capacity of the teaching force in some schools might limit the extent to which more complex changes can be achieved. In 2004, in Mumbai, He, Linden, and MacLeod (2009) randomly assigned students in pre7 schools, primary schools, and stand-alone reading classes either to a program that changed the reading curriculum and provided new activities for teachers, or to no treatment at all. The program produced gains in students’ reading scores of .26-.70 of a standard deviation, and it was most effective with pre-school and low-performing students. Yet, the version that was taught out of school was more effective than the one during school time, yielding an additional .24 of a standard deviation effect. The greater impact of the program for out-of-school hours suggests scaffolding might not be enough to produce the types of changes in teacher pedagogy needed to radically improve student learning. Even relatively simple changes in classroom activities have been shown to improve learning in developing countries. In the Tarlac province of the Philippines in 2009, 5,510 fourth grade students were randomly assigned either to a treatment group that received in-service training for teachers, reading materials, and a reading marathon, or to a group that received no intervention (Abeberese, Kumler, and Linden 2011). One month after the program was implemented, the number of books students read increased from 2.3 to 9.5 and students’ reading scores increased by .13 of a standard deviation. Further, the effects persisted with time. Three months after the reading marathon, treated students still read 3.1 books more than those in the control group and had reading scores that were .06 of a standard deviation higher. Two studies in Madagascar indicate, however, that systemic problems can limit the extent to which improvements in pedagogy can be taken to scale. In 2005, the country provided teachers with tools that specified their duties in detail (e.g., an operations manual). This initiative was evaluated using an RCT: researchers recruited 30 districts for the evaluation and randomly assigned 15 of them to receive this intervention, while letting the others continue to run their schools as they usually would. In a first study of this intervention, Lassibille et al. (2010) found that after two years schools where the program had been implemented had higher student attendance and lower repetition rates than control schools, but only when coupled with interventions at the district and subdistrict levels that improved workflow. However, even with these interventions, the test scores of the treatment schools were not statistically significantly higher than those of control schools. (See Box 3 for an explanation of statistically significant differences.) These results suggest that improving teacher management might not be enough when there are other systemic obstacles. 8 Box 3. What Are Statistically Significant Differences? In simple terms, when a difference in the outcomes for two groups is “statistically significant,” it means it is unlikely to have occurred by chance. The phrase has little to do with the size of a difference between two groups. In an evaluation, the group of students who participated in a program might have statistically significant higher outcomes than the group of students who did not participate, but these outcomes might be only marginally higher. In order to determine whether the difference between two outcomes is statistically significant, statisticians have developed tests that calculate whether the difference could have been found by pure chance. If this probability is too high, researchers will be reluctant to claim that differences in outcomes are stable. Defining what it means for these probabilities to be “too high,” however, is a somewhat arbitrary process. In fact, researchers in education typically report three levels of statistical significance: 10, 5, and 1 percent. A 10 percent significance level means that there is a .10 probability of encountering a difference like the one observed by chance; that is, that there is a 1 in 10 chance that the difference observed between two groups is an artifact of the sample selected for the study. In a second study of the same intervention, Glewwe and Maïga (2011) found that the average program effect was the same for civil service and contract teachers. The main difference between these two groups was that the former had tenure while the latter was hired through annual renewable contracts. This is an interesting question because Madagascar has hired a large number of contract teachers, who typically have less training but who may have higher incentives to work hard in order to get their contract renewed. If contract type is a good proxy for teacher experience or aptitude, these results imply that improving the efficiency of inexperienced teachers (who arguably need it most) may not be enough to raise student achievement. These studies illustrate the potential of interventions that clarify expectations for teachers’ work by producing improvements in pedagogy. But they also serve as a reminder that such initiatives might be hard to scale or ineffective in the presence of systemic obstacles. Increasing Instructional Time The United States has experimented with policies that expand the school day and year, and the results of these interventions are encouraging. Linden, Herrera, and Grossman (2001) evaluated a program in Washington, DC in 2006 that offered academic instruction, enrichment activities, and mentoring after school and during the summer to middle-school students. They found the program raised students’ test scores by .09 of a standard deviation in reading and by .12 in math by the second year. They also found that students participating in the program were more likely to be proactive about enrolling in 9 high school by seeking information about schools, visiting schools, and talking to adults and peers about schools and where to apply. In Chicago in 1996, the state government began mandating all students in grades 3, 6, and 8 who had not passed their end-of-year tests to attend a six-week remedial education program during the summer and then retake the exams. Students who still failed were held back a grade. Jacob and Lefgren (2004a) evaluated the effects of this summer program. Students were not randomly assigned to the program, so the researchers compared students who failed their exams by only a few points to those who passed their exams by a few points. Since students around the passing cutoff are likely to have a comparable level of skills—and given that all students who failed had to attend the summer camp while those who passed did not—comparing these two groups of students could yield a reliable estimate of the effect of the program for students around the passing cutoff. This empirical strategy is known as a Regression Discontinuity Design (Box 4). Box 4. What Is Regression Discontinuity Design? Regression Discontinuity Designs (RDDs) are commonly used when individuals are assigned to a program according to whether they reach a cutoff in some type of score, such as student achievement tests or teacher certification exams. In these cases, researchers can get an estimate of a program’s effects by comparing the outcomes of individuals who just missed the threshold to those who were just above it. If one makes the reasonable assumption that individuals around the cutoff can be expected to have comparable outcomes (e.g., if one assumes that students just passing and just failing a test have similar abilities), then comparing the outcomes of these two groups after the program is enacted will give an estimate of its average effect. Importantly, however, the estimate will only apply to individuals around the cutoff, which is why it is commonly called the “local” average treatment effect. Third graders who attended the summer camp gained about 20 percent of a year’s worth of learning in the first year, with a 25-40 percent fadeout in the second year. Third graders who were held back also scored better than those who were not. Yet, Jacob and Lefgren (2004a) found no effects for sixth graders. These results are intriguing because they do not offer a good explanation of why the program had an effect only at some grade levels. Further investigation is thus required. Studies such as that of Jacob and Lefgren suggest that increasing time in school can be an effective strategy to improve the learning outcomes of the most disadvantaged students. However, it is important to note that such studies examine targeted programs that serve disadvantaged schools or students and do not necessarily speak to the effectiveness of across-the-board increases in class time. 10 Reducing Class Size Interventions reducing class size have been highly contentious. Proponents argue that decreasing student-teacher ratios can make classes more manageable and provide disadvantaged students with the personal attention they need. Opponents contend that merely lowering the number of students in a classroom will not boost learning unless it triggers changes in teacher pedagogy, which are unlikely to occur, and that hiring more teachers per student raises costs considerably. Studying the effects of lowering class size is challenging because students are hardly ever randomly assigned to classes of different sizes. One way to obtain an estimate of the impact of these interventions is to exploit “class-size caps” that set a maximum number of students per class to conduct an RDD and compare the performance of students in classes on either side of that rule. Angrist and Lavy (1999) took advantage of a longstanding rule in Israel that set the maximum class size at 40 students. If a 40-student class enrolled an additional student, the rule required that the class be broken into two classes of 20.5 students on average. By comparing students on either side of the 40-student cutoff, the authors found that being assigned to a smaller class size had a positive effect on student achievement. This finding, however, was later challenged by Urquiola and Verhoogen (2009).6 Fortunately, governments in developed and developing countries have become increasingly interested in whether reducing class size improves student learning and have invested in RCTs to get a definitive answer. By far the best-known rigorous evaluation of a class size reduction policy in a developed country was conducted in Tennessee in the United States. There, in 1985, the state government adopted a class-size reduction program called the Student/Teacher Achievement Ratio, which came to be known as Project STAR. The program randomly assigned 11,600 students and their teachers to one of three types of classes: (1) small classes (13–17 students); (2) regular-size classes (22–25 students); and (3) regular-size classes with a full-time teacher’s aide. Krueger (1999) evaluated the results. He found that, on average, students who attended small classes performed four percentage points higher in their first year (about .20 of a standard deviation) and that their advantage 6 Urquiola and Verhoogen studied the case of Chile, where a rule stipulates a maximum class size of 45 students. They found that the data followed a pattern close to that in Israel. Yet, they also found that most private schools tried to avoid enrolling the number of students that would require them to act upon a classsize cap—either by adjusting their fees or by not admitting that additional student. (The one exception consisted of schools in which parents valued small classes, which were also schools in which students tended to come from wealthier backgrounds.) Thus, the authors showed that the strategy used by Angrist and Lavy (1999) could lead to biased estimates of the impact of class size reductions, since students assigned to small classes are not comparable to those in regular classes. 11 increased by about one percentage point per year in subsequent years. These effects were larger for minority and poor students. Teacher’s aides, however, had little effect on student achievement. While the magnitude of these effects has been the subject of much debate (Hanushek 2003), the size effect in the study remained a benchmark for other impact evaluations in education. The study is seen as demonstrating the potential of large class-size reductions in a developed country setting. The main contribution of Krueger’s study was that it used an RCT to evaluate a class-size reduction. Yet, what experiments such as Project STAR gain in internal validity they can lose in external validity (Box 5). Many have argued that while Krueger offered a rigorous answer to the question of whether Project STAR boosted learning, it is less clear whether his findings carry on to contexts other than Tennessee. Box 5. What Is Internal and External Validity? Internal validity refers to whether a study obtained an unbiased estimate of the effect of an intervention. External validity refers to whether the effect can be generalized to a population of interest. Take the example of a pre-service training program. Internal validity would be concerned with issues such as whether teachers were randomly assigned to the program (selection bias), whether these are novice teachers who would have improved even in the absence of the program (maturation), or whether teachers who left the program before the outcomes were measured were the ones who benefited the least (differential attrition). External validity would be concerned with whether the program or its main beneficiaries are sufficiently similar to others so that an evaluation of the latter would yield comparable results. Hoxby (2000) contributed to the discussion of the external validity of Project STAR by evaluating the impact of class size in Connecticut using two rigorous methods. First, she compared students assigned to small or regular classes due to minimum/maximum rules using an RDD. Then she exploited changes in population trends. Given that such changes affect student-teacher ratios but are unlikely to influence student learning, Hoxby used such changes to conduct an Instrumental Variables Estimation (Box 6). Using these two methods, she found that class size did not have a statistically significant effect on student achievement. In fact, she was able to rule out even modest effects (.02–.04 of a standard deviation for a 10 percent reduction in class size). While Hoxby could not benefit from random assignment in Connecticut, her study is widely considered to be an important qualification for the implications of the STAR findings. 12 Box 6. What Is Instrumental Variables Estimation? Instrumental Variables Estimation (IVE) is often used whenever there has been no random assignment of individuals to a treatment group and there is no discontinuity that determines who participated in a treatment group and who did not. It is based on the observation that variation in the outcomes of a program’s participants is partly endogenous (i.e., related to the participants’ characteristics) and partly exogenous (i.e., related to the program’s characteristics). Thus, one way to evaluate such a program is to obtain data on a variable that would be related to the exogenous variation but not to the endogenous variation in outcomes. One can then estimate the effect of the program by using this variable. It is not easy to identify such a variable (the “instrument”) and obtain data on it. Yet, there are a number of commonly used instruments, such as geographic features that influence treatment (e.g., distance between participants and a program), legal institutions that determine the intensity of treatment (e.g., different laws in neighboring similar states), and natural experiments (e.g., birth dates that determine whether a person has access to a treatment group or not), that can be utilized. Finally, random assignment to a treatment group can also be used as an instrument, since it is uncorrelated with individuals’ characteristics but influences take-up. The other question is whether the results of Project STAR apply to developing nations, where class sizes are much larger than in the United States. Duflo, Dupas, and Kremer (2007) evaluated a program in Busia and Teso, Kenya, that randomly assigned 210 primary schools to a program that (1) lowered student-teacher ratios on average from 80 to 46; (2) combined class-size reductions with improved incentives for teachers (by hiring local teachers on short-term contracts or increasing parental oversight); or (3) combined class-size reductions with tracking by students’ achievement. Reducing student-teacher ratios, in the absence of other reforms, actually led to lower teacher effort and to small and statistically insignificant increases in test scores. However, combining class-size reductions with improved incentives led to significantly larger test scores increases (.19 of a standard deviation), and combining class-size reductions with tracking led to even larger increases (.25–.31 of a standard deviation). This study draws attention to the importance of policy interactions by suggesting that class-size reductions are unlikely to improve student achievement without incentives for teachers to change their behavior in the classroom. These studies offer a nuanced answer to the question of whether class-size reductions improve student learning. The answer seems to be that lowering student-teacher ratios can boost achievement, but only insofar as it changes teachers’ pedagogy and thus the educational experience of children in the classroom. This does not happen automatically and may depend on the extent to which improvements can be made in classroom management (i.e., initial class size) and on the mix of incentives to produce changes in teacher behavior. 13 Attracting the Best into Teaching Economic theory suggests that individuals with outstanding academic and/or professional achievements are more likely to be effective teachers. On one hand, “human capital theory,” developed by Becker (1964), Schultz (1963), and Mincer (1962), proposes that employers should hire individuals with more schooling because additional education increases productivity by raising workers’ ability to understand and use new information. On the other hand, “market signaling theory,” developed by Spence (1974) and Arrow (1963), among others, posits that, even if the education that potential employees acquire does not raise their productivity, high-skilled individuals will pursue it to send a signal to their employers that they possess skills that others lack and deserve to be paid more. While the theories differ on why employers should hire applicants with exceptional credentials, they both imply that attracting individuals with higher credentials will result in a higher-skilled teaching force. Studies in economics suggest that luring top talent into teaching can also have a multiplier effect. Becker and Murphy (2000) argued that people’s actions often influence their neighbor’s incentives or information, and Montgomery (1991) showed that this occurs in various ways in labor markets. They call these interactions “social multiplier effects.” These models give reason to believe that if teaching is able to attract qualified people, then competitive candidates who had not considered teaching might be drawn to it. One way in which education systems can attract talented individuals is by offering competitive wages. Not surprisingly, economists have a lot to say about the role of pay. As Akerlof (1982) and Shapiro and Stiglitz (1984) noted, wages serve two functions: they allocate labor and provide incentives for employee effort. This, they argue, is the reason why firms seeking to attract outstanding candidates might want to pay their employees an “efficiency wage,” i.e., more than what competitors offer. Lazear and Rosen (1981), however, argued that firms can avoid some of the problems inherent in efficiency wages by paying their employees commensurate with their performance. This idea has recently regained relevance in the education literature. Specifically, Barlevy and Neal (2011) showed that a “pay-for-percentile” approach, in which teachers are rewarded for improvements in the performance of their students relative to peers with similar achievement levels at the beginning of the school year, incentivizes teachers to allocate socially optimal levels of effort to all students. Many studies have sought to understand the extent to which requirements for entry into the teaching profession motivate talented individuals to aspire to become teachers and identify those who will succeed as classroom instructors. 14 Table 2 summarizes four types of interventions that have been rigorously evaluated: (1) setting requirements for entry into teaching; (2) relaxing entry requirements for outstanding individuals; (3) rewarding advanced educational qualifications and experience; and (4) increasing teacher pay. Rigorous studies on these topics are scarce because teachers are not randomly assigned to characteristics (e.g., race or gender) or qualifications (e.g., degrees or experience) and students are hardly ever randomly assigned to their teachers. The studies reviewed here are among the most rigorous in this line of research, but much remains to be done to obtain more conclusive answers to the questions that they pose. To minimize the risk of conflating selection and treatment effects, most research on teacher requirements uses a strategy known in econometrics as “fixed effects” (Box 7). Box 7. What Are Fixed Effects? Fixed Effects (FE) are commonly used in research that identifies correlations between two factors. Their main purpose is to account for factors that cannot be observed or for which there are no data. This is done by breaking up the sample into subgroups (e.g., subgroups of students, schools, and teachers) and comparing units in each subgroup among themselves over time. FE control for characteristics that are “fixed” (i.e., those that do not change) over time, but they have no way to account for time-varying factors. Unlike the other empirical strategies reviewed in this paper, FE, regardless of their sophistication, generally cannot conclusively establish cause-andeffect relations. Rather, they are used to increase the reliability of correlations. Setting Requirements for Entry into Teaching Virtually all education systems require individuals to meet some qualifications to become a teacher. These requirements are intended to set minimum standards for the profession, but in many education systems, traditional requirements have been called into question for being poor predictors of teacher effectiveness and for making entry into teaching too costly—particularly for top candidates who already face a high opportunity cost for going into teaching. Most research on this topic has focused on whether certified teachers have students who perform, on average, better than those of uncertified teachers. The evidence suggests that the predictive effect of certification is usually small. Kane, Rockoff, and Staiger (2006) used data from the 1998–99 to the 2004–05 school years to examine the relationship between student achievement and teacher certification for reading and math teachers in grades four to eight in New York City. This “panel data” allowed the authors to observe a teacher’s effectiveness within the same school, grade, 15 and year over time and use combinations of FE to account for factors that confound the effect of certification. The effects of teacher certification were, at best, small. In fact, in math, students assigned to teachers without a certification performed, on average, no differently than peers assigned to traditionally certified teachers. Yet, these results differed somewhat according to the group of uncertified teachers to which traditionally certified teachers were compared. A potential explanation for the lack of effects of certification in New York City may be that certification is a better proxy for teacher quality at some levels and in some subjects than in others, or that the quality of teacher certification in places such as New York is not stringent enough to predict teacher effectiveness. This would imply that teacher certification should be made more stringent rather than abandoned altogether. This is what a study by Clotfelter, Ladd, and Vigdor (2007) in North Carolina suggested. The authors measured the predictive effect of a number of teacher qualifications (including certification) on student achievement at the high school level from 1999–2002 using permutations of student, school, subject, and year FE. Consistent with the hypothesis that certification might play a more important role in teacher effectiveness at the high school level, Clotfelter and his collaborators found that having a teacher with an alternative certification reduces student achievement by .06 of a standard deviation, and that having a teacher with a certification that is neither traditional nor alternative negatively affects student achievement by .05 of a standard deviation. The authors also found support for the case for higher certification standards by detecting small but positive effects of having a teacher with a National Board of Professional Teaching Standards (NBPTS) certification, a national certification program widely believed to have higher standards than traditional certification programs in the United States. NBPTS requires teachers to submit classroom videos and write essays and is both labor- and time-intensive. The mixed evidence on certification led governments to commission an RCT on whether certification is worth the investment. Cantrell et al. (2008) compared the performance of classrooms of elementary students in Los Angeles in 2003 randomly assigned to NBPTS applicants and to non-applicants. Students of NBPTS-certified teachers scored .22 of a standard deviation above those of teachers who applied but were not certified. The students of successful NBPTS applicants also achieved greater learning gains than those of unsuccessful applicants. However, NBPTS-certified teachers were not more effective than teachers who did not apply to become NBPTS-certified. This suggests that while NBPTS might be successful at identifying the most effective teachers among its applicants, it may not be attracting the most effective applicants. 16 Another entry requirement that education systems are increasingly resorting to is exams, which typically assess teachers’ subject-matter knowledge or field-specific pedagogical knowledge. The evidence on the usefulness of these tests is still evolving, but it suggests that they show some promise for identifying individuals who will be successful classroom instructors. In Peru, Meltzer and Woessmann (2010) took advantage of the fact that students and teachers were tested in the same year on two subjects to determine whether the same student taught by the same teacher in two different subjects performs better in one of those two subjects if the teacher’s knowledge is relatively better in that subject. The authors used FE to focus on the within-teacher, within-student variation in student outcomes while controlling for the fixed characteristics of students, teachers, and subjects. They found suggestive evidence that a teacher’s subject-matter knowledge affects student achievement: a 1 standard deviation increase in teacher test scores raised student test scores by .10 standard deviation units. This means that if a student switched from having a teacher at the 5th percentile of the distribution of subject-matter knowledge to one at the 50th percentile, the student’s achievement would increase by .17 of a standard deviation by the end of the year. The results of a recent U.S. study differed considerably from those in Peru. Cantrell and Kane (2013) used correlations to examine whether the Content Knowledge for Teaching (CKT) tests, which assess teachers’ understanding of how students acquire and understand subject-specific skills in math and reading, correlated with measures of teachers’ value-added scores in these subjects in elementary and middle school in six U.S. districts: Charlotte-Mecklenburg, North Carolina; Dallas, Texas; Denver, Colorado; Hillsborough County, Florida; New York City; and Memphis, Tennessee. To account for the effect of teacher assignment to students, the authors used teachers’ CKT scores to predict their average performance on a series of indicators across two different sections. Cantrell and Kane found CKT tests were correlated with subject-specific classroom observation protocols. Yet, they were not correlated with teachers’ value-added scores, regardless of whether these were measured using state tests or “audit” tests that measure more complex skills. Studies on entry requirements thus seem to find slim support for the effectiveness of traditional requirements (e.g., certification) in developed countries, and encouraging results for the use of alternative requirements (e.g., entry exams) in developing countries. Yet, important questions remain about the generalizability and robustness of these findings. 17 Relaxing Entry Requirements for Outstanding Individuals It is perhaps this lack of clarity on the ideal set of teacher entry requirements that has prompted some education systems to relax traditional requirements for talented individuals seeking entry into the profession, in hopes of making teaching an attractive career choice for them. One program seeking to lure talented youth into teaching in the United States is Teach for America (TFA), which recruits outstanding college graduates who majored in a wide variety of fields to teach in high-need schools for two years. TFA’s dual mission is to raise the quality of instruction and at the same time serve as a transformative experience for its corps members so that, after their two-year commitment, they become champions of education reform in whatever field or sector they choose to work. Since its founding in 1989, TFA has inspired replicas in 28 other countries. The most rigorous study on TFA’s impact was conducted in 2001 in grades 1–5 in six of the 15 regions where the program places its members (Baltimore, Chicago, Los Angeles, Houston, New Orleans, and the Mississippi Delta). Decker, Mayer, and Glazerman (2004) randomly assigned students to TFA or control teachers at the same school and grade and compared their achievement after one year. Control teachers included all those to whom students would have been assigned in the absence of TFA teachers (who were neither necessarily certified nor experienced). In fact, TFA teachers were more likely to have attended competitive colleges and to have acquired their education degree, but less likely to have education-specific training and student-teaching experience than control teachers. Yet, overall, students of TFA teachers performed .15 of a standard deviation higher than those of control teachers and .26 higher than those of novice control teachers (with 1–3 years of experience). There was no difference, however, in reading. Recent evidence suggests that the effect of such programs might be greater in developing countries, where average teacher effectiveness is lower. Alfonso, Santiago, and Bassi (2011) evaluated the impact of an adaptation of TFA in Chile called Enseña Chile (“Teach Chile” or eCh) during the 2009–10 and 2010–11 school years. Alfonso and her collaborators were not able to randomly assign eCh teachers to students, so they used Propensity Score Matching (Box 8). Like FE, this technique is an improvement over correlations that try to account for observable measures that might bias estimates of the impact of a particular policy, but it cannot distinguish selection from treatment effects. After a year, students in schools that received an eCh teacher scored .22–.51 of a standard deviation higher in Spanish and .17–.43 higher in math. After two years, students scored .75 of a standard deviation higher in Spanish and .33 higher in math. These estimates seem too large for an educational intervention and hint at the existence of selection bias, 18 but they offer a good motivation to evaluate these types of programs more rigorously in developing countries. Box 8. What Is Propensity Score Matching? Propensity Score Matching (PSM) is an empirical strategy often employed when it is not possible to use any of the rigorous evaluation methods mentioned earlier in this paper. It cannot be used to claim that a given program caused a particular impact. Rather, it is simply a way to mitigate the evaluation problem mentioned earlier. The approach is simple. Researchers first identify the factors that made individuals more prone to self-select into a treatment group and then create a comparison group by matching each “treated” individual in the treatment group with one (or more) “untreated” individuals with comparable observable characteristics. In this way, researchers can compare the outcomes of both sets of individuals after the treatment as an effect of that treatment. The downside of these types of studies is that researchers can only match individuals on what they can observe. If there are unobservable characteristics that made some individuals more prone to select into a program, there is no way for researchers to know that, so their estimate will be biased. These studies suggest that while it is not known which entry requirements are most effective at predicting teacher effectiveness, neither is there any compelling case for doing away with such requirements altogether. A better understanding is needed of which entry requirements (if any) serve to ensure a minimum “floor” of teaching quality and whether (and if so, how) one can more broadly use entry requirements without relying on other ways of identifying effective teachers that may be more costly but also more effective. These studies, however, focus on teachers entering the profession through highly selective alternative pathways. The question thus arises of whether the bulk of alternatively certified teachers, who enter the profession through far less selective pathways, will be less effective than regular teachers. Exactly this question prompted Constantine et al. (2009) to compare the impact of traditionally and alternatively certified teachers on student achievement in the United States in 2003 by randomly assigning students in the same school and grade to one or the other of the two types of teachers. The study was conducted with 2,600 students in 63 schools in 20 school districts. The authors found that the difference between traditionally and alternatively certified teachers might be less clear-cut than policy debates suggest because (1) both varied widely in the hours of pre-service training they had undertaken; (2) on average they did not differ in their scores on college entrance exams, the selectivity of their college, or their educational attainment; (3) neither group was more effective on average in raising student achievement; and (4) neither the amount nor content of the courses that alternatively certified teachers took made a difference in their 19 effectiveness. These findings suggest that alternatively certified teachers who do not enter the profession through selective programs are not all that different from regular teachers. Rewarding Advanced Educational Qualifications and Experience Many education systems pay their teachers more for graduate studies or years of experience. The rationales for these pay differentials are to attract better educated individuals and to retain seasoned instructors. The implicit assumption in these policies is that these are more effective teachers, so research on this topic assesses whether this is the case. Hanushek et al. (2005) combined the Texas Schools Microdata Panel with data on classroom assignment from an anonymous large urban school district in the state to determine whether certain teacher characteristics, including having a master’s degree or experience, predicted higher student achievement. The study used school and student FE. The authors found that students of teachers with master’s degrees performed no differently, on average, than those whose teachers did not have such a degree. They also found that first-year teachers had much lower performance on average than other teachers (the achievement of their students was .12–.16 of a standard deviation lower), but that additional years of teaching experience had little impact on student achievement. Until recently, the external validity on traditional teacher credentials in the United States was also problematic, since most studies focused on a single school district. In 2012, Kane and Staiger published a report that examined whether teachers with more experience or master’s degrees have students who perform better on math and reading tests. As in the Cantrell and Kane (2013) study mentioned earlier, the Kane and Staiger (2012) study looked at school districts in Charlotte-Mecklenburg, North Carolina; Dallas, Texas; Denver, Colorado; Hillsborough County, Florida; New York City; and Memphis, Tennessee. In the absence of random assignment of teachers to students, the authors tried to account for factors that confound the effect of credentials by using the average performance of students in two different sections taught by the same teacher as their outcome variable. They found that students of teachers with 12 or more years of experience achieved gains that were .01 of a standard deviation higher in math and .02 higher in reading than those of teachers with fewer than three years of experience. Similarly, they found that students of teachers with a master’s degree saw gains that were .03 of a standard deviation higher in math and.02 lower in reading.7 7 Given the existing concerns in the United States with the quality of state tests, the authors also examined how well teaching experience and master’s degrees predicted achievement gains in “audit” tests of math 20 The authors benchmark the predictive power of these traditional teacher credentials with a measure of teacher quality that combines a teacher’s value added, student survey results, and classroom observation scores from another section (i.e., another group of students). They find that this combined measure does a much better job at predicting student achievement gains: students with a teacher who is in the top 25 percent in this combined measure make gains that are .22 of a standard deviation higher in math and .07 higher in reading in the state tests than their peers with a teacher in the bottom 25 percent in the combined measure.8 Together, these studies offer compelling evidence that educational qualifications and experience are very weak proxies for teacher effectiveness and that more thoughtful measures can do a better job at predicting teaching quality. The findings suggest that education systems would be well advised not to rely on pay differentials for educational qualifications and experience to attract better teachers. Increasing Teacher Pay Finally, a policy commonly touted as important to attract talented individuals into teaching is to raise teachers’ salaries. The effectiveness of high salaries in attracting better teachers is difficult to evaluate because teachers are not randomly assigned to salaries. Hoxby and Leigh (2004), however, noted that unionization compressed wages. They used laws that either helped or hindered teachers’ unionization to find out whether the decrease in teacher aptitude in the United States over recent decades can be explained by an increase of opportunities for women outside of the profession or by compressed wages within the profession. This allowed them to use legislation as an “instrument” for pay compression in an IVE. Pay compression increased the share of the lowest-aptitude female college graduates who became teachers by 9 percentage points and decreased that of the highest-aptitude female college graduates who became teachers by 12 percentage points. Hoxby and Leigh also found that improvements in pay parity reduced the share of women who taught by 3.2 percentage points for the highest aptitude group and 0 for the lowest aptitude groups. These findings suggest that wage compression played a key role in the decline of aptitude and reading. They found that the predictive power of master’s degrees was slightly higher with these tests: students of teachers with master’s degrees made gains that were .05 of a standard deviation higher in math and .03 higher in reading than their peers with teachers without master’s degrees. The predictive effect of teaching experience with the audit tests, however, was more mixed: students of teachers with 12 or more years of experience made gains that were .05 of a standard deviation lower in math and .03 higher in reading. 8 The results were similar using the audit tests: the first group of students made gains that were .13 of a standard deviation higher in math and .13 higher in reading than the second group. 21 in the teaching profession. The lesson from this study is not to universally increase wages, but rather to pay better teachers more in order to attract and retain them. Preparing Teachers with Useful Training and Experience Theoretical and empirical work in economics has shown that high-skilled workers value training more than low-skilled workers. Salop and Salop (1976) and later Autor (2001) called this phenomenon the “complementarity” between skills and training. Thus, offering training can promote positive self-selection on skills that employers cannot typically observe. This is of interest for teacher labor markets, in which there are very few observable skills that make an individual more likely to be an effective teacher. If human capital theory is correct and training enables individuals to acquire new skills, education systems should subsidize pre-service teacher training. As Becker’s (1964) model suggested, however, this should be true only if the skills that are necessary to be an effective teacher are specific, rather than general. He argued that since training in general skills makes employees more likely to succeed not only at the firm or industry that provides them with the training, but also at others, firms are unwilling to invest in training in general skills. Yet, the incentives for governments to subsidize pre-service teacher training, and for potential entrants to take it, are strong because this training almost invariably leads to a job, avoiding many of the so-called “contracting” problems with specific-skills provision present in private firms (Prendergast 1993). But as Autor (2001) noted, even if education is pure signaling, training could be a worthy investment as a screening mechanism. Training can elicit otherwise unobservable information about the skills of potential employees, which firms can use to make hiring decisions. This is particularly relevant for teacher policy, given that teaching seems to fit the profile of the occupations described by Terviö (2003) in which initial experience on the job is a fairly accurate predictor of subsequent performance. Although pre-service training is one of the most common ways in which governments try to increase the capacity of teachers, there is little rigorous research on its impact. This is mostly due to self-selection into training, which makes selection and treatment effects hard to disentangle. The very small body of rigorous research on this issue is composed mainly of empirical work assessing changes to traditional forms of training or interventions early in teachers’ careers that might complement pre-service training. Table 3 summarizes two types of interventions that have been rigorously evaluated: (1) assigning mentors to new teachers; and (2) including a practice component in teacher preparation. 22 Assigning Mentors to New Teachers One consistent finding in the teacher effectiveness literature is that teachers are less successful in their first few years on the job. This has led some education systems to assign mentors or coaches to new teachers to accelerate their learning in what are often called “induction” programs. The evidence suggests that such interventions can improve student learning, but that the quality and dosage of mentoring matters in ways that are not easy to anticipate when designing these programs. Rockoff (2008) evaluated a mentoring program in New York City in 2004. In order to measure the impact of the program, he compared the performance of teachers with prior experience (who were not targeted by the program) to that of new teachers (who were targeted by the program, and therefore were more likely to be assigned to a mentor). This strategy is known as Differences-in-Differences Analysis (Box 9). Rockoff found that novice teachers assigned to mentors were 4.5 percent more likely to complete their first year than teachers without mentors, but just as likely (or, put differently, no more likely) to stay at their schools or raise student achievement. Box 9. What Is Differences-in-Differences Analysis? Differences-in-Differences Analysis (DDA) is often used when other rigorous empirical strategies cannot be conducted. It is simple in that it estimates the impact of a program by obtaining the difference in outcomes for a group of individuals who received a treatment before and after the treatment (called the “first difference”), obtaining the difference in outcomes for a similar group that was not exposed to treatment (called the “second difference”), and then subtracting the latter from the former. While one may try to calculate a program’s impact by using only the first difference, it is likely that other things happened at the same time of the treatment that would confound its effect. Therefore, using the second difference allows researchers to account for any other events that would affect all individuals (known as “secular trends”). If nothing else changed between the two groups at the time of the treatment, and the composition of the two groups remained the same during the treatment, a DDA offers an unbiased estimate of a program’s impact. Rockoff also used an IVE to look at whether experience, the number of hours spent by mentors with each teacher, and teachers’ perceptions of mentors make a difference. He found teachers were less likely to leave a school when they had mentors with prior experience at that school, which he takes to suggest that a key part of mentoring may be the provision of school-specific knowledge. He also found that the number of hours mentors spent with a teacher had a positive impact on student achievement: for every 10 hours of mentoring, students of mentored teachers scored .05 of a standard deviation higher in math and .04 higher in reading. 23 The importance of the quality and dosage of mentoring programs led Glazerman et al. (2010) to evaluate the impact of such programs in 17 urban districts in the United States. They randomly assigned 418 elementary schools to one of two categories: (1) schools with a “comprehensive” teacher induction program, in which beginning teachers received monthly professional training, opportunities to observe veteran teachers, and a full-time mentor who had access to updated training and materials; or (2) schools whose beginning teachers received the district’s usual, less-intensive induction. In 10 of the 17 districts, the services were offered to treatment schools for one year and in the remaining seven districts, services were offered for two years. However, districts were not randomly assigned to the length of the services, so findings for both sets of districts should be interpreted separately. Ninety percent of teachers in districts assigned to comprehensive induction had a mentor in their first year, compared to 72 percent of teachers in districts with business-as-usual induction. Interestingly, however, teachers who received comprehensive induction were no more effective than their control peers, as measured by classroom observations and student achievement. Further, the achievement of students with teachers receiving two years of comprehensive induction showed no improvements during the teachers’ first two years on the job, but by the teachers’ third year their students performed .11 of a standard deviation better in reading and .20 better in math than students of control teachers. Neither induction program had a positive effect on teacher retention or satisfaction. This complements the Rockoff study by providing a sobering view of mentoring in which improving quality and dosage is not just a matter of increasing treatment intensity. Including a Practice Component in Teacher Preparation Research suggests teacher performance improves considerably during the first few years on the job. Some have attributed this to the importance of first-hand experience in mastering classroom practices, leading some education systems to incorporate “clinical” (practice-based) components in their teacher education programs. One such initiative is the Boston Teacher Residency (BTR) program, which as stated in its literature recruits “talented college graduates, career changers and community members of all ages” to take an intensive summer training program and a one-year “residency” in a classroom in the Boston public schools. During that year, participants work with a mentor four days a week and attend seminars once a week. Upon graduation, they receive a master’s degree and a “dual” license, including one in special education. They are placed in schools with alumni and receive one-on-one induction and professional development. Chicago and Denver adopted similar programs in 2002 and 2004, respectively. 24 Papay et al. (2011) used a combination of FE to conduct the first independent evaluation of BTR for the school years from 2001–10. BTR graduates were more racially diverse than other Boston public school novices, more likely to teach math and science, and more likely to remain in the profession after their fifth year. BTR graduates were no more effective in the classroom, as measured by their value added in student achievement tests, but they improved faster after their second year, and by their fifth year they outperformed even veteran teachers. These findings suggest that clinical experience might offer teachers tools to improve their performance faster on the job. Matching Teachers’ Skills with Students’ Needs From the perspective of economics, job matching minimizes turnover, which is costly to employers and employees, and maximizes what is called “allocative efficiency” (Hicks 1939; Kaldor 1939). Economic theory offers a framework to examine why a school that is a good match for a teacher can make that teacher more productive. In “search theory” or “matching theory” (Pissarides 1979), economists argue that it is useful to think of jobs as no different from other goods that are consumed: when choosing a job, individuals consider their options and select the one based on their preferences and the constraints that they face. Yet, there is an important distinction between the consumption of, say, typical household goods and the choice of a job, a distinction that draws on the work of Nelson (1970). Typical household goods are “search goods” that are easy to evaluate before acquiring them (Lucas and Prescott 1974). In the case of search goods, mismatches between people’s preferences and constraints and the good that best addresses them are due to imperfect information and can be dealt with by improving information provision. Jobs, by contrast, are “experience goods”—it is not easy for people to evaluate whether a particular job is the most productive fit for them without taking and holding the job for a while (Johnson 1978). This is why economists find that when people switch jobs, they earn more (i.e., because they typically switch to a job that has made them more productive) and why job-switching declines with tenure and experience (i.e., because with time people eventually find the “right” job that best fits them) (Altonji and Shakotko 1987; Bartel and Borjas 1981; Jovanovic 1979a and b; Topel and Ward 1992). Economists also justify job-matching on the grounds of “job market segmentation.” As Reich, Gordon, and Edwards (1973) argued, there is a wealth of empirical work suggesting that workers’ skills are not the only determinant of wages, and that workers with similar skill levels get paid different wages in different labor markets. Proponents of segmented labor markets argue that the “law of one price”—which dictates that the same good cannot be sold in two markets for different prices because it would create arbitrage 25 opportunities that would quickly equalize the price in both markets—does not hold in labor markets. If they are right, then job-matching could be instrumental in correcting these market failures. The precise mechanisms for how matching should take place, however, have been the subject of considerable debate in the field (Lang and Dickens 1987). Table 4 summarizes three types of interventions that have been rigorously evaluated: (1) offering bonuses for teachers to work in high-need schools; (2) offering bonuses for teachers to teach critical shortage subjects; and (3) improving working conditions. Offering Bonuses for Teachers to Work in Schools with the Greatest Need Steele, Murnane, and Willett (2010) evaluated an experiment in 2000–02 in California to offer $20,000 signing bonuses to attract talented, novice teachers to low-performing schools and retain them for at least four years. They took advantage of the fact that this policy was enacted only for two years to conduct an IVE. Bonus recipients would have been less likely to teach in low-performing schools than observably comparable counterparts had the bonus not existed; however, because of the bonus, the probability that its recipients taught in low-performing schools increased by 28 percentage points, and 75 percent of both bonus and nonbonus recipients who began working in lowperforming schools stayed in such schools for at least four years. Offering Bonuses for Teachers to Teach Critical Shortage Subjects Clotfelter et al. (2008) assessed a program in North Carolina in 2001 that offered a $1,800 annual bonus to retain teachers certified in math, science, and special education in high-poverty or failing schools. They compared the turnover of eligible and ineligible teachers before and after the program to conduct a DDA. The bonus increased year-onyear retention by 10–13 percent. This effect, however, was mainly driven by math teachers: those who received the bonus were 18 percent less likely to leave their schools than those who did not. Science and special education teachers who received bonuses were no less likely to leave than those who did not. The effects were concentrated in middle schools: teachers at this level were 27 percent less likely to leave if they had received a bonus, but bonus recipients in high schools showed no difference in attrition. Considering that in the first year of implementation the bonus was only 4–5 percent of an average teacher’s salary, even modest financial incentives seem to influence teachers’ decisions to stay in hard-to-staff schools. 26 Improving Working Conditions A number of recent papers shed light on the effect of improvements in working conditions on teachers’ productivity. Jackson (2013) used data on third through fifth graders in North Carolina in 1995–2006 and employed year, school, and teacher FE to estimate the changes in teachers’ productivity (as measured by their value-added scores) as a result of transferring from one school to another. He found that teachers who switched schools were more effective—by .09 of a standard deviation in math and .07 in reading—than before they moved, which he interpreted to be suggestive of “match effects.” These effects accounted for 10–40 percent of what is typically estimated as teacher quality, suggesting a sizable portion is not portable. In fact, Jackson also found certain types of teachers achieved much better outcomes at certain types of schools, such that an optimal matching of teachers to schools could raise outcomes for all students. While this study suggests matching teachers to schools is important, it does not indicate which factors would raise teachers’ productivity. Jackson and Bruegman (2009), however, conducted a study that indicates peer learning (i.e., teachers learning from each other) is an important mechanism through which match quality improves teacher productivity. The authors used the same database as Jackson (2013) and employed student, teacher, school, and year FE. Teachers who experienced a 1 standard deviation improvement in observable peer characteristics (e.g., experience and licensure) were associated with a .10 of a standard deviation increase in the scores of their students in math and reading. Teachers who experienced a 1 standard deviation increase in the valueadded scores of their peers had students who performed .05 of a standard deviation better in math and reading. While there are only a few studies on matching teachers to schools, the message from these studies is clear: the match between teachers and their schools is an important factor in teacher retention and effectiveness, and monetary incentives have shown promising results to allocate teachers where they are most needed. A key gap in the literature is the lack of rigorous studies of these initiatives in developing countries, where they are most needed. Leading Teachers with Strong Principals Although economists do not have a lot to say about the importance of administrators, they traditionally justify the assignment of workers to different jobs within a firm on the grounds that each worker has a “comparative advantage”—i.e., he or she is best at doing 27 a certain task and is therefore better off specializing in that task (Sattinger 1975; Rosen 1978). Thus, economists argue that those teachers with outstanding management skills are best suited for principal posts and those with outstanding classroom effectiveness skills would be best employed as teachers or coaches. Several extensions of this approach also note that whenever jobs must be staffed by a single worker, high-ability workers must be assigned to jobs that value ability more highly (Rosen 1978). Research on policies affecting principals’ work is very recent and there are only a handful of rigorous studies. Yet, this emerging body of evidence is suggestive of the types of policies that might be more conducive to the effectiveness of principals. Table 5 summarizes two interventions that have been rigorously evaluated: (1) hiring more effective principals; (2) setting requirements for principal positions; and (3) granting principals more authority over staffing decisions. Hiring More Effective Principals Research on the effectiveness of principals is new and still evolving, but there are some studies on whether (and if so, how) one can estimate a principal’s value added based on the achievement of students at his/her school. Coelli and Green (2012) exploited the fact that principals in British Columbia, Canada, are rotated across schools by districts to observe the performance of different principals at the same school. Using school, neighborhood, and peer group FE, they found a sizable heterogeneity in school principal quality affecting the English test scores of 12th grade students. Getting a principal who is 1 standard deviation better in the principal-effects distribution implied that graduation rates were 2.6 percentage points higher (or, roughly, .33 of a standard deviation of the cross-school graduation rate distribution) and English exam scores were 2.5 percentage points higher (roughly equivalent to 1 standard deviation). Branch, Hanushek, and Rivkin (2012) introduced longitudinal data to the estimation of principal effects. They used data from Texas for 1995–2001 to estimate the productivity of principals using principal, school, and year FE. They found the annual impact of having an effective principal was .05–.21 of a standard deviation, depending on the method used. They also explored a potential mechanism through which effective principals can improve student achievement: removing ineffective teachers. Consistent with this hypothesis, they found teachers who left schools with the most successful principals were much more likely to have been among the least effective teachers in that school than teachers leaving schools run by less successful principals. The authors also 28 looked into whether principals who switched schools were more effective than those who stayed, and found no relationship between principal quality and probability of transfer. Principals in the lowest performance quartile were least likely to remain in their position and most likely to leave public schools entirely. By contrast, principals in the highest performance quartile were more likely to remain in their position and least likely to leave public schools entirely than their peers in the lowest quartile, but were more likely to leave their position and less likely to leave public schools than those in the middle two quartiles. As Grissom, Kalogrides, and Loeb (2011) argued, however, different models produce estimates that span a wide range. The authors used data from Miami-Dade County, Florida for 2003–10 in grades 3–10 to show that, while some models identified principal effects as large as .15 of a standard deviation in math and .11 in reading, others found effects as low as .02 of a standard deviation in both subjects for the same principals. Interestingly, the most conceptually unappealing models, which attributed too large a share of school effects to principals, aligned more closely with nontest measures than approaches that more convincingly separated the effect of the principal from the effects of other school inputs. The main take-away from the study is that while there is evidence that principal quality matters, precisely measuring the effectiveness of principals will require further research. Setting Requirements for Becoming a Principal While there is a consensus that effective principals matter, how to identify effective principals is less clear. There are two studies on the relationship between principal characteristics and effectiveness, and both were done in the United States, which raises questions about their external validity. Clark, Martorell, and Rockoff (2009) explored this question using employment and student achievement data and school FE from New York City for 1998–2006. They found little evidence of a relationship between school performance and the selectivity of a principal’s undergraduate or graduate institution. They also found little relationship between performance and a principal’s prior work experience. However, among very inexperienced principals, school performance is higher among those that were previously assistant principals at their current school. There was also a positive relationship between principal experience and school performance: principals with three years of experience had students whose math scores were .04 of a standard deviation higher and principals with five or more years of experience had students whose math scores were .06 of a standard deviation higher than those of students with principals in their first year. Finally, 29 there was mixed evidence on the relationship between principal training and professional development programs and school performance. Grissom and Loeb (2011) used data for 2008 from Miami-Dade County, Florida, including surveys of principals and assistant principals, grades assigned by parents to their children’s schools, and administrative data (e.g., school performance ratings and state tests). Principals with stronger organizational skills were in charge of schools with greater gains in student achievement: for all schools, a 1 standard deviation increase in principals’ self-rating in “organization management” was associated with a .12 point increase in school accountability performance in the original scale, or with .10 of a standard deviation. Principals’ self-rating on their organizational skills was also related to parental ratings of their children’s schools, but was not consistent with teacher satisfaction. Finally, ratings given by assistant principals of principals’ organizational skills were also associated with math and reading gains. Granting Principals More Authority over Staffing Decisions The most rigorous study on granting principals greater authority over staffing decisions was conducted in New York City by Rockoff et al. (2011). They assessed the impact of a pilot program in the 2007–08 school year in which principals were selected from a group of volunteers to receive “value-added” performance measures for teachers at their schools and training on the methods used to construct these measures. Because principals were selected at random, the authors could carry out an RCT. Principals’ prior beliefs about teacher effectiveness matched value-added measures and they updated their views when they received new performance data. In fact, after data were provided, principals were more likely to keep teachers with higher value-added scores and less likely to keep teachers with lower scores, leading to small improvements in productivity at their schools. Yet, importantly, the provision of value-added measures did not “crowd out” information about teacher effectiveness that principals collected through classroom observations. These results suggest that principals hold accurate views about the effectiveness of their teachers and they can use performance data wisely to make staffing decisions. A program in Chicago in 2004 gave principals authority to use an expedited process to dismiss probationary teachers (i.e., those with less than five years of experience) for any reason. Jacob (2010a, b) used a DDA to assess this reform by comparing changes in teacher absences before and after the policy for probationary and tenured teachers. He found that the policy reduced absences among probationary teachers by 10 percent and reduced absences among teachers with 15 or more absences by 20 percent. He also found evidence that the policy increased student achievement at the elementary level. 30 Interestingly, these changes were only partly explained by the reduction in teacher absenteeism, suggesting the policy affected teacher behavior in class. Finally, principals considered teacher absences and value-added measures in choosing whom to dismiss. This suggests that, when given flexibility to dismiss teachers, principals can make sound decisions. These studies, while tentative, offer clear implications. Principals with management experience are most effective. In fact, principals (at least in developed countries) seem ready to take on additional authority and responsibility to improve their schools’ performance. Monitoring Teaching and Learning Economists see monitoring as useful for two purposes: keeping employees motivated and identifying talented employees for promotion. Discussions of performance monitoring in economics have been typically framed within a principal-agent problem. Originally, it was a problem of inducing worker effort, in which the employer knew what to ask employees to do in a context of asymmetric information (Ross 1973). Then, economists acknowledged that employers do not always know how their employees could be most productive, so it became a problem of first figuring out adequate employee behavior and then inducing it (Baker 1992). More recently, the field has noted that for some jobs it is best to monitor performance through output-based rather than input-based measures (Prendergast 2002). This might be of relevance to teaching, where there is considerable debate over the teacher actions that work best to raise student learning. Economists also see performance monitoring as useful for hiring. Whenever credentials or screening devices are poor predictors of productivity, firms can minimize hiring costs while maximizing worker productivity by creating “internal labor markets”—that is, by hiring employees for lower-ranked positions, observing their performance, and deciding which should stay where they are or be promoted (Williamson 1975; Osterman 1984). The way in which many education systems promote their teachers is deeply embedded in this approach, which suggests there is a strong belief that experience is the best way for teachers to become principals and for principals to obtain school-specific knowledge. Table 6 summarizes four types of interventions that have been rigorously evaluated: (1) increasing parental and community involvement in school affairs; (2) grading and/or ranking schools based on student achievement; (3) monitoring teacher effort; and (4) monitoring teacher performance. 31 Increasing Community and Parental Involvement in School Affairs Initiatives seeking to increase parental and community involvement in education have been rigorously evaluated in several contexts. In Uttar Pradesh, India, villages were randomly assigned to a control group or to one of the following: (1) an initiative that explained the role of village committees to their members; (2) a program that trained volunteers to administer and report on reading, writing, and math assessments; or (3) a project that trained local volunteers to provide remedial reading classes after school. None of these interventions affected community involvement in public schools or student achievement. However, the third intervention had a positive effect on youth involvement, getting young people to volunteer to teach, and subsequently improved students’ reading test scores by about .02 of a standard deviation. The authors interpreted their findings as suggesting that individuals face considerable constraints to improving the quality of their public schools, even when they care about it and are willing to get involved. However, evidence from a similar experiment in India suggests that the format and context of these initiatives matter. Pandey, Goyal, and Sundararaman (2009) evaluated an initiative in Karnataka, Madhya Pradesh, and Uttar Pradesh in 2006 that randomly assigned villages to either a campaign that disseminated information to communities about their state-mandated roles and responsibilities in school management or to no intervention. The information campaign increased the number of meetings of the village education committees and member participation by 25 percent in Uttar Pradesh, and it increased the percentage of parents who talked to teachers about the quality of education in Madhya Pradesh. The campaign also raised teacher attendance by 11 percent in Uttar Pradesh and classroom activity by 30 percent in Madhya Pradesh. However, the campaign led to only modest increases in reading achievement. These findings suggest that while information campaigns can raise teacher and community effort, their prospects to raise learning are limited. These results seem to echo those of a similar study in a developed country. Avvisati et al. (2008) used an RCT to evaluate an initiative that convened meetings with parents to discuss helping their children with homework, grades on report cards, and other school issues. The aim was to increase parental involvement at school and at home. The children of treated families developed more positive behavior and attitudes in school and had fewer literacy problems. In fact, there were large spillovers on the classmates of treated families. These findings lend further support to the promise of interventions seeking to increase community participation in order to influence the behavior of families, although they beg the question of the extent to which these changes raise learning. 32 Grading or Ranking Schools Based on Student Achievement Another initiative that has been subject to rigorous evaluation is the grading or ranking of schools according to student achievement. The only RCT of this policy, conducted in Punjab, Pakistan in 2004, evaluated the provision of school- and student-level report cards. Andrabi, Das, and Khwaja (2009) found villages in which report cards were provided had test scores that were .10 of a standard deviation higher than those in which none were distributed, but that the learning gains were concentrated in low-performing schools. In fact, report cards produced .34 of a standard deviation increase in test scores in schools that initially scored below the median baseline, but had no effects on those initially above the median. Finally, report cards decreased private school fees by 18 percent. Given that few students switched schools, providing report cards generated competitive pressures on all schools to increase their quality to make it commensurate with the price that they charge. A recent study sheds some light on other often overlooked consequences of efforts to rank or grade schools. When the state of Florida suddenly changed the way it graded its schools in 2002, Feng, Figlio, and Sass (2010) took the opportunity to conduct a DDA. They found that teachers in schools that received a lower-than-expected grade were 11 percent more likely to leave their school, and those in schools that got a higher-than-expected grade were 2.3 percent less likely to leave. These effects were clearest in schools that not only received a lower-than-expected grade, but were downgraded to a failing grade. Teachers at these schools were in fact 42 percent more likely to leave their schools and 67 percent more likely to move to a school in the same district. Importantly, however, the positive and negative effects of this policy on teacher mobility left the composition of teachers at downgraded schools unchanged: teachers who remained behind increased their effort, raising student achievement by .05 of a standard deviation, and teachers leaving these schools were also on average more effective. These findings suggest that school accountability can sometimes affect teacher mobility without improving student learning. A similar policy was instituted in New York City in 2007 to grade schools on an A-to-F scale and to tie these classifications to rewards and consequences, including possible school closure. An evaluation indicated that many of the findings about such policies depend on the specifics of the policies themselves. Rockoff and Turner (2010) used an RDD to assess the effect of school grades under the New York initiative that were arguably released too late in the school year to prompt any significant changes in school activities or personnel. However, they found that schools that received Ds or Fs raised math achievement and that those that received Fs also improved English scores. Additionally, they found that parental evaluations of school quality improved in schools 33 that received D or F ratings. These findings suggest school accountability policies that go beyond providing information may successfully spur improvements in school quality. Again, however, there is evidence that the details of these types of interventions matter. Mizala and Urquiola (2007) evaluated an initiative in Chile that since 1996 has ranked schools according to their performance, adjusted for students’ socioeconomic status. The program offered a monetary incentive to schools if they performed above a certain threshold. The authors took advantage of this threshold to assess the effects of the program using an RDD. The policy had no effect on the learning outcomes of bonusrecipient schools. This suggests that public information about school quality, even when attached to rewards and consequences, does not always prompt improvements in student achievement. Monitoring Teacher Effort Evidence on initiatives devised specifically to monitor teacher performance is much clearer than evidence on ranking schools. Duflo, Hanna, and Ryan (2010) evaluated an initiative in Rajasthan, India in 2003 that monitored teacher absenteeism on a daily basis using cameras, and that made teachers’ salaries a function of their attendance by randomly assigning schools to either this treatment or no treatment. The intervention reduced teacher absenteeism from 42 to 23 percent. After a year, test scores in treatment schools were .17 of a standard deviation higher than in control schools, and grade completion increased. These findings indicate that initiatives to induce improvements in teacher effort that specifically target the behavior they try to change can achieve remarkable results. Monitoring Teacher Performance A recent study examined the extent to which teachers’ effectiveness in one classroom (classroom A)—as measured by a number of indicators including a master’s degree, experience, subject-specific pedagogical knowledge, NBPTS certification, principal ratings, student surveys, classroom observations, and value added—was predictive of their value added in another classroom (classroom B). In particular, the study found that teachers identified as effective through a composite of three measures (student surveys, classroom observations, and value added) in classroom A tended to also have high valueadded scores in classroom B (Kane and Staiger 2010, 2012). In fact, Kane et al. (2013) found that a student-level standard deviation in the composite measure of teacher effectiveness on one year corresponded to roughly a student-level standard deviation in the value added for the following year, when comparing teachers teaching the same subject and grade in the same school. The studies used data for grades 4–8 in 2009–11 in 34 Charlotte-Mecklenburg, North Carolina; Dallas, Texas; Denver, Colorado; Hillsborough County, Florida; New York City; and Memphis, Tennessee. Mihaly et al. (2013) compared four ways of weighting the three measures of effective teaching: (1) weighting them for maximum accuracy in predicting next-year gains on state tests (which resulted in an 81 percent weight on value added, 2 percent weight on classroom observations, and 17 percent weight on student surveys); (2) a 50 percent weight on value added and 25 percent each on both classroom observations and student surveys; (3) equal weights (33 percent) on all three components; and (4) a 50 percent weight on classroom observations. The first model maximized the predictive power of the composite, as measured by its correlation with next-year value added, while the third and fourth models maximized reliability (i.e., the proportion of variation in the scores on the composite reflecting consistent differences in practice between individual teachers). The authors did not recommend an “ideal” set of weights, but argued that the weights of the different measures should correspond to the main goal of the teacher evaluation system and its associated school and teacher accountability system. Supporting Teachers to Improve Instruction Building on the work of Greenwald (1986) and Acemoglu and Pischke (1998), it is noted here that when a firm invests in general skills training, it engages in “asymmetric learning”—that is, it learns more about the effect of training on the worker’s productivity than its competitors learn. Thus, the firm’s decision to invest in such training depends entirely on the effect that the training has on its productivity. These insights can provide a rationale for education systems to invest in training their teachers once they have entered the profession. In the case of education, the threat of turnover is mitigated by the fact that the education system as a whole bears the costs of training and that much of the competition for teachers is between schools, rather than between education and other fields. Table 7 shows two types of interventions that have been rigorously evaluated: (1) providing or improving learning materials; and (2) providing in-service training. Providing or Improving Learning Materials Efforts to provide teachers with supplementary teaching materials have not been as successful as expected. Many initiatives have consisted of trying to incorporate computers in the classroom. Barrera-Osorio and Linden (2009) conducted an RCT to evaluate a program in Colombia in 2006 that sought to integrate computers, donated by the private sector, into the teaching of reading in public schools. The program had 35 virtually no effect on test scores or other outcomes, and the results were consistent across grade levels, subjects, and students’ gender. Surveys administered to teachers suggested that the main reason behind the limited effect of the program was that few teachers actually incorporated the donated computers into their classroom practices. These findings indicate input-based policies are likely to have little impact on student learning if they do not change business-as-usual pedagogical processes. Yet, not all of these interventions have been unsuccessful. Banerjee et al. (2007) used an RCT to evaluate a computer-assisted learning program in Vadodara, India, in 2002. Under the program, fourth graders shared a computer with a peer for two hours per week to play games solving math problems of age-appropriate difficulty. The program increased math scores by .35 of a standard deviation in the first year and by .47 of a standard deviation in the second year, and it was equally effective for all students. Interestingly, however, the program had no effect on students’ reading scores, suggesting limited spillover effects. Further, effects seemed to fade shortly after the program ended. The difference between this program and the one in Colombia seems to be that it represented a meaningful change in classroom instruction. Evidence from developed countries seems to support this conclusion. Rouse and Krueger (2004) used an RCT in 2001 to evaluate the impact of a reading instruction computer program on students with reading difficulties in an anonymous district in the United States. While the program improved some aspects of students’ learning skills, these gains did not translate into a broader measure of language acquisition or into reading skills. Together, the three evaluations reviewed here suggest that the impact of computerassisted learning varies considerably, depending on the details of specific initiatives. Other types of input-based policies have been even less successful in raising student learning. Borkum, He, and Linden (2009) conducted an RCT in Bangalore, India in 2007 of the impact of a program staffed by trained librarians in primary schools. The program was implemented using a “hub and spoke” system in which there were physical libraries in “hub” schools, while “spoke” schools were served by a mobile librarian. The program failed to increase language competency. In fact, estimates were sufficiently precise to rule out effects larger than .11 and .13 of a standard deviation based on the confidence intervals (Box 10). These findings indicate that improving school libraries is unlikely to have a positive impact on student achievement. 36 Box 10. What Is a Confidence Interval? A confidence interval gives the range of possible values that an estimate could potentially take if the same evaluation were to be done over and over again. For example, suppose that a study estimates that a given intervention has an effect of .20 of a standard deviation. The confidence interval could say that, if the evaluation were to be repeated many times, the effect could be as small as .10 of a standard deviation or as large as .30. While many readers might not have had to interpret confidence intervals in the context of impact evaluations, they have likely heard of them in opinion polls. When poll results are reported, their “margins of error” are often mentioned. These margins are nothing more than the width of the confidence interval on each side of the estimated effect. For example, in the case mentioned above, the margin of error would be +/– .10 of a standard deviation. In fact, at least in developing countries, additional books seem to have little effect. Glewwe, Kremer, and Moulin (2009) used an RCT to evaluate a program in Busia and Teso, Kenya that provided free textbooks to schools. The intervention had no impact on average test scores or student attendance. Interestingly, the textbooks increased the test scores of already high-achieving students by .06 of a standard deviation. But few children in lower grades (15–29 percent of median students) were actually able to read these books, since they were in English, which was the students’ third language. This study is a reminder that input-based policies are likely to have a limited impact when there are good reasons to believe that the inputs provided will not be appropriately used. A study in Busia and Teso, Kenya, conducted by Glewwe et al. (2004) found that the limited impact of input-based policies was true not only for students, but also for teachers. The authors used an RCT to evaluate the impact of free flipcharts on student learning. The flipcharts included science charts and a teacher’s guide for science, charts for health, another set for math, and a wall map—mostly for grades 7 and 8. There was no evidence that the charts increased test scores. This finding was particularly interesting, since other less rigorous studies had found an effect on test scores of .20 of a standard deviation. This evaluation illustrates the importance of rigorous research to make critical policy decisions. Providing In-service Training One intervention geared toward supporting teachers’ work that has been rigorously evaluated has been in-service teacher training. Jacob and Lefgren (2004b) evaluated a reform in Chicago in 1996 that placed 71 of the district’s 489 primary school teachers on academic probation and provided them with funding for in-service training. The authors conducted an RDD, exploiting the fact that eligibility for probation was determined according to a cutoff in students’ standardized reading scores. They found that marginal 37 increases in in-service training had no effect on either reading or math achievement. While more rigorous research is needed on the impact of in-service teacher training, these findings are a reminder that while school districts invest heavily in this type of training, there is little rigorous evidence to show it works. Similarly, in Jerusalem, Israel, Angrist and Lavy (2001) conducted a DDA in 1995 of a handful of public schools that received a special infusion of funds primarily earmarked for in-service training. In-service teacher training in secular (i.e., nonreligious) schools improved students’ test scores by .20–.40 of a standard deviation, but the effects in religious schools were less robust, meaning that they depended on the way in which the authors modeled the relationship between the treatment and student achievement. Angrist and Lavy speculated that this differential effect might be explained by the fact that the intervention started later and was implemented on a smaller scale in religious schools. The authors compared this intervention to alternative school improvement strategies and concluded that it was a relatively inexpensive way of improving achievement. It is challenging to draw conclusions about rigorous studies on different forms of teacher support because the evidence is uneven. However, there is good reason to believe that input-based policies (mainly in developing countries) appear to have a limited impact on student achievement, especially when they do not influence classroom instruction, and that the impact of in-service teacher training might depend at least on the context where it is implemented and probably as well on its format. Motivating Teachers to Perform Economists typically think about incentives in the framework of principal-agent problems. The key insight in this framework, as presented in the work of Mirrlees (1976), Holmstrom (1979), and Shavell (1979), is that employers face a tradeoff when choosing how to remunerate their employees between offering them a base salary that ensures that they will receive a minimum level of compensation and making their pay conditional on their performance in order to induce them to work hard. An “efficient” contract is one that balances these two goals of “full insurance” and “first-best incentives.” Economists have noted that there might be several unintended consequences in incentive contracting. A key insight put forth by Holmstom and Milgrom (1991) and Baker (1992) is that a worker’s measured performance might be quite different from the worker’s total contribution to firm value, and that incentive schemes that reward the former might discourage employees’ contributions to the work of their peers or the long-run effects of their actions (Gibbons and Waldman 1999). The other drawback of incentive contracting is the trade-off between intrinsic and extrinsic motivation—that is, mechanisms that 38 provide extrinsic rewards to employees for actions from which they derive intrinsic motivation might attenuate the latter (Bénabou and Tirole 2003). Table 8 shows two types of interventions that have been rigorously evaluated: (1) hiring contract teachers; and (2) paying teachers for raising student achievement. Hiring Contract Teachers In many school systems, teachers are public employees, and once they obtain permanent employment status they can only be dismissed in extreme circumstances through a process that typically is time consuming. Therefore, instead of hiring new teachers as public employees, some countries have implemented reforms to hire teachers through annual, renewable contracts. The most recent evaluation of this type of initiative was conducted by Muralidharan and Sundararaman (2010) in Andhra Pradesh, India starting in 2005. The program provided an extra contract teacher to 100 randomly-chosen government-run primary schools. At the end of two years, students in schools with an extra contract teacher outperformed those in comparison schools by .15 of a standard deviation in math and .13 in reading. While all students benefited from the program, those in their first year in school and in remote schools benefited the most. Contract teachers were absent less frequently than regular teachers (16 percent as opposed to 27 percent of the time) and were more likely to be found teaching (49 percent as opposed to 43 percent of the time). Finally, while the students of contract and regular teachers made similar achievement gains, contract teachers only cost one-fifth of what regular teachers cost. In some contexts, contract teachers have been shown to positively affect achievement even when they lack formal training. In 2001, Banerjee et al. (2007) used an RCT to evaluate an initiative in Mumbai and Vadodara, India that provided government schools with an extra teacher to work with third and fourth graders who were struggling academically. These teachers, typically young women, were recruited from local communities and had only finished secondary school. Yet, they improved students’ test scores by .14 of a standard deviation in the first year and .28 in the second year of the program. In fact, treated schools still scored .10 of a standard deviation higher than control schools one year after the program had ended. Perhaps most importantly, children at the bottom of the initial test score distribution and those receiving remedial teaching made the largest gains. 39 Paying Teachers to Improve Student Achievement By far one of the most popular and rigorously evaluated ways in which education systems have recently tried to motivate teachers is by making part of their pay conditional on the achievement of their students. In 2007, New York City adopted a teacher incentive program in more than 200 high-need schools. The program awarded a school up to $3,000 for every full-time unionized teacher if it met the annual performance target set by the U.S. Department of Education based on school report card scores. The target was based on student achievement and progress on the state exams for primary and middle schools, a high school leaving exam, student attendance, and a learning environment survey administered to teachers, parents, and students. The program also gave schools $1,500 per full-time unionized teacher if they met at least 75 percent of their mandated target. Schools that received the bonus could choose to distribute it at their discretion. Fryer (2011) took advantage of the fact that schools were randomly assigned to the program to evaluate its impact and found no impact on student achievement or evidence that the program affected teacher retention, absenteeism, or the learning environment. The author considered several reasons why the incentives program might have been ineffective, including the possibility that incentives were not large enough, the incentive scheme was too complex, group-based incentives may not be effective, or teachers may not know how they can improve student achievement. Results from an experiment in Nashville, Tennessee were equally discouraging. Springer et al. (2010) evaluated the impact of the Project on Incentives in Teaching (POINT), an initiative that offered middle-school math teachers bonuses of up to $15,000 per year for getting their students to make unusually large gains on the state exam—the Tennessee Comprehensive Assessment Program (TCAP)—during the 2006–08 school years. The authors conducted an RCT in which half of the teachers who volunteered for the study were randomly assigned to a group that was eligible for the bonuses and half were assigned to a group that was not eligible. Springer and his collaborators found that students assigned to eligible teachers on average performed no differently than those assigned to noneligible teachers. While fifth graders of eligible teachers outperformed their peers in the second and third years of the study, this effect did not persist once these students moved on to the next grade. Most other rigorous studies on performance pay plans for teachers were conducted in developing countries. Muralidharan and Sundararaman (2009) carried out an RCT to evaluate a teacher incentive program in the Indian state of Andhra Pradesh. The program offered bonuses of about 3 percent of teachers’ annual pay, based on the average 40 improvement of their students’ test scores in independently administered tests. After two years, the bonus program increased student achievement by .28 of a standard deviation in math and .16 in reading. Interestingly, incentive schools performed better on both mechanical and conceptual components of the tests, suggesting these improvements were not driven by “teaching to the test.” In fact, students in treated schools also did better in nonincentivized subjects such as science and social studies. Finally, the results of individual- and group-based incentives were not different in the first two years of the program. This study suggests that well-designed merit pay plans can have important effects on student achievement in developing countries. A study in Kenya, however, drew attention to the importance of what these programs measure and reward. Glewwe, Ilias, and Kremer (2010) conducted an RCT of a program in Busia and Teso, Kenya that rewarded primary school teachers with in-kind prizes based on students’ scores on district exams. The program assigned low scores to students who did not take the exam so as to eliminate perverse incentives for schools to discourage low-scoring students from taking the exams. Students in treated schools scored higher than those in control schools in the first year. Yet, most of the gains were focused on the aspects measured by the reward formula, and the program had no effect on teacher behavior (including teacher attendance, the amount of homework that they assigned, and teacher pedagogy). The bonus increased test preparation by 4.2 percent in the first year and by 7.4 percent in the second year. These findings illustrate that, unless they are adequately designed, incentive programs can lead teachers to focus on test preparation but produce little change that is conducive to better learning. In Chile, Rau and Contreras (2009) used DDA, RDD, and PSM to evaluate a program that offered bonuses to all teachers in public or publicly subsidized primary and middle schools based on students’ performance and progress on the national student achievement test and other school factors. The bonus was small, but far from negligible (about 10-40 percent of a teacher’s monthly wage). The program increased average learning outcomes by .07–.12 of a standard deviation. Interestingly, however, there was no evidence that winning the bonus led to any additional gains. These findings hint at the possibility that while the introduction of the program might have led schools to work harder to raise student learning through easy efficiency gains, the complexity (and therefore, predictability) of the bonus might have given them little clue about how to accomplish improvements beyond these initial gains. Some of the evaluations of teacher incentives also shed light on aspects that can be useful in the design of these initiatives. For example, Lavy (2008, 2009) used an RDD to evaluate a program in Israel that awarded monetary bonuses to individual teachers based on students’ average scores on matriculation exams and passing rates, relative to 41 predicted scores (adjusted for students’ socioeconomic status). The bonus was large (70– 300 percent of a teacher’s monthly wage) and teachers were likely to receive it (about 48 percent of teachers got some award). The program resulted in students in treated schools having pass rates that were 14 percent higher in math and 5 percent higher in English. These students also had 10 percent higher scores in math and 4 percent higher scores in English. Yet, Lavy found that while the effects of the bonus did not differ by gender, female teachers were more pessimistic about their likelihood of getting it. This study suggests that even when bonuses are predictable, different groups of teachers might react to their predictability differently. An evaluation of a teacher incentive program in Mexico shed light on the relative importance of the different aspects being rewarded and the persistence of rewards in a merit pay program. McEwan and Santibáñez (2005) used an RDD to evaluate the program, which awarded teachers and principals if they scored above a cutoff on an index that considered their education, experience, and students’ test scores, among other factors. The awards were not only substantial, they also persisted through teachers’ careers. However, there was no evidence that student test scores improved as a result of the reform. The authors speculated that this might have been because the bulk of the award (about 80 percent) was determined by the background of teachers and principals, rather than by their performance on the job. This suggests that, in designing incentive programs, education systems should ensure that bonuses are linked to the types of changes in employee behavior and productivity that they seek to promote. Finally, a study in Kenya offers important insight on the potential of input-based rather than output-based teacher incentives. Kremer et al. (2001) used an RCT to evaluate an individual bonus program at the pre-school level that offered principals resources to award teachers bonuses for good attendance. The bonus was sizable, constituting up to 300 percent of teachers’ monthly wages. Importantly, however, principals distributed the bonus among all teachers in their schools, regardless of their actual attendance. The bonus had no impact on teacher absenteeism (on average about 29 percent), student attendance, or achievement. In addition to lending further support to the perils of freeriding raised by the study on the New York City bonus, this study seems to suggest that there is little evidence that input-based bonuses have much promise to increase student learning. In sum, the studies on interventions seeking to raise teacher effort and productivity yield at least two clear lessons. First, hiring contract teachers seems to be a promising way for developing countries to expand their teacher workforces. Second, while the effects of merit pay programs in developed countries seem mixed, the evidence in developing countries is much more encouraging. Yet, key design features (e.g., group versus 42 individual bonuses, coverage, performance measures, award processes, predictability, bonus size, etc.) are crucial in mediating the impact of these plans (Bruns and Santibáñez 2011). Making Sense of the Theory and Evidence: Take-Aways and Caveats This paper has tried to provide an analytical framework to make sense of the evidence on teacher policies in developed and developing countries and to engage readers in a conversation about what is known and how well it is known. Unlike many prior reviews, this paper has not sought to “distill” the key lessons from the available evidence for a nontechnical audience. Rather, the aim has been to explain the methods used in the studies of teacher policies in hopes of promoting an informed dialogue that includes all stakeholders, from citizens to teachers, government officials, experts, and others. This concluding section offers some final observations about the state of the evidence on teacher policies and puts forth some directions for future research. First, evidence on teacher policies is uneven and one should be cautious about how to interpret it. Much is known about some policies and very little or nothing about others. This is for several reasons. One is that there are certain areas in which there have not been opportunities to conduct rigorous research, mostly because rules and regulations make it difficult for researchers to implement some of the methodological strategies that have been reviewed here. Another reason is that some interventions are still quite new and it will take some time for researchers to evaluate their results. In the absence of rigorous studies on a particular intervention, it can be tempting to look at the best available study, even if it is not able to distinguish the impact of an intervention independent from other factors that may affect the outcomes of interest. This perspective is understandable: as stated at the onset of this paper, policy decisions need to be made on a daily basis and governments cannot wait for the most rigorous studies to be published before they make a decision. However, it is also clear that this “next-best research” approach has not always served the education community well in the past. There have been several important teacher policies for which initial evidence looked promising until rigorous studies indicated otherwise. Rather than relying on nonrigorous studies, governments should design new interventions in ways that render them amenable to rigorous evaluation. In this way, a policy itself can inform future decisions. Second, while the focus here has been on the impact of teacher policies on student achievement, important outcomes that governments need to take into account have been left out. As several studies on the economics of education have shown, there can often be 43 important “sleeper” (i.e., dormant) effects of policies that cannot be identified until children reach adolescence or even adulthood (e.g., health, crime, family planning, earnings, civic participation). Governments need to put in place robust data systems that allow researchers to assess these key long-term outcomes. Third, this paper has identified a need to better understand not only the effects of each teacher policy in isolation, but also the interactions between teacher policies. Several studies have begun to do this by designing evaluations that assess the impact of different combinations of interventions (e.g., class reductions by themselves, class reductions with teacher monitoring, and class reductions with teacher monitoring and ability tracking). We applaud these efforts and encourage others to follow suit. Finally, we believe there is a pressing need for more cost-benefit considerations in impact evaluations. It is often the case that several interventions can tackle the same problem effectively, but information is lacking about cost-benefits to make sound decisions about which intervention promises a bigger “bang for the buck.” Some scholars have begun to pay attention to these issues in other areas of education (Dhaliwal et al. 2011), but more energy should be devoted to examining cost issues in the evaluation of teacher policies, in particular. 44 Table 1. Evidence on Setting Clear Expectations for Teachers Scaffolding Teachers’ Work Intervention Location Study Method Results Scripted lessons (pre-school, grade 2) with a structured curriculum, reading activities, ability tracking, and compulsory reading at home. United States Borman et al. (2007) Very Strong (RCT) Effect sizes ranged from .21 SD on the passage comprehension literacy measure to .33 SD on the word attack measure. Pre-packaged curriculum and activities (preschool and grade 1) for reading. Mumbai, India He, Linden, and MacLeod (2009) Very Strong (RCT) Gains in students’ reading scores of .26–.70 SD, proving most effective in pre-school classes and with low-performing students. The version of the program taught out of school time is more effective than the school-day version, yielding a .24 additional SD effect over the .26 SD effect. Pre-packaged classroom activities (grades 1– 5) either with a machine or flashcards. Maharasthra, India He, Linden, and MacLeod (2007) Very strong (RCT) Reading marathon (grade 4) accompanied by a training program for teachers and provision of reading materials. Tarlac, Philippines Abeberese, Kumler, and Linden (2011) 45 Very strong (RCT) Gains in English achievement of .30 SD. Older and lower-performing students benefited the most from the program. The version of the program implemented by teachers improved not only students’ English scores, but also their math scores, suggesting positive spillovers. Increased the number of books read in the month after the program from 2.3 to 9.5. Students’ reading scores increased by .13 SD. Three months after the reading marathon, treated students still read 3.1 books more than the comparison group and had reading scores that were .06 SD higher. Manual of operations and school-level report cards (grade 3-4). Madagascar Lassibille et al. (2010); Glewwe and Maïga (2011) Very strong (RCT) No statistically significant effects on test scores. The impact did not vary by teacher type (i.e., whether civil service, contract, or student teachers). The program raised attendance and reduced grade repetition only when coupled with interventions at the subdistrict and district levels. Increasing Instructional Time Intervention Location After-school program and summer school (grades 5–6), including academic and enrichment activities and mentoring after school and during the summer. Washington, DC Summer school (grades 3 and 6), which is compulsory for all students who do not pass a competency exam to move on to the next grade. Chicago Study Method Linden, Herrera, and Grossman (2011) Very strong (RCT) Jacob and Lefgren (2002) Strong (RDD + IVE) Results Higher test scores by .09 SD (reading) and .12 SD (problem-solving) by the second year. Participating students were also more likely to visit a high school, get information, and talk to adults and peers about high schools to decide where to apply. .11–.13 SD higher test scores in reading and math in grade 3 but no higher test scores in grade 6. Reducing Class Size Intervention Location Class-size reductions (pre-school to grade 3) through small classes or regular-size classes with a teacher aide. Tennessee Class-size reductions (grade 1) by hiring a contract teacher, tracking students by initial ability, and/or empowering parents to monitor Busia and Teso, Kenya Study Krueger (1999) Method Very strong (RCT) Results Duflo, Dupas, and Kremer (2007) 46 Very strong (RCT) The average per-student effect was about .048 SD in reading and math. The effect was strongest for minority students and those with low socioeconomic status. Reducing class size (on average from 80 to 46 students), in the absence of any other reform, led to lower teacher effort and no discernible improvement in student achievement. teacher performance. Combining class-size reductions with improved incentives led to significantly higher test scores— either by hiring contract teachers and increasing parental oversight (an improvement of .19 SD in literacy and numeracy) or hiring contract teachers and tracking students (an improvement of .25–.31 SD). Class-size reductions (grades 1–6) emerging from idiosyncratic variations in student population or from maximum/minimum rules. Connecticut Hoxby (2000) Strong (IVE + RDD) No effect on student achievement. It is possible to rule out even modest effects such as .02––.04 SD for a 10% reduction in class size. Class-size reductions (grades 3–5) through a rule that capped the maximum number of students per class at 40. Israel Angrist and Lavy (1999) Strong (RDD) The per-student effect was .017–.019 SD in fourth grade and .036–.071 SD in fifth grade, combining student achievement in both subjects. Source: [you need to add source, even if it says “Prepared by the author.”] Note: IVE = instrumental variables estimation; RCT = randomized control trial; RDD = regression discontinuity design; SD = standard deviation. 47 Table 2. Evidence on Attracting the Best People into Teaching Setting Requirements for Entry into Teaching Intervention Professional certification (grades 2–5) based on classroom videos and essays. Location Los Angeles Study Cantrell et al. (2008) Method Very Strong (RCT) Results Students of teachers with professional certification scored .22 SD higher than those of teachers who applied to become certified but did not receive the certificate. However, students of certified teachers were not more effective than students of nonapplicant teachers. Students of successful applicants also achieved greater gains than those of unsuccessful applicants, but not greater than those of nonapplicants. Students of high-scoring successful applicants outperformed those of low-scoring successful applicants. Content knowledge tests (grade 6) for teachers already in service. Peru Meltzer and Woessmann (2010) Tentative (student, teacher, and subject fixed effects) 1 SD increase in teacher test scores in math and reading increases student test scores by .10 SD. Traditional teaching credentials (grades 3–5), including teaching experience, test scores, and licensure. North Carolina Clotfelter, Ladd, and Vigdor (2007a) Tentative (student, school, subject, and year fixed effects) A teacher’s experience, test scores, and regular licensure have positive effects on student learning, with larger effects for math than for reading. The magnitudes are large when compared to the effects of changes in class size or to the socioeconomic characteristics of students. Kane, Rockoff, and Tentative (school, grade, and Teacher certification (grades 4–8). New York City 48 The certification status of a teacher has a small effect on student performance. Among teachers with the same certification status, Staiger (2006) year fixed effects) Content knowledge tests (grades 4–8) for teachers already in service. CharlotteMecklenburg, North Carolina; Dallas, Texas; Denver, Colorado; Hillsborough County, Florida; New York City; and Memphis, Tennessee Cantrell and Kane (2013) Tentative (correlations over two sections of students) there are large and persistent differences in teacher effectiveness. Even high turnover groups from alternative pathways into teaching would have to be only slightly more effective in their first year to offset the effects of their high exit rates. Tests of teachers’ subject-specific pedagogical knowledge in math and reading are correlated with teachers’ performance on subject-specific observations, but not with teachers’ value added. The tests did not correlate with teachers’ value added at any point of the value-added distribution (i.e., it was not useful in identifying the bottomperforming, average-performing, or top-performing teachers). Relaxing Entry Requirements for Outstanding Individuals Intervention Alternative pathway into teaching (grades 1– 5) that recruits outstanding college graduates to teach in high-need schools for two years. Location United States Study Decker, Mayer, and Glazerman (2004) Method Very strong (RCT) Results 49 The program attracted teachers who were more likely to attend a competitive college and to attain their education degree while teaching and less likely to have education-specific training and studentteaching experience than others in their schools. The students of teachers in the program performed .15 SD higher in math than students of other teachers (or 10% of a grade equivalent), but no differently in reading. Compared to students of other novice teachers, students of teachers in the program performed .26 SD higher in math, but no differently in reading. Alternative pathway into teaching (pre-school to grade 5) that combines all nontraditionally trained teachers. United States Constantine et al. (2009) Very strong (RCT) An alternative pathway into teaching (grades 7–9) that recruits outstanding college graduates to teach in high-need schools for two years. Chile Alfonso, Santiago, and Bassi (2011) Tentative (PSM) Traditionally and alternatively certified teachers (TC and AC, respectively) varied widely in their hours of pre-service training. TC and AC teachers did not differ in their scores on college entrance exams, the selectivity of their college, or their educational attainment. TC and AC teachers did not differ in their effectiveness in raising student learning. The amount or content of teacher training did not seem to impact the effectiveness of AC teachers. Increased scores by .22–.51 SD in reading and by .17–.43 SD in math in participating schools in its first year and by .75 SD in Spanish and .33 SD in math in its second year. Participating schools also had students with higher measures of self-esteem, self-efficacy, and intellectual and meta-cognitive skills. Rewarding Advanced Educational Qualifications and Experience Intervention Location Study Teaching experience (grades 4–8) from 3 to 12 years. CharlotteMecklenburg, North Carolina; Dallas, Texas; Denver, Colorado; Hillsborough County, Florida; New York City; Kane and Staiger (2012) Method Tentative (school, grade, and subject fixed effects, two sections of students) Results 50 Teachers with 12 or more years of experience have students with .01 SD higher scores in math and .02 SD higher scores in reading on state tests compared to students of teachers with fewer than three years of experience. These results were lower when math and reading were measured on supplemental tests. Teachers with a master’s degree had students with .03 SD higher scores in math and.02 SD lower scores in reading on state tests compared to teachers without a master’s. These results were higher when math and reading were measured on supplemental tests. Gains from experience were considerably smaller than gains on a combined measure of teacher effectiveness, which included teachers’ value added, as well as their performance on a student survey and on classroom observations. and Memphis, Tennessee Advanced degrees and certification (grades 3– 8). Texas Hanushek et al. (2005) Tentative (school and student fixed effects) Teacher quality appears to be unrelated to advanced degrees or certification, but experience in the first year of teaching seems to matter. Good teachers tend to be effective with all student ability levels but there is a positive value of matching students and teachers by race. Increasing Teacher Pay Intervention Improvements in teacher pay (elementary and secondary school) in terms of levels and distribution. Location United States Study Hoxby and Leigh (2004) Method Strong (IVE) Results Pay compression increased the share of lowestaptitude female college graduates who became teachers by 9 percentage points and decreased the share of highest-aptitude female college graduates who become teachers by 12 percentage points. Improvements in pay parity reduced the share of women who taught by 3.2 percentage points for the highest aptitude group and 0 for the three lower aptitude groups. Source: [you need to add source, even if it says “Prepared by the author.”] Note: AC = alternatively certified teachers, IVE = instrumental variables estimation; PSM = propensity score matching; RCT = randomized control trial; SD = standard deviation; TC = traditionally certified teachers. 51 Table 3. Evidence on Preparing Teachers with Useful Training and Experience Setting Requirements for Entry into Teaching Intervention Compulsory mentoring (grades 4–8) in which new teachers with less than one year of experience met weekly with a mentor. Location New York City Study Rockoff (2008) Method Strong (DDA + IVE) Results Comprehensive mentoring (grades 2–5) with intensive mentorship support for new teachers, orientations, professional development, classroom observations, and feedback. United States Glazerman et al. (2010) Very strong (RCT) Teachers without prior teaching experience were 4.5% more likely to complete their first year if they had been assigned to a mentor. Retention in a school is higher when a mentor has previous experience at that school. For every 10 hours of mentoring, students of mentored teachers scored .05 SD higher in math and .04 SD higher in reading. 90% of new teachers in districts assigned to a comprehensive mentoring program had a mentor assigned to them, compared to 72% of those in districts with a regular mentoring program. Teachers receiving comprehensive mentoring for one year were no more effective, as measured by observations and student achievement, than those with business-as-usual mentoring. Teachers with two years of comprehensive mentoring raised student achievement by .11 SD in reading and .20 SD in math in the third year of the study. Comprehensive mentoring had no effects on teacher retention or satisfaction. Including a Practice Component in Teacher Preparation Intervention Teacher residency (grades 4–8) with local recruitment, year-long residency with a mentor Location Boston Study Papay et al. (2011) 52 Method Tentative (school, Results Participating teachers are more racially diverse than other novice teachers in their district, more likely to teach math and science, and more likely to remain four days a week, gradual classroom responsibilities, and a three-year commitment. student, grade, and year fixed effects) teaching in their district through their fifth year. Initially, these teachers are no more effective at raising student test scores than other novice teachers in language and less effective than them in math, but their effectiveness improves rapidly over time, such that by their fourth and fifth year they outperform veteran teachers. Source: [you need to add source, even if it says “Prepared by the author.”] Note: DDA = differences-in-differences analysis; IVE = instrumental variables estimation; RCT = randomized control trial; SD = standard deviation. 53 Table 4. Evidence on Matching Teachers’ Skills with Students’ Needs Offering Bonuses for Teachers to Work in High-need Schools Intervention Bonuses to attract and retain talented novice teachers to low-performing schools (all grades) for at least four years, selected based on grade point averages, recommendation letters, CVs, essays, and interviews. Location California Study Method Steele, Murnane, and Willett (2010) Strong (IVE) Results A $20,000 signing bonus increased by 28 percentage points the probability that its recipients taught in low-performing schools. 75% of both bonus and nonbonus recipients who began working in low-performing schools stayed in such schools for at least four years. Offering Bonuses for Teachers to Teach Critical Shortage Areas Intervention Bonuses to retain certified teachers in math, science, and special education (grades 10–12) in low-income or low-performing schools. Location North Carolina Study Clotfelter et al. (2008) Method Strong (DDA) Results Increased year-on-year retention by 10–13%. Math teachers receiving the bonus were 18% less likely to depart the following year than those who did not get the bonus, but science and special education teachers were no less likely to leave. Middle-school teachers were 27% less likely to leave if they had received a bonus, but bonusreceiving high school teachers showed no statistically significant difference in attrition patterns. Improving Working Conditions Intervention Job matching (grades 3–5) as measured by teachers who switch schools. Location North Carolina Study Jackson (2013) Method Tentative (teacher, school, and year fixed 54 Results Teachers’ value-added scores improved by .09 SD in math and .07 SD in reading when they moved to a school that is a better match for them. Match quality was also negatively correlated with school-switching, unrelated to exit from the profession, and increased with experience. effects) Effective peers (grades 3–5) as measured by the prior student achievement of students of peers of a teacher who switches schools. North Carolina Jackson and Bruegman (2009) Tentative (student, teacher, school, and year fixed effects) Teachers who experience a 1 SD improvement in observable peer characteristics are associated with .10 SD in the test scores of their students in math and reading. A 1 SD increase in a teacher’s peers’ mean value added increases student test scores by .05 SD in math and reading. Source: [you need to add source, even if it says “Prepared by the author.”] Note: DDA = differences-in-differences analysis; IVE = instrumental variables estimation; SD = standard deviation. 55 Table 5. Evidence on Leading Teachers with Strong Principals Hiring More Effective Principals Intervention Hiring more effective principals (grades 3–8) as measured by the average performance of students in their schools. Location Texas Study Method Branch, Hanushek, and Rivkin (2012) Tentative (principal and school fixed effects) Results Hiring more effective principals (grades 3–10) as measured by the average performance of students in their schools. Miami Hiring more effective principals (grades 3–10) as measured by the average performance of students in their schools. British Columbia, Canada Grissom, Kalogrides, and Loeb (2012) Tentative (school, neighborhood, and peer fixed effects) Coelli and Green (2012) Tentative (school, neighborhood, and peer fixed effects) 56 The annual impact of having an effective rather than an ineffective principal is .05–.21 SD, depending on the method used to identify this effect. Teachers who leave schools with the most successful principals are more likely to have been among the least effective teachers in that school than teachers leaving schools run by less successful principals. It is not always the case that more effective principals are more likely to remain in their position and less likely to leave public schools. The choice of model to estimate principal effects is substantively important for assessment. While some models identify principal effects as large as 0.15 SD in math and 0.11 SD in reading, others find effects as low as 0.02 SD in both subjects for the same principals. Getting a principal who is 1 SD better in the principal effects distribution implies that graduation rates will be .33 SD higher and English scores will be 1 SD higher in grade 12. Setting Requirements for Becoming a Principal Intervention Surveys of parents, teachers, and assistant principals on principal effectiveness. Location Miami Study Grissom and Loeb (2011) Method Tentative (teacher and student fixed effects) Results Principal experience and education as measured by selectivity of undergraduate and graduate institution, prior work experience, and professional development. New York City Clark, Martorell, and Rockoff (2009) Tentative (school effects) A 1 SD increase in principals’ self-rating on organizational skills is associated with a .10 SD increase in school accountability performance. Principals’ self-rating on organizational skills is also related to parental ratings of their children’s schools, but not consistently with teacher satisfaction. The ratings given by assistant principals of principals’ organizational skills are also associated with math and reading gains. Little relationship between school performance and the selectivity of a principal’s undergraduate or graduate institution. Little relationship between performance and a principal’s prior work experience. Principals with three years of experience had students whose math scores were .04 SD higher than those of students with principals in their first year, and principals with five or more years of experience had students whose math scores were .06 SD higher. Mixed evidence on the relationship between principal training and professional development programs and school performance. Granting Principals More Authority over Staffing Decisions Intervention Providing teacher value-added scores (grades 4–8) in which principals received performance measures of their teachers based on student test score outcomes. Location New York City Study Rockoff et al. (2011) Method Very strong (RCT) Results 57 Increased the likelihood that teachers with low performance estimates exited their schools after the information was provided, causing a small improvement in teacher productivity at these schools. Receipt of “hard” performance data did not “crowd out” information that principals collect through classroom observation. Discretion for principals to dismiss probationary teachers (grades 3, 5, and 8) for any reason in an expedited way. Chicago Jacob (2010a and b) Strong (DDA + matching) Reduced absences among probationary teachers by roughly 10% and reduced the prevalence of teachers with 15 or more absences by 20%. Increased student achievement in elementary. Effects are only partly explained by changes in teacher absenteeism, suggesting that the policy affected teacher behavior. There is tentative evidence that principals consider teacher absences and value-added measures, along with several demographic characteristics, in choosing whom to dismiss. Source: [you need to add source, even if it says “Prepared by the author.”] Note: DDA = differences-in-differences analysis; IVE = instrumental variables estimation; RCT = randomized control trial; SD = standard deviation. 58 Table 6. Evidence on Monitoring Teaching and Learning Increasing Community and Parental Involvement in School Affairs Intervention Increasing community participation in school management or support (grades 1–6) by explaining the role of village education committees to community members, training community volunteers to administer assessments, or training community volunteers to provide remedial instruction. Location Uttar Pradesh, India Study Banerjee et al. (2010) Method Very strong (RCT) Results Providing information to communities about their role in school management (grades 1–5) through state-mandated roles and responsibilities in school management Karnataka, Madhya Pradesh, and Uttar Pradesh, India Pandey, Goyal, and Sundararaman (2009) Very strong (RCT) Organizing meetings with parents (grade 6) to discuss helping their children with homework, report cards, and other school issues. France Avvisati et al. (2008) Very strong (RCT) 59 A program that explained the role of village education committees to their members and their communities increased committee members’ knowledge of their role by .39 SD. A program that trained volunteers to administer and report on reading, writing, and math assessments increased committee members’ knowledge of their role by .35 SD. A program that trained local volunteers to provide remedial reading classes after school increased committee members’ knowledge of their role by .32 SD and increased students’ reading test scores by about .02 SD. Increased number of meetings of village education committees and member participation in school inspections by 25% in Uttar Pradesh and increased the percentage of parents who talked to teachers about the quality of education in Madhya Pradesh. The program also raised teacher attendance by 11% in Uttar Pradesh and increased teacher classroom activity by 30% in Madhya Pradesh. The program increased reading achievement in grade 3 in both Uttar and Madhya Pradesh. Increased parental involvement at school and home. Children of treated families developed more positive behavior and attitudes in school and had fewer literacy problems. There are large spillover effects on classmates of treated families. Grading or Ranking Schools Based on Student Achievement Intervention School- and student-level report cards (grade 3) with notes and rankings in English, math, and Urdu, with other schools and villages used as benchmarks. Location Punjab, Pakistan Study Andrabi, Das, and Khwaja (2009) Method Very strong (RCT) Results Assigning grades to schools (grades 3–10) using an A-to-F scale based on proficiency rates in reading, writing, and math. Florida Feng, Figlio, and Sass (2010) Strong (DDA) Higher learning by .10 SD. Gains are heterogeneous: report cards produce a .34 SD increase in schools that initially scored below the median baseline, they have no effect on schools that initially scored above the median, and they cause a .10 SD increase in scores in government schools. Private school fees decrease by 18%. Teachers working in schools that received a lower grade than expected are 11% more likely to leave their school and those that received a higher grade than expected are 2.3% less likely to leave. Teachers at schools downgraded to a failing grade are 42% more likely to leave and 67% more likely to move to a school in the same district. The effectiveness of teachers staying in downgraded schools increased by .05 SD in student achievement, but since the teachers leaving these schools were of higher quality, the net effect on the teacher composition was nil. Grading schools and linking grades to rewards and consequences (grades 4–8), including possible school closure. New York City Rockoff and Turner (2010) Strong (RDD) Increased student achievement in math and English and improved parental evaluations of school quality. Ranking schools by student achievement (grades 4, 8, 10) accounting for their socioeconomic status, and offering them a Chile Mizala and Urquiola (2007) Strong (RDD) No consistent effect on the learning outcomes of top-ranked schools that receive the award. 60 monetary incentive for good performance. Monitoring Teacher Effort Intervention Monitoring and rewarding teacher attendance (grades 1–6) using tamper-proof cameras and making salary a function of attendance. Location Rajasthan, India Study Duflo, Hanna, and Ryan (2010) Method Very strong (RCT) Results Rural schools that monitored teacher performance with cameras and made individual teachers’ salaries a direct function of their attendance reduced teacher absenteeism from 42% to 23%. After a year, test scores in treatment schools were .17 SD higher than in control schools and grade completion increased. Monitoring Teacher Performance Intervention Combining multiple measures of teacher effectiveness (grades 4–8), including classroom observations, student surveys, and principal surveys. Location CharlotteMecklenburg, North Carolina; Dallas, Texas; Denver, Colorado; Hillsborough County, Florida; New York City; and Memphis, Tennessee Study Method Kane et al. (2013) Very strong (RCT) Results A composite measure of teacher effectiveness that combines student surveys, classroom observations, and a teacher’s track record of student achievement gains on state tests identified teachers who produced higher average student achievement following random assignment of teachers to classrooms. The magnitude of the achievement gains produced by teachers identified as effective prior to random assignment was proportional to their performance on the composite measure. A subset of teachers identified as effective prior to randomization also had students who performed better on “audit” tests of math and reading. Source: [you need to add source, even if it says “Prepared by the author.”] Note: DDA = differences-in-differences analysis; RCT = randomized control trial; RDD = regression discontinuity design; SD = standard deviation. 61 Table 7. Evidence on Supporting Teachers to Improve Instruction Providing or Improving Learning Materials Intervention Providing computers to public schools (grades 3–9) to be integrated into the teaching of reading. Location Colombia Study BarreraOsorio and Linden (2009) Method Very strong (RCT) Results Little effect on students’ test scores and other outcomes. The results are consistent across grade levels, subjects, and students’ gender. Surveys indicate few teachers incorporate the computers into their curriculum. Providing computer-adaptive software (grade 4) in math during and after school (two hours per week) with local instructors. Vadodara, India Banerjee et al. (2007) Very strong (RCT) Increase in math scores by .36 SD in the first year, but had no effect on reading, suggesting limited spillover effects. Providing language/reading software (grades 3–6). United States Rouse and Krueger (2004) Very strong (RCT) Improved some aspects of students’ learning skills, but these gains did not translate into a broader measure of language acquisition or into actual reading skills. Introducing libraries (grades 1–6) through a “hub-and-spoke” system and trained librarians who assist students. Bangalore, India Borkum, He, and Linden (2009) Very strong (RCT) No increase in language competency scores. Providing textbooks (grades 3–8) in English, math, and science. Busia and Teso, Kenya Glewwe, Kremer, and Moulin (2009) Very strong (RCT) No effect on average test scores or student attendance. Test scores increased .06 SD for already highachieving students, but few children in lower grades were able to read the (English language) texts (15– 29% of median students). Glewwe et al. (2004) Very strong (RCT) Providing flipchart materials (grades 3–8), including science charts, teachers’ manuals, health charts, math charts, and geography maps. Busia and Teso, Kenya 62 No effect on student test scores. Providing In-Service Teacher Training Intervention Location Study Method Results Providing in-service training (grades 2–8) through funding for low-performing schools. Chicago Jacob and Lefgren (2004b) Strong (RDD) No effect on either reading or math achievement. Providing in-service training (grade 4) on a weekly basis during school by external trainers; focus on Hebrew, math, and English teaching. Jerusalem, Israel Angrist and Lavy (2001) Strong (DDA + matching) Teacher training in secular schools leads to an improvement in student achievement of .20–.40 SD. Estimates for religious schools are not clear-cut, perhaps because training in religious schools started later and was implemented on a smaller scale. Source: [you need to add source, even if it says “Prepared by the author.”] Note: DDA = differences-in-differences analysis; RCT = randomized control trial; SD = standard deviation. 63 Table 8. Evidence on Motivating Teachers to Perform Hiring Contract Teachers Intervention Hiring contract teachers (grades 1–5) chosen by school committees, on an annual basis, without labor protections and a lower salary. Location Study Andhra Pradesh, India Muralidharan and Sundararaman (2010) Method Very strong (RCT) Results Hiring contract teachers (grades 3–4) who pull children out of their regular classrooms to give them remedial instruction. Mumbai and Vadodara, India Banerjee et al. (2007) Very strong (RCT) Test scores were .15 SD higher in math and .13 SD higher in reading. Contract teachers were absent 16% of the time while regular teachers were absent 27% of the time. Contract teachers were found teaching 49% of the time, regular teachers 43% of the time. There is tentative evidence that the students of contract and regular teachers make the same achievement gains, although contract teachers cost one-fifth of what regular teachers cost. Test scores were .14 SD higher in the first year and .28 SD higher in the second year than those of control schools. Treated schools still scored .10 SD higher one year after the program ended. Children at the bottom of the initial test score distribution and those receiving remedial teaching made the largest gains. Paying Teachers to Raise Student Achievement Intervention Bonuses for performance and inputs (grades 3–12) based on attendance, climate, level, and progress in student achievement. Schools decide Location New York City Study Fryer (2011) 64 Method Very strong (RCT) Results No increase in student achievement. No evidence that teacher incentives affect teacher retention, absenteeism, or the learning environment. how to distribute the bonuses. Bonuses for performance (grades 5–8) for reaching a minimum performance level in math. Nashville, Tennessee Springer et al. (2010) Very strong (RCT) Bonuses for individual and group performance (grades 2–5) for reaching a minimum threshold and for additional increases. Andhra Pradesh, India Muralidharan and Sundararaman (2009) Very strong (RCT) Bonuses for group performance (grades 4–8) for ranking according to performance level and change. Busia and Teso, Kenya Glewwe, Ilias, and Kremer (2010) Very strong (RCT) Bonuses for group performance and inputs (grades 4, 8, and 10) based on student learning outcomes and teacher or school inputs. Chile Rau and Contreras (2009) Strong (DDA + PSM + RDD) 65 Middle-school students assigned to teachers eligible to receive bonuses for improving student achievement on average performed no differently than those assigned to noneligible teachers. Fifth grade students of eligible teachers outperformed their peers in the second and third years of the study, but this effect did not persist once these students moved on to the next grade. Improved student achievement by .28 SD (math) and .16 SD (reading) after two years. Incentive schools performed better both on mechanical and conceptual components of the test, suggesting little evidence of adverse consequences. Students in incentive schools also do better in nonincentivized subjects, such as science and social studies, suggesting positive spillover effects. The results of individual- and group-based incentives were not statistically significant in either the first or the second year. Schools where school committees gave bonuses to teachers whose students did well on exams scored .14 SD higher than control schools in the first year. The program had no effect on teacher attendance, the amount of homework that teachers assigned, or teacher pedagogy. The program increased test preparation by 4.2% in the first year and 7.4% in the second year. .07–.12 SD increase in average learning outcomes. No evidence that winning the bonus led to additional gains. Lavy (2008, 2009) Strong (RDD + matching) Bonuses for individual performance (grades 10–12) based on students’ passing rates and average scores on their high school matriculation exams. Israel Bonuses for individual performance and inputs (grades 3–6) based on educational credentials, experience, professional development, peer evaluations, subject matter knowledge and student achievement. Mexico McEwan and Santib ez (2005) Strong (RDD) No evidence that student test scores improved as a result of the performance-based pay reform. Bonuses for group performance and other factors (grades 9–11) based on school rankings on credits, graduation, dropout rates and student achievement. Israel Lavy (2002) Strong (RDD) .13 SD improvement in student achievement and modest increases in credits earned and the percentage of students taking matriculation exams. Bonuses for individual inputs (pre-school) including absenteeism. Principals decide how to allocate the bonuses. Busia and Teso, Kenya Kremer et al. (2001) Very strong (RCT) No effect on teacher absenteeism, student attendance, or achievement. Students in treated schools had 14% higher pass rates and 10% higher scores in math, and 5% higher pass rates and 4% higher scores in English. The effects of the bonus did not differ by gender, although female teachers were more pessimistic about their possibility of receiving an award. Source: [you need to add source, even if it says “Prepared by the author.”] Note: DDA = differences-in-differences analysis; PSM = propensity score matching; RCT = randomized control trial; RDD = regression discontinuity design; SD = standard deviation. 66 References Abdulkadiroglu, A., J.D. Angrist, S.M. Dynarski, T.J. Kane, and P.A. Pathak. 2011. Accountability and Flexibility in Public Schools: Evidence from Boston’s Charters and Pilots. Quarterly Journal of Economics 126: 699–748. Abeberese, A.B., T.J. Kumler, and L.L. Linden. 2011. Improving Reading Skills by Encouraging Children to Read: A Randomized Evaluation of the Sa Aklat Sisikat Reading Program in the Philippines. Cambridge, MA: Innovations for Poverty Action (IPA). Acemoglu, D., and J.-S. Pischke. 1998. Why Do Firms Train? Theory and Evidence. Quarterly Journal of Economics 113(1): 79–119. Akerlof, G.A. 1970. The Market for “Lemons:” Quality Uncertainty and the Market Mechanism. Quarterly Journal of Economics 84(3): 488–500. Akerlof, G.A. 1982. Labor Contracts as Partial Gift Exchange. Quarterly Journal of Economics 97(4): 543– 69. Alcázar, L., F.H. Rogers, N. Chaudhury, J. Hammer, M. Kremer, and K. Muralidharan. 2006. Why Are Teachers Absent? Probing Service Delivery in Peruvian Primary Schools. International Journal of Educational Research 45: 117–36. Alfonso, M., A. Santiago, and M. Bassi. 2011. Estimating the Impact of Placing Top University Graduates in Vulnerable Schools in Chile. Washington, DC: Inter-American Development Bank. Andrabi, T., J. Das, and A. Khwaja. 2009. Report Cards: The Impact of Providing School and Child Test Scores on Educational Markets. Washington, DC: World Bank. Angrist, J.D., and J. Guryan. 2008. Does Teacher Testing Raise Teacher Quality? Evidence from State Certification Requirements. Economics of Education Review 27(5): 483–503. Angrist, J.D., and Lavy, V. 1999. Using Maimonides’ Rule to Estimate the Effect of Class Size on Scholastic Achievement. Quarterly Journal of Economics 114(2): 533–75. Angrist, J.D., and V. Lavy. 2001. Does Teacher Training Affect Pupil Learning? Evidence from Matched Comparisons in Jerusalem Public Schools. Journal of Labor Economics 19(2): 343–69. Arrow, K.J. 1963. Uncertainty and the Welfare Economics of Medical Care. American Economic Review 53(5): 941–73 Autor, D.H. 2003a. Lecture Note: Self-Selection – The Roy Model. MIT 14.661 (November 14). Massachusetts Institute of Technology. Autor, D.H. 2003b. Lecture Note: Monitoring, Measurement and Risk. MIT 14.661. November 13. Massachusetts Institute of Technology. Avvisati, F., M. Gurgand, N. Guyon, and E. Maurin. 2008. Quels Effets Attendre d’une Politique d’Implication des Parents d’Élèves dans les Collèges? Les Enseignements d’une Expérimentation Contrôlée. Paris: Ecole d’Economie de Paris. Baker, G. 1992. Incentive Contracts and Performance Measurement. Journal of Political Economy 100: 598–614. 67 Banerjee, A.V., R. Banerji, E. Duflo, and M. Walton. 2011. What Helps Children to Learn? Evaluation of Pratham’s Read India Program in Bihar and Uttarakhand. Cambridge, MA: Abdul Latif Jameel Poverty Action Lab (JPAL). Banerjee, A.V., S. Cole, E. Duflo, and L. Linden. 2007. Remedying Education: Evidence from Two Randomized Experiments in India. Quarterly Journal of Economics 122(3): 1235–264. Banerjee, A.V., R. Banerji, E. Duflo, R. Glennerster, and S. Khemani. 2010. Pitfalls of Participatory Programs: Evidence from a Randomized Evaluation in Education in India. American Economic Journal 2(1): 1–30. Barlevy, Gadi, and Derek Neal. 2011. Pay for Percentile. NBER Working Paper No. 17194. Cambridge, MA: National Bureau of Economic Research. Barrera-Osorio, F., and L.L. Linden. 2009. The Use and Misuse of Computers in Education: Evidence from a Randomized Controlled Trial of a Language Arts Program. Cambridge, MA: Abdul Latif Jameel Poverty Action Lab (JPAL). Becker, G.S. 1964. Human Capital: A Theoretical and Empirical Analysis, with Special Reference to Education. Chicago: University of Chicago Press. Becker, G.S., and K. Murphy. 2000. Social Economics. Cambridge, MA: Harvard University Press. Bénabou, R., and J. Tirole. 2003. Intrinsic and Extrinsic Motivation. Review of Economic Studies 70(3): 489–520. Borjas, G. 1987. Self-Selection and the Earnings of Immigrants. American Economic Review 77: 531–53. Borkum, E., F. He, and L. Linden. 2009. School Libraries and Language Skills in Indian Primary Schools: A Randomized Evaluation of the Akshara Library Program. Cambridge, MA: Abdul Latif Jameel Poverty Action Lab (JPAL). Borman, G., R.E. Slavin, A. Cheung, A. Chamberlain, N.A. Madden, and B. Chambers. 2007. Final Reading Outcomes of the National Randomized Field Trial of Success for All. American Educational Research Journal 44(3): 701–31. Boyd, D., H. Lankford, and S. Loeb. 2005. The Draw of Home: How Teachers’ Preferences for Proximity Disadvantage Urban Schools. Journal of Policy Analysis and Management 24(1): 113–32. Boyd, D., H. Lankford, S. Loeb, and J. Wyckoff. 2011. Teacher Layoffs: An Empirical Illustration of Seniority versus Measures of Effectiveness. Education Finance and Policy 6(3): 439–54. Boyd, D., P.L. Grossman, H. Lankford, S. Loeb, and J. Wyckoff. 2009. Teacher Preparation and Student Achievement. Educational Evaluation and Policy Analysis 31(4): 416–40. Boyd, D., P. Grossman, K. Hammerness, H. Lankford, S. Loeb, M. Ronfeldt, and J. Wyckoff. 2010a. Recruiting Effective Math Teachers: How Do Math Immersion Teachers Compare? Evidence from New York City. NBER Working Paper 16017. Cambridge, MA: National Bureau of Economic Research. Boyd, D., H. Lankford, S. Loeb, M. Ronfeldt, and J. Wyckoff. 2010b. The Role of Teacher Quality in Retention and Hiring: Using Applications-to-Transfer to Uncover Preferences of Teachers and Schools. NBER Working Paper 15966. Cambridge, MA: National Bureau of Economic Research. 68 Boyd, D., P. Grossman, M. Ing, H. Lankford, S. Loeb, R. O’Brien, and J. Wyckoff. 2011. The Effectiveness and Retention of Teachers with Prior Career Experience. Economics of Education Review 30: 1229–241. Bruns, B., D. Filmer, and H.A. Patrinos. 2011. Making Schools Work: New Evidence on Accountability Reforms. Washington, DC: World Bank. Bruns, B., and L. Santibáñez. 2011. Making Teachers Accountable. In Making Schools Work: New Evidence on Accountability Reforms, ed. by B. Bruns, D. Filmer, and H.A. Patrinos. Washington, DC: World Bank. Cantrell, S., and T.J. Kane. 2013. Ensuring Fair and Reliable Measures of Effective Teaching: Culminating Findings from the MET Project’s Three-Year Study. Seattle: Bill and Melinda Gates Foundation. Cantrell, S., J. Fullerton, T.J. Kane, and D.O. Staiger. 2008. National Board Certification and Teacher Effectiveness: Evidence from a Random Assignment Experiment. NBER Working Paper 14608. Cambridge, MA: National Bureau of Economic Research. Chetty, R., J.N. Friedman, N. Hilger, E. Saez, D.W. Schanzenbach, and D. Yagan. 2011. How Does Your Kindergarten Classroom Affect Your Earnings? Evidence from Project Star. Quarterly Journal of Economics 126(4): 1593–660. Clark, D., P. Martorell, and J. Rockoff. 2009. School Principal and School Performance. CALDER Working Paper No. 38. Washington, DC: National Center for Analysis of Longitudinal Data in Educational Research (CALDER), the Urban Institute. Clotfelter, C.T., H.F. Ladd, and J.L. Vigdor. 2007a. Teacher Credentials and Student Achievement: Longitudinal Analysis with Student Fixed Effects. Economics of Education 26(6): 673–82. Clotfelter, C.T., H.F. Ladd, and J.L. Vigdor. 2007b. Are Teacher Absences Worth Worrying About in the U.S.? NBER Working Paper 13648. Cambridge, MA: National Bureau of Economic Research. Clotfelter, C., E. Glennie, H. Ladd, and J. Vigdor. 2008. Would Higher Salaries Keep Teachers in HighPoverty Schools? Evidence from a Policy Intervention in North Carolina. Journal of Public Economics 92(5-6): 1352–370. Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences. Hillsdale, NJ: Lawrence. Constantine, J., D. Player, T. Silva, K. Hallgreen, M. Grider, and J. Deke. 2009. An Evaluation of Teachers through Different Routes to Certification: Final Report. Washington, DC: Mathematica Policy Research/Institute of Education Sciences. Das, J., S. Dercon, J. Habyarimana, and P. Krishnan. 2007. Teacher Shocks and Student Learning: Evidence from Zambia. Journal of Human Resources 42(4): 820–62. Decker, P.T., D.P. Mayer, and S. Glazerman. 2004. The Effects of Teach for America on Students: Findings from a National Evaluation. Princeton, NJ: Mathematica Policy Research, Inc. Dee, T.S. 2004. Teachers, Race and Student Achievement in a Randomized Experiment. The Review of Economics and Statistics 86(1): 195–210. Dee, T.S. 2007. Teachers and the Gender Gaps in Student Achievement. Journal of Human Resources 42(3): 528–54. Deming, D.J. 2011. Better Schools, Less Crime? Quarterly Journal of Economics 126(4): 2063–115. 69 Dhaliwal, I., E. Duflo, R. Glennerster, and C. Tulloch. 2011. Comparative Cost-Effectiveness Analysis to Inform Policy in Developing Countries: A General Framework with Applications for Education. Cambridge, MA: Abdul Latif Jameel Poverty Action Lab (JPAL). Dobbie, W., and R.G. Fryer. 2009. Are High Quality Schools Enough to Close the Achievement Gap? Evidence from a Social Experiment in Harlem. NBER Working Paper 15473. Cambridge, MA: National Bureau of Economic Research. Dobbie, W., and R.G. Fryer. 2011. The Impact of Youth Service on Future Outcomes: Evidence from Teach for America. NBER Working Paper 17402. Cambridge, MA: National Bureau of Economic Research. Duflo, E., P. Dupas, and M. Kremer. 2007. Peer Effects, Pupil-Teacher Ratios, and Teacher Incentives: Evidence from a Randomized Evaluation in Kenya. Cambridge, MA: Abdul Latif Jameel Poverty Action Lab (JPAL). Duflo, E., P. Dupas, and M. Kremer. 2009. Additional Resources versus Organizational Changes in Education: Experimental Evidence from Kenya. Cambridge, MA: Abdul Latif Jameel Poverty Action Lab (JPAL). Duflo, E., R. Hanna, and S. Ryan. 2010. Incentives Work: Getting Teachers to Come to School. Cambridge, MA: Abdul Latif Jameel Poverty Action Lab (JPAL). Feng, L., D.N. Figlio, and T. Sass. 2010. School Accountability and Teacher Mobility. NBER Working Paper 16070. Cambridge, MA: National Bureau of Economic Research. Ferraz, C., and B. Bruns. Forthcoming. Incentives to Teach: The Effects of Performance Pay in Brazilian Schools. Washington, DC: World Bank. Fryer, R.G. 2011. Teacher Incentives and Student Achievement: Evidence from New York City Public Schools. NBER Working Paper 16850. Cambridge, MA: National Bureau of Economic Research. Ganimian, A.J., and E. Vegas. 2011. What Are the Different Profiles of Successful Teacher Policy Systems? SABER-Teachers Background Paper No. 5. Washington, DC: World Bank. Gibbons, R., and M. Waldman. 1999. Careers in Organizations: Theory and Evidence. In Handbook of Labor Economics, Vol. 3, ed. by O. Ashenfelter and D. Card. Amsterdam: Elsevier. Glazerman, S., and A. Seifullah. 2012. An Evaluation of the Chicago Teacher Advancement Program (Chicago TAP) After Four Years: Final Report. Washington, DC: Mathematica Policy Research. Glazerman, S., E. Isenberg, S. Dolfin, M. Bleeker, A. Johnson, M. Grider, and M. Jacobus. 2010. Impacts of Comprehensive Teacher Induction: Final Results from a Randomized Controlled Study. Washington, DC: Mathematica Policy Research/Institute of Education Sciences. Glewwe, P., and E. Maïga. 2011. The Impacts of School Management Reforms in Madagascar: Do the Impacts Vary by Teacher Type? Cambridge, MA: Abdul Latif Jameel Poverty Action Lab (JPAL). Glewwe, P., N. Ilias, and M. Kremer. 2010. Teacher Incentives. American Economic Journal: Applied Economics 2(3): 205–27. Glewwe, P., M. Kremer, and S. Moulin. 2009. Many Children Left Behind? Textbooks and Test Scores in Kenya. American Economic Journal: Applied Economics 1(1): 112–35. Glewwe, P., M. Kremer, S. Moulin, and E. Zitzewitz. 2004. Retrospective vs. Prospective Analyses of School Inputs: The Case of Flip Charts in Kenya. Journal of Development Economics 74: 251–68. 70 Goodman, S., and L. Turner. 2010. Teacher Incentive Pay and Educational Outcomes: Evidence from the NYC Bonus Program. Paper prepared for the conference on “Merit Pay: Will It Work? Is It Politically Viable?” Program for Education Policy and Governance, Harvard Kennedy School, June 3-4. Greenwald, B.C. 1986. Adverse Selection in the Labour Market. Review of Economic Studies 53: 325–47. Grissom, J.A., and S. Loeb. 2011. Triangulating Principal Effectiveness: How Perspectives of Parents, Teachers and Assistant Principals Identify the Central Importance of Managerial Skills. American Educational Research Journal 48(5): 1–33. Hanushek, E.A. 2003. The Failure of Input-Based Schooling Policies. The Economic Journal 113: F64– F98. Hanushek, E.A., and L. Woessmann. 2007. Education Quality and Economic Growth. Washington, DC: World Bank. Hanushek, E.A., J.F. Kain, D.M. O’Brien, and S.G. Rivkin. 2005. The Market for Teacher Quality. NBER Working Paper 11154. Cambridge, MA: National Bureau of Economic Research. He, F., L.L. Linden, and M. MacLeod. 2007. Helping Teach What Teachers Don’t Know: An Assessment of the Pratham English Language Program. Cambridge, MA: Abdul Latif Jameel Poverty Action Lab (JPAL). He, F., L.L. Linden, and M. MacLeod. 2009. A Better Way to Teach Children to Read? Evidence from a Randomized Controlled Trial. Cambridge, MA: Abdul Latif Jameel Poverty Action Lab (JPAL). Hicks, J. 1939. The Foundations of Welfare Economics. The Economic Journal 49(196): 696–712. Holmstom, B. 1979. Moral Hazard and Observability. Bell Journal of Economics 9: 74-91. Holmstrom, B., and P. Milgrom. 1991. Multitask Principal-Agent Analyses: Incentive Contracts, Asset Ownership and Job Design. Journal of Law, Economics and Organization 7: 24–52. Horng, E.L., D. Klasik, and S. Loeb. 2010. Principal’s Time Use and School Effectiveness. American Journal of Education 116: 491–523. Hoxby, C.M. 2000. The Effects of Class Size on Student Achievement: New Evidence from Population Variation. Quarterly Journal of Economics 115(4): 1239–285. Hoxby, C.M., and A. Leigh. 2004. Pulled Away or Pushed Out? Explaining the Decline of Teacher Aptitude in the United States. American Economic Journal 93(2): 236–40. Jackson, C.K. 2013. Match Quality, Worker Productivity and Worker Mobility: Direct Evidence from Teachers. Review of Economics and Statistics. Jackson, C.K., and E. Bruegmann. 2009. Teaching Students and Teaching Each Other: The Importance of Peer Learning for Teachers. American Economic Journal: Applied Economics 1(4): 85–108. Jacob, B.A. 2010a. The Effect of Employment Protection on Worker Effort: Evidence from Public Schooling. NBER Working Paper 15655. Cambridge, MA: National Bureau of Economic Research. Jacob, B.A. 2010b. Do Principals Fire the Worst Teachers? NBER Working Paper No. 15715. Cambridge, MA: National Bureau of Economic Research. 71 Jacob, B.A., and L. Lefgren. 2004a. Remedial Education and Student Achievement: A RegressionDiscontinuity Analysis. Review of Economics and Statistics 86(1): 226–44. Jacob, B.A., and L. Lefgren. 2004b. The Impact of Teacher Training on Student Achievement: QuasiExperimental Evidence from School Reform Efforts in Chicago. Journal of Human Resources 39(1): 50–79. Jacob, B.A., and L. Lefgren. 2005. Principals as Agents: Subjective Performance Measurement in Education. NBER Working Paper No. 11463. Cambridge, MA: National Bureau of Economic Research. Jacob, B.A., and L. Lefgren. 2007. What Do Parents Value in Education? An Empirical Investigation of Parents’ Revealed Preferences for Teachers. Quarterly Journal of Economics 122(4): 1603–637. Jacob, B.A., and S.D. Levitt. 2003. Rotten Apples: An Investigation of the Prevalence and Predictors of Teacher Cheating. Quarterly Journal of Economics 118(3): 843–78. Johnson, W. 1978. A Theory of Job Shopping. Quarterly Journal of Economics 92: 261–77. Kaldor, N. 1939. Welfare Propositions in Economics and Interpersonal Comparisons of Utility. The Economic Journal 49(195): 549–52. Kane, T.J., D.F. McCaffrey, T. Miller, and D.O. Staiger. 2013. Have We Identified Effective Teachers? Validating Measures of Effective Teaching Using Random Assignment. Seattle: Bill and Melinda Gates Foundation. Kane, T.J., and D.O. Staiger. 2010. Learning about Teaching: Initial Findings from the Measures of Effective Teaching Project. Seattle: Bill and Melinda Gates Foundation. Kane, T.J., and D.O. Staiger. 2012. Gathering Feedback for Teaching: Combining High-Quality Observations with Student Surveys and Achievement Gains. Seattle: Bill and Melinda Gates Foundation. Kane, T.J., J.E. Rockoff, and D.O. Staiger. 2006. What Does Certification Tell Us About Teacher Effectiveness? Evidence from New York City. NBER Working Paper No. 12155. Cambridge, MA: National Bureau of Economic Research. Kane, T.J., E.S. Taylor, J.H. Tyler, and A.L. Wooten. 2010. Identifying Effective Classroom Practices Using Student Achievement Data. NBER Working Paper No. 15803. Cambridge, MA: National Bureau of Economic Research. Kemple, J. J., and C.J. Willner. 2008. Career Academies: Long-Term Impacts on Labor Market Outcomes, Educational Attainment and Transitions to Adulthood. New York: MDRC. Kremer, M.E., D. Chen, P. Glewwe, and S. Moulin. 2001. Interim Report on a Teacher Incentive Program in Kenya. Cambridge, MA: Harvard University. Krueger, A. 1999. Experimental Estimates of Education Production Functions. Quarterly Journal of Economics 114: 497–532. Lang, K., and W.T. Dickens. 1987. Neoclassical and Sociological Perspectives on Segmented Labor Markets. NBER Working Paper No. 2127. Cambridge, MA: National Bureau of Economic Research. 72 Lassibille, G., J.-P. Tan, C. Jesse, and T.V. Nguyen. 2010. Managing for Results in Primary Education in Madagascar: Evaluating the Impact of Selected Workflow Interventions. The World Bank Economic Review 24(2): 303–29. Lavy, V. 2002. Evaluating the Effect of Teachers’ Group Performance Incentives on Pupil Achievement. The Journal of Political Economy 110(6): 1286–317. Lavy, V. 2008. Gender Differences in Market Competitiveness in a Real Workplace: Evidence from Performance-Based Pay Tournaments among Teachers. Paper prepared for the conference on “Economic Incentives: Do They Work in Education?” Program for Education Policy and Governance, Harvard Kennedy School, May 16-17. Lavy, V. 2009. Performance Pay and Teachers’ Effort, Productivity, and Grading Ethics. The American Economic Review 99(5): 1979––2011. Lavy, V. 2010. Do Differences in School's Instruction Time Explain International Achievement Gaps in Math, Science, and Reading? Evidence from Developed and Developing Countries. NBER Working Paper No. 16227. Cambridge, MA: National Bureau of Economic Research. Lazear, E.P., and S. Rosen. 1981. Rank-Order Tournaments as Optimum Labor Contracts. Journal of Political Economy 89(5): 841–64. Linden, L.L., C. Herrera, and J. Grossman. 2011. Achieving Academic Success After School: A Randomized Evaluation of the Higher Achievement Program. Cambridge, MA: Abdul Latif Jameel Poverty Action Lab (JPAL). Lucas, R.E., and E.C. Prescott. 1974. Equilibrium Search and Unemployment. Journal of Economic Theory. Journal of Economic Theory 7(2): 188–209. McEwan, P., and L. Santib ez. 2005. Teacher and Principal Incentives in Mexico. In Incentives to Improve Teaching: Lessons from Latin America, ed. by E. Vegas. Washington, DC: World Bank. Metzler, J., and L. Woessmann. 2010. The Impact of Teacher Subject Knowledge on Student Achievement: Evidence from Within-Teacher Within-Student Variation. IZA DP No. 4999. Bonn, Germany: Institute for the Study of Labor (IZA). Mihaly, K., D.F. McCaffrey, D.O. Staiger, and J.R. Lockwood. 2013. A Composite Estimator of Effective Teaching. Seattle: Bill and Melinda Gates Foundation. Miller, R.T., R.J. Murnane, and J.B. Willett. 2007. Do Teacher Absences Impact Student Achievement? Longitudinal Evidence from One Urban School District. NBER Working Paper 13356. Cambridge, MA: National Bureau of Economic Research. Mincer, J. 1962. On-the-Job Training: Costs, Returns and Some Implications. Journal of Political Economy 70: 50–79. Mirrlees, J. 1976. The Optimal Structure of Incentives and Authority within an Organization. Bell Journal of Economics 7: 105–31. Mizala, A., and M. Urquiola. 2007. School Markets: The Impact of Information Approximating Schools’ Effectiveness. NBER Working Paper 13676. Cambridge, MA: National Bureau of Economic Research. Montgomery, J. 1991. Equilibrium Wage Dispersion and Interindustry Wage Differentials. Quarterly Journal of Economics 106: 163–79. 73 Muralidharan, K., and V. Sundararaman. 2009. Teacher Performance Pay: Experimental Evidence from India. NBER Working Paper 15323. Cambridge, MA: National Bureau of Economic Research. Muralidharan, K., and V. Sundararaman. 2010. Contract Teachers: Experimental Evidence from India. Washington, DC: World Bank. Murnane, R.J. 2010. Progress and Puzzles in Educational Policy Research. Harvard Education Letter 26(2). Murnane, R.J., and J.B. Willett. 2011. Methods Matter: Improving Causal Inference in Educational and Social Science Research. New York: Oxford University Press. Myung, J., S. Loeb, and E.L. Horng. 2011. Tapping the Principal Pipeline: Identifying Talent for Future School Leadership in the Absence of Formal Succession Management Programs. Educational Administration Quarterly 47(5): 695–727. Nelson, R.R., and E.S. Phelps. 1966. Investment in Humans, Technological Diffusion and Economic Growth. The American Economic Review 56(1/2): 69–75. Osterman, P. 1984. Internal Labor Markets. Cambridge, MA: MIT Press. Pandey, P., S. Goyal, and V. Sundararaman. 2009. Community Participation in Public Schools: Impact of Information Campaigns in Three Indian States. Education Economics 17(3): 355-375. Papay, J.P., M.R. West, J.B. Fullerton, and T.J. Kane. 2011. Does Practice-Based Teacher Preparation Increase Student Achievement? Early Evidence from the Boston Teacher Residency. NBER Working Paper 17646. Cambridge, MA: National Bureau of Economic Research. Pissarides, C.A. 1979. Job Matching with State Employment Agencies and Random Search. The Economic Journal 89(356): 818–33. Prendergast, C. 1993. The Role of Promotion in Inducing Specific Human Capital Acquisition. Quarterly Journal of Economics 108(2): 523–34 Prendergast, C. 2002. The Tenuous Tradeoff between Risk and Incentives. Journal of Political Economy, 110(5): 1071–102. Rau, T., and D. Contreras. 2009. Tournaments, Gift Exchanges, and the Effect of Monetary Incentives for Teachers: The Case of Chile. Department of Economics Working Paper 305. Santiago: University of Chile. Reardon, S.F. 2011. The Widening Academic Gap between the Rich and the Poor: New Evidence and Possible Explanations. In Whither Opportunity? Rising Inequality, Schools and Children’s Life Chances, ed. by G. Duncan, G. and R.J. Murnane. New York: Russell Sage Foundation and Spencer Foundation. Reich, M., D.M. Gordon, and R.C. Edwards. 1973. A Theory of Labor Market Segmentation. American Economic Review 639(2): 359–65 Rockoff, J.E. 2008. Does Mentoring Reduce Turnover and Improve Skills of New Employees? Evidence from Teachers in New York City. NBER Working Paper 13868. Cambridge, MA: National Bureau of Economic Research. Rockoff, J.E., and L.J. Turner. 2010. Short Run Impacts of Accountability on School Quality. American Economic Journal: Economic Policy 2(4): 119–47. 74 Rockoff, J.E., B.A. Jacob, T.J. Kane, and D.O. Staiger. 2008. Can You Recognize an Effective Teacher When You Recruit One? NBER Working Paper 14485. Cambridge, MA: National Bureau of Economic Research. Rockoff, J.E., D.O. Staiger, T.J. Kane, and E.S. Taylor. 2011. Information and Employee Evaluation: Evidence from a Randomized Intervention in Public Schools. New York: Columbia Business School. Rosen, S. 1978. Substitution and Division of Labor. Económica 45: 235–50. Ross, S. 1973. The Economic Theory of Agency: The Principal’s Problem. American Economic Review 63(2): 134–39. Rouse, C.E., and A.B. Krueger. 2004. Putting Computerized Instruction to the Test: A Randomized Evaluation of a “Scientifically Based” Reading Program. Economics of Education Review 23(4): 323–38. Roy, A. 1951. Some Thoughts on the Distribution of Earnings. Oxford Economic Papers: 235–46. Sattinger, M. 1975. Comparative Advantage and the Distributions of Earnings and Abilities. Econometrica 43: 455–68. Schultz, T.W. 1963. The Economic Value of Education. New York, NY: Columbia University Press. Shapiro, C., and J. Stiglitz. 1984. Equilibrium Unemployment as a Worker Discipline Device. American Economic Review 74: 433–44. Shavell, S. 1979. Risk Sharing and Incentives in the Principal and Agent Relationship, Bell Journal of Economics 10: 55–73. Spence, A.M. 1974. Market Signaling: Informational Transfer in Hiring and Related Screening Processes. Cambridge: Harvard University Press. Springer, M.G., D. Ballou, L. Hamilton, V.-N. Le, J.R. Lockwood, D. McCaffrey, M. Pepper, and B.M. Stecher. 2010. Teacher Pay for Performance: Experimental Evidence from the Project on Incentives in Teaching. Pittsburgh: The RAND Corporation. Steele, J.L., R.J. Murnane, and J.B. Willett. 2010. Do Financial Incentives Help Low-Performing Schools Attract and Keep Academically Talented Teachers? Evidence from California. Journal of Policy Analysis and Management 29(3): 451–58. Stiglitz, J. 1975. The Theory of Screening, Education and the Distribution of Income. American Economic Review 65: 283–300. Suryadarma, D., A. Suryahadi, S. Sumarto, and F.H. Rogers. 2006. Improving Student Performance in Public Primary Schools in Developing Countries: Evidence from Indonesia. Education Economics 14(4): 401-429. Taylor, E.S., and J.H. Tyler. 2011. The Effect of Evaluation on Performance: Evidence from Longitudinal Student Achievement Data of Mid-Career Teachers. NBER Working Paper 16877. Cambridge, MA: National Bureau of Economic Research. Terviö, M. 2003. Mediocrity in Talent Markets. Berkeley, CA: Haas School of Business, University of California, Berkeley. 75 Tyler, J.H. 2011. If You Build It, Will They Come? Teacher Use of Student Performance Data on a WebBased Tool. NBER Working Paper 17486. Cambridge, MA: National Bureau of Economic Research. Urquiola, M., and E. Verhoogen. 2009. Class-Size Caps, Sorting and the Regression Discontinuity Design. American Economic Review 99(1): 179–215. Vegas, E., A.J. Ganimian, N. Goldstein, A. Paglayan, S. Loeb, and P. Romaguera. 2011. What Are Teacher Policy Goals, How Can Education Systems Reach Them and How Will We Know When They Do? SABER-Teachers Background Paper No. 2. Washington, DC: World Bank. Williamson, O. 1975. Understanding the Employment Relation: The Analysis of Idiosyncratic Exchange. Bell Journal of Economics 16: 250–80. 76