Policy Applications of Teacher Performance Measures: A Review of the Evidence Brian Jacob Presented at the Policy Forum on Teacher Effectiveness in New Jersey March 22, 2011 My Task • Summarize existing evidence on the impact of policy applications of teacher performance measures • I will focus on teacher-level programs including: – Teacher value-added measures (VAM) and/or – Classroom observation protocols • I will focus on programs designed for evaluation rather than for only diagnostic purposes Outline 1. The use of VAM to determine teacher compensation, hire/transfer, and tenure/dismissal 2. Classroom Observation Protocols 3. Holistic teacher evaluation “systems” (e.g., TAP, Minnesota Q-Comp) 4. Some thoughts on implementation details Teacher VAM and Compensation • Most common use of VAM to date involves teacher pay, usually bonuses • Two goals of these policies: – Increase teacher effort in the short/medium-term – Change the composition of the teaching force in the longer-term • Important underlying assumptions: – Teachers are not working on the “efficiency frontier” in the existing system – Teachers have the capacity/support to change behavior – The labor supply of potentially effective teachers is relatively elastic Evidence on VAM and Teacher Pay • Studies in other countries show positive impact on student performance – Probably not very applicable to U.S. context • Nashville, TN: individual bonuses for middle school math teachers up to $15,000 – No effect on achievement; few effects on any other intermediate outcomes – Period of tremendous achievement growth in Nashville: student achievement in T and C groups improved substantially • New York City: bonuses assigned to schools on the basis of school performance goals, with an average bonus of $3,000/teacher for meeting achievement goals – No effect on student achievement or a host of other student and teacher measures – Some evidence of more positive effects in smaller schools VAM and Other Teacher Policies • No existing programs use VAM for hiring • “Talent Transfer” program bases transfers/bonuses on VAM – Identifies high-performing teachers and offers them incentives to stay/move to high-needs schools for up to two years – Large financial incentives $10,000/yr for two years to move – Teachers identified through VA analysis and schools identified on the basis of low performance – Charlotte-Mecklenberg started in 2008; LA and Miami in 2010 • No existing policy dismisses and/or denies tenure to teachers based on VAM alone Classroom Observation Protocols (COPs) • Why use classroom observations to measure teacher performance? – Additional measure increases reliability of overall rating – Captures different aspects of effectiveness – By focusing on teacher practices, COPs can (a) mitigate gaming/strategic behavior, and (b) provide useful feedback to teachers • How do current COPs differ from traditional teacher evaluations conducted by administrators? – Based on set of pre-specified, observable behaviors – Outside observers; Multiple unannounced observations per year • Variation in COPs – – – – Danielson: student-teacher relationships, instructional approaches CLASS: focus on emotional climate of the classroom MQI: accuracy and richness of teacher’s mathematical knowledge PLATO: focus on instructional strategies specific to secondary ELA Research on COPs • Growing research base suggests COP scores are positively associated with student achievement – Danielson: 200+ teachers in Cincinnati, positive relationship with student achievement – PLATO: 24 middle school ELA teachers in NYC – 2nd vs. 4th quartile – MQI: 24 middle school math teachers in Southwestern district – correlations of .2 to .5 with VAM • Caveats/Limitations – Very small samples; timing of observations and student achievement – COPs are *less* predictive of current student achievement than prior VAM – COPs are*not* generally more predictive than informal/holistic supervisor ratings – Gates-funded MET project (Measures of Effective Teaching) will explore this issue with much larger samples, using much more sophisticated analysis • Recent study in Cincinnati shows that participation in evaluation itself increases teacher VA by .1 s.d. Teacher Evaluation “Systems” • No evidence that merit pay alone works in the short-run • Some evidence that COPs are associated with teacher VAM, and the evaluation process itself may be beneficial • => Comprehensive programs: development and training opportunities, data and feedback, incentives and evaluation • Examples include: D.C. IMPACT, Teacher Advancement Program (TAP), Q-Comp in Minnesota Teacher Advancement Program (TAP) • Four components: – – – – (1) performance-based compensation (2) master/mentor teachers (3) framework for teacher evaluation (4) ongoing trainings and collaborative groups • Springer, Ballou and Peng (2008) – Uses student-level NWEA data in 2 states to compare achievement gains within schools over time – do gains increase after a school adopts TAP – Some positive effects for gr 1-5 , but negative effects for gr 9 and 10 • Mathematica Study of Chicago TAP – Yr 2 (2010) – performance pay for principals and other school staff; average teacher bonus was $2,000; mentor (master) teachers received $7,00 ($15,000) – Hybrid design – random assignment of 16 schools + matching – No effects on student achievement, teacher retention Denver Pro-Comp • Consists of 4 components: – – – – Market incentives (hard-to-staff schools/subjects) Student growth (VAM) Knowledge and skills (completion of degrees or PD) Professional evaluations • Voluntary for teachers hired before 2006; mandatory for teachers hired on/after Jan 1, 2006 • Growth in math and reading since 02-03 districtwide, but hard to attribute this to Pro-Comp alone • Some tentative evidence suggesting there might be some very small positive effects – Productivity effects and composition effects • Ongoing work looks at retention in hard-to-serve schools Minnesota’s Q-Comp • Voluntary district-level program • District plan must include multiple teacher career paths, jobembedded professional development, teacher evaluation, performance pay, and a revised teacher salary schedule • Districts get $190 per pupil in state aid + $70 per pupil extra • Since inception in 2005, 46 of Minnesota’s 337 traditional districts and several charter schools have opted in • Considerable variation in content of district plans • Recent evaluation found few positive impacts on student achievement – No effect on math achievement – Potential small effects on reading achievement in some districts that focused on actions/outcomes rather than subjective evaluations The Devil is in the (Implementation) Details • Much of the debate focuses on what measures will be included and how heavily they will weigh in the overall evaluation • This misses at least 3 critical implementation choices: – Where to set cutoffs for different levels? – How to average across components (VAM, COP)? – What actions are tied to evaluation outcomes? Setting Cutoffs to Avoid the Widget Effect • Report by TNTP documents that traditional teacher evaluation systems overstate the number of exemplary teachers and understate the number of mediocre or ineffective teachers Distribution of Teacher Ratings Across Districts (Source: Widget Effect) Domain Distinguished Proficient Basic Unsatisfactory Chicago 68.7% 24.9% 6.1% 0.4% Cincinnati 57.8% 34.7% 6.9% 0.6% Distribution of Teacher Ratings in DC IMPACT 09-10 Overall Score Highly Effective Effective Minimally Effective Ineffective 16% 66% 16% 2% The Effect of Averaging • Given the 4-point scales commonly used, components such as school performance and “other” outcomes that comprise a small fraction (~10%) of overall rating will generally have no impact on teacher’s overall rating • Suppose that a system includes only VAM and COP, classifies teachers on a 4-pt scale, weights both equally, and that VAM and COP are correlated .3 – If 2% of teachers receive ratings of “1” on each, then only .4% will receive a final rating of “1” – If 10% of teachers receive ratings of “1” on each, then only 3.5% will receive lowest rating • Possible remedies? – Triage: score of “1” on any measure triggers review Teacher Non-Renewal Policy in the Chicago Public Schools (CPS) • Starting in 2004-05, new collective bargaining agreement in CPS stipulated that principals could dismiss (i.e., non-renew) any probationary teacher (years 1-4) outside the RIF process and without the typical process associated with dismissal for cause • Streamlined process: on-line system allow principals to “click” teachers to non-renew • Effect of this policy provides some insight on potential impact of COPs and TESs – How frequently will principals to dismiss teachers? – What teacher characteristics do principals value? – What are the effects on teacher and student outcomes? Teacher Non-Renewal in Chicago • Principals do seem to consider (proxies for) teacher productivity in determining which teachers to dismiss. – Principals are more likely to dismiss teachers who are frequently absent, failed the teacher certification exam at least once, and who have received worse evaluations in the past. – Principals are less likely to dismiss teachers who attended a more competitive college and have a MA degree. – Elementary teachers who were dismissed had lower VAM than their peers who were not dismissed. • Policy reduced absences among probationary teachers by roughly 10-20 percent Teacher Non-Renewal in Chicago • Roughly 40% of principals did not dismiss any of their probationary teachers over the first 3 years of the policy – Includes many principals in the lowest performing schools in the district • Potential explanations – Teacher labor supply – Social norms • Implications – Managerial discretion alone will not necessarily change personnel practices – Principal training? – Changing the default? Extra slides Cincinnati Teacher Evaluation System • Adapted from Danielson framework; started in 2000-01 • 4 evaluations per year by outside peer experts; 1 additional observation by school administrator • 4 domains, only 2 of which assessed by observation: learning environment (D2) and teaching for learning (D3) – Teachers receive scores from 1-4 in each of 32 elements, which are aggregated up to 15 standards and then the 4 domains • End-of-year scores in each standard/domain are holistic/subjective determination of outside evaluators – Based on “preponderance of the evidence” and can account for growth over the year and/or extenuating circumstances Evidence on Cincinnati TES • TES scores predict student achievement even after controlling for various student characteristics (including prior achievement) – 1 point increase in TES => 0.10 -.15 sd increase in scores – Top vs. bottom quartile teacher => 3 percentage point diff • Conditional on overall score, teachers who score high in classroom environment (Domain 2) relative to teaching practices (Domain 3) appear more effective in math • Conditional on overall score, teachers whose instruction focuses on questions/discussion relative to standards/content appear more effective in reading The Widget Effect Distribution of Teacher Ratings in Cincinnati TES – Since 2001 Domain Distinguished Proficient Basic Unsatisfactory Classroom Environment 64.1% 31.4% 3.7% 0.7% Teaching Strategies 46.1% 47.4% 6.4% 0.1%