Revisiting the Impact of Teachers Jesse Rothstein University of California, Berkeley and NBER rothstein@berkeley.edu April 2015 Value-added modeling (VAM) and bias I Value-added models aim to identify teachers’ causal effects on students’ test scores. I I In observational data, without control over assignment. Key control is student’s prior-year score. I Each VAM relies on a strong implicit assumption about the assignment rule (Rothstein 2010). If this is wrong, VAM scores may be biased. I A burgeoning literature tries to quantify any biases. I One measure of the magnitude of biases: Ω≡ V [causal effect] V [causal effect] + V [bias] The state of play (part 1) I Early VAM literature didn’t push very hard on causality. I Rothstein (2010) finds effects of grade-g teachers on students’ g − 2 scores. Implies V [bias] > 0. But magnitude unclear. I Rothstein (2009) simulates realistic DGPs to estimate range of potential bias. Plausible range: Ω ∈ {0.6, 1.0}. The state of play (part 2) I Rothstein (2009) plausible range: Ω ∈ {0.6, 1.0} I Experiments: How well do observational VAM scores forecast experimental teacher effects? (Kane and Staiger, with co-authors) I I I I Ω̂ = 0.91 (0.18) math; 1.09 (0.29) ELA (Kane-Staiger 2008) Ω̂ = 1.04 (0.14) math; 0.70 (0.21) ELA (Kane et al. 2013). Small samples and noncompliance create big SEs. Quasi-experiment (Chetty, Friedman, Rockoff 2014a,b). I I I I Generalization of experimental tests to ”teacher switching” quasi-experiments, with much larger N. Similar ”quasi-experiments” used by Chetty-Hendren, Finkelstein-Gentzkow-Williams (2014). Using standard VAM: Ω̂ = 0.84 (0.05). Adjust VA model to account for “drift” and get Ω̂ = 0.97 (0.03). [Or is it Ω̂ = 0.88 (0.03)?] The state of play (part 2) I Rothstein (2009) plausible range: Ω ∈ {0.6, 1.0} I Experiments: How well do observational VAM scores forecast experimental teacher effects? (Kane and Staiger, with co-authors) I I I I Ω̂ = 0.91 (0.18) math; 1.09 (0.29) ELA (Kane-Staiger 2008) Ω̂ = 1.04 (0.14) math; 0.70 (0.21) ELA (Kane et al. 2013). Small samples and noncompliance create big SEs. Quasi-experiment (Chetty, Friedman, Rockoff 2014a,b). I I I I Generalization of experimental tests to ”teacher switching” quasi-experiments, with much larger N. Similar ”quasi-experiments” used by Chetty-Hendren, Finkelstein-Gentzkow-Williams (2014). Using standard VAM: Ω̂ = 0.84 (0.05). Adjust VA model to account for “drift” and get Ω̂ = 0.97 (0.03). [Or is it Ω̂ = 0.88 (0.03)?] The state of play (part 2) I Rothstein (2009) plausible range: Ω ∈ {0.6, 1.0} I Experiments: How well do observational VAM scores forecast experimental teacher effects? (Kane and Staiger, with co-authors) I I I I Ω̂ = 0.91 (0.18) math; 1.09 (0.29) ELA (Kane-Staiger 2008) Ω̂ = 1.04 (0.14) math; 0.70 (0.21) ELA (Kane et al. 2013). Small samples and noncompliance create big SEs. Quasi-experiment (Chetty, Friedman, Rockoff 2014a,b). I I I I Generalization of experimental tests to ”teacher switching” quasi-experiments, with much larger N. Similar ”quasi-experiments” used by Chetty-Hendren, Finkelstein-Gentzkow-Williams (2014). Using standard VAM: Ω̂ = 0.84 (0.05). Adjust VA model to account for “drift” and get Ω̂ = 0.97 (0.03). [Or is it Ω̂ = 0.88 (0.03)?] The state of play (part 2) I Rothstein (2009) plausible range: Ω ∈ {0.6, 1.0} I Experiments: How well do observational VAM scores forecast experimental teacher effects? (Kane and Staiger, with co-authors) I I I I Ω̂ = 0.91 (0.18) math; 1.09 (0.29) ELA (Kane-Staiger 2008) Ω̂ = 1.04 (0.14) math; 0.70 (0.21) ELA (Kane et al. 2013). Small samples and noncompliance create big SEs. Quasi-experiment (Chetty, Friedman, Rockoff 2014a,b). I I I I Generalization of experimental tests to ”teacher switching” quasi-experiments, with much larger N. Similar ”quasi-experiments” used by Chetty-Hendren, Finkelstein-Gentzkow-Williams (2014). Using standard VAM: Ω̂ = 0.84 (0.05). Adjust VA model to account for “drift” and get Ω̂ = 0.97 (0.03). [Or is it Ω̂ = 0.88 (0.03)?] The state of play (part 2) I Rothstein (2009) plausible range: Ω ∈ {0.6, 1.0} I Experiments: How well do observational VAM scores forecast experimental teacher effects? (Kane and Staiger, with co-authors) I I I I Ω̂ = 0.91 (0.18) math; 1.09 (0.29) ELA (Kane-Staiger 2008) Ω̂ = 1.04 (0.14) math; 0.70 (0.21) ELA (Kane et al. 2013). Small samples and noncompliance create big SEs. Quasi-experiment (Chetty, Friedman, Rockoff 2014a,b). I I I I Generalization of experimental tests to ”teacher switching” quasi-experiments, with much larger N. Similar ”quasi-experiments” used by Chetty-Hendren, Finkelstein-Gentzkow-Williams (2014). Using standard VAM: Ω̂ = 0.84 (0.05). Adjust VA model to account for “drift” and get Ω̂ = 0.97 (0.03). [Or is it Ω̂ = 0.88 (0.03)?] This paper I Revisit Chetty et al. (2014a,b), ”Measuring the Impact of Teachers” CFR-I “Teacher switching” quasi-experimental test for bias. CFR-II Effect of teacher VA on students’ long-run outcomes. I I replicate CFR-I and CFR-II analyses in North Carolina data, using CFR’s methods and code. I All key results are replicated, essentially perfectly. (See also Bacher-Hicks, Kane, and Staiger, 2014, in L.A.) I But... But... Further analysis indicates: I I Teacher switching is correlated with changes in student composition, violating quasi-experimental assumption. Estimates that adjust for changes in observables indicate moderate bias in VA scores. I I I CFR specification yields Ω̂ = 0.89 (0.02). My preferred specification yields Ω̂ = 0.80 (0.02). Long-run effects are not robust. I I I CFR specifications under-control for observables. Estimates are very sensitive to controls. Zero effect is plausible, with richer controls. But... Further analysis indicates: I I Teacher switching is correlated with changes in student composition, violating quasi-experimental assumption. Estimates that adjust for changes in observables indicate moderate bias in VA scores. I I I CFR specification yields Ω̂ = 0.89 (0.02). My preferred specification yields Ω̂ = 0.80 (0.02). Long-run effects are not robust. I I I CFR specifications under-control for observables. Estimates are very sensitive to controls. Zero effect is plausible, with richer controls. But... Further analysis indicates: I I Teacher switching is correlated with changes in student composition, violating quasi-experimental assumption. Estimates that adjust for changes in observables indicate moderate bias in VA scores. I I I CFR specification yields Ω̂ = 0.89 (0.02). My preferred specification yields Ω̂ = 0.80 (0.02). Long-run effects are not robust. I I I CFR specifications under-control for observables. Estimates are very sensitive to controls. Zero effect is plausible, with richer controls. CFR-I’s key result: Using teacher switching to estimate Ω Figure: Bin scatter plot of year-over-year changes in mean school-grade-subject scores on changes in mean predicted VA in NYC. (CFR-I, Figure 4A.) Preview of results I: Replication of CFR-I Figure: Bin scatter plot of changes in mean scores on changes in mean predicted VA in North Carolina End of Year Scores Change in avg. end of year score .1 0 Δ Score = αst + 0.981 * Δ VA (0.017) -.1 -.1 0 Change in avg. teacher predicted value-added .1 Preview of results II: Falsification test Figure: Bin scatter plot for prior year scores Prior Year Scores Change in avg. prior year score .1 0 Δ Score = αst + 0.134 * Δ VA (0.017) -.1 -.1 0 Change in avg. teacher predicted value-added .1 Preview of results III: Selection-corrected quasi-experiment Figure: Bin scatter plot for growth in scores from prior year Change in Score Growth Change in avg. end of year score less change in avg. prior year score .1 0 Δ Score = αst + 0.847 * Δ VA (0.014) -.1 -.1 0 Change in avg. teacher predicted value-added .1 Outline 1. Measuring VAM bias using experiments & quasi-experiments 2. Reproduction in North Carolina 3. Evaluating the quasi-experimental design 4. Long-run outcomes 5. Conclusion Outline 1. Measuring VAM bias using experiments & quasi-experiments 2. Reproduction in North Carolina 3. Evaluating the quasi-experimental design 4. Long-run outcomes 5. Conclusion The VAM specification I I Let Aijmgst be the test score of student i with teacher j in subject m in grade g , school s, year t. Regress Aijmgst = αjm + Xijmgst βm + eijmgst . I I αjm is a teacher-subject FE (common across classrooms). Xijmgst has: I I I I I Cubic in prior-year scores in each subject. Ethnicity, gender, age, special education, etc. Class- (jmgst) and school-year- (st) means of these. Class size. Note: Aggregate coefficients identified only from within-teacher, between-class variation. I Form residual Aijmgst − Xijmgst β̂m = α̂jm + êijmgst . I Average these residuals to classroom level and call this µ̃jmsgt . Shrinkage / Empirical Bayes I Suppose µ̃ = µ∗ + e, with: I I I I µ∗ ⊥ e, µ∗ ∼ N 0, σµ2 ∗ , and e ∼ N 0, σe2 . Then the Empirical Bayes estimate of µ∗ is µ̂EB ≡ σµ2 ∗ σµ2 ∗ + σe2 µ̃ = E [µ∗ |µ̃] . Shrinkage / Empirical Bayes I Suppose µ̃ = µ∗ + e, with: I I I I µ∗ ⊥ e, µ∗ ∼ N 0, σµ2 ∗ , and e ∼ N 0, σe2 . Then the Empirical Bayes estimate of µ∗ is µ̂EB ≡ I σµ2 ∗ σµ2 ∗ + σe2 µ̃ = E [µ∗ |µ̃] . This is the posterior mean, an unbiased predictor (E µ∗ |µ̂EB = µ̂EB ), the best linear predictor, and a ”shrinkage” estimator. Shrinkage / Empirical Bayes I Suppose µ̃ = µ∗ + e, with: I I I I µ∗ ⊥ e, µ∗ ∼ N 0, σµ2 ∗ , and e ∼ N 0, σe2 . Then the Empirical Bayes estimate of µ∗ is µ̂EB ≡ σµ2 ∗ σµ2 ∗ + σe2 µ̃ = E [µ∗ |µ̃] . I This is the posterior mean, an unbiased predictor (E µ∗ |µ̂EB = µ̂EB ), the best linear predictor, and a ”shrinkage” estimator. I Note: If reliability varies across observations, so does shrinkage ratio. If reliability = 0 (i.e., we have no information), we assign the grand mean µ̂EB = E [µ∗ ]. Defining bias in VAM I Suppressing subscripts, decompose teacher j’s VA in year t as: µ̃jt = µ∗j + bj + ejt I I I µ∗j is the causal effect of interest. bj is bias due to student assignments . ejt is sampling error, annual shocks, transitory bias. I We are interested in V [bj ]. I bj reflects persistent student assignments. Transitory non-random assignments show up in ejt . Bias with EB estimates I The feasible EB estimate is µ̂EB jt ≡ λµ̃jt , where: h i ∗+b V µ j j cov µ̃jt , µ̃jt 0 i. = h λ≡ V [µ̃jt ] V µ∗ + b + e j j jt Bias with EB estimates I The feasible EB estimate is µ̂EB jt ≡ λµ̃jt , where: h i ∗+b V µ j j cov µ̃jt , µ̃jt 0 i. = h λ≡ V [µ̃jt ] V µ∗ + b + e j I ∗+b . µ̂EB is an unbiased predictor of µ j jt j j jt Bias with EB estimates I The feasible EB estimate is µ̂EB jt ≡ λµ̃jt , where: h i ∗+b V µ j j cov µ̃jt , µ̃jt 0 i. = h λ≡ V [µ̃jt ] V µ∗ + b + e j I I j jt ∗+b . µ̂EB is an unbiased predictor of µ j jt j It is a biased predictor of µ∗j if V µ∗j + bj 6= V µ∗j : ∂E [µ∗j |µ̂EB j ] EB ∂ µ̂j V µ∗j 6= 1. = V µ∗j + bj Bias with EB estimates I The feasible EB estimate is µ̂EB jt ≡ λµ̃jt , where: h i ∗+b V µ j j cov µ̃jt , µ̃jt 0 i. = h λ≡ V [µ̃jt ] V µ∗ + b + e j I I jt ∗+b . µ̂EB is an unbiased predictor of µ j jt j It is a biased predictor of µ∗j if V µ∗j + bj 6= V µ∗j : ∂E [µ∗j |µ̂EB j ] EB ∂ µ̂j I j V µ∗j 6= 1. = V µ∗j + bj CFR define µ̂EB as ”forecast biased” or ”forecast unbiased” if this derivative is < or = 1. The Kane-Staiger experimental strategy for estimating Ω 1. Construct observational estimate of teacher j’s VA in t, µ̃jt = µ∗j + bj + ejt . 2. ”Shrink” toget µ̂jt , using between-year covariance to estimate V µ∗j + bj . 3. Use random assignment to get a second estimate of the teacher’s VA in t 0 6= t, µ̌jt 0 = µ∗j + b̌j + ějt 0 , with orthogonal bias and error: b̌j + ějt 0 ⊥ µ∗j , (bj + ejt ) . 4. Then a regression of (unshrunken) µ̌jt 0 on µ̂jt yields h i V µ∗j cov µ̌jt 0 , µ̂jt i = Ω. = h V [µ̂jt ] V µ∗ + b j j Note: Assumes cov µ∗j , bj = 0 (Horvath 2014). CFR-I: Teacher switching as a quasi-experiment I Experiments are very difficult. Need another approach. I If VAM is right, then when a high-VA teacher replaces a low-VA teacher, scores should rise. I Focus on school-grade-year aggregates to abstract from within-school sorting. I Prediction unbiasedness =⇒ average scores rise 1-for-1 with change in average predicted VA. CFR-I: Teacher switching as a quasi-experiment I Experiments are very difficult. Need another approach. I If VAM is right, then when a high-VA teacher replaces a low-VA teacher, scores should rise. I Focus on school-grade-year aggregates to abstract from within-school sorting. I Prediction unbiasedness =⇒ average scores rise 1-for-1 with change in average predicted VA. I Let Qsgt represent the mean shrunken VA across all teachers at school s, grade g , year t, based on all VA observations from before t − 1 or after t. CFR-I: Teacher switching as a quasi-experiment I Experiments are very difficult. Need another approach. I If VAM is right, then when a high-VA teacher replaces a low-VA teacher, scores should rise. I Focus on school-grade-year aggregates to abstract from within-school sorting. I Prediction unbiasedness =⇒ average scores rise 1-for-1 with change in average predicted VA. I Let Qsgt represent the mean shrunken VA across all teachers at school s, grade g , year t, based on all VA observations from before t − 1 or after t. I Test whether ∂∆Asgt ∂ (Asgt − Asg ,t−1 ) ≡ =1 ∂∆Qsgt ∂ (Qsgt − Qsg ,t−1 ) “Drift” in true VA σy2∗ 2 σy ∗ +σe2 I Recall EB formula: ŷ EB ≡ I Variance components usually based on between-year covariance, variance. I If the framework is right, h i ∂E µ̃jt |µ̂EB j,{−t} ∂ µ̂EB j,{−t} y. = 1. I When you do this in the CFR sample, derivative is less than 1. I One explanation: “Drift.” µ∗j or bj evolves over time. Shrinkage with “drift” in true VA I CFR redefine µ̂ as the predicted value from (implied) regressions of this year’s µ̃jt on each past or future µ̃js , s < t − 1 or s > t. I This ensures unbiased predictions of observed VA. Implementation: I I I I They assume cov (µ̃jt 0 , µ̃j,t 0 +s ) ≡ σs . This allows estimation of the regression coefficients without running separate regressions for each teacher (or set of available measures). No drift equivalent to σ1 = σ2 = σ3 = . . . . Implementation of the drift estimator I Let σs ≡ cov (µ̃jt 0 , µ̃j,t 0 +s ). I Consider prediction for t0 of a teacher observed in S ≡ {t1 , t2 , t3 }: µ̂EB jm,t0 ≡ ψ|t0 −t1 | µ̃jt1 + ψ|t0 −t2 | µ̃jt2 + ψ|t0 −t3 | µ̃jt3 , where: −1 ψ|t0 −t1 | σ0 σ|t1 −t2 | σ|t1 −t3 | σ|t0 −t1 | ψ|t −t | ≡ σ|t −t | σ0 σ|t2 −t3 | σ|t0 −t2 | . 0 2 2 1 ψ|t0 −t3 | σ|t3 −t1 | σ|t3 −t2 | σ0 σ|t0 −t3 | I The ψ coefficients vary with both t0 and S, but σs pool over all available pairs. CFR-I’s basic quasi-experimental results Dependent variable is change from t − 1 to t in mean scores at school-grade-subject level, ∆Āmsgt . CFR (2011) Year FEs ∆Qmsgt 0.84 (0.05) No. of school-grade-subject-year cells 24,887 CFR-I’s basic quasi-experimental results Dependent variable is change from t − 1 to t in mean scores at school-grade-subject level, ∆Āmsgt . CFR (2011) CFR (2014a) Year FEs Year FEs School-Year FEs ∆Qmsgt 0.84 0.97 0.96 (0.03) (0.03) (0.05) No. of school-grade-subject-year cells 24,887 59,770 59,770 CFR-I’s basic quasi-experimental results Dependent variable is change from t − 1 to t in mean scores at school-grade-subject level, ∆Āmsgt . CFR (2011) CFR (2014a) Year FEs Year FEs School-Year FEs ∆Qmsgt 0.84 0.97 0.96 (0.03) (0.03) (0.05) No. of school-grade-subject-year cells 24,887 59,770 59,770 I Interpretation is that uncorrected VA predictions are biased, but drift-corrected predictions are not. CFR-I’s basic quasi-experimental results Dependent variable is change from t − 1 to t in mean scores at school-grade-subject level, ∆Āmsgt . CFR (2011) CFR (2014a) Year FEs Year FEs School-Year FEs ∆Qmsgt 0.84 0.97 0.96 (0.03) (0.03) (0.05) No. of school-grade-subject-year cells 24,887 59,770 59,770 I Interpretation is that uncorrected VA predictions are biased, but drift-corrected predictions are not. I Assumption: ∆Qmsgt is randomly assigned at m-s-g -t level. Outline 1. Measuring VAM bias using experiments & quasi-experiments 2. Reproduction in North Carolina 3. Evaluating the quasi-experimental design 4. Long-run outcomes 5. Conclusion The North Carolina data I CFR-I: NYC administrative data & IRS records for parent characteristics & long-run outcomes. I I use North Carolina administrative data covering 1997-2011, from North Carolina Education Research Data Center. I Tests in grades 3-8 (plus start-of-3rd-grade “pre-test”). Teacher identifiers for teacher who proctored test. I I I Focus on grades 3-5. Dummy out bad/suspicious matches. I For long-run outcomes, use HS graduation, GPA, college-going plans (not all cohorts). I Limited parent characteristics. I N = 77, 177 school-grade-year-subject cells (vs. 59, 770 for CFR). Replication: Autocorrelation of µ̃jt within teacher, across years 0.0 0.1 Correlation 0.2 0.3 0.4 0.5 Autocorrelation Vector in Elementary School for English and Math Scores 0 2 4 6 Years Between Classes CFR English NC English Accounting for “drift” in forming µ̂j 8 CFR Math NC Math 10 Contributions to the variance of ∆Q The “treatment” here is not just teacher retirements and hiring Component Differences between stayers’ µ̂jt and µ̂j,t−1 Unreliable teacher-student matches Grade switchers (within schools) School switchers (within districts) District switchers (within North Carolina) Temporary leavers from/returners to system Permanent leavers from & new arrivals to system Share of V (∆Q) 10% 7% 31% 8% 6% 13% 24% Replication of quasi-experimental estimates Sample Dependent variable: CFR-I Year FEs School-year FEs NC replication Non-missing EB VA ¯ ∆Ā ∆ (1) (2) (3) 0.97 (0.03) 0.96 (0.03) 0.00 (0.01) X X 0.98 (0.02) 0.01 (0.01) X 1.05 (0.02) Replication of quasi-experimental estimates Sample Dependent variable: CFR-I Year FEs School-year FEs NC replication Non-missing EB VA ¯ ∆Ā ∆ (1) (2) (3) 0.97 (0.03) 0.96 (0.03) 0.00 (0.01) X X 0.98 (0.02) 0.01 (0.01) X 1.05 (0.02) Replication of quasi-experimental estimates Sample Dependent variable: CFR-I Year FEs School-year FEs NC replication Non-missing EB VA ¯ ∆ ∆Ā (1) (2) (3) 0.97 (0.03) 0.96 (0.03) 0.00 (0.01) X X 0.98 (0.02) 0.01 (0.01) X 1.05 (0.02) Teachers without shrunken VA scores I Qsgt is the average shrunken VA score of all teachers in s − g − t cell, based on data from outside {t − 1, t}. I For those observed only in {t − 1, t}, there’s nothing to shrink. I CFR-I sample excludes these teachers from Qsgt and excludes their students from Āsgt . I In an alternative specification, set µ̂jt = µ̂j,t+1 = 0 for these teachers, and include them in Qsgt and their students in Āsgt . Replication of quasi-experimental estimates Sample Dependent variable: CFR-I Year FEs School-year FEs NC replication Non-missing EB VA ¯ ∆ ∆Ā All (1) (2) (3) ∆Ā (4) 0.97 (0.03) 0.96 (0.03) 0.00 (0.01) 0.88 (0.03) X X 0.98 (0.02) 0.01 (0.01) X 1.05 (0.02) X Replication of quasi-experimental estimates Sample Dependent variable: CFR-I Year FEs School-year FEs NC replication Non-missing EB VA ¯ ∆ ∆Ā All (1) (2) (3) ∆Ā (4) 0.97 (0.03) 0.96 (0.03) 0.00 (0.01) 0.88 (0.03) X X 0.98 (0.02) 0.01 (0.01) X 1.05 (0.02) X 0.87 (0.02) What is going on? I I CFR-I argue that this reflects measurement error in the augmented ∆Qsgt . But µ̂j,t−1 = µ̂jt = 0 is the correct EB estimate for teachers not seen outside {t − 1, t}. I I I Complete shrinkage when reliability of signal is 0. Posterior mean = prior mean. Empirically, the coefficient falls because of the dependent variable: What is going on? I I CFR-I argue that this reflects measurement error in the augmented ∆Qsgt . But µ̂j,t−1 = µ̂jt = 0 is the correct EB estimate for teachers not seen outside {t − 1, t}. I I I Complete shrinkage when reliability of signal is 0. Posterior mean = prior mean. Empirically, the coefficient falls because of the dependent variable: Include all teachers on RHS? Include all students on LHS? NC replication (1) N N 1.05 (0.02) (2) Y Y 0.87 (0.02) What is going on? I I CFR-I argue that this reflects measurement error in the augmented ∆Qsgt . But µ̂j,t−1 = µ̂jt = 0 is the correct EB estimate for teachers not seen outside {t − 1, t}. I I I Complete shrinkage when reliability of signal is 0. Posterior mean = prior mean. Empirically, the coefficient falls because of the dependent variable: Include all teachers on RHS? Include all students on LHS? NC replication (1) N N 1.05 (0.02) (2) Y Y 0.87 (0.02) (3) Y N 1.20 (0.03) What is going on? I I CFR-I argue that this reflects measurement error in the augmented ∆Qsgt . But µ̂j,t−1 = µ̂jt = 0 is the correct EB estimate for teachers not seen outside {t − 1, t}. I I I Complete shrinkage when reliability of signal is 0. Posterior mean = prior mean. Empirically, the coefficient falls because of the dependent variable: Include all teachers on RHS? Include all students on LHS? NC replication (1) N N 1.05 (0.02) (2) Y Y 0.87 (0.02) (3) Y N 1.20 (0.03) (4) N Y 0.66 (0.02) A sample selection explanation I Teachers’ VA is correlated with students’ prior scores: Dep. var: Prior score Teacher EB VA (1) CFR-I 0.0078 (0.0004) (2) NC 0.0054 (0.0003) I Consider s − g cell with teachers A and B in t − 1 and A and C in t, with µ̂C missing. I ∆Q is µ̂A − 0.5 ∗ (µ̂A + µ̂B ) ≈ 0.5 ∗ (µ̂A − µ̂B ). I Then ∆Q > 0 ⇐⇒ µ̂A > µ̂B . Probabilistically: I I I µ̂A > 0. As students have higher prior-year scores than C s in t. Excluding C ’s students from the t mean score biases it (and the t − 1 to t change in mean scores) upwards. A sample selection explanation I Teachers’ VA is correlated with students’ prior scores: Dep. var: Prior score Teacher EB VA (1) CFR-I 0.0078 (0.0004) (2) NC 0.0054 (0.0003) I Consider s − g cell with teachers A and B in t − 1 and A and C in t, with µ̂C missing. I ∆Q is µ̂A − 0.5 ∗ (µ̂A + µ̂B ) ≈ 0.5 ∗ (µ̂A − µ̂B ). I Then ∆Q > 0 ⇐⇒ µ̂A > µ̂B . Probabilistically: I I I µ̂A > 0. As students have higher prior-year scores than C s in t. Excluding C ’s students from the t mean score biases it (and the t − 1 to t change in mean scores) upwards. A sample selection explanation I Teachers’ VA is correlated with students’ prior scores: Dep. var: Prior score Teacher EB VA (1) CFR-I 0.0078 (0.0004) (2) NC 0.0054 (0.0003) I Consider s − g cell with teachers A and B in t − 1 and A and C in t, with µ̂C missing. I ∆Q is µ̂A − 0.5 ∗ (µ̂A + µ̂B ) ≈ 0.5 ∗ (µ̂A − µ̂B ). I Then ∆Q > 0 ⇐⇒ µ̂A > µ̂B . Probabilistically: I I I µ̂A > 0. As students have higher prior-year scores than C s in t. Excluding C ’s students from the t mean score biases it (and the t − 1 to t change in mean scores) upwards. A sample selection explanation I Teachers’ VA is correlated with students’ prior scores: Dep. var: Prior score Teacher EB VA (1) CFR-I 0.0078 (0.0004) (2) NC 0.0054 (0.0003) I Consider s − g cell with teachers A and B in t − 1 and A and C in t, with µ̂C missing. I ∆Q is µ̂A − 0.5 ∗ (µ̂A + µ̂B ) ≈ 0.5 ∗ (µ̂A − µ̂B ). I Then ∆Q > 0 ⇐⇒ µ̂A > µ̂B . Probabilistically: I I I µ̂A > 0. As students have higher prior-year scores than C s in t. Excluding C ’s students from the t mean score biases it (and the t − 1 to t change in mean scores) upwards. A sample selection explanation I Teachers’ VA is correlated with students’ prior scores: Dep. var: Prior score Teacher EB VA (1) CFR-I 0.0078 (0.0004) (2) NC 0.0054 (0.0003) I Consider s − g cell with teachers A and B in t − 1 and A and C in t, with µ̂C missing. I ∆Q is µ̂A − 0.5 ∗ (µ̂A + µ̂B ) ≈ 0.5 ∗ (µ̂A − µ̂B ). I Then ∆Q > 0 ⇐⇒ µ̂A > µ̂B . Probabilistically: I I I µ̂A > 0. As students have higher prior-year scores than C s in t. Excluding C ’s students from the t mean score biases it (and the t − 1 to t change in mean scores) upwards. A sample selection explanation I Teachers’ VA is correlated with students’ prior scores: Dep. var: Prior score Teacher EB VA (1) CFR-I 0.0078 (0.0004) (2) NC 0.0054 (0.0003) I Consider s − g cell with teachers A and B in t − 1 and A and C in t, with µ̂C missing. I ∆Q is µ̂A − 0.5 ∗ (µ̂A + µ̂B ) ≈ 0.5 ∗ (µ̂A − µ̂B ). I Then ∆Q > 0 ⇐⇒ µ̂A > µ̂B . Probabilistically: I I I µ̂A > 0. As students have higher prior-year scores than C s in t. Excluding C ’s students from the t mean score biases it (and the t − 1 to t change in mean scores) upwards. Outline 1. Measuring VAM bias using experiments & quasi-experiments 2. Reproduction in North Carolina 3. Evaluating the quasi-experimental design 4. Long-run outcomes 5. Conclusion A falsification test I The quasi-experimental strategy is based on the idea that ∆Qsgt is (as good as) randomly assigned with respect to ∆Asgt . Is it? I Random assignment has testable implications: ∆Qsgt should be uncorrelated with anything prior. I CFR present some limited evidence of this. Following Rothstein (2010), I focus on prior-year scores. Let: I I I LAsgt be the mean score in t − 1 (generally in grade g − 1) of students who are in grade g in year t; ∆LAsgt ≡ LAsgt − LAsg ,t−1 . Falsification test bin-scatter Figure: Bin scatter plot for prior year scores Prior Year Scores Change in avg. prior year score .1 0 Δ Score = αst + 0.134 * Δ VA (0.017) -.1 -.1 0 Change in avg. teacher predicted value-added .1 Teacher switching quasi-experiment: Effects on prior-year scores Table: Dependent variable is ∆LAsgt ∆Qsgt Include all students in LHS? Include all teachers in RHS? (1) 0.13 (0.02) N N Note: All columns include school-year FEs. (2) 0.04 (0.02) Y N (3) 0.12 (0.03) N Y (4) 0.08 (0.02) Y Y Quasi-experiment estimates with controls Excluding classrooms with missing teacher EB scores CFR-I ∆Qsgt Leads and lags in ∆Qsgt ∆LAsgt NC replication ∆Qsgt Leads and lags in ∆Qsgt ∆LAsgt (1) (2) 0.96 (0.03) 0.95 (0.02) X Cubic 0.98 (0.02) 0.96 (0.02) X Cubic Note: All columns include school-year FEs. (3) (4) 0.90 (0.02) 0.89 (0.02) Cubic 0.68 (0.00) Quasi-experiment estimates with controls Excluding classrooms with missing teacher EB scores CFR-I ∆Qsgt Leads and lags in ∆Qsgt ∆LAsgt NC replication ∆Qsgt Leads and lags in ∆Qsgt ∆LAsgt (1) (2) 0.96 (0.03) 0.95 (0.02) X Cubic 0.98 (0.02) 0.96 (0.02) X Cubic Note: All columns include school-year FEs. (3) (4) 0.90 (0.02) 0.89 (0.02) Cubic 0.68 (0.00) Quasi-experiment estimates with controls Excluding classrooms with missing teacher EB scores CFR-I ∆Qsgt Leads and lags in ∆Qsgt ∆LAsgt NC replication ∆Qsgt Leads and lags in ∆Qsgt ∆LAsgt (1) (2) 0.96 (0.03) 0.95 (0.02) X Cubic 0.98 (0.02) 0.96 (0.02) X Cubic Note: All columns include school-year FEs. (3) (4) 0.90 (0.02) 0.89 (0.02) Cubic 0.68 (0.00) Quasi-experiment estimates with controls Including classrooms with missing teacher EB scores (1) CFR-I ∆Qsgt FEs NC replication ∆Qsgt (3) (4) 0.82 (0.02) 0.29 (0.01) Year 0.82 (0.02) 0.80 (0.02) 0.27 (0.01) Sch-Year 0.88 (0.03) Year 0.87 (0.02) ∆LAsgt FEs (2) Year Sch-Year Potential mechanical explanations I Issues: I I I Data from t − 2 are used both to form the t − 1 VA scores and to construct the prior-year scores for t − 1. t and t − 1 teachers may have taught the same students in earlier grades. Simple solutions: I I Exclude t − 2 data when predicting VA in t − 1 and t. Instrument for ∆Q with a version that zeroes out teachers who taught t − 1 or t cohort previously. Potential mechanical explanations I Issues: I I I Data from t − 2 are used both to form the t − 1 VA scores and to construct the prior-year scores for t − 1. t and t − 1 teachers may have taught the same students in earlier grades. Simple solutions: I I Exclude t − 2 data when predicting VA in t − 1 and t. Instrument for ∆Q with a version that zeroes out teachers who taught t − 1 or t cohort previously. Baseline Leave-three-out VA scores IV with non-followers Prior 0.08 0.10 0.18 score (0.02) (0.03) (0.05) End-of-year score 0.80 (0.02) 0.79 (0.03) 0.77 (0.03) Potential mechanical explanations I Issues: I I I Data from t − 2 are used both to form the t − 1 VA scores and to construct the prior-year scores for t − 1. t and t − 1 teachers may have taught the same students in earlier grades. Simple solutions: I I Exclude t − 2 data when predicting VA in t − 1 and t. Instrument for ∆Q with a version that zeroes out teachers who taught t − 1 or t cohort previously. Baseline Leave-three-out VA scores IV with non-followers School-year-subject FEs S-Y-M FEs, IV with non-followers Prior 0.08 0.10 0.18 0.04 0.03 score (0.02) (0.03) (0.05) (0.04) (0.05) End-of-year score 0.80 (0.02) 0.79 (0.03) 0.77 (0.03) Potential mechanical explanations I Issues: I I I Data from t − 2 are used both to form the t − 1 VA scores and to construct the prior-year scores for t − 1. t and t − 1 teachers may have taught the same students in earlier grades. Simple solutions: I I Exclude t − 2 data when predicting VA in t − 1 and t. Instrument for ∆Q with a version that zeroes out teachers who taught t − 1 or t cohort previously. Baseline Leave-three-out VA scores IV with non-followers School-year-subject FEs S-Y-M FEs, IV with non-followers Prior 0.08 0.10 0.18 0.04 0.03 score (0.02) (0.03) (0.05) (0.04) (0.05) End-of-year score 0.80 (0.02) 0.79 (0.03) 0.77 (0.03) 0.81 (0.03) 0.80 (0.03) Outline 1. Measuring VAM bias using experiments & quasi-experiments 2. Reproduction in North Carolina 3. Evaluating the quasi-experimental design 4. Long-run outcomes 5. Conclusion Effects of teachers on students’ long-run outcomes I Want to know whether high-VA teachers have longer-lasting effects. I Effects on test scores fade out quickly. I CFR-II use tax data to construct long-run outcomes – college enrollment, earnings at age 28, teen birth, etc. Two types of estimates: I I I I “Observational” (classroom-level) regressions, with controls. Quasi-experimental estimates. Defining the causal effect of interest is tricky! CFR-II estimates (1) (2) (3) Dep. var.: College attendance at age 20 Teacher VA (EB, in SDs) 0.82 0.71 0.74 (0.07) (0.06) (0.09) Baseline controls X X X Parent chars. controls X Twice-lagged scores X Quasi-experiment Dep. var.: Annual earnings at age 28 Teacher VA (EB, in SDs) $350 $286 $309 (92) (88) (110) (4) 0.73 (0.25) X Note: CFR-II “controls” in (1)-(3) are implemented in an unusual way. Implementing observational estimates with controls I I Want to estimate effect of µ∗ on Y , controlling for Z . CFR-II method for observational estimates: I I I I I Regress Y on Z , with teacher FEs. Regress (Y − Z β̂) on µ̂EB . µ̂EB is not residualized against Z . Yields an unbiased estimate only under very restrictive conditions. This approach uses the within-teacher covariance between Z (parent income) & Y (child outcomes). It doesn’t control for the more important between-teacher (and between-school) variation. Implementing observational estimates with controls I I Want to estimate effect of µ∗ on Y , controlling for Z . CFR-II method for observational estimates: I I I I I I Regress Y on Z , with teacher FEs. Regress (Y − Z β̂) on µ̂EB . µ̂EB is not residualized against Z . Yields an unbiased estimate only under very restrictive conditions. This approach uses the within-teacher covariance between Z (parent income) & Y (child outcomes). It doesn’t control for the more important between-teacher (and between-school) variation. So why not an OLS regression of Y on µ̂EB and Z ? I I µ̂EB is an unconditionally unbiased predictor of µ∗ + b, but a biased predictor conditional on Z . OLS may be biased, but this can be avoided by instrumenting for µ̃jt with µ̃jt 0 . Revisiting the long-run analysis in NC data I NC data don’t have adult outcomes. I But they can be linked to high school measures. I extract five (each set to missing if student is not in HS data): I 1. 2. 3. 4. 5. I Graduate high school Plan to attend college Plan to attend 4-year college GPA (4 point scale) High school class rank (100=top) I don’t have CFR’s richer controls. Focus on method, with VA-model controls (individual & class mean lagged scores). Observational estimates of effects of teacher VA on long-run outcomes Controls None Graduate HS (%) (1) 0.74 (0.05) Plan college (%) 0.86 (0.07) Plan 4-year coll. (%) 3.42 (0.14) GPA (4-pt. scale) 0.046 (0.003) Class rank (0-100) 1.34 (0.07) Observational estimates of effects of teacher VA on long-run outcomes Controls None Graduate HS (%) (1) 0.74 (0.05) CFR 2-step (2) 0.38 (0.05) Plan college (%) 0.86 (0.07) 0.35 (0.06) Plan 4-year coll. (%) 3.42 (0.14) 1.21 (0.10) GPA (4-pt. scale) 0.046 (0.003) 0.021 (0.002) Class rank (0-100) 1.34 (0.07) 0.62 (0.06) Observational estimates of effects of teacher VA on long-run outcomes (1) 0.74 (0.05) CFR 2-step (2) 0.38 (0.05) (3) 0.21 (0.04) Plan college (%) 0.86 (0.07) 0.35 (0.06) 0.21 (0.06) Plan 4-year coll. (%) 3.42 (0.14) 1.21 (0.10) 0.63 (0.09) GPA (4-pt. scale) 0.046 (0.003) 0.021 (0.002) 0.014 (0.002) Class rank (0-100) 1.34 (0.07) 0.62 (0.06) 0.33 (0.05) Controls None Graduate HS (%) OLS Observational estimates of effects of teacher VA on long-run outcomes OLS IV (1) 0.74 (0.05) CFR 2-step (2) 0.38 (0.05) (3) 0.21 (0.04) (4) 0.21 (0.04) Plan college (%) 0.86 (0.07) 0.35 (0.06) 0.21 (0.06) 0.22 (0.06) Plan 4-year coll. (%) 3.42 (0.14) 1.21 (0.10) 0.63 (0.09) 0.64 (0.09) GPA (4-pt. scale) 0.046 (0.003) 0.021 (0.002) 0.014 (0.002) 0.015 (0.002) Class rank (0-100) 1.34 (0.07) 0.62 (0.06) 0.33 (0.05) 0.34 (0.05) Controls None Graduate HS (%) Quasi-experimental estimates of effects of teacher VA on long-run outcomes Controls Graduate HS (%) None (1) 0.36* (0.18) Plan college (%) 0.35 (0.22) Plan 4-year coll. (%) -0.02 (0.33) GPA (4-pt. scale) 0.010 (0.008) Class rank (0-100) 0.34 (0.24) Quasi-experimental estimates of effects of teacher VA on long-run outcomes Controls None (1) 0.36* (0.18) ∆LAsgt (2) 0.20 (0.19) Plan college (%) 0.35 (0.22) 0.19 (0.23) Plan 4-year coll. (%) -0.02 (0.33) -0.35 (0.33) GPA (4-pt. scale) 0.010 (0.008) -0.003 (0.008) Class rank (0-100) 0.34 (0.24) 0.03 (0.24) Graduate HS (%) Outline 1. Measuring VAM bias using experiments & quasi-experiments 2. Reproduction in North Carolina 3. Evaluating the quasi-experimental design 4. Long-run outcomes 5. Conclusion Conclusions I Basic results of CFR-I (and, mostly, CFR-II) are successfully reproduced in North Carolina data. I But the so-called “quasi-experiment” fails: Teacher switches are correlated with changes in student preparedness. Estimates that control for this indicate modest bias: I I I I Inference about long-run effects not robust. I I I Forecast bias coefficient (AKA reliability ratio) around 0.8. In the middle of Rothstein’s (2009) plausible range of 0.6 - 1. Estimates are very sensitive to observables. Quasi-experiment with controls yields no significant effects. CFR’s reported specifications would not detect any of this. Where does VAM stand? I In an important sense, none of this matters. I I I But: I I I I I NYC and NC are low stakes (at teacher level). Modest bias is not inconsistent with VAM being useful. Failure of MET study indicates agents are not indifferent to class assignments. NYC and NC evidence indicates these assignments bias VAM scores. This is a source of unfairness in VAM-based evaluations that undercuts their “face validity.” We should expect assignments to be distorted under high stakes, worsening bias and perhaps harming educational outcomes. Moreover, these studies (like others) consider best cases: Extensive controls, excluding hard cases.