Revisiting the Impact of Teachers Jesse Rothstein April 2015

advertisement
Revisiting the Impact of Teachers
Jesse Rothstein
University of California, Berkeley and NBER
rothstein@berkeley.edu
April 2015
Value-added modeling (VAM) and bias
I
Value-added models aim to identify teachers’ causal effects on
students’ test scores.
I
I
In observational data, without control over assignment.
Key control is student’s prior-year score.
I
Each VAM relies on a strong implicit assumption about the
assignment rule (Rothstein 2010). If this is wrong, VAM
scores may be biased.
I
A burgeoning literature tries to quantify any biases.
I
One measure of the magnitude of biases:
Ω≡
V [causal effect]
V [causal effect] + V [bias]
The state of play (part 1)
I
Early VAM literature didn’t push very hard on causality.
I
Rothstein (2010) finds effects of grade-g teachers on students’
g − 2 scores. Implies V [bias] > 0. But magnitude unclear.
I
Rothstein (2009) simulates realistic DGPs to estimate range
of potential bias. Plausible range: Ω ∈ {0.6, 1.0}.
The state of play (part 2)
I
Rothstein (2009) plausible range: Ω ∈ {0.6, 1.0}
I
Experiments: How well do observational VAM scores forecast
experimental teacher effects? (Kane and Staiger, with
co-authors)
I
I
I
I
Ω̂ = 0.91 (0.18) math; 1.09 (0.29) ELA (Kane-Staiger 2008)
Ω̂ = 1.04 (0.14) math; 0.70 (0.21) ELA (Kane et al. 2013).
Small samples and noncompliance create big SEs.
Quasi-experiment (Chetty, Friedman, Rockoff 2014a,b).
I
I
I
I
Generalization of experimental tests to ”teacher switching”
quasi-experiments, with much larger N.
Similar ”quasi-experiments” used by Chetty-Hendren,
Finkelstein-Gentzkow-Williams (2014).
Using standard VAM: Ω̂ = 0.84 (0.05).
Adjust VA model to account for “drift” and get
Ω̂ = 0.97 (0.03). [Or is it Ω̂ = 0.88 (0.03)?]
The state of play (part 2)
I
Rothstein (2009) plausible range: Ω ∈ {0.6, 1.0}
I
Experiments: How well do observational VAM scores forecast
experimental teacher effects? (Kane and Staiger, with
co-authors)
I
I
I
I
Ω̂ = 0.91 (0.18) math; 1.09 (0.29) ELA (Kane-Staiger 2008)
Ω̂ = 1.04 (0.14) math; 0.70 (0.21) ELA (Kane et al. 2013).
Small samples and noncompliance create big SEs.
Quasi-experiment (Chetty, Friedman, Rockoff 2014a,b).
I
I
I
I
Generalization of experimental tests to ”teacher switching”
quasi-experiments, with much larger N.
Similar ”quasi-experiments” used by Chetty-Hendren,
Finkelstein-Gentzkow-Williams (2014).
Using standard VAM: Ω̂ = 0.84 (0.05).
Adjust VA model to account for “drift” and get
Ω̂ = 0.97 (0.03). [Or is it Ω̂ = 0.88 (0.03)?]
The state of play (part 2)
I
Rothstein (2009) plausible range: Ω ∈ {0.6, 1.0}
I
Experiments: How well do observational VAM scores forecast
experimental teacher effects? (Kane and Staiger, with
co-authors)
I
I
I
I
Ω̂ = 0.91 (0.18) math; 1.09 (0.29) ELA (Kane-Staiger 2008)
Ω̂ = 1.04 (0.14) math; 0.70 (0.21) ELA (Kane et al. 2013).
Small samples and noncompliance create big SEs.
Quasi-experiment (Chetty, Friedman, Rockoff 2014a,b).
I
I
I
I
Generalization of experimental tests to ”teacher switching”
quasi-experiments, with much larger N.
Similar ”quasi-experiments” used by Chetty-Hendren,
Finkelstein-Gentzkow-Williams (2014).
Using standard VAM: Ω̂ = 0.84 (0.05).
Adjust VA model to account for “drift” and get
Ω̂ = 0.97 (0.03). [Or is it Ω̂ = 0.88 (0.03)?]
The state of play (part 2)
I
Rothstein (2009) plausible range: Ω ∈ {0.6, 1.0}
I
Experiments: How well do observational VAM scores forecast
experimental teacher effects? (Kane and Staiger, with
co-authors)
I
I
I
I
Ω̂ = 0.91 (0.18) math; 1.09 (0.29) ELA (Kane-Staiger 2008)
Ω̂ = 1.04 (0.14) math; 0.70 (0.21) ELA (Kane et al. 2013).
Small samples and noncompliance create big SEs.
Quasi-experiment (Chetty, Friedman, Rockoff 2014a,b).
I
I
I
I
Generalization of experimental tests to ”teacher switching”
quasi-experiments, with much larger N.
Similar ”quasi-experiments” used by Chetty-Hendren,
Finkelstein-Gentzkow-Williams (2014).
Using standard VAM: Ω̂ = 0.84 (0.05).
Adjust VA model to account for “drift” and get
Ω̂ = 0.97 (0.03). [Or is it Ω̂ = 0.88 (0.03)?]
The state of play (part 2)
I
Rothstein (2009) plausible range: Ω ∈ {0.6, 1.0}
I
Experiments: How well do observational VAM scores forecast
experimental teacher effects? (Kane and Staiger, with
co-authors)
I
I
I
I
Ω̂ = 0.91 (0.18) math; 1.09 (0.29) ELA (Kane-Staiger 2008)
Ω̂ = 1.04 (0.14) math; 0.70 (0.21) ELA (Kane et al. 2013).
Small samples and noncompliance create big SEs.
Quasi-experiment (Chetty, Friedman, Rockoff 2014a,b).
I
I
I
I
Generalization of experimental tests to ”teacher switching”
quasi-experiments, with much larger N.
Similar ”quasi-experiments” used by Chetty-Hendren,
Finkelstein-Gentzkow-Williams (2014).
Using standard VAM: Ω̂ = 0.84 (0.05).
Adjust VA model to account for “drift” and get
Ω̂ = 0.97 (0.03). [Or is it Ω̂ = 0.88 (0.03)?]
This paper
I
Revisit Chetty et al. (2014a,b), ”Measuring the Impact of
Teachers”
CFR-I “Teacher switching” quasi-experimental test for bias.
CFR-II Effect of teacher VA on students’ long-run outcomes.
I
I replicate CFR-I and CFR-II analyses in North Carolina data,
using CFR’s methods and code.
I
All key results are replicated, essentially perfectly.
(See also Bacher-Hicks, Kane, and Staiger, 2014, in L.A.)
I
But...
But...
Further analysis indicates:
I
I
Teacher switching is correlated with changes in student
composition, violating quasi-experimental assumption.
Estimates that adjust for changes in observables indicate
moderate bias in VA scores.
I
I
I
CFR specification yields Ω̂ = 0.89 (0.02).
My preferred specification yields Ω̂ = 0.80 (0.02).
Long-run effects are not robust.
I
I
I
CFR specifications under-control for observables.
Estimates are very sensitive to controls.
Zero effect is plausible, with richer controls.
But...
Further analysis indicates:
I
I
Teacher switching is correlated with changes in student
composition, violating quasi-experimental assumption.
Estimates that adjust for changes in observables indicate
moderate bias in VA scores.
I
I
I
CFR specification yields Ω̂ = 0.89 (0.02).
My preferred specification yields Ω̂ = 0.80 (0.02).
Long-run effects are not robust.
I
I
I
CFR specifications under-control for observables.
Estimates are very sensitive to controls.
Zero effect is plausible, with richer controls.
But...
Further analysis indicates:
I
I
Teacher switching is correlated with changes in student
composition, violating quasi-experimental assumption.
Estimates that adjust for changes in observables indicate
moderate bias in VA scores.
I
I
I
CFR specification yields Ω̂ = 0.89 (0.02).
My preferred specification yields Ω̂ = 0.80 (0.02).
Long-run effects are not robust.
I
I
I
CFR specifications under-control for observables.
Estimates are very sensitive to controls.
Zero effect is plausible, with richer controls.
CFR-I’s key result: Using teacher switching to estimate Ω
Figure: Bin scatter plot of year-over-year changes in mean
school-grade-subject scores on changes in mean predicted VA in NYC.
(CFR-I, Figure 4A.)
Preview of results I: Replication of CFR-I
Figure: Bin scatter plot of changes in mean scores on changes in mean
predicted VA in North Carolina
End of Year Scores
Change in avg. end of year score
.1
0
Δ Score = αst + 0.981 * Δ VA
(0.017)
-.1
-.1
0
Change in avg. teacher predicted value-added
.1
Preview of results II: Falsification test
Figure: Bin scatter plot for prior year scores
Prior Year Scores
Change in avg. prior year score
.1
0
Δ Score = αst + 0.134 * Δ VA
(0.017)
-.1
-.1
0
Change in avg. teacher predicted value-added
.1
Preview of results III: Selection-corrected quasi-experiment
Figure: Bin scatter plot for growth in scores from prior year
Change in Score Growth
Change in avg. end of year score
less change in avg. prior year score
.1
0
Δ Score = αst + 0.847 * Δ VA
(0.014)
-.1
-.1
0
Change in avg. teacher predicted value-added
.1
Outline
1. Measuring VAM bias using experiments & quasi-experiments
2. Reproduction in North Carolina
3. Evaluating the quasi-experimental design
4. Long-run outcomes
5. Conclusion
Outline
1. Measuring VAM bias using experiments & quasi-experiments
2. Reproduction in North Carolina
3. Evaluating the quasi-experimental design
4. Long-run outcomes
5. Conclusion
The VAM specification
I
I
Let Aijmgst be the test score of student i with teacher j in
subject m in grade g , school s, year t.
Regress Aijmgst = αjm + Xijmgst βm + eijmgst .
I
I
αjm is a teacher-subject FE (common across classrooms).
Xijmgst has:
I
I
I
I
I
Cubic in prior-year scores in each subject.
Ethnicity, gender, age, special education, etc.
Class- (jmgst) and school-year- (st) means of these.
Class size.
Note: Aggregate coefficients identified only from
within-teacher, between-class variation.
I
Form residual Aijmgst − Xijmgst β̂m = α̂jm + êijmgst .
I
Average these residuals to classroom level and call this µ̃jmsgt .
Shrinkage / Empirical Bayes
I
Suppose µ̃ = µ∗ + e, with:
I
I
I
I
µ∗ ⊥ e,
µ∗ ∼ N 0, σµ2 ∗ , and
e ∼ N 0, σe2 .
Then the Empirical Bayes estimate of µ∗ is
µ̂EB ≡
σµ2 ∗
σµ2 ∗ + σe2
µ̃ = E [µ∗ |µ̃] .
Shrinkage / Empirical Bayes
I
Suppose µ̃ = µ∗ + e, with:
I
I
I
I
µ∗ ⊥ e,
µ∗ ∼ N 0, σµ2 ∗ , and
e ∼ N 0, σe2 .
Then the Empirical Bayes estimate of µ∗ is
µ̂EB ≡
I
σµ2 ∗
σµ2 ∗ + σe2
µ̃ = E [µ∗ |µ̃] .
This
is the posterior mean, an unbiased predictor
(E µ∗ |µ̂EB = µ̂EB ), the best linear predictor, and a
”shrinkage” estimator.
Shrinkage / Empirical Bayes
I
Suppose µ̃ = µ∗ + e, with:
I
I
I
I
µ∗ ⊥ e,
µ∗ ∼ N 0, σµ2 ∗ , and
e ∼ N 0, σe2 .
Then the Empirical Bayes estimate of µ∗ is
µ̂EB ≡
σµ2 ∗
σµ2 ∗ + σe2
µ̃ = E [µ∗ |µ̃] .
I
This
is the posterior mean, an unbiased predictor
(E µ∗ |µ̂EB = µ̂EB ), the best linear predictor, and a
”shrinkage” estimator.
I
Note: If reliability varies across observations, so does
shrinkage ratio. If reliability = 0 (i.e., we have no
information), we assign the grand mean µ̂EB = E [µ∗ ].
Defining bias in VAM
I
Suppressing subscripts, decompose teacher j’s VA in year t as:
µ̃jt = µ∗j + bj + ejt
I
I
I
µ∗j is the causal effect of interest.
bj is bias due to student assignments .
ejt is sampling error, annual shocks, transitory bias.
I
We are interested in V [bj ].
I
bj reflects persistent student assignments. Transitory
non-random assignments show up in ejt .
Bias with EB estimates
I
The feasible EB estimate is µ̂EB
jt ≡ λµ̃jt , where:
h
i
∗+b
V
µ
j
j
cov µ̃jt , µ̃jt 0
i.
= h
λ≡
V [µ̃jt ]
V µ∗ + b + e
j
j
jt
Bias with EB estimates
I
The feasible EB estimate is µ̂EB
jt ≡ λµ̃jt , where:
h
i
∗+b
V
µ
j
j
cov µ̃jt , µ̃jt 0
i.
= h
λ≡
V [µ̃jt ]
V µ∗ + b + e
j
I
∗+b .
µ̂EB
is
an
unbiased
predictor
of
µ
j
jt
j
j
jt
Bias with EB estimates
I
The feasible EB estimate is µ̂EB
jt ≡ λµ̃jt , where:
h
i
∗+b
V
µ
j
j
cov µ̃jt , µ̃jt 0
i.
= h
λ≡
V [µ̃jt ]
V µ∗ + b + e
j
I
I
j
jt
∗+b .
µ̂EB
is
an
unbiased
predictor
of
µ
j
jt
j
It is a biased predictor of µ∗j if V µ∗j + bj 6= V µ∗j :
∂E [µ∗j |µ̂EB
j ]
EB
∂ µ̂j
V µ∗j
6= 1.
= V µ∗j + bj
Bias with EB estimates
I
The feasible EB estimate is µ̂EB
jt ≡ λµ̃jt , where:
h
i
∗+b
V
µ
j
j
cov µ̃jt , µ̃jt 0
i.
= h
λ≡
V [µ̃jt ]
V µ∗ + b + e
j
I
I
jt
∗+b .
µ̂EB
is
an
unbiased
predictor
of
µ
j
jt
j
It is a biased predictor of µ∗j if V µ∗j + bj 6= V µ∗j :
∂E [µ∗j |µ̂EB
j ]
EB
∂ µ̂j
I
j
V µ∗j
6= 1.
= V µ∗j + bj
CFR define µ̂EB as ”forecast biased” or ”forecast unbiased” if
this derivative is < or = 1.
The Kane-Staiger experimental strategy for estimating Ω
1. Construct observational estimate of teacher j’s VA in t,
µ̃jt = µ∗j + bj + ejt .
2. ”Shrink”
toget µ̂jt , using between-year covariance to estimate
V µ∗j + bj .
3. Use random assignment to get a second estimate of the
teacher’s VA in t 0 6= t, µ̌jt 0 = µ∗j + b̌j + ějt 0 , with orthogonal
bias and error:
b̌j + ějt 0 ⊥ µ∗j , (bj + ejt ) .
4. Then a regression of (unshrunken) µ̌jt 0 on µ̂jt yields
h i
V
µ∗j
cov µ̌jt 0 , µ̂jt
i = Ω.
= h
V [µ̂jt ]
V µ∗ + b
j
j
Note: Assumes cov µ∗j , bj = 0 (Horvath 2014).
CFR-I: Teacher switching as a quasi-experiment
I
Experiments are very difficult. Need another approach.
I
If VAM is right, then when a high-VA teacher replaces a
low-VA teacher, scores should rise.
I
Focus on school-grade-year aggregates to abstract from
within-school sorting.
I
Prediction unbiasedness =⇒ average scores rise 1-for-1 with
change in average predicted VA.
CFR-I: Teacher switching as a quasi-experiment
I
Experiments are very difficult. Need another approach.
I
If VAM is right, then when a high-VA teacher replaces a
low-VA teacher, scores should rise.
I
Focus on school-grade-year aggregates to abstract from
within-school sorting.
I
Prediction unbiasedness =⇒ average scores rise 1-for-1 with
change in average predicted VA.
I
Let Qsgt represent the mean shrunken VA across all teachers
at school s, grade g , year t, based on all VA observations
from before t − 1 or after t.
CFR-I: Teacher switching as a quasi-experiment
I
Experiments are very difficult. Need another approach.
I
If VAM is right, then when a high-VA teacher replaces a
low-VA teacher, scores should rise.
I
Focus on school-grade-year aggregates to abstract from
within-school sorting.
I
Prediction unbiasedness =⇒ average scores rise 1-for-1 with
change in average predicted VA.
I
Let Qsgt represent the mean shrunken VA across all teachers
at school s, grade g , year t, based on all VA observations
from before t − 1 or after t.
I
Test whether
∂∆Asgt
∂ (Asgt − Asg ,t−1 )
≡
=1
∂∆Qsgt
∂ (Qsgt − Qsg ,t−1 )
“Drift” in true VA
σy2∗
2
σy ∗ +σe2
I
Recall EB formula: ŷ EB ≡
I
Variance components usually based on between-year
covariance, variance.
I
If the framework is right,
h
i
∂E µ̃jt |µ̂EB
j,{−t}
∂ µ̂EB
j,{−t}
y.
= 1.
I
When you do this in the CFR sample, derivative is less than 1.
I
One explanation: “Drift.” µ∗j or bj evolves over time.
Shrinkage with “drift” in true VA
I
CFR redefine µ̂ as the predicted value from (implied)
regressions of this year’s µ̃jt on each past or future µ̃js ,
s < t − 1 or s > t.
I
This ensures unbiased predictions of observed VA.
Implementation:
I
I
I
I
They assume cov (µ̃jt 0 , µ̃j,t 0 +s ) ≡ σs .
This allows estimation of the regression coefficients without
running separate regressions for each teacher (or set of
available measures).
No drift equivalent to σ1 = σ2 = σ3 = . . . .
Implementation of the drift estimator
I
Let σs ≡ cov (µ̃jt 0 , µ̃j,t 0 +s ).
I
Consider prediction for t0 of a teacher observed in
S ≡ {t1 , t2 , t3 }:
µ̂EB
jm,t0 ≡ ψ|t0 −t1 | µ̃jt1 + ψ|t0 −t2 | µ̃jt2 + ψ|t0 −t3 | µ̃jt3 ,
where:

 
−1 

ψ|t0 −t1 |
σ0
σ|t1 −t2 | σ|t1 −t3 |
σ|t0 −t1 |
 ψ|t −t |  ≡  σ|t −t |
σ0
σ|t2 −t3 |   σ|t0 −t2 |  .
0
2
2
1
ψ|t0 −t3 |
σ|t3 −t1 | σ|t3 −t2 |
σ0
σ|t0 −t3 |
I
The ψ coefficients vary with both t0 and S, but σs pool over
all available pairs.
CFR-I’s basic quasi-experimental results
Dependent variable is change from t − 1 to t in mean scores at
school-grade-subject level, ∆Āmsgt .
CFR (2011)
Year FEs
∆Qmsgt
0.84
(0.05)
No. of school-grade-subject-year cells
24,887
CFR-I’s basic quasi-experimental results
Dependent variable is change from t − 1 to t in mean scores at
school-grade-subject level, ∆Āmsgt .
CFR (2011)
CFR (2014a)
Year FEs
Year FEs School-Year FEs
∆Qmsgt
0.84
0.97
0.96
(0.03)
(0.03)
(0.05)
No. of school-grade-subject-year cells
24,887
59,770
59,770
CFR-I’s basic quasi-experimental results
Dependent variable is change from t − 1 to t in mean scores at
school-grade-subject level, ∆Āmsgt .
CFR (2011)
CFR (2014a)
Year FEs
Year FEs School-Year FEs
∆Qmsgt
0.84
0.97
0.96
(0.03)
(0.03)
(0.05)
No. of school-grade-subject-year cells
24,887
59,770
59,770
I
Interpretation is that uncorrected VA predictions are biased,
but drift-corrected predictions are not.
CFR-I’s basic quasi-experimental results
Dependent variable is change from t − 1 to t in mean scores at
school-grade-subject level, ∆Āmsgt .
CFR (2011)
CFR (2014a)
Year FEs
Year FEs School-Year FEs
∆Qmsgt
0.84
0.97
0.96
(0.03)
(0.03)
(0.05)
No. of school-grade-subject-year cells
24,887
59,770
59,770
I
Interpretation is that uncorrected VA predictions are biased,
but drift-corrected predictions are not.
I
Assumption: ∆Qmsgt is randomly assigned at m-s-g -t level.
Outline
1. Measuring VAM bias using experiments & quasi-experiments
2. Reproduction in North Carolina
3. Evaluating the quasi-experimental design
4. Long-run outcomes
5. Conclusion
The North Carolina data
I
CFR-I: NYC administrative data & IRS records for parent
characteristics & long-run outcomes.
I
I use North Carolina administrative data covering 1997-2011,
from North Carolina Education Research Data Center.
I
Tests in grades 3-8 (plus start-of-3rd-grade “pre-test”).
Teacher identifiers for teacher who proctored test.
I
I
I
Focus on grades 3-5.
Dummy out bad/suspicious matches.
I
For long-run outcomes, use HS graduation, GPA,
college-going plans (not all cohorts).
I
Limited parent characteristics.
I
N = 77, 177 school-grade-year-subject cells
(vs. 59, 770 for CFR).
Replication: Autocorrelation of µ̃jt within teacher, across
years
0.0
0.1
Correlation
0.2
0.3
0.4
0.5
Autocorrelation Vector in Elementary School for English and Math Scores
0
2
4
6
Years Between Classes
CFR English
NC English
Accounting for “drift” in forming µ̂j
8
CFR Math
NC Math
10
Contributions to the variance of ∆Q
The “treatment” here is not just teacher retirements and hiring
Component
Differences between stayers’ µ̂jt and µ̂j,t−1
Unreliable teacher-student matches
Grade switchers (within schools)
School switchers (within districts)
District switchers (within North Carolina)
Temporary leavers from/returners to system
Permanent leavers from & new arrivals to system
Share of V (∆Q)
10%
7%
31%
8%
6%
13%
24%
Replication of quasi-experimental estimates
Sample
Dependent variable:
CFR-I
Year FEs
School-year FEs
NC replication
Non-missing EB VA
¯
∆Ā
∆Â
(1)
(2)
(3)
0.97
(0.03)
0.96
(0.03)
0.00
(0.01)
X
X
0.98
(0.02)
0.01
(0.01)
X
1.05
(0.02)
Replication of quasi-experimental estimates
Sample
Dependent variable:
CFR-I
Year FEs
School-year FEs
NC replication
Non-missing EB VA
¯
∆Ā
∆Â
(1)
(2)
(3)
0.97
(0.03)
0.96
(0.03)
0.00
(0.01)
X
X
0.98
(0.02)
0.01
(0.01)
X
1.05
(0.02)
Replication of quasi-experimental estimates
Sample
Dependent variable:
CFR-I
Year FEs
School-year FEs
NC replication
Non-missing EB VA
¯
∆Â
∆Ā
(1)
(2)
(3)
0.97
(0.03)
0.96
(0.03)
0.00
(0.01)
X
X
0.98
(0.02)
0.01
(0.01)
X
1.05
(0.02)
Teachers without shrunken VA scores
I
Qsgt is the average shrunken VA score of all teachers in
s − g − t cell, based on data from outside {t − 1, t}.
I
For those observed only in {t − 1, t}, there’s nothing to shrink.
I
CFR-I sample excludes these teachers from Qsgt
and excludes their students from Āsgt .
I
In an alternative specification, set µ̂jt = µ̂j,t+1 = 0 for these
teachers, and include them in Qsgt and their students in Āsgt .
Replication of quasi-experimental estimates
Sample
Dependent variable:
CFR-I
Year FEs
School-year FEs
NC replication
Non-missing EB VA
¯
∆Â
∆Ā
All
(1)
(2)
(3)
∆Ā
(4)
0.97
(0.03)
0.96
(0.03)
0.00
(0.01)
0.88
(0.03)
X
X
0.98
(0.02)
0.01
(0.01)
X
1.05
(0.02)
X
Replication of quasi-experimental estimates
Sample
Dependent variable:
CFR-I
Year FEs
School-year FEs
NC replication
Non-missing EB VA
¯
∆Â
∆Ā
All
(1)
(2)
(3)
∆Ā
(4)
0.97
(0.03)
0.96
(0.03)
0.00
(0.01)
0.88
(0.03)
X
X
0.98
(0.02)
0.01
(0.01)
X
1.05
(0.02)
X
0.87
(0.02)
What is going on?
I
I
CFR-I argue that this reflects measurement error in the
augmented ∆Qsgt .
But µ̂j,t−1 = µ̂jt = 0 is the correct EB estimate for teachers
not seen outside {t − 1, t}.
I
I
I
Complete shrinkage when reliability of signal is 0.
Posterior mean = prior mean.
Empirically, the coefficient falls because of the dependent
variable:
What is going on?
I
I
CFR-I argue that this reflects measurement error in the
augmented ∆Qsgt .
But µ̂j,t−1 = µ̂jt = 0 is the correct EB estimate for teachers
not seen outside {t − 1, t}.
I
I
I
Complete shrinkage when reliability of signal is 0.
Posterior mean = prior mean.
Empirically, the coefficient falls because of the dependent
variable:
Include all teachers on RHS?
Include all students on LHS?
NC replication
(1)
N
N
1.05
(0.02)
(2)
Y
Y
0.87
(0.02)
What is going on?
I
I
CFR-I argue that this reflects measurement error in the
augmented ∆Qsgt .
But µ̂j,t−1 = µ̂jt = 0 is the correct EB estimate for teachers
not seen outside {t − 1, t}.
I
I
I
Complete shrinkage when reliability of signal is 0.
Posterior mean = prior mean.
Empirically, the coefficient falls because of the dependent
variable:
Include all teachers on RHS?
Include all students on LHS?
NC replication
(1)
N
N
1.05
(0.02)
(2)
Y
Y
0.87
(0.02)
(3)
Y
N
1.20
(0.03)
What is going on?
I
I
CFR-I argue that this reflects measurement error in the
augmented ∆Qsgt .
But µ̂j,t−1 = µ̂jt = 0 is the correct EB estimate for teachers
not seen outside {t − 1, t}.
I
I
I
Complete shrinkage when reliability of signal is 0.
Posterior mean = prior mean.
Empirically, the coefficient falls because of the dependent
variable:
Include all teachers on RHS?
Include all students on LHS?
NC replication
(1)
N
N
1.05
(0.02)
(2)
Y
Y
0.87
(0.02)
(3)
Y
N
1.20
(0.03)
(4)
N
Y
0.66
(0.02)
A sample selection explanation
I
Teachers’ VA is correlated with students’ prior scores:
Dep. var: Prior score
Teacher EB VA
(1)
CFR-I
0.0078
(0.0004)
(2)
NC
0.0054
(0.0003)
I
Consider s − g cell with teachers A and B in t − 1
and A and C in t, with µ̂C missing.
I
∆Q is µ̂A − 0.5 ∗ (µ̂A + µ̂B ) ≈ 0.5 ∗ (µ̂A − µ̂B ).
I
Then ∆Q > 0 ⇐⇒ µ̂A > µ̂B . Probabilistically:
I
I
I
µ̂A > 0.
As students have higher prior-year scores than C s in t.
Excluding C ’s students from the t mean score biases it (and
the t − 1 to t change in mean scores) upwards.
A sample selection explanation
I
Teachers’ VA is correlated with students’ prior scores:
Dep. var: Prior score
Teacher EB VA
(1)
CFR-I
0.0078
(0.0004)
(2)
NC
0.0054
(0.0003)
I
Consider s − g cell with teachers A and B in t − 1
and A and C in t, with µ̂C missing.
I
∆Q is µ̂A − 0.5 ∗ (µ̂A + µ̂B ) ≈ 0.5 ∗ (µ̂A − µ̂B ).
I
Then ∆Q > 0 ⇐⇒ µ̂A > µ̂B . Probabilistically:
I
I
I
µ̂A > 0.
As students have higher prior-year scores than C s in t.
Excluding C ’s students from the t mean score biases it (and
the t − 1 to t change in mean scores) upwards.
A sample selection explanation
I
Teachers’ VA is correlated with students’ prior scores:
Dep. var: Prior score
Teacher EB VA
(1)
CFR-I
0.0078
(0.0004)
(2)
NC
0.0054
(0.0003)
I
Consider s − g cell with teachers A and B in t − 1
and A and C in t, with µ̂C missing.
I
∆Q is µ̂A − 0.5 ∗ (µ̂A + µ̂B ) ≈ 0.5 ∗ (µ̂A − µ̂B ).
I
Then ∆Q > 0 ⇐⇒ µ̂A > µ̂B . Probabilistically:
I
I
I
µ̂A > 0.
As students have higher prior-year scores than C s in t.
Excluding C ’s students from the t mean score biases it (and
the t − 1 to t change in mean scores) upwards.
A sample selection explanation
I
Teachers’ VA is correlated with students’ prior scores:
Dep. var: Prior score
Teacher EB VA
(1)
CFR-I
0.0078
(0.0004)
(2)
NC
0.0054
(0.0003)
I
Consider s − g cell with teachers A and B in t − 1
and A and C in t, with µ̂C missing.
I
∆Q is µ̂A − 0.5 ∗ (µ̂A + µ̂B ) ≈ 0.5 ∗ (µ̂A − µ̂B ).
I
Then ∆Q > 0 ⇐⇒ µ̂A > µ̂B . Probabilistically:
I
I
I
µ̂A > 0.
As students have higher prior-year scores than C s in t.
Excluding C ’s students from the t mean score biases it (and
the t − 1 to t change in mean scores) upwards.
A sample selection explanation
I
Teachers’ VA is correlated with students’ prior scores:
Dep. var: Prior score
Teacher EB VA
(1)
CFR-I
0.0078
(0.0004)
(2)
NC
0.0054
(0.0003)
I
Consider s − g cell with teachers A and B in t − 1
and A and C in t, with µ̂C missing.
I
∆Q is µ̂A − 0.5 ∗ (µ̂A + µ̂B ) ≈ 0.5 ∗ (µ̂A − µ̂B ).
I
Then ∆Q > 0 ⇐⇒ µ̂A > µ̂B . Probabilistically:
I
I
I
µ̂A > 0.
As students have higher prior-year scores than C s in t.
Excluding C ’s students from the t mean score biases it (and
the t − 1 to t change in mean scores) upwards.
A sample selection explanation
I
Teachers’ VA is correlated with students’ prior scores:
Dep. var: Prior score
Teacher EB VA
(1)
CFR-I
0.0078
(0.0004)
(2)
NC
0.0054
(0.0003)
I
Consider s − g cell with teachers A and B in t − 1
and A and C in t, with µ̂C missing.
I
∆Q is µ̂A − 0.5 ∗ (µ̂A + µ̂B ) ≈ 0.5 ∗ (µ̂A − µ̂B ).
I
Then ∆Q > 0 ⇐⇒ µ̂A > µ̂B . Probabilistically:
I
I
I
µ̂A > 0.
As students have higher prior-year scores than C s in t.
Excluding C ’s students from the t mean score biases it (and
the t − 1 to t change in mean scores) upwards.
Outline
1. Measuring VAM bias using experiments & quasi-experiments
2. Reproduction in North Carolina
3. Evaluating the quasi-experimental design
4. Long-run outcomes
5. Conclusion
A falsification test
I
The quasi-experimental strategy is based on the idea that
∆Qsgt is (as good as) randomly assigned with respect to
∆Asgt . Is it?
I
Random assignment has testable implications:
∆Qsgt should be uncorrelated with anything prior.
I
CFR present some limited evidence of this.
Following Rothstein (2010), I focus on prior-year scores. Let:
I
I
I
LAsgt be the mean score in t − 1 (generally in grade g − 1) of
students who are in grade g in year t;
∆LAsgt ≡ LAsgt − LAsg ,t−1 .
Falsification test bin-scatter
Figure: Bin scatter plot for prior year scores
Prior Year Scores
Change in avg. prior year score
.1
0
Δ Score = αst + 0.134 * Δ VA
(0.017)
-.1
-.1
0
Change in avg. teacher predicted value-added
.1
Teacher switching quasi-experiment:
Effects on prior-year scores
Table: Dependent variable is ∆LAsgt
∆Qsgt
Include all students in LHS?
Include all teachers in RHS?
(1)
0.13
(0.02)
N
N
Note: All columns include school-year FEs.
(2)
0.04
(0.02)
Y
N
(3)
0.12
(0.03)
N
Y
(4)
0.08
(0.02)
Y
Y
Quasi-experiment estimates with controls
Excluding classrooms with missing teacher EB scores
CFR-I
∆Qsgt
Leads and lags in ∆Qsgt
∆LAsgt
NC replication
∆Qsgt
Leads and lags in ∆Qsgt
∆LAsgt
(1)
(2)
0.96
(0.03)
0.95
(0.02)
X
Cubic
0.98
(0.02)
0.96
(0.02)
X
Cubic
Note: All columns include school-year FEs.
(3)
(4)
0.90
(0.02)
0.89
(0.02)
Cubic
0.68
(0.00)
Quasi-experiment estimates with controls
Excluding classrooms with missing teacher EB scores
CFR-I
∆Qsgt
Leads and lags in ∆Qsgt
∆LAsgt
NC replication
∆Qsgt
Leads and lags in ∆Qsgt
∆LAsgt
(1)
(2)
0.96
(0.03)
0.95
(0.02)
X
Cubic
0.98
(0.02)
0.96
(0.02)
X
Cubic
Note: All columns include school-year FEs.
(3)
(4)
0.90
(0.02)
0.89
(0.02)
Cubic
0.68
(0.00)
Quasi-experiment estimates with controls
Excluding classrooms with missing teacher EB scores
CFR-I
∆Qsgt
Leads and lags in ∆Qsgt
∆LAsgt
NC replication
∆Qsgt
Leads and lags in ∆Qsgt
∆LAsgt
(1)
(2)
0.96
(0.03)
0.95
(0.02)
X
Cubic
0.98
(0.02)
0.96
(0.02)
X
Cubic
Note: All columns include school-year FEs.
(3)
(4)
0.90
(0.02)
0.89
(0.02)
Cubic
0.68
(0.00)
Quasi-experiment estimates with controls
Including classrooms with missing teacher EB scores
(1)
CFR-I
∆Qsgt
FEs
NC replication
∆Qsgt
(3)
(4)
0.82
(0.02)
0.29
(0.01)
Year
0.82
(0.02)
0.80
(0.02)
0.27
(0.01)
Sch-Year
0.88
(0.03)
Year
0.87
(0.02)
∆LAsgt
FEs
(2)
Year
Sch-Year
Potential mechanical explanations
I
Issues:
I
I
I
Data from t − 2 are used both to form the t − 1 VA scores and
to construct the prior-year scores for t − 1.
t and t − 1 teachers may have taught the same students in
earlier grades.
Simple solutions:
I
I
Exclude t − 2 data when predicting VA in t − 1 and t.
Instrument for ∆Q with a version that zeroes out teachers
who taught t − 1 or t cohort previously.
Potential mechanical explanations
I
Issues:
I
I
I
Data from t − 2 are used both to form the t − 1 VA scores and
to construct the prior-year scores for t − 1.
t and t − 1 teachers may have taught the same students in
earlier grades.
Simple solutions:
I
I
Exclude t − 2 data when predicting VA in t − 1 and t.
Instrument for ∆Q with a version that zeroes out teachers
who taught t − 1 or t cohort previously.
Baseline
Leave-three-out VA scores
IV with non-followers
Prior
0.08
0.10
0.18
score
(0.02)
(0.03)
(0.05)
End-of-year score
0.80
(0.02)
0.79
(0.03)
0.77
(0.03)
Potential mechanical explanations
I
Issues:
I
I
I
Data from t − 2 are used both to form the t − 1 VA scores and
to construct the prior-year scores for t − 1.
t and t − 1 teachers may have taught the same students in
earlier grades.
Simple solutions:
I
I
Exclude t − 2 data when predicting VA in t − 1 and t.
Instrument for ∆Q with a version that zeroes out teachers
who taught t − 1 or t cohort previously.
Baseline
Leave-three-out VA scores
IV with non-followers
School-year-subject FEs
S-Y-M FEs, IV with non-followers
Prior
0.08
0.10
0.18
0.04
0.03
score
(0.02)
(0.03)
(0.05)
(0.04)
(0.05)
End-of-year score
0.80
(0.02)
0.79
(0.03)
0.77
(0.03)
Potential mechanical explanations
I
Issues:
I
I
I
Data from t − 2 are used both to form the t − 1 VA scores and
to construct the prior-year scores for t − 1.
t and t − 1 teachers may have taught the same students in
earlier grades.
Simple solutions:
I
I
Exclude t − 2 data when predicting VA in t − 1 and t.
Instrument for ∆Q with a version that zeroes out teachers
who taught t − 1 or t cohort previously.
Baseline
Leave-three-out VA scores
IV with non-followers
School-year-subject FEs
S-Y-M FEs, IV with non-followers
Prior
0.08
0.10
0.18
0.04
0.03
score
(0.02)
(0.03)
(0.05)
(0.04)
(0.05)
End-of-year score
0.80
(0.02)
0.79
(0.03)
0.77
(0.03)
0.81
(0.03)
0.80
(0.03)
Outline
1. Measuring VAM bias using experiments & quasi-experiments
2. Reproduction in North Carolina
3. Evaluating the quasi-experimental design
4. Long-run outcomes
5. Conclusion
Effects of teachers on students’ long-run outcomes
I
Want to know whether high-VA teachers have longer-lasting
effects.
I
Effects on test scores fade out quickly.
I
CFR-II use tax data to construct long-run outcomes – college
enrollment, earnings at age 28, teen birth, etc.
Two types of estimates:
I
I
I
I
“Observational” (classroom-level) regressions, with controls.
Quasi-experimental estimates.
Defining the causal effect of interest is tricky!
CFR-II estimates
(1)
(2)
(3)
Dep. var.: College attendance at age 20
Teacher VA (EB, in SDs) 0.82
0.71
0.74
(0.07) (0.06) (0.09)
Baseline controls
X
X
X
Parent chars. controls
X
Twice-lagged scores
X
Quasi-experiment
Dep. var.: Annual earnings at age 28
Teacher VA (EB, in SDs) $350
$286
$309
(92)
(88)
(110)
(4)
0.73
(0.25)
X
Note: CFR-II “controls” in (1)-(3) are implemented in an unusual
way.
Implementing observational estimates with controls
I
I
Want to estimate effect of µ∗ on Y , controlling for Z .
CFR-II method for observational estimates:
I
I
I
I
I
Regress Y on Z , with teacher FEs.
Regress (Y − Z β̂) on µ̂EB .
µ̂EB is not residualized against Z .
Yields an unbiased estimate only under very restrictive
conditions.
This approach uses the within-teacher covariance between Z
(parent income) & Y (child outcomes). It doesn’t control for
the more important between-teacher (and between-school)
variation.
Implementing observational estimates with controls
I
I
Want to estimate effect of µ∗ on Y , controlling for Z .
CFR-II method for observational estimates:
I
I
I
I
I
I
Regress Y on Z , with teacher FEs.
Regress (Y − Z β̂) on µ̂EB .
µ̂EB is not residualized against Z .
Yields an unbiased estimate only under very restrictive
conditions.
This approach uses the within-teacher covariance between Z
(parent income) & Y (child outcomes). It doesn’t control for
the more important between-teacher (and between-school)
variation.
So why not an OLS regression of Y on µ̂EB and Z ?
I
I
µ̂EB is an unconditionally unbiased predictor of µ∗ + b, but a
biased predictor conditional on Z .
OLS may be biased, but this can be avoided by instrumenting
for µ̃jt with µ̃jt 0 .
Revisiting the long-run analysis in NC data
I
NC data don’t have adult outcomes.
I
But they can be linked to high school measures.
I extract five (each set to missing if student is not in HS
data):
I
1.
2.
3.
4.
5.
I
Graduate high school
Plan to attend college
Plan to attend 4-year college
GPA (4 point scale)
High school class rank (100=top)
I don’t have CFR’s richer controls. Focus on method, with
VA-model controls (individual & class mean lagged scores).
Observational estimates of effects of teacher VA on
long-run outcomes
Controls
None
Graduate HS (%)
(1)
0.74
(0.05)
Plan college (%)
0.86
(0.07)
Plan 4-year coll. (%)
3.42
(0.14)
GPA (4-pt. scale)
0.046
(0.003)
Class rank (0-100)
1.34
(0.07)
Observational estimates of effects of teacher VA on
long-run outcomes
Controls
None
Graduate HS (%)
(1)
0.74
(0.05)
CFR
2-step
(2)
0.38
(0.05)
Plan college (%)
0.86
(0.07)
0.35
(0.06)
Plan 4-year coll. (%)
3.42
(0.14)
1.21
(0.10)
GPA (4-pt. scale)
0.046
(0.003)
0.021
(0.002)
Class rank (0-100)
1.34
(0.07)
0.62
(0.06)
Observational estimates of effects of teacher VA on
long-run outcomes
(1)
0.74
(0.05)
CFR
2-step
(2)
0.38
(0.05)
(3)
0.21
(0.04)
Plan college (%)
0.86
(0.07)
0.35
(0.06)
0.21
(0.06)
Plan 4-year coll. (%)
3.42
(0.14)
1.21
(0.10)
0.63
(0.09)
GPA (4-pt. scale)
0.046
(0.003)
0.021
(0.002)
0.014
(0.002)
Class rank (0-100)
1.34
(0.07)
0.62
(0.06)
0.33
(0.05)
Controls
None
Graduate HS (%)
OLS
Observational estimates of effects of teacher VA on
long-run outcomes
OLS
IV
(1)
0.74
(0.05)
CFR
2-step
(2)
0.38
(0.05)
(3)
0.21
(0.04)
(4)
0.21
(0.04)
Plan college (%)
0.86
(0.07)
0.35
(0.06)
0.21
(0.06)
0.22
(0.06)
Plan 4-year coll. (%)
3.42
(0.14)
1.21
(0.10)
0.63
(0.09)
0.64
(0.09)
GPA (4-pt. scale)
0.046
(0.003)
0.021
(0.002)
0.014
(0.002)
0.015
(0.002)
Class rank (0-100)
1.34
(0.07)
0.62
(0.06)
0.33
(0.05)
0.34
(0.05)
Controls
None
Graduate HS (%)
Quasi-experimental estimates of effects of teacher VA on
long-run outcomes
Controls
Graduate HS (%)
None
(1)
0.36*
(0.18)
Plan college (%)
0.35
(0.22)
Plan 4-year coll. (%)
-0.02
(0.33)
GPA (4-pt. scale)
0.010
(0.008)
Class rank (0-100)
0.34
(0.24)
Quasi-experimental estimates of effects of teacher VA on
long-run outcomes
Controls
None
(1)
0.36*
(0.18)
∆LAsgt
(2)
0.20
(0.19)
Plan college (%)
0.35
(0.22)
0.19
(0.23)
Plan 4-year coll. (%)
-0.02
(0.33)
-0.35
(0.33)
GPA (4-pt. scale)
0.010
(0.008)
-0.003
(0.008)
Class rank (0-100)
0.34
(0.24)
0.03
(0.24)
Graduate HS (%)
Outline
1. Measuring VAM bias using experiments & quasi-experiments
2. Reproduction in North Carolina
3. Evaluating the quasi-experimental design
4. Long-run outcomes
5. Conclusion
Conclusions
I
Basic results of CFR-I (and, mostly, CFR-II) are successfully
reproduced in North Carolina data.
I
But the so-called “quasi-experiment” fails: Teacher switches
are correlated with changes in student preparedness.
Estimates that control for this indicate modest bias:
I
I
I
I
Inference about long-run effects not robust.
I
I
I
Forecast bias coefficient (AKA reliability ratio) around 0.8.
In the middle of Rothstein’s (2009) plausible range of 0.6 - 1.
Estimates are very sensitive to observables.
Quasi-experiment with controls yields no significant effects.
CFR’s reported specifications would not detect any of this.
Where does VAM stand?
I
In an important sense, none of this matters.
I
I
I
But:
I
I
I
I
I
NYC and NC are low stakes (at teacher level).
Modest bias is not inconsistent with VAM being useful.
Failure of MET study indicates agents are not indifferent to
class assignments.
NYC and NC evidence indicates these assignments bias VAM
scores.
This is a source of unfairness in VAM-based evaluations that
undercuts their “face validity.”
We should expect assignments to be distorted under high
stakes, worsening bias and perhaps harming educational
outcomes.
Moreover, these studies (like others) consider best cases:
Extensive controls, excluding hard cases.
Download