Baker NJPSA - Foundation for Educational Administration

advertisement
Graduate School of Education
The Quantitative Side of Teacher
Evaluation in New Jersey
Bruce D. Baker
Graduate School of Education
Modern Teacher Evaluation Policies
Making Certain Distinctions with Uncertain Information
• First, the modern teacher evaluation template requires that objective measures
of student achievement growth necessarily be considered in a weighting system
of parallel components.
– Placing the measures alongside one another in a weighting scheme assumes all
measures in the scheme to be of equal validity and reliability but of varied importance
(utility) – varied weight.
• Second, modern teacher evaluation template requires that teachers be placed
into effectiveness categories by assigning arbitrary numerical cutoffs to the
aggregated weighted evaluation components.
– That is, a teacher in the 25%ile or lower when combining all evaluation components
might be assigned a rating of “ineffective,” whereas the teacher at the 26%ile might be
labeled effective.
• Third, the modern teacher evaluation template places inflexible timelines on
the conditions for removal of tenure.
– Typical legislation dictates that teacher tenure either can or must be revoked and the
teacher dismissed after 2 consecutive years of being rated ineffective (where tenure
can only be achieved after 3 consecutive years of being rate effective).
Graduate School of Education
Due Process Concerns
•
•
•
Attribution/Validity
– Face
• There are many practical challenges including whether teachers should be held
responsible for summer learning in an annual assessment model, or how to
parse influence of teacher teams and/or teachers with assistants.
• SGPs have their own face validity problem in the authors own words. One
cannot reasonably evaluate someone on a measure not attributable to them.
– Statistical
• That there may be, and likely will be with an SGP, significant persistent bias –
that is, other stuff affecting growth such as student assignment, classroom
conditions, class sizes, etc. – which render the resulting estimate NOT
attributable to the teacher.
Reliability
– Lack of reliability of measures, jumping around from year to year, suggests also that
the measures are not a valid representation of actual teacher quality.
Arbitrary/Unjustifiable Decisions
– Cut-points imposed throughout the system make invalid assumptions regarding the
statistics – that a 1 point differential is meaningful.
Graduate School of Education
Legal Parallels
• NJ Anti-Bullying Statute
– Damned if you do, damned if you don’t!
– Bullying statute forces district officials to make distinctions/apply
labels which may not be justifiably applied (“bully” or “bullying
behavior”)
– Application of these labels leads to damages to the individual to
whom they are applied (liberty interests/property interests)
– Result?
• Due process challenges
• District backtracking (avoidance)
• Measuring good teaching is no more precise an endeavor
than measuring bullying! (and state mandates cannot make it
so!)
Graduate School of Education
TEACH NJ Issues
Arbitrary
Weightings
Assumes equal
Validity, but
Varied
Importance
Graduate School of Education
TEACH NJ Issues
Arbitrary Reduction/
Categorization of
(forced) normally
distributed data
Graduate School of Education
Statistically inappropriate to
declare +.49sd categorically
different from +.50sd!
[especially with noisy data]
Graduate School of Education
Arbitrary, statistically
unjustifiable collapsing
to 4pt scale, then
multiplied times
arbitrary weight
Assumes all parts
equally valid & reliable,
but of varied
importance (wt)
Graduate School of Education
• At best, this applies to 20 to 30% of teachers
– At the middle school level! (which would be the
highest)
Average NJ Middle School
2008 – 2011 Fall Staffing Reports
1. Requires differential contracts by
staffing type
2. Some states/school districts “resolve”
this problem by assigning all other
teachers the average of those rated:
1. Significant “attribution” concern
(due process)
2. Induces absurd practices
3. This problem undermines “reform”
arguments that in cases of RIF,
quality, not seniority should prevail
because supposed “quality”
measures only apply to those
positions least likely to be reduced.
World
Vocational EdLanguage
Support
Services
Art
Business
Elementary
Generalist
Special Ed
English/LAL
Social Studies
Science
Music
Math
Family/Consu
mer Sci
Industrial Health/PE
Arts
Graduate School of Education
Usefulness of Teacher Effect Measures
• Estimating teacher effects
– Basic attribution problems
• Seasonality
• Spill-over
– Stability Issues
• Decomposing the Signal and the Noise
• SGP and VAM
– Attribution?
– False signals in NJ SGP data
• Debunking Disinformation
Graduate School of Education
A little stat-geeking
• What’s in a growth or VAM estimate?
– The largest part is random noise… that is, if we look from
year to year, across the same teachers, estimates jump
around a lot, or vary a lot in “unexplained” and seemingly
unpredictable ways.
– The other two parts are:
• False Signal, or predictable patterns that are predictable not as a
function of anything the teacher is doing, but a function of other
stuff outside the teacher’s control, that happens to have
predictable influence
– Student sort, classroom conditions, summer experiences, test
form/scale and starting position of students on that scale.
• True Signal, or that piece of the predictability of change in test
score from time 1 to time 2 that might fairly be attributed to the
role of the teacher in the classroom.
Graduate School of Education
Distilling Signal from Noise
Total Variation
Unknown &
Seemingly
Unpredictable
Error
(Random)
Making high stakes
personnel decisions on the
basis of either Noise or False
Signal is problematic!
[& that may be the majority
of the variation]
Attributable
to Teacher?
Predictable Variation
(Stable Component)
Attributable
to other
Persistent
Attributes?
Difficult if not
implausible
to accurately
parse
Graduate School of Education
The SGP Difference!
Graduate School of Education
SGPs & New Jersey
• Student Growth Percentiles are not designed for
inferring teacher influence on student outcomes.
• Student Growth Percentiles do not (even try to) control
for various factors outside of the teacher’s control.
• Student Growth Percentiles are not backed by research
on estimating teacher effectiveness. By contrast,
research on SGPs has shown them to be poor at
isolating teacher influence.
• New Jersey’s Student Growth Percentile measures, at
the school level, are significantly statistically biased
with respect to student population characteristics and
average performance level.
Graduate School of Education
In the authors words…
Damian Betebenner:
“Unfortunately Professor Baker conflates the data (i.e. the
measure) with the use. A primary purpose in the development of
the Colorado Growth Model (Student Growth Percentiles/SGPs) was
to distinguish the measure from the use: To separate the
description of student progress (the SGP) from the attribution of
responsibility for that progress.”
http://www.ednewscolorado.org/voices/student-growth-percentiles-andshoe-leather
Graduate School of Education
Playing Semantics…
• When pressed on the point that GPs are not
designed for attributing student gains to their
teachers, those defending their use in teacher
evaluation will often say…
– “SGPs are a good measure of student growth, and
shouldn’t teachers be accountable for student
growth?”
• Let’s be clear here, one cannot be accountable
for something that is not rightly attributable to
them!
Graduate School of Education
The Bottom Line?
• Policymakers seem to be moving forward on
implementation of policies that display complete disregard
for basic statistical principles –
– that one simply cannot draw precise conclusions (and thus
make definitive decisions) based on imprecise information.
• Can’t draw a strict cut point through messy data. Same applies to high
stakes cut scores for kids.
– That one cannot make assertions about the accuracy of the
position of any one point among thousands, based on the loose
patterns we find in these types of data.
• Good data informed decision making requires deep
nuanced understanding of statistics, measures, what they
mean… and most importantly WHAT THEY DON’T! (and
can’t possibly)
Graduate School of Education
Reasonable Alternatives?
•
•
•
•
•
To the extent these data can produce some true signal amidst the
false signal and noise, central office data teams in large districts
might be able to use several (not just one) rich, but varied
models to screen for variations that warrant further exploration.
This screening approach, much like high-error-rate rapid
diagnostics tests, might tell us where to focus some additional
energy (that is, classroom and/or school observation).
We may then find that the signal was false, or that it really does
tell us something either about how we’ve mismatched teachers
and assignments, or the preparedness of some teachers.
But, the initial screening information should NEVER dictate the
final decision (as it will under Toxic Trifecta models).
But, if we find that the data-driven analysis more often sends us
down inefficient pathways, we might decide it’s just not worth it.
But this cannot be achieved by centralized policy or through
contractual agreements.
Unfortunately current policies and recent contractual agreements
prohibit thoughtful, efficient strategies!
Screening
Observation
Validation
[or NOT]
& Questions?
Graduate School of Education
Graduate School of Education
New York City Examples
Graduate School of Education
English Language Arts Grades 4 to 8
-1
0
1
2
9 to 15% (of those who were “good”
or were “bad” in the previous year)
move all the way from good to bad or
bad to good. 20 to 35% who were
“bad” stayed “bad” & 20 to 35% who
were “good” stayed “good.” And this
is between the two years that show
the highest correlation for ELA.
-.4
-.2
0
.2
Value Added 2008-09
Other
Bad to Good
Bad
Correlation=.327
Good to Bad
Average
Good
.4
.6
Graduate School of Education
Mathematics Grades 4 to 8
-1
-.5
0
.5
1
1.5
For math, only about 7% of teachers jump all the
way from being bad to good or good to bad (of
those who were “good” or “bad” the previous
year), and about 30 to 50% who were good
remain good, or who were bad, remain bad.
-.5
0
.5
Value Added 2008-09
Other
Bad to Good
Bad
Correlation=.5046
Good to Bad
Average
Good
1
1.5
Graduate School of Education
But is the signal we find real or false?
• Math – Likelihood of being labeled “good”
– 15% less likely to be good in school with higher attendance rate
– 7.3% less likely to be good for each 1 student increase in school average class size
– 6.5% more likely to be good for each additional 1% proficient in Math
• Math – Likelihood of being repeatedly labeled “good”
– 19% less likely to be sequentially good in school with higher attendance rate (gr 4
to 8)
– 6% less likely to be sequentially good in school with 1 additional student per class
(gr 4 to 8)
– 7.9% more likely to be sequentially good in school with 1% higher math proficiency
rate.
• Math Flip Side – Likelihood of being labeled “bad”
– 14% more likely to be bad in school with higher attendance rate.
– 7.9% more likely to be sequentially bad for each additional student in average class
size (gr 4 to 8)
Graduate School of Education
About those “Great”
Irreplaceable Teachers!
Graduate School of Education
100
Figure 1 – Who is irreplaceable in 2006-07 after being irreplaceable in 2005-06?
60
80
Awesome x 2
0
20
40
Awesomeness
0
20
40
60
80
100
%ile 2005-06
OK to Stinky Teachers
Awesome Teachers
Important Tangent: Note how spreading data into percentiles makes pattern messier!
Graduate School of Education
100
Figure 2 – Among those 2005-06 Irreplaceables, how do they reshuffle between 2006-07 &
2007-08?
0
20
40
60
80
Awesome x 3
0
20
40
60
80
%ile 2006-07
OK to Stinky Teachers
Awesome Teachers
100
Graduate School of Education
100
Figure 3 – How many of those teachers who were totally awesome in 2007-08 were still
totally awesome in 2008-09?
Awesome x 4?
0
20
40
60
80
[but may have dropped
out one prior year]
0
20
40
60
80
%ile 2007-08
OK to Stinky Teachers
Awesome Teachers
100
Graduate School of Education
0
20
40
60
80
100
Figure 4 – How many of those teachers who were totally awesome in 2008-09 were still
totally awesome in 2009-10?
0
20
40
60
80
%ile 2008-09
OK to Stinky Teachers
Awesome Teachers
100
Graduate School of Education
Persistently Irreplaceable?
Of the thousands of teachers for whom ratings
exist for each year in NYC, there are 14 in math
and 5 in ELA that stay in the top 20% for each
year!
Sure hope they don’t leave!
Graduate School of Education
Distilling Signal & Noise in
New Jersey MGPs
Graduate School of Education
New Jersey SGPs & Performance Level
80
Schools Including 7th Grade
0
20
40
60
Is it really true that the most
effective teachers are in the
schools that already have high
proficiency rates?
0
20
40
60
% Proficient 7th Grade Math
correlation=.54
Strong FALSE signal (bias)
80
100
Graduate School of Education
New Jersey SGPs & Performance Level
Is it really true that the most
effective teachers are in the
schools that already have high
proficiency rates?
20
40
60
80
Schools Including 7th Grade
0
correlation=.57
20
40
60
% Proficient 7th Grade Language Arts
Strong FALSE signal (bias)
80
100
Graduate School of Education
Middle School MGP Racial Bias
30
40
50
60
70
Grades 06-08 Schools
Is it really true that the most
effective teachers are in the
schools that serve the fewest
minority students?
0
.2
.4
.6
% Black or Hispanic
correlation=-.4755
Strong FALSE signal (bias)
.8
1
Graduate School of Education
Middle School MGP Racial Bias
20
40
60
80
Grades 06-08 Schools
Is it really true that the most
effective teachers are in the
schools that serve the fewest
minority students?
0
.2
.4
.6
% Black or Hispanic
correlation=-.3260
Strong FALSE signal (bias)
.8
1
Graduate School of Education
Okay… so is it really true that the most
effective teachers are in the schools that
serve the fewest non-proficient special
education students?
Significant FALSE signal (bias)
Graduate School of Education
And what if the underlying
measures are junk?
Ceilings, Floors and Growth
Possiblities?
Graduate School of Education
15
Math 4
0
5
Percent
10
One CANNOT create variation where there is none!
100
150
200
Math 2011
250
300
Graduate School of Education
One CANNOT create variation where there is none!
6
4
2
0
Percent
8
10
Math 7
100
150
200
Math 2011
250
300
Graduate School of Education
NJASK Math Grade 3 to 4 Cohort
100
150
200
250
300
Sample DFG FG District
100
150
200
Math 2010
250
300
Graduate School of Education
NJASK Math Grade 4 to 5 Cohort
150
200
250
300
Sample DFG FG District
100
150
200
Math 2011
250
300
Graduate School of Education
NJASK Math Grade 6 to 7 Cohort
100
150
200
250
300
Sample DFG FG District
150
200
250
Math 2010
300
Graduate School of Education
NJASK Math Grade 7 to 8 Cohort
100
150
200
250
300
Sample DFG FG District
100
150
200
Math 2011
250
300
Graduate School of Education
Debunking Disinformation
Graduate School of Education
Misrepresentations
• NJ Commissioner Christopher Cerf explained:
– “You are looking at the progress students make and that fully takes into account
socio-economic status,” Cerf said. “By focusing on the starting point, it equalizes
for things like special education and poverty and so on.”[17] (emphasis added)
• Why this statement is untrue:
– First, comparisons of individual students don’t actually explain what happens
when a group of students is aggregated to their teacher and the teacher is
assigned the median student’s growth score to represent his/her effectiveness,
where teachers don’t all have an evenly distributed mix of kids who started at
similar points (to other teachers). So, in one sense, this statement doesn’t even
address the issue.
– Second, this statement is simply factually incorrect, even regarding the
individual student. The statement is not supported by research on estimating
teacher effects which largely finds that sufficiently precise student, classroom
and school level factors do relate to variations not only in initial performance
level but also in performance gains.
[17]http://www.wnyc.org/articles/new-jersey-news/2013/mar/18/everything-you-need-know-about-studentsbaked-their-test-scores-new-jersy-education-officials-say/
Graduate School of Education
Further research on this point…
• Two recent working papers compare SGP and VAM estimates for teacher
and school evaluation and both raise concerns about the face validity and
statistical properties of SGPs.
– Goldhaber and Walch (2012) conclude: “For the purpose of starting
conversations about student achievement, SGPs might be a useful tool, but
one might wish to use a different methodology for rewarding teacher
performance or making high-stakes teacher selection decisions” (p. 30).[6]
– Ehlert and colleagues (2012) note: “Although SGPs are currently employed for
this purpose by several states, we argue that they (a) cannot be used for
causal inference (nor were they designed to be used as such) and (b) are the
least successful of the three models [Student Growth Percentiles, One-Step
VAM & Two-Step VAM] in leveling the playing field across schools” (p. 23).[7]
[6] Goldhaber, D., & Walch, J. (2012). Does the model matter? Exploring the relationship between different
student achievement-based teacher assessments. University of Washington at Bothell, Center for Education
Data & Research. CEDR Working Paper 2012-6.
[7] Ehlert, M., Koedel, C., &Parsons, E., & Podgursky, M. (2012). Selecting growth measures for school and
teacher evaluations. National Center for Analysis of Longitudinal Data in Education Research (CALDAR).
Working Paper #80.
Graduate School of Education
Misrepresentations
•
“The Christie administration cites its own research to back up its plans, the most favored
being the recent Measures of Effective Teaching (MET) project funded by the Gates
Foundation, which tracked 3,000 teachers over three years and found that student
achievement measures in general are a critical component in determining a teacher’s
effectiveness.”[23]
•
The Gates Foundation MET project did not study the use of Student Growth Percentile
Models. Rather, the Gates Foundation MET project studied the use of value-added
models, applying those models under the direction of leading researchers in the field,
testing their effects on fall to spring gains, and on alternative forms of assessments. Even
with these more thoroughly vetted value-added models, the Gates MET project
uncovered, though largely ignored, numerous serious concerns regarding the use of
value-added metrics. External reviewers of the Gates MET project reports pointed out
that while the MET researchers maintained their support for the method, the actual
findings of their report cast serious doubt on its usefulness.[24]
•
•
[23]http://www.njspotlight.com/stories/13/03/18/fine-print-overview-of-measures-for-tracking-growth/
[24] Rothstein, J. (2011). Review of “Learning About Teaching: Initial Findings from the Measures of Effective Teaching Project.”
Boulder, CO: National Education Policy Center. Retrieved [date] from http://nepc.colorado.edu/thinktank/review-learningabout-teaching. [accessed 2-may-13]
Graduate School of Education
Misrepresentations
• But… even though these ratings look unstable from year to year, they are
about as stable as baseball batting averages from year to year, and clearly
batting average is “good” statistic for making baseball decisions?
• Not so, say the baseball stat geeks:
– Not surprisingly, Batting Average comes in at about the same consistency for
hitters as ERA for pitchers. One reason why BA is so inconsistent is that it is
highly correlated to Batting Average on Balls in Play (BABIP)–.79–and BABIP
only has a year-to-year correlation of .35.
– Descriptive statistics like OBP and SLG fare much better, both coming in at .62
and .63 respectively. When many argue that OBP is a better statistic than BA it
is for a number of reasons, but one is that it’s more reliable in terms of
identifying a hitter’s true skill since it correlates more year-to-year.
http://www.beyondtheboxscore.com/2011/9/1/2393318/what-hitting-metrics-areconsistent-year-to-year
Put simply, VAM estimates ARE about as useful as batting average – NOT
VERY!
Graduate School of Education
About the Chetty Study…
• While the Chetty, Friedman and Rockoff studies suggest
that variation many, many years ago, absent high
stakes assessment, in NYC, across classrooms of kids,
are associated with small wage differences of those
kids at age 30 (thus arguing that teaching quality – as measured by
variation in classroom level student gains), this study has no direct
implications for what might work in hiring, retaining
and compensating teachers.
• The presence of variation across thousands of
teachers, even if loosely correlated with other stuff,
provides little basis for identifying any one single
teacher as persistently good or bad.
Download