University of California, Los Angeles

advertisement
Issues in Teacher Evaluation and Validity:
Conceptual, Methodological, and Practical
Jose Felipe Martinez
University of California, Los Angeles
Graduate School of Education
New Mexico Teacher Evaluation Advisory Council
(NMTEACH)
New Mexico Public Education Department
UCLA Graduate School of Education & Information Studies
University of California, Los Angeles
‹#›/27
Overview
• Teacher Evaluation: The Policy Context
• Teacher Evaluation
• Conceptual/Methodological Issues: Why, What, How
• Constructs and methods
• Teacher Evaluation with Multiple Measures
•
•
•
•
Multiple Measures and Validity
Models for combining indicators
Validation Frameworks and Sources of Evidence
Consequences, additional issues
University of California, Los Angeles
‹#›/27
Teacher Evaluation:
The Policy Context
University of California, Los Angeles
‹#›/27
Teacher Evaluation: A New Silver Bullet?
• Teacher evaluation systems undergoing reform
• Tied to perceptions of performance in national or
international evaluations,
• Reverse Lake Wobegon; all below avg. (Feuer, 2012)
• …assumptions about the role of “good/bad”
teachers in explaining/improving the results and
• …about our ability to identify these teachers
• Related to perceptions of teaching profession
• …quality of existing teacher evaluation systems
University of California, Los Angeles
‹#›/27
Many Prominent examples
• United States
• Los Angeles, New York, Chicago (2012)
• Denver (2010)
• Tennesee (1992, 2012)
• Toledo, Cincinnati (1990’s)
• Worldwide
• Singapore (2006)
• Chile (2003)
• Mexico (1993,2009)
• Australia (2013)
University of California, Los Angeles
‹#›/27
Teacher Evaluation:
Conceptual/Methodological Issues
University of California, Los Angeles
‹#›/27
Why Evaluate?
• Motivations, inferences and uses
•
•
•
•
Identify struggling teachers to help them improve
Identify recurrent struggling teachers for sanction
Provide incentives to the best teachers
Inform school practice/district policies on Teacher
Preparation and Professional Development
• Identify and scale effective teacher practice
• Or typically a combination... (e.g. NMTEACH)
University of California, Los Angeles
‹#›/27
Teacher Evaluation
Conceptual/Methodological Issues:
Why, What, How
University of California, Los Angeles
‹#›/27
What to Evaluate?
• Teacher competence (Reynolds, 1999):
•
•
•
•
Knowledge: Subject, Pedagogical
Skill: Ability, applied knowledge
Disposition: Attitudes, Perceptions, Beliefs
Practice: Classroom processes (e.g. instruction,
assessment, management)
• And..
• Seniority, Credentials
• School citizenship, contributions to community…
• “Effectiveness”: Ability to raise student test scores
University of California, Los Angeles
‹#›/27
What to Evaluate? All of the above?
• “We fully understand that standardized
tests don't capture all of the subtle qualities
of successful teaching. That's why we call
for multiple measures in evaluating
teachers. In an ideal world, that data should
also drive instruction and drive useful
professional development.“
Arne Duncan
U.S. Secretary of Education
University of California, Los Angeles
‹#›/27
How to Evaluate?
Teacher Constructs
(What?)
Measures
(How?)
Knowledge (subject, pedagogical)
Skills (ability, applied knowledge)
Multiple Choice Tests
Performance Assessments
Vignettes
Practice, Classroom Performance
(instruction, assessment,
management)
Surveys, Logs
Classroom Observations, Video
Artifacts, Portfolios
Disposition (beliefs, attitudes)
Survey, Interview
Citizenship (contributions to
community)
Surveys, Interview, Self
Assessment
Effectiveness (contribution to
student achievement)
Student Test Score Gains;
“Value Added”
(Reynolds, 1999)
University of California, Los Angeles
‹#›/27
Which is Best? Which should we use?
• No method is inherently preferable
• Each illuminates a different aspect of Teacher
[insert euphemism here].
• Different kind of information from different sources
• Pros and cons in reliability, validity, credibility…
• Here I will briefly discuss:
•
•
•
•
Value Added Models
Observations
Surveys
Portfolios
University of California, Los Angeles
‹#›/27
Value Added Models
• Culture changing towards using student
achievement to evaluate teachers
• Simple Logic:
• Students do better (grow) more in some classrooms
(Weisberg et al. 2009; Kane et.al. 2011)
• Student learning should be a (the?) key criterion to
evaluate teacher quality
• Seemingly Simple Method:
• With longitudinal data…compare teachers on the
progress of their students, not their achievement.
• Estimate teacher unique contributions to student
academic growth, net of factors outside teacher control
University of California, Los Angeles
‹#›/27
Value Added Models
• A family of statistical models
• e.g. TVAAS, Growth percentiles, (variable) Persistence
• Correlated; measures used + important (Lockwood et.al 2007)
• A variety of issues:
• Partial view of student learning (Baker et. al. 2010)
• Unstable estimates (Schochet & Chiang; 2010)
• Descriptive, not causal (Stuart, Rubin,Zanutto,2004), nor
explanatory/diagnostic (Goe, 2011)
• Available only for some teachers (30-40% US)
• “…VAM estimates best used in combination with
other indicators” (Braun et al., 2010)
University of California, Los Angeles
‹#›/27
Classroom Observations
• Widely used to assess quality teaching practice
• Explanatory + Formative counterpart to VAM
• Identify areas in need of improvement  Inform PD
• Expensive if standardized (training, time)
• Error from complex rubrics, human judgment
• Bias/Subjectivity in construct definition/emphasis
• Lower reliability than traditional instruments (live or video)
• Weak correlations with other indicators including student
achievement (Kane et al. 2010)
University of California, Los Angeles
‹#›/27
Classroom Observation: Constructs
Singapore’s Competencies
• Nurturing the Whole Child
•
•
•
•
Core Competency!
Share values with student
Take action to develop the student
Act consistently in the student’s interest
•
•
•
•
Subject Mastery
Analytical Thinking
Initiative
Teaching Creatively
•
•
Partnering with Parents
Working in Teams
•
•
Understanding the Environment
Developing Others
•
Emotional Intelligence
• Cultivating Knowledge
• Working with Others
• Winning Hearts and Minds
• Knowing Self and Others
Danielson Framework
Planning and Preparation
•
•
•
•
•
•
Demonstrating Knowledge of Content and
Pedagogy
Demonstrating Knowledge of Students
Selecting Instructional Goals
Demonstrating Knowledge of Resources
Designing Coherent Instruction
Assessing Student Learning
Classroom Environment
•
•
•
•
•
Creating an Environment of Respect and
Rapport
Establishing a Culture for Learning
Managing Classroom Procedures
Managing Student Behavior
Organizing Physical Space
•
•
•
•
•
Communicating Clearly and Accurately
Using Questioning and Discussion Techniques
Engaging Students in Learning
Providing Feedback to Students
Demonstrating Flexibility and Responsiveness
Instruction
Professional Responsibilities
University of California, Los Angeles
‹#›/27
Classroom Observation: Reliability
(Source: Bill and Melinda Gates Foundation, 2011)
University of California, Los Angeles
‹#›/27
Classroom Observation: Reliability
(Source: Bill and Melinda Gates Foundation, 2011)
University of California, Los Angeles
‹#›/27
Teacher Surveys
• Common method for collecting data on teacher
(classroom) practice on a large scale
• Good coverage; Low cost; low burden for teachers
• Adequate reliability
• Questionable Validity
• Error from inconsistency in interpretation of questions
• …and social desirability
• e.g. Emphasis on higher order thinking
• Weak correlations with other indicators including
student achievement (Kane et al. 2010)
University of California, Los Angeles
‹#›/27
Student Surveys
• Increasingly popular for teacher evaluation
• Coverage; cost; perceived validity
• Adequate reliability aggregated by classroom
• Correlated w/student achievement as much or more
than teacher surveys
(Kane etal. 2010)
• Additional information at the student level
• Variance reflects differentiated teacher practice with
different students (Martínez, 2012; Muthen , 1995)
• Correlated w/achievement also within classrooms
University of California, Los Angeles
‹#›/27
Student Surveys: Remaining Issues
• Memory errors, inconsistency in interpretation
• Particularly with younger children
• Concerns for high stakes teacher evaluation
• Social desirability, pressure, other validity issues
• Cost Issues
• Unit of measurement, construct invariance
• “My teacher asks me to read books”
• vs. “Our teacher asks us to read books”
University of California, Los Angeles
‹#›/27
Student Surveys: Correlation to Achvmt
University of California, Los Angeles
‹#›/27
Teacher Portfolios
• Compile evidence of teacher practice over a period of time
What’s in a Teacher Portfolio?
Classroom Artifacts
(lesson plans, assignments,
samples of student work, etc.)
vs. Surveys
+  Richer, Better Validity, PD value
-  Higher cost, Rater/Rubric Error,
Burden on teachers
Teacher Reflections
(on practice reflected in artifacts)
Student/Teacher Survey/Log
(classroom practice, attitudes,
perceptions)
University of California, Los Angeles
vs. Observations
Debate taking form
‹#›/27
Portfolios vs. Observations
• 1. Cost to Collect & Score?
•
• Similar or lower than observations
2. Score Reliability?
• Similar to observations/video (see MET study)
• May need to re-examine ideas of “acceptable reliability”
• Better coverage, validity x/some aspects of practice
• Interesting possibilities with newer technologies
• 3. More burdensome for teachers?
• Yes, much more so (20-30+ hour effort)
• But, with burden comes Professional Development
• So far used mostly for “National Certification”
• Growing interest? : EdTPA, PACT
• May be feasible as integral to an evaluation/PD cycle
University of California, Los Angeles
‹#›/27
Teahcer Evaluation and
Multiple Measures
(Validity)
University of California, Los Angeles
‹#›/27
Validity
• How do we know we are doing a good job of
evaluating teachers?
• Are our inferences and decisions valid?
“An integrated evaluative judgment of the degree to
which empirical evidence and theoretical rationales
support the adequacy and appropriateness of
inferences and actions based on test scores or
others modes of assessment.”
Messick (1989)
University of California, Los Angeles
‹#›/27
• “In educational settings, a decision or characterization
that will have major impact [on a student] should not
be made on the basis of a single score. Other relevant
information should be taken into account if it will
enhance the overall validity of the decision.”
Standards for Educational and Psychological Testing, Standard 13.7
(AERA, APA, & NCME, 1999)
University of California, Los Angeles
‹#›/27
What to Evaluate? All of the above
• New Mexico’s teacher evaluation system should
utilize a matrix in which multiple components of
a teacher’s evaluation combine to determine a
teacher’s overall effectiveness rating.
• Effectiveness levels should only be assigned
after careful consideration of multiple measures,
including student achievement data,
observations, and other proven measures
[emphasis added]
New Mexico Effective Teaching Task Force
University of California, Los Angeles
‹#›/27
Multiple measures:
Logic and Assumptions
• General Assumption:
• Combining multiple measures leads to better informed
(more valid) decisions about teachers and teaching
1.Accuracy
-Teachers classified into finer, more stable categories (De
Pascale, 2012; Steele et. al. 2010)
2.Validity
- More complete picture of performance (Goe, 2011)
-Less incentive for test preparation (Steele et. al. 2010)
3.Feedback
- Information to help teachers adjust and improve
instruction and classroom strategies. (Duncan, 2011)
4.Relevance
- Greater confidence in results of evaluation among the
public and stakeholders (Glazerman et. al. 2011)
University of California, Los Angeles
‹#›/27
Combining Multiple measures:
Conceptual Issues
• When/were does these assumptions hold?, in
what situations? Depends on several factors
• Assumptions about nature of constructs involved
• Intended inferences and uses
• What is meant exactly by combining (Brookhart, 2009)
• Not self-explanatory. A variety of models is available
• Substantial literature in psychology, personnel
evaluation, and student assessment.
• Only starting to be applied to Teacher Evaluation
University of California, Los Angeles
‹#›/27
Models for Combining Multiple Measures
Model
Description
Conjunctive
Must meet criteria (pass) for all measures
Disjunctive
Must meet criteria (pass) for k measures
Compensatory
Based on composite measures. High level in one
measure compensates for low levels in others
Hybrid
e.g. Compensatory-conjunctive, Sequential
(Mehrens, 1989; Chester, 2003)
University of California, Los Angeles
‹#›/27
Combination Model 0:
Do not Combine!
• May consider not combining the indicators !
• Summary indices not essential to formative or
summative evaluation
• Key measures may be collected, maintained, and
reported separately
• All used to illuminate a side of the picture (improve
teaching, communication, citizenship, achievmt?)
• And used jointly as needed where summative
judgments are sought
(Mehrens 1989; Brookhart 2009)
• Making combined use of multiple indicators
≠Combining multiple indicators
University of California, Los Angeles
‹#›/27
Combination Model 1:
Conjuntive, Disjunctive
33
Classroom
Observation
Portfolio
Student
Survey
Teacher
Test
Other
Indicators
Student
Achievemt.
University of California, Los Angeles
‹#›/27
Decision Rules and Reliability
• Error in Multiple Measures may cancel out or
compound
• Assume Teacher A True Scores in T1, T2 are passes
• Because of unreliability the probability of pass Observed
Scores is estimated at 0.80 and 0.90, respectively
• Probability of pass scores in both tests (Conjunctive
Model): 0.8*0.9=0.72
• Probability of pass scores in either test (Disjunctive
Model): 1-[0.2*0.1]=0.98
(see e.g. Cronbach, Linn, Brennan, & Haertel, 1997; Douglas
and Mislevy, 2010)
University of California, Los Angeles
‹#›/27
Decision Rules and Reliability
• Simplistic scenario. Complex rules often used in
practice according to policy context and goals
• E.g.: Teachers must pass Measure 1 or 2, AND
not rank lowest in Measure 3 (eg. New Haven)
• Choice of decision rule more important for
accuracy and validity than the reliability of the
component measures chosen (Chester, 2003)
• Importantly: Models are not “objective”; each
involves judgment
• Why satisfy k criteria, not k-1? Why those criteria?
University of California, Los Angeles
‹#›/27
Hybrid system : e.g. New Haven
• Synthesizes three component measures (each
on 5-pt. scale):
• Teacher instructional practice
• Teacher professional values
• Student learning outcomes
University of California, Los Angeles
‹#›/27
Combination Model 2 (Compensatory):
Principal Components / Factor Analysis
37
Portfolio
Classroom
Observation
Other
Measures
Student
achievement
Global
Construct
Student/
Parent Survey
University of California, Los Angeles
Teacher
Survey
‹#›/27
Combination Model 3 (Compensatory):
Optimal Weight (Achievement as Criterion)
38
Artifacts/
Portfolio
Classroom
Observation
Other
Measures
Teacher
Construct
Student/
Parent Survey
University of California, Los Angeles
Student
Achievement
Teacher
Survey
‹#›/27
Combination Model 3 (Compensatory):
Optimal Weight (Achievement as Criterion)
39
Other
Measures
Artifacts/
Portfolio
Classroom
Observation
β
β
β
Teacher
Construct
Student/
Parent Survey
University of California, Los Angeles
Student
Achievement
β
β
Teacher
Survey
‹#›/27
MM Combination Model 4 (Compensatory):
PC/FA: Student achievement as Indicator
40
Artifacts/
Portfolio
Classroom
Observation
Other
Measures
Teacher
Construct
Student/
Parent Survey
University of California, Los Angeles
Student
Achievement
Teacher
Survey
‹#›/27
MM Combination Model 5 (Compensatory):
SEM/Canonical Correlates
41
Artifacts/
Portfolio
Classroom
Observation
Student/
Parent Survey
Other
Measures
Teacher
Construct
Student
Measure #1
Student
Outcomes
Student
Measure #2
Other (e.g. noncognitive)
Teacher
Survey
University of California, Los Angeles
‹#›/27
MM Combination Model 6 : (Darlington, 1970)
Unmeasured Criterion, theoretical weights
42
Artifacts/
Portfolio
Other
Measures
Classroom
Observation
Student
Achievement
Unmeasured
Teacher
Construct
Student/
Parent Survey
Teacher
Survey
University of California, Los Angeles
‹#›/27
Empirical vs. Theoretical Weighting
• Model 6 is most likely scenario in practice
• Policy assumptions/values (consensual) inform
the system, alongside technical considerations
• It really is the only feasible scenario
• Empirical weights cannot be derived
• Ultimate criterion measure is NOT available
• Note model 3 assumes such measure is available
• But does not give “correct” weight for criterion
• Exposure to Validity shrinkage (weight change
over time)
University of California, Los Angeles
‹#›/27
Multiple Measures and Validity
• Models may lead to different inferences.
• Little guidance available; so…
• LOCAL VALIDITY STUDIES NEEDED (lots of them)
• As with single measures, need to set up testable
validation hypotheses (Kane, 2006)
• Whatever the construct : Teacher [euphemism]
• 1. Describe intended inferences, uses, AND
CONSEQUENCES
• 2. Collect empirical evidence to support
• 2012, 2013 MET reports will be influential. May force field to
broaden our lens and revise assumptions and expectations
• No getting around conducting local validation studies
University of California, Los Angeles
‹#›/27
What KINDS of EVIDENCE?
• All of them: Validity is a unitary notion
•
•
•
•
•
Theoretical support
Consistency and accuracy (Reliability)
Correlations, Internal structure
Predictive power
Consequences of use
• Validity becomes a rather empty academic
topic if the consequences are not considered
• Or if they differ markedly from expectation
University of California, Los Angeles
‹#›/27
What consequences?
• Intended and Unintended Effects
•
•
•
•
•
•
On teaching practice
On different student outcomes
On recruitment and retention
On Motivation, Competition, Fraud
On Perceptions of validity, fairness, utility
On dynamic of relationships with parents and
community
• Etc etc
University of California, Los Angeles
‹#›/27
Final Remarks. Teacher Evaluation:
Why are we doing this again?
• Some good reasons
• Make student achievement priority
• Monitor & assess teacher performance
• Develop a culture of accountability
• and of reflection and improvement
• Inform PD to improve teacher performance
• However
• Multiple fallible indicators do not automatically
yield better, less fallible inferences. But they
always yield more complex ones
• Using indicators in combination involves technical
but also conceptual and policy assumptions
University of California, Los Angeles
‹#›/27
Final Remarks. Teacher Evaluation:
Why are we doing this again?
• Because “the stakes are high, and the future of
our children is at stake” (insert public official
name here, circa 2012) we should proceed
carefully and deliberately.
• Good measures take time to develop.
• Solid systems based on these measures take
longer to test and implement.
• The consequences of implementing these systems
are unknown and will take longer to assess.
• Experience suggests moving too fast to
implement may shortchange the system
University of California, Los Angeles
‹#›/27
Final Remarks. Teacher Evaluation:
Why are we doing this again?
• Most important goal in my view is not only to
avoid unfair decisions, and negative unintended
consequences (though the potential for both
should give us pause)
• Greatest risk is missing an opportunity to enact
sound teacher evaluation policy with great
potential to positively impact educational
practice and outcomes
University of California, Los Angeles
‹#›/27
Thank you
jfmtz@ucla.edu
University of California, Los Angeles
‹#›/27
Download