Issues in Teacher Evaluation and Validity: Conceptual, Methodological, and Practical Jose Felipe Martinez University of California, Los Angeles Graduate School of Education New Mexico Teacher Evaluation Advisory Council (NMTEACH) New Mexico Public Education Department UCLA Graduate School of Education & Information Studies University of California, Los Angeles ‹#›/27 Overview • Teacher Evaluation: The Policy Context • Teacher Evaluation • Conceptual/Methodological Issues: Why, What, How • Constructs and methods • Teacher Evaluation with Multiple Measures • • • • Multiple Measures and Validity Models for combining indicators Validation Frameworks and Sources of Evidence Consequences, additional issues University of California, Los Angeles ‹#›/27 Teacher Evaluation: The Policy Context University of California, Los Angeles ‹#›/27 Teacher Evaluation: A New Silver Bullet? • Teacher evaluation systems undergoing reform • Tied to perceptions of performance in national or international evaluations, • Reverse Lake Wobegon; all below avg. (Feuer, 2012) • …assumptions about the role of “good/bad” teachers in explaining/improving the results and • …about our ability to identify these teachers • Related to perceptions of teaching profession • …quality of existing teacher evaluation systems University of California, Los Angeles ‹#›/27 Many Prominent examples • United States • Los Angeles, New York, Chicago (2012) • Denver (2010) • Tennesee (1992, 2012) • Toledo, Cincinnati (1990’s) • Worldwide • Singapore (2006) • Chile (2003) • Mexico (1993,2009) • Australia (2013) University of California, Los Angeles ‹#›/27 Teacher Evaluation: Conceptual/Methodological Issues University of California, Los Angeles ‹#›/27 Why Evaluate? • Motivations, inferences and uses • • • • Identify struggling teachers to help them improve Identify recurrent struggling teachers for sanction Provide incentives to the best teachers Inform school practice/district policies on Teacher Preparation and Professional Development • Identify and scale effective teacher practice • Or typically a combination... (e.g. NMTEACH) University of California, Los Angeles ‹#›/27 Teacher Evaluation Conceptual/Methodological Issues: Why, What, How University of California, Los Angeles ‹#›/27 What to Evaluate? • Teacher competence (Reynolds, 1999): • • • • Knowledge: Subject, Pedagogical Skill: Ability, applied knowledge Disposition: Attitudes, Perceptions, Beliefs Practice: Classroom processes (e.g. instruction, assessment, management) • And.. • Seniority, Credentials • School citizenship, contributions to community… • “Effectiveness”: Ability to raise student test scores University of California, Los Angeles ‹#›/27 What to Evaluate? All of the above? • “We fully understand that standardized tests don't capture all of the subtle qualities of successful teaching. That's why we call for multiple measures in evaluating teachers. In an ideal world, that data should also drive instruction and drive useful professional development.“ Arne Duncan U.S. Secretary of Education University of California, Los Angeles ‹#›/27 How to Evaluate? Teacher Constructs (What?) Measures (How?) Knowledge (subject, pedagogical) Skills (ability, applied knowledge) Multiple Choice Tests Performance Assessments Vignettes Practice, Classroom Performance (instruction, assessment, management) Surveys, Logs Classroom Observations, Video Artifacts, Portfolios Disposition (beliefs, attitudes) Survey, Interview Citizenship (contributions to community) Surveys, Interview, Self Assessment Effectiveness (contribution to student achievement) Student Test Score Gains; “Value Added” (Reynolds, 1999) University of California, Los Angeles ‹#›/27 Which is Best? Which should we use? • No method is inherently preferable • Each illuminates a different aspect of Teacher [insert euphemism here]. • Different kind of information from different sources • Pros and cons in reliability, validity, credibility… • Here I will briefly discuss: • • • • Value Added Models Observations Surveys Portfolios University of California, Los Angeles ‹#›/27 Value Added Models • Culture changing towards using student achievement to evaluate teachers • Simple Logic: • Students do better (grow) more in some classrooms (Weisberg et al. 2009; Kane et.al. 2011) • Student learning should be a (the?) key criterion to evaluate teacher quality • Seemingly Simple Method: • With longitudinal data…compare teachers on the progress of their students, not their achievement. • Estimate teacher unique contributions to student academic growth, net of factors outside teacher control University of California, Los Angeles ‹#›/27 Value Added Models • A family of statistical models • e.g. TVAAS, Growth percentiles, (variable) Persistence • Correlated; measures used + important (Lockwood et.al 2007) • A variety of issues: • Partial view of student learning (Baker et. al. 2010) • Unstable estimates (Schochet & Chiang; 2010) • Descriptive, not causal (Stuart, Rubin,Zanutto,2004), nor explanatory/diagnostic (Goe, 2011) • Available only for some teachers (30-40% US) • “…VAM estimates best used in combination with other indicators” (Braun et al., 2010) University of California, Los Angeles ‹#›/27 Classroom Observations • Widely used to assess quality teaching practice • Explanatory + Formative counterpart to VAM • Identify areas in need of improvement Inform PD • Expensive if standardized (training, time) • Error from complex rubrics, human judgment • Bias/Subjectivity in construct definition/emphasis • Lower reliability than traditional instruments (live or video) • Weak correlations with other indicators including student achievement (Kane et al. 2010) University of California, Los Angeles ‹#›/27 Classroom Observation: Constructs Singapore’s Competencies • Nurturing the Whole Child • • • • Core Competency! Share values with student Take action to develop the student Act consistently in the student’s interest • • • • Subject Mastery Analytical Thinking Initiative Teaching Creatively • • Partnering with Parents Working in Teams • • Understanding the Environment Developing Others • Emotional Intelligence • Cultivating Knowledge • Working with Others • Winning Hearts and Minds • Knowing Self and Others Danielson Framework Planning and Preparation • • • • • • Demonstrating Knowledge of Content and Pedagogy Demonstrating Knowledge of Students Selecting Instructional Goals Demonstrating Knowledge of Resources Designing Coherent Instruction Assessing Student Learning Classroom Environment • • • • • Creating an Environment of Respect and Rapport Establishing a Culture for Learning Managing Classroom Procedures Managing Student Behavior Organizing Physical Space • • • • • Communicating Clearly and Accurately Using Questioning and Discussion Techniques Engaging Students in Learning Providing Feedback to Students Demonstrating Flexibility and Responsiveness Instruction Professional Responsibilities University of California, Los Angeles ‹#›/27 Classroom Observation: Reliability (Source: Bill and Melinda Gates Foundation, 2011) University of California, Los Angeles ‹#›/27 Classroom Observation: Reliability (Source: Bill and Melinda Gates Foundation, 2011) University of California, Los Angeles ‹#›/27 Teacher Surveys • Common method for collecting data on teacher (classroom) practice on a large scale • Good coverage; Low cost; low burden for teachers • Adequate reliability • Questionable Validity • Error from inconsistency in interpretation of questions • …and social desirability • e.g. Emphasis on higher order thinking • Weak correlations with other indicators including student achievement (Kane et al. 2010) University of California, Los Angeles ‹#›/27 Student Surveys • Increasingly popular for teacher evaluation • Coverage; cost; perceived validity • Adequate reliability aggregated by classroom • Correlated w/student achievement as much or more than teacher surveys (Kane etal. 2010) • Additional information at the student level • Variance reflects differentiated teacher practice with different students (Martínez, 2012; Muthen , 1995) • Correlated w/achievement also within classrooms University of California, Los Angeles ‹#›/27 Student Surveys: Remaining Issues • Memory errors, inconsistency in interpretation • Particularly with younger children • Concerns for high stakes teacher evaluation • Social desirability, pressure, other validity issues • Cost Issues • Unit of measurement, construct invariance • “My teacher asks me to read books” • vs. “Our teacher asks us to read books” University of California, Los Angeles ‹#›/27 Student Surveys: Correlation to Achvmt University of California, Los Angeles ‹#›/27 Teacher Portfolios • Compile evidence of teacher practice over a period of time What’s in a Teacher Portfolio? Classroom Artifacts (lesson plans, assignments, samples of student work, etc.) vs. Surveys + Richer, Better Validity, PD value - Higher cost, Rater/Rubric Error, Burden on teachers Teacher Reflections (on practice reflected in artifacts) Student/Teacher Survey/Log (classroom practice, attitudes, perceptions) University of California, Los Angeles vs. Observations Debate taking form ‹#›/27 Portfolios vs. Observations • 1. Cost to Collect & Score? • • Similar or lower than observations 2. Score Reliability? • Similar to observations/video (see MET study) • May need to re-examine ideas of “acceptable reliability” • Better coverage, validity x/some aspects of practice • Interesting possibilities with newer technologies • 3. More burdensome for teachers? • Yes, much more so (20-30+ hour effort) • But, with burden comes Professional Development • So far used mostly for “National Certification” • Growing interest? : EdTPA, PACT • May be feasible as integral to an evaluation/PD cycle University of California, Los Angeles ‹#›/27 Teahcer Evaluation and Multiple Measures (Validity) University of California, Los Angeles ‹#›/27 Validity • How do we know we are doing a good job of evaluating teachers? • Are our inferences and decisions valid? “An integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or others modes of assessment.” Messick (1989) University of California, Los Angeles ‹#›/27 • “In educational settings, a decision or characterization that will have major impact [on a student] should not be made on the basis of a single score. Other relevant information should be taken into account if it will enhance the overall validity of the decision.” Standards for Educational and Psychological Testing, Standard 13.7 (AERA, APA, & NCME, 1999) University of California, Los Angeles ‹#›/27 What to Evaluate? All of the above • New Mexico’s teacher evaluation system should utilize a matrix in which multiple components of a teacher’s evaluation combine to determine a teacher’s overall effectiveness rating. • Effectiveness levels should only be assigned after careful consideration of multiple measures, including student achievement data, observations, and other proven measures [emphasis added] New Mexico Effective Teaching Task Force University of California, Los Angeles ‹#›/27 Multiple measures: Logic and Assumptions • General Assumption: • Combining multiple measures leads to better informed (more valid) decisions about teachers and teaching 1.Accuracy -Teachers classified into finer, more stable categories (De Pascale, 2012; Steele et. al. 2010) 2.Validity - More complete picture of performance (Goe, 2011) -Less incentive for test preparation (Steele et. al. 2010) 3.Feedback - Information to help teachers adjust and improve instruction and classroom strategies. (Duncan, 2011) 4.Relevance - Greater confidence in results of evaluation among the public and stakeholders (Glazerman et. al. 2011) University of California, Los Angeles ‹#›/27 Combining Multiple measures: Conceptual Issues • When/were does these assumptions hold?, in what situations? Depends on several factors • Assumptions about nature of constructs involved • Intended inferences and uses • What is meant exactly by combining (Brookhart, 2009) • Not self-explanatory. A variety of models is available • Substantial literature in psychology, personnel evaluation, and student assessment. • Only starting to be applied to Teacher Evaluation University of California, Los Angeles ‹#›/27 Models for Combining Multiple Measures Model Description Conjunctive Must meet criteria (pass) for all measures Disjunctive Must meet criteria (pass) for k measures Compensatory Based on composite measures. High level in one measure compensates for low levels in others Hybrid e.g. Compensatory-conjunctive, Sequential (Mehrens, 1989; Chester, 2003) University of California, Los Angeles ‹#›/27 Combination Model 0: Do not Combine! • May consider not combining the indicators ! • Summary indices not essential to formative or summative evaluation • Key measures may be collected, maintained, and reported separately • All used to illuminate a side of the picture (improve teaching, communication, citizenship, achievmt?) • And used jointly as needed where summative judgments are sought (Mehrens 1989; Brookhart 2009) • Making combined use of multiple indicators ≠Combining multiple indicators University of California, Los Angeles ‹#›/27 Combination Model 1: Conjuntive, Disjunctive 33 Classroom Observation Portfolio Student Survey Teacher Test Other Indicators Student Achievemt. University of California, Los Angeles ‹#›/27 Decision Rules and Reliability • Error in Multiple Measures may cancel out or compound • Assume Teacher A True Scores in T1, T2 are passes • Because of unreliability the probability of pass Observed Scores is estimated at 0.80 and 0.90, respectively • Probability of pass scores in both tests (Conjunctive Model): 0.8*0.9=0.72 • Probability of pass scores in either test (Disjunctive Model): 1-[0.2*0.1]=0.98 (see e.g. Cronbach, Linn, Brennan, & Haertel, 1997; Douglas and Mislevy, 2010) University of California, Los Angeles ‹#›/27 Decision Rules and Reliability • Simplistic scenario. Complex rules often used in practice according to policy context and goals • E.g.: Teachers must pass Measure 1 or 2, AND not rank lowest in Measure 3 (eg. New Haven) • Choice of decision rule more important for accuracy and validity than the reliability of the component measures chosen (Chester, 2003) • Importantly: Models are not “objective”; each involves judgment • Why satisfy k criteria, not k-1? Why those criteria? University of California, Los Angeles ‹#›/27 Hybrid system : e.g. New Haven • Synthesizes three component measures (each on 5-pt. scale): • Teacher instructional practice • Teacher professional values • Student learning outcomes University of California, Los Angeles ‹#›/27 Combination Model 2 (Compensatory): Principal Components / Factor Analysis 37 Portfolio Classroom Observation Other Measures Student achievement Global Construct Student/ Parent Survey University of California, Los Angeles Teacher Survey ‹#›/27 Combination Model 3 (Compensatory): Optimal Weight (Achievement as Criterion) 38 Artifacts/ Portfolio Classroom Observation Other Measures Teacher Construct Student/ Parent Survey University of California, Los Angeles Student Achievement Teacher Survey ‹#›/27 Combination Model 3 (Compensatory): Optimal Weight (Achievement as Criterion) 39 Other Measures Artifacts/ Portfolio Classroom Observation β β β Teacher Construct Student/ Parent Survey University of California, Los Angeles Student Achievement β β Teacher Survey ‹#›/27 MM Combination Model 4 (Compensatory): PC/FA: Student achievement as Indicator 40 Artifacts/ Portfolio Classroom Observation Other Measures Teacher Construct Student/ Parent Survey University of California, Los Angeles Student Achievement Teacher Survey ‹#›/27 MM Combination Model 5 (Compensatory): SEM/Canonical Correlates 41 Artifacts/ Portfolio Classroom Observation Student/ Parent Survey Other Measures Teacher Construct Student Measure #1 Student Outcomes Student Measure #2 Other (e.g. noncognitive) Teacher Survey University of California, Los Angeles ‹#›/27 MM Combination Model 6 : (Darlington, 1970) Unmeasured Criterion, theoretical weights 42 Artifacts/ Portfolio Other Measures Classroom Observation Student Achievement Unmeasured Teacher Construct Student/ Parent Survey Teacher Survey University of California, Los Angeles ‹#›/27 Empirical vs. Theoretical Weighting • Model 6 is most likely scenario in practice • Policy assumptions/values (consensual) inform the system, alongside technical considerations • It really is the only feasible scenario • Empirical weights cannot be derived • Ultimate criterion measure is NOT available • Note model 3 assumes such measure is available • But does not give “correct” weight for criterion • Exposure to Validity shrinkage (weight change over time) University of California, Los Angeles ‹#›/27 Multiple Measures and Validity • Models may lead to different inferences. • Little guidance available; so… • LOCAL VALIDITY STUDIES NEEDED (lots of them) • As with single measures, need to set up testable validation hypotheses (Kane, 2006) • Whatever the construct : Teacher [euphemism] • 1. Describe intended inferences, uses, AND CONSEQUENCES • 2. Collect empirical evidence to support • 2012, 2013 MET reports will be influential. May force field to broaden our lens and revise assumptions and expectations • No getting around conducting local validation studies University of California, Los Angeles ‹#›/27 What KINDS of EVIDENCE? • All of them: Validity is a unitary notion • • • • • Theoretical support Consistency and accuracy (Reliability) Correlations, Internal structure Predictive power Consequences of use • Validity becomes a rather empty academic topic if the consequences are not considered • Or if they differ markedly from expectation University of California, Los Angeles ‹#›/27 What consequences? • Intended and Unintended Effects • • • • • • On teaching practice On different student outcomes On recruitment and retention On Motivation, Competition, Fraud On Perceptions of validity, fairness, utility On dynamic of relationships with parents and community • Etc etc University of California, Los Angeles ‹#›/27 Final Remarks. Teacher Evaluation: Why are we doing this again? • Some good reasons • Make student achievement priority • Monitor & assess teacher performance • Develop a culture of accountability • and of reflection and improvement • Inform PD to improve teacher performance • However • Multiple fallible indicators do not automatically yield better, less fallible inferences. But they always yield more complex ones • Using indicators in combination involves technical but also conceptual and policy assumptions University of California, Los Angeles ‹#›/27 Final Remarks. Teacher Evaluation: Why are we doing this again? • Because “the stakes are high, and the future of our children is at stake” (insert public official name here, circa 2012) we should proceed carefully and deliberately. • Good measures take time to develop. • Solid systems based on these measures take longer to test and implement. • The consequences of implementing these systems are unknown and will take longer to assess. • Experience suggests moving too fast to implement may shortchange the system University of California, Los Angeles ‹#›/27 Final Remarks. Teacher Evaluation: Why are we doing this again? • Most important goal in my view is not only to avoid unfair decisions, and negative unintended consequences (though the potential for both should give us pause) • Greatest risk is missing an opportunity to enact sound teacher evaluation policy with great potential to positively impact educational practice and outcomes University of California, Los Angeles ‹#›/27 Thank you jfmtz@ucla.edu University of California, Los Angeles ‹#›/27