COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE Combining Multiple Measures of Teacher Practice and Performance: Technical and Conceptual Considerations for Teacher Evaluation Evaluacion Docente con Medidas Múltiples de Práctica y Desempeño: Consideraciones Técnicas y Conceptuales José Felipe Martínez1 1 University of California, Los Angeles, 2019B Moore Hall, Box 951521, Los Angeles, CA 90095-1521. jfmtz@ucla.edu 1 COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE Abstract Reform efforts in education systems in the United States and other countries place increasing emphasis on the promise of teacher evaluation systems for helping complement traditional school-level accountability systems in improving classroom instruction and student learning. This paper examines the key conceptual and methodological issues faced in measuring a construct as complex and multidimensional as teacher quality or effectiveness in the context of high stakes accountability policies. The common methods of assessing teacher practice and performance are reviewed, along with the appropriate methods for using these different indicators in combination for formative and summative evaluation. The focus here is on the validity of the inferences about teacher effectiveness that can be derived from these sources of information, as well as the implications and potential consequences for education policy and practice of implementing these systems on the field. Keywords: Teacher Evaluation, Validity, Multiple Measures, Classroom Observation, Teacher Portfolios 2 COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE 3 Teacher Evaluation and Validity: Conceptual and Methodological Considerations in Multiple Indicator Systems We fully understand that standardized tests don't capture all of the subtle qualities of successful teaching.That's why we call for multiple measures in evaluating teachers.In an ideal world, that data should also drive instruction and useful professional development. Arne Duncan U.S. Secretary of Education Introduction Teacher evaluation is an area of growing interest and relevance for local, state, and national educational systems around the world, reflecting mounting pressures to assess the performance of teachers, and hold them more concretely accountable for the levels of academic achievement attained by their students. A variety of methods have been proposed and are being tested to assess key constructs and aspects of teacher performance. In addition to technical issues specific to each of these methods, the critical questions at the center of policy debates around teacher evaluation increasingly refer to the appropriate ways of combining a variety of measures available for purposes of teacher performance appraisal. Unfortunately, growing attention from policymakers and the public is not necessarily being accompanied by a correspondent increase in awareness and understanding of the many complex technical issues involved in evaluating the work of teachers. Little specific guidance is available to researchers and policymakers to aid in addressing the variety of conceptual, technical, and policy issues that emerge in trying to answer this question. Much systematic work COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE 4 and discussion is therefore needed to address questions about the indicators to be collected, and the appropriate sources and uses of that information; the extent of uniqueness and overlap in the information provided by these indicators and sources; and the appropriate ways to use these indicators in combination to improve teacher performance. In this paper I examine some of the critical conceptual and methodological issues faced by researchers and policymakers who try to measure teacher quality or effectiveness in the context of recent trends in teacher evaluation internationally. I briefly discuss the most common methods used to assess teacher practice and performance, and focus mostly on the validity of the inferences about teacher effectiveness that can be derived from these sources of information, either alone or together in combination, as well as the implications and potential consequences for education policy and practice. Policy Context Reform efforts in education systems in the United States and other countries increasingly emphasize the notion of performance review and accountability for individual teachers, as opposed or in addition to the traditional focus on school-level accountability. As prior waves of accountability-based reform, this one has its roots in perceptions about the (inadequate) performance of students in national or international assessments (e.g. NAEP, PISA, TIMSS). Feuer (2012) describes an interesting kind of reverse-Lake Wobegon effect taking hold worldwide whereby states and countries consistently find reasons in the data to be pessimistic about the performance of their students, relative to peer or competitor systems. Unlike reform targeted at the school level, however, these initiatives are also heavily informed by assumptions about and evidence of the importance of quality teaching and, more COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE 5 generally, effective teachers (or lack thereof) in explaining and potentially improving the outcomes observed. Additionally, they reflect a critique of a majority of existing teacher evaluation systems which, as many reports and authors have noted became perfunctory rituals with no consequences or enforcement, and little formative or informative value for the teacher or the district (Loeb, Raudenbush, et al 2010). Finally, they reflect a series of optimistic assumptions about the ability of the proposed revamped system and procedures to reliably identify effective and ineffective teachers. These rationales can be seen clearly underlying current policy efforts targeting teacher evaluation in large and medium size districts across the U.S. (e.g. New York, Los Angeles, Chicago, Tennessee). The same discourse and assumptions are increasingly common to reform efforts targeting teacher evaluation internationally, including for example national systems in Singapore, Chile, and Mexico, and nascent efforts in the United Kingdom and Australia. Teacher Evaluation: Why, What, How Why Evaluate. Educational theory, mounting empirical evidence, and common sense suggest that effective teachers can exert an important influence on the levels of achievement of their students (Baker et. al 2010; Rowe, 2003). Teacher evaluation systems can have different origins, motivations, and goals depending in local policy context, but typically seek information on aspects of teacher practice believed to be aligned with improved system outcomes, crucially student achievement, which is used for both formative and summative purposes. These include, among others, identifying struggling teachers for intervention to help them improve their teaching and classroom practices; identifying teachers who are persistent underperformers for remedial action, sanction, or dismissal; providing incentives to the best teachers; informing COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE 6 school practice and district policies on teacher preparation and professional development; identifying effective teacher practices in order to develop models of effective instruction to scale up for implementation in classrooms across the system. Typically, teacher evaluation systems are conceived to support a combination of these intended uses and goals, in turn using the information collected formatively to guide teacher education and professional development, and summatively for decisions related to career advancement, remuneration, and retention. Unfortunately, policy and political priorities can often be at odds with careful design and thorough consideration of long-term goals, uses, and consequences. The opening quote from Secretary Duncan, for example, reflects a decision to move forward with high stakes teacher evaluation, irrespective of whether we live in an ideal world where data from the evaluation is used drive instruction and professional development. Thus effectively turns what should be the core formative component of any such system into an optional luxury that may or may not be available in the real world, at least initially. What to Evaluate. While researchers, educators, and policymakers increasingly agree about the importance of developing sensible approaches to evaluation to support teacher development and accountability, reaching a consensus around the specific aspects teachers’ professional practice that should be included in the evaluation is more challenging. Teaching and teacher practice are inherently complex, multidimensional constructs. Teaching involves a variety of processes and interactions that take place in the classroom and outside; some substantive in nature, others related to practical aspects of classroom work (daily routines, classroom management), yet others pertaining to psychological aspects of teacher-student interactions (e.g. motivation, respect, feedback). Teacher practice more broadly defined further includes a multitude of aspects of the COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE 7 work of a teacher outside the classroom, including among others communication with parents, administrators, and other teachers at the school, school citizenship, and contributions to the broader community. Thus, although the notion of assessing teacher effectiveness has simple intuitive appeal, in practice it involves selecting, defining, collecting information about, and making inferences involving dozens of complex component constructs (Peterson, 1987). Terms like Teaching Quality, Teacher Practice, Teacher Effectiveness, or Teacher Performance are often treated as interchangeable, but they are more appropriately seen as closely related, with important areas of overlap, but also uniqueness. Evaluating teacher quality or effectiveness entails first developing an agreeable definition for each of these components. The notion of teacher competences (Reynolds, 1999) provides a particularly useful global heuristic for teacher quality that identifies four main components of competence: teacher knowledge (e.g. subject and pedagogical), skill (e.g.applied knowledge), disposition (e.g. attitudes, perceptions, beliefs), and practice (e.g. instruction, assessment, management). Notably, this definition excludes indicators like teacher education, credentials, experience (seniority), or contributions to student achievement or other non-cognitive outcomes. While indicators like these can be important and informative, and are often considered for formal teacher evaluation, they are seen here as correlates more than components of teacher competence. The discussion above suggests first that the most appropriate definition of teacher quality or competence depends on the intended uses and context. It also suggests ultimately, that other things being equal, the richer the definition of the construct the better. A simple answer to the question of what constructs to evaluate when considering how best to appraise teacher performance could thus be “all of the above” or at least “as many of the above as is affordable”. COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE 8 To the extent some of them are excluded the definition of teacher quality or competence is narrower, and so by extension is the evaluation. How to Evaluate. Having decided that teacher evaluation ought to encompass as many of the relevant constructs as is feasible, attention turns to the methodological questions that arise in measuring each of them in practice. A variety of methods can be used to capture each of the constructs that compose teacher quality or competence: Teacher knowledge and skills, either subject matter or pedagogical is often assessed by means of standardized tests (e.g. the Praxis I, or MKT tests) or through alternative open-ended performance assessments (e.g. the Praxis II assessment), or vignettes of classroom practice (Stecher, Le, et.al., 2006). Information about dispositions and beliefs is typically collected directly from teachers by means of surveys or interviews (Mayer, 1999). For constructs related to school citizenship and contributions to the broader community surveys can also be used to collect information from teachers and other sources of information (e.g. parents, administrators). Classroom Observation. Finally, a variety of methods can be used to develop indicators of classroom practice. Traditionally, direct observation has been the default method for studying and monitoring the work of teachers in the classroom. Observation (either live or through videotape) has considerable face validity as a method for seeing teaching as it happens in classrooms, and providing direct evidence for identifying areas in need of improvement to inform professional development (Pianta and Hamre, 2009). It remains a staple of teacher evaluation systems old and new as the central explanatory and formative complement to the summative evidence offered by measures based on summaries of student achievement. On the other hand observation in classrooms presents significant challenges, starting with identifying COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE 9 and defining the many distinct but closely interconnected notions that collectively compose the broader construct of classroom practice. Table 1 shows the constructs targeted for observation in Singapore side by side with a typical list of constructs from an observation system based on the Danielson framework in the United States. While the figure shows some areas of overlap across countries, it also highlights the subjectivity and culture-specificity inherent to all definitions of good teaching. Thus, classroom observation in Singapore, a high performing country in international assessments whose educational system has been the subject of much study and praise in recent years, emphasizes aspects of classroom life (e.g. nurturing the whole child, winning hearts and minds) that get little attention in American systems, which typically focus on more technical aspects of classroom practice (e.g. instruction, lesson planning). ----------------Insert Table 1 about here ----------------- Moreover, as a method for large-scale standardized data collection classroom observation faces challenges in understanding, quantifying, and monitoring the extent of human rating error in the resulting measures. It requires of specialized measurement techniques, and large investment of resources for training, deploying, and monitoring the work of groups of professionals who serve as standardized observers across a state or district. Even with available resources however, developing and maintaining reliable measures of classroom practice on a large scale, remains a significant challenge. The disappointing results of a recent widely publicized study that tested some of the best known observation rubrics on a large scale are a COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE 10 sobering reminder of the extent of the challenge if these measures are meant to support inferences and decisions involving individual teachers (Kane et. al., 2011) Surveys. A variety of alternative methods have been used to collect information about classroom practice, which may offer substantive, logistic, or cost advantages compared to direct observation in classrooms. In particular, teacher surveys offer a cost efficient alternative, that permits to collect information about large numbers of practices and constructs with little added cost or burden to teachers. As with other surveys, researchers can construct teacher surveys so that the resulting aggregate indicators reflect key constructs of classroom practice of interest, and show adequate levels of reliability by common psychometric standards. On the other hand, teacher surveys have significant limitations in their ability to produce consistent and valid information about classroom processes. Surveys are prone to errors from loss of or inaccurate memory, and from inconsistency in teacher interpretations of the content and focus of items. Moreover, they are subject to strong social desirability effects, particularly in a situation of relatively high stakes and subjective areas of practice. Thus, three teachers who report that they always emphasize higher order cognitive skills in their instruction could be over or underestimating the true frequencies; alternatively they could each be reporting truthfully and accurately but mean different things by always, by emphasizing, or by higher order. Finally, teachers could knowingly under or over report the true frequency based on their perception of the desirability of the practice in question (Mayer, 1999). As a result, teacher surveys have been notoriously inconsistent as predictors of student outcomes (Bill and Melinda Gates Foundation, 2010). Student surveys have gained in popularity recently as an alternative that can address some of the limitations faced with teacher surveys. In particular, student surveys can be used to COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE 11 create classroom aggregates that are as reliable as those obtained from teacher surveys, and more strongly predictive of student achievement (Kane et.al., 2011;Martinez, 2012). Because they are not as susceptible to social desirability, student surveys are also often seen as more valid for teacher evaluation purposes (Ferguson, 2010). Finally, student surveys provide additional useful information typically not available from teachers; specifically, within- classroom variance in student reports can be used to monitor differentiated or individualized instruction (Martinez, 2012; Muthen. 1995). At the same time, collecting information from students has potential drawbacks in accuracy and consistency of the reports, particularly with young children, whereas with older children there may be concerns about potential bias and engagement. Student surveys also face methodological challenges designing and constructing indicators of the right constructs at the right level (Schweig, 2012). Notably, a question that asks how often “My teacher asks me to read books in the classroom” may behave differently psychometrically than one that asks how often “Our teacher asks us to read books”. Finally, the use of student surveys for high stakes teacher evaluation has not been thoroughly tested and could create issues of validity and bias. Portfolios. Teacher portfolios are another method gaining prominence as an alternative for collecting information about classroom practice. Teachers use portfolios to compile, annotate, and reflect on artifacts of classroom practice like lesson plans, assignments, and quizzes, over a period of time. Research suggests that teacher portfolios can be used to collect rich information about classroom practice with reliability comparable to classroom observations (Martinez, Borko, and Stecher, 2011). Moreover, because they require sustained effort and mental engagement from teachers, portfolios are seen as promising mechanisms for professional development for monitoring and improving classroom practice (Shulman, 1997). This is exemplified recently by the rapidly expanding use of EdTPA, a portfolio-based assessment COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE 12 system for pre-service teachers developed at Stanford and endorsed by 25 states in the United States (Teacher Performance Assessment Consortium, 2012). On the downside, portfolios require of considerable resources for development, collection, and review, and are very burdensome for teachers if not embedded in the formal professional development cycle. Portfolios also are limited for capturing interactive or verbal aspects of instruction like on the fly questioning. Value Added Models. While not part of the competence framework outlined above where it is at most a key correlate or byproduct, student achievement is central to the notion of effectiveness at the center of recent policy reforms involving teachers in school districts in the United States. This has led to the growing popularity of Value Added models (VAMs) for estimating the contribution of individual teachers to student achievement, and to much debate in research and policy circles. A variety of critical concerns have been raised about VAM estimates, including their limited scope (Baker et.al., 2010) and lack of explanatory or diagnostic value (Goe, Bell, and Little, 2011) for formative uses, and their instability (Schochet & Chiang, 2010) and ultimately non-causal nature on the summative side (Rubin, Stuart, and Zanutto, 2004). Taken together, these concerns have led to a widespread view that VAMs cannot be used alone to assess teachers, but only alongside other measures within a broader approach to evaluation (National Research Council, 2010). Teacher Evaluation with Multiple Measures The inherent limitations of each of the approaches listed above make it apparent that none of them is intrinsically preferable or more informative than the others; each has advantages and limitations and is most suitable for illuminating different important aspects of teacher quality or effectiveness. From this it follows that no single method can provide sufficient information to COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE 13 support a valid evaluation of teacher performance. Instead, the consensus among researchers and policy makers is that teacher evaluation must rely on a variety of indicators in order to be valid, comprehensive, and useful (Baker et.al. 2010; National Research Council, 2010). The notion has been long established for the case of student assessments in the standards for the measurement profession: In educational settings, a decision or characterization that will have major impact [on students] should not be made on the basis of a single score. Other relevant information should be taken into account if it will enhance the overall validity of the decision. (AERA, APA, NCME, 1999) Indeed, in the United States growing numbers of districts and states are developing evaluation systems to support high-stake inferences and decisions about teachers including hiring and tenure decisions, career advancement, and in some cases compensation.1 Public debate around these systems has focused largely on their approach to estimate teacher contributions to student achievement (e.g. Value Added Indicators). In fact, the systems always rely on multiple measures (typically three or more) and the majority of a teacher’s rating often depends on indicators other than student achievement. These indicators may include information from classroom observations, principal reports, parent or student surveys, classroom artifacts, and official records among others. 2 The emerging consensus around this multi-pronged approach is based on evidence and a series of assumptions about the consequences and benefits of evaluating teachers using multiple 1 The list includes the three largest districts in the country (New York, Los Angeles, and Chicago) which collectively educate more than 2.5 million students and employ more than one hundred and fifty thousand teachers; additional prominent examples exist in Washington DC, Tennesee, Colorado, and Pittsburgh, among others. 2 See e.g. Sartain, Stoelinga, et. al (2011) for Chicago; Strunk, Weinstein, Makkonen, and Furedi (2012) for Los Angeles; and Marsh, Springer, et.al. (2011) for New York. COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE 14 indicators. Thus, among other benefits multiple measures are expected to provide a more complete picture of teacher performance (Goe, Holdheide, and Miller 2011); allow teachers to be classified into finer, more stable categories (De Pascale, 2012; Steele et. al. 2010); minimize incentives for test preparation (Steele et. al. 2010); provide information to help teachers adjust and improve instruction and classroom strategies (Duncan, 2011); and generate greater confidence in results of evaluation among all stakeholders, particularly teachers, and the public (Glazerman et. al. 2011). These and other assumptions have been investigated in the context of student assessment (see e.g. Henderson, Julian, and Yen, 2003; Schafer, 2003), but the extent to which they collectively hold in practical application for teacher evaluation is not well understood. This will depend on several factors, including the nature of the constructs involved, the intended inferences and uses of the measures and, importantly, the specific methods used to combine them (Brookhart, 2009). The last point in particular merits attention in light of the growing push for high stakes teacher evaluation in the United States and around the world. Interestingly, the consensus around the use of multiple measures in this context is as wide as it is vague. There are of course a variety of approaches for combining measures for the purpose of evaluating teachers, and which one we choose can be of consequence for the properties of the resulting indicators, and the inferences we ultimately draw about teachers. At least four approaches have been proposed in the literature in psychology and student assessment for combining multiple measures that reflect different attributes of a broader target construct. These include conjunctive and disjunctive evaluation models, a variety of compensatory linear models, and hybrid approaches that combine more than one of these (Henderson, Julian, and Yen, 2003). COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE 15 Conjunctive Models. This approach integrates information from a multiple measures by specifying a decision rule that requires subjects to satisfy (pass) a minimum criterion level of performance in each of the measures involved. Thus, a conjunctive model may specify that beginning teachers must receive scores of 3 or higher in all (or a certain number of) observation scales, be rated satisfactory in all (or a certain number) of student survey scales, and not rank in the bottom 20% of the distribution of Value Added scores in order to advance to tenured status. Similar decision rules may be specified for other purposes (e.g. further career advancement, incentives) by varying the measures included and the target standard of performance for each of them. Figure 1 depicts a conjunctive model with multiple measures graphically, simplifying the assessment to a series of binary questions for each of the indicators. This model is most appropriate for minimizing false positives, or where adequate performance is necessary in separate domains or components of a broader construct. Disjunctive Models. These are also discrete decision rules for combining indicators, where the requirement is to satisfy the performance criteria for a minimum of q out of p indicator measures in the system (Mehrens, 1989). For example, in the example above, teachers would be required to meet at least two of the three criteria in order to succeed at a certain evaluation step. Figure 1 can also represent disjunctive decision models, since these are also constructed by combining a series discrete decisions. The disjunctive model is appropriate where we want to minimize false negatives, where repeated measures of the same construct are being collected, or where not all the domains of a broader construct are equally critical to achieve. A special case of the disjunctive model that requires passing any one of p measures has sometimes been called a complementary model (Brookhart, 2009) and is used to maximize the opportunity to pass in repeated testing situations. COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE 16 ----------------Insert Figure 1 about here ----------------- Compensatory Models. In these models, high performance on one or more measures is allowed to compensate for lower performance on others by creating composite or aggregate indicators that synthesize the information available in all the measures. In most cases it is additionally assumed that a single trait underlies the component measures, and a summary index created as a simple or weighted average (Brookhart, 2009.) Alternatively, compensatory models can be seen as linear combinations of measures that seek to maximize certain properties of the resulting indicators, or its relationships to other measures. These may include empirical correlations among indicators (canonical or factor analysis models), and relationships to measured criteria (Aamodt & Kimbrough, 1985), or unmeasured criteria (Darlington, 1970). Figure 2 shows a graphical representation of a compensatory model where each of the indicators is conceived as an indicator or item in a factor analysis. This model would allow to investigate the pattern of interrelationship among the measures and estimate empirical factor loadings to create a combination of indicators to maximize the validity of inferences about an underlying teacher construct. ----------------Insert Figure 2 about here ----------------- COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE 17 A different kind of compensatory model can be seen in Figure 3. This is a common model in current teacher accountability research and policy, where student achievement is treated not as an indicator, but as the criterion measure which the remaining indicators are expected to predict. This is the default model in the Measures of Effective Teaching study for example (Kane et. al. 2011) and it seeks to obtain Beta regression weights to create an optimal linear combination of indicators to maximize the predictive power over the criterion or outcome— student achievement, either static or in the comparative growth metric of value added models. In defining standardized tests as the ultimate criteria however, this model would seem to contradict the stated goal of evaluating teachers based on multiple measures to reflect multidimensional nature of the teacher profession, and the many important aspects of the work of teachers in and out of schools and classrooms. Moreover, since the ultimate criterion is assumed known and available, the inclusion of other indicators is at best unnecessary (redundant in relation to the final criterion already at hand), and it could be counterproductive (by diluting the high quality information available with indirect indicators of the target criterion.). ----------------Insert Figure 3 about here ----------------- Finally, Figure 4 presents a model with an unmeasured criterion and theoretical weights relating it to the various indicators of teacher performance (Darlington, 1970). This model shares COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE 18 characteristics of those in Figures 2 and 3; like the optimal weight model it sets up a regression, with beta weights specifying the relationships between each indicator and a criterion. As in the factor analysis model, the criterion is an unmeasured underlying construct that can only be inferred from the measured indicator—including student achievement. The difference in the model described by Darlington is the approach followed to determine the most appropriate weight to give each measure. This is an increasingly important question for policymakers, and one that this view suggests cannot be answered from a strictly technical and scientific standpoint; the appropriate weights must be assigned theoretically, that is from a consensus among key stakeholders about the goals and priorities of an evaluation system, and the value placed on each of its component parts. Of necessity, this is the model being implemented in districts around the country, as decision makers face the alternative of making critical decisions involving teacher career advancement or compensation using empirically determined weights that may fluctuate across districts, and within districts over time, and which will have to be presented to teacher unions and the public. Far from a limitation then, theoretical weights are in fact best suited for use in high stakes teacher evaluation on a large scale (Darlingon, 1970). ----------------Insert Figure 4 about here ----------------Hybrid Models. The three general approaches just described can be integrated into hybrid evaluation models if desired. For example, in the examples listed before a hybrid conjunctivedisjunctive model could specify that all teachers must meet the student achievement criteria, and at least one of the other two criteria (classroom observation and student survey). A conjunctive- COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE 19 compensatory model could instead require teachers to satisfy a set of criteria, where one of the criteria is a linear composite of three component measures from teacher and student surveys. Formative Model. Importantly, an alternative approach that is often overlooked rests on the idea that multiple measures should not necessarily be combined in any way, but instead should be used in combination to provide richer and more useful information to inform a formative evaluation (Schmidt & Kaplan, 1971.) As Mehrens (1989) has argued, collecting, maintaining, reporting, and using the measures separately seems like the natural choice for teacher evaluation where the key purposes center around formative processes and professional development for improving teaching practice. In a formative model each of the measures illuminates and guides improvement efforts related to the aspect(s) of teacher practice and performance it is most directly informative of. Finally, if desired, the component measures are still available to be combined for summative purposes according to any of the models above. Multiple measures and Reliability. The conjunctive and disjunctive models are intuitively appealing and in principle can be implemented without the use of advanced statistical techniques. However, the reliability of the resulting indicators and inferences under each of these approaches can vary substantially and should be examined closely to ensure the system offers sufficiently accurate information for the intended purposes. Specifically, despite the common assumption that using multiple measures will improve the reliability of inferences, it can be shown that under the conjunctive model the reliability of a decision is that of the least reliable of the component measures (Chester, 2003). Assume for example that Teacher A’s true scores are sufficient to meet the desired criteria on measures M1 and M2 —i.e. the teacher should pass both measures. However, because the measures are not perfectly reliable Teacher A could be erroneously classified as not passing either measure. If we know the true scores for Teacher A, COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE 20 we can estimate the probability of misclassification directly from the standard error of measurement for each measure. For Teacher A this hypothetical probabilities are estimated at 0.25 and 0.15 respectively for M1 and M2—that is, the probability of correct classification is 0.75 and 0.85. From this it follows that the probability of passing both tests in practice (based on the observed, not true scores) under a conjunctive model is 0.75*0.85=0.64, whereas the probability of seeing at least one pass under a disjunctive model is 1-[0.25*0.15]=0.96. While Douglas and Mislevy (2010) suggest that the differences may not be quite as dramatic under compensatory models, the example underscores the importance of understanding errors of measurement in the context of multiple measures evaluation, particularly where complex or hybrid decision rules are involved (Douglas and Mislevy, 2010; Cronbach, Linn, Brennan, & Haertel, 1997.) Multiple measures and Validity. As with reliability, we are only in the early stages of validity research under the various approaches for combining the growing variety of measures of teacher performance available. As these systems start to come online and used on the field, more and more extensive empirical validity research is urgently needed to investigate the assumptions, implications, requirements, and potential consequences of using each of the models for teacher performance appraisal purposes (Brookhart and Loadman; 1992) As discussed above, how to combine multiple measures of teacher performance is only in part a technical issue. Teacher evaluation systems must first make explicit their definition of effectiveness and the value they place on the measures based on theoretical or empirical support, all in specific light of the local context, program goals, and priorities. In considering these multiple elements and decisions it is useful to conceive of them as components of a validity argument for specific inferences to be derived from the indicators for specific purposes (Kane, COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE 21 2006). As with any single measure, the validity framework applied to multiple measures highlights the need to rely on assumptions about the nature of the theoretical construct being measured (i.e. teacher quality or effectiveness) and the ways in which it may be operationalized in practice. It also makes clear that different uses will lead to different validity arguments, and require different sources of evidence and support. In particular, inferences and uses that carry serious consequences require the greatest extent of theoretical and empirical support. As outlined by Kane (2006) validity is unitary and purpose dependent; validation entails outlining an interpretive argument for the intended inferences, uses, and consequences, and collecting evidence to support this argument. This evidence may include a variety of sources of theoretical support for the intended constructs and conceptual framework, as well as empirical evidence of consistency and accuracy (reliability), expected patters of inter-correlation and internal structure for the measures, and predictive power over intended criterion measures, among others. Importantly, the framework is also requires explicit consideration of the planned consequences and outcomes that are expected to derive from the proposed interpretation and use of the measures. Clearly, evidence of intended and potentially unintended consequences is also the most important consideration from the policy perspective—otherwise validity becomes a rather academic topic if the consequences of use are not considered. Conclusion A new wave of reform efforts emphasizing teacher-focused accountability is taking hold in education systems around the world. The possibility of replacing current cursory and hollow evaluation systems with more meaningful efforts to assess and monitor teacher practice and performance should be viewed with great interest. The new multiple-measure systems have the COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE 22 potential to become engines driving a change towards a culture of reflection, improvement, and accountability among teachers, and to support formative evaluation systems that feed directly into professional development channels, leading to meaningful improvements in teaching and ultimately student learning. However, these systems face considerable conceptual and methodological challenges in design and implementation, and evolve interwoven within complex policy environments. It was argued here that combining multiple fallible indicators does not automatically yield better, less fallible inferences, but it always yields more complex inferences. Developers should carefully consider the assumptions and consequences (intended and unintended) of the various approaches for combining measures—or using them in combination. Because the stakes are high for teachers and students, it behooves developers and decision makers to focus on the validity of intended uses and inferences, and take the necessary steps to develop a system to support these inferences. Ultimately, to avoid becoming the latest policy silver bullet to quickly take root and flame out (or dilute into meaninglessness), multi-measure systems of teacher evaluation should ensure sufficient theoretical and empirical support throughout development and implementation. It is our job as researchers and methodologists to remind policymakers that good measures take time to develop, solid systems based on these measures take longer to test and implement, and the consequences of specific uses of these systems are largely unknown and will take longer to assess. The greatest risk rests not in the potential for unfair decisions involving individual teachers, although the potential for these should give policymakers pause; instead the risk is missing a critical opportunity to enact sound policy with great potential to positively impact educational practice and outcomes. COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE 23 REFERENCES Aamodt, M. G. & Kimbrough, W. W. (1985). Comparison of four methods for weighting multiple predictors. Educational and Psychological Measurement, 45, 477-482. AERA,APA, NCME. (1999) Standards for Educational and Psychological Testing. Washington, D.C.: American Educational Research Association Baker, E. L., Barton, P. E., Darling-Hammond, L., Haertel, E., Ladd, H. F., Linn, R. L., Ravitch, D., et al. (2010). Problems with the use of student test scores to evaluate teachers. Washington, DC: Economic Policy Institute. Bill & Melinda Gates Foundation (2010) Teachers’ perceptions and the MET project. Retrieved from http://www.metproject.org/downloads/Teacher_Perceptions_092110.pdf Brookhart, S. (2009). The Many Meanings of Multiple Measures.Education Leadership 67((3) Brookhart, S.M., & Loadman, W. E. (1992). School-university collaboration: across cultures. Teaching Education. 4, (2): 53-68. Chester, M. D. (2003), Multiple Measures and High-Stakes Decisions: A Framework for Combining Measures. Educational Measurement: Issues and Practice, 22: 32–41. Cronbach, L. J., Linn, R. L., Brennan, R. L, & Haertel, E. H. (1997). Generalizability analysis for performance assessments of student achievement or school effectiveness. Educational and Psychological Measurement, 57, 373-399. Darlington, R. B. (1970), Some Techniques for maximizing a test’s validity when the criterion variable is unobserved. Journal of Educational Measurement, 7: 1–14. De Pascale, C. (2012) Managing Multiple Measures. Principal, 91(5) Duncan, A. (2011) Duncan Tells Teachers: Change is Hard. Homeroom, Retrieved December, 09, 2012 from http://www.ed.gov/blog/2012/08/duncan-tells-teachers-change-is-hard/ Douglas, K. M., & Mislevy, R. J. (2010). Estimating classification accuracy for complex decision rules based on multiple scores. Journal of Educational and Behavioral Statistics, 35, 1–27. Ferguson, R. (2010, October 14). Student perceptions of teaching effectiveness. Retrieved from http://www.gse.harvard.edu/ncte/news/Using_Student_Perceptions_Ferguson.pdf Feuer, M. (2012) No Country Left Behind: Rhetoric and Reality of International Large-Scale Assessment. Princeton, NJ, Educational Testing Service. COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE 24 Glazerman, S., Goldhaber, D., Loeb, S., Raudenbush, S., Staiger, D. O., & Whitehurst, G. J. (2011).Passing muster: Evaluating evaluation systems. Washington, DC: Brown Center on Education Policy at Brookings. Goe, L., Bell, C., & Little, O. (2008). Approaches to evaluating teacher effectiveness: A research synthesis. Washington, DC: National Comprehensive Center for Teacher Quality. Goe.L, Holheide, L., and Miller, T (2011) A Practical Guide to Designing Comprehensive Teacher Evaluation Systems. National Comprehensive Center for Teacher Quality, Washington, DC Henderson-Montero, D., Julian, M, and Yen, W (2003) Multiple Perspectives on Multiple Measures. Educational Measurement: Issues and Practice, 22(2). Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational Measurement, 4th Ed. (pp. 17-64). Westport, CT: American Council on Education and Praeger Publishers Kane, T. J., Staiger, D. O., McCCaffrey, D., Cantrell, S., Archer, J., Buhayar, S., et al. (2012). Gathering Feedback for Teaching: Combining High-Quality Observations with Student Surveys and Achievement Gains. Seattle: Bill & Melinda Gates foundation. Glazerman, S., Goldhaber, D., Loeb, S., Raudenbush, S., Staiger, D. O., & Whitehurst, G. J. (2010).Evaluating teachers: The important role of value-added. Washington, DC: Brown Center onEducation Policy at Brookings. Marsh, J. A., Springer, M. G., McCaffrey, D. F., Yuan, K., Epstein, S., Koppich, J., Kalra, N., DiMartino, C. & Peng, A. X. (2011). A big apple for educators: New York City's experiment with schoolwide performance bonuses. Santa Monica, California: RAND. Martinez, J.F. (2012) Consequences of omitting the classroom in multilevel models of schooling: An illustration using the effects of opportunity to learn on reading achievement. School Effectiveness and School Improvement DOI:10.1080/09243453.2012.678864 Martínez, J.F, Borko, H; and Stecher, B. (2011) Measuring instructional practices in middle school science using classroom artifacts. Journal for Research in Science Teaching 41(1) pp. 38-67. DOI 10.1002/tea.20447 Mayer, D. (1999). Measuring Instructional Practice: Can Policy Makers Trust Survey Data? Educational Evaluation and Policy Analysis 21(1): 29-45. Mehrens, W (1989) Combining Evaluation Data from Multiple Sources. In Millman and L. Darling-Hammond (Eds) The new handbook of teacher evaluation: Assessment of elementary and secondary school teachers (pp. 322-336). Newbury Park, CA: Sage COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE 25 Muthén, B. O., Huang, L. C., Jo, B., Khoo, S. T., Goff, G. N., Novak, J. & Shih, J. (1995). Opportunity-to-learn effects on achievement: Analytical aspects. Educational Evaluation and Policy Analysis, 17, 371-403 National Research Council (2010). Getting value out of value-added. Braun, Chudowsky and Koenig (Eds.) Washington, DC: National Academies Press. Rubin, D.B., Stuart, E.A., and Zanutto, E.L. (2004) A Potential Outcomes View of Value-Added Assessment in Education. Journal of Educational and Behavioral Statistics 29 (1): 103-116. Peterson, K. (1987) Teacher evaluation with multiple and variable lines of evidence. American Education Research Journal, 24(2). Pianta, RC, & Hamre, BK (2009). Conceptualization, measurement, and improvement of classroom processes: Standardized observation can leverage capacity. Educational Researcher, 38, 109–119\ Reynolds, M. (1999). Standards and professional practice: The TTA and initial teaching training.British Journal of Educational Studies, 47(3), 247-260. Rowe, K. (2003). The importance of techer quality as a key determinant of students’ experiences and outcomes of schooling. Paper presented at the ACER Research Conference, Melbourne Rubin, D.B., Stuart, E.A., and Zanutto, E.L. (2004) A Potential Outcomes View of Value-Added Assessment in Education. Journal of Educational and Behavioral Statistics 29 (1): 103-116. Sarain, L., Stoelinga, S., Brown, E (2011) Rethinking Teacher Evaluation in Chicago: Lessons from Classroom Observations, Principal-Teacher Conferences, and District Implementation, University of Chicago Consortium on School Research. Retrieved December 08, 2012 from http://ccsr.uchicago.edu/sites/default/files/publications/Teacher%20Eval%20Report%20FINAL. pdf Schafer, W. D. (2003), A State Perspective on Multiple Measures in School Accountability. Educational Measurement: Issues and Practice, 22: 27–31. Schmidt, F. L., & Kaplan, L. B. (1971). Composite vs. multiple criteria: A review and resolution of the controversy. Personnel Psychology, 24, 419-434. Schochet, P. Z., & Chiang, H. S. (2010). Error rates in measuring teacher and school performance based on student test score gains (NCEE 2010-4004). Washington, DC: National Center for Educational Evaluation and Regional Assistance, Institute of Education Sciences, United States Department of Education. Schweigh, J. (2012) Cross-Level Measurement Invariance in School and Classroom Environment Variables, Manuscript in preparation. COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE Shulman, L. (1998) "Teacher Portfolios: A Theoretical Activity" in N. Lyons (ed.) With Portfolio in Hand. (pp. 23-37) New York: Teachers College Press. Stecher, B., Le, V.N., Hamilton, L., Ryan, G., Robyn, A., & Lockwood, J. R. (2006). Using structured classroom vignettes to measure instructional practices in mathematics. Educational Evaluation and Policy Analysis, 28(2), 101-130. Steele, Jennifer, Laura S. Hamilton, and Brian M. Stecher (2010) Incorporating Student Performance Measures into Teacher Evaluation Systems. Santa Monica, CA: The RAND Corporation, Strunk, K., Weinstein, T., Makkonen, R., and Furedi, D (2012) Three lessons emerge from Los Angeles Unifi ed School District’s implementation of a new system for teacher evaluation, growth, and development. Phi Delta Kappan, 94(3), Retrieved December 08, 2012 from Teacher Performance Assessment Consortium (2012). Retrieved December 08, 2012 <http://edtpa.aacte.org/> 26 COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE Table 1. Constructs in Singapore and Danielson Observation Protocols 27 COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE Figure 1. Conjunctive and Disjunctive Decision Models Figure 2. Factor Analysis Model for Multiple Measures (Compensatory) 28 COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE Figure 3. Optimal Prediction Weights with Achievement as Criterion (Compensatory) Figure 4. Unmeasured Criterion with Theoretical Weights (Compensatory) 29