combining multiple indicators of teacher performance Combining

advertisement
COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE
Combining Multiple Measures of Teacher Practice and Performance:
Technical and Conceptual Considerations for Teacher Evaluation
Evaluacion Docente con Medidas Múltiples de Práctica y Desempeño:
Consideraciones Técnicas y Conceptuales
José Felipe Martínez1
1
University of California, Los Angeles, 2019B Moore Hall, Box 951521, Los Angeles, CA
90095-1521. jfmtz@ucla.edu
1
COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE
Abstract
Reform efforts in education systems in the United States and other countries place increasing
emphasis on the promise of teacher evaluation systems for helping complement traditional
school-level accountability systems in improving classroom instruction and student learning.
This paper examines the key conceptual and methodological issues faced in measuring a
construct as complex and multidimensional as teacher quality or effectiveness in the context of
high stakes accountability policies. The common methods of assessing teacher practice and
performance are reviewed, along with the appropriate methods for using these different
indicators in combination for formative and summative evaluation. The focus here is on the
validity of the inferences about teacher effectiveness that can be derived from these sources of
information, as well as the implications and potential consequences for education policy and
practice of implementing these systems on the field.
Keywords: Teacher Evaluation, Validity, Multiple Measures, Classroom Observation, Teacher
Portfolios
2
COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE
3
Teacher Evaluation and Validity: Conceptual and Methodological
Considerations in Multiple Indicator Systems
We fully understand that standardized tests don't capture all of the subtle qualities of
successful teaching.That's why we call for multiple measures in evaluating teachers.In an
ideal world, that data should also drive instruction and useful professional development.
Arne Duncan
U.S. Secretary of Education
Introduction
Teacher evaluation is an area of growing interest and relevance for local, state, and
national educational systems around the world, reflecting mounting pressures to assess the
performance of teachers, and hold them more concretely accountable for the levels of academic
achievement attained by their students. A variety of methods have been proposed and are being
tested to assess key constructs and aspects of teacher performance. In addition to technical issues
specific to each of these methods, the critical questions at the center of policy debates around
teacher evaluation increasingly refer to the appropriate ways of combining a variety of measures
available for purposes of teacher performance appraisal.
Unfortunately, growing attention from policymakers and the public is not necessarily
being accompanied by a correspondent increase in awareness and understanding of the many
complex technical issues involved in evaluating the work of teachers. Little specific guidance is
available to researchers and policymakers to aid in addressing the variety of conceptual,
technical, and policy issues that emerge in trying to answer this question. Much systematic work
COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE
4
and discussion is therefore needed to address questions about the indicators to be collected, and
the appropriate sources and uses of that information; the extent of uniqueness and overlap in the
information provided by these indicators and sources; and the appropriate ways to use these
indicators in combination to improve teacher performance.
In this paper I examine some of the critical conceptual and methodological issues faced
by researchers and policymakers who try to measure teacher quality or effectiveness in the
context of recent trends in teacher evaluation internationally. I briefly discuss the most common
methods used to assess teacher practice and performance, and focus mostly on the validity of the
inferences about teacher effectiveness that can be derived from these sources of information,
either alone or together in combination, as well as the implications and potential consequences
for education policy and practice.
Policy Context
Reform efforts in education systems in the United States and other countries increasingly
emphasize the notion of performance review and accountability for individual teachers, as
opposed or in addition to the traditional focus on school-level accountability. As prior waves of
accountability-based reform, this one has its roots in perceptions about the (inadequate)
performance of students in national or international assessments (e.g. NAEP, PISA, TIMSS).
Feuer (2012) describes an interesting kind of reverse-Lake Wobegon effect taking hold
worldwide whereby states and countries consistently find reasons in the data to be pessimistic
about the performance of their students, relative to peer or competitor systems.
Unlike reform targeted at the school level, however, these initiatives are also heavily
informed by assumptions about and evidence of the importance of quality teaching and, more
COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE
5
generally, effective teachers (or lack thereof) in explaining and potentially improving the
outcomes observed. Additionally, they reflect a critique of a majority of existing teacher
evaluation systems which, as many reports and authors have noted became perfunctory rituals
with no consequences or enforcement, and little formative or informative value for the teacher or
the district (Loeb, Raudenbush, et al 2010). Finally, they reflect a series of optimistic
assumptions about the ability of the proposed revamped system and procedures to reliably
identify effective and ineffective teachers. These rationales can be seen clearly underlying
current policy efforts targeting teacher evaluation in large and medium size districts across the
U.S. (e.g. New York, Los Angeles, Chicago, Tennessee). The same discourse and assumptions
are increasingly common to reform efforts targeting teacher evaluation internationally, including
for example national systems in Singapore, Chile, and Mexico, and nascent efforts in the United
Kingdom and Australia.
Teacher Evaluation: Why, What, How
Why Evaluate. Educational theory, mounting empirical evidence, and common sense
suggest that effective teachers can exert an important influence on the levels of achievement of
their students (Baker et. al 2010; Rowe, 2003). Teacher evaluation systems can have different
origins, motivations, and goals depending in local policy context, but typically seek information
on aspects of teacher practice believed to be aligned with improved system outcomes, crucially
student achievement, which is used for both formative and summative purposes. These include,
among others, identifying struggling teachers for intervention to help them improve their
teaching and classroom practices; identifying teachers who are persistent underperformers for
remedial action, sanction, or dismissal; providing incentives to the best teachers; informing
COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE
6
school practice and district policies on teacher preparation and professional development;
identifying effective teacher practices in order to develop models of effective instruction to scale
up for implementation in classrooms across the system.
Typically, teacher evaluation systems are conceived to support a combination of these
intended uses and goals, in turn using the information collected formatively to guide teacher
education and professional development, and summatively for decisions related to career
advancement, remuneration, and retention. Unfortunately, policy and political priorities can often
be at odds with careful design and thorough consideration of long-term goals, uses, and
consequences. The opening quote from Secretary Duncan, for example, reflects a decision to
move forward with high stakes teacher evaluation, irrespective of whether we live in an ideal
world where data from the evaluation is used drive instruction and professional development.
Thus effectively turns what should be the core formative component of any such system into an
optional luxury that may or may not be available in the real world, at least initially.
What to Evaluate.
While researchers, educators, and policymakers increasingly agree about the importance
of developing sensible approaches to evaluation to support teacher development and
accountability, reaching a consensus around the specific aspects teachers’ professional practice
that should be included in the evaluation is more challenging. Teaching and teacher practice are
inherently complex, multidimensional constructs. Teaching involves a variety of processes and
interactions that take place in the classroom and outside; some substantive in nature, others
related to practical aspects of classroom work (daily routines, classroom management), yet others
pertaining to psychological aspects of teacher-student interactions (e.g. motivation, respect,
feedback). Teacher practice more broadly defined further includes a multitude of aspects of the
COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE
7
work of a teacher outside the classroom, including among others communication with parents,
administrators, and other teachers at the school, school citizenship, and contributions to the
broader community. Thus, although the notion of assessing teacher effectiveness has simple
intuitive appeal, in practice it involves selecting, defining, collecting information about, and
making inferences involving dozens of complex component constructs (Peterson, 1987). Terms
like Teaching Quality, Teacher Practice, Teacher Effectiveness, or Teacher Performance are
often treated as interchangeable, but they are more appropriately seen as closely related, with
important areas of overlap, but also uniqueness.
Evaluating teacher quality or effectiveness entails first developing an agreeable definition
for each of these components. The notion of teacher competences (Reynolds, 1999) provides a
particularly useful global heuristic for teacher quality that identifies four main components of
competence: teacher knowledge (e.g. subject and pedagogical), skill (e.g.applied knowledge),
disposition (e.g. attitudes, perceptions, beliefs), and practice (e.g. instruction, assessment,
management). Notably, this definition excludes indicators like teacher education, credentials,
experience (seniority), or contributions to student achievement or other non-cognitive outcomes.
While indicators like these can be important and informative, and are often considered for formal
teacher evaluation, they are seen here as correlates more than components of teacher
competence.
The discussion above suggests first that the most appropriate definition of teacher quality
or competence depends on the intended uses and context. It also suggests ultimately, that other
things being equal, the richer the definition of the construct the better. A simple answer to the
question of what constructs to evaluate when considering how best to appraise teacher
performance could thus be “all of the above” or at least “as many of the above as is affordable”.
COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE
8
To the extent some of them are excluded the definition of teacher quality or competence is
narrower, and so by extension is the evaluation.
How to Evaluate.
Having decided that teacher evaluation ought to encompass as many of the relevant
constructs as is feasible, attention turns to the methodological questions that arise in measuring
each of them in practice. A variety of methods can be used to capture each of the constructs that
compose teacher quality or competence: Teacher knowledge and skills, either subject matter or
pedagogical is often assessed by means of standardized tests (e.g. the Praxis I, or MKT tests) or
through alternative open-ended performance assessments (e.g. the Praxis II assessment), or
vignettes of classroom practice (Stecher, Le, et.al., 2006). Information about dispositions and
beliefs is typically collected directly from teachers by means of surveys or interviews (Mayer,
1999). For constructs related to school citizenship and contributions to the broader community
surveys can also be used to collect information from teachers and other sources of information
(e.g. parents, administrators).
Classroom Observation. Finally, a variety of methods can be used to develop indicators
of classroom practice. Traditionally, direct observation has been the default method for studying
and monitoring the work of teachers in the classroom. Observation (either live or through
videotape) has considerable face validity as a method for seeing teaching as it happens in
classrooms, and providing direct evidence for identifying areas in need of improvement to
inform professional development (Pianta and Hamre, 2009). It remains a staple of teacher
evaluation systems old and new as the central explanatory and formative complement to the
summative evidence offered by measures based on summaries of student achievement. On the
other hand observation in classrooms presents significant challenges, starting with identifying
COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE
9
and defining the many distinct but closely interconnected notions that collectively compose the
broader construct of classroom practice. Table 1 shows the constructs targeted for observation in
Singapore side by side with a typical list of constructs from an observation system based on the
Danielson framework in the United States. While the figure shows some areas of overlap across
countries, it also highlights the subjectivity and culture-specificity inherent to all definitions of
good teaching. Thus, classroom observation in Singapore, a high performing country in
international assessments whose educational system has been the subject of much study and
praise in recent years, emphasizes aspects of classroom life (e.g. nurturing the whole child,
winning hearts and minds) that get little attention in American systems, which typically focus on
more technical aspects of classroom practice (e.g. instruction, lesson planning).
----------------Insert Table 1
about here
-----------------
Moreover, as a method for large-scale standardized data collection classroom observation
faces challenges in understanding, quantifying, and monitoring the extent of human rating error
in the resulting measures. It requires of specialized measurement techniques, and large
investment of resources for training, deploying, and monitoring the work of groups of
professionals who serve as standardized observers across a state or district. Even with available
resources however, developing and maintaining reliable measures of classroom practice on a
large scale, remains a significant challenge. The disappointing results of a recent widely
publicized study that tested some of the best known observation rubrics on a large scale are a
COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE
10
sobering reminder of the extent of the challenge if these measures are meant to support
inferences and decisions involving individual teachers (Kane et. al., 2011)
Surveys. A variety of alternative methods have been used to collect information about
classroom practice, which may offer substantive, logistic, or cost advantages compared to direct
observation in classrooms. In particular, teacher surveys offer a cost efficient alternative, that
permits to collect information about large numbers of practices and constructs with little added
cost or burden to teachers. As with other surveys, researchers can construct teacher surveys so
that the resulting aggregate indicators reflect key constructs of classroom practice of interest, and
show adequate levels of reliability by common psychometric standards. On the other hand,
teacher surveys have significant limitations in their ability to produce consistent and valid
information about classroom processes. Surveys are prone to errors from loss of or inaccurate
memory, and from inconsistency in teacher interpretations of the content and focus of items.
Moreover, they are subject to strong social desirability effects, particularly in a situation of
relatively high stakes and subjective areas of practice. Thus, three teachers who report that they
always emphasize higher order cognitive skills in their instruction could be over or
underestimating the true frequencies; alternatively they could each be reporting truthfully and
accurately but mean different things by always, by emphasizing, or by higher order. Finally,
teachers could knowingly under or over report the true frequency based on their perception of the
desirability of the practice in question (Mayer, 1999). As a result, teacher surveys have been
notoriously inconsistent as predictors of student outcomes (Bill and Melinda Gates Foundation,
2010).
Student surveys have gained in popularity recently as an alternative that can address
some of the limitations faced with teacher surveys. In particular, student surveys can be used to
COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE
11
create classroom aggregates that are as reliable as those obtained from teacher surveys, and more
strongly predictive of student achievement (Kane et.al., 2011;Martinez, 2012). Because they are
not as susceptible to social desirability, student surveys are also often seen as more valid for
teacher evaluation purposes (Ferguson, 2010). Finally, student surveys provide additional useful
information typically not available from teachers; specifically, within- classroom variance in
student reports can be used to monitor differentiated or individualized instruction (Martinez,
2012; Muthen. 1995). At the same time, collecting information from students has potential
drawbacks in accuracy and consistency of the reports, particularly with young children, whereas
with older children there may be concerns about potential bias and engagement. Student surveys
also face methodological challenges designing and constructing indicators of the right constructs
at the right level (Schweig, 2012). Notably, a question that asks how often “My teacher asks me
to read books in the classroom” may behave differently psychometrically than one that asks how
often “Our teacher asks us to read books”. Finally, the use of student surveys for high stakes
teacher evaluation has not been thoroughly tested and could create issues of validity and bias.
Portfolios. Teacher portfolios are another method gaining prominence as an alternative
for collecting information about classroom practice. Teachers use portfolios to compile,
annotate, and reflect on artifacts of classroom practice like lesson plans, assignments, and
quizzes, over a period of time. Research suggests that teacher portfolios can be used to collect
rich information about classroom practice with reliability comparable to classroom observations
(Martinez, Borko, and Stecher, 2011). Moreover, because they require sustained effort and
mental engagement from teachers, portfolios are seen as promising mechanisms for professional
development for monitoring and improving classroom practice (Shulman, 1997). This is
exemplified recently by the rapidly expanding use of EdTPA, a portfolio-based assessment
COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE
12
system for pre-service teachers developed at Stanford and endorsed by 25 states in the United
States (Teacher Performance Assessment Consortium, 2012). On the downside, portfolios
require of considerable resources for development, collection, and review, and are very
burdensome for teachers if not embedded in the formal professional development cycle.
Portfolios also are limited for capturing interactive or verbal aspects of instruction like on the fly
questioning.
Value Added Models. While not part of the competence framework outlined above where
it is at most a key correlate or byproduct, student achievement is central to the notion of
effectiveness at the center of recent policy reforms involving teachers in school districts in the
United States. This has led to the growing popularity of Value Added models (VAMs) for
estimating the contribution of individual teachers to student achievement, and to much debate in
research and policy circles. A variety of critical concerns have been raised about VAM estimates,
including their limited scope (Baker et.al., 2010) and lack of explanatory or diagnostic value
(Goe, Bell, and Little, 2011) for formative uses, and their instability (Schochet & Chiang, 2010)
and ultimately non-causal nature on the summative side (Rubin, Stuart, and Zanutto, 2004).
Taken together, these concerns have led to a widespread view that VAMs cannot be used alone
to assess teachers, but only alongside other measures within a broader approach to evaluation
(National Research Council, 2010).
Teacher Evaluation with Multiple Measures
The inherent limitations of each of the approaches listed above make it apparent that none
of them is intrinsically preferable or more informative than the others; each has advantages and
limitations and is most suitable for illuminating different important aspects of teacher quality or
effectiveness. From this it follows that no single method can provide sufficient information to
COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE
13
support a valid evaluation of teacher performance. Instead, the consensus among researchers and
policy makers is that teacher evaluation must rely on a variety of indicators in order to be valid,
comprehensive, and useful (Baker et.al. 2010; National Research Council, 2010). The notion has
been long established for the case of student assessments in the standards for the measurement
profession:
In educational settings, a decision or characterization that will have major impact [on
students] should not be made on the basis of a single score. Other relevant information
should be taken into account if it will enhance the overall validity of the decision. (AERA,
APA, NCME, 1999)
Indeed, in the United States growing numbers of districts and states are developing
evaluation systems to support high-stake inferences and decisions about teachers including hiring
and tenure decisions, career advancement, and in some cases compensation.1 Public debate
around these systems has focused largely on their approach to estimate teacher contributions to
student achievement (e.g. Value Added Indicators). In fact, the systems always rely on multiple
measures (typically three or more) and the majority of a teacher’s rating often depends on
indicators other than student achievement. These indicators may include information from
classroom observations, principal reports, parent or student surveys, classroom artifacts, and
official records among others. 2
The emerging consensus around this multi-pronged approach is based on evidence and a
series of assumptions about the consequences and benefits of evaluating teachers using multiple
1
The list includes the three largest districts in the country (New York, Los Angeles, and Chicago) which
collectively educate more than 2.5 million students and employ more than one hundred and fifty thousand teachers;
additional prominent examples exist in Washington DC, Tennesee, Colorado, and Pittsburgh, among others.
2 See e.g. Sartain, Stoelinga, et. al (2011) for Chicago; Strunk, Weinstein, Makkonen, and Furedi (2012) for Los
Angeles; and Marsh, Springer, et.al. (2011) for New York.
COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE
14
indicators. Thus, among other benefits multiple measures are expected to provide a more
complete picture of teacher performance (Goe, Holdheide, and Miller 2011); allow teachers
to be classified into finer, more stable categories (De Pascale, 2012; Steele et. al. 2010);
minimize incentives for test preparation (Steele et. al. 2010); provide information to help
teachers adjust and improve instruction and classroom strategies (Duncan, 2011); and generate
greater confidence in results of evaluation among all stakeholders, particularly teachers, and the
public (Glazerman et. al. 2011). These and other assumptions have been investigated in the
context of student assessment (see e.g. Henderson, Julian, and Yen, 2003; Schafer, 2003), but the
extent to which they collectively hold in practical application for teacher evaluation is not well
understood. This will depend on several factors, including the nature of the constructs involved,
the intended inferences and uses of the measures and, importantly, the specific methods used to
combine them (Brookhart, 2009).
The last point in particular merits attention in light of the growing push for high stakes
teacher evaluation in the United States and around the world. Interestingly, the consensus around
the use of multiple measures in this context is as wide as it is vague. There are of course a variety
of approaches for combining measures for the purpose of evaluating teachers, and which one we
choose can be of consequence for the properties of the resulting indicators, and the inferences we
ultimately draw about teachers. At least four approaches have been proposed in the literature in
psychology and student assessment for combining multiple measures that reflect different
attributes of a broader target construct. These include conjunctive and disjunctive evaluation
models, a variety of compensatory linear models, and hybrid approaches that combine more than
one of these (Henderson, Julian, and Yen, 2003).
COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE
15
Conjunctive Models. This approach integrates information from a multiple measures by
specifying a decision rule that requires subjects to satisfy (pass) a minimum criterion level of
performance in each of the measures involved. Thus, a conjunctive model may specify that
beginning teachers must receive scores of 3 or higher in all (or a certain number of) observation
scales, be rated satisfactory in all (or a certain number) of student survey scales, and not rank in
the bottom 20% of the distribution of Value Added scores in order to advance to tenured status.
Similar decision rules may be specified for other purposes (e.g. further career advancement,
incentives) by varying the measures included and the target standard of performance for each of
them. Figure 1 depicts a conjunctive model with multiple measures graphically, simplifying the
assessment to a series of binary questions for each of the indicators. This model is most
appropriate for minimizing false positives, or where adequate performance is necessary in
separate domains or components of a broader construct.
Disjunctive Models. These are also discrete decision rules for combining indicators,
where the requirement is to satisfy the performance criteria for a minimum of q out of p indicator
measures in the system (Mehrens, 1989). For example, in the example above, teachers would be
required to meet at least two of the three criteria in order to succeed at a certain evaluation step.
Figure 1 can also represent disjunctive decision models, since these are also constructed by
combining a series discrete decisions. The disjunctive model is appropriate where we want to
minimize false negatives, where repeated measures of the same construct are being collected, or
where not all the domains of a broader construct are equally critical to achieve. A special case of
the disjunctive model that requires passing any one of p measures has sometimes been called a
complementary model (Brookhart, 2009) and is used to maximize the opportunity to pass in
repeated testing situations.
COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE
16
----------------Insert Figure 1
about here
-----------------
Compensatory Models. In these models, high performance on one or more measures is
allowed to compensate for lower performance on others by creating composite or aggregate
indicators that synthesize the information available in all the measures. In most cases it is
additionally assumed that a single trait underlies the component measures, and a summary index
created as a simple or weighted average (Brookhart, 2009.) Alternatively, compensatory models
can be seen as linear combinations of measures that seek to maximize certain properties of the
resulting indicators, or its relationships to other measures. These may include empirical
correlations among indicators (canonical or factor analysis models), and relationships to
measured criteria (Aamodt & Kimbrough, 1985), or unmeasured criteria (Darlington, 1970).
Figure 2 shows a graphical representation of a compensatory model where each of the indicators
is conceived as an indicator or item in a factor analysis. This model would allow to investigate
the pattern of interrelationship among the measures and estimate empirical factor loadings to
create a combination of indicators to maximize the validity of inferences about an underlying
teacher construct.
----------------Insert Figure 2
about here
-----------------
COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE
17
A different kind of compensatory model can be seen in Figure 3. This is a common
model in current teacher accountability research and policy, where student achievement is treated
not as an indicator, but as the criterion measure which the remaining indicators are expected to
predict. This is the default model in the Measures of Effective Teaching study for example (Kane
et. al. 2011) and it seeks to obtain Beta regression weights to create an optimal linear
combination of indicators to maximize the predictive power over the criterion or outcome—
student achievement, either static or in the comparative growth metric of value added models. In
defining standardized tests as the ultimate criteria however, this model would seem to contradict
the stated goal of evaluating teachers based on multiple measures to reflect multidimensional
nature of the teacher profession, and the many important aspects of the work of teachers in and
out of schools and classrooms. Moreover, since the ultimate criterion is assumed known and
available, the inclusion of other indicators is at best unnecessary (redundant in relation to the
final criterion already at hand), and it could be counterproductive (by diluting the high quality
information available with indirect indicators of the target criterion.).
----------------Insert Figure 3
about here
-----------------
Finally, Figure 4 presents a model with an unmeasured criterion and theoretical weights
relating it to the various indicators of teacher performance (Darlington, 1970). This model shares
COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE
18
characteristics of those in Figures 2 and 3; like the optimal weight model it sets up a regression,
with beta weights specifying the relationships between each indicator and a criterion. As in the
factor analysis model, the criterion is an unmeasured underlying construct that can only be
inferred from the measured indicator—including student achievement. The difference in the
model described by Darlington is the approach followed to determine the most appropriate
weight to give each measure. This is an increasingly important question for policymakers, and
one that this view suggests cannot be answered from a strictly technical and scientific standpoint;
the appropriate weights must be assigned theoretically, that is from a consensus among key
stakeholders about the goals and priorities of an evaluation system, and the value placed on each
of its component parts. Of necessity, this is the model being implemented in districts around the
country, as decision makers face the alternative of making critical decisions involving teacher
career advancement or compensation using empirically determined weights that may fluctuate
across districts, and within districts over time, and which will have to be presented to teacher
unions and the public. Far from a limitation then, theoretical weights are in fact best suited for
use in high stakes teacher evaluation on a large scale (Darlingon, 1970).
----------------Insert Figure 4
about here
----------------Hybrid Models. The three general approaches just described can be integrated into hybrid
evaluation models if desired. For example, in the examples listed before a hybrid conjunctivedisjunctive model could specify that all teachers must meet the student achievement criteria, and
at least one of the other two criteria (classroom observation and student survey). A conjunctive-
COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE
19
compensatory model could instead require teachers to satisfy a set of criteria, where one of the
criteria is a linear composite of three component measures from teacher and student surveys.
Formative Model. Importantly, an alternative approach that is often overlooked rests on
the idea that multiple measures should not necessarily be combined in any way, but instead
should be used in combination to provide richer and more useful information to inform a
formative evaluation (Schmidt & Kaplan, 1971.) As Mehrens (1989) has argued, collecting,
maintaining, reporting, and using the measures separately seems like the natural choice for
teacher evaluation where the key purposes center around formative processes and professional
development for improving teaching practice. In a formative model each of the measures
illuminates and guides improvement efforts related to the aspect(s) of teacher practice and
performance it is most directly informative of. Finally, if desired, the component measures are
still available to be combined for summative purposes according to any of the models above.
Multiple measures and Reliability. The conjunctive and disjunctive models are
intuitively appealing and in principle can be implemented without the use of advanced statistical
techniques. However, the reliability of the resulting indicators and inferences under each of these
approaches can vary substantially and should be examined closely to ensure the system offers
sufficiently accurate information for the intended purposes. Specifically, despite the common
assumption that using multiple measures will improve the reliability of inferences, it can be
shown that under the conjunctive model the reliability of a decision is that of the least reliable of
the component measures (Chester, 2003). Assume for example that Teacher A’s true scores are
sufficient to meet the desired criteria on measures M1 and M2 —i.e. the teacher should pass both
measures. However, because the measures are not perfectly reliable Teacher A could be
erroneously classified as not passing either measure. If we know the true scores for Teacher A,
COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE
20
we can estimate the probability of misclassification directly from the standard error of
measurement for each measure. For Teacher A this hypothetical probabilities are estimated at
0.25 and 0.15 respectively for M1 and M2—that is, the probability of correct classification is
0.75 and 0.85. From this it follows that the probability of passing both tests in practice (based on
the observed, not true scores) under a conjunctive model is 0.75*0.85=0.64, whereas the
probability of seeing at least one pass under a disjunctive model is 1-[0.25*0.15]=0.96. While
Douglas and Mislevy (2010) suggest that the differences may not be quite as dramatic under
compensatory models, the example underscores the importance of understanding errors of
measurement in the context of multiple measures evaluation, particularly where complex or
hybrid decision rules are involved (Douglas and Mislevy, 2010; Cronbach, Linn, Brennan, &
Haertel, 1997.)
Multiple measures and Validity. As with reliability, we are only in the early stages of
validity research under the various approaches for combining the growing variety of measures of
teacher performance available. As these systems start to come online and used on the field, more
and more extensive empirical validity research is urgently needed to investigate the assumptions,
implications, requirements, and potential consequences of using each of the models for teacher
performance appraisal purposes (Brookhart and Loadman; 1992)
As discussed above, how to combine multiple measures of teacher performance is only in
part a technical issue. Teacher evaluation systems must first make explicit their definition of
effectiveness and the value they place on the measures based on theoretical or empirical support,
all in specific light of the local context, program goals, and priorities. In considering these
multiple elements and decisions it is useful to conceive of them as components of a validity
argument for specific inferences to be derived from the indicators for specific purposes (Kane,
COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE
21
2006). As with any single measure, the validity framework applied to multiple measures
highlights the need to rely on assumptions about the nature of the theoretical construct being
measured (i.e. teacher quality or effectiveness) and the ways in which it may be operationalized
in practice. It also makes clear that different uses will lead to different validity arguments, and
require different sources of evidence and support. In particular, inferences and uses that carry
serious consequences require the greatest extent of theoretical and empirical support.
As outlined by Kane (2006) validity is unitary and purpose dependent; validation entails
outlining an interpretive argument for the intended inferences, uses, and consequences, and
collecting evidence to support this argument. This evidence may include a variety of sources of
theoretical support for the intended constructs and conceptual framework, as well as empirical
evidence of consistency and accuracy (reliability), expected patters of inter-correlation and
internal structure for the measures, and predictive power over intended criterion measures,
among others. Importantly, the framework is also requires explicit consideration of the planned
consequences and outcomes that are expected to derive from the proposed interpretation and use
of the measures. Clearly, evidence of intended and potentially unintended consequences is also
the most important consideration from the policy perspective—otherwise validity becomes a
rather academic topic if the consequences of use are not considered.
Conclusion
A new wave of reform efforts emphasizing teacher-focused accountability is taking hold
in education systems around the world. The possibility of replacing current cursory and hollow
evaluation systems with more meaningful efforts to assess and monitor teacher practice and
performance should be viewed with great interest. The new multiple-measure systems have the
COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE
22
potential to become engines driving a change towards a culture of reflection, improvement, and
accountability among teachers, and to support formative evaluation systems that feed directly
into professional development channels, leading to meaningful improvements in teaching and
ultimately student learning. However, these systems face considerable conceptual and
methodological challenges in design and implementation, and evolve interwoven within complex
policy environments. It was argued here that combining multiple fallible indicators does not
automatically yield better, less fallible inferences, but it always yields more complex inferences.
Developers should carefully consider the assumptions and consequences (intended and
unintended) of the various approaches for combining measures—or using them in combination.
Because the stakes are high for teachers and students, it behooves developers and decision
makers to focus on the validity of intended uses and inferences, and take the necessary steps to
develop a system to support these inferences.
Ultimately, to avoid becoming the latest policy silver bullet to quickly take root and
flame out (or dilute into meaninglessness), multi-measure systems of teacher evaluation should
ensure sufficient theoretical and empirical support throughout development and implementation.
It is our job as researchers and methodologists to remind policymakers that good measures take
time to develop, solid systems based on these measures take longer to test and implement, and
the consequences of specific uses of these systems are largely unknown and will take longer to
assess. The greatest risk rests not in the potential for unfair decisions involving individual
teachers, although the potential for these should give policymakers pause; instead the risk is
missing a critical opportunity to enact sound policy with great potential to positively impact
educational practice and outcomes.
COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE
23
REFERENCES
Aamodt, M. G. & Kimbrough, W. W. (1985). Comparison of four methods for weighting
multiple predictors. Educational and Psychological Measurement, 45, 477-482.
AERA,APA, NCME. (1999) Standards for Educational and Psychological Testing. Washington,
D.C.: American Educational Research Association
Baker, E. L., Barton, P. E., Darling-Hammond, L., Haertel, E., Ladd, H. F., Linn, R. L., Ravitch,
D., et al. (2010). Problems with the use of student test scores to evaluate teachers. Washington,
DC: Economic Policy Institute.
Bill & Melinda Gates Foundation (2010) Teachers’ perceptions and the MET project.
Retrieved from http://www.metproject.org/downloads/Teacher_Perceptions_092110.pdf
Brookhart, S. (2009). The Many Meanings of Multiple Measures.Education Leadership 67((3)
Brookhart, S.M., & Loadman, W. E. (1992). School-university collaboration: across cultures.
Teaching Education. 4, (2): 53-68.
Chester, M. D. (2003), Multiple Measures and High-Stakes Decisions: A Framework for
Combining Measures. Educational Measurement: Issues and Practice, 22: 32–41.
Cronbach, L. J., Linn, R. L., Brennan, R. L, & Haertel, E. H. (1997). Generalizability analysis
for performance assessments of student achievement or school effectiveness. Educational and
Psychological Measurement, 57, 373-399.
Darlington, R. B. (1970), Some Techniques for maximizing a test’s validity when the criterion
variable is unobserved. Journal of Educational Measurement, 7: 1–14.
De Pascale, C. (2012) Managing Multiple Measures. Principal, 91(5)
Duncan, A. (2011) Duncan Tells Teachers: Change is Hard. Homeroom, Retrieved December,
09, 2012 from http://www.ed.gov/blog/2012/08/duncan-tells-teachers-change-is-hard/
Douglas, K. M., & Mislevy, R. J. (2010). Estimating classification accuracy for
complex decision rules based on multiple scores. Journal of Educational and
Behavioral Statistics, 35, 1–27.
Ferguson, R. (2010, October 14). Student perceptions of teaching effectiveness. Retrieved
from http://www.gse.harvard.edu/ncte/news/Using_Student_Perceptions_Ferguson.pdf
Feuer, M. (2012) No Country Left Behind: Rhetoric and Reality of International Large-Scale
Assessment. Princeton, NJ, Educational Testing Service.
COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE
24
Glazerman, S., Goldhaber, D., Loeb, S., Raudenbush, S., Staiger, D. O., & Whitehurst, G. J.
(2011).Passing muster: Evaluating evaluation systems. Washington, DC: Brown Center on
Education Policy at Brookings.
Goe, L., Bell, C., & Little, O. (2008). Approaches to evaluating teacher effectiveness: A research
synthesis. Washington, DC: National Comprehensive Center for Teacher Quality.
Goe.L, Holheide, L., and Miller, T (2011) A Practical Guide to Designing Comprehensive
Teacher Evaluation Systems. National Comprehensive Center for Teacher Quality, Washington,
DC
Henderson-Montero, D., Julian, M, and Yen, W (2003) Multiple Perspectives on Multiple
Measures. Educational Measurement: Issues and Practice, 22(2).
Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational Measurement, 4th Ed. (pp.
17-64). Westport, CT: American Council on Education and Praeger Publishers
Kane, T. J., Staiger, D. O., McCCaffrey, D., Cantrell, S., Archer, J., Buhayar, S., et al. (2012).
Gathering Feedback for Teaching: Combining High-Quality Observations with Student Surveys
and Achievement Gains. Seattle: Bill & Melinda Gates foundation.
Glazerman, S., Goldhaber, D., Loeb, S., Raudenbush, S., Staiger, D. O., & Whitehurst, G. J.
(2010).Evaluating teachers: The important role of value-added. Washington, DC: Brown Center
onEducation Policy at Brookings.
Marsh, J. A., Springer, M. G., McCaffrey, D. F., Yuan, K., Epstein, S., Koppich, J.,
Kalra, N., DiMartino, C. & Peng, A. X. (2011). A big apple for educators: New York City's
experiment with schoolwide performance bonuses. Santa Monica, California: RAND.
Martinez, J.F. (2012) Consequences of omitting the classroom in multilevel models of schooling:
An illustration using the effects of opportunity to learn on reading achievement. School
Effectiveness and School Improvement DOI:10.1080/09243453.2012.678864
Martínez, J.F, Borko, H; and Stecher, B. (2011) Measuring instructional practices in middle school
science using classroom artifacts. Journal for Research in Science Teaching 41(1) pp. 38-67. DOI
10.1002/tea.20447
Mayer, D. (1999). Measuring Instructional Practice: Can Policy Makers Trust Survey Data?
Educational Evaluation and Policy Analysis 21(1): 29-45.
Mehrens, W (1989) Combining Evaluation Data from Multiple Sources. In Millman and L.
Darling-Hammond (Eds) The new handbook of teacher evaluation: Assessment of elementary
and secondary school teachers (pp. 322-336). Newbury Park, CA: Sage
COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE
25
Muthén, B. O., Huang, L. C., Jo, B., Khoo, S. T., Goff, G. N., Novak, J. & Shih, J. (1995).
Opportunity-to-learn effects on achievement: Analytical aspects. Educational Evaluation and
Policy Analysis, 17, 371-403
National Research Council (2010). Getting value out of value-added. Braun, Chudowsky and
Koenig (Eds.) Washington, DC: National Academies Press.
Rubin, D.B., Stuart, E.A., and Zanutto, E.L. (2004) A Potential Outcomes View of Value-Added
Assessment in Education. Journal of Educational and Behavioral Statistics 29 (1): 103-116.
Peterson, K. (1987) Teacher evaluation with multiple and variable lines of evidence. American
Education Research Journal, 24(2).
Pianta, RC, & Hamre, BK (2009). Conceptualization, measurement, and improvement of
classroom processes: Standardized observation can leverage capacity. Educational
Researcher, 38, 109–119\
Reynolds, M. (1999). Standards and professional practice: The TTA and initial teaching
training.British Journal of Educational Studies, 47(3), 247-260.
Rowe, K. (2003). The importance of techer quality as a key determinant of students’ experiences
and outcomes of schooling. Paper presented at the ACER Research Conference, Melbourne
Rubin, D.B., Stuart, E.A., and Zanutto, E.L. (2004) A Potential Outcomes View of Value-Added
Assessment in Education. Journal of Educational and Behavioral Statistics 29 (1): 103-116.
Sarain, L., Stoelinga, S., Brown, E (2011) Rethinking Teacher Evaluation in Chicago: Lessons
from Classroom Observations, Principal-Teacher Conferences, and District Implementation,
University of Chicago Consortium on School Research. Retrieved December 08, 2012 from
http://ccsr.uchicago.edu/sites/default/files/publications/Teacher%20Eval%20Report%20FINAL.
pdf
Schafer, W. D. (2003), A State Perspective on Multiple Measures in School Accountability.
Educational Measurement: Issues and Practice, 22: 27–31.
Schmidt, F. L., & Kaplan, L. B. (1971). Composite vs. multiple criteria: A review and
resolution of the controversy. Personnel Psychology, 24, 419-434.
Schochet, P. Z., & Chiang, H. S. (2010). Error rates in measuring teacher and school
performance based on student test score gains (NCEE 2010-4004). Washington, DC: National
Center for Educational Evaluation and Regional Assistance, Institute of Education Sciences,
United States Department of Education.
Schweigh, J. (2012) Cross-Level Measurement Invariance in School and Classroom
Environment Variables, Manuscript in preparation.
COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE
Shulman, L. (1998) "Teacher Portfolios: A Theoretical Activity" in N. Lyons (ed.) With
Portfolio in Hand. (pp. 23-37) New York: Teachers College Press.
Stecher, B., Le, V.N., Hamilton, L., Ryan, G., Robyn, A., & Lockwood, J. R. (2006). Using
structured classroom vignettes to measure instructional practices in mathematics. Educational
Evaluation and Policy Analysis, 28(2), 101-130.
Steele, Jennifer, Laura S. Hamilton, and Brian M. Stecher (2010) Incorporating Student
Performance Measures into Teacher Evaluation Systems. Santa Monica, CA: The RAND
Corporation,
Strunk, K., Weinstein, T., Makkonen, R., and Furedi, D (2012) Three lessons emerge
from Los Angeles Unifi ed School District’s implementation of a new system for teacher
evaluation, growth, and development. Phi Delta Kappan, 94(3), Retrieved December 08, 2012
from
Teacher Performance Assessment Consortium (2012). Retrieved December 08, 2012
<http://edtpa.aacte.org/>
26
COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE
Table 1. Constructs in Singapore and Danielson Observation Protocols
27
COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE
Figure 1. Conjunctive and Disjunctive Decision Models
Figure 2. Factor Analysis Model for Multiple Measures (Compensatory)
28
COMBINING MULTIPLE INDICATORS OF TEACHER PERFORMANCE
Figure 3. Optimal Prediction Weights with Achievement as Criterion (Compensatory)
Figure 4. Unmeasured Criterion with Theoretical Weights (Compensatory)
29
Download