- NCEES Wiki

advertisement
Building Principals’ Instructional
Capacity through Focused State and
Local Support
Executive Summary
Our national trend toward using Value-Added Modeling (VAM) to measure
educator effectiveness has provided administrators with hard data to use as
they evaluate teachers, analyze professional development needs, and support teacher professional growth.
Unfortunately, although many states have made great strides using the data provided to evaluate teachers, there
“Leadership is
second only to
teaching among
school influences
on student
success.”
- Clifford & Ross
(2012)
are still gaps in the level of professional development site-based administrators receive to
ensure they understand high-quality instruction, evaluate teachers appropriately, and
provide clear, specific instructional feedback that leads to improved teacher practice.
Modern technology allows for an analysis of student test scores that provides detailed
information about how much growth a student achieves during one school year. This data
has revealed a distinct discrepancy in teacher evaluation ratings versus their
corresponding value-added scores in North Carolina that indicates a need to provide
deeper, sustained professional development. One strategy to support principals as they evaluate teachers fairly
and effectively and gain the pedagogical understandings to provide accurate feedback about classroom
instruction that will lead to improved student learning outcomes is to develop a partnership between the state
education agency and district leadership. State-level development of stronger inter-rater reliability and
calibration systems for evaluations is needed for accurate and meaningful evaluations. Furthermore, district-level
support is necessary to provide principals with sustained, differentiated professional development targeted
toward specific recommendations for teacher growth. This dual-support approach will create a ripple effect that
will impact not only principal instructional leadership but also teacher effectiveness and ultimately student
learning outcomes.
Figure 1: State/District Dual-Support Approach Method
State



Develop a calibration
system to ensure
inter-rater reliability
Create an evaluation
certification program
to identify principals
who demonstrate
proficiency in teacher
evaluation
Facilitate on-going
regional Principals’
PLCs
District
 Identify trends in
evaluation data
 Facilitate monthly
data meetings with
principals and
develop SMART
goals
 Provide time for
collegial
conversations and
evaluation coaching
sessions using
collaborative
walkthroughs with
emphasis on
instructional
feedback
Introduction
Tony Flach (2014), National Practice Director at
Houghton Mifflin-Harcourt, contends that instructional
leadership should be defined as “the ability to guide adults
to improve instruction through the creation of favorable
learning environments, building of adult content and
pedagogical knowledge, and explicit monitoring of the
learning of both adults and students.” The principal as
'instructional leader' is a somewhat new concept that began to materialize during the early 1980's. Prior to this
shift in responsibilities, most principals functioned as managers and operational leaders. This movement toward
academic and instructional leadership was influenced by current research during the time which indicated that
effective schools most often had principals who understood and articulated the importance of instructional
leadership (Brookover and Lezotte, 1982). Another major shift in education was the ability to access data that
clearly and accurately revealed, without bias, student achievement, discipline, demographic disparities, and
retention rates. Suddenly, public education became much more public with widespread access to data. This
movement even further perpetuated the need for an instructional leader in the principalship.
The high demands of the principalship are difficult to prioritize, and historically principals have overlooked the
importance of the evaluation process. Now, in the second decade of the 21st century, educational leaders have
access to even more data that highlights the alignment and misalignment between evaluation and rating. Teacher
evaluations are more important than ever with high-stakes decisions, such as tenure and performance pay,
connected to evaluation results. Principals, assistant principals, and other educational leaders designated to
evaluate teachers are responsible for accurate and reliable evaluations (NGA Center for Best Practices, 2011).
The benefits to teacher evaluation are not usually immediate. Too often, principals are handling problems that
need immediate attention and an immediate solution, thus creating urgency. Numerous research sources provide
evidence that a combination of rigorous classroom observations combined with additional data measures will
provide an accurate evaluation of teacher effectiveness (Bill & Melinda Gates Foundation, 2013; Ho & Kane,
2013; Taylor & Tyler, 2012).
A focus on accurate teacher evaluation is necessary. Ongoing professional development provided through a dualsupport model between the state education agency and local district level leadership is crucial to improving the
quality of instructional feedback administrators can provide teachers. This model, if implemented successfully,
will result in improved teacher effectiveness. A conceptual understanding of how a principal’s leadership
impacts teaching, learning, and student learning outcomes is the first step to understanding the need for change
in the preparation and development of principal preparation programs. Furthermore, developing a framework for
support for principals from the state and district level will ensure principals hone the skills needed to improve
and inform instruction. Ongoing, aligned, and monitored professional development will also lead to more
reliable and valid evaluations. Principals will also develop a deeper understanding of content standards,
pedagogy, and instructional design to provide clear, specific, and constructive feedback to teachers that will lead
to improved student learning outcomes. One-shot professional development is not enough to inform principals’
understanding of instructional leadership.
Research clearly supports the need for teacher evaluation, but little attention has been given to ensure that
evaluators are trained and certified to make subjective decisions regarding a teacher’s performance. During the
2009-10 school year, the North Carolina Department of Public Instruction conducted state-wide, two-day trainthe-trainer professional development on the new North Carolina Professional Teaching Standards, the evaluation
process, and look fors during classroom observations. Since that time, training of new administrators and review
sessions for experienced administrators has become the responsibility of each school district. Policy TCP-C-004:
Policy establishing the Teacher Performance Appraisal process states that Component 1: Training “Before
participating in the evaluation process, all teachers, principals and peer evaluators must complete training on the
evaluation process.” The consistency, quality, and fidelity of these trainings is unknown.
The evaluation of teachers must be purposeful, reliable and valid. The comprehensive study by Yoon et al.
(2007) indicates that not only the duration of the professional development plays an important role in the success
of the initiative but that follow up and on-going, job-embedded opportunities for discussion, feedback, and
continued emphasis on the professional development are all an integral part of a successful implementation.
Creating experiences where principals can learn from reflecting on their experiences is an important part of the
learning process. Standards one, two, and three of the North Carolina Educator Evaluation Standards for North
Carolina School Executives requires principals to reflect on their practice in the areas of strategic leadership,
instructional leadership, and cultural leadership (North Carolina School Executive: Principal and Assistant
Principal Evaluation Process Manual, 2012).
One integral way district leaders can ensure principals have the opportunity to support teachers and grow as
instructional and cultural leaders is to provide ongoing professional learning opportunities aligned with state
goals - either as part of their regularly scheduled professional learning communities (PLCs) or in follow-up
sessions designed around reflection, sharing, and feedback.
The goals of the state/district partnership would be to develop strong inter-rater reliability and instructional
leadership between and among site-based administrators across the state. Currently, in North Carolina, there are
major discrepancies between evaluation ratings and teachers’ VAM. To ensure principals have a deep
understanding of the standards, how to rate teachers, and how to provide strong instructional feedback, changes
must be made in the current model of support.
Background
North Carolina principals are currently faced with a variety of new challenges that are previously unknown. In
2010, North Carolina adopted new standards for every content area and grade level, Common Core State
Standards (CCSS) in English Language Arts and Mathematics as well as the Common Core State Standards for
Literacy in History, Social Studies, Science, and Technical Subjects. In addition to these standards, North
Carolina also adopted the North Carolina Essential Standards (NCES) for all other grade levels and all content
areas (In the States, 2012). According to the North Carolina Department of Public Instruction (NCDPI) (2011),
the new standards are based on a philosophy of teaching and learning consistent with current research, best
practices, and new national standards. The NCDPI contends that the new North Carolina Standard Course of
Study (NCSCoS) is designed to support the state’s educators as they provide the most challenging education
possible for North Carolina students. The ultimate goal of these new standards is to prepare all students for a
career and/or college. Not only do principals have the important task of providing teachers with high-quality
instructional feedback to improve their performance, but they also must take time to learn the new content
standards across the board. Now, more than ever, principals need sustained and focused professional
development to support them as they evaluate the quality of teaching and provide instructional feedback that
improves student learning outcome
For almost two decades, quality teaching has been consistently identified by researchers as the most important
school-based factor in student achievement (McCaffrey, Lockwood, Koretz, & Hamilton, 2003; Rivkin,
Hanushek, & Kain, 2000; Rowan, Correnti & Miller, 2002; Wright, Horn, & Sanders, 1997). Instructional
guidance, support, and feedback that teachers receive from principals is imperative in improving their practice.
Research has proven that evaluation is more effective when the evaluators are trained (Darling-Hammond et al.,
2011). Trainings should include resources that support the evaluation process (McGuinn, 2012). A high-quality
professional development partnership is the key to successful implementation of any teacher evaluation system.
States must rethink the way evaluators have been trained in the past and develop a new model designed to grow
instructional leaders through ongoing training, modeling, collaboration, and support.
The Widget Effect, reported by Weisburg et.al. (2009) describes the school district’s assumption that teacher
effectiveness is the same from teacher to teacher. Teachers are not viewed as individual professionals but rather
as “interchangeable parts.” This report suggests that better evaluation will not only improve teaching to benefit
students, but it will benefit teachers by treating them as
professionals. Characteristics of the Widget Effect in
Teacher Evaluation:

All teachers are rated good or great.

Excellence goes unrecognized.

Inadequate professional development is provided. No special attention is given to novice teachers.

Poor performance goes unaddressed
The Widget Effect is simply another indicator that principal instructional leadership has taken a back seat to
managerial and organizational components of the principal’s role. Without a clear emphasis on evaluation from
the state and the ongoing opportunities for discourse, professional development, and support, the quality and
accuracy of teacher evaluations is not likely to improve.
As more and more states are turning to measuring student growth data as using VAM such as the Education
Value Added Assessment System (EVAAS) from the SAS Institute, The Widget Effect is more prominent than
ever. Through VAM data, notable discrepancies between evaluation and student learning outcomes have been
revealed. Value-added assessment systems, such as EVAAS, provide individual teacher, school, district and state
growth data. The North Carolina State Board of Education implemented EVAAS data as part of teacher and
principal evaluations during the 2011-12 school year. EVAAS determines the effectiveness of teachers, schools,
and districts with regards to student achievement and provides multiple aimed at analyzing student and teacher
performance on standardized assessments. A multifactorial correlation study was conducted to analyze the data
surrounding the correlation between teacher performance evaluation ratings and the EVAAS student
achievement data. The dataset included 11,430 North Carolina teachers in 35 local education agencies (LEAs)
having both Educator Value Added Assessment System (EVAAS) scores and performance evaluation ratings
assigned in 2010-11 school year. Although 46,000 teachers had evaluation data for 2010-11, only around 11,000
of those also gave an end of grade (EOG) or End of course (EOC) assessment. These 11,000 received an
EVAAS data score. Research found that there was a small distribution of evaluation ratings in the study. Out of
11,000 teachers, the 100 teachers with the best student achievement data received the same ratings as the 100
teachers with the worst achievement data. The study did not find a correlation between performance evaluation
data and EVAAS data (Batton, Britt, DeNeal, & Hales, 2012). This finding alone demonstrates a deep need to
better prepare evaluators to provide instructional feedback on content, pedagogy, and instructional design. A
comprehensive system of support from district and state agencies is mandatory.
Conceptual Frameworks
In 2012, The American Institutes for Research published a report titled: “The Ripple Effect” to examine
principals’ influence on student achievement. The report provided a conceptual framework for understanding the
role of the principal in terms of instruction, the direct effects of the principal’s practice on both teachers and the
school as a whole, and the indirect effects that a principal’s impact has on classroom instruction and learning. A
current modification and personalized iteration of “The Ripple Effect” framework helps to identify the
significance of a state-district
Figure 2: Adaptation of "The Ripple Effect"
partnership to improve principals’ practice, improve instructional quality, and impact student learning outcomes
(Clifford et al., 2012). This iteration of the framework suggests that principals need strong, focused professional
learning opportunities and support to directly impact teacher quality, which ultimately can have an indirect effect
on student achievement (Figure 2).
To ensure that the professional development and support provided by state and local agencies will result in
improved student learning outcomes, both agencies must make a concerted effort to address the components of
planning, implementation, and evaluation (PIE) in the professional development cycle (Figure 2).
Figure 3: The PIE Cycle of Professional Development
One-shot, fragmented workshops lasting 14 hours or less show no statistically significant effect on student
learning (Darling-Hammond, Wei, Andree, Richardson, and Orphanos, 2009). Effective professionaldevelopment programs are job-embedded and provide participants with five critical elements:

Collaborative learning: Opportunities to learn in supportive groups where content is organized

Clear evident links among curriculum, assessment, and professional-learning decisions as related to
specific teaching contexts: Emphasis on the importance of developing content knowledge, specifically in
math and science, as well as pedagogical understandings specific to content areas (Blank, de las Alas, and
Smith, 2008; Blank and de las Alas, 2009; Heller, Daehler, Wong, Shinohara, and Miratrix, 2012).

Active learning: Application of new knowledge and opportunities to receive feedback, use of ongoing
data to reflect how teaching practices influence student learning over time

Emphasis on deep content knowledge and pedagogy: Direct ties to how to teach content through new
techniques and strategies

Sustained learning, over multiple days and weeks: Engagement in 30 to 100 hours of learning over six
months to twelve months to increase student achievement
Every educational institution in our country plans professional development in hopes to improve the quality of
teaching and student learning outcomes. However, research has shown that very specific and deliberate steps
must be followed in the PIE cycle to ensure that not only do participants benefit from professional development
but also that student learning outcomes are improved. Several important components that have traditionally been
overlooked must be addressed. These include the allocation of funding, time, and follow-up to ensure
participants have rich instruction and receive the appropriate on-going support necessary to ensure overarching
positive changes in teacher practice.
Little research in the field measures the impact of professional development on student learning outcomes.
However, a great deal of research has been conducted around creating and evaluating high-quality professional
development. An examination of the three meta-analyses provides a great deal of compelling empirical evidence
regarding the qualities of professional development that led to student gains. Those qualities can be represented
in the three distinct elements of the PIE cycle. The outer ring of the PIE cycle represents the ongoing
components of high-quality professional development that increases student achievement, and the three inner
components represent the phases that districts or schools must adhere to in order to ensure success.
No longer is it an acceptable practice to make assumptions about the impact of professional development.
Advances in technology include more online assessment opportunities for students, an increase in state and
federal student testing mandates, and VAM models that provide a clear and undeniable data regarding the impact
of instruction on student growth. These data points provide the opportunity to assess the impact of professional
development more accurately than ever before. This deep understanding uncovers a need for a monumental
change in training, support systems and structures, and follow up for practicing principals. According to the
Learning Forward Center for Results, “policies, resources, calendars, daily schedules, coaches, and budgets
influence the quality and results of collaborative professional learning and may need to be discussed and altered
for alignment” (Standards Assessment Inventory, n.d.). State and district leaders must think creatively and
collaboratively to address the need of developing instructional leadership capacity in principals as well as create
structures to ensure success.
Central Research Questions
How can the North Carolina Department of Public Instruction and school districts develop and implement
a sustainable plan to ensure that teacher evaluators have the requisite knowledge and skills to evaluate
and provide feedback to ensure teacher growth and effectiveness?
•
How can state education departments support the professional development needed for improved rateragreement using the teacher evaluation process?
•
How can districts support principals to ensure they become instructional leaders?
•
What must happen before, during, and after professional development that leads to improved student
learning outcomes?
Methodology
NCDPI conducted a multi-factorial correlation study to identify a correlation coefficient for the composite
teacher evaluation and value-added scores from the Education Value-Added Assessment System (EVAAS). A
statewide analysis, conducted by Mr. Dayne Batten, used data from the 2011-12 school year to determine the
relationship between North Carolina Educator Evaluation ratings and student growth for 26,260 teachers.
Statewide and district correlations were calculated to identify trends and potential errors. Data for the teacher
evaluation consisted of ratings for each of the five standards and for career status teachers being evaluated on
standards one and four only. All rating data available was used to assign an average evaluation score as the
mean.
All EVAAS scores available for a teacher were averaged to identify a composite score. The standardized values
for each class were a mean of zero and a standard deviation of one. Value-added scores from EVAAS were
calculated for all teachers who administered End-of-Grade assessments (grades 4-8), End of Course assessments
(grades 9-12), or a Career and Technical Education Post-Assessment (Batton, Britt, DeNeal, & Hales, 2012).
Education Administration Quarterly published a report Examining Teacher Evaluation Validity and Leadership
Decision Making within a Standards- Based Evaluation System examining a school districts implementation of a
standards-based teacher evaluation system and the variation of validity. Using mixed methods research, 23
school leaders with “more” and “less” valid results were identified using teacher evaluation ratings and valueadded student achievement data. Interviews with all 23 school leaders were conducted to learn about the
attitudes of the school leaders on the teacher evaluation, decision making strategies and school contexts. Eight
of the principals were examined as a subset of data, all having two consistent years of validity scores to analyze
(n=4 more valid and n=4 less valid).
In 2011-12, 337 teachers were provided cameras from the Bill and Melinda Gates Foundation to video-tape their
lessons for a correlational study to determine inter-rater agreement among evaluators. These lessons would
contribute to a video library of teaching practices. Teachers were asked to video their classroom lessons twenty
five times using the digital video cameras and microphones provided, during the 2011-12 school year. One
hundred and six of these teachers were from Hillsborough County, Florida. Sixty-seven Hillsborough teachers
consented to having their lessons scored by administrators and peers. Scoring would follow the districts
observation protocol. Administrators and peer observers were recruited to participate. In the end, 129 (53
principals/ assistant principals and 76 peer raters) raters scored videos. Ho and Kane (2013) reported the results
of inter-rater agreement in their research report The Reliability of Classroom Observations by Classroom
Personnel.
Yoon et al. (2007) conducted one of the most significant research studies in the field of professional
development. Reviewing the Evidence on How Teacher Professional Development Affects Student Achievement
explores the relationship between over 1,300 studies and the What Works Clearinghouse Evidence Standards for
Reviewing Studies. What Works Clearinghouse Evidence Standards for Reviewing Studies. A team of researchers
from the American Institutes for Research reviewed and examined findings from these studies. Discouragingly,
of all of these studies, only nine met the What Works Clearinghouse (WWC) evidence standards (Figure 4).
Figure 4: Nine Studies that Met the What Works Clearinghouse Evidence Standards
1.
Carpenter et al., 1989 (randomized controlled
trial)
2.
Cole, 1992 (randomized controlled trial)
3.
Duffy et al., 1986 (randomized controlled trial)
4.
Marek & Methven, 1991 (quasi-experimental
design)
5.
McCutchen et al., 2002 (quasi-experimental
design)
6.
McGill-Franzen et al., 1999 (randomized control
trial)
7.
Saxe et al., 2001 (quasi-experimental design)
8.
Sloan, 1993 (randomized controlled trial)
9.
Tienken, 2003 (randomized controlled trial with
group equivalence problems)
Source: Authors’ synthesis of studies described in the text
(Yoon et al., 2007).
Although these studies revolved around teacher professional development, the ultimate goal was to impact
student learning outcomes. From the nine studies that met evidence standards, two methodologies emerged as
being most appropriate for ensuring success: randomized control trial and quasi-experimental design. The
average effect size of the nine studies was 0.54, with sizes ranging from -0.53 to 2.39. Professional development,
for teachers or administrators, should be focused not only on participant learning but also on impacting student
achievement. Thus, findings from the Yoon study are significant to consider when developing a research method
for evaluating the effectiveness of an inter-rater agreement professional development program.
A variety of research methods have been used to examine the observation and rating process for classroom
teachers. It is important to note that while correlational and mixed methods studies are appropriate designs for
evaluating inter-rater agreement, both quasi-experimental and randomized controlled trials are most appropriate
for measuring the effectiveness of professional development on student learning outcomes.
Author(s)
Year
Methodology
Title
Hollingsworth
2012
Qualitative Case
Study
Weisberg et al.
Taylor & Tyler
2009
2012
Mixed Methods
QuasiExperimental
Ho & Kane
2013
Generalizability
Study
Clifford et al.
2012
Qualitative
Meta-Analysis
Empowering Teachers
to Implement
Formative Assessment
The Widget Effect
Can Teacher
Evaluation Improve
Teaching?
The Reliability of
Classroom
Observations by
School Personnel
The Ripple Effect
Results
Teacher performance evaluation is at the center of school reform efforts nationwide. Understanding the link
between evaluation and teacher performance is a key in improving both instruction and student learning
outcomes. Taylor and Tyler (2012) studied Ohio mid-career teachers to determine the correlation between valueadded data and student achievement—before, during, and after evaluation. For many years, teachers were not
evaluated, yet student achievement data was collected on those students in the form of end-of-year assessments.
Figure 5: Improvement Through Evaluation
Key findings included that teachers are more
productive in post-evaluation years. This supports the
fact that evaluation is a significant factor in teacher
growth (Figure 3).
Evaluations consisted of “multiple, highly structured
classroom observations conducted by experienced peer
teachers and administrators.” The research indicates
that teachers could increase knowledge and
information based on the “formal scoring and feedback
routines of an evaluation program.” They found that
evaluation could also inspire classroom teachers to be “more self-reflective, regardless of the evaluative criteria.”
Finally, the study revealed that having an evaluation process could also create more opportunities for
instructional conversations with other teachers and administrators about effective pedagogy. This study is
significant in developing a link between high-quality teacher evaluation and performance. As indicated by Figure
3, teachers not only performed better during the year they were evaluated but also continued to demonstrate
growth on student assessments.
North Carolina Professional
Teaching Standards
I – Teacher demonstrate leadership
II – Teachers establish a respectful environment
for a diverse population of students.
III – Teachers know the content they teach
IV – Teachers facilitate learning for their students
V – Teachers reflect on their practice
VI – Teachers contribute to the academic success
of students**
North Carolina is also in the midst of the state’s
evaluation reform effort to include value-added data in a
teacher’s status. Previously, North Carolina depended
solely upon principal evaluations to determine a teacher’s
status from year to year. Beginning in school year (SY)
2011-12, teachers were rated on a new standard (6),
based solely on measuring growth using a value-added
score. Teachers were assigned an overall effectiveness
status based on their value-added growth. North Carolina
** New standard (2011-12)
identified three possible statuses for teachers: does not
meet expected growth, meets expected growth, or “exceeds expected growth. North Carolina determined that
teachers who did not meet expected growth for three consecutive years may lose their employment within the
LEA. Moreover, principals and assistant principals also receive an overall effectiveness status on standard 8 of
the North Carolina Principal Evaluation. This standard is populated based on the overall value-added data of the
school.
Tom Tomberlin, Director of District Human Resources Support at the North
Carolina Department of Public Instruction developed Figure 4 to demonstrate the
correlation between each North Carolina Professional Teaching Standards (I-V) and
the index, Standard 6. Standard 6 has a low correlation factor to all of the other
standards (between .173 and .205 in 2011- 2012 and between .167 and .198 in 20122013). However, this chart also identifies high correlation between the other 5
standards (all around 0.70).
The strong correlation between standards 1 through 5 indicates that when principals
evaluate teachers, they rate teachers the same on all five standards instead of considering each standard
separately (Tomberlin, 2014).
Figure 6: NCPTS Correlation 2011-12 & 2012 -13
Ho and Kane (2013) reported several key findings in
their research report, Reliability of Classroom
Observations, that relate to the need for multiple
measures for rating and additional professional
development for administrators:
•
Observers rarely used the top or bottom
categories in the four-point rating scale.
•
Administrators rated their own teachers 0.1
higher than administrators from other institutions and 0.2
higher than peers.
•
First impressions make a difference.
Administrators who had an initial negative impression of
a teacher tended to score that teacher lower in
subsequent observations.
These findings indicate a need for using multiple
measures to evaluate teachers to account for human error
and bias. Furthermore, principals need ongoing
professional development to evaluate accurately (Figure
7).
Although the professional development studies focused
on teacher professional development, the ultimate goal
was to impact student learning outcomes. From the nine
studies that met evidence standards, two methodologies
emerged as being most appropriate for ensuring success:
randomized control trial and quasi-experimental design.
The average effect size of the nine studies was 0.54, with
sizes ranging from -0.53 to 2.39 (Appendix A).
Figure 7: Inter-Rater Agreement Findings
Professional development, for teachers or administrators, should be focused not only on participant learning but
also on impacting student achievement. Thus, findings from the Yoon study are significant to consider when
developing a research method for evaluating the effectiveness of an inter-rater agreement professional
development program.
Discussion and Conclusions
College and Career Readiness (CCR) standards for students emphasize the need for quality teachers in
classrooms teaching our 21st century students. Many states, including North Carolina have rolled out a new
evaluation system for teachers and principals. With new evaluation in place, there is still motivation for
continued reform of the evaluation system with states rating 99 percent of their teachers as effective or better
(Reform Support Network, 2013). Additional training for evaluators to efficiently rate teachers is crucial. Interrater reliability and inter-rater agreement are professional development topics for evaluators of teachers. Interrater reliability and inter-rater agreement are often confused. Inter-rater reliability is the relative similarity
between multiple sets of raters. Inter-rater agreement is the frequency multiple rater rate the same (Graham,
2011). Inter-rater agreement is a form of calibration that is significant in ensuring that teachers get accurate
feedback on performance. Performance review calibration improves the reliability in a rating system for
evaluation. Calibration promotes “honesty and fairness” in the ratings system for evaluation of teachers. It is a
process in which multiple observers or evaluators collaborate and discuss performance ratings based on objective
evidences (Performance Review Calibration," n.d.). The calibration process provides a platform of common
language and understanding of the professional teaching standards and instructional practices. Currently, North
Carolina lacks a robust calibration system, and principal support for evaluation is delivered in short professional
development sessions that are often attended by a small percentage of principals across the state. Furthermore,
there is both disconnect and numerous variables that are related to the professional development provided for
principals in regards to teacher evaluation. Due to the fact that there are no current policies or procedures
mandating principal preparation for evaluation, no level of consistency exists among LEAs to provide principals
with the training they need to evaluate educators effectively.
Inter-rater reliability: The relative similarity
between multiple sets of raters.
Principals must be prepared to evaluate teachers. Edward
Thorndike was the first person to study The Halo Effect, a
Inter-rater agreement: The frequency multiple
raters rate the same.
(Graham, 2011)
bias of an observer to be influenced by one characteristic of a
person to assess additional characteristics. "Also known as
the physical attractiveness stereotype and the ‘what is
beautiful is good principle, The Halo Effect, at the most specific level, refers to the habitual tendency of people
to rate attractive individuals more favorably for their personality traits or characteristics than those who are less
attractive. The Halo Effect is also used in a more general sense to describe the global impact of likeable
personality or some specific desirable trait in creating biased judgments of the target person on any dimension.
Thus, feelings generally overcome cognitions when we appraise others" (Standing, 2004).
Thorndike applied his theory to teacher evaluation. His goal was to understand how one quality of a teacher
defined or influenced assessment of other characteristics during evaluation. Similar to Tomberlin’s evaluation of
teacher ratings, Thorndike also revealed that high ratings of one particular quality or element also correlated to
similar high ratings of other unrelated characteristics (See Figure 4). Moreover, negative ratings of a specific
characteristics or quality likewise led to lower ratings of other characteristics. The Halo Effect supports the
psychometric flaw of rater bias. The teaching and learning process is complex, and the subjectivity of what good
teaching looks like lends itself to a bias of evaluation. The Halo Effect provides reason why North Carolina
teacher ratings one through five have such a strong correlation.
Now that high stakes decisions are linked to the evaluation, concern has been raised about the lack of correlation
between teacher ratings based on principal evaluation and student learning outcomes. Undeniably, principals
must identify areas of strength and areas for improvement for teachers, as outlined by the evaluation standards or
goals. Crucial information can be exchanged when educators collaborate over improved effectiveness. Thus,
principals need intensive training on evaluating teachers and giving feedback (Reform Support Network, 2013).
Race to the Top, a $4.35 billion United States
Department of Education initiative to reform K12 education, was awarded to twelve states.
Many Race to the Top (RttT) recipients have voiced an interest in building video libraries of classroom
instruction. Video brings “real world” visuals of the classroom to a professional development opportunity. Interrater reliability training could be supported with the integration of video and validated ratings to use as guiding
documentation for agreement. A completed rubric would be considered a guiding rubric for a specific video and
would have explicit documentation to support the ratings for the teacher. For example, a group of evaluators
could view a 30 minute classroom lesson. The evaluators could respond by using a teacher evaluation tool to
document teacher behavior as identified by the tool. The video would have already been viewed and deeply
analyzed for evidences that would support the rating of the teacher as outlined by the guiding rubric. There are
limitless opportunities to use video libraries for professional development for all stakeholders. Professional
“To develop skilled classroom observers,
training must be thorough, careful, and well
structured. Observers’ understanding of the
application of the rubric must be reviewed
frequently, and feedback that corrects
misunderstandings must be given as soon as
possible” (McClellan et al., 2012).
development should also include collaborative conversations
around the standards and how to interpret data and practice
scoring (Reform Support Network, 2013, p. 1). Numerous
states already use video to train their evaluators (Reform
Support Network, 2013, p. 2). The key to these trainings
would be to ensure that they follow the PIE model for
professional development to ensure appropriate and sufficient coaching, follow-up, duration, and reflection take
place in order to produce change in practice.
Research has proven that evaluation is more effective when the evaluators are trained (Darling-Hammond et al.,
2011). A variety of professional development, including face-to-face deliveries and on-line modules, are
currently being used to train principals. Trainings should include resources that support the evaluation process
(McGuinn, 2012). Ongoing, sustained, professional development will be the key to successful implementation
of any evaluation system.
Yoon et al. (2007) relate that professional development trainings lasting between 30-100 hours “were more
likely to have an impact on participants’ student achievement than programs that provided fewer hours” (p. 6).
Furthermore, “studies that had more than 14 hours of professional development showed a positive and
significant effect on student achievement from professional development.”
According to McClellan et al. (2012), principals need training for evaluation that addresses observer bias,
provides opportunities to analyze and use the evaluation tool, and uses video of real classrooms to help
principals gain a greater understanding of calibration. The more practice principals have, the more likely they are
to evaluate more accurately. Furthermore, the most candid and honest the conversations during training are, the
more likely principals are to build inter-rater agreement within and across LEAs.
Recommendations/Implications
States must re-think the way evaluators have been trained to evaluate in previous models and expand this
training to focus on inter-rater and intra-rater reliability as well as ensure adequate time is providing for
implementation and ongoing evaluation.
The principal’s active role in attending, supporting, understanding, implementing, and evaluating professional
development is crucial to the success of any professional development endeavor. However, for principals to
build this type of instructional leadership capacity, the state and local districts must work together to ensure
principals have the training, support, and understanding to affect student learning outcomes through their work
with teachers. Successful school reform is impossible without the principal (Day, 2000). Leadership studies cite
that the school leader is an undeniably crucial component in the successful completion of any change initiative
(Leithwood et al., 2006). Without the principal’s active support, professional development initiated by the
district is unlikely to be successful. Principals must be actively involved in the process and must be held
accountable for the professional development. Schlechty (2001) contends that principals must function as part of
the district team and reminds site-based administrators that they are as responsible for classroom instruction as
teachers themselves. Ensuring principals understand their role as instructional leaders and are provided with the
skills needed to assume this role is an expectation all states and school districts should go unspoken. However,
we can no longer assume principals are inherently instructional leaders and quality evaluators. It is the
responsibility of universities, the state, and the local district to continue to support principal growth through
providing ongoing opportunities to evaluate, discuss, collaborate, and reflect on classroom instruction and
evaluation ratings.
A variety of researchers (Andrews & Soder, 1987; Hallinger & Heck, 1996; Hallinger et al., 1996; Leithwood et
al., 2006; Waters et al., 2003) have concluded that principals do have some degree of impact on student
achievement, but the correlation of this relationship remains unknown (Mees, 2008). We now have data sources
that can help us more accurately measure this impact.
The North Carolina Department of Instruction and Local Education Agencies have individual and collaborative
work in front of them regarding the validity and reliability of teacher evaluation and growth. The state agency
must provide training criteria for the best possible professional development including train the trainer and video
calibration for principals to build their inter-rater agreement. Principals should be trained to recognize the
evidences and behaviors of the Professional Teaching Standards and accurately evaluate teachers. Most
importantly, NCDPI should implement a certification process of evaluators to ensure their ability to recognize
evidences to assess the teacher accurately. LEAs share in the responsibility of training principals to recognize
evidences, including classroom practices and behaviors that deserve clear, specific feedback and support for
improved instructional practices to meet the needs of students (Figure 5).
Figure 8: Effective Partnership
Need for Further Research
Currently, there is a need to implement a comprehensive state-district principal evaluation support partnership
program and evaluate the effectiveness of that program by measuring the correlation among teacher evaluation
ratings, teacher value-added data, and student surveys.
Furthermore, there is a need to develop a means to triangulate data by incorporating a student survey component.
According to the MET project, the data collected from student surveys yield more consistent results than either
classroom observation data or value-added measures (Asking students about teaching, 2012). “Research
indicates that students are the most qualified sources to report on the extent to which the learning experience was
productive, informative, satisfying, or worthwhile” (Theall and Franklin, 2001). Research to analyze and
evaluate the data that is collected principal evaluation of teachers, student growth, and student surveys could lead
to more robust and comprehensive principal preparation programs.
Finally, research regarding principal evaluation certification programs that have yielded high-correlation
between teacher evaluation and achievement data needs to be analyzed to determine how to develop the best
professional development, support, and certification programs for North Carolina.
References
Asking students about teaching. (2012). www.metproject.org. Retrieved March 30, 2014, from
http://www.metproject.org/downloads/Asking_Students_Practitioner_Brief.pdf
Batton, D., Britt, C., DeNeal, J., & Hales, L. (2012). NC Teacher Evaluations & Teacher Effectiveness: Exploring the relationship
between value-added data and teacher evaluations (Project 6.4). Retrieved from http://www.ncpublicschools.org/docs/internresearch/reports/teachereval.pdf
Bill & Melinda Gates Foundation. (2013). Ensuring fair and reliable measures of effective teaching: Culminating findings from the
MET Project’s three year study. Retrieved from
http://metproject.org/downloads/MET_Ensuring_Fair_and_Reliable_Measures_Practitioner_Brief.pdf
Blank, R. K., & de las Alas, N. (2009). Effects of teacher professional development on gains in student achievement: How metaanalysis provides scientific evidence useful to educational leaders. Washington, DC: Council of Chief State Officers.
Brookover, W. B., & Lezotte, L. (1982). Creating effective schools. Holmes Beach, FL: Learning Publication.
Clifford, M., Behrstock-Sherratt, E., & Fetters, J. (2012). The Ripple Effect: A Synthesis of Research on Principal Influence to Inform
Performance Evaluation Design. A Quality School Leadership Issue Brief. American Institutes for Research.
http://files.eric.ed.gov/fulltext/ED530748.pdf
Covey, S. R. (1989). The seven habits of highly effective people: restoring the character ethic. New York: Simon and Schuster.
Darling-Hammond, L., Amrein-Beardsley, A., Haertel, E. H., & Rothstein, J. (2011). Getting teacher evaluation right: A background
paper for policy makers. Retrieved from
http://iaase.org/Documents/Ctrl_Hyperlink/Session_30c_GettingTeacherEvaluationRight_uid9102012952462.pdf
Day, C. (2000). Beyond transformational leadership. Educational Leadership, 57(7), 56-59.
Flach, Tony. (2014, February). Leadership and Data. North Carolina Association of Supervision and Curriculum Development Annual
Conference. Pinehurst, NC.
Ho, A., & Kane, T. (2013). The reliability of classroom observations by school personnel. Retrieved from
http://www.metproject.org/downloads/MET_Reliability%20of%20Classroom%20Observations_Research%20Paper.pdf:
In the States. (n.d.). Common Core State Standards Initiative. Retrieved November 17, 2013, from http://www.corestandards.org/inthe-states
Kimball, S. M., & Milanowski, A. (2009). Examining Teacher Evaluation Validity and Leadership Decision Making Within a
Standards-Based Evaluation System. Educational Administration Quarterly, 45(1), 34-70.
Leithwood, K., & Jantzi, D. (2006). Transformational school
leadership for large-scale reform: Effects on students, teachers, and their classroom practices. School
Effectiveness and School Improvement, 17(2), 201-227.
McCaffrey, J. R., Lockwood, D. F., Koretz, D. M., & Hamilton, L. S. (2003). Evaluating value added models for teacher accountability
[Monograph]. Santa Monica, CA: RAND Corporation. Retrieved from
http://www.rand.org/pubs/monographs/2004/RAND_MG158.pdf
McClellan, C., Atkinson, M., & Danielson, C. (2012). Teacher evaluation training and certification: Lessons learned from the
measures of effective teaching project. [White paper]. Retrieved March 30, 2014,
http://www.teachscape.com/binaries/content/assets/teachscape-marketingwebsite/resources/march_13whitepaperteacherevaluatortraining.pdf.
McGuinn, P. (2012). The state of teacher evaluation reform: State education agency capacity and the implementation of new teacherevaluation systems. Retrieved from http://www.americanprogress.org/wpcontent/uploads/2012/11/McGuinn_TheStateofEvaluation-1.pdf
Mees, G. (2008). The relationships among principal leadership, school culture,and student achievement in Missouri middle schools..
National Association of Secondary School Principals. Retrieved March 24, 2014, from
https://www.principals.org/Portals/0/content/59554.pdf
Mullins, H. (2014) The PIE cycle of effective professional development. Retrieved February 14, 2014 from:
http://edlstudio.wikispaces.com/Heather+Mullins
National Association of Elementary School Principals & National Association of Secondary School Principals. (2012). Rethinking
principal evaluation: A new paradigm informed by research and practice. Alexandria: Gail Connelly & JoAnn D. Bartoletti.
NGA Center for Best Practices. (2011). Preparing principals to evaluate teachers. Retrieved from http://www.nga.org/cms/home/ngacenter-for-best-practices/center-publications/page-edu-publications/col2-content/main-content-list/preparing-principals-toevaluate.html
North Carolina school executive evaluation process manual. (2012). North Carolina educator evaluation system. Retrieved November
12, 2013, from:
http://ncees.ncdpi.wikispaces.net/file/view/Principal%20Process%20Manual%202012.pdf/389359046/Principal%20Process%
20Manual%202012.pdf
Performance review calibration-building an honest appraisal. (n.d.). Retrieved from
http://www.successfactors.com/en_us/lp/articles/performance-review-calibration.html
Reform Support Network. (2013). Promoting Evaluation Rating Accuracy: Strategic Options for States. Retrieved from
http://www2.ed.gov/about/inits/ed/implementation-support-unit/tech-assist/evaluation-rating-accuracy.pdf
Rivkin, S. G., Hanushek, E. A., & Kain, J. F. (2000). Teachers, schools, and academic achievement (Working Paper W6691).
Cambridge, MA: National Bureau of Economic Research.
Rowan, B., Correnti, R., & Miller, R. J. (2002). What large-scale survey research tells us about teacher effects on student achievement:
Insights from the Prospects study of elementary schools. Teachers College Record, 104, 1525-1567.
Schwab, R. L. (1991). Research-based teacher evaluation: a special issue of the Journal of personnel evaluation in education. Boston:
Kluwer Academic.
Standards investment inventory 2 - Recommendations: The leadership standard. (n.d.). Learning Forward: The professional learning
association. Retrieved March 24, 2014, from http://learningforward.org/docs/sai/leadershiprecommendations.pdf?sfvrsn=2
Standing, L.G., in The SAGE Encyclopedia of Social Science Research Methods, Volume 1, 2004.
Taylor, E. S., & Tyler, J. H. (2012, Fall 2012). Can teacher evaluation improve teaching? Education Next, 12. Retrieved from
http://educationnext.org/
Theall, M., & Franklin, J. (2001). Looking for bias in all the wrong places – A search for truth or a witch hunt in student ratings of
instruction?. New Directions for Institutional Research, 27(5), 45-56.
Tomberlin, T. (2014). READY Principals Spring 2014. NCEES. Retrieved March 26, 2014, from
http://ncees.ncdpi.wikispaces.net/READY+Principals+Spring+2014
Weisberg, D., Sexton, S., Mulhern, J., Keeling, D., Schunck, J., Palcisco, A., & Morgan, K. (2009). The widget effect: Our national
failure to acknowledge and act on differences in teacher effectiveness. Retrieved from
http://widgeteffect.org/downloads/TheWidgetEffect.pdf
Wright, S. P., Horn, S. P., & Sanders, W. L. (1997). Teachers and classroom context effects on student achievement: Implications for
teacher evaluation. Journal of Personnel Evaluation in Education, 11, 57-67.
Yoon, K. S., Duncan, T., Lee, S. W. Y., Scarloss, B., & Shapley, K. (2007) Reviewing the evidence on how teacher professional
development affects student achievement (Issues & Answers Report, REL 2007, no 033)Washington DC: US: Department of
Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance, Regional
Educational Laboratory Southwest. Retrieved from: http://ies.ed.gov/ncee/edlabs
Appendix A: Effectiveness of Professional Development: Features and Effects by Study
Study/Design
Carpenter et al.,
1989 (RCT)
Cole, 1992 (RCT)
Outcome Measure
Effect Size
Recomputerized
Statistical
Significance
Not significant, but
substantively
important
Improvement
Index
Content Area
School Level
Student Outcomes
Examined
16
Math
1st grade
Computation and
math problem-solving
scores on the Iowa
Test of Basic Skills
Level 7
Math and Reading
and English/
Language Arts
4th grade
Students’ computation
and math problemsolving scores on the
Iowa Test of Basic
Skills, Level 7
Students’ reading
comprehension test
scores on the GatesMacGinitie Test
Students’
Conservation
reasoning as
measured by Piagetian
Cognitive Tasks
Students’ alphabetics
(Test of Phonological
Awareness),
orthographic fluency
(a timed alphabetic
writing task),
comprehension (the
comprehension
subtest of the GatesMacGinitie Reading
Tests), and writing
skills (a composition
task)
Students’ receptive
language skills (the
Peabody Picture
Vocabulary Test) and
early literacy skills
(subtests of the
Concepts about Print
and Diagnostic serve)
Iowa Test of Basic
Skills Level 7
computation
0.41
Iowa Test of Basic
Skills Level 7,
problem-solving
0.41
Not significant, but
substantively
important
16
Average for Math
0.50
19
Average for Reading
0.82
Average for
Language
Statistically
significant
Statistically
significant
0.24
Not significant
9
29
Duffy et al., 1986
(RCT)
Gates-MacGinitie
Reading Test
0.00
Not significant
0
Reading and
English/ Language
Arts
5th grade
Marek & Methven,
1991 (QED)
Average for
Conservation Test
0.39
Statistically
significant
15
Science
Kindergarten –
3rd grade , 5th
grade
McCuthchen et al.,
2002 (QED)
Gates-MacGinitie
Word Reading
Subtest
0.39
Statistically
significant
15
Reading and
English/ Language
Arts
Kindergarten –
1st grade
McGill-Franzen et
al., 1999 (RCT)
Concepts about print
1.11
Statistically
Significant
37
Kindergarten
0.69
Statistically
Significant
25
Reading and
English/Language
Arts
0.32
Not significant, but
substantively
important
13
0.66
Not significant, but
substantively
important
24
Letter identification
Writing vocabulary
Ohio Word Test
Hearing the sounds
in words
0.97
Statistically
Significant
33
0.12
Not significant
5
Fraction concepts
2.39
Statistically
Significant
49
Fractions
computation
-0.53
Not significant, but
substantively
important
-20
Comprehensive Test
of Basic Skills –
reading
0.68
Not significant, but
substantively
important
25
Comprehensive Test
of Basic Skills – math
0.26
Not significant, but
substantively
important
10
0.63
Not significant, but
substantively
important
Not significant, but
substantively
important
23
Peabody Picture
Vocabulary Test
Saxe et al., 2001
(QED)
Sloan, 1993 (RCT)
Comprehensive Test
of Basic Skills science
Tienken, 2003
(RCT with group
equivalence
problems)
Content/organization
score on narrative
writing test
Source: Adapted from Yoon et al. (2007)
0.41
16
Math
4th grade – 5th
grade
Math, Science, and
English/ Language
Arts
4th grade – 5th
grade
Reading and
English/ Language
Arts
4th grade
Students’ concepts
and computation of
fractions, as assessed
by 29-item, 40-minute
timed measure
developed by the
authors
Students’’ reading,
math, and science
scores, measured by
the Comprehensive
Test of Basic Skills
Students’ narrative
writing, as measured
by content/
organization scores on
a standardized writing
test administered as
part of New Jersey’s
Elementary School
Proficiency
Assessment
Appendix B: Adaptation of “The Ripple Effect”
Download