Building Principals’ Instructional Capacity through Focused State and Local Support Executive Summary Our national trend toward using Value-Added Modeling (VAM) to measure educator effectiveness has provided administrators with hard data to use as they evaluate teachers, analyze professional development needs, and support teacher professional growth. Unfortunately, although many states have made great strides using the data provided to evaluate teachers, there “Leadership is second only to teaching among school influences on student success.” - Clifford & Ross (2012) are still gaps in the level of professional development site-based administrators receive to ensure they understand high-quality instruction, evaluate teachers appropriately, and provide clear, specific instructional feedback that leads to improved teacher practice. Modern technology allows for an analysis of student test scores that provides detailed information about how much growth a student achieves during one school year. This data has revealed a distinct discrepancy in teacher evaluation ratings versus their corresponding value-added scores in North Carolina that indicates a need to provide deeper, sustained professional development. One strategy to support principals as they evaluate teachers fairly and effectively and gain the pedagogical understandings to provide accurate feedback about classroom instruction that will lead to improved student learning outcomes is to develop a partnership between the state education agency and district leadership. State-level development of stronger inter-rater reliability and calibration systems for evaluations is needed for accurate and meaningful evaluations. Furthermore, district-level support is necessary to provide principals with sustained, differentiated professional development targeted toward specific recommendations for teacher growth. This dual-support approach will create a ripple effect that will impact not only principal instructional leadership but also teacher effectiveness and ultimately student learning outcomes. Figure 1: State/District Dual-Support Approach Method State Develop a calibration system to ensure inter-rater reliability Create an evaluation certification program to identify principals who demonstrate proficiency in teacher evaluation Facilitate on-going regional Principals’ PLCs District Identify trends in evaluation data Facilitate monthly data meetings with principals and develop SMART goals Provide time for collegial conversations and evaluation coaching sessions using collaborative walkthroughs with emphasis on instructional feedback Introduction Tony Flach (2014), National Practice Director at Houghton Mifflin-Harcourt, contends that instructional leadership should be defined as “the ability to guide adults to improve instruction through the creation of favorable learning environments, building of adult content and pedagogical knowledge, and explicit monitoring of the learning of both adults and students.” The principal as 'instructional leader' is a somewhat new concept that began to materialize during the early 1980's. Prior to this shift in responsibilities, most principals functioned as managers and operational leaders. This movement toward academic and instructional leadership was influenced by current research during the time which indicated that effective schools most often had principals who understood and articulated the importance of instructional leadership (Brookover and Lezotte, 1982). Another major shift in education was the ability to access data that clearly and accurately revealed, without bias, student achievement, discipline, demographic disparities, and retention rates. Suddenly, public education became much more public with widespread access to data. This movement even further perpetuated the need for an instructional leader in the principalship. The high demands of the principalship are difficult to prioritize, and historically principals have overlooked the importance of the evaluation process. Now, in the second decade of the 21st century, educational leaders have access to even more data that highlights the alignment and misalignment between evaluation and rating. Teacher evaluations are more important than ever with high-stakes decisions, such as tenure and performance pay, connected to evaluation results. Principals, assistant principals, and other educational leaders designated to evaluate teachers are responsible for accurate and reliable evaluations (NGA Center for Best Practices, 2011). The benefits to teacher evaluation are not usually immediate. Too often, principals are handling problems that need immediate attention and an immediate solution, thus creating urgency. Numerous research sources provide evidence that a combination of rigorous classroom observations combined with additional data measures will provide an accurate evaluation of teacher effectiveness (Bill & Melinda Gates Foundation, 2013; Ho & Kane, 2013; Taylor & Tyler, 2012). A focus on accurate teacher evaluation is necessary. Ongoing professional development provided through a dualsupport model between the state education agency and local district level leadership is crucial to improving the quality of instructional feedback administrators can provide teachers. This model, if implemented successfully, will result in improved teacher effectiveness. A conceptual understanding of how a principal’s leadership impacts teaching, learning, and student learning outcomes is the first step to understanding the need for change in the preparation and development of principal preparation programs. Furthermore, developing a framework for support for principals from the state and district level will ensure principals hone the skills needed to improve and inform instruction. Ongoing, aligned, and monitored professional development will also lead to more reliable and valid evaluations. Principals will also develop a deeper understanding of content standards, pedagogy, and instructional design to provide clear, specific, and constructive feedback to teachers that will lead to improved student learning outcomes. One-shot professional development is not enough to inform principals’ understanding of instructional leadership. Research clearly supports the need for teacher evaluation, but little attention has been given to ensure that evaluators are trained and certified to make subjective decisions regarding a teacher’s performance. During the 2009-10 school year, the North Carolina Department of Public Instruction conducted state-wide, two-day trainthe-trainer professional development on the new North Carolina Professional Teaching Standards, the evaluation process, and look fors during classroom observations. Since that time, training of new administrators and review sessions for experienced administrators has become the responsibility of each school district. Policy TCP-C-004: Policy establishing the Teacher Performance Appraisal process states that Component 1: Training “Before participating in the evaluation process, all teachers, principals and peer evaluators must complete training on the evaluation process.” The consistency, quality, and fidelity of these trainings is unknown. The evaluation of teachers must be purposeful, reliable and valid. The comprehensive study by Yoon et al. (2007) indicates that not only the duration of the professional development plays an important role in the success of the initiative but that follow up and on-going, job-embedded opportunities for discussion, feedback, and continued emphasis on the professional development are all an integral part of a successful implementation. Creating experiences where principals can learn from reflecting on their experiences is an important part of the learning process. Standards one, two, and three of the North Carolina Educator Evaluation Standards for North Carolina School Executives requires principals to reflect on their practice in the areas of strategic leadership, instructional leadership, and cultural leadership (North Carolina School Executive: Principal and Assistant Principal Evaluation Process Manual, 2012). One integral way district leaders can ensure principals have the opportunity to support teachers and grow as instructional and cultural leaders is to provide ongoing professional learning opportunities aligned with state goals - either as part of their regularly scheduled professional learning communities (PLCs) or in follow-up sessions designed around reflection, sharing, and feedback. The goals of the state/district partnership would be to develop strong inter-rater reliability and instructional leadership between and among site-based administrators across the state. Currently, in North Carolina, there are major discrepancies between evaluation ratings and teachers’ VAM. To ensure principals have a deep understanding of the standards, how to rate teachers, and how to provide strong instructional feedback, changes must be made in the current model of support. Background North Carolina principals are currently faced with a variety of new challenges that are previously unknown. In 2010, North Carolina adopted new standards for every content area and grade level, Common Core State Standards (CCSS) in English Language Arts and Mathematics as well as the Common Core State Standards for Literacy in History, Social Studies, Science, and Technical Subjects. In addition to these standards, North Carolina also adopted the North Carolina Essential Standards (NCES) for all other grade levels and all content areas (In the States, 2012). According to the North Carolina Department of Public Instruction (NCDPI) (2011), the new standards are based on a philosophy of teaching and learning consistent with current research, best practices, and new national standards. The NCDPI contends that the new North Carolina Standard Course of Study (NCSCoS) is designed to support the state’s educators as they provide the most challenging education possible for North Carolina students. The ultimate goal of these new standards is to prepare all students for a career and/or college. Not only do principals have the important task of providing teachers with high-quality instructional feedback to improve their performance, but they also must take time to learn the new content standards across the board. Now, more than ever, principals need sustained and focused professional development to support them as they evaluate the quality of teaching and provide instructional feedback that improves student learning outcome For almost two decades, quality teaching has been consistently identified by researchers as the most important school-based factor in student achievement (McCaffrey, Lockwood, Koretz, & Hamilton, 2003; Rivkin, Hanushek, & Kain, 2000; Rowan, Correnti & Miller, 2002; Wright, Horn, & Sanders, 1997). Instructional guidance, support, and feedback that teachers receive from principals is imperative in improving their practice. Research has proven that evaluation is more effective when the evaluators are trained (Darling-Hammond et al., 2011). Trainings should include resources that support the evaluation process (McGuinn, 2012). A high-quality professional development partnership is the key to successful implementation of any teacher evaluation system. States must rethink the way evaluators have been trained in the past and develop a new model designed to grow instructional leaders through ongoing training, modeling, collaboration, and support. The Widget Effect, reported by Weisburg et.al. (2009) describes the school district’s assumption that teacher effectiveness is the same from teacher to teacher. Teachers are not viewed as individual professionals but rather as “interchangeable parts.” This report suggests that better evaluation will not only improve teaching to benefit students, but it will benefit teachers by treating them as professionals. Characteristics of the Widget Effect in Teacher Evaluation: All teachers are rated good or great. Excellence goes unrecognized. Inadequate professional development is provided. No special attention is given to novice teachers. Poor performance goes unaddressed The Widget Effect is simply another indicator that principal instructional leadership has taken a back seat to managerial and organizational components of the principal’s role. Without a clear emphasis on evaluation from the state and the ongoing opportunities for discourse, professional development, and support, the quality and accuracy of teacher evaluations is not likely to improve. As more and more states are turning to measuring student growth data as using VAM such as the Education Value Added Assessment System (EVAAS) from the SAS Institute, The Widget Effect is more prominent than ever. Through VAM data, notable discrepancies between evaluation and student learning outcomes have been revealed. Value-added assessment systems, such as EVAAS, provide individual teacher, school, district and state growth data. The North Carolina State Board of Education implemented EVAAS data as part of teacher and principal evaluations during the 2011-12 school year. EVAAS determines the effectiveness of teachers, schools, and districts with regards to student achievement and provides multiple aimed at analyzing student and teacher performance on standardized assessments. A multifactorial correlation study was conducted to analyze the data surrounding the correlation between teacher performance evaluation ratings and the EVAAS student achievement data. The dataset included 11,430 North Carolina teachers in 35 local education agencies (LEAs) having both Educator Value Added Assessment System (EVAAS) scores and performance evaluation ratings assigned in 2010-11 school year. Although 46,000 teachers had evaluation data for 2010-11, only around 11,000 of those also gave an end of grade (EOG) or End of course (EOC) assessment. These 11,000 received an EVAAS data score. Research found that there was a small distribution of evaluation ratings in the study. Out of 11,000 teachers, the 100 teachers with the best student achievement data received the same ratings as the 100 teachers with the worst achievement data. The study did not find a correlation between performance evaluation data and EVAAS data (Batton, Britt, DeNeal, & Hales, 2012). This finding alone demonstrates a deep need to better prepare evaluators to provide instructional feedback on content, pedagogy, and instructional design. A comprehensive system of support from district and state agencies is mandatory. Conceptual Frameworks In 2012, The American Institutes for Research published a report titled: “The Ripple Effect” to examine principals’ influence on student achievement. The report provided a conceptual framework for understanding the role of the principal in terms of instruction, the direct effects of the principal’s practice on both teachers and the school as a whole, and the indirect effects that a principal’s impact has on classroom instruction and learning. A current modification and personalized iteration of “The Ripple Effect” framework helps to identify the significance of a state-district Figure 2: Adaptation of "The Ripple Effect" partnership to improve principals’ practice, improve instructional quality, and impact student learning outcomes (Clifford et al., 2012). This iteration of the framework suggests that principals need strong, focused professional learning opportunities and support to directly impact teacher quality, which ultimately can have an indirect effect on student achievement (Figure 2). To ensure that the professional development and support provided by state and local agencies will result in improved student learning outcomes, both agencies must make a concerted effort to address the components of planning, implementation, and evaluation (PIE) in the professional development cycle (Figure 2). Figure 3: The PIE Cycle of Professional Development One-shot, fragmented workshops lasting 14 hours or less show no statistically significant effect on student learning (Darling-Hammond, Wei, Andree, Richardson, and Orphanos, 2009). Effective professionaldevelopment programs are job-embedded and provide participants with five critical elements: Collaborative learning: Opportunities to learn in supportive groups where content is organized Clear evident links among curriculum, assessment, and professional-learning decisions as related to specific teaching contexts: Emphasis on the importance of developing content knowledge, specifically in math and science, as well as pedagogical understandings specific to content areas (Blank, de las Alas, and Smith, 2008; Blank and de las Alas, 2009; Heller, Daehler, Wong, Shinohara, and Miratrix, 2012). Active learning: Application of new knowledge and opportunities to receive feedback, use of ongoing data to reflect how teaching practices influence student learning over time Emphasis on deep content knowledge and pedagogy: Direct ties to how to teach content through new techniques and strategies Sustained learning, over multiple days and weeks: Engagement in 30 to 100 hours of learning over six months to twelve months to increase student achievement Every educational institution in our country plans professional development in hopes to improve the quality of teaching and student learning outcomes. However, research has shown that very specific and deliberate steps must be followed in the PIE cycle to ensure that not only do participants benefit from professional development but also that student learning outcomes are improved. Several important components that have traditionally been overlooked must be addressed. These include the allocation of funding, time, and follow-up to ensure participants have rich instruction and receive the appropriate on-going support necessary to ensure overarching positive changes in teacher practice. Little research in the field measures the impact of professional development on student learning outcomes. However, a great deal of research has been conducted around creating and evaluating high-quality professional development. An examination of the three meta-analyses provides a great deal of compelling empirical evidence regarding the qualities of professional development that led to student gains. Those qualities can be represented in the three distinct elements of the PIE cycle. The outer ring of the PIE cycle represents the ongoing components of high-quality professional development that increases student achievement, and the three inner components represent the phases that districts or schools must adhere to in order to ensure success. No longer is it an acceptable practice to make assumptions about the impact of professional development. Advances in technology include more online assessment opportunities for students, an increase in state and federal student testing mandates, and VAM models that provide a clear and undeniable data regarding the impact of instruction on student growth. These data points provide the opportunity to assess the impact of professional development more accurately than ever before. This deep understanding uncovers a need for a monumental change in training, support systems and structures, and follow up for practicing principals. According to the Learning Forward Center for Results, “policies, resources, calendars, daily schedules, coaches, and budgets influence the quality and results of collaborative professional learning and may need to be discussed and altered for alignment” (Standards Assessment Inventory, n.d.). State and district leaders must think creatively and collaboratively to address the need of developing instructional leadership capacity in principals as well as create structures to ensure success. Central Research Questions How can the North Carolina Department of Public Instruction and school districts develop and implement a sustainable plan to ensure that teacher evaluators have the requisite knowledge and skills to evaluate and provide feedback to ensure teacher growth and effectiveness? • How can state education departments support the professional development needed for improved rateragreement using the teacher evaluation process? • How can districts support principals to ensure they become instructional leaders? • What must happen before, during, and after professional development that leads to improved student learning outcomes? Methodology NCDPI conducted a multi-factorial correlation study to identify a correlation coefficient for the composite teacher evaluation and value-added scores from the Education Value-Added Assessment System (EVAAS). A statewide analysis, conducted by Mr. Dayne Batten, used data from the 2011-12 school year to determine the relationship between North Carolina Educator Evaluation ratings and student growth for 26,260 teachers. Statewide and district correlations were calculated to identify trends and potential errors. Data for the teacher evaluation consisted of ratings for each of the five standards and for career status teachers being evaluated on standards one and four only. All rating data available was used to assign an average evaluation score as the mean. All EVAAS scores available for a teacher were averaged to identify a composite score. The standardized values for each class were a mean of zero and a standard deviation of one. Value-added scores from EVAAS were calculated for all teachers who administered End-of-Grade assessments (grades 4-8), End of Course assessments (grades 9-12), or a Career and Technical Education Post-Assessment (Batton, Britt, DeNeal, & Hales, 2012). Education Administration Quarterly published a report Examining Teacher Evaluation Validity and Leadership Decision Making within a Standards- Based Evaluation System examining a school districts implementation of a standards-based teacher evaluation system and the variation of validity. Using mixed methods research, 23 school leaders with “more” and “less” valid results were identified using teacher evaluation ratings and valueadded student achievement data. Interviews with all 23 school leaders were conducted to learn about the attitudes of the school leaders on the teacher evaluation, decision making strategies and school contexts. Eight of the principals were examined as a subset of data, all having two consistent years of validity scores to analyze (n=4 more valid and n=4 less valid). In 2011-12, 337 teachers were provided cameras from the Bill and Melinda Gates Foundation to video-tape their lessons for a correlational study to determine inter-rater agreement among evaluators. These lessons would contribute to a video library of teaching practices. Teachers were asked to video their classroom lessons twenty five times using the digital video cameras and microphones provided, during the 2011-12 school year. One hundred and six of these teachers were from Hillsborough County, Florida. Sixty-seven Hillsborough teachers consented to having their lessons scored by administrators and peers. Scoring would follow the districts observation protocol. Administrators and peer observers were recruited to participate. In the end, 129 (53 principals/ assistant principals and 76 peer raters) raters scored videos. Ho and Kane (2013) reported the results of inter-rater agreement in their research report The Reliability of Classroom Observations by Classroom Personnel. Yoon et al. (2007) conducted one of the most significant research studies in the field of professional development. Reviewing the Evidence on How Teacher Professional Development Affects Student Achievement explores the relationship between over 1,300 studies and the What Works Clearinghouse Evidence Standards for Reviewing Studies. What Works Clearinghouse Evidence Standards for Reviewing Studies. A team of researchers from the American Institutes for Research reviewed and examined findings from these studies. Discouragingly, of all of these studies, only nine met the What Works Clearinghouse (WWC) evidence standards (Figure 4). Figure 4: Nine Studies that Met the What Works Clearinghouse Evidence Standards 1. Carpenter et al., 1989 (randomized controlled trial) 2. Cole, 1992 (randomized controlled trial) 3. Duffy et al., 1986 (randomized controlled trial) 4. Marek & Methven, 1991 (quasi-experimental design) 5. McCutchen et al., 2002 (quasi-experimental design) 6. McGill-Franzen et al., 1999 (randomized control trial) 7. Saxe et al., 2001 (quasi-experimental design) 8. Sloan, 1993 (randomized controlled trial) 9. Tienken, 2003 (randomized controlled trial with group equivalence problems) Source: Authors’ synthesis of studies described in the text (Yoon et al., 2007). Although these studies revolved around teacher professional development, the ultimate goal was to impact student learning outcomes. From the nine studies that met evidence standards, two methodologies emerged as being most appropriate for ensuring success: randomized control trial and quasi-experimental design. The average effect size of the nine studies was 0.54, with sizes ranging from -0.53 to 2.39. Professional development, for teachers or administrators, should be focused not only on participant learning but also on impacting student achievement. Thus, findings from the Yoon study are significant to consider when developing a research method for evaluating the effectiveness of an inter-rater agreement professional development program. A variety of research methods have been used to examine the observation and rating process for classroom teachers. It is important to note that while correlational and mixed methods studies are appropriate designs for evaluating inter-rater agreement, both quasi-experimental and randomized controlled trials are most appropriate for measuring the effectiveness of professional development on student learning outcomes. Author(s) Year Methodology Title Hollingsworth 2012 Qualitative Case Study Weisberg et al. Taylor & Tyler 2009 2012 Mixed Methods QuasiExperimental Ho & Kane 2013 Generalizability Study Clifford et al. 2012 Qualitative Meta-Analysis Empowering Teachers to Implement Formative Assessment The Widget Effect Can Teacher Evaluation Improve Teaching? The Reliability of Classroom Observations by School Personnel The Ripple Effect Results Teacher performance evaluation is at the center of school reform efforts nationwide. Understanding the link between evaluation and teacher performance is a key in improving both instruction and student learning outcomes. Taylor and Tyler (2012) studied Ohio mid-career teachers to determine the correlation between valueadded data and student achievement—before, during, and after evaluation. For many years, teachers were not evaluated, yet student achievement data was collected on those students in the form of end-of-year assessments. Figure 5: Improvement Through Evaluation Key findings included that teachers are more productive in post-evaluation years. This supports the fact that evaluation is a significant factor in teacher growth (Figure 3). Evaluations consisted of “multiple, highly structured classroom observations conducted by experienced peer teachers and administrators.” The research indicates that teachers could increase knowledge and information based on the “formal scoring and feedback routines of an evaluation program.” They found that evaluation could also inspire classroom teachers to be “more self-reflective, regardless of the evaluative criteria.” Finally, the study revealed that having an evaluation process could also create more opportunities for instructional conversations with other teachers and administrators about effective pedagogy. This study is significant in developing a link between high-quality teacher evaluation and performance. As indicated by Figure 3, teachers not only performed better during the year they were evaluated but also continued to demonstrate growth on student assessments. North Carolina Professional Teaching Standards I – Teacher demonstrate leadership II – Teachers establish a respectful environment for a diverse population of students. III – Teachers know the content they teach IV – Teachers facilitate learning for their students V – Teachers reflect on their practice VI – Teachers contribute to the academic success of students** North Carolina is also in the midst of the state’s evaluation reform effort to include value-added data in a teacher’s status. Previously, North Carolina depended solely upon principal evaluations to determine a teacher’s status from year to year. Beginning in school year (SY) 2011-12, teachers were rated on a new standard (6), based solely on measuring growth using a value-added score. Teachers were assigned an overall effectiveness status based on their value-added growth. North Carolina ** New standard (2011-12) identified three possible statuses for teachers: does not meet expected growth, meets expected growth, or “exceeds expected growth. North Carolina determined that teachers who did not meet expected growth for three consecutive years may lose their employment within the LEA. Moreover, principals and assistant principals also receive an overall effectiveness status on standard 8 of the North Carolina Principal Evaluation. This standard is populated based on the overall value-added data of the school. Tom Tomberlin, Director of District Human Resources Support at the North Carolina Department of Public Instruction developed Figure 4 to demonstrate the correlation between each North Carolina Professional Teaching Standards (I-V) and the index, Standard 6. Standard 6 has a low correlation factor to all of the other standards (between .173 and .205 in 2011- 2012 and between .167 and .198 in 20122013). However, this chart also identifies high correlation between the other 5 standards (all around 0.70). The strong correlation between standards 1 through 5 indicates that when principals evaluate teachers, they rate teachers the same on all five standards instead of considering each standard separately (Tomberlin, 2014). Figure 6: NCPTS Correlation 2011-12 & 2012 -13 Ho and Kane (2013) reported several key findings in their research report, Reliability of Classroom Observations, that relate to the need for multiple measures for rating and additional professional development for administrators: • Observers rarely used the top or bottom categories in the four-point rating scale. • Administrators rated their own teachers 0.1 higher than administrators from other institutions and 0.2 higher than peers. • First impressions make a difference. Administrators who had an initial negative impression of a teacher tended to score that teacher lower in subsequent observations. These findings indicate a need for using multiple measures to evaluate teachers to account for human error and bias. Furthermore, principals need ongoing professional development to evaluate accurately (Figure 7). Although the professional development studies focused on teacher professional development, the ultimate goal was to impact student learning outcomes. From the nine studies that met evidence standards, two methodologies emerged as being most appropriate for ensuring success: randomized control trial and quasi-experimental design. The average effect size of the nine studies was 0.54, with sizes ranging from -0.53 to 2.39 (Appendix A). Figure 7: Inter-Rater Agreement Findings Professional development, for teachers or administrators, should be focused not only on participant learning but also on impacting student achievement. Thus, findings from the Yoon study are significant to consider when developing a research method for evaluating the effectiveness of an inter-rater agreement professional development program. Discussion and Conclusions College and Career Readiness (CCR) standards for students emphasize the need for quality teachers in classrooms teaching our 21st century students. Many states, including North Carolina have rolled out a new evaluation system for teachers and principals. With new evaluation in place, there is still motivation for continued reform of the evaluation system with states rating 99 percent of their teachers as effective or better (Reform Support Network, 2013). Additional training for evaluators to efficiently rate teachers is crucial. Interrater reliability and inter-rater agreement are professional development topics for evaluators of teachers. Interrater reliability and inter-rater agreement are often confused. Inter-rater reliability is the relative similarity between multiple sets of raters. Inter-rater agreement is the frequency multiple rater rate the same (Graham, 2011). Inter-rater agreement is a form of calibration that is significant in ensuring that teachers get accurate feedback on performance. Performance review calibration improves the reliability in a rating system for evaluation. Calibration promotes “honesty and fairness” in the ratings system for evaluation of teachers. It is a process in which multiple observers or evaluators collaborate and discuss performance ratings based on objective evidences (Performance Review Calibration," n.d.). The calibration process provides a platform of common language and understanding of the professional teaching standards and instructional practices. Currently, North Carolina lacks a robust calibration system, and principal support for evaluation is delivered in short professional development sessions that are often attended by a small percentage of principals across the state. Furthermore, there is both disconnect and numerous variables that are related to the professional development provided for principals in regards to teacher evaluation. Due to the fact that there are no current policies or procedures mandating principal preparation for evaluation, no level of consistency exists among LEAs to provide principals with the training they need to evaluate educators effectively. Inter-rater reliability: The relative similarity between multiple sets of raters. Principals must be prepared to evaluate teachers. Edward Thorndike was the first person to study The Halo Effect, a Inter-rater agreement: The frequency multiple raters rate the same. (Graham, 2011) bias of an observer to be influenced by one characteristic of a person to assess additional characteristics. "Also known as the physical attractiveness stereotype and the ‘what is beautiful is good principle, The Halo Effect, at the most specific level, refers to the habitual tendency of people to rate attractive individuals more favorably for their personality traits or characteristics than those who are less attractive. The Halo Effect is also used in a more general sense to describe the global impact of likeable personality or some specific desirable trait in creating biased judgments of the target person on any dimension. Thus, feelings generally overcome cognitions when we appraise others" (Standing, 2004). Thorndike applied his theory to teacher evaluation. His goal was to understand how one quality of a teacher defined or influenced assessment of other characteristics during evaluation. Similar to Tomberlin’s evaluation of teacher ratings, Thorndike also revealed that high ratings of one particular quality or element also correlated to similar high ratings of other unrelated characteristics (See Figure 4). Moreover, negative ratings of a specific characteristics or quality likewise led to lower ratings of other characteristics. The Halo Effect supports the psychometric flaw of rater bias. The teaching and learning process is complex, and the subjectivity of what good teaching looks like lends itself to a bias of evaluation. The Halo Effect provides reason why North Carolina teacher ratings one through five have such a strong correlation. Now that high stakes decisions are linked to the evaluation, concern has been raised about the lack of correlation between teacher ratings based on principal evaluation and student learning outcomes. Undeniably, principals must identify areas of strength and areas for improvement for teachers, as outlined by the evaluation standards or goals. Crucial information can be exchanged when educators collaborate over improved effectiveness. Thus, principals need intensive training on evaluating teachers and giving feedback (Reform Support Network, 2013). Race to the Top, a $4.35 billion United States Department of Education initiative to reform K12 education, was awarded to twelve states. Many Race to the Top (RttT) recipients have voiced an interest in building video libraries of classroom instruction. Video brings “real world” visuals of the classroom to a professional development opportunity. Interrater reliability training could be supported with the integration of video and validated ratings to use as guiding documentation for agreement. A completed rubric would be considered a guiding rubric for a specific video and would have explicit documentation to support the ratings for the teacher. For example, a group of evaluators could view a 30 minute classroom lesson. The evaluators could respond by using a teacher evaluation tool to document teacher behavior as identified by the tool. The video would have already been viewed and deeply analyzed for evidences that would support the rating of the teacher as outlined by the guiding rubric. There are limitless opportunities to use video libraries for professional development for all stakeholders. Professional “To develop skilled classroom observers, training must be thorough, careful, and well structured. Observers’ understanding of the application of the rubric must be reviewed frequently, and feedback that corrects misunderstandings must be given as soon as possible” (McClellan et al., 2012). development should also include collaborative conversations around the standards and how to interpret data and practice scoring (Reform Support Network, 2013, p. 1). Numerous states already use video to train their evaluators (Reform Support Network, 2013, p. 2). The key to these trainings would be to ensure that they follow the PIE model for professional development to ensure appropriate and sufficient coaching, follow-up, duration, and reflection take place in order to produce change in practice. Research has proven that evaluation is more effective when the evaluators are trained (Darling-Hammond et al., 2011). A variety of professional development, including face-to-face deliveries and on-line modules, are currently being used to train principals. Trainings should include resources that support the evaluation process (McGuinn, 2012). Ongoing, sustained, professional development will be the key to successful implementation of any evaluation system. Yoon et al. (2007) relate that professional development trainings lasting between 30-100 hours “were more likely to have an impact on participants’ student achievement than programs that provided fewer hours” (p. 6). Furthermore, “studies that had more than 14 hours of professional development showed a positive and significant effect on student achievement from professional development.” According to McClellan et al. (2012), principals need training for evaluation that addresses observer bias, provides opportunities to analyze and use the evaluation tool, and uses video of real classrooms to help principals gain a greater understanding of calibration. The more practice principals have, the more likely they are to evaluate more accurately. Furthermore, the most candid and honest the conversations during training are, the more likely principals are to build inter-rater agreement within and across LEAs. Recommendations/Implications States must re-think the way evaluators have been trained to evaluate in previous models and expand this training to focus on inter-rater and intra-rater reliability as well as ensure adequate time is providing for implementation and ongoing evaluation. The principal’s active role in attending, supporting, understanding, implementing, and evaluating professional development is crucial to the success of any professional development endeavor. However, for principals to build this type of instructional leadership capacity, the state and local districts must work together to ensure principals have the training, support, and understanding to affect student learning outcomes through their work with teachers. Successful school reform is impossible without the principal (Day, 2000). Leadership studies cite that the school leader is an undeniably crucial component in the successful completion of any change initiative (Leithwood et al., 2006). Without the principal’s active support, professional development initiated by the district is unlikely to be successful. Principals must be actively involved in the process and must be held accountable for the professional development. Schlechty (2001) contends that principals must function as part of the district team and reminds site-based administrators that they are as responsible for classroom instruction as teachers themselves. Ensuring principals understand their role as instructional leaders and are provided with the skills needed to assume this role is an expectation all states and school districts should go unspoken. However, we can no longer assume principals are inherently instructional leaders and quality evaluators. It is the responsibility of universities, the state, and the local district to continue to support principal growth through providing ongoing opportunities to evaluate, discuss, collaborate, and reflect on classroom instruction and evaluation ratings. A variety of researchers (Andrews & Soder, 1987; Hallinger & Heck, 1996; Hallinger et al., 1996; Leithwood et al., 2006; Waters et al., 2003) have concluded that principals do have some degree of impact on student achievement, but the correlation of this relationship remains unknown (Mees, 2008). We now have data sources that can help us more accurately measure this impact. The North Carolina Department of Instruction and Local Education Agencies have individual and collaborative work in front of them regarding the validity and reliability of teacher evaluation and growth. The state agency must provide training criteria for the best possible professional development including train the trainer and video calibration for principals to build their inter-rater agreement. Principals should be trained to recognize the evidences and behaviors of the Professional Teaching Standards and accurately evaluate teachers. Most importantly, NCDPI should implement a certification process of evaluators to ensure their ability to recognize evidences to assess the teacher accurately. LEAs share in the responsibility of training principals to recognize evidences, including classroom practices and behaviors that deserve clear, specific feedback and support for improved instructional practices to meet the needs of students (Figure 5). Figure 8: Effective Partnership Need for Further Research Currently, there is a need to implement a comprehensive state-district principal evaluation support partnership program and evaluate the effectiveness of that program by measuring the correlation among teacher evaluation ratings, teacher value-added data, and student surveys. Furthermore, there is a need to develop a means to triangulate data by incorporating a student survey component. According to the MET project, the data collected from student surveys yield more consistent results than either classroom observation data or value-added measures (Asking students about teaching, 2012). “Research indicates that students are the most qualified sources to report on the extent to which the learning experience was productive, informative, satisfying, or worthwhile” (Theall and Franklin, 2001). Research to analyze and evaluate the data that is collected principal evaluation of teachers, student growth, and student surveys could lead to more robust and comprehensive principal preparation programs. Finally, research regarding principal evaluation certification programs that have yielded high-correlation between teacher evaluation and achievement data needs to be analyzed to determine how to develop the best professional development, support, and certification programs for North Carolina. References Asking students about teaching. (2012). www.metproject.org. Retrieved March 30, 2014, from http://www.metproject.org/downloads/Asking_Students_Practitioner_Brief.pdf Batton, D., Britt, C., DeNeal, J., & Hales, L. (2012). NC Teacher Evaluations & Teacher Effectiveness: Exploring the relationship between value-added data and teacher evaluations (Project 6.4). Retrieved from http://www.ncpublicschools.org/docs/internresearch/reports/teachereval.pdf Bill & Melinda Gates Foundation. (2013). Ensuring fair and reliable measures of effective teaching: Culminating findings from the MET Project’s three year study. Retrieved from http://metproject.org/downloads/MET_Ensuring_Fair_and_Reliable_Measures_Practitioner_Brief.pdf Blank, R. K., & de las Alas, N. (2009). Effects of teacher professional development on gains in student achievement: How metaanalysis provides scientific evidence useful to educational leaders. Washington, DC: Council of Chief State Officers. Brookover, W. B., & Lezotte, L. (1982). Creating effective schools. Holmes Beach, FL: Learning Publication. Clifford, M., Behrstock-Sherratt, E., & Fetters, J. (2012). The Ripple Effect: A Synthesis of Research on Principal Influence to Inform Performance Evaluation Design. A Quality School Leadership Issue Brief. American Institutes for Research. http://files.eric.ed.gov/fulltext/ED530748.pdf Covey, S. R. (1989). The seven habits of highly effective people: restoring the character ethic. New York: Simon and Schuster. Darling-Hammond, L., Amrein-Beardsley, A., Haertel, E. H., & Rothstein, J. (2011). Getting teacher evaluation right: A background paper for policy makers. Retrieved from http://iaase.org/Documents/Ctrl_Hyperlink/Session_30c_GettingTeacherEvaluationRight_uid9102012952462.pdf Day, C. (2000). Beyond transformational leadership. Educational Leadership, 57(7), 56-59. Flach, Tony. (2014, February). Leadership and Data. North Carolina Association of Supervision and Curriculum Development Annual Conference. Pinehurst, NC. Ho, A., & Kane, T. (2013). The reliability of classroom observations by school personnel. Retrieved from http://www.metproject.org/downloads/MET_Reliability%20of%20Classroom%20Observations_Research%20Paper.pdf: In the States. (n.d.). Common Core State Standards Initiative. Retrieved November 17, 2013, from http://www.corestandards.org/inthe-states Kimball, S. M., & Milanowski, A. (2009). Examining Teacher Evaluation Validity and Leadership Decision Making Within a Standards-Based Evaluation System. Educational Administration Quarterly, 45(1), 34-70. Leithwood, K., & Jantzi, D. (2006). Transformational school leadership for large-scale reform: Effects on students, teachers, and their classroom practices. School Effectiveness and School Improvement, 17(2), 201-227. McCaffrey, J. R., Lockwood, D. F., Koretz, D. M., & Hamilton, L. S. (2003). Evaluating value added models for teacher accountability [Monograph]. Santa Monica, CA: RAND Corporation. Retrieved from http://www.rand.org/pubs/monographs/2004/RAND_MG158.pdf McClellan, C., Atkinson, M., & Danielson, C. (2012). Teacher evaluation training and certification: Lessons learned from the measures of effective teaching project. [White paper]. Retrieved March 30, 2014, http://www.teachscape.com/binaries/content/assets/teachscape-marketingwebsite/resources/march_13whitepaperteacherevaluatortraining.pdf. McGuinn, P. (2012). The state of teacher evaluation reform: State education agency capacity and the implementation of new teacherevaluation systems. Retrieved from http://www.americanprogress.org/wpcontent/uploads/2012/11/McGuinn_TheStateofEvaluation-1.pdf Mees, G. (2008). The relationships among principal leadership, school culture,and student achievement in Missouri middle schools.. National Association of Secondary School Principals. Retrieved March 24, 2014, from https://www.principals.org/Portals/0/content/59554.pdf Mullins, H. (2014) The PIE cycle of effective professional development. Retrieved February 14, 2014 from: http://edlstudio.wikispaces.com/Heather+Mullins National Association of Elementary School Principals & National Association of Secondary School Principals. (2012). Rethinking principal evaluation: A new paradigm informed by research and practice. Alexandria: Gail Connelly & JoAnn D. Bartoletti. NGA Center for Best Practices. (2011). Preparing principals to evaluate teachers. Retrieved from http://www.nga.org/cms/home/ngacenter-for-best-practices/center-publications/page-edu-publications/col2-content/main-content-list/preparing-principals-toevaluate.html North Carolina school executive evaluation process manual. (2012). North Carolina educator evaluation system. Retrieved November 12, 2013, from: http://ncees.ncdpi.wikispaces.net/file/view/Principal%20Process%20Manual%202012.pdf/389359046/Principal%20Process% 20Manual%202012.pdf Performance review calibration-building an honest appraisal. (n.d.). Retrieved from http://www.successfactors.com/en_us/lp/articles/performance-review-calibration.html Reform Support Network. (2013). Promoting Evaluation Rating Accuracy: Strategic Options for States. Retrieved from http://www2.ed.gov/about/inits/ed/implementation-support-unit/tech-assist/evaluation-rating-accuracy.pdf Rivkin, S. G., Hanushek, E. A., & Kain, J. F. (2000). Teachers, schools, and academic achievement (Working Paper W6691). Cambridge, MA: National Bureau of Economic Research. Rowan, B., Correnti, R., & Miller, R. J. (2002). What large-scale survey research tells us about teacher effects on student achievement: Insights from the Prospects study of elementary schools. Teachers College Record, 104, 1525-1567. Schwab, R. L. (1991). Research-based teacher evaluation: a special issue of the Journal of personnel evaluation in education. Boston: Kluwer Academic. Standards investment inventory 2 - Recommendations: The leadership standard. (n.d.). Learning Forward: The professional learning association. Retrieved March 24, 2014, from http://learningforward.org/docs/sai/leadershiprecommendations.pdf?sfvrsn=2 Standing, L.G., in The SAGE Encyclopedia of Social Science Research Methods, Volume 1, 2004. Taylor, E. S., & Tyler, J. H. (2012, Fall 2012). Can teacher evaluation improve teaching? Education Next, 12. Retrieved from http://educationnext.org/ Theall, M., & Franklin, J. (2001). Looking for bias in all the wrong places – A search for truth or a witch hunt in student ratings of instruction?. New Directions for Institutional Research, 27(5), 45-56. Tomberlin, T. (2014). READY Principals Spring 2014. NCEES. Retrieved March 26, 2014, from http://ncees.ncdpi.wikispaces.net/READY+Principals+Spring+2014 Weisberg, D., Sexton, S., Mulhern, J., Keeling, D., Schunck, J., Palcisco, A., & Morgan, K. (2009). The widget effect: Our national failure to acknowledge and act on differences in teacher effectiveness. Retrieved from http://widgeteffect.org/downloads/TheWidgetEffect.pdf Wright, S. P., Horn, S. P., & Sanders, W. L. (1997). Teachers and classroom context effects on student achievement: Implications for teacher evaluation. Journal of Personnel Evaluation in Education, 11, 57-67. Yoon, K. S., Duncan, T., Lee, S. W. Y., Scarloss, B., & Shapley, K. (2007) Reviewing the evidence on how teacher professional development affects student achievement (Issues & Answers Report, REL 2007, no 033)Washington DC: US: Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance, Regional Educational Laboratory Southwest. Retrieved from: http://ies.ed.gov/ncee/edlabs Appendix A: Effectiveness of Professional Development: Features and Effects by Study Study/Design Carpenter et al., 1989 (RCT) Cole, 1992 (RCT) Outcome Measure Effect Size Recomputerized Statistical Significance Not significant, but substantively important Improvement Index Content Area School Level Student Outcomes Examined 16 Math 1st grade Computation and math problem-solving scores on the Iowa Test of Basic Skills Level 7 Math and Reading and English/ Language Arts 4th grade Students’ computation and math problemsolving scores on the Iowa Test of Basic Skills, Level 7 Students’ reading comprehension test scores on the GatesMacGinitie Test Students’ Conservation reasoning as measured by Piagetian Cognitive Tasks Students’ alphabetics (Test of Phonological Awareness), orthographic fluency (a timed alphabetic writing task), comprehension (the comprehension subtest of the GatesMacGinitie Reading Tests), and writing skills (a composition task) Students’ receptive language skills (the Peabody Picture Vocabulary Test) and early literacy skills (subtests of the Concepts about Print and Diagnostic serve) Iowa Test of Basic Skills Level 7 computation 0.41 Iowa Test of Basic Skills Level 7, problem-solving 0.41 Not significant, but substantively important 16 Average for Math 0.50 19 Average for Reading 0.82 Average for Language Statistically significant Statistically significant 0.24 Not significant 9 29 Duffy et al., 1986 (RCT) Gates-MacGinitie Reading Test 0.00 Not significant 0 Reading and English/ Language Arts 5th grade Marek & Methven, 1991 (QED) Average for Conservation Test 0.39 Statistically significant 15 Science Kindergarten – 3rd grade , 5th grade McCuthchen et al., 2002 (QED) Gates-MacGinitie Word Reading Subtest 0.39 Statistically significant 15 Reading and English/ Language Arts Kindergarten – 1st grade McGill-Franzen et al., 1999 (RCT) Concepts about print 1.11 Statistically Significant 37 Kindergarten 0.69 Statistically Significant 25 Reading and English/Language Arts 0.32 Not significant, but substantively important 13 0.66 Not significant, but substantively important 24 Letter identification Writing vocabulary Ohio Word Test Hearing the sounds in words 0.97 Statistically Significant 33 0.12 Not significant 5 Fraction concepts 2.39 Statistically Significant 49 Fractions computation -0.53 Not significant, but substantively important -20 Comprehensive Test of Basic Skills – reading 0.68 Not significant, but substantively important 25 Comprehensive Test of Basic Skills – math 0.26 Not significant, but substantively important 10 0.63 Not significant, but substantively important Not significant, but substantively important 23 Peabody Picture Vocabulary Test Saxe et al., 2001 (QED) Sloan, 1993 (RCT) Comprehensive Test of Basic Skills science Tienken, 2003 (RCT with group equivalence problems) Content/organization score on narrative writing test Source: Adapted from Yoon et al. (2007) 0.41 16 Math 4th grade – 5th grade Math, Science, and English/ Language Arts 4th grade – 5th grade Reading and English/ Language Arts 4th grade Students’ concepts and computation of fractions, as assessed by 29-item, 40-minute timed measure developed by the authors Students’’ reading, math, and science scores, measured by the Comprehensive Test of Basic Skills Students’ narrative writing, as measured by content/ organization scores on a standardized writing test administered as part of New Jersey’s Elementary School Proficiency Assessment Appendix B: Adaptation of “The Ripple Effect”