1 Reliability of Using Student Learning Objectives for Teacher Evaluation Shuqiong Lin, Xuejun Ji, Wen Luo Texas A&M University Abstract This study aimed at examining the reliability of using SLO to rank and classify teacher effectiveness. Teacher-student ratio, ICC, test score reliability, entry-level achievement, and target setting model were considered as design factors for detecting how those factors impact the accuracy of this teacher evaluation approach. Results showed fair well reliability of teacher ranking and classification based on SLO scores. Also, study found that teacher-student ratio, test score reliability, ICC, and target setting model had at least moderate effect on SLO, while the influence from different entry-level achievement on teacher evaluation reliability was very limited. Keyword: Student Learning Objective, Teacher Evaluation, Reliability of Evaluation Introduction In 1971, the concept of assessing teaching effectiveness based on students’ achievement growth was first explored scientifically (Hanushek, 1971). As the concept matures, educators and policy makers have widely adapted it nowadays. Value-added models, first introduced by William Sanders , turns out to be one of most popular and developed implementations of this concept. However, due to the requirement of using state-wide standard tests for measuring students’ growth, value-added models are typically applied to reading and math teachers from grades 4 to grade 8 (Grill, Burch & Booker, 2013), which makes up only 31% of all teachers 2 (Prince, et al., 2009). That is, approximately 70% of the teachers from Pre-K through grade 12 cannot be assessed reliably using value-added models. The call for a more flexible teacher evaluation assessment based on students’ achievement growth gave rise to the Students Learning Objectives (SLO) program, an alternative indicator of teacher accountability. Student learning objective itself is an academic goal that educators establish for each individual or subgroup of students (Marion, 2012). The extent to which the goals have been achieved represents students’ academic growth. Students’ academic growth is then linked to teacher effectiveness based on the idea that high-performing teachers help their students to make larger learning gains. Operationally, SLO program requires teachers to set measurable academic growth targets at the start of their course for all students. The target is set by teachers and project leaders (e.g., principles) based on students’ prior achievement such as test scores from previous years, students’ strengths and weakness, and any available trend data. Throughout the semester, teachers reflect on students’ academic progress and classroom practice, and strive to help students to achieve the goals. At the end of the course, each student gets an SLO score which equals his/her end-of-course achievement score minus his/her pre-set target. If a student gets a non-negative SLO score, he/she is marked as reaching target. Otherwise, he/she is marked as failing to reach target. Based on the percentage of students reaching target within individual teacher, teachers are ranked and sorted into different categories. Those teachers with high target reaching rate are classified as high-performing teachers in evaluation and may be considered for promotion or raise in pay. Those teacher with low target reaching rate are classified as lowperforming teachers and may need intensive professional development (“Reach SLO Manual”, 2013; “Student Learning Objectives Operations Manual”, 2013). 3 As an alternative tool for teacher evaluation, SLO program is famous for its irreplaceable flexibility. For one thing, learning growth in SLO program can be measured using both formal (state-wide standard test) and less formal (locally-determined tests) tests. Unlike value-added model which only assesses reading and math teachers from grades 4 to grade 8, SLO programs can be applied to any subjects and grades from kindergarten to high school. For another thing, various target setting methods for SLO program are available for use in order to accommodate the diverse situations and needs of courses and teachers. A review of the three most commonly used target setting methods will be provided in the Method section. In addition to its flexibility, extant literature has also documented other advantages of SLO program. In general, studies indicate that setting learning objectives itself leads to positive impact on learning. For example, a meta-analysis showed high effect size for setting objectives in classroom instruction (Beesley & Apthorp, 2010). A longitudinal study showed that schools with SLO program have greater improvement in passing rates on the Texas Assessment of Knowledge and Skills (TAKS) over time than schools that do not implement SLO program (Schmitt, Lamb, Cornetto, & Courtemanche, 2013). Due to the desirable features of SLO program, some large school districts have started to apply this method to instruction and teacher evaluation. For example, Ohio’s Teacher Evaluation System (OTES) uses SLO program for schools to establish their local achievement measures and for teachers without value-added data; New York State loads SLO program in their Teacher and Leader Effectiveness (TLE) system and develops more down-to-earth settings, which becomes a vital component of the national Student Learning Objectives (SLO) Work Group. SLO program gain its popularity in many other states and school districts like Georgia, Indiana, Connecticut, 4 Delaware, Tennessee, Denver Publish schools, District of Columbia Public Schools, Jeffco Public Schools, and Austin Independent Schools. Despite the promising features of SLO program and emerging interest from school districts, the psychometric properties of such measure in assessing teachers’ performance are not well studied. Grill, Burch and Booker (2013) summarized that little evidence existed on the statistical properties of using SLO program for teacher evaluation. They reported that 7 studies examined the statistical properties of SLO program. Almost all of them focused on the impact of SLO programs on the improvement of students’ test and teachers’ performance (i.e., Goldhaber & Walch, 2011; Community Training and Assistance Center, 2004; Community Training and Assistance Center, 2013; Proctor, Walters, Reichardt, Goldhaber. & Walch, 2011; Schmitt, et al., 2013; Tennessee Department of Education, 2012; Terry, 2008). Grill, Burch and Booker (2013) also claimed whether the ability of SLO to distinguish among teachers represents true differences in teacher performance remained to be determined. The purpose of this study is to examine how the accuracy of teacher ranking and classification based on SLO measures is affected by the following factors: (1) test score reliability, (2) teacher-student ratio, (3) intra-class correlation (ICC) of test scores, (4) entry-level achievement, and (5) target setting models. Method To investigate the impact of the five factors on the performance of SLO measures of teacher effectiveness, we conducted a Monte Carlo study in which we generated two-level SLO data with students nested within teachers. The five design factors (i.e., teacher-student ratio, ICC, test 5 score reliability, entry-level achievement, and target setting model) were manipulated. Below, we describe the data generating process and design factors setting, followed by analysis of the meta-data (i.e., the data set that contains statistics computed from each generated sample data set). All data generation and analysis were performed by SAS 9.3 (SAS Institute Inc., 2012) Data Generation The sample data were generated mimicking a typical public school. It is known that class size and school size depend highly on school location (i.e., city, suburb, town or rural) and the school level (i.e., primary, secondary or high school). Generally speaking, the class size for a typical public school is around 20 to 30, and the school size is around 1,000 to 1,500. Each teacher usually teaches 2 to 3 different classes. Accordingly, we generated data based on a school with 1200 students in total. The number of teachers and students per teacher is determined based on the design factor teacher-to-student ratio. For each student, two true scores (i.e., the score without measurement errors) need to be created: true pre-test score (X*ij ) and true end-of-course test score (Y*ij). The true pre-test score of each student was generated from a normal distribution X*ij ~ N (ππ ∗ , πΏπ ∗ ). The true end-ofcourse score also took a normal distribution [Y*ij ~ N (ππ ∗ , πΏπ ∗ )] and was generated using the following two-level linear model: Level 1: Y*ij = β0j + β1j X*ij+ rij Level 2: β0j = γ00 + U0j β1j = γ10 With rij ~ N (0, δr) (i: ith student, j: jth teacher) (1) (2) (3) 6 U0j ~ N (0, δu0) where γ00 is the intercept, γ10 the effect of the pre-test true scores on the end-of-course true scores, and U0j is the random effect of the jth teacher on Y*ij, which is also the jth teacher’s true contribution to students growth. A higher U0j means a greater contribution of the teacher to the student’s end-of-course true score. Finally, rij represents the random error for the ith student taught by the jth teacher. After all pre- and end-of-course true test scores were generated, the observed pre-test score and end-of-course score (i.e., the score with measurement errors) were generated by adding the measurement errors (Eij_X, Eij_Y) to the true scores: Xij = X*ij + Eij_X (4) Yij = Y*ij + Eij_Y (5) With Eij_X ~ N (0, δE_X) Eij_Y ~ N (0, δE_Y) In the above data generation models, some parameters took different values based on the levels in the design factors, some had fixed values, and some were derived from other parameters based on certain assumptions. Next we describe the design factors, followed by parameters that have fixed values, and finally derived parameters. Design Factors Teacher-student ratio. We chose two levels of teacher-student ratio: 1: 60 and 1:40. Because the total number of students is fixed as 1200, in the 1:60 condition, there are 20 teachers 7 with 60 students nested within each teacher in a sample. On the other hand, in the 1:40 condition, there are 30 teachers with 40 students nested within each teacher.1 Intra-class correlation (ICC). ICC is a class-level characteristic that reflects how strong the clustering effect is for the test scores. We selected the 0.1 and 0.2 level because those values are the most commonly seen in educational settings (Hedges & Hedberg, 2007). Test score reliability (ρ). Test score reliability refers to the reliability of students’ pre- and post-test scores. Given that many of the applications of SLO programs are in grades and subjects that do not have state-wide standard tests, the reliability of tests employed in those grades might vary significantly depending on the subjects and developers. For example, the reliability of reading and science test scores from cooperative educational services like SAT and ACT are usually higher than 0.9 (Ormrod, 2010, p.527). However, the reliability of scores from the writing test was only 0.64 (ACT writing test technical report, 2009). Other than those educational services’ tests, the reliabilities of locally determined assessments such as districtcreated assessment or teacher-created tasks would fluctuate more and are relatively low. Therefore, we selected four levels (i.e., 0.65, 0.75, 0.85, and 0.95) to represent reliabilities from low to high. Entry-level achievement. Entry-level achievement is the mean of the pre-test true scores (μX∗ ). It is common that some classes have relatively better test performance at the beginning of the course than other classes. This factor might influence students’ target reaching rate under some targeting setting models. For example, under the class-wide target setting model (e.g., the 1 In reality, each student should leant different subjects from different teachers, meaning that students are crossclassified within several teachers. For simplicity, we just study teachers who teach same subjects of different grades. That is, each student only nested within one teacher. Thus, the teacher-student ratio here is not the teacher-student ratio for each high school, but the teacher-students ratio for a specific course in each high school. The teacherstudent ratio for each simulated high school is much higher than these values. 8 same mastery goal for all the students in a class), the target reaching rate for classes with low ππ ∗ might be lower than classes with high ππ ∗ . We selected two levels of ππ ∗ (65and 75) to represent low and medium entry-level performance, given that the highest possible score from the test is 100 points. Target Setting Model. Students’ reaching target rates is directly related to the targets. Even for a same group of students in the same course, using different targets will lead to different target reaching rates, which brings about different teacher ranking outcomes. Half-split model, Banded model and Class-wide Model are the three most commonly used target setting models for SLO programs. For example, New York State and Georgia State used all three targets setting approaches, Ohio State adopted the banded model, and Austin independent school district was allowed to use only Half-split model. The Half-Split model specifies target at the individual level. It is an appropriate target setting approach when pre-test scores (Xij ) are provided and the end-of-course test is on a similar scale as the pre-test. In this target setting model, target score (represented by THalf-Split) = Xij + (100 – Xij) /2. Banded model categorizes students into different bands according to their Xij and each band is assigned a target. The required growth for each band is unequal, because the subgroup of students who have low Xij often has more room for improvement than students who have already earned high scores on Xij. This banded model could be well applied when there are tiered levels of expectations for students within a course. For this study, students’ banded target (represented by TBanded) was set as 60 for those whose Xij were ranging from 0-30; 70 for Xij between 31 and 50; 80 for Xij between 51 to 70; and 90 for Xij above 70. The setting of those bands is based on 9 the expectation that the possible room and required growth for students with low Xij is larger than those with high Xij2. Finally, the Class-wide model set only one target for all students in a class. This shared target can be either high or low to represent the “mastery” level of performance or the “pass” level of performance (“Critical Decisions within Student Learning Objectives (SLOs): Target Setting Models”, 2013). Given the mean of the pre-test scores (i.e., 65 or 75), we adopted 85 as the class-wide target score (i.e., TClass-Wide =85). Combining the different levels under five design factors, there are 2 (teacher-student ratios)*4(test score reliability)*2(ICC)*2(entry-level performance)*3(target setting) =96 conditions. The first four factors are between-subject factors and the last factor (i.e., target setting methods) is the within-subject factor. 1000 replications were run in each condition. Parameters with fixed values In equation (3), the fixed effect γ10 was set to be 0.8, representing a relatively high correlation between pre- and end-of-course tests. It is reasonable because the literature claimed that students’ prior achievement could highly impact the absorption of new knowledge and further sways their end-of-course achievement. For example, prior knowledge was found to benefit the function of working memory and acquisition of conceptions (Cook, 2006; Hewson & Hewson, 2003), which play important roles on gaining new knowledge and better test performance. 2 In reality, states using Banded Model all follow this general rule. But the specific ranges vary by states based on their own assessment and students’ ability. For more information, see the implementation in New York State and Ohio State ("Critical Decisions within SLOs: Target Setting Models", 2013; “A Guide to Using SLOs as a LocallyDetermined Measure of Students Growth”, 2013). 10 In equation (2), the fixed effect γ00 was set to be 30 so that the mean of Y*ij equals 82 when ππ ∗ =65, and 90 when ππ ∗ =75 (see equation 6 for the calculation). This value was chosen to make sure that under the Class-wide target setting model the mean of the post-test is not too far away from the target (i.e., 85). If the mean of the post-test scores is too far away from the target, it is equivalent to setting unrealistically high or low class-wide target, which will result in too many teachers with 0% target reaching rate or 100% target reaching rate. It is not desirable from a psychometric point of view because of low discrimination. Finally, the standard deviation of the pre-test and end-of-course true scores were set to be 5 (i.e., πΏπ ∗ =πΏπ ∗ =5). With this value, we can make sure that even when the mean of Y*ij is as high as 90, 95% of the scores will be still within the range of 0-100. It is noted that scores over 100 were set to be 100. There is a slight ceiling effect when the mean of Y*ij is 90, however, it is realistic as ceiling effect is often encountered in real life exam. Derived Parameters Based on the above parameter values, ππ ∗ , was computed using the following equation: ππ ∗ =E(Y*ij ) = E(β0j )+ E(β1j X*ij ) (6) Therefore ππ ∗ equals 82 and 90 when ππ ∗ is 65 and 75 respectively. δU0 and δr were obtained using the following equations: δU0 = [(πΏπ ∗ 2 – γ102 *πΏπ ∗ 2)*ICC]1/2 (7) δr = [(πΏπ ∗ 2 – γ102 *πΏπ ∗ 2)*(1-ICC) ] 1/2 (8) Hence, for ICC=0.1, δU0 is 0.95 and δr is 2.85; for ICC=0.2, δU0 is 1.34 and δr is 2.68. 11 The computation of δE_X and δE_Y (i.e., standard deviation of the measurement error) drew upon the concept of test scores reliability (ρ). Because test scores reliability (ρ) is defined as the variance of true scores divided by the variance of the observed scores, which is the sum of the variance of the true scores and the variance of measurement errors, the standard deviation of the measurement errors can be calculated as below: δE_X = [ (1-ρ)/ ρ)* πΏπ ∗ 2 ]1/2 (9) Thus, for ρ=0.95, δE_X was 3.67; for ρ=0.85, δE_X was 2.89; for ρ=0.75, δE_X was 2.10; and for ρ=0.65, δE_X was 1.15. Because the true pre-test and the end-of-course scores had the same variance and reliability, the δE_Y was the same as δE_X in all the conditions. We use δE to represent them from here on. Table 1 summarizes all parameter values under the 96 conditions. Because the two teacher-student ratio conditions share the same set of parameter values, there are only 48 rows in the table. Insert Table 1 About Here Analyses of data Analysis of generated sample data. In order to get the SLO score for each student, the observed end-of-course score (Yij) was compared to the target score. If the target was met, the students’ SLO = 1, otherwise SLO = 0. Individual student’s SLO scores were then aggregated at the teacher’s level and used to compute the target reaching rate for this teacher. It is noted that because there are three target setting methods, a student has three SLO scores (one under each 12 method) and a teacher has three target reaching rate as well. For instance, a teacher has 60 students nested within him/her. Using the split-half method, 40 of those 60 students had SLO score = 1, then the target reaching rate for this teacher based on split-half method is 40/60= 66.67%. Teachers were then ranked according to three observed target reaching rates and the true teacher effect U0j. Therefore, in each sample data there are three observed teacher rankings (i.e., Rhalf-split, Rbanded, and Rclass-wide) and one true teacher ranking (RU0j). In addition to teacher rankings, it is also meaningful for a teacher evaluation method to correctly identify excellent and poor teachers. We used three categories: excellent (i.e., top 15%), ordinary (i.e., middle 70%), and poor (i.e., bottom 15%). Similarly, teachers were classified into the three groups based on the observed target reaching rates and the true effect (i.e., U0j). Hence, there are three observed classifications in each generated data set (i.e., Chalf-split, Cbanded, and Cclasswide) and one classification (CU0j). Outcomes of Interest. We are interested in the correlation between the three observed rankings and the true ranking, and the congruence of the three observed classifications and the true classification, because these measures represent the degree to which the SLO methods can correctly rank order teachers and classify teachers. To measure the correlation between the true ranking and the three observed rankings, we computed the Spearman’s rho (ρs): ρHalf-Split, ρBanded and ρClass-Wide. To measuring the congruence between the three observed classification and the true classification, we computed the weighted Kappa statistic: κHalf-Split, κBanded and κClass-Wide. Analysis of the meta-data. To examine the effects of the design factors on the outcomes, we ran two sets of Multivariate Analysis of Variance (MANOVA). The first set included the response variables of the three Spearman rho’s and the second set included the response variables of the three Weighted Kappa. The explanatory variables included four between-subject 13 factors (i.e., teacher-student ratio, test score reliabilities, ICC, and entry-level achievement) and one within-subject factor (i.e., targets setting methods). Given that the purpose of using MANOVA in the present study was descriptive rather than inferential, the p value of the F-test was not reported. Instead, the eta-squared (ο¨2) effect size was computed and reported as a measure of practical significance. Only effects with ο¨2 greater than 0.05 were interpreted. The between-subject η2 is computed using equation (10), and the within-subject and interaction effects’ η2 are computed using equation (11). η2 =SS effect/ (SS effect + SS error) (10) 1 η2 = 1 ο ο s 3 (11) Results Teacher Rankings Table 2 presented the means and standard deviations of the Spearman rho’s broken down by levels in the design factors. Overall, the correlation between true and SLO-based teacher ranking was high (mean Spearman’s rho = 0.73).The MANOVA results (see Table 3) showed that ICC was the most influential factor explaining 29% of the between-subjects variation in the dependent variables. In general, as ICC increased the correlations between the observed rankings and the true rankings also increased. Several patterns were able to be found through those mean values generally. Test reliability accounted for about 10% of the between-subjects variation in the outcomes. With the increase of test score reliabilities, the correlations increased. First, compared with conditions with teacher-students ratio equal to 1:40 (SHalf-Split=0.77, SBanded= 0.70 3 Λ refers to Wilks’ Lambda. S represents the number of response variables, which is either Spearman rho’s or Weighted Kappa. 14 and SClass-Wide=0.66), higher Spearman rho values were reported for all SHalf-Split (0.80), SBanded (0.76) and SClass-Wide (0.71) when teacher-students ratios were1:60. Target setting methods accounted for about 58% of the within-subjects variations. The Half-Split method showed the highest mean Spearman rho (0.78), followed by the Banded (0.73). The Class-Wide target setting method had the lowest mean correlation (0.68). Insert Table 2 & 3 Teacher Classification Table 4 presented the means and standard deviations of the weighted Kappa broken down by levels in the design factors. Overall, the agreement between true teacher classification and observed teacher classification was moderate (mean weighted Kappa= 0.46). The MANOVA results (see Table 5) showed similar patterns as those for the teacher rankings. indicating a moderate reliability of using SLO to evaluate teacher classification. Intra-Class Correlation had moderate impact on teacher classification accuracy (Effect size = 0.13). Higher Intra-Class Correlation results in better agreement between the observed and the true teacher classifications. Target setting methods also affected the accuracy of teacher classification (Effect size=0.11). In general, the Half-Split and the Banded target setting methods preformed better than the ClassWide method. Third, the interaction between Targets setting mode and teacher-students ratio also influenced the teacher classification reliability moderately. The teacher classification reliability difference between ratio equal to 1:60 and ratio equal to 1:40 is larger when using the Half-split setting model than the difference from applying Banded and Class-wide target setting models. Fifth, when one teacher has more students nested with him/her, the reliability of teacher 15 ranking declined. data sets with ICC=0.1 had lower KHalf-Split, KBanded and KClass-Wide (0.42, 0.42 and 0.37 respectively) than that from ICC=0.2 (KHalf-Split= 0.53, KBanded= 0.53 and KClass-Wide =0.48); increase of test score reliabilities produced greater KHalf-Split, KBanded and KClass-Wide. For instance, as ρ increased from 0.65 to o.95, KHalf-Split equaled to 0.44, 0.47, 0.49 and 0.51; when entry-level achievement was 65, the KHalf-Split, KBanded and KClass-Wide were 0.49, 0.48 and 0.44, which were all larger than Weighted Kappa values for entry-level achievement were 75: 0.46, 0.47 and 0.41; Half-Split and Banded target setting model showed better Weighted Kappa (KHalfSplit=KBanded= 0.48) than that from Class-Wide target setting model (0.43). Insert table 4&5 Discussion The study examined the reliability of SLO scores as a measure of teacher effectiveness, and how teacher rankings and classifications were affected by factors including teacher-student ratio, test score reliability, ICC, entry-level achievement, and target setting methods. Generally speaking, teacher rankings and classifications based on SLO scores are fairly reliable. Goldhaber and Walch (2011) observed high correlation between value-added estimates of teacher effects and students’ achievement growth (0.8 for math and 0.6 for reading). Our simulation results showed an average of 0.78 correlation between the rankings based on the SLO measures and the true teacher effects. This showed that SLO-based assessment of teacher effectiveness is statistically sound and could be fairly reliable when applied appropriately. As expected, the reliability of teacher rankings and classifications increased with the test score reliability and the variability of true teacher effects (i.e., ICC). For both ranking and 16 classification reliabilities, 1:60 ration reported better reliability. For one teacher, the number of students nested within him/her is the sample size for this teacher. Larger sample size should enhance the stability of this teacher’s SLO score (i.e., target reaching rate from his/her students), which enlarges the reliability of teacher evaluation based on SLO. Comparing the three target setting methods, Half-Split and Banded methods produced more reliable teacher rankings and classification than the Class-Wide method. Target scores under both Half-Split and Banded approaches are set based on pre-course test scores and are specific to individual students. On the other hand, with a class-wide target, it is likely that students with low pre-test scores could not reach the target even though they had a big growth. In this case, individual students’ SLO scores do not adequately represent their growth. As a result, teachers’ rankings and classifications based on these SLO scores are less reliable. Implications Based on our findings, we recommend school district leaders to choose achievement tests with high reliability (0.85 and above) and set individual-specific academic goals when using SLO program for measuring teacher effectiveness. Cautions should be exercised when the number of students within a teacher is small (less than 40). Limitations and Future Research Directions This study assumed that all students have pre-test score. However, in reality for students in pre-k and early elementary grades, it is possible that they do not have a baseline score for a certain course. In that case, other target setting approaches need to be used. Another issue is that we assumed that pre-test and end-of-course tests were similar tests with the same reliability. 17 However, it is likely that pre-test and post-test are different tests with different reliabilities. Future studies can examine the relative importance of pre-test and post-test reliability. 18 Reference ACT writing test technical report. (2009). www.act.org. Retrieved from http://www.act.org/aap/writing/pdf/TechReport.pdf A Guide to Using SLOs as a Locally-Determined Measure of Students Growth. (2013). www.education.ohio.gov. Retrieved from education.ohio.gov/getattachment/Topics/Teaching/Educator-Evaluation-System/Ohio-sTeacher-Evaluation-System/Student-Growth-Measures/Student-Learning-ObjectiveExamples/071513_SLO_Guidebook_FINAL.docx.aspx Barge, D. J. (2013, August 26). Student Learning Objectives Operations Manual. Retrieved December 23, 2013, from http://www.gadoe.org/School-Improvement/Teacher-and-LeaderEffectiveness/Documents/2013%20SLO%20Manual.pdf Beesley, A., & Apthorp, H. (2010). Classroom instruction that works, second edition: Research report. Denver, CO: Mid-continent Research for Education and Learning. Community Training and Assistance Center. (2004). Catalyst for change: Pay for perfor-mance in Denver: Final report. Boston: Author. Retrieved May 4, 2012, from http://www.ctacusa.com/publications/catalyst-for-change-pay-for-performance-in-denverfinal-report/ Community Training and Assistance Center. (2013). It’s more than the money. Boston: Author. Retrieved February 27, 2013, from http://www.ctacusa.com/PDFs/ MoreThanMoneyreport.pdf Cook, M. P. (2006). Visual representations in science education: The influence of prior 19 knowledge and cognitive load theory on instructional design principles. Science Education, 90(6), 1073-1091. Critical Decisions within Student Learning Objectives (SLOs): Target Setting Models [Video file]. (2013, October 22). EngagyNY. Retrieved November 28, 2013, from http://www.engageny.org/resource/critical-decisions-within-student-learning-objectivesslos-target-setting-modele Geldhof, G. J., Preacher, K. J., & Zyphur, M. J. (2013). Reliability estimation in a multilevel confirmatory factor analysis framework. Psychological Methods, doi:http://dx.doi.org/10.1037/a0032138 Gill, B., Bruch, J, & Booker, K. (2013). Using alternative student growth measures for evaluating teacher performance: what the literature says. (REL 2013–002). Washington, DC: U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance, Regional Educational Laboratory MidAtlantic. Goldhaber, D., & Walch, J. (2011). Strategic pay reform: A student outcomes-based evalua-tion of Denver’s ProComp teacher pay initiative. Economics of Education Review, 31(6), 1067– 1083. Hanushek, E. (1971). Teacher Characteristics and Gains in Student Achievement: Estimation Using Micro Data. American Economic Review, 61(2), 280-288. Hedges, L., & Hedberg, E.C. (207). Intraclass correlation values for planning group-randomized trials in education. Educational Evaluation and Policy Analysis, 29, 60-87. 20 Hewson, M. G., & Hewson, P. W. (2003). Effect of instruction using students' prior knowledge and conceptual change strategies on science learning. Journal of Research in Science Teaching, 40, S86-S98. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison- Wesley. Marion, S., DePascale, C., Domaleski, C., Gong, B., & Diaz-Biello, E. (2012). Considerations for Analyzing Educators’ Contributions to Student Learning in Non-tested Subjects and Grades with a Focus on Student Learning Objectives. Dover, NH: Center for Assessment. Ormrod, E., J. (2010). Educational Psychology: Developing Learners (7th ed.). Boston, MA: Pearson. Prince, C., Schuermann, P., Guthrie, J.,Witham, P., Milanowski, A., & Thorn, C. (2006). The other 69 percent: Fairly rewarding the performance of teachers of non-tested subjects and grades. Washington, DC: U.S. Department of Education, Office of Elementary and Secondary Education. Proctor, D., Walters, B., Reichardt, R., Goldhaber, D., & Walch, J. (2011). Making a differ-ence in education reform: ProComp external evaluation report 2006–2010. Prepared for the Denver Public Schools. Denver: The Evaluation Center, University of Colorado. Reach SLO Manual. (2013). Retrieved December 19, 2013, from http://www.austinisd.org/sites/default/files/dept/reach/SLO_Manual_20132014FinalRevisedJ_0.pdf SAS Institute Inc. (2012). SAS/STAT 9.3 user’s guide. Cary, NC: Author. Schmitt, L., Lamb, L., Cornetto, K., & Courtemanche, M. (2013). AISD REACH program 21 update: 2012–2013 student learning objectives (SLOs). (Department of Program Evaluation Publication 12.83). Austin: Austin Indepen-dent School District. Tennessee Department of Education. (2012). Teacher evaluation in Tennessee: A report on year 1 implementation. Nashville, TN: Author. Retrieved December 14, 2012, from http://www.tn.gov/education/doc/yr_1_tchr_eval_rpt.pdf Terry, B. D. (2008). Paying for results: Examining incentive pay in Texas schools. Austin: Texas Public Policy Foundation. Retrieved December 13, 2012, from http://broadeducation.org/asset/1128-paying%20for%20results.pdf 22 Table 1 List of parameters for two teacher-student ratios ICC ρ ππΏ∗ ππ∗ δX* δY* δU0 δr δE 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.95 0.95 0.95 0.85 0.85 0.85 0.75 0.75 0.75 0.65 0.65 0.65 0.95 0.95 0.95 0.85 0.85 0.85 0.75 0.75 0.75 0.65 0.65 0.65 75 65 55 75 65 55 75 65 55 75 65 55 75 65 55 75 65 55 75 65 55 75 65 55 90 82 74 90 82 74 90 82 74 90 82 74 90 82 74 90 82 74 90 82 74 90 82 74 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 1.34 1.34 1.34 1.34 1.34 1.34 1.34 1.34 1.34 1.34 1.34 1.34 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.68 2.68 2.68 2.68 2.68 2.68 2.68 2.68 2.68 2.68 2.68 2.68 3.67 3.67 3.67 2.89 2.89 2.89 2.10 2.10 2.10 1.15 1.15 1.15 3.67 3.67 3.67 2.89 2.89 2.89 2.10 2.10 2.10 1.15 1.15 1.15 23 Table 2 Description of Spearman rho across five design factors Design factor N Mean SHalf-Split Mean SBanded Mean SClass-Wide Ratio 1:60 1:40 16000 16000 0.80 (0.11) 0.77 (0.11) 0.76 (0.12) 0.7 (0.12) 0.71 (0.14) 0.66 (0.14) ICC 0.1 0.2 16000 16000 0.73 (0.12) 0.83 (0.08) 0.67 (0.13) 0.79 (0.10) 0.61 (0.14) 0.74 (0.11) 65 75 16000 16000 0.77 (0.11) 0.79 (0.12) 0.73 (0.13) 0.73 (0.13) 0.66 (0.14) 0.69 (0.15) ρ 0.95 0.85 0.75 0.65 8000 8000 8000 8000 0.73 (0.12) 0.77 (0.11) 0.80 (0.10) 0.83 (0.09) 0.69 (0.13) 0.72 (0.13) 0.74 (0.12) 0.77 (0.11) 0.65 (0.14) 0.67 (0.14) 0.69 (0.14) 0.7 (0.14) Target setting 32000 0.78 (0.11) 0.73 (0.13) 0.68 (0.14) μX* 24 Table 3 MANOVA and effect sizes for teacher ranking reliability based on Spearman rho F p Effect Size Within effect Targets 22035.80 (2,31980) <0.0001 0.5795 Between effects Ratio 3127.27 (1) <0.0001 0.0891 ICC 13077.50 (1) <0.0001 0.2902 ρ 1236.22 (3) <0.0001 0.1039 μ 304.48 (1) <0.0001 0.0094 Ratio * ICC 66.70 (1) <0.0001 0.0021 Ratio * ρ 1.88 (3) 0.1311 0.0002 Ratio * μ 2.61 (1) 0.1063 0.0001 ICC * ρ 4.99 (3) 0.0018 0.0005 ICC * μ 0.44 (1) 0.5069 0.0000 ρ*μ 4.07 (3) 0.0067 0.0004 Between-Within effects Targets * Ratio 120.35 (2,31980) <0.0001 0.0075 Target * ICC 278.01 (2,31980) <0.0001 0.0171 Target * ρ 232.3 (6,63960) <0.0001 0.0422 Target * μ 235.14 (2,31980) <0.0001 0.0145 Targets * Ratio * ICC 1.54 (2,31980) 0.2141 0.0001 Targets * Ratio * ρ 4.27 (6,63960) <0.0003 0.0009 Targets * Ratio * μ 2.02 (2,31980) 0.1321 0.0001 Target * ICC * ρ 13.87 (6,63960) <0.0001 0.0026 Target * ICC * μ 0.67 (2,31980) 0.5135 0.0000 Target * ρ * μ 3.04 (6,63960) 0.0057 0.0006 Note: Numbers in the parentheses are the degree of freedom. The total observation number for RM-ANOVA was 32000. 25 Table 4 Description of Weighted Kappa across five design factors Design factor N Mean KHalf-Split Mean KBanded Mean KClass-Wide 16000 16000 0.56 (0.21) 0.40 (0.18) 0.50 (0.21) 0.45 (0.18) 0.45 (0.21) 0.40 (0.18) 16000 16000 0.42 (0.21) 0.53 (0.20) 0.42 (0.19) 0.53 (0.19) 0.37 (0.19) 0.48 (0.19) 65 75 ρ 0.65 0.75 0.85 0.95 16000 16000 0.49 (0.21) 0.46 (0.21) 0.48 (0.18) 0.47 (0.20) 0.44 (0.20) 0.41 (0.20) 8000 8000 8000 8000 0.44 (0.21) 0.47 (0.20) 0.49 (0.21) 0.51 (0.21) 0.43 (0.20) 0.46 (0.20) 0.49 (0.19) 0.52 (0.19) 0.40 (0.20) 0.42 (0.20) 0.44 (0.19) 0.45 (0.20) Target setting 32000 0.48 (0.21) 0.48 (0.20) 0.43 (0.20) Ratio 1:60 1:40 ICC 0.1 0.2 μX* 26 Table 5 MANOVA and effect sizes for teacher classifications reliability based on Weighted Kappa F p Effect size Within effect Targets 1995.19 (2,31980) <0.0001 0.1110 Between effect Ratio 3366.16 (1) <0.0001 0.0952 ICC 4783.94 (1) <0.0001 0.1301 ρ 360.53 (3) <0.0001 0.0327 μ 139.20 (1) <0.0001 0.0043 Ratio * ICC 1.14 (1) 0.2847 0.0000 Ratio * ρ 9.59 (3) <0.0001 0.0009 Ratio * μ 1.32 (1) 0.2513 0.0000 ICC * ρ 1.14 (3) 0.3331 0.0001 ICC * μ 2.09 (1) 0.1479 0.0001 ρ*μ 3.24 (3) 0.0213 0.0003 Interaction effect Targets * Ratio 2386.24 (2,31980) <0.0001 0.1299 Target * ICC 0.49 (2,31980) 0.6129 0.0000 Target * ρ 33.28 (6,63960) <0.0001 0.0062 Target * μ 63.92 (2,31980) <0.0001 0.0040 Targets * Ratio * ICC 0.53 (2,31980) 0.5863 0.0000 Targets * Ratio * ρ 37.05 (6,63960) <0.0001 0.0069 Targets * Ratio * μ 8.01 (2,31980) 0.0003 0.0005 Target * ICC * ρ 0.54 (6,63960) 0.7746 0.0000 Target * ICC * μ 1.09 (2,31980) 0.3367 0.0000 Target * ρ * μ 1.53 (6,63960) 0.1623 0.0003 Note: Numbers in the parentheses are the degree of freedom. The total observation number for RM-ANOVA was 32000.