Reliability of Using Student Learning Objectives for Teacher

advertisement
1
Reliability of Using Student Learning Objectives for Teacher Evaluation
Shuqiong Lin, Xuejun Ji, Wen Luo
Texas A&M University
Abstract This study aimed at examining the reliability of using SLO to rank and classify teacher
effectiveness. Teacher-student ratio, ICC, test score reliability, entry-level achievement, and
target setting model were considered as design factors for detecting how those factors impact the
accuracy of this teacher evaluation approach. Results showed fair well reliability of teacher
ranking and classification based on SLO scores. Also, study found that teacher-student ratio, test
score reliability, ICC, and target setting model had at least moderate effect on SLO, while the
influence from different entry-level achievement on teacher evaluation reliability was very
limited.
Keyword: Student Learning Objective, Teacher Evaluation, Reliability of Evaluation
Introduction
In 1971, the concept of assessing teaching effectiveness based on students’ achievement
growth was first explored scientifically (Hanushek, 1971). As the concept matures, educators
and policy makers have widely adapted it nowadays. Value-added models, first introduced by
William Sanders , turns out to be one of most popular and developed implementations of this
concept. However, due to the requirement of using state-wide standard tests for measuring
students’ growth, value-added models are typically applied to reading and math teachers from
grades 4 to grade 8 (Grill, Burch & Booker, 2013), which makes up only 31% of all teachers
2
(Prince, et al., 2009). That is, approximately 70% of the teachers from Pre-K through grade 12
cannot be assessed reliably using value-added models. The call for a more flexible teacher
evaluation assessment based on students’ achievement growth gave rise to the Students Learning
Objectives (SLO) program, an alternative indicator of teacher accountability.
Student learning objective itself is an academic goal that educators establish for each
individual or subgroup of students (Marion, 2012). The extent to which the goals have been
achieved represents students’ academic growth. Students’ academic growth is then linked to
teacher effectiveness based on the idea that high-performing teachers help their students to make
larger learning gains. Operationally, SLO program requires teachers to set measurable academic
growth targets at the start of their course for all students. The target is set by teachers and project
leaders (e.g., principles) based on students’ prior achievement such as test scores from previous
years, students’ strengths and weakness, and any available trend data. Throughout the semester,
teachers reflect on students’ academic progress and classroom practice, and strive to help
students to achieve the goals. At the end of the course, each student gets an SLO score which
equals his/her end-of-course achievement score minus his/her pre-set target. If a student gets a
non-negative SLO score, he/she is marked as reaching target. Otherwise, he/she is marked as
failing to reach target. Based on the percentage of students reaching target within individual
teacher, teachers are ranked and sorted into different categories. Those teachers with high target
reaching rate are classified as high-performing teachers in evaluation and may be considered for
promotion or raise in pay. Those teacher with low target reaching rate are classified as lowperforming teachers and may need intensive professional development (“Reach SLO Manual”,
2013; “Student Learning Objectives Operations Manual”, 2013).
3
As an alternative tool for teacher evaluation, SLO program is famous for its irreplaceable
flexibility. For one thing, learning growth in SLO program can be measured using both formal
(state-wide standard test) and less formal (locally-determined tests) tests. Unlike value-added
model which only assesses reading and math teachers from grades 4 to grade 8, SLO programs
can be applied to any subjects and grades from kindergarten to high school. For another thing,
various target setting methods for SLO program are available for use in order to accommodate
the diverse situations and needs of courses and teachers. A review of the three most commonly
used target setting methods will be provided in the Method section.
In addition to its flexibility, extant literature has also documented other advantages of
SLO program. In general, studies indicate that setting learning objectives itself leads to positive
impact on learning. For example, a meta-analysis showed high effect size for setting objectives
in classroom instruction (Beesley & Apthorp, 2010). A longitudinal study showed that schools
with SLO program have greater improvement in passing rates on the Texas Assessment of
Knowledge and Skills (TAKS) over time than schools that do not implement SLO program
(Schmitt, Lamb, Cornetto, & Courtemanche, 2013).
Due to the desirable features of SLO program, some large school districts have started to
apply this method to instruction and teacher evaluation. For example, Ohio’s Teacher Evaluation
System (OTES) uses SLO program for schools to establish their local achievement measures and
for teachers without value-added data; New York State loads SLO program in their Teacher and
Leader Effectiveness (TLE) system and develops more down-to-earth settings, which becomes a
vital component of the national Student Learning Objectives (SLO) Work Group. SLO program
gain its popularity in many other states and school districts like Georgia, Indiana, Connecticut,
4
Delaware, Tennessee, Denver Publish schools, District of Columbia Public Schools, Jeffco
Public Schools, and Austin Independent Schools.
Despite the promising features of SLO program and emerging interest from school districts,
the psychometric properties of such measure in assessing teachers’ performance are not well
studied. Grill, Burch and Booker (2013) summarized that little evidence existed on the statistical
properties of using SLO program for teacher evaluation. They reported that 7 studies examined
the statistical properties of SLO program. Almost all of them focused on the impact of SLO
programs on the improvement of students’ test and teachers’ performance (i.e., Goldhaber &
Walch, 2011; Community Training and Assistance Center, 2004; Community Training and
Assistance Center, 2013; Proctor, Walters, Reichardt, Goldhaber. & Walch, 2011; Schmitt, et al.,
2013; Tennessee Department of Education, 2012; Terry, 2008). Grill, Burch and Booker (2013)
also claimed whether the ability of SLO to distinguish among teachers represents true differences
in teacher performance remained to be determined.
The purpose of this study is to examine how the accuracy of teacher ranking and
classification based on SLO measures is affected by the following factors: (1) test score
reliability, (2) teacher-student ratio, (3) intra-class correlation (ICC) of test scores, (4) entry-level
achievement, and (5) target setting models.
Method
To investigate the impact of the five factors on the performance of SLO measures of teacher
effectiveness, we conducted a Monte Carlo study in which we generated two-level SLO data
with students nested within teachers. The five design factors (i.e., teacher-student ratio, ICC, test
5
score reliability, entry-level achievement, and target setting model) were manipulated. Below,
we describe the data generating process and design factors setting, followed by analysis of the
meta-data (i.e., the data set that contains statistics computed from each generated sample data
set). All data generation and analysis were performed by SAS 9.3 (SAS Institute Inc., 2012)
Data Generation
The sample data were generated mimicking a typical public school. It is known that class
size and school size depend highly on school location (i.e., city, suburb, town or rural) and the
school level (i.e., primary, secondary or high school). Generally speaking, the class size for a
typical public school is around 20 to 30, and the school size is around 1,000 to 1,500. Each
teacher usually teaches 2 to 3 different classes. Accordingly, we generated data based on a
school with 1200 students in total. The number of teachers and students per teacher is determined
based on the design factor teacher-to-student ratio.
For each student, two true scores (i.e., the score without measurement errors) need to be
created: true pre-test score (X*ij ) and true end-of-course test score (Y*ij). The true pre-test score
of each student was generated from a normal distribution X*ij ~ N (πœ‡π‘‹ ∗ , 𝛿𝑋 ∗ ). The true end-ofcourse score also took a normal distribution [Y*ij ~ N (πœ‡π‘Œ ∗ , π›Ώπ‘Œ ∗ )] and was generated using the
following two-level linear model:
Level 1: Y*ij = β0j + β1j X*ij+ rij
Level 2: β0j = γ00 + U0j
β1j = γ10
With rij ~ N (0, δr)
(i: ith student, j: jth teacher)
(1)
(2)
(3)
6
U0j ~ N (0, δu0)
where γ00 is the intercept, γ10 the effect of the pre-test true scores on the end-of-course true scores,
and U0j is the random effect of the jth teacher on Y*ij, which is also the jth teacher’s true
contribution to students growth. A higher U0j means a greater contribution of the teacher to the
student’s end-of-course true score. Finally, rij represents the random error for the ith student
taught by the jth teacher.
After all pre- and end-of-course true test scores were generated, the observed pre-test
score and end-of-course score (i.e., the score with measurement errors) were generated by adding
the measurement errors (Eij_X, Eij_Y) to the true scores:
Xij = X*ij + Eij_X
(4)
Yij = Y*ij + Eij_Y
(5)
With Eij_X ~ N (0, δE_X)
Eij_Y ~ N (0, δE_Y)
In the above data generation models, some parameters took different values based on the
levels in the design factors, some had fixed values, and some were derived from other parameters
based on certain assumptions. Next we describe the design factors, followed by parameters that
have fixed values, and finally derived parameters.
Design Factors
Teacher-student ratio. We chose two levels of teacher-student ratio: 1: 60 and 1:40.
Because the total number of students is fixed as 1200, in the 1:60 condition, there are 20 teachers
7
with 60 students nested within each teacher in a sample. On the other hand, in the 1:40 condition,
there are 30 teachers with 40 students nested within each teacher.1
Intra-class correlation (ICC). ICC is a class-level characteristic that reflects how strong the
clustering effect is for the test scores. We selected the 0.1 and 0.2 level because those values are
the most commonly seen in educational settings (Hedges & Hedberg, 2007).
Test score reliability (ρ). Test score reliability refers to the reliability of students’ pre- and
post-test scores. Given that many of the applications of SLO programs are in grades and subjects
that do not have state-wide standard tests, the reliability of tests employed in those grades might
vary significantly depending on the subjects and developers. For example, the reliability of
reading and science test scores from cooperative educational services like SAT and ACT are
usually higher than 0.9 (Ormrod, 2010, p.527). However, the reliability of scores from the
writing test was only 0.64 (ACT writing test technical report, 2009). Other than those
educational services’ tests, the reliabilities of locally determined assessments such as districtcreated assessment or teacher-created tasks would fluctuate more and are relatively low.
Therefore, we selected four levels (i.e., 0.65, 0.75, 0.85, and 0.95) to represent reliabilities from
low to high.
Entry-level achievement. Entry-level achievement is the mean of the pre-test true scores
(μX∗ ). It is common that some classes have relatively better test performance at the beginning of
the course than other classes. This factor might influence students’ target reaching rate under
some targeting setting models. For example, under the class-wide target setting model (e.g., the
1
In reality, each student should leant different subjects from different teachers, meaning that students are crossclassified within several teachers. For simplicity, we just study teachers who teach same subjects of different grades.
That is, each student only nested within one teacher. Thus, the teacher-student ratio here is not the teacher-student
ratio for each high school, but the teacher-students ratio for a specific course in each high school. The teacherstudent ratio for each simulated high school is much higher than these values.
8
same mastery goal for all the students in a class), the target reaching rate for classes with low πœ‡π‘‹ ∗
might be lower than classes with high πœ‡π‘‹ ∗ . We selected two levels of πœ‡π‘‹ ∗ (65and 75) to represent
low and medium entry-level performance, given that the highest possible score from the test is
100 points.
Target Setting Model. Students’ reaching target rates is directly related to the targets. Even
for a same group of students in the same course, using different targets will lead to different
target reaching rates, which brings about different teacher ranking outcomes. Half-split model,
Banded model and Class-wide Model are the three most commonly used target setting models
for SLO programs. For example, New York State and Georgia State used all three targets setting
approaches, Ohio State adopted the banded model, and Austin independent school district was
allowed to use only Half-split model.
The Half-Split model specifies target at the individual level. It is an appropriate target
setting approach when pre-test scores (Xij ) are provided and the end-of-course test is on a
similar scale as the pre-test. In this target setting model, target score (represented by THalf-Split) =
Xij + (100 – Xij) /2.
Banded model categorizes students into different bands according to their Xij and each band
is assigned a target. The required growth for each band is unequal, because the subgroup of
students who have low Xij often has more room for improvement than students who have already
earned high scores on Xij. This banded model could be well applied when there are tiered levels
of expectations for students within a course. For this study, students’ banded target (represented
by TBanded) was set as 60 for those whose Xij were ranging from 0-30; 70 for Xij between 31 and
50; 80 for Xij between 51 to 70; and 90 for Xij above 70. The setting of those bands is based on
9
the expectation that the possible room and required growth for students with low Xij is larger
than those with high Xij2.
Finally, the Class-wide model set only one target for all students in a class. This shared
target can be either high or low to represent the “mastery” level of performance or the “pass”
level of performance (“Critical Decisions within Student Learning Objectives (SLOs): Target
Setting Models”, 2013). Given the mean of the pre-test scores (i.e., 65 or 75), we adopted 85 as
the class-wide target score (i.e., TClass-Wide =85).
Combining the different levels under five design factors, there are 2 (teacher-student
ratios)*4(test score reliability)*2(ICC)*2(entry-level performance)*3(target setting) =96
conditions. The first four factors are between-subject factors and the last factor (i.e., target
setting methods) is the within-subject factor. 1000 replications were run in each condition.
Parameters with fixed values
In equation (3), the fixed effect γ10 was set to be 0.8, representing a relatively high
correlation between pre- and end-of-course tests. It is reasonable because the literature claimed
that students’ prior achievement could highly impact the absorption of new knowledge and
further sways their end-of-course achievement. For example, prior knowledge was found to
benefit the function of working memory and acquisition of conceptions (Cook, 2006; Hewson &
Hewson, 2003), which play important roles on gaining new knowledge and better test
performance.
2
In reality, states using Banded Model all follow this general rule. But the specific ranges vary by states based on
their own assessment and students’ ability. For more information, see the implementation in New York State and
Ohio State ("Critical Decisions within SLOs: Target Setting Models", 2013; “A Guide to Using SLOs as a LocallyDetermined Measure of Students Growth”, 2013).
10
In equation (2), the fixed effect γ00 was set to be 30 so that the mean of Y*ij equals 82
when πœ‡π‘‹ ∗ =65, and 90 when πœ‡π‘‹ ∗ =75 (see equation 6 for the calculation). This value was chosen
to make sure that under the Class-wide target setting model the mean of the post-test is not too
far away from the target (i.e., 85). If the mean of the post-test scores is too far away from the
target, it is equivalent to setting unrealistically high or low class-wide target, which will result in
too many teachers with 0% target reaching rate or 100% target reaching rate. It is not desirable
from a psychometric point of view because of low discrimination.
Finally, the standard deviation of the pre-test and end-of-course true scores were set to be
5 (i.e., 𝛿𝑋 ∗ =π›Ώπ‘Œ ∗ =5). With this value, we can make sure that even when the mean of Y*ij is as high
as 90, 95% of the scores will be still within the range of 0-100. It is noted that scores over 100
were set to be 100. There is a slight ceiling effect when the mean of Y*ij is 90, however, it is
realistic as ceiling effect is often encountered in real life exam.
Derived Parameters
Based on the above parameter values, πœ‡π‘Œ ∗ , was computed using the following equation:
πœ‡π‘Œ ∗ =E(Y*ij ) = E(β0j )+ E(β1j X*ij )
(6)
Therefore πœ‡π‘Œ ∗ equals 82 and 90 when πœ‡π‘‹ ∗ is 65 and 75 respectively. δU0 and δr were obtained
using the following equations:
δU0 = [(π›Ώπ‘Œ ∗ 2 – γ102 *𝛿𝑋 ∗ 2)*ICC]1/2
(7)
δr = [(π›Ώπ‘Œ ∗ 2 – γ102 *𝛿𝑋 ∗ 2)*(1-ICC) ] 1/2
(8)
Hence, for ICC=0.1, δU0 is 0.95 and δr is 2.85; for ICC=0.2, δU0 is 1.34 and δr is 2.68.
11
The computation of δE_X and δE_Y (i.e., standard deviation of the measurement error) drew
upon the concept of test scores reliability (ρ). Because test scores reliability (ρ) is defined as the
variance of true scores divided by the variance of the observed scores, which is the sum of the
variance of the true scores and the variance of measurement errors, the standard deviation of the
measurement errors can be calculated as below:
δE_X = [ (1-ρ)/ ρ)* 𝛿𝑋 ∗ 2 ]1/2
(9)
Thus, for ρ=0.95, δE_X was 3.67; for ρ=0.85, δE_X was 2.89; for ρ=0.75, δE_X was 2.10; and for
ρ=0.65, δE_X was 1.15. Because the true pre-test and the end-of-course scores had the same
variance and reliability, the δE_Y was the same as δE_X in all the conditions. We use δE to
represent them from here on.
Table 1 summarizes all parameter values under the 96 conditions. Because the two
teacher-student ratio conditions share the same set of parameter values, there are only 48 rows in
the table.
Insert Table 1 About Here
Analyses of data
Analysis of generated sample data. In order to get the SLO score for each student, the
observed end-of-course score (Yij) was compared to the target score. If the target was met, the
students’ SLO = 1, otherwise SLO = 0. Individual student’s SLO scores were then aggregated at
the teacher’s level and used to compute the target reaching rate for this teacher. It is noted that
because there are three target setting methods, a student has three SLO scores (one under each
12
method) and a teacher has three target reaching rate as well. For instance, a teacher has 60
students nested within him/her. Using the split-half method, 40 of those 60 students had SLO
score = 1, then the target reaching rate for this teacher based on split-half method is 40/60=
66.67%. Teachers were then ranked according to three observed target reaching rates and the true
teacher effect U0j. Therefore, in each sample data there are three observed teacher rankings (i.e.,
Rhalf-split, Rbanded, and Rclass-wide) and one true teacher ranking (RU0j).
In addition to teacher rankings, it is also meaningful for a teacher evaluation method to
correctly identify excellent and poor teachers. We used three categories: excellent (i.e., top 15%),
ordinary (i.e., middle 70%), and poor (i.e., bottom 15%). Similarly, teachers were classified into
the three groups based on the observed target reaching rates and the true effect (i.e., U0j). Hence,
there are three observed classifications in each generated data set (i.e., Chalf-split, Cbanded, and Cclasswide)
and one classification (CU0j).
Outcomes of Interest. We are interested in the correlation between the three observed
rankings and the true ranking, and the congruence of the three observed classifications and the
true classification, because these measures represent the degree to which the SLO methods can
correctly rank order teachers and classify teachers. To measure the correlation between the true
ranking and the three observed rankings, we computed the Spearman’s rho (ρs): ρHalf-Split, ρBanded
and ρClass-Wide. To measuring the congruence between the three observed classification and the
true classification, we computed the weighted Kappa statistic: κHalf-Split, κBanded and κClass-Wide.
Analysis of the meta-data. To examine the effects of the design factors on the outcomes,
we ran two sets of Multivariate Analysis of Variance (MANOVA). The first set included the
response variables of the three Spearman rho’s and the second set included the response
variables of the three Weighted Kappa. The explanatory variables included four between-subject
13
factors (i.e., teacher-student ratio, test score reliabilities, ICC, and entry-level achievement) and
one within-subject factor (i.e., targets setting methods). Given that the purpose of using
MANOVA in the present study was descriptive rather than inferential, the p value of the F-test
was not reported. Instead, the eta-squared (2) effect size was computed and reported as a
measure of practical significance. Only effects with 2 greater than 0.05 were interpreted. The
between-subject η2 is computed using equation (10), and the within-subject and interaction
effects’ η2 are computed using equation (11).
η2 =SS effect/ (SS effect + SS error)
(10)
1
η2 = 1 ο€­  s
3
(11)
Results
Teacher Rankings
Table 2 presented the means and standard deviations of the Spearman rho’s broken down
by levels in the design factors. Overall, the correlation between true and SLO-based teacher
ranking was high (mean Spearman’s rho = 0.73).The MANOVA results (see Table 3) showed
that ICC was the most influential factor explaining 29% of the between-subjects variation in the
dependent variables. In general, as ICC increased the correlations between the observed rankings
and the true rankings also increased. Several patterns were able to be found through those mean
values generally. Test reliability accounted for about 10% of the between-subjects variation in
the outcomes. With the increase of test score reliabilities, the correlations increased. First,
compared with conditions with teacher-students ratio equal to 1:40 (SHalf-Split=0.77, SBanded= 0.70
3
Λ refers to Wilks’ Lambda. S represents the number of response variables, which is either Spearman
rho’s or Weighted Kappa.
14
and SClass-Wide=0.66), higher Spearman rho values were reported for all SHalf-Split (0.80), SBanded
(0.76) and SClass-Wide (0.71) when teacher-students ratios were1:60. Target setting methods
accounted for about 58% of the within-subjects variations. The Half-Split method showed the
highest mean Spearman rho (0.78), followed by the Banded (0.73). The Class-Wide target setting
method had the lowest mean correlation (0.68).
Insert Table 2 & 3
Teacher Classification
Table 4 presented the means and standard deviations of the weighted Kappa broken down
by levels in the design factors. Overall, the agreement between true teacher classification and
observed teacher classification was moderate (mean weighted Kappa= 0.46). The MANOVA
results (see Table 5) showed similar patterns as those for the teacher rankings. indicating a
moderate reliability of using SLO to evaluate teacher classification. Intra-Class Correlation had
moderate impact on teacher classification accuracy (Effect size = 0.13). Higher Intra-Class
Correlation results in better agreement between the observed and the true teacher classifications.
Target setting methods also affected the accuracy of teacher classification (Effect size=0.11). In
general, the Half-Split and the Banded target setting methods preformed better than the ClassWide method. Third, the interaction between Targets setting mode and teacher-students ratio
also influenced the teacher classification reliability moderately. The teacher classification
reliability difference between ratio equal to 1:60 and ratio equal to 1:40 is larger when using the
Half-split setting model than the difference from applying Banded and Class-wide target setting
models. Fifth, when one teacher has more students nested with him/her, the reliability of teacher
15
ranking declined. data sets with ICC=0.1 had lower KHalf-Split, KBanded and KClass-Wide (0.42, 0.42
and 0.37 respectively) than that from ICC=0.2 (KHalf-Split= 0.53, KBanded= 0.53 and KClass-Wide
=0.48); increase of test score reliabilities produced greater KHalf-Split, KBanded and KClass-Wide. For
instance, as ρ increased from 0.65 to o.95, KHalf-Split equaled to 0.44, 0.47, 0.49 and 0.51; when
entry-level achievement was 65, the KHalf-Split, KBanded and KClass-Wide were 0.49, 0.48 and 0.44,
which were all larger than Weighted Kappa values for entry-level achievement were 75: 0.46,
0.47 and 0.41; Half-Split and Banded target setting model showed better Weighted Kappa (KHalfSplit=KBanded=
0.48) than that from Class-Wide target setting model (0.43).
Insert table 4&5
Discussion
The study examined the reliability of SLO scores as a measure of teacher effectiveness, and
how teacher rankings and classifications were affected by factors including teacher-student ratio,
test score reliability, ICC, entry-level achievement, and target setting methods.
Generally speaking, teacher rankings and classifications based on SLO scores are fairly
reliable. Goldhaber and Walch (2011) observed high correlation between value-added estimates
of teacher effects and students’ achievement growth (0.8 for math and 0.6 for reading). Our
simulation results showed an average of 0.78 correlation between the rankings based on the SLO
measures and the true teacher effects. This showed that SLO-based assessment of teacher
effectiveness is statistically sound and could be fairly reliable when applied appropriately.
As expected, the reliability of teacher rankings and classifications increased with the test
score reliability and the variability of true teacher effects (i.e., ICC). For both ranking and
16
classification reliabilities, 1:60 ration reported better reliability. For one teacher, the number of
students nested within him/her is the sample size for this teacher. Larger sample size should
enhance the stability of this teacher’s SLO score (i.e., target reaching rate from his/her students),
which enlarges the reliability of teacher evaluation based on SLO.
Comparing the three target setting methods, Half-Split and Banded methods produced more
reliable teacher rankings and classification than the Class-Wide method. Target scores under
both Half-Split and Banded approaches are set based on pre-course test scores and are specific to
individual students. On the other hand, with a class-wide target, it is likely that students with low
pre-test scores could not reach the target even though they had a big growth. In this case,
individual students’ SLO scores do not adequately represent their growth. As a result, teachers’
rankings and classifications based on these SLO scores are less reliable.
Implications
Based on our findings, we recommend school district leaders to choose achievement tests
with high reliability (0.85 and above) and set individual-specific academic goals when using
SLO program for measuring teacher effectiveness. Cautions should be exercised when the
number of students within a teacher is small (less than 40).
Limitations and Future Research Directions
This study assumed that all students have pre-test score. However, in reality for students
in pre-k and early elementary grades, it is possible that they do not have a baseline score for a
certain course. In that case, other target setting approaches need to be used. Another issue is that
we assumed that pre-test and end-of-course tests were similar tests with the same reliability.
17
However, it is likely that pre-test and post-test are different tests with different reliabilities.
Future studies can examine the relative importance of pre-test and post-test reliability.
18
Reference
ACT writing test technical report. (2009). www.act.org. Retrieved from
http://www.act.org/aap/writing/pdf/TechReport.pdf
A Guide to Using SLOs as a Locally-Determined Measure of Students Growth. (2013).
www.education.ohio.gov. Retrieved from
education.ohio.gov/getattachment/Topics/Teaching/Educator-Evaluation-System/Ohio-sTeacher-Evaluation-System/Student-Growth-Measures/Student-Learning-ObjectiveExamples/071513_SLO_Guidebook_FINAL.docx.aspx
Barge, D. J. (2013, August 26). Student Learning Objectives Operations Manual. Retrieved
December 23, 2013, from http://www.gadoe.org/School-Improvement/Teacher-and-LeaderEffectiveness/Documents/2013%20SLO%20Manual.pdf
Beesley, A., & Apthorp, H. (2010). Classroom instruction that works, second edition: Research
report. Denver, CO: Mid-continent Research for Education and Learning.
Community Training and Assistance Center. (2004). Catalyst for change: Pay for perfor-mance
in Denver: Final report. Boston: Author. Retrieved May 4, 2012, from
http://www.ctacusa.com/publications/catalyst-for-change-pay-for-performance-in-denverfinal-report/
Community Training and Assistance Center. (2013). It’s more than the money. Boston: Author.
Retrieved February 27, 2013, from http://www.ctacusa.com/PDFs/ MoreThanMoneyreport.pdf
Cook, M. P. (2006). Visual representations in science education: The influence of prior
19
knowledge and cognitive load theory on instructional design principles. Science Education,
90(6), 1073-1091.
Critical Decisions within Student Learning Objectives (SLOs): Target Setting Models [Video
file]. (2013, October 22). EngagyNY. Retrieved November 28, 2013, from
http://www.engageny.org/resource/critical-decisions-within-student-learning-objectivesslos-target-setting-modele
Geldhof, G. J., Preacher, K. J., & Zyphur, M. J. (2013). Reliability estimation in a multilevel
confirmatory factor analysis framework. Psychological Methods,
doi:http://dx.doi.org/10.1037/a0032138
Gill, B., Bruch, J, & Booker, K. (2013). Using alternative student growth measures for
evaluating teacher performance: what the literature says. (REL 2013–002). Washington,
DC: U.S. Department of Education, Institute of Education Sciences, National Center for
Education Evaluation and Regional Assistance, Regional Educational Laboratory MidAtlantic.
Goldhaber, D., & Walch, J. (2011). Strategic pay reform: A student outcomes-based evalua-tion
of Denver’s ProComp teacher pay initiative. Economics of Education Review, 31(6), 1067–
1083.
Hanushek, E. (1971). Teacher Characteristics and Gains in Student Achievement: Estimation
Using Micro Data. American Economic Review, 61(2), 280-288.
Hedges, L., & Hedberg, E.C. (207). Intraclass correlation values for planning group-randomized
trials in education. Educational Evaluation and Policy Analysis, 29, 60-87.
20
Hewson, M. G., & Hewson, P. W. (2003). Effect of instruction using students' prior knowledge
and conceptual change strategies on science learning. Journal of Research in Science
Teaching, 40, S86-S98.
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA:
Addison- Wesley.
Marion, S., DePascale, C., Domaleski, C., Gong, B., & Diaz-Biello, E. (2012). Considerations
for Analyzing Educators’ Contributions to Student Learning in Non-tested Subjects and
Grades with a Focus on Student Learning Objectives. Dover, NH: Center for Assessment.
Ormrod, E., J. (2010). Educational Psychology: Developing Learners (7th ed.). Boston, MA:
Pearson.
Prince, C., Schuermann, P., Guthrie, J.,Witham, P., Milanowski, A., & Thorn, C. (2006). The
other 69 percent: Fairly rewarding the performance of teachers of non-tested subjects and
grades. Washington, DC: U.S. Department of Education, Office of Elementary and
Secondary Education.
Proctor, D., Walters, B., Reichardt, R., Goldhaber, D., & Walch, J. (2011). Making a differ-ence
in education reform: ProComp external evaluation report 2006–2010. Prepared for the
Denver Public Schools. Denver: The Evaluation Center, University of Colorado.
Reach SLO Manual. (2013). Retrieved December 19, 2013, from
http://www.austinisd.org/sites/default/files/dept/reach/SLO_Manual_20132014FinalRevisedJ_0.pdf
SAS Institute Inc. (2012). SAS/STAT 9.3 user’s guide. Cary, NC: Author.
Schmitt, L., Lamb, L., Cornetto, K., & Courtemanche, M. (2013). AISD REACH program
21
update: 2012–2013 student learning objectives (SLOs). (Department of Program Evaluation
Publication 12.83). Austin: Austin Indepen-dent School District.
Tennessee Department of Education. (2012). Teacher evaluation in Tennessee: A report on year
1 implementation. Nashville, TN: Author. Retrieved December 14, 2012, from
http://www.tn.gov/education/doc/yr_1_tchr_eval_rpt.pdf
Terry, B. D. (2008). Paying for results: Examining incentive pay in Texas schools. Austin: Texas
Public Policy Foundation. Retrieved December 13, 2012, from
http://broadeducation.org/asset/1128-paying%20for%20results.pdf
22
Table 1 List of parameters for two teacher-student ratios
ICC
ρ
𝝁𝑿∗
𝝁𝒀∗
δX*
δY*
δU0
δr
δE
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.95
0.95
0.95
0.85
0.85
0.85
0.75
0.75
0.75
0.65
0.65
0.65
0.95
0.95
0.95
0.85
0.85
0.85
0.75
0.75
0.75
0.65
0.65
0.65
75
65
55
75
65
55
75
65
55
75
65
55
75
65
55
75
65
55
75
65
55
75
65
55
90
82
74
90
82
74
90
82
74
90
82
74
90
82
74
90
82
74
90
82
74
90
82
74
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
0.95
0.95
0.95
0.95
0.95
0.95
0.95
0.95
0.95
0.95
0.95
0.95
1.34
1.34
1.34
1.34
1.34
1.34
1.34
1.34
1.34
1.34
1.34
1.34
2.85
2.85
2.85
2.85
2.85
2.85
2.85
2.85
2.85
2.85
2.85
2.85
2.68
2.68
2.68
2.68
2.68
2.68
2.68
2.68
2.68
2.68
2.68
2.68
3.67
3.67
3.67
2.89
2.89
2.89
2.10
2.10
2.10
1.15
1.15
1.15
3.67
3.67
3.67
2.89
2.89
2.89
2.10
2.10
2.10
1.15
1.15
1.15
23
Table 2 Description of Spearman rho across five design factors
Design factor
N
Mean SHalf-Split
Mean SBanded
Mean SClass-Wide
Ratio
1:60
1:40
16000
16000
0.80 (0.11)
0.77 (0.11)
0.76 (0.12)
0.7 (0.12)
0.71 (0.14)
0.66 (0.14)
ICC
0.1
0.2
16000
16000
0.73 (0.12)
0.83 (0.08)
0.67 (0.13)
0.79 (0.10)
0.61 (0.14)
0.74 (0.11)
65
75
16000
16000
0.77 (0.11)
0.79 (0.12)
0.73 (0.13)
0.73 (0.13)
0.66 (0.14)
0.69 (0.15)
ρ
0.95
0.85
0.75
0.65
8000
8000
8000
8000
0.73 (0.12)
0.77 (0.11)
0.80 (0.10)
0.83 (0.09)
0.69 (0.13)
0.72 (0.13)
0.74 (0.12)
0.77 (0.11)
0.65 (0.14)
0.67 (0.14)
0.69 (0.14)
0.7 (0.14)
Target setting
32000
0.78 (0.11)
0.73 (0.13)
0.68 (0.14)
μX*
24
Table 3 MANOVA and effect sizes for teacher ranking reliability based on Spearman rho
F
p
Effect Size
Within effect
Targets
22035.80 (2,31980)
<0.0001
0.5795
Between effects
Ratio
3127.27 (1)
<0.0001
0.0891
ICC
13077.50 (1)
<0.0001
0.2902
ρ
1236.22 (3)
<0.0001
0.1039
μ
304.48 (1)
<0.0001
0.0094
Ratio * ICC
66.70 (1)
<0.0001
0.0021
Ratio * ρ
1.88 (3)
0.1311
0.0002
Ratio * μ
2.61 (1)
0.1063
0.0001
ICC * ρ
4.99 (3)
0.0018
0.0005
ICC * μ
0.44 (1)
0.5069
0.0000
ρ*μ
4.07 (3)
0.0067
0.0004
Between-Within effects
Targets * Ratio
120.35 (2,31980)
<0.0001
0.0075
Target * ICC
278.01 (2,31980)
<0.0001
0.0171
Target * ρ
232.3 (6,63960)
<0.0001
0.0422
Target * μ
235.14 (2,31980)
<0.0001
0.0145
Targets * Ratio * ICC
1.54 (2,31980)
0.2141
0.0001
Targets * Ratio * ρ
4.27 (6,63960)
<0.0003
0.0009
Targets * Ratio * μ
2.02 (2,31980)
0.1321
0.0001
Target * ICC * ρ
13.87 (6,63960)
<0.0001
0.0026
Target * ICC * μ
0.67 (2,31980)
0.5135
0.0000
Target * ρ * μ
3.04 (6,63960)
0.0057
0.0006
Note: Numbers in the parentheses are the degree of freedom. The total observation number for RM-ANOVA was
32000.
25
Table 4 Description of Weighted Kappa across five design factors
Design factor
N
Mean KHalf-Split
Mean KBanded
Mean KClass-Wide
16000
16000
0.56 (0.21)
0.40 (0.18)
0.50 (0.21)
0.45 (0.18)
0.45 (0.21)
0.40 (0.18)
16000
16000
0.42 (0.21)
0.53 (0.20)
0.42 (0.19)
0.53 (0.19)
0.37 (0.19)
0.48 (0.19)
65
75
ρ
0.65
0.75
0.85
0.95
16000
16000
0.49 (0.21)
0.46 (0.21)
0.48 (0.18)
0.47 (0.20)
0.44 (0.20)
0.41 (0.20)
8000
8000
8000
8000
0.44 (0.21)
0.47 (0.20)
0.49 (0.21)
0.51 (0.21)
0.43 (0.20)
0.46 (0.20)
0.49 (0.19)
0.52 (0.19)
0.40 (0.20)
0.42 (0.20)
0.44 (0.19)
0.45 (0.20)
Target setting
32000
0.48 (0.21)
0.48 (0.20)
0.43 (0.20)
Ratio
1:60
1:40
ICC
0.1
0.2
μX*
26
Table 5 MANOVA and effect sizes for teacher classifications reliability based on Weighted
Kappa
F
p
Effect size
Within effect
Targets
1995.19 (2,31980)
<0.0001
0.1110
Between effect
Ratio
3366.16 (1)
<0.0001
0.0952
ICC
4783.94 (1)
<0.0001
0.1301
ρ
360.53 (3)
<0.0001
0.0327
μ
139.20 (1)
<0.0001
0.0043
Ratio * ICC
1.14 (1)
0.2847
0.0000
Ratio * ρ
9.59 (3)
<0.0001
0.0009
Ratio * μ
1.32 (1)
0.2513
0.0000
ICC * ρ
1.14 (3)
0.3331
0.0001
ICC * μ
2.09 (1)
0.1479
0.0001
ρ*μ
3.24 (3)
0.0213
0.0003
Interaction effect
Targets * Ratio
2386.24 (2,31980)
<0.0001
0.1299
Target * ICC
0.49 (2,31980)
0.6129
0.0000
Target * ρ
33.28 (6,63960)
<0.0001
0.0062
Target * μ
63.92 (2,31980)
<0.0001
0.0040
Targets * Ratio * ICC
0.53 (2,31980)
0.5863
0.0000
Targets * Ratio * ρ
37.05 (6,63960)
<0.0001
0.0069
Targets * Ratio * μ
8.01 (2,31980)
0.0003
0.0005
Target * ICC * ρ
0.54 (6,63960)
0.7746
0.0000
Target * ICC * μ
1.09 (2,31980)
0.3367
0.0000
Target * ρ * μ
1.53 (6,63960)
0.1623
0.0003
Note: Numbers in the parentheses are the degree of freedom. The total observation number for RM-ANOVA was
32000.
Download