CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS

advertisement

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS

Abstract

Prior research has consistently found marked cross-year instability in teachers’ value-added scores, but additional evidence is needed regarding the stability of other key metrics used in updated teacher evaluation systems. Furthermore, little research has explored the predictors of

1 changes in scores. Using a new dataset, we replicate prior work measuring the intertemporal stability of value-added, observation, and student perception metrics of teacher quality, and investigate predictors of cross-year changes. In addition to providing additional evidence for the latter two metrics, we extend the literature by investigating whether changes can be predicted by variables found outside of traditional administrative datasets. We specifically explore the effect of teacher learning opportunities, changes in instruction, and teachers’ own perceptions regarding the ease of educating a given classroom of students, and, notably, find that teachers’ perceptions predict changes in scores. However, predictors explain a very small proportion of variation in teacher quality scores.

Keywords: assessment, consequential validity, teacher evaluation

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS

The Cross-Year Stability of Teacher Quality Metrics and Predictors of their Change

Recent federal policies have incentivized states and districts to use multiple metrics to assess teacher and teaching quality,

1

including scores created from value-added models, standardized observational instruments, and student-reported perceptions of classroom quality.

2

Because these metrics play a central role in revamped teacher evaluation and accountability systems, researchers and policymakers have invested considerable resources to better understand their properties. Though a substantial proportion of these efforts have explored measure bias

(e.g., Authors, 2015a; Chetty, Friedman, & Rockoff, 2014a; Kane, McCaffrey, Miller, & Staiger,

2013; Kane & Staiger, 2008; Rothstein, 2009), another important property of interest is the intertemporal stability of scores, or the extent to which these measures identify the same teachers as effective or ineffective over adjacent academic years.

In fact, existing research often indicates marked shifts in individual teachers’ value-added scores between academic years, (e.g., Aaronson, Barrow, & Sander, 2007; Ballou, 2005; Koedel

& Betts, 2007; McCaffrey, Sass, Lockwood, & Mihaly, 2009). This inconsistency has led some researchers to question the utility of value-added metrics for both formative and summative purposes (e.g., Baker et al., 2010). Yet despite the fact that new teacher evaluation scores typically include multiple data sources, this literature has largely focused on the stability of value-added estimates, with some exceptions (e.g., Polikoff, 2015). Additional evidence is needed regarding the stability of other measures states and districts use to evaluate teachers, as metric instability can complicate efforts to identify teachers for termination, remediation, and reward.

Another underexamined issue in this literature is the extent to which year-to-year changes in teacher scores can be predicted, for instance by the resources provided to teachers, by the

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS amount of class time allotted to test preparation reported by teachers, by the composition of students in teachers’ classrooms, or even by teachers themselves, as they perceive and react to

3 changes in their student population. If a sizeable portion of the variance in scores between years can be explained by such sources, concerns regarding the validity and utility of scores would be at least partially ameliorated, especially if such results could be used to inform improvements to instruction or outcomes (e.g., by providing additional professional development or coaching experiences).

To gain insight into these issues, we use data from fourth- and fifth-grade teachers of mathematics and their students to answer the following research questions:

(1) What is the cross-year stability of key teacher quality metrics?

(2) Can we identify classroom-level predictors that are significantly associated with within-teacher cross-year variability in scores?

(3) How much of the variance in within-teacher scores can be explained by these predictors?

In what follows, we review the literature surrounding the cross-year stability of teacher quality metrics derived from student performance on standardized assessments ( value-added metrics ), standardized observational instruments applied to recorded or live lessons ( observation metrics ), and student survey data ( student perception metrics ). We describe the sources of our data, our methods for data reduction and analysis, and our results. Overall, we observed that withinteacher cross-year score stability ranged from 0.24 to 0.47, with substantial movement in teacher ranking (i.e., a change in performance quintile greater than one) occurring for up to 43 of teachers across years, depending on the metric considered. Teachers’ perceptions of their students’ abilities relative to the prior year was the most consistent predictor of cross-year

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS changes in teacher scores; however, this and other predictors explained a small proportion of the

4 variability in such changes. We conclude our paper by addressing the implications of our findings for policymakers and future research.

Background

Recent federal and state policies, including Race to the Top and waivers to the No Child

Left Behind act, have encouraged the adoption of more stringent teacher evaluation metrics and multifaceted accountability systems. These changes occurred following evidence that previous evaluation systems failed to meaningfully distinguish teachers (Weisberg et al., 2009) despite numerous studies showing that teachers substantially affect both student test scores (e.g.,

Hanushek & Rivkin, 2010; Nye, Konstantopoulos, & Hedges, 2004) and long-term outcomes, such as college enrollment and lifetime earnings (Chetty, Friedman, & Rockoff, 2014b). In many locations, new accountability systems use data from several different sources: student performance on standardized assessments, which are then aggregated within value-added models to produce teacher scores; student performance on Student Learning Objectives, aggregated to the teacher level; observations of classroom instruction scored using standardized instruments, such as the Framework for Teaching (Danielson, 2011); and student-reported perceptions of teacher and classroom quality, usually collected via surveys, aggregated to the teacher level (e.g., the Tripod; Ferguson, 2008). In this section, we review existing research on cross-year stability in the three metrics we consider in our analyses.

The Cross-Year Stability of Teacher Quality Metrics

The stability of measures aggregated from student test scores has been of concern since the early 1970s, when researchers began to identify more and less effective teaching practices based on a comparison of classroom instruction with students’ gains on basic skills assessments.

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS

In these studies, often referred to collectively as the process-product literature, correlations between adjacent-year gains in student test scores measured ranged from 0.2 to 0.4 (Brophy,

5

1973; Brophy, Coulter, Crawford, Evertson, & King, 1975; Good & Grouws, 1975). However, teacher-level scores in this literature often represented simple aggregated gains (i.e., the average student post-test performance differenced from the pre-test performance), and did not take into account other influential factors, such as student background or peer characteristics.

More recently, scholars have examined the cross-year stability of teacher value-added scores. Unlike scores used in analyses of the early process-product literature, value-added metrics typically measure teacher impacts on student test performance after controlling for other observable characteristics of the student or classroom that might influence outcomes. Cross-year correlations for these scores vary from study to study, but largely arrive at similar conclusions when estimated from similarly controlled value-added models (Koedel, Mihaly, & Rockoff,

2015). For example, McCaffrey et al. (2009) used panel data from five Florida school districts and found most cross-year correlations in the range of 0.2 to 0.7. The Measures of Effective

Teaching (MET) study found cross-year correlations of 0.2 for English language arts (ELA) and roughly 0.5 for mathematics (Kane & Staiger, 2012). Goldhaber and Hansen (2013) analyzed ten years of panel data from North Carolina and found that the average cross-year correlation between teacher scores was 0.55. Notably, averaging teacher value-added scores into three-year bundles and then correlating those bundles improved this estimate to 0.65.

Transition matrices provide additional insight into the cross-year stability of value-added models by describing within-teacher changes in adjacent-year value-added ranks or categorizations (e.g., quintiles). Koedel and Betts (2007), for instance, used data from San Diego schools to show that only 20% to 35% of teachers, depending on quintile and model

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS specification, remained in the same performance quintile in consecutive years; in one model,

6

13% moved from the first to last quintile, or vice versa. Using data from schools in Chicago,

Aaronson et al. (2007) found that 23% to 57% of teachers remained in the same quartile across years. The authors also found that 10% of the highest ranked teachers in one year were ranked in the lowest quartile the prior year, and 8% of the lowest-ranked teachers were ranked in the highest quartile the prior year. Ballou (2005) presented results from the Tennessee Value-Added

Assessment System that were also consistent with the studies described above.

The process-product studies of the 1970s also investigated the stability of the “process” side of the equation—measures of teachers’ classroom behaviors. Some scholars (Brophy et al.,

1975; Marshall, Green, Hartsough, & Lawrence, 1977) examined stability in observation metrics across lessons within a given school year, noting that for many areas, considerable instability across time of day, subject matter, and lessons existed. This led to the application of generalizability theory (Shavelson & Webb, 1991) to attempt to recover within-year estimates of the stability of teacher behavior. Studies applying generalizability theory to several modern instruments have estimated that, depending on the instrument and practices being assessed, between 13% to 40% of the variance in observation scores lies at the teacher level (Bell et al.,

2012; Hill, Charalambous, & Kraft, 2012), leading many to recommend that such scores be based on multiple lessons assessed by multiple raters in order to improve overall reliability.

Cross-year stability in these observational metrics, however, has been less often investigated. Brophy and colleagues (1975) estimated cross-year stability correlations for a set of classroom indicators observed four times in the first year and 14 times in the second.

Correlations were in the 0.5 to 0.7 range for items capturing negative and positive teacher affect, clarity of the presentation, and teacher-initiated problem-solving. Polikoff (2015) found a wide

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS range of cross-year stability coefficients for scores from observation instruments used in the

MET study, which collected up to four lessons per year per teacher. Based on the coefficients

7 from regressions of standardized current year scores on standardized prior year scores

( standardized regression coefficients ), the stability of scores for subject-independent behaviors

(e.g., managing student behavior, creating a positive environment for learning) varied from as low as 0.25 to as high as 0.55; the stability of scores for subject-dependent behaviors (e.g., the prevalence of mathematical errors during instruction) varied from 0.04 to 0.35. Transition matrices from analyses indicated that, though considerable cross-year variability existed for within-teacher scores on several observation instruments, in fewer than five percent of possible observations did teachers switch from the bottom quintile to the top quintile, or vice versa, in terms of performance across years.

Polikoff (2015) argued that these estimates may have been influenced by the MET’s less than ideal scoring conditions. For example, teachers in the MET were scored using truncated versions of longer instruments; the MET scoring design also allowed only for scoring 30 minutes of instruction per lesson, and only one rater viewed and scored that 30 minute clip. Most scoring advice and designs for observational instruments recommend having multiple raters score as much instruction as possible on the complete instrument (see Bell et al., 2012; Hill et al., 2012) to increase the amount of signal variance in scores. Under more comprehensive scoring designs, a reasonable expectation would be to find higher cross-year correlations for observation scores.

Research on the stability of student surveys aggregated and used as teacher evaluation instruments is scarce. In the MET study, the stability of different scales from the Tripod student survey instrument (Ferguson, 2008) from December to March in the same school year ranged from 0.7 to 0.85 (Kane & Cantrell, 2010); this statistic was reported after correcting for

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS measurement error. Polikoff (2015) used MET data and found uncorrected, cross-year standardized regression coefficients for teacher Tripod scores in the 0.32 to 0.45 range;

8 depending on the subject; in fewer than 11% of observations did teachers switch from the bottom to the top performance quintile, or vice versa, across years. In his review of research around student perceptions, however, Polikoff noted that his finding for stability was lower than those reporting the stability of teacher evaluations conducted by students in higher education (e.g.,

Marsh, 2007; Marsh & Hocevar, 1991).

Explanations for Cross-Year Instability

Metrics of teacher quality may be unstable across years for two reasons (Koedel et al.,

2015; McCaffrey et al. 2009; Polikoff, 2015). First, measurement error in outcomes (i.e., student test or classroom observation scores) adds noise to teacher scores and can attenuate cross-year correlations; given that reliability estimates for teacher metrics are often quite low, cross-year correlations are likely depressed. Second, scores can also be unstable because teachers’ actual capacity to deliver instruction and improve student outcomes may change across time periods, either for year-specific reasons (e.g., teaching a particularly cohesive classroom in one year) or because of long-term time trends (e.g., actual improvement due to increased experience).

Despite these clear reasons for instability, few scholars have explored the factors that might explain this instability, and studies that exist tend to focus only on factors that can be obtained from administrative datasets. Two separate studies using panel data from North

Carolina (Goldhaber & Hansen, 2013; Jackson & Bruegmann, 2009) found that teachers’ valueadded scores can be modestly predicted by the value-added scores of their peers. Goldhaber and

Hansen further showed that peer absences predicted teachers’ value-added scores, and found small associations in the expected direction between teacher value-added scores and class size,

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS percent of students in the class with free-lunch eligibility, percent of the class that is minority,

9 and teacher absences. Papay and Kraft (2011), also using panel data from North Carolina, isolated an effect of teacher experience on changes in value-added scores.

Similar to this initial evidence investigating value-added metrics, year-to-year changes in student composition may also yield differences in instructional quality, as perceived by external raters or students; student composition may in particular affect the behavioral climate of a classroom and the pacing of new material, as teachers adjust instruction to fit the needs of different sets of students. In analyses of MET observation and student survey data, however, few significant predictors of cross-year variability emerged when considering changes in the racial or gender composition of the classroom, the percentage of students with disabilities or considered

English language learners, or average student prior achievement (Polikoff, 2015).

Beyond variables that can be easily gleaned from administrative datasets, several other substantive explanations for cross-year instability stand out. Some focus on professional-learning resources that theoretically lead to the improvement in teachers’ scores, including the receipt of coaching, the presence of peer collaboration, and participation in professional development. In studies of specific programs, each has been demonstrated to improve instruction and/or student outcomes (e.g., for coaching, see Allen, Hafen, Gergory, Mikami, & Pianta, 2015; for peer collaboration, see Gersten, Dimino, Jayanthi, Kim, & Santoro, 2010; for professional development, see a review by Yoon, Duncan, Lee, Scarloss, & Shapley, 2007). However, the more typical effect of such teacher experiences on year-to-year differences in scores is not known. Resources specific to schools themselves, such as curriculum materials, access to services for students in need, or collegiality, may vary by year, further affecting teacher scores;

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS in prior work, such variables have been associated with outcomes such as teacher turnover and

10 satisfaction (e.g., Johnson, Kraft, & Papay, 2012).

Teacher quality metrics may also be sensitive to teachers’ instructional changes across years. In particular, teachers may adjust the amount of test preparation activity in which they engage from year to year in response to prior-year accountability scores, changes in formal accountability requirements, or the perceived needs of students. It is difficult to predict the effect of test preparation activity; additional test preparation may improve value-added scores on state tests if well-targeted, but negatively impact student test scores if other valuable instruction is crowded out. It is also logical to think that test preparation activities might negatively impact observational scores, as teachers replace rich content instruction with test preparation strategies or similar behaviors (e.g., Hannaway & Hamilton, 2008; Plank & Condliffe, 2013).

Finally, teachers themselves may be able to predict changes in their classroom instruction and student outcome scores. Teachers may view particular cohorts of students as stronger at entry into the classroom or easier to educate while there and adjust instruction accordingly, or weaker at entry, perhaps because of students’ behavioral challenges and learning needs. If teachers can predict changes in their own instruction and student outcome scores, it suggests that unmeasured factors may play a role in the production of both instruction and student outcomes.

In sum, the review of the literature suggests considerable interest in the stability of classroom and student outcomes, especially in light of the use of these metrics in current teacher evaluation systems. Prior empirical evidence has accumulated regarding the stability of valueadded scores, but the research base around observation and student perception metrics is relatively thin. To this end, we replicate Polikoff’s analysis (2015) of value-added, observation, and student perception score stability with a different set of data, then extend his analysis to ask

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS whether teacher learning opportunities, test preparation behavior, and teachers’ own reports of

11 students predict cross-year score fluctuations.

Data

Data Sources

We drew on three years of data collected during the 2010-11, 2011-12 and 2012-13 academic years in support of a study investigating the relationships between teacher characteristics, instructional quality, and test performance. Our sample included fourth- and fifthgrade elementary mathematics teachers and their students from four urban East Coast public school districts. The study recruited only teachers in schools containing at least two mathematics teachers in either fourth or fifth grade. Ultimately, study staff invited 583 teachers to join the study, with 328, or 56%, matriculating into the data collection phase.

The sources of data in the study included: (a) up to three video-recorded lessons of instruction per teacher per year, for a total of 1,721 lessons across three years and an average of

5.4 videos per teacher; (b) teacher surveys administered twice per year, with response rates ranging from 95% to 98% across years; (c) a student survey based on the Tripod (Ferguson,

2008) administered once per year in the spring,

2

and; (d) student administrative data, including state mathematics and ELA standardized test scores, student scores on an alternative mathematics assessment administered by the project, and demographic information. For most teachers in our sample, we had multiple years of data, hereafter referred to as classrooms.

Sample

Our main analysis compares the estimates of the cross-year stability of different teacher quality metrics to those found in prior literature and to one other. In order to ensure such withinstudy comparisons were not a function of different sample sizes owing to variably missing data,

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS we restricted the sample to include only classrooms with both a current and prior-year score for

12 all quality measures (described below). This restriction resulted in a main analysis sample comprising 251 classrooms taught by 181 unique teachers. Table 1 below compares the characteristics of classrooms in our restricted sample to those outside of the sample.

[Insert Table 1]

We found that some characteristics of the classrooms in our sample differed from those excluded from our sample. For example, differences-in-means t-tests assuming equal variance suggested that the two groups of classrooms were statistically different in terms of average size, average percentage of white students, average percentage of Hispanic students, average percentage of students classified as English language learners (ELL), average percentage of students eligible for free- or reduced-price lunch (FRPL), and prior test performance. The raw means of these characteristics were generally very similar across the groups, however, and, importantly, single-year estimates of the value-added scores for the classrooms across these two groups did not differ significantly, suggesting samples of similar quality.

3

For our analyses attempting to explain year-to-year changes in scores, we used a subset of teachers and classrooms. This sample included 244 of the original 251 classrooms, taught by

177 of the 181 teachers. These classrooms possessed scores on all predictors of change

(described below) in addition to having scores for all quality metrics in any given year.

Data Reduction

Value-added metrics. We calculated value-added scores for each teacher using two different measures of student achievement: standardized state mathematics assessments ( state

VA ) and an alternative, low-stakes mathematics test ( study VA ) administered to students by the study and co-developed by project staff with Educational Testing Service. Investigation into the

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS characteristics of student scores from both tests suggested acceptable levels of reliability, with

13 student performance on both tests exhibiting internal consistency estimates greater than 0.82, depending on the grade and year of administration (see Authors, 2012; Authors, 2015b). To recover teacher scores, we estimated the following multilevel equation for each year of the study, where students are nested within teachers: 𝑎 𝑗𝑐𝑘𝑔𝑑𝑡

= 𝛽

0

+ 𝛽

1

𝐴 𝑗𝑡−1

+ 𝛽

2

𝑋 𝑗𝑡

+ 𝛾 𝑔𝑡

+ 𝛿 𝑑

+ 𝜇 𝑘

+ 𝜀 𝑗𝑐𝑘𝑔𝑑𝑡

(1)

The outcome, 𝑎 𝑗𝑐𝑘𝑔𝑑𝑡

, represents student 𝑗 ’s rank-standardized score on either the state or alternative mathematics exam at time 𝑡 . Our model controlled for: (1) 𝐴 𝑗𝑡−1

, a vector of prior test performance for student 𝑗 at time 𝑡 -1, including a linear, quadratic, and cubic term of student 𝑗 ’s mathematics exam score at time 𝑡 -1 and a linear term for student 𝑗 ’s state ELA exam score at time 𝑡 -1; (2) 𝑋 𝑗𝑡

, a vector of student demographic indicators including the modal gender or race for student 𝑗 , and student 𝑗 ’s ELL, FRPL, and special education (SPED) status at time 𝑡 , and; (3) 𝛾 𝑔𝑡

and 𝛿 𝑑

, fixed effects for district and grade-by-year interactions.

The equation parameter, 𝜇 𝑘

, a random effect for being taught by teacher k, represents teacher 𝑘 ’s state VA or study VA in a given year, adjusted for measurement error due to differences between teachers in the number of students taught in a year. Value-added vendors for several notable school districts, including the District of Columbia, Pittsburgh, and Florida, use similar models to the one we employ (Goldhaber & Theobald, 2012).

To be included in our model, the tested grade level for student j at time t was exactly one grade higher than his or her tested grade at time t1.

4

Furthermore, we excluded classes in which more than 50% of students were categorized as SPED, more than 50% of students had missing baseline test scores, and, after all other restrictions, contained fewer than five students. These

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS restrictions were implemented to ensure the typicality of the sample of students from which

14 value-added scores were being estimated.

Observation metrics . As noted above, study teachers video-recorded their instruction up to three times per year over the three years of the study. Teachers selected the lessons to be recorded, though project guidelines discouraged recording on days with test preparation or exams. Most lessons lasted between 45 minutes and one hour. This video data was split into segments 5 and scored by two raters using the Mathematical Quality of Instruction (MQI) observation instrument (Hill et al., 2008), and by one rater on the Classroom Assessment Scoring

System (CLASS) instrument (Pianta, LeParo, & Hamre, 2007). All raters in the study participated in biweekly calibration meetings to ensure the standardization of scoring processes and accuracy in the application of instruments.

The MQI is designed to capture mathematics-specific aspects of teaching, such as the presence of student mathematical reasoning and explanation, and the dense use of disciplinary language among teachers and students. The CLASS is a subject-independent observation instrument that captures more general classroom phenomena, such as climate and organization.

Previous factor analyses suggested that teacher performance across the MQI and CLASS instruments formed the following four factors (Authors, 2015c): (a) ambitious instruction , which captures elements of mathematics instruction such as linking between mathematical representations, providing mathematical explanations, remediating student mathematical difficulties, and providing opportunities for students to engage in Common Core aligned practices; (b) errors

, which captures the extent to which teachers’ instruction contains mathematical errors, imprecisions, or a lack of clarity; (c) classroom climate , which captures the emotional and instructional support provided to students by the teacher as well as student

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS engagement levels during instruction, and; (d) classroom organization , which captures the

15 productivity (e.g., time management) and negative climate (reverse scored) of the classroom, as well as the teacher’s behavior management skills.

For these analyses, we calculated teacher scores for each factor within each school year.

First, we averaged video segment-level scores across the items composing each factor. We then aggregated these scores to the lesson level, and averaged lesson-level scores for the four factors—ambitious instruction, errors, classroom climate, and classroom organization—to each teacher. For project estimates of the reliability of the observation metrics used in our analyses, please see Table A1 in the Appendix.

Student perception metrics. We used data from student perception surveys to calculate a third category of teacher quality metric. These surveys included a subset of items from the

Tripod survey (Ferguson, 2008), and asked students to report on seven teacher characteristics and activities: care for students, classroom management ability, efforts to clarify ideas and lessons, efforts to challenge students to work and think hard, ability to captivate students in the classroom, discussions with students about mathematics, and efforts to consolidate the material learned. Twenty-six total items from the student surveys were subjected to an exploratory factor analyses; that analysis suggested that a single factor ( Tripod ) best fit the data. Across the three study years, the average Cronbach’s alpha estimate was 0.91 for this single factor. To calculate

Tripod scores for each teacher in a given year, we first averaged student responses across all 26 codes, then aggregated these scores within teacher.

Predictors of teacher quality metrics . We created 11 different within-year teacher variables to predict changes in teacher quality metrics. Some of the variables we considered, such as changes in classroom composition, have been investigated by prior analyses; new

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS elements included measures of teacher resources, test-related activities, and teacher perceptions

16 of students’ relative ability. To create classroom composition predictors of change, we differenced the values for the demographic (i.e., proportion FRPL, ELL, or SPED) and academic composition (i.e., baseline math test performance) of teachers’ students across years.

For the other predictors of change, we created composite scores by first conducting an exploratory factor analysis on items within stems on the teacher survey to investigate patterns in teacher responses; these analyses generally confirmed hypothesized structures were present, and, where not, we modified scales to exclude misfitting items. Next, we calculated Cronbach’s

Alphas ( 𝛼 , reported below) to ensure the internal consistency of composites created from suggested factor structures, then took the average of teacher responses across items within each factor in a year to produce teacher scores. Predictors capturing changes in learning opportunities across years included:

Coaching scores, estimated from two items measuring the frequency in the previous year that a teacher observed a district mathematics coach and that a teacher was observed by a district mathematics coach (2011-12 𝛼 = 0.76

; 2012-13 𝛼 = 0.78

);

Collaboration scores, estimated from three items measuring the frequency in the previous year of collaborative lesson planning, analysis of student assessment results with a mathematics coach or other teachers, and analysis of student work with a mathematics coach or other teachers (2011-12 𝛼 = 0.82

; 2012-13 𝛼 = 0.71

);

Professional development scores, estimated from six items measuring the frequency in the previous year that a teacher studied or learned about: how students learn mathematics; mathematics pedagogy; mathematics curriculum materials; results of state or district tests in any subject; and general pedagogy (2011-12 𝛼 = 0.85

; 2012-13 𝛼 = 0.86

);

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS 17

The above metrics were developed from teachers’ retrospective reports of engagement with these professional learning resources in the prior year. A set of additional items asked teachers to report on current instructional activities and school conditions; because we were interested in whether changes in these factors were related to changes in teacher value-added scores, these metrics were constructed as differences between survey years (Cronbach’s Alphas reported reflect internal consistency of the measure score, and not the difference):

Test prep activities scores, which averaged responses across four items measuring the frequency that a teacher: used released standardized test problems to prepare for the state test; used problems that were formatted like those on state tests; focused on students right below performance levels for the state test; and taught specific test-taking strategies

(2010-11 𝛼 = 0.68

; 2011-12 𝛼 = 0.74

; 2012-13 𝛼 = 0.75

);

Testing impact on scope and sequence scores, which averaged responses across six items measuring the extent to which, during instruction, a teacher: decreased time on non-tested topics; increased time for tested topics; limited deep discussion; limited special projects; limited demanding problems; and sequenced topics such that tested topics would be covered before testing (2010-11 𝛼 = 0.82

; 2011-12 𝛼 = 0.84

; 2012-13 𝛼 = 0.73

); and,

School resources scores, which averaged responses across nine items measuring a teacher’s perceptions of whether his or her school: had limited resources (reverse coded); did not allow use of professional judgment in class (reverse coded); was enjoyable to be teaching in; was characterized by frequent interruptions to instruction (reverse coded); provided materials needed for class; provided access to professional development; made it difficult to get students services (reverse coded); was poorly maintained (reverse

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS 18 coded); and created an environment where work was respected and valued (2010-11 𝛼 =

0.80

; 2011-12 𝛼 = 0.78

; 2012-13 𝛼 = 0.75

).

We estimated scores for our final predictor of cross-year variability in teacher quality scores, teacher perceptions of students’ relative abilities

, by taking a teacher’s responses to four items asking him or her to compare this year’s students to last year’s students. The points of comparison were student learning difficulties, preparedness, behavior problems, and ease of being taught (2011-12 𝛼 = 0.85

; 2012-13 𝛼 = 0.84

). The scale was scored such that higher values reflected more positive relative opinions of this year’s students.

Analyses

Research Question 1: What is the cross-year stability of key teacher quality metrics?

To answer our first research question, we closely replicated the analytic approach used by

Polikoff (2015) in investigating the cross-year stability of MET measures of teacher quality. We estimated the following regression, clustering standard errors at the teacher level:

TQ 𝑘𝑦

= 𝛽

0

+ 𝛽

1

TQ 𝑘𝑦−1

+ 𝜒 𝑦

+ 𝜂 + 𝜀 𝑘𝑦,𝑒

(2)

The outcome, TQ 𝑘𝑦

, is teacher k

’s teacher quality score in year y . The model includes fixed effects for year y , 𝜒 𝑦

, and district, 𝜂 . As TQ 𝑘𝑦−1 captures teacher k

’s teacher quality score in the prior year, 𝛽

1

represents our coefficient of interest. We standardized both TQ 𝑘𝑦

and TQ 𝑘𝑦−1

such that 𝛽

1

estimated a standardized regression coefficient, which approximates the partial correlation between prior teacher quality scores and current teacher quality scores after controlling for fixed effects. We expected more stable measures to exhibit larger 𝛽

1

estimates.

To better illustrate the variability in scores across years, we also explored changes in the withinteacher quintile categorizations of quality for the different metrics across years.

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS

Research Question 2: Can we identify classroom-level predictors that are significantly

19 associated with within-teacher cross-year variability in scores?

Research Question 3: How much of the variance in within-teacher scores can be explained by these predictors?

To explore the classroom-level predictors associated with cross-year differences in within-teacher scores, we estimated the following regression, clustering standard errors at the teacher level:

TQ 𝑘𝑦

= 𝛽

0

+ 𝛽

1

TQ 𝑘𝑦−1

+ 𝛽

2 𝜏 𝑘𝑦

+ 𝜒 𝑦

+ 𝜂 + 𝜀 𝑘𝑦,𝑒

(3)

The model we estimated replicates that represented by Equation 2, except we added an additional vector of controls, 𝜏 𝑘𝑦

, which included our theoretical predictors of change, grouped into three main categories: predictors capturing year-to-year changes in teachers’ learning opportunities, test-related practices, or resources ( Model 1 ); (b) predictors capturing year-to-year changes in classroom composition ( Model 2 ); and (c) teacher perceptions ( Model 3 ). From the estimates of the model, we explored which predictors of 𝜏 𝑘𝑦

significantly associated with our outcome, teacher k

’s teacher quality score in year y ( TQ 𝑘𝑦

). Following the estimation of each model, we also conducted a Wald test to test the joint significance of each category of variables on teacher quality scores.

To answer the third research question, which explored the amount of variance in scores for each metric explained by the entire set of predictors, we compared the 𝑅 2

value from a model predicting the metric using all eleven predictors of change ( Model 4 ) to the 𝑅 2

value from a model containing only the prior score, TQ 𝑘𝑦−1

, fixed effects for year y , 𝜒 𝑦

, and district, 𝜂 ( Model

0 ). If our predictors from Equation 3 explained a large proportion of the variance in changes

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS from year to year in scores of teacher quality, we would expect the change in 𝑅 2

across models

20 to be large.

Results

Research Question 1

Table 2 summarizes the findings of our analysis of the cross-year stability of the teacher metrics.

[Insert Table 2.]

For value -added measures, the stability of within-teacher scores across years fell within the range presented by prior research, with our standardized regression coefficient on currentyear teacher quality by prior-year teacher quality being 0.47 for the state VA metric and 0.31 for the study VA metric. The stability of teacher Tripod scores ( 𝛽

1

= 0.45

) also matched that of

Polikoff’s analysis of teachers in the MET project (2015); Polikoff found a cross-year relationship for this student perception metric of 0.41.

When comparing our estimates for the cross-year stability of teacher quality, as judged by observation scores on the CLASS and MQI, however, we found notable differences. Specifically, we observed more stable estimates for teacher scores on the two dimensions of the MQI instrument—teachers’ ambitious instruction and errors—than those found for teacher scores in the MET data. The standardized regression coefficient for prior-year single-dimension MQI scores for teachers in the MET was 0.12; for our data, we found a coefficient of 0.41 for ambitious instruction and 0.29 for errors. Conversely, teacher performance on a single-dimension score for the CLASS was related at 0.55 for MET teachers, but at 0.26 and 0.24 for the classroom climate and classroom organization dimensions of the CLASS in our data. These

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS conflicting findings further support the need for additional investigation into the cross-year stability of observation measures.

[Insert Table 3.]

In Table 3, we depict metric stability using transition matrices. From the table, we see

21 that stability within each metric is largely related to the cross-year correlations, sensibly yielding a similar ordering of metric stability. Specifically, lower percentages of classrooms experienced a large change in performance categorization (i.e., a quintile shift greater than one) for the ambitious instruction, Tripod, and state VA metrics. However, no clear parallel pattern emerged for the most and least effective classrooms; across metrics, between two percent and 14% of teachers exchanged the top for the bottom quintiles (or vice versa).

Research Question 2

Table 4 summarizes the relationship between our predictors of change and teacher quality scores.

[Insert Table 4.]

From Model 1, we found that teachers’ reports of professional learning opportunities, including coaching, collaboration, and professional development, were not related to student performance on either the state or project-administered tests. The same was true for aggregated Tripod scores.

However, teachers who received more coaching in the prior year improved in their classroom organization skills, and teachers who attended more professional development in the prior year made fewer mathematical errors and imprecisions in their instruction (i.e., such teachers had

“lower” error scores). However, no other relationships were observed between professional learning opportunities and instructional characteristics as rated by external observers.

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS

Model 1 also shows that teacher engagement with test preparation activities was marginally significantly related to state VA, with a magnitude of 0.10 teacher-level standard

22 deviations (SDs), but not to the project-administered assessment, as would be expected given that teachers were likely not directing effort toward improving student performance on that lowstakes test. Also matching intuition, teachers who reported that testing impacted the scope and sequence of instruction declined in their delivery of ambitious mathematical instruction, as judged by external raters. Reports of both testing-related variables were not related to classroom climate and organization, errors, or to the Tripod measure.

Between-year changes in school resources were related to classroom climate and organization—but in the opposite direction than expected. Teacher reports of improved school environments were associated with lower scores on these two dimensions. However, the measure was marginally positively related to students’ aggregated Tripod reports.

The overall lack of strong predictive power of these variables on classroom and student outcomes makes it difficult to reach decisive conclusions from Model 1. Furthermore, learning resources, testing-related measures, and school resources did not collectively appear important for predicting changes in teacher quality scores; only for the classroom organization metric was the Wald test of joint significance significant.

A similar pattern emerges from Model 2 in Table 4, which reports the association between changes in individual classroom composition variables and changes in classroom and student outcomes. Changes in the proportion of ELLs did not predict changes in any teacher quality metric. Increases in the proportion of FRPL-eligible students were associated with lower value-added scores (i.e., a 0.10 increase in FRPL-eligible students was associated with a 0.11-SD decrease in state value-added scores), but only for state VA was the relationship significant at the

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS p < 0.05 level. Increases in the proportion of SPED students were associated with improved

23 classroom organization scores, but no other metric. Finally, increases in the overall baseline test achievement of students for teachers from one year to the next predicted increases in the amount of mathematical errors and imprecisions made during instruction, the students’ perceptions of teacher quality, and study VA; for the errors measure, the direction was surprising, and only for teacher-level Tripod scores was the relationship significant at the p < 0.05 level. Unlike the changes in resources variables, however, we found the Wald test to be significant for several teacher quality metrics; specifically, for ambitious instruction, Tripod, state VA and study VA, the test suggested that changes in classroom composition as a whole might jointly predict changes in scores.

In contrast to the above findings, teacher perceptions of the relative quality of their current students appeared to be a more consistent significant predictor of change. Five of our seven measures showed a positive relationship, meaning teachers who reported current-year students were easier to work with had stronger classroom organization, ambitious instruction,

Tripod, and state and study assessment value-added scores. The size of these differences was often remarkable; for instance, a one-SD difference in teacher perceptions was associated with an increase of 0.25 SDs for aggregated Tripod scores, and a 0.15-SD change in teachers’ scores on classroom organization—a coefficient surprisingly close to the effect of teachers’ prior-year classroom organization scores. With regards to the value-added measures in particular, the precise causal explanation for the relationship is unclear. Teachers may accurately perceive unobserved and uncontrolled-for characteristics of their students that impact those students’ test performance; alternatively, teachers’ perceptions of students might alter student-teacher interactions that might lead to improved test score gains.

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS 24

Though we could not test unobservable student characteristics that teachers may perceive, we could subject the teacher perception variable to tighter controls than in our standard model, in order to gauge whether teachers were echoing classroom-level compositional changes. To do so, we ran a model controlling for changes in the classroom demographic and academic performance of students as well as teacher perceptions predicting change in state and study value-added scores. We found teachers’ perceptions continued to significantly predict changes in state VA

( 𝛽 = 0.15, 𝑝 < 0.05

), even when controlling for changes in the classroom demographic and academic composition of students across years, suggesting an additional effect of perceptions beyond the effect of actual observed composition on outcomes. However, correlations did show that classrooms where teachers perceived students as having higher relative abilities did have lower proportions of FRPL-eligible students ( 𝜌 = −0.19, 𝑝 < 0.01

) and higher-performing students in mathematics at baseline ( 𝜌 = 0.48, 𝑝 < 0.00

), relative to the previous classroom. As well, the effect of teachers’ perceptions on the change in the study VA was attenuated and became insignificant when including the classroom composition change variables.

Research Question 3

For our final research question, Table 4 presents results regarding the amount of variance in teacher quality scores that can be explained for each indicator by our entire set of predictors after controlling for teachers’ performance on the quality metrics in the previous year. In the bottom rows of the table, we report the 𝑅 2

statistic for each measure, or the proportion of variance in cross-year changes of teacher quality scores that can be explained without (Model 0) and with (Model 1) all 11 predictors. Our analyses suggested that there exists a substantial amount of variation still to be explained. We found that the change in 𝑅 2

from the model with no

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS 25 predictors to 11 predictors was always below 0.10, even before adjusting for the decrease in the degrees of freedom.

Conclusion

Prior research exploring the metrics used to assess teacher quality in evaluation systems have largely focused on the properties of value-added metrics. Some have used these metric’s instability across years of teacher scores as evidence in opposition of their use for summative or formative purposes (e.g., Baker et al., 2010). The same reasoning, however, could be applied to other sources used to measure teacher quality, such as classroom observations or student surveys.

Observation and student perception metrics, however, have largely received less scrutiny

(Corcoran & Goldhaber, 2013) than test-based metrics of quality, resulting in a weaker understanding of their properties. And existing research suggested that these scores are equally or more unstable as value-added scores (Polikoff, 2015).

In our examination of the cross-year stability of teacher quality measures, we corroborated prior findings on value-added metrics and a popular student perception metric, but found evidence for the stability of observational metrics that conflicted with findings from another large-scale research project on teacher quality (the MET; Polikoff, 2015). Specifically, we found stronger evidence for the stability of teacher scores on the MQI observational instrument, and weaker evidence for the stability of teacher scores on the CLASS observational instrument. One potential explanation for this difference may be the variability in scoring design across projects. Though both the MET project and our study had similar scoring designs for the

CLASS instrument (i.e., one rater scoring on the full instrument), notable differences existed in the scoring design for the MQI instrument. In our study, teachers were assessed on 14 items from the MQI instrument, compared to just six for the MET, and by two raters, compared to just one.

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS

The additional data points used to estimate teacher quality thus likely contributed to the additional observed stability of scores across years. This further highlights the need to

26 understand the context in which estimates of cross-year stabilities are reported in research and in practice. It also suggests that designers of high-stakes teacher evaluation systems would be wise to dedicate additional training and resources to the scoring systems that now exist (for a review, see Herlihy et al., 2014).

Our study was also able to provide additional evidence on predictors of change in teacher quality scores. To date, most research investigating the predictors of change have been restricted to variables captured by administrative datasets. Through additional data collection, however, our study was able to measure other theoretical contributors to changes in scores. With one exception—teacher perceptions of students’ relative abilities—these predictors identified as likely contributors to change did not consistently predict intertemporal variability across the different teacher quality metrics. Furthermore, the entire set of predictors jointly explained only a negligible amount of variance in scores.

The relatively weak explanatory power of our entire set of predictors indicates that further research needs to be done. The instability of assessments of teacher effectiveness or ineffectiveness has led many stakeholders in education systems to challenge their validity and use. Yet an inability to consistently identify why scores change may also lead to challenges to such systems, as unexplained instability suggests to educators that they cannot improve their performance in tangible ways. Conversely, if future research can isolate the pathways that teachers can improve their quality over time, measure instability may become less of an issue.

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS

References

Aaronson, D., Barrow, L., & Sander, W. (2007). Teachers and student achievement in the

Chicago public high schools. Journal of Labor Economics , 25 (1), 95-135.

Allen, J. P., Hafen, C. A., Gregory, A. C., Mikami, A. Y., & Pianta, R. C. (2015). Enhancing

27 secondary school instruction and student achievement: Replication and extension of the

My Teaching Partner–Secondary intervention. Journal of Research on Educational

Effectiveness .

Authors. (2012). [Title omitted for blind review].

Authors. (2015a). [Title omitted for blind review].

Authors. (2015b). [Title omitted for blind review].

Authors. (2015c). [Title omitted for blind review].

Baker, E. L., Barton, P. E., Darling-Hammond, L., Haertel, E., Ladd, H. F., Linn, R. L., ... &

Shepard, L. A. (2010).

Problems with the use of student test scores to evaluate teachers

(EPI Briefing Paper #278). Washington, DC: Economic Policy Institute.

Ballou, D. (2005). Value-added assessment: Lessons from Tennessee. In R. Lissitz (Ed.), Value added models in education: Theory and applications (pp. 272-297). Maple Grove, MN:

JAM Press.

Bell, C. A., Gitomer, D. H., McCaffrey, D. F., Hamre, B. K., Pianta, R. C., & Qi, Y. (2012). An argument approach to observation protocol validity. Educational Assessment , 17 (2-3),

62-87.

Brophy, J. E. (1973). Stability of teacher effectiveness. American Educational Research Journal ,

245-252.

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS 28

Brophy, J. E., Coulter, C. L., Crawford, W. J., Evertson, C. M., & King, C. E. (1975). Classroom observation scales: Stability across time and context and relationships with student learning gains. Journal of Educational Psychology , 67 (6), 873-881.

Chetty, R., Friedman, J. N., & Rockoff, J. E. (2014a). Measuring the impacts of teachers I:

Evaluating bias in teacher value-added estimates. The American Economic

Review , 104 (9), 2593-2632.

Chetty, R., Friedman, J. N., & Rockoff, J. E. (2014b). Measuring the impacts of teachers II:

Teacher value-added and student outcomes in adulthood. The American Economic

Review , 104 (9), 2633-2679.

Corcoran, S., & Goldhaber, D. (2013). Value added and its uses: Where you stand depends on where you sit. Education Finance and Policy , 8 (3), 418-434.

Danielson, C. (2011). Enhancing professional practice: A framework for teaching . Alexandria,

VA: Association for Supervision and Curriculum Development.

Ferguson, R. F. (2008). The Tripod Project framework . Cambridge, MA: Harvard University.

Gersten, R., Dimino, J., Jayanthi, M., Kim, J. S., & Santoro, L. E. (2010). Teacher study group impact of the professional development model on reading instruction and student outcomes in first grade classrooms. American Educational Research Journal , 47 (3), 694-

739.

Goldhaber, D., & Hansen, M. (2013). Is it just a bad class? Assessing the long ‐ term stability of estimated teacher performance. Economica , 80 (319), 589-612.

Goldhaber, D., & Theobald, R. (2012). Do different value-added models tell us the same things?

Stanford, CA: Carnegie Knowledge Network.

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS

Good, T. L., & Grouws, D. A. (1975). Process-product relationships in fourth grade mathematics classrooms . Washington, DC: National Institute of Education.

Hannaway, J., & Hamilton, L. (2008). Accountability policies: Implications for school and classroom practices . Washington, DC: The Urban Institute.

Hanushek, E. A., & Rivkin, S. G. (2010). Generalizations about using value-added measures of teacher quality. The American Economic Review , 100 (2), 267-271.

Herlihy, C., Karger, E., Pollard, C., Hill, H. C., Kraft, M. A., Williams, M., & Howard, S.

(2014). State and local efforts to investigate the validity and reliability of scores from teacher evaluation systems. Teachers College Record, 116 (1).

29

Hill, H. C., Blunk, M. L., Charalambous, C. Y., Lewis, J. M., Phelps, G. C., Sleep, L., & Ball,

D. L. (2008). Mathematical Knowledge for Teaching and the Mathematical Quality of

Instruction: An exploratory study. Cognition and Instruction , 26 (4), 430-511.

Hill, H. C., Charalambous, C. Y., & Kraft, M. A. (2012). When rater reliability is not enough:

Teacher observation systems and a case for the generalizability study. Educational

Researcher , 41 (2), 56-64.

Jackson, C. K., & Bruegmann, E. (2009). Teaching students and teaching each other: The importance of peer learning for teachers. American Economic Journal: Applied

Economics , 1 (4), 85-108.

Johnson, S. M., Kraft, M. A., & Papay, J. P. (2012). How context matters in high-need schools:

The effects of teachers’ working conditions on their professional satisfaction and their students’ achievement.

Teachers College Record , 114 (10), 1-39.

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS 30

Kane, T. J., & Cantrell, S. (2010). Learning about teaching: Initial findings from the measures of effective teaching project (MET project research paper). Seattle, WA: Bill & Melinda

Gates Foundation.

Kane, T. J., McCaffrey, D. F., Miller, T., & Staiger, D. O. (2013). Have we identified effective teachers? Validating measures of effective teaching using random assignment (MET project research paper) . Seattle, WA: Bill and Melinda Gates Foundation.

Kane, T. J., & Staiger, D. O. (2012). Gathering feedback for teaching: Combining high-quality observations with student surveys and achievement gains (MET project research paper).

Seattle, WA: Bill & Melinda Gates Foundation.

Kane, T. J., & Staiger, D. O. (2008). Estimating teacher impacts on student achievement: An experimental evaluation (No. w14607). Cambridge, MA: National Bureau of Economic

Research.

Koedel, C., & Betts, J. R. (2007). Re-examining the role of teacher quality in the educational production function . Nashville, TN: National Center on Performance Incentives,

Vanderbilt, Peabody College.

Koedel, C., Mihaly, K., & Rockoff, J. E. (2015). Value-added modeling: A review. Economics of

Education Review , 47, 180-195.

Marsh, H. W. (2007). Do university teachers become more effective with experience? A multilevel growth model of students' evaluations of teaching over 13 years. Journal of

Educational Psychology , 99 (4), 775-790.

Marsh, H. W., & Hocevar, D. (1991). Students' evaluations of teaching effectiveness: The stability of mean ratings of the same teachers over a 13-year period. Teaching and

Teacher Education , 7 (4), 303-314.

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS 31

Marshall, H. H., Green, J. L., Hartsough, C. S., & Lawrence, M. T. (1977). Stability of classroom variables as measured by a broad range observational system. The Journal of Educational

Research, 70 (6).

McCaffrey, D. F., Sass, T. R., Lockwood, J. R., & Mihaly, K. (2009). The intertemporal variability of teacher effect estimates. Education Finance and Policy , 4 (4), 572-606.

Nye, B., Konstantopoulos, S., & Hedges, L. V. (2004). How large are teacher effects?

Educational Evaluation and Policy Analysis , 26 (3), 237-257.

Papay, J. P., & Kraft, M. A. (2015). Productivity returns to experience in the teacher labor market: Methodological challenges and new evidence on long-term career improvement. Journal of Public Economics .

Pianta, R. C., LaParo, K. M., & Hamre, B. K. (2007). Classroom Assessment Scoring System

(CLASS) manual . Baltimore, MD: Brookes Publishing.

Plank, S. B., & Condliffe, B. F. (2013). Pressures of the season: An examination of classroom quality and high-stakes accountability. American Educational Research Journal , 50 (5),

1152-1182.

Polikoff, M. S. (2015). The stability of observational and student survey measures of teaching effectiveness. American Journal of Education , 121 (2), 183-212.

Rothstein, J. (2009). Student sorting and bias in value-added estimation: Selection on observables and unobservables. Education Finance and Policy , 4 (4), 537-571.

Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer . Newbury Park, CA:

Sage Publications.

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS

Weisberg, D., Sexton, S., Mulhern, J., Keeling, D., Schunck, J., Palcisco, A., & Morgan, K.

32

(2009). The widget effect: Our national failure to acknowledge and act on differences in teacher effectiveness . Brooklyn, NY: New Teacher Project.

Yoon, K. S., Duncan, T., Lee, S. W.-Y., Scarloss, B., & Shapley, K. (2007). Reviewing the evidence on how teacher professional development affects student achievement (Issues &

Answers Report, REL 2007–No. 033). Washington, DC: U.S. Department of Education,

Institute of Education Sciences, National Center for Education Evaluation and Regional

Assistance, Regional Educational Laboratory Southwest. Retrieved from http://ies.ed.gov/ncee/edlabs

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS

Footnotes

1 For simplicity, in our paper we refer to both together as “teacher quality”, unless otherwise specified.

2

Across the three years of data collection of the study, 94% of students who were in

33 study classrooms in the spring responded to the survey.

3

We only compare teacher quality as measured by value-added estimates across samples because the out-of-sample group also includes those that did not participate in the study. Thus, these classrooms do not have observation or student perception metrics of quality.

4

For the model predicting student performance on the state test, this restriction compared the current tested grade of students to the tested grade in the prior year. For the model predicting student performance on the alternative test, this restriction required the current tested grade of students to be equal to the tested grade in the prior semester; additionally, for students who were in study classrooms in consecutive years, the current tested grade was required to be one greater than the tested grade in the previous year.

5

Videos were broken down into 7.5-minute segments to be scored on the MQI, and into

15-minute segments to be scored on the CLASS.

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS

Table 1. Summary Statistics for Sample and Out-of-sample Classrooms

Sample, n = 251

Out-of-sample, n

= 4574

Single-year value-added 0.00 (0.21) 0.00 (0.20)

Two-sample ttest p-value

0.94

34

Classroom Composition

Size

% Male

% White

% African-American

% Asian

% Hispanic

% ELL

22.20 (4.49)

49.08 (12.59)

24.38 (21.96)

39.82 (23.97)

9.11 (12.92)

22.14 (19.89)

19.00 (22.96)

% FRPL

% SPED

62.50 (25.14)

10.78 (10.19)

Prior math test performance 0.09 (0.57)

Prior ELA test performance

% Missing prior math test

0.10 (0.57)

performance

% Missing prior ELA test

performance

8.87 (8.48)

9.57 (9.85)

Note: Standard deviations are reported in parentheses.

20.33 (6.60)

48.42 (12.12)

21.83 (22.75)

37.29 (27.32)

8.57 (12.66)

28.27 (24.05)

21.56 (23.47)

66.18 (28.48)

10.63 (10.52)

0.02 (0.60)

0.02 (0.57)

8.41 (8.05)

9.17 (9.25)

0.00

0.40

0.08

0.15

0.51

0.00

0.09

0.04

0.83

0.06

0.02

0.38

0.52

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS

Table 2. Stability of Teacher Quality Metrics from Standardized Regression Coefficients

Classroom climate

Classroom organization

Ambitious instruction

Errors Tripod State VA

35

Study VA

Prior-year 𝛽

1

0.26*** 0.24*** 0.41*** 0.29*** 0.45*** 0.47*** 0.31***

Note : All models include district and school year fixed effects. The number of classrooms is 251, and the number of teachers is 181.

Standard errors are clustered at the teacher level.

~ p < .10, * p < .05, ** p < .01, *** p < .001.

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS 36

Table 3. Stability of Teacher Quality Metrics from Transition Matrices

Current-year quintile

Observation metrics

Student perception and valueadded metrics

Prior-year quintile 1 2 3 4 5 1 2 3 4 5

1

2

3

4

5

25

30

14

Classroom climate (36):

33

18

12

22

22

18

12

18

28

8

12

20 24 14 22 20

12 12 24 20 32

28

41

24

Tripod (35):

27

22

14

22

16

24

2

8

20 20 18 16 26

4 16 28 24 28

12 14 18 20 36

1

2

3

4

5

1

2

3

4

5

1

2

Classroom organization (43):

37 18 25 14 6

16 26 12 20 26

22 24 22 14 18

18 14 16 32 20

8 18 24 20 30

Ambitious instruction (32):

35

35

22

24

29

20

18

16

18

16

8

28

8

30 24 26 8 12

16 24 12 32 16

10 18 24 22 26

10 10 20 22 38

Errors (37):

12

12

State VA (31):

39 18 24 14 6

28 24 24 18 6

14 30 18 26 12

10 12 20 24 34

10

33

16

Study VA (39):

29

14

14

18

22

42

2

34 16 14 14 22

20 14 24 16 26

8

6

26

14

24

24

20

28

22

28

3 20 28 20 14 18

4 18 16 18 28 20

5 6 6 28 22 38

Note : Cells represent percentage of classrooms. Rows and columns approximately add up to 100.

Depicted in parentheses is the percentage of classrooms whose absolute difference in performance quintiles across years is greater than one.

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS

Table 4. Predictors of Teacher Quality Metrics Controlling for Prior-Year Scores

Classroom climate

Classroom organization

Ambitious instruction

Model 1 :

Coaching

Collaboration

Professional development

Test prep activities

Testing impact on scope and sequence

School resources

Wald test, p -value

0.02

-0.06

-0.01

0.02

0.02

-0.10~

0.48

0.16*

-0.04

0.03

0.06

0.06

-0.11~

0.04

-0.04

0.09

-0.06

0.01

-0.11*

0.07

0.22

Errors

0.03

-0.03

-0.13*

0.03

0.04

0.08

0.42

Tripod

-0.02

0.08

0.06

0.03

-0.05

0.08~

0.28

37

State VA Study VA

0.00

0.03

0.09

0.10~

-0.01

0.11

0.03

0.06

0.00

0.06

0.42

0.01

0.05

0.69

Model 2 :

Proportion FRPL

Proportion ELL

Proportion SPED

Average prior math score

Wald test, p -value

Model 3 : Teacher perceptions of

students’ relative abilities

-0.12

-0.27

0.26

-0.04

0.88

0.05

-0.52

0.12

0.84~

0.20

0.16

0.15**

-0.26

-0.36

-0.61

0.18

0.04

0.11*

0.54

-0.63

-0.57

0.32~

0.12

-0.01

-0.50

0.44

0.44

0.46**

0.02

0.25***

-1.13**

0.54

0.30

0.11

0.05

0.17**

-0.73~

0.02

0.78

0.30~

0.03

0.13*

Model 0 : 𝑅 2

Model 4 : 𝑅 2

0.18

0.21

0.12

0.21

0.37

0.41

0.15

0.21

0.27

0.36

0.22

0.29

0.09

0.15

Note : All models include prior-year teacher quality scores and district and school year fixed effects. The number of classrooms is 244, and the number of teachers is 177. Standard errors are clustered at the teacher level. Model 0 controls for only prior score and fixed effects. Model 4 controls for all predictors, in addition to prior score and fixed effects.

~ p < .10, * p < .05, ** p < .01, *** p < .001.

CROSS-YEAR STABILITY OF TEACHER QUALITY METRICS 38

Appendix

Table A1. Reliability Estimates for Observation Metrics of Teacher Quality

Cronbach's Alpha Adjusted intraclass correlations

N items 2010-11 2011-12 2012-13 Average 2010-11 2011-12 2012-13 Average

Ambitious instruction

Errors

Classroom climate

11

3

9

0.83

0.70

0.70

0.77

0.66

0.77

0.72

0.51

0.70

0.77

0.62

0.72

0.59

0.41

0.40

0.51

0.38

0.30

0.46

0.30

0.13

Classroom organization 3 0.89 0.92 0.90 0.90 0.60 0.40 0.13

Note : Intraclass correlations are adjusted using the modal number of lessons videotaped for teachers within each year.

0.52

0.36

0.28

0.38

Download