The Impact of Selection of Student Achievement Measurement Instrument on Teacher Value-Added Measures James L. Woodworth Quantitative Research Assistant Hoover Institute, Stanford University, Stanford, California Wen-Juo Lo Assistant Professor University of Arkansas, Fayetteville Joshua B. McGee Vice President for Public Accountability Initiatives Laura and John Arnold Foundation, Houston, Texas Nathan C. Jensen Research Specialist Northwest Evaluation Association, Portland, Oregon Abstract In this paper, we analyze the impact on the variance of teacher value-added measures arising from using different evaluation instruments of student achievement. The psychometric characteristics of the student evaluation instrument used may affect the amount of variance in the estimates of student achievement which in turn affects the amount of variance in teacher valueadded measures. The goal of this paper is to contribute to the body of information for policy makers on the implications of selection of student measurement instruments on teacher valueadded measures. The results demonstrated that well designed value-added measures based on proper measurements of student achievement can provide reliable value-added estimations of teacher performance. Introduction Value-added measures (VAM) have generated much discussion in the education policy realm. Opponents raise concerns over poor reliability and worry the statistical noise will cause many teachers to be misidentified as poor teachers when they are actually performing at an acceptable level (Hill, 2009; Baker, Barton, Darling-Hammond, Haertel, Ladd, Linn, et al., 2010). Supporters of VAMs hail properly designed VAMs as a way to quantitatively measure the input of schools and individual teachers (Glazerman, Loeb, Goldhaber, Staiger, Raudenbush, & Whitehurst, 2010). Much of the statistical noise associated with VAM originates from the characteristics of the testing instruments used to measure student performance. In this analysis, we use a series of Monte Carlo simulations to demonstrate how changes to the characteristics of two measurement instruments, the Texas Assessment of Knowledge and Skills (TAKS) and Northwest Evaluation Association’s (NWEA) Measures of Academic Progress (MAP), affect the reliability of the VAMs. Sources of Error in Measurement Instruments One of the larger sources of statistical noise in the VAM comes from the lack of sensitivity in the student measurement instrument. A measurement instrument with a high level of error in relation to the changes in the parameter it is meant to measure will increase the variance of VAMs based on that particular instrument (Thompson, 2008). Measurement instrument error occurs for a number of reasons, such as test design, vertical alignment, and student sample size. To have useful VAMs, researchers must find student measurement instruments with the smallest ratio of error to growth parameter possible. Test design. 2 Proficiency tests such as those used by many states may be particularly noisy, i.e. unreliable, at measuring growth in student achievement. This is because measuring growth is not the purpose for which proficiency tests are designed. Proficiency tests are designed to differentiate between two broad categories, proficient or not proficient (Anderson, 1972). To accomplish this, proficiency tests must be reliable only around the dividing point between proficient and not proficient, but must be as reliable as possible around this point. To gain this reliability, test developers will modify the test to increase the ability of the test to discriminate between proficient and not proficient. Increasing instrument discrimination around this point can be accomplished in several ways. One is to increase the number of items i.e. questions on the test. Increasing the number of items across the entire spectrum of student ability would require the test to become burdensomely long. Excessively long tests have their own downfalls. In addition to the amount of instructional time lost to taking long tests, student attention spans also limit the number of items which can be added to a test without sacrificing student effort and thereby introducing another source of statistical noise. To lessen the problems of long tests, psychometricians increase the number of items just around the proficient/not proficient dividing point of the criterion referenced test. Focusing items around this single point increases the test reliability at the proficiency point but does not improve reliability in the upper and lower ends of the distribution of test scores. When tests are structured in this manner, they are less reliable the further a student’s ability is from the focus point. Tests which are highly focused around a specific point have heteroskedastic conditional standard errors (CSMs) (May, Perez-Johnson, Haimson, Sattar, & Gleason, 2009). The differences in CSEs across the spectrum of student ability can be dramatic. Figure 1 shows the 3 distribution of CSEs from the 2009 TAKS 5th-grade reading test above and below the measured value of students’ achievement. As can be seen in Figure 1, there is a large amount of variance in the size of the measurement error as the values move away from the focus point. Figure 1: Distribution of Conditional Standard Errors - 2009 TAKS Reading, Grade 5 Focusing the test can increase the reliability enough to make the test an effective measure of proficiency, but the same test would be a poor measure of growth across the full spectrum of student ability. The traditional paper and pencil state-level proficiency tests are often very brief; for example, the TAKS contains as few as 40 items. Again, if item difficulty is tightly focused around a single point of proficiency, a test may be able to reliably differentiate performance just above that point from performance just below that point; however, such a test will not be able to differentiate between performances two or three standard deviations above the proficiency point. 4 Reliable value-added measures require an reliable measure of all student performance regardless of where a student’s score is located within the ability distribution. Even paper and pencil tests which have been designed to measure growth are generally limited to measuring performance at a particular grade level and will therefore have limited items to increase reliability. Computer adaptive tests can avoid the reliability problems related to the limited item pool available on traditional paper and pencil tests (Weiss, 1982). Using adaptive algorithms, a computer adaptive test can select from a much larger pool of items those items that are appropriate to the ability level for each student. The effect of this differentiated item selection is to create a test which is focused around the student’s level of true ability. Placing the focus of the test items at the student’s ability point instead of at the proficiency point minimizes the testing error around the student’s true ability as opposed to around the proficiency point and thereby gives a more reliable measure of the student’s ability. Vertical alignment. Value-Added models which are meant to measure growth from year to year must be vertically aligned. As we discussed above, criterion referenced tests such as state proficiency test often used in value-added measures are designed to measure student mastery of a specific set of grade-level knowledge. As this knowledge changes from year to year, scores on these tests are not comparable unless great care is taken to align scale values on the tests from year to year. While scale alignment does not affect norm referenced VAMs, Briggs and Weeks (2009) found that underlying vertical scale issues can affect threshold classifications of the effectiveness of schools. Additionally, vertically aligning the scales of multiple grade level scales is not the only alignment issue. Even if different grade level scales are aligned, there is some discussion among 5 psychometricians as to whether the instruments are truly aligned just because they are based on Item Response Theory (IRT) (Ballou, 2009). The TAKS is based on IRT theory and contains a vertically aligned scale across grades (TEA, 2010a). The MAP is based on a large pool of items which are individually aligned to ensure a consistent interval, single scale of measurement across multiple levels or grades of ability (NWEA, 2011). The level of internal alignment on the tests may be an additional source of statistical noise in the student measurement which is transferred to the VAM. Student sample size. The central limit theorem states that the reliability of any estimate based on mean values of a parameter increases with the number of members within the sample (Kirk, 1995). Opponents of VAMs often express concern that the class size served by the typical teacher is insufficient to produce a reliable measure of teacher performance (Braun, 2005). Central tendency also says that with a large enough sample size, even a poor estimator of teacher performance will eventually reach an acceptable level of reliability. The question around VAMs then becomes one of efficiency. How does the error related to the student measurement instrument affect the efficiency of the VAM? How large a sample is necessary to permit the VAM to reliably measure teacher performance? As typical class sizes in the elementary grades are around 25 students, we evaluated the effect of different student measurement instrument on the efficiency of teacher VAM. We analyze this by comparing results from each simulation using samples with different numbers of students. Those tests with a smaller ratio between expected growth and CSE (growth/CSE) introduce more noise into the value-added measure and will require a larger sample size to produce reliable VAMs of teacher performance. 6 The Data TAKS - Texas Assessment of Knowledge and Skills We used TAKS data for these analyses because the TAKS is vertically aligned and we were able to obtain the psychometric characteristics from the TEA. Additionally, the TAKS scores have a slight ceiling effect which we exploited to analyze the affects of ceiling effects on instrument variance. To obtain TAKS data for the analysis, we used Reading scores for grade 5 from the 2009 TAKS administration. The TEA reports both scale scores and vertically aligned scale scores for all tests. We used the vertical scale scores for all TAKS calculations. We computed the descriptive statistics for fifth grade vertical scale scores based on the score frequency distribution tables (TEA, 2010a) publicly available from the Texas Education Agency (TEA) website. Table 1 shows the descriptive statistics for the data set. For this study, we used the CSEs for each possible score as they were reported by TEA (2010b). 7 Table 1: Statistical Summary 2009 TAKS Reading, Grade 5 Statistic N Mean Std. Deviation Skewness Minimum Maximum Value 323,507 701.49 100.24 .273 188 922 In our analysis of the TAKS data, in addition to using the normally distributed ~N(0,1) simulated data set described below, we also created a second simulated data set by fitting 100 data points to the actual distribution from the 2009 TAKS 5th-grade reading test. Differences between the two data sets were used to analyze how ceiling effects affect value-added measures. MAP - Measures of Academic Progress Data for the MAP were retrieved from “RIT Scale Norms: For Use with Measures of Academic Progress” and “Measures of Academic Progress® and Measures of Academic Progress for Primary Grades ®: Technical Manual”1. The MAP uses an equidistant scale score referred to as a RIT score for all tests. We computed the RIT mean and standard deviation for reading scores for the MAP from frequency distributions reported by NWEA. We used the average CSEs as reported by MAP in the same documents. Methodology Data Generation We used a Monte Carlo simulation for this analysis. Monte Carlo simulation is a computer simulation which draws repeatedly from a set of random values which are then applied to an algorithm to perform a deterministic computation based on the inputs. The results of the simulation are then aggregated to produce an asymptotic estimation of the event being studied. 1 Access to both documents was made possible through a data grant from the Kingsbury Center. 8 The benefit of using the Monte Carlo study is that it allowed us to isolate the instrument measurement error of different psychometric characteristics from other sources of error. To accomplish this isolation, we control the measurement error within a simulation of data points. To generate the synthesized sample for the Monte Carlo simulations, we used SPSS to generate a normally distributed random sample of 10,000 Z-scores (see table 2 below). In order to simulate group sizes equal to those which would be served by teachers at the high school, middle school, and elementary school levels, we randomly selected from this 10,000 data point sample three nested groups with 100 individuals, 50 individuals, and 25 individuals. Because all our samples are normally distributed and the samples of 25 were randomly selected and are nested within the sample of 50 which was randomly selected and nested within the sample of 100, we considered each level the equivalent of adding additional classes to the value-added models for elementary and middle school levels. We therefore feel it is acceptable to consider the n=50 simulations to be useable estimates for a value-added measure for elementary teachers using two years worth of data, and the n=100 for four years worth of data. Table 2: Statistical Summary, Starting Z Scores Statistic N Mean Std. Deviation Skewness Minimum Maximum Value 10,000 0.008 .999 0.0 -3.71 3.70 Table3: Statistical Summary, z-Score Samples by n Statistic N Mean Std. Deviation Skewness Minimum Maximum Values 100 50 -.13 -.09 .97 .97 -.12 .18 -2.34 -1.85 2.09 2.09 9 25 .01 1.00 .10 -1.77 2.09 The z-scores we generated (S) were used to represent the number of standard deviations above or below the mean of each test the starting scores would be for the data points within the sample. The z-score was then multiplied by the test standard deviation and added to the test mean in order to generate the individual controlled pre-scores (P1). The same samples were used for each simulation except the TAKS actual distribution simulation. For example, in every simulation except the TAKS actual model, the starting score (P1) for case 1 in the n=100 sample is always 2.43 standard deviations below the mean. Controlled P1i = μ + (Si • σ) For all simulations, we created a controlled post-score (P2) by adding an expected year’s growth for the given test to the controlled pre-score. The expected growth for one year was a major consideration for different models of the simulation. For the TAKS analysis, we used two different values to represent one year of growth. The first value used was 24 vertical scale points which was the difference between “Met Standard” on the 5th-grade reading test and “Met Standard” on the 6th-grade reading test (TEA, 2010a). The other value used in the TAKS analysis was 34 vertical scale points which was the difference between “Commended” on the 5th-grade reading test and “Commended” on the 6th-grade reading test. The difference between the average scores for the TAKS 5th-grade reading and TAKS 6th-grade reading was 25.89 vertical scale points. We did not run a separate analysis for this value since it is so close to the “Met Standard” value of 24 vertical scale points. For the MAP we used the norming population’s average fifth grade one year growth of 5.06 RIT scale points for expected growth (NWEA, 2008). Controlled P2i = P1i + Expected Growth 10 In order to simulate testing variance we used Stata11 to multiply the reported conditional standard error (CSE) for each data point by a computer generated random number ~N(0,1) and added that value to the pre-score. We followed the same procedure using a second computer generated random number ~N(0,1) to generate the post score. We then subtracted the pre-score from the post-score to generate a simple value-added score. Because all parameters are simulated we were able to limit variance in the scores to that due to variance from the test instrument. Simulated Growth = (P2 + (Random2 •CSE)) - (P1 + (Random1 • CSE)) Conditional standard errors for the TAKS were taken from “Technical Digest for the Academic Year 2008-2009: TAKS 2009 Conditional Standard Error of Measurement (CSEM)” (TEA, 2010b). To eliminate bias from grade-to-grade alignment on the TAKS, we used a fall-tospring within grade testing design for all the TAKS simulations except for the grade transition model. This means the same set of CSE were used to compute both the pre- and post-scores. We did conduct one grade-to-grade simulation for the TAKS in which 5th-grade CSEs were used to compute the simulated pre-score and 6th-grade CSEs were used to compute the simulated postscore. Because the MAP is a computer adaptive test, each individual test has a separate CSE. This necessitated we use an average CSE for the MAP simulations. Also, since the MAP is computer adaptive, the tests do not focus around a specific proficiency point. This means the CSEs are very stable over the vast majority of the distribution of test scores. Instead of the distribution of CSEs being bimodal, the distribution is uniform except for the left tail of the bottom grades and right tail of the top grades. The reported average CSEs for the range of RIT scores used in this study fall between 2.5 and 3.5 RIT scale points. We applied the value of 3.5 RIT scale points to the main simulation based on MAP scores. We chose to use the higher end of 11 the range to ensure we were not “cherry picking” values. Additionally, while it would be highly unlikely for every student to complete a test with a CSE of 5 which is outside the expected average value range, we ran a “worst case” simulation using this maximized CSE value of 5 for all data points. As the MAP uses a single, equidistant point scale across all grade ranges, it does not have issues with grade-to-grade scale alignment. Monte Carlo Simulation For each simulation, we ran 1,000 iterations in Stata11. The 1,000 iterations simulated the same set of 100 students taking the test 1,000 times. Each iteration was identified by an index number and each individual was identified by a case number. We filtered the results by case number to identify our three sample groups of 100, 50, and 25 data points within each iteration. We then used the collapse command in Stata11 to combine the data by index number which gave us an average growth score for the sample for each iteration.2 Finally, we generated two variables to identify the number of iterations in each simulation which identified when growth was in excess of one-half year more or one-half year less than the controlled growth. The threshold values were based on evidence from the body of research on teacher performance. Current research suggests that student gains of about .10 to .18 standard deviations are moderate gains (Kane, 2004; Schochet, 2008; Schochet and Chang, 2010). The within grade standard deviation on the TAKS was 100 vertical scale points while one year expected growth was 24 points. We feel then that a growth of .12 standard deviations or one-half a year above expected growth, 12 points, would be considered a moderate effect size for an educational intervention (likewise for gains one-half year below expected growth). To 2 Recall the controlled pre-scores on which all other values are based have the same z-score for each iteration in every simulation except the TAKS actual simulation. 12 maintain consistency in the comparisons, we applied a one-half year threshold to the MAP simulations as well. Because all iterations had a controlled value of one year growth, these one-half year below and one-half year above variables are identified as false negatives and false positives respectively for iterations with average gain scores of one-half a year below or one-half a year above expected growth. A false negative in the simulation would represent a teacher whose students made the predefined one year growth, but due to measurement error, was identified as having students with growth less than one-half the expected one year growth. A false positive in the simulation would represent a teacher whose students made the predefined one year growth, but due to measurement error, was identified as having students with growth more than one and one-half years growth. For example, for the TAKS simulation in which 24 vertical scale points was one year of growth, the false negative iteration had growth of less than 12 vertical scale points and the false positive had growth of more than 36 vertical scale points. We determine the impact of instrument measurement error on the value-added measure by comparing the number of misidentified teachers for each model. Results As was expected, we find that the results of the Monte Carlo simulations were highly dependent on the characteristics of the test instruments and the size of the sample. We ran multiple versions of the simulations to determine how changing various psychometric characteristics of the test instrument impacted outcomes. These variations included changing the distribution of data point values (ceiling effects), expected growth, and size of the samples. For distributions, we used a normal distribution for most of the simulations, but included a distribution fitted to the actual distribution of student scores from the TAKS to determine what 13 impact ceiling effects would have on the results. We also ran a simulation which used the expected growth at the “commended” level instead of “met expectations”. Finally we ran the simulations using a variety of sample sizes (n=25, n=50, and n=100). These values were chosen as they could represent not only multiple years of data for an elementary teacher, but also expected annual teacher loads for middle and high school teachers. The variable of interest in our results was the number of teachers misidentified by the various simulations. We were able to look at different variants of the simulation to determine the consistency of the simulation in predicting the known outcomes of the growth measures. To put this in context, using an inferential statistics test with the typical α=.05, we would expect up to 5% of values to fall outside our critical values in the distribution before we declared the sample significantly different from the reference population. Applying this standard, the simulations were fairly stable and differed little from the controlled population when n=100. All the variations of the TAKS had more than 5.0% misidentifications at n=50 and n=25. The only version of the MAP simulation which generated more than 5.0% misidentifications was the n=25 MAP simulation which used maximized CSEs. TAKS - Texas Assessment of Knowledge and Skills The results for the simulations based on the TAKS were the most unstable. The TAKS simulations did moderately well at correctly identifying teachers when the sample sizes were n=100. Table 4 shows the simulation results for various permutations of the TAKS when sample n=100. The actual distributions of the student test scores on the TAKS 5th grade reading tests were negatively skewed (see figure 2). We conducted a simulation with our starting values fitted to the actual distribution of student scores on the TAKS to evaluate if the ceiling effects of the 14 distribution would interact with the heteroskedastic nature of the CSEs and thereby amplify the test instrument errors. This simulation generated 17 false negatives and 25 false positives for a total of 42 misidentifications. Changing the sample n for the actual distribution models had a similar effect as the normal distribution model. The false negatives, false positives, and total misidentifications were 74, 96, 170 and 161, 184, 345 for n=50 and n=25 respectively. The differences in the percent of teachers misidentified between the TAKS actual model and the TAKS normal model were very small for the n=50 and n=25 sample sizes, but were almost twice as large for the n=100 sample sizes. This would indicate that the ceiling effects were primarily impacting the efficiency of the models. Another issue related to distributional effects and VAMs is the non-random assignment of students to classrooms. The non-random assignment of students to classrooms introduces a large amount of error as well as bias into VAM (Rothstein, 2009). The value-added estimates for teachers who teach students taken predominantly from the tails of the achievement distribution, such as those who work with advanced placement or remedial students, were far less reliable than the randomly assigned models. 15 Figure 2: Frequency Distribution TAKS Reading, Grade 5 The increased error was a result of the heteroskedastic nature of the SEM (see figure 1). To estimate the impact, we used the values from the TAKS normal distribution of students but selected the rightmost portion of the distribution for our n=50 and n=25 samples. This distribution generated 9 false negatives and 18 false positives for a total of 27 misidentifications using the full n=100 sample. When we reduced the sample size to n=50, the sample consisting of the right half of the distribution produced 99 false negatives and 87 false positives for a total of 186 total misidentifications. Taking the rightmost quartile for the n=25 sample, generated 222 false negatives and 212 false positives for a total of 434 misidentifications. This means that teachers who taught their school's highest performing students would be misidentified by our VAM 43.5% of the time purely due to noise on the TAKS. The results for the non-random samples on the VAM based on the MAP were similar to the randomly selected samples. This was the expected outcome as the SEM on the MAP is not correlated with the RIT scale score. 16 Table 4: Monte Carlo Simulation Results, n=100 Monte Carlo Results n=100 TAKS Actual Distribution TAKS Normal Non-Normal Distribution TAKS Normal Distribution at “Meets” Level TAKS Normal Distribution Avg SE TAKS Normal Distribution at “Commended” Level TAKS Normal Grade Transition MAP Normal MAP Max CSE % False Negative 1.7 .9 .9 1.2 .8 1.4 0.0 0.0 % False Positive 2.5 1.8 1.8 1.8 .2 2.1 0.0 0.0 % Total Correct ID 95.8 97.3 97.3 97.0 99.0 96.5 100.0 100.0 The extreme level of error associated with non-random samples would make continued discussion of the differences between traditional proficiency measures and computer adaptive measures unproductive. To avoid creating a straw man argument, we used random, normally distributed samples for the remaining TAKS simulations in the study to make them more comparable to the MAP simulations. When using the annual growth at “meets expectations” and a normally distributed population, the TAKS normal generated 9 false negatives and 18 false positives for a total of 27 misidentifications. When we reduced the sample size to n=50, the consistency of the TAKS normal model quickly diminished. The simulation generated 66 false negatives and 84 false positives misidentifying 150 teachers. At n=25, the simulation misidentified 348, over 1/3 of all teachers, with the misidentifications fairly evenly split between false negatives and false positives. In order to analyze the impact of the heteroskedastic nature of the CSEs, we recomputed the number of misidentifications using the simulation data from the TAKS normal simulation. Instead of using score point level CSEs, we used the average standard error value of 38.96. Using the average standard error, the TAKS normal model generated 12 false negatives, 18 false 17 positives for a total of 30 misidentifications at n=100. When we ran the model with sample n=50, there were 57 false negatives and 74 false positives totaling 131 misidentifications. Finally, at n=25, the counts were 145 false negatives, 160 false positives, and 305 total misidentifications. Table 5: Monte Carlo Simulation Results, n=50 Monte Carlo Results n=50 TAKS Actual Distribution TAKS Normal Non-Normal Distribution TAKS Normal Distribution at “Meets” Level TAKS Normal Distribution Avg SE TAKS Normal Distribution at “Commended” Level TAKS Normal Grade Transition MAP Normal MAP Max CSE % False Negative 7.4 9.9 6.6 5.7 4.4 6.5 0.0 .7 % False Positive 9.6 8.7 8.4 7.4 1.7 8.1 0.0 .6 % Total Correct ID 83.0 81.4 85.0 86.9 93.9 85.4 100.0 98.7 The simulations above used a fall-to-spring testing assumption which means we used the same set of CSEs on both tests. Many districts, however, use a spring-to-spring testing schedule. In order to examine the impact of the vertical alignment between grades, we altered the simulation to use fifth grade CSEs in the pre-score computation and sixth grade CSEs in the post-score computation. The CSE values from year to year had a pairwise correlation = .96. As can be seen in tables 4 - 6, the results of the normal grade transition model were almost identical to those of the normal model. The only model of the TAKS simulations which demonstrated major differences from the other models across all sample sizes was the normal distribution with the expected growth at the “commended” level. As expected, due to having a smaller ratio between expected growth and CSE, this model had only about half as many misidentifications as the other TAKS models. By using the higher value of 34 for expected growth, the commended model was much more reliable at the n=100 level, having only 8 false negatives and 2 false positives for a total of 10 18 misidentifications. While this model was much more reliable, it still showed value-added measures grow increasingly unreliable as the sample n decreased. The model generated 44 false negatives and 17 false positives at the n=50 level and 102 false negatives and 77 false positives at the n=25 level. It is worth noting that this model consistently generated more false negatives than false positives, whereas the other TAKS models generated more false negatives than false positives in 8 out of 9 instances. Table 6: Monte Carlo Simulation Results, n=25 Monte Carlo Results n=25 % False Negative 16.1 22.2 16.8 14.5 10.2 18.6 .5 3.0 TAKS Actual Distribution TAKS Normal Non-Normal Distribution TAKS Normal Distribution at “Meets” TAKS Normal Distribution Avg SE TAKS Normal Distribution at “Commended” TAKS Normal Grade Transition MAP Normal MAP Max CSE % False Positive 18.4 21.2 18.0 16.0 7.7 18.2 .5 4.2 % Total Correct ID 65.5 56.6 65.2 69.5 82.1 63.2 99.0 92.8 MAP - Measures of Academic Progress Both of the MAP models produced estimates more in line with the controlled average MAP growth estimates of 5.06 RITs. The MAP normally distributed model produced no misidentifications at either the n=100 or n=50 simulations. Even when the sample size was reduced to n=25, the simulation produced only 5 false negatives and 5 false positives for a total of 10 misidentifications. In the grade range studied, the CSEs on the MAP are highly consistent. They average below 3.5 RIT points across the entire range of scores. In order to evaluate the effect of increased CSEs, we ran a simulation with an extremely high MAP CSE=5 for all data points. This simulation still generated no misidentifications at n=100, only 7 false negatives and 6 false positives for a total of 13 misidentifications out of 1,000 data points at n=50, and lastly 30 false 19 negatives and 42 false positives for a total of 72 misidentifications at n=25. The worse performance of the high CSE value MAP simulation was consistent with the changes we found in the TAKS when we altered the relationship between the size of the expected growth and the test instrument CSEs. Sample Size The impact of sample size can be easily seen by examining Table 7 below. With the exception of the TAKS normal “commended” level simulation, which had a different growth, all the TAKS models had an average growth near the controlled value of 24 vertical scale points; however, the standard deviations grow rapidly as the sample n is halved. The same increase is seen in the statistics from the MAP simulations as well. These differences are expected according to central tendency theory; however, it is still informative to discuss the degree to which these changes impact the reliability of VAMs. Table 7 shows the differences by sample n for the TAKS normal model and the MAP normal model. The MAP simulations not only produced fewer false negative than the TAKS, but the reliability of the MAP was sustainable even with a reduction in the sample size. This is due to the larger ratio between the expected growth and standard errors on the MAP than that on the TAKS. Table 8 shows results from VAM based on the MAP will be more efficient than VAM based on the TAKS. This permits the VAM based on the MAP to produce more reliable estimates of teacher performance while using smaller student samples. 20 Table 7: Descriptive Statistics Based on Sample N TAKS Normal Distribution at “Meets” TAKS Normal Distribution Avg SE TAKS Normal Distribution at “Commended” TAKS Actual Distribution TAKS Normal Grade Transition MAP Normal MAP Max CSE Controlled Growth n=100 Average Simulated Growth SD n=50 Average Simulated Growth SD n=25 Average Simulated Growth SD 24 24.08 5.45 24.45 8.37 24.14 12.39 24 24.19 5.45 24.61 8.03 24.59 11.47 34 33.85 5.60 34.15 8.12 34.92 11.87 24 24.29 6.02 24.26 8.78 24.18 12.28 24 24.08 5.59 24.24 8.59 24.15 12.85 5.06 5.06 5.07 5.05 .49 .71 5.12 5.05 .72 .99 5.12 5.08 1.03 1.37 Table 8: TAKS normal vs. MAP normal Based on Sample N Test TAKS Normal Distribution at “Meets” MAP Normal Percent misidentified at n=100 2.7 0.0 Percent misidentified at n=50 15.0 0.0 Percent misidentified at n=25 34.8 7.2 Conclusions The critical values of a test instrument for reliably computing teacher value-added models are the relationship between the expected growth and the CSEs of the test and the sample size being tested. As the size of the average CSE approaches, or on some tests, even exceeds the size of the average expected growth, the ability of the instrument to reliably measure student growth is greatly compromised. If value-added measures are going to be useful additions to teacher evaluation programs, they must be based on tests with a level of reliability sufficient for the task with sample sizes similar to those found in typical classrooms. Pencil and paper student proficiency tests are particularly ineffective at measuring student growth. There are a number of reasons this is true. The primary one is the reliability of 21 the tests. Proficiency tests are generally meant to discern between two fairly broad categories, proficient or not proficient. They can also be used with a reasonable expectation of reliability to identify those students who are in the upper and lower ends of the distribution as well, but generally these tests are not very efficient at measuring fine increases in achievement. The minimum CSE found on the 5th grade TAKS was 24 points. This smallest CSE was equal to the one year expected growth. Essentially, for any student there was at best a 67% likelihood that their score on the TAKS was within one year’s worth of growth of the score which represents their true ability. For the 13.5% of students performing at the top end of the scale, the level of error is 74 points or three years of growth. This reliability issue means proficiency tests such as the TAKS are not very good measures on which to base value-added measures. The source of this error in traditional proficiency tests is their tendency to be focused around the proficiency point. Test makers focus proficiency tests in this manner to permit them to get similarly reliable results with a shorter test. In order to get higher levels of reliability across the entire spectrum of ability, the tests would require many additional items. Added items mean added production costs and added instructional time spent on testing; however, the efficiency gained by focusing the test around a specific point comes at the cost of reliability in the tails of the student distribution. This creates heteroskedasticity in the CSE structure which as our results show amplifies the overall error of the value-added measures. Well designed computer adaptive tests can greatly reduce both the test instrument reliability problem and heteroskedasticity problem. The mechanism of adaptive tests allows each student to receive a test focused on their level of ability rather than a general proficiency point. This tailored approach places each student in the center of their own personal distribution of CSEs. With tests based on Rasch models, the lowest CSEs are located at the point on which the 22 test is focused. By doing this, computer adaptive tests are able to greatly increase the precision of the student ability measurement, which in turn also improves the reliability of the VAMs which use measures of student achievement as the major independent variables. Another benefit of computer adaptive tests is their ability to apply a single scale across multiple grades. The creators of the TAKS did an admirable job of vertically aligning performance between the fifth and sixth grades. Our simulations showed very little differences between the TAKS normal which simulated fall-to-spring testing and the TAKS grade transition model which simulated spring-to-spring growth. Since most districts currently test only once per year, the alignment between grades is a major concern as poor alignment has the potential to inject large amounts of noise into the measures of student achievement which will be passed along to the VAMs. Having a single scale eliminates the need to align different scales; however, it is also critical that all items in the item pool of a computer adaptive test be properly rated for difficulty to ensure item selection at different levels of performance does not introduce a similar type of noise to the computer adaptive test. Policy Implications Why Are More States Not Using Computer Adaptive Testing? There are a number of reasons more states are not using CAT even though CAT has been demonstrated to be a much more reliable measure of student achievement. Pelton and Pelton (2006, p. 143) state, “…many schools and districts are waiting for positive examples of practical applications and observable benefits before adopting a CAT” The major reason seems to be lack of awareness. Whether the individual supports testing and VAMs or opposes them, there seems to be little understanding on the part of many state level educators and legislators of the mechanics of test reliability and measurement error. Overcoming these hesitations and 23 shortcomings in understanding will require not just more studies on the subject, but also better dissemination of the information. While it can be expected that companies providing CAT will be actively involved in this type of outreach, departments of statistics and research units located at institutions of higher learning can also play an important role in increasing understanding and acceptance of CAT among state-level education policy makers and school district leaders. Another problem with increasing the use of CAT is the perceived cost. There are perceived financial costs as well as perceived costs in instructional time for increasing testing. The instructional cost time is an incorrect perception. Due to the relative efficiency of CAT compared to traditional paper and pencil tests, CAT can greatly reduce the amount of instructional time spent on testing while increasing the amount of available data on student achievement. Because of this shorter testing time, CAT can be used as formative assessments as well as summative assessments without increasing the overall time spent on testing during a school year. Additionally since results from CAT can be made available much faster than traditional paper and pencil tests, educators can use the results to modify instructional plans to better meet student needs. Depending on the design, CAT could be more or less costly than traditional paper and pencil tests. Based on documentation from the TEA (2010c), the average cost of administering the TAKS for the 2010-2011 school year is $14.88 per pupil. This is comparable to the current per pupil prices for administering CAT. Further, it is reasonable to expect the cost of CAT to drop as more testing companies enter the CAT market. Administering CAT does require access to adequate computer equipment and network infrastructure. The hardware demands of CAT are generally modest. For example, both the server and client side of NWEA’s MAP test require only a Pentium II (226 MHz) processor with 24 256 MB of memory. These specifications are very minimal when compared to most modern computers. Contrary to the traditional paper and pencil tests, all students do not have to take a CAT at the exact same moment. This allows schools to be able to process entire student bodies without requiring them to have a 1:1 computer to student ratio. While the hardware requirements for CAT are minimal, some schools will require system upgrades to be able to complete schoolwide computer adaptive testing. States wishing to move to CAT environment will need to guarantee funding to ensure all schools are able to meet the minimal hardware requirements. This issue will certainly be a concern as both consortiums developing the Common Core assessments have stated that those assessments will be CAT instruments. What level of teacher misidentification is acceptable? Another major issue which must be determined by each state or district is the acceptable level of teacher misidentification. All measures of teacher performance will contain a level of statistical error. The level of error that is acceptable is something that will have to be determined by each state and school district. In locations with strong teachers’ unions, these levels will likely have to be negotiated as part of the collective bargaining agreement. Regardless of the agreed upon acceptable level, CAT are more likely than traditional paper and pencil tests to meet or exceed the requirements for reliability. This is especially true when using student sample sizes associated with the traditional classroom. We think value-added models can provide useful information about a teacher’s performance, but not when the reliability of the instrument measuring their students’ performance is ± a full year’s growth. We would recommend that districts or states wishing to pursue VAMs as a portion of teacher evaluations carefully investigate the characteristics of the student measurement instruments. Instruments with a low expected-growth to standard-error 25 ratio should be rejected as they are an unsuitable foundation for VAMs. While concerns about the impacts of the reliability of the student testing instrument are somewhat mitigated by increasing the sample size, districts and states should always opt for the most reliable measurement instrument available when computing VAMs. 26 References Anderson, R. (1972). How to construct achievement tests to assess comprehension. Review of Educational Research, 42(2), 145-170. Baker, E., Barton, P., Darling-Hammond, L., Haertel, E., Ladd, H., Linn, R., Ravitch, D., Rothstein, R., Shavelson, R., and Shepard, L. (2010) Problems with the Use of Student Test Scores to Evaluate Teachers. EPI Briefing Paper #278. Washington, DC: Economic Policy Institute. Ballou, D. (2009). Test scaling and value-added measurement. Education Finance and Policy, 4(4) 351-383. Braun, H. (2005). Using Student Progress to Evaluate Teachers: A Primer on Value-Added Models. Princeton, NJ: Educational Testing Service. Retrieved on March 12, 2011 from: http://www.ets.org/Media/Research/pdf/PICVAM.pdf Briggs, D. and Weeks, J. (2009) The sensitivity of value-added modeling to the creation of a vertical score scale. Education Finance and Policy, 4(4), 384-414. Glazerman, S., Loeb, S., Goldhaber, D., Staiger, D., Raudenbush, S., and Whitehurst, G. (2010). Evaluating Teachers: The Important Role of Value-Added. Washington, DC: Brown Center on Education Policy at Brookings. Hill, H. (2009). Evaluating value-added models: A validity argument approach. Journal of Policy Analysis and Management, 28, 700-709. Kane, T. (2004). The Impact of After-School Programs: Interpreting the Results of Four Recent Evaluations. Working Paper. University of California, Los Angeles. Kirk, R. (1995). Experimental Design: Procedures for the Behavioral Sciences (3ed). Pacific Grove, CA: Brooks/Cole Publishing Company. May, H., Perez-Johnson, I., Haimson, J., Sattar, S., and Gleason, P. (2009). Using State Tests in Education Experiments: A Discussion of the Issues. (NCEE 2009-013). Washington, DC: National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education. Northwest Evaluation Association. (2011). Measures of Academic Progress® and Measures ofAcademic Progress for Primary Grades®: Technical Manual. Portland, OR: Author Northwest Evaluation Association. (2008). RIT Scale Norms: For Use with Measures of Academic Progress. Portland, AR: Author. Pelton, T. and Pelton, L. (2006). Introducing a computer-adaptive testing system to a small school district. In S. L. Howell and M. Hricko (Eds.), Online assessment and measurement: case studies from higher education, K-12, and corporate (pp. 143-156). Hershey, PA: Information Science Publishing. 27 Rothstein, J. (2009). Student sorting and bias in value added estimation: selection on observables and unobservables. NBER Working Paper 14666. Cambridge, MA: National Bureau of Economic Research. Retrieved February 7, 2012 from: http://www.nber.org/papers/w14666 Schochet, P. (2008). Statistical power for random assignment evaluations of education programs. Journal of Educational and Behavioral Statistics, 33(1), 62-87. Schochet, P. and Chiang, H. (2010). Error Rates in Measuring Teacher and School Performance Based on Student Test Score Gains (NCEE 2010-4004). Washington, DC: Nation Center for Education Evaluation and Regional Assistance, Institute of Educational Sciences, U.S. Department of Education. Texas Education Agency. (2010a). Technical Digest for the Academic Year 2008-2009: TAKS 2009 Scale Distributions and Statistics by Subject and Grade. Austin, TX: Author. Retrieved February 15, 2011 from: http://www.tea.state.tx.us/student.assessment/techdigest/yr0809/ Texas Education Agency. (2010b). Technical digest for the academic year 2008-2009: TAKS 2009 conditional standard error of measurement (CSEM). Austin, TX: Author. Retrieved February 15, 2011 from: http://www.tea.state.tx.us/student.assessment/techdigest/yr0809/ Texas Education Agency. (2010c). Approval of Costs of Administering the 2010-2011 Texas Assessment of Knowledge and Skills (TAKS) to Private School Students. Retrieved on April 12, 2011 from: http://www.tea.state.tx.us/index4.aspx?id=2147489185 Thompson, T. (2007).Growth, Precision, and CAT: An Examination of Gain Score Conditional SEM. Paper presented at the 2008 annual meeting of the National Council on Measurement in Education, New York, NY. Weiss, D. (1982). Improving measurement quality and efficiency with adaptive testing. Applied Psychological Measurement, 6(4), 479-492. 28