The Impact of Selection of Student Achievement

advertisement
The Impact of Selection of Student Achievement
Measurement Instrument on Teacher Value-Added
Measures
James L. Woodworth
Quantitative Research Assistant
Hoover Institute, Stanford University, Stanford, California
Wen-Juo Lo
Assistant Professor
University of Arkansas, Fayetteville
Joshua B. McGee
Vice President for Public Accountability Initiatives
Laura and John Arnold Foundation, Houston, Texas
Nathan C. Jensen
Research Specialist
Northwest Evaluation Association, Portland, Oregon
Abstract
In this paper, we analyze the impact on the variance of teacher value-added measures
arising from using different evaluation instruments of student achievement. The psychometric
characteristics of the student evaluation instrument used may affect the amount of variance in the
estimates of student achievement which in turn affects the amount of variance in teacher valueadded measures. The goal of this paper is to contribute to the body of information for policy
makers on the implications of selection of student measurement instruments on teacher valueadded measures.
The results demonstrated that well designed value-added measures based on proper
measurements of student achievement can provide reliable value-added estimations of teacher
performance.
Introduction
Value-added measures (VAM) have generated much discussion in the education policy
realm. Opponents raise concerns over poor reliability and worry the statistical noise will cause
many teachers to be misidentified as poor teachers when they are actually performing at an
acceptable level (Hill, 2009; Baker, Barton, Darling-Hammond, Haertel, Ladd, Linn, et al.,
2010). Supporters of VAMs hail properly designed VAMs as a way to quantitatively measure the
input of schools and individual teachers (Glazerman, Loeb, Goldhaber, Staiger, Raudenbush, &
Whitehurst, 2010).
Much of the statistical noise associated with VAM originates from the characteristics of
the testing instruments used to measure student performance. In this analysis, we use a series of
Monte Carlo simulations to demonstrate how changes to the characteristics of two measurement
instruments, the Texas Assessment of Knowledge and Skills (TAKS) and Northwest Evaluation
Association’s (NWEA) Measures of Academic Progress (MAP), affect the reliability of the
VAMs.
Sources of Error in Measurement Instruments
One of the larger sources of statistical noise in the VAM comes from the lack of
sensitivity in the student measurement instrument. A measurement instrument with a high level
of error in relation to the changes in the parameter it is meant to measure will increase the
variance of VAMs based on that particular instrument (Thompson, 2008). Measurement
instrument error occurs for a number of reasons, such as test design, vertical alignment, and
student sample size. To have useful VAMs, researchers must find student measurement
instruments with the smallest ratio of error to growth parameter possible.
Test design.
2
Proficiency tests such as those used by many states may be particularly noisy, i.e.
unreliable, at measuring growth in student achievement. This is because measuring growth is not
the purpose for which proficiency tests are designed. Proficiency tests are designed to
differentiate between two broad categories, proficient or not proficient (Anderson, 1972). To
accomplish this, proficiency tests must be reliable only around the dividing point between
proficient and not proficient, but must be as reliable as possible around this point. To gain this
reliability, test developers will modify the test to increase the ability of the test to discriminate
between proficient and not proficient.
Increasing instrument discrimination around this point can be accomplished in several
ways. One is to increase the number of items i.e. questions on the test. Increasing the number of
items across the entire spectrum of student ability would require the test to become
burdensomely long. Excessively long tests have their own downfalls. In addition to the amount
of instructional time lost to taking long tests, student attention spans also limit the number of
items which can be added to a test without sacrificing student effort and thereby introducing
another source of statistical noise.
To lessen the problems of long tests, psychometricians increase the number of items just
around the proficient/not proficient dividing point of the criterion referenced test. Focusing items
around this single point increases the test reliability at the proficiency point but does not improve
reliability in the upper and lower ends of the distribution of test scores. When tests are structured
in this manner, they are less reliable the further a student’s ability is from the focus point.
Tests which are highly focused around a specific point have heteroskedastic conditional
standard errors (CSMs) (May, Perez-Johnson, Haimson, Sattar, & Gleason, 2009). The
differences in CSEs across the spectrum of student ability can be dramatic. Figure 1 shows the
3
distribution of CSEs from the 2009 TAKS 5th-grade reading test above and below the measured
value of students’ achievement. As can be seen in Figure 1, there is a large amount of variance in
the size of the measurement error as the values move away from the focus point.
Figure 1: Distribution of Conditional Standard Errors - 2009 TAKS Reading, Grade 5
Focusing the test can increase the reliability enough to make the test an effective measure
of proficiency, but the same test would be a poor measure of growth across the full spectrum of
student ability.
The traditional paper and pencil state-level proficiency tests are often very brief; for
example, the TAKS contains as few as 40 items. Again, if item difficulty is tightly focused
around a single point of proficiency, a test may be able to reliably differentiate performance just
above that point from performance just below that point; however, such a test will not be able to
differentiate between performances two or three standard deviations above the proficiency point.
4
Reliable value-added measures require an reliable measure of all student performance regardless
of where a student’s score is located within the ability distribution.
Even paper and pencil tests which have been designed to measure growth are generally
limited to measuring performance at a particular grade level and will therefore have limited items
to increase reliability. Computer adaptive tests can avoid the reliability problems related to the
limited item pool available on traditional paper and pencil tests (Weiss, 1982). Using adaptive
algorithms, a computer adaptive test can select from a much larger pool of items those items that
are appropriate to the ability level for each student. The effect of this differentiated item selection
is to create a test which is focused around the student’s level of true ability. Placing the focus of
the test items at the student’s ability point instead of at the proficiency point minimizes the
testing error around the student’s true ability as opposed to around the proficiency point and
thereby gives a more reliable measure of the student’s ability.
Vertical alignment.
Value-Added models which are meant to measure growth from year to year must be
vertically aligned. As we discussed above, criterion referenced tests such as state proficiency test
often used in value-added measures are designed to measure student mastery of a specific set of
grade-level knowledge. As this knowledge changes from year to year, scores on these tests are
not comparable unless great care is taken to align scale values on the tests from year to year.
While scale alignment does not affect norm referenced VAMs, Briggs and Weeks (2009) found
that underlying vertical scale issues can affect threshold classifications of the effectiveness of
schools.
Additionally, vertically aligning the scales of multiple grade level scales is not the only
alignment issue. Even if different grade level scales are aligned, there is some discussion among
5
psychometricians as to whether the instruments are truly aligned just because they are based on
Item Response Theory (IRT) (Ballou, 2009). The TAKS is based on IRT theory and contains a
vertically aligned scale across grades (TEA, 2010a). The MAP is based on a large pool of items
which are individually aligned to ensure a consistent interval, single scale of measurement across
multiple levels or grades of ability (NWEA, 2011). The level of internal alignment on the tests
may be an additional source of statistical noise in the student measurement which is transferred
to the VAM.
Student sample size.
The central limit theorem states that the reliability of any estimate based on mean values
of a parameter increases with the number of members within the sample (Kirk, 1995). Opponents
of VAMs often express concern that the class size served by the typical teacher is insufficient to
produce a reliable measure of teacher performance (Braun, 2005). Central tendency also says
that with a large enough sample size, even a poor estimator of teacher performance will
eventually reach an acceptable level of reliability.
The question around VAMs then becomes one of efficiency. How does the error related
to the student measurement instrument affect the efficiency of the VAM? How large a sample is
necessary to permit the VAM to reliably measure teacher performance? As typical class sizes in
the elementary grades are around 25 students, we evaluated the effect of different student
measurement instrument on the efficiency of teacher VAM. We analyze this by comparing
results from each simulation using samples with different numbers of students. Those tests with a
smaller ratio between expected growth and CSE (growth/CSE) introduce more noise into the
value-added measure and will require a larger sample size to produce reliable VAMs of teacher
performance.
6
The Data
TAKS - Texas Assessment of Knowledge and Skills
We used TAKS data for these analyses because the TAKS is vertically aligned and we
were able to obtain the psychometric characteristics from the TEA. Additionally, the TAKS
scores have a slight ceiling effect which we exploited to analyze the affects of ceiling effects on
instrument variance.
To obtain TAKS data for the analysis, we used Reading scores for grade 5 from the 2009
TAKS administration. The TEA reports both scale scores and vertically aligned scale scores for
all tests. We used the vertical scale scores for all TAKS calculations. We computed the
descriptive statistics for fifth grade vertical scale scores based on the score frequency distribution
tables (TEA, 2010a) publicly available from the Texas Education Agency (TEA) website. Table
1 shows the descriptive statistics for the data set. For this study, we used the CSEs for each
possible score as they were reported by TEA (2010b).
7
Table 1: Statistical Summary 2009 TAKS Reading, Grade 5
Statistic
N
Mean
Std. Deviation
Skewness
Minimum
Maximum
Value
323,507
701.49
100.24
.273
188
922
In our analysis of the TAKS data, in addition to using the normally distributed ~N(0,1)
simulated data set described below, we also created a second simulated data set by fitting 100
data points to the actual distribution from the 2009 TAKS 5th-grade reading test. Differences
between the two data sets were used to analyze how ceiling effects affect value-added measures.
MAP - Measures of Academic Progress
Data for the MAP were retrieved from “RIT Scale Norms: For Use with Measures of
Academic Progress” and “Measures of Academic Progress® and Measures of Academic
Progress for Primary Grades ®: Technical Manual”1. The MAP uses an equidistant scale score
referred to as a RIT score for all tests. We computed the RIT mean and standard deviation for
reading scores for the MAP from frequency distributions reported by NWEA. We used the
average CSEs as reported by MAP in the same documents.
Methodology
Data Generation
We used a Monte Carlo simulation for this analysis. Monte Carlo simulation is a
computer simulation which draws repeatedly from a set of random values which are then applied
to an algorithm to perform a deterministic computation based on the inputs. The results of the
simulation are then aggregated to produce an asymptotic estimation of the event being studied.
1
Access to both documents was made possible through a data grant from the Kingsbury Center.
8
The benefit of using the Monte Carlo study is that it allowed us to isolate the instrument
measurement error of different psychometric characteristics from other sources of error. To
accomplish this isolation, we control the measurement error within a simulation of data points.
To generate the synthesized sample for the Monte Carlo simulations, we used SPSS to
generate a normally distributed random sample of 10,000 Z-scores (see table 2 below). In order
to simulate group sizes equal to those which would be served by teachers at the high school,
middle school, and elementary school levels, we randomly selected from this 10,000 data point
sample three nested groups with 100 individuals, 50 individuals, and 25 individuals. Because all
our samples are normally distributed and the samples of 25 were randomly selected and are
nested within the sample of 50 which was randomly selected and nested within the sample of
100, we considered each level the equivalent of adding additional classes to the value-added
models for elementary and middle school levels. We therefore feel it is acceptable to consider the
n=50 simulations to be useable estimates for a value-added measure for elementary teachers
using two years worth of data, and the n=100 for four years worth of data.
Table 2: Statistical Summary, Starting Z Scores
Statistic
N
Mean
Std. Deviation
Skewness
Minimum
Maximum
Value
10,000
0.008
.999
0.0
-3.71
3.70
Table3: Statistical Summary, z-Score Samples by n
Statistic
N
Mean
Std. Deviation
Skewness
Minimum
Maximum
Values
100
50
-.13 -.09
.97
.97
-.12 .18
-2.34 -1.85
2.09 2.09
9
25
.01
1.00
.10
-1.77
2.09
The z-scores we generated (S) were used to represent the number of standard deviations
above or below the mean of each test the starting scores would be for the data points within the
sample. The z-score was then multiplied by the test standard deviation and added to the test
mean in order to generate the individual controlled pre-scores (P1). The same samples were used
for each simulation except the TAKS actual distribution simulation. For example, in every
simulation except the TAKS actual model, the starting score (P1) for case 1 in the n=100 sample
is always 2.43 standard deviations below the mean.
Controlled P1i = μ + (Si • σ)
For all simulations, we created a controlled post-score (P2) by adding an expected year’s
growth for the given test to the controlled pre-score. The expected growth for one year was a
major consideration for different models of the simulation. For the TAKS analysis, we used two
different values to represent one year of growth. The first value used was 24 vertical scale points
which was the difference between “Met Standard” on the 5th-grade reading test and “Met
Standard” on the 6th-grade reading test (TEA, 2010a). The other value used in the TAKS
analysis was 34 vertical scale points which was the difference between “Commended” on the
5th-grade reading test and “Commended” on the 6th-grade reading test. The difference between
the average scores for the TAKS 5th-grade reading and TAKS 6th-grade reading was 25.89
vertical scale points. We did not run a separate analysis for this value since it is so close to the
“Met Standard” value of 24 vertical scale points. For the MAP we used the norming population’s
average fifth grade one year growth of 5.06 RIT scale points for expected growth (NWEA,
2008).
Controlled P2i = P1i + Expected Growth
10
In order to simulate testing variance we used Stata11 to multiply the reported conditional
standard error (CSE) for each data point by a computer generated random number ~N(0,1) and
added that value to the pre-score. We followed the same procedure using a second computer
generated random number ~N(0,1) to generate the post score. We then subtracted the pre-score
from the post-score to generate a simple value-added score. Because all parameters are simulated
we were able to limit variance in the scores to that due to variance from the test instrument.
Simulated Growth = (P2 + (Random2 •CSE)) - (P1 + (Random1 • CSE))
Conditional standard errors for the TAKS were taken from “Technical Digest for the
Academic Year 2008-2009: TAKS 2009 Conditional Standard Error of Measurement (CSEM)”
(TEA, 2010b). To eliminate bias from grade-to-grade alignment on the TAKS, we used a fall-tospring within grade testing design for all the TAKS simulations except for the grade transition
model. This means the same set of CSE were used to compute both the pre- and post-scores. We
did conduct one grade-to-grade simulation for the TAKS in which 5th-grade CSEs were used to
compute the simulated pre-score and 6th-grade CSEs were used to compute the simulated postscore.
Because the MAP is a computer adaptive test, each individual test has a separate CSE.
This necessitated we use an average CSE for the MAP simulations. Also, since the MAP is
computer adaptive, the tests do not focus around a specific proficiency point. This means the
CSEs are very stable over the vast majority of the distribution of test scores. Instead of the
distribution of CSEs being bimodal, the distribution is uniform except for the left tail of the
bottom grades and right tail of the top grades. The reported average CSEs for the range of RIT
scores used in this study fall between 2.5 and 3.5 RIT scale points. We applied the value of 3.5
RIT scale points to the main simulation based on MAP scores. We chose to use the higher end of
11
the range to ensure we were not “cherry picking” values. Additionally, while it would be highly
unlikely for every student to complete a test with a CSE of 5 which is outside the expected
average value range, we ran a “worst case” simulation using this maximized CSE value of 5 for
all data points. As the MAP uses a single, equidistant point scale across all grade ranges, it does
not have issues with grade-to-grade scale alignment.
Monte Carlo Simulation
For each simulation, we ran 1,000 iterations in Stata11. The 1,000 iterations simulated
the same set of 100 students taking the test 1,000 times. Each iteration was identified by an index
number and each individual was identified by a case number. We filtered the results by case
number to identify our three sample groups of 100, 50, and 25 data points within each iteration.
We then used the collapse command in Stata11 to combine the data by index number which gave
us an average growth score for the sample for each iteration.2
Finally, we generated two variables to identify the number of iterations in each
simulation which identified when growth was in excess of one-half year more or one-half year
less than the controlled growth. The threshold values were based on evidence from the body of
research on teacher performance. Current research suggests that student gains of about .10 to .18
standard deviations are moderate gains (Kane, 2004; Schochet, 2008; Schochet and Chang,
2010). The within grade standard deviation on the TAKS was 100 vertical scale points while one
year expected growth was 24 points. We feel then that a growth of .12 standard deviations or
one-half a year above expected growth, 12 points, would be considered a moderate effect size for
an educational intervention (likewise for gains one-half year below expected growth). To
2
Recall the controlled pre-scores on which all other values are based have the same z-score for each iteration in
every simulation except the TAKS actual simulation.
12
maintain consistency in the comparisons, we applied a one-half year threshold to the MAP
simulations as well.
Because all iterations had a controlled value of one year growth, these one-half year
below and one-half year above variables are identified as false negatives and false positives
respectively for iterations with average gain scores of one-half a year below or one-half a year
above expected growth. A false negative in the simulation would represent a teacher whose
students made the predefined one year growth, but due to measurement error, was identified as
having students with growth less than one-half the expected one year growth. A false positive in
the simulation would represent a teacher whose students made the predefined one year growth,
but due to measurement error, was identified as having students with growth more than one and
one-half years growth. For example, for the TAKS simulation in which 24 vertical scale points
was one year of growth, the false negative iteration had growth of less than 12 vertical scale
points and the false positive had growth of more than 36 vertical scale points. We determine the
impact of instrument measurement error on the value-added measure by comparing the number
of misidentified teachers for each model.
Results
As was expected, we find that the results of the Monte Carlo simulations were highly
dependent on the characteristics of the test instruments and the size of the sample. We ran
multiple versions of the simulations to determine how changing various psychometric
characteristics of the test instrument impacted outcomes. These variations included changing the
distribution of data point values (ceiling effects), expected growth, and size of the samples. For
distributions, we used a normal distribution for most of the simulations, but included a
distribution fitted to the actual distribution of student scores from the TAKS to determine what
13
impact ceiling effects would have on the results. We also ran a simulation which used the
expected growth at the “commended” level instead of “met expectations”. Finally we ran the
simulations using a variety of sample sizes (n=25, n=50, and n=100). These values were chosen
as they could represent not only multiple years of data for an elementary teacher, but also
expected annual teacher loads for middle and high school teachers.
The variable of interest in our results was the number of teachers misidentified by the
various simulations. We were able to look at different variants of the simulation to determine the
consistency of the simulation in predicting the known outcomes of the growth measures. To put
this in context, using an inferential statistics test with the typical α=.05, we would expect up to
5% of values to fall outside our critical values in the distribution before we declared the sample
significantly different from the reference population. Applying this standard, the simulations
were fairly stable and differed little from the controlled population when n=100. All the
variations of the TAKS had more than 5.0% misidentifications at n=50 and n=25. The only
version of the MAP simulation which generated more than 5.0% misidentifications was the n=25
MAP simulation which used maximized CSEs.
TAKS - Texas Assessment of Knowledge and Skills
The results for the simulations based on the TAKS were the most unstable. The TAKS
simulations did moderately well at correctly identifying teachers when the sample sizes were
n=100. Table 4 shows the simulation results for various permutations of the TAKS when sample
n=100.
The actual distributions of the student test scores on the TAKS 5th grade reading tests
were negatively skewed (see figure 2). We conducted a simulation with our starting values fitted
to the actual distribution of student scores on the TAKS to evaluate if the ceiling effects of the
14
distribution would interact with the heteroskedastic nature of the CSEs and thereby amplify the
test instrument errors. This simulation generated 17 false negatives and 25 false positives for a
total of 42 misidentifications. Changing the sample n for the actual distribution models had a
similar effect as the normal distribution model. The false negatives, false positives, and total
misidentifications were 74, 96, 170 and 161, 184, 345 for n=50 and n=25 respectively. The
differences in the percent of teachers misidentified between the TAKS actual model and the
TAKS normal model were very small for the n=50 and n=25 sample sizes, but were almost twice
as large for the n=100 sample sizes. This would indicate that the ceiling effects were primarily
impacting the efficiency of the models.
Another issue related to distributional effects and VAMs is the non-random assignment
of students to classrooms. The non-random assignment of students to classrooms introduces a
large amount of error as well as bias into VAM (Rothstein, 2009). The value-added estimates for
teachers who teach students taken predominantly from the tails of the achievement distribution,
such as those who work with advanced placement or remedial students, were far less reliable
than the randomly assigned models.
15
Figure 2: Frequency Distribution TAKS Reading, Grade 5
The increased error was a result of the heteroskedastic nature of the SEM (see figure 1).
To estimate the impact, we used the values from the TAKS normal distribution of students but
selected the rightmost portion of the distribution for our n=50 and n=25 samples. This
distribution generated 9 false negatives and 18 false positives for a total of 27 misidentifications
using the full n=100 sample. When we reduced the sample size to n=50, the sample consisting of
the right half of the distribution produced 99 false negatives and 87 false positives for a total of
186 total misidentifications. Taking the rightmost quartile for the n=25 sample, generated 222
false negatives and 212 false positives for a total of 434 misidentifications. This means that
teachers who taught their school's highest performing students would be misidentified by our
VAM 43.5% of the time purely due to noise on the TAKS. The results for the non-random
samples on the VAM based on the MAP were similar to the randomly selected samples. This
was the expected outcome as the SEM on the MAP is not correlated with the RIT scale score.
16
Table 4: Monte Carlo Simulation Results, n=100
Monte Carlo Results n=100
TAKS Actual Distribution
TAKS Normal Non-Normal Distribution
TAKS Normal Distribution at “Meets” Level
TAKS Normal Distribution Avg SE
TAKS Normal Distribution at “Commended” Level
TAKS Normal Grade Transition
MAP Normal
MAP Max CSE
% False
Negative
1.7
.9
.9
1.2
.8
1.4
0.0
0.0
% False
Positive
2.5
1.8
1.8
1.8
.2
2.1
0.0
0.0
% Total
Correct
ID
95.8
97.3
97.3
97.0
99.0
96.5
100.0
100.0
The extreme level of error associated with non-random samples would make continued
discussion of the differences between traditional proficiency measures and computer adaptive
measures unproductive. To avoid creating a straw man argument, we used random, normally
distributed samples for the remaining TAKS simulations in the study to make them more
comparable to the MAP simulations.
When using the annual growth at “meets expectations” and a normally distributed
population, the TAKS normal generated 9 false negatives and 18 false positives for a total of 27
misidentifications. When we reduced the sample size to n=50, the consistency of the TAKS
normal model quickly diminished. The simulation generated 66 false negatives and 84 false
positives misidentifying 150 teachers. At n=25, the simulation misidentified 348, over 1/3 of all
teachers, with the misidentifications fairly evenly split between false negatives and false
positives.
In order to analyze the impact of the heteroskedastic nature of the CSEs, we recomputed
the number of misidentifications using the simulation data from the TAKS normal simulation.
Instead of using score point level CSEs, we used the average standard error value of 38.96.
Using the average standard error, the TAKS normal model generated 12 false negatives, 18 false
17
positives for a total of 30 misidentifications at n=100. When we ran the model with sample n=50,
there were 57 false negatives and 74 false positives totaling 131 misidentifications. Finally, at
n=25, the counts were 145 false negatives, 160 false positives, and 305 total misidentifications.
Table 5: Monte Carlo Simulation Results, n=50
Monte Carlo Results n=50
TAKS Actual Distribution
TAKS Normal Non-Normal Distribution
TAKS Normal Distribution at “Meets” Level
TAKS Normal Distribution Avg SE
TAKS Normal Distribution at “Commended” Level
TAKS Normal Grade Transition
MAP Normal
MAP Max CSE
% False
Negative
7.4
9.9
6.6
5.7
4.4
6.5
0.0
.7
% False
Positive
9.6
8.7
8.4
7.4
1.7
8.1
0.0
.6
% Total
Correct
ID
83.0
81.4
85.0
86.9
93.9
85.4
100.0
98.7
The simulations above used a fall-to-spring testing assumption which means we used the
same set of CSEs on both tests. Many districts, however, use a spring-to-spring testing schedule.
In order to examine the impact of the vertical alignment between grades, we altered the
simulation to use fifth grade CSEs in the pre-score computation and sixth grade CSEs in the
post-score computation. The CSE values from year to year had a pairwise correlation = .96. As
can be seen in tables 4 - 6, the results of the normal grade transition model were almost identical
to those of the normal model.
The only model of the TAKS simulations which demonstrated major differences from the
other models across all sample sizes was the normal distribution with the expected growth at the
“commended” level. As expected, due to having a smaller ratio between expected growth and
CSE, this model had only about half as many misidentifications as the other TAKS models. By
using the higher value of 34 for expected growth, the commended model was much more reliable
at the n=100 level, having only 8 false negatives and 2 false positives for a total of 10
18
misidentifications. While this model was much more reliable, it still showed value-added
measures grow increasingly unreliable as the sample n decreased. The model generated 44 false
negatives and 17 false positives at the n=50 level and 102 false negatives and 77 false positives
at the n=25 level. It is worth noting that this model consistently generated more false negatives
than false positives, whereas the other TAKS models generated more false negatives than false
positives in 8 out of 9 instances.
Table 6: Monte Carlo Simulation Results, n=25
Monte Carlo Results n=25
% False
Negative
16.1
22.2
16.8
14.5
10.2
18.6
.5
3.0
TAKS Actual Distribution
TAKS Normal Non-Normal Distribution
TAKS Normal Distribution at “Meets”
TAKS Normal Distribution Avg SE
TAKS Normal Distribution at “Commended”
TAKS Normal Grade Transition
MAP Normal
MAP Max CSE
% False
Positive
18.4
21.2
18.0
16.0
7.7
18.2
.5
4.2
% Total
Correct
ID
65.5
56.6
65.2
69.5
82.1
63.2
99.0
92.8
MAP - Measures of Academic Progress
Both of the MAP models produced estimates more in line with the controlled average
MAP growth estimates of 5.06 RITs. The MAP normally distributed model produced no
misidentifications at either the n=100 or n=50 simulations. Even when the sample size was
reduced to n=25, the simulation produced only 5 false negatives and 5 false positives for a total
of 10 misidentifications.
In the grade range studied, the CSEs on the MAP are highly consistent. They average
below 3.5 RIT points across the entire range of scores. In order to evaluate the effect of increased
CSEs, we ran a simulation with an extremely high MAP CSE=5 for all data points. This
simulation still generated no misidentifications at n=100, only 7 false negatives and 6 false
positives for a total of 13 misidentifications out of 1,000 data points at n=50, and lastly 30 false
19
negatives and 42 false positives for a total of 72 misidentifications at n=25. The worse
performance of the high CSE value MAP simulation was consistent with the changes we found
in the TAKS when we altered the relationship between the size of the expected growth and the
test instrument CSEs.
Sample Size
The impact of sample size can be easily seen by examining Table 7 below. With the
exception of the TAKS normal “commended” level simulation, which had a different growth, all
the TAKS models had an average growth near the controlled value of 24 vertical scale points;
however, the standard deviations grow rapidly as the sample n is halved. The same increase is
seen in the statistics from the MAP simulations as well. These differences are expected
according to central tendency theory; however, it is still informative to discuss the degree to
which these changes impact the reliability of VAMs. Table 7 shows the differences by sample n
for the TAKS normal model and the MAP normal model. The MAP simulations not only
produced fewer false negative than the TAKS, but the reliability of the MAP was sustainable
even with a reduction in the sample size. This is due to the larger ratio between the expected
growth and standard errors on the MAP than that on the TAKS. Table 8 shows results from
VAM based on the MAP will be more efficient than VAM based on the TAKS. This permits the
VAM based on the MAP to produce more reliable estimates of teacher performance while using
smaller student samples.
20
Table 7: Descriptive Statistics Based on Sample N
TAKS Normal Distribution at
“Meets”
TAKS Normal Distribution
Avg SE
TAKS Normal Distribution at
“Commended”
TAKS Actual Distribution
TAKS Normal Grade
Transition
MAP Normal
MAP Max CSE
Controlled
Growth
n=100
Average
Simulated
Growth
SD
n=50
Average
Simulated
Growth
SD
n=25
Average
Simulated
Growth
SD
24
24.08
5.45
24.45
8.37
24.14
12.39
24
24.19
5.45
24.61
8.03
24.59
11.47
34
33.85
5.60
34.15
8.12
34.92
11.87
24
24.29
6.02
24.26
8.78
24.18
12.28
24
24.08
5.59
24.24
8.59
24.15
12.85
5.06
5.06
5.07
5.05
.49
.71
5.12
5.05
.72
.99
5.12
5.08
1.03
1.37
Table 8: TAKS normal vs. MAP normal Based on Sample N
Test
TAKS Normal Distribution at “Meets”
MAP Normal
Percent
misidentified at
n=100
2.7
0.0
Percent
misidentified at
n=50
15.0
0.0
Percent
misidentified at
n=25
34.8
7.2
Conclusions
The critical values of a test instrument for reliably computing teacher value-added models
are the relationship between the expected growth and the CSEs of the test and the sample size
being tested. As the size of the average CSE approaches, or on some tests, even exceeds the size
of the average expected growth, the ability of the instrument to reliably measure student growth
is greatly compromised. If value-added measures are going to be useful additions to teacher
evaluation programs, they must be based on tests with a level of reliability sufficient for the task
with sample sizes similar to those found in typical classrooms.
Pencil and paper student proficiency tests are particularly ineffective at measuring
student growth. There are a number of reasons this is true. The primary one is the reliability of
21
the tests. Proficiency tests are generally meant to discern between two fairly broad categories,
proficient or not proficient. They can also be used with a reasonable expectation of reliability to
identify those students who are in the upper and lower ends of the distribution as well, but
generally these tests are not very efficient at measuring fine increases in achievement. The
minimum CSE found on the 5th grade TAKS was 24 points. This smallest CSE was equal to the
one year expected growth. Essentially, for any student there was at best a 67% likelihood that
their score on the TAKS was within one year’s worth of growth of the score which represents
their true ability. For the 13.5% of students performing at the top end of the scale, the level of
error is 74 points or three years of growth. This reliability issue means proficiency tests such as
the TAKS are not very good measures on which to base value-added measures.
The source of this error in traditional proficiency tests is their tendency to be focused
around the proficiency point. Test makers focus proficiency tests in this manner to permit them
to get similarly reliable results with a shorter test. In order to get higher levels of reliability
across the entire spectrum of ability, the tests would require many additional items. Added items
mean added production costs and added instructional time spent on testing; however, the
efficiency gained by focusing the test around a specific point comes at the cost of reliability in
the tails of the student distribution. This creates heteroskedasticity in the CSE structure which as
our results show amplifies the overall error of the value-added measures.
Well designed computer adaptive tests can greatly reduce both the test instrument
reliability problem and heteroskedasticity problem. The mechanism of adaptive tests allows each
student to receive a test focused on their level of ability rather than a general proficiency point.
This tailored approach places each student in the center of their own personal distribution of
CSEs. With tests based on Rasch models, the lowest CSEs are located at the point on which the
22
test is focused. By doing this, computer adaptive tests are able to greatly increase the precision of
the student ability measurement, which in turn also improves the reliability of the VAMs which
use measures of student achievement as the major independent variables.
Another benefit of computer adaptive tests is their ability to apply a single scale across
multiple grades. The creators of the TAKS did an admirable job of vertically aligning
performance between the fifth and sixth grades. Our simulations showed very little differences
between the TAKS normal which simulated fall-to-spring testing and the TAKS grade transition
model which simulated spring-to-spring growth. Since most districts currently test only once per
year, the alignment between grades is a major concern as poor alignment has the potential to
inject large amounts of noise into the measures of student achievement which will be passed
along to the VAMs. Having a single scale eliminates the need to align different scales; however,
it is also critical that all items in the item pool of a computer adaptive test be properly rated for
difficulty to ensure item selection at different levels of performance does not introduce a similar
type of noise to the computer adaptive test.
Policy Implications
Why Are More States Not Using Computer Adaptive Testing?
There are a number of reasons more states are not using CAT even though CAT has been
demonstrated to be a much more reliable measure of student achievement. Pelton and Pelton
(2006, p. 143) state, “…many schools and districts are waiting for positive examples of practical
applications and observable benefits before adopting a CAT” The major reason seems to be lack
of awareness. Whether the individual supports testing and VAMs or opposes them, there seems
to be little understanding on the part of many state level educators and legislators of the
mechanics of test reliability and measurement error. Overcoming these hesitations and
23
shortcomings in understanding will require not just more studies on the subject, but also better
dissemination of the information. While it can be expected that companies providing CAT will
be actively involved in this type of outreach, departments of statistics and research units located
at institutions of higher learning can also play an important role in increasing understanding and
acceptance of CAT among state-level education policy makers and school district leaders.
Another problem with increasing the use of CAT is the perceived cost. There are
perceived financial costs as well as perceived costs in instructional time for increasing testing.
The instructional cost time is an incorrect perception. Due to the relative efficiency of CAT
compared to traditional paper and pencil tests, CAT can greatly reduce the amount of
instructional time spent on testing while increasing the amount of available data on student
achievement. Because of this shorter testing time, CAT can be used as formative assessments as
well as summative assessments without increasing the overall time spent on testing during a
school year. Additionally since results from CAT can be made available much faster than
traditional paper and pencil tests, educators can use the results to modify instructional plans to
better meet student needs.
Depending on the design, CAT could be more or less costly than traditional paper and
pencil tests. Based on documentation from the TEA (2010c), the average cost of administering
the TAKS for the 2010-2011 school year is $14.88 per pupil. This is comparable to the current
per pupil prices for administering CAT. Further, it is reasonable to expect the cost of CAT to
drop as more testing companies enter the CAT market.
Administering CAT does require access to adequate computer equipment and network
infrastructure. The hardware demands of CAT are generally modest. For example, both the
server and client side of NWEA’s MAP test require only a Pentium II (226 MHz) processor with
24
256 MB of memory. These specifications are very minimal when compared to most modern
computers. Contrary to the traditional paper and pencil tests, all students do not have to take a
CAT at the exact same moment. This allows schools to be able to process entire student bodies
without requiring them to have a 1:1 computer to student ratio. While the hardware requirements
for CAT are minimal, some schools will require system upgrades to be able to complete schoolwide computer adaptive testing. States wishing to move to CAT environment will need to
guarantee funding to ensure all schools are able to meet the minimal hardware requirements.
This issue will certainly be a concern as both consortiums developing the Common Core
assessments have stated that those assessments will be CAT instruments.
What level of teacher misidentification is acceptable?
Another major issue which must be determined by each state or district is the acceptable
level of teacher misidentification. All measures of teacher performance will contain a level of
statistical error. The level of error that is acceptable is something that will have to be determined
by each state and school district. In locations with strong teachers’ unions, these levels will likely
have to be negotiated as part of the collective bargaining agreement. Regardless of the agreed
upon acceptable level, CAT are more likely than traditional paper and pencil tests to meet or
exceed the requirements for reliability. This is especially true when using student sample sizes
associated with the traditional classroom.
We think value-added models can provide useful information about a teacher’s
performance, but not when the reliability of the instrument measuring their students’
performance is ± a full year’s growth. We would recommend that districts or states wishing to
pursue VAMs as a portion of teacher evaluations carefully investigate the characteristics of the
student measurement instruments. Instruments with a low expected-growth to standard-error
25
ratio should be rejected as they are an unsuitable foundation for VAMs. While concerns about
the impacts of the reliability of the student testing instrument are somewhat mitigated by
increasing the sample size, districts and states should always opt for the most reliable
measurement instrument available when computing VAMs.
26
References
Anderson, R. (1972). How to construct achievement tests to assess comprehension. Review of
Educational Research, 42(2), 145-170.
Baker, E., Barton, P., Darling-Hammond, L., Haertel, E., Ladd, H., Linn, R., Ravitch, D.,
Rothstein, R., Shavelson, R., and Shepard, L. (2010) Problems with the Use of Student Test
Scores to Evaluate Teachers. EPI Briefing Paper #278. Washington, DC: Economic Policy
Institute.
Ballou, D. (2009). Test scaling and value-added measurement. Education Finance and Policy,
4(4) 351-383.
Braun, H. (2005). Using Student Progress to Evaluate Teachers: A Primer on Value-Added
Models. Princeton, NJ: Educational Testing Service. Retrieved on March 12, 2011 from:
http://www.ets.org/Media/Research/pdf/PICVAM.pdf
Briggs, D. and Weeks, J. (2009) The sensitivity of value-added modeling to the creation of a
vertical score scale. Education Finance and Policy, 4(4), 384-414.
Glazerman, S., Loeb, S., Goldhaber, D., Staiger, D., Raudenbush, S., and Whitehurst, G. (2010).
Evaluating Teachers: The Important Role of Value-Added. Washington, DC: Brown Center on
Education Policy at Brookings.
Hill, H. (2009). Evaluating value-added models: A validity argument approach. Journal of Policy
Analysis and Management, 28, 700-709.
Kane, T. (2004). The Impact of After-School Programs: Interpreting the Results of Four Recent
Evaluations. Working Paper. University of California, Los Angeles.
Kirk, R. (1995). Experimental Design: Procedures for the Behavioral Sciences (3ed). Pacific
Grove, CA: Brooks/Cole Publishing Company.
May, H., Perez-Johnson, I., Haimson, J., Sattar, S., and Gleason, P. (2009). Using State Tests in
Education Experiments: A Discussion of the Issues. (NCEE 2009-013). Washington, DC:
National Center for Education Evaluation and Regional Assistance, Institute of Education
Sciences, U.S. Department of Education.
Northwest Evaluation Association. (2011). Measures of Academic Progress® and Measures
ofAcademic Progress for Primary Grades®: Technical Manual. Portland, OR: Author
Northwest Evaluation Association. (2008). RIT Scale Norms: For Use with Measures of
Academic Progress. Portland, AR: Author.
Pelton, T. and Pelton, L. (2006). Introducing a computer-adaptive testing system to a small school district.
In S. L. Howell and M. Hricko (Eds.), Online assessment and measurement: case studies from higher
education, K-12, and corporate (pp. 143-156). Hershey, PA: Information Science Publishing.
27
Rothstein, J. (2009). Student sorting and bias in value added estimation: selection on observables
and unobservables. NBER Working Paper 14666. Cambridge, MA: National Bureau of
Economic Research. Retrieved February 7, 2012 from: http://www.nber.org/papers/w14666
Schochet, P. (2008). Statistical power for random assignment evaluations of education programs.
Journal of Educational and Behavioral Statistics, 33(1), 62-87.
Schochet, P. and Chiang, H. (2010). Error Rates in Measuring Teacher and School Performance
Based on Student Test Score Gains (NCEE 2010-4004). Washington, DC: Nation Center for
Education Evaluation and Regional Assistance, Institute of Educational Sciences, U.S.
Department of Education.
Texas Education Agency. (2010a). Technical Digest for the Academic Year 2008-2009: TAKS
2009 Scale Distributions and Statistics by Subject and Grade. Austin, TX: Author. Retrieved
February 15, 2011 from: http://www.tea.state.tx.us/student.assessment/techdigest/yr0809/
Texas Education Agency. (2010b). Technical digest for the academic year 2008-2009: TAKS
2009 conditional standard error of measurement (CSEM). Austin, TX: Author. Retrieved
February 15, 2011 from: http://www.tea.state.tx.us/student.assessment/techdigest/yr0809/
Texas Education Agency. (2010c). Approval of Costs of Administering the 2010-2011 Texas
Assessment of Knowledge and Skills (TAKS) to Private School Students. Retrieved on April 12,
2011 from: http://www.tea.state.tx.us/index4.aspx?id=2147489185
Thompson, T. (2007).Growth, Precision, and CAT: An Examination of Gain Score Conditional
SEM. Paper presented at the 2008 annual meeting of the National Council on Measurement in
Education, New York, NY.
Weiss, D. (1982). Improving measurement quality and efficiency with adaptive testing. Applied
Psychological Measurement, 6(4), 479-492.
28
Download