The Case for Performance-Based Tasks without Equating

advertisement
The Case for Performance-Based Tasks
without Equating
Paper Presented at the National Council on Measurement
in Education Vancouver, British Columbia, Canada
Walter D. Way
Daniel Murphy
Sonya Powers
Leslie Keng
April 2012
PERFORMANCE-BASED ASSESSMENTS
1
Abstract
Significant momentum exists for next-generation assessments to increasingly utilize technology
to develop and deliver performance-based assessments. Many traditional challenges with this
assessment approach still apply, including psychometric concerns related to performance-based
tasks (PBTs), which include low reliability, efficiency of measurement and the comparability of
different tasks. This paper proposes a model for performance-based assessments that assumes
random selection of PBTs from a large pool, and that assumes tasks are comparable without
equating PBTs. The model assumes that if a large number of PBTs can be randomly assigned,
then task-to-task variation across individuals will average out at the group (i.e., classroom and
school) level. The model was evaluated empirically using simulations involving a re-analysis of
data from a statewide assessment. A set of G-theory analyses was conducted to assess the
reliability of average school performance on PBTs and evaluate how variance due to the
randomly-assigned tasks compared to other sources of variation. Analysis based on the linear
student growth percentiles (SGP) model was used to assess the degree to which the model
assumption of randomly-equivalent tasks held by comparing school classifications based on PBT
growth estimates with three alternative school-level measures. The study findings support the
viability of the proposed model to support next-generation performance-based assessments for
uses related to group-level inferences.
Keywords: performance-based tasks, G theory, student growth percentiles, Common Core
PERFORMANCE-BASED ASSESSMENTS
1
The Case for Performance-Based Tasks without Equating
Significant momentum exists in the United States for next-generation assessments that go
beyond traditional multiple-choice and constructed-response item types. The focus on college
and career readiness and the increased emphasis on new skills and abilities that are needed to
succeed in the 21st century have propelled a renewed interest in performance-based assessment
(PBA; Darling-Hammond & Pecheone, 2010). This movement is further supported by the
increasingly broad ways that technology is being used to present instruction and assess learning.
It is therefore not surprising that both consortia that have been funded by the federal government
to develop assessments measuring the Common Core standards – the Partnership for Assessment
of Readiness for College and Careers (PARCC) and the SMARTER Balanced Assessment
Consortium (SBAC) – plan to include performance tasks as part of their summative tests.
Although next-generation PBA has potential, many of the traditional challenges with this
assessment approach still apply. When PBA is combined with objectively-scored assessments to
produce summative scores, which both the PARCC and SBAC plan to do, complex issues related
to aggregating scores arise (Wise, 2011). These issues are exacerbated by psychometric
concerns related to performance-based tasks, which include low reliability, efficiency of
measurement (i.e., the amount of testing time needed to achieve a desired level of reliability),
and the comparability of different tasks. Comparability from task-to-task has long been a concern
with performance-based assessments. Green (1995) was frank in concluding that a performancebased assessment is not well-suited to “maintaining the aspects of a testing procedure that are
congenial to the equating process.” He summarized the concern as follows:
PERFORMANCE-BASED ASSESSMENTS
2
In the language of factor analysis, each test item or task usually has a small
amount of common variance and has substantial specific variance. A test combines the
results of many items or tasks, building up the common variance and washing out the
specific variance, which is usually treated as a source of error. The fewer items, or tasks,
there are, the less advantage can be taken of the immense power of aggregation to
overwhelm such error. (Green, 1995, p. 14)
The traditional notion of performance-based tasks, developed in the late 1980s and early
1990s, assumed that for a given assessment program, students would typically take the same
tasks at the same time. Thus, when an assessment was repeated with a new task, either for the
same students at a later time or for a new cohort of students, comparisons of student performance
was extremely difficult because of task-to-task variation.
The next generation of PBA has the opportunity to approach task variation differently
because of advances both in technology and psychometric approaches. For example, evidencecentered design (ECD; Huff, Steinberg, & Matts, 2010) approaches can reduce task-to-task
variation and provide templates to aid task development. Technology can further assist
development efforts both by automatically generating variants of tasks and by making it possible
to randomly select specific tasks from an available pool of tasks. However, to take full advantage
of these features, changes in traditional models for PBA will be needed. In this paper, we
propose and evaluate a model for performance-based assessments that assumes random selection
of performance-based tasks (PBTs) from a large pool, and that assumes tasks are comparable
without equating for the purposes of aggregating scores at the group (e.g., classroom or school)
level. These two assumptions capitalize on an underlying expectation that task-to-task variation
across individuals will average out at the group level.
PERFORMANCE-BASED ASSESSMENTS
3
A Model for Performance-Based Assessments
It seems reasonable to assume that next-generation development of performance-based
tasks will be strongly influenced by ECD. For example, Mislevy and Haertel speak of “the
exploitation of efficiencies from reuse and compatibility” (Mislevy & Haertel, 2006, p. 22) that
is afforded by ECD. Luecht and his colleagues (c.f., Luecht, Burke, & Shu, 2010) have coined
the term “Assessment Engineering” (AE), which involves construct maps, evidence models, task
models and templates as a means of generating extremely large numbers of complex
performance exercises. The Literacy Design Collaborative (LDC) is a Gates Foundation project
whose purpose is to develop literacy template tasks that can be filled with curriculum content
from varied subjects. These templates can be used for teaching in the classroom but also extend
to performance task design. The potential to develop large numbers of tasks from specified task
templates is further aided by technology, which provides tools that test developers can use to
support assessment task authoring (Liu & Haertel, 2011).
If performance-based tasks are to measure deep thinking and 21st century skills, they will
still require significant assessment time. Thus, although ECD approaches can reduce task-to-task
variation, this variation will still exist at the individual student level because each student can
only take a limited number of PBTs because of practical time limitations. However, if a large
number of PBTs can be randomly assigned, then task-to-task variation across individuals will
average out when scores are aggregated across students. This is an important test design
consideration for the Common Core assessments, which must produce student achievement data
and student growth data for determinations of school, principal, and teacher effectiveness. With
large pools of PBTs, random assignment can represent a kind of domain sampling, and scores
across classes and schools can represent estimates of average domain scores. From this
PERFORMANCE-BASED ASSESSMENTS
4
viewpoint, equating is not necessary because the aggregated scores are comparable within
calculable estimates of standard errors.
An additional benefit of having a large pool of PBTs is significantly lessened security
concerns. With a large enough task pool, it would be possible to disclose tasks along with the
actual student work and the resulting scores. Furthermore, the PBT raw scores could be
interpreted directly in terms of the applied scoring rubrics. This would greatly facilitate another
requirement of the Common Core assessments, which is to produce data that informs teaching,
learning and program improvement.
Research Questions
The purpose of this paper is to describe and illustrate a model for performance-based
assessments that assumes randomly-selected assessments tasks from a large pool of PBTs are
comparable such that equating is not necessary. The model was evaluated empirically using
simulations involving a re-analysis of data from a statewide assessment. The analyses sought to
answer two main research questions. The first research question focused on the reliable of test
scores under the proposed PBT model. Specifically, how does the impact of task-to-task
variation on the reliability of PBT scores at an aggregate (e.g., school) level compare to other
sources of variation? Also, states often make inferences about schools by placing them in
performance categories based on their students’ achievement growth using standardized
measures. Our second research question therefore asked, to what degree does the assumption of
randomly-equivalent tasks hold when classifying schools based on their students’ growth on test
scores that incorporate PBTs?
PERFORMANCE-BASED ASSESSMENTS
5
Method
Data Source
An empirical simulation was conducted to explore the impact of unequated performancebased tasks used for summative assessment purposes. The empirical simulation used real
response data from a statewide mathematics and science tests administered in grade 10 in 2009
and grade 11 in 2010. Students were matched across years, so that the data set used will include
data for each of the four tests.
Because the assessments consist of only multiple-choice items, the performance-based
task scores were “simulated” by combining randomly selected subsets of items from the math
and science tests. For each test, 50 random samples of 12 items were selected with replacement
from the complete tests. These sets represented the simulated PBTs. The simulated PBTs were
used in different ways for the two sets of analyses that followed. For the generalizability analyses
(to be described below) student scores on all 50 sets were included. For the growth analyses, it
was assumed that students took one of the sets of 12 items and the remaining test items were
considered the simulated summative test. Table 1 presents the number of items and coefficient
Alpha reliabilities of the full tests that were used for the simulations, as well as the SpearmanBrown projected reliabilities of the shortened tests and the simulated performance tasks.
Table 1. Test Length and Reliability of the Full and Shortened Tests
Shortened Test
Simulated Task
Full Test
Test
#items
ρxx'
nitems
ρ*xx'
nitems
ρ*xx'
Science Grade 10
55
0.91
43
0.89
12
0.69
Math Grade 10
56
0.93
44
0.91
12
0.74
Science Grade 11
55
0.88
43
0.85
12
0.61
Math Grade 11
60
0.90
48
0.88
12
0.65
ρxx′ refers to coefficient alpha reliability. ρ*xx′ refers to Spearman-Brown adjusted reliability.
PERFORMANCE-BASED ASSESSMENTS
6
Data Generation
For each test at each grade level, the data were generated for the study as follows:
1. Fifty sets of 12 items were selected at random and with replacement from the full test.
Each of these 12 sets was assumed to represent a simulated PBT.
2. The 0, 1 responses for each on the 50 simulated PBTs were summed and saved.
3. For each student, the entire set of 50 simulated PBTs were utilized for the generalizability
analyses.
4. For each student, one of the 50 simulated PBTs was randomly selected for use in the
growth analyses.
5. For the growth analyses, the 12 items contributing to the assigned simulated PBT were
coded as “not presented” in the student’s response data matrix for each summative test .
6. For each summative test, Rasch item parameters obtained operationally were used to
recalibrate student abilities using the data matrices with the 12-item sets excluded. For
each test, the resulting ability estimates were rescaled to have a mean of 500. Note that
the operational tests were not vertically scaled between grades 10 and 11.
Thus, for the growth analyses, each student received four scores for each subject area
(math and science): a simulated PBT raw scores and an equated summative test scale score
across two years of administration (2009 and 2010). The use of data across years allowed student
growth to be calculated from 2009 to 2010. Table 2 provides descriptive statistics for the
simulated PBTs raw scores and equated summative tests.
PERFORMANCE-BASED ASSESSMENTS
7
Table 2. Descriptive Statistics for Simulated PBTs and Summative Tests (Student Level)
PBT (Raw Score)
Summative Test (Scale Score)
Subject
N
Grade Mean (SD) Minimum Maximum Mean (SD) Minimum Maximum
500
8.32
10
12
806
0
12
(100)
(2.69)
Math 246,017
500
8.87
11
111
841
0
12
(100)
(2.33)
500
8.60
10
31
849
0
12
(100)
(2.46)
Science 245,438
500
8.93
11
-43
877
0
12
(100)
(2.18)
In addition, the availability of each student’s campus identifier allowed us to aggregate
the simulated test results at the school level. Table 3 provides the descriptive informative for the
simulated PBTs raw scores and equated summative test scale scores aggregated at the school
level.
Table 3. Descriptive Statistics for Simulated PBTs and Summative Tests (School Level)
Summative Test Scale Score
PBT Raw Score
Subject
N
Grade Mean (SD) Minimum Maximum Mean (SD) Minimum Maximum
492.61
8.15
10
381.81
673.38
4.73
11.33
(37.94)
(0.94)
Math 1,086
494.00
8.77
11
360.65
674.83
5.53
11.32
(36.69)
(0.77)
493.28
8.48
10
382.24
637.40
5.78
10.81
(38.64)
(0.87)
Science 1,086
494.30
8.84
11
364.81
629.86
5.97
10.88
(36.29)
(0.71)
Generalizability Analyses of PBT Scores
A number of analyses were conducted on the simulated data to address the research
questions. Generalizability theory (G-theory; Cronbach, Gleser, Nanda & Rajaratnam, 1972;
Feldt & Brennan, 1989) analyses were applied to assess the reliability of average school
PERFORMANCE-BASED ASSESSMENTS
8
performance on the PBTs. (See also Kane & Brennan, 1977; Kane, Gillmore, & Crooks, 1976 for
a discussion of G-theory in the context of estimating the reliability of class means.)
A generalizability (G) study was designed which included students (persons [p]), schools
(s), PBTs (t), and grade level (occasions [o]) as measurement facets. The structure of the study
data was such that students were nested within schools, and tasks were nested within occasions.
Because student scores were matched across two years (i.e., grade 10 to grade 11), all students
included in the analyses took all items in both occasions. The G study design can be abbreviated
as: (p:s) x (t:o). In order to have a balanced design, and to be consistent with the criteria used in
the growth model analysis (see next subsection), schools with fewer than 30 students were
eliminated. In schools with more than 30 students, 30 students were randomly sampled for
inclusion in the study. The variance component for persons was therefore estimated based on 30
replications. There were 1,008 schools with 30 or more students for math and 1,016 schools for
science that were used to estimate the variability due to schools. In addition, for this portion of
the study, scores on each of the 50 PBTs used in the simulation were calculated for each student.
As a result, for both subjects, 50 simulated PBTs were used to estimate task variability. Finally,
the science and math assessments were given at grade 10 and grade 11, resulting in 2 replications
for the occasions facet.
Variance components were estimated using the mGENOVA software (Brennan, 2001).
The multivariate counterpart to the univariate (p:s) x (t:o) design was used, which following the
notation of Brennan (2001a), and is represented (p•:s•) x tº, where solid circles (•) indicate facets
crossed with the occasions facet and open circles (º) indicate facets nested within the occasions
facet.
PERFORMANCE-BASED ASSESSMENTS
9
Several decision (D) studies were used to evaluate the reliability that can be expected
under a variety of measurement replications. The replications that were considered were those
that might be possible operationally. Although 50 simulated tasks were used to estimate
variance components, PBTs are time consuming to administer and score. Therefore, D studies
were conducted using one to four PBTs per person. Also, it was expected that as the number of
students increased within a school, the average school performance would be more reliable
because of decreased sampling variability. Thirty students per school were used to estimate the
variability due to persons, but for D studies, sample sizes of 10, 25, 50, 75, and 100 were also
considered. Ten students might represent a particularly small class size, 25 represented an
average class size for a school with a single class per grade level, and 50, 75, and 100
represented schools with two to four classrooms per grade level. Additionally, reliability
estimates were calculated based on the Grade 10 test (occasion 1) only, the Grade 11 test
(occasion 2) only, and based on both occasions. Because schools are often rank ordered for
comparisons, the generalizability coefficient was used as the estimate of reliability for a single
occasion, and the composite generalizability coefficient was used as the estimate of reliability
across the two occasions. The index of dependability (also known as the phi coefficient) was
also calculated for each condition, as this index would be more appropriate for situations where
schools are held to an absolute criterion, such as annual yearly progress (AYP).
Growth Analyses
To provide data for the growth analyses, composite scores that combined the PBT raw
score and summative scale score for each student were created. Two composite scores that
standardized the PBTs and summative test measures were considered: 1) a composite score is
that was the (unweighted) sum of the two standardized scores, and 2) a weighted composite score
PERFORMANCE-BASED ASSESSMENTS
10
in which the standardized summative test score was weighted to count three times as much as the
standardized PBT score.
A linear student growth percentile (SGP) model was used to estimate student and school
growth. The SGP model uses quantile regression (Koenker, 2005) to estimate a conditional linear
quantile function,
Q(τ X = x ) = x ′β(τ )
(1)
where Q(τ ) is the τ th quantile of random variable Y and β(τ ) is the set of regression
coefficients. The quantile regression procedure minimizes an asymmetric loss function for each
τ in a specified set Τ ⊂ (0,1) , in particular for this analysis Τ = {.01,.02 ,.03 ,K ,.99}. In this
analysis, each student received an estimated growth percentile, which was the τ that minimized
the distance between the student’s observed grade 11 score and a predicted grade 11 score based
on the model. To measure school growth, the students’ growth percentiles were aggregated
within each school and the median growth percentile was calculated. Schools with less than 30
students were removed from the analyses.
The next step divided the schools into 5 equally sized groups and assigned them a grade
of A, B, C, D, or F based on the median growth percentiles for each of the measures. We
compared the school grade classifications based on students’ PBT growth estimates with the
classifications based on their growth estimates using three alternative school-level measures: the
mean summative test scale score, the mean (unweighted) composite score, and the mean
weighted composite score. It was hypothesized that the variation across PBTs would cancel out
when aggregated at the school level, in which case school classifications based on PBTs would
lead to similar inferences as those based on the summative and composite measures.
PERFORMANCE-BASED ASSESSMENTS
11
Results
Generalizability Analyses
Variance component estimates resulting from the generalizability analyses are provided
in Table 4 for math and in Table 5 for science. The school level correlation between PBT scores
in Grades 10 and 11 was 0.95 for math and 0.94 for science, indicating that student performance
was very similar from year to year within a school. Universe score variance was greater in
Grade 10 than in Grade 11 for both science and math, but the error variance was similar in the
two grades. This led to higher reliability estimates in Grade 10. A comparison of the variance
estimates indicated that relative to other sources, there was very little variability in performance
across PBTs. Likewise, there was very little interaction between schools and PBTs. The major
source of error variance came from the variability of students within schools. These data
indicated that there was more variability in student performance within a school than variability
across schools. Thus, the G study results suggested that the most efficient way to decrease error
variance would be to include as many students as possible into the averages used to evaluate
school-level performance.
Table 4. Variance Estimates for Math
Variance Component
Occasion 1 (Grade 10)
Occasion 2 (Grade 11)
School
0.69
0.39
Person : School
4.58
3.03
Task
0.23
0.25
School x Task
0.03
0.02
(Person : School) x Task
1.50
1.47
PERFORMANCE-BASED ASSESSMENTS
12
Table 5. Variance Estimates for Science
Variance Component
Occasion 1 (Grade 10)
Occasion 2 (Grade 11)
School
0.64
0.38
Person : School
3.55
2.63
Task
0.16
0.16
School x Task
0.03
0.03
(Person : School) x Task
1.51
1.42
Generalizability coefficients provide an estimate of reliability based on relative error
variance. These coefficients were provided because in many cases schools are evaluated based
on their rank order. However, in the case of AYP, schools are also evaluated against an absolute
criterion, making the index of dependability, which is based on absolute error variance, the more
conceptually appropriate estimate of reliability. Both coefficients are provided in Table 6 for
math and Table 7 for science. For the D study designs considered below, more sources of error
contribute to the calculation of absolute error variance than that contribute to the calculation of
relative error variance. For this reason, the index of dependability (phi coefficient) is always
lower than the generalizability coefficient. The implication of this is that additional replications
of the measurement procedure should be used when making school comparisons based on an
absolute criterion like AYP to achieve the same reliability obtained when making normative
comparisons of schools.
Generalizability and phi coefficients were provided for the five levels of student sample
size (10, 25, 50, 75, and 100), two of the task conditions (1 and 4), and for Grade 10, Grade 11,
and the composite of the two years. As previously mentioned, the Grade 10 scores were slightly
PERFORMANCE-BASED ASSESSMENTS
13
more reliable. The composite scores (based on the PBTs across years) were only slightly more
reliable than the scores for either of the two years because the correlation between grade 10 and
11 school-level scores was so high. The reliability of school-level scores was above 0.90 for
schools with 100 students or more, given four or more tasks taken by students per occasion. As
the number of students and the number of tasks decreased, the reliability also decreased. For
schools with 10 students, the reliability was quite low, even with four tasks.
Table 6. Generalizability and Phi Coefficients for Math
Occasion 1 (Grade 10) Occasion 2 (Grade 11)
N Persons
100
75
50
25
10
Composite
N Tasks
GC
Phi
GC
Phi
GC
Phi
4
0.93
0.86
0.91
0.79
0.93
0.88
1
0.89
0.69
0.85
0.55
0.91
0.75
4
0.91
0.84
0.89
0.78
0.91
0.87
1
0.87
0.68
0.83
0.54
0.89
0.74
4
0.87
0.81
0.84
0.74
0.87
0.83
1
0.83
0.65
0.78
0.52
0.85
0.71
4
0.77
0.73
0.74
0.66
0.78
0.75
1
0.72
0.58
0.66
0.46
0.74
0.64
4
0.58
0.55
0.53
0.49
0.59
0.57
1
0.52
0.45
0.45
0.35
0.55
0.49
PERFORMANCE-BASED ASSESSMENTS
14
Table 7. Generalizability and Phi Coefficients for Science
Occasion 1 (Grade 10) Occasion 2 (Grade 11)
N Persons
100
75
50
25
10
Composite
N Tasks
GC
Phi
GC
Phi
GC
Phi
4
0.93
0.88
0.91
0.83
0.94
0.90
1
0.89
0.73
0.84
0.62
0.91
0.79
4
0.91
0.87
0.89
0.81
0.92
0.89
1
0.87
0.72
0.81
0.61
0.89
0.78
4
0.88
0.84
0.85
0.78
0.89
0.86
1
0.83
0.69
0.77
0.58
0.85
0.75
4
0.80
0.76
0.75
0.69
0.80
0.78
1
0.73
0.62
0.66
0.52
0.76
0.68
4
0.62
0.59
0.55
0.52
0.62
0.61
1
0.55
0.48
0.46
0.39
0.57
0.52
Generalizability coefficients are plotted with black lines in Figure 1 for each combination
of student sample size and number of tasks, for math. Phi coefficients are plotted for each
combination using red lines. The same information is provided in Figure 2 for science. It is
clear from Figures 1 and 2 that school sample size had a substantial impact on the reliability of
school-level scores. The increase in reliability from schools with 10 students to schools with 25
students was around 0.2. However, the difference between the reliability obtained with 75
students and the reliability obtained with 100 students was much smaller. A further increase in
sample size would have negligible impact on the reliability of school-level scores.
Increasing the number of tasks had a much less dramatic impact on reliability. This is
expected given that the variability attributable to differences among the PBTs was much smaller
PERFORMANCE-BASED ASSESSMENTS
15
than the variability amongst students. Increasing the number of PBTs from one to four increased
phi coefficients more than generalizability coefficients because of the differences in how the task
variability contributed to the calculation of error variance for the two coefficients. However, the
improvement in reliability from including more than 2 tasks was very modest. These results
suggest that few PBTs are needed to obtain reliable information about school-level performance
as long as the number of students included in school-level scores is sufficient.
1
GC 10
GC 25
0.9
GC 50
RELIABILITY
GC 75
GC 100
0.8
Phi 10
Phi 25
0.7
Phi 50
Phi 75
Phi 100
0.6
0.5
0.4
1
2
3
4
NUMBER OF TASKS
Figure 1. Generalizability and Phi Coefficients for Math PBTs by School Size and
Number of Tasks.
PERFORMANCE-BASED ASSESSMENTS
16
1
GC 10
0.9
GC 25
RELIABILITY
GC 50
GC 75
0.8
GC 100
Phi 10
0.7
Phi 25
Phi 50
Phi 75
0.6
Phi 100
0.5
1
2
3
4
NUMBER OF TASKS
Figure 2. Generalizability and Phi Coefficients for Science PBTs by School Size and
Number of Tasks.
Growth Analyses
The histograms in Figures 3 and 4 illustrate the school median growth percentile
distributions across the four measures (summative scale score, PBT raw score, composite score,
weighted composite score) for math and science respectively. One aspect of the SGP analysis of
PBT growth to note is that the restricted range of scores across the PBT assessments did not
supply enough score points to make use of the full distribution of student growth percentiles. For
example, the PBT math student growth percentiles included only 50 of the possible 99
percentiles, and the PBT science student growth percentiles included only 48. A result of this
range restriction is evident in the histograms for the math and science PBTs in Figures 1 and 2,
where the school median growth percentile distributions of PBTs do not approximate normality
as well as do those of the other measures.
PERFORMANCE-BASED ASSESSMENTS
Figure 3. Histograms depicting school median growth percentiles for math across the four
simulated measures.
17
PERFORMANCE-BASED ASSESSMENTS
Figure 4. Histograms depicting school median growth percentiles for science across the four
simulated measures.
18
PERFORMANCE-BASED ASSESSMENTS
19
The restriction of PBT raw score range is also evident in Figure 5, which presents
scatterplots of the PBT and summative test student growth percentiles. The vertical gaps in the
scatter plots represent the unassigned student growth percentiles. These plots suggest that
although SGP models are well suited to the raw score scales of PBTs, in practice some thought
should be given to the range of quantiles used. It is possible that less fine-grained SGP analysis
(e.g., by using deciles) may be adequate when modeling growth for instruments with few score
points.
Nevertheless, a comparison of Figures 5 and 6 indicates that, as expected, when a large
number of PBTs was randomly assigned, task‐to‐task variation that could be considerable at the
student level tended to average out at the school level. The correlation between the summative
task and PBT student growth percentiles at the student level is .23 for math and .22 for science as
depicted in Figure 5. By contrast, the correlations between the summative task and PBT school
median growth percentiles increased to .67 for math and .64 for science as depicted in Figure 6.
Figure 5. Scatter plots depicting the relationship between the student growth
percentiles for the math (left) and science (right) PBTs and summative tests.
PERFORMANCE-BASED ASSESSMENTS
20
Figure 6. Scatter plots depicting the relationship between the school median growth
percentiles for the math (left) and science (right) PBTs and summative tests.
Because differences in student-level variation tended to average out at the school level, it
was expected that inferences based on PBT growth would be similar to inferences based on
summative or composite measure growth. The school classification agreement rates based on the
median growth percentiles for math and science across the different measures are presented in
Tables 8 and 9 respectively.
Table 8. School Performance Agreement Rates Based on Median Growth Percentiles for Math
Agreement
Type
Exact
Adjacent
Exact + Adjacent
Within 2 Categories
Extreme Variation
Summative
Test
Growth
38%
41%
80%
15%
5%
PBT Growth Compared to:
Weighted Composite
Measure
Growth
42%
43%
85%
12%
3%
Composite
Measure
Growth
50%
40%
91%
8%
1%
PERFORMANCE-BASED ASSESSMENTS
21
Table 9. School Performance Agreement Rates Based on Median Growth Percentiles for Science
Agreement
Type
Exact
Adjacent
Exact + Adjacent
Within 2 Categories
Extreme Variation
Performance Based Task Growth Compared to:
Summative
Weighted Composite
Composite
Test
Measure
Measure
Growth
Growth
Growth
40%
40%
44%
41%
43%
47%
81%
83%
91%
14%
13%
8%
5%
3%
1%
The exact agreement rates among median growth percentiles based on PBTs and the
other measures (e.g., schools rated an A for growth under the PBT and other measures) examined
in the study range from 38% to 50%. Therefore, classifications based on median growth
percentiles for PBTs would be likely to place schools into different performance categories than
classifications based on median growth percentiles for summative tests and composite measures.
However, there seemed to be a reasonable amount of consistency among the classifications. The
percentages of schools classified to either the same or within one category of each other (e.g.,
schools rated an A for growth under the PBT and a B for growth under the other measures)
ranged from 80% to 91% across conditions.
Furthermore, none of the comparisons demonstrated high rates of extreme variation in
school ratings (e.g., schools rated an A based on the PBT growth was a D or F on other
measures). Therefore, the results suggested that, in practice, inferences based on PBT growth
would be similar those based on growth using summative measures or measures that combine the
two types of scores.
PERFORMANCE-BASED ASSESSMENTS
22
Discussion
The results of the analyses help answer the two research questions of interest about the
proposed model for performance-based assessments that assumes random-equivalent PBTs
selected from a large pool such that equating is not necessary.
First, results from the G theory analyses indicate very little variability in performance at
the school level due to PBTs or the interaction between schools and PBTs. Therefore, the results
imply that the impact of task-to-task variation on the reliability of school-level PBT scores is
small. The results also suggest that few PBTs were needed to obtain reliable information about
school-level performance. The primary source of error variance identified in the G study was
attributed to variability of student performance on the PBTs within schools. This finding is
consistent with previous research on the reliability of class means (Kane, Gillmore, & Crooks,
1976), and suggests that the best way to increase the reliability of school level PBT measurement
is to increase the number of students observed within schools. Reliability estimates were low for
samples of 25 students per school but reached acceptable levels for samples of 50 students or
more.
Second, in comparing the school classifications based on the four different growth
estimates (PBT raw score, summative scale score, composite score, and weighted composite
score), the study found that inferences based on PBT growth estimates would be similar in
practice to inferences based on growth estimates from the other three types of measures.
Therefore, the growth analysis results support the G-study finding that school-level measurement
using randomly equivalent PBTs appears to be a viable option given sufficient sample size per
school. This is an important test design consideration for the Common Core assessments, which
must produce student achievement data and student growth data for determinations of school,
PERFORMANCE-BASED ASSESSMENTS
23
principal, and teacher effectiveness. The results suggest that random assignment from large pools
of PBTs can represent a kind of domain sampling, and scores across classes and schools can
represent estimates of average domain scores.
Next-generation assessments will increasingly utilize technology to develop and deliver
PBTs, supported by new assessment design approaches such as ECD. By generating large pools
of PBTs according to task models and templates, the psychometric assumptions associated with
performance-based task scoring, reporting, and aggregation can be simplified. This paper serves
as an initial investigation into the use of a performance-based assessment model in which
equating is not required. The findings support the viability of such a model with the potential to
support next-generation performance-based assessments where scores are aggregated across
groups to make inferences about teacher or school performance.
It is important to recognize that the results of this study are specific to the conditions
evaluated. It should be noted that only one student cohort was evaluated (albeit, across two
years) in this study and one content area was considered at a time. A more complex G study
design could incorporate additional cohorts and examine content area as a facet. In addition, the
study was limited in that PBTs were simulated from multiple-choice response data. Clearly,
further investigation of the model using real PBTs is needed. Finally, although results support the
reliability of randomly equivalent PBTs when used to measure performance at the aggregate
level, the results suggested relatively unreliable results at the student level using the PBTs.
Further research regarding the appropriate use of PBT measurement at the student level,
particularly within summative assessment systems, is warranted.
PERFORMANCE-BASED ASSESSMENTS
24
References
Betebenner, D. W. (2009). Norm-and criterion-referenced student growth. Educational
Measurement: Issues and Practice, 28, 42-51.
Brennan, R. L. (2001a). Generalizability theory. New York: Springer-Verlag.
Brennan, R. L. (2001b). mGENOVA [Computer software and manual]. Iowa City, IA: Center for
Advanced Studies in Measurement and Assessment, The University of Iowa. (Available
on http://www.education.uiowa.edu/casma).
Cronbach, L. J., Gleser, G. C., Nanda, H., and Rajaratnam, N (1972). The dependability of
behavioral measures: Theory of generalizability for scores and profiles. New York:
Wiley.
Darling-Hammond, L. and Pecheone, R. (2010, March). Developing an internationally
comparable balanced assessment system that supports high-quality learning. Paper
presented at the National Conference on Next-Generation K-12 Assessment Systems.
Available at: http://www.k12center.org/rsc/pdf/DarlingHammondPechoneSystemModel.pdf.
Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational measurement
(3rd ed.), pp. 105-146. New York: Macmillan.
Green, B.F. (1995). Comparability of scores from performance assessments. Educational
Measurement: Issues and Practice, 14, 13-15.
Huff, K., Steinberg, L., & Matts, T. (2010). The promises and challenges of implementing
evidence-centered design in large-scale assessment. Applied Measurement in Education,
23, 310-324.
Kane, M. T., & Brennan, R. L. (1977). The generalizability of class means. Review of
Educational Research, 47, 267-292.
Kane, M. T., Gillmore, G. M., & Crooks, T. J. (1976). Student evaluations of teaching: The
generalizability of class means. Journal of Educational Measurement, 13, 171-183.
Koenker, R. (2005). Quantile regression. New York, NY: Cambridge University Press.
Liu, M., & Haertel, G. (2011). Design patterns: A tool to support assessment task authoring
(Draft Large-Scale Assessment Technical Report 11). Menlo Park, CA: SRI
International.
Luecht, R., Burke, M., & Shu, Z. (2010, April). Controlling difficulty and security for complex
computerized performance exercises using Assessment Engineering. Paper presented at
the annual meeting of the National Council on Measurement in Education, Denver, CO.
PERFORMANCE-BASED ASSESSMENTS
Mislevy, R., & Haertel, G. (2006). Implications of evidence-centered design for educational
testing (Draft PADI Technical Report 17). Menlo Park, CA: SRI International.
U. S. Department of Education. Overview information: Race to the Top Fund Assessment
Program; Notice inviting applications for new awards for fiscal year (FY) 2010. 75
Federal Register, 18171-18185. (April 9, 2010).
Wise, L.L. (2011, February). Picking up the pieces: Aggregating results from through-course
assessments. Paper presented at the Invitational Research Symposium on ThroughCourse Assessments. Available at:
http://www.k12center.org/rsc/pdf/TCSA_Symposium_Final_Paper_Wise.pdf.
25
Download