The Case for Performance-Based Tasks without Equating Paper Presented at the National Council on Measurement in Education Vancouver, British Columbia, Canada Walter D. Way Daniel Murphy Sonya Powers Leslie Keng April 2012 PERFORMANCE-BASED ASSESSMENTS 1 Abstract Significant momentum exists for next-generation assessments to increasingly utilize technology to develop and deliver performance-based assessments. Many traditional challenges with this assessment approach still apply, including psychometric concerns related to performance-based tasks (PBTs), which include low reliability, efficiency of measurement and the comparability of different tasks. This paper proposes a model for performance-based assessments that assumes random selection of PBTs from a large pool, and that assumes tasks are comparable without equating PBTs. The model assumes that if a large number of PBTs can be randomly assigned, then task-to-task variation across individuals will average out at the group (i.e., classroom and school) level. The model was evaluated empirically using simulations involving a re-analysis of data from a statewide assessment. A set of G-theory analyses was conducted to assess the reliability of average school performance on PBTs and evaluate how variance due to the randomly-assigned tasks compared to other sources of variation. Analysis based on the linear student growth percentiles (SGP) model was used to assess the degree to which the model assumption of randomly-equivalent tasks held by comparing school classifications based on PBT growth estimates with three alternative school-level measures. The study findings support the viability of the proposed model to support next-generation performance-based assessments for uses related to group-level inferences. Keywords: performance-based tasks, G theory, student growth percentiles, Common Core PERFORMANCE-BASED ASSESSMENTS 1 The Case for Performance-Based Tasks without Equating Significant momentum exists in the United States for next-generation assessments that go beyond traditional multiple-choice and constructed-response item types. The focus on college and career readiness and the increased emphasis on new skills and abilities that are needed to succeed in the 21st century have propelled a renewed interest in performance-based assessment (PBA; Darling-Hammond & Pecheone, 2010). This movement is further supported by the increasingly broad ways that technology is being used to present instruction and assess learning. It is therefore not surprising that both consortia that have been funded by the federal government to develop assessments measuring the Common Core standards – the Partnership for Assessment of Readiness for College and Careers (PARCC) and the SMARTER Balanced Assessment Consortium (SBAC) – plan to include performance tasks as part of their summative tests. Although next-generation PBA has potential, many of the traditional challenges with this assessment approach still apply. When PBA is combined with objectively-scored assessments to produce summative scores, which both the PARCC and SBAC plan to do, complex issues related to aggregating scores arise (Wise, 2011). These issues are exacerbated by psychometric concerns related to performance-based tasks, which include low reliability, efficiency of measurement (i.e., the amount of testing time needed to achieve a desired level of reliability), and the comparability of different tasks. Comparability from task-to-task has long been a concern with performance-based assessments. Green (1995) was frank in concluding that a performancebased assessment is not well-suited to “maintaining the aspects of a testing procedure that are congenial to the equating process.” He summarized the concern as follows: PERFORMANCE-BASED ASSESSMENTS 2 In the language of factor analysis, each test item or task usually has a small amount of common variance and has substantial specific variance. A test combines the results of many items or tasks, building up the common variance and washing out the specific variance, which is usually treated as a source of error. The fewer items, or tasks, there are, the less advantage can be taken of the immense power of aggregation to overwhelm such error. (Green, 1995, p. 14) The traditional notion of performance-based tasks, developed in the late 1980s and early 1990s, assumed that for a given assessment program, students would typically take the same tasks at the same time. Thus, when an assessment was repeated with a new task, either for the same students at a later time or for a new cohort of students, comparisons of student performance was extremely difficult because of task-to-task variation. The next generation of PBA has the opportunity to approach task variation differently because of advances both in technology and psychometric approaches. For example, evidencecentered design (ECD; Huff, Steinberg, & Matts, 2010) approaches can reduce task-to-task variation and provide templates to aid task development. Technology can further assist development efforts both by automatically generating variants of tasks and by making it possible to randomly select specific tasks from an available pool of tasks. However, to take full advantage of these features, changes in traditional models for PBA will be needed. In this paper, we propose and evaluate a model for performance-based assessments that assumes random selection of performance-based tasks (PBTs) from a large pool, and that assumes tasks are comparable without equating for the purposes of aggregating scores at the group (e.g., classroom or school) level. These two assumptions capitalize on an underlying expectation that task-to-task variation across individuals will average out at the group level. PERFORMANCE-BASED ASSESSMENTS 3 A Model for Performance-Based Assessments It seems reasonable to assume that next-generation development of performance-based tasks will be strongly influenced by ECD. For example, Mislevy and Haertel speak of “the exploitation of efficiencies from reuse and compatibility” (Mislevy & Haertel, 2006, p. 22) that is afforded by ECD. Luecht and his colleagues (c.f., Luecht, Burke, & Shu, 2010) have coined the term “Assessment Engineering” (AE), which involves construct maps, evidence models, task models and templates as a means of generating extremely large numbers of complex performance exercises. The Literacy Design Collaborative (LDC) is a Gates Foundation project whose purpose is to develop literacy template tasks that can be filled with curriculum content from varied subjects. These templates can be used for teaching in the classroom but also extend to performance task design. The potential to develop large numbers of tasks from specified task templates is further aided by technology, which provides tools that test developers can use to support assessment task authoring (Liu & Haertel, 2011). If performance-based tasks are to measure deep thinking and 21st century skills, they will still require significant assessment time. Thus, although ECD approaches can reduce task-to-task variation, this variation will still exist at the individual student level because each student can only take a limited number of PBTs because of practical time limitations. However, if a large number of PBTs can be randomly assigned, then task-to-task variation across individuals will average out when scores are aggregated across students. This is an important test design consideration for the Common Core assessments, which must produce student achievement data and student growth data for determinations of school, principal, and teacher effectiveness. With large pools of PBTs, random assignment can represent a kind of domain sampling, and scores across classes and schools can represent estimates of average domain scores. From this PERFORMANCE-BASED ASSESSMENTS 4 viewpoint, equating is not necessary because the aggregated scores are comparable within calculable estimates of standard errors. An additional benefit of having a large pool of PBTs is significantly lessened security concerns. With a large enough task pool, it would be possible to disclose tasks along with the actual student work and the resulting scores. Furthermore, the PBT raw scores could be interpreted directly in terms of the applied scoring rubrics. This would greatly facilitate another requirement of the Common Core assessments, which is to produce data that informs teaching, learning and program improvement. Research Questions The purpose of this paper is to describe and illustrate a model for performance-based assessments that assumes randomly-selected assessments tasks from a large pool of PBTs are comparable such that equating is not necessary. The model was evaluated empirically using simulations involving a re-analysis of data from a statewide assessment. The analyses sought to answer two main research questions. The first research question focused on the reliable of test scores under the proposed PBT model. Specifically, how does the impact of task-to-task variation on the reliability of PBT scores at an aggregate (e.g., school) level compare to other sources of variation? Also, states often make inferences about schools by placing them in performance categories based on their students’ achievement growth using standardized measures. Our second research question therefore asked, to what degree does the assumption of randomly-equivalent tasks hold when classifying schools based on their students’ growth on test scores that incorporate PBTs? PERFORMANCE-BASED ASSESSMENTS 5 Method Data Source An empirical simulation was conducted to explore the impact of unequated performancebased tasks used for summative assessment purposes. The empirical simulation used real response data from a statewide mathematics and science tests administered in grade 10 in 2009 and grade 11 in 2010. Students were matched across years, so that the data set used will include data for each of the four tests. Because the assessments consist of only multiple-choice items, the performance-based task scores were “simulated” by combining randomly selected subsets of items from the math and science tests. For each test, 50 random samples of 12 items were selected with replacement from the complete tests. These sets represented the simulated PBTs. The simulated PBTs were used in different ways for the two sets of analyses that followed. For the generalizability analyses (to be described below) student scores on all 50 sets were included. For the growth analyses, it was assumed that students took one of the sets of 12 items and the remaining test items were considered the simulated summative test. Table 1 presents the number of items and coefficient Alpha reliabilities of the full tests that were used for the simulations, as well as the SpearmanBrown projected reliabilities of the shortened tests and the simulated performance tasks. Table 1. Test Length and Reliability of the Full and Shortened Tests Shortened Test Simulated Task Full Test Test #items ρxx' nitems ρ*xx' nitems ρ*xx' Science Grade 10 55 0.91 43 0.89 12 0.69 Math Grade 10 56 0.93 44 0.91 12 0.74 Science Grade 11 55 0.88 43 0.85 12 0.61 Math Grade 11 60 0.90 48 0.88 12 0.65 ρxx′ refers to coefficient alpha reliability. ρ*xx′ refers to Spearman-Brown adjusted reliability. PERFORMANCE-BASED ASSESSMENTS 6 Data Generation For each test at each grade level, the data were generated for the study as follows: 1. Fifty sets of 12 items were selected at random and with replacement from the full test. Each of these 12 sets was assumed to represent a simulated PBT. 2. The 0, 1 responses for each on the 50 simulated PBTs were summed and saved. 3. For each student, the entire set of 50 simulated PBTs were utilized for the generalizability analyses. 4. For each student, one of the 50 simulated PBTs was randomly selected for use in the growth analyses. 5. For the growth analyses, the 12 items contributing to the assigned simulated PBT were coded as “not presented” in the student’s response data matrix for each summative test . 6. For each summative test, Rasch item parameters obtained operationally were used to recalibrate student abilities using the data matrices with the 12-item sets excluded. For each test, the resulting ability estimates were rescaled to have a mean of 500. Note that the operational tests were not vertically scaled between grades 10 and 11. Thus, for the growth analyses, each student received four scores for each subject area (math and science): a simulated PBT raw scores and an equated summative test scale score across two years of administration (2009 and 2010). The use of data across years allowed student growth to be calculated from 2009 to 2010. Table 2 provides descriptive statistics for the simulated PBTs raw scores and equated summative tests. PERFORMANCE-BASED ASSESSMENTS 7 Table 2. Descriptive Statistics for Simulated PBTs and Summative Tests (Student Level) PBT (Raw Score) Summative Test (Scale Score) Subject N Grade Mean (SD) Minimum Maximum Mean (SD) Minimum Maximum 500 8.32 10 12 806 0 12 (100) (2.69) Math 246,017 500 8.87 11 111 841 0 12 (100) (2.33) 500 8.60 10 31 849 0 12 (100) (2.46) Science 245,438 500 8.93 11 -43 877 0 12 (100) (2.18) In addition, the availability of each student’s campus identifier allowed us to aggregate the simulated test results at the school level. Table 3 provides the descriptive informative for the simulated PBTs raw scores and equated summative test scale scores aggregated at the school level. Table 3. Descriptive Statistics for Simulated PBTs and Summative Tests (School Level) Summative Test Scale Score PBT Raw Score Subject N Grade Mean (SD) Minimum Maximum Mean (SD) Minimum Maximum 492.61 8.15 10 381.81 673.38 4.73 11.33 (37.94) (0.94) Math 1,086 494.00 8.77 11 360.65 674.83 5.53 11.32 (36.69) (0.77) 493.28 8.48 10 382.24 637.40 5.78 10.81 (38.64) (0.87) Science 1,086 494.30 8.84 11 364.81 629.86 5.97 10.88 (36.29) (0.71) Generalizability Analyses of PBT Scores A number of analyses were conducted on the simulated data to address the research questions. Generalizability theory (G-theory; Cronbach, Gleser, Nanda & Rajaratnam, 1972; Feldt & Brennan, 1989) analyses were applied to assess the reliability of average school PERFORMANCE-BASED ASSESSMENTS 8 performance on the PBTs. (See also Kane & Brennan, 1977; Kane, Gillmore, & Crooks, 1976 for a discussion of G-theory in the context of estimating the reliability of class means.) A generalizability (G) study was designed which included students (persons [p]), schools (s), PBTs (t), and grade level (occasions [o]) as measurement facets. The structure of the study data was such that students were nested within schools, and tasks were nested within occasions. Because student scores were matched across two years (i.e., grade 10 to grade 11), all students included in the analyses took all items in both occasions. The G study design can be abbreviated as: (p:s) x (t:o). In order to have a balanced design, and to be consistent with the criteria used in the growth model analysis (see next subsection), schools with fewer than 30 students were eliminated. In schools with more than 30 students, 30 students were randomly sampled for inclusion in the study. The variance component for persons was therefore estimated based on 30 replications. There were 1,008 schools with 30 or more students for math and 1,016 schools for science that were used to estimate the variability due to schools. In addition, for this portion of the study, scores on each of the 50 PBTs used in the simulation were calculated for each student. As a result, for both subjects, 50 simulated PBTs were used to estimate task variability. Finally, the science and math assessments were given at grade 10 and grade 11, resulting in 2 replications for the occasions facet. Variance components were estimated using the mGENOVA software (Brennan, 2001). The multivariate counterpart to the univariate (p:s) x (t:o) design was used, which following the notation of Brennan (2001a), and is represented (p•:s•) x tº, where solid circles (•) indicate facets crossed with the occasions facet and open circles (º) indicate facets nested within the occasions facet. PERFORMANCE-BASED ASSESSMENTS 9 Several decision (D) studies were used to evaluate the reliability that can be expected under a variety of measurement replications. The replications that were considered were those that might be possible operationally. Although 50 simulated tasks were used to estimate variance components, PBTs are time consuming to administer and score. Therefore, D studies were conducted using one to four PBTs per person. Also, it was expected that as the number of students increased within a school, the average school performance would be more reliable because of decreased sampling variability. Thirty students per school were used to estimate the variability due to persons, but for D studies, sample sizes of 10, 25, 50, 75, and 100 were also considered. Ten students might represent a particularly small class size, 25 represented an average class size for a school with a single class per grade level, and 50, 75, and 100 represented schools with two to four classrooms per grade level. Additionally, reliability estimates were calculated based on the Grade 10 test (occasion 1) only, the Grade 11 test (occasion 2) only, and based on both occasions. Because schools are often rank ordered for comparisons, the generalizability coefficient was used as the estimate of reliability for a single occasion, and the composite generalizability coefficient was used as the estimate of reliability across the two occasions. The index of dependability (also known as the phi coefficient) was also calculated for each condition, as this index would be more appropriate for situations where schools are held to an absolute criterion, such as annual yearly progress (AYP). Growth Analyses To provide data for the growth analyses, composite scores that combined the PBT raw score and summative scale score for each student were created. Two composite scores that standardized the PBTs and summative test measures were considered: 1) a composite score is that was the (unweighted) sum of the two standardized scores, and 2) a weighted composite score PERFORMANCE-BASED ASSESSMENTS 10 in which the standardized summative test score was weighted to count three times as much as the standardized PBT score. A linear student growth percentile (SGP) model was used to estimate student and school growth. The SGP model uses quantile regression (Koenker, 2005) to estimate a conditional linear quantile function, Q(τ X = x ) = x ′β(τ ) (1) where Q(τ ) is the τ th quantile of random variable Y and β(τ ) is the set of regression coefficients. The quantile regression procedure minimizes an asymmetric loss function for each τ in a specified set Τ ⊂ (0,1) , in particular for this analysis Τ = {.01,.02 ,.03 ,K ,.99}. In this analysis, each student received an estimated growth percentile, which was the τ that minimized the distance between the student’s observed grade 11 score and a predicted grade 11 score based on the model. To measure school growth, the students’ growth percentiles were aggregated within each school and the median growth percentile was calculated. Schools with less than 30 students were removed from the analyses. The next step divided the schools into 5 equally sized groups and assigned them a grade of A, B, C, D, or F based on the median growth percentiles for each of the measures. We compared the school grade classifications based on students’ PBT growth estimates with the classifications based on their growth estimates using three alternative school-level measures: the mean summative test scale score, the mean (unweighted) composite score, and the mean weighted composite score. It was hypothesized that the variation across PBTs would cancel out when aggregated at the school level, in which case school classifications based on PBTs would lead to similar inferences as those based on the summative and composite measures. PERFORMANCE-BASED ASSESSMENTS 11 Results Generalizability Analyses Variance component estimates resulting from the generalizability analyses are provided in Table 4 for math and in Table 5 for science. The school level correlation between PBT scores in Grades 10 and 11 was 0.95 for math and 0.94 for science, indicating that student performance was very similar from year to year within a school. Universe score variance was greater in Grade 10 than in Grade 11 for both science and math, but the error variance was similar in the two grades. This led to higher reliability estimates in Grade 10. A comparison of the variance estimates indicated that relative to other sources, there was very little variability in performance across PBTs. Likewise, there was very little interaction between schools and PBTs. The major source of error variance came from the variability of students within schools. These data indicated that there was more variability in student performance within a school than variability across schools. Thus, the G study results suggested that the most efficient way to decrease error variance would be to include as many students as possible into the averages used to evaluate school-level performance. Table 4. Variance Estimates for Math Variance Component Occasion 1 (Grade 10) Occasion 2 (Grade 11) School 0.69 0.39 Person : School 4.58 3.03 Task 0.23 0.25 School x Task 0.03 0.02 (Person : School) x Task 1.50 1.47 PERFORMANCE-BASED ASSESSMENTS 12 Table 5. Variance Estimates for Science Variance Component Occasion 1 (Grade 10) Occasion 2 (Grade 11) School 0.64 0.38 Person : School 3.55 2.63 Task 0.16 0.16 School x Task 0.03 0.03 (Person : School) x Task 1.51 1.42 Generalizability coefficients provide an estimate of reliability based on relative error variance. These coefficients were provided because in many cases schools are evaluated based on their rank order. However, in the case of AYP, schools are also evaluated against an absolute criterion, making the index of dependability, which is based on absolute error variance, the more conceptually appropriate estimate of reliability. Both coefficients are provided in Table 6 for math and Table 7 for science. For the D study designs considered below, more sources of error contribute to the calculation of absolute error variance than that contribute to the calculation of relative error variance. For this reason, the index of dependability (phi coefficient) is always lower than the generalizability coefficient. The implication of this is that additional replications of the measurement procedure should be used when making school comparisons based on an absolute criterion like AYP to achieve the same reliability obtained when making normative comparisons of schools. Generalizability and phi coefficients were provided for the five levels of student sample size (10, 25, 50, 75, and 100), two of the task conditions (1 and 4), and for Grade 10, Grade 11, and the composite of the two years. As previously mentioned, the Grade 10 scores were slightly PERFORMANCE-BASED ASSESSMENTS 13 more reliable. The composite scores (based on the PBTs across years) were only slightly more reliable than the scores for either of the two years because the correlation between grade 10 and 11 school-level scores was so high. The reliability of school-level scores was above 0.90 for schools with 100 students or more, given four or more tasks taken by students per occasion. As the number of students and the number of tasks decreased, the reliability also decreased. For schools with 10 students, the reliability was quite low, even with four tasks. Table 6. Generalizability and Phi Coefficients for Math Occasion 1 (Grade 10) Occasion 2 (Grade 11) N Persons 100 75 50 25 10 Composite N Tasks GC Phi GC Phi GC Phi 4 0.93 0.86 0.91 0.79 0.93 0.88 1 0.89 0.69 0.85 0.55 0.91 0.75 4 0.91 0.84 0.89 0.78 0.91 0.87 1 0.87 0.68 0.83 0.54 0.89 0.74 4 0.87 0.81 0.84 0.74 0.87 0.83 1 0.83 0.65 0.78 0.52 0.85 0.71 4 0.77 0.73 0.74 0.66 0.78 0.75 1 0.72 0.58 0.66 0.46 0.74 0.64 4 0.58 0.55 0.53 0.49 0.59 0.57 1 0.52 0.45 0.45 0.35 0.55 0.49 PERFORMANCE-BASED ASSESSMENTS 14 Table 7. Generalizability and Phi Coefficients for Science Occasion 1 (Grade 10) Occasion 2 (Grade 11) N Persons 100 75 50 25 10 Composite N Tasks GC Phi GC Phi GC Phi 4 0.93 0.88 0.91 0.83 0.94 0.90 1 0.89 0.73 0.84 0.62 0.91 0.79 4 0.91 0.87 0.89 0.81 0.92 0.89 1 0.87 0.72 0.81 0.61 0.89 0.78 4 0.88 0.84 0.85 0.78 0.89 0.86 1 0.83 0.69 0.77 0.58 0.85 0.75 4 0.80 0.76 0.75 0.69 0.80 0.78 1 0.73 0.62 0.66 0.52 0.76 0.68 4 0.62 0.59 0.55 0.52 0.62 0.61 1 0.55 0.48 0.46 0.39 0.57 0.52 Generalizability coefficients are plotted with black lines in Figure 1 for each combination of student sample size and number of tasks, for math. Phi coefficients are plotted for each combination using red lines. The same information is provided in Figure 2 for science. It is clear from Figures 1 and 2 that school sample size had a substantial impact on the reliability of school-level scores. The increase in reliability from schools with 10 students to schools with 25 students was around 0.2. However, the difference between the reliability obtained with 75 students and the reliability obtained with 100 students was much smaller. A further increase in sample size would have negligible impact on the reliability of school-level scores. Increasing the number of tasks had a much less dramatic impact on reliability. This is expected given that the variability attributable to differences among the PBTs was much smaller PERFORMANCE-BASED ASSESSMENTS 15 than the variability amongst students. Increasing the number of PBTs from one to four increased phi coefficients more than generalizability coefficients because of the differences in how the task variability contributed to the calculation of error variance for the two coefficients. However, the improvement in reliability from including more than 2 tasks was very modest. These results suggest that few PBTs are needed to obtain reliable information about school-level performance as long as the number of students included in school-level scores is sufficient. 1 GC 10 GC 25 0.9 GC 50 RELIABILITY GC 75 GC 100 0.8 Phi 10 Phi 25 0.7 Phi 50 Phi 75 Phi 100 0.6 0.5 0.4 1 2 3 4 NUMBER OF TASKS Figure 1. Generalizability and Phi Coefficients for Math PBTs by School Size and Number of Tasks. PERFORMANCE-BASED ASSESSMENTS 16 1 GC 10 0.9 GC 25 RELIABILITY GC 50 GC 75 0.8 GC 100 Phi 10 0.7 Phi 25 Phi 50 Phi 75 0.6 Phi 100 0.5 1 2 3 4 NUMBER OF TASKS Figure 2. Generalizability and Phi Coefficients for Science PBTs by School Size and Number of Tasks. Growth Analyses The histograms in Figures 3 and 4 illustrate the school median growth percentile distributions across the four measures (summative scale score, PBT raw score, composite score, weighted composite score) for math and science respectively. One aspect of the SGP analysis of PBT growth to note is that the restricted range of scores across the PBT assessments did not supply enough score points to make use of the full distribution of student growth percentiles. For example, the PBT math student growth percentiles included only 50 of the possible 99 percentiles, and the PBT science student growth percentiles included only 48. A result of this range restriction is evident in the histograms for the math and science PBTs in Figures 1 and 2, where the school median growth percentile distributions of PBTs do not approximate normality as well as do those of the other measures. PERFORMANCE-BASED ASSESSMENTS Figure 3. Histograms depicting school median growth percentiles for math across the four simulated measures. 17 PERFORMANCE-BASED ASSESSMENTS Figure 4. Histograms depicting school median growth percentiles for science across the four simulated measures. 18 PERFORMANCE-BASED ASSESSMENTS 19 The restriction of PBT raw score range is also evident in Figure 5, which presents scatterplots of the PBT and summative test student growth percentiles. The vertical gaps in the scatter plots represent the unassigned student growth percentiles. These plots suggest that although SGP models are well suited to the raw score scales of PBTs, in practice some thought should be given to the range of quantiles used. It is possible that less fine-grained SGP analysis (e.g., by using deciles) may be adequate when modeling growth for instruments with few score points. Nevertheless, a comparison of Figures 5 and 6 indicates that, as expected, when a large number of PBTs was randomly assigned, task‐to‐task variation that could be considerable at the student level tended to average out at the school level. The correlation between the summative task and PBT student growth percentiles at the student level is .23 for math and .22 for science as depicted in Figure 5. By contrast, the correlations between the summative task and PBT school median growth percentiles increased to .67 for math and .64 for science as depicted in Figure 6. Figure 5. Scatter plots depicting the relationship between the student growth percentiles for the math (left) and science (right) PBTs and summative tests. PERFORMANCE-BASED ASSESSMENTS 20 Figure 6. Scatter plots depicting the relationship between the school median growth percentiles for the math (left) and science (right) PBTs and summative tests. Because differences in student-level variation tended to average out at the school level, it was expected that inferences based on PBT growth would be similar to inferences based on summative or composite measure growth. The school classification agreement rates based on the median growth percentiles for math and science across the different measures are presented in Tables 8 and 9 respectively. Table 8. School Performance Agreement Rates Based on Median Growth Percentiles for Math Agreement Type Exact Adjacent Exact + Adjacent Within 2 Categories Extreme Variation Summative Test Growth 38% 41% 80% 15% 5% PBT Growth Compared to: Weighted Composite Measure Growth 42% 43% 85% 12% 3% Composite Measure Growth 50% 40% 91% 8% 1% PERFORMANCE-BASED ASSESSMENTS 21 Table 9. School Performance Agreement Rates Based on Median Growth Percentiles for Science Agreement Type Exact Adjacent Exact + Adjacent Within 2 Categories Extreme Variation Performance Based Task Growth Compared to: Summative Weighted Composite Composite Test Measure Measure Growth Growth Growth 40% 40% 44% 41% 43% 47% 81% 83% 91% 14% 13% 8% 5% 3% 1% The exact agreement rates among median growth percentiles based on PBTs and the other measures (e.g., schools rated an A for growth under the PBT and other measures) examined in the study range from 38% to 50%. Therefore, classifications based on median growth percentiles for PBTs would be likely to place schools into different performance categories than classifications based on median growth percentiles for summative tests and composite measures. However, there seemed to be a reasonable amount of consistency among the classifications. The percentages of schools classified to either the same or within one category of each other (e.g., schools rated an A for growth under the PBT and a B for growth under the other measures) ranged from 80% to 91% across conditions. Furthermore, none of the comparisons demonstrated high rates of extreme variation in school ratings (e.g., schools rated an A based on the PBT growth was a D or F on other measures). Therefore, the results suggested that, in practice, inferences based on PBT growth would be similar those based on growth using summative measures or measures that combine the two types of scores. PERFORMANCE-BASED ASSESSMENTS 22 Discussion The results of the analyses help answer the two research questions of interest about the proposed model for performance-based assessments that assumes random-equivalent PBTs selected from a large pool such that equating is not necessary. First, results from the G theory analyses indicate very little variability in performance at the school level due to PBTs or the interaction between schools and PBTs. Therefore, the results imply that the impact of task-to-task variation on the reliability of school-level PBT scores is small. The results also suggest that few PBTs were needed to obtain reliable information about school-level performance. The primary source of error variance identified in the G study was attributed to variability of student performance on the PBTs within schools. This finding is consistent with previous research on the reliability of class means (Kane, Gillmore, & Crooks, 1976), and suggests that the best way to increase the reliability of school level PBT measurement is to increase the number of students observed within schools. Reliability estimates were low for samples of 25 students per school but reached acceptable levels for samples of 50 students or more. Second, in comparing the school classifications based on the four different growth estimates (PBT raw score, summative scale score, composite score, and weighted composite score), the study found that inferences based on PBT growth estimates would be similar in practice to inferences based on growth estimates from the other three types of measures. Therefore, the growth analysis results support the G-study finding that school-level measurement using randomly equivalent PBTs appears to be a viable option given sufficient sample size per school. This is an important test design consideration for the Common Core assessments, which must produce student achievement data and student growth data for determinations of school, PERFORMANCE-BASED ASSESSMENTS 23 principal, and teacher effectiveness. The results suggest that random assignment from large pools of PBTs can represent a kind of domain sampling, and scores across classes and schools can represent estimates of average domain scores. Next-generation assessments will increasingly utilize technology to develop and deliver PBTs, supported by new assessment design approaches such as ECD. By generating large pools of PBTs according to task models and templates, the psychometric assumptions associated with performance-based task scoring, reporting, and aggregation can be simplified. This paper serves as an initial investigation into the use of a performance-based assessment model in which equating is not required. The findings support the viability of such a model with the potential to support next-generation performance-based assessments where scores are aggregated across groups to make inferences about teacher or school performance. It is important to recognize that the results of this study are specific to the conditions evaluated. It should be noted that only one student cohort was evaluated (albeit, across two years) in this study and one content area was considered at a time. A more complex G study design could incorporate additional cohorts and examine content area as a facet. In addition, the study was limited in that PBTs were simulated from multiple-choice response data. Clearly, further investigation of the model using real PBTs is needed. Finally, although results support the reliability of randomly equivalent PBTs when used to measure performance at the aggregate level, the results suggested relatively unreliable results at the student level using the PBTs. Further research regarding the appropriate use of PBT measurement at the student level, particularly within summative assessment systems, is warranted. PERFORMANCE-BASED ASSESSMENTS 24 References Betebenner, D. W. (2009). Norm-and criterion-referenced student growth. Educational Measurement: Issues and Practice, 28, 42-51. Brennan, R. L. (2001a). Generalizability theory. New York: Springer-Verlag. Brennan, R. L. (2001b). mGENOVA [Computer software and manual]. Iowa City, IA: Center for Advanced Studies in Measurement and Assessment, The University of Iowa. (Available on http://www.education.uiowa.edu/casma). Cronbach, L. J., Gleser, G. C., Nanda, H., and Rajaratnam, N (1972). The dependability of behavioral measures: Theory of generalizability for scores and profiles. New York: Wiley. Darling-Hammond, L. and Pecheone, R. (2010, March). Developing an internationally comparable balanced assessment system that supports high-quality learning. Paper presented at the National Conference on Next-Generation K-12 Assessment Systems. Available at: http://www.k12center.org/rsc/pdf/DarlingHammondPechoneSystemModel.pdf. Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational measurement (3rd ed.), pp. 105-146. New York: Macmillan. Green, B.F. (1995). Comparability of scores from performance assessments. Educational Measurement: Issues and Practice, 14, 13-15. Huff, K., Steinberg, L., & Matts, T. (2010). The promises and challenges of implementing evidence-centered design in large-scale assessment. Applied Measurement in Education, 23, 310-324. Kane, M. T., & Brennan, R. L. (1977). The generalizability of class means. Review of Educational Research, 47, 267-292. Kane, M. T., Gillmore, G. M., & Crooks, T. J. (1976). Student evaluations of teaching: The generalizability of class means. Journal of Educational Measurement, 13, 171-183. Koenker, R. (2005). Quantile regression. New York, NY: Cambridge University Press. Liu, M., & Haertel, G. (2011). Design patterns: A tool to support assessment task authoring (Draft Large-Scale Assessment Technical Report 11). Menlo Park, CA: SRI International. Luecht, R., Burke, M., & Shu, Z. (2010, April). Controlling difficulty and security for complex computerized performance exercises using Assessment Engineering. Paper presented at the annual meeting of the National Council on Measurement in Education, Denver, CO. PERFORMANCE-BASED ASSESSMENTS Mislevy, R., & Haertel, G. (2006). Implications of evidence-centered design for educational testing (Draft PADI Technical Report 17). Menlo Park, CA: SRI International. U. S. Department of Education. Overview information: Race to the Top Fund Assessment Program; Notice inviting applications for new awards for fiscal year (FY) 2010. 75 Federal Register, 18171-18185. (April 9, 2010). Wise, L.L. (2011, February). Picking up the pieces: Aggregating results from through-course assessments. Paper presented at the Invitational Research Symposium on ThroughCourse Assessments. Available at: http://www.k12center.org/rsc/pdf/TCSA_Symposium_Final_Paper_Wise.pdf. 25