Representative Samples and PARCC to MCAS Concordance Studies This report describes the methods and outcomes for a) selecting representative samples of test-takers for MCAS and PARCC in 2015, and b) identifying estimated MCAS results for PARCC test-takers. February 2016 Massachusetts Department of Elementary and Secondary Education 75 Pleasant Street, Malden, MA 02148-4906 Phone 781-338-3000 TTY: N.E.T. Relay 800-439-2370 www.doe.mass.edu This document was prepared by the Massachusetts Department of Elementary and Secondary Education Mitchell D. Chester, Ed.D. Commissioner The Massachusetts Department of Elementary and Secondary Education, an affirmative action employer, is committed to ensuring that all of its programs and facilities are accessible to all members of the public. We do not discriminate on the basis of age, color, disability, national origin, race, religion, sex, gender identity, or sexual orientation. Inquiries regarding the Department’s compliance with Title IX and other civil rights laws may be directed to the Human Resources Director, 75 Pleasant St., Malden, MA 02148-4906. Phone: 781-338-6105. © 2016 Massachusetts Department of Elementary and Secondary Education Permission is hereby granted to copy any or all parts of this document for non-commercial educational purposes. Please credit the “Massachusetts Department of Elementary and Secondary Education.” This document printed on recycled paper Massachusetts Department of Elementary and Secondary Education 75 Pleasant Street, Malden, MA 02148-4906 Phone 781-338-3000 TTY: N.E.T. Relay 800-439-2370 www.doe.mass.edu Table of Contents Introduction .......................................................................................................................... 1 Background and Purpose ....................................................................................................... 1 PART 1: SELECTING REPRESENTATIVE SAMPLES The Need for Representative Samples ................................................................................... 3 Method to Identify Representative Samples .......................................................................... 5 Results from the Representative Sample Study ...................................................................... 7 PART 2: CONCORDANCE TABLES AND GUIDANCE FOR USE OF DATA Concordance Tables Comparing MCAS to PARCC Results ...................................................... 10 Introduction ................................................................................................................................ 10 Methods for Generating ............................................................................................................... 10 Composite Performance Index (CPI) Results for PARCC Schools and Districts ................................. 13 Guidance for Using Representative Samples and Concordance Tables .................................. 15 Concordance Tables ..................................................................................................................... 15 Conducting Analyses at the State Level with Representative Samples............................................ 15 Conducting Analyses that are Not State-Level ............................................................................... 16 References .......................................................................................................................... 18 Appendix A: Proof-of-Concept Study.................................................................................... 19 Counts ......................................................................................................................................... 19 Balance........................................................................................................................................ 19 Replication of 2013–14 Psychometric Results................................................................................ 21 Replication of 2013–14 Student Growth Percentiles ...................................................................... 23 Replication of 2013–14 Accountability Results .............................................................................. 24 Summary of Results from the Proof-of-Concept Study .................................................................. 25 Appendix B: Method Used to Select Representative Samples ............................................... 26 Appendix C: Logistic Regression Variables and Results ......................................................... 29 Introduction During the 2014–15 school year, school districts in Massachusetts were offered a choice regarding their grades 3–8 summative testing programs: whether to participate in MCAS or PARCC. In order to generate stable trends for the 2014–15 school year, the State embarked on two analytical studies. The first addressed non-equivalence in MCAS and PARCC samples of test-takers through the selection of representative samples from each group. The second estimated MCAS scores for PARCC test-takers to generate Composite Performance Index values (CPIs, which are measures of proficiency for schools and districts). Although each test was taken by roughly half of the grades 3–8 examinees, demographic differences between the two groups of examinees remained. If left unaddressed, these demographic differences would distort state trends and other analyses. To reduce unintended differences between the two groups of examinees, the Department, with assistance from national testing experts (members of the MCAS Technical Assistance Committee), developed a method to select representative samples from the total samples of examinees taking MCAS and PARCC in 2015. This first analysis produced representative samples of examinees taking MCAS and PARCC that were significantly more similar to each other than the total samples were. The second analysis used the representative samples produced in the first analysis to match MCAS scores for examinees, by grade and subject/test, to PARCC scores, using an equipercentile linking approach (which links scores across the distributions of the two tests). The resulting data were used to generate CPIs for students, schools, and districts. This report details the methods used to identify representative samples for MCAS and PARCC test-takers and the methods used to estimate MCAS scores for PARCC examinees, and presents outcomes from both analyses to show how well each study worked. Guidance for using the representative samples is also provided. Background and Purpose Massachusetts has administered its Massachusetts Comprehensive Assessment System (MCAS) tests in English language arts and mathematics every year since 1998. In 2010 it joined the PARCC consortium to develop new tests aimed at measuring college and career readiness. In 2013–14 Massachusetts participated in PARCC field testing, and in 2014–15 Massachusetts continued its trial of the PARCC test for a second year while continuing to administer the MCAS. For the spring 2015 test administration, Massachusetts public school districts serving grades 3 to 8 were offered the option to administer either the MCAS or PARCC tests in English language arts and mathematics.1 Because districts were not assigned randomly to take PARCC or MCAS, the groups of students who took MCAS were likely to be systematically different (i.e., higher- or lower-performing or having different demographic characteristics) than those who took PARCC. When samples 1 The state’s three largest districts (Boston, Worcester, and Springfield) were offered the opportunity to choose PARCC or MCAS school by school rather than district-wide. All districts that selected PARCC had the option of administering the test online or on paper (e.g., choice by mode). 1 systematically differ, it interferes with the ability to observe whether changes in state-level student achievement from one year to the next are due to actual changes in performance or to differences in the samples (or both), and simply combining results from the two assessments will not produce an accurate picture of statewide performance. To address this issue, the State developed a methodology to identify samples of 2015 MCAS and PARCC test-takers that were representative of all students in the state. These students’ performance would be used to determine how MCAS and PARCC results compared and could be linked. The purposes for doing so were to report state-level results for 2015, including results from both MCAS and PARCC test-takers; to maintain trends for MCAS results relative to prior years; to calculate student growth percentiles (SGPs) for MCAS and PARCC test-takers; and to calculate accountability levels for all districts and schools; PARCC accountability levels are calculated using concordance tables that identify associated MCAS score estimates for a range of PARCC scores. Part 1 of this report explains in further detail the need for representative samples, and describes the methodology the Department used to select them. Part 2 of the report explains the process for generating the concordance tables linking PARCC results to MCAS, and provides guidance about how to interpret and use assessment data from the 2015 school year. 2 PART 1: SELECTING REPRESENTATIVE SAMPLES The Need for Representative Samples As expected the students taking MCAS and PARCC were not equivalent, with differences on prior performance and student demographic variables. In terms of numbers, although about 50% of the districts participated in each test2, the number of PARCC test-takers was slightly higher. Table 1 compares the numbers of districts and grades 3–8 students that participated in PARCC and MCAS. The full list of district choices for the 2015 assessments is available on the State PARCC website, in the Excel file (“list by district”). Table 1: District Assessment Choices for 2015 Assessment Choices for Spring 2015* # of Districts % of Districts # of Students % of Students MCAS PARCC 230 192 55% 45% 197480 225572 47% 53% Total 422 100% 423052 100% *District counts do not include the three largest districts or any singleschool district. Schools in the three largest districts (Boston, Springfield, and Worcester) were assigned either MCAS or PARCC. In single-school districts, 188 districts administered MCAS and 6 administered PARCC. MCAS and PARCC 2015 test-takers scored similarly on MCAS in 2014, as shown in Table 2. In both English language arts and mathematics, the percentages scoring at each proficiency level are similar across the assessments, with the 2015 MCAS test-takers performing slightly higher at the Advanced level. 2 This estimate does not include single-school districts – 188 out of 194 single-school districts administered MCAS in 2015. 3 Table 2: 2014 MCAS Results for 2015 MCAS and PARCC Test-Takers Group Achievement Levels and SGP Differences, Grades 3–8 ELA Achievement Level: Advanced ELA Achievement Level: Proficient ELA Achievement Level: Needs Improvement ELA Achievement Level: Warning ELA Student Growth Percentile Total Number ELA Math Achievement Level: Advanced Math Achievement Level: Proficient Math Achievement Level: Needs Improvement Math Achievement Level: Warning Math Student Growth Percentile Total Number Math 2014 Average MCAS & PARCC Test-Takers 14.4% 52.6% 25.2% 7.8% 50.1 410811 24.7% 33.3% 27.5% 14.5% 50.2 412005 2014 MCAS Results of 2015 MCAS-takers 15.1% 52.8% 24.3% 7.9% 50.2 187465 25.5% 33.5% 26.7% 14.3% 50.6 187704 2014 MCAS Results of 2015 PARCC-takers 13.8% 52.4% 26.0% 7.8% 49.9 223346 24.1% 33.3% 28.1% 14.6% 49.8 224301 Table 3 compares MCAS and PARCC test-takers by demographic characteristics. The demographic differences between the two are somewhat larger than the achievement differences, driven in part by the decision in some large school districts to administer PARCC. Overall, students with higher needs are more heavily weighted in the PARCC sample. 4 Table 3: 2014 Demographics for 2015 MCAS and PARCC Test-Takers Group Demographic Differences, Across Grades Ever ELL High Needs* Free/Reduced Lunch** Race: AA/Black Race: Asian Race: Hispanic Race: White Race: More than One Race: Other Race: AA/Hispanic No Special Needs Services Minimal Hours Special Needs Services Low Hours Special Needs Services Moderate Hours Special Needs Services High Hours Special Needs Services Total N 2014 Overall Population 14.5% 49.1% 39.1% 8.3% 6.2% 16.4% 66.0% 3.0% 0.3% 24.6% 81.8% 2.7% 3.6% 9.6% 2.2% 442982 2015 MCAStakers 15.1% 47.5% 36.7% 5.1% 7.5% 17.3% 67.2% 2.9% 0.3% 22.5% 81.1% 2.7% 3.4% 9.3% 3.4% 202938 2015 PARCCtakers 17.0% 54.0% 44.8% 11.0% 6.0% 18.8% 60.7% 3.1% 0.4% 29.8% 81.3% 2.7% 3.5% 9.2% 3.3% 240044 *High Needs Students belong to at least one of these groups: current/former English Language Learner (ELL), low income, student with disabilities. **2014 Values, Imputed. Although the demographic differences between MCAS and PARCC test-takers are not great, they are large enough to call into question whether the two groups can fairly be compared without making an adjustment for selection bias. Method to Identify Representative Samples The process used to identify representative samples involved matching each of the 2015 testing populations (MCAS test-takers and PARCC test-takers) to the characteristics of the overall 2014 MCAS population using student-level data. (The Department chose 2014 as the target population because 2014 was the last year for which the state has statewide results on a single assessment: MCAS). By removing from each 2015 sample those test-takers who were most dissimilar to the 2014 test-takers, the Department was able to create two 2015 samples that are well-matched to the 2014 student population. By definition, the two 2015 samples are also roughly equivalent. This matching process is represented visually in the logic model in Figure 1. 5 Figure 1: Logic Model for the Sample-Matching Study The methodology for selecting representative samples is a variation of propensity score matching, a statistical technique commonly used to estimate the impact of a treatment when participants are not randomly assigned to it (Angrist & Pischke, 2009; Austin, 2011; Murnane & Willett, 2011; Rosenbaum, 2010). The situation here is not precisely analogous, as the self-selection into the MCAS or PARCC test is determined by districts, not by student characteristics. But the principle applies nonetheless: we can identify a representative sample of students who are similar to one another in all measurable ways except the assignment of taking MCAS or PARCC. We can then use these representative groups to estimate state findings. The propensity score matching conducted in this analysis used prior MCAS results and student demographic variables to match test-takers in each sample (MCAS and PARCC) in the current year to the population of test-takers in the prior year. (It should be noted that prior MCAS results were emphasized in the analysis, resulting in better balance on prior achievement than on demographic variables, although it will be shown that the method worked to create better balance on both sets of variables.) The method worked by removing test-takers who were more unlike the prior year’s population of testtakers, creating two sets of representative samples comprised of test-takers more like those of the prior year’s population of students. Results using this methodology were evaluated in a “proof-of-concept study” that applied the method to draw representative samples in 2014 that were equivalent to the population of examinees in 2013. If the method worked well, then we would expect to get identical results for analyses conducted in 2014, which we did. The four critical checks conducted and the results were 1) The prior achievement and key demographic variables looked similar across the samples and were similar to the prior year’s data (2013). 2) The MCAS cut scores (i.e., the raw scores that correspond with the MCAS achievement levels of “220, Needs Improvement,” “240, Proficient,” and “260, Advanced”) were replicated for the representative sample of examinees assigned to MCAS in 2014.3 3 Each year, the current year’s MCAS results are linked to the prior year’s results using a method called “equating.” The equating method identifies the raw scores for each MCAS achievement level (e.g., 220 is Needs Improvement) that yield consistent measurements from the year prior. In other words, the equating method establishes consistency in the MCAS measurement scales. 6 3) The student growth percentiles (SGPs) had a uniform (flat) distribution with a median at or near 50.4 The majority of SGPs generated using the representative samples were the same as or very close to the actual SGPs. 4) School- and district-level accountability results were nearly equivalent to what was reported in 2014 for both samples. The proof-of-concept study provided evidence that the methodology worked well. Consequently, the State should be able to use the representative samples as the data source for psychometric and analytical work and still obtain the same results as it would have if it had used the full sample. A full presentation of the evidence from the proof-of-concept study is presented in Appendix A. The proof-of-concept study also allowed the State to establish the methodology for selecting the samples prior to the generation of 2015 assessment data, to avoid any concern that the State might select a sampling strategy that would advantage students who took one or the other assessment. Using a slightly refined methodology, the same analysis used in the proof-of-concept study was conducted for 2015 to select representative samples of MCAS and PARCC test-takers from the 2015 administration, measuring their representativeness by the characteristics of the state in 2014. Further details on the matching methodology are provided in Appendix B. Results from the Representative Sample Study The number of overall test-takers and the number of students selected for each representative sample are shown in Table 4. Table 4: PARCC and MCAS Samples for 2015 PARCC and MCAS Samples, 2015 Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8 Total Total MCAS 33251 33205 33962 33978 33579 34963 202938 MCAS MCAS Rep. Sample 25086 25324 26058 25357 26154 26252 154231 MCAS % Removed 25% 24% 23% 25% 22% 25% 24% Total PARCC 39534 39114 39828 40284 40327 40957 240044 PARCC PARCC Rep. Sample 29704 30026 30416 30198 30624 31209 182177 PARCC % Removed 25% 23% 24% 25% 24% 24% 24% Approximately 75 percent of test-takers were retained in each representative sample. Retaining a large N was important to minimize error, particularly for down-the-line calculations such as student growth percentiles that depend on a large amount of student data to be estimated accurately. 4 Student growth percentiles, by definition, have a flat, uniform distribution with a median of 50 and a roughly equal percentage of students in each percentile, from 1 to 99. 7 Looking first at how well the representative samples are matched to the population in 2014 and to each other, Tables 5 and 6 demonstrate that the MCAS and PARCC samples are well-matched to the state on students’ prior performance and demographic characteristics. As shown in Table 5, the MCAS sample is nearly identical on prior performance to MCAS test-takers as a whole, but the PARCC representative sample selects disproportionately from higher-performing PARCC test-takers to make the sample more similar to the state. Table 5: Comparison of Achievement Outcomes for MCAS and PARCC Test-Takers, by Grade and Sample, to 2014 MCAS Population Comparison of Achievement Outcomes for 2015 Test-Takers, by Grade and Sample to 2014 Population 2014 MCAS All 2015 All 2015 2015 Rep. Population MCAS TestPARCC Test2015 Rep. Sample Average Takers Takers Sample MCAS PARCC Gr. 3 * 53% 54% 51% 53% 53% Gr. 4–8** 50% 51% 49% 51% 51% Average Ach. Gr. 4–8*** 58% 58% 55% 56% 56% *2014 Achievement Outcome Grade 3: Estimated percent scoring Proficient+ on MCAS ELA & Math, by school and demographic group **2014 Achievement Outcome Grade 4–8: Percent scoring Proficient+ on MCAS ELA & Math ***Average percent of examinees scoring Proficient+ on 2014 MCAS ELA and Math, separately As shown in Table 6, the MCAS and PARCC representative samples are fairly equivalent across most demographic comparisons. The largest differences are identified in the Black/African American and High Needs categories, again likely stemming from the choice of some large school districts to administer PARCC. The representative samples do balance this difference somewhat, but the PARCC representative sample still has slightly higher percentages of test-takers in these categories (along with fewer White students) than the 2014 Population and the 2015 representative sample for MCAS. In addition, the PARCC sample has slightly more examinees who were English language learners or who received free- or reduced-priced lunch in 2014. Table 6: Comparison of Demographics for 2015 MCAS and PARCC Test-Takers to 2014 Population of Examinees Comparison of 2015 Demographics to 2014 Examinee Population Demographic Ever ELL High Needs* Free Lunch (2014, imp.)** Race: Black/African American Race: Asian Race: Hispanic Race: White Race: Other Special Education 2014 Population 14.7% 47.2% 38.0% 8.5% 5.8% 15.3% 67.7% 0.3% 16.9% All 2015 MCAS-Takers 15.1% 47.5% 36.7% 5.1% 7.0% 17.3% 67.2% 0.3% 17.7% All 2015 PARCC-Takers 17.0% 54.0% 44.8% 11.0% 5.9% 18.8% 60.7% 0.4% 17.6% 2015 Rep. Sample MCAS 14.2% 46.0% 35.6% 5.7% 7.0% 16.0% 67.9% 0.3% 17.2% 2015 Rep. Sample PARCC 16.7% 47.9% 39.7% 10.8% 6.4% 16.0% 63.6% 0.4% 15.8% 8 *Students in the High Needs category belong to any of these groups: special education, low-income, and ELL or ever-ELL students **Free lunch values were estimated for students with missing values Student growth percentiles (SGPs) generated for 2015 MCAS and PARCC (provided in Table 7) show a median at or near 50 in all grades for the representative samples, while there is a greater departure from 50 for examinees not included in the representative samples. Across all test-takers in the state, SGPs hover at or near a median of 50, as expected. Table 7: Statewide ELA SGPs for 2015, by Sample Comparison of Median Student Growth Percentiles, by Testing Program and Sample PARCC Sample Group Selected UnSelected Total Grade 04 05 06 07 08 04 05 06 07 08 04 05 06 07 08 ELA Median Number 50 26321 50 27052 50 26625 50 26166 50 27127 52 8126 48 7566 49 8205 49 8095 47 7835 50 34447 50 34618 50 34830 50 34261 50 34962 Math Median Number 50 26289 50 27196 50 26656 50 26156 50 27070 49 8049 50 7621 50 8166 48 8128 50 7944 50 34338 50 34817 50 34822 50 34284 50 35014 MCAS Sample ELA Median Number 50 22176 50 23451 50 22640 50 23180 50 23267 53 7441 46 6945 50 7742 50 6932 50 7780 50 29617 49 30396 50 30382 50 30112 50 31047 Math Median Number 49 22283 50 23621 50 22597 51 23222 50 23297 50 7481 51 7055 52 7800 48 6914 50 7811 49 29764 50 30676 50 30397 50 30136 50 31108 9 PART 2: CONCORDANCE TABLES AND GUIDANCE FOR USE OF DATA Concordance Tables Comparing MCAS to PARCC Results Introduction This section of Part 2 describes the methods and outcomes for the MCAS and PARCC concordance studies. The selection of representative samples enabled greater accuracy when comparing MCAS to PARCC outcomes, and also allowed the State to calculate achievement trends for the 2015 test administrations. The concordance work allowed the State to directly compare PARCC to MCAS results using an equipercentile approach and concordance tables. Methods for Generating Concordance Tables Comparing MCAS to PARCC Results The representative samples were used to generate concordance tables that estimate MCAS scores based on PARCC test-takers’ results. The concordance tables serve two primary purposes: 1. to provide a better understanding of the relationships between the new PARCC test scores and MCAS scores 2. to enable use of PARCC results in the State’s accountability formulas, which involve four-year trends The equipercentile method, which identifies comparable test scores across two different tests using student achievement percentiles generated from each set of test results,5 was used to generate the concordance tables. The equipercentile method is appropriate because a) the two tests measure similar educational standards (the PARCC assessments measure the Common Core State Standards, and the Massachusetts Curriculum Frameworks are based on the Common Core State Standards6), and b) the representative samples drawn from the prior year’s population appear to satisfy the single-subject requirement (see “single-subject requirement,” Kolen & Brennan, 2004, pp. 293–294).7 Additionally, the equipercentile method for estimating MCAS scores from PARCC scores also works under the given conditions of a) non-linear relationships between test scales, b) differences in test difficulty, and c) the need to have accurate estimated MCAS scores across the PARCC scaled-score continuum (Kolen & Brennan, 2004, p. 294). With the equipercentile method, the representative sample of test-takers for each test are first ranked from lowest to highest with scores matched to percentiles.8 The graphs in Figure 2 show the distribution 5 Student achievement percentiles are synonymous with student score rankings. To generate the percentiles in this application, results on either test were ranked from low to high and assigned a percentile from 0 to 100 (this range was used to provide more differentiation at the ends of the score distribution on MCAS). 6 See: http://www.doe.mass.edu/candi/commoncore/, document #1. 7 The single subject requirement stipulates that a single group of test-takers sit for both exams. Since almost all Massachusetts test-takers took MCAS in 2014, the State, using the representative samples approach, identified 2015 test-takers in both the MCAS and PARCC groups that were nearly equivalent to the population of examinees in 2014. As illustrated in Figure 1 on p. 5, once the representative samples are found to approximate the comparison population, they are also considered approximately equivalent to each other. 8 Percentiles were calculated on student ability measures (thetas), which underlie each of the scaled scores. Percentile buckets ranged from 0 to 100 with each bucket identifying one percentile. “0” and “100” were retained 10 of MCAS and PARCC scaled scores by percentile. For each test, as the percentile increases, the scaled score increases in a logistic manner (i.e., in a non-linear fashion that is bounded by 0 and 1— or in this case, 0 and 100). Figure 2: Distribution of Grade 4 ELA and Math Scaled Scores by Percentile for MCAS and PARCC This example illustrates two issues. First, the MCAS scale did not stretch out across all of the percentiles computed, so gaps are noted in Graph 1. To address these gaps, MCAS scores were repeated across percentiles so there was an estimated MCAS score for every percentile. Second, the scale did not stretch completely from the beginning to the end of the distribution for some grades and subjects (in grade 3 ELA, for example, the MCAS score that mapped to the “0” percentile is 206 and not 200, as shown in Table 8). The MCAS estimates for each percentile were then mapped on the PARCC percentiles so that every student with a valid PARCC score also received an estimated MCAS score. Table 8 shows a portion of the concordance table for grade 4. Estimated MCAS scores by PARCC scaled scores are shown for percentiles that range from 0 to 10 and from 51 to 61. Looking at the first row of results, PARCC fourth-grade testtakers who had a valid scaled score from 650 to 668 on the ELA exam received an estimated MCAS score of 206. In Math, fourth-grade test-takers with a valid PARCC score that ranged from 650 to 676 received an MCAS scaled score of 206. Toward the middle of the PARCC scale, fourth-graders who received a PARCC ELA scaled score that ranged from 756 to 757 received an estimated MCAS scaled score of 240. The full set of concordance tables, by grade, is published here, in the “Spring 2015 Grades 3–8 MCAS and PARCC Concordance Tables.” in the analysis to provide additional differentiation of student achievement at the ends (bottom and top) of the distribution. 11 Table 8: Segment of Concordance Table for PARCC and MCAS, Grade 4, ELA and Math Concordance: MCAS and PARCC by Percentile, Grade 4 Percentile 0 1 2 3 4 5 6 7 8 9 10 PARCC ELA SS 650 to 668 668 to 681 681 to 689 689 to 694 694 to 698 698 to 701 701 to 704 704 to 707 707 to 709 709 to 711 711 to 713 MCAS ELA SS 206 208 210 214 214 214 216 216 216 218 218 PARCC Math SS 650 to 676 676 to 685 685 to 690 690 to 694 694 to 697 697 to 699 699 to 701 701 to 704 704 to 705 705 to 707 707 to 709 MCAS Math SS 206 210 214 214 214 216 216 216 218 218 218 Percentile 51 52 53 54 55 56 57 58 59 60 61 PARCC ELA SS 756 to 757 757 to 758 758 to 758 759 to 759 759 to 760 760 to 761 761 to 762 762 to 763 763 to 763 763 to 764 764 to 765 MCAS ELA SS 240 240 240 240 240 242 242 242 242 242 244 PARCC Math SS 748 to 748 748 to 749 749 to 750 750 to 751 751 to 752 752 to 753 753 to 753 753 to 754 754 to 755 755 to 756 756 to 757 MCAS Math SS 238 238 238 240 240 240 242 242 242 242 244 The graphs in Figure 3 display the relationship between PARCC and MCAS scaled scores for ELA and Math at grade 4. The graphs show the gradual increase of MCAS scores as PARCC scores increase, as well as the range of PARCC scores associated with each concordant MCAS score. Figure 3: Relationship Between MCAS and PARCC Scaled Scores for Grade 4 ELA and Math A similar equipercentile method, using the SGP and Equate packages through the statistical platform, R (Albano, 2014, Betebenner, 2015), was applied for eighth-graders taking the PARCC Algebra I test. This methodology better adjusted the MCAS math score estimates for the higher-achieving eighth-graders taking Algebra 1 by accounting for students’ prior achievement. The resulting concordance table for the Algebra 1 test features ranges of estimated scores for both MCAS and PARCC, as shown by the segment provided in Table 9. 12 Table 9: Segment of PARCC to MCAS Concordance Table for Algebra 1, Grade 8 Concordance: MCAS and PARCC by Percentile, Algebra I Percentile 0 1 2 3 4 5 6 7 8 9 10 PARCC Math SS 677 to 694 695 to 706 706 to 714 714 to 720 720 to 723 723 to 726 726 to 727 727 to 731 731 to 733 733 to 735 735 to 737 MCAS Math SS 214 to 218 218 to 220 218 to 220 220 to 222 222 to 222 222 to 224 224 to 224 224 to 226 226 to 228 228 to 228 228 to 230 Percentile 51 52 53 54 55 56 57 58 59 60 61 PARCC Math SS 772 to 773 773 to 774 774 to 775 775 to 775 775 to 776 776 to 777 777 to 777 777 to 778 778 to 779 779 to 780 780 to 780 MCAS Math SS 252 to 254 254 to 254 254 to 254 254 to 254 254 to 254 254 to 256 256 to 256 256 to 256 256 to 256 256 to 256 256 to 256 Composite Performance Index (CPI) Results for PARCC Schools and Districts Estimated MCAS scores from the concordance study were used to generate Composite Performance Indices (CPIs) for school- and district-level accountability purposes. Although schools and districts taking PARCC for the first time in 2015 were “held harmless” from negative accountability decisions, the CPIs were reported and are part of the historical record. CPIs were generated from the estimated MCAS scores using the standard formula, as shown in the first column of Table 10. Table 10 provides the range of PARCC scores associated with each CPI level in ELA and Math for grades 3 through 8. Table 10: PARCC Values for CPIs 2015 PARCC Composite performance Index (CPI) Concordance Table CPI Points per Student 100 (240-280) 75 (230-238) 50 (220-228) 25 (210-218) 0 (200-208) Grade 3 ELA Math 745-850 720-745 691-720 668-691 650-668 735-850 724-735 708-724 667-708 650-667 Grade 4 ELA Math 754-850 737-754 717-737 681-717 650-681 750-850 729-750 709-729 676-709 650-676 PARCC Scaled Scores Grade 5 Grade 6 ELA Math ELA Math 743-850 725-743 711-725 677-711 650-677 740-850 728-740 712-728 686-712 650-686 741-850 726-741 713-726 674-713 650-674 741-850 725-741 701-725 662-700 650-662 Grade 7 ELA Math 746-850 737-746 726-737 692-723 650-692 746-850 737-746 726-737 692-723 650-692 ELA Grade 8 Math Alg 01 727-850 712-727 695-712 662-695 650-662 743-850 729-743 712-729 667-712 650-667 749-850 736-748 705-735 672-704 659-672 The average CPIs were compared by grade level, test, and testing group. Comparisons by CPIs across groups (representative sample vs. total sample), by grade, are provided in Tables 11 and 12. In all cases except for one, the CPI differences across the MCAS and PARCC examinees are smaller for the representative sample group than for the total sample. Differences for all representative sample groups are “1” or less, indicating that the CPIs are fairly comparable across the tests. 13 Total Representative Sample Table 11: Comparison of CPIs by Test and Testing Group, ELA Comparison of CPIs, by Test and Testing Group, ELA* 03 04 05 06 07 08 03 04 05 06 07 08 Median 100 100 100 100 100 100 100 100 100 100 100 100 PARCC Mean 82.49 77.63 87.09 86.63 86.28 92.17 81.25 78.91 84.83 85.28 86.29 91.09 Valid N 28075 27859 28504 28006 28005 28647 36901 36431 37105 37204 36909 37368 Median 100 100 100 100 100 100 100 100 100 100 100 100 MCAS Mean 83.48 78.60 87.35 86.70 87.14 91.52 83.30 81.03 85.95 86.27 88.17 90.97 Valid N 24104 23908 24810 23968 24672 24481 32297 32256 32915 32806 32371 33116 PARCC - MCAS Difference ES* -0.99 -0.04 -0.97 -0.04 -0.26 -0.01 -0.07 0.00 -0.86 -0.04 0.65 0.03 -2.06 -0.08 -2.12 -0.08 -1.12 -0.05 -0.99 -0.04 -1.88 -0.08 0.12 0.01 *ES = Effect Size Total Representative Sample Table 12: Comparison of CPIs by Test and Testing Group, Math Comparison of CPIs, by Test and Testing Group, Math** 03 04 05 06 07 08 Alg. 1* 03 04 05 06 07 08 Alg. 1* Median 100 75 100 100 100 100 100 100 75 100 100 100 100 100 PARCC Mean 85.28 76.96 83.18 81.19 72.30 78.05 92.43 84.22 78.32 79.53 79.10 72.48 75.52 91.54 Valid N 28089 27880 28466 27995 27754 24956 3558 36942 36461 37085 37194 36593 32984 4264 Median 100 75 100 100 100 100 MCAS Mean 85.45 77.32 83.71 81.57 73.18 78.85 Valid N 24104 23908 24810 23968 24672 24481 100 100 100 100 100 100 85.43 79.90 81.42 81.11 75.82 78.40 32297 32256 32915 32806 32371 33116 PARCC - MCAS Difference ES** -0.17 -0.01 -0.36 -0.01 -0.53 -0.02 -0.38 -0.01 -0.88 -0.03 -0.80 -0.03 -1.21 -1.58 -1.88 -2.01 -3.33 -2.88 -0.05 -0.06 -0.07 -0.07 -0.11 -0.10 *Algebra 1 taken by eighth-graders. **ES = Effect Size The last column in both Table 11 and Table 12 shows the effect size of the differences between the two groups. Effect sizes indicate the “standardized mean difference” between two groups. Basic rules indicate that effect sizes of 0.2 or less indicate small differences, and effect sizes near 0 indicate almost no difference (Becker, 2000). It should be noted, however, that smaller differences tended to favor the MCAS examinee group (with the MCAS group showing slightly higher achievement than the PARCC group). 14 Guidance for Using Representative Samples and Concordance Tables Concordance Tables Locations Estimated MCAS results that correspond with PARCC scores are available in the concordance tables and are linked to PARCC results in several datasets, as shown in Table 13. Because CPIs for PARCC testtakers, schools, and districts are calculated using concordant MCAS scores, CPIs also provide information of MCAS concordance with PARCC. Table 13: Datasets Containing PARCC to MCAS Concordance Results Datasets Containing PARCC to MCAS Concordance Results Dataset Name Research Files (student level) School and District PARCC Results MCAS Achievement Distribution and Growth Reports Description De-identified student-level files Full set of PARCC data with MCAS concordance results and CPIs included A collection of reports that provide results by MCAS performance levels and by CPIs based on MCAS levels Location Request access: www.doe.mass.edu/infoservices/research School/District Dropboxes Edwin Analytics, see: PE334, PE434, PE305, PE405, among others Cautions There are several things to keep in mind when using concordance scores. First, because test-takers took only one of the tests (PARCC or MCAS), the concordant results approximate, but are not exactly the same as, the results test-takers would have gotten if the alternative test had been administered. Users are cautioned against making consequential decisions based on a single test score, a single two-year comparison, or a single analysis, particularly when estimated scores are being used. Second, due to the requirements for conducting concordance studies (described on page 10), the concordance results are specifically applicable to 2015. A refinement of this approach will be used to generate concordance tables for 2016. 9 It is anticipated that while the 2015 concordance tables apply primarily to 2015, the 2016 tables will be applicable to both 2015 and 2016. Analyses using concordance tables applied to nondesignated years should be used with strong caution. Third, concordance results for PARCC were identified without consideration of the mode of administration (paper or online); therefore, no adjustments were made for any differences that may be attributable to mode. Conducting Analyses at the State Level with Representative Samples In 2015, datasets and data reports with state-level results provide information on the representative samples either by reporting state-wide results only for representative samples (as is done in Edwin Analytics), or by providing a “representative samples flag” (a column of “1s” that denote the cases [students] that belong to the representative samples for each test). The representative samples are 9 Refined concordance tables for 2016 will be published on the Department’s website by summer 2016. 15 useful for comparing state-level results from 2015, either in their entirety or disaggregated by the demographic groups studied in this paper, and are useful for comparing to state-level results in prior or subsequent years. Data users conducting their own analyses are encouraged to apply the representative samples flags, which will be available in all state-wide data sets, when using 2015 state-level results in analyses. As shown in Figure 4, when representative samples are reported for state-level results, a superscript (¹) in the report links to a footnoted description of how those representative samples are used. Figure 4: State-Level Results Based on Representative Samples in 2015 Conducting Analyses that are Not State-Level Representative samples are not applicable to smaller units of analysis (e.g., analyses at the school- or district-level) because these samples were identified for state-level use only. In situations where students within a school or a district took the same test two years in a row, year-to-year comparisons can be made using typical procedures (e.g., comparing across student demographic groups using the scaled scores). Scaled-score comparisons should only be made using scores on the same scale. Therefore, if a school or district changed from administering the MCAS tests in 2014 to administering the PARCC tests in 2015, then scaled score comparisons should be made by applying estimated results from the concordance tables for 2015 and/or the concordance tables for 2016. SGPs and CPIs can also be used in comparing groups or in evaluating trends over time. However, once again, caution is advised when rendering judgments based on small differences between transitional 2015 SGPs or CPIs and traditional SGPs or CPIs generated from MCAS data. Data users may wish to take mode into consideration when conducting analyses with PARCC data, based on potential mode differences (paper versus online) resulting from variations in prior experience with online testing. For example, users may wish to take caution when comparing PARCC results across 16 schools with different administration modes, or when comparing year-to-year results that involve different mode administrations. The Department has identified the 2015 PARCC test mode for each school and district in the file titled list by district. In addition, a variable denoting mode is provided in the 2015 research files, which can be requested here. 17 References Albano, A. D. (2014). equate: an R Package for Observed-Score Linking and Equating. Retrieved from: https://cran.r-project.org/web/packages/equate/vignettes/equatevignette.pdf. Angrist, J. D., & Pischke, J-S. (2009). Making regression make sense In J. D. Angrist and J-S. Pischke (Eds.) Mostly Harmless Econometrics, An Empiricist’s Companion (pp. 8094). Princeton, NJ: Princeton University Press. Betebenner, D. (2012). On the Precision of MCAS SGPs. Presentation given to the MCAS Technical Assistance Committee, April 2013. Betebenner, D. (2015, February 19). SGP: An R Package for the Calculation and Visualization of Student Growth Percentiles & Percentile Growth Trajectories. Retrieved from: https://cran.rproject.org/web/packages/SGP/SGP.pdf. Becker, L. A. (2000). Effect Size (ES). Retrieved from: http://www2.jura.unihamburg.de/instkrim/kriminologie/Mitarbeiter/Enzmann/Lehre/StatIIKrim/ EffectSizeBecker.pdf. Honaker, J., King, G., & Blackwell, M. (2014, November 14). Amelia: Multiple Imputation of Incomplete Multivariate Data. Retrieved from: https://cran.r-project.org/web/packages/Amelia/Amelia.pdf. Ho, D. E., Imai, K., King, G., & Stuart, E. A. (2007, January 31). Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference. Political Analysis, 15, pp. 199–236. Retrieved from: http://gking.harvard.edu/files/matchp.pdf. Ho, D. E., Imai, K., King, G., & Stuart, E. A. (2011, June 28). MatchIt: Nonparametric Preprocessing for Parametric Causal Inference. Retrieved from: http://raptor1.bizlab.mtsu.edu/sdrive/TEFF/Rlib/library/MatchIt/doc/matchit.pdf. Kolen, M. J., & R. L. Brennan. (2004). Test Equating, Scaling, and Linking: Methods and Practices, 2nd Ed. New York, NY: Springer Science+Business Media, Inc. Murnane, R. J. & Willett, J. B. (2012). Dealing with selection bias in nonexperimental data In R. J. Murnane and J. B. Willette (Eds.) Methods Matter: Improving Causal Inference in Educational and Social Science Research (pp. 304–331). NY, NY: Oxford University Press. Rosenbaum, P. R. (2010). Design of Observational Studies. New York, NY: Springer Science+Business Media, Inc. Rosenthal, R, and Rubin, D.B. (1982). A simple, general purpose display of magnitude of experimental effect. Journal of Educational Psychology, 74, pp. 166–169. 18 Appendix A: Proof-of-Concept Study Results from the proof-of-concept study were examined to evaluate and refine the models. The proofof-concept study compared 2013–14 representative sample results with 2012–13 population-level results, allowing for a direct examination of how well the matching models worked with respect to four key factors: 1. Balance on prior achievement and key demographic variables between the 2013–14 representative samples and the 2012–13 population; a direct comparison between the MCAS and PARCC groups in 2013–14 was also conducted 2. Replication of 2013–14 MCAS psychometric results in the representative samples (the replication of cut scores was examined) 3. Replication of the 2013–14 student growth percentiles in the representative samples 4. Similarity of school- and district-level accountability results in 2013-14. Counts In the 1:1 matching of the 2013–14 examinees to the 2012–13 examinees, a designated percentage (~13–14%) of examinees is removed from the self-selected samples. The removed examinees are more dissimilar to the prior year’s population of examinees than those selected for the representative samples. The numbers of examinees included and removed from the representative samples in the proof-of-concept study, by grade, is shown in Table A1. Table A1: PARCC and MCAS Samples, Proof-of-Concept Study Grade 3 4 5 6 7 8 Total PARCC and MCAS Samples, 2014 Proof-of-Concept Study MCAS PARCC PARCC Total MCAS Rep. MCAS % Total Rep. MCAS Sample Removed PARCC Sample 29686 26119 12% 35579 31120 30719 26183 15% 35461 31054 30644 26318 14% 35584 31110 30041 26708 11% 35931 31194 31393 26636 15% 36456 31686 31850 26988 15% 36317 31775 184333 158952 14% 215328 187939 PARCC % Removed 13% 12% 13% 13% 13% 13% 13% Balance Balance refers to the comparability of groups according to the demographic variables studied. When two groups are equivalent across a large range of relevant demographic variables, they are said to be balanced or equivalent (Ho, Imai, King, and Stuart, 2007). The goal of drawing representative samples was to generate two sets of samples that were each more comparable to the prior year in terms of student demographics and prior achievement than were their respective overall groups of examinees. For the two 2014 representative samples, results on balance showed that the matching procedures resulted in better matching both to the previous year’s population and to each other, in terms of achievement and demographic variables. 19 Table A2 shows the results of matching for the criterion (achievement) variables. Achievement results did not differ much for examinees in the 2014 unmatched samples; however, the matching procedure achieved better balance on these variables. The columns in the table show the following achievement results: population average in 2013 average of all 2014 MCAS examinees average of selected (matched) MCAS examinees average of all 2014 PARCC examinees average of selected PARCC examinees The first two rows provide comparisons for the grade 3 and the grades 4–8 criterion variables. The third row provides the average for two achievement variables for grades 4–8. With respect to MCAS achievement variables, there is very little change achieved from the matching, as the overall MCAS group results were already close to the mean. For PARCC test-takers, however, very low-scoring examinees were removed from the PARCC sample, putting it more closely in line with both the MCAS sample and the 2013 population results. The selected samples are nearly identical to the population and to each other. Table A2: Comparison of 2014 Samples to Population Achievement Results, by Group Comparison to Achievement Outcomes for 2014 Test-Takers, by Grade and Sample to Population Gr. 3 * Gr. 4–8** Average Ach. Gr. 4–8*** 2013 Pop. Average 60% 51% 63% All 2014 MCAS Test-Takers 63% 52% 61% 2014 MCAS Rep. Sample 62% 50% 60% All 2014 PARCC Test-Takers 57% 50% 59% 2014 PARCC Rep. Sample 60% 51% 60% *2013 Achievement Outcome, Grade 3. Estimated percent scoring Proficient+ on MCAS ELA & Math, by school and demographic group **2013 Achievement Outcome Grade 4-8: Percent scoring Proficient+ on MCAS ELA & Math ***Average percent of students scoring Proficient+ on 2013 MCAS ELA and Math, separately Table A3 presents similar population and group outcomes for demographic comparisons. Adjustments to group demographic proportions are larger for some categories than for others. The largest group differences are noted on the Race: Black/African American, High Needs, and Free Lunch categories. After matching, the MCAS and PARCC samples are brought more closely in line to the population proportions, with the MCAS group having slightly fewer examinees in these categories and the PARCC group having slightly more. 20 Table A3: Comparison of 2014 Demographic Characteristics (Proportions), by Group Comparison of Group Demographics to the 2013 Population of Examinees Race: Black/African American Race: Hispanic Race: White Race: Asian Free Lunch (2013, imp)* High Needs** Special Education Ever-ELL 2013 Population 8.2% 15.8% 67.0% 5.9% 37.5% 47.5% 18.0% 14.1% All 2014 MCAS Test-Takers 5.1% 15.6% 69.6% 6.5% 34.4% 45.4% 18.1% 13.5% 2014 MCAS Rep. Sample 5.7% 16.5% 67.6% 6.8% 36.0% 47.4% 19.1% 14.0% All 2014 PARCC Test-Takers 11.0% 16.9% 63.1% 5.6% 41.8% 52.1% 18.0% 15.3% 2014 PARCC Rep. Sample 10.4% 16.3% 64.0% 5.8% 39.6% 49.9% 17.8% 15.0% *Free lunch values were estimated for students with missing values **Students in the High Needs category belong to any of these groups: special education, low income, and ELL or everELL students Replication of 2013–14 Psychometric Results Measured Progress, the testing contractor for MCAS, evaluated the impact of the representative samples approach on the cut scores assigned to MCAS for the 2014 test administration. The evaluation involved re-running the 2014 MCAS results using the representative samples from the proof-of-concept study. The comparison of the number of raw score points assigned to each cut score is provided in the last two columns of Table A4. In the table, the “2014 Actual” column provides the number of raw score points actually assigned to each cut score in 2014 and the “Rep. Sample” column (shaded green) indicates the number of raw score points that would be assigned to each cut score using the representative samples. Comparisons for four tests/grade combinations are presented. As is shown, the number of raw score points assigned to each cut score using the representative samples matched the actual raw score cuts in 2014 for the four test/grade combinations studied except in one instance (Math Grade 4, W/F to NI) where one raw score point difference is noted. Measured Progress psychometricians indicated that this small difference is anticipated with re-analyses and that the results were nearly equivalent. A second check of the representative samples was conducted by comparing graphs of student results for the four test/grade combinations studied. The graphs in Figure A1 depict examinees’ expected raw scores (Expected Total Score) by examinees’ ability measures (Theta) for the 2014 Actual population results (red dashed line) and the results based on the representative samples (‘2015 Matched’ – blue line – which denotes the 2014 test characteristic curves based on the representative samples). As is shown in the graphs, the examinees’ expected results for the 2014 populations of students and the 2014 representative samples are nearly identical in all instances, indicating that the use of the representative samples yields equivalent results to those generated with the population-level data. 21 Table A4: Replication of 2014 MCAS Results with Representative Samples 2014 Comparison of MCAS Actual Cut Scores and Cut Scores from Representative Samples Test Cut ELA Grade 3 ELA Grade 7 Math Grade 4 Math Grade 8 2014 Actual Rep. Sample W/F to NI 23 23 NI to Prof 37 37 Prof to Adv 44 44 W/F to NI NI to Prof Prof to Adv W/F to NI NI to Prof Prof to Adv W/F to NI NI to Prof Prof to Adv 30 47 64 23 39 48 24 37 48 30 47 64 22 39 48 24 37 48 Figure A1: Comparison of 2014 Expected MCAS Results by Ability (Theta), for Four Subject/Grade Combinations MAT04 -2 0 2 40 30 20 10 0 2014 Actual 2015 Matched 4 -4 -2 0 Theta Theta ELA07 MAT08 2 4 -4 -2 0 Theta 2 40 30 20 2014 Actual 2015 Matched 0 0 2014 Actual 2015 Matched 10 Expected Total Score 50 10 20 30 40 50 60 70 -4 Expected Total Score Expected Total Score 40 30 20 10 2014 Actual 2015 Matched 0 Expected Total Score 50 ELA03 4 -4 -2 0 2 4 Theta 22 Replication of 2013–14 Student Growth Percentiles The 2013–14 representative samples were used to recompute the SGPs for the students designated to take MCAS, to both evaluate the impact of the samples on the generation of SGPs for 2014–15 and to ensure that the SGPs could be replicated using the representative samples approach. The graphs in Figure A2 display differences on MCAS SGPs generated with the representative samples in ELA and Math. The majority of the recalculated SGPs were between -2 and 2 (94% for ELA and 92% for Math)— far smaller than the expected standard error for SGPs, which is generally between 5 and 7 (Betebenner, 2013). Figure A2: Replication of 2013–14 Student Growth Percentiles (SGPs) Table A5 provides descriptive statistics for the recalculated SGPs, by sample. The anticipated mean and median SGP across the state was 50. Here we can see that the SGPs for the unselected sample vary more from the expected median of 50 than the recalculated SGPs for the selected sample. For the selected sample, the median and mean SGPs for all grades are within one point of 50. The total SGPs are also within one point of 50. These results confirm that the representative samples can be used to calculate population-level SGPs for the 2015 test administrations. 23 Table A5: Descriptive Statistics for Recalculated SGPs, by Sample Recalculated SGPs for MCAS Examinees, by Sample, Proof of Concept Study UnSelected Selected Total 04 05 06 07 08 04 05 06 07 08 04 05 06 07 08 Median 55 47 51 51 50 49 50 50 50 50 49 50 50 50 50 Recalculated SGP ELA Mean Min. Max. 53 1 99 48 1 99 51 1 99 51 1 99 50 1 99 50 1 99 50 1 99 50 1 99 50 1 99 50 1 99 50 1 99 50 1 99 50 1 99 50 1 99 50 1 99 Number 3613 3340 2690 3754 3852 20800 21073 21649 21278 21213 24413 24413 24339 25032 25065 Median 51 50 48 52 50 50 50 50 50 51 50 50 50 50 51 Recalculated SGP Math Mean Min. Max. Number 51 1 99 3614 50 1 99 3377 48 1 99 2696 51 1 99 3750 50 1 99 3872 50 1 99 20843 50 1 99 21152 50 1 99 21669 50 1 99 21288 51 1 99 21322 50 1 99 24457 50 1 99 24529 50 1 99 24365 50 1 99 25038 50 1 99 25194 Replication of 2013–14 Accountability Results The method for selecting representative samples was evaluated with a final check on how the method affected accountability results. The 2014 CPI means for each representative sample (MCAS and PARCC) were compared to those of the examinees excluded from the representative samples. In both sets of comparisons (in ELA and in Math), mean CPIs for the representative samples matched exactly, while the mean CPIs for the excluded student samples differed considerably, as shown in Table A6. These results confirmed the use of the representative samples for calculating the Department’s accountability measures. Table A6: Comparison of 2014 CPI Calculations Comparison of 2014 CPIs for Representative Samples and Excluded Samples, by Test Group SAMPLE MCAS Representative Sample Students excluded from the MCAS rep. sample PARCC Representative Sample Students excluded from the PARCC rep. sample ELA 2014 CPI 85.4 89.3 85.4 82.7 Math 2014 CPI 78.9 83.6 78.9 74.1 24 Summary of Results from the Proof-of-Concept Study The proof-of-concept study examined the representative samples approach for generating results that were very close or identical to the actual results (population-level results) for the 2014 test administration. In each of the four areas investigated, this approach yielded results for the representative samples that were equivalent or nearly equivalent to the actual results in 2014: The comparisons of prior achievement and demographic averages indicated identical or more similar results (balanced) than the results for the total samples, indicating that the matching approach ameliorated differences by testing group (MCAS vs. PARCC). The 2014 cut scores on four MCAS tests (i.e., four grade/subject combinations) were replicated using the representative samples approach, indicating that this approach can be used to maintain testing trends in 2015. The 2014 SGPs were nearly replicated using the representative samples approach, indicating that SGPs can be generated for 2015 using this method. The 2014 accountability results (average CPIs in ELA and Math) were replicated using the representative samples approach, indicating that this approach can be used to generate accountability statistics for 2015. All results from the proof-of-concept study demonstrated that the representative sample results more consistently matched results for the testing population than did the overall sample group results; the representative samples yielded results nearly equivalent to the actual results in 2014. 25 Appendix B: Method Used to Select Representative Samples A derivation of propensity score matching was used to identify representative samples. Propensity score matching uses a host of background variables to match individuals across two conditions (e.g., intervention, treatment). The formula shown in Figure B1 computes a “propensity score” that captures the differences among examinees for each condition (e.g., treatment) with respect to the group of variables used to compute it. Typically, propensity scores (x) are used to match people who received an intervention (z) to students who have not, using a set of variables (X, covariates). The propensity score is an estimated probability (p), based on the background variables and the condition (e.g., intervention). The propensity scores are then used to match individuals across the two conditions (or more than two conditions). For our study, there is no intervention – the condition (z) being evaluated is assignment to a testing program, which is not really dependent on a student’s background variables; instead, assignment is determined by the district or school a student attends. Consequently, estimating propensity scores based on testing assignment would yield a poorly fitting model and poor matching of students across conditions. Figure B1: Typical Propensity Score Equation propensity score Z = treatment assignment (e.g., test assignment) e( x ) p ( z 1 | X ) Vector of covariates (e.g., student demographics) The Department, in consultation with testing experts from the MCAS Technical Advisory Committee, matched students taking MCAS to students taking PARCC using prior MCAS results instead of test assignment (z = prior test results: Not Proficient vs. Proficient/Advanced). Using this approach, propensity scores (x) were generated outside of the matching program, as shown in Figure B2. 26 Figure B2: Derivation of Propensity Score Equation propensity score z = prior MCAS Results: 0=Not Prof./Adv. 1=Prof/Adv e( x ) p ( z 1 | X ) Probability (can be computed outside of the matching program) Vector of covariates (e.g., student demographics) The propensity scores are probabilities for achieving Proficient or Advanced on the criterion variable used, given the examinee demographics. Propensity scores are generated using a binary logistic regression for each grade and subject.10 Two primary matching models were used that differed largely by the achievement variable (criterion variable) used for the matching. These two models are described below. 1. The main model that identified representative samples in grades 4–8 used the prior MCAS results in ELA (0=W/NI, 1=Prof/Adv), and included the prior MCAS results in Math, (0=W/NI, 1=Prof/Adv) as one of the covariates. 2. Since grade 3 does not have prior achievement results, the criterion variable used to match examinees was a dichotomized (0, 1) average score (based on the results for prior students in that school, grade, and that student’s demographic group) that indicated whether or not students scored Proficient or Advanced on both ELA and Math for the prior year (2014). For example, a grade 3 student in 2015 was matched to another grade 3 student in 2014 in that same school and grade according to the average MCAS results for that group (by race and whether the examinee was enrolled in special education or not). Model variables and coefficients are provided in Appendix C. Model fits for the three models are shown in Table B1; for all three, higher numbers indicate better fit. The column labeled “% Corr” shows how accurately the model classified students according to the criterion variable used (MCAS results, proficient or not proficient, in 2014). This statistic ranges from 0 to 100%, and the fits shown indicate moderately strong model fit. The data in the columns labeled “C & S” (Cox and Snell) and “Nagel” (Nagelkerke) are two additional model fit statistics that describe the proportions of variance (score spread) explained 10 First, a binary logistic regression was run using the prior year’s data (the prior year provides population-level results). Next, the population-level coefficients for each variable used in the model were applied to the current year’s data to yield population-level results for the current year—in this case, coefficients generated using data from 2014 were applied to the 2015 model. 27 by the models, with higher proportions indicating better fit. Nagelkerke ranges from 0 to 1 and Cox and Snell from 0 to a limit below 1. The model fits on these metrics indicate moderate fit for most of the models.11 Table B1: Model Fits for Matching Models Model Fit: Log. Regression % Corr C & S Nagel. Grade 3 83.8 .453 .604 Grade 4 78.3 .352 .471 Grade 5 80.1 .345 .475 Grade 6 82.8 .370 .522 Grade 7 82.8 .348 .507 Grade 8 85.5 .308 .488 Prior to generating the propensity scores, a multiple imputation procedure (from the R Amelia package) was used to substitute estimated values for all missing data used in the analysis. The matching was conducted with the R Package MatchIt using the nearest neighbor matching algorithm. The nearest neighbor match conducts a 1:1 match. Therefore, each examinee included in each representative sample was matched to an examinee the prior year in that grade using the propensity scores generated through the logistic regression. After the matching was conducted, all data were evaluated to determine how similar the matched groups were on student demographic and achievement variables used in the models. Matched samples for 2014–15 were evaluated for balance on the demographic and prior achievement variables, as well as for comparability on SGP and accountability results. These results, presented in the body of this report, showed that the method generated samples that were significantly more equivalent to the prior year’s population than the unmatched samples. Further, the matched samples for MCAS and PARCC test-takers were more similar to each other than were the unmatched samples. 11 The higher model fits for grade 3 are largely an artifact of the criterion variable, which includes information on student demographic information. Because the remainding models used actual examinee results, those model fits provide better information for matching than does the grade 3 model, despite appearances. 28 Appendix C: Logistic Regression Variables and Results Table C1 provides information about the variables included in the two logistic regression models used to generate the propensity scores. Table C1: Variables Used in the Logistic Regression Models* Variables Used Across the Models Type of Variable Criterion Variable Name emperf2013_PA_mean_imp_cat Criterion Covariate Covariate Covariate Covariate Covariate Covariate Covariate eperf2014_imp race_B race_H race_A race_W race_M ever_ell Highneeds Covariate Covariate Covariate Covariate Covariate Covariate Covariate Covariate Covariate Covariate Covariate freelunch2014_imp yrsinmass_imp levelofneed0_B levelofneed0_H levelofneed0_W levelofneed0_A freelunch2014_BH freelunch2014_H freelunch2014_A freelunch2014_W emperf2012_PA_mean_imp Covariate Covariate Covariate Covariate Covariate emperf2012_B emperf2012_H emperf2012_W emperf2012_A emperf2012_levelofneed0 Covariate emperf2013_PA_imp Covariate Covariate Covariate eperf2013_imp mperf2013_imp emperf2013_PA_mean_imp_cat Description Proficient or Advanced on MCAS ELA/Math in the prior year, in that grade and school, by Race and Free/Reduced Lunch Status, Imputed, Dichotomized (0,1) Proficient or Advanced on MCAS ELA in the prior year Race/Ethnicity = African American/Black Race/Ethnicity = Hispanic/Latino Race/Ethnicity = Asian Race/Ethnicity = Caucasian/White Race/Ethnicity = Mixed Ever an English Language Learner High Needs (Student with Disability + Free/Reduced Lunch Eligible) Free/Reduced Lunch Eligible (prior year), Imputed Number of years in Massachusetts schools, Imputed Interaction: No Special Needs * Race_B Interaction: No Special Needs * Race_H Interaction: No Special Needs * Race_W Interaction: No Special Needs * Race_A Interaction: freelunch2014_imp * Race_BH Interaction: freelunch2014_imp * Race_H Interaction: freelunch2014_imp * Race_A Interaction: freelunch2014_imp * Race_W Proficient or Advanced on MCAS ELA and Math, proportion for that school, grade, race, and free/reduced lunch category, two years' prior, imputed Interaction: emperf2012_PA_mean_imp * Race_B Interaction: emperf2012_PA_mean_imp * Race_H Interaction: emperf2012_PA_mean_imp * Race_W Interaction: emperf2012_PA_mean_imp * Race_A Interaction: emperf2012_PA_mean_imp * leverofneed0 Proficient or Advanced on MCAS ELA nd Math, 2013, Imputed Proficient or Advanced on MCAS ELA in the prior year Proficient or Advanced on MCAS ELA, 2013, Imputed Proficient or Advanced on MCAS Math, 2013, Imputed Model Grade 3 Grades 4-8 Grades 3-8 Grades 3-8 Grades 3-8 Grades 3-8 Grades 3-8 Grades 3-8 Grades 3-8 Grades 3-8 Grades 3-8 Grades 3-8 Grades 3-8 Grades 3-8 Grades 3-8 Grades 3-8 Grades 3-8 Grades 3-8 Grades 3-8 Grades 4-8 Grades 4-8 Grades 4-8 Grades 4-8 Grades 4-8 Grades 4-8 Grades 4-8 Grades 4-8 Grades 4-8 Grades 4-8 * Variables were used only in the Grade 3 model, only in the Grades 4–8 model, or in both models (Grades 3–8), as indicated by the far right column. Tables C2 and C3 provide information on the covariates used in the logistic regression models. Table C2 provides information for the models used in grades 3 through 5, and Table C3 provides information for grades 6 through 8. Within each set of grade-level results, the first column indicates the population-level coefficient, the second column provides the standard error of the coefficient, and the third column, the 29 statistical significance of the coefficient. For propensity-matching purposes, variables should be retained in the model if the probability (significance) is less than 0.5; however, all variables were kept in for consistency across models. Table C2: Information for Model Variables, Grades 3–5 Coefficients for Covariates used in the Logistic Regression Models, Grades 3–5 Grade 3 Grade 4 Grade 5 St. St. St. Coefficient Error Sign. Coefficient Error Sign. Coefficient Error race_B -3.135 .234 .000 -0.225 0.231 .329 -0.073 0.219 race_H -2.718 .208 .000 -0.352 0.214 .101 -0.126 0.209 race_A -0.682 .217 .002 -0.028 0.252 .912 0.025 0.240 race_W -3.400 .179 .000 -0.005 0.193 .981 -0.052 0.194 race_M 0.297 .163 .068 0.175 0.178 .327 0.216 0.177 ever_ell -0.236 .042 .000 -0.369 0.035 .000 -0.415 0.034 highneeds -0.742 .064 .000 -0.300 0.051 .000 -0.329 0.049 freelunch2014_imp -0.922 .102 .000 -0.211 0.113 .062 -0.234 0.121 yrsinmass_imp -0.029 .017 .098 0.029 0.012 .014 0.029 0.009 levelofneed0_B 1.696 .161 .000 0.895 0.118 .000 0.629 0.093 levelofneed0_H 1.224 .125 .000 1.071 0.089 .000 0.614 0.070 levelofneed0_W 4.324 .077 .000 0.328 0.055 .000 0.169 0.056 levelofneed0_A 2.189 .150 .000 0.565 0.172 .001 0.371 0.160 freelunch2014_B 0.529 .121 .000 0.020 0.137 .884 -0.082 0.144 freelunch2014_H -0.756 .105 .000 -0.220 0.125 .079 -0.126 0.134 freelunch2014_A -0.700 .125 .000 -0.135 0.139 .331 -0.290 0.152 freelunch2014_W 0.174 .094 .065 -0.051 0.109 .642 -0.034 0.120 mperf2014_imp 1.851 0.020 .000 1.909 0.021 emperf2013_PA_mean_imp 3.019 0.087 .000 2.873 0.094 emperf2013_B -0.392 0.130 .003 -0.100 0.146 emperf2013_H -0.074 0.113 .516 -0.181 0.124 emperf2013_W -0.051 0.084 .538 -0.101 0.095 emperf2013_A -0.091 0.122 .455 -0.079 0.133 emperf2013_levelofneed0 -0.097 0.082 .239 -0.063 0.093 Constant 1.137 0.178 .000 -2.219 0.199 .000 -1.600 0.201 Sign. .740 .548 .916 .789 .222 .000 .000 .053 .002 .000 .000 .002 .020 .567 .347 .056 .779 .000 .000 .492 .144 .287 .555 .500 .000 30 Table C3: Information for Model Variables, Grades 6–8 Coefficients for Covariates Used in the Logistic Regression Models, Grades 6–8 Grade 6 Grade 7 Grade 8 Coefficient St. Error Sign. Coefficient St. Error Sign. Coefficient St. Error race_B -0.282 0.246 .251 -0.079 0.232 .733 -0.089 0.244 race_H -0.144 0.235 .539 -0.075 0.226 .740 0.025 0.238 race_A -0.183 0.277 .509 -0.111 0.263 .672 0.014 0.278 race_W 0.135 0.221 .541 -0.026 0.213 .901 0.231 0.225 race_M 0.234 0.205 .254 -0.182 0.196 .353 0.244 0.200 ever_ell -0.379 0.035 .000 -0.413 0.034 .000 -0.609 0.035 highneeds -0.403 0.052 .000 -0.280 0.051 .000 -0.394 0.053 freelunch2014_imp -0.405 0.136 .003 0.021 0.133 .874 -0.190 0.157 yrsinmass_imp 0.055 0.008 .000 0.063 0.006 .000 0.072 0.006 levelofneed0_B 0.780 0.094 .000 0.746 0.083 .000 0.743 0.080 levelofneed0_H 0.593 0.071 .000 0.729 0.067 .000 0.739 0.063 levelofneed0_W 0.059 0.061 .336 0.097 0.062 .114 0.203 0.063 levelofneed0_A 0.452 0.184 .014 0.234 0.171 .172 0.305 0.175 freelunch2014_B 0.333 0.159 .036 -0.210 0.155 .174 0.044 0.178 freelunch2014_H 0.199 0.149 .183 -0.450 0.148 .002 -0.124 0.171 freelunch2014_A -0.111 0.176 .527 -0.365 0.178 .040 -0.282 0.205 freelunch2014_W 0.133 0.135 .325 -0.361 0.133 .007 -0.189 0.157 mperf2014_imp 2.147 0.023 .000 2.224 0.030 .000 2.259 0.037 emperf2013_PA_mean_imp 3.308 0.108 .000 3.672 0.104 .000 3.580 0.116 emperf2013_B -0.273 0.152 .072 -0.373 0.157 .017 -0.151 0.158 emperf2013_H -0.123 0.148 .404 -0.083 0.140 .550 -0.174 0.146 emperf2013_W -0.153 0.116 .187 -0.230 0.109 .034 -0.003 0.122 emperf2013_A 0.058 0.156 .708 -0.166 0.153 .276 -0.060 0.166 emperf2013_levelofneed0 -0.062 0.114 .586 -0.014 0.105 .891 -0.266 0.119 Constant -1.802 0.228 .000 -1.476 0.218 .000 -1.104 0.231 Sign. .715 .917 .960 .305 .224 .000 .000 .226 .000 .000 .000 .001 .082 .804 .468 .168 .230 .000 .000 .339 .234 .980 .716 .026 .000 31