Advances in Large-Scale Assessment: A PARCC Update National Conference on Student Assessment New Orleans June 26, 2014 1 Introductions Presenters: Luis Saldivia, Senior Mathematics Advisor, ETS Michelle Richard, Technology Solutions Manager, Pearson Lora Monfils, Senior Psychometrician, ETS Laurie Davis, Director, Solutions Implementation, Pearson Discussant: Mike Russell, Senior Associate, Center for Assessment Moderator: Kit Viator, Senior Executive, ETS 2 Advances in Large-Scale Assessment: A PARCC Update Innovations in the Measurement of Mathematics Content Luis Saldivia Educational Testing Service 3 PARCC Related Innovations • Innovations explicitly sought by PARCC • Additional innovations required to properly measure mathematics performance as required by CCSS and PARCC’s Evidence-Centered Design specifications 4 PARCC Math Innovations 1. Quality assessment of individual content standards with machine scoring of responses entered by computer interface 2. Practice-forward tasks 3. Tasks assessing conceptual understanding with machine scoring of responses entered by computer interface 4. Integrative tasks with machine scoring of responses entered by computer interface 5 PARCC Math Innovations 5. Fluency assessment with machine scoring of responses entered by computer interface 6. Expressing mathematical reasoning 7. Modeling / application 8. Technology-enhanced tasks 6 PARCC: Sample Mathematics Items URL: http://practice.parcc.testnav.com/# • Sample Set HS Math: # 9 • Sample Set Grades 6-8 Math: # 4 • Grade 3 Math EOY: # 17 • Geometry EOY/Calculator Section: # 18 • Algebra I EOY/Non-Calc Section: # 6 7 Summary • We will need to be responsive to PARCC states’ needs as they develop over time • Two major purposes associated with use of technological innovation: – Improve the precision of measurement of the intended constructs – Improve tools available to students during the assessment process to support student engagement 8 Advances in Large-Scale Assessment: A PARCC Update Innovations in Item Functionality and Scoring Michelle Richard Pearson 9 Innovations in Functionality • Use a class inside standard QTI interaction to give it a different context • Expresses functionality specific to current system, but would default to basic interaction in other systems – matchInteraction / Table Grid • class=“table-grid” changes interaction from Drag and Drop functionality to a matrix with checkboxes or radio buttons • Scores exactly the same with or without class – textEntryInteraction / Equation Editor • class=“tei-ee” changes a text response box to a box embedded with palettes that can be used to create mathML-based responses • Scoring via rubric in both representations 10 Match Interaction with class “table-grid” added Match Interaction with no class attribute 11 textEntry Interaction with class “tei-ee” added textEntry Interaction with no class attribute 12 New Item Types • Fraction Model – Allows for equivalent fractions to be modeled • Function Graph – Appearance of graph is driven by button selections – Evaluated against an equation and other parameters using record cardinality • Interactive Numberline – Evaluates solution plotted on a numberline • Select in Passage / Text Highlighting – Allows for text to be selected directly from passage as answer to question without need for alternate source file 13 Multiple Delivery Formats • Standard Computer/Laptop – Uses mouse controls – Differing monitor sizes • Tablet/iPad – Uses touchscreen controls – Limited screen size (7” to 10.1”) • Cross-Browser / OS compatibility – Functionality of interactions can differ across browsers / OS • Accessibility – Delivery of items that are both accessible and innovative is a fine line to walk 14 Innovations in Scoring • Composite Items – Multi-part items that appear on a single screen. – Item parts can be single interaction or multiple interactions – Part scores are summed for total score • Partial Credit – Single-part item that is worth multiple points, or composite items – Use scoring logic to evaluate the total correctness of item ( at least half correct, or one less than fully correct) and assign partial credit score • Dependent scoring – Score from one part of item is dependent on correctness of other part – Used to evaluate if student can both respond and support response 15 Innovations in Scoring • Custom Operators – QTI/APIP allows for customized scoring logic to built within the scoring engine and called in the XML – Allows for more complex scoring models to be authored • A few PARCC customOperators – stringToNumber = Converts a string that contains commas to a float value (ex: 14,342 to 14342) – SubstringBefore and SubstringAfter = takes characters before or after specific character(s) and splits for evaluation (ex: ‘3/5’ to ‘3’ and ‘5’) – CountPointsThatSatisfyEquation = takes text-based equation provided by author and evaluates student-entered points against it. Output is count of points that were correct, which can be used to provide full or partial credit (ex: 0 3, and -2 -5 against the equation y=4*x+3) – IsCorrectByQuantity = Counts cloned sources in a target for correctness 16 Advances in Large-Scale Assessment: A PARCC Update Field Test Design and Analysis Lora Monfils Educational Testing Service 17 Overview • Background on Field Test – Purpose/goals – Constraints • Field Test Design • Sampling • Psychometric Analysis – Item Analysis – Scoring and Scaling Studies – Comparability Studies 18 Purposes and Constraints • 3 primary purposes of the Field Test: 1) To obtain data to assemble alternate or parallel forms for operational use in subsequent years; 2) To evaluate methods for scoring and scaling the PBA and EOY (including vertical scaling) such that the resulting student scores are appropriate for intended purposes and are comparable across forms, administrations and years; and 3) To yield data that are appropriate and sufficient to support the required psychometric and item research studies. • Constraints: – Testing time for individual students must be kept at a minimum. – Field test items will be used for operational forms so security must be maximized. 19 Field Test Design • To the extent possible, the Field Test was designed to reflect future operational administrations – 2 separate administrations – PBA in March, EOY in April/May – Dual mode administration – PBA and EOY field test forms constructed to full operational test blueprints and requirements • Data collection design – 2 conditions: 1) Full summative (PBA+EOY), 2) PBA or EOY but not both – Linking through common items across forms and conditions, and randomly equivalent groups 20 Additional Design Considerations • Timing of PBA FT and EOY FT relative to State operational testing • Individual State participation requirements • Test burden for participating districts/schools and students • Number of FT forms – construction, administration • Data collection for special studies • Initial FT design modified to address considerations – Trade-offs between ideal and practical, advantages and disadvantages 21 FT Design: Condition 1 • Description – Students participate in both the PBA and EOY FT administrations. Each Condition 1 form has 2 parts (1 PBA + 1 EOY); each student takes pt 1 & pt 2 of assigned form. Assignment to forms is through spiraling at the student level . • Purpose – This condition most closely replicates the operational summative assessment. Data for scoring and scaling studies, and other research studies. Data will also contribute to item statistics for operational form construction. • Notes on Linking Items – Forms include common items, on-grade items for within-grade linking and adjacent-grade items for vertical scaling • EOY and math PBA forms include external matrix sections with off-grade items • ELA PBA forms share internal items cross grades – HS math EOCs, common items link Traditional and Integrated – Designated CBT-PBT pairs to link across modes 22 FT Design: Condition 2 • Description – 2A : Students participate in PBA administration only. PBA FT forms are spiraled at the student level. – 2B: Students participate in EOY administration only. EOY FT forms are spiraled at the student level. • Purpose – By administering test components separately, statistical data can be generated while limiting testing time for individual students. Data will be used to obtain item statistics for operational form construction. • Notes on Linking Items – EOY and PBA forms include common on-grade items linking within and across conditions • PBT forms also include adjacent-grade items for vertical scaling • Int Math forms also include adjacent-grade and Trad Math items 23 Field Test Design N Forms per Grade or Traditional Math EOC Number of Forms per Grade or EOC Field Test Administration Condition ELA/Literacy Mathematics March April CBT PBT CBT PBT 1 FS1 PBA portion EOY portion 62 FS 1 FS 6 FS 1 FS 2A PBA/MYA PBA n/a 18 (16) 62 12 (10) 6 (5) 2B EOY n/a EOY 9 5 9 6 1 The Full Summative (FS) test consists of two parts – Part 1 is the PBA portion and Part 2 is the EOY portion. Grade 3, where there will be five forms.. 2 Except 24 Form Type Field Test Design N Forms per Integrated Math EOC Number of Forms per EOC Field Test Administration Condition IM 1 IM 2 IM 3 March April CBT PBT CBT PBT CBT PBT 1 FS1 PBA portion EOY portion 2 1 1 1 2 1 2A PBA/MYA PBA n/a 2 2 2 2 2 2 2B EOY n/a EOY 2 3 2 2 2 2 1 The 25 Form Type Full Summative (FS) test consists of two parts – Part 1 is the PBA portion and Part 2 is the EOY portion. Sample Size • To support IRT scaling, target minimum sample size of 1,200 valid cases per item (test form) – To achieve target, oversampled by approximately 50% for Cond 1 and 20% for Cond 2 to allow for attrition, non-response, etc – Separate samples drawn for each content/grade and test mode – Students to test in one subject only – either Math or ELA/Literacy – Where targets not met, some adjustments in number of forms • Note: Linking items/tasks appeared in more than one form, therefore targeted 1,200 valid cases per form to support wellestimated item parameters from each form for common item linking. 26 Sampling Overview • Targets established for each state – Based on proportional representation with 2% minimum – Grade level enrollments from 2012 NCES data – Reflected state special participation requests – impact distributed proportionally • For each content/grade, schools placed into 5 strata based on % Proficient reported in NCES data for the content/grade (% Proficient Math or % Proficient ELA) – New schools (e.g., those that opened subsequent to the 2012 NCES data) formed a sixth stratum – Test-level targets were distributed across strata – proportional to number of schools in each stratum – Within strata, schools were sampled to meet designated sample size requirements 27 Sampling Overview, cont • Schools were randomly selected within strata to meet test target sample size. – Each sampled school contributed two classes (estimated as 20 per class, thus 40 students) if grade-level enrollment permitted – In the case of states that required that all classes participate, the sampled school contributed the number of students in that grade. • Sampling of schools continued until targets were met for strata and test overall – Assuming sufficient number of schools to sample from, because selection within schools entailed groups of students (2 classes or entire grade), rather than individuals, meeting targets resulted in some overage – If insufficient number of schools, sampling resulted in not meeting targets • Samples evaluated at State and PARCC level – ELA Prof, Math Prof, Econ Dis, SWD, LEP, Gender, Ethnicity 28 Sampling Overview, cont • Adjustments to standard procedures implemented in certain cases – To reduce over-sampling when selecting entire grades, an algorithm was implemented to minimize the size of the last school randomly selected to meet target – When there were limitations on number of schools available to sample from for a given test due to special requests, and/or low volume curriculum as in the case of Integrated Mathematics, targets used were adjusted to allow sampling of students for all conditions and modes, albeit in smaller numbers • Sampled schools were sent to States for approval • Replacements identified for schools not approved and schools that declined to participate – Iterative process, with 3 rounds of recruitment • Major collaborative effort 29 Field Test Analyses Overview • Research questions to inform operational assessments – Innovative items • Evaluate item/task performance • Implications for future operational forms – Scoring • Combining PBA and EOY to yield summative score • Subscore reporting – Scaling • IRT model selection • Feasibility of vertical scale – Special studies • Mode/device comparability • HS Math EOC comparability 30 Field Test AnalysesClassical Item and Test Analyses • Evaluation of Field Test Item/Task Performance – Classical Item Analyses • • • • • Classical item difficulty indices (or p-value; SR and CR items) The percentage of students choosing each response option (SR items) Item-total correlation (SR and CR items) Distractor-total correlation (SR items) Score point distribution (CR items). – Differential Item Functioning • Groups determined based on policy considerations – Gender, ethnicity, special populations (SWD, ELL, EconDis) • Mantel-Haenzel procedures, Logistic Regression – Reliability • PBA, EOY, FS (PBA+EOY) • Total sample, plus subgroups of interest 31 Field Test AnalysesDimensionality Analyses • Prior to IRT scaling, dimensionality studies will be conducted within each summative test component (PBA and EOY) and grade, across the PBA and EOY within each grade, and across grades. • Dimensionality analyses are necessary for determining a) evidence of essential unidimensionality for IRT scaling, b) the score aggregation method for PBA and EOY tests, and c) the feasibility and structure of a vertical scale. • Both exploratory and confirmatory analyses 32 Field Test AnalysesIRT Model Selection Considerations • Analyses to inform IRT model selection considerations – Underlying assumptions for different IRT models. • • • • Dimensionality Equal discrimination in Rasch/PC Minimal guessing in Rasch and 2PL Local independence/Minimal testlet effect – Model simplicity or parsimony – Model fit • Goodness-of-fit tests • Plots of empirical data vs model-based ICCs – Implications for vertical scales 33 Field Test Analyses – Explore Viability of Vertical Scale • Issues related to VS for Mathematics EOCs – Ideally, if sufficiently large representative samples, use Integrated Math 1, 2, 3 • For Traditional Math, progression Gr 8 -> Alg1 -> Alg2; Gr 8 -> Geom • Common ES statements/items Integrated & Traditional EOCs (EOC comp study) • Integrated Math samples smaller than planned, will provide preliminary results • Implications of rates of CCSS implementation in Spring 2014 – Variation in grade to grade performance within and across states • Due to state transition timelines • Due to district/school implementation and other factors – Impact on relative difficulty, discrimination of items • From grade to grade in vertical linking sets • Within grade on “operational core” items – Suggests results may differ substantially in 2015 and beyond until CCSS are fully implemented 34 • Consider periodic evaluation of scale stability Field Test Analyses – To Inform Operational Scores • Analyses to inform operational summative scores – For Math, analyses to investigate combining the PBA and EOY into a single summative Math scale score – For ELA, analyses to investigate production of separate scale scores for Writing and Reading, and a single summative ELA/L scale score – Considerations for choice of score scale • Investigation of estimation procedures to support subscore reporting • Comparability across mode/device • Comparability across HS Math EOCs for Trad, Int courses 35 Questions? • Questions? • Thank you! 36 Advances in Large-Scale Assessment: A PARCC Update PARCC Mode and Device Comparability Research Laurie Davis Pearson 37 Why Conduct Comparability Research? • PARCC’s ultimate goal is digital delivery of the ELA/Literacy and Mathematics assessments using the widest variety of devices that will support interchangeable scores. • Initially to include: – Desktop computers – Laptops computers – Tablets (9.7” or larger) • Strict comparability (score interchangeability) across computerbased tests and paper-based tests is not a PARCC goal • However…paper will be provided as an option for schools where technology infrastructure is not ready for digital delivery 38 Mode vs. Device Comparability • Mode Comparability – TEIs on computer only – Score interchangeability not expected • Device Comparability – TEIs on all devices – Score interchangeability expected 39 Mode: Computer vs. Paper and Pencil • Initial comparability studies planned as part of PARCC field test analyses • All grades and subjects • Schools assigned to either paper or computer mode • Goal: Evaluate the degree to which comparability can be obtained through scaling items onto a single metric, linking or concordance 40 Mode Comparability: Item and Test Level Analyses • Classical item analysis – Differences, rank order p-values • DIF – CBT reference, PBT focal • Factor structure, dimensionality • Reliability • IRT analysis – informed by dimensionality analysis – Separate calibrations – Link PBT to CBT with S-L procedure • Score adjustment • Evaluate resulting score distributions 41 Device: Computer vs. Tablet • 2-part research effort • Part I: Cognitive lab (qualitative)—summer 2013 • 72 students in grades 4, 8, and 11 from CO and AR • Part II: Comparability study (quantitative)—2014 using field test data • Goal: Determine the statistical and practical significance of any device effects 42 Device Comparability: Item and Test Level Analyses Item/Task Level Analyses 1. Comparison of task p-values/means across conditions 2. Comparison of Item Response Theory (IRT) item difficulties across conditions 3. Differential item functioning (DIF) analysis Test Level Analyses 1. Reliability 2. Validity—Relationship of PARCC scores to external measures 3. Score Interpretations • Differences in estimated scale scores across device conditions • Statistically significant difference = Greater than 2 SEs of the linking 43 Device Comparability Grades and Subjects Studied • 2014 Device comparability study will include: • Grade 4 ELA/Literacy • Grade 4 Mathematics • Grade 8 ELA/Literacy • Grade 8 Mathematics • Grade 10 ELA/Literacy • Geometry 44 Device Comparability Study What we Planned • Used data entered by states/schools into Technology Readiness Tool to evaluate tablet availability • Results indicate fewer than 5% of devices in classrooms are tablets • If randomly distributed across forms within the field test, approx. 60 students per form on tablet • Targeted sampling for tablet sample is needed (n=600 per grade/subject) • Computer sample will come from field test • Groups will be matched prior to analysis 45 Device Comparability Study What we Got • Grade 8 and high school studies use random assignment (computer and tablet) of students from Burlington, MA • Approximately 250 students per grade/subject • ~125 students per study condition • Grade 4 study uses matched sample from LA, AR, & MA. • Students assigned to tablet condition matched to students who tested on computer in the field test • Approximately 300-400 students per subject 46 A Sampling of Student Survey Results PARCC PBA Field Test Administration Burlington, MA Grade 8 Students 47 48 49 50 51 Advances in Large-Scale Assessment: A PARCC Update Discussion: Five Thoughts Mike Russell Center for Assessment 52 1. Perspective Freedom 7 1961 Science Fiction Space Colonization 53 Alan Shepard 1. Perspective Gaming Testing 54 2. Innovation & Measurement Value SS Savannah, 1819 First Trans-Atlantic Steamship Crossing 55 3. Informed by Research 56 4. Interoperability 57 5. Competing Tensions CCSS/Content Innovation Interoperability 58 Time