A Tale of Two Tests STANAG and CEFR Comparing the Results of side-by-side testing of reading proficiency BILC Conference May 2010 Istanbul, Turkey Dr. Elvira Swender, ACTFL With apologies to the author With apologies to the author We had a “Dickens of a time” with this study. Overview Two systems: STANAG and CEFR Two tests of reading proficiency BAT-Reading Leipzig Test of Reading Proficiency (LTRP) The side-by-side study Observations Questions Two Systems Why is there a need to relate STANAG and CEFR? To recognize linguistic abilities of military personnel in civilian society To provide a framework to military institutions in nation states operating STANAG qualifications who need to equate them with CEFR for the purpose of gaining civilian recognition of military qualifications To provide guidance to employers, trainers, nonlanguage experts on how to interpret/evaluate CEFR qualifications To identify competence gaps thereby determine whether an individual is capable of undertaking a job requiring a given SLP To allow informed decisions to be made on appropriate linguistic competence “Birds of a Feather” Broad Questions? Can the two systems be compared? Are the two systems related? Can the two systems be aligned? Can the two systems be equated? Comparing CEFR and STANAG Similarities Feature Describe language abilities on a scale from little or no ability to that of a highly articulate speaker CEFR A1, A2, B1, B2, C1, C2 STANAG 0+, 1, 1+, 2, 2+, 3, 3+, 4, 4+, 5 Criterion referenced Address speaking, listening, reading, and writing Contain can-do statements Describe tasks (functions), contexts, and expectations for accuracy All criteria, some of the time All criteria, all of the time A Summary of the Major Contrasts CEFR The primary purpose is to check learners’ progress in developing communicative competence within a specific course of study. STANAG The primary purpose is to test individuals’ general proficiency across a wide range of topics regardless of their course of study. The primary users of the information are the teachers and students. By design, the CEFR is underspecified for testing of general, real-world proficiency. The primary users of the information are teachers and administrators, employers. By design, STANAG is underspecified for measuring step-bystep progress within a specific curriculum. About this Study University of Leipzig April 19-23, 2010 Proctored on-line tests in computer lab Goal was to involve five groups with 20 participants each Split test design Levels A1, A2, B1, B2, C1 according to course enrolled half of the participants in each group took the BAT-R test first, the other half took the RPT-E first Tests taken on different days 2 to 3 days apart depending on group 90 minutes per test Characteristics of Participants Gender Female: 65%; Male 35% Age Average 25 (Range: 19-63) First language German (85%) Arabic, Russian, Polish, Brazilian, Chinese, Thai Mean # of years of English study in school: German students 8.7 years Foreign students: 5.1 years Enrolled in 1 of 5 different levels English Language Institute to English teacher trainees BAT Reading Test Test of English reading proficiency Advisory scores for calibrating national proficiency tests STANAG 6001 (version 3), Levels 1,2,3 Internet-delivered and computer scored Developed by BILC Test Working Group Delivered by ACTFL Format Criterion-referenced tests Allow for direct application of the STANAG Proficiency Scale Texts and tasks are aligned by level Each proficiency level is tested separately Test takers take all items for Levels 1,2,3 20 texts at each level One item with 4 multiple choice responses per text Scoring Criteria The proficiency rating is assigned based on two separate scores Must show “mastery” at a level to be assigned that level “Floor” – sustained ability across a range of tasks and contexts specific to one level “Ceiling” – non-sustained ability at the next higher proficiency level Non-compensatory scoring Performance at the next higher level provides evidence of random, emerging, or developing proficiency at the next higher level. Developing proficiency at the next higher level indicates a + rating. Leipzig Test of Reading Proficiency Test of English reading proficiency for entering and exiting students at universities in the state of Saxony/Germany To determine proficiency levels from A1 to C1 according to the CEFR For placement and certification purposes Entrance and exit requirements in all subjects Developed by the University of Leipzig under a grant from the state of Saxony Format 5 texts with 3 questions each per level Multiple choice questions 15 items per level one correct answer and three distracters Entire Series of tests Combine 2 or 3 adjoining levels A1-B1 or B1-B2 or B1-C1 Version of the test used in this study B1-C1 Level A1 5 texts: 60-100 words each Major tasks and functions Content Basic personal and social needs Text type Topic recognition and comprehension of simple single facts Very short, simple straight-forward texts: notes, post cards, simple instructions and directions 3 MC questions per text Global, selective, detail Screen shot of A1 item to come (requestedfrom Helen) Level C1 5 texts: 200-300 words each Major tasks and functions Content Academic, professional, and literary material Text type Complex information processing including inferences, hypotheses, and nuances Op/ed pieces, analyses and commentaries, detailed technical reports, literary texts 3 MC questions per text global, detail, inference Scoring Criteria Total number of points Rate highest levels that have a combined total of at least 18 points with the lower level with at least 11 points (70%) 18-24 points (60-80%) = lower level 25-30 points (81-100%) = higher level Findings A1 0 1 1 2 A2 B1 B2 C1 TOTAL 1 4 1 7 1+ 4 6 10 2 1 16 6 3 26 2+ 6 1 7 3 5 10 15 17 14 66 TOTAL 3 9 23 BAT-R Total Score Scatter Plot of Total Raw Scores LTRP Total Score (Correlation of Total Raw Scores r = .905, p < .001) With the current data, one could say At the lowest and highest ends of the scales there is alignment No one who was rated 1 was also rated B2 or C1 No one who was rated 3 was rated A1, A2, or B1. The middle ranges are where there is the least amount of alignment A BAT-R 2 can be anything from A2 to C1 A1 0 1 1 2 A2 B1 B2 C1 TOTAL 1 4 1 7 1+ 4 6 10 2 1 16 6 3 26 2+ 6 1 7 3 5 10 15 17 14 66 TOTAL 3 9 23 With the current data, one could say BAT-R 0 1 1+ 2 2+ 3 LTRP 0 or A1 A1 or A2, (Mostly A2) A2 or B1 (Mostly B1) A2, B1, B2, or C1 (Mostly B1) B2 or C1 (Mostly B2) B2 or C1 (Mostly C1) With the current data, one could say LTRP A1 A2 B1 B2 C1 BAT-R 0 or 1 (Mostly 1) 1, 1+ or 2 (Mostly 1) 1+ or 2 (Mostly 2) 2, 2+ or 3 (Mostly 2) 2, 2+ or 3 (Mostly 3) Estimated Probability Estimated Probability of a BAT-R Rating Based on LTRP Rating BAT-R Rating LTRP Rating 0 1 1+ 2 2+ 3 0 0.93 0.07 . . . . A1 0.30 0.67 0.03 . . . A2 0.01 0.49 0.40 0.09 . . B1 . 0.03 0.21 0.74 0.01 0.01 B2 . . 0.01 0.57 0.23 0.18 C1 . . . 0.04 0.08 0.88 Shaded values are highest probability on the row. What is the probability? That a BAT-R 2 is also a LTRP: A2 B1 B2 C1 9% 74% 57% 5% What is the probability? That a BAT-R 3 is also an LTRP: B1 B2 C1 9% 18% 88% What is the probability? That a LTRP B1 is also a BAT-R: 1 1+ 2 2+ 3 3% 21% 74% 1% 1% What is the probability? That a LTRP B2 is also a BAT-R: 1+ 2 2+ 3 1% 57% 23% 18% Answering the Broad Questions Can the two systems be compared? YES Are the two systems related? YES Can the two systems be aligned? Somewhat Can the two systems be equated? Probably not “Heat Chart” STANAG 6001 CEFR When comparing testing systems Ask about the purpose of the test Ask about what the test is testing Placement, progress, prove a level, etc. Is it a test of achievement, performance, proficiency? Does it test spontaneous abilities or rehearsed performance? Ask about how the test scores are determined Non-compensatory prove a floor and ceiling Total points Ask if research exists Answers from a CEFR Expert CEFR is not one system. It is NOT intended to be used to transfer scores from one country to the next or from one language to another but rather to set a framework within which educators can build curricula. Not a harmonisation project Alignment is problematic because we do not know what we are aligning. Not a matter of alignment or equivalency but a matter of relationship The scale is an origin for comparison. The scale functions as exemplars and activities. The scale is a meta-framework for learning and teaching. Conversation with Nick Saville, Cambridge, England April 15, 2010 In Closing It is a far, far better thing that we do than we have ever done to know how to use test scores. Questions? Contact: eswender@actfl.org Extra slides Crosstabulation of Test Results