A tale of Two Tests

advertisement
A Tale of Two Tests
STANAG and CEFR
Comparing the Results of side-by-side testing
of reading proficiency
BILC Conference
May 2010
Istanbul, Turkey
Dr. Elvira Swender, ACTFL
With apologies to the author
With apologies to the author
We had a “Dickens of a time” with this study.
Overview


Two systems: STANAG and CEFR
Two tests of reading proficiency





BAT-Reading
Leipzig Test of Reading Proficiency (LTRP)
The side-by-side study
Observations
Questions
Two Systems
Why is there a need to relate
STANAG and CEFR?





To recognize linguistic abilities of military personnel in
civilian society
To provide a framework to military institutions in nation
states operating STANAG qualifications who need to
equate them with CEFR for the purpose of gaining
civilian recognition of military qualifications
To provide guidance to employers, trainers, nonlanguage experts on how to interpret/evaluate CEFR
qualifications
To identify competence gaps thereby determine whether
an individual is capable of undertaking a job requiring a
given SLP
To allow informed decisions to be made on appropriate
linguistic competence
“Birds of a Feather”
Broad Questions?




Can the two systems be compared?
Are the two systems related?
Can the two systems be aligned?
Can the two systems be equated?
Comparing CEFR and STANAG
Similarities
Feature
Describe language abilities on a
scale from little or no ability to
that of a highly articulate speaker
CEFR
A1, A2, B1, B2, C1, C2
STANAG
0+, 1, 1+, 2, 2+, 3, 3+, 4, 4+, 5
Criterion referenced
Address speaking, listening,
reading, and writing
Contain can-do statements
Describe tasks (functions),
contexts, and expectations
for accuracy
All criteria, some of
the time
All criteria, all of the
time
A Summary of the Major Contrasts
CEFR
The primary purpose is to check
learners’ progress in developing
communicative competence within
a specific course of study.
STANAG
The primary purpose is to test
individuals’ general proficiency
across a wide range of topics
regardless of their course of study.


The primary users of the
information are the teachers and
students.

By design, the CEFR is underspecified for testing of general,
real-world proficiency.



The primary users of the
information are teachers and
administrators, employers.
By design, STANAG is underspecified for measuring step-bystep progress within a specific
curriculum.
About this Study




University of Leipzig
April 19-23, 2010
Proctored on-line tests in computer lab
Goal was to involve five groups with 20 participants each


Split test design


Levels A1, A2, B1, B2, C1 according to course enrolled
half of the participants in each group took the BAT-R test first,
the other half took the RPT-E first
Tests taken on different days


2 to 3 days apart depending on group
90 minutes per test
Characteristics of Participants


Gender
 Female: 65%; Male 35%
Age




Average 25 (Range: 19-63)
First language
 German (85%)
 Arabic, Russian, Polish, Brazilian, Chinese, Thai
Mean # of years of English study in school:
 German students 8.7 years
 Foreign students: 5.1 years
Enrolled in 1 of 5 different levels
 English Language Institute to English teacher trainees
BAT Reading Test






Test of English reading proficiency
Advisory scores for calibrating national
proficiency tests
STANAG 6001 (version 3), Levels 1,2,3
Internet-delivered and computer scored
Developed by BILC Test Working Group
Delivered by ACTFL
Format

Criterion-referenced tests



Allow for direct application of the STANAG Proficiency
Scale
Texts and tasks are aligned by level
Each proficiency level is tested separately



Test takers take all items for Levels 1,2,3
20 texts at each level
One item with 4 multiple choice responses per text
Scoring Criteria

The proficiency rating is assigned based on two separate
scores



Must show “mastery” at a level to be assigned that level


“Floor” – sustained ability across a range of tasks and contexts
specific to one level
“Ceiling” – non-sustained ability at the next higher proficiency
level
Non-compensatory scoring
Performance at the next higher level provides evidence
of random, emerging, or developing proficiency at the
next higher level.

Developing proficiency at the next higher level indicates a +
rating.
Leipzig Test of Reading Proficiency



Test of English reading proficiency for entering
and exiting students at universities in the state
of Saxony/Germany
To determine proficiency levels from A1 to C1
according to the CEFR
For placement and certification purposes


Entrance and exit requirements in all subjects
Developed by the University of Leipzig under a
grant from the state of Saxony
Format

5 texts with 3 questions each per level


Multiple choice questions


15 items per level
one correct answer and three distracters
Entire Series of tests

Combine 2 or 3 adjoining levels


A1-B1 or B1-B2 or B1-C1
Version of the test used in this study

B1-C1
Level A1


5 texts: 60-100 words each
Major tasks and functions


Content


Basic personal and social needs
Text type


Topic recognition and comprehension of simple single
facts
Very short, simple straight-forward texts: notes, post
cards, simple instructions and directions
3 MC questions per text

Global, selective, detail
Screen shot of A1 item

to come (requestedfrom Helen)
Level C1


5 texts: 200-300 words each
Major tasks and functions


Content


Academic, professional, and literary material
Text type


Complex information processing including inferences,
hypotheses, and nuances
Op/ed pieces, analyses and commentaries, detailed
technical reports, literary texts
3 MC questions per text

global, detail, inference
Scoring Criteria




Total number of points
Rate highest levels that have a combined
total of at least 18 points with the lower
level with at least 11 points (70%)
18-24 points (60-80%) = lower level
25-30 points (81-100%) = higher level
Findings
A1
0
1
1
2
A2
B1
B2
C1
TOTAL
1
4
1
7
1+
4
6
10
2
1
16
6
3
26
2+
6
1
7
3
5
10
15
17
14
66
TOTAL
3
9
23
BAT-R Total Score
Scatter Plot of Total Raw Scores
LTRP Total Score
(Correlation of Total Raw Scores r = .905, p < .001)
With the current data, one could say

At the lowest and highest ends of the
scales there is alignment



No one who was rated 1 was also rated B2 or
C1
No one who was rated 3 was rated A1, A2, or
B1.
The middle ranges are where there is the
least amount of alignment

A BAT-R 2 can be anything from A2 to C1
A1
0
1
1
2
A2
B1
B2
C1
TOTAL
1
4
1
7
1+
4
6
10
2
1
16
6
3
26
2+
6
1
7
3
5
10
15
17
14
66
TOTAL
3
9
23
With the current data, one could say
BAT-R
0
1
1+
2
2+
3
LTRP
0 or A1
A1 or A2, (Mostly A2)
A2 or B1 (Mostly B1)
A2, B1, B2, or C1 (Mostly B1)
B2 or C1 (Mostly B2)
B2 or C1 (Mostly C1)
With the current data, one could
say
LTRP
A1
A2
B1
B2
C1
BAT-R
0 or 1 (Mostly 1)
1, 1+ or 2 (Mostly 1)
1+ or 2 (Mostly 2)
2, 2+ or 3 (Mostly 2)
2, 2+ or 3 (Mostly 3)
Estimated Probability
Estimated Probability of a BAT-R Rating
Based on LTRP Rating
BAT-R Rating
LTRP
Rating
0
1
1+
2
2+
3
0
0.93
0.07
.
.
.
.
A1
0.30
0.67
0.03
.
.
.
A2
0.01
0.49
0.40
0.09
.
.
B1
.
0.03
0.21
0.74
0.01
0.01
B2
.
.
0.01
0.57
0.23
0.18
C1
.
.
.
0.04
0.08
0.88
Shaded values are highest probability on the row.
What is the probability?

That a BAT-R 2 is also a LTRP:




A2
B1
B2
C1
9%
74%
57%
5%
What is the probability?

That a BAT-R 3 is also an LTRP:



B1
B2
C1
9%
18%
88%
What is the probability?

That a LTRP B1 is also a BAT-R:





1
1+
2
2+
3
3%
21%
74%
1%
1%
What is the probability?

That a LTRP B2 is also a BAT-R:




1+
2
2+
3
1%
57%
23%
18%
Answering the Broad Questions
Can the two systems be
compared?
YES
Are the two
systems related?
YES
Can the two systems
be aligned?
Somewhat
Can the two systems
be equated?
Probably not
“Heat Chart”
STANAG 6001
CEFR
When comparing testing systems

Ask about the purpose of the test


Ask about what the test is testing



Placement, progress, prove a level, etc.
Is it a test of achievement, performance, proficiency?
Does it test spontaneous abilities or rehearsed performance?
Ask about how the test scores are determined

Non-compensatory



prove a floor and ceiling
Total points
Ask if research exists
Answers from a CEFR Expert
CEFR is not one system. It is NOT intended to be used to transfer scores
from one country to the next or from one language to another but rather to
set a framework within which educators can build curricula.
Not a harmonisation project
Alignment is problematic because we do not know what we are aligning. Not
a matter of alignment or equivalency but a matter of relationship
The scale is an origin for comparison. The scale functions as exemplars and
activities. The scale is a meta-framework for learning and teaching.
Conversation with Nick Saville,
Cambridge, England
April 15, 2010
In Closing
It is a far, far better thing that
we do than we have ever done
to know how to use test scores.
Questions?
Contact: eswender@actfl.org
Extra slides
Crosstabulation of Test Results
Download