30 Years of Evidence on the Comparability of Exam Standards

advertisement
30 Years of Evidence on the
Comparability of Exam
Standards: Myths, Fiascos and
Unrealistic Expectations
Paul E. Newton
Centre for Evaluation & Monitoring, University of Durham,
30th Anniversary Conference: 30 Years of Evidence in
Education. 23 September 2014. London.
Statistics vs. Judgement:
What Does 30 Years of Research
Tell Us About the Best and Worst
Way to Maintain Exam
Standards?
What does it mean to
‘maintain’ an exam standard?
Grade Awarding
The process of identifying:
 which marks on this year’s exam
 correspond to levels of attainment
(i.e. levels of knowledge, skill and understanding)
that were associated with
 grade boundary marks on last year’s exam.
Why do exam boards need to
move grade boundaries?
Because even exams that are designed to
measure:
 exactly the same kind of attainment
 in exactly the same way
 may end up being slightly different in terms of the
overall difficulty of their questions
Have we always maintained
exam standards like this?
 30 years ago – in 1984?
 60 years ago – in 1954?
Have we always maintained
exam standards like this?
 30 years ago – in 1984?
 60 years ago – in 1954?
… yes, pretty much!
Attainment-referencing
From one examination to the next,
corresponding grade boundaries should be
located at marks associated with equivalent
levels of attainment.
The myth
HYPOTHETICAL A level pass-rates for UCLES
(Summer examinations, Home candidates only)
100
90
80
70
Latin
60
French
50
Physics
40
Biology
30
20
10
1986
1985
1984
1983
1982
1981
1980
1979
1978
1977
1976
1975
1974
1973
1972
1971
1970
1969
1968
1967
1966
1965
1964
1963
1962
1961
1960
0
The myth… debunked
A level pass-rates for the 'Cambridge' board
UCLES (1960 to 1984)
(Summer examinations, Home candidates only)
100
95
90
85
80
Latin
75
Physics
70
65
60
55
1984
1983
1982
1981
1980
1979
1978
1977
1976
1975
1974
1973
1972
1971
1970
1969
1968
1967
1966
1965
1964
1963
1962
1961
1960
50
How do you operationalise
attainment-referencing?
Cumulative percentage of A level Sociology students awarded grade E (blue)
against total number of results awarded (red)
(for All Boards, Summer Awards, All Modes, by Syllabus Group)
100.0
34000
95.0
32000
90.0
30000
85.0
28000
80.0
Cum.% E
No. Results
26000
75.0
70.0
24000
65.0
22000
60.0
20000
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Scrutiny of scripts
(undertaken by examiners)
Comparing levels of
attainment ‘directly’
by inspecting
performances in
examination scripts
a.k.a. ‘Judgement’
Scrutiny of data
(undertaken by the Board)
Cumulative percentage of A level Sociology students awarded grade E (blue)
against total number of results awarded (red)
(for All Boards, Summer Awards, All Modes, by Syllabus Group)
100.0
34000
95.0
32000
90.0
30000
85.0
28000
80.0
Cum.% E
No. Results
26000
75.0
70.0
24000
65.0
22000
60.0
Comparing levels of
attainment indirectly
by ‘modelling’ the
causal determinants of
attainment
20000
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
a.k.a. ‘Statistics’
Which is better –
statistics or judgement?
Cumulative percentage of A level Sociology students awarded grade E (blue)
against total number of results awarded (red)
(for All Boards, Summer Awards, All Modes, by Syllabus Group)
100.0
34000
95.0
32000
90.0
30000
85.0
28000
80.0
Cum.% E
No. Results
26000
75.0
70.0
24000
65.0
22000
60.0
20000
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Which is better –
statistics or judgement?
Cumulative percentage of A level Sociology students awarded grade E (blue)
against total number of results awarded (red)
(for All Boards, Summer Awards, All Modes, by Syllabus Group)
100.0
34000
95.0
32000
90.0
30000
85.0
28000
80.0
Cum.% E
No. Results
26000
75.0
70.0
24000
65.0
22000
60.0
20000
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
The battle of grade awarding
Examiners
The Board
We are just so impressed
by the quality of
performances that we see
in our French exams.
But do you really have
enough evidence to justify
raising the pass-rate yet
again?
After all:
 pass-rates haven’t been rising
in German or Spanish
 the French cohort is
expanding massively
What Does 30 Years of Research
Tell Us About the Best and Worst Way to
Maintain Exam Standards?
Evidence from Exam Boards
Evidence from Academia
Evidence from Regulators
What have we learned since 1984?
We shouldn’t put too much
confidence in statistics
Cumulative % candidates with grade E (or higher)
Averaged across 13 UCLES A level subjects, 1960-1984
(Summer examinations, Home candidates only, Main syllabuses only)
100
90
80
70
60
50
40
30
20
10
0
4 NEAB maths A levels
 P&A, P&M, P&S, SMP
MLM to control for prior
achievement, gender, etc.
 even after control, SMP still
appeared too lenient
However the SMP syllabus
 more motivating
 excellent support materials
 more time-consuming
We shouldn’t put too much
confidence in judgement
Grade boundaries set by
examiner judgement alone
 for two exam papers
 same subject
 different tiers
 sat by same candidates
Many more students ended
up with higher grades on
the lower tier exam (than
on the higher tier).
Judgemental innovations
We have learned
how to harness
examiner judgement
more effectively
Statistical innovations
We have learned
how to compute
statistical analyses
more effectively
It is extremely hard to
predict and control
comparability threats.
The ‘fiascos’
Summer 2002
Curriculum 2000 anomaly
Summer 2012
GCSE English anomaly
January awarding, 2012
Clear tendency to ensure students marked
‘comfortably’ above historical boundaries
June awarding, 2012
Same tendency, but many students no longer
‘comfortably’ above the raised boundaries
So, which is better –
statistics or judgement?
Cumulative percentage of A level Sociology students awarded grade E (blue)
against total number of results awarded (red)
(for All Boards, Summer Awards, All Modes, by Syllabus Group)
100.0
34000
95.0
32000
90.0
30000
85.0
28000
80.0
Cum.% E
No. Results
26000
75.0
70.0
24000
65.0
22000
60.0
20000
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Unrealistic expectations
Three ‘stages’ in understanding comparability
1. statistical auditing
 problems are routine
 solutions require ‘back of the envelope’ sums
2. scientific research
 problems are difficult
 solutions require rigorous and objective investigations
3. art criticism
 problems are perhaps insurmountable
 solutions require value judgements
(Bardell, Forrest and Shoesmith, 1978)
Realistic expectations +
Persuasive justifications
Four ‘stages’ in understanding comparability
1. statistical auditing
2. scientific research
3. art criticism
4. engineering pragmatism



many comparability problems are technically insurmountable…
but some are less insurmountable than others and should be
prioritised
all comparability solutions are inevitably imperfect… but some
are less imperfect than others and should be prioritised
technically insurmountable problems and inevitably imperfect
solutions highlight the fundamental importance of strong
arguments in defence of policy and practice
Download