30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring, University of Durham, 30th Anniversary Conference: 30 Years of Evidence in Education. 23 September 2014. London. Statistics vs. Judgement: What Does 30 Years of Research Tell Us About the Best and Worst Way to Maintain Exam Standards? What does it mean to ‘maintain’ an exam standard? Grade Awarding The process of identifying: which marks on this year’s exam correspond to levels of attainment (i.e. levels of knowledge, skill and understanding) that were associated with grade boundary marks on last year’s exam. Why do exam boards need to move grade boundaries? Because even exams that are designed to measure: exactly the same kind of attainment in exactly the same way may end up being slightly different in terms of the overall difficulty of their questions Have we always maintained exam standards like this? 30 years ago – in 1984? 60 years ago – in 1954? Have we always maintained exam standards like this? 30 years ago – in 1984? 60 years ago – in 1954? … yes, pretty much! Attainment-referencing From one examination to the next, corresponding grade boundaries should be located at marks associated with equivalent levels of attainment. The myth HYPOTHETICAL A level pass-rates for UCLES (Summer examinations, Home candidates only) 100 90 80 70 Latin 60 French 50 Physics 40 Biology 30 20 10 1986 1985 1984 1983 1982 1981 1980 1979 1978 1977 1976 1975 1974 1973 1972 1971 1970 1969 1968 1967 1966 1965 1964 1963 1962 1961 1960 0 The myth… debunked A level pass-rates for the 'Cambridge' board UCLES (1960 to 1984) (Summer examinations, Home candidates only) 100 95 90 85 80 Latin 75 Physics 70 65 60 55 1984 1983 1982 1981 1980 1979 1978 1977 1976 1975 1974 1973 1972 1971 1970 1969 1968 1967 1966 1965 1964 1963 1962 1961 1960 50 How do you operationalise attainment-referencing? Cumulative percentage of A level Sociology students awarded grade E (blue) against total number of results awarded (red) (for All Boards, Summer Awards, All Modes, by Syllabus Group) 100.0 34000 95.0 32000 90.0 30000 85.0 28000 80.0 Cum.% E No. Results 26000 75.0 70.0 24000 65.0 22000 60.0 20000 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Scrutiny of scripts (undertaken by examiners) Comparing levels of attainment ‘directly’ by inspecting performances in examination scripts a.k.a. ‘Judgement’ Scrutiny of data (undertaken by the Board) Cumulative percentage of A level Sociology students awarded grade E (blue) against total number of results awarded (red) (for All Boards, Summer Awards, All Modes, by Syllabus Group) 100.0 34000 95.0 32000 90.0 30000 85.0 28000 80.0 Cum.% E No. Results 26000 75.0 70.0 24000 65.0 22000 60.0 Comparing levels of attainment indirectly by ‘modelling’ the causal determinants of attainment 20000 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 a.k.a. ‘Statistics’ Which is better – statistics or judgement? Cumulative percentage of A level Sociology students awarded grade E (blue) against total number of results awarded (red) (for All Boards, Summer Awards, All Modes, by Syllabus Group) 100.0 34000 95.0 32000 90.0 30000 85.0 28000 80.0 Cum.% E No. Results 26000 75.0 70.0 24000 65.0 22000 60.0 20000 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Which is better – statistics or judgement? Cumulative percentage of A level Sociology students awarded grade E (blue) against total number of results awarded (red) (for All Boards, Summer Awards, All Modes, by Syllabus Group) 100.0 34000 95.0 32000 90.0 30000 85.0 28000 80.0 Cum.% E No. Results 26000 75.0 70.0 24000 65.0 22000 60.0 20000 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 The battle of grade awarding Examiners The Board We are just so impressed by the quality of performances that we see in our French exams. But do you really have enough evidence to justify raising the pass-rate yet again? After all: pass-rates haven’t been rising in German or Spanish the French cohort is expanding massively What Does 30 Years of Research Tell Us About the Best and Worst Way to Maintain Exam Standards? Evidence from Exam Boards Evidence from Academia Evidence from Regulators What have we learned since 1984? We shouldn’t put too much confidence in statistics Cumulative % candidates with grade E (or higher) Averaged across 13 UCLES A level subjects, 1960-1984 (Summer examinations, Home candidates only, Main syllabuses only) 100 90 80 70 60 50 40 30 20 10 0 4 NEAB maths A levels P&A, P&M, P&S, SMP MLM to control for prior achievement, gender, etc. even after control, SMP still appeared too lenient However the SMP syllabus more motivating excellent support materials more time-consuming We shouldn’t put too much confidence in judgement Grade boundaries set by examiner judgement alone for two exam papers same subject different tiers sat by same candidates Many more students ended up with higher grades on the lower tier exam (than on the higher tier). Judgemental innovations We have learned how to harness examiner judgement more effectively Statistical innovations We have learned how to compute statistical analyses more effectively It is extremely hard to predict and control comparability threats. The ‘fiascos’ Summer 2002 Curriculum 2000 anomaly Summer 2012 GCSE English anomaly January awarding, 2012 Clear tendency to ensure students marked ‘comfortably’ above historical boundaries June awarding, 2012 Same tendency, but many students no longer ‘comfortably’ above the raised boundaries So, which is better – statistics or judgement? Cumulative percentage of A level Sociology students awarded grade E (blue) against total number of results awarded (red) (for All Boards, Summer Awards, All Modes, by Syllabus Group) 100.0 34000 95.0 32000 90.0 30000 85.0 28000 80.0 Cum.% E No. Results 26000 75.0 70.0 24000 65.0 22000 60.0 20000 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Unrealistic expectations Three ‘stages’ in understanding comparability 1. statistical auditing problems are routine solutions require ‘back of the envelope’ sums 2. scientific research problems are difficult solutions require rigorous and objective investigations 3. art criticism problems are perhaps insurmountable solutions require value judgements (Bardell, Forrest and Shoesmith, 1978) Realistic expectations + Persuasive justifications Four ‘stages’ in understanding comparability 1. statistical auditing 2. scientific research 3. art criticism 4. engineering pragmatism many comparability problems are technically insurmountable… but some are less insurmountable than others and should be prioritised all comparability solutions are inevitably imperfect… but some are less imperfect than others and should be prioritised technically insurmountable problems and inevitably imperfect solutions highlight the fundamental importance of strong arguments in defence of policy and practice