Using Comparative Judgement to Assess Performance in the Critical Maths Curriculum

advertisement
Using Comparative Judgement to Assess Performance in the Critical Maths
Curriculum
Chris Wheadon & Ian Jones
Abstract
The purpose of this study was to assess how well Comparative Judgement performs in relation to a
traditional marking approach on maths questions that require genuine problem solving. 92 students
completed one of two open-ended maths questions and their responses were marked using level of
response mark schemes and judged using Comparative Judgement. The findings suggest that both
traditional marking and Comparative Judgment can produce reliable and valid results for the cohort
used to trial these questions. The most obvious explanation for this result is that most of the students
struggled so much with the calculations they were required to make that they were unable to offer
analysis or interpretation of data. With little analysis offered, there was little subjectivity inherent in the
assessment, so the students’ work could be assessed objectively using either traditional marking or
Comparative Judgement.
Introduction
The DfE is funding MEI to investigate how Professor Sir Timothy Gowers's ideas might inform a
curriculum that could become the basis of a new course for students who do not currently study
mathematics post-16. The curriculum will be based on students engaging with realistic problems and
developing skills of analysing problems and thinking flexibly to solve them. As the intention of the
curriculum is that questions are open-ended it is important to understand whether traditional marking
approaches are appropriate or whether alternative approaches to traditional marking are more
suitable. This report considers the results of a small scale trial of two questions developed for the
curriculum which were marked both using traditional level of response mark schemes and using
Comparative Judgement.
Method
Approach
The approach to the trial was to develop two questions and associated level of mark schemes which
would then be trialled by students. Following the trial the work would be both marked and judged
using Comparative Judgement. The original intention was to randomly assign markers and judges
from a single pool to avoid confounding experience and expertise with the mode of assessment, but
this proved logistically difficult. Instead the designers of the questions and mark schemes undertook
the marking themselves, while the judging was undertaken by a wider panel of judges. The different
expertise of the markers and judges therefore restricts the interpretation of the results.
Materials
The two questions that were developed for the trial are presented along with their associated mark
schemes
in
Appendix
1.
The
Comparative
Judgement
was
undertaken
using
www.nomoremarking.com, and judges provided with the guidance presented in Appendix 2. Judges
were given the mark schemes developed for the marking, along with detailed work throughs for the
School Trip question.
Participants
The students who answered the questions were, in the main, year 12 students. Three groups of year
12 students were taken from two 11-18 schools and one group from a sixth form college. A further
group of 22 year eleven students from an 11-18 school completed the School Trip assessment. All of
1
No More Marking Ltd. September 2014
the students had gained a grade C or better in Mathematics at GCSE. 53 students completed the
SPAM question, while 39 students completed the School Trip question. The two markers of the
questions were the designers of the assessments and mark schemes. Both markers marked all of the
questions. The judges were recruited on a voluntary basis mainly from within MEI and have varying
degrees of Mathematics expertise. The two markers also participated in the judging, but their
judgements have been removed from the analysis that follows, apart from where indicated.
Analysis
All the analysis of the Comparative Judgement data was undertaken using the R package
1
BradleyTerry2 . For ease of interpretation the Comparative Judgement ability parameters were scaled
to have the same mean and standard deviation as the average marks from both markers on each
question. Inter-rater reliability of the judging was calculated by an iterative procedure of splitting the
judges into two random groups, fitting the Bradley Terry model to the decisions produced by each
group separately, and calculating the rank order correlation between the ability parameters derived
from the two separate models. The mean correlation of 100 iterations is reported. Judge variability is
2
calculated using the Infit formula provided by Pollitt (2012) .
Results
SPAM
Marking
The mean mark on the SPAM question was 2.68 for marker1 and 2.83 for marker2 (out of a maximum
of 6) with a standard deviation of the two markers at 1.46 and 1.42 respectively. The inter-rater
marking reliability was high, with a rank order correlation of 0.85 between the marks given by the two
markers. The reason for the high agreement appears to be the tendency of both markers to award the
central mark of 3, and of the reluctance to award marks above 3 (See figure 1). The restricted range
of marks results in poor discrimination. The markers agreed exactly on the marks of 62 per cent of the
scripts, which is far higher than would be expected for true open-ended questions. Referring to the
mark scheme and the scripts, any subjectivity in marking would only be apparent at marks higher than
4, where students attempt a recommendation based on their analysis, but only 15 per cent of scripts
scored higher than 4 marks.
Judging
For the SPAM question there were 15 judges, each completing 90 judgements. Around twice as many
judges were recruited as were strictly required in order to calculate inter-rater reliability. One judge
was removed due to the variability of their judging (Infit > 1.5), leaving 14 judges and a total of 1260
judgements. The inter-rater reliability of the judges was 0.94, higher than that between the two
markers. The rank order correlation between the average mark of the two markers and the judges
was 0.89, and with each marker separately 0.87. The high correlation of the judges with the markers
suggests that the judging process was valid, with judges generally valuing the same qualities in the
scripts as the markers.
Figure 2 shows how the marks and the judgements are closely correlated, but also reveals a few
anomalies. A particularly interesting example is presented in Appendix 3. This script makes the
observation that 37 per cent of the blocked email is incorrectly blocked, which is correct, and shows a
good understanding of the problem. The script was ranked 5th by the judges, but 12th according to
the average mark awarded by the markers, who both gave a mark of 3 out of 6. To be awarded a
1
Firth, David, and Heather L Turner. "Bradley-Terry models in R: the BradleyTerry2 package."
Journal of Statistical Software 48.9 (2012).
2
Pollitt, Alastair. "The method of adaptive comparative judgement." Assessment in Education:
Principles, Policy & Practice 19.3 (2012): 281-300.
2
No More Marking Ltd. September 2014
higher mark than 3 the mark scheme requires an explicit recommendation. The judges, unconstrained
by the mark scheme, could reward the implicit understanding more highly than the markers.
Overall the judging performed well for the SPAM question despite the paucity of interpretation and
analysis offered by the students. The high inter-rater reliability of the judges suggests that a
Comparative Judgement approach would work well with distributed panels of judges.
Figure 1: Density distribution of marks from the Comparative Judgement (cj) and the marks for the
SPAM question.
Figure 2: The Comparative Judgement scores (cj) with associated standard errors, and the marks for
the SPAM assessment.
School Trip
Marking
The mean mark on the School Trip question for both markers was very low for both markers (5.2 and
4.9 respectively out of 15) with the standard deviation of the two markers at 3.8 and 3.2 respectively.
The inter-rater marking reliability was higher than on the SPAM assessment at 0.90. The markers
agreed exactly on 44 per cent of the marks, which is again much higher than would be expected for
3
No More Marking Ltd. September 2014
an open-ended question with some inherent subjectivity. The distributions of marks (figure 3) again
suggests that the markers have a tendency towards the middle mark, while both markers awarded
relatively few high marks. Again it seems that most students struggled to move beyond calculations
into explanations and justifications, which restricted any subjectivity of interpretation of their answers.
Judging
For the School Trip question there were 17 judges, each completing 70 judgements. Around twice as
many judges were recruited as were strictly required in order to calculate inter-rater reliability. No
judges were removed due to the variability of their judging (Infit > 1.5), which means there was a total
of 1190 judgements. The inter-rater reliability of the judges was 0.91, similar to that between the two
markers. The rank order correlation between the average mark of the two markers and the judges
was 0.89, and with each marker separately 0.86 and 0.93. Again the high correlation of the judges
with the markers suggests that the judging process was valid, with judges generally valuing the same
qualities in the scripts as the markers.
There were fewer anomalies between the judges and the markers for the School Trip question,
although marker1 did find some quality in an explanation that didn’t impress the judges or marker2.
While the agreement seems to be a positive quality, a closer examination of the scripts suggests that
the students were mainly reproducing learned approaches to the question, with little evidence of any
genuine problem-solving. The best answers, reproduced in Appendix 4, are more complete than other
answers, but show little originality.
Figure 3: Density distribution of marks from the Comparative Judgement (cj) and the marks for the
School Trip question.
4
No More Marking Ltd. September 2014
Figure 4: The Comparative Judgement scores (cj) with associated standard errors, and the marks for
the School Trip assessment.
Discussion
The intention of this study was to examine how well Comparative Judgement performed in assessing
open-ended questions that require genuine problem-solving. Overall, Comparative Judgement
performed well in assessing the questions in the trial, but then so did traditional marking. The obvious
explanation for this result is that the variation in responses was due to mistakes made by the
students, rather than to variation in the quality of interpretation and analysis of the data they were
presented with. The failure of the questions to achieve their purpose is no reflection on those
involved in creating the questions but a reflection of how difficult it is to produce such questions.
Despite the attempts of the questions to elicit original thinking and answers it was also clear that
many of the students’ approaches were reproductions of taught approaches. Students in this trial
struggled to move beyond the procedural mathematics required to complete the questions and to
reflect on their approaches and the possible interpretations of their analyses. One possible way in
which the questions could be improved is if they were to be less demanding in their requirement for
calculations, and more demanding in the areas of interpretation and analysis.
Overall, it would seem that the greatest challenge for this curriculum is creating questions that are
accessible, mathematically valid, and allow genuine problem-solving approaches to shine; the
development of such questions may well be a bigger challenge for this curriculum than how to assess
the students’ responses.
5
No More Marking Ltd. September 2014
Appendix 1: Questions and Mark Schemes
SPAM
A company is considering introducing a filter to its email system to remove unwanted advertising
mail which is sometimes referred to as SPAM. It is suggested that the filter will save time and
prevent email storage capacity being wasted.
The company receives approximately 100000 emails per week of which about 5% are SPAM. A new
SPAM filter claims to be 97% accurate. The company take this to mean that 97% of SPAM is correctly
filtered out, and only 3% of genuine emails are incorrectly filtered out.
Some of the company managers are worried that the SPAM filter may filter out urgent genuine
emails and cause delays, others think that too much SPAM will still get through the system.
Use the information above to find out how many of the company’s emails are likely to be affected by
each of these worries.
State whether you would recommend the SPAM filter and give your reasons for this decision.
Solution
Solutions are likely to include the
following points.
● Using data given to find
representative frequencies.
●
Recommendation with reason
Marks
5-6 for a clear,
correct complete
solution which takes
account of the all the
information given and
includes a
recommendation with
a reason that follows
from the work
3-4 for a solution
which takes account
of the information and
includes a
recommendation that
follows on from the
work. There may be
some errors in the
frequencies but these
are not serious
enough to invalidate
the argument.
Indicative content
The correct frequencies
are as follows
● In a typical week
5000 SPAM emails
and 95 000 emails are
OK.
● 4850 SPAM
emails rejected, 150
get through.
● 2850 good emails
rejected, 92 150 get
through
1-2 for some
progress towards a
solution
School Trip
6
No More Marking Ltd. September 2014
A group of 50 students are given 5 possible venues for a school reward trip. To make recording easy
each venue is assigned a letter A,B,C,D or E.
The students are asked to place the venues in order of preference starting with their favourite at the
top.
When all of the lists are collected they are sorted into groups whose lists are identical. In total there
are 6 groups. This information is given in the table below
For example group 1 is the preference list submitted by 18 of the students.
Choice
1st
2nd
3rd
4th
5th
Frequency
Group
1
A
D
E
C
B
18
Group
2
B
E
D
C
A
12
Group
3
C
B
E
D
A
8
Group
4
D
C
E
B
A
7
Group
5
E
B
D
C
A
3
Group
6
E
C
D
B
A
2
50
The school plans to select just one venue.
Using the information in the table, suggest different ways to decide on a venue for the reward trip
and choose the best way to decide where to go. You should explain how you arrived at your
decision, and show your working to choose the venue for the trip.
School trip mark scheme
Solution
Solutions are likely to
include the following points.
● A description of a
method to make a
decision.
● Consideration of
more than one possible
method and reasons for
choosing a particular
method.
● Working showing the
chosen method in
practice.
● A conclusion based
on the working.
Marks
13-15 for a clear, correct
complete solution which
takes account of the all the
information given and
includes a conclusion. The
solution recognises that
there is more than one
possible way to arrive at a
decision and gives reasons
for rejecting alternative
methods.
10-12 for a clear, correct
complete solution which
takes account of the
information and includes a
conclusion.
Indicative content
Choices of decision
methods for the top three
levels should take account
of all the data in the table.
Example of 4-6 category
Most students chose A.
Although more students put
A last than put it first, they
will understand a decision
based on top votes easily
and it is quick to use so I
would choose A.
Example of 1-3 category
Most students chose A but
a lot of students also put it
last.
7-9 for substantial progress
towards a complete solution.
This level includes
responses which correctly
describe a decision method
that takes account of all the
information and start to
7
No More Marking Ltd. September 2014
apply the decision method
but there may be some
errors or no conclusion is
stated.
4-6 for some progress
towards a complete solution.
Responses at this level
include some correct work,
for example, describing one
or more decision methods
but not applying any of
them. This level also
includes responses that
correctly use a very simple
decision method with a
reason.
1-3 for some relevant work.
Responses at this level have
at least one correct
statement or some relevant
working but these are not
part of a coherently
communicated solution.
8
No More Marking Ltd. September 2014
Appendix 2: Instructions for Judges
Notes to Judges
You should have received an email which contains a link to the online assessment url.
When you click on this link you should obtain a the screen below
Click on the blue start button in the bottom right of the box and you should see two items of work
for you to judge.
You can zoom in on either item of work by placing the mouse pointer on the item and left clicking
the mouse once.
Use the
button in the top right of the screen to revert to the normal two item view.
Once you have decided on which item is the better solution use the right or left button (see below)
to select this item
After making your selection you should automatically obtain two more items to judge, and so on,
until the screen informs that you have completed your judgements. (see below)
The questions which the students were asked to answer in the assessment trial are shown on the
next page. There are also examples of some of the possible calculations and/or systems which the
students may have chosen to use. This should save you some time.
9
No More Marking Ltd. September 2014
Please note that this is not a mark scheme and you are encouraged to use your own judgement to
identify, in each case , which of the two solutions is better.
10
No More Marking Ltd. September 2014
Appendix 3: Scripts from SPAM
A Perfectly Marked Answer (6 out of 6 for both markers)
11
No More Marking Ltd. September 2014
An Implicit Recommendation
12
No More Marking Ltd. September 2014
Appendix 4: Scripts from School Trip
The Best Judged Answer
13
No More Marking Ltd. September 2014
Appendix 5: Tables of results
SPAM
id
mei-12
mei-28
mei-8
mei-14
mei-2
mei-13
mei-27
mei-22
mei-3
mei-30
mei-21
mei-17
mei-25
mei-31
mei-51
mei-50
mei-24
mei-44
mei-18
mei-19
mei-29
mei-32
mei-53
mei-4
mei-7
mei-20
mei-6
mei-52
mei-23
mei-48
mei-26
mei-43
mei-11
mei-33
mei-36
mei-15
mei-5
mei-46
mei-45
mei-39
mei-41
mei-16
mei-37
mei-42
mei-47
mei-10
mei-34
mei-35
mei-49
mei-9
mei-38
mei-1
mei-40
judging
mark
5.14
4.93
4.81
4.71
4.40
4.36
4.35
4.17
3.95
3.89
3.75
3.73
3.72
3.70
3.59
3.56
3.49
3.39
3.38
3.31
3.27
3.21
3.00
2.97
2.97
2.85
2.66
2.66
2.65
2.61
2.58
2.55
2.34
2.20
2.19
2.16
2.09
1.95
1.90
1.82
1.64
1.55
1.53
1.28
1.00
0.94
0.90
0.85
0.77
0.55
0.34
0.25
-0.04
standard
error
0.38
0.34
0.35
0.32
0.31
0.30
0.32
0.30
0.29
0.29
0.29
0.30
0.29
0.28
0.29
0.29
0.29
0.27
0.28
0.28
0.28
0.28
0.29
0.28
0.28
0.29
0.28
0.28
0.28
0.28
0.28
0.28
0.28
0.29
0.29
0.29
0.29
0.29
0.30
0.29
0.29
0.30
0.31
0.32
0.32
0.33
0.33
0.33
0.32
0.34
0.38
0.37
0.40
marker1
marker2
5
5
5
4
3
5
6
6
3
4
3
6
3
3
3
3
3
2
3
3
3
2
3
3
3
3
3
3
3
3
3
3
3
1
2
3
2
2
2
0
1
2
1
1
0
2
1
2
0
1
1
1
1
5
5
5
4
3
6
5
5
5
3
3
6
3
3
3
3
3
3
3
3
3
4
2
3
6
3
3
2
3
3
3
2
4
2
2
2
2
2
2
1
2
2
2
1
1
2
1
1
1
1
1
1
1
14
No More Marking Ltd. September 2014
School Trip
id
mei-v6
mei-v16
mei-v7
mei-v8
mei-v5
mei-v18
mei-v27
mei-v24
mei-v15
mei-v17
mei-v26
mei-v14
mei-v33
mei-v29
mei-v12
mei-v2
mei-v3
mei-v13
mei-v11
mei-v30
mei-v34
mei-v1
mei-v28
mei-v4
mei-v37
mei-v19
mei-v23
mei-v36
mei-v25
mei-v32
mei-v35
mei-v9
mei-v38
mei-v39
mei-v22
mei-v31
mei-v20
mei-v21
mei-v10
mark
11.10
10.27
9.22
8.77
8.77
8.53
8.46
7.84
7.83
7.53
7.40
7.21
6.90
6.82
6.45
6.42
6.36
6.26
5.65
5.64
5.20
5.13
5.04
4.38
4.25
4.02
3.80
3.34
3.20
3.16
3.10
2.35
1.74
0.78
0.77
0.23
-0.86
-1.02
-5.04
se
1.05
0.94
0.85
0.84
0.84
0.84
0.84
0.82
0.84
0.81
0.81
0.82
0.80
0.80
0.80
0.81
0.80
0.80
0.80
0.80
0.80
0.80
0.80
0.81
0.81
0.82
0.82
0.84
0.82
0.85
0.84
0.88
0.92
0.94
0.92
0.92
0.99
1.09
1.88
mark1
11
9
12
13
11
9
10
11
10
9
9
4
4
2
4
6
7
4
4
4
4
5
2
5
7
2
2
1
9
4
1
3
1
1
1
1
1
1
0
mark2
12
9
12
9
9
8
10
9
9
8
6
5
5
5
4
3
5
5
5
4
4
3
5
5
5
2
2
2
4
4
2
3
1
1
1
2
1
1
0
15
No More Marking Ltd. September 2014
Download