Uploaded by terryd

Multiple choice - state of the art

advertisement
MULTIPLE CHOICE: A STATE
THE ART REPORT
OF
RobertWood
University of London
CONTENTS
Page
1.
2.
INTRODUCTION
193
POLEMICS
x.kw;;h
!
195
fz!
RECALL, RECOGNITION AND BEYOND
203
Higher Order Skills
Summary
3.
ITEM TYPES
Item Types Other Than Simple Multiple Choice
True-false
Multiple true-false
Multiple completion
Assertion-reason
Data necessity
Data sufficiency
Quantitative comparisons
Summary
4.
CONSTRUCTING ITEMS
Number of Distracters
The 'None of These' Option
Violating Item Construction Principles
Item Forms
Sumnary
5.
INSTRUCTIONS, SCORING FORMULAS AND RESPONSE BEHAVIOUR
Changing answers
Confidence weighting
Ranking alternative answers
Elimination scoring
Weighting of item responses
Summary
204
209
210
211
211
213
213
215
217
217
218
221
223
223
225
226
227
230
232
234
234
236
237
237
239
1%
Evaha tion in Education
6.
ITEM ANALYSIS
Other Discrimination Indices
Generalised Item Statistics
The Item Characteristic Curve
Probabilistic Models of Item Response
Summary
7.
ITEM SELECTION AND TEST CONSTRUCTION
Constructing Group Tests
Norm-referenced tests
Arranaina Items in the Test Form
The I&line of Difficulty Concept
Individualised Testing
Testing for Other Than Individual Differences
Criterion-referenced tests
Choosing Items to Discriminate Between Groups
Computer Programs for Item Analysis
Sunxnary
240
245
247
247
249
251
253
253
253
258
259
260
;66:
264
265
265
ACKNOWLEDGEMENTS
267
REFERENCES
268
Introduction
Pctential readers of this book will want to know where it stands relative to
predecessors like Vernon (1964), Brown (1966) and Macintosh and Morrison (1969).
The answer is that it is meant to be a successor to Vernon's monograph which,
first class though it still is, was thought to be in need of updating and
expanding. It is therefore not a practical handbook like the other i~o books.
Although I often give my opinion on what is good practice, the book concentrates on marshalling and evaluating.the literature on multiple choice testing,
the aim being to clarify what is known about this testing technique: in short,
what is intended is a state of the art report.
Multiple choice is but one of a number of testing techniques. Anyone who
wonders why it seems to dwarf the'others in the attention it receives and the
controversy it arouses might choose among the following reasons:
1.
2.
3.
4.
5.
6.
The technique originated in the USA and attracts irrational hostility
on that account.
Choosing one out of a number of alternative answers is thought by some
to be a trivial mental exercise.
The answer deemed to be correct can be obtained by blind guessing.
The format raises a number of methodological problems, real and imagined setting, scoring, etc.
The data multiple choice tests produce lend themselves to elaborate
statistical analysis and to the shaping of theories abotit response
behaviour. Without multiple choice, modern test theory would not have
come into existence (which would have been a blessing some might think).
Because of a widespread belief that multiple choice tests are easily
prepared for, there has come into being what amounts to an industry
consisting of writers turning out books of multiple choice items usually
directed at specific examinations. Often the content of these collections
is shoddy and untested but their continual publication and reviewing keeps
multiple choice before the public eye and affords hostile critics the
opportunity to lambast the technique.
The opportunities for research investigations offered by 3, 4 and 5 above
have sustained many American academics in their careers and their offerings
have filled and continue to fill the pages of several journals, notably
Educational and Psychological Fleasurement and the Journal of Educational
I"easurement. The absence of such specialised journals in Britain has
rileantless of an outpouring, although in recent years most subject-based
educational journals, particularly those connected with the sciences, have
carried at least one article concerned with 't-e-inventing the wheel', in
apparent ignorance, wilful or otherwise, of developments elsewhere.
It is with multiple choice in the context of educational achievement testing
that I am mainly concerned in this book. This concentration stems frcm my own
backgrouna ant current employment although I also believe that it is with
193
194
Evaluation in Education
achievement tests that multiple choice finds its most important application.
I work for the University of London School Examinations Department which is
one of the bodies in England charged with conducting GCE (General Certificate
of Education) examinations at Ordinary and Advanced levels. Ordinary or Olevel is usually taken by students at the age of 16 and Advanced or A-level
at age 18. I hope this explanation will enable readers to understand the odd
references in the text to GCE or to 0- and A-level or to the London GCE board.
In the one place where CSE is mentioned this refers to the Certificate of
Secondary Education taken by less able students, also around the age of 16.
When referencing I have not attempted to be exhaustive although neither have
been too selective. I hope I have mentioned most work of the last ten years;
older work can be traced through the references I have cited. Where I think
there is a good bibliography on a topic I have said so.
I
The weighing up of research evidence is a difficult business and I do not
pretend to have any easy solutions. The fact that numerous studies have used
American college students, often those studying psychology, throws doubt on
some of the literature but I take the view that it is possible to perceive
tendencies, to see where something definitely did not work and to judge where
promise lies. Often the accounts of experiments are less interesting than the
speculative writing of the investigators. Sometimes there are no experiments
at all but just polemical writing, nearly always hostile to multiple choice.
It is with these polemics that I start.
I. Polemics
‘The Orangoutang
score is that score on a standardised
reading test
that can be obtained
by a well-trained
Orangoutung under these special
conditions.
A slightly
hungry Orangoutang is placed in a small cage
that has an oblong window
ad
four
buttons.
The Orangoutang has been
trained
that every time the reading teacher places a neatly
typed
multiple
choice item fern a reading test in the oblong oin&w,
all that
he (the Orangoutungl
has to do to get a bit of banax& is to press a
button,
any of the buttons,
which, incidentally,
are lubelled
A, B, C
and D.” CFry, 1971)
Although the quotation above is acid enough, no one has savaged the multiple
choice test quite like Banesh Hoffman and Jacques Barzun. Both are American
acader,iics,
both regard multiple choice as the enemy of intellectual standards
and creative expression. In their different ways they have made out a case
against multiple choice which must be taken seriously even if Hoffman's
diatribes have a superior, fanatical tone which soon grates.
In his various onslaughts, Hoffman (1962, 1967(a), 1967(b)) has insisted that
multiple choice "favours the picker of choices rather than the doer", and that
students he variously calls "gifted", "profound", "deep", "subtle" and "first
rate" are liable to see more in a question than the questioner intended, a
habit which, he claims does not work to their advantage.
In favouring the "doer', Hoffman is expressing a preference which he is entitled to do but he produces no evidence for supposing there are distinct breeds
of "pickers" and "doers", just as he is unable to demonstrate that "picking"
is necessarily either a passive or a trivial activity. To choose an answer to
a question is to take a decision, even if it is a small one. In any case,
this is not the point; as I shall argue presently, why should not students
"pick" and "do"?
The fact is that much of the distaste for multiple choice
expressedby American critics like Hoffman and Barzun arises from fears about
the effects of using multiple choice tests exclusively in American school
testing programmes and the consequent lack of any opportunity for the student
to compose his own answers to questions. Recent reports from the USA (Binyon,
1976), which have linked what is seen as the growing inability of even university students to write competent English with the absence of essay tests, would
seem to justify such fears although no convincing analysis substantiating the
link has yet been offered and other factors in American society, such as the
low value placed on writing outside school may well be implicated. As far as
the British situation is concerned, examining boards are agreed that multiple
choice should be only one element of an examination, and often a minor element
at that; many examinations do not contain a multiple choice element at all or
are ever likely to. In practice, the highest weight multiple choice will
attract is 50 per cent and then only rarely; generally it is in the region of
30-40 per cent. (Although based on 1971 examinations, an unpublished Schools
195
196
Evaluation in Education
Council survey (Schools Council, 1973) provides what is probably still a
reasonably accurate picture of the extent of objective test usage in the
United Kingdom and of the weightings given to these tests.)
The case for using multiple choice rests in large part on the belief that there
is room for an exercise in which candidates concentrate on giving answers to
questions free of the obligation to write up - some would say dress up - their
answers in extended form, a point made by Nuttall (1974, p.35) and by Pearce
(1974, p-52). Instead of asking candidates to do a little reading and a lot
of writing with (hopefully) some thinking interspersed - what, I suppose,
Hoffman would call "doing" - they are asked to read and think or listen and
think before "picking". I see nothing wrong in this. By and large there has
been an over-emphasis on writing skills in our examinations - unlike the USA and the different approach to assessment represented by multiple choice serves
as a corrective. I would accept that a concentration on reading and thinking
is to some extent arbitrary in terms of priorities. After all, a strong case
could be made out on behalf of oral skills yet how little these feature in
external examinations, leaving aside language subjects. That they have been
so neglected is , of course, directly indicative of the over-emphasis that has
been placed on written work, which in turn can be traced to the conservatism
of examiners and teachers and to the difficulties of organising oral assessments.
If the quality of written work produced by the average candidate in the examination room was better than it is, one might be more impressed by the arguments of those who insist on written answers or "doing" in all circumstances.
But, as anybody knows, examination writing is far from being the highest form
of the art; how could it be when nervous individuals have to write against the
clock without a real opportunity to draft and work over their ideas, a practice
intrinsic to writing? As Wason (1970) has observed, one learns what one wants
to write as one goes along. Small wonder, then, that the typical examination
answer is an unsightly mess of half-baked ideas flung down on the paper in the
hope that some, at least, will induce a reward from the examiner.
No doubt the fault lies in the kind of examination that is set or in setting
examinations at all. Certainly examinations have been roundly condemned for
stultifying writing in the schools by placing too much emphasis on one kind of
writing only, namely the impersonal and transactional, at the expense of the
personal and expressive (Britton et al, 1975). Yet it sometimes seems that
whatever attempts are made to liberalise examinations, the response in the
schools is inexorably towards training pupils in the new ways so that these
soon become as routinised as the bad old ways. Elsewhere (Wood, 1976(a)) I
have written of the deadlock which exists between examiners, teachers and
students and the solution is not at all clear. At least with multiple choice
all parties understand what is required of them.
That some individuals will rise above the restrictive circumstances of a
written paper and demonstrate organisation, originality and so forth is not in
dispute. What they must be aware of is being thought too clever or of producing answers which are regarded as interesting but irrelevant. The people who
frame multiple choice questions are, by and large, the same people who frame
and mark essay questions. Most will have graduated from one to the other. If
their thinking is "convergent" it will show in both cases. Vernon (1964, p.7)
may have had this in mind when he remarked that it is by no means certain that
conventional examinations are capable of eliciting what are often called the
higher qualities. He observed, quite correctly, that the typical examination
Multiple Choice: A State of the Art Report
197
answer, at least at 15-16 years, tends to be marked more for accuracy and
number of facts than for organisation, originality etc., not least because
this ensures an acceptable level of reliability. This being so, it seems to
me that what Hoffman claims is true of teachers of "gifted" students - "that
such teachers often feel it necessary to warn precisely their intellectually
liveliest students not to think too precisely or deeply when taking mechanised
tests" (Hoffman, 1967(a), p.383) - might equally well be applied to essay
tests. If this reads like a "plague on both your houses", that is not my
intention. The point is that multiple choice is not alone in having deficiencies - they can be found in all the techniques used in public examinations.
As long as it serves a useful assessment function - and I have tried to
establish that it does - the weaknesses, which are technical or procedural,
can be attended to. In this respect I wish that the essay test had received
even half the attention bestowed on multiple choice.
If Hoffman's first charge is seen to be shallow, what of his second - that
the gifted, creative, non-conformist mind is apt to see more in multiple
choice questions than was intended, the consequence of which is to induce
uncertainty, perplexity and ultimately incorrect answers? All the examples
Hoffman produces are designed to show that to the "gifted" the question is not
meaningful, or contains more than one answer or else does not contain the
answer at all. "Only exceptional students are apt to see the deeper defects
of test items" (Hoffman, 1967(a), p. 383) he remarks at one point, but is not
a student exceptional precisely because he can see the deeper effects? What
Hoffman, who naturally includes himself among the "gifted", and other critics
forget, is that the items they take apart are meant for average 16 or 18 year
olds who do not possess their superior intellects. In these circumstances it
is hardly surprising that hard scrutiny of questions will reveal "ambiguities",
unseen by the average eye. Whether or not "gifted" persons find these an&iguities in the examination room and how they react to them, having found them,
is very much open to question. Too little work has been done on this subject
but the best study to date (Alker, Carlson and Hermann, 1969) concluded that
"first-rate" students were not, in general , upset and penalised by multiple
choice questions. They found that characteristics of both superficial and
deep thinking were associated with doing well on multiple choice questions
There the matter rests until more evidence comes along. A further study along
the lines of the one just discussed would be worth doing.
Whenever we refer to "atiiguities" we must bear in mind that knowledge is
always provisional and relative in character; most of us are given, and then
settle for, convenient approximations to "true" knowledge. Writing about
science, Ravetz (1971, Chapter 6) has remarked on the tendency of teachers,
aided and abetted by textbook writers, to rely on standardisation of scientific
material and for his purpose to introduce successive degrees of banality as
the teaching becomes further displaced from contact with research and high
class scientific debate. The end product is what he calls vulgarised knowledge
or what Kuhn (1962), more kindly, calls normal science.
Ravetz's analysis is hard to fault but the solution is not easy to see. Driver
(1975) is surely right that "there are no 'right answers' in technology", yet
when she writes "instead of accepting the teacher's authority as the ultimate
judge, the pupils can be encouraged to develop their own criteria for success;
to consider their own value systems and to make judgements on them" one
recoils, not because of the radical nature of the sentiment but because the
teaching programme implied seems too alrbitious for the majority of 15 and 16
year olds. at any rate. Can you teach children to be sceptical about ideas
198
Evaluation
in Education
before they know anything to be sceptical about? Perhaps it can be done but
it needs a particular teaching flair to be able to present knowledge with the
right degree of uncertainty. In general it seems inevitable that most children
will be acquainted only with received knowledge and ideas which contribute to
an outdated view of the physical world. Whether they are willing or able to
update this view later will depend on training, temperament and opportunity.
The relevance of all this for multiple choice is obvious. For Hoffman and
other critics, multiple choice embodies vulgarised knowledge in its most
blatant form; it deals, in Barzun's (1959, p.139) term, with the "thoughtcliche". Through the items they set, examiners make public their versions of
knowledge and through the medium they foster the impression that every problem
has a right answer. The sceptical mind is given no opportunity to function a choice must be made. Worse, the format may reinforce misconceptions.
Finally, not only does multiple choice reflect the transmission of standardised
knowledge, through so-called "backwash" effects, it encourages it. That, at
any rate, is what the critics say. Myself, I see no point in denying that
multiple choice embodies standardised knowledge. If that is what is taught,
then all examining techniques will reflect the fact. More than anything else,
examinations serve to codify what at any time"passes as "approved" knowledge,
ideas and constructions. Those who find the spectacle distasteful attack
multiple choice because it is such a convenient target but what could be more
standardised than the official answer to a question like "It is often said
that Britain has an unwritten constitution. Discuss."
One place where the arguments about multiple choice as a representation of
objective knowledge have come to a head is over the English Language comprehension exercise set at Ordinary level by the London GCE board and similar tests
set by other examining boards. Basically there is a conflict between those
who insist that judgement about meaning is always subjective and who deny
that there are any "correct" interpretations (e.g. Honeyford, 1973) and those
(e.g. Davidson, 1974) who see multiple choice as a formalisation of a public
discussion about meaning. It seems to me that the resolution of this conflict
depends, once again, on who and what is being tested. Were the subject at
issue Advanced level English Literature, where one must expect candidates to
come forward with different interpretations of texts, I would want to give
Honeyford's argument a lot of weight, even if marking raises severe problems.
For if, as he maintains, comprehension is an essentially private experience it
is logical nonsense to attempt to standardise examiners' opinions. One of the
severest critics in print on multiple choice found himself caught in just this
dilemma when championing essay tests, "If you standardise essay tests, they
become as superficial as multiple choice; if you do not standardise them, they
measure not the abilities of the examinee but function rather as projective
tests ofTe
graders' personalities" (La Fave, 1966). Presumably, the proper
approach to marking in these circumstances
is to allow different
interpretations, providing
they can be supported
convincingly.
This presupposes a
broadmindedness on the part of examiners which may not exist, but I see no
other solution. With Ordinary level English Language comprehension, on the
other hand, the latitude
for varying
interpretations
of meaning
iS not so
The candidates are younger and the material is simpler.
Sometimes it
great.
appears
at first glance debatable whether a word or phrase conveys a meaning
best but on close analysis it usually turns out that the examiners have gone
for finer distinctions
in order to test the understanding of the more competent candidates. In doing so, however, they run the risk of provoking controversy; it is no accident that most of the complaints about multiple choice concern items where the discrimination called for is allegedly too fine or is
Multiple
Choice:
A State of the Art Repoti
1%
reckoned to be non-existent. Consider, for instance, the following O-level
English Language comprehension item set by the London board in June 1975 which
came in for some criticism in the correspondence columns of the Guardian and
The Times Educational Supplement.
The item refers to the following sentence which was part of a longer passage:
"The distinction of London Bridge station on the Chatham side is that
it is not a terminus but a junction where lives begin to fade and
blossom again as they swap trains in the rush hour and make for all
regions of South London and the towns of Kent."
The item was as follows:
The statement that London Bridge is a place "where Zives beg-into fade
and blossom again" is best explained by say'ingthat it is a place where
people:
Grow tired of waiting for their trains and feel better when
they have caught them.
Flag at the end of their day and revive as they travel homeward.
Leave behind the Loneliness of the city and enjoy the co~any in
a crowded carriage.
Escape from the unhealthy atmosphere of Lundon and fZouri.shin
the country.
Forget about their daily work and Zook foruard to enjoying their
leisure.
According to one critic (Guardian, 17.6.75), there are "rules that pertain to
this type of question. One answer must clearly be perceived to be correct and
evidence must be forthcoming why this is so". In his view, the London board
broke that "rule" with this item, and indeed others in the same paper. The
crux of the matter is obviously the word "clearly" and here is where I part
company with the critic. The examiners have set candidates an item which calls
for rather closer attention to the text than might generally be the case. But
is this so wrong? A test where the right answer jumped out every time would
be a very dull test. As it happened, the reason why statement B was considered
to be the best answer was explained very nicely by another correspondent to the
Guardian. This is what she said:
"'The candidate does not need to read the examiner's mind, if he reads
the question. In the sentence you are not told:
A
C
D
E
Whether the people grow tired of waiting, or whether
The city is lonely, or whether
London is unhealthy, or whether
They will forget their work.
You are told that "lives begin to fade and blossom again" and statement
B best explains this by saying London Bridge is a place where people
flas at the end of their day and revive as they travel homeward."
(Guardian, 24.6.75)
200
Evaluation in Education
BACKWASH
I would like to distinguish two kinds of backwash. The first concerns the
effect of an examining technique on the way subject matter is structured,
taught and learnt, the second concerns the way candidates prepare and are
prepared for the technique in question.
In the case of multiple choice this
involves developing what the Americans call "test-wiseness" - the capacity to
get the most marks from a test by responding to cues, knowing how to pace
oneself and so forth. In the case of essay tests, the comparable behaviour
would be knowing how to "spot" questions, how long to spend on questions and
generally knowing how to maximise one's chances.
Providing it is not overdone, the second kind of backwash need not be taken as
seriously as the first. After reviewing 80 or so studies, Mellenberg (1972)
observed that "there seems to be no evidence that study habits are strongly
affected by the type of test that students are expecting in examinations".
Vernon (1964) offers the view that "so long as the objective questions are
reasonably straightforward and brief, we know that the amount of improvement
brought about by coaching and practice is limited . .. However, it is possible
(though there is little direct evidence) that facility in coping with more
complex items is more highly coachable and that pupils who receive practice
at these may gain an undue advantage". When we come to look at the more
complex item types in Chapter 3, readers may feel that Vernon has a point. The
sort of coaching which is most likely to go on involves the collections of
items I referred to somewhat disparagingly in the introduction. Teachers may
believe that having their students work through these productions is the best
way of preparing for the examination, but they may be deluding themselves.
Covering the subject matter is one thing, mastering the technique another.
These collections of items may leave the candidate short of both objectives.
Reviewing the impact of multiple choice on English Language testing in three
African nations, Ghana, Nigeria and Ethiopia, Forrest (1975) maintained that
the most regrettable effect everywhere is the amount of time teachers give to
working objective questions in class, but added that better trained teachers
find that multiple choice gives them scope for better teaching - it is the
weaker ones who resort to undesirable methods. Whether the net result of
multiple choice coaching activity is any more serious in scale or effect than
the preparations which are made for essay and other kinds of tests one simply
does not know. There is a greater need for coaching in writing skills if the
comments of examiners are anything to go by.
Coming now to the backwash which affects learning, it is sometimes claimed
that multiple choice perpetuates false concepts or, what amounts to the same
thing, over-simplifies events and relationships through the limitations of the
format. “If you teach history with a view to circulating some idea of the
toleration of the other person's point of view, not only does multiple choice
not test this but it tends to have the opposite effect, with harmful effects
on the proper study of history" was the comment of one teacher in the
discussion following Nuttall's (1974) paper. This comment, of course, harks
back to the earlier discussion of the relativity of knowledge, and the varying
degrees of sophistication with which it can be handled. I can understand this
particular teacher feeling sore at having to suffer what he would regard as a
regression to "black and white" judgements, but I wonder if he was triygered
off by one or two clumsily phrased items which I am afraid are often the ones
Multiple Choice: A State of the Art Report
201
the public sees.
Whether or not multiple choice actually reinforces wrong answers is a moot
question. Taking as his point of departure Skinner's (1961) dictum that
"every wrong answer on a multiple choice test increases the probability that a
student will someday dredge out of his memory the wrong answer instead of the
right one", Preston (1965) attempted to test the influence of wrong answers
(and right answers) upon students' grasp of vocabulary within the same hour.
The conditioning effect of wrong selections of items was demonstrated for some
words but not for others. Karraker (1967) obtained a more positive result when
he found that a group exposed to plausible wrong responses without being told
the correct answers made more errors on a later test than another group who
were told the correct answers. Eklund (1968), having carried out a thorough
experimental study of the question, maintained that the use of multiple choice
in the earlier stages of the learning process may involve considerable risks of
negative effects but that later on these risks seem to become much less marked.
This is interesting when we consider the terminal nature of examinations and
the fact that they often signal discontinuities in learning. How much candidates remember after examinations is in any case debatable. Miller and Parlett
(1974, p.107) put forward the idea that examinations actually serve to clear
the memory rather than reinforce existing knowledge, correct or incorrect.
This may sound an odd function of an examination but Miller and Parlett
claim that, unless "rehearsed" or used, detailed recall of factual information
drops rapidly after an examination, a claim we might all echo from our experience.
The best way of mitigating forgetting is to give immediate feedback of results.
In what is rare unanimity, the sundry studies in this area (Berglund, 1969;
Zontine, Richards and Strang, 1972; Beeson, 1973; and references there are in
Strang and Rust, 1973; Betz and Weiss, 1976(a), 1,976(b)) all claim that irmnediate knowledge of results, item by item or at the end of the test, enhances
learning. Even if it is not feasible in connection with public examinations,
in the classroom, where diagnosis and repair are the critical activities,
immediate feedback is certainly possible and should always be given.
I have not mentioned what for many people is the real objection to multiple
choice - the opportunity it offers for blind guessing. That a candidate can
deceive the examiner by obtaining the correct answer when in a state of ignorance cannot be denied - there is no way of stopping it - but as I shall make
clear in Chapter 5, I do not see this as a grave impediment. Besides, the
opportunity for guessing exists with the traditional type of questions,
although this is seldom remarked upon. In particular, traditional essays
invariably require the candidate to guess which parts of his knowledge are
going to appeal to the examiner (Cross, 1972).
As in other instances, multiple choice tests are vulnerable to the guessing
charge because statistical evidence can be adduced, whereas for essay papers
it is so much harder to come by.
Critics of multiple choice testing are inclined to apply double standards. Not
only do they expect multiple choice to be something it is not, but they subject
it to tougher criteria than they apply to other techniques. Dudley (1973),
for instance, in the medical context, criticises multiple choice on the grounds
that it fails to test adequately all aspects of an individual's knowledge.
This is about as fair as complaining about a stethoscope because it cannot be
used to examine eyes and ears. I do not say that multiple choice is above
202
Evaluation in Education
reproach; what I do say is that it must be viewed in context, and fairly.
American critics are entitled to be worried about what they see as the adverse
effects of multiple choice in the USA, but when criticism turns into crude,
caricature and obsessive vilification we should know when to part company.
Besides, much work has gone into stretching the basic multiple choice form in
an effort to test what are sometimes called "higher order" skills. lnlhat
these
skills might be is the subject of the next chapter.
SUMMARY
1. Just as the multiple choice test originated in the USA, so most of the
strongest criticism has come from there, particularly from Banesh Hoffman and
Jacques Barzun. One reason for this is the exclusive use of multiple choice
in school and college testing programnes which deprives students of the opportunity to express themselves in writing.
2. The British situation is quite different. If anything,
too much emphasis
has been given to writing. Multiple choice seldom attracts as much as 50 per
cent weighting in external school examinations; generally the figure is in the
region of 30 to 40 per cent.
3. Multiple choice serves a distinct assessment function. It makes the candidate concentrate on thinking about problems without requiring the extended
writing which can often be irrelevant and worthless, given the time-trial
conditions of examinations. Yet critics want it to be something it is not,
complaining that it cannot measure things like "toleration of the other man's
point of view", when no one ever claimed that it could. Multiple choice has
faults but so do other techniques. One point in its favour is that it leaves
the candidate in no doubt about what he has to do, unlike the essay test where
he has to guess what the examiner expects from him.
4. Multiple choice is criticised for encouraging students to think of knowledge as cut and dried and for penalising clever students who see ambiguities
their duller colleagues do not. It should be remembered, however, that knowledge is always provisional and that what is a sophisticated viewpoint to one
group is simple-minded to a more mature group. Examinations codify what is
accepted as 'approvedn knowledge at any given time. Because examiners reveal
themselves more openly through multiple choice, it provides a convenient target for critics who resist the idea that knowledge is packaged in standardised
form.
5. Concerning the "backwash" effects of multiple choice, few "hard" data are
avaitable. We simply do not know if multiple choice helps to perpetuate false
concepts and misinformation or leads to more superficial learning than would
have occurred otherwise. Nor do we know how much coaching of multiple choice
answering techniques goes on nor what payoff accrues. Information on these
matters is not necessarily required but those who pontificate on the baleful
effects of multiple choice ought to realise how little is known.
2. Recall, Recognition
and Beyond
"Taking an objective test is simply pointing. It culls for the least
effort of mind above that of keeping awake - recognition." (Barzm, 1959)
Hoffman and Barzun scorn multiple choice because in their minds it calls for
lowly recognition and nothing else. This analysis simply will not do. Apart
from playing down the psychological complexity of what recognition may entail,
it fails to account for what happens when candidates are obliged to work out
an answer and then search among the alternatives for it. Most mathematical
calculations and data interpretations come into this category. Even a solution
achieved by exhaustive substitution of the alternatives into an expression cannot be said to be a case of recognition. As far as I can see, the Hoffman/
Barzun critique is based on one kind of item only, the simplest form of multiple choice involving memory for facts, formulas, words etc. Here is an
example of the sort of item Barzun and Hoffman regard as the enemy (Barzun,
1959, p.139):
"'Emperor' is the name of (a) a string quartet
(c) a violin sonata."
(bJ a piano concerto
Readers who have not immediately spotted the flaw in this item should know
that while (b) is the official answer, one of Haydn's quartets is also called
the 'Emperor'.
Consider now an altogether different type of item, reproduced below. Only by
grotesque twisting of the meaning of the word could anyone seriously claim
that the answer can be reached by recognition.
O&put per worker per annum
Steel (tons) &eat (tons)
Urbaxia
Ruralia
10
2
40
30
The table shows the cost of wheat in terms of steel, or vice versa, before
the opening of trade. Asswne that these costs remain constant for all
levels of output, and that there are no transport costs. Which of the
following will take place?
A
B
C
D
E
Urbania wilZ export steel
Ruralia will export steel
Urbania wilZ export both steel and wheat
Urbania will export steel and Rumlia will export wheat
It is impossible to predict until the terms of trade are knom
(University of London, A-level Economics Paper 2, Summer 1972)
"Recognition is easier because, under comparable conditions, the presence of a
target word facilitates access to stored information", remarks the editor of a
203
204
Evaluation in Education
recent collection of papers by experimental psychologists (Brown, 1976) and
this view is supported by the results of psychometric studies (Heim and Watts,
1967) in which the same questions have been asked in open-ended and multiple
choice form. It might therefore be thought that if something can be recalled,
it can be recognised as a matter of course. Not so, according to Tulving (in
Brown, 1976) who reports that recognition failure of recallable words can
occur more often than might be supposed. Nor need this surprise us for if, as
modern experimental psychologists maintain, recognition and recall are distinct
~;o~;%;s, the easier activity is not necessarily contained in the more diffi-
.
It would be as well, then, not to write off recognition as trivial or rudimentary. Where the object is to test factual knowledge it has its place, as does
recall where questions are open-ended. But, of course, both recognition and
recall are memory functions. Where candidates are obliged to engage in mental
operations, as in the question above , recall on its own will not be enough.
To be sure, successful recall may supply elements which facilitate problem
solving but the realisation of a solution will depend on the activation of
other psychological processes. What these processes might be is anyone's
guess - it is customary to give them names like "quantitative reasoning" or
"concept formation" or "ability to interpret data" or, more generally, "higher
order" skills. It is to these that I want to turn my attention.
HIGHER ORDER SKILLS
"!%e long experience with objective tests has demonstrated that there
me hard& any of the so-called
'higher'mental processes that cannot
be tested
with objective
tests." (Ashford, 1972, p. 4211.
Although seldom expressed so blithely, the claim that multiple choice items can
test "higher order" skills is often encountered.
In the sense that multiple
choice can test more than memory, the claim is correct, as we have seen, but one
sometimes gets the impression that those who make the claim want to believe it
is true but underneath are uneasy about the supportive evidence. The trouble
is that while we all think we know what we mean by "higher order" skills, terms
like "originality" or "abstract reasoning" themselves beg questions which lead
all the way back to satisfactory definitions of what constitutes skill X or Y.
It is fair to say that the drive to test higher order skills via multiple
choice dates from the publication, twenty Years ago, of Volume 1 of the Taxonomy relating to the cognitive domain (Bloom et al, 1956). Certainly no other
work has been so influential in shaping our thoughts about cognitive ski IIs.
The trouble is that too many people have accepted the Taxonomy uncritically.
Knowledge, Comprehension, Application, Evaluation, and Synthesis are still
bandied about as if they were eternal verities instead of being hypothetical
constructs constantly in need of verification. Wilson (1970, p.23) has
expressed nicely the value and limitations of the Taxonomy, or versions of it.
Referring to skills, he writes, "They are extruded in a valient attempt to
create some order out of the complexities of the situation. As tentative
crutches to test writing and curriculum development they are useful, but we
must beware of ascribing to them more permanence and reality than they deserve".
Prefacing examination syllabuses with some preamble like "Knowledge will
attract 15 per cent of the marks, Comprehension 2C per cent" and so forth, even
Multiple
Choice:
A State of the Art Report
205
with a qualification that these figures are only approximate, nevertheless
conveys a precision which is simply not justified in the present state of our
knowledge.
This is not the place to appraise the Taxonomy in depth. Another work in this
series (de Landsheere, 1977) does just that. What I cannot avoid though is to
examine the psychological status of the Taxonomy to see how far it constitutes
a plausible model of cognitive processes, and therefore of higher order skills.
At the present time, attitudes to the Taxonomy range from more or less uncritical acceptance e.g. "Where Bloom's discipline-free Taxonomy of Educational
Objectives is used the cognitive skills are unatiiguously defined in respect
of the thought processes which go on in an individual student's mind"
(Cat&ridge Test Development and Research Unit, 1975, p.6) through wary endorsement along the lines "It may leave a lot to be desired but it is the best
Taxonomy we've got", to downright hostility (Sockett, 1971; Pring, 1971;
Ormell, 1974). By and large I would say that the Taxonomy's influence is now
as much on the decline as it was in the ascendant ten years ago, when indeed I
was promoting it myself (Wood, 1968) although not entirely without reservation.
The overriding criticism, as Sockett (1971, p.17) sees it, is that "the
Taxonomy operates with a nsive theory of knowledge which cannot be ignored however classificatory and neutral its intentions". In particular, he rejects the
division into Knowledge and Intellectual Skills and Abilities, claiming that in
the things we are said toknow"
there are necessarily embedded all manner of
"intellectual skills and abilities" e.g. to know a concept is to understand it.
One is bound to say that the organisation of the Taxononly is remarkably ad hoc,
not grounded in any psychological principles other than that knowledge is
straightforward and anything involving mental operations is more difficult. As
Sockett puts it (p.23-24), "to rank them (cognitive processes) in a simplecomplex hierarchy either means that as a matter of fact people find reasoning
more difficult than remembering - which may or may not be true - or that there
are logical complexities in reasoning not present in remembering, which has not
been shown".
If we ask whether the proof is in the pudding, the evidence from the various
attempts at empirical validation is not impressive. The standard device of
asking judges to say what they think an item is measuring has revealed, as far
as the higher Bloom Taxonomy categories are concerned, that agreement is the
exception rather than the rule (Poole, 1972; Fairbrother, 1975). Since the
exercise is like asking people to sort fruit into apples, oranges, bananas etc.
without giving them more than the vaguest idea of what an apple or a banana
looks like, this is hardly surprising.
It follows that attempts to verify or
even re-constitute the hierarchical structure of the Taxonomy (Kropp et al,
1966; Seddon and Stolz, 1973) which have by and large failed to verify the
hypothesised structure, are doomed in any case for if the measures are "dirty"
to start with nothing "clean" is going to come out. Of course, there is always
the correlational approach to validating and classifying items, that is finding ite:?s\/nich seem to cluster together and/or relate to some external criterion. The weakness of this approach, as Levy (1973) has observed, is that
"we might know less about the tests we drag in to help us understand the test
of interest than we already know about the test".
The failure of these fishing expeditions to verify hypothesised models of
psychological processes indicates to me that we have been in too much of a
hurry to build systems. We have skipped a few stages in the development.
206
Evaluation
in Education
What we should have been doing was to fasten on to modest competencies like
"knowledge of physics concept X" and make sure we could measure them. Although
a systems-builder himself, Gagne (1970(a)) recognised the necessity for doing
this as the first order of business. For measurement to be authentic, he says,
it must be both distinctive and distortion-free.
The problem of distinctive
measurement is that of identifying precisely what is being measured. Keeping
measurement distortion-free means reducing , as far as possible, the "noise"
which factors such as marker error, quotidian variability among candidates and
blind guessing introduce into measurement operations.
Put like this, distinctiveness and freedom from distortion would appear to be
"validity" and "reliability" thinly disguised, and there is some truth in this,
particularly 3here freedom from distortion is concerned. With distinctiveness,
however, Gagne wishes to take a tougher attitude towards validity. Working
within a learning theory framework, he believes that only when suitable controls are employed can dependable conclusions be drawn about what is being
measured. "Distinctiveness in measurement has the aim of ruling out the
observation of one category of capability as opposed to some other capability"
(Gag&, 1970(a), p.111). Gag& imagines a two-stage measurement in which the
first stage acts as a control to ascertain whether prior information needed to
answer the second stage is present. Levy (1973, p.32) suggests a similar
procedure for investigating discrepancies in behaviour. If it is "knowledge
of principles" we are after we must make sure we are testing this and not
"knowledge of concepts". To use one of Gagng's examples, the principle that
"a parallelepiped is a prism whose bases are parallelograms" may not have been
learned because the learner has not acquired one or more of its component concepts, whether "prism", "base" or "parallelogram". Thus the first stage of
measurement would be to determine whether these concepts have been acquired.
The idea that items should measure one concept or principle or fact at a time,
and not a mixture of unknown proportions, is, of course, hardly new. Back in
1929, in a marvellous book which would still be instructive reading for examiners, Hamilton was hammering home the message:
“In deciding how much information he shall give the candidate, or how
much guidance by controlling clauses, the examiner will, of course, be
guided principally by the indication he wants the candidate's answers
to have. His chief aim in setting the question is to test the candidate's power of dealing with the volumes of such compound solids as the
sausage-shaped gas-bag, but if he omits to provide the formula, he will
clearly fail to test that power in a candidate who does not happen to
remember the formula." (Hamilton, 1929, Chapter 6)
Items which cannot be answered because too little information is given, laboratory-style items which can be answered without doing the experiments, English
Language comprehension items which can be answered without reading the passage
(Preston, 1964; Tuinman, 1972) and modern language comprehension items which
can be answered with little or no knowledge of the language - all these are
instances of lack of distinctiveness attributable to failure to assess learning
which is supposed to have occurred. Where comprehension items are concerned,
it has been suggested that a useful check on distinctiveness is to administer
the items without their associated passages in order to determine their passagedependence (Pyrczak, 1972, 1974). Prescott (1970) makes a similar suggestion
in connection with modern language comprehension items, except that he wants
the items to be tried out on people who have not been taught the language so
that he can find out exactly how much the items depend on acquisition of the
Multiple Choice: A State of the Art Report
language.
subjects.
207
The idea is a good one but difficult to put over to experimental
That the need for distinctiveness of measurement can be overlooked is illustrated by the following example from an item-writing text book (Brown, 1966,
p.27). The item and the accompanying commentary are reproduced below.
In cases of myopia, spectacles are needed which contain:
a.
b.
c.
d.
convergent lenses
coloured lenses to reduce glare
lenses to block ultra-violet light
divergent lenses
"A useful type, first since 'myopia' had to be correctly identified
with 'short sight', from which the principle of divergence to 'lengthen'
the sight had to be deduced. Option a was clearly a direct opposite
but b and c also distracted weaker pupils who had to guess the meaning
of 'myopia'."
To be fair, I ought to add that elsewhere in the book Brown shows a lively
awareness of the need for distinctiveness as when, writing about practical
science, he argues for items which can only be answered successfully if the
candidate has undergone a practical course in the laboratory.
It seems to me that the often heard criticism of the Taxonomy - that what is
"Comprehension" to one person is "Knowledge" to another or what is genuine
"Application" to one is routine to another - is attributable to a failure to
pay enough attention to distinctiveness. Granted we know precious little
about what happens when a person encounters an item (Fiske, 1968). Introspection studies (e.g. Handy and Johnstone, 1973) have been notably uninformative,
mainly because candidates have difficulty describing what they did some time
after the event. Yet too often items have been so loosely worded as to permit
a variety of problem-solving strategies and therefore a variety of opinions as
to what the item might have been measuring. Gag& (1970(a)) takes the Bloom
Taxonomy to task for perpetuating this kind of confusion, citing one of the
illustrative items which is meant to be measuring knowledge of a principle but
which might well be measuring no more than a verbal association.
But does it matter how a candidate arrives at an answer?, Surely it is likely
to make little difference whether a distinction can or need be drawn between
the learning of a concept, say, and the learning of a principle since those who
"know more" are going to learn faster and achieve better ultimate performancess
regardless of what the particular components of their capabilities are. I
think it does matter. In the first place achievement may not be as ordered and
sequential as this proposition implies. We do not know enough about the growth
of skills to be sure that knowledge of X implies knowledge of Y. In the second
place, the sanguine be1ie.f that ability will show through regardless can have a
powerful effect on teaching; in particular, it may encourage teachers to present material prematurely or pitch it at too high a conceptual level, omitting
the intermediate steps. Shayer (1972) has remarked on these tendencies in
connection with the Nuffield O-level Physics syllabus. The kind of examination
question where examiners have not bothered to ascertain whether the basics have
been assimilated but have moved immediately to test higher levels of understanding only blurs the measurement. Unfortunately the practice of lumping
together performance on all items into a test score does little to encourage
belief in distinctiveness as something worth having. Were we to move towards
two-stage or multi-stage measurement rather than deperrd on the single item
(alone or in collections) then, as Gag6 points out (1970(a), p.124), we
would have to devise new scoring procedures and testing would become quite
different from what we are used to.
Valuable though the notions of distinctiveness and freedom from distortion are
as measurement requirements, they are of limited use when it come to relating
higher order skills and systematising. The way forward depends Ii believe, on
developing a keener understanding of how learning cumulates in individuals.
As I have already indicated, I do not think the Taxonomy provides an adequate
description of psychological processes , much less promotes unde~tand~ng of how
behaviour is organised into abilities. Nor am I convinced that Gagne"'s own
theory of learning (Gagne", 19~~(b~~ is the answer. X would not want to dismiss
any enterprise which attempts to understand basic processes of learning but
where complex constellations of skills are concerned I have doubts about the
utility of an atomistic model of learning. Anyone who has studied reports of
attempts to validate hypothesised hierarchical sequences of learning will know
how complicated and elaborate an analysis of the acquisition of even a simple
skill can be (see, for instance, Resnick, Siegel and Kresh, 1971).
Where does this leave us? If in GagnB's scheme the learning networks are so
intricate that one is in danger of not seeing the wood for the trees, other
models of intellectual growth seem all too loose and vague. Levy (1973) maintains that the simplex - which means a cumulative hierarchy like the Taxonomy _
should be regarded as the model of growth but gives little indication as to
how it might work out in practice. Anastasi (1970) makes some persuasive
speculations
a&out how traits or abilities develop and become differentiated,
which help to clarify at a macro-level how learning may occur,but leaves us
little wiser concerning the nature of abilities. There is Piaget's theory and
its derivatives, of course, and in this connection there have been some interesting attempts to elucidate how scientific concepts develop in adolescents
(e.g. Shayer. K~&he~nn and Wylam, 1975). The object of this work* which is
to determine what to teach (test) when, seems to me absolutely right and offers,
I am sure, the best chance of arriving at a coherent view of how abilities
develop and articulate.
For the time being, though, I imagine we shall continue to proceed pragmatically, attempting to measure this ability or that skill - "ability to see
historical connections", "ability to read graphs" etc - whenever they seem
appropriate in the context of a particular subject area without necessarily
worrying how they relate, if at all. Actually this may be no bad thing
providing the analysis of what skills are important is penetrating, In this
connection, Wyatt's (1974) article makes suggestive reading. Writing of
university science examinations, he argues that for each student we might wish
to know: "Wow much subject matter he knows; how well he communicates both
orally and in writing; how well he reasons from and about the data and ideas
of the subject; how well he makes relevant observations of material; how far
he is familiar with and uses the literature and books; how well he can design
experiments; how well he can handle apparatus and improvise or make his own;
how far he can be trusted with materials; how skilled he is at exhibiting his
results; how skilled he is with mathematical, statistical and graphic manipulation of data". Obviously multiple choice cannot be used to measure each of
these skills or even most of them but enumeration of fists like these, at
least makes it easier to oecide which testing technique is likely to work
best for each skill.
Multiple Choice: A Stare of the Art Report
209
In the next chapter I will discuss, with illustrations, how the simple- multiple
choice form has been extended into different item types in a bid to measure
abilities other than factual recognition and recall. It will become evident
how the best of these item types succeed in controlling the candidate's
problem-solving behaviour but also what a ragbag of almost arbitrarily chosen
skills they appear to elicit, a state of affairs which only underlines that
we test what is easiest to test, knowing all the time that it is not enough.
SUMMARY
1. Multiple choice items can demand more than recognition, despite what the
more hostile critics say. Whenever candidates are obliged to work out an
answer and then search among the alternatives for it, processes other than
recognition, which we generally call higher order skills, are activated.
2. Attempts to describe and classify these higher order skills have amounted
to very little. Bloom's Taxonomy has promised more than it has delivered.
Generally speaking, denotation and measurement of higher order skills has
proceeded in an ad hoc fashion according to the subject matter. However the
failure to substantiate taxonomies of skills may not matter providing a
penetrating analysis of what students ought to be able to do is carried out.
It is suggested that more attention should be given to measuring what we
say we are measuring and in this connection Gag&'s notions of distinctiveness
and freedomfrom distortion are discussed.
3. Item Types
"Choice-typeiternscan be constructed to assess complex achievement
in any area of study." (Senathirajahand Weiss, 1971)
Two approaches to measuring skills other than factual recall, classification
or computation can be distinguished. One tries to make the most of the basic
multiple choice form by loading it with more data and making the candidate
reason, interpret and evaluate, while the other throws problems into different
forms or item types which oblige the candidate to engage in certain kinds of
thinking before choosing an answer in the usual way. The development of these
item types can be seen as an attempt to control and localise the deployment of
higher order skills.
The danger with increasing the information load is, of course, that items can
become turgid and even obscure. This item taken from the Cambridge Test
Development and Research Unit (TDRU) handbook for item writers (TDRU, 1975)
illustrates how the difficulty of giving candidates enough information to make
the problem believable, without swamping them, can be overcome.
If thecounty
~'ounci2 responsible for
the north west corner of Scotland
had to choose between the constructionof a furniture polish factory
which wouZd employ 50 people and a hydro-electricpower station it
should choose:
The factory, because the salty and humid temperaturecases a
rapid decay of exposed wood.
b. The factory, because the long-term gain in employment would be
greater than that which the power station could provide.
c. The factory, because it would make use of the natural resources
of the region to a greater extent that the power station could.
d. The power station, because it would result in a large number of
highly paid constructionworkers being attracted into the region.
e. The power station, because the power production in the Highlands
is insufficient to meet the needs of this part of ScotZand.
a.
The item seems to be measuring appreciation of the relative importance of
economic and social factors. If it seems too wordy the reader might look at
the illustrative items for the higher Bloom Taxonomy categories and consider
whether Vernon (1964.,p.11) was being too kind when he suggested that many
readers will find them "excessively verbose, or even perversely complicated".
I happen to think that the item is not too wordy but there is no getting away
from the fact that items like this make considerable demands on candidates'
reading comprehension. This in turn can threaten the distinctiveness of the
measurement; if candidates have difficulty understanding the question or the
instructions - a point I will discuss when I come to other item types - there
must be doubt as to what their responses mean.
210
Multiple Choice: A State of the Art Report
211
Now,
of course, it is perfectly true, as Vernon observes, that all examinations
involve a common element of reading comprehension, of understanding the questions and coping with the medium. At the same time it is desirable that the
candidate should be handicapped as little as possible by having to learn the
medium as well as the subject, the principle being that examinations should
take as natural a form as possible. This places multiple choice item writers
in something of a dilemma. On the one hand the need for distinctive measurement obliges them to exercise what Hamilton (1929) called "guidance by controlling clauses", yet the provision of this guidance inevitably demands more reading from candidates. Exactly the same dilemma faces the compiler of essay
questions. Nor is there any instant remedy. The hope must be that when formulating questions examiners will use language in a straightforward, cogent, and
effective manner, remembering that the cause of candidates is not advanced by
reducing the wording of questions to a minimum.
ITEM TYPES OTHER THAN SIMPLE MULTIPLE CHOICE
When considering an item type the first thing to ask is whether it performs
some special measurement function or whether, to put it bluntly, it has any
functional basis at all. fin item type may be invented more from a desire for
diversity and novelty than from a concern to satisfy a measurement need.
Gag& (1970(a)), for one, has argued, rightly in my view, that we have a set
of testing techniques and some measurement problems but that the two do not
necessarily correspond. When evaluating an item type, we should ask ourselves
"Goes it do something different?" "Does it test something worth testing?" "Is
the format comprehensible to the average candidate?" and, above all, "Could
the problem be handled just as well within an existing item type, especially
simple multiple choice?"
The first item type to be discussed - the true/false type - is different from
the rest in that it is a primitive form of multiple choice rather than an
embellishment.
True-false
Of all the alternatives to simple multiple choice the ordinary true-false (TF)
item has been subjected to most criticism. The reasons are obvious; the possibility of distorting measurements through guessing is great, or so it appears,
and there would seem to be limited opportunity to ask probing questions. For
some time now, Ebel (1970, 1971) has been promoting the TF item but his seems
to be a lone voice. As regards guessing, Ebel discounts it as a serious factor,
believing that when there is enough time and the questions are serious, people
will rarely guess. He also believes that TF items can measure profound thought,
his grounds being that the essence of acquiring knowledge is the successive
judgement of true-false propositions. This is a claim which readers will have
to evaluate for themselves. Personally I am sceptical. Where the acquistion
of knowledge or skills can be programned in the form of an algorithm e.g. the
assembly of apparatus, Ebel's claim has some validity but where knowledge comes
about through complex association and synthesis, as it often does, then a more
sophisticated explanation is required.
It is significant that Ebel believes TF items to be most effective in teaching/
212
Evaluation in Education
learning situations. Inasmuch as the teacher may expect to get more honest
responses and to cover ground quickly, one can see what he means. Actually,
the whole multiple choice genre has to be viewed differently in the context of
a teaching situation compared to that of a public examination. In particular,
restrictions about wording can be relaxed because the teacher is presu~bly
at hand and willing to clarify items if necessary. Moreover, since the teacher
and not a machine will be doing the marking, the form can be extended to allow
candidates to volunteer responses either in defence of an answer and/or in
criticism of an option or options. This is multiple choice at a very informal
and informative level, and there is no reason why teachers should not use
true/false items as long as they know what they are doing.
(by permission of
United Features Syndicate Inc.)
One recent investigation into the setting of true-false items is perhaps worth
mentioning. Peterson and Peterson (1976) asked some students to read a prose
passage and then respond to items based on it which were either true or false
and were phrased either affirmatively or negatively. Thus, for example, the
facts which read: "The mud mounds so typical of flamingo nests elsewhere did
not appear in this colony; there was no mud with which to build them. Instead
the birds laid their eggs on the bare lava rock" yielded these four true/false
items:
1. The flamingoes in the colony
(true affirmative)
2. The flamingoes in the colony
affirmative)
3. The ftamingoes in the colony
{true negative)
4. The flamingoes in the colony
rock. (false negative)
laid their eggs on bare rock.
built nests of mud. (false
did not built nests of mud.
di'dnot lay their eggs on bare
It was found that true negatives yielded most errors, followed by false negatives, true affirmatives and false affirmmes.
Peterson and Peterson
concludedthat if test constructors wish to make true-false items more difficult the correct policy is not to include more false than true statements in
the test, as Ebel (1971) suggested, but rather to include more statements
phrased negatively. It should be mentioned that the results of this study
differed somewhat from those of an earlier study by Wason (1961) who found
true affirmatives to be no easier than false affirmatives although on the
finding that true negatives are harder to verify than false negatives, the two
studies are in agreement. Unfortunately neither study can be regarded as
authoritative; Peterson and Peterson's, in particular, is almost a caricature
of the typical psychology experiment. You will see what I mean from their
Multiple Choice: A State of the Art Report
213
description of the subjects: "Forty-four students (ten males and thirty-four
females) from the introductory psychology course at Northern Illinois University volunteered for the experiment and thereby added bonus points to their
course grade". Nor did these small nutiers inhibit the investigators from
carrying out significance tests although mercifully they refrained from testing
for sex differences in response to the items.
Multiple true-false
The process of answering a multiple choice item can be thought of as comprising
a nunber of true-false decisions rolled into one, the understanding being that
one answer is true and the rest are false. By contrast, there is another type
of item - called by some "multiple true-false" (Hill and Woods, 1974) - where
each of the statements relating to a situation can be true or false. This
5
is widely used in medical examining where it is sometimes known as the
"indeterminate" type e.g. in the University of London. Ten years or so ago it
enjoyed a vogue in connection with CSE experimental mathematics papers under
the name "multi-facet" (Schools Council, 1965). Here is an example (T, F and
D/K indicate True, False and Don't Know respectively):
A measurement, after a process of calculati.on,
appears as '2.6038metres'.
T
F
D/K
(a) The measurement is 2.60cm. to the necxst
(b) g,
measurement is 2.04mm. to the xeare?t
(c) g;
measurement is 2.6Om. to two sipificunt
figures.
(d) The measurement is 260.3&m. to
two
significant figures.
(e)
The measurement is 264~~~ to three
significant figures.
The attraction of exploiting a situation from a number of points of view is
obvious. One objection which has been raised against this item type is that
getting one facet right could well depend on getting another right. The
orthodox view, and this applies to items in general, is that efforts should be
made to keep items independent of each other in the sense that the probability
of a person getting an item right is not affected by his responses to other
items. How realistic this requirement is is anyone's guess; my feeling is
that items do and perhaps should inter-relate. To put it another way, are
items on the same subject matter ever truly discrete? Certainly, if Gag&Is
two-stage measurement procedure or something like it were to be realised, the
items would be intimately related and new scoring formulas and much else would
be required (Gag&, 1970(a), p.124).
Multiple completion
The multiple completion or selection type of item requires the candidate to
choose a pre-coded combination of statements, as in the following example from
the London GCE board (University of London, 1975, p.24).
214
Evaluation
in Education
In the question
below, ONE or MORE of the responses
given are correct.
Decide which of the responses
is (are) correct.
Then choose:
A.
B.
C.
D.
E.
If
If
1, 2 and 3 are all correct
1 and 2 only are correct
If 2 and 3 onZy are correct
If 1 only is correct
If 3 only is correct
lfhich
of the following
was too high?
1.
2.
3.
would
be desirable
An increase
in saving
A rise in exports
A rise in the school leaving
ii
the
level
of unemployment
age
The difference between this item type and the multiple true-false type is one
of function. The multiple true-false question is usually just a set of questions or propositions which, although relating to the same theme or situation,
are not necessarily joined structurally or organically whereas the multiple
completion item can be used to probe understanding of multiple causes and
consequences of events or complex relationships and therefore has much more
range. For this reason, I would rate the multiple completion type as more
rewarding in principle. However, much depends on the item writer's ability to
make the most of the combinations so that, if you like, the whole is more than
the sum of the parts. In this connection, the Cambridge handbook (TDRU, 1975,
p.20-21) has something apposite to say,
"Many item writers find the Multiple Selection type of item the easiest
kind to write, which is not surprising if one looks upon this type as
being little more than three true/false questions, linked, of course, by
a common theme. In writing Multiple Selection items, every effort should
be made to make them imaginative and to consider carefully how the
candidates will look upon, not only the statements, but also the possible
combination of statements, in order to aim for the highest possible
discrimination power."
The snag with multiple completion items is that they require the use of a
response code. If it is reckoned that in addition to coding his answer, the
candidate has to transfer it to an answer sheet, a task which has been shown
by Muller, Calhoun and Orling (1972) and others referred to therein to produce
more errors than occur when answers are marked directly in the test booklet,
the possibilities for error will be apparent. The likelihood of distortion is
increased by the fact that the coding structure contains information which
candidates can use to their advantage. If a candidate can definitely rule
out one statement, he can narrow down the choice between alternative answers
and the more statements there are the more clues are given away. To prevent
this sort of thing happening it has been suggested that "any other statement
or combination of statements" might be used as an option but the TDRU handbook claims that it is difficult to obtain statistically sound items with this
option as a key. I am not opposed to the use of this particular option but
I recognise that it may introduce an imbalance into the item which is liable
to threaten the coherence of the problem (see the discussion of the "none of
these" options in Chapter 4).
Evidence that the coding procedure does introduce error, at least among the
Multiple Choice: A State of the Art Repoti
215
less able, has been presented by Wright (1975). An unpublished study of my own
(Wood, 1974), which used GCE O-level items rather than the very easy items used
by Wright, revealed that the coding structure did work in favour of the more
able, as expected. The obvious way to dispense with coding structures would be
to ask candidates to make multiple responses directly on the answer sheet, and
to program the mark-sensor and score accumulator accordingly. This is common
practice in medical examining (see, for instance, Lever, Harden, Wilson and
Jolley, 1970). The result is a harder, and a fairer, item but would candidates
be confused if a multiple completion section requiring multiple marking were
to be placed in a test which otherwise required single marks in the conventional manner? Vy investigation, although not conclusive, suggests that mixing
the mode of response is unlikely to worry candidates any more than the other
switching they have to do in the course of a typical GCE 0- or A-level test.
Besides the multiple completion section could always be placed at the end of
the test.
I am drawn to the view that without a system of multiple responding the multiple completion item type is too prone to give distorted results. The directions used by the London GCE board (see the example) are capable of improvement
but even at its most lucid and compact the rubric would still worry some candidates. Some would say that these candidates would make a hash of the items,
anyway, but even if this were true I see no reason to compound the superiority
of the cleverer candidates. Agreed we give scope to intelligence in many
different ways, often without realising it, but where the opportunity exists
to stop blatant advantage it should be taken.
Assertion-reason
The assertion-reason item was devised for the purpose of ascertaining candidates' grasp of causality. In terms of the strong feelings it arouses, this
item type is not far behind the true-false item, of which it is a variant.
Once again the candidate has to cope with involved directions but the more
The
serious objections concern the logical status of the task itself.
directions used by the London GCE board are reproduced below together with a
sample item (University of London, 1975).
Each of the foZZowing questions consists of a statement in the lefthand column followed by a statement in the right-hand co2um-n.
Decide whether the firs
statement is true or false.
Decide whether the second statement is true or faZse.
Then on the ansuer sheet mark:
A.
If both statements are true and the second statement is a correct
expzanation of the first statement.
B. If both statements are true and the second statement is NOT a
correct explanation of the first statement.
C. If the first statement is true but the second statement is false.
c. If the first statement is false but the second statement is true.
E. If both statements are false.
236
Evaluation in Edwath
Directions Summarised
A.
TrL4e
TPZie
292dstatement Zs a correct expcplanation
of the 1st
B.
TPUE
TY%?
2nd stutemenz:t'sNOT a corrcc-lexplanation of the 1st
C.
True
False
D.
Fake
Trtie
E.
Faise
FaL?e
FIRST STATEmNT
The growing season iursouth-west
England i.slonger than in the
south-east of EngLand
SECOND STATE~NT
Summers are uamner in the southwest than in the south-east of
EZzg&a?%d
fhe birectl'onsused by the Cambridge TDRU are much the same except that
"Correct" is replaced by "adequate", in my view an improvement since it makes
the exercise less naive and dogmatic. On the other hand London has dropped
the "BECAUSE" linking the "assertion" and the "reason", a necessary move
otherwise all the options save A have no meaning. Statements such as:
%e
#or&Z? is fZaf_MCAfJSL?nature c&hors a vacziwn
OF
Japan irzvadedPoEa& SECAIJSE HitZer
bonibea’ Pearl
Rwbour
which would correspond to options D and E respectively, are, of course, nonsense. The weakness of this type of item is, in fact, that statements have
to be considered not as an integrated entity
but in two parts. This means
that, as in my absurd examples, the two parts need not necessarily bear any
relationship to each other, although the need for credibility usua?ly ensures
that they do. I am afraid that Banesh Hoffman would make mincemeat of some
of the assertion-reason items I have seen and I would find it hard to quarrel
with him.
On the statistical side, analysis of the multiple choice tests set by the
London board from 1971 to 1974 (Wood, 1973(a); Quinn, 1975) shows assertionreason items coming out consistently with lower average discrimination values
than other item types. The TDRU reports the same outcome (TDRU, 1976) and
suggests that the basic reason for this may lie in a failure to utilise all
the options in the response code to the same extent, In particular, the
option A (true-true, reason correct explanation of assertion) was keyed less
frequently than might be expected but was popular with candidates, often
proving to be the most popular of the incorrect options. Thus it would appear
that candidates are inclined to believe in the propositions put before them
yet item writers, ~art~c~lar~~ in Chemistry ahd Biology says the TDRU, seem
to find genuinely correct propositions hard to contrive.
My own hunch about assertion-reason items is that they are less related to
school achievement and more related to intelligence (competence vs. ability,
see Chapter 7) than any other type of item. If this is true, their aspirations
to distinctiveness must be seriously questioned. It miqht be advisable in any
case to consider whether this kind of item can be rewritten in simple multiple
choice format. This can often be done to good effect as in the illustrative
item from the TDRU handbook used at the beginning of the chapter.
Multiple Choice: A State of the Art Report
217
The remaining item types to be disctissed are all used in mathematics testing,
although it is conceivable that they might be applied in other disciplines.
They are meant to test different aspects of mathematical work and the names of
the item types - data necessity, data sufficiency, quantitative comparisons give a good idea of what is demanded. In all cases, the burden of the question
is to be found in the directions so that the usual objections about undue
reliance on reading comprehension apply.
Data necessity
In this kind of problem the candidate is asked to find redundant information.
The directions used by the London board and an example are given below
(University of London, 1975, p.21).
Directions: Each of the following questions consists of a problem followed
by four pieces of information. Do not actuaZly solve the problem, but
decide whether the problem cou2d be solved if any of the pieces of information were omitted, and choose:
A
B
C
D
E
if
if
if
if
if
1 could
2 could
3 could
4 couZd
none of
be omitted
be omitted
be omitted
be omitted
the pieces of information could be omitted
What fraction of a population of adult males has a height greater than
18Ocm?
1 The distribution of heights is norma
2 The size of the population is 12,000
3 The mean height of the popuZation is 175cm
0 The standard deviation of heights is 7cm
Payne and Pennycuick (1975,p.16), whose collection of items is exempt from the
criticisms of this genre I made earlier, point out, and I agree with them,
that this item type lacks some of the variety of others for there are essentially only two sorts of problems - those requiring all the information for
their solution and those from which one piece can be omitted. Often an idea
earmarked for data necessity treatment can be better exploited in the multiple
completion format, emphasising again that the same function can often be
performed just as well by another item type. In general, I would suggest
avoiding the data necessity format.
Data sufficiency
As the example below shows (University of London, 1975, p.21) the directions
are formidable although Payne and Pennycuick (1975) show how they can be
simplified.
Directions: Each of the following questions consists of a problem and -two
statements 1 and 2, in which certain data are given. You are not asked to
solve the probZem; you have to decide whether the data given in the stateme)Ltsare sufficient for solving the probZem. Using the data given in the
statements, choose:
A itiT
ZACH statement (i.e. statement 1 ALONE and statement 2 ALONE) is
suSficient by itself to solve the problem
218
Evahfation
in Educatjon
B if statement 1 ALONE 1:ssufficient but statement 2 alone is ~101;
sufficient to SOLVZ
the
problem.
C if stadament 2 ALONE is suf_FhIent but statement 2 alone is not
sufficient to sotZve the problem.
D if BOTH statements I and 2 TUGE~~~~ are suffhierat ta soLve
the
problem, but NEITHER statement AXME is suff&?:ent.
E if statements 1 and 2 TOGET,WR are NOT sufpkiant to solve the?
problem, and additional data specific to the problem are need&
Initial
Vertical
f
/
-
Fig. 3.1
tiai5 is the initiaZ veZocity, V, of the proje&Ze?
I
I?_@ = 54m
z
r = 42..%
The concept of sufficiency is important in mathematics and this item type may
be the only way of testing it.
The London GCE board intends to use it in its
Advanced level mathematics multiple choice tests starting in 1977 but I would
not have thought it would be suitable for lower age groups studying mathematics.
Quantitative comparisons
As far as I know, this item type it
3 not used in any British examinations or
tests. It was introduced into the Scholastic Aptitude Test (SAT) by the
College Board in the USA, partly as a replacement for the data necessity and
sufficiency types, the instructions for which, interestingly enough, were
considered too complicated for the average candidate to follow. After seeing
an example of the item type with instructions (reproduced below), readers can
judge for themselves
whether
the substitution was justified. The task presented to candidates is an easier one and even if the instructions might still
confuse some candidates, it should be possible to-assimilate them more quickly
than those associated with other item types. I would think this item type
could be used profitably in examinations for 16-year olds.
Directians: &xh quest*ion 6-z this seotion consisz% of tic qua&&&,
one in cok@ml k mad one in Culurnz 5. You are to compare f,ha ixJ0
CCpi.CMtities
and on -theunsuer skset bBacken space:
Multiple Choice:
A
B
C
D
if
if
if
if
the
the
the
the
A State of the Art Report
219
quantity in Column A is the greater;
quantity in Column B is the greater;
two quantities are equal;
relationship cannot be determined from the information given.
Notes: (1) In certain questions, information concerning one or both of
the quantities to be compared is centred above the two columns. 1.2) A
symbol that appears in both co~wms represents the same thing in Colwm
A as it does in Coikm B. (31 AZZ nwnbers in this test are reaZ nwnbers.
Letters such as I, n, and k stand for real nwnbers.
CoZwnn A
Colwnn B
5Lc= 0
Question 15
1
Question 16
3x352~8
Question 17
2
SJ
Y2X
X
4x352~6
zz
YX2
(CoZlege Board, 1976)
That the item types just discussed call for higher order skills, however that
term is defined, is incontrovertible. Being more or less memory-proof (by
which I mean answers cannot be recognised or recalled intact) they impel the
examinee to engage in distinctive reasoning processes. Although opportunities
for elimination still exist, particularly in the case of data necessity items,
candidates need to get a firm purchase on the problems in order to tackle them
successfully.
I have not described all the item types that exist. The matching type of item
has its place and I have not really much to say about it except that it can be
a lot of work for little result. There is an item type called relationship
analysis which the London board is experimenting with in its A-level mathematics
tests
I choose not to describe it here because it is too specialised to
mathematics (a description can be found in University of London, 1975 and
Payne and Pennycuick, 1975). Relationship analysis is one of the newer item
types studied, and in some cases devised, by the group responsible for constructing the experimental British Test of Academic Aptitude (TAA). One or two
of these item types have found their way into attainment tests but to date
the others have not caught on. Details can be found in Appendix E of
Choppin and Orr (1976).
If it is asked how the mathematical item types compare in terms of statistical
indices an analysis of the London board's 1976 A-level mathematics pretests
carried out by my colleague Carolyn Ferguson shows that simple multiple choice
were usually the easiest and also the most discriminating items while the
relationship analysis item type proved most difficult and also least discriminating. Of the other item types, data necessity generally showed up as the
next easiest type after multiple choice and the next poorest discriminator
after relationship analysis. Data sufficiency items showed up reasonably well
in terms of discrimination but tended to be on the hard side. The finding
that simple multiple choice provides the highest average discrimination agrees
with the outcome of our analyses of O-level tests (Wood, 1973(a); Quinn 1975)
and also with the TDRU analysis (TDRU, 1976). To some extent this is due to
multiple choice enjoying a slightly greater representation in the tests as a
220
Evaluation in Education
whole so that it is-tending to determine what is being measured and also to the
fact that the correlations between scores on the subtests formed by the different item types are lowish (0.30 - 0.50). Whether the item types are measuring different skills reliably is another matter. All we have to go on at the
moment are internal consistency estimates for small nutiers of items and I
would not want to place too much weight on them. I might add that the analysis
just discussed was provoked by complaints fror;;
schools and colleges, particularly the latter, that students, especially those of foreign origin, were
experiencing difficulties with the more complicated item types. We therefore
have to keep a close eye on how these item types go and whether they should
all be included in the operational examinations.
The problem of deciding whether item types are contributing enouyh to justify
their inclusion in a test is indeed a difficult one. "Does the format of a
question make any difference to a candidate's performance in terms of the
final outcome?" is a question one is asked periodically. What people usually
mean is "Do the item types measI.re the same thing?" The stock method of investigating this question is to correlate the scores on the different kinds of
tests or item types. If high ccrrelations, say O.GO or more, result, then
it is customary to conclude that the tests are "measuring the same things"
(Choppin and Purves, 1969; Bracht and Hopkins, 197C and references therein;
Skurnik, 1973). This being so, one or more of the tests or item types must
be redundant, in which case one or more of the6!, preferably the more troublesome ones, can be discarded. Or so the argument goes. On the other hand, if
low correlations, of say 0.50 or less, result, the tests are said to be
"measuring different
things", and test constructors pat themselves on the back
for having brought this off.
As I have hinted, both interpretations are shaky. Low correlations ma) cor,:e
about because the measures are unreliable. For instance, Advanced level
Chemistry practical examinations show low correlations (around 0.30) with
theory papers but no one can be sure whether the low correlation is genuine or
whether it is due to the unpliability of the practical examination - candidates
are assessed on two experiments only. It is true that correlations can be
corrected for unreliability, assuming good measures of reliability are available, which they rarely are, but this correction is itself the subject of
controversy, the problem being that it tends to be an "over-correction of
unknown extent" (Lord and Novick, i968, p.138). Thus corrected correlations
can look higher than they really are, which is why &?lenbergh (197i), who
scrutinised 8C or so studies, concluded cautiously that multiple choice and
open-ended questions are sometimes operationalisations of the same variables,
sometimes not.
With high correlations, the case would seem to be open and shut; one measure
must be as good as the other. But this conclusion does not follow at all, as
Choppin (1974(a)) has shown. Suppose, he says, that two measures X ard Y
correlate 0.98 and that X is found to correiate C,5O with some otf,er *,iariable
Z. Examination of the variance shared between the variables shows that the
correlation between Y and Z may lie anywhere ben,een O.23 and C.67. Thus the
two measures are not necessarily interchangeable. That is a statistical
argument but there are others. If the high correlation cores abo!"t because
the open-ended questions are doing the same job as the multiple choice iter.a
- eliciting factual content etc. - then the essay paper is obviously rot beirg
used to advantage. It is not perforrling its special function. In science
subjects there may be soi?e truth in this. i;ht if there are 1;'"*fiiSfor sun/Jr;sing that the processes called for by thi-twc tests are different iflkind, a~:i
Multiple Choice: A State of the Art Report
221
separate functions are being satisfied, all that high correlation means is that
relative to each ctner, persons produce the same kind of performance on both
tests.
That various tests administered to the same children should produce high correlations need come as no surprise; as Levy (1973, p.6-7) remarks, children
developing in a particular culture are likely to accrue knowledge, processes
or whatever at oifferent rates but in a similar order, a view also put by
Anastasi (1970). ilhat no one should do is tc conclude from this that it is a
waste of time to teach some aspect of a subject just because tests based on the
subject matter correlate highly with tests based on other aspects of the subject. As Cronbsch (157ci,p.4b-49) has pointed out, a subject in which one
coclpetencewas developed at the expense of the other could go unnoticed since
one or more schools could on average be high on one conipetence and low on the
other kqithout this showin up in the correlation between scores.
It is the
across-schools correlation, which is formed by correlating average scores for
schools, that will expose uneven development of the two competences.
Arguments based on correlations are of strictly limited utility. Evaluation
of the validity of item types must proceed alon other lines. The first test
must be one of acceptability - can the average candidate grasp what is required
reasonably qtiickly? It may be that types like relationship analysis fail on
tnat account. One can also ask if the task set makes sense. Perhaps assertion-reason fails on that score. Then, of course, one must ask whether the
skili supposedly being measured is worth measuring and, if it is, whether the
Ir: this connection,
multiple
compleiterr.
type is being used to best effect.
As a general
comment,
I would say that the
tier, sometimes
causes concern.
simple r;:ultiFle cnoice form has a lot of elasticity
left in it and that item
writers snotild think nard ana long before abanconing
it for another
item type.
i:iti:
r.!ac_j
tests cow stratified into different item types as a matter of course
the danger is tnat these divisions will become permanently fixed when what is
reqlrired is ilLid
the acceptability
allocation
based on a mGre or less constant
monitoring
of
and measurement
efficacy
of the item types. The last thing
I would want to ao is to discourage experimentation with the multiple choice
form but I am bound to say that the experience
so far seems to indicate
that
the price for producincj
to the point, \;nere some
something
candidates
different
is a complicating
are definitely
handicapped.
of
instructions
SUMMARY
1.
Tc set more out of the simple multiple choice form usually means increasing
the infGrmation
loac.
Lare should be taken not to overdo the reading comprehensiGn
element.
Various
item types other than simple multipie
choice are available.
The
2.
iteol hriter should always check that he or she has chosen the appropriate
item
liften ideas can be handled quite well within
type and is using it properly.
tt-:esimple r.ibltiple cnoice form withtirt resortins
to fancy constructions.
Lxce~t fr;r triie-false ant n.bItiple trtie-false, all the other item types have
:r? <cr:r;or
trat the i nst rl;ctior:s
?r-e lensthy and apparently
complicated.
This
1eac.s tc tiic critiiisr. t::et a:ilitl tc. urcerstand instructions is beins tested
bc; src ar,ythini, else-. 'orie ~r+rover.ient is GGssible
in the wording
and presentdtic: ci tilt
instr;ctiGns
c,er,era17J,hsed but it will never be possible to
:isptiltne criticism, entirely.
222
Evaluation in Education
3. The claims made for true-false items by Robert Ebel do not convince and
this item type is not recommended for formal achievement testing. In the
classroom it is different and there is no reason why these items should not be
used there.
4. The multiple completion or selection item type suffers from the drawback
that candidates have to code their answers using a table before making a mark
on the answer sheet. Another shortcoming is that information is usually qiven
away by the coding table and candidates may use it to their advantage, either
consciously or unconsciously. As might be expected, the cleverer candidates
appear to derive most advantage from it.
5. At their worst, assertion-reason items can be very silly and good ones are
hard to write. Making use of all the response positions is a problem; in
particular propositions which are correct for the right reasons are apparently
hard to come by. This item type is generally not recommended. Notions of
causality can be tested using the simple multiple choice form.
6. The series of item types which have been introduced into mathematics
achievement tests - data necessity, data sufficiency, relationship analysis,
quantitative comparisons - must be regarded as still being on trial. It is
fairly certain that the first three can only be used with older, sophisticated
candidates and even then the criticism that the instructions are too complicated could hold. Data sufficiency and quantitative comparisons look to be the
most promising item types although this view is based on little more than a
hunch. It looks doubtful whether these item types will find an application
in other subject areas.
7. Trying to validate item types by correlational studies is a waste of time.
The interpretation of both high and low correlations is fraught with problems.
Validation is best carried out by asking common sense questions "Is the item
type performing a useful measurement function no other one can?" "Is the item
type acceptable to candidates?" etc.
4. Constructing Items
"Item writing
continues
to be an art to which some scientific
procedures
and eqerimentally
derived judgements
make 0n2y modest
contributions.
” (Wesnm,
3971)
The critical activity in item writing is, of course, the birth of the idea
followed by the framing of the item. The idea may come in the form of a particular item type or it may have to be shaped to fit an item type, depending on
the commission. Not everyone is happy relying on the item writer for ideas.
Later in the chapter I shall discuss the work of those who believe it is possible to generate items in such a way that what Bormuth (1970) calls the "arbitrary" practices of i ternwriters are eliminated.
On the understanding that the way ideas come into being is not susceptible to
enquiry, most of the research on problems involved in item writing has been
about issues like the optimum number of distracters, the use of the 'none of
these' option, the production of distracters, the advisability of using negatively framed stems9 what Andrew harrison has called the "small change" of item
writing. tiaving studied the work which has been done on these topics one is
obliged to agree with Wesman (1971) that "relatively little significant
research has been published on problems involved in item writing". To be fair
there are reasons for this; more than in most other areas of research, the
investigator is faced with the difficulty of building in sufficient controls to
permit the degree of generalisability which would make the findings dependable
and useful. "Most research", writes Wesman, "reports what has been done by a
single writer with a single test; it does not present recipes that will enable
all who follow to obtain similar results. A study may show that one threechoice vocabulary test is just as good without two additional options per item;
it will not show that another three-choice vocabulary test with different words
and different distracters would not be improved substantially by the addition
of well-selected options."
Llespite these strong reservations, Wesman does discuss the item writing
research which was done between roughly 1945 and 1970, although his treatment
is not exhaustive. As he warned, nothing definitive emerged and much the same
has to be said for later research. Let us consider the question of the number
of distracters first.
NUMBER OF OISTRACTORS
Conventional wisdom has it that to provide fewer than three distracters offers
candidates too much scope for elimination tactics. Yet, three-choice and twochoice items have their champions, as do true-false items. Three-choice items,
in particular, are thought by some to have interesting possibilities. After
223
224
Evaluation in Education
randomly eliminating the fourth alternative from a sample of psychology achievement items, Costin (1970) administered tests constructed of both three- and
four-choice items to a sample of students. He found that his "artificial"
three-choice items were more discriminating, more difficult and more reliable
than the four-choice items from the same item pool. The outcome of a later
study (Costin, 1972) was much the same. As to why three-choice items did as
good a job as four-choice items Costin was inclined to believe that the explanation was more psychological than statistical. In the 1970 paper he offered
his results as empirical support for Tversky's (1964) mathematical proof that
three-choice items are optimal as far as discrimination is concerned, a result
also proved by Grier (1975), although it should be noted that both proofs depend on the somewhat shaky assumption that total testing time for a set of
items is proportional to the number of choices per item (see Lord, 1976(a)).
However, in the 1972 paper Costin was inclined to believe that the more choices
that are provided, the more cues candidates have available for answering items
they "don't know". This effect he saw as a greater threat to reliability and
perhaps also validity than reducing the number of alternatives. I am not sure
I accept this argument. If the extra alternatives are poor they may help to
give the answer away but generally I would have thought that encouraging candidates to utilise partial information would result in greater validity, nor
would it necessarily jeopardise reliability.
As always, there are so many factors involved in an issue like this. I have
mentioned the nature of the cues but there is the matter of how specific or
general the item is, what it is aimed at and also how it can be solved. Concerning this last factor, Choppin (1974(a)) maintains that when the number of
alternatives is reduced, items that can be solved by substitution or elimination - what he calls "backwards" items - are less valid than comparable "forward" items, that is, items that must be solved first before referring to the
alternatives. On the other hand, he finds random guessing patterns more prevalent for items of the "forwards" type, which is reasonable given the lack
of opportunity for eliminating alternatives. In general, Choppin finds that
reducing the number of alternatives does lower reliability (estimated by internal consistency) and validity and recommends that, whenever possible, items
with at least five alternative responses should be used.
To see what would happen when "natural" four-choice items were compared with
four-choice items formed by removing the least popular distractor from fivechoice items, Ramos and Stern (1973) set up an experiment involving tests in
French and Spanish. Their results suggest that the two kinds of items are not
clearly distinguishable and that the availability of the fifth choice to the
candidate is not of major consequence. However, the usual qualifications concerning replicability apply. One thing Ramos and Stern did notice was a small
decrease in reliability when going from five to four choices and wondered
whether it might not have been better to eliminate the least discriminating
distractor. Since reliability and discrimination are directly related, they
probably had a point.
I feel myself that one can probably get away with reducing from five to four
alternatives. It is when the number is reduced to three or even two that the
soundness of the measurement is threatened. As far as the true-false type is
concerned, Ebel (1969) stated quite categorically that if a teacher can write,
and a student can answer, two true-false items in-less time than is required
to write or to answer one four-choice item, preference should be given to
true-false. However, empirical studies suggest that things do not work out
this way. Oosterhof and Glasnapp (1974) reported that 21 to 43 as many true-
Multiple Choice: A State of the Art Report
2%
false as multiple choice items were needed in order to produce equivalent
reliabilties, a ratio which was greater than the rate at which true-false
items were answered relative to multiple choice. Frisbie (1973), while warning that no hard and fast rules can be formulated regarding the amount of
time required to respond to different types of items without considering item
content as well, nevertheless found that the ratio of true-false to multiple
choice attempts was in the region of 14 rather than 2. It would seem that
Ebel's proposition does not hold up in practice, but again one can never be
sure.
Many people would argue, in any case, that the number of alternatives for any
particular item should be decided not by statistical considerations but rather
by the nature of the problem and its capacity for producing different mistakes.
Granted there are administrative grounds for keeping the number of alternatives
constant throughout a test but these can always be over-ridden if necessary.
The best way of finding out what the major errors are likely to be is to try
out the problem first in an open-ended form. Although it hardly warrants a
paper in a journal some workers have thought the idea needed floating in their
own subject area, e.g. Tamir (1971) in a biology context. If practicable, I
think it is a useful thing to do but one should not feel bound by the results
of the exercise. Candidates can make such crazy errors that I am sceptical
about the wisdom of using whatever they turn up. A candidate might produce
something which he would immediately recognise as wrong if it was presented
in an item. It might also be argued that item writers ought to be aware of
the more common errors anyway but this is not necessarily the case. Evidence
that distracters often fail to match alternatives generated by free responses
comes from studies by Nixon (1973) and by Bishop, Knapp and MacIntyre (1969).
THE 'NONE OF THESE' OPTION
The 'none of these' or 'none of the above' option arouses strong feelings
among item writers. Some refuse to use it under any circumstances, believing
that it weakens their items; others use it too much, as an easy way to make
up the number when short of distracters. The Cambridge item-writer's manual
(TDRU, 1975) maintains that 'none of these' is best avoided altogether and if
it must be used it should only be in cases where the other options are unequivocally right or wrong. In this connection, the study by Bishop, Knapp
and MacIntyre (1969) reported that the biggest difference between the distributions of responses to questions framed in multiple choice and open-ended
form was between the 'none of these' category in the multiple choice form and
the 'minor errors' in the open-ended form, which means that when placed alongside definite alternatives the 'none of these' option was not sufficiently
attractive to the effective.
Williamson and Hopkins (1967) reported that the use of the 'none of these'
option tended to make no difference one way or the other to the reliab.ility
or validity of the tests concerned. After arriving at much the same results,
Choppin (1974(a)) concluded that "these findings offer little reason to employ
the null-option item type. They undoubtedly set a more complicated task to
the testee without increasing reliability or validity". My own feeling is
that 'none of these' is defensible when candidates are more likely to solve a
problem first without reference to the options, what Choppin called the "forwards" type of item. Thus I permit its use in multiple completion items,
especially as it denies candidates the opportunity to glean information from
226
Evaluation in Education
the coding structure. Note, however, that as was pointed out in Chapter 3,
items of this kind often fail to satisfy statistical criteria. This is probably due to the imbalance created in an item when 'none of these' is the
correct answer. There may be many wrong answers but only four (if the item
is five-choice) are presented with the right one being found among the rest.
Whenever the 'none of these' option is used, the notion that all distracters
should be equally attractive ceases to apply, not that such items would necessarily be ideal, except in a statistical sense (see Weitzman, 1970).
VIOLATING ITEM CONSTRUCTION PRINCIPLES
There appear to be three studies of what happens when certain rules of what
might be called item-writing etiquette are violated. Following Dunn and
Goldstein (1959), and in very similar fashion, McMorris et al. (1972) and
Board and Whitney (1972) looked at the effects of providing candidates with
cues by (a) putting extraneous material in the item stem, (b) making the keyed
response stand out from the distracters through making it over-long or short
and (c) producing grammatical inconsistencies between the stem and the distractors. McMorris et al. obtained the same results as Dunn and Goldstein,
namely that the violations made items easier but did not affect the reliability
or validity of the instruments; Board and Whitney, however, reported quite
contrasting results. According to them, poor item writing practices serve to
obscure or reduce differences between good and poor students. It seems that
extraneous material in the stem makes the items easier for poor students but
more difficult for good students, the first group deriving clues from the
'window-dressing' and the second looking for more than is in the item (shades
of Hoffman!). Although grammatical inconsistencies between stem and keyed
response did not have a noticeable effect on item difficulties they did reduce
the validity of the test. Finally, making the keyed response different in
length from the distracters helped the poor students more than their abler
colleagues. It would seem that poor tests favour poor candidates!
t$yown feeling about these findings is that no self-respecting item writer or
editor should allow inconsistencies of the kind mentioned to creep into a test.
In this respect, these findings are of no particular concern. After all, no
one would conclude, on the basis of these studies, that it was now acceptable
to perpetrate grammatical inconsistencies or to make keys longer that distractors. Nor need one necessarily believe, on the basis of one study, that
it is harmful to do so. It is just that it is not advisable to give the 'testwise' the opportunity to learn too much from the layout of the test.
One item-writing rule which needs discussing is the one which warns against
The reason for having it is, of course, that the
negative phrasing of items.
negative element, the 'NOT', can be overlooked and also that it can lead to
awkward double n'egatives. What is not so often realised is that a negative
stem implies that all the distracters will be correct rather than incorrect,
as is usually the case. Farrington (1975), writing in the context of modern
language testing, has argued that this feature is wholly desirable, his
rationale being that it avoids presenting the candidate with many incorrect
pieces of language, a practice which leads to a mistake-obsessed view of language learning. The same argument applies to items which use the phrase 'One
of the following EXCEPT . ..I. indeed EXCEPT, being less negative, may be preferred to NOT, where feasible.
Multiple Choice: A State of the Art Repot?
227
Whatever the effects on reliability and validity, it is pretty obvious that
the difficulty of items can be manipulated by varying their construction and
format. Dudycha and Carpenter (1973) who, incidentally, deplore the lack of
research on item writing strategies, found that items with incomplete stems
were more difficult than those with closed stems, that negatively phrased
items were more difficult than those phrased in a positive way and that items
with inclusive options such as 'none of these' were more difficult than those
with all-specific options, a result also reported by Choppin (1974(a)).
Dudycha and Carpenter did not study what effect placement of the keyed response
(A, B, C, D or E) might have on difficulty but Ace and Dawis (1973), who have
a good bibliography on the subject, provide evidence that this factor can
result in significant changes in difficulty level, at least for verbal analogy
items. On the other hand, an earlier study by Marcus (1963) revealed no tendency to favour one response location rather than another. This is a good
example of an unresolved issue in item writing.
Provided pretesting is carried out and estimates of difficulty are available
to the test constructor, the fact that varying the format may make one item
more difficult than another does not seem terribly important. It all depends
on what kind of test, educationally or statistically, is wanted. As for the
location of keyed responses, much can be done to eradicate the effect of
undesirable tendencies by randomising the keys so that each position is represented approximately by the same number of items. Note, however, that randomisation may not be feasible when the options are quantities (usually presented
in order of magnitude) or patterns of language. In these cases some juggling
of the content may be necessary. Suppose that one wanted to alter items which
turn out to be too hard. What effect might this have on the discrimination of
the items? If Dudycha and Carpenter (1973) are to be believed then discrimination is less susceptible to format changes than is difficulty. Accordingly,
they maintain that an item stem or orientation may be altered without lowering
discrimination, the exception being items which include 'none of these' as an
option.
The inconclusiveness of so much of the research into item writing should not,
in my view, be regarded as an invitation to engage in a new research blitz on
the various issues discussed. If when framing items the writer sticks to the
rules of etiquette laid down in the various manuals and succeeds in contriving four or five options, based on an appreciation of likely misapprehensions
among examinees, this is as much as can reasonably be hoped for. What should
not be encouraged is what might be called the 'armchair' approach to item
writing which can best be summed up as an over-reliance on mechanical formulas
for generating distracters, such as powers of ten or arithinetical sequences
or similar words, combined with a penchant for trick or nonsense distracters.
That is not to say that only items composed of empirically derived distracters
are admissible. Corbluth's (1975) thoughtful analysis and categorisation of
the kinds of distracters which might be appropriate for reading comprehension
exercises has persuaded me that an enlightened 'armchair' approach could work
quite well.
ITEM FORMS
It was because they believed that items are not sufficiently related to the
previous instruction, and also because they distrusted the free rein given to
item writers, that Bormuth (1970) and others set out to develop the radically
different methods ot writing or generating items which I mentioned at the
beginning. These turn on the notion of an item -shell or form (Osburn, 1968).
228
Evaluation in Education
An item form is a principle or procedure for generating a subclass of items
Item forms are composed of constant
having a definite syntactical structure.
and variable elements, and as such, define classes of item sentences by specifying the replacement sets for the variable elements.
Exhaustive generation
of items is not necessary although in principle it can be done. For instance,
an item form rule might be 'What is the result of multiplying numbers by zero?'
in which case the items 'What is 0 x O?', 'What is 1 x O?', 'What is 2 x O?',
'What is 0.1 x O?' etc. would be generated. The sheer size of the 'domain'
of items so created will be appreciated. It is not supposed, of course, that
a candidate should attempt every item in order to show that he has 'mastery'
over this 'domain'; since it is assumed that all items in a domain are equivalent and interchangeable, he need only attempt a sample of items. As human
beings would contaminate the measurement were they to choose the sample, this
job is best left to a computer which can be programmed to select a random
sample from the domain.
The assumptions behind item forms can be tested and it is instructive to look
at the studies which have done so. Macready and Met-win (1973) put the case
an item form will be considered as 'inadequate' for use in a
squarely, II...
diagnostic domain-referenced test if (a) the items within the item form are
not homogeneous, (b) the items are not of equivalent difficulty or (c) both
of the above". They then go on to test item forms of the kind I have just
illustrated. Although their conclusions are positive, an examination of their
paper suggests that there were more item forms which failed to meet the tests
than passed, hardly surprising given the very stiff requirements. Much the
same could be said of an earlier attempt at verification by Hively, Patterson
and Page (1968) who found that variance within item forms, which in theory
should be nil if items are truly homogeneous, did not differ as much from
variance between item forms as it should have done. Neither of these studies
(see also Macready, 1975) inspires much confidence that the item form concept
has anything to offer. To my mind, this ultra-mechanical procedure, conceived
in the cause of eliminating dreaded 'value judgements', carries within it the
seeds of its own destruction. An example supplied by Klein and Kosecoff (1973)
will help to make the point.
Consider the following objective: The student can compute the correct
product of two single digit numerals greater than 0 where the maximum
value of this product does not exceed 20'. The specificity of this objective is quite deceptive since there are 29 pairs of numerals that might
be used to assess student performance. Further, each of the resulting
290 cotiination of pairs and item types could be modified in a variety
of ways that might influence whether the student answered them correctly.
Some of these modifications are:
-
vary the sequence of numerals (e.g. 5 then 3 versus 3 then 5)
use different item formats (e.g. multiple choice versus completionj
change the mode of presentation (e.g. written versus oral)
change the mode of response (e.g. written versus oral).
There are other question marks too. The theory seems to have little to say
about the generation of distracters - all problems seem to be free-response yet a slight manipulation of the distracters can change the difficulty of an
item and destroy equivalence, as in the following example taken from Klein
and Kosecoff.
Mu/rip/e Choice:
Eight hundredths equaZs
A.
B.
C.
D.
800
80
8
.08
A State of the Art Repoti
229
Eight hundredths equals
A.
B.
c.
D.
800
.80
.08
.008
Doubts also arise because the item construction rules depend largely on linguistic analyses which do not necessarily have any psychological or educational
relevance. This is particularly true of Bormuth's (1970) book and, as I have
said, there are few applied studies to clarify the situation. Anderson (1972),
in the course of an interesting paper which raises a lot of critical questions
e.g. 'Which of the innumerable things said to students should they be tested
on?', suggests that something like item forms could be employed with "domains
of knowledge expressed in a natural language" but has to admit that elementary
mathematics is a "relatively easy case". Amen to that.
There is another theory of item construction which must be mentioned, except
that coming from Louis Guttman it is much else besides. Guttman and
Schlesinger (1967) and Guttman (1970) have presented techniques for systematically constructing item responses and particularly distracters. Items are
then tested for homogeneity by a technique called smallest space analysis
(Schlesinger and Guttman, 1969) which is essentially a cluster analysis.
Violations and confirmations of homogeneity ought to lead to improvement of
the item generating rule. Note, however, that clustering techniques can be
criticised for the absence of an underlying model, thus reintroducing the
arbitrariness into item construction procedures it was hoped to dispel. I
should not need to say this but it is impossible to remove the influence of
human beings from the item writing process. It seems to me hugely ironic that
a system of testing meant to be child-centred, that is criterion-referenced
rather than norm-referenced, should rely so heavily on computers and random
selection of items to test learning. Above all testing should be purposeful,
adaptive and constantly informed by human intelligence - mechanistic devices
and aleatory techniques have no part to play in the forming of questions.
Yet at one time there were hopes that, in some fields at any rate, item writing
might be turned over to the computer and exploratory studies were conducted by
Richards (1967) and by Fremer and Anastasio (1969) and perhaps by others not
to be found in the published literature. It is illuminating that Richards
should write that "it soon became clear that developing a sensible procedure
for choosing distracters is the most difficult problem in writing tests on a
computer". Since he needed to generate alternatives for synonym items,
Richards was able to solve his problem by using Roget's Thesaurus but the low
level of the task will be evident, not to mention the unique advantage of
having the Thesaurus available. I can find no recent accounts of item writing
by computer so perhaps this line of work is dead.
Up to now, I have dealt with how items come into being and what they might be
measuring. In the next chapter I turn to a consideration of the different
ways items can be presented and answered and the various methods of scoring
answers that have been suggested.
230
Evaluation in Education
SUKMARY
1. There are two schools of thought about item writing. The first, containing by far the most men-hers believes that the inspiration and the formulation
of the item should be left to the item writer. The other, rather extreme
school of thought maintains that item writing should be taken out of the hands
of the item writer and organised in a systematic way by generating items from
prepared structures which derive from an analysis of the material that has
been taught and on which items are based. Although the human touch is still
there at the beginning of the process it is removed later on by the automatic
generator. While I appreciate the motivation behind the idea - which is to
link teaching and testing more closely - I do not approve of the methods
employed. I suppose there may exist a third school of thought which regards
generated items on a take-it-or-leave-it basis, saving what looks useful and
ignoring the rest. There may be some merit in this idea but it is a c7umsy
way of going about item writing.
2. The various technical aspects of item writing - number of options, use of
'none of these' I use of negatives and so forth - have been studied intensively
but it is rare to find one where the results can be generalised with confidence,
If an investigator wants to know how something will work out he is best advised
to do an experiment himself, simulating the intended testing conditions as
closely as possible. Some findings have achieved a certain solidity. For
example, it is fairly certain that items containing 'none of these' as an
option will be more difficult than those with all-specific options. ThiS
follows from the fact that 'none of these' fs so many options rolled into one.
Generally speaking, opinion does not favour the use of 'none of these' although
I would permit it with multiple completion items.
3. The traditional view that item writers should use as many options as possible, certainly 4 or 5, continues to hold sway. Those who have promoted
the three-choice item, which has from time to time looked promising, have not
yet managed to substantiate their case. The same applies to Ebel's attempts
to show that twice as many true-false items can be answered in the time it
takes to answer a set of items with four options. As to whether one should
aim for four or five options I doubt if it makes much difference. The point
is not to be dogmatic about always having the same number, even if it is
convenient for data processing purposes.
4. f do not think the computer has any place in item writing and it is salutary
to note that the various attempts in the late 1960's to program computers to
write items appear to have fizzled out.
5. Instructions,
Scoring
Formulas
and Response
“If you’re smart, you can pass a true or false
smart. ” (Linus in ‘Peanuts ” by SchuZs)
test
Behaviour
dthout
being
If on encountering an item candidates were always to divide into those who knew
the right answer for certain and those who were so ignorant that they were
obliged to guess blindly in order to provide an answer, scoring would be a
simple matter. A correct answer could be awarded one point, anything else
zero, and a correction could be applied to cancel out the number of correct
answers achieved by blind guessing. Alas, life is not as simple as this. A
candidate who does not know the right answer immediately may act in any one
of the following ways:
1. Eliminate one or more of the alternatives and then, by virtue
of misinformation or incompetence, go for a particular wrong answer.
2. Eliminate one or more of the alternatives and then choose randomly
among the remainder.
3. Fail to eliminate any of the alternatives but choose a particular
wrong answer for the same reasons as in 1.
4. Make a random choice among all the alternatives, in other wordse
do what is popularly known as guessing.
Actually, these possibilities are justbench marks on a continuum of response
behaviour anchored at one end with certain knowledge and at the other with corn
plete ignorance. An individual's placement on this continuum with respect to
a particular item depends on the relevant knowledge he can muster and the confidence he has in it. It follows that the distinction between an informed
answer and shrewd, intuitive guess or between a wild hunch and a random
selection is necessarily blurred; also that with enough misapprehension in his
head an individual can actually score less than he would have got by random
quessing on every item. In other words, one expects poor candidates to be
poor 'guessers'.
I suppose that blind guessing is most likely to occur when candidates find
themselves short of time and with a nutier of items still to be answered.
Some people believe that this situation occurs quite frequently and that when
it does occur candidates are prone to race through the outstanding items,
placing marks 'at random' on the answer sheet, thereby securing a certain
number of undeserved points. My view of the matter is that given appropriate
testing conditions this scenario will rarely, if ever, come about. Even in
the absence of those conditions, which I will come to next, it is by no means
certain that individuals are able to make random choices repeatedly. Apparently, people find it difficult to make a series of random choices without
falling into various kinds of response sets (Rabinowitz, 1970). They tend to
avoid repetitive pairs and triplets (e.g. AA, DDD), and to use backward (e.g.
EDC) but not forward series more than expected by chance. They also tend to
231
232
Evaluation in Education
exhaust systematically the entire set of possible responses before starting
again, that is to say, they tend to cycle responses. If the correct answers
are distributed randomly so that each lettered option appears approximately
the same nutier of times, it follows that anyone in the grip of one or more
response sets and attempting to guess randomly at a number of consecutive
or near-consecutive items will almost certainly fail to obtain the marks
which genuine random guessing would have secured. Of course, candidates who
are bent on cheating or have just given up can always mark the same letter
throughout the test but in the London board's experience, at any rate, such
behaviour is rare.
What do I mean by 'appropriate testing conditions'? I mean that the test is
relatively unspeeded so that nearly all candidates have time to finish, that
the items deal with subject matter which the candidates have had an opportunity
to learn and that the items are not so difficult as to be beyond all but a few
candidates. If any or all of these conditions are violated, then the incidence
of blind guessing will rise and the remarks I shall be making will cease to
apply with the same force. However, I am working on the assumption that such
violations are unlikely to happen ; certainly the British achievement tests I
am familiar with, namely those set at GCE 0- and A-levels, satisfy the three
conditions.
A study of the item statistics for any multiple choice test will show that the
majority of items contain at least one distractor which is poorly endorsed, and
so fails to distract. This constitutes strong evidence for the ability of the
mass of candidates to narrow down choice when ignorant of the correct answer
(see also Powell and Isbister, 1974). Were the choice among the remaining
alternatives to be decided randomly, there might be cause for alarm, for then
the average probability of obtaining a correct answer to a five-choice item
by 'guessing' would rise from l/5 to perhaps l/4 or even l/3. But the likelihood of this happening in practice seems slight. Having applied what they
actually know, candidates are likely to be left with a mixture of misinformation and incompetence which will nudge them towards a particular distractor,
placed there for that purpose. Is there any evidence for the hypothesis that
'guessing' probabilities are less than predicted by chance? Gage and Damrin
(1950) constructed four parallel versions of the same test containing 2, 3, 4
and 5-choice items respectively. They were able to calculate that the average
chances of obtaining the right answer by guessing were 0.445, 0.243, 0.120
and 0.046 respectively, as compared with 0.500, 0.333, 0.250 and 0.200 respectively, which are the changes theory based on random guessing would have
predicted. This is only one study and it needs to be repeated in a number of
different contexts. But it is a result which tallies with intuition and coupled with the fact that there are always candidates who score below the chance
score level on multiple choice papers, it suggests that the average probability
of obtaining a correct answer to a five-choice item when in ignorance may well
be closer to l/10 than l/5.
It is also a mistake to assume that chance-level scores, e.g. 10 out of 50 for
a test made up of five-choice items, are necessarily the product of blind
guessing. Unless the candidates obtaining such scores were actually to guess
randomly at every item, which , as I have said, seems most improbable, the
chance-level score is just like any other score. Donlon (1971) makes the
point that chance-level scores may sometimes have predictive value, although
he does suggest that steps should be taken to check whether such scores
could have arisen as a result of blind guessing (for details of the method
suggested, see Donlon's paper).
Multiple Choice: A State of the Art Report
233
It is one thing to be dubious about the, incidence of blind guessing, another
to doubt that individuals differ in the extent to which they are willing to
'chance their arm' and utilise whatever inforniation is at their disposal.
This propensity to chance one's arm is linked to what psychologists call 'risk
taking behaviour'. The notion is that timid candidates will fail to make the
best use of what they know and will be put off by instructions which carry a
punitive tone, while their bolder colleagues will chance their arm regardless.
The fact is that if the instructions for answering a test warn candidates that
they will be penalised for guessing (where what is meant by guessing may or
may not be specified) those who choose to ignore the instructions and have a
shot at every question will be better off - even after exaction of the penalty
- than those who abide by the instructions and leave alone items they are not
certain about, even though an informed 'guess' would probably lead them to the
right answer (Bayer, 1971; Diamond and Evans, 1973; Slakter et al. 1975).
Perhaps it was an instinctive grasp of this point which made the teachers in
Schofield's (1973) sample advise their candidates to have a shot at every
question, even though they were under the misapprehension that a guessing
penalty was in operation.
Exactly the same considerations apply to instructions which attempt to persuade
candidates to omit items they do not know the answer to by offering as automatic credit the chance score i.e. l/5 in the case of a five-choice item. On
the face of it, this seems a good way of controlling guessing but the snag is
that the more able candidates tend to heed the instructions more diligently
than the rest and so fail todothemselves justice. Because their probabilities
of success are in truth much greater than chance they are under-rewarded by
the automatic credit whereas the weakest candidates actually benefit from
omitting because their probabilities of success are below the=
level.
That, at any rate is the conclusion I drew from rrly
study (Wood, 1976(d)). It
is supported by the results of a study in the medical examining field
(Sanderson, 1973) in which candidates were given the option of answering
'Don't know' to true-false items of the indeterminate type. 'Don't know' is
in effect an omit and Sanderson found that it was the more able candidates who
tended to withhold definite answers. I should add that there is a paper
(Traub and Hambleton, 1972) which favours the use of the automatic credit but
it is not clear what instructions were used in the experiment. The wording
of instructions to create the right psychological impact is, of course, decisive. Readers will understand that I am not denying that blind guessing
occurs; that would be rather foolish. Choppin's (1974(a), 1975) study, for 9
instance, provides incontrovertible evidence of blind guessing (as well as
some interesting differences between countries) but the items used were difficult and that makes all the difference. All I am saying is that when conditions are right, blind guessing is by no means as common as some of the
more emotional attacks on multiple choice would have us believe.
Cureton (1971) has put the issue of guessing in a nutshell; if a candidate has
a hunch he should play it for hunches are right with frequency greater than
chance. There is nothing wrong with playing hunches, despite what Rowley
(1974) says. He equates 'test-wiseness' with use of partial information and
hunches and argues that the advantages which accrue to the 'test-wise' should
be cancelled out. I believe this view is mistaken. Test-wiseness, as I understand it, is about how candidates utilise cues in items, and to the extent that
they benefit, this is the fault of the item writer or test constructor. A
fine distinction, maybe, but an important one. Incidentally, anyone interested
in test-wiseness might refer to Diamond and Evans (1972), Crehan, Koehler, and
Slakter (1974), Rowley (1974) and Nilsson and Wedman(1976).
I would add that
234
Evaluation in Education
Et-$-iness
is not just associated with multiple choice as some people seem
.
Any guessing correction is properly called a correction for individual differences in confidence, as Gritten and Johnson (1941) pointed out a long time
ago. It is applied because some people attempt more items than others. Even
if these instructions are never 100 per cent successful I believe that they do
reduce omitting to a point where individual differences in confidence cannot
exert any real distorting effect on the estimation of ability. To turn the
question of whether candidates should always be advised to attempt items into
an ethical dilemma (see, for instance, Schofield, 1973), as if guessing were on
a par with euthanasia, strikes me as getting the whole thing out of proportion.
Changing answers
If individuals differ in their willingness to supply an answer at all, it is
easy to imagine them differing in their readiness to change their answers
having already committed themselves. The question of whether candidates are
likely to improve their scores by changing answers has been investigated by a
number of workers. The general view is that there are gains to be made which
are probably greater for better students than poor ones. Ten years ago,
Pippert (1966) thought he was having the final word on what he called the
"changed answer myth" - his view was that answers should be changed - but since
then there have been published studies by Copeland (1972), Foote and Belinsky
(1972), Reiling and Taylor (1972), Jacobs (1974) - this last reference contains
a bibliography of older work - Pascale (1974) and Lynch and Smith (1975).
These last investigators concluded that when candidates do not review their
answers, reliability and validity suffer and that directions to stick with the
first response to an item are misleading. It might be thought that hunches
would lose conviction when reviewed but the time span between the first response and the review is short and, besides, hunches are more right than wrong.
Confidence weighting
The decision to change an answer may be read as a sign of unwillingness to
invest one answer with, as it were, complete confidence. The idea that individuals might be asked to signify their degree of confidence in the answers
they make, or in the alternatives open to them, has excited a number of people
who have believed that such a move would not only constitute a more realistic
form of answering psychologically but would also yield more statistical information.
Known variously as confidence weighting, confidence testing, probabilistic
weighting, probabilistic testing or even subject weighted test taking procedure, the basic notion is that individuals should express, by some code or
other, their degree of confidence in the answer they believe to be correct,
or else, by way of refinement, in the correctness of the options presented to
them. Credit for laying the intellectual foundations of probabilistic weighting and for providing a psychometric application goes to de Finetti (1965),
although less sophisticated methods have been discussed in the educational
measurement literature for some forty years (see, Jacobs, 1971, for a comprehensive bibliography on the subject). Much energy has been expended on
devising methods of presentation, instructions and scoring rules which will
be conprehensible to the ordinary candidate (see, Lord and Novick, 1968, Ch.14;
Multiple Choice: A State of the Art Report
2%
Echternacht, 1972; Boldt, 1974). In one method, for instance, candidates
are invited to distribute five stars (each representing a subjective probability of 0.20) across the options presented to them. It is assumed that the
individual's degree of belief or personal probability concerning the correctness of each alternative answer corresponds exactly with his personal probability distribution, restricted to sum to unity. The trouble is candidates may
not care about some of the alternatives offered to them in which case talk
of belief is.fatuous. Empirical evidence from other fields suggests that
often individuals have a hard time distributing their personal probabilities
(Peterson and Beach, 1967); some fail to constrain their probabilities to add
to unity, although the stars scheme gets over this, while others tend to lump
the probability density on what they consider is the correct answer, which is
not necessarily the best strategy.
Apart from worries about whether candidates can handle the technique, concern
has been expressed that confidence test scores are influenced, to a measurable
degree, by personality variables. The worry is that individuals respond with
a characteristic certainty which cannot be accounted for on the basis of their
knowledge (Hansen, 1971). Or, as Koehler (1974) puts it, confidence response
methods produce variability in scores that cannot be attributed to knowledge of
subject matter. Not everyone accepts this assessment, of course, especially
Shuford who has been the foremost champion of confidence weighting (for a
recent promotional effort, see Shuford and Brown, 1975). Basically, he and
his associates argue that practise, necessary in atv case because of the novel
features of the technique, removes 'undesirable' personality effects. A paper
by Echternacht, Boldt and Sellman (1972) suggests that this might indeed be
the case although their plea is more for an open verdict than anything else.
Suppose, for the sake of argument, that they are right. Is there psychometric
evidence which would suggest that it would be worth switching to confidence
weighting?
Having compared the validities of conventional testing and various confidence
testing procedures Koehler (1971) concluded that conventional testing is
preferable because it is easier to administer, takes less testing time and
does not require the training of candidates. A similar conclusion was reached
by Hanna and (kens (1973) who observed that greater validity could have been
attained by using the available time to lengthen the multiple choice test
rather than to confidence-mark items. Not satisfied with existing jargon,
Krauft and Beggs (1973) coined the phrase "subject weighted test taking
procedure" to describe a set-up in which candidates were permitted to distribute 4 points among 4 alternatives so as to represent their beliefs as to
the correctness of the alternatives. Total score was computed as the number
of points assigned to correct alternatives. After all this, they found that
the experimental procedure failed to encourage candidates to respond any
differently than they would have done to a conventional multiple choice test,
that is to say, no extra statistical discrimination was forthcoming. For the
most affirmative results we must turn to a paper by Pugh and Brunza (1975).
They claimed that by using a confidence scoring system the reliability of a
vocabulary test was increased without apparently altering the relative
difficulty level of the items. Moreover, no personality bias was found.
What has to be realised about results like this is that reliability is not
everything and that what may appear to be additional reliable variance may be
irrelevant variance attributable to response styles. That, at any rate, was
the view expressed by Hopkins, Hakstian and Hopkins (1973) who also pointed
out that far from increasing validity response-style variance may actually
diminish it.
236
Evaluation in Education
Confidence testing has also found advocates in the medical examining field but
the latest paper I have been able to find on the subject is no more optimistic
than any of the rest. Palva and Korhonen (1973) investigated a scheme whereby
candidates were asked to choose the correct answer as usual and to check a 1
if they were very sure of their answer, a 2 if they were fairly sure and a 3
if they were guessing. After applying the following scoring scheme (due to
Rothman, 1969):
Very sure
For a correct answer
For an incorrect answer
413
-l/3
Fairly sure
1
0
Guess
2/3
l/3
these workers concluded that confidence testing does not give any substantial
information in addition to what is given by conventional scoring and so cannot
be justified. Their scoring scheme may be criticised for encouraging guessing
by allowing l/3 of a mark even when the guess is incorrect (see the critique
by Paton, 1971) but given the direction of their results it is hard to imagine
that any modification would make much difference to the conclusion.
Ranking alternative answers
From time to time, the idea of asking candidates to rank alternatives in order
notably in econoof lausibility is wheeled out. Certainly there areims,
istory, which seem to positively invite this mode of response. The
mic+
idea would be to score a ranking according to where the keyed option was placed
on say a 4-3-2-l-O points basis, so that a keyed answer ranked second would
score 3. This is, in fact, the practice followed in self-scoring procedures,
where an individual is given immediate feedback and has to continue making
passes at the item until he comes up with the right answer, thus establishing
a full or partial ranking of options. Ualrymple-Alford (1970) has studied
this system from a theoretical point of view. Empirical investigations have
been made by Gilman and Ferry (1972) and Evans and Misfeldt (1974) who report
improvements in split-half reliability estimates, to which can be added the
benefits of immediate feedback discussed in Chapter 1. From the standpoint of
public examinations, these procedures suffer from the limitation that they can
only really be implemented by a computer-assisted test administration - available pencil-and-paper techniques seem too ponderous - but in the classroom
they are quite feasible and could be helpful in teaching.
If specifying a complete ranking of alternatives is thought to be too much to
ask some form of restricted ranking may be suitable. It is often the case
that there is an alternative which is palpably least plausible. A possible
scoring formula for this set-up might be to assign a score of 1 to the correct
response and a score of -X to a 'least correct' response, where O<X<l. Lord
reports (Lord and Novick, 1968,p.314) that he investigated just such a scheme
but gained little from using the scoring formula. I would imagine that with
some subject areas agreement over what are the 'least correct responses' may
be hard to come by although, of course ,.it could be done by looking at pretest
statistics and choosing the options answered by the candidates with the lowest
average test score.
Multiple
Choice:
A State of the Art Report
237
Elimination scoring
Of the response procedures which require the candidate to do something other
than mark what he believes to be the correct response, the one known as elimination scoring appears to have particular promise. All the examinee has to do
is to mark all the options he believes to be 'wrong'. The elimination score
is the number of incorrect options eliminated minus (k-l) for each correct
answer eliminated, where k is the number of alternatives per item. Thus the
maximum score on a five-choice item would be four and minimum score -4 which
a candidate would receive if he eliminated the correct answer and no other
option. The penalty imposed for eliminating a correct answer is put in to
control guessing. If a candidate decides at random whether or not to eliminate
a choice or, what comes to the same thing, nominates it as a distractor, then
his expected score from that choice is zero. Coombs, tiilholland and Womer
(1956), who appear to have been the first to study this procedure, obtained
slightly higher reliability coefficients for elimination compared to conventional scoring. Rather more positive results were obtained in a later study
by Collet (1971) and it looks as if elimination scoring might be worth investigating further. One mark in its favour, according to Lord and Novick (1968,
p.315), is that it is likely to discourage blind guessing. Suppose, for
example, that faced with a four-choice item a candidate has eliminated two
alternatives he knows to be wrong. There remains two choices, one the answer
and one a distractor. If he has run out of knowledge and chooses to have a
go he is gambling an additional point credit against a three point loss. One
objection to elimination scoring might be that it is too negative, laying too
much stress on what is wrong rather than what is right. That remains to be
seen. As yet we know little about this method of scoring.
Weighting of item responses
Once item responses have been made in the normal manner, they can be subjected
to all kinds of statistical manipulations in an effort to produce more informative scores than those obtained by simply summing the nutier of correct
items. Individual items can be differentially weighted, groups of items can
be weighted, even options within items can be weighted. Sadly all these
efforts have amounted to very little although a recent study by Echternacht
(1976) reports more promising results. If the intercorrelations among items
or sections of tests are positive (as they invariably are) then differential
weighting of items or sections produces a rank order or scores which differs
little from the order produced by number right (Aiken, 1966). At the time
Aiken claimed that his analysis reinforced the results of previous empirical
and theoretical work and the position has changed little since. The most
authoritative paper of the period is the review by Stanley and Wang (1970);
other relevant papers are those by Sabers and White (1969), Hendrickson (1971),
Reilly and Jackson (1973) and Reilly (1975). All of these papers examine the
effects of empirical weighting - empirical because weights are allocated after
the event - via an iterative computer solution so as to maximise the reliability or validity of the scores. Thus candidates have no idea when they take
the test of the relevant score values of items. Not that this is anything new;
in the normal course of events items weight themselves according to their
discrimination values. The empirical weighting method, which incidentally was
first proposed by Guttman (1941), is, of course, open to the objection that
it is the candidates rather than the examiners who are in effect deciding
which options are most credit-worthy even to the extent of downgrading the
nominal key if the brightest group of candidates should for some reason happen
238
Evaluation in Education
to be drawn to another option (although it is very unlikely that such an item
would survive as far as an operational test). There is also an element of
self-fulfilment present since candidates' scores are being used to adjust
candidates' scores; students who do well on the test as a whole have their
scores boosted on each item which only compounds their superiority.
Echternacht (1976) was not exaggerating when he observed that one would have
problems giving candidates a satisfying explanation of an empirical scoring
scheme.
If empirical weighting leaves something to be desired, what about a priori
weighting of the options? Here one asks informed persons, presumably examiners,
to put values on the options perhaps with a simple 4-3-2-l-O scheme. This has
already been broached in this chapter. Alternatively and preferably although
difficult, examiners could be asked to construct items in such a way that the
distracters were graded according to plausibility. One might argue they should
be doing this anyway. What happens when a priori or subjective weighting is
tried out? Echternacht (1976) using specially constructed items found that
the results it gave were not even as reliable as conventional number-right
scoring and certainly inferior to the results obtained from empirical weighting. He found in fact that with empirical weighting he registered an increase
in reliability equivalent to a 30 per cent increase in length of a conventionally scored test and also reported an increase in validity. Note however that
his items, which were quantitative aptitude items, cost 60 per cent more to
produce than items written in the usual way.
Recently my colleague Brian Quinn and I (Quinn and Wood, 1976) compared subjective and empirical weighting with conventional scoring in connection with the
Ordinary level English language comprehension test mentioned in Chapter 1.
First we attempted to rank options in an order of plausibility or likelihood
but it was soon apparent that few of the distracters could be deemed worthy
of any credit, or at least that they could not often be ranked in any meaningful order. For 33 of the 60 items we would recognise no merit in any of the
options but the key, for 23 items a second best ('near miss') option only
could be identified; for three items second and third best options could be
found, and for only one item could all five options be ranked. The arbitrary
marking scheme chosen was 1.0 for the key, 0.75 for the second best option,
0.5 for third best, 0.25 for fourth best, and zero for the fifth best, plus
all unranked options and omissions.
In the event neither subjective or empirical weighting made much difference to
the original rank ordering of candidates produced by conventional scoring.
For subjective weighting the correlation between the derived scores and conventional scores was 0.99, although, of course, a high correlation was expected
owing to the fact that the same 33 items figured in both derived and conventional scores. The correlation for empirically weighted series was at first
rather lower (0.83) but on inspection this was found to be attributable to
omits being scored zero. Dubious though the logic may be it is clearly necessary to give rewards for omits otherwise graded scoring will tend to discriminate heavily between candidates who choose any response, no matter how poor,
and those who choose none at all. When omits were scored, the correlation between empirically weighted scores and conventional scores rose to 0.95. Any
extra discrimination which graded scoring gives will naturally tend to be at
the bottom end of the score range since these are the people who get P lot of
items 'wrong' and who stand to benefit from partial scoring. Our experience
has been that there is a little increase in discrimination in this sector but
nothing spectacular.
Multiple Choice: A Stare of the Art Report
2%
It ought to be said that the strength of the case for graded scoring varies
according to the subject matter.
In mathematics and the sciences, also the
social sciences, graded scoring may work; in French and English perhaps the
problems of interpretation are too great.
SUMMARY
1. Good candidates are good 'guessers', as Linus says. All candidates should
be encouraged to make use of all the knowledge at their disposal. When candidates choose to omit questions it is usually the more able ones who do not do
themselves justice, being reticent when they actually know the right answers.
2. Blind guessing does occur but only when the conditions are ripe for it.
Under appropriate testing conditions its effects can be reduced until it is
no longer a problem. The occasional hysterical outbursts from a teacher or
an examiner or a member of the public to the effect that multiple choice tests
are no more than gambling machines are quite unjustified. All the evidence is
that if tests are properly constructed, presented and timed, candidates will
take them seriously.
3. Despite all the ingenuity and effort which has gone into developing methods
for rewarding partial information, and no one has exceeded the Open University
in this respect (see their undated document 'CMA Instructions'), there is
little evidence that any one method provides measurable gains. Elimination
scoring has possibilities, as does self-scoring except that it is presently
too limited in scope. Confidence weighting is, I think, too elaborate and
beyond the average candidate. One is left with the conclusion that if the
items in a test are well constructed, if candidates are advised to go over
their answers since changing answers seems to pay, and if the testing conditions are such as to inhibit blind guessing with candidates being encouraged
to attempt all items, number right suffices most needs.
6. Item Analysis
“Psychometricians
appear to shed mh
Of their
as they concentrate
on the minutiae
of elegant
(Anastasi,
1967)
psychoZogica2
knowledge
statistical
techniques.
11
That there is substance in Anastasi's rebuke cannot be denied. Item analysis
has been the plaything of two or three generations of psychometricians, professional and amateur. Because the so-called classical methods of item analysis
are accessible to anyone who is at all numerate, and because there is room for
differences of opinion over how items shall be characterised statistically,
the literature teems with competing indices - according to Hales (1972)
more
than 60 methods for calculating measures of discrimination have been proposed!
In devoting as much space as I shall to item analysis and test construction, I
run the risk of doing what Anastasi warns against, but I have no alternative
if I am to survey the field properly. There are, in any case, good grounds
for taking item analysis seriously. More than any other testing technique,
multiple choice is dependent on statistical analysis for legitimation. The
guarantee of pretesting is intended to reassure the public that items are, as
the London booklet says (University of London 1975, p.3), "free from ambiguity,
and are of an appropriate level of difficulty". Doubtless, we should expose
other testing techniques to the same close scrutiny, but, for one reason or
another, this does not seem to happen very often.
Actually, the term 'statistics' as applied to classical item analysis, is
something of a misnomer, because the statistics are not motivated by probabilistic assumptions, or if they are; these are not at all apparent. It is true
that in a sample the proportion getting an item correct is the best estimate
of the item difficulty in the population, and that other statistical statements
can be made, but, generally speaking, classical item analysis has developed
along pragmatic lines. I might mention that Guilford's (1954) textbook is
still far and away the best guide to the classical methods.
For a more truly statistical approach, b!e must turn to the modern methods of
item analysis which are based on theories of item response expressed in probabilistic terms (Lord and Novick, 1968). It is fair to say that these modern
methods are still generally unknown or poorly understood, although the Rasch
model, which is just about the simplest example of an item response model, is
bringing people into contact with modern thinking. How much utility these
methods have is another matter.
My aim is to present classical and modern methods of item analysis, but to do
so in such a way that, hopefully, the two approaches will be unified in the
reader's mind. Unless I specify to the contrary, I am writing about dichotomous items, i.e. those scored 0 or 1.
240
Multiple Choice: A State of the Art Report
241
For any item, the raw response data consist of frequency counts of the numbers
of individuals choosing each option, together with the number not answering
the item at all, known as the 'omits'. From this information, it is immediately QOSSible to calculate the proportion or percentage of individuals getting
the riqht answers. This statistic is known as the item difficulty ..-_-._
or facility,
depending on which nomenclature you prefer. Facility is perhaps the morefelicitous term, since the higher the proportion correct, the easier is the
item.
For those who want a direct measure of item difficulty, the delta statistic
(A) iS available. Delta is a nonlinear transformation of the proportion correct, arranged so as to have a mean of 13 and a standard deviation of 4,
giving it an effective range of 1 to 25. The formula is:
A = 4@-'(Q) + 13
-1
where Q is the proportion correct or item facility, and @
is the inverse
normal transformation (for details, see Henrysson, 1971, p.139-140). The
point of choosing a nonlinear transformation is that proportions or Qercentages fall on a nonlinear scale: so that judgement of relative differences in
facilities is apt to be misleading. Thus, the difference in difficulty between items with facilities of -40 and .50 is quite small, whereas the
difference between items with facilities .10 and .20 is quite large. The
delta scale, however, is ostensibly linear, so that the difference in difficulty between items with deltas of 13 and 14 is taken to be the same as the
difference in difficulty between items with deltas of 17 and 18.
Figure 6.1 shows the approximate relationship between facility and delta.
can be seen that a facility of 0.50 corresponds to a delta of 13, since
a(O.5) equals 0.
It
What A does not have is a direct relationship with a test statistic like
number correct, but this is easily remedied by transforming the total score
so that it has the same mean and standard deviation as delta.
With test score as the categorising variable, it is possible to divide the
candidate population into ability groups or bands, and to observe how these
groups respond to the item. (Test score is chosen simply because it is
usually the best measure of the relevant ability, but if a better measure is
available, it should be used.) By sorting the item responses into a two-way
table of counts, with ability bands as rows and alternative answers as
columns, the data can be laid out for inspection. Table 6.1 shows what I mean.
Here the five ability bands - and five is a good number to use - have been
constructed so as to contain equal numbers of candidates, or as nearly equal
as possible, which means that, unless the distribution of scores is rectangular
(see Chapter 7), the score intervals will always be unequal. However, there
is no reason why the bands should not be defined in terms of equal score intervals or according to some assumption about the underlying distribution scores.
If, for instance, one wanted to believe that the underlying score distribution
was normal, the bands could be constructed so as to have greatest numbers in
the middle bands and smallest in the outer bands. The problem then is that,
given small numbers, any untoward behaviour in the tails of the distribution
would be amplified or distorted. Also, interpretation of the table would be
more prone to error because of the varying numbers in the bands.
242
Evaluation in Education
Fig. 6.1
TABLE
6.1
SCOR
Options
1:6*
.46
1: - 22
18
:5 - 29
35
35 - 47
Mean
Criterion
B
13
.04
169 E
1
30
C
.:4
!2
.07
42
49
A
8
5
8
7
0
0
30.79
18.77
22.02
19.18
7
ES2
.26
0
.0:
Item
Reached
319
1.00
;:
2
0
0
0
0
63
64
64
64
64
23.45
12.50
26.02
19
Multipie Choice: A State of the Art Report
243
T:ie item generating the data in Table 6.1 belonged to a 50-item Chemistry test
taken by 319 candidates. Normally, of course, the candidate population would
be far larger than this but for my purposes there is some advantage to be
gained from smaller nutiers. The correct (starred) answer was option A,
chosen by 146 candidates, which, as the number underneath indicates, was 0.46
of the entry. The facility of this item was, therefore, 0.46, and the diffiCU7tJ' (A) 13.42. Of the distracters, E was most popular (endorsed by 82 candidates, or 0.26 of the entry), followed by C, D and 3. Only two candidates
.
omitted the item.
Turning to the body of the table, a pattern will be evident. Whereas, under
the correct answer A the count increases as the ability level rises, under
the distracters (excepting D, where the trend is unclear), the gradient runs
in the opposite direction. This is just as it should be if we want the item
to discriminate in terms of total test score. The pattern we should not want
to see would be one where the counts under A were equal or, worse, wh=
the
count in each cell of the table was the same. As it is, the distribution of
answers tells us quite a lot. Relatively speaking, options B and C were much
more popular in the bottom ability band than in the rest, and in the bottom
band the correct answer was barely more popular than 5 and D, which were almost
totally rejected by the top two bands. Taken as a whole, the table underlines
the observation made in the last chapter that wrong answers are seldom, if
ever, distributed equally across the distracters , either viewing the candidate
population as a whole, or in bands. Nor is there any evidence of blind guessing, the sign of which would be an inflated number in the top left hand cell
of the table - the one containing a '9' - causing the gradient to flatten out
at the bottom, or even go in the other direction.
The notion of a gradient of difficulty is a useful way of representing another
characteristic of an item, its effectiveness in establishing the difference
between 'clever' and 'dull' candidates, or its discriminating power, The usual
approach to obtaining a measure of item discrimination is to calculate the
correlation between score on the item (1 or 0) and score on the test as a whole,
the idea being that the higher the correlation between candidates' score on an
item and their score on the test, the more effective the item is in separating
them. Naturally, this relationship is a relative one; when included in one
test, an item could have a higher item-test correlation than when included in
another, yet produce poorer discrimination.
The correlation I am
it is worth examining
the table it actually
lation can be set out
is as follows:
talking about has a special name, point biserial, and
how it is calculated to see how mu& informat%
from
uses. The formula for the sample point biserial correin various ways, but the most convenient for my purposes
I
rpbis = v.
J h
where Mp is the mean score on the test obtained by those who got the item
correct, FI is the mean score on the test for the entire group, S is the standard deviation of the test scores for the entire group, and p is the proportion
getting the item right (the item facility). Evidently, rpbis serves as a
measure of separation through the action of the term F - M; it is also a
p3function cf item facility, and the effect of this will be looked at presently.
244
Evaluation in Education
To calculate rpbis for the item in Table 6.1, the values of Mp and M can be
found in the row directly underneath the body of the table labelled, 'Mean
criterion', where 'criterion' means test score. These mean test scores provide
useful supple~ntary information. Thus, the mean score on the test obtained
by those choosing A, the correct
answer was 30.79. This is Mp. Similarly,
the 13 candidates choosing B scored an average of 18.77 on the test, which
made them the lowest scoring of the four distractor groups. The mean score on
the test for the entire group, M, is given at the right end of the 'Mean
criterion' row, and is 26.02. The value of the standard deviation, S, which
is not given in the table, was 8.96. The expression for rpbis is, therefore,
30.79
- 26.02*
8.96
(r )
‘46
3
which turns out to be 0.49.
The question immediately arises, 'Is a value of O-49 good, bad or indifferent?'
This is a fair question to ask of any correlation taking this value. If it
were an ordinary product ~ment correlation, one might interpret the value
within the limits -1 to +1 but, with the point biserial, #is assumption may
be unjustified. In fact, as Wilmut (1975(a),p.30) demonstrates, the point
biserial coefficient, when applied to item analysis, is unlikely ever to exceed
0.75, or to fall below -0.10. In these circumstances, a value of 0.49 signifies quite effective discrimination. Of the many discrimination indices, the
chief competitor to the point biserial is the biserial correlation, and much
has been written on the subject of which statistic is preferable. unlike the
point biserial, the biserial is not a product mo~ient correlation; rather, it
should be thought of as a measure of association between performance on the
item and performance on the test or some other criterion. Also distinguishing
it from its competitor is the assumption that underlying the right-wrong
dichotomy imposed in scoring an item is a normally distributed latent variable
which may be thought of as representing the trait or traits that determine
success or failure on the item. Doubts about the tenability or appropriateness
of this assumption lead some people to have nothing to do with the biserial.
Equally, there are those, like myself, who find the assumption underlying the
point biseriat - that a person either has the ability to get an item right or
has none at all - quite implausible. What has attracted people to the biserial
is the possibility that it may have certain desirable properties which make it
superior to the point biserial or any other discrimination index, namely, that
it is less influenced by item difficulty and also - an important property this
- that it holds stable, or is invariant, from one testing situation to another,
a property the point biserial definiteTy does not possess.
The formula for calculating the sample biserial correlation coefficent resembles that for the point biserial quite closely, being
where the terms are as before, except for h(p),
which stands for the ordinate or elevation of the normal curve at the point where it cuts off a proporh(p) enters into the formula because
tion p of the area under the curve.
of the assumption about the normally distributed underlying variable. It is
easily looked up in any textbook containing statistical tables (see, for
instance, Glass and Stanley, 1970, Table 8).
Multiple
Choice:
A State of the Art Report
245
The relationship between the biserial and point biserial formulae is simple,
being
'pbis = 'his
l
ho
JXW-
This means that the point biserial is equal to the biserial multiplied by a
factor that depends only on item difficulty, so that the point biserial will
always be less than the biserial.
In fact, Lord and Novick (1968, p.340)
show that the point biserial can never attain a value as hiah as four-fifths
of the biserial, and present a table showing how the fraction varies according
to item difficulty (see also Bowers, 1972). In theory, the biserial can-take
any value between -1 and +l. Negative values usually indicate that the wrong
answer has been keyed. Values greater than 0.75 are rare, although, in exceptional circumstances, the biserial can exceed 1. This is usually due to some
peculiarity in the test score or criterion distribution (Glass and Stanley,
1970, p.171). For the item in Table 6.1, the biserial estimate was 0.62,
about which one would say the same as about the point biserial value, that it
signifies quite effective discrimination.
As Lord and Novick (1968, p.342) observe, the extent of biserial invariance is
necessarily a matter for empirical investigation. They themselves claim that
'biserial correlations tend to be more stable from group to group than point
biserials' and present some results which point in that direction. My view
is that this is still very much an open question. Experience at the London
examinations board indicates that even with ostensibly parallel groups of
candidates biserial estimates for the same item can 'bounce' around beyond
what would be expected from the estimated margin of error.
So, what is the answer to the burning question, 'Biserial or point biserial?'
The consensus among people who have studied this question (e.g. Bowers, 1972)
seems to be that as long as a markedly non-normal distribution of ability is
not anticipated, substantially the same items are selected or rejected whichever statistic is used to evaluate discrimination. It is true that the point
biserial is rather more dependent on the level of item difficulty but this
is not serious, since it only leads to rejection of very easy or very difficult items, which would be rejected anyway. For the practical user, nly
advice is to fasten on to one or another statistic, learn about its behaviour,
and stick with it. Switching from one to the other, or trying to interpret
both simultaneously, is a waste of time.
OTHER DISCRIMINATION INDICES
Of all the discrimination indices which have been advanced, the simplest is
undoubtedly D, or net D, as it is sometimes called,
If, for any item Rh is
the proportion correct achieved by the 27 per cent highest scores on the test,
and Rl is the corresponding figure for the 27 per cent lowest scores, then
D = Rh - pl. It may seem odd that just as good results can be obtained by
discarding the middle of the score distribution as by using the whole distribution, but providing the ability being measured is normally distributed, and
that is a big proviso, this is the case. The quantity of information may be
reduced, but the quality is improved, the result being groups which ares in
Kelley's (1939) words, "most indubitably different with respect to the trait
246
Evaluation
in Education
in question". Those interested in the statistical basis of the 27 per
rule might consult Kelley's original paper, or, more recently, that by
and Weitzman (1964). Incidentally, D'Agostino and Cureton (1975) have
recently that the correct percentage is more like 21 per cent, but add
the use of 27 per cent is not far from optimal.
cent
Ross
shown
that
It is important to remember that the D statistic was invented to fill a need
at the time for a short-cut manual method of estimating the discriminating
power of an item. Now that most users will have access to an item analysis
computer program short-cut methods have become pointless, although there is
a story that someone actually programmed the calculations for the D statistic!
For those who want or need an item analysis by hand - and it is still a good
way of getting the 'feel' of item performance - it can be said that D agrees
quite closely with biserial correlation estimates, even when the underlying
distribution is non-normal (Hales, 1972). Tables from which D can easily be
calculated (replacing the old Fan tables) have been compiled by Nuttall and
Skurnik (1969), who also provide other 'nuts and bolts' for a manual item
analysis.
Like the point biserial and even the biserial, the D index is dependent on
item facility. In particular, it decreases sharply as the facility approaches
0 or 1, when it must be interpreted with caution (Connaughton and Skurnik,
1969; Nuttall and Skurnik, 1969) but then, as I have said, the test constructor will probably not be interested in these items anyway. Those who prefer
a discrimination index which is independent of item difficulty might be
interested in the rank biserial correlation coefficient (Glass, 1966). However, there are problems with this index when frequent ties among test scores
occur, and it is, therefore, not recommended for large groups (n>50).
If most discrimination indices are affected by difficulty anyway, why not
deliberately combine difficulty and discrimination into one index? Ivens
(1971) and Hofmann (1975) have attempted to do this in different ways, working
within nonparametric and probabilistic frameworks respectively. For Ivens,
the best possible item will have a difficulty of 0.5 and perfect discrimination,
0.5 being chosen because this value maximises the number of discriminations an
item can make; for Hofmann, item efficiency, as he calls it, is defined as the
ratio of observed discrimination to maximum discrimination, the latter having
been determined from the difficulty of the item. Both workers make certain
claims for their indices, Hofmann's being rather more sweeping, but it is
too soon to say what substance these have. Significantly, Ivens admits that
there will be instances where his index will not be appropriate for item
selection as, for example, when the object is to construct a test to select
a small proportion of individuals.
A more serious shortcoming of both indices, and this applies to others like
the rank biserial, is their failure to satisfy what Lord and Novick (1968,
p.328) call a basic requirement of an item parameter, namely, that it should
have 'a definite (preferably a clear and simple) relationship to some interesting total test score parameter'. To call this a basic requirement is no
exaggeration.
Item statistics are only useful insofar as they enable us to
predict the characteristics of a test COmpOSt?.d of the items examined. Bowers
(1972) is absolutely right when he remarks that a comparison of values of
biserial and point biserial coefficients, or any other indices, begs the
question. What matters is to select items that lead to test score distributions which are best for a particular application. I shall elaborate on
this theme in the next chapter.
Multiple
Choice:
A State of the Art Report
247
GE~~ERALISED ITEM STATISTICS
So far, in calculating discrimination indices, only information about those
who got the item right (or wrong) ha_
c been used, whereas, of course, there is
information associated with each wrong answer, namely the mean test scores
achieved by the candidates who fall for the distractors. It is reasonable to
ask whether statistics could not be devised to take this itlfor~-~tiondirectly
into account, and so provide more accurate summaries of how an item behaves.
What is wanted are generalisations of the point biserial and biserial coefficients, and these have, in fact, been developed by Das Gupta (1960) and
Jaspen (1965) respectively. To calculate the point multiserial, as Das Gupta
terms his statistic, each response option, including the right answer, is
treated as a separate nominal category, as if each represented a character such
as eye colour. With the polyserial, on the other hand, being a generalisation
of the biserial, it is necessary that the distracters can be ordered or graded
in terms of degree of 'wrongness' or 'rightness', so that the assumption of an
underlying normally distributed trait can be better mot. I must say that, in
rqy experience, items seldom lend themselves to this kind of ordering, at least
not on a large scale (see the discussion on a priori weighting in Chapter 5).
That is why I think the pofyserial coefficient will generally find a more
suitable application when the polychotomised variable is something like an
examination grade or a rating, where there is a natural order of measurement.
The point multiserial is the more suitable statistic, but it is rather cumbersome to calculate (although not once it is programmed). Having used it nlyself,
I have never felt that it was any more informative than the ordinary point
biserial. bjy feeling is that these generalised statistics are not a great deal
of use in regular item analysis, although I remain willing to be convinced.
The user would be just as well off with biseriaf estimates calculated for each
distractor, and some item analysis programs do provide this information.
Whatever item statistics are dreamed up, being of a summary nature they are
bound to be less informative than we would like. As Wilmut (1975(b), p,2)
has observed, an infinitive number of items can have different response
patterns, yet possess the same discrimination index. The difficulty index
has limitations too; "All we know is that if the respondent passes it, the
item is less difficult than his ability to cope with it, and if he fails it,
it is more difficult than his ability to cope with it" (Guilford, 1954, p.419).
The message is that it is a mistake to rely too heavily on item statistics.
Instead, attention must be fixed on item response patterns, or gradients, as
was done for the data in Table 6.1. This complicates matters, but is unavoidable if accurate predictions of test score distributions are to be made.
THE ITEM CHARACTERISTIC CURVE
The most instructive way of examining an item response pattern is to plot a
graph showing how success rate varies with candidates' ability, for which tota
test score usually stands proxy. The result is called an item characteristic
curve. It is the coping stone of modern item analysis methods, but the idea
is
old as educational measurement itself, dating from 1905, when Binet and
Simon plotted curves to show how childrens' success rates on items varied with
age. The movement towards summarising item response patterns only came with
the streamlining of item selection procedures.
248
Evaluation in Education
To plot an item characteristic curve, the most obvious method would seem to
be to plot success rates for as many groups as there are different test scores
In practice, however, this method is not only finicky, but may also mislead,
the reason being that success rates calculated from very small numbers of
candidates ob~ining certain test scores are unstable, and thus may give a
false impression of how an item performs. Since the relationship between the
assumed underlying ability and test score is unknown, and the test scores are
bound to be fallible, it is preferable to group candidates in terms of test
score intervals, the supposition being that all candidates within a group
possess roughly the same amount of the ability in question. When this is
done, a curve like that in Fig. 6.2 results. A step-by-step method for producing the curve is given in the Appendix of Wi~mut (~975(b)). Since we cannot measure 'ability' directly, the unit of measurement for the ability dimension is test score expressed in standardised form.
P
Ability
0
Fig. 6.2
The curve in Fig. 6.2 is the classic form - steep in the middle and shallow
at the tails. Given anything like a normal distribution of ability, items
with this characteristic are needed to produce discrimination among the mass
of candidates in the middle of the score range. If, however, the focus of
discrimination is elsewhere, say at the lower end of the ability range, then
items with characteristic curves like that shown in Fig. 6.3 will be needed.
P
Ablltty
0
Fig. 6.3
Multiple
Choice:
A State of the Art Report
249
There is not the space to demonstrate the variety of item characteristic
curves. Those with a special interest might consult Wood and Skurnik (1969,
p.122) and W-ilmut (1975(b), p.4 and 5). I might add that it is quite feasible
to plot response curves for each distractor and to display them on the same
graph as the item characteristic curve. It is then possible to inspect the
behaviour of each distractor.
PROBABILISTIC MODELS OF ITEM RESPONSE
While the item characteristic curves should be displayed for inspection whenever possible, they are not in a suitable form for theoretical exploratory
and predictive work.
It would, therefore, be useful if these curves could be
represented by a mathematical function or functions. Repeated investigation
has shown that if item characteristic curves are well behaved they can be
fitted by functions of the exponential type. Such functions then constitute
a model of the item response process in which an individual's probablity of
success on an item is said to be governed jointly by his ability and by the
difficulty and discrimination of the item.
Ability
Fig. 6.4
Ability
Fig. 6.5
250
Evaluation in Education
0’
Ability
Fig. 6.6
Various models have been proposed to fit different families of curves. If
items are extremely well-behaved and look like Fig. 6.4 - same discrimination,
but varying difficulties.- they will fit what is called the Rasch model, i.e.
the one-parameter logistic model, in which the one parameter is the item difficulty. If items are not so well-behaved and look like Fig. 6.5 - varying
difficulties and discriminations - they will fit either the two-parameter
logistic modeler the two-parameter normal ogive model, which are very similar
Finally, if items look like what Levy (1973, p.3) calls 'reality' (Fig. 6.6),
then some will fit one model and some another, but not all will fit the same
model, however complicated it is made (within reason, of course). The.item
analyst or constructor then has to decide whether or not to discard those
items which fail to fit his favourite model. With the most restrictive
model - that of Rasch - quite a number of items may have to be discarded or,
at least, put to one side; with the other models, which allow discrimination
to vary, not so many items should be rejected.
The utility of these item response models obviously depends on how many items
are reasonably well-behaved and how many are like 'reality'. This is
especially true where the Rasch model is concerned. Even though the criteria
for acceptance are stiff, the more fervent champions of this model insist,
in Procrustean fashion, that the data should fit the model, rather than the
model fit data (Wright, 1968; Willmott and Fowles, 1974), the reasoning being
that only items which fit can produce 'objective' measurement. The technical
conditions for ensuring objectivity are discussed by Rasch (1968) but, basically, to be 'objective', measurement should be 'sample-free', which means
that it should be possible to estimate an individual's ability with any set
of items, providing they fit the model. This property of 'sample-freeness'
is hailed as an important breakthrough by Rasch enthusiasts, and it is true,
if all goes well, 'sample-freeness' does work. What has tended to escape
attention are the cases where 'sample-freeness' breaks down, as when an item
behaves differently in different samples , or discriminates erratically across
the ability range, e.g. as in Fig. 6.4. Whitely and Dawis (1974, 1976) have
a good discussion of this point. In the 1976 paper they note that the assumptions of the item-parameter invariant models of latent traits may not always
correspond to the psychological properties of items. Thus test difficulty
may depend on the tendency of items to interact in context, as well as on their
individual difficulties.
Multiple Choice: A State of the Art Report
251
'Sample-freeness' depends on an item behaving uniformly across the ability
range, but if you estimate the difficulty of an item from a high ability
group - as Rasch says you are entitled to - you can never be sure how the item
will work with the rest of the candidate population or sample. The only way
of finding this out is to test the item on a group of individuals formed by
sampling more or less evenly across the ability range, or, failing that, to
draw a random sample of individuals, just as one would if using the classical
methods of item analysis, which Rasch enthusiasts find wanting.
To my mind, the real importance of item response models lies in the estimation
of abilities. Given a set of items which fit one of the models, individuals'
abilities can be estimated more or less directly from their responses. Since
each item response is weighted according to the discriminating power of the
item, the Rasch model gives estimates which correlate perfectly with total
test score. Thus, one could say that the Rasch model provides the necessary
logical underpinning for classical analysis, and the use of number correct
score. However, there is more to it than this. If all goes well, these
ability estimates possess certain properties which test scores do not have,
the most spectacular being that Rasch model estimates fall on a ratio scale,
so that statements like 'person X has twice as much ability as person Y' can
be made. To date, this property does not seem to have been much exploited,
It is not that there has been
although Choppin's (1976) paper is an exception.
any shortage of people wanting to have a crack at fitting items to the Rasch
model, far from it, so voguish has this model become. The trouble is that,
having run the computer program and obtained results, people tend to be at a
loss as to what to do next. In fact, fitting items to the Rasch model, or
any other, is just the beginning. There remain the tricky problems of
identifying and validating traits so that meaning, which is after all the
central preoccupation, is injected into the measurement. I touch on some of
these problems in Wood (1976(b)).
As the reader will have gathered, I am sceptical about the efficacy of these
latent trait models. The fact is that, despite heavy promotion, especially in
the case of the Rasch model, they have yet to deliver the goods in terms of
practical utility. For instance, no one has yet demonstrated, to my satisfaction, how an examining board, running a multiple choice pretesting programme,
might profitably use the Rasch model in preference to the classical methods
presently used. My judgement is that, for standard group testing situations,
such as examinations, the gains to be had from these models are not enough to
justify going over wholesale to them. Where they do come into their own is in
connection with individualised, or tailored, testing, or more generally, whenever different students are given different tests and it is necessary to place
all students on the same measurement scale. I shall have something to say
about this in the next chapter.
SUrvZMARY
1. Item statistics are only useful insofar as they enable us to select items
that lead to test score distributions which are best for a particular application.
2. Used on their own, summary item statistics are not informative enough, and
can give misleading predictions of test score distributions. It is preferable
252
Evaluation in Education
to inspect the entire item response pattern, even if this is more time-consuming. Item behaviour is best brought out by the characteristic curve, which
bridges the old and the new methods of item analysis.
3. Probabilistic item response models have yet to demonstrate their utility
in respect of routine item analysis programmes, or their capacity to illuminate
behaviour. On the other hand, the Rasch model, in particular, has been a force
for good in that it has made people who might otherwise have remained unenlight
ened, aware of measurement problems , and of ways of thinking about measurement.
It has also introduced a welcome rigour into what was formerly (and still is) a
jumble of ad hoc practices. My quibbles are those of one who is impatient to
leave the discovery stage behind, and to engage the important question, which
is, "What is being measured?"
7. item Selection and Test Construction
“F’ooZs rush
in where
angels
fear
to tread.”
The idea that a test is fixed in length and duration, comes in a fancy booklet
and is sat by large numbers of candidates is a product of the group testing
ethos which has dominated educational and psychological testing, especially
the former, the best example being public examinations. But when you think
about it, there is no reason, in principle, why a person or group of persons
should not be given different, although not necessarily exclusive, sets of
items. After all, it is an individual who takes a test, not a group. In any
group test, there will be some items that are so easy for some candidates that
they solve them without effort, and also some that are so difficult for others
that they cannot begin to answer them. In an individualised measurement procedure, items are chosen for each individual so as, in Thorndike's (1971)
words, "to define the precise limits of an examinee's competence". Thus group
and individualised testing call for rather different approaches to item selection. I shall deal with both approaches in this chapter although I shall
devote more space to the construction of group tests, since these are still far
and away the most widely used.
Traditionally, group tests have been designed to measure individual differences.
Such tests are known in the trade as norm-referenced tests. Suppose, however,
that the object is not to discriminate between individuals but, instead, to
find out whether they are able to satisfy certain criteria or, in the current
jargon, demonstrate minimal competencies. What is wanted here are criterionreferenced tests, about which so much has been written in the last ten years.
Later in the chapter, I will deal with the construction of these tests3 and
also with the construction of tests designed to discriminate between rou s of
individuals, interest in which has sprung up as a result of the accounta
e$;* i ity
drive in the USA. The chapter ends with a section on item analysis computer
programs.
CONSTRUCTING GROUP TESTS
Norm-referenced tests
For tests of attainment, such as GCE and CSE examinations, the measurement
objective is to discriminate maximally between candidates so that they may be
ordered as accurately and precisely as possible. Ideally, the distribution of
test scores should be uniform or rectangular, as drawn below (Fig. 7.1).
The
worst possible distribution, given this objective, would be a vertical line,
since this would signify that all candidates had received the same score.
254
Evaluation in Education
Score
Fig.
7.1
Suppose that a test constructor sets out to achieve a rectangular distribution
of test scores. What sort of items are needed ? There are two ways of approaching this question, one classical and one modern. According to Scott (1972),
writing in the classical tradition, there is presently no consensus concerning
the best method of obtaining a rectangular score distribution. Everyone
agrees that the correlations between items should be high, although exactly
what the range of values should be is disputed; it is over what difficulty
values items should have that the arguments occur. Scott, himself, having
looked into the matter thoroughly, concluded that maximum discrimination will
result if all items have p values of 0.50 (A = 13), and correlate equally at
around 0.33, where, of course the p values apply to the relevant candidate
population.
If the group being tested were very able, and the p values applied
to a typical population, the recommendation would not work. Naturally, the
distribution of ability in the group being tested affects the score distribution.
It may seem late in the day to bring in the idea of item intercorrelation, but
I deliberately chose not to discuss it in the last chapter, on the grounds that
it is not a statistic that pertains to any one item, rather to pairs of items.
In this sense, it fails to satisfy the Lord and Novick test of a useful item
parameter (see Chapter 6). However, in other respects, it is most useful. For
instance, reliability of the internal consistency variety depends entirely
upon the item intercorrelations , so that, given estimates of the latter,
internal consistency can be estimated. It is high internal consistency which
is reflected in the rectangular distribution; items which measure the same
competence over and over again will rank candidates in the same order and the
equal spacing characteristic of the rectangular distribution will emerge. It
is when item content and demand varies that candidates who have studied differently, and so are different anyway, are able to reach the same score by different routes, and scores pile up in the middle of the score range. One candidate
might know A and B, but not C, another A, but not B and C, and another C only.
This is just what happens in most attainment tests, where item intercorrelations are generally of the order of 0.10 to 0.20, rather than the 0.33 which
Scott suggests is necessary to really flatten out the test score distribution.
I will explain what I believe to be the reason for this later.
As long as total test score is used as the criterion for measuring discrimination, high internal consistency means high item intercorrelations, which, in
turn, mean high discrimination values, and vice-versa. Discrimination provides
the link between classical and modern methods. To achieve a rectangular score
distribution, what is needed are items with characteristic curves like the
one in Fig. 7.2 - steep over nearly the whole of the ability range, and, therefore, highly discriminating.
Multiple
P
IL
0
Choice:
A State of the Art Report
255
Ability
Fig. 7.2
To predict what the test properties will be, it is only necessary to add together the item characteristic curves to produce a test characteristic curve.
Thus, the result of accumulating curves like the one shown in tig. / 2 '11 be
a test characteristic curve identical to that curve. In practice, oi cErse
the item characteristic curves would vary in slope, so that the test charact&istic curve would not be as steep as that shown above, and the test score distribution would be less flat. A useful set of graphs illustrating the relationship between test characteristic curves and test score distributions is
provided by Wilmut (1975(b), p.7). Since exact or even close matches will be
rare, the test constructor will have to exercise discretion, especially with
the 'poorly-behaved' items which discriminate effectively over one or more
parts of the ability range, but not elsewhere. Sometimes, if the test constructor can find a complementary item which discriminates where the other
does not, the two together should provide highly effective discrimination
across the whole ability range. The idea is demonstrated in Fig. 7.3.
I
P
0
r
Abilrty
‘:i,bili+y
Fig. 7.3
Incidentally, this example shows the value of having a relaxed view towards
item selection. Probably, neither of these items would fit an item response
model, but they still have something to offer in terms of discrimination.
If low intercorrelations and, therefore, discrimination values frustrate
attempts to produce rectangular distributions of scores for multiple choice
attainment tests, should those responsible for putting together tests depart
at all from the optimum item selection strategy? They should not. Whether
or not they are aware of the consequences of low intercorrelations, they
should act as if rectangular distributions were realisable. That is to say,
they should follow the advice most generally given in text books and articles
on multiple choice (see, for instance, Macintosh and Morrison, 1969, p.66-67).
to choose items with facilities between 0.40 and 0.60, or A values between
12 and 14, and with discrimination values (usually biserial r) greater than
256
Evaluation in Education
0.40, if possible. Only when items are very homogeneous, which means an
average intercorrelation greater than 0.33, should item facilities be distributed more evenly (Henrysson, 1971, p.152-153).
I am aware that this advice seems to run counter to common sense.
If all
items are of the same difficulty, they can only measure efficiently those whose
ability level corresponds to the difficulty level. Only if items are distributed across the difficulty range so that everyone has something they can
tackle, can everyone be measured reasonably efficiently. This argument is
impeccable - as far as it goes. The fact is that neither item selection
policy will give the best results, the first because it neglects the most and
least able candidates, and the second because, unless the test is to be grotesquely long, there are too few items at each point of the difficulty/ability
range to provide effective measurement. The equal difficulty strategy is simply the better of two poor alternatives for large candidate populations. In
practice, of course, test constructors like to include a few easy and a few
difficult items, the easy items "in order to help the candidate to relax",
while the difficult items "serve the function of stretching the more able
candidates" (Macintosh and Morrison, 1969) but, in effect, this is just a token
gesture. If candidates at the extremities are to be measured efficiently, what
is needed are tests tailored to their abilities, difficult tests for the
cleverest and easy tests for the dullest. Therein lies the motivation for
developing individualised testing procedures.
0.7 -
I
I
i
I
49
I
50
I
0.6 -
22
I
48
I
26 I
0.5 -
3960
45
II
570
2'
51282
14
16
53
IlO,
42
41
I
33
931 16
44,329
I 56
3724712
t__~k_lz--_-_
54
3
'556
4034431
I
.o
0.4
z
I
m
59
I 27
I
46326830I
32 52
)
-
I
'?
I
0,3__----_T5__~_-_55
35
I
38
I
I
0.2 -
471
I
I
I
0.10
I
I
0.10
I
0.20
I
0.30
I
I
I
0.40
I
I
0.50
/
I
0.60
/
/
I
0.70
0.80
0.90
Faclllty
Fig. 7.4
Scatter plot of facility against biserial.
(GCE O-level History 'B' Paper 3 - June 1974).
1.00
Multiple Choice: A State of the Art Report
257
When selecting items in practice, a handy way of displaying the available items
in terms of their statistical characteristics is to plot values of facility or
difficulty against values of the discrimination index, whatever that is. It is
conventional to plot difficulty along the horizontal axis and discrimination up
to the vertical axis, with the position of items being signified by the item
number, and also, perhaps, by some coding, like a box or a circle or a colour,
to indicate different item types or content areas. On top of the plot can be
superimposed horizontal and vertical lines indicating the region within which
"acceptable" items are to be found. An example taken from Quinn's (1975)
survey will show what I mean (Fig. 7.4).
The test was a London board Ordinary level History paper, and the joint distribution of difficulty and discrimination values is not unusual. One feature is
the number of items which turned out to be too easy. This is likely to happen
when pretest item statistics are taken at face value, the point being that
candidates often find items easier in the examination proper than in the pretest, due, presumably, to extra motivation and a better state of preparedness.
(The London GCE board holds pretests one to two months before the examinations.)
For this reason, it is advisable to adjust informally pretest item facilities
upwards by five to ten percentage points, so as to get a more accurate idea
of how the items will perfomi in the operational situation. With discrimination values, there is no such rule-of-thumb, but it is a fairly safe generalisation to say that they too mostly improve from pretest to examination,
partly because there is ustially a positive correlation between facility and
discrimination values, so that, as items get easier, discrimination improves,
and, also, because incomplete preparation at the time of the pretests will
tend to elicit answers that candidates would not offer in an examination.
It
is, therefore, a good idea to set the lower boundary for discrimination at 0.3
instead of 0.4, as we have done in Fig. 7.4, thus letting in items which, on
the pretest information, look dubious but, in the operational setting, are
likely to turn out to be acceptable. If it is asked why some items should not
be pretested by including them in an examination paper without scoring them,
the answer is that it can be done (although the London board has not done so),
but there are likely to be objections from candidates and teachers that time
spent on the pretest items is time wasted on the operational items.
It cannot be emphasised too often that accept/reject boundaries are not to be
kept to rigidly, if only because of the 'slippage' between pretest and operational values just remarked on. There are educational considerations involved
in item selection and test construction, and these must always be allowed to
over-ride statistical efficiency considerations. For example, even supposing
a rectangular distribution of scores could be produced, it is almost certain
that the items would be too similar to satisfy the content/skill specification,
and less discriminating but more relevant items would have to be introduced to
enrich the test. Generally, however, it is the low discriminating and the
hard items which pose the dilennna of inclusion or non-inclusion. Are they
simply poor items or are they perhaps items based on content which the examiners would like to see taught, but which is meeting resistance among teachers?
The trouble is that sometimes the notion of what ought to be tested (and,
therefore, taught) is not so securely founded as to lead to authoritative
decisions, and an uneasy compromise results, with, perhaps, the statistics
getting the upper hand.
As I have remarked, low discrimination values are the rule rather than the
exception. The reasons are not hard to find. Comprehensive syllabuses containing a variety of material have to be covered, and some sampling is
258
Evaiuation in Education
inevitable. This being so, some teachem will sample one way, some another.
With candidates sampling in their own way during examination preparation, the
net effect is that their knowledge is likely to be 'spotty' and inconsistent.
In these circumstances, one would not expect correlations between items to be
high, Interestingly enough, the two multiple-choice papers in Quinn's (1975)
survey to shaw the highest average discrimination values were the English
language and the French reading and listening comprehensive tests. Whereas
in subjects like Physics and Mathematics the pieces of information tend to be
discrete and unrelated, in reading and listening comprehension there are interconnections and resonances within the language which help to make candidates'
performance more homogeneous. Rowley (1974f, who would probably cite this as
another exampfe of 'test-wiseness' (see Chapter 51, did, in fact, report that
'test-wiseness' was more marked on verbal ~o~p~bension tests than on quantitative tests.
1
0
IO
20
\
30
40
50
60
Score
Fig. 7.5
Figure 7.5 shows the score distribution for the History test, the items of
which were displayed in Fig, 7.4. You can see the dome shape, which is common
for the London board multiple-choice tests. This dome becomes a peak if the
discrimination values slip downwards.
ARRANGING
ITEMSIN THE TEST FORM
In a conventional group test there is an issue concerning the way items should
be arranged. The received opinion is that items should be arranged from easy
to hard, E-H, the rationale being that anxiety is induced by encountering a
difficult item early in a test, and that the effect persists over time, and
Multiple Choice: A State of the Art Report
2%
causes candidates to fail items they would have answered correctly had anxiety
not interfered. This seems sensible, but, in practice, does E-H sequencing
make any difference to scores? There has been a string of enquiries which have
found that item and test statistics - difficulty, discrimination, KR20 internal
consistency - were little affected by re-arranging items from random to E-H or
vice-versa (Brenner, 1964; Flaugher, Melton and Myers, 1968; Shoemaker, 1970;
Huck and Bowers, 1972). Perhaps the most thorough enquiry into this issue was
carried out by Munz and Jacobs (1971), who also provide other references. They
concluded that an E-H arrangement did not appear to improve test performance
or reduce test-taking anxiety, as compared to an H-E arrangement, but that it
did leave students with a more positive feeling about the test afterwards
("easier and fairer") than did the H-E arrangement. Their view was that
arranging items according to candidates' perception of item difficulty - subjective item difficulty - constitutes the only justification for the E-H
arrangement. The snag is that the subjective item difficulty of any item will
vary according to the candidate. All the same, I would back the easy to hard
arrangement.
THE INCLINE OF DIFFICULTY CONCEPT
If items were to be arranged in E-H order, and candidates could somehow be
persuaded to stop once items got too difficult for them, a big step would have
been taken towards individualising testing and, of course, to making it more
efficient. This is the notion behind the so-called incline of difficulty concept. It differs in an important respect from the so-called multilevel format
which is used by certain American testing programs, for example, the Iowa Tests
of Basic Skills (Hieronymus and Lindquist, 1971). Whereas, with the ITBS, a
single test booklet covers the whole range of difficulty from, say, easy nineyear old to difficult sixteen-year old, and candidates are advised where to
start the test and where to stop it, in the incline of difficulty set-up, as
presently explicated (see Harrison, 1973), candidates start at the beginning
but are given no instructions as to when to stop, except, as I have said, when
they find the items getting too difficult. The consequence is that the weaker
candidates, perhaps understandably, tend to move on up the incline, sometimes
by leapfrogging, in the hope that they will encounter items which are within
their capabilities. In behaving like this, they will often be justified,
since the existence of interactions between items and individuals - pockets of
knowledge candidates are "not supposed to have" or "should have but don't" means that every candidate is likely to have his own incline of difficulty
which will not correspond to the "official" incline of difficulty. This is
just another way of saying that item statistics , on which the incline of difficulty would be based, can only take us so far in the prediction of individual
behaviour. Nevertheless, the incline of difficulty idea deserves further
researching, if only because any scheme which permits some individualisation
of measurement within the constraints of paper and pencil testing is worth
investigating. The same remark applies to the multilevel format, some of the
problems of which are discussed by Thorndike (1971, p.6).
260
Evaluation
in Education
INDIVIDUALISED TESTING
With fully individualised testing, the idea is to adapt a test to the individual, so that his ability can be assessed accurately and precisely, with as few
items as possible. Given some initial information about an individual's level
of ability, reliable or otherwise, he/she is presented with an item for which
his or her chances of success are reckoned to be 50:50. If the individual
gets the item right, the ability estimate is revised upwards, and he or she
then receives a more difficult item, while a wrong answer means that the ability estimate will be revised downwards, and an easier item will be presented
next. The zeroing-in process continues in see-saw fashion, but with decreasing
movement, until a satisfactory determination of ability is made, where what
constitutes 'satisfactory' has to be defined and, indeed, is an outstanding
technical problem. There are variations on this theme, such as presenting
items in blocks rather than singly, but the basic idea is as I have described
it. A review of my own (Wood, 1973(b)) gives the background and further explanation. The procedure is entirely dependent on latent trait methodology since
classical methods cannot handle the matching of ability level and item difficulty, nor the estimation of ability. Evidently, conventional number correct
scoring would not do, since two individuals could get the same number of items
correct and yet be quite different in ability.
By common consent, individualised testing is best conducted in an interactive
computer-assisted set-up, but I am afraid the expense involved in doing so
will be out of the reach of many testing organisations, never mind individual
teachers. Lord (1976(b)) may be right that computer costs will come down but
when he claims that they will come down to the point where computer-based
adaptive testing on a large-scale will be economical, one is bound to ask
"For whom", and I suspect the answer will be organisations like the US Civil
Service Commission (McKillip and Urry, 1976), or the British Army (Killcross,
1974), and almost nobody else. To devote too much space to computer-assisted
adaptive testing would be wrong, anyway, given the scope of this book. Apart
from anything else, those in the van of developments are keen to go beyond
multiple-choice and program the computer to do things not possible with paper
and pencil tests, a point made forcefully by Green (1976), who cites the testing of verbal fluency as "a natural for the computer". More generally, he
makes the point, as I did myself in my review paper, that what is needed now
is more information in addition to the extra efficiency the computer already
supplies.
The Green and Lord papers are the best parts of a useful report which will
enable interested readers to get up to date with developments in the field.
Other worthwhile references are Lumsden (1976, p-275-276) who, like Green,
pulls no punches, and Weiss (1976), the last summing up fuur years of investigation into computer-assisted adaptive testing at the University of Minnesota.
So far, the nearest realisation of tailored testing in a paper and pencil form
is the flexilevel test (Lord, 1971(a), 1971(b)). In a flexilevel test, the
candidate knows immediately whether or not he got the right answer. he starts
by attempting an item of median facility. If correct, he moves to the easiest
item of above median facility; if incorrect, he moves to the hardest item of
below median facility. The candidate attempts only N ; 1 items in the set,
where N is the total number of items in the test, which has a rectangular distribution of item facilities so as to provide measurement for all abilities.
Multiple
Choice:
A State of the Art Report
261
In practice, the routing can be arranged in a number of ways. I myself, (Wood,
1969) invited candidates to remove some opaque masking material corresponding
to the chosen response, in order to reveal the number of the item they should
tackle next. Other devices call for the candidate to rub out the masking
material with a special rubber. The Ford Motor Company (1972), for instance,
have used a scheme like this for testing registered technicians' skills,
except that there the emphasis was on remedial activity - wrong answers uncover
messages which enable to candidate to rectify his mistakes. This, of course,
is the notion behind programmed learning, to which, of course, tailored testing bears a resemblance.
In terms of statistical efficiency, Lord found that near the middle of the
ability range for which the best was designed, the flexilevel test was slightly
less effective than a conventional test composed of items with facilities
around 0.50, and with discrimination values as high as possible. Towards the
extremes of the ability range however, the flexilevel test produced more
precise measurements, a result one would expect, since it is the reason for
adopting these individualised tests in the first place. Weiss (1976. p.4),
however, is less kind towards the flexilevel test, and maintains that it offers
little improvement over the conventional test, besides being likely to induce
undesirable psychological resistance as a result of the branching strategy.
i3y this he means, I think, that candidates are not happy taking different
routes which effectively differentiate them, and also that they may experience
some difficulty following the routing instructions. These objections remain
to be substantiated.
It should be remembered that Lord's results stemned from computer simulations,
and Weiss's from computer-assisted item administration. As far as I know,
there have been no thorough-going experiments in the paper and pencil mode;
my own rough and ready exercise (Wood, 1969) must be discounted. I am not
advocating that t here should be a spate of such experiments, but it would be
good to have one or two. A strong point in favour of flexilevel testing is
that even though candidates are obliged to do items of different difficulties,
the number right score turns out to be an excellent estimate of ability
(Lumsden, 1976).
If the individualising of testing is to work properly so that the pay-off is
delivered, it is clear that much depends on the accuracy and precision of the
calibration of test items. With group tests, the existence of error in the
estimation of item parameters is not so critical; we are working with wider
margins of error, and are not expecting so much from the measurement. Unfortunately, the calibration of items is something of a grey area in measurement.
Certainly, the kind of casual calibration of items using "grab" groups which
certain Rasch model enthusiasts recommend, is not adequate (for more on this
see Wood (1976(b)).
There are three other problems pertaining to individualised testing I should
like to draw attention to. I have already remarked on the likely existence
of what I called item-individual interactions which result in outcomes that
are "not supposed to happen". If these effects can interfere with incline of
difficulty arrangements, they can certainly throw out and maybe even sabotage
individualised branching procedures. Furthermore, and this harks back to the
earlier introduction of the idea of subjective item difficulty, we do not know
yet whether or not difficulty levels appropriate to each individual's ability
level are the best ones for keeping motivation high and anxiety and frustration
low, although there is reason to expect that they will, at least, inhibit
anxiety (Betz, 1975).
262
Evaluation in Education
The second problem, which is more important, concerns the meaning of the measurement. In attainment tests, where testing is usually based on a sampling
of a comprehensive syllabus, the point is to discover how much of that sampling
individuals can master so that appropriate generalisations can be made. Where
the decisive element is the difficulty of the item, as it is in individualised
testing, an item sequence presented to a particular individual may bear little
resemblance to the sample content specification; the items could even be very
similar in kind, although this is unlikely to happen in practice. The issue
at stake here is that of defining domains or universes of items, each of which
contains homogeneous items of the same description, and each of which is sampled according to some scheme (see Chapter 4). On the face of it, there is no
reason why this model should not apply in the context of say, Ordinary level
examination papers, which, after all, are supposedly constructed from a specification grid, each cell of which could be said to form an item universe.
Unfortunately, there are severe demarcation issues over what constitutes a
cell, especially if one of the defining categories in the grid is Bloom's
taxonomy or a version of it (see Chapter 2). It is also true that systematic
universe sampling, using the flexilevel technique, would require far more
items than would be required for the conventional heterogeneous group test.
Finally, there is the reporting problem. Suppose you can measure individuals
pronouns and
on umpteen universes - knowledge of words, adverbs, ad=tives,
what have you - what do you do with the results? Profiles would be so unwieldy
I am afraid that with heterogeneous domains such as we
as to be meaningless.
usually have to deal with in examinations like GCE Ordinary and Advanced level,
we can do no better than summary statements about achievement, even if it means
degraded ~asu~~nt.
TESTING FOR OTHER THAN INDIVIDUAL DIFFERENCES
Criterion-referenced tests
The discussion so far has been about tests of individual differences, and how
to construct them. Because these tests are designed to measure a person in
relation to a normative group, they have been labelled norm-referenced tests.
They may be contrasted with criterion-referenced tests, which are designed "to
yield measurements that are directly interpretable in terms of specified performance standards" (Glaser and Nitko, 1971, p.653). In practice the differences between the two kinds of test may be more apparent than real, as I have
tried to explain elsewhere (Wood, 1976(c)), but there is no denying that there
is a fundamental difference in function and purpose. Carver (1974) points out
that all tests, to a certain extent, reflect both between-individual differences and within-individual growth, but that most tests will do a better job
in one area than another. The first element he proposes, reasonably enough,
to call the psychometric element or dimension , while, for the second dimension,
he coins the term edumetric. A test may be evaluated along either dimension.
Aptitude tests and, to a lesser extent, examinations focus-on the psychometric
dimension, while teacher-made tests usually focus more on the edumetric dimension, a statement that applies generally to criterion-referenced tests. If
McClelland (1973) is right that schools should be testing for competence rather
than ability, and I think he is, teachers should be using criterion-referenced
tests rather than norm-referenced tests.
Multiple Choice: A State of the Art Report
263
Much more could be said about criterion-referenced tests, but the above will
serve to set the stage for a discussion of the item selection and test construction procedures which are appropriate for these tests when they consist of
multiple-choice items. First, it must be noted that the value of pretesting
and item analysis is disputed by the more doctrinaire advocates of criterionreferenced testing (CRT) who argue that since the items are generated directly
from the item forms which represent objectives (see Chapter 4), the calculation
of discrimination indices and subsequent manipulations are irrelevant (Osburn,
1968). The items generated go straight into a test, and that is that. Someone
who believed fervently in the capacities of item writers could, of course, take
the same fundamental line. This position seems altogether too extreme; one is
bound to agree with Henrysson and Wedman (1974) that there will always be
subjective and uncertain elements in the formulation of objectives and, therefore, in the production of items which will render criterion-referenced tests
less than perfect.
If item analysis has a part to play in the construction of CRT's, the question
is, 'What kind of item analysis?
Evidently, it must be different from conventional, psychometric item analysis. According to the usual conventions for
norm-referenced tests, items that everyone tends to get right or everyone tends
to get wrong are bound to have low discrimination values and will, therefore,
be discarded, but they might be just what the CRT person wants. Lewis (1974),
for example, argues that items with facilities as near as possible to 100 per
cent should be favoured above all others. Exactly what the point of giving
such a test would be, when it was known in advance that nearly everyone was
going to get nearly everything right, defeats me. Surely, it would be much
more in keeping with the spirit of CRT to set a test comprising items of 50 or
70 per cent facility, and then see how many of the groups in question could
score 100 per cent.
The orthodox CRT practitioner regards items which discriminate strongly between
individuals as of no use to him. Brennan (1972) has maintained that what is
wanted are items with high facilities andwith "non-significant" item-test
correlations. Items that discriminate positively "usually indicate a need for
revision". Whether Brennan is correct or not is beside the point because discrimination indices like the biserial should not be used in the first place.
If the idea is to find items which are sensitive to changes within individuals,
then it is necessary to test items out on groups before and after they have
received instruction. Items showing little or no difference, indicating insensitivity to learning, would then be discarded. The best edumetric items,
according to Carver (1974), are those which have p values approaching 0 prior
to instruction, and p values approaching 1 subsequent to instruction. Various
refinements of this simple difference measure have been proposed (see Henrysson
and Wedman, 1974 for details). The most useful is an adjustment which takes
into account the fact that the significance of a difference varies according
to where on the percentage scale it occurs, the formula for the resulting
statistic being:
Fpost test
1
-
Ppretest
-
Ppretest
But will teachers, who, after all, are meant to be the prime users, be bothered
to go to the lengths of administering pretests and post tests, and then selecting items? One could say the same of conventional item analysis, of course,
indeed I have always thought that item analysis in the classroom was something
more talked about than practised. With CRT, however, the procedure is so much
264
Evaluation
in Education
more cumbersome that the pay-off seems hardly worthwhile. "One might argue
that the teacher's time could be better spent in other areas of the instructional process" writes Crehan (1974), and it is hard to disagree. Besides,
there is something improper about a teacher giving his students items before
they have had the opportunity to master the relevant subject matter. In these
circumstances, there is a real possibility-that some students will be demoralised before they start. My belief is that the teacher is better advised to
rely on his own intuition and everyday observation, rather than engage in
statistical exercises. Above all, CRT should be informal in conception and
execution, and there is no purpose served in decking it out with elaborate
statistical trappings.
CHOOSING ITEMS TO DISCRIMINATE BETWEEN GROUPS
Not only may items be chosen to discriminate between and within individuals,
but also between roups of individuals. The practical importance of such a
measure lies in
e evaluation of teaching programmes or instructional success,
an issue gaining increasing attention these days, particularly in the USA.
Suppose a number of classes within a school have been taught the same material,
and it is desired to set all the class members a test to find out which class
has learnt most. Lewy (1973) has shown that items which differentiate within
classes will not necessarily register differences between classes. This
what one would expect, given that the basic units ofobservation - the individual score and the class average - are so different. For item selection to
differentiate between classes, the appropriate discrimination index is the
intraclass correlation (for details see Lewy's paper, also a paper by Rakow,
l-Using
indices like the biserial will most likely result in tests which
are not sensitive to differences between class performance. Much of the criticism levelled at American studies which have claimed that school makes little
or no difference to achievement, like those of Coleman et al (1966) and Jencks
et al (1972), has hinged on the fact that norm-referenced tests constructed
according to the usual rules were used to make the measurements, whereas what
should have been used were tests built to reflect differences between school
performances (Madaus, Kellaghan and Rakow, 1975).
My own experience with using the intraclass correlation on item response data
from achievement tests - after the event, of course - has been that the highest values occur with items on topics that are either new to the syllabus or
are controversial.
If, as seems likely, these topics are taken up by only
some teachers, the effect will be to create a possibly spurious impression of
greater between-school variability than really exists. I have also found that
assertion-reason items show less variation between schools than items of any
other type, an outcome I interpret as evidence that assertion-reason items
measure ability more than competence, to borrow McClelland's distinction again
(see also the section of assertion-reason items in Chapter 3). The fact that
simple multiple choice items , which probably give the "purest" measures of
competence, showed greatest variation between schools in my analysis, tends
to support me in my view.
Multiple
Choice:
A State of the Art Repoti
265
COMPUTER PROGRAMS FOR ITEM ANALYSIS
Compiling even an abbreviated list of available item analysis programs is not
an easy task. That is why I was pleased by the appearance recently of a paper
(Schnittjer and Cartledge (SC), 1576), which provides a comparative evaluation
of five programs originating in the USA. The coverage is not comprehensive,
nor does it pretend to be, and descriptions of many other programs of varying
scope can be found in the 'Computer Programs' section of Educational and
Psychological Measurement, which appears in every other issue. 0ne program I
would h ave expected to find in the SC paper is FORTAP (Baker and Martin, 1969).
Among other features, it supplies estimates of item parameters for the distractors as well as for the key. It was also, I believe, the first commercially available program to provide estimates of the item parameters for the
normal ogive item response model, LOGOG (Kolakowski and Bock, 1974) and LOGIST
(Wingersky and Lord, 1973). As far as the Rasch model is concerned, the SC
paper describes MESAMAX, developed at the University of Chicago. There is
also Choppin's (1974(b)) program, which is based on his own treatment of the
Rasch model (Choppin, 1968). One program the SC paper could not be expected
to mention is one developed recently in the University of London School Examinations Department (Wilson and Wood, 1976). Itwill be known as TSFA, and,
among otlle; things, will serve as a successor to the Chicago program
dealt with in the SC paper. The output for every item resembles that in
Table 6.1. Basic sample statistics and item parameter estimates are given for
main test and subtests, using the relevant test scores and/or optimal external
criterion scores. Sample statistics and item parameter estimates for different
subsets of persons can also be obtained. Further options allow the user to
plot test score/criterion biserials or point biserials against item difficulties or facilities for main test and subtests. For the large version only,
tetrachoric correlations between items can be obtained on request, and the
correlation matrix used as input for a factor analysis.
SUMNARY
1. Three types of test are identiFied, test:s to measure individual differences,
tests to measure differences within individuals and tests to measure differences between groups. Within the first type, group tests are distinguished
from individualised tests. The different kinds of item analysis and test
selection appropriate to each are discussed. In the case of group tests,
the classical and modern approaches to test selection are contrasted.
2. The recommendation for designing group tests for large candidate populations is to choose items with difficulty levels around 0.50, and with discrimination values as high as possible, consistent with educational considerations.
Those who find this advice hard to understand might reflect that to provide
enough items at points on the difficulty range so as to secure the same efficiency of measurement across the ability range, and not just in the middle, would
mean an impossibly long test. In these circumstances the equal difficulty
strategy gives a closer representation of the true ordering of candidates than
does spreading the same number of items across the difficulty range, but the
real answer to the problem is to be found in individualised testing.
266
Evaluation in Education
3. When selecting items for group tests, plotting difficulty and discrimination values against each other gives the test constructor a good idea of the
statistical characteristics of the available items. It should be remembered
that pretest difficulty and discrimination values are apt to be underestimates
of the actual examination values so that it is wise to make some allowance
for this when choosing items. It is also advisable not to take accept/reject
borderlines too seriously. There is nothing special about a biserial value of
0.30 or 0.40; what matters is to fill the test with items which can be justified on educational grounds, not always an easy thing to do.
4. Discrimination values for achievement tests are generally on the low side,
especially for subjects like Physics, where candidates' knowledge seems to be
spotty and inconsistent, causing low correlations between items. For reading
and listening comprehension tests the discrimination values are somewhat higher
suggesting that candidates are able to deal with the material in a more consistent fashion, due, perhaps, to connections within the language.
5. Presenting items in an easy to hard sequence seems to be the most congenial
to candidates.
6. The incline of difficulty and multilevel concepts promise some individualisation within the group testing framework.
7. Fully individualised testing can only really be carried out with the help
of a computer. Flexilevel testing is the nearest approximation we have in
paper and pencil form, and it would be worth a further look under realistic
conditions.
8. The item analysis and test selection procedures appropriate for criterionreferenced tests should obviously be different in kind to those used in psychometric work, if they are even necessary at all, as some believe. Ideas vary
as to what measures should be used, although there is general agreement that
the difference between performance prior to instruction and performance after
instruction is the critical factor. Whether teachers will be willing to
indulge in item analysis for criterion-referenced tests, given the work it
entails, is a moot point.
9. Interest in selecting items which will discriminate between roups rather
than individuals seem to be growing. Much of the criticism leve led at
American studies which have claimed that school makes little or no difference
to achievement has hinged on the fact that norm-referenced tests constructed
according to the usual rules were used to make the measurement, whereas what
should have been used were tests built to reflect differences between school
performance. Just as with criterion-referenced testing, a different set of
item analysis and test selection procedures is necessary for between-group
testing. In this case, the appropriate statistic. is the intraclass correlation. It should be noted that between-group variability can sometimes appear
greater than it is simply because the material on which an item is based has
not been taught in some schools.
10. References to the more comprehensive item analysis computer programs are
given.
Acknowledgements
I am grateful, above all, to Andrew Harrison whose challenging comments
on the manuscript helped me enormously.
I am grateful also to the
editors, especially Bruce Choppin, for suggesting improvements and for
their support, and to my colleague Keith Davidson for discussing the
manuscript with me. The Secretary of the London GCE board, A.R. Stephenson,
has encouraged me in the writing of this book and I would like to thank
him too.
Permission to reproduce test items has been given by the University of
London, the.Test Development and Research Unit of the Oxford Delegacy
of Local Examinations, the University of Cambridge Local Examinations
Syndicate, the Oxford and Cambridge Schools Examination Board and the
Educational Testing Service, Princeton, New Jersey.
R. Wood
267
References
Ace, M.C. & Dawis, R.V. (1973), Item structure as a determinant of item difficulty in verbal analogies, Educ. Psychol. Measmt. 33, 143-149.
Aiken, L.R. (1966), Another look at weighting test items, Jour. Educ. Measmt.
3, 183-185.
Alker, H.A., Carlson, J.A. & Hermann, M.G. (1969), Multiple-choice questions
and students characteristics, Jour. Educ. Psvchol. 60, 231-243.
Anastasi, A. (1967), Psychology, psychologists and psychological testing,
Amer. Psychol. 22, 297-306.
Anastasi, A. (1970), On the formation of psychological traits, Amer. Psychol.
25, 899-910.
Anderson, K.C. (1372), ticw tc construct achievement tests to assess comprehension, Rev. Educ. Res. 42, 145-170.
Ashford, T.A. (1972), A brief history of objective tests, Jour. Chem. Educ.
49, 420-423.
Baker, F.B. & Martin, T.J. (1969), FORTAP: A Fortran test analysis package,
Educ. Psychol. Measmt. 29, 159-164.
Barzun, 3. (1959), The House of Intellect, Harper and Row, New York.
Bayer, D.H. (1971), Effect of test instructions, test anxiety, defensiveness
and confidence in judgement on guessing behaviour in multiple-choice test
situations, Psychol. Sch. 8, 208-215.
Beeson, R.O. (1973), Immediate knowledge of results and test performance,
Jour. Educ. Res. 66, 224-226.
Berglund, G.W. (1969), Effect of knowledge of results on retention, Psychol.
-Sch. 6, 420-421.
Betz, N.W. (1975), Prospects: New types of information and psychological
implications. In Computerised Adaptive Trait Measurement: Problems and
Prospects, Psychometric Methods Program, University of Minnesota.
Betz, N.E. & Weiss, D.J. (1976(a)), Effects of immediate knowledge of results
and adaptive testing on ability test performance, Research Report, 76-3,
Psychometric Methods Program, University of Minnesota.
Betz, N.E. & Weiss, D.J. (1976(b)), Psychological effects of immediate knowledge of results and adaptive ability testing, Research Report 76-4, Psychometric Methods Program, University of Minnesota.
Binyon, M. (1976), Concern mounts at fall in writing standards, Times Educ.
Suppt. February 13th
Multiple
Choice:
A State of the Art Report
269
Bishop, A.J. Knapp, T.R. & MacIntyre, D.I. (1969), A comparison of the results
of open-ended and multiple-cnoice versions of a mathematics test, Int. Jour.
Educ. Sci. 3, 147-154.
Board, C & Whitney, D.R. (1972), The effect of selected poor item-writing
practices on test difficulty, reliability and validity, Jour. Educ. Measmt.
9, 225-233.
Boldt, R.F. (1974), An approximately reproducing scoring scheme that aligns
random response and omission, Educ. Psychol. Measmt. 34, 57-61.
Bormuth., J. (1970), On the Theory of Achievement Test Items, University of
Chicago Press, Chicago.
Bowers, J. (19723, A note on comparing r-biserial and r-point biserial, Educ.
Psychol. Measmt, 32, 771-775.
Bracht, G.H. & Hopkins, K.D. (1970), The communality of essay and objective
tests of academic achievement, Educ. Psychol. Measmt, 30, 359-364.
Brennan, R.L. (1972), A generalised upper-lower item discrimination index,
Educ. Psychol. Measmt, 32, 289-303.
Brenner, M.H. (1964), Test difficulty, reliability, and discrimination as
functions of item difficulty order, Jour. Appl. Psychol. 48, 98-100.
Britton, J., Burgess, T., Martin, N. McLeod, A. & Rosen, H. (1975), The Development of Writing Abilities (ll-lb), Schools Council Research Studies,
Macmillan, Education, London
Brown, J. (1966), Objective Tests; Their Construction and Analysis: A Practical
Handbook for Teachers, Longmans, London.
Brown, J. (Ed.), (1976), Recall and Recognition, Wiley, London.
Carver, R.P. (1974), Two dimensions of tests: Psychometric and edumetric,
Amer. Psychol. 29, 512-518.
Choppin, B.H. (1968), An item bank using sample-free calibration, Nature, 119,
870-872.
Choppin, B.H. (1974(a)), The Correction for Guessing on Objective Tests, IEA
Monograph Studies, No. 4, Stockholm.
Choppin, B.H. (1974(b)), Rasch/Choppin pairwise analysis: Express calibration
by pair-X. National Foundation for Educational Research, Slough.
Choppin, B.H. (1975), Guessing the answer on objective tests, Brit. Jour. Educ.
Psychol. 45, 206-213.
Choppin, B.H. (1976), Recent developments in item banking: A review, In
de Gruitjer, D.N.M. & van der Kamp, L.J. Th., (Eds.),
Advances in Psychological and Educational Measurement. John Wiley, London.
Choppin, B.H., & Purves, A.C. (1969), Compariscn of open-ended and multiplechoice items dealing with literacy understanding, Res. Teach. Eng. 3,
15-24.
Choppin, B.H. & Orr, L. (1976), Aptitude Testing at Eighteen-Plus, National
Foundation for Educational Research, Slough.
Coleman, J.S. et al (1966), Equality of Educational Opportunity, Office of
Education, US Dept. of Health, Education and Welfare, Washington.
College Entrance Examination Board (1976), About the SAT - 1976-77. New York.
270
Evaluation in Education
Collet, L.S. (1971), Elimination scoring: An empirical evaluation, Jour. Educ.
Measmt. 8, 209-214.
Connaughton, I.M. & Skurnik, L-S. (1969), The comparative effectiveness of
several short-cut item analysis procedures, Brit. Jour. Educ. Psychol. 39,
225-232.
coombs, C.H., Milholland, J.E. & Womer, F.B. (1956), The assessment of partial
knowledge, Educ. Psychol. Measmt. 16, 13-37.
Copeland, D.A. (1972), Should chemistry students change answers on multiplechoice tests? Jour. Chem. Educ. 49, 258.
Corbluth, J. (1975), A functional analysis of multiple-choice questions for
reading comprehension, Eng. Lang. Teach. Jour. 29, 164-173.
Costin, F. (1970), The optional number of alternatives in multiple-choice
achievement tests: Some empirical evidence for a mathematical proof,
Educ. Psychol. Measmt. 30, 353-358.
Costin, F. (1972), Three-choice versus four-choice items: Implications for
reliability and validity of objective achievement tests, Educ. Psychol.
Measmt. 32, 1035-1038.
Crehan, K.D. (1974), Item analysis for teacher-made mastery tests, Jour. Educ.
Measmt. 11, 255-262.
Cronbach, L.J. (1970), Validation of educational measures. In Proceedin s of
the 1969 Invitational Conference on Testing Problems. Educationa
Y-X&Service, Princeton.
Cross, M. (1972), The use of objective
Vocational Aspect. 24, 133-139.
tests
in government examinations,
Cureton, E.E. (1971), Reliability of multiple-choice tests in the proportion
of variance which is true variance, Educ. Psychol. Measmt. 31, 827-829.
D'Agostino, R.B. & Cureton, E.E. (1975), The 27 percent rule revisited, Educ.
VMeasmt.
25, 41-50.
Dalrymple-Alford, E.C. (1970), A model for assessing multiple-choice test
performance, Brit. Jour. Math. Stat. Psychol. 23, 199-203.
Das Gupta, S. (1960), Point biserial correlation coefficient and its generalisation, Psychometrika, 25, 393-408.
Davidson, K. (1974), Objective text, The Use of English, 26, 12-78.
De
Finetti, B. (1965), Methods for discriminating levels of partial knowledge
concerning a test item, Brit, Jour. Math. Stat. Psychol. 18, 87-123.
De Landsheere, V (1977), On defining educational objectives, Evaluation in
iInternational
1. 2, Pergamon Press.
Diamond, J.J. & Evans, W.J. (1972), An investigation of the cognitive correlates of test-wiseness, Jour. Educ. Measmt. 9, 745-150.
Diamond, 3.5. & Evans, W.J. (1973), The correction for guessing, Rev. Educ.
Res. 43, 181-192.
Donlon, T.F. (1971), Whose zoo?
Teach. 25, 7-10.
Fry's orangoutang score revisited. Read.
Driver, R. (1975), The name of the game. Sch. Sci.Rev. 56, 800-805.
Multiple Choice: A State of the Art Report
271
Dudley, H.A.F. (1973), Multiple-choice tests, Lancet. 2, 195.
Dudycha, A.L. & Carpenter, J.B.
(1973), Effects of item format on item discrimination and difficulty, Jour. Appl. Psychol. 58, 11-121.
Dunn, T.F. & Goldstein, L.G. (1959), Test difficulty, validity and reliability
as functions of selected multiple-choice item construction principles,
Educ. Psychol. Measmt. 19, 171-179.
Ebel, R.L. (1969), Expected reliability as a function of choices per item,
Educ. Psychol. Measmt. 29, 565-570.
Ebel, R.L. (1970), The case for true-false test items, School Rev. 78, 373-390.
Ebel, R.L. (1971), How to write true-false test items, Educ. Psychol. Measmt.
31, 417-426.
Echternacht, G.J. (1972), Use of confidence weighting in objective tests, Rev.
E&c. Res. 42, 2 17-236.
Echternacht, G.J. (1976), Reliability and validity of item option weighting
schemes, Educ. Psychol. Measmt. 36, 301-310.
Echternacht, G.J., Boldt, R.F. & Sellman, W.S. (1972), Personality influences
on confidence test scores, Jour. Educ. Measmt. 9, 235-241.
Eklund, H. (1968), Multiple Choice and Retention, Almqvist and Wiksells,
Uppsala.
Evans, R.M. & Misfeldt, K. (1974), Effect of self-scoring procedures on test
reliability, Percept. Mot. Skills. 38, 1246.
Fairbrother, R. (1975), The reliability of teachers' judgements of the
abilities being tested by multiple choice items, Educ. Res. 17, 202-210.
Farrington, B. (1975), What is knowing a language? Some considerations
arising from an Advanced level multiple-choice test in French, Modern
Languages. 56, 10-17.
Fiske, D.W. (1968), Items and persons: Formal duals and psychological differences, Mult. Behav. Res. 3, 393-402,
Flaugher, R.L., Melton, R.S. & Myers, C.T. (1968), Item rearrangement under
typical test conditions, Educ. Psychol. Measmt. 28, 813-824.
Foote, R. & Belinky, C. (1972), It pays to switch? Consequences of changing
answers on multiple-choice examinations, Psychol. Reps. 31, 667-673.
Ford Motor Company, (1972), Registered Technician Program,
Brentwood, Essex.
Bulletin 18,
Forrest, R. (1975), Objective examinations and the teaching of English, &.
Lang. Teach. Jour. 29, 240-246.
Fremer, J. & Anastasio, E. (1969), Computer-assisted item writing - I (Spelling
items), Jour. Educ. Measmt. 6, 69-74.
Frisbee, D.A. (1973), Multiple-choice versus true-false: A comparison of
reliabilities and concurrent validities, Jour. Educ. Measmt. 10, 297-304.
Fry, E. (1971), The orangoutang score, Read. Teach. 24, 360-362.
Gage, N.L. & Damrin, D.E. (1950), Reliability, homogeneity and number of
choices, Jour. Educ. Psychol. 41, 385-404.
Gag&, R-M. (1970(a)), Instructional variables and learning outcomes, In
M.C. Wittrock & D.E. Wiley (Eds.), The Evaluation of Instruction: Issues and
Problems, holt, Rinehart, Winston, New York.
272
Evaluation in Educafion
Gagne, R.M. (1970(b)), The Conditions of Learning, Holt, Rinehart and Winston,
New York,
Gilman, D.A. & Ferry, P. (1972), Increasing test reliability through selfscoring procedures, Jour. Educ. Measmt. 9, 205-2)8.
Glaser, R. & Nitko, A.J. (1971), Measurement in learning and instruction, In
Thorndike, R.L. (Ed.), Educational Measurement. American Council on Education
Washington.
Glass, G.V. (1966), Note on rank-biserial correlation, Educ. Psychol. Measmt.
26, 623-631.
G7ass, G.V. & Stanley, J.C. (1970), Statistical Methods in Education and
Psychology, Prentice-Hall, Englewood Cliffs, N.J.
Green, B.F. (1976), Invited discussion, In Proceedings of the First Conference
on Computerised Adaptive Testing, U.S. Civil Service Commission, Washington.
Grier, J.B. (1975), The number of alternatives for optimum test reliability,
Jour. Educ. Measmt, 12, 109-112.
Gritten, F. & Johnson, D.M. (1941), Individual differences in judging multiplechoice questions, Jour. Educ. Psychol. 30, 423-430.
Guilford, J.P. (1954), Psychometric Methods, McGraw-Hill.
Guttman, L. (1941), The quantification of a class of attributes: A theory and
method of scale construction, In Horst, P. (Ed.), The Prediction of personal
Adjust~nt,'Social
Science Research Council, New York.
Guttman, L. (1970), Integration of test design and analysis, In Proceedin s of
-l-Tz&the 1969 Invitational Conference on Testing Problems, Educationa
Service, Princeton.
Guttman, L. & Schlesinger, I.M. (1967), Systematic construction of distracters
for ability and achievement testing, Educ. Psycho?. Measmt. 27, 569-580.
Hales, L.W. (1972), Method of obtaining the index of discrimination for item
selection and selected test characteristics: A comparative study, Educ.
Psychol. Measmt. 32, 929-937.
Hamilton, E.R. (1929), The Art of Interrogation, Kegan Paul, London.
Hanna, G.S. & Owens, R.E. (1973), Incremental validity of confidence weighting
of items, Calif. Jour. Educ. Res. 24, 165-168.
Hansen, R. (1977), The influence of variables other than knowledge on probabilistic tests, Jour. Educ. Measmt. 8, 9-14.
Handy, J. & Johnstone, A.H. (1973), How students reason in objective tests,
Educ. in Chem. 10, 99-100.
Harrison, A.W. (1973), Incline of difficulty experiment in French - Stages 1
and 2, Unpublished manuscript, Associated Examining Board, Aldershot.
Heim, A.W. & Watts, K.P. (1967), An experiment on multiple-choice versus openended answering in a vocabulary test, Brit. Jour. Educ. Psychol. 37,
399-346.
Hendrickson, G.F. (1971), The effect of differential option weighting on
multiple-choice tests, Jour. Educ. Measmt. 8, 291-296.
tlenrysson, S. (1971), Gathering, analysing and using data on test items; In
Thorndike, R.L. (Ed.), Educational Measurement, American Council on Education
Washington.
Multiple
Choice:
A State of the Art Report
273
Henrysson, S. & Wedman, I. (1974), Some problems in construction and evaluation
of criterion-referenced tests, Stand. Jour. Educ. Res. 18, 1-12.
Hieronymus, A.N. & Lindquist, E.F. (1971), Teacher's Guide for Administration,
Interpretation and Use: Iowa Tests of Basic Skills, Houghton Mifflin, Boston.
Hill, G.C. & Woods, G.T. (1974), Multiple true-false questions, Educ. in Chem.
11, 86-87.
Hively, W., Patterson, H.L. & Page, S.H. (1968), A "universe-defined" system
of arithmetic achievement tests, Jour. Educ. Measmt. 5, 275-290.
Hoffman, B. (1962), The Tyranny of Testing, Crowell-Collier, New York.
Hoffman, B.
1967(a)), Psychometric scientism, Phi Delta Kappa. 48, 381-386.
Hoffman, B.
1967(b)), Multiple-choice tests, Physics Educ. 2, 247-251.
Hofmann, R.J (1975), The concept of efficiency in item analysis, Educ. Psychol.
Measmt. 35 , 621-640.
Honeyford, R
(1973), Against objective. testing, The Use of English. 25, 17-26.
Hopkins, K.D , Hakstian, A.R. & hopkins, B.R. (1973), Validity and reliability
consequences of confidence weighting, Educ. Psychol. Measmt. 33, 135-14.
Huck, S.W. & Bowers, N.D. (1972), Item difficulty level and sequence effects
in multiple-choice achievement tests, Jour. Educ. Measmt. 9, 105-111.
S.H. (1971), Nonparametric item evaluation index, Educ. Psychol. Measmt.
31, 843-849.
Ivens,
Jacobs, S.S. (1971), Correlates of unwarranted confidence in responses to
objective test items, Jour. Educ. Measmt. 8, 15-20.
Jacobs, S.S. (1972), Answer changing on objective tests: Some implications for
test validity, Educ. Psychol. Measmt. 32, 1039-1044.
Jaspen, N. (1965), Polyserial correlation programs in Fortran, Educ. Psychol.
Measmt. 25, 229-233.
Jencks, C.S. et.al. (1972), Inequality: A Reassessment of the Effect of Family
and Schooling in America, Basic Books, New York.
Karraker, R.J. (1967), Knowledge of results and incorrect recall of plausible
multiple-choice alternatives, Jour. Educ. Psychol. 58, 11-14.
Kelley, T.L. (1939), The selection of upper and lower groups for the validation
of test items, Jour. Educ. Psychol. 30. 17.
Killcross, M.C. (1974), A Tailored Testing System for Selection and Allocation
in the British Army Paper presented at the 18th International Congress of
Applied Psychology,'Montreal.
Klein, S.P. & Kosecoff, J. (1973), Issues and procedures in the development of
criterion-referenced tests, ERIC TM Report, 26.
Koehler, R.A. (1971), A comparison of the validities of conventional choice
testing and various confidence marking procedures, Jour. Educ. Measmt. 8,
297-303.
Koehler, R.A. (1974), Over confidence on probabilistic tests, Jour. Educ.
-Cleasmt.
__ 11, 101-108.
Kolaknwski, D. & Bock, R.D. (1974), LOGOG: Maximum likelihood item analysis.
dt,u
test scoring - logistic model, National Educational Resources, Chicago.
274
Evaluation in Education
Krauft, C.C. & Beggs, D.L. (1973), Test taking procedure, risk taking and
multiple-choice tests scores, Jour. Exper. Educ. 41, 74-77.
Kropp, R.P., Stoker, H.W. & Bashaw, W.L. (1966), The Construction and Validation of Tests of the Cognitive Processes as described in the Taxonomy of
tducatlonal ObJectlves. Institute of Human Learnlng and Department of
tducational Research and Testing, Florida State University, Tallahassee.
Kuhn, T.S. (1962), The Structure of Scientific Revolutions, University of
Chicago Press, Chicago.
La Fave, L. (1966), Essay versus multiple-choice: Which test is preferable?
Psychol. Sch. 3, 65-69.
Lever, R.S., Harden, R. McG., Wilson, G.M. & Jolley, J.L. (1970), A simple
answer sheet designed for use with objective examinations, Brit. Jour. Med.
Educ. 4, 37-41.
-..
Levy, P (1973), On the relation between test theory and psychology. In
Kline, p. (Ed.) New Approaches in Psycholosical Measurement, Wiley, London.
Lewis, D.G. (1974), Assessment in Education, University of London Press, London.
Lewy, A. (1973), Discrimination among individuals vs. discrimination among
groups, Jour. Educ. Measmt. 10, 19-24.
Lord, F.M. (1971(a)), The self-scoring flexilevel test, Jour. Educ. Measmt. 8,
147-151.
Lord, F.M. (1971(b)), A theoretical study of the measurement effectiveness of
flexilevel tests, Educ. Psychol. Measmt. 31, 805-814.
Lord, F.M. (1976(a)), Optimal number of choices per item - a comparison of
four approaches, Research Bulletin 76-4, Educational Testing Service,
Princeton, N.J.
Lord, F.M. (1976(b)), Invited discussion, In Proceedings of the First Conference on Computerised Adaptive Testing, U.S. Civil Service COmmisslon,
Washington.
Lord, F.M. & Novick, M.R. (1968), Statistical Theories of Mental Test Scores,
Addison Wesley, New York.
Lumsden, J. (1976), Test theory, Ann. Rev. Psychol. 27, 251-280.
Lynch, D.O. & Smith, B.C. (1975), Item response changes: Effects on test
scores, Meas. Eval. in Guidance. 7, 220-224.
Macintosh, H.G. & Morrison, R.B. (1969), Objective Testing, University of
London Press, London.
Macready, G.B. (1975), The structure of domain hierarchies found within a
domain referenced testing system, Educ. Psychol. Measmt. 35, 583-598.
Macready, G.B. & Memin, J.C. (1973), Homogeneity within item forms in domain
referenced testing, Educ. Psychol. Measmt. 33, 351-360.
Madaus, G., Kellaghan, T. & Rakow, E. (1975), A Study of the Sensitivity of
Measures of School Effectiveness, Report to the Carnegv
Marcus, A. (1963), Effect of correct response location on the difficulty level
of multiple-choice questions, Jour. Appl. Psychol. 47, 48-51.
McClelland, D.C. (1973), Testing for competence rather than for intelligence,
Amer. Psychol. 28, 1-14.
Multiple Choice: A State of the Art Reporr
275
McKillip, R.H. & Urry, V.W. (1976), Computer-assisted testing: An orderly
transition from theory to practice. In Proceedings of the First Conference
on Computerised Adaptive Testing, U.S. Civil Service Commission, Washington.
McMorris, R.F., Brown, J.A.,
Snyder, G.W. & Pruzek, R.M. (1972), Effects of
violating item construction principles, Jour. Educ. Measmt. 9, 287-296.
Mellenbergh, G.J. (1972), A comparison between different kinds of achievement
test items. Nederlands Tijdschrift voor de Psychologie en haar Grensgebieden.
27, 157-158.
Miller, C.M.L. & Parlett, M. (1974), Up to the Mark: A Study of the Examination
Game, Society for Research into Higher Education, London.
Muller, D., Calhoun, E. & Orling, R. (1972), Test reliability as a function of
answer sheet mode, Jour. Educ. Measmt. 9, 321-324.
Munz, D.C. & Jacobs, P.D. (1971), An evaluation of perceived item-difficulty
sequencing in academic testing, Brit. Jour. Educ. Psychol. 41, 195-205.
Nilsson, I. & Wedman, I. (1976), On test-wiseness and some related constructs,
Stand. Jour. Educ. Res. 20, 25-40.
Nixon, J.C. (1973), Investigation of the response foils of the Modified Rhyme
Hearing Test, J. Speech Hearing Res. 4, 658-666;
Nuttall, D.L. (1974), Multiple-choice objective tests - A reappraisal, In
Conference Report 11, University of London University Entrance and School
Examinations Council, London.
Nuttall, D.L. & Skurnik, L.S. (1969), Examination and Item Analysis Manual,
National Foundation for Educational Research, Slough.
Oosterhof, A.C. & Glasnapp, D.R. (1974), Comparative reliability and difficulties of the multiple-choice and true-false formats, Jour. Exper. Educ.
42, 62-64.
Open University, CMA Instructions, Undated document.
Ormell, C.P. (1974), Bloom's taxonomy and the objectives of education, Educ.
Res. 17, 3-18.
Osburn, H.G. (1968), Item sampling for achievement testing, Educ. Psychol.
Measmt. 28, 95-104.
Palva, I.P. & Korhonen, V. (1973), Confidence testing as an improvement of
multiple-choice examinations, Brit. Jour. Med. Educ. 7, 179-181.
Pascale, P.J. (1974), Changing answers on multiple-choice achievement tests,
Meas. Eval. in Guidance. 6, 236-238.
Paton, D.M. (1971), An examination of confidence testing in multiple-choice
examinations, Brit. Jour. Med. Educ. 5, 53-55.
Fayne, R.W. & Pennycuick, D.B. (1975), Multiple Choice Questions on Advanced
Level Mathematics, Bell, London.
Pearce, J. (1974), Examinations in English Language. In Language, Classroom and
Examinations. Schools Council Programme in Linguistics and English Teaching
Papers Series II, Vol. 4, Longman.
Peterson, C.C. & Peterson J.L. (1976), Linguistic determinants of the diffiMeasmt, 36, 161-164.
culty of true-false test items, Educ. Psycho1-.Peterson, C.R. & Beach, L.R. (1967), Man as an intuitive statistician, Psychol.
Bull. 68, 29-46.
276
Evaluation in Education
Pippert, R. (1966), Final note on the changed answer myth, Clearing house, 38
165-166,
Paolet R.L. f1972), Characteristics of the Taxonomy of Educational Objectives:
Cognitive domain - a replication, Psychol. Sch. 9, 83-88,
Powell, J.C. % Isbister, A.&. <1974), A comparison behJeen right and wrong
answers on a multiple choice test, Educ. Psychol. Measmt. 34, 499-509.
Prescott, W.E. (1970), The use and influence of objective tests, In Examining
Modern Languages. Centre for Information of Language Teaching Reports and
Papers 4, London.
Preston, R.C. f1964), Ability of students to identify correct responses before
reading, Jour. Educ. Res. 58, 181-183.
Preston, R.C. (1965), Multiple-choice test as an instrument in perpetuating
false concepts, Educ. Psychol. Measmt, 25, 111-116.
Pring, R. (1971), Bloom's taxonomy: A philosophical critique (2)* Camb. Jour.
Educ, 2, 83-91.
Pugh, R.C. & Erunza, 3.J. (1975), Effects o? a confidence weighted scoring
system on measures of test reliability and validity, Educ. Psychol. Measmt.
35. 73-78.
Pyrczak, F. (1972), Objective evaluation of the quality of multiple-choice
test items designed to measure comprehension of reading passages, Read. Res.
Quart. 8, 62-71.
Pyrczak, F_ (19743, Passage-dependence of items designed to measure the
ability to identify the main ideas of paragraphs: Implications for validity,
Educ. Psychol. Measmt. 34, 343-348.
Quinn, B. (1975), A technical report on the multiple-choice tests set by the
London GCE Board 1973 and 1974, University of London School Examinations
Department, London. Unpublished manuscript.
Quinn, 8. & Wood, R. (lg74), Giving part marks for multiple-choice questions,
University of london School Examinations Department, London. Un~~bl~s~ed
manuscript.
Rabinowitz, F.M. (1970), Characteristic sequential dependencies in multiplechoice situations, Psychol. Bull. 74, 141-148.
Rakow, E.A. f1974f, Evaluation of Educational Program Differences via Achievement Test Item Difficulties; Paper presented at the Amer-ican Educattonal
Research Association, Chicago.
Ramos, R.A. & Stern, J, (1973), Item behaviour associated with changes in the
number of alternatives in multiple-choice items, Jour. Educ. Measmt. 10,
305-310.
Rasch, G. fl968), A Mathematical Theory of Objectivity and its Consequences
for Model Construction. Paper delivered at European Heeting on Statistics,
Econometrics and kianagcrrentScience, Amsterdam.
Ravetz, J.R. (1971), Scientific Knowledge and its Social Problems, Oxford
University Press.
Reiling, E. & Taylor, R. (1972), A new approach to the problem of changing
initial responses to multiple-choice questions, Jour. Educ. Measmt, 9, 67-70,
Reilly, R.R. (19751, Empirical option weighting with a correction for guessing,
E&c. Fsychol. Measmt. 35, 613-619.
Multiple Choice: A State of the Art Report
2n
Reilly, R.R. & Jackson, R. (1973), Effects of empirical options weighting on
reliability and validity of an academic aptitude test, Jour. Educ. Measmt.
10, 185-194.
Resnick, L.B., Siegel, A.W. & Kresh, E. (1971), Transfer and secjuence in
double classification skills, Jour. Exp. Child. Psychol. 11, 139-149.
Richards, J.M. (1967), Can computers write college admissions tests? Jour.
Psychol. 51, 211-215.
App.
ROSS, J. & Weitzman, R.A. (1964), The twenty-seven percent rule, Ann. Math.
Stat. 35, 214-221.
Rothman, A.I.
(1969), Confidence testing: An examination of multiple-choice
testing, Brit. Jour. Med. Educ. 3, 237-239.
Rowley, G.L. (1974), Which examinees are most favoured by the use of multiplechoice tests? Jour. Educ. Measmt. 11, 15-23.
Sabers, D.L. & White, G.W. (1969), The effect of differential weighting of
individual item responses on the predictive validity and reliability of
an aptitude test, Jour. Educ. Measmt. 6, 93-96.
Sanderson, P.H. (1973), The 'don't know' option in MCQ examinations, Brit.
Jour. Med. Educ. 7, 25-29.
Schlesinger, I.M. & Guttman, L. (1969), Smallest space analysis of intelligence
and achievement tests, Psychol. Bull. 71, 95-100.
Schnittjer, C.J. & Cartledge, C.M. (1976), Item analysis programs: A comparative analysis of performance, Educ. Psychol. Measmt. 36, 183-188.
Schofield, R. (1973), Guessing on objective type test items, Sch. Sci. Rev. 55,
170-172.
Schools Council, (1965), The Certificate of Secondary Education: Experimental
Examinations - Mathematm,
Examinations BulT&n,
7, Schools Council.
London.
Schools Council, (1973), Objective test survey, Unpublished document. Schools
Council, London.
Scott, W.A. (1972), The distribution of test scores, Educ. Psychol. Measmt. 32,
725-735.
Seddon, G.M. & Stolz, C.J.S. (1973), The Validity of Bloom's Taxonomy of Educational Objectives for the Cognitive Domain. Unpublished manuscript, Chemical
Education Sector, Universitya
Senathirajah, N. & Weiss, J. (1971), Evaluation in Geography, Ontario Institute
for Studies in Education, Toronto.
Shayer, M. (1972), Conceptual demands in the Nuffield O-level Physics course,
Sch. Sci. Rev. 54, 26-34.
Shayer, M., Kuchemann, D.E. & Wylam, H. (1975), Concepts in Secondary Mathematics and Science, S.S.R.C. Project Report, Chelsea College, London.
Shoemaker, D.M. (1970), Test statistics as a function of item arrangement,
Jour. Exper. Educ. 39, 85-88.
Shuford, E. & Brown, T.A. (1975), Elicitation of personal probabilities and
their assessment, Instructional Science. 4, 137-188.
Skinner, B.F. (1963), Teaching machines, Scientific American, 90-102.
278
Evaluation in Education
Skurnik, L.S. (1973), Examination folklore: Short answer and multiple-choice
questions, West African Jour. Educ. Voc. Measmt. 1, 6-12.
Slakter, M.J., Crehan, K.D. & Koehler, R.A. (1975), Longitudinal studies of
risk taking on objective examinations, Educ. Psychol. Measmt. 35, 97-105.
Sockett, H. (1971), Bloom's taxonomy: A philosophical critique (1), Camb. Jour.
Educ. 1, 16-35.
Stanley, J.C. & Wang, M.D. (1970), Weighting test items and test-item options:
An overview of the analytical and empirical literature, Educ. Psychol. Measmt.
30, 21-z.
Strang, H.R. & Rust, J.O. (1973), The effects of im~diate knowledge of results
and task definition on multiple-choice answering, Jour. Exper. Educ. 42, 77-
80.
Tamir, P. (1971), An alternative approach to the construction of multiplechoice test items, Jour. Biol. Educ. 5, 305-507.
Test Development and Research Unit, (1975), Multiple Choice Item Writing,
Occasional P~lication 2, Cadridge.
Test Development and Research Unit, (1976), Report for 1975, Cambridge.
Thorndike, R.L. (1971), Educational measurement for the Seventies, In
Thorndike, R.L. (Ed.), Educational Measurement, American Council on Education,
Washington.
Traub, R.E. & Carleton, R.K. (1972), The effect of scoring instructions and
degree of speededness on the validity and reliability of multiple-choice
tests, Educ. Psychof. Measmt. 32, 737-758.
Tuinman, J.J.
(1972), Inspection of passages as a function of passage dependency of the test items, Jour. Read. Behav. 5, 186-191.
Tulving, E. (19761,
In Brown, J. (Ed.), Recall and Recognition, Wiley, London.
Tversky, A. (19641, On the optimal nuder
Jour. Math. Psychol. 1, 386-391.
of alternatives at a choice point.
University of London, (1975), Multiple-choice Objective Tests: Notes for the
Guidance of Teachers, University of London University Entrance and School
rxaminatlons Counclf, London.
Vernon, P.E. (1964), The Certificate of Secondary Education: An Introduction
to Objective-type Examinations, Examinations Bulletin 4, Secondary Schools
Examlnattons Council, London.
Wason, P.C. (1961), Response to affirmative and negative binary statements,
Brit. Jour. Psychol. 52, 133-142.
Wason, P.C. (1970), On writing scientific papers, Physics Bull. 21, 407-408.
Weiss, D.J. (1976), Computerised Ability Testing 1972-1975, Psychometric
Methods Program, University of Minnesota.
Weitzman, R.A. (1970), Ideal multiple choice items, Jour. Amer. Stat. Assoc.
65, 71-89.
Wesman, A.G. (1971). Writing the test item, In Thorndike R.L. (Ed.)
Educational Measurement, American Council on Education, Washington.
I:'hitely,S.E. & Dawis, R.V. (1974), The nature of objectivity with the Rasch
model, Jour. Educ. Measmt. I?, 163-178.
Multiple Choice: A State of the Att Report
279
Whitely, S.E. & Dawis, R.Y. (7976), The influence of test context on item
diffic.ulty, Educ. Psychol. Measmt. 36, 329-338.
Williamson, M.L. & Hopkins, K.D. (1967), The use of 'none of these' versus
homogeneous alternatives on multiple-choice tests: Experimental reliability
and validity comparisons, Jour. Educ. Measmt. 4, 53-58.
Willmott, AS. & Fowles, D.E. (1974), The Objective Interpretation of Test
Performance: The Rasch Model Applied, National Foundation for Educational
Research, Slough.
Wilmut, J. (1975(a)), Objective test analysis: Some criteria for item selection, Res. in Educ. 13, 27-56.
Wilmut, J. (1975(b)), Selecting Objective Test Items, Associated Examining
Board* Aldershot.
Wilson, N. (1970), Objective Tests and Mathe~tical
Council for Educational Research, Sydney.
Learning, Australian
Wingersky, M.S. & Lord, F.M. (1973), A computer program for estimating examinee
ability and item characteristic curve parameters when there are omitted
responses, Research Memorandum 73-2. Educational Testing Service, Princeton.
Wood, R. (1968), Objectives in the teaching of mathematics. Educ. Res. 10, 8398.
Wood, R. (1969), The efficacy of tailored testing, Educ. Res. 11, 219-222.
Wood, R. (1973(a)), A technical report on the multiple choice tests set by the
London GCE Board 1971 and 1972, Unpublished document, University of Loridon
School Examinations Department, London.
Wood, R. (1973(b)), Response-contingent testing* Rev. Educ. Res. 43, 529-544.
Wood, R. (1974), Multiple-completion items: Effects of a restricted response
structure on success rates, Unpublished manuscript, University of London
School Examinations Department, London.
Wood, R. (1976(a)), Barking up the wrong tree?
they examine, Times Educ. Suppl. June 18.
What examiners say about those
Wood, R. (1976(b)), Trait measurement and item banks. In de Gruitjer, 0,N.M.
& van der Kamp, L.J. Th. (Eds.) Advances in Psychological and Educational
Measurement, John Wiley, London.
Wood, R. (1976(c)), A critical note on Harvey's 'Some thoughts on norm-referenced and criterion-referenced measures', Res. in Educ. 15, 69-72.
Wood, R. (1976(d)), Inhibiting blind guessing: The effect of instructions,
Jour. Educ. Measmt. 13, 297.
Wood, R. & Skurnik, L.S. (1969), Item Banking, National Foundation for Educ.
Research, Slough.
Wright, B.D. (1968), Sample-free test calibration and person measurement,
Proceedings of the 1967 Invitational Conference on Testing Problems,
tducatlonal Testing Service, Pnnceton.
Wright, P. (1975), Presenting people with choices: The effect of format on
the comprehension of examination rubrics, Prog. Learn. Educ. Tech. 12,
109-114.
Wyatt, H.V. (1974), Testing out tests, Times Higher Educ. Suppl. June 28.
In
280
Evaluation in Education
Zontine, P.L. Richards, H.C. & Strang, H.R. (1972), Effect of contingent
reinforcement on Peabody Picture Vocabulary test performance, Psychol.
Reports, 31, 615-622.
Download