MULTIPLE CHOICE: A STATE THE ART REPORT OF RobertWood University of London CONTENTS Page 1. 2. INTRODUCTION 193 POLEMICS x.kw;;h ! 195 fz! RECALL, RECOGNITION AND BEYOND 203 Higher Order Skills Summary 3. ITEM TYPES Item Types Other Than Simple Multiple Choice True-false Multiple true-false Multiple completion Assertion-reason Data necessity Data sufficiency Quantitative comparisons Summary 4. CONSTRUCTING ITEMS Number of Distracters The 'None of These' Option Violating Item Construction Principles Item Forms Sumnary 5. INSTRUCTIONS, SCORING FORMULAS AND RESPONSE BEHAVIOUR Changing answers Confidence weighting Ranking alternative answers Elimination scoring Weighting of item responses Summary 204 209 210 211 211 213 213 215 217 217 218 221 223 223 225 226 227 230 232 234 234 236 237 237 239 1% Evaha tion in Education 6. ITEM ANALYSIS Other Discrimination Indices Generalised Item Statistics The Item Characteristic Curve Probabilistic Models of Item Response Summary 7. ITEM SELECTION AND TEST CONSTRUCTION Constructing Group Tests Norm-referenced tests Arranaina Items in the Test Form The I&line of Difficulty Concept Individualised Testing Testing for Other Than Individual Differences Criterion-referenced tests Choosing Items to Discriminate Between Groups Computer Programs for Item Analysis Sunxnary 240 245 247 247 249 251 253 253 253 258 259 260 ;66: 264 265 265 ACKNOWLEDGEMENTS 267 REFERENCES 268 Introduction Pctential readers of this book will want to know where it stands relative to predecessors like Vernon (1964), Brown (1966) and Macintosh and Morrison (1969). The answer is that it is meant to be a successor to Vernon's monograph which, first class though it still is, was thought to be in need of updating and expanding. It is therefore not a practical handbook like the other i~o books. Although I often give my opinion on what is good practice, the book concentrates on marshalling and evaluating.the literature on multiple choice testing, the aim being to clarify what is known about this testing technique: in short, what is intended is a state of the art report. Multiple choice is but one of a number of testing techniques. Anyone who wonders why it seems to dwarf the'others in the attention it receives and the controversy it arouses might choose among the following reasons: 1. 2. 3. 4. 5. 6. The technique originated in the USA and attracts irrational hostility on that account. Choosing one out of a number of alternative answers is thought by some to be a trivial mental exercise. The answer deemed to be correct can be obtained by blind guessing. The format raises a number of methodological problems, real and imagined setting, scoring, etc. The data multiple choice tests produce lend themselves to elaborate statistical analysis and to the shaping of theories abotit response behaviour. Without multiple choice, modern test theory would not have come into existence (which would have been a blessing some might think). Because of a widespread belief that multiple choice tests are easily prepared for, there has come into being what amounts to an industry consisting of writers turning out books of multiple choice items usually directed at specific examinations. Often the content of these collections is shoddy and untested but their continual publication and reviewing keeps multiple choice before the public eye and affords hostile critics the opportunity to lambast the technique. The opportunities for research investigations offered by 3, 4 and 5 above have sustained many American academics in their careers and their offerings have filled and continue to fill the pages of several journals, notably Educational and Psychological Fleasurement and the Journal of Educational I"easurement. The absence of such specialised journals in Britain has rileantless of an outpouring, although in recent years most subject-based educational journals, particularly those connected with the sciences, have carried at least one article concerned with 't-e-inventing the wheel', in apparent ignorance, wilful or otherwise, of developments elsewhere. It is with multiple choice in the context of educational achievement testing that I am mainly concerned in this book. This concentration stems frcm my own backgrouna ant current employment although I also believe that it is with 193 194 Evaluation in Education achievement tests that multiple choice finds its most important application. I work for the University of London School Examinations Department which is one of the bodies in England charged with conducting GCE (General Certificate of Education) examinations at Ordinary and Advanced levels. Ordinary or Olevel is usually taken by students at the age of 16 and Advanced or A-level at age 18. I hope this explanation will enable readers to understand the odd references in the text to GCE or to 0- and A-level or to the London GCE board. In the one place where CSE is mentioned this refers to the Certificate of Secondary Education taken by less able students, also around the age of 16. When referencing I have not attempted to be exhaustive although neither have been too selective. I hope I have mentioned most work of the last ten years; older work can be traced through the references I have cited. Where I think there is a good bibliography on a topic I have said so. I The weighing up of research evidence is a difficult business and I do not pretend to have any easy solutions. The fact that numerous studies have used American college students, often those studying psychology, throws doubt on some of the literature but I take the view that it is possible to perceive tendencies, to see where something definitely did not work and to judge where promise lies. Often the accounts of experiments are less interesting than the speculative writing of the investigators. Sometimes there are no experiments at all but just polemical writing, nearly always hostile to multiple choice. It is with these polemics that I start. I. Polemics ‘The Orangoutang score is that score on a standardised reading test that can be obtained by a well-trained Orangoutung under these special conditions. A slightly hungry Orangoutang is placed in a small cage that has an oblong window ad four buttons. The Orangoutang has been trained that every time the reading teacher places a neatly typed multiple choice item fern a reading test in the oblong oin&w, all that he (the Orangoutungl has to do to get a bit of banax& is to press a button, any of the buttons, which, incidentally, are lubelled A, B, C and D.” CFry, 1971) Although the quotation above is acid enough, no one has savaged the multiple choice test quite like Banesh Hoffman and Jacques Barzun. Both are American acader,iics, both regard multiple choice as the enemy of intellectual standards and creative expression. In their different ways they have made out a case against multiple choice which must be taken seriously even if Hoffman's diatribes have a superior, fanatical tone which soon grates. In his various onslaughts, Hoffman (1962, 1967(a), 1967(b)) has insisted that multiple choice "favours the picker of choices rather than the doer", and that students he variously calls "gifted", "profound", "deep", "subtle" and "first rate" are liable to see more in a question than the questioner intended, a habit which, he claims does not work to their advantage. In favouring the "doer', Hoffman is expressing a preference which he is entitled to do but he produces no evidence for supposing there are distinct breeds of "pickers" and "doers", just as he is unable to demonstrate that "picking" is necessarily either a passive or a trivial activity. To choose an answer to a question is to take a decision, even if it is a small one. In any case, this is not the point; as I shall argue presently, why should not students "pick" and "do"? The fact is that much of the distaste for multiple choice expressedby American critics like Hoffman and Barzun arises from fears about the effects of using multiple choice tests exclusively in American school testing programmes and the consequent lack of any opportunity for the student to compose his own answers to questions. Recent reports from the USA (Binyon, 1976), which have linked what is seen as the growing inability of even university students to write competent English with the absence of essay tests, would seem to justify such fears although no convincing analysis substantiating the link has yet been offered and other factors in American society, such as the low value placed on writing outside school may well be implicated. As far as the British situation is concerned, examining boards are agreed that multiple choice should be only one element of an examination, and often a minor element at that; many examinations do not contain a multiple choice element at all or are ever likely to. In practice, the highest weight multiple choice will attract is 50 per cent and then only rarely; generally it is in the region of 30-40 per cent. (Although based on 1971 examinations, an unpublished Schools 195 196 Evaluation in Education Council survey (Schools Council, 1973) provides what is probably still a reasonably accurate picture of the extent of objective test usage in the United Kingdom and of the weightings given to these tests.) The case for using multiple choice rests in large part on the belief that there is room for an exercise in which candidates concentrate on giving answers to questions free of the obligation to write up - some would say dress up - their answers in extended form, a point made by Nuttall (1974, p.35) and by Pearce (1974, p-52). Instead of asking candidates to do a little reading and a lot of writing with (hopefully) some thinking interspersed - what, I suppose, Hoffman would call "doing" - they are asked to read and think or listen and think before "picking". I see nothing wrong in this. By and large there has been an over-emphasis on writing skills in our examinations - unlike the USA and the different approach to assessment represented by multiple choice serves as a corrective. I would accept that a concentration on reading and thinking is to some extent arbitrary in terms of priorities. After all, a strong case could be made out on behalf of oral skills yet how little these feature in external examinations, leaving aside language subjects. That they have been so neglected is , of course, directly indicative of the over-emphasis that has been placed on written work, which in turn can be traced to the conservatism of examiners and teachers and to the difficulties of organising oral assessments. If the quality of written work produced by the average candidate in the examination room was better than it is, one might be more impressed by the arguments of those who insist on written answers or "doing" in all circumstances. But, as anybody knows, examination writing is far from being the highest form of the art; how could it be when nervous individuals have to write against the clock without a real opportunity to draft and work over their ideas, a practice intrinsic to writing? As Wason (1970) has observed, one learns what one wants to write as one goes along. Small wonder, then, that the typical examination answer is an unsightly mess of half-baked ideas flung down on the paper in the hope that some, at least, will induce a reward from the examiner. No doubt the fault lies in the kind of examination that is set or in setting examinations at all. Certainly examinations have been roundly condemned for stultifying writing in the schools by placing too much emphasis on one kind of writing only, namely the impersonal and transactional, at the expense of the personal and expressive (Britton et al, 1975). Yet it sometimes seems that whatever attempts are made to liberalise examinations, the response in the schools is inexorably towards training pupils in the new ways so that these soon become as routinised as the bad old ways. Elsewhere (Wood, 1976(a)) I have written of the deadlock which exists between examiners, teachers and students and the solution is not at all clear. At least with multiple choice all parties understand what is required of them. That some individuals will rise above the restrictive circumstances of a written paper and demonstrate organisation, originality and so forth is not in dispute. What they must be aware of is being thought too clever or of producing answers which are regarded as interesting but irrelevant. The people who frame multiple choice questions are, by and large, the same people who frame and mark essay questions. Most will have graduated from one to the other. If their thinking is "convergent" it will show in both cases. Vernon (1964, p.7) may have had this in mind when he remarked that it is by no means certain that conventional examinations are capable of eliciting what are often called the higher qualities. He observed, quite correctly, that the typical examination Multiple Choice: A State of the Art Report 197 answer, at least at 15-16 years, tends to be marked more for accuracy and number of facts than for organisation, originality etc., not least because this ensures an acceptable level of reliability. This being so, it seems to me that what Hoffman claims is true of teachers of "gifted" students - "that such teachers often feel it necessary to warn precisely their intellectually liveliest students not to think too precisely or deeply when taking mechanised tests" (Hoffman, 1967(a), p.383) - might equally well be applied to essay tests. If this reads like a "plague on both your houses", that is not my intention. The point is that multiple choice is not alone in having deficiencies - they can be found in all the techniques used in public examinations. As long as it serves a useful assessment function - and I have tried to establish that it does - the weaknesses, which are technical or procedural, can be attended to. In this respect I wish that the essay test had received even half the attention bestowed on multiple choice. If Hoffman's first charge is seen to be shallow, what of his second - that the gifted, creative, non-conformist mind is apt to see more in multiple choice questions than was intended, the consequence of which is to induce uncertainty, perplexity and ultimately incorrect answers? All the examples Hoffman produces are designed to show that to the "gifted" the question is not meaningful, or contains more than one answer or else does not contain the answer at all. "Only exceptional students are apt to see the deeper defects of test items" (Hoffman, 1967(a), p. 383) he remarks at one point, but is not a student exceptional precisely because he can see the deeper effects? What Hoffman, who naturally includes himself among the "gifted", and other critics forget, is that the items they take apart are meant for average 16 or 18 year olds who do not possess their superior intellects. In these circumstances it is hardly surprising that hard scrutiny of questions will reveal "ambiguities", unseen by the average eye. Whether or not "gifted" persons find these an&iguities in the examination room and how they react to them, having found them, is very much open to question. Too little work has been done on this subject but the best study to date (Alker, Carlson and Hermann, 1969) concluded that "first-rate" students were not, in general , upset and penalised by multiple choice questions. They found that characteristics of both superficial and deep thinking were associated with doing well on multiple choice questions There the matter rests until more evidence comes along. A further study along the lines of the one just discussed would be worth doing. Whenever we refer to "atiiguities" we must bear in mind that knowledge is always provisional and relative in character; most of us are given, and then settle for, convenient approximations to "true" knowledge. Writing about science, Ravetz (1971, Chapter 6) has remarked on the tendency of teachers, aided and abetted by textbook writers, to rely on standardisation of scientific material and for his purpose to introduce successive degrees of banality as the teaching becomes further displaced from contact with research and high class scientific debate. The end product is what he calls vulgarised knowledge or what Kuhn (1962), more kindly, calls normal science. Ravetz's analysis is hard to fault but the solution is not easy to see. Driver (1975) is surely right that "there are no 'right answers' in technology", yet when she writes "instead of accepting the teacher's authority as the ultimate judge, the pupils can be encouraged to develop their own criteria for success; to consider their own value systems and to make judgements on them" one recoils, not because of the radical nature of the sentiment but because the teaching programme implied seems too alrbitious for the majority of 15 and 16 year olds. at any rate. Can you teach children to be sceptical about ideas 198 Evaluation in Education before they know anything to be sceptical about? Perhaps it can be done but it needs a particular teaching flair to be able to present knowledge with the right degree of uncertainty. In general it seems inevitable that most children will be acquainted only with received knowledge and ideas which contribute to an outdated view of the physical world. Whether they are willing or able to update this view later will depend on training, temperament and opportunity. The relevance of all this for multiple choice is obvious. For Hoffman and other critics, multiple choice embodies vulgarised knowledge in its most blatant form; it deals, in Barzun's (1959, p.139) term, with the "thoughtcliche". Through the items they set, examiners make public their versions of knowledge and through the medium they foster the impression that every problem has a right answer. The sceptical mind is given no opportunity to function a choice must be made. Worse, the format may reinforce misconceptions. Finally, not only does multiple choice reflect the transmission of standardised knowledge, through so-called "backwash" effects, it encourages it. That, at any rate, is what the critics say. Myself, I see no point in denying that multiple choice embodies standardised knowledge. If that is what is taught, then all examining techniques will reflect the fact. More than anything else, examinations serve to codify what at any time"passes as "approved" knowledge, ideas and constructions. Those who find the spectacle distasteful attack multiple choice because it is such a convenient target but what could be more standardised than the official answer to a question like "It is often said that Britain has an unwritten constitution. Discuss." One place where the arguments about multiple choice as a representation of objective knowledge have come to a head is over the English Language comprehension exercise set at Ordinary level by the London GCE board and similar tests set by other examining boards. Basically there is a conflict between those who insist that judgement about meaning is always subjective and who deny that there are any "correct" interpretations (e.g. Honeyford, 1973) and those (e.g. Davidson, 1974) who see multiple choice as a formalisation of a public discussion about meaning. It seems to me that the resolution of this conflict depends, once again, on who and what is being tested. Were the subject at issue Advanced level English Literature, where one must expect candidates to come forward with different interpretations of texts, I would want to give Honeyford's argument a lot of weight, even if marking raises severe problems. For if, as he maintains, comprehension is an essentially private experience it is logical nonsense to attempt to standardise examiners' opinions. One of the severest critics in print on multiple choice found himself caught in just this dilemma when championing essay tests, "If you standardise essay tests, they become as superficial as multiple choice; if you do not standardise them, they measure not the abilities of the examinee but function rather as projective tests ofTe graders' personalities" (La Fave, 1966). Presumably, the proper approach to marking in these circumstances is to allow different interpretations, providing they can be supported convincingly. This presupposes a broadmindedness on the part of examiners which may not exist, but I see no other solution. With Ordinary level English Language comprehension, on the other hand, the latitude for varying interpretations of meaning iS not so The candidates are younger and the material is simpler. Sometimes it great. appears at first glance debatable whether a word or phrase conveys a meaning best but on close analysis it usually turns out that the examiners have gone for finer distinctions in order to test the understanding of the more competent candidates. In doing so, however, they run the risk of provoking controversy; it is no accident that most of the complaints about multiple choice concern items where the discrimination called for is allegedly too fine or is Multiple Choice: A State of the Art Repoti 1% reckoned to be non-existent. Consider, for instance, the following O-level English Language comprehension item set by the London board in June 1975 which came in for some criticism in the correspondence columns of the Guardian and The Times Educational Supplement. The item refers to the following sentence which was part of a longer passage: "The distinction of London Bridge station on the Chatham side is that it is not a terminus but a junction where lives begin to fade and blossom again as they swap trains in the rush hour and make for all regions of South London and the towns of Kent." The item was as follows: The statement that London Bridge is a place "where Zives beg-into fade and blossom again" is best explained by say'ingthat it is a place where people: Grow tired of waiting for their trains and feel better when they have caught them. Flag at the end of their day and revive as they travel homeward. Leave behind the Loneliness of the city and enjoy the co~any in a crowded carriage. Escape from the unhealthy atmosphere of Lundon and fZouri.shin the country. Forget about their daily work and Zook foruard to enjoying their leisure. According to one critic (Guardian, 17.6.75), there are "rules that pertain to this type of question. One answer must clearly be perceived to be correct and evidence must be forthcoming why this is so". In his view, the London board broke that "rule" with this item, and indeed others in the same paper. The crux of the matter is obviously the word "clearly" and here is where I part company with the critic. The examiners have set candidates an item which calls for rather closer attention to the text than might generally be the case. But is this so wrong? A test where the right answer jumped out every time would be a very dull test. As it happened, the reason why statement B was considered to be the best answer was explained very nicely by another correspondent to the Guardian. This is what she said: "'The candidate does not need to read the examiner's mind, if he reads the question. In the sentence you are not told: A C D E Whether the people grow tired of waiting, or whether The city is lonely, or whether London is unhealthy, or whether They will forget their work. You are told that "lives begin to fade and blossom again" and statement B best explains this by saying London Bridge is a place where people flas at the end of their day and revive as they travel homeward." (Guardian, 24.6.75) 200 Evaluation in Education BACKWASH I would like to distinguish two kinds of backwash. The first concerns the effect of an examining technique on the way subject matter is structured, taught and learnt, the second concerns the way candidates prepare and are prepared for the technique in question. In the case of multiple choice this involves developing what the Americans call "test-wiseness" - the capacity to get the most marks from a test by responding to cues, knowing how to pace oneself and so forth. In the case of essay tests, the comparable behaviour would be knowing how to "spot" questions, how long to spend on questions and generally knowing how to maximise one's chances. Providing it is not overdone, the second kind of backwash need not be taken as seriously as the first. After reviewing 80 or so studies, Mellenberg (1972) observed that "there seems to be no evidence that study habits are strongly affected by the type of test that students are expecting in examinations". Vernon (1964) offers the view that "so long as the objective questions are reasonably straightforward and brief, we know that the amount of improvement brought about by coaching and practice is limited . .. However, it is possible (though there is little direct evidence) that facility in coping with more complex items is more highly coachable and that pupils who receive practice at these may gain an undue advantage". When we come to look at the more complex item types in Chapter 3, readers may feel that Vernon has a point. The sort of coaching which is most likely to go on involves the collections of items I referred to somewhat disparagingly in the introduction. Teachers may believe that having their students work through these productions is the best way of preparing for the examination, but they may be deluding themselves. Covering the subject matter is one thing, mastering the technique another. These collections of items may leave the candidate short of both objectives. Reviewing the impact of multiple choice on English Language testing in three African nations, Ghana, Nigeria and Ethiopia, Forrest (1975) maintained that the most regrettable effect everywhere is the amount of time teachers give to working objective questions in class, but added that better trained teachers find that multiple choice gives them scope for better teaching - it is the weaker ones who resort to undesirable methods. Whether the net result of multiple choice coaching activity is any more serious in scale or effect than the preparations which are made for essay and other kinds of tests one simply does not know. There is a greater need for coaching in writing skills if the comments of examiners are anything to go by. Coming now to the backwash which affects learning, it is sometimes claimed that multiple choice perpetuates false concepts or, what amounts to the same thing, over-simplifies events and relationships through the limitations of the format. “If you teach history with a view to circulating some idea of the toleration of the other person's point of view, not only does multiple choice not test this but it tends to have the opposite effect, with harmful effects on the proper study of history" was the comment of one teacher in the discussion following Nuttall's (1974) paper. This comment, of course, harks back to the earlier discussion of the relativity of knowledge, and the varying degrees of sophistication with which it can be handled. I can understand this particular teacher feeling sore at having to suffer what he would regard as a regression to "black and white" judgements, but I wonder if he was triygered off by one or two clumsily phrased items which I am afraid are often the ones Multiple Choice: A State of the Art Report 201 the public sees. Whether or not multiple choice actually reinforces wrong answers is a moot question. Taking as his point of departure Skinner's (1961) dictum that "every wrong answer on a multiple choice test increases the probability that a student will someday dredge out of his memory the wrong answer instead of the right one", Preston (1965) attempted to test the influence of wrong answers (and right answers) upon students' grasp of vocabulary within the same hour. The conditioning effect of wrong selections of items was demonstrated for some words but not for others. Karraker (1967) obtained a more positive result when he found that a group exposed to plausible wrong responses without being told the correct answers made more errors on a later test than another group who were told the correct answers. Eklund (1968), having carried out a thorough experimental study of the question, maintained that the use of multiple choice in the earlier stages of the learning process may involve considerable risks of negative effects but that later on these risks seem to become much less marked. This is interesting when we consider the terminal nature of examinations and the fact that they often signal discontinuities in learning. How much candidates remember after examinations is in any case debatable. Miller and Parlett (1974, p.107) put forward the idea that examinations actually serve to clear the memory rather than reinforce existing knowledge, correct or incorrect. This may sound an odd function of an examination but Miller and Parlett claim that, unless "rehearsed" or used, detailed recall of factual information drops rapidly after an examination, a claim we might all echo from our experience. The best way of mitigating forgetting is to give immediate feedback of results. In what is rare unanimity, the sundry studies in this area (Berglund, 1969; Zontine, Richards and Strang, 1972; Beeson, 1973; and references there are in Strang and Rust, 1973; Betz and Weiss, 1976(a), 1,976(b)) all claim that irmnediate knowledge of results, item by item or at the end of the test, enhances learning. Even if it is not feasible in connection with public examinations, in the classroom, where diagnosis and repair are the critical activities, immediate feedback is certainly possible and should always be given. I have not mentioned what for many people is the real objection to multiple choice - the opportunity it offers for blind guessing. That a candidate can deceive the examiner by obtaining the correct answer when in a state of ignorance cannot be denied - there is no way of stopping it - but as I shall make clear in Chapter 5, I do not see this as a grave impediment. Besides, the opportunity for guessing exists with the traditional type of questions, although this is seldom remarked upon. In particular, traditional essays invariably require the candidate to guess which parts of his knowledge are going to appeal to the examiner (Cross, 1972). As in other instances, multiple choice tests are vulnerable to the guessing charge because statistical evidence can be adduced, whereas for essay papers it is so much harder to come by. Critics of multiple choice testing are inclined to apply double standards. Not only do they expect multiple choice to be something it is not, but they subject it to tougher criteria than they apply to other techniques. Dudley (1973), for instance, in the medical context, criticises multiple choice on the grounds that it fails to test adequately all aspects of an individual's knowledge. This is about as fair as complaining about a stethoscope because it cannot be used to examine eyes and ears. I do not say that multiple choice is above 202 Evaluation in Education reproach; what I do say is that it must be viewed in context, and fairly. American critics are entitled to be worried about what they see as the adverse effects of multiple choice in the USA, but when criticism turns into crude, caricature and obsessive vilification we should know when to part company. Besides, much work has gone into stretching the basic multiple choice form in an effort to test what are sometimes called "higher order" skills. lnlhat these skills might be is the subject of the next chapter. SUMMARY 1. Just as the multiple choice test originated in the USA, so most of the strongest criticism has come from there, particularly from Banesh Hoffman and Jacques Barzun. One reason for this is the exclusive use of multiple choice in school and college testing programnes which deprives students of the opportunity to express themselves in writing. 2. The British situation is quite different. If anything, too much emphasis has been given to writing. Multiple choice seldom attracts as much as 50 per cent weighting in external school examinations; generally the figure is in the region of 30 to 40 per cent. 3. Multiple choice serves a distinct assessment function. It makes the candidate concentrate on thinking about problems without requiring the extended writing which can often be irrelevant and worthless, given the time-trial conditions of examinations. Yet critics want it to be something it is not, complaining that it cannot measure things like "toleration of the other man's point of view", when no one ever claimed that it could. Multiple choice has faults but so do other techniques. One point in its favour is that it leaves the candidate in no doubt about what he has to do, unlike the essay test where he has to guess what the examiner expects from him. 4. Multiple choice is criticised for encouraging students to think of knowledge as cut and dried and for penalising clever students who see ambiguities their duller colleagues do not. It should be remembered, however, that knowledge is always provisional and that what is a sophisticated viewpoint to one group is simple-minded to a more mature group. Examinations codify what is accepted as 'approvedn knowledge at any given time. Because examiners reveal themselves more openly through multiple choice, it provides a convenient target for critics who resist the idea that knowledge is packaged in standardised form. 5. Concerning the "backwash" effects of multiple choice, few "hard" data are avaitable. We simply do not know if multiple choice helps to perpetuate false concepts and misinformation or leads to more superficial learning than would have occurred otherwise. Nor do we know how much coaching of multiple choice answering techniques goes on nor what payoff accrues. Information on these matters is not necessarily required but those who pontificate on the baleful effects of multiple choice ought to realise how little is known. 2. Recall, Recognition and Beyond "Taking an objective test is simply pointing. It culls for the least effort of mind above that of keeping awake - recognition." (Barzm, 1959) Hoffman and Barzun scorn multiple choice because in their minds it calls for lowly recognition and nothing else. This analysis simply will not do. Apart from playing down the psychological complexity of what recognition may entail, it fails to account for what happens when candidates are obliged to work out an answer and then search among the alternatives for it. Most mathematical calculations and data interpretations come into this category. Even a solution achieved by exhaustive substitution of the alternatives into an expression cannot be said to be a case of recognition. As far as I can see, the Hoffman/ Barzun critique is based on one kind of item only, the simplest form of multiple choice involving memory for facts, formulas, words etc. Here is an example of the sort of item Barzun and Hoffman regard as the enemy (Barzun, 1959, p.139): "'Emperor' is the name of (a) a string quartet (c) a violin sonata." (bJ a piano concerto Readers who have not immediately spotted the flaw in this item should know that while (b) is the official answer, one of Haydn's quartets is also called the 'Emperor'. Consider now an altogether different type of item, reproduced below. Only by grotesque twisting of the meaning of the word could anyone seriously claim that the answer can be reached by recognition. O&put per worker per annum Steel (tons) &eat (tons) Urbaxia Ruralia 10 2 40 30 The table shows the cost of wheat in terms of steel, or vice versa, before the opening of trade. Asswne that these costs remain constant for all levels of output, and that there are no transport costs. Which of the following will take place? A B C D E Urbania wilZ export steel Ruralia will export steel Urbania wilZ export both steel and wheat Urbania will export steel and Rumlia will export wheat It is impossible to predict until the terms of trade are knom (University of London, A-level Economics Paper 2, Summer 1972) "Recognition is easier because, under comparable conditions, the presence of a target word facilitates access to stored information", remarks the editor of a 203 204 Evaluation in Education recent collection of papers by experimental psychologists (Brown, 1976) and this view is supported by the results of psychometric studies (Heim and Watts, 1967) in which the same questions have been asked in open-ended and multiple choice form. It might therefore be thought that if something can be recalled, it can be recognised as a matter of course. Not so, according to Tulving (in Brown, 1976) who reports that recognition failure of recallable words can occur more often than might be supposed. Nor need this surprise us for if, as modern experimental psychologists maintain, recognition and recall are distinct ~;o~;%;s, the easier activity is not necessarily contained in the more diffi- . It would be as well, then, not to write off recognition as trivial or rudimentary. Where the object is to test factual knowledge it has its place, as does recall where questions are open-ended. But, of course, both recognition and recall are memory functions. Where candidates are obliged to engage in mental operations, as in the question above , recall on its own will not be enough. To be sure, successful recall may supply elements which facilitate problem solving but the realisation of a solution will depend on the activation of other psychological processes. What these processes might be is anyone's guess - it is customary to give them names like "quantitative reasoning" or "concept formation" or "ability to interpret data" or, more generally, "higher order" skills. It is to these that I want to turn my attention. HIGHER ORDER SKILLS "!%e long experience with objective tests has demonstrated that there me hard& any of the so-called 'higher'mental processes that cannot be tested with objective tests." (Ashford, 1972, p. 4211. Although seldom expressed so blithely, the claim that multiple choice items can test "higher order" skills is often encountered. In the sense that multiple choice can test more than memory, the claim is correct, as we have seen, but one sometimes gets the impression that those who make the claim want to believe it is true but underneath are uneasy about the supportive evidence. The trouble is that while we all think we know what we mean by "higher order" skills, terms like "originality" or "abstract reasoning" themselves beg questions which lead all the way back to satisfactory definitions of what constitutes skill X or Y. It is fair to say that the drive to test higher order skills via multiple choice dates from the publication, twenty Years ago, of Volume 1 of the Taxonomy relating to the cognitive domain (Bloom et al, 1956). Certainly no other work has been so influential in shaping our thoughts about cognitive ski IIs. The trouble is that too many people have accepted the Taxonomy uncritically. Knowledge, Comprehension, Application, Evaluation, and Synthesis are still bandied about as if they were eternal verities instead of being hypothetical constructs constantly in need of verification. Wilson (1970, p.23) has expressed nicely the value and limitations of the Taxonomy, or versions of it. Referring to skills, he writes, "They are extruded in a valient attempt to create some order out of the complexities of the situation. As tentative crutches to test writing and curriculum development they are useful, but we must beware of ascribing to them more permanence and reality than they deserve". Prefacing examination syllabuses with some preamble like "Knowledge will attract 15 per cent of the marks, Comprehension 2C per cent" and so forth, even Multiple Choice: A State of the Art Report 205 with a qualification that these figures are only approximate, nevertheless conveys a precision which is simply not justified in the present state of our knowledge. This is not the place to appraise the Taxonomy in depth. Another work in this series (de Landsheere, 1977) does just that. What I cannot avoid though is to examine the psychological status of the Taxonomy to see how far it constitutes a plausible model of cognitive processes, and therefore of higher order skills. At the present time, attitudes to the Taxonomy range from more or less uncritical acceptance e.g. "Where Bloom's discipline-free Taxonomy of Educational Objectives is used the cognitive skills are unatiiguously defined in respect of the thought processes which go on in an individual student's mind" (Cat&ridge Test Development and Research Unit, 1975, p.6) through wary endorsement along the lines "It may leave a lot to be desired but it is the best Taxonomy we've got", to downright hostility (Sockett, 1971; Pring, 1971; Ormell, 1974). By and large I would say that the Taxonomy's influence is now as much on the decline as it was in the ascendant ten years ago, when indeed I was promoting it myself (Wood, 1968) although not entirely without reservation. The overriding criticism, as Sockett (1971, p.17) sees it, is that "the Taxonomy operates with a nsive theory of knowledge which cannot be ignored however classificatory and neutral its intentions". In particular, he rejects the division into Knowledge and Intellectual Skills and Abilities, claiming that in the things we are said toknow" there are necessarily embedded all manner of "intellectual skills and abilities" e.g. to know a concept is to understand it. One is bound to say that the organisation of the Taxononly is remarkably ad hoc, not grounded in any psychological principles other than that knowledge is straightforward and anything involving mental operations is more difficult. As Sockett puts it (p.23-24), "to rank them (cognitive processes) in a simplecomplex hierarchy either means that as a matter of fact people find reasoning more difficult than remembering - which may or may not be true - or that there are logical complexities in reasoning not present in remembering, which has not been shown". If we ask whether the proof is in the pudding, the evidence from the various attempts at empirical validation is not impressive. The standard device of asking judges to say what they think an item is measuring has revealed, as far as the higher Bloom Taxonomy categories are concerned, that agreement is the exception rather than the rule (Poole, 1972; Fairbrother, 1975). Since the exercise is like asking people to sort fruit into apples, oranges, bananas etc. without giving them more than the vaguest idea of what an apple or a banana looks like, this is hardly surprising. It follows that attempts to verify or even re-constitute the hierarchical structure of the Taxonomy (Kropp et al, 1966; Seddon and Stolz, 1973) which have by and large failed to verify the hypothesised structure, are doomed in any case for if the measures are "dirty" to start with nothing "clean" is going to come out. Of course, there is always the correlational approach to validating and classifying items, that is finding ite:?s\/nich seem to cluster together and/or relate to some external criterion. The weakness of this approach, as Levy (1973) has observed, is that "we might know less about the tests we drag in to help us understand the test of interest than we already know about the test". The failure of these fishing expeditions to verify hypothesised models of psychological processes indicates to me that we have been in too much of a hurry to build systems. We have skipped a few stages in the development. 206 Evaluation in Education What we should have been doing was to fasten on to modest competencies like "knowledge of physics concept X" and make sure we could measure them. Although a systems-builder himself, Gagne (1970(a)) recognised the necessity for doing this as the first order of business. For measurement to be authentic, he says, it must be both distinctive and distortion-free. The problem of distinctive measurement is that of identifying precisely what is being measured. Keeping measurement distortion-free means reducing , as far as possible, the "noise" which factors such as marker error, quotidian variability among candidates and blind guessing introduce into measurement operations. Put like this, distinctiveness and freedom from distortion would appear to be "validity" and "reliability" thinly disguised, and there is some truth in this, particularly 3here freedom from distortion is concerned. With distinctiveness, however, Gagne wishes to take a tougher attitude towards validity. Working within a learning theory framework, he believes that only when suitable controls are employed can dependable conclusions be drawn about what is being measured. "Distinctiveness in measurement has the aim of ruling out the observation of one category of capability as opposed to some other capability" (Gag&, 1970(a), p.111). Gag& imagines a two-stage measurement in which the first stage acts as a control to ascertain whether prior information needed to answer the second stage is present. Levy (1973, p.32) suggests a similar procedure for investigating discrepancies in behaviour. If it is "knowledge of principles" we are after we must make sure we are testing this and not "knowledge of concepts". To use one of Gagng's examples, the principle that "a parallelepiped is a prism whose bases are parallelograms" may not have been learned because the learner has not acquired one or more of its component concepts, whether "prism", "base" or "parallelogram". Thus the first stage of measurement would be to determine whether these concepts have been acquired. The idea that items should measure one concept or principle or fact at a time, and not a mixture of unknown proportions, is, of course, hardly new. Back in 1929, in a marvellous book which would still be instructive reading for examiners, Hamilton was hammering home the message: “In deciding how much information he shall give the candidate, or how much guidance by controlling clauses, the examiner will, of course, be guided principally by the indication he wants the candidate's answers to have. His chief aim in setting the question is to test the candidate's power of dealing with the volumes of such compound solids as the sausage-shaped gas-bag, but if he omits to provide the formula, he will clearly fail to test that power in a candidate who does not happen to remember the formula." (Hamilton, 1929, Chapter 6) Items which cannot be answered because too little information is given, laboratory-style items which can be answered without doing the experiments, English Language comprehension items which can be answered without reading the passage (Preston, 1964; Tuinman, 1972) and modern language comprehension items which can be answered with little or no knowledge of the language - all these are instances of lack of distinctiveness attributable to failure to assess learning which is supposed to have occurred. Where comprehension items are concerned, it has been suggested that a useful check on distinctiveness is to administer the items without their associated passages in order to determine their passagedependence (Pyrczak, 1972, 1974). Prescott (1970) makes a similar suggestion in connection with modern language comprehension items, except that he wants the items to be tried out on people who have not been taught the language so that he can find out exactly how much the items depend on acquisition of the Multiple Choice: A State of the Art Report language. subjects. 207 The idea is a good one but difficult to put over to experimental That the need for distinctiveness of measurement can be overlooked is illustrated by the following example from an item-writing text book (Brown, 1966, p.27). The item and the accompanying commentary are reproduced below. In cases of myopia, spectacles are needed which contain: a. b. c. d. convergent lenses coloured lenses to reduce glare lenses to block ultra-violet light divergent lenses "A useful type, first since 'myopia' had to be correctly identified with 'short sight', from which the principle of divergence to 'lengthen' the sight had to be deduced. Option a was clearly a direct opposite but b and c also distracted weaker pupils who had to guess the meaning of 'myopia'." To be fair, I ought to add that elsewhere in the book Brown shows a lively awareness of the need for distinctiveness as when, writing about practical science, he argues for items which can only be answered successfully if the candidate has undergone a practical course in the laboratory. It seems to me that the often heard criticism of the Taxonomy - that what is "Comprehension" to one person is "Knowledge" to another or what is genuine "Application" to one is routine to another - is attributable to a failure to pay enough attention to distinctiveness. Granted we know precious little about what happens when a person encounters an item (Fiske, 1968). Introspection studies (e.g. Handy and Johnstone, 1973) have been notably uninformative, mainly because candidates have difficulty describing what they did some time after the event. Yet too often items have been so loosely worded as to permit a variety of problem-solving strategies and therefore a variety of opinions as to what the item might have been measuring. Gag& (1970(a)) takes the Bloom Taxonomy to task for perpetuating this kind of confusion, citing one of the illustrative items which is meant to be measuring knowledge of a principle but which might well be measuring no more than a verbal association. But does it matter how a candidate arrives at an answer?, Surely it is likely to make little difference whether a distinction can or need be drawn between the learning of a concept, say, and the learning of a principle since those who "know more" are going to learn faster and achieve better ultimate performancess regardless of what the particular components of their capabilities are. I think it does matter. In the first place achievement may not be as ordered and sequential as this proposition implies. We do not know enough about the growth of skills to be sure that knowledge of X implies knowledge of Y. In the second place, the sanguine be1ie.f that ability will show through regardless can have a powerful effect on teaching; in particular, it may encourage teachers to present material prematurely or pitch it at too high a conceptual level, omitting the intermediate steps. Shayer (1972) has remarked on these tendencies in connection with the Nuffield O-level Physics syllabus. The kind of examination question where examiners have not bothered to ascertain whether the basics have been assimilated but have moved immediately to test higher levels of understanding only blurs the measurement. Unfortunately the practice of lumping together performance on all items into a test score does little to encourage belief in distinctiveness as something worth having. Were we to move towards two-stage or multi-stage measurement rather than deperrd on the single item (alone or in collections) then, as Gag6 points out (1970(a), p.124), we would have to devise new scoring procedures and testing would become quite different from what we are used to. Valuable though the notions of distinctiveness and freedom from distortion are as measurement requirements, they are of limited use when it come to relating higher order skills and systematising. The way forward depends Ii believe, on developing a keener understanding of how learning cumulates in individuals. As I have already indicated, I do not think the Taxonomy provides an adequate description of psychological processes , much less promotes unde~tand~ng of how behaviour is organised into abilities. Nor am I convinced that Gagne"'s own theory of learning (Gagne", 19~~(b~~ is the answer. X would not want to dismiss any enterprise which attempts to understand basic processes of learning but where complex constellations of skills are concerned I have doubts about the utility of an atomistic model of learning. Anyone who has studied reports of attempts to validate hypothesised hierarchical sequences of learning will know how complicated and elaborate an analysis of the acquisition of even a simple skill can be (see, for instance, Resnick, Siegel and Kresh, 1971). Where does this leave us? If in GagnB's scheme the learning networks are so intricate that one is in danger of not seeing the wood for the trees, other models of intellectual growth seem all too loose and vague. Levy (1973) maintains that the simplex - which means a cumulative hierarchy like the Taxonomy _ should be regarded as the model of growth but gives little indication as to how it might work out in practice. Anastasi (1970) makes some persuasive speculations a&out how traits or abilities develop and become differentiated, which help to clarify at a macro-level how learning may occur,but leaves us little wiser concerning the nature of abilities. There is Piaget's theory and its derivatives, of course, and in this connection there have been some interesting attempts to elucidate how scientific concepts develop in adolescents (e.g. Shayer. K~&he~nn and Wylam, 1975). The object of this work* which is to determine what to teach (test) when, seems to me absolutely right and offers, I am sure, the best chance of arriving at a coherent view of how abilities develop and articulate. For the time being, though, I imagine we shall continue to proceed pragmatically, attempting to measure this ability or that skill - "ability to see historical connections", "ability to read graphs" etc - whenever they seem appropriate in the context of a particular subject area without necessarily worrying how they relate, if at all. Actually this may be no bad thing providing the analysis of what skills are important is penetrating, In this connection, Wyatt's (1974) article makes suggestive reading. Writing of university science examinations, he argues that for each student we might wish to know: "Wow much subject matter he knows; how well he communicates both orally and in writing; how well he reasons from and about the data and ideas of the subject; how well he makes relevant observations of material; how far he is familiar with and uses the literature and books; how well he can design experiments; how well he can handle apparatus and improvise or make his own; how far he can be trusted with materials; how skilled he is at exhibiting his results; how skilled he is with mathematical, statistical and graphic manipulation of data". Obviously multiple choice cannot be used to measure each of these skills or even most of them but enumeration of fists like these, at least makes it easier to oecide which testing technique is likely to work best for each skill. Multiple Choice: A Stare of the Art Report 209 In the next chapter I will discuss, with illustrations, how the simple- multiple choice form has been extended into different item types in a bid to measure abilities other than factual recognition and recall. It will become evident how the best of these item types succeed in controlling the candidate's problem-solving behaviour but also what a ragbag of almost arbitrarily chosen skills they appear to elicit, a state of affairs which only underlines that we test what is easiest to test, knowing all the time that it is not enough. SUMMARY 1. Multiple choice items can demand more than recognition, despite what the more hostile critics say. Whenever candidates are obliged to work out an answer and then search among the alternatives for it, processes other than recognition, which we generally call higher order skills, are activated. 2. Attempts to describe and classify these higher order skills have amounted to very little. Bloom's Taxonomy has promised more than it has delivered. Generally speaking, denotation and measurement of higher order skills has proceeded in an ad hoc fashion according to the subject matter. However the failure to substantiate taxonomies of skills may not matter providing a penetrating analysis of what students ought to be able to do is carried out. It is suggested that more attention should be given to measuring what we say we are measuring and in this connection Gag&'s notions of distinctiveness and freedomfrom distortion are discussed. 3. Item Types "Choice-typeiternscan be constructed to assess complex achievement in any area of study." (Senathirajahand Weiss, 1971) Two approaches to measuring skills other than factual recall, classification or computation can be distinguished. One tries to make the most of the basic multiple choice form by loading it with more data and making the candidate reason, interpret and evaluate, while the other throws problems into different forms or item types which oblige the candidate to engage in certain kinds of thinking before choosing an answer in the usual way. The development of these item types can be seen as an attempt to control and localise the deployment of higher order skills. The danger with increasing the information load is, of course, that items can become turgid and even obscure. This item taken from the Cambridge Test Development and Research Unit (TDRU) handbook for item writers (TDRU, 1975) illustrates how the difficulty of giving candidates enough information to make the problem believable, without swamping them, can be overcome. If thecounty ~'ounci2 responsible for the north west corner of Scotland had to choose between the constructionof a furniture polish factory which wouZd employ 50 people and a hydro-electricpower station it should choose: The factory, because the salty and humid temperaturecases a rapid decay of exposed wood. b. The factory, because the long-term gain in employment would be greater than that which the power station could provide. c. The factory, because it would make use of the natural resources of the region to a greater extent that the power station could. d. The power station, because it would result in a large number of highly paid constructionworkers being attracted into the region. e. The power station, because the power production in the Highlands is insufficient to meet the needs of this part of ScotZand. a. The item seems to be measuring appreciation of the relative importance of economic and social factors. If it seems too wordy the reader might look at the illustrative items for the higher Bloom Taxonomy categories and consider whether Vernon (1964.,p.11) was being too kind when he suggested that many readers will find them "excessively verbose, or even perversely complicated". I happen to think that the item is not too wordy but there is no getting away from the fact that items like this make considerable demands on candidates' reading comprehension. This in turn can threaten the distinctiveness of the measurement; if candidates have difficulty understanding the question or the instructions - a point I will discuss when I come to other item types - there must be doubt as to what their responses mean. 210 Multiple Choice: A State of the Art Report 211 Now, of course, it is perfectly true, as Vernon observes, that all examinations involve a common element of reading comprehension, of understanding the questions and coping with the medium. At the same time it is desirable that the candidate should be handicapped as little as possible by having to learn the medium as well as the subject, the principle being that examinations should take as natural a form as possible. This places multiple choice item writers in something of a dilemma. On the one hand the need for distinctive measurement obliges them to exercise what Hamilton (1929) called "guidance by controlling clauses", yet the provision of this guidance inevitably demands more reading from candidates. Exactly the same dilemma faces the compiler of essay questions. Nor is there any instant remedy. The hope must be that when formulating questions examiners will use language in a straightforward, cogent, and effective manner, remembering that the cause of candidates is not advanced by reducing the wording of questions to a minimum. ITEM TYPES OTHER THAN SIMPLE MULTIPLE CHOICE When considering an item type the first thing to ask is whether it performs some special measurement function or whether, to put it bluntly, it has any functional basis at all. fin item type may be invented more from a desire for diversity and novelty than from a concern to satisfy a measurement need. Gag& (1970(a)), for one, has argued, rightly in my view, that we have a set of testing techniques and some measurement problems but that the two do not necessarily correspond. When evaluating an item type, we should ask ourselves "Goes it do something different?" "Does it test something worth testing?" "Is the format comprehensible to the average candidate?" and, above all, "Could the problem be handled just as well within an existing item type, especially simple multiple choice?" The first item type to be discussed - the true/false type - is different from the rest in that it is a primitive form of multiple choice rather than an embellishment. True-false Of all the alternatives to simple multiple choice the ordinary true-false (TF) item has been subjected to most criticism. The reasons are obvious; the possibility of distorting measurements through guessing is great, or so it appears, and there would seem to be limited opportunity to ask probing questions. For some time now, Ebel (1970, 1971) has been promoting the TF item but his seems to be a lone voice. As regards guessing, Ebel discounts it as a serious factor, believing that when there is enough time and the questions are serious, people will rarely guess. He also believes that TF items can measure profound thought, his grounds being that the essence of acquiring knowledge is the successive judgement of true-false propositions. This is a claim which readers will have to evaluate for themselves. Personally I am sceptical. Where the acquistion of knowledge or skills can be programned in the form of an algorithm e.g. the assembly of apparatus, Ebel's claim has some validity but where knowledge comes about through complex association and synthesis, as it often does, then a more sophisticated explanation is required. It is significant that Ebel believes TF items to be most effective in teaching/ 212 Evaluation in Education learning situations. Inasmuch as the teacher may expect to get more honest responses and to cover ground quickly, one can see what he means. Actually, the whole multiple choice genre has to be viewed differently in the context of a teaching situation compared to that of a public examination. In particular, restrictions about wording can be relaxed because the teacher is presu~bly at hand and willing to clarify items if necessary. Moreover, since the teacher and not a machine will be doing the marking, the form can be extended to allow candidates to volunteer responses either in defence of an answer and/or in criticism of an option or options. This is multiple choice at a very informal and informative level, and there is no reason why teachers should not use true/false items as long as they know what they are doing. (by permission of United Features Syndicate Inc.) One recent investigation into the setting of true-false items is perhaps worth mentioning. Peterson and Peterson (1976) asked some students to read a prose passage and then respond to items based on it which were either true or false and were phrased either affirmatively or negatively. Thus, for example, the facts which read: "The mud mounds so typical of flamingo nests elsewhere did not appear in this colony; there was no mud with which to build them. Instead the birds laid their eggs on the bare lava rock" yielded these four true/false items: 1. The flamingoes in the colony (true affirmative) 2. The flamingoes in the colony affirmative) 3. The ftamingoes in the colony {true negative) 4. The flamingoes in the colony rock. (false negative) laid their eggs on bare rock. built nests of mud. (false did not built nests of mud. di'dnot lay their eggs on bare It was found that true negatives yielded most errors, followed by false negatives, true affirmatives and false affirmmes. Peterson and Peterson concludedthat if test constructors wish to make true-false items more difficult the correct policy is not to include more false than true statements in the test, as Ebel (1971) suggested, but rather to include more statements phrased negatively. It should be mentioned that the results of this study differed somewhat from those of an earlier study by Wason (1961) who found true affirmatives to be no easier than false affirmatives although on the finding that true negatives are harder to verify than false negatives, the two studies are in agreement. Unfortunately neither study can be regarded as authoritative; Peterson and Peterson's, in particular, is almost a caricature of the typical psychology experiment. You will see what I mean from their Multiple Choice: A State of the Art Report 213 description of the subjects: "Forty-four students (ten males and thirty-four females) from the introductory psychology course at Northern Illinois University volunteered for the experiment and thereby added bonus points to their course grade". Nor did these small nutiers inhibit the investigators from carrying out significance tests although mercifully they refrained from testing for sex differences in response to the items. Multiple true-false The process of answering a multiple choice item can be thought of as comprising a nunber of true-false decisions rolled into one, the understanding being that one answer is true and the rest are false. By contrast, there is another type of item - called by some "multiple true-false" (Hill and Woods, 1974) - where each of the statements relating to a situation can be true or false. This 5 is widely used in medical examining where it is sometimes known as the "indeterminate" type e.g. in the University of London. Ten years or so ago it enjoyed a vogue in connection with CSE experimental mathematics papers under the name "multi-facet" (Schools Council, 1965). Here is an example (T, F and D/K indicate True, False and Don't Know respectively): A measurement, after a process of calculati.on, appears as '2.6038metres'. T F D/K (a) The measurement is 2.60cm. to the necxst (b) g, measurement is 2.04mm. to the xeare?t (c) g; measurement is 2.6Om. to two sipificunt figures. (d) The measurement is 260.3&m. to two significant figures. (e) The measurement is 264~~~ to three significant figures. The attraction of exploiting a situation from a number of points of view is obvious. One objection which has been raised against this item type is that getting one facet right could well depend on getting another right. The orthodox view, and this applies to items in general, is that efforts should be made to keep items independent of each other in the sense that the probability of a person getting an item right is not affected by his responses to other items. How realistic this requirement is is anyone's guess; my feeling is that items do and perhaps should inter-relate. To put it another way, are items on the same subject matter ever truly discrete? Certainly, if Gag&Is two-stage measurement procedure or something like it were to be realised, the items would be intimately related and new scoring formulas and much else would be required (Gag&, 1970(a), p.124). Multiple completion The multiple completion or selection type of item requires the candidate to choose a pre-coded combination of statements, as in the following example from the London GCE board (University of London, 1975, p.24). 214 Evaluation in Education In the question below, ONE or MORE of the responses given are correct. Decide which of the responses is (are) correct. Then choose: A. B. C. D. E. If If 1, 2 and 3 are all correct 1 and 2 only are correct If 2 and 3 onZy are correct If 1 only is correct If 3 only is correct lfhich of the following was too high? 1. 2. 3. would be desirable An increase in saving A rise in exports A rise in the school leaving ii the level of unemployment age The difference between this item type and the multiple true-false type is one of function. The multiple true-false question is usually just a set of questions or propositions which, although relating to the same theme or situation, are not necessarily joined structurally or organically whereas the multiple completion item can be used to probe understanding of multiple causes and consequences of events or complex relationships and therefore has much more range. For this reason, I would rate the multiple completion type as more rewarding in principle. However, much depends on the item writer's ability to make the most of the combinations so that, if you like, the whole is more than the sum of the parts. In this connection, the Cambridge handbook (TDRU, 1975, p.20-21) has something apposite to say, "Many item writers find the Multiple Selection type of item the easiest kind to write, which is not surprising if one looks upon this type as being little more than three true/false questions, linked, of course, by a common theme. In writing Multiple Selection items, every effort should be made to make them imaginative and to consider carefully how the candidates will look upon, not only the statements, but also the possible combination of statements, in order to aim for the highest possible discrimination power." The snag with multiple completion items is that they require the use of a response code. If it is reckoned that in addition to coding his answer, the candidate has to transfer it to an answer sheet, a task which has been shown by Muller, Calhoun and Orling (1972) and others referred to therein to produce more errors than occur when answers are marked directly in the test booklet, the possibilities for error will be apparent. The likelihood of distortion is increased by the fact that the coding structure contains information which candidates can use to their advantage. If a candidate can definitely rule out one statement, he can narrow down the choice between alternative answers and the more statements there are the more clues are given away. To prevent this sort of thing happening it has been suggested that "any other statement or combination of statements" might be used as an option but the TDRU handbook claims that it is difficult to obtain statistically sound items with this option as a key. I am not opposed to the use of this particular option but I recognise that it may introduce an imbalance into the item which is liable to threaten the coherence of the problem (see the discussion of the "none of these" options in Chapter 4). Evidence that the coding procedure does introduce error, at least among the Multiple Choice: A State of the Art Repoti 215 less able, has been presented by Wright (1975). An unpublished study of my own (Wood, 1974), which used GCE O-level items rather than the very easy items used by Wright, revealed that the coding structure did work in favour of the more able, as expected. The obvious way to dispense with coding structures would be to ask candidates to make multiple responses directly on the answer sheet, and to program the mark-sensor and score accumulator accordingly. This is common practice in medical examining (see, for instance, Lever, Harden, Wilson and Jolley, 1970). The result is a harder, and a fairer, item but would candidates be confused if a multiple completion section requiring multiple marking were to be placed in a test which otherwise required single marks in the conventional manner? Vy investigation, although not conclusive, suggests that mixing the mode of response is unlikely to worry candidates any more than the other switching they have to do in the course of a typical GCE 0- or A-level test. Besides the multiple completion section could always be placed at the end of the test. I am drawn to the view that without a system of multiple responding the multiple completion item type is too prone to give distorted results. The directions used by the London GCE board (see the example) are capable of improvement but even at its most lucid and compact the rubric would still worry some candidates. Some would say that these candidates would make a hash of the items, anyway, but even if this were true I see no reason to compound the superiority of the cleverer candidates. Agreed we give scope to intelligence in many different ways, often without realising it, but where the opportunity exists to stop blatant advantage it should be taken. Assertion-reason The assertion-reason item was devised for the purpose of ascertaining candidates' grasp of causality. In terms of the strong feelings it arouses, this item type is not far behind the true-false item, of which it is a variant. Once again the candidate has to cope with involved directions but the more The serious objections concern the logical status of the task itself. directions used by the London GCE board are reproduced below together with a sample item (University of London, 1975). Each of the foZZowing questions consists of a statement in the lefthand column followed by a statement in the right-hand co2um-n. Decide whether the firs statement is true or false. Decide whether the second statement is true or faZse. Then on the ansuer sheet mark: A. If both statements are true and the second statement is a correct expzanation of the first statement. B. If both statements are true and the second statement is NOT a correct explanation of the first statement. C. If the first statement is true but the second statement is false. c. If the first statement is false but the second statement is true. E. If both statements are false. 236 Evaluation in Edwath Directions Summarised A. TrL4e TPZie 292dstatement Zs a correct expcplanation of the 1st B. TPUE TY%? 2nd stutemenz:t'sNOT a corrcc-lexplanation of the 1st C. True False D. Fake Trtie E. Faise FaL?e FIRST STATEmNT The growing season iursouth-west England i.slonger than in the south-east of EngLand SECOND STATE~NT Summers are uamner in the southwest than in the south-east of EZzg&a?%d fhe birectl'onsused by the Cambridge TDRU are much the same except that "Correct" is replaced by "adequate", in my view an improvement since it makes the exercise less naive and dogmatic. On the other hand London has dropped the "BECAUSE" linking the "assertion" and the "reason", a necessary move otherwise all the options save A have no meaning. Statements such as: %e #or&Z? is fZaf_MCAfJSL?nature c&hors a vacziwn OF Japan irzvadedPoEa& SECAIJSE HitZer bonibea’ Pearl Rwbour which would correspond to options D and E respectively, are, of course, nonsense. The weakness of this type of item is, in fact, that statements have to be considered not as an integrated entity but in two parts. This means that, as in my absurd examples, the two parts need not necessarily bear any relationship to each other, although the need for credibility usua?ly ensures that they do. I am afraid that Banesh Hoffman would make mincemeat of some of the assertion-reason items I have seen and I would find it hard to quarrel with him. On the statistical side, analysis of the multiple choice tests set by the London board from 1971 to 1974 (Wood, 1973(a); Quinn, 1975) shows assertionreason items coming out consistently with lower average discrimination values than other item types. The TDRU reports the same outcome (TDRU, 1976) and suggests that the basic reason for this may lie in a failure to utilise all the options in the response code to the same extent, In particular, the option A (true-true, reason correct explanation of assertion) was keyed less frequently than might be expected but was popular with candidates, often proving to be the most popular of the incorrect options. Thus it would appear that candidates are inclined to believe in the propositions put before them yet item writers, ~art~c~lar~~ in Chemistry ahd Biology says the TDRU, seem to find genuinely correct propositions hard to contrive. My own hunch about assertion-reason items is that they are less related to school achievement and more related to intelligence (competence vs. ability, see Chapter 7) than any other type of item. If this is true, their aspirations to distinctiveness must be seriously questioned. It miqht be advisable in any case to consider whether this kind of item can be rewritten in simple multiple choice format. This can often be done to good effect as in the illustrative item from the TDRU handbook used at the beginning of the chapter. Multiple Choice: A State of the Art Report 217 The remaining item types to be disctissed are all used in mathematics testing, although it is conceivable that they might be applied in other disciplines. They are meant to test different aspects of mathematical work and the names of the item types - data necessity, data sufficiency, quantitative comparisons give a good idea of what is demanded. In all cases, the burden of the question is to be found in the directions so that the usual objections about undue reliance on reading comprehension apply. Data necessity In this kind of problem the candidate is asked to find redundant information. The directions used by the London board and an example are given below (University of London, 1975, p.21). Directions: Each of the following questions consists of a problem followed by four pieces of information. Do not actuaZly solve the problem, but decide whether the problem cou2d be solved if any of the pieces of information were omitted, and choose: A B C D E if if if if if 1 could 2 could 3 could 4 couZd none of be omitted be omitted be omitted be omitted the pieces of information could be omitted What fraction of a population of adult males has a height greater than 18Ocm? 1 The distribution of heights is norma 2 The size of the population is 12,000 3 The mean height of the popuZation is 175cm 0 The standard deviation of heights is 7cm Payne and Pennycuick (1975,p.16), whose collection of items is exempt from the criticisms of this genre I made earlier, point out, and I agree with them, that this item type lacks some of the variety of others for there are essentially only two sorts of problems - those requiring all the information for their solution and those from which one piece can be omitted. Often an idea earmarked for data necessity treatment can be better exploited in the multiple completion format, emphasising again that the same function can often be performed just as well by another item type. In general, I would suggest avoiding the data necessity format. Data sufficiency As the example below shows (University of London, 1975, p.21) the directions are formidable although Payne and Pennycuick (1975) show how they can be simplified. Directions: Each of the following questions consists of a problem and -two statements 1 and 2, in which certain data are given. You are not asked to solve the probZem; you have to decide whether the data given in the stateme)Ltsare sufficient for solving the probZem. Using the data given in the statements, choose: A itiT ZACH statement (i.e. statement 1 ALONE and statement 2 ALONE) is suSficient by itself to solve the problem 218 Evahfation in Educatjon B if statement 1 ALONE 1:ssufficient but statement 2 alone is ~101; sufficient to SOLVZ the problem. C if stadament 2 ALONE is suf_FhIent but statement 2 alone is not sufficient to sotZve the problem. D if BOTH statements I and 2 TUGE~~~~ are suffhierat ta soLve the problem, but NEITHER statement AXME is suff&?:ent. E if statements 1 and 2 TOGET,WR are NOT sufpkiant to solve the? problem, and additional data specific to the problem are need& Initial Vertical f / - Fig. 3.1 tiai5 is the initiaZ veZocity, V, of the proje&Ze? I I?_@ = 54m z r = 42..% The concept of sufficiency is important in mathematics and this item type may be the only way of testing it. The London GCE board intends to use it in its Advanced level mathematics multiple choice tests starting in 1977 but I would not have thought it would be suitable for lower age groups studying mathematics. Quantitative comparisons As far as I know, this item type it 3 not used in any British examinations or tests. It was introduced into the Scholastic Aptitude Test (SAT) by the College Board in the USA, partly as a replacement for the data necessity and sufficiency types, the instructions for which, interestingly enough, were considered too complicated for the average candidate to follow. After seeing an example of the item type with instructions (reproduced below), readers can judge for themselves whether the substitution was justified. The task presented to candidates is an easier one and even if the instructions might still confuse some candidates, it should be possible to-assimilate them more quickly than those associated with other item types. I would think this item type could be used profitably in examinations for 16-year olds. Directians: &xh quest*ion 6-z this seotion consisz% of tic qua&&&, one in cok@ml k mad one in Culurnz 5. You are to compare f,ha ixJ0 CCpi.CMtities and on -theunsuer skset bBacken space: Multiple Choice: A B C D if if if if the the the the A State of the Art Report 219 quantity in Column A is the greater; quantity in Column B is the greater; two quantities are equal; relationship cannot be determined from the information given. Notes: (1) In certain questions, information concerning one or both of the quantities to be compared is centred above the two columns. 1.2) A symbol that appears in both co~wms represents the same thing in Colwm A as it does in Coikm B. (31 AZZ nwnbers in this test are reaZ nwnbers. Letters such as I, n, and k stand for real nwnbers. CoZwnn A Colwnn B 5Lc= 0 Question 15 1 Question 16 3x352~8 Question 17 2 SJ Y2X X 4x352~6 zz YX2 (CoZlege Board, 1976) That the item types just discussed call for higher order skills, however that term is defined, is incontrovertible. Being more or less memory-proof (by which I mean answers cannot be recognised or recalled intact) they impel the examinee to engage in distinctive reasoning processes. Although opportunities for elimination still exist, particularly in the case of data necessity items, candidates need to get a firm purchase on the problems in order to tackle them successfully. I have not described all the item types that exist. The matching type of item has its place and I have not really much to say about it except that it can be a lot of work for little result. There is an item type called relationship analysis which the London board is experimenting with in its A-level mathematics tests I choose not to describe it here because it is too specialised to mathematics (a description can be found in University of London, 1975 and Payne and Pennycuick, 1975). Relationship analysis is one of the newer item types studied, and in some cases devised, by the group responsible for constructing the experimental British Test of Academic Aptitude (TAA). One or two of these item types have found their way into attainment tests but to date the others have not caught on. Details can be found in Appendix E of Choppin and Orr (1976). If it is asked how the mathematical item types compare in terms of statistical indices an analysis of the London board's 1976 A-level mathematics pretests carried out by my colleague Carolyn Ferguson shows that simple multiple choice were usually the easiest and also the most discriminating items while the relationship analysis item type proved most difficult and also least discriminating. Of the other item types, data necessity generally showed up as the next easiest type after multiple choice and the next poorest discriminator after relationship analysis. Data sufficiency items showed up reasonably well in terms of discrimination but tended to be on the hard side. The finding that simple multiple choice provides the highest average discrimination agrees with the outcome of our analyses of O-level tests (Wood, 1973(a); Quinn 1975) and also with the TDRU analysis (TDRU, 1976). To some extent this is due to multiple choice enjoying a slightly greater representation in the tests as a 220 Evaluation in Education whole so that it is-tending to determine what is being measured and also to the fact that the correlations between scores on the subtests formed by the different item types are lowish (0.30 - 0.50). Whether the item types are measuring different skills reliably is another matter. All we have to go on at the moment are internal consistency estimates for small nutiers of items and I would not want to place too much weight on them. I might add that the analysis just discussed was provoked by complaints fror;; schools and colleges, particularly the latter, that students, especially those of foreign origin, were experiencing difficulties with the more complicated item types. We therefore have to keep a close eye on how these item types go and whether they should all be included in the operational examinations. The problem of deciding whether item types are contributing enouyh to justify their inclusion in a test is indeed a difficult one. "Does the format of a question make any difference to a candidate's performance in terms of the final outcome?" is a question one is asked periodically. What people usually mean is "Do the item types measI.re the same thing?" The stock method of investigating this question is to correlate the scores on the different kinds of tests or item types. If high ccrrelations, say O.GO or more, result, then it is customary to conclude that the tests are "measuring the same things" (Choppin and Purves, 1969; Bracht and Hopkins, 197C and references therein; Skurnik, 1973). This being so, one or more of the tests or item types must be redundant, in which case one or more of the6!, preferably the more troublesome ones, can be discarded. Or so the argument goes. On the other hand, if low correlations, of say 0.50 or less, result, the tests are said to be "measuring different things", and test constructors pat themselves on the back for having brought this off. As I have hinted, both interpretations are shaky. Low correlations ma) cor,:e about because the measures are unreliable. For instance, Advanced level Chemistry practical examinations show low correlations (around 0.30) with theory papers but no one can be sure whether the low correlation is genuine or whether it is due to the unpliability of the practical examination - candidates are assessed on two experiments only. It is true that correlations can be corrected for unreliability, assuming good measures of reliability are available, which they rarely are, but this correction is itself the subject of controversy, the problem being that it tends to be an "over-correction of unknown extent" (Lord and Novick, i968, p.138). Thus corrected correlations can look higher than they really are, which is why &?lenbergh (197i), who scrutinised 8C or so studies, concluded cautiously that multiple choice and open-ended questions are sometimes operationalisations of the same variables, sometimes not. With high correlations, the case would seem to be open and shut; one measure must be as good as the other. But this conclusion does not follow at all, as Choppin (1974(a)) has shown. Suppose, he says, that two measures X ard Y correlate 0.98 and that X is found to correiate C,5O with some otf,er *,iariable Z. Examination of the variance shared between the variables shows that the correlation between Y and Z may lie anywhere ben,een O.23 and C.67. Thus the two measures are not necessarily interchangeable. That is a statistical argument but there are others. If the high correlation cores abo!"t because the open-ended questions are doing the same job as the multiple choice iter.a - eliciting factual content etc. - then the essay paper is obviously rot beirg used to advantage. It is not perforrling its special function. In science subjects there may be soi?e truth in this. i;ht if there are 1;'"*fiiSfor sun/Jr;sing that the processes called for by thi-twc tests are different iflkind, a~:i Multiple Choice: A State of the Art Report 221 separate functions are being satisfied, all that high correlation means is that relative to each ctner, persons produce the same kind of performance on both tests. That various tests administered to the same children should produce high correlations need come as no surprise; as Levy (1973, p.6-7) remarks, children developing in a particular culture are likely to accrue knowledge, processes or whatever at oifferent rates but in a similar order, a view also put by Anastasi (1970). ilhat no one should do is tc conclude from this that it is a waste of time to teach some aspect of a subject just because tests based on the subject matter correlate highly with tests based on other aspects of the subject. As Cronbsch (157ci,p.4b-49) has pointed out, a subject in which one coclpetencewas developed at the expense of the other could go unnoticed since one or more schools could on average be high on one conipetence and low on the other kqithout this showin up in the correlation between scores. It is the across-schools correlation, which is formed by correlating average scores for schools, that will expose uneven development of the two competences. Arguments based on correlations are of strictly limited utility. Evaluation of the validity of item types must proceed alon other lines. The first test must be one of acceptability - can the average candidate grasp what is required reasonably qtiickly? It may be that types like relationship analysis fail on tnat account. One can also ask if the task set makes sense. Perhaps assertion-reason fails on that score. Then, of course, one must ask whether the skili supposedly being measured is worth measuring and, if it is, whether the Ir: this connection, multiple compleiterr. type is being used to best effect. As a general comment, I would say that the tier, sometimes causes concern. simple r;:ultiFle cnoice form has a lot of elasticity left in it and that item writers snotild think nard ana long before abanconing it for another item type. i:iti: r.!ac_j tests cow stratified into different item types as a matter of course the danger is tnat these divisions will become permanently fixed when what is reqlrired is ilLid the acceptability allocation based on a mGre or less constant monitoring of and measurement efficacy of the item types. The last thing I would want to ao is to discourage experimentation with the multiple choice form but I am bound to say that the experience so far seems to indicate that the price for producincj to the point, \;nere some something candidates different is a complicating are definitely handicapped. of instructions SUMMARY 1. Tc set more out of the simple multiple choice form usually means increasing the infGrmation loac. Lare should be taken not to overdo the reading comprehensiGn element. Various item types other than simple multipie choice are available. The 2. iteol hriter should always check that he or she has chosen the appropriate item liften ideas can be handled quite well within type and is using it properly. tt-:esimple r.ibltiple cnoice form withtirt resortins to fancy constructions. Lxce~t fr;r triie-false ant n.bItiple trtie-false, all the other item types have :r? <cr:r;or trat the i nst rl;ctior:s ?r-e lensthy and apparently complicated. This 1eac.s tc tiic critiiisr. t::et a:ilitl tc. urcerstand instructions is beins tested bc; src ar,ythini, else-. 'orie ~r+rover.ient is GGssible in the wording and presentdtic: ci tilt instr;ctiGns c,er,era17J,hsed but it will never be possible to :isptiltne criticism, entirely. 222 Evaluation in Education 3. The claims made for true-false items by Robert Ebel do not convince and this item type is not recommended for formal achievement testing. In the classroom it is different and there is no reason why these items should not be used there. 4. The multiple completion or selection item type suffers from the drawback that candidates have to code their answers using a table before making a mark on the answer sheet. Another shortcoming is that information is usually qiven away by the coding table and candidates may use it to their advantage, either consciously or unconsciously. As might be expected, the cleverer candidates appear to derive most advantage from it. 5. At their worst, assertion-reason items can be very silly and good ones are hard to write. Making use of all the response positions is a problem; in particular propositions which are correct for the right reasons are apparently hard to come by. This item type is generally not recommended. Notions of causality can be tested using the simple multiple choice form. 6. The series of item types which have been introduced into mathematics achievement tests - data necessity, data sufficiency, relationship analysis, quantitative comparisons - must be regarded as still being on trial. It is fairly certain that the first three can only be used with older, sophisticated candidates and even then the criticism that the instructions are too complicated could hold. Data sufficiency and quantitative comparisons look to be the most promising item types although this view is based on little more than a hunch. It looks doubtful whether these item types will find an application in other subject areas. 7. Trying to validate item types by correlational studies is a waste of time. The interpretation of both high and low correlations is fraught with problems. Validation is best carried out by asking common sense questions "Is the item type performing a useful measurement function no other one can?" "Is the item type acceptable to candidates?" etc. 4. Constructing Items "Item writing continues to be an art to which some scientific procedures and eqerimentally derived judgements make 0n2y modest contributions. ” (Wesnm, 3971) The critical activity in item writing is, of course, the birth of the idea followed by the framing of the item. The idea may come in the form of a particular item type or it may have to be shaped to fit an item type, depending on the commission. Not everyone is happy relying on the item writer for ideas. Later in the chapter I shall discuss the work of those who believe it is possible to generate items in such a way that what Bormuth (1970) calls the "arbitrary" practices of i ternwriters are eliminated. On the understanding that the way ideas come into being is not susceptible to enquiry, most of the research on problems involved in item writing has been about issues like the optimum number of distracters, the use of the 'none of these' option, the production of distracters, the advisability of using negatively framed stems9 what Andrew harrison has called the "small change" of item writing. tiaving studied the work which has been done on these topics one is obliged to agree with Wesman (1971) that "relatively little significant research has been published on problems involved in item writing". To be fair there are reasons for this; more than in most other areas of research, the investigator is faced with the difficulty of building in sufficient controls to permit the degree of generalisability which would make the findings dependable and useful. "Most research", writes Wesman, "reports what has been done by a single writer with a single test; it does not present recipes that will enable all who follow to obtain similar results. A study may show that one threechoice vocabulary test is just as good without two additional options per item; it will not show that another three-choice vocabulary test with different words and different distracters would not be improved substantially by the addition of well-selected options." Llespite these strong reservations, Wesman does discuss the item writing research which was done between roughly 1945 and 1970, although his treatment is not exhaustive. As he warned, nothing definitive emerged and much the same has to be said for later research. Let us consider the question of the number of distracters first. NUMBER OF OISTRACTORS Conventional wisdom has it that to provide fewer than three distracters offers candidates too much scope for elimination tactics. Yet, three-choice and twochoice items have their champions, as do true-false items. Three-choice items, in particular, are thought by some to have interesting possibilities. After 223 224 Evaluation in Education randomly eliminating the fourth alternative from a sample of psychology achievement items, Costin (1970) administered tests constructed of both three- and four-choice items to a sample of students. He found that his "artificial" three-choice items were more discriminating, more difficult and more reliable than the four-choice items from the same item pool. The outcome of a later study (Costin, 1972) was much the same. As to why three-choice items did as good a job as four-choice items Costin was inclined to believe that the explanation was more psychological than statistical. In the 1970 paper he offered his results as empirical support for Tversky's (1964) mathematical proof that three-choice items are optimal as far as discrimination is concerned, a result also proved by Grier (1975), although it should be noted that both proofs depend on the somewhat shaky assumption that total testing time for a set of items is proportional to the number of choices per item (see Lord, 1976(a)). However, in the 1972 paper Costin was inclined to believe that the more choices that are provided, the more cues candidates have available for answering items they "don't know". This effect he saw as a greater threat to reliability and perhaps also validity than reducing the number of alternatives. I am not sure I accept this argument. If the extra alternatives are poor they may help to give the answer away but generally I would have thought that encouraging candidates to utilise partial information would result in greater validity, nor would it necessarily jeopardise reliability. As always, there are so many factors involved in an issue like this. I have mentioned the nature of the cues but there is the matter of how specific or general the item is, what it is aimed at and also how it can be solved. Concerning this last factor, Choppin (1974(a)) maintains that when the number of alternatives is reduced, items that can be solved by substitution or elimination - what he calls "backwards" items - are less valid than comparable "forward" items, that is, items that must be solved first before referring to the alternatives. On the other hand, he finds random guessing patterns more prevalent for items of the "forwards" type, which is reasonable given the lack of opportunity for eliminating alternatives. In general, Choppin finds that reducing the number of alternatives does lower reliability (estimated by internal consistency) and validity and recommends that, whenever possible, items with at least five alternative responses should be used. To see what would happen when "natural" four-choice items were compared with four-choice items formed by removing the least popular distractor from fivechoice items, Ramos and Stern (1973) set up an experiment involving tests in French and Spanish. Their results suggest that the two kinds of items are not clearly distinguishable and that the availability of the fifth choice to the candidate is not of major consequence. However, the usual qualifications concerning replicability apply. One thing Ramos and Stern did notice was a small decrease in reliability when going from five to four choices and wondered whether it might not have been better to eliminate the least discriminating distractor. Since reliability and discrimination are directly related, they probably had a point. I feel myself that one can probably get away with reducing from five to four alternatives. It is when the number is reduced to three or even two that the soundness of the measurement is threatened. As far as the true-false type is concerned, Ebel (1969) stated quite categorically that if a teacher can write, and a student can answer, two true-false items in-less time than is required to write or to answer one four-choice item, preference should be given to true-false. However, empirical studies suggest that things do not work out this way. Oosterhof and Glasnapp (1974) reported that 21 to 43 as many true- Multiple Choice: A State of the Art Report 2% false as multiple choice items were needed in order to produce equivalent reliabilties, a ratio which was greater than the rate at which true-false items were answered relative to multiple choice. Frisbie (1973), while warning that no hard and fast rules can be formulated regarding the amount of time required to respond to different types of items without considering item content as well, nevertheless found that the ratio of true-false to multiple choice attempts was in the region of 14 rather than 2. It would seem that Ebel's proposition does not hold up in practice, but again one can never be sure. Many people would argue, in any case, that the number of alternatives for any particular item should be decided not by statistical considerations but rather by the nature of the problem and its capacity for producing different mistakes. Granted there are administrative grounds for keeping the number of alternatives constant throughout a test but these can always be over-ridden if necessary. The best way of finding out what the major errors are likely to be is to try out the problem first in an open-ended form. Although it hardly warrants a paper in a journal some workers have thought the idea needed floating in their own subject area, e.g. Tamir (1971) in a biology context. If practicable, I think it is a useful thing to do but one should not feel bound by the results of the exercise. Candidates can make such crazy errors that I am sceptical about the wisdom of using whatever they turn up. A candidate might produce something which he would immediately recognise as wrong if it was presented in an item. It might also be argued that item writers ought to be aware of the more common errors anyway but this is not necessarily the case. Evidence that distracters often fail to match alternatives generated by free responses comes from studies by Nixon (1973) and by Bishop, Knapp and MacIntyre (1969). THE 'NONE OF THESE' OPTION The 'none of these' or 'none of the above' option arouses strong feelings among item writers. Some refuse to use it under any circumstances, believing that it weakens their items; others use it too much, as an easy way to make up the number when short of distracters. The Cambridge item-writer's manual (TDRU, 1975) maintains that 'none of these' is best avoided altogether and if it must be used it should only be in cases where the other options are unequivocally right or wrong. In this connection, the study by Bishop, Knapp and MacIntyre (1969) reported that the biggest difference between the distributions of responses to questions framed in multiple choice and open-ended form was between the 'none of these' category in the multiple choice form and the 'minor errors' in the open-ended form, which means that when placed alongside definite alternatives the 'none of these' option was not sufficiently attractive to the effective. Williamson and Hopkins (1967) reported that the use of the 'none of these' option tended to make no difference one way or the other to the reliab.ility or validity of the tests concerned. After arriving at much the same results, Choppin (1974(a)) concluded that "these findings offer little reason to employ the null-option item type. They undoubtedly set a more complicated task to the testee without increasing reliability or validity". My own feeling is that 'none of these' is defensible when candidates are more likely to solve a problem first without reference to the options, what Choppin called the "forwards" type of item. Thus I permit its use in multiple completion items, especially as it denies candidates the opportunity to glean information from 226 Evaluation in Education the coding structure. Note, however, that as was pointed out in Chapter 3, items of this kind often fail to satisfy statistical criteria. This is probably due to the imbalance created in an item when 'none of these' is the correct answer. There may be many wrong answers but only four (if the item is five-choice) are presented with the right one being found among the rest. Whenever the 'none of these' option is used, the notion that all distracters should be equally attractive ceases to apply, not that such items would necessarily be ideal, except in a statistical sense (see Weitzman, 1970). VIOLATING ITEM CONSTRUCTION PRINCIPLES There appear to be three studies of what happens when certain rules of what might be called item-writing etiquette are violated. Following Dunn and Goldstein (1959), and in very similar fashion, McMorris et al. (1972) and Board and Whitney (1972) looked at the effects of providing candidates with cues by (a) putting extraneous material in the item stem, (b) making the keyed response stand out from the distracters through making it over-long or short and (c) producing grammatical inconsistencies between the stem and the distractors. McMorris et al. obtained the same results as Dunn and Goldstein, namely that the violations made items easier but did not affect the reliability or validity of the instruments; Board and Whitney, however, reported quite contrasting results. According to them, poor item writing practices serve to obscure or reduce differences between good and poor students. It seems that extraneous material in the stem makes the items easier for poor students but more difficult for good students, the first group deriving clues from the 'window-dressing' and the second looking for more than is in the item (shades of Hoffman!). Although grammatical inconsistencies between stem and keyed response did not have a noticeable effect on item difficulties they did reduce the validity of the test. Finally, making the keyed response different in length from the distracters helped the poor students more than their abler colleagues. It would seem that poor tests favour poor candidates! t$yown feeling about these findings is that no self-respecting item writer or editor should allow inconsistencies of the kind mentioned to creep into a test. In this respect, these findings are of no particular concern. After all, no one would conclude, on the basis of these studies, that it was now acceptable to perpetrate grammatical inconsistencies or to make keys longer that distractors. Nor need one necessarily believe, on the basis of one study, that it is harmful to do so. It is just that it is not advisable to give the 'testwise' the opportunity to learn too much from the layout of the test. One item-writing rule which needs discussing is the one which warns against The reason for having it is, of course, that the negative phrasing of items. negative element, the 'NOT', can be overlooked and also that it can lead to awkward double n'egatives. What is not so often realised is that a negative stem implies that all the distracters will be correct rather than incorrect, as is usually the case. Farrington (1975), writing in the context of modern language testing, has argued that this feature is wholly desirable, his rationale being that it avoids presenting the candidate with many incorrect pieces of language, a practice which leads to a mistake-obsessed view of language learning. The same argument applies to items which use the phrase 'One of the following EXCEPT . ..I. indeed EXCEPT, being less negative, may be preferred to NOT, where feasible. Multiple Choice: A State of the Art Repot? 227 Whatever the effects on reliability and validity, it is pretty obvious that the difficulty of items can be manipulated by varying their construction and format. Dudycha and Carpenter (1973) who, incidentally, deplore the lack of research on item writing strategies, found that items with incomplete stems were more difficult than those with closed stems, that negatively phrased items were more difficult than those phrased in a positive way and that items with inclusive options such as 'none of these' were more difficult than those with all-specific options, a result also reported by Choppin (1974(a)). Dudycha and Carpenter did not study what effect placement of the keyed response (A, B, C, D or E) might have on difficulty but Ace and Dawis (1973), who have a good bibliography on the subject, provide evidence that this factor can result in significant changes in difficulty level, at least for verbal analogy items. On the other hand, an earlier study by Marcus (1963) revealed no tendency to favour one response location rather than another. This is a good example of an unresolved issue in item writing. Provided pretesting is carried out and estimates of difficulty are available to the test constructor, the fact that varying the format may make one item more difficult than another does not seem terribly important. It all depends on what kind of test, educationally or statistically, is wanted. As for the location of keyed responses, much can be done to eradicate the effect of undesirable tendencies by randomising the keys so that each position is represented approximately by the same number of items. Note, however, that randomisation may not be feasible when the options are quantities (usually presented in order of magnitude) or patterns of language. In these cases some juggling of the content may be necessary. Suppose that one wanted to alter items which turn out to be too hard. What effect might this have on the discrimination of the items? If Dudycha and Carpenter (1973) are to be believed then discrimination is less susceptible to format changes than is difficulty. Accordingly, they maintain that an item stem or orientation may be altered without lowering discrimination, the exception being items which include 'none of these' as an option. The inconclusiveness of so much of the research into item writing should not, in my view, be regarded as an invitation to engage in a new research blitz on the various issues discussed. If when framing items the writer sticks to the rules of etiquette laid down in the various manuals and succeeds in contriving four or five options, based on an appreciation of likely misapprehensions among examinees, this is as much as can reasonably be hoped for. What should not be encouraged is what might be called the 'armchair' approach to item writing which can best be summed up as an over-reliance on mechanical formulas for generating distracters, such as powers of ten or arithinetical sequences or similar words, combined with a penchant for trick or nonsense distracters. That is not to say that only items composed of empirically derived distracters are admissible. Corbluth's (1975) thoughtful analysis and categorisation of the kinds of distracters which might be appropriate for reading comprehension exercises has persuaded me that an enlightened 'armchair' approach could work quite well. ITEM FORMS It was because they believed that items are not sufficiently related to the previous instruction, and also because they distrusted the free rein given to item writers, that Bormuth (1970) and others set out to develop the radically different methods ot writing or generating items which I mentioned at the beginning. These turn on the notion of an item -shell or form (Osburn, 1968). 228 Evaluation in Education An item form is a principle or procedure for generating a subclass of items Item forms are composed of constant having a definite syntactical structure. and variable elements, and as such, define classes of item sentences by specifying the replacement sets for the variable elements. Exhaustive generation of items is not necessary although in principle it can be done. For instance, an item form rule might be 'What is the result of multiplying numbers by zero?' in which case the items 'What is 0 x O?', 'What is 1 x O?', 'What is 2 x O?', 'What is 0.1 x O?' etc. would be generated. The sheer size of the 'domain' of items so created will be appreciated. It is not supposed, of course, that a candidate should attempt every item in order to show that he has 'mastery' over this 'domain'; since it is assumed that all items in a domain are equivalent and interchangeable, he need only attempt a sample of items. As human beings would contaminate the measurement were they to choose the sample, this job is best left to a computer which can be programmed to select a random sample from the domain. The assumptions behind item forms can be tested and it is instructive to look at the studies which have done so. Macready and Met-win (1973) put the case an item form will be considered as 'inadequate' for use in a squarely, II... diagnostic domain-referenced test if (a) the items within the item form are not homogeneous, (b) the items are not of equivalent difficulty or (c) both of the above". They then go on to test item forms of the kind I have just illustrated. Although their conclusions are positive, an examination of their paper suggests that there were more item forms which failed to meet the tests than passed, hardly surprising given the very stiff requirements. Much the same could be said of an earlier attempt at verification by Hively, Patterson and Page (1968) who found that variance within item forms, which in theory should be nil if items are truly homogeneous, did not differ as much from variance between item forms as it should have done. Neither of these studies (see also Macready, 1975) inspires much confidence that the item form concept has anything to offer. To my mind, this ultra-mechanical procedure, conceived in the cause of eliminating dreaded 'value judgements', carries within it the seeds of its own destruction. An example supplied by Klein and Kosecoff (1973) will help to make the point. Consider the following objective: The student can compute the correct product of two single digit numerals greater than 0 where the maximum value of this product does not exceed 20'. The specificity of this objective is quite deceptive since there are 29 pairs of numerals that might be used to assess student performance. Further, each of the resulting 290 cotiination of pairs and item types could be modified in a variety of ways that might influence whether the student answered them correctly. Some of these modifications are: - vary the sequence of numerals (e.g. 5 then 3 versus 3 then 5) use different item formats (e.g. multiple choice versus completionj change the mode of presentation (e.g. written versus oral) change the mode of response (e.g. written versus oral). There are other question marks too. The theory seems to have little to say about the generation of distracters - all problems seem to be free-response yet a slight manipulation of the distracters can change the difficulty of an item and destroy equivalence, as in the following example taken from Klein and Kosecoff. Mu/rip/e Choice: Eight hundredths equaZs A. B. C. D. 800 80 8 .08 A State of the Art Repoti 229 Eight hundredths equals A. B. c. D. 800 .80 .08 .008 Doubts also arise because the item construction rules depend largely on linguistic analyses which do not necessarily have any psychological or educational relevance. This is particularly true of Bormuth's (1970) book and, as I have said, there are few applied studies to clarify the situation. Anderson (1972), in the course of an interesting paper which raises a lot of critical questions e.g. 'Which of the innumerable things said to students should they be tested on?', suggests that something like item forms could be employed with "domains of knowledge expressed in a natural language" but has to admit that elementary mathematics is a "relatively easy case". Amen to that. There is another theory of item construction which must be mentioned, except that coming from Louis Guttman it is much else besides. Guttman and Schlesinger (1967) and Guttman (1970) have presented techniques for systematically constructing item responses and particularly distracters. Items are then tested for homogeneity by a technique called smallest space analysis (Schlesinger and Guttman, 1969) which is essentially a cluster analysis. Violations and confirmations of homogeneity ought to lead to improvement of the item generating rule. Note, however, that clustering techniques can be criticised for the absence of an underlying model, thus reintroducing the arbitrariness into item construction procedures it was hoped to dispel. I should not need to say this but it is impossible to remove the influence of human beings from the item writing process. It seems to me hugely ironic that a system of testing meant to be child-centred, that is criterion-referenced rather than norm-referenced, should rely so heavily on computers and random selection of items to test learning. Above all testing should be purposeful, adaptive and constantly informed by human intelligence - mechanistic devices and aleatory techniques have no part to play in the forming of questions. Yet at one time there were hopes that, in some fields at any rate, item writing might be turned over to the computer and exploratory studies were conducted by Richards (1967) and by Fremer and Anastasio (1969) and perhaps by others not to be found in the published literature. It is illuminating that Richards should write that "it soon became clear that developing a sensible procedure for choosing distracters is the most difficult problem in writing tests on a computer". Since he needed to generate alternatives for synonym items, Richards was able to solve his problem by using Roget's Thesaurus but the low level of the task will be evident, not to mention the unique advantage of having the Thesaurus available. I can find no recent accounts of item writing by computer so perhaps this line of work is dead. Up to now, I have dealt with how items come into being and what they might be measuring. In the next chapter I turn to a consideration of the different ways items can be presented and answered and the various methods of scoring answers that have been suggested. 230 Evaluation in Education SUKMARY 1. There are two schools of thought about item writing. The first, containing by far the most men-hers believes that the inspiration and the formulation of the item should be left to the item writer. The other, rather extreme school of thought maintains that item writing should be taken out of the hands of the item writer and organised in a systematic way by generating items from prepared structures which derive from an analysis of the material that has been taught and on which items are based. Although the human touch is still there at the beginning of the process it is removed later on by the automatic generator. While I appreciate the motivation behind the idea - which is to link teaching and testing more closely - I do not approve of the methods employed. I suppose there may exist a third school of thought which regards generated items on a take-it-or-leave-it basis, saving what looks useful and ignoring the rest. There may be some merit in this idea but it is a c7umsy way of going about item writing. 2. The various technical aspects of item writing - number of options, use of 'none of these' I use of negatives and so forth - have been studied intensively but it is rare to find one where the results can be generalised with confidence, If an investigator wants to know how something will work out he is best advised to do an experiment himself, simulating the intended testing conditions as closely as possible. Some findings have achieved a certain solidity. For example, it is fairly certain that items containing 'none of these' as an option will be more difficult than those with all-specific options. ThiS follows from the fact that 'none of these' fs so many options rolled into one. Generally speaking, opinion does not favour the use of 'none of these' although I would permit it with multiple completion items. 3. The traditional view that item writers should use as many options as possible, certainly 4 or 5, continues to hold sway. Those who have promoted the three-choice item, which has from time to time looked promising, have not yet managed to substantiate their case. The same applies to Ebel's attempts to show that twice as many true-false items can be answered in the time it takes to answer a set of items with four options. As to whether one should aim for four or five options I doubt if it makes much difference. The point is not to be dogmatic about always having the same number, even if it is convenient for data processing purposes. 4. f do not think the computer has any place in item writing and it is salutary to note that the various attempts in the late 1960's to program computers to write items appear to have fizzled out. 5. Instructions, Scoring Formulas and Response “If you’re smart, you can pass a true or false smart. ” (Linus in ‘Peanuts ” by SchuZs) test Behaviour dthout being If on encountering an item candidates were always to divide into those who knew the right answer for certain and those who were so ignorant that they were obliged to guess blindly in order to provide an answer, scoring would be a simple matter. A correct answer could be awarded one point, anything else zero, and a correction could be applied to cancel out the number of correct answers achieved by blind guessing. Alas, life is not as simple as this. A candidate who does not know the right answer immediately may act in any one of the following ways: 1. Eliminate one or more of the alternatives and then, by virtue of misinformation or incompetence, go for a particular wrong answer. 2. Eliminate one or more of the alternatives and then choose randomly among the remainder. 3. Fail to eliminate any of the alternatives but choose a particular wrong answer for the same reasons as in 1. 4. Make a random choice among all the alternatives, in other wordse do what is popularly known as guessing. Actually, these possibilities are justbench marks on a continuum of response behaviour anchored at one end with certain knowledge and at the other with corn plete ignorance. An individual's placement on this continuum with respect to a particular item depends on the relevant knowledge he can muster and the confidence he has in it. It follows that the distinction between an informed answer and shrewd, intuitive guess or between a wild hunch and a random selection is necessarily blurred; also that with enough misapprehension in his head an individual can actually score less than he would have got by random quessing on every item. In other words, one expects poor candidates to be poor 'guessers'. I suppose that blind guessing is most likely to occur when candidates find themselves short of time and with a nutier of items still to be answered. Some people believe that this situation occurs quite frequently and that when it does occur candidates are prone to race through the outstanding items, placing marks 'at random' on the answer sheet, thereby securing a certain number of undeserved points. My view of the matter is that given appropriate testing conditions this scenario will rarely, if ever, come about. Even in the absence of those conditions, which I will come to next, it is by no means certain that individuals are able to make random choices repeatedly. Apparently, people find it difficult to make a series of random choices without falling into various kinds of response sets (Rabinowitz, 1970). They tend to avoid repetitive pairs and triplets (e.g. AA, DDD), and to use backward (e.g. EDC) but not forward series more than expected by chance. They also tend to 231 232 Evaluation in Education exhaust systematically the entire set of possible responses before starting again, that is to say, they tend to cycle responses. If the correct answers are distributed randomly so that each lettered option appears approximately the same nutier of times, it follows that anyone in the grip of one or more response sets and attempting to guess randomly at a number of consecutive or near-consecutive items will almost certainly fail to obtain the marks which genuine random guessing would have secured. Of course, candidates who are bent on cheating or have just given up can always mark the same letter throughout the test but in the London board's experience, at any rate, such behaviour is rare. What do I mean by 'appropriate testing conditions'? I mean that the test is relatively unspeeded so that nearly all candidates have time to finish, that the items deal with subject matter which the candidates have had an opportunity to learn and that the items are not so difficult as to be beyond all but a few candidates. If any or all of these conditions are violated, then the incidence of blind guessing will rise and the remarks I shall be making will cease to apply with the same force. However, I am working on the assumption that such violations are unlikely to happen ; certainly the British achievement tests I am familiar with, namely those set at GCE 0- and A-levels, satisfy the three conditions. A study of the item statistics for any multiple choice test will show that the majority of items contain at least one distractor which is poorly endorsed, and so fails to distract. This constitutes strong evidence for the ability of the mass of candidates to narrow down choice when ignorant of the correct answer (see also Powell and Isbister, 1974). Were the choice among the remaining alternatives to be decided randomly, there might be cause for alarm, for then the average probability of obtaining a correct answer to a five-choice item by 'guessing' would rise from l/5 to perhaps l/4 or even l/3. But the likelihood of this happening in practice seems slight. Having applied what they actually know, candidates are likely to be left with a mixture of misinformation and incompetence which will nudge them towards a particular distractor, placed there for that purpose. Is there any evidence for the hypothesis that 'guessing' probabilities are less than predicted by chance? Gage and Damrin (1950) constructed four parallel versions of the same test containing 2, 3, 4 and 5-choice items respectively. They were able to calculate that the average chances of obtaining the right answer by guessing were 0.445, 0.243, 0.120 and 0.046 respectively, as compared with 0.500, 0.333, 0.250 and 0.200 respectively, which are the changes theory based on random guessing would have predicted. This is only one study and it needs to be repeated in a number of different contexts. But it is a result which tallies with intuition and coupled with the fact that there are always candidates who score below the chance score level on multiple choice papers, it suggests that the average probability of obtaining a correct answer to a five-choice item when in ignorance may well be closer to l/10 than l/5. It is also a mistake to assume that chance-level scores, e.g. 10 out of 50 for a test made up of five-choice items, are necessarily the product of blind guessing. Unless the candidates obtaining such scores were actually to guess randomly at every item, which , as I have said, seems most improbable, the chance-level score is just like any other score. Donlon (1971) makes the point that chance-level scores may sometimes have predictive value, although he does suggest that steps should be taken to check whether such scores could have arisen as a result of blind guessing (for details of the method suggested, see Donlon's paper). Multiple Choice: A State of the Art Report 233 It is one thing to be dubious about the, incidence of blind guessing, another to doubt that individuals differ in the extent to which they are willing to 'chance their arm' and utilise whatever inforniation is at their disposal. This propensity to chance one's arm is linked to what psychologists call 'risk taking behaviour'. The notion is that timid candidates will fail to make the best use of what they know and will be put off by instructions which carry a punitive tone, while their bolder colleagues will chance their arm regardless. The fact is that if the instructions for answering a test warn candidates that they will be penalised for guessing (where what is meant by guessing may or may not be specified) those who choose to ignore the instructions and have a shot at every question will be better off - even after exaction of the penalty - than those who abide by the instructions and leave alone items they are not certain about, even though an informed 'guess' would probably lead them to the right answer (Bayer, 1971; Diamond and Evans, 1973; Slakter et al. 1975). Perhaps it was an instinctive grasp of this point which made the teachers in Schofield's (1973) sample advise their candidates to have a shot at every question, even though they were under the misapprehension that a guessing penalty was in operation. Exactly the same considerations apply to instructions which attempt to persuade candidates to omit items they do not know the answer to by offering as automatic credit the chance score i.e. l/5 in the case of a five-choice item. On the face of it, this seems a good way of controlling guessing but the snag is that the more able candidates tend to heed the instructions more diligently than the rest and so fail todothemselves justice. Because their probabilities of success are in truth much greater than chance they are under-rewarded by the automatic credit whereas the weakest candidates actually benefit from omitting because their probabilities of success are below the= level. That, at any rate is the conclusion I drew from rrly study (Wood, 1976(d)). It is supported by the results of a study in the medical examining field (Sanderson, 1973) in which candidates were given the option of answering 'Don't know' to true-false items of the indeterminate type. 'Don't know' is in effect an omit and Sanderson found that it was the more able candidates who tended to withhold definite answers. I should add that there is a paper (Traub and Hambleton, 1972) which favours the use of the automatic credit but it is not clear what instructions were used in the experiment. The wording of instructions to create the right psychological impact is, of course, decisive. Readers will understand that I am not denying that blind guessing occurs; that would be rather foolish. Choppin's (1974(a), 1975) study, for 9 instance, provides incontrovertible evidence of blind guessing (as well as some interesting differences between countries) but the items used were difficult and that makes all the difference. All I am saying is that when conditions are right, blind guessing is by no means as common as some of the more emotional attacks on multiple choice would have us believe. Cureton (1971) has put the issue of guessing in a nutshell; if a candidate has a hunch he should play it for hunches are right with frequency greater than chance. There is nothing wrong with playing hunches, despite what Rowley (1974) says. He equates 'test-wiseness' with use of partial information and hunches and argues that the advantages which accrue to the 'test-wise' should be cancelled out. I believe this view is mistaken. Test-wiseness, as I understand it, is about how candidates utilise cues in items, and to the extent that they benefit, this is the fault of the item writer or test constructor. A fine distinction, maybe, but an important one. Incidentally, anyone interested in test-wiseness might refer to Diamond and Evans (1972), Crehan, Koehler, and Slakter (1974), Rowley (1974) and Nilsson and Wedman(1976). I would add that 234 Evaluation in Education Et-$-iness is not just associated with multiple choice as some people seem . Any guessing correction is properly called a correction for individual differences in confidence, as Gritten and Johnson (1941) pointed out a long time ago. It is applied because some people attempt more items than others. Even if these instructions are never 100 per cent successful I believe that they do reduce omitting to a point where individual differences in confidence cannot exert any real distorting effect on the estimation of ability. To turn the question of whether candidates should always be advised to attempt items into an ethical dilemma (see, for instance, Schofield, 1973), as if guessing were on a par with euthanasia, strikes me as getting the whole thing out of proportion. Changing answers If individuals differ in their willingness to supply an answer at all, it is easy to imagine them differing in their readiness to change their answers having already committed themselves. The question of whether candidates are likely to improve their scores by changing answers has been investigated by a number of workers. The general view is that there are gains to be made which are probably greater for better students than poor ones. Ten years ago, Pippert (1966) thought he was having the final word on what he called the "changed answer myth" - his view was that answers should be changed - but since then there have been published studies by Copeland (1972), Foote and Belinsky (1972), Reiling and Taylor (1972), Jacobs (1974) - this last reference contains a bibliography of older work - Pascale (1974) and Lynch and Smith (1975). These last investigators concluded that when candidates do not review their answers, reliability and validity suffer and that directions to stick with the first response to an item are misleading. It might be thought that hunches would lose conviction when reviewed but the time span between the first response and the review is short and, besides, hunches are more right than wrong. Confidence weighting The decision to change an answer may be read as a sign of unwillingness to invest one answer with, as it were, complete confidence. The idea that individuals might be asked to signify their degree of confidence in the answers they make, or in the alternatives open to them, has excited a number of people who have believed that such a move would not only constitute a more realistic form of answering psychologically but would also yield more statistical information. Known variously as confidence weighting, confidence testing, probabilistic weighting, probabilistic testing or even subject weighted test taking procedure, the basic notion is that individuals should express, by some code or other, their degree of confidence in the answer they believe to be correct, or else, by way of refinement, in the correctness of the options presented to them. Credit for laying the intellectual foundations of probabilistic weighting and for providing a psychometric application goes to de Finetti (1965), although less sophisticated methods have been discussed in the educational measurement literature for some forty years (see, Jacobs, 1971, for a comprehensive bibliography on the subject). Much energy has been expended on devising methods of presentation, instructions and scoring rules which will be conprehensible to the ordinary candidate (see, Lord and Novick, 1968, Ch.14; Multiple Choice: A State of the Art Report 2% Echternacht, 1972; Boldt, 1974). In one method, for instance, candidates are invited to distribute five stars (each representing a subjective probability of 0.20) across the options presented to them. It is assumed that the individual's degree of belief or personal probability concerning the correctness of each alternative answer corresponds exactly with his personal probability distribution, restricted to sum to unity. The trouble is candidates may not care about some of the alternatives offered to them in which case talk of belief is.fatuous. Empirical evidence from other fields suggests that often individuals have a hard time distributing their personal probabilities (Peterson and Beach, 1967); some fail to constrain their probabilities to add to unity, although the stars scheme gets over this, while others tend to lump the probability density on what they consider is the correct answer, which is not necessarily the best strategy. Apart from worries about whether candidates can handle the technique, concern has been expressed that confidence test scores are influenced, to a measurable degree, by personality variables. The worry is that individuals respond with a characteristic certainty which cannot be accounted for on the basis of their knowledge (Hansen, 1971). Or, as Koehler (1974) puts it, confidence response methods produce variability in scores that cannot be attributed to knowledge of subject matter. Not everyone accepts this assessment, of course, especially Shuford who has been the foremost champion of confidence weighting (for a recent promotional effort, see Shuford and Brown, 1975). Basically, he and his associates argue that practise, necessary in atv case because of the novel features of the technique, removes 'undesirable' personality effects. A paper by Echternacht, Boldt and Sellman (1972) suggests that this might indeed be the case although their plea is more for an open verdict than anything else. Suppose, for the sake of argument, that they are right. Is there psychometric evidence which would suggest that it would be worth switching to confidence weighting? Having compared the validities of conventional testing and various confidence testing procedures Koehler (1971) concluded that conventional testing is preferable because it is easier to administer, takes less testing time and does not require the training of candidates. A similar conclusion was reached by Hanna and (kens (1973) who observed that greater validity could have been attained by using the available time to lengthen the multiple choice test rather than to confidence-mark items. Not satisfied with existing jargon, Krauft and Beggs (1973) coined the phrase "subject weighted test taking procedure" to describe a set-up in which candidates were permitted to distribute 4 points among 4 alternatives so as to represent their beliefs as to the correctness of the alternatives. Total score was computed as the number of points assigned to correct alternatives. After all this, they found that the experimental procedure failed to encourage candidates to respond any differently than they would have done to a conventional multiple choice test, that is to say, no extra statistical discrimination was forthcoming. For the most affirmative results we must turn to a paper by Pugh and Brunza (1975). They claimed that by using a confidence scoring system the reliability of a vocabulary test was increased without apparently altering the relative difficulty level of the items. Moreover, no personality bias was found. What has to be realised about results like this is that reliability is not everything and that what may appear to be additional reliable variance may be irrelevant variance attributable to response styles. That, at any rate, was the view expressed by Hopkins, Hakstian and Hopkins (1973) who also pointed out that far from increasing validity response-style variance may actually diminish it. 236 Evaluation in Education Confidence testing has also found advocates in the medical examining field but the latest paper I have been able to find on the subject is no more optimistic than any of the rest. Palva and Korhonen (1973) investigated a scheme whereby candidates were asked to choose the correct answer as usual and to check a 1 if they were very sure of their answer, a 2 if they were fairly sure and a 3 if they were guessing. After applying the following scoring scheme (due to Rothman, 1969): Very sure For a correct answer For an incorrect answer 413 -l/3 Fairly sure 1 0 Guess 2/3 l/3 these workers concluded that confidence testing does not give any substantial information in addition to what is given by conventional scoring and so cannot be justified. Their scoring scheme may be criticised for encouraging guessing by allowing l/3 of a mark even when the guess is incorrect (see the critique by Paton, 1971) but given the direction of their results it is hard to imagine that any modification would make much difference to the conclusion. Ranking alternative answers From time to time, the idea of asking candidates to rank alternatives in order notably in econoof lausibility is wheeled out. Certainly there areims, istory, which seem to positively invite this mode of response. The mic+ idea would be to score a ranking according to where the keyed option was placed on say a 4-3-2-l-O points basis, so that a keyed answer ranked second would score 3. This is, in fact, the practice followed in self-scoring procedures, where an individual is given immediate feedback and has to continue making passes at the item until he comes up with the right answer, thus establishing a full or partial ranking of options. Ualrymple-Alford (1970) has studied this system from a theoretical point of view. Empirical investigations have been made by Gilman and Ferry (1972) and Evans and Misfeldt (1974) who report improvements in split-half reliability estimates, to which can be added the benefits of immediate feedback discussed in Chapter 1. From the standpoint of public examinations, these procedures suffer from the limitation that they can only really be implemented by a computer-assisted test administration - available pencil-and-paper techniques seem too ponderous - but in the classroom they are quite feasible and could be helpful in teaching. If specifying a complete ranking of alternatives is thought to be too much to ask some form of restricted ranking may be suitable. It is often the case that there is an alternative which is palpably least plausible. A possible scoring formula for this set-up might be to assign a score of 1 to the correct response and a score of -X to a 'least correct' response, where O<X<l. Lord reports (Lord and Novick, 1968,p.314) that he investigated just such a scheme but gained little from using the scoring formula. I would imagine that with some subject areas agreement over what are the 'least correct responses' may be hard to come by although, of course ,.it could be done by looking at pretest statistics and choosing the options answered by the candidates with the lowest average test score. Multiple Choice: A State of the Art Report 237 Elimination scoring Of the response procedures which require the candidate to do something other than mark what he believes to be the correct response, the one known as elimination scoring appears to have particular promise. All the examinee has to do is to mark all the options he believes to be 'wrong'. The elimination score is the number of incorrect options eliminated minus (k-l) for each correct answer eliminated, where k is the number of alternatives per item. Thus the maximum score on a five-choice item would be four and minimum score -4 which a candidate would receive if he eliminated the correct answer and no other option. The penalty imposed for eliminating a correct answer is put in to control guessing. If a candidate decides at random whether or not to eliminate a choice or, what comes to the same thing, nominates it as a distractor, then his expected score from that choice is zero. Coombs, tiilholland and Womer (1956), who appear to have been the first to study this procedure, obtained slightly higher reliability coefficients for elimination compared to conventional scoring. Rather more positive results were obtained in a later study by Collet (1971) and it looks as if elimination scoring might be worth investigating further. One mark in its favour, according to Lord and Novick (1968, p.315), is that it is likely to discourage blind guessing. Suppose, for example, that faced with a four-choice item a candidate has eliminated two alternatives he knows to be wrong. There remains two choices, one the answer and one a distractor. If he has run out of knowledge and chooses to have a go he is gambling an additional point credit against a three point loss. One objection to elimination scoring might be that it is too negative, laying too much stress on what is wrong rather than what is right. That remains to be seen. As yet we know little about this method of scoring. Weighting of item responses Once item responses have been made in the normal manner, they can be subjected to all kinds of statistical manipulations in an effort to produce more informative scores than those obtained by simply summing the nutier of correct items. Individual items can be differentially weighted, groups of items can be weighted, even options within items can be weighted. Sadly all these efforts have amounted to very little although a recent study by Echternacht (1976) reports more promising results. If the intercorrelations among items or sections of tests are positive (as they invariably are) then differential weighting of items or sections produces a rank order or scores which differs little from the order produced by number right (Aiken, 1966). At the time Aiken claimed that his analysis reinforced the results of previous empirical and theoretical work and the position has changed little since. The most authoritative paper of the period is the review by Stanley and Wang (1970); other relevant papers are those by Sabers and White (1969), Hendrickson (1971), Reilly and Jackson (1973) and Reilly (1975). All of these papers examine the effects of empirical weighting - empirical because weights are allocated after the event - via an iterative computer solution so as to maximise the reliability or validity of the scores. Thus candidates have no idea when they take the test of the relevant score values of items. Not that this is anything new; in the normal course of events items weight themselves according to their discrimination values. The empirical weighting method, which incidentally was first proposed by Guttman (1941), is, of course, open to the objection that it is the candidates rather than the examiners who are in effect deciding which options are most credit-worthy even to the extent of downgrading the nominal key if the brightest group of candidates should for some reason happen 238 Evaluation in Education to be drawn to another option (although it is very unlikely that such an item would survive as far as an operational test). There is also an element of self-fulfilment present since candidates' scores are being used to adjust candidates' scores; students who do well on the test as a whole have their scores boosted on each item which only compounds their superiority. Echternacht (1976) was not exaggerating when he observed that one would have problems giving candidates a satisfying explanation of an empirical scoring scheme. If empirical weighting leaves something to be desired, what about a priori weighting of the options? Here one asks informed persons, presumably examiners, to put values on the options perhaps with a simple 4-3-2-l-O scheme. This has already been broached in this chapter. Alternatively and preferably although difficult, examiners could be asked to construct items in such a way that the distracters were graded according to plausibility. One might argue they should be doing this anyway. What happens when a priori or subjective weighting is tried out? Echternacht (1976) using specially constructed items found that the results it gave were not even as reliable as conventional number-right scoring and certainly inferior to the results obtained from empirical weighting. He found in fact that with empirical weighting he registered an increase in reliability equivalent to a 30 per cent increase in length of a conventionally scored test and also reported an increase in validity. Note however that his items, which were quantitative aptitude items, cost 60 per cent more to produce than items written in the usual way. Recently my colleague Brian Quinn and I (Quinn and Wood, 1976) compared subjective and empirical weighting with conventional scoring in connection with the Ordinary level English language comprehension test mentioned in Chapter 1. First we attempted to rank options in an order of plausibility or likelihood but it was soon apparent that few of the distracters could be deemed worthy of any credit, or at least that they could not often be ranked in any meaningful order. For 33 of the 60 items we would recognise no merit in any of the options but the key, for 23 items a second best ('near miss') option only could be identified; for three items second and third best options could be found, and for only one item could all five options be ranked. The arbitrary marking scheme chosen was 1.0 for the key, 0.75 for the second best option, 0.5 for third best, 0.25 for fourth best, and zero for the fifth best, plus all unranked options and omissions. In the event neither subjective or empirical weighting made much difference to the original rank ordering of candidates produced by conventional scoring. For subjective weighting the correlation between the derived scores and conventional scores was 0.99, although, of course, a high correlation was expected owing to the fact that the same 33 items figured in both derived and conventional scores. The correlation for empirically weighted series was at first rather lower (0.83) but on inspection this was found to be attributable to omits being scored zero. Dubious though the logic may be it is clearly necessary to give rewards for omits otherwise graded scoring will tend to discriminate heavily between candidates who choose any response, no matter how poor, and those who choose none at all. When omits were scored, the correlation between empirically weighted scores and conventional scores rose to 0.95. Any extra discrimination which graded scoring gives will naturally tend to be at the bottom end of the score range since these are the people who get P lot of items 'wrong' and who stand to benefit from partial scoring. Our experience has been that there is a little increase in discrimination in this sector but nothing spectacular. Multiple Choice: A Stare of the Art Report 2% It ought to be said that the strength of the case for graded scoring varies according to the subject matter. In mathematics and the sciences, also the social sciences, graded scoring may work; in French and English perhaps the problems of interpretation are too great. SUMMARY 1. Good candidates are good 'guessers', as Linus says. All candidates should be encouraged to make use of all the knowledge at their disposal. When candidates choose to omit questions it is usually the more able ones who do not do themselves justice, being reticent when they actually know the right answers. 2. Blind guessing does occur but only when the conditions are ripe for it. Under appropriate testing conditions its effects can be reduced until it is no longer a problem. The occasional hysterical outbursts from a teacher or an examiner or a member of the public to the effect that multiple choice tests are no more than gambling machines are quite unjustified. All the evidence is that if tests are properly constructed, presented and timed, candidates will take them seriously. 3. Despite all the ingenuity and effort which has gone into developing methods for rewarding partial information, and no one has exceeded the Open University in this respect (see their undated document 'CMA Instructions'), there is little evidence that any one method provides measurable gains. Elimination scoring has possibilities, as does self-scoring except that it is presently too limited in scope. Confidence weighting is, I think, too elaborate and beyond the average candidate. One is left with the conclusion that if the items in a test are well constructed, if candidates are advised to go over their answers since changing answers seems to pay, and if the testing conditions are such as to inhibit blind guessing with candidates being encouraged to attempt all items, number right suffices most needs. 6. Item Analysis “Psychometricians appear to shed mh Of their as they concentrate on the minutiae of elegant (Anastasi, 1967) psychoZogica2 knowledge statistical techniques. 11 That there is substance in Anastasi's rebuke cannot be denied. Item analysis has been the plaything of two or three generations of psychometricians, professional and amateur. Because the so-called classical methods of item analysis are accessible to anyone who is at all numerate, and because there is room for differences of opinion over how items shall be characterised statistically, the literature teems with competing indices - according to Hales (1972) more than 60 methods for calculating measures of discrimination have been proposed! In devoting as much space as I shall to item analysis and test construction, I run the risk of doing what Anastasi warns against, but I have no alternative if I am to survey the field properly. There are, in any case, good grounds for taking item analysis seriously. More than any other testing technique, multiple choice is dependent on statistical analysis for legitimation. The guarantee of pretesting is intended to reassure the public that items are, as the London booklet says (University of London 1975, p.3), "free from ambiguity, and are of an appropriate level of difficulty". Doubtless, we should expose other testing techniques to the same close scrutiny, but, for one reason or another, this does not seem to happen very often. Actually, the term 'statistics' as applied to classical item analysis, is something of a misnomer, because the statistics are not motivated by probabilistic assumptions, or if they are; these are not at all apparent. It is true that in a sample the proportion getting an item correct is the best estimate of the item difficulty in the population, and that other statistical statements can be made, but, generally speaking, classical item analysis has developed along pragmatic lines. I might mention that Guilford's (1954) textbook is still far and away the best guide to the classical methods. For a more truly statistical approach, b!e must turn to the modern methods of item analysis which are based on theories of item response expressed in probabilistic terms (Lord and Novick, 1968). It is fair to say that these modern methods are still generally unknown or poorly understood, although the Rasch model, which is just about the simplest example of an item response model, is bringing people into contact with modern thinking. How much utility these methods have is another matter. My aim is to present classical and modern methods of item analysis, but to do so in such a way that, hopefully, the two approaches will be unified in the reader's mind. Unless I specify to the contrary, I am writing about dichotomous items, i.e. those scored 0 or 1. 240 Multiple Choice: A State of the Art Report 241 For any item, the raw response data consist of frequency counts of the numbers of individuals choosing each option, together with the number not answering the item at all, known as the 'omits'. From this information, it is immediately QOSSible to calculate the proportion or percentage of individuals getting the riqht answers. This statistic is known as the item difficulty ..-_-._ or facility, depending on which nomenclature you prefer. Facility is perhaps the morefelicitous term, since the higher the proportion correct, the easier is the item. For those who want a direct measure of item difficulty, the delta statistic (A) iS available. Delta is a nonlinear transformation of the proportion correct, arranged so as to have a mean of 13 and a standard deviation of 4, giving it an effective range of 1 to 25. The formula is: A = 4@-'(Q) + 13 -1 where Q is the proportion correct or item facility, and @ is the inverse normal transformation (for details, see Henrysson, 1971, p.139-140). The point of choosing a nonlinear transformation is that proportions or Qercentages fall on a nonlinear scale: so that judgement of relative differences in facilities is apt to be misleading. Thus, the difference in difficulty between items with facilities of -40 and .50 is quite small, whereas the difference between items with facilities .10 and .20 is quite large. The delta scale, however, is ostensibly linear, so that the difference in difficulty between items with deltas of 13 and 14 is taken to be the same as the difference in difficulty between items with deltas of 17 and 18. Figure 6.1 shows the approximate relationship between facility and delta. can be seen that a facility of 0.50 corresponds to a delta of 13, since a(O.5) equals 0. It What A does not have is a direct relationship with a test statistic like number correct, but this is easily remedied by transforming the total score so that it has the same mean and standard deviation as delta. With test score as the categorising variable, it is possible to divide the candidate population into ability groups or bands, and to observe how these groups respond to the item. (Test score is chosen simply because it is usually the best measure of the relevant ability, but if a better measure is available, it should be used.) By sorting the item responses into a two-way table of counts, with ability bands as rows and alternative answers as columns, the data can be laid out for inspection. Table 6.1 shows what I mean. Here the five ability bands - and five is a good number to use - have been constructed so as to contain equal numbers of candidates, or as nearly equal as possible, which means that, unless the distribution of scores is rectangular (see Chapter 7), the score intervals will always be unequal. However, there is no reason why the bands should not be defined in terms of equal score intervals or according to some assumption about the underlying distribution scores. If, for instance, one wanted to believe that the underlying score distribution was normal, the bands could be constructed so as to have greatest numbers in the middle bands and smallest in the outer bands. The problem then is that, given small numbers, any untoward behaviour in the tails of the distribution would be amplified or distorted. Also, interpretation of the table would be more prone to error because of the varying numbers in the bands. 242 Evaluation in Education Fig. 6.1 TABLE 6.1 SCOR Options 1:6* .46 1: - 22 18 :5 - 29 35 35 - 47 Mean Criterion B 13 .04 169 E 1 30 C .:4 !2 .07 42 49 A 8 5 8 7 0 0 30.79 18.77 22.02 19.18 7 ES2 .26 0 .0: Item Reached 319 1.00 ;: 2 0 0 0 0 63 64 64 64 64 23.45 12.50 26.02 19 Multipie Choice: A State of the Art Report 243 T:ie item generating the data in Table 6.1 belonged to a 50-item Chemistry test taken by 319 candidates. Normally, of course, the candidate population would be far larger than this but for my purposes there is some advantage to be gained from smaller nutiers. The correct (starred) answer was option A, chosen by 146 candidates, which, as the number underneath indicates, was 0.46 of the entry. The facility of this item was, therefore, 0.46, and the diffiCU7tJ' (A) 13.42. Of the distracters, E was most popular (endorsed by 82 candidates, or 0.26 of the entry), followed by C, D and 3. Only two candidates . omitted the item. Turning to the body of the table, a pattern will be evident. Whereas, under the correct answer A the count increases as the ability level rises, under the distracters (excepting D, where the trend is unclear), the gradient runs in the opposite direction. This is just as it should be if we want the item to discriminate in terms of total test score. The pattern we should not want to see would be one where the counts under A were equal or, worse, wh= the count in each cell of the table was the same. As it is, the distribution of answers tells us quite a lot. Relatively speaking, options B and C were much more popular in the bottom ability band than in the rest, and in the bottom band the correct answer was barely more popular than 5 and D, which were almost totally rejected by the top two bands. Taken as a whole, the table underlines the observation made in the last chapter that wrong answers are seldom, if ever, distributed equally across the distracters , either viewing the candidate population as a whole, or in bands. Nor is there any evidence of blind guessing, the sign of which would be an inflated number in the top left hand cell of the table - the one containing a '9' - causing the gradient to flatten out at the bottom, or even go in the other direction. The notion of a gradient of difficulty is a useful way of representing another characteristic of an item, its effectiveness in establishing the difference between 'clever' and 'dull' candidates, or its discriminating power, The usual approach to obtaining a measure of item discrimination is to calculate the correlation between score on the item (1 or 0) and score on the test as a whole, the idea being that the higher the correlation between candidates' score on an item and their score on the test, the more effective the item is in separating them. Naturally, this relationship is a relative one; when included in one test, an item could have a higher item-test correlation than when included in another, yet produce poorer discrimination. The correlation I am it is worth examining the table it actually lation can be set out is as follows: talking about has a special name, point biserial, and how it is calculated to see how mu& informat% from uses. The formula for the sample point biserial correin various ways, but the most convenient for my purposes I rpbis = v. J h where Mp is the mean score on the test obtained by those who got the item correct, FI is the mean score on the test for the entire group, S is the standard deviation of the test scores for the entire group, and p is the proportion getting the item right (the item facility). Evidently, rpbis serves as a measure of separation through the action of the term F - M; it is also a p3function cf item facility, and the effect of this will be looked at presently. 244 Evaluation in Education To calculate rpbis for the item in Table 6.1, the values of Mp and M can be found in the row directly underneath the body of the table labelled, 'Mean criterion', where 'criterion' means test score. These mean test scores provide useful supple~ntary information. Thus, the mean score on the test obtained by those choosing A, the correct answer was 30.79. This is Mp. Similarly, the 13 candidates choosing B scored an average of 18.77 on the test, which made them the lowest scoring of the four distractor groups. The mean score on the test for the entire group, M, is given at the right end of the 'Mean criterion' row, and is 26.02. The value of the standard deviation, S, which is not given in the table, was 8.96. The expression for rpbis is, therefore, 30.79 - 26.02* 8.96 (r ) ‘46 3 which turns out to be 0.49. The question immediately arises, 'Is a value of O-49 good, bad or indifferent?' This is a fair question to ask of any correlation taking this value. If it were an ordinary product ~ment correlation, one might interpret the value within the limits -1 to +1 but, with the point biserial, #is assumption may be unjustified. In fact, as Wilmut (1975(a),p.30) demonstrates, the point biserial coefficient, when applied to item analysis, is unlikely ever to exceed 0.75, or to fall below -0.10. In these circumstances, a value of 0.49 signifies quite effective discrimination. Of the many discrimination indices, the chief competitor to the point biserial is the biserial correlation, and much has been written on the subject of which statistic is preferable. unlike the point biserial, the biserial is not a product mo~ient correlation; rather, it should be thought of as a measure of association between performance on the item and performance on the test or some other criterion. Also distinguishing it from its competitor is the assumption that underlying the right-wrong dichotomy imposed in scoring an item is a normally distributed latent variable which may be thought of as representing the trait or traits that determine success or failure on the item. Doubts about the tenability or appropriateness of this assumption lead some people to have nothing to do with the biserial. Equally, there are those, like myself, who find the assumption underlying the point biseriat - that a person either has the ability to get an item right or has none at all - quite implausible. What has attracted people to the biserial is the possibility that it may have certain desirable properties which make it superior to the point biserial or any other discrimination index, namely, that it is less influenced by item difficulty and also - an important property this - that it holds stable, or is invariant, from one testing situation to another, a property the point biserial definiteTy does not possess. The formula for calculating the sample biserial correlation coefficent resembles that for the point biserial quite closely, being where the terms are as before, except for h(p), which stands for the ordinate or elevation of the normal curve at the point where it cuts off a proporh(p) enters into the formula because tion p of the area under the curve. of the assumption about the normally distributed underlying variable. It is easily looked up in any textbook containing statistical tables (see, for instance, Glass and Stanley, 1970, Table 8). Multiple Choice: A State of the Art Report 245 The relationship between the biserial and point biserial formulae is simple, being 'pbis = 'his l ho JXW- This means that the point biserial is equal to the biserial multiplied by a factor that depends only on item difficulty, so that the point biserial will always be less than the biserial. In fact, Lord and Novick (1968, p.340) show that the point biserial can never attain a value as hiah as four-fifths of the biserial, and present a table showing how the fraction varies according to item difficulty (see also Bowers, 1972). In theory, the biserial can-take any value between -1 and +l. Negative values usually indicate that the wrong answer has been keyed. Values greater than 0.75 are rare, although, in exceptional circumstances, the biserial can exceed 1. This is usually due to some peculiarity in the test score or criterion distribution (Glass and Stanley, 1970, p.171). For the item in Table 6.1, the biserial estimate was 0.62, about which one would say the same as about the point biserial value, that it signifies quite effective discrimination. As Lord and Novick (1968, p.342) observe, the extent of biserial invariance is necessarily a matter for empirical investigation. They themselves claim that 'biserial correlations tend to be more stable from group to group than point biserials' and present some results which point in that direction. My view is that this is still very much an open question. Experience at the London examinations board indicates that even with ostensibly parallel groups of candidates biserial estimates for the same item can 'bounce' around beyond what would be expected from the estimated margin of error. So, what is the answer to the burning question, 'Biserial or point biserial?' The consensus among people who have studied this question (e.g. Bowers, 1972) seems to be that as long as a markedly non-normal distribution of ability is not anticipated, substantially the same items are selected or rejected whichever statistic is used to evaluate discrimination. It is true that the point biserial is rather more dependent on the level of item difficulty but this is not serious, since it only leads to rejection of very easy or very difficult items, which would be rejected anyway. For the practical user, nly advice is to fasten on to one or another statistic, learn about its behaviour, and stick with it. Switching from one to the other, or trying to interpret both simultaneously, is a waste of time. OTHER DISCRIMINATION INDICES Of all the discrimination indices which have been advanced, the simplest is undoubtedly D, or net D, as it is sometimes called, If, for any item Rh is the proportion correct achieved by the 27 per cent highest scores on the test, and Rl is the corresponding figure for the 27 per cent lowest scores, then D = Rh - pl. It may seem odd that just as good results can be obtained by discarding the middle of the score distribution as by using the whole distribution, but providing the ability being measured is normally distributed, and that is a big proviso, this is the case. The quantity of information may be reduced, but the quality is improved, the result being groups which ares in Kelley's (1939) words, "most indubitably different with respect to the trait 246 Evaluation in Education in question". Those interested in the statistical basis of the 27 per rule might consult Kelley's original paper, or, more recently, that by and Weitzman (1964). Incidentally, D'Agostino and Cureton (1975) have recently that the correct percentage is more like 21 per cent, but add the use of 27 per cent is not far from optimal. cent Ross shown that It is important to remember that the D statistic was invented to fill a need at the time for a short-cut manual method of estimating the discriminating power of an item. Now that most users will have access to an item analysis computer program short-cut methods have become pointless, although there is a story that someone actually programmed the calculations for the D statistic! For those who want or need an item analysis by hand - and it is still a good way of getting the 'feel' of item performance - it can be said that D agrees quite closely with biserial correlation estimates, even when the underlying distribution is non-normal (Hales, 1972). Tables from which D can easily be calculated (replacing the old Fan tables) have been compiled by Nuttall and Skurnik (1969), who also provide other 'nuts and bolts' for a manual item analysis. Like the point biserial and even the biserial, the D index is dependent on item facility. In particular, it decreases sharply as the facility approaches 0 or 1, when it must be interpreted with caution (Connaughton and Skurnik, 1969; Nuttall and Skurnik, 1969) but then, as I have said, the test constructor will probably not be interested in these items anyway. Those who prefer a discrimination index which is independent of item difficulty might be interested in the rank biserial correlation coefficient (Glass, 1966). However, there are problems with this index when frequent ties among test scores occur, and it is, therefore, not recommended for large groups (n>50). If most discrimination indices are affected by difficulty anyway, why not deliberately combine difficulty and discrimination into one index? Ivens (1971) and Hofmann (1975) have attempted to do this in different ways, working within nonparametric and probabilistic frameworks respectively. For Ivens, the best possible item will have a difficulty of 0.5 and perfect discrimination, 0.5 being chosen because this value maximises the number of discriminations an item can make; for Hofmann, item efficiency, as he calls it, is defined as the ratio of observed discrimination to maximum discrimination, the latter having been determined from the difficulty of the item. Both workers make certain claims for their indices, Hofmann's being rather more sweeping, but it is too soon to say what substance these have. Significantly, Ivens admits that there will be instances where his index will not be appropriate for item selection as, for example, when the object is to construct a test to select a small proportion of individuals. A more serious shortcoming of both indices, and this applies to others like the rank biserial, is their failure to satisfy what Lord and Novick (1968, p.328) call a basic requirement of an item parameter, namely, that it should have 'a definite (preferably a clear and simple) relationship to some interesting total test score parameter'. To call this a basic requirement is no exaggeration. Item statistics are only useful insofar as they enable us to predict the characteristics of a test COmpOSt?.d of the items examined. Bowers (1972) is absolutely right when he remarks that a comparison of values of biserial and point biserial coefficients, or any other indices, begs the question. What matters is to select items that lead to test score distributions which are best for a particular application. I shall elaborate on this theme in the next chapter. Multiple Choice: A State of the Art Report 247 GE~~ERALISED ITEM STATISTICS So far, in calculating discrimination indices, only information about those who got the item right (or wrong) ha_ c been used, whereas, of course, there is information associated with each wrong answer, namely the mean test scores achieved by the candidates who fall for the distractors. It is reasonable to ask whether statistics could not be devised to take this itlfor~-~tiondirectly into account, and so provide more accurate summaries of how an item behaves. What is wanted are generalisations of the point biserial and biserial coefficients, and these have, in fact, been developed by Das Gupta (1960) and Jaspen (1965) respectively. To calculate the point multiserial, as Das Gupta terms his statistic, each response option, including the right answer, is treated as a separate nominal category, as if each represented a character such as eye colour. With the polyserial, on the other hand, being a generalisation of the biserial, it is necessary that the distracters can be ordered or graded in terms of degree of 'wrongness' or 'rightness', so that the assumption of an underlying normally distributed trait can be better mot. I must say that, in rqy experience, items seldom lend themselves to this kind of ordering, at least not on a large scale (see the discussion on a priori weighting in Chapter 5). That is why I think the pofyserial coefficient will generally find a more suitable application when the polychotomised variable is something like an examination grade or a rating, where there is a natural order of measurement. The point multiserial is the more suitable statistic, but it is rather cumbersome to calculate (although not once it is programmed). Having used it nlyself, I have never felt that it was any more informative than the ordinary point biserial. bjy feeling is that these generalised statistics are not a great deal of use in regular item analysis, although I remain willing to be convinced. The user would be just as well off with biseriaf estimates calculated for each distractor, and some item analysis programs do provide this information. Whatever item statistics are dreamed up, being of a summary nature they are bound to be less informative than we would like. As Wilmut (1975(b), p,2) has observed, an infinitive number of items can have different response patterns, yet possess the same discrimination index. The difficulty index has limitations too; "All we know is that if the respondent passes it, the item is less difficult than his ability to cope with it, and if he fails it, it is more difficult than his ability to cope with it" (Guilford, 1954, p.419). The message is that it is a mistake to rely too heavily on item statistics. Instead, attention must be fixed on item response patterns, or gradients, as was done for the data in Table 6.1. This complicates matters, but is unavoidable if accurate predictions of test score distributions are to be made. THE ITEM CHARACTERISTIC CURVE The most instructive way of examining an item response pattern is to plot a graph showing how success rate varies with candidates' ability, for which tota test score usually stands proxy. The result is called an item characteristic curve. It is the coping stone of modern item analysis methods, but the idea is old as educational measurement itself, dating from 1905, when Binet and Simon plotted curves to show how childrens' success rates on items varied with age. The movement towards summarising item response patterns only came with the streamlining of item selection procedures. 248 Evaluation in Education To plot an item characteristic curve, the most obvious method would seem to be to plot success rates for as many groups as there are different test scores In practice, however, this method is not only finicky, but may also mislead, the reason being that success rates calculated from very small numbers of candidates ob~ining certain test scores are unstable, and thus may give a false impression of how an item performs. Since the relationship between the assumed underlying ability and test score is unknown, and the test scores are bound to be fallible, it is preferable to group candidates in terms of test score intervals, the supposition being that all candidates within a group possess roughly the same amount of the ability in question. When this is done, a curve like that in Fig. 6.2 results. A step-by-step method for producing the curve is given in the Appendix of Wi~mut (~975(b)). Since we cannot measure 'ability' directly, the unit of measurement for the ability dimension is test score expressed in standardised form. P Ability 0 Fig. 6.2 The curve in Fig. 6.2 is the classic form - steep in the middle and shallow at the tails. Given anything like a normal distribution of ability, items with this characteristic are needed to produce discrimination among the mass of candidates in the middle of the score range. If, however, the focus of discrimination is elsewhere, say at the lower end of the ability range, then items with characteristic curves like that shown in Fig. 6.3 will be needed. P Ablltty 0 Fig. 6.3 Multiple Choice: A State of the Art Report 249 There is not the space to demonstrate the variety of item characteristic curves. Those with a special interest might consult Wood and Skurnik (1969, p.122) and W-ilmut (1975(b), p.4 and 5). I might add that it is quite feasible to plot response curves for each distractor and to display them on the same graph as the item characteristic curve. It is then possible to inspect the behaviour of each distractor. PROBABILISTIC MODELS OF ITEM RESPONSE While the item characteristic curves should be displayed for inspection whenever possible, they are not in a suitable form for theoretical exploratory and predictive work. It would, therefore, be useful if these curves could be represented by a mathematical function or functions. Repeated investigation has shown that if item characteristic curves are well behaved they can be fitted by functions of the exponential type. Such functions then constitute a model of the item response process in which an individual's probablity of success on an item is said to be governed jointly by his ability and by the difficulty and discrimination of the item. Ability Fig. 6.4 Ability Fig. 6.5 250 Evaluation in Education 0’ Ability Fig. 6.6 Various models have been proposed to fit different families of curves. If items are extremely well-behaved and look like Fig. 6.4 - same discrimination, but varying difficulties.- they will fit what is called the Rasch model, i.e. the one-parameter logistic model, in which the one parameter is the item difficulty. If items are not so well-behaved and look like Fig. 6.5 - varying difficulties and discriminations - they will fit either the two-parameter logistic modeler the two-parameter normal ogive model, which are very similar Finally, if items look like what Levy (1973, p.3) calls 'reality' (Fig. 6.6), then some will fit one model and some another, but not all will fit the same model, however complicated it is made (within reason, of course). The.item analyst or constructor then has to decide whether or not to discard those items which fail to fit his favourite model. With the most restrictive model - that of Rasch - quite a number of items may have to be discarded or, at least, put to one side; with the other models, which allow discrimination to vary, not so many items should be rejected. The utility of these item response models obviously depends on how many items are reasonably well-behaved and how many are like 'reality'. This is especially true where the Rasch model is concerned. Even though the criteria for acceptance are stiff, the more fervent champions of this model insist, in Procrustean fashion, that the data should fit the model, rather than the model fit data (Wright, 1968; Willmott and Fowles, 1974), the reasoning being that only items which fit can produce 'objective' measurement. The technical conditions for ensuring objectivity are discussed by Rasch (1968) but, basically, to be 'objective', measurement should be 'sample-free', which means that it should be possible to estimate an individual's ability with any set of items, providing they fit the model. This property of 'sample-freeness' is hailed as an important breakthrough by Rasch enthusiasts, and it is true, if all goes well, 'sample-freeness' does work. What has tended to escape attention are the cases where 'sample-freeness' breaks down, as when an item behaves differently in different samples , or discriminates erratically across the ability range, e.g. as in Fig. 6.4. Whitely and Dawis (1974, 1976) have a good discussion of this point. In the 1976 paper they note that the assumptions of the item-parameter invariant models of latent traits may not always correspond to the psychological properties of items. Thus test difficulty may depend on the tendency of items to interact in context, as well as on their individual difficulties. Multiple Choice: A State of the Art Report 251 'Sample-freeness' depends on an item behaving uniformly across the ability range, but if you estimate the difficulty of an item from a high ability group - as Rasch says you are entitled to - you can never be sure how the item will work with the rest of the candidate population or sample. The only way of finding this out is to test the item on a group of individuals formed by sampling more or less evenly across the ability range, or, failing that, to draw a random sample of individuals, just as one would if using the classical methods of item analysis, which Rasch enthusiasts find wanting. To my mind, the real importance of item response models lies in the estimation of abilities. Given a set of items which fit one of the models, individuals' abilities can be estimated more or less directly from their responses. Since each item response is weighted according to the discriminating power of the item, the Rasch model gives estimates which correlate perfectly with total test score. Thus, one could say that the Rasch model provides the necessary logical underpinning for classical analysis, and the use of number correct score. However, there is more to it than this. If all goes well, these ability estimates possess certain properties which test scores do not have, the most spectacular being that Rasch model estimates fall on a ratio scale, so that statements like 'person X has twice as much ability as person Y' can be made. To date, this property does not seem to have been much exploited, It is not that there has been although Choppin's (1976) paper is an exception. any shortage of people wanting to have a crack at fitting items to the Rasch model, far from it, so voguish has this model become. The trouble is that, having run the computer program and obtained results, people tend to be at a loss as to what to do next. In fact, fitting items to the Rasch model, or any other, is just the beginning. There remain the tricky problems of identifying and validating traits so that meaning, which is after all the central preoccupation, is injected into the measurement. I touch on some of these problems in Wood (1976(b)). As the reader will have gathered, I am sceptical about the efficacy of these latent trait models. The fact is that, despite heavy promotion, especially in the case of the Rasch model, they have yet to deliver the goods in terms of practical utility. For instance, no one has yet demonstrated, to my satisfaction, how an examining board, running a multiple choice pretesting programme, might profitably use the Rasch model in preference to the classical methods presently used. My judgement is that, for standard group testing situations, such as examinations, the gains to be had from these models are not enough to justify going over wholesale to them. Where they do come into their own is in connection with individualised, or tailored, testing, or more generally, whenever different students are given different tests and it is necessary to place all students on the same measurement scale. I shall have something to say about this in the next chapter. SUrvZMARY 1. Item statistics are only useful insofar as they enable us to select items that lead to test score distributions which are best for a particular application. 2. Used on their own, summary item statistics are not informative enough, and can give misleading predictions of test score distributions. It is preferable 252 Evaluation in Education to inspect the entire item response pattern, even if this is more time-consuming. Item behaviour is best brought out by the characteristic curve, which bridges the old and the new methods of item analysis. 3. Probabilistic item response models have yet to demonstrate their utility in respect of routine item analysis programmes, or their capacity to illuminate behaviour. On the other hand, the Rasch model, in particular, has been a force for good in that it has made people who might otherwise have remained unenlight ened, aware of measurement problems , and of ways of thinking about measurement. It has also introduced a welcome rigour into what was formerly (and still is) a jumble of ad hoc practices. My quibbles are those of one who is impatient to leave the discovery stage behind, and to engage the important question, which is, "What is being measured?" 7. item Selection and Test Construction “F’ooZs rush in where angels fear to tread.” The idea that a test is fixed in length and duration, comes in a fancy booklet and is sat by large numbers of candidates is a product of the group testing ethos which has dominated educational and psychological testing, especially the former, the best example being public examinations. But when you think about it, there is no reason, in principle, why a person or group of persons should not be given different, although not necessarily exclusive, sets of items. After all, it is an individual who takes a test, not a group. In any group test, there will be some items that are so easy for some candidates that they solve them without effort, and also some that are so difficult for others that they cannot begin to answer them. In an individualised measurement procedure, items are chosen for each individual so as, in Thorndike's (1971) words, "to define the precise limits of an examinee's competence". Thus group and individualised testing call for rather different approaches to item selection. I shall deal with both approaches in this chapter although I shall devote more space to the construction of group tests, since these are still far and away the most widely used. Traditionally, group tests have been designed to measure individual differences. Such tests are known in the trade as norm-referenced tests. Suppose, however, that the object is not to discriminate between individuals but, instead, to find out whether they are able to satisfy certain criteria or, in the current jargon, demonstrate minimal competencies. What is wanted here are criterionreferenced tests, about which so much has been written in the last ten years. Later in the chapter, I will deal with the construction of these tests3 and also with the construction of tests designed to discriminate between rou s of individuals, interest in which has sprung up as a result of the accounta e$;* i ity drive in the USA. The chapter ends with a section on item analysis computer programs. CONSTRUCTING GROUP TESTS Norm-referenced tests For tests of attainment, such as GCE and CSE examinations, the measurement objective is to discriminate maximally between candidates so that they may be ordered as accurately and precisely as possible. Ideally, the distribution of test scores should be uniform or rectangular, as drawn below (Fig. 7.1). The worst possible distribution, given this objective, would be a vertical line, since this would signify that all candidates had received the same score. 254 Evaluation in Education Score Fig. 7.1 Suppose that a test constructor sets out to achieve a rectangular distribution of test scores. What sort of items are needed ? There are two ways of approaching this question, one classical and one modern. According to Scott (1972), writing in the classical tradition, there is presently no consensus concerning the best method of obtaining a rectangular score distribution. Everyone agrees that the correlations between items should be high, although exactly what the range of values should be is disputed; it is over what difficulty values items should have that the arguments occur. Scott, himself, having looked into the matter thoroughly, concluded that maximum discrimination will result if all items have p values of 0.50 (A = 13), and correlate equally at around 0.33, where, of course the p values apply to the relevant candidate population. If the group being tested were very able, and the p values applied to a typical population, the recommendation would not work. Naturally, the distribution of ability in the group being tested affects the score distribution. It may seem late in the day to bring in the idea of item intercorrelation, but I deliberately chose not to discuss it in the last chapter, on the grounds that it is not a statistic that pertains to any one item, rather to pairs of items. In this sense, it fails to satisfy the Lord and Novick test of a useful item parameter (see Chapter 6). However, in other respects, it is most useful. For instance, reliability of the internal consistency variety depends entirely upon the item intercorrelations , so that, given estimates of the latter, internal consistency can be estimated. It is high internal consistency which is reflected in the rectangular distribution; items which measure the same competence over and over again will rank candidates in the same order and the equal spacing characteristic of the rectangular distribution will emerge. It is when item content and demand varies that candidates who have studied differently, and so are different anyway, are able to reach the same score by different routes, and scores pile up in the middle of the score range. One candidate might know A and B, but not C, another A, but not B and C, and another C only. This is just what happens in most attainment tests, where item intercorrelations are generally of the order of 0.10 to 0.20, rather than the 0.33 which Scott suggests is necessary to really flatten out the test score distribution. I will explain what I believe to be the reason for this later. As long as total test score is used as the criterion for measuring discrimination, high internal consistency means high item intercorrelations, which, in turn, mean high discrimination values, and vice-versa. Discrimination provides the link between classical and modern methods. To achieve a rectangular score distribution, what is needed are items with characteristic curves like the one in Fig. 7.2 - steep over nearly the whole of the ability range, and, therefore, highly discriminating. Multiple P IL 0 Choice: A State of the Art Report 255 Ability Fig. 7.2 To predict what the test properties will be, it is only necessary to add together the item characteristic curves to produce a test characteristic curve. Thus, the result of accumulating curves like the one shown in tig. / 2 '11 be a test characteristic curve identical to that curve. In practice, oi cErse the item characteristic curves would vary in slope, so that the test charact&istic curve would not be as steep as that shown above, and the test score distribution would be less flat. A useful set of graphs illustrating the relationship between test characteristic curves and test score distributions is provided by Wilmut (1975(b), p.7). Since exact or even close matches will be rare, the test constructor will have to exercise discretion, especially with the 'poorly-behaved' items which discriminate effectively over one or more parts of the ability range, but not elsewhere. Sometimes, if the test constructor can find a complementary item which discriminates where the other does not, the two together should provide highly effective discrimination across the whole ability range. The idea is demonstrated in Fig. 7.3. I P 0 r Abilrty ‘:i,bili+y Fig. 7.3 Incidentally, this example shows the value of having a relaxed view towards item selection. Probably, neither of these items would fit an item response model, but they still have something to offer in terms of discrimination. If low intercorrelations and, therefore, discrimination values frustrate attempts to produce rectangular distributions of scores for multiple choice attainment tests, should those responsible for putting together tests depart at all from the optimum item selection strategy? They should not. Whether or not they are aware of the consequences of low intercorrelations, they should act as if rectangular distributions were realisable. That is to say, they should follow the advice most generally given in text books and articles on multiple choice (see, for instance, Macintosh and Morrison, 1969, p.66-67). to choose items with facilities between 0.40 and 0.60, or A values between 12 and 14, and with discrimination values (usually biserial r) greater than 256 Evaluation in Education 0.40, if possible. Only when items are very homogeneous, which means an average intercorrelation greater than 0.33, should item facilities be distributed more evenly (Henrysson, 1971, p.152-153). I am aware that this advice seems to run counter to common sense. If all items are of the same difficulty, they can only measure efficiently those whose ability level corresponds to the difficulty level. Only if items are distributed across the difficulty range so that everyone has something they can tackle, can everyone be measured reasonably efficiently. This argument is impeccable - as far as it goes. The fact is that neither item selection policy will give the best results, the first because it neglects the most and least able candidates, and the second because, unless the test is to be grotesquely long, there are too few items at each point of the difficulty/ability range to provide effective measurement. The equal difficulty strategy is simply the better of two poor alternatives for large candidate populations. In practice, of course, test constructors like to include a few easy and a few difficult items, the easy items "in order to help the candidate to relax", while the difficult items "serve the function of stretching the more able candidates" (Macintosh and Morrison, 1969) but, in effect, this is just a token gesture. If candidates at the extremities are to be measured efficiently, what is needed are tests tailored to their abilities, difficult tests for the cleverest and easy tests for the dullest. Therein lies the motivation for developing individualised testing procedures. 0.7 - I I i I 49 I 50 I 0.6 - 22 I 48 I 26 I 0.5 - 3960 45 II 570 2' 51282 14 16 53 IlO, 42 41 I 33 931 16 44,329 I 56 3724712 t__~k_lz--_-_ 54 3 '556 4034431 I .o 0.4 z I m 59 I 27 I 46326830I 32 52 ) - I '? I 0,3__----_T5__~_-_55 35 I 38 I I 0.2 - 471 I I I 0.10 I I 0.10 I 0.20 I 0.30 I I I 0.40 I I 0.50 / I 0.60 / / I 0.70 0.80 0.90 Faclllty Fig. 7.4 Scatter plot of facility against biserial. (GCE O-level History 'B' Paper 3 - June 1974). 1.00 Multiple Choice: A State of the Art Report 257 When selecting items in practice, a handy way of displaying the available items in terms of their statistical characteristics is to plot values of facility or difficulty against values of the discrimination index, whatever that is. It is conventional to plot difficulty along the horizontal axis and discrimination up to the vertical axis, with the position of items being signified by the item number, and also, perhaps, by some coding, like a box or a circle or a colour, to indicate different item types or content areas. On top of the plot can be superimposed horizontal and vertical lines indicating the region within which "acceptable" items are to be found. An example taken from Quinn's (1975) survey will show what I mean (Fig. 7.4). The test was a London board Ordinary level History paper, and the joint distribution of difficulty and discrimination values is not unusual. One feature is the number of items which turned out to be too easy. This is likely to happen when pretest item statistics are taken at face value, the point being that candidates often find items easier in the examination proper than in the pretest, due, presumably, to extra motivation and a better state of preparedness. (The London GCE board holds pretests one to two months before the examinations.) For this reason, it is advisable to adjust informally pretest item facilities upwards by five to ten percentage points, so as to get a more accurate idea of how the items will perfomi in the operational situation. With discrimination values, there is no such rule-of-thumb, but it is a fairly safe generalisation to say that they too mostly improve from pretest to examination, partly because there is ustially a positive correlation between facility and discrimination values, so that, as items get easier, discrimination improves, and, also, because incomplete preparation at the time of the pretests will tend to elicit answers that candidates would not offer in an examination. It is, therefore, a good idea to set the lower boundary for discrimination at 0.3 instead of 0.4, as we have done in Fig. 7.4, thus letting in items which, on the pretest information, look dubious but, in the operational setting, are likely to turn out to be acceptable. If it is asked why some items should not be pretested by including them in an examination paper without scoring them, the answer is that it can be done (although the London board has not done so), but there are likely to be objections from candidates and teachers that time spent on the pretest items is time wasted on the operational items. It cannot be emphasised too often that accept/reject boundaries are not to be kept to rigidly, if only because of the 'slippage' between pretest and operational values just remarked on. There are educational considerations involved in item selection and test construction, and these must always be allowed to over-ride statistical efficiency considerations. For example, even supposing a rectangular distribution of scores could be produced, it is almost certain that the items would be too similar to satisfy the content/skill specification, and less discriminating but more relevant items would have to be introduced to enrich the test. Generally, however, it is the low discriminating and the hard items which pose the dilennna of inclusion or non-inclusion. Are they simply poor items or are they perhaps items based on content which the examiners would like to see taught, but which is meeting resistance among teachers? The trouble is that sometimes the notion of what ought to be tested (and, therefore, taught) is not so securely founded as to lead to authoritative decisions, and an uneasy compromise results, with, perhaps, the statistics getting the upper hand. As I have remarked, low discrimination values are the rule rather than the exception. The reasons are not hard to find. Comprehensive syllabuses containing a variety of material have to be covered, and some sampling is 258 Evaiuation in Education inevitable. This being so, some teachem will sample one way, some another. With candidates sampling in their own way during examination preparation, the net effect is that their knowledge is likely to be 'spotty' and inconsistent. In these circumstances, one would not expect correlations between items to be high, Interestingly enough, the two multiple-choice papers in Quinn's (1975) survey to shaw the highest average discrimination values were the English language and the French reading and listening comprehensive tests. Whereas in subjects like Physics and Mathematics the pieces of information tend to be discrete and unrelated, in reading and listening comprehension there are interconnections and resonances within the language which help to make candidates' performance more homogeneous. Rowley (1974f, who would probably cite this as another exampfe of 'test-wiseness' (see Chapter 51, did, in fact, report that 'test-wiseness' was more marked on verbal ~o~p~bension tests than on quantitative tests. 1 0 IO 20 \ 30 40 50 60 Score Fig. 7.5 Figure 7.5 shows the score distribution for the History test, the items of which were displayed in Fig, 7.4. You can see the dome shape, which is common for the London board multiple-choice tests. This dome becomes a peak if the discrimination values slip downwards. ARRANGING ITEMSIN THE TEST FORM In a conventional group test there is an issue concerning the way items should be arranged. The received opinion is that items should be arranged from easy to hard, E-H, the rationale being that anxiety is induced by encountering a difficult item early in a test, and that the effect persists over time, and Multiple Choice: A State of the Art Report 2% causes candidates to fail items they would have answered correctly had anxiety not interfered. This seems sensible, but, in practice, does E-H sequencing make any difference to scores? There has been a string of enquiries which have found that item and test statistics - difficulty, discrimination, KR20 internal consistency - were little affected by re-arranging items from random to E-H or vice-versa (Brenner, 1964; Flaugher, Melton and Myers, 1968; Shoemaker, 1970; Huck and Bowers, 1972). Perhaps the most thorough enquiry into this issue was carried out by Munz and Jacobs (1971), who also provide other references. They concluded that an E-H arrangement did not appear to improve test performance or reduce test-taking anxiety, as compared to an H-E arrangement, but that it did leave students with a more positive feeling about the test afterwards ("easier and fairer") than did the H-E arrangement. Their view was that arranging items according to candidates' perception of item difficulty - subjective item difficulty - constitutes the only justification for the E-H arrangement. The snag is that the subjective item difficulty of any item will vary according to the candidate. All the same, I would back the easy to hard arrangement. THE INCLINE OF DIFFICULTY CONCEPT If items were to be arranged in E-H order, and candidates could somehow be persuaded to stop once items got too difficult for them, a big step would have been taken towards individualising testing and, of course, to making it more efficient. This is the notion behind the so-called incline of difficulty concept. It differs in an important respect from the so-called multilevel format which is used by certain American testing programs, for example, the Iowa Tests of Basic Skills (Hieronymus and Lindquist, 1971). Whereas, with the ITBS, a single test booklet covers the whole range of difficulty from, say, easy nineyear old to difficult sixteen-year old, and candidates are advised where to start the test and where to stop it, in the incline of difficulty set-up, as presently explicated (see Harrison, 1973), candidates start at the beginning but are given no instructions as to when to stop, except, as I have said, when they find the items getting too difficult. The consequence is that the weaker candidates, perhaps understandably, tend to move on up the incline, sometimes by leapfrogging, in the hope that they will encounter items which are within their capabilities. In behaving like this, they will often be justified, since the existence of interactions between items and individuals - pockets of knowledge candidates are "not supposed to have" or "should have but don't" means that every candidate is likely to have his own incline of difficulty which will not correspond to the "official" incline of difficulty. This is just another way of saying that item statistics , on which the incline of difficulty would be based, can only take us so far in the prediction of individual behaviour. Nevertheless, the incline of difficulty idea deserves further researching, if only because any scheme which permits some individualisation of measurement within the constraints of paper and pencil testing is worth investigating. The same remark applies to the multilevel format, some of the problems of which are discussed by Thorndike (1971, p.6). 260 Evaluation in Education INDIVIDUALISED TESTING With fully individualised testing, the idea is to adapt a test to the individual, so that his ability can be assessed accurately and precisely, with as few items as possible. Given some initial information about an individual's level of ability, reliable or otherwise, he/she is presented with an item for which his or her chances of success are reckoned to be 50:50. If the individual gets the item right, the ability estimate is revised upwards, and he or she then receives a more difficult item, while a wrong answer means that the ability estimate will be revised downwards, and an easier item will be presented next. The zeroing-in process continues in see-saw fashion, but with decreasing movement, until a satisfactory determination of ability is made, where what constitutes 'satisfactory' has to be defined and, indeed, is an outstanding technical problem. There are variations on this theme, such as presenting items in blocks rather than singly, but the basic idea is as I have described it. A review of my own (Wood, 1973(b)) gives the background and further explanation. The procedure is entirely dependent on latent trait methodology since classical methods cannot handle the matching of ability level and item difficulty, nor the estimation of ability. Evidently, conventional number correct scoring would not do, since two individuals could get the same number of items correct and yet be quite different in ability. By common consent, individualised testing is best conducted in an interactive computer-assisted set-up, but I am afraid the expense involved in doing so will be out of the reach of many testing organisations, never mind individual teachers. Lord (1976(b)) may be right that computer costs will come down but when he claims that they will come down to the point where computer-based adaptive testing on a large-scale will be economical, one is bound to ask "For whom", and I suspect the answer will be organisations like the US Civil Service Commission (McKillip and Urry, 1976), or the British Army (Killcross, 1974), and almost nobody else. To devote too much space to computer-assisted adaptive testing would be wrong, anyway, given the scope of this book. Apart from anything else, those in the van of developments are keen to go beyond multiple-choice and program the computer to do things not possible with paper and pencil tests, a point made forcefully by Green (1976), who cites the testing of verbal fluency as "a natural for the computer". More generally, he makes the point, as I did myself in my review paper, that what is needed now is more information in addition to the extra efficiency the computer already supplies. The Green and Lord papers are the best parts of a useful report which will enable interested readers to get up to date with developments in the field. Other worthwhile references are Lumsden (1976, p-275-276) who, like Green, pulls no punches, and Weiss (1976), the last summing up fuur years of investigation into computer-assisted adaptive testing at the University of Minnesota. So far, the nearest realisation of tailored testing in a paper and pencil form is the flexilevel test (Lord, 1971(a), 1971(b)). In a flexilevel test, the candidate knows immediately whether or not he got the right answer. he starts by attempting an item of median facility. If correct, he moves to the easiest item of above median facility; if incorrect, he moves to the hardest item of below median facility. The candidate attempts only N ; 1 items in the set, where N is the total number of items in the test, which has a rectangular distribution of item facilities so as to provide measurement for all abilities. Multiple Choice: A State of the Art Report 261 In practice, the routing can be arranged in a number of ways. I myself, (Wood, 1969) invited candidates to remove some opaque masking material corresponding to the chosen response, in order to reveal the number of the item they should tackle next. Other devices call for the candidate to rub out the masking material with a special rubber. The Ford Motor Company (1972), for instance, have used a scheme like this for testing registered technicians' skills, except that there the emphasis was on remedial activity - wrong answers uncover messages which enable to candidate to rectify his mistakes. This, of course, is the notion behind programmed learning, to which, of course, tailored testing bears a resemblance. In terms of statistical efficiency, Lord found that near the middle of the ability range for which the best was designed, the flexilevel test was slightly less effective than a conventional test composed of items with facilities around 0.50, and with discrimination values as high as possible. Towards the extremes of the ability range however, the flexilevel test produced more precise measurements, a result one would expect, since it is the reason for adopting these individualised tests in the first place. Weiss (1976. p.4), however, is less kind towards the flexilevel test, and maintains that it offers little improvement over the conventional test, besides being likely to induce undesirable psychological resistance as a result of the branching strategy. i3y this he means, I think, that candidates are not happy taking different routes which effectively differentiate them, and also that they may experience some difficulty following the routing instructions. These objections remain to be substantiated. It should be remembered that Lord's results stemned from computer simulations, and Weiss's from computer-assisted item administration. As far as I know, there have been no thorough-going experiments in the paper and pencil mode; my own rough and ready exercise (Wood, 1969) must be discounted. I am not advocating that t here should be a spate of such experiments, but it would be good to have one or two. A strong point in favour of flexilevel testing is that even though candidates are obliged to do items of different difficulties, the number right score turns out to be an excellent estimate of ability (Lumsden, 1976). If the individualising of testing is to work properly so that the pay-off is delivered, it is clear that much depends on the accuracy and precision of the calibration of test items. With group tests, the existence of error in the estimation of item parameters is not so critical; we are working with wider margins of error, and are not expecting so much from the measurement. Unfortunately, the calibration of items is something of a grey area in measurement. Certainly, the kind of casual calibration of items using "grab" groups which certain Rasch model enthusiasts recommend, is not adequate (for more on this see Wood (1976(b)). There are three other problems pertaining to individualised testing I should like to draw attention to. I have already remarked on the likely existence of what I called item-individual interactions which result in outcomes that are "not supposed to happen". If these effects can interfere with incline of difficulty arrangements, they can certainly throw out and maybe even sabotage individualised branching procedures. Furthermore, and this harks back to the earlier introduction of the idea of subjective item difficulty, we do not know yet whether or not difficulty levels appropriate to each individual's ability level are the best ones for keeping motivation high and anxiety and frustration low, although there is reason to expect that they will, at least, inhibit anxiety (Betz, 1975). 262 Evaluation in Education The second problem, which is more important, concerns the meaning of the measurement. In attainment tests, where testing is usually based on a sampling of a comprehensive syllabus, the point is to discover how much of that sampling individuals can master so that appropriate generalisations can be made. Where the decisive element is the difficulty of the item, as it is in individualised testing, an item sequence presented to a particular individual may bear little resemblance to the sample content specification; the items could even be very similar in kind, although this is unlikely to happen in practice. The issue at stake here is that of defining domains or universes of items, each of which contains homogeneous items of the same description, and each of which is sampled according to some scheme (see Chapter 4). On the face of it, there is no reason why this model should not apply in the context of say, Ordinary level examination papers, which, after all, are supposedly constructed from a specification grid, each cell of which could be said to form an item universe. Unfortunately, there are severe demarcation issues over what constitutes a cell, especially if one of the defining categories in the grid is Bloom's taxonomy or a version of it (see Chapter 2). It is also true that systematic universe sampling, using the flexilevel technique, would require far more items than would be required for the conventional heterogeneous group test. Finally, there is the reporting problem. Suppose you can measure individuals pronouns and on umpteen universes - knowledge of words, adverbs, ad=tives, what have you - what do you do with the results? Profiles would be so unwieldy I am afraid that with heterogeneous domains such as we as to be meaningless. usually have to deal with in examinations like GCE Ordinary and Advanced level, we can do no better than summary statements about achievement, even if it means degraded ~asu~~nt. TESTING FOR OTHER THAN INDIVIDUAL DIFFERENCES Criterion-referenced tests The discussion so far has been about tests of individual differences, and how to construct them. Because these tests are designed to measure a person in relation to a normative group, they have been labelled norm-referenced tests. They may be contrasted with criterion-referenced tests, which are designed "to yield measurements that are directly interpretable in terms of specified performance standards" (Glaser and Nitko, 1971, p.653). In practice the differences between the two kinds of test may be more apparent than real, as I have tried to explain elsewhere (Wood, 1976(c)), but there is no denying that there is a fundamental difference in function and purpose. Carver (1974) points out that all tests, to a certain extent, reflect both between-individual differences and within-individual growth, but that most tests will do a better job in one area than another. The first element he proposes, reasonably enough, to call the psychometric element or dimension , while, for the second dimension, he coins the term edumetric. A test may be evaluated along either dimension. Aptitude tests and, to a lesser extent, examinations focus-on the psychometric dimension, while teacher-made tests usually focus more on the edumetric dimension, a statement that applies generally to criterion-referenced tests. If McClelland (1973) is right that schools should be testing for competence rather than ability, and I think he is, teachers should be using criterion-referenced tests rather than norm-referenced tests. Multiple Choice: A State of the Art Report 263 Much more could be said about criterion-referenced tests, but the above will serve to set the stage for a discussion of the item selection and test construction procedures which are appropriate for these tests when they consist of multiple-choice items. First, it must be noted that the value of pretesting and item analysis is disputed by the more doctrinaire advocates of criterionreferenced testing (CRT) who argue that since the items are generated directly from the item forms which represent objectives (see Chapter 4), the calculation of discrimination indices and subsequent manipulations are irrelevant (Osburn, 1968). The items generated go straight into a test, and that is that. Someone who believed fervently in the capacities of item writers could, of course, take the same fundamental line. This position seems altogether too extreme; one is bound to agree with Henrysson and Wedman (1974) that there will always be subjective and uncertain elements in the formulation of objectives and, therefore, in the production of items which will render criterion-referenced tests less than perfect. If item analysis has a part to play in the construction of CRT's, the question is, 'What kind of item analysis? Evidently, it must be different from conventional, psychometric item analysis. According to the usual conventions for norm-referenced tests, items that everyone tends to get right or everyone tends to get wrong are bound to have low discrimination values and will, therefore, be discarded, but they might be just what the CRT person wants. Lewis (1974), for example, argues that items with facilities as near as possible to 100 per cent should be favoured above all others. Exactly what the point of giving such a test would be, when it was known in advance that nearly everyone was going to get nearly everything right, defeats me. Surely, it would be much more in keeping with the spirit of CRT to set a test comprising items of 50 or 70 per cent facility, and then see how many of the groups in question could score 100 per cent. The orthodox CRT practitioner regards items which discriminate strongly between individuals as of no use to him. Brennan (1972) has maintained that what is wanted are items with high facilities andwith "non-significant" item-test correlations. Items that discriminate positively "usually indicate a need for revision". Whether Brennan is correct or not is beside the point because discrimination indices like the biserial should not be used in the first place. If the idea is to find items which are sensitive to changes within individuals, then it is necessary to test items out on groups before and after they have received instruction. Items showing little or no difference, indicating insensitivity to learning, would then be discarded. The best edumetric items, according to Carver (1974), are those which have p values approaching 0 prior to instruction, and p values approaching 1 subsequent to instruction. Various refinements of this simple difference measure have been proposed (see Henrysson and Wedman, 1974 for details). The most useful is an adjustment which takes into account the fact that the significance of a difference varies according to where on the percentage scale it occurs, the formula for the resulting statistic being: Fpost test 1 - Ppretest - Ppretest But will teachers, who, after all, are meant to be the prime users, be bothered to go to the lengths of administering pretests and post tests, and then selecting items? One could say the same of conventional item analysis, of course, indeed I have always thought that item analysis in the classroom was something more talked about than practised. With CRT, however, the procedure is so much 264 Evaluation in Education more cumbersome that the pay-off seems hardly worthwhile. "One might argue that the teacher's time could be better spent in other areas of the instructional process" writes Crehan (1974), and it is hard to disagree. Besides, there is something improper about a teacher giving his students items before they have had the opportunity to master the relevant subject matter. In these circumstances, there is a real possibility-that some students will be demoralised before they start. My belief is that the teacher is better advised to rely on his own intuition and everyday observation, rather than engage in statistical exercises. Above all, CRT should be informal in conception and execution, and there is no purpose served in decking it out with elaborate statistical trappings. CHOOSING ITEMS TO DISCRIMINATE BETWEEN GROUPS Not only may items be chosen to discriminate between and within individuals, but also between roups of individuals. The practical importance of such a measure lies in e evaluation of teaching programmes or instructional success, an issue gaining increasing attention these days, particularly in the USA. Suppose a number of classes within a school have been taught the same material, and it is desired to set all the class members a test to find out which class has learnt most. Lewy (1973) has shown that items which differentiate within classes will not necessarily register differences between classes. This what one would expect, given that the basic units ofobservation - the individual score and the class average - are so different. For item selection to differentiate between classes, the appropriate discrimination index is the intraclass correlation (for details see Lewy's paper, also a paper by Rakow, l-Using indices like the biserial will most likely result in tests which are not sensitive to differences between class performance. Much of the criticism levelled at American studies which have claimed that school makes little or no difference to achievement, like those of Coleman et al (1966) and Jencks et al (1972), has hinged on the fact that norm-referenced tests constructed according to the usual rules were used to make the measurements, whereas what should have been used were tests built to reflect differences between school performances (Madaus, Kellaghan and Rakow, 1975). My own experience with using the intraclass correlation on item response data from achievement tests - after the event, of course - has been that the highest values occur with items on topics that are either new to the syllabus or are controversial. If, as seems likely, these topics are taken up by only some teachers, the effect will be to create a possibly spurious impression of greater between-school variability than really exists. I have also found that assertion-reason items show less variation between schools than items of any other type, an outcome I interpret as evidence that assertion-reason items measure ability more than competence, to borrow McClelland's distinction again (see also the section of assertion-reason items in Chapter 3). The fact that simple multiple choice items , which probably give the "purest" measures of competence, showed greatest variation between schools in my analysis, tends to support me in my view. Multiple Choice: A State of the Art Repoti 265 COMPUTER PROGRAMS FOR ITEM ANALYSIS Compiling even an abbreviated list of available item analysis programs is not an easy task. That is why I was pleased by the appearance recently of a paper (Schnittjer and Cartledge (SC), 1576), which provides a comparative evaluation of five programs originating in the USA. The coverage is not comprehensive, nor does it pretend to be, and descriptions of many other programs of varying scope can be found in the 'Computer Programs' section of Educational and Psychological Measurement, which appears in every other issue. 0ne program I would h ave expected to find in the SC paper is FORTAP (Baker and Martin, 1969). Among other features, it supplies estimates of item parameters for the distractors as well as for the key. It was also, I believe, the first commercially available program to provide estimates of the item parameters for the normal ogive item response model, LOGOG (Kolakowski and Bock, 1974) and LOGIST (Wingersky and Lord, 1973). As far as the Rasch model is concerned, the SC paper describes MESAMAX, developed at the University of Chicago. There is also Choppin's (1974(b)) program, which is based on his own treatment of the Rasch model (Choppin, 1968). One program the SC paper could not be expected to mention is one developed recently in the University of London School Examinations Department (Wilson and Wood, 1976). Itwill be known as TSFA, and, among otlle; things, will serve as a successor to the Chicago program dealt with in the SC paper. The output for every item resembles that in Table 6.1. Basic sample statistics and item parameter estimates are given for main test and subtests, using the relevant test scores and/or optimal external criterion scores. Sample statistics and item parameter estimates for different subsets of persons can also be obtained. Further options allow the user to plot test score/criterion biserials or point biserials against item difficulties or facilities for main test and subtests. For the large version only, tetrachoric correlations between items can be obtained on request, and the correlation matrix used as input for a factor analysis. SUMNARY 1. Three types of test are identiFied, test:s to measure individual differences, tests to measure differences within individuals and tests to measure differences between groups. Within the first type, group tests are distinguished from individualised tests. The different kinds of item analysis and test selection appropriate to each are discussed. In the case of group tests, the classical and modern approaches to test selection are contrasted. 2. The recommendation for designing group tests for large candidate populations is to choose items with difficulty levels around 0.50, and with discrimination values as high as possible, consistent with educational considerations. Those who find this advice hard to understand might reflect that to provide enough items at points on the difficulty range so as to secure the same efficiency of measurement across the ability range, and not just in the middle, would mean an impossibly long test. In these circumstances the equal difficulty strategy gives a closer representation of the true ordering of candidates than does spreading the same number of items across the difficulty range, but the real answer to the problem is to be found in individualised testing. 266 Evaluation in Education 3. When selecting items for group tests, plotting difficulty and discrimination values against each other gives the test constructor a good idea of the statistical characteristics of the available items. It should be remembered that pretest difficulty and discrimination values are apt to be underestimates of the actual examination values so that it is wise to make some allowance for this when choosing items. It is also advisable not to take accept/reject borderlines too seriously. There is nothing special about a biserial value of 0.30 or 0.40; what matters is to fill the test with items which can be justified on educational grounds, not always an easy thing to do. 4. Discrimination values for achievement tests are generally on the low side, especially for subjects like Physics, where candidates' knowledge seems to be spotty and inconsistent, causing low correlations between items. For reading and listening comprehension tests the discrimination values are somewhat higher suggesting that candidates are able to deal with the material in a more consistent fashion, due, perhaps, to connections within the language. 5. Presenting items in an easy to hard sequence seems to be the most congenial to candidates. 6. The incline of difficulty and multilevel concepts promise some individualisation within the group testing framework. 7. Fully individualised testing can only really be carried out with the help of a computer. Flexilevel testing is the nearest approximation we have in paper and pencil form, and it would be worth a further look under realistic conditions. 8. The item analysis and test selection procedures appropriate for criterionreferenced tests should obviously be different in kind to those used in psychometric work, if they are even necessary at all, as some believe. Ideas vary as to what measures should be used, although there is general agreement that the difference between performance prior to instruction and performance after instruction is the critical factor. Whether teachers will be willing to indulge in item analysis for criterion-referenced tests, given the work it entails, is a moot point. 9. Interest in selecting items which will discriminate between roups rather than individuals seem to be growing. Much of the criticism leve led at American studies which have claimed that school makes little or no difference to achievement has hinged on the fact that norm-referenced tests constructed according to the usual rules were used to make the measurement, whereas what should have been used were tests built to reflect differences between school performance. Just as with criterion-referenced testing, a different set of item analysis and test selection procedures is necessary for between-group testing. In this case, the appropriate statistic. is the intraclass correlation. It should be noted that between-group variability can sometimes appear greater than it is simply because the material on which an item is based has not been taught in some schools. 10. References to the more comprehensive item analysis computer programs are given. Acknowledgements I am grateful, above all, to Andrew Harrison whose challenging comments on the manuscript helped me enormously. I am grateful also to the editors, especially Bruce Choppin, for suggesting improvements and for their support, and to my colleague Keith Davidson for discussing the manuscript with me. The Secretary of the London GCE board, A.R. Stephenson, has encouraged me in the writing of this book and I would like to thank him too. Permission to reproduce test items has been given by the University of London, the.Test Development and Research Unit of the Oxford Delegacy of Local Examinations, the University of Cambridge Local Examinations Syndicate, the Oxford and Cambridge Schools Examination Board and the Educational Testing Service, Princeton, New Jersey. R. Wood 267 References Ace, M.C. & Dawis, R.V. (1973), Item structure as a determinant of item difficulty in verbal analogies, Educ. Psychol. Measmt. 33, 143-149. Aiken, L.R. (1966), Another look at weighting test items, Jour. Educ. Measmt. 3, 183-185. Alker, H.A., Carlson, J.A. & Hermann, M.G. (1969), Multiple-choice questions and students characteristics, Jour. Educ. Psvchol. 60, 231-243. Anastasi, A. (1967), Psychology, psychologists and psychological testing, Amer. Psychol. 22, 297-306. Anastasi, A. (1970), On the formation of psychological traits, Amer. Psychol. 25, 899-910. Anderson, K.C. (1372), ticw tc construct achievement tests to assess comprehension, Rev. Educ. Res. 42, 145-170. Ashford, T.A. (1972), A brief history of objective tests, Jour. Chem. Educ. 49, 420-423. Baker, F.B. & Martin, T.J. (1969), FORTAP: A Fortran test analysis package, Educ. Psychol. Measmt. 29, 159-164. Barzun, 3. (1959), The House of Intellect, Harper and Row, New York. Bayer, D.H. (1971), Effect of test instructions, test anxiety, defensiveness and confidence in judgement on guessing behaviour in multiple-choice test situations, Psychol. Sch. 8, 208-215. Beeson, R.O. (1973), Immediate knowledge of results and test performance, Jour. Educ. Res. 66, 224-226. Berglund, G.W. (1969), Effect of knowledge of results on retention, Psychol. -Sch. 6, 420-421. Betz, N.W. (1975), Prospects: New types of information and psychological implications. In Computerised Adaptive Trait Measurement: Problems and Prospects, Psychometric Methods Program, University of Minnesota. Betz, N.E. & Weiss, D.J. (1976(a)), Effects of immediate knowledge of results and adaptive testing on ability test performance, Research Report, 76-3, Psychometric Methods Program, University of Minnesota. Betz, N.E. & Weiss, D.J. (1976(b)), Psychological effects of immediate knowledge of results and adaptive ability testing, Research Report 76-4, Psychometric Methods Program, University of Minnesota. Binyon, M. (1976), Concern mounts at fall in writing standards, Times Educ. Suppt. February 13th Multiple Choice: A State of the Art Report 269 Bishop, A.J. Knapp, T.R. & MacIntyre, D.I. (1969), A comparison of the results of open-ended and multiple-cnoice versions of a mathematics test, Int. Jour. Educ. Sci. 3, 147-154. Board, C & Whitney, D.R. (1972), The effect of selected poor item-writing practices on test difficulty, reliability and validity, Jour. Educ. Measmt. 9, 225-233. Boldt, R.F. (1974), An approximately reproducing scoring scheme that aligns random response and omission, Educ. Psychol. Measmt. 34, 57-61. Bormuth., J. (1970), On the Theory of Achievement Test Items, University of Chicago Press, Chicago. Bowers, J. (19723, A note on comparing r-biserial and r-point biserial, Educ. Psychol. Measmt, 32, 771-775. Bracht, G.H. & Hopkins, K.D. (1970), The communality of essay and objective tests of academic achievement, Educ. Psychol. Measmt, 30, 359-364. Brennan, R.L. (1972), A generalised upper-lower item discrimination index, Educ. Psychol. Measmt, 32, 289-303. Brenner, M.H. (1964), Test difficulty, reliability, and discrimination as functions of item difficulty order, Jour. Appl. Psychol. 48, 98-100. Britton, J., Burgess, T., Martin, N. McLeod, A. & Rosen, H. (1975), The Development of Writing Abilities (ll-lb), Schools Council Research Studies, Macmillan, Education, London Brown, J. (1966), Objective Tests; Their Construction and Analysis: A Practical Handbook for Teachers, Longmans, London. Brown, J. (Ed.), (1976), Recall and Recognition, Wiley, London. Carver, R.P. (1974), Two dimensions of tests: Psychometric and edumetric, Amer. Psychol. 29, 512-518. Choppin, B.H. (1968), An item bank using sample-free calibration, Nature, 119, 870-872. Choppin, B.H. (1974(a)), The Correction for Guessing on Objective Tests, IEA Monograph Studies, No. 4, Stockholm. Choppin, B.H. (1974(b)), Rasch/Choppin pairwise analysis: Express calibration by pair-X. National Foundation for Educational Research, Slough. Choppin, B.H. (1975), Guessing the answer on objective tests, Brit. Jour. Educ. Psychol. 45, 206-213. Choppin, B.H. (1976), Recent developments in item banking: A review, In de Gruitjer, D.N.M. & van der Kamp, L.J. Th., (Eds.), Advances in Psychological and Educational Measurement. John Wiley, London. Choppin, B.H., & Purves, A.C. (1969), Compariscn of open-ended and multiplechoice items dealing with literacy understanding, Res. Teach. Eng. 3, 15-24. Choppin, B.H. & Orr, L. (1976), Aptitude Testing at Eighteen-Plus, National Foundation for Educational Research, Slough. Coleman, J.S. et al (1966), Equality of Educational Opportunity, Office of Education, US Dept. of Health, Education and Welfare, Washington. College Entrance Examination Board (1976), About the SAT - 1976-77. New York. 270 Evaluation in Education Collet, L.S. (1971), Elimination scoring: An empirical evaluation, Jour. Educ. Measmt. 8, 209-214. Connaughton, I.M. & Skurnik, L-S. (1969), The comparative effectiveness of several short-cut item analysis procedures, Brit. Jour. Educ. Psychol. 39, 225-232. coombs, C.H., Milholland, J.E. & Womer, F.B. (1956), The assessment of partial knowledge, Educ. Psychol. Measmt. 16, 13-37. Copeland, D.A. (1972), Should chemistry students change answers on multiplechoice tests? Jour. Chem. Educ. 49, 258. Corbluth, J. (1975), A functional analysis of multiple-choice questions for reading comprehension, Eng. Lang. Teach. Jour. 29, 164-173. Costin, F. (1970), The optional number of alternatives in multiple-choice achievement tests: Some empirical evidence for a mathematical proof, Educ. Psychol. Measmt. 30, 353-358. Costin, F. (1972), Three-choice versus four-choice items: Implications for reliability and validity of objective achievement tests, Educ. Psychol. Measmt. 32, 1035-1038. Crehan, K.D. (1974), Item analysis for teacher-made mastery tests, Jour. Educ. Measmt. 11, 255-262. Cronbach, L.J. (1970), Validation of educational measures. In Proceedin s of the 1969 Invitational Conference on Testing Problems. Educationa Y-X&Service, Princeton. Cross, M. (1972), The use of objective Vocational Aspect. 24, 133-139. tests in government examinations, Cureton, E.E. (1971), Reliability of multiple-choice tests in the proportion of variance which is true variance, Educ. Psychol. Measmt. 31, 827-829. D'Agostino, R.B. & Cureton, E.E. (1975), The 27 percent rule revisited, Educ. VMeasmt. 25, 41-50. Dalrymple-Alford, E.C. (1970), A model for assessing multiple-choice test performance, Brit. Jour. Math. Stat. Psychol. 23, 199-203. Das Gupta, S. (1960), Point biserial correlation coefficient and its generalisation, Psychometrika, 25, 393-408. Davidson, K. (1974), Objective text, The Use of English, 26, 12-78. De Finetti, B. (1965), Methods for discriminating levels of partial knowledge concerning a test item, Brit, Jour. Math. Stat. Psychol. 18, 87-123. De Landsheere, V (1977), On defining educational objectives, Evaluation in iInternational 1. 2, Pergamon Press. Diamond, J.J. & Evans, W.J. (1972), An investigation of the cognitive correlates of test-wiseness, Jour. Educ. Measmt. 9, 745-150. Diamond, 3.5. & Evans, W.J. (1973), The correction for guessing, Rev. Educ. Res. 43, 181-192. Donlon, T.F. (1971), Whose zoo? Teach. 25, 7-10. Fry's orangoutang score revisited. Read. Driver, R. (1975), The name of the game. Sch. Sci.Rev. 56, 800-805. Multiple Choice: A State of the Art Report 271 Dudley, H.A.F. (1973), Multiple-choice tests, Lancet. 2, 195. Dudycha, A.L. & Carpenter, J.B. (1973), Effects of item format on item discrimination and difficulty, Jour. Appl. Psychol. 58, 11-121. Dunn, T.F. & Goldstein, L.G. (1959), Test difficulty, validity and reliability as functions of selected multiple-choice item construction principles, Educ. Psychol. Measmt. 19, 171-179. Ebel, R.L. (1969), Expected reliability as a function of choices per item, Educ. Psychol. Measmt. 29, 565-570. Ebel, R.L. (1970), The case for true-false test items, School Rev. 78, 373-390. Ebel, R.L. (1971), How to write true-false test items, Educ. Psychol. Measmt. 31, 417-426. Echternacht, G.J. (1972), Use of confidence weighting in objective tests, Rev. E&c. Res. 42, 2 17-236. Echternacht, G.J. (1976), Reliability and validity of item option weighting schemes, Educ. Psychol. Measmt. 36, 301-310. Echternacht, G.J., Boldt, R.F. & Sellman, W.S. (1972), Personality influences on confidence test scores, Jour. Educ. Measmt. 9, 235-241. Eklund, H. (1968), Multiple Choice and Retention, Almqvist and Wiksells, Uppsala. Evans, R.M. & Misfeldt, K. (1974), Effect of self-scoring procedures on test reliability, Percept. Mot. Skills. 38, 1246. Fairbrother, R. (1975), The reliability of teachers' judgements of the abilities being tested by multiple choice items, Educ. Res. 17, 202-210. Farrington, B. (1975), What is knowing a language? Some considerations arising from an Advanced level multiple-choice test in French, Modern Languages. 56, 10-17. Fiske, D.W. (1968), Items and persons: Formal duals and psychological differences, Mult. Behav. Res. 3, 393-402, Flaugher, R.L., Melton, R.S. & Myers, C.T. (1968), Item rearrangement under typical test conditions, Educ. Psychol. Measmt. 28, 813-824. Foote, R. & Belinky, C. (1972), It pays to switch? Consequences of changing answers on multiple-choice examinations, Psychol. Reps. 31, 667-673. Ford Motor Company, (1972), Registered Technician Program, Brentwood, Essex. Bulletin 18, Forrest, R. (1975), Objective examinations and the teaching of English, &. Lang. Teach. Jour. 29, 240-246. Fremer, J. & Anastasio, E. (1969), Computer-assisted item writing - I (Spelling items), Jour. Educ. Measmt. 6, 69-74. Frisbee, D.A. (1973), Multiple-choice versus true-false: A comparison of reliabilities and concurrent validities, Jour. Educ. Measmt. 10, 297-304. Fry, E. (1971), The orangoutang score, Read. Teach. 24, 360-362. Gage, N.L. & Damrin, D.E. (1950), Reliability, homogeneity and number of choices, Jour. Educ. Psychol. 41, 385-404. Gag&, R-M. (1970(a)), Instructional variables and learning outcomes, In M.C. Wittrock & D.E. Wiley (Eds.), The Evaluation of Instruction: Issues and Problems, holt, Rinehart, Winston, New York. 272 Evaluation in Educafion Gagne, R.M. (1970(b)), The Conditions of Learning, Holt, Rinehart and Winston, New York, Gilman, D.A. & Ferry, P. (1972), Increasing test reliability through selfscoring procedures, Jour. Educ. Measmt. 9, 205-2)8. Glaser, R. & Nitko, A.J. (1971), Measurement in learning and instruction, In Thorndike, R.L. (Ed.), Educational Measurement. American Council on Education Washington. Glass, G.V. (1966), Note on rank-biserial correlation, Educ. Psychol. Measmt. 26, 623-631. G7ass, G.V. & Stanley, J.C. (1970), Statistical Methods in Education and Psychology, Prentice-Hall, Englewood Cliffs, N.J. Green, B.F. (1976), Invited discussion, In Proceedings of the First Conference on Computerised Adaptive Testing, U.S. Civil Service Commission, Washington. Grier, J.B. (1975), The number of alternatives for optimum test reliability, Jour. Educ. Measmt, 12, 109-112. Gritten, F. & Johnson, D.M. (1941), Individual differences in judging multiplechoice questions, Jour. Educ. Psychol. 30, 423-430. Guilford, J.P. (1954), Psychometric Methods, McGraw-Hill. Guttman, L. (1941), The quantification of a class of attributes: A theory and method of scale construction, In Horst, P. (Ed.), The Prediction of personal Adjust~nt,'Social Science Research Council, New York. Guttman, L. (1970), Integration of test design and analysis, In Proceedin s of -l-Tz&the 1969 Invitational Conference on Testing Problems, Educationa Service, Princeton. Guttman, L. & Schlesinger, I.M. (1967), Systematic construction of distracters for ability and achievement testing, Educ. Psycho?. Measmt. 27, 569-580. Hales, L.W. (1972), Method of obtaining the index of discrimination for item selection and selected test characteristics: A comparative study, Educ. Psychol. Measmt. 32, 929-937. Hamilton, E.R. (1929), The Art of Interrogation, Kegan Paul, London. Hanna, G.S. & Owens, R.E. (1973), Incremental validity of confidence weighting of items, Calif. Jour. Educ. Res. 24, 165-168. Hansen, R. (1977), The influence of variables other than knowledge on probabilistic tests, Jour. Educ. Measmt. 8, 9-14. Handy, J. & Johnstone, A.H. (1973), How students reason in objective tests, Educ. in Chem. 10, 99-100. Harrison, A.W. (1973), Incline of difficulty experiment in French - Stages 1 and 2, Unpublished manuscript, Associated Examining Board, Aldershot. Heim, A.W. & Watts, K.P. (1967), An experiment on multiple-choice versus openended answering in a vocabulary test, Brit. Jour. Educ. Psychol. 37, 399-346. Hendrickson, G.F. (1971), The effect of differential option weighting on multiple-choice tests, Jour. Educ. Measmt. 8, 291-296. tlenrysson, S. (1971), Gathering, analysing and using data on test items; In Thorndike, R.L. (Ed.), Educational Measurement, American Council on Education Washington. Multiple Choice: A State of the Art Report 273 Henrysson, S. & Wedman, I. (1974), Some problems in construction and evaluation of criterion-referenced tests, Stand. Jour. Educ. Res. 18, 1-12. Hieronymus, A.N. & Lindquist, E.F. (1971), Teacher's Guide for Administration, Interpretation and Use: Iowa Tests of Basic Skills, Houghton Mifflin, Boston. Hill, G.C. & Woods, G.T. (1974), Multiple true-false questions, Educ. in Chem. 11, 86-87. Hively, W., Patterson, H.L. & Page, S.H. (1968), A "universe-defined" system of arithmetic achievement tests, Jour. Educ. Measmt. 5, 275-290. Hoffman, B. (1962), The Tyranny of Testing, Crowell-Collier, New York. Hoffman, B. 1967(a)), Psychometric scientism, Phi Delta Kappa. 48, 381-386. Hoffman, B. 1967(b)), Multiple-choice tests, Physics Educ. 2, 247-251. Hofmann, R.J (1975), The concept of efficiency in item analysis, Educ. Psychol. Measmt. 35 , 621-640. Honeyford, R (1973), Against objective. testing, The Use of English. 25, 17-26. Hopkins, K.D , Hakstian, A.R. & hopkins, B.R. (1973), Validity and reliability consequences of confidence weighting, Educ. Psychol. Measmt. 33, 135-14. Huck, S.W. & Bowers, N.D. (1972), Item difficulty level and sequence effects in multiple-choice achievement tests, Jour. Educ. Measmt. 9, 105-111. S.H. (1971), Nonparametric item evaluation index, Educ. Psychol. Measmt. 31, 843-849. Ivens, Jacobs, S.S. (1971), Correlates of unwarranted confidence in responses to objective test items, Jour. Educ. Measmt. 8, 15-20. Jacobs, S.S. (1972), Answer changing on objective tests: Some implications for test validity, Educ. Psychol. Measmt. 32, 1039-1044. Jaspen, N. (1965), Polyserial correlation programs in Fortran, Educ. Psychol. Measmt. 25, 229-233. Jencks, C.S. et.al. (1972), Inequality: A Reassessment of the Effect of Family and Schooling in America, Basic Books, New York. Karraker, R.J. (1967), Knowledge of results and incorrect recall of plausible multiple-choice alternatives, Jour. Educ. Psychol. 58, 11-14. Kelley, T.L. (1939), The selection of upper and lower groups for the validation of test items, Jour. Educ. Psychol. 30. 17. Killcross, M.C. (1974), A Tailored Testing System for Selection and Allocation in the British Army Paper presented at the 18th International Congress of Applied Psychology,'Montreal. Klein, S.P. & Kosecoff, J. (1973), Issues and procedures in the development of criterion-referenced tests, ERIC TM Report, 26. Koehler, R.A. (1971), A comparison of the validities of conventional choice testing and various confidence marking procedures, Jour. Educ. Measmt. 8, 297-303. Koehler, R.A. (1974), Over confidence on probabilistic tests, Jour. Educ. -Cleasmt. __ 11, 101-108. Kolaknwski, D. & Bock, R.D. (1974), LOGOG: Maximum likelihood item analysis. dt,u test scoring - logistic model, National Educational Resources, Chicago. 274 Evaluation in Education Krauft, C.C. & Beggs, D.L. (1973), Test taking procedure, risk taking and multiple-choice tests scores, Jour. Exper. Educ. 41, 74-77. Kropp, R.P., Stoker, H.W. & Bashaw, W.L. (1966), The Construction and Validation of Tests of the Cognitive Processes as described in the Taxonomy of tducatlonal ObJectlves. Institute of Human Learnlng and Department of tducational Research and Testing, Florida State University, Tallahassee. Kuhn, T.S. (1962), The Structure of Scientific Revolutions, University of Chicago Press, Chicago. La Fave, L. (1966), Essay versus multiple-choice: Which test is preferable? Psychol. Sch. 3, 65-69. Lever, R.S., Harden, R. McG., Wilson, G.M. & Jolley, J.L. (1970), A simple answer sheet designed for use with objective examinations, Brit. Jour. Med. Educ. 4, 37-41. -.. Levy, P (1973), On the relation between test theory and psychology. In Kline, p. (Ed.) New Approaches in Psycholosical Measurement, Wiley, London. Lewis, D.G. (1974), Assessment in Education, University of London Press, London. Lewy, A. (1973), Discrimination among individuals vs. discrimination among groups, Jour. Educ. Measmt. 10, 19-24. Lord, F.M. (1971(a)), The self-scoring flexilevel test, Jour. Educ. Measmt. 8, 147-151. Lord, F.M. (1971(b)), A theoretical study of the measurement effectiveness of flexilevel tests, Educ. Psychol. Measmt. 31, 805-814. Lord, F.M. (1976(a)), Optimal number of choices per item - a comparison of four approaches, Research Bulletin 76-4, Educational Testing Service, Princeton, N.J. Lord, F.M. (1976(b)), Invited discussion, In Proceedings of the First Conference on Computerised Adaptive Testing, U.S. Civil Service COmmisslon, Washington. Lord, F.M. & Novick, M.R. (1968), Statistical Theories of Mental Test Scores, Addison Wesley, New York. Lumsden, J. (1976), Test theory, Ann. Rev. Psychol. 27, 251-280. Lynch, D.O. & Smith, B.C. (1975), Item response changes: Effects on test scores, Meas. Eval. in Guidance. 7, 220-224. Macintosh, H.G. & Morrison, R.B. (1969), Objective Testing, University of London Press, London. Macready, G.B. (1975), The structure of domain hierarchies found within a domain referenced testing system, Educ. Psychol. Measmt. 35, 583-598. Macready, G.B. & Memin, J.C. (1973), Homogeneity within item forms in domain referenced testing, Educ. Psychol. Measmt. 33, 351-360. Madaus, G., Kellaghan, T. & Rakow, E. (1975), A Study of the Sensitivity of Measures of School Effectiveness, Report to the Carnegv Marcus, A. (1963), Effect of correct response location on the difficulty level of multiple-choice questions, Jour. Appl. Psychol. 47, 48-51. McClelland, D.C. (1973), Testing for competence rather than for intelligence, Amer. Psychol. 28, 1-14. Multiple Choice: A State of the Art Reporr 275 McKillip, R.H. & Urry, V.W. (1976), Computer-assisted testing: An orderly transition from theory to practice. In Proceedings of the First Conference on Computerised Adaptive Testing, U.S. Civil Service Commission, Washington. McMorris, R.F., Brown, J.A., Snyder, G.W. & Pruzek, R.M. (1972), Effects of violating item construction principles, Jour. Educ. Measmt. 9, 287-296. Mellenbergh, G.J. (1972), A comparison between different kinds of achievement test items. Nederlands Tijdschrift voor de Psychologie en haar Grensgebieden. 27, 157-158. Miller, C.M.L. & Parlett, M. (1974), Up to the Mark: A Study of the Examination Game, Society for Research into Higher Education, London. Muller, D., Calhoun, E. & Orling, R. (1972), Test reliability as a function of answer sheet mode, Jour. Educ. Measmt. 9, 321-324. Munz, D.C. & Jacobs, P.D. (1971), An evaluation of perceived item-difficulty sequencing in academic testing, Brit. Jour. Educ. Psychol. 41, 195-205. Nilsson, I. & Wedman, I. (1976), On test-wiseness and some related constructs, Stand. Jour. Educ. Res. 20, 25-40. Nixon, J.C. (1973), Investigation of the response foils of the Modified Rhyme Hearing Test, J. Speech Hearing Res. 4, 658-666; Nuttall, D.L. (1974), Multiple-choice objective tests - A reappraisal, In Conference Report 11, University of London University Entrance and School Examinations Council, London. Nuttall, D.L. & Skurnik, L.S. (1969), Examination and Item Analysis Manual, National Foundation for Educational Research, Slough. Oosterhof, A.C. & Glasnapp, D.R. (1974), Comparative reliability and difficulties of the multiple-choice and true-false formats, Jour. Exper. Educ. 42, 62-64. Open University, CMA Instructions, Undated document. Ormell, C.P. (1974), Bloom's taxonomy and the objectives of education, Educ. Res. 17, 3-18. Osburn, H.G. (1968), Item sampling for achievement testing, Educ. Psychol. Measmt. 28, 95-104. Palva, I.P. & Korhonen, V. (1973), Confidence testing as an improvement of multiple-choice examinations, Brit. Jour. Med. Educ. 7, 179-181. Pascale, P.J. (1974), Changing answers on multiple-choice achievement tests, Meas. Eval. in Guidance. 6, 236-238. Paton, D.M. (1971), An examination of confidence testing in multiple-choice examinations, Brit. Jour. Med. Educ. 5, 53-55. Fayne, R.W. & Pennycuick, D.B. (1975), Multiple Choice Questions on Advanced Level Mathematics, Bell, London. Pearce, J. (1974), Examinations in English Language. In Language, Classroom and Examinations. Schools Council Programme in Linguistics and English Teaching Papers Series II, Vol. 4, Longman. Peterson, C.C. & Peterson J.L. (1976), Linguistic determinants of the diffiMeasmt, 36, 161-164. culty of true-false test items, Educ. Psycho1-.Peterson, C.R. & Beach, L.R. (1967), Man as an intuitive statistician, Psychol. Bull. 68, 29-46. 276 Evaluation in Education Pippert, R. (1966), Final note on the changed answer myth, Clearing house, 38 165-166, Paolet R.L. f1972), Characteristics of the Taxonomy of Educational Objectives: Cognitive domain - a replication, Psychol. Sch. 9, 83-88, Powell, J.C. % Isbister, A.&. <1974), A comparison behJeen right and wrong answers on a multiple choice test, Educ. Psychol. Measmt. 34, 499-509. Prescott, W.E. (1970), The use and influence of objective tests, In Examining Modern Languages. Centre for Information of Language Teaching Reports and Papers 4, London. Preston, R.C. f1964), Ability of students to identify correct responses before reading, Jour. Educ. Res. 58, 181-183. Preston, R.C. (1965), Multiple-choice test as an instrument in perpetuating false concepts, Educ. Psychol. Measmt, 25, 111-116. Pring, R. (1971), Bloom's taxonomy: A philosophical critique (2)* Camb. Jour. Educ, 2, 83-91. Pugh, R.C. & Erunza, 3.J. (1975), Effects o? a confidence weighted scoring system on measures of test reliability and validity, Educ. Psychol. Measmt. 35. 73-78. Pyrczak, F. (1972), Objective evaluation of the quality of multiple-choice test items designed to measure comprehension of reading passages, Read. Res. Quart. 8, 62-71. Pyrczak, F_ (19743, Passage-dependence of items designed to measure the ability to identify the main ideas of paragraphs: Implications for validity, Educ. Psychol. Measmt. 34, 343-348. Quinn, B. (1975), A technical report on the multiple-choice tests set by the London GCE Board 1973 and 1974, University of London School Examinations Department, London. Unpublished manuscript. Quinn, 8. & Wood, R. (lg74), Giving part marks for multiple-choice questions, University of london School Examinations Department, London. Un~~bl~s~ed manuscript. Rabinowitz, F.M. (1970), Characteristic sequential dependencies in multiplechoice situations, Psychol. Bull. 74, 141-148. Rakow, E.A. f1974f, Evaluation of Educational Program Differences via Achievement Test Item Difficulties; Paper presented at the Amer-ican Educattonal Research Association, Chicago. Ramos, R.A. & Stern, J, (1973), Item behaviour associated with changes in the number of alternatives in multiple-choice items, Jour. Educ. Measmt. 10, 305-310. Rasch, G. fl968), A Mathematical Theory of Objectivity and its Consequences for Model Construction. Paper delivered at European Heeting on Statistics, Econometrics and kianagcrrentScience, Amsterdam. Ravetz, J.R. (1971), Scientific Knowledge and its Social Problems, Oxford University Press. Reiling, E. & Taylor, R. (1972), A new approach to the problem of changing initial responses to multiple-choice questions, Jour. Educ. Measmt, 9, 67-70, Reilly, R.R. (19751, Empirical option weighting with a correction for guessing, E&c. Fsychol. Measmt. 35, 613-619. Multiple Choice: A State of the Art Report 2n Reilly, R.R. & Jackson, R. (1973), Effects of empirical options weighting on reliability and validity of an academic aptitude test, Jour. Educ. Measmt. 10, 185-194. Resnick, L.B., Siegel, A.W. & Kresh, E. (1971), Transfer and secjuence in double classification skills, Jour. Exp. Child. Psychol. 11, 139-149. Richards, J.M. (1967), Can computers write college admissions tests? Jour. Psychol. 51, 211-215. App. ROSS, J. & Weitzman, R.A. (1964), The twenty-seven percent rule, Ann. Math. Stat. 35, 214-221. Rothman, A.I. (1969), Confidence testing: An examination of multiple-choice testing, Brit. Jour. Med. Educ. 3, 237-239. Rowley, G.L. (1974), Which examinees are most favoured by the use of multiplechoice tests? Jour. Educ. Measmt. 11, 15-23. Sabers, D.L. & White, G.W. (1969), The effect of differential weighting of individual item responses on the predictive validity and reliability of an aptitude test, Jour. Educ. Measmt. 6, 93-96. Sanderson, P.H. (1973), The 'don't know' option in MCQ examinations, Brit. Jour. Med. Educ. 7, 25-29. Schlesinger, I.M. & Guttman, L. (1969), Smallest space analysis of intelligence and achievement tests, Psychol. Bull. 71, 95-100. Schnittjer, C.J. & Cartledge, C.M. (1976), Item analysis programs: A comparative analysis of performance, Educ. Psychol. Measmt. 36, 183-188. Schofield, R. (1973), Guessing on objective type test items, Sch. Sci. Rev. 55, 170-172. Schools Council, (1965), The Certificate of Secondary Education: Experimental Examinations - Mathematm, Examinations BulT&n, 7, Schools Council. London. Schools Council, (1973), Objective test survey, Unpublished document. Schools Council, London. Scott, W.A. (1972), The distribution of test scores, Educ. Psychol. Measmt. 32, 725-735. Seddon, G.M. & Stolz, C.J.S. (1973), The Validity of Bloom's Taxonomy of Educational Objectives for the Cognitive Domain. Unpublished manuscript, Chemical Education Sector, Universitya Senathirajah, N. & Weiss, J. (1971), Evaluation in Geography, Ontario Institute for Studies in Education, Toronto. Shayer, M. (1972), Conceptual demands in the Nuffield O-level Physics course, Sch. Sci. Rev. 54, 26-34. Shayer, M., Kuchemann, D.E. & Wylam, H. (1975), Concepts in Secondary Mathematics and Science, S.S.R.C. Project Report, Chelsea College, London. Shoemaker, D.M. (1970), Test statistics as a function of item arrangement, Jour. Exper. Educ. 39, 85-88. Shuford, E. & Brown, T.A. (1975), Elicitation of personal probabilities and their assessment, Instructional Science. 4, 137-188. Skinner, B.F. (1963), Teaching machines, Scientific American, 90-102. 278 Evaluation in Education Skurnik, L.S. (1973), Examination folklore: Short answer and multiple-choice questions, West African Jour. Educ. Voc. Measmt. 1, 6-12. Slakter, M.J., Crehan, K.D. & Koehler, R.A. (1975), Longitudinal studies of risk taking on objective examinations, Educ. Psychol. Measmt. 35, 97-105. Sockett, H. (1971), Bloom's taxonomy: A philosophical critique (1), Camb. Jour. Educ. 1, 16-35. Stanley, J.C. & Wang, M.D. (1970), Weighting test items and test-item options: An overview of the analytical and empirical literature, Educ. Psychol. Measmt. 30, 21-z. Strang, H.R. & Rust, J.O. (1973), The effects of im~diate knowledge of results and task definition on multiple-choice answering, Jour. Exper. Educ. 42, 77- 80. Tamir, P. (1971), An alternative approach to the construction of multiplechoice test items, Jour. Biol. Educ. 5, 305-507. Test Development and Research Unit, (1975), Multiple Choice Item Writing, Occasional P~lication 2, Cadridge. Test Development and Research Unit, (1976), Report for 1975, Cambridge. Thorndike, R.L. (1971), Educational measurement for the Seventies, In Thorndike, R.L. (Ed.), Educational Measurement, American Council on Education, Washington. Traub, R.E. & Carleton, R.K. (1972), The effect of scoring instructions and degree of speededness on the validity and reliability of multiple-choice tests, Educ. Psychof. Measmt. 32, 737-758. Tuinman, J.J. (1972), Inspection of passages as a function of passage dependency of the test items, Jour. Read. Behav. 5, 186-191. Tulving, E. (19761, In Brown, J. (Ed.), Recall and Recognition, Wiley, London. Tversky, A. (19641, On the optimal nuder Jour. Math. Psychol. 1, 386-391. of alternatives at a choice point. University of London, (1975), Multiple-choice Objective Tests: Notes for the Guidance of Teachers, University of London University Entrance and School rxaminatlons Counclf, London. Vernon, P.E. (1964), The Certificate of Secondary Education: An Introduction to Objective-type Examinations, Examinations Bulletin 4, Secondary Schools Examlnattons Council, London. Wason, P.C. (1961), Response to affirmative and negative binary statements, Brit. Jour. Psychol. 52, 133-142. Wason, P.C. (1970), On writing scientific papers, Physics Bull. 21, 407-408. Weiss, D.J. (1976), Computerised Ability Testing 1972-1975, Psychometric Methods Program, University of Minnesota. Weitzman, R.A. (1970), Ideal multiple choice items, Jour. Amer. Stat. Assoc. 65, 71-89. Wesman, A.G. (1971). Writing the test item, In Thorndike R.L. (Ed.) Educational Measurement, American Council on Education, Washington. I:'hitely,S.E. & Dawis, R.V. (1974), The nature of objectivity with the Rasch model, Jour. Educ. Measmt. I?, 163-178. Multiple Choice: A State of the Att Report 279 Whitely, S.E. & Dawis, R.Y. (7976), The influence of test context on item diffic.ulty, Educ. Psychol. Measmt. 36, 329-338. Williamson, M.L. & Hopkins, K.D. (1967), The use of 'none of these' versus homogeneous alternatives on multiple-choice tests: Experimental reliability and validity comparisons, Jour. Educ. Measmt. 4, 53-58. Willmott, AS. & Fowles, D.E. (1974), The Objective Interpretation of Test Performance: The Rasch Model Applied, National Foundation for Educational Research, Slough. Wilmut, J. (1975(a)), Objective test analysis: Some criteria for item selection, Res. in Educ. 13, 27-56. Wilmut, J. (1975(b)), Selecting Objective Test Items, Associated Examining Board* Aldershot. Wilson, N. (1970), Objective Tests and Mathe~tical Council for Educational Research, Sydney. Learning, Australian Wingersky, M.S. & Lord, F.M. (1973), A computer program for estimating examinee ability and item characteristic curve parameters when there are omitted responses, Research Memorandum 73-2. Educational Testing Service, Princeton. Wood, R. (1968), Objectives in the teaching of mathematics. Educ. Res. 10, 8398. Wood, R. (1969), The efficacy of tailored testing, Educ. Res. 11, 219-222. Wood, R. (1973(a)), A technical report on the multiple choice tests set by the London GCE Board 1971 and 1972, Unpublished document, University of Loridon School Examinations Department, London. Wood, R. (1973(b)), Response-contingent testing* Rev. Educ. Res. 43, 529-544. Wood, R. (1974), Multiple-completion items: Effects of a restricted response structure on success rates, Unpublished manuscript, University of London School Examinations Department, London. Wood, R. (1976(a)), Barking up the wrong tree? they examine, Times Educ. Suppl. June 18. What examiners say about those Wood, R. (1976(b)), Trait measurement and item banks. In de Gruitjer, 0,N.M. & van der Kamp, L.J. Th. (Eds.) Advances in Psychological and Educational Measurement, John Wiley, London. Wood, R. (1976(c)), A critical note on Harvey's 'Some thoughts on norm-referenced and criterion-referenced measures', Res. in Educ. 15, 69-72. Wood, R. (1976(d)), Inhibiting blind guessing: The effect of instructions, Jour. Educ. Measmt. 13, 297. Wood, R. & Skurnik, L.S. (1969), Item Banking, National Foundation for Educ. Research, Slough. Wright, B.D. (1968), Sample-free test calibration and person measurement, Proceedings of the 1967 Invitational Conference on Testing Problems, tducatlonal Testing Service, Pnnceton. Wright, P. (1975), Presenting people with choices: The effect of format on the comprehension of examination rubrics, Prog. Learn. Educ. Tech. 12, 109-114. Wyatt, H.V. (1974), Testing out tests, Times Higher Educ. Suppl. June 28. In 280 Evaluation in Education Zontine, P.L. Richards, H.C. & Strang, H.R. (1972), Effect of contingent reinforcement on Peabody Picture Vocabulary test performance, Psychol. Reports, 31, 615-622.