188 Multiple Choice Items: How to Gain the Most Out of Them PINCHAS TAMIR School of Education and Israel Science Teaching Center Hebrew University Jerusalem, Israel Introduction Multiple choice questions have been often blamed for a variety of negative educational outcomes such as rote/ superficial learning, under-development of communication skills, deficient ability to develop and present an argument, and more. The major justifications offered for their widespread use, especially in the United States, are: (1) they permit coverage of a wide range of topics in a relatively short time; (2) they are objective in terms of scoring and are therefore more reliable; (3) they are easily and quickly scored and lend themselves to computer marking; and (4) they avoid unjustified penalties to students who know their subject matter but are poor writers. The purpose of this article is to show how multiple choice items can be designed and used as an effective diagnostic tool by avoiding their pitfalls and by taking advantage of their potential benefits. The following issues will be discussed: (a) 'correct' versus 'best' answers; (b) construction of diagnostic multiple choice items; (c) the problem of guessing; (d) the use of justifications to choices; and (e) positive versus negative items. Correct Versus Best Answers It is relatively easy to design multiple choice items in which one option is correct and the rest (the distractors) are incorrect. Items of this kind tend to be of a lower cognitive level, requiring most often no more than memorization of particular facts. Since most teachermade tests are comprised of such items, they do indeed deserve the harsh criticism put forward against them. However, as shown by many authors I (e.g., Schwab, 1963), as the focus turns away from correct/incorrect to the best answer, the picture changes dramatically. Now the student is faced with the task of carefully analyzing the various options, each of which may present factually correct information, and selecting that answer which best fits the context and the data given in the item's stem. Multiple choice items of this kind cater for a wide range of cognitive abilities. When compared with open-ended questions, they admittedly do not require the student to formulate an answer, but they do impose the additional requirement of weighing the evidence provided by the different options. Take, for example, the following item: The dry weight of corn plants at the end of their growth is 6 tons per acre. All of this crop was produced from B I O C H E M I C A L E D U C A T I O N 19(4) 1991 A Water and minerals absorbed from the soil (1) B Minerals from the soil and oxygen from the air (13) *C Water and minerals from the soil and carbon dioxide from the air (64) D Water, minerals and organic substances from the soil (22) The asterisk (*) indicates that C is the best answer. The figures in parentheses show the percentage of students who took the low level matriculation examination in biology in Israel in the year 1985 (n = 2405), who chose each option. Had this been an open question asking "How has all this crop been produced?" the expected answer would have been: "By photosynthesis." However, in the actual context of the multiple choices, the students had to know that the carbon dioxide from the air is a raw substance in photosynthesis as well as to rule out the notion that plants absorb their organic substances from the soil. They also had to realize the although the information in option (A) is not incorrect, this is not the best answer to the question. As for diagnostic purposes, it is certainly worth knowing that 22% believe that plants obtain their organic matter from the soil and that 13% somehow confuse photosynthesis with respiration. In a way, the distractors in a multiple choice item function much like one of the standard procedures in a Piagetian classical interview, in which the interviewer is not fully satisfied even when the child gives a correct answer and proceeds to check the understanding by suggesting a competing answer. Take for example the famous interview regarding area conservation. When the child indicates that the cows have the same grazing area, regardless of the manner in which the houses are scattered in the field, the interviewer keeps pressing: "Yesterday another child told me that here (where the houses are scattered all over) the cows have more food than here (where the houses are close together) - - what do you say?" Children who do not really understand that the two areas are equal may fall in the trap. Thus, the distractors in a good multiple choice item serve as such traps. It may be concluded that wisely designed multiple choice items have a high diagnostic potential. Construction of Diagnostic Multiple Choice Items There are two ways to go about constructing diagnostic multiple choice items: (a) using known misconceptions as distractors; and (b) using students' answers to open-ended questions as a basis for constructing distractors. 2 The research literature in science education is full of studies which identify students' misconceptions in relation to a variety of topics. These misconceptions may serve as excellent distractors. For example, option D in the item cited above represents a common belief of many students in many countries that plants obtain their food, including organic matter, from the soil much like animals which obtain their food by eating. 2-7 When such items are used the results quickly indicate not only how many students chose the best answer but also how many students hold particular misconceptions. In spite of extensive research there are, and will continue to exist, many topics and concepts for which there is no a priori information regarding misconceptions. In such cases the alternative approach suggested by Tamir 2 is still viable. According to this approach, teachers who administer open questions to their students may collect, while assessing the papers, typical student answers, correct and incorrect. These answers which represent the ways the students think on given questions actually reveal certain conceptions including misconceptions which are excellent sources for item options. This approach has been used in the study of student conceptions about natural selection. 8 The Problem of Guessing It is generally recognized that multiple choice items lend themselves to guessing so that the probability of obtaining correct answers in items comprised of four options by purely random selection is 25%. However, different evaluators have taken different positions regarding the way this problem should be dealt with. Those who consider guessing as 'noise' causing measurement error tend to use a formula according to which incorrect choices involve penalty expressed in actually losing points (marks). Under such procedure students who do not know the correct answer are advised not to respond to that particular item, since a nil response does not result in losing points. My position is the following: as long as we deal with cognitively low level recall items, in which one option is clearly correct while the distractors are factually incorrect, guessing should indeed be discouraged. However, when cognitively high level items are considered where we ask for the best answer, the situation is totally different. Here the students have to think, compare, weigh evidence, apply, analyze, synthesize or evaluate; in short, they have to solve a problem by utilizing their knowledge and intellectual skills. Under these circumstances, choices are often made by 'educated guess', which, in my opinion, should be encouraged. It may be concluded that, in this kind of multiple choice test, correction for guessing is neither necessary nor desirable, and students should be advised to attempt all items. At the same time, however, it would be worth knowing to what extent guessing is indeed 'educated' rather than totally random. The following procedure offers at least a partial solution: the students are asked to make two responses to each item - - f i r s t , to choose the best answer and, second, to indicate if they are sure or not sure in their choice. The following marking scheme is used: correct: sure - - 2 points correct: not sure - - 1 point incorrect: not sure - - 1/2 point incorrect: sure - - 0 This marking scheme has been found to be verY useful B I O C H E M I C A L E D U C A T I O N 19(4) 1991 BE 1 9 : 4 - D in two ways. First, its reward hierarchy facilitates honest reporting by students; and second, it provides a very important feedback to the teacher as well as to the student. Thus, for example, if most students in the class are not sure about a particular item, the teacher may conclude that there is a need to revisit the relevant subject matter in class. As for the students, they learn how to selfevaluate their knowledge. If many mismatches occur, the student may attempt to find the reasons and adjust his/her learning strategies. Conveniently, this procedure lends itself easily to machine scoring. Finally, test reliability increases substantially. The Use of Justifications In the context of this article the term justification is assigned to reasons and arguments given by a respondent to a multiple choice item for the choice she/he has made. Very little is reported in the literature about the use of justifications, mainly because very little use has been made of this approach. The main reason for avoiding the use of justifications is, most probably, that certain advantages associated with objective items, namely, high reliability, machine scoring, and economical coverage of a wide range of topics, are lost. In essence the items become very similar to short essay questions. If so, why do we need the multiple choice item? There are at least two important reasons for using justifications with multiple choice rather than using short essay questions. First, as already explained in relation to the example given in the first section of this article, the distractors serve as traps. When students are required to justify their choice, they have to consider the data in all the options and explain why a certain option is better than others. By including wise options, both as the best answer and as distractors, we 'force' the students to consider specific matters and to express their position in writing. Thus, in the particular item cited, it is not enough to know that the dry weight of corn plants is mainly a result of photosynthesis, but in addition, the student needs to relate to the following: (a) the role of carbon dioxide in the process, (b) that minerals and water have a share, (c) that plants do not obtain their organic matter from the soil, and (d) that oxygen is not a source of that organic matter. The second reason for requiring justifications for multiple choice items is the 'back-wash' effect. Students who know that they may be asked to justify their choices will attempt to learn their subject matter in a more meaningful way and in more depth so that they will be prepared to write an adequate and complete justification. Treagust and Haslam 9 designed two-tier items: in the first step the student chooses the option, and then is presented with several possible justifications from which she or he has to choose the best. Although this approach has the advantage of objective and quick scoring, often the relationships between the options in the first tier and those in the second tier are quite awkward. The results of the Israeli matriculation examination in 18791468, 1991, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1016/0307-4412(91)90094-O, Wiley Online Library on [28/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License 189 biology reveal substantial gaps between the scores on the multiple choice responses and that of the justifications. For example, in the four items included in the 1985 examination, the percentages of students choosing correctly the best answers were 78, 65, 81, 38 and that of students providing satisfactory justifications were 59, 48, 61, 29, respectively. On the average the gap between choice and justification scores has reached a whole standard deviation. This gap indicates that a considerable number of students who choose correctly the best answer do not really understand the relevant subject matter. This itself is worth noting. However, the most important contribution of the justifications is that they provide information about students' conceptions and reasoning patterns beyond that which can be obtained by the various procedures outlined above. For more details see Tamir. 1° Positive and Negative Multiple Choice Items: How Different are They? The usual form of a multiple choice item in most countries is a stem followed by 4 or 5 options one of which is representing the best or correct answer. The task of the student is to identify the best answer. However, the world is full of surprises. While visiting Australia I discovered that all multiple choice items included in the biology matriculation examinations used in the State of Victoria were of the negative type. It was explained to me that there was an explicit decision to prefer this format because of the belief that it is better for students to be exposed to correct information than to incorrect one. The following is the rationale: since responding to a test is in itself a learning experience, why not include many correct facts which will reinforce students' knowledge and restrict incorrect information to the minimum necessary? Additional argument has been that it is easier to construct a good multiple choice item which has only one incorrect answer whereas, on the other hand, it is quite difficult to invent good distractors. Thus, natural conditions for an interesting study as described below have been created. The purpose of this study was to examine certain issues related to the different item modes, namely Negative (N) and Positive (P), in an attempt to gain a better insight into the underlying reasoning processes involved in responding to the N and P modes. Most test constructors object to N items arguing that "the danger of confusion inherent in negative items outweighs any possible value" (Tinkelman, ref 11, p 58). Similarly, Wesman 12 writes: "One occasionally finds a stem phrased in a negative c o n t e x t . . . This may lead the students to respond with the wrong answer because they have been tripped up by the tricky or careless item writing rather than through lack of knowledge" (p 96). Cassels & Johnstone ~3 investigated the effect of language and context on students' performance in multiple choice tests. They found that in some cases a change of one word in the stem improved the performance in certain items by about 15%. A n o t h e r finding relates directly to the problem of BIOCHEMICAL E D U C A T I O N 19(4) 1991 our study: Questions in chemistry set in a positive form brought better performance from pupils than negative ones. If questions contained double negatives (one in the stem and one in the options) the performance was very poor. Johnstone 14 discusses these results and writes: "Linguistic literature (Wason ~5'16) has shown that ideas in a negative form occupy twice as much space in the working m e m o r y as positive forms. Double negatives may even occupy four times the space occupied by a positive form. It is little wonder that negative questions fail so badly in tests in that they leave less space in the working m e m o r y for thought" (p 115). Seventy multiple choice items selected from biology matriculation examinations in Victoria by the local chief examiner were mailed from Australia to the author. Thirty five items were selected and translated by the author into Hebrew. The accuracy of the translation was checked independently by three biology educators. The translated test consisted of negative items. A corresponding form consisting of matching positive items was prepared by the author. An attempt was made to use as much as possible the same options in the two modes and to design distractors which would be as similar as possible to their matching correct options in terms of the content and concepts included. The Appendix presents an example. Out of the 35 items the first 20 items were shared by all. In these 20 items students had to choose either the best (correct) answer in positive items or the least acceptable (incorrect) answer in the negative items. The remaining 15 items were of high cognitive level and required the students to choose and justify their choices. 1° The tests were administered to 254 Israeli 12th grade students from nine high schools all over the country, by teachers who had agreed to do so in A p r i l - M a y 1990 just a short time before the date of their matriculation examination. The results were analyzed using regular test scoring procedures yielding reliability indices, frequency distributions, means and standard deviations, and point biserial correlations. The justifications were subjected to two analyses. Firstly each was evaluated on a 3 point scale in which 1 = incorrect; 2 = partially correct; and 3 = correct and complete. Secondly, the justifications were content-analysed and appropriate categories were created to accommodate the various arguments. Having established the categories, two independent evaluators read all the justifications and classified them into the agreed upon categories so that frequencies could be calculated. On the average there were no significant differences in the scores of items of low cognitive level. On the other hand, in items of higher cognitive levels, the scores on P items were substantially higher than on N items. This result was explained in terms of reasoning patterns developed in the cognitive structure of students as a result of long experience with P items, as well as by the larger space required to process negative items in the working memory. There were practically no gender differences. Justifications's scores were substantially lower than mul- 18791468, 1991, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1016/0307-4412(91)90094-O, Wiley Online Library on [28/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License 190 tiple choice scores. Providing satisfactory justifications in the N mode was on the average more difficult than in the P mode. Substantive differences were found between the kinds of justifications provided to the P and N items. An example is presented in the Appendix. The majority of students who chose the correct answer in the positive mode justified their choice by saying: "In tubes 4, 5 the substrate is the same, the p H is different and the products are different. This indicates that the pH had an effect." The majority of students who chose the correct answer in the negative mode justified their choice by saying that option C is incorrect since "the amount of product depends on the amount of substrate, not on the amount of enzyme". As may be seen in the Appendix about a third of those responding to the N mode chose option B. The justification provided by most of those choosing option B was "since a different compound is composed of different substances the end products must be different". These students had failed to notice that tubes 1, 2 which contained different compounds yielded the same end product. In this case one may speculate that the positive mode was easier since students had known from their experiences with enzymes that pH was an important factor which usually affects enzymes' activity. On the other hand, the decision regarding options B and C in the negative mode required careful evaluation of the meaning of the information provided and reliance on prior knowledge was not enough. Based on the results of this pioneering study there appear to be a variety of differences pertaining to student performance in P and N modes of multiple choice items. The main findings and conclusions of this study are the following: (1) In items of low cognitive level there are, on the average, no differences in performance between the N and the P modes. (2) In items which require high cognitive reasoning the N mode is, on the average, more difficult than the P mode. (3) This difference between low and high cognitive items may lend support to the hypothesis that processing N items requires more space in the working memory. (4) T h e r e may be interactions between performance in the P/N mode and the items' content. (5) In items with which algorithms are used, such as the check board used to solve crosses in genetics, there are no P/N mode differences. (6) Multiple choice scores are positively correlated with the extent of offering justifications as well as with the justification scores. Stated differently a student choosing correctly the best answer in both P/N modes is more likely to offer a justification and, as well, more likely to have a higher justification score. (7) On the average, justification scores in the P mode are higher than in the N mode even when the contents of the items and the actual options are very similar.The detailed BIOCHEMICAL E D U C A T I O N 19(4) 1991 analysis of the justifications lends support to the assertion that information processing in the N mode is more complex and involves more steps than in the P mode. (8) T h e r e are no gender interactions in any of the measures and processes related to the P/N mode effects identified in this study. (9) The level of performance on the various measures is positively correlated with the school grade in biology. The magnitude of the correlations in the P mode is very similar to that of the N mode. If we consider the school grade as a measure of concurrent validity we may conclude that the two modes are equally valid. Hence, the two modes may be regarded as equally valid measures of students' performance, even though they may differ in their difficulty level. (10) A detailed content analysis of the justifications shows that a plausible explanation for the higher difficulty level of N items is that the necessary information processing involves more steps and is more complex than in the P mode. The data also lend support to the hypothesis that processing negative items occupies more space in the working memory. It still remains to be seen whether or not the performance of Australian students who have been used to the N mode will be different from that of the Israeli students, who like most students in other countries have been used to the P mode. References 1Schwab, J J (1963) The Biology Teacher Handbook, Wiley, New York 2Tamir, P (1971) 'An alternative approach to the construction of multiple choice test items' J Biol Educ 5, 305-307 3Bell, B (1985) 'Students' ideas about plant nutrition: what are they?' J Biol Educ 19,213-219 4Simpson, M and Arnold, B (1980) 'An investigation of the development of the concept of photosynthesis to SCE 'O' grade', Aberdeen College of Education, Aberdeen, Scotland 5Smith, E L and Lott, G W (1983) 'Teaching for conceptual change: some ways to go wrong', in Helm H and Novak J D (Editors), International Seminar on Misconceptions in Science and Mathematics, Cornell University Press, Ithaca, pp 57-66 6Stavy, R, Eisen, Y and Yaakobi, D (1987) 'How Israeli students aged 13-25 understood photosynthesis' Int J Science Education 9, 105-115 7Wandersee, J H (1983) 'Students' misconceptions about photosynthesis: a cross age study', in Helm, H and Novak J D (Editors), International Seminar on Misconceptions in Science and Mathematics, Cornell University Press, Ithaca, pp 441-463 8Brumby, M (1979) 'Problems in learning the concept of natural selection' J Biol Educ 13, 119-122 9Treagust, D F and Haslam, F (1986) 'Evaluating secondary students' misconceptions of photosynthesis and respiration in plants using a twotier diagnostic instrument', Paper presented at the annual meeting of the National Association of Research in Science Teaching, San Francisco t°Tamir, P (1990) 'Some issues related to the use of justifications to multiple choice items', J Biol Educ 24, 285-292 11Tinkelman, S N (1971) Planning the Objective Test, in Thorndike R L (Editor) Educational Measurement, American Council of Education, Washington, DC, pp 46-80 12Wesman, A (1971) 'Writing the Test Item', in Thorndike R L (Editor) Educational Measurement, American Council of Education, Washington, DC, pp 81-129 18791468, 1991, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1016/0307-4412(91)90094-O, Wiley Online Library on [28/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License 191 t3Cassels, J R T and Johnstone, A H (1980) 'Understanding of Nontechnical Words in Science', London, Royal Society of Chemistry 1aJohnstone, A H (1983) 'Training teachers to be aware of the student learning difficulties', in Tamir P, Holstein A and Ben Peretz M (Editors) Preservice and Inservice Education of Science Teachers, Balaban International Science Services, Rehovot-Philadelphia, pp 109-116 ESWason, P C (1959) 'The processing of positive and negative information', Quarterly J Experimental Psychology 11, 92-107 t6Wason P C (1961) 'Response to the affirmative and negative binary statements', Brit J Psychol 52, 133-142 Appendix: An item featuring differences in justifications Mammalian liver tissue was finely ground, filtered and treated so that only enzymes remained in the solution. Some of the solution was added to a series of test tubes (see below) and incubated in 37°C for one hour. The treatments and results were as follows: Tube no 1 2 3 4 5 6 Compound added tryptophan kynurenine histidine maltose maltose protein pH Compoundsdetected after 1 hour not tested not tested not tested 6.6 10 5 nicotinic acid nicotinic acid glutamic acid + formic acid glucose maltose various amino acids From these results it may be concluded that: Negative] (6) (32) (*38) (4) A B C D the amount of glucose in tube 4 will depend on the amount of maltose put there adding a seventh tube with a different compound would not necessarily result in a different end product addition of more enzyme solutions to tube 2 will increase the amount of nicotinic acid. at least one of the reactions indicated is likely to be affected by pH. Positive (15) (2) (*66) (4) A B C D addition of more enzyme solution to tube 2 will increase the amount of nicotinic acid adding a seventh tube with a different compound would result in a different end product at least one of the reactions is likely to be affected by pH. addition of more maltose to tube 5 will increase the amount of glucose * = correct answer; the figures indicate the percentage of students choosing the corresponding option Reshaping the preclinicai medical curriculum: modest proposal BRUCE G CHARLTON Department of Anatomy University of Glasgow Glasgow G12 8QQ, UK Introduction If you are happy with the current preclinical medical training in British Medical Schools, then you need read no further. If, on the other hand, you consider it to be a dull anachronism, consisting of too many 'facts', overtaught, encouraging passive learning, insufficiently interactive betwen staff and students, lacking in clinical relevance, unscientific and boring, then you may consider that we should be looking for ways to improve it. There has been no shortage of suggestions for improvement dating back over more than a century, ~ but most of BIOCHEMICAL E D U C A T I O N 19(4) 1991 these have been flawed by excessive idealism. For example, the complete integration of the pre- and the clinical components - - with much basic science being taught by practicing clinicians - - is one excellent idea; but is probably logistically impossible in an established medical school without an unrealistic investment of time, money and dedication to the project. Another idea is the universal extension of the course by a year, so that every student does a bachelor's degree in medical science (instead of just a selected minority, as at present). But this would be too expensive, expands an already bloated training programme, would not be to all student's tastes; and anyway would leave the problems of the existing curriculum untouched, an extra year merely serving to undo the bad habits of the previous two. What is required is a simple method of reducing the bulk of compulsory basic 'factual' material, and replacing it with the kind of challenging, in-depth study which is the norm (or at least the ideal) for many other university 18791468, 1991, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1016/0307-4412(91)90094-O, Wiley Online Library on [28/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License 192