Improving Classroom Improving Classroom Multiple-Choice Tests: A Worked Example Using Statistical Criteria Judith R. Emslie and Gordon R. Emslie Department of Psychology Ryerson University, Toronto Copyright © 2005 Emslie 1 Improving Classroom 2 Abstract The first draft of a classroom multiple-choice test is an object in need of improvement (T. M. Haladyna, 1999; S. J. Osterlind, 1998). Teachers who build each test anew fail to capitalize on their previous efforts. This paper describes an iterative approach and provides specific numerical criteria for item and test adequacy. Indices are selected for aptness, simplicity, and mutual compatibility. A worked example, suitable for teachers of all disciplines, demonstrates the process of raising test quality to a level that meets student needs. Reuse after refinement does not threaten test security. Keywords: Multiple-Choice tests; Teacher made tests; Item analysis; Psychometrics; Test Construction; Teacher Education Improving Classroom 3 Ellsworth, Dunnell, and Duell (1990) present 37 guidelines for writing multiple-choice items. They find that approximately 60% of the items published in instructor guides for educational psychology texts violate one or more item writing guideline. Hansen and Dexter (1997) report that, by the same criterion, 75% of auditing (accounting) test bank items are faulty. And in practice, “item writers can expect that about 50% of their items will fail to perform as intended” (Haladyna, 1999, p. 214). It is safe to assume that teacher constructed items fare no better. Therefore, a classroom multiple-choice test is an object in need of improvement. Following item-writing advice helps but most guidelines are based on expert opinion rather than empirical evidence or theory (Haladyna, 1999; Osterlind, 1998). Even a structurally sound item might be functionally flawed. For example, the item might be too difficult or off topic. Therefore, teachers must assume personal responsibility for test quality. They must assess item functioning in the context of their test’s intended domain and target population. Teachers must conduct an iterative functional analysis of the test. They must scrutinize, refine, and reuse items. They must not assume any collection of likely looking items will suffice. Psychometric interpretation guidelines are available in classic texts (e.g.; Crocker & Algina, 1986; Cronbach, 1990; Magnusson, 1966/1967; Nunnally, 1978). However, the information is not in a brief, pragmatic form with unequivocal criteria for item acceptance or rejection. Consequently, teachers neglect statistical information that could improve the quality of their multiple-choice tests. This paper provides “how to” information and a demonstration test using artificial data. It is of use to teachers—of all disciplines—particularly those awed by statistical concepts or new to multiple-choice testing. Experienced teachers might find it useful for distribution to teaching assistants or as a pedagogic aid. This paper de-emphasizes technical terminology without Improving Classroom 4 misrepresenting “classical” psychometric concepts. (For simplicity, “modern” psychometric theory relating to criterion-referenced items and tests is not considered.) To allow reproduction of the entire data set, statistical ideals are relaxed (e.g., the demonstration test length and student sample size are unrealistically small). Teachers can customize the information by processing the data through their own test scoring systems. This paper enumerates criteria for the interpretation of elementary indices of item and test quality. Where appropriate, letters in parentheses signify alternative or related expressions of the given criterion. The specific numerical values of the indices are selected for aptness, simplicity, and compatibility. A tolerable, not ideal, standard of acceptability is assumed. The demonstration test—based on knowledge not specific to any discipline—provides a medium for (a) describing the indices, (b) explaining the rationale for these indices, and (c) illustrating the process of test refinement. A Demonstration Multiple-Choice Test A Light-hearted Illustrative Classroom Evaluation (ALICE, see Table 1) about Lewis Carroll's literary work (1965a, 1965b) is administered to ten hypothetical students. The scored student responses are given in Table 2. Item Acceptability Criterion 1 (a) An item is acceptable if its pass rate, p, is between .20 and .80. (b) An item is acceptable if its failure rate, q, is between .20 and .80. (c) An item is acceptable if its variance (pq) is .16 or higher. A basic assumption in testing is that people differ. Therefore, the first requirement of a test item Improving Classroom 5 Table 1 The Demonstration Test ______________________________________________________________________________ Instructions: Answer the following questions about Lewis Carroll’s literary work. 1. When Alice met Humpty Dumpty, he was sitting on a *a. wall. b. horse. c. king. d. chair. e. tove. c. watch. d. beheading. e. glove. d. fawn. e. walrus. 2. Alice's prize in the caucus-race was a a. penny. *b. thimble. 3. Tweedledum and Tweedledee's battle was halted by a a. sheep. b. lion. *c. crow. 4. The balls used in the Queen's croquet game were a. serpents. b. teacups. c. cabbages. *d. hedgehogs. e. apples. 5. The Mock Turtle defined uglification as a kind of a. raspberry. b. calendar. c. envelope. d. whispering. *e. arithmetic. c. subtract. d. sew. e. calculate. c. treacle. d. camomile. e. vinegar. *c. Liddell. d. Dodgson. e. Lewis. 6. The White Queen was not able to *a. think. b. knit. 7. The Cook's tarts were mostly made of a. barley. *b. pepper. 8. In real life, Alice's surname was a. Carroll. b. Hopeman. ______________________________________________________________________________ Note. Asterisks are added to identify targeted alternatives (correct answers). Improving Classroom 6 Table 2 Scored Student Responses and Indices of Item Quality ______________________________________________________________________________ Item number _____________________________________________________________________ Student 1 2 3 4 5 6 7 8 ______________________________________________________________________________ Top half of the class: High scorers on the test Ann a b c d e a b b Bob a b a d NR a b c Cam a b e d e c b a Don a b c d b e b d Eve NR b d d e R>1 b c Bottom half of the class: Low scorers on the test Fay a b c a c a c d Guy a e c b NR a b b Hal a b b c NR a b a Ian a c c e d a c e Joy a a c e a R>1 c e ______________________________________________________________________________ (table continues) Improving Classroom 7 Item number _____________________________________________________________________ Index 1 2 3 4 5 6 7 8 ______________________________________________________________________________ pa .90 #1 .70 .60 .50 .30 .60 .70 .20 pqb .09 #1 .21 .24 .25 .21 .24 .21 .16 ptopc .80 1.00 .40 1.00 .60 .40 1.00 .40 pbottomd 1.00 .40 .80 .00 .00 .80 .40 .00 -.33 #2 .49 .60 .26 #2 .49 .08 #2 re -.57 #2 -.21 #2 ______________________________________________________________________________ Note. The letters a through e = alternative selected; NR = no response (omission); R>1 = more than one response (multiple response); = correct response; = incorrect response; The psychometric shortcomings of the demonstration test are indicated by numerical superscripts corresponding to criteria discussed in this paper. a Pass rate for the entire class (N = 10), the proportion of students answering the item correctly. b e Item variance. cPass rate for the top half of the class. dPass rate for the bottom half of the class. Item-test correlation. Improving Classroom 8 is that not all respondents give the same answer. The item must differentiate. A multiple-choice item distinguishes two groups, those who pass and those who fail. The pass rate is p, the proportion of students that selects the target (correct answer). The failure rate is q, or 1 - p, the proportion that fails to answer, gives multiple answers, or selects a decoy (incorrect answers). Differentiation is highest when the pass and fail groups are equal and decreases as these groups diverge in size. For example, if five of ten students pass (p = .5) and five fail (q = .5), then each passing student is differentiated from each failing student and there are 5 x 5 = 25 discriminations. If everyone passes and no one fails (p = 1, q = 0), there are 10 x 0 = 0 discriminations. Differentiation becomes inadequate if one group is more than four times the size of the other. Therefore, both p and q should be within the range .2 to .8. It follows that pq should be no lower than .2 x .8 = .16. The product pq is the item variance (often labeled s2 or VAR). Summing the check marks in the columns of Table 2 gives the number of students who answer each question correctly. The column sum divided by the total number of students gives the item pass rate. Item 1, where Humpty Dumpty sat, is too easy, p = .9. Consequently, its variance is unacceptably low, pq = .9 x .1 = .09. Unless this item can be made more difficult, it should be dropped. Criterion 2 An item is acceptable if its item-test correlation (r) is positive and .30 or higher. It is not enough that an item differentiates; it must differentiate appropriately. Well-informed students should pass and uninformed students should fail the item. If an item is so easy that only one person answers incorrectly, that person should be the one with least knowledge. Item 1 fails in this respect. Joy has the least knowledge (the lowest ALICE score) but the student who fails Improving Classroom 9 this item is Eve, a high scorer. In Table 2 the students are ranked in descending order according to their total ALICE scores. Correct responses should be more frequent in the top half of the table (i.e., in the top half of the class) than in the bottom half. This is not the case for items 1, 3, and 6. For example, only two of the top five students pass item 6 but four of the bottom five students get it correct. The pass rates calculated separately for the top and bottom halves of the class (ptop and pbottom) confirm that the “wrong” students get items 1, 3, and 6 correct. Students who know nothing about Lewis Carroll’s work but know their nursery rhymes could answer the Humpty Dumpty question (item 1). For item 3, what halted Tweedledum and Tweedledee’s battle, perhaps students selected the target, c. crow, not through knowledge but because they chose the exception—the only non-mammal—or because they followed the adage, “when in doubt, choose c”. In item 6, the White Queen’s inability, alternative c. subtract is a more accurate description than the teacher’s target, a. think. The White Queen sometimes was unable to think but always was unable to subtract. Item 6 is mis-keyed. Off with the teacher’s head! The relationship between passing or failing an item and doing well or doing poorly on the test as a whole is assessed by an item-test correlation coefficient (r). The possible range is from -1 to +1. A mid range value ( -.30 < r < +.30) indicates that the relationship is weak or absent. The higher the positive correlation, the stronger the tendency for students who do well on an item to also do well on the test as a whole (appropriate differentiation). The more negative the correlation, the stronger the evidence that passing the item is associated with a low score on the test (inappropriate differentiation). In Table 2 negative correlation coefficients flag items 1, 3, and 6 for attention as anticipated. More to the point, the correlation coefficient detects Improving Classroom 10 weaknesses overlooked by the pass/fail rate statistics. Items 5 and 8 differentiate appropriately but insufficiently. The correlation of item 8, Alice’s real life surname, with the test is practically zero. It should be deleted. The teacher should consider rewording item 5, the definition of uglification. Perhaps a more homogeneous set of alternatives (e.g., all school subjects) would increase its item-test correlation. Bear in mind that an unsatisfactory item-test correlation indicates a problem either in the item or in the test (or both). This index is a measure of item quality only if the total test score is meaningful (see General Remarks below). Criterion 3 An item with d decoys is acceptable if 20/d% to 80/d% of the students select each decoy. An efficient item has effective decoys; that is, wrong answers that are sufficiently plausible as to be selected by uninformed students. Effectiveness is maximal when the incorrect responses are evenly distributed across the decoys (showing that none is redundant). By Criterion 1, an item failure rate between 20% and 80% is acceptable. In ALICE, the number of decoys (d) per item is 4. Therefore, each wrong answer should be chosen by 20/4% to 80/4% of the students. With a sample size of ten, frequencies below 1 or above 2 are outside the desired (5% to 20%) range. The response frequencies are given in Table 3. Items 1, 2, 6, and 7 fail to meet the criterion. For item 1, where Humpty Dumpty sat, the frequency of selection of all decoys is zero. But the teacher should not direct test refinement efforts at the apparent violation before identifying the root cause. The real problem here is this item’s high p value. An excessively easy target depletes the response rate for the decoys. For item 2, no one selects the implausible decoy d. beheading as the prize in the caucus-race. For item 6, the White Queen’s inability, the decoys b. knit and d. Improving Classroom 11 Table 3 Frequency of Student Responses and Indices of Item Quality ______________________________________________________________________________ Item number ____________________________________________________________________ Responsea 1 2 3 4 5 6 7 8 ______________________________________________________________________________ a 9 1 1 1 1 6 0 #3 2 b 0 #3 7 1 1 1 0 #3 7 2 c 0 #3 1 6 1 1 1 3 #3 2 d 0 #3 0 #3 1 5 1 0 #3 0 #3 2 e 0 #3 1 1 2 3 1 0 #3 2 NR 1 #4 0 0 0 3 #4 0 0 0 R>1 0 0 0 0 0 2 #4 0 0 ______________________________________________________________________________ Note. Ten students wrote the test. For each item, the frequency of selection of the correct response is double underlined. The psychometric shortcomings of the demonstration test are indicated by numerical superscripts corresponding to criteria discussed in this paper. a The letters a through e = option chosen; NR = no response (omission); R>1 = more than one response (multiple response). Improving Classroom 12 sew are never selected. These decoys are highly related and therefore it might seem that either both are correct or neither is. In item 7, the Cook’s tarts, only one of the four decoys is ever chosen. Consequently, the pass rate is inflated by guessing. The teacher should generate plausible new decoys and re-evaluate these items. Criterion 4 An item is acceptable if ≤ 5% of the students omit answers and/or give multiple answers. An important requirement is that the item presents a well-structured task. Obviously, the clarity of the items should be assessed during test construction. But unforeseen problems are revealed when a sizable proportion of the students (5% or more) fails to respond, or gives multiple answers (or both). With a sample size of 10, frequencies of 1 or above are outside the acceptable range. Items 1 and 5 show excessive omissions (see Table 3). Maybe the high scorer Eve omitted item 1, where Humpty Dumpty sat, because she thought the answer, a. wall, so obvious that it must be a trick question. For item 5, perhaps poorly prepared students simply gave up. For them the concocted word, uglification, had no associated thoughts. The teacher could try new decoys based on uglification’s resemblance to English words such as ugliness and nullification. (See Criterion 2 for an alternative strategy.) The frequency of omissions for items 1 and 5 suggests that some students feared they would be penalized for an incorrect guess. Item 6 has too many multiple answers. The stem might be confusing because it contravenes recommended practice; word the stem positively or at least emphasize negative words by capitalizing or underlining. The alternatives might be confusing because they overlap in meaning (knit with sew, subtract with calculate, and calculate with think). Puzzled, students left more than one alternative marked. Perhaps these students assumed they would receive partial credit. The teacher should rewrite the Improving Classroom 13 item with clear-cut alternatives. Test Acceptability Criterion 5 (a) A test is acceptable if the internal consistency coefficient (K-R 20) is at least .64. (b) A test is acceptable if the correlation between observed and true scores is at least .80. (c) A test is acceptable if the true variance proportion is at least .64 of the total. (d) A test is acceptable if the error variance proportion is no more than .36 of the total. Any test of more than one item requires the assumption that item scores can be added to produce a meaningful single test score. In an internally consistent test all the items work in unison to produce a stable assessment of student performance. When the test is internally inconsistent, performance varies markedly according to the particular items considered. Someone who excels in one part of the test might do very badly, do well, or be average on another part. The test gives a mixed message. The additivity requirement applies when all the items in the test are intended to measure the same topic. If the test covers more than one topic, each topic is considered a test in its own right (see General Remarks below). In Table 2, Bob (the second highest ALICE scorer) gets 50% of the odd numbered items correct and 100% of the even numbered items correct. Joy (the lowest scorer) also gets 50% of the odd numbered items correct but 0% of the even numbered questions. These two students are equally knowledgeable in terms of the odd items, but at opposite ends of the knowledge spectrum in terms of the even items. Similar discrepancies are present for other students and other item groups. For example, Don gets 100% in the first half of the test but only 25% in the second half. Improving Classroom 14 The Kuder-Richardson formula 20 (K-R 20) is a measure of internal consistency. Theoretical values range from zero, each item measures something different from every other item, to +1, all the items measure the same thing. An internal consistency coefficient of .64 or higher is adequate. The K-R 20 for ALICE is .15 (see Table 4). Therefore, the eight items do not form a coherent set. However, this does not preclude the possibility of one or more coherent subsets of items. The test score is an estimate of the student’s true knowledge. The student’s true score is unknowable but, in a statistical sense, it is the mean of the scores that would be obtained if the individual were tested repeatedly. The positive square root of the test's internal consistency coefficient (K-R 20) estimates the correlation between students' observed and true scores. A test is acceptable if this correlation estimate is at least correlation between observed and true scores is .64 = .80. For ALICE, the estimated .15 = .39. The observed scores are insufficiently related to the students' true scores. When an achievement test is administered to a group of students, the spread of scores (the total test variance) is determined in part by genuine differences in the students' knowledge (true variance) and in part by inadequacies of measurement (error variance). The K-R 20 coefficient represents the proportion attributable to true variance. Therefore, the error proportion is 1 - (K-R 20). A test is acceptable if the proportion of true variance is at least .64 and, equivalently, if the proportion of error variance is no more than .36 of the total variance. ALICE fails to reach the required standard. Its true variance proportion is .15. The error proportion is .85. To convert to the measurement units of a particular test, the true and error proportions are Improving Classroom 15 multiplied by the total test variance. ALICE has a total variance of 1.85 (see Table 4). Therefore in raw score units, the true variance is .15 (1.85) = .28 and the error variance is (1 - .15) (1.85) = 1.57. The square root of the error variance—the standard error of measurement (SEM)— measures the margin of error in assessing an individual student's true score. The probability is approximately 95% that the student's true score lies within 2 SEM of the obtained score. The SEM for ALICE is 1.57 = 1.26. For Fay, whose ALICE score is 4, the margin of error is 4 ± 2(1.26). That is; her true ALICE score might be anywhere between 1.48 and 6.52—in effect anywhere between 1 and 7. Given that there are only eight items, the imprecision is obvious. General Remarks Test scoring programs generally provide the foregoing information in a composite printout such as shown for ALICE in Table 4. Only item 4, croquet balls, passes muster. The other items must be modified or dropped. An indication of the power of test refinement is that the mere omission of the worst item in terms of item-test correlation (item 3, Tweedledum and Tweedledee) raises the ALICE K-R 20 coefficient from .15 to .52. But there are additional problems. First, the teacher’s directions do not fully disclose the test requirements (see Table 1). As a result, individual differences in students’ willingness to omit, guess, or give multiple answers contribute to error variance. Better instructions, such as follow, might prevent items 1, 5, and 6 from running afoul of Criterion 4 (omissions and multiple answers): Answer all the questions. For each item, circle the single best alternative. There is no penalty for wrong answers. If you guess correctly you will receive one mark. Improving Classroom 16 Table 4 Sample Computer Printout and Indices of Item and Test Quality Psychometric Analysis of the Alice Demonstration Test _____________________________________________________________________________ Item number ___________________________________________________________________ 1 2 3 4 5 6 7 8 _____________________________________________________________________________ Responsea a .90 #1 .10 .10 .10 .10 .60 .00 #3 .20 b .00 #3 .70 .10 .10 .10 .00 #3 .70 .20 c .00 #3 .10 .60 .10 .10 .10 .30 #3 .20 d .00 #3 .00 #3 .10 .50 .10 .00 #3 .00 #3 .20 e .00 #3 .10 .10 .20 .30 .10 .00 #3 .20 NR .10 #4 .00 .00 .00 .30 #4 .00 .00 .00 R>1 .00 .00 .00 .00 .00 .20 #4 .00 .00 pqb .09 #1 .21 .24 .25 .21 .24 .21 .16 rc -.33 #2 .49 .60 .26 #2 .49 .08 #2 Index -.57 #2 -.21 #2 _____________________________________________________________________________ (table continues) Improving Classroom Test statistics 17 Test indices Mean = 4.50 Kuder-Richardson formula 20 = .15 #5 Variance = 1.85 True variance = 0.28 = 15% #5 Standard error of measurement = 1.26 Error variance = 1.57 = 85% #5 Number of students 10 Observed, true score correlation = .39 #5 = _____________________________________________________________________________ Note. Entries in the top part of the table are the proportional response frequencies for the test items. For each item, the proportion of students selecting the correct alternative is double underlined. The psychometric shortcomings of the test are indicated by numerical superscripts corresponding to criteria discussed in this paper. a The letters a through e = alternative selected; NR = no response (omission); R>1 = more than one response (multiple response). bVariance. cItem-test correlation. Improving Classroom 18 If you guess incorrectly you will neither gain nor lose marks. There is no credit for multiple answers even if one of them is correct. Second, the teacher failed to define the test domain clearly. The instructions state that the test assesses students’ knowledge of Lewis Carroll’s literary work but do not specify which works. Items 2, 4, 5, and 7 are from Alice's Adventures in Wonderland. Items 1, 3, and 6 are from Through the Looking-Glass and What Alice Found There. Item 8, the real Alice’s surname, is biographical and does not relate to Carroll’s literary work. The assumption of a single domain is suspect. Perhaps ALICE is three tests. This partitioning is supported by the item-test correlation pattern observed in Table 4. The four Wonderland items have positive item-test correlations. The three Looking-Glass items all correlate negatively with ALICE. The biographical item has an essentially zero correlation with ALICE. Therefore, the teacher should analyze the Wonderland and Looking-Glass items as independent tests. (This new information means that the item-test correlations and the K-R 20 for the original ALICE are inappropriate.) Even without modification, the four Wonderland items make an internally consistent test (K-R 20 = .84). The three Looking-Glass items show promise (K-R 20 = .54). Students who do well on Wonderland items tend to do poorly on Looking-Glass items (the correlation between tests is -.57). It seems that the requirement to read both books was inadequately communicated or misunderstood. Curiouser and curiouser! If the overall quality of the test (or tests) is still unacceptable after modification of the existing items, the next step is to write new items. Additional items generally increase a test’s internal consistency. Estimate the required test length by multiplying the current test length by the quotient (D - CD)/(C - CD) where C is the consistency coefficient obtained for the Current Improving Classroom 19 test and D is the consistency coefficient Desired for the new test. Therefore, to upgrade the Looking-Glass test from a current K-R 20 of .54 to a desired K-R 20 of .64, the new test must be about (.64 - .54 x .64)/(.54 - .54 x .64) = 1.5 times as long as the existing 3-item test. Five-items should suffice. A final consideration in refining a test is to ensure that the item set as a whole has desirable characteristics. Do the items cover the entire domain? Are there omitted topics or redundancies? Is there an appropriate mix of factual and application items? Is the difficulty level appropriate? Is the test structurally sound? ALICE violates several guidelines. For example, the stems are in sentence completion form instead of question form and the alternatives are placed horizontally instead of vertically (see Haladyna, 1999). Refutations and Conclusions A critic might argue that the reuse-after-refinement approach compromises test security. “I return the tests to students for study purposes so I need a new test every time.” But assuming that the teacher wants to encourage students to review conceptually rather than memorize specific information, there is no requirement that the distributed review items should be the same as those used in class tests. Besides, most of the items would be modified or replaced before the test is reused. Teachers who refuse to modify or reuse items can create their own items. But this is inefficient because it fails to capitalize on previous work. Moreover, it places a heavy demand on the teacher’s creativity and item-writing skills—abilities that are never examined. Alternatively, the teacher can compose successive tests by selecting new items from published test banks. But these teachers use items that have been and will be used by other Improving Classroom 20 teachers. If their students interact, item security is compromised. Furthermore, a test bank is an exhaustible supply source of mediocre quality. If, as is likely, the teacher selects the “best bet” items first, then subsequent tests will be of deteriorating quality. In sharp contrast, the reuseafter-refinement approach generates tests of improving quality. Therefore, the major threat is not reuse after refinement but reuse without refinement. Teachers who argue that, “the psychometric indices are important for commercial tests but not for classroom tests” miss the point. There is a difference between relaxing the rigor and abandoning the process. The level of processing might vary but all tests share the need for refinement. Before testing, the teacher should write or select items according to item-writing guidelines. After testing, the teacher should evaluate item performance according to the psychometric indices. The requirement is to establish a classroom test of quality. The level of quality must meet the needs of students and teacher, not those of commercial test publishers. Extra vetting takes extra time but the investment is recouped by a fairer and more precise assessment of student performance—surely the essential purpose of any test procedure. To conclude, good multiple-choice tests are not likely to occur if teachers select questions indiscriminately from published test banks or rely on their own first drafts of original items. An iterative approach to test construction improves test quality. Improving Classroom 21 References Carroll, L. (1965a). Alice's adventures in wonderland (A Centennial Edition). New York: Random House. Carroll, L. (1965b). Through the looking-glass and what Alice found there (A Centennial Edition). New York: Random House. Crocker, L. & Algina, J. (1986). Introduction to classical and modern test theory. New York: Holt, Rinehart, & Winston. Cronbach, L. J. (1990). Essentials of psychological testing (5th ed.). New York: HarperCollins. Ellsworth, R. A., Dunnell, P., & Duell, O. K. (1990). Multiple-choice test items: What are textbook authors telling teachers? Journal of Educational Research, 83, 289-293. Haladyna, T. M. (1999). Developing and validating multiple-choice test items (2nd ed.). Mahwah, NJ: Erlbaum. Hansen, J. D., & Dexter, L. (1997). Quality multiple-choice test questions: Item-writing guidelines and an analysis of auditing testbanks. Journal of Education for Business, 73, 94-97. Magnusson, D. (1967). Test theory (H. Mabon, Tran.). Reading, MA: Addison-Wesley. (Original work published 1966) Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York: McGraw-Hill. Osterlind, S. J. (1998). Constructing test items: Multiple-choice, constructed-response, performance, and other formats (2nd ed.). Boston: Kluwer.