Using Statistical Criteria to Improve Classroom Multiple

advertisement
Improving Classroom
Improving Classroom Multiple-Choice Tests: A Worked Example Using Statistical Criteria
Judith R. Emslie and Gordon R. Emslie
Department of Psychology
Ryerson University, Toronto
Copyright © 2005 Emslie
1
Improving Classroom
2
Abstract
The first draft of a classroom multiple-choice test is an object in need of improvement
(T. M. Haladyna, 1999; S. J. Osterlind, 1998). Teachers who build each test anew fail to
capitalize on their previous efforts. This paper describes an iterative approach and provides
specific numerical criteria for item and test adequacy. Indices are selected for aptness, simplicity,
and mutual compatibility. A worked example, suitable for teachers of all disciplines,
demonstrates the process of raising test quality to a level that meets student needs. Reuse after
refinement does not threaten test security.
Keywords: Multiple-Choice tests; Teacher made tests; Item analysis; Psychometrics; Test
Construction; Teacher Education
Improving Classroom
3
Ellsworth, Dunnell, and Duell (1990) present 37 guidelines for writing multiple-choice
items. They find that approximately 60% of the items published in instructor guides for
educational psychology texts violate one or more item writing guideline. Hansen and Dexter
(1997) report that, by the same criterion, 75% of auditing (accounting) test bank items are faulty.
And in practice, “item writers can expect that about 50% of their items will fail to perform as
intended” (Haladyna, 1999, p. 214). It is safe to assume that teacher constructed items fare no
better. Therefore, a classroom multiple-choice test is an object in need of improvement.
Following item-writing advice helps but most guidelines are based on expert opinion rather than
empirical evidence or theory (Haladyna, 1999; Osterlind, 1998). Even a structurally sound item
might be functionally flawed. For example, the item might be too difficult or off topic.
Therefore, teachers must assume personal responsibility for test quality. They must assess item
functioning in the context of their test’s intended domain and target population. Teachers must
conduct an iterative functional analysis of the test. They must scrutinize, refine, and reuse items.
They must not assume any collection of likely looking items will suffice. Psychometric
interpretation guidelines are available in classic texts (e.g.; Crocker & Algina, 1986; Cronbach,
1990; Magnusson, 1966/1967; Nunnally, 1978). However, the information is not in a brief,
pragmatic form with unequivocal criteria for item acceptance or rejection. Consequently,
teachers neglect statistical information that could improve the quality of their multiple-choice
tests. This paper provides “how to” information and a demonstration test using artificial data. It
is of use to teachers—of all disciplines—particularly those awed by statistical concepts or new to
multiple-choice testing. Experienced teachers might find it useful for distribution to teaching
assistants or as a pedagogic aid. This paper de-emphasizes technical terminology without
Improving Classroom
4
misrepresenting “classical” psychometric concepts. (For simplicity, “modern” psychometric
theory relating to criterion-referenced items and tests is not considered.) To allow reproduction
of the entire data set, statistical ideals are relaxed (e.g., the demonstration test length and student
sample size are unrealistically small). Teachers can customize the information by processing the
data through their own test scoring systems.
This paper enumerates criteria for the interpretation of elementary indices of item and test
quality. Where appropriate, letters in parentheses signify alternative or related expressions of the
given criterion. The specific numerical values of the indices are selected for aptness, simplicity,
and compatibility. A tolerable, not ideal, standard of acceptability is assumed. The demonstration
test—based on knowledge not specific to any discipline—provides a medium for (a) describing
the indices, (b) explaining the rationale for these indices, and (c) illustrating the process of test
refinement.
A Demonstration Multiple-Choice Test
A Light-hearted Illustrative Classroom Evaluation (ALICE, see Table 1) about Lewis
Carroll's literary work (1965a, 1965b) is administered to ten hypothetical students. The scored
student responses are given in Table 2.
Item Acceptability
Criterion 1
(a) An item is acceptable if its pass rate, p, is between .20 and .80.
(b) An item is acceptable if its failure rate, q, is between .20 and .80.
(c) An item is acceptable if its variance (pq) is .16 or higher.
A basic assumption in testing is that people differ. Therefore, the first requirement of a test item
Improving Classroom
5
Table 1
The Demonstration Test
______________________________________________________________________________
Instructions: Answer the following questions about Lewis Carroll’s literary work.
1. When Alice met Humpty Dumpty, he was sitting on a
*a. wall.
b. horse.
c. king.
d. chair.
e. tove.
c. watch.
d. beheading.
e. glove.
d. fawn.
e. walrus.
2. Alice's prize in the caucus-race was a
a. penny.
*b. thimble.
3. Tweedledum and Tweedledee's battle was halted by a
a. sheep.
b. lion.
*c. crow.
4. The balls used in the Queen's croquet game were
a. serpents.
b. teacups.
c. cabbages.
*d. hedgehogs.
e. apples.
5. The Mock Turtle defined uglification as a kind of
a. raspberry.
b. calendar.
c. envelope.
d. whispering.
*e. arithmetic.
c. subtract.
d. sew.
e. calculate.
c. treacle.
d. camomile.
e. vinegar.
*c. Liddell.
d. Dodgson.
e. Lewis.
6. The White Queen was not able to
*a. think.
b. knit.
7. The Cook's tarts were mostly made of
a. barley.
*b. pepper.
8. In real life, Alice's surname was
a. Carroll.
b. Hopeman.
______________________________________________________________________________
Note. Asterisks are added to identify targeted alternatives (correct answers).
Improving Classroom
6
Table 2
Scored Student Responses and Indices of Item Quality
______________________________________________________________________________
Item number
_____________________________________________________________________
Student
1
2
3
4
5
6
7
8
______________________________________________________________________________
Top half of the class: High scorers on the test
Ann
a
b
c
d
e
a
b
b
Bob
a
b
a
d
NR
a
b
c
Cam
a
b
e
d
e
c
b
a
Don
a
b
c
d
b
e
b
d
Eve
NR
b
d
d
e
R>1
b
c
Bottom half of the class: Low scorers on the test
Fay
a
b
c
a
c
a
c
d
Guy
a
e
c
b
NR
a
b
b
Hal
a
b
b
c
NR
a
b
a
Ian
a
c
c
e
d
a
c
e
Joy
a
a
c
e
a
R>1
c
e
______________________________________________________________________________
(table continues)
Improving Classroom
7
Item number
_____________________________________________________________________
Index
1
2
3
4
5
6
7
8
______________________________________________________________________________
pa
.90 #1
.70
.60
.50
.30
.60
.70
.20
pqb
.09 #1
.21
.24
.25
.21
.24
.21
.16
ptopc
.80
1.00
.40
1.00
.60
.40
1.00
.40
pbottomd
1.00
.40
.80
.00
.00
.80
.40
.00
-.33 #2
.49
.60
.26 #2
.49
.08 #2
re
-.57 #2
-.21 #2
______________________________________________________________________________
Note. The letters a through e = alternative selected; NR = no response (omission); R>1 = more
than one response (multiple response);  = correct response;  = incorrect response; The
psychometric shortcomings of the demonstration test are indicated by numerical superscripts
corresponding to criteria discussed in this paper.
a
Pass rate for the entire class (N = 10), the proportion of students answering the item correctly.
b
e
Item variance. cPass rate for the top half of the class. dPass rate for the bottom half of the class.
Item-test correlation.
Improving Classroom
8
is that not all respondents give the same answer. The item must differentiate. A multiple-choice
item distinguishes two groups, those who pass and those who fail. The pass rate is p, the
proportion of students that selects the target (correct answer). The failure rate is q, or 1 - p, the
proportion that fails to answer, gives multiple answers, or selects a decoy (incorrect answers).
Differentiation is highest when the pass and fail groups are equal and decreases as these groups
diverge in size. For example, if five of ten students pass (p = .5) and five fail (q = .5), then each
passing student is differentiated from each failing student and there are 5 x 5 = 25
discriminations. If everyone passes and no one fails (p = 1, q = 0), there are 10 x 0 = 0
discriminations. Differentiation becomes inadequate if one group is more than four times the size
of the other. Therefore, both p and q should be within the range .2 to .8. It follows that pq should
be no lower than .2 x .8 = .16. The product pq is the item variance (often labeled s2 or VAR).
Summing the check marks in the columns of Table 2 gives the number of students who
answer each question correctly. The column sum divided by the total number of students gives
the item pass rate. Item 1, where Humpty Dumpty sat, is too easy, p = .9. Consequently, its
variance is unacceptably low, pq = .9 x .1 = .09. Unless this item can be made more difficult, it
should be dropped.
Criterion 2
An item is acceptable if its item-test correlation (r) is positive and .30 or higher.
It is not enough that an item differentiates; it must differentiate appropriately. Well-informed
students should pass and uninformed students should fail the item. If an item is so easy that only
one person answers incorrectly, that person should be the one with least knowledge. Item 1 fails
in this respect. Joy has the least knowledge (the lowest ALICE score) but the student who fails
Improving Classroom
9
this item is Eve, a high scorer.
In Table 2 the students are ranked in descending order according to their total ALICE
scores. Correct responses should be more frequent in the top half of the table (i.e., in the top half
of the class) than in the bottom half. This is not the case for items 1, 3, and 6. For example, only
two of the top five students pass item 6 but four of the bottom five students get it correct. The
pass rates calculated separately for the top and bottom halves of the class (ptop and pbottom)
confirm that the “wrong” students get items 1, 3, and 6 correct. Students who know nothing
about Lewis Carroll’s work but know their nursery rhymes could answer the Humpty Dumpty
question (item 1). For item 3, what halted Tweedledum and Tweedledee’s battle, perhaps
students selected the target, c. crow, not through knowledge but because they chose the
exception—the only non-mammal—or because they followed the adage, “when in doubt, choose
c”. In item 6, the White Queen’s inability, alternative c. subtract is a more accurate description
than the teacher’s target, a. think. The White Queen sometimes was unable to think but always
was unable to subtract. Item 6 is mis-keyed. Off with the teacher’s head!
The relationship between passing or failing an item and doing well or doing poorly on the
test as a whole is assessed by an item-test correlation coefficient (r). The possible range is from
-1 to +1. A mid range value ( -.30 < r < +.30) indicates that the relationship is weak or absent.
The higher the positive correlation, the stronger the tendency for students who do well on an
item to also do well on the test as a whole (appropriate differentiation). The more negative the
correlation, the stronger the evidence that passing the item is associated with a low score on the
test (inappropriate differentiation). In Table 2 negative correlation coefficients flag items 1, 3,
and 6 for attention as anticipated. More to the point, the correlation coefficient detects
Improving Classroom
10
weaknesses overlooked by the pass/fail rate statistics. Items 5 and 8 differentiate appropriately
but insufficiently. The correlation of item 8, Alice’s real life surname, with the test is practically
zero. It should be deleted. The teacher should consider rewording item 5, the definition of
uglification. Perhaps a more homogeneous set of alternatives (e.g., all school subjects) would
increase its item-test correlation.
Bear in mind that an unsatisfactory item-test correlation indicates a problem either in the
item or in the test (or both). This index is a measure of item quality only if the total test score is
meaningful (see General Remarks below).
Criterion 3
An item with d decoys is acceptable if 20/d% to 80/d% of the students select each decoy.
An efficient item has effective decoys; that is, wrong answers that are sufficiently plausible as to
be selected by uninformed students. Effectiveness is maximal when the incorrect responses are
evenly distributed across the decoys (showing that none is redundant). By Criterion 1, an item
failure rate between 20% and 80% is acceptable. In ALICE, the number of decoys (d) per item is
4. Therefore, each wrong answer should be chosen by 20/4% to 80/4% of the students. With a
sample size of ten, frequencies below 1 or above 2 are outside the desired (5% to 20%) range.
The response frequencies are given in Table 3. Items 1, 2, 6, and 7 fail to meet the criterion. For
item 1, where Humpty Dumpty sat, the frequency of selection of all decoys is zero. But the
teacher should not direct test refinement efforts at the apparent violation before identifying the
root cause. The real problem here is this item’s high p value. An excessively easy target depletes
the response rate for the decoys. For item 2, no one selects the implausible decoy d. beheading as
the prize in the caucus-race. For item 6, the White Queen’s inability, the decoys b. knit and d.
Improving Classroom
11
Table 3
Frequency of Student Responses and Indices of Item Quality
______________________________________________________________________________
Item number
____________________________________________________________________
Responsea
1
2
3
4
5
6
7
8
______________________________________________________________________________
a
9
1
1
1
1
6
0 #3
2
b
0 #3
7
1
1
1
0 #3
7
2
c
0 #3
1
6
1
1
1
3 #3
2
d
0 #3
0 #3
1
5
1
0 #3
0 #3
2
e
0 #3
1
1
2
3
1
0 #3
2
NR
1 #4
0
0
0
3 #4
0
0
0
R>1
0
0
0
0
0
2 #4
0
0
______________________________________________________________________________
Note. Ten students wrote the test. For each item, the frequency of selection of the correct
response is double underlined. The psychometric shortcomings of the demonstration test are
indicated by numerical superscripts corresponding to criteria discussed in this paper.
a
The letters a through e = option chosen; NR = no response (omission); R>1 = more than one
response (multiple response).
Improving Classroom
12
sew are never selected. These decoys are highly related and therefore it might seem that either
both are correct or neither is. In item 7, the Cook’s tarts, only one of the four decoys is ever
chosen. Consequently, the pass rate is inflated by guessing. The teacher should generate
plausible new decoys and re-evaluate these items.
Criterion 4
An item is acceptable if ≤ 5% of the students omit answers and/or give multiple answers.
An important requirement is that the item presents a well-structured task. Obviously, the clarity
of the items should be assessed during test construction. But unforeseen problems are revealed
when a sizable proportion of the students (5% or more) fails to respond, or gives multiple
answers (or both). With a sample size of 10, frequencies of 1 or above are outside the acceptable
range. Items 1 and 5 show excessive omissions (see Table 3). Maybe the high scorer Eve omitted
item 1, where Humpty Dumpty sat, because she thought the answer, a. wall, so obvious that it
must be a trick question. For item 5, perhaps poorly prepared students simply gave up. For them
the concocted word, uglification, had no associated thoughts. The teacher could try new decoys
based on uglification’s resemblance to English words such as ugliness and nullification. (See
Criterion 2 for an alternative strategy.) The frequency of omissions for items 1 and 5 suggests
that some students feared they would be penalized for an incorrect guess. Item 6 has too many
multiple answers. The stem might be confusing because it contravenes recommended practice;
word the stem positively or at least emphasize negative words by capitalizing or underlining. The
alternatives might be confusing because they overlap in meaning (knit with sew, subtract with
calculate, and calculate with think). Puzzled, students left more than one alternative marked.
Perhaps these students assumed they would receive partial credit. The teacher should rewrite the
Improving Classroom
13
item with clear-cut alternatives.
Test Acceptability
Criterion 5
(a) A test is acceptable if the internal consistency coefficient (K-R 20) is at least .64.
(b) A test is acceptable if the correlation between observed and true scores is at least .80.
(c) A test is acceptable if the true variance proportion is at least .64 of the total.
(d) A test is acceptable if the error variance proportion is no more than .36 of the total.
Any test of more than one item requires the assumption that item scores can be added to produce
a meaningful single test score. In an internally consistent test all the items work in unison to
produce a stable assessment of student performance. When the test is internally inconsistent,
performance varies markedly according to the particular items considered. Someone who excels
in one part of the test might do very badly, do well, or be average on another part. The test gives
a mixed message. The additivity requirement applies when all the items in the test are intended
to measure the same topic. If the test covers more than one topic, each topic is considered a test
in its own right (see General Remarks below).
In Table 2, Bob (the second highest ALICE scorer) gets 50% of the odd numbered items
correct and 100% of the even numbered items correct. Joy (the lowest scorer) also gets 50% of
the odd numbered items correct but 0% of the even numbered questions. These two students are
equally knowledgeable in terms of the odd items, but at opposite ends of the knowledge
spectrum in terms of the even items. Similar discrepancies are present for other students and
other item groups. For example, Don gets 100% in the first half of the test but only 25% in the
second half.
Improving Classroom
14
The Kuder-Richardson formula 20 (K-R 20) is a measure of internal consistency.
Theoretical values range from zero, each item measures something different from every other
item, to +1, all the items measure the same thing. An internal consistency coefficient of .64 or
higher is adequate. The K-R 20 for ALICE is .15 (see Table 4). Therefore, the eight items do not
form a coherent set. However, this does not preclude the possibility of one or more coherent
subsets of items.
The test score is an estimate of the student’s true knowledge. The student’s true score is
unknowable but, in a statistical sense, it is the mean of the scores that would be obtained if the
individual were tested repeatedly. The positive square root of the test's internal consistency
coefficient (K-R 20) estimates the correlation between students' observed and true scores. A test
is acceptable if this correlation estimate is at least
correlation between observed and true scores is
.64 = .80. For ALICE, the estimated
.15 = .39. The observed scores are
insufficiently related to the students' true scores.
When an achievement test is administered to a group of students, the spread of scores (the
total test variance) is determined in part by genuine differences in the students' knowledge (true
variance) and in part by inadequacies of measurement (error variance). The K-R 20 coefficient
represents the proportion attributable to true variance. Therefore, the error proportion is
1 - (K-R 20). A test is acceptable if the proportion of true variance is at least .64 and,
equivalently, if the proportion of error variance is no more than .36 of the total variance. ALICE
fails to reach the required standard. Its true variance proportion is .15. The error proportion is
.85.
To convert to the measurement units of a particular test, the true and error proportions are
Improving Classroom
15
multiplied by the total test variance. ALICE has a total variance of 1.85 (see Table 4). Therefore
in raw score units, the true variance is .15 (1.85) = .28 and the error variance is
(1 - .15) (1.85) = 1.57.
The square root of the error variance—the standard error of measurement (SEM)—
measures the margin of error in assessing an individual student's true score. The probability is
approximately 95% that the student's true score lies within 2 SEM of the obtained score. The
SEM for ALICE is 1.57 = 1.26. For Fay, whose ALICE score is 4, the margin of error is
4 ± 2(1.26). That is; her true ALICE score might be anywhere between 1.48 and 6.52—in effect
anywhere between 1 and 7. Given that there are only eight items, the imprecision is obvious.
General Remarks
Test scoring programs generally provide the foregoing information in a composite
printout such as shown for ALICE in Table 4. Only item 4, croquet balls, passes muster. The
other items must be modified or dropped. An indication of the power of test refinement is that
the mere omission of the worst item in terms of item-test correlation (item 3, Tweedledum and
Tweedledee) raises the ALICE K-R 20 coefficient from .15 to .52. But there are additional
problems.
First, the teacher’s directions do not fully disclose the test requirements (see Table 1). As
a result, individual differences in students’ willingness to omit, guess, or give multiple answers
contribute to error variance. Better instructions, such as follow, might prevent items 1, 5, and 6
from running afoul of Criterion 4 (omissions and multiple answers):
Answer all the questions. For each item, circle the single best alternative. There is
no penalty for wrong answers. If you guess correctly you will receive one mark.
Improving Classroom
16
Table 4
Sample Computer Printout and Indices of Item and Test Quality
Psychometric Analysis of the Alice Demonstration Test
_____________________________________________________________________________
Item number
___________________________________________________________________
1
2
3
4
5
6
7
8
_____________________________________________________________________________
Responsea
a
.90 #1
.10
.10
.10
.10
.60
.00 #3
.20
b
.00 #3
.70
.10
.10
.10
.00 #3
.70
.20
c
.00 #3
.10
.60
.10
.10
.10
.30 #3
.20
d
.00 #3
.00 #3
.10
.50
.10
.00 #3
.00 #3
.20
e
.00 #3
.10
.10
.20
.30
.10
.00 #3
.20
NR
.10 #4
.00
.00
.00
.30 #4
.00
.00
.00
R>1
.00
.00
.00
.00
.00
.20 #4
.00
.00
pqb
.09 #1
.21
.24
.25
.21
.24
.21
.16
rc
-.33 #2
.49
.60
.26 #2
.49
.08 #2
Index
-.57 #2
-.21 #2
_____________________________________________________________________________
(table continues)
Improving Classroom
Test statistics
17
Test indices
Mean
=
4.50
Kuder-Richardson formula 20
= .15 #5
Variance
=
1.85
True variance
=
0.28
=
15% #5
Standard error of measurement =
1.26
Error variance
=
1.57
=
85% #5
Number of students
10
Observed, true score correlation = .39 #5
=
_____________________________________________________________________________
Note. Entries in the top part of the table are the proportional response frequencies for the test
items. For each item, the proportion of students selecting the correct alternative is double
underlined. The psychometric shortcomings of the test are indicated by numerical superscripts
corresponding to criteria discussed in this paper.
a
The letters a through e = alternative selected; NR = no response (omission); R>1 = more than
one response (multiple response). bVariance. cItem-test correlation.
Improving Classroom
18
If you guess incorrectly you will neither gain nor lose marks. There is no credit
for multiple answers even if one of them is correct.
Second, the teacher failed to define the test domain clearly. The instructions state that the
test assesses students’ knowledge of Lewis Carroll’s literary work but do not specify which
works. Items 2, 4, 5, and 7 are from Alice's Adventures in Wonderland. Items 1, 3, and 6 are from
Through the Looking-Glass and What Alice Found There. Item 8, the real Alice’s surname, is
biographical and does not relate to Carroll’s literary work. The assumption of a single domain is
suspect. Perhaps ALICE is three tests. This partitioning is supported by the item-test correlation
pattern observed in Table 4. The four Wonderland items have positive item-test correlations. The
three Looking-Glass items all correlate negatively with ALICE. The biographical item has an
essentially zero correlation with ALICE. Therefore, the teacher should analyze the Wonderland
and Looking-Glass items as independent tests. (This new information means that the item-test
correlations and the K-R 20 for the original ALICE are inappropriate.)
Even without modification, the four Wonderland items make an internally consistent test
(K-R 20 = .84). The three Looking-Glass items show promise (K-R 20 = .54). Students who do
well on Wonderland items tend to do poorly on Looking-Glass items (the correlation between
tests is -.57). It seems that the requirement to read both books was inadequately communicated
or misunderstood. Curiouser and curiouser!
If the overall quality of the test (or tests) is still unacceptable after modification of the
existing items, the next step is to write new items. Additional items generally increase a test’s
internal consistency. Estimate the required test length by multiplying the current test length by
the quotient (D - CD)/(C - CD) where C is the consistency coefficient obtained for the Current
Improving Classroom
19
test and D is the consistency coefficient Desired for the new test. Therefore, to upgrade the
Looking-Glass test from a current K-R 20 of .54 to a desired K-R 20 of .64, the new test must be
about (.64 - .54 x .64)/(.54 - .54 x .64) = 1.5 times as long as the existing 3-item test. Five-items
should suffice.
A final consideration in refining a test is to ensure that the item set as a whole has
desirable characteristics. Do the items cover the entire domain? Are there omitted topics or
redundancies? Is there an appropriate mix of factual and application items? Is the difficulty level
appropriate? Is the test structurally sound? ALICE violates several guidelines. For example, the
stems are in sentence completion form instead of question form and the alternatives are placed
horizontally instead of vertically (see Haladyna, 1999).
Refutations and Conclusions
A critic might argue that the reuse-after-refinement approach compromises test security.
“I return the tests to students for study purposes so I need a new test every time.” But assuming
that the teacher wants to encourage students to review conceptually rather than memorize
specific information, there is no requirement that the distributed review items should be the same
as those used in class tests. Besides, most of the items would be modified or replaced before the
test is reused.
Teachers who refuse to modify or reuse items can create their own items. But this is
inefficient because it fails to capitalize on previous work. Moreover, it places a heavy demand on
the teacher’s creativity and item-writing skills—abilities that are never examined.
Alternatively, the teacher can compose successive tests by selecting new items from
published test banks. But these teachers use items that have been and will be used by other
Improving Classroom
20
teachers. If their students interact, item security is compromised. Furthermore, a test bank is an
exhaustible supply source of mediocre quality. If, as is likely, the teacher selects the “best bet”
items first, then subsequent tests will be of deteriorating quality. In sharp contrast, the reuseafter-refinement approach generates tests of improving quality. Therefore, the major threat is not
reuse after refinement but reuse without refinement.
Teachers who argue that, “the psychometric indices are important for commercial tests
but not for classroom tests” miss the point. There is a difference between relaxing the rigor and
abandoning the process. The level of processing might vary but all tests share the need for
refinement. Before testing, the teacher should write or select items according to item-writing
guidelines. After testing, the teacher should evaluate item performance according to the
psychometric indices. The requirement is to establish a classroom test of quality. The level of
quality must meet the needs of students and teacher, not those of commercial test publishers.
Extra vetting takes extra time but the investment is recouped by a fairer and more precise
assessment of student performance—surely the essential purpose of any test procedure.
To conclude, good multiple-choice tests are not likely to occur if teachers select questions
indiscriminately from published test banks or rely on their own first drafts of original items. An
iterative approach to test construction improves test quality.
Improving Classroom
21
References
Carroll, L. (1965a). Alice's adventures in wonderland (A Centennial Edition). New York:
Random House.
Carroll, L. (1965b). Through the looking-glass and what Alice found there (A Centennial
Edition). New York: Random House.
Crocker, L. & Algina, J. (1986). Introduction to classical and modern test theory. New York:
Holt, Rinehart, & Winston.
Cronbach, L. J. (1990). Essentials of psychological testing (5th ed.). New York: HarperCollins.
Ellsworth, R. A., Dunnell, P., & Duell, O. K. (1990). Multiple-choice test items: What are
textbook authors telling teachers? Journal of Educational Research, 83, 289-293.
Haladyna, T. M. (1999). Developing and validating multiple-choice test items (2nd ed.).
Mahwah, NJ: Erlbaum.
Hansen, J. D., & Dexter, L. (1997). Quality multiple-choice test questions: Item-writing
guidelines and an analysis of auditing testbanks. Journal of Education for Business, 73,
94-97.
Magnusson, D. (1967). Test theory (H. Mabon, Tran.). Reading, MA: Addison-Wesley. (Original
work published 1966)
Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York: McGraw-Hill.
Osterlind, S. J. (1998). Constructing test items: Multiple-choice, constructed-response,
performance, and other formats (2nd ed.). Boston: Kluwer.
Download