Measures of Partial Knowledge and Unexpected Responses in

advertisement
Chang, S.-H., Lin, P.-C., & Lin, Z. C. (2007). Measures of Partial Knowledge and Unexpected Responses in Multiple-Choice
Tests. Educational Technology & Society, 10 (4), 95-109.
Measures of Partial Knowledge and Unexpected Responses in Multiple-Choice
Tests
Shao-Hua Chang
Department of Applied English, Southern Taiwan University, Tainan, Taiwan // shaohua@mail.stut.edu.tw
Pei-Chun Lin
Department of Transportation and Communication Management Science, National Cheng Kung University, Taiwan
peichunl@mail.ncku.edu.tw
Zih-Chuan Lin
Department of Information Management, National Kaohsiung First University of Science & Technology, Taiwan
u9324819@ccms.nkfust.edu.tw
ABSTRACT
This study investigates differences in the partial scoring performance of examinees in elimination testing and
conventional dichotomous scoring of multiple-choice tests implemented on a computer-based system.
Elimination testing that uses the same set of multiple-choice items rewards examinees with partial knowledge
over those who are simply guessing. This study provides a computer-based test and item analysis system to
reduce the difficulty of grading and item analysis following elimination tests. The Rasch model, based on item
response theory for dichotomous scoring, and the partial credit model, based on graded item response for
elimination testing, are the kernel of the test-diagnosis subsystem to estimate examinee ability and itemdifficulty parameters. This study draws the following conclusions: (1) examinees taking computer-based tests
(CBTs) have the same performance as those taking paper-and-pencil tests (PPTs); (2) conventional scoring does
not measure the same knowledge as partial scoring; (3) the partial scoring of multiple choice lowers the number
of unexpected responses from examinees; and (4) the different question topics and types do not influence the
performance of examinees in either PPTs or CBTs.
Keywords
Computer-based tests, Elimination testing, Unexpected responses, Partial knowledge, Item response theory
Introduction
The main missions of educators are determining learning progress and diagnosing difficulty experienced by students
when studying. Testing is a conventional means of evaluating students, and testing scores can be adopted to observe
learning outcomes. Multiple-choice (MC) items continue to dominate educational testing owing to their ability to
effectively and simply measure constructs such as ability and achievement. Measurement experts and testing
organizations prefer the MC format to others (e.g., short-answer, essay, constructed-response) for the following
reasons:
¾ Content sampling is generally superior to other formats, and the application of MC formats normally leads to
highly content-valid test-score interpretations.
¾ Test scores can be extremely reliable with a sufficient number of high-quality MC items.
¾ MC items can be easily pre-tested, stored, used, and reused, particularly with the advent of low-cost,
computerized item-banking systems.
¾ Objective, high-speed test scoring is achievable.
¾ Diagnostic subscores are easily obtainable.
¾ Test theories (i.e., item response, generalizability, and classical) easily accommodate binary responses.
¾ Most content can be tested using this format, including many types of higher-level thinking (Haladyna &
Downing, 1989).
However, the conventional MC examination scheme requires examinees to evaluate each option and select one
answer. Examinees are often absolutely certain that some of the options are incorrect, but still unable to identify the
correct response (Bradbard, Parker, & Stone, 2004). From the viewpoint of learning, knowledge is accumulated
continuously rather than on an all-or-nothing basis. The conventional scoring format of the MC examination cannot
ISSN 1436-4522 (online) and 1176-3647 (print). © International Forum of Educational Technology & Society (IFETS). The authors and the forum jointly retain the
copyright of the articles. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies
are not made or distributed for profit or commercial advantage and that copies bear the full citation on the first page. Copyrights for components of this work owned by
others than IFETS must be honoured. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior
specific permission and/or a fee. Request permissions from the editors at kinshuk@ieee.org.
95
distinguish between partial knowledge (Coombs, Milholland, & Womer, 1956) and the absence of knowledge. In
conventional MC tests, students choose only one response. The number of correctly answered questions is counted,
and the scoring method is called number scoring (NS). Akeroyd (1982) stated that NS makes the simplifying
assumption that all of the wrong answers of students are the results of random guesses, thus neglecting the existence
of partial knowledge. Coombs et al. (1956) first proposed an alternative method for administering MC tests. In their
procedure, students are instructed to mark as many incorrect options as they can identify. This procedure is referred
to as elimination testing (ET). Bush (2001) presented a multiple-choice test format that permits an examinee who is
uncertain of the correct answer to a question to select more than one answer. Incorrect selections are penalized by
negative marking. The aim of both the Bush and Coombs schemes is to reward examinees with partial knowledge
over those who are simply guessing.
Education researchers have been continuously concerned not only about how to evaluate students’ partial knowledge
accurately but also about how to reduce the number of unexpected responses. The number of correctly answered
questions is composed of two numbers: the number of questions to which the students actually know the answer, and
the number of questions to which the students correctly guess the answer (Bradbard et al., 2004). A higher frequency
of the second case indicates a less reliable learning performance evaluation. Chan & Kennedy (2002) compared
student scores on MC and equivalent constructed-response questions, and found that students do indeed score better
on constructed-response questions for particular MC questions. Although constructed-response testing produces
fewer unexpected responses than the conventional dichotomous scoring method, the change of item constructs raises
the complexity of both creating the test and of the post-test item grading and analysis, whereas ET uses the same set
of MC items and makes guessing a futile effort.
Bradbard et al. (2004) suggested that the greatest obstacle in implementing ET is the complexity of grading and the
analysis of test items following traditional paper assessment. Accordingly, examiners are not very willing to adopt
ET. To overcome this problem, this study provides an integrated computer-based test and item-analysis system to
reduce the difficulty of grading and item analysis following testing. Computer-based tests (CBTs) offer several
advantages over traditional paper-and-pencil tests (PPTs). The benefits of CBTs include reduced costs of data entry,
improved rate of disclosure, ease of data conversion into databases, and reduced likelihood of missing data (Hagler,
Norman, Radick, Calfas, & Sallis, 2005). Once set up, CBTs are easier to administer than PPTs. CBTs offer the
possibility of instant grading and automatic tracking and averaging of grades. In addition, they are easier to
manipulate to reduce cheating (Inouye & Bunderson, 1986; Bodmann & Robinson, 2004).
Most CBTs measure test item difficulty based on the percentage of correct responses. A higher percentage of correct
responses implies an easier test item. The approach of test items analysis disregards the relationship between the
examinee’s ability and item difficulty. For instance, if the percentage of correct responses for test item A is quite
small, then the test item analysis system categorizes it as “difficult.” However, statistics also reveal that more failing
examinees than passing examinees answer item A correctly. Therefore, the design of test item A may be
inappropriate, misleading, or unclear, and should be further studied to aid future curriculum designers to compose
high-quality items. To avoid the fallacy of percentage of correct responses, this study constructs a CBT system that
applies the Rasch model based on item response theory for dichotomous scoring and the partial credit model based
on graded item response for ET to estimate the examinee ability and item difficulty parameters (Baker, 1992;
Hambleton & Swaminathan, 1985; Zhu & Cole, 1996; Wright & Stone, 1979; Zhu, 1996; Wright & Masters, 1982).
Before ET implemented by computer-based system is broadly adopted, we still need to examine whether any
discrepancy exists in the performance of examinees who take elimination tests on paper and the performance of those
who take CBTs. This study compares the scores of examinees taking tests using the NS of dichotomous scoring
method and the partial scoring of ET using the same set of MC items in CBT and PPT settings, where the content
subject is operations management. This study has the following specific goals:
1. Evaluate whether the partial scoring for the MC test produces fewer unexpected responses of examinees.
2. Compare the examinee performance on conventional PPTs with their performance on CBTs.
3. Analyze whether different question content, such as calculation and concept, influences the performance of
examinees on PPTs and CBTs.
4. Investigate the relationship between an examinee’s ability and the item difficulty, to help the curriculum
designers compose high-quality items.
96
The rest of this paper will first present a brief literature review on partial knowledge, testing methods, scoring
methods, multiple choice, and CBTs. Then this paper will describe the configuration of a computer-based assessment
system, formulate the research hypotheses, and provide the research method, experimental design, and data
collection in detail. The statistics analysis and hypothesis testing results will be presented subsequently. Conclusions
are finally drawn in the last section.
Related literature
This study first defines related domain knowledge, discusses studies applied to conventional scoring and partial
scoring, compares scoring modes and investigates the principle of designing MC items, and finally summarizes the
pros and cons of CBT systems.
Partial knowledge
Reducing the opportunities to guess and measuring partial knowledge improve the psychometric properties of a test.
These methods can be classified by their ability to identify partial knowledge on a given test item (Alexander,
Bartlett, Truell, & Ouwenga, 2001). Coombs et al. (1956) stated that the conventional scoring format of the MC
examination cannot distinguish between partial knowledge and absence of knowledge. Ben-Simon, Budescu, &
Nevo (1997) classify examinees’ knowledge for a given item as full knowledge (identifies all of the incorrect
options), partial knowledge (identifies some of the incorrect options), partial misinformation (identifies the correct
answer and some incorrect options), full misinformation (identifies only the correct answer), and absence of
knowledge (either omits the item or identifies all options). Bush (2001) conducted a study that allows examinees to
select more than one answer to a question if they are uncertain of the correct one. Negative marking is used to
penalize incorrect selections. The aim is to explicitly reward examinees who possess partial knowledge as compared
with those who are simply guessing.
Number-scoring (NS) of multiple choice
Students choose only one response. The number of correctly answered questions is composed of the number of
questions to which the student knows the answer, and the number of questions to which the students correctly guess
the answer. According to the classification of Ben-Simon et al. (1997), NS can only distinguish between full
knowledge and absence of knowledge. A student’s score on an NS section with 25 MC questions and three points per
correct response is in the range 0–75.
Elimination testing (ET) of multiple choice
Alternative schemes proposed for administering MC tests increase the complexity of responding and scoring, and the
available information about student understanding of material (Coombs et al., 1956; Abu-Sayf, 1979; Alexander et
al., 2001). Since partial knowledge is not captured in conventional NS format of an MC examination, Coombs et al.
(1956) describe a procedure that instructs students to mark as many incorrect options as they can identify. One point
is awarded for each incorrect choice identified, but k points are deducted (where k equals the number of options
minus one) if the correct option is identified as incorrect. Consequently, a question score is in the range (–3, +3) on a
question with four options, and a student’s score on an ET section with 25 MC questions with four options each is in
the range (–75, +75). Bradbard and Green (1986) have classified ET scoring as follows: completely correct score
(+3), partially correct score (+2 or +1), no-understanding score (0), partially incorrect score (–1 or –2), completely
incorrect score (–3).
Subset selection testing (SST) of multiple choice
Rather than identifying incorrect options, the examinee attempts to construct subsets of item options that include the
correct answer (Jaradat & Sawaged, 1986). The scoring for an item with four options is as follows: if the correct
97
response is identified, then the score is 3; while if the subset of options identified includes the correct response and
other options, then the item score is 3 – n (n = 1, 2, or 3), where n denotes the number of other options included. If
subsets of options that do not include the correct option are identified, then the score is –n, where n is the number of
options included. SST and ET are probabilistically equivalent.
Comparison studies of scoring methods
Coombs et al. (1956) observed that tests using ET are somewhat more reliable than NS tests and measure the same
abilities as NS scoring. Dressel & Schmid (1953) compared SST with NS using college students in a physical science
course. They observed that the reliability of the SST test, at 0.67, was slightly lower than that of the NS test at 0.70.
They also noted that academically high-performing students scored better than average compared to low-performing
students, with respect to full knowledge, regardless of the difficulty of the items. Jaradat and Tollefson (1988)
compared ET with SST, using graduate students enrolled in an educational measurement course. No significant
differences in terms of reliability were observed between the methods. Jaradat and Tollefson reported that the
majority of students felt that ET and SST were better measures of their knowledge than conventional NS, but they
still preferred NS. Bradbard et al. (2004) concluded that ET scoring is useful whenever there is concern about
improving the accuracy of measuring a student’s partial knowledge. ET scoring may be particularly helpful in
content areas where partial or full misinformation can have life-threatening consequences. This study adopted ET
scoring as the measurement scheme for partial scoring.
Design of multiple choice
An MC item is composed of a correct answer and several distractors. The design of distractors is the largest
challenge in constructing an MC item (Haladyna & Downing, 1989). Haladyna & Downing summarized the common
rules of design found in many references. One such rule is that all the option choices should adopt parallel grammar
to avoid giving clues to the correct answer. The option choices should address the same content, and the distractors
should all be reasonable choices for a student with limited or incorrect information. Items should be as clear and
concise as possible, both to ensure that students know what is being asked, and to minimize reading time and the
influence of reading skills on performance. Haladyna and Downing recommended some guidelines for developing
distractors:
¾ Employ plausible distractors; avoid illogical distractors.
¾ Incorporate common student errors into distractors.
¾ Adopt familiar yet incorrect phrases as distractors.
¾ Use true statements that do not correctly answer the items.
Kehoe (1995) recommended improving tests by maintaining and developing a pool of “good” items from which
future tests are drawn in part or in whole. This approach is particularly true for instructors who teach the same course
more than once. The proportion of students answering an item correctly also affects its discrimination power. Items
answered correctly (or incorrectly) by a large proportion of examinees (more than 85%) have a markedly low power
to discriminate. In a good test, most items are answered correctly by 30% to 80% of the examinees. Kehoe described
the following three methods to enhance the ability of items to discriminate among abilities:
¾ Items that correlate less than 0.15 with total test score should probably be restructured.
¾ Distractors that are not chosen by any examinees should be replaced or eliminated.
¾ Items that virtually all examinees answer correctly are unhelpful for discriminating among students and should
be replaced by harder items.
Pros and cons of CBTs
CBTs have several benefits:
¾ The single-item presentation is not restricted to text, is easy to read, and allows combining with pictures, voice,
image, and animation.
98
¾
¾
¾
¾
¾
¾
CBTs shorten testing time, give instantaneous results, and increase test security. The online testing format of an
exam can be configured to enable instantaneous feedback to the student, and can be more easily scheduled and
administered than PPTs (Gretes & Green, 2000; Bugbee, 1996).
The test requires no paper, eliminating the battle for the copy machine. The option of printing the test is always
available.
The student can take the test when and where appropriate. Even taking the test at home is an option if the
student has Internet access at home. Students can wait until they think they have mastered the material before
being tested on it.
The test is a valuable teaching tool. CBTs provide immediate feedback, requiring the student to get the correct
answer before moving on.
CBTs save 100% of the time taken to distribute the test to test-takers (a CBT never has to be handed out).
CBTs save 100% of the time taken to create different versions of the same test (re-sequencing questions to
prevent an examinee from cheating by looking at the test of the person next to him).
However, CBTs have some disadvantages:
¾
¾
¾
¾
The test format is normally limited to true/false and MC. Computer-based automatic grading cannot easily
judge the accuracy of constructed-response questions such as short-answer, problem-solving exercises, and
essay questions.
When holding an onsite test, instructors must prepare many computers for examinees, and be prepared for the
difficulties caused by computer crashes.
The computer display is not suitable for question items composed of numerous words, since the resolution
might make the text difficult to read (Mazzeo & Harvey, 1988).
Most items in mathematics and chemistry testing need manual calculation. The need to write down and
calculate answers on draft paper might decrease the answering speed (Ager, 1993).
Research method
This study first constructs a computer-based assessment system, and then adopts experimental design to record an
examinee’s performance with different testing tools and scoring approaches. This section describes research
hypotheses, variables, the experimental design, and data collection.
System configuration
As well as providing a platform for computer-based ET, this system implements the Rasch one-parameter logistics
item characteristics curve (ICC) model for dichotomous scoring and the grade-response model for partial scoring of
ET to estimate the item and ability parameters (Hambleton & Swaminathan, 1985; Wright & Masters, 1982; Wright
& Stone, 1979; Zhu, 1996; Zhu & Cole, 1996). Furthermore, item difficulty analysis allows the system to maintain a
high-quality test bank. shows the system configuration, which comprises databases and three main subsystems: (1)
The computer-based assessment system is the platform used by examinees to take tests, and enables learning
progress to be tracked, grades to be queried, and examinees’ response patterns to be recorded in detail; (2) the
testbank management system, the platform used by instructors to manage testing items and examinee accounts; (3)
the test diagnosis system, which collects data from the answer record and the gradebook database to analyze the
difficulty of test items and the ability of examinees. These subsystems are described in detail as follows (Figure 1):
Computer-based assessment system
The computer-based assessment system links to the answer record, and the gradebook database collects the complete
answer record and incorporates a feedback mechanism that enables examinees to check their own learning progress
and increase their learning efficiency by means of constructive interaction. The main functions of the computerbased assessment system are as follows: The system first verifies the examinee’s eligibility for the test and then
displays the test information, including allowed answering time, scoring methods, and test items. The system permits
examinees to write down keywords or mark on the test sheet, as in a paper test. The complete answering processes
99
are stored in the answer record and the gradebook database. The answer record and gradebook database enable the
computer-based assessment system to collect information for examinees during tests, improving examinees’
understanding of their own learning status. Examinees can understand the core of a problem by reading the remarks
they themselves made during a test to identify any errors and improve in areas of poor understanding.
Instructor
interface
Examinee
interface
Testbank management system
Computer-based assessment system
Test item management/
Exams management/
Account management
Online testing /
Progress tracking/
Grade querying
Database
Answer record
and Gradebook
Database
Testbank
Test Diagnosis system
Defining difficulty
Setting up commonly used tests
Testing unexpected responses
Item fit analysis
Person fit analysis
Estimating ability and item parameter
Scoring
Sorting examinee’s responses
Figure 1. System configuration
100
Testbank management system
The testbank management system is normally accessed by instructors who are designing test items, revising test
items, designing examinations, and reusing tests. The main advantage of the item banking is in test development.
This system allows a curriculum designer to edit multiple-choice questions, constructed-respond items, and true/false
items and specify scoring modes. Designed test items are stored in the testbank database. Test items can be displayed
in text format or integrated with multimedia images. This system supports three parts of the preparation of
examinations: (1) creating original questions, (2) browsing questions from a testbank, and (3) using questions from a
testbank by selecting questions items at random. The system provides dichotomous scoring and partial scoring for
MC items. After the item sheet has been prepared, the curriculum designer must specify “examinee eligibility,”
“testing date,” “scoring mode,” “time allowed,” and “question values.” The computer-based assessment system
permits only eligible examinees to take the test at the specified time. Although the system can automatically and
instantly grade multiple-choice and true/false questions, the announcement of testing scores is delayed until all
examinees have finished the test.
Test diagnosis system
The test diagnosis system analyzes items and examinees’ ability based on scoring methods (dichotomous scoring and
partial scoring) and the data retrieved from the answer record and gradebook database. The diagnostic process is
summarized as follows: (1) Sorting the results from the test records. The system first sorts the item-response patterns
for all examinees. The answer records are rearranged according to each examinee’s ability and the number of
students who have given the correct answer to a given question. (2) Matrix reduction. The system then deletes
useless item responses and examinee responses, such as those questions to which all examinees gave correct answers
or wrong answers, or received a full or a zero score, since these data do not help to assess the ability of examinees
(Baker, 1992; Bond & Fox, 2001). Matrix reduction can reduce the resources and time required to conduct the
calculation. (3) The JMLE procedures (Hambleton & Swaminathan, 1985; Wongwiwatthananukit, Popovich, and
Bennett, 2000) depicted in Figure 2 are applied to estimate the ability and item parameters. Then the diagnosis
procedures are followed by the item and person fit analysis, and the procedures are described as follows: First, let
random variable X ni denote examinee n’s response on item i, in which X ni = 1 signifies the correct answer; θ n is
the personal parameter of examinee n; and b i denotes the item parameter, which determines the item location and is
called the item difficulty in attainment tests. The expected value and variance of X ni are shown in equations (1) and
(2).
(1) E ( X ni ) = exp(θ n − bi ) /[1 + exp(θ n − bi )] = π ni
(2) Var ( X ni ) = π ni (1 − π ni ) = Wni
The standardized residual and kurtosis of X ni are shown in equations (3) and (4).
(3) Z ni = [ X ni − E ( X ni )] /[Var ( X ni )]
1
2
(4) Cni = [(1 − π ni ) 4 (π ni )] + [(0 − π ni ) 4 (1 − π ni )]
The person fit analysis shows the mean square (MNSQ) and standardized weighted mean square (Zstd), as
represented in equations (5) and (6).
(5) MNSQ = ∑ Wni Z ni2
i
∑W
ni
=ν n
i
(6) Zstd = tn = (ν n1 3 − 1) (3 / qn ) + (qn / 3) where qn2 = ∑ ( Cni − Wni2 ) (∑ Wni ) 2
i
i
The item fit analysis shows the mean square (MNSQ) and standardized weighted mean square (Zstd) are as follows:
101
(7) MNSQ = ∑ Wni Z ni2
n
∑W
ni
=νi
n
(8) Zstd = ti = (ν i1 3 − 1) (3 / qi ) + (qi / 3) where qi2 = ∑ ( Cni − Wni2 ) (∑ Wni ) 2
n
n
Figure 2. JMLE procedures for estimating both ability and item parameters
Table 1 lists the principles to distinguish examinees and items from unexpected responses, summarized by Bond and
Fox (2001), and Linacre and Wright (1994). The acceptable range for MNSQ is between 0.75 and 1.3, and the Zstd
value should be in the range (-2, +2). Person and item fitness outside the range are considered unexpected responses.
Incorporating the calculation module in the online testing system is helpful for curriculum designers to maintain the
quality of test items by modifying or removing test items with unexpected responses.
MNSQ
>1.3
<0.75
Table 1. Fit statistics
Zstd
Variation
>2.0
Too much
<2.0
Too little
Misfit Type
Underfit
Overfit
102
Research hypotheses
This study proposes four hypotheses based on literature reviews. First, Alexander et al. (2001) explored students in a
computer technology course who completed either a PPT or a CBT in a proctored computer lab. The test scores were
similar, but students in the computer-based group, particularly freshmen, completed the test in the least amount of
time. Bodman and Robinson (2004) investigated the effect of several different modes of test administration on scores
and completion times, and the results of the study indicate that undergraduates completed the computer-based tests
faster than the paper-based tests with no difference in scores. Stated formally:
Hypothesis 1. The average score of CBTs is equivalent to that of PPTs.
Second, the statistical analysis of Bradbard et al. (2004) indicates that the Coombs procedure is a viable alternative to
the standard scoring procedure. In Ben-Simon et al.’s (1997) classification, NS performed very poorly in
discriminating between full knowledge and absence of knowledge. Stated formally:
Hypothesis 2. ET detects partial knowledge of examinees more effectively than NS.
Third, Bradbard and Green (1986) indicated that elimination testing lowers the amount of guesswork, and the
influence increases throughout the grading period. Stated formally:
Hypothesis 3. ET lowers the number of unexpected responses for examinees more effectively than NS.
Finally, most items in mathematics and chemistry testing need manual calculation. The need to write down and
calculate answers on draft paper might lower the answering speed (Ager, 1993). Stated formally:
Hypothesis 4. Different types of question content, such as calculation and concept, influence the performance of
examinees on PPTs or CBTs.
Research variables
The independent variables used in this study and the operational definitions are as follows:
¾ Scoring mode: including partial scoring and conventional dichotomous scoring to analyze the influence of
different answering and scoring schemes on partial knowledge.
¾ Testing tool: comparing conventional PPTs with CBTs and recognizing appropriate question types for CBTs.
The dependent variable adopted in this study is the students’ performance, which is determined by test scores.
Experimental design
Tests were provided according to the two scoring modes and two testing tools, which were combined to form four
treatments. Table 2 lists the multifactor design. Treatment 1 (T1) was CBTs, using the ET scoring method;
Treatment 2 (T2) was CBTs, using NS scoring method; Treatment 3 (T3) was PPTs, using ET scoring method; and
Treatment 4 (T4) was PPTs, using NS scoring method.
Table 2. Multifactor design
CBTs PPTs
ET
T1
T3
NS
T2
T4
Data collection
The subjects of the experiment were 102 students in an introductory operations management module, which is a
required course for students of the two junior classes in the Department of Information Management at National
Kaohsiung First University of Science and Technology in Taiwan. All students were required to take all four exams
103
in the course. A randomized block design, separating each class into two cohorts, was adopted for data collection
before the first test. The students were thus separated into the following four groups: class A, cohort 1 (A1); class A,
cohort 2 (A2); class B, cohort 1 (B1); and class B, cohort 2 (B2). The four tests were implemented separately using
CBTs and PPTs. Each test was worth 25% of the final grade. The item contents were concept oriented and
calculation oriented. Because all subjects participating in this study needed to take both NS and ET scored tests, they
were given a lecture describing ET and given several opportunities to practice, to prevent bias in personal scores due
to unfamiliarity with the answering method, thus enhancing the reliability of this study. Students in each group were
all given exactly the same questions.
Data analysis
Reliability
The split-half reliability coefficient was calculated to ensure internal consistency within a single test for each group.
Linn and Gronlund (2000) proposed setting the reliability coefficient between 0.60 and 0.85. Table 3 lists the
reliability coefficient for each treatment and cohort combination, and all Cronbach α values were in this range.
Test 1
Test 2
Test 3
Test 4
Table 3. Reliability coefficients for each group
Cronbach α
.6858 (T2, A1)
.6684 (T3, A2)
.6307 (T4, B1)
.6649 (T4, A1)
.7011 (T1, A2)
.6053 (T2, B1)
.7450 (T4, A2)
.7174 (T1, A1)
.7043 (T2, B2)
.7358 (T2, A2)
.7270 (T3, A1)
.7041 (T4, B2)
.6901 (T1, B2)
.7791 (T3, B2)
.8078 (T3, B1)
.6341 (T1, B1)
Correlation analysis
The Pearson and Spearman correlation coefficients were derived to determine whether response patterns of the
examinees were consistent. Table 4 lists the correlation coefficient of ET and NS on each test, and Table 5 presents
the correlation coefficients of CBTs and PPTs. The analytical results in both tables demonstrate a highly positive
correlation for each scoring mode and test tool.
Test 1
Table 4. Correlation coefficient of ET vs. NS
Pearson
Spearman
CBTs
PPTs
CBTs
PPTs
**
**
**
0.894
0.903
0.887
0.902**
Test 2
0.698**
0.692**
0.679**
0.722**
Test 3
0.747**
0.787**
0.676**
0.833**
Test 4
0.871**
0.768**
0.887**
0.743**
**indicates significance at the 0.01 level for two-tailed test.
Table 5. Correlation coefficient of CBTs vs. PPTs
Pearson
Spearman
NS
ET
NS
ET
Test 1
.868**
.836**
.875**
.800**
Test 2
.759**
.846**
.791**
.803**
Test 3
.840**
.805**
.802**
.766**
**
**
**
Test 4
.832
.830
.842
.796**
**indicates significance at the 0.01 level for two-tailed test.
104
Hypothesis testing
One-way analysis of variation (ANOVA) was adopted to test Hypothesis 1. The average score of the CBTs is
equivalent to that of PPTs. Table 6 shows that all p-values are greater than 0.05. Each individual group is not
statistically significant at the 5% level and no sufficient proof is available to reject Hypothesis 1.
Test 1
Test 2
Table 6. One-way ANOVA
Scoring
Mean of score
F-statistic p-value
mode
45.60 (PPTs)
1.622
.209
NS
42.84 (CBTs)
27.16 (PPTs)
ET
1.509
.225
32.08 (CBTs)
55.12 (PPTs)
0.16
.901
NS
55.44 (CBTs)
ET
43.32 (PPTs)
39.72 (CBTs)
.671
.417
NS
47.04 (PPTs)
49.44 (CBTs)
.495
.485
ET
44.12 (PPTs)
45.70 (CBTs)
.143
.707
NS
41.12 (PPTs)
36.48 (CBTs)
1.688
.200
ET
30.72 (PPTs)
24.40 (CBTs)
2.223
.143
Test 3
Test 4
Hypothesis 2: ET can detect partial knowledge of examinees more effectively than NS. According to the scoring
classification of Bradbard and Green (1986), and Ben-Simon et al.’s classification (1997), Table 7 summarizes the
average number of items for NS and ET scoring taxonomy. For Test 1, the average number of correct answers for
class A, cohort 1 was 14.28; the number of answers revealing full knowledge for class A, cohort 2 was 11.67; the
number of answers indicating partial knowledge for class A, cohort 2 was 2.38; the number of answers revealing
absence of knowledge for class A, cohort 2 was 1.42; the number of answers indicating partial misinformation for
class A, cohort 2 was 9.25; and the number of answers revealing full misinformation for class A, cohort 2 was 0.29.
Table 7. Average number of items for scoring taxonomy
NS
Test 1
Test 2
Test 3
Test 4
ET
correct
full knowledge
partial
knowledge
absence of
knowledge
partial
misinformation
full
misinformation
14.28 (A1)
15.20 (B1)
18.38 (A1)
18.48 (B1)
11.67 (A2)
12.48 (B2)
14.16 (A2)
15.12 (B2)
2.38
2.56
2.40
2.40
1.42
1.36
2.20
2.24
9.25
8.44
6.20
5.08
0.29
0.16
0.04
0.16
15.88 (A2)
16.48 (B2)
12.16 (A2)
13.16 (B2)
15.33 (A1)
15.80 (B1)
11.08 (A1)
10.36 (B1)
3.46
2.12
3.63
2.56
1.13
0.96
2.83
2.68
4.83
6.08
7.38
9.08
0.25
0.04
0.08
0.32
105
To test whether ET can effectively detect partial knowledge of examinees, the number of correct items of NS was
compared to the number of full knowledge of ET in each test. Examinees who did not know the correct answer to an
NS test item would have either guessed or given up answering. The number of correctly answered items is composed
of (1) correct response by lucky blind guesses, and (2) correct response by examinees’ knowledge. Burton (2002)
proposed that the conventional MC scoring can be described by equation B = (K + k + R), where B is the number of
correct items; K is the number of correct items for which an examinee possesses accurate knowledge; k is the number
of correct items for which an examinee possesses partial knowledge and guesses correctly; R is the number of correct
items for which the examinee has no knowledge but makes a lucky blind guess. Burton considered that the examinee
could delete distractors and increase the proportion of correct guesses based on partial knowledge. Our study
randomly assigns subjects to each group so the ability should be approximately equivalent. Table 7 demonstrates that
the number of correct NS items for each of the four tests is greater than the full knowledge number of ET, which
shows that ET can distinguish between full knowledge and partial knowledge by partial scoring.
Hypothesis 3: ET lowers the number of unexpected responses for examinees more effectively than NS. Table 8
shows the number of unexpected responses based on the calculation described in the research method section,
equations (1) to (8), and reveals that the number of unexpected responses in NS is greater than ET for each test,
which demonstrates that ET reduces the unexpected responses of examinee.
Table 8. Number of unexpected responses
NS
ET
Test 1
25
17
Test 2
10
9
Test 3
17
7
Test 4
29
15
Next, one-way ANOVA was adopted to test Hypothesis 4, that different question contents, such as calculation and
concept, influence the performance of examinees on PPTs or CBTs. Table 9 shows that p-value > 0.05 for each
subgroup. Thus, experimental results reject Hypothesis 4, and indicate that different question content does not
influence the performance of examinees on PPTs or CBTs. This result did not confirm Ager’s study (1993). The first
reason might be the course content: an introductory course focused on principles and concepts of operations
management, instead of lots of complex calculation. The second explanation is the subjects in this study are all
students from the information management department. These students are quite used to interacting with computer
interfaces; consequently, the performance difference between PPTs and CBTs is insignificant.
Table 9. One-way ANOVA
Scoring
Mean
F-statistic p-value
NS
PPTs
ET
Test 1
NS
CBTs
ET
NS
Test 2
PPTs
ET
1.97 (concept)
1.87 (calculation)
1.22 (concept)
1.45 (calculation)
1.08 (concept)
.90 (calculation)
.385 (concept)
.39 (calculation)
2.15 (concept)
2.13 (calculation)
1.83 (concept)
1.64 (calculation)
.163
.688
.744
.394
.206
.666
.000
.989
.014
.907
.552
.462
106
NS
CBTs
ET
NS
PPTs
ET
Test 3
NS
CBTs
ET
NS
PPTs
ET
Test 4
NS
CBTs
ET
2.34 (concept)
2.45 (calculation)
1.49 (concept)
1.46 (calculation)
1.91 (concept)
1.95 (calculation)
1.85 (concept)
1.80 (calculation)
2.24 (concept)
2.20 (calculation)
2.17 (concept)
2.06 (calculation)
1.62 (concept)
1.51 (calculation)
1.37 (concept)
.90 (calculation)
1.68 (concept)
1.40 (calculation)
1.05 (concept)
1.08 (calculation)
.304
.591
.007
.936
.054
.818
.072
.789
.005
.944
.052
.831
.248
.622
3.222
084
1.229
.281
.005
.947
Conclusion
This study demonstrates the feasibility of adopting ET and CBTs to replace conventional NS and PPTs. Under the
same MC testing item construct, the researchers investigate the performance difference among examinees between
partial scoring by the elimination testing and the conventional dichotomous scoring method. This study first builds a
computer-based assessment system, then adopts experimental design to record an examinee’s performance with
different testing tools and scoring approaches. One-way ANOVA does not show sufficient proof to reject the
hypothesis that the performance of students taking CBTs is the same as the performance of students taking PPTs.
This finding is in agreement with Alexander et al. (2001) and Bodman and Robinson (2004). We conclude that the
discrepancy does not exist in the performance of examinees who take PPTs nor in the performance of those who take
such CBTs when ET is used.
Next, the correct number of NS was compared to the number of full knowledge of ET in each test. Data analysis
demonstrates that the number of correct NS items is greater than the full knowledge number of ET for each test,
which shows that ET can distinguish between full knowledge and partial knowledge by partial scoring. The number
of unexpected responses calculated by the Rasch model based on item response theory for dichotomous scoring and
the partial credit model based on graded item response for elimination testing to estimate the examinee ability and
item difficulty parameters reveals that the number of unexpected responses in NS is greater than ET for each test,
which demonstrates that ET reduces the unexpected responses of examinees. ET scoring is helpful whenever
examinees’ partial knowledge and unexpected responses are concerned. Moreover, the instructors can more
accurately assess examinees’ partial knowledge by adopting ET and CBTs, which is not only helpful in teaching, but
also increases examinees’ eagerness and willingness to learn.
Experimental results also indicate that different question content does not influence the performance of examinees on
PPTs or CBTs. This result did not confirm Ager’s study (1993). The first reason might be the course content, an
introductory course focused on principles and concepts of operations management, instead of on lots of complex
calculation. The second explanation is that the subjects in this study are all students from the information
management department. These students are quite used to interacting with computer interfaces; consequently, they
present the difference of performance on PPTs and CBTs is insignificant. We remain aware that the validity of any
experimental study is limited to the scope of the experiment. Since the study involved only two classes and one
107
course, more comparison tests could be performed on other content subjects to induce those suited for adopting ET
and CBTs to replace conventional NS and PPTs.
Since the trend in e-learning technologies and system development is toward the creation of standards-based
distributed computing applications. To maintain a high-quality testbank and promote the sharing and reusing of test
items, further research should consider incorporating the IMS question and test interoperability (QTI) specification,
which describes a basic structure for the representation of assessment data groups, questions, and results, and allows
the CBT system to go further by using the shareable content object reference model (SCORM) and IMS metadata
specification.
References
Abu-Sayf, F. K. (1979). Recent developments in the scoring of multiple-choice items. Educational Review, 31, 269–
270.
Ager, T. (1993). Online placement testing in mathematics and chemistry. Journal and Computer-Based Instruction,
20 (2), 52–57.
Akeroyd, F. M. (1982). Progress in multiple-choice scoring methods. Journal of Further and Higher Education, 6,
87–90.
Alexander, M. W., Bartlett, J. E., Truell, A. D., & Ouwenga, K. (2001). Testing in a computer technology course: An
investigation of equivalency in performance between online and paper and pencil methods. Journal of Career and
Technical Education, 18 (1), 69–80.
Baker, F. B. (1992). Item response theory: Parameter estimation techniques, New York, NY: Marcel Dekker.
Ben-Simon, A., Budescu, D. V., & Nevo, B. (1997). A comparative study of measures of partial knowledge in
multiple-choice tests. Applied Psychological Measurement, 21 (1), 65–88.
Bodmann, S. M. & Robinson, D. H. (2004). Speed and performance differences among computer-based and paperpencil tests. Journal of Educational Computing Research, 31 (1), 51–60.
Bond, T. G. & Fox, C. M. (2001), Applying the Rasch model: Fundamental measurement in the human sciences,
Mahwah, New Jersey: Lawrence Erlbaum Associates.
Bradbard, D. A. & Green, S. B. (1986). Use of the Coombs elimination procedure in classroom tests. Journal of
Experimental Education, 54, 68–72.
Bradbard, D. A., Parker, D. F., & Stone, G. L. (2004). An alternate multiple-choice scoring procedure in a
macroeconomics course. Decision Sciences Journal of Innovative Education, 2 (1), 11–26.
Bugbee, A. C. (1996). The equivalence of PPTs and computer-based testing. Journal of Research on Computing in
Education, 28 (3), 282–299.
Burton, R. F. (2002). Misinformation, partial knowledge and guessing in true/false tests. Medical Education, 36,
805–811.
Bush, M. (2001). A multiple choice test that rewards partial knowledge. Journal of Further and Higher Education,
25 (2), 157–163.
Chan, N., & Kennedy, P. E. (2002). Are multiple-choice exams easier for economics students? A comparison of
multiple-choice and “equivalent” constructed-response exam questions. Southern Economic Journal, 68 (4), 957–
971.
108
Coombs, C. H., Milholland, J. E., & Womer, F. B. (1956). The assessment of partial knowledge. Educational and
Psychological Measurement, 16, 13–37.
Dressel, P. L., & Schmid, J. (1953). Some modifications of the multiple-choice item. Educational and Psychological
Measurement, 13, 574–595.
Gretes, J. A., & Green, M. (2000). Improving undergraduate learning with computer-assisted assessment. Journal of
Research on Computing in Education, 33 (1), 46–4.
Hagler, A. S., Norman, G. J., Radick, L. R., Calfas, K. J., & Sallis, J. F. (2005) Comparability and reliability of
paper- and computer-based measures of psychosocial constructs for adolescent fruit and vegetable and dietary fat
intake. Journal of the American Dietetic Association, 105 (11), 1758–1764.
Haladyna, T. M. & Downing, S. M. (1989). A taxonomy of multiple-choice item-writing rules. Applied Measurement
in Education, 2, 37–50.
Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications, Boston, MA:
Kluwer-Nijhoff.
Inouye, D. K., & Bunderson, C. V. (1986). Four generations of computerized test administration. Machine-Mediated
Learning, 1, 355–371.
Jaradat, D. & Sawaged, S. (1986). The subset selection technique foe multiple-choice tests: An empirical inquiry.
Journal of Educational Measurement, 23 (4), 369–376.
Jaradat, D. & Tollefson, N. (1988). The impact of alternative scoring procedures for multiple-choice items on test
reliability, validity, and grading. Educational and Psychological Measurement, 48, 627–635.
Kehoe, J. (1995). Basic item analysis for multiple-choice tests. Practical assessment, Research & Evaluation, 4 (10),
retrieved October 15, 2007, from http://PAREonline.net/getvn.asp?v=4&n=10.
Linacre, J. M., & Wright, B. D. (1994). Chi-square fit statistics. Rasch Measurement Transactions, 8 (2), 360,
retrieved October 15, 2007, from http://rasch.org/rmt/rmt82.htm.
Linn, R. L., & Gronlund, N. E. (2000). Measurement and assessment in teaching (8th Ed.), Upper Saddle River, NJ:
Prentice-Hall.
Mazzeo, J., & Harvey, A. L. (1988). The equivalence of scores from automated and conventional education and
psychological test, New York, NY: College Board Publications.
Wongwiwatthananukit, S., Popovich, N. G., & Bennett, D. E. (2000). Assessing pharmacy student knowledge on
multiple-choice examinations using partial-credit scoring of combined-response multiple-choice items. American
Journal of Pharmaceutical Education, 64 (1), 1–10.
Wright, B. D., & Masters, G. N. (1982). Rating scale analysis, Chicago, IL: MESA Press.
Wright, B. D., & Stone, M. H. (1979). Best test design, Chicago, IL: MESA Press.
Zhu, W. (1996). Should total scores from a rating scale be used directly? Research Quarterly for Exercise and Sport,
67 (3), 363–372.
Zhu, W., & Cole, E. L. (1996). Many-faceted Rasch calibration of a gross motor instrument. Research Quarterly for
Exercise and Sport, 67 (1), 24–34.
109
Download