Chang, S.-H., Lin, P.-C., & Lin, Z. C. (2007). Measures of Partial Knowledge and Unexpected Responses in Multiple-Choice Tests. Educational Technology & Society, 10 (4), 95-109. Measures of Partial Knowledge and Unexpected Responses in Multiple-Choice Tests Shao-Hua Chang Department of Applied English, Southern Taiwan University, Tainan, Taiwan // shaohua@mail.stut.edu.tw Pei-Chun Lin Department of Transportation and Communication Management Science, National Cheng Kung University, Taiwan peichunl@mail.ncku.edu.tw Zih-Chuan Lin Department of Information Management, National Kaohsiung First University of Science & Technology, Taiwan u9324819@ccms.nkfust.edu.tw ABSTRACT This study investigates differences in the partial scoring performance of examinees in elimination testing and conventional dichotomous scoring of multiple-choice tests implemented on a computer-based system. Elimination testing that uses the same set of multiple-choice items rewards examinees with partial knowledge over those who are simply guessing. This study provides a computer-based test and item analysis system to reduce the difficulty of grading and item analysis following elimination tests. The Rasch model, based on item response theory for dichotomous scoring, and the partial credit model, based on graded item response for elimination testing, are the kernel of the test-diagnosis subsystem to estimate examinee ability and itemdifficulty parameters. This study draws the following conclusions: (1) examinees taking computer-based tests (CBTs) have the same performance as those taking paper-and-pencil tests (PPTs); (2) conventional scoring does not measure the same knowledge as partial scoring; (3) the partial scoring of multiple choice lowers the number of unexpected responses from examinees; and (4) the different question topics and types do not influence the performance of examinees in either PPTs or CBTs. Keywords Computer-based tests, Elimination testing, Unexpected responses, Partial knowledge, Item response theory Introduction The main missions of educators are determining learning progress and diagnosing difficulty experienced by students when studying. Testing is a conventional means of evaluating students, and testing scores can be adopted to observe learning outcomes. Multiple-choice (MC) items continue to dominate educational testing owing to their ability to effectively and simply measure constructs such as ability and achievement. Measurement experts and testing organizations prefer the MC format to others (e.g., short-answer, essay, constructed-response) for the following reasons: ¾ Content sampling is generally superior to other formats, and the application of MC formats normally leads to highly content-valid test-score interpretations. ¾ Test scores can be extremely reliable with a sufficient number of high-quality MC items. ¾ MC items can be easily pre-tested, stored, used, and reused, particularly with the advent of low-cost, computerized item-banking systems. ¾ Objective, high-speed test scoring is achievable. ¾ Diagnostic subscores are easily obtainable. ¾ Test theories (i.e., item response, generalizability, and classical) easily accommodate binary responses. ¾ Most content can be tested using this format, including many types of higher-level thinking (Haladyna & Downing, 1989). However, the conventional MC examination scheme requires examinees to evaluate each option and select one answer. Examinees are often absolutely certain that some of the options are incorrect, but still unable to identify the correct response (Bradbard, Parker, & Stone, 2004). From the viewpoint of learning, knowledge is accumulated continuously rather than on an all-or-nothing basis. The conventional scoring format of the MC examination cannot ISSN 1436-4522 (online) and 1176-3647 (print). © International Forum of Educational Technology & Society (IFETS). The authors and the forum jointly retain the copyright of the articles. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear the full citation on the first page. Copyrights for components of this work owned by others than IFETS must be honoured. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from the editors at kinshuk@ieee.org. 95 distinguish between partial knowledge (Coombs, Milholland, & Womer, 1956) and the absence of knowledge. In conventional MC tests, students choose only one response. The number of correctly answered questions is counted, and the scoring method is called number scoring (NS). Akeroyd (1982) stated that NS makes the simplifying assumption that all of the wrong answers of students are the results of random guesses, thus neglecting the existence of partial knowledge. Coombs et al. (1956) first proposed an alternative method for administering MC tests. In their procedure, students are instructed to mark as many incorrect options as they can identify. This procedure is referred to as elimination testing (ET). Bush (2001) presented a multiple-choice test format that permits an examinee who is uncertain of the correct answer to a question to select more than one answer. Incorrect selections are penalized by negative marking. The aim of both the Bush and Coombs schemes is to reward examinees with partial knowledge over those who are simply guessing. Education researchers have been continuously concerned not only about how to evaluate students’ partial knowledge accurately but also about how to reduce the number of unexpected responses. The number of correctly answered questions is composed of two numbers: the number of questions to which the students actually know the answer, and the number of questions to which the students correctly guess the answer (Bradbard et al., 2004). A higher frequency of the second case indicates a less reliable learning performance evaluation. Chan & Kennedy (2002) compared student scores on MC and equivalent constructed-response questions, and found that students do indeed score better on constructed-response questions for particular MC questions. Although constructed-response testing produces fewer unexpected responses than the conventional dichotomous scoring method, the change of item constructs raises the complexity of both creating the test and of the post-test item grading and analysis, whereas ET uses the same set of MC items and makes guessing a futile effort. Bradbard et al. (2004) suggested that the greatest obstacle in implementing ET is the complexity of grading and the analysis of test items following traditional paper assessment. Accordingly, examiners are not very willing to adopt ET. To overcome this problem, this study provides an integrated computer-based test and item-analysis system to reduce the difficulty of grading and item analysis following testing. Computer-based tests (CBTs) offer several advantages over traditional paper-and-pencil tests (PPTs). The benefits of CBTs include reduced costs of data entry, improved rate of disclosure, ease of data conversion into databases, and reduced likelihood of missing data (Hagler, Norman, Radick, Calfas, & Sallis, 2005). Once set up, CBTs are easier to administer than PPTs. CBTs offer the possibility of instant grading and automatic tracking and averaging of grades. In addition, they are easier to manipulate to reduce cheating (Inouye & Bunderson, 1986; Bodmann & Robinson, 2004). Most CBTs measure test item difficulty based on the percentage of correct responses. A higher percentage of correct responses implies an easier test item. The approach of test items analysis disregards the relationship between the examinee’s ability and item difficulty. For instance, if the percentage of correct responses for test item A is quite small, then the test item analysis system categorizes it as “difficult.” However, statistics also reveal that more failing examinees than passing examinees answer item A correctly. Therefore, the design of test item A may be inappropriate, misleading, or unclear, and should be further studied to aid future curriculum designers to compose high-quality items. To avoid the fallacy of percentage of correct responses, this study constructs a CBT system that applies the Rasch model based on item response theory for dichotomous scoring and the partial credit model based on graded item response for ET to estimate the examinee ability and item difficulty parameters (Baker, 1992; Hambleton & Swaminathan, 1985; Zhu & Cole, 1996; Wright & Stone, 1979; Zhu, 1996; Wright & Masters, 1982). Before ET implemented by computer-based system is broadly adopted, we still need to examine whether any discrepancy exists in the performance of examinees who take elimination tests on paper and the performance of those who take CBTs. This study compares the scores of examinees taking tests using the NS of dichotomous scoring method and the partial scoring of ET using the same set of MC items in CBT and PPT settings, where the content subject is operations management. This study has the following specific goals: 1. Evaluate whether the partial scoring for the MC test produces fewer unexpected responses of examinees. 2. Compare the examinee performance on conventional PPTs with their performance on CBTs. 3. Analyze whether different question content, such as calculation and concept, influences the performance of examinees on PPTs and CBTs. 4. Investigate the relationship between an examinee’s ability and the item difficulty, to help the curriculum designers compose high-quality items. 96 The rest of this paper will first present a brief literature review on partial knowledge, testing methods, scoring methods, multiple choice, and CBTs. Then this paper will describe the configuration of a computer-based assessment system, formulate the research hypotheses, and provide the research method, experimental design, and data collection in detail. The statistics analysis and hypothesis testing results will be presented subsequently. Conclusions are finally drawn in the last section. Related literature This study first defines related domain knowledge, discusses studies applied to conventional scoring and partial scoring, compares scoring modes and investigates the principle of designing MC items, and finally summarizes the pros and cons of CBT systems. Partial knowledge Reducing the opportunities to guess and measuring partial knowledge improve the psychometric properties of a test. These methods can be classified by their ability to identify partial knowledge on a given test item (Alexander, Bartlett, Truell, & Ouwenga, 2001). Coombs et al. (1956) stated that the conventional scoring format of the MC examination cannot distinguish between partial knowledge and absence of knowledge. Ben-Simon, Budescu, & Nevo (1997) classify examinees’ knowledge for a given item as full knowledge (identifies all of the incorrect options), partial knowledge (identifies some of the incorrect options), partial misinformation (identifies the correct answer and some incorrect options), full misinformation (identifies only the correct answer), and absence of knowledge (either omits the item or identifies all options). Bush (2001) conducted a study that allows examinees to select more than one answer to a question if they are uncertain of the correct one. Negative marking is used to penalize incorrect selections. The aim is to explicitly reward examinees who possess partial knowledge as compared with those who are simply guessing. Number-scoring (NS) of multiple choice Students choose only one response. The number of correctly answered questions is composed of the number of questions to which the student knows the answer, and the number of questions to which the students correctly guess the answer. According to the classification of Ben-Simon et al. (1997), NS can only distinguish between full knowledge and absence of knowledge. A student’s score on an NS section with 25 MC questions and three points per correct response is in the range 0–75. Elimination testing (ET) of multiple choice Alternative schemes proposed for administering MC tests increase the complexity of responding and scoring, and the available information about student understanding of material (Coombs et al., 1956; Abu-Sayf, 1979; Alexander et al., 2001). Since partial knowledge is not captured in conventional NS format of an MC examination, Coombs et al. (1956) describe a procedure that instructs students to mark as many incorrect options as they can identify. One point is awarded for each incorrect choice identified, but k points are deducted (where k equals the number of options minus one) if the correct option is identified as incorrect. Consequently, a question score is in the range (–3, +3) on a question with four options, and a student’s score on an ET section with 25 MC questions with four options each is in the range (–75, +75). Bradbard and Green (1986) have classified ET scoring as follows: completely correct score (+3), partially correct score (+2 or +1), no-understanding score (0), partially incorrect score (–1 or –2), completely incorrect score (–3). Subset selection testing (SST) of multiple choice Rather than identifying incorrect options, the examinee attempts to construct subsets of item options that include the correct answer (Jaradat & Sawaged, 1986). The scoring for an item with four options is as follows: if the correct 97 response is identified, then the score is 3; while if the subset of options identified includes the correct response and other options, then the item score is 3 – n (n = 1, 2, or 3), where n denotes the number of other options included. If subsets of options that do not include the correct option are identified, then the score is –n, where n is the number of options included. SST and ET are probabilistically equivalent. Comparison studies of scoring methods Coombs et al. (1956) observed that tests using ET are somewhat more reliable than NS tests and measure the same abilities as NS scoring. Dressel & Schmid (1953) compared SST with NS using college students in a physical science course. They observed that the reliability of the SST test, at 0.67, was slightly lower than that of the NS test at 0.70. They also noted that academically high-performing students scored better than average compared to low-performing students, with respect to full knowledge, regardless of the difficulty of the items. Jaradat and Tollefson (1988) compared ET with SST, using graduate students enrolled in an educational measurement course. No significant differences in terms of reliability were observed between the methods. Jaradat and Tollefson reported that the majority of students felt that ET and SST were better measures of their knowledge than conventional NS, but they still preferred NS. Bradbard et al. (2004) concluded that ET scoring is useful whenever there is concern about improving the accuracy of measuring a student’s partial knowledge. ET scoring may be particularly helpful in content areas where partial or full misinformation can have life-threatening consequences. This study adopted ET scoring as the measurement scheme for partial scoring. Design of multiple choice An MC item is composed of a correct answer and several distractors. The design of distractors is the largest challenge in constructing an MC item (Haladyna & Downing, 1989). Haladyna & Downing summarized the common rules of design found in many references. One such rule is that all the option choices should adopt parallel grammar to avoid giving clues to the correct answer. The option choices should address the same content, and the distractors should all be reasonable choices for a student with limited or incorrect information. Items should be as clear and concise as possible, both to ensure that students know what is being asked, and to minimize reading time and the influence of reading skills on performance. Haladyna and Downing recommended some guidelines for developing distractors: ¾ Employ plausible distractors; avoid illogical distractors. ¾ Incorporate common student errors into distractors. ¾ Adopt familiar yet incorrect phrases as distractors. ¾ Use true statements that do not correctly answer the items. Kehoe (1995) recommended improving tests by maintaining and developing a pool of “good” items from which future tests are drawn in part or in whole. This approach is particularly true for instructors who teach the same course more than once. The proportion of students answering an item correctly also affects its discrimination power. Items answered correctly (or incorrectly) by a large proportion of examinees (more than 85%) have a markedly low power to discriminate. In a good test, most items are answered correctly by 30% to 80% of the examinees. Kehoe described the following three methods to enhance the ability of items to discriminate among abilities: ¾ Items that correlate less than 0.15 with total test score should probably be restructured. ¾ Distractors that are not chosen by any examinees should be replaced or eliminated. ¾ Items that virtually all examinees answer correctly are unhelpful for discriminating among students and should be replaced by harder items. Pros and cons of CBTs CBTs have several benefits: ¾ The single-item presentation is not restricted to text, is easy to read, and allows combining with pictures, voice, image, and animation. 98 ¾ ¾ ¾ ¾ ¾ ¾ CBTs shorten testing time, give instantaneous results, and increase test security. The online testing format of an exam can be configured to enable instantaneous feedback to the student, and can be more easily scheduled and administered than PPTs (Gretes & Green, 2000; Bugbee, 1996). The test requires no paper, eliminating the battle for the copy machine. The option of printing the test is always available. The student can take the test when and where appropriate. Even taking the test at home is an option if the student has Internet access at home. Students can wait until they think they have mastered the material before being tested on it. The test is a valuable teaching tool. CBTs provide immediate feedback, requiring the student to get the correct answer before moving on. CBTs save 100% of the time taken to distribute the test to test-takers (a CBT never has to be handed out). CBTs save 100% of the time taken to create different versions of the same test (re-sequencing questions to prevent an examinee from cheating by looking at the test of the person next to him). However, CBTs have some disadvantages: ¾ ¾ ¾ ¾ The test format is normally limited to true/false and MC. Computer-based automatic grading cannot easily judge the accuracy of constructed-response questions such as short-answer, problem-solving exercises, and essay questions. When holding an onsite test, instructors must prepare many computers for examinees, and be prepared for the difficulties caused by computer crashes. The computer display is not suitable for question items composed of numerous words, since the resolution might make the text difficult to read (Mazzeo & Harvey, 1988). Most items in mathematics and chemistry testing need manual calculation. The need to write down and calculate answers on draft paper might decrease the answering speed (Ager, 1993). Research method This study first constructs a computer-based assessment system, and then adopts experimental design to record an examinee’s performance with different testing tools and scoring approaches. This section describes research hypotheses, variables, the experimental design, and data collection. System configuration As well as providing a platform for computer-based ET, this system implements the Rasch one-parameter logistics item characteristics curve (ICC) model for dichotomous scoring and the grade-response model for partial scoring of ET to estimate the item and ability parameters (Hambleton & Swaminathan, 1985; Wright & Masters, 1982; Wright & Stone, 1979; Zhu, 1996; Zhu & Cole, 1996). Furthermore, item difficulty analysis allows the system to maintain a high-quality test bank. shows the system configuration, which comprises databases and three main subsystems: (1) The computer-based assessment system is the platform used by examinees to take tests, and enables learning progress to be tracked, grades to be queried, and examinees’ response patterns to be recorded in detail; (2) the testbank management system, the platform used by instructors to manage testing items and examinee accounts; (3) the test diagnosis system, which collects data from the answer record and the gradebook database to analyze the difficulty of test items and the ability of examinees. These subsystems are described in detail as follows (Figure 1): Computer-based assessment system The computer-based assessment system links to the answer record, and the gradebook database collects the complete answer record and incorporates a feedback mechanism that enables examinees to check their own learning progress and increase their learning efficiency by means of constructive interaction. The main functions of the computerbased assessment system are as follows: The system first verifies the examinee’s eligibility for the test and then displays the test information, including allowed answering time, scoring methods, and test items. The system permits examinees to write down keywords or mark on the test sheet, as in a paper test. The complete answering processes 99 are stored in the answer record and the gradebook database. The answer record and gradebook database enable the computer-based assessment system to collect information for examinees during tests, improving examinees’ understanding of their own learning status. Examinees can understand the core of a problem by reading the remarks they themselves made during a test to identify any errors and improve in areas of poor understanding. Instructor interface Examinee interface Testbank management system Computer-based assessment system Test item management/ Exams management/ Account management Online testing / Progress tracking/ Grade querying Database Answer record and Gradebook Database Testbank Test Diagnosis system Defining difficulty Setting up commonly used tests Testing unexpected responses Item fit analysis Person fit analysis Estimating ability and item parameter Scoring Sorting examinee’s responses Figure 1. System configuration 100 Testbank management system The testbank management system is normally accessed by instructors who are designing test items, revising test items, designing examinations, and reusing tests. The main advantage of the item banking is in test development. This system allows a curriculum designer to edit multiple-choice questions, constructed-respond items, and true/false items and specify scoring modes. Designed test items are stored in the testbank database. Test items can be displayed in text format or integrated with multimedia images. This system supports three parts of the preparation of examinations: (1) creating original questions, (2) browsing questions from a testbank, and (3) using questions from a testbank by selecting questions items at random. The system provides dichotomous scoring and partial scoring for MC items. After the item sheet has been prepared, the curriculum designer must specify “examinee eligibility,” “testing date,” “scoring mode,” “time allowed,” and “question values.” The computer-based assessment system permits only eligible examinees to take the test at the specified time. Although the system can automatically and instantly grade multiple-choice and true/false questions, the announcement of testing scores is delayed until all examinees have finished the test. Test diagnosis system The test diagnosis system analyzes items and examinees’ ability based on scoring methods (dichotomous scoring and partial scoring) and the data retrieved from the answer record and gradebook database. The diagnostic process is summarized as follows: (1) Sorting the results from the test records. The system first sorts the item-response patterns for all examinees. The answer records are rearranged according to each examinee’s ability and the number of students who have given the correct answer to a given question. (2) Matrix reduction. The system then deletes useless item responses and examinee responses, such as those questions to which all examinees gave correct answers or wrong answers, or received a full or a zero score, since these data do not help to assess the ability of examinees (Baker, 1992; Bond & Fox, 2001). Matrix reduction can reduce the resources and time required to conduct the calculation. (3) The JMLE procedures (Hambleton & Swaminathan, 1985; Wongwiwatthananukit, Popovich, and Bennett, 2000) depicted in Figure 2 are applied to estimate the ability and item parameters. Then the diagnosis procedures are followed by the item and person fit analysis, and the procedures are described as follows: First, let random variable X ni denote examinee n’s response on item i, in which X ni = 1 signifies the correct answer; θ n is the personal parameter of examinee n; and b i denotes the item parameter, which determines the item location and is called the item difficulty in attainment tests. The expected value and variance of X ni are shown in equations (1) and (2). (1) E ( X ni ) = exp(θ n − bi ) /[1 + exp(θ n − bi )] = π ni (2) Var ( X ni ) = π ni (1 − π ni ) = Wni The standardized residual and kurtosis of X ni are shown in equations (3) and (4). (3) Z ni = [ X ni − E ( X ni )] /[Var ( X ni )] 1 2 (4) Cni = [(1 − π ni ) 4 (π ni )] + [(0 − π ni ) 4 (1 − π ni )] The person fit analysis shows the mean square (MNSQ) and standardized weighted mean square (Zstd), as represented in equations (5) and (6). (5) MNSQ = ∑ Wni Z ni2 i ∑W ni =ν n i (6) Zstd = tn = (ν n1 3 − 1) (3 / qn ) + (qn / 3) where qn2 = ∑ ( Cni − Wni2 ) (∑ Wni ) 2 i i The item fit analysis shows the mean square (MNSQ) and standardized weighted mean square (Zstd) are as follows: 101 (7) MNSQ = ∑ Wni Z ni2 n ∑W ni =νi n (8) Zstd = ti = (ν i1 3 − 1) (3 / qi ) + (qi / 3) where qi2 = ∑ ( Cni − Wni2 ) (∑ Wni ) 2 n n Figure 2. JMLE procedures for estimating both ability and item parameters Table 1 lists the principles to distinguish examinees and items from unexpected responses, summarized by Bond and Fox (2001), and Linacre and Wright (1994). The acceptable range for MNSQ is between 0.75 and 1.3, and the Zstd value should be in the range (-2, +2). Person and item fitness outside the range are considered unexpected responses. Incorporating the calculation module in the online testing system is helpful for curriculum designers to maintain the quality of test items by modifying or removing test items with unexpected responses. MNSQ >1.3 <0.75 Table 1. Fit statistics Zstd Variation >2.0 Too much <2.0 Too little Misfit Type Underfit Overfit 102 Research hypotheses This study proposes four hypotheses based on literature reviews. First, Alexander et al. (2001) explored students in a computer technology course who completed either a PPT or a CBT in a proctored computer lab. The test scores were similar, but students in the computer-based group, particularly freshmen, completed the test in the least amount of time. Bodman and Robinson (2004) investigated the effect of several different modes of test administration on scores and completion times, and the results of the study indicate that undergraduates completed the computer-based tests faster than the paper-based tests with no difference in scores. Stated formally: Hypothesis 1. The average score of CBTs is equivalent to that of PPTs. Second, the statistical analysis of Bradbard et al. (2004) indicates that the Coombs procedure is a viable alternative to the standard scoring procedure. In Ben-Simon et al.’s (1997) classification, NS performed very poorly in discriminating between full knowledge and absence of knowledge. Stated formally: Hypothesis 2. ET detects partial knowledge of examinees more effectively than NS. Third, Bradbard and Green (1986) indicated that elimination testing lowers the amount of guesswork, and the influence increases throughout the grading period. Stated formally: Hypothesis 3. ET lowers the number of unexpected responses for examinees more effectively than NS. Finally, most items in mathematics and chemistry testing need manual calculation. The need to write down and calculate answers on draft paper might lower the answering speed (Ager, 1993). Stated formally: Hypothesis 4. Different types of question content, such as calculation and concept, influence the performance of examinees on PPTs or CBTs. Research variables The independent variables used in this study and the operational definitions are as follows: ¾ Scoring mode: including partial scoring and conventional dichotomous scoring to analyze the influence of different answering and scoring schemes on partial knowledge. ¾ Testing tool: comparing conventional PPTs with CBTs and recognizing appropriate question types for CBTs. The dependent variable adopted in this study is the students’ performance, which is determined by test scores. Experimental design Tests were provided according to the two scoring modes and two testing tools, which were combined to form four treatments. Table 2 lists the multifactor design. Treatment 1 (T1) was CBTs, using the ET scoring method; Treatment 2 (T2) was CBTs, using NS scoring method; Treatment 3 (T3) was PPTs, using ET scoring method; and Treatment 4 (T4) was PPTs, using NS scoring method. Table 2. Multifactor design CBTs PPTs ET T1 T3 NS T2 T4 Data collection The subjects of the experiment were 102 students in an introductory operations management module, which is a required course for students of the two junior classes in the Department of Information Management at National Kaohsiung First University of Science and Technology in Taiwan. All students were required to take all four exams 103 in the course. A randomized block design, separating each class into two cohorts, was adopted for data collection before the first test. The students were thus separated into the following four groups: class A, cohort 1 (A1); class A, cohort 2 (A2); class B, cohort 1 (B1); and class B, cohort 2 (B2). The four tests were implemented separately using CBTs and PPTs. Each test was worth 25% of the final grade. The item contents were concept oriented and calculation oriented. Because all subjects participating in this study needed to take both NS and ET scored tests, they were given a lecture describing ET and given several opportunities to practice, to prevent bias in personal scores due to unfamiliarity with the answering method, thus enhancing the reliability of this study. Students in each group were all given exactly the same questions. Data analysis Reliability The split-half reliability coefficient was calculated to ensure internal consistency within a single test for each group. Linn and Gronlund (2000) proposed setting the reliability coefficient between 0.60 and 0.85. Table 3 lists the reliability coefficient for each treatment and cohort combination, and all Cronbach α values were in this range. Test 1 Test 2 Test 3 Test 4 Table 3. Reliability coefficients for each group Cronbach α .6858 (T2, A1) .6684 (T3, A2) .6307 (T4, B1) .6649 (T4, A1) .7011 (T1, A2) .6053 (T2, B1) .7450 (T4, A2) .7174 (T1, A1) .7043 (T2, B2) .7358 (T2, A2) .7270 (T3, A1) .7041 (T4, B2) .6901 (T1, B2) .7791 (T3, B2) .8078 (T3, B1) .6341 (T1, B1) Correlation analysis The Pearson and Spearman correlation coefficients were derived to determine whether response patterns of the examinees were consistent. Table 4 lists the correlation coefficient of ET and NS on each test, and Table 5 presents the correlation coefficients of CBTs and PPTs. The analytical results in both tables demonstrate a highly positive correlation for each scoring mode and test tool. Test 1 Table 4. Correlation coefficient of ET vs. NS Pearson Spearman CBTs PPTs CBTs PPTs ** ** ** 0.894 0.903 0.887 0.902** Test 2 0.698** 0.692** 0.679** 0.722** Test 3 0.747** 0.787** 0.676** 0.833** Test 4 0.871** 0.768** 0.887** 0.743** **indicates significance at the 0.01 level for two-tailed test. Table 5. Correlation coefficient of CBTs vs. PPTs Pearson Spearman NS ET NS ET Test 1 .868** .836** .875** .800** Test 2 .759** .846** .791** .803** Test 3 .840** .805** .802** .766** ** ** ** Test 4 .832 .830 .842 .796** **indicates significance at the 0.01 level for two-tailed test. 104 Hypothesis testing One-way analysis of variation (ANOVA) was adopted to test Hypothesis 1. The average score of the CBTs is equivalent to that of PPTs. Table 6 shows that all p-values are greater than 0.05. Each individual group is not statistically significant at the 5% level and no sufficient proof is available to reject Hypothesis 1. Test 1 Test 2 Table 6. One-way ANOVA Scoring Mean of score F-statistic p-value mode 45.60 (PPTs) 1.622 .209 NS 42.84 (CBTs) 27.16 (PPTs) ET 1.509 .225 32.08 (CBTs) 55.12 (PPTs) 0.16 .901 NS 55.44 (CBTs) ET 43.32 (PPTs) 39.72 (CBTs) .671 .417 NS 47.04 (PPTs) 49.44 (CBTs) .495 .485 ET 44.12 (PPTs) 45.70 (CBTs) .143 .707 NS 41.12 (PPTs) 36.48 (CBTs) 1.688 .200 ET 30.72 (PPTs) 24.40 (CBTs) 2.223 .143 Test 3 Test 4 Hypothesis 2: ET can detect partial knowledge of examinees more effectively than NS. According to the scoring classification of Bradbard and Green (1986), and Ben-Simon et al.’s classification (1997), Table 7 summarizes the average number of items for NS and ET scoring taxonomy. For Test 1, the average number of correct answers for class A, cohort 1 was 14.28; the number of answers revealing full knowledge for class A, cohort 2 was 11.67; the number of answers indicating partial knowledge for class A, cohort 2 was 2.38; the number of answers revealing absence of knowledge for class A, cohort 2 was 1.42; the number of answers indicating partial misinformation for class A, cohort 2 was 9.25; and the number of answers revealing full misinformation for class A, cohort 2 was 0.29. Table 7. Average number of items for scoring taxonomy NS Test 1 Test 2 Test 3 Test 4 ET correct full knowledge partial knowledge absence of knowledge partial misinformation full misinformation 14.28 (A1) 15.20 (B1) 18.38 (A1) 18.48 (B1) 11.67 (A2) 12.48 (B2) 14.16 (A2) 15.12 (B2) 2.38 2.56 2.40 2.40 1.42 1.36 2.20 2.24 9.25 8.44 6.20 5.08 0.29 0.16 0.04 0.16 15.88 (A2) 16.48 (B2) 12.16 (A2) 13.16 (B2) 15.33 (A1) 15.80 (B1) 11.08 (A1) 10.36 (B1) 3.46 2.12 3.63 2.56 1.13 0.96 2.83 2.68 4.83 6.08 7.38 9.08 0.25 0.04 0.08 0.32 105 To test whether ET can effectively detect partial knowledge of examinees, the number of correct items of NS was compared to the number of full knowledge of ET in each test. Examinees who did not know the correct answer to an NS test item would have either guessed or given up answering. The number of correctly answered items is composed of (1) correct response by lucky blind guesses, and (2) correct response by examinees’ knowledge. Burton (2002) proposed that the conventional MC scoring can be described by equation B = (K + k + R), where B is the number of correct items; K is the number of correct items for which an examinee possesses accurate knowledge; k is the number of correct items for which an examinee possesses partial knowledge and guesses correctly; R is the number of correct items for which the examinee has no knowledge but makes a lucky blind guess. Burton considered that the examinee could delete distractors and increase the proportion of correct guesses based on partial knowledge. Our study randomly assigns subjects to each group so the ability should be approximately equivalent. Table 7 demonstrates that the number of correct NS items for each of the four tests is greater than the full knowledge number of ET, which shows that ET can distinguish between full knowledge and partial knowledge by partial scoring. Hypothesis 3: ET lowers the number of unexpected responses for examinees more effectively than NS. Table 8 shows the number of unexpected responses based on the calculation described in the research method section, equations (1) to (8), and reveals that the number of unexpected responses in NS is greater than ET for each test, which demonstrates that ET reduces the unexpected responses of examinee. Table 8. Number of unexpected responses NS ET Test 1 25 17 Test 2 10 9 Test 3 17 7 Test 4 29 15 Next, one-way ANOVA was adopted to test Hypothesis 4, that different question contents, such as calculation and concept, influence the performance of examinees on PPTs or CBTs. Table 9 shows that p-value > 0.05 for each subgroup. Thus, experimental results reject Hypothesis 4, and indicate that different question content does not influence the performance of examinees on PPTs or CBTs. This result did not confirm Ager’s study (1993). The first reason might be the course content: an introductory course focused on principles and concepts of operations management, instead of lots of complex calculation. The second explanation is the subjects in this study are all students from the information management department. These students are quite used to interacting with computer interfaces; consequently, the performance difference between PPTs and CBTs is insignificant. Table 9. One-way ANOVA Scoring Mean F-statistic p-value NS PPTs ET Test 1 NS CBTs ET NS Test 2 PPTs ET 1.97 (concept) 1.87 (calculation) 1.22 (concept) 1.45 (calculation) 1.08 (concept) .90 (calculation) .385 (concept) .39 (calculation) 2.15 (concept) 2.13 (calculation) 1.83 (concept) 1.64 (calculation) .163 .688 .744 .394 .206 .666 .000 .989 .014 .907 .552 .462 106 NS CBTs ET NS PPTs ET Test 3 NS CBTs ET NS PPTs ET Test 4 NS CBTs ET 2.34 (concept) 2.45 (calculation) 1.49 (concept) 1.46 (calculation) 1.91 (concept) 1.95 (calculation) 1.85 (concept) 1.80 (calculation) 2.24 (concept) 2.20 (calculation) 2.17 (concept) 2.06 (calculation) 1.62 (concept) 1.51 (calculation) 1.37 (concept) .90 (calculation) 1.68 (concept) 1.40 (calculation) 1.05 (concept) 1.08 (calculation) .304 .591 .007 .936 .054 .818 .072 .789 .005 .944 .052 .831 .248 .622 3.222 084 1.229 .281 .005 .947 Conclusion This study demonstrates the feasibility of adopting ET and CBTs to replace conventional NS and PPTs. Under the same MC testing item construct, the researchers investigate the performance difference among examinees between partial scoring by the elimination testing and the conventional dichotomous scoring method. This study first builds a computer-based assessment system, then adopts experimental design to record an examinee’s performance with different testing tools and scoring approaches. One-way ANOVA does not show sufficient proof to reject the hypothesis that the performance of students taking CBTs is the same as the performance of students taking PPTs. This finding is in agreement with Alexander et al. (2001) and Bodman and Robinson (2004). We conclude that the discrepancy does not exist in the performance of examinees who take PPTs nor in the performance of those who take such CBTs when ET is used. Next, the correct number of NS was compared to the number of full knowledge of ET in each test. Data analysis demonstrates that the number of correct NS items is greater than the full knowledge number of ET for each test, which shows that ET can distinguish between full knowledge and partial knowledge by partial scoring. The number of unexpected responses calculated by the Rasch model based on item response theory for dichotomous scoring and the partial credit model based on graded item response for elimination testing to estimate the examinee ability and item difficulty parameters reveals that the number of unexpected responses in NS is greater than ET for each test, which demonstrates that ET reduces the unexpected responses of examinees. ET scoring is helpful whenever examinees’ partial knowledge and unexpected responses are concerned. Moreover, the instructors can more accurately assess examinees’ partial knowledge by adopting ET and CBTs, which is not only helpful in teaching, but also increases examinees’ eagerness and willingness to learn. Experimental results also indicate that different question content does not influence the performance of examinees on PPTs or CBTs. This result did not confirm Ager’s study (1993). The first reason might be the course content, an introductory course focused on principles and concepts of operations management, instead of on lots of complex calculation. The second explanation is that the subjects in this study are all students from the information management department. These students are quite used to interacting with computer interfaces; consequently, they present the difference of performance on PPTs and CBTs is insignificant. We remain aware that the validity of any experimental study is limited to the scope of the experiment. Since the study involved only two classes and one 107 course, more comparison tests could be performed on other content subjects to induce those suited for adopting ET and CBTs to replace conventional NS and PPTs. Since the trend in e-learning technologies and system development is toward the creation of standards-based distributed computing applications. To maintain a high-quality testbank and promote the sharing and reusing of test items, further research should consider incorporating the IMS question and test interoperability (QTI) specification, which describes a basic structure for the representation of assessment data groups, questions, and results, and allows the CBT system to go further by using the shareable content object reference model (SCORM) and IMS metadata specification. References Abu-Sayf, F. K. (1979). Recent developments in the scoring of multiple-choice items. Educational Review, 31, 269– 270. Ager, T. (1993). Online placement testing in mathematics and chemistry. Journal and Computer-Based Instruction, 20 (2), 52–57. Akeroyd, F. M. (1982). Progress in multiple-choice scoring methods. Journal of Further and Higher Education, 6, 87–90. Alexander, M. W., Bartlett, J. E., Truell, A. D., & Ouwenga, K. (2001). Testing in a computer technology course: An investigation of equivalency in performance between online and paper and pencil methods. Journal of Career and Technical Education, 18 (1), 69–80. Baker, F. B. (1992). Item response theory: Parameter estimation techniques, New York, NY: Marcel Dekker. Ben-Simon, A., Budescu, D. V., & Nevo, B. (1997). A comparative study of measures of partial knowledge in multiple-choice tests. Applied Psychological Measurement, 21 (1), 65–88. Bodmann, S. M. & Robinson, D. H. (2004). Speed and performance differences among computer-based and paperpencil tests. Journal of Educational Computing Research, 31 (1), 51–60. Bond, T. G. & Fox, C. M. (2001), Applying the Rasch model: Fundamental measurement in the human sciences, Mahwah, New Jersey: Lawrence Erlbaum Associates. Bradbard, D. A. & Green, S. B. (1986). Use of the Coombs elimination procedure in classroom tests. Journal of Experimental Education, 54, 68–72. Bradbard, D. A., Parker, D. F., & Stone, G. L. (2004). An alternate multiple-choice scoring procedure in a macroeconomics course. Decision Sciences Journal of Innovative Education, 2 (1), 11–26. Bugbee, A. C. (1996). The equivalence of PPTs and computer-based testing. Journal of Research on Computing in Education, 28 (3), 282–299. Burton, R. F. (2002). Misinformation, partial knowledge and guessing in true/false tests. Medical Education, 36, 805–811. Bush, M. (2001). A multiple choice test that rewards partial knowledge. Journal of Further and Higher Education, 25 (2), 157–163. Chan, N., & Kennedy, P. E. (2002). Are multiple-choice exams easier for economics students? A comparison of multiple-choice and “equivalent” constructed-response exam questions. Southern Economic Journal, 68 (4), 957– 971. 108 Coombs, C. H., Milholland, J. E., & Womer, F. B. (1956). The assessment of partial knowledge. Educational and Psychological Measurement, 16, 13–37. Dressel, P. L., & Schmid, J. (1953). Some modifications of the multiple-choice item. Educational and Psychological Measurement, 13, 574–595. Gretes, J. A., & Green, M. (2000). Improving undergraduate learning with computer-assisted assessment. Journal of Research on Computing in Education, 33 (1), 46–4. Hagler, A. S., Norman, G. J., Radick, L. R., Calfas, K. J., & Sallis, J. F. (2005) Comparability and reliability of paper- and computer-based measures of psychosocial constructs for adolescent fruit and vegetable and dietary fat intake. Journal of the American Dietetic Association, 105 (11), 1758–1764. Haladyna, T. M. & Downing, S. M. (1989). A taxonomy of multiple-choice item-writing rules. Applied Measurement in Education, 2, 37–50. Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications, Boston, MA: Kluwer-Nijhoff. Inouye, D. K., & Bunderson, C. V. (1986). Four generations of computerized test administration. Machine-Mediated Learning, 1, 355–371. Jaradat, D. & Sawaged, S. (1986). The subset selection technique foe multiple-choice tests: An empirical inquiry. Journal of Educational Measurement, 23 (4), 369–376. Jaradat, D. & Tollefson, N. (1988). The impact of alternative scoring procedures for multiple-choice items on test reliability, validity, and grading. Educational and Psychological Measurement, 48, 627–635. Kehoe, J. (1995). Basic item analysis for multiple-choice tests. Practical assessment, Research & Evaluation, 4 (10), retrieved October 15, 2007, from http://PAREonline.net/getvn.asp?v=4&n=10. Linacre, J. M., & Wright, B. D. (1994). Chi-square fit statistics. Rasch Measurement Transactions, 8 (2), 360, retrieved October 15, 2007, from http://rasch.org/rmt/rmt82.htm. Linn, R. L., & Gronlund, N. E. (2000). Measurement and assessment in teaching (8th Ed.), Upper Saddle River, NJ: Prentice-Hall. Mazzeo, J., & Harvey, A. L. (1988). The equivalence of scores from automated and conventional education and psychological test, New York, NY: College Board Publications. Wongwiwatthananukit, S., Popovich, N. G., & Bennett, D. E. (2000). Assessing pharmacy student knowledge on multiple-choice examinations using partial-credit scoring of combined-response multiple-choice items. American Journal of Pharmaceutical Education, 64 (1), 1–10. Wright, B. D., & Masters, G. N. (1982). Rating scale analysis, Chicago, IL: MESA Press. Wright, B. D., & Stone, M. H. (1979). Best test design, Chicago, IL: MESA Press. Zhu, W. (1996). Should total scores from a rating scale be used directly? Research Quarterly for Exercise and Sport, 67 (3), 363–372. Zhu, W., & Cole, E. L. (1996). Many-faceted Rasch calibration of a gross motor instrument. Research Quarterly for Exercise and Sport, 67 (1), 24–34. 109