Item Analysis Assumptions

FACT SHEET 24A Title: Item Analysis Assumptions (Difficulty & Discrimination Indexes) Date: June 2008 Details: Mr. Imran Zafar, Database Administrator, Assessment Unit, Dept of Medical Education. Ext. 47142 Introduction: It is widely believed that “Assessment drives the curriculum”. Hence it can be argued that if the quality of teaching, training, and learning is to be upgraded, assessment is the obvious starting point. However, upgrading assessment is continuous process. The cycle of planning, and constructing assessment tools, followed by testing, validating, and reviewing has to be repeated continuously When tests are developed for instructional purposes, to assess the effects of educational programs, or for educational research purposes, it is very important to conduct item and test analyses. These analyses evaluate the quality of the items and of the test as a whole. Such analyses can also be employed to revise and improve both items and the test as a whole. Quantitative item analysis is a technique that enables us to assess the quality or utility of an item. It does so by identifying distractors or response options that are underperforming. Item-analysis procedures are intended to maximize test reliability. Because maximization of test reliability is accomplished by determining the relationship between individual items and the test a whole, it is important to insure that the overall test is measuring what it is supposed to measure. It this not the case, the total score will be a poor criterion for evaluating each item. The use of a multiple-choice format for hour exams at many institutions leads to a deluge of statistical data, which are often neglected or completely ignored. This paper will introduce some of the terms encountered in the analysis of test results, so that these data may become more meaningful and therefore more useful Need for Item Analysis 1) Provision of information about how the quality of test items compare. The comparison is necessary if subsequent tests of the same material are going to be better. 2) Provision of diagnostic information about the types of items that students most often get incorrect. This information can be used as a basis for making instructional decision. 3) Provision of a rational basis for discussing test results with students. 4) Communication to the test developer which items needs to be improved or eliminated, to be replaced with better items. What is the Output of Item Analysis? Item analysis could yield the following outputs:     Distribution of responses for each distractor of each item or frequencies of responses (histogram). Difficulty index for each item of the test Discrimination Index for each item of the test. Measure of exam internal consistency reliability Total Score Frequencies Issues to consider when interpreting the distribution of students’ total scores: Distribution:      Is this the distribution you expected? Was the test easier, more difficult than you anticipated? How does the mean score of this year’s class compare to scores from previous classes? Is there a ceiling effect – that is, are all scores close to the top? Is there a floor effect – that is, are all scores close to the lower possible? Spread of Scores:     Is the spread of scores large? Are there students who are scoring low marks compared to the majority of the students? Can you determine why they are not doing as well as most other students? Can you provide any extra assistance? Is there a group of students who are well ahead of the other students? Difficulty Index It actually tells us how easy the item was for the students in that particular group. The higher the difficulty index the easier the question; the lower the difficulty index, the more difficult the question. The difficulty index, in fact, equals to “Easiness Index”. Issues to consider in relation to the Difficulty Index   Are the easiest items in your test, i.e. those with the lowest difficulty ranking, the first items in the test? If the more difficult items occur at the start of the test, students can become upset because they feel, early on, that they can not succeed. Literature quotes following (generalized) interpretation of Difficulty Index. Indexed Range 0.85 – 1.00 0.70 – 0.84 0.30 – 0.69 0.15 – 0.29 0.00 – 0.14 Inference to Question Very Easy Easy Optimum Hard Very Hard Item Discrimination Index (DI) This is calculated by subtracting the proportion of students correct in the lower group from the proportion correct in the upper group. It is assumed that persons in the top third on total scores should have a greater proportion with the item correct than the lower third. The calculation of the index is an approximation of a correlation between the scores on an item and the total score. Therefore, the DI is a measure of how successfully an item discriminates between students of different abilities on the test as a whole. Any item which did not discriminates between the lower and upper group of students would have a DI=0. An item where the lower group performed better than the upper group would have a negative DI. The discrimination index is affected by the difficulty of an item, because by definition, if an item is very easy everyone tends to get it right and it does not discriminate. Likewise, if it is very difficult everyone tends to get it wrong. Such items can be important to have in a test because they help define the range of difficulty of concepts assessed. Items should not be discarded just because they do not discriminate. Issues to consider in relation to the Item Discrimination Index      Are there any items with a negative discrimination index (DI)? That is, terms where students in the lower third of the group did better than students in the upper third of the group? Was this a deceptively easy item? Was the correct answer key used? Are there any items that do not discriminate between the students i.e. where the DI is 0.00 or very close to 0.0? Are these items which are either very hard or very easy and therefore where you could have a DI of 0? Literature quotes following (generalized) interpretation of Discrimination Index. Indexed Range Below 0.19 0.20 – 0.29 0.30 – 1.00 Inference to Question Poor Dubious Okay FACT SHEET 24B Title: Item Analysis Assumptions Measure of exam internal consistency (reliability) Kuder-Richardson 20 (KR20) Date: June 2008 Details: Mr. Imran Zafar, Database Administrator, Assessment Unit, Dept of Medical Education. Ext. 47142 Test Validity and Reliability Test reliability measures the accuracy, stability, and consistency of the test scores. Reliability is affected by the characteristics of the students, characteristics of the test, and conditions affecting test administration and scoring. The two factors that determine overall test quality are test validity and test reliability. Test validity is the appropriateness of the test for the subject area and students being tested. Validity cannot be measured by a computer. It is up to the instructor to design valid test items that best measure the intended subject area. By definition, valid tests are reliable. However, a reliable test is not necessarily valid. For example, a math test comprised entirely of word problems may be measuring as much verbal skills as math ability. The reliability of a test refers to the extent to which the test is likely to produce consistent scores. The KR-20 index is the appropriate index of test reliability for multiple-choice examinations. What does the KR-20 measure? The KR-20 is a measure of internal consistency reliability or how well your exam measures a single cognitive factor. If you administer an Embryology exam, you hope all test items relate to this broad construct. Similarly, an Obstetrics/Gynecology test is designed to measure this medical specialty. Reliability coefficients theoretically range in value from zero (no reliability) to 1.00 (perfect reliability). In practice, their approximate range is from .50 to .90 for about 95% of the classroom tests. High reliability means that the questions of a test tended to "pull together." Students who answered a given question correctly were more likely to answer other questions correctly. If a parallel test were developed by using similar items, the relative scores of students would show little change. Low reliability means that the questions tended to be unrelated to each other in terms of who answered them correctly. The resulting test scores reflect peculiarities of the items or the testing situation more than students' knowledge of the subject matter. The KR-20 formula includes (1) (2) (3) the number of test items on the exam, student performance on every test item, and the variance (standard deviation squared) for the set of student test scores. The index ranges from 0.00 to 1.00. A value close to 0.00 means you are measuring many unknown factors but not what you intended to measure. You are close to measuring a single factor when your KR-20 is near 1.00. Most importantly, we can be confident that an exam with a high KR-20 has yielded student scores that are reliable (i.e., reproducible or consistent; or as psychometricians say, the true score). A medical school test should have a KR-20 of 0.60 or better to be acceptable. How do you interpret the KR-20 value? Reliability Interpretation .90 and above .80 - .90 .70 - .80 Excellent reliability; at the level of the best standardized tests Very good for a classroom test Good for a classroom test; in the range of most. There are probably a few items which could be improved. Somewhat low. This test needs to be supplemented by other measures (e.g., more tests) to determine grades. There are probably some items which could be improved. Suggests need for revision of test, unless it is quite short (ten or fewer items). The test definitely needs to be supplemented by other measures (e.g., more tests) for grading. Questionable reliability. This test should not contribute heavily to the course grade, and it needs revision. .60 - .70 .50 - .60 .50 or below Standard Error of Measurement The standard error of measurement is directly related to the reliability of the test. It is an index of the amount of variability in an individual student's performance due to random measurement error. If it were possible to administer an infinite number of parallel tests, a student's score would be expected to change from one administration to the next due to a number of factors. For each student, the scores would form a "normal" (bellshaped) distribution. The mean of the distribution is assumed to be the student's "true score," and reflects what he or she "really" knows about the subject. The standard deviation of the distribution is called the standard error of measurement and reflects the amount of change in the student's score which could be expected from one test administration to another. Whereas the reliability of a test always varies between 0.00 and 1.00, the standard error of measurement is expressed in the same scale as the test scores. For example, multiplying all test scores by a constant will multiply the standard error of measurement by that same constant, but will leave the reliability coefficient unchanged. A general rule of thumb to predict the amount of change which can be expected in individual test scores is to multiply the standard error of measurement by 1.5. Only rarely would one expect a student's score to increase or decrease by more than that amount between two such similar tests. The smaller the standard error of measurement, the more accurate the measurement provided by the test. A CAUTION in Interpreting Item Analysis Results Each of the various item statistics provides information which can be used to improve individual test items and to increase the quality of the test as a whole. Such statistics must always be interpreted in the context of the type of test given and the individuals being tested. W. A. Mehrens and I. J. Lehmann provide the following set of cautions in using item analysis results (Measurement and Evaluation in Education and Psychology. New York: Holt, Rinehart and Winston, 1973, 333-334): 1. Item analysis data are not synonymous with item validity. An external criterion is required to accurately judge the validity of test items. By using the internal criterion of total test score, item analyses reflect internal consistency of items rather than validity. 2. The discrimination index is not always a measure of item quality. There is a variety of reasons an item may have low discriminating power: a) extremely difficult or easy items will have low ability to discriminate but such items are often needed to adequately sample course content and objectives; b) an item may show low discrimination if the test measures many different content areas and cognitive skills. For example, if the majority of the test measures "knowledge of facts," then an item assessing "ability to apply principles" may have a low correlation with total test score, yet both types of items are needed to measure attainment of course objectives. 3. Item analysis data are tentative. Such data are influenced by the type and number of students being tested; instructional procedures employed, and chance errors. If repeated use of items is possible, statistics should be recorded for each administration of each item. Conclusions Developing the perfect test is the unattainable goal for anyone in an evaluative position. Even when guidelines for constructing fair and systematic tests are followed, a plethora of factors may enter into a student's perception of the test items. Looking at an item's difficulty and discrimination will assist the test developer in determining what is wrong with individual items. Item and test analysis provide empirical data about how individual items and whole tests are performing in real test situations. One of the principal advantages of having MCQ tests scored and analyzed by computer at College of Medicine, King Saud bin Abdulaziz University for Health Sciences is the feedback available on how well the test has performed. Careful consideration of the results of item analysis can lead to significant improvements in the quality of exams. Taken in parts from: 1. Office of Educational Assessment University of Washington http://www.washington.edu/oea/score1.htm 2. Basic Item Analysis for Multiple-Choice Tests. Jerard Kehoe,Virginia Polytechnic Institute and State University 3. Basic Concepts in Item and Test Analysis Susan Matlock-Hetzel, Texas A&M University, January 1997 4. Multiple Choice Tests: Test scoring and analysis Christina Ballantyne, http://www.tlc.murdoch.edu.au/eddev/evaluation/mcq/score.html 5. http://chemed.chem.purdue.edu/chemed/stats.html 6. A paper published in the Journal of Chemical Education, 1980, 57, 188-190. Since this collection of brief tips and examples is drawn from a number of websites, and literature. The authorship of the contents, therefore, rests with the original writers. Imran Zafar June 2008

Item Analysis Assumptions

Related documents

Products

Support

Item Analysis Assumptions

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib