Item Analysis Assumptions

advertisement
FACT SHEET 24A
Title: Item Analysis Assumptions (Difficulty & Discrimination Indexes)
Date: June 2008
Details: Mr. Imran Zafar, Database Administrator, Assessment Unit, Dept of Medical
Education. Ext. 47142
Introduction:
It is widely believed that “Assessment drives the curriculum”. Hence it can be argued that if
the quality of teaching, training, and learning is to be upgraded, assessment is the obvious
starting point. However, upgrading assessment is continuous process. The cycle of planning,
and constructing assessment tools, followed by testing, validating, and reviewing has to be
repeated continuously
When tests are developed for instructional purposes, to assess the effects of educational
programs, or for educational research purposes, it is very important to conduct item and test
analyses. These analyses evaluate the quality of the items and of the test as a whole. Such
analyses can also be employed to revise and improve both items and the test as a whole.
Quantitative item analysis is a technique that enables us to assess the quality or utility of an
item. It does so by identifying distractors or response options that are underperforming.
Item-analysis procedures are intended to maximize test reliability. Because maximization of
test reliability is accomplished by determining the relationship between individual items and
the test a whole, it is important to insure that the overall test is measuring what it is supposed
to measure. It this not the case, the total score will be a poor criterion for evaluating each
item.
The use of a multiple-choice format for hour exams at many institutions leads to a deluge of
statistical data, which are often neglected or completely ignored. This paper will introduce
some of the terms encountered in the analysis of test results, so that these data may become
more meaningful and therefore more useful
Need for Item Analysis
1)
Provision of information about how the quality of test items compare. The
comparison is necessary if subsequent tests of the same material are going to be
better.
2)
Provision of diagnostic information about the types of items that students most
often get incorrect. This information can be used as a basis for making
instructional decision.
3)
Provision of a rational basis for discussing test results with students.
4)
Communication to the test developer which items needs to be improved or
eliminated, to be replaced with better items.
What is the Output of Item Analysis?
Item analysis could yield the following outputs:




Distribution of responses for each distractor of each item or frequencies of responses
(histogram).
Difficulty index for each item of the test
Discrimination Index for each item of the test.
Measure of exam internal consistency reliability
Total Score Frequencies
Issues to consider when interpreting the distribution of students’ total scores:
Distribution:





Is this the distribution you expected?
Was the test easier, more difficult than you anticipated?
How does the mean score of this year’s class compare to scores from previous
classes?
Is there a ceiling effect – that is, are all scores close to the top?
Is there a floor effect – that is, are all scores close to the lower possible?
Spread of Scores:




Is the spread of scores large?
Are there students who are scoring low marks compared to the majority of the
students?
Can you determine why they are not doing as well as most other students?
Can you provide any extra assistance? Is there a group of students who are well ahead
of the other students?
Difficulty Index
It actually tells us how easy the item was for the students in that particular group. The higher
the difficulty index the easier the question; the lower the difficulty index, the more difficult
the question. The difficulty index, in fact, equals to “Easiness Index”.
Issues to consider in relation to the Difficulty Index


Are the easiest items in your test, i.e. those with the lowest difficulty ranking, the first
items in the test?
If the more difficult items occur at the start of the test, students can become upset
because they feel, early on, that they can not succeed.
Literature quotes following (generalized) interpretation of Difficulty Index.
Indexed Range
0.85 – 1.00
0.70 – 0.84
0.30 – 0.69
0.15 – 0.29
0.00 – 0.14
Inference to Question
Very Easy
Easy
Optimum
Hard
Very Hard
Item Discrimination Index (DI)
This is calculated by subtracting the proportion of students correct in the lower group from
the proportion correct in the upper group. It is assumed that persons in the top third on total
scores should have a greater proportion with the item correct than the lower third.
The calculation of the index is an approximation of a correlation between the scores on an
item and the total score. Therefore, the DI is a measure of how successfully an item
discriminates between students of different abilities on the test as a whole. Any item which
did not discriminates between the lower and upper group of students would have a DI=0. An
item where the lower group performed better than the upper group would have a negative DI.
The discrimination index is affected by the difficulty of an item, because by definition, if an
item is very easy everyone tends to get it right and it does not discriminate. Likewise, if it is
very difficult everyone tends to get it wrong. Such items can be important to have in a test
because they help define the range of difficulty of concepts assessed. Items should not be
discarded just because they do not discriminate.
Issues to consider in relation to the Item Discrimination Index





Are there any items with a negative discrimination index (DI)? That is, terms where
students in the lower third of the group did better than students in the upper third of
the group?
Was this a deceptively easy item?
Was the correct answer key used?
Are there any items that do not discriminate between the students i.e. where the DI is
0.00 or very close to 0.0?
Are these items which are either very hard or very easy and therefore where you could
have a DI of 0?
Literature quotes following (generalized) interpretation of Discrimination Index.
Indexed Range
Below 0.19
0.20 – 0.29
0.30 – 1.00
Inference to Question
Poor
Dubious
Okay
FACT SHEET 24B
Title: Item Analysis Assumptions
Measure of exam internal consistency (reliability) Kuder-Richardson 20 (KR20)
Date: June 2008
Details: Mr. Imran Zafar, Database Administrator, Assessment Unit, Dept of Medical
Education. Ext. 47142
Test Validity and Reliability
Test reliability measures the accuracy, stability, and consistency of the test scores. Reliability
is affected by the characteristics of the students, characteristics of the test, and conditions
affecting test administration and scoring.
The two factors that determine overall test quality are test validity and test reliability.
Test validity is the appropriateness of the test for the subject area and students being tested.
Validity cannot be measured by a computer. It is up to the instructor to design valid test
items that best measure the intended subject area. By definition, valid tests are reliable.
However, a reliable test is not necessarily valid. For example, a math test comprised entirely
of word problems may be measuring as much verbal skills as math ability.
The reliability of a test refers to the extent to which the test is likely to produce consistent
scores. The KR-20 index is the appropriate index of test reliability for multiple-choice
examinations.
What does the KR-20 measure?
The KR-20 is a measure of internal consistency reliability or how well your exam measures a
single cognitive factor. If you administer an Embryology exam, you hope all test items relate
to this broad construct. Similarly, an Obstetrics/Gynecology test is designed to measure this
medical specialty.
Reliability coefficients theoretically range in value from zero (no reliability) to 1.00 (perfect
reliability). In practice, their approximate range is from .50 to .90 for about 95% of the
classroom tests.
High reliability means that the questions of a test tended to "pull together." Students who
answered a given question correctly were more likely to answer other questions correctly. If a
parallel test were developed by using similar items, the relative scores of students would
show little change.
Low reliability means that the questions tended to be unrelated to each other in terms of who
answered them correctly. The resulting test scores reflect peculiarities of the items or the
testing situation more than students' knowledge of the subject matter.
The KR-20 formula includes
(1)
(2)
(3)
the number of test items on the exam,
student performance on every test item, and
the variance (standard deviation squared) for the set of student test scores.
The index ranges from 0.00 to 1.00. A value close to 0.00 means you are measuring many
unknown factors but not what you intended to measure. You are close to measuring a single
factor when your KR-20 is near 1.00. Most importantly, we can be confident that an exam
with a high KR-20 has yielded student scores that are reliable (i.e., reproducible or consistent;
or as psychometricians say, the true score). A medical school test should have a KR-20 of
0.60 or better to be acceptable.
How do you interpret the KR-20 value?
Reliability
Interpretation
.90 and above
.80 - .90
.70 - .80
Excellent reliability; at the level of the best standardized tests
Very good for a classroom test
Good for a classroom test; in the range of most. There are
probably a few items which could be improved.
Somewhat low. This test needs to be supplemented by other
measures (e.g., more tests) to determine grades. There are
probably some items which could be improved.
Suggests need for revision of test, unless it is quite short (ten or
fewer items). The test definitely needs to be supplemented by
other measures (e.g., more tests) for grading.
Questionable reliability. This test should not contribute heavily
to the course grade, and it needs revision.
.60 - .70
.50 - .60
.50 or below
Standard Error of Measurement
The standard error of measurement is directly related to the reliability of the test. It is an
index of the amount of variability in an individual student's performance due to random
measurement error. If it were possible to administer an infinite number of parallel tests, a
student's score would be expected to change from one administration to the next due to a
number of factors. For each student, the scores would form a "normal" (bellshaped)
distribution. The mean of the distribution is assumed to be the student's "true score," and
reflects what he or she "really" knows about the subject. The standard deviation of the
distribution is called the standard error of measurement and reflects the amount of change in
the student's score which could be expected from one test administration to another.
Whereas the reliability of a test always varies between 0.00 and 1.00, the standard error of
measurement is expressed in the same scale as the test scores. For example, multiplying all
test scores by a constant will multiply the standard error of measurement by that same
constant, but will leave the reliability coefficient unchanged.
A general rule of thumb to predict the amount of change which can be expected in individual
test scores is to multiply the standard error of measurement by 1.5. Only rarely would one
expect a student's score to increase or decrease by more than that amount between two such
similar tests. The smaller the standard error of measurement, the more accurate the
measurement provided by the test.
A CAUTION in Interpreting Item Analysis Results
Each of the various item statistics provides information which can be used to improve
individual test items and to increase the quality of the test as a whole. Such statistics must
always be interpreted in the context of the type of test given and the individuals being tested.
W. A. Mehrens and I. J. Lehmann provide the following set of cautions in using item analysis
results (Measurement and Evaluation in Education and Psychology. New York: Holt,
Rinehart and Winston, 1973, 333-334):
1.
Item analysis data are not synonymous with item validity. An external criterion is
required to accurately judge the validity of test items. By using the internal criterion
of total test score, item analyses reflect internal consistency of items rather than
validity.
2.
The discrimination index is not always a measure of item quality. There is a variety of
reasons an item may have low discriminating power:
a) extremely difficult or easy items will have low ability to discriminate but such
items are often needed to adequately sample course content and objectives;
b) an item may show low discrimination if the test measures many different content
areas and cognitive skills. For example, if the majority of the test measures
"knowledge of facts," then an item assessing "ability to apply principles" may
have a low correlation with total test score, yet both types of items are needed to
measure attainment of course objectives.
3.
Item analysis data are tentative. Such data are influenced by the type and number of
students being tested; instructional procedures employed, and chance errors. If
repeated use of items is possible, statistics should be recorded for each administration
of each item.
Conclusions
Developing the perfect test is the unattainable goal for anyone in an evaluative position. Even
when guidelines for constructing fair and systematic tests are followed, a plethora of factors
may enter into a student's perception of the test items. Looking at an item's difficulty and
discrimination will assist the test developer in determining what is wrong with individual
items. Item and test analysis provide empirical data about how individual items and whole
tests are performing in real test situations.
One of the principal advantages of having MCQ tests scored and analyzed by computer at
College of Medicine, King Saud bin Abdulaziz University for Health Sciences is the
feedback available on how well the test has performed. Careful consideration of the results of
item analysis can lead to significant improvements in the quality of exams.
Taken in parts from:
1.
Office of Educational Assessment
University of Washington
http://www.washington.edu/oea/score1.htm
2.
Basic Item Analysis for Multiple-Choice Tests.
Jerard Kehoe,Virginia Polytechnic Institute and State University
3.
Basic Concepts in Item and Test Analysis
Susan Matlock-Hetzel, Texas A&M University, January 1997
4.
Multiple Choice Tests: Test scoring and analysis
Christina Ballantyne,
http://www.tlc.murdoch.edu.au/eddev/evaluation/mcq/score.html
5.
http://chemed.chem.purdue.edu/chemed/stats.html
6.
A paper published in the Journal of Chemical Education, 1980, 57, 188-190.
Since this collection of brief tips and examples is drawn from a number of websites, and literature. The authorship of the
contents, therefore, rests with the original writers.
Imran Zafar
June 2008
Download