1 Test Bias and Differential Item Functioning Psychological tests are often designed to measure differences in favorable characteristics (e.g. intelligence, college aptitude, etc.) among people. However, oftentimes certain ethnic groups tend to score lower, on average, on these tests. It is extremely difficult to develop a test that measures innate intelligence without introducing cultural bias. This has been virtually impossible to achieve. Some have argued that these differences are due to: 1. Environmental factors 2. Biological factors The big question is: Are the tests biased or are they simply reflecting true differences? Are standardized tests as valid for minority examinees as they are for white examinees? Some psychologists argue that the tests are differentially valid for African-Americans and whites. However differences in test performance among ethnic groups do not necessarily indicate test bias. The question is truly one of validity. Does the test have different meanings for different ethnic groups? Most attempts to reduce cultural bias in testing have depended on the content validity of items. Differential Item Functioning (DIF) is one such approach. It is a statistical test that determines whether an item functions differentially in favor of one group after matching the two groups on overall ability, as measured by other test items. Introduction to Item Response Theory (IRT) IRT is model-based measurement in which an estimate of a person’s underlying trait (i.e. ability, personality, organizational efficacy) depend both on that person’s response and the properties of the items used to measure the trait. IRT is the basis for almost all large-scale psychological measurement, yet most test users are unfamiliar with IRT and are unaware that the basis of testing has changed. When most psychologists or researchers think of measurement, they think of the principles of classical test theory (CTT), such as reliability. Although IRT is derived from many of the principles taken from CTT some of the “rules of measurement” have changed. On the other hand, without a large number of “examinees” it is not possible to apply IRT and therefore CTT, and extensions of CTT, can and should be used in these instances. 2 Limitations of Classical Measurement Models 1. Examinee and test characteristics are confounded; each can be interpreted only in the context of the other. a. An examinee’s “true score” is defined as the expected observed score for an examinee on the test of interest. b. The difficulty of an item is defined as the proportion of examinees in a group of interest that answered the item correctly. This makes it difficult to compare examinees that take different tests and to compare item characteristics for items obtained using different groups of examinees. 2. It is nearly impossible to create parallel tests, yet the main concept in the CTT conceptual framework, reliability, is defined as the correlation between scores on parallel forms of the test. 3. The standard error of measurement in CTT is constant for all examinees, which is implausible. 4. CTT is test oriented, rather than item oriented, so it is impossible to predict how an individual will perform on a particular item. A Comparison of CTT and IRT In CTT the standard error of measurement (SEM) applies to all scores in a particular population; in IRT the SEM not only differs across scores, but generalizes across populations. The SEM describes the expected differences in scores that are due to measurement error and is critical to score interpretations at both the individual level and the overall level In CTT, larger tests are more reliable than shorter tests; in IRT shorter tests can be more reliable than longer tests. In CTT the Spearman-Brown prophesy states that as test length increases true score variance increases faster than error score variance and so reliability is increased. In IRT, given item pools that are large enough to ensure a multitude of items of different difficulty levels, adaptive tests can be comprised so that examinees are given items that are most reliable at their ability level (i.e. 3 neither too easy nor too hard), thereby increasing decreasing the standard error of measurement for all examinees. In CTT, comparing test scores across forms is best when forms are parallel; in IRT comparing test scores (i.e. ability estimates) across forms is best when forms are not parallel and the level of test difficulty varies between persons. When two groups of examinees receive different forms of a test designed to measure the same trait, differences in ability distributions between the two groups and item characteristics are controlled for by equating measures. In CTT, strict conditions must be met before test scores can be equated, specifically forms must be parallel which requires equal means and variances across forms. However, these conditions are seldom met in practical testing situations and oftentimes we are interested in comparing scores from quite different measures. It is possible to use regression to find comparable scores from one test form to another but errors in equating are influenced greatly be differences in test difficulty levels. In IRT test forms can be adapted so that each examinee receives only those items that are optimally appropriate for that person (i.e. neither too difficult nor too hard). Then ability can be estimated separately for each person using an IRT model that controls for differences item difficulty. Then estimates of the latent trait or ability can be compared, rather then overall test score. In CTT, item characteristics will be unbiased only if they come from a representative sample; in IRT item characteristics will be unbiased even if they come from an unrepresentative sample. In CTT, item statistics can vary dramatically across sample if the samples are unrepresentative of the population. In CTT, the “meaning” of test scores is obtained by comparing scores to those obtained form a “norm” group; in IRT the “meaning” of test scores (i.e. ability estimates) is obtained by comparing scores from their distance between items. In IRT, item characteristics and ability estimates are on the same scale so it is possible to locate a person’s ability estimate on a continuum that orders items by their level of difficulty. This can then be interpreted such that a person has a probability greater than 0.5 of obtaining the correct answer to any item that falls below that person’s ability estimate and a probability less than 0.5 of 4 obtaining the correct answer to any item that falls above that person’s ability estimate. It should be noted that it is still possible in IRT to obtain norm-referenced information. In CTT, interval scale properties can only be achieved by a non-linear transformation of observed scores, specifically normalizing scores; in IRT interval scale properties are achieved by merely applying justifiable models. In CTT, mixed item formats (i.e. multiple-choice and open-ended) lead to unbalanced impact on total test score; in IRT mixed item formats may lead to optimal scores. In CTT, increasing the number of categories for an item will lead to different total test scores that may be lead to different conclusions being drawn about an examinee’s ability. This may or may not be what is wanted. In IRT there are models specifically designed for tests that have a mixed item format and increasing the number of categories for an item. In CTT, difference scores cannot be meaningfully compared when the scores obtained at the first time point differ; in IRT, difference scores can be meaningfully compared when the scores obtained at the first time point differ. For difference or change scores to be meaningful when initial scores differ they must come an interval level of measurement. In CTT this is not easily justifiable, whereas in IRT, interval level measurement is directly justifiable by the model.