DIFferent Approaches to Cross-Cultural Validation Ana Ćosić Pilepić, Tamara Mohorić, Vladimir Takšić Faculty of Social Sciences and Humanities University of Rijeka 1 TERMINOLOGY • measurement bias – one of the threats to validity of interpretations and inferences based on psychological testing • low construct validity of test – test contains items measuring other constructs beside the one it was intended to measure potential threat of bias towards a specific group of participants • item bias – probability for success on the item is different for examinees with the same level of ability 2 TERMINOLOGY • IMPACT – true group differences in the probability of answering correctly on the item (basis for analyzing impact – a valid test) • differential item functioning (DIF) – somewhat neutral term to refer to differences in the statistical properties of an item between groups of examinees of equal ability • DIF ≠ item bias • DIF is a necessary but not sufficient condition for item bias 3 CHARACTERISTICS COMMON TO ALL DIF METHODS • focal and reference group • matching variable – examinees in different groups are matched on ability/latent trait/proficiency (basis for detection of DIF and not impact) : internal or external • predictive validity approach vs. internal criteria approach (DIF indices) • internal criteria: total test score or other items • purification step – used to remove DIF items that might contaminate the matching criterion; remaining DIF-free items (anchor items) can then be used in ability matching 4 DIFFERENT CLASSIFICATIONS OF DIF METHODS • • • • • null DIF hypothesis: observed-score and latent variable null DIF the studied item score: dichotomous items vs. polytomous items methods comparison of two or more groups parametric vs. nonparametric methods Mapuranga, 2008. 5 CTT IRT • STAND • smoothed STAND • DIF dissection • M-H procedure • CMH • Cox's β • hierarchical logistic regression • logistic mixed model • mixture model • HGLM • DFIT • TestGraf • ScramsMcLeod • MIMIC model • Lagrangian multiplier tests • RCML • McDonald's CTT&IRT • SIBTEST • kernel smoothed SIBTEST • MULTISIB No test theory • Liu-Agresti estimator • logistic regression 6 CLASSICAL TEST THEORY METHODS 7 Traditional methods • • • • the transformed item difficulty index (delta plot method) adjustment: residualized TID indices - Shepard et al. (1985) analysis of variance correlational methods: • rank-order correlation of p-values • item-test point-biserial correlations • exploratory factor analysis 8 Traditional methods • methods based on the concept of differential item difficulty • classical indices are sample dependent • classical item p-values confound item difficulty with group mean differences and item discrimination (Angoff, 1982; Hunter, 1975., Lord 1977) • whenever two groups are not equal on the trait being measured, highly discriminating items will appear to be biased because they do a better job of making the distinction between low-scoring and high-scoring groups • frequent type I and type II errors 9 Standardization (STAND) • Dorans & Kolick, 1983 • the idea is to compute the difference between the proportion of examinees, from both focal and reference groups, who answer the item correctly at each score level, and with more weight attached to score levels with more examinees • unsigned proportion difference and signed proportion difference (standardized p-difference) • it requires large sample sizes and offers no significance test 10 Mantel – Haenszel procedure • the odds of correctly endorsing an item for the focal group relative to the reference group; large differences DIF is present • odds ratios of success at each ability level are estimated and then averaged over all ability levels Mantel-Haenszel odds ratio index (𝛼𝑀𝐻) • 𝛽𝑀𝐻 - logit transformation of 𝛼𝑀𝐻 • Mantel-Hanszel delta difference (MH D – DIF ) – classification of items (negligable DIF, slight to moderate DIF, moderate to large DIF) 11 ITEM RESPONSE THEORY METHODS matching variable is an estimated ability level or latent trait, θ 12 BASIC CONCEPTS OF IRT - item characteristic curve - 3 parameters: item difficulty (b), item discrimination (a) and guessing factor (c) - 1 parameter logistic model (1PL; Rasch model), 2 parameter logistic model (2PL), 3 parameter logistic model (3PL) 13 DIF detection • basic idea: if DIF is present, reference and focal group will have different ICCs • multitude of different approaches for identification of differences in parameters between groups are present: • comparison of item parameters across groups (Linacre & Wright, 1986; Lord, 1980); • estimating the area between the ICCs for the two groups (Raju, 1988); • or improvement in fit for the model can be tested, comparing fit with and without separate group parameter estimates (Thissen, Steinberg, & Wainer, 1993) 14 Pros and cons • • • • • sample independent parameters more precise than CTT methods conceptually clear, but computationally demanding graphical representation (ICC) offers an easy visual inspection tool big samples– 3PL model >1000 examinees per group 15 MIXED METHODS SIBTEST, kernel smoothed SIBTEST, MULTISIB 16 Simultaneous item bias test (SIBTEST) • nonparametric method to detect test bias or differential test functioning (DTF) • conceptually similar to standardization, but it offers a significance test AND matching variable is a latent trait, not an observed score • allows for evaluation of DIF amplification or cancellation effects across items within a testlet or bundle • allows for detection of both uniform and non-uniform DIF 17 NO TEST THEORY METHODS Logistic regression, Liu-Agresti estimator 18 Logistic regression • Swaminathan and Rogers (1990) • a link between contingency table methods (odds ratio methods) and IRT methods • tries to predict the success on a specific item based on total test score only (no DIF); based on total test score and group membership (uniform DIF) or based on total test score, group membership and their interaction term (nonuniform DIF) • small sample sizes 19 CONCLUSION 20 • Use of specific method must be based on different considerations: - studied instrument - studied items - sample size - resources (money, time) - ... 21 PURPOSE OF INTENDED STUDY: • To test the efficacy of different methods of DIF • To compare different test theory frameworks to determine whether there is any difference in the information given by the different approaches • To compare different DIF indices • To examine whether there are problematic items with respect to DIF in different conditions (sex, culture) • Sample: > 5000 examinees from 10 different countries (cross-cultural study) • Instrument: ESCQ (Likert scale) 22 THANK YOU FOR YOUR ATTENTION! Please give any comment or suggestion regarding the doctoral dissertation research design... 23