A Permutation Test for the Detection of Differential Item Functioning

advertisement
Supplemental Digital Content 2: Model description and effect size definition
The IRT analyses conducted in this article used the Samejima’s graded response model to
obtain item parameters. This model assumes that for the ith ITEMi taking the ordered responses
k=1, 2, 3; the probability of the response ITEMi =k can be modeled as
Pr ob( ITEM i  k ) 
1
1

1  exp[ ai (  bik )] 1  exp[ ai (  bik 1 )]
in which  is the unobserved latent variable (the patient perception of his or her provider) that is
being measured by the items; ai is the slope or discriminating parameter for item (i) and bik is the
item location for response (k). Higher values of the latent variable  will be indicative of a person
having more of the construct being measured through the items. The magnitude of the
discriminating parameter indicates the degree to which the item is related to the latent construct and
how quickly the probability of endorsing a response category (k) increases with increasing location
estimates while such location parameter estimates the likelihood of an item response endorsement.
Appendix C2 provides the full estimates of the permutation steps proposed in this article, starting
from the parameter estimates of the constrained model, then the estimates from the non-constrained
models for Spanish Speakers as well as English speakers, followed by the permutation test p-values.
These results are presented for all the three rounds that resulted in the inference about the four items
to be used as anchor items. As the goal of this article was to describe how the permutation tests can
be used to make these inferences, in Appendix C3, we reported for Item 2 (round 1) the graphs of
the distributions of the difference between the pseudo Spanish and English group of the parameters
under the ideal case scenario of no differential item functioning with participants randomly assigned
to the two groups. A vertical line is also added to each plot depicting the difference estimate
obtained from the true Spanish and English groups observed in the CAHPS. This vertical line
allows one to visually assess the p-value statistics, the likelihood that the difference observed from
the CAHPS data could be just by chance. While the graphs showed that parameter estimates
differences of a (0.02) and b2 (-0.20) are not exactly equal to 0 just by chance, the difference in b1
(1.16) is too large to be observed by chance. As this permutation can apply to any statistics, we also
presented the distribution of the likelihood ratio statistics under the null hypothesis of no DIF along
with the estimate obtained from the CAHPS data.
In appendix D, we presented the expected item score plots for the ITEMS with DIF, scores which
according to Raju et al [39] will be similar if the items function similarly in the two groups. This
expected score also called the true score (tis) for the item i and participant s is computed as
3
t is ( s )   kPik ( s ) at the construct level s where Pik(s) is the probability of responding to
k 1
category k (in 1, 2, 3) weighted by the category score which we set to k. Summing these expected
item scores over all the items in the scale will yield the scale true test score. But as the DIF in the
items were observed in difference directions, the scale true test score resulted in DIF cancellation at
the scale level. In order to appreciate the impact of the DIF at the item level, we also reported a
standardized item expected score difference between Spanish and English speakers, that we called
standardized effect size at the construct levels s = 0 and s = -1, defined as
tis ( s | Spanish) tis ( s | English)
Sdi
where sdi is the common standard deviation of item i in the
studied sample.
Download