Supplemental Digital Content 2: Model description and effect size definition The IRT analyses conducted in this article used the Samejima’s graded response model to obtain item parameters. This model assumes that for the ith ITEMi taking the ordered responses k=1, 2, 3; the probability of the response ITEMi =k can be modeled as Pr ob( ITEM i k ) 1 1 1 exp[ ai ( bik )] 1 exp[ ai ( bik 1 )] in which is the unobserved latent variable (the patient perception of his or her provider) that is being measured by the items; ai is the slope or discriminating parameter for item (i) and bik is the item location for response (k). Higher values of the latent variable will be indicative of a person having more of the construct being measured through the items. The magnitude of the discriminating parameter indicates the degree to which the item is related to the latent construct and how quickly the probability of endorsing a response category (k) increases with increasing location estimates while such location parameter estimates the likelihood of an item response endorsement. Appendix C2 provides the full estimates of the permutation steps proposed in this article, starting from the parameter estimates of the constrained model, then the estimates from the non-constrained models for Spanish Speakers as well as English speakers, followed by the permutation test p-values. These results are presented for all the three rounds that resulted in the inference about the four items to be used as anchor items. As the goal of this article was to describe how the permutation tests can be used to make these inferences, in Appendix C3, we reported for Item 2 (round 1) the graphs of the distributions of the difference between the pseudo Spanish and English group of the parameters under the ideal case scenario of no differential item functioning with participants randomly assigned to the two groups. A vertical line is also added to each plot depicting the difference estimate obtained from the true Spanish and English groups observed in the CAHPS. This vertical line allows one to visually assess the p-value statistics, the likelihood that the difference observed from the CAHPS data could be just by chance. While the graphs showed that parameter estimates differences of a (0.02) and b2 (-0.20) are not exactly equal to 0 just by chance, the difference in b1 (1.16) is too large to be observed by chance. As this permutation can apply to any statistics, we also presented the distribution of the likelihood ratio statistics under the null hypothesis of no DIF along with the estimate obtained from the CAHPS data. In appendix D, we presented the expected item score plots for the ITEMS with DIF, scores which according to Raju et al [39] will be similar if the items function similarly in the two groups. This expected score also called the true score (tis) for the item i and participant s is computed as 3 t is ( s ) kPik ( s ) at the construct level s where Pik(s) is the probability of responding to k 1 category k (in 1, 2, 3) weighted by the category score which we set to k. Summing these expected item scores over all the items in the scale will yield the scale true test score. But as the DIF in the items were observed in difference directions, the scale true test score resulted in DIF cancellation at the scale level. In order to appreciate the impact of the DIF at the item level, we also reported a standardized item expected score difference between Spanish and English speakers, that we called standardized effect size at the construct levels s = 0 and s = -1, defined as tis ( s | Spanish) tis ( s | English) Sdi where sdi is the common standard deviation of item i in the studied sample.