Appendix A Details on Analyses of Data with Missing Responses Not all phases of the analysis could accommodate participant responses with missing item data. While the estimation of parameters (i.e., the linking itself) and confirmatory analyses took advantage of all available data for all participants, the classical statistics analysis and evaluation of linking accuracy are based on participants with complete responses. Therefore, these analyses were based on slightly smaller sample sizes: 85% of total (N = 622) for the HAQDI linking (0 to 60 scoring); 90% of total (N = 660) for HAQ-DI (0 to 3 scoring); and 91% of total for the SF-36 PF (N = 656). Appendix B Details on Linking Methods Multi-method approach. The PROsetta Stone methodology applies multiple linking methods, including those based in item response theory (IRT), as well as more traditional equipercentile methods.(1) Such a multi-method approach is recommended by Kolen and Brennan (2004),(2) as it serves to ensure that any violations of assumptions do not distort the results. Fixed-Parameter Calibration. Item responses in both the linking sample (PROMIS PF + HAQ-DI [max-8], PROIMS PF + HAQ-DI [sum-20], and PROMIS PF + SF-36 PF) were calibrated in a single run with PROMIS parameters fixed at their previously published values (3). Thus, the item parameters of the legacy instruments items were estimated, subject to the metric defined by the PROMIS item parameters and yielded item parameters for the legacy instruments on the PROMIS metric. Separate calibration with linking constants. The second IRT-based method we applied was separate calibration followed by the computation of transformation constants. This procedure uses the discrepancy between the established PROMIS parameters (3) and freely calibrated estimation of PROMIS parameters to place the legacy parameters on the established PROMIS metric. This is useful, because it avoids imposing the constraints inherent in the fixedparameter calibration. We applied four procedures to obtain the linking constants: mean/mean, mean/sigma.(4, 5) These IRT linking methods were implemented using the package plink (6) in R.(7) We ran all IRT calibrations using MULTILOG 7.03.(8) Comparing IRT linking methods. We obtained four sets of IRT parameters for each legacy measure. To compare methods, we examined the differences between the test characteristic curves (TCCs). If the differences between the expected raw summed score values were small (1 raw score point), we considered the methods interchangeable. When this occurred, we defaulted to the simpler fixed-parameter method for obtaining scores for each participant. Appendix C Unidimensionality Analysis and Details on Classical Item Statistics Method Confirmatory factor analyses (CFA) were conducted on the raw data treating the indicator variables as ordinal and using the WLSMV estimator of Mplus.(9) This model posited that all items load highly on a single factor. We calculated fit statistics to help quantify the fit of a unidimensionality model; this included the Comparative Fit Index (CFI), the Tucker Lewis Index (TLI), and the root mean square error of approximation (RMSEA). We used the following model fit criteria: RMSEA < .08,(10) TLI > .95,(11) and CFI > .95.(11, 12) We applied these somewhat arbitrary criteria as informative guides for judging the relative unidimensionality of item responses. We also estimated the proportion of total variance attributable to a general factor, known as omega hierarchical (OmegaH (13, 14)) using the psych package (15) in R.(7) This method estimates OmegaH from the general factor loadings derived from an exploratory factor analysis and a Schmid–Leiman transformation.(16) Values of 0.70 or higher for OmegaH suggest that the item set is sufficiently unidimensional for most purposes.(17, 18) Results For the 3 combined PROMIS and legacy item sets, CFA fit statistics suggested very good fit. PROMIS and HAQ-DI (sum-20) (96 items) fit values were: CFI = 0.97, TLI = 0.97, and RMSEA = 0.04. PROMIS and the HAQ-DI (max-8) (84 items) were CFI = 0.97, TLI = 0.97, and RMSEA = 0.04. For PROMIS and the SF-36 PF (86 items) fit values were: CFI = 0.98, TLI = 0.98, and RMSEA = 0.04. These results suggest data-model fit to a unidimensional model. As shown in Table C1, values of OmegaH were relatively high, ranging from .77 to .82. These values suggest the presence of a dominant general factor for each instrument pair.(17) Table C1 also shows classical item statistics, including item-total correlations and Cronbach’s alpha. Table C1 Classical Item Analysis for Individual and Combined Instruments Item-Total Correlations Items Instruments Alpha Omega-h PROMIS PF 76 Min. .52 Mean .73 Max. .87 .99 .76 HAQ-DI (sum-20) 20 .44 .67 .77 .94 .81 PROMIS PF & HAQ-DI (sum-20) 96 .37 .71 .86 .99 .79 HAQ-DI (max-8) 8 .59 .72 .80 .91 .84 PROMIS PF & HAQ-DI (max-8) 84 .52 .73 .87 .99 .82 SF-36 PF 10 .50 .74 .83 .93 .80 PROMIS PF & SF-36 PF 86 .51 .73 .87 .99 .77 Estimates are for those participants with no missing data. This represents 85% of total (N = 622) for the HAQ-DI (sum-20) linking, 90% of total (N = 660) for HAQ-DI (max-8), and 91% of total for the SF-36 PF (N = 656). Although the PROMIS PF sample composition was slightly different for each linkage, we report here only the statistics for the HAQ-DI (sum-20) linking sample, as the estimates differ by no more than 0.01 as the sample composition changes. HAQ-DI = Health Assessment Questionnaire – Disability Index; PROMIS PF = PROMIS Physical Function; SF-36 PF = Short Form 36 Physical Function Appendix D Evaluation of Linking Accuracy Method We used a single-group design, the strongest of available linking designs.(19) In this approach, items from each instrument are administered to all participants and scores are obtained for each respondent on each measure to be linked. This is convenient when evaluating the validity of the linking method, because actual scores on the anchor or reference measure (PROMIS PF scores in the current study) are obtained, as well as those generated by linking through the target measures (HAQ-DI and SF-36 PF). We evaluated the accuracy of linking by comparing (estimated) linked scores to their actual scores. We computed Pearson product-moment correlations between estimated and actual scores, and calculated the mean and standard deviation of the differences in scores. Our expectation was that correlations would be high (e.g., .80 or higher) and that the mean of the differences would be close to zero. Users of the cross-walk tables may wish to know that the size of their samples affect the error associated with linking. To estimate how this error decreases as sample size increases, we conducted a resampling procedure to obtain multiple, random, small subsets of cases (n = 25, 50, and 75). Samples were drawn randomly with replacement over 10,000 replications (bootstrapping). For each replication, the mean difference between the actual and linked PROMIS T-score was computed. This resulted in a distribution of differences that might be expected with different sample sizes. The means and the standard deviations of these distributions were computed over replications as an estimate of bias and empirical standard error, respectively. Results To evaluate the accuracy of our linking methods, we compared the linked PROMIS Tscore to actual PROMIS T-scores based on correlations, mean differences, and standard deviations of difference scores for the linked and actual scores (see Table D1). The correlations were quite high (r ≥ .80) and the mean of difference scores were close to zero for each of the three links. There was a high degree of similarity between the methods (equipercentile vs. IRT), suggesting that small violations of IRT assumptions did not distort the results. Results of resampling with small subsets (n = 25; 50; 75) demonstrated the expected magnitude of linking error was associated with the sample size. For the HAQ-DI (max-8) link, the empirical standard error decreases from 1.2 to .79 to .63 as sample sizes increases from 25, to 50, and 75. For the HAQ-DI (sum-20), the corresponding errors are 1.1, .78, and .62. For the SF36 PF, the corresponding errors are .80, .55, and .45. These standard errors can be used to create confidence intervals around linking results. If the PROsetta Stone cross-walk table was used to estimate PROMIS scores from a sample of 75 HAQ-DI (max-8) scores, there would be a 95% probability that the difference between the mean of this linked PROMIS score and the mean of the PROMIS PF T-score (if obtained) would be within ± 1.2 T-score points (1.96 × the .63 standard error of the HAQ-DI [max-8]). Such a small T-score range suggests that, with adequate sample size, the error introduced by linking is likely to be acceptable to most users of the cross-walk tables. Table D1 Correlations, Mean Differences, and Standard Deviations of Differences between Actual vs. Linked PROMIS PF T-Scores, using Cross-Walk Tables Correlation Mean of Differences SD of Differences PROMIS PF & HAQ-DI (sum-20) IRT fixed-parameter Equipercentile PROMIS PF & HAQ-DI (max-8) IRT fixed-parameter Equipercentile PROMIS PF & SF-36 PF IRT fixed-parameter Equipercentile 0.80 0.80 -0.04 0.28 5.77 5.78 0.80 0.80 0.07 0.33 5.78 5.78 0.91 0.91 0.03 -0.15 4.11 4.16 The mean and SD of differences values are on the T-score metric of the PROMIS PF. HAQ-DI = Health Assessment Questionnaire – Disability Index; PROMIS PF = PROMIS Physical Function; SF-36 PF = Short Form 36 Physical Function Reference List 1. Lord FM. The Standard Error of Equipercentile Equating. Journal of Educational Statistics. 1982;7(3):165-74. doi:10.2307/1164642 2. Kolen MJ, Brennan RL. Test equating, scaling, and linking: methods and practices. New York: Springer; 2004. 3. Rose M, Bjorner JB, Gandek B, Bruce B, Fries JF, Ware Jr JE. The PROMIS Physical Function item bank was calibrated to a standardized metric and shown to improve measurement efficiency. J. Clin. Epidemiol. 2014;67(5):516-26. doi:http://dx.doi.org/10.1016/j.jclinepi.2013.10.024 4. Haebara T. Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research. 1980;22(144-149). 5. Stocking ML, Lord FM. Developing a Common Metric in Item Response Theory. Appl Psychol Meas. 1983;7(2):201-10. doi:10.1177/014662168300700208 6. Weeks JP. Plink: An R package for linking mixed-format tests using IRT-based methods. J Stat Softw. 2010;35(12):1-33. 7. R Core Development Team. R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2011. 8. Thissen D, Chen WH, Bock D. Multilog 7.03. Lincolnwood, IL: Scientific Software International, Inc.; 2003. 9. Muthen LK, Muthen BO. Mplus User's Guide. Los Angeles, CA: Muthen & Muthen; 2006. 10. Browne MW, Cudeck R, Bollen KA, Long KS. Alternative ways of assessing model fit. In: Bollen KA, Long JS, editors. Testing Structural Equation Models. Newbury Park, CA: Sage Publications; 1993. p. 136-62. 11. Hu L, Bentler PM. Fit indices in covariance structure modeling: Sensitivity to underparameterization model misspecification. Psychol. Methods. 1998(3):424-53. 12. Hu LT, Bentler PM. Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Struct Equ Modeling. 1999;6(1):1-55. doi:10.1080/10705519909540118 13. McDonald RP. Test Theory: A unified treatment. Mahwah, NJ: Lawrence Earlbaum Associates, Inc.; 1999. 14. Zinbarg R, Revelle W, Yovel I, Li W. Cronbach’s α, Revelle’s β, and Mcdonald’s ω H : their relations with each other and two alternative conceptualizations of reliability. Psychometrika. 2005;70(1):123-33. doi:10.1007/s11336-003-0974-7 15. Revelle W. psych: Procedures for Personality and Psychological Research. (R Package Version 1.2-1) Evanston, Illinois, USA: Northwestern University; 2013. 16. Schmid J, Leiman J. The development of hierarchical factor solutions. Psychometrika. 1957;22(1):53-61. doi:10.1007/bf02289209 17. Reise SP, Scheines R, Widaman KF, Haviland MG. Multidimensionality and Structural Coefficient Bias in Structural Equation Modeling: A Bifactor Perspective. Educ Psychol Meas. 2013;73(1):5-26. doi:10.1177/0013164412449831 18. Reise SP, Bonifay WE, Haviland MG. Scoring and Modeling Psychological Measures in the Presence of Multidimensionality. J. Pers. Assess. 2012;95(2):129-40. doi:10.1080/00223891.2012.725437 19. Dorans NJ. Linking Scores from Multiple Health Outcome Instruments. Qual. Life Res. 2007;16(Supplement 1):85-94. doi:10.2307/40212575