Appendix A Details on Analyses of Data with Missing Responses

Appendix A
Details on Analyses of Data with Missing Responses
Not all phases of the analysis could accommodate participant responses with missing
item data. While the estimation of parameters (i.e., the linking itself) and confirmatory analyses
took advantage of all available data for all participants, the classical statistics analysis and
evaluation of linking accuracy are based on participants with complete responses. Therefore,
these analyses were based on slightly smaller sample sizes: 85% of total (N = 622) for the HAQDI linking (0 to 60 scoring); 90% of total (N = 660) for HAQ-DI (0 to 3 scoring); and 91% of
total for the SF-36 PF (N = 656).
Appendix B
Details on Linking Methods
Multi-method approach. The PROsetta Stone methodology applies multiple linking
methods, including those based in item response theory (IRT), as well as more traditional
equipercentile methods.(1) Such a multi-method approach is recommended by Kolen and
Brennan (2004),(2) as it serves to ensure that any violations of assumptions do not distort the
Fixed-Parameter Calibration. Item responses in both the linking sample (PROMIS PF +
HAQ-DI [max-8], PROIMS PF + HAQ-DI [sum-20], and PROMIS PF + SF-36 PF) were
calibrated in a single run with PROMIS parameters fixed at their previously published values (3).
Thus, the item parameters of the legacy instruments items were estimated, subject to the metric
defined by the PROMIS item parameters and yielded item parameters for the legacy instruments
on the PROMIS metric.
Separate calibration with linking constants. The second IRT-based method we applied
was separate calibration followed by the computation of transformation constants. This
procedure uses the discrepancy between the established PROMIS parameters (3) and freely
calibrated estimation of PROMIS parameters to place the legacy parameters on the established
PROMIS metric. This is useful, because it avoids imposing the constraints inherent in the fixedparameter calibration. We applied four procedures to obtain the linking constants: mean/mean,
mean/sigma.(4, 5) These IRT linking methods were implemented using the package plink (6) in
R.(7) We ran all IRT calibrations using MULTILOG 7.03.(8)
Comparing IRT linking methods. We obtained four sets of IRT parameters for each
legacy measure. To compare methods, we examined the differences between the test
characteristic curves (TCCs). If the differences between the expected raw summed score values
were small (1 raw score point), we considered the methods interchangeable. When this occurred,
we defaulted to the simpler fixed-parameter method for obtaining scores for each participant.
Appendix C
Unidimensionality Analysis and Details on Classical Item Statistics
Confirmatory factor analyses (CFA) were conducted on the raw data treating the
indicator variables as ordinal and using the WLSMV estimator of Mplus.(9) This model posited
that all items load highly on a single factor. We calculated fit statistics to help quantify the fit of
a unidimensionality model; this included the Comparative Fit Index (CFI), the Tucker Lewis
Index (TLI), and the root mean square error of approximation (RMSEA). We used the following
model fit criteria: RMSEA < .08,(10) TLI > .95,(11) and CFI > .95.(11, 12) We applied these
somewhat arbitrary criteria as informative guides for judging the relative unidimensionality of
item responses.
We also estimated the proportion of total variance attributable to a general factor, known
as omega hierarchical (OmegaH (13, 14)) using the psych package (15) in R.(7) This method
estimates OmegaH from the general factor loadings derived from an exploratory factor analysis
and a Schmid–Leiman transformation.(16) Values of 0.70 or higher for OmegaH suggest that
the item set is sufficiently unidimensional for most purposes.(17, 18)
For the 3 combined PROMIS and legacy item sets, CFA fit statistics suggested very good
fit. PROMIS and HAQ-DI (sum-20) (96 items) fit values were: CFI = 0.97, TLI = 0.97, and
RMSEA = 0.04. PROMIS and the HAQ-DI (max-8) (84 items) were CFI = 0.97, TLI = 0.97, and
RMSEA = 0.04. For PROMIS and the SF-36 PF (86 items) fit values were: CFI = 0.98, TLI =
0.98, and RMSEA = 0.04. These results suggest data-model fit to a unidimensional model. As
shown in Table C1, values of OmegaH were relatively high, ranging from .77 to .82. These
values suggest the presence of a dominant general factor for each instrument pair.(17) Table C1
also shows classical item statistics, including item-total correlations and Cronbach’s alpha.
Table C1
Classical Item Analysis for Individual and Combined Instruments
Item-Total Correlations
HAQ-DI (sum-20)
PROMIS PF & HAQ-DI (sum-20)
HAQ-DI (max-8)
PROMIS PF & HAQ-DI (max-8)
SF-36 PF
Estimates are for those participants with no missing data. This represents 85% of total (N = 622)
for the HAQ-DI (sum-20) linking, 90% of total (N = 660) for HAQ-DI (max-8), and 91% of total
for the SF-36 PF (N = 656). Although the PROMIS PF sample composition was slightly different
for each linkage, we report here only the statistics for the HAQ-DI (sum-20) linking sample, as
the estimates differ by no more than 0.01 as the sample composition changes.
HAQ-DI = Health Assessment Questionnaire – Disability Index; PROMIS PF = PROMIS
Physical Function; SF-36 PF = Short Form 36 Physical Function
Appendix D
Evaluation of Linking Accuracy
We used a single-group design, the strongest of available linking designs.(19) In this
approach, items from each instrument are administered to all participants and scores are obtained
for each respondent on each measure to be linked. This is convenient when evaluating the
validity of the linking method, because actual scores on the anchor or reference measure
(PROMIS PF scores in the current study) are obtained, as well as those generated by linking
through the target measures (HAQ-DI and SF-36 PF).
We evaluated the accuracy of linking by comparing (estimated) linked scores to their
actual scores. We computed Pearson product-moment correlations between estimated and actual
scores, and calculated the mean and standard deviation of the differences in scores. Our
expectation was that correlations would be high (e.g., .80 or higher) and that the mean of the
differences would be close to zero.
Users of the cross-walk tables may wish to know that the size of their samples affect the
error associated with linking. To estimate how this error decreases as sample size increases, we
conducted a resampling procedure to obtain multiple, random, small subsets of cases (n = 25, 50,
and 75). Samples were drawn randomly with replacement over 10,000 replications
(bootstrapping). For each replication, the mean difference between the actual and linked
PROMIS T-score was computed. This resulted in a distribution of differences that might be
expected with different sample sizes. The means and the standard deviations of these
distributions were computed over replications as an estimate of bias and empirical standard error,
To evaluate the accuracy of our linking methods, we compared the linked PROMIS Tscore to actual PROMIS T-scores based on correlations, mean differences, and standard
deviations of difference scores for the linked and actual scores (see Table D1). The correlations
were quite high (r ≥ .80) and the mean of difference scores were close to zero for each of the
three links. There was a high degree of similarity between the methods (equipercentile vs. IRT),
suggesting that small violations of IRT assumptions did not distort the results.
Results of resampling with small subsets (n = 25; 50; 75) demonstrated the expected
magnitude of linking error was associated with the sample size. For the HAQ-DI (max-8) link,
the empirical standard error decreases from 1.2 to .79 to .63 as sample sizes increases from 25, to
50, and 75. For the HAQ-DI (sum-20), the corresponding errors are 1.1, .78, and .62. For the SF36 PF, the corresponding errors are .80, .55, and .45.
These standard errors can be used to create confidence intervals around linking results. If
the PROsetta Stone cross-walk table was used to estimate PROMIS scores from a sample of 75
HAQ-DI (max-8) scores, there would be a 95% probability that the difference between the mean
of this linked PROMIS score and the mean of the PROMIS PF T-score (if obtained) would be
within ± 1.2 T-score points (1.96 × the .63 standard error of the HAQ-DI [max-8]). Such a small
T-score range suggests that, with adequate sample size, the error introduced by linking is likely
to be acceptable to most users of the cross-walk tables.
Table D1
Correlations, Mean Differences, and Standard Deviations of Differences between Actual vs.
Linked PROMIS PF T-Scores, using Cross-Walk Tables
Mean of
SD of
PROMIS PF & HAQ-DI (sum-20)
IRT fixed-parameter
PROMIS PF & HAQ-DI (max-8)
IRT fixed-parameter
IRT fixed-parameter
The mean and SD of differences values are on the T-score metric of the PROMIS PF.
HAQ-DI = Health Assessment Questionnaire – Disability Index; PROMIS PF = PROMIS
Physical Function; SF-36 PF = Short Form 36 Physical Function
Reference List
Lord FM. The Standard Error of Equipercentile Equating. Journal of Educational
Statistics. 1982;7(3):165-74. doi:10.2307/1164642
Kolen MJ, Brennan RL. Test equating, scaling, and linking: methods and practices. New
York: Springer; 2004.
Rose M, Bjorner JB, Gandek B, Bruce B, Fries JF, Ware Jr JE. The PROMIS Physical
Function item bank was calibrated to a standardized metric and shown to improve measurement
efficiency. J. Clin. Epidemiol. 2014;67(5):516-26.
Haebara T. Equating logistic ability scales by a weighted least squares method. Japanese
Psychological Research. 1980;22(144-149).
Stocking ML, Lord FM. Developing a Common Metric in Item Response Theory. Appl
Psychol Meas. 1983;7(2):201-10. doi:10.1177/014662168300700208
Weeks JP. Plink: An R package for linking mixed-format tests using IRT-based methods.
J Stat Softw. 2010;35(12):1-33.
R Core Development Team. R: a language and environment for statistical computing.
Vienna, Austria: R Foundation for Statistical Computing; 2011.
Thissen D, Chen WH, Bock D. Multilog 7.03. Lincolnwood, IL: Scientific Software
International, Inc.; 2003.
Muthen LK, Muthen BO. Mplus User's Guide. Los Angeles, CA: Muthen & Muthen;
Browne MW, Cudeck R, Bollen KA, Long KS. Alternative ways of assessing model fit.
In: Bollen KA, Long JS, editors. Testing Structural Equation Models. Newbury Park, CA: Sage
Publications; 1993. p. 136-62.
Hu L, Bentler PM. Fit indices in covariance structure modeling: Sensitivity to
underparameterization model misspecification. Psychol. Methods. 1998(3):424-53.
Hu LT, Bentler PM. Cutoff criteria for fit indexes in covariance structure analysis:
Conventional criteria versus new alternatives. Struct Equ Modeling. 1999;6(1):1-55.
McDonald RP. Test Theory: A unified treatment. Mahwah, NJ: Lawrence Earlbaum
Associates, Inc.; 1999.
Zinbarg R, Revelle W, Yovel I, Li W. Cronbach’s α, Revelle’s β, and Mcdonald’s ω H :
their relations with each other and two alternative conceptualizations of reliability.
Psychometrika. 2005;70(1):123-33. doi:10.1007/s11336-003-0974-7
Revelle W. psych: Procedures for Personality and Psychological Research. (R Package
Version 1.2-1) Evanston, Illinois, USA: Northwestern University; 2013.
Schmid J, Leiman J. The development of hierarchical factor solutions. Psychometrika.
1957;22(1):53-61. doi:10.1007/bf02289209
Reise SP, Scheines R, Widaman KF, Haviland MG. Multidimensionality and Structural
Coefficient Bias in Structural Equation Modeling: A Bifactor Perspective. Educ Psychol Meas.
2013;73(1):5-26. doi:10.1177/0013164412449831
Reise SP, Bonifay WE, Haviland MG. Scoring and Modeling Psychological Measures in
the Presence of Multidimensionality. J. Pers. Assess. 2012;95(2):129-40.
Dorans NJ. Linking Scores from Multiple Health Outcome Instruments. Qual. Life Res.
2007;16(Supplement 1):85-94. doi:10.2307/40212575