Supplemental Material B - Detailed Methods for Evaluation of Clinimetric properties The following criteria were used to evaluate the clinimetric properties of the selected scales. Content validity (extent to which the domain of interest is comprehensively sampled by the items in the questionnaire) was assessed by whether patients were involved during item selection and/or item reduction and patients were consulted for reading and comprehension as well as expert clinicians in the field. Content validity was rated as: “+” (patients and (investigator or expert) involved in rating development), “±” (patients only were involved in constructing the rating scale), “-“(no patient involvement in scale development), or “?” (no information found on content validity). Readability & comprehension was rated as: “+” (readability tested or reported); and result was good (scale reads at 5th grade level or below), “-“(inadequate scale testing or reading level is above 5th grade, and “?” (no information found on readability and comprehension reported). Internal consistency (the extent to which items in a (sub)scale are intercorrelated), a measure of the homogeneity of a (sub)scale, was examined by looking to ensure factor analysis was applied in order to provide empirical support for the dimensionality of the questionnaire and that Cronbach's alpha was greater than 0.80 for every dimension/subscale. There is no ‘gold standard’ about how high should Cronbach’s alpha be for ‘good’ score reliability, there are some rough guidelines: >=0.70: adequate; >=0.80: very good; >=0.90:excellent.69, 70 Internal consistency was rated as: “+” (adequate design, method, and factor analysis with alpha > 0.80), “±” (doubtful method used), “–” (inadequate internal consistency), and “?” (lacking information found on internal consistency of the rating scale). Construct validity (the extent to which scores on the questionnaire relate to other measures in a manner that is consistent with theoretically derived hypothesis concerning the domains that are measured) was evaluated in each rating scale by confirming that the hypotheses were formulated, as well as that the results were acceptable in accordance with the hypotheses. Finally, the validation process was assessed to determine that an adequate measure was used. The construct validity was rated as: “+” (adequate design, method, and result), “±” (doubtful method used), “-” (inadequate construct validity), or “?” (no information found on construct validity). Acceptability/floor & ceiling effects (the degree to which scores adequately represent the distribution of the health in the sampled population) were assessed by confirming that descriptive statistics of the distribution of scores were presented and evaluating whether 15% or more of respondents achieved the highest or lowest possible score. The rating scale was then graded as: “+” (no floor/ceiling effects), “-” (more than 15% of respondents achieved the highest or lowest possible score), and “?” (no information found on floor and ceiling effects) Test-retest reliability/Agreement (or the extent to which the same results are obtained on repeated administrations of the same questionnaire when no change in physical functioning has occurred) was assessed by confirming that an intraclass correlation coefficient (ICC) was calculated and that the ICC > 0.70, as well as that the time interval and confidence intervals were presented. The scale was rated as: “+” (adequate design, method, and ICC > 0.70), “±” (doubtful method was used), “-” (inadequate reliability), or “?” (no information found on test-retest reliability). In evaluative scales the ability to produce exactly the same scores with repeated measurements is assessed by ascertaining that the limits of agreement, Kappa, or standard error of measurement (SEM) was presented for the scale. Agreement was rated as: “+” adequate design, method and result, “±” doubtful method used, “-“ inadequate agreement, “?” information found on agreement not reported. Responsiveness (the ability to detect important change over time in the concept being measured) of the scale was determined by confirming that hypotheses regarding responsiveness were formulated and results were in agreement and that an adequate measure was used (ES, SRM, comparison with external standard). The scale was rated as: “+” (adequate design, method and result), “±” (doubtful method used), “-“ (inadequate responsiveness), and “?” no information found on responsiveness. Interpretability (the degree to which one can assign qualitative meaning to quantitative scores) was assessed by ascertaining whether authors provided information on the interpretation of scores, e.g.: 1) means and SD of scores before and after treatment were reported, 2) comparative data on the distribution of scores in relevant subgroups was reported, 3) information on the relationship of scores to well-known functional measures or clinical diagnosis is presented, 4) information on the association between changes in score and patients' global ratings of the magnitude of change they have experienced. Each scale was rated as: “+” (2 or more of the above types of information was presented), “±” (doubtful method/used or doubtful description), or “?” (no information found on interpretation). Minimal clinically important difference (MCID) (the smallest difference in score in the domain of interest which patients perceive as beneficial and would mandate a change in patient's management) was assessed by confirming that information is provided about what (difference in) score would be clinically meaningful. The scale was rated as “+” (MCID presented) or “–“ (no MCID presented). Time to administer or time needed to complete the questionnaire was assessed. In research, longer times to complete a scale may be acceptable, but in clinical care situations, scales that take longer than five to ten minutes to complete tend to be burdensome. The selected scales were thus rated as: “+” (less than 10 minutes to complete), “-“(more than 10 minutes to complete), or “?” (no information found on time to complete the scale). Administration burden, or the ease of the method used to calculate the scale’s score was rated as: “+” (easy, e.g., summing up of the items), “±” (moderate, e.g., visual analogue scale (VAS) or simple formula), “-“ (difficult, e.g., VAS in combination with formula, or complex formula, “?” (no information found on rating method). Conclusions After the clinimetric evaluation, each scale was rated according to the following definitions: A scale was “Recommended” if the scale was used in PD in reports other than the original scale description, had been studied clinimetrically and considered valid, reliable and sensitive. “Suggested” scales met at least part of the above criteria, but fell short of meeting all.