Supplemental Material B - Detailed Methods for Evaluation of

advertisement
Supplemental Material B - Detailed Methods for Evaluation of Clinimetric
properties
The following criteria were used to evaluate the clinimetric properties of the selected
scales.
Content validity (extent to which the domain of interest is comprehensively
sampled by the items in the questionnaire) was assessed by whether patients were
involved during item selection and/or item reduction and patients were consulted for
reading and comprehension as well as expert clinicians in the field. Content validity
was rated as: “+” (patients and (investigator or expert) involved in rating development),
“±” (patients only were involved in constructing the rating scale), “-“(no patient
involvement in scale development), or “?” (no information found on content validity).
Readability & comprehension was rated as: “+” (readability tested or reported); and
result was good (scale reads at 5th grade level or below), “-“(inadequate scale testing or
reading level is above 5th grade, and “?” (no information found on readability and
comprehension reported).
Internal consistency (the extent to which items in a (sub)scale are intercorrelated), a
measure of the homogeneity of a (sub)scale, was examined by looking to ensure factor
analysis was applied in order to provide empirical support for the dimensionality of the
questionnaire and that Cronbach's alpha was greater than 0.80 for every
dimension/subscale. There is no ‘gold standard’ about how high should Cronbach’s
alpha be for ‘good’ score reliability, there are some rough guidelines: >=0.70: adequate;
>=0.80: very good; >=0.90:excellent.69, 70 Internal consistency was rated as: “+”
(adequate design, method, and factor analysis with alpha > 0.80), “±” (doubtful method
used), “–” (inadequate internal consistency), and “?” (lacking information found on
internal consistency of the rating scale).
Construct validity (the extent to which scores on the questionnaire relate to other
measures in a manner that is consistent with theoretically derived hypothesis
concerning the domains that are measured) was evaluated in each rating scale by
confirming that the hypotheses were formulated, as well as that the results were
acceptable in accordance with the hypotheses. Finally, the validation process was
assessed to determine that an adequate measure was used. The construct validity was
rated as: “+” (adequate design, method, and result), “±” (doubtful method used), “-”
(inadequate construct validity), or “?” (no information found on construct validity).
Acceptability/floor & ceiling effects (the degree to which scores adequately represent
the distribution of the health in the sampled population) were assessed by confirming
that descriptive statistics of the distribution of scores were presented and evaluating
whether 15% or more of respondents achieved the highest or lowest possible score.
The rating scale was then graded as: “+” (no floor/ceiling effects), “-” (more than 15% of
respondents achieved the highest or lowest possible score), and “?” (no information
found on floor and ceiling effects)
Test-retest reliability/Agreement (or the extent to which the same results are obtained
on repeated administrations of the same questionnaire when no change in physical
functioning has occurred) was assessed by confirming that an intraclass correlation
coefficient (ICC) was calculated and that the ICC > 0.70, as well as that the time interval
and confidence intervals were presented. The scale was rated as: “+” (adequate
design, method, and ICC > 0.70), “±” (doubtful method was used), “-” (inadequate
reliability), or “?” (no information found on test-retest reliability). In evaluative scales the
ability to produce exactly the same scores with repeated measurements is assessed by
ascertaining that the limits of agreement, Kappa, or standard error of measurement
(SEM) was presented for the scale. Agreement was rated as: “+” adequate design,
method and result, “±” doubtful method used, “-“ inadequate agreement, “?”
information found on agreement not reported.
Responsiveness (the ability to detect important change over time in the concept
being measured) of the scale was determined by confirming that hypotheses regarding
responsiveness were formulated and results were in agreement and that an adequate
measure was used (ES, SRM, comparison with external standard). The scale was
rated as: “+” (adequate design, method and result), “±” (doubtful method used), “-“
(inadequate responsiveness), and “?” no information found on responsiveness.
Interpretability (the degree to which one can assign qualitative meaning to quantitative
scores) was assessed by ascertaining whether authors provided information on the
interpretation of scores, e.g.: 1) means and SD of scores before and after treatment
were reported, 2) comparative data on the distribution of scores in relevant subgroups
was reported, 3) information on the relationship of scores to well-known functional
measures or clinical diagnosis is presented, 4) information on the association between
changes in score and patients' global ratings of the magnitude of change they have
experienced. Each scale was rated as: “+” (2 or more of the above types of information
was presented), “±” (doubtful method/used or doubtful description), or “?” (no
information found on interpretation).
Minimal clinically important difference (MCID) (the smallest difference in score in the
domain of interest which patients perceive as beneficial and would mandate a change in
patient's management) was assessed by confirming that information is provided about
what (difference in) score would be clinically meaningful. The scale was rated as “+”
(MCID presented) or “–“ (no MCID presented).
Time to administer or time needed to complete the questionnaire was assessed. In
research, longer times to complete a scale may be acceptable, but in clinical care
situations, scales that take longer than five to ten minutes to complete tend to be
burdensome. The selected scales were thus rated as: “+” (less than 10 minutes to
complete), “-“(more than 10 minutes to complete), or “?” (no information found on time
to complete the scale).
Administration burden, or the ease of the method used to calculate the scale’s score
was rated as: “+” (easy, e.g., summing up of the items), “±” (moderate, e.g., visual
analogue scale (VAS) or simple formula), “-“ (difficult, e.g., VAS in combination with
formula, or complex formula, “?” (no information found on rating method).
Conclusions After the clinimetric evaluation, each scale was rated according to the
following definitions: A scale was “Recommended” if the scale was used in PD in
reports other than the original scale description, had been studied clinimetrically and
considered valid, reliable and sensitive. “Suggested” scales met at least part of the
above criteria, but fell short of meeting all.
Download