Evidence of Validity for the Hip Outcome Score

advertisement
Evidence of Validity for the Hip Outcome Score
RobRoy L. Martin, Ph.D., P.T., C.S.C.S., Bryan T. Kelly, M.D., and Marc J. Philippon, M.D.
Purpose: The purpose of this study was to offer evidence of validity for the Hip Outcome Score
(HOS) based on internal structure, test content, and relation to other variables. Methods: The study
population consisted of 507 subjects with a labral tear. Internal structure was evaluated by use of
factor analysis and coefficient ␣. Test content was evaluated by use of item response theory. Pearson
correlation coefficients were used to assess relations between the Short Form 36 and the HOS.
Results: The mean subject age was 38 years (range, 13 to 66 years), with 232 male and 273 female
subjects. Of the subjects, 263 (52%) underwent arthroscopic surgery. Factor analysis found that 17
of 19 items on the activities-of-daily-living (ADL) subscale loaded on 1 factor. The 2 items that did
not fit the 1-factor model were omitted from further testing. All 9 items on the sports subscale loaded
on 1 factor. The coefficient ␣ values were .96 and .95 for the ADL and sports subscales, respectively.
The errors associated with a single measure were ⫾4.6 and ⫾3.8 points for the ADL and sports
subscales, respectively. Item response theory found that all items contributed to their test information
curves and were potentially responsive. The correlations between the HOS and Short Form 36
measures of physical function were significantly different than their correlation to measures of mental
functioning (P ⬍ .005). Conclusions: The results of this study provide evidence of validity to support
the use of the HOS ADL and sports subscales for individuals with labral tears. This includes
individuals who underwent arthroscopic surgery, as well as those who did not. Specifically, the
results of this study found that the HOS ADL and sports subscales were unidimensional, had adequate
internal consistency, were potentially responsive across the spectrum of ability, and contributed
information across the spectrum of ability. In addition, scores obtained by the HOS related to
measures of function and did not relate to measures of mental health. Level of Evidence: Level III,
development of diagnostic criteria with nonconsecutive patients. Key Words: Hip Outcome Score—
Labral tear—Hip arthroscopy—Outcome instrument—Validity.
M
From the Department of Physical Therapy, Duquesne University
(R.L.M.), Pittsburgh, Pennsylvania; Hospital for Special Surgery,
New York-Presbyterian Hospital, Weill Medical College of Cornell
University (B.T.K.), New York, New York; Steadman Hawkins
Clinic, Steadman Hawkins Research Foundation (M.J.P.), Vail,
Colorado; and University of Pittsburgh Medical Center (M.J.P.),
Pittsburgh, Pennsylvania, U.S.A.
Supported in part by a grant from the Orthopaedic Section of the
American Physical Therapy Association and the Steadman
Hawkins Research Foundation. Within the last 12 months, B.T.K.
and M.J.P. have received financial support exceeding $500 from
Smith & Nephew, Andover, MA.
Address correspondence and reprint requests to RobRoy L.
Martin, Ph.D., P.T., C.S.C.S., Department of Physical Therapy,
Duquesne University, 114 Rangos School of Health Sciences, Pittsburgh, PA 15282, U.S.A. E-mail: martinr280@duq.edu
© 2006 by the Arthroscopy Association of North America
0749-8063/06/2212-5197$32.00/0
doi:10.1016/j.arthro.2006.07.027
Note: To access the supplementary Appendix accompanying
this report, visit the December issue of Arthroscopy at
www.arthroscopyjournal.org.
1304
usculoskeletal hip disorders and hip arthroscopy are areas of growing interest within the
field of orthopaedics. As physicians and other health
care practitioners become more involved in these areas, research that defines the expected outcomes for
various treatments will be needed. This will include
continuing to define the outcomes of both arthroscopic
surgical treatment and nonsurgical treatment for individuals with acetabular labral tears. A number of
self-report evaluative instruments have been developed for individuals with hip pathology.1-8 All of
these instruments have deficiencies that may negatively impact their ability to assess the effect of treatment interventions for individuals with labral tears
who may be functioning throughout a wide range of
ability.
The usefulness of an instrument can be determined
based on concepts associated with contemporary validity theory. Important concepts to consider include
evidence for test content, internal structure, and rela-
Arthroscopy: The Journal of Arthroscopic and Related Surgery, Vol 22, No 12 (December), 2006: pp 1304-1311
VALIDITY EVIDENCE FOR HIP OUTCOME SCORE
tion to other variables.9,10 To be useful in the realm of
sports medicine and hip arthroscopy, an instrument
should have adequate representation of items questioning an individual’s proficiency with a wide range
of ability. This would include activities requiring a
high level of ability (i.e., sports participation). If an
instrument does not have this adequate representation,
a ceiling effect and inadequate sensitivity to change
may occur when individuals are only limited at the
high end of ability. The instruments that are currently
available contain only a limited number of items that
assess activity and participation at the higher end of
ability. Objectively evaluating the individual items in
their ability to contribute information and be potentially responsive, particularly at the high end of ability, can be done with item response theory (IRT).
The purpose of this study was to create an instrument, the Hip Outcome Score (HOS), that could be
used to assess the outcome of treatment intervention
for individuals with acetabular tears who may be
functioning throughout a wide range of ability. It was
hypothesized that the newly created HOS would be
unidimensional, have adequate internal consistency,
be potentially responsive across the spectrum of ability, and contribute information across the spectrum of
ability. In addition, scores obtained by the HOS would
relate to concurrent measures of function while not
unduly relating to concurrent measures of mental
health.
METHODS
Creating the Interim HOS
Item content for the HOS was derived from input
from physicians and physical therapists who treat individuals with musculoskeletal hip disorders. The purpose of this instrument was to assess self-reported
functional status. Therefore, according to the terms
defined by the International Classification of Functioning, Disability and Health model items that related
to activity and participation were included whereas
items relating to body structure and function (i.e.,
symptoms) were not considered.11 An effort was made
to include functional activities that cover a full spectrum of ability, including sports-related activities. On
the basis of these criteria, a total of 28 items were
initially considered and developed. It was believed
that all 28 items were appropriate and should be
included in the HOS.
A decision was made to create 2 subscales, the
activities-of-daily-living (ADL) and sports subscales.
1305
The ADL subscale contained 19 items pertaining to
basic daily activities, and the sports subscale contained 9 items pertaining to higher-level activities,
such as those required in athletics. In addition to the 5
potential responses, ranging from “unable to do” to
“no difficulty,” a response of “nonapplicable” was
also added. This allows subjects to designate that
something other than their hip problem limits their
activity. This means that both missing responses and
nonapplicable responses could not be scored. This
model of 2 subscales, as well as the use of a nonapplicable response, was based on the successful results
associated with an instrument developed for the foot
and ankle.10
Procedure for Data Collection
We used a cross-sectional study design. Potential
subjects consisted of patients who were under the care
of a single orthopaedic surgeon who specializes in the
treatment of musculoskeletal hip-related disorders
and, particularly, acetabular labral tears. Subjects
were given the HOS and Short Form 36 (SF-36) to
complete during a regularly scheduled clinical visit.
Patients who could not read English or who did not
have a labral tear were excluded. On the basis of the
methods used in previous studies,12 a decision was
made to exclude subjects who had a high number of
items that could not be scored. Subjects were included
if they had at least 14 of 19 and 7 of 9 items that could
be scored on the ADL and sports subscales, respectively. Demographic information was recorded from
the computer database and medical records. This
study was approved by the institutional review board,
and all subjects gave their consent for participation in
this study.
Data Analysis
Evidence for Test Content
Psychometric procedures associated with IRT were
used to obtain evidence for test content. This analysis
included an assessment of unidimensionality, construction of item characteristic curves, and production
of test information functions.13
Assumption of Unidimensionality: The assumption of unidimensionality must be met before IRT can
be used.13 Evidence for this would be provided by a
factor that accounts for a large amount of the variance
as indicated by the production of only 1 eigenvalue
with a value greater than 1. Exploratory factor analysis was completed by use of PRELIS (Scientific Soft-
1306
R. L. MARTIN ET AL.
ware International, Chicago, IL). Eigenvalues and factor loading patterns were used to identify and extract
factors. Items with the lowest factor loading to the
principal component were sequentially deleted until
only 1 eigenvalue was produced that had a value
greater than 1.
Item Characteristic Curves: MULTILOG (Scientific Software International) was used to perform IRT
and calibrate the items by use of the 2-parameter
graded response model. The results of IRT allow for
item characteristic curves to be constructed in an
Excel spreadsheet (Microsoft, Redmond, WA) for
each item by use of difficulty and discrimination parameters generated by MULTILOG. An appropriate
item characteristic curve with 5 potential responses,
with each response describing a level of proficiency
with the activity in question, should have 5 distinct
and separate curves. Each curve should have 1 peak,
and together, the 5 curves should span the spectrum of
ability (theta).13 Items that did not have appropriate
item characteristic curves were considered for
elimination.
TABLE 1.
Standing for 15 min
Getting into and out of an
average car
Putting on socks and shoes
Walking up steep hills
Walking down steep hills
Going up 1 flight of stairs
Going down 1 flight of
stairs
Going up and down curbs
Deep squatting
Getting into and out of a
bath
Sitting for 15 min
Walking initially
Walking for approximately
10 min
Walking for 15 min or
greater
Twisting/pivoting on
involved leg
Rolling over in bed
Light to moderate work
(standing and walking)
Heavy work
(pushing/pulling,
climbing, carrying)
Recreational activities
Test Information Function: The results of IRT
provide information values generated by MULTILOG
for each item at 9 ability levels, ranging from ⫺2.0 to
2.0. The item information values for each item at the 9
ability levels were summed to produce the test information function. The target test information function for an
evaluative instrument should provide information across all
ability ranges.13 Items that did not contribute to the test
information function were considered for elimination.
Evidence of Internal Structure
The Cronbach coefficient ␣ value was calculated by
use of the SPSS program (version 11.5; SPSS, Chicago, IL) to assess internal consistency. The standard
error of measure (SEM) was calculated as follows:
SEM ⫽ ␴兹1 ⫺ r
in which ␴ was the SD of the scores and r was the
coefficient ␣. A 90% confidence interval (CI) was
then calculated to determine the error associated with
a score at a single point in time.
Item Response Pattern for ADL Subscale
Unable to
Do
Extreme
Difficulty
Moderate
Difficulty
Slight
Difficulty
No
Difficulty
Nonapplicable
Missing
Response
7 (1.4%)
29 (5.7%)
129 (25.4%)
132 (26%)
206 (40%)
2 (0.4%)
3 (0.4%)
0
5 (1%)
20 (3.9%)
13 (2.6%)
1 (0.2%)
46 (9.1%)
60 (11.8%)
69 (13.6%)
48 (9.5%)
38 (7.5%)
118 (23.3%)
104 (20.5%)
161 (31.8%)
147 (29%)
100 (19.7%)
189 (37.3%)
156 (30.8%)
133 (26.2%)
140 (27.6%)
158 (31.2%)
153 (30.2%)
180 (35.5%)
112 (22.1%)
142 (28%)
209 (41.2%)
0
0
10 (2%)
11 (2.2%)
1 (0.2%)
1 (0.2%)
2 (0.4%)
2 (0.4%)
6 (1.2%)
0
1 (0.2%)
1 (0.2%)
67 (13.2%)
17 (3.4%)
9 (1.8%)
105 (20.7%)
91 (17.9%)
66 (13%)
123 (24.3%)
164 (32.3%)
134 (26.4%)
114 (22.5%)
230 (45.4%)
292 (57.6%)
72 (14.2%)
1 (0.2%)
0.3 (0.6%)
16 (3.2%)
3 (0.6%)
2 (0.4%)
10 (2%)
10 (2%)
2 (0.4%)
8 (1.6%)
21 (4.1%)
28 (5.5%)
25 (4.9%)
83 (16.4%)
80 (15.8%)
100 (19.7%)
136 (26.8%)
148 (29.2%)
170 (33.5%)
184 (36.3%)
248 (48.9%)
199 (39.3%)
64 (12.6%)
0
0
9 (1.8%)
1 (0.2%)
5 (1%)
12 (2.4%)
41 (8.1%)
107 (21.1%)
154 (30.4%)
190 (37.5%)
0
3 (0.6%)
33 (6.5%)
82 (16.2%)
137 (27%)
113 (22.3%)
136 (26.8%)
4 (0.8%)
2 (0.4%)
49 (9.7%)
7 (1.4%)
107 (21.1%)
47 (9.3%)
139 (27.4%)
87 (17.2%)
136 (26.8%)
176 (34.7%)
65 (12.8%)
185 (36.5%)
5 (1%)
1 (0.2%)
6 (1.2%)
4 (0.8%)
11 (2.2%)
33 (6.5%)
110 (21.7%)
181 (35.7%)
167 (32.9%)
1 (0.2%)
4 (0.8%)
64 (12.6%)
98 (19.3%)
114 (22.5%)
98 (19.3%)
143 (28.2%)
139 (27.4%)
114 (22.5%)
106 (20.9%)
52 (10.3%)
39 (7.7%)
18 (3.6%)
18 (3.6%)
4 (0.8%)
9 (1.8)
VALIDITY EVIDENCE FOR HIP OUTCOME SCORE
Evidence of Convergent and Divergent Validity
Convergent evidence was examined by assessment
of the associations between the HOS and SF-36 physical function subscale and physical component summary score by use of Pearson correlation coefficients.
Divergent evidence was examined by assessment of
the associations between the HOS and SF-36 mental
health subscale and the mental health component summary score. Testing for differences in the correlation
coefficients between the HOS and concurrent measures of physical function and mental health was done
based on the equation of Meng et al.14 The a priori
type I error rate was set at .005 to account for the
multiple comparisons.
RESULTS
Subjects
Included in the data analysis from October 2003 to
December 2004 were 507 subjects with a labral tear as
their primary diagnosis. The mean subject age was 38
years (SD, 13 years; range, 13 to 66 years), with 232
male and 273 female subjects. The mean duration of
symptoms was 3.4 years (SD, 5 years; range, 11 days
to 29 years). The subjects’ reported current level of
function was normal in 3%, nearly normal in 26%,
abnormal in 51%, and severely abnormal in 20%. Of
the subjects, 263 (52%) underwent arthroscopic surgery. The mean length of time between surgery and
TABLE 2.
Running 1 mile
Jumping
Swinging objects like
a golf club
Landing
Starting and stopping
quickly
Cutting/lateral
movements
Low-impact activities
like fast walking
Ability to perform
activity with your
normal technique
Ability to participate
in your desired
sport as long as
you would like
1307
completion of the questionnaires in these individuals
was 6.7 months (range, 2 days to 3.86 years). With
respect to comorbidities, all subjects noted that their
hip condition was their primary limiting factor.
Item Response Patterns for ADL
and Sports Subscales
The response patterns for the individual items are
presented in Tables 1 and 2 for the ADL and sports
subscales, respectively. For the ADL subscale, item
10 (getting into and out of a bath) had the highest
number of nonapplicable and missing responses, because 14.4% of individuals had data that could not be
scored. For the remaining items, fewer than 6% of
individuals had data that could not be scored for each
item. Compared with the ADL subscale, the sports
subscale had a larger number of nonapplicable responses. The number of missing responses was only
slightly greater for the sports subscale.
Assumption of Unidimensionality
PRELIS requires the use of complete data. Therefore 430 subjects (85%) and 343 subjects (68%) were
used to evaluate the ADL and sports subscales, respectively. For both the ADL and sports subscales,
analysis was done to assess for difference in gender,
age, duration of symptoms, time between surgery and
data collection, and current rating of function between
subjects with no missing responses compared with the
Item Response Pattern for Sports Subscale
Unable to
Do
Extreme
Difficulty
Moderate
Difficulty
Slight
Difficulty
No
Difficulty
Nonapplicable
Missing
Response
286 (56.4%)
171 (33.7%)
62 (12.2%)
96 (18.9%)
60 (11.8%)
84 (16.6%)
41 (8.1%)
81 (16%)
35 (6.9%)
64 (12.6%)
21 (4.1%)
8 (1.6%)
2 (0.4%)
3 (0.6%)
82 (16.2%)
110 (21.7%)
50 (9.9%)
89 (17.6%)
66 (13%)
98 (19.3%)
99 (19.5%)
96 (18.9%)
111 (21.9%)
77 (15.2%)
92 (18.1%)
22 (4.3%)
7 (1.4%)
15 (3%)
79 (15.6%)
118 (23.3%)
131 (25.8%)
107 (21.1%)
67 (13.2%)
3 (0.6%)
15 (3%)
105 (20.7%)
133 (26.2%)
113 (22.3%)
104 (20.5%)
33 (6.5%)
9 (1.8%)
10 (2%)
74 (14.6%)
76 (15%)
115 (22.7%)
132 (26%)
104 (20.5%)
4 (0.8%)
2 (0.4%)
116 (22.9%)
95 (18.7%)
116 (22.9%)
104 (20.5%)
61 (12%)
8 (1.6%)
7 (1.4)
260 (51.3%)
96 (18.9%)
64 (12.6%)
42 (8.3%)
29 (5.7%)
14 (2.8%)
2 (0.4%)
1308
R. L. MARTIN ET AL.
other subjects included in the study. The ␣ value was
set at .05 but adjusted to .005 because of the 10
planned comparisons.
For the ADL subscale, a significant difference was
not found for gender (P ⫽ .94), age (P ⫽ .009),
duration of symptoms (P ⫽ .7), time between surgery
and data collection (P ⫽ .012), and current rating of
function (P ⫽ .18). For the sports subscale, a significant difference was not found for age (P ⫽ .58),
duration of symptoms (P ⫽ .37), time between surgery and data collection (P ⫽ .39), and current rating
of function (P ⫽ .56). A significant difference was
found for gender (P ⬍ .0005), because the ratio of
female subjects to male subjects was lower in the
group with no missing data compared with the group
with 1 or 2 missing responses.
Factor analysis of the 19-item ADL subscale indicated that the items loaded on 2 factors with eigenvalues of 12.4 and 1.2. The factor loadings of each
item on the 19-item ADL subscale to the first principal
component are reported in Table 3. Because 2 factors
with an eigenvalue greater than 1 were produced, the
factor analysis was repeated sequentially omitting
item 11 (sitting) and item 3 (putting on socks and
shoes). A 17-item ADL subscale loaded on 1 factor,
TABLE 3.
Factor Loadings of Individual Items
for ADL Subscale
Item Content
1
2
Standing for 15 min
Getting into and out of an
average car
Putting on socks and shoes
Walking up steep hills
Walking down steep hills
Going up 1 flight of stairs
Going down 1 flight of stairs
Going up and down curbs
Deep squatting
Getting into and out of a bath
Sitting for 15 min
Walking initially
Walking for approximately 10
min
Walking for 15 min or greater
Twisting/pivoting on involved leg
Rolling over in bed
Light to moderate work (standing
and walking)
Heavy work (pushing/pulling,
climbing, carrying)
Recreational activities
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Factor Loadings of Individual Items
of Sports Subscale
Item No.
Item Content
1
2
3
4
5
6
7
8
Running 1 mile
Jumping
Swinging objects like a golf club
Landing
Starting and stopping quickly
Cutting/lateral movements
Low-impact activities like fast walking
Ability to perform activity with your
normal technique
Ability to participate in your desired
sport as long as you would like
9
Factor
Loading
.90
.93
.85
.94
.89
.88
.86
.83
.87
which accounted for 68% of the variance and had an
eigenvalue of 11.6. The factor loading of each item on
the 17-item ADL subscale to the first principal component is found in Table 3. Scores from this modified
17-item ADL subscale were then used for constructing
the item characteristic curves and test information
functions.
The 9-item sports subscale loaded on 1 factor. This
1 factor accounted for 80.3% of the variance and had
an eigenvalue of 7.1. The factor loadings of each item
to the first principal component are found in Table 4.
Item Characteristic Curves
Factor Loading
Item No.
TABLE 4.
19 Item 17 Item
.82
.78
.82
.76
.63
.84
.84
.85
.86
.84
.75
.81
.55
.76
.86
—
.85
.85
.86
.86
.83
.75
.80
—
.75
.87
.84
.75
.76
.89
.85
.74
.74
.89
.81
.82
.77
.77
Inspection of the item characteristic curves for the
ADL subscale revealed that all but 4 items had wellfitting curves. Those 4 items pertained to the following: (1) getting into a car, (2) going up steps, (3) going
down steps, and (4) going up and down curbs. The
item characteristic curve for the item pertaining to
walking up hills is an example of an item that had
a well-fitting item characteristic curve and is presented in Fig 1. The item characteristic curves for
the items that did not have well-fitting item characteristic curves resembled that pertaining to going
up and down curbs, which is shown in Fig 2. Item
characteristic curves were also plotted for 9 items
on the sports subscale. All 9 had well-fitting curves
similar to that displayed in Fig 1.
Test Information Function
The test information function for the modified 17item ADL subscale and the 9-item sports subscale can
be found in Fig 3. The 4 items without well-fitting
item characteristic curves (getting into a car, going up
steps, going down steps, and going up and down
VALIDITY EVIDENCE FOR HIP OUTCOME SCORE
1
ADL TIF
35
0.9
Sports TIF
30
0.8
0.7
0.6
Information
Probability of Response
1309
RESP0
RESP1
0.5
RESP2
RESP3
0.4
RESP4
0.3
0.2
25
20
15
10
0.1
0
THETA
5
-3.2
-2.2
-1.2
-0.2
0.8
1.8
2.8
3.8
curbs) were considered for elimination. The test information function was recalculated separately with
each of these items deleted. In each case a decrease in
information was noted throughout the range of ability.
Therefore these 4 items were retained to maximize the
instrument’s precision of measurement across the
range of ability.
The 19-item ADL subscale and 9-item sports subscale can be found in Appendix 1 (online only, available at www.arthroscopyjournal.org). The ADL and
sports subscales are scored separately. The item re1
0.9
Probability of Response
0.8
0.7
RESP0
0.6
RESP1
RESP2
0.5
RESP3
RESP4
0.4
0.3
0.2
0.1
0
THETA
-3.2
-2.2
-1.2
-0.2
0.8
1.8
2.8
3.8
FIGURE 2. Item characteristic curve for item 8 (going up and down
curbs) on ADL subscale. This represents an item characteristic curve
that may be potentially unacceptable. It should be noted that there are
only 4 separate curves for an item that has 5 potential responses.
(RESPO, “unable to do” response; RESP1, “extremely difficult” response; RESP2, “moderate difficulty” response; RESP3, “slight difficulty” response; RESP4, “no difficulty at all” response.)
2.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-2.0
0
FIGURE 1. Item characteristic curve for item 4 (walking up steep
hills) on ADL subscale. This represents the appropriate item characteristic curve for an item with 5 potential responses. It should be
noted that there are 5 separate curves, each with 1 peak, and
together, the 5 curves span the spectrum of ability (␽). (RESPO,
“unable to do” response; RESP1, “extremely difficult” response;
RESP2, “moderate difficulty” response; RESP3, “slight difficulty”
response; RESP4, “no difficulty at all” response.)
Ability
FIGURE 3. Test information function (TIF) for ADL and sports
subscales showing their potential to provide information across
range of ability. The ADL subscale offers more information regarding function at the lower end of ability, whereas the sports
subscale offers more information at the higher range of ability.
lated to sitting and the item related to putting on socks
and shoes are not scored. The response to each of the
other 17 items on the ADL subscale is scored from 4
to 0, with 4 indicating “no difficulty” and 0 indicating
“unable to do.” Nonapplicable responses are not
counted. The scores for each of the items are added
together to obtain the item score total. The total number of items with a response is multiplied by 4 to
obtain the highest potential score. If the subject answers all 17 items, the highest potential score is 68. If
1 item is not answered, then the highest score is 64; if
2 are not answered, then the highest score is 60; and so
on. The item score total is divided by the highest
potential score. This value is then multiplied by 100 to
obtain a percentage. The sports subscale is scored in a
similar manner, with the highest potential score being
36. A higher score represents a higher level of physical function for both the ADL and sports subscales.
The mean score for the 17-item ADL subscale and
9-item sports subscale was 67.8 (SD, 20.5; range, 4 to
100) and 41.9 (SD, 27.8; range, 0 to 100), respectively.
Evidence of Internal Structure
The assessment of internal consistency for the
17-item ADL subscale found a coefficient ␣ value of
.96, with an SEM of 2.8 and a 90% CI of ⫾4.6 points.
For the sports subscale, the coefficient ␣ value was
.95, with an SEM of 2.3 and a 90% CI of ⫾3.8.
1310
R. L. MARTIN ET AL.
Evidence of Convergent and Divergent Validity
The correlation coefficients between the 17-item
ADL subscale and SF-36 physical function subscale,
physical component summary score, mental health
subscale, and mental component summary score were
0.76, 0.74, 0.27, and 0.18, respectively. The correlation coefficients between the sports subscale and
SF-36 physical function subscale, physical component
summary score, mental health subscale, and mental
component summary score were 0.72, 0.68, 0.23, and
0.1, respectively. The calculated t values assessing for
differences in the correlation coefficients between the
ADL and sports subscales to measures of physical and
mental functioning were significant with P ⬍ .0005.
DISCUSSION
The results of this study offer evidence that the
HOS is a valid measure of self-reported physical function for individuals with acetabular labral tears who
are undergoing either arthroscopic surgical treatment
or nonsurgical treatment. Specifically, this study provides evidence for internal structure and test content
because the HOS represents the influence of labral
tears on activity and participation across the spectrum
of ability. Evidence for convergent and divergent validity was obtained because scores relate to other
measures of the same construct and do not relate to
measures of a different construct.
Missing data from a self-report instrument threaten
the potential accuracy and validity of the patient responses. In a clinical situation missing responses occur and, when present to a small degree, are thought to
be acceptable. Item 10 (getting into and out of a bath)
had a noticeably larger number of nonapplicable and
missing responses. It was therefore considered for
omission from the HOS. However, on the basis of its
factor loading pattern, coefficient ␣ value, and item
characteristic curve, it was believed that this item
should remain on the instrument. The sports subscale
had a noticeably higher number of nonapplicable responses and therefore a lower number of items that
could be scored. Although individuals noted their hip
condition as their primary limiting factor, we have found
that higher-level activities can sometimes be limited by
factors other than hip pathology. A more detailed investigation of these other factors is outside of the scope of
the data collected in this study. However, future studies
are planned to examine this issue.
The factor analysis performed required complete
data sets without any nonapplicable or missing re-
sponses, whereas the IRT and evidence for convergent
and divergent validity analyses used incomplete data
sets. Analyses comparing demographic information
between individuals with complete data sets with
those with incomplete data sets were completed. For
the ADL subscale, even though P values approached
statistical significance for age (40 years v 40.7 years)
and time between surgery and collection (5.9 months v
3.5 months), there was probably not much clinical
significance. For the sports subscale, the significant
difference in the gender distribution between those
with complete data sets and those with incomplete
data sets may require further analysis and, at this
point, is difficult to interpret.
The analysis of internal structure by use of factor
analysis found that 2 items needed to be eliminated
in order for the ADL subscale to meet the assumption of unidimensionality. Items describing the individual’s ability to sit and to put on socks and
shoes may represent a different domain than the
other 17 items. Because this finding was somewhat
surprising, a poststudy analysis of internal consistency was done with the entire 19-item ADL subscale. Internal consistency was higher when the
items related to sitting and putting on socks and
shoes were deleted and confirmed that omitting
these items from the ADL score was appropriate.
Sitting and putting on socks and shoes may offer
valuable differential diagnosis information for pain
associated with femoral acetabular impingement
and stiffness associated with arthritis. However, on
the basis of this study, their inclusion on an instrument that assesses the influence of acetabular labral
tears on functional status is questioned.
Objective evidence for content was obtained by the
psychometric procedures of IRT and the results of
internal consistency. Individuals with labral tears, including those who undergo hip arthroscopy, generally
function at a high level and may only be limited in
their ability to participate in sports. The results of the
IRT analysis offer evidence for adequate representation of items questioning activities in the higher range
of ability. This may offer an advantage over other
instruments currently available. To substantiate this,
further study will be required.
The coefficient ␣ value was used to estimate the
precision of a measurement at a single point in time.
To help interpret these values, consider a patient who
scores 70 on the ADL subscale. One can be confident
that, 90% of the time, this patient will score between
65.4 and 74.6. In addition, one can be confident that,
at a single point in time, individuals who score above
VALIDITY EVIDENCE FOR HIP OUTCOME SCORE
74.6 or below 65.4 are performing at a different level
than an individual with an observed score of 70. This
information can used when evaluating whether the
scores from 2 individuals are different at a single point
in time.
In addition to evidence for test content and internal
structure, our results provide convergent and divergent
evidence of validity. As expected, the HOS was found to
have relatively high correlations with concurrent measures of physical function and relatively low correlations
with concurrent measures of mental health. This finding
provides evidence that the HOS is a measure of physical
function as opposed to mental function.
CONCLUSIONS
The results of this study provide evidence of validity
to support the use of the HOS ADL and sports subscales
in individuals with labral tears. This includes individuals
who have undergone arthroscopic surgery, as well as
those who have not. Specifically, the results of this study
found that the HOS ADL and sports subscales were
unidimensional, had adequate internal consistency, were
potentially responsive across the spectrum of ability, and
contributed information across the spectrum of ability. In
addition, scores obtained by the HOS related to measures
of function and did not relate to measures of mental
health.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
REFERENCES
1. Bellamy N, Buchanan WW, Goldsmith CH, Campbell J, Stitt
LW. Validation study of WOMAC: A health status instrument
for measuring clinically important patient relevant outcomes to
13.
14.
1311
antirheumatic drug therapy in patients with osteoarthritis of the
hip or knee. J Rheumatol 1988;15:1833-1840.
Nilsdotter AK, Lohmander LS, Klassbo M, Roos EM. Hip
disability and osteoarthritis outcome score (HOOS)—Validity
and responsiveness in total hip replacement. BMC Musculoskelet Disord 2003;4:10.
Harris WH. Traumatic arthritis of the hip after dislocation and
acetabular fractures: Treatment by mold arthroplasty. An endresult study using a new method of result evaluation. J Bone
Joint Surg Am 1969;51:737-755.
Tugwell P, Bombardier C, Buchanan WW, Goldsmith CH,
Grace E, Hanna B. The MACTAR Patient Preference Disability Questionnaire—An individualized functional priority approach for assessing improvement in physical disability in
clinical trials in rheumatoid arthritis. J Rheumatol 1987;14:
446-451.
Wright JG, Young NL, Waddell JP. The reliability and validity
of the self-reported patient-specific index for total hip arthroplasty. J Bone Joint Surg Am 2000;82:829-837.
Binkley JM, Stratford PW, Lott SA, Riddle DL. The Lower
Extremity Functional Scale (LEFS): Scale development, measurement properties, and clinical application. North American
Orthopaedic Rehabilitation Research Network. Phys Ther
1999;79:371-383.
Hunsaker FG, Cioffi DA, Amadio PC, Wright JG, Caughlin B.
The American Academy of Orthopaedic Surgeons outcomes
instruments: Normative values from the general population.
J Bone Joint Surg Am 2002;84:208-215.
Christensen CP, Althausen PL, Mittleman MA, Lee JA, McCarthy JC. The nonarthritic hip score: Reliable and validated.
Clin Orthop Relat Res 2003:75-83.
Messick S. Meaning and values in test validation: The science
and ethics of assessment. Educ Res 1989;18:5-11.
Martin RL, Irrgang JJ, Burdett RG, Conti SF, Van Swearingen
JM. Evidence of validity for the Foot and Ankle Ability
Measure (FAAM). Foot Ankle Int 2005;26:968-983.
International classification of functioning, disability and
health (ICF). Geneva: World Health Organization, 2001.
Irrgang JJ, Anderson AF, Boland AL, et al. Development and
validation of the international knee documentation committee
subjective knee form. Am J Sports Med 2001;29:600-613.
Hambleton RK, Jones RW. Comparison of classical test theory
and item response theory and their applications to test development. Educ Meas Issues Pract 1993;12:38-47.
Meng X, Roenthal R, Sax G. Comparing correlation coefficients. Psychol Bull 1957;111:172-175.
Download