Evidence of Validity for the Hip Outcome Score RobRoy L. Martin, Ph.D., P.T., C.S.C.S., Bryan T. Kelly, M.D., and Marc J. Philippon, M.D. Purpose: The purpose of this study was to offer evidence of validity for the Hip Outcome Score (HOS) based on internal structure, test content, and relation to other variables. Methods: The study population consisted of 507 subjects with a labral tear. Internal structure was evaluated by use of factor analysis and coefficient ␣. Test content was evaluated by use of item response theory. Pearson correlation coefficients were used to assess relations between the Short Form 36 and the HOS. Results: The mean subject age was 38 years (range, 13 to 66 years), with 232 male and 273 female subjects. Of the subjects, 263 (52%) underwent arthroscopic surgery. Factor analysis found that 17 of 19 items on the activities-of-daily-living (ADL) subscale loaded on 1 factor. The 2 items that did not fit the 1-factor model were omitted from further testing. All 9 items on the sports subscale loaded on 1 factor. The coefficient ␣ values were .96 and .95 for the ADL and sports subscales, respectively. The errors associated with a single measure were ⫾4.6 and ⫾3.8 points for the ADL and sports subscales, respectively. Item response theory found that all items contributed to their test information curves and were potentially responsive. The correlations between the HOS and Short Form 36 measures of physical function were significantly different than their correlation to measures of mental functioning (P ⬍ .005). Conclusions: The results of this study provide evidence of validity to support the use of the HOS ADL and sports subscales for individuals with labral tears. This includes individuals who underwent arthroscopic surgery, as well as those who did not. Specifically, the results of this study found that the HOS ADL and sports subscales were unidimensional, had adequate internal consistency, were potentially responsive across the spectrum of ability, and contributed information across the spectrum of ability. In addition, scores obtained by the HOS related to measures of function and did not relate to measures of mental health. Level of Evidence: Level III, development of diagnostic criteria with nonconsecutive patients. Key Words: Hip Outcome Score— Labral tear—Hip arthroscopy—Outcome instrument—Validity. M From the Department of Physical Therapy, Duquesne University (R.L.M.), Pittsburgh, Pennsylvania; Hospital for Special Surgery, New York-Presbyterian Hospital, Weill Medical College of Cornell University (B.T.K.), New York, New York; Steadman Hawkins Clinic, Steadman Hawkins Research Foundation (M.J.P.), Vail, Colorado; and University of Pittsburgh Medical Center (M.J.P.), Pittsburgh, Pennsylvania, U.S.A. Supported in part by a grant from the Orthopaedic Section of the American Physical Therapy Association and the Steadman Hawkins Research Foundation. Within the last 12 months, B.T.K. and M.J.P. have received financial support exceeding $500 from Smith & Nephew, Andover, MA. Address correspondence and reprint requests to RobRoy L. Martin, Ph.D., P.T., C.S.C.S., Department of Physical Therapy, Duquesne University, 114 Rangos School of Health Sciences, Pittsburgh, PA 15282, U.S.A. E-mail: martinr280@duq.edu © 2006 by the Arthroscopy Association of North America 0749-8063/06/2212-5197$32.00/0 doi:10.1016/j.arthro.2006.07.027 Note: To access the supplementary Appendix accompanying this report, visit the December issue of Arthroscopy at www.arthroscopyjournal.org. 1304 usculoskeletal hip disorders and hip arthroscopy are areas of growing interest within the field of orthopaedics. As physicians and other health care practitioners become more involved in these areas, research that defines the expected outcomes for various treatments will be needed. This will include continuing to define the outcomes of both arthroscopic surgical treatment and nonsurgical treatment for individuals with acetabular labral tears. A number of self-report evaluative instruments have been developed for individuals with hip pathology.1-8 All of these instruments have deficiencies that may negatively impact their ability to assess the effect of treatment interventions for individuals with labral tears who may be functioning throughout a wide range of ability. The usefulness of an instrument can be determined based on concepts associated with contemporary validity theory. Important concepts to consider include evidence for test content, internal structure, and rela- Arthroscopy: The Journal of Arthroscopic and Related Surgery, Vol 22, No 12 (December), 2006: pp 1304-1311 VALIDITY EVIDENCE FOR HIP OUTCOME SCORE tion to other variables.9,10 To be useful in the realm of sports medicine and hip arthroscopy, an instrument should have adequate representation of items questioning an individual’s proficiency with a wide range of ability. This would include activities requiring a high level of ability (i.e., sports participation). If an instrument does not have this adequate representation, a ceiling effect and inadequate sensitivity to change may occur when individuals are only limited at the high end of ability. The instruments that are currently available contain only a limited number of items that assess activity and participation at the higher end of ability. Objectively evaluating the individual items in their ability to contribute information and be potentially responsive, particularly at the high end of ability, can be done with item response theory (IRT). The purpose of this study was to create an instrument, the Hip Outcome Score (HOS), that could be used to assess the outcome of treatment intervention for individuals with acetabular tears who may be functioning throughout a wide range of ability. It was hypothesized that the newly created HOS would be unidimensional, have adequate internal consistency, be potentially responsive across the spectrum of ability, and contribute information across the spectrum of ability. In addition, scores obtained by the HOS would relate to concurrent measures of function while not unduly relating to concurrent measures of mental health. METHODS Creating the Interim HOS Item content for the HOS was derived from input from physicians and physical therapists who treat individuals with musculoskeletal hip disorders. The purpose of this instrument was to assess self-reported functional status. Therefore, according to the terms defined by the International Classification of Functioning, Disability and Health model items that related to activity and participation were included whereas items relating to body structure and function (i.e., symptoms) were not considered.11 An effort was made to include functional activities that cover a full spectrum of ability, including sports-related activities. On the basis of these criteria, a total of 28 items were initially considered and developed. It was believed that all 28 items were appropriate and should be included in the HOS. A decision was made to create 2 subscales, the activities-of-daily-living (ADL) and sports subscales. 1305 The ADL subscale contained 19 items pertaining to basic daily activities, and the sports subscale contained 9 items pertaining to higher-level activities, such as those required in athletics. In addition to the 5 potential responses, ranging from “unable to do” to “no difficulty,” a response of “nonapplicable” was also added. This allows subjects to designate that something other than their hip problem limits their activity. This means that both missing responses and nonapplicable responses could not be scored. This model of 2 subscales, as well as the use of a nonapplicable response, was based on the successful results associated with an instrument developed for the foot and ankle.10 Procedure for Data Collection We used a cross-sectional study design. Potential subjects consisted of patients who were under the care of a single orthopaedic surgeon who specializes in the treatment of musculoskeletal hip-related disorders and, particularly, acetabular labral tears. Subjects were given the HOS and Short Form 36 (SF-36) to complete during a regularly scheduled clinical visit. Patients who could not read English or who did not have a labral tear were excluded. On the basis of the methods used in previous studies,12 a decision was made to exclude subjects who had a high number of items that could not be scored. Subjects were included if they had at least 14 of 19 and 7 of 9 items that could be scored on the ADL and sports subscales, respectively. Demographic information was recorded from the computer database and medical records. This study was approved by the institutional review board, and all subjects gave their consent for participation in this study. Data Analysis Evidence for Test Content Psychometric procedures associated with IRT were used to obtain evidence for test content. This analysis included an assessment of unidimensionality, construction of item characteristic curves, and production of test information functions.13 Assumption of Unidimensionality: The assumption of unidimensionality must be met before IRT can be used.13 Evidence for this would be provided by a factor that accounts for a large amount of the variance as indicated by the production of only 1 eigenvalue with a value greater than 1. Exploratory factor analysis was completed by use of PRELIS (Scientific Soft- 1306 R. L. MARTIN ET AL. ware International, Chicago, IL). Eigenvalues and factor loading patterns were used to identify and extract factors. Items with the lowest factor loading to the principal component were sequentially deleted until only 1 eigenvalue was produced that had a value greater than 1. Item Characteristic Curves: MULTILOG (Scientific Software International) was used to perform IRT and calibrate the items by use of the 2-parameter graded response model. The results of IRT allow for item characteristic curves to be constructed in an Excel spreadsheet (Microsoft, Redmond, WA) for each item by use of difficulty and discrimination parameters generated by MULTILOG. An appropriate item characteristic curve with 5 potential responses, with each response describing a level of proficiency with the activity in question, should have 5 distinct and separate curves. Each curve should have 1 peak, and together, the 5 curves should span the spectrum of ability (theta).13 Items that did not have appropriate item characteristic curves were considered for elimination. TABLE 1. Standing for 15 min Getting into and out of an average car Putting on socks and shoes Walking up steep hills Walking down steep hills Going up 1 flight of stairs Going down 1 flight of stairs Going up and down curbs Deep squatting Getting into and out of a bath Sitting for 15 min Walking initially Walking for approximately 10 min Walking for 15 min or greater Twisting/pivoting on involved leg Rolling over in bed Light to moderate work (standing and walking) Heavy work (pushing/pulling, climbing, carrying) Recreational activities Test Information Function: The results of IRT provide information values generated by MULTILOG for each item at 9 ability levels, ranging from ⫺2.0 to 2.0. The item information values for each item at the 9 ability levels were summed to produce the test information function. The target test information function for an evaluative instrument should provide information across all ability ranges.13 Items that did not contribute to the test information function were considered for elimination. Evidence of Internal Structure The Cronbach coefficient ␣ value was calculated by use of the SPSS program (version 11.5; SPSS, Chicago, IL) to assess internal consistency. The standard error of measure (SEM) was calculated as follows: SEM ⫽ 兹1 ⫺ r in which was the SD of the scores and r was the coefficient ␣. A 90% confidence interval (CI) was then calculated to determine the error associated with a score at a single point in time. Item Response Pattern for ADL Subscale Unable to Do Extreme Difficulty Moderate Difficulty Slight Difficulty No Difficulty Nonapplicable Missing Response 7 (1.4%) 29 (5.7%) 129 (25.4%) 132 (26%) 206 (40%) 2 (0.4%) 3 (0.4%) 0 5 (1%) 20 (3.9%) 13 (2.6%) 1 (0.2%) 46 (9.1%) 60 (11.8%) 69 (13.6%) 48 (9.5%) 38 (7.5%) 118 (23.3%) 104 (20.5%) 161 (31.8%) 147 (29%) 100 (19.7%) 189 (37.3%) 156 (30.8%) 133 (26.2%) 140 (27.6%) 158 (31.2%) 153 (30.2%) 180 (35.5%) 112 (22.1%) 142 (28%) 209 (41.2%) 0 0 10 (2%) 11 (2.2%) 1 (0.2%) 1 (0.2%) 2 (0.4%) 2 (0.4%) 6 (1.2%) 0 1 (0.2%) 1 (0.2%) 67 (13.2%) 17 (3.4%) 9 (1.8%) 105 (20.7%) 91 (17.9%) 66 (13%) 123 (24.3%) 164 (32.3%) 134 (26.4%) 114 (22.5%) 230 (45.4%) 292 (57.6%) 72 (14.2%) 1 (0.2%) 0.3 (0.6%) 16 (3.2%) 3 (0.6%) 2 (0.4%) 10 (2%) 10 (2%) 2 (0.4%) 8 (1.6%) 21 (4.1%) 28 (5.5%) 25 (4.9%) 83 (16.4%) 80 (15.8%) 100 (19.7%) 136 (26.8%) 148 (29.2%) 170 (33.5%) 184 (36.3%) 248 (48.9%) 199 (39.3%) 64 (12.6%) 0 0 9 (1.8%) 1 (0.2%) 5 (1%) 12 (2.4%) 41 (8.1%) 107 (21.1%) 154 (30.4%) 190 (37.5%) 0 3 (0.6%) 33 (6.5%) 82 (16.2%) 137 (27%) 113 (22.3%) 136 (26.8%) 4 (0.8%) 2 (0.4%) 49 (9.7%) 7 (1.4%) 107 (21.1%) 47 (9.3%) 139 (27.4%) 87 (17.2%) 136 (26.8%) 176 (34.7%) 65 (12.8%) 185 (36.5%) 5 (1%) 1 (0.2%) 6 (1.2%) 4 (0.8%) 11 (2.2%) 33 (6.5%) 110 (21.7%) 181 (35.7%) 167 (32.9%) 1 (0.2%) 4 (0.8%) 64 (12.6%) 98 (19.3%) 114 (22.5%) 98 (19.3%) 143 (28.2%) 139 (27.4%) 114 (22.5%) 106 (20.9%) 52 (10.3%) 39 (7.7%) 18 (3.6%) 18 (3.6%) 4 (0.8%) 9 (1.8) VALIDITY EVIDENCE FOR HIP OUTCOME SCORE Evidence of Convergent and Divergent Validity Convergent evidence was examined by assessment of the associations between the HOS and SF-36 physical function subscale and physical component summary score by use of Pearson correlation coefficients. Divergent evidence was examined by assessment of the associations between the HOS and SF-36 mental health subscale and the mental health component summary score. Testing for differences in the correlation coefficients between the HOS and concurrent measures of physical function and mental health was done based on the equation of Meng et al.14 The a priori type I error rate was set at .005 to account for the multiple comparisons. RESULTS Subjects Included in the data analysis from October 2003 to December 2004 were 507 subjects with a labral tear as their primary diagnosis. The mean subject age was 38 years (SD, 13 years; range, 13 to 66 years), with 232 male and 273 female subjects. The mean duration of symptoms was 3.4 years (SD, 5 years; range, 11 days to 29 years). The subjects’ reported current level of function was normal in 3%, nearly normal in 26%, abnormal in 51%, and severely abnormal in 20%. Of the subjects, 263 (52%) underwent arthroscopic surgery. The mean length of time between surgery and TABLE 2. Running 1 mile Jumping Swinging objects like a golf club Landing Starting and stopping quickly Cutting/lateral movements Low-impact activities like fast walking Ability to perform activity with your normal technique Ability to participate in your desired sport as long as you would like 1307 completion of the questionnaires in these individuals was 6.7 months (range, 2 days to 3.86 years). With respect to comorbidities, all subjects noted that their hip condition was their primary limiting factor. Item Response Patterns for ADL and Sports Subscales The response patterns for the individual items are presented in Tables 1 and 2 for the ADL and sports subscales, respectively. For the ADL subscale, item 10 (getting into and out of a bath) had the highest number of nonapplicable and missing responses, because 14.4% of individuals had data that could not be scored. For the remaining items, fewer than 6% of individuals had data that could not be scored for each item. Compared with the ADL subscale, the sports subscale had a larger number of nonapplicable responses. The number of missing responses was only slightly greater for the sports subscale. Assumption of Unidimensionality PRELIS requires the use of complete data. Therefore 430 subjects (85%) and 343 subjects (68%) were used to evaluate the ADL and sports subscales, respectively. For both the ADL and sports subscales, analysis was done to assess for difference in gender, age, duration of symptoms, time between surgery and data collection, and current rating of function between subjects with no missing responses compared with the Item Response Pattern for Sports Subscale Unable to Do Extreme Difficulty Moderate Difficulty Slight Difficulty No Difficulty Nonapplicable Missing Response 286 (56.4%) 171 (33.7%) 62 (12.2%) 96 (18.9%) 60 (11.8%) 84 (16.6%) 41 (8.1%) 81 (16%) 35 (6.9%) 64 (12.6%) 21 (4.1%) 8 (1.6%) 2 (0.4%) 3 (0.6%) 82 (16.2%) 110 (21.7%) 50 (9.9%) 89 (17.6%) 66 (13%) 98 (19.3%) 99 (19.5%) 96 (18.9%) 111 (21.9%) 77 (15.2%) 92 (18.1%) 22 (4.3%) 7 (1.4%) 15 (3%) 79 (15.6%) 118 (23.3%) 131 (25.8%) 107 (21.1%) 67 (13.2%) 3 (0.6%) 15 (3%) 105 (20.7%) 133 (26.2%) 113 (22.3%) 104 (20.5%) 33 (6.5%) 9 (1.8%) 10 (2%) 74 (14.6%) 76 (15%) 115 (22.7%) 132 (26%) 104 (20.5%) 4 (0.8%) 2 (0.4%) 116 (22.9%) 95 (18.7%) 116 (22.9%) 104 (20.5%) 61 (12%) 8 (1.6%) 7 (1.4) 260 (51.3%) 96 (18.9%) 64 (12.6%) 42 (8.3%) 29 (5.7%) 14 (2.8%) 2 (0.4%) 1308 R. L. MARTIN ET AL. other subjects included in the study. The ␣ value was set at .05 but adjusted to .005 because of the 10 planned comparisons. For the ADL subscale, a significant difference was not found for gender (P ⫽ .94), age (P ⫽ .009), duration of symptoms (P ⫽ .7), time between surgery and data collection (P ⫽ .012), and current rating of function (P ⫽ .18). For the sports subscale, a significant difference was not found for age (P ⫽ .58), duration of symptoms (P ⫽ .37), time between surgery and data collection (P ⫽ .39), and current rating of function (P ⫽ .56). A significant difference was found for gender (P ⬍ .0005), because the ratio of female subjects to male subjects was lower in the group with no missing data compared with the group with 1 or 2 missing responses. Factor analysis of the 19-item ADL subscale indicated that the items loaded on 2 factors with eigenvalues of 12.4 and 1.2. The factor loadings of each item on the 19-item ADL subscale to the first principal component are reported in Table 3. Because 2 factors with an eigenvalue greater than 1 were produced, the factor analysis was repeated sequentially omitting item 11 (sitting) and item 3 (putting on socks and shoes). A 17-item ADL subscale loaded on 1 factor, TABLE 3. Factor Loadings of Individual Items for ADL Subscale Item Content 1 2 Standing for 15 min Getting into and out of an average car Putting on socks and shoes Walking up steep hills Walking down steep hills Going up 1 flight of stairs Going down 1 flight of stairs Going up and down curbs Deep squatting Getting into and out of a bath Sitting for 15 min Walking initially Walking for approximately 10 min Walking for 15 min or greater Twisting/pivoting on involved leg Rolling over in bed Light to moderate work (standing and walking) Heavy work (pushing/pulling, climbing, carrying) Recreational activities 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Factor Loadings of Individual Items of Sports Subscale Item No. Item Content 1 2 3 4 5 6 7 8 Running 1 mile Jumping Swinging objects like a golf club Landing Starting and stopping quickly Cutting/lateral movements Low-impact activities like fast walking Ability to perform activity with your normal technique Ability to participate in your desired sport as long as you would like 9 Factor Loading .90 .93 .85 .94 .89 .88 .86 .83 .87 which accounted for 68% of the variance and had an eigenvalue of 11.6. The factor loading of each item on the 17-item ADL subscale to the first principal component is found in Table 3. Scores from this modified 17-item ADL subscale were then used for constructing the item characteristic curves and test information functions. The 9-item sports subscale loaded on 1 factor. This 1 factor accounted for 80.3% of the variance and had an eigenvalue of 7.1. The factor loadings of each item to the first principal component are found in Table 4. Item Characteristic Curves Factor Loading Item No. TABLE 4. 19 Item 17 Item .82 .78 .82 .76 .63 .84 .84 .85 .86 .84 .75 .81 .55 .76 .86 — .85 .85 .86 .86 .83 .75 .80 — .75 .87 .84 .75 .76 .89 .85 .74 .74 .89 .81 .82 .77 .77 Inspection of the item characteristic curves for the ADL subscale revealed that all but 4 items had wellfitting curves. Those 4 items pertained to the following: (1) getting into a car, (2) going up steps, (3) going down steps, and (4) going up and down curbs. The item characteristic curve for the item pertaining to walking up hills is an example of an item that had a well-fitting item characteristic curve and is presented in Fig 1. The item characteristic curves for the items that did not have well-fitting item characteristic curves resembled that pertaining to going up and down curbs, which is shown in Fig 2. Item characteristic curves were also plotted for 9 items on the sports subscale. All 9 had well-fitting curves similar to that displayed in Fig 1. Test Information Function The test information function for the modified 17item ADL subscale and the 9-item sports subscale can be found in Fig 3. The 4 items without well-fitting item characteristic curves (getting into a car, going up steps, going down steps, and going up and down VALIDITY EVIDENCE FOR HIP OUTCOME SCORE 1 ADL TIF 35 0.9 Sports TIF 30 0.8 0.7 0.6 Information Probability of Response 1309 RESP0 RESP1 0.5 RESP2 RESP3 0.4 RESP4 0.3 0.2 25 20 15 10 0.1 0 THETA 5 -3.2 -2.2 -1.2 -0.2 0.8 1.8 2.8 3.8 curbs) were considered for elimination. The test information function was recalculated separately with each of these items deleted. In each case a decrease in information was noted throughout the range of ability. Therefore these 4 items were retained to maximize the instrument’s precision of measurement across the range of ability. The 19-item ADL subscale and 9-item sports subscale can be found in Appendix 1 (online only, available at www.arthroscopyjournal.org). The ADL and sports subscales are scored separately. The item re1 0.9 Probability of Response 0.8 0.7 RESP0 0.6 RESP1 RESP2 0.5 RESP3 RESP4 0.4 0.3 0.2 0.1 0 THETA -3.2 -2.2 -1.2 -0.2 0.8 1.8 2.8 3.8 FIGURE 2. Item characteristic curve for item 8 (going up and down curbs) on ADL subscale. This represents an item characteristic curve that may be potentially unacceptable. It should be noted that there are only 4 separate curves for an item that has 5 potential responses. (RESPO, “unable to do” response; RESP1, “extremely difficult” response; RESP2, “moderate difficulty” response; RESP3, “slight difficulty” response; RESP4, “no difficulty at all” response.) 2.0 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5 -2.0 0 FIGURE 1. Item characteristic curve for item 4 (walking up steep hills) on ADL subscale. This represents the appropriate item characteristic curve for an item with 5 potential responses. It should be noted that there are 5 separate curves, each with 1 peak, and together, the 5 curves span the spectrum of ability (). (RESPO, “unable to do” response; RESP1, “extremely difficult” response; RESP2, “moderate difficulty” response; RESP3, “slight difficulty” response; RESP4, “no difficulty at all” response.) Ability FIGURE 3. Test information function (TIF) for ADL and sports subscales showing their potential to provide information across range of ability. The ADL subscale offers more information regarding function at the lower end of ability, whereas the sports subscale offers more information at the higher range of ability. lated to sitting and the item related to putting on socks and shoes are not scored. The response to each of the other 17 items on the ADL subscale is scored from 4 to 0, with 4 indicating “no difficulty” and 0 indicating “unable to do.” Nonapplicable responses are not counted. The scores for each of the items are added together to obtain the item score total. The total number of items with a response is multiplied by 4 to obtain the highest potential score. If the subject answers all 17 items, the highest potential score is 68. If 1 item is not answered, then the highest score is 64; if 2 are not answered, then the highest score is 60; and so on. The item score total is divided by the highest potential score. This value is then multiplied by 100 to obtain a percentage. The sports subscale is scored in a similar manner, with the highest potential score being 36. A higher score represents a higher level of physical function for both the ADL and sports subscales. The mean score for the 17-item ADL subscale and 9-item sports subscale was 67.8 (SD, 20.5; range, 4 to 100) and 41.9 (SD, 27.8; range, 0 to 100), respectively. Evidence of Internal Structure The assessment of internal consistency for the 17-item ADL subscale found a coefficient ␣ value of .96, with an SEM of 2.8 and a 90% CI of ⫾4.6 points. For the sports subscale, the coefficient ␣ value was .95, with an SEM of 2.3 and a 90% CI of ⫾3.8. 1310 R. L. MARTIN ET AL. Evidence of Convergent and Divergent Validity The correlation coefficients between the 17-item ADL subscale and SF-36 physical function subscale, physical component summary score, mental health subscale, and mental component summary score were 0.76, 0.74, 0.27, and 0.18, respectively. The correlation coefficients between the sports subscale and SF-36 physical function subscale, physical component summary score, mental health subscale, and mental component summary score were 0.72, 0.68, 0.23, and 0.1, respectively. The calculated t values assessing for differences in the correlation coefficients between the ADL and sports subscales to measures of physical and mental functioning were significant with P ⬍ .0005. DISCUSSION The results of this study offer evidence that the HOS is a valid measure of self-reported physical function for individuals with acetabular labral tears who are undergoing either arthroscopic surgical treatment or nonsurgical treatment. Specifically, this study provides evidence for internal structure and test content because the HOS represents the influence of labral tears on activity and participation across the spectrum of ability. Evidence for convergent and divergent validity was obtained because scores relate to other measures of the same construct and do not relate to measures of a different construct. Missing data from a self-report instrument threaten the potential accuracy and validity of the patient responses. In a clinical situation missing responses occur and, when present to a small degree, are thought to be acceptable. Item 10 (getting into and out of a bath) had a noticeably larger number of nonapplicable and missing responses. It was therefore considered for omission from the HOS. However, on the basis of its factor loading pattern, coefficient ␣ value, and item characteristic curve, it was believed that this item should remain on the instrument. The sports subscale had a noticeably higher number of nonapplicable responses and therefore a lower number of items that could be scored. Although individuals noted their hip condition as their primary limiting factor, we have found that higher-level activities can sometimes be limited by factors other than hip pathology. A more detailed investigation of these other factors is outside of the scope of the data collected in this study. However, future studies are planned to examine this issue. The factor analysis performed required complete data sets without any nonapplicable or missing re- sponses, whereas the IRT and evidence for convergent and divergent validity analyses used incomplete data sets. Analyses comparing demographic information between individuals with complete data sets with those with incomplete data sets were completed. For the ADL subscale, even though P values approached statistical significance for age (40 years v 40.7 years) and time between surgery and collection (5.9 months v 3.5 months), there was probably not much clinical significance. For the sports subscale, the significant difference in the gender distribution between those with complete data sets and those with incomplete data sets may require further analysis and, at this point, is difficult to interpret. The analysis of internal structure by use of factor analysis found that 2 items needed to be eliminated in order for the ADL subscale to meet the assumption of unidimensionality. Items describing the individual’s ability to sit and to put on socks and shoes may represent a different domain than the other 17 items. Because this finding was somewhat surprising, a poststudy analysis of internal consistency was done with the entire 19-item ADL subscale. Internal consistency was higher when the items related to sitting and putting on socks and shoes were deleted and confirmed that omitting these items from the ADL score was appropriate. Sitting and putting on socks and shoes may offer valuable differential diagnosis information for pain associated with femoral acetabular impingement and stiffness associated with arthritis. However, on the basis of this study, their inclusion on an instrument that assesses the influence of acetabular labral tears on functional status is questioned. Objective evidence for content was obtained by the psychometric procedures of IRT and the results of internal consistency. Individuals with labral tears, including those who undergo hip arthroscopy, generally function at a high level and may only be limited in their ability to participate in sports. The results of the IRT analysis offer evidence for adequate representation of items questioning activities in the higher range of ability. This may offer an advantage over other instruments currently available. To substantiate this, further study will be required. The coefficient ␣ value was used to estimate the precision of a measurement at a single point in time. To help interpret these values, consider a patient who scores 70 on the ADL subscale. One can be confident that, 90% of the time, this patient will score between 65.4 and 74.6. In addition, one can be confident that, at a single point in time, individuals who score above VALIDITY EVIDENCE FOR HIP OUTCOME SCORE 74.6 or below 65.4 are performing at a different level than an individual with an observed score of 70. This information can used when evaluating whether the scores from 2 individuals are different at a single point in time. In addition to evidence for test content and internal structure, our results provide convergent and divergent evidence of validity. As expected, the HOS was found to have relatively high correlations with concurrent measures of physical function and relatively low correlations with concurrent measures of mental health. This finding provides evidence that the HOS is a measure of physical function as opposed to mental function. CONCLUSIONS The results of this study provide evidence of validity to support the use of the HOS ADL and sports subscales in individuals with labral tears. This includes individuals who have undergone arthroscopic surgery, as well as those who have not. Specifically, the results of this study found that the HOS ADL and sports subscales were unidimensional, had adequate internal consistency, were potentially responsive across the spectrum of ability, and contributed information across the spectrum of ability. In addition, scores obtained by the HOS related to measures of function and did not relate to measures of mental health. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. REFERENCES 1. Bellamy N, Buchanan WW, Goldsmith CH, Campbell J, Stitt LW. Validation study of WOMAC: A health status instrument for measuring clinically important patient relevant outcomes to 13. 14. 1311 antirheumatic drug therapy in patients with osteoarthritis of the hip or knee. J Rheumatol 1988;15:1833-1840. Nilsdotter AK, Lohmander LS, Klassbo M, Roos EM. Hip disability and osteoarthritis outcome score (HOOS)—Validity and responsiveness in total hip replacement. BMC Musculoskelet Disord 2003;4:10. Harris WH. Traumatic arthritis of the hip after dislocation and acetabular fractures: Treatment by mold arthroplasty. An endresult study using a new method of result evaluation. J Bone Joint Surg Am 1969;51:737-755. Tugwell P, Bombardier C, Buchanan WW, Goldsmith CH, Grace E, Hanna B. The MACTAR Patient Preference Disability Questionnaire—An individualized functional priority approach for assessing improvement in physical disability in clinical trials in rheumatoid arthritis. J Rheumatol 1987;14: 446-451. Wright JG, Young NL, Waddell JP. The reliability and validity of the self-reported patient-specific index for total hip arthroplasty. J Bone Joint Surg Am 2000;82:829-837. Binkley JM, Stratford PW, Lott SA, Riddle DL. The Lower Extremity Functional Scale (LEFS): Scale development, measurement properties, and clinical application. North American Orthopaedic Rehabilitation Research Network. Phys Ther 1999;79:371-383. Hunsaker FG, Cioffi DA, Amadio PC, Wright JG, Caughlin B. The American Academy of Orthopaedic Surgeons outcomes instruments: Normative values from the general population. J Bone Joint Surg Am 2002;84:208-215. Christensen CP, Althausen PL, Mittleman MA, Lee JA, McCarthy JC. The nonarthritic hip score: Reliable and validated. Clin Orthop Relat Res 2003:75-83. Messick S. Meaning and values in test validation: The science and ethics of assessment. Educ Res 1989;18:5-11. Martin RL, Irrgang JJ, Burdett RG, Conti SF, Van Swearingen JM. Evidence of validity for the Foot and Ankle Ability Measure (FAAM). Foot Ankle Int 2005;26:968-983. International classification of functioning, disability and health (ICF). Geneva: World Health Organization, 2001. Irrgang JJ, Anderson AF, Boland AL, et al. Development and validation of the international knee documentation committee subjective knee form. Am J Sports Med 2001;29:600-613. Hambleton RK, Jones RW. Comparison of classical test theory and item response theory and their applications to test development. Educ Meas Issues Pract 1993;12:38-47. Meng X, Roenthal R, Sax G. Comparing correlation coefficients. Psychol Bull 1957;111:172-175.