ARTICLE IN PRESS Physical Therapy in Sport 8 (2007) 14–21 www.elsevier.com/locate/yptsp Original research Reliability of the Thomas test for assessing range of motion about the hip J. Peelera,, J.E. Andersonb a b Department of Kinesiology and Applied Health, University of Winnipeg, 515 Portage Ave., Winnipeg, Manitoba, Canada R3B 2E9 Department of Human Anatomy and Cell Science, Faculty of Medicine, University of Manitoba, Winnipeg, Manitoba, Canada Received 11 May 2006; received in revised form 22 August 2006; accepted 26 September 2006 Abstract Objectives: Rehabilitative protocols and research are significantly influenced by the ability to perform reliable measures of specific physical attributes or functions. The hypothesis was that the Thomas test for evaluating range of motion about the hip joint is a reliable clinical assessment tool. Subjects: Participants (n ¼ 54) were between the ages of 18 and 45, and had no history of trauma. Methods: Three Board-Certified Athletic Therapists assessed hip range of motion using pass/fail and goniometer scoring systems. A re-test session was completed seven to ten days later. Results: Statistically, Kappa values for pass/fail scoring (intra-rater R¼ 0.47, inter-rater R¼ 0.39) and ICC values (intra-rater R¼ 0.52, interrater R¼ 0.60) for goniometer data both indicated that the Thomas test demonstrated poor intra and inter-rater reliability. However, measurement error values (SEM ¼ 11, ME ¼ 21, and CV ¼ 15%) and Bland and Altman plots demonstrated that there was only a small degree of intra-rater variance for each examiner when executing the Thomas test in a clinical setting. Conclusions: Results call into question the statistical reliability of the Thomas test, but provide clinicians with important information regarding the reliability limits of the Thomas test when used to clinically evaluate hip range of motion and ilio-psoas muscle flexibility in a physically active population. More research is required in order to determine the variables that may confound statistical reliability of this orthopaedic technique that is commonly used in a clinical setting to assess hip function. 1466-853X/$-see front matter r 2006 Elsevier Ltd. All rights reserved. r 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.ptsp.2006.09.023 Keywords: Special tests; Orthopaedic evaluation; Flexibility measurement 1. Introduction In orthopaedics, effective treatment and useful research are both dependent on the extent to which clinicians can perform reliable and accurate measures of a specific physical attribute or function (Portney & Watkins, 2000; Weir, 2005). Unreliable or inaccurate assessment confounds the use of a hypothesis-driven research model, compromises the clinician’s ability to make informed decisions regarding treatment progression, and therefore complicates the effective prescription Corresponding author. Tel.: +12047891408;fax: +12047837866. E-mail address: j.peeler@uwinnipeg.ca (J. Peeler). of treatment protocols (Atkinson & Nevill, 1998; Portney & Watkins, 2000). Reliability, or consistency, refers to the extent that a measurement is reproducible and free of error, and assesses whether measurements are repeatable when all conditions are thought to be held constant (Bedard, Martin, Krueger, & Brazil, 2000; Portney & Watkins, 2000). A reliable examiner will be able to make repeated assessments as evidenced by consistent scoring. To establish rater reliability, the instrument and response variables are considered stable, with any observed differences between scores being attributed to rater error. Examiner reliability can be conceptualized as either intra-rater (or within-examiner) or inter-rater (or between-examiners) reliability. Intra-rater reliability refers to the reproducibility of the measurements by the same examiner (i.e. the consistency with which one patient or subject is assessed by the same examiner over multiple examinations), and inter-rater reliability refers to the reproducibility of measurements taken by different examiners (i.e. the consistency with which one patient or subject is ARTICLE IN PRESS J. Peeler, J.E. Anderson / Physical Therapy in Sport 8 (2007) 14–21 assessed by multiple examiners) (Bedard et al., 2000; Portney & Watkins, 2000; Vela, Tourville, & Hertel, 2003; Weir, 2005). Goniometric assessment is a routine procedure used by clinicians to evaluate joint range of motion (ROM). It allows the quantification of movement using linear (inches or cm) or angular units (degrees of an arc). Because of its widespread use in an orthopaedic setting, the reliability of goniometric assessment (continuous data) has been rigorously investigated (Boone, Azen, Chun-Mei, Spence, Baron, & Lee, 1978; Low, 1976; Rothstein, Miller, & Roettger, 1983; Somers, Hanson, Kedzierski, Nestor, & Quinlivan, 1997). Clinicians also utilize pass/fail (or negative/positive) scoring systems (dichotomous data) to assess ROM about a particular joint. These orthopaedic tests (sometimes referred to as ‘‘special tests’’) help to determine whether a particular type of dysfunction or injury may be present (Magee, 2002). Detailed procedures and established benchmarks are used to assess motion as either a passing score (when a range of motion meets or exceeds a specified angle) or a failure (when a range of motion fails to meet the specified angle). Previous research has shown that many special tests lack sensitivity, in that a failing score is suggestive of dysfunction, while a passing score does not necessarily rule out or exclude dysfunction (Ross, Nordeen, & Barido, 2003). In clinical orthopaedics, the Thomas test is commonly used by clinicians to assess ROM about the hip joint. The face-validity of this assessment technique is confirmed by its inclusion in a number of prominent textbooks on orthopaedic physical assessment (Anderson & Hall, 1999; Kendall, McCreary, Provance, Rodgers, & Romani, 2005; Magee, 2002; Prentice, 2003; Reid, 1992; Richardson & Iglarsh, 1994), and its use as measurement tool in research examining ilio-psoas muscle flexibility and ROM about the hip joint. Unfortunately, most of these studies are specific to one population (i.e., in one type of athlete, sport, or disease) or data were collected using scoring criteria in the absence of confirming reliability (Bartlett, Wolf, Shurtleff, & Stahell, 1985; Glard, Launay, Viehweger, Guillaume, Jouve, & Bollini, 2005; Lee, Kerrigan, & Croce, 1997; Schache, Blanch, & Murphy, 2000; Staheli, 1977; Thurston, 2006; Tyler, Zook, Brittis, & Gleim, 1996). As a result, there is little normative data available on hip ROM and ilio-psoas muscle flexibility in the general population, and a reliability measurement for the Thomas test is not reported. 2. Hypotheses and specific aims The purpose of this investigation was to test the hypothesis that the Thomas test provides reliable assessment of hip range of motion. Specifically, the study had the following aims: (1) to investigate the intrarater reliability of the Thomas test; (2) to investigate the inter-rater reliability of the Thomas test; and (3) to compare the reliability of goniometer (continuous data) and pass/fail (dichotomous data) scoring for the Thomas test. 15 3. Materials and methods 3.1. Data collection protocol Little statistical information was available regarding the normal level of variance of the Thomas test when used to clinically assess range of motion and flexibility about the hip joint. A power analysis estimation revealed that a sample size of approximately 140 limbs would provide a 90% confidence level when analyzing data at a po0.05 level of significance (Hassard, 1991). Following approval by the Research Ethics Board at the University of Manitoba, healthy, physically active subjects between the ages of 18 and 45 years of age with no history of surgery or trauma to the hip, knee, or lower leg region were recruited for the study. All subjects took part in an initial intake session prior to beginning the study. During this session, subjects were assigned an identification (I.D.) number, and asked to complete informed consent and participant information forms. They also completed a physical activity questionnaire to provide baseline information regarding their habitual activity patterns at work, leisure and play (Baecke, Burema, & Frijters, 1982); these data were used to confirm the sample was representative of a normal physically active population. Baseline anthropometric data such as height (m), weight (kg), and femoral length (defined as the distance (cm) measured from the head of the fibula, along the length of the femur through the greater trochanter, to the table top with the knee flexed to 901) were also measured and recorded by a clinician with more than 13 years of clinical assessment experience in orthopaedics. Subjects were instructed to refrain from starting new activities or exercise regimes during the course of the study. At the conclusion of the intake session, subjects were scheduled for two separate sessions for assessment of hip range of motion, which occurred 7–10 days apart. Three experienced examiners were recruited from the community to participate in the study. All were BoardCertified Athletic Therapists who possessed a minimum of 6 years of clinical assessment and treatment experience in musculoskeletal disorders, and who routinely use the Thomas test to evaluate hip function. Prior to the start of the study, each examiner attended two 1-hr instructional workshops in order to become familiar with the testing protocol, and to reinforce the criteria defining pass/fail scoring on the Thomas test, and the standardized procedures for collecting goniometric data. Assessments took place in the Rehabilitation Exercise Laboratory located in the School of Medical Rehabilitation at the University of Manitoba. Subjects were free to schedule their assessments (both test and retest sessions) over the lunch hour, or during early evening time slots. All testing was conducted in a standardized testing environment (i.e., consistent room temperature (201C), ARTICLE IN PRESS 16 J. Peeler, J.E. Anderson / Physical Therapy in Sport 8 (2007) 14–21 lighting, privacy, and plinth type), with no type of ‘‘warm up’’ exercise being completed prior to the initiation of testing. Subjects were instructed to wear shorts and T-shirts for all assessments, and to refrain from exercise a minimum of four hours prior to testing sessions. Assessment by each examiner took approximately 5min to complete; a maximum of four subjects were tested per half hour. Subjects underwent independent assessment by each of the three examiners in a random order. Examiners assessed bilateral hip range of motion of each participant using the Thomas Test (Fig. 1). Examiners determined pass/fail scoring according to the protocol outlined in Magee’s Orthopedic Physical Assessment textbook (Magee, 2002). The participant was positioned supine on the examination table, and the examiner passively flexed one hip (to a minimum of 901 of hip flexion), bringing the knee up to the chest in order to flatten the lumbar spine and stabilize the pelvis. During this maneuver, care was taken not to excessively flex the hip to prevent the pelvis or lumbar spine from moving out of a neutral posture. The subject was instructed to hold the hip flexed against the chest. The test was scored as a pass if the opposite hip and knee remained stationary and positioned flat against the examination table. The test was scored as a fail if the opposite hip flexed, and the knee lifted off the examination table (Magee, 2002). Joint range of motion was quantified using an 18-inch flexible and adjustable plastic goniometer (BaselineTM, Diagnostic and Measuring Instruments) that is commonly employed by health care practitioners working in a clinical setting (Rothstein et al., 1983). Goniometer measurements were made from the same joint angle that was scored by the examiner as either a pass or fail, and were carried out using visibly identifiable anatomical landmarks, thus avoiding procedures that would require examiners to estimate the exact centre of rotation about which the hip or knee joints move. Pilot testing demonstrated that the greater trochanter of the femur (hip) and the head of the fibula (knee) were the most Fig. 1. Thomas Test: visual representation of pass/fail scoring. (a). Pass readily identifiable landmarks of the region. The easy score: participant’s test leg remains on the plinth when the opposite hip is flexed to the chest; (b). Fail score: participant’s test leg will rise off the plinth when the opposite hip is flexed to the chest; (c). Visual representation of goniometer scoring. An adhesive marker was placed over the head of the fibula, and examiners measured the distance between the plinth and the head of fibula (#2). This distance along with the participants femoral length—#1, was used in a trigonometric equation to calculate the hip flexion angle (HFA) in degrees for the test leg (Reproduced and adapted from David J. Magee’s Orthopedic Physical Assessment—4th edition, p. 631). availability of these points facilitated efficient, accurate, and reliable surface landmarking, and helped to minimize the confounding effect of inconsistent surface landmarking on the part of examiners about the hip and knee joints (France & Nester, 2001). Pilot testing also highlighted several difficulties in utilizing a goniometer to quantify the degree of hip flexion during the ARTICLE IN PRESS J. Peeler, J.E. Anderson / Physical Therapy in Sport 8 (2007) 14–21 execution of the Thomas test. These problems included: (1) No identifiable superior landmark above the hip joint about which the arm of the goniometer could be aligned; (2) difficulty in aligning the axis of rotation of the goniometer with the center of motion for the hip joint; and (3) difficulty maintaining the inferior arm of the goniometer in alignment with the long axis of the limb. In an effort to minimize the confounding effect that these variables could have on the measurement of hip flexion, the degree of hip flexion was calculated using a trigonometric equation that was based on the previously measured femoral length and one single measurement about the knee that was made by each examiner. Prior to assessment, examiners placed an adhesive marker over the head of the fibula. During execution of the Thomas test (Fig. 1), examiners measured the perpendicular distance (PD) (cm) between the surface of the examination table and the inferior boundary of the adhesive marker over top of the head of the fibula. This value, and the corresponding measurement of femoral length (FL) (as previously defined) for the test leg were later entered into the trigonometric equation to calculate the angle of hip flexion (HFA) for the test leg. This equation was defined as HFA ¼ (Sin1 (PD/FL)), according to standard trigonometry definitions. Thomas test scores (goniometer measurement to the nearest degree and a pass/fail score) were recorded for each subject by each examiner on a standardized data collection sheet for each test session. Examiners were blinded as to their scoring from the first test session, and to the scoring by other examiners. At the end of each test session, data sheets were collected and collated according to subject I.D. numbers. 3.2. Data analysis Data were entered in a Microsoft Excel spreadsheet. Descriptive statistics (mean7SD) organized by gender were generated for age, body weight and height measurements, calculated body mass index (BMI), physical activity levels (scored out of a total of 15), and hip joint range of motion. Intraclass correlation coefficients (ICC) were calculated in order to evaluate the intra and inter-rater reliability of goniometer scoring. An ICC (3, 1) model was used to evaluate the intra-rater reliability. ICC values were calculated using a two-way ANOVA and the equation: ICC ¼ (BMS EMS)/(BMS+(K1) EMS), where BMS is the betweensubjects mean score, EMS is the error mean score, and K is the number of raters (Domholt, 2000; Holmback, Porter, Downham, & Lexell, 1999, 2001; Portney & Watkins, 2000). An ICC (2, 1) model was used to evaluate the inter-rater reliability. ICC values were again calculated using a two-way ANOVA and the equation: ICC ¼ (BMSEMS)/(BMS+(K1) EMS)+ (K (RMSEMS)/n), where BMS is the between-subjects mean score, EMS is the error mean score, RMS is the betweenraters mean score, K is the number of raters, and n is the number of subjects tested (Domholt, 2000; Holmback et al., 1999, 2001; Portney & Watkins, 2000). Intra and inter-rater reliability of pass/fail scoring was measured using a Kappa statistic. This 17 statistic uses a simple index of agreement, called percent agreement, to measure how often raters agree on scoring for each individual subject. The advantage of the Kappa statistic is that it examines the proportion of observed agreement, and also considers the proportion of agreement that might be expected by chance. Therefore, the coefficient of agreement (proportion of observations on which there is agreement divided by the number of pairs of scores that were obtained) produced by the Kappa test is corrected for chance (number of expected agreements divided by number of possible agreements). This calculation provides a reasonable estimate of the reliability of dichotomous pass/fail data (Haley & Osberg, 1989; Portney & Watkins, 2000). As cited by several clinical research publications, ICC and Kappa values above 0.75 should be considered representative of high levels of reliability, while values between 0.4 and 0.75 are indicative of a fair to moderate level of reliability. ICC values below 0.4 should be considered representative of a poor level of reliability (Atkinson & Nevill, 1998; Domholt, 2000; Holmback et al., 1999; Portney & Watkins, 2000; Shrout & Fleiss, 1979). Three forms of measurement error statistics (standard error of the measurement (SEM), method error (ME), and coefficient of variation (CV)) were used to examine the within-subject variation between testing sessions. The standard error of the measurement was defined by SEM ¼ SD1(1ICC)0.5, where SD1 is the standard deviation of all measurements, and the ICC value is derived from intrarater analysis. Method error was defined as ME ¼ SD 2/O2, where SD2 is the standard deviation of the differences between the 2 measurements. The coefficient of variation was defined as CV ¼ 100 ME/X1, where X1 is the mean for all observations from test sessions 1 and 2 (Holmback et al., 1999). Finally, Bland and Altman graphs provided a visual representation of the variation between scores by each of the examiners, and were used to study any systematic bias between testing sessions (Bland & Altman, 1986). 4. Results Descriptive statistics for study participants are presented in Table 1. Participants had a mean age of 29 years (males: 2977.0; females 2877.4), and were representative of a population that is young, healthy, and physically active in a wide variety of leisure and Table 1 Participant anthropometric data (mean, 7 standard deviation) for the present study Male Age (years) Weight (kg) 2977.0 80710.0 Female Total (n ¼ 19) (n ¼ 38) (n ¼ 57) 2877.4 6477.5** 2977.3 69711.0 ARTICLE IN PRESS 18 J. Peeler, J.E. Anderson / Physical Therapy in Sport 8 (2007) 14–21 Height (m) Body mass index Physical activity levels (/15) 1.7770.07 25.373.0 8.471.6 1.6470.07** 1.6870.09 23.972.9 24.373.0 8.371.1 8.571.3 **po0.01. Table 2 Goniometer scoring of hip joint range of motion Gender Male (34 limbs) Female (74 limbs) All (108 limbs) Retest Test Retest Test Retest Examiner 1 772 772 772 772 772 772 Examiner 2 Examiner 3 671 772 672 772 771 772 771 772 771 772 671 772 771 Pass/Fail Goniometer Mean Examiner 1 (n ¼ 108) 0.72 0.59 0.66 Examiner 2 (n ¼ 108) Examiner 3 (n ¼ 108) 0.37 0.33 0.43 0.53 0.40 0.43 Mean 0.47 0.52 Table 4 Thomas test chance corrected Kappa Statistics for pass/fail scoring, and ICC (model 2, 1) values for goniometer scoring during the present study Test Group average 772 Intra-rater 772 Hip joint range of motion data (mean, 7standard deviation) for Thomas Testing obtained through goniometer measurements (in degrees) for the present study. The group average is representative of the mean score of all examiners across both assessments (test and retest). Table 3 Thomas test chance corrected Kappa Statistics for pass/fail scoring, and ICC (model 3, 1) values for goniometer scoring during the present study sporting opportunities. Fifty-seven (57) subjects volunteered to participate in the study over a 6-month period. Fifty-four (54) subjects completed both testing sessions. For analysis, the flexibility measurements from 108 limbs were used to investigate intra-rater reliability, while 222 flexibility measurements were available to examine inter-rater reliability. Descriptive statistics for hip range of motion are presented in Table 2. The mean hip joint range of motion for all participants was 7172. On average, there was no gender difference when comparing the ROM about the hip joint during Thomas testing. Intra-class correlation coefficients and a chance corrected Kappa statistic were used to evaluate the relative reliability of intra and inter-rater scoring for the Thomas test. Intra-rater results are presented in Table 3. Pass/fail corrected Kappa values ranged from a low of 0.33 to a high of 0.72 among the three examiners. Goniometer ICC values ranged from a low of 0.43 to a high of 0.59 among the three examiners. The intra-rater results demonstrated that on average, the goniometer method of scoring was slightly more consistent than the pass/fail method of scoring. As well, intra-rater results revealed that examiner #1 was generally the most reliable in scoring hip ROM during the test–retest protocol, independent of scoring method. goniometer ICC values were on average, higher than the pass/fail corrected Kappa values. However, analysis of between-examiner scores using a two-way ANOVA revealed significant variation (po0.01) in goniometer scoring among the three examiners. Inter-rater Examiner 1 Examiner 2 Mean Pass/Fail Examiner 2 (n ¼ 222) 0.31 — 0.39 Examiner 3 (n ¼ 222) 0.47 0.38 Goniometer Examiner 2 (n ¼ 222) 0.60** — Examiner 3 (n ¼ 222) 0.71 0.50** 0.60 **po0.01. Inter-rater results are presented in Table 4. The For the present study, measurement error was analyzed using the goniometer data from the three examiners. The standard error of the measurement for the Thomas test was 11; the method error was 21; and the coefficient of variation among examiners was 15 percent. In general, measurement error values were small, and were representative of a tight distribution for test–retest scoring. Bland and Altman graphs illustrate the consistency of each examiner’s scoring over the two test sessions, as well as the variability of scoring between the three examiners. In Fig. 2(a)–(c), the mean assessment score for the two test sessions of each participant (x-axis) was plotted against the difference between the two test scores for the same participant (y-axis). The mean difference scores (test session #1 score minus test session #2 score) are equally distributed about zero, indicating that there was good test–retest scoring consistency, and that examiners were unbiased in scoring over the two testing sessions (i.e. a higher or lower score was just as likely to occur in test session #1 as test session #2). The combined Bland and Altman graph (Fig. 3) also provides a visual representation of the significant variation (po0.01) of ARTICLE IN PRESS J. Peeler, J.E. Anderson / Physical Therapy in Sport 8 (2007) 14–21 19 Fig. 2. Bland & Altman graphs provide visual confirmation that there was no intra-rater systematic bias between testing sessions #1 and #2 for the: (a) Examiner #1, (b) Examiner #2, and (c) Examiner #3. between-examiner scoring for participants. The x-axis depicts a large range in measurements, with a number of outlying data points for each examiner, and is indicative of systematic examiner-dependent use of the Thomas test. Mean Assessment Score 5. Discussion This study was conducted to examine the reliability of an orthopaedic assessment technique that is commonly used in the clinic to assess hip range of motion and iliopsoas muscle tightness about the hip joint. To our knowledge, the reliability of this ‘‘special test’’ has not been previously reported within the scientific literature. The results call into question the statistical reliability of the Thomas test during both goniometer and pass/fail scoring. However, results provide useful information to practitioners regarding the limits of reliability for this technique when used clinically by individual examiners to assess whether hip ROM and iliopsoas muscle flexibility have changed, for example due to an intervention or pathology. The results show that the Thomas test demonstrated poor statistical reliability for intra and inter-rater comparisons among examiners during both goniometer and pass/fail scoring. It would appear that despite the use of well-defined methodology and examiner workshops that were designed to standardize the assessment protocol and define pass/fail criteria, each of the examiners used slightly different stringency (i.e., specified ROM) when grading ilio-psoas flexibility and hip joint range of motion as either a pass or fail. Fig. 3. Bland & Altman graphs illustrate there was a large amount of interrater variation over the 2 test sessions. Beyond this, despite the use of a readily identifiable anatomical landmark, it would appear that inaccurate or inconsistent surface landmarking during goniometer evaluation may have contributed to the large amount of variation between examiners’ scores for each participant. Because goniometric assessment evaluated joint ROM to within one degree, small measurement differences may have also resulted in an over-emphasis of the variation between examiners’ scores. Measurement error values for the goniometer data indicated that there was little variation in each examiner’s scoring over the two testing sessions (this is confirmed by the goniometer data presented in Table 2). The SEM values illustrated that flexibility scores were tightly distributed, with 95% of the retest scores falling within 21 of the initial flexibility scores (i.e., Thomas test session #1 mean score ¼ 71, 95% chance that test session #2 mean score would be between 51 and 91). This small amount of within-subject variation between testing sessions was confirmed by the ME values, which indicated a small degree of variation between test sessions ARTICLE IN PRESS 20 J. Peeler, J.E. Anderson / Physical Therapy in Sport 8 (2007) 14–21 for each examiner. Finally, the CV values provided a universal estimate (or percentage) of the within-subject variation over the two testing sessions for each of the flexibility assessment techniques. Because the CV values are expressed independent of the units of measurement, they account for differences in the magnitude of the mean and facilitate easy comparison of the results between methods (i.e., they provide a measure of relative variation among different assessment techniques). If the mean scores for the Thomas test (mean ¼ 71) are examined in conjunction with its respective CV value (Thomas ¼ 15%), a clearer understanding of the relative variation of the technique is revealed (Thomas: 710.15 ¼ 11). While the Thomas test has a large CV value, when the variation is expressed relative to the scoring variation observed among participants, it is apparent that values from the three measurement error tests are comparable. This information provides valuable insight into the clinical reliability limits of the Thomas test, and enables practitioners to make knowledgeable decisions regarding whether a ‘‘real’’ change has occurred between testing sessions, or whether the observed change is simply a product of measurement error. The results have important implications for clinicians specializing in orthopaedics. The statistical data indicate that even experienced examiners who possess advanced orthopaedic assessment skills, had difficulty attaining a high level of reliability when assessing hip joint ROM and ilio-psoas muscle flexibility using either pass/fail or goniometer scoring methods. This finding has important implications for the education, application, and evaluation of clinical orthopedic skills within the orthopaedic and rehabilitative science communities. Also, the study sample was representative of a population that was young, healthy and physically active. From a clinical standpoint, this type of patient would be hypothesized to provide the most accurate and consistent model for examining assessment techniques because it would limit confounding factors such as joint pathology, muscle contractures, and elevated BMI. If this notion is true, then one would predict that reliability values for the Thomas test could be very different when examining sedentary and sporting populations, or individuals who demonstrate specific joint pathology. This point warrants consideration by clinicians evaluating hip joint ROM and ilio-psoas muscle flexibility in specific populations, and in recording day-to-day progress in rehabilitation programs designed to increase function about the hip joint. While this research project provided invaluable information on the reliability limits of the Thomas test when used in a clinical setting, it is important to acknowledge that participant variation both within and between assessment sessions (i.e., three consecutive Thomas test assessments conducted during each testing session; participant activities the day/week of assessment; the order of testing), as well as the procedures for executing the Thomas test, may have adversely affected the reliability scores for this technique. In order to limit or study the affect of these confounding variables, alternate methodological approaches could be investigated. The Thomas testing procedure could be modified to incorporate a method for standardizing the degree of hip flexion during Thomas testing (potentially with a belt that straps the hips to the table and prevents horizontal and longitudinal movement). As well, assessment of hip ROM and ilio-psoas flexibility could be done from digital photos or film in order to minimize participant variation both within and between assessment sessions. These changes would serve to further standardize the data collection protocol, and thereby limit the number of variables that may confound rater reliability. The results of such a study would help to clarify the results of the present study and serve as a valuable comparison of the reliability differences between hands-on and secondary assessment of joint ROM and flexibility. 6. Conclusion The results of this study provide important information to practitioners regarding the limits of reliability for an orthopaedic assessment technique (Thomas test) that is commonly used in a clinical setting. Statistically, the data call into question the reliability of the technique when used to score ROM and ilio-psoas muscle flexibility about the hip joint using both goniometer and pass/fail scoring methods. This means one measure may not be reproduced precisely on a second assessment, or during assessment by another clinician. However, clinically the results serve as a guide for practitioners when evaluating and deciding whether a change observed between testing sessions is ‘‘real’’, or simply a product of measurement error. Beyond this, the methodology employed for this study serves as a template to guide the evaluation or development of other clinically reliable musculoskeletal assessment techniques for the lower extremity. It should also assist in educating practitioners about ‘‘evidence-based’’ application and evaluation of clinical assessment skills used in the orthopaedic and rehabilitative sciences. References Anderson, M. K., & Hall, S. J. (1999). Thigh, hip, and pelvis injuries. In D. Balado (Ed.), Sports injury management (2nd ed., pp. 319– 358). Philadelphia: Lippincott Williams & Watkins. Atkinson, G., & Nevill, A. M. (1998). Statistical methods for assessing measurement error (reliability) in variables relevant to sports medicine. Sports medicine, 26, 217–238. ARTICLE IN PRESS J. Peeler, J.E. Anderson / Physical Therapy in Sport 8 (2007) 14–21 Baecke, J. A. H., Burema, J., & Frijters, J. E. R. (1982). A short questionnaire for the measurement of habitual physical activity in epidemiological studies. American Journal of Clinical Nutrition, 36, 936–942. Bartlett, M. D., Wolf, L. S., Shurtleff, D. B., & Stahell, L. T. (1985). Hip flexion contractures: A comparison of measurement methods. Archives of Physical Medicine and Rehabilitation, 66, 620–625. Bedard, M., Martin, N. J., Krueger, P., & Brazil, K. (2000). Assessing reproducibility of data obtained with instruments based on continuous measurements. Experiments Aging Research, 26, 353–365. Bland, J. M., & Altman, D. G. (1986). Statistical methods for assessing agreement between two methods of clinical measurement. Lancet, 1, 307–310. Boone, D. C., Azen, S. P., Chun-Mei, L., Spence, C., Baron, C., & Lee, L. (1978). Reliability of goniometric measurements. Physical Therapy, 58, 1355–1360. Domholt, E. (2000). Physical therapy research—Principles and applications, vol. 23. (2nd ed.). (pp. 347–393). Philadelphia, WB: Saunders Company. France, L., & Nester, C. (2001). Effects of errors in the identification of anatomical landmarks on the accuracy of Q angle values. Clinical Biomechanics, 16, 710–713. Glard, Y., Launay, F., Viehweger, E., Guillaume, J. M., Jouve, J. L., & Bollini, G. (2005). Hip flexion contracture and lumbar spine lordosis in myelomeningocele. Journal of Pediatric Orthopaedics, 25, 476–478. Haley, S. M., & Osberg, J. S. (1989). Kappa coefficient calculation using multiple ratings per subject: a special communication. Physical Therapy, 69, 970–974. Hassard, T. H. (1991). What sample size will I need? In Understanding biostatistics (pp. 167–182). St. Louis: Mosby Year Book. Holmback, A. M., Porter, M. M., Downham, D., & Lexell, J. (1999). Reliability of isokinetic ankle dorsiflexor strength measurements in healthy young men and women. Scandinavian Journal of Rehabilitation Medicine, 31, 229–239. Holmback, A. M., Porter, M. M., Downham, D., & Lexell, J. (2001). Ankle dorsiflexor muscle performance in healthy young men and women: reliability of eccentric peak torque and work measurements. Journal of Rehabilitation Medicine, 33, 90–96. Kendall, F. P., McCreary, E. K., Provance, P. G., Rodgers, M. M., & Romani, W. A. (2005). Lower extremity. In Muscles—testing and function with posture and pain (5th ed., pp. 359–464). Baltimore, Maryland: Lippincott Williams & Wilkins. Lee, L. W., Kerrigan, D. C., & Croce, U. D. (1997). Dynamic implications of hip flexion contractures. American Journal of Physical Medicine and Rehabilitation, 76, 502–508. Low, J. L. (1976). The reliability of joint measurements. Physiotherapy, 62, 227–229. Magee, D. J. (2002). Orthopedic physical assessment, (4th ed., pp. 1–66, 607–660), vol. 11. W.B. Saunders Company: Philadelphia, Pennsylvania. Portney, L. G., Watkins, M. P. (2000). Foundations of clinical research— Applications to practice, (vol. 5). (pp. 61–77, 557–586), (2nd ed.). Upper Saddle River, NJ: Prentice Hall Health. Prentice, W. E. (2003). The thigh, hip, groin, and pelvis. In Arnheim’s principles of athletic training: A competency-based approach (11th ed., pp. 625–667). New York: McGraw-Hill. Reid, D. C. (1992). Problems of the hip, pelvis, and sacroiliac joint. In Sports injury assessment and rehabilitation (2nd ed., pp. 601–670). Philadelphia: Churchill Livingstone. Richardson, J. K., & Iglarsh, Z. A. (1994). Hip. In Clinical orthopaedic physical therapy (pp. 333–398). Philadelphia: W. B. Saunders Company. Ross, M. D., Nordeen, M. H., & Barido, M. (2003). Test–retest reliability of Patrick’s hip range of motion test in health collegeaged men. Journal of Strength & Conditioning Research, 17, 156–161. 21 Rothstein, J. M., Miller, P. J., & Roettger, R. F. (1983). Goniometric reliability in a clinical setting: Elbow and knee measurements. Physical Therapy, 63, 1611–1615. Schache, A. G., Blanch, P. D., & Murphy, A. T. (2000). Relation of anterior pelvic tilt during running to clinical and kinematic measures of hip extension. British Journal of Sports Medicine, 34, 279–283. Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86, 420–428. Somers, D. L., Hanson, J. A., Kedzierski, C. M., Nestor, K. L., & Quinlivan, K. Y. (1997). The influence of experience on the reliability of goniometric and visual measurement of the forefoot position. Journal of Orthopaedic & Sports Physical Therapy, 25, 192–202. Staheli, L. T. (1977). The prone hip extension test: A method of measuring hip flexion deformity. Clinical Orthopaedics, 12–15. Thurston, A. (2006). Assessment of fixed flexion deformity of the hip. Clinical Orthopaedics and Related Research, 186–189. Tyler, T., Zook, L., Brittis, D., & Gleim, G. (1996). A new pelvic tilt detection device: Roentgenographic validation and application to assessment of hip motion in professional ice hockey players. Journal of Orthopaedic & Sports Physical Therapy, 24, 303–308. Vela, L., Tourville, T. W., & Hertel, J. (2003). Physical examination of acutely injured ankles: An evidence based approach. Athletic Therapy Today, 8, 13–19. Weir, J. P. (2005). Quantifying test–retest reliability using the intraclass correlation coefficient and the SEM. Journal of Strength and Conditioning Research, 19, 231–240.