ARTICLE IN PRESS
Physical Therapy in Sport
8 (2007) 14–21
www.elsevier.com/locate/yptsp
Original research
Reliability of the Thomas test for assessing range of motion about the
hip
J. Peelera,, J.E. Andersonb
a
b
Department of Kinesiology and Applied Health, University of Winnipeg, 515 Portage Ave., Winnipeg, Manitoba, Canada R3B 2E9
Department of Human Anatomy and Cell Science, Faculty of Medicine, University of Manitoba, Winnipeg, Manitoba, Canada
Received 11 May 2006; received in revised form 22 August 2006; accepted 26 September 2006
Abstract
Objectives: Rehabilitative protocols and research are significantly influenced by the ability to perform reliable measures of specific physical
attributes or functions. The hypothesis was that the Thomas test for evaluating range of motion about the hip joint is a reliable clinical
assessment tool.
Subjects: Participants (n ¼ 54) were between the ages of 18 and 45, and had no history of trauma.
Methods: Three Board-Certified Athletic Therapists assessed hip range of motion using pass/fail and goniometer scoring systems. A re-test
session was completed seven to ten days later.
Results: Statistically, Kappa values for pass/fail scoring (intra-rater R¼ 0.47, inter-rater R¼ 0.39) and ICC values (intra-rater R¼ 0.52, interrater R¼ 0.60) for goniometer data both indicated that the Thomas test demonstrated poor intra and inter-rater reliability. However,
measurement error values (SEM ¼ 11, ME ¼ 21, and CV ¼ 15%) and Bland and Altman plots demonstrated that there was only a small
degree of intra-rater variance for each examiner when executing the Thomas test in a clinical setting. Conclusions: Results call into question
the statistical reliability of the Thomas test, but provide clinicians with important information regarding the reliability limits of the Thomas
test when used to clinically evaluate hip range of motion and ilio-psoas muscle flexibility in a physically active population. More research
is required in order to determine the variables that may confound statistical reliability of this orthopaedic technique that is commonly used
in a clinical setting to assess hip function.
1466-853X/$-see front matter r 2006 Elsevier Ltd. All rights reserved.
r 2006 Elsevier Ltd. All rights reserved.
doi:10.1016/j.ptsp.2006.09.023
Keywords: Special tests; Orthopaedic evaluation; Flexibility measurement
1. Introduction
In orthopaedics, effective treatment and useful research
are both dependent on the extent to which clinicians can
perform reliable and accurate measures of a specific
physical attribute or function (Portney & Watkins, 2000;
Weir, 2005). Unreliable or inaccurate assessment
confounds the use of a hypothesis-driven research model,
compromises the clinician’s ability to make informed
decisions regarding treatment progression, and therefore
complicates the effective prescription
Corresponding author. Tel.: +12047891408;fax: +12047837866. E-mail
address: j.peeler@uwinnipeg.ca (J. Peeler).
of treatment protocols (Atkinson & Nevill, 1998; Portney &
Watkins, 2000). Reliability, or consistency, refers to the
extent that a measurement is reproducible and free of error,
and assesses whether measurements are repeatable when all
conditions are thought to be held constant (Bedard, Martin,
Krueger, & Brazil, 2000; Portney & Watkins, 2000). A
reliable examiner will be able to make repeated assessments
as evidenced by consistent scoring. To establish rater
reliability, the instrument and response variables are
considered stable, with any observed differences between
scores being attributed to rater error. Examiner reliability can
be conceptualized as either intra-rater (or within-examiner) or
inter-rater (or between-examiners) reliability. Intra-rater
reliability refers to the reproducibility of the measurements by
the same examiner (i.e. the consistency with which one
patient or subject is assessed by the same examiner over
multiple examinations), and inter-rater reliability refers to the
reproducibility of measurements taken by different examiners
(i.e. the consistency with which one patient or subject is
ARTICLE IN PRESS
J. Peeler, J.E. Anderson / Physical Therapy in Sport 8 (2007) 14–21
assessed by multiple examiners) (Bedard et al., 2000; Portney &
Watkins, 2000; Vela, Tourville, & Hertel, 2003; Weir, 2005).
Goniometric assessment is a routine procedure used by
clinicians to evaluate joint range of motion (ROM). It allows the
quantification of movement using linear (inches or cm) or
angular units (degrees of an arc). Because of its widespread use
in an orthopaedic setting, the reliability of goniometric
assessment (continuous data) has been rigorously investigated
(Boone, Azen, Chun-Mei, Spence, Baron, & Lee, 1978; Low,
1976; Rothstein, Miller, & Roettger, 1983; Somers, Hanson,
Kedzierski, Nestor, & Quinlivan, 1997). Clinicians also utilize
pass/fail (or negative/positive) scoring systems (dichotomous
data) to assess ROM about a particular joint. These orthopaedic
tests (sometimes referred to as ‘‘special tests’’) help to determine
whether a particular type of dysfunction or injury may be present
(Magee, 2002). Detailed procedures and established benchmarks
are used to assess motion as either a passing score (when a range
of motion meets or exceeds a specified angle) or a failure (when
a range of motion fails to meet the specified angle). Previous
research has shown that many special tests lack sensitivity, in that
a failing score is suggestive of dysfunction, while a passing score
does not necessarily rule out or exclude dysfunction (Ross,
Nordeen, & Barido, 2003).
In clinical orthopaedics, the Thomas test is commonly used by
clinicians to assess ROM about the hip joint. The face-validity of
this assessment technique is confirmed by its inclusion in a
number of prominent textbooks on orthopaedic physical
assessment (Anderson & Hall, 1999; Kendall, McCreary,
Provance, Rodgers, & Romani, 2005; Magee, 2002; Prentice,
2003; Reid, 1992; Richardson & Iglarsh, 1994), and its use as
measurement tool in research examining ilio-psoas muscle
flexibility and ROM about the hip joint. Unfortunately, most of
these studies are specific to one population (i.e., in one type of
athlete, sport, or disease) or data were collected using scoring
criteria in the absence of confirming reliability (Bartlett, Wolf,
Shurtleff, & Stahell, 1985; Glard, Launay, Viehweger,
Guillaume, Jouve, & Bollini, 2005; Lee, Kerrigan, & Croce,
1997; Schache, Blanch, & Murphy, 2000; Staheli, 1977;
Thurston, 2006; Tyler, Zook, Brittis, & Gleim, 1996). As a result,
there is little normative data available on hip ROM and ilio-psoas
muscle flexibility in the general population, and a reliability
measurement for the Thomas test is not reported.
2. Hypotheses and specific aims
The purpose of this investigation was to test the hypothesis
that the Thomas test provides reliable assessment of hip range
of motion. Specifically, the study had the following aims: (1)
to investigate the intrarater reliability of the Thomas test; (2)
to investigate the inter-rater reliability of the Thomas test; and
(3) to compare the reliability of goniometer (continuous data)
and pass/fail (dichotomous data) scoring for the Thomas test.
15
3. Materials and methods
3.1. Data collection protocol
Little statistical information was available regarding the
normal level of variance of the Thomas test when used to
clinically assess range of motion and flexibility about the
hip joint. A power analysis estimation revealed that a
sample size of approximately 140 limbs would provide a
90% confidence level when analyzing data at a po0.05 level
of significance (Hassard, 1991). Following approval by the
Research Ethics Board at the University of Manitoba,
healthy, physically active subjects between the ages of 18
and 45 years of age with no history of surgery or trauma to
the hip, knee, or lower leg region were recruited for the
study. All subjects took part in an initial intake session prior
to beginning the study. During this session, subjects were
assigned an identification (I.D.) number, and asked to
complete informed consent and participant information
forms. They also completed a physical activity
questionnaire to provide baseline information regarding
their habitual activity patterns at work, leisure and play
(Baecke, Burema, & Frijters, 1982); these data were used
to confirm the sample was representative of a normal
physically active population. Baseline anthropometric data
such as height (m), weight (kg), and femoral length
(defined as the distance (cm) measured from the head of the
fibula, along the length of the femur through the greater
trochanter, to the table top with the knee flexed to 901) were
also measured and recorded by a clinician with more than
13 years of clinical assessment experience in orthopaedics.
Subjects were instructed to refrain from starting new
activities or exercise regimes during the course of the study.
At the conclusion of the intake session, subjects were
scheduled for two separate sessions for assessment of hip
range of motion, which occurred 7–10 days apart.
Three experienced examiners were recruited from the
community to participate in the study. All were BoardCertified Athletic Therapists who possessed a minimum of
6 years of clinical assessment and treatment experience in
musculoskeletal disorders, and who routinely use the
Thomas test to evaluate hip function. Prior to the start of
the study, each examiner attended two 1-hr instructional
workshops in order to become familiar with the testing
protocol, and to reinforce the criteria defining pass/fail
scoring on the Thomas test, and the standardized
procedures for collecting goniometric data.
Assessments took place in the Rehabilitation Exercise
Laboratory located in the School of Medical Rehabilitation
at the University of Manitoba. Subjects were free to
schedule their assessments (both test and retest sessions)
over the lunch hour, or during early evening time slots. All
testing was conducted in a standardized testing
environment (i.e., consistent room temperature (201C),
ARTICLE IN PRESS
16
J. Peeler, J.E. Anderson / Physical Therapy in Sport 8 (2007) 14–21
lighting, privacy, and plinth type), with no type of ‘‘warm up’’
exercise being completed prior to the initiation of testing.
Subjects were instructed to wear shorts and T-shirts for all
assessments, and to refrain from exercise a minimum of four
hours prior to testing sessions. Assessment by each examiner
took approximately 5min to complete; a maximum of four
subjects were tested per half hour. Subjects underwent
independent assessment by each of the three examiners in a
random order.
Examiners assessed bilateral hip range of motion of each
participant using the Thomas Test (Fig. 1). Examiners
determined pass/fail scoring according to the protocol outlined
in Magee’s Orthopedic Physical Assessment textbook (Magee,
2002). The participant was positioned supine on the
examination table, and the examiner passively flexed one hip
(to a minimum of 901 of hip flexion), bringing the knee up to
the chest in order to flatten the lumbar spine and stabilize the
pelvis. During this maneuver, care was taken not to excessively
flex the hip to prevent the pelvis or lumbar spine from moving
out of a neutral posture. The subject was instructed to hold the
hip flexed against the chest. The test was scored as a pass if the
opposite hip and knee remained stationary and positioned flat
against the examination table. The test was scored as a fail if
the opposite hip flexed, and the knee lifted off the examination
table (Magee, 2002).
Joint range of motion was quantified using an 18-inch
flexible and adjustable plastic goniometer (BaselineTM,
Diagnostic and Measuring Instruments) that is commonly
employed by health care practitioners working in a clinical
setting (Rothstein et al., 1983). Goniometer measurements
were made from the same joint angle that was scored by the
examiner as either a pass or fail, and were carried out using
visibly identifiable anatomical landmarks, thus avoiding
procedures that would require examiners to estimate the exact
centre of rotation about which the hip or knee joints move.
Pilot testing demonstrated that the greater trochanter of the
femur (hip) and the head of the fibula (knee) were the most
Fig. 1. Thomas Test: visual representation of pass/fail scoring. (a). Pass
readily identifiable landmarks of the region. The easy
score: participant’s test leg remains on the plinth when the opposite hip is
flexed to the chest; (b). Fail score: participant’s test leg will rise off the plinth
when the opposite hip is flexed to the chest; (c). Visual representation of
goniometer scoring. An adhesive marker was placed over the head of the
fibula, and examiners measured the distance between the plinth and the head
of fibula (#2). This distance along with the participants femoral length—#1,
was used in a trigonometric equation to calculate the hip flexion angle (HFA)
in degrees for the test leg (Reproduced and adapted from David J. Magee’s
Orthopedic Physical Assessment—4th edition, p. 631).
availability of these points facilitated efficient, accurate, and
reliable surface landmarking, and helped to minimize the
confounding effect of inconsistent surface landmarking on the
part of examiners about the hip and knee joints (France &
Nester, 2001).
Pilot testing also highlighted several difficulties in utilizing
a goniometer to quantify the degree of hip flexion during the
ARTICLE IN PRESS
J. Peeler, J.E. Anderson / Physical Therapy in Sport 8 (2007) 14–21
execution of the Thomas test. These problems included: (1) No
identifiable superior landmark above the hip joint about which
the arm of the goniometer could be aligned; (2) difficulty in
aligning the axis of rotation of the goniometer with the center of
motion for the hip joint; and (3) difficulty maintaining the
inferior arm of the goniometer in alignment with the long axis of
the limb. In an effort to minimize the confounding effect that
these variables could have on the measurement of hip flexion, the
degree of hip flexion was calculated using a trigonometric
equation that was based on the previously measured femoral
length and one single measurement about the knee that was made
by each examiner. Prior to assessment, examiners placed an
adhesive marker over the head of the fibula. During execution of
the Thomas test (Fig. 1), examiners measured the perpendicular
distance (PD) (cm) between the surface of the examination table
and the inferior boundary of the adhesive marker over top of the
head of the fibula. This value, and the corresponding
measurement of femoral length (FL) (as previously defined) for
the test leg were later entered into the trigonometric equation to
calculate the angle of hip flexion (HFA) for the test leg. This
equation was defined as HFA ¼ (Sin1 (PD/FL)), according to
standard trigonometry definitions.
Thomas test scores (goniometer measurement to the nearest
degree and a pass/fail score) were recorded for each subject by
each examiner on a standardized data collection sheet for each
test session. Examiners were blinded as to their scoring from the
first test session, and to the scoring by other examiners. At the
end of each test session, data sheets were collected and collated
according to subject I.D. numbers.
3.2. Data analysis
Data were entered in a Microsoft Excel spreadsheet.
Descriptive statistics (mean7SD) organized by gender were
generated for age, body weight and height measurements,
calculated body mass index (BMI), physical activity levels
(scored out of a total of 15), and hip joint range of motion. Intraclass correlation coefficients (ICC) were calculated in order to
evaluate the intra and inter-rater reliability of goniometer
scoring. An ICC (3, 1) model was used to evaluate the intra-rater
reliability. ICC values were calculated using a two-way ANOVA
and the equation: ICC ¼ (BMS EMS)/(BMS+(K1) EMS), where
BMS is the betweensubjects mean score, EMS is the error mean
score, and K is the number of raters (Domholt, 2000; Holmback,
Porter, Downham, & Lexell, 1999, 2001; Portney & Watkins,
2000). An ICC (2, 1) model was used to evaluate the inter-rater
reliability. ICC values were again calculated using a two-way
ANOVA and the equation: ICC ¼ (BMSEMS)/(BMS+(K1)
EMS)+ (K (RMSEMS)/n), where BMS is the between-subjects
mean score, EMS is the error mean score, RMS is the betweenraters mean score, K is the number of raters, and n is the number
of subjects tested (Domholt, 2000; Holmback et al., 1999, 2001;
Portney & Watkins, 2000). Intra and inter-rater reliability of
pass/fail scoring was measured using a Kappa statistic. This
17
statistic uses a simple index of agreement, called percent
agreement, to measure how often raters agree on scoring for
each individual subject. The advantage of the Kappa statistic
is that it examines the proportion of observed agreement, and
also considers the proportion of agreement that might be
expected by chance. Therefore, the coefficient of agreement
(proportion of observations on which there is agreement
divided by the number of pairs of scores that were obtained)
produced by the Kappa test is corrected for chance (number
of expected agreements divided by number of possible
agreements). This calculation provides a reasonable estimate
of the reliability of dichotomous pass/fail data (Haley &
Osberg, 1989; Portney & Watkins, 2000). As cited by several
clinical research publications, ICC and Kappa values above
0.75 should be considered representative of high levels of
reliability, while values between 0.4 and 0.75 are indicative
of a fair to moderate level of reliability. ICC values below 0.4
should be considered representative of a poor level of
reliability (Atkinson & Nevill, 1998; Domholt, 2000;
Holmback et al., 1999; Portney & Watkins, 2000; Shrout &
Fleiss, 1979). Three forms of measurement error statistics
(standard error of the measurement (SEM), method error
(ME), and coefficient of variation (CV)) were used to
examine the within-subject variation between testing
sessions. The standard error of the measurement was defined
by SEM ¼ SD1(1ICC)0.5, where SD1 is the standard deviation
of all measurements, and the ICC value is derived from intrarater analysis. Method error was defined as ME ¼ SD 2/O2,
where SD2 is the standard deviation of the differences
between the 2 measurements. The coefficient of variation was
defined as CV ¼ 100 ME/X1, where X1 is the mean for all
observations from test sessions 1 and 2 (Holmback et al.,
1999). Finally, Bland and Altman graphs provided a visual
representation of the variation between scores by each of the
examiners, and were used to study any systematic bias
between testing sessions (Bland & Altman, 1986).
4. Results
Descriptive statistics for study participants are presented
in Table 1. Participants had a mean age of 29 years (males:
2977.0; females 2877.4), and were representative of a
population that is young, healthy, and physically active in a
wide variety of leisure and
Table 1
Participant anthropometric data (mean, 7 standard deviation) for the
present study
Male
Age (years)
Weight (kg)
2977.0
80710.0
Female
Total (n ¼ 19)
(n ¼ 38) (n ¼ 57)
2877.4
6477.5**
2977.3
69711.0
ARTICLE IN PRESS
18
J. Peeler, J.E. Anderson / Physical Therapy in Sport 8 (2007) 14–21
Height (m)
Body mass index
Physical activity levels (/15)
1.7770.07
25.373.0
8.471.6
1.6470.07** 1.6870.09
23.972.9
24.373.0
8.371.1
8.571.3
**po0.01.
Table 2
Goniometer scoring of hip joint range of motion
Gender
Male (34 limbs)
Female (74 limbs)
All (108 limbs)
Retest
Test
Retest
Test
Retest
Examiner 1
772
772
772
772
772
772
Examiner 2
Examiner 3
671
772
672
772
771
772
771
772
771
772
671
772
771
Pass/Fail
Goniometer
Mean
Examiner 1 (n ¼ 108)
0.72
0.59
0.66
Examiner 2 (n ¼ 108)
Examiner 3 (n ¼ 108)
0.37
0.33
0.43
0.53
0.40
0.43
Mean
0.47
0.52
Table 4
Thomas test chance corrected Kappa Statistics for pass/fail scoring, and ICC
(model 2, 1) values for goniometer scoring during the present study
Test
Group average 772
Intra-rater
772
Hip joint range of motion data (mean, 7standard deviation) for Thomas
Testing obtained through goniometer measurements (in degrees) for the
present study. The group average is representative of the mean score of all
examiners across both assessments (test and retest).
Table 3
Thomas test chance corrected Kappa Statistics for pass/fail scoring, and ICC
(model 3, 1) values for goniometer scoring during the present study
sporting opportunities. Fifty-seven (57) subjects
volunteered to participate in the study over a 6-month
period. Fifty-four (54) subjects completed both testing
sessions. For analysis, the flexibility measurements from
108 limbs were used to investigate intra-rater reliability,
while 222 flexibility measurements were available to
examine inter-rater reliability.
Descriptive statistics for hip range of motion are
presented in Table 2. The mean hip joint range of motion
for all participants was 7172. On average, there was no
gender difference when comparing the ROM about the hip
joint during Thomas testing.
Intra-class correlation coefficients and a chance
corrected Kappa statistic were used to evaluate the relative
reliability of intra and inter-rater scoring for the Thomas
test. Intra-rater results are presented in Table 3. Pass/fail
corrected Kappa values ranged from a low of 0.33 to a high
of 0.72 among the three examiners. Goniometer ICC values
ranged from a low of 0.43 to a high of 0.59 among the three
examiners. The intra-rater results demonstrated that on
average, the goniometer method of scoring was slightly
more consistent than the pass/fail method of scoring. As
well, intra-rater results revealed that examiner #1 was
generally the most reliable in scoring hip ROM during the
test–retest protocol, independent of scoring method.
goniometer ICC values were on average, higher than the
pass/fail corrected Kappa values. However, analysis of
between-examiner scores using a two-way ANOVA revealed
significant variation (po0.01) in goniometer scoring among
the three examiners.
Inter-rater
Examiner 1
Examiner 2
Mean
Pass/Fail
Examiner 2 (n ¼ 222)
0.31
—
0.39
Examiner 3 (n ¼ 222)
0.47
0.38
Goniometer
Examiner 2 (n ¼ 222)
0.60**
—
Examiner 3 (n ¼ 222)
0.71
0.50**
0.60
**po0.01.
Inter-rater results are presented in Table 4. The
For the present study, measurement error was analyzed
using the goniometer data from the three examiners. The
standard error of the measurement for the Thomas test was
11; the method error was 21; and the coefficient of variation
among examiners was 15 percent. In general, measurement
error values were small, and were representative of a tight
distribution for test–retest scoring.
Bland and Altman graphs illustrate the consistency of each
examiner’s scoring over the two test sessions, as well as the
variability of scoring between the three examiners. In Fig.
2(a)–(c), the mean assessment score for the two test sessions
of each participant (x-axis) was plotted against the difference
between the two test scores for the same participant (y-axis).
The mean difference scores (test session #1 score minus test
session #2 score) are equally distributed about zero,
indicating that there was good test–retest scoring consistency,
and that examiners were unbiased in scoring over the two
testing sessions (i.e. a higher or lower score was just as likely
to occur in test session #1 as test session #2). The combined
Bland and Altman graph (Fig. 3) also provides a visual
representation of the significant variation (po0.01) of
ARTICLE IN PRESS
J. Peeler, J.E. Anderson / Physical Therapy in Sport 8 (2007) 14–21
19
Fig. 2. Bland & Altman graphs provide visual confirmation that there was no intra-rater systematic bias between testing sessions #1 and #2 for the:
(a) Examiner #1, (b) Examiner #2, and (c) Examiner #3.
between-examiner scoring for participants. The x-axis depicts
a large range in measurements, with a number of outlying data
points for each examiner, and is indicative of systematic
examiner-dependent use of the Thomas test.
Mean Assessment Score
5. Discussion
This study was conducted to examine the reliability of an
orthopaedic assessment technique that is commonly used in
the clinic to assess hip range of motion and iliopsoas muscle
tightness about the hip joint. To our knowledge, the reliability
of this ‘‘special test’’ has not been previously reported within
the scientific literature. The results call into question the
statistical reliability of the Thomas test during both
goniometer and pass/fail scoring. However, results provide
useful information to practitioners regarding the limits of
reliability for this technique when used clinically by
individual examiners to assess whether hip ROM and iliopsoas muscle flexibility have changed, for example due to an
intervention or pathology.
The results show that the Thomas test demonstrated poor
statistical reliability for intra and inter-rater comparisons
among examiners during both goniometer and pass/fail
scoring. It would appear that despite the use of well-defined
methodology and examiner workshops that were designed to
standardize the assessment protocol and define pass/fail
criteria, each of the examiners used slightly different
stringency (i.e., specified ROM) when grading ilio-psoas
flexibility and hip joint range of motion as either a pass or fail.
Fig. 3. Bland & Altman graphs illustrate there was a large amount of interrater variation over the 2 test sessions.
Beyond this, despite the use of a readily identifiable
anatomical landmark, it would appear that inaccurate or
inconsistent surface landmarking during goniometer
evaluation may have contributed to the large amount of
variation between examiners’ scores for each participant.
Because goniometric assessment evaluated joint ROM to
within one degree, small measurement differences may
have also resulted in an over-emphasis of the variation
between examiners’ scores. Measurement error values for
the goniometer data indicated that there was little variation
in each examiner’s scoring over the two testing sessions
(this is confirmed by the goniometer data presented in
Table 2). The SEM values illustrated that flexibility scores
were tightly distributed, with 95% of the retest scores
falling within 21 of the initial flexibility scores (i.e.,
Thomas test session #1 mean score ¼ 71, 95% chance that
test session #2 mean score would be between 51 and 91).
This small amount of within-subject variation between
testing sessions was confirmed by the ME values, which
indicated a small degree of variation between test sessions
ARTICLE IN PRESS
20
J. Peeler, J.E. Anderson / Physical Therapy in Sport 8 (2007) 14–21
for each examiner. Finally, the CV values provided a
universal estimate (or percentage) of the within-subject
variation over the two testing sessions for each of the
flexibility assessment techniques. Because the CV values
are expressed independent of the units of measurement,
they account for differences in the magnitude of the mean
and facilitate easy comparison of the results between
methods (i.e., they provide a measure of relative variation
among different assessment techniques). If the mean scores
for the Thomas test (mean ¼ 71) are examined in
conjunction with its respective CV value (Thomas ¼ 15%),
a clearer understanding of the relative variation of the
technique is revealed (Thomas: 710.15 ¼ 11). While the
Thomas test has a large CV value, when the variation is
expressed relative to the scoring variation observed among
participants, it is apparent that values from the three
measurement error tests are comparable. This information
provides valuable insight into the clinical reliability limits
of the Thomas test, and enables practitioners to make
knowledgeable decisions regarding whether a ‘‘real’’
change has occurred between testing sessions, or whether
the observed change is simply a product of measurement
error.
The results have important implications for clinicians
specializing in orthopaedics. The statistical data indicate
that even experienced examiners who possess advanced
orthopaedic assessment skills, had difficulty attaining a
high level of reliability when assessing hip joint ROM and
ilio-psoas muscle flexibility using either pass/fail or
goniometer scoring methods. This finding has important
implications for the education, application, and evaluation
of clinical orthopedic skills within the orthopaedic and
rehabilitative science communities. Also, the study sample
was representative of a population that was young, healthy
and physically active. From a clinical standpoint, this type
of patient would be hypothesized to provide the most
accurate and consistent model for examining assessment
techniques because it would limit confounding factors such
as joint pathology, muscle contractures, and elevated BMI.
If this notion is true, then one would predict that reliability
values for the Thomas test could be very different when
examining sedentary and sporting populations, or
individuals who demonstrate specific joint pathology. This
point warrants consideration by clinicians evaluating hip
joint ROM and ilio-psoas muscle flexibility in specific
populations, and in recording day-to-day progress in
rehabilitation programs designed to increase function about
the hip joint.
While this research project provided invaluable
information on the reliability limits of the Thomas test
when used in a clinical setting, it is important to
acknowledge that participant variation both within and
between assessment sessions (i.e., three consecutive
Thomas test assessments conducted during each testing
session; participant activities the day/week of assessment;
the order of testing), as well as the procedures for executing
the Thomas test, may have adversely affected the reliability
scores for this technique. In order to limit or study the affect
of these confounding variables, alternate methodological
approaches could be investigated. The Thomas testing
procedure could be modified to incorporate a method for
standardizing the degree of hip flexion during Thomas
testing (potentially with a belt that straps the hips to the
table and prevents horizontal and longitudinal movement).
As well, assessment of hip ROM and ilio-psoas flexibility
could be done from digital photos or film in order to
minimize participant variation both within and between
assessment sessions. These changes would serve to further
standardize the data collection protocol, and thereby limit
the number of variables that may confound rater reliability.
The results of such a study would help to clarify the results
of the present study and serve as a valuable comparison of
the reliability differences between hands-on and secondary
assessment of joint ROM and flexibility.
6. Conclusion
The results of this study provide important information to
practitioners regarding the limits of reliability for an
orthopaedic assessment technique (Thomas test) that is
commonly used in a clinical setting. Statistically, the data call
into question the reliability of the technique when used to
score ROM and ilio-psoas muscle flexibility about the hip
joint using both goniometer and pass/fail scoring methods.
This means one measure may not be reproduced precisely on
a second assessment, or during assessment by another
clinician. However, clinically the results serve as a guide for
practitioners when evaluating and deciding whether a change
observed between testing sessions is ‘‘real’’, or simply a
product of measurement error. Beyond this, the methodology
employed for this study serves as a template to guide the
evaluation or development of other clinically reliable
musculoskeletal assessment techniques for the lower
extremity. It should also assist in educating practitioners
about ‘‘evidence-based’’ application and evaluation of
clinical assessment skills used in the orthopaedic and
rehabilitative sciences.
References
Anderson, M. K., & Hall, S. J. (1999). Thigh, hip, and pelvis injuries. In D.
Balado (Ed.), Sports injury management (2nd ed., pp. 319– 358).
Philadelphia: Lippincott Williams & Watkins.
Atkinson, G., & Nevill, A. M. (1998). Statistical methods for assessing
measurement error (reliability) in variables relevant to sports medicine.
Sports medicine, 26, 217–238.
ARTICLE IN PRESS
J. Peeler, J.E. Anderson / Physical Therapy in Sport 8 (2007) 14–21
Baecke, J. A. H., Burema, J., & Frijters, J. E. R. (1982). A short questionnaire
for the measurement of habitual physical activity in epidemiological
studies. American Journal of Clinical Nutrition, 36, 936–942.
Bartlett, M. D., Wolf, L. S., Shurtleff, D. B., & Stahell, L. T. (1985). Hip
flexion contractures: A comparison of measurement methods. Archives
of Physical Medicine and Rehabilitation, 66, 620–625.
Bedard, M., Martin, N. J., Krueger, P., & Brazil, K. (2000). Assessing
reproducibility of data obtained with instruments based on continuous
measurements. Experiments Aging Research, 26, 353–365.
Bland, J. M., & Altman, D. G. (1986). Statistical methods for assessing
agreement between two methods of clinical measurement. Lancet, 1,
307–310.
Boone, D. C., Azen, S. P., Chun-Mei, L., Spence, C., Baron, C., & Lee, L.
(1978). Reliability of goniometric measurements. Physical Therapy, 58,
1355–1360.
Domholt, E. (2000). Physical therapy research—Principles and applications,
vol. 23. (2nd ed.). (pp. 347–393). Philadelphia, WB: Saunders Company.
France, L., & Nester, C. (2001). Effects of errors in the identification of
anatomical landmarks on the accuracy of Q angle values. Clinical
Biomechanics, 16, 710–713.
Glard, Y., Launay, F., Viehweger, E., Guillaume, J. M., Jouve, J. L., &
Bollini, G. (2005). Hip flexion contracture and lumbar spine lordosis in
myelomeningocele. Journal of Pediatric Orthopaedics, 25, 476–478.
Haley, S. M., & Osberg, J. S. (1989). Kappa coefficient calculation using
multiple ratings per subject: a special communication. Physical Therapy,
69, 970–974.
Hassard, T. H. (1991). What sample size will I need? In Understanding
biostatistics (pp. 167–182). St. Louis: Mosby Year Book.
Holmback, A. M., Porter, M. M., Downham, D., & Lexell, J. (1999).
Reliability of isokinetic ankle dorsiflexor strength measurements in
healthy young men and women. Scandinavian Journal of Rehabilitation
Medicine, 31, 229–239.
Holmback, A. M., Porter, M. M., Downham, D., & Lexell, J. (2001). Ankle
dorsiflexor muscle performance in healthy young men and women:
reliability of eccentric peak torque and work measurements. Journal of
Rehabilitation Medicine, 33, 90–96.
Kendall, F. P., McCreary, E. K., Provance, P. G., Rodgers, M. M., & Romani,
W. A. (2005). Lower extremity. In Muscles—testing and function with
posture and pain (5th ed., pp. 359–464). Baltimore, Maryland: Lippincott
Williams & Wilkins.
Lee, L. W., Kerrigan, D. C., & Croce, U. D. (1997). Dynamic implications
of hip flexion contractures. American Journal of Physical Medicine and
Rehabilitation, 76, 502–508.
Low, J. L. (1976). The reliability of joint measurements. Physiotherapy, 62,
227–229.
Magee, D. J. (2002). Orthopedic physical assessment, (4th ed., pp. 1–66,
607–660), vol. 11. W.B. Saunders Company: Philadelphia,
Pennsylvania.
Portney, L. G., Watkins, M. P. (2000). Foundations of clinical research—
Applications to practice, (vol. 5). (pp. 61–77, 557–586), (2nd ed.).
Upper Saddle River, NJ: Prentice Hall Health.
Prentice, W. E. (2003). The thigh, hip, groin, and pelvis. In Arnheim’s
principles of athletic training: A competency-based approach (11th
ed., pp. 625–667). New York: McGraw-Hill.
Reid, D. C. (1992). Problems of the hip, pelvis, and sacroiliac joint. In
Sports injury assessment and rehabilitation (2nd ed., pp. 601–670).
Philadelphia: Churchill Livingstone.
Richardson, J. K., & Iglarsh, Z. A. (1994). Hip. In Clinical orthopaedic
physical therapy (pp. 333–398). Philadelphia: W. B. Saunders
Company.
Ross, M. D., Nordeen, M. H., & Barido, M. (2003). Test–retest reliability
of Patrick’s hip range of motion test in health collegeaged men.
Journal of Strength & Conditioning Research, 17, 156–161.
21
Rothstein, J. M., Miller, P. J., & Roettger, R. F. (1983). Goniometric
reliability in a clinical setting: Elbow and knee measurements.
Physical Therapy, 63, 1611–1615.
Schache, A. G., Blanch, P. D., & Murphy, A. T. (2000). Relation of
anterior pelvic tilt during running to clinical and kinematic measures
of hip extension. British Journal of Sports Medicine, 34, 279–283.
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing
rater reliability. Psychological Bulletin, 86, 420–428.
Somers, D. L., Hanson, J. A., Kedzierski, C. M., Nestor, K. L., &
Quinlivan, K. Y. (1997). The influence of experience on the reliability
of goniometric and visual measurement of the forefoot position.
Journal of Orthopaedic & Sports Physical Therapy, 25, 192–202.
Staheli, L. T. (1977). The prone hip extension test: A method of measuring
hip flexion deformity. Clinical Orthopaedics, 12–15.
Thurston, A. (2006). Assessment of fixed flexion deformity of the hip.
Clinical Orthopaedics and Related Research, 186–189.
Tyler, T., Zook, L., Brittis, D., & Gleim, G. (1996). A new pelvic tilt
detection device: Roentgenographic validation and application to
assessment of hip motion in professional ice hockey players. Journal
of Orthopaedic & Sports Physical Therapy, 24, 303–308.
Vela, L., Tourville, T. W., & Hertel, J. (2003). Physical examination of
acutely injured ankles: An evidence based approach. Athletic Therapy
Today, 8, 13–19.
Weir, J. P. (2005). Quantifying test–retest reliability using the intraclass
correlation coefficient and the SEM. Journal of Strength and
Conditioning Research, 19, 231–240.