Reliability of assessment tools in rehabilitation: an illustration of

advertisement
Clinical Rehabilitation 1998; 12: 187–199
Reliability of assessment tools in rehabilitation: an
illustration of appropriate statistical analyses
Gabrielle Rankin and Maria Stokes Royal Hospital for Neuro-disability, London
Objective: To provide a practical guide to appropriate statistical analysis of a
reliability study using real-time ultrasound for measuring muscle size as an
example.
Design: Inter-rater and intra-rater (between-scans and between-days) reliability.
Subjects: Ten normal subjects (five male) aged 22–58 years.
Method: The cross-sectional area (CSA) of the anterior tibial muscle group
was measured using real-time ultrasonography.
Main outcome measures: Intraclass correlation coefficients (ICCs) and the
95% confidence interval (CI) for the ICCs, and Bland and Altman method for
assessing agreement, which includes calculation of the mean difference
–
–
between measures (d ), the 95% CI for d , the standard deviation of the
differences (SD diff), the 95% limits of agreement and a reliability coefficient.
Results: Inter-rater reliability was high, ICC (3,1) was 0.92 with a 95% CI
of 0.72 → 0.98. There was reasonable agreement between measures on
–
–
the Bland and Altman test, as d was –0.63 cm2, the 95% CI for d was
2
2
–1.4 → 0.14 cm , the SDdiff was 1.08 cm , the 95% limits of agreement
–2.73 → 1.53 cm2 and the reliability coefficient was 2.4. Between-scans
repeatability was high, ICCs (1,1) were 0.94 and 0.93 with 95% CIs of
0.8 → 0.99 and 0.75 → 0.98, for days 1 and 2 respectively. Measures showed
–
good agreement on the Bland and Altman test: d for day 1 was 0.15 cm 2 and
–
2
for day 2 it was –0.32 cm , the 95% CIs for d were –0.51 → 0.81 cm 2 for day
1 and –0.98 → 0.34 cm2 for day 2; SDdiff was 0.93 cm2 for both days, the 95%
limits of agreement were –1.71 → 2.01 cm2 for day 1 and –2.18 → 1.54 cm2
for day 2; the reliability coefficient was 1.80 for day 1 and 1.88 for day 2. The
–
between-days ICC (1,2) was 0.92 and the 95% CI 0.69 → 0.98. The d was
2
2
–0.98 cm , the SDdiff was 1.25 cm with 95% limits of agreement of
–
–3.48 → 1.52 cm2 and the reliability coefficient 2.8. The 95% CI for d
2
(–1.88 → –0.08 cm ) and the distribution graph showed a bias towards a
larger measurement on day 2.
Conclusions: The ICC and Bland and Altman tests are appropriate for
analysis of reliability studies of similar design to that described, but neither
test alone provides sufficient information and it is recommended that both
are used.
Address for correspondence: Professor MJ Stokes, Research
Department, Royal Hospital for Neuro-disability, West Hill,
Putney, London SW15 3SW, UK.
© Arnold 1998
0269–2155(98)CR180OA
188 G Rankin and M Stokes
Introduction
Reliability studies of assessment tools in rehabilitation are necessary to ensure that the error
involved in measurement is small enough to
detect actual changes in what is being measured.
Such reliability studies are numerous in the literature but a diversity of statistical approaches is
used (see below). This makes comparison
between studies difficult and leaves the
researcher with the problem of deciding which
tests to use.
Different types of reliability data require different tests. Where a device is used to make
objective measurements of, for example, muscle
strength, the data produced are continuous, i.e.
derived from a range of possible values or an
underlying continuum. Other types of data are
produced when the assessment involves categorizing the data. Where there is no obvious ordering to the categories, such as normal or abnormal;
back pain, leg pain or arm pain, the data are
termed nominal data. Where the data have a natural order, such as small, medium or large, the
results are termed ordinal data.1
The kappa test1,2 is commonly used for examining the reliability of nominal data, and the
appropriate type of test for different situations,
e.g. number of raters, was discussed by Haas.3 A
weighted kappa test is used for ordinal data.2,3
There is less consensus regarding continuous
data which forms the topic of the present paper.
Several authors have discussed the inappropriateness of using tests such as Pearson’s correlation, t-tests, coefficient of variation (CV), per
cent agreement and chi-square.1,3–8
The intraclass correlation (ICC9–11) and Bland
and Altman12 tests are now often used for reliability studies. There is little guidance as to which
ICC equations or Bland and Altman results to
use. Many papers do not explain the context in
which the tests are being used or even specify the
ICC equation used. Bland and Altman tests have
been described for method comparison studies,
similar methods for repeatability testing have
been mentioned but not in relation to inter-rater
reliability testing.12–14
The overriding problem is a lack of consensus
as to which of these tests are most appropriate.
The present paper addresses the problem of
selecting an appropriate ICC equation, and clarifies some inconsistencies in the literature regarding analysis of reliability data.
The type of design used in the present study is
similar to that of many reliability studies in rehabilitation, and therefore serves as a useful example to demonstrate the appropriate statistics for
such studies. The intra-rater reliability (or
test–retest reliability) of measuring muscle size
was examined in one investigator over a period
of training in the ultrasound imaging technique,
as well as inter-rater reliability against a more
experienced operator.
Muscle size measurement provides a direct
assessment of muscle atrophy and hypertrophy,
and a close correlation between muscle cross-sectional area (CSA) and isometric strength has
been demonstrated in some muscle groups (see
ref. 8 for a review). Measurement of muscle size
is therefore useful for assessing and monitoring
changes in muscle with disease and injury, as well
as treatment outcome.
The present study investigated inter-rater reliability, and the between-scans and between-days
repeatability of CSA measurements of the anterior tibial muscle group using real-time ultrasound imaging. Appropriate statistical analyses
for a repeated measures design and inter-rater
reliability are demonstrated. The advantages of
using ICC equations and Bland and Altman tests
are discussed.
Method
Real-time ultrasound images were taken of the
anterior tibial muscle group, as described previously,15 and only brief details are given here.
Subjects
Ten normal subjects (five males) aged 22–58
years were studied. Exclusion criteria regarding history of disease or injury were as described previously.15 Written, informed consent
was obtained from each subject, and the study
was approved by Riverside Research Ethics
Committee.
Statistical analysis of reliability studies
Investigators
The first investigator (GR) is a research physiotherapist with recent training in ultrasound
imaging of muscle. The second (MS) is experienced in musculoskeletal ultrasonography and
also a research physiotherapist. Both therefore
had the necessary knowledge of musculoskeletal
anatomy.
Procedure
The basic principles of ultrasound imaging and
instrumentation were discussed in detail elsewhere.8 An ALOKA SSD 1200 ultrasound scanner and a 5 MHz linear array transducer were
used. The anterior tibial muscle group of the
dominant leg (preferred leg for kicking) was
scanned. The muscles in the anterior tibial group
include tibialis anterior, extensor hallucis longus,
extensor digitorum longus and the peronei.
The subject lay supine with the hip in neutral,
the knee extended and the ankle resting at 90° in
a padded metal splint. The level of scanning was
20% of the distance from the head of fibula to
the tip of the lateral malleolus. The scanning
level was traced onto a transparent sheet,
together with bony landmarks and any permanent skin blemishes, such as freckles and scars,
so that the scanning site could be relocated accurately on the second day of scanning.
A stand-off gel pad (Kitecko – 14 × 11 × 4 cm)
was used to provide good contact with body contours, increase the field of view and prevent skin
compression.16 Ultrasonic coupling gel was
applied to the skin area to be scanned, the transducer and to the upper surface of the stand-off
pad. The image was stored on the screen, and
real-time measurements of CSA were taken using
the scanner’s on-screen electronic callipers by
tracing around the muscle borders. Each scan was
photographed as a permanent record.
Protocol
Each subject was scanned at the same time of
day on two different days. On day 1, two images
were taken by GR (the scanning site remained
marked on the subject), and measurements made.
On day 2, the scanning site was relocated using
the transparent sheet, and two images were again
taken by GR, and one was taken by MS. The
investigators were blind as to any previous mea-
189
surements, and on-screen measurements could
not be seen by the investigator.
Analysis of data
Since the purpose of the present paper is to
investigate which statistical tests to use, some
background information is warranted (see
Appendix for details). The two types of analysis
used were intraclass correlations (ICCs),10,11 (on
data produced from an analysis of variance
[ANOVA]), and Bland and Altman12 tests.
If the between-subjects and within-subjects
variance is being examined, a one-way ANOVA
is used. Where within-subjects variance is divided
into between-raters and residual components, a
two-way ANOVA is used. The mean square
(MS) for the sources of variance obtained by
ANOVA in the present study design are
between-subjects (BMS), within-subjects (WMS),
between-raters (RMS) and residual (EMS). The
mean squares from different sources of variance
are used in the ICC equations, and Shrout and
Fleiss10 described six equations for inter-rater
reliability studies (see Appendix), and Fleiss11
provided equations for repeated measurements
made by the same rater.
Inter-rater reliability
The CSA of scan 1 from day 2 (GR) was compared to the scan taken by MS on the same day.
Of the six ICC equations described by Shrout and
Fleiss,10 the (3,1) equation was used (see Appendix) and calculated from a two-way ANOVA
(carried out using SPSS for Windows version
7.5.1).
ICC (3,1) =
BMS – EMS
BMS + (k–1) EMS
where k is the number of raters.
Intra-rater reliability
Repeatability was only examined for investigator GR, using a one-way ANOVA and the
following ICC equations: Between-scans repeatability compared scan 1 to scan 2 for day 1
and day 2 separately, and ICC (1,1) was used:
ICC (1,1) =
BMS – WMS
BMS + (k–1) WMS
where k is the number of measurements.
190 G Rankin and M Stokes
Between-days repeatability compared the
mean of scans 1 and 2 from day 1 with the mean
of scans 1 and 2 from day 2 using ICC (1,2). The
mean of the two scans was used for each day,
hence the second integer of 2 (see ‘Discussion’
and the Appendix).
BMS – WMS
ICC (1,2) =
BMS
The reasons for selecting these specific ICC
equations are explained in the ‘Discussion’ section. The inter- and intra-rater reliability analyses were performed separately in this example, as
both raters did not make repeated measurements. A design in which a number of raters
make equal numbers of observations enables one
analysis to examine both types of reliability.17 An
ICC ratio of 1 indicates perfect reliability with no
measurement error, whilst 0 indicates no reliability.1 The 95% CI values for each ICC value
were calculated using ICC macros on the SPSS
World Wide Web site (see Appendix). Bland and
Altman12 methods were used to assess agreement
between-scans, between-days and betweenraters. Calculations included the mean difference
–
–
between measures (d ), and the 95% CI for d , the
standard deviation of the differences (SD diff), the
reliability coefficient, and the 95% limits of
agreement (see Appendix). Diagrams were also
plotted to illustrate the distribution of results.
Table 1
Results
The raw data are shown in Table 1, and the
ANOVA results are shown in Tables 2 and 3. An
example of an ICC calculation for the present
data is shown in the Appendix.
Inter-rater reliability
The ICC coefficient for inter-rater reliability
was high, with a ratio from the (3,1) equation of
0.92 (see Table 4 which also shows the 95% CI
for the coefficient). The Bland and Altman 12 test
results in Table 4 show reasonable agreement,
and the distribution of values in Figure 1 show
that in most cases, differences between measures
are less than 1.5 cm2. There is a tendency for
measurements taken by GR to be smaller than
those taken by MS.
Between-scans repeatability on day 1 and day 2
Between-scans repeatability was high for both
days with reliability coefficients from ICCs (1,1)
of 0.94 and 0.93 respectively (Table 4). The
results from Bland and Altman 12 methods seen in
Table 4 indicate good agreement. Figure 2 for
day 1 shows that in all but one case, the difference is 1 cm2 or less, and the points are distributed around zero. From Figure 3 for day 2 there
appears to be better agreement, with the differences in seven cases lying close to zero, with one
outlier.
Cross-sectional area measurements of the anterior tibial muscle group (cm2)
Inter-rater reliability
Between-scans repeatability – day 1 and
day 2 measured by one operator (GR)
Day 1
Day 2
Day 2
Between-days repeatability measured
by one operator (GR)
Day 1
Day 2
Subject
Scan 1
(GR)
Scan 1
(MS)
Scan 1
Scan 2
Scan 1
Scan 2
Mean of scan
1 &2
Mean of scan
1& 2
1
2
3
4
5
6
7
8
9
10
17.13
16.08
10.91
14.96
13.00
18.27
14.99
15.64
10.93
16.48
18.78
17.42
10.73
15.65
11.52
17.51
15.81
16.88
12.19
18.16
16.40
13.73
10.28
14.36
10.95
16.96
16.20
13.94
10.00
18.12
15.94
13.48
10.74
14.44
11.64
15.84
14.06
14.82
10.65
17.84
17.13
16.08
10.91
14.96
13.00
18.27
14.99
15.64
10.93
16.48
16.78
16.31
10.60
14.70
12.63
18.57
15.81
15.22
13.46
17.51
16.17
13.61
10.51
14.40
11.30
16.40
15.13
14.38
10.33
17.98
16.96
16.20
10.76
14.83
12.82
18.42
15.40
15.43
12.20
17.00
Statistical analysis of reliability studies
Table 2
191
ANOVA table of results for inter-rater reliability – two-way mixed model, raters fixed
Source of variation
Between-subjects
Between-raters
Error
For data in Table 1
Degrees of freedom (df)
Mean square (MS)
Expected mean square E(MS)
df
MS
n–1
k–1
(n–1) (k–1)
BMS
RMS
EMS
σ2e + kσ2s
σ2e + n/k–1
σ2e
9
1
9
14.21
1.96
0.58
ρ2j
k is the number of raters; n is the number of subjects; σ2 varience. For expected mean square equations see Fleiss.11
Table 3
ANOVA table of results for intra-rater reliability – one-way random effects model
For data in Table 1
Source of
variation
Degrees of
Mean square
freedom (df) (MS)
Between-subjects n–1
Within-subjects
n(k–1)
BMS
WMS
Expected mean
square E(MS)
df
Between-scans
day 1
Between-scans Between-days
day 2
σ2e + kσ 2s
σ2e
9
10
13.57
0.41
11.55
0.45
12
0.98
k is the number of times; n is the number of subjects; σ2 = variance.
Table 4 Results for inter-rater reliability, between-scans and between-days repeatability – ICCs and Bland and
Altman tests
ICC
Bland and Altman
–
d
(cm2)
–
SE of d
(cm2)
–
95% CI for d
(cm2)
SDdiff
(cm2)
Reliability 95% limits of
coefficient agreement
(cm2)
–0.63
0.34
–1.4 → 0.14
1.08
2.4
–2.73 → 1.53
0.15
0.29
–0.51 → 0.81
0.93
1.80
–1.71 → 2.01
0.75 → 0.98
–0.32
0.29
–0.98 → 0.34
0.93
1.88
–2.18 → 1.54
0.69 → 0.98
–0.98
0.40
–1.88 → –0.08 1.25a
2.8
–3.48 → 1.52
ICC
95%
coefficient CI
Inter-rater reliability
Between-scans
repeatability day 1
Between-scans
repeatability day 2
Between-days
repeatability
(3,1)
0.92
(1,1)
0.94
(1,1)
0.93
(1,2)
0.92
0.72 → 0.98
0.8 → 0.99
–
–
–
d is the mean difference; SE of d is the standard error of the mean difference; 95% CI for d is the 95% confidence
interval for the mean difference; SDdiff is the standard deviation of the differences.
a
The standard deviation has been corrected to take into account mean measurement values.
Between-days repeatability
Repeatability was high with an ICC (1,2) of
0.92. Results from Bland and Altman12 seen in
Table 4 demonstrate reasonable agreement.
However, Figure 4 indicates a bias as all but one
point lie below zero, i.e. the differences (day 1 –
day 2) have a negative value. Therefore, nine of
the measurements taken on day 2 were larger
than those taken on day 1. This bias is also
–
reflected in the 95% CI for d which is –1.88 →
2
–0.08 cm . Zero does not lie within the interval,
which indicates a bias between the two measures.
Discussion
Measurements of the anterior tibial muscles
appear to be reliable in the hands of the two
192 G Rankin and M Stokes
investigators studied, but neither of the two statistical tests alone provide a full analysis.
Reliability of ultrasound measurements
Cross-sectional area measurement of the
anterior tibial group using real-time ultrasound
scanning has been demonstrated to have high
repeatability, both between-scans and between-
IRMEAN (cm 2)
Figure 1 Distribution plot from Bland and Altman12 test
showing mean measurements against differences
between measurements for inter-rater reliability. IRMEAN
is the inter-rater mean measurements taken by two raters
(GR and MS). IRDIFF is the inter-rater difference between
measures taken by GR and MS.
BS1MEAN (cm 2)
Figure 2 Distribution plot from Bland and Altman12 test
for between-scans repeatability day 1. BS1MEAN is
between-scans day 1, mean measurements from scan 1
and scan 2. BS1DIFF is between-scans day 1, difference
between measurements (scan 1–scan 2).
days, and high inter-rater reliability. Previous
ultrasound reliability studies did not use ICCs
or Bland and Altman12 tests to allow comparison.
It has been argued that if a technique shows
high reliability between raters, examination of
intra-rater repeatability is not important.1 However, it is useful to document that a technique is
BS2MEAN (cm 2)
Figure 3 Distribution plot from Bland and Altman12 test
for between-scans repeatability day 2. BS2MEAN is
between-scans day 2, mean measurements from scan 1
and scan 2. BS2DIFF is between-scans day 2, difference
between measurements (scan 1–scan 2).
BDMEAN (cm 2)
Figure 4 Distribution plot from Bland and Altman12 test
for between-days repeatability. BDMEAN is between-days
mean measurements from day 1 and day 2. BDDIFF is
between-days difference between measurements (day
1–day 2).
Statistical analysis of reliability studies
reliable in the hands of a particular investigator
over time.
The Bland and Altman12 distribution graph
(Figure 4) indicated a bias between measurements made by GR on different days, with those
on day 2 being larger. It is interesting to note in
Table 1 of the raw data, that in the three subjects
where the largest differences were seen, the highest value was similar to that of the more experienced investigator (MS). This bias to the larger
value suggests a consistent change in the interpretation of the muscle border by investigator
GR on day 2. This bias was not evident from the
ICC results and was not large enough to produce
poor reliability results for the quantitative values
produced from the tests. It is, however, interesting to note, and may disappear on re-testing
when the investigator gains more experience.
Inappropriate statistical tests
Certain tests are often used incorrectly for
evaluating reliability. Pearson’s correlation coefficient is inappropriate because the strength of
linear association, and not agreement, is measured; it is possible to have a high degree of correlation when agreement is poor.6,7,18 A paired
t-test assesses whether there is any evidence that
two sets of measurements agree on average.
However, it is the difference between within-subjects scores that is of interest. Taking the mean
score of all subjects has potential to provide misleading estimates. A high scatter of individual differences can result in the difference between the
means being nonsignificant.6,14
Using the CV to calculate reliability is no
longer considered to be appropriate in most
cases, and reasons for this have been discussed
elsewhere.5,8 The CV is the standard deviation
divided by the mean. For reliability calculations,
the between-subjects standard deviation is often
used, but this is inappropriate. In some cases,
where the standard deviation is proportional to
the mean, it has been suggested that the CV is
suitable. If this is the case, the within-subjects
standard deviation should be calculated, rather
than between-subjects, and the data often need
to be transformed. Such methods are discussed
by Bland13 and Chinn.5 Results for CV are not
given in the present paper so as not to encourage their continued use.
193
Selection of an ICC equation
Inter-rater reliability
When using ICCs, the equation and the reasoning behind the choice of equation should be
clearly stated. This is dependent on the study
design and the intent of the reliability study. 10,11,18
For inter-rater reliability, the ICC (1,1) equation is unlikely to be of use clinically (see Appendix). The essential factor in choosing between
(2,1) and (3,1) is the eventual application of the
test or measuring system. If the aim is general
application in clinical practice or research trials,
ICC (2,1) is appropriate, and a greater number
of raters than in the present study would be
required. If testing is only to be performed by a
small number of raters who are the same raters
used in the reliability study, ICC (3,1) is the
choice, as is the case with the present study.
The application of a procedure to measure passive ankle dorsiflexion was clearly defined in a
study by Moseley and Adams,20 ‘it was intended
to generalize the present results to a variety of
judges’, and therefore ICC (2,1) was used. In the
present study it was only the reliability of specified raters that was of interest, as they will be the
raters obtaining measurements in a larger ultrasound study. The same applies to a reliability
study using ICC (3,1) described by Andrews et
al.21 The reliability of using a hand-held
dynamometer was reported for three raters who
then went on to collect normative values for isometric muscle force for 156 adults.
A measurement that is reported to have good
reliability for general application should be interpreted with caution. The raters used in the reliability study should be a random sample from a
larger population. This criterion was not fulfilled
in the study by Moseley and Adams,20 where five
physiotherapists working at the hospital volunteered to serve as raters. This is not a random
sample and potentially introduces bias in raters’
ability. The larger population of raters also needs
to be defined, as this will imply who will demonstrate similar reliability to the raters in the reliability study. For example, this could be all
physiotherapists of all abilities working in all
areas, only physiotherapy students, or only senior
physiotherapists specializing in a particular area.
In addition, the level of reliability is only applicable for measuring subjects demonstrating a sim-
194 G Rankin and M Stokes
ilar range of measurement values to that of the
subjects in the reliability study (see below).
The implications of using the wrong equation
need consideration. For the same set of data, ICC
(1,1) will usually give lower values than ICC (2,1)
or (3,1), and in most cases ICC (2,1) will result
in a lower value than when (3,1) is used.10 This
trend is seen in the results of the present study
but the difference between results from any of
the equations is minimal: ICC (1,1) = 0.90, (2,1)
= 0.90, (3,1) = 0.92. Results for ICC (1,1) and ICC
(2,1) were also identical in a reliability study to
measure passive ankle dorsiflexion,20 and Bohannon18 did not find much difference between calculations of ICC (2,1) and (3,1). This was not the
case, though, for the example presented by
Shrout and Fleiss,10 where the same data produced ICCs ranging from 0.17 to 0.91 according
to the equation used.
It was suggested by Moseley and Adams,20 that
if ICC (1,1) is mistakenly used, the result is not
misleading because it is an underestimation of
the true ICC value and this conservativeness has
been advanced as a reason for using the ICC (1,1)
form.
While it may be better to underestimate rather
than overestimate reliability, gross underestimation may result in a potentially useful technique
being discarded. As importantly, choice of an
inappropriate equation shows a lack of understanding of design and application of reliability
studies.
Intra-rater reliability
The ICC equations used for intra-rater reliability of repeated measurements in the present
study, (1,1) and (1,2), are equivalent to the interrater reliability equations (1,1) and (1,k),
described by Shrout and Fleiss,10 but k represents
time (number of measurements) rather than
rater. For between-scans repeatability, the equation was obtained from Fleiss11 who described the
same study design as that used here. Although
the equation presented was the same as (1,1)
from an earlier paper by the same author10 this
was not pointed out. The system for labelling
the equations for inter-rater reliability does
not appear to be used for intra-rater reliability,
and a consistent system would be useful,
preferably distinguishing between inter- and
intra-rater reliability equations.
For between-days repeatability, equation ICC
(1,2) was used, as the mean of two measures
(k = 2) was included in the analysis, hence the
second digit of 2 (see Appendix).
Bland and Altman tests
Bland and Altman described a series of statistical methods for assessing agreement between
two methods of clinical measurement.12–14 In relation to repeatability it is suggested that the coefficient of repeatability is calculated. It is unclear
whether this should be used in isolation or to supplement the other tests. The method of calculation varies in two different references.12,13
Inter-rater reliability was not discussed.
Strengths and weaknesses of ICCs and Bland
and Altman tests
A comparison of ICCs and Bland and Altman
methods is shown in Table 5. These have differing advantages and disadvantages.1,19 When using
the ICCs, the choice of equation, study design
and intended application need to be defined
clearly.
A reliability coefficient such as the ICC
appears easy to interpret: the closer to 1, the
greater the reliability. However, interpretation is
not that simple, the coefficient is just one point
estimate of reliability based on one selected sample. The ICC in isolation cannot give a true picture of reliability and should be complemented
by hypothesis testing and/or confidence interval
construction.17
In addition, the ICC cannot be interpreted clinically because it gives no indication of the magnitude of disagreement between measurements.
It should therefore be complemented by calculation of the standard error of measurement
(SEM)17 or Bland and Altman 95% limits of
agreement tests.
A major criticism of the ICC is the influence
of between-subjects variance on the ratio. In simple terms the ICC is the ratio of true score variance (between-subjects variance) to true score
variance plus error. If the true score variance is
sufficiently large, reliability will always appear
high and vice versa.
Therefore, for a group of subjects with a wide
range of CSA measurements, the ICC is likely to
Statistical analysis of reliability studies
Table 5
195
Comparison of intraclass correlations and Bland and Altman statistical methods for assessing agreement
ICCs
Bland and Altman
methods
Advantages
Disadvantages
Inter-rater reliability can allow for fixed
or random effects
Influenced by magnitude of between-subjects
variation (possible advantage)
Single reliability coefficient simple to
understand (but see disadvantages)
Potential to oversimplify if ICC quoted in isolation.
Potential to select wrong equation
Calculations relatively simple for
coefficient but more complex for 95%
CI. Easy to calculate for any number of
raters, data sets or mean measures
Gives no indication of actual measurement values or
ranges, any bias in measurements, and cannot be
interpreted clinically (but can use with the SEM)
Graph: Data easily interpreted visually,
easy to see size and range of differences
in measurements, any bias or outliers, or
relation between the size of differences
with the size of the mean
More complex analyses if more than two raters or
data sets, mean measures or data needs to be
transformed (not fully described in the literature)
95% CI for mean of the differences: Will
indicate bias in measurements
More complex to interpret than a single reliability
coefficient (possible advantage)
95% limits of agreement: Can relate to
clinical acceptability
95% CI for limits of agreement: Need sample set of
at least 50 otherwise limits will be very wide
Independent of between-subjects variation
(possible advantage)
Note: Neither ICC nor Bland and Altman results can be compared directly with those from other studies.
be greater than for a more homogeneous sample
group with similar muscle CSA measurements.
This criticism will be true for any type of reliability calculation based on between-subjects variance, such as the coefficient of variation5 (see
above). The Bland and Altman methods are
independent of the true variability in the observations.
There is a lack of consensus regarding this
issue since it has been suggested that reliability
should reflect true variability.1,7 It has been
argued that reliability is relative, it reflects how
well a measurement can differentiate individuals,
and therefore that reliability or measurement
error should be contrasted with the expected
variation among the subjects being tested.1 In
other words, if CSA measurements were to range
from 10–40 cm2, a measurement error of 1 cm2
may not be important, but if measurements
ranged from 5–10 cm2, an error of 1 cm2 might
not be acceptable.
This issue has yet to be settled, but in the
meantime the following points should be noted.
Bland and Altman 95% limits of agreement indicate a range of error, but this must be interpreted
with reference to the range of measurement val-
ues obtained. Therefore, Bland and Altman tests
should be complemented by raw data and/or
ranges. If using the ICC, the between-subjects
variation should be a meaningful index of reliability, i.e. the variation between subjects in the
selected sample must reflect the true population
of interest.
Regardless of which reliability tests are
selected, it appears that comparison of reliability
results between studies is not possible unless the
size and attributes of the samples tested in each
case are virtually identical.
A distinct advantage of the ICC with regard to
inter-rater reliability in the present case is its ability to differentiate for random or fixed effects
(see Appendix).
The Bland and Altman methods have two
advantages in comparison to the ICC: the powerful visual representation of the degree of
agreement, and the easy identification of bias,
outliers, and any relationship between the variance in measures with the size of the mean.
The current situation of a variety of statistical
tests being used, some of them inappropriate, is
clearly unhelpful to the field of rehabilitation
research. It appears that ICCs and/or Bland and
196 G Rankin and M Stokes
Altman methods are emerging as the statistical
analyses of choice, and it is suggested they both
be used in reliability studies of similar design to
that described in the present paper. It is also suggested that a standardized system for annotating
the different ICC formulae is needed and that, in
the absence of this, authors could be encouraged
to indicate the type of ICC used more clearly,
stating the formula and its origin as well as the
rationale for its use in the context of their study
design.
Several authors, including those of the present
paper, have put forward their suggestions for
appropriate statistical analysis of reliability studies. A consensus needs to be reached to establish
which tests, and which of the many results produced by these analyses, are the most relevant
ones to be adopted universally.
Acknowledgements
The authors thank the subjects who took part
in the study, Dr Jörg Huber (School of Life
Sciences, Roehampton Institute London) for
statistical advice, Dr Janine Gray (Centre
for Health & Medical Research, University of
Teesside) for her comments and the Living Again
Trust for financial support.
References
1 Streiner DL, Norman GR. Health measurement
scales: a practical guide to their development and use,
second edition. Oxford: Oxford University Press,
1995: 104–27.
2 Davies M, Fleiss JL. Measuring agreement for
multinominal data. Biomet 1982; 38: 1047–51.
3 Haas M. Statistical methodology for reliability
studies. J Manipulative Physiol Ther 1991; 14:
119–32.
4 Brennan P, Silman A. Statistical methods for
assessing observer variability in clinical measures.
BMJ 1992; 304: 1491–94.
5 Chinn S. The assessment of methods of
measurement. Statist Med 1990; 9: 351–62.
6 Maher C. Pitfalls in reliability studies: some
suggestions for change. Aust J Physiother 1993; 39:
5–7.
7 Riddle DL, Finucane SD, Rothstein JM, Walker
ML. Intrasession and intersession reliability of handheld dynamometer measurements taken on braindamaged patients. Phys Ther 1989; 69: 182–94.
8 Stokes M, Hides J, Nassiri DK. Musculoskeletal
ultrasound imaging: diagnostic and treatment aid in
rehabilitation. Phys Ther Rev 1997; 2: 73–92.
9 Bartko JJ. The intraclass correlation coefficient as a
measure of reliability. Psychol Rep 1966; 19: 3–11.
10 Shrout PE, Fleiss JL. Intraclass correlations: uses in
assessing rater reliability. Psychol Bull 1979; 86:
420–28.
11 Fleiss JL. Reliability of measurement. In: Fleiss JL
ed. Design and analysis of clinical experiments. New
York: John Wiley & Sons, 1986: 1–32.
12 Bland JM Altman DG. Statistical methods for
assessing agreement between two methods of clinical
measurement. Lancet 1986; 1: 307–10.
13 Bland M. Clinical measurement. In: Bland M ed. An
introduction to medical statistics. Oxford: Oxford
University Press, 1987, 265–73.
14 Altman DG. Some common problems in medical
research. In: Altman DG ed. Practical statistics for
medical research, first edition. London: Chapman &
Hall, 1991: 398–403.
15 Martinson H, Stokes MJ. Measurement of anterior
tibial muscle size using real-time ultrasound imaging.
Eur J Appl Physiol 1991; 63: 250–54.
16 Kelly SJ, Stokes MJ. Symmetry of anterior tibial
muscle size measured by real-time ultrasound
imaging in young females. Clin Rehabil 1993; 7:
222–28.
17 Eliasziw M, Young SL, Woodbury MG, Fryday-Field
K. Statistical methodology for the concurrent
assessment of interrater and intrarater reliability.
Phys Ther 1994; 74: 89–100.
18 Bohannon RW. Commentary. Phys Ther 1989; 69:
190–92.
19 Krebs DE. Intraclass correlation coefficients: use
and calculation. Phys Ther 1984; 64: 1581–89.
20 Moseley A, Adams R. Measurement of passive
ankle dorsiflexion: procedure and reliability. Aust J
Physiother 1991; 37: 175–81.
21 Andrews AW, Thomas MW, Bohannon RW.
Normative values for isometric muscle force
measurements obtained with hand-held
dynamometers. Phys Ther 1996; 76: 248–59.
Statistical analysis of reliability studies 197
Appendix
The rationale for the intraclass correlation and Bland and Altman12 tests are outlined in the context
of reliability studies, together with the formulae used in different situations.
Intraclass Correlation (ICCs)
The ICC analyses were specifically designed by Bartko9 to examine reliability, providing a reliability index to indicate the measurement error. Reliability is the ratio of variance of interest/sum of
variance of interest + measurement error.
There are several formulae for ICCs which can give quite different results when applied to the
same data. Each formula is appropriate for specific situations which are defined by the experimental design and the potential use of the results.10,11
The ICCs are calculated from results obtained from analysis of variance (ANOVA) for repeated
measures. An ANOVA table for inter-rater data (see Table 2) shows different sources of variance
with their associated degrees of freedom, mean squares (MS) and errors.
The six forms of ICC were (1,1), (2,1), (3,1), (1,k), (2,k), (3,k) for inter-rater reliability. They used
the term judge (JMS) where the present study used rater (RMS). It should be noted that the SPSS
macro for ICCs also uses JMS.
The first integer, 1, 2 or 3, relates to three different cases, cases 1, 2 and 3 respectively. It does
not indicate the type of ANOVA used, as has been suggested.20 The second integer indicates how
many units of analysis are included. If there are more than one, the second integer is greater than 1
(see below). The equations given below are from Shrout and Fleiss,10 and the associated variability
formulae are from Streiner and Norman.1
Case 1
Each subject is rated by a different set of k raters, randomly selected from a larger population of
raters. This analysis uses one-way ANOVA results:
ICC (1,1) =
BMS – WMS
BMS + (k–1)WMS
or
subject variability
subject variability + within-subjects variability
This analysis is unlikely to be practical for clinical research purposes, as such a design is not common, but perhaps it could be appropriate for studying reliability before providing data from singlecase studies or multicentre trials. The same equation was described by Fleiss11 in the context of
intra-rater reliability.
Case 2
A random sample of k raters is selected from a larger population of raters and each rater tests
each subject, i.e. each judge rates n subjects altogether. Case 2 considers the raters as random effects
and should be used if there is a need to generalize to other raters within the same population. It
measures the agreement of the raters and answers the question as to whether the raters are interchangeable. This analysis uses two-way ANOVA results and the variance due to the rater is included
in the equation:
ICC (2,1) =
BMS – EMS
BMS + (k–1)EMS + k(RMS–EMS)/n
Alternatively,
ICC (2,1) =
subject variability
subject variability + observer variability + random error variability
198 G Rankin and M Stokes
Case 3
Each subject is rated by each of the same k raters who are the only raters of interest, as in the
present study. In this case, raters are considered to be fixed, and the reliability will reflect the accuracy of measurement for the specified raters but cannot be applied generally to any other raters.
This analysis uses two-way ANOVA results but only the residual variance, not the between-raters
variance, comes into the equation. This is because the between-raters variance is fixed; it will always
contribute the same amount to the within-subjects variance and does not need to be factored out.
BMS – EMS
ICC (3,1) =
subject variability
or
BMS + (k–1)EMS
subject variability + random error variability
It can be seen that this is the same as (2,1) but with k(RMS–EMS)/n removed from the denominator. The calculation for ICC (3,1) using the present data for inter-rater reliability (see Table 2) is as
follows:
ICC (3,1) =
14.21 – 0.58
where k–1 = 1
14.21 + (k–1) 0.58
13.63
=
14.21 + 0.58
=
13.63
=
0.92
14.79
The second integer
Equations (1,k), (2,k) and (3,k) are used when the unit of analysis is the mean measurement
obtained either from more than one measurement or from more than one rater (k in this situation
does not always refer to the number of raters). The reliability of a mean rating will almost always
be greater than that of an individual rating.
ICC (1,k) =
BMS – WMS
BMS
ICC (2,k) =
BMS – EMS
BMS + (JMS/RMS – EMS)/n
ICC (3,k) =
BMS – EMS
BMS
Some papers only use the mathematical symbols for different types of variance, despite citing Strout
and Fleiss,10 rather than the mean squares abbreviations in their equations, which can cause some
confusion when trying to determine which equation they have used. Both the MS abbreviations and
the variance symbols are shown in Tables 2 and 3 for clarification.
ICC macros on the SPSS World Wide Web site
The ICC equations mentioned above can be calculated very simply by hand using data from the
ANOVA. However, should the appropriate computer facilities be available, ICC macros (termed
ICCSF) can be downloaded from the SPSS Web site and used to calculate ICCs and related statistics. The calculations are those described by Shrout and Fleiss.10
There are three primary macros, ICCSF1.SPS, ICCSF2.SPS and ICCSF3.SPS, relating to cases 1,
2 and 3 respectively which require SPSS 6.0 or higher, including the MATRIX procedure. There are
Statistical analysis of reliability studies
199
three abbreviated versions which only require a version of SPSS with macro capability and the
MATRIX procedure.
The unabbreviated macros can be used to calculate ICC values and 95% CIs both for single
raters/measurements and for the mean of k raters/measurements. In addition it performs an F-test
for the null hypothesis of 0 population value, to determine the probability that the ICC = 0 for the
true population.
The help file for the ICC macros describes fully the different macro files and how to use them.
To perform the calculations, the relevant data file in SPSS needs to be open. A new syntax file is
opened and two lines of syntax are necessary (upper or lower case acceptable).
Line 1: include ‘ICCSF*.SPS’.
The relevant file name and location of the file is specified, for example, include ‘a:\ICCSF1.SPS’.
Line 2: icc vars=varlist.
Varlist refers to the relevant variables from the data file which need to be listed, for example,
icc vars=dlsclgr, dlsc2gr. On the command Run followed by All, the results are reported.
Bland and Altman method of analysis
Bland and Altman described a method to assess agreement between clinical measurements.12–14
Their approach is based on analysis of differences between measurements and they suggest that estimation of the agreement between measures is more appropriate than a reliability coefficient or
hypothesis (significance) testing.
Several stages are described in analysing the data, providing different ways of expressing the results.
1)
2)
3)
4)
5)
The differences between two measures are plotted against the average of the two measurements,
the mean. From this graph (e.g. Figures 1–4), the size of each difference, the range of differences and their distribution about zero (perfect agreement) can be seen clearly. The distribution
can also indicate a bias in the measurement. The distribution may show that differences are
related to the size of the mean; as the mean increases, so does the difference. The statistical
analyses described assume a constant level of error, there should be constant variance with
increasing means. If this is not the case, the data need to be transformed. This technique is
described in detail elsewhere. 4,5,12 Bland and Altman state that only log transformations
should be applied, however it has been suggested that other transformations are sometimes
more appropriate.5
–
The mean of the differences (d ) and the standard deviation of the differences (SDdiff) are
–
calculated. The closer d is to zero and the smaller the value of SDdiff, the better the agreement
between measures.
–
It is also of interest to estimate the ‘true’ value of d , which is a measure of the bias between
measures
and a 95%
confidence interval can be calculated. The 95% confidence interval for
– –
–
d = d ± tn-1SE of d , where n is the number of subjects and SE = SD diff / n. If zero does not lie
within the interval it can be concluded that a bias exists between the two measures.4 When mean
measurements are compared, the estimate of the standard deviation needs to be corrected
because some of the effect of the repeated measurement error has been removed (for details
see ref. 12).
d 2 where d is the difference between two
The coefficient of repeatability is calculated as 2 n
measures and it is assumed that the mean difference is zero.12
–
The 95% limits of agreement can be calculated as d ± 2SDdiff, and 95% confidence intervals for
12
these limits of agreement can also be calculated. If there are more than two repeated measures,
the calculations are more complex.12 The sample size should be large enough, preferably greater
than 50, to allow the limits of agreement to be estimated well,14 otherwise confidence intervals
will be very wide. Due to the small sample size CIs were not calculated in the present study.
Download