Clinical Rehabilitation 1998; 12: 187–199 Reliability of assessment tools in rehabilitation: an illustration of appropriate statistical analyses Gabrielle Rankin and Maria Stokes Royal Hospital for Neuro-disability, London Objective: To provide a practical guide to appropriate statistical analysis of a reliability study using real-time ultrasound for measuring muscle size as an example. Design: Inter-rater and intra-rater (between-scans and between-days) reliability. Subjects: Ten normal subjects (five male) aged 22–58 years. Method: The cross-sectional area (CSA) of the anterior tibial muscle group was measured using real-time ultrasonography. Main outcome measures: Intraclass correlation coefficients (ICCs) and the 95% confidence interval (CI) for the ICCs, and Bland and Altman method for assessing agreement, which includes calculation of the mean difference – – between measures (d ), the 95% CI for d , the standard deviation of the differences (SD diff), the 95% limits of agreement and a reliability coefficient. Results: Inter-rater reliability was high, ICC (3,1) was 0.92 with a 95% CI of 0.72 → 0.98. There was reasonable agreement between measures on – – the Bland and Altman test, as d was –0.63 cm2, the 95% CI for d was 2 2 –1.4 → 0.14 cm , the SDdiff was 1.08 cm , the 95% limits of agreement –2.73 → 1.53 cm2 and the reliability coefficient was 2.4. Between-scans repeatability was high, ICCs (1,1) were 0.94 and 0.93 with 95% CIs of 0.8 → 0.99 and 0.75 → 0.98, for days 1 and 2 respectively. Measures showed – good agreement on the Bland and Altman test: d for day 1 was 0.15 cm 2 and – 2 for day 2 it was –0.32 cm , the 95% CIs for d were –0.51 → 0.81 cm 2 for day 1 and –0.98 → 0.34 cm2 for day 2; SDdiff was 0.93 cm2 for both days, the 95% limits of agreement were –1.71 → 2.01 cm2 for day 1 and –2.18 → 1.54 cm2 for day 2; the reliability coefficient was 1.80 for day 1 and 1.88 for day 2. The – between-days ICC (1,2) was 0.92 and the 95% CI 0.69 → 0.98. The d was 2 2 –0.98 cm , the SDdiff was 1.25 cm with 95% limits of agreement of – –3.48 → 1.52 cm2 and the reliability coefficient 2.8. The 95% CI for d 2 (–1.88 → –0.08 cm ) and the distribution graph showed a bias towards a larger measurement on day 2. Conclusions: The ICC and Bland and Altman tests are appropriate for analysis of reliability studies of similar design to that described, but neither test alone provides sufficient information and it is recommended that both are used. Address for correspondence: Professor MJ Stokes, Research Department, Royal Hospital for Neuro-disability, West Hill, Putney, London SW15 3SW, UK. © Arnold 1998 0269–2155(98)CR180OA 188 G Rankin and M Stokes Introduction Reliability studies of assessment tools in rehabilitation are necessary to ensure that the error involved in measurement is small enough to detect actual changes in what is being measured. Such reliability studies are numerous in the literature but a diversity of statistical approaches is used (see below). This makes comparison between studies difficult and leaves the researcher with the problem of deciding which tests to use. Different types of reliability data require different tests. Where a device is used to make objective measurements of, for example, muscle strength, the data produced are continuous, i.e. derived from a range of possible values or an underlying continuum. Other types of data are produced when the assessment involves categorizing the data. Where there is no obvious ordering to the categories, such as normal or abnormal; back pain, leg pain or arm pain, the data are termed nominal data. Where the data have a natural order, such as small, medium or large, the results are termed ordinal data.1 The kappa test1,2 is commonly used for examining the reliability of nominal data, and the appropriate type of test for different situations, e.g. number of raters, was discussed by Haas.3 A weighted kappa test is used for ordinal data.2,3 There is less consensus regarding continuous data which forms the topic of the present paper. Several authors have discussed the inappropriateness of using tests such as Pearson’s correlation, t-tests, coefficient of variation (CV), per cent agreement and chi-square.1,3–8 The intraclass correlation (ICC9–11) and Bland and Altman12 tests are now often used for reliability studies. There is little guidance as to which ICC equations or Bland and Altman results to use. Many papers do not explain the context in which the tests are being used or even specify the ICC equation used. Bland and Altman tests have been described for method comparison studies, similar methods for repeatability testing have been mentioned but not in relation to inter-rater reliability testing.12–14 The overriding problem is a lack of consensus as to which of these tests are most appropriate. The present paper addresses the problem of selecting an appropriate ICC equation, and clarifies some inconsistencies in the literature regarding analysis of reliability data. The type of design used in the present study is similar to that of many reliability studies in rehabilitation, and therefore serves as a useful example to demonstrate the appropriate statistics for such studies. The intra-rater reliability (or test–retest reliability) of measuring muscle size was examined in one investigator over a period of training in the ultrasound imaging technique, as well as inter-rater reliability against a more experienced operator. Muscle size measurement provides a direct assessment of muscle atrophy and hypertrophy, and a close correlation between muscle cross-sectional area (CSA) and isometric strength has been demonstrated in some muscle groups (see ref. 8 for a review). Measurement of muscle size is therefore useful for assessing and monitoring changes in muscle with disease and injury, as well as treatment outcome. The present study investigated inter-rater reliability, and the between-scans and between-days repeatability of CSA measurements of the anterior tibial muscle group using real-time ultrasound imaging. Appropriate statistical analyses for a repeated measures design and inter-rater reliability are demonstrated. The advantages of using ICC equations and Bland and Altman tests are discussed. Method Real-time ultrasound images were taken of the anterior tibial muscle group, as described previously,15 and only brief details are given here. Subjects Ten normal subjects (five males) aged 22–58 years were studied. Exclusion criteria regarding history of disease or injury were as described previously.15 Written, informed consent was obtained from each subject, and the study was approved by Riverside Research Ethics Committee. Statistical analysis of reliability studies Investigators The first investigator (GR) is a research physiotherapist with recent training in ultrasound imaging of muscle. The second (MS) is experienced in musculoskeletal ultrasonography and also a research physiotherapist. Both therefore had the necessary knowledge of musculoskeletal anatomy. Procedure The basic principles of ultrasound imaging and instrumentation were discussed in detail elsewhere.8 An ALOKA SSD 1200 ultrasound scanner and a 5 MHz linear array transducer were used. The anterior tibial muscle group of the dominant leg (preferred leg for kicking) was scanned. The muscles in the anterior tibial group include tibialis anterior, extensor hallucis longus, extensor digitorum longus and the peronei. The subject lay supine with the hip in neutral, the knee extended and the ankle resting at 90° in a padded metal splint. The level of scanning was 20% of the distance from the head of fibula to the tip of the lateral malleolus. The scanning level was traced onto a transparent sheet, together with bony landmarks and any permanent skin blemishes, such as freckles and scars, so that the scanning site could be relocated accurately on the second day of scanning. A stand-off gel pad (Kitecko – 14 × 11 × 4 cm) was used to provide good contact with body contours, increase the field of view and prevent skin compression.16 Ultrasonic coupling gel was applied to the skin area to be scanned, the transducer and to the upper surface of the stand-off pad. The image was stored on the screen, and real-time measurements of CSA were taken using the scanner’s on-screen electronic callipers by tracing around the muscle borders. Each scan was photographed as a permanent record. Protocol Each subject was scanned at the same time of day on two different days. On day 1, two images were taken by GR (the scanning site remained marked on the subject), and measurements made. On day 2, the scanning site was relocated using the transparent sheet, and two images were again taken by GR, and one was taken by MS. The investigators were blind as to any previous mea- 189 surements, and on-screen measurements could not be seen by the investigator. Analysis of data Since the purpose of the present paper is to investigate which statistical tests to use, some background information is warranted (see Appendix for details). The two types of analysis used were intraclass correlations (ICCs),10,11 (on data produced from an analysis of variance [ANOVA]), and Bland and Altman12 tests. If the between-subjects and within-subjects variance is being examined, a one-way ANOVA is used. Where within-subjects variance is divided into between-raters and residual components, a two-way ANOVA is used. The mean square (MS) for the sources of variance obtained by ANOVA in the present study design are between-subjects (BMS), within-subjects (WMS), between-raters (RMS) and residual (EMS). The mean squares from different sources of variance are used in the ICC equations, and Shrout and Fleiss10 described six equations for inter-rater reliability studies (see Appendix), and Fleiss11 provided equations for repeated measurements made by the same rater. Inter-rater reliability The CSA of scan 1 from day 2 (GR) was compared to the scan taken by MS on the same day. Of the six ICC equations described by Shrout and Fleiss,10 the (3,1) equation was used (see Appendix) and calculated from a two-way ANOVA (carried out using SPSS for Windows version 7.5.1). ICC (3,1) = BMS – EMS BMS + (k–1) EMS where k is the number of raters. Intra-rater reliability Repeatability was only examined for investigator GR, using a one-way ANOVA and the following ICC equations: Between-scans repeatability compared scan 1 to scan 2 for day 1 and day 2 separately, and ICC (1,1) was used: ICC (1,1) = BMS – WMS BMS + (k–1) WMS where k is the number of measurements. 190 G Rankin and M Stokes Between-days repeatability compared the mean of scans 1 and 2 from day 1 with the mean of scans 1 and 2 from day 2 using ICC (1,2). The mean of the two scans was used for each day, hence the second integer of 2 (see ‘Discussion’ and the Appendix). BMS – WMS ICC (1,2) = BMS The reasons for selecting these specific ICC equations are explained in the ‘Discussion’ section. The inter- and intra-rater reliability analyses were performed separately in this example, as both raters did not make repeated measurements. A design in which a number of raters make equal numbers of observations enables one analysis to examine both types of reliability.17 An ICC ratio of 1 indicates perfect reliability with no measurement error, whilst 0 indicates no reliability.1 The 95% CI values for each ICC value were calculated using ICC macros on the SPSS World Wide Web site (see Appendix). Bland and Altman12 methods were used to assess agreement between-scans, between-days and betweenraters. Calculations included the mean difference – – between measures (d ), and the 95% CI for d , the standard deviation of the differences (SD diff), the reliability coefficient, and the 95% limits of agreement (see Appendix). Diagrams were also plotted to illustrate the distribution of results. Table 1 Results The raw data are shown in Table 1, and the ANOVA results are shown in Tables 2 and 3. An example of an ICC calculation for the present data is shown in the Appendix. Inter-rater reliability The ICC coefficient for inter-rater reliability was high, with a ratio from the (3,1) equation of 0.92 (see Table 4 which also shows the 95% CI for the coefficient). The Bland and Altman 12 test results in Table 4 show reasonable agreement, and the distribution of values in Figure 1 show that in most cases, differences between measures are less than 1.5 cm2. There is a tendency for measurements taken by GR to be smaller than those taken by MS. Between-scans repeatability on day 1 and day 2 Between-scans repeatability was high for both days with reliability coefficients from ICCs (1,1) of 0.94 and 0.93 respectively (Table 4). The results from Bland and Altman 12 methods seen in Table 4 indicate good agreement. Figure 2 for day 1 shows that in all but one case, the difference is 1 cm2 or less, and the points are distributed around zero. From Figure 3 for day 2 there appears to be better agreement, with the differences in seven cases lying close to zero, with one outlier. Cross-sectional area measurements of the anterior tibial muscle group (cm2) Inter-rater reliability Between-scans repeatability – day 1 and day 2 measured by one operator (GR) Day 1 Day 2 Day 2 Between-days repeatability measured by one operator (GR) Day 1 Day 2 Subject Scan 1 (GR) Scan 1 (MS) Scan 1 Scan 2 Scan 1 Scan 2 Mean of scan 1 &2 Mean of scan 1& 2 1 2 3 4 5 6 7 8 9 10 17.13 16.08 10.91 14.96 13.00 18.27 14.99 15.64 10.93 16.48 18.78 17.42 10.73 15.65 11.52 17.51 15.81 16.88 12.19 18.16 16.40 13.73 10.28 14.36 10.95 16.96 16.20 13.94 10.00 18.12 15.94 13.48 10.74 14.44 11.64 15.84 14.06 14.82 10.65 17.84 17.13 16.08 10.91 14.96 13.00 18.27 14.99 15.64 10.93 16.48 16.78 16.31 10.60 14.70 12.63 18.57 15.81 15.22 13.46 17.51 16.17 13.61 10.51 14.40 11.30 16.40 15.13 14.38 10.33 17.98 16.96 16.20 10.76 14.83 12.82 18.42 15.40 15.43 12.20 17.00 Statistical analysis of reliability studies Table 2 191 ANOVA table of results for inter-rater reliability – two-way mixed model, raters fixed Source of variation Between-subjects Between-raters Error For data in Table 1 Degrees of freedom (df) Mean square (MS) Expected mean square E(MS) df MS n–1 k–1 (n–1) (k–1) BMS RMS EMS σ2e + kσ2s σ2e + n/k–1 σ2e 9 1 9 14.21 1.96 0.58 ρ2j k is the number of raters; n is the number of subjects; σ2 varience. For expected mean square equations see Fleiss.11 Table 3 ANOVA table of results for intra-rater reliability – one-way random effects model For data in Table 1 Source of variation Degrees of Mean square freedom (df) (MS) Between-subjects n–1 Within-subjects n(k–1) BMS WMS Expected mean square E(MS) df Between-scans day 1 Between-scans Between-days day 2 σ2e + kσ 2s σ2e 9 10 13.57 0.41 11.55 0.45 12 0.98 k is the number of times; n is the number of subjects; σ2 = variance. Table 4 Results for inter-rater reliability, between-scans and between-days repeatability – ICCs and Bland and Altman tests ICC Bland and Altman – d (cm2) – SE of d (cm2) – 95% CI for d (cm2) SDdiff (cm2) Reliability 95% limits of coefficient agreement (cm2) –0.63 0.34 –1.4 → 0.14 1.08 2.4 –2.73 → 1.53 0.15 0.29 –0.51 → 0.81 0.93 1.80 –1.71 → 2.01 0.75 → 0.98 –0.32 0.29 –0.98 → 0.34 0.93 1.88 –2.18 → 1.54 0.69 → 0.98 –0.98 0.40 –1.88 → –0.08 1.25a 2.8 –3.48 → 1.52 ICC 95% coefficient CI Inter-rater reliability Between-scans repeatability day 1 Between-scans repeatability day 2 Between-days repeatability (3,1) 0.92 (1,1) 0.94 (1,1) 0.93 (1,2) 0.92 0.72 → 0.98 0.8 → 0.99 – – – d is the mean difference; SE of d is the standard error of the mean difference; 95% CI for d is the 95% confidence interval for the mean difference; SDdiff is the standard deviation of the differences. a The standard deviation has been corrected to take into account mean measurement values. Between-days repeatability Repeatability was high with an ICC (1,2) of 0.92. Results from Bland and Altman12 seen in Table 4 demonstrate reasonable agreement. However, Figure 4 indicates a bias as all but one point lie below zero, i.e. the differences (day 1 – day 2) have a negative value. Therefore, nine of the measurements taken on day 2 were larger than those taken on day 1. This bias is also – reflected in the 95% CI for d which is –1.88 → 2 –0.08 cm . Zero does not lie within the interval, which indicates a bias between the two measures. Discussion Measurements of the anterior tibial muscles appear to be reliable in the hands of the two 192 G Rankin and M Stokes investigators studied, but neither of the two statistical tests alone provide a full analysis. Reliability of ultrasound measurements Cross-sectional area measurement of the anterior tibial group using real-time ultrasound scanning has been demonstrated to have high repeatability, both between-scans and between- IRMEAN (cm 2) Figure 1 Distribution plot from Bland and Altman12 test showing mean measurements against differences between measurements for inter-rater reliability. IRMEAN is the inter-rater mean measurements taken by two raters (GR and MS). IRDIFF is the inter-rater difference between measures taken by GR and MS. BS1MEAN (cm 2) Figure 2 Distribution plot from Bland and Altman12 test for between-scans repeatability day 1. BS1MEAN is between-scans day 1, mean measurements from scan 1 and scan 2. BS1DIFF is between-scans day 1, difference between measurements (scan 1–scan 2). days, and high inter-rater reliability. Previous ultrasound reliability studies did not use ICCs or Bland and Altman12 tests to allow comparison. It has been argued that if a technique shows high reliability between raters, examination of intra-rater repeatability is not important.1 However, it is useful to document that a technique is BS2MEAN (cm 2) Figure 3 Distribution plot from Bland and Altman12 test for between-scans repeatability day 2. BS2MEAN is between-scans day 2, mean measurements from scan 1 and scan 2. BS2DIFF is between-scans day 2, difference between measurements (scan 1–scan 2). BDMEAN (cm 2) Figure 4 Distribution plot from Bland and Altman12 test for between-days repeatability. BDMEAN is between-days mean measurements from day 1 and day 2. BDDIFF is between-days difference between measurements (day 1–day 2). Statistical analysis of reliability studies reliable in the hands of a particular investigator over time. The Bland and Altman12 distribution graph (Figure 4) indicated a bias between measurements made by GR on different days, with those on day 2 being larger. It is interesting to note in Table 1 of the raw data, that in the three subjects where the largest differences were seen, the highest value was similar to that of the more experienced investigator (MS). This bias to the larger value suggests a consistent change in the interpretation of the muscle border by investigator GR on day 2. This bias was not evident from the ICC results and was not large enough to produce poor reliability results for the quantitative values produced from the tests. It is, however, interesting to note, and may disappear on re-testing when the investigator gains more experience. Inappropriate statistical tests Certain tests are often used incorrectly for evaluating reliability. Pearson’s correlation coefficient is inappropriate because the strength of linear association, and not agreement, is measured; it is possible to have a high degree of correlation when agreement is poor.6,7,18 A paired t-test assesses whether there is any evidence that two sets of measurements agree on average. However, it is the difference between within-subjects scores that is of interest. Taking the mean score of all subjects has potential to provide misleading estimates. A high scatter of individual differences can result in the difference between the means being nonsignificant.6,14 Using the CV to calculate reliability is no longer considered to be appropriate in most cases, and reasons for this have been discussed elsewhere.5,8 The CV is the standard deviation divided by the mean. For reliability calculations, the between-subjects standard deviation is often used, but this is inappropriate. In some cases, where the standard deviation is proportional to the mean, it has been suggested that the CV is suitable. If this is the case, the within-subjects standard deviation should be calculated, rather than between-subjects, and the data often need to be transformed. Such methods are discussed by Bland13 and Chinn.5 Results for CV are not given in the present paper so as not to encourage their continued use. 193 Selection of an ICC equation Inter-rater reliability When using ICCs, the equation and the reasoning behind the choice of equation should be clearly stated. This is dependent on the study design and the intent of the reliability study. 10,11,18 For inter-rater reliability, the ICC (1,1) equation is unlikely to be of use clinically (see Appendix). The essential factor in choosing between (2,1) and (3,1) is the eventual application of the test or measuring system. If the aim is general application in clinical practice or research trials, ICC (2,1) is appropriate, and a greater number of raters than in the present study would be required. If testing is only to be performed by a small number of raters who are the same raters used in the reliability study, ICC (3,1) is the choice, as is the case with the present study. The application of a procedure to measure passive ankle dorsiflexion was clearly defined in a study by Moseley and Adams,20 ‘it was intended to generalize the present results to a variety of judges’, and therefore ICC (2,1) was used. In the present study it was only the reliability of specified raters that was of interest, as they will be the raters obtaining measurements in a larger ultrasound study. The same applies to a reliability study using ICC (3,1) described by Andrews et al.21 The reliability of using a hand-held dynamometer was reported for three raters who then went on to collect normative values for isometric muscle force for 156 adults. A measurement that is reported to have good reliability for general application should be interpreted with caution. The raters used in the reliability study should be a random sample from a larger population. This criterion was not fulfilled in the study by Moseley and Adams,20 where five physiotherapists working at the hospital volunteered to serve as raters. This is not a random sample and potentially introduces bias in raters’ ability. The larger population of raters also needs to be defined, as this will imply who will demonstrate similar reliability to the raters in the reliability study. For example, this could be all physiotherapists of all abilities working in all areas, only physiotherapy students, or only senior physiotherapists specializing in a particular area. In addition, the level of reliability is only applicable for measuring subjects demonstrating a sim- 194 G Rankin and M Stokes ilar range of measurement values to that of the subjects in the reliability study (see below). The implications of using the wrong equation need consideration. For the same set of data, ICC (1,1) will usually give lower values than ICC (2,1) or (3,1), and in most cases ICC (2,1) will result in a lower value than when (3,1) is used.10 This trend is seen in the results of the present study but the difference between results from any of the equations is minimal: ICC (1,1) = 0.90, (2,1) = 0.90, (3,1) = 0.92. Results for ICC (1,1) and ICC (2,1) were also identical in a reliability study to measure passive ankle dorsiflexion,20 and Bohannon18 did not find much difference between calculations of ICC (2,1) and (3,1). This was not the case, though, for the example presented by Shrout and Fleiss,10 where the same data produced ICCs ranging from 0.17 to 0.91 according to the equation used. It was suggested by Moseley and Adams,20 that if ICC (1,1) is mistakenly used, the result is not misleading because it is an underestimation of the true ICC value and this conservativeness has been advanced as a reason for using the ICC (1,1) form. While it may be better to underestimate rather than overestimate reliability, gross underestimation may result in a potentially useful technique being discarded. As importantly, choice of an inappropriate equation shows a lack of understanding of design and application of reliability studies. Intra-rater reliability The ICC equations used for intra-rater reliability of repeated measurements in the present study, (1,1) and (1,2), are equivalent to the interrater reliability equations (1,1) and (1,k), described by Shrout and Fleiss,10 but k represents time (number of measurements) rather than rater. For between-scans repeatability, the equation was obtained from Fleiss11 who described the same study design as that used here. Although the equation presented was the same as (1,1) from an earlier paper by the same author10 this was not pointed out. The system for labelling the equations for inter-rater reliability does not appear to be used for intra-rater reliability, and a consistent system would be useful, preferably distinguishing between inter- and intra-rater reliability equations. For between-days repeatability, equation ICC (1,2) was used, as the mean of two measures (k = 2) was included in the analysis, hence the second digit of 2 (see Appendix). Bland and Altman tests Bland and Altman described a series of statistical methods for assessing agreement between two methods of clinical measurement.12–14 In relation to repeatability it is suggested that the coefficient of repeatability is calculated. It is unclear whether this should be used in isolation or to supplement the other tests. The method of calculation varies in two different references.12,13 Inter-rater reliability was not discussed. Strengths and weaknesses of ICCs and Bland and Altman tests A comparison of ICCs and Bland and Altman methods is shown in Table 5. These have differing advantages and disadvantages.1,19 When using the ICCs, the choice of equation, study design and intended application need to be defined clearly. A reliability coefficient such as the ICC appears easy to interpret: the closer to 1, the greater the reliability. However, interpretation is not that simple, the coefficient is just one point estimate of reliability based on one selected sample. The ICC in isolation cannot give a true picture of reliability and should be complemented by hypothesis testing and/or confidence interval construction.17 In addition, the ICC cannot be interpreted clinically because it gives no indication of the magnitude of disagreement between measurements. It should therefore be complemented by calculation of the standard error of measurement (SEM)17 or Bland and Altman 95% limits of agreement tests. A major criticism of the ICC is the influence of between-subjects variance on the ratio. In simple terms the ICC is the ratio of true score variance (between-subjects variance) to true score variance plus error. If the true score variance is sufficiently large, reliability will always appear high and vice versa. Therefore, for a group of subjects with a wide range of CSA measurements, the ICC is likely to Statistical analysis of reliability studies Table 5 195 Comparison of intraclass correlations and Bland and Altman statistical methods for assessing agreement ICCs Bland and Altman methods Advantages Disadvantages Inter-rater reliability can allow for fixed or random effects Influenced by magnitude of between-subjects variation (possible advantage) Single reliability coefficient simple to understand (but see disadvantages) Potential to oversimplify if ICC quoted in isolation. Potential to select wrong equation Calculations relatively simple for coefficient but more complex for 95% CI. Easy to calculate for any number of raters, data sets or mean measures Gives no indication of actual measurement values or ranges, any bias in measurements, and cannot be interpreted clinically (but can use with the SEM) Graph: Data easily interpreted visually, easy to see size and range of differences in measurements, any bias or outliers, or relation between the size of differences with the size of the mean More complex analyses if more than two raters or data sets, mean measures or data needs to be transformed (not fully described in the literature) 95% CI for mean of the differences: Will indicate bias in measurements More complex to interpret than a single reliability coefficient (possible advantage) 95% limits of agreement: Can relate to clinical acceptability 95% CI for limits of agreement: Need sample set of at least 50 otherwise limits will be very wide Independent of between-subjects variation (possible advantage) Note: Neither ICC nor Bland and Altman results can be compared directly with those from other studies. be greater than for a more homogeneous sample group with similar muscle CSA measurements. This criticism will be true for any type of reliability calculation based on between-subjects variance, such as the coefficient of variation5 (see above). The Bland and Altman methods are independent of the true variability in the observations. There is a lack of consensus regarding this issue since it has been suggested that reliability should reflect true variability.1,7 It has been argued that reliability is relative, it reflects how well a measurement can differentiate individuals, and therefore that reliability or measurement error should be contrasted with the expected variation among the subjects being tested.1 In other words, if CSA measurements were to range from 10–40 cm2, a measurement error of 1 cm2 may not be important, but if measurements ranged from 5–10 cm2, an error of 1 cm2 might not be acceptable. This issue has yet to be settled, but in the meantime the following points should be noted. Bland and Altman 95% limits of agreement indicate a range of error, but this must be interpreted with reference to the range of measurement val- ues obtained. Therefore, Bland and Altman tests should be complemented by raw data and/or ranges. If using the ICC, the between-subjects variation should be a meaningful index of reliability, i.e. the variation between subjects in the selected sample must reflect the true population of interest. Regardless of which reliability tests are selected, it appears that comparison of reliability results between studies is not possible unless the size and attributes of the samples tested in each case are virtually identical. A distinct advantage of the ICC with regard to inter-rater reliability in the present case is its ability to differentiate for random or fixed effects (see Appendix). The Bland and Altman methods have two advantages in comparison to the ICC: the powerful visual representation of the degree of agreement, and the easy identification of bias, outliers, and any relationship between the variance in measures with the size of the mean. The current situation of a variety of statistical tests being used, some of them inappropriate, is clearly unhelpful to the field of rehabilitation research. It appears that ICCs and/or Bland and 196 G Rankin and M Stokes Altman methods are emerging as the statistical analyses of choice, and it is suggested they both be used in reliability studies of similar design to that described in the present paper. It is also suggested that a standardized system for annotating the different ICC formulae is needed and that, in the absence of this, authors could be encouraged to indicate the type of ICC used more clearly, stating the formula and its origin as well as the rationale for its use in the context of their study design. Several authors, including those of the present paper, have put forward their suggestions for appropriate statistical analysis of reliability studies. A consensus needs to be reached to establish which tests, and which of the many results produced by these analyses, are the most relevant ones to be adopted universally. Acknowledgements The authors thank the subjects who took part in the study, Dr Jörg Huber (School of Life Sciences, Roehampton Institute London) for statistical advice, Dr Janine Gray (Centre for Health & Medical Research, University of Teesside) for her comments and the Living Again Trust for financial support. References 1 Streiner DL, Norman GR. Health measurement scales: a practical guide to their development and use, second edition. Oxford: Oxford University Press, 1995: 104–27. 2 Davies M, Fleiss JL. Measuring agreement for multinominal data. Biomet 1982; 38: 1047–51. 3 Haas M. Statistical methodology for reliability studies. J Manipulative Physiol Ther 1991; 14: 119–32. 4 Brennan P, Silman A. Statistical methods for assessing observer variability in clinical measures. BMJ 1992; 304: 1491–94. 5 Chinn S. The assessment of methods of measurement. Statist Med 1990; 9: 351–62. 6 Maher C. Pitfalls in reliability studies: some suggestions for change. Aust J Physiother 1993; 39: 5–7. 7 Riddle DL, Finucane SD, Rothstein JM, Walker ML. Intrasession and intersession reliability of handheld dynamometer measurements taken on braindamaged patients. Phys Ther 1989; 69: 182–94. 8 Stokes M, Hides J, Nassiri DK. Musculoskeletal ultrasound imaging: diagnostic and treatment aid in rehabilitation. Phys Ther Rev 1997; 2: 73–92. 9 Bartko JJ. The intraclass correlation coefficient as a measure of reliability. Psychol Rep 1966; 19: 3–11. 10 Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull 1979; 86: 420–28. 11 Fleiss JL. Reliability of measurement. In: Fleiss JL ed. Design and analysis of clinical experiments. New York: John Wiley & Sons, 1986: 1–32. 12 Bland JM Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986; 1: 307–10. 13 Bland M. Clinical measurement. In: Bland M ed. An introduction to medical statistics. Oxford: Oxford University Press, 1987, 265–73. 14 Altman DG. Some common problems in medical research. In: Altman DG ed. Practical statistics for medical research, first edition. London: Chapman & Hall, 1991: 398–403. 15 Martinson H, Stokes MJ. Measurement of anterior tibial muscle size using real-time ultrasound imaging. Eur J Appl Physiol 1991; 63: 250–54. 16 Kelly SJ, Stokes MJ. Symmetry of anterior tibial muscle size measured by real-time ultrasound imaging in young females. Clin Rehabil 1993; 7: 222–28. 17 Eliasziw M, Young SL, Woodbury MG, Fryday-Field K. Statistical methodology for the concurrent assessment of interrater and intrarater reliability. Phys Ther 1994; 74: 89–100. 18 Bohannon RW. Commentary. Phys Ther 1989; 69: 190–92. 19 Krebs DE. Intraclass correlation coefficients: use and calculation. Phys Ther 1984; 64: 1581–89. 20 Moseley A, Adams R. Measurement of passive ankle dorsiflexion: procedure and reliability. Aust J Physiother 1991; 37: 175–81. 21 Andrews AW, Thomas MW, Bohannon RW. Normative values for isometric muscle force measurements obtained with hand-held dynamometers. Phys Ther 1996; 76: 248–59. Statistical analysis of reliability studies 197 Appendix The rationale for the intraclass correlation and Bland and Altman12 tests are outlined in the context of reliability studies, together with the formulae used in different situations. Intraclass Correlation (ICCs) The ICC analyses were specifically designed by Bartko9 to examine reliability, providing a reliability index to indicate the measurement error. Reliability is the ratio of variance of interest/sum of variance of interest + measurement error. There are several formulae for ICCs which can give quite different results when applied to the same data. Each formula is appropriate for specific situations which are defined by the experimental design and the potential use of the results.10,11 The ICCs are calculated from results obtained from analysis of variance (ANOVA) for repeated measures. An ANOVA table for inter-rater data (see Table 2) shows different sources of variance with their associated degrees of freedom, mean squares (MS) and errors. The six forms of ICC were (1,1), (2,1), (3,1), (1,k), (2,k), (3,k) for inter-rater reliability. They used the term judge (JMS) where the present study used rater (RMS). It should be noted that the SPSS macro for ICCs also uses JMS. The first integer, 1, 2 or 3, relates to three different cases, cases 1, 2 and 3 respectively. It does not indicate the type of ANOVA used, as has been suggested.20 The second integer indicates how many units of analysis are included. If there are more than one, the second integer is greater than 1 (see below). The equations given below are from Shrout and Fleiss,10 and the associated variability formulae are from Streiner and Norman.1 Case 1 Each subject is rated by a different set of k raters, randomly selected from a larger population of raters. This analysis uses one-way ANOVA results: ICC (1,1) = BMS – WMS BMS + (k–1)WMS or subject variability subject variability + within-subjects variability This analysis is unlikely to be practical for clinical research purposes, as such a design is not common, but perhaps it could be appropriate for studying reliability before providing data from singlecase studies or multicentre trials. The same equation was described by Fleiss11 in the context of intra-rater reliability. Case 2 A random sample of k raters is selected from a larger population of raters and each rater tests each subject, i.e. each judge rates n subjects altogether. Case 2 considers the raters as random effects and should be used if there is a need to generalize to other raters within the same population. It measures the agreement of the raters and answers the question as to whether the raters are interchangeable. This analysis uses two-way ANOVA results and the variance due to the rater is included in the equation: ICC (2,1) = BMS – EMS BMS + (k–1)EMS + k(RMS–EMS)/n Alternatively, ICC (2,1) = subject variability subject variability + observer variability + random error variability 198 G Rankin and M Stokes Case 3 Each subject is rated by each of the same k raters who are the only raters of interest, as in the present study. In this case, raters are considered to be fixed, and the reliability will reflect the accuracy of measurement for the specified raters but cannot be applied generally to any other raters. This analysis uses two-way ANOVA results but only the residual variance, not the between-raters variance, comes into the equation. This is because the between-raters variance is fixed; it will always contribute the same amount to the within-subjects variance and does not need to be factored out. BMS – EMS ICC (3,1) = subject variability or BMS + (k–1)EMS subject variability + random error variability It can be seen that this is the same as (2,1) but with k(RMS–EMS)/n removed from the denominator. The calculation for ICC (3,1) using the present data for inter-rater reliability (see Table 2) is as follows: ICC (3,1) = 14.21 – 0.58 where k–1 = 1 14.21 + (k–1) 0.58 13.63 = 14.21 + 0.58 = 13.63 = 0.92 14.79 The second integer Equations (1,k), (2,k) and (3,k) are used when the unit of analysis is the mean measurement obtained either from more than one measurement or from more than one rater (k in this situation does not always refer to the number of raters). The reliability of a mean rating will almost always be greater than that of an individual rating. ICC (1,k) = BMS – WMS BMS ICC (2,k) = BMS – EMS BMS + (JMS/RMS – EMS)/n ICC (3,k) = BMS – EMS BMS Some papers only use the mathematical symbols for different types of variance, despite citing Strout and Fleiss,10 rather than the mean squares abbreviations in their equations, which can cause some confusion when trying to determine which equation they have used. Both the MS abbreviations and the variance symbols are shown in Tables 2 and 3 for clarification. ICC macros on the SPSS World Wide Web site The ICC equations mentioned above can be calculated very simply by hand using data from the ANOVA. However, should the appropriate computer facilities be available, ICC macros (termed ICCSF) can be downloaded from the SPSS Web site and used to calculate ICCs and related statistics. The calculations are those described by Shrout and Fleiss.10 There are three primary macros, ICCSF1.SPS, ICCSF2.SPS and ICCSF3.SPS, relating to cases 1, 2 and 3 respectively which require SPSS 6.0 or higher, including the MATRIX procedure. There are Statistical analysis of reliability studies 199 three abbreviated versions which only require a version of SPSS with macro capability and the MATRIX procedure. The unabbreviated macros can be used to calculate ICC values and 95% CIs both for single raters/measurements and for the mean of k raters/measurements. In addition it performs an F-test for the null hypothesis of 0 population value, to determine the probability that the ICC = 0 for the true population. The help file for the ICC macros describes fully the different macro files and how to use them. To perform the calculations, the relevant data file in SPSS needs to be open. A new syntax file is opened and two lines of syntax are necessary (upper or lower case acceptable). Line 1: include ‘ICCSF*.SPS’. The relevant file name and location of the file is specified, for example, include ‘a:\ICCSF1.SPS’. Line 2: icc vars=varlist. Varlist refers to the relevant variables from the data file which need to be listed, for example, icc vars=dlsclgr, dlsc2gr. On the command Run followed by All, the results are reported. Bland and Altman method of analysis Bland and Altman described a method to assess agreement between clinical measurements.12–14 Their approach is based on analysis of differences between measurements and they suggest that estimation of the agreement between measures is more appropriate than a reliability coefficient or hypothesis (significance) testing. Several stages are described in analysing the data, providing different ways of expressing the results. 1) 2) 3) 4) 5) The differences between two measures are plotted against the average of the two measurements, the mean. From this graph (e.g. Figures 1–4), the size of each difference, the range of differences and their distribution about zero (perfect agreement) can be seen clearly. The distribution can also indicate a bias in the measurement. The distribution may show that differences are related to the size of the mean; as the mean increases, so does the difference. The statistical analyses described assume a constant level of error, there should be constant variance with increasing means. If this is not the case, the data need to be transformed. This technique is described in detail elsewhere. 4,5,12 Bland and Altman state that only log transformations should be applied, however it has been suggested that other transformations are sometimes more appropriate.5 – The mean of the differences (d ) and the standard deviation of the differences (SDdiff) are – calculated. The closer d is to zero and the smaller the value of SDdiff, the better the agreement between measures. – It is also of interest to estimate the ‘true’ value of d , which is a measure of the bias between measures and a 95% confidence interval can be calculated. The 95% confidence interval for – – – d = d ± tn-1SE of d , where n is the number of subjects and SE = SD diff / n. If zero does not lie within the interval it can be concluded that a bias exists between the two measures.4 When mean measurements are compared, the estimate of the standard deviation needs to be corrected because some of the effect of the repeated measurement error has been removed (for details see ref. 12). d 2 where d is the difference between two The coefficient of repeatability is calculated as 2 n measures and it is assumed that the mean difference is zero.12 – The 95% limits of agreement can be calculated as d ± 2SDdiff, and 95% confidence intervals for 12 these limits of agreement can also be calculated. If there are more than two repeated measures, the calculations are more complex.12 The sample size should be large enough, preferably greater than 50, to allow the limits of agreement to be estimated well,14 otherwise confidence intervals will be very wide. Due to the small sample size CIs were not calculated in the present study.