Statistical Methods in Medical Research 2004; 13: 251^271 Sample size requirements for the design of reliability study: review and new results MM Shoukri Department of Epidemiology and Biostatistics, University of Western Ontario, London, Ontario, Canada and Department of Biostatistics, Epidemiology and Scienti¢c Computing, King Faisal Specialist Hospital and Research Centre, Riyadh, Kingdom of Saudi Arabia, MH Asyali Department of Biostatistics, Epidemiology and Scienti¢c Computing, King Faisal Specialist Hospital and Research Centre, Riyadh, Kingdom of Saudi Arabia and A Donner Department of Epidemiology and Biostatistics, University of Western Ontario, London, Ontario, Canada The reliability of continuous or binary outcome measures is usually assessed by estimation of the intraclass correlation coefficient (ICC). A crucial step for this purpose is the determination of the required sample size. In this review, we discuss the contributions made in this regard and derive the optimal allocation for the number of subjects k and the number of repeated measurements n that minimize the variance of the estimated ICC. Cost constraints are discussed for both normally and non-normally distributed responses, with emphasis on the case of dichotomous assessments. Tables showing optimal choices of k and n are given along with the guidelines for the efficient design of reliability studies. 1 Introduction Measurement errors can seriously affect statistical analysis and interpretation; it therefore becomes important to assess the magnitude of such errors by calculating a reliability coefficient and assessing its precision. Although the topic of reliability has gained much attention in the literature,1–3 investigations into sample size requirements remain scarce. In this review, we revisit the issue of sample size requirements for reliability studies having either continuous or binary outcomes. In either case, the measurement of reliability must distinguish within-subject variation from between-subjects variation. A widely recognized index that possesses this property is the intraclass correlation coefficient (ICC) defined as: r ¼ s2s =(s2s þ s2e ), where s2s and s2e are the among-subjects and within-subjects components of variance, respectively. It is seen that r is the proportion of between-subject variation relative to the total variation. In the most frequently adopted design, k subjects are each rated by the same n raters (for inter-rater reliability). A similar approach, however, can also be adopted when a single subject is assessed repeatedly on each of several occasions (test–retest reliability), or when replicates consisting of different occasions are taken on different Address for correspondence: MM Shoukri, Department of Biostatistics, Epidemiology and Scientific Computing, King Faisal Specialist Hospital and Research Centre, MBC-03, PO Box 3354, Riyadh 11211, Kingdom of Saudi Arabia. E-mail: shoukri@kfshrc.edu.sa # Arnold 2004 10.1191/0962280204sm365ra Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016 252 MM Shoukri, MH Asyali and A Donner subjects by a single rater.4 In each of these cases, and for both continuous and binary assessments, r can be estimated from an appropriate one-way analysis of variance (ANOVA).5,6 The question then arises as to the optimal combination (n, k) that permits an accurate estimation of r. One of the limiting factors in designing and implementing reliability studies in many settings is the difficulty of arranging for the replicated observations. In clinical situations, for instance, there are typically only a few specialists available, in a hospital or clinic, who are willing to participate in a study and who are qualified to make the observations. For example, Walter et al.7 reported a study in which physiotherapists were required to evaluate the gross motor function of children with Down’s syndrome, using videotapes. Owing to respondent burden and other practicalities of scheduling, it was not possible to arrange for more than three or four therapies to assess a given child. Had live observations been made, it would have been unreasonable to increase the number of observers beyond this point, due to the likely intimidation of the child, particularly since evaluations are often done in their home. Also repetitive observations may lead to fatigue and aversion effects. Similar considerations apply in many reliability studies, and n is typically limited by a maximum tolerable respondent burden such as in an interview of physical assessment of clinical condition. The scientific question is related to the optimal allocation of (n, k) to conduct a reliability study. Another study is on the reliability of diagnosing dysplasia in the urothelial lining of the bladder, which may be an indicator of poor prognosis in patients with superficial bladder cancer. However, in a British Medical Research Council (BMRC) randomized multi-centre clinical trial of treatment for superficial bladder cancer, considerable discrepancies appeared to exist between the assessment of dysplasia by local pathologists and by the reference pathologist for the trial.8 Therefore, it was decided to establish a board of five pathologists who specialized in urology to evaluate the extent of their agreement on the assessment of dysplasia. At the time of design of the study, no formal methods for calculating appropriate sample sizes were available. However, the investigators independently assessed 100 slides, a number determined to estimate a reliability coefficient with prespecified level of precision. We note that the number of replicates (five per slide) is fixed a priori, and the issue of sample size in this situation is entirely different from that of the previous example. The third example is based on a study by Awni et al.9 in bioavailability=bioequivalence (BA=BE) study. In a typical BA=BE trial, the area under the blood concentration time curve (AUC) is considered an important parameter. To assess the intra-subject variability with respect to a specific drug formulation, the ICC is used and therefore, several measurements of the AUC are needed from each subject. Given that in a BA=BE study subjects are paid volunteers, the study design must address the issues of cost and time constraints while producing precise estimate for ICC. For a fixed number of replicates, Donner and Eliasziw10 have provided contours of exact power for selected values of k and n, while Eliasziw and Donner11 used these results to identify optimal designs that minimize the study costs. Walter et al.7 developed a simple approximation to these exact results that allow the calculation of the required value of k when n is fixed. We note that reliability studies, however, are primarily designed to estimate the level of observer agreement, with their results invariably reported as measures of such agreement. Yet, power considerations in the design of reliability studies necessarily require specification of the hypotheses to be tested. Rejection of H0: r ¼ 0 based on results from a reliability Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016 Design of reliability study 253 study is particularly unhelpful, since the investigator needs to know more than the fact that the observed level of reliability is unlikely to be due to chance.12 Given that reliability studies are essentially estimation procedures, it is natural to base the sample size calculations on the attainment of a specified level of precision for the estimated r. For example, Bonett,13 focusing on the case of a continuous outcome measure, provided sample size requirements based on the need to achieve a specified expected width for a confidence interval (CI) on r. These results are parallel to those of Donner,14 who focused on estimating the number of subjects required to construct a CI with fixed width about the intraclass kappa coefficient for the case of a binary outcome measure. In this review, we revisit the literature on sample size requirements when interest is focused on estimating the ICC reliability from a single sample of subjects. In Section 2, we discuss issues of power, efficiency and fixed length CIs when the response variable is normally distributed. Methods that use the calculus of optimization to find the combination (n, k) that minimizes the variance of the estimated ICC when the response variable is continuous are investigated in detail. Issues of cost are investigated in the situation when both k and n are determined so that the variance of the estimated ICC is minimized subject to cost constraints. We devote Section 3 to the issue of optimal design when the assessments are binary, and present an overall discussion of our results in Section 4. 2 Continuous outcome measures 2.1 Normal case 2.1.1 Power considerations The most commonly used model for estimating reliability is the one-way random effects model, yij ¼ m þ si þ eij (1) where m is the grand mean of all measurements in the population, si reflects the effect of subject i, and eij is the error of measurement, i ¼ 1, 2, . . . , k; j ¼ 1, 2, . . . , n. It is assumed that the subject effects {si} are normally and identically distributed with mean 0 and variance s2s , the errors {eij} are normally and identically distributed with mean 0 and variance s2e and the {si} and {eij} are independent. Letting MSA and MSW denote the among-subjects and within-subject mean squares, respectively, the ANOVA estimator of r is given by r^ ¼ MSA MSW F1 ¼ MSA þ (n 1)MSW F þ n 1 (2) where F ¼ MSA=MSW. Donner and Eliasziw10 investigated the values of k and n required to test H0: r ¼ r0 versus H1: r > r0, where r0 is a specified criterion value of r. For the case n ¼ 2, that is, test–retest data, we may use Fisher’s normalizing transformation,6 analogous to the well known Fisher transformation of the Pearson product–moment or interclass Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016 254 MM Shoukri, MH Asyali and A Donner correlation. Fisher showed that u ¼ (1=2) ln {(1 þ r^ )=(1 r^ )} is nearly normally distributed with mean m(r) ¼ (1=2) ln {(1 þ r)=(1 r)} and variance 1=(k 1:5). Let za and zb denote the values of the standard normal distribution corresponding to the chosen level of significance a and power 1 7 b. The required number of subjects for testing H0: r ¼ r0 versus H1: r ¼ r1 > r0 is obtained directly from the previous theory as (za ¼ 1:64) þ (zb ¼ 0:84) 2 k ¼ 1:5 þ (3) m(r0 ) m(r1 ) Table 1 gives the required values of k according to the values of r0 and r1, a ¼ 0.05 and b ¼ 0.20. The results in Table 1 indicate that the required sample size k depends critically on the values of r0 and r1, and in particular, on their difference. For example, much more effort is required to distinguish values that differ by 0.1 compared with those that differ by 0.2. Note also that larger samples are required to detect relatively small values of r1 for a given difference r1 7 r0. Walter et al.7 developed a simple approximation that allows the calculation of the required number of subjects k for an arbitrary number of replicates n. The interest here is in testing H0: r ¼ r0 versus H1: r ¼ r1 > r0 using F ¼ MSA=MSW ¼ {1 þ (n 1)r^ }=(1 r^ ) where r^ is given by Equation (2). The critical value for the test statistic is CFa,n1 ,n2 , where C ¼ 1 þ {nr0 =(1 r0 )} and Fa,n1 ,n2 is the 100(1 7 a) per cent point in the cumulative F-distribution with degrees of freedom given by n1 ¼ k 7 1, and n2 ¼ k(n 7 1). Their approximation uses a simple formula, avoiding the intensive numerical work required to implement exact methods and providing the investigator increased flexibility for exploring various design options. As described by Donner and Eliasziw,10 the test of H0: r ¼ r1 has power given by 1 b ¼ Pr[F C0 Fa,n1 ,n2 ] (4) where b is the type II error and C0 ¼ (1 þ nf0 )=(1 nf), with f0 ¼ r0 =(1 r0 ) and f ¼ r1 =(1 r1 ). To solve Equation (4), Walter et al.7 used a result by Fisher,6 regarding the asymptotic distribution of Z ¼ (1=2) ln F. Omitting details, the estimated number of subjects is given by k¼1þ nA(a þ b) (5) (n 1)( ln C0 )2 where A(a, b) ¼ 2(za þ zb )2 . Table 1 Number of subjects k at n ¼ 2 (a ¼ 0.05 and b ¼ 0.20) r0 r1 k 0.2 0.2 0.4 0.8 0.6 0.6 0.8 0.6 0.9 0.8 27 9 86 46 39 Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016 Design of reliability study 255 Table 2 Approximate sample size (a ¼ 0.05 and b ¼ 0.10) r0 r1 n ln C0 k 0.2 0.4 0.6 0.8 0.2 0.4 0.8 0.8 0.9 0.6 10 20 10 10 2 0.784 1.732 0.941 0.797 0.981 32 6 22 31 36 Table 2 shows the required values of k for typical values of n, and according to the given values of r0 and r1, at a ¼ 0.05 and b ¼ 0.10. They indicate that the required sample size k depends on the values of r0 and r1, and particularly, on their difference. The remarks made concerning the values of k in Table 1 also apply to the values of k in Table 2. 2.1.2 Specified width of a CI Giraudeau and Mary12 and Bonett13 argued that the approach of hypothesis testing may not be appropriate while planning a reliability study. This is because one has to specify both the values of r0 and r1, which may in practice be a difficult task. An alternative approach is to instead focus on the width of the CI for r. Indeed, in single sample problems, the results of a reliability study are usually expressed as a point estimate of r and its associated CI. The sample size calculations are then aimed at achieving an interval estimate that has sufficient precision. The approximate width of a 95% CI on r is equal to 2za=2 var(r^ )1=2 , where var(r^ ) ¼ 2(1 r)2 {1 þ (n 1)r}2 kn(n 1) (6) is the approximate variance of the ICC r^ as derived by Fisher6 and za=2 is the critical value of the standard normal distribution exceeded with probability a=2. However, Equation (6) requires samples of at least moderate size (e.g., k 30) to insure its validity. An approximation to the sample size that yields an exact CI for r having desired width w is obtained by setting w ¼ 2za=2 {var(r)}1=2 , with r replaced by ‘planning value’ r , and then solving for k to give k¼ 8z2a=2 (1 r )2 {1 þ (n 1)r }2 w2 n(n 1) (7) which is then rounded up to the nearest integer. The approximation suggested by Bonett13 is k* ¼ k þ 1, where k is given by Equation (7). Table 3 gives the required sample size for typical planned values of r for w ¼ 0.2, a ¼ 0.05 and various values of n. As can be seen from Table 3, the value of k* is a decreasing function of n for any given value of r . Thus, if the cost of sampling an additional subject is relatively high, it may be less costly to increase the number of replicates per subject than to increase Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016 256 MM Shoukri, MH Asyali and A Donner Table 3 Values of k* for planned values of r, with w ¼ 0.2 r 0.6 0.7 0.8 0.9 n 2 3 5 10 158 101 52 15 100 67 36 11 71 51 28 9 57 42 24 8 the number of subjects. Besides the usual advantages of interval estimation over hypothesis testing, Bonett13 argued that the effect of an inaccurate planning value (required for both procedures) is more serious in the context of hypothesis testing. For example, to test H0: r ¼ 0.7 at a ¼ b ¼ 0.05, with n ¼ 3, the required sample size obtained using the approach of Walter et al.7 is about 3376, 786 and 167 for r1 ¼ 0.725, 0.75 and 0.80, respectively. In comparison, the sample size required to estimate r with a 95% CI having width of 0.2 is 60, 52 and 37 for r ¼ 0:725, 0:75 and 0.80, respectively. 2.1.3 Efficiency requirements In this section, it is assumed that the investigator is interested in determining the number of replicates n per subject needed to minimize the variance of the estimated r, where we assume the total number of measurements N ¼ nk is fixed due to practical constraints. Following Shoukri et al.,15 substitution of N ¼ nk into Equation (6) gives var(r^ ) ¼ f (n, r) ¼ 2(1 r)2 {1 þ (n 1)r}2 N(n 1) (8) A necessary condition for f(n, r) to have a minimum is that qf =qn ¼ 0, with a sufficient condition given by q2 f =qn2 > 0.16 Differentiating f(n, r) with respect to n, equating to zero and solving for n we obtain n0 ¼ 1þr r (9) Moreover, (q2 f =qn2 )jn¼n0 ¼ 4{r3 (1 r2 )}=N > 0 and the sufficient condition for a unique minimum is therefore satisfied. Note that the range of r is strictly positive, since within the framework of reliability studies negative values are usually not of interest. Equation (9) indicates that, when r ¼ 1, n0 ¼ 2 is the minimum number of replicates needed per subject. The smaller the value of r, the larger the required n, and hence a smaller number of subjects k ¼ N=n would be needed. Table 4 shows the optimal combination (n, k) that minimizes the variance of r^ for different values of r. Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016 120 90 60 N k var(r^ ) k var(r^ ) k var(r^ ) 4.45 (0.011) 8.18 (0.007) 10.9 (0.005) 0.1 (n ¼ 11) 10 (0.017) 15 (0.011) 20 (0.008) 0.2 (n ¼ 6) 13.8 (0.020) 20.8 (0.013) 27.7 (0.010) 0.3 (n ¼ 4.3) 17.1 (0.019) 25.7 (0.013) 34.3 (0.010) 0.4 (n ¼ 3.5) Table 4. Optimal combinations of (n, k) which minimize the variance of r^ 20 (0.017) 30 (0.011) 40 (0.008) 0.5 (n ¼ 3) r 22.5 (0.013) 33.75 (0.008) 45 (0.006) 0.6 (n ¼ 2.7) 24.7 (0.008) 37 (0.006) 49.4 (0.004) 0.7 (n ¼ 2.4) 26.7 (0.004) 40 (0.003) 53.3 (0.002) 0.8 (n ¼ 2.25) 28.4 (0.001) 42.6 (0.001) 56.8 (0.001) 0.9 (n ¼ 2.1) Design of reliability study 257 Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016 258 MM Shoukri, MH Asyali and A Donner Remarks 1. Because N ¼ nk is fixed, a larger number of replicates n lead to a much smaller number of recruited subjects, which may limit the generalizability potential of the study. However, if r is expected to be moderately high (>0.6), not more than two or three replicates per subject are required. 2. The previous remark is similar to the conclusion made by Giraudeau and Mary,12 who based their sample size requirements on the achievement of a specific width for the 95% CI. These guidelines are also consistent with the results reported in Table 3 of Walter et al.7 3. In practice, only integer values of (n, k) are used and because N ¼ nk is fixed a priori, optimum values of n should first be rounded to the nearest integer and then k ¼ N=n is rounded to the nearest integer as well. Shoukri et al.15 showed that the net loss=gain in efficiency of the estimated reliability coefficient is negligible. For example, when r ¼ 0.7, N ¼ 60, the optimal allocations are n ¼ 2.43 and k ¼ 24.69, giving var(r) ¼ 0.0084, whereas the rounded integer allocations are n ¼ 2 and k ¼ 30, giving var(r) ¼ 0.0087 (i.e., a net loss in efficiency of 3.7%). 2.1.4 Incorporation of cost constraints Funding constraints will often determine the cost of recruiting subjects for a reliability study. Although too small a sample may lead to a study that produces an imprecise estimate of the reliability coefficient, too large a sample may result in a waste of resources. Thus a critical a decision in many such studies is to balance the cost of recruiting subjects with the need to obtain a reasonably precise estimate of r. There have been attempts to address the issue of statistical power in the presence of funding constraints. Eliasziw and Donner11 presented a method for determining the number of subjects k and the number of replicates n that minimize the overall cost of conducting a reliability study, while providing acceptable power for tests of hypotheses concerning r. They also provided tables showing optimal choices for k and n under various cost constraints. Shoukri et al.15 addressed the issue of obtaining the combinations (n, k) that minimize the variance of r^ subject to cost constraints. In their attempt to construct a flexible cost function, they adhered to the general guidelines identified by Flynn et al.17 and Eliasziw and Donner.11 First, one has to identify approximately the sampling costs and overhead costs. The sampling cost depends primarily on the size of the sample, and includes data collection costs, travel costs, management and other staff costs. On the other hand, overhead costs remain fixed regardless of sample size, including, for example, the cost of setting the data collection form. Following Sukhatme et al.,18 it is assumed that the overall cost function is given as: C ¼ c0 þ kc1 þ nkc2 (10) where c0 is the fixed cost, c1 the cost of recruiting a subject, and c2 is the cost of making a single observation. Using the method of Lagrange multipliers,16 the objective function G is given as G ¼ var(r^ ) þ l(C c0 kc1 nkc2 ) Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016 (11) Design of reliability study 259 where var(r^ ) is given by Equation (8) and l is the Lagrange multiplier. The necessary conditions for the minimization of G are qG=qn ¼ 0, qG=qk ¼ 0 and qG=ql ¼ 0, with the sufficient condition for G to have constrained relative minimum given by a theorem in Rao.16 Differentiating G with respect to n, k and l, and equating to zero, we obtain n3 rc2 n2 c2 (1 þ r) nc1 (2 r) þ (1 r)c1 ¼ 0 (12) and k¼ C c0 c1 þ nc2 (13) The explicit expression for the optimal solution to these equations was given by Shoukri et al.15 as nopt ¼ (1 þ r)=r þ A1=3 =r B 3 (14) where A ¼ 3r 3R (R þ 1)2 r4 (6R2 þ 4R 2)r3 þ 12R(R þ 1)r2 (8R2 þ 10Rþ 1=2 2)r R 1 þ 9R(r3 r2 þ r) þ (r þ 1)3 , B ¼ {3Rr(r 2) (r þ 1)2 }=rA1=3 and R ¼ c1 =c2 . Once the value of nopt is determined and r, the corresponding optimal k for given C 7 c0, c1, c2, is obtained by substitution into Equation (13). The results of the optimization procedure appear in Table 5. Without loss of generality, we assume that C 7 c0 ¼ 100. It is apparent from Table 5 that when c1 (the cost per subject) increases, the required number of subjects (k) decreases, while the required number of replicates per subject (n) increases. However, when c2 increases, both k and n decrease. On the other hand, when c1 and c2 are fixed, an increase in r results in a decline in the required value of n and an increase in k. This trend reflects two intuitive facts: the first is that it is sensible to decrease the number of items associated with a higher cost, while increasing those with lower cost; the second is that when r is large (high reproducibility) then a fewer number of replicates per subject are needed, while a higher number of subjects should be recruited. Note that this is similar to the conclusion reached in the previous section, when the cost was not explicitly considered. We also note that at the higher levels of c1 and c2, the optimal allocation is quite stable with respect to changes in sampling cost. This is advantageous in practice, since it is often difficult to forecast the exact cost prior to the initiation of the study. Finally, it can be noted that by setting c1 ¼ 0 and c2 ¼ 1 in Equation (12), we obtain nopt ¼ (1 þ r)=r, as in Equation (9). This means that a special cost structure is implied by the optimal allocation procedure discussed in the previous section. Moreover, setting r ¼ 1 in Equation (12) gives nopt ¼ 1 þ (1 þ R)1=2 2, emphasizing that the ratio R ¼ c1=c2 is an important factor in determining the optimal allocation of (n, k). Example 1 This example is given in Shoukri et al.15 where, to assess the accuracy of Doppler echocardiography (DE) in determining aortic valve area (AVA) prospective evaluation on patients with aortic stenosis, an investigator wishes to demonstrate a high Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016 260 MM Shoukri, MH Asyali and A Donner Table 5 Optimal values of n and k that minimize var(r^ ) c2 c1 r 0.25 0.25 0.5 1 3 5 15 25 0.7 0.8 0.9 0.7 0.8 0.9 0.7 0.8 0.9 0.7 0.8 0.9 0.7 0.8 0.9 0.7 0.8 0.9 0.7 0.8 0.9 0.5 1 3 5 15 25 n k n k n k n k n k n k n k 3 2.76 2.57 2.74 2.52 2.36 2.60 2.40 2.24 2.5 2.3 2.1 2.5 2.3 2.1 2.4 2.3 2.1 2.4 2.3 2.1 100 106 112 62 66 70 35 38 40 13 14 15 8 8 9 3 3 3 2 2 2 3.4 3.15 2.9 3 2.76 2.57 2.74 2.53 2.36 2.54 2.35 2.20 2.5 2.3 2.2 2.4 2.3 2.1 2.4 2.3 2.1 73 78 81 50 53 56 31 33 35 12 13 14 8 8 9 3 3 3 2 2 2 4.1 3.77 3.48 3.44 3.15 2.92 3 2.76 2.57 2.64 2.44 2.28 2.56 2.37 2.22 2.47 2.29 2.15 2.45 2.27 2.1 49 51 53 37 39 41 25 26 28 11 12 13 7 8 8 3 3 3 2 2 2 6 5.45 5 4.69 4.27 3.93 3.8 3.48 3.21 3 2.76 2.57 2.79 2.58 2.4 2.56 2.37 2.22 2.51 2.32 2.17 22 23 24 19 19 20 15 15 16 8 9 9 6 6 7 2 3 3 2 2 2 7.33 6.65 6.07 5.6 5.10 4.67 4.42 4.03 3.71 3.30 3.0 2.81 3.00 2.76 2.57 2.64 2.44 2.28 2.56 2.37 2.22 15 15 15 13 13 14 11 11 11 7 7 7 5 5 6 2 2 3 2 2 2 11.73 10.60 9.64 8.68 7.86 7.16 6.54 5.93 5.43 4.42 4.03 3.72 3.80 3.48 3.21 3 2.76 2.57 2.8 2.58 2.4 6 6 6 5 5 5 5 5 5 4 4 4 3 3 3 2 2 2 2 2 2 14.8 13.35 12.12 10.82 9.78 8.90 8.03 7.28 6.65 5.25 4.77 4.38 4.42 4.03 3.72 3.30 3.03 2.81 3.00 2.76 2.57 4 4 4 3 4 4 3 3 3 2 3 3 2 2 2 2 2 2 1 1 1 degree of reliability (r ¼ 90%) in estimating AVA using the ‘velocity integral method’. Suppose that the total cost of making the study is fixed at US$1600. We assume that the travel costs for a patient in going from the health center to the tertiary hospital (where the procedure is done) is US$15. The administrative cost of the procedure and the cost of using the DE is US$15 per visit. It is assumed that c0, the overhead cost, is absorbed by the hospital. From Table 5, nopt for R ¼ 1 and r ¼ 0:9 is 2.57, which should be rounded up to 3. From Equation (13) kopt ¼ 1600 ffi 27 15 þ 3 15 That is, we need 27 patients, with three measurements each. The minimized value of var(r^ ) is 0.00097. 2.2 Non-normal case As indicated earlier, the sampling distribution and formula for the variance of the estimated ICC rely on an assumption of normality, which in practice can only be approximately satisfied. In this regard, it should be noted that for statistical inferences in the one-way random effects ANOVA model, it has been found that the distribution of the ratio of mean squares is often quite robust with respect to non-normality. In particular, Scheffé19 concluded that the impact of non-normality on inferences for Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016 Design of reliability study 261 means is slight, but can be serious for inferences on variances of random effects whose kurtosis differ from zero. Although, Scheffé’s conclusions were based on inferences of the variance ratio f ¼ s2s =s2e , they likely have similar implications for r ¼ f=(1 þ f). 2.2.1 Efficiency requirements Tukey20 obtained the variance of the variance component estimates under various ANOVA models by employing ‘polykeys’. For the one-way random effects model, application of the delta method shows that to a first order approximation,21,22 var(r^ ) ¼ g 2(1 r2 ){1 þ (n 1)r}2 g þ r2 (1 r2 ) s þ e kn(n 1) k kn (15) where gs ¼ E(s4i )=s4s and ge ¼ E(e4ij )=s4e . Note that when gs ¼ ge ¼ 0, var(r^ ) reduces to the corresponding expression for the normal case, given by Equation (6). Differentiating Equation (15) with respect to n and equating to zero, the optimal value for n is obtained as: n ¼ 1 þ 1 (16) r(1 þ gs )1=2 Remarks 1. At gs ¼ 0, n* is equal to n0, as given by Equation (9). Moreover, for large values of gs, reflecting increasing departures from normality, a smaller number of replicates is needed with a correspondingly larger number of subjects. Thus the recommended strategy for choosing n and k is the same as that for the normal case. 2. The actual form of the error distribution does not influence the optimal number of replicates. However, both the error distribution and the between-subjects random effect distributions do affect the level of precision associated with r^ . Nonetheless, as can be seen from Equation (15), the influence of ge on the estimated precision is much smaller than the influence of gs provided N ¼ nk is large. Table 6 Optimal values of k and n that minimize var(r^ ) R ¼ c1=c2 r gs ¼ ge 0.1 0.1 1 10 0.7 0.8 0.9 0.7 0.8 0.9 0.7 0.8 0.9 0.5 2 n k n k n k 2.48 2.31 2.16 2.98 2.75 2.57 5.55 5.08 4.67 38.72 41.58 44.17 25.12 26.64 28.02 6.43 6.63 6.82 2.43 2.28 2.16 2.91 2.73 2.56 5.36 5.00 4.65 39.46 41.93 44.26 25.55 26.84 28.07 6.51 6.67 6.82 2.29 2.22 2.14 2.73 2.64 2.54 4.91 4.77 4.60 41.77 43.12 44.59 26.83 27.50 28.26 6.71 6.77 6.85 Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016 262 MM Shoukri, MH Asyali and A Donner Figure 1 Optimal n versus R ¼ c1=c2 for gs ¼ ge ¼ 0:1 and r ¼ 0.7, 0.8, 0.9. 2.2.2 Incorporation of cost constraints Again using the method of Lagrange multipliers, we construct the objective function as G ¼ var(r^ ) þ l(C c0 kc1 nkc2 ), where var(r^ ) is given by Equation (15) and l is the Lagrange multiplier. Differentiating G with respect to n, k and l, and equating to zero, we obtain n3 rc2 n2 c2 (1 þ r) nc1 (2 r) þ (1 r)c1 þ n2 (n 1)2 c2 r2 (1 r)2 gs (n 1)2 c1 r2 (1 r)2 ge ¼ 0 (17) as equations for n, with k again given by Equation (13). We note that when gs ¼ ge ¼ 0 Equation (17) reduces to Equation (12), which was obtained for the normal case. Collecting powers of n, we obtain the following 4th degree polynomial for the optimal n: c2 r2 (1 r)2 gs n4 þ {rc2 2c2 r2 (1 r)2 gs }n3 þ { c2 (1 þ r) þ c2 r2 (1 r)2 gs c1 r2 (1 r)2 ge }n2 þ {2c1 r2 (1 r)2 ge c1 (2 r)}n þ c1 (1 r) c1 r2 (1 r)2 ge ¼ 0 Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016 Design of reliability study 263 Figure 2 Optimal n versus R ¼ c1=c2 for gs ¼ ge ¼ 0:5 and r ¼ 0.7, 0.8, 0.9. Although an explicit solution to this equation is available, the resulting expression is complicated and does not provide any useful insight. However, we may summarize the results of the optimization procedure as in Table 6, where we provide the optimal n and k for various values of r, R and gs ¼ ge. Once nopt is determined, it is substituted in Equation (13) to determine the corresponding optimal kopt as kopt ¼ C c0 (C c0 )=c2 ¼ c1 þ nopt c2 R þ nopt In this brief table, where without loss of generality we set (C c0 )=c2 ¼ 100, our principal aim is to establish the behaviour of the optimal (n, k) as r, and gs ¼ ge vary. In Figures 1–3, we provide more detailed plots of optimal n versus R ¼ c1=c2 for different values r and gs ¼ ge. Most of the remarks made with respect to Table 5 hold for Table 6 as well. Moreover, from Table 6 and from the figures it is seen that for fixed R and r, the optimal value of n decreases, while the optimal value of k increases with increasing departure from normality. Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016 264 MM Shoukri, MH Asyali and A Donner Figure 3 Optimal n versus R ¼ c1=c2 for gs ¼ ge ¼ 2 and r ¼ 0.7, 0.8, 0.9. 3 Binary outcome measures In assessing inter-rater reliability, a choice must be made on how to measure the condition under investigation. One of the practical aspects of this decision concerns the relative advantages of measuring the trait either on a continuous scale, as in the proceeding sections or on a categorical scale. In this review, attention is restricted to the case of dichotomous judgments. Kraemer23 pointed out that it is useful to distinguish between two conceptually distinct types of dichotomous scores. The first of these is a truly dichotomous variable, which categorizes members of a population into nonoverlapping subpopulations on a nominal scale. Examples would include a variable denoting the presence or absence of a morbid condition, such as sleep apnea. As the time and expense of monitoring the occurrence of this event on a long term basis might be substantial, physicians might argue on practical grounds that a continuous ‘surrogate variable’ be used in its place. For example, the use of heart rate variability measures to examine the progress of the condition could result in a trial of shorter duration and lower cost.24 The second type of dichotomous variable arises when an inherently continuous variable is dichotomized to correspond closely to how the variable is used in clinical practice. A common example is the dichotomization of patient blood pressure scores into hypertensive and normotensive categories. Donner Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016 Design of reliability study 265 and Eliasziw25 investigated the statistical implications associated with the process of dichotomization as compared with the case of a truly dichotomous variable, and concluded that the effect of dichotomization on the efficiency of reliability estimates can be severe. In the following section, we investigate the issues of efficiency and cost in designing reliability studies for dichotomous responses, where no distinction between the two types of variables is made. 3.1 Power considerations We now consider issues related to the choice of outcome measure. One of the practical aspects of this decision concerns the relative advantages of measuring the trait of interest on a continuous versus a dichotomous scale. First, we outline sample size requirements for the case of binary outcome measure when k subjects are judged by two raters. An underlying model for this case was given by Mak.26 Let yij denote the binary observation for the jth rater of the ith subject, i ¼ 1, 2, . . . , k; j ¼ 1, 2 and let p ¼ Pr[yij ¼ 1] be the probability that the trait is present. Then, for the special case considered here p1 (k) ¼ Pr[yi1 ¼ 1, yi2 ¼ 1] ¼ p2 þ kp(1 p) p2 (k) ¼ Pr[yi1 ¼ 0, yi2 ¼ 1] ¼ Pr[yi1 ¼ 1, yi2 ¼ 0] ¼ 2p(1 p)(1 k) (18) 2 p3 (k) ¼ Pr[yi1 ¼ 0, yi2 ¼ 0] ¼ (1 p) þ kp(1 p) where k may be interpreted as a coefficient of interobserver agreement. This model is often referred as the ‘common correlation model’ since it assumes that ki ¼ k, i ¼ 1, 2, . . . , k. It will also be shown in Section 3.2 that the model in Equation (18) is a special case of the beta-binomial distribution discussed by Haseman and Kupper.27 If the observed frequencies are given as in Table 7, the k may be estimated by k^ ¼ 1 where p^ ¼ Pk P2 i¼1 j¼1 yij =(2k) n2 2kp^ (1 p^ ) (19) ¼ (2n1 þ n2 )=(2k) is the sample estimate of p. Table 7 Frequencies for the two-rater agreement case Category Frequency Probability (1,1) (1,0) or (0,1) (0,0) Total n1 n2 n3 n P1(k) P2(k) P3(k) 1 Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016 266 MM Shoukri, MH Asyali and A Donner The estimator k^ has been shown under this model to be the maximum likelihood estimator of k, with large sample variance given by28 var(k^ ) ¼ (1 k) k(2 k) (1 k)(1 2k) þ k 2p(1 p) (20) Donner and Eliasziw29 used the goodness-of-fit test procedure to facilitate sample size calculations that may be used to enroll a sufficient number of subjects in a study of interobserver agreement involving two raters. They showed that the number of subjects needed to test H0: k ¼ k0 versus H1: k ¼ k1 is given by ( )1 [p(1 p)(k1 k0 )]2 2[p(1 p)(k1 k0 )]2 [p(1 p)(k1 k0 )]2 k¼A þ (21) þ p2 þ p(1 p)k0 p(1 p)(1 k0 ) (1 p)2 þ p(1 p)k0 2 where A2 ¼ (z1a=2 þ z1b )2 . Example 2 Suppose that it is of interest to test H0: k0 ¼ 0.60 versus H1: k0 6¼ 0.60 where k0 ¼ 0.60 corresponds to the value of kappa characterized as representing ‘substantial’ agreement.30 To ensure with 80% probability a significant result at a ¼ 0.05 and p ¼ 0:30 when k1 ¼ 0.90, the required number of subjects from Equation (21) is k ¼ 66. In Table 8, we present some values of the required number of subjects for different values of k0, k1, p, a ¼ 0.05 and 1 7 b ¼ 0.80. 3.1.1 Specified width of a CI As in case of a continuous measurement, we may base our sample size calculation on the required width of a CI. Suppose that an interobserver agreement study is to be Table 8 Number of required subjects k at a ¼ 0.05 and b ¼ 0.20 k1 k0 p 0.4 0.4 0.6 0.7 0.8 0.9 0.1 0.3 0.5 0.1 0.3 0.5 0.1 0.3 0.5 0.1 0.3 0.5 0.1 0.3 0.5 404 190 165 179 84 73 101 47 41 64 30 26 0.6 0.7 0.8 0.9 334 148 126 121 52 45 1090 474 400 49 21 18 195 83 71 770 336 282 17 7 6 46 20 17 103 44 37 413 177 149 1339 595 502 335 148 125 149 66 55 1090 474 400 272 118 100 Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016 779 336 282 Design of reliability study 267 conducted such that a CI for the intraclass kappa statistic given by Equation (19) has a desired width w. Setting w ¼ 2za=2 {var(k^ )}1=2 where var(k^ ) is given by Equation (20), replacing k by a planned value k , and solving for k, the required number of subjects is given by k¼ 4z2a=2 w2 k (2 k ) (1 k ) (1 k )(1 2k ) þ 2p(1 p) (22) Example 3 Suppose that an interobserver agreement study involving two raters is designed to achieve a value of k ¼ 0.80. We also assume that the probability of positive rating is 0.30, the desired width of the CI is w ¼ 0.20, with level of confidence 0.95. Then we have k¼ 4(1:64)2 (0:2)2 (0:8)(1:2) (0:2) (0:2)(0:6) þ 2(0:3)(0:7) ¼ 117 3.2 E⁄ciency requirements We now focus attention on seeking the allocation that minimizes the variance of the estimator of r for fixed N ¼ nk when the outcome measure is binary. Let yij denote the jth rating made on the ith subject, where yij ¼ 1 if the condition is present and 0 otherwise. Landis and Koch,30 analogous to the continuous case, employed the one-way random effects model yij ¼ mi þ eij (23) where mi ¼ mi þ si for i ¼ 1, 2, . . . , k; j ¼ 1, 2, . . . , n. Analogous to the case of a continuous outcome measure, we assume the {si} are idd with mean 0 and variance s2s , the {eij} are idd with mean 0 and variance s2e , and that the {si} and {eij} independent. We may therefore write E(yij ) ¼ p ¼ Pr[yij ¼ 1] and s2 ¼ var(yij ) ¼ p(1 p) (24) Letting d ¼ Pr[yij ¼ 1, yil ¼ 1] ¼ E(yij yil ), it follows for j 6¼ l and i ¼ 1, 2, . . . , k that d ¼ cov(yij , yil ) þ E(yij yil ) ¼ rp(1 p) þ p2 (25) where r ¼ (d p2 )=[p(1 p)]. It is clear from the previous set up that the probability that two measurements taken from the same subject are in agreement is Po ¼ p2 þ (1 p)2 þ 2rp(1 p). Substituting r ¼ 0 in Po, we obtain agreement by chance as, Pe ¼ p2 þ (1 p)2 . Therefore, the beyond chance agreement is given by Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016 268 MM Shoukri, MH Asyali and A Donner k ¼ (Po Pe )=(1 Pe ) ¼ r. It is therefore clear that r is directly analogous to the components of kappa described by Cohen31 and Fleiss and Cohen.32 Following Landis and Koch,30 let s2s ¼ rp(1 p) s2e ¼ (1 r)p(1 p) (26) denote the variance components of yij. Then s2 ¼ s2s þ s2e and the corresponding estimator of r is given, analogous to Equation (2), by r ¼ MSA MSW MSA þ (n 1)MSW where 8 P 2 9 k > > Pk 2 < = i¼1 yi 1 i¼1 yi MSA ¼ , > k 1> nk : n ; ( Pk 2 ) k n X X 1 i¼1 yi y yij and yi ¼ MSW ¼ k(n 1) i¼1 i n j¼1 We note first that the statistic r* depends only on the total yi, and not on the individual binary responses. Crowder33 and Haseman and Kupper27 demonstrated the equivalence of the ANOVA model given earlier to the well known beta-binomial model which arises when conditional on the subject effect mi, the subject’s total yi has a binomial distribution with conditional mean and variance given, respectively, by E(yi j mi ) ¼ nmi and var(yi j mi ) ¼ nmi (1 mi ). The parameter mi is assumed to follow the beta distribution f (mi ) ¼ G(a þ b) a1 mi (1 mi )b1 G(a)G(b) (27) where a ¼ p(1 r)=r, and b ¼ (1 p)(1 r)=r. Therefore, the ANOVA model and the beta-binomial model are virtually indistinguishable.34 Because the optimal number of replicates for the non-normal case under the former model was shown to be n ¼ 1 þ 1={r(1 þ gs )1=2 } and since gs is the kurtosis of the subject effect distribution, one may use the kurtosis of the beta distribution to determine the optimal number of replications. Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016 Design of reliability study 269 One can derive gs for the beta distribution from the recurrence given by m01 ¼ p1 m0l m0l1 ¼ ðl 1Þr þ pð1 rÞ , l ¼ 2, 3, . . . 1 þ ðl 1Þr where m0l ¼ E[mli ]. Then gs ¼ m4 =(m2 )2 , where m4 ¼ m04 4m03 m01 þ 6m02 (m01 )2 3(m01 )4 and m2 ¼ m02 (m01 )2 .21 Substituting gs into Equation (16) we obtain 1=2 (1 r)(1 2r) n ¼ 1 þ p(1 p) c(p, r) (28) where c(p, r) ¼ p[r þ p(1 r)][2r þ p(1 r)][3r þ p(1 r) 4p(1 þ 2r)] þ (1 þ r)(1 þ 2r)[6p3 (1 p)r þ 3p4 þ p2 (1 p)2 r2 ] Table 9 shows the optimal number of replications n* and the corresponding optimal number of subjects k ¼ N=n*. In contrast to the continuous measurement model, the optimal allocation in the case of a binary outcome measure depends on the mean of the response variable p. We also note that for fixed N the optimal allocations are equivalent for p and 1 p. Remarks 1. When p is small, as few as two replicates are required, but with a corresponding larger value for the required number of subjects. When p ¼ 0:5, a fewer number of subjects should be recruited with not more than three replicates per subject. 2. There is a noted similarity between the results given in Tables 4 and 9. In both cases, higher values of r imply that as few as n ¼ 2 replicates are needed and hence, a larger number of subjects should be recruited. In particular, when p ¼ 0:5 and 0.6 r 0.8, the optimal allocation for the case of binary outcome measure is close to that required in the case of a continuous outcome measure. Table 9 Optimal allocation at N ¼ 60 for a binary response variable r p 0.4 0.1 0.3 0.5 0.5 0.6 0.7 0.8 0.9 n k n k n k n k n k n k 1.81 2.36 2.53 33 25 24 1.64 2.10 2.25 37 29 27 1.53 1.94 2.08 39 31 29 1.46 1.82 1.95 41 33 31 1.40 1.73 1.85 43 35 32 1.36 1.65 1.77 44 36 34 Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016 270 MM Shoukri, MH Asyali and A Donner 4 Discussion A crucial decision facing a researcher in the design stage of a reliability study is the determination of the required values of n and k. When the investigator has prior knowledge of what is regarded as an acceptable level of reliability, the hypothesis testing approach may be used and sample size calculations can be performed using the results of Donner and Eliasziw10 and Walter et al.7 However, in many cases, values of the reliability coefficient under the null and alternative hypotheses may be difficult to specify. Moreover, the estimated value of ICC depends on the level of heterogeneity of the sampled subjects: the more the heterogeneity, the higher the value of ICC. As most reliability studies focus on estimation of the ICC, a principal aim of this review is to provide the values (n, k) that maximizes the precision of the estimated ICC. It is fortuitous that this approach produces estimates of sample size that are in close agreement with results from procedures based on power considerations. An overall conclusion from the earlier results is that for both continuous and binary outcome measures, the variance of the estimated ICC is minimized with only a small number of replicates provided the true value of the ICC is reasonably high. In many clinical investigations an ICC at least 0.60 is required as the minimal acceptable value. Under such circumstances, one can safely recommend only two or three replications per subject. Finally, it is noted that in practice the optimal allocations must be integer values, and that the net loss=gain in precision as a result of rounding the values of (n, k) is negligible. Ideally one should adopt one of the available combinatorial optimization algorithms, often referred to as integer programming models. These models are suited for the optimal allocations problems that we reviewed in this study, as the main concern was to find the best solution(s) in a well defined discrete space. This topic needs further investigation. References 1 2 3 4 5 Dunn G. Design and analysis of reliability studies. Statistical Methods in Medical Research 1992; 1: 123–57. Shoukri MM. Agreement. In Armitage P, Colton T, eds Encyclopedia of biostatistics. New York: John Wiley & Sons, 1999: 117–30. Shoukri MM. Agreement. In Gail MH, Benichou J, eds Encyclopedia of epidemiologic methods. New York: John Wiley & Sons, 2000: 43–49. Haggard ER. Intraclass correlation and the analysis of variance. New York: Dryden Press, 1958. Elston R. Response to query: estimating ‘heritability’ of a continuous trait. Biometrics 1977; 33: 232–33. 6 Fisher RA. Statistical methods for research workers. London: Oliver & Boyd, 1925. 7 Walter DS, Eliasziw M, Donner A. Sample size and optimal design for reliability studies. Statistics in Medicine 1998; 17: 101–10. 8 Freedman LS, Parmar MKB, Baker SG. The design of observer agreement studies with binary assessments. Statistics in Medicine 1993; 12: 165–79. 9 Awni WM, Skaar DJ, Schwenk MH. Interindividual and intraindividual variability in labetalol pharmacokinetics. Journal of Clinical Pharmacology 1988; 28: 344–49. 10 Donner A, Eliasziw M. Sample size requirements for reliability studies. Statistics in Medicine 1987; 6: 441–48. Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016 Design of reliability study 271 11 Eliasziw M, Donner A. A cost-function approach to the design of reliability studies. Statistics in Medicine 1987; 6: 647–55. 12 Giraudeau B, Mary JY. Planning a reproducibility study: how many subjects and how many replicates per subject for an expected width of the 95 per cent confidence interval of the intraclass correlation coefficient. Statistics in Medicine 2001; 20: 3205–14. 13 Bonett DG. Sample size requirements for estimating intraclass correlations with desired precision. Statistics in Medicine 2002; 21: 1331–35. 14 Donner A. Sample size requirements for interval estimation of the intraclass kappa statistic. Communications in Statistics – Simulation 1999; 28: 415–29. 15 Shoukri MM, Asyali MH, Walter SD. Issues of cost and efficiency in the design of reliability studies. Biometrics 2003; 59: 1107– 12. 16 Rao SS. Optimization: theory and applications. New Delhi: Wiley Eastern Limited, 1984. 17 Flynn NT, Whitley E, Peters T. Recruitment strategy in a cluster randomized trial: cost implications. Statistics in Medicine 2002; 21: 397–405. 18 Sukhatme PV, Sukhatme BV, Sukhatme S, Asok C. Sampling theory of surveys with applications. Ames, IA: Iowa State University Press, 1984. 19 Scheffé H. The analysis of variance. New York: John Wiley & Sons, 1959. 20 Tukey JW. Variance of variance components I: balanced designs. Annals of Mathematical Statistics 1956; 27: 722–36. 21 Kendall M, Stuart A. The advanced theory of statistics, Vol 1. London: Griffin, 1986. 22 Hemmersley IM. The unbiased estimate and standard error of the intraclass variance. Metron 1949; 15: 189–205. 23 24 25 26 27 28 29 30 31 32 33 34 Kraemer CH. Ramification of a population model for k as a coefficient of reliability. Psychometrika 1979; 44: 461–72. Frederic R, Gaspoz JM, Fortune IC, Minini P, Pichot V, Duverney D, Lacow JR and Barthelemy JC. Screening of obstructive sleep apnea syndrome by heart rate variability analysis. Circulation 1999; 100: 1411–15. Donner A, Eliasziw M. Statistical implications of the choice between a dichotomous or continuous trait in studies of interobserver agreement. Biometrics 1994; 50: 550–55. Mak TK. Analyzing intraclass correlation for dichotomous variables. Applied Statistics 1988; 20: 37–46. Haseman JK, Kupper LL. Analysis of dichotomous response data from certain toxicological experiments. Biometrics 1979; 35: 281–93. Bloch DA, Kraemer HC. 2 2 Kappa coefficients: measures of agreement or association. Biometrics 1989; 45: 269–87. Donner A, Eliasziw M. The goodness-of-fit approach to inference procedures for the kappa statistic: confidence interval construction, significance testing, and sample size estimation. Statistics in Medicine 1992; 11: 1511–19. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977; 33: 159–74. Cohen JA. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 1960; 20: 37–46. Fleiss J, Cohen JA. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement 1973; 33: 613–19. Crowder M. Beta-binomial ANOVA for proportions. Applied Statistics 1978; 27: 34–37. Cox DR, Snell EJ. Analysis of binary data. London: Chapman and Hall, 1989. Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016