Uploaded by AHMED AMER

251.full

advertisement
Statistical Methods in Medical Research 2004; 13: 251^271
Sample size requirements for the design of
reliability study: review and new results
MM Shoukri Department of Epidemiology and Biostatistics, University of Western Ontario,
London, Ontario, Canada and Department of Biostatistics, Epidemiology and Scienti¢c
Computing, King Faisal Specialist Hospital and Research Centre, Riyadh, Kingdom of Saudi
Arabia, MH Asyali Department of Biostatistics, Epidemiology and Scienti¢c Computing,
King Faisal Specialist Hospital and Research Centre, Riyadh, Kingdom of Saudi Arabia and
A Donner Department of Epidemiology and Biostatistics, University of Western Ontario,
London, Ontario, Canada
The reliability of continuous or binary outcome measures is usually assessed by estimation of the intraclass
correlation coefficient (ICC). A crucial step for this purpose is the determination of the required sample
size. In this review, we discuss the contributions made in this regard and derive the optimal allocation for
the number of subjects k and the number of repeated measurements n that minimize the variance of the
estimated ICC. Cost constraints are discussed for both normally and non-normally distributed responses,
with emphasis on the case of dichotomous assessments. Tables showing optimal choices of k and n are
given along with the guidelines for the efficient design of reliability studies.
1
Introduction
Measurement errors can seriously affect statistical analysis and interpretation;
it therefore becomes important to assess the magnitude of such errors by calculating
a reliability coefficient and assessing its precision. Although the topic of reliability
has gained much attention in the literature,1–3 investigations into sample size
requirements remain scarce. In this review, we revisit the issue of sample size
requirements for reliability studies having either continuous or binary outcomes.
In either case, the measurement of reliability must distinguish within-subject variation
from between-subjects variation. A widely recognized index that possesses this
property is the intraclass correlation coefficient (ICC) defined as: r ¼ s2s =(s2s þ s2e ),
where s2s and s2e are the among-subjects and within-subjects components of variance,
respectively.
It is seen that r is the proportion of between-subject variation relative to the total
variation. In the most frequently adopted design, k subjects are each rated by the same n
raters (for inter-rater reliability). A similar approach, however, can also be adopted
when a single subject is assessed repeatedly on each of several occasions (test–retest
reliability), or when replicates consisting of different occasions are taken on different
Address for correspondence: MM Shoukri, Department of Biostatistics, Epidemiology and Scientific
Computing, King Faisal Specialist Hospital and Research Centre, MBC-03, PO Box 3354, Riyadh
11211, Kingdom of Saudi Arabia. E-mail: shoukri@kfshrc.edu.sa
# Arnold 2004
10.1191/0962280204sm365ra
Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016
252 MM Shoukri, MH Asyali and A Donner
subjects by a single rater.4 In each of these cases, and for both continuous and binary
assessments, r can be estimated from an appropriate one-way analysis of variance
(ANOVA).5,6 The question then arises as to the optimal combination (n, k) that permits
an accurate estimation of r. One of the limiting factors in designing and implementing
reliability studies in many settings is the difficulty of arranging for the replicated
observations. In clinical situations, for instance, there are typically only a few specialists
available, in a hospital or clinic, who are willing to participate in a study and who are
qualified to make the observations.
For example, Walter et al.7 reported a study in which physiotherapists were required
to evaluate the gross motor function of children with Down’s syndrome, using
videotapes. Owing to respondent burden and other practicalities of scheduling, it was
not possible to arrange for more than three or four therapies to assess a given child.
Had live observations been made, it would have been unreasonable to increase the
number of observers beyond this point, due to the likely intimidation of the child,
particularly since evaluations are often done in their home. Also repetitive observations
may lead to fatigue and aversion effects. Similar considerations apply in many reliability
studies, and n is typically limited by a maximum tolerable respondent burden such as in
an interview of physical assessment of clinical condition. The scientific question is
related to the optimal allocation of (n, k) to conduct a reliability study.
Another study is on the reliability of diagnosing dysplasia in the urothelial lining of the
bladder, which may be an indicator of poor prognosis in patients with superficial bladder
cancer. However, in a British Medical Research Council (BMRC) randomized multi-centre
clinical trial of treatment for superficial bladder cancer, considerable discrepancies appeared
to exist between the assessment of dysplasia by local pathologists and by the reference
pathologist for the trial.8 Therefore, it was decided to establish a board of five pathologists
who specialized in urology to evaluate the extent of their agreement on the assessment of
dysplasia. At the time of design of the study, no formal methods for calculating appropriate
sample sizes were available. However, the investigators independently assessed 100 slides, a
number determined to estimate a reliability coefficient with prespecified level of precision.
We note that the number of replicates (five per slide) is fixed a priori, and the issue of sample
size in this situation is entirely different from that of the previous example.
The third example is based on a study by Awni et al.9 in bioavailability=bioequivalence
(BA=BE) study. In a typical BA=BE trial, the area under the blood concentration time
curve (AUC) is considered an important parameter. To assess the intra-subject variability
with respect to a specific drug formulation, the ICC is used and therefore, several
measurements of the AUC are needed from each subject. Given that in a BA=BE study
subjects are paid volunteers, the study design must address the issues of cost and time
constraints while producing precise estimate for ICC. For a fixed number of replicates,
Donner and Eliasziw10 have provided contours of exact power for selected values of k
and n, while Eliasziw and Donner11 used these results to identify optimal designs that
minimize the study costs. Walter et al.7 developed a simple approximation to these exact
results that allow the calculation of the required value of k when n is fixed. We note that
reliability studies, however, are primarily designed to estimate the level of observer
agreement, with their results invariably reported as measures of such agreement. Yet,
power considerations in the design of reliability studies necessarily require specification
of the hypotheses to be tested. Rejection of H0: r ¼ 0 based on results from a reliability
Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016
Design of reliability study 253
study is particularly unhelpful, since the investigator needs to know more than the fact
that the observed level of reliability is unlikely to be due to chance.12
Given that reliability studies are essentially estimation procedures, it is natural to base
the sample size calculations on the attainment of a specified level of precision for the
estimated r. For example, Bonett,13 focusing on the case of a continuous outcome
measure, provided sample size requirements based on the need to achieve a specified
expected width for a confidence interval (CI) on r. These results are parallel to those of
Donner,14 who focused on estimating the number of subjects required to construct a CI
with fixed width about the intraclass kappa coefficient for the case of a binary outcome
measure.
In this review, we revisit the literature on sample size requirements when interest is
focused on estimating the ICC reliability from a single sample of subjects. In Section 2,
we discuss issues of power, efficiency and fixed length CIs when the response variable is
normally distributed. Methods that use the calculus of optimization to find the
combination (n, k) that minimizes the variance of the estimated ICC when the response
variable is continuous are investigated in detail. Issues of cost are investigated in the
situation when both k and n are determined so that the variance of the estimated ICC is
minimized subject to cost constraints. We devote Section 3 to the issue of optimal design
when the assessments are binary, and present an overall discussion of our results in
Section 4.
2
Continuous outcome measures
2.1 Normal case
2.1.1 Power considerations
The most commonly used model for estimating reliability is the one-way random
effects model,
yij ¼ m þ si þ eij
(1)
where m is the grand mean of all measurements in the population, si reflects the effect of
subject i, and eij is the error of measurement, i ¼ 1, 2, . . . , k; j ¼ 1, 2, . . . , n. It is assumed
that the subject effects {si} are normally and identically distributed with mean 0 and
variance s2s , the errors {eij} are normally and identically distributed with mean 0
and variance s2e and the {si} and {eij} are independent. Letting MSA and MSW denote
the among-subjects and within-subject mean squares, respectively, the ANOVA estimator of r is given by
r^ ¼
MSA MSW
F1
¼
MSA þ (n 1)MSW F þ n 1
(2)
where F ¼ MSA=MSW.
Donner and Eliasziw10 investigated the values of k and n required to test H0: r ¼ r0
versus H1: r > r0, where r0 is a specified criterion value of r. For the case n ¼ 2, that is,
test–retest data, we may use Fisher’s normalizing transformation,6 analogous to the
well known Fisher transformation of the Pearson product–moment or interclass
Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016
254 MM Shoukri, MH Asyali and A Donner
correlation. Fisher showed that u ¼ (1=2) ln {(1 þ r^ )=(1 r^ )} is nearly normally distributed with mean m(r) ¼ (1=2) ln {(1 þ r)=(1 r)} and variance 1=(k 1:5).
Let za and zb denote the values of the standard normal distribution corresponding to
the chosen level of significance a and power 1 7 b. The required number of subjects for
testing H0: r ¼ r0 versus H1: r ¼ r1 > r0 is obtained directly from the previous theory as
(za ¼ 1:64) þ (zb ¼ 0:84) 2
k ¼ 1:5 þ
(3)
m(r0 ) m(r1 )
Table 1 gives the required values of k according to the values of r0 and r1, a ¼ 0.05 and
b ¼ 0.20.
The results in Table 1 indicate that the required sample size k depends critically on
the values of r0 and r1, and in particular, on their difference. For example, much more
effort is required to distinguish values that differ by 0.1 compared with those that differ
by 0.2. Note also that larger samples are required to detect relatively small values of r1
for a given difference r1 7 r0.
Walter et al.7 developed a simple approximation that allows the calculation of the
required number of subjects k for an arbitrary number of replicates n. The interest
here is in testing H0: r ¼ r0 versus H1: r ¼ r1 > r0 using F ¼ MSA=MSW ¼
{1 þ (n 1)r^ }=(1 r^ ) where r^ is given by Equation (2). The critical value for the test
statistic is CFa,n1 ,n2 , where C ¼ 1 þ {nr0 =(1 r0 )} and Fa,n1 ,n2 is the 100(1 7 a) per cent
point in the cumulative F-distribution with degrees of freedom given by n1 ¼ k 7 1, and
n2 ¼ k(n 7 1). Their approximation uses a simple formula, avoiding the intensive
numerical work required to implement exact methods and providing the investigator
increased flexibility for exploring various design options.
As described by Donner and Eliasziw,10 the test of H0: r ¼ r1 has power given by
1 b ¼ Pr[F C0 Fa,n1 ,n2 ]
(4)
where b is the type II error and C0 ¼ (1 þ nf0 )=(1 nf), with f0 ¼ r0 =(1 r0 )
and f ¼ r1 =(1 r1 ). To solve Equation (4), Walter et al.7 used a result by Fisher,6
regarding the asymptotic distribution of Z ¼ (1=2) ln F. Omitting details, the estimated
number of subjects is given by
k¼1þ
nA(a þ b)
(5)
(n 1)( ln C0 )2
where A(a, b) ¼ 2(za þ zb )2 .
Table 1 Number of subjects k at n ¼ 2 (a ¼ 0.05 and b ¼ 0.20)
r0
r1
k
0.2
0.2
0.4
0.8
0.6
0.6
0.8
0.6
0.9
0.8
27
9
86
46
39
Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016
Design of reliability study 255
Table 2 Approximate sample size (a ¼ 0.05 and b ¼ 0.10)
r0
r1
n
ln C0
k
0.2
0.4
0.6
0.8
0.2
0.4
0.8
0.8
0.9
0.6
10
20
10
10
2
0.784
1.732
0.941
0.797
0.981
32
6
22
31
36
Table 2 shows the required values of k for typical values of n, and according to the
given values of r0 and r1, at a ¼ 0.05 and b ¼ 0.10. They indicate that the required
sample size k depends on the values of r0 and r1, and particularly, on their difference.
The remarks made concerning the values of k in Table 1 also apply to the values of k
in Table 2.
2.1.2 Specified width of a CI
Giraudeau and Mary12 and Bonett13 argued that the approach of hypothesis testing
may not be appropriate while planning a reliability study. This is because one has to
specify both the values of r0 and r1, which may in practice be a difficult task. An
alternative approach is to instead focus on the width of the CI for r. Indeed, in single
sample problems, the results of a reliability study are usually expressed as a point
estimate of r and its associated CI. The sample size calculations are then aimed at
achieving an interval estimate that has sufficient precision.
The approximate width of a 95% CI on r is equal to 2za=2 var(r^ )1=2 , where
var(r^ ) ¼
2(1 r)2 {1 þ (n 1)r}2
kn(n 1)
(6)
is the approximate variance of the ICC r^ as derived by Fisher6 and za=2 is the critical
value of the standard normal distribution exceeded with probability a=2. However,
Equation (6) requires samples of at least moderate size (e.g., k 30) to insure its
validity. An approximation to the sample size that yields an exact CI for r having
desired width w is obtained by setting w ¼ 2za=2 {var(r)}1=2 , with r replaced by
‘planning value’ r , and then solving for k to give
k¼
8z2a=2 (1 r )2 {1 þ (n 1)r }2
w2 n(n 1)
(7)
which is then rounded up to the nearest integer. The approximation suggested by
Bonett13 is k* ¼ k þ 1, where k is given by Equation (7).
Table 3 gives the required sample size for typical planned values of r for w ¼ 0.2,
a ¼ 0.05 and various values of n.
As can be seen from Table 3, the value of k* is a decreasing function of n for any
given value of r . Thus, if the cost of sampling an additional subject is relatively high,
it may be less costly to increase the number of replicates per subject than to increase
Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016
256 MM Shoukri, MH Asyali and A Donner
Table 3 Values of k* for planned values of r, with w ¼ 0.2
r
0.6
0.7
0.8
0.9
n
2
3
5
10
158
101
52
15
100
67
36
11
71
51
28
9
57
42
24
8
the number of subjects. Besides the usual advantages of interval estimation over
hypothesis testing, Bonett13 argued that the effect of an inaccurate planning value
(required for both procedures) is more serious in the context of hypothesis testing. For
example, to test H0: r ¼ 0.7 at a ¼ b ¼ 0.05, with n ¼ 3, the required sample size
obtained using the approach of Walter et al.7 is about 3376, 786 and 167 for
r1 ¼ 0.725, 0.75 and 0.80, respectively. In comparison, the sample size required to
estimate r with a 95% CI having width of 0.2 is 60, 52 and 37 for r ¼ 0:725, 0:75
and 0.80, respectively.
2.1.3 Efficiency requirements
In this section, it is assumed that the investigator is interested in determining
the number of replicates n per subject needed to minimize the variance of the
estimated r, where we assume the total number of measurements N ¼ nk is fixed
due to practical constraints. Following Shoukri et al.,15 substitution of N ¼ nk into
Equation (6) gives
var(r^ ) ¼ f (n, r) ¼
2(1 r)2 {1 þ (n 1)r}2
N(n 1)
(8)
A necessary condition for f(n, r) to have a minimum is that qf =qn ¼ 0, with a sufficient
condition given by q2 f =qn2 > 0.16 Differentiating f(n, r) with respect to n, equating to
zero and solving for n we obtain
n0 ¼
1þr
r
(9)
Moreover, (q2 f =qn2 )jn¼n0 ¼ 4{r3 (1 r2 )}=N > 0 and the sufficient condition for a
unique minimum is therefore satisfied. Note that the range of r is strictly positive,
since within the framework of reliability studies negative values are usually not of
interest. Equation (9) indicates that, when r ¼ 1, n0 ¼ 2 is the minimum number
of replicates needed per subject. The smaller the value of r, the larger the required
n, and hence a smaller number of subjects k ¼ N=n would be needed. Table 4
shows the optimal combination (n, k) that minimizes the variance of r^ for different
values of r.
Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016
120
90
60
N
k
var(r^ )
k
var(r^ )
k
var(r^ )
4.45
(0.011)
8.18
(0.007)
10.9
(0.005)
0.1 (n ¼ 11)
10
(0.017)
15
(0.011)
20
(0.008)
0.2 (n ¼ 6)
13.8
(0.020)
20.8
(0.013)
27.7
(0.010)
0.3 (n ¼ 4.3)
17.1
(0.019)
25.7
(0.013)
34.3
(0.010)
0.4 (n ¼ 3.5)
Table 4. Optimal combinations of (n, k) which minimize the variance of r^
20
(0.017)
30
(0.011)
40
(0.008)
0.5 (n ¼ 3)
r
22.5
(0.013)
33.75
(0.008)
45
(0.006)
0.6 (n ¼ 2.7)
24.7
(0.008)
37
(0.006)
49.4
(0.004)
0.7 (n ¼ 2.4)
26.7
(0.004)
40
(0.003)
53.3
(0.002)
0.8 (n ¼ 2.25)
28.4
(0.001)
42.6
(0.001)
56.8
(0.001)
0.9 (n ¼ 2.1)
Design of reliability study 257
Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016
258 MM Shoukri, MH Asyali and A Donner
Remarks
1. Because N ¼ nk is fixed, a larger number of replicates n lead to a much smaller
number of recruited subjects, which may limit the generalizability potential of the
study. However, if r is expected to be moderately high (>0.6), not more than two
or three replicates per subject are required.
2. The previous remark is similar to the conclusion made by Giraudeau and Mary,12
who based their sample size requirements on the achievement of a specific width for
the 95% CI. These guidelines are also consistent with the results reported in Table
3 of Walter et al.7
3. In practice, only integer values of (n, k) are used and because N ¼ nk is fixed a
priori, optimum values of n should first be rounded to the nearest integer and then
k ¼ N=n is rounded to the nearest integer as well. Shoukri et al.15 showed that the
net loss=gain in efficiency of the estimated reliability coefficient is negligible. For
example, when r ¼ 0.7, N ¼ 60, the optimal allocations are n ¼ 2.43 and k ¼ 24.69,
giving var(r) ¼ 0.0084, whereas the rounded integer allocations are n ¼ 2 and
k ¼ 30, giving var(r) ¼ 0.0087 (i.e., a net loss in efficiency of 3.7%).
2.1.4 Incorporation of cost constraints
Funding constraints will often determine the cost of recruiting subjects for a
reliability study. Although too small a sample may lead to a study that produces an
imprecise estimate of the reliability coefficient, too large a sample may result in a waste
of resources. Thus a critical a decision in many such studies is to balance the cost of
recruiting subjects with the need to obtain a reasonably precise estimate of r.
There have been attempts to address the issue of statistical power in the presence of
funding constraints. Eliasziw and Donner11 presented a method for determining the
number of subjects k and the number of replicates n that minimize the overall cost of
conducting a reliability study, while providing acceptable power for tests of hypotheses
concerning r. They also provided tables showing optimal choices for k and n under
various cost constraints.
Shoukri et al.15 addressed the issue of obtaining the combinations (n, k) that
minimize the variance of r^ subject to cost constraints. In their attempt to construct a
flexible cost function, they adhered to the general guidelines identified by Flynn et al.17
and Eliasziw and Donner.11 First, one has to identify approximately the sampling costs
and overhead costs. The sampling cost depends primarily on the size of the sample, and
includes data collection costs, travel costs, management and other staff costs. On the
other hand, overhead costs remain fixed regardless of sample size, including, for
example, the cost of setting the data collection form. Following Sukhatme et al.,18 it
is assumed that the overall cost function is given as:
C ¼ c0 þ kc1 þ nkc2
(10)
where c0 is the fixed cost, c1 the cost of recruiting a subject, and c2 is the cost of making
a single observation. Using the method of Lagrange multipliers,16 the objective function
G is given as
G ¼ var(r^ ) þ l(C c0 kc1 nkc2 )
Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016
(11)
Design of reliability study 259
where var(r^ ) is given by Equation (8) and l is the Lagrange multiplier. The necessary
conditions for the minimization of G are qG=qn ¼ 0, qG=qk ¼ 0 and qG=ql ¼ 0, with
the sufficient condition for G to have constrained relative minimum given by a theorem
in Rao.16 Differentiating G with respect to n, k and l, and equating to zero, we obtain
n3 rc2 n2 c2 (1 þ r) nc1 (2 r) þ (1 r)c1 ¼ 0
(12)
and
k¼
C c0
c1 þ nc2
(13)
The explicit expression for the optimal solution to these equations was given by Shoukri
et al.15 as
nopt ¼
(1 þ r)=r þ A1=3 =r B
3
(14)
where
A ¼ 3r 3R (R þ 1)2 r4 (6R2 þ 4R 2)r3 þ 12R(R þ 1)r2 (8R2 þ 10Rþ
1=2
2)r R 1
þ 9R(r3 r2 þ r) þ (r þ 1)3 , B ¼ {3Rr(r 2) (r þ 1)2 }=rA1=3
and R ¼ c1 =c2 .
Once the value of nopt is determined and r, the corresponding optimal k for given C 7 c0,
c1, c2, is obtained by substitution into Equation (13). The results of the optimization
procedure appear in Table 5. Without loss of generality, we assume that C 7 c0 ¼ 100.
It is apparent from Table 5 that when c1 (the cost per subject) increases, the required
number of subjects (k) decreases, while the required number of replicates per subject (n)
increases. However, when c2 increases, both k and n decrease. On the other hand, when
c1 and c2 are fixed, an increase in r results in a decline in the required value of n and an
increase in k. This trend reflects two intuitive facts: the first is that it is sensible to
decrease the number of items associated with a higher cost, while increasing those with
lower cost; the second is that when r is large (high reproducibility) then a fewer number
of replicates per subject are needed, while a higher number of subjects should be
recruited. Note that this is similar to the conclusion reached in the previous section,
when the cost was not explicitly considered. We also note that at the higher levels of c1
and c2, the optimal allocation is quite stable with respect to changes in sampling cost.
This is advantageous in practice, since it is often difficult to forecast the exact cost prior
to the initiation of the study. Finally, it can be noted that by setting c1 ¼ 0 and c2 ¼ 1 in
Equation (12), we obtain nopt ¼ (1 þ r)=r, as in Equation (9). This means that a special
cost structure is implied by the optimal allocation procedure discussed in the previous
section. Moreover, setting r ¼ 1 in Equation (12) gives nopt ¼ 1 þ (1 þ R)1=2 2,
emphasizing that the ratio R ¼ c1=c2 is an important factor in determining the optimal
allocation of (n, k).
Example 1 This example is given in Shoukri et al.15 where, to assess the accuracy of
Doppler echocardiography (DE) in determining aortic valve area (AVA) prospective
evaluation on patients with aortic stenosis, an investigator wishes to demonstrate a high
Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016
260 MM Shoukri, MH Asyali and A Donner
Table 5 Optimal values of n and k that minimize var(r^ )
c2
c1
r
0.25
0.25
0.5
1
3
5
15
25
0.7
0.8
0.9
0.7
0.8
0.9
0.7
0.8
0.9
0.7
0.8
0.9
0.7
0.8
0.9
0.7
0.8
0.9
0.7
0.8
0.9
0.5
1
3
5
15
25
n
k
n
k
n
k
n
k
n
k
n
k
n
k
3
2.76
2.57
2.74
2.52
2.36
2.60
2.40
2.24
2.5
2.3
2.1
2.5
2.3
2.1
2.4
2.3
2.1
2.4
2.3
2.1
100
106
112
62
66
70
35
38
40
13
14
15
8
8
9
3
3
3
2
2
2
3.4
3.15
2.9
3
2.76
2.57
2.74
2.53
2.36
2.54
2.35
2.20
2.5
2.3
2.2
2.4
2.3
2.1
2.4
2.3
2.1
73
78
81
50
53
56
31
33
35
12
13
14
8
8
9
3
3
3
2
2
2
4.1
3.77
3.48
3.44
3.15
2.92
3
2.76
2.57
2.64
2.44
2.28
2.56
2.37
2.22
2.47
2.29
2.15
2.45
2.27
2.1
49
51
53
37
39
41
25
26
28
11
12
13
7
8
8
3
3
3
2
2
2
6
5.45
5
4.69
4.27
3.93
3.8
3.48
3.21
3
2.76
2.57
2.79
2.58
2.4
2.56
2.37
2.22
2.51
2.32
2.17
22
23
24
19
19
20
15
15
16
8
9
9
6
6
7
2
3
3
2
2
2
7.33
6.65
6.07
5.6
5.10
4.67
4.42
4.03
3.71
3.30
3.0
2.81
3.00
2.76
2.57
2.64
2.44
2.28
2.56
2.37
2.22
15
15
15
13
13
14
11
11
11
7
7
7
5
5
6
2
2
3
2
2
2
11.73
10.60
9.64
8.68
7.86
7.16
6.54
5.93
5.43
4.42
4.03
3.72
3.80
3.48
3.21
3
2.76
2.57
2.8
2.58
2.4
6
6
6
5
5
5
5
5
5
4
4
4
3
3
3
2
2
2
2
2
2
14.8
13.35
12.12
10.82
9.78
8.90
8.03
7.28
6.65
5.25
4.77
4.38
4.42
4.03
3.72
3.30
3.03
2.81
3.00
2.76
2.57
4
4
4
3
4
4
3
3
3
2
3
3
2
2
2
2
2
2
1
1
1
degree of reliability (r ¼ 90%) in estimating AVA using the ‘velocity integral method’.
Suppose that the total cost of making the study is fixed at US$1600. We assume that the
travel costs for a patient in going from the health center to the tertiary hospital (where
the procedure is done) is US$15. The administrative cost of the procedure and the cost
of using the DE is US$15 per visit. It is assumed that c0, the overhead cost, is absorbed
by the hospital. From Table 5, nopt for R ¼ 1 and r ¼ 0:9 is 2.57, which should be
rounded up to 3. From Equation (13)
kopt ¼
1600
ffi 27
15 þ 3 15
That is, we need 27 patients, with three measurements each. The minimized value of
var(r^ ) is 0.00097.
2.2 Non-normal case
As indicated earlier, the sampling distribution and formula for the variance of the
estimated ICC rely on an assumption of normality, which in practice can only be
approximately satisfied. In this regard, it should be noted that for statistical inferences
in the one-way random effects ANOVA model, it has been found that the distribution
of the ratio of mean squares is often quite robust with respect to non-normality. In
particular, Scheffé19 concluded that the impact of non-normality on inferences for
Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016
Design of reliability study 261
means is slight, but can be serious for inferences on variances of random effects whose
kurtosis differ from zero. Although, Scheffé’s conclusions were based on inferences of
the variance ratio f ¼ s2s =s2e , they likely have similar implications for r ¼ f=(1 þ f).
2.2.1 Efficiency requirements
Tukey20 obtained the variance of the variance component estimates under various
ANOVA models by employing ‘polykeys’. For the one-way random effects model,
application of the delta method shows that to a first order approximation,21,22
var(r^ ) ¼
g
2(1 r2 ){1 þ (n 1)r}2
g þ r2 (1 r2 ) s þ e
kn(n 1)
k kn
(15)
where gs ¼ E(s4i )=s4s and ge ¼ E(e4ij )=s4e . Note that when gs ¼ ge ¼ 0, var(r^ ) reduces to
the corresponding expression for the normal case, given by Equation (6). Differentiating
Equation (15) with respect to n and equating to zero, the optimal value for n is
obtained as:
n ¼ 1 þ
1
(16)
r(1 þ gs )1=2
Remarks
1. At gs ¼ 0, n* is equal to n0, as given by Equation (9). Moreover, for large values of
gs, reflecting increasing departures from normality, a smaller number of replicates is
needed with a correspondingly larger number of subjects. Thus the recommended
strategy for choosing n and k is the same as that for the normal case.
2. The actual form of the error distribution does not influence the optimal number of
replicates. However, both the error distribution and the between-subjects random
effect distributions do affect the level of precision associated with r^ . Nonetheless, as
can be seen from Equation (15), the influence of ge on the estimated precision is
much smaller than the influence of gs provided N ¼ nk is large.
Table 6 Optimal values of k and n that minimize var(r^ )
R ¼ c1=c2
r
gs ¼ ge
0.1
0.1
1
10
0.7
0.8
0.9
0.7
0.8
0.9
0.7
0.8
0.9
0.5
2
n
k
n
k
n
k
2.48
2.31
2.16
2.98
2.75
2.57
5.55
5.08
4.67
38.72
41.58
44.17
25.12
26.64
28.02
6.43
6.63
6.82
2.43
2.28
2.16
2.91
2.73
2.56
5.36
5.00
4.65
39.46
41.93
44.26
25.55
26.84
28.07
6.51
6.67
6.82
2.29
2.22
2.14
2.73
2.64
2.54
4.91
4.77
4.60
41.77
43.12
44.59
26.83
27.50
28.26
6.71
6.77
6.85
Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016
262 MM Shoukri, MH Asyali and A Donner
Figure 1 Optimal n versus R ¼ c1=c2 for gs ¼ ge ¼ 0:1 and r ¼ 0.7, 0.8, 0.9.
2.2.2 Incorporation of cost constraints
Again using the method of Lagrange multipliers, we construct the objective function
as G ¼ var(r^ ) þ l(C c0 kc1 nkc2 ), where var(r^ ) is given by Equation (15) and l is
the Lagrange multiplier. Differentiating G with respect to n, k and l, and equating to
zero, we obtain
n3 rc2 n2 c2 (1 þ r) nc1 (2 r) þ (1 r)c1 þ n2 (n 1)2 c2 r2 (1 r)2 gs
(n 1)2 c1 r2 (1 r)2 ge ¼ 0
(17)
as equations for n, with k again given by Equation (13). We note that when gs ¼ ge ¼ 0
Equation (17) reduces to Equation (12), which was obtained for the normal
case. Collecting powers of n, we obtain the following 4th degree polynomial for the
optimal n:
c2 r2 (1 r)2 gs n4 þ {rc2 2c2 r2 (1 r)2 gs }n3 þ { c2 (1 þ r) þ c2 r2 (1 r)2 gs
c1 r2 (1 r)2 ge }n2 þ {2c1 r2 (1 r)2 ge c1 (2 r)}n þ c1 (1 r) c1 r2 (1 r)2 ge ¼ 0
Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016
Design of reliability study 263
Figure 2 Optimal n versus R ¼ c1=c2 for gs ¼ ge ¼ 0:5 and r ¼ 0.7, 0.8, 0.9.
Although an explicit solution to this equation is available, the resulting expression is
complicated and does not provide any useful insight. However, we may summarize the
results of the optimization procedure as in Table 6, where we provide the optimal n and
k for various values of r, R and gs ¼ ge. Once nopt is determined, it is substituted in
Equation (13) to determine the corresponding optimal kopt as
kopt ¼
C c0
(C c0 )=c2
¼
c1 þ nopt c2
R þ nopt
In this brief table, where without loss of generality we set (C c0 )=c2 ¼ 100, our
principal aim is to establish the behaviour of the optimal (n, k) as r, and gs ¼ ge vary. In
Figures 1–3, we provide more detailed plots of optimal n versus R ¼ c1=c2 for different
values r and gs ¼ ge. Most of the remarks made with respect to Table 5 hold for Table 6
as well. Moreover, from Table 6 and from the figures it is seen that for fixed R and r,
the optimal value of n decreases, while the optimal value of k increases with increasing
departure from normality.
Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016
264 MM Shoukri, MH Asyali and A Donner
Figure 3 Optimal n versus R ¼ c1=c2 for gs ¼ ge ¼ 2 and r ¼ 0.7, 0.8, 0.9.
3
Binary outcome measures
In assessing inter-rater reliability, a choice must be made on how to measure the
condition under investigation. One of the practical aspects of this decision concerns
the relative advantages of measuring the trait either on a continuous scale, as in the
proceeding sections or on a categorical scale. In this review, attention is restricted to
the case of dichotomous judgments. Kraemer23 pointed out that it is useful to
distinguish between two conceptually distinct types of dichotomous scores. The first
of these is a truly dichotomous variable, which categorizes members of a population
into nonoverlapping subpopulations on a nominal scale. Examples would include a
variable denoting the presence or absence of a morbid condition, such as sleep apnea.
As the time and expense of monitoring the occurrence of this event on a long term basis
might be substantial, physicians might argue on practical grounds that a continuous
‘surrogate variable’ be used in its place. For example, the use of heart rate
variability measures to examine the progress of the condition could result in a trial
of shorter duration and lower cost.24 The second type of dichotomous variable arises
when an inherently continuous variable is dichotomized to correspond closely to how
the variable is used in clinical practice. A common example is the dichotomization of
patient blood pressure scores into hypertensive and normotensive categories. Donner
Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016
Design of reliability study 265
and Eliasziw25 investigated the statistical implications associated with the process of
dichotomization as compared with the case of a truly dichotomous variable, and
concluded that the effect of dichotomization on the efficiency of reliability estimates can
be severe.
In the following section, we investigate the issues of efficiency and cost in designing
reliability studies for dichotomous responses, where no distinction between the two
types of variables is made.
3.1 Power considerations
We now consider issues related to the choice of outcome measure. One of the
practical aspects of this decision concerns the relative advantages of measuring the trait
of interest on a continuous versus a dichotomous scale. First, we outline sample size
requirements for the case of binary outcome measure when k subjects are judged by two
raters. An underlying model for this case was given by Mak.26 Let yij denote the binary
observation for the jth rater of the ith subject, i ¼ 1, 2, . . . , k; j ¼ 1, 2 and let
p ¼ Pr[yij ¼ 1] be the probability that the trait is present. Then, for the special case
considered here
p1 (k) ¼ Pr[yi1 ¼ 1, yi2 ¼ 1] ¼ p2 þ kp(1 p)
p2 (k) ¼ Pr[yi1 ¼ 0, yi2 ¼ 1] ¼ Pr[yi1 ¼ 1, yi2 ¼ 0] ¼ 2p(1 p)(1 k)
(18)
2
p3 (k) ¼ Pr[yi1 ¼ 0, yi2 ¼ 0] ¼ (1 p) þ kp(1 p)
where k may be interpreted as a coefficient of interobserver agreement. This model is
often referred as the ‘common correlation model’ since it assumes that ki ¼ k, i ¼ 1,
2, . . . , k. It will also be shown in Section 3.2 that the model in Equation (18) is a special
case of the beta-binomial distribution discussed by Haseman and Kupper.27 If the
observed frequencies are given as in Table 7, the k may be estimated by
k^ ¼ 1 where p^ ¼
Pk P2
i¼1
j¼1 yij =(2k)
n2
2kp^ (1 p^ )
(19)
¼ (2n1 þ n2 )=(2k) is the sample estimate of p.
Table 7 Frequencies for the two-rater agreement case
Category
Frequency
Probability
(1,1)
(1,0) or (0,1)
(0,0)
Total
n1
n2
n3
n
P1(k)
P2(k)
P3(k)
1
Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016
266 MM Shoukri, MH Asyali and A Donner
The estimator k^ has been shown under this model to be the maximum likelihood
estimator of k, with large sample variance given by28
var(k^ ) ¼
(1 k)
k(2 k)
(1 k)(1 2k) þ
k
2p(1 p)
(20)
Donner and Eliasziw29 used the goodness-of-fit test procedure to facilitate sample size
calculations that may be used to enroll a sufficient number of subjects in a study of
interobserver agreement involving two raters. They showed that the number of subjects
needed to test H0: k ¼ k0 versus H1: k ¼ k1 is given by
(
)1
[p(1 p)(k1 k0 )]2 2[p(1 p)(k1 k0 )]2
[p(1 p)(k1 k0 )]2
k¼A
þ
(21)
þ
p2 þ p(1 p)k0
p(1 p)(1 k0 )
(1 p)2 þ p(1 p)k0
2
where A2 ¼ (z1a=2 þ z1b )2 .
Example 2 Suppose that it is of interest to test H0: k0 ¼ 0.60 versus H1: k0 6¼ 0.60
where k0 ¼ 0.60 corresponds to the value of kappa characterized as representing
‘substantial’ agreement.30 To ensure with 80% probability a significant result at
a ¼ 0.05 and p ¼ 0:30 when k1 ¼ 0.90, the required number of subjects from Equation
(21) is k ¼ 66. In Table 8, we present some values of the required number of subjects for
different values of k0, k1, p, a ¼ 0.05 and 1 7 b ¼ 0.80.
3.1.1 Specified width of a CI
As in case of a continuous measurement, we may base our sample size calculation on
the required width of a CI. Suppose that an interobserver agreement study is to be
Table 8 Number of required subjects k at a ¼ 0.05 and b ¼ 0.20
k1
k0
p
0.4
0.4
0.6
0.7
0.8
0.9
0.1
0.3
0.5
0.1
0.3
0.5
0.1
0.3
0.5
0.1
0.3
0.5
0.1
0.3
0.5
404
190
165
179
84
73
101
47
41
64
30
26
0.6
0.7
0.8
0.9
334
148
126
121
52
45
1090
474
400
49
21
18
195
83
71
770
336
282
17
7
6
46
20
17
103
44
37
413
177
149
1339
595
502
335
148
125
149
66
55
1090
474
400
272
118
100
Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016
779
336
282
Design of reliability study 267
conducted such that a CI for the intraclass kappa statistic given by Equation (19) has a
desired width w. Setting
w ¼ 2za=2 {var(k^ )}1=2
where var(k^ ) is given by Equation (20), replacing k by a planned value k , and solving
for k, the required number of subjects is given by
k¼
4z2a=2
w2
k (2 k )
(1 k ) (1 k )(1 2k ) þ
2p(1 p)
(22)
Example 3 Suppose that an interobserver agreement study involving two raters is
designed to achieve a value of k ¼ 0.80. We also assume that the probability of positive
rating is 0.30, the desired width of the CI is w ¼ 0.20, with level of confidence 0.95.
Then we have
k¼
4(1:64)2
(0:2)2
(0:8)(1:2)
(0:2) (0:2)(0:6) þ
2(0:3)(0:7)
¼ 117
3.2 E⁄ciency requirements
We now focus attention on seeking the allocation that minimizes the variance of the
estimator of r for fixed N ¼ nk when the outcome measure is binary.
Let yij denote the jth rating made on the ith subject, where yij ¼ 1 if the condition is
present and 0 otherwise. Landis and Koch,30 analogous to the continuous case,
employed the one-way random effects model
yij ¼ mi þ eij
(23)
where mi ¼ mi þ si for i ¼ 1, 2, . . . , k; j ¼ 1, 2, . . . , n. Analogous to the case of a
continuous outcome measure, we assume the {si} are idd with mean 0 and variance
s2s , the {eij} are idd with mean 0 and variance s2e , and that the {si} and {eij} independent.
We may therefore write E(yij ) ¼ p ¼ Pr[yij ¼ 1] and
s2 ¼ var(yij ) ¼ p(1 p)
(24)
Letting d ¼ Pr[yij ¼ 1, yil ¼ 1] ¼ E(yij yil ), it follows for j 6¼ l and i ¼ 1, 2, . . . , k that
d ¼ cov(yij , yil ) þ E(yij yil ) ¼ rp(1 p) þ p2
(25)
where r ¼ (d p2 )=[p(1 p)]. It is clear from the previous set up that the probability
that two measurements taken from the same subject are in agreement is
Po ¼ p2 þ (1 p)2 þ 2rp(1 p). Substituting r ¼ 0 in Po, we obtain agreement by
chance as, Pe ¼ p2 þ (1 p)2 . Therefore, the beyond chance agreement is given by
Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016
268 MM Shoukri, MH Asyali and A Donner
k ¼ (Po Pe )=(1 Pe ) ¼ r. It is therefore clear that r is directly analogous to the
components of kappa described by Cohen31 and Fleiss and Cohen.32 Following Landis
and Koch,30 let
s2s ¼ rp(1 p)
s2e ¼ (1 r)p(1 p)
(26)
denote the variance components of yij. Then s2 ¼ s2s þ s2e and the corresponding
estimator of r is given, analogous to Equation (2), by
r ¼
MSA MSW
MSA þ (n 1)MSW
where
8
P
2 9
k
>
>
Pk 2
<
=
i¼1 yi
1
i¼1 yi
MSA ¼
,
>
k 1>
nk
: n
;
(
Pk 2 )
k
n
X
X
1
i¼1 yi
y yij
and yi ¼
MSW ¼
k(n 1) i¼1 i
n
j¼1
We note first that the statistic r* depends only on the total yi, and not on the
individual binary responses. Crowder33 and Haseman and Kupper27 demonstrated
the equivalence of the ANOVA model given earlier to the well known beta-binomial
model which arises when conditional on the subject effect mi, the subject’s total yi has a
binomial distribution with conditional mean and variance given, respectively, by
E(yi j mi ) ¼ nmi and var(yi j mi ) ¼ nmi (1 mi ). The parameter mi is assumed to follow
the beta distribution
f (mi ) ¼
G(a þ b) a1
mi (1 mi )b1
G(a)G(b)
(27)
where a ¼ p(1 r)=r, and b ¼ (1 p)(1 r)=r. Therefore, the ANOVA model and the
beta-binomial model are virtually indistinguishable.34 Because the optimal number of
replicates for the non-normal case under the former model was shown to be
n ¼ 1 þ 1={r(1 þ gs )1=2 } and since gs is the kurtosis of the subject effect distribution,
one may use the kurtosis of the beta distribution to determine the optimal number of
replications.
Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016
Design of reliability study 269
One can derive gs for the beta distribution from the recurrence given by
m01 ¼ p1
m0l
m0l1
¼
ðl 1Þr þ pð1 rÞ
, l ¼ 2, 3, . . .
1 þ ðl 1Þr
where m0l ¼ E[mli ]. Then gs ¼ m4 =(m2 )2 , where m4 ¼ m04 4m03 m01 þ 6m02 (m01 )2 3(m01 )4
and m2 ¼ m02 (m01 )2 .21 Substituting gs into Equation (16) we obtain
1=2
(1 r)(1 2r)
n ¼ 1 þ p(1 p)
c(p, r)
(28)
where
c(p, r) ¼ p[r þ p(1 r)][2r þ p(1 r)][3r þ p(1 r) 4p(1 þ 2r)]
þ (1 þ r)(1 þ 2r)[6p3 (1 p)r þ 3p4 þ p2 (1 p)2 r2 ]
Table 9 shows the optimal number of replications n* and the corresponding optimal
number of subjects k ¼ N=n*. In contrast to the continuous measurement model, the
optimal allocation in the case of a binary outcome measure depends on the mean of the
response variable p. We also note that for fixed N the optimal allocations are equivalent
for p and 1 p.
Remarks
1. When p is small, as few as two replicates are required, but with a corresponding
larger value for the required number of subjects. When p ¼ 0:5, a fewer number of
subjects should be recruited with not more than three replicates per subject.
2. There is a noted similarity between the results given in Tables 4 and 9. In both
cases, higher values of r imply that as few as n ¼ 2 replicates are needed and hence,
a larger number of subjects should be recruited. In particular, when p ¼ 0:5 and
0.6 r 0.8, the optimal allocation for the case of binary outcome measure is
close to that required in the case of a continuous outcome measure.
Table 9 Optimal allocation at N ¼ 60 for a binary response variable
r
p
0.4
0.1
0.3
0.5
0.5
0.6
0.7
0.8
0.9
n
k
n
k
n
k
n
k
n
k
n
k
1.81
2.36
2.53
33
25
24
1.64
2.10
2.25
37
29
27
1.53
1.94
2.08
39
31
29
1.46
1.82
1.95
41
33
31
1.40
1.73
1.85
43
35
32
1.36
1.65
1.77
44
36
34
Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016
270 MM Shoukri, MH Asyali and A Donner
4
Discussion
A crucial decision facing a researcher in the design stage of a reliability study is the
determination of the required values of n and k. When the investigator has prior
knowledge of what is regarded as an acceptable level of reliability, the hypothesis
testing approach may be used and sample size calculations can be performed using the
results of Donner and Eliasziw10 and Walter et al.7 However, in many cases, values of
the reliability coefficient under the null and alternative hypotheses may be difficult to
specify. Moreover, the estimated value of ICC depends on the level of heterogeneity of
the sampled subjects: the more the heterogeneity, the higher the value of ICC. As most
reliability studies focus on estimation of the ICC, a principal aim of this review is to
provide the values (n, k) that maximizes the precision of the estimated ICC. It is
fortuitous that this approach produces estimates of sample size that are in close
agreement with results from procedures based on power considerations.
An overall conclusion from the earlier results is that for both continuous and binary
outcome measures, the variance of the estimated ICC is minimized with only a small
number of replicates provided the true value of the ICC is reasonably high. In many
clinical investigations an ICC at least 0.60 is required as the minimal acceptable value.
Under such circumstances, one can safely recommend only two or three replications per
subject.
Finally, it is noted that in practice the optimal allocations must be integer values, and
that the net loss=gain in precision as a result of rounding the values of (n, k) is
negligible. Ideally one should adopt one of the available combinatorial optimization
algorithms, often referred to as integer programming models. These models are suited
for the optimal allocations problems that we reviewed in this study, as the main concern
was to find the best solution(s) in a well defined discrete space. This topic needs further
investigation.
References
1
2
3
4
5
Dunn G. Design and analysis of reliability
studies. Statistical Methods in Medical
Research 1992; 1: 123–57.
Shoukri MM. Agreement. In Armitage P,
Colton T, eds Encyclopedia of biostatistics.
New York: John Wiley & Sons, 1999:
117–30.
Shoukri MM. Agreement. In Gail MH,
Benichou J, eds Encyclopedia of epidemiologic
methods. New York: John Wiley & Sons,
2000: 43–49.
Haggard ER. Intraclass correlation and the
analysis of variance. New York: Dryden Press,
1958.
Elston R. Response to query: estimating
‘heritability’ of a continuous trait. Biometrics
1977; 33: 232–33.
6 Fisher RA. Statistical methods for research
workers. London: Oliver & Boyd, 1925.
7 Walter DS, Eliasziw M, Donner A. Sample
size and optimal design for reliability
studies. Statistics in Medicine 1998; 17:
101–10.
8 Freedman LS, Parmar MKB, Baker SG. The
design of observer agreement studies with
binary assessments. Statistics in Medicine
1993; 12: 165–79.
9 Awni WM, Skaar DJ, Schwenk MH.
Interindividual and intraindividual variability
in labetalol pharmacokinetics. Journal of
Clinical Pharmacology 1988; 28: 344–49.
10 Donner A, Eliasziw M. Sample size
requirements for reliability studies. Statistics
in Medicine 1987; 6: 441–48.
Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016
Design of reliability study 271
11 Eliasziw M, Donner A. A cost-function
approach to the design of reliability studies.
Statistics in Medicine 1987; 6: 647–55.
12 Giraudeau B, Mary JY. Planning a
reproducibility study: how many subjects and
how many replicates per subject for an
expected width of the 95 per cent confidence
interval of the intraclass correlation
coefficient. Statistics in Medicine 2001; 20:
3205–14.
13 Bonett DG. Sample size requirements for
estimating intraclass correlations with desired
precision. Statistics in Medicine 2002; 21:
1331–35.
14 Donner A. Sample size requirements for
interval estimation of the intraclass kappa
statistic. Communications in Statistics –
Simulation 1999; 28: 415–29.
15 Shoukri MM, Asyali MH, Walter SD. Issues
of cost and efficiency in the design of
reliability studies. Biometrics 2003; 59: 1107–
12.
16 Rao SS. Optimization: theory and
applications. New Delhi: Wiley Eastern
Limited, 1984.
17 Flynn NT, Whitley E, Peters T. Recruitment
strategy in a cluster randomized trial: cost
implications. Statistics in Medicine 2002; 21:
397–405.
18 Sukhatme PV, Sukhatme BV, Sukhatme S,
Asok C. Sampling theory of surveys with
applications. Ames, IA: Iowa State University
Press, 1984.
19 Scheffé H. The analysis of variance. New
York: John Wiley & Sons, 1959.
20 Tukey JW. Variance of variance components
I: balanced designs. Annals of Mathematical
Statistics 1956; 27: 722–36.
21 Kendall M, Stuart A. The advanced theory of
statistics, Vol 1. London: Griffin, 1986.
22 Hemmersley IM. The unbiased estimate and
standard error of the intraclass variance.
Metron 1949; 15: 189–205.
23
24
25
26
27
28
29
30
31
32
33
34
Kraemer CH. Ramification of a population
model for k as a coefficient of reliability.
Psychometrika 1979; 44: 461–72.
Frederic R, Gaspoz JM, Fortune IC, Minini P,
Pichot V, Duverney D, Lacow JR and
Barthelemy JC. Screening of obstructive sleep
apnea syndrome by heart rate variability
analysis. Circulation 1999; 100: 1411–15.
Donner A, Eliasziw M. Statistical implications
of the choice between a dichotomous or
continuous trait in studies of interobserver
agreement. Biometrics 1994; 50: 550–55.
Mak TK. Analyzing intraclass correlation for
dichotomous variables. Applied Statistics
1988; 20: 37–46.
Haseman JK, Kupper LL. Analysis of
dichotomous response data from certain
toxicological experiments. Biometrics 1979;
35: 281–93.
Bloch DA, Kraemer HC. 2 2 Kappa
coefficients: measures of agreement or
association. Biometrics 1989; 45: 269–87.
Donner A, Eliasziw M. The goodness-of-fit
approach to inference procedures for the
kappa statistic: confidence interval
construction, significance testing, and sample
size estimation. Statistics in Medicine 1992;
11: 1511–19.
Landis JR, Koch GG. The measurement of
observer agreement for categorical data.
Biometrics 1977; 33: 159–74.
Cohen JA. A coefficient of agreement for
nominal scales. Educational and
Psychological Measurement 1960; 20: 37–46.
Fleiss J, Cohen JA. The equivalence of
weighted kappa and the intraclass correlation
coefficient as measures of reliability.
Educational and Psychological Measurement
1973; 33: 613–19.
Crowder M. Beta-binomial ANOVA for
proportions. Applied Statistics 1978; 27: 34–37.
Cox DR, Snell EJ. Analysis of binary data.
London: Chapman and Hall, 1989.
Downloaded from smm.sagepub.com at Orebro County Council on June 8, 2016
Download