Uploaded by ga-in Kim

flahault2005

advertisement
Journal of Clinical Epidemiology 58 (2005) 859–862
Sample size calculation should be performed for design accuracy
in diagnostic test studies
Antoine Flahaulta,b,*, Michel Cadilhaca,b, Guy Thomasb
a
Unité de biostatistique et d’Informatique médicale, Hôpital Tenon, Paris, France
Institut National de la Santé et de la Recherche Médicale (INSERM), unité 707, Université Pierre & Marie Curie, 27,
rue Chaligny, F-75571 Paris cedex 12, France
b
Accepted 8 December 2004
Abstract
Background and Objectives: Guidelines for conducting studies and reading medical literature on diagnostic tests have been published:
Requirements for the selection of cases and controls, and for ensuring a correct reference standard are now clarified. Our objective was
to provide tables for sample size determination in this context.
Study Design and Setting: In the usual situation, where the prevalence Prev of the disease of interest is ⬍0.50, one first determines
the minimal number Ncases of cases required to ensure a given precision of the sensitivity estimate. Computations are based on the binomial
distribution, for user-specified type I and type II error levels. The minimal number Ncontrols of controls is then derived so as to allow for
representativeness of the study population, according to Ncontrols ⫽ Ncases [(1 ⫺ Prev)/Prev].
Results: Tables give the values of Ncases corresponding to expected sensitivities from 0.60 to 0.99, acceptable lower 95% confidence
limits from 0.50 to 0.98, and 5% probability of the estimated lower confidence limit being lower than the acceptable level.
Conclusion: When designing diagnostic test studies, sample size calculations should be performed in order to guarantee the design
accuracy. 쑖 2005 Elsevier Inc. All rights reserved.
Keywords: Sensitivity; Specificity; Sample size; Binomial distribution; Diagnostic test
1. Introduction
The importance of sample size calculation in medical
research is emphasized in all sets of good clinical practice
guidelines. Previous articles have dealt with this issue under
various circumstances, and in particular for two-group comparisons within clinical trials [1]. Simel et al. [2] deal with
sample sizes based on desired likelihood ratios confidence
intervals. Knottnerus and Muris [3] deal with the whole strategy needed for development of diagnostic tests, but do
not provide practical tables for calculating sample sizes
in the very situation that clinician epidemiologists are in
when dealing with sensitivity or specificity confidence
intervals.
From a statistical point of view, sample size issues for
diagnostic test assessment studies have formal counterparts
within the field of clinical trials, so that answers could be
derived from published equations and tables [4], at least in
principle; how to perform such derivations may not be clear
to clinicians.
* Corresponding author. Tel.: ⫹33-1-44738441; fax: ⫹33-1-44738454.
E-mail address: flahault.a@wanadoo.fr (A. Flahault).
0895-4356/05/$ – see front matter 쑖 2005 Elsevier Inc. All rights reserved.
doi: 10.1016/j.jclinepi.2004.12.009
Moreover, in the case of binary (yes/no) outcome tests,
a normal approximation to the binomial distribution is often
used [5]. Although the accuracy of the approximation is
usually good, modern software allows for exact calculations to be carried out at virtually no extra cost. For instance,
a SAS macro is available to compute exact binomial confidence limits [6].
Our objective was to describe the determination of sample
size for binary diagnostic test assessment studies, and to
provide exact tables based on the binomial distribution.
2. Methods
2.1. Definitions
Assessing a diagnostic test procedure with binary (yes/no)
outcome entails determining the operating characteristics of
the test with respect to some disease of interest. The intrinsic
characteristics of the test are sensitivity and specificity. Sensitivity (Se) is the probability that the test outcome is positive
in a patient who has the disease, and is estimated by the proportion of positive test results among a sample of patients with the
disease (cases). Specificity (Sp) is the probability that the test
860
A. Flahault et al. / Journal of Clinical Epidemiology 58 (2005) 859–862
outcome is negative in a subject who is free from the disease of
interest, and is estimated by the proportion of negative
results in a sample of disease-free subjects. The positive (or
negative) predictive value of the test in a given population
is the probability that a test positive (or negative) subject has
(or does not have) the disease. Although predictive values are
of obvious clinical and epidemiological relevance, they
are not intrinsic to the test, insofar as they also depend on
the prevalence of the disease in the population under study.
These issues are discussed by Altman and Bland [7,8].
In addition, to evaluate the accuracy of the sensitivity or
specificity estimate, the experimenter must further estimate
some confidence limit. The 1 ⫺ α lower confidence limit
for Se (or Sp) can be thought of as the lowest value of Se
(or Sp) that is not rejected by a one-sided test of level α
of the null hypothesis Se ⫽ SeL (or Sp ⫽ SpL) against the
alternative hypothesis Se ⬎ SeL (or Sp ⬎ SpL). Upper confidence limits are defined in an analogous manner, but are
irrelevant here, because the concern is that the test actually
performs worse, not better, than indicated by the observed
proportion of positive (or negative) outcomes in the trial
sample.
2.2. Number of cases
Assume first that we wish to determine the number of
cases to estimate the sensitivity of a new diagnostic test.
The process of determining the sample size in this context
is formally identical to that when comparing an observed
proportion to a known proportion. Prior to sample size
computation, the experimenter must thus specify (i) the expected sensitivity Se of the test and (ii) the maximal distance
δ from Se within which the 1 ⫺ α lower confidence limit
is required to fall, with probability 1 ⫺ β. Although α and
1 ⫺ β retain their usual meanings in terms of type I error
and power, respectively, δ is analogous to the effect size, and
Se ⫺ δ plays the role of the known proportion. Relying
on the normal approximation to the binomial distribution,
sample sizes could thus be determined according to equation (A1) in the Appendix. For small values of δ, sample
sizes may be further approximated by halving the numbers
given in Table 3.1 of Machin et al. [4], but this table is illsuited to the context of diagnostic test assessment studies.
Moreover, because in cases of interest the expected sensitivity will typically be close to one, the normal approximation may be somewhat inaccurate, and one should therefore
fall back on exact equations based on the binomial distribution (see Appendix).
2.3. Number of controls
To determine the number of controls needed to estimate
the specificity of a diagnostic test, the procedure is identical
with that described in the preceding section, substituting
specificity for sensitivity.
In practice, the clinician will want to estimate both sensitivity and specificity within a study population containing
cases and controls. In this case, to ensure that the study
population is representative of the population to which the
test will be applied, the proportions of cases and controls
should take account of the prevalence Prev of the disease,
according to
Ncontrols ⫽ Ncases [(1⫺Prev)/Prev]
(1)
For the vast majority of diseases, Prev ⬍ 0.50, and so
equation (1) yields Ncontrols ⬎ Ncases. In such instances,
provided the accuracy requirements are similar for Sp and
Se, the experimenter should first determine the minimal
number of cases from the tables, and then compute the number
of controls from equation (1). If Prev ⬎ 0.50, first read
Ncontrols in the tables, then compute Ncases from equation (1).
Sample sizes were computed according to equation (A3) in
the Appendix, using Mathematica software [9]. The code is
available from the corresponding author.
So far, we have considered situations where cases and
controls are sampled separately. Often, however, the investigator must sample from a population without prior knowledge of the individual case–control status. In such instances,
the sample size n must be determined such that, with high
probability (e.g. 95%), the sample contains sufficient numbers of cases and controls. To meet this requirement, a
possible strategy is to choose n as the smallest integer
such that
n
兺
()
x⫽Ncases
n
x
n⫺x
x Prev (1⫺Prev) ⭓ 0.95,
(2)
where Prev is the population disease prevalence, and Ncases
is determined from the tables. If Prev ⬎ 0.50, use the same
equation with Ncontrols in place of Ncases.
3. Results
Sample sizes corresponding to lower 95% confidence
limits, to be violated with probability ⬍ 5%, are presented
in Tables 1 and 2.
Whenever disease prevalence is ⬍0.50, the following
guidelines should be followed. The first step requires an
assumption on the expected value of the new diagnostic
test sensitivity. The second step is to specify the minimum
acceptable lower confidence limit, together with the required
probability (which was set here at 0.95) that this limit is not
violated. The minimal sample size for the group of cases is
then read from the tables. The corresponding number of
controls is obtained from equation (1).
For example, suppose we wish to investigate a new diagnostic procedure with expected sensitivity and specificity of
0.90, in a population where the disease prevalence is 0.10,
and we require the lower 95% confidence limit to be ⬎ 0.75
with 0.95 probability. From Table 1, Ncases ⫽ 70; from equation (1), Ncontrols ⫽ 630.
A. Flahault et al. / Journal of Clinical Epidemiology 58 (2005) 859–862
Table 1
Number of cases (or controls) for expected sensitivities (or specificities)
ranging from 0.60 to 0.95
Expected
Minimal acceptable lower confidence limit
sensitivity
(or specificity) 0.5 0.55
0.6
0.65 0.7 0.75 0.8
0.85 0.9
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
474
93
268 1,058
119
262
67
114
42
62
28
40
18
26
13
18
11
12
1,018
248 960
107 230 869
60
98 204
33
52 85
24
31 41
14
16 24
756
176
70
34
624
235
50
298
The probability that the estimated 95% lower confidence limit is above
the minimal acceptable value is 0.95.
4. Discussion
We have tabulated sample sizes needed for assessing
diagnostic procedures (Tables 1 and 2), to help physicians
and clinical epidemiologists when designing a study on diagnostic test assessment.
If one expected a sensitivity of 0.85 for a screening test
and considered that the lower 95% confidence limit should
not fall below 0.80, with 0.95 probability, the exact number
of cases is 624 (Table 1). The corresponding normal approximation yields 621 cases, which is only 0.5% lower than
the exact one. The approximation is obviously acceptable
here, but diagnostic tests are usually associated with higher
values of sensitivity and specificity. In such instances, the
normal approximation, although usually robust, may be misleading. For a sensitivity of 0.99, and a lower confidence
95% limit required to be ⬎0.90, with 0.95 probability,
the approximate number of cases is 54, which is 11% lower
than the 61 cases derived by the exact method (Table 2).
Because we wished to present a ready-to-use approach
to determining sample size for binary outcome, a situation that
frequently occurs in diagnostic test studies, we did not take
into account in our calculation other refinements, such as
polytomous outcome, study design or observer variation.
Most of these considerations are presented and discussed
elsewhere [3,10–12]. Knottnerus and Muris [3] dealt with
861
the whole strategy needed for development of diagnostic
tests, including issues related to the appropriate research
question, study design, the reference standard, study population, statistical aspects including sample size, and external
(clinical) validation. Jaeschke et al. [11,12] highlighted the
main principles for validity of diagnostic test studies. In
particular, three points were emphasized: (i) independent,
blind comparison with a reference standard, (ii) appropriate
spectrum of included patients to whom the test will be applied
in practice, and (iii) absence of influence of the test results
on the decision to perform the reference standard. There are
“points to consider,” published by the European Agency
for the Evaluation of Medicinal Products [10], that provide
useful guidelines for those who wish to develop diagnostic agents.
Requirements about sample size are also worth considering. In particular, point estimates do not provide sufficient
information about the validity of diagnostic tests, and lower
confidence limits should be provided. Additional research
developments are probably needed in case of polytomous
outcomes, which do not seem to be currently widely
available.
In conclusion, we recommend using the one-sided 95%
formulation of the lower confidence limit for sample size
calculations when designing routine studies on diagnostic
test assessment.
Appendix
1. Suppose the expected sensitivity of the diagnostic test
is π, and we wish for the 1 ⫺ α lower confidence
limit for π to be greater than π ⫺ δ with probability
1 ⫺ β. Based on the normal approximation, the required number of subjects is
n ⫽
[z1⫺β冪π(1⫺π) ⫹ z1⫺α冪(π⫺δ)(1⫺π⫹δ)
δ2
(A1)
where z1⫺α and z1⫺β are the 1 ⫺ α and 1 ⫺ β quantiles,
respectively, of the standard normal distribution. Note
that equation (A1) is simply equation (3.10) from
Machin et al. [4], setting π1 ⫽ π ⫺ δ and π2 ⫽ π.
Table 2
Number of cases (or controls) for expected sensitivities (or specificities) ranging from 0.91 to 0.99
Minimal acceptable lower confidence limit
Expected sensitivity
(or specificity)
0.85
0.86
0.87
0.88
0.89
0.9
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
319
220
166
126
93
76
59
50
50
438
294
203
153
109
82
63
53
50
666
403
273
183
137
98
79
58
50
1,127
613
372
248
169
117
85
63
51
2,443
1,035
549
334
217
151
105
69
56
9,309
2,215
934
493
298
191
129
89
61
8,428
1,992
832
434
253
158
115
68
7,512
1,763
729
374
224
129
77
6,576
1,524
625
309
185
109
5,626
1,288
519
259
127
4,654
1,036
386
181
3,643
781
261
2,620
521
1,567
The probability that the estimated 95% lower confidence limit is above the minimal acceptable value is 0.95.
862
A. Flahault et al. / Journal of Clinical Epidemiology 58 (2005) 859–862
2. Let x be an observation from the binomial distribution
with parameters n and π. It can be shown [1] that
an estimate π̂L of the 1 ⫺ α lower confidence limit for
π is the solution to
n
()
n
兺 i π̂iL (1⫺π̂L)i ⫽ α.
i⫽x
(A2)
Because π̂L is increasing in x, the α-quantile, γ, of the distribution of π̂L is the solution to
n
()
n
兺 i γ i(1⫺γ)i ⫽ α,
i ⫽ qβ
(A3)
where qβ is a β-quantile of the binomial distribution with
parameters n and π.
References
[1] Campbell MJ, Julious SA, Altman DG. Estimating sample sizes for
binary, ordered categorical, and continuous outcomes in two group
comparisons. BMJ 1995;311:1145–8.
[2] Simel DL, Samsa GP, Matchar DB. Likelihood ratios with confidence:
sample size estimation for diagnostic test studies. J Clin Epidemiol
1991;44:763–70.
[3] Knottnerus JA, Muris JW. Assessment of the accuracy of diagnostic
tests: the cross-sectional study. J Clin Epidemiol 2003;56:1118–28.
[4] Machin D, Campbell M, Fayers P, Pinol A. Sample size tables for
clinical studies. 2nd ed. Oxford: Blackwell Science; 1997.
[5] Armitage P, Berry G. Statistical methods in medical research. 2nd ed.
Oxford: Blackwell Science; 1987:117–9, 133–4, 462.
[6] Daly L. Simple SAS macros for the calculation of exact binomial and
Poisson confidence limits. Comput Biol Med 1992;22:351–61.
[7] Altman DG, Bland JM. Diagnostic test 1: sensitivity and specificity. BMJ 1994;308:1552.
[8] Altman DG, Bland JM. Diagnostic test 2: predictive values. BMJ
1994;309:102.
[9] Wolfram S. Mathematica: a system for doing mathematics by computer. 2nd ed. Champaign, IL: Addison Wesley; 1993.
[10] The European Agency for the Evaluation of Medicinal Products
Committee for proprietary medicinal products. Points to consider on the
evaluation of diagnostic agents [Internet]. CPMP/EWP/1119/98 (2001).
Available at: http://www.emea.eu.int/pdfs/human/ewp/111998en.pdf.
[11] Jaeschke R, Guyatt G, Sackett DL. The Evidence-Based Medicine
Working Group. Users’ guides to the medical literature. III. How to
use an article about a diagnostic test. A. Are the results of the study
valid? JAMA 1994;271:389–91.
[12] Jaeschke R, Guyatt G, Sackett DL. The Evidence-Based Medicine
Working Group. Users’ guides to the medical literature. III. How to
use an article about a diagnostic test. B. What are the results and will
they help me in caring for my patients? JAMA 1994;271:703–7.
Download