Journal of Clinical Epidemiology 58 (2005) 859–862 Sample size calculation should be performed for design accuracy in diagnostic test studies Antoine Flahaulta,b,*, Michel Cadilhaca,b, Guy Thomasb a Unité de biostatistique et d’Informatique médicale, Hôpital Tenon, Paris, France Institut National de la Santé et de la Recherche Médicale (INSERM), unité 707, Université Pierre & Marie Curie, 27, rue Chaligny, F-75571 Paris cedex 12, France b Accepted 8 December 2004 Abstract Background and Objectives: Guidelines for conducting studies and reading medical literature on diagnostic tests have been published: Requirements for the selection of cases and controls, and for ensuring a correct reference standard are now clarified. Our objective was to provide tables for sample size determination in this context. Study Design and Setting: In the usual situation, where the prevalence Prev of the disease of interest is ⬍0.50, one first determines the minimal number Ncases of cases required to ensure a given precision of the sensitivity estimate. Computations are based on the binomial distribution, for user-specified type I and type II error levels. The minimal number Ncontrols of controls is then derived so as to allow for representativeness of the study population, according to Ncontrols ⫽ Ncases [(1 ⫺ Prev)/Prev]. Results: Tables give the values of Ncases corresponding to expected sensitivities from 0.60 to 0.99, acceptable lower 95% confidence limits from 0.50 to 0.98, and 5% probability of the estimated lower confidence limit being lower than the acceptable level. Conclusion: When designing diagnostic test studies, sample size calculations should be performed in order to guarantee the design accuracy. 쑖 2005 Elsevier Inc. All rights reserved. Keywords: Sensitivity; Specificity; Sample size; Binomial distribution; Diagnostic test 1. Introduction The importance of sample size calculation in medical research is emphasized in all sets of good clinical practice guidelines. Previous articles have dealt with this issue under various circumstances, and in particular for two-group comparisons within clinical trials [1]. Simel et al. [2] deal with sample sizes based on desired likelihood ratios confidence intervals. Knottnerus and Muris [3] deal with the whole strategy needed for development of diagnostic tests, but do not provide practical tables for calculating sample sizes in the very situation that clinician epidemiologists are in when dealing with sensitivity or specificity confidence intervals. From a statistical point of view, sample size issues for diagnostic test assessment studies have formal counterparts within the field of clinical trials, so that answers could be derived from published equations and tables [4], at least in principle; how to perform such derivations may not be clear to clinicians. * Corresponding author. Tel.: ⫹33-1-44738441; fax: ⫹33-1-44738454. E-mail address: flahault.a@wanadoo.fr (A. Flahault). 0895-4356/05/$ – see front matter 쑖 2005 Elsevier Inc. All rights reserved. doi: 10.1016/j.jclinepi.2004.12.009 Moreover, in the case of binary (yes/no) outcome tests, a normal approximation to the binomial distribution is often used [5]. Although the accuracy of the approximation is usually good, modern software allows for exact calculations to be carried out at virtually no extra cost. For instance, a SAS macro is available to compute exact binomial confidence limits [6]. Our objective was to describe the determination of sample size for binary diagnostic test assessment studies, and to provide exact tables based on the binomial distribution. 2. Methods 2.1. Definitions Assessing a diagnostic test procedure with binary (yes/no) outcome entails determining the operating characteristics of the test with respect to some disease of interest. The intrinsic characteristics of the test are sensitivity and specificity. Sensitivity (Se) is the probability that the test outcome is positive in a patient who has the disease, and is estimated by the proportion of positive test results among a sample of patients with the disease (cases). Specificity (Sp) is the probability that the test 860 A. Flahault et al. / Journal of Clinical Epidemiology 58 (2005) 859–862 outcome is negative in a subject who is free from the disease of interest, and is estimated by the proportion of negative results in a sample of disease-free subjects. The positive (or negative) predictive value of the test in a given population is the probability that a test positive (or negative) subject has (or does not have) the disease. Although predictive values are of obvious clinical and epidemiological relevance, they are not intrinsic to the test, insofar as they also depend on the prevalence of the disease in the population under study. These issues are discussed by Altman and Bland [7,8]. In addition, to evaluate the accuracy of the sensitivity or specificity estimate, the experimenter must further estimate some confidence limit. The 1 ⫺ α lower confidence limit for Se (or Sp) can be thought of as the lowest value of Se (or Sp) that is not rejected by a one-sided test of level α of the null hypothesis Se ⫽ SeL (or Sp ⫽ SpL) against the alternative hypothesis Se ⬎ SeL (or Sp ⬎ SpL). Upper confidence limits are defined in an analogous manner, but are irrelevant here, because the concern is that the test actually performs worse, not better, than indicated by the observed proportion of positive (or negative) outcomes in the trial sample. 2.2. Number of cases Assume first that we wish to determine the number of cases to estimate the sensitivity of a new diagnostic test. The process of determining the sample size in this context is formally identical to that when comparing an observed proportion to a known proportion. Prior to sample size computation, the experimenter must thus specify (i) the expected sensitivity Se of the test and (ii) the maximal distance δ from Se within which the 1 ⫺ α lower confidence limit is required to fall, with probability 1 ⫺ β. Although α and 1 ⫺ β retain their usual meanings in terms of type I error and power, respectively, δ is analogous to the effect size, and Se ⫺ δ plays the role of the known proportion. Relying on the normal approximation to the binomial distribution, sample sizes could thus be determined according to equation (A1) in the Appendix. For small values of δ, sample sizes may be further approximated by halving the numbers given in Table 3.1 of Machin et al. [4], but this table is illsuited to the context of diagnostic test assessment studies. Moreover, because in cases of interest the expected sensitivity will typically be close to one, the normal approximation may be somewhat inaccurate, and one should therefore fall back on exact equations based on the binomial distribution (see Appendix). 2.3. Number of controls To determine the number of controls needed to estimate the specificity of a diagnostic test, the procedure is identical with that described in the preceding section, substituting specificity for sensitivity. In practice, the clinician will want to estimate both sensitivity and specificity within a study population containing cases and controls. In this case, to ensure that the study population is representative of the population to which the test will be applied, the proportions of cases and controls should take account of the prevalence Prev of the disease, according to Ncontrols ⫽ Ncases [(1⫺Prev)/Prev] (1) For the vast majority of diseases, Prev ⬍ 0.50, and so equation (1) yields Ncontrols ⬎ Ncases. In such instances, provided the accuracy requirements are similar for Sp and Se, the experimenter should first determine the minimal number of cases from the tables, and then compute the number of controls from equation (1). If Prev ⬎ 0.50, first read Ncontrols in the tables, then compute Ncases from equation (1). Sample sizes were computed according to equation (A3) in the Appendix, using Mathematica software [9]. The code is available from the corresponding author. So far, we have considered situations where cases and controls are sampled separately. Often, however, the investigator must sample from a population without prior knowledge of the individual case–control status. In such instances, the sample size n must be determined such that, with high probability (e.g. 95%), the sample contains sufficient numbers of cases and controls. To meet this requirement, a possible strategy is to choose n as the smallest integer such that n 兺 () x⫽Ncases n x n⫺x x Prev (1⫺Prev) ⭓ 0.95, (2) where Prev is the population disease prevalence, and Ncases is determined from the tables. If Prev ⬎ 0.50, use the same equation with Ncontrols in place of Ncases. 3. Results Sample sizes corresponding to lower 95% confidence limits, to be violated with probability ⬍ 5%, are presented in Tables 1 and 2. Whenever disease prevalence is ⬍0.50, the following guidelines should be followed. The first step requires an assumption on the expected value of the new diagnostic test sensitivity. The second step is to specify the minimum acceptable lower confidence limit, together with the required probability (which was set here at 0.95) that this limit is not violated. The minimal sample size for the group of cases is then read from the tables. The corresponding number of controls is obtained from equation (1). For example, suppose we wish to investigate a new diagnostic procedure with expected sensitivity and specificity of 0.90, in a population where the disease prevalence is 0.10, and we require the lower 95% confidence limit to be ⬎ 0.75 with 0.95 probability. From Table 1, Ncases ⫽ 70; from equation (1), Ncontrols ⫽ 630. A. Flahault et al. / Journal of Clinical Epidemiology 58 (2005) 859–862 Table 1 Number of cases (or controls) for expected sensitivities (or specificities) ranging from 0.60 to 0.95 Expected Minimal acceptable lower confidence limit sensitivity (or specificity) 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 474 93 268 1,058 119 262 67 114 42 62 28 40 18 26 13 18 11 12 1,018 248 960 107 230 869 60 98 204 33 52 85 24 31 41 14 16 24 756 176 70 34 624 235 50 298 The probability that the estimated 95% lower confidence limit is above the minimal acceptable value is 0.95. 4. Discussion We have tabulated sample sizes needed for assessing diagnostic procedures (Tables 1 and 2), to help physicians and clinical epidemiologists when designing a study on diagnostic test assessment. If one expected a sensitivity of 0.85 for a screening test and considered that the lower 95% confidence limit should not fall below 0.80, with 0.95 probability, the exact number of cases is 624 (Table 1). The corresponding normal approximation yields 621 cases, which is only 0.5% lower than the exact one. The approximation is obviously acceptable here, but diagnostic tests are usually associated with higher values of sensitivity and specificity. In such instances, the normal approximation, although usually robust, may be misleading. For a sensitivity of 0.99, and a lower confidence 95% limit required to be ⬎0.90, with 0.95 probability, the approximate number of cases is 54, which is 11% lower than the 61 cases derived by the exact method (Table 2). Because we wished to present a ready-to-use approach to determining sample size for binary outcome, a situation that frequently occurs in diagnostic test studies, we did not take into account in our calculation other refinements, such as polytomous outcome, study design or observer variation. Most of these considerations are presented and discussed elsewhere [3,10–12]. Knottnerus and Muris [3] dealt with 861 the whole strategy needed for development of diagnostic tests, including issues related to the appropriate research question, study design, the reference standard, study population, statistical aspects including sample size, and external (clinical) validation. Jaeschke et al. [11,12] highlighted the main principles for validity of diagnostic test studies. In particular, three points were emphasized: (i) independent, blind comparison with a reference standard, (ii) appropriate spectrum of included patients to whom the test will be applied in practice, and (iii) absence of influence of the test results on the decision to perform the reference standard. There are “points to consider,” published by the European Agency for the Evaluation of Medicinal Products [10], that provide useful guidelines for those who wish to develop diagnostic agents. Requirements about sample size are also worth considering. In particular, point estimates do not provide sufficient information about the validity of diagnostic tests, and lower confidence limits should be provided. Additional research developments are probably needed in case of polytomous outcomes, which do not seem to be currently widely available. In conclusion, we recommend using the one-sided 95% formulation of the lower confidence limit for sample size calculations when designing routine studies on diagnostic test assessment. Appendix 1. Suppose the expected sensitivity of the diagnostic test is π, and we wish for the 1 ⫺ α lower confidence limit for π to be greater than π ⫺ δ with probability 1 ⫺ β. Based on the normal approximation, the required number of subjects is n ⫽ [z1⫺β冪π(1⫺π) ⫹ z1⫺α冪(π⫺δ)(1⫺π⫹δ) δ2 (A1) where z1⫺α and z1⫺β are the 1 ⫺ α and 1 ⫺ β quantiles, respectively, of the standard normal distribution. Note that equation (A1) is simply equation (3.10) from Machin et al. [4], setting π1 ⫽ π ⫺ δ and π2 ⫽ π. Table 2 Number of cases (or controls) for expected sensitivities (or specificities) ranging from 0.91 to 0.99 Minimal acceptable lower confidence limit Expected sensitivity (or specificity) 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 319 220 166 126 93 76 59 50 50 438 294 203 153 109 82 63 53 50 666 403 273 183 137 98 79 58 50 1,127 613 372 248 169 117 85 63 51 2,443 1,035 549 334 217 151 105 69 56 9,309 2,215 934 493 298 191 129 89 61 8,428 1,992 832 434 253 158 115 68 7,512 1,763 729 374 224 129 77 6,576 1,524 625 309 185 109 5,626 1,288 519 259 127 4,654 1,036 386 181 3,643 781 261 2,620 521 1,567 The probability that the estimated 95% lower confidence limit is above the minimal acceptable value is 0.95. 862 A. Flahault et al. / Journal of Clinical Epidemiology 58 (2005) 859–862 2. Let x be an observation from the binomial distribution with parameters n and π. It can be shown [1] that an estimate π̂L of the 1 ⫺ α lower confidence limit for π is the solution to n () n 兺 i π̂iL (1⫺π̂L)i ⫽ α. i⫽x (A2) Because π̂L is increasing in x, the α-quantile, γ, of the distribution of π̂L is the solution to n () n 兺 i γ i(1⫺γ)i ⫽ α, i ⫽ qβ (A3) where qβ is a β-quantile of the binomial distribution with parameters n and π. References [1] Campbell MJ, Julious SA, Altman DG. Estimating sample sizes for binary, ordered categorical, and continuous outcomes in two group comparisons. BMJ 1995;311:1145–8. [2] Simel DL, Samsa GP, Matchar DB. Likelihood ratios with confidence: sample size estimation for diagnostic test studies. J Clin Epidemiol 1991;44:763–70. [3] Knottnerus JA, Muris JW. Assessment of the accuracy of diagnostic tests: the cross-sectional study. J Clin Epidemiol 2003;56:1118–28. [4] Machin D, Campbell M, Fayers P, Pinol A. Sample size tables for clinical studies. 2nd ed. Oxford: Blackwell Science; 1997. [5] Armitage P, Berry G. Statistical methods in medical research. 2nd ed. Oxford: Blackwell Science; 1987:117–9, 133–4, 462. [6] Daly L. Simple SAS macros for the calculation of exact binomial and Poisson confidence limits. Comput Biol Med 1992;22:351–61. [7] Altman DG, Bland JM. Diagnostic test 1: sensitivity and specificity. BMJ 1994;308:1552. [8] Altman DG, Bland JM. Diagnostic test 2: predictive values. BMJ 1994;309:102. [9] Wolfram S. Mathematica: a system for doing mathematics by computer. 2nd ed. Champaign, IL: Addison Wesley; 1993. [10] The European Agency for the Evaluation of Medicinal Products Committee for proprietary medicinal products. Points to consider on the evaluation of diagnostic agents [Internet]. CPMP/EWP/1119/98 (2001). Available at: http://www.emea.eu.int/pdfs/human/ewp/111998en.pdf. [11] Jaeschke R, Guyatt G, Sackett DL. The Evidence-Based Medicine Working Group. Users’ guides to the medical literature. III. How to use an article about a diagnostic test. A. Are the results of the study valid? JAMA 1994;271:389–91. [12] Jaeschke R, Guyatt G, Sackett DL. The Evidence-Based Medicine Working Group. Users’ guides to the medical literature. III. How to use an article about a diagnostic test. B. What are the results and will they help me in caring for my patients? JAMA 1994;271:703–7.