What is the point of confidence intervals? Stephen Gorard Durham University s.a.c.gorard@durham.ac.uk Overview A confidence interval (CI) is one of the most widely abused and misunderstood ideas in statistical analysis. It is an attempt to provide an illustration of the uncertainty inherent in any estimate of a population value based on the value obtained from a random sample. This kind of illustration is intended to be used to help users and analysts to judge how good the estimate is. Unfortunately, the logic underlying the way in which CIs are used is flawed, in the same way as the logic of significance testing is, and anyway there is rarely a real-life situation where the assumptions necessary to calculate CIs are met. For those interested, this brief outline explains why. For everyone else, it is safe simply to ignore confidence intervals as irrelevant, overly complex, unrealistic and potentially misleading. An illustration of a confidence interval A 95% confidence interval for a mean measurement based on a large true random sample, where the measurements themselves are normally distributed, is calculated as the sample mean plus or minus 1.96 times the standard error of the mean. The value of 1.96 comes from the fact that 95% of the area under a normal curve lies within 1.96 standard deviations of its mean (and this value would be different for other CIs, such 90% or 50%). This value is adjusted for the sampling error of the sampling distribution, estimated as the standard error of the sample mean, which is the standard deviation of the sample mean divided by the square root of the number of cases in the sample. The standard deviation itself is the square root of: the sum of the squared deviation of each score from the mean divided by the number of cases in the sample. Confidence intervals can also be calculated for other estimates of the population data, perhaps most commonly ‘effect’ sizes. The discussion here is about means, but the same argument applies to any use of CIs in social science. Imagine a population of all of the schoolchildren aged 10 in one region. A sample of 81 schoolchildren is selected at random from this known population, and tested for their attainment in maths. Imagine also that all 81 children took part, that the tests were 100% accurate as an assessment of attainment, and that the 81 scores were normally distributed. Perhaps the mean attainment score of the 81 children was 50 marks, with a standard deviation of 18. The 95% confidence interval for this imaginary result would be from 50-1.96(18/9) to 50+1.96(18/9), or from +46.08 to +53.92. This is and looks like a small interval, and it would traditionally give an analyst reasonable confidence that the sample mean is a robust estimate of the population mean. But should it? What does all of this complicated calculation, with its squaring and square rooting, really tell us about the proximity of the sample mean to the real population mean? The assumptions needed for confidence intervals It would obviously be unnecessary to use CIs in the situation where the population mean is already known. Confidence intervals are not required and do not offer any sensible or comprehensible message for anyone working with population data. And they cannot make up for missing data in 1 population datasets because an incomplete population is not a random sample of course. Nor do they address things like bias or errors in measurement. It is also hard to envisage a situation where the standard deviation of the population was known but the population mean was not. In practice then, any confidence interval is calculated as above, based on mathematical information about a perfect normal distribution, but using only the achieved empirical information about a specific sample mean and standard deviation. It would be incorrect to use CIs in any situation where their underlying assumptions were not met. Any deviation from normality in the achieved sample data would mean that the mathematical basis for calculating a CI no longer applied. Thus, in the vast majority of social science situations, where data does not describe a perfect or even near-perfect normal distribution, CIs would be misleading and should be avoided. Some commentators advise dealing with sampling problems by adjusting the CI formula to use the t distribution (as in the t-test statistic) instead of the normal distribution. However, as long as the sample is large the choice here makes little difference, and it is as unlikely that the data is a perfect t distribution as that it is normally distributed. It would also be incorrect to use CIs when the sample is not randomly selected from the population. If the cases are selected non-randomly in any way, or if there is any non-response, then sample is not random, by definition. This again means that in the vast majority of social science situations with incomplete or purposive samples, or attempted population data like the UK birth cohort studies, CIs cannot be used and would not make any sense if they were. It would be incorrect to use CIs to try and deal with the fact that the measurements taken from a random were less than 100% accurate (even if they were normally distributed). In summary, there is no real-life research situation in which the assumptions necessary for the valid calculation of CIs will be met. The meaning of a confidence interval However, for the sake of argument imagine a dataset like in the illustration above that is perfectly normal in shape, with 100% accuracy and 100% response. Even here, the CI does not state what most analysts and even purported expert sources imagine that it states. One CI for one sample says nothing at all about the mean of the population from which it was drawn. How could it? The achieved sample mean is the only estimate of the population mean, and so it is also the best estimate of it, by definition. This achieved sample mean could be close to the population mean or much larger or smaller than it. There is nothing in the measurements from the sample that can tell us what the true situation is, and it is a kind of magic-thinking or superstition when analysts imagine that a computer or calculator could derive a different or better estimate of the actual population mean just by using the CI formula. Apart from its invalid use in the situations where key assumptions are not met, this magic-thinking is the most widespread and dangerous abuse of the idea of a confidence interval. It is important to recall that any CI for a sample mean does not state that the population mean has a 95% chance of being within that CI. The CI is based solely on the sample mean and in itself says nothing about the population mean. A confidence interval for a sample mean is really a recursive or even tautological construct. Given a normal distribution for a random sample with complete response, complete measurement accuracy, and of a known size and standard deviation, the CI is about what happens when repeated samples of the same size are drawn. Of course, this is again hypothetical, and for several reasons. If repeated random samples were drawn in practice, the best estimate of the population mean would be the overall mean for the repeated samples (the process would, in effect, simply provide a larger sample and so a better estimate). The use of CIs could not and would not improve this estimate. But in the hypothetical situation, if many repeated true random samples of the same size had their mean and 2 confidence intervals estimated, then it is assumed that 95% of the time the actual population mean will lie within those many various confidence intervals. Each time a different CI is produced. So, even if the battery of unrealistic assumptions were met, a CI is defined only in terms of other repeated hypothetical CIs. And it is only one of these many different CIs for the same population mean. This makes it a very strange concept. Although it is true that 95% of the area of the normal curve will lie within 1.96 standard deviations of its mean (by definition), it must be recalled that it is not the population mean (nor its standard deviation) that is used to calculate CIs. Sample CIs are, by definition, always centred around the achieved sample mean. Thus, if the achieved sample mean were the true population mean, and the same size sample was drawn repeatedly then 95% of the repeated sample means would lie within 1.96 standard deviations of the original achieved sample mean (assuming that each sample is random and complete and the data from each sample is normally distributed of course). This is the true meaning of a confidence interval. But if the sample mean is not the true population mean (and why would it be?), then the CI calculations will be conducted with a mean that is not at the centre of the population normal distribution, and so 95% of the area will not lie within 1.96 standard deviations. In reality, very little of the normal curve might be near the achieved sample mean. Therefore, a specific sample CI cannot show how close a specific sample mean would be to the population mean. Nor does it follow that the CI for a specific sample mean would be one of the 95% of samples that would contain the population mean. That is a shame because this is what analysts want, and what most pretend that the CI provides for them. Perhaps an easier way to see how useless CIs are in reality is to return to the example of a random sample of 81 children scoring an average of 50 with a standard deviation of 18 in a maths attainment task. Imagine further that the average score in maths for the population is actually 75. This means that the mean and CI calculated for the achieved sample, 50 with CI from +46.08 to +53.92, are considerable underestimates. But the analyst would not know this in practice because they would not know the population mean (else they would not need CIs). They might conclude, wrongly, that +46.08 to +53.92 is a tight range and that 50 is therefore a good estimate for the population mean. Imagine now that the population mean was really 40 not 75. What difference does this make to the CIs for the sample? It makes no difference because there is no relationship between the calculation of the sample CIs and the actual population mean. In this second example, the sample mean is much closer to the population mean (10 points or 20% off) but the CIs are calculated in the same way and give exactly the same answer as the former example where the sample mean was much further from the population mean (25 points or 50% off). CIs say nothing about the proximity of any one achieved sample mean to the population mean. To imagine that they could is to believe in magic not science. Conclusion Confidence intervals are unusable in just about all real-life contexts (where at least one of nonresponse, some measurement error, or departures from normality in the data occur). More importantly, they are useless even in ideal circumstances since they are just the achieved data writ large. They rely on assuming that the achieved sample mean is the population mean in order to try and calculate the probability of it being the population mean! The ‘logic’ does not work any better than the ‘logic’ of assuming a null hypothesis must be true can lead to a probability of it being true. Modus tollendo tollens arguments do not work with probabilities. 3