QM 6 Confidence intervals

What is the point of confidence intervals?
Stephen Gorard
Durham University
A confidence interval (CI) is one of the most widely abused and misunderstood ideas in statistical
analysis. It is an attempt to provide an illustration of the uncertainty inherent in any estimate of a
population value based on the value obtained from a random sample. This kind of illustration is
intended to be used to help users and analysts to judge how good the estimate is. Unfortunately, the
logic underlying the way in which CIs are used is flawed, in the same way as the logic of significance
testing is, and anyway there is rarely a real-life situation where the assumptions necessary to
calculate CIs are met. For those interested, this brief outline explains why. For everyone else, it is
safe simply to ignore confidence intervals as irrelevant, overly complex, unrealistic and potentially
An illustration of a confidence interval
A 95% confidence interval for a mean measurement based on a large true random sample, where
the measurements themselves are normally distributed, is calculated as the sample mean plus or
minus 1.96 times the standard error of the mean. The value of 1.96 comes from the fact that 95% of
the area under a normal curve lies within 1.96 standard deviations of its mean (and this value would
be different for other CIs, such 90% or 50%). This value is adjusted for the sampling error of the
sampling distribution, estimated as the standard error of the sample mean, which is the standard
deviation of the sample mean divided by the square root of the number of cases in the sample. The
standard deviation itself is the square root of: the sum of the squared deviation of each score from
the mean divided by the number of cases in the sample. Confidence intervals can also be calculated
for other estimates of the population data, perhaps most commonly ‘effect’ sizes. The discussion
here is about means, but the same argument applies to any use of CIs in social science.
Imagine a population of all of the schoolchildren aged 10 in one region. A sample of 81 schoolchildren is selected at random from this known population, and tested for their attainment in maths.
Imagine also that all 81 children took part, that the tests were 100% accurate as an assessment of
attainment, and that the 81 scores were normally distributed. Perhaps the mean attainment score of
the 81 children was 50 marks, with a standard deviation of 18. The 95% confidence interval for this
imaginary result would be from 50-1.96(18/9) to 50+1.96(18/9), or from +46.08 to +53.92. This is
and looks like a small interval, and it would traditionally give an analyst reasonable confidence that
the sample mean is a robust estimate of the population mean. But should it? What does all of this
complicated calculation, with its squaring and square rooting, really tell us about the proximity of
the sample mean to the real population mean?
The assumptions needed for confidence intervals
It would obviously be unnecessary to use CIs in the situation where the population mean is already
known. Confidence intervals are not required and do not offer any sensible or comprehensible
message for anyone working with population data. And they cannot make up for missing data in
population datasets because an incomplete population is not a random sample of course. Nor do
they address things like bias or errors in measurement. It is also hard to envisage a situation where
the standard deviation of the population was known but the population mean was not. In practice
then, any confidence interval is calculated as above, based on mathematical information about a
perfect normal distribution, but using only the achieved empirical information about a specific
sample mean and standard deviation.
It would be incorrect to use CIs in any situation where their underlying assumptions were not met.
Any deviation from normality in the achieved sample data would mean that the mathematical basis
for calculating a CI no longer applied. Thus, in the vast majority of social science situations, where
data does not describe a perfect or even near-perfect normal distribution, CIs would be misleading
and should be avoided. Some commentators advise dealing with sampling problems by adjusting the
CI formula to use the t distribution (as in the t-test statistic) instead of the normal distribution.
However, as long as the sample is large the choice here makes little difference, and it is as unlikely
that the data is a perfect t distribution as that it is normally distributed.
It would also be incorrect to use CIs when the sample is not randomly selected from the population.
If the cases are selected non-randomly in any way, or if there is any non-response, then sample is
not random, by definition. This again means that in the vast majority of social science situations with
incomplete or purposive samples, or attempted population data like the UK birth cohort studies, CIs
cannot be used and would not make any sense if they were. It would be incorrect to use CIs to try
and deal with the fact that the measurements taken from a random were less than 100% accurate
(even if they were normally distributed). In summary, there is no real-life research situation in which
the assumptions necessary for the valid calculation of CIs will be met.
The meaning of a confidence interval
However, for the sake of argument imagine a dataset like in the illustration above that is perfectly
normal in shape, with 100% accuracy and 100% response. Even here, the CI does not state what
most analysts and even purported expert sources imagine that it states. One CI for one sample says
nothing at all about the mean of the population from which it was drawn. How could it? The
achieved sample mean is the only estimate of the population mean, and so it is also the best
estimate of it, by definition. This achieved sample mean could be close to the population mean or
much larger or smaller than it. There is nothing in the measurements from the sample that can tell
us what the true situation is, and it is a kind of magic-thinking or superstition when analysts imagine
that a computer or calculator could derive a different or better estimate of the actual population
mean just by using the CI formula. Apart from its invalid use in the situations where key assumptions
are not met, this magic-thinking is the most widespread and dangerous abuse of the idea of a
confidence interval. It is important to recall that any CI for a sample mean does not state that the
population mean has a 95% chance of being within that CI. The CI is based solely on the sample
mean and in itself says nothing about the population mean.
A confidence interval for a sample mean is really a recursive or even tautological construct. Given a
normal distribution for a random sample with complete response, complete measurement accuracy,
and of a known size and standard deviation, the CI is about what happens when repeated samples of
the same size are drawn. Of course, this is again hypothetical, and for several reasons. If repeated
random samples were drawn in practice, the best estimate of the population mean would be the
overall mean for the repeated samples (the process would, in effect, simply provide a larger sample
and so a better estimate). The use of CIs could not and would not improve this estimate. But in the
hypothetical situation, if many repeated true random samples of the same size had their mean and
confidence intervals estimated, then it is assumed that 95% of the time the actual population mean
will lie within those many various confidence intervals. Each time a different CI is produced. So, even
if the battery of unrealistic assumptions were met, a CI is defined only in terms of other repeated
hypothetical CIs. And it is only one of these many different CIs for the same population mean. This
makes it a very strange concept.
Although it is true that 95% of the area of the normal curve will lie within 1.96 standard deviations of
its mean (by definition), it must be recalled that it is not the population mean (nor its standard
deviation) that is used to calculate CIs. Sample CIs are, by definition, always centred around the
achieved sample mean. Thus, if the achieved sample mean were the true population mean, and the
same size sample was drawn repeatedly then 95% of the repeated sample means would lie within
1.96 standard deviations of the original achieved sample mean (assuming that each sample is
random and complete and the data from each sample is normally distributed of course). This is the
true meaning of a confidence interval. But if the sample mean is not the true population mean (and
why would it be?), then the CI calculations will be conducted with a mean that is not at the centre of
the population normal distribution, and so 95% of the area will not lie within 1.96 standard
deviations. In reality, very little of the normal curve might be near the achieved sample mean.
Therefore, a specific sample CI cannot show how close a specific sample mean would be to the
population mean. Nor does it follow that the CI for a specific sample mean would be one of the 95%
of samples that would contain the population mean. That is a shame because this is what analysts
want, and what most pretend that the CI provides for them.
Perhaps an easier way to see how useless CIs are in reality is to return to the example of a random
sample of 81 children scoring an average of 50 with a standard deviation of 18 in a maths attainment
task. Imagine further that the average score in maths for the population is actually 75. This means
that the mean and CI calculated for the achieved sample, 50 with CI from +46.08 to +53.92, are
considerable underestimates. But the analyst would not know this in practice because they would
not know the population mean (else they would not need CIs). They might conclude, wrongly, that
+46.08 to +53.92 is a tight range and that 50 is therefore a good estimate for the population mean.
Imagine now that the population mean was really 40 not 75. What difference does this make to the
CIs for the sample? It makes no difference because there is no relationship between the calculation
of the sample CIs and the actual population mean. In this second example, the sample mean is much
closer to the population mean (10 points or 20% off) but the CIs are calculated in the same way and
give exactly the same answer as the former example where the sample mean was much further from
the population mean (25 points or 50% off). CIs say nothing about the proximity of any one achieved
sample mean to the population mean. To imagine that they could is to believe in magic not science.
Confidence intervals are unusable in just about all real-life contexts (where at least one of nonresponse, some measurement error, or departures from normality in the data occur). More
importantly, they are useless even in ideal circumstances since they are just the achieved data writ
large. They rely on assuming that the achieved sample mean is the population mean in order to try
and calculate the probability of it being the population mean! The ‘logic’ does not work any better
than the ‘logic’ of assuming a null hypothesis must be true can lead to a probability of it being true.
Modus tollendo tollens arguments do not work with probabilities.