Lecture: Prevalence and Incidence, and 95% Confidence Intervals Dr Richard Crossman, Research Fellow, Health Sciences, Warwick Medical School This lecture will: (1) introduce two important measures of health and disease – prevalence and incidence, and (2) introduce confidence intervals. Please refer to the webpage for the Social and Population Perspective theme for a fuller lecture synopsis. Learning Outcomes This lecture will contribute to the following two learning outcomes: To distinguish between, calculate and interpret measures of health and disease found in the medical literature (such as prevalence, incidence, incidence rate ratio, odds ratio, standardised mortality ratio, attributable risk) (TD12a) Interpret 95% confidence intervals and p values from statistical tests, and distinguish between statistical and clinical significance (TD12a) Sub-Outcomes You should be able to: Define and differentiate between the terms ‘incidence’ and ‘prevalence’, and describe their inter-relationship Distinguish between ‘observed’ epidemiological quantities (incidence, prevalence etc.) and their ‘true’ or ‘underlying’ values. Discuss how ‘observed’ epidemiological quantities depart from their ‘true’ values because of random variation. Describe how ‘observed’ values help us towards a knowledge of the ‘true’ values by: (a) allowing us to test hypotheses about the ‘true’ value (b) allowing us to calculate a confidence interval that gives a range which includes the ‘true’ value with a specific probability. Relevant Reading Ben-Shlomo Y, Brookes ST, Hickman M. Epidemiology, Evidence-Based Medicine and Public Health. Lecture Notes. 6th Edition. Wiley-Blackwell, 2013, Chapters 2 & 4. Lecture Synopsis 1. The ‘Extent’ of Disease: Incidence and Prevalence To examine the extent of disease in a population we need to know about the number of new cases arising in a given time (incidence), and the number of people who have the disease at any given moment (prevalence). We are interested in knowing how many new cases arise for many reasons: (1) To see if new cases of an infectious disease are getting more frequent can help decide whether an epidemic is in progress. (2) To monitor the effect of prevention programmes (if they are working, the first effect should be a fall in the frequency of new cases). (3) To compare people exposed to some potential hazard with unexposed people, to help us decide if the exposure really is dangerous. On the other hand, numbers of new cases do not always give us much of an idea about the ‘burden of disease’ (the extent to which a disease is a problem to a community), so it is important to also know about numbers of existing cases. This information helps us know the extent of need for particular health services. This is especially true for conditions which exist throughout life. a) Measuring New Cases: the Incidence Rate A simple count of new cases is of little use; it is necessary to have information on population size (i.e. the number at risk) and the time period. The best measure of the population at risk is the product of (multiplication of) the number of people observed and the number of years of observation. This is called the person-time at risk, or the number of person-years (p-y). Dividing the number of new cases observed (events) by the number of person-years gives the incidence rate. This useful measure answers the question ‘how many new cases per year, per head of population?’ A mortality rate is a special case of an incidence rate where the event is death rather than onset of disease. Mortality rates may be calculated for specific diseases (e.g. the malignant melanoma mortality rate in England and Wales is 20 per million person-years) or for all diseases combined (the all-cause mortality rate). Incidence and mortality rates can be compared between different populations to see whether individuals in one population are at higher risk than another; the populations and the periods of observation do not then have to be the same size. For example, if one were to observe 300 myocardial infarctions (MIs) in a population of 50,000 over an 18 month period, the incidence rate would be 300 ÷ (50,000 x 1.5), i.e. 0.004 MIs per person per year, or 0.004 MIs per person-year. This is not a very intuitive way of expressing the rate. ‘4 per 1,000 person-years’ is better – you can see straight away that 4 MIs per year would be expected in a population of 1,000. For rare diseases, rates are often expressed per 100,000 person-years. Note that, for the purpose of calculating an incidence rate in this manner, observing 50,000 people for 18 months is equivalent to observing 25,000 people for 3 years, or any other combination resulting in 75,000 person-years. b) Measuring Existing Sufferers: the Point Prevalence For health service planning, especially for incurable or long-standing diseases, it is often more useful to know the number of people who currently have the disease rather than the incidence rate. Once again the number of sufferers is not particularly useful unless we know how many people are at risk of the disease. In this case there is no time period because we are simply interested in the proportion of the population who are affected by the disease. The number of sufferers divided by the number at risk is called the point prevalence of the disease (usually referred to as the prevalence). For example, if 80 members of a population of 1,500 have cancer at a particular time, prevalence = 0.053 or 53 per 1,000 or 5.3%. c) Relationship between Incidence and Prevalence Prevalence and incidence are related because all the prevalent cases must at some time have been incident cases; other things being equal, higher incidence will imply higher prevalence. However the relationship is not quite this simple. The number of prevalent cases is constantly being added to by new (incident) cases, and constantly being depleted by patients dying or recovering: Figure: relationship between incidence, prevalence and rates of death/cure Prevalence is influenced by the death rate and the cure rate, as well as the incidence rate. If a new treatment is found which keeps people with the disease alive longer, prevalence will increase; if more patients are cured or die, prevalence will fall. When the incidence rate and the rates of recovery and death are constant, then P ( I x L) where P = prevalence, I = incidence rate and L = mean duration of disease. The symbol ‘’ means ‘is approximately equal to’. 2. Confidence Intervals a) Sampling Variation and Statistical Models Often we would like to draw conclusions for a population but it is not feasible to assess the whole population. Therefore we draw a representative sample (ideally a random sample) from the population. However, different samples most likely will give us at least slightly different answers. This we call sampling variation. Statistical models are introduced and critically reflected upon. The idea that models do not necessarily need to be true to be still useful is discussed. 2. Normal Distribution and Normal Approximation The Normal distribution (also called Gaussian distribution) is introduced. The Normal distribution has two parameters, namely mean and standard deviation. The normal distribution with mean 0 and standard deviation 1 is called the standard normal distribution. The normal distribution is important because it can be used to approximate other distributions if sample sizes are not too small. 3. Proportions: Estimates, Confidence Intervals, and Hypothesis Tests Using the prevalence of hypertension as an example, estimates, confidence intervals, and hypothesis tests for proportions are presented. The observed value of a quantity of interest (e.g. prevalence, incidence rate) is the best estimate of the quantity’s true value. However, estimates are subject to sampling variation. Their precision can be described by their standard errors (SE). For instance the prevalence of hypertension in a population can be estimated by the observed prevalence in a random sample from this population. Hereby the observed prevalence is calculated as the number of subjects with hypertension divided by the number of subjects in the sample. The standard error of an observed proportion (e.g. prevalence) is p (1 p ) / n where p denotes the proportion (prevalence) and n the number of subjects. Estimated proportions from several samples from the same population follow (approximately) a Normal distribution, which has two parameters, namely mean and standard deviation. The mean of this normal distribution can be estimated from a sample by the observed proportion and the standard deviation by the standard error of the observed proportion. It is a feature of the Normal distribution that 95% of values are in the range “mean ± 1.96 x standard deviation”. Plugging in observed proportion and its standard error for mean and standard deviation we obtain the interval “observed proportion ± 1.96 x SE of observed proportion”. This range is known as the 95% confidence interval (CI). If we sample repeatedly from the same population and calculate the 95% confidence interval for each sample, then we expect that 95% of these confidence intervals include the true value of the proportion. In 5% of the cases the range of the confidence interval would not include the true value. A lot of analyses include a comparison either between different groups or a sample with a known quantity. The numerical value corresponding to the comparison is called the effect (e.g. difference of means). The hypothesis of no effect is called the ‘null hypothesis’. For instance, common null hypotheses are “a difference of means is 0”. Whether data are consistent with a certain null hypothesis or not can be tested by means of a confidence interval. If the 95% CI includes the ‘null hypothesis’, then the data are consistent with the null hypothesis of no effect at a level of 95%. If not, the effect is called statistically significant. An alternative approach to check whether data are consistent with a null hypothesis is the hypothesis test. The probability that we could have obtained the observed data or more extreme data if the null hypothesis were true is calculated. This probability is known as the p-value. If a p-value is very small, then either something very unlikely has occurred or the null hypothesis is wrong. Usually p-values smaller than 5% are considered ‘small’. But this is of course somewhat arbitrary. 4. Difference between Two Proportions In a study in about 1,300 adolescents boys and girls were asked whether they always use seat belts. A difference of 9 percentage points between boys and girls was observed suggesting that boys are more likely to use seat belts. The above described methods for confidence intervals and hypothesis tests are extended to differences between two proportions to investigate whether such findings could be attributable to chance.