Ruth Harrell
Liann Brookes-smith
1
9.30am – 10.30am
10.30am break
10.45 – 11.30am
11.40 – 12.30pm
12.30 – 1.30pm lunch
1.30 – 2.30pm probability
2.30 – 2.45pm break
2.45 – 3.30pm sampling and curve
3.30 – 4.30pm confidence and risk
2
Statistics may be defined as "a body of methods for making wise decisions in the face of uncertainty." ~W.A. Wallis
“There are three kinds of lies: lies, damned lies, and statistics.” Disraeli (according to Mark Twain)
98% of all statistics are made up. ~Author
Unknown
Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital. ~Aaron Levenstein
If you can not measure it, it does not exist ~
Author unknown
3
What are statistics?
Why are data important?
What do you feel about stats?
What do they tell us?
E.g. 40% of children on XX area have dental caries, what does that tell us?
List types of data you are aware of or use in your day to day
4
Obtain, verify, analyse and interpret data and/or information to improve the health and wellbeing outcomes of a population / community / group – demonstrating: a. knowledge of the importance of accurate and reliable data / information and the anomalies that might occur b. knowledge of the main terms and concepts used in epidemiology and the routinely used methods for analysing quantitative and qualitative data c. ability to make valid interpretations of the data and/or information and communicate these clearly to a variety of audiences
5
Aim of the day is to improve people understanding of the data they use, how to analyse it and interpret it.
This session is concentrating on the data rather than things such as the study design but we are happy to discuss and answer questions on both; you can’t understand what the data is telling you without understanding how it has been collected and the potential for bias.
6
1.
2.
3.
4.
5.
6.
7.
Types of data
Basic probability and stats
Understanding how data is collected
Measures of odds and ratios - comparing populations and study results.
Population sampling - Good samples and bad samples
Understanding Confidence intervals & p values is the result reliable
How I apply data to what I am doing
7
8
We have a responsibility to present data in a way that can be easily understood, and which does not misrepresent the true meaning of the data.
Key decisions are made based on the data – or more accurately people’s impression of the data – so this has an impact on use of resources and eventually on patient care.
Accurate analysis and presentation of the data saves lives!
9
Quantitative data measures quantity ie is numerical.
Qualitative data is usually more descriptive and not measured in numbers.
However, data originally obtained as qualitative information about individual items may give rise to quantitative data if they are summarised by means of counts;
10
Discrete data can only take certain particular values
Continuous falls on a scale.
For example height is continuous, but the number of siblings is discrete.
11
Nominal comes from the Latin nomen, meaning
'name', and is used to describe categorical data.
There is no quantitative relationship between the different categories (though sometimes a number may be assigned for ease of analysis). An example is ethnicity.
Ordinal data again describes categories but there is some order to them - though the relationship between them may not be well defined. For example, Agenda for change pay scales, since they are ordered and can therefore be put in sequence (but there is no numerical relationship between them).
12
Sometimes the data you have isn't the most effective way of displaying the data.
E.g. You have data on weight in Kilos.
Having a list of continuous weights is not intuitive, therefore you convert this to BMI I.e., those who are underweight, healthy weight, obese and morbidly obese.
Continuous to ordinal.
13
With this you can display more meaningful data
BUT
You lose the detail, the number of the edge of each category (borderline). You cant transform it back.
What you transform it to may not be the best use of data.
You can also transform data using complex calculations doing a “log” of each number, this will sometimes convert skewed data to normal curved data (discussed later)
14
Exercise 1 and 2
15
What are the options?
Tables – simple descriptive, cross tab… (mention pivot table)
Graphs – bar, line, x-y or scatter, pie chart….
16
Having looked at the raw data and carried out any transformations you felt necessary, you now want to describe the features of this data.
Distributions – plotting the data is the first step in this. You need to consider the shape of the graph before you know how to best analyse the data.
17
Normal
18
Skewed
19
Bimodal
20
Uniform
21
22
Definitions:
Range: the difference between the highest and the lowest values in a set
Mean: the total value of measure values summed divided by the number of measures
Median: the middle measure
Mode: measure found most often
Interquartile ranges: is a measure of statistical dispersion, being equal to the difference between the upper and lower quartiles
Standard deviation: is a measure of how spread out numbers are.
23
Mean= (sum of observations)
(number of observations)
Mode = the most common observation
Median = the number where 50% of observations are below and 50% are above
24
Std Dev= sum of (difference squared between each observation and the mean) / (number of observations - 1)
IQR= the difference between the value at the
25 th percentile and 75 th percentile
25
Sample mean x = ( Σ x i
) / n
Sample standard deviation = s = sqrt [ Σ ( x i
- x ) 2 / ( n - 1 ) ] x i is each observation
N is the number of observations
Σ means ‘sum’
26
27
28
Any data missing?
How old is it?
What is the denominator?
Who collected it
How was it collected?
Ways to avoid making statements about inaccurate data?
29
30
This graph is a graph showing the trend of obesity in adults from 1993 – 2007
Percentage: of what (all adults presumed, all registered? All resident?) what age is defined as an adult?
Is the increase due to chance or an actual increase?
Data is quantitative/continuous
31
When looking at data sometimes the relationship we see is one caused by the way in which we are measuring not actually what is there.
32
Rate or Number
You have 50 cases of COPD in area 1, and 150 cases in COPD in area 2. should you do something in area 2?
Area 1 has population of 2000
Area 2 has population of 5000
In area 1 rate in 50-74 year olds is 20/1000
In area 1 rate in 50-74 year olds is 42/1000
Area 1’s data was from 2004
Area 2’s data was from 2005-2009
Area 1 is 20/1000 confidence interval (12-48 per 1000)
Area 2 is 42/1000 confidence interval (18 – 56 per
1000)
Now what?
33
Exercise 5
What do these data tell you? Key message?
What would you ask of these data?
What further information would you want to know?
34
Probability is a way of quantifying the judgements that we make all the time – from ‘do I need an umbrella?’ to ‘shall I bet on that horse?’
Probability is measured on a linear scale of 0 to 1 where 0 is impossible and 1 is absolutely certain.
35
Why is probability relevant to public health?
Probability gives us a quantitative measurement of the chances of something happening, and there are 2 key ways in which it is used in Public Health
It is another word for risk (or if it has a positive impact benefit). For example, the probability that some who smokes cigarettes will get lung cancer has been shown to be much higher than for someone who doesn’t smoke.
It helps us to answer the question ‘how likely is it that the observed effect is due to our intervention not just to chance?’, and is used in all types of studies – testing medical treatments, evaluating the impact of public health interventions, assessing need of one population compared to another.
36
Odd – number of events divided by the number of opportunities
Risk in exposed– number of events divided by the number of exposed
Risk in un- exposed– number of events divided by the number of un-exposed
Relative risk or Risk ratio is a ratio of the probability of the event occurring in the exposed group versus a non-exposed group
Absolute risk is the difference in risk between the exposed and unexposed.
37
What is the probability of a 6 if you throw an unbiased dice?
What is the probability of a total of
6 if you throw two unbiased dice?
38
I'm not an outlier I just haven't found my distribution yet.
39
Exercise 6
Worse and early death = 0-3/10
No change = 4-5 /10
Cure = 2-6/10
40
In the real world we don’t usually get data from everybody that we are interested in. Why not?
Cost and resources may be too large
People may choose to opt in or out
May have incomplete data (data entry problems etc)
41
So what we need to do is measure a sample of people and infer from that sample what the population looks like. We can do this by tweaking the statistical formula used – but there are two things to consider;
If your sample size is too low you are unlikely to get a reasonable result – you can still use the formula but you need to bear this in mind when interpreting it
Think about who you have managed to sample – are they representative of the population? (imagine walking in to a large open plan office with a set of scales and asking people if they would mind being weighed – who is more likely to volunteer?)
42
If we have a REPRESENTATIVE sample, we can apply a statistical tweak to help us to estimate the figure for the population.
If we don’t (if the sample is biased), though we can carry out the maths, it will always be flawed.
43
Principle –
Measure your sample
Calculate the mean and standard deviation (of the sample)
Calculate the standard error = standard deviation of the sample / n
To estimate your mean, we say our best guess is that the population mean is equal to the sample mean
Then we can use the standard error to estimate how close we think our estimate is.
First we need to talk about confidence intervals
44
Darling, you are two standard deviations below the mean
Of course your normal (mean 10, mode, 7)
You are mean
Your looks are in the 80% percentile
The difference between you and her is a standard deviation
45
46
Thinking about our data that fitted the normal curve –
By using the mathematical model we can easily calculate probabilities.
The maths tells us that;
The total area under the normal curve is equal to 1.
The probability that any new observation will fall within one standard deviation of the mean is 68%
The probability that any new observation will fall within two standard deviations of the mean is 95%
The probability that any new observation will fall within three standard deviations of the mean is
99.7%
47
48
CERN experiments observe particle consistent with long-sought Higgs boson
Geneva, 4 July 2012.
“We observe in our data clear signs of a new particle, at the level of 5 sigma , in the mass region around 126
GeV. The outstanding performance of the LHC and
ATLAS and the huge efforts of many people have
brought us to this exciting stage,” said ATLAS experiment spokesperson Fabiola Gianotti, “but a little more time is needed to prepare these results for publication.”
At five-sigma there is only one chance in nearly two million that the result is wrong, i.e. the measurement seen is a random fluctuation.
49
if we measure one individual’s IQ we can be 95% sure that it would fall between 70 and 130
This ‘interval’ is called the 95% confidence interval.
We use 95% by convention; sometimes other figures are used such as 98%.
If we measure the heights of a class of children and we have a mean of 1.2m, standard deviation of 0.1, what is your estimate for the height of a child randomly selected from the sample?
1.2 +/-0.2, ie 95% of this sample lies between 1.0 and 1.4m
50
Reminder; the heights of a class of children have a mean of 1.2m, standard deviation of 0.1
We measure a new child and their height is 1.5m.
What does this mean?
This is equal to mean + 3 standard deviations. This means we had less than a 0.5% chance that we would have this height in a child in this population.
That doesn’t mean they are not part of the distribution (0.5% is not that rare) but you might be sensible to check a few things to be sure they are part of the same population (age!).
51
This time we are using confidence intervals to estimate our
‘true’ population characteristics based on a sample.
Best estimate of the mean = measured mean of sample
Best estimate of standard deviation of population = std
deviation of sample/ number of measurements in the sample
Therefore we can say that we are 95% confident that the mean of the population lies between the sample mean +/-
2xstandard error
This implies that;
Our estimate of the mean gets better as n increases – because our error gets smaller.
This is the way we usually use confidence intervals in public health as we usually measure a sample and infer the population.
Examples – Health survey for England, Household surveys, etc
52
You are a significant part of my life
P value =9
53
Sister 1 CI 4-9
Sister 2 CI 5-11
Sister 3 CI 4-13
ME CI 2-3
54
The important question is – is there a difference between two populations?
This question might be asked in slightly different ways for different types of study, but is fundamentally the same;
For an RCT you compare control group with the intervention group
For a cohort you compare the outcomes in those exposed to a risk factor compared to those not exposed
For a case-control you look at the group with the disease and compare their risk factors to those without the disease
You might look at before and after an intervention was put in place
You might compare one city or country to another
55
The important question is – is there a difference between two populations?
56
We can calculate the difference between the two populations as;
Mean difference = mean of pop 1 – mean pop2
Confidence interval = mean difference +/- 1.96*SE
SE (standard error) is a combination of the standard errors for each sample (shown here as s1 and s2)
SE = sqrt[ (s
1
2 / n
1
) + (s
2
2 / n
2
) ]
(se can be slightly different for different situations – but this gives you an idea)
57
Testing using t test;
You need to know the mean and standard deviation of both of your samples.
You start with a hypothesis; this is that there is no difference between the two samples (or populations)
You then do some maths;
t = [(mean of sample 1 – mean of sample 2)] / SE where SE= sqrt[ (standard dev of pop 1)2 / n1) +
(standard dev of pop 2)2 / n2) ]
58
So what does t mean?
t =the horizontal axis of a normal distribution with mean=0 and standard deviation=1
You can read the probability of the two samples coming from the same population from a table of t values
Most important value -
if t>1.96 then the probability of them being from the same distribution is <0.05
By convention, we discard the null hypothesis if p<0.05
Its good practice to quote the p value e.g. P=0.01
If t>1.96, then the probability of the two samples coming from the same population is <0.05 (5%). This suggests that they are fundamentally different
59
What do these results mean?
Mean difference = 0, with 95% confidence interval
(-1.0, +1.0), p= 0.50
Mean difference = 0.5, with 95% confidence interval (0.1, 0.9), p= 0.049
Mean difference = 1, with 95% confidence interval
(-0.1, +1.1), p= 0.055
Mean difference = 1, with 95% confidence interval
(0.2, +1.8), p= 0.02
60
Same principle – null hypothesis is that there is no difference
For no difference, the 95% confidence interval would include 0
If it does not include 0, then you can be 95% confident that there is a risk difference.
You can also quote a p value
Example – the risk difference for having a heart attack in the placebo group compared to the intervention group was 2% with a 95% confidence interval of (1.5% to
2.4%), p=0.02
Would you take the intervention?
61
You can also calculate the number needed to treat from this
NNT is the number of people you need to treat to prevent one event from occuring
Example – the risk difference for having a heart attack in the placebo group compared to the intervention group was 2% with a 95% confidence interval of (1.5% to
2.4%), p=0.02
If you treat 100 people you avoid 2 heart attacks.
NNT = 50
62
A relative measure of risk – very commonly used
Same principle – null hypothesis is that there is no difference IN THE RATIO OF RISKS
For no difference, the 95% confidence interval would include 1
Why 1 this time?
Because if both had the same risk, the ratio would be 1
If it does not include 1, then you can be 95% confident that there is a risk difference.
You can also quote a p value
63
A relative measure of risk – very commonly used
Very similar to risk ratio
Used for certain types of study, and the result of some calculations
For no difference, the 95% confidence interval would include 1
If it does not include 1, then you can be 95% confident that there is a difference.
You can also quote a p value
64
Meta-analysis of the 5 prospective cohort studies (86,092 patients) indicated that individuals with periodontal disease had a 1.14 times higher risk of developing CHD than the controls (relative risk 1.14, 95% CI 1.074-
1.213, P < .001) the risk of VTE was 2.33 for obesity (95% CI, 1.68 to
3.24), 1.51 for hypertension (95% CI, 1.23 to 1.85),
1.42 for diabetes mellitus (95% CI, 1.12 to 1.77), 1.18 for smoking (95% CI, 0.95 to 1.46), and 1.16 for hypercholesterolemia (95% CI, 0.67 to 2.02).
65
Your boss says:
“do we need a weight loss service for kids in XXX area”
1.
2.
3.
You collect data, definition of “kids”, is this data accurate, how was it collected, what year.
Compare the areas, are you much different is there an underlying reason
Is this value statistically significant?
66
You look at a service elsewhere (from evidence)
You ask yourself, who was included in this sample, are they different to my population
Looking at the odds what proportion of kids will this work on
Look to see if the test group were bias compared to control group
Were the results normally distributed, skewed or other
67
Were the results significant between the two groups.
Can you rely on these findings
You have just found the need.
Evaluated its accuracy
Reviewed a solution
Looked at effectiveness
68
Basic maths and probability http://www.cimt.plymouth.ac.uk/pr ojects/mepres/book7/bk7i21/bk7_2
1i1.htm
Tutorials on statistics http://www.stattrek.com/tutorials/s tatistics-tutorial.aspx
69