Masterclass Data, understanding it, interpreting it and using it.

advertisement

Master class

Data, understanding it, interpreting it and using it.

Ruth Harrell

Liann Brookes-smith

1

Agenda

9.30am – 10.30am

10.30am break

10.45 – 11.30am

11.40 – 12.30pm

12.30 – 1.30pm lunch

1.30 – 2.30pm probability

2.30 – 2.45pm break

2.45 – 3.30pm sampling and curve

3.30 – 4.30pm confidence and risk

2

Introduction

Statistics may be defined as "a body of methods for making wise decisions in the face of uncertainty." ~W.A. Wallis

“There are three kinds of lies: lies, damned lies, and statistics.” Disraeli (according to Mark Twain)

98% of all statistics are made up. ~Author

Unknown

Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital. ~Aaron Levenstein

If you can not measure it, it does not exist ~

Author unknown

3

Question to the Room

What are statistics?

Why are data important?

What do you feel about stats?

What do they tell us?

E.g. 40% of children on XX area have dental caries, what does that tell us?

List types of data you are aware of or use in your day to day

4

Practitioner competencies

Obtain, verify, analyse and interpret data and/or information to improve the health and wellbeing outcomes of a population / community / group – demonstrating: a. knowledge of the importance of accurate and reliable data / information and the anomalies that might occur b. knowledge of the main terms and concepts used in epidemiology and the routinely used methods for analysing quantitative and qualitative data c. ability to make valid interpretations of the data and/or information and communicate these clearly to a variety of audiences

5

Aim for the day

Aim of the day is to improve people understanding of the data they use, how to analyse it and interpret it.

This session is concentrating on the data rather than things such as the study design but we are happy to discuss and answer questions on both; you can’t understand what the data is telling you without understanding how it has been collected and the potential for bias.

6

Topics covered

1.

2.

3.

4.

5.

6.

7.

Types of data

Basic probability and stats

Understanding how data is collected

Measures of odds and ratios - comparing populations and study results.

Population sampling - Good samples and bad samples

Understanding Confidence intervals & p values is the result reliable

How I apply data to what I am doing

7

Types of data

8

Describing the data

We have a responsibility to present data in a way that can be easily understood, and which does not misrepresent the true meaning of the data.

Key decisions are made based on the data – or more accurately people’s impression of the data – so this has an impact on use of resources and eventually on patient care.

Accurate analysis and presentation of the data saves lives!

9

Quantitative vs. Qualitative

Quantitative data measures quantity ie is numerical.

Qualitative data is usually more descriptive and not measured in numbers.

However, data originally obtained as qualitative information about individual items may give rise to quantitative data if they are summarised by means of counts;

10

Discrete – Continuous

Discrete data can only take certain particular values

Continuous falls on a scale.

For example height is continuous, but the number of siblings is discrete.

11

Nominal - Ordinal

Nominal comes from the Latin nomen, meaning

'name', and is used to describe categorical data.

There is no quantitative relationship between the different categories (though sometimes a number may be assigned for ease of analysis). An example is ethnicity.

Ordinal data again describes categories but there is some order to them - though the relationship between them may not be well defined. For example, Agenda for change pay scales, since they are ordered and can therefore be put in sequence (but there is no numerical relationship between them).

12

Transforming the data

Sometimes the data you have isn't the most effective way of displaying the data.

E.g. You have data on weight in Kilos.

Having a list of continuous weights is not intuitive, therefore you convert this to BMI I.e., those who are underweight, healthy weight, obese and morbidly obese.

Continuous to ordinal.

13

Transforming the data (2)

With this you can display more meaningful data

BUT

You lose the detail, the number of the edge of each category (borderline). You cant transform it back.

What you transform it to may not be the best use of data.

You can also transform data using complex calculations doing a “log” of each number, this will sometimes convert skewed data to normal curved data (discussed later)

14

Exercise

Exercise 1 and 2

15

Displaying the data

What are the options?

Tables – simple descriptive, cross tab… (mention pivot table)

Graphs – bar, line, x-y or scatter, pie chart….

16

Basic statistics and probability

Having looked at the raw data and carried out any transformations you felt necessary, you now want to describe the features of this data.

Distributions – plotting the data is the first step in this. You need to consider the shape of the graph before you know how to best analyse the data.

17

Types of graph

Normal

18

Types of graph

Skewed

19

Types of graph

Bimodal

20

Types of graph

Uniform

21

15 minute

Break!

22

Data measures

Definitions:

Range: the difference between the highest and the lowest values in a set

Mean: the total value of measure values summed divided by the number of measures

Median: the middle measure

Mode: measure found most often

Interquartile ranges: is a measure of statistical dispersion, being equal to the difference between the upper and lower quartiles

Standard deviation: is a measure of how spread out numbers are.

23

Mean, median and mode

Mean= (sum of observations)

(number of observations)

Mode = the most common observation

Median = the number where 50% of observations are below and 50% are above

24

Standard Deviation and IQR

Std Dev= sum of (difference squared between each observation and the mean) / (number of observations - 1)

IQR= the difference between the value at the

25 th percentile and 75 th percentile

25

Formulas

Sample mean x = ( Σ x i

) / n

Sample standard deviation = s = sqrt [ Σ ( x i

- x ) 2 / ( n - 1 ) ] x i is each observation

N is the number of observations

Σ means ‘sum’

26

Exercise 3

27

Exercise 4

28

How reliable is my data?

Any data missing?

How old is it?

What is the denominator?

Who collected it

How was it collected?

Ways to avoid making statements about inaccurate data?

29

Describing data

30

Interpret the graph

This graph is a graph showing the trend of obesity in adults from 1993 – 2007

Percentage: of what (all adults presumed, all registered? All resident?) what age is defined as an adult?

Is the increase due to chance or an actual increase?

Data is quantitative/continuous

31

Bias

When looking at data sometimes the relationship we see is one caused by the way in which we are measuring not actually what is there.

32

Fudging

Rate or Number

You have 50 cases of COPD in area 1, and 150 cases in COPD in area 2. should you do something in area 2?

Area 1 has population of 2000

Area 2 has population of 5000

In area 1 rate in 50-74 year olds is 20/1000

In area 1 rate in 50-74 year olds is 42/1000

Area 1’s data was from 2004

Area 2’s data was from 2005-2009

Area 1 is 20/1000 confidence interval (12-48 per 1000)

Area 2 is 42/1000 confidence interval (18 – 56 per

1000)

Now what?

33

Exercise

Exercise 5

What do these data tell you? Key message?

What would you ask of these data?

What further information would you want to know?

34

Basics of probability

Probability is a way of quantifying the judgements that we make all the time – from ‘do I need an umbrella?’ to ‘shall I bet on that horse?’

Probability is measured on a linear scale of 0 to 1 where 0 is impossible and 1 is absolutely certain.

35

Probability

Why is probability relevant to public health?

Probability gives us a quantitative measurement of the chances of something happening, and there are 2 key ways in which it is used in Public Health

It is another word for risk (or if it has a positive impact benefit). For example, the probability that some who smokes cigarettes will get lung cancer has been shown to be much higher than for someone who doesn’t smoke.

It helps us to answer the question ‘how likely is it that the observed effect is due to our intervention not just to chance?’, and is used in all types of studies – testing medical treatments, evaluating the impact of public health interventions, assessing need of one population compared to another.

36

Probability and risk

Odd – number of events divided by the number of opportunities

Risk in exposed– number of events divided by the number of exposed

Risk in un- exposed– number of events divided by the number of un-exposed

Relative risk or Risk ratio is a ratio of the probability of the event occurring in the exposed group versus a non-exposed group

Absolute risk is the difference in risk between the exposed and unexposed.

37

Probability cont…

What is the probability of a 6 if you throw an unbiased dice?

What is the probability of a total of

6 if you throw two unbiased dice?

38

Welcome back!!

I'm not an outlier I just haven't found my distribution yet.

39

Exercise

Exercise 6

Worse and early death = 0-3/10

No change = 4-5 /10

Cure = 2-6/10

40

Population sampling (1)

In the real world we don’t usually get data from everybody that we are interested in. Why not?

Cost and resources may be too large

People may choose to opt in or out

May have incomplete data (data entry problems etc)

41

Population sampling (2)

So what we need to do is measure a sample of people and infer from that sample what the population looks like. We can do this by tweaking the statistical formula used – but there are two things to consider;

If your sample size is too low you are unlikely to get a reasonable result – you can still use the formula but you need to bear this in mind when interpreting it

Think about who you have managed to sample – are they representative of the population? (imagine walking in to a large open plan office with a set of scales and asking people if they would mind being weighed – who is more likely to volunteer?)

42

Population sampling (3)

If we have a REPRESENTATIVE sample, we can apply a statistical tweak to help us to estimate the figure for the population.

If we don’t (if the sample is biased), though we can carry out the maths, it will always be flawed.

43

Population sampling (4)

Principle –

Measure your sample

Calculate the mean and standard deviation (of the sample)

Calculate the standard error = standard deviation of the sample / n

To estimate your mean, we say our best guess is that the population mean is equal to the sample mean

Then we can use the standard error to estimate how close we think our estimate is.

First we need to talk about confidence intervals

44

Which one is an Insult.

Darling, you are two standard deviations below the mean

Of course your normal (mean 10, mode, 7)

You are mean

Your looks are in the 80% percentile

The difference between you and her is a standard deviation

45

46

Probability, Population Sampling and the Normal Curve

Thinking about our data that fitted the normal curve –

By using the mathematical model we can easily calculate probabilities.

The maths tells us that;

The total area under the normal curve is equal to 1.

The probability that any new observation will fall within one standard deviation of the mean is 68%

The probability that any new observation will fall within two standard deviations of the mean is 95%

The probability that any new observation will fall within three standard deviations of the mean is

99.7%

47

Examples

48

CERN experiments observe particle consistent with long-sought Higgs boson

Geneva, 4 July 2012.

“We observe in our data clear signs of a new particle, at the level of 5 sigma , in the mass region around 126

GeV. The outstanding performance of the LHC and

ATLAS and the huge efforts of many people have

brought us to this exciting stage,” said ATLAS experiment spokesperson Fabiola Gianotti, “but a little more time is needed to prepare these results for publication.”

At five-sigma there is only one chance in nearly two million that the result is wrong, i.e. the measurement seen is a random fluctuation.

49

Confidence intervals (1)

if we measure one individual’s IQ we can be 95% sure that it would fall between 70 and 130

This ‘interval’ is called the 95% confidence interval.

We use 95% by convention; sometimes other figures are used such as 98%.

If we measure the heights of a class of children and we have a mean of 1.2m, standard deviation of 0.1, what is your estimate for the height of a child randomly selected from the sample?

1.2 +/-0.2, ie 95% of this sample lies between 1.0 and 1.4m

50

Confidence intervals (2)

Reminder; the heights of a class of children have a mean of 1.2m, standard deviation of 0.1

We measure a new child and their height is 1.5m.

What does this mean?

This is equal to mean + 3 standard deviations. This means we had less than a 0.5% chance that we would have this height in a child in this population.

That doesn’t mean they are not part of the distribution (0.5% is not that rare) but you might be sensible to check a few things to be sure they are part of the same population (age!).

51

Confidence intervals (3)

This time we are using confidence intervals to estimate our

‘true’ population characteristics based on a sample.

Best estimate of the mean = measured mean of sample

Best estimate of standard deviation of population = std

 deviation of sample/ number of measurements in the sample

Therefore we can say that we are 95% confident that the mean of the population lies between the sample mean +/-

2xstandard error

This implies that;

Our estimate of the mean gets better as n increases – because our error gets smaller.

This is the way we usually use confidence intervals in public health as we usually measure a sample and infer the population.

Examples – Health survey for England, Household surveys, etc

52

You are a significant part of my life

P value =9

53

I would never treat you differently to your sisters

Sister 1 CI 4-9

Sister 2 CI 5-11

Sister 3 CI 4-13

ME CI 2-3

54

Comparing two samples

The important question is – is there a difference between two populations?

This question might be asked in slightly different ways for different types of study, but is fundamentally the same;

For an RCT you compare control group with the intervention group

For a cohort you compare the outcomes in those exposed to a risk factor compared to those not exposed

For a case-control you look at the group with the disease and compare their risk factors to those without the disease

You might look at before and after an intervention was put in place

You might compare one city or country to another

55

Comparing two samples (2)

The important question is – is there a difference between two populations?

56

Comparing two samples (3)

We can calculate the difference between the two populations as;

Mean difference = mean of pop 1 – mean pop2

Confidence interval = mean difference +/- 1.96*SE

SE (standard error) is a combination of the standard errors for each sample (shown here as s1 and s2)

SE = sqrt[ (s

1

2 / n

1

) + (s

2

2 / n

2

) ]

(se can be slightly different for different situations – but this gives you an idea)

57

T tests

Testing using t test;

You need to know the mean and standard deviation of both of your samples.

You start with a hypothesis; this is that there is no difference between the two samples (or populations)

You then do some maths;

 t = [(mean of sample 1 – mean of sample 2)] / SE where SE= sqrt[ (standard dev of pop 1)2 / n1) +

(standard dev of pop 2)2 / n2) ]

58

T tests (2)

So what does t mean?

 t =the horizontal axis of a normal distribution with mean=0 and standard deviation=1

You can read the probability of the two samples coming from the same population from a table of t values

Most important value -

 if t>1.96 then the probability of them being from the same distribution is <0.05

By convention, we discard the null hypothesis if p<0.05

Its good practice to quote the p value e.g. P=0.01

If t>1.96, then the probability of the two samples coming from the same population is <0.05 (5%). This suggests that they are fundamentally different

59

T tests (3)

What do these results mean?

Mean difference = 0, with 95% confidence interval

(-1.0, +1.0), p= 0.50

Mean difference = 0.5, with 95% confidence interval (0.1, 0.9), p= 0.049

Mean difference = 1, with 95% confidence interval

(-0.1, +1.1), p= 0.055

Mean difference = 1, with 95% confidence interval

(0.2, +1.8), p= 0.02

60

Risk differences

Same principle – null hypothesis is that there is no difference

For no difference, the 95% confidence interval would include 0

If it does not include 0, then you can be 95% confident that there is a risk difference.

You can also quote a p value

Example – the risk difference for having a heart attack in the placebo group compared to the intervention group was 2% with a 95% confidence interval of (1.5% to

2.4%), p=0.02

Would you take the intervention?

61

Risk differences (2)

You can also calculate the number needed to treat from this

NNT is the number of people you need to treat to prevent one event from occuring

Example – the risk difference for having a heart attack in the placebo group compared to the intervention group was 2% with a 95% confidence interval of (1.5% to

2.4%), p=0.02

If you treat 100 people you avoid 2 heart attacks.

NNT = 50

62

Risk ratio

A relative measure of risk – very commonly used

Same principle – null hypothesis is that there is no difference IN THE RATIO OF RISKS

For no difference, the 95% confidence interval would include 1

Why 1 this time?

Because if both had the same risk, the ratio would be 1

If it does not include 1, then you can be 95% confident that there is a risk difference.

You can also quote a p value

63

Odds ratio

A relative measure of risk – very commonly used

Very similar to risk ratio

Used for certain types of study, and the result of some calculations

For no difference, the 95% confidence interval would include 1

If it does not include 1, then you can be 95% confident that there is a difference.

You can also quote a p value

64

Examples

Meta-analysis of the 5 prospective cohort studies (86,092 patients) indicated that individuals with periodontal disease had a 1.14 times higher risk of developing CHD than the controls (relative risk 1.14, 95% CI 1.074-

1.213, P < .001) the risk of VTE was 2.33 for obesity (95% CI, 1.68 to

3.24), 1.51 for hypertension (95% CI, 1.23 to 1.85),

1.42 for diabetes mellitus (95% CI, 1.12 to 1.77), 1.18 for smoking (95% CI, 0.95 to 1.46), and 1.16 for hypercholesterolemia (95% CI, 0.67 to 2.02).

65

In summary

Your boss says:

“do we need a weight loss service for kids in XXX area”

1.

2.

3.

You collect data, definition of “kids”, is this data accurate, how was it collected, what year.

Compare the areas, are you much different is there an underlying reason

Is this value statistically significant?

66

In summary (2)

You look at a service elsewhere (from evidence)

You ask yourself, who was included in this sample, are they different to my population

Looking at the odds what proportion of kids will this work on

Look to see if the test group were bias compared to control group

Were the results normally distributed, skewed or other

67

In summary (3)

Were the results significant between the two groups.

Can you rely on these findings

You have just found the need.

Evaluated its accuracy

Reviewed a solution

Looked at effectiveness

WELL DONE!!!

68

Useful websites

Basic maths and probability http://www.cimt.plymouth.ac.uk/pr ojects/mepres/book7/bk7i21/bk7_2

1i1.htm

Tutorials on statistics http://www.stattrek.com/tutorials/s tatistics-tutorial.aspx

69

Download