Probability

advertisement

Statistics for Health Research

Introduction to

Distributions and

Probability

Peter T. Donnan

Professor of Epidemiology and Biostatistics

Overview

• Distributions

• History of probability

• Definitions of probability

• Random variable

• Probability density function

• Normal, Binomial and Poisson distributions

Introduction to Probability

Density Functions

• Normal Distribution /

• Gaussian / Bell curve

• Poisson named after French

Mathematician

• Binomial related to binary factors

(Bernoulli Trials)

Early use of Normal

Distribution

• Gauss was a German mathematician who solved mystery of where Ceres would appear after it disappeared behind the Sun.

• He assumed the errors formed a

Normal distribution and managed to accurately predict the orbit of Ceres

What is the relationship between the Normal or

Gaussian distribution and probability?

Probability

“I cannot believe that God plays dice with the cosmos”

Albert Einstein

“The probable is what usually happens”

Aristotle

Origins of Probability

• Early interest in permutations Vedic literature 400 BC

• Distinguished origins in betting and gambling!

• Pascal and Fermat studied division of stakes in gambling (1654)

• Enlightenment – seen as helping public policy, social equity

• Astronomy – Gauss (1801)

• Social and genetic – Galton (1885)

• Experimental design – Fisher (1936)

Types of Probability

Two basic definitions:

1) Frequentist

Classical

Proportion of times an event occurs in a long series of ‘trials’

2) Subjectivist

Bayesian

Strength of belief in event happening

Frequentists vs. Bayesians

• Two entrenched camps

• Scientists tend to use the frequentist approach

• Bayesians gaining ground

• Most scientists use frequentist methods but incorrectly interpret results in a Bayesian way!

Frequentists

• Consider tossing a fair coin

• In any trial, event may be a

‘head’ or ‘tail’ i.e. binary

• Repeated tossing gives series of ‘events’

• In long run prob of heads=0.5

T H TT HHHH T HHH T HHH TT H TTT HH TT H TT HHH TTT HH T HHH TTTTT HHH

0.6 0.56 0.52

Frequentist

Probability

• Note the difference between ‘long run’ probability and an individual trial

• In an individual trial a head either occurs

(X=1) or does not occur (X=0)

• Patient either survives or dies following an

MI

• Prob of dying after MI ≈ 30% based on a previous long series from a population of individuals who experienced MI

Subjective

Probability

• Based on strength of belief

• But more akin to thinking of clinician making a diagnosis

• Faced with patient with chest pain, based on past experience, believes prob of heart disease is 20%

• Person tossing coin believes prob of head is 1/2

Comparison of definitions of Probability

• Problems of subjective probability

• Probability for same patient can vary even with same clinician

• Person can believe prob of head is 0.1 even if it is a fair coin

• Subjectivists argue they are more realistic

• This course sticks to ‘frequentist’ and

‘model-based’ methods of probability

Random Variable

• Consider rolling 2 dice and we want to summarise the probabilities of all possible outcomes

• We call the outcome a random variable X which can have any value in this case from 2 to 12

• Enumerate all probabilities in sample space S

• P (2) = 1/6x1/6 = 1/36, P (3)=2/36,

P (4) = 3/36, etc…..

6/36

5/36

4/36

3/36

2/36

1/36

Probability Density Function for rolling two dice

2 3 4 5 6 7 8 9 10 11 12

1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36

2 3 4 5 6 7 8 9 10 11 12

Probability Density Function for rolling two dice

What is probability of getting 12? Answer 1/36

What is probability of getting more than 8? Ans. 10/36

6/36

5/36

4/36

3/36

2/36

1/36

2 3 4 5 6 7 8 9 10 11 12

6/36

5/36

4/36

3/36

2/36

1/36

Probability Density Function for continuous variable

2 3 4 5 6 7 8 9 10 11 12

1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36

2 3 4 5 6 7 8 9 10 11 12

Consider distribution of weight in kg; all values possible not just discrete

20…….30……40…… 50 ……60…….70…….80…..90….100….110…… 120

Weight in kilograms

2 3 4 5 6 7 8 9 10 11 12

Probability Density Function in SPSS

Use Analyze / Descriptive Statistics / Frequencies and select no table and charts box as below

Probability Density Function in SPSS

Data from ‘LDL Data.sav’ of baseline LDL cholesterol

Normal Distribution

Note that a Normal or Gaussian curve is defined by two parameters:

Mean µ and Standard Deviation σ

And often written as N ( µ, σ )

Hence any Normal distribution has mathematical form

Impossible to be integrated so area under the curve obtained by numerical integration and tabulated!

Normal Distribution

As noted earlier the curve is symmetrical about the mean and so p ( x ) > mean = 0.5 or 50%

And p ( x ) < mean = 0.5 or 50%

And p (a < x < b) = p(b) – p(a)

50% 50%

Normal Distribution and

Probabilities

So we now have a way of working out the probability of any value or range of values of a variables IF a

Normal distribution is a reasonable fit to the data p (a < x < b) = p(b) – p(a) which is the area under the curve between a and b

50% 50%

Normal Distribution

Most of area lies between +1 and -1 SD (64%)

The large majority lie between +2 and -2 SDs (95%)

Normal Distribution

Probability Density

Function (PDF) =

How well does my data fit a Normal Distribution?

Statistics

Baseline LDL

N

Mean

Median

St d. Dev iation

Skewness

St d. Error of Skewness

Minimum

Maximum

Valid

Missing

1383

0

3.454363

3.506214

.9889157

.039

.066

.3345

7.5650

Note median and mean virtually the same

Skewness = 0.039, close to zero

Skewness is measure of symmetry (0=perfect symetry)

Eyeball test - fitted normal curve looks good!

Try Q-Q plot in Analyze /

Descriptive Statistics/ Q-Q plot

Plot compares Expected

Normal distribution with real data and if data lies on line y = x then the Normal

Distribution is a good fit

Note still an eyeball test!

Is this a good fit?

I used to be Normal until I discovered Kilmogorov-Smirnoff!

One-Sample Kol mogorov-Smirnov Test

N

Normal Parameters

Most Extreme

Diff erences a,b

Mean

Std. Dev iation

Absolute

Positiv e

Negativ e

Kolmogorov -Smirnov Z

Asy mp. Sig. (2-tailed) a. Test distribution is Normal.

b. Calculated f rom data.

Baseline LDL

1383

3.454363

.9889157

.043

.043

-.043

1.617

.011

Eyeball Test indicates distribution is approximately

Normal but K-S test is significant indicating discrepancy compared to Normal

WARNING: DO NOT RELY ON THIS TEST

Consider the distribution of survival times following surgery for colorectal cancer

Statistics

Time f rom Surgery

N

Mean

Median

St d. Dev iation

Skewness

St d. Error of Skewness

Minimum

Maximum

Valid

Missing

476

0

848.3908

835.5000

582.39657

2.081

.112

14.00

5763.00

Note median=835 days and mean=848

Skewness = 2.081, very skewed (> 1.0)

Strong tail to right! Approximately Normal?

Try a log transformation for right positive skewed data?

Statistics logtime

N

Mean

Median

Std. Deviation

Skewness

Std. Error of Skewness

Minimum

Maximum

Valid

Missing

476

0

6.4346

6.7286

.95059

-1.504

.112

2.67

8.66

Better but now slightly skewed to left!

Examples of skewed distributions in Health Research

Discrete random variables – hospital admissions, cigarettes smoked, alcohol consumption, costs

Continuous RV – BMI, cholesterol, BP

30%

The Binomial

Distribution

• ‘Binomial’ means ‘two numbers’.

• Outcomes of health research are often measured by whether they have occurred or not.

• For example, recovered from disease, admitted to hospital, died, etc

• May be modelled by assuming that the number of events n has a binomial distribution with a fixed probability of event p

The Binomial

Distribution

• Based on work of Jakob Bernoulli, a Swiss mathematician

• Refused a church appointment and instead studied mathematics

• Early use was for games of chance but now used in every human endeavour

• When n = 1 this is called a Bernoulli trial

• Binomial distribution is distribution for a series of Bernoulli trials

The Binomial

Distribution

• Binomial distribution written as B ( n , p) where n is the total number of events and p

= prob of an event

• This is a Binomial

Distribution with

0.20

p=0.25 and n=20

0.15

0.10

0.05

0.00

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Successes

The Binomial

Distribution

The Poisson

Distribution

Poisson distribution (1838), named after its inventor Simeon Poisson who was a French mathematician. He found that if we have a rare event (i.e. p is small) and we know the expected or mean ( or µ) number of occurrences, the probabilities of 0, 1, 2 ... events are given by:

P ( R )

 e

  

R !

R

The Poisson

Distribution

Note similarity to Binomial

In fact when p is small and n is large

B(n, p) ~ P (µ = np)

Also for large values of µ:

P (µ) ~ N ( µ, µ )

Hence if n and p not known could use Poisson instead

The Poisson

Distribution

In health research often used to model the number of events assumed to be random:

Number of hip replacement failures,

Number of cases of C. diff infection,

Diagnoses of leukaemia around nuclear power stations,

Number of H1N1 cases in Scotland,

Etc.

Summary

Many of variables measured in Health Research form distributions which approximate to common distributions with known mathematical properties

40

Normal, Poisson, Binomial, etc…

30

Note a relationship for all centred 20

10

Std. Dev = .96

Mean = -.04

N = 501.00

around the exponential distribution

Where e = 2.718

0

-2.95

-1.95

RANNORM

-.95

.05

1.05

2.05

• All belong to the Exponential Family of distributions

• These probability distributions are critical to applying statistical methods

SPSS Practical

• Read in data file ‘LDL Data.sav’

• Consider adherence to statins, baseline

LDL, min Chol achieved, BMI, duration of statin use

• Assess distributions for normality

• If non-normal consider a transformation

• Try to carry out Q-Q plots

Download