STAT 305: Chapter 4 – Basic Probability Concepts
Spring 2014
In this section we go through the basics of probability. Probability is fundamental to the study of statistics and is used extensively in the inferential process.
Specifically in this section we will learn some basic ideas about probabilities:
What they are where they come from
Simple probability models
Properties of probabilities
Conditional probabilities and the concept of independence
Baye’s Rule
How to calculate probabilities:
- Using table of counts obtained through sampling (empirical or sample-based probabilities)
- Using properties of probabilities, such as independence.
Example 4.1: I toss a fair coin (where ‘fair’ means ‘equally likely outcomes’)
What are the possible outcomes?
What is the probability it will turn up heads?
Example 4.2: I choose a duck nest full of freshly laid eggs at random and look at predation status
What are the possible outcomes?
What is the probability the duck nest is predated?
Definition: A probability is a number…..
44
STAT 305: Chapter 4 – Basic Probability Concepts
Spring 2014
WHERE DO PROBABILITIES COME FROM?
Probabilities from models (games, genetics, certain types of random experiments)
The probability of getting a four when a fair dice is rolled is ________.
Probabilities from data (or _________________ probabilities)
What is the probability that a randomly selected starling is female?
– In a random sample of n = 67 starlings 40 are found to female.
–
The estimated probability that a randomly chosen starling is female is
Subjective probabilities:
The probability that there will be another outbreak of ebola in Africa within the next year is 0.1.
The probability of rain in the next 24 hours is very high or 80%.
A doctor may state that a patient’s chances of a full recovery are 70%.
PROBABILITIES FROM DATA - SOME BASIC IDEAS
Example 4.3: Type of Hodgkin’s Disease and Response to Treatment
Below is a table containing the results of treatment for patients with different types of Hodgkin’s disease. The data was collected by taking a random sample of 538 patients diagnosed with some form of Hodgkin’s disease, thus both type of Hodgkin’s and response to treatment are random .
ROW
TOTALS
72
104
266
96
COLUMN 126 98 314 n = 538
TOTALS
Histological Types of Hodgkin’s Disease
LD = lymphocyte depletion
LP = lymphocyte predominant
MC = mixed cellularity
NS = nodular schlerosis
For a patient selected at random from these 538 Hodgkin’s patients, find the probability that the patient:
(a) had a positive response
(b) had at least some response to treatment.
(c) had LP and had a positive response to treatment.
(d) had LP or NS for their histological type.
45
STAT 305: Chapter 4 – Basic Probability Concepts
Spring 2014
CONDITIONAL PROBABILITY and INDEPENDENCE
• We are interested in the probability of something happening given information about the
occurrence of another event.
•
Key words that indicate conditional probability are: given, amongst, for those with, …
Conditional Probability
“The probability of event A occurring given that event B has already occurred”
is written in shorthand as
Formal Definition
P(A | B) =
Independence
Events A and B are said to be independent if
Example 4.4: Simple example when rolling a single fair die
We define the following to events based on the outcome of rolling the die
Conditional probability
A =
B =
Independence
C = we obtain two sixes in a row, D = we obtain three sixes in a row, etc.
46
STAT 305: Chapter 4 – Basic Probability Concepts
Spring 2014
Example 4.3 (cont’d) : Conditional Probabilities from Hodgkin’s Example
ROW
TOTALS
72
104
266
96
COLUMN 126 98 314 n = 538
TOTALS
Let’s consider some potential conditional probabilities of interest in this study.
A 2-D mosaic plot is a graphical display of the conditional probabilities of the form P(Y|X) where Y is this example is the response to treatment (Y) and X is the histological type of Hodgkin’s disease (X).
47
STAT 305: Chapter 4 – Basic Probability Concepts
Spring 2014
Example 4.5: Motorcycle Helmet Use and Brain Injury in Wisconsin Motorcyclists
A study was conducted in 1991 by the University of Wisconsin and the Wisconsin Department of
Transportation in which linked police reports and hospital discharge records were used to assess, among other things, the risk for head injury for motorcyclists in motor-vehicle crashes. The data shown below can be used to examine the relationship between helmet use and whether brain injury was sustained in the accident.
Helmet Worn
No Helmet
Brain Injury No Brain Injury Row
Totals
17
97
977
1918
994
2015
Column
Totals
114 2895 3009
a) What is the probability that a motorcycle accident victim in Wisconsin suffered brain injury? b) What is the probability that a motorcyclist involved in an accident was wearing a helmet? Can this be used to estimate the probability that a randomly sampled motorcyclist in WI wears a helmet? c) What is the probability that a motorcyclist suffered brain injury given that they were wearing a helmet? d) What is the probability that a motorcyclist not wearing a helmet suffered brain injury?
48
STAT 305: Chapter 4 – Basic Probability Concepts
Spring 2014 e) How many times more likely is a motorcyclist not wearing a helmet to sustain a brain injury?
This ratio is called the ________________ or __________________ .
RELATIVE RISK (RR)
The relative risk or risk ratio is defined generically as in case of a study where examination of risk is appropriate, e.g. cancer and smoking
=
In the case of study where we looking at a benefit, e.g. a drug that reduces the risk of an adverse outcome
=
ODDS RATIO (OR)
Most studies that are conducted that seek to examine the relationship between an adverse outcome, e.g. death or cancer, and set of potential risk factors are case-control observational studies. In a case-controls study remember we sample individuals with the adverse event (cases) and some similar individuals without the adverse event (controls) and compare the two groups in terms of risk factors we are interested in. The odds ratio (OR) is the main tool used to quantify risk in case-control studies. The example below will demonstrate why.
First, we need to define what odds for an event are.
The Odds for an event A are defined as
49
STAT 305: Chapter 4 – Basic Probability Concepts
Spring 2014
The Odds Ratio for an event A associated with a “risk factor” are defined as
𝑶𝑹 =
Odds for A for those with risk factor
Odds for A for those without risk factor
𝑃(𝐴|𝑟𝑖𝑠𝑘)
1−𝑃(𝐴|𝑟𝑖𝑠𝑘)
𝑃(𝐴|𝑛𝑜 𝑟𝑖𝑠𝑘)
1−𝑃(𝐴|𝑛𝑜 𝑟𝑖𝑠𝑘)
The event A here is an adverse outcome like death or cancer, etc. If we are doing a study looking at benefit instead risk, then A could be a good outcome such as survival or remission, etc.
Example 4.6: Age at First Pregnancy and Cervical Cancer
A case-control study was conducted to determine whether there was increased risk of cervical cancer amongst women who had their first child before age 25. A sample of 49 women with cervical cancer was taken of which 42 had their first child before the age of 25. From a sample of 317 “similar” women without cervical cancer it was found that 203 of them had their first child before age 25. Do these data suggest that having a child at or before age 25 increases risk of cervical cancer?
Cervical Cancer:
Case or Control
Age at First
Pregnancy
Age < 25
Age > 25
Case Control Column Totals
Row Totals n = a) Why can’t we meaningfully calculated P( cervical cancer | risk factor status )?
VERY IMPORTANT b) Find P( risk factor | disease status ) for each group of women.
50
STAT 305: Chapter 4 – Basic Probability Concepts
Spring 2014 c) What are the odds for the risk factor amongst the cases? Amongst the controls? d) What is odds ratio for having the risk factor associated with being a case? e) Even though it is not appropriate to do so, calculate the P( disease|risk factor status ) and the odds for disease for both risk factor groups. f) Finally calculate the odds ratio for having cervical cancer associated with having first pregnancy at or before age 25. What do we find? Why do you suppose the OR is much more commonly used than RR?
51
STAT 305: Chapter 4 – Basic Probability Concepts
Spring 2014
Properties of the OR:
1) OR = ________
Risk factor present
Risk factor absent
Disease
Present (case) a c
Disease
Absent (control) b d
Identification of the a cell is the key, once you have identified this cell, b is in the same row, c is in the same column, and d is the diagonally opposite cell.
2) When the disease is rare in the population being studied, e.g. P(disease) < .10 or less than 10% of the population have the disease, then there is little difference between the RR and OR, with the difference getting smaller the rarer the disease is. Thus for many diseases RR
OR , which makes it easier to discuss and interpret odds ratios because we can state the results in terms of how many times more likely something is rather than using a multiplicative statement in terms of the odds.
3) When disease is more common, say better than 10% have the disease or adverse outcome, then the OR tends to overstate the associated risk and we will find the OR >> RR.
Example 4.6 (cont’d):
Age at First Pregnancy and Cervical Cancer
Disease Status
Case
(cervical cancer)
Control
(no cervical cancer) Column Totals
Risk Factor Status Age < 25
(risk present) b
203 245
Age > 25
(risk absent)
Row Totals a
42 c
7
49 d
114
317
121 n = 366
52