Probability

STAT 305: Chapter 4 – Basic Probability Concepts

Spring 2014

Probability

In this section we go through the basics of probability. Probability is fundamental to the study of statistics and is used extensively in the inferential process.

Specifically in this section we will learn some basic ideas about probabilities:



What they are where they come from



Simple probability models



Properties of probabilities



Conditional probabilities and the concept of independence



Baye’s Rule



How to calculate probabilities:

- Using table of counts obtained through sampling (empirical or sample-based probabilities)

- Using properties of probabilities, such as independence.

Example 4.1: I toss a fair coin (where ‘fair’ means ‘equally likely outcomes’)





What are the possible outcomes?

What is the probability it will turn up heads?

Example 4.2: I choose a duck nest full of freshly laid eggs at random and look at predation status

 What are the possible outcomes?

 What is the probability the duck nest is predated?

Definition: A probability is a number…..

44


Spring 2014

WHERE DO PROBABILITIES COME FROM?

Probabilities from models (games, genetics, certain types of random experiments)

The probability of getting a four when a fair dice is rolled is ________.

Probabilities from data (or _________________ probabilities)

What is the probability that a randomly selected starling is female?

– In a random sample of n = 67 starlings 40 are found to female.

–

The estimated probability that a randomly chosen starling is female is

Subjective probabilities:

The probability that there will be another outbreak of ebola in Africa within the next year is 0.1.

The probability of rain in the next 24 hours is very high or 80%.

A doctor may state that a patient’s chances of a full recovery are 70%.

PROBABILITIES FROM DATA - SOME BASIC IDEAS

Example 4.3: Type of Hodgkin’s Disease and Response to Treatment

Below is a table containing the results of treatment for patients with different types of Hodgkin’s disease. The data was collected by taking a random sample of 538 patients diagnosed with some form of Hodgkin’s disease, thus both type of Hodgkin’s and response to treatment are random .

Type of

Hodgkin’s

LD

LP

Disease MC

Response to

Treatment

None Partial Positive

ROW

44

12

10

18

18

74

TOTALS

72

104

58 54 154

266

NS 12 16 68

96

COLUMN 126 98 314 n = 538

TOTALS

Histological Types of Hodgkin’s Disease

LD = lymphocyte depletion

LP = lymphocyte predominant

MC = mixed cellularity

NS = nodular schlerosis

For a patient selected at random from these 538 Hodgkin’s patients, find the probability that the patient:

(a) had a positive response

(b) had at least some response to treatment.

(c) had LP and had a positive response to treatment.

(d) had LP or NS for their histological type.

45


Spring 2014

CONDITIONAL PROBABILITY and INDEPENDENCE

• We are interested in the probability of something happening given information about the

occurrence of another event.

•

Key words that indicate conditional probability are: given, amongst, for those with, …

Conditional Probability

“The probability of event A occurring given that event B has already occurred”

is written in shorthand as

Formal Definition

P(A | B) =

Independence

Events A and B are said to be independent if

Example 4.4: Simple example when rolling a single fair die

We define the following to events based on the outcome of rolling the die

Conditional probability

A =

B =

Independence

C = we obtain two sixes in a row, D = we obtain three sixes in a row, etc.

46


Spring 2014

Example 4.3 (cont’d) : Conditional Probabilities from Hodgkin’s Example

Response to

Treatment

None Partial Positive

ROW

TOTALS

Type of

Hodgkin’s

Disease

LD

LP

MC

NS

44 10 18

12 18 74

72

104

58 54 154

266

12 16 68

96

COLUMN 126 98 314 n = 538

TOTALS

Let’s consider some potential conditional probabilities of interest in this study.

A 2-D mosaic plot is a graphical display of the conditional probabilities of the form P(Y|X) where Y is this example is the response to treatment (Y) and X is the histological type of Hodgkin’s disease (X).

47


Spring 2014

Example 4.5: Motorcycle Helmet Use and Brain Injury in Wisconsin Motorcyclists

A study was conducted in 1991 by the University of Wisconsin and the Wisconsin Department of

Transportation in which linked police reports and hospital discharge records were used to assess, among other things, the risk for head injury for motorcyclists in motor-vehicle crashes. The data shown below can be used to examine the relationship between helmet use and whether brain injury was sustained in the accident.

Helmet Worn

No Helmet

Brain Injury No Brain Injury Row

Totals

17

97

977

1918

994

2015

Column

Totals

114 2895 3009

For information on how to do this type of analysis in JMP see the tutorial Bivariate

Displays for Categorical Data on the course website.

a) What is the probability that a motorcycle accident victim in Wisconsin suffered brain injury? b) What is the probability that a motorcyclist involved in an accident was wearing a helmet? Can this be used to estimate the probability that a randomly sampled motorcyclist in WI wears a helmet? c) What is the probability that a motorcyclist suffered brain injury given that they were wearing a helmet? d) What is the probability that a motorcyclist not wearing a helmet suffered brain injury?

48


Spring 2014 e) How many times more likely is a motorcyclist not wearing a helmet to sustain a brain injury?

This ratio is called the ________________ or __________________ .

RELATIVE RISK (RR)

The relative risk or risk ratio is defined generically as in case of a study where examination of risk is appropriate, e.g. cancer and smoking

RR

=

𝑃("𝑏𝑎𝑑 𝑡ℎ𝑖𝑛𝑔"|𝑟𝑖𝑠𝑘 𝑓𝑎𝑐𝑡𝑜𝑟 𝑝𝑟𝑒𝑠𝑒𝑛𝑡)

𝑃("𝑏𝑎𝑑 𝑡ℎ𝑖𝑛𝑔"|𝑟𝑖𝑠𝑘 𝑓𝑎𝑐𝑡𝑜𝑟 𝑎𝑏𝑠𝑒𝑛𝑡)

In the case of study where we looking at a benefit, e.g. a drug that reduces the risk of an adverse outcome

RR

=

𝑃("𝑔𝑜𝑜𝑑 𝑡ℎ𝑖𝑛𝑔"|𝑏𝑒𝑛𝑒𝑓𝑖𝑐𝑖𝑎𝑙 𝑓𝑎𝑐𝑡𝑜𝑟 𝑝𝑟𝑒𝑠𝑒𝑛𝑡)

𝑃("𝑔𝑜𝑜𝑑 𝑡ℎ𝑖𝑛𝑔"|𝑏𝑒𝑛𝑒𝑓𝑖𝑐𝑖𝑎𝑙 𝑓𝑎𝑐𝑡𝑜𝑟 𝑎𝑏𝑠𝑒𝑛𝑡)

ODDS RATIO (OR)

Most studies that are conducted that seek to examine the relationship between an adverse outcome, e.g. death or cancer, and set of potential risk factors are case-control observational studies. In a case-controls study remember we sample individuals with the adverse event (cases) and some similar individuals without the adverse event (controls) and compare the two groups in terms of risk factors we are interested in. The odds ratio (OR) is the main tool used to quantify risk in case-control studies. The example below will demonstrate why.

First, we need to define what odds for an event are.

The Odds for an event A are defined as

49


Spring 2014

The Odds Ratio for an event A associated with a “risk factor” are defined as

𝑶𝑹 =

Odds for A for those with risk factor

Odds for A for those without risk factor

𝑶𝑹 =

𝑃(𝐴|𝑟𝑖𝑠𝑘)

1−𝑃(𝐴|𝑟𝑖𝑠𝑘)

𝑃(𝐴|𝑛𝑜 𝑟𝑖𝑠𝑘)

1−𝑃(𝐴|𝑛𝑜 𝑟𝑖𝑠𝑘)

The event A here is an adverse outcome like death or cancer, etc. If we are doing a study looking at benefit instead risk, then A could be a good outcome such as survival or remission, etc.

Example 4.6: Age at First Pregnancy and Cervical Cancer

A case-control study was conducted to determine whether there was increased risk of cervical cancer amongst women who had their first child before age 25. A sample of 49 women with cervical cancer was taken of which 42 had their first child before the age of 25. From a sample of 317 “similar” women without cervical cancer it was found that 203 of them had their first child before age 25. Do these data suggest that having a child at or before age 25 increases risk of cervical cancer?

Cervical Cancer:

Case or Control

Age at First

Pregnancy

Age < 25

Age > 25

Case Control Column Totals

Row Totals n = a) Why can’t we meaningfully calculated P( cervical cancer | risk factor status )?



VERY IMPORTANT b) Find P( risk factor | disease status ) for each group of women.

50


Spring 2014 c) What are the odds for the risk factor amongst the cases? Amongst the controls? d) What is odds ratio for having the risk factor associated with being a case? e) Even though it is not appropriate to do so, calculate the P( disease|risk factor status ) and the odds for disease for both risk factor groups. f) Finally calculate the odds ratio for having cervical cancer associated with having first pregnancy at or before age 25. What do we find? Why do you suppose the OR is much more commonly used than RR?

51


Spring 2014

Properties of the OR:

1) OR = ________

Risk factor present

Risk factor absent

Disease

Present (case) a c

Disease

Absent (control) b d

Identification of the a cell is the key, once you have identified this cell, b is in the same row, c is in the same column, and d is the diagonally opposite cell.

2) When the disease is rare in the population being studied, e.g. P(disease) < .10 or less than 10% of the population have the disease, then there is little difference between the RR and OR, with the difference getting smaller the rarer the disease is. Thus for many diseases RR



OR , which makes it easier to discuss and interpret odds ratios because we can state the results in terms of how many times more likely something is rather than using a multiplicative statement in terms of the odds.

3) When disease is more common, say better than 10% have the disease or adverse outcome, then the OR tends to overstate the associated risk and we will find the OR >> RR.

Example 4.6 (cont’d):

Age at First Pregnancy and Cervical Cancer

Disease Status

Case

(cervical cancer)

Control

(no cervical cancer) Column Totals

Risk Factor Status Age < 25

(risk present) b

203 245

Age > 25

(risk absent)

Row Totals a

42 c

7

49 d

114

317

121 n = 366

52

Probability - Winona State University

Type of

Hodgkin’s

LD

LP

Disease MC

Response to

Treatment

None Partial Positive

44

12

10

18

18

74

58 54 154

NS 12 16 68

Response to

Treatment

None Partial Positive

Type of

Hodgkin’s

Disease

LD

LP

MC

NS

44 10 18

12 18 74

58 54 154

12 16 68

For information on how to do this type of analysis in JMP see the tutorial Bivariate

Displays for Categorical Data on the course website.

RR

𝑃("𝑏𝑎𝑑 𝑡ℎ𝑖𝑛𝑔"|𝑟𝑖𝑠𝑘 𝑓𝑎𝑐𝑡𝑜𝑟 𝑝𝑟𝑒𝑠𝑒𝑛𝑡)

𝑃("𝑏𝑎𝑑 𝑡ℎ𝑖𝑛𝑔"|𝑟𝑖𝑠𝑘 𝑓𝑎𝑐𝑡𝑜𝑟 𝑎𝑏𝑠𝑒𝑛𝑡)

RR

𝑃("𝑔𝑜𝑜𝑑 𝑡ℎ𝑖𝑛𝑔"|𝑏𝑒𝑛𝑒𝑓𝑖𝑐𝑖𝑎𝑙 𝑓𝑎𝑐𝑡𝑜𝑟 𝑝𝑟𝑒𝑠𝑒𝑛𝑡)

𝑃("𝑔𝑜𝑜𝑑 𝑡ℎ𝑖𝑛𝑔"|𝑏𝑒𝑛𝑒𝑓𝑖𝑐𝑖𝑎𝑙 𝑓𝑎𝑐𝑡𝑜𝑟 𝑎𝑏𝑠𝑒𝑛𝑡)

𝑶𝑹 =

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib