ObsDatAna2CatVars

advertisement
Exploratory data analysis with two
qualitative variables
Not in FPP
1
Exploratory data analysis with two
qualitative/categorical variables
 Main tools
 Contigency tables
 Conditional, marginal, and joint frequencies
2
Motivating example
 Surviving the Titanic
 Was there a class discrimination in survival of the wreck of the
Titanic?
 “It has been suggested before the Enquiry that the third-class
passengers had been unfairly treated, that their access to the
boat deck had been impeded; and that when they reached the
deck the first and second-class passengers were given
precedence in getting places in the boats.” Lord Mersey, 1912
3
Titanic: Class by survival
1st
Class
4
2nd
Class
3rd
Class
Crew
Dead
122
167
528
696
1513
Alive
203
118
178
212
711
325
285
706
908
2224
Titanic: Marginal frequencies
 % Dead = 1513/2224 = 0.68
 % Alive = 711/2224 = 0.32
 % in first class
= 325/2224 = 0.14
 % in second class = 285/2224 = 0.13
 % in third class
= 706/2224 = 0.32
 % crew
= 908/2224 = 0.41
5
Titanic: Conditional frequenceis
 % (Alive | 1st)
 % (Alive | 2nd)
 % (Alive | 3rd)
 % (Alive | Crew)
= 203/325 = 0.625
= 118/285 = 0.414
= 178/706 = 0.252
= 212/908 = 0.233
 Based on these frequencies does there appear to be class
discrimination?
6
Titanic: Class by person type
1st
Class
Child.
7
2nd
3rd
Crew
Class Class
6
24
79
0
109
Wom.
144
93
165
23
425
Men
175
168
462
885
1690
325
285
706
908
2224
Titanic: percentage of men in each class
 % (Man | 1st)
 % (Man | 2nd)
 % (Man | 3rd)
 % (Man | Crew)
= 175/325 = 0.54
= 168/285 = 0.59
= 462/706 = 0.65
= 885/908 = 0.97
 There are larger percentages of men in third class and crew
8
Surviving the Titanic
 A reason for class differences in survival:
 Larger percentages of men died
 3rd class consisted of mostly men.
 Hence, a larger percentage of 3rd class passengers died.
 Once again keep in mind possible lurking variables that could
be driving the relationship seen between two measured
variables
9
Relative risk and odds ratios
 Motivating example
 Physicians’ health study (1989): randomized experiment with
22071 male physicians at least 40 years old
 Half the subjects assigned to take aspirin every other day
 Other half assigned to take a placebo, a dummy pill that looked
and tasted like aspirin
10
Physicians’ health study
 Here are the number of people in each cell:
Heart
attack
11
No heart
attack
Aspirin
104
10933
11037
Placebo
189
10845
11034
293
21778
22071
Relative risk
y1
y2
x1
a
b
a+b
Risk of y1 for level x1=a/(a+b)
x2
c
d
c+d
Risk of y1 for level x2=c/(c+d)
a+c b+d
12
a/(a + b)
Relative risk =
c /(c  d)
Relative risk for physicians’ health
study
 Relative risk of a heart attack when taking aspirin versus
when taking a placebo equals
104 /(104  10933)
RR 
 0.55
189 /(189  10845)
 People that took aspirin are 0.55 times as likely to have a
heart attack than people that took the placebo
 Or people that took placebo are 1/0.55 = 1.82 times as
likely to have a heart attack than people that took aspirin

13
Odds ratios
y1
y2
x1
a
b
Odds of y1 for level x1=a/b
x2
c
d
Odds of y1 for level x2=c/d
a/b
Odds ratio=
c /d
14
Odds ratios for physicians’ health
study
 Relative risk of a heart attack when taking aspirin versus
taking a placebo is
104 /(104  10933)
RR 
 0.55
189 /(189  10845)
 Odds of having a heart attack when taking aspirin over odds

of a heart attack when taking a placebo (odds ratio)
104/10933
OR 
 0.546
189/10845
15
Interpreting odds ratios and
relative risks
 When the variables X and Y are independent
 odds ratio = 1
relative risk = 1
 When subjects with level x1 are more likely to have y1 than
subjects with level x2, the
 odds ratio > 1
relative risk > 1
 When subjects with level x1 are less likely to have y1 than
subjects with level x2, then
 odds ratio < 1
16
relative risk < 1
Which one should be used?
 If Relative Risk is available then it should be used
 In a cohort study, the relative risk can be calculated directly
 In a case-control study the relative risk cannot be calculated
directly, so an odds ratio is used instead
 Case-control studies is an example. They compare subjects who have a
“condition” to subjects that don’t but have similar controls
 In this type of study we know %(exposure|disease). But to compute the
RR we need %(disease|exposure).
 Recall that RR = %(disease|exposure)/%(disease|placebo)
 Not available in more complex modeling (logistic regression)
17
Odds ratio vs relative risk
 When is odds ratio a good approximation of relative risk
 When cases are representative of diseased population
 When controls are representative of population without disease
 When the disease being studied occurs at low frequency
 Of itself, an odds ratio is a useful measure of association
18
Relative risk vs absolute risk
 % smokers who get lung cancer: 8% (conservative guess
here)
 Relative risk of lung cancer for smokers: 800%
 Getting lung cancer is not commonplace, even for smokers.
But, smokers’ chances of getting lung cancer are much, much
higher than non-smokers’ chances.
19
Simpsons paradox
 When a third variable seemingly reverses the association
between two other variables
 Hot hand example
20
Download