ObsDatAna2CatVars

Exploratory data analysis with two qualitative variables Not in FPP 1 Exploratory data analysis with two qualitative/categorical variables  Main tools  Contigency tables  Conditional, marginal, and joint frequencies 2 Motivating example  Surviving the Titanic  Was there a class discrimination in survival of the wreck of the Titanic?  “It has been suggested before the Enquiry that the third-class passengers had been unfairly treated, that their access to the boat deck had been impeded; and that when they reached the deck the first and second-class passengers were given precedence in getting places in the boats.” Lord Mersey, 1912 3 Titanic: Class by survival 1st Class 4 2nd Class 3rd Class Crew Dead 122 167 528 696 1513 Alive 203 118 178 212 711 325 285 706 908 2224 Titanic: Marginal frequencies  % Dead = 1513/2224 = 0.68  % Alive = 711/2224 = 0.32  % in first class = 325/2224 = 0.14  % in second class = 285/2224 = 0.13  % in third class = 706/2224 = 0.32  % crew = 908/2224 = 0.41 5 Titanic: Conditional frequenceis  % (Alive | 1st)  % (Alive | 2nd)  % (Alive | 3rd)  % (Alive | Crew) = 203/325 = 0.625 = 118/285 = 0.414 = 178/706 = 0.252 = 212/908 = 0.233  Based on these frequencies does there appear to be class discrimination? 6 Titanic: Class by person type 1st Class Child. 7 2nd 3rd Crew Class Class 6 24 79 0 109 Wom. 144 93 165 23 425 Men 175 168 462 885 1690 325 285 706 908 2224 Titanic: percentage of men in each class  % (Man | 1st)  % (Man | 2nd)  % (Man | 3rd)  % (Man | Crew) = 175/325 = 0.54 = 168/285 = 0.59 = 462/706 = 0.65 = 885/908 = 0.97  There are larger percentages of men in third class and crew 8 Surviving the Titanic  A reason for class differences in survival:  Larger percentages of men died  3rd class consisted of mostly men.  Hence, a larger percentage of 3rd class passengers died.  Once again keep in mind possible lurking variables that could be driving the relationship seen between two measured variables 9 Relative risk and odds ratios  Motivating example  Physicians’ health study (1989): randomized experiment with 22071 male physicians at least 40 years old  Half the subjects assigned to take aspirin every other day  Other half assigned to take a placebo, a dummy pill that looked and tasted like aspirin 10 Physicians’ health study  Here are the number of people in each cell: Heart attack 11 No heart attack Aspirin 104 10933 11037 Placebo 189 10845 11034 293 21778 22071 Relative risk y1 y2 x1 a b a+b Risk of y1 for level x1=a/(a+b) x2 c d c+d Risk of y1 for level x2=c/(c+d) a+c b+d 12 a/(a + b) Relative risk = c /(c  d) Relative risk for physicians’ health study  Relative risk of a heart attack when taking aspirin versus when taking a placebo equals 104 /(104  10933) RR   0.55 189 /(189  10845)  People that took aspirin are 0.55 times as likely to have a heart attack than people that took the placebo  Or people that took placebo are 1/0.55 = 1.82 times as likely to have a heart attack than people that took aspirin  13 Odds ratios y1 y2 x1 a b Odds of y1 for level x1=a/b x2 c d Odds of y1 for level x2=c/d a/b Odds ratio= c /d 14 Odds ratios for physicians’ health study  Relative risk of a heart attack when taking aspirin versus taking a placebo is 104 /(104  10933) RR   0.55 189 /(189  10845)  Odds of having a heart attack when taking aspirin over odds  of a heart attack when taking a placebo (odds ratio) 104/10933 OR   0.546 189/10845 15 Interpreting odds ratios and relative risks  When the variables X and Y are independent  odds ratio = 1 relative risk = 1  When subjects with level x1 are more likely to have y1 than subjects with level x2, the  odds ratio > 1 relative risk > 1  When subjects with level x1 are less likely to have y1 than subjects with level x2, then  odds ratio < 1 16 relative risk < 1 Which one should be used?  If Relative Risk is available then it should be used  In a cohort study, the relative risk can be calculated directly  In a case-control study the relative risk cannot be calculated directly, so an odds ratio is used instead  Case-control studies is an example. They compare subjects who have a “condition” to subjects that don’t but have similar controls  In this type of study we know %(exposure|disease). But to compute the RR we need %(disease|exposure).  Recall that RR = %(disease|exposure)/%(disease|placebo)  Not available in more complex modeling (logistic regression) 17 Odds ratio vs relative risk  When is odds ratio a good approximation of relative risk  When cases are representative of diseased population  When controls are representative of population without disease  When the disease being studied occurs at low frequency  Of itself, an odds ratio is a useful measure of association 18 Relative risk vs absolute risk  % smokers who get lung cancer: 8% (conservative guess here)  Relative risk of lung cancer for smokers: 800%  Getting lung cancer is not commonplace, even for smokers. But, smokers’ chances of getting lung cancer are much, much higher than non-smokers’ chances. 19 Simpsons paradox  When a third variable seemingly reverses the association between two other variables  Hot hand example 20

ObsDatAna2CatVars

Related documents

Products

Support

ObsDatAna2CatVars

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib