Stat-285 – Assignment 9 – 2007 Fall Term 1. Women and children first. Have you ever watched a movie, or read a book, about a ship in trouble and when the words “women and children first!” are shouted out, you know that inevitably those words means that the ship is doomed to sink? You can find the source of gallant tradition at http://ne.essortment. com/shiptraditionw_rrqb.htm. This question deals with the sinking of the Titanic and an examination of the probability of survivorship as a function of age, sex, and class of passage of this tragedy. Visit http://www.statsci.org/data/general/titanic.html to get a list of the passengers aboard the Titanic. Download the datafile and import it into JMP. The file contains 5 variables: the passenger name; the class of passage; the age; the sex; and an indicator variable for survival status. (a) Several of the ages are missing. These could likely be reconstructed from the original sources. We will assume that the age values are MCAR. What does this mean, and what implications will this have for the analysis? Solution: MCAR = Missing Completely at Random implies that the missingness is unrelated to the response value, i.e. missingness is unrelated to survival status. The only effect that MCAR has on the analysis in that the se are larger than if the data were not missing. (b) Use Analyze->Fit Y-by-X platform to look at the breakdown of sex by class of passage. What does the mosaic plot show you? Confirm this by looking at a suitable contingency table with the appropriate percentages. Solution: The mosaic plot and contingency tables are: 1 The proportion of males seems to increase as the class of passage decreases increasing from 56% in first class to 70% in third class. [The chi-square test for equal proportions shows that there is strong c 2007 Carl James Schwarz 2 evidence that the sex ratio is not constant across class of passage.]. (c) Use the Analyze->Fit Y-by-X platform to investigate the survival rates of the two sexes for each separate class of passage. [Hint: Use the By button.]. Complete the following table – note that S is survival:1 Males Female Odds-ratio of S c.i. for Class P (S) ODDS(S) P (S) ODDS(S) F vs M odds-ratio 1st 2nd 3rd So what do you conclude about “women and children first”? Solution: The Analyze->Fit Y-by-X platform is completed as: The estimated proportion of survival can be read off the contingency tables, as can the odd ratio (but the odds ratio needs to be inverted as it is for males:females not females:males). The odds of survival for each sex are computed by hand. 1 If you use a By variable, you cannot save predictions directly to the data table as in previous assignments. However, saved columns are still accessible by using the Red-Triange →Script →Data Table Window. This will show a “hidden” data table that is created for each value of the By variable. You will have to do this for each value of the By variables. Here is the official FAQ from SAS: When by variables are used, JMP creates a new intermediate table for each level of the by variable. Statistics such as predicted values are saved to these intermediate tables rather than the original data table. To see the intermediate table you will need to click on the red triangle next to Generalized Linear Model Fit and choose Script->Data Table Window. You will have to do this for each level of the by variable. The new data table that appears will be for that specific level of the by variable and will contain the statistics such as predicted values that you have chosen. c 2007 Carl James Schwarz 3 c 2007 Carl James Schwarz 4 The completed table is: Males Female Odds-ratio c.i. for Class P (S) ODDS(S) P (S) ODDS(S) F vs M (S) odds-ratio 1st 33% 1:2 94% 16:1 30:1 (15 : 1 → 64 : 1) 2nd 15% 1:6 88% 7:1 43:1 (21 : 1 → 87 : 1) 3rd 12% 1:8 38% 1:2 5:1 ( 3 : 1 → 7 : 1) In all classes, females had a higher survival rate than males. The second class passengers appear to heed the call for “women and children first” as the odds of survival for females is the largest. If you look at the raw percentages, you see that the chances of survival for females among the first and second class passengers is roughly the same (around 90%), but the survival rate of males second classs is less than half of that in first class. (d) The above analysis ignored the age of the passengers. For each combination of sex and passenger class, fit a logistic regression to predict survival as a function of age. Complete the following table for predicting the SURVIVAL rates of passengers as a function of age [Hint: think carefully what JMP produces – is it predicting survival or death?]: Coefficient Class Sex of age SE p-value 1st Males 1st Females 2nd Males 2nd Females 3rd Males 3rd Females So what do you conclude about the adage of “women and children first”? c 2007 Carl James Schwarz 5 Solution: Use the Analyze->Fit Y-by-X platform as follows: This gives the following summary output: c 2007 Carl James Schwarz 6 Notice that each of the above outputs is for the log-odds of DEATH (survival=0) and so the coefficient for SURVIVAL is simply the negative of the reported coefficient This gives the table: c 2007 Carl James Schwarz 7 Coefficient Class Sex of age SE p-value 1st Males −.054 .015 .0003 1st Females .012 .031 .69 2nd Males −.143 .032 < .0001 2nd Females −.030 .028 .28 3rd Males −.051 .020 .012 3rd Females .0007 .016 .96 None of the female coefficients are statistically significant from zero. This implies that there is no evidence of a relationship between age and survival for females in all three classes. There is strong evidence of an effect of age for males in all three classes. The coefficients are negative which implies that as age increases, the log-odds of survival (and hence the probability of survival) decrease. The effect of age appears to be strongest for the second class males as their coefficient has the largest magnitude, while the effect of age in the first and third class male passengers is about equal. A plot of the survival curves on both the ordinary and logit scale appears below: c 2007 Carl James Schwarz 8 Notice that the lines for females are almost flat (on the logit scale) with little change in survival by age, while the lines for males are very steep. If you compute the Range Odds Ratio – the change in the oddsratio as you go from the smallest to the largest age for each sex-class combination, you find that the range of odds of survival is quite large for males but very small for females. So yes, it appears that the adage reads “women and young males first”. In more advanced classes (e.g. Stat-302 or Stat-402), you would have learned how to fit one model for the combined data over all sexes and classes of passage, and looked at the effect of age upon survival after adjusting for the sex and class of passage. 2. Never underestimate the p-o-w-e-r of the Orange side Many people find it annoying when a cell phone goes off at the exact climax of a film.2 When I was visiting England in September 2005, I happened to go to a movie and noticed a series of ads that played before the movie started asking patrons to turn off their cell phone. The premise of these advertisements are pitches by various celebrities to the Orange Film Funding Board, a fictitious agency, for films they would like to produce. The ads 2 See http://www.cnn.com/2005/TECH/10/17/wireless.manners/index.html or http: //www.boundless.org/2005/articles/a0001207.cfm or http://www.mobiledia.com/news/ 41645.html. c 2007 Carl James Schwarz 9 were sponsored by the Orange Cell Phone company, one of the largest mobile phone companies in the United Kingdom.3 You can view some of the advertisements at (don’t forget to press the Play button beneath each ad): (a) http://www.visit4info.com/details.cfm?adid=22035 - my favorite (b) http://www.visit4info.com/details.cfm?adid=20298 (c) http://www.visit4info.com/details.cfm?adid=24647 - my second favorite (d) http://www.visit4info.com/details.cfm?adid=24648 These advertisements have made it into Wikipedia at http://en.wikipedia. org/wiki/Orange_UK. But do these commercials actually work? (a) Describe how your would perform an experiment as a completely randomized design. The four ads are to be compared (with a control of no ads). There are 10 screens, five showings per day (morning, early afternoon, late afternoon, early evening, and late evening identified by the numbers 1 to 5), seven days per week (1=Sunday, 2=Monday, etc), and a 4 week test period. Solution: There are a total of 10 x 5 x 7 x 4 = 1400 possible showings. The five treatments (the 4 ads plus a control) should be randomly assigned to each of the showings and the number of cell phones that ring could be recorded. You can download some data from http://www.stat.sfu.ca/~cschwarz/ Stat-285/Assignments/cellphone.txt. The variables in the dataset are the week, day, showing, screen, ad used, number of tickets sold, and the number of cell phones that went off. Convert the number of cell phones that went off to a simple yes/no variable. (b) Test the hypothesis that the probability of a cell phone interruption is the same for all ads (including the control). Solution: This can be done with the Analyze->Fit Y-by-X platform or the Analyze->Fit Model platform or the Generalized Linear Model Platform: 3 More details at http://www.orange.com/ c 2007 Carl James Schwarz 10 c 2007 Carl James Schwarz 11 c 2007 Carl James Schwarz 12 In all cases, the p-value is < .0001 and so there is very strong evidence that the probability of being interrupted by a cell phone is not equal across all the treatment levels. Of course, at this stage, we don’t know which treatment is best or worst. (c) Estimate the probability of a cell phone interrupting the movie for each ad and complete the following table: Ad Estimate se 95% ci None dh dv jc ss Solution: These probabilities were estimated using the Analyze>Fit Model platform and the Generalized Linear Modeling option. Notice that the models estimate the probability of NO interruption, and must be subtracted from 1 to get the probability of an interruption. The se were estimated by taking the range of the 95% confidence interval and dividing by 4. Ad Estimate se 95% ci None .20 .025 (.16 → .25) dh .14 .025 (.10 → .20) dv .03 .01 (.01 → .05) jc .06 .02 (.03 → .10) ss .05 .01 (.03 → .08) c 2007 Carl James Schwarz 13 (d) Draw a suitable graph (possibly by hand) showing the results from the previous table. What does this graph show? Which ad seems to be the most effective? Solution: I used the graphing feature of JMP to create the following plot: The probability of interruption appears to be smallest for the dv and jc and ss commercials, followed by the dh ad, followed by the control screenings. The ads seem to work, but there appears to be some minor differences among the ads. (e) Estimate the difference in the log-odds between cases with no ads and the Darth Vader ad along with a se and and an approximate 95% confidence interval. Convert this to an odds ratio along with a 95% confidence interval. Interpret this odds-ratio. What do you conclude? Solution: Use the Contrast option of the Generalized Linear Model platform: c 2007 Carl James Schwarz 14 Again, be careful because JMP is measuring the log-odds of NO interruptions, and we want the log-odds of interruptions. The estimated difference in log-odds of interruptions between the control and dv ads is 2.17 (se .39). The approximate 95% confidence interval is (1.39 → 2.95). Because this difference in log-odds is positive, this implies that the odds of an interruption in the control setting is HIGHER than the log-odds in the dv ad. The estimated odds-ratio is found as e2.18 = 8.8 with an approximate 95% confidence interval from e1.39 = 4 → e2.95 = 19). This implies that the odds of an interruption are about 9 times higher in the control showings than when the dv ad shows. Truly the The Phone is Strong Here. In more advanced classes (e.g. Stat-302 and Stat-402) you will learn how to use the actual number of cell phone calls as the response variable and how to adjust it for the number of tickets sold for that showing. Common errors made on this assignment – check your work! • Many students just attached all output and did not provide the table and conclusions. c 2007 Carl James Schwarz 15 There are NO jobs for people who just bash numbers through a statistical package and provide "computer diarrhea" as a report! It is vitally important that you understand what output is produced and that you are able to write a coherent report. In many cases, output is badly labelled and the results are not obvious. • In the experimental design, some students did not consider the control group (no ad). • Some students just stated the null hypothesis. • Many students did not notice that the models estimate the probability of No interruption. c 2007 Carl James Schwarz 16