Solutions - people.stat.sfu.ca

advertisement
Stat-285 – Assignment 9 – 2007 Fall Term
1. Women and children first.
Have you ever watched a movie, or read a book, about a ship in trouble
and when the words “women and children first!” are shouted out, you
know that inevitably those words means that the ship is doomed to sink?
You can find the source of gallant tradition at http://ne.essortment.
com/shiptraditionw_rrqb.htm.
This question deals with the sinking of the Titanic and an examination
of the probability of survivorship as a function of age, sex, and class of
passage of this tragedy.
Visit http://www.statsci.org/data/general/titanic.html to get a
list of the passengers aboard the Titanic. Download the datafile and
import it into JMP. The file contains 5 variables: the passenger name; the
class of passage; the age; the sex; and an indicator variable for survival
status.
(a) Several of the ages are missing. These could likely be reconstructed
from the original sources. We will assume that the age values are
MCAR. What does this mean, and what implications will this have
for the analysis?
Solution: MCAR = Missing Completely at Random implies that
the missingness is unrelated to the response value, i.e. missingness is
unrelated to survival status. The only effect that MCAR has on the
analysis in that the se are larger than if the data were not missing.
(b) Use Analyze->Fit Y-by-X platform to look at the breakdown of sex
by class of passage. What does the mosaic plot show you? Confirm
this by looking at a suitable contingency table with the appropriate
percentages.
Solution: The mosaic plot and contingency tables are:
1
The proportion of males seems to increase as the class of passage
decreases increasing from 56% in first class to 70% in third class.
[The chi-square test for equal proportions shows that there is strong
c
2007
Carl James Schwarz
2
evidence that the sex ratio is not constant across class of passage.].
(c) Use the Analyze->Fit Y-by-X platform to investigate the survival
rates of the two sexes for each separate class of passage. [Hint: Use
the By button.]. Complete the following table – note that S is survival:1
Males
Female
Odds-ratio of S
c.i. for
Class P (S) ODDS(S) P (S) ODDS(S)
F vs M odds-ratio
1st
2nd
3rd
So what do you conclude about “women and children first”?
Solution: The Analyze->Fit Y-by-X platform is completed as:
The estimated proportion of survival can be read off the contingency
tables, as can the odd ratio (but the odds ratio needs to be inverted
as it is for males:females not females:males). The odds of survival for
each sex are computed by hand.
1 If you use a By variable, you cannot save predictions directly to the data table as in
previous assignments. However, saved columns are still accessible by using the Red-Triange
→Script →Data Table Window. This will show a “hidden” data table that is created for each
value of the By variable. You will have to do this for each value of the By variables.
Here is the official FAQ from SAS:
When by variables are used, JMP creates a new intermediate table for each level of
the by variable. Statistics such as predicted values are saved to these intermediate
tables rather than the original data table. To see the intermediate table you will
need to click on the red triangle next to Generalized Linear Model Fit and choose
Script->Data Table Window. You will have to do this for each level of the by
variable. The new data table that appears will be for that specific level of the
by variable and will contain the statistics such as predicted values that you have
chosen.
c
2007
Carl James Schwarz
3
c
2007
Carl James Schwarz
4
The completed table is:
Males
Female
Odds-ratio
c.i. for
Class P (S) ODDS(S) P (S) ODDS(S) F vs M (S)
odds-ratio
1st
33%
1:2
94%
16:1
30:1 (15 : 1 → 64 : 1)
2nd
15%
1:6
88%
7:1
43:1 (21 : 1 → 87 : 1)
3rd
12%
1:8
38%
1:2
5:1
( 3 : 1 → 7 : 1)
In all classes, females had a higher survival rate than males. The
second class passengers appear to heed the call for “women and children first” as the odds of survival for females is the largest. If you
look at the raw percentages, you see that the chances of survival for
females among the first and second class passengers is roughly the
same (around 90%), but the survival rate of males second classs is
less than half of that in first class.
(d) The above analysis ignored the age of the passengers. For each combination of sex and passenger class, fit a logistic regression to predict
survival as a function of age. Complete the following table for predicting the SURVIVAL rates of passengers as a function of age
[Hint: think carefully what JMP produces – is it predicting survival
or death?]:
Coefficient
Class Sex
of age SE p-value
1st
Males
1st
Females
2nd
Males
2nd
Females
3rd
Males
3rd
Females
So what do you conclude about the adage of “women and children
first”?
c
2007
Carl James Schwarz
5
Solution: Use the Analyze->Fit Y-by-X platform as follows:
This gives the following summary output:
c
2007
Carl James Schwarz
6
Notice that each of the above outputs is for the log-odds of DEATH
(survival=0) and so the coefficient for SURVIVAL is simply the negative of the reported coefficient This gives the table:
c
2007
Carl James Schwarz
7
Coefficient
Class Sex
of age
SE p-value
1st
Males
−.054 .015
.0003
1st
Females
.012 .031
.69
2nd
Males
−.143 .032 < .0001
2nd
Females
−.030 .028
.28
3rd
Males
−.051 .020
.012
3rd
Females
.0007 .016
.96
None of the female coefficients are statistically significant from zero.
This implies that there is no evidence of a relationship between age
and survival for females in all three classes. There is strong evidence
of an effect of age for males in all three classes. The coefficients are
negative which implies that as age increases, the log-odds of survival
(and hence the probability of survival) decrease. The effect of age
appears to be strongest for the second class males as their coefficient
has the largest magnitude, while the effect of age in the first and
third class male passengers is about equal.
A plot of the survival curves on both the ordinary and logit scale
appears below:
c
2007
Carl James Schwarz
8
Notice that the lines for females are almost flat (on the logit scale)
with little change in survival by age, while the lines for males are
very steep.
If you compute the Range Odds Ratio – the change in the oddsratio as you go from the smallest to the largest age for each sex-class
combination, you find that the range of odds of survival is quite large
for males but very small for females.
So yes, it appears that the adage reads “women and young males
first”.
In more advanced classes (e.g. Stat-302 or Stat-402), you would have
learned how to fit one model for the combined data over all sexes and
classes of passage, and looked at the effect of age upon survival after
adjusting for the sex and class of passage.
2. Never underestimate the p-o-w-e-r of the Orange side
Many people find it annoying when a cell phone goes off at the exact
climax of a film.2
When I was visiting England in September 2005, I happened to go to a
movie and noticed a series of ads that played before the movie started
asking patrons to turn off their cell phone. The premise of these advertisements are pitches by various celebrities to the Orange Film Funding
Board, a fictitious agency, for films they would like to produce. The ads
2 See http://www.cnn.com/2005/TECH/10/17/wireless.manners/index.html or http:
//www.boundless.org/2005/articles/a0001207.cfm or http://www.mobiledia.com/news/
41645.html.
c
2007
Carl James Schwarz
9
were sponsored by the Orange Cell Phone company, one of the largest
mobile phone companies in the United Kingdom.3
You can view some of the advertisements at (don’t forget to press the Play
button beneath each ad):
(a) http://www.visit4info.com/details.cfm?adid=22035 - my favorite
(b) http://www.visit4info.com/details.cfm?adid=20298
(c) http://www.visit4info.com/details.cfm?adid=24647 - my second favorite
(d) http://www.visit4info.com/details.cfm?adid=24648
These advertisements have made it into Wikipedia at http://en.wikipedia.
org/wiki/Orange_UK.
But do these commercials actually work?
(a) Describe how your would perform an experiment as a completely randomized design. The four ads are to be compared (with a control of
no ads). There are 10 screens, five showings per day (morning, early
afternoon, late afternoon, early evening, and late evening identified
by the numbers 1 to 5), seven days per week (1=Sunday, 2=Monday,
etc), and a 4 week test period.
Solution: There are a total of 10 x 5 x 7 x 4 = 1400 possible
showings. The five treatments (the 4 ads plus a control) should be
randomly assigned to each of the showings and the number of cell
phones that ring could be recorded.
You can download some data from http://www.stat.sfu.ca/~cschwarz/
Stat-285/Assignments/cellphone.txt. The variables in the dataset are
the week, day, showing, screen, ad used, number of tickets sold, and the
number of cell phones that went off.
Convert the number of cell phones that went off to a simple yes/no variable.
(b) Test the hypothesis that the probability of a cell phone interruption
is the same for all ads (including the control).
Solution: This can be done with the Analyze->Fit Y-by-X platform or the Analyze->Fit Model platform or the Generalized Linear
Model Platform:
3 More
details at http://www.orange.com/
c
2007
Carl James Schwarz
10
c
2007
Carl James Schwarz
11
c
2007
Carl James Schwarz
12
In all cases, the p-value is < .0001 and so there is very strong evidence
that the probability of being interrupted by a cell phone is not equal
across all the treatment levels. Of course, at this stage, we don’t
know which treatment is best or worst.
(c) Estimate the probability of a cell phone interrupting the movie for
each ad and complete the following table:
Ad
Estimate se 95% ci
None
dh
dv
jc
ss
Solution: These probabilities were estimated using the Analyze>Fit Model platform and the Generalized Linear Modeling option.
Notice that the models estimate the probability of NO interruption,
and must be subtracted from 1 to get the probability of an interruption. The se were estimated by taking the range of the 95% confidence
interval and dividing by 4.
Ad
Estimate
se
95% ci
None
.20 .025 (.16 → .25)
dh
.14 .025 (.10 → .20)
dv
.03
.01 (.01 → .05)
jc
.06
.02 (.03 → .10)
ss
.05
.01 (.03 → .08)
c
2007
Carl James Schwarz
13
(d) Draw a suitable graph (possibly by hand) showing the results from
the previous table. What does this graph show? Which ad seems to
be the most effective?
Solution: I used the graphing feature of JMP to create the following
plot:
The probability of interruption appears to be smallest for the dv and
jc and ss commercials, followed by the dh ad, followed by the control
screenings. The ads seem to work, but there appears to be some
minor differences among the ads.
(e) Estimate the difference in the log-odds between cases with no ads
and the Darth Vader ad along with a se and and an approximate
95% confidence interval. Convert this to an odds ratio along with
a 95% confidence interval. Interpret this odds-ratio. What do you
conclude?
Solution: Use the Contrast option of the Generalized Linear Model
platform:
c
2007
Carl James Schwarz
14
Again, be careful because JMP is measuring the log-odds of NO
interruptions, and we want the log-odds of interruptions.
The estimated difference in log-odds of interruptions between the
control and dv ads is 2.17 (se .39). The approximate 95% confidence interval is (1.39 → 2.95). Because this difference in log-odds is
positive, this implies that the odds of an interruption in the control
setting is HIGHER than the log-odds in the dv ad.
The estimated odds-ratio is found as e2.18 = 8.8 with an approximate
95% confidence interval from e1.39 = 4 → e2.95 = 19). This implies
that the odds of an interruption are about 9 times higher in the
control showings than when the dv ad shows.
Truly the The Phone is Strong Here.
In more advanced classes (e.g. Stat-302 and Stat-402) you will learn how
to use the actual number of cell phone calls as the response variable and
how to adjust it for the number of tickets sold for that showing.
Common errors made on this assignment – check your work!
• Many students just attached all output and did not provide the table and
conclusions.
c
2007
Carl James Schwarz
15
There are NO jobs for people who just bash numbers through a statistical
package and provide "computer diarrhea" as a report! It is vitally important that you understand what output is produced and that you are able
to write a coherent report. In many cases, output is badly labelled and
the results are not obvious.
• In the experimental design, some students did not consider the control
group (no ad).
• Some students just stated the null hypothesis.
• Many students did not notice that the models estimate the probability of
No interruption.
c
2007
Carl James Schwarz
16
Download