Count data - a homework of titanic proportions (due Thurs. Jan. 28)

advertisement
Homework
We have seen that splits in decision trees are related to Chi-Square tests so naturally we want to be
comfortable with these. Under our demos link from our course home page, you will find a SAS program
Titanic.sas. It contains counts of alive and dead from the famous Titanic ship wreck along with the URL
from which the data came (there are several reports available on survival data, not all of which exactly
match in their counts).
For your writeup, suppose you were working with the White Star Line shortly after the Titanic sunk.
Perhaps some initial analyses were run by a consultant and your boss, an intelligent person with some
statistical knowledge, wanted you to check out the results and explain a few concerns he had (the
questions below). He asked you to write up your thoughts. He is particularly concerned with the
potential for being sued for class prejudice in terms of who got saved and who didn’t. My main point on
this and other homeworks is to work on your exposition as well as your analysis. It is fine to organize
your report by question, but please use complete sentences and make it interesting to read.
(1) Look over the program and describe the two forms of data sets that it creates. Also describe what
the variable X represents in that second data set and what that first.status notation does.
(2) Record the correlation r between X and the 0-1 survival variable and give its p-value. Square the
correlation and multiply by the total number of passengers. Record your result (you will see why later).
(3) Compute the usual (Pearson) Chi-square test, using PROC FREQ on either TITANIC data set, for
testing the null hypothesis of independence between class status (crew, first class etc.) and fate. Remind
your readers what a p-value is and what it is used for. What is the p-value for the usual (Pearson) Chisquare test? What is the logworth of the 4 way split implied by this table? For a table like this, the
degrees of freedom is (#rows-1)(#cols-1) where #rows is the number of rows etc.
(4) We see one p-value that is way out of line with the others and it is based on different degrees of
freedom. Why? (Hint: compare the test statistic to your correlation in (1)). Does that unusual p-value
indicate that there was class bias in getting people into the lifeboats or does it not? Perhaps the
company can use this to defend against accusations of prejudicial saving of lives based on status (??)
(5) Does the order listed for the status categories in the Chi-Square table affect any of the p-values? If
so, is the order listed the most sensible one? If not, take the appropriate action (I should not have to tell
you this!). Which, if any, of the Chi-squares and p-values changed? Why? In describing what you’ve
learned here in the report, you might distinguish linear association from general association.
(6) Comparing parts 2 and 4, why was it important to run that PROC SORT before creating the second
data set and its X variable?


Note: Within SAS you can find help on any procedure (like PROC FREQ) by clicking on the
purple book with a yellow “?” on it (menu bar at top, rightmost icon). You can click on the
contents tab at the top of the help window, then SAS PRODUCTS, then SAS/STAT, then PROC
FREQ, then SYNTAX, for example.
If you are unfamiliar with SAS or chi-square calculations, you can fill in these blanks to do
question 3:
proc freq data=titanic;
tables _________/chisq;
weight ________;
run;
Download