11/12/09 More types of inference for nominal variables Nominal data is categorical with more than two categories Chi-square test Compare observed frequencies of nominal variable to hypothesized probabilities FPP 28 Chi-squared goodness of fit test Test if two nominal variables are independent Chi-squared test of independence Goodness of fit test Goodness of fit test Do people admit themselves to Days from hospitals more frequently close birthday to their birthday? within 7 Data from a random sample of 200 people admitted to hospitals Number of admissions 11 8-30 24 31-90 69 91+ 96 Assume there is no birthday effect, that is, people admit randomly. Then, Pr (within 7) = Pr (8 - 30) = Pr (31-90) = Pr (91+) = = = = = .0411 .1260 .3288 .5041 So, in a sample of 200 people, we’d expect to be in “within 7” to be in “8 - 30” to be in “31 - 90” to be in “91+” 1 11/12/09 Goodness of fit test Goodness of fit test If admissions are random, we expect the sample frequencies Hypothesis and hypothesized probabilities to be similar Claim (alternative hyp.) is admission probabilities depend on the days since birthday But, as always, the sample frequencies are affected by chance Opposite of claim (null hyp.) is probabilities in accordance with So, we need to see whether the sample frequencies could H0 : Pr (within 7) = .0411 random admissions. error Pr (8 - 30) = .1260 Pr (31-90) = .3288 Pr (91+) = .5041 HA : probabilities different than those in H0 . have been a plausible result from a chance error if the hypothesized probabilities are true. Let’s build a hypothesis test Goodness of fit test: Test statistic Goodness of fit test: Test statistic Cell Chi-squared test statistic Obs Exp Dif Dif2 Dif2/Exp In 7 (observed - expected)2 X = sum expected 2 8-30 31-90 91+ (observed - expected)2 X 2 = sum expected = .94 + .057 + .16 + .23 = 1.397 € € 2 11/12/09 Goodness of fit test: Calculate pvalue Goodness of fit test: Calculate pvalue X2 has a chi-squared distribution with degrees of freedom To get a p-value, calculate the area under the chi-squared equal to number of categories minus 1. In this case, df = 4 – 1 = 3. curve to the right of 1.397 Using JMP, this area is 0.703. If the null hypothesis is true, there is a 70% chance of observing a value of X2 as or more extreme than 1.397 Using the table the p-value is between 0.9 and 0.70 Chi-squared table JMP output admissions 3 11/12/09 Goodness of fit test: Judging pvalue Independence test The .70 is a large p-value, indicating the data could well Is birth order related to occur by random chance when the null hypothesis is true. Therefore, we cannot reject the null hypothesis. There is not enough evidence to conclude that admissions rates are independent of time from birthday. delinquency? Nye (1958) randomly sampled 1154 high school girls and asked if they had been “delinquent”. Eldest 24 450 In Between 29 312 Youngest 35 211 Only 23 70 Sample of conditional frequencies Test of independence % Delinquent for each birth Hypotheses order status Oldest .05 Middle .085 Opposite is that there is no relationship. Youngest .14 H0 : birth order and delinquency are Based on conditional frequencies, it appears that youngest are more delinquent Could these sample frequencies have plausibly occurred by chance if there is no relationship between birth order and delinqeuncy Claim is that there is some relationship between birth order and delinquency. Only .25 independent. HA : birth order and delinquency are dependent. 4 11/12/09 Implications of independence Test of independence Expected counts Expected counts Under independence, Pr(oldest and delinquent) = Pr(oldest)*Pr(delinquent) Estimate Pr(oldest) as marginal frequency of oldest Oldest 45.59 428.41 In Between 32.80 308.2 Hence, estimate Pr(oldest and delinquent) as Youngest 23.66 222.34 The expected number of oldest and delinquent, under independence, Only 8.95 84.05 Estimate Pr(delinquent) as marginal frequency of delinquent equals This is repeated for all the other cells in table Next we compare the observed counts with the expected to get a test statistic Test of independence: Use the X2 statistic as the test statistic: Calculate the p-value X 2 has a chi-squared distribution with degrees of freedom: df = (number rows – 1) * (number columns – 1) In delinquency problem, df = (4 - 1) * (2 - 1) = 3. The area under the chi-squared curve to the right of 42.245 is less than .0001. There is only a very small chance of getting an X2 as or more extreme than 42.245. 5 11/12/09 JMP output for chi-squared test This is a small p-value. It is unlikely we’d observe data like this if the null hypothesis is true. There does appear to be an association between delinquency and birth order. Chi-squared test details Chi-squared test items Requires simple random samples. What do I do when expected counts are less than 5? Works best when expected frequencies in each cell are at least 5. Should not have zero counts How one specifies categories can affect results. Try to get more data. Barring that, you can collapse categories. Example: Is baldness related to heart disease? (see JMP for data set) Baldness Disease Number of people None Yes 251 None No 331 Little Yes 165 Little No 221 Some Yes 195 Some No 185 Combine “extreme” and “much” categories Much Yes 50 Much or extreme Yes 52 Much No 34 Much or extreme No 35 Extreme Yes 2 Extreme No 1 This changes the question slightly, since we have a new category. 6 11/12/09 Chi-squared test for collapsed data for baldness example Based on p-value, baldness and heart disease are not independent. We see that increasing baldness is associated with increased incidence of heart disease. 7