11/12/09 Chi-square test More types of inference for nominal variables

advertisement
11/12/09
More types of inference for
nominal variables
 Nominal data is categorical with more than two categories
Chi-square test
 Compare observed frequencies of nominal variable to
hypothesized probabilities
FPP 28
 Chi-squared goodness of fit test
 Test if two nominal variables are independent
 Chi-squared test of independence
Goodness of fit test
Goodness of fit test
 Do people admit themselves to Days from
hospitals more frequently close birthday
to their birthday?
within 7
 Data from a random sample of
200 people admitted to
hospitals
Number of
admissions
11
8-30
24
31-90
69
91+
96
 Assume there is no birthday effect, that is, people admit
randomly. Then,
Pr (within 7) =
Pr (8 - 30) =
Pr (31-90) =
Pr (91+)
=
=
=
=
=
.0411
.1260
.3288
.5041
 So, in a sample of 200 people, we’d expect
to be in “within 7”
to be in “8 - 30”
to be in “31 - 90”
to be in “91+”
1
11/12/09
Goodness of fit test
Goodness of fit test
 If admissions are random, we expect the sample frequencies
 Hypothesis
and hypothesized probabilities to be similar
 Claim (alternative hyp.) is admission probabilities depend on
the days since birthday
 But, as always, the sample frequencies are affected by chance
 Opposite of claim (null hyp.) is probabilities in accordance with
 So, we need to see whether the sample frequencies could
 H0 : Pr (within 7) = .0411
random admissions.
error
Pr (8 - 30) = .1260
Pr (31-90) = .3288
Pr (91+) = .5041
 HA : probabilities different than those in H0 .
have been a plausible result from a chance error if the
hypothesized probabilities are true.
 Let’s build a hypothesis test
Goodness of fit test: Test statistic
Goodness of fit test: Test statistic
Cell
 Chi-squared test statistic
Obs Exp
Dif
Dif2 Dif2/Exp
In 7
 (observed - expected)2 
X = sum

expected


2
8-30
31-90
91+
 (observed - expected)2 
X 2 = sum

expected


= .94 + .057 + .16 + .23 = 1.397
€
€
2
11/12/09
Goodness of fit test: Calculate pvalue
Goodness of fit test: Calculate pvalue
 X2 has a chi-squared distribution with degrees of freedom
 To get a p-value, calculate the area under the chi-squared
equal to number of categories minus 1.
 In this case, df = 4 – 1 = 3.
curve to the right of 1.397
 Using JMP, this area is 0.703. If the null hypothesis is true,
there is a 70% chance of observing a value of X2 as or more
extreme than 1.397
 Using the table the p-value is between 0.9 and 0.70
Chi-squared table
JMP output admissions
3
11/12/09
Goodness of fit test: Judging pvalue
Independence test
 The .70 is a large p-value, indicating the data could well
 Is birth order related to
occur by random chance when the null hypothesis is true.
Therefore, we cannot reject the null hypothesis. There is not
enough evidence to conclude that admissions rates are
independent of time from birthday.
delinquency?
 Nye (1958) randomly
sampled 1154 high school
girls and asked if they had
been “delinquent”.
Eldest
24
450
In Between
29
312
Youngest
35
211
Only
23
70
Sample of conditional frequencies
Test of independence
 % Delinquent for each birth
 Hypotheses
order status
Oldest
.05
Middle
.085
 Opposite is that there is no relationship.
Youngest
.14
 H0 : birth order and delinquency are
 Based on conditional
frequencies, it appears that
youngest are more
delinquent
 Could these sample
frequencies have plausibly
occurred by chance if there
is no relationship between
birth order and delinqeuncy
 Claim is that there is some relationship between birth order and
delinquency.
Only
.25
independent.
 HA : birth order and delinquency are
dependent.
4
11/12/09
Implications of independence
Test of independence
 Expected counts
 Expected counts
 Under independence,
 Pr(oldest and delinquent) = Pr(oldest)*Pr(delinquent)
 Estimate Pr(oldest) as marginal frequency of oldest
Oldest
45.59
428.41
In Between 32.80
308.2
 Hence, estimate Pr(oldest and delinquent) as
Youngest
23.66
222.34
 The expected number of oldest and delinquent, under independence,
Only
8.95
84.05
 Estimate Pr(delinquent) as marginal frequency of delinquent
equals
 This is repeated for all the other cells in table
 Next we compare the observed counts with the expected to
get a test statistic
Test of independence:
 Use the
X2
statistic as the test statistic:
 Calculate the p-value
 X 2 has a chi-squared distribution with degrees of freedom:
df = (number rows – 1) * (number columns – 1)
 In delinquency problem,
df = (4 - 1) * (2 - 1) = 3.
 The area under the chi-squared curve to the right of 42.245 is less
than .0001. There is only a very small chance of getting an X2 as
or more extreme than 42.245.
5
11/12/09
JMP output for chi-squared test
 This is a small p-value. It is
unlikely we’d observe data
like this if the null
hypothesis is true. There
does appear to be an
association between
delinquency and birth
order.
Chi-squared test details
Chi-squared test items
 Requires simple random samples.
 What do I do when expected counts are less than 5?
 Works best when expected frequencies in each cell are at
least 5.
 Should not have zero counts
 How one specifies categories can affect results.
 Try to get more data. Barring that, you can collapse categories.
Example: Is baldness related to heart disease? (see JMP for data set)
Baldness Disease Number of people
None Yes
251
None
No
331
Little
Yes
165
Little
No
221
Some
Yes
195
Some
No
185 Combine “extreme” and “much” categories
Much
Yes
50
Much or extreme Yes 52
Much
No
34
Much or extreme No
35
Extreme Yes
2
Extreme No
1
This changes the question slightly, since we have a new category.
6
11/12/09
Chi-squared test
 for collapsed data for baldness
example
 Based on p-value, baldness and
heart disease are not
independent.
 We see that increasing baldness
is associated with increased
incidence of heart disease.
7
Download