Module 10: Chi

advertisement
Author(s): Brenda Gunderson, Ph.D., 2011
License: Unless otherwise noted, this material is made available under the
terms of the Creative Commons Attribution–Non-commercial–Share
Alike 3.0 License: http://creativecommons.org/licenses/by-nc-sa/3.0/
We have reviewed this material in accordance with U.S. Copyright Law and have tried to maximize your
ability to use, share, and adapt it. The citation key on the following slide provides information about how you
may share and adapt this material.
Copyright holders of content included in this material should contact open.michigan@umich.edu with any
questions, corrections, or clarification regarding the use of content.
For more information about how to cite these materials visit http://open.umich.edu/education/about/terms-of-use.
Any medical information in this material is intended to inform and educate and is not a tool for self-diagnosis
or a replacement for medical evaluation, advice, diagnosis or treatment by a healthcare professional. Please
speak to your physician if you have questions about your medical condition.
Viewer discretion is advised: Some medical content is graphic and may not be suitable for all viewers.
Some material may be sourced from:
Mind on Statistics
Utts/Heckard, 3rd Edition, Duxbury, 2006
Text Only: ISBN 0495667161
Bundled version: ISBN 1111978301
Material from this publication used with permission.
Attribution Key
for more information see: http://open.umich.edu/wiki/AttributionPolicy
Use + Share + Adapt
{ Content the copyright holder, author, or law permits you to use, share and adapt. }
Public Domain – Government: Works that are produced by the U.S. Government. (17 USC §
105)
Public Domain – Expired: Works that are no longer protected due to an expired copyright term.
Public Domain – Self Dedicated: Works that a copyright holder has dedicated to the public domain.
Creative Commons – Zero Waiver
Creative Commons – Attribution License
Creative Commons – Attribution Share Alike License
Creative Commons – Attribution Noncommercial License
Creative Commons – Attribution Noncommercial Share Alike License
GNU – Free Documentation License
Make Your Own Assessment
{ Content Open.Michigan believes can be used, shared, and adapted because it is ineligible for copyright. }
Public Domain – Ineligible: Works that are ineligible for copyright protection in the U.S. (17 USC § 102(b)) *laws in
your jurisdiction may differ
{ Content Open.Michigan has used under a Fair Use determination. }
Fair Use: Use of works that is determined to be Fair consistent with the U.S. Copyright Act. (17 USC § 107) *laws in your
jurisdiction may differ
Our determination DOES NOT mean that all uses of this 3rd-party content are Fair Uses and we DO NOT guarantee that
your use of the content is Fair.
To use this content you should do your own independent analysis to determine whether or not your use will be Fair.
Module 10: Chi-Square Tests
Objectives: In this module you will learn how to perform three Chi-square tests (the test of goodness of
fit, the test of independence, and the test of homogeneity) that are used to analyze categorical
responses.
Overview: There are three Chi-Square tests presented in this module: the tests of goodness of fit,
independence, and homogeneity. For all three tests the data are generally presented in the form of a
contingency table (a rectangular array of numbers in cells). All three tests are based on the Chi-Square
statistic:
2  
(Oi  Ei ) 2
where
Ei
Oi is the observed count and
Ei is the expected count under the corresponding null hypothesis.
The goodness of fit test answers the question, “Do the data fit well compared to a specified
distribution?” This test considers one categorical variable and assesses whether the proportion of
sampled observations falling into each category matches well enough to the null distribution for the
given problem. For instance, the null distribution might be specified by a manufacturer, a product label,
or the results of a previous study. The null hypothesis for the goodness of fit test specifies this null
distribution which describes the population proportion of observations in each category.
The test of homogeneity answers the question, “Do two or more populations have the same
distribution for one categorical variable?” This test considers one categorical variable and assesses
whether this variable is distributed the same in two (or more) different populations. The null
hypothesis for the test of homogeneity is that the distribution of the categorical variable is the same for
the two (or more) populations.
The test of independence answers the question, “Are two factors (or variables) independent for a
population under study?” This test considers two categorical variables and assesses whether there is a
relationship between these two variables for a single population. The null hypothesis for the test of
independence is that the two categorical variables are independent (that is, they are not related) for
the population of interest.
There are also a few properties of the Chi-square distribution that you might find useful. The expected
value of a Chi-square distribution is its degrees of freedom (mean =   df ), and its variance is 2 times
its degrees of freedom. Thus, its standard deviation is the square root of 2 times the degrees of
freedom (  2  2 * df so   2 * df ). This frame of reference can help us assess if our observed
statistic is unusual under the null hypothesis or somewhat consistent with the null hypothesis.
For details on how to enter cross-tabulated data and perform these tests in SPSS, please refer to the
How To Do It in SPSS section for Module 10 online in CTOOLS. Output will be provided for all activities.
100
Formula Card:
Activity 1: Is there a Different Pattern in the Distribution of Accidental Deaths in
a Certain Region Compared to the Pattern in the entire United States?
In this activity you will perform a chi-square goodness of fit test to test if the data fit well with a
specified model stated in the null hypothesis.
Background: According to the records of the National Safety Council, accidental deaths in the United
States during 2002 had the following distribution according to the principal types of accidents.
Motor Vehicle
45%
Falls
15%
Drowning
4%
Fire
3%
Poison
16%
Other
17%
Suppose that an accidental death data set from a particular geographical region yielded the following
frequency distribution for the principal types of accidents:
Motor Vehicle
Falls
Drowning
Fire
Poison
Other
442
161
42
33
162
150
Do these data show a significantly different pattern in the distribution of accidental deaths in the
particular region compared to the pattern in the entire United States? Use a 5% significance level.
(Source: National Safety Council Website, 2005)
Task: Perform a Chi-square goodness of fit test.
Recall: Write out the Five Steps for conducting a test of hypotheses (Reference page 51).
1.
2.
3.
4.
5.
101
1. State the hypotheses:
a. Explain in one sentence why the test for this scenario is a goodness of fit test.
b. State the null hypothesis H0 : ___________________________
where _____ represents
2. Assumption Checks and Computing the Test Statistic:
a. Find the expected counts for the different categories of accidental deaths, and fill them in the
table that has the null hypothesis proportions and the observed counts.
Null %
Observed
Expected
Motor
Vehicle
45%
442
Falls
Drowning
Fire
Poison
Other
Total
15%
161
4%
42
3%
33
16%
162
17%
150
100%
990
b. Do all cells have expected counts greater than 5?
Yes
No
c. Complete the calculation of the test statistic based on your table above.
X2 
442  445.52  161  148.52  42  39.62  33  29.7 2  162  158.42  ..........................2
445.5
148.5
39.6
29.7
158.4
3. Calculate the p-value: Based on the SPSS output, report the p-value, and fill in the blanks.
Test Statistics
Chi-Square
df
Asymp. Sig.
Category
3.663
5
.599
The p-value is __________.
The expected value of the test statistic assuming the null hypothesis is true is ____________.
The large p-value is consistent with the fact that our observed test statistic value is
102
 3.663
greater than
4. Decision:
less than the expected test statistic value (under the null hypothesis).
What is your decision at a 5% significance level? Reject H0 Fail to reject H0
Remember:
Reject H0

Fail to reject H0 
Results statistically significant
Results not statistically significant
5. Conclusion: What is your conclusion at a 5% significance level?
Activity 2: Are Angry People More Likely to have Heart Disease?
In this activity you will perform a Chi-square test of Independence to test if the two categorical variables
appear to be related for a population.
Background: People who get angry easily tend to be more likely to have heart disease. That is the
conclusion of a study that followed a random sample of 12,986 people from three locations over about
four years. All subjects were free of heart disease at the beginning of the study. The subjects took the
Spielberger Trait Anger Scale, which measures how prone a person is to sudden anger. The 8474 people
in the sample who had normal blood pressure were classified according to whether they had “coronary
heart disease” (CHD) or not and whether they had low anger, moderate anger, or high anger according
to the Anger Scale. The classification summary is given.
Low Anger
CHD
53
No CHD
3057
(Source: Moore, 2001, page 476.)
Moderate Anger
110
4621
High Anger
27
606
Task: Perform a Chi-square test of independence at the 10% level to assess if Anger classification and
Heart Disease Status are related.
1. State the hypotheses:
a. Explain in one sentence why the test for this scenario is a test of independence.
b. State your null hypothesis:
H0 : _______________________________________________________________
c. Intuition about what the conclusion may be can be derived from looking at some basic
probabilities.
What proportion of sampled subjects had CHD?
What proportion of High anger subjects had CHD?
What proportion of Moderate anger subjects had CHD?
What proportion of Low anger subjects had CHD?
If the null hypothesis is true, we would expect the above four numbers to be _________.
103
2. Assumption Checks and Computing the Test Statistic:
CHD * TEMPER Crosstabulation
CHD
CHD
No CHD
Total
Count
Expected Count
Count
Expected Count
Count
Expected Count
Low anger
53
69.7
3057
3040.3
3110
3110.0
TEMPER
Moderate
anger
110
106.1
4621
4624.9
4731
4731.0
Chi-Square Te sts
High anger
27
14.2
606
618.8
633
633.0
Total
190
190.0
8284
8284.0
8474
8474.0
Pearson Chi-Square
Lik elihood Ratio
Linear-by-Linear
As soc iation
N of Valid Cases
Value
16.077 a
13.999
2
2
As ymp. Sig.
(2-sided)
.000
.001
1
.000
df
13.184
8474
a. 0 c ells (.0% ) have expected count less than 5. The
minimum expected count is 14. 19.
b. Briefly, explain what the “expected counts” are.
c. Based on the table, do the assumptions appear to be met to perform the test?
(Are all expected counts greater than 5?) Yes No
d. The expected value of the test statistic assuming the null hypothesis is true is ____________.
e. From the output, report the test statistic value: _____ = ______________.
3. Calculate the p-value: Based on the SPSS output, report the p-value, and fill in the blanks.
The p-value is __________.
The small p-value is consistent with the fact that our observed test statistic value is even
greater than less than the expected test statistic value (under the null hypothesis).
4. Decision:
What is your decision at a 10% significance level? Reject H0 Fail to reject H0
Remember:
Reject H0

Fail to reject H0 
Results statistically significant
Results not statistically significant
5. Conclusion: What is your conclusion at a 10% significance level?
104
Activity 3: Comparison of the Distribution of Academic Degrees:
Males versus Females
In this activity you will perform a Chi-square test of Homogeneity to test if the distribution of a
categorical response is the same across two or more populations.
Background: How do women and men compare in the pursuit of academic degrees? The table below
present counts (in thousands) from the Statistical Abstract of degrees earned in 1996 categorized by the
level of the degree and the sex of the recipient.
Bachelor
Master
Professional
Female
642
227
32
Male
522
179
45
Task: Perform a Chi-square test of homogeneity. Use a 1% significance level.
Doctorate
18
27
1. State the hypotheses:
State the null hypothesis:
H0 : _______________________________________________________________
2. Assumption Checks and Computing the Test Statistic:
a. Show how the expected count 531.8 (first cell for males) was computed.
b. Based on the table, do the assumptions appear to be met to perform the test?
(Are all expected counts greater than 5?) Yes No
c. The expected value of the test statistic assuming the null hypothesis is true is ____________.
105
d. From the output below, report the test statistic value: _____ = ______________.
Chi-Square Te sts
Pearson Chi-Square
Lik elihood Ratio
Linear-by-Linear
As soc iation
N of Valid Cases
Value
9.514a
9.485
5.099
3
3
As ymp. Sig.
(2-sided)
.023
.023
1
.024
df
1692
a. 0 c ells (.0% ) have expected count less than 5. The
minimum expected count is 20. 56.
3. Calculate the p-value: Based on the SPSS output, report the p-value, and fill in the blanks.
The p-value is __________.
4. Decision:
What is your decision at a 1% significance level? Reject H0 Fail to reject H0
Remember:
Reject H0

Fail to reject H0 
Results statistically significant
Results not statistically significant
5. Conclusion: What is your conclusion at a 1% significance level?
Would your decision and conclusion change if the significance level was 5% instead of 1%?
How about 3% instead of 1%?
How about 2.3% instead of 1%?
How about 2% instead of 1%?
Based on your answers above, you can see that the p-value represents the ___________
significance level at which the results would be statistically significant.
Check Your Understanding:
Fill in the blank with the most appropriate Chi-square test to address the research question.
1. A researcher wants to determine if scoring high or low on an artistic ability test depends on being
right or left-handed.
___________________________________
2. A national organization wants to compare the distribution of level of highest education completed
(high school, college, masters, doctoral) for Republicans versus Democrats.
___________________________________
3. A preservation society has the percentages of five main types of fish in the river from 10 years ago.
After noticing an imbalance recently, they add some fish from hatcheries to the river. How can they
determine if they restored the ecosystem from a new sample of fish?
___________________________________
106
Example Exam Question on Chi-Square Tests
A study was performed to examine the attendance pattern and exam performance of students in an
intro statistics course. A sample of 96 students enrolled in an intro stat course was selected (these can
be considered a random sample of such students). These 96 students were classified by their
attendance status (regularly attend lecture or not) and their performance on the midterm exam,
classified as low (below 50%), middle (50% to 80%), and high (above 80%). The data and SPSS output
are provided.
Regularly Attend? * Exam Performance Crosstabulation
Count
Regularly
Attend?
Yes
No
Total
Exam Performance
Low
Middle
High
12
20
28
16
12
8
28
32
36
Total
60
36
96
Chi-Square Tests
Pears on Chi-Square
Likelihood Ratio
Linear-by-Linear As sociation
N of Valid Cases
Value
8.195 a
8.298
8.067
96
df
2
2
1
Asymp. Sig.
.017
.016
.005
a. 0 cells (.0%) have expected count les s than 5. The minimum
expected count is 10.50.
a. Give the name of the chi-square test for assessing if there is a relationship between attendance
status and exam performance.
___________________________________________________________________
b. Based on the above data, what proportion of regular attendees performed high on the exam?
Final answer: ________________________
c. Assuming there is no relationship between attendance status and exam performance, how many
regular attendees would you expect to perform high on the exam? Show all work.
Final answer: ________________________
d. Assuming there is no relationship between attendance status and exam performance, what is the
distribution of the test statistic? (Include all relevant information.)
Final answer: ___________________________________
e. Use a level of 0.05 to assess if there is a significant relationship between attendance status and
exam performance.
Test Statistic Value: _______________________
p-value: ______________________
Thus, there (circle your answer):
does
does not
appear to be an association between attendance status and exam performance.
107
Download