Chapter
12
Inference on
Categorical Data
© 2010 Pearson Prentice Hall. All rights reserved
Section
12.1
Goodness-of-Fit
Test
© 2010 Pearson Prentice Hall. All rights reserved
Objective
•
Perform a goodness-of-fit test
© 2010 Pearson Prentice Hall. All rights reserved
12-104
Characteristics of the Chi-Square
Distribution
1. It is not symmetric.
© 2010 Pearson Prentice Hall. All rights reserved
12-105
Characteristics of the Chi-Square
Distribution
1. It is not symmetric.
2. The shape of the chi-square distribution
depends on the degrees of freedom, just like
Student’s t-distribution.
© 2010 Pearson Prentice Hall. All rights reserved
12-106
Characteristics of the Chi-Square
Distribution
1. It is not symmetric.
2. The shape of the chi-square distribution
depends on the degrees of freedom, just like
Student’s t-distribution.
3. As the number of degrees of freedom
increases, the chi-square distribution
becomes more nearly symmetric.
© 2010 Pearson Prentice Hall. All rights reserved
12-107
Characteristics of the Chi-Square
Distribution
1. It is not symmetric.
2. The shape of the chi-square distribution
depends on the degrees of freedom, just like
Student’s t-distribution.
3. As the number of degrees of freedom
increases, the chi-square distribution
becomes more nearly symmetric.
4. The values of 2 are nonnegative, i.e., the
values of 2 are greater than or equal to 0.
© 2010 Pearson Prentice Hall. All rights reserved
12-108
© 2010 Pearson Prentice Hall. All rights reserved
12-109
A goodness-of-fit test is an inferential
procedure used to determine whether a
frequency distribution follows a specific
distribution.
© 2010 Pearson Prentice Hall. All rights reserved
12-110
Expected Counts
Suppose that there are n independent trials of an
experiment with k ≥ 3 mutually exclusive possible
outcomes. Let p1 represent the probability of observing
the first outcome and E1 represent the expected count of
the first outcome; p2 represent the probability of
observing the second outcome and E2 represent the
expected count of the second outcome; and so on. The
expected counts for each possible outcome are given by
Ei = i = npi for i = 1, 2, …, k
© 2010 Pearson Prentice Hall. All rights reserved
12-111
Parallel Example 1: Finding Expected Counts
A sociologist wishes to determine whether the distribution for the
number of years care-giving grandparents are responsible for
their grandchildren is different today than it was in 2000.
According to the United States Census Bureau, in 2000, 22.8%
of grandparents have been responsible for their grandchildren
less than 1 year; 23.9% of grandparents have been responsible
for their grandchildren for 1 or 2 years; 17.6% of grandparents
have been responsible for their grandchildren 3 or 4 years; and
35.7% of grandparents have been responsible for their
grandchildren for 5 or more years. If the sociologist randomly
selects 1,000 care-giving grandparents, compute the expected
number within each category assuming the distribution has not
changed from 2000.
© 2010 Pearson Prentice Hall. All rights reserved
12-112
Solution
Step 1: The probabilities are the relative frequencies from
the 2000 distribution:
p<1yr = 0.228
p1-2yr = 0.239
p3-4yr = 0.176
p ≥5yr = 0.357
© 2010 Pearson Prentice Hall. All rights reserved
12-113
Solution
Step 2: There are n=1,000 trials of the experiment so the
expected counts are:
E<1yr = np<1yr = 1000(0.228) = 228
E1-2yr = np1-2yr = 1000(0.239) = 239
E3-4yr = np3-4yr =1000(0.176) = 176
E≥5yr = np ≥5yr = 1000(0.357) = 357
© 2010 Pearson Prentice Hall. All rights reserved
12-114
Test Statistic for Goodness-of-Fit Tests
Let Oi represent the observed counts of category i, Ei
represent the expected counts of category i, k represent
the number of categories, and n represent the number of
independent trials of an experiment. Then the formula
 
2
Oi  E i 
2
Ei
i  1, 2,
,k
approximately follows the chi-square distribution with
k-1 degrees of freedom, provided that
1. all expected frequencies are greater than or equal to 1
 (all Ei ≥ 1) and
2. no more than 20% of the expected frequencies are
less than 5.
© 2010 Pearson Prentice Hall. All rights reserved
12-115
CAUTION!
Goodness-of-fit tests are used to test hypotheses
regarding the distribution of a variable based on
a single population. If you wish to compare two
or more populations, you must use the tests for
homogeneity presented in Section 12.2.
© 2010 Pearson Prentice Hall. All rights reserved
12-116
The Goodness-of-Fit Test
To test the hypotheses regarding a distribution,
we use the steps that follow.
Step 1: Determine the null and alternative
hypotheses.
H0: The random variable follows a
certain distribution
H1: The random variable does not
follow a certain distribution
© 2010 Pearson Prentice Hall. All rights reserved
12-117
Step 2: Decide on a level of significance, ,
depending on the seriousness of making
a Type I error.
© 2010 Pearson Prentice Hall. All rights reserved
12-118
Step 3:
a) Calculate the expected counts for each
of the k categories. The expected counts
are Ei=npi for i = 1, 2, … , k where n is
the number of trials and pi is the
probability of the ith category, assuming
that the null hypothesis is true.
© 2010 Pearson Prentice Hall. All rights reserved
12-119
Step 3:
b) Verify that the requirements for the goodnessof-fit test are satisfied.
1. All expected counts are greater than or
equal to 1 (all Ei ≥ 1).
2. No more than 20% of the expected counts
are less than 5.
c) Compute the test statistic:
 
2
0
Oi  E i 
2
Ei
Note: Oi is the observed count for the ith category.
© 2010 Pearson Prentice Hall. All rights reserved
12-120
CAUTION!
If the requirements in Step 3(b) are not satisfied,
one option is to combine two or more of the lowfrequency categories into a single category.
© 2010 Pearson Prentice Hall. All rights reserved
12-121
Classical Approach
Step 4: Determine the critical value. All goodnessof-fit tests are right-tailed tests, so the
2
critical value is   with k-1 degrees of
freedom.

© 2010 Pearson Prentice Hall. All rights reserved
12-122
Classical Approach
Step 5: Compare the critical value to the test
2
2



statistic. If 0
 , reject the null
hypothesis.

© 2010 Pearson Prentice Hall. All rights reserved
12-123
P-Value Approach
Step 4: Use Table VII to obtain an approximate
P-value by determining the area under the
chi-square distribution with k-1 degrees of
freedom to the right of the test statistic.
© 2010 Pearson Prentice Hall. All rights reserved
12-124
P-Value Approach
Step 5: If the P-value < , reject the null
hypothesis.
© 2010 Pearson Prentice Hall. All rights reserved
12-125
Step 6: State the conclusion.
© 2010 Pearson Prentice Hall. All rights reserved
12-126
Parallel Example 2: Conducting a Goodness-of -Fit Test
A sociologist wishes to determine whether the distribution for
the number of years care-giving grandparents are responsible
for their grandchildren is different today than it was in 2000.
According to the United States Census Bureau, in 2000, 22.8%
of grandparents have been responsible for their grandchildren
less than 1 year; 23.9% of grandparents have been responsible
for their grandchildren for 1 or 2 years; 17.6% of grandparents
have been responsible for their grandchildren 3 or 4 years; and
35.7% of grandparents have been responsible for their
grandchildren for 5 or more years. The sociologist randomly
selects 1,000 care-giving grandparents and obtains the
following data.
© 2010 Pearson Prentice Hall. All rights reserved
12-127
Test the claim that the distribution is different
today than it was in 2000 at the  = 0.05 level
of significance.
© 2010 Pearson Prentice Hall. All rights reserved
12-128
Solution
Step 1: We want to know if the distribution today is
different than it was in 2000. The hypotheses are
then:
H0: The distribution for the number of years
care-giving grandparents are responsible for
their grandchildren is the same today as it was
in 2000
H1: The distribution for the number of years
care-giving grandparents are responsible for
their grandchildren is different today than it
was in 2000
© 2010 Pearson Prentice Hall. All rights reserved
12-129
Solution
Step 2: The level of significance is =0.05.
Step 3:
(a) The expected counts were computed in
Example 1.
Number of
Years
Observed
Counts
Expected
Counts
<1
252
228
1-2
255
239
3-4
162
176
≥5
331
357
© 2010 Pearson Prentice Hall. All rights reserved
12-130
Solution
Step 3:
(b) Since all expected counts are greater than or equal to
5, the requirements for the goodness-of-fit test are
satisfied.
(c) The test statistic is
252  228

 
2
2
0
228
255  239


2
239
162 176


2
176
331 357


2
357
 6.605
© 2010 Pearson Prentice Hall. All rights reserved
12-131
Solution: Classical Approach
Step 4: There are k = 4 categories, so we find the
critical value using 4-1=3 degrees of freedom.
2

The critical value is 0.05  7.815

© 2010 Pearson Prentice Hall. All rights reserved
12-132
Solution: Classical Approach
Step 5: Since the test statistic,  02  6.605 is less than
2
 7.815 , we fail to reject
the critical value  0.05
the null hypothesis.


© 2010 Pearson Prentice Hall. All rights reserved
12-133
Solution: P-Value Approach
Step 4: There are k = 4 categories. The P-value is the
area under the chi-square distribution with 4-1=3
degrees of freedom to the right of  02  6.605 .
Thus, P-value ≈ 0.09.

© 2010 Pearson Prentice Hall. All rights reserved
12-134
Solution: P-Value Approach
Step 5: Since the P-value ≈ 0.09 is greater than the
level of significance  = 0.05, we fail to
reject the null hypothesis.
© 2010 Pearson Prentice Hall. All rights reserved
12-135
Solution
Step 6: There is insufficient evidence to conclude
that the distribution for the number of years
care-giving grandparents are responsible for
their grandchildren is different today than it
was in 2000 at the  = 0.05 level of
significance.
© 2010 Pearson Prentice Hall. All rights reserved
12-136
Section
12.2
Tests for
Independence and
the Homogeneity of
Proportions
© 2010 Pearson Prentice Hall. All rights reserved
Objectives
1. Perform a test for independence
2. Perform a test for homogeneity of
proportions
© 2010 Pearson Prentice Hall. All rights reserved
12-138
Objective 1
• Perform a Test for Independence
© 2010 Pearson Prentice Hall. All rights reserved
12-139
The chi-square test for independence is used
to determine whether there is an association
between a row variable and column variable in
a contingency table constructed from sample
data. The null hypothesis is that the variables
are not associated; in other words, they are
independent. The alternative hypothesis is that
the variables are associated, or dependent.
© 2010 Pearson Prentice Hall. All rights reserved
12-140
“In Other Words”
In a chi-square independence test, the null
hypothesis is always
H0: The variables are independent
The alternative hypothesis is always
H0: The variables are not independent
© 2010 Pearson Prentice Hall. All rights reserved
12-141
The idea behind testing these types of claims is
to compare actual counts to the counts we
would expect if the null hypothesis were true
(if the variables are independent). If a
significant difference between the actual
counts and expected counts exists, we would
take this as evidence against the null
hypothesis.
© 2010 Pearson Prentice Hall. All rights reserved
12-142
If two events are independent, then
P(A and B) = P(A)P(B)
We can use the Multiplication Principle for
independent events to obtain the expected
proportion of observations within each cell
under the assumption of independence and
multiply this result by n, the sample size, in
order to obtain the expected count within each
cell.
© 2010 Pearson Prentice Hall. All rights reserved
12-143
Parallel Example 1: Determining the Expected Counts in a
Test for Independence
In a poll, 883 males and 893 females were asked “If you
could have only one of the following, which would
you pick: money, health, or love?” Their responses
are presented in the table below. Determine the
expected counts within each cell assuming that gender
and response are independent.
Source: Based on a Fox News Poll conducted in January, 1999
© 2010 Pearson Prentice Hall. All rights reserved
12-144
Solution
Step 1: We first compute the row and column totals:
Money Health Love Row Totals
Men
82
446
355
883
Women
46
574
273
893
Column totals
128
1020
628
1776
© 2010 Pearson Prentice Hall. All rights reserved
12-145
Solution
Step 2: Next compute the relative marginal frequencies
for the row variable and column variable:
Money
Health
Love
Relative
Frequency
Men
82
446
355
883/1776
≈ 0.4972
Women
46
574
273
893/1776
≈0.5028
Relative
128/1776 1020/1776 628/1776
Frequency ≈0.0721
≈0.5743 ≈0.3536
© 2010 Pearson Prentice Hall. All rights reserved
1
12-146
Solution
Step 3: Assuming gender and response are independent,
we use the Multiplication Rule for Independent
Events to compute the proportion of observations
we would expect in each cell.
Men
Women
Money
0.0358
0.0362
Health
0.2855
0.2888
© 2010 Pearson Prentice Hall. All rights reserved
Love
0.1758
0.1778
12-147
Solution
Step 4: We multiply the expected proportions from step 3
by 1776, the sample size, to obtain the expected
counts under the assumption of independence.
Men
Wome
n
Money
Health
Love
1776(0.0358) 1776(0.2855) 1776(0.1758)
≈ 63.5808
≈ 507.048 ≈ 312.2208
1776(0.0362) 1776(0.2888) 1776(0.1778)
≈ 64.2912
≈ 512.9088 ≈ 315.7728
© 2010 Pearson Prentice Hall. All rights reserved
12-148
Expected Frequencies in a Chi-Square
Test for Independence
To find the expected frequencies in a cell when
performing a chi-square independence test, multiply
the row total of the row containing the cell by the
column total of the column containing the cell and
divide this result by the table total. That is,
(row total)(column total)
Expected frequency =
table total
© 2010 Pearson Prentice Hall. All rights reserved
12-149
Test Statistic for the Test of
Independence
Let Oi represent the observed number of counts in the
ith cell and Ei represent the expected number of counts
2
in the ith cell. Then
Oi  E i 

2
 
Ei
approximately follows the chi-square distribution with
(r-1)(c-1) degrees of freedom, where r is the number of rows
and c is the
number of columns in the contingency table,
provided that (1) all expected frequencies are greater than or
equal to 1 and (2) no more than 20% of the expected
frequencies are less than 5.
© 2010 Pearson Prentice Hall. All rights reserved
12-150
Chi-Square Test for Independence
To test the association (or independence of) two
variables in a contingency table:
Step 1: Determine the null and alternative
hypotheses.
H0: The row variable and column
variable are independent.
H1: The row variable and column
variables are dependent.
© 2010 Pearson Prentice Hall. All rights reserved
12-151
Step 2: Choose a level of significance, ,
depending on the seriousness of making
a Type I error.
© 2010 Pearson Prentice Hall. All rights reserved
12-152
Step 3:
a) Calculate the expected frequencies
(counts) for each cell in the contingency
table.
b) Verify that the requirements for the chisquare test for independence are
satisfied:
1. All expected frequencies are greater
than or equal to 1 (all Ei ≥ 1).
2. No more than 20% of the expected
frequencies are less than 5.
© 2010 Pearson Prentice Hall. All rights reserved
12-153
Step 3:
c) Compute the test statistic:
 
2
0
Oi  E i 
2
Ei
Note: Oi is the observed count for the ith category.

© 2010 Pearson Prentice Hall. All rights reserved
12-154
Classical Approach
Step 4: Determine the critical value. All
chi-square tests for independence are
right-tailed tests, so the critical value is
2
  with (r-1)(c-1) degrees of freedom,
where r is the number of rows and c is
the number of columns in the
contingency table.

© 2010 Pearson Prentice Hall. All rights reserved
12-155
© 2010 Pearson Prentice Hall. All rights reserved
12-156
Classical Approach
Step 5: Compare the critical value to the test
2
2



statistic. If 0
 , reject the null
hypothesis.

© 2010 Pearson Prentice Hall. All rights reserved
12-157
P-Value Approach
Step 4: Use Table VII to determine an approximate Pvalue by determining the area under the chisquare distribution with (r-1)(c-1) degrees of
freedom to the right of the test statistic.
© 2010 Pearson Prentice Hall. All rights reserved
12-158
P-Value Approach
Step 5: If the P-value < , reject the null
hypothesis.
© 2010 Pearson Prentice Hall. All rights reserved
12-159
Step 6: State the conclusion.
© 2010 Pearson Prentice Hall. All rights reserved
12-160
Parallel Example 2: Performing a Chi-Square Test for
Independence
In a poll, 883 males and 893 females were asked “If you
could have only one of the following, which would
you pick: money, health, or love?” Their responses
are presented in the table below. Test the claim that
gender and response are independent at the  = 0.05
level of significance.
Source: Based on a Fox News Poll conducted in January, 1999
© 2010 Pearson Prentice Hall. All rights reserved
12-161
Solution
Step 1: We want to know whether gender and response
are dependent or independent so the hypotheses
are:
H0: gender and response are independent
H1: gender and response are dependent
Step 2: The level of significance is =0.05.
© 2010 Pearson Prentice Hall. All rights reserved
12-162
Solution
Step 3:
(a) The expected frequencies were computed in Example
1 and are given in parentheses in the table below,
along with the observed frequencies.
Money
Health
Men
82
446
(63.5808) (507.048)
Women
46
574
(64.2912) (512.9088)
Love
355
(312.2208)
273
(315.7728)
© 2010 Pearson Prentice Hall. All rights reserved
12-163
Solution
Step 3:
(b) Since none of the expected frequencies are less than 5,
the requirements for the goodness-of-fit test are
satisfied.
(c) The test statistic is
82  63.5808

 
2
2
0
63.5808
446  507.048


2
507.048

273  315.7728


2
315.7728
 36.82
© 2010 Pearson Prentice Hall. All rights reserved
12-164
Solution: Classical Approach
Step 4: There are r = 2 rows and c =3 columns, so we
find the critical value using (2-1)(3-1) = 2
degrees of freedom. The critical value is
2
 0.05
 5.99 .

© 2010 Pearson Prentice Hall. All rights reserved
12-165
Solution: Classical Approach
Step 5: Since the test statistic,  02  36.82 is greater
2
 5.99, we reject
than the critical value  0.05
the null hypothesis.


© 2010 Pearson Prentice Hall. All rights reserved
12-166
Solution: P-Value Approach
Step 4: There are r = 2 rows and c =3 columns so we
find the P-value using (2-1)(3-1) = 2 degrees of
freedom. The P-value is the area under the chisquare distribution with 2 degrees of freedom
2

 36.82
to the right of
is
0 which
approximately 0.

© 2010 Pearson Prentice Hall. All rights reserved
12-167
Solution: P-Value Approach
Step 5: Since the P-value is less than the level of
significance  = 0.05, we reject the null
hypothesis.
© 2010 Pearson Prentice Hall. All rights reserved
12-168
Solution
Step 6: There is sufficient evidence to conclude that
gender and response are dependent at the
 = 0.05 level of significance.
© 2010 Pearson Prentice Hall. All rights reserved
12-169
To see the relation between response and gender,
we draw bar graphs of the conditional
distributions of response by gender. Recall
that a conditional distribution lists the relative
frequency of each category of a variable, given
a specific value of the other variable in a
contingency table.
© 2010 Pearson Prentice Hall. All rights reserved
12-170
Parallel Example 3: Constructing a Conditional Distribution
and Bar Graph
Find the conditional distribution of response by gender for
the data from the previous example, reproduced
below.
Source: Based on a Fox News Poll conducted in January, 1999
© 2010 Pearson Prentice Hall. All rights reserved
12-171
Solution
We first compute the conditional distribution of response
by gender.
Men
Women
Money
82/883
≈ 0.0929
46/893
≈ 0.0515
Health
Love
446/883 355/883
≈ 0.5051 ≈ 0.4020
574/893 273/893
≈ 0.6428 ≈ 0.3057
© 2010 Pearson Prentice Hall. All rights reserved
12-172
Solution
© 2010 Pearson Prentice Hall. All rights reserved
12-173
Objective 2
• Perform a Test for Homogeneity of
Proportions
© 2010 Pearson Prentice Hall. All rights reserved
12-174
In a chi-square test for homogeneity of
proportions, we test whether different
populations have the same proportion of
individuals with some characteristic.
© 2010 Pearson Prentice Hall. All rights reserved
12-175
The procedures for performing a test of
homogeneity are identical to those for a
test of independence.
© 2010 Pearson Prentice Hall. All rights reserved
12-176
Parallel Example 5: A Test for Homogeneity of Proportions
The following question was asked of a random sample of individuals
in 1992, 2002, and 2008: “Would you tell me if you feel being
a teacher is an occupation of very great prestige?” The results
of the survey are presented below:
1992
2002
2008
Yes
418
479
525
No
602
541
485
Test the claim that the proportion of individuals that feel being a
teacher is an occupation of very great prestige is the same for
each year at the  = 0.01 level of significance.
Source: The Harris Poll
© 2010 Pearson Prentice Hall. All rights reserved
12-177
Solution
Step 1: The null hypothesis is a statement of “no
difference” so the proportions for each year who
feel that being a teacher is an occupation of very
great prestige are equal. We state the hypotheses
as follows:
H0: p1= p2= p3
H1: At least one of the proportions is different
from the others.
Step 2: The level of significance is =0.01.
© 2010 Pearson Prentice Hall. All rights reserved
12-178
Solution
Step 3:
(a) The expected frequencies are found by multiplying the
appropriate row and column totals and then dividing by
the total sample size. They are given in parentheses in the
table below, along with the observed frequencies.
Yes
No
1992
418
(475.554)
602
(544.446)
2002
479
(475.554)
541
(544.446)
2008
525
(470.892)
485
(539.108)
© 2010 Pearson Prentice Hall. All rights reserved
12-179
Solution
Step 3:
(b) Since none of the expected frequencies are less than 5,
the requirements are satisfied.
(c) The test statistic is
418  475.554 

 
2
2
0
475.554
479  475.554 


2
475.554

485  539.108


2
539.108
 24.74
© 2010 Pearson Prentice Hall. All rights reserved
12-180
Solution: Classical Approach
Step 4: There are r = 2 rows and c =3 columns, so we
find the critical value using (2-1)(3-1) = 2
degrees of freedom.
2
The critical value is  0.01  9.210.

© 2010 Pearson Prentice Hall. All rights reserved
12-181
Solution: Classical Approach
Step 5: Since the test statistic,  02  24.74 is greater
2
 9.210 , we reject
than the critical value  0.01
the null hypothesis.


© 2010 Pearson Prentice Hall. All rights reserved
12-182
Solution: P-Value Approach
Step 4: There are r = 2 rows and c =3 columns so we
find the P-value using (2-1)(3-1) = 2 degrees of
freedom. The P-value is the area under the chisquare distribution with 2 degrees of freedom
2
which
to the right of
is
0  24.74
approximately 0.

© 2010 Pearson Prentice Hall. All rights reserved
12-183
Solution: P-Value Approach
Step 5: Since the P-value is less than the level of
significance  = 0.01, we reject the null
hypothesis.
© 2010 Pearson Prentice Hall. All rights reserved
12-184
Solution
Step 6: There is sufficient evidence to reject the null
hypothesis at the  = 0.01 level of
significance. We conclude that the
proportion of individuals who believe that
teaching is a very prestigious career is
different for at least one of the three years.
© 2010 Pearson Prentice Hall. All rights reserved
12-185