Chapter 9

advertisement
Chapter 2, Section 2.5 and Chapter 9, Sections 9.1 & 9.2
Analysis of Two-Way Tables: Section 2.5
The variables we have worked with recently have been quantitative variables
(numbers). Now we will work with categorical variables. Two-way tables
compare two categorical variables measured on a set of cases.
Examples
 Gender versus major
 Political party versus voting status
Two-Way Table:
 Describes the relationship between two categorical variables.
 Represents a table of counts.
Example:
Years of education and income. Suppose a random sample of 1,000 people
was selected and the following data was obtained:
<10,000
Years
Of
Education
None
some College
Bachelor
Post-grad
Total
100
85
55
10
250
10,00030,000
85
110
95
10
300
30,00150,000
50
60
175
15
300
>50,000 Total
15
20
50
65
150
250
275
375
100
1,000
Note: Each person surveyed represents a case. Each case fits into exactly
one education class and one income category, so each case fits in one and
only one cell of the body of the table.
The Joint Distribution of the Categorical Variables:
If we want the proportion of cases associated with any cell in the table we
divide the count for that cell by the grand total (the total number of cases in
the entire table). If we do this for each cell, we will have the joint
distribution of our two categorical variables.
Lecture 15, Sections 9.1 & 9.2
Page 1
1. Find the joint distribution for the example above.
<10,000
Years
Of
Education
None
some College
Bachelor
Post-grad
Total
10%
8.5%
5.5%
1%
25%
10,00030,000
8.5%
11%
9.5%
1%
30%
30,00150,000
5%
6%
17.5%
1.5%
30%
>50,000 Total
1.5%
2%
5%
6.5%
15%
Marginal Distributions of Categorical variables:
The marginal distributions of each categorical variable are obtained from
row and column totals. Basically we are examining the distributions of a
single variable in the two-way table. Marginal distributions allow us to
compare the relative frequencies among the levels of a single categorical
variable.
2. Find the marginal distribution of education and income for the
example above.
Marginal distribution of education:
Years
None
Of
some College
Education Bachelor
Post-grad
25%
27.5%
37.5%
10%
Marginal distribution of income:
<10,000
10,000-30,000
30,00150,000
>50,000
25%
30%
30%
15%
Lecture 15, Sections 9.1 & 9.2
Page 2
25%
27.5%
37.5%
10%
100%
Conditional Distributions of Categorical variables:
In conditional distributions, we find the distribution of one categorical
variable given a common level of another categorical variable.
3. For the example above, find the conditional distribution of education
among people earning more than $50,000.
College Count for
Income>50
None
15
Some
20
B.S.
50
Post
65
Total
150
Percent
10.0
13.3
33.3
43.3
100
4. For the example above, find the conditional distribution of people’s
earnings given they have a Bachelor’s Degree.
Annual Income
<10 10-30 >30-50 >50
Bachelor’s Degree Count 55
95
175
50
Percent 14.7 25.3 46.7
13.3
Inference for Two-Way Tables: Section 9.1
We will now define a significance test to examine the relationships between
two categorical variables. This new test starts by presenting the data as a
two-way table.
Example:
Continue with the years of education and income.
<10,000
Years
Of
Education
None
some College
Bachelor
Post-grad
Total
Lecture 15, Sections 9.1 & 9.2
Page 3
100
85
55
10
250
10,00030,000
85
110
95
10
300
30,001- >50,000 Total
50,000
50
15
250
60
20
275
175
50
375
15
65
100
300
150
1,000
Note: The Years of Education is the natural explanatory variable for
differences in income. Below is a table of percents that describe how
income levels vary with the years of education. Changes in this conditional
distribution of income indicate that years of education is associated with
income.
Years
None
Of
Some College
Education Bachelor
Post-grad
All
<10,000 10,00030,000
40.0% 34.0%
30.9%
40.0%
14.7%
25.3%
10.0%
10.0%
25.0%
30.0%
30,00150,000
20.0%
21.8%
46.7%
15.0%
30.0%
>50,000
6.0%
7.3%
13.3%
65.0%
15.0%
The bold figures are the maximum percent in each row. Note how this
maximum moves across the table with increasing college education.
The differences among the conditional distributions appear to be large. A
statistical test, the Chi-Square Test, will tell us whether or not these
differences can be attributed to chance.
Chi-Square Test for Two-Way Tables:
Step 1. Write the null and alternate hypotheses
 The null hypothesis, H , is that there is no association between the
0
row variable and the column variable.
 The alternative hypothesis, H , is that there is an association between
a
the two variables.
Step 2. Arrange the observed counts in a two-way table.
OBSERVED VALUES
INCOME CATEGORY
<10 K 10K-30K
>30K-50K >50K
YEARS OF EDUCATION
NO COLLEGE
SOME COLLEGE
BS DEGREE
POST GRAD
TOTAL
100
85
55
10
250
Lecture 15, Sections 9.1 & 9.2
Page 4
85
110
95
10
300
50
60
175
15
300
15
20
50
65
150
TOTAL
250
275
375
100
1000
Step 3. Determine the counts that would be expected in each cell if Ho were
true. Expected cell count = row total x column total / grand total.
EXPECTED VALUES
INCOME CATEGORY
<10 K 10K-30K >30K-50K
62.50
75.00
75.00
68.75
82.50
82.50
93.75 112.50
112.50
25.00
30.00
30.00
250
300
300
NO COLLEGE
SOME COLLEGE
BS DEGREE
POST GRAD
TOTAL
>50K
37.50
41.25
56.25
15.00
150
TOTAL
250
275
375
100
1000
Rule: In order for this test to be valid, 1) the average of the expected
values must be 5 or more, 2) there can be no expected values <1, 3) the
number of expected values which are <5 must be less than 20% . In a
2x2 table all four expected values must be 5 or more.
Step 4. Determine the Chi-Square test statistic for each cell.
 The Chi-square statistic is a measure of how much difference there
is between the observed count and the expected count for each cell.
The formula for the statistic is:
2
(observed exp ected ) 2

exp ected
The sum is over all r x c cells in the table.
Below is a table showing the Chi-Square contributions for each cell.
YEARS OF EDUCATION
NO COLLEGE
SOME COLLEGE
BS DEGREE
POST GRAD
TOTAL
CHI SQUARE CONTRIBUTIONS
INCOME CATEGORY
<10 K 10K-30K >30K-50K
>50K
22.50
1.33
8.33 13.50
3.84
9.17
6.14 10.95
16.02
2.72
34.72
0.69
9.00
13.33
7.50 166.67
51.36
26.56
56.69 191.81
P VALUE FOR CHI SQUARE
Lecture 15, Sections 9.1 & 9.2
Page 5
TOTAL
45.67
30.09
54.16
196.50
326.41
0.00000
The three largest contributions are shown in bold type, and they make up
68.8% of the total.
The  2 statistic is always zero or positive, and it is zero only when the
observed counts are exactly equal to the expected counts. Large values of
 2 are evidence against H because they say the observed counts are far
0
from what we would expect if H were true. This is consistent with other
0
tests where large values of the test statistic are evidence against H .
0
The Chi-squared distributions are a family of distributions that take only
positive values and are skewed to the right. A specific chi-square
distribution is specified by one parameter, called the degrees of freedom.
The degrees of freedom is equal to (rows-1) (columns -1). (4-1)(4-1)=9
P-value is the area to the right of  2 under the chi-square density curve.
The P-value is determined by software.
P VALUE FOR CHI SQUARE=326.41 WITH 9 DF = .0000
A small P-value is evidence against H , in favor of H . If the P-value is =
0
a
or < α, we reject Ho and conclude that there is an association between the
row variable and the column variable. It tells us nothing about the nature of
the association.
In order to explore the association between the row and column variables,
we should always accompany the chi-square test by a description of what the
data shows including the following:
 Calculate and compare appropriate percents.
 Look at the Chi Square contribution for each cell. This will show
where the big differences are between the observed and expected
counts. Note that the three largest Chi-Square contributions account
for 68.6% of the total Chi-Square value.
 Look at bar graphs of the data.
Lecture 15, Sections 9.1 & 9.2
Page 6
SPSS Chi-square output:
years * income Crosstabulation
years
None
Count
CountExpected
Some
College
Count
Bachelor
Post-grad
Total
<10,000
100
62.5
Expected
Count
Count
Expected
Count
Count
Expected
Count
Count
Expected
Count
income
10,000 30,001 30,000
50,000
85
50
75.0
75.0
Total
>50,000
15
37.5
250
250.0
85
110
60
20
275
68.8
82.5
82.5
41.3
275.0
55
95
175
50
375
93.8
112.5
112.5
56.3
375.0
10
10
15
65
100
25.0
30.0
30.0
15.0
100.0
250
300
300
150
1000
250.0
300.0
300.0
150.0
1000.0
Chi-Square Tests
Pearson Chi-Square
Likelihood Ratio
Linear-by-Linear
Association
N of Valid Cases
Value
326.413(
a)
Asymp. Sig.
(2-sided)
df
9
.000
261.039
9
.000
166.046
1
.000
1000
a 0 cells (.0%) have expected count less than 5. The minimum
expected count is 15.00.
Note: The P Value is one-sided even though SPSS calls it two sided.
The Chi-square Test and the Z test:
We can use the chi-squared test to compare any number of proportions. If
we are comparing just two proportions for a two sided test, we can use the z
Lecture 15, Sections 9.1 & 9.2
Page 7
test or the  2 test. These two tests always agree. The value for z would
equal the square root of the Chi Square value.
Lecture 15, Sections 9.1 & 9.2
Page 8
Download