Chapter 2, Section 2.5 and Chapter 9, Sections 9.1 & 9.2 Analysis of Two-Way Tables: Section 2.5 The variables we have worked with recently have been quantitative variables (numbers). Now we will work with categorical variables. Two-way tables compare two categorical variables measured on a set of cases. Examples Gender versus major Political party versus voting status Two-Way Table: Describes the relationship between two categorical variables. Represents a table of counts. Example: Years of education and income. Suppose a random sample of 1,000 people was selected and the following data was obtained: <10,000 Years Of Education None some College Bachelor Post-grad Total 100 85 55 10 250 10,00030,000 85 110 95 10 300 30,00150,000 50 60 175 15 300 >50,000 Total 15 20 50 65 150 250 275 375 100 1,000 Note: Each person surveyed represents a case. Each case fits into exactly one education class and one income category, so each case fits in one and only one cell of the body of the table. The Joint Distribution of the Categorical Variables: If we want the proportion of cases associated with any cell in the table we divide the count for that cell by the grand total (the total number of cases in the entire table). If we do this for each cell, we will have the joint distribution of our two categorical variables. Lecture 15, Sections 9.1 & 9.2 Page 1 1. Find the joint distribution for the example above. <10,000 Years Of Education None some College Bachelor Post-grad Total 10% 8.5% 5.5% 1% 25% 10,00030,000 8.5% 11% 9.5% 1% 30% 30,00150,000 5% 6% 17.5% 1.5% 30% >50,000 Total 1.5% 2% 5% 6.5% 15% Marginal Distributions of Categorical variables: The marginal distributions of each categorical variable are obtained from row and column totals. Basically we are examining the distributions of a single variable in the two-way table. Marginal distributions allow us to compare the relative frequencies among the levels of a single categorical variable. 2. Find the marginal distribution of education and income for the example above. Marginal distribution of education: Years None Of some College Education Bachelor Post-grad 25% 27.5% 37.5% 10% Marginal distribution of income: <10,000 10,000-30,000 30,00150,000 >50,000 25% 30% 30% 15% Lecture 15, Sections 9.1 & 9.2 Page 2 25% 27.5% 37.5% 10% 100% Conditional Distributions of Categorical variables: In conditional distributions, we find the distribution of one categorical variable given a common level of another categorical variable. 3. For the example above, find the conditional distribution of education among people earning more than $50,000. College Count for Income>50 None 15 Some 20 B.S. 50 Post 65 Total 150 Percent 10.0 13.3 33.3 43.3 100 4. For the example above, find the conditional distribution of people’s earnings given they have a Bachelor’s Degree. Annual Income <10 10-30 >30-50 >50 Bachelor’s Degree Count 55 95 175 50 Percent 14.7 25.3 46.7 13.3 Inference for Two-Way Tables: Section 9.1 We will now define a significance test to examine the relationships between two categorical variables. This new test starts by presenting the data as a two-way table. Example: Continue with the years of education and income. <10,000 Years Of Education None some College Bachelor Post-grad Total Lecture 15, Sections 9.1 & 9.2 Page 3 100 85 55 10 250 10,00030,000 85 110 95 10 300 30,001- >50,000 Total 50,000 50 15 250 60 20 275 175 50 375 15 65 100 300 150 1,000 Note: The Years of Education is the natural explanatory variable for differences in income. Below is a table of percents that describe how income levels vary with the years of education. Changes in this conditional distribution of income indicate that years of education is associated with income. Years None Of Some College Education Bachelor Post-grad All <10,000 10,00030,000 40.0% 34.0% 30.9% 40.0% 14.7% 25.3% 10.0% 10.0% 25.0% 30.0% 30,00150,000 20.0% 21.8% 46.7% 15.0% 30.0% >50,000 6.0% 7.3% 13.3% 65.0% 15.0% The bold figures are the maximum percent in each row. Note how this maximum moves across the table with increasing college education. The differences among the conditional distributions appear to be large. A statistical test, the Chi-Square Test, will tell us whether or not these differences can be attributed to chance. Chi-Square Test for Two-Way Tables: Step 1. Write the null and alternate hypotheses The null hypothesis, H , is that there is no association between the 0 row variable and the column variable. The alternative hypothesis, H , is that there is an association between a the two variables. Step 2. Arrange the observed counts in a two-way table. OBSERVED VALUES INCOME CATEGORY <10 K 10K-30K >30K-50K >50K YEARS OF EDUCATION NO COLLEGE SOME COLLEGE BS DEGREE POST GRAD TOTAL 100 85 55 10 250 Lecture 15, Sections 9.1 & 9.2 Page 4 85 110 95 10 300 50 60 175 15 300 15 20 50 65 150 TOTAL 250 275 375 100 1000 Step 3. Determine the counts that would be expected in each cell if Ho were true. Expected cell count = row total x column total / grand total. EXPECTED VALUES INCOME CATEGORY <10 K 10K-30K >30K-50K 62.50 75.00 75.00 68.75 82.50 82.50 93.75 112.50 112.50 25.00 30.00 30.00 250 300 300 NO COLLEGE SOME COLLEGE BS DEGREE POST GRAD TOTAL >50K 37.50 41.25 56.25 15.00 150 TOTAL 250 275 375 100 1000 Rule: In order for this test to be valid, 1) the average of the expected values must be 5 or more, 2) there can be no expected values <1, 3) the number of expected values which are <5 must be less than 20% . In a 2x2 table all four expected values must be 5 or more. Step 4. Determine the Chi-Square test statistic for each cell. The Chi-square statistic is a measure of how much difference there is between the observed count and the expected count for each cell. The formula for the statistic is: 2 (observed exp ected ) 2 exp ected The sum is over all r x c cells in the table. Below is a table showing the Chi-Square contributions for each cell. YEARS OF EDUCATION NO COLLEGE SOME COLLEGE BS DEGREE POST GRAD TOTAL CHI SQUARE CONTRIBUTIONS INCOME CATEGORY <10 K 10K-30K >30K-50K >50K 22.50 1.33 8.33 13.50 3.84 9.17 6.14 10.95 16.02 2.72 34.72 0.69 9.00 13.33 7.50 166.67 51.36 26.56 56.69 191.81 P VALUE FOR CHI SQUARE Lecture 15, Sections 9.1 & 9.2 Page 5 TOTAL 45.67 30.09 54.16 196.50 326.41 0.00000 The three largest contributions are shown in bold type, and they make up 68.8% of the total. The 2 statistic is always zero or positive, and it is zero only when the observed counts are exactly equal to the expected counts. Large values of 2 are evidence against H because they say the observed counts are far 0 from what we would expect if H were true. This is consistent with other 0 tests where large values of the test statistic are evidence against H . 0 The Chi-squared distributions are a family of distributions that take only positive values and are skewed to the right. A specific chi-square distribution is specified by one parameter, called the degrees of freedom. The degrees of freedom is equal to (rows-1) (columns -1). (4-1)(4-1)=9 P-value is the area to the right of 2 under the chi-square density curve. The P-value is determined by software. P VALUE FOR CHI SQUARE=326.41 WITH 9 DF = .0000 A small P-value is evidence against H , in favor of H . If the P-value is = 0 a or < α, we reject Ho and conclude that there is an association between the row variable and the column variable. It tells us nothing about the nature of the association. In order to explore the association between the row and column variables, we should always accompany the chi-square test by a description of what the data shows including the following: Calculate and compare appropriate percents. Look at the Chi Square contribution for each cell. This will show where the big differences are between the observed and expected counts. Note that the three largest Chi-Square contributions account for 68.6% of the total Chi-Square value. Look at bar graphs of the data. Lecture 15, Sections 9.1 & 9.2 Page 6 SPSS Chi-square output: years * income Crosstabulation years None Count CountExpected Some College Count Bachelor Post-grad Total <10,000 100 62.5 Expected Count Count Expected Count Count Expected Count Count Expected Count income 10,000 30,001 30,000 50,000 85 50 75.0 75.0 Total >50,000 15 37.5 250 250.0 85 110 60 20 275 68.8 82.5 82.5 41.3 275.0 55 95 175 50 375 93.8 112.5 112.5 56.3 375.0 10 10 15 65 100 25.0 30.0 30.0 15.0 100.0 250 300 300 150 1000 250.0 300.0 300.0 150.0 1000.0 Chi-Square Tests Pearson Chi-Square Likelihood Ratio Linear-by-Linear Association N of Valid Cases Value 326.413( a) Asymp. Sig. (2-sided) df 9 .000 261.039 9 .000 166.046 1 .000 1000 a 0 cells (.0%) have expected count less than 5. The minimum expected count is 15.00. Note: The P Value is one-sided even though SPSS calls it two sided. The Chi-square Test and the Z test: We can use the chi-squared test to compare any number of proportions. If we are comparing just two proportions for a two sided test, we can use the z Lecture 15, Sections 9.1 & 9.2 Page 7 test or the 2 test. These two tests always agree. The value for z would equal the square root of the Chi Square value. Lecture 15, Sections 9.1 & 9.2 Page 8