Chapter 9: Analysis of Two-Way Tables II. A. B. C. D. E. Data Analysis for Two-Way Tables (IPS section 9.1 pages 612-620) The Two-Way Table – A two-way table of counts organizes data about two categorical variables. Values of the row variable label the rows that run across the table, and values of the column variable label the columns that run down the table. Two-way tables are often used to summarize large amounts of data by grouping outcomes into categories. Each combination of values for these two variables is called a cell. For each cell, a proportion is obtained by dividing the cell entry by the total sample size. The collection of these proportions is the joint distribution of the two categorical variables. Marginal Distributions – The row totals and column totals in a two-way table give the marginal distributions of the two variables separately. It is clearer to present these distributions as percents of the table total. Marginal distributions do not give any information about the relationships between the variables. Describing Relations in Two-Way Tables – Relationships among categorical variables are described by calculating appropriate percents from the counts given. Conditional Distributions – When we condition one the value of one variable and calculate the distribution of the other variable, we obtain a conditional distribution. To find the conditional distribution of the row variable for one specific value of the column variable, look only at that one column in the table. Find each entry in the column as a percent of the column total. There is a conditional distribution of the row variable for each column in the table. Comparing these conditional distributions is one way to describe the association between the row and column variables. It is particularly useful when the column variable is the explanatory variable. When the row variable is explanatory, find the conditional distribution of the column variable for each row and compare these distributions. Bar graphs are a flexible means of presenting categorical data. There is not single best way to describe an association between two categorical variables. Simpson’s Paradox – An association or comparison that holds for all of several groups can reverse direction when the data are combined to form a single group. This reversal is called Simpson’s paradox. III. Inference for Two-Way Tables (IPS section 9.2 pages 620-629) A. The Hypothesis: No Association – The null hypothesis Ho of interest in a two-way table is that there is no association between the row variable and the column variable. B. Expected Cell Counts – To test the null hypothesis in r x c tables, we compare the observed cell counts with expected cell counts calculated under the assumption that the null hypothesis is true. A numerical summary of the comparison will be our test statistic. row total x column total expected cell count = n C. Chi-Square Statistic – The chi-square statistic is a measure of how much the observed cell counts in a two-way table diverge from the expected cell counts. The recipe for the statistic is (observed count - expected count)2 expected count where “observed” represents an observed sample count, “expected” represents the expected count for the same cell, and the sum is over all r x c cells in the table. D. Chi-Square Distribution (denoted 2 ) - Like the t distribution, the 2 distributions form a family described by a single parameter, the degrees of freedom. We use 2 (df) to indicate a particular member of this family. 2 distributions take only positive values and are skewed to the right. X2 E. Chi-Square Test for Two-Way Tables – The null hypothesis Ho is that there is no association between the row and column variables in a two-way table. The alternative is that these variables are related. If Ho is true, the chi-square statistic 2 has approximated a 2 distribution with (r-1)(c-1) degrees of freedom. The P-value for the chi-square test is P( 2 ≥ X2) where 2 is a random variable having the 2 (df) distribution with df = (r-1)(c-1). The chi-square test always uses the upper tail of the 2 distribution because any deviation from the null hypothesis makes that statistic larger. The approximation of the distribution of X2 by 2 becomes more accurate as the cell counts increase. F. The Chi-Square Test and the z Test – A comparison of the proportions of “successes” in two populations leads to a 2 x 2 table. We can compute two population proportions either by the chi-square test or by the two-sample z test from section 8.2. In fact, these tests always give exactly the same result, because the 2 statistic is equal to the square of the z statistic, and 2 (1) critical values are equal to the squares of the corresponding N(0,1) critical values. The advantage of the z test is that we can test either one-sided or two-sided alternatives. The chi-square test always tests the twosided alternative. Of course, the chi-square test can compare more than two populations, whereas the z test compares only two.