STP 420 SUMMER 2005 STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES PART 3 – TOPICS IN INFERENCE CHAPTER 9 INFERENCE FOR TWO-WAY TABLES 9.1 Inference for two-way tables Freq. Binge drinker Yes No Total Gender Men 1630 5550 7180 Female 1684 8232 9916 Total 3314 13782 17096 This is a 2 2 table and has 2 rows and 2 columns. The table presents two categories variables, each with two categories. Each value is in a cell. A 3 2 table has 3 rows and 2 columns. An r c table has r rows and c columns. Describing relations in two-way tables Joint distribution – distribution formed by expressing the count in each cell as a percent of the grand total. Conditional distributions – distributions formed by expressing the count in each cell express as a percent of the row total. The same applies when it is a percent of the column total. Expected cell count = row total column total n Two-way table for Frequent Binge drinkers and Gender 1 STP 420 SUMMER 2005 Freq. Binge drinker Yes Expected Total Row Column No Expected Total Row Column Total Gender Men 1630 1391.82 9.53% 49.19% 22.70% 5550 5788.18 32.46% 40.27% 77.30% 7180 100.00% 42.00% Female 1684 1922.18 9.85% 50.81% 16.98% 8232 7993.82 48.15% 59.73% 83.02% 9916 100.00% 58.00% Total 3314 19.38% 100.00% 13782 80.62% 100.00% 17096 100.00% Marginal Distribution for Frequent Binge drinker Freq. Binge drinker Yes No Total Frequency 3314 13782 17096 Percent 19.38% 80.62% 100.00% Marginal Distribution for Gender Gender Male Female Total Frequency 7180 9916 17096 Percent 42.00% 58.00% 100.00% Conditional distribution of Frequent Binge drinkers given Gender is Male. 2 STP 420 SUMMER 2005 Gender = male Freq. Binge drinkers Yes No Total Frequency 1630 5550 7180 Percent 22.70% 77.30% 100.00% Conditional distribution of Frequent Binge drinkers given Gender is Female. Gender = female Freq. Binge drinkers Yes No Total Frequency 1684 8232 9916 Percent 16.98% 83.02% 100.00% Conditional distribution of Gender given frequent Binge drinker is Yes. Freq. Binge d. = yes Gender Male Female Total Frequency 1630 1684 3314 Percent 49.19% 50.81% 100.00% Conditional distribution of Gender given frequent Binge drinker is No. Freq. Binge d. = yes Gender Male Female Total Frequency 5550 8232 13782 Percent 40.27% 59.73% 100.00% Joint distribution of frequent Binge drinkers and Gender 3 STP 420 SUMMER 2005 Freq. Binge drinker Yes No Total Gender Men 1630 9.53% 5550 32.46% 7180 Female 1684 9.85% 8232 48.15% 9916 Total 3314 19.38% 13782 80.62% 17096 42.00% 58.00% 100.00% Simpson’s paradox – an association or comparison that holds for all of several groups can reverse direction when the data are combined to form a single group. - reversal of direction by aggregation of data Example of three-way table – presenting information on three variables, one two-way table for each level (value) of the third variable. Died Survived Total Good Condition Hosp. A Hosp. B 6 8 594 592 600 600 Died Survived Total Poor Condition Hosp. A Hosp. B 57 8 1443 192 1500 200 Condition variable – good and poor Hospital variable – A and B Survival variable – Died and survived Aggregation of data – adding up across one variable (elimination of one variable) Eg. eliminating condition (ignoring condition) Died Survived Total Hosp. A 63 2037 2100 Hosp. B 16 784 800 4 STP 420 SUMMER 2005 9.2 Inferences for Two-Way Tables The hypothesis: no association H0 : There is no association between the row variable and the column variable Expected cell counts – calculated under the assumption that H0 is true. The chi-square test The chi-square statistic is a measure of how much the observed cell counts in a two-way table diverge from the expected cell counts. The recipe for the statistic is X 2 ( observed count exp ected count ) 2 exp ected count where “observed” represent an observed sample count, “expected” represents the expected count for the same cell, and the sum is over all r c cells in the table. Chi-square distribution (2)– distribution of the X2 statistic above - density curve (area under the curve is 1) - infinitely many curves and are identified by the degrees of freedom (similar to t distribution) - right skewed - begins at 0 and extends infinitely to the right Chi-square Test for Two-way Tables The null hypothesis H0 is that there is no association between the row and column variables in a two-way table. The alternative is that these variables are related. If H0 is true, the chi-square statistic X2 has approximately a 2 distribution with (r – 1)(c – 1) degrees of freedom. The P-value for the chi-square test is P(2 X2) 5 STP 420 SUMMER 2005 Where 2 is a random variable having the 2(df) distribution with df = (r – 1)(c – 1). This test is always only a one-tailed test (right-tailed),since we are only testing to see if there is an association between the two variables where H0 is that there is no association between the two variables. The chi-square test and the z test In a 2 2 table, the comparison of proportion of successes in two populations can be done by either the chi-square test or the two-sample z test for a proportion. The two tests give the same result. The square of the N(0, 1) z values equal to the chi-square values. Eg. 2(1) critical values equals the square of the N(0, 1) critical values. Beyond the basics – meta-analysis Meta-analysis – collection of statistical techniques designed to combine information from different but similar studies. Relative risk – ratio of two proportions where the second is usually a reference/control. 9.3 Formulas and Models for Two-Way Tables* Computations for Two-Way Tables 1. Calculate descriptive statistics that convey the important information in the table. Usually these will be column or row percents. 2. Find the expected counts and use these to compute the X2 statistic. 3. Use chi-square critical values from Table F to find the approximate P-value. 6 STP 420 SUMMER 2005 4. Draw a conclusion about the association between the row and column variables. Computing expected cell counts Computing the chi-square statistic Models for two-way tables Each unit/subject must only be counted once. Will compare two population proportions. Model for Comparing Several Populations using Two-Way Tables Select independent SRSs from each of c populations, sizes n1, n2, … , nc. Classify each individual in a sample according to a categorical response variable with r possible values. There are c different probability distributions, one for each population. The null hypothesis is that the distributions of the response variable are the same in all c populations. The alternative hypothesis says that these c distributions are not all the same. Joint distribution – probability distribution of all the r c possible outcomes in an r c two-way table. Marginal distributions – overall distribution for each of the two categorical variables (either summing over rows to give the marginal distribution for the column variable or summing over columns to give the marginal distributions for the row variable). Model for Examining Independence in Two-Way Tables Select and SRS of size n from a population. Measure two categorical variables for each individual. The null hypothesis is that the row and column variables are independent. 7 STP 420 SUMMER 2005 The alternative hypothesis is that the row and column variables are dependent. 8