Chapter 10 Analyzing the Association Between Categorical Variables Learn …. How to detect and describe associations between categorical variables Agresti/Franklin Statistics, 1 of 90 Section 10.1 What Is Independence and What is Association? Agresti/Franklin Statistics, 2 of 90 Example: Is There an Association Between Happiness and Family Income? Agresti/Franklin Statistics, 3 of 90 Example: Is There an Association Between Happiness and Family Income? Agresti/Franklin Statistics, 4 of 90 Example: Is There an Association Between Happiness and Family Income? The percentages in a particular row of a table are called conditional percentages They form the conditional distribution for happiness, given a particular income level Agresti/Franklin Statistics, 5 of 90 Example: Is There an Association Between Happiness and Family Income? Agresti/Franklin Statistics, 6 of 90 Example: Is There an Association Between Happiness and Family Income? Guidelines when constructing tables with conditional distributions: • Make the response variable the column variable • Compute conditional proportions for the response variable within each row • Include the total sample sizes Agresti/Franklin Statistics, 7 of 90 Independence vs Dependence For two variables to be independent, the population percentage in any category of one variable is the same for all categories of the other variable For two variables to be dependent (or associated), the population percentages in the categories are not all the same Agresti/Franklin Statistics, 8 of 90 Example: Happiness and Gender Agresti/Franklin Statistics, 9 of 90 Example: Happiness and Gender Agresti/Franklin Statistics, 10 of 90 Example: Belief in Life After Death Agresti/Franklin Statistics, 11 of 90 Example: Belief in Life After Death Are race and belief in life after death independent or dependent? • The conditional distributions in the table are similar but not exactly identical • It is tempting to conclude that the variables are dependent Agresti/Franklin Statistics, 12 of 90 Example: Belief in Life After Death Are race and belief in life after death independent or dependent? • The definition of independence between variables refers to a population • The table is a sample, not a population Agresti/Franklin Statistics, 13 of 90 Independence vs Dependence Even if variables are independent, we would not expect the sample conditional distributions to be identical Because of sampling variability, each sample percentage typically differs somewhat from the true population percentage Agresti/Franklin Statistics, 14 of 90 Section 10.2 How Can We Test whether Categorical Variables are Independent? Agresti/Franklin Statistics, 15 of 90 A Significance Test for Categorical Variables The hypotheses for the test are: H0: The two variables are independent Ha: The two variables are dependent (associated) • The test assumes random sampling and a large sample size Agresti/Franklin Statistics, 16 of 90 What Do We Expect for Cell Counts if the Variables Are Independent? The count in any particular cell is a random variable • Different samples have different values for the count The mean of its distribution is called an expected cell count • This is found under the presumption that H0 is true Agresti/Franklin Statistics, 17 of 90 How Do We Find the Expected Cell Counts? Expected Cell Count: • For a particular cell, the expected cell count equals: (Row total) (Column total) Expected cell count Total sample size Agresti/Franklin Statistics, 18 of 90 Example: Happiness by Family Income Agresti/Franklin Statistics, 19 of 90 The Chi-Squared Test Statistic The chi-squared statistic summarizes how far the observed cell counts in a contingency table fall from the expected cell counts for a null hypothesis 2 (observed count - expected count) expected count 2 Agresti/Franklin Statistics, 20 of 90 Example: Happiness and Family Income Agresti/Franklin Statistics, 21 of 90 Example: Happiness and Family Income State the null and alternative hypotheses for this test H0: Happiness and family income are independent Ha: Happiness and family income are dependent (associated) Agresti/Franklin Statistics, 22 of 90 Example: Happiness and Family Income Report the statistic and explain how it was calculated: To calculate the statistic, for each cell, calculate: 2 2 2 (observed count - expected count) expected count Sum the values for all the cells The value is 73.4 2 Agresti/Franklin Statistics, 23 of 90 Example: Happiness and Family Income The larger the value, the greater the evidence against the null hypothesis of independence and in support of the alternative hypothesis that happiness and income are associated 2 Agresti/Franklin Statistics, 24 of 90 The Chi-Squared Distribution To convert the test statistic to a Pvalue, we use the sampling distribution 2 of the statistic For large sample sizes, this sampling distribution is well approximated by the chi-squared probability distribution Agresti/Franklin Statistics, 25 of 90 The Chi-Squared Distribution Agresti/Franklin Statistics, 26 of 90 The Chi-Squared Distribution Main properties of the chi-squared distribution: • It falls on the positive part of the real number line • The precise shape of the distribution depends on the degrees of freedom: df = (r-1)(c-1) Agresti/Franklin Statistics, 27 of 90 The Chi-Squared Distribution Main properties of the chi-squared distribution: • The mean of the distribution equals the df value • It is skewed to the right • The larger the value, the greater the evidence against H0: independence Agresti/Franklin Statistics, 28 of 90 The Chi-Squared Distribution Agresti/Franklin Statistics, 29 of 90 The Five Steps of the ChiSquared Test of Independence 1. Assumptions: • Two categorical variables • Randomization • Expected counts ≥ 5 in all cells Agresti/Franklin Statistics, 30 of 90 The Five Steps of the ChiSquared Test of Independence 2. Hypotheses: H0: The two variables are independent Ha: The two variables are dependent (associated) Agresti/Franklin Statistics, 31 of 90 The Five Steps of the ChiSquared Test of Independence 3. Test Statistic: 2 (observed count - expected count) expected count 2 Agresti/Franklin Statistics, 32 of 90 The Five Steps of the ChiSquared Test of Independence 4. P-value: Right-tail probability above the observed value, for the chi-squared distribution with df = (r-1)(c-1) 5. Conclusion: Report P-value and interpret in context • If a decision is needed, reject H0 when P-value ≤ significance level Agresti/Franklin Statistics, 33 of 90 Chi-Squared is Also Used as a “Test of Homogeneity” The chi-squared test does not depend on which is the response variable and which is the explanatory variable When a response variable is identified and the population conditional distributions are identical, they are said to be homogeneous • The test is then referred to as a test of homogeneity Agresti/Franklin Statistics, 34 of 90 Example: Aspirin and Heart Attacks Revisited Agresti/Franklin Statistics, 35 of 90 Example: Aspirin and Heart Attacks Revisited What are the hypotheses for the chisquared test for these data? The null hypothesis is that whether a doctor has a heart attack is independent of whether he takes placebo or aspirin The alternative hypothesis is that there’s an association Agresti/Franklin Statistics, 36 of 90 Example: Aspirin and Heart Attacks Revisited Report the test statistic and P-value for the chi-squared test: • The test statistic is 25.01 with a P-value of 0.000 This is very strong evidence that the population proportion of heart attacks differed for those taking aspirin and for those taking placebo Agresti/Franklin Statistics, 37 of 90 Example: Aspirin and Heart Attacks Revisited The sample proportions indicate that the aspirin group had a lower rate of heart attacks than the placebo group Agresti/Franklin Statistics, 38 of 90 Limitations of the Chi-Squared Test If the P-value is very small, strong evidence exists against the null hypothesis of independence But… The chi-squared statistic and the Pvalue tell us nothing about the nature of the strength of the association Agresti/Franklin Statistics, 39 of 90 Limitations of the Chi-Squared Test We know that there is statistical significance, but the test alone does not indicate whether there is practical significance as well Agresti/Franklin Statistics, 40 of 90 Section 10.3 How Strong is the Association? Agresti/Franklin Statistics, 41 of 90 The following is a table on Gender and Happiness: Gender: Not Pretty Very Females 163 898 502 Males 130 705 379 In a study of the two variables (Gender and Happiness), which one is the response variable? a. Gender b. Happiness Agresti/Franklin Statistics, 42 of 90 The following is a table on Gender and Happiness: Gender: Not Pretty Very Females 163 898 502 Males 130 705 379 What is the Expected Cell Count for ‘Females’ who are ‘Pretty Happy’? a. 898 b. 801.5 c. 902 d. 521 Agresti/Franklin Statistics, 43 of 90 The following is a table on Gender and Happiness: Gender: Not Pretty Very Females 163 898 502 Males 130 Calculate the 705 value 2 a. 1.75 b. 0.27 c. 0.98 d. 10.34 Agresti/Franklin Statistics, 44 of 90 379 The following is a table on Gender and Happiness: Gender: Not Pretty Very Females 163 898 502 Males 130 705 379 At a significance level of 0.05, what is the correct decision? a. ‘Gender’ and ‘Happiness’ are independent b. There is an association between ‘Gender’ and ‘Happiness’ Agresti/Franklin Statistics, 45 of 90 Analyzing Contingency Tables Is there an association? • The chi-squared test of independence addresses this • When the P-value is small, we infer that the variables are associated Agresti/Franklin Statistics, 46 of 90 Analyzing Contingency Tables How do the cell counts differ from what independence predicts? To answer this question, we compare each observed cell count to the corresponding expected cell count Agresti/Franklin Statistics, 47 of 90 Analyzing Contingency Tables How strong is the association? Analyzing the strength of the association reveals whether the association is an important one, or if it is statistically significant but weak and unimportant in practical terms Agresti/Franklin Statistics, 48 of 90 Measures of Association A measure of association is a statistic or a parameter that summarizes the strength of the dependence between two variables Agresti/Franklin Statistics, 49 of 90 Difference of Proportions An easily interpretable measure of association is the difference between the proportions making a particular response Agresti/Franklin Statistics, 50 of 90 Difference of Proportions Agresti/Franklin Statistics, 51 of 90 Difference of Proportions Case (a) exhibits the weakest possible association – no association Accept Credit Card Income High Low No 60% 60% Yes 40% 40% The difference of proportions is 0 Agresti/Franklin Statistics, 52 of 90 Difference of Proportions Case (b) exhibits the strongest possible association: Income High Low Accept Credit Card No Yes 0% 100% 100% 0% The difference of proportions is 100% Agresti/Franklin Statistics, 53 of 90 Difference of Proportions In practice, we don’t expect data to follow either extreme (0% difference or 100% difference), but the stronger the association, the large the absolute value of the difference of proportions Agresti/Franklin Statistics, 54 of 90 Example: Do Student Stress and Depression Depend on Gender? Agresti/Franklin Statistics, 55 of 90 Example: Do Student Stress and Depression Depend on Gender? Which response variable, stress or depression, has the stronger sample association with gender? Agresti/Franklin Statistics, 56 of 90 Example: Example: Do Do Student Student Stress Stress and and Depression DepressionDepend Dependon onGender? Gender? Stress: Gender Yes Female Male 35% 16% No 65% 84% The difference of proportions between females and males was 0.35 – 0.16 = 0.19 Agresti/Franklin Statistics, 57 of 90 Example: Do Student Stress and Depression Depend on Gender? Depression: Gender Yes No Female 8% 92% Male 6% 94% The difference of proportions between females and males was 0.08 – 0.06 = 0.02 Agresti/Franklin Statistics, 58 of 90 Example: Do Student Stress and Depression Depend on Gender? In the sample, stress (with a difference of proportions = 0.19) has a stronger association with gender than depression has (with a difference of proportions = 0.02) Agresti/Franklin Statistics, 59 of 90 The Ratio of Proportions: Relative Risk Another measure of association, is the ratio of two proportions: p1/p2 In medical applications in which the proportion refers to an adverse outcome, it is called the relative risk Agresti/Franklin Statistics, 60 of 90 Example: Relative Risk for Seat Belt Use and Outcome of Auto Accidents Agresti/Franklin Statistics, 61 of 90 Example: Relative Risk for Seat Belt Use and Outcome of Auto Accidents Treating the auto accident outcome as the response variable, find and interpret the relative risk Agresti/Franklin Statistics, 62 of 90 Example: Relative Risk for Seat Belt Use and Outcome of Auto Accidents The adverse outcome is death The relative risk is formed for that outcome For those who wore a seat belt, the proportion who died equaled 510/412,878 = 0.00124 For those who did not wear a seat belt, the proportion who died equaled 1601/164,128 = 0.00975 Agresti/Franklin Statistics, 63 of 90 Example: Relative Risk for Seat Belt Use and Outcome of Auto Accidents The relative risk is the ratio: • 0.00124/0.00975 = 0.127 • The proportion of subjects wearing a seat belt who died was 0.127 times the proportion of subjects not wearing a seat belt who died Agresti/Franklin Statistics, 64 of 90 Example: Relative Risk for Seat Belt Use and Outcome of Auto Accidents Many find it easier to interpret the relative risk but reordering the rows of data so that the relative risk has value above 1.0 Agresti/Franklin Statistics, 65 of 90 Example: Relative Risk for Seat Belt Use and Outcome of Auto Accidents Reversing the order of the rows, we calculate the ratio: • 0.00975/0.00124 = 7.9 • The proportion of subjects not wearing a seat belt who died was 7.9 times the proportion of subjects wearing a seat belt who died Agresti/Franklin Statistics, 66 of 90 Example: Relative Risk for Seat Belt Use and Outcome of Auto Accidents A relative risk of 7.9 represents a strong association • This is far from the value of 1.0 that would occur if the proportion of deaths were the same for each group • Wearing a set belt has a practically significant effect in enhancing the chance of surviving an auto accident Agresti/Franklin Statistics, 67 of 90 Properties of the Relative Risk The relative risk can equal any nonnegative number When p1= p2, the variables are independent and relative risk = 1.0 Values farther from 1.0 (in either direction) represent stronger associations Agresti/Franklin Statistics, 68 of 90 Large Does Not Mean There’s a Strong Association A large chi-squared value provides strong evidence that the variables are associated It does not imply that the variables have a strong association This statistic merely indicates (through its P-value) how certain we can be that the variables are associated, not how strong that association is Agresti/Franklin Statistics, 69 of 90 Section 10.4 How Can Residuals Reveal the Pattern of Association? Agresti/Franklin Statistics, 70 of 90 Association Between Categorical Variables The chi-squared test and measures of association such as (p1 – p2) and p1/p2 are fundamental methods for analyzing contingency tables The P-value for summarized the strength of evidence against H0: independence Agresti/Franklin Statistics, 71 of 90 Association Between Categorical Variables If the P-value is small, then we conclude that somewhere in the contingency table the population cell proportions differ from independence The chi-squared test does not indicate whether all cells deviate greatly from independence or perhaps only some of them do so Agresti/Franklin Statistics, 72 of 90 Residual Analysis A cell-by-cell comparison of the observed counts with the counts that are expected when H0 is true reveals the nature of the evidence against H0 The difference between an observed and expected count in a particular cell is called a residual Agresti/Franklin Statistics, 73 of 90 Residual Analysis The residual is negative when fewer subjects are in the cell than expected under H0 The residual is positive when more subjects are in the cell than expected under H0 Agresti/Franklin Statistics, 74 of 90 Residual Analysis To determine whether a residual is large enough to indicate strong evidence of a deviation from independence in that cell we use a adjusted form of the residual: the standardized residual Agresti/Franklin Statistics, 75 of 90 Residual Analysis The standardized residual for a cell: (observed count – expected count)/se • A standardized residual reports the number of standard errors that an observed count falls from its expected count • Its formula is complex • Software can be used to find its value • A large value provides evidence against independence in that cell Agresti/Franklin Statistics, 76 of 90 Example: Standardized Residuals for Religiosity and Gender “To what extent do you consider yourself a religious person?” Agresti/Franklin Statistics, 77 of 90 Example: Standardized Residuals for Religiosity and Gender Agresti/Franklin Statistics, 78 of 90 Example: Standardized Residuals for Religiosity and Gender Interpret the standardized residuals in the table Agresti/Franklin Statistics, 79 of 90 Example: Standardized Residuals for Religiosity and Gender The table exhibits large positive residuals for the cells for females who are very religious and for males who are not at all religious. In these cells, the observed count is much larger than the expected count There is strong evidence that the population has more subjects in these cells than if the variables were independent Agresti/Franklin Statistics, 80 of 90 Example: Standardized Residuals for Religiosity and Gender The table exhibits large negative residuals for the cells for females who are not at all religious and for males who are very religious In these cells, the observed count is much smaller than the expected count There is strong evidence that the population has fewer subjects in these cells than if the variables were independent Agresti/Franklin Statistics, 81 of 90 Section 10.5 What if the Sample Size is Small? Fisher’s Exact Test Agresti/Franklin Statistics, 82 of 90 Fisher’s Exact Test The chi-squared test of independence is a large-sample test When the expected frequencies are small, any of them being less than about 5, small-sample tests are more appropriate Fisher’s exact test is a small-sample test of independence Agresti/Franklin Statistics, 83 of 90 Fisher’s Exact Test The calculations for Fisher’s exact test are complex Statistical software can be used to obtain the P-value for the test that the two variables are independent The smaller the P-value, the stronger is the evidence that the variables are associated Agresti/Franklin Statistics, 84 of 90 Example: Tea Tastes Better with Milk Poured First? This is an experiment conducted by Sir Ronald Fisher His colleague, Dr. Muriel Bristol, claimed that when drinking tea she could tell whether the milk or the tea had been added to the cup first Agresti/Franklin Statistics, 85 of 90 Example: Tea Tastes Better with Milk Poured First? Experiment: • Fisher asked her to taste eight cups of tea: • Four had the milk added first • Four had the tea added first • She was asked to indicate which four had the milk added first • The order of presenting the cups was randomized Agresti/Franklin Statistics, 86 of 90 Example: Tea Tastes Better with Milk Poured First? Results: Agresti/Franklin Statistics, 87 of 90 Example: Tea Tastes Better with Milk Poured First? Analysis: Agresti/Franklin Statistics, 88 of 90 Example: Tea Tastes Better with Milk Poured First? The one-sided version of the test pertains to the alternative that her predictions are better than random guessing Does the P-value suggest that she had the ability to predict better than random guessing? Agresti/Franklin Statistics, 89 of 90 Example: Tea Tastes Better with Milk Poured First? The P-value of 0.243 does not give much evidence against the null hypothesis The data did not support Dr. Bristol’s claim that she could tell whether the milk or the tea had been added to the cup first Agresti/Franklin Statistics, 90 of 90