Associations Between Categorical Variables • Case where both explanatory (independent) variable and response (dependent) variable are qualitative (Chapter 7 includes case where both are binary (2 levels) • Association: The distributions of responses differ among the levels of the explanatory variable (e.g. Party affiliation by gender) Contingency Tables • Cross-tabulations of frequency counts where the rows (typically) represent the levels of the explanatory variable and the columns represent the levels of the response variable. • Numbers within the table represent the numbers of individuals falling in the corresponding combination of levels of the two variables • Row and column totals are called the marginal distributions for the two variables Example - Cyclones Near Antarctica • Period of Study: September,1973-May,1975 • Explanatory Variable: Region (40-49,50-59,60-79) (Degrees South Latitude) • Response: Season (Aut(4),Wtr(5),Spr(4),Sum(8)) (Number of months in parentheses) • Units: Cyclones in the study area • Treating the observed cyclones as a “random sample” of all cyclones that could have occurred Source: Howarth(1983), “An Analysis of the Variability of Cyclones around Antarctica and Their Relation to Sea-Ice Extent”, Annals of the Association of American Geographers, Vol.73,pp519-537 Example - Cyclones Near Antarctica Region\Season 40-49S 50-59S 60-79S Total Autumn 370 526 980 1876 Winter 452 624 1200 2276 Spring 273 513 995 1781 Summer 422 1059 1751 3232 Total 1517 2722 4926 9165 For each region (row) we can compute the percentage of storms occuring during each season, the conditional distribution. Of the 1517 cyclones in the 40-49 band, 370 occurred in Autumn, a proportion of 370/1517=.244, or 24.4% as a percentage. Region\Season 40-49S 50-59S 60-79S Autumn 24.4 19.3 19.9 Winter 29.8 22.9 24.4 Spring 18.0 18.9 20.2 Summer 27.8 38.9 35.5 Total% (n) 100.0 (1517) 100.0 (2722) 100.0 (4926) Example - Cyclones Near Antarctica 40.00 region 40-49S 50-59S 60-79S 30.00 regp ct Bars show Means 20.00 10.00 Autumn Winter Spring Summer season Graphical Conditional Distributions for Regions Guidelines for Contingency Tables • Compute percentages for the response (column) variable within the categories of the explanatory (row) variable. Note that in journal articles, rows and columns may be interchanged. • Divide the cell totals by the row (explanatory category) total and multiply by 100 to obtain a percent, the row percents will add to 100 • Give title and clearly define variables and categories. • Include row (explanatory) total sample sizes Independence & Dependence • Statistically Independent: Population conditional distributions of one variable are the same across all levels of the other variable • Statistically Dependent: Conditional Distributions are not all equal • When testing, researchers typically wish to demonstrate dependence (alternative hypothesis), and wish to refute independence (null hypothesis) Pearson’s Chi-Square Test • Can be used for nominal or ordinal explanatory and response variables • Variables can have any number of distinct levels • Tests whether the distribution of the response variable is the same for each level of the explanatory variable (H0: No association between the variables • r = # of levels of explanatory variable • c = # of levels of response variable Pearson’s Chi-Square Test • Intuition behind test statistic – Obtain marginal distribution of outcomes for the response variable – Apply this common distribution to all levels of the explanatory variable, by multiplying each proportion by the corresponding sample size – Measure the difference between actual cell counts and the expected cell counts in the previous step Pearson’s Chi-Square Test • Notation to obtain test statistic – Rows represent explanatory variable (r levels) – Cols represent response variable (c levels) 1 2 … c Total 1 n11 n12 … n1c n1. 2 n21 n22 … n2c n2. … … … … … … r nr1 nr2 … nrc nr. Total n.1 n.2 … n.c n.. Pearson’s Chi-Square Test • Observed frequency (fo): The number of individuals falling in a particular cell • Expected frequency (fe): The number we would expect in that cell, given the sample sizes observed in study and the assumtpion of independence. – Computed by multiplying the row total and the column total, and dividing by the overall sample size. – Applies the overall marginal probability of the response category to the sample size of explanatory category Pearson’s Chi-Square Test • Large-sample test (all fe > 5) • H0: Variables are statistically independent (No association between variables) • Ha: Variables are statistically dependent (Association exists between variables) • Test Statistic: 2 ( f o f e )2 f obs e 2 • P-value: Area above obs in the chi-squared distribution with (r-1)(c-1) degrees of freedom. (Critical values in Table 8.5) Example - Cyclones Near Antarctica Observed Cell Counts (fo): Region\Season 40-49S 50-59S 60-79S Total Autumn 370 526 980 1876 Winter 452 624 1200 2276 Spring 273 513 995 1781 Summer 422 1059 1751 3232 Total 1517 2722 4926 9165 Note that overall: (1876/9165)100%=20.5% of all cyclones occurred in Autumn. If we apply that percentage to the 1517 that occurred in the 40-49S band, we would expect (0.205)(1517)=310.5 to have occurred in the first cell of the table. The full table of fe: Region\Season 40-49S 50-59S 60-79S Total Autumn 310.5 557.2 1008.3 1876 Winter 376.7 676.0 1223.3 2276 Spring 294.8 529.0 957.3 1781 Summer 535.0 959.9 1737.1 3232 Total 1517 2722 4926 9165 Example - Cyclones Near Antarctica Computation of Region 40-49S 40-49S 40-49S 40-49S 50-59S 50-59S 50-59S 50-59S 60-79S 60-79S 60-79S 60-79S 2 obs Season Autumn Winter Spring Summer Autumn Winter Spring Summer Autumn Winter Spring Summer fo fe 370 452 273 422 526 624 513 1059 980 1200 995 1751 310.5 376.7 294.8 535.0 557.2 676.0 529.0 959.9 1008.3 1223.3 957.3 1737.1 (fo-fe)^2 3540.25 5670.09 475.24 12769 973.44 2704 256 9820.81 800.89 542.89 1421.29 193.21 ((fo-fe)^2)/fe 11.4017713 15.0520042 1.61207598 23.8672897 1.74702082 4 0.48393195 10.2310762 0.79429733 0.44379138 1.4846861 0.11122561 71.2291706 Example - Cyclones Near Antarctica • H0: Seasonal distribution of cyclone occurences is independent of latitude band • Ha: Seasonal occurences of cyclone occurences differ among latitude bands 2 • Test Statistic: obs 71.2 • P-value: Area in chi-squared distribution with (31)(4-1)=6 degrees of freedom above 71.2 Frrom Table 8.5, P(222.46)=.001 P< .001 SPSS Output - Cyclone Example O N A S p t i m o u n r i t m t m R 4 C 0 2 3 2 7 E 5 7 8 0 0 % % % % % % 5 C 6 4 3 9 2 E 2 0 0 9 0 % % % % % % 6 C 0 0 5 1 6 E 3 3 3 1 0 % % % % % % T C 6 6 1 2 5 E 0 0 0 0 0 % % % % % % a p a d i l d u f a P 9 6 0 P-value L 7 6 0 L 8 1 0 A 5N a 0 m Misuses of chi-squared Test • Expected frequencies too small (all expected counts should be above 5, not necessary for the observed counts) • Dependent samples (the same individuals are in each row, see McNemar’s test) • Can be used for nominal or ordinal variables, but more powerful methods exist for when both variables are ordinal and a directional association is hypothesized Residual Analysis • Once dependence has been determined from a chisquared test, often interested in determining which cells contributed • Residual: fo-fe measures the difference between the observed and expected counts – Positive implies observed more than expected – Residual’s practical importance depends on level of fe • Adjusted Residual (computed for each cell): fo fe f e (1 row proportion )(1 column proportion ) Adjusted residuals above 3 in absolute value give strong evidence against independence in that cell Example - Cyclones Near Antarctica Adjusted residuals are computed in the following table. Row proportion for Region 40-49S: 1517/9165=0.1655 Column Proportion for Season Autumn is: 1876/9165=0.2047 Region 40-49S 40-49S 40-49S 40-49S 50-59S 50-59S 50-59S 50-59S 60-79S 60-79S 60-79S 60-79S Season Autumn Winter Spring Summer Autumn Winter Spring Summer Autumn Winter Spring Summer fo fe 370 452 273 422 526 624 513 1059 980 1200 995 1751 310.5 376.7 294.8 535 557.2 676 529 959.9 1008.3 1223.3 957.3 1737.1 row prop col prop adj res 0.1655 0.2047 4.144837 0.1655 0.2483 4.898484 0.1655 0.1943 -1.54843 0.1655 0.3526 -6.64664 0.297 0.2047 -1.76769 0.297 0.2483 -2.75125 0.297 0.1943 -0.92433 0.297 0.3526 4.741291 0.5375 0.2047 -1.4695 0.5375 0.2483 -1.12983 0.5375 0.1943 1.996065 0.5375 0.3526 0.609481 2x2 Tables • Each variable has 2 levels – Explanatory Variable – Groups (Typically based on demographics, exposure, or Trt) – Response Variable – Outcome (Typically presence or absence of a characteristic) • Measures of association – Relative Risk (Prospective Studies) – Odds Ratio (Prospective or Retrospective) – Absolute Risk (Prospective Studies) 2x2 Tables - Notation Group 1 Outcome Present n11 Outcome Absent n12 Group Total n1. Group 2 n21 n22 n2. Outcome Total n.1 n.2 n.. Relative Risk • Ratio of the probability that the outcome characteristic is present for one group, relative to the other • Sample proportions with characteristic from groups 1 and 2: n11 1 n1. ^ n21 2 n2. ^ Relative Risk • Estimated Relative Risk: ^ RR 1 ^ 2 95% Confidence Interval for Population Relative Risk: ( RR (e 1.96 v ) , RR (e1.96 ^ e 2.71828 v )) ^ (1 1 ) (1 v n11 n21 2 ) Relative Risk • Interpretation – Conclude that the probability that the outcome is present is higher (in the population) for group 1 if the entire interval is above 1 – Conclude that the probability that the outcome is present is lower (in the population) for group 1 if the entire interval is below 1 – Do not conclude that the probability of the outcome differs for the two groups if the interval contains 1 Example - Coccidioidomycosis and TNFa-antagonists • Research Question: Risk of developing Coccidioidmycosis associated with arthritis therapy? • Groups: Patients receiving tumor necrosis factor a (TNFa) versus Patients not receiving TNFa (all patients arthritic) Source: Bergstrom, et al (2004) TNFa Other Total COC 7 4 11 No COC 240 734 974 Total 247 738 985 Example - Coccidioidomycosis and TNFa-antagonists • Group 1: Patients on TNFa • Group 2: Patients not on TNFa ^ 7 4 1 .0283 2 .0054 247 738 ^ ^ 1 .0283 RR ^ 5.24 2 .0054 95%CI : (5.24e 1.96 .3874 1 .0283 1 .0054 v .3874 7 4 , 5.24e1.96 .3874 ) (1.55 , 17.76) Entire CI above 1 Conclude higher risk if on TNFa Odds Ratio • Odds of an event is the probability it occurs divided by the probability it does not occur • Odds ratio is the odds of the event for group 1 divided by the odds of the event for group 2 • Sample odds of the outcome for each group: n11 / n1. n11 odds1 n12 / n1. n12 odds2 n21 n22 Odds Ratio • Estimated Odds Ratio: odds1 n11 / n12 n11n22 OR odds2 n21 / n22 n12n21 95% Confidence Interval for Population Odds Ratio ( OR (e 1.96 v ) , OR (e1.96 v ) ) 1 1 1 1 e 2.71828 v n11 n12 n21 n22 Odds Ratio • Interpretation – Conclude that the probability that the outcome is present is higher (in the population) for group 1 if the entire interval is above 1 – Conclude that the probability that the outcome is present is lower (in the population) for group 1 if the entire interval is below 1 – Do not conclude that the probability of the outcome differs for the two groups if the interval contains 1 Example - NSAIDs and GBM • Case-Control Study (Retrospective) – Cases: 137 Self-Reporting Patients with Glioblastoma Multiforme (GBM) – Controls: 401 Population-Based Individuals matched to cases wrt demographic factors GBM Present GBM Absent NSAID User 32 138 NSAID Non-User 105 263 Total 137 401 Source: Sivak-Sears, et al Total 170 368 538 Example - NSAIDs and GBM 32(263) 8416 0.58 138(105) 14490 1 1 1 1 v 0.0518 32 138 105 263 OR 95% CI : ( 0.58e 1.96 0.0518 , 0.58e1.96 0.0518 ) (0.37 , 0.91) Interval is entirely below 1, NSAID use appears to be lower among cases than controls Absolute Risk • Difference Between Proportions of outcomes with an outcome characteristic for 2 groups • Sample proportions with characteristic from groups 1 and 2: n11 1 n1. ^ n21 2 n2. ^ Absolute Risk Estimated Absolute Risk: ^ ^ AR 1 2 95% Confidence Interval for Population Absolute Risk ^ ^ ^ ^ 1 1 1 2 1 2 AR 1.96 n1. n2. Absolute Risk • Interpretation – Conclude that the probability that the outcome is present is higher (in the population) for group 1 if the entire interval is positive – Conclude that the probability that the outcome is present is lower (in the population) for group 1 if the entire interval is negative – Do not conclude that the probability of the outcome differs for the two groups if the interval contains 0 Example - Coccidioidomycosis and TNFa-antagonists • Group 1: Patients on TNFa • Group 2: Patients not on TNFa ^ 7 4 1 .0283 2 .0054 247 738 ^ ^ ^ AR 1 2 .0283 .0054 .0229 .0283(.9717) .0054(.9946) 247 738 .0229 .0213 (0.0016 , 0.0242) 95%CI : .0229 1.96 Interval is entirely positive, TNFa is associated with higher risk Ordinal Explanatory and Response Variables • Pearson’s Chi-square test can be used to test associations among ordinal variables, but more powerful methods exist • When theories exist that the association is directional (positive or negative), measures exist to describe and test for these specific alternatives from independence: – Gamma – Kendall’s tb Concordant and Discordant Pairs • Concordant Pairs - Pairs of individuals where one individual scores “higher” on both ordered variables than the other individual • Discordant Pairs - Pairs of individuals where one individual scores “higher” on one ordered variable and the other individual scores “higher” on the other • C = # Concordant Pairs D = # Discordant Pairs – Under Positive association, expect C > D – Under Negative association, expect C < D – Under No association, expect C D Example - Alcohol Use and Sick Days • Alcohol Risk (Without Risk, Hardly any Risk, Some to Considerable Risk) • Sick Days (0, 1-6, 7) • Concordant Pairs - Pairs of respondents where one scores higher on both alcohol risk and sick days than the other • Discordant Pairs - Pairs of respondents where one scores higher on alcohol risk and the other scores higher on sick days Source: Hermansson, et al (2003) Example - Alcohol Use and Sick Days A C D d o d d a t A W 7 3 5 5 H 4 3 6 3 S 2 5 4 1 T 3 1 5 9 • Concordant Pairs: Each individual in a given cell is concordant with each individual in cells “Southeast” of theirs •Discordant Pairs: Each individual in a given cell is discordant with each individual in cells “Southwest” of theirs Example - Alcohol Use and Sick Days A C D d o d d a t A W 7 3 5 5 H 4 3 6 3 S 2 5 4 1 T 3 1 5 9 C 347(63 56 25 34) 113(56 34) 154(25 34) 63(34) 83164 D 145(154 63 52 25) 113(154 52) 56(52 25) 63(52) 73496 Measures of Association • Goodman and Kruskal’s Gamma: CD CD ^ ^ 1 1 • Kendall’s tb: CD ^ tb 0.5 (n ni. )( n n. j ) 2 2 2 2 When there’s no association between the ordinal variables, the population based values of these measures are 0. Statistical software packages provide these tests. Example - Alcohol Use and Sick Days C D 83164 73496 0.0617 C D 83164 73496 ^ c y m a b o r l E x o u O K 5 0 7 5 O G 2 2 7 5 N 9 a N b U