Categorical Data Categorical Data Analysis • To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the onset of severe chest pain is recorded for each subject. Variables: - Onset of severe chest pain (+ve / –ve) -Gender (male / female) Chi-square tests • Commonly denoted as 2 • Useful in testing for independence between categorical variables (e.g. genetic association between cases / controls) 2 (| O E |) i 2 i Ei i 1 K Comparison of observed, against what is expected under the null hypothesis. Assumptions • Sufficiently large data in each cell in the cross-tabulation table. Small Cell Counts • In general, require (a) (b) Smallest expected count is 1 or more At least 80% of the cells have an expected count of 5 or more • Yate’s Continuity Correction Provides a better approximation of the test statistic when the data is dichotomous (2 2) (| Oi Ei | 0.5) 2 Ei i 1 K 2 Goodness-of-fit test • Null hypothesis of a hypothesized distribution for the data. • Expected frequencies calculated under the hypothesized distribution. For example: The number of outbreaks of flu epidemics is charted over the period 1500 to 1931, and the number of outbreaks each year is tabulated. The variable of interest counts the number of outbreaks occurring in each year of that 432 year period. E.g. there were 223 years with no flu outbreaks. Goodness-of-fit test • Hypotheses: H0: Data follows a Poisson distribution with mean 0.692 H1: Data does not follow a Poisson distribution with mean 0.692 Note: Mean 0.692 is obtained from the sample mean. Sample mean = (0 x 223 + 1 x 142 + 2 x 48 + 3 x 15 + 4 x 4 + 5 x 0) / 432 = 0.692 Expected frequency for X = 0 = 432 P(X = 0), where X ~ Poisson(0.692) (Oi Ei ) 2 2 ~ Test Statistic 6 1 E i 1 i K , with df = (6 – 1). This yields a p-value of 0.99, indicating that we will almost certainly be wrong if we reject the null hypothesis. Test of independence Most common usage of the Pearson’s chi-square test. H0: The two categorical variables are independent H1: The two categorical variables are associated (i.e. not independent) Under the independence assumption, if outcome A is independent to outcome B, then P(A and B happen jointly) = P(A happen) x P(B happen) Calculating expected frequencies P(Chest pain +) = 83/1073 P(Males) = 520/1073 P(Chest pain -) = 990/1073 P(Females) = 553/1073 P(Males with chest pain +) = 83/1073 x 520/1073 = 0.0375 Expected(Males with chest pain +) = 1073 x P(.) = 1073 x 0.0375 = 40.224 Observed(Males with chest pain +) = 46 Test of independence • Expected frequencies calculated by: Eij • Degrees of freedom = (r – 1) (c – 1) Ri C j n Chi-square test Chi-square test Chi-Square Tests Pears on Chi-Square Continuity Correctiona Likelihood Ratio Fisher's Exact Test Linear-by-Linear Ass ociation N of Valid Cas es Value 1.744 b 1.456 1.745 1.743 df 1 1 1 1 Asymp. Sig. (2-s ided) .187 .228 .186 Exact Sig. (2-s ided) Exact Sig. (1-s ided) .209 .114 .187 1073 a. Computed only for a 2x2 table b. 0 cells (.0%) have expected count les s than 5. The minimum expected count is 40.22. Looking at the validity of the assumption of sufficiently large sample sizes! Quantification of the effect •2-test identifies whether there is significant association between the two categorical variables. • But does not quantify the strength and direction of the association. • Need odds ratio to do this. • Odds ratio defines “how many times more likely” it is to be in one category compared to the other: Example: For the previous example on severe chest pain, males are about 1.4 times more likely to experience severe chest pains than females. Always know what is the outcome/event of interest, and what is the baseline reference! Otherwise OR can be interpreted both ways! Odds ratio and relative risk Pos. outcome Neg. outcome Exposure (+) a b Exposure (-) c d a c a b ad OR b d c d bc Calculation of odds ratio is pretty straightforward. - Use the leading diagonal divided by the antidiagonal. a ( a b) a ( a c ) RR c (c d ) b (b d ) Relative risk is more tricky though, since it’s not symmetric! While it’s commonly used interchangeable with OR, the interpretation and calculation are very different! Exegesis on epidemiology Case-Control Study • Compare affected and unaffected individuals • Usually retrospective in nature • Temporal sequence cannot be established (timing for the onset of the disease) • No information on population incidence of the disease Cohort Study • • • • • • Odds ratio is the right metric here! Usually random sampling of subjects within the population Prospective, retrospective or both Relative risk is the Long follow-up; loss to follow-up appropriate metric here! Costly to conduct Temporal sequence can be established Provides information on population incidence of the disease Confidence interval of odds ratio • Not straightforward to obtain confidence intervals of odds ratio (due to complexity in obtaining the variance) • Straightforward to obtain the variance of the logarithm of odds ratio. Varlog( OR ) ˆ1 1 1 1 1p log Var a b c 1 d p ˆ pˆ 2 Var log 1 1 pˆ 2 • Odds ratio is always reported together with the pvalues (obtained from Pearson’s Chi-square test), and the corresponding confidence intervals. Case study on smoking and lung cancer Ca (+ve) Ca (-ve) Smoking (+) 1,301 1,205 Smoking (-) 56 152 Odds and Odds Ratio Odds Ratio (OR) = (1301/56)/(1205/152) Pearson’s Chi-square p-value = = 47.985, on df = 1 0 = 1 1 1 1 1301 56 1205 152 Var[log(OR)] 95% Confidence interval= = exp log( 2.93) 1.96 (2.14, 4.02) = 2.93 = 0.026 0.026 Beyond 2 x 2 tables severe chest pain lasting 30 min or more * RACE Crosstabulation s evere ches t pain las ting 30 min or more yes no Total Count % within RACE Count % within RACE Count % within RACE Chinese 47 6.2% 707 93.8% 754 100.0% Chi-Square Tests Pears on Chi-Square Likelihood Ratio Linear-by-Linear Ass ociation N of Valid Cases Value 10.300a 9.170 10.156 2 2 Asymp. Sig. (2-s ided) .006 .010 1 .001 df 1073 a. 0 cells (.0%) have expected count les s than 5. The minimum expected count is 12.07. RACE Malay 14 9.0% 142 91.0% 156 100.0% Indian 22 13.5% 141 86.5% 163 100.0% Total 83 7.7% 990 92.3% 1073 100.0% Nominal or ordinal For categorical variables with two possible outcomes: - Does not matter whether the variable is nominal or ordinal For categorical variables with more than 2 outcomes: - Important to note whether the variable is nominal or ordinal - Test to use is very different, and thus conclusion reached can be very different. Example: Consider the same dataset on severe chest pain, suppose we have the smoking status of every individual, classified into: - Non-smoker - Daily smoker - Excessive smoker Smoking intensity Chi-square test for trend severe chest pain lasting 30 min or more * smoking status Crosstabulation s evere ches t pain las ting 30 min or more yes no Total Count % within s moking s tatus Count % within s moking s tatus Count % within s moking s tatus Pears on Chi-Square Likelihood Ratio Linear-by-Linear Ass ociation N of Valid Cases 5.236 Ex-s moker 9 13.2% 59 86.8% 68 100.0% Total 81 7.7% 969 92.3% 1050 100.0% ORsmoker = 1.52 (0.88, 2.63), p = 0.180 Chi-Square Tests Value 5.243 a 4.724 s moking s tatus non-s moker daily s moker 53 19 6.7% 9.8% 736 174 93.3% 90.2% 789 193 100.0% 100.0% 2 2 Asymp. Sig. (2-s ided) .073 .094 1 .022 df 1050 a. 0 cells (.0%) have expected count les s than 5. The minimum expected count is 5.25. ORex-smoker= 2.11 (1.00, 4.51), p = 0.081 with non-smoker as reference category. Linear-by-linear association Adopts a correlational approach by calculating the Pearson correlation coefficient between the rows and the columns, allowing for ordinal outcomes in either. severe chest pain lasting 30 min or more * smoking status Crosstabulation s evere ches t pain las ting 30 min or more yes no Total Count % within s moking s tatus Count % within s moking s tatus Count % within s moking s tatus s moking s tatus non-s moker daily s moker 53 19 6.7% 9.8% 736 174 93.3% 90.2% 789 193 100.0% 100.0% Ex-s moker 9 13.2% 59 86.8% 68 100.0% Recode rows as: yes = 0, no = 1. Recode columns as: non-smoker = 0, daily smoker = 1, ex-smoker = 2 Total 81 7.7% 969 92.3% 1050 100.0% Linear-by-linear association Consider the test statistic: T = (N – 1) r2 ~ Chi-square(1) = (1050 – 1) (-0.0706)2 = 5.2356 53 observations 19 observations Chi-Square Tests Pears on Chi-Square Likelihood Ratio Linear-by-Linear Ass ociation N of Valid Cases Value 5.243 a 4.724 5.236 2 2 Asymp. Sig. (2-s ided) .073 .094 1 .022 df 1050 a. 0 cells (.0%) have expected count les s than 5. The minimum expected count is 5.25. Pearson Correlation = -0.0706 Nominal vs. Ordinal Importance of recognising the kind of variables we have in order to identify the right test! Chi-Square Tests Pears on Chi-Square Likelihood Ratio Linear-by-Linear Ass ociation N of Valid Cases Value 5.243 a 4.724 5.236 2 2 Asymp. Sig. (2-s ided) .073 .094 1 .022 df 1050 a. 0 cells (.0%) have expected count les s than 5. The minimum expected count is 5.25. Procedure for Categorical Data Analysis • Summarise data using cross-tabulation tables, with percentages • Recognise whether any of the variables are ordinal • Perform a chi-square of independence to test for association between the two categorical variables, or the linear-by-linear test if there is at least one ordinal variable out of the two variables • Check the validity of the assumption on the sample size • Quantify any significant association using odds ratios • Always report odds ratios with corresponding 95% confidence interval Categorical Data Analysis in SPSS Example: Let’s consider the lung cancer and smoking example: Ca (+ve) Ca (-ve) Smoking (+) 1,301 1,205 Smoking (-) 56 152 1. Establish the relationship between the onset of lung cancer and smoking status. Quantify this relationship if it is statistically significant. Data entry Slightly counter-intuitive, event of interest and outcome of interest should be coded as 0, and the baseline reference outcome/event coded as 1. Define what 0 and 1 corresponds to: Definition of 0s and 1s converted to what you specified under “Values”. Percentages are much easier and more meaningful to interpret than absolute numbers! Highly significant, P < 0.001 Odds ratio of getting lung cancer with corresponding 95% CI, with non-smoker as baseline Relative risk of getting lung cancer with corresponding 95% CI, with non-smoker as baseline Students should be able to • understand the use of a chi-square test for testing independence between two categorical outcomes • understand the assumptions on sample sizes for the use of a chi-square test • know how to quantify the association using odds ratio/relative risk, with corresponding 95% confidence intervals • differentiate between the tests to be used for nominal categorical and ordinal categorical variables. • perform the appropriate analyses in SPSS and RExcel