Categorical data Previously, we've analyzed data using t-tests, anova, or linear regression for which the dependent (response, y) variable was quantitative. These methods are usually not appropriate when the dependent variable is categorical: live, die cured, not cured success, failure disease, no disease, republican, democrat, other none, mild, moderate, severe Sometimes we have only two levels of the categorical response (success, failure) and sometimes we have more than two levels (republican, democrat, other). Sometimes there is no order to the levels, and sometimes there is an order (none, mild, moderate, severe). We use categorical data analysis methods to analyze categorical response variables. Among the commonly-used categorical methods are the following. Frequency tables Chi-square test Fisher exact test Cochran-Mantel-Haenszel (CMH) test Logistic regression We'll cover frequency tables, chi-square test and Fisher Exact Test in some detail. We won't cover CMH and logistic regression, but you should be aware they exist if you do analysis of categorical data. Frequency and contingency tables Frequency tables or contingency tables provide counts of the number of individuals who experience some event (such as cancer) conditioned on some property of the individual (such as having a specific gene mutation or not). mutation.data = matrix(c(5,35,20,580), nrow=2, byrow=T) colnames(mutation.data) = c("Cancer", "No cancer") rownames(mutation.data) = c("Mutation ", "No mutation") mutation.data Mutation No mutation Cancer No cancer 5 35 20 580 Tables that show frequencies or counts have several names. Frequency table: shows the frequency (and sometimes percent) of each category. 2*2 table indicates a table with 2 rows and 2 columns, as in the example above. r*c indicates a table with r rows and c columns. These tables are also sometimes called contingency tables. The word contingency just indicates that the counts are contingent on (dependent on) some property of the individual, such as cancer/not versus mutation/not. Chi-square and Fisher Exact tests Suppose we wish to know if a drug is effective at curing a disease. We perform an experiment, and observe the following result. Observed outcome: Drug Placebo Cured 50 0 Not cured 0 50 In this case, we observe that all the patients who received the drug were cured, while none of the patients who received the placebo were cured, The null hypothesis is that the drug doesn't have any effect, that is, there is no association between treatment and cure. Ho: There is no association between treatment and cure. If there is no association between drug and cure, we expect half of each group to be cured, and half not to be cured. Expected result if the null hypothesis is true: Drug Placebo Cured 25 25 Not cured 25 25 We use the chi-square test (or the Fisher Exact test) to test for an association between treatment and cure. The chi squared test examines the difference between the observed outcome and the outcome expected by chance, and tells us the probability that the observed outcome would have occurred in the absence of any true association (between drug and cure in this case). cure.data = matrix(c(50,0,0,50), nrow=2, byrow=T) colnames(cure.data) = c("Cured", "Not cured") rownames(cure.data) = c("Drug", "Placebo") cure.data Drug Placebo Cured Not cured 50 0 0 50 We use the R function chisq.test() to test for an association between treatment (drug vs placebo) and outcome (cure, not): chisq.test(cure.data) Pearson's Chi-squared test with Yates' continuity correction data: cure.data X-squared = 96.04, df = 1, p-value < 2.2e-16 The p-value is p < 2.23-16, which is essentially zero. The p-value is significant at alpha = 0.05, so we reject the null hypothesis that there is no association between drug and cure. Optional material: Calculation of the chi-square statistic and p-value In this section, we'll examine how the chi-square statistic and it's p-value are calculated. You don't have to know this to use the chi-square test, but it is helpful to understand the method, and also to understand when it can fail. You may recall the T statistics used in the t-test. T= Mean( placebo ) Mean(drug ) SEM ( placebo ) 2 SEM (drug ) 2 The T statistic is a function of the difference between the mean of the drug and the mean of the placebo. Notice that, if the drug has no effect, the expected value of the mean for drug group is just the mean of the placebo group. If the drug has no effect: mean of the drug (observed result) = the mean of the placebo (the expected result) If the drug has no effect: observed – expected = 0 The chi-square test follows the same logic. It compares the observed result (what we actually see) to the expected result if the null hypothesis is true (what we would see if there is no association). The chi-square statistic is a function of the differences between observed (O) and expected (E) number of events in each cell of the table. Suppose we have a table with g=2*2=4 cells. Observed outcome: Cured 50 0 Drug Placebo Not cured 0 50 Expected result if the null hypothesis is true: Cured 25 25 Drug Placebo Not cured 25 25 Here is the formula to calculate the chi-square statistic: (Oi Ei )2 Ei i 1 g 2 In our example, we have g=2*2=4 cells. chi-square = (50-25)2/25 + (0-25)2/25 + (0-25)2/25 + (50-25)2/25 =252/25 + 252/25 + 252/25 + 252/25 = 100 The p-value for the chi-square test is the probability that a chi-square value of 100 or greater would occur when the null hypothesis is true. The actual value of the chi-square statistic calculated by the R software is slightly smaller than 100, because it applies a correction factor called the "Yates continuity correction". The Yates continuity correction makes the estimated p-value more accurate. Optional material: How to calculate the expected value in each cell and get the pvalue using chi-square. The expected value for each cell is Eij = rowi total * columnj total / grand total, as follows. 1. Calculate the row totals, column totals, and grand total for the observed data Drug Placebo Column Total Cured 50 0 Not cured 0 50 50 50 Row Total 50 50 100 <= Grand total 2. Calculate expected value for each cell as (row total * column total / grand total) Drug Placebo Column Total Cured 25 25 Not cured 25 25 50 50 Row Total 50 50 100 <= Grand total The R function automatically calculates the expected count in each cell, so you don't need to. Chi-square test example Example: is there an association between cancer and smoking? Suppose we have data on 60 individuals who smoke and 60 who don't. We determine the number of each that have cancer. Smoke No smoking Cancer No cancer 40 20 20 40 What is the probability that we see the distribution if there is no association between smoking and cancer? smoking.data = matrix(c(40,20,20,40), nrow=2, byrow=T) colnames(smoking.data) = c("Cancer", "No cancer") rownames(smoking.data) = c("Smoke", "No smoking") smoking.data Smoke No smoking > Cancer No cancer 40 20 20 40 chisq.test(smoking.data) Pearson's Chi-squared test with Yates' continuity correction data: smoking.data X-squared = 12.0333, df = 1, p-value = 0.0005226 The p-value = 0.0005226 is significant at alpha = 0.05, so we reject the null hypothesis that there is no association between cancer and smoking. It is not required that the number of individuals in each group be the same. Example: is there an association between a mutation and smoking? Collect data on 40 individuals who have the mutation and 600 individuals who don't. Determine the number of each that have cancer. mutation.data = matrix(c(5,35,20,580), nrow=2, byrow=T) colnames(mutation.data) = c("Cancer", "No cancer") rownames(mutation.data) = c("Mutation ", "No mutation") mutation.data Mutation No mutation Cancer No cancer 5 35 20 580 > > chisq.test(mutation.data) Pearson's Chi-squared test with Yates' continuity correction data: mutation.data X-squared = 6.1301, df = 1, p-value = 0.01329 Warning message: In chisq.test(mutation.data) : Chi-squared approximation may be incorrect > The p-value = 0.01329 is significant at alpha = 0.05, so we reject the null hypothesis that there is no association between the mutation and smoking. Notice that we get a warning message: Warning message: In chisq.test(mutation.data) : Chi-squared approximation may be incorrect We get the error message because the chi-square test is an approximation to the Fisher exact test, and requires that certain assumptions are met. If the assumptions are not met, the chi-square p-value may be quite wrong. Usually, the assumptions are met if the expected number of observations in each cell of the table is greater than 5. 1. Calculate the row totals, column totals, and grand total for the observed data Cured Not cured Row Total Drug 5 35 40 Placebo 20 580 600 Column Total 25 615 640 <= Grand total 2. Calculate expected value for each cell as (row total * column total / grand total) Cured Not cured Row Total Drug 1.5625 38.4375 40 Placebo 23.4375 576.5625 600 Column Total 25 615 640 <= Grand total In this example, the expected number in the cell Drug/Cured is 1.56, so the chi-square approximation may not give us a reliable p-value. In these situations, we use the Fisher Exact test. Fisher Exact test The chi-square test is an approximate test that is fast and easy to calculate. It is based on the normal approximation to the binomial distribution. The approximation works reasonably well for large sample size, but the p-values are inaccurate for small sample sizes. By default, for 2x2 tables, R uses a version of the chi-square test with the Yates continuity correction that gives relatively accurate p-values even for small sample size. When the expected number of observations in any cell is small (less than 5), the Fisher Exact test is preferred. fisher.test(mutation.data) Fisher's Exact Test for Count Data data: mutation.data p-value = 0.01552 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 1.142297 12.258694 sample estimates: odds ratio 4.126956 In this example, the chi-square approximation with Yates continuity correction gave p=0.01329, while the Fisher exact test gives p=0.01552, so the results are similar. Here's another example where the results are different. fisher.data = matrix(c(2, 15,2,160), nrow=2, byrow=T) colnames(fisher.data) = c("Cancer", "No cancer") rownames(fisher.data) = c("Mutation ", "No mutation") fisher.data Mutation No mutation > Cancer No cancer 2 15 2 160 > fisher.test(fisher.data) Fisher's Exact Test for Count Data data: fisher.data p-value = 0.04561 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.7058128 152.8072330 sample estimates: odds ratio 10.37043 > chisq.test(fisher.data) Pearson's Chi-squared test with Yates' continuity correction data: fisher.data X-squared = 3.7327, df = 1, p-value = 0.05336 Warning message: In chisq.test(fisher.data) : Chi-squared approximation may be incorrect > In this example, the chi-square approximation give p=0.053, while the Fisher exact test gives p=0.0456. Optional material: Chi-square and Fisher tests for tables with more than two rows or columns The chi-square and Fisher exact tests are not limited to tables with only two rows or columns drug2.data = matrix(c(12, 33, 14, 51, 3, 39), nrow=3, byrow=T) colnames(drug2.data) = c("Relapse", "No relapse") rownames(drug2.data) = c("Placebo", "Dose A", "Dose B") drug2.data > drug2.data Relapse No relapse Placebo 12 33 Dose A 14 51 Dose B 3 39 > fisher.test(drug2.data) Fisher's Exact Test for Count Data data: drug2.data p-value = 0.04273 alternative hypothesis: two.sided > chisq.test(drug2.data) Pearson's Chi-squared test data: drug2.data X-squared = 5.8086, df = 2, p-value = 0.05479 In this case, the Fisher test is significant (p=0.04273) while the chi-square test is not (p=0.05479) The chi-square test was developed before we had fast computers. For most biomedical data sets, it is possible to run the Fisher Exact test in a reasonable amount of time. Examples using data frames For these examples we'll use the stroke data set from the ISwR package. install.packages("ISwR") library(ISwR) head(stroke) sex died 1 Male 1991-01-07 2 Male <NA> 3 Male 1991-06-02 4 Female 1991-01-13 5 Female <NA> 6 Male 1991-01-13 dstr age dgn coma diab minf han dead obsmonths 1991-01-02 76 INF No No Yes No TRUE 0.16339869 1991-01-03 58 INF No No No No FALSE 59.60784314 1991-01-08 74 INF No No Yes Yes TRUE 4.73856209 1991-01-11 77 ICH No Yes No Yes TRUE 0.06535948 1991-01-13 76 INF No Yes No Yes FALSE 59.28104575 1991-01-13 48 ICH Yes No No Yes TRUE 0.10000000 ?stroke A data frame with 829 observations on the following 10 variables. sex a factor with levels Female and Male. died a Date, date of death. dstr a Date, date of stroke. age a numeric vector, age at stroke. dgn a factor, diagnosis, with levels ICH (intracranial haemorrhage), ID (unidentified). INF (infarction, ischaemic), SAH (subarchnoid haemorrhage). coma a factor with levels No and Yes, indicating whether patient was in coma after the stroke. diab a factor with levels No and Yes, history of diabetes. minf a factor with levels No and Yes, history of myocardial infarction. han a factor with levels No and Yes, history of hypertension. obsmonths a numeric vector, observation times in months (set to 0.1 for patients dying on the same day as the stroke). dead a logical vector, whether patient died during the study. Example. Is there an association between sex and diabetes in this data set? mytable= xtabs(~ sex + diab, data=stroke) mytable chisq.test(mytable) fisher.test(mytable) diab sex No Yes Female 434 69 Male 288 28 > chisq.test(mytable) Pearson's Chi-squared test with Yates' continuity correction data: mytable X-squared = 3.932, df = 1, p-value = 0.04738 > fisher.test(mytable) Fisher's Exact Test for Count Data data: mytable p-value = 0.04509 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.3699738 0.9894478 sample estimates: odds ratio 0.6118643 Example. Is there an association between sex and death in this data set? mytable= xtabs(~ sex + dead, data=stroke) mytable prop.table(mytable,1) chisq.test(mytable) fisher.test(mytable) > mytable dead sex FALSE TRUE Female 189 321 Male 155 164 > prop.table(mytable,1) dead sex FALSE TRUE Female 0.3705882 0.6294118 Male 0.4858934 0.5141066 In these data, 62.9% of women and 51.4% of men die. Is this statistically significant? > chisq.test(mytable) Pearson's Chi-squared test with Yates' continuity correction data: mytable X-squared = 10.2779, df = 1, p-value = 0.001346 > fisher.test(mytable) Fisher's Exact Test for Count Data data: mytable p-value = 0.001122 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.4644489 0.8359763 sample estimates: odds ratio 0.6233559