From: http://statland.org/R/Rchisq.htm 1. Chi-square with R R can do all the usual chi-square tests, with either raw data or tables of counts. We start with the heart attack data and the Goodness-of-Fit test. > names(heartatk) [1] "Patient" "DIAGNOSIS" "SEX" [7] "LOS" "AGE" "DRG" "DIED" "CHARGES" > attach(heartatk) > table(SEX) SEX F M 5065 7779 > chisq.test(table(SEX)) Chi-squared test for given probabilities data: table(SEX) X-squared = 573.4815, df = 1, p-value < 2.2e-16 The command chisq.test(table(SEX)) does a chi-square goodness of fit test on the table for the SEX variable. The default is to test for equal expected counts in every cell. That is hardly the case here. 2.2e-16 means 2.2X10-16 = 0.00000000000000022, which is small. If you do not want equal proportions, you need to give a set of proportions for each cell. For example, genetic theory predicts that certain fruit flies will fall into four categories in proportions 9:3:3:1. Data showed counts of 59, 20, 11 and 10. > chisq.test(c(59,20,11,10),p=c(9/16,3/16,3/16,1/16)) Chi-squared test for given probabilities data: c(59, 20, 11, 10) X-squared = 5.6711, df = 3, p-value = 0.1288 We would not reject the theoretical hypothesis with these data. You can make a cross-tabulation of two categorical variables with table and do a test of independence or homogeneity with chisq.test. (We return to the heart attack data.) > table(SEX,DIED) DIED SEX 0 1 F 4298 767 M 7136 643 > chisq.test(table(SEX,DIED)) Pearson's Chi-squared test with Yates' continuity correction data: table(SEX, DIED) X-squared = 147.7612, df = 1, p-value < 2.2e-16 Page 1 of 4 We have seen this p-value before! It is probably the smallest non-zero number R can handle and hence not very accurate. However, p is definitely small! Hence we reject the hypothesis that the mortality rate is the same for men and women. Looking at the data, it is higher for men. With 12,844 observations, getting the table is a lot more work than computing chi-square, and it is best to let the computer do it. If you have an existing table, R will analyze it -- but not without putting up a fight. You need to enter the table one row at a time and use rbind to combine the rows into a table. Here is some data on hepatitis C incidence (yes,no) and tattoo "status": from a parlor, elsewhere, and no tattoo. > hep=rbind(c(17,35),c(8,53),c(22,491)) > hep [,1] [,2] [1,] 17 35 [2,] 8 53 [3,] 22 491 > chisq.test(hep) Pearson's Chi-squared test data: hep X-squared = 57.9122, df = 2, p-value = 2.658e-13 Warning message: Chi-squared approximation may be incorrect in: chisq.test(hep) The warning message probably refers to the small expected count the cell where the 8 appears. You could overcome that by pooling the two tattoo sources together. Either way, getting tattoos seems to greatly increase the risk of hepatitis C. http://statland.org/R/ 2. Reading Tables into R In school, we work mainly with tiny datasets we can type into our technology if all else fails. In practice, we generally start with raw data in a computer file. Often, getting the data into a form our technology can work with is a major undertaking. We will go through that once right now. R can read data from a text file. The text file has to be in the form of a table with columns representing variables. All columns must be the same length. Missing data must be signified by "NA". Optionally, the first row of the file may contain names for the variables. You can access such a file named heartatk4R.txt. Download and save this file in the directory where the R program lives. The file looks like this. Patient 1 2 3 4 5 6 7 8 9 10 11 12 DIAGNOSIS 41041 F 41041 F 41091 F 41081 F 41091 M 41091 M 41091 F 41091 F 41041 M 41041 F 41041 F 41091 M SEX 122 122 122 122 122 121 121 121 121 123 121 121 DRG 0 0 0 0 0 0 0 0 0 1 0 0 DIED CHARGES 4752 0010 3941 0006 3657 0005 1481 0002 1681 0001 6378.6400 10958.520 16583.930 4015.3300 1989.4400 7471.6300 3930.6300 LOS 079 034 076 080 055 0009 0015 0015 0002 0001 0006 0005 AGE 084 084 070 076 065 052 072 Page 2 of 4 13 14 15 16 17 41091 41091 41041 41041 41041 F F M M M 122 122 122 122 121 0 0 0 0 0 ¥ 0009 4433.9300 3318.2100 4863.8300 5000.6400 083 0004 0002 0005 0003 061 053 077 053 Above are only the first 17 cases (of 12,844). To use this in R you must define a variable to be equal to the contents of this file. > heartatk = read.table("heartatk4R.txt",header=TRUE) The argument header=TRUE tells R that the first row of the file should be interpreted as variable names. (These should not include spaces.) You can now get a table of contents for what you have created in R with > objects() This should return heartatk along with any other variables you may have created. You will not see on this list any of the variables that are inside of heartatk because they are hiding. To see them, type > names(heartatk) [1] "Patient" "DIAGNOSIS" "SEX" [7] "LOS" "AGE" "DRG" "DIED" "CHARGES" To bring them out of hiding, you must "attach" them to your R workspace. (This avoids conflicts if several tables include variables with the same name. Attach just one at a time.) > attach(heartatk) These data came from an ActivStats CD which provided this background information: Heart Attack Patients This set of data is all of the hospital discharges in New York State with an admitting diagnosis of an Acute Myocardial Infarction (AMI), also called a heart attack, who did not have surgery, in the year 1993. There are 12,844 cases. AGE gives age in years SEX is coded M for males F for females DIAGNOSIS is in the form of an International Classification of Diseases, 9th Edition, Clinical Modification code. These tell which part of the heart was affected. DRG is the Diagnosis Related Group. It groups together patients with similar management. In this data set there are just three different drgs. 121 for AMIs with cardiovascular complications who did not die. 122 for AMIs without cardiovascular complications who did not die. 123 for AMIs where the patient died. LOS gives the hospital length of stay in days. DIED has a 1 for patients who died in hospital and a 0 otherwise. CHARGES gives the total hospital charges in dollars. Data provided by Health Process Management of Doylestown, PA. After you attach the data table you can work with the internal variables providing you remember that R is casesensitive. > table(sex) Error in table(sex) : object "sex" not found > table(SEX) SEX F M Page 3 of 4 5065 7779 © 2006 Robert W. Hayden. Data Desk is a registered trademark of Data Description. Page 4 of 4