Topic 7 Independence of Categorical Variables Activity 7-7: Pet Birds and Lung Cancer Kohlmeier, Arminger, Bartolomeycik, Bellach, Rehm and Thomas, British Medical Journal, 305, pp. 986-989, conducted a study to determine if keeping a pet bird could be considered a risk factor in the development of lung cancer. They took a sample of patients in Berlin who had contracted lung cancer and a sample of patients in Berlin who had not contracted lung cancer. The data are summarized in the following table. Lung Cancer 98 141 Kept a pet bird Did not keep a pet bird a) No Lung Cancer 101 328 Conduct a chi-square analysis to determine if keeping a pet bird is independent from developing lung cancer. SPSS results are shown below. Crosstabs Case Processing Summary Cases Mis sing N Percent Valid N Kept Pet Bird? * Have Lung Canceer? Percent 668 100.0% 0 .0% Total N Percent 668 Ke pt P et Bird? * Have Lung Canceer? Crosstabulati on Kept P et Bird? Yes No Total Count Ex pec ted Count % within Have Lung Canc eer? Count Ex pec ted Count % within Have Lung Canc eer? Count Ex pec ted Count % within Have Lung Canc eer? Have Lung Canceer? Yes No 98 101 71.2 127.8 Total 199 199.0 41.0% 23.5% 29.8% 141 167.8 328 301.2 469 469.0 59.0% 76.5% 70.2% 239 239.0 429 429.0 668 668.0 100.0% 100.0% 100.0% Chi-Square Tests Pearson Chi-Square Continuity Correction a Likelihood Ratio Fis her's Exact Test Linear-by-Linear As sociation N of Valid Cases Value 22.374 b 21.547 21.924 22.341 df 1 1 1 1 As ymp. Sig. (2-sided) .000 .000 .000 Exact Sig. (2-sided) Exact Sig. (1-sided) .000 .000 .000 668 a. Computed only for a 2x2 table b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 71. 20. 100.0% b) c) What do you conclude? There is very strong evidence that keeping a pet bird and having lung cancer are related. Is keeping a pet bird a risk factor? The column percents indicate that a higher percentage of bird owners had lung cancer, but that does not allow us to conclude that owning a pet bird is the reason. This is an observational study, not a controlled experiment. Activity 7-8: High School Graduates A random sample of 115 students leaving high school was taken to see what relationship, if any, existed between success in passing the General Certificate of Education Advanced Level examinations and where students were headed after graduation. These are much like examinations taken by many state’s high school students in the United States. The number of examinations each student passed along with whether they went on to college, some other training program or straight to employment were observed. The results are summarized below. College Other training Employment a) Number of Examinations Passed 2 3 10 29 7 4 12 11 1 1 7 14 4 or more 16 1 3 Do the data indicate that there is a relationship between the number of advanced level examinations passed and the ultimate destination of high school graduates? Explain in detail how you justify your conclusion. SPSS results for the original table are shown below. Crosstabs Ca se Processing Sum ma ry N Dest. after Graduat ion * Number of Ex ams Passed Valid Percent 115 100.0% Cases Missing N Percent 0 N .0% Total Percent 115 100.0% Dest. after Graduation * Number of Exams Passed Crosstabulation Number of Exams Passed 2 3 4 or more 1 10 29 16 10.7 14.1 21.4 9.7 1 Dest. after Graduation College Other training Employment Total Count Expected Count % within Number of Exams Passed Count Expected Count % within Number of Exams Passed Count Expected Count % within Number of Exams Passed Count Expected Count % within Number of Exams Passed Total 56 56.0 4.5% 34.5% 65.9% 80.0% 48.7% 7 3.6 7 4.8 4 7.3 1 3.3 19 19.0 31.8% 24.1% 9.1% 5.0% 16.5% 14 7.7 12 10.1 11 15.3 3 7.0 40 40.0 63.6% 41.4% 25.0% 15.0% 34.8% 22 22.0 29 29.0 44 44.0 20 20.0 115 115.0 100.0% 100.0% 100.0% 100.0% 100.0% Chi-Square Te sts Value 33.012 a 37.879 Pearson Chi-Square Lik elihood Ratio Linear-by-Linear As soc iation N of Valid Cases 6 6 As ymp. Sig. (2-sided) .000 .000 1 .000 df 25.270 115 a. 3 c ells (25.0%) have ex pec ted c ount les s than 5. The minimum expected count is 3.30. The expected counts for the Other training category are below 5 for three of the four categories. Combining College and Other training results in the following table. De st. a fter Gra dua tion * Number of Exam s Pa sse d Crosstabulation Number of Exams P ass ed 2 3 4 or more 8 17 33 17 14.3 18.9 28.7 13.0 1 Dest. after Graduat ion College and other t raining Employment Total Count Ex pec ted Count % within Number of Exams P ass ed Count Ex pec ted Count % within Number of Exams P ass ed Count Ex pec ted Count % within Number of Exams P ass ed 36.4% 58.6% 75.0% 85.0% 65.2% 14 7.7 12 10.1 11 15.3 3 7.0 40 40.0 63.6% 41.4% 25.0% 15.0% 34.8% 22 22.0 29 29.0 44 44.0 20 20.0 115 115.0 100.0% 100.0% 100.0% 100.0% 100.0% Chi-Square Te sts Pearson Chi-Square Lik elihood Ratio Linear-by-Linear As soc iation N of Valid Cases Value 13.937 a 14.030 13.380 3 3 As ymp. Sig. (2-sided) .003 .003 1 .000 df 115 a. 0 c ells (.0% ) have expected count less than 5. The minimum expected count is 6.96. Combining Other training and Employment results in the following table. De st. a fter Gra dua tion * Num ber of Exam s P asse d Crosstabulati on Number of Exams Pass ed 2 3 4 or more 1 10 29 16 10.7 14.1 21.4 9.7 1 Dest. after Graduation College Employment and other t raining Total Count Ex pec ted Count % within Number of Exams P ass ed Count Ex pec ted Count % within Number of Exams P ass ed Count Ex pec ted Count % within Number of Exams P ass ed Total 75 75.0 Total 56 56.0 4.5% 34.5% 65.9% 80.0% 48.7% 21 11.3 19 14.9 15 22.6 4 10.3 59 59.0 95.5% 65.5% 34.1% 20.0% 51.3% 22 22.0 29 29.0 44 44.0 20 20.0 115 115.0 100.0% 100.0% 100.0% 100.0% 100.0% Chi-Square Te sts Pearson Chi-Square Lik elihood Ratio Linear-by-Linear As soc iation N of Valid Cases Value 32.573 a 37.367 3 3 As ymp. Sig. (2-sided) .000 .000 1 .000 df 31.319 115 a. 0 c ells (.0% ) have expected count less than 5. The minimum expected count is 9.74. In both cases there is strong or very strong evidence that destination after graduation and number of examinations passed are related. b) If there is a relation, how would you describe it? The more examinations passed the more likely a student is to be going to college or other training in the first case. In the second case, the more examinations passed the more likely a student is to be going to college. Activity 7-9: Gender and Politics a) Conduct a chi-square analysis on the data collected in the Preliminaries to ascertain there is a relationship between the gender of a student at your school and his/her political orientation. Results will vary according to the data collected. b) What do you conclude? Justify your conclusions. Results will vary according to the data collected. c) If a pattern exists, describe it. Results will vary according to the data collected. Activity 7-10: Religiosity and Marriage The SPSS data set REL_MAR.SAV contains a portion of the NSFH data from NSFH.SAV. Marital Status gives the marital status of the respondent—1 = married, 2 = separated, 3 = divorced, and 4 = widowed. Religiosity gives the number of times per month the respondent attends church—0 = never, 1 = once a week or less, 2 = twice a week, and 3 = more than twice a week. a) Do the data indicate there is a relation between marital status and the number of times one attends church? Justify your conclusions. With the original data there were three cells with expected counts less than 5 in the more than twice a week category. Combining this with the twice a week category gives the following results. Msarital S tatus * Reli giosity Crosstabula tion Ms arit al St atus Married Separated Divorced W idowed Total Count Ex pec ted Count % within Religiosit y Count Ex pec ted Count % within Religiosit y Count Ex pec ted Count % within Religiosit y Count Ex pec ted Count % within Religiosit y Count Ex pec ted Count % within Religiosit y Never 210 216.8 65.6% 19 16.9 5.9% 68 50.1 21.3% 23 36.3 7.2% 320 320.0 100.0% Religiosity Once per week or less 536 529.7 68.5% 44 41.3 5.6% 104 122.4 13.3% 98 88.7 12.5% 782 782.0 100.0% Twice a week or more 371 370.5 67.8% 24 28.9 4.4% 86 85.6 15.7% 66 62.0 12.1% 547 547.0 100.0% Total 1117 1117.0 67.7% 87 87.0 5.3% 258 258.0 15.6% 187 187.0 11.3% 1649 1649.0 100.0% Chi-Square Te sts Pearson Chi-Square Lik elihood Ratio Linear-by-Linear As soc iation N of Valid Cases Value 16.830 a 17.060 .095 6 6 As ymp. Sig. (2-sided) .010 .009 1 .758 df 1649 a. 0 c ells (.0% ) have expected count less than 5. The minimum expected count is 16. 88. So, there is some evidence of a relationship between the number of times one attends religious services and marital status. b) If there is a relationship, how would you characterize it? There is a tendency for more divorced people to not attend religious services, separated people to attend services once a week or less, and widowed persons to attend services once a week or more. Activity 7-11: Deciding Authorship The chi-square test can be used in answering questions that come under the general heading of goodness-of-fit problems. Here the problem is to determine if randomly selected items from a population follow a particular pattern or not. A particularly intriguing example of this type of analysis is that of trying to determine if the pattern of words in manuscripts of unknown authorship follow the patterns of a particular author. To do this, a sample of the author’s works is compared with the manuscripts whose author is not known. It is common to look at things such as the frequency of use of particular words and the frequency of use of words of various lengths. We shall look at an example of the latter. In 1861 the New Orleans Crescent published a ten letters that were signed by Quintus Curtius Snodgrass. The letters describe events that seem to have occurred, but there is no record of the existence of anyone named Quintus Curtius Snodgrass. Some claimed that the author was actually Mark Twain. The table below gives the distribution of word lengths from the Snodgrass letters and the distribution of word lengths from a sample of the works of Mark Twain. Word Length Mark Twain Q. C. Snodgrass 1 312 424 2 1146 2685 3 1394 2752 4 1177 2302 5 661 1431 6 442 992 Word Length Mark Twain Q. C. Snodgrass 8 231 638 9 181 465 10 109 276 11 50 152 12 24 101 13+ 12 61 7 367 896 To perform the analysis, let the two authors be the rows of a two-way table, and let the word lengths be the columns of the table. Then do the chi-square analysis on the resulting table. a) The chi-square analysis we are doing tests the null hypothesis that the row and column variables are independent against the alternative hypothesis that they are not independent. In this problem does the null hypothesis of independence coincide with the hypothesis that the two authors are the same, or does it coincide with the hypothesis that they differ? If the two authors are the same, the pattern of word lengths would tend to be the same. This coincides with the null hypothesis that the two authors are independent. b) Conduct the chi-square analysis and summarize your conclusions. SPSS results are shown below. Chi-Square Te sts Value Pearson Chi-Square 19281. 000 a Lik elihood Ratio 26706. 476 Linear-by-Linear 202.161 As soc iation N of Valid Cases 19281 25 25 As ymp. Sig. (2-sided) .000 .000 1 .000 df a. 0 c ells (.0% ) have expected count less than 5. The minimum expected count is 5.79. The minimum expected count is greater than 5. So the results are valid. They indicate that there is very strong evidence that word length and author are not independent. c) Can we conclude that Mark Twain could have written the Q. C. Snodgrass letters? There is very strong evidence that Mark Twain is not Q. C. Snodgrass.