Topic 7

advertisement
Topic 7
Independence of Categorical Variables
Activity 7-7: Pet Birds and Lung Cancer
Kohlmeier, Arminger, Bartolomeycik, Bellach, Rehm and Thomas, British Medical Journal, 305, pp. 986-989,
conducted a study to determine if keeping a pet bird could be considered a risk factor in the development of lung
cancer. They took a sample of patients in Berlin who had contracted lung cancer and a sample of patients in Berlin
who had not contracted lung cancer. The data are summarized in the following table.
Lung Cancer
98
141
Kept a pet bird
Did not keep a pet bird
a)
No Lung Cancer
101
328
Conduct a chi-square analysis to determine if keeping a pet bird is independent from developing lung
cancer.
SPSS results are shown below.
Crosstabs
Case Processing Summary
Cases
Mis sing
N
Percent
Valid
N
Kept Pet Bird? * Have
Lung Canceer?
Percent
668
100.0%
0
.0%
Total
N
Percent
668
Ke pt P et Bird? * Have Lung Canceer? Crosstabulati on
Kept P et
Bird?
Yes
No
Total
Count
Ex pec ted Count
% within Have
Lung Canc eer?
Count
Ex pec ted Count
% within Have
Lung Canc eer?
Count
Ex pec ted Count
% within Have
Lung Canc eer?
Have Lung Canceer?
Yes
No
98
101
71.2
127.8
Total
199
199.0
41.0%
23.5%
29.8%
141
167.8
328
301.2
469
469.0
59.0%
76.5%
70.2%
239
239.0
429
429.0
668
668.0
100.0%
100.0%
100.0%
Chi-Square Tests
Pearson Chi-Square
Continuity Correction a
Likelihood Ratio
Fis her's Exact Test
Linear-by-Linear
As sociation
N of Valid Cases
Value
22.374 b
21.547
21.924
22.341
df
1
1
1
1
As ymp. Sig.
(2-sided)
.000
.000
.000
Exact Sig.
(2-sided)
Exact Sig.
(1-sided)
.000
.000
.000
668
a. Computed only for a 2x2 table
b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 71.
20.
100.0%
b)
c)
What do you conclude?
There is very strong evidence that keeping a pet bird and having lung cancer are related.
Is keeping a pet bird a risk factor?
The column percents indicate that a higher percentage of bird owners had lung cancer, but that does not
allow us to conclude that owning a pet bird is the reason. This is an observational study, not a controlled
experiment.
Activity 7-8: High School Graduates
A random sample of 115 students leaving high school was taken to see what relationship, if any, existed between
success in passing the General Certificate of Education Advanced Level examinations and where students were
headed after graduation. These are much like examinations taken by many state’s high school students in the United
States. The number of examinations each student passed along with whether they went on to college, some other
training program or straight to employment were observed. The results are summarized below.
College
Other training
Employment
a)
Number of Examinations Passed
2
3
10
29
7
4
12
11
1
1
7
14
4 or more
16
1
3
Do the data indicate that there is a relationship between the number of advanced level examinations passed
and the ultimate destination of high school graduates? Explain in detail how you justify your conclusion.
SPSS results for the original table are shown below.
Crosstabs
Ca se Processing Sum ma ry
N
Dest. after Graduat ion
* Number of Ex ams
Passed
Valid
Percent
115
100.0%
Cases
Missing
N
Percent
0
N
.0%
Total
Percent
115
100.0%
Dest. after Graduation * Number of Exams Passed Crosstabulation
Number of Exams Passed
2
3
4 or more
1
10
29
16
10.7
14.1
21.4
9.7
1
Dest. after
Graduation
College
Other training
Employment
Total
Count
Expected Count
% within Number
of Exams Passed
Count
Expected Count
% within Number
of Exams Passed
Count
Expected Count
% within Number
of Exams Passed
Count
Expected Count
% within Number
of Exams Passed
Total
56
56.0
4.5%
34.5%
65.9%
80.0%
48.7%
7
3.6
7
4.8
4
7.3
1
3.3
19
19.0
31.8%
24.1%
9.1%
5.0%
16.5%
14
7.7
12
10.1
11
15.3
3
7.0
40
40.0
63.6%
41.4%
25.0%
15.0%
34.8%
22
22.0
29
29.0
44
44.0
20
20.0
115
115.0
100.0%
100.0%
100.0%
100.0%
100.0%
Chi-Square Te sts
Value
33.012 a
37.879
Pearson Chi-Square
Lik elihood Ratio
Linear-by-Linear
As soc iation
N of Valid Cases
6
6
As ymp. Sig.
(2-sided)
.000
.000
1
.000
df
25.270
115
a. 3 c ells (25.0%) have ex pec ted c ount les s than 5. The
minimum expected count is 3.30.
The expected counts for the Other training category are below 5 for three of the four categories.
Combining College and Other training results in the following table.
De st. a fter Gra dua tion * Number of Exam s Pa sse d Crosstabulation
Number of Exams P ass ed
2
3
4 or more
8
17
33
17
14.3
18.9
28.7
13.0
1
Dest. after Graduat ion College and other t raining
Employment
Total
Count
Ex pec ted Count
% within Number
of Exams P ass ed
Count
Ex pec ted Count
% within Number
of Exams P ass ed
Count
Ex pec ted Count
% within Number
of Exams P ass ed
36.4%
58.6%
75.0%
85.0%
65.2%
14
7.7
12
10.1
11
15.3
3
7.0
40
40.0
63.6%
41.4%
25.0%
15.0%
34.8%
22
22.0
29
29.0
44
44.0
20
20.0
115
115.0
100.0%
100.0%
100.0%
100.0%
100.0%
Chi-Square Te sts
Pearson Chi-Square
Lik elihood Ratio
Linear-by-Linear
As soc iation
N of Valid Cases
Value
13.937 a
14.030
13.380
3
3
As ymp. Sig.
(2-sided)
.003
.003
1
.000
df
115
a. 0 c ells (.0% ) have expected count less than 5. The
minimum expected count is 6.96.
Combining Other training and Employment results in the following table.
De st. a fter Gra dua tion * Num ber of Exam s P asse d Crosstabulati on
Number of Exams Pass ed
2
3
4 or more
1
10
29
16
10.7
14.1
21.4
9.7
1
Dest. after
Graduation
College
Employment and
other t raining
Total
Count
Ex pec ted Count
% within Number
of Exams P ass ed
Count
Ex pec ted Count
% within Number
of Exams P ass ed
Count
Ex pec ted Count
% within Number
of Exams P ass ed
Total
75
75.0
Total
56
56.0
4.5%
34.5%
65.9%
80.0%
48.7%
21
11.3
19
14.9
15
22.6
4
10.3
59
59.0
95.5%
65.5%
34.1%
20.0%
51.3%
22
22.0
29
29.0
44
44.0
20
20.0
115
115.0
100.0%
100.0%
100.0%
100.0%
100.0%
Chi-Square Te sts
Pearson Chi-Square
Lik elihood Ratio
Linear-by-Linear
As soc iation
N of Valid Cases
Value
32.573 a
37.367
3
3
As ymp. Sig.
(2-sided)
.000
.000
1
.000
df
31.319
115
a. 0 c ells (.0% ) have expected count less than 5. The
minimum expected count is 9.74.
In both cases there is strong or very strong evidence that destination after graduation and number of
examinations passed are related.
b)
If there is a relation, how would you describe it?
The more examinations passed the more likely a student is to be going to college or other training in the
first case. In the second case, the more examinations passed the more likely a student is to be going to
college.
Activity 7-9: Gender and Politics
a)
Conduct a chi-square analysis on the data collected in the Preliminaries to ascertain there is a relationship
between the gender of a student at your school and his/her political orientation.
Results will vary according to the data collected.
b) What do you conclude? Justify your conclusions.
Results will vary according to the data collected.
c) If a pattern exists, describe it.
Results will vary according to the data collected.
Activity 7-10: Religiosity and Marriage
The SPSS data set REL_MAR.SAV contains a portion of the NSFH data from NSFH.SAV. Marital Status gives
the marital status of the respondent—1 = married, 2 = separated, 3 = divorced, and 4 = widowed. Religiosity gives
the number of times per month the respondent attends church—0 = never, 1 = once a week or less, 2 = twice a week,
and 3 = more than twice a week.
a) Do the data indicate there is a relation between marital status and the number of times one attends church?
Justify your conclusions.
With the original data there were three cells with expected counts less than 5 in the more than twice a week
category. Combining this with the twice a week category gives the following results.
Msarital S tatus * Reli giosity Crosstabula tion
Ms arit al
St atus
Married
Separated
Divorced
W idowed
Total
Count
Ex pec ted Count
% within Religiosit y
Count
Ex pec ted Count
% within Religiosit y
Count
Ex pec ted Count
% within Religiosit y
Count
Ex pec ted Count
% within Religiosit y
Count
Ex pec ted Count
% within Religiosit y
Never
210
216.8
65.6%
19
16.9
5.9%
68
50.1
21.3%
23
36.3
7.2%
320
320.0
100.0%
Religiosity
Once per
week or less
536
529.7
68.5%
44
41.3
5.6%
104
122.4
13.3%
98
88.7
12.5%
782
782.0
100.0%
Twice a week
or more
371
370.5
67.8%
24
28.9
4.4%
86
85.6
15.7%
66
62.0
12.1%
547
547.0
100.0%
Total
1117
1117.0
67.7%
87
87.0
5.3%
258
258.0
15.6%
187
187.0
11.3%
1649
1649.0
100.0%
Chi-Square Te sts
Pearson Chi-Square
Lik elihood Ratio
Linear-by-Linear
As soc iation
N of Valid Cases
Value
16.830 a
17.060
.095
6
6
As ymp. Sig.
(2-sided)
.010
.009
1
.758
df
1649
a. 0 c ells (.0% ) have expected count less than 5. The
minimum expected count is 16. 88.
So, there is some evidence of a relationship between the number of times one attends religious services and
marital status.
b) If there is a relationship, how would you characterize it?
There is a tendency for more divorced people to not attend religious services, separated people to attend
services once a week or less, and widowed persons to attend services once a week or more.
Activity 7-11: Deciding Authorship
The chi-square test can be used in answering questions that come under the general heading of goodness-of-fit
problems. Here the problem is to determine if randomly selected items from a population follow a particular pattern
or not. A particularly intriguing example of this type of analysis is that of trying to determine if the pattern of words
in manuscripts of unknown authorship follow the patterns of a particular author. To do this, a sample of the author’s
works is compared with the manuscripts whose author is not known. It is common to look at things such as the
frequency of use of particular words and the frequency of use of words of various lengths. We shall look at an
example of the latter.
In 1861 the New Orleans Crescent published a ten letters that were signed by Quintus Curtius Snodgrass. The
letters describe events that seem to have occurred, but there is no record of the existence of anyone named Quintus
Curtius Snodgrass. Some claimed that the author was actually Mark Twain. The table below gives the distribution
of word lengths from the Snodgrass letters and the distribution of word lengths from a sample of the works of Mark
Twain.
Word Length
Mark Twain
Q. C. Snodgrass
1
312
424
2
1146
2685
3
1394
2752
4
1177
2302
5
661
1431
6
442
992
Word Length
Mark Twain
Q. C. Snodgrass
8
231
638
9
181
465
10
109
276
11
50
152
12
24
101
13+
12
61
7
367
896
To perform the analysis, let the two authors be the rows of a two-way table, and let the word lengths be the columns
of the table. Then do the chi-square analysis on the resulting table.
a)
The chi-square analysis we are doing tests the null hypothesis that the row and column variables are
independent against the alternative hypothesis that they are not independent. In this problem does the null
hypothesis of independence coincide with the hypothesis that the two authors are the same, or does it
coincide with the hypothesis that they differ?
If the two authors are the same, the pattern of word lengths would tend to be the same. This coincides with
the null hypothesis that the two authors are independent.
b)
Conduct the chi-square analysis and summarize your conclusions.
SPSS results are shown below.
Chi-Square Te sts
Value
Pearson Chi-Square 19281. 000 a
Lik elihood Ratio
26706. 476
Linear-by-Linear
202.161
As soc iation
N of Valid Cases
19281
25
25
As ymp. Sig.
(2-sided)
.000
.000
1
.000
df
a. 0 c ells (.0% ) have expected count less than 5. The
minimum expected count is 5.79.
The minimum expected count is greater than 5. So the results are valid. They indicate that there is very
strong evidence that word length and author are not independent.
c)
Can we conclude that Mark Twain could have written the Q. C. Snodgrass letters?
There is very strong evidence that Mark Twain is not Q. C. Snodgrass.
Download