# mat207 Case Study 20.1.2 - American Statistical Association

```MAT207 – Roback
Spring 2002
MAT207: Logistic Regression
Case Study 20.1.2 – Birdkeeping and Lung Cancer, A Retrospective Observational Study
Description:
A 1972-1981 health survey in The Hague, Netherlands, discovered an association between keeping pet birds and
increased risk of lung cancer. To investigate birdkeeping as a risk factor, researchers conducted a case-control study
of patients in 1985 at four hospitals in The Hague. They identified 49 cases of lung cancer among patients who
were registered with a general practice, who were age 65 or younger, and who had resided in the city since 1965.
They also selected 98 controls from a population of residents having the same general age structure. (Data based on
P.A. Holst, D. Kromhout, and R. Brand, “For Debate: Pet Birds as an Independent Risk Factor for Lung Cancer,”
British Medical Journal 297 (1988): 13-21.)
Data was gathered on the following variables:
 female = gender (1 = Female, 0 = Male)
 ag = age, in years
 status = socioeconomic status (1 = High, 0 = Low), determined by the occupation of the household’s
primary wage earner
 yr = years of smoking prior to diagnosis or examination
 cd = average rate of smoking, in cigarettes per day
 bird = indicator of birdkeeping (1 = Yes, 0 = No), determined by whether or not there were caged birds in
the home for more than 6 consecutive months from 5 to 14 years before diagnosis (cases) or examination
(controls)
 cancer = indicator of lung cancer diagnosis (1 = Cancer, 0 = No Cancer)
Age and smoking history are both known to be associated with lung cancer incidence. After age, socioeconomic
status, and smoking have been controlled for, is an additional risk associated with birdkeeping?
Initial Graphical and Numerical Descriptions of Data:

To get “double” coded scatterplot like Display 20.10 in the book, first create a new variable birdcanc: Under
Transform…Compute, let Target = birdcanc, Numeric Expression = 1, If…(cancer=0)&amp;(bird=0). Then repeat with Numeric
Expression = 2, If…(cancer=0)&amp;(bird=1), Numeric Expression = 3, If…(cancer=1)&amp;(bird=0), Numeric Expression = 4,
If…(cancer=1)&amp;(bird=1). Define values in your Data Editor. Then do Graphs…Scatter with Y=yr, X=ag, Set Markers By =
birdcanc, and use Format…Marker to choose 4 different symbols.
CANCER
60
AG
50
YR
40
CD
30
Years of Smoking
20
FEMALE
10
Bird / Cancer
No Bird / Cancer
STATUS
0
Bird / No Cancer
-10
No Bird / No Cancer
30
40
50
60
BIRD
70
A ge (years)
Page 1
MAT207 – Roback
Spring 2002
Analyze…Descriptive Statistics…Explore. Dependent list = quantitative predictors; Factor list = cancer
Descriptives
AG
CANCER
0
1
YR
0
1
CD
0
1
Statistic
57.0000
59.0000
54.392
7.3751
37.00
67.00
56.8980
58.0000
54.344
7.3718
37.00
66.00
24.9592
25.0000
220.163
14.8379
.00
47.00
33.6327
36.0000
97.987
9.8989
.00
50.00
14.2143
15.0000
104.995
10.2467
.00
45.00
18.8163
20.0000
59.820
7.7343
.00
40.00
Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Std. Error
.7450
1.0531
1.4989
1.4141
1.0351
1.1049
Analyze…Descriptive Statistics…Crosstabs. Rows = categorical predictors; Columns = LC. Cells = Row %
SS * LC Crosstabulation
SEX * LC Crosstabulation
LC
LC
SEX
FEMALE
MALE
Total
Count
% within SEX
Count
% within SEX
Count
% within SEX
LUNGCA
NCER
12
33.3%
37
33.3%
49
33.3%
NOCANCER
24
66.7%
74
66.7%
98
66.7%
Total
36
100.0%
111
100.0%
147
100.0%
SS
LOW
Total
BK * LC Crosstabulation
LC
BK
BIRD
NOBIRD
Total
Count
% within BK
Count
% within BK
Count
% within BK
LUNGCA
NCER
33
49.3%
16
20.0%
49
33.3%
NOCANCER
34
50.7%
64
80.0%
98
66.7%
HIGH
Total
67
100.0%
80
100.0%
147
100.0%
Page 2
Count
% within SS
Count
% within SS
Count
% within SS
LUNGCA
NCER
12
26.7%
37
36.3%
49
33.3%
NOCANCER
33
73.3%
65
63.7%
98
66.7%
Total
45
100.0%
102
100.0%
147
100.0%
MAT207 – Roback
Spring 2002
Model One:

Regression…Binary Logistic. Options…CI for exp(B) and Casewise Listing of Residuals. Save… Probabilities and Group
Membership
Classification Table a
Predicted
CANCER
Model Summary
Step
1
-2 Log
likelihood
155.240
Observed
CANCER
Step 1
Cox &amp; Snell
R Square
.195
Nagelkerke
R Square
.271
0
0
1
1
82
21
Overall Percentage
a. The cut value is .500
Variables in the Equation
Step
a
1
FEMALE
AG
STATUS
YR
BIRD
Constant
B
.521
-.046
.132
.083
1.335
-1.406
S.E.
.530
.035
.464
.025
.409
1.717
Wald
.969
1.759
.081
11.112
10.645
.671
df
Sig.
.325
.185
.776
.001
.001
.413
1
1
1
1
1
1
Exp(B)
1.684
.955
1.141
1.086
3.799
.245
95.0% C.I.for EXP(B)
Lower
Upper
.597
4.755
.892
1.022
.459
2.834
1.035
1.141
1.704
8.472
a. Variable(s) entered on s tep 1: FEMALE, AG, STATUS, YR, BIRD.
Casewise Listb
Case
21
28
47
Observed
CANCER
1**
1**
1**
Selected
a
Status
S
S
S
Predicted
.129
.139
.085
Predicted
Group
0
0
0
Temporary Variable
Resid
ZResid
.871
2.602
.861
2.494
.915
3.281
a. S = Selected, U = Unselected cas es, and ** = Misclassified cases .
b. Cases with studentized residuals greater than 2.000 are listed.
Model Two:
Model Summary
Step
1
-2 Log
likelihood
166.532
Cox &amp; Snell
R Square
.131
Nagelkerke
R Square
.182
Variables in the Equation
Step
a
1
FEMALE
AG
STATUS
YR
Constant
B
.721
-.063
-.056
.087
.103
S.E.
.503
.034
.437
.025
1.576
Wald
2.059
3.509
.017
12.371
.004
df
1
1
1
1
1
Sig.
.151
.061
.897
.000
.948
a. Variable(s) entered on s tep 1: FEMALE, AG, STATUS, YR.
Page 3
Exp(B)
2.057
.939
.945
1.091
1.108
95.0% C.I.for EXP(B)
Lower
Upper
.768
5.511
.879
1.003
.402
2.224
1.039
1.146
16
28
Percentage
Correct
83.7
57.1
74.8
MAT207 – Roback
Spring 2002
Model Three:

To do Forward Selection with “forced variables”, choose Block 1 = female, ag, status, bird (Enter), and Block 2 = yr, cd, yr2,
cd2, yr*cd (Forward LR). Note that you must Transform…Compute to get squared terms, but interactions can be entered via
point and click.
Variables in the Equation
Step
a
1
FEMALE
AG
STATUS
BIRD
YR
Constant
B
.521
-.046
.132
1.335
.083
-1.406
S.E.
.530
.035
.464
.409
.025
1.717
Wald
.969
1.759
.081
10.645
11.112
.671
df
Sig.
.325
.185
.776
.001
.001
.413
1
1
1
1
1
1
Exp(B)
1.684
.955
1.141
3.799
1.086
.245
a. Variable(s) entered on s tep 1: YR.
Model Summary
Step
1
-2 Log
likelihood
155.240
Cox &amp; Snell
R Square
.195
Nagelkerke
R Square
.271
Va riables not in the Equa tion
St ep
1
Model if Term Removed
Variable
Step 1 YR

Model Log
Likelihood
-85.934
Change in
-2 Log
Likelihood
16.627
df
1
Sig. of the
Change
.000
Variables
Overall Statistics
Sc ore
1.048
.152
.369
.367
2.685
df
1
1
1
1
4
Sig.
.306
.697
.544
.544
.612
Book’s approach to determining how to model smoking… Start with model below containing demographic covariates plus
rich SSOM model with two smoking variables. Then start looking to eliminate higher order terms involving Yr and Cd (e.g.
start by removing the interaction, then the squared terms), running smaller models by hand where necessary. Eventually
learn that smoking can be well accounted for by just Yr. Add Bird term at the end to get Model One.
Va riables in the Equa tion
Sta ep
1
CD
YR2
CD2
CD by YR
FEMALE
AG
STATUS
YR
CD
YR2
CD2
CD by YR
Constant
B
.795
-.057
-.015
.032
.132
.001
-.002
-.001
-.642
S. E.
.518
.038
.445
.093
.100
.002
.002
.002
2.298
W ald
2.353
2.301
.001
.115
1.755
.340
.643
.354
.078
df
1
1
1
1
1
1
1
1
1
Sig.
.125
.129
.972
.735
.185
.560
.423
.552
.780
Ex p(B)
2.214
.944
.985
1.032
1.141
1.001
.998
.999
.526
a. Variable(s) ent ered on step 1: FEMALE, AG, STATUS, YR, CD, YR2, CD2, CD * YR .
Page 4
```