MAT207 – Roback Spring 2002 MAT207: Logistic Regression Case Study 20.1.2 – Birdkeeping and Lung Cancer, A Retrospective Observational Study Description: A 1972-1981 health survey in The Hague, Netherlands, discovered an association between keeping pet birds and increased risk of lung cancer. To investigate birdkeeping as a risk factor, researchers conducted a case-control study of patients in 1985 at four hospitals in The Hague. They identified 49 cases of lung cancer among patients who were registered with a general practice, who were age 65 or younger, and who had resided in the city since 1965. They also selected 98 controls from a population of residents having the same general age structure. (Data based on P.A. Holst, D. Kromhout, and R. Brand, “For Debate: Pet Birds as an Independent Risk Factor for Lung Cancer,” British Medical Journal 297 (1988): 13-21.) Data was gathered on the following variables: female = gender (1 = Female, 0 = Male) ag = age, in years status = socioeconomic status (1 = High, 0 = Low), determined by the occupation of the household’s primary wage earner yr = years of smoking prior to diagnosis or examination cd = average rate of smoking, in cigarettes per day bird = indicator of birdkeeping (1 = Yes, 0 = No), determined by whether or not there were caged birds in the home for more than 6 consecutive months from 5 to 14 years before diagnosis (cases) or examination (controls) cancer = indicator of lung cancer diagnosis (1 = Cancer, 0 = No Cancer) Age and smoking history are both known to be associated with lung cancer incidence. After age, socioeconomic status, and smoking have been controlled for, is an additional risk associated with birdkeeping? Initial Graphical and Numerical Descriptions of Data: To get “double” coded scatterplot like Display 20.10 in the book, first create a new variable birdcanc: Under Transform…Compute, let Target = birdcanc, Numeric Expression = 1, If…(cancer=0)&(bird=0). Then repeat with Numeric Expression = 2, If…(cancer=0)&(bird=1), Numeric Expression = 3, If…(cancer=1)&(bird=0), Numeric Expression = 4, If…(cancer=1)&(bird=1). Define values in your Data Editor. Then do Graphs…Scatter with Y=yr, X=ag, Set Markers By = birdcanc, and use Format…Marker to choose 4 different symbols. CANCER 60 AG 50 YR 40 CD 30 Years of Smoking 20 FEMALE 10 Bird / Cancer No Bird / Cancer STATUS 0 Bird / No Cancer -10 No Bird / No Cancer 30 40 50 60 BIRD 70 A ge (years) Page 1 MAT207 – Roback Spring 2002 Analyze…Descriptive Statistics…Explore. Dependent list = quantitative predictors; Factor list = cancer Descriptives AG CANCER 0 1 YR 0 1 CD 0 1 Statistic 57.0000 59.0000 54.392 7.3751 37.00 67.00 56.8980 58.0000 54.344 7.3718 37.00 66.00 24.9592 25.0000 220.163 14.8379 .00 47.00 33.6327 36.0000 97.987 9.8989 .00 50.00 14.2143 15.0000 104.995 10.2467 .00 45.00 18.8163 20.0000 59.820 7.7343 .00 40.00 Mean Median Variance Std. Deviation Minimum Maximum Mean Median Variance Std. Deviation Minimum Maximum Mean Median Variance Std. Deviation Minimum Maximum Mean Median Variance Std. Deviation Minimum Maximum Mean Median Variance Std. Deviation Minimum Maximum Mean Median Variance Std. Deviation Minimum Maximum Std. Error .7450 1.0531 1.4989 1.4141 1.0351 1.1049 Analyze…Descriptive Statistics…Crosstabs. Rows = categorical predictors; Columns = LC. Cells = Row % SS * LC Crosstabulation SEX * LC Crosstabulation LC LC SEX FEMALE MALE Total Count % within SEX Count % within SEX Count % within SEX LUNGCA NCER 12 33.3% 37 33.3% 49 33.3% NOCANCER 24 66.7% 74 66.7% 98 66.7% Total 36 100.0% 111 100.0% 147 100.0% SS LOW Total BK * LC Crosstabulation LC BK BIRD NOBIRD Total Count % within BK Count % within BK Count % within BK LUNGCA NCER 33 49.3% 16 20.0% 49 33.3% NOCANCER 34 50.7% 64 80.0% 98 66.7% HIGH Total 67 100.0% 80 100.0% 147 100.0% Page 2 Count % within SS Count % within SS Count % within SS LUNGCA NCER 12 26.7% 37 36.3% 49 33.3% NOCANCER 33 73.3% 65 63.7% 98 66.7% Total 45 100.0% 102 100.0% 147 100.0% MAT207 – Roback Spring 2002 Model One: Regression…Binary Logistic. Options…CI for exp(B) and Casewise Listing of Residuals. Save… Probabilities and Group Membership Classification Table a Predicted CANCER Model Summary Step 1 -2 Log likelihood 155.240 Observed CANCER Step 1 Cox & Snell R Square .195 Nagelkerke R Square .271 0 0 1 1 82 21 Overall Percentage a. The cut value is .500 Variables in the Equation Step a 1 FEMALE AG STATUS YR BIRD Constant B .521 -.046 .132 .083 1.335 -1.406 S.E. .530 .035 .464 .025 .409 1.717 Wald .969 1.759 .081 11.112 10.645 .671 df Sig. .325 .185 .776 .001 .001 .413 1 1 1 1 1 1 Exp(B) 1.684 .955 1.141 1.086 3.799 .245 95.0% C.I.for EXP(B) Lower Upper .597 4.755 .892 1.022 .459 2.834 1.035 1.141 1.704 8.472 a. Variable(s) entered on s tep 1: FEMALE, AG, STATUS, YR, BIRD. Casewise Listb Case 21 28 47 Observed CANCER 1** 1** 1** Selected a Status S S S Predicted .129 .139 .085 Predicted Group 0 0 0 Temporary Variable Resid ZResid .871 2.602 .861 2.494 .915 3.281 a. S = Selected, U = Unselected cas es, and ** = Misclassified cases . b. Cases with studentized residuals greater than 2.000 are listed. Model Two: Model Summary Step 1 -2 Log likelihood 166.532 Cox & Snell R Square .131 Nagelkerke R Square .182 Variables in the Equation Step a 1 FEMALE AG STATUS YR Constant B .721 -.063 -.056 .087 .103 S.E. .503 .034 .437 .025 1.576 Wald 2.059 3.509 .017 12.371 .004 df 1 1 1 1 1 Sig. .151 .061 .897 .000 .948 a. Variable(s) entered on s tep 1: FEMALE, AG, STATUS, YR. Page 3 Exp(B) 2.057 .939 .945 1.091 1.108 95.0% C.I.for EXP(B) Lower Upper .768 5.511 .879 1.003 .402 2.224 1.039 1.146 16 28 Percentage Correct 83.7 57.1 74.8 MAT207 – Roback Spring 2002 Model Three: To do Forward Selection with “forced variables”, choose Block 1 = female, ag, status, bird (Enter), and Block 2 = yr, cd, yr2, cd2, yr*cd (Forward LR). Note that you must Transform…Compute to get squared terms, but interactions can be entered via point and click. Variables in the Equation Step a 1 FEMALE AG STATUS BIRD YR Constant B .521 -.046 .132 1.335 .083 -1.406 S.E. .530 .035 .464 .409 .025 1.717 Wald .969 1.759 .081 10.645 11.112 .671 df Sig. .325 .185 .776 .001 .001 .413 1 1 1 1 1 1 Exp(B) 1.684 .955 1.141 3.799 1.086 .245 a. Variable(s) entered on s tep 1: YR. Model Summary Step 1 -2 Log likelihood 155.240 Cox & Snell R Square .195 Nagelkerke R Square .271 Va riables not in the Equa tion St ep 1 Model if Term Removed Variable Step 1 YR Model Log Likelihood -85.934 Change in -2 Log Likelihood 16.627 df 1 Sig. of the Change .000 Variables Overall Statistics Sc ore 1.048 .152 .369 .367 2.685 df 1 1 1 1 4 Sig. .306 .697 .544 .544 .612 Book’s approach to determining how to model smoking… Start with model below containing demographic covariates plus rich SSOM model with two smoking variables. Then start looking to eliminate higher order terms involving Yr and Cd (e.g. start by removing the interaction, then the squared terms), running smaller models by hand where necessary. Eventually learn that smoking can be well accounted for by just Yr. Add Bird term at the end to get Model One. Va riables in the Equa tion Sta ep 1 CD YR2 CD2 CD by YR FEMALE AG STATUS YR CD YR2 CD2 CD by YR Constant B .795 -.057 -.015 .032 .132 .001 -.002 -.001 -.642 S. E. .518 .038 .445 .093 .100 .002 .002 .002 2.298 W ald 2.353 2.301 .001 .115 1.755 .340 .643 .354 .078 df 1 1 1 1 1 1 1 1 1 Sig. .125 .129 .972 .735 .185 .560 .423 .552 .780 Ex p(B) 2.214 .944 .985 1.032 1.141 1.001 .998 .999 .526 a. Variable(s) ent ered on step 1: FEMALE, AG, STATUS, YR, CD, YR2, CD2, CD * YR . Page 4