Chapter 23 Logistic Regression Analysis In the multiple regression, the dependent variables have been continuous such as weight, sbp, price and so on. If the dependent variable, Y, is one of the binary response or dichotomous variables, such as Male/Female, Yes/No, Success/Fail, Present/Absent or Smoking/Nonsmoking, logistic regression can be used to describe its relationship with several predictor variables, X1 , X 2 ,...., X k and an (adjusted) odds ratio can be estimated. Logistic function f (z) 1 1 e z and its graph has a sigmoid shape 1 1 f( z) 0.5 0 5 0 -4 5 z 10 5 This function is well-suited for modeling a probability because the values of f(z) ranges from 0 to 1 as z varies from to . The logistic model Let Y be a dichotomous variable which is defined as for those have lung cancer 1 Y 0 for those do not have lung cancer and p = Pr(Y=1| X1 ,..., X k ). p= and p̂ = 1 1 exp[ ( 0 1 X 1 2 X 2 ... k X k )] 1 1 exp[ (ˆ 0 ˆ 1X1 ˆ 2 X 2 ... ˆ k X k )] n Note: With no predictors, p̂ Y i 1 n i Y (1) The logit form of the logistic model The relationship of a dichotomous variable with its predictors is quantified with Pr( D) p the odds ratio. Since odds (D) = , odds(Y=1) = . 1 Pr(D) 1 p The “logit” is the natural log odds of the event, Y=1, that is, p logit [p] = ln [odds( Y = 1)] = ln 1 p logit [p] = 0 1X1 .... k X k (2) Note: The logits can take on any values between to while Pr(Y=1) can only take on values between 0 and 1. Odds(Y=1) = e 0 1X1 ... k X k (3) This formulation helps in clarifying the meaning of the maximum likelihood coefficients: e i gives the change in the odds for Y when there is a unit change in the predictor X i , i = 1,..,k An adjusted odds ratio is an odds ratio comparing two categories of the variable after controlling for the other variables in the model. For example, an adjusted odds ratio comparing two categories of the variable, smoking status( X1 ) is ^ ^ OR X1 1 vs X1 0 Odds(Y 1 | X1 1, X 2 ,..., X k ) ^ Odds(Y 1 | X1 0, X 2 ,..., X k ) e ˆ 0 ˆ 1 ˆ 2 X 2 ... ˆ , X k e ˆ 0 ˆ 2 X 2 ... ˆ , X k ˆ e 1 and its (1 ) * 100% confidence interval is e ˆ1 Z1 / 2 * Sˆ More specifically, its 95% confidence interval is 1 e ˆ 1 1.96*Sˆ 1 . Suppose we want to find an adjusted odds ratio for a continuous variable such as age (X2). Most often the increase by “1” will not be interesting. For example, an increase of 1 year in age may be too small to be considered important. A change of 10 years might be more useful. Then ^ ^ OR X 2 10 vs X 2 20 Odds(Y 1 | X1 , X 2 20,..., X k ) ^ Odds(Y 1 | X1 , X 2 10,..., X k ) and its 95% confidence interval is 10ˆ 2 1.96*10sˆ e e e 2 ˆ 0 ˆ 1X1 ˆ 2 *20... ˆ , X k ˆ 0 ˆ 1X1 ˆ 2 *10... ˆ , X k ˆ ˆ e ( 2010)*2 e102 Inference for logistic regression Criteria for Assessing Model Fit The logistic procedure fits linear logistic regression models for dichotomous variables by the method of maximum likelihood estimation. Let –2 log L A = log-likelihood statistic of model A with “p” predictors and –2 log L B = log-likelihood statistic of model B with “k” predictors and k > p. Then the likelihood ratio Chi-square, G 2 , is G 2 (2 ln L A ) (2 ln L B ) ~ 2 k p If Model A is the model with intercept only, then G 2 plays the role of the overall F, testing H 0 : 1 2 ... k 0 with “k” degrees of freedom. If Model A has “p” predictors and model B has “k” predictors, then G 2 plays the role of the (multiple) partial F with “k – p “ degrees of freedom. Analysis of Maximum Likelihood Estimates (MLEs) The Wald Chi-Square Test is 2 ˆ i W= ^ 2 , i = 1,...,k SE(ˆ i ) It tests H 0 : i 0 for i 1,..., k vs H A : Not H 0 . Example 1 (When the independent variable is nominal) Let Y be a dichotomous variable which is defined as for those have lung cancer 1 Y 0 for those do not have lung cancer and X be a dichotomous variable such as smoking status with 1 X 0 The logit form of the logistic model is for smo ker s for nonsmo ker s logit[p] = 0 1X (4) logit(lung cancer|smokers(X=1)) = 0 1 *1 = 0 1 logit(lung cancer|nonsmokers (X=0)) = 0 1 * 0 = 0 Thus Odds(lung cancer|smokers) = e 0 1 Odds(lung cancer|nonsmokers) = e 0 and the odds ratio comparing the odds of smokers getting lung cancer to the odds of nonsmokers getting lung cancer is odds(lung cancer | smo ker s) e 0 1 OR S vs NS = = e 1 odds(lung cancer | nonsmo ker s) e0 In other words, the estimate of OR is ˆ OR e 1 where ̂1 is the maximum likelihood estimate(MLE) of 1 in the equation (4). Testing whether H 0 : OR S vs NS =1 is the same as testing whether H 0 : 1 0 since OR e 1 and e 0 1 Suppose we want to analyze the following data using the logistic regression Factor B Factor A Oral Used Contraceptive Never Used Heart Attacks Yes No 23 34 35 132 Model logit [p] = 0 1X SAS program /* The single-trial syntax is used exclusively when independent variables are continuous or mixed*/ Data heart; input contra $ attack $; if contra = ‘used’ then X = 1; else X = 0; if attack = ‘yes’ then Y = 1; else Y = 0; lines; used yes : used yes: used yes 23 times used no used no: : used no 34 times never yes : never yes: never yes 35 times never no : never no: never no 132 times run; proc logistic descending; model Y = X / link = logit; run; /* The single-trial syntax with weight */ Data heart; input contra $ attack $ wt; if contra = ‘used’ then X = 1; else X = 0; if attack = ‘yes’ then Y = 1; else Y = 0; lines; used yes 23 used no 34 never yes 35 never no 132 run; proc logistic descending; weight wt; model Y = X / link = logit; run; /*The events/trials syntax */ Data heart; input contra $ yes no; n = yes+no; if contra = 'used' then X=1; else X=0; lines; used 23 34 never 35 132 run; proc logistic; model yes/n = X /link=logit; run; /* doing logistic regression using PROC GENMOD –GENMOD is like GLM in categorical data analysis*/ Data heart; input contra $ yes no; n = yes+no; /*added 1 before ‘used’ and 2 before ‘never’ in order to make ‘used’ as an event of interest */ lines; 1used 23 34 2never 35 132 run; proc genmod; class contra; model yes/n = contra /dist=bin link=logit; run; Ouput The LOGISTIC Procedure Model Information Data Set WORK.HEART Response Variable Y Number of Response Levels 2 Number of Observations 4 Weight Variable wt Sum of Weights 224 Link Function Logit Optimization Technique Fisher's scoring Response Profile Total Y Frequency Ordered Value 1 2 1 0 Total Weight 2 2 58.00000 166.00000 Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. Model Fit Statistics Intercept Only 258.226 257.612 256.226 Criterion AIC SC -2 Log L Intercept and Covariates 252.358 251.131 248.358 Testing Global Null Hypothesis: BETA=0 Test Likelihood Ratio Score Wald Chi-Square 7.8676 8.3288 8.0449 DF 1 1 1 Pr > ChiSq 0.0050 0.0039 0.0046 256.226 -248.358 7.868 The LOGISTIC Procedure Parameter Intercept X Analysis of Maximum Likelihood Estimates Standard DF Estimate Error Chi-Square 1 1 -1.3275 0.9366 0.1901 0.3302 48.7488 8.0449 Pr > ChiSq <.0001 0.0046 Odds Ratio Estimates Effect X Point Estimate 2.551 95% Wald Confidence Limits 1.336 4.873 Interpretation: p̂ 1.3275 .9366 * X 1. The logistic regression equation is logit[ p̂ ] = ln 1 p̂ 1 2. equivalently, p̂ ( 1.3275.9366*X ) 1 e 2 3. G , Likelihood Ratio Chi-square statistic, test H 0 : 1 0 which is equivalent to testing H 0 : OR Used vs Never 1 . 4. Overall, the model is significant because Likelihood Ratio Chi-Square statistic is 7.8676 with 1 degree of freedom and p-value = .0050. 5. The odds ratio OR Used vs Never estimate is e .9366 2.551 and its 95% Wald confidence limits (1.336, 4.873) do not contain 1. (just as the estimate of 1 = .9366 and its p-value is .0046.) Suppose we want to find out the relationship between heart attacks and BMI and the following data have been collected. BMI Above 30 (Obese) 25 – 30 (Overweight) Below 25 (Normal) Heart attacks Yes 25 10 5 No 5 20 25 Create dummy variables BMI Obese Overweight Normal X1 1 0 0 X2 0 1 0 Dependent variable Y = 1 if a subject had a heart attack; Y = 0 if a subject does not. Model log it p 0 1X1 2 X 2 SAS program Data cancer; input bmi $ attack $ wt; if bmi= ‘obese’ then X1 = 1; else X1=0; if bmi= ‘overwt’ then X2 = 1; else X2 = 0; if attack = ‘yes’ then Y = 1 ; else Y = 0; lines; obese yes 25 obese no 5 overwt yes 10 overwt no 20 normal yes 5 normal no 25 run; proc logistic descending; weight wt; model Y = X1 X2 /link=logit; run; Output The LOGISTIC Procedure Model Information Data Set WORK.CANCER Response Variable Y Number of Response Levels 2 Number of Observations 6 Weight Variable wt Sum of Weights 90 Link Function Logit Optimization Technique Fisher's scoring Response Profile Total Y Frequency 1 3 0 3 Ordered Value 1 2 Total Weight 40.000000 50.000000 Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 125.653 98.258 SC 125.445 97.633 -2 Log L 123.653 92.258 Testing Global Null Hypothesis: BETA=0 Test Likelihood Ratio Score Wald Chi-Square 31.3949 29.2500 23.3637 DF 2 2 2 Pr > ChiSq <.0001 <.0001 <.0001 The LOGISTIC Procedure Parameter Intercept X1 X2 Analysis of Maximum Likelihood Estimates Standard DF Estimate Error Chi-Square 1 1 1 -1.6093 3.2186 0.9162 0.4899 0.6928 0.6245 10.7921 21.5842 2.1523 Odds Ratio Estimates Effect X1 X2 Point Estimate 24.994 2.500 95% Wald Confidence Limits 6.429 0.735 97.170 8.500 Pr > ChiSq 0.0010 <.0001 0.1424 Association of Predicted Probabilities and Observed Responses Percent Concordant Percent Discordant Percent Tied Pairs 33.3 33.3 33.3 9 Somers' D Gamma Tau-a c 0.000 0.000 0.000 0.500 1. The logistic regression equation is log it (p̂) 1.6093 3.2186 * X1 .9162 * X2 1 2. equivalently, p̂ ( 1.6093 3.2186*X1.9162*X 2 ) 1 e 2 3. G , Likelihood Ratio Chi-square statistic, test H 0 : 1 2 0 which is equivalent to testing H 0 : OR Obese vs Normal OR Overwight vs Normal 1 . 4. Overall, the model is significant because Likelihood 31.3949 5. Ratio Chi-Square statistic is whose df = 2 and p-value = < .0001 OR Obese vs Normal 25 , and its confidence interval (6.429, 97.170)does not include 1. That shows that the odds of getting heart attack between obese group and normal group is significantly different. On the other hand, OR Overweight vs Normal 2.5 and its confidence interval (0.735, 8.500 ) includes 1. That means that the odds of getting heart attack between overweight people and normal people is not significantly different. Example 2 (When independent variables are mixed: nominal and continuous) Let Y be a dichotomous variable which is defined as for those have lung cancer 1 Y 0 for those do not have lung cancer and X1 be smoking status, X 2 be sbp and X 3 be age. The logit form of the logistic model is logit(Y=1) = 0 1X1 2 X 2 3 X 3 (5) logit(lung cancer|smokers, sbp=160, age = 40) = 0 1 *1 2 *160 3 * 40 logit(lung cancer|smokers, sbp=120, age =40) = 0 1 *1 2 *120 3 * 40 Thus Odds(lung cancer|smokers, sbp=160, age = 40) = e 0 1 1602 403 Odds(lung cancer|smokers, sbp=120, age = 40) = e 0 1 1202 403 and the odds ratio comparing the odds of those who smoke, are 40 years old and whose sbp is 160 getting lung cancer to the odds of those who smoke, are 40 years old and whose sbp is 120 getting lung cancer is OR sbp160 vs sbp120 = odds(lung cancer | smokers, sbp 160, age 40) e 0 1 1602 403 = 120 40 e (160120)2 e 402 2 3 odds(lung cancer | smokers, sbp 120, age 40) e 0 1 In other words, the estimate of OR is ˆ OR e 402 where ̂ 2 is the MLE of 2 in the equation (5). Testing whether H 0 : OR sbp160 H 0 : 2 0 since OR e 401 and e vs sbp120 40*0 1 is the same as testing whether e 1 0 Example: Logistic regression with a continuous independent variable data heart; input sbp chd wt @@; lines; 110 0 153 110 1 3 121 0 235 121 1 17 131 0 272 131 1 12 141 0 255 141 1 16 151 0 127 151 1 12 161 0 77 161 1 8 177 0 83 177 1 16 190 0 35 190 1 8 run; proc logistic descending; weight wt; model chd = sbp /link=logit; output out=heartout p=pred; run; data heart2; set heartout; input sbp no yes; total=yes+no; prob= yes/total; lines; 110 153 3 121 235 17 131 272 12 141 255 16 151 127 12 161 77 8 177 83 16 190 35 8 run; proc print; var sbp no yes prob pred; run; proc plot; plot prob*sbp pred*sbp='*' /overlay; run; The LOGISTIC Procedure Model Information Data Set Response Variable Number of Response Levels Number of Observations Weight Variable Sum of Weights Link Function Optimization Technique WORK.HEART chd 2 16 wt 1329 Logit Fisher's scoring Response Profile Ordered Value chd Total Frequency Total Weight 1 2 1 0 8 8 92.0000 1237.0000 Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. Model Fit Statistics Criterion AIC SC -2 Log L Intercept Only Intercept and Covariates 670.831 671.604 668.831 648.520 650.066 644.520 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq 24.3110 26.6394 25.3529 1 1 1 <.0001 <.0001 <.0001 Likelihood Ratio Score Wald The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Chi-Square Pr > ChiSq Intercept sbp 1 1 -6.0631 0.0243 0.7195 0.00482 71.0212 25.3529 <.0001 <.0001 Odds Ratio Estimates Effect sbp Point Estimate 1.025 95% Wald Confidence Limits 1.015 1 unit difference in sbp 10-unit difference in sbp 20-unit difference in sbp 1.034 Odds ratio 1.024598 1.275069 1.6258 Obs sbp no yes prob pred 1 2 3 4 5 6 7 8 110 121 131 141 151 161 177 190 153 235 272 255 127 77 83 35 3 17 12 16 12 8 16 8 0.01923 0.06746 0.04225 0.05904 0.08633 0.09412 0.16162 0.18605 0.032576 0.032576 0.042134 0.042134 0.053104 0.053104 0.066732 0.066732 Plot of prob*sbp. Plot of pred*sbp. Legend: A = 1 obs, B = 2 obs, etc. Symbol used is '*'. prob ‚ 0.200 ˆ ‚ ‚ ‚ A ‚ 0.175 ˆ ‚ ‚ ‚ A ‚ 0.150 ˆ ‚ ‚ ‚ ‚ 0.125 ˆ ‚ ‚ ‚ ‚ 0.100 ˆ ‚ A ‚ ‚ A ‚ 0.075 ˆ ‚ ‚ A * * ‚ A ‚ * * 0.050 ˆ ‚ ‚ A * ‚* * ‚ 0.025 ˆ ‚A ‚ ‚ ‚ 0.000 ˆ ‚ Šˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆ 110 120 130 140 150 160 170 180 190 sbp NOTE: 1 obs hidden. Model Selection in Logistic Regression Example 3: Cancer Remission Data (When independent variables are all continuous) The data, taken from Lee (1974), consist of patient characteristics and whether or not cancer remission occured. Data remiss; input remiss cell smear infil li blast temp; label remiss cards; 1 .8 .83 1 .9 .36 0 .8 .88 0 1 .87 1 .9 .75 0 1 .65 1 .95 .97 0 .95 .87 0 1 .45 0 .95 .36 0 .85 .39 0 .7 .76 0 .8 .46 0 .2 .39 0 1 .9 1 1 .84 0 .65 .42 0 1 .75 0 .5 .44 1 1 .63 0 1 .33 0 .9 .93 1 1 .58 0 .95 .32 1 1 .6 1 1 .69 0 1 .73 run; = 'complete remission'; .66 .32 .7 .87 .68 .65 .92 .83 .45 .34 .33 .53 .37 .08 .9 .84 .27 .75 .22 .63 .33 .84 .58 .3 .6 .69 .73 1.9 1.4 .8 .7 1.3 .6 1 1.9 .8 .5 .7 1.2 .4 .8 1.1 1.9 .5 1 .6 1.1 .4 .6 1 1.6 1.7 .9 .7 1.1 .74 .176 1.053 .519 .519 1.23 1.354 .322 0 .279 .146 .38 .114 1.037 2.064 .114 1.322 .114 1.072 .176 1.591 .531 .886 .964 .398 .398 .996 .992 .982 .986 .98 .982 .992 1.02 .999 1.038 .988 .982 1.006 .99 .99 1.02 1.014 1.004 .99 .986 1.01 1.02 1.002 .988 .99 .986 .986 proc logistic; Title 'Stepwise Regression on Cancer Remission Data'; model remiss=cell smear infil li blast temp / selection = stepwise slentry = .3 slstay = .3 details; run; proc logistic; title 'Backward Elimination Using the Fast Option'; model remiss = temp cell li smear blast / selection = backward fast slstay = .2; run; proc logistic; title 'Best Subsets Regession'; model remiss = temp cell li smear blast / selection = score; run; Output (Edited) Stepwise Regression on Cancer Remission Data Stepwise Selection Procedure Step 0. Intercept entered: Analysis of Maximum Likelihood Estimates Variable DF INTERCPT 1 Step Parameter Estimate Standard Error Wald Chi-Square Pr > Chi-Square Standardized Estimate 0.6931 0.4082 2.8827 0.0895 . Odds Ratio . 1. Variable LI entered: Analysis of Maximum Likelihood Estimates Variable DF INTERCPT LI 1 1 Step Parameter Estimate Standard Error Wald Chi-Square Pr > Chi-Square Standardized Estimate Odds Ratio 3.7771 -2.8973 1.3786 1.1868 7.5064 5.9594 0.0061 0.0146 . -0.747230 . 0.055 2. Variable TEMP entered: Analysis of Maximum Likelihood Estimates Variable DF Parameter Estimate Standard Error Wald Chi-Square Pr > Chi-Square Standardized Estimate Odds Ratio INTERCPT LI TEMP 1 1 1 -47.8559 -3.3020 52.4331 46.4416 1.3594 47.4934 1.0618 5.9005 1.2188 0.3028 0.0151 0.2696 . -0.851626 0.429597 . 0.037 999.000 Step 3. Variable CELL entered: Analysis of Maximum Likelihood Estimates Variable DF Parameter Estimate Standard Error Wald Chi-Square Pr > Chi-Square Standardized Estimate Odds Ratio INTERCPT CELL LI TEMP 1 1 1 1 -67.6339 -9.6522 -3.8671 82.0738 56.8875 7.7511 1.7783 61.7124 1.4135 1.5507 4.7290 1.7687 0.2345 0.2130 0.0297 0.1835 . -0.993231 -0.997359 0.672450 . 0.000 0.021 999.000 Summary of Stepwise Procedure Step 1 2 3 Variable Entered Removed Number In Score Chi-Square Wald Chi-Square Pr > Chi-Square 1 2 3 7.9311 1.2591 1.4701 . . . 0.0049 0.2618 0.2253 LI TEMP CELL Backward Elimination Using the Fast Option Step 0. The following variables were entered: INTERCPT TEMP CELL LI SMEAR BLAST Model Fitting Information and Testing Global Null Hypothesis BETA=0 Criterion AIC SC -2 LOG L Score Step Intercept Only Intercept and Covariates 36.372 37.668 34.372 . 33.857 41.632 21.857 . Chi-Square for Covariates . . 12.515 with 5 DF (p=0.0284) 9.330 with 5 DF (p=0.0966) 1. Fast Backward Elimination: Analysis of Variables Removed by Fast Backward Elimination Variable Removed Chi-Square Pr > Chi-Square Residual Chi-Square DF Pr > Residual Chi-Square 0.0008 0.0951 1.5135 0.9768 0.7578 0.2186 0.0008 0.0959 1.6094 1 2 3 0.9768 0.9532 0.6573 BLAST SMEAR CELL The LOGISTIC Procedure Analysis of Variables Removed by Fast Backward Elimination Variable Removed Chi-Square Pr > Chi-Square Residual Chi-Square DF Pr > Residual Chi-Square 0.6535 0.4189 2.2629 4 0.6875 TEMP Summary of Backward Elimination Procedure Step 1 1 1 1 Variable Removed Number In Wald Chi-Square Pr > Chi-Square 4 3 2 1 0.000844 0.0951 1.5135 0.6535 0.9768 0.7578 0.2186 0.4189 BLAST SMEAR CELL TEMP Analysis of Maximum Likelihood Estimates Variable DF INTERCPT LI 1 1 Parameter Estimate Standard Error Wald Chi-Square Pr > Chi-Square Standardized Estimate Odds Ratio 3.7771 -2.8973 1.3786 1.1868 7.5064 5.9594 0.0061 0.0146 . -0.747230 . 0.055 Best Subsets Regession The LOGISTIC Procedure Data Set: WORK.REMISS Response Variable: REMISS Response Levels: 2 Number of Observations: 27 Link Function: Logit complete remission Response Profile Ordered Value REMISS Count 1 2 0 1 18 9 Regression Models Selected by Score Criterion Number of Variables Score Value Variables Included in Model 1 7.9311 LI 1 3.5258 BLAST 1 1.8893 CELL 1 1.0745 SMEAR 1 0.6591 TEMP ------------------------------2 8.6611 CELL LI 2 8.3648 TEMP LI 2 7.9807 LI BLAST 2 7.9537 LI SMEAR 2 5.0826 TEMP BLAST 2 3.9013 CELL BLAST 2 3.5456 SMEAR BLAST 2 2.8228 TEMP CELL 2 2.3308 CELL SMEAR 2 1.5641 TEMP SMEAR ------------------------------------3 9.2502 TEMP CELL LI 3 8.6817 CELL LI BLAST 3 8.6652 CELL LI SMEAR 3 8.5691 TEMP LI BLAST 3 8.3720 TEMP LI SMEAR 3 7.9817 LI SMEAR BLAST 3 5.4816 TEMP CELL BLAST 3 5.4018 TEMP SMEAR BLAST 3 3.9272 CELL SMEAR BLAST 3 3.0976 TEMP CELL SMEAR -----------------------------------------4 9.2791 TEMP CELL LI SMEAR 4 9.2572 TEMP CELL LI BLAST 4 8.6819 CELL LI SMEAR BLAST 4 8.6315 TEMP LI SMEAR BLAST 4 5.8305 TEMP CELL SMEAR BLAST -----------------------------------------------5 9.3295 TEMP CELL LI SMEAR BLAST --------------------------------------------------- Based on the stepwise selection, the backward elimination method and the best subset method, the candidates for the best model may be the model with CELL LI or with TEMP, CELL, LI. Once again, the best model selection is part statistical methods, and part experience and common sense. Example 4: Conditional Logistic Regression for 1-1 Matched Data The data is a subset of data from the Los Angeles Study of the Endometrial Cancer Data described in Breslow and Day (1980). There are 63 matched pairs, each consisting of a case of endometrical cancer (OUTCOME=1) and a control (OUTCOME=0). The case and the corresponding control have the same ID. The explanatory variables include GALL (an indicator for gall bladder disease) and HYPER (an indicator for hypertension). The goal of the analysis is to determine the relative risk of having the endometrial cancer for those who have gall bladder disease controlling the effect of hypertension. data; drop id1 gall1 hyper1; retain id1 gall1 hyper1 0; input id outcome gall hyper @@ ; if (id = id1) then do; gall=gall1-gall; hyper=hyper1-hyper; output; end; else do; id1=id; gall1=gall; hyper1=hyper; end; cards; 1 2 3 : 55 56 57 58 59 60 61 62 63 run; (Edited) 1 0 1 0 1 0 : 1 1 1 0 1 1 1 0 1 0 1 1 1 1 1 0 1 1 0 0 1 1 2 3 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 55 56 57 58 59 60 61 62 63 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 proc logistic; model outcome = gall / noint; run ; proc logistic; model outcome = gall hyper / noint ; run ; Output The LOGISTIC Procedure Model Information Data Set Response Variable Number of Response Levels Number of Observations Link Function Optimization Technique WORK.DATA1 outcome 1 63 Logit Fisher's scoring Response Profile Ordered Value outcome Total Frequency 1 0 63 Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. Model Fit Statistics Criterion Without Covariates With Covariates 87.337 87.337 87.337 85.654 87.797 83.654 AIC SC -2 Log L Testing Global Null Hypothesis: BETA=0 Test Likelihood Ratio Score Wald Chi-Square DF Pr > ChiSq 3.6830 3.5556 3.2970 1 1 1 0.0550 0.0593 0.0694 Analysis of Maximum Likelihood Estimates Parameter gall DF Estimate Standard Error Chi-Square Pr > ChiSq 1 0.9555 0.5262 3.2970 0.0694 The LOGISTIC Procedure Odds Ratio Estimates Effect gall Point Estimate 2.600 95% Wald Confidence Limits 0.927 7.293 NOTE: Since there is only one response level, measures of association between the observed and predicted values were not calculated. The LOGISTIC Procedure Model Information Data Set Response Variable Number of Response Levels Number of Observations Link Function Optimization Technique WORK.DATA1 outcome 1 63 Logit Fisher's scoring Response Profile Ordered Value outcome Total Frequency 1 0 63 Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. Model Fit Statistics Without With Criterion Covariates Covariates AIC SC -2 Log L 87.337 87.337 87.337 86.788 91.074 82.788 Testing Global Null Hypothesis: BETA=0 Test Likelihood Ratio Score Wald Chi-Square 4.5487 4.3620 4.0060 DF 2 2 2 Pr > ChiSq 0.1029 0.1129 0.1349 The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Parameter gall hyper DF 1 1 Estimate 0.9704 0.3481 Standard Error 0.5307 0.3770 Chi-Square 3.3432 0.8526 Pr > ChiSq 0.0675 0.3558 Odds Ratio Estimates Effect gall hyper Point Estimate 2.639 1.416 95% Wald Confidence Limits 0.933 7.468 0.677 2.965 NOTE: Since there is only one response level, measures of association between the observed and predicted values were not calculated. Example 5. Conditional Logistic Regression for m:n Matching Conditional logistic regression is used to investigate the relationship between an outcome and a set of prognostic factors in matched case-control studies. The outcome is whether the subject is a case or a control. If there is only one case and one control, the matching is 1:1. M:n matching refers to the situation where there is a varying number of cases and controls in the matched sets. You can perform conditional logistic regression with the PHREG procedure by using the discrete logistic model and forming a stratum for each matched set. In addition, you need to create dummy survival times so all the cases in a matched set have the same event time value and the corresponding controls are censored at later times. Consider the following set of low infant birth data extracted from Hosmer and Lemeshow (1989). These data represent 189 women of whom 59 had low birth-weight babies and 130 had normal weight babies. Under investigation are the following risk factors: weight in pounds at the last menstrual period (LWT), presence of hypertension (HT), smoking status during pregnancy (SMOKE), and presence of uterine irritability (UI). For HT, SMOKE, and UI, a value of 1 indicates a "yes" and a value of zero indicates a "no". The woman's age (AGE) is used as the matching variable. The SAS data set LBW contains subset of the data corresponding to women between the ages of 16 and 32. data lbw; input id age low lwt smoke ht ui @@; time=2-low; cards; (Edited) 25 16 1 130 0 0 0 143 166 16 0 112 0 0 0 167 189 16 0 135 1 0 0 206 216 16 0 95 0 0 0 37 : : 203 30 0 112 0 0 0 56 107 31 0 100 0 0 1 126 163 31 0 150 1 0 0 222 22 32 1 105 1 0 0 106 134 32 0 132 0 0 0 170 175 32 0 170 0 0 0 207 16 16 16 17 0 0 0 1 110 135 170 130 0 1 0 1 0 0 0 0 0 0 0 1 31 31 31 32 32 32 1 0 0 0 0 0 102 215 120 121 134 186 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 ; title 'Example 5. Conditional Logistic Regression for m:n Matching'; proc phreg data=lbw; strata age; model time*low(0)= lwt smoke ht ui / ties=discrete; run; Output The PHREG Procedure Data Set: WORK.LBW Dependent Variable: TIME Censoring Variable: LOW Censoring Value(s): 0 Ties Handling: DISCRETE Summary of the Number of Event and Censored Values Stratum 1 2 3 4 5 6 7 8 9 10 11 12 13 14 AGE 16 17 18 19 20 21 22 23 24 25 26 27 28 29 Total Event Censored Percent Censored 7 12 10 16 18 12 13 13 13 15 8 3 9 7 1 5 2 3 8 5 2 5 5 6 4 2 2 1 6 7 8 13 10 7 11 8 8 9 4 1 7 6 85.71 58.33 80.00 81.25 55.56 58.33 84.62 61.54 61.54 60.00 50.00 33.33 77.78 85.71 15 30 7 1 6 85.71 16 31 5 1 4 80.00 17 32 6 1 5 83.33 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Total 174 54 120 68.97 Testing Global Null Hypothesis: BETA=0 Criterion -2 LOG L Score Wald Without Covariates 159.069 . . With Covariates Model Chi-Square 141.108 17.961 with 4 DF (p=0.0013) . 17.315 with 4 DF (p=0.0017) . 15.558 with 4 DF (p=0.0037) The PHREG Procedure Analysis of Maximum Likelihood Estimates Variable LWT SMOKE HT UI DF Parameter Estimate Standard Error Wald Chi-Square Pr > Chi-Square Risk Ratio 1 1 1 1 -0.014985 0.808047 1.751430 0.883410 0.00706 0.36797 0.73932 0.48032 4.50021 4.82216 5.61199 3.38266 0.0339 0.0281 0.0178 0.0659 0.985 2.244 5.763 2.419 HW for ch23 is #1 (except (e))