1 Multiple Logistic Regression Example, with Categorical Predictors Chapter 4, p. 126 – Mating Behavior Among Horseshoe Crabs In the earlier example, all of the explanatory variables were treated as continuous variables. However, two of the four explanatory variables, Color and Spinal Condition, are actually ordinal variables. We can make the model truer to the actual nature of the data by treating these two variables as categorical. The way we will handle this is by replacing an ordinal explanatory variable having, say, k categories by k – 1 dichotomous variables. Instead of the predictor variable X1 = “Color of Female Crab’s Shell”, with four levels, we will have three dichotomous variables: X11 = “Light medium?” or X12 = “Medium?” or X13 = “Dark medium?”. Instead of the predictor variable X2 = “Spinal Condition of Female Crab”, with three levels, we will have two dichotomous variables: X21 = “Both good?” X22 = “One worn or broken?” We will then use X3 = “Carapace Width” as the remaining predictor variable, since a) it is more strongly correlated with Y than X4 = “Weight”, and b) it is also strongly correlated with X4. The logistic regression model will be estimated using SAS PROC LOGISTIC. The SAS program for estimating the model is given below, followed by the output. The data are listed in the Appendix. proc format; value difmt 0 = "No" 1 = "Yes"; ; data crabs; input x1 x2 x3 x4 s; y = 0; if s > 0 then y = 1; x11 = 0; if x1 = 1 then x11 = 1; x12 = 0; if x1 = 2 then x12 = 1; x13 = 0; if x1 = 3 then x13 = 1; x21 = 0; if x2 = 1 then x21 = 1; x22 = 0; if x2 = 2 then x22 = 1; label x1 = "Color" x2 = "Spine Condition" x3 = "Carapace Width" x4 = "Weight" x11 = "Light medium?" x12 = "Medium?" x13 = "Dark medium?" 2 x21 = "Both good?" x22 = "One worn or broken?" y = "Satellite Males?" s = "No. of Satellite Males"; format y x11 x12 x13 x21 x22 difmt.; cards; The data set is listed in the earlier handout. ; proc logistic; model y (order=formatted event='Yes') = x11 x12 x13 x21 x22 x3 / covb; title "Logistic regression of Satellite Presence"; title2 "vs. Several Explanatory Variables,"; title3 "Somc of which are Categorical"; ; proc corr; var y x11 x12 x13 x21 x22 x3; title "Correlations Among All Variables"; title2 "Including Number of Satellite Males"; ; run; 3 SAS Output for Full Model, Including All Explanatory Variables: Logistic regression of Satellite Presence 6 vs. Several Explanatory Variables, Somc of which are Categorical 11:31 Wednesday, November 5, 2008 The LOGISTIC Procedure Model Information Data Set WORK.CRABS Response Variable y Satellite Males? Number of Response Levels 2 Model binary logit Optimization Technique Fisher's scoring Number of Observations Read 173 Number of Observations Used 173 Response Profile Ordered Total Value y Frequency 1 No 62 2 Yes 111 Probability modeled is y='Yes'. Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 227.759 200.612 SC 230.912 222.685 -2 Log L 225.759 186.612 Logistic regression of Satellite Presence 7 vs. Several Explanatory Variables, Somc of which are Categorical 11:31 Wednesday, November 5, 2008 The LOGISTIC Procedure Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 39.1466 6 <.0001 Score 35.3481 6 <.0001 Wald 28.4898 6 <.0001 Parameter Intercept x11 x12 x13 x21 x22 x3 Analysis of Maximum Likelihood Estimates Standard Wald DF Estimate Error Chi-Square 1 -12.3908 2.8194 19.3153 1 1.6683 0.9329 3.1985 1 1.5249 0.5672 7.2285 1 1.1443 0.5933 3.7199 1 -0.3770 0.5019 0.5643 1 -0.4348 0.6254 0.4834 1 0.4562 0.1078 17.9141 Effect x11 x12 x13 x21 x22 x3 Odds Ratio Estimates Point 95% Wald Estimate Confidence Limits 5.303 0.852 33.006 4.595 1.512 13.966 3.140 0.982 10.045 0.686 0.256 1.834 0.647 0.190 2.205 1.578 1.278 1.949 Pr > ChiSq <.0001 0.0737 0.0072 0.0538 0.4525 0.4869 <.0001 4 Association of Predicted Probabilities and Observed Responses Percent Concordant 77.3 Somers' D 0.549 Percent Discordant 22.4 Gamma 0.551 Percent Tied 0.3 Tau-a 0.254 Pairs 6882 c 0.775 Parameter Intercept x11 x12 x13 x21 x22 x3 7 Intercept 7.948753 0.019563 -0.16101 -0.2992 0.049164 -0.36215 -0.29929 Variables: Variable y x11 x12 x13 x21 x22 x3 Logistic regression of Satellite Presence 8 vs. Several Explanatory Variables, Somc of which are Categorical 11:31 Wednesday, November 5, 2008 The LOGISTIC Procedure Estimated Covariance Matrix x11 x12 x13 x21 x22 x3 0.019563 -0.16101 -0.2992 0.049164 -0.36215 -0.29929 0.870212 0.293014 0.252517 -0.17261 -0.14623 -0.00969 0.293014 0.321701 0.244704 -0.06079 -0.06118 -0.0029 0.252517 0.244704 0.351987 -0.01315 -0.03011 0.00237 -0.17261 -0.06079 -0.01315 0.251913 0.074371 -0.00236 -0.14623 -0.06118 -0.03011 0.074371 0.39115 0.013802 -0.00969 -0.0029 0.00237 -0.00236 0.013802 0.01162 y N 173 173 173 173 173 173 173 Correlations Among All Variables 9 Including Number of Satellite Males 11:31 Wednesday, November 5, 2008 The CORR Procedure x11 x12 x13 x21 x22 x3 Simple Statistics Mean Std Dev Sum Minimum Maximum 0.64162 0.48092 111.00000 0 1.00000 0.06936 0.25481 12.00000 0 1.00000 0.54913 0.49902 95.00000 0 1.00000 0.25434 0.43675 44.00000 0 1.00000 0.21387 0.41123 37.00000 0 1.00000 0.08671 0.28222 15.00000 0 1.00000 26.29884 2.10906 4550 21.00000 33.50000 Simple Statistics Variable Label y Satellite Males? x11 Light medium? x12 Medium? x13 Dark medium? x21 Both good? x22 One worn or broken? x3 Carapace Width Pearson Correlation Coefficients, N = 173 Prob > |r| under H0: Rho=0 y Satellite Males? x11 Light medium? x12 Medium? x13 Dark medium? x21 Both good? x22 One worn or broken? x3 Carapace Width y 1.00000 0.06171 0.4200 0.19493 0.0102 -0.06176 0.4195 0.06644 0.3851 -0.11242 0.1409 0.40141 <.0001 x11 0.06171 0.4200 1.00000 -0.30130 <.0001 -0.15944 0.0361 0.35696 <.0001 0.07758 0.3103 0.08670 0.2567 x12 0.19493 0.0102 -0.30130 <.0001 1.00000 -0.64453 <.0001 0.10432 0.1720 -0.00978 0.8983 0.21273 0.0050 x13 -0.06176 0.4195 -0.15944 0.0361 -0.64453 <.0001 1.00000 -0.20751 0.0062 0.00872 0.9093 -0.15242 0.0453 x21 0.06644 0.3851 0.35696 <.0001 0.10432 0.1720 -0.20751 0.0062 1.00000 -0.16071 0.0347 0.20139 0.0079 x22 -0.11242 0.1409 0.07758 0.3103 -0.00978 0.8983 0.00872 0.9093 -0.16071 0.0347 1.00000 -0.23035 0.0023 x3 0.40141 <.0001 0.08670 0.2567 0.21273 0.0050 -0.15242 0.0453 0.20139 0.0079 -0.23035 0.0023 1.00000 5 Next, we want to find a subset of these variables that works in our model, since it is possible that some of the dichotomous explanatory variables do not contribute to the explanatory power of the model. We look at the correlations (phi coefficients) between Y and each of the dichotomous variables to choose a likely subset of variables to eliminate. The variables that are least strongly correlated with Y are X11, X13, and X21. We will fit a model without these variables, and then test whether they may be eliminated by comparing the two models. The output for the full model is given above. The output for the reduced model is listed below. SAS Output for Reduced Model: Logistic regression of Satellite Presence 1 vs. Several Explanatory Variables, Somc of which are Categorical Reduced Model 14:58 Wednesday, November 5, 2008 The LOGISTIC Procedure Model Information Data Set WORK.CRABS Response Variable y Satellite Males? Number of Response Levels 2 Model binary logit Optimization Technique Fisher's scoring Number of Observations Read 173 Number of Observations Used 173 Response Profile Ordered Total Value y Frequency 1 No 62 2 Yes 111 Probability modeled is y='Yes'. Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 227.759 199.697 SC 230.912 212.310 -2 Log L 225.759 191.697 Logistic regression of Satellite Presence 2 vs. Several Explanatory Variables, Somc of which are Categorical Reduced Model 14:58 Wednesday, November 5, 2008 The LOGISTIC Procedure Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 34.0617 3 <.0001 Score 30.1587 3 <.0001 Wald 25.3813 3 <.0001 Parameter Intercept x12 x22 x3 Analysis of Maximum Likelihood Estimates Standard Wald DF Estimate Error Chi-Square 1 -12.0239 2.7321 19.3692 1 0.5823 0.3525 2.7284 1 -0.1463 0.5960 0.0603 1 0.4736 0.1055 20.1555 Odds Ratio Estimates Pr > ChiSq <.0001 0.0986 0.8061 <.0001 6 Effect x12 x22 x3 Point Estimate 1.790 0.864 1.606 95% Wald Confidence Limits 0.897 3.572 0.269 2.778 1.306 1.975 Association of Predicted Probabilities and Observed Responses Percent Concordant 74.6 Somers' D 0.499 Percent Discordant 24.7 Gamma 0.502 Percent Tied 0.6 Tau-a 0.231 Pairs 6882 c 0.749 7 Parameter Intercept x12 x22 x3 Logistic regression of Satellite Presence 3 vs. Several Explanatory Variables, Somc of which are Categorical Reduced Model 14:58 Wednesday, November 5, 2008 The LOGISTIC Procedure Estimated Covariance Matrix Intercept x12 x22 x3 7.46416 0.010161 -0.37749 -0.28698 0.010161 0.124258 -0.01469 -0.00273 -0.37749 -0.01469 0.355191 0.013524 -0.28698 -0.00273 0.013524 0.011128 Now we want to compare the full model and the reduced model to see whether the reduced model is adequate. Step 1: H0: 11 13 21 0 v. HA: Not all three are 0. Step 2: We have n 173 , = 0.05. Step 3: The test statistic is the Likelihood Ratio chi-square statistic G 2 2l 0 l1 , where l1 is the maximum of the log-likelihood function for the reduced model, and l 0 is the maximum of the log-likelihood function for the full model. Under the null hypothesis, the statistic G 2 has a chisquare distribution with d.f. = 3, since we are hypothesizing that we may eliminate three of the parameters from the full model. Step 4: We will reject the null hypothesis if G2 > 32, 0.05 7.81 . Step 5: From the two outputs, we find G2 = 191.697 – 186.612 = 5.085. Step 6: We fail to reject the null hypothesis at the 0.05 level of significance. We do not have sufficient evidence to conclude that the variables X11, X13, and X21 provide additional explanatory power to predict the value of Y. Thus, we choose the reduced model to explain Y, using the predictors X12 = “Medium Color?”, X22 = “One worn or broken?”, and X = “Carapace Width.” However, we then may want to test whether one of the dichotomous variables is extraneous. We choose X22, since it is not substantially correlated with Y. We use the above model as the full model, and estimate a reduced model with the additional predictor eliminated. Logistic regression of Satellite Presence 1 vs. Several Explanatory Variables, Somc of which are Categorical 15:18 Wednesday, November 5, 2008 The LOGISTIC Procedure Model Information Data Set WORK.CRABS Response Variable y Satellite Males? Number of Response Levels 2 Model binary logit Optimization Technique Fisher's scoring Number of Observations Read 173 Number of Observations Used 173 8 Response Profile Ordered Total Value y Frequency 1 No 62 2 Yes 111 Probability modeled is y='Yes'. Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 227.759 197.757 SC 230.912 207.217 -2 Log L 225.759 191.757 Logistic regression of Satellite Presence 2 vs. Several Explanatory Variables, Somc of which are Categorical 15:18 Wednesday, November 5, 2008 The LOGISTIC Procedure Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 34.0014 2 <.0001 Score 30.0492 2 <.0001 Wald 25.2770 2 <.0001 Parameter Intercept x12 x3 Analysis of Maximum Likelihood Estimates Standard Wald DF Estimate Error Chi-Square 1 -12.1827 2.6617 20.9495 1 0.5763 0.3516 2.6873 1 0.4793 0.1032 21.5756 Effect x12 x3 Pr > ChiSq <.0001 0.1011 <.0001 Odds Ratio Estimates Point 95% Wald Estimate Confidence Limits 1.779 0.893 3.544 1.615 1.319 1.977 Association of Predicted Probabilities and Observed Responses Percent Concordant 74.4 Somers' D 0.501 Percent Discordant 24.3 Gamma 0.508 Percent Tied 1.3 Tau-a 0.232 Pairs 6882 c 0.751 Parameter Intercept x12 x3 Estimated Covariance Matrix Intercept x12 7.084613 -0.00481 -0.00481 0.123591 -0.27346 -0.0022 x3 -0.27346 -0.0022 0.010647 9 Step 1: H0: 22 0 v. HA: 22 0 . Step 2: We have n 173 , = 0.05. Step 3: The test statistic is the Likelihood Ratio chi-square statistic G 2 2l 0 l1 , where l1 is the maximum of the log-likelihood function for the reduced model, and l 0 is the maximum of the log-likelihood function for the full model. Under the null hypothesis, the statistic G 2 has a chisquare distribution with d.f. = 1, since we are hypothesizing that we may eliminate one of the parameters from the full model. Step 4: We will reject the null hypothesis if G2 > 12, 0.05 3.84 . Step 5: From the two outputs, we find G2 = 191.757 – 191.697 = 0.06. Step 6: We fail to reject the null hypothesis at the 0.05 level of significance. We do not have sufficient evidence to conclude that the variable X22 provides additional explanatory power to predict the value of Y. Thus our final model for predicting the presence of satellite males includes whether the color of the female crab’s shell is medium, and the width of the carapace of the female crab. (We could compare this model with the model containing only the carapace width, but we would find that the two variables together do provide better explanatory power.)