1 Simple Logistic Regression Example Chapter 4, p. 101 (and Chapter 3, p. 75) Mating Behavior Among Horseshoe Crabs A study was done of the nesting and mating behavior of horseshoe crabs. Each female crab has a male crab living in her nest. There also may be several “satellite” male crabs lingering nearby. One study found that these satellite males actually fertilize a large proportion of the eggs that the female lays. This study examined the relationship between some of the physical characteristics of each female and the number of satellite males, which ranged from 0 to a maximum of 15. This study was conducted to find characteristics of the breeding females that might account for the differences among the number of satellite males. Some conjectures were that there were visual cues about the size or color of the females, or chemical cues that would attract unattached males. It was also found in earlier studies that the satellite males tended to be older than the attached males, and that unattached males tended to be less healthy than attached males. For the logistic regression model, the response variable, S = “No. of Satellite Males” was dichotomized as Y = 1, if there are any satellite males, or Y = 0 if there are no satellite males. We have several possible predictor variables: X1 = “Color of Female Crab’s Shell”, X2 = “Spinal Condition of Female Crab”, X3 = “Width of Carapace of Female Crab”, and X4 = “Weight of Female Crab”. The explanatory variables are coded as follows. Color: 1 = “Light medium” 2 = “Medium” 3 = “Dark medium” 4 = “Dark” Spinal condition: 1 = “Both good” 2 = “One worn or broken” 3 = “Both worn or broken” Carapace width is measured in centimeters; Weight is measured in kilograms. In the first example, we will use one of the predictors, X3, to fit a simple logistic regression model. We will test for fit of the model, and we will interpret the results. The logistic regression model will be estimated using SAS PROC LOGISTIC (also could use SAS PROC GENMOD). The SAS program for estimating the model is given below, followed by the output. The data are listed in the Appendix. We want to test whether the model with the single explanatory variable X3 =”Carapace Width of Female Crab” allows us to predict whether there are satellite males around the female. Step 1: H0: 1 0 v. HA: 1 0 . Step 2: We have n 173 , = 0.05. Step 3: The test statistic is the Likelihood Ratio chi-square statistic G 2 2l 0 l1 , where l1 is the maximum of the log-likelihood function for the model with the predictor variable, and l 0 is the maximum of the log-likelihood function for the response variable without imposing the prediction model. Under the null hypothesis, the statistic G 2 has a chi-square distribution with d.f. = 1. Step 4: We will reject the null hypothesis if G2 > 12, 0.05 3.84146 . Step 5: From the output, we find G2 = 31.3059, with p-value < 0.0001. Step 6: We reject the null hypothesis at the 0.05 level of significance. We have sufficient evidence to conclude that the variable X3 provides explanatory power to predict the value of Y. 2 From the table of parameter estimates, we find that ˆ1 0.4972 , with S .E. ˆ1 0.1017 . Then a (Wald) 95% confidence interval estimate of 1 is ˆ 1.96S .E. ˆ 0.2979, 0.6965 . Also, 1 1 we can say that the odds of the presence of satellite males increases by an estimated factor of ˆ e 1 1.644 for each increase of 1 cm in the female’s carapace width. A (Wald) 95% confidence interval for the odds ratio is (1.347, 2.007). ˆ x ˆ 0 ˆ1 x 12.3508 0.4972 x . The The estimated logit function is given by ln ˆ 1 x variance of the logit function at a particular value of x is: var ˆ0 ˆ1 x var ˆ0 x 2 var ˆ1 2 x cov ˆ0 , ˆ1 . Note that, since both parameters are estimated from the same data, the estimators will be correlated, so we need the covariance term above. The terms in the equation may be obtained by using the COVB option in the MODEL statement of SAS PROC LOGISTIC. We find from the output that 2 2 0.01035 , and the estimated covariance is -0.26685. Hence, S.E. ˆ 6.910227 , S .E. ˆ 1 0 ˆ x at x is the standard error of ln 1 ˆ x S .E. ˆ0 ˆ1 x 6.910227 0.01035 x 2 0.26685 x . Using these values, we can then find a confidence interval estimate for the logit at x for any value of x. For example, one of the females had a carapace width of 30.0 cm. A 95% confidence interval estimate of the logit at x = 30.0 cm is 12.3508 0.497230.0 1.96 6.910227 (0.01035)(900) (0.26685)(30.0) 3.0541, 8.1845 . This technique may be used to find a confidence interval for the odds function at a particular value of x, by exponentiation. A plot of the estimated logistic regression function is given below. Notice that the SAS program (using PROC CORR) also provides the simple descriptive statistics and the matrix of Pearson correlation coefficients between all pairs of variables. We see that 64.162% of the female crabs had satellite males. We also see that the minimum number of satellite males was 0, and the maximum number was 15. We see that the correlation between Y and X4 is 0.38719. However, the correlation between Y and X3 is 0.40141; the width of the female’s carapace is positively correlated with whether there are any satellite males. There is also a negative correlation between Y and X1 = “Color of Female Crab’s Shell”. There is not a significant correlation between Y and X2 = “Spinal Condition of Female Crab”. Hence, we may also consider X1 and X3 as possible explanatory variables. These will be considered when we discuss multiple logistic regression models. 3 SAS Program for Simple Logistic Regression: proc format; value difmt 0 = "No" 1 = "Yes"; ; data crabs; input x1 x2 x3 x4 s; y = 0; if s > 0 then y = 1; label x1 = "Color" x2 = "Spine Condition" x3 = "Carapace Width" x4 = "Weight" y = "Satellite Males?" s = "No. of Satellite Males"; format y difmt.; cards; The data are listed in the Appendix. ; proc logistic; model y (order=formatted event='Yes') = x3 / covb; title "Logistic regression of Satellite Presence"; title2 "vs. Female's Carapace Width"; ; 4 proc corr; var y x1 x2 x3 x4 s; title "Correlations Among All Variables"; title2 "Including Number of Satellite Males"; ; run; SAS Output: Logistic regression of Satellite Presence 1 vs. Female's Carapace Width 16:55 Wednesday, November 5, 2008 The LOGISTIC Procedure Model Information Data Set WORK.CRABS Response Variable y Satellite Males? Number of Response Levels 2 Model binary logit Optimization Technique Fisher's scoring Number of Observations Read 173 Number of Observations Used 173 Response Profile Ordered Total Value y Frequency 1 No 62 2 Yes 111 Probability modeled is y='Yes'. Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 227.759 198.453 SC 230.912 204.759 -2 Log L 225.759 194.453 Logistic regression of Satellite Presence 2 vs. Female's Carapace Width 16:55 Wednesday, November 5, 2008 The LOGISTIC Procedure Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 31.3059 1 <.0001 Score 27.8752 1 <.0001 Wald 23.8872 1 <.0001 Parameter Intercept x3 Analysis of Maximum Likelihood Estimates Standard Wald DF Estimate Error Chi-Square 1 -12.3508 2.6287 22.0749 1 0.4972 0.1017 23.8872 Effect x3 Pr > ChiSq <.0001 <.0001 Odds Ratio Estimates Point 95% Wald Estimate Confidence Limits 1.644 1.347 2.007 Association of Predicted Probabilities and Observed Responses Percent Concordant 73.5 Somers' D 0.485 Percent Discordant 25.0 Gamma 0.492 Percent Tied 1.5 Tau-a 0.224 Pairs 6882 c 0.742 5 Estimated Covariance Matrix Parameter Intercept x3 Intercept 6.910227 -0.26685 x3 -0.26685 0.01035 6 Variable y x1 x2 x3 x4 s Variables: N 173 173 173 173 173 173 Correlations Among All Variables 3 Including Number of Satellite Males 16:55 Wednesday, November 5, 2008 The CORR Procedure y x1 x2 x3 x4 s Simple Statistics Mean Std Dev Sum Minimum Maximum 0.64162 0.48092 111.00000 0 1.00000 2.43931 0.80193 422.00000 1.00000 4.00000 2.48555 0.82552 430.00000 1.00000 3.00000 26.29884 2.10906 4550 21.00000 33.50000 2.43723 0.57726 421.64000 1.20000 5.20000 2.91908 3.14834 505.00000 0 15.00000 Variable y x1 x2 x3 x4 s y Satellite Males? x1 Color x2 Spine Condition x3 Carapace Width x4 Weight s No. of Satellite Males Simple Statistics Label Satellite Males? Color Spine Condition Carapace Width Weight No. of Satellite Males Pearson Correlation Coefficients, N = 173 Prob > |r| under H0: Rho=0 y x1 x2 x3 1.00000 -0.26778 -0.02777 0.40141 0.0004 0.7169 <.0001 -0.26778 1.00000 0.37850 -0.26439 0.0004 <.0001 0.0004 -0.02777 0.37850 1.00000 -0.12189 0.7169 <.0001 0.1101 0.40141 -0.26439 -0.12189 1.00000 <.0001 0.0004 0.1101 0.38719 -0.25067 -0.16650 0.88689 <.0001 0.0009 0.0286 <.0001 0.69496 -0.19078 -0.08993 0.33989 <.0001 0.0119 0.2393 <.0001 x4 0.38719 <.0001 -0.25067 0.0009 -0.16650 0.0286 0.88689 <.0001 1.00000 0.36930 <.0001 s 0.69496 <.0001 -0.19078 0.0119 -0.08993 0.2393 0.33989 <.0001 0.36930 <.0001 1.00000 6 Appendix: Horseshoe Crab Data 2 3 3 4 2 1 4 2 2 2 1 3 2 2 3 3 2 2 2 2 4 4 2 2 3 2 3 2 2 3 1 2 2 4 3 2 4 2 2 2 2 2 3 1 3 2 3 2 4 2 2 1 2 4 1 3 3 3 2 3 2 3 3 1 3 1 3 1 3 3 3 3 3 3 3 3 3 2 1 3 3 3 3 3 3 1 2 1 3 3 3 3 3 1 3 3 3 3 1 3 3 3 1 3 1 3 1 1 3 1 28.3 26.0 25.6 21.0 29.0 25.0 26.2 24.9 25.7 27.5 26.1 28.9 30.3 22.9 26.2 24.5 30.0 26.2 25.4 25.4 27.5 27.0 24.0 28.7 26.5 24.5 27.3 26.5 25.0 22.0 30.2 25.4 24.9 25.8 27.2 30.5 25.0 30.0 22.9 23.9 26.0 25.8 29.0 26.5 22.5 23.8 24.3 26.0 24.7 22.5 28.7 29.3 26.7 23.4 27.7 3.05 8 2.60 4 2.15 0 1.85 0 3.00 1 2.30 3 1.30 0 2.10 0 2.00 8 3.15 6 2.80 5 2.80 4 3.60 3 1.60 4 2.30 3 2.05 5 3.05 8 2.40 3 2.25 6 2.25 4 2.90 0 2.25 3 1.70 0 3.20 0 1.97 1 1.60 1 2.90 1 2.30 4 2.10 2 1.40 0 3.28 2 2.30 0 2.30 6 2.25 10 2.40 5 3.32 3 2.10 8 3.00 9 1.60 0 1.85 2 2.28 3 2.20 0 3.28 4 2.35 0 1.55 0 2.10 0 2.15 0 2.30 14 2.20 0 1.60 1 3.15 3 3.20 4 2.70 5 1.90 0 2.50 6 7 2 4 2 2 3 2 3 3 3 3 2 2 3 3 2 2 2 2 2 3 3 1 2 2 3 2 2 4 3 2 3 1 1 3 2 1 2 3 2 1 4 2 2 2 2 2 4 2 2 3 2 2 3 4 2 2 3 3 3 1 1 1 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 2 3 2 3 1 3 3 3 3 3 1 1 2 3 1 3 3 1 3 3 3 1 3 3 3 3 3 3 3 2 3 1 1 1 1 3 28.2 24.7 25.7 27.8 27.0 29.0 25.6 24.2 25.7 23.1 28.5 29.7 23.1 24.5 27.5 26.3 27.8 31.9 25.0 26.2 28.4 24.5 27.9 25.0 29.0 31.7 27.6 24.5 23.8 28.2 24.1 28.0 26.0 24.7 25.8 27.1 27.4 26.7 26.8 25.8 23.7 27.9 30.0 25.0 27.7 28.3 25.5 26.0 26.2 23.0 22.9 25.1 25.9 25.5 26.8 29.0 28.5 2.60 6 2.10 5 2.00 5 2.75 0 2.45 3 3.20 10 2.80 7 1.90 0 1.20 0 1.65 0 3.05 0 3.85 5 1.55 0 2.20 1 2.55 1 2.40 1 3.25 3 3.33 2 2.40 5 2.22 0 3.20 3 1.95 6 3.05 7 2.25 6 2.92 3 3.73 4 2.85 4 1.90 0 1.80 0 3.05 8 1.80 0 2.62 0 2.30 9 1.90 0 2.65 0 2.95 8 2.70 5 2.60 2 2.70 5 2.60 0 1.85 0 2.80 6 3.30 5 2.10 4 2.90 5 3.00 15 2.25 0 2.15 5 2.40 0 1.65 1 1.60 0 2.10 5 2.55 4 2.75 0 2.55 0 2.80 1 3.00 1 8 2 2 2 4 3 2 4 2 2 2 2 2 2 2 3 4 3 4 3 2 2 2 2 4 4 2 2 3 3 2 1 2 3 2 2 3 3 3 2 4 4 3 2 2 2 3 4 3 2 2 2 2 2 2 3 2 2 2 3 3 3 3 3 3 3 3 3 1 3 3 3 2 3 3 3 3 1 3 3 2 3 3 2 3 3 1 3 1 3 3 3 1 3 2 2 3 3 3 3 3 1 3 3 3 3 3 3 1 1 1 3 3 1 3 24.7 29.0 27.0 23.7 27.0 24.2 22.5 25.1 24.9 27.5 24.3 29.5 26.2 24.7 29.8 25.7 26.2 27.0 24.8 23.7 28.2 25.2 23.2 25.8 27.5 25.7 26.8 27.5 28.5 28.5 27.4 27.2 27.1 28.0 26.5 23.0 26.0 24.5 25.8 23.5 26.7 25.5 28.2 25.2 25.3 25.7 29.3 23.8 27.4 26.2 28.0 28.4 33.5 25.8 24.0 23.1 28.3 2.55 4 3.10 1 2.50 6 1.80 0 2.50 6 1.65 2 1.47 4 1.80 0 2.20 0 2.63 6 2.00 0 3.02 4 2.30 0 1.95 4 3.50 4 2.15 0 2.17 2 2.63 0 2.10 0 1.95 0 3.05 11 2.00 1 1.95 4 2.00 3 2.60 0 2.00 0 2.65 0 3.10 3 3.25 9 3.00 3 2.70 6 2.70 3 2.55 0 2.80 1 1.30 0 1.80 0 2.20 3 2.25 0 2.30 0 1.90 0 2.45 0 2.25 0 2.87 1 2.00 1 1.90 2 2.10 0 3.23 12 1.80 6 2.90 3 2.02 2 2.90 4 3.10 5 5.20 7 2.40 0 1.90 10 2.00 0 3.20 0 9 2 2 3 2 3 3 3 2 26.5 26.5 26.1 24.5 2.35 2.75 2.75 2.00 4 7 3 0