Logistic Regression, Mating Behavior of Horseshoe Crabs

advertisement
1
Simple Logistic Regression Example
Chapter 4, p. 101 (and Chapter 3, p. 75)
Mating Behavior Among Horseshoe Crabs
A study was done of the nesting and mating behavior of horseshoe crabs. Each female crab has a
male crab living in her nest. There also may be several “satellite” male crabs lingering nearby.
One study found that these satellite males actually fertilize a large proportion of the eggs that the
female lays. This study examined the relationship between some of the physical characteristics
of each female and the number of satellite males, which ranged from 0 to a maximum of 15.
This study was conducted to find characteristics of the breeding females that might account for
the differences among the number of satellite males. Some conjectures were that there were
visual cues about the size or color of the females, or chemical cues that would attract unattached
males. It was also found in earlier studies that the satellite males tended to be older than the
attached males, and that unattached males tended to be less healthy than attached males.
For the logistic regression model, the response variable, S = “No. of Satellite Males” was
dichotomized as Y = 1, if there are any satellite males, or Y = 0 if there are no satellite males.
We have several possible predictor variables: X1 = “Color of Female Crab’s Shell”, X2 =
“Spinal Condition of Female Crab”, X3 = “Width of Carapace of Female Crab”, and X4 =
“Weight of Female Crab”. The explanatory variables are coded as follows.
Color: 1 = “Light medium” 2 = “Medium” 3 = “Dark medium” 4 = “Dark”
Spinal condition: 1 = “Both good” 2 = “One worn or broken” 3 = “Both worn or broken”
Carapace width is measured in centimeters; Weight is measured in kilograms.
In the first example, we will use one of the predictors, X3, to fit a simple logistic regression
model. We will test for fit of the model, and we will interpret the results. The logistic regression
model will be estimated using SAS PROC LOGISTIC (also could use SAS PROC GENMOD).
The SAS program for estimating the model is given below, followed by the output. The data are
listed in the Appendix.
We want to test whether the model with the single explanatory variable X3 =”Carapace Width of
Female Crab” allows us to predict whether there are satellite males around the female.
Step 1: H0: 1  0
v.
HA: 1  0 .
Step 2: We have n  173 ,  = 0.05.
Step 3: The test statistic is the Likelihood Ratio chi-square statistic G 2  2l 0  l1  , where l1 is
the maximum of the log-likelihood function for the model with the predictor variable, and l 0 is
the maximum of the log-likelihood function for the response variable without imposing the
prediction model. Under the null hypothesis, the statistic G 2 has a chi-square distribution with
d.f. = 1.
Step 4: We will reject the null hypothesis if G2 > 12, 0.05  3.84146 .
Step 5: From the output, we find G2 = 31.3059, with p-value < 0.0001.
Step 6: We reject the null hypothesis at the 0.05 level of significance. We have sufficient
evidence to conclude that the variable X3 provides explanatory power to predict the value of Y.
2
 
From the table of parameter estimates, we find that ˆ1  0.4972 , with S .E. ˆ1  0.1017 . Then
a (Wald) 95% confidence interval estimate of 1 is ˆ  1.96S .E. ˆ  0.2979, 0.6965 . Also,
 
1
1
we can say that the odds of the presence of satellite males increases by an estimated factor of
ˆ
e 1  1.644 for each increase of 1 cm in the female’s carapace width. A (Wald) 95% confidence
interval for the odds ratio is (1.347, 2.007).
 ˆ x   ˆ
   0  ˆ1 x  12.3508  0.4972 x . The
The estimated logit function is given by ln 
ˆ
 1   x  
variance of the logit function at a particular value of x is:
var ˆ0  ˆ1 x  var ˆ0  x 2 var ˆ1  2 x cov ˆ0 , ˆ1 .
Note that, since both parameters are estimated from the same data, the estimators will be
correlated, so we need the covariance term above. The terms in the equation may be obtained by
using the COVB option in the MODEL statement of SAS PROC LOGISTIC. We find from the
output that
2
2
 0.01035 , and the estimated covariance is -0.26685. Hence,
S.E. ˆ
 6.910227 , S .E. ˆ


 
 


  
  
1
0
 ˆ x  
 at x is
the standard error of ln 
 1  ˆ x  
S .E. ˆ0  ˆ1 x  6.910227  0.01035 x 2  0.26685 x .
Using these values, we can then find a confidence interval estimate for the logit at x for any
value of x. For example, one of the females had a carapace width of 30.0 cm. A 95%
confidence interval estimate of the logit at x = 30.0 cm is
 12.3508  0.497230.0  1.96 6.910227  (0.01035)(900)  (0.26685)(30.0)
  3.0541, 8.1845 .
This technique may be used to find a confidence interval for the odds function at a particular
value of x, by exponentiation.


A plot of the estimated logistic regression function is given below.
Notice that the SAS program (using PROC CORR) also provides the simple descriptive statistics
and the matrix of Pearson correlation coefficients between all pairs of variables. We see that
64.162% of the female crabs had satellite males. We also see that the minimum number of
satellite males was 0, and the maximum number was 15. We see that the correlation between Y
and X4 is 0.38719. However, the correlation between Y and X3 is 0.40141; the width of the
female’s carapace is positively correlated with whether there are any satellite males. There is
also a negative correlation between Y and X1 = “Color of Female Crab’s Shell”. There is not a
significant correlation between Y and X2 = “Spinal Condition of Female Crab”. Hence, we may
also consider X1 and X3 as possible explanatory variables. These will be considered when we
discuss multiple logistic regression models.
3
SAS Program for Simple Logistic Regression:
proc format;
value difmt 0 = "No"
1 = "Yes";
;
data crabs;
input x1 x2 x3 x4 s;
y = 0;
if s > 0 then y = 1;
label
x1 = "Color"
x2 = "Spine Condition"
x3 = "Carapace Width"
x4 = "Weight"
y = "Satellite Males?"
s = "No. of Satellite Males";
format y difmt.;
cards;
The data are listed in the Appendix.
;
proc logistic;
model y (order=formatted event='Yes') = x3 / covb;
title "Logistic regression of Satellite Presence";
title2 "vs. Female's Carapace Width";
;
4
proc corr;
var y x1 x2 x3 x4 s;
title "Correlations Among All Variables";
title2 "Including Number of Satellite Males";
;
run;
SAS Output:
Logistic regression of Satellite Presence
1
vs. Female's Carapace Width 16:55 Wednesday, November 5, 2008
The LOGISTIC Procedure
Model Information
Data Set
WORK.CRABS
Response Variable
y
Satellite Males?
Number of Response Levels
2
Model
binary logit
Optimization Technique
Fisher's scoring
Number of Observations Read
173
Number of Observations Used
173
Response Profile
Ordered
Total
Value
y
Frequency
1
No
62
2
Yes
111
Probability modeled is y='Yes'.
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Intercept
Intercept
and
Criterion
Only
Covariates
AIC
227.759
198.453
SC
230.912
204.759
-2 Log L
225.759
194.453
Logistic regression of Satellite Presence
2
vs. Female's Carapace Width 16:55 Wednesday, November 5, 2008
The LOGISTIC Procedure
Testing Global Null Hypothesis: BETA=0
Test
Chi-Square
DF
Pr > ChiSq
Likelihood Ratio
31.3059
1
<.0001
Score
27.8752
1
<.0001
Wald
23.8872
1
<.0001
Parameter
Intercept
x3
Analysis of Maximum Likelihood Estimates
Standard
Wald
DF
Estimate
Error
Chi-Square
1
-12.3508
2.6287
22.0749
1
0.4972
0.1017
23.8872
Effect
x3
Pr > ChiSq
<.0001
<.0001
Odds Ratio Estimates
Point
95% Wald
Estimate
Confidence Limits
1.644
1.347
2.007
Association of Predicted Probabilities and Observed Responses
Percent Concordant
73.5
Somers' D
0.485
Percent Discordant
25.0
Gamma
0.492
Percent Tied
1.5
Tau-a
0.224
Pairs
6882
c
0.742
5
Estimated Covariance Matrix
Parameter
Intercept
x3
Intercept
6.910227
-0.26685
x3
-0.26685
0.01035
6
Variable
y
x1
x2
x3
x4
s
Variables:
N
173
173
173
173
173
173
Correlations Among All Variables
3
Including Number of Satellite Males
16:55 Wednesday, November 5, 2008
The CORR Procedure
y
x1
x2
x3
x4
s
Simple Statistics
Mean
Std Dev
Sum
Minimum
Maximum
0.64162
0.48092
111.00000
0
1.00000
2.43931
0.80193
422.00000
1.00000
4.00000
2.48555
0.82552
430.00000
1.00000
3.00000
26.29884
2.10906
4550
21.00000
33.50000
2.43723
0.57726
421.64000
1.20000
5.20000
2.91908
3.14834
505.00000
0
15.00000
Variable
y
x1
x2
x3
x4
s
y
Satellite Males?
x1
Color
x2
Spine Condition
x3
Carapace Width
x4
Weight
s
No. of Satellite Males
Simple Statistics
Label
Satellite Males?
Color
Spine Condition
Carapace Width
Weight
No. of Satellite Males
Pearson Correlation Coefficients, N = 173
Prob > |r| under H0: Rho=0
y
x1
x2
x3
1.00000
-0.26778
-0.02777
0.40141
0.0004
0.7169
<.0001
-0.26778
1.00000
0.37850
-0.26439
0.0004
<.0001
0.0004
-0.02777
0.37850
1.00000
-0.12189
0.7169
<.0001
0.1101
0.40141
-0.26439
-0.12189
1.00000
<.0001
0.0004
0.1101
0.38719
-0.25067
-0.16650
0.88689
<.0001
0.0009
0.0286
<.0001
0.69496
-0.19078
-0.08993
0.33989
<.0001
0.0119
0.2393
<.0001
x4
0.38719
<.0001
-0.25067
0.0009
-0.16650
0.0286
0.88689
<.0001
1.00000
0.36930
<.0001
s
0.69496
<.0001
-0.19078
0.0119
-0.08993
0.2393
0.33989
<.0001
0.36930
<.0001
1.00000
6
Appendix: Horseshoe Crab Data
2
3
3
4
2
1
4
2
2
2
1
3
2
2
3
3
2
2
2
2
4
4
2
2
3
2
3
2
2
3
1
2
2
4
3
2
4
2
2
2
2
2
3
1
3
2
3
2
4
2
2
1
2
4
1
3
3
3
2
3
2
3
3
1
3
1
3
1
3
3
3
3
3
3
3
3
3
2
1
3
3
3
3
3
3
1
2
1
3
3
3
3
3
1
3
3
3
3
1
3
3
3
1
3
1
3
1
1
3
1
28.3
26.0
25.6
21.0
29.0
25.0
26.2
24.9
25.7
27.5
26.1
28.9
30.3
22.9
26.2
24.5
30.0
26.2
25.4
25.4
27.5
27.0
24.0
28.7
26.5
24.5
27.3
26.5
25.0
22.0
30.2
25.4
24.9
25.8
27.2
30.5
25.0
30.0
22.9
23.9
26.0
25.8
29.0
26.5
22.5
23.8
24.3
26.0
24.7
22.5
28.7
29.3
26.7
23.4
27.7
3.05 8
2.60 4
2.15 0
1.85 0
3.00 1
2.30 3
1.30 0
2.10 0
2.00 8
3.15 6
2.80 5
2.80 4
3.60 3
1.60 4
2.30 3
2.05 5
3.05 8
2.40 3
2.25 6
2.25 4
2.90 0
2.25 3
1.70 0
3.20 0
1.97 1
1.60 1
2.90 1
2.30 4
2.10 2
1.40 0
3.28 2
2.30 0
2.30 6
2.25 10
2.40 5
3.32 3
2.10 8
3.00 9
1.60 0
1.85 2
2.28 3
2.20 0
3.28 4
2.35 0
1.55 0
2.10 0
2.15 0
2.30 14
2.20 0
1.60 1
3.15 3
3.20 4
2.70 5
1.90 0
2.50 6
7
2
4
2
2
3
2
3
3
3
3
2
2
3
3
2
2
2
2
2
3
3
1
2
2
3
2
2
4
3
2
3
1
1
3
2
1
2
3
2
1
4
2
2
2
2
2
4
2
2
3
2
2
3
4
2
2
3
3
3
1
1
1
3
3
3
3
3
3
1
3
3
3
3
3
3
3
3
3
2
3
2
3
1
3
3
3
3
3
1
1
2
3
1
3
3
1
3
3
3
1
3
3
3
3
3
3
3
2
3
1
1
1
1
3
28.2
24.7
25.7
27.8
27.0
29.0
25.6
24.2
25.7
23.1
28.5
29.7
23.1
24.5
27.5
26.3
27.8
31.9
25.0
26.2
28.4
24.5
27.9
25.0
29.0
31.7
27.6
24.5
23.8
28.2
24.1
28.0
26.0
24.7
25.8
27.1
27.4
26.7
26.8
25.8
23.7
27.9
30.0
25.0
27.7
28.3
25.5
26.0
26.2
23.0
22.9
25.1
25.9
25.5
26.8
29.0
28.5
2.60 6
2.10 5
2.00 5
2.75 0
2.45 3
3.20 10
2.80 7
1.90 0
1.20 0
1.65 0
3.05 0
3.85 5
1.55 0
2.20 1
2.55 1
2.40 1
3.25 3
3.33 2
2.40 5
2.22 0
3.20 3
1.95 6
3.05 7
2.25 6
2.92 3
3.73 4
2.85 4
1.90 0
1.80 0
3.05 8
1.80 0
2.62 0
2.30 9
1.90 0
2.65 0
2.95 8
2.70 5
2.60 2
2.70 5
2.60 0
1.85 0
2.80 6
3.30 5
2.10 4
2.90 5
3.00 15
2.25 0
2.15 5
2.40 0
1.65 1
1.60 0
2.10 5
2.55 4
2.75 0
2.55 0
2.80 1
3.00 1
8
2
2
2
4
3
2
4
2
2
2
2
2
2
2
3
4
3
4
3
2
2
2
2
4
4
2
2
3
3
2
1
2
3
2
2
3
3
3
2
4
4
3
2
2
2
3
4
3
2
2
2
2
2
2
3
2
2
2
3
3
3
3
3
3
3
3
3
1
3
3
3
2
3
3
3
3
1
3
3
2
3
3
2
3
3
1
3
1
3
3
3
1
3
2
2
3
3
3
3
3
1
3
3
3
3
3
3
1
1
1
3
3
1
3
24.7
29.0
27.0
23.7
27.0
24.2
22.5
25.1
24.9
27.5
24.3
29.5
26.2
24.7
29.8
25.7
26.2
27.0
24.8
23.7
28.2
25.2
23.2
25.8
27.5
25.7
26.8
27.5
28.5
28.5
27.4
27.2
27.1
28.0
26.5
23.0
26.0
24.5
25.8
23.5
26.7
25.5
28.2
25.2
25.3
25.7
29.3
23.8
27.4
26.2
28.0
28.4
33.5
25.8
24.0
23.1
28.3
2.55 4
3.10 1
2.50 6
1.80 0
2.50 6
1.65 2
1.47 4
1.80 0
2.20 0
2.63 6
2.00 0
3.02 4
2.30 0
1.95 4
3.50 4
2.15 0
2.17 2
2.63 0
2.10 0
1.95 0
3.05 11
2.00 1
1.95 4
2.00 3
2.60 0
2.00 0
2.65 0
3.10 3
3.25 9
3.00 3
2.70 6
2.70 3
2.55 0
2.80 1
1.30 0
1.80 0
2.20 3
2.25 0
2.30 0
1.90 0
2.45 0
2.25 0
2.87 1
2.00 1
1.90 2
2.10 0
3.23 12
1.80 6
2.90 3
2.02 2
2.90 4
3.10 5
5.20 7
2.40 0
1.90 10
2.00 0
3.20 0
9
2
2
3
2
3
3
3
2
26.5
26.5
26.1
24.5
2.35
2.75
2.75
2.00
4
7
3
0
Download