Logistic Regression Example with Grouped Data

advertisement
1
Logistic Regression Model Development,
Including Stepwise Logistic Regression, With Example
Chapter 4, p. 126 – Mating Behavior Among Horseshoe Crabs
In situations in which we have a large number of possible explanatory variables, choosing a
“best” model may become somewhat tedious. It would be helpful to have a systematic procedure
to look for the best model. The stepwise logistic regression procedure considers a number of
possible multiple regression models, and selects subsets of parameters to test for possible
addition to the model or elimination from the model.
The following step-by-step procedure works to develop a parsimonious logistic regression model
for explaining the variation in the dichotomous response variable Y in terms of a subset of a
large pool of possible explanatory variables, some of which may be categorical and some,
continuous. This procedure is explained in more detail in Hosmer and Lemeshow (1).
1) First, we look at the relationship between Y and each explanatory variable X by itself, either
using a 2 X I contingency table, if X is categorical, or using a univariate logistic regression
model, if X is continuous. We use the likelihood ratio test statistic to test the significance of this
relationship.
For a categorical variable X exhibiting at least a moderate level of association, we estimate
individual odds ratios (along with confidence limits), using one level of X as a reference level. If
the contingency table has a cell with nij = 0, then we should consider one of two options: a)
collapse some categories of X to remove 0 frequencies, or b) if X is ordinal, we may perhaps
model it as a continuous variable.
If X is continuous, we estimate a univariate logistic regression model, and test for significance.
(We could also use a t-test for this purpose.)
2) Select variables for the multivariate analysis. Any variable whose univariate test p-value
< 0.25 should be considered as a candidate for inclusion (Mickey and Greenland, 1989). Use of
0.05 was shown by these two authors to fail often to include variables known to be important.
However, use of 0.25 has the disadvantage of tending to include some variables of questionable
importance. One school of thought advocates inclusion of all “scientifically relevant” variables,
regardless of the results of step (1). This may be a useful starting point for model selection,
using both the results of the univariate analysis and professional judgement about the relevance
of various variables in the pool of possible predictors.
An alternative method for model building, after completing the univariate analysis, is a
somewhat mechanical method using one of the following three procedures: i) Backward
elimination, ii) Forward selection, or iii) Stepwise selection (which combines backward
elimination with forward selection).
There is a major advantage of use of one of these “mechanical” procedures at some point in the
model development – it saves time and effort. In general, if there is a collection of k possible
explanatory variables, there are 2k possible models to consider; so if we have 7 possible
2
explanatory variables, there would be 128 possible models (including the model with no
explanatory variables). It would be useful to have a procedure for model development that did
not require consideration of all possible models, but instead proceeded in a systematic fashion to
consider a relatively short sequence of likely models to achieve the goal of obtaining the single
best model. The three most commonly used methods are: i) Forward selection, ii) Backward
elimination, and ii) Stepwise (which combines forward selection and backward elimination,
since the explanatory variables are often correlated with each other). The following discussion
of the three methods is taken from the SAS/STAT User’s Guide.
i)
When SELECTION=FORWARD, PROC LOGISTIC first estimates parameters for
effects forced into the model. These effects are the intercepts and the first explanatory
effects in the MODEL statement, where is the number specified by the START= or
INCLUDE= option in the MODEL statement (n is zero by default). Next, the
procedure computes the score chi-square statistic for each effect not in the model and
examines the largest of these statistics. If it is significant at the SLENTRY= level, the
corresponding effect is added to the model. Once an effect is entered in the model, it
is never removed from the model. The process is repeated until none of the remaining
effects meet the specified level for entry or until the STOP= value is reached.
ii)
When SELECTION=BACKWARD, parameters for the complete model as specified
in the MODEL statement are estimated unless the START= option is specified. In
that case, only the parameters for the intercepts and the first n explanatory effects in
the MODEL statement are estimated, where n is the number specified by the
START= option. Results of the Wald test for individual parameters are examined.
The least significant effect that does not meet the SLSTAY= level for staying in the
model is removed. Once an effect is removed from the model, it remains excluded.
The process is repeated until no other effect in the model meets the specified level for
removal or until the STOP= value is reached. Backward selection is often less
successful than forward or stepwise selection because the full model fit in the first
step is the model most likely to result in a complete or quasi-complete separation of
response values as described in the section Existence of Maximum Likelihood
Estimates.
iii)
The SELECTION=STEPWISE option is similar to the SELECTION=FORWARD
option except that effects already in the model do not necessarily remain. Effects are
entered into and removed from the model in such a way that each forward selection
step can be followed by one or more backward elimination steps. The stepwise
selection process terminates if no further effect can be added to the model or if the
current model is identical to a previously visited model.
Often, in doing stepwise logistic regression, we choose SLENTRY = 0.25, to make it easy for a
variable to be entered into the model, and SLSTAY = 0.05, to make it difficult for a variable to
stay in the model. The reason for these different criteria is that the explanatory variables tend to
be correlated. If at one step, variable X1 is removed from the model, and at the next step,
variable X2 is removed, it may be that at a later step, variable X1 will be re-inserted, if the two
variables are actually correlated – the initial presence of the second variable in the model led to
the removal of the first variable; but, with the two variables being correlated, after the second
variable is removed, it may be that the first will then add explanatory power to the model.
3
These “mechanical” selection procedures are sometimes criticized for failing to use scientifically
relevant criteria, sometimes leading to the inclusion of “noise” variables. Ultimately, the
researcher would be advised to use sound scientific judgment in conjunction with the mechanical
methods to select a model.
3) After a full but relatively parsimonious model has been developed by the above methods, we
should look more closely at the variables in the model and consider the need for including
interaction terms in the model. An interaction term, in this case, is a product of two of the
explanatory variables. In addition, for each continuous variable X in the model, we should check
the assumption that the logit is a linear function of X. There are various methods for doing this.
Example: Stepwise Logistic Regression of Y = “Satellite Males?” v. all other variables in the
horseshoe crab data set, using backward elimination.
For the logistic regression model, the response variable, S = “No. of Satellite Males” was
dichotomized as Y = 1, if there are any satellite males, or Y = 0 if there are no satellite males.
We have several possible predictor variables: X1 = “Color of Female Crab’s Shell”, X2 =
“Spinal Condition of Female Crab”, X3 = “Weight of Female Crab”, and X4 = “Width of
Carapace of Female Crab”.
We will first look at the relationships between each explanatory variable and the response
variable, using 2 X 2 contingency tables for categorical explanatory variables, and univariate
logistic regression for continuous explanatory variables. We will then search for a “best”
multiply logistic regression model, using the stepwise approach. Finally, we will consider the fit
of the model to the data, by several different methods.
The logistic regression model will be estimated using SAS PROC LOGISTIC, using stepwise
selection. The SAS program for estimating the model is given below, followed by the output.
The data are listed in the Appendix.
Stepwise Logistic Regression SAS Program:
proc format;
value difmt 0 = "No"
1 = "Yes";
;
data crabs;
input x1 x2 x3 x4 s;
y = 0;
if s > 0 then y = 1;
x11 = 0;
if x1 = 1 then x11 = 1;
x12 = 0;
if x1 = 2 then x12 = 1;
x13 = 0;
if x1 = 3 then x13 = 1;
x21 = 0;
if x2 = 1 then x21 = 1;
x22 = 0;
if x2 = 2 then x22 = 1;
4
label
x1 = "Color"
x2 = "Spine Condition"
x3 = "Carapace Width"
x4 = "Weight"
x11 = "Light medium?"
x12 = "Medium?"
x13 = "Dark medium?"
x21 = "Both good?"
x22 = "One worn or broken?"
y = "Satellite Males?"
s = "No. of Satellite Males";
format y x11 x12 x13 x21 x22 difmt.;
cards;
The data are included in the previous handout.
;
proc freq;
tables y*(x11 x12 x13 x21 x22) / norow nocol nopercent chisq;
exact fisher or;
title "Relationships Between Satellite Males?";
title2 "And Each of the (Dichotomous) Categorical";
title3 "Explanatory Variables";
;
proc logistic;
model y (order=formatted event='Yes') = x3 / covb;
title "Logistic regression of Satellite Presence";
title2 "vs. Carapace Width";
;
proc logistic;
model y (order=formatted event='Yes') = x4 / covb;
title "Logistic regression of Satellite Presence";
title2 "vs. Weight";
;
proc corr nosimple;
var x11 x12 x13 x21 x22 x3 x4 y;
title "Correlation Matrix for All Variables";
title2;
title3;
;
proc logistic;
model y (order=formatted event='Yes') = x11 x12 x13 x21 x22 x3 x4 /
selection=stepwise covb;
title "Stepwise Logistic regression of Satellite Presence";
title2 "vs. Several Explanatory Variables,";
title3 "Somc of which are Categorical";
title4 "Backward Selection Used";
;
run;
5
SAS Output for Stepwise Logistic Regression, Stepwise Selection:
Relationships Between Satellite Males?
And Each of the (Dichotomous) Categorical
Explanatory Variables
The FREQ Procedure
Table of y by x11
y(Satellite Males?)
x11(Light medium?)
Frequency‚No
‚Yes
‚ Total
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
No
‚
59 ‚
3 ‚
62
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
Yes
‚
102 ‚
9 ‚
111
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
Total
161
12
173
Statistics for Table of y by x11
Statistic
DF
Value
Prob
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Chi-Square
1
0.6587
0.4170
Likelihood Ratio Chi-Square
1
0.6941
0.4048
Continuity Adj. Chi-Square
1
0.2496
0.6174
Mantel-Haenszel Chi-Square
1
0.6549
0.4184
Phi Coefficient
0.0617
Contingency Coefficient
0.0616
Cramer's V
0.0617
WARNING: 25% of the cells have expected counts less
than 5. Chi-Square may not be a valid test.
Fisher's Exact Test
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Cell (1,1) Frequency (F)
59
Left-sided Pr <= F
0.8713
Right-sided Pr >= F
0.3169
Table Probability (P)
0.1882
Two-sided Pr <= P
0.5411
Relationships Between Satellite Males?
And Each of the (Dichotomous) Categorical
Explanatory Variables
The FREQ Procedure
Statistics for Table of y by x11
Estimates of the Relative Risk (Row1/Row2)
Type of Study
Value
95% Confidence Limits
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Case-Control (Odds Ratio)
1.7353
0.4519
6.6630
Cohort (Col1 Risk)
1.0356
0.9571
1.1204
Cohort (Col2 Risk)
0.5968
0.1677
2.1232
6
Odds Ratio (Case-Control Study)
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Odds Ratio
1.7353
Asymptotic Conf Limits
95% Lower Conf Limit
0.4519
95% Upper Conf Limit
6.6630
Exact Conf Limits
95% Lower Conf Limit
0.4106
95% Upper Conf Limit
10.3221
Sample Size = 173
Table of y by x12
y(Satellite Males?)
x12(Medium?)
Frequency‚No
‚Yes
‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
No
‚
36 ‚
26 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
Yes
‚
42 ‚
69 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
Total
78
95
Total
62
111
173
Relationships Between Satellite Males?
And Each of the (Dichotomous) Categorical
Explanatory Variables
The FREQ Procedure
Statistics for Table of y by x12
Statistic
DF
Value
Prob
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Chi-Square
1
6.5734
0.0104
Likelihood Ratio Chi-Square
1
6.5807
0.0103
Continuity Adj. Chi-Square
1
5.7819
0.0162
Mantel-Haenszel Chi-Square
1
6.5354
0.0106
Phi Coefficient
0.1949
Contingency Coefficient
0.1913
Cramer's V
0.1949
Fisher's Exact Test
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Cell (1,1) Frequency (F)
36
Left-sided Pr <= F
0.9968
Right-sided Pr >= F
0.0081
Table Probability (P)
0.0049
Two-sided Pr <= P
0.0114
Estimates of the Relative Risk (Row1/Row2)
Type of Study
Value
95% Confidence Limits
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Case-Control (Odds Ratio)
2.2747
1.2070
4.2869
Cohort (Col1 Risk)
1.5346
1.1157
2.1107
Cohort (Col2 Risk)
0.6746
0.4865
0.9355
7
Odds Ratio (Case-Control Study)
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Odds Ratio
2.2747
Asymptotic Conf Limits
95% Lower Conf Limit
1.2070
95% Upper Conf Limit
4.2869
Exact Conf Limits
95% Lower Conf Limit
1.1512
95% Upper Conf Limit
4.5079
Sample Size = 173
Relationships Between Satellite Males?
And Each of the (Dichotomous) Categorical
Explanatory Variables
The FREQ Procedure
Table of y by x13
y(Satellite Males?)
x13(Dark medium?)
Frequency‚No
‚Yes
‚ Total
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
No
‚
44 ‚
18 ‚
62
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
Yes
‚
85 ‚
26 ‚
111
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
Total
129
44
173
Statistics for Table of y by x13
Statistic
DF
Value
Prob
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Chi-Square
1
0.6599
0.4166
Likelihood Ratio Chi-Square
1
0.6520
0.4194
Continuity Adj. Chi-Square
1
0.3973
0.5285
Mantel-Haenszel Chi-Square
1
0.6561
0.4180
Phi Coefficient
-0.0618
Contingency Coefficient
0.0616
Cramer's V
-0.0618
Fisher's Exact Test
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Cell (1,1) Frequency (F)
44
Left-sided Pr <= F
0.2627
Right-sided Pr >= F
0.8400
Table Probability (P)
0.1027
Two-sided Pr <= P
0.4680
Relationships Between Satellite Males?
And Each of the (Dichotomous) Categorical
Explanatory Variables
The FREQ Procedure
Statistics for Table of y by x13
Estimates of the Relative Risk (Row1/Row2)
Type of Study
Value
95% Confidence Limits
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Case-Control (Odds Ratio)
0.7477
0.3703
1.5096
Cohort (Col1 Risk)
0.9268
0.7667
1.1202
Cohort (Col2 Risk)
1.2395
0.7410
2.0731
Odds Ratio (Case-Control Study)
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Odds Ratio
0.7477
Asymptotic Conf Limits
95% Lower Conf Limit
0.3703
95% Upper Conf Limit
1.5096
Exact Conf Limits
95% Lower Conf Limit
0.3512
95% Upper Conf Limit
1.6182
Sample Size = 173
8
Table of y by x21
y(Satellite Males?)
x21(Both good?)
Frequency‚No
‚Yes
‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
No
‚
51 ‚
11 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
Yes
‚
85 ‚
26 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
Total
136
37
Total
62
111
173
Relationships Between Satellite Males?
And Each of the (Dichotomous) Categorical
Explanatory Variables
The FREQ Procedure
Statistics for Table of y by x21
Statistic
DF
Value
Prob
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Chi-Square
1
0.7637
0.3822
Likelihood Ratio Chi-Square
1
0.7801
0.3771
Continuity Adj. Chi-Square
1
0.4632
0.4961
Mantel-Haenszel Chi-Square
1
0.7593
0.3835
Phi Coefficient
0.0664
Contingency Coefficient
0.0663
Cramer's V
0.0664
Fisher's Exact Test
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Cell (1,1) Frequency (F)
51
Left-sided Pr <= F
0.8575
Right-sided Pr >= F
0.2501
Table Probability (P)
0.1076
Two-sided Pr <= P
0.4427
Estimates of the Relative Risk (Row1/Row2)
Type of Study
Value
95% Confidence Limits
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Case-Control (Odds Ratio)
1.4182
0.6463
3.1117
Cohort (Col1 Risk)
1.0742
0.9202
1.2540
Cohort (Col2 Risk)
0.7574
0.4023
1.4261
9
Odds Ratio (Case-Control Study)
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Odds Ratio
1.4182
Asymptotic Conf Limits
95% Lower Conf Limit
0.6463
95% Upper Conf Limit
3.1117
Exact Conf Limits
95% Lower Conf Limit
0.6128
95% Upper Conf Limit
3.4558
Sample Size = 173
Relationships Between Satellite Males?
And Each of the (Dichotomous) Categorical
Explanatory Variables
The FREQ Procedure
Table of y by x22
y(Satellite Males?)
x22(One worn or broken?)
Frequency‚No
‚Yes
‚ Total
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
No
‚
54 ‚
8 ‚
62
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
Yes
‚
104 ‚
7 ‚
111
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
Total
158
15
173
Statistics for Table of y by x22
Statistic
DF
Value
Prob
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Chi-Square
1
2.1862
0.1393
Likelihood Ratio Chi-Square
1
2.0944
0.1478
Continuity Adj. Chi-Square
1
1.4325
0.2314
Mantel-Haenszel Chi-Square
1
2.1736
0.1404
Phi Coefficient
-0.1124
Contingency Coefficient
0.1117
Cramer's V
-0.1124
Fisher's Exact Test
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Cell (1,1) Frequency (F)
54
Left-sided Pr <= F
0.1169
Right-sided Pr >= F
0.9585
Table Probability (P)
0.0754
Two-sided Pr <= P
0.1636
Relationships Between Satellite Males?
And Each of the (Dichotomous) Categorical
Explanatory Variables
The FREQ Procedure
Statistics for Table of y by x22
Estimates of the Relative Risk (Row1/Row2)
Type of Study
Value
95% Confidence Limits
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Case-Control (Odds Ratio)
0.4543
0.1564
1.3197
Cohort (Col1 Risk)
0.9296
0.8350
1.0349
Cohort (Col2 Risk)
2.0461
0.7791
5.3738
Odds Ratio (Case-Control Study)
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Odds Ratio
0.4543
Asymptotic Conf Limits
95% Lower Conf Limit
0.1564
95% Upper Conf Limit
1.3197
Exact Conf Limits
95% Lower Conf Limit
0.1331
95% Upper Conf Limit
1.5264
Sample Size = 173
10
Logistic regression of Satellite Presence
vs. Carapace Width
The LOGISTIC Procedure
Model Information
Data Set
WORK.CRABS
Response Variable
y
Satellite Males?
Number of Response Levels
2
Model
binary logit
Optimization Technique
Fisher's scoring
Number of Observations Read
173
Number of Observations Used
173
Response Profile
Ordered
Total
Value
y
Frequency
1
No
62
2
Yes
111
Probability modeled is y='Yes'.
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Intercept
Intercept
and
Criterion
Only
Covariates
AIC
227.759
198.453
SC
230.912
204.759
-2 Log L
225.759
194.453
Testing Global Null Hypothesis: BETA=0
Test
Chi-Square
DF
Pr > ChiSq
Likelihood Ratio
31.3059
1
<.0001
Score
27.8752
1
<.0001
Wald
23.8872
1
<.0001
Parameter
Intercept
x3
Logistic regression of Satellite Presence
vs. Carapace Width
The LOGISTIC Procedure
Analysis of Maximum Likelihood Estimates
Standard
Wald
DF
Estimate
Error
Chi-Square
1
-12.3508
2.6287
22.0749
1
0.4972
0.1017
23.8872
Effect
x3
Pr > ChiSq
<.0001
<.0001
Odds Ratio Estimates
Point
95% Wald
Estimate
Confidence Limits
1.644
1.347
2.007
Association of Predicted Probabilities and Observed Responses
Percent Concordant
73.5
Somers' D
0.485
Percent Discordant
25.0
Gamma
0.492
Percent Tied
1.5
Tau-a
0.224
Pairs
6882
c
0.742
Estimated Covariance Matrix
Parameter
Intercept
x3
Intercept
6.910227
-0.26685
x3
-0.26685
0.01035
Logistic regression of Satellite Presence
vs. Weight
The LOGISTIC Procedure
Model Information
Data Set
WORK.CRABS
Response Variable
y
Satellite Males?
Number of Response Levels
2
11
Model
binary logit
Optimization Technique
Fisher's scoring
Number of Observations Read
Number of Observations Used
173
173
Response Profile
Ordered
Total
Value
y
Frequency
1
No
62
2
Yes
111
Probability modeled is y='Yes'.
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Intercept
Intercept
and
Criterion
Only
Covariates
AIC
227.759
199.737
SC
230.912
206.044
-2 Log L
225.759
195.737
Testing Global Null Hypothesis: BETA=0
Test
Chi-Square
DF
Pr > ChiSq
Likelihood Ratio
30.0214
1
<.0001
Score
25.9353
1
<.0001
Wald
23.2158
1
<.0001
12
Parameter
Intercept
x4
Logistic regression of Satellite Presence
vs. Weight
The LOGISTIC Procedure
Analysis of Maximum Likelihood Estimates
Standard
Wald
DF
Estimate
Error
Chi-Square
1
-3.6933
0.8800
17.6159
1
1.8145
0.3766
23.2158
Effect
x4
Pr > ChiSq
<.0001
<.0001
Odds Ratio Estimates
Point
95% Wald
Estimate
Confidence Limits
6.138
2.934
12.841
Association of Predicted Probabilities and Observed Responses
Percent Concordant
72.7
Somers' D
0.476
Percent Discordant
25.1
Gamma
0.487
Percent Tied
2.2
Tau-a
0.220
Pairs
6882
c
0.738
Estimated Covariance Matrix
Parameter
Intercept
x4
Intercept
0.774312
-0.3249
x4
-0.3249
0.141823
8
Variables:
Correlation Matrix for All Variables
The CORR Procedure
x12
x13
x21
x22
x3
x11
Pearson Correlation Coefficients, N = 173
Prob > |r| under H0: Rho=0
x11
x12
x13
x11
1.00000
-0.30130
-0.15944
Light medium?
<.0001
0.0361
x12
-0.30130
1.00000
-0.64453
Medium?
<.0001
<.0001
x13
-0.15944
-0.64453
1.00000
Dark medium?
0.0361
<.0001
x21
0.35696
0.10432
-0.20751
Both good?
<.0001
0.1720
0.0062
x22
0.07758
-0.00978
0.00872
One worn or broken?
0.3103
0.8983
0.9093
x4
y
x21
0.35696
<.0001
0.10432
0.1720
-0.20751
0.0062
1.00000
-0.16071
0.0347
Pearson Correlation Coefficients, N = 173
Prob > |r| under H0: Rho=0
x4
0.09104
0.2336
0.19302
0.0109
-0.14016
0.0659
0.21927
0.0038
-0.15197
0.0459
y
0.06171
0.4200
0.19493
0.0102
-0.06176
0.4195
0.06644
0.3851
-0.11242
0.1409
Correlation Matrix for All Variables
The CORR Procedure
Pearson Correlation Coefficients, N = 173
Prob > |r| under H0: Rho=0
x11
x12
x13
x3
0.08670
0.21273
-0.15242
Carapace Width
0.2567
0.0050
0.0453
x4
0.09104
0.19302
-0.14016
Weight
0.2336
0.0109
0.0659
y
0.06171
0.19493
-0.06176
Satellite Males?
0.4200
0.0102
0.4195
x21
0.20139
0.0079
0.21927
0.0038
0.06644
0.3851
x11
Light medium?
x12
Medium?
x13
Dark medium?
x21
Both good?
x22
One worn or broken?
x22
0.07758
0.3103
-0.00978
0.8983
0.00872
0.9093
-0.16071
0.0347
1.00000
x3
0.08670
0.2567
0.21273
0.0050
-0.15242
0.0453
0.20139
0.0079
-0.23035
0.0023
13
Pearson Correlation Coefficients, N = 173
Prob > |r| under H0: Rho=0
x3
Carapace Width
x4
Weight
y
Satellite Males?
x22
-0.23035
0.0023
-0.15197
0.0459
-0.11242
0.1409
x3
1.00000
0.88689
<.0001
0.40141
<.0001
x4
0.88689
<.0001
1.00000
0.38719
<.0001
y
0.40141
<.0001
0.38719
<.0001
1.00000
Stepwise Logistic regression of Satellite Presence
vs. Several Explanatory Variables,
Somc of which are Categorical
Backward Selection Used
The LOGISTIC Procedure
Model Information
Data Set
WORK.CRABS
Response Variable
y
Satellite Males?
Number of Response Levels
2
Model
binary logit
Optimization Technique
Fisher's scoring
Number of Observations Read
173
Number of Observations Used
173
Response Profile
Ordered
Total
Value
y
Frequency
1
No
62
2
Yes
111
Probability modeled is y='Yes'.
Stepwise Selection Procedure
Step
0. Intercept entered:
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
-2 Log L = 225.759
Residual Chi-Square Test
Chi-Square
DF
Pr > ChiSq
36.3085
7
<.0001
14
Step
1. Effect x3 entered:
Stepwise Logistic regression of Satellite Presence
vs. Several Explanatory Variables,
Somc of which are Categorical
Backward Selection Used
The LOGISTIC Procedure
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Intercept
Intercept
and
Criterion
Only
Covariates
AIC
227.759
198.453
SC
230.912
204.759
-2 Log L
225.759
194.453
Testing Global Null Hypothesis: BETA=0
Test
Chi-Square
DF
Pr > ChiSq
Likelihood Ratio
31.3059
1
<.0001
Score
27.8752
1
<.0001
Wald
23.8872
1
<.0001
Residual Chi-Square Test
Chi-Square
DF
Pr > ChiSq
9.3673
6
0.1540
NOTE: No effects for the model in Step 1 are removed.
NOTE: No (additional) effects met the 0.05 significance level for entry into the model.
Effect
Step Entered Removed
1 x3
DF
1
Summary of Stepwise Selection
Number
Score
Wald
Variable
In Chi-Square Chi-Square Pr > ChiSq Label
1
27.8752
<.0001 Carapace Width
Stepwise Logistic regression of Satellite Presence
vs. Several Explanatory Variables,
Somc of which are Categorical
Backward Selection Used
The LOGISTIC Procedure
Parameter
Intercept
x3
Analysis of Maximum Likelihood Estimates
Standard
Wald
DF
Estimate
Error
Chi-Square
1
-12.3508
2.6287
22.0749
1
0.4972
0.1017
23.8872
Effect
x3
Pr > ChiSq
<.0001
<.0001
Odds Ratio Estimates
Point
95% Wald
Estimate
Confidence Limits
1.644
1.347
2.007
Association of Predicted Probabilities and Observed Responses
Percent Concordant
73.5
Somers' D
0.485
Percent Discordant
25.0
Gamma
0.492
Percent Tied
1.5
Tau-a
0.224
Pairs
6882
c
0.742
Estimated Covariance Matrix
Parameter
Intercept
x3
Intercept
6.910227
-0.26685
x3
-0.26685
0.01035
15
Since there is only one variable in the final model, we need not look at interaction terms.
However, for the sake of completeness, we will look at a stepwise regression for the data, with
all interaction terms also included in the model, to see whether any interaction terms survive.
In this run, we will use SLENTRY=0.25 and SLSTAY=0.05. The SAS program is shown
below.
proc format;
value difmt 0 = "No"
1 = "Yes";
;
data crabs;
input x1 x2 x3 x4 s;
y = 0;
if s > 0 then y = 1;
x11 = 0;
if x1 = 1 then x11 = 1;
x12 = 0;
if x1 = 2 then x12 = 1;
x13 = 0;
if x1 = 3 then x13 = 1;
x21 = 0;
if x2 = 1 then x21 = 1;
x22 = 0;
if x2 = 2 then x22 = 1;
x11x21 = x11*x21;
x12x21 = x12*x21;
x13x21 = x13*x21;
x11x22 = x11*x22;
x12x22 = x12*x22;
x13x22 = x13*x22;
x11x3 = x11*x3;
x11x4 = x11*x4;
x12x3 = x12*x3;
x12x4 = x12*x4;
x13x3 = x13*x3;
x13x4 = x13*x4;
x21x3 = x21*x3;
x21x4 = x21*x4;
x22x3 = x22*x3;
x22x4 = x22*x4;
x3x4 = x3*x4;
label
x1 = "Color"
x2 = "Spine Condition"
x3 = "Carapace Width"
x4 = "Weight"
x11 = "Light medium?"
x12 = "Medium?"
x13 = "Dark medium?"
x21 = "Both good?"
x22 = "One worn or broken?"
y = "Satellite Males?"
s = "No. of Satellite Males";
format y x11 x12 x13 x21 x22 difmt.;
cards;
The data set is included in the first handout.
;
16
proc logistic;
model y (order=formatted event='Yes') = x11x21 x12x21 x13x21 x11x22 x12x22
x13x22
x11x3 x11x4 x12x3 x12x4 x13x3 x13x4 x21x3 x21x4 x22x3 x22x4 x3x4 x11
x12 x13
x21 x22 x3 x4 / selection=stepwise slentry=0.25 slstay=0.05 covb;
title "Stepwise Logistic regression of Satellite Presence";
title2 "vs. Several Explanatory Variables,";
title3 "Somc of which are Categorical";
title4 "Stepwise Selection Used, and Interactions Included";
;
run;
SAS PROC LOGISTIC Output:
Stepwise Logistic regression of Satellite Presence
vs. Several Explanatory Variables,
Somc of which are Categorical
Stepwise Selection Used, and Interactions Included
The LOGISTIC Procedure
Model Information
Data Set
WORK.CRABS
Response Variable
y
Number of Response Levels
2
Model
binary logit
Optimization Technique
Fisher's scoring
Number of Observations Read
Number of Observations Used
Satellite Males?
173
173
Response Profile
Ordered
Total
Value
y
Frequency
1
No
62
2
Yes
111
Probability modeled is y='Yes'.
Stepwise Selection Procedure
Step
0. Intercept entered:
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
-2 Log L = 225.759
17
Residual Chi-Square Test
Chi-Square
DF
Pr > ChiSq
52.5969
24
0.0007
Step
1. Effect x3 entered:
Stepwise Logistic regression of Satellite Presence
vs. Several Explanatory Variables,
Somc of which are Categorical
Stepwise Selection Used, and Interactions Included
The LOGISTIC Procedure
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Intercept
Intercept
and
Criterion
Only
Covariates
AIC
227.759
198.453
SC
230.912
204.759
-2 Log L
225.759
194.453
Testing Global Null Hypothesis: BETA=0
Test
Chi-Square
DF
Pr > ChiSq
Likelihood Ratio
31.3059
1
<.0001
Score
27.8752
1
<.0001
Wald
23.8872
1
<.0001
Residual Chi-Square Test
Chi-Square
DF
Pr > ChiSq
29.0098
23
0.1800
NOTE: No effects for the model in Step 1 are removed.
Step
2. Effect x12 entered:
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Stepwise Logistic regression of Satellite Presence
vs. Several Explanatory Variables,
Somc of which are Categorical
Stepwise Selection Used, and Interactions Included
The LOGISTIC Procedure
Model Fit Statistics
Intercept
Intercept
and
Criterion
Only
Covariates
AIC
227.759
197.757
SC
230.912
207.217
-2 Log L
225.759
191.757
18
Testing Global Null Hypothesis: BETA=0
Test
Chi-Square
DF
Pr > ChiSq
Likelihood Ratio
34.0014
2
<.0001
Score
30.0492
2
<.0001
Wald
25.2770
2
<.0001
Residual Chi-Square Test
Chi-Square
DF
Pr > ChiSq
24.7797
22
0.3077
Step
3. Effect x12 is removed:
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Intercept
Intercept
and
Criterion
Only
Covariates
AIC
227.759
198.453
SC
230.912
204.759
-2 Log L
225.759
194.453
Stepwise Logistic regression of Satellite Presence
vs. Several Explanatory Variables,
Somc of which are Categorical
Stepwise Selection Used, and Interactions Included
The LOGISTIC Procedure
Testing Global Null Hypothesis: BETA=0
Test
Chi-Square
DF
Pr > ChiSq
Likelihood Ratio
31.3059
1
<.0001
Score
27.8752
1
<.0001
Wald
23.8872
1
<.0001
Residual Chi-Square Test
Chi-Square
DF
Pr > ChiSq
29.0098
23
0.1800
NOTE: No effects for the model in Step 3 are removed.
NOTE: Model building terminates because the last effect entered is removed by the Wald
statistic criterion.
Step
1
2
3
Effect
Entered Removed
x3
x12
x12
Parameter
Intercept
x3
DF
1
1
1
Summary of Stepwise Selection
Number
Score
Wald
In Chi-Square Chi-Square Pr > ChiSq
1
27.8752
<.0001
2
2.7133
0.0995
1
2.6873
0.1011
Analysis of Maximum Likelihood Estimates
Standard
Wald
DF
Estimate
Error
Chi-Square
1
-12.3508
2.6287
22.0749
1
0.4972
0.1017
23.8872
Pr > ChiSq
<.0001
<.0001
Stepwise Logistic regression of Satellite Presence
vs. Several Explanatory Variables,
Somc of which are Categorical
Stepwise Selection Used, and Interactions Included
Effect
x3
The LOGISTIC Procedure
Odds Ratio Estimates
Point
95% Wald
Estimate
Confidence Limits
1.644
1.347
2.007
Variable
Label
Carapace Width
Medium?
Medium?
19
Association of Predicted Probabilities and Observed Responses
Percent Concordant
73.5
Somers' D
0.485
Percent Discordant
25.0
Gamma
0.492
Percent Tied
1.5
Tau-a
0.224
Pairs
6882
c
0.742
Estimated Covariance Matrix
Parameter
Intercept
x3
Intercept
6.910227
-0.26685
x3
-0.26685
0.01035
1) Hosmer, D. W. and Lemeshow, S. (1989). Applied Logistic Regression, John Wiley &
Sons, New York.
2) Mickey, J. and Greenland, S. (1989). “A study of the impact of confounder-selection
criteria on effect estimation,” American Journal of Epidemiology, 129, 125-137.
Download