1 Logistic Regression Model Development, Including Stepwise Logistic Regression, With Example Chapter 4, p. 126 – Mating Behavior Among Horseshoe Crabs In situations in which we have a large number of possible explanatory variables, choosing a “best” model may become somewhat tedious. It would be helpful to have a systematic procedure to look for the best model. The stepwise logistic regression procedure considers a number of possible multiple regression models, and selects subsets of parameters to test for possible addition to the model or elimination from the model. The following step-by-step procedure works to develop a parsimonious logistic regression model for explaining the variation in the dichotomous response variable Y in terms of a subset of a large pool of possible explanatory variables, some of which may be categorical and some, continuous. This procedure is explained in more detail in Hosmer and Lemeshow (1). 1) First, we look at the relationship between Y and each explanatory variable X by itself, either using a 2 X I contingency table, if X is categorical, or using a univariate logistic regression model, if X is continuous. We use the likelihood ratio test statistic to test the significance of this relationship. For a categorical variable X exhibiting at least a moderate level of association, we estimate individual odds ratios (along with confidence limits), using one level of X as a reference level. If the contingency table has a cell with nij = 0, then we should consider one of two options: a) collapse some categories of X to remove 0 frequencies, or b) if X is ordinal, we may perhaps model it as a continuous variable. If X is continuous, we estimate a univariate logistic regression model, and test for significance. (We could also use a t-test for this purpose.) 2) Select variables for the multivariate analysis. Any variable whose univariate test p-value < 0.25 should be considered as a candidate for inclusion (Mickey and Greenland, 1989). Use of 0.05 was shown by these two authors to fail often to include variables known to be important. However, use of 0.25 has the disadvantage of tending to include some variables of questionable importance. One school of thought advocates inclusion of all “scientifically relevant” variables, regardless of the results of step (1). This may be a useful starting point for model selection, using both the results of the univariate analysis and professional judgement about the relevance of various variables in the pool of possible predictors. An alternative method for model building, after completing the univariate analysis, is a somewhat mechanical method using one of the following three procedures: i) Backward elimination, ii) Forward selection, or iii) Stepwise selection (which combines backward elimination with forward selection). There is a major advantage of use of one of these “mechanical” procedures at some point in the model development – it saves time and effort. In general, if there is a collection of k possible explanatory variables, there are 2k possible models to consider; so if we have 7 possible 2 explanatory variables, there would be 128 possible models (including the model with no explanatory variables). It would be useful to have a procedure for model development that did not require consideration of all possible models, but instead proceeded in a systematic fashion to consider a relatively short sequence of likely models to achieve the goal of obtaining the single best model. The three most commonly used methods are: i) Forward selection, ii) Backward elimination, and ii) Stepwise (which combines forward selection and backward elimination, since the explanatory variables are often correlated with each other). The following discussion of the three methods is taken from the SAS/STAT User’s Guide. i) When SELECTION=FORWARD, PROC LOGISTIC first estimates parameters for effects forced into the model. These effects are the intercepts and the first explanatory effects in the MODEL statement, where is the number specified by the START= or INCLUDE= option in the MODEL statement (n is zero by default). Next, the procedure computes the score chi-square statistic for each effect not in the model and examines the largest of these statistics. If it is significant at the SLENTRY= level, the corresponding effect is added to the model. Once an effect is entered in the model, it is never removed from the model. The process is repeated until none of the remaining effects meet the specified level for entry or until the STOP= value is reached. ii) When SELECTION=BACKWARD, parameters for the complete model as specified in the MODEL statement are estimated unless the START= option is specified. In that case, only the parameters for the intercepts and the first n explanatory effects in the MODEL statement are estimated, where n is the number specified by the START= option. Results of the Wald test for individual parameters are examined. The least significant effect that does not meet the SLSTAY= level for staying in the model is removed. Once an effect is removed from the model, it remains excluded. The process is repeated until no other effect in the model meets the specified level for removal or until the STOP= value is reached. Backward selection is often less successful than forward or stepwise selection because the full model fit in the first step is the model most likely to result in a complete or quasi-complete separation of response values as described in the section Existence of Maximum Likelihood Estimates. iii) The SELECTION=STEPWISE option is similar to the SELECTION=FORWARD option except that effects already in the model do not necessarily remain. Effects are entered into and removed from the model in such a way that each forward selection step can be followed by one or more backward elimination steps. The stepwise selection process terminates if no further effect can be added to the model or if the current model is identical to a previously visited model. Often, in doing stepwise logistic regression, we choose SLENTRY = 0.25, to make it easy for a variable to be entered into the model, and SLSTAY = 0.05, to make it difficult for a variable to stay in the model. The reason for these different criteria is that the explanatory variables tend to be correlated. If at one step, variable X1 is removed from the model, and at the next step, variable X2 is removed, it may be that at a later step, variable X1 will be re-inserted, if the two variables are actually correlated – the initial presence of the second variable in the model led to the removal of the first variable; but, with the two variables being correlated, after the second variable is removed, it may be that the first will then add explanatory power to the model. 3 These “mechanical” selection procedures are sometimes criticized for failing to use scientifically relevant criteria, sometimes leading to the inclusion of “noise” variables. Ultimately, the researcher would be advised to use sound scientific judgment in conjunction with the mechanical methods to select a model. 3) After a full but relatively parsimonious model has been developed by the above methods, we should look more closely at the variables in the model and consider the need for including interaction terms in the model. An interaction term, in this case, is a product of two of the explanatory variables. In addition, for each continuous variable X in the model, we should check the assumption that the logit is a linear function of X. There are various methods for doing this. Example: Stepwise Logistic Regression of Y = “Satellite Males?” v. all other variables in the horseshoe crab data set, using backward elimination. For the logistic regression model, the response variable, S = “No. of Satellite Males” was dichotomized as Y = 1, if there are any satellite males, or Y = 0 if there are no satellite males. We have several possible predictor variables: X1 = “Color of Female Crab’s Shell”, X2 = “Spinal Condition of Female Crab”, X3 = “Weight of Female Crab”, and X4 = “Width of Carapace of Female Crab”. We will first look at the relationships between each explanatory variable and the response variable, using 2 X 2 contingency tables for categorical explanatory variables, and univariate logistic regression for continuous explanatory variables. We will then search for a “best” multiply logistic regression model, using the stepwise approach. Finally, we will consider the fit of the model to the data, by several different methods. The logistic regression model will be estimated using SAS PROC LOGISTIC, using stepwise selection. The SAS program for estimating the model is given below, followed by the output. The data are listed in the Appendix. Stepwise Logistic Regression SAS Program: proc format; value difmt 0 = "No" 1 = "Yes"; ; data crabs; input x1 x2 x3 x4 s; y = 0; if s > 0 then y = 1; x11 = 0; if x1 = 1 then x11 = 1; x12 = 0; if x1 = 2 then x12 = 1; x13 = 0; if x1 = 3 then x13 = 1; x21 = 0; if x2 = 1 then x21 = 1; x22 = 0; if x2 = 2 then x22 = 1; 4 label x1 = "Color" x2 = "Spine Condition" x3 = "Carapace Width" x4 = "Weight" x11 = "Light medium?" x12 = "Medium?" x13 = "Dark medium?" x21 = "Both good?" x22 = "One worn or broken?" y = "Satellite Males?" s = "No. of Satellite Males"; format y x11 x12 x13 x21 x22 difmt.; cards; The data are included in the previous handout. ; proc freq; tables y*(x11 x12 x13 x21 x22) / norow nocol nopercent chisq; exact fisher or; title "Relationships Between Satellite Males?"; title2 "And Each of the (Dichotomous) Categorical"; title3 "Explanatory Variables"; ; proc logistic; model y (order=formatted event='Yes') = x3 / covb; title "Logistic regression of Satellite Presence"; title2 "vs. Carapace Width"; ; proc logistic; model y (order=formatted event='Yes') = x4 / covb; title "Logistic regression of Satellite Presence"; title2 "vs. Weight"; ; proc corr nosimple; var x11 x12 x13 x21 x22 x3 x4 y; title "Correlation Matrix for All Variables"; title2; title3; ; proc logistic; model y (order=formatted event='Yes') = x11 x12 x13 x21 x22 x3 x4 / selection=stepwise covb; title "Stepwise Logistic regression of Satellite Presence"; title2 "vs. Several Explanatory Variables,"; title3 "Somc of which are Categorical"; title4 "Backward Selection Used"; ; run; 5 SAS Output for Stepwise Logistic Regression, Stepwise Selection: Relationships Between Satellite Males? And Each of the (Dichotomous) Categorical Explanatory Variables The FREQ Procedure Table of y by x11 y(Satellite Males?) x11(Light medium?) Frequency‚No ‚Yes ‚ Total ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ No ‚ 59 ‚ 3 ‚ 62 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Yes ‚ 102 ‚ 9 ‚ 111 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 161 12 173 Statistics for Table of y by x11 Statistic DF Value Prob ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Chi-Square 1 0.6587 0.4170 Likelihood Ratio Chi-Square 1 0.6941 0.4048 Continuity Adj. Chi-Square 1 0.2496 0.6174 Mantel-Haenszel Chi-Square 1 0.6549 0.4184 Phi Coefficient 0.0617 Contingency Coefficient 0.0616 Cramer's V 0.0617 WARNING: 25% of the cells have expected counts less than 5. Chi-Square may not be a valid test. Fisher's Exact Test ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Cell (1,1) Frequency (F) 59 Left-sided Pr <= F 0.8713 Right-sided Pr >= F 0.3169 Table Probability (P) 0.1882 Two-sided Pr <= P 0.5411 Relationships Between Satellite Males? And Each of the (Dichotomous) Categorical Explanatory Variables The FREQ Procedure Statistics for Table of y by x11 Estimates of the Relative Risk (Row1/Row2) Type of Study Value 95% Confidence Limits ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Case-Control (Odds Ratio) 1.7353 0.4519 6.6630 Cohort (Col1 Risk) 1.0356 0.9571 1.1204 Cohort (Col2 Risk) 0.5968 0.1677 2.1232 6 Odds Ratio (Case-Control Study) ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Odds Ratio 1.7353 Asymptotic Conf Limits 95% Lower Conf Limit 0.4519 95% Upper Conf Limit 6.6630 Exact Conf Limits 95% Lower Conf Limit 0.4106 95% Upper Conf Limit 10.3221 Sample Size = 173 Table of y by x12 y(Satellite Males?) x12(Medium?) Frequency‚No ‚Yes ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ No ‚ 36 ‚ 26 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Yes ‚ 42 ‚ 69 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 78 95 Total 62 111 173 Relationships Between Satellite Males? And Each of the (Dichotomous) Categorical Explanatory Variables The FREQ Procedure Statistics for Table of y by x12 Statistic DF Value Prob ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Chi-Square 1 6.5734 0.0104 Likelihood Ratio Chi-Square 1 6.5807 0.0103 Continuity Adj. Chi-Square 1 5.7819 0.0162 Mantel-Haenszel Chi-Square 1 6.5354 0.0106 Phi Coefficient 0.1949 Contingency Coefficient 0.1913 Cramer's V 0.1949 Fisher's Exact Test ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Cell (1,1) Frequency (F) 36 Left-sided Pr <= F 0.9968 Right-sided Pr >= F 0.0081 Table Probability (P) 0.0049 Two-sided Pr <= P 0.0114 Estimates of the Relative Risk (Row1/Row2) Type of Study Value 95% Confidence Limits ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Case-Control (Odds Ratio) 2.2747 1.2070 4.2869 Cohort (Col1 Risk) 1.5346 1.1157 2.1107 Cohort (Col2 Risk) 0.6746 0.4865 0.9355 7 Odds Ratio (Case-Control Study) ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Odds Ratio 2.2747 Asymptotic Conf Limits 95% Lower Conf Limit 1.2070 95% Upper Conf Limit 4.2869 Exact Conf Limits 95% Lower Conf Limit 1.1512 95% Upper Conf Limit 4.5079 Sample Size = 173 Relationships Between Satellite Males? And Each of the (Dichotomous) Categorical Explanatory Variables The FREQ Procedure Table of y by x13 y(Satellite Males?) x13(Dark medium?) Frequency‚No ‚Yes ‚ Total ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ No ‚ 44 ‚ 18 ‚ 62 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Yes ‚ 85 ‚ 26 ‚ 111 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 129 44 173 Statistics for Table of y by x13 Statistic DF Value Prob ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Chi-Square 1 0.6599 0.4166 Likelihood Ratio Chi-Square 1 0.6520 0.4194 Continuity Adj. Chi-Square 1 0.3973 0.5285 Mantel-Haenszel Chi-Square 1 0.6561 0.4180 Phi Coefficient -0.0618 Contingency Coefficient 0.0616 Cramer's V -0.0618 Fisher's Exact Test ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Cell (1,1) Frequency (F) 44 Left-sided Pr <= F 0.2627 Right-sided Pr >= F 0.8400 Table Probability (P) 0.1027 Two-sided Pr <= P 0.4680 Relationships Between Satellite Males? And Each of the (Dichotomous) Categorical Explanatory Variables The FREQ Procedure Statistics for Table of y by x13 Estimates of the Relative Risk (Row1/Row2) Type of Study Value 95% Confidence Limits ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Case-Control (Odds Ratio) 0.7477 0.3703 1.5096 Cohort (Col1 Risk) 0.9268 0.7667 1.1202 Cohort (Col2 Risk) 1.2395 0.7410 2.0731 Odds Ratio (Case-Control Study) ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Odds Ratio 0.7477 Asymptotic Conf Limits 95% Lower Conf Limit 0.3703 95% Upper Conf Limit 1.5096 Exact Conf Limits 95% Lower Conf Limit 0.3512 95% Upper Conf Limit 1.6182 Sample Size = 173 8 Table of y by x21 y(Satellite Males?) x21(Both good?) Frequency‚No ‚Yes ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ No ‚ 51 ‚ 11 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Yes ‚ 85 ‚ 26 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 136 37 Total 62 111 173 Relationships Between Satellite Males? And Each of the (Dichotomous) Categorical Explanatory Variables The FREQ Procedure Statistics for Table of y by x21 Statistic DF Value Prob ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Chi-Square 1 0.7637 0.3822 Likelihood Ratio Chi-Square 1 0.7801 0.3771 Continuity Adj. Chi-Square 1 0.4632 0.4961 Mantel-Haenszel Chi-Square 1 0.7593 0.3835 Phi Coefficient 0.0664 Contingency Coefficient 0.0663 Cramer's V 0.0664 Fisher's Exact Test ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Cell (1,1) Frequency (F) 51 Left-sided Pr <= F 0.8575 Right-sided Pr >= F 0.2501 Table Probability (P) 0.1076 Two-sided Pr <= P 0.4427 Estimates of the Relative Risk (Row1/Row2) Type of Study Value 95% Confidence Limits ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Case-Control (Odds Ratio) 1.4182 0.6463 3.1117 Cohort (Col1 Risk) 1.0742 0.9202 1.2540 Cohort (Col2 Risk) 0.7574 0.4023 1.4261 9 Odds Ratio (Case-Control Study) ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Odds Ratio 1.4182 Asymptotic Conf Limits 95% Lower Conf Limit 0.6463 95% Upper Conf Limit 3.1117 Exact Conf Limits 95% Lower Conf Limit 0.6128 95% Upper Conf Limit 3.4558 Sample Size = 173 Relationships Between Satellite Males? And Each of the (Dichotomous) Categorical Explanatory Variables The FREQ Procedure Table of y by x22 y(Satellite Males?) x22(One worn or broken?) Frequency‚No ‚Yes ‚ Total ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ No ‚ 54 ‚ 8 ‚ 62 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Yes ‚ 104 ‚ 7 ‚ 111 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 158 15 173 Statistics for Table of y by x22 Statistic DF Value Prob ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Chi-Square 1 2.1862 0.1393 Likelihood Ratio Chi-Square 1 2.0944 0.1478 Continuity Adj. Chi-Square 1 1.4325 0.2314 Mantel-Haenszel Chi-Square 1 2.1736 0.1404 Phi Coefficient -0.1124 Contingency Coefficient 0.1117 Cramer's V -0.1124 Fisher's Exact Test ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Cell (1,1) Frequency (F) 54 Left-sided Pr <= F 0.1169 Right-sided Pr >= F 0.9585 Table Probability (P) 0.0754 Two-sided Pr <= P 0.1636 Relationships Between Satellite Males? And Each of the (Dichotomous) Categorical Explanatory Variables The FREQ Procedure Statistics for Table of y by x22 Estimates of the Relative Risk (Row1/Row2) Type of Study Value 95% Confidence Limits ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Case-Control (Odds Ratio) 0.4543 0.1564 1.3197 Cohort (Col1 Risk) 0.9296 0.8350 1.0349 Cohort (Col2 Risk) 2.0461 0.7791 5.3738 Odds Ratio (Case-Control Study) ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Odds Ratio 0.4543 Asymptotic Conf Limits 95% Lower Conf Limit 0.1564 95% Upper Conf Limit 1.3197 Exact Conf Limits 95% Lower Conf Limit 0.1331 95% Upper Conf Limit 1.5264 Sample Size = 173 10 Logistic regression of Satellite Presence vs. Carapace Width The LOGISTIC Procedure Model Information Data Set WORK.CRABS Response Variable y Satellite Males? Number of Response Levels 2 Model binary logit Optimization Technique Fisher's scoring Number of Observations Read 173 Number of Observations Used 173 Response Profile Ordered Total Value y Frequency 1 No 62 2 Yes 111 Probability modeled is y='Yes'. Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 227.759 198.453 SC 230.912 204.759 -2 Log L 225.759 194.453 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 31.3059 1 <.0001 Score 27.8752 1 <.0001 Wald 23.8872 1 <.0001 Parameter Intercept x3 Logistic regression of Satellite Presence vs. Carapace Width The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Standard Wald DF Estimate Error Chi-Square 1 -12.3508 2.6287 22.0749 1 0.4972 0.1017 23.8872 Effect x3 Pr > ChiSq <.0001 <.0001 Odds Ratio Estimates Point 95% Wald Estimate Confidence Limits 1.644 1.347 2.007 Association of Predicted Probabilities and Observed Responses Percent Concordant 73.5 Somers' D 0.485 Percent Discordant 25.0 Gamma 0.492 Percent Tied 1.5 Tau-a 0.224 Pairs 6882 c 0.742 Estimated Covariance Matrix Parameter Intercept x3 Intercept 6.910227 -0.26685 x3 -0.26685 0.01035 Logistic regression of Satellite Presence vs. Weight The LOGISTIC Procedure Model Information Data Set WORK.CRABS Response Variable y Satellite Males? Number of Response Levels 2 11 Model binary logit Optimization Technique Fisher's scoring Number of Observations Read Number of Observations Used 173 173 Response Profile Ordered Total Value y Frequency 1 No 62 2 Yes 111 Probability modeled is y='Yes'. Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 227.759 199.737 SC 230.912 206.044 -2 Log L 225.759 195.737 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 30.0214 1 <.0001 Score 25.9353 1 <.0001 Wald 23.2158 1 <.0001 12 Parameter Intercept x4 Logistic regression of Satellite Presence vs. Weight The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Standard Wald DF Estimate Error Chi-Square 1 -3.6933 0.8800 17.6159 1 1.8145 0.3766 23.2158 Effect x4 Pr > ChiSq <.0001 <.0001 Odds Ratio Estimates Point 95% Wald Estimate Confidence Limits 6.138 2.934 12.841 Association of Predicted Probabilities and Observed Responses Percent Concordant 72.7 Somers' D 0.476 Percent Discordant 25.1 Gamma 0.487 Percent Tied 2.2 Tau-a 0.220 Pairs 6882 c 0.738 Estimated Covariance Matrix Parameter Intercept x4 Intercept 0.774312 -0.3249 x4 -0.3249 0.141823 8 Variables: Correlation Matrix for All Variables The CORR Procedure x12 x13 x21 x22 x3 x11 Pearson Correlation Coefficients, N = 173 Prob > |r| under H0: Rho=0 x11 x12 x13 x11 1.00000 -0.30130 -0.15944 Light medium? <.0001 0.0361 x12 -0.30130 1.00000 -0.64453 Medium? <.0001 <.0001 x13 -0.15944 -0.64453 1.00000 Dark medium? 0.0361 <.0001 x21 0.35696 0.10432 -0.20751 Both good? <.0001 0.1720 0.0062 x22 0.07758 -0.00978 0.00872 One worn or broken? 0.3103 0.8983 0.9093 x4 y x21 0.35696 <.0001 0.10432 0.1720 -0.20751 0.0062 1.00000 -0.16071 0.0347 Pearson Correlation Coefficients, N = 173 Prob > |r| under H0: Rho=0 x4 0.09104 0.2336 0.19302 0.0109 -0.14016 0.0659 0.21927 0.0038 -0.15197 0.0459 y 0.06171 0.4200 0.19493 0.0102 -0.06176 0.4195 0.06644 0.3851 -0.11242 0.1409 Correlation Matrix for All Variables The CORR Procedure Pearson Correlation Coefficients, N = 173 Prob > |r| under H0: Rho=0 x11 x12 x13 x3 0.08670 0.21273 -0.15242 Carapace Width 0.2567 0.0050 0.0453 x4 0.09104 0.19302 -0.14016 Weight 0.2336 0.0109 0.0659 y 0.06171 0.19493 -0.06176 Satellite Males? 0.4200 0.0102 0.4195 x21 0.20139 0.0079 0.21927 0.0038 0.06644 0.3851 x11 Light medium? x12 Medium? x13 Dark medium? x21 Both good? x22 One worn or broken? x22 0.07758 0.3103 -0.00978 0.8983 0.00872 0.9093 -0.16071 0.0347 1.00000 x3 0.08670 0.2567 0.21273 0.0050 -0.15242 0.0453 0.20139 0.0079 -0.23035 0.0023 13 Pearson Correlation Coefficients, N = 173 Prob > |r| under H0: Rho=0 x3 Carapace Width x4 Weight y Satellite Males? x22 -0.23035 0.0023 -0.15197 0.0459 -0.11242 0.1409 x3 1.00000 0.88689 <.0001 0.40141 <.0001 x4 0.88689 <.0001 1.00000 0.38719 <.0001 y 0.40141 <.0001 0.38719 <.0001 1.00000 Stepwise Logistic regression of Satellite Presence vs. Several Explanatory Variables, Somc of which are Categorical Backward Selection Used The LOGISTIC Procedure Model Information Data Set WORK.CRABS Response Variable y Satellite Males? Number of Response Levels 2 Model binary logit Optimization Technique Fisher's scoring Number of Observations Read 173 Number of Observations Used 173 Response Profile Ordered Total Value y Frequency 1 No 62 2 Yes 111 Probability modeled is y='Yes'. Stepwise Selection Procedure Step 0. Intercept entered: Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. -2 Log L = 225.759 Residual Chi-Square Test Chi-Square DF Pr > ChiSq 36.3085 7 <.0001 14 Step 1. Effect x3 entered: Stepwise Logistic regression of Satellite Presence vs. Several Explanatory Variables, Somc of which are Categorical Backward Selection Used The LOGISTIC Procedure Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 227.759 198.453 SC 230.912 204.759 -2 Log L 225.759 194.453 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 31.3059 1 <.0001 Score 27.8752 1 <.0001 Wald 23.8872 1 <.0001 Residual Chi-Square Test Chi-Square DF Pr > ChiSq 9.3673 6 0.1540 NOTE: No effects for the model in Step 1 are removed. NOTE: No (additional) effects met the 0.05 significance level for entry into the model. Effect Step Entered Removed 1 x3 DF 1 Summary of Stepwise Selection Number Score Wald Variable In Chi-Square Chi-Square Pr > ChiSq Label 1 27.8752 <.0001 Carapace Width Stepwise Logistic regression of Satellite Presence vs. Several Explanatory Variables, Somc of which are Categorical Backward Selection Used The LOGISTIC Procedure Parameter Intercept x3 Analysis of Maximum Likelihood Estimates Standard Wald DF Estimate Error Chi-Square 1 -12.3508 2.6287 22.0749 1 0.4972 0.1017 23.8872 Effect x3 Pr > ChiSq <.0001 <.0001 Odds Ratio Estimates Point 95% Wald Estimate Confidence Limits 1.644 1.347 2.007 Association of Predicted Probabilities and Observed Responses Percent Concordant 73.5 Somers' D 0.485 Percent Discordant 25.0 Gamma 0.492 Percent Tied 1.5 Tau-a 0.224 Pairs 6882 c 0.742 Estimated Covariance Matrix Parameter Intercept x3 Intercept 6.910227 -0.26685 x3 -0.26685 0.01035 15 Since there is only one variable in the final model, we need not look at interaction terms. However, for the sake of completeness, we will look at a stepwise regression for the data, with all interaction terms also included in the model, to see whether any interaction terms survive. In this run, we will use SLENTRY=0.25 and SLSTAY=0.05. The SAS program is shown below. proc format; value difmt 0 = "No" 1 = "Yes"; ; data crabs; input x1 x2 x3 x4 s; y = 0; if s > 0 then y = 1; x11 = 0; if x1 = 1 then x11 = 1; x12 = 0; if x1 = 2 then x12 = 1; x13 = 0; if x1 = 3 then x13 = 1; x21 = 0; if x2 = 1 then x21 = 1; x22 = 0; if x2 = 2 then x22 = 1; x11x21 = x11*x21; x12x21 = x12*x21; x13x21 = x13*x21; x11x22 = x11*x22; x12x22 = x12*x22; x13x22 = x13*x22; x11x3 = x11*x3; x11x4 = x11*x4; x12x3 = x12*x3; x12x4 = x12*x4; x13x3 = x13*x3; x13x4 = x13*x4; x21x3 = x21*x3; x21x4 = x21*x4; x22x3 = x22*x3; x22x4 = x22*x4; x3x4 = x3*x4; label x1 = "Color" x2 = "Spine Condition" x3 = "Carapace Width" x4 = "Weight" x11 = "Light medium?" x12 = "Medium?" x13 = "Dark medium?" x21 = "Both good?" x22 = "One worn or broken?" y = "Satellite Males?" s = "No. of Satellite Males"; format y x11 x12 x13 x21 x22 difmt.; cards; The data set is included in the first handout. ; 16 proc logistic; model y (order=formatted event='Yes') = x11x21 x12x21 x13x21 x11x22 x12x22 x13x22 x11x3 x11x4 x12x3 x12x4 x13x3 x13x4 x21x3 x21x4 x22x3 x22x4 x3x4 x11 x12 x13 x21 x22 x3 x4 / selection=stepwise slentry=0.25 slstay=0.05 covb; title "Stepwise Logistic regression of Satellite Presence"; title2 "vs. Several Explanatory Variables,"; title3 "Somc of which are Categorical"; title4 "Stepwise Selection Used, and Interactions Included"; ; run; SAS PROC LOGISTIC Output: Stepwise Logistic regression of Satellite Presence vs. Several Explanatory Variables, Somc of which are Categorical Stepwise Selection Used, and Interactions Included The LOGISTIC Procedure Model Information Data Set WORK.CRABS Response Variable y Number of Response Levels 2 Model binary logit Optimization Technique Fisher's scoring Number of Observations Read Number of Observations Used Satellite Males? 173 173 Response Profile Ordered Total Value y Frequency 1 No 62 2 Yes 111 Probability modeled is y='Yes'. Stepwise Selection Procedure Step 0. Intercept entered: Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. -2 Log L = 225.759 17 Residual Chi-Square Test Chi-Square DF Pr > ChiSq 52.5969 24 0.0007 Step 1. Effect x3 entered: Stepwise Logistic regression of Satellite Presence vs. Several Explanatory Variables, Somc of which are Categorical Stepwise Selection Used, and Interactions Included The LOGISTIC Procedure Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 227.759 198.453 SC 230.912 204.759 -2 Log L 225.759 194.453 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 31.3059 1 <.0001 Score 27.8752 1 <.0001 Wald 23.8872 1 <.0001 Residual Chi-Square Test Chi-Square DF Pr > ChiSq 29.0098 23 0.1800 NOTE: No effects for the model in Step 1 are removed. Step 2. Effect x12 entered: Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. Stepwise Logistic regression of Satellite Presence vs. Several Explanatory Variables, Somc of which are Categorical Stepwise Selection Used, and Interactions Included The LOGISTIC Procedure Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 227.759 197.757 SC 230.912 207.217 -2 Log L 225.759 191.757 18 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 34.0014 2 <.0001 Score 30.0492 2 <.0001 Wald 25.2770 2 <.0001 Residual Chi-Square Test Chi-Square DF Pr > ChiSq 24.7797 22 0.3077 Step 3. Effect x12 is removed: Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 227.759 198.453 SC 230.912 204.759 -2 Log L 225.759 194.453 Stepwise Logistic regression of Satellite Presence vs. Several Explanatory Variables, Somc of which are Categorical Stepwise Selection Used, and Interactions Included The LOGISTIC Procedure Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 31.3059 1 <.0001 Score 27.8752 1 <.0001 Wald 23.8872 1 <.0001 Residual Chi-Square Test Chi-Square DF Pr > ChiSq 29.0098 23 0.1800 NOTE: No effects for the model in Step 3 are removed. NOTE: Model building terminates because the last effect entered is removed by the Wald statistic criterion. Step 1 2 3 Effect Entered Removed x3 x12 x12 Parameter Intercept x3 DF 1 1 1 Summary of Stepwise Selection Number Score Wald In Chi-Square Chi-Square Pr > ChiSq 1 27.8752 <.0001 2 2.7133 0.0995 1 2.6873 0.1011 Analysis of Maximum Likelihood Estimates Standard Wald DF Estimate Error Chi-Square 1 -12.3508 2.6287 22.0749 1 0.4972 0.1017 23.8872 Pr > ChiSq <.0001 <.0001 Stepwise Logistic regression of Satellite Presence vs. Several Explanatory Variables, Somc of which are Categorical Stepwise Selection Used, and Interactions Included Effect x3 The LOGISTIC Procedure Odds Ratio Estimates Point 95% Wald Estimate Confidence Limits 1.644 1.347 2.007 Variable Label Carapace Width Medium? Medium? 19 Association of Predicted Probabilities and Observed Responses Percent Concordant 73.5 Somers' D 0.485 Percent Discordant 25.0 Gamma 0.492 Percent Tied 1.5 Tau-a 0.224 Pairs 6882 c 0.742 Estimated Covariance Matrix Parameter Intercept x3 Intercept 6.910227 -0.26685 x3 -0.26685 0.01035 1) Hosmer, D. W. and Lemeshow, S. (1989). Applied Logistic Regression, John Wiley & Sons, New York. 2) Mickey, J. and Greenland, S. (1989). “A study of the impact of confounder-selection criteria on effect estimation,” American Journal of Epidemiology, 129, 125-137.