1 Logistic Regression, Testing for Interaction Example, Influenza Shots A local health clinic sent fliers to its clients to encourage everyone, but especially older persons at high risk of complications, to get a flu shot for protection against an expected flu epidemic. In a pilot follow-up study, 159 clients were randomly selected and asked whether they actually received a flu shot. A client who received a flu shot was coded Y = 1; and a client who did not receive a flu shot was coded Y = 0. In addition, data were collected on their age (X1) and their health awareness. The latter data were combined into a health awareness index (X2), for which higher values indicate greater awareness. Also included in the data were client gender (X3), with males coded X3 = 1 and females coded X3 = 0. It is suspected that there may be some interactions between predictor variables; e.g., perhaps the relationship between health awareness and the response variable is mediated by gender. Hence, we want to test for interaction effects. To do this, we will estimate two logistic regression models – one with interactions included and another without. Assuming that we have already looked at the univariate relationships between Y and each explanatory variable, we will proceed to look for possible interaction effects. The reduced model (with only the three explanatory variables) is x i 0 1 X 1i 2 X 2i 3 X 3i i , ln 1 x i where the i subscript denotes the ith observation in the data set, and i is a random error term associated with the ith observation. The full model (with interaction terms included) is x i 0 1 X 1i 2 X 2i 3 X 3i 12 X 1i X 2i 13 X 1i X 3i 23 X 2i X 3i i . ln 1 x i We want to test whether there are interaction effects present. Step 1: H0: 12 13 23 0 v. HA: Not all 0. Step 2: We have n 159 , = 0.05. Step 3: The test statistic is the Likelihood Ratio chi-square statistic G 2 2l 0 l1 , where l1 is the maximum of the log-likelihood function for the model with the interaction terms as well as the predictors, and l 0 is the maximum of the log-likelihood function for the model with just the predictor variables. Under the null hypothesis, the statistic G 2 has a chi-square distribution with d.f. = 3. Step 4: We will reject the null hypothesis if G2 > 32, 0.05 7.81 . Step 5: From the output, we find G2 = 105.093 – 104.994 = 0.099. Step 6: We fail to reject the null hypothesis at the 0.05 level of significance. We do not have sufficient evidence to conclude that the interaction terms need to be included in the model. 2 If we had rejected the null hypothesis, then we could have used a follow-up procedure, such as stepwise multiple regression, to find which explanatory variables and which interaction terms would need to be included in the model. If we find that a particular interaction term is significant, then we would also want to include those two explanatory variables in our final model (Note: In this case, if we were to perform stepwise regression, we would find that only two of the predictors, Age and Health Awareness, would need to be included in our final model.) The estimated model is therefore (from the output of the last PROC LOGISTIC): ˆ x i ˆ 0 ˆ1 X 1i ˆ2 X 2i , or ln ˆ 1 x i ˆ x i 1.4578 0.0779 * Age 0.0955 * Health Awareness . ln ˆ 1 x i We also want to check the assumption that the logit is linear in each of the (continuous) predictor variables. There are several ways to do this. One way is a rather tedious, graphical approach, involving grouped data. A simpler approach is to test whether we need to include nonlinear terms in the model. To do this, we will use the final model above as our reduced model, and add quadratic terms in each of the two explanatory variables, so that the full model is x i 0 1 X 1i 2 X 2i 1 X 12i 2 X 22i i . ln 1 x i We want to test whether these quadratic terms are needed. The last PROC LOGISTIC in the SAS program below estimates the model with the quadratic terms included. Step 1: H0: 1 2 0 v. HA: Not both 0. Step 2: We have n 159 , = 0.05. Step 3: The test statistic is the Likelihood Ratio chi-square statistic G 2 2l 0 l1 , where l1 is the maximum of the log-likelihood function for the model with the quadratic terms as well as the predictors, and l 0 is the maximum of the log-likelihood function for the model with just the (1) predictor variables. Under the null hypothesis, the statistic G 2 has a chi-square distribution with d.f. = 2. Step 4: We will reject the null hypothesis if G2 > 22, 0.05 5.99 . Step 5: From the output, we find G2 = 105.795 – 104.706 = 1.089. Step 6: We fail to reject the null hypothesis at the 0.05 level of significance. We do not have sufficient evidence to conclude that the quadratic terms need to be included in the model. Our final model therefore is given by Equation (1) above. The estimate of the regression slope for Age is ˆ1 0.0779 , with a standard error of S .E. ˆ 0.0297 . Thus, a 95% confidence interval estimate for the slope is 1 ˆ1 1.96S .E. ˆ1 0.019688, 0.136112 . Now, since Age is a continuous variable, it is not very interesting to consider the odds ratio for a unit increase in Age. Instead, we will calculate a 95% confidence interval estimate of the odds ratio for an increase in Age of 5 years. The point estimate of the odds ratio is e 0.07795 1.4762 , and a 95% confidence interval estimate is e 0.0196885 , e 0.1361125 1.1034, 1.9750 . We are 95% confident that the odds of having had a 3 flue shot increase by between 1.1034 and 1.9750 for each 5-year increase in Age, for this population. The SAS program for conducting data analysis is given below, followed by the output. SAS Program proc format; value difmt 0 = "No " 1 = "Yes"; value sexfmt 0 = "Female" 1 = "Male "; ; data flushot; input y x1 x2 x3; x1x2 = x1*x2; x1x3 = x1*x3; x2x3 = x2*x3; x1sq = x1**2; x2sq = x2**2; label y = "Flu Shot?" x1 = "Age in Years" x2 = "Health Awareness Index" x3 = "Gender" x1x2 = "Interaction of Age with Health Awareness" x1x3 = "Interaction of Age with Gender" x2x3 = "Interaction of Health Awareness with Gender" x1sq = "Square of Age in Years" x2sq = "Square of Health Awareness"; format y difmt. x3 sexfmt.; cards; The data set is listed in the appendix. ; proc logistic; model y (order=formatted event='Yes') = x1 x2 x3; title "Multiple Logistic Regression of Flu Shot"; title2 "Against Age, Health Awareness, and Gender"; ; proc logistic; model y (order=formatted event='Yes') = x1 x2 x3 x1x2 x1x3 x2x3; title "Multiple Logistic Regression of Flu Shot"; title2 "Against Age and Health Awareness"; title3 "With Interaction Terms Included"; ; proc logistic; model y (order=formatted event='Yes') = x1 x2; title "Multiple Logistic Regression of Flu Shot"; title2 "Against Age and Health Awareness"; title3; ; proc logistic; model y (order=formatted event='Yes') = x1 x2 x1sq x2sq; title "Multiple Logistic Regression of Flu Shot"; title2 "Against Age and Health Awareness"; title3 "With Quadratic Terms Included"; ; run; 4 Output of SAS Program Multiple Logistic Regression of Flu Shot Against Age, Health Awareness, and Gender The LOGISTIC Procedure Model Information Data Set WORK.FLUSHOT Response Variable y Flu Shot? Number of Response Levels 2 Model binary logit Optimization Technique Fisher's scoring Number of Observations Read 159 Number of Observations Used 159 Response Profile Ordered Total Value y Frequency 1 No 135 2 Yes 24 Probability modeled is y='Yes'. Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 136.941 113.093 SC 140.010 125.369 -2 Log L 134.941 105.093 Multiple Logistic Regression of Flu Shot Against Age, Health Awareness, and Gender The LOGISTIC Procedure Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 29.8476 3 <.0001 Score 27.0173 3 <.0001 Wald 19.9803 3 0.0002 Parameter Intercept x1 x2 x3 Analysis of Maximum Likelihood Estimates Standard Wald DF Estimate Error Chi-Square 1 -1.1772 2.9824 0.1558 1 0.0728 0.0304 5.7401 1 -0.0990 0.0335 8.7419 1 0.4339 0.5218 0.6917 Effect x1 x2 x3 Odds Ratio Estimates Point 95% Wald Estimate Confidence Limits 1.076 1.013 1.141 0.906 0.848 0.967 1.543 0.555 4.291 Pr > ChiSq 0.6930 0.0166 0.0031 0.4056 5 Association of Predicted Probabilities and Observed Responses Percent Concordant 82.1 Somers' D 0.644 Percent Discordant 17.7 Gamma 0.645 Percent Tied 0.2 Tau-a 0.166 Pairs 3240 c 0.822 Multiple Logistic Regression of Flu Shot Against Age and Health Awareness With Interaction Terms Included The LOGISTIC Procedure Model Information Data Set WORK.FLUSHOT Response Variable y Flu Shot? Number of Response Levels 2 Model binary logit Optimization Technique Fisher's scoring Number of Observations Read 159 Number of Observations Used 159 Response Profile Ordered Total Value y Frequency 1 No 135 2 Yes 24 Probability modeled is y='Yes'. Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 136.941 118.994 SC 140.010 140.476 -2 Log L 134.941 104.994 Multiple Logistic Regression of Flu Shot Against Age and Health Awareness With Interaction Terms Included The LOGISTIC Procedure Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 29.9472 6 <.0001 Score 32.4819 6 <.0001 Wald 19.4560 6 0.0035 6 Parameter Intercept x1 x2 x3 x1x2 x1x3 x2x3 Analysis of Maximum Likelihood Estimates Standard Wald DF Estimate Error Chi-Square 1 2.6164 13.5486 0.0373 1 0.0138 0.2007 0.0047 1 -0.1650 0.2490 0.4394 1 0.3907 6.0739 0.0041 1 0.00103 0.00371 0.0773 1 0.00588 0.0615 0.0091 1 -0.00634 0.0679 0.0087 Effect x1 x2 x3 x1x2 x1x3 x2x3 Pr > ChiSq 0.8469 0.9452 0.5074 0.9487 0.7809 0.9238 0.9255 Odds Ratio Estimates Point 95% Wald Estimate Confidence Limits 1.014 0.684 1.503 0.848 0.520 1.381 1.478 <0.001 >999.999 1.001 0.994 1.008 1.006 0.892 1.135 0.994 0.870 1.135 Association of Predicted Probabilities and Observed Responses Percent Concordant 82.5 Somers' D 0.653 Percent Discordant 17.2 Gamma 0.656 Percent Tied 0.4 Tau-a 0.168 Pairs 3240 c 0.827 Multiple Logistic Regression of Flu Shot Against Age and Health Awareness The LOGISTIC Procedure Model Information Data Set WORK.FLUSHOT Response Variable y Flu Shot? Number of Response Levels 2 Model binary logit Optimization Technique Fisher's scoring Number of Observations Read 159 Number of Observations Used 159 Response Profile Ordered Total Value y Frequency 1 No 135 2 Yes 24 Probability modeled is y='Yes'. Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 136.941 111.795 SC 140.010 121.002 -2 Log L 134.941 105.795 Multiple Logistic Regression of Flu Shot Against Age and Health Awareness The LOGISTIC Procedure Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 29.1454 2 <.0001 Score 26.7071 2 <.0001 Wald 19.8291 2 <.0001 Analysis of Maximum Likelihood Estimates 7 Parameter Intercept x1 x2 DF 1 1 1 Estimate -1.4578 0.0779 -0.0955 Effect x1 x2 Standard Error 2.9153 0.0297 0.0324 Wald Chi-Square 0.2500 6.8761 8.6786 Pr > ChiSq 0.6170 0.0087 0.0032 Odds Ratio Estimates Point 95% Wald Estimate Confidence Limits 1.081 1.020 1.146 0.909 0.853 0.969 Association of Predicted Probabilities and Observed Responses Percent Concordant 80.7 Somers' D 0.618 Percent Discordant 18.9 Gamma 0.620 Percent Tied 0.4 Tau-a 0.159 Pairs 3240 c 0.809 Multiple Logistic Regression of Flu Shot Against Age and Health Awareness With Quadratic Terms Included The LOGISTIC Procedure Model Information Data Set WORK.FLUSHOT Response Variable y Flu Shot? Number of Response Levels 2 Model binary logit Optimization Technique Fisher's scoring Number of Observations Read 159 Number of Observations Used 159 Response Profile Ordered Total Value y Frequency 1 No 135 2 Yes 24 Probability modeled is y='Yes'. Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. 8 Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 136.941 114.706 SC 140.010 130.050 -2 Log L 134.941 104.706 Multiple Logistic Regression of Flu Shot Against Age and Health Awareness With Quadratic Terms Included The LOGISTIC Procedure Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 30.2348 4 <.0001 Score 34.2112 4 <.0001 Wald 19.4995 4 0.0006 Parameter Intercept x1 x2 x1sq x2sq Analysis of Maximum Likelihood Estimates Standard Wald DF Estimate Error Chi-Square 1 0.2193 14.2594 0.0002 1 0.2296 0.4052 0.3210 1 -0.3518 0.2638 1.7780 1 -0.00112 0.00303 0.1363 1 0.00238 0.00236 1.0171 Effect x1 x2 x1sq x2sq Pr > ChiSq 0.9877 0.5710 0.1824 0.7120 0.3132 Odds Ratio Estimates Point 95% Wald Estimate Confidence Limits 1.258 0.569 2.784 0.703 0.419 1.180 0.999 0.993 1.005 1.002 0.998 1.007 Association of Predicted Probabilities and Observed Responses Percent Concordant 81.2 Somers' D 0.629 Percent Discordant 18.3 Gamma 0.632 Percent Tied 0.4 Tau-a 0.162 Pairs 3240 c 0.815 Appendix: Flu Shot Data 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 59 61 82 51 53 62 51 70 71 55 58 53 72 56 56 81 62 49 56 50 53 52 55 51 70 70 49 69 54 65 58 48 58 65 68 83 68 44 70 69 74 57 0 1 0 0 0 1 1 1 1 1 0 1 0 0 0 0 0 0 1 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 56 56 50 52 52 67 51 70 64 61 53 77 73 67 50 80 75 65 60 68 61 62 53 72 54 59 61 50 48 52 54 62 71 65 49 58 62 69 56 76 51 64 57 51 81 50 64 64 59 53 63 59 70 72 68 75 57 64 67 83 48 81 53 61 51 51 65 51 54 64 69 71 38 51 54 59 57 63 48 58 56 59 75 48 79 66 57 68 48 60 63 61 57 69 38 50 45 72 51 62 81 55 77 65 53 49 65 58 60 57 37 49 55 60 1 1 1 1 0 1 0 0 0 1 0 1 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 1 1 0 0 1 0 1 0 1 1 1 0 0 1 1 0 0 1 0 10 0 0 0 1 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 67 59 55 75 66 67 59 78 59 68 59 68 78 55 71 51 65 54 79 64 82 64 70 59 59 63 48 61 51 48 71 51 57 49 67 73 73 56 48 50 50 66 53 50 51 68 72 51 62 60 67 70 55 66 65 84 58 57 56 58 64 51 59 61 49 49 55 61 50 47 73 45 45 59 61 52 50 46 67 56 50 56 61 74 78 68 71 58 57 51 74 56 57 65 47 69 71 76 60 75 65 42 66 49 58 61 55 60 54 63 56 59 52 63 1 1 0 1 0 0 0 1 0 0 1 1 1 1 1 0 0 1 0 0 1 0 1 0 1 1 0 0 0 0 1 0 1 0 1 0 0 0 1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 0 1 1 0 11 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 68 51 67 52 68 76 54 50 63 77 60 51 51 66 52 66 56 49 67 57 56 76 68 73 57 59 53 67 62 63 62 52 58 49 65 55 60 51 67 64 55 58 66 64 66 22 32 56 1 1 1 0 0 1 1 1 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 1