Chapter 14 Binary In many regression applications the response variable has only two outcomes: an event either did or did not occur. Such a variable is often referred to as a binary or binomial variable as its behavior is related to the binomial distribution. A regression model with this type of response can be interpreted as a model that estimates the effect of the independent variable(s) on the probability of the event occurring. Binary response data typically appear in one of two ways: - When observations represent individual subjects, the response is represented by a dummy or indicator variable having any two values. The most commonly used values are zero if the event does not occur and unity if it does. - When observations summarize the occurrence of events for each set of unique combinations of the independent variables, the response variable is x/n where x is the number of occurrences and n the number of observations in the set. Regression with a binary response is illustrated with data from a study of carriers of muscular dystrophy. Two groups of women, one consisting of known carriers of the disease and the other a control group, were examined for four types of protein in their blood. It is known that proteins may be used as a screening tool to identify carriers. The variables in the resulting data set {ANGEL > Data > Chapter 14 > Binary} are: Carrier: 0 for control, 1 for carrier P1: measurement of protein type 1 P2: measurement of protein type 2 P3: measurement of protein type 3 P4: measurement of protein type 4 Objective: Determine the effectiveness of these proteins to identify carriers of the disease, with special reference on how screening is improved by using measurements of the other proteins. Analysis Because P1 has been the standard, it will be used to illustrate binomial regression with a single independent variable. Because this looks like a regression problem, you may first try a linear regression model by regressing Carrier on P1. The Minitab output is: The regression equation is Carrier = 0.275 + 0.00188 P1 Predictor Constant P1 Coef 0.27462 0.0018778 S = 0.446282 SE Coef 0.06766 0.0004283 R-Sq = 22.0% T 4.06 4.38 P 0.000 0.000 R-Sq(adj) = 20.9% 1 The regression is certainly significant, and the estimated coefficients suggest that the probability of detecting a carrier increases with measurements of P1 (from the positive slope). But a scatterplot of the data is interesting: Scatterplot of Carrier vs P1 1.6 1.4 1.2 Carrier 1.0 0.8 0.6 0.4 0.2 0.0 0 100 200 300 400 500 600 700 P1 The plot immediately reveals a problem: the response variable is defined to be a set of probabilities that, by definition, are constrained to lie between zero and one, yet many estimated values are beyond this range. Another difficulty with this model is that the variance of the binomial response variable is known to be a function of π(1- π), where π is the probability of the event. This obviously violates the equal variance assumption required by the least squares estimation process. Thus this particular approach to the regression with a binary response appears to have limited usefulness. The use of weighted regression may alleviate the unequal variance violation, and the use of some transformation (possibly arcsine) may provide somewhat better estimates. However, a more useful approach is afforded by the logistic regression model. Logistic Regression – Binary Response Recall that for a binary response, y, the expected value of y, E(y) = π, where π denotes P(y=1). The log model is: exp( Bo B1 x1 .... Bk xk ) and through algebraic manipulation, 1 exp( Bo B1 x1 .... Bk xk ) ln Bo B1 x1 .... Bk x k 1 Notice that although the regression model is linear on the right side, the left side is a nonlinear function of the response variable π. This function is known as the logit link 2 function, and because it is not linear, the usual least squares methods cannot be used to estimate the parameters. Instead, a method known as maximum likelihood is used to obtain these estimates. P( y 1) is known as the 1 P( y 0) odds of the event y=1 occurring. For example, if π = 0.8 then the odd of y=1 occurring 0 .8 are =4, or 4 to 1. Therefore, this is often referred to as the log-odds model. 0 .2 Also, since π = P(y=1), then 1 – π = P(y=0). The ratio = To perform the binary logistic regression in Minitab: Stat > Regression > Binary Logistic and enter Carrier for Response and P1 in Model. Note: the window for Factors refers to any variable(s)which are categorical. Binary Logistic Regression: Carrier versus P1 Link Function: Logit 1 Response Information Variable Carrier Value 1 0 Total Count 32 38 70 (Event) Logistic Regression Table 2 Predictor Constant P1 Coef -2.18913 0.0303324 SE Coef 0.604449 0.0097241 Z -3.62 3.12 P 0.000 0.002 Odds Ratio 95% CI Lower Upper 1.03 1.01 1.05 Log-Likelihood = -34.556 3a Test that all slopes are zero: G = 27.414, DF = 1, P-Value = 0.000 3b Goodness-of-Fit Tests 4 Method Pearson Deviance Hosmer-Lemeshow Chi-Square 43.1919 48.6574 2.6442 DF 51 51 8 P 0.773 0.567 0.955 Table of Observed and Expected Frequencies: 5 (See Hosmer-Lemeshow Test for the Pearson Chi-Square Statistic) Value 1 Obs Exp 0 1 2 3 4 1 1.2 2 1.5 1 1.6 2 2.1 Group 5 6 3 2.4 3 2.7 7 8 9 10 Total 3 3.5 6 5.3 6 6.7 5 5.0 32 3 Obs Exp Total 6 5.8 7 5 5.5 7 6 5.4 7 6 5.9 8 5 5.6 8 4 4.3 7 4 3.5 7 1 1.7 7 1 0.3 7 0 0.0 5 38 70 Measures of Association: 6 (Between the Response Variable and Predicted Probabilities) Pairs Concordant Discordant Ties Total Number 971 235 10 1216 Percent 79.9 19.3 0.8 100.0 Summary Measures Somers' D Goodman-Kruskal Gamma Kendall's Tau-a 0.61 0.61 0.30 Interpreting Output 1 - Response Information displays the number of missing observations and the number of observations that fall into each of the two response categories. The response value that has been designated as the reference event is the first entry under Value and labeled as the event. In this case, the reference event is Carrier-1 (meaning person is carrier). 2 - Logistic Regression Table shows the estimated coefficients, standard error of the coefficients, z-values, and p-values. When you use the logit link function, you also see the odds ratio and a 95% confidence interval for the odds ratio. If there are several independent variables then the individual tests are useful for determining the importance of the individual variables. The odds ratio is computed as eB1 and is the change in the event odds, defined as P(event)/P(non event) for a unit increase in the independent variable. Typically analysts compute eB1 – 1, which is an estimate of the percentage increase (or decrease) in the odds P(y=1)/P(y=0) for every 1-unit increase in X (while holding other X’s fixed if they exist in the model). In this example the estimated odds ratio is e0.0303324 = 1.03; and eB1 – 1 = 0.03. For each additional unit increase in protein P1, the odds of being classified as a carrier increase by 3%. Notice that this odds ratio is very close to 1 even though the test for P1 (p = 0.002) gives evidence to that the estimated coefficient for P1 is not equal to zero. A more meaningful difference would be found if the odds ratio were higher. The interpretation of the slope is for change in log odds. For example, with slope of 0.0303 we would state that for a unit increase in P1 the log odds of being a carrier increase by 0.0303. NOTE: Some stat packages such as SAS present a chi-square statistic instead of the z-statistic. For large sample sizes either works, however smaller sample sizes should use the chi-square. The relationship between the two is simply z2 = chi-square with 1 degree of freedom. For this example, the chi-square statistic would be equal to (3.12)2 = 9.7344 and using 1 degree of freedom would produce the same p-value. From the output our logistic regression model would be: ˆ exp( 2.189 0.0303x1 ) 1 exp( 2.189 0.0303x1 ) 4 3a –This is a measure of model fit and can be used to compare models by evaluating the difference between twice the log-likelihood values for various models. The test to do this is to take the difference of twice the likelihood of the smaller model minus twice the likelihood of the larger model and find the p-value using a Χ2 test with degrees of freedom equal to the difference in parameter estimates between the two models. NOTE: some software packages (e.g. SAS) report a – 2 log likelihood which is simply twice the value provided by Minitab. 3b – Next, is the statistic G or he log-likelihood ratio test which is a chi-square test. This statistic tests the null hypothesis that all the coefficients associated with predictors (i.e. the slopes) equal zero versus these coefficients not all being equal to zero. In this example, G = 27.414, with a pvalue of 0.000, indicating that there is sufficient evidence the coefficient for P1 is different from zero. 4 - The goodness-of-fit tests, with p-values ranging from 0.567 to 0.955, indicate that there is insufficient evidence to claim that the model does not fit the data adequately. If the p-value is less than some stated -level, the test would reject the null hypothesis of an adequate fit. 5 - Allows you to see how well the model fits the data by comparing the observed and expected frequencies. There is insufficient evidence that the model does not fit the data well, as the observed and expected frequencies are similar. This supports the conclusions made by the goodness-of-fit tests in 4. 6 – This portion provides measures of association to assess the quality of the model. These measures are based on an analysis of individual pairs of observations with different responses. In this example there are 38 zeroes and 32 ones; hence there are 38 x 32 = 1216 such pairs. A pair is deemed concordant if the observation with the higher response also has the higher estimated probability (i.e. the individual carrying the disease has a higher probability of having the disease), discordant if the reverse is true, and tied if the estimated probabilities are identical. The numbers given are the percentages of pairs in each of these classes; obviously, the higher the percentage of concordant pairs the better is the fit of the model. The right-hand portion of 6 gives three different measures of rank correlation computed for these quantities. These correlations may range from zero to one; therefore a larger correlation implies a stronger relationship, i.e. a stronger predictive validity for that particular model. These statistics can be used then to compare models. In this example the values range from 0.30 to 0.61 which imply a reasonable predictive ability. Logistic regression is also applicable to multi-level responses. The response may be ordinal (no pain, slight pain, substantial pain) or nominal (Democrat, Republican, Independent). For ordinal response outcomes, you can model functions called cumulative logits by performing ordered logistic regression using the proportional odds ratio. Open Data > Logistic Regression > Ordinal About the data: Male(0) and female(1) subjects received an active(1) or placebo(0) treatment for their arthritis pain, and the subsequent extent of improvement as recorded as marked(1), some(2), or none(3). One possible strategy would be to create dichotomous response variables by combining two of the response categories. However, since there is a natural ordering to these 5 response levels, it makes sense to consider a strategy that takes advantage of this ordering. In Minitab select Stat > Regression > Ordinal Logistic. For response enter Improve and for model enter Gender, Treatment. Also, since both of these predictors are categorical variables you need to enter them as factors, too. Finally, since you are modeling factors click Results and select “In addition, list of factor level values, and tests for terms with more than 1 degree of freedom.” Interpreting Output: The response and factor information is similar in interpretation as that for binary logistic regression. From the Logistic Regression Table, the p-values are used to test for statistical evidence that the respective predictors have an effect on the response. Here, both p-values are less are small indicating that Treatment and Gender have an statistically significant effect on Improvement. The value under each Factor (i.e. 1 under Treatment and Gender) indicates which factor level is being compared to the other factor levels. The values labeled Const(1) and Const(2) are estimated intercepts for the logits of the cumulative probabilities of marked improvement, and for some improvement, respectively. Because the cumulative probability for the last response value is 1, there is not need to estimate an intercept for no improvement. The coefficients for the predictors represent the increments of log odds for Gender = females and Treatment = active, respectively. That is, by eb1 = e1.319 = 3.74 means that females have 3.7 times higher odds of showing improvement as males, both for marked improvement versus some or no improvement and for marked or some improvement versus no improvement. Those subjects receiving the active drug have and e1.797 = 6.03 times higher odds of showing improvement as those on placebo, both for marked improvement versus some or no improvement and for marked or some improvement versus no improvement. The log-likelihood, all slopes test, Goodness-of-fits, and Measures of Association are interpreted similarly as that in binary logistic regression. Comparing models using log-likelihood statistics To test whether the addition of a covariate or covariates is statistically warranted we can compare the log-likelihood from the smaller model to that from the larger model. Twice this difference follows a chi-square distribution with degrees of freedom equal to the difference in parameters estimated. Example: Create an interaction term of Gender x Treatment. Re-compute the logistic expression by entering this interaction term into the model and factor statements. The log-likelihood from the model containing only the main effects was -75.015, and from the model including the interaction we get -74.860. The difference is 75.015 – 74.860 = 0.155. This value taken twice is 0.310 which follows a chi-square distribution with 1 6 degree of freedom – 4 estimates in the smaller model and 5 in the larger model. From Minitab we can calculate a p-value by Calc > Probability Distributions > Chi-square. Enter 1 as the degree of freedom and 0.310 as the Input Constant. Taking 1 – the probability results in a p-value of 0.578, which is large p-value indicating that adding the interaction term to the main effects model is not significant, i.e. do not need the interaction term. Ordinal Logistic Regression: Improve versus Gender, Treatment Link Function: Logit Response Information Variable Improve Value 1 2 3 Total Count 28 14 42 84 Factor Information Factor Gender Treatment Levels 2 2 Values 0, 1 0, 1 Logistic Regression Table Predictor Const(1) Const(2) Gender 1 Treatment 1 Odds Ratio 95% CI Lower Upper Coef -2.66719 -1.81280 SE Coef 0.599697 0.556609 Z -4.45 -3.26 P 0.000 0.001 1.31875 0.529188 2.49 0.013 3.74 1.33 10.55 1.79730 0.472822 3.80 0.000 6.03 2.39 15.24 Log-Likelihood = -75.015 Test that all slopes are zero: G = 19.887, DF = 2, P-Value = 0.000 Goodness-of-Fit Tests Method Pearson Deviance Chi-Square 1.91000 2.71210 DF 4 4 P 0.752 0.607 Measures of Association: (Between the Response Variable and Predicted Probabilities) Pairs Concordant Discordant Ties Total Number 1268 324 564 2156 Percent 58.8 15.0 26.2 100.0 Summary Measures Somers' D Goodman-Kruskal Gamma Kendall's Tau-a 0.44 0.59 0.27 7 The table of concordant, discordant, and tied pairs is calculated by forming all possible pairs of observations with different response values. Suppose the response values are 1, 2, and 3. Minitab pairs every observation with response value 1 with every observation with response values of 2 and 3 and then pairs every observation with the response value 2 with every observation with response value 3. The total number of pairs equals the number of observations with response of 1 multiplied by the number of observations with the response of 2 plus the number of observations with response of 1 multiplied by the number of observations with the response of 3 plus the number of observations with response of 2 multiplied by the number of observations with the response of 3. In this example this would be 28*14 + 28*42 + 14*42 = 2156 To determine whether the pairs are concordant or discordant, Minitab calculates the cumulative predicted probabilities of each observation and compares these values for each pair of observations. Concordant pairs occur when the lowest response value (in the example above, that is 1), a pair is concordant if the cumulative probability up to the lowest response value is greater for the observation with the lowest response value than for the observation with the higher response value. For pairs with the highest response values (in the example above, pairs with 2 and 3), a pair is concordant if the cumulative probability up to 2 is greater for the observation with the response value 2 than the observation with the response value 3. Recall in our discussion of slope interpretation and odds ratio for binary outcomes. Using Carrier (1 = Yes) and P1 as the predictor, the slope value was -0.0303. The interpretation for an increase in one unit of P1 was for a 3% increase in the odds ratio of being a carrier. We arrived at this 3% by finding exp(b1) or exp[-0.0303] = 1.03; this translated to the 3% increase in the odds ratio. Furthermore, we agreed that this did not convert to an increase of 0.03 in outcome event probability, i.e. that the probability of being a carrier increased by 0.03 for each unit increase in P1. In fact we found the probability of being a carrier when P1 = 100 to be 0.699 and for P1 = 101 this was 0.706 However, we can use these probabilities to calculate the increase in odds ratio by comparing the odds for being a carrier for P1 =101 compared to the odds of being a carrier for P1 = 100. Recall that the odds were found by π/(1-π). This results in the following: P1 = 101 we have odds 0.706/0.294 = 2.401 and P1 = 100 the odds of 0.699/0.301 = 2.322 This gets us the odds ratio for this unit increase to be 2.401/2.322 = 1.03; the odds ratio from exp[b1]. This method is extended to our ordinal logistic discussion. Using Gender with Female (1) being the event and the Response events for Improvement of Marked (1), Some (2) and None (3) we had a slope for Gender of 1.319. This resulted in an odds ratio of exp[1.319] = 3.74 8 Interpreting the slope based on a unit increase would simply mean comparing males to females, keeping in mind that we would hold the Treatment factor constant. That is, we would compare gender for those receiving treatment or compare gender for those receiving placebo. We found the following response probabilities for females receiving treatment: Marked: 0.610 Some: 0.176 None: 0.214 Calculating these for probabilities for males receiving treatment we have: Marked: 0.295 Some: 0.201 None: 0.504 Now for the odds ratio for comparing females to males for say the Marked outcomes versus Some or None we have: For females the odds are 0.610/0.390 = 1.564 and for males the odds are 0.295/0.705 = 0.418 The odds ratio is then 1.564/0.418 = 3.74 But we also said this 3.74 odds ratio would be the same if comparing Marked or Some to None. For this we have the following odds: For females the odds are 0.786/0.214 = 3.673 and for males the odds are 0.494/0.504 = 0.98 The odds ratio is then 3.673/0.98 = 3.74 As you can see, the result is the same. If you calculated the odds and odds ratio for the placebo group you again would arrive at 3.74 as an odds ratio. Same logic applies to odds ratio for Treatment. 9