+Logistic Regression Stat 462 (Follow-up to Activity 2 of April 5 Lab) In some problems, the objective is to predict the probability that an event occurs. For instance, we might want to use information about high school grades and SAT scores to predcit the probability a student will be successful in college. Or, we might want to predict the probability that a person will get heart disease based on various lifestyle factors. Difficulties With Ordinary Regression Model We can define the y-variable as Y = 1 if event occurs, 0 otherwise. Suppose p = probability that Y = 1 (the event occurs); thus, 1p is the probability that Y = 0. The theoretical mean value of Y is E(Y) = (p) (1) + (1p)(0) = p. Thus, in a model such as E(Y) = 0 1 X , the predicted values will indeed be estimates of the probabilty that Y = 1. But, there are at least two difficulties with using the ordinary regression model to estimate a probability: 1. There is an obvious violation of the assumption that the y-variable is normally distributed. 2. There are no constraints on the possible values for ŷ , so sometimes our estimated probabilities might not be between 0 and 1, the range for legitimate probabilities. The Binary Logistic Model The simple logistic regression model for one X variable is e 0 1X p 1 e 0 1X where p = probability an observation is in a specific category of interest. e 0 1X1 2X 2 ... The multiple logistic regression model extends this to p 1 e 0 1X1 2X 2 ... Maximum likelihood estimation is used to estimate the parameters. See Chapter 14 for details. And, the model is constrained to give values of p that are between 0 and 1. p p Algebraically equivalent expressions of the model are e 0 1X and log 0 1 X . 1 p 1 p p The ratio is called the odds-ratio. It compares the probability an event happens to the probability an 1 p event does not happen. The logistic model can be described as “log of odds-ratio”= linear function of X.” Example (Activity 2 of the April 5 Lab) Students in Stat 200 at Penn State were asked if they’ve ever driven while under the influence of alcohol or drugs. They also were, “How many days per month do you drink at least two beers?” Here we model the probability that somebody has driven under the influence as a function of the days per month of drinking beer. Here are results from using Minitab’ “Binary Logistic Regression” (under Stat>Regression). Variable DrivDrnk Value Yes No Total Count 122 127 249 (Event) Logistic Regression Table Predictor Constant DaysBeer Coef -1.5514 0.19031 SE Coef 0.2661 0.02946 Z P -5.83 0.000 6.46 0.000 Odds Ratio 1.21 95% CI Lower Upper 1.14 1.28 Page 2 Here, p = probability that a student has ever driven under the influence and X = days per month of drinking at least two beers. We see that 122/249 students said they have driven shile under the influence. Parameter estimates are ˆ 0 1.5514 , and ˆ 1 0.19031 . The variable X = DaysBeer is a statistically significant predictor (Z = 6.46, P = 0.000). The model for estimating p = probability of ever having driven while under the influence is e 1.55140.19031X p̂ 1 e 1.55140.19031X For example, if X = 10 days per month of drinking beer, e 1.55140.19031(10) e .3517 1.42148 p̂ 0.587 1.5514 0.19031(10) .3517 1 1.4218 1 e 1 e Here’s a plot of estimated probabilty of ever having driven under the influence versus days per month of drinking at least two beers. A few estimated probabilities: DaysBeer p̂ 0 0.175 4 0.312 10 0.587 20 0.905 28 0.978 Notice the value under “Odds Ratio” in the output (given as 1.21). This is calculated as e b1 e .19031 . The value estimates the ratio of the odds for the event occurring (driving under the influence) when the x-variable has the value X+1 compared to when it has the value X. In other words, it is the multiplicative effect on the odds due to increasing the x-variable by one unit. Notice also that there is a confidence interval given for the odds ratio. Note: In the theoretical model, odds of the event at X +1 equals e 0 1 ( X 1) , while the the odds at X = e 0 1X . The ratio is e 1 . Multiple Logistic Regression Suppose we add Gender (1 if male, 0 if female) as a predictor variable. Results are: Logistic Regression Table Predictor Constant DaysBeer Gender male Coef -1.7736 0.18693 SE Coef 0.2945 0.03004 Z P -6.02 0.000 6.22 0.000 0.6172 0.2954 2.09 0.037 Odds Ratio 95% CI Lower Upper 1.21 1.14 1.28 1.85 1.04 3.31 Log-Likelihood = -139.981 Test that all slopes are zero: G = 65.125, DF = 2, P-Value = 0.000 Plot of estimated probabilities versus DaysBeer and gender. ----------------------- Now we add as predictors Smoke = whether or not student smokes cigarettes and Marijaun = whether or not student has smoked marijuana in the past 6 months. Logistic Regression Table Predictor Constant DaysBeer Gender male Smoke Yes Marijuan Yes Coef -2.1097 0.13866 SE Coef 0.3201 0.03244 Z P -6.59 0.000 4.27 0.000 0.6963 0.3115 1.2029 0.9400 Odds Ratio 95% CI Lower Upper 1.15 1.08 1.22 2.24 0.025 2.01 1.09 3.69 0.4146 2.90 0.004 3.33 1.48 7.50 0.3299 2.85 0.004 2.56 1.34 4.89 Log-Likelihood = -128.923 Test that all slopes are zero: G = 87.241, DF = 4, P-Value = 0.000