Binary Logistic Regression: DrivDrnk versus DaysBeer

advertisement
+Logistic Regression Stat 462 (Follow-up to Activity 2 of April 5 Lab)
In some problems, the objective is to predict the probability that an event occurs. For instance, we might
want to use information about high school grades and SAT scores to predcit the probability a student will
be successful in college. Or, we might want to predict the probability that a person will get heart disease
based on various lifestyle factors.
Difficulties With Ordinary Regression Model
We can define the y-variable as Y = 1 if event occurs, 0 otherwise. Suppose p = probability that Y = 1
(the event occurs); thus, 1p is the probability that Y = 0. The theoretical mean value of Y is E(Y) = (p)
(1) + (1p)(0) = p. Thus, in a model such as E(Y) =  0  1 X , the predicted values will indeed be
estimates of the probabilty that Y = 1.
But, there are at least two difficulties with using the ordinary regression model to estimate a probability:
1. There is an obvious violation of the assumption that the y-variable is normally distributed.
2. There are no constraints on the possible values for ŷ , so sometimes our estimated probabilities might
not be between 0 and 1, the range for legitimate probabilities.
The Binary Logistic Model
The simple logistic regression model for one X variable is
e 0 1X
p
1  e 0 1X
where p = probability an observation is in a specific category of interest.
e 0 1X1 2X 2 ...
The multiple logistic regression model extends this to p 
1  e 0 1X1 2X 2 ...
Maximum likelihood estimation is used to estimate the parameters. See Chapter 14 for details. And, the
model is constrained to give values of p that are between 0 and 1.
p
p
Algebraically equivalent expressions of the model are
 e 0 1X and log
  0  1 X .
1 p
1 p
p
The ratio
is called the odds-ratio. It compares the probability an event happens to the probability an
1 p
event does not happen. The logistic model can be described as “log of odds-ratio”= linear function of X.”
Example (Activity 2 of the April 5 Lab)
Students in Stat 200 at Penn State were asked if they’ve ever driven while under the influence of alcohol
or drugs. They also were, “How many days per month do you drink at least two beers?” Here we model
the probability that somebody has driven under the influence as a function of the days per month of
drinking beer. Here are results from using Minitab’ “Binary Logistic Regression” (under
Stat>Regression).
Variable
DrivDrnk
Value
Yes
No
Total
Count
122
127
249
(Event)
Logistic Regression Table
Predictor
Constant
DaysBeer
Coef
-1.5514
0.19031
SE Coef
0.2661
0.02946
Z
P
-5.83 0.000
6.46 0.000
Odds
Ratio
1.21
95% CI
Lower
Upper
1.14
1.28
Page 2
Here, p = probability that a student has ever driven under the influence and X = days per month of
drinking at least two beers.
We see that 122/249 students said they have driven shile under the influence. Parameter estimates are
ˆ 0  1.5514 , and ˆ 1  0.19031 . The variable X = DaysBeer is a statistically significant predictor (Z =
6.46, P = 0.000). The model for estimating p = probability of ever having driven while under the
influence is
e 1.55140.19031X
p̂ 
1  e 1.55140.19031X
For example, if X = 10 days per month of drinking beer,
e 1.55140.19031(10)
e .3517
1.42148
p̂ 


 0.587
1.5514 0.19031(10)
.3517
1  1.4218
1 e
1 e
Here’s a plot of estimated probabilty of ever having driven under the influence versus days per month of
drinking at least two beers.
A few estimated probabilities:
DaysBeer
p̂
0
0.175
4
0.312
10
0.587
20
0.905
28
0.978
Notice the value under “Odds Ratio” in the output (given as 1.21). This is calculated as
e b1  e .19031 . The value estimates the ratio of the odds for the event occurring (driving under the
influence) when the x-variable has the value X+1 compared to when it has the value X. In other
words, it is the multiplicative effect on the odds due to increasing the x-variable by one unit.
Notice also that there is a confidence interval given for the odds ratio.
Note: In the theoretical model, odds of the event at X +1 equals e 0 1 ( X 1) , while the the odds at X
= e  0 1X . The ratio is e 1 .
Multiple Logistic Regression
Suppose we add Gender (1 if male, 0 if female) as a predictor variable. Results are:
Logistic Regression Table
Predictor
Constant
DaysBeer
Gender
male
Coef
-1.7736
0.18693
SE Coef
0.2945
0.03004
Z
P
-6.02 0.000
6.22 0.000
0.6172
0.2954
2.09 0.037
Odds
Ratio
95% CI
Lower
Upper
1.21
1.14
1.28
1.85
1.04
3.31
Log-Likelihood = -139.981
Test that all slopes are zero: G = 65.125, DF = 2, P-Value = 0.000
Plot of estimated probabilities versus DaysBeer and gender.
-----------------------
Now we add as predictors Smoke = whether or not student smokes cigarettes and Marijaun = whether or
not student has smoked marijuana in the past 6 months.
Logistic Regression Table
Predictor
Constant
DaysBeer
Gender
male
Smoke
Yes
Marijuan
Yes
Coef
-2.1097
0.13866
SE Coef
0.3201
0.03244
Z
P
-6.59 0.000
4.27 0.000
0.6963
0.3115
1.2029
0.9400
Odds
Ratio
95% CI
Lower
Upper
1.15
1.08
1.22
2.24 0.025
2.01
1.09
3.69
0.4146
2.90 0.004
3.33
1.48
7.50
0.3299
2.85 0.004
2.56
1.34
4.89
Log-Likelihood = -128.923
Test that all slopes are zero: G = 87.241, DF = 4, P-Value = 0.000
Download