A Broad Overview of Key Statistical Concepts

advertisement
Logistic regression for
binary response variables
Space shuttle example
• n = 24 space shuttle launches prior to
Challenger disaster on January 27, 1986
• Response y is an indicator variable
– y = 1 if O-ring failures during launch
– y = 0 if no O-ring failures during launch
• Predictor x1 is launch temperature, in
degrees Fahrenheit
Space shuttle example
Plot of Failure versus Temperature
Failure
Yes
No
1.0
0.5
0.0
50
60
70
Temperature
80
A model
The mean of a binary response
If there are 20% smokers and 80% non-smokers,
and Yi = 1, if smoker and 0, if non-smoker, then:
E Yi   (1)(0.20)  (0)(0.80)  0.20  P Yi  1
If pi = P (Yi = 1) and 1 – pi = P (Yi = 0), then:
E Yi   (1)( pi )  (0)(1  pi )  pi  P Yi  1
A linear regression model for
a binary response
If the simple linear regression model is:
Yi   0  1 xi   i
for Yi = 0, 1
Then, the mean response …
E Yi   P Yi  1   0  1 xi
… is the probability that Yi = 1 when the level
of the predictor variable is xi.
Space shuttle example
Regression Plot
failure = 2.43729 - 0.0306883 temp
S = 0.414476
R-Sq = 23.8 %
R-Sq(adj) = 20.3 %
Probability of Failure
1.0
0.5
0.0
50
60
70
Temperature
80
(Simple) logistic regression function
exp   x i 
pi  P Yi  1 
1  exp   x i 
exp  10  0.1x i 
pi  P Yi  1 
1  exp  10  0.1x i 
E(Y) = p
1.0
0.5
0.0
50
100
X
150
exp 10  0.1x i 
pi  P Yi  1 
1  exp 10  0.1x i 
E(Y) = p
1.0
0.5
0.0
50
100
X
150
Space shuttle example
exp 10.8  0.17 x i 
ˆpi  Pˆ Yi  1 
1  exp 10.8  0.17 x i 
Probability of failure
1.0
0.5
0.0
50
60
70
Temperature
80
Alternative formulation of (simple)
logistic regression function
exp   x i 
pi  P Yi  1 
1  exp   x i 
(algebra)
 pi
ln 
 1  pi
“logit”

    xi

Space shuttle example
 pˆ i
ln 
 1  pˆ i
2

  10.8  0.17 xi

1
Logit
0
-1
-2
-3
50
60
70
Temperature
80
Interpretation of
slope coefficients
Odds
If there are 20% smokers and 80% non-smokers:
0.80
Odds 
4
0.20
and
ln Odds  ln 4  1.39
“Odds are 4 to 1” … 4 non-smokers to 1 smoker.
If pi = P (Yi = 1) and 1 – pi = P (Yi = 0), then:
pi
Odds 
1  pi
and
 pi
ln Odds   ln 
 1  pi



Odds ratio
MALE: 20% smokers
and 80% non-smokers:
0.80
OddsM 
4
0.20
4
OR 
 2.67
1.5
FEMALE: 40% smokers
and 60% non-smokers:
0.60
OddsF 
 1.5
0.40
The odds that a male is a nonsmoker
is 2.67 times the odds that a female is
a nonsmoker.
Odds ratio
Group 2
Group 1
Odds1 
pi|1
1  pi|1
Odds2 
pi | 2
1  pi | 2
The odds ratio
pi|1
pi | 2
Odds1
OR 


Odds2 1  pi|1 1  pi|2
Space shuttle example
Predicted odds:
pˆ i
 exp 10.8  0.17 x i 
1  pˆ i
Predicted odds at x1 = 55 degrees:
pˆ i|55
 exp 10.8  0.1755  4.263
1  pˆ i|55
Predicted odds at x1 = 80 degrees:
pˆ i|80
 exp 10.8  0.1780  0.06081
1  pˆ i|80
Space shuttle example
Predicted odds ratio for x1 = 55 relative to x1 = 80:
pˆ i|55
pˆ i|80
exp 10.8  0.1755
4.263



 76
1  pˆ i|55 1  pˆ i|80 exp 10.8  0.1780  0.06081
The odds of O-ring failure at 55 degrees Fahrenheit
is 76 times the odds of O-ring failure at 80 degrees
Fahrenheit!
Interpretation of slope coefficients
The ratio of the odds at X1 = A relative to the
odds at X1 = B (for fixed values of other X’s)
is:
Odds A exp  A1 

 exp 1  A  B 
OddsB exp B1 
Estimation of
logistic regression coefficients
Maximum likelihood estimation
• Choose as estimates of the parameters the
values that assign the highest probability to
(“maximize likelihood of”) the observed
outcome.
exp 10  0.15 x i 
Suppose pi  P Yi  1 
1  exp 10  0.15 x i 
For first observation, Y1 = 1 and x1 = 53:
exp 10  0.15(53) 
P Y1  1 
 0.886
1  exp 10  0.15(53) 
… for second observation, Y2 = 1 and x2 = 56:
exp 10  0.15(56) 
P Y2  1 
 0.832
1  exp 10  0.15(56) 
… and for 24th observation, Y24 = 0 and x24 = 81:
P Y24  0  1 
exp 10  0.15(81) 
 0.896
1  exp 10  0.15(81) 
If α = 10 and β = -0.15, what is the
probability of observed outcome?
The likelihood of the observed outcome is:
P Y1  1, Y2  1,..., Y24  0 
 P Y1  1 P Y2  1 ...  P Y24  0 
 0.886  0.832  ...  0.896  4.82 10
6
The log likelihood of the observed outcome is:
ln P Y1  1, Y2  1,..., Y24  0
 ln P Y1  1 P Y2  1 ...  P Y24  0
 ln 0.886  ln 0.832  ...  ln 0.896  12.24
Maximum likelihood estimation
• Choose as estimates of the parameters the
values that assign the highest probability to
(“maximize likelihood of”) the observed
outcome.
exp 10.8  0.17 x i 
Suppose pi  P Yi  1 
1  exp 10.8  0.17 x i 
For first observation, Y1 = 1 and x1 = 53:
exp 10.8  0.17(53) 
P Y1  1 
 0.857
1  exp 10.8  0.17(53) 
… for second observation, Y2 = 1 and x2 = 56:
exp 10.8  0.17(56) 
P Y2  1 
 0.782
1  exp 10.8  0.17(56) 
… and for 24th observation, Y24 = 0 and x24 = 81:
P Y24  0  1 
exp 10.8  0.17(81) 
 0.951
1  exp 10.8  0.17(81) 
If α = 10.8 and β = -0.17, what is the
probability of observed outcome?
The likelihood of the observed outcome is:
P Y1  1, Y2  1,..., Y24  0
 P Y1  1 P Y2  1 ...  P Y24  0
 0.857  0.782  ...  0.951  9.97 10
6
The log likelihood of the observed outcome is:
ln P Y1  1, Y2  1,..., Y24  0
 ln P Y1  1 P Y2  1 ...  P Y24  0
 ln 0.857   ln 0.782  ...  ln 0.951  11.52
Space shuttle example
Link Function:
Logit
Response Information
Variable
failure
Value
1
0
Total
Count
7
17
24
(Event)
Logistic Regression Table
Predictor
Coef
SE Coef
Z
P
Constant 10.875
5.703
1.91 0.057
temp
-0.17132 0.08344 -2.05 0.040
Odds
Ratio
0.84
95% CI
Lower Upper
0.72
0.99
Properties of MLEs
• If a model is correct and the sample size is
large enough:
– MLEs are essentially unbiased.
– Formulas exist for estimating the standard
errors of the estimators.
– The estimators are about as precise as any
nearly unbiased estimators.
– MLEs are approximately normally distributed.
Test and confidence intervals
for single coefficients
Inference for βj
Test statistic:
Z
Confidence interval:
ˆ j   j
 
se ˆ j
follows approximate standard
normal distribution.
 
ˆ j  z / 2  se ˆ j
Space shuttle example
Link Function:
Logit
Response Information
Variable
failure
Value
1
0
Total
Count
7
17
24
(Event)
Logistic Regression Table
Predictor
Coef
SE Coef
Z
P
Constant 10.875
5.703
1.91 0.057
temp
-0.17132 0.08344 -2.05 0.040
Odds
Ratio
0.84
95% CI
Lower Upper
0.72
0.99
Space shuttle example
• There is sufficient evidence, at the α = 0.05
level, to conclude that temperature is related
to the probability of O-ring failure.
• For every 1-degree increase in temperature,
the odds ratio of O-ring failure to O-ring
non-failure is estimated to be 0.84 (95% CI
is 0.72 to 0.99).
Survival in the Donner Party
• In 1846, Donner and Reed families traveled
from Illinois to California by covered wagon.
• Group became stranded in eastern Sierra
Nevada mountains when hit by heavy snow.
• 40 of 87 members died from famine and
exposure.
• Are females better able to withstand harsh
conditions than are males?
Survival in the Donner Party
Probability of survival
0.9
0.8
0.7
Female
0.6
0.5
0.4
0.3
Male
0.2
0.1
0.0
15
25
35
45
Age
55
65
Survival in the Donner Party
Link Function:
Logit
Response Information
Variable
STATUS
Value
SURVIVED
DIED
Total
Count
20
25
45
(Event)
Logistic Regression Table
Predictor Coef
SE Coef
Z
P
Constant
1.633
1.110
1.47 0.141
AGE
-0.07820 0.03729 -2.10 0.036
Gender
1.5973
0.7555
2.11 0.034
Odds
Ratio
0.92
4.94
95% CI
Lower
Upper
0.86
1.12
0.99
21.72
Download