Document 13581855

advertisement
Harvard-MIT Division of Health Sciences and Technology
HST.951J: Medical Decision Support, Fall 2005
Instructors: Professor Lucila Ohno-Machado and Professor Staal Vinterbo
6.873/HST.951 Medical Decision Support
Fall 2005
Logistic Regression Maximum Likelihood Estimation
Lucila Ohno-Machado Risk Score of Death from Angioplasty
Unadjusted Overall Mortality Rate = 2.1%
3000
60%
53.6%
62% Number
Number of Cases
2500
50%
of Cases
Mortality
Risk
2000
40%
1500
30%
21.5%
26%
1000
20%
12.4%
500
10%
7.6%
0.4%
1.4%
2.2% 2.9%
0
0 to 2
3 to 4
5 to 6
7 to 8
Risk Score Category
1.6%
9 to 10
1.3%
>10
0%
Linear Regression
Ordinary Least Squares (OLS)
y
Minimize Sum of Squared Errors
(SSE)
x3
n data points
i is the subscript for each point
x1
ŷi = β 0 + β1 xi
x2
x4
x
n
n
i=1
i=1
SSE = ∑ ( yi − ŷi ) 2 = ∑ [ yi − ( β 0 + β1 xi )]2
Logit
pi =
y
1
1+ e −( β 0 + β1
xi )
e β 0 + β1xi
pi = β 0 + β1xi
e
+1
⎡ pi ⎤
log ⎢
⎥ = β 01+ β1 xi
⎣1− pi ⎦
x
logit
x
Increasing β
1.2
1.2
1.2
1
1
1
0.8
0.8
0.8
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0
0
0
10
20
30
0.6Series1
Series1
0
10
20
30
0
Series1
10
20
30
Finding β0
• Baseline case
1
pi =
−( β 0 )
1+ e
Blue(1) Green(0)
Death
28
22
50
Life
45
52
97
Total
73
74
147
1
0.297 =
− ( β 0 )
1+ e
β 0 = −0
.8616
Odds ratio
• Odds: p/(1-p)
• Odds-ratio
pdeath|blue
1− pdeath|blue
OR =
p
death|green
Blue
Green
Death
28
22
50
Life
45
52
97
Total
73
74
147
1− pdeath| green
28 / 45
= 1.47
OR =
22 / 52
What do coefficients mean?
eβcolor = ORcolor
OR =
e β color
β color
Blue
Green
Death
28
22
50
Life
45
52
97
Total
73
74
147
28 / 45
= 1
.47
22 / 52
= 1
.47
= 0
.385
pblue =
p green
1
1+ e
− ( −0.8616 + 0.385 )
= 0.383
1
=
= 0.297
0.8616
1+ e
What do coefficients mean?
eβage = ORage
pdeath|age =50
Age49
Age50
Death
28
22
50
Life
45
52
97
Total
73
74
147
1− pdeath|age =50
OR =
pdeath|age =49
1− pdeath|age =49
Why not search using OLS?
y
ŷi = β 0 + β1
xi
x3
n
SSE = ∑ ( yi
− ŷi )
2
x1
x2
x4
i=1
x
logit
⎡ pi ⎤
log ⎢
⎥ = β 01+ β1 xi
⎣1− pi ⎦
x
P(model | data) ?
pi =
If only intercept is allowed, which
value would it have?
1
1+ e −( β 0 + β1xi )
y
y
x
x
x
P (data | model) ?
P(data|model) = [P(model | data) P(data)] / P(model)
When comparing models:
P(model): assume all the same (ie, chances of
being a model with high coefficients the same as
low, etc)
P(data): assume it is the same
Then,
P(data | model) α P(model | data)
Maximum Likelihood Estimation
• Maximize P(data | model)
• Maximize the probability that we would
observe what we observed (given
assumption of a particular model)
• Choose the best parameters from the
particular model
logit
x
Maximum Likelihood Estimation
• Steps:
– Define expression for the probability of data
as a function of the parameters
– Find the values of the parameters that maximize this expression
Likelihood Function
L = Pr(Y )
L = Pr( y1 , y2 ,..., yn )
n
L = Pr( y1 ) Pr( y2 )... Pr( yn )
= ∏ Pr( yi )
i =1
Likelihood Function
Binomial
L = Pr(Y )
L = Pr( y1 , y2 ,..., yn )
n
L = Pr( y1 ) Pr( y2 )... Pr( yn ) = ∏ Pr( yi )
i =1
Pr( yi = 1) = pi
Pr( yi = 0) = (1 − pi )
Pr( yi ) = pi (1 − pi )1− yi
yi
Likelihood Function
n
n
i=1
i=1
L = ∏ Pr( yi ) = ∏ pi (1 − pi
)
1− yi
yi
yi
⎛ pi ⎞
⎟⎟ (1 − pi
)
L = ∏ ⎜⎜
i=1 ⎝ (1 − pi ) ⎠
n
⎛ pi ⎞
⎟⎟ + ∑ log(1 − pi )
log L = ∑ yi log⎜⎜
i
⎝ (1 − pi ) ⎠ i
log L = ∑ yi ( βxi ) − ∑ log(1 + e βxi ) Since model is the logit
i
i
Log Likelihood Function
n
n
i=1
i=1
L = ∏ Pr( yi ) = ∏ pi (1− pi
)1− yi
yi
y
i
⎛ pi ⎞
⎟⎟ (1− pi )
L = ∏ ⎜⎜
i=1 ⎝ (1− pi ) ⎠
n
⎛ pi ⎞
⎟⎟ + ∑ log(1− pi
)
log L = ∑ yi log⎜⎜
i
⎝ (1− pi ) ⎠ i
Log Likelihood Function
⎛ pi ⎞
⎟⎟ + ∑ log(1− pi
)
log L = ∑ yi log⎜⎜
i
⎝ (1− pi
) ⎠ i
log L = ∑ yi
( βxi ) − ∑ log(1+ e βxi )
i
i
Since model is the logit
Maximize
log L = ∑ yi
( βxi ) − ∑ log(1 + e βxi )
i
i
∂ log L
= ∑ yi xi − ∑ yˆ i xi = 0
∂β
i
i
1
Not easy to solve because
yˆ i =
y-hat is non-linear, need to
1 + e − βxi
use iterative methods: most
popular is Newton-Raphson
Maximize
log L = ∑ yi
( βxi ) − ∑ log(1 + e βxi )
i
i
∂ log L
= ∑ yi xi − ∑ yˆ i xi = 0
∂β
i
i
1
Not easy to solve because
ŷi =
y-hat is non-linear, need to
1 + e − βxi
use iterative methods: most
popular is Newton-Raphson
Newton-Raphson
• Start with random or zero βs
• “walk” in the “direction” that maximizes
MLE
– how big a step (Gradient or Score)
– direction
Maximizing the LogLikelihood
Log
Likelihood
β
i+1
First iteration LL
βi
Initial LL
Maximizing the LogLikelihood
Log
Likelihood
β
i+1
Second iteration LL
βi
New Initial LL
Similar iterative method to Minimizing the Error in Gradient Descent (neural nets)
Error surface
initial error
negative derivative
final error
local minimum
winitial wtrained
positive change
Newton-Raphson Algorithm
log L = ∑ yi
( β xi ) − ∑ log(1+ e βxi )
i
i
∂ log L
U (β ) =
= ∑ yi xi − ∑ yˆ i
xi
∂β
i
i
∂ 2 log L
'
I (β ) =
= −∑ xi xi yˆ i (1− yˆ i
)
∂β ∂β '
i
β j +1 = β j − I −1 ( β j )U ( β j )
a step
Gradient
Hessian
Convergence
•
Criterion
β j+1 − β j
< .
0001
βj
•
Convergence problems: complete and quasicomplete separation
Complete separation
MLE does not exist (ie, it is infinite)
βi
β
i+1
y
y
x
x
x
Quasi-complete separation
Same values for predictors, different
outcomes
y
x
No (quasi)complete separation
is fine to find MLE
y
x
How good is the model?
• Is it better than predicting the same prior
probability for everyone? (ie, model with
just β0)
• How well do the training data fit?
• How well does is generalize?
Generalized likelihood-ratio test
• Are β1, β2, …, βn
different from 0?
n
n
i=1
i=1
L = ∏ Pr( yi ) = ∏ pi (1− pi
)1− yi
yi
log L = ∑ [ yi log pi + (1− yi ) log(1 − pi
)]
i
G = −2 log Lo + 2 log L
1
G has χ2 distribution
cross − entropy _ error = −∑ [ yi log pi + (1− yi ) log(1 − pi
)]
i
AIC, SC, BIC
• To compare models
• Akaike’s Information Criterion, k
parameters
AIC= −2logL
+ 2
k
• Schwartz Criterion, Bayesian Information
Criterion, n cases
BIC= −2logL +
k logn
Summary
• Maximum Likelihood Estimation is used in
finding parameters for models
• MLE maximizes the probability that the
data obtained would have been generated
by the model
• Coming up: goodness-of-fit (how good are
the predictions?)
– How well do the training data fit?
– How well does is generalize?
Download