Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor Lucila Ohno-Machado and Professor Staal Vinterbo 6.873/HST.951 Medical Decision Support Fall 2005 Logistic Regression Maximum Likelihood Estimation Lucila Ohno-Machado Risk Score of Death from Angioplasty Unadjusted Overall Mortality Rate = 2.1% 3000 60% 53.6% 62% Number Number of Cases 2500 50% of Cases Mortality Risk 2000 40% 1500 30% 21.5% 26% 1000 20% 12.4% 500 10% 7.6% 0.4% 1.4% 2.2% 2.9% 0 0 to 2 3 to 4 5 to 6 7 to 8 Risk Score Category 1.6% 9 to 10 1.3% >10 0% Linear Regression Ordinary Least Squares (OLS) y Minimize Sum of Squared Errors (SSE) x3 n data points i is the subscript for each point x1 ŷi = β 0 + β1 xi x2 x4 x n n i=1 i=1 SSE = ∑ ( yi − ŷi ) 2 = ∑ [ yi − ( β 0 + β1 xi )]2 Logit pi = y 1 1+ e −( β 0 + β1 xi ) e β 0 + β1xi pi = β 0 + β1xi e +1 ⎡ pi ⎤ log ⎢ ⎥ = β 01+ β1 xi ⎣1− pi ⎦ x logit x Increasing β 1.2 1.2 1.2 1 1 1 0.8 0.8 0.8 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 0 10 20 30 0.6Series1 Series1 0 10 20 30 0 Series1 10 20 30 Finding β0 • Baseline case 1 pi = −( β 0 ) 1+ e Blue(1) Green(0) Death 28 22 50 Life 45 52 97 Total 73 74 147 1 0.297 = − ( β 0 ) 1+ e β 0 = −0 .8616 Odds ratio • Odds: p/(1-p) • Odds-ratio pdeath|blue 1− pdeath|blue OR = p death|green Blue Green Death 28 22 50 Life 45 52 97 Total 73 74 147 1− pdeath| green 28 / 45 = 1.47 OR = 22 / 52 What do coefficients mean? eβcolor = ORcolor OR = e β color β color Blue Green Death 28 22 50 Life 45 52 97 Total 73 74 147 28 / 45 = 1 .47 22 / 52 = 1 .47 = 0 .385 pblue = p green 1 1+ e − ( −0.8616 + 0.385 ) = 0.383 1 = = 0.297 0.8616 1+ e What do coefficients mean? eβage = ORage pdeath|age =50 Age49 Age50 Death 28 22 50 Life 45 52 97 Total 73 74 147 1− pdeath|age =50 OR = pdeath|age =49 1− pdeath|age =49 Why not search using OLS? y ŷi = β 0 + β1 xi x3 n SSE = ∑ ( yi − ŷi ) 2 x1 x2 x4 i=1 x logit ⎡ pi ⎤ log ⎢ ⎥ = β 01+ β1 xi ⎣1− pi ⎦ x P(model | data) ? pi = If only intercept is allowed, which value would it have? 1 1+ e −( β 0 + β1xi ) y y x x x P (data | model) ? P(data|model) = [P(model | data) P(data)] / P(model) When comparing models: P(model): assume all the same (ie, chances of being a model with high coefficients the same as low, etc) P(data): assume it is the same Then, P(data | model) α P(model | data) Maximum Likelihood Estimation • Maximize P(data | model) • Maximize the probability that we would observe what we observed (given assumption of a particular model) • Choose the best parameters from the particular model logit x Maximum Likelihood Estimation • Steps: – Define expression for the probability of data as a function of the parameters – Find the values of the parameters that maximize this expression Likelihood Function L = Pr(Y ) L = Pr( y1 , y2 ,..., yn ) n L = Pr( y1 ) Pr( y2 )... Pr( yn ) = ∏ Pr( yi ) i =1 Likelihood Function Binomial L = Pr(Y ) L = Pr( y1 , y2 ,..., yn ) n L = Pr( y1 ) Pr( y2 )... Pr( yn ) = ∏ Pr( yi ) i =1 Pr( yi = 1) = pi Pr( yi = 0) = (1 − pi ) Pr( yi ) = pi (1 − pi )1− yi yi Likelihood Function n n i=1 i=1 L = ∏ Pr( yi ) = ∏ pi (1 − pi ) 1− yi yi yi ⎛ pi ⎞ ⎟⎟ (1 − pi ) L = ∏ ⎜⎜ i=1 ⎝ (1 − pi ) ⎠ n ⎛ pi ⎞ ⎟⎟ + ∑ log(1 − pi ) log L = ∑ yi log⎜⎜ i ⎝ (1 − pi ) ⎠ i log L = ∑ yi ( βxi ) − ∑ log(1 + e βxi ) Since model is the logit i i Log Likelihood Function n n i=1 i=1 L = ∏ Pr( yi ) = ∏ pi (1− pi )1− yi yi y i ⎛ pi ⎞ ⎟⎟ (1− pi ) L = ∏ ⎜⎜ i=1 ⎝ (1− pi ) ⎠ n ⎛ pi ⎞ ⎟⎟ + ∑ log(1− pi ) log L = ∑ yi log⎜⎜ i ⎝ (1− pi ) ⎠ i Log Likelihood Function ⎛ pi ⎞ ⎟⎟ + ∑ log(1− pi ) log L = ∑ yi log⎜⎜ i ⎝ (1− pi ) ⎠ i log L = ∑ yi ( βxi ) − ∑ log(1+ e βxi ) i i Since model is the logit Maximize log L = ∑ yi ( βxi ) − ∑ log(1 + e βxi ) i i ∂ log L = ∑ yi xi − ∑ yˆ i xi = 0 ∂β i i 1 Not easy to solve because yˆ i = y-hat is non-linear, need to 1 + e − βxi use iterative methods: most popular is Newton-Raphson Maximize log L = ∑ yi ( βxi ) − ∑ log(1 + e βxi ) i i ∂ log L = ∑ yi xi − ∑ yˆ i xi = 0 ∂β i i 1 Not easy to solve because ŷi = y-hat is non-linear, need to 1 + e − βxi use iterative methods: most popular is Newton-Raphson Newton-Raphson • Start with random or zero βs • “walk” in the “direction” that maximizes MLE – how big a step (Gradient or Score) – direction Maximizing the LogLikelihood Log Likelihood β i+1 First iteration LL βi Initial LL Maximizing the LogLikelihood Log Likelihood β i+1 Second iteration LL βi New Initial LL Similar iterative method to Minimizing the Error in Gradient Descent (neural nets) Error surface initial error negative derivative final error local minimum winitial wtrained positive change Newton-Raphson Algorithm log L = ∑ yi ( β xi ) − ∑ log(1+ e βxi ) i i ∂ log L U (β ) = = ∑ yi xi − ∑ yˆ i xi ∂β i i ∂ 2 log L ' I (β ) = = −∑ xi xi yˆ i (1− yˆ i ) ∂β ∂β ' i β j +1 = β j − I −1 ( β j )U ( β j ) a step Gradient Hessian Convergence • Criterion β j+1 − β j < . 0001 βj • Convergence problems: complete and quasicomplete separation Complete separation MLE does not exist (ie, it is infinite) βi β i+1 y y x x x Quasi-complete separation Same values for predictors, different outcomes y x No (quasi)complete separation is fine to find MLE y x How good is the model? • Is it better than predicting the same prior probability for everyone? (ie, model with just β0) • How well do the training data fit? • How well does is generalize? Generalized likelihood-ratio test • Are β1, β2, …, βn different from 0? n n i=1 i=1 L = ∏ Pr( yi ) = ∏ pi (1− pi )1− yi yi log L = ∑ [ yi log pi + (1− yi ) log(1 − pi )] i G = −2 log Lo + 2 log L 1 G has χ2 distribution cross − entropy _ error = −∑ [ yi log pi + (1− yi ) log(1 − pi )] i AIC, SC, BIC • To compare models • Akaike’s Information Criterion, k parameters AIC= −2logL + 2 k • Schwartz Criterion, Bayesian Information Criterion, n cases BIC= −2logL + k logn Summary • Maximum Likelihood Estimation is used in finding parameters for models • MLE maximizes the probability that the data obtained would have been generated by the model • Coming up: goodness-of-fit (how good are the predictions?) – How well do the training data fit? – How well does is generalize?