Logistic Regression for binary outcomes In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression since: 1. Y can’t be linearly related to Xs. 2. Y does NOT have a Gaussian (normal) distribution around “mean” P. We need a “linearizing” transformation and a non Gaussian error model Since 0 <= P <= 1 Might use odds = P/(1-P) Odds has no “ceiling” but has “floor” of zero. So we use the logit transformation ln(P/(1-P)) = ln(odds) = logit(P) Logit does not have a floor or ceiling. Model: logit = ln(P/(1-P))=β0+ β1X1 + β2X2+…+βkXk or Odds= e(β0 + β1X1 + β2X2+…+βkXk)=elogit Since P=odds/(1 + odds) & odds = elogit P = elogit/(1 + elogit) = 1/(1 + e-logit) P vs logit P vs logit 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 P=risk P=risk 1.0 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0.0 -4 -3 -2 -1 0 1 logit =log odds 2 3 4 -4 -3 -2 -1 0 1 logit=log odds 2 3 4 If ln(odds)= β0+ β1X1 + β2X2+…+βkXk then odds = (eβ0) (eβ1X1) (eβ2X2)…(eβkXk) or odds = (base odds) OR1 OR2 … ORk Model is multiplicative on the odds scale (Base odds are odds when all Xs=0) ORi = odds ratio for the ith X Interpreting β coefficients Example: Dichotomous X X = 0 for males, X=1 for females logit(P) = β0 + β1 X M: X=0, logit(Pm)= β0 F: X=1, logit(Pf) = β0 + β1 logit(Pf) – logit(Pm) = β1 log(OR) = β1, eβ1 = OR Example: P is proportion with disease logit(P) = β0 + β1 age + β2 sex “sex” is coded 0 for M, 1 for F OR for F vs M for disease is eβ2 if both are the same age. eβ1 is the increase in the odds of disease for a one year increase in age. (eβ1)k = ekβ1 is the OR for a ‘k’ year change in age in two groups with the same gender. Example: P is proportion with a MI Predictors: age in years htn = hypertension (1=yes, 0=no) smoke = smoking (1=yes, 0=no) Logit(P) = β0+ β1age + β2 htn + β3 smoke Q: Want OR for a 40 year old with hypertension vs otherwise identical 30 year old without hypertension. A:β0+β140+β2+β3smoke– (β0+β130+β3smoke) = β110+β2=log OR. OR = e[10 β1+β2]. Interactions P is proportion with CHD S:1= smoking, 0=non. D:1=drinking, 0 =non Logit(P)= β0+ β1S + β2 D + β3 SD Referent category is S=0, D=0 S D odds OR 0 0 eβ0 OR00=1= eβ0/ eβ0 1 0 eβ0+β1 OR10= eβ1 0 1 eβ0+β2 OR01= eβ2 1 1 eβ0+β1+β2+β3 OR11= e(β1+β2+β3) When will OR11=OR10 x OR01? IFF β3=0 Interpretation example Potential predictors (13) of in hospital infection mortality (yes or no) Crabtree, et al JAMA 8 Dec 1999 No 22, 2143-2148 Gender (female or male) Age in years APACHE score (0-129) Diabetes (y/n) Renal insufficiency / Hemodyalysis (y/n) Intubation / mechanical ventilation (y/n) Malignancy (y/n) Steroid therapy (y/n) Transfusions (y/n) Organ transplant (y/n) WBC - count Max temperature - degrees Days from admission to treatment (> 7 days) Factors Associated With Mortality for All Infections Characteristic Odds Ratio (95% CI) p value Incr APACHE score 1.15 (1.11-1.18) <.001 Transfusion (y/n) 4.15 (2.46-6.99) <.001 Increasing age 1.03 (1.02-1.05) <.001 Malignancy 2.60 (1.62-4.17) <.001 Max Temperature 0.70 (0.58-0.85) <.001 Adm to treat>7 d 1.66 (1.05-2.61) 0.03 Female (y/n) 1.32 (0.90-1.94) 0.16 *APACHE = Acute Physiology & Chronic Health Evaluation Score Diabetes complications -Descriptive stats Table of obese by diabetes complication obese diabetes complication Freq | no- 0|yes- 1| Total % yes -----+------+------+ no 0| 56 | 28 | 84 28/84=33% -----+------+------+ yes 1| 20 | 41 | 61 41/61=67% -----+------+------+ Total 76 69 145 %obese 26% 59% RR=2.0, OR=4.1 Fasting glucose (“fast glu”) mg/dl n min median mean No complication 76 70.0 90.0 91.2 Complication 69 75.0 114.0 155.9 , p < 0.001 max 112.0 353.0, p= Steady state glucose (“steady glu”) mg/dl n min median mean max No complication 76 29.0 105.0 114.0 273.0 Complication 69 60.0 257.0 261.5 480.0, p= Diabetes complication Parameter DF beta SE(b) Chi-Square p Intercept 1 -14.70 3.231 20.706 <.0001 obese 1 0.328 0.615 0.285 0.5938 Fast glu 1 0.108 0.031 2.456 0.0004 Steady glu 1 0.023 0.005 18.322 <.0001 Log odds diabetes complication = -14.7+0.328 obese+0.108 fast glu + 0.023 steady glu Statistical sig of the βs Linear regr t = b/SE -> p value Logistic regr Χ2 = (b/SE)2 -> p value Must first form (95%) CI for β on log scale b – 1.96 SE, b + 1.96 SE Then take antilogs of each end e[b – 1.96 SE], e[b + 1.96 SE] Diabetes complications Odds Ratio Estimates Point Effect Estimate obese e0.328=1.388 Fast glu e0.108=1.114 Steady glu e0.023=1.023 95% Wald Confidence Limits 0.416 4.631 1.049 1.182 1.012 1.033 Model fit-Linear vs Logistic regression k variables, n observations Variation Model Error Total df k n-k n-1 sum square or deviance G D T <-fixed Yi= ith observation, Ŷi=prediction for ith obs statistic Linear regr Logistic regr D/(n-k) Σ[(Yi-Ŷi)/Ŷ]2 Corr(Y,Ŷ)2 G/T Residual SDe -R2 R2 Mean deviance Hosmer-L χ2 Cox-Snell R2 Pseudo R2 Good regression models have large G and small D. For logistic regression, D/(n-k), the mean deviance, should be near 1.0. There are two versions of the R2 for logistic regression. Goodness of fit:Deviance Deviance in logistic is like SS in linear regr df -2log L p value Model (G) 3 117.21 < 0.001 Error (D) 141 83.46 total (T) 144 200.67 mean deviance =83.46/141=0.59 (want mean deviance to be ≤ 1) R2pseudo=G/total =117/201= 0.58, R2cs =0.554 Goodness of fit:H-L chi sq Compare observed vs model predicted (expected) frequencies by pred. decile decile total obs y exp y obs no exp no 1 16 0 0.23 16 15.8 2 15 0 0.61 15 14.4 3 15 0 1.31 15 13.7 … 8 16 15 15.6 1 0.40 9 23 23 23.0 0 0.00 chi-square=9.89, df=7, p = 0.1946 Goodness of fit vs R2 Interpretation when goodness of fit is acceptable and R2 is poor. Need to include interactions or make transformation on X variables in model? Need to obtain more X variables? Sensitivity & Specificity True pos True neg Classify pos a b Classify neg c d total a+c b+d Sensitivity=a/(a+c), false neg=c/(a+c) Specificity=d/(b+d), false pos=b/(b+d) Accuracy = W sensitivity + (1-W) specificity Any good classification rule, including a logistic model, should have high sensitivity & specificity. In logistic, we choose a cutpoint, Pc, Predict positive if P > Pc Predict negative if P < Pc Diabetes complication logit(Pi) = -14.7+0.328 obese+0.108 fast glu +0.023 steady glu Pi = 1/(1+ exp(-logit)) Compute Pi for all observations, find value of Pi (call it P0) that maximizes accuracy=0.5 sensitivity + 0.5 specificity This is an ROC analysis using the logit (or Pi) ROC for logistic model Diabetes model accuracy Logit =0.447, P0=e0.447/(1+e0.447) = 0.61 True comp True no comp Pred yes 55 11 Pred no 14 65 total 69 76 Sens=55/69= 79.7%, Spec=65/76=85.5% Accuracy = (81.2% + 85.5%)/2 = 83.4% C statistic (report this) n0=num negative, n1=num positive Make all n0 x n1 pairs (1,0) Concordant if predicted P for Y=1 > predicted P for Y=0 Discordant if predicted P for Y=1 < predicted P for Y=0 C = num concordant + 0.5 num ties n0 x n1 C=0.949 for diabetes complication model Logistic model is also a discriminant model (LDA) 0.60 0.50 freq 0.40 0.30 0.20 0.10 0.00 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 logit(P) Histograms of logit scores for each group Poisson Regression Y is a low positive integer, 0, 1,2, … Model: ln(mean Y) = β0+ β1X1 + β2X2+…+βkXk so mean Y = exp(β0+ β1X1 + β2X2+…+βkXk) dY/dXi = βi mean Y, βi = (dY/dXi)/mean Y 100 βi is the percent change per unit change in Xi End Equation for logit = log odds=depr “score” logit = -1.8259 + 0.8332 female + 0.3578 chron ill -0.0299 income odds depr = elogit, risk = odds/(1+odds) coding: Female: 0 for M, 1 for F Chron ill: 0 for no, 1 for yes Income in 1000s Example: Depression (y/n) Model for depression term coeff=β Intercept -1.8259 female 0.8332 chron ill 0.3578 income -0.0299 SE 0.4495 0.3882 0.3300 0.0135 p value 0.0001 0.0319 0.2782 0.0268 Female, chron ill are binary, income in 1000s ORs term Intercept female chron ill income coeff=β -1.8259 0.8332 0.3578 -0.0299 OR = eβ --2.301 1.430 0.971