9. Logistic regression

Logistic Regression for binary outcomes In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression since: 1. Y can’t be linearly related to Xs. 2. Y does NOT have a Gaussian (normal) distribution around “mean” P. We need a “linearizing” transformation and a non Gaussian error model Since 0 <= P <= 1 Might use odds = P/(1-P) Odds has no “ceiling” but has “floor” of zero. So we use the logit transformation ln(P/(1-P)) = ln(odds) = logit(P) Logit does not have a floor or ceiling. Model: logit = ln(P/(1-P))=β0+ β1X1 + β2X2+…+βkXk or Odds= e(β0 + β1X1 + β2X2+…+βkXk)=elogit Since P=odds/(1 + odds) & odds = elogit P = elogit/(1 + elogit) = 1/(1 + e-logit) P vs logit P vs logit 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 P=risk P=risk 1.0 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0.0 -4 -3 -2 -1 0 1 logit =log odds 2 3 4 -4 -3 -2 -1 0 1 logit=log odds 2 3 4 If ln(odds)= β0+ β1X1 + β2X2+…+βkXk then odds = (eβ0) (eβ1X1) (eβ2X2)…(eβkXk) or odds = (base odds) OR1 OR2 … ORk Model is multiplicative on the odds scale (Base odds are odds when all Xs=0) ORi = odds ratio for the ith X Interpreting β coefficients Example: Dichotomous X X = 0 for males, X=1 for females logit(P) = β0 + β1 X M: X=0, logit(Pm)= β0 F: X=1, logit(Pf) = β0 + β1 logit(Pf) – logit(Pm) = β1 log(OR) = β1, eβ1 = OR Example: P is proportion with disease logit(P) = β0 + β1 age + β2 sex “sex” is coded 0 for M, 1 for F OR for F vs M for disease is eβ2 if both are the same age. eβ1 is the increase in the odds of disease for a one year increase in age. (eβ1)k = ekβ1 is the OR for a ‘k’ year change in age in two groups with the same gender. Example: P is proportion with a MI Predictors: age in years htn = hypertension (1=yes, 0=no) smoke = smoking (1=yes, 0=no) Logit(P) = β0+ β1age + β2 htn + β3 smoke Q: Want OR for a 40 year old with hypertension vs otherwise identical 30 year old without hypertension. A:β0+β140+β2+β3smoke– (β0+β130+β3smoke) = β110+β2=log OR. OR = e[10 β1+β2]. Interactions P is proportion with CHD S:1= smoking, 0=non. D:1=drinking, 0 =non Logit(P)= β0+ β1S + β2 D + β3 SD Referent category is S=0, D=0 S D odds OR 0 0 eβ0 OR00=1= eβ0/ eβ0 1 0 eβ0+β1 OR10= eβ1 0 1 eβ0+β2 OR01= eβ2 1 1 eβ0+β1+β2+β3 OR11= e(β1+β2+β3) When will OR11=OR10 x OR01? IFF β3=0 Interpretation example Potential predictors (13) of in hospital infection mortality (yes or no) Crabtree, et al JAMA 8 Dec 1999 No 22, 2143-2148 Gender (female or male) Age in years APACHE score (0-129) Diabetes (y/n) Renal insufficiency / Hemodyalysis (y/n) Intubation / mechanical ventilation (y/n) Malignancy (y/n) Steroid therapy (y/n) Transfusions (y/n) Organ transplant (y/n) WBC - count Max temperature - degrees Days from admission to treatment (> 7 days) Factors Associated With Mortality for All Infections Characteristic Odds Ratio (95% CI) p value Incr APACHE score 1.15 (1.11-1.18) <.001 Transfusion (y/n) 4.15 (2.46-6.99) <.001 Increasing age 1.03 (1.02-1.05) <.001 Malignancy 2.60 (1.62-4.17) <.001 Max Temperature 0.70 (0.58-0.85) <.001 Adm to treat>7 d 1.66 (1.05-2.61) 0.03 Female (y/n) 1.32 (0.90-1.94) 0.16 *APACHE = Acute Physiology & Chronic Health Evaluation Score Diabetes complications -Descriptive stats Table of obese by diabetes complication obese diabetes complication Freq | no- 0|yes- 1| Total % yes -----+------+------+ no 0| 56 | 28 | 84 28/84=33% -----+------+------+ yes 1| 20 | 41 | 61 41/61=67% -----+------+------+ Total 76 69 145 %obese 26% 59% RR=2.0, OR=4.1 Fasting glucose (“fast glu”) mg/dl n min median mean No complication 76 70.0 90.0 91.2 Complication 69 75.0 114.0 155.9 , p < 0.001 max 112.0 353.0, p= Steady state glucose (“steady glu”) mg/dl n min median mean max No complication 76 29.0 105.0 114.0 273.0 Complication 69 60.0 257.0 261.5 480.0, p= Diabetes complication Parameter DF beta SE(b) Chi-Square p Intercept 1 -14.70 3.231 20.706 <.0001 obese 1 0.328 0.615 0.285 0.5938 Fast glu 1 0.108 0.031 2.456 0.0004 Steady glu 1 0.023 0.005 18.322 <.0001 Log odds diabetes complication = -14.7+0.328 obese+0.108 fast glu + 0.023 steady glu Statistical sig of the βs Linear regr t = b/SE -> p value Logistic regr Χ2 = (b/SE)2 -> p value Must first form (95%) CI for β on log scale b – 1.96 SE, b + 1.96 SE Then take antilogs of each end e[b – 1.96 SE], e[b + 1.96 SE] Diabetes complications Odds Ratio Estimates Point Effect Estimate obese e0.328=1.388 Fast glu e0.108=1.114 Steady glu e0.023=1.023 95% Wald Confidence Limits 0.416 4.631 1.049 1.182 1.012 1.033 Model fit-Linear vs Logistic regression k variables, n observations Variation Model Error Total df k n-k n-1 sum square or deviance G D T <-fixed Yi= ith observation, Ŷi=prediction for ith obs statistic Linear regr Logistic regr D/(n-k) Σ[(Yi-Ŷi)/Ŷ]2 Corr(Y,Ŷ)2 G/T Residual SDe -R2 R2 Mean deviance Hosmer-L χ2 Cox-Snell R2 Pseudo R2 Good regression models have large G and small D. For logistic regression, D/(n-k), the mean deviance, should be near 1.0. There are two versions of the R2 for logistic regression. Goodness of fit:Deviance Deviance in logistic is like SS in linear regr df -2log L p value Model (G) 3 117.21 < 0.001 Error (D) 141 83.46 total (T) 144 200.67 mean deviance =83.46/141=0.59 (want mean deviance to be ≤ 1) R2pseudo=G/total =117/201= 0.58, R2cs =0.554 Goodness of fit:H-L chi sq Compare observed vs model predicted (expected) frequencies by pred. decile decile total obs y exp y obs no exp no 1 16 0 0.23 16 15.8 2 15 0 0.61 15 14.4 3 15 0 1.31 15 13.7 … 8 16 15 15.6 1 0.40 9 23 23 23.0 0 0.00 chi-square=9.89, df=7, p = 0.1946 Goodness of fit vs R2 Interpretation when goodness of fit is acceptable and R2 is poor. Need to include interactions or make transformation on X variables in model? Need to obtain more X variables? Sensitivity & Specificity True pos True neg Classify pos a b Classify neg c d total a+c b+d Sensitivity=a/(a+c), false neg=c/(a+c) Specificity=d/(b+d), false pos=b/(b+d) Accuracy = W sensitivity + (1-W) specificity Any good classification rule, including a logistic model, should have high sensitivity & specificity. In logistic, we choose a cutpoint, Pc, Predict positive if P > Pc Predict negative if P < Pc Diabetes complication logit(Pi) = -14.7+0.328 obese+0.108 fast glu +0.023 steady glu Pi = 1/(1+ exp(-logit)) Compute Pi for all observations, find value of Pi (call it P0) that maximizes accuracy=0.5 sensitivity + 0.5 specificity This is an ROC analysis using the logit (or Pi) ROC for logistic model Diabetes model accuracy Logit =0.447, P0=e0.447/(1+e0.447) = 0.61 True comp True no comp Pred yes 55 11 Pred no 14 65 total 69 76 Sens=55/69= 79.7%, Spec=65/76=85.5% Accuracy = (81.2% + 85.5%)/2 = 83.4% C statistic (report this) n0=num negative, n1=num positive Make all n0 x n1 pairs (1,0) Concordant if predicted P for Y=1 > predicted P for Y=0 Discordant if predicted P for Y=1 < predicted P for Y=0 C = num concordant + 0.5 num ties n0 x n1 C=0.949 for diabetes complication model Logistic model is also a discriminant model (LDA) 0.60 0.50 freq 0.40 0.30 0.20 0.10 0.00 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 logit(P) Histograms of logit scores for each group Poisson Regression Y is a low positive integer, 0, 1,2, … Model: ln(mean Y) = β0+ β1X1 + β2X2+…+βkXk so mean Y = exp(β0+ β1X1 + β2X2+…+βkXk) dY/dXi = βi mean Y, βi = (dY/dXi)/mean Y 100 βi is the percent change per unit change in Xi End Equation for logit = log odds=depr “score” logit = -1.8259 + 0.8332 female + 0.3578 chron ill -0.0299 income odds depr = elogit, risk = odds/(1+odds) coding: Female: 0 for M, 1 for F Chron ill: 0 for no, 1 for yes Income in 1000s Example: Depression (y/n) Model for depression term coeff=β Intercept -1.8259 female 0.8332 chron ill 0.3578 income -0.0299 SE 0.4495 0.3882 0.3300 0.0135 p value 0.0001 0.0319 0.2782 0.0268 Female, chron ill are binary, income in 1000s ORs term Intercept female chron ill income coeff=β -1.8259 0.8332 0.3578 -0.0299 OR = eβ --2.301 1.430 0.971

9. Logistic regression

Related documents

Products

Support

9. Logistic regression

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib