Logistic Regression © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Linear Regression Review FEV1 = b0 + b1Age + b2Height E(FEV1 ) = m = b0 + b1Age + b2Height Where data are assumed to be normally distributed with mean equal to m Models such as these are appropriate for continuous outcome measures such as FEV1, weight, blood pressure What if our outcome is Binary? © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Common binary outcome measures Healthy vs unhealthy E.g., heart disease (y/n), Cancer (y/n), COPD (y/n) Progressive disease vs stable disease Based on, e.g., cancer stage Alive vs dead © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Convenient coding of binary outcomes COPD = 0 = 1 if FEV1/FVC > 0.70 if FEV1/FVC < 0.70 Large = 0 = 1 if tumor size is “small” if tumor size is “large” Dead = 0 = 1 if alive if deceased Note use of 0/1 coding and descriptive names that define “1” © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH The Logistic Regression Model Consider the case of a binary indicator of vital status Dead = 0 = 1 if alive if deceased If Dead is coded 0/1, then its expected value is equal to the probability that Dead=1. i.e., E(Dead) = P = Probability of death © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH The Logistic Regression Model Suppose we want to model the association between vital status and age… If we fit the data using standard linear regression, our model would be of the form P = β0 + β1Age That is, we assume the probability of death varies in a linear manner with age. © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH .8 1 The Logistic Regression Model d ea d .6 When age = 60, estimated value of dead = .6 .4 Is this a sensible result? 0 .2 What if predicted value is >1 or <0? 20 40 © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH 60 age 80 100 The Logistic Regression Model Logistic regression analysis is tool for modeling binary data that overcomes some of the limitations of linear regression. Rather than assuming the data are normally distributed, which we know isn’t true, we first assume the data follow a binomial distribution, which implicitly assumes we have a series of 0/1 observations each with probability P of being dead, i.e., Dead = 1. © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH The Logistic Regression Model Rather than assuming P is a linear combination of variables of interest, e.g., P = β0 + β1Age + β2Male we instead assume P e b 0 + b 1 Age + b 2 Male 1+ e b 0 + b 1 Age + b 2 Male or equivalently, ln[P/(1-P)] = β0 + β1Age + β2Male © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH The Logistic Regression Model ln[P/(1-P)] = β0 + β1Age + β2Height The function ln[P/(1-P)] is referred to as the “logit” of P, hence the term “logistic” regression! Unlike the linear regression model, the logit function has the desirable property that it is always between 0 and 1. It also turns out to have some statistical properties that makes it a particularly desirable function of P to estimate. © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH .2 0 .1 d ea dp r .3 .4 Sample logistic function 20 40 © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH 60 age 80 100 The Logistic Regression Model Interpretation of coefficients ln[P/(1-P)] = β0 + β1Age + β2Male Recall that P/(1-P) is the odds of our outcome of interest, in this case death. Hence the logit of P is the same as the ln(odds) of death, and so the odds of death can be written odds e © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH b 0 + b 1 Age + b 2 Male The Logistic Regression Model Interpretation of coefficients odds e OR (male vs female) b 0 + b 1 Age + b 2 Male e = e = => e b 0 + b 1 Age + b 2 b 0 + b 1 Age b2 b2 = ln(OR males vs females ) © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH The Logistic Regression Model Interpretation of coefficients b 0 + b 1 Age + b 2 Male odds e Similarly we can calculate the OR associated with an increase in age of 10 years as e OR (10 yr incr in age) = b 0 + b 1 ( Age + 10 ) + b 2 Male e = b 0 + b 1 Age + b 2 Male e 10 b 1 => 10b2 = ln(OR 10 year increase in age ) © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH The Logistic Regression Model Interpretation of coefficients Ever smoker (X=1) Never smoker (X=0) Dead (Y=1) 62 (14% = p1) 18 (9% = p0) Alive (Y= 0) 384 (86%) 186 (91%) Vital status Odds ratio for smokers to never-smokers = p1 (1 p ) 1 p0 (1 p 0 ) STATA logistic regression output for: logit dead smk Coef. Std. Err. .28225 z = 1.67 OR = e.512 = 1.67 P>z [95% Conf. Interval] smk .5118667 1.81 0.070 -.0413331 1.065067 _cons -2.335375 .2468438 -9.46 0.000 -2.81918 -1.85157 © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH The Logistic Regression Model Hypothesis testing and confidence intervals Testing H0: ln(OR) = b1 = 0 vs. Ha: b1 = 0 is equivalent to testing H0: OR = eb1 = 1 vs. Ha: eb1 = 1 Use large sample normality of b1 to compute p-values and to construct confidence limits b1 / SE(b1) should look like a z-score under H0 … use to compute p-value b1 ± 1.96 * SE(b1) is an approximate 95% confidence interval STATA logistic regression output for: logit dead smk Coef. Std. Err. .28225 z P>z [95% Conf. Interval] smk .5118667 1.81 0.070 -.0413331 1.065067 _cons -2.335375 .2468438 -9.46 0.000 -2.81918 -1.85157 © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH The Logistic Regression Model computing CIs for the odds ratio Because b1 is more normally distributed than eb1, we construct CIs for the ln(OR) and then exponentiate these to get corresponding CIs for the OR. 95% CI for ln(OR) = (-0.04, 1.07) 95% CI for OR = e(-0.04, 1.07) = (0.96, 2.90) © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Variations in software output STATA logistic regression output: logit dead smk Coef. Std. Err. .28225 z P>z [95% Conf. Interval] smk .5118667 1.81 0.070 -.0413331 1.065067 _cons -2.335375 .2468438 -9.46 0.000 -2.81918 -1.85157 Default output in the log scale STATA logistic regression output: logit dead smk, or Odds Ratio Std. Err. smk 1.668403 .4709067 z P>|z| [95% Conf. Interval] 1.81 0.070 .9595094 2.901032 Output requested in the transformed scale © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH The Logistic Regression Model Adjusting for potential confounder variables Suppose we conduct a cross-sectional study to investigate the association between gender and COPD. If P is the probability of having COPD and Male is a 0/1 indicator of male sex, then we might fit the logistic model ln[P/(1-P)] = β0 + β1Male to assess the OR for COPD associated with male sex. Might this association be confounded by smoking status, and if so how might we adjust for the potentially confounding effects of smoking? © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH The Logistic Regression Model Adjusting for potential confounder variables If ES is a 0/1 indicator of ever having smoked, we might fit the model ln[P/(1-P)] = β0 + β1Male + β2ES Under this model, we say the effect of male sex is now adjusted for the potentially confounding effect of having ever smoked. The resulting odds ratio is analogous to the pooled OR that you would get from a stratified 2x2 table analysis that crosses Male by COPD for each level of ES. We could adjust for additional potential confounders, including continuous variables, by adding them to the model as main effects. © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH The Logistic Regression Model Adjusting for potential effect modification Now suppose we want to know whether smoking modifies the effect of male sex on COPD prevalence. In classical epidemiology this means we want to know if the OR associated with male sex varies by smoking status. How would we test for the presence of effect modification in our logistic model? As we learned previously, we use interaction terms! © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH The Logistic Regression Model Adjusting for potential effect modification ln[(P/(1-P)] = b0 + b1Male + b2ES + b3Male*ES E vS m k M ale m od el 0 0 b0 0 1 b0 + b1 1 0 b0 + 1 1 b0 + b1 + b2 + b3 b2 b1 = ln(OR) for male sex in never smokers b2 = ln(OR) for ever smoking in women b3 = difference in ln(OR) for ever smoking between men & women = difference in ln(OR) for male sex between ever & never smokers Testing H0: β3 = 0 is a test of whether there is effect modification. © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Some Examples STATA output from: logit dead smk, or Odds Ratio Std. Err. Z P>z Smk 1.668403 .4709067 1.81 0.070 STATA output from: Odds Ratio Smk 4.219142 Age1 1.108467 [95% Conf. Interval] .9595094 2.901032 logit dead age smk, or Std. Err Z P>z [95% Conf. Interval] 1.397128 4.35 0.000 2.204738 8.074048 .0147424 7.74 0.000 1.079946 1.137742 • The second model gives the OR for death associated with smoking after adjusting for age • Note the change in the size of the smoking OR between the two models – what might explain this change? © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Some Examples In the PAD Trial, (non-medical) lay volunteers were trained to respond to cardiac arrests in public and to perform CPR. Volunteers received retraining at various intervals to see how long it took before their CPR skills degraded to the point that they were unlikely to perform adequate CPR. We want to know 1) whether the amount of time between trainings is related to CPR quality, and 2) whether the relationship of CPR quality and time between trainings differs across age groups. © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Some Examples We define the following variables: Response variable: cprok = 0 if CPR performed during testing was inadequate = 1 if CPR performed during testing was adequate Predictor variable: agegt50 = 0 if age is < 50 = 1 if age is > 50 Predictor variable: late = 0 if volunteer was tested/retrained ≤ 7months after initial training = 1 if volunteer was tested/retrained > 7months after initial training © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Some Examples STATA output from: logit cprok late, or robust Robust Odds Ratio Std. Err Z P>z [95% Conf. Interval] late 0.932 0.094 -0.69 0.490 0.7646 1.1372 agegt50 0.4129 0.0486 -7.51 0.000 0.3278 0.5201 STATA output from: logit cprok late if agegt50==1, or robust Robust Odds Ratio Std. Err Z P>z [95% Conf. Interval] late 1.3303 0.2252 1.69 0.092 0.9546 1.8538 STATA output from: logit cprok late if agegt50==0, or robust Robust Odds Ratio Std. Err Z P>z [95% Conf. Interval] late 0.74 0.0976 -2.25 0.025 0.5764 0.9631 © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Some Examples STATA output from: logit cprok late agegt50 agelate, or robust (agelate = late*agegt50) Robust Odds Ratio Std. Err Z P>z [95% Conf. Interval] Late 0.7709 1.090 -1. 0.066 0.5843 1.0170 agegt50 0.4129 0.0486 -7.51 0.000 0.3278 0.5201 agelate 1.6661 0.4134 2.06 0.040 1.0243 2.7099 We reject the null hypothesis of no interaction and conclude that the impact of time between retraining on CPR performance varies significantly for those over and under the age of 50. © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH