Logistic regression The chi-square test looks at the effect of one variable (such as treatment: drug or placebo) on another variable (such as recovery or not). This is analogous to the t-test, which looks at the effect of one variable (treatment) on another variable (such as blood pressure). If another variable (such as age) can affect the response (recovery, blood pressure), it is often very helpful to control for effects of that variable, by including it in the analysis. There are extensions of the chi-square test to look at the effects of two or more variables on a response variable. We will look now at logistic regression, an alternative method to examine the effects of two or more variables on a binary response variable. A good text is David Kleinbaum's "Logistic Regression". Logistic regression assigns an object to a class, or gives the probability of class membership: chest pain: asthma vs heart attack whale vs. submarine apple vs. orange vs. banana nucleotide in a DNA sequence: {A, C, G, T} We use logistic regression when the dependent (y) variable is a binary categorical variable, for example live/die, yes/no, success/failure. Success is usually encoded as 1, failure as 0. Often, we are interested in determining if a particular independent variable (such as level of a protein in the blood) is associated with different outcomes. We’ve used the chi-square test and the Fisher Exact test work when the dependent variable is a binary categorical variable. Logistic regression extends these tests by allowing us to include continuous covariates in the model. Logistic regression requires the use of logs and odds ratios, so we’ll do a quick review. If you don’t want to see the review, skip ahead to the examples below. Review of log and exp functions base e = 2.718282 e^k is written as exp(k) exp(1) = e^1 = 2.718282. log(x) = k exp(k) = e^k = x. log(9) = 2.197 exp(2.197) = e^2.197 = 8.997979. log(1) = 0 exp(0) = e^0 = 1. exp(log(x)) = x exp(log(35)) Odds ratio Odds = (Probability) / (1 – Probability) = P / (1-P) If the probability of an event is P = 0.5, the odds are 0.5/(1-0.5) = 0.5/0.5 = 1 If the probability of an event is P = 0.25, the odds are 0.25/(1-0.25) = 0.25/0.75 = 0.33 Probability Odds = P/(1-P) 0.0001 0.0001 0.001 0.0010 0.01 0.0101 0.05 0.05 0.1 0.11 0.2 0.25 0.3 0.43 0.4 0.67 0.5 1.00 0.6 1.50 0.7 2.33 0.8 4.00 0.9 9.00 0.99 99.00 If two events have the same probability, then they have the same odds their odds ratio is 1 the log odds are 0 because log(1) = 0 If the log odds ratio is greater than zero, the event in the numerator has greater probability than the event in the denominator. If the log odds ratio is less than zero, the event in the numerator has less probability than the event in the denominator. Logistic regression We have two variables: 1. X is a continuous variable, such as gene expression or a mother’s weight 2. Y is a binary variable, such as cancer/not, premature/not We encode Y as a Bernoulli variable with success = 1 and failure = 0. The probability of success is , the Greek letter Pi. In logistic regression, rather than modeling Y which can only take values of 0 or 1 (but not values in between), we instead model the probability of success , which can take values between 0 and 1. It is most convenient to model the probabilities on a transformed scale: logit Pi = log( Pi / (1-Pi)) = beta0 + beta1 * x. logit Pi = log( Pi / (1-Pi)) is the log odds of Pi, the probability of success, which is the probability that Y = 1. # Plot of a logistic regression curve, where the probability of success P(Y =1) increases with X. beta0 = 0 beta1 = 1 x.range = -10:10 Pi.result.vector= c() for (index in 1:length(x.range)) { Pi.result.vector[index] = exp(beta0 + beta1 * x.range [index]) / ( 1 + exp(beta0 + beta1 * x.range[index])) } plot(x.range, Pi.result.vector) The mathematical form of the logistic regression line is the model Pi = exp(beta0 + beta1 * x) / ( 1 + exp(beta0 + beta1 * x) This mathematical form is equivalent to modeling the log odds of Pi as a linear function of beta0 + beta1 * x. logit Pi = log( Pi / (1-Pi)) = beta0 + beta1 * x. Generalized linear models (GLM) Logistic regression is an example of a generalized linear model, which is a family of models that extend simple linear regression. We will not look in detail at generalized linear models in this lecture, but will simply use the R function glm(). If you are interested, you can read more about GLM’s, but they will not be on the final exam. Fitting the logistic regression model with glm() To run a logistic regression using the glm() function, we need to give the argument family=binomial. Logistic regression example from Julian Faraway, "Extending the linear model with R." Julian Faraway is the author of two particularly good books on R: "Linear models in R" and "Extending the linear model with R." We'll use an example from the latter book. Diabetes prevalence is particularly high among Pima Indians in Arizona. Useful population to study causes. We’ll examine diabetes as a function of insulin. install.packages("faraway") library(faraway) data(pima) help(pima) The National Institute of Diabetes and Digestive and Kidney Diseases conducted a study on 768 adult female Pima Indians living near Phoenix. The dataset contains the following variables pregnant Number of times pregnant glucose Plasma glucose concentration at 2 hours in an oral glucose tolerance test diastolic Diastolic blood pressure (mm Hg) triceps Triceps skin fold thickness (mm) insulin 2-Hour serum insulin (mu U/ml) bmi Body mass index (weight in kg/(height in metres squared)) diabetes Diabetes pedigree function age Age (years) test test whether the patient shows signs of diabetes (coded 0 if negative, 1 if positive) Diabetes is indicated by the variable “test” pima$test = factor(pima$test) summary(pima$test) summary(pima) # Some variables that should be positive have values of zero. # These are likely missing values that need to be dealt with. Set them to NA for now. pima$insulin[pima$insulin == 0] = NA hist(pima$insulin) plot(pima$insulin, pima$test) boxplot(pima$insulin ~ pima$test) We model the variable "test" as a function of the insulin. res.diabetes = glm(test ~ insulin, family=binomial, data= pima) summary(res.diabetes) Call: glm(formula = test ~ insulin, family = binomial, data = pima) Deviance Residuals: Min 1Q Median -2.3178 -0.8198 -0.7099 3Q 1.2737 Max 1.8652 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.626107 0.203443 -7.993 1.32e-15 *** insulin 0.005701 0.001058 5.387 7.17e-08 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 499.70 on 393 degrees of freedom Residual deviance: 463.89 on 392 degrees of freedom (374 observations deleted due to missingness) AIC: 467.89 Number of Fisher Scoring iterations: 4 In these data, insulin is significantly associated with diabetes. Many issues related to model checking, variable selection, and so on should be considered for these data. Methods for these and implementations in R are discussed in the following texts. Julian Faraway. Extending the Linear Model with R. John Fox. R Companion to Applied Regression. Logistic regression example 2 library(ISwR) ?malaria This data frame contains the following columns: subject subject code. age age in years. ab antibody level. mal a numeric vector code, Malaria: 0: no, 1: yes. A random sample of 100 children aged 3–15 years from a village in Ghana. The children were followed for a period of 8 months. At the beginning of the study, values of a particular antibody were assessed. Based on observations during the study period, the children were categorized into two groups: individuals with and without symptoms of malaria. summary(malaria) subject Min. : 1.00 1st Qu.: 25.75 Median : 50.50 Mean : 50.50 3rd Qu.: 75.25 Max. :100.00 mal Min. :0.00 1st Qu.:0.00 Median :0.00 Mean :0.27 3rd Qu.:1.00 Max. :1.00 > age Min. : 3.00 1st Qu.: 5.75 Median : 9.00 Mean : 8.86 3rd Qu.:12.00 Max. :15.00 ab Min. : 2.0 1st Qu.: 29.0 Median : 111.0 Mean : 311.5 3rd Qu.: 373.8 Max. :2066.0 Is antibody level (ab) significantly associated with malaria? boxplot(ab ~ mal, data=malaria) res.malaria.ab = glm(mal ~ ab, family=binomial, data= malaria) summary(res.malaria.ab) Call: glm(formula = mal ~ ab, family = binomial, data = malaria) Deviance Residuals: Min 1Q Median -0.9960 -0.8893 -0.6472 3Q 1.3766 Max 2.8993 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.437616 0.292493 -1.496 0.1346 ab -0.002665 0.001214 -2.196 0.0281 * --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 116.65 Residual deviance: 107.28 AIC: 111.28 on 99 on 98 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 6 The logistic regression indicates that antibody level (ab) is significantly associated with malaria (p=0.0281). Does malaria incidence change with age? boxplot(age ~ mal, data=malaria) res.malaria.age = glm(mal ~ age, family=binomial, data= malaria) summary(res.malaria.age) Call: glm(formula = mal ~ age, family = binomial, data = malaria) Deviance Residuals: Min 1Q Median -1.0243 -0.8526 -0.6665 3Q 1.3386 Max 1.9466 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.03126 0.56305 -0.056 0.9557 age -0.11335 0.06315 -1.795 0.0727 . --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 116.65 Residual deviance: 113.28 AIC: 117.28 on 99 on 98 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 4 In the logistic regression age approaches significance for association with malaria (p=0.0727). It might be useful to include age in the model. res.malaria.abage = glm(mal ~ ab + age, family=binomial, data= malaria) summary(res.malaria.abage) Call: glm(formula = mal ~ ab + age, family = binomial, data = malaria) Deviance Residuals: Min 1Q Median -1.1185 -0.8353 -0.6416 3Q 1.2335 Max 2.9604 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.152899 0.587756 0.260 0.7948 ab -0.002472 0.001210 -2.042 0.0412 * age -0.074615 0.065438 -1.140 0.2542 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 116.65 Residual deviance: 105.95 AIC: 111.95 on 99 on 97 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 6 The logistic regression indicates that antibody level (ab) is still significantly associated with malaria (p=0.0412) after controlling for age.