CSSS 508: Intro to R 3/03/06 Logistic Regression Last week, we looked at linear regression, using independent variables to predict a continuous dependent response variable. Very often we want to predict a binary outcome: Yes/No (Failure/Success) For example, we may want to predict whether or not someone will go to college or whether or not they will be obese or whether or not they will develop a hereditary condition. We use logistic regression to model this scenario. Our response variable is usually 0 or 1. The formula is: P(Y = 1) = 1 / (1 + exp[-(B0 + B1x1 + B2X2 + B3X3 + …+ BpXp)]) 0.0 0.2 0.4 f(z) 0.6 0.8 1.0 More simply: f(z) = 1 / (1 + exp(-z)) where z is the regular linear regression we’ve seen. The behavior of f(z) (or P(response = 1)) looks like: -4 -2 0 2 4 z Note that when z = 0, P(response = 1) = 0.5. We have no information from the covariates, and so it’s essentially a coin flip. High z, high chance of a response of 1. Low z, low chance of a response of 1. Rebecca Nugent, Department of Statistics, U. of Washington -1- In R, we model logistic regression using generalized linear models (glm). This function allows for several different types of models, each with their own “family”. For us, the family just means that we specify the type of response variable we have and what kind of model we would like to use. help(glm) The arguments we’ll look at today are: formula, family, and data. Formula and data are the same as used before in linear regression. If you are working with a data frame, you can type in the formula y~ x1 + x2 + …+xp and then data = the name of your data frame. If you have a variable defined for each term in your formula, you just need to include the formula argument. For logistic regression, family = binomial. Recall that a binomial distribution models the probability of trials being successes or failures (like our response variable). Let’s try it on the low infant birth weight data set. library(MASS) help(birthwt) Our variables: low: 0, 1 (Indicator of birth weight less than 2.5 kg) age: mother’s age in years lwt: mother’s weight in lbs race: mother’s race (1 = white, 2 = black, 3 = other) smoke: smoking status during pregnancy ptl: no. of previous premature labors ht: history of hypertension ui: presence of uterine irritability ftv: no. of physician visits during first trimester bwt: birth weight in grams First attaching the data set to R so we can access the individual variables: attach(birthwt) All variables have a natural ordering except race. Race has been coded as integers. If we leave it as integers, the model will return a number that is associated with a change from white to black or a change from black to other. We want to remove this order from the race categories. race2<-factor(race,labels=c("white","black","other")) Rebecca Nugent, Department of Statistics, U. of Washington -2- Running the model: bwt<glm(low~age+lwt+race2+smoke+ptl+ht+ui+ftv,family=binomial) What do we get back? bwt is a glm object. It immediately returns the coefficients for each variable. Each coefficient is the log odds associated with only that variable. The intercept is the base log odds that everyone starts with (much like the intercept in a linear regression). Negative log odds decrease your risk. Positive log odds increase your risk. 0 is neutral. If we want to get the odds ratio back, we need to do exp(coefficient). For example, the odds associated with each year increase in age are exp(-0.02955) = 0.97088. That is, each year increases in age decreases your odds of giving to birth to an infant of low birth weight by 3%. The odds associated with smoking are exp(0.93885) = 2.557. If you smoke, you increase your odds of giving birth to an infant of low birth weight by 155%. To get back all of these odds ratios, we can do: exp(bwt$coef) (Intercept) 1.6170819 ptl 1.7217428 age 0.9708833 ht 6.4449886 lwt 0.9846941 ui 2.1546928 race2black 3.5689085 ftv 1.0674812 race2other 2.4120956 smoke 2.5570281 Brief note on the race factor variable: odds ratios are found via comparison to a reference group. Here the reference group is “white”. So the odds ratio for “black” compared to “white” is 3.569; the odds ratio for “other” compared to “white” is 2.412. R just uses the first factor as the reference group. You can change the reference group if you want by creating your own factor variables. white<-ifelse(race==1,1,0) black<-ifelse(race==2,1,0) other<-ifelse(race==3,1,0) glm(low~age+lwt+white+black+smoke+ptl+ht+ui+ftv,family=binomial) This model has “other” as the reference group. glm(low~age+lwt+white+other+smoke+ptl+ht+ui+ftv,family=binomial) This model has “black” as the reference group. You can’t put all three in the model because then there’s nothing to compare against. Generally, it is a good idea to use the largest group as your reference group. Rebecca Nugent, Department of Statistics, U. of Washington -3- Confidence Intervals: We can get back more information by using summary. summary(bwt) We find confidence intervals for our odds ratios by exp(coef +/- 1.96* SE) (NOT exp(coef) +/- 1.96*exp(SE) – need to put the exp around the whole expression). sum.coef<-summary(bwt)$coef est<-exp(sum.coef[,1]) upper.ci<-exp(sum.coef[,1]+1.96*sum.coef[,2]) lower.ci<-exp(sum.coef[,1]-1.96*sum.coef[,2]) cbind(est,upper.ci,lower.ci) Confidence intervals that do not contain one are significant. We can also do stepwise logistic regression using step( ). This function will choose the best combination of all the variables to predict low birth weight. We pass in the glm object of the full model. bwt.step<-step(bwt) Again, we can find the odds ratios and confidence intervals using the code above. For those interested in doing multinomial logistic regression (when your response variable has more than two answers), try the multinom function in the nnet package. library(nnet) help(multinom) multinom(race~age+lwt) Returns coefficients for each class. Rebecca Nugent, Department of Statistics, U. of Washington -4-