CSSS 508: Intro to R - Department of Statistics

advertisement
CSSS 508: Intro to R
3/03/06
Logistic Regression
Last week, we looked at linear regression, using independent variables to predict a
continuous dependent response variable.
Very often we want to predict a binary outcome: Yes/No (Failure/Success)
For example, we may want to predict whether or not someone will go to college or
whether or not they will be obese or whether or not they will develop a hereditary
condition.
We use logistic regression to model this scenario. Our response variable is usually 0 or 1.
The formula is:
P(Y = 1) = 1 / (1 + exp[-(B0 + B1x1 + B2X2 + B3X3 + …+ BpXp)])
0.0
0.2
0.4
f(z)
0.6
0.8
1.0
More simply: f(z) = 1 / (1 + exp(-z)) where z is the regular linear regression we’ve seen.
The behavior of f(z) (or P(response = 1)) looks like:
-4
-2
0
2
4
z
Note that when z = 0, P(response = 1) = 0.5.
We have no information from the covariates, and so it’s essentially a coin flip.
High z, high chance of a response of 1. Low z, low chance of a response of 1.
Rebecca Nugent, Department of Statistics, U. of Washington
-1-
In R, we model logistic regression using generalized linear models (glm).
This function allows for several different types of models, each with their own “family”.
For us, the family just means that we specify the type of response variable we have and
what kind of model we would like to use.
help(glm)
The arguments we’ll look at today are: formula, family, and data.
Formula and data are the same as used before in linear regression. If you are working
with a data frame, you can type in the formula y~ x1 + x2 + …+xp and then data = the
name of your data frame. If you have a variable defined for each term in your formula,
you just need to include the formula argument.
For logistic regression, family = binomial.
Recall that a binomial distribution models the probability of trials being successes or
failures (like our response variable).
Let’s try it on the low infant birth weight data set.
library(MASS)
help(birthwt)
Our variables:
low: 0, 1 (Indicator of birth weight less than 2.5 kg)
age: mother’s age in years
lwt: mother’s weight in lbs
race: mother’s race (1 = white, 2 = black, 3 = other)
smoke: smoking status during pregnancy
ptl:
no. of previous premature labors
ht:
history of hypertension
ui:
presence of uterine irritability
ftv:
no. of physician visits during first trimester
bwt: birth weight in grams
First attaching the data set to R so we can access the individual variables:
attach(birthwt)
All variables have a natural ordering except race. Race has been coded as integers. If we
leave it as integers, the model will return a number that is associated with a change from
white to black or a change from black to other. We want to remove this order from the
race categories.
race2<-factor(race,labels=c("white","black","other"))
Rebecca Nugent, Department of Statistics, U. of Washington
-2-
Running the model:
bwt<glm(low~age+lwt+race2+smoke+ptl+ht+ui+ftv,family=binomial)
What do we get back?
bwt is a glm object. It immediately returns the coefficients for each variable.
Each coefficient is the log odds associated with only that variable. The intercept is the
base log odds that everyone starts with (much like the intercept in a linear regression).
Negative log odds decrease your risk. Positive log odds increase your risk. 0 is neutral.
If we want to get the odds ratio back, we need to do exp(coefficient).
For example, the odds associated with each year increase in age are
exp(-0.02955) = 0.97088. That is, each year increases in age decreases your odds of
giving to birth to an infant of low birth weight by 3%.
The odds associated with smoking are exp(0.93885) = 2.557. If you smoke, you increase
your odds of giving birth to an infant of low birth weight by 155%.
To get back all of these odds ratios, we can do:
exp(bwt$coef)
(Intercept)
1.6170819
ptl
1.7217428
age
0.9708833
ht
6.4449886
lwt
0.9846941
ui
2.1546928
race2black
3.5689085
ftv
1.0674812
race2other
2.4120956
smoke
2.5570281
Brief note on the race factor variable: odds ratios are found via comparison to a reference
group. Here the reference group is “white”. So the odds ratio for “black” compared to
“white” is 3.569; the odds ratio for “other” compared to “white” is 2.412.
R just uses the first factor as the reference group.
You can change the reference group if you want by creating your own factor variables.
white<-ifelse(race==1,1,0)
black<-ifelse(race==2,1,0)
other<-ifelse(race==3,1,0)
glm(low~age+lwt+white+black+smoke+ptl+ht+ui+ftv,family=binomial)
This model has “other” as the reference group.
glm(low~age+lwt+white+other+smoke+ptl+ht+ui+ftv,family=binomial)
This model has “black” as the reference group.
You can’t put all three in the model because then there’s nothing to compare against.
Generally, it is a good idea to use the largest group as your reference group.
Rebecca Nugent, Department of Statistics, U. of Washington
-3-
Confidence Intervals:
We can get back more information by using summary.
summary(bwt)
We find confidence intervals for our odds ratios by exp(coef +/- 1.96* SE)
(NOT exp(coef) +/- 1.96*exp(SE) – need to put the exp around the whole expression).
sum.coef<-summary(bwt)$coef
est<-exp(sum.coef[,1])
upper.ci<-exp(sum.coef[,1]+1.96*sum.coef[,2])
lower.ci<-exp(sum.coef[,1]-1.96*sum.coef[,2])
cbind(est,upper.ci,lower.ci)
Confidence intervals that do not contain one are significant.
We can also do stepwise logistic regression using step( ). This function will choose
the best combination of all the variables to predict low birth weight.
We pass in the glm object of the full model.
bwt.step<-step(bwt)
Again, we can find the odds ratios and confidence intervals using the code above.
For those interested in doing multinomial logistic regression (when your response
variable has more than two answers), try the multinom function in the nnet package.
library(nnet)
help(multinom)
multinom(race~age+lwt)
Returns coefficients for each class.
Rebecca Nugent, Department of Statistics, U. of Washington
-4-
Download