Logistic regression

advertisement
Logistic regression
The chi-square test looks at the effect of one variable (such as treatment: drug or
placebo) on another variable (such as recovery or not). This is analogous to the t-test,
which looks at the effect of one variable (treatment) on another variable (such as blood
pressure). If another variable (such as age) can affect the response (recovery, blood
pressure), it is often very helpful to control for effects of that variable, by including it in
the analysis.
There are extensions of the chi-square test to look at the effects of two or more
variables on a response variable. We will look now at logistic regression, an alternative
method to examine the effects of two or more variables on a binary response variable. A
good text is David Kleinbaum's "Logistic Regression".
Logistic regression assigns an object to a class, or gives the probability of class
membership:




chest pain: asthma vs heart attack
whale vs. submarine
apple vs. orange vs. banana
nucleotide in a DNA sequence: {A, C, G, T}
We use logistic regression when the dependent (y) variable is a binary categorical
variable, for example live/die, yes/no, success/failure. Success is usually encoded as 1,
failure as 0.
Often, we are interested in determining if a particular independent variable (such as
level of a protein in the blood) is associated with different outcomes.
We’ve used the chi-square test and the Fisher Exact test work when the dependent
variable is a binary categorical variable. Logistic regression extends these tests by
allowing us to include continuous covariates in the model.
Logistic regression requires the use of logs and odds ratios, so we’ll do a quick review. If
you don’t want to see the review, skip ahead to the examples below.
Review of log and exp functions
base e = 2.718282
e^k is written as exp(k)
exp(1) = e^1 = 2.718282.
log(x) = k
exp(k) = e^k = x.
log(9) = 2.197
exp(2.197) = e^2.197 = 8.997979.
log(1) = 0
exp(0) = e^0 = 1.
exp(log(x)) = x
exp(log(35))
Odds ratio
Odds = (Probability) / (1 – Probability) = P / (1-P)
If the probability of an event is P = 0.5,
the odds are 0.5/(1-0.5) = 0.5/0.5 = 1
If the probability of an event is P = 0.25,
the odds are 0.25/(1-0.25) = 0.25/0.75 = 0.33
Probability
Odds = P/(1-P)
0.0001
0.0001
0.001
0.0010
0.01
0.0101
0.05
0.05
0.1
0.11
0.2
0.25
0.3
0.43
0.4
0.67
0.5
1.00
0.6
1.50
0.7
2.33
0.8
4.00
0.9
9.00
0.99
99.00
If two events have the same probability, then
 they have the same odds
 their odds ratio is 1
 the log odds are 0 because log(1) = 0
If the log odds ratio is greater than zero, the event in the numerator has greater
probability than the event in the denominator.
If the log odds ratio is less than zero, the event in the numerator has less probability
than the event in the denominator.
Logistic regression
We have two variables:
1. X is a continuous variable, such as gene expression or a mother’s weight
2. Y is a binary variable, such as cancer/not, premature/not
We encode Y as a Bernoulli variable with success = 1 and failure = 0. The probability of
success is , the Greek letter Pi.
In logistic regression, rather than modeling Y which can only take values of 0 or 1 (but
not values in between), we instead model the probability of success , which can take
values between 0 and 1.
It is most convenient to model the probabilities on a transformed scale:
logit Pi = log( Pi / (1-Pi)) = beta0 + beta1 * x.
logit Pi = log( Pi / (1-Pi)) is the log odds of Pi, the probability of success, which is the
probability that Y = 1.
# Plot of a logistic regression curve, where the probability of success P(Y =1) increases
with X.
beta0 = 0
beta1 = 1
x.range = -10:10
Pi.result.vector= c()
for (index in 1:length(x.range))
{
Pi.result.vector[index] = exp(beta0 + beta1 * x.range [index]) / ( 1 + exp(beta0 + beta1 *
x.range[index]))
}
plot(x.range, Pi.result.vector)
The mathematical form of the logistic regression line is the model
Pi = exp(beta0 + beta1 * x) / ( 1 + exp(beta0 + beta1 * x)
This mathematical form is equivalent to modeling the log odds of Pi as a linear function
of beta0 + beta1 * x.
logit Pi = log( Pi / (1-Pi)) = beta0 + beta1 * x.
Generalized linear models (GLM)
Logistic regression is an example of a generalized linear model, which is a family of
models that extend simple linear regression.
We will not look in detail at generalized linear models in this lecture, but will simply use
the R function glm().
If you are interested, you can read more about GLM’s, but they will not be on the final
exam.
Fitting the logistic regression model with glm()
To run a logistic regression using the glm() function, we need to give the argument
family=binomial.
Logistic regression example from Julian Faraway,
"Extending the linear model with R."
Julian Faraway is the author of two particularly good books on R:
"Linear models in R" and "Extending the linear model with R."
We'll use an example from the latter book.
Diabetes prevalence is particularly high among Pima Indians in Arizona. Useful
population to study causes.
We’ll examine diabetes as a function of insulin.
install.packages("faraway")
library(faraway)
data(pima)
help(pima)
The National Institute of Diabetes and Digestive and Kidney Diseases conducted a study
on 768 adult female Pima Indians living near Phoenix.
The dataset contains the following variables
pregnant
Number of times pregnant
glucose
Plasma glucose concentration at 2 hours in an oral glucose tolerance test
diastolic
Diastolic blood pressure (mm Hg)
triceps
Triceps skin fold thickness (mm)
insulin
2-Hour serum insulin (mu U/ml)
bmi
Body mass index (weight in kg/(height in metres squared))
diabetes
Diabetes pedigree function
age
Age (years)
test
test whether the patient shows signs of diabetes (coded 0 if negative, 1 if positive)
Diabetes is indicated by the variable “test”
pima$test = factor(pima$test)
summary(pima$test)
summary(pima)
# Some variables that should be positive have values of zero.
# These are likely missing values that need to be dealt with. Set them to NA for now.
pima$insulin[pima$insulin == 0] = NA
hist(pima$insulin)
plot(pima$insulin, pima$test)
boxplot(pima$insulin ~ pima$test)
We model the variable "test" as a function of the insulin.
res.diabetes = glm(test ~ insulin, family=binomial, data= pima)
summary(res.diabetes)
Call:
glm(formula = test ~ insulin, family = binomial, data = pima)
Deviance Residuals:
Min
1Q
Median
-2.3178 -0.8198 -0.7099
3Q
1.2737
Max
1.8652
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.626107
0.203443 -7.993 1.32e-15 ***
insulin
0.005701
0.001058
5.387 7.17e-08 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 499.70 on 393 degrees of freedom
Residual deviance: 463.89 on 392 degrees of freedom
(374 observations deleted due to missingness)
AIC: 467.89
Number of Fisher Scoring iterations: 4
In these data, insulin is significantly associated with diabetes.
Many issues related to model checking, variable selection, and so on should be
considered for these data. Methods for these and implementations in R are discussed in
the following texts.
Julian Faraway. Extending the Linear Model with R.
John Fox. R Companion to Applied Regression.
Logistic regression example 2
library(ISwR)
?malaria
This data frame contains the following columns:
subject
subject code.
age
age in years.
ab
antibody level.
mal
a numeric vector code, Malaria: 0: no, 1: yes.
A random sample of 100 children aged 3–15 years from a village in Ghana. The children
were followed for a period of 8 months. At the beginning of the study, values of a
particular antibody were assessed. Based on observations during the study period, the
children were categorized into two groups: individuals with and without symptoms of
malaria.
summary(malaria)
subject
Min.
: 1.00
1st Qu.: 25.75
Median : 50.50
Mean
: 50.50
3rd Qu.: 75.25
Max.
:100.00
mal
Min.
:0.00
1st Qu.:0.00
Median :0.00
Mean
:0.27
3rd Qu.:1.00
Max.
:1.00
>
age
Min.
: 3.00
1st Qu.: 5.75
Median : 9.00
Mean
: 8.86
3rd Qu.:12.00
Max.
:15.00
ab
Min.
:
2.0
1st Qu.: 29.0
Median : 111.0
Mean
: 311.5
3rd Qu.: 373.8
Max.
:2066.0
Is antibody level (ab) significantly associated with malaria?
boxplot(ab ~ mal, data=malaria)
res.malaria.ab = glm(mal ~ ab, family=binomial, data= malaria)
summary(res.malaria.ab)
Call:
glm(formula = mal ~ ab, family = binomial, data = malaria)
Deviance Residuals:
Min
1Q
Median
-0.9960 -0.8893 -0.6472
3Q
1.3766
Max
2.8993
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.437616
0.292493 -1.496
0.1346
ab
-0.002665
0.001214 -2.196
0.0281 *
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 116.65
Residual deviance: 107.28
AIC: 111.28
on 99
on 98
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 6
The logistic regression indicates that antibody level (ab) is significantly associated with
malaria (p=0.0281).
Does malaria incidence change with age?
boxplot(age ~ mal, data=malaria)
res.malaria.age = glm(mal ~ age, family=binomial, data= malaria)
summary(res.malaria.age)
Call:
glm(formula = mal ~ age, family = binomial, data = malaria)
Deviance Residuals:
Min
1Q
Median
-1.0243 -0.8526 -0.6665
3Q
1.3386
Max
1.9466
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.03126
0.56305 -0.056
0.9557
age
-0.11335
0.06315 -1.795
0.0727 .
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 116.65
Residual deviance: 113.28
AIC: 117.28
on 99
on 98
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 4
In the logistic regression age approaches significance for association with malaria
(p=0.0727). It might be useful to include age in the model.
res.malaria.abage = glm(mal ~ ab + age, family=binomial, data= malaria)
summary(res.malaria.abage)
Call:
glm(formula = mal ~ ab + age, family = binomial, data = malaria)
Deviance Residuals:
Min
1Q
Median
-1.1185 -0.8353 -0.6416
3Q
1.2335
Max
2.9604
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.152899
0.587756
0.260
0.7948
ab
-0.002472
0.001210 -2.042
0.0412 *
age
-0.074615
0.065438 -1.140
0.2542
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 116.65
Residual deviance: 105.95
AIC: 111.95
on 99
on 97
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 6
The logistic regression indicates that antibody level (ab) is still significantly associated
with malaria (p=0.0412) after controlling for age.
Download