Lecture Notes on Binary Regression (doc)

advertisement
Workshop on Binary Regression
Logistic Regression + Classification Trees + Regression Trees +
Graphics + Multinomial Regression
Hyderabad, December 26-29, 2012
MB RAO
Module 1: An Introduction to Logistic Regression + Fitting the model
with R + Goodness-of-fit test
BINARY RESPONSE VARIABLE AND LOGISTIC REGRESSION
A binary variable is a variable with only two possible values. There
are many, many examples of binary variables in statistical work.
Example. A patient is admitted with abdominal sepsis (blood poisoning).
The case is severe enough to warrant surgery. The patient is wheeled to the
operation theatre. Let us speculate what will happen after the surgery. Let
Y
= 1 if death follows surgery,
= 0 if the patient survives.
The outcome Y is random. Since it is random, we want to know its
distribution.
Pr(Y = 1) = π, say, and
Pr(Y = 0) = 1 – π.
In simple terms, we want to know the chances (π) that a patient dies after
surgery. Equivalently, what are the chances (1 – π) of survival after surgery?
Are there any prognostic variables or factors which could influence the
outcome Y. Surgeons list the following variables which could have some
bearing on Y.
1. X1: Shock. Is the patient in a state of shock before the surgery?
X1
= 1 if yes,
= 0 if no.
2. X2: Malnutrition. Is the patient undernourished?
X2 = 1 if yes,
= 0 if no.
1
3. X3: Alcoholism. Is the patient alcoholic?
X3
= 1 if yes,
= 0 if no.
4. X4: Age
5. X5: Bowel infarction. Has the patient bowel infarction?
X5
= 1 if yes,
= 0 if no.
The variables X1, X2, X3, and X5 are categorical covariates and X4 is a
continuous covariate. The categorical variables are binary with only two
possible values. It is felt that the outcome Y depends on these covariates.
Response variable
Y
Covariates, predictors, or independent
variables
X1, X2, X3, X4, X5
More precisely, the probability π depends on X1 through X5. In order to
indicate the dependence, we write
π = π(X1, X2, X3, X4, X5).
We want to model π as a function of the covariates.
Why one wants to build a model? If a model is in place, one could use the
model to assess the chances of survival after surgery for a patient before he
is wheeled into the operation theatre. How? The surgeon could get
information on X1, X2, X3, X4, X5 for the patient, calculate π = π(X1, X2, X3,
X4, X5) from the postulated model, and then the chances of survival (1 – π)
after surgery.
A possible model?
π(X1, X2, X3, X4, X5) = β0 + β1X1 + β2X2 + β3X3 + β4X4 + β5X5
This is like a multiple regression model. This model is not acceptable. The
left hand side of the model π is a probability. The right hand side of the
model could be any real number.
Why not model Y directly as a function of the covariates? For example,
Y = β0 + β1X1 + β2X2 + β3X3 + β4X4 + β5X5?
2
This is not acceptable. The left hand side Y takes only two values 0 and 1,
but the right hand side could be any real number.
Logistic regression model
π(X1, X2, X3, X4, X5) =
 0  1 X 1   2 X 2   3 X 3   4 X 4   5 X 5
e
1 e
 0  1 X 1   2 X 2   3 X 3   4 X 4   5 X 5
.
This model looks very formidable. The left hand side of the model is a
probability and hence its value should always between zero and one. The
right hand is always a number between zero and one. Why?
This model has 6 unknown parameters β0, β1, β2, β3, β4, and β5. We need to
know the values of these parameters before it can be used. We can estimate
the parameters of the model if we have data on a sample of patients. We do
have data.
Structure of the data.
Patient
1
2
3
etc.
Y
1
1
0
X1
1
1
0
X2
1
0
1
X3
1
0
0
X4
47
53
32
X5
1
1
0
We have data on 106 patients. Present the data. Discuss the data.
A digression: The approach I presented is purely statistical. Identify the
variables of interest, designate the response variable or dependent variable,
designate the covariates or independent variables, postulate a model, fit the
model, and check it’s goodness-of-fit.
Engineers, physicists, and computer scientists will look at the problem from
a different angle. They work directly with the dependent or response
variable. I will talk about their approach later.
Problems (Back to our problem)
3
1. How does one estimate the parameters of the model using the data?
There are two standard methods available. 1. Method of maximum
likelihood. Write the likelihood of the data. Maximize the likelihood
with respect to the parameters. 2. Method of weighted least squares.
The least squares principle is used to minimize certain sum of
squares. This method is much simpler than the method of maximum
likelihood. Asymptotically, both methods are equivalent. If the
sample is large, the estimates will be more or less the same.
2. Once the model is estimated, we need to check whether or not the
model adequately summarizes the data. We need to assess how well
the model fits the data. We may have to use some goodness-of-fit
tests to make the assessment.
3. If the model fits the data well, we need to examine the impact of each
and every covariate on the response variable. It is tantamount to
identifying risk factors. We need to test the significance of each and
every covariate in the model. If a covariate is not significant, we
could remove the covariate from the model and then fit a leaner and
tighter model to the data.
4. If a particular model does not fit the data well, explore other models,
which can do a better job.
5. If an adequate model is fitted, explain how the model can be used in
practice. Spend time on interpreting the model.
Before we pursue all these objectives, let us look at the model from another
angle.
π(X1, X2, X3, X4, X5) = Probability of death after surgery for a patient with
covariate values X1, X2, X3, X4, X5
e  0  1 X1   2 X 2   3 X 3   4 X 4   5 X 5
=
 0  1 X 1   2 X 2   3 X 3   4 X 4   5 X 5
1 e
1 - π(X1, X2, X3, X4, X5) = Probability of survival after surgery for a patient
with covariate values X1, X2, X3, X4, X5
=

1
1
1  e  0  1 X1   2 X 2   3 X 3   4 X 4   5 X 5
= Odds of Death versus Survival after surgery
4
ln(

= exp{β0 + β1X1 + β2X2 + β3X3 + β4X4 + β5X5}
) = log odds = logit =
1
β0 + β1X1 + β2X2 + β3X3 + β4X4 + β5X5
This is like a multiple regression model. The log odds are a linear function
of the covariates! This form of the model is very useful for interpretation.
The parameter β0 is called the intercept of the model. The parameter β1 is
called the regression coefficient associated with the variable ‘Shock.’ The
parameter β2 is called the regression coefficient associated with the variable
‘Malnutrition,’ etc. These regression coefficients indicate how much impact
the corresponding covariates have on the response variable.
The logistic regression model can be spelled out either in the form
π(X1, X2, X3, X4, X5) =
 0  1 X 1   2 X 2   3 X 3   4 X 4   5 X 5
e
1 e
or in the form
ln(

1
Both are equivalent.
 0  1 X 1   2 X 2   3 X 3   4 X 4   5 X 5
,
) = β0 + β1X1 + β2X2 + β3X3 + β4X4 + β5X5.
Using R and the data, I fitted the model. The following are the estimates of
the parameters along with their standard errors and p-values.
Variable Regression
Coefficient
Intercept
-9.754
Shock
3.674
Malnutrition 1.217
Alcoholism
3.355
Age
0.09215
Infarction
2.798
Standard
Error
2.534
1.162
0.7274
0.9797
0.03025
1.161
z-value p-value
3.16
1.67
3.43
3.04
2.41
0.0016
0.095
0.0006
0.0023
0.016
Let us proceed in a systematic way with an analysis of the model.
1. Adequacy of the model. The model fits the data very well. We can
say that the model is a good summarization of the data. I will talk
about this aspect when I present and discuss the relevant program.
5
2. Estimated model.
ln(

) = -9.754 + 3.674*X1 + 1.217*X2
1
+ 3.355*X3 + 0.09215*X4 + 2.798*X5
3. Impact of the covariates on the response variable. Let us look at the
covariate X1 (Shock). We want to test the null hypothesis H0: β1 = 0
(The covariate has no impact on the response variable or the covariate
X1 is not significant) against the alternative H1: β1 ≠ 0 (The covariate
has some impact on Y or the covariate is significant). An estimate of
β1 is 3.674. Is this value significant? We look at the corresponding zvalue (Estimate/(Standard Error)). If the z-value exceeds 1.96 in
absolute value, we reject the null hypothesis at 5% level of
significance. In our case, it does indeed exceed 1.96. The variable X1
is significant. There is another way to check significance of a variable.
Look at the corresponding p-value.
a. If p ≤ 0.001, the covariate is very, very significant.
b. If 0.001 < p ≤ 0.01, the covariate is very significant.
c. If 0.01 < p ≤ 0.05, the covariate is significant.
d. If p > 0.05, the covariate is not significant.
In our case, p = 0.0016. The variable X1 is very significant.
Further, the estimate 3.674 is positive. This means that X1 has a
positive significance over the response variable. This means that if the
value of X1 goes up, the probability π goes up. In our example, the
variable X1 takes only two values 1 and 0. The probability π will be
higher for a person with X1 = 1 than for a person with X1 = 0, other
factors remaining the same. I will talk about ‘how much higher’ later.
4. Let us look at the other covariates.
Malnutrition:
Not significant
Alcoholism:
Very, very significant
(Positive impact)
Age:
Very significant
(The higher the age is, the higher
the probability of death is.)
Infarction:
significant
(Positive impact)
5. Let us make the model a little tighter. Chuck out ‘Malnutrition’ from
the model. Refitting gives the following estimates.
Variable
Regression
Standard
z-value
Coefficient Error
6
Intercept
-8.895
Shock
3.701
1.103
3.355
Alcoholism
3.186
0.9163
3.477
Age
0.08983
0.02918
3.078
Infarction
2.386
1.071
2.228
6. The fit is good. Every covariate is significant. The estimated model is
ln(

1
) = -8.895 + 3.701*X1 + 3.186*X3
+ 0.08983*X4 + 2.386*X5
Data on ‘abdominal sepsis’
I have data in Excel format.
ID
Y
X1
X2
X3
1
0
0
0
0
2
0
0
0
0
3
0
0
0
0
4
0
0
0
0
5
0
0
0
0
6
1
0
1
0
7
0
0
0
0
8
1
0
0
1
9
0
0
0
0
10
0
0
1
0
11
1
0
0
1
12
0
0
0
0
13
0
0
0
0
14
0
0
0
0
15
0
0
1
0
16
0
0
1
0
17
0
0
0
0
19
0
0
0
0
20
1
1
1
0
22
0
0
0
0
102 0
0
0
0
103 0
0
0
0
104 0
1
0
0
105 1
1
0
0
106 0
0
0
0
107 0
0
0
0
X4
56
80
61
26
53
87
21
69
57
76
66
48
18
46
22
33
38
27
60
31
59
29
60
63
80
23
X5
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
1
0
0
1
0
0
7
108
110
111
112
113
114
115
116
117
118
119
120
122
123
202
203
204
205
206
207
208
209
210
211
214
215
217
301
302
303
304
305
306
307
308
309
401
402
501
502
0
0
1
0
0
1
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
1
1
0
0
0
0
0
0
0
0
1
1
0
0
0
1
0
0
0
1
0
0
0
1
0
1
1
1
0
0
0
0
0
1
0
0
1
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
71
87
70
22
17
49
50
51
37
76
60
78
60
57
28
94
43
70
70
26
19
80
66
55
36
28
59
50
20
74
54
68
25
27
77
54
43
27
66
47
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
1
0
1
0
0
0
0
0
0
0
0
1
0
8
503
504
505
506
507
508
510
511
512
513
514
515
516
518
519
520
521
523
524
525
526
527
529
531
532
534
535
536
537
538
539
540
541
542
543
544
545
546
549
550
0
0
1
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
1
0
1
0
0
0
1
0
0
0
0
0
0
1
0
1
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
1
0
0
1
0
0
0
0
0
1
0
1
0
1
1
1
0
0
1
1
0
0
0
0
0
0
0
1
0
0
0
0
1
1
0
1
0
1
1
0
0
0
0
0
1
0
1
0
1
0
0
0
0
0
0
0
1
1
1
0
0
0
1
0
0
0
0
1
0
0
0
1
1
1
0
0
1
1
37
36
76
33
40
90
45
75
70
36
57
22
33
75
22
80
85
90
71
51
67
77
20
52
60
29
30
20
36
54
65
47
22
69
68
49
25
44
56
42
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
9
I am working on a number of projects. In about fifty percent of these
projects the response variable is binary.
A specific example
A child gets cancer. It could be any one of the following.
Bone Cancer; Kidney (Wilms); Hodgkin’s; Leukemia; Neuroblastoma; NonHodgkin’s; Soft tissue sarcoma; CNS
Treatment begins. The child recovers and survives for five years. The child
is on Pediatric Cancer Registry. The child is followed lifelong. Periodically,
the child is examined. A number of measurements are recorded.
Some children get BCC (Basal Cell Carcinoma). Others don’t. How can one
explain this? What are the risk factors?
Data are collected. 320 children got BCC at least once during the follow up
years. 723 children never gotten BCC.
Response variable: Occurrence of BCC (Yes or No)
Covariates:
Type of Cancer (Categorical with 8 levels)
Age at diagnosis of cancer (numeric)
Follow up Years (How many years the child was followed up after he/she
enters the cancer registry?)
Gender
Race
Radiation (Yes or No)
SMN (Did the child get a different cancer during the follow up years?)
Model Fitting Using R
I want to illustrate how to use R to fit a logistic regression model for a given
data. The data I use here is different from the one I presented earlier. It is
easy to understand this data.
A particular treatment is being evaluated to cure a particular medical
condition. Introduce the response variable. Does the patient get relief when
he/she has the treatment?
10
Y = 1 if yes
= 0 if no
The response variable is binary. There are two prognostic variables: Age and
Gender. A sample of 20 male and 20 female patients are chosen to try the
treatment. Input the data.
> Age <- c(37, 39, 39, 42, 47, 48, 48, 52, 53, 55, 56, 57, 58, 58, 60, 64, 65,
68, 68, 70, 34, 38, 40, 40, 41, 43, 43, 43, 44, 46, 47, 48, 48, 50, 50, 52, 55,
60, 61, 61)
Gender is a categorical variable. Enter data on gender as a factor. Gender 0
means Female and 1 means Male.
> Gender <- factor(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1))
An alternative way of inputting the data on Gender.
> Gender <- rep(0:1, c(20, 20))
Response is a categorical variable. Enter data on Response as a factor.
Response 1 means Yes and 0 means No.
> Response <- factor(c(0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1,
1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1))
Create a data file containing data on all these variables. There is no need to
do this. It is good to have everything in a single folder.
> MB <- data.frame(Age, Gender, Response)
Look at the data.
> MB
Age Gender Response
1 37 0
0
2 39 0
0
3 39 0
0
11
4 42
5 47
6 48
7 48
8 52
9 53
10 55
11 56
12 57
13 58
14 58
15 60
16 64
17 65
18 68
19 68
20 70
21 34
22 38
23 40
24 40
25 41
26 43
27 43
28 43
29 44
30 46
31 47
32 48
33 48
34 50
35 50
36 52
37 55
38 60
39 61
40 61
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
1
0
0
0
0
0
0
1
0
0
1
1
1
1
1
1
0
0
0
1
1
1
0
0
1
1
1
0
1
1
1
1
1
1
Reflect on the data. The first 20 patients are female and the last 20 female.
The ages are reported in increasing order of magnitude. At lower ages of
12
females the treatment does not seem to be working. For males the treatment
seems to be working at all ages with a good probability. We need to quantify
our first impressions. Model building will help us.
Let us fit the logistic regression model. Create a folder, which will store the
output. The basic command is glm (generalized linear model). The
command ‘glm’ is available in the ‘base.’
> MB1 <- glm(Response ~ Age + Gender, family = binomial, data = MB)
Let us look at the output.
> summary(MB1)
Call:
glm(formula = Response ~ Age + Gender, family = binomial, data = Logi)
Deviance Residuals:
Min
1Q Median
3Q
Max
-1.86671 -0.80814 0.03983 0.78066 2.17061
Coefficients:
Estimate Std. Error z value
(Intercept) -9.84294
3.67576 -2.678
Age
0.15806
0.06164
2.564
Gender1
3.48983
1.19917
2.910
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Pr(>|z|)
0.00741 **
0.01034 *
0.00361 **
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 55.452 on 39 degrees of freedom
Residual deviance: 38.917 on 37 degrees of freedom
AIC: 44.917
Number of Fisher Scoring iterations: 5
The estimated model is:
13
ln
Pr(Y  1)
= -9.84294 + 0.15806*Age + 3.48983*Gender
Pr(Y  0)
Age is a significant covariate. How can you tell? Look at its p-value
0.01304.
Gender is a very significant factor. How can you tell? Look at its p-value
0.00361.
General conclusions.
1. Look at the regression coefficients. Both are positive. The higher the
age is the higher the chances are of getting relief on the treatment.
2. Males and Females react to the treatment significantly differently.
The treatment is more beneficial to males than to females.
Justification.
Two patients on the treatment
Patient 1. Age = 50
Gender = Male
Patient 2. Age = 55
Gender = Male
Patient 1
Pr(Y = 1) = Probability of getting relief =
4.71
1  4.71
= 0.82
Patient 2.
Pr(Y = 1) = Probability of getting relief =
10.38
1  10.38
= 0.91
Another example
Two patients on the treatment
Patient 1. Age = 50
Gender = Female
Patient 2. Age = 55
Gender = Female
Patient 1.
14
Pr(Y = 1) = Probability of getting relief =
0.14
1  0.14
= 0.12
Patient 2.
Pr(Y = 1) = Probability of getting relief =
0.32
1  0.32
= 0.24
Compare the probabilities of getting relief for a male and a female with the
same age 50:
Male:
Female:
Probability of relief = 0.82
Probability of relief = 0.12
It is now time to focus on a goodness-of-fit test of the Logistic Regression
Model. We will work with the data presented above, with response variable
‘Response,’ and covariates ‘Gender’ and ‘Age.’ We want to test the
hypothesis that the Logisitc Regression Model is an adequate summary of
the data. In other words, we want to test that the response probability
Pr(Response = 1) follows the stipulated logistic regression model pattern.
The null hypothesis is
H0: Pr(Response = 1) =
exp(0  1 * Age   2 * Gender)
1  exp(0  1 * Age   2 * Gender)
for some parameters β0, β1, and β2.
The alternative is H1: H0 not true.
Hosmer and Lemeshow devised a test to test the validity of the null
hypothesis. I will not go through the rationality behind the test. We will use
the R package to conduct the test on our data. The test is available in a
package called ‘rms.’ First, we need to download the package. Download the
package. How?
Put the package ‘rms’ into service.
> library (rms)
The basic R command is ‘lrm’ (logistic regression model). Create a new
folder for the execution of ‘lrm.’
15
The package ‘rms’ also fits the logistic regression model. The command is
different.
 MB2 <- lrm(Response ~ Age + Gender, data = MB, x = TRUE, y =
TRUE)
You need to ask for the output. The output is given later.
The following command gives the results of the Hosmer-Lemeshow test.
 residuals.lrm(MB2, type = 'gof')
Sum of squared errors
6.4736338
Z
0.0552307
Expected value|H0 SD
6.4612280
0.2246174
P
0.9559547
You look at the p-value in the output. It is a large number. The chances of
getting the type of data we have gotten when the null hypothesis is true are
0.96. Recall what the null hypothesis is here. One cannot reject the null
hypothesis. The logistic regression model adequately summarizes the data.
Let us ask for the output of the ‘lrm’ command.
> MB2
Logistic Regression Model
lrm(formula = Response ~ Age + Gender, data = Logi, x = TRUE, y =
TRUE)
Frequencies of Responses
0 1
20 20
Obs Max Deriv Model L.R. d.f. P C
Dxy
40 7e-08
16.54
2 3e-04 0.849 0.698
Gamma
Tau-a
R2 Brier
0.703
0.358 0.451 0.162
16
Coef
S.E.
Wald Z P
Intercept -9.8429 3.67577 -2.68 0.0074
Age
0.1581 0.06164 2.56 0.0103
Gender
3.4898 1.19917 2.91 0.0036
̂ 0 = - 9.8429; ˆ1 = 0.1581; ˆ2 = 3.4898
Age is significant. Gender is very significant.
Why does one build a model? If the model is an adequate summary of the
data, the data can be thrown away. The model can be used to answer any
questions that may be raised on the experiment and the outcomes.
A model lets us assess whether or not a particular covariate is a significant
risk factor. A model lets us evaluate the extent of its influence on the
outcome variable.
Module 2: Null Hypotheses + p-values + Standard Errors + Critical Values +
LOGISTIC REGRESSION + INTERACTIONS + GRAPHS
Null hypotheses and their ilk
Let us go back to the abdominal sepsis problem. The response variable
(Dead or Alive after surgery, Y) is binary. There are five covariates (X1 =
Shock; X2 = Undernourishment; X3 = Alcoholism; X4 = Age; X5 =
Infarction). We postulated the following regression model.
π(X1, X2, X3, X4, X5) = Probability of Death After Surgery = Pr(Y = 1) =
e  0  1 X1   2 X 2   3 X 3   4 X 4   5 X 5
.
 0  1 X 1   2 X 2   3 X 3   4 X 4   5 X 5
1 e
This is a population model. The population consists of all those who have
had abdominal sepsis and for whom surgery is contemplated. We believe
that the response probability has the pattern spelled out above. This belief
can be tested.
We want to test the impact of Shock on the Response. The null hypothesis is
one of skepticism. The patient being in shock has no bearing on the outcome
of the surgery. The null hypothesis is H0: β1 = 0. We have to have an
alternative. H1: β1 ≠ 0. The null hypothesis can be interpreted that Shock has
17
no impact on the Response. Another interpretation is that ‘Shock’ has no
significance. Yet, another interpretation is that ‘Shock’ is not a risk factor.
Yet, another interpretation is that there is no association between Shock and
the outcome of surgery. The alternative is interpreted as that ‘Shock’ has an
impact on the Response. Equivalently, Shock and Outcome of Surgery are
associated.
In order to test the hypotheses we need data. It means that we want to check
whether the data is consistent with the null hypothesis. Using the data, we
estimate the unknown parameter β1. If the null hypothesis is true, we would
̂1 to be close to the null value β1 = 0. Any large value of
expect the estimate 𝛽
̂1 would make us doubt the validity of the null hypothesis. In
the estimate 𝛽
̂1 =
practice, we got 𝛽
3.674. Is this large enough to cast doubt on the validity of H0? If β1 = 0, how
much it is plausible to get an estimate of β1 to be as high as 3.674?
Mathematical statisticians are able to calculate the probability of getting an
estimate at least as large as 3.674 if the null hypothesis is true. Formally, the
p-value is defined by
̂1 ⌋ ≥ 3.674 | 𝐻0 𝑖𝑠 𝑇𝑟𝑢𝑒) = 0.0016.
p = Pr(⌊𝛽
If the null hypothesis is true, the chances of observing a value at least as
large as 3.674 for β1 in a sample are very, very small. I do not think getting
the estimate 3.674 is feasible. However, we did indeed get such an estimate.
How did one calculate the probability? The probability is calculated under
the assumption that the null hypothesis is true. May be, the assumption is not
valid. Reject the null hypothesis!
In short, if the p-value is small, reject the null hypothesis. Typically, the pvalue is compared with the industry standard 0.05. Any event with the
probability of occurrence 0.05 or less is unlikely to occur.
When a p-value is deemed small?
1. One-in-twenty principle (5%): If an event has the probability of
occurrence ≤ 5%, the event is not expected to occur. (Level of
significance is 5%)
18
2. One-in-hundred principle (1%): If an event has the probability of
occurrence ≤ 1%, the event is not expected to occur. (Level of
significance is 1%)
3. One-in-ten principle (10%): If an event has the probability of
occurrence ≤ 10%, the event is not expected to occur. (Level of
significance is 10%)
4. Non-judgmental: Just report the p-value. Let the reader make up
his/her mind.
Some misconceptions!
1. Can I say that the chances that the null hypothesis is true are 0.0016?
No. Remember that the p-value is a conditional probability.
2. Is the null hypothesis true? I don’t know.
3. Is the null hypothesis false? I don’t know.
4. Is the data consistent with the null hypothesis? No, in this example.
Some theory behind the calculation of p-value
We have a model. The model is believed to be true. It has some parameters
in the model. One of the parameters is β1. We take a random sample of
subjects and collect data on the variables. Using the data, we estimate β 1. Let
̂1 . The value of the estimate would vary from
us denote the estimate by 𝛽
sample to sample. If the null hypothesis is true, it has been shown that
̂1
𝛽
~ N(0, 1).
Here SE is the standard error of the estimate. Using the standard normal
distribution, we need to calculate
𝑆𝐸
̂
𝛽
3.674
̂
𝛽
Pr(⌊ 1 ⌋ ≥
) = Pr(⌊ 1 ⌋ ≥ 3.1618) = 2*pnorm(3.1618, lower.tail = F) =
𝑆𝐸
1.162
𝑆𝐸
0.0016
I used R to calculate the p-value using the ‘pnorm’ command. Talk about
this more.
What is standard error?
A medical doctor collected data on 106 patients. Using the data, we
estimated β1. The estimate is 3.674. If another researcher collects data on the
same theme, the estimate may not come out to be 3.674. There is bound to
19
be variation from one estimate to another. Mathematical statisticians are able
to estimate the variation as measured by the standard deviation of the
estimate. This is the standard error of the estimate. This can also be called
margin of error. In this example, standard error is 1.162. One can use the
standard error to provide a 95% confidence interval for the unknown
parameter β1 of the population model. It is
3.674 ± 1.96*1.162.
This interval misses β1 = 0. Check! From this, one can conclude that the null
hypothesis can be rejected.
Logistic Regression and interactions
Let us go back to the example presented in the last lecture. A treatment is
being tested out on patients suffering from a medical condition whether or
not they get relief.
Response variable: For any randomly chosen patient on treatment, let
Y = 1 if the patient gets relief
= 0 if not.
There are two prognostic variables: Age and Gender.
Main effects model:
exp(0  1 * Age   2 * Gender)
Pr(Y = 1) =
1  exp(0  1 * Age   2 * Gender)
and
1
Pr(Y = 0) =
1  exp(0  1 * Age   2 * Gender)
for some unknown parameters β0, β1, and β2.
Interaction model:
exp(0  1 * Age   2 * Gender  3 * Age * Gender)
Pr(Y = 1) =
1  exp(0  1 * Age   2 * Gender  3 * Age * Gender)
and
1
Pr(Y = 0) =
1  exp(0  1 * Age   2 * Gender  3 * Age * Gender)
for some unknown parameters β0, β1, β2, and β3.
20
Typically, one should entertain an interaction model before gravitating
towards the main effects model.
What does an interaction model mean? In a multiple regression set-up, an
interaction model is easy to explain. In the context of a logistic regression
model, one can provide a good explanation in terms of log-odds model.
Pr(Y  1)
= β0 + β1*Age + β2*Gender + β3*Age*Gender
Pr(Y  0)
The log-odds is a linear function of the covariates!
ln
Logistic Regression Model for Males
Pr(Y  1)
ln
= (β0 + β2) + (β1 + β3)*Age
Pr(Y  0)
The log-odds is a linear function of age with intercept β0 + β2 and slope β1 +
β3.
Logistic Regression Model for Females
Pr(Y  1)
ln
= β0 + β1*Age
Pr(Y  0)
The log-odds is a linear function of age with intercept β0 and slope β1.
If interaction is present, i.e., β3 ≠ 0, the slopes are different. It is of
paramount importance to test the significance of interaction to begin with,
i.e., test the null hypothesis H0: β3 = 0. This is what we will do.
Load R with data.
> Age <- c(37, 39, 39, 42, 47, 48, 48, 52, 53, 55, 56, 57, 58, 58, 60, 64, 65,
68, 68, 70, 34, 38, 40, 40, 41, 43, 43, 43, 44, 46, 47, 48, 48, 50, 50, 52, 55,
60, 61, 61)
> length(Age)
[1] 40
> Gender <- rep(c("female", "male"), c(20, 20))
> Response <- factor(c(0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1,
1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1))
> length(Response)
[1] 40
> MB <- data.frame(Age, Gender, Response)
21
> MB
Age Gender Response
1 37 female
0
2 39 female
0
3 39 female
0
4 42 female
0
5 47 female
0
6 48 female
0
7 48 female
1
8 52 female
0
9 53 female
0
10 55 female
0
11 56 female
0
12 57 female
0
13 58 female
0
14 58 female
1
15 60 female
0
16 64 female
0
17 65 female
1
18 68 female
1
19 68 female
1
20 70 female
1
21 34 male
1
22 38 male
1
23 40 male
0
24 40 male
0
25 41 male
0
26 43 male
1
27 43 male
1
28 43 male
1
29 44 male
0
30 46 male
0
31 47 male
1
32 48 male
1
33 48 male
1
34 50 male
0
35 50 male
1
36 52 male
1
37 55 male
1
38 60 male
1
22
39 61 male
40 61 male
1
1
Attaching package: 'rms'
Fit the main effects model.
> MB1 <- lrm(Response ~ Age + Gender, data = MB, x = T, y = T)
> MB1
Logistic Regression Model
lrm(formula = Response ~ Age + Gender, data = MB, x = T, y = T)
Frequencies of Responses
0 1
20 20
Obs Max Deriv Model L.R.
d.f.
40 7e-08 16.54
2
Gamma Tau-a
R2 Brier
0.703 0.358 0.451 0.162
Intercept
Age
Gender=male
Coef
-9.8429
0.1581
3.4898
P
3e-04
C
Dxy
0.849 0.698
S.E.
Wald Z
3.67577 -2.68
0.06164 2.56
1.19917 2.91
P
0.0074
0.0103
0.0036
Let us do a goodness-of-fit test.
> residuals.lrm(MB1, type = 'gof')
Sum of squared errors
6.4736338
Z
0.0552307
Expected value|H0
6.4612280
P
0.9559547
SD
0.2246174
The fit is excellent.
23
Let us fit the interaction model.
> MB2 <- lrm(Response ~ Age + Gender + Age*Gender, data = MB, x = T,
y = T)
> MB2
Logistic Regression Model
lrm(formula = Response ~ Age + Gender + Age * Gender, data = MB,
x = T, y = T)
Frequencies of Responses
0 1
20 20
Obs Max Deriv Model L.R.
d.f.
40 2e-05
16.97
3
Gamma Tau-a
R2 Brier
0.733 0.373 0.461 0.158
Coef S.E.
Intercept
-12.1462 5.5816
Age
0.1970 0.0935
Gender=male
7.7047 6.7598
Age * Gender=male -0.0819 0.1259
P
7e-04
C
0.864
Dxy
0.728
Wald Z P
-2.18 0.0295
2.11 0.0351
1.14 0.2544
-0.65 0.5153
The interaction is not significant. Let us do a goodness-of-fit test.
> residuals.lrm(MB2, type = 'gof')
Sum of squared errors
6.3100002
Z
-0.3503031
The fit is excellent.
Expected value|H0
6.3857317
P
0.7261113
SD
0.2161884
We better stick to the main effects model. What is its interpretation? It is
easy to explain in terms of log-odds.
24
Logistic Regression model for Males
ln
Pr(Y  1)
= β0 + β2 + β1*Age
Pr(Y  0)
Logistic Regression model for Females
ln
Pr(Y  1)
= β0 + β1*Age
Pr(Y  0)
The lines are parallel. The only difference is in the intercepts.
Let us do some plotting. Plot the logistic regression model for males and
females separately.
> curve(exp(-9.8429 + 3.4898 + 0.1581*x)/(1 + exp(-9.8429 + 3.4898 +
0.1581*x)),
+ from = 30, to = 75, xlab = "Age", ylab = "Probability", col = "red", sub =
+ "Logistic Regression Model", main = "Probability of Relief From
Treatment")
> curve(exp(-9.8429 + 0.1581*x)/(1 + exp(-9.8429 + 0.1581*x)), col =
"blue",
+ add = T)
> text(40, 0.6, "Males", col = "red")
> text(60, 0.6, "Females", col = "blue")
The output is at the end.
What else we can do? Some prediction. Prediction is useful in pattern
recognition problems. What does the output folder MB1 contain?
> names(MB1)
[1] "freq"
"sumwty"
"stats"
[4] "fail"
"coefficients" "var"
[7] "u"
"deviance"
"est"
[10] "non.slopes"
"linear.predictors" "penalty.matrix"
[13] "info.matrix"
"weights"
"x"
[16] "y"
"call"
"Design"
25
[19] "scale.pred"
[22] "na.action"
[25] "fitFunction"
"terms"
"fail"
"assign"
"nstrata"
> MB1$linear.predictors
1
2
3
4
5
6
-3.99478499 -3.67866837 -3.67866837 -3.20449343 -2.41420186 -2.25614355
7
8
9
10
11
12
-2.25614355 -1.62391030 -1.46585199 -1.14973536 -0.99167705 -0.83361874
13
14
15
16
17
18
-0.67556042 -0.67556042 -0.35944380 0.27278945 0.43084777 0.90502270
19
20
21
22
23
24
0.90502270 1.22113933 -0.97913195 -0.34689870 -0.03078207 -0.03078207
25
26
27
28
29
30
0.12727624 0.44339287 0.44339287 0.44339287 0.60145118 0.91756780
31
32
33
34
35
36
1.07562612 1.23368443 1.23368443 1.54980106 1.54980106 1.86591768
37
38
39
40
2.34009262 3.13038418 3.28844250 3.28844250
Linear predictors are the numbers calculated in the linear form of the model
for every individual in the sample. We can calculate predicted probability of
relief as per the model for every one in the sample using the subject’s
covariate values.
> Predict <- round(exp(MB1$linear.predictor)/(1 +
exp(MB1$linear.predictors)), 3)
> Predict
1
2
3
4
5
6
7
8
9
10
11
12
13
0.018 0.025 0.025 0.039 0.082 0.095 0.095 0.165 0.188 0.241 0.271 0.303
0.337
14
15
16
17
18
19
20
21
22
23
24
25
26
0.337 0.411 0.568 0.606 0.712 0.712 0.772 0.273 0.414 0.492 0.492 0.532
0.609
27
28
29
30
31
32
33
34
35
36
37
38
39
0.609 0.609 0.646 0.715 0.746 0.774 0.774 0.825 0.825 0.866 0.912 0.958
0.964
40
0.964
Put everything together.
> MB3 <- data.frame(MB, Predict)
> MB3
Age Gender Response Predict
1
37 female
0
0.018
2
39 female
0
0.025
3
39 female
0
0.025
26
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
42
47
48
48
52
53
55
56
57
58
58
60
64
65
68
68
70
34
38
40
40
41
43
43
43
44
46
47
48
48
50
50
52
55
60
61
61
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
male
male
male
male
male
male
male
male
male
male
male
male
male
male
male
male
male
male
male
male
0
0
0
1
0
0
0
0
0
0
1
0
0
1
1
1
1
1
1
0
0
0
1
1
1
0
0
1
1
1
0
1
1
1
1
1
1
0.039
0.082
0.095
0.095
0.165
0.188
0.241
0.271
0.303
0.337
0.337
0.411
0.568
0.606
0.712
0.712
0.772
0.273
0.414
0.492
0.492
0.532
0.609
0.609
0.609
0.646
0.715
0.746
0.774
0.774
0.825
0.825
0.866
0.912
0.958
0.964
0.964
Are there any graphics available to check on interaction?
27
0.6
Males
Females
0.2
0.4
Probability
0.8
1.0
Probability of Relief From Treatment
30
40
50
60
70
Age
Logistic Regression Model
Module 3: Odds, Odds ratio, and their ilk
Odds and odds ratio
A Bernoulli Trial is a random experiment, which when performed
results in one and only one of two possible outcomes. For example, tossing a
coin is a Bernoulli trial. There are only two possible outcomes: Heads or
Tails. A medical researcher devised a new medicine for curing a specific
malady. If the medicine is administered to a patient suffering from the
malady, only one of two possible outcomes results: the patient is cured or
28
the patient is not cured. It is customary to denote the outcomes as Success
and Failure.
Consider a Bernoulli trial with probability of Success being p.
Consequently, the probability of Failure is 1-p. We define
p
.
1 p
The odds of Success provide a way to assess how likely the Success to occur
in comparison with Failure. In the following table, for a given probability of
Success, we calculate the odds of Success versus Failure.
Odds of Success versus Failure = Odds of Success =
p:
Odds:
0.1
1/9
0.2
¼
0.3
3/7
0.4
2/3
0.5
1
0.6
1.5
0.7
7/3
0.8
4
0.9
9
Suppose the odds of Success are 1. It means that Success and Failure are
equally likely.
Suppose the odds of Success are 2. This means that Success is twice as
likely to occur as Failure. Solve the equation
p
= 2 for p.
1 p
Solution: p = 2/3 and 1 – p = 1/3.
Suppose the odds of Success are 8. This means that Success is eight times as
likely to occur as Failure. Suppose the odds of Success are ¼.This means
that Failure is four times as likely to occur as Success.
In Medical Research, the scientists usually talk about odds of a treatment
being successful.
If we know the probability of Success, we can work out the odds of Success.
Conversely, if we know the odds of Success (versus Failure), we can work
out the probability of Success.
Suppose the odds of Success are 4.1. Set
p
= 4.1 and solve for p.
1 p
29
As a matter of fact, p =
4.1
= 0.8039.
1  4.1
Now we come to the concept of Odds Ratio. As the name indicates, it
is indeed the ratio of two sets of odds. Another name for the Odds Ratio is
Cross Ratio.
Odds ratio can be defined for independent Bernoulli experiments. In
prospective studies, we generally compare the performance two treatments.
Prospective Studies
Suppose we have two Bernoulli trials. In one Bernoulli trial the probability
of Success is p1 and in the other it is p2. In Bernoulli Trial 1, the odds of
p1
Success versus Failure are
. In Bernoulli Trial 2, the odds of Success
1  p1
p2
versus Failure are
. The Odds Ratio is defined by
1  p2
OR = .
Interpretation. Suppose OR = 2. This means
Odds of Success versus Failure in Trial 1 are two times the Odds of Success
versus Failure in Trial 2. It also implies that the probability of Success in
Trial 1 is greater than the probability of Success in Trial 2. How much
larger? It depends on what the probability of Success is in Trial 2.
1. Suppose the Odds of Success in Trial 2 are one. This means that in Trial
2, Success and Failure occur with equal probability ½. The odds of
Success in Trial 1 are two. This implies that the probability of success in
Trial 1 is 2/3. The probability of Failure is 1/3. Successes are two times
as likely as Failures.
2. Suppose the Odds of Success in Trial 2 are two. In Trial 2, the
probability of Success is 2/3. Thus in Trial 2, Successes are two times as
likely as Failures. In Trial 1, the Odds of Success are 4. The probability
of Success in Trial 1, therefore, is 4/5. The probability of Failure is 1/5.
Thus in Trial 1, Successes are four times as likely as Failures.
Estimation of Odds Ratio
30
The population odds ratio OR is unknown. We collect data in order to
estimate and build confidence intervals for OR. Typically, the data are
collected by adopting a prospective design.
In medical research, Trial 1 corresponds to an experimental drug and Trial 2
corresponds to a standard drug (control). The drugs are designed to cure a
specific malady.
Sampling.
Select m many patients randomly and put them all on the experimental drug.
Observe them for a certain length of time. Let m1 be the number of patients
for whom the drug is successful. Let m2 be the number of patients for whom
the drug is not successful.
Select n many patients randomly and put them all on the standard drug.
Observe them for the same length of time. Let n1 be the number of patients
for whom the drug is successful. Let n2 be the number of patients for whom
the drug is not successful.
The sampling protocol described above is called a prospective design. The
data can be put in the form of a 2x2 contingency table.
Drug
Experimental
Standard
Successful
Yes
No
m1
m2
n1
n2
Sample size
m
n
The null hypothesis is that the population OR = 1 (Hypothesis of
skepticism). The alternative hypothesis is that OR ≠ 1.
H0:
H1:
OR = 1
OR ≠ 1
Null hypothesis is equivalent to the statement that the experimental and
standard drugs are equally effective. Let p1 be the probability of success on
the experimental drug and p2 the probability of success on the standard drug.
Population odds ratio is defined by
31
p1
OR =
p2
(1  p1)
(1  p2 )
p (1  p2 )
.
 1
(1  p1) p2
Claim: OR = 1 if and only if p1 = p2. Prove this.
It is easier to handle OR than handling with p1 and p2.
mn
Estimate of OR = ORˆ  1 2 (Derive this estimate.)
m2n1
Standard Error of the ln ORˆ 
1
1
1 1

 
m1 m2 n1 n2
ln ORˆ  ln 1
SE(ln( ORˆ ))
Test: Reject the null hypothesis at 5% level of significance if |Z| > 1.96.
Test statistic = Z=
Note: Testing OR = 1 is equivalent to testing p1 = p2. For testing p1 = p2, one
could use the two-sample proportion test if the alternative is directional or a
chi-squared test if the alternative is non-directional. The test based on OR
has better sampling properties than the one based on the proportions.
Retrospective Studies
Odds Ratio can also be defined in the context of a retrospective study. Let us
consider the problem of examining association between smoking status of
mothers and perinatal mortality. We select at random a maternity record of a
mother from a group of hospitals. We observe the following two categorical
variables:
Smoking status of the mother
and
perinatal mortality of the baby
It is natural to expect that these response variables are correlated.
Their joint distribution can be summarized in the following table.
Mother
Smoked (X)
Yes (0)
Perinatal Mortality (Y)
Yes (0)
No (1)
a
b
Marginals
a+b
32
No (1)
Marginals
c
a+c
d
b+d
c+d
1
Odds of Death versus Life if the mother smoked during pregnancy =
[Pr(Baby Died / Mother Smoked)]  [Pr(Baby Alive / Mother Smoked)]
a
( a  b)
=
.
b
( a  b)
Odds of Death versus Life if the mother did not smoke during pregnancy =
Pr(Baby Died/Mother did not smoke)Pr(Baby Alive/Mother did not smoke)
c
(c  d )
=
.
d
(c  d )
The odds ratio is the ratio of these two sets of odds.
ad
OR =
bc
Equivalently,
Odds of Death versus Life if mother smoked = OR  Odds of Death versus
Life if mother did not smoke.
OR = 1 means it does not matter whether mother smokes or not. The
smoking status of the mother has no impact on mortality. This also means
that Smoking Status of Mother and Perinatal Mortality are statistically
independent.
OR > 1 implies that death probability if mother smokes is greater than the
death probability if mother does not smoke.
So far what we discussed is about population’s odds ratio. We do not know
the population odds ratio. We need to estimate the population odds ratio and
test hypothesis about the odds ratio.
Inference for Odds Ratio
The data come in the form of a 2x2 contingency table.
X /Y
0
1
33
0
1
n00
n10
n01
n11
n n
A point estimate of OR: ORˆ  00 11 .
n10  n01
In order to build a confidence interval for OR, we need its large sample
standard error. What is standard error? It is easy to obtain a large sample
standard error of
ln( ORˆ ) = ln(n00) + ln(n11) – ln(n01) – ln(n10)
using asymptotic theory. As a matter of fact, estimated standard error is
given by
ORˆ )) =
1
1
1
1
.



n00 n01 n10 n11
A large sample 95% confidence interval for ln(OR) is given by
ln( ORˆ )  1.96SE(ln( ORˆ )).
In order to get a large sample 95% confidence interval for ln(OR), take antilogarithms. It is given by
Exp{ ln( ORˆ ) - 1.96SE(ln( ORˆ ))}  OR 
Exp{ln( ORˆ ) + 1.96SE(ln( ORˆ ))}
SE(ln(
Whether the study is prospective or retrospective, the concept of odds ratio
is the same. The underlying distributions are characteristically different.
Example. Back to perinatal mortality problem and smoking mothers. A
retrospective study of 48,378 mothers yielded the following data.
Mother
Smoked
Yes
No
Marginals
ORˆ =
Perinatal Mortality
Yes
No
619
20,443
634
26,682
1253
47,125
Marginals
21,062
27,316
48,378
619  26682
= 1.27
634  20443
ln( ORˆ ) = 0.2390
34
SE[ln( ORˆ )] =
1
1
1
1



619 20443 634 26682
= 0.057
A large sample 99% confidence interval for ln(OR) is given by
Ln( ORˆ )  2.576SE[ln( ORˆ )]
0.2390  0.1475
0.0915  ln(OR)  0.3865
A large sample 99% confidence interval for OR is given by
exp{0.0915}  OR  exp{0.3865}.
1.10  OR  1.47
Conclusion. The number 1 is not in the interval. Smoking status of a mother
does indeed influence the perinatal mortality of the baby. The odds ratio is at
least 1.10 and at most it is 1.47.
The odds of death versus life if mother smokes is at best 1.10 times the odds
of death versus life if mother does not smoke and at worst it is 1.47 times the
odds of death versus life if mother does not smoke.
R can do all these calculations.
Download the package ‘vcd.’ Activate the package.
Enter the data into a matrix.
> MB <- matrix(c(619, 20443, 634, 26682), nrow = 2, byrow = T)
> MB
[,1] [,2]
[1,] 619 20443
[2,] 634 26682
Name the rows and columns.
> rownames(MB) <- c("Smoked", "No")
> colnames(MB) <- c("Died", "Alive")
> MB
Died Alive
Smoked 619 20443
No
634 26682
35
The command for oddsratio is ‘oddsratio.’
> oddsratio(MB)
[1] 0.2424050
This is ln(oddsratio), i.e., log of odds ratio.
We can get confidence interval of ln(OR). The default level is 95%.
> confint(MB)
lwr
upr
[1,] 0.1302128 0.3545972
> confint(MB, level = 0.95)
lwr
upr
[1,] 0.1302128 0.3545972
> confint(MB2, level = 0.90)
lwr
upr
[1,] 0.1482503 0.3365596
> confint(MB, level = 0.99)
lwr
upr
[1,] 0.09495945 0.3898505
How to get confidence intervals for OR?
> oddsratio(MB, log = F)
[1] 1.274310
> MB3 <- oddsratio(MB, log = F)
> confint(MB3)
lwr
upr
[1,] 1.139071 1.425606
Odds ratios in the context of Logistic Regression
In a multiple regression model, it is easy to examine the impact of a
covariate on the response variable. Suppose we have one response variable y
and three covariates X1, X2, and X3. Suppose we have the following
estimated multiple regression equation:
ŷ = 3 + 2X1 + 3X2 – 4X3.
36
What is the impact of X1 on the response variable? Suppose we increase the
value of X1 by one unit and keep the values of X2 and X3 the same. What
will happen to the value of y?
Scenario 1
X1 = 1, X2 = 3, X3 = 1 
ŷ = 10
Scenario 2
X1 = 2, X2 = 3, X3 = 1 
ŷ
= 12
What is the difference between Scenarios 1 and 2? The value of X1 has gone
up by one unit and the values of X2 and X3 have remained the same. The
value of y has gone up by two units. The number 2 is precisely the
coefficient of X1 in the multiple regression estimated equation. Thus the
impact of X1 on the response variable is positive and is measured by the
coefficient of X1 in the equation.
What is the impact of X3 on the response variable y? If the value of X3 goes
up by one unit and the values of X1 and X2 remain the same, then the value
of y goes down by four units on average.
Scenario 1
X1 = 1, X2 = 3, X3 = 1 
ŷ = 10
Scenario 2
X1 = 1, X2 = 3, X3 = 2 
ŷ = 6
What is the difference between Scenarios 1 and 2?
We would like to initiate a similar study in the environment of logistic
regression models. Suppose we have two covariates X1 and X2 in a logistic
regression model. Suppose the model is given by Pr(Y = 1 / X1, X2) =
exp{0  1 X1   2 X 2}
1  exp{0  1 X1   2 X 2}
and
1
Pr(Y = 0 / X1, X2) =
1  exp{0  1 X1   2 X 2}
37
What is the impact of X1 on the response variable? A multiple regression
type of interpretation is not possible here. We work with the odds ratios. Let
Y = 1 stand for success and Y = 0 stand for failure.
Odds of Success versus Failure =
Pr(Y  1 | X1, X 2 )
 exp( 0  1 X1   2 X 2 ) .
Pr(Y  0 | X1, X 2 )
Look at the following scenario:
X1 = 1 and X2 = 2  Odds of Success versus Failure = exp{0 + 1 + 22}
(Check this.)
Let us increase the value of X1 by one unit and keep the value of X2 the
same, i.e.,
X1 = 2 and X2 = 2  Odds of Success versus Failure = exp{0 + 21 + 22}
Odds Ratio = [Odds of Success versus Failure when X1 = 2 and X2 = 2] 
[Odds of Success versus Failure when X1 = 1 and X2 = 2] = exp{1}.
The numbers given to X1 and X2 are not special. Give any values to X1 and
X2 so that the value of X1 goes up by one unit and the value of X2 remains
the same. The odds ratio will remain the same.
Equivalently,
[Odds of Success versus Failure when X1 = 2 and X2 = 2] = (Odds Ratio)[
Odds of Success versus Failure when X1 = 1 and X2 = 2]
Value of 1 Odds Ratio Impact of X1 on Y
0
1
No impact
>0
>1
If X1 goes up, so are the
odds.
<0
<1
If X1 goes up, the odds
go down.
Example. Framingham Study: Homework
38
Module 4: Odds ratio from the logistic regression model vis-à-vis Odds ratio
from the contingency table + Biplots + How to download EXCEL onto R
Odds ratio
Let us look at the data on Response to a particular treatment with prognostic
variable Age and Gender.
> Age <- c(37, 39, 39, 42, 47, 48, 48, 52, 53, 55, 56, 57, 58, 58, 60, 64, 65,
68, 68, 70, 34, 38, 40, 40, 41, 43, 43, 43, 44, 46, 47, 48, 48, 50, 50, 52, 55,
60, 61, 61)
> Gender <- factor(rep(c("F", "M"), c(20, 20)))
> Response <- factor(c(0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1,
1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1))
> MB <- data.frame(Age, Gender, Response)
> MB
Age Gender Response
1 37 F
0
2 39 F
0
3 39 F
0
4 42 F
0
5 47 F
0
6 48 F
0
7 48 F
1
8 52 F
0
9 53 F
0
10 55 F
0
11 56 F
0
12 57 F
0
13 58 F
0
14 58 F
1
15 60 F
0
16 64 F
0
17 65 F
1
18 68 F
1
19 68 F
1
20 70 F
1
21 34 M
1
22 38 M
1
23 40 M
0
24 40 M
0
25 41 M
0
39
26 43 M
1
27 43 M
1
28 43 M
1
29 44 M
0
30 46 M
0
31 47 M
1
32 48 M
1
33 48 M
1
34 50 M
0
35 50 M
1
36 52 M
1
37 55 M
1
38 60 M
1
39 61 M
1
40 61 M
1
Response and Gender are binary variables. We can cross-tabulate these
variables to get a 2x2 contingency table. We can calculate the odds ratio for
the table to measure the degree of association between Response and
Gender. We can also build a 95% confidence interval for the population
odds ratio. Let us activate the package ‘vcd.’
> MB1 <- table(Response, Gender)
> MB1
Gender
Response F M
0 14 6
1 6 14
> MB2 <- oddsratio(MB1, log = F)
> MB2
[1] 5.444444
Interpretation: Odds of Cure versus No Cure if the patient is Male =
5.44*Odds of Cure versus No Cure if the patient is Female. Odds of Cure
are much better for males than for females.
Suppose the odds of Cure versus No Cure for females are 1/5, i.e., No Cure
is 5 times more likely than Cure. Then the odds of Cure versus No Cure for
males is 5.44*(1/5) = 1.088, better than evens. More precisely,
Pr(Cure | Male) = 1.088/(1 + 1.088) = 0.52
In this analysis, age is not factored into.
> confint(MB2)
lwr upr
40
[1,] 1.471410 20.14528
More precisely, a 95% confidence interval for the population odds ratio is
given by
1.47 ≤ OR ≤ 20.15.
Why this confidence interval is so wide? The sample is small. Recall the
̂ ) is √
standard error of ln(𝑂𝑅
1
14
+
1
6
+
1
14
1
+ . The length of the confidence
6
intervals depends on the standard error. The smaller the standard error is the
small the length of the confidence interval is. The numbers in the four cells
of the contingency table are small. The bigger these numbers are the smaller
the standard is. Recall how the confidence interval is built.
We want to test the null hypothesis that the population odds ratio is equal to
one, i.e., there is no association between Response and Gender.
H0: OR = 1
H1: OR ≠ 1
The observed 95% confidence interval is: 1.47 ≤ OR ≤ 20.15. The interval
does not contain OR = 1. We reject the null hypothesis at 5% level of
significance. We can also calculate the p-value.
Under the null hypothesis, ln(OR) = 0. Under the null hypothesis,
theoretically,
Z=
̂ )− 0
ln(𝑂𝑅
̂ ))
𝑆𝐸(ln(𝑂𝑅
has a standard normal distribution. Observed value of the z-statistic can be
computed using R.
> SE <- sqrt(1/14 + 1/14 + 1/6 + 1/6)
> SE
[1] 0.6900656
> Z <- log(MB2)/SE
>Z
[1] 2.455703
> pvalue <- 2*pnorm(2.455703, lower.tail = F)
> pvalue
[1] 0.01406093
Based on this p-value, we can reject H0: ln(OR) = 0 or H0: OR = 1.
We can fit a logistic regression model to the data.
exp(𝛽0 + 𝛽1 ∗𝐴𝑔𝑒+ 𝛽2 ∗𝐺𝑒𝑛𝑑𝑒𝑟)
Pr(Response = 1 | Age, Gender) =
1+ exp(𝛽0 + 𝛽1 ∗𝐴𝑔𝑒+ 𝛽2 ∗𝐺𝑒𝑛𝑑𝑒𝑟)
Let us fit this model.
> MB3 <- glm(Response ~ Age + Gender, data = MB, family = binomial)
> summary(MB3)
41
Call:
glm(formula = Response ~ Age + Gender, family = binomial, data = MB)
Deviance Residuals:
Min
1Q
Median
-1.86671 -0.80814
0.03983
Coefficients:
3Q
0.78066
Max
2.17061
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.84294
3.67576 -2.678 0.00741 **
Age
0.15806
0.06164
2.564 0.01034 *
GenderM
3.48983
1.19917
2.910 0.00361 **
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 55.452 on 39 degrees of freedom
Residual deviance: 38.917 on 37 degrees of freedom
AIC: 44.917
Number of Fisher Scoring iterations: 5
> OddsRatioGender <- exp(3.4898)
> OddsRatioGender
[1] 32.77939
If the Age is fixed, no matter what it is, the odds of Cure versus No Cure for
a male = 32.78*Odds of Cure versus No Cure for a female, both with the
same age.
Suppose the odds of Cure versus No Cure for females are 1/5, i.e., No Cure
is 5 times more likely than Cure females. Then the odds ratio of Cure versus
No Cure for males is 32.78*(1/5) = 6.556. This means
Pr(Cure | Male) = 6.556/(1 + 6.556) = 0.87.
This odds ratio takes into account age. One can say that this odds ratio is the
odds ratio of Response and Gender adjusted for Age. This is indeed a true
summary of the relationship between Response to the treatment and Gender.
Another great advantage of the logistic regression model, if it is a good fit, is
that we can measure association between the response variable (binary) and
a numeric covariate after adjusting for the presence of other covariates.
In our example, the continuous variable is Age. The odds ratio of Response
to Treatment and Age is exp(0.15806) = 1.17.
If the gender is fixed,
Odds of Cure versus No cure if (Age = x+1) =
1.17*Odds of Cure versus No Cure if (Age = x),
where x is any number.
42
This odds ratio is the odds ratio of Response to the Treatment and Age
adjusted for Gender. If the gender is fixed,
Odds of Cure versus No cure if (Age = x+2) =
(1.17)2*Odds of Cure versus No Cure if (Age = x)
Derive this result.
There is no way we can measure association between Response to Treatment
(binary) and Age (continuous) using contingency table approach.
Build a 95% confidence interval for the odds ratio of Gender
The coefficient of Gender in the logistic regression model is β2. Population
odds ratio is exp(β2). A 95% confidence interval for β2 is
𝛽̂
2 ± 1.96*SE.
3.49 ± 1.96*1.20
1.14 ≤ β2 ≤ 5.84
A 95% confidence interval for the odds ratio exp(β2) is obtained by
exponentiation of the above interval.
3.13 ≤ OR ≤ 343.78
Why this interval is so wide?
Biplots
Goal: I have 4 quantitative variables: X1, X2, X3, and X4. Make a graphical
presentation of the data on these four variables in a single frame.
Solution: Get a scatter plot of X1 and X2. Get the scatter plot of X3 and X4
on the same graph.
How? Let us look at an example. The data ‘iris’ is available in R. Data were
collected on Petal Length, Petal Width, Sepal Length, and Sepal Width on
three different species of iris flowers (setosa, versicolor, viriginica).
Download the data.
> data(iris)
> dim(iris)
[1] 150 5
> head(iris)
1
2
3
4
5
6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1
3.5
1.4
0.2 setosa
4.9
3.0
1.4
0.2 setosa
4.7
3.2
1.3
0.2 setosa
4.6
3.1
1.5
0.2 setosa
5.0
3.6
1.4
0.2 setosa
5.4
3.9
1.7
0.4 setosa
How many flowers in each species?
> table(iris$Species)
setosa versicolor virginica
43
50
50
50
Focus on setosa flowers only.
The four measurements are: Petal.Length; Petal.Width; Sepal.Length; and
Sepal.Width.
Get the scatter plot of Petal Length and Sepal Length. Superimpose this
graph with the scatter plot of Petal Width and Sepal Width.
> setosa <- subset(iris, iris$Species == "setosa")
Using ‘par’ command, create four lines of space at the bottom, four lines on
the left, seven lines at the top, and seven lines on the right. I need space at
the top and on the right for legend. (mar = margin on the sides)
> par(mar = c(4, 4, 7, 7))
> plot(setosa$Petal.Length, setosa$Sepal.Length, pch = 16, col = "red",
+ xlab = "Petal Length", ylab = "Sepal Length")
I have been harping that a plot command will not accept another plot
command in any superimposition. We can overcome that.
> par(new = T)
> plot(setosa$Petal.Width, setosa$Sepal.Width, pch = 17, col = "blue", ann
= F,
+ axes = F)
> range(setosa$Sepal.Width)
[1] 2.3 4.4
> range(setosa$Petal.Width)
[1] 0.1 0.6
> axis(side = 3, at = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6))
> axis(side = 4, at = c(2, 3, 4, 5))
> mtext("Petal Width", side = 3, line = 2)
> mtext("Sepal Width", side = 4, line = 3)
> mtext("Setosa Flowers", side = 3, line = 5)
mtext = text on the margins
Here is the biplot.
This method of plotting can be used to plot X versus Y and X versus U,
where X, Y, and U are three quantitative variables. In this case, side = 3 is
not needed.
44
Setosa Flowers
0.2
Petal Width
0.3
0.4
0.5
0.6
Sepal Width
4
4.5
3
5.0
Sepal Length
5.5
0.1
1.0
1.2
1.4
1.6
1.8
Petal Length
How to download EXCEL data into R for MAC users?
Courtesy: Gail Pyne-Geithman, Associate Professor, Neurosurgery
1. Save the data as a comma-separated.csv file.
2. Find the precise address of this file. If you can find it, that is good. If
you cannot, R can find it for you. For example, suppose the data file is
“sepsis.csv”. Then in R console, type < rawdata <file.choose(“sepsis.csv”). Type < rawdata. This will give the address
in double quotes.
3. Type < Name <- read.csv(“Address/sepsis.csv”, header = T)
4. The folder Name contains the data.
45
Module 5: NON-PARAMETRIC REGRESSION; BINARY RESPONSE
VARIABLE; CLASSIFICATION TREES
We have been working on how to model a binary response variable in terms
of covariates or independent variables. Our approach was probabilistic in
nature. We proposed a logistic regression model. However, there are a
number of other approaches. One approach popular with engineers and
physicists is to treat the problem as a pattern recognition or classification
problem. Let us go back to the abdominal sepsis problem.
Response variable
Y = 1 if the patient dies after surgery
= 0 if the patient survives after surgery
Independent variables
X1: Is the patient in a state of shock?
X2: Is the patient suffering from malnutrition?
X3: Is the patient alcoholic?
X4: Age
X5: Has the patient bowel infarction?
In logistic regression, the probability distribution of Y is modeled in terms of
the covariates.
If we view this problem as a pattern recognition problem, we need to
identify what the patterns are. The situation Y = 1 is regarded as one pattern
and Y = 0 as the other. Once we have information on the independent
variables for a patient, we need to classify him/her into one of the two
patterns. We have to come up with a protocol, which will classify the patient
as falling into one of the patterns. In other words, we have to say whether he
will die or survive after surgery. We will not make a probability statement.
Any classification protocol one comes up can not be expected to be free of
errors. A classification protocol is judged based on its misclassification error
rate. We will make precise this concept later.
Core idea: Look at the space of predictors. We want to break up the
predictor space into boxes (5-dimensional parallelepipeds) so that each box
is identified with one pattern. For example, Shock = 1, Malnourishment = 0,
Alcoholism = 1, Age > 45, Infarction = 1 is one such box. Can we say that
46
most of the patients that fall into this box die? We want to divide the
predictor space into mutually exclusive and exhaustive boxes so that the
patients falling into each box have predominantly one pattern. The creation
of such boxes is the main objective of this lecture.
One popular method in classification or pattern recognition is the so called
the ‘classification tree methodology,’ which is a data mining method. The
methodology was first proposed by Breiman, Friedman, Olshen, and Stone
in their monograph published in 1984. This goes by the acronym CART
(Classification and Regression Trees). A commercial program called CART
can be purchased from Salford Systems. Other more standard statistical
software such as SPLUS, SPSS, and R also provide tree construction
procedures with user-friendly graphical interface. The packages ‘rpart’ and
‘tree’ do classification trees. Some of the material I am presenting in this
lecture is culled from the following two books.
L Breiman, J H Friedman, R A Olshen, and C J Stone – Classification and
Regression Trees, Wadsworth International Group, 1984.
Heping Zhang and Burton Singer – Recursive Partitioning in the Health
Sciences, Second Edition, Springer, 2008.
Various computer programs related to this methodology can be downloaded
freely from Heping Zhang’s web site: http://peace.med.yale.edu/pub
Let me illustrate the basic ideas of tree construction in the context of a
specific example of binary classification. In the construction of a tree, for
evaluation purpose, we need the concept of ENTROPY of a probability
distribution and/or Gini’s measure of uncertainty. Suppose we have a
random variable X taking finitely many values with some probability
distribution.
X:
Pr.:
1
p1
2
p2
…
…
m
pm
We want to measure the degree of uncertainty in the distribution (p1, p2, … ,
pm). For example, suppose m = 2. Look at the distributions (1/2, 1/2) and
(0.99, 0.01). There is more uncertainty in the first distribution than in the
second. Suppose some one is about to crank out X. I am more comfortable in
47
betting on the outcome of X if the underlying distribution is (0.99, 0.01) than
when the distribution is (1/2,1/2). We want to assign a numerical quantity to
measure the degree of uncertainty. Entropy of a distribution is introduced as
a measure of uncertainty.
Entropy (p1, p2, … , pm) =
m
  pi ln pi
= Entropy impurity = Measure of
i 1
Chaos, with the convention that 0 ln 0 = 0.
Properties
1. 0 ≤ Entropy ≤ ln m.
2. The minimum 0 is attained for each of the distributions (1, 0, 0, … ,
0), (0, 1, 0, … , 0), … , (0, 0, … , 0, 1). For each of these
distributions, there is no uncertainty. The entropy is zero.
3. The maximum ln m is attained at the distribution (1/m, 1/m, … ,
1/m). The uniform distribution is the most chaotic. Under this
uniform distribution, uncertainty is maximum.
There are other measures of uncertainty available in the literature.
Gini’s measure of uncertainty for the distribution (p1, p2, … , pm) =
 pi p j .
i j
Properties
1. 0 ≤ Gini’s measure ≤ (m-1)/m.
2. The minimum 0 is attained for each of the distributions (1, 0, 0, … ,
0), (0, 1, 0, … , 0), … , (0, 0, … , 0, 1). For each of these
distributions, there is no uncertainty. The Gini’s measure is zero.
3. The maximum (m-1)/m is attained at the most chaotic distribution
(1/m, 1/m, … , 1/m). Under this uniform distribution, the uncertainty
is maximum.
Another measure of uncertainty is defined by
min {p1, p2, … , pm}.
Basic ideas in the development of a classification tree
Let me work with an artificial example.
ID
Y
X1
X2
48
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
0
1
1
0
0
1
1
0
1
0
1
0
1
0
0
1
0
1
1
0
1
6
5
10
5
4
10
4
8
9
3
8
9
3
7
2
6
7
1
2
2
5
7
9
5
8
2
3
4
7
9
8
2
1
7
10
10
5
6
4
Goal: I know one with given X1 and X2 values. I need to classify him as
having the pattern Y = 0 or Y = 1. We have the training data given above to
develop a classification protocol. (I could have done a logistic regression
here.)
Another view point: What ranges of X1 and X2 values identify the pattern {Y
= 0} and what for the pattern {Y = 1}?
Step 1: Put all the subjects into the root node. There are 10 subjects with the
pattern Y = 0 and ten with Y = 1. How impure the root (mother) node is.
Calculate the entropy of the distribution:
{Y = 0}
{Y = 1}
0.5
0.5
Impurity of the mother
= - 0.5 ln(0.5) – 0.5 ln(0.5) = ln 2 = 0.69.
Step 2: Let us split the mother node into two daughter nodes. We need to
choose one of the covariates. Let us choose X1. We need to choose one of
49
the numbers taken by X1. The possible values of X1 are 1, 2, … , 10. Let us
choose 5. All those subjects with X1 ≤ 5 go into the left daughter node. All
those subjects with X1 > 5 go into the right daughter node.
Members of the left daughter node: ID 1, 3, 5, 6, 8, 11, 14, 16, 19, 20. Five
of these subjects have the pattern {Y = 0} and the rest {Y = 1}.
Impurity of this daughter = - 0.5 ln(0.5) – 0.5 ln(0.5) = 0.69.
Members of the right daughter node: ID 2, 4, 7, 9, 10, 12, 13, 15, 17, 18.
Five of these subjects have the pattern {Y = 0} and the rest {Y = 1}.
Impurity of this daughter = - 0.5 ln(0.5) – 0.5 ln(0.5) = 0.69.
This is a disappointment. We expected the daughters to be less chaotic. May
be the choice of the cut-point X1 = 5 is not helpful.
We need to compare the mother node with the daughter nodes. We need to
calculate impurity of the daughters combined.
Right and left daughters have the same number of subjects. The weights of
these nodes are: 50:50 or 0.5:0.5. The weights come from: what proportions
of subjects from the root node are in the daughter nodes.
Overall impurity of the daughters = weighted sum of individual impurities =
0.5*0.69 + 0.5*0.69 = 0.69
Our goal is to seek daughters purer than their mothers.
Improvement in purity achieved by the split = Goodness of the split =
Impurity of the mother – Overall impurity of the daughters = 0.69 – 0.69 =
0.
There is no improvement by splitting the mother node this way. We could
have chosen another number such as 4 instead of 5 for X1. Our goal is to
maximize the goodness of the split. Let us persist with this split.
Step 3. Let us split the left daughter node. Choose one of the covariates. Let
us choose now X2. Let us choose one of the numbers taken by X2. Let us
50
choose 5. Shepherd all those subjects with X2 ≤ 5 into the left grand
daughter node and those with X2 > 5 into the right grand daughter node.
Composition of the subjects in the left grand daughter node: ID 1, 5, 8, 14,
20. All these subjects have the pattern {Y = 0}. Its impurity is zero. This
granddaughter is the purest. Further split is useless. This is a terminal node.
Declare this node as {Y = 0} node.
Composition of the subjects in the right grand daughter node: ID 3, 6, 11,
16, 19. All these subjects have the pattern {Y = 1}. Its impurity is zero. This
granddaughter is the purest. No further split is possible. This is a terminal
node. Declare this node as {Y = 1} node.
Step 4. Let us split the right daughter node. Choose one of the covariates.
Let us choose X2. Let us choose one of the numbers taken by X2. Let us
choose 5. Shepherd all those subjects with X2 ≤ 5 into the left grand
daughter node and those with X2 > 5 into the right grand daughter node.
Composition of the subjects in the left grand daughter node: ID 2, 7, 9, 13,
18. All these subjects have the pattern {Y = 1}. Its impurity is zero. Why?
This granddaughter is the purest. Further split is pointless. This is a terminal
node. Declare this node as {Y = 1} node.
Composition of the subjects in the right grand daughter node: ID 4, 10, 12,
15, 17. All these subjects have the pattern {Y = 0}. Its impurity is zero. This
granddaughter is the purest. Further split is not worthwhile. This is a
terminal node. Declare this node as {Y = 0} node.
The task of building a tree is complete. Look at the tree that results.
Let us now calculate the misclassification error rate. Let us pour all the
subjects into the mother node. We know the pattern each subject has. Check
which terminal node they fall into. Check whether it’s true pattern matches
with the pattern of the terminal node. The percentage of mismatches is the
misclassification rate.
Misclassification rate = 0%.
How does one use this classification protocol in practice? Take a subject
whose pattern is unknown. We have its covariate values. Pour this subject
51
into the mother node. See where he lands. Note the identity of the terminal
node. That is the pattern he is classified into.
Let me give a realistic example. The illustrative example comes from
the Yale Pregnancy Outcome Study, a project funded by the National
Institute of Health. The basic question they want to address is what factors
influence pre-term deliveries. The study subjects were women who made a
first prenatal visit to a private obstetrics, midwife practice, health
maintenance organization, or hospital clinic in the Greater New Haven,
Connecticut, area between March 12, 1980 and March 12, 1982. From these
women, select those women whose pregnancies ended in a singleton live
birth.
The outcome or response variable: Woman whose pregnancy ended by a
singleton live birth: Is this a preterm delivery or a normal time delivery?
At the time of prenatal visit, measurements on 15 variables (predictors or
independent variables) were collected.
Variable name
Maternal age
Marital status
Label Type Range/levels
X1
Continuous 13-46
X2
Nominal
Currently married,
divorced,
separated,
widowed,
never married
Race
X3
Nominal
White, Black, Hispanic, Asian,
Other
Marijuana use
X4
Nominal
Yes, No
Times of using
X5
Ordinal
≥ 5, 3-4, 2, 1 (daily)
Marijuana
4-6, 1-3 (weekly)
2-3, 1, < 1 (monthly)
Years of education X6
Continuous 4-27
Employment
X7
Nominal
Yes, no
Smoker
X8
Nominal
Yes, No
Cigarettes smoked X9
Continuous 0-66
Passive smoking X10
Nominal
Yes, No
Gravidity
X11
Ordinal
1-10
Hormones/DES X12
Nominal
None, hormones,
Used by mother
DES, both, Uncertain
52
Alcohol (oz/day) X13
Caffeine (mg)
X14
Parity
X15
Ordinal
0-3
Continuous 12.6-1273
Ordinal
0-7
The training sample consisted of 3,861 pregnant women. The objective of
the study is to predict whether of not the delivery will be preterm based on
the measurements collected at the time of prenatal visit. This is viewed as a
binary classification problem.
We could solve this problem by following the logistic regression
methodology. How?
Classification tree methodology was pursued for this problem. What are the
key steps?
A tree will consist of a root node, internal (circle) nodes, and terminal
(box) nodes. Identify each woman in the sample who had a preterm delivery
with 0 and who had a normal term delivery with 1.
Step 1. Stuff the root node with all these women.
Step 2. We will create two daughter nodes (Left Daughter Node and Right
Daughter Node) out of the root node. Every woman in the root node has to
go either to the Left Daughter Node or Right Daughter Node. In other words,
we will split the women in the root node into two groups. The splitting is
done using one of the predictors. Suppose we start with X1. In the sample,
we have women representing every age from 13 to 43. We may decide to
split the root node according to the following criterion.
Put a woman in the Left Daughter Node if her age X1 ≤ 13 years. The
number 13 is the cut-point chosen. Otherwise, put the woman in the Right
Daughter Node. According to this criterion, some women in the root node go
into the Left Daughter Node and the rest go into the Right Daughter Node.
We could split the root node using a different criterion. For example, put a
woman in the Left Daughter Node if her age X1 ≤ 35 years. Otherwise, put
the woman in the Right Daughter Node. Here the chosen cut-point is 35.
Important idea: In order to split the root node, we need to select one of the
covariates and a cut-point.
53
There are 31 different ways of splitting the root node! We want to choose a
good split. The objective is to channel as many women with Label 1 (normal
delivery) as possible into one node and channel as many women with Label
0 (pre-term delivery) into the other node. Let us assess the situation when
the split was done based on the age 35 years. The composition of the
daughter nodes can be summarized by the following 2x2 contingency table.
Left Node
Right Node
Total
Term
3521
135
3656
Preterm
198
7
205
Total
3719
142
3861
The left node has a proportion of 3521/3791 1’s and 198/3719 0’s. The
entropy impurity of the distribution (3521/3791, 198/3719) can be calculated
– (3521/3791) ln (3521/3719) - (198/3791) ln (198/3719) = 0.2079
The impurity 0 is the ideal value we are seeking.
Similarly, the entropy impurity of the right node is 0.1964.
The goodness of the split is defined by
Impurity of the mother node – P(Left node)*impurity of the left node –
P(Right node)*impurity of the right node,
where P(left node) and P(right node) are the probabilities that a subject falls
into the left node and right node, respectively. (You can inject some
Bayesian philosophy into these probabilities.) For the time being, we can
take these probabilities to be 3719/3861 and 142/3861, respectively.
Therefore, the goodness of the split = 0.20753 – (3719/3861)*0.2079 –
(142/3861)*0.1964 = 0.00001.
More intuitively, we are computing
Impurity of the mother – Impurity of the daughters
to judge how good the chosen cut-point is .
54
The larger the difference is, the better the daughters are!
Talk about this more! If each daughter is pure, this is the best split.
We are shooting for a high value for the goodness of split. Thus for every
possible choice of age, we can measure goodness of split. Choose that age
for which the goodness of split is maximum.
The goodness of allowable Age splits
Split
Impurity
Value
Left node
Right node
13
0.00000
0.20757
14
0.00000
0.20793
15
0.31969
0.20615
.
…
…
24
0.25747
0.18195
.
…
…
43
0.20757
0.00000
1000*Goodness
of the split
0.01
0.14
0.17
…
1.50
…
0.01
At the age 24 years, we have an optimal split.
Here, we started with age to split the root node. Why age? It could have been
any other predictor.
Strategy. Find the best split and the corresponding impurity reduction for
every predictor. Choose that predictor for which impurity reduction is the
largest. This is the variable we start with to split the root node.
After splitting the root node, look at the left and right daughter nodes. We
now split the Left Daughter Node into two nodes: Left Grand Daughter
Node and Right Grand Daughter Node. We choose one of the predictors
including the one already used for the split.
How do we split a node based on a nominal or categorical variable? Suppose
we choose Race for the split. Note that the Race is a nominal variable with
five possible values. There are 24 – 1 ways we can split. The possibilities are
listed below.
Left Grand Daughter Node
White
Right Grand Daughter Node
Black, Hispanic, Asian, Others
55
Black
Hispanic
Asian
Others
White, Black
White, Hispanic
White, Asian
White, Others
Black, Hispanic
Black, Asian
Black, Others
Hispanic, Asian
Hispanic, Others
Asian, Others
White, Hispanic, Asian, Others
White, Black, Asian, Others
White, Black, Asian, Others
White, Black, Hispanic, Asian
Hispanic, Asian, Others
Black, Asian, Others
Black, Hispanic, Others
Black, Hispanic, Asian
White, Asian, Others
White, Hispanic, Others
White, Hispanic, Asian
White, Black, Others
White, Black, Asian
White, Black, Hispanic
Let us look at the split based on White on one hand and Black, Hispanic,
Asian, Others on the other hand. Channel all women in the Left Daughter
Node into Left Grand Daughter Node if she is white. Otherwise, she goes
into the Right Grand Daughter Node. We can assess how good the split is
just the same way as we did earlier.
Thus the splitting goes on using all the predictors one by one.
When do we create a terminal node? We stop splitting when a node is
smaller than the prescribed minimum size. This is called pruning. The choice
of the minimum size depends on the sample size. If the size of a node is less
than 1% of the total size, one could stop splitting. Or if a node contains less
than 5 subjects, stop splitting.
There are a number of packages available to build a classification tree. We
will look at two of them: tree; rpart. Let us download these packages and
look at some examples.
56
Module 6: Classification Trees + rpart package + Creation of Polygonal
Plots
Creation of Polygonal Plots
Recall the artificial data presented in Lecture 9. The data had one binary
outcome variable Y (0 or 1) and two predictors X1 and X2. Each of the
predictors takes integer values from 1 through 10. I built a tree with my bare
hands. The tree is equivalent to the following classification protocol.
If X1 ≤ 5 and X2 ≤ 5, classify the subject to have the pattern {Y = 0}.
If X1 ≤ 5 and X2 ≥ 6, classify the subject to have the pattern {Y = 1}.
If X1 ≥ 6 and X2 ≤ 5, classify the subject to have the pattern {Y = 1}.
If X1 ≥ 5 and X2 ≥ 6, classify the subject to have the pattern {Y = 0}.
There is another way to present the classification protocol graphically. The
statement X1 ≤ 5 and X2 ≤ 5 is equivalent to, graphically, the rectangle with
vertices (1, 1), (1, 5), (5, 5), (5, 1) in the X1 – X2 plane. The command
‘polygon’ draws the rectangle. First, we need to create a blank plot setting
the X1- and X2-axes. The input type = “n” exhorts the plot that there should
be no points imprinted on the graph.
> plot(c(1,10), c(1, 10), type = "n", xlab = "X1", ylab = "X2", main =
"Classification Protocol")
The ‘polygon’ command has essentially two major inputs. The x-input
should have all the x coordinates of the points. The y-input should have all
the corresponding y-coordinates of the points. The polygon thus created
latches onto the existing plot.
> polygon(c(1, 1, 5, 5), c(1, 5, 5, 1), col = "gray", border = "blue", lwd = 2)
The statement X1 ≤ 5 and X2 ≥ 6 is equivalent to, graphically, the rectangle
with vertices (1, 6), (1,10), (5, 10), (5, 6) in the X1 – X2 plane.
> polygon(c(1, 1, 5, 5), c(6, 10, 10,6), col = "yellow", border = "blue", lwd =
2)
The other polygons are created in the same way.
> polygon(c(6, 6, 10, 10), c(6, 10, 10,6), col = "mistyrose", border = "blue",
lwd = 2)
> polygon(c(6, 6, 10, 10), c(1, 5, 5, 1), col = "cyan", border = "blue", lwd =
2)
We need to identify each rectangle with a pattern.
> text(3, 3, "{Y = 0}", col = "red")
> text(3, 8, "{Y = 1}", col = "blue")
> text(8, 8, "{Y = 0}", col = "red")
> text(8, 3, "{Y = 1}", col = "blue")
57
{Y = 1}
{Y = 0}
{Y = 0}
{Y = 1}
2
4
X2
6
8
10
Classification Protocol
2
4
6
8
10
X1
rpart package
‘rpart’ is an acronym for recursive partitioning.
Terry Therneau and Elizabeth Atkinson (Mayo Foundation) have developed
‘rpart’ (recursive partitioning) package to implement classification trees and
regression trees in all their glory. The method depends what kind of response
variable we have.
Categorical → method = “class”
Continuous → method = “anova”
Count
→ method = “poisson”
Survival
→ method = “exp”
They have two monographs on their package available on the internet.
58
An introduction to Recursive Partitioning using the RPART routines,
February, 2000
Same title, September, 1997
Both are very informative.
Let me illustrate ‘rpart’ command in the context of a binary classification
problem. Four data sets are available in the package.
Download ‘rpart.’
 data(package = “rpart”)
Data sets in package ‘rpart’:
car.test.frame
Automobile Data from 'Consumer Reports' 1990
cu.summary
Automobile Data from ‘Consumer Reports' 1990
kyphosis
Data on Children who have had Corrective Spinal
Surgery
solder
Soldering of Components on Printed-Circuit Boards
Let us look at ‘kyphosis’ data.
> data(kyphosis)
> dim(kyphosis)
[1] 81 4
> head(kyphosis)
Kyphosis Age Number Start
1
absent 71
3
5
2
absent 158
3
14
3 present 128
4
5
4
absent
2
5
1
5
absent
1
4
15
6
absent
1
2
16
Understand the data. Look at the documentation on the data.
Look at the documentation on ‘raprt.’
If we let the partition continue without any break, we will end up with a
saturated tree. Every terminal node is pure. It is quite possible some terminal
nodes contain only one data point. One has to declare each terminal node as
one of the two types: present or absent. Majority rule.
Discuss
We need to arrest the growth of the tree. One possibility is to demand that if
a node contains 20 observations or less no more splitting is done at this
node. This is the default setting in ‘rpart.’
Let us check.
MB <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
To get a tree, follow the commands.
> plot(MB, uniform = T, margin = 0.1)
> text(MB, use.n = T, all = T)
59
> title(main = "Classification Tree for Kyphosis Data")
Classification Tree for Kyphosis Data
Start>=8.5
|
absent
64/17
Start>=14.5
absent
56/6
present
8/11
Age< 55
absent
27/6
absent
29/0
absent
12/0
Age>=111
absent
15/6
absent
12/2
present
3/4
Comments and interpretation
1. The root node has 81 subjects, for 64 of them kyphosis is absent and
17 present. The root node identifies the majority. As a matter of fact,
each node identifies the majority pattern.
2. All those subjects with Start ≥ 8.5 go into the left node. Total number
of subjects in the left node is 62, 56 of them have kyphosis absent.
The majority have kyphosis absent, which is duly recorded inside the
node.
3. All those subjects with Start < 8.5 go into the right node. Total
number of subjects in the right node is 19, 8 of them have kyphosis
absent. The majority have kyphosis present, which is duly recoded
inside the node.
60
4. This node is a terminal node. No further split is envisaged because
the total number of observations is 19 ≤ 20. The command stops
splitting a node if the size of the node is 20 or less (default). This is a
pruning strategy. This terminal node is declared ‘present’ as per the
‘majority rule’ paradigm.
5. The node on the left is split again. The best covariate as per the
entropy purity calculations is ‘Start’ again. All those subjects with
Start ≥ 14.5 go into the left node. This node is pure. No split is
possible. This node has 29 subjects for all of whom kyphosis is
absent. Obviously, we declare this terminal node as ‘absent.’ All
those subjects with Start < 14.5 go into the right node, which has 33
patients. And so on.
6. Other terminal nodes are self-explanatory.
The classification protocol as per this tree is given by:
1. If a child has Start < 8.5, predict that kyphosis will be present.
2. If a child has 14.5 ≤ Start, predict that kyphosis will be absent.
3. If a child has 8.5 ≤ Start < 14.5 and Age < 55 months, predict that
kyphosis will be absent.
4. If a child has 8.5 ≤ Start < 14.5 and Age ≥ 111 months, predict that
kyphosis will be absent.
5. If a child has 8.5 ≤ Start < 14.5 and 55 ≤ Age < 111 months, predict
that kyphosis will be present.
6. The covariate ‘Number’ has no role in the classification.
Since only two predictors are playing a role here, we can build rectanglesgraph using the polygon command. Here are the commands.
> plot(c(1, 18), c(1,206), type = "n", xlab = "Start", ylab = "Age", ann = F,
+ main = "Classification Protocol - Kyphosis Data", axes = F)
ann = F means annotation false.
The title and labels are not imprinted anymore. We can add these later. Have
the landmarks of ‘Start’ and ‘Age’ be recorded on the graph.
> axis(side = 1, at = c(1, 8, 9, 14, 15, 18))
> axis(side = 2, at = c(1, 54, 55, 110, 111, 206))
Build 5 polygons as per the protocol.
> polygon(c(1, 1, 8, 8), c(1, 206, 206, 1), col = "gray", border = "red", lwd =
1.5)
> polygon(c(15, 15, 18, 18), c(1, 206, 206, 1), col = "mistyrose", border =
"green", lwd = 1.5)
61
> polygon(c(9, 9, 14, 14), c(1, 54, 54, 1), col = "mistyrose", border =
"green", lwd = 1.5)
> polygon(c(9, 9, 14, 14), c(55, 110, 110, 55), col = "gray", border = "red",
lwd = 1.5)
> polygon(c(9, 9, 14, 14), c(111, 206, 206, 111), col = "mistyrose", border =
"green", lwd = 1.5)
Print the texts accordingly.
> text(4, 100, "present", col = "red")
> text(12, 25, "absent", col = "green")
> text(12, 75, "present", col = "red")
> text(12, 150, "absent", col = "green")
> text(16.5, 100, "absent", col = "green")
> title(main = "Classification Protocol - Kyphosis Data", sub = "81
Children",
+ xlab = "Start", ylab = "Age")
62
206
Classification Protocol - Kyphosis Data
110
Age
absent
present
absent
54
present
1
absent
1
8
9
14
18
Start
81 Children
How reliable the judgment of this tree is?
We have 81 children in our study. We know for each child whether kyphosis
is present or absent. Pour the data on the covariates of a child into the root
node. See which terminal node the child settles in. Classify the child
accordingly. We know the true status of the child. Note down whether or not
a mismatch occurred. Find the total number of mismatches.
Misclassification rate = re-substitution error
= 100*(8 + 0 + 2 + 3)/81 = 16%.
Add up all the minority numbers of the terminal nodes.
We have other choices when graphing a tree. Let us try some of these.
> plot(MB, branch = 0, margin = 0.1, col = "red")
> text(MB, use.n = T, all = T, col = "red")
63
> title(MB, main = "Classification Tree for Kyphosis data")
Another choice:
> plot(MB, branch = 0.4, margin = 0.1, col = "red")
> text(MB, use.n = T, all = T, col = "red")
> title(MB, main = "Classification Tree for Kyphosis data")
Classification Tree for Kyphosis data
Start>=8.5
absent
64/17
absent
29/0
Start>=14.5
absent
56/6
absent
12/0
Age< 55
absent
Age>=111
27/6
absent
15/6
absent
12/2
present
8/11
present
3/4
052785923755,
1,
1372549,
, 1,
3,
list(minsplit
1,1,
9,1,1,
5,2,1,
3,2),
0.01,
3,complexity
1,
=1,
3,20,
0,
7,
1,
0.684863523573199,
1,
3,
minbucket
4,
5, 1,
1,
3,
= 0.823529411764706,
9,
2,
c(0.176470588235294,
8,=
2,
list(prior
9,
1,
7,2,
9,
cp1,
5,=1,
9,
=0.01,
0.297533206831119,
c(0.790123456790123,
8, 1,
1,
3,maxcompete
3,list(summary
1,
3,
1,
0.764705882352941,
rpart(formula
7,
1,
0.0196078431372549,
7, 1,
1,
3,
Kyphosis
7,
1,
=c(FALSE,
4,
3,
1,
=0.64516129032258,
maxsurrogate
function
5,
1,
= 0.209876543209877),
9,
2,
Kyphosis
~ Age
5,FALSE,
1,
class
8,
2,
(yval,
1,+2,
9,1.17647058823529,
Number
0.01,
5,
1,
~dev,
=
.,1,
9,
FALSE)
5,
data
9,
1,
0.0196078431372549,
usesurrogate
wt,
3,
1,
+0.596774193548387,
=ylevel,
Start
7,
2,
kyphosis)
3,
1,
loss
7,digits)
1,
9,
2,
= c(0,
1.17647058823529,
7,
1,
= 2,
8,1,
1,
surrogatestyle
3,1,
1,
9,0),
2,
0.01,
3,1.24675324675324,
1,
split
3, 1,
1,
3,
0.0196078431372549,
=1,
5,1)
=2,
9,0,
0.215587222254518,
5,maxdepth
1,
8, 2,
1,
9, 2,
9, 1,
9,
0.28877005
=1,
3,30,
3,0.01,
1,
5,
1,
xval
3,
1,
64
Classification Tree for Kyphosis data
Start>=8.5
|
absent
64/17
Age< 11.5
present
8/11
Start>=14.5
absent
Age< 55
absent 56/6 absent
Age>=98
29/0 absent27/6absent
12/0
15/6
absent
2/0
absentpresent
14/2 1/4
Start< 5.5
Age>=130.5present
absent
present
6/11
6/6
0/5
absent
2/0
Age< 93
present
4/6
Number< 4.5
absent present
4/2
0/4
absentpresent
3/0
1/2
052785923755,
1,
1372549,
, 1,
3,
list(minsplit
1,1,
9,1,1,
5,2,1,
3,2),
0.01,
3,complexity
1,
=1,
3,20,
0,
7,
1,
0.684863523573199,
1,
3,
minbucket
4,
5, 1,
1,
3,
= 0.823529411764706,
9,
2,
c(0.176470588235294,
8,=
2,
list(prior
9,
1,
7,2,
9,
cp1,
5,=1,
9,
=0.01,
0.297533206831119,
c(0.790123456790123,
8, 1,
1,
3,maxcompete
3,list(summary
1,
3,
1,
0.764705882352941,
rpart(formula
7,
1,
0.0196078431372549,
7, 1,
1,
3,
Kyphosis
7,
1,
=c(FALSE,
4,
3,
1,
=0.64516129032258,
maxsurrogate
function
5,
1,
= 0.209876543209877),
9,
2,
Kyphosis
~ Age
5,FALSE,
1,
class
8,
2,
(yval,
1,+2,
9,1.17647058823529,
Number
0.01,
5,
1,
~dev,
=
.,1,
9,
FALSE)
5,
data
9,
1,
0.0196078431372549,
usesurrogate
wt,
3,
1,
+0.596774193548387,
=ylevel,
Start
7,
2,
kyphosis)
3,
1,
loss
7,digits)
1,
9,
2,
= c(0,
1.17647058823529,
7,
1,
= 2,
8,1,
1,
surrogatestyle
3,1,
1,
9,0),
2,
0.01,
3,1.24675324675324,
1,
split
3, 1,
1,
3,
0.0196078431372549,
=1,
5,1)
=2,
9,0,
0.215587222254518,
5,maxdepth
1,
8, 2,
1,
9, 2,
9, 1,
9,
0.28877005
=1,
3,30,
3,0.01,
1,
5,
1,
xval
3,
1,
We can increase the size of the tree by reducing the threshold number 20.
Let us do it. If the size of a node ≤ 5, don’t split. The following is the R
command.
> MB1 <- rpart(Kyphosis ~ ., data = kyphosis, control =
rpart.control(minsplit = 5))
> plot(MB1, branch = 0.4, margin = 0.1, col = "red")
> text(MB1, use.n = T, all = T, col = "red")
> title(MB, main = "Classification Tree for Kyphosis data")
65
Classification Tree for Kyphosis data
Start>=8.5
|
absent
64/17
absent
29/0
Start>=14.5
absent
56/6
absent
12/0
Age< 55
absent
Age>=111
27/6
absent
15/6
absent
12/2
present
8/11
present
3/4
052785923755,
1,
1372549,
, 1,
3,
list(minsplit
1,1,
9,1,1,
5,2,1,
3,2),
0.01,
3,complexity
1,
=1,
3,20,
0,
7,
1,
0.684863523573199,
1,
3,
minbucket
4,
5, 1,
1,
3,
= 0.823529411764706,
9,
2,
c(0.176470588235294,
8,=
2,
list(prior
9,
1,
7,2,
9,
cp1,
5,=1,
9,
=0.01,
0.297533206831119,
c(0.790123456790123,
8, 1,
1,
3,maxcompete
3,list(summary
1,
3,
1,
0.764705882352941,
rpart(formula
7,
1,
0.0196078431372549,
7, 1,
1,
3,
Kyphosis
7,
1,
=c(FALSE,
4,
3,
1,
=0.64516129032258,
maxsurrogate
function
5,
1,
= 0.209876543209877),
9,
2,
Kyphosis
~ Age
5,FALSE,
1,
class
8,
2,
(yval,
1,+2,
9,1.17647058823529,
Number
0.01,
5,
1,
~dev,
=
.,1,
9,
FALSE)
5,
data
9,
1,
0.0196078431372549,
usesurrogate
wt,
3,
1,
+0.596774193548387,
=ylevel,
Start
7,
2,
kyphosis)
3,
1,
loss
7,digits)
1,
9,
2,
= c(0,
1.17647058823529,
7,
1,
= 2,
8,1,
1,
surrogatestyle
3,1,
1,
9,0),
2,
0.01,
3,1.24675324675324,
1,
split
3, 1,
1,
3,
0.0196078431372549,
=1,
5,1)
=2,
9,0,
0.215587222254518,
5,maxdepth
1,
8, 2,
1,
9, 2,
9, 1,
9,
0.28877005
=1,
3,30,
3,0.01,
1,
5,
1,
xval
3,
1,
66
Module 7: Logistic Regression for Grouped Data
A medical researcher wants to explore the connection between hypertension
and the predictors smoking, obesity, and snoring. ‘Hypertension’ is taken to
be a response variable. He collected data on a sample of 433 subjects. For
each subject, he assessed whether or not the subject suffers from
hypertension, whether or not the subject smokes, whether or not the subject
is obese, and whether or not the subject snores. A typical record looks like:
Hypertension
Yes
Smoking
No
Obesity
Yes
Snoring
Yes
He has 433 such records. Note that all variables are binary. We can entertain
a logistic regression model for the response variable.
Pr(Hypertension = Yes)/Pr(Hypertension = No)
= exp{β0 + β1*Smoking + β2*Obesity + β3*Snoring}
We need to score each binary variable as 1 or 0. The R program will do it for
you.
The data consists of 433 pieces of information. Since each predictor is
binary, we can summarize the entire data into 8 pieces of information as
follows.
Smoking
No
Yes
No
Yes
No
Yes
No
Yes
Obesity
No
No
Yes
Yes
No
No
Yes
Yes
Snoring
No
No
No
No
Yes
Yes
Yes
Yes
Total # hypertension
60
5
17
2
8
1
2
0
187
35
85
13
51
15
23
8
Digest the data. Each covariate column is structured.
67
We need to enter the data in R format. It can be done a couple of different
ways. The way the data are arraigned makes it possible to proceed the
following way. First, create a folder storing the words “No” and “Yes.”
> no.yes <- c("No", "Yes")
Create a folder for each of the predictors.
> smoking <- gl(2, 1, 8, no.yes)
gl = generate levels
Look at the documentation of ‘gl.’
?gl
no.yes: the levels come from the folder no.yes
8: The folder should consist of 8 entries.
2: Each entry should be one of the levels coming from no.yes.
1: The entries should alternate between “No” and “Yes.”
> obesity <- gl(2, 2, 8, no.yes)
2: The entries should consist of 2 “No” s followed by 2 “Yes” s.
> snoring <- gl(2, 4, 8, no.yes)
4: The entries should consist of 4 “No” s followed by 4 “Yes” s.
The data under smoking can be entered in another way.
Smoking1 <- c(“No”, “Yes”, “No”, “Yes”, “No”, “Yes”, “No”, “Yes”)
The folders n.tot and n.hyp. are self-explanatory.
> n.tot <- c(60, 17, 8, 2, 187, 85, 51, 23)
> n.hyp <- c(5, 2, 1, 0, 35, 13, 15, 8)
Let us all put all the data folders into a single frame.
> hyp <- data.frame(smoking, obesity, snoring, n.tot, n.hyp)
> hyp
smoking obesity snoring n.tot n.hyp
1
No
No
No
60
5
68
2
3
4
5
6
7
8
Yes
No
Yes
No
Yes
No
Yes
No
Yes
Yes
No
No
Yes
Yes
No
No
No
Yes
Yes
Yes
Yes
17
8
2
187
85
51
23
2
1
0
35
13
15
8
We need to fit a logistic regression model using these grouped data. It can be
done in two different ways. In one of the ways, we need to create a matrix
consisting of two columns. The first column should consist of number of
people suffering from hypertension and the second column those who don’t.
The matrix operation is characterized by the command ‘cbind,’ where ‘c’
stands, as usual for ‘column.’
> hyp1 <- cbind(n.hyp, n.tot - n.hyp)
> hyp1
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
n.hyp
5 55
2 15
1
7
0
2
35 152
13 72
15 36
8 15
We are ready for fitting a logistic regression model.
> hyp2 <- glm(hyp1 ~ smoking + obesity + snoring, family = binomial)
> hyp2
Call: glm(formula = hyp1 ~ smoking + obesity + snoring, family =
binomial)
Coefficients:
(Intercept)
smokingYes
obesityYes
snoringYes
-2.37766
-0.06777
0.69531
0.87194
Degrees of Freedom: 7 Total (i.e. Null); 4 Residual
Null Deviance: 14.13
69
Residual Deviance: 1.618
> summary (hyp2)
AIC: 34.54
Call:
glm(formula = hyp1 ~ smoking + obesity + snoring, family = binomial)
Deviance Residuals:
1
-0.04344
2
0.54145
3
-0.25476
4
-0.80051
5
0.19759
6
-0.46602
Coefficients: Estimate Std. Error z value
(Intercept) -2.37766
0.38018 -6.254
smokingYes -0.06777
0.27812 -0.244
obesityYes
0.69531
0.28509
2.439
snoringYes
0.87194
0.39757
2.193
7
-0.21262
8
0.56231
Pr(>|z|)
4e-10 ***
0.8075
0.0147 *
0.0283 *
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 14.1259 on 7 degrees of freedom
Residual deviance: 1.6184 on 4 degrees of freedom
AIC: 34.537
Residual deviance is the sum of squares of the deviance residuals.
Number of Fisher Scoring iterations: 4
There is another way to go about fitting the model. We need the proportions
of those who suffer from hypertension to the total for each configuration of
predictor variables.
> hyp3 <- n.hyp/n.tot
> hyp3
[1] 0.08333333 0.11764706 0.12500000 0.00000000 0.18716578 0.15294118 0.29411765
[8] 0.34782609
These proportions are the response variable in the model. Talk a little bit
about this.
70
> hyp4 <- glm(hyp3 ~ smoking + obesity + snoring, binomial, weights =
n.tot)
R would not know what the total number of subjects is from which the
proportion is calculated. It needs this information in writing the likelihood of
the data.
> hyp4
Call: glm(formula = hyp3 ~ smoking + obesity + snoring, family =
binomial, weights = n.tot)
Coefficients:
(Intercept)
-2.37766
smokingYes
-0.06777
obesityYes
0.69531
snoringYes
0.87194
Degrees of Freedom: 7 Total (i.e. Null); 4 Residual
Null Deviance: 14.13
Residual Deviance: 1.618
AIC: 34.54
> summary (hyp4)
Call:
glm(formula = hyp3 ~ smoking + obesity + snoring, family = binomial,
weights = n.tot)
Deviance Residuals:
1
-0.04344
2
0.54145
3
-0.25476
4
-0.80051
5
0.19759
6
-0.46602
7
-0.21262
8
0.56231
Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.37766
smokingYes -0.06777
obesityYes
0.69531
snoringYes
0.87194
0.38018
0.27812
0.28509
0.39757
-6.254
-0.244
2.439
2.193
4e-10 ***
0.8075
0.0147 *
0.0283 *
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 14.1259 on 7 degrees of freedom
Residual deviance: 1.6184 on 4 degrees of freedom
71
AIC: 34.537
Number of Fisher Scoring iterations: 4
It is interesting that ‘Smoking’ is not a significant factor. The other two
predictors are significant.
The fitted model is:
ln(Odds) = -2.37736 – 0.0677*smoking +
0.69531*obesity + 0.87194*snoring.
The R program coded each “Yes” as 1 and “No” as 0. Why? It is following
the alpha-numeric principle.
Using the model, we can predict ln(Odds) for each scenario of predictors.
> predict(hyp4)
1
2
3
4
-2.3776615 -2.4454364 -1.6823519 -1.7501268
5
6
7
-1.5057221 -1.5734970 -0.8104126
8
-0.8781874
We can predict the response probabilities using the model for each scenario
of predictors.
> predict(hyp4, type = "response")
1
0.08489206
2
0.07977292
3
0.15678429
4
0.14803121
5
6
0.18157364 0.17171843
7
0.30780259
8
0.29355353
There is another R-command we can use to get the predicted probabilities.
> fitted(hyp4)
1
0.08489206
2
3
0.07977292 `0.15678429
4
0.14803121
5
0.18157364
6
0.17171843
8
0.29355353
7
0.30780259
72
We can compare the observed proportions with the predicted probabilities
for each scenario of predictors to see how close they are. Note that the
observed proportions are in the folder ‘hyp3.’
> hyp3
[1] 0.08333333 0.11764706 0.12500000 0.00000000 0.18716578 0.15294118 0.29411765
[8] 0.34782609
It is difficult to compare the proportions. We can compare observed
frequencies with expected frequencies. The expected frequency, as per the
model, is the product of the predicted probability and total. We can calculate
the expected frequency for each scenario of predictors.
> fitted(hyp4)*n.tot
1
5.0935236
2
1.3561397
3
1.2542744
5
6
7
33.9542700 14.5960668 15.6979321
4
0.2960624
8
6.7517311
We can place the observed and expected frequencies side by side.
> data.frame(fit = fitted(hyp4)*n.tot, n.hyp, n.tot)
fit
hyp3
n.tot
1 5.0935236
5
60
2 1.3561397
2
17
3 1.2542744
1
8
4 0.2960624
0
2
5 33.9542700
35
187
6 14.5960668
13
85
7 15.6979321
15
51
8 6.7517311
8
23
Can we do some testing about model adequacy? One could use HosmerLemeshow test. Another can be built on residual variance or null deviance.
R is reluctant to associate p-value with the deviance. Just as well, because no
exact p-value can be found, only an approximation that is valid for large
expected counts. In the present example, some expected counts are below 5.
If you insist in doing a goodness of fit, the asymptotic result is that each
stated deviance has a chi-squared distribution with the stated degrees of
freedom under the null hypothesis that the response probability follows a
logistic model pattern.
73
Module 8: Prediction in Classification Tress + Regression Trees
Make sure the response variable is ‘factor’ when wanting to build a
classification tree
We build classification trees when the response variable is binary. If you use
rpart package, make sure your response variable is a ‘factor.’ If the response
variable is descriptive such as absence and presence, the response variable is
indeed a ‘factor.’ If the response variable is coded as 0 and 1, make sure the
codes are factors. If they are not, one can always convert them into factors
using the command ‘as.factor.’ Suppose the file is called MB. Then type <
MB <- as.factor(MB).
Prediction in classification trees
Let us work with the kyphosis data. Activate the ‘rpart’ package.
> data(kyphosis)
Build a classification tree.
> MB <- rpart(Kyphosis ~ ., data = kyphosis)
The ‘predict’ command predicts the status of each kid in the data as per the
classification tree.
> MB1 <- predict(MB, newdata = kyphosis)
> head(MB1)
absent
present
1 0.4210526 0.5789474
2 0.8571429 0.1428571
3 0.4210526 0.5789474
4 0.4210526 0.5789474
5 1.0000000 0.0000000
6 1.0000000 0.0000000
What is going on? Look at the data.
> head(kyphosis)
Kyphosis Age Number Start
1
absent 71
3
5
2
absent 158
3
14
3 present 128
4
5
4
absent
2
5
1
5
absent
1
4
15
6
absent
1
2
16
Look at the first kid. Feed his data into the tree. He falls into the last
terminal node. The prediction is ‘Kyphosis present.’ Look at the data in the
last terminal node. Nineteen of our kids will fall into this node. Eight of
them have Kyphosis absent and eleven of them have Kyphosis present. As
74
per the classification protocol (majority rule), every one of these kids will be
classified Kyphosis present. Using the data in the terminal node, R
calculates the probability of Kyphosis present and also of Kyphosis absent.
These are the probabilities that are reported in the output of ‘predict’
command.
> 11/19
[1] 0.5789474
Let us codify the probabilities into present and absent using the threshold
probability 0.50.
> MB2 <- ifelse(MB1$present >= 0.50, "present", "absent")
Error in MB1$present : $ operator is invalid for atomic vectors
> class(MB1)
[1] "matrix"
The ‘ifelse’ command does not work on matrices. Convert the folder into
data.frame.
> MB2 <- as.data.frame(MB1)
> MB3 <- ifelse(MB2$present >= 0.50, "present", "absent")
> head(MB3)
[1] "present" "absent" "present" "present" "absent" "absent"
Let us add MB3 to the mother folder ‘kyphosis.’
> kyphosis$Prediction <- MB3
> head(kyphosis)
Kyphosis Age Number Start Prediction
1
absent 71
3
5
present
2
absent 158
3
14
absent
3 present 128
4
5
present
4
absent
2
5
1
present
5
absent
1
4
15
absent
6
absent
1
2
16
absent
We want to identify the kids for whom the actual status of Kyphosis and
Prediction disagree.
> kyphosis$Disagree <- ifelse(kyphosis$Kyphosis == "absent" &
kyphosis$Prediction == "present", 1, ifelse(kyphosis$Kyphosis == "present"
& kyphosis$Prediction == "absent", 1, 0))
> head(kyphosis)
Kyphosis Age Number Start Prediction Disagree
1
absent 71
3
5
present
1
2
absent 158
3
14
absent
0
3 present 128
4
5
present
0
4
absent
2
5
1
present
1
5
absent
1
4
15
absent
0
75
6
absent
1
2
16
absent
0
How many kids are misclassified?
> sum(kyphosis$Disagree)
[1] 13
What is the misclassification rate?
> (13/81)*100
[1] 16.04938
We have two new kids with the following information.
Kid 1 Age = 12; Number = 4; Start = 7
Kid 2 Age = 121; Number = 5, Start = 9
How does the tree classify these kids?
> MB4 <- data.frame(Age = c(12, 121), Number = c(4, 5), Start = c(7, 9))
> MB4
Age Number Start
1 12
4
7
2 121
5
9
> MB5 <- predict(MB, newdata = MB4)
> MB5
absent
present
[1,] 0.4210526 0.5789474
[2,] 0.8571429 0.1428571
The first kid will be classified as Kyphosis present and the second Kyphosis
absent.
Regression trees
We now focus on developing a regression tree when the response variable is
quantitative. Let me work out the build-up of a tree using an example. The
data set ‘bodyfat’ is available in the package ‘mboost.’ Download the
package and the data.
The data has 71 observations on 10 variables. Body fat was measured on 71
healthy German women using Dual Energy X-ray Absorptiometry (DXA).
This reference method is very accurate in measuring body fat. However, the
setting-up of this instrument requires a lot of effort and is of high cost.
Researchers are looking ways to estimate body fat using some
anthropometric measurements such as waist circumference, hip
circumference, elbow breadth, and knee breadth. The data gives these
anthropometric measurements on the women in addition to their age.
Here is the data.
> data(bodyfat)
> dim(bodyfat)
[1] 71 10
> head(bodyfat)
76
age DEXfat waistcirc hipcirc elbowbreadth kneebreadth anthro3a
anthro3b
47 57 41.68
100.0
112.0
7.1
9.4
4.42
4.95
48 65 43.29
99.5
116.5
6.5
8.9
4.63
5.01
49 59 35.41
96.0
108.5
6.2
8.9
4.12
4.74
50 58 22.79
72.0
96.5
6.1
9.2
4.03
4.48
51 60 36.42
89.5
100.5
7.1
10.0
4.24
4.68
52 61 24.13
83.5
97.0
6.5
8.8
3.55
4.06
anthro3c anthro4
47
4.50
6.13
48
4.48
6.37
49
4.60
5.82
50
3.91
5.66
51
4.15
5.91
52
3.64
5.14
Ignore the last four measurements. Each one is a sum of logarithms of three
of the four anthropometric measurements.
We now want to create a regression tree. All data points go into the root
node to begin with. We need to select one of the covariates and a cut-point
to split the root node. Let us start with the covariate ‘waistcirc’ and the cutpoint 88.4, say. All women with waistcirc < 88.4 are shepherded into the left
node and the rest into the right node. We need to judge how good the split is.
We use variance as the criterion. Calculate the variance of ‘DEXfat’ of all
women in the root node. It is 121.9426.
Calculate the variance of ‘DEXfat’ of all women in the left node. It is
33.72712.
Calculate the variance of ‘DEXfat’ of all women in the right node. It is
52.07025.
Goodness of the split = Mother’s variance – Weighted sum of daughters’
variances
= 121.9426 – [(40/71)*33.72712 + (31/71)*52.07025] = 80.2065.
The goal is to find that cut-point for which the goodness of the split is
maximum.
For each covariate, find the best cut-point. Select the best covariate to start
the tree. Follow the same principle at every stage.
> var(bodyfat$DEXfat)
[1] 121.9426
> MB1 <- subset(bodyfat, bodyfat$waistcirc < 88.4)
> var(MB1$DEXfat)
[1] 33.72712
> mean(bodyfat$DEXfat)
[1] 30.78282
77
> MB2 <- subset(bodyfat, bodyfat$waistcirc >= 88.4)
> dim(MB1)
[1] 40 10
> dim(MB2)
[1] 31 10
> var(MB2$DEXfat)
[1] 52.07205
Let us use rpart to build a regression tree. I need to prune the tree. If the size
of a node is 10 or less, don’t split the node.
> MB <- rpart(DEXfat ~ waistcirc + hipcirc + elbowbreadth + kneebreadth,
data =
+ bodyfat, control = rpart.control(minsplit = 10))
> plot(MB, uniform = T, margin = 0.1)
> text(MB, use.n = T, all = T)
waistcirc< 88.4
|
30.78
n=71
hipcirc< 96.25
22.92
n=40
kneebreadth< 11.15
40.92
n=31
waistcirc< 70.35 waistcirc< 80.75
18.21
26.41
n=17
n=23
15.11
n=7
20.38
n=10
24.13
n=13
29.37
n=10
hipcirc< 109.9
39.26
n=28
35.28
n=13
56.45
n=3
42.71
n=15
Interpretation of the tree
1. At each node, the mean of DEXfat is reported.
2. At each node the size of the node is reported.
3. The tree has 7 terminal nodes.
4. The variable elbowbreadth has no role in the tree.
5. How does one carry out prediction here? Take any woman with
anthropometric measurements measured. Pour the measurements into
the root node. The data will settle in one of the terminal nodes. The
78
mean of the DEXfat reported in the terminal node is the predicted
DEXfat for the woman.
Let us pour the data on the covariates of all individuals in our sample. The
body fat is predicted by the tree. Let us record the predicted body fat and
observed body fat side by side.
> MB3 <- predict(MB, newdata = bodyfat)
> MB4 <- data.frame(bodyfat$DEXfat,
PredictedValues = MB3)
> MB4
bodyfat.DEXfat PredictedValues
47
41.68
42.71133
48
43.29
42.71133
49
35.41
35.27846
50
22.79
24.13077
51
36.42
35.27846
52
24.13
29.37200
53
29.83
29.37200
54
35.96
35.27846
55
23.69
24.13077
56
22.71
20.37700
57
23.42
24.13077
58
23.24
20.37700
59
26.25
20.37700
60
21.94
15.10857
61
30.13
24.13077
62
36.31
35.27846
63
27.72
24.13077
64
46.99
42.71133
65
42.01
42.71133
66
18.63
20.37700
67
38.65
35.27846
68
21.20
20.37700
69
35.40
35.27846
70
29.63
35.27846
71
25.16
24.13077
72
31.75
29.37200
73
40.58
42.71133
74
21.69
24.13077
75
46.60
56.44667
76
27.62
29.37200
77
41.30
42.71133
78
42.76
42.71133
79
28.84
29.37200
80
36.88
29.37200
79
81
25.09
24.13077
82
29.73
29.37200
83
28.92
29.37200
84
43.80
42.71133
85
26.74
24.13077
86
33.79
35.27846
87
62.02
56.44667
88
40.01
42.71133
89
42.72
35.27846
90
32.49
35.27846
91
45.92
42.71133
92
42.23
42.71133
93
47.48
42.71133
94
60.72
56.44667
95
32.74
35.27846
96
27.04
29.37200
97
21.07
24.13077
98
37.49
35.27846
99
38.08
42.71133
100
40.83
42.71133
101
18.51
20.37700
102
26.36
24.13077
103
20.08
20.37700
104
43.71
42.71133
105
31.61
35.27846
106
28.98
29.37200
107
18.62
20.37700
108
18.64
15.10857
109
13.70
15.10857
110
14.88
15.10857
111
16.46
20.37700
112
11.21
15.10857
113
11.21
15.10857
114
14.18
15.10857
115
20.84
24.13077
116
19.00
24.13077
117
18.07
20.37700
Here is the graph of the observed and predicted values.
80
40
20
30
Predicted Fat
50
Regression Tree Output on the bodyfat data
10
20
30
40
50
60
Observed Fat
Here is the R code.
> plot(bodyfat$DEXfat, MB3, pch = 16, col = "red", xlab = "Observed
Fat",
+ ylab = "Predicted Fat", main = "Regression Tree Output on the bodyfat
data")
> abline(a=0, b=1, col = "blue")
81
Module 9: Polygonal Plots for Three Predictors in Classification Tree
Let us go back to the data ‘infert’ in the package ‘datasets.’ In the
classification tree, three variables (spontaneous, age, and parity) made an
impact. We want to present an illuminating polygonal plot. Spontaneous
abortions have range 0 to 2 and the other variables have wider range. We
will create three polygonal plots one for each value of spontaneous with xaxis age and y-axis parity. We will create a single frame accommodating all
three plots using the ‘par’ command.
Here is the plot.
5
3
1
Parity
Spontaneous Abortions = 0
25
30
35
40
Age
0 2 4 6
Parity
Spontaneous Abortions = 1
F
F
I
21
F
28
I
30
44
Age
5
3
F
I
1
Parity
Spontaneous Abortions = 2
21
F
28
I
30
44
Age
Here are the R commands.
> par(mfrow = c(3, 1), oma = c(3, 2, 4, 1))
82
> plot(infert$age, infert$parity, xlab = "Age", ylab = "Parity", main =
+ "Spontaneous Abortions = 0", type = "n")
> polygon(c(21, 21, 44, 44), c(1, 6, 6, 1), col = "mistyrose")
> plot(infert$age, infert$parity, xlab = "Age", ylab = "Parity", main =
+ "Spontaneous Abortions = 1", type = "n", ylim = c(0, 6), axes = F)
> axis(side = 1, at = c(21, 28, 29, 30, 31, 44))
> axis(side = 2, at = c(0:6))
> polygon(c(21, 21, 44, 44), c(4, 6, 6, 4), col = "mistyrose")
> text(33, 5, "F", col = "green")
> polygon(c(21, 21, 28, 28), c(2, 3, 3, 2), col = "mistyrose")
> text(25, 2.5, "F", col = "green")
> polygon(c(21, 21, 28, 28), c(0.5, 1.5, 1.5, 0.5), col = "lightcyan")
> text(25, 1, "I", col = "red")
> polygon(c(29, 29, 30, 30), c(1, 3, 3, 1), col = "mistyrose")
> text(29.5, 2, "F", col = "green")
> polygon(c(31, 31, 44, 44), c(1, 3, 3, 1), col = "lightcyan")
> text(38, 2, "I", col = "red")
> plot(infert$age, infert$parity, xlab = "Age", ylab = "Parity", main =
+ "Spontaneous Abortions = 2", type = "n", axes = F)
> axis(side = 1, at = c(21, 28, 29, 30, 31, 44))
> axis(side = 2, at = c(1:6))
> polygon(c(21, 21, 28, 28), c(1, 3, 3, 1), col = "lightcyan")
> text(25, 2, "I", col = "red")
> polygon(c(21, 21, 44, 44), c(4, 6, 6, 4), col = "mistyrose")
> text(33, 5, "F", col = "green")
> polygon(c(29, 29, 30, 30), c(1, 3, 3, 1), col = "mistyrose")
> text(29.5, 2, "F", col = "green")
> polygon(c(31, 31, 44, 44), c(1, 3, 3, 1), col = "lightcyan")
> text(38, 2, "I", col = "red")
Writing the classification tree verbally makes it easy to build the polygonal
plots. The tree has seven terminal nodes. Trace each terminal node to the
root node. Make a tentative polygonal plot by hand using the verbal
description. Let us begin the whole exercise with the tree.
83
Classification Tree for the Infertility Data
spontaneous< 0.5
0|
165/83
parity>=3.5
0
1
113/28
52/55
age< 30.5
1
36/48
0
16/7
0
8/2
age>=28.5
0
27/21
spontaneous< 1.5
0
19/19
parity>=1.5
0
1
16/12
3/7
0
10/3
1
9/27
1
6/9
0 = Not infertile; 1 = Infertile
Here is the code for the tree.
> data(infert)
> head(infert)
education age parity induced case spontaneous stratum pooled.stratum
1 0-5yrs 26 6
1 1
2
1
3
2 0-5yrs 42 1
1 1
0
2
1
3 0-5yrs 39 6
2 1
0
3
4
4 0-5yrs 34 4
2 1
0
4
2
5 6-11yrs 35 3
1 1
1
5
32
6 6-11yrs 36 4
2 1
1
6
36
> infert$education1 <- ifelse(infert$education == "0-5yrs", 0,
+ ifelse(infert$education == "6-11yrs", 1, 2))
84
> head(infert)
education age parity induced case spontaneous stratum
pooled.stratum
1
0-5yrs 26
6
1
1
2
1
3
2
0-5yrs 42
1
1
1
0
2
1
3
0-5yrs 39
6
2
1
0
3
4
4
0-5yrs 34
4
2
1
0
4
2
5
6-11yrs 35
3
1
1
1
5
32
6
6-11yrs 36
4
2
1
1
6
36
education1
1
0
2
0
3
0
4
0
5
1
6
1
> infert$case <- as.factor(infert$case)
> MB <- rpart(case ~ age + parity + induced + spontaneous + education1,
+ data = infert)
> plot(MB, uniform = T, margin = 0.1)
> text(MB, use.n = T, all = T)
> title(main = "Classification Tree for the Infertility Data", sub =
+ "0 = Not infertile; 1 = Infertile")
Verbal description
Read the terminal nodes from left to right.
Terminal node No. 1
If spontaneous = 0, then fertile.
Terminal node No. 2
If spontaneous = 1 0r 2 and parity = 4, 5, or 6, then fertile.
Terminal node No. 3
If spontaneous = 1 or 2, parity = 1, 2, or 3, age ≤ 30, and age ≥ 29, then
fertile.
Clean this up.
If spontaneous = 1 or 2, parity = 1, 2, or 3, and age = 29 or 30, then fertile.
85
Terminal node No. 4
If spontaneous = 1 or 2, parity = 1, 2, or 3, age ≤ 30, age ≤ 28, spontaneous
= 0 or 1, and parity = 2, 3, 4, 5, or 6, then fertile.
Clean this up.
If spontaneous = 1, parity = 2 or 3, and age ≤ 28, then fertile.
Terminal node No. 5
If spontaneous = 1 or 2, parity = 1, 2, or 3, age ≤ 30, age ≤ 28, spontaneous
= 0 or 1, and parity = 1, then infertile.
Clean this up.
If spontaneous = 1, parity = 1, and age ≤ 28, then infertile.
Terminal node No. 6
If spontaneous = 1 or 2, parity = 1, 2, or 3, age ≤ 30, age ≤ 28, spontaneous
= 0 or 1, and spontaneous = 2, then infertile.
Clean this up.
If spontaneous = 2, parity = 1, 2, or 3, and age ≤ 28, then infertile.
Terminal node No. 7
If spontaneous = 1 or 2, parity = 1, 2, or 3,and age ≥ 31, then infertile.
Draw polygonal plots by hand. Every inch of the space in each plot should
be covered.
Module 10: MULTINOMIAL LOGISTIC REGRESSION
In traditional Logistic Regression, the response variable is binary. We
now move on to response variables, which are polytomous, i.e., have more
than two possible response categories. The models we develop here
encompass the traditional binary response variables.
Polytomous variables are of two kinds: ordinal or nominal. Suppose a
subject is given a certain medication for arthritis pain. The response variable
is how much improvement the subject perceives. The subject is asked to
check one of the following items:
Y: Marked improvement, Some improvement, or None at all.
86
The response variable takes three possible values. The responses have some
sense of ordering. We will then say that the response variable is ordinal.
Suppose an opinion pollster is interested in conducting a survey in a
borough in order to ascertain political leanings of its denizens. Each subject
in the borough is classified into one of the categories:
Y: Republican, Democrat, or No particular affiliation.
The response variable is again polytomous. No sense of ordering is
perceivable in the responses. We say that the response variable is nominal.
We now embark on building models for polytomous response
variables. We will develop special models if the polytomous response
variable is ordinal.
Example. A medical researcher is entrusted with the job of evaluating an
active treatment in alleviating a certain type of arthritis pain. As a control, he
takes a Placebo treatment. Each subject is a female or male. For each subject
selected at random, let
X1
= 1 if the subject is female
= 0 if the subject is male;
X2
= 1 if the treatment given is Active
= 0 if the treatment given is Placebo
Y
= 1 (Marked improvement)
= 2 (Some improvement)
= 3 (None at all)
Data will be collected on the variables Y, X1, and X2 for a random sample of
arthritis patients. In this problem we have two covariates X1 and X2. The
response variable is Y, which is polytomous. The objective is to explore how
the response variable Y depends on the covariates X1 and X2.
Research questions are:
1. Are there significant gender differences in the responses?
2. Are there significant differences between the active treatment and
placebo in the responses?
87
We will answer these questions via a model building endeavor. We want to
build a model connecting Y with X1 and X2. After building the model, we
want to ascertain the impact of each covariate on the response variable.
More appropriately, for given values of X1 and X2, we want to model
Pr(Y = 1 / X1, X2),
Pr(Y = 2 / X1, X2),
Pr(Y = 3 / X1, X2)
as a function of the covariates X1 and X2. What is the interpretation?
Suppose we have a subject whose covariate values X1 and X2 are known.
What are the chances that the subject exhibits marked improvement in
his/her condition? What are the chances that the subject exhibits some
improvement in his/her condition? What are the chances that the subject
exhibits no improvement at all in his/her condition. The responses are
mutually exclusive and exhaustive. Therefore, the sum of these three
probabilities is one.
The Multinomial Logistic regression model is given by,
exp(1  1 X1   2 X 2 )
D
exp( 2  3 X1   4 X 2 )
Pr(Y = 2 / X1, X2) =
D
1
Pr(Y = 3 / X1, X2) = ,
D
where D = 1 + exp(α1 + β1X1 + β2X2) + exp(α2 + β3X1 + β4X2).
Pr(Y = 1 / X1, X2) =
On the left side, each is a probability. Each probability is a number between
0 and 1. Each of the right sides is also a number between 0 and 1. The
probabilities on the left side add up to unity. So do the expressions on the
right side. There is no incongruity in the model.
This model has 6 parameters.
Is there any special reason that the model for the response {Y = 3} is the one
it is given? You could swap the expressions for {Y = 2} and {Y = 3}. The
88
ultimate conclusions will remain the same. You have to take one of the
specific responses as the baseline. We can rewrite the model in a different
way.
Compare the responses {Y = 1} and {Y = 3}:
Pr(Y  1 | X1, X 2 )
ln
= α1 + β1X1 + β2X2.
Pr(Y  3 | X1, X 2 )
Compare the responses {Y = 2} and {Y = 3}:
Pr(Y  2 | X1, X 2 )
ln
= α2 + β3X1 + β4X2.
Pr(Y  3 | X1, X 2 )
If the response variable is ordinal, there is another model commonly
entertained. This is called the Proportional Odds Model. In our example, the
response variable is indeed ordinal.
PROPORTIONAL ODDS MODEL
Pr(Y = 1 | X1, X2) =
exp{1  1 X1   2 X 2}
1  exp{1  1 X1   2 X 2}
and
Pr(Y = 1 | X1, X2) + Pr(Y = 2 / X1, X2) =
exp{ 2  1 X1   2 X 2}
,
1  exp{ 2  1 X1   2 X 2}
with α1 < α2.
Under this model, each individual probability can be ascertained.
exp{ 2  1 X1   2 X 2 }
Pr(Y = 2 | X1, X2) =
1  exp{ 2  1 X1   2 X 2 }
exp{1  1 X1   2 X 2}
1  exp{1  1 X1   2 X 2}
Pr(Y = 3 | X1,X2) = 1 -
exp{ 2  1 X1   2 X 2 }
1  exp{ 2  1 X1   2 X 2 }
In order to make sure that these are all non-negative numbers, we ought to
have α2 ≥ α1. Discuss why? This model has 4 parameters.
89
What is the difference between the multinomial logistic regression model
and the proportional odds model?
1. The proportional odds model has a fewer parameters.
2. The proportional odds model is applicable only when we have an
ordinal response variable.
3. If both models fit well the data, we choose the one with fewer
parameters. Simplicity is a desirable trait.
This model can be rewritten in the following way.
Compare the responses {Y = 1} and {Y = 2 or 3}: Compare the best
response versus not so best: (We could do this because the response variable
is ordinal.)
Pr(Y  1 | X1, X 2 )
ln
= α1 + β1X1 + β2X2.
Pr(Y  2 | X1, X 2 )  Pr(Y  3 | X1, X 2 )
Compare the responses {Y = 1 or 2} and {Y = 3}: Compare the best two
responses versus the worst:
Pr(Y  1 | X1, X 2 )  Pr(Y  2 | X1, X 2 )
ln
= α2 + β1X1 + β2X2.
Pr(Y  3 | X1, X 2 )
Some writers write the proportional odds model, in the above case, in the
following way.
Pr(Y = 3 | X1, X2) =
exp{1  1 X1   2 X 2}
1  exp{1  1 X1   2 X 2}
and
Pr(Y = 2 | X1, X2) + Pr(Y = 3 / X1, X2) =
exp{ 2  1 X1   2 X 2}
,
1  exp{ 2  1 X1   2 X 2}
with α1 < α2.
The package ‘Design’ (nee, ‘rms’) uses this form of Proportional Odds
model. In whatever way we decide to write the model, the final conclusions
about significance of covariates and predicted probabilities remain the same.
Why this is called a proportional odds model? Let us look at the general
situation. Suppose we have a categorical response variable Y which is
90
ordinal. Let the possible values of Y be denoted by 1, 2, … , J. (1 means the
best and J means the worst.) Suppose we have only one covariate X. The
proportional odds model is given by,
exp(1  X )
1  exp(1  X )
exp( 2  X )
Pr(Y ≤ 2|X) =
1  exp( 2  X )
exp( 3  X )
Pr(Y ≤ 3|X) =
1  exp( 3  X )
…
…
exp( J 1  X )
Pr(Y ≤ J-1/X) =
,
1  exp( J 1  X )
with the provision that α1 ≤ α2 ≤ … ≤ αJ-1.
Pr(Y = 1|X) =
Equivalently, we can formulate the model as
ln
Pr(Y  j / X )
= αj + βX, j = 1, 2, … , J-1.
Pr(Y  j / X )
We are comparing the odds of the best j responses with the worst J-j
responses. What is special about the model? The same β is present in every
equation. This is what makes the model to have proportional odds.
Suppose we have an individual with X = x1 and another individual with X =
x2. Compare the odds of the best j responses versus the worst J-j responses
for the individual with X = x1 and the odds of the best j responses versus the
worst J-j responses for the individual with X = x2.
Pr(Y  j / x1)
Pr(Y  j / x1)
 exp( ( x1  x2 )).
Pr(Y  j / x2 )
Pr(Y  j / x2 )
The odds are proportional. They depend only on their conditions X = x1 and
X = x2. They are free of j.
Actual data on the arthritis experiement
91
Sex
Treatment
Female
Active
Female
Placebo
Male Active
Male Placebo
Improvement
Marked
Some None
16
5
6
6
7
19
5
2
7
1
0
10
Total
27
32
14
11
Analyze the data.
The analysis of the data is tantamount to comparing the distribution of the
response variable Y among four distinct populations:
1. Those who are females and are on active treatment;
2. Those who are females and are on placebo;
3. Those who are males and are on active treatment;
4. Those who are males and are on placebo.
The empirical distributions are:
Female
Active
Placebo
Pr(Y = 1)
0.59
0.19
Pr(Y = 2)
0.19
0.22
Pr(Y = 3)
0.22
0.59
Males
Active
0.36
0.14
0.50
Placebo
0.09
0.00
0.91
Empirical analysis
1. Compare the responses of Males and Females who are on active
medication. Is the observed difference in the responses significantly
different?
2. Compare the responses between active treatment and placebo for
females. Is the observed difference in the responses significantly
different?
OPERATION R
A number of packages is available to fit logistic models.
Base: Binary Logistic Regression Model
nnet: Multinomial Logistic Regression model for raw data
VGAM: Multinomial and Proportional Odds models
92
Design (rms): Proportional Odds Model
Raw data or ungrouped data
The following data came from the Florida Game and Fresh Water Fish
Commission. They wanted to investigate factors influencing the primary
food choice of alligators. For 59 alligators sampled in Lake George, Florida,
the numbers pertain to the alligator’s length (in meters) and primary food
type found in the alligator’s stomach. Primary food type has three
categories: Fish, Invertebrate, and Other. The invertebrates are primarily
apple snails, aquatic insects, and crayfish. The ‘Other’ category includes
reptiles (primarily turtles, though one stomach contained tags of 23 baby
alligators that had been released in the lake during the previous year.)
Size: 1.24 1.45 1.63 1.78 1.98 2.36 2.79 3.68 1.30
Food: I
I
I
I I
F F O I
Size: 1.45 1.65 1.78 2.03 2.39 2.84 3.71 1.30 1.47
Food: O
O I F F
F F I I
Size: 1.65 1.78 2.03 2.41 3.25 3.89 1.32 1.47 1.65
Food: I
O F F
O F F
F F
Size: 1.80 2.16 2.44 3.28 1.32 1.50 1.65 1.80 2.26
Food: I
F F O F I F F F
Size: 2.46 3.33 1.40 1.52 1.68 1.85 2.31 2.56 3.56
Food: F FI F I
F F F O F
Size: 1.42 1.55 1.70 1.88 2.31 2.67 3.58 1.42 1.60
Food: I
I
I
I F F F F I
Size: 1.73 1.93 2.36 2.72 3.66
Food: O I F I F
Here, the response variable is ‘Food Type,’ which is a nominal categorical
variable. The independent variable is ‘Size,’ which is quantitative. We want
to investigate how the food choice is dependent on size. Let us entertain a
multinomial logistic regression model.
93
exp{1  1  Size}
D
exp{ 2  2  Size}
Pr(Food = I) =
D
1
Pr(Food = O) =
D
D = 1 + exp{α1 + β1*Size} + exp{α2 + β2*Size}
Pr(Food = F) =
The multinomial logistic model fitting is not available in the ‘base’ system
of R. It is available in several packages. The package I will use is ‘nnet.’ It is
also available in ‘Design.’ Get into the R console. Download the package
‘nnet’ from Ohio.
We have two columns of data. Input the data.
> size <- c(1.24, 1.45, 1.63, 1.78, 1.98, 2.36, 2.79, 3.68, 1.30, 1.45, 1.65,
1.78, 2.03, 2.39, 2.84, 3.71, 1.30, 1.47, 1.65, 1.78, 2.03, 2.41, 3.25, 3.89,
1.32, 1.47, 1.65, 1.80, 2.16, 2.44, 3.28, 1.32, 1.50, 1.65, 1.80, 2.26, 2.46,
3.33, 1.40, 1.52, 1.68, 1.85, 2.31, 2.56, 3.56, 1.42, 1.55, 1.70, 1.88, 2.31,
2.67, 3.58, 1.42, 1.60, 1.73, 1.93, 2.36, 2.72, 3.66)
‘Food type’ is categorical. One can input the 59 entries in toto. Or, one can
exploit the repetitions as follows.
> food <- factor(c(rep("I", 5), rep("F", 2), "0", "I", rep("0", 2), "I",
rep("F", 4), rep("I", 3), "0", rep("F", 2), "0", rep("F", 4), "I", rep("F", 2),
"0", "F", "I", rep("F", 6), "I", rep("F", 3), "0", "F", rep("I", 4),
rep("F", 4), "I", "0", "I", "F", "I", "F"))
Put both data sets into a single frame.
> allig <- data.frame(food, size)
> allig
food
1
I
2
I
3
I
4
I
5
I
size
1.24
1.45
1.63
1.78
1.98
94
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
F
F
0
I
0
0
I
F
F
F
F
I
I
I
0
F
F
0
F
F
F
F
I
F
F
0
F
I
F
F
F
F
F
F
I
F
F
F
0
F
I
I
I
I
F
F
2.36
2.79
3.68
1.30
1.45
1.65
1.78
2.03
2.39
2.84
3.71
1.30
1.47
1.65
1.78
2.03
2.41
3.25
3.89
1.32
1.47
1.65
1.80
2.16
2.44
3.28
1.32
1.50
1.65
1.80
2.26
2.46
3.33
1.40
1.52
1.68
1.85
2.31
2.56
3.56
1.42
1.55
1.70
1.88
2.31
2.67
95
52
53
54
55
56
57
58
59
F
F
I
0
I
F
I
F
3.58
1.42
1.60
1.73
1.93
2.36
2.72
3.66
Put the package ‘nnet’ into service.
> library(nnet)
The R command is ‘multinom.’
 allig1 <- multinom(food ~ size, data = allig)
# weights: 9 (4 variable)
initial value 64.818125
iter 10 value 49.170785
final value 49.170622
converged
> summary(allig1)
Call:
multinom(formula = food ~ size, data = allig)
Coefficients:
(Intercept)
size
F 1.617952 -0.1101836
I 5.697543 -2.4654695
Std. Errors:
(Intercept) size
F 1.307291 0.5170838
I 1.793820 0.8996485
Residual Deviance: 98.34124
AIC: 106.3412
Correlation of Coefficients:
96
F:(Intercept) F:size
I:(Intercept)
F:size
-0.9528240
I:(Intercept) 0.5905923 -0.5608696
I:size
-0.4442392 0.4637660 -0.9591611
It is not providing p-values. Let us write down the estimated model.
exp{1.62  0.11  Size}
D
exp{5.70  2.47  Size}
Pr(Food = I) =
D
1
Pr(Food = O) =
D
D = 1 + exp{1.62 -0.11*Size} + exp{5.70 – 2.47*Size}.
Pr(Food = F) =
Let us calculate z-values associated with every parameter.
α1 -β1 -α2 -β2 --
z = Estimate/S.E. = 1.62/1.31 = 1.24
z = Estimate/S.E. = -0.11/0.52 = - 0.21
z = Estimate/S.E. = 5.70/1.79 = 3.18
z = Estimate/S.E. = -2.47/0.90 = - 2.74
Interpretation. Only two estimates are significant.
Let us plot all the three logistic curves as a function of size. We use the basic
command ‘curve.’
> curve(exp(1.62 - 0.11*x)/(1 + exp(1.62 - 0.11*x) + exp(5.70 - 2.47*x)),
+ xlim = c(1, 4), xlab = "Size", ylab = "Probability")
> curve(exp(5.70 - 2.47*x)/(1 + exp(1.62 - 0.11*x) + exp(5.70 - 2.47*x)),
add = TRUE, col = "blue")
> curve(1/(1 + exp(1.62 - 0.11*x) + exp(5.70 - 2.47*x)), add = TRUE, col =
"red")
> title(main = "Probability of Choice of Food as a function of Size")
> text(3, 0.7, "FISH")
> text(1.5, 0.7, "Invertebrate")
97
> text(3, 0.2, "Other")
0.7
Probability of Choice of Food as a function of Size
FISH
0.5
0.4
0.2
0.3
Probability
0.6
Invertebrate
Other
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Size
Comment on the graph. As the alligator grows bigger and bigger, it prefers
fish over vertebrates.
Comment. The nnet package orders the categories of the response variable
alpha-numerically. The last one is taken the baseline category (i.e., 1/D).
Multinomial logistic regression model and grouped data
Recall the arthritis data. There were 84 subjects in the sample. These 84
pieces of information (raw data) have been summarized into 12 pieces of
information.
98
Actual data on the arthritis experiement
Sex
Female
Female
Male
Male
Treatment
Active
Placebo
Active
Placebo
Improvement
Marked
Some None
16
5
6
6
7
19
5
2
7
1
0
10
Total
27
32
14
11
Write the theoretical model.
exp(1  1 * Gender  2 * Treatment)
D
exp( 2  3 * Gender  4 * Treatment)
Pr(Y = Some) =
D
1
Pr(Y = None) =
D
D = Sum of the numerators.
Pr(Y = Marked) =
Estimate the unknown parameters using the data.
We will use the package VGAM. We need to enter the data into R in the
form presented above.
Each covariate is structured. Exploit this. Use the command ‘gl’ (generate
levels).
> Gender <- gl(2, 2, 4, labels = c("Female", "Male"))
> Treatment <- gl(2, 1, 4, labels = c("Active", "Placebo"))
> Marked <- c(16, 6, 5, 1)
> Some <- c(5, 7, 2, 0)
> None <- c(6, 19, 7, 10)
> Arthritis <- data.frame(Gender, Treatment, Marked, Some, None)
> Arthritis
Gender Treatment Marked Some None
1 Female
Active
16
5
6
2 Female
Placebo
6
7
19
3
Male
Active
5
2
7
4
Male
Placebo
1
0
10
99
Download the package VGAM and make it active.
> Multi <- vglm(cbind(Marked, Some, None) ~ Gender + Treatment, family
=
+ multinomial, data = Arthritis)
> summary(Multi)
Call:
vglm(formula = cbind(Marked, Some, None) ~ Gender + Treatment,
family = multinomial, data = Arthritis)
Pearson Residuals:
log(mu[,1]/mu[,3]) log(mu[,2]/mu[,3])
1
-0.0019173
-0.30625
2
-0.0672899
0.26014
3
-0.0701808
0.53312
4
0.2363841
-0.79075
Coefficients:
Value
(Intercept):1
1.0324454
(Intercept):2
-0.0029107
GenderMale:1
-1.3784722
GenderMale:2
-1.6615071
TreatmentPlacebo:1 -2.1686535
TreatmentPlacebo:2 -1.1055147
Std. Error t value
0.45755 2.2564433
0.55587 -0.0052363
0.63848 -2.1589737
0.86029 -1.9313430
0.59430 -3.6490942
0.67377 -1.6407881
Number of linear predictors: 2
Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3])
Dispersion Parameter for multinomial family: 1
Residual Deviance: 1.70347 on 2 degrees of freedom
Log-likelihood: -74.51039 on 2 degrees of freedom
Number of Iterations: 4
100
Check Model Adequacy: Look at the residual deviance and the
corresponding degrees of freedom.
H0: The response probabilities follow the pattern of a multinomial logistic
regression model.
Technically, if the sample is large, Residual Deviance has a chi-squared
distribution if the null hypothesis is true. Let us calculate the p-value.
p-value = Probability of getting the residual deviance as large as what was
observed if the null hypothesis is true.
> pchisq(1.70347, 2, lower.tail = F)
[1] 0.426674
The null hypothesis cannot be rejected. The Multinomial model is a good fit.
Let us write down the estimated model.
Pr(Y = Marked) =
exp(1.03  1.38 * Male  2.17 * Placebo)
D
Pr(Y = Some) =
exp(0.00  1.66 * Male  1.11 * Placebo)
D
1
Pr(Y = None) =
D
How do we know that the category {Y = None} is the base line? Look at the
vglm command. I had ‘None’ put in the end of the ‘cbind’ command.
You can have any category to be the baseline.
Look at the response probabilities as per the model.
> Multi1 <- fitted(Multi)
> Multi1
Marked
Some
None
101
1
2
3
4
0.58437331
0.19443502
0.37299433
0.07073449
0.20751090
0.19991267
0.09980040
0.05479949
0.2081158
0.6056523
0.5272053
0.8744660
How do we get these?
Look at the expected frequencies as per the fitted model. How does one
calculate these?
> total <- c(27, 32, 14, 11)
> Multi2 <- fitted(Multi)*total
> Multi2
1
2
3
4
Marked
15.7780793
6.2219207
5.2219207
0.7780793
Some
None
5.6027944 5.619126
6.3972056 19.380874
1.3972056 7.380874
0.6027944 9.619126
Too many decimals? Round the numbers up to 3 decimal places.
> Multi3 <- round(fitted(Multi), 3)
> Multi3
Marked Some None
1 0.584 0.208 0.208
2 0.194 0.200 0.606
3 0.373 0.100 0.527
4 0.071 0.055 0.874
I want a barplot of these numbers. I want to make the above output is a
matrix.
> Multi4 <- as.matrix(Multi3)
> Multi4
Marked
1 0.584
2 0.194
3 0.373
4 0.071
Some
0.208
0.200
0.100
0.055
None
0.208
0.606
0.527
0.874
There is no problem. Identify the rows more descriptively.
> rownames(Multi4) <- c("FemaleActive", "FemalePlacebo", "MaleActive",
102
+ "MalePlacebo")
Look at what happens.
> Multi4
Marked
FemaleActive
0.584
FemalePlacebo 0.194
MaleActive
0.373
MalePlacebo
0.071
Some
0.208
0.200
0.100
0.055
None
0.208
0.606
0.527
0.874
Discuss what type of barplot we want.
> barplot(Multi4, beside = T, legend.text = colnames(t(Multi4)), col =
+ c("red", "blue", "green", “magenta”))
Comments. The barplot command of a matrix always gives barplots column
by column. Look at the output.
103
0
5
10
15
FemaleActive
FemalePlacebo
MaleActive
MalePlacebo
Marked
Some
None
Let us look at another type of barplot.
> barplot(t(Multi4), beside = T, legend.text = colnames(Multi4), col =
+ c("red", "blue", "green"))
(t = transpose of a matrix)
Add more description to the barplot.
> title(main = "Multinomial Logistic Regression of Improvement Response
on
+ Gender and Treatment", xlab = "Gender Treatment Combination", ylab =
+ "Probability")
104
Multinomial Logistic Regression of Improvement Response on
Gender and Treatment
0.4
0.0
0.2
Probability
0.6
0.8
Marked
Some
None
FemaleActive
FemalePlacebo
MaleActive
MalePlacebo
Gender Treatment Combination
Comments.
1. For females on active treatment, the predominant response is marked
improvement with chances hovering around 60%.
2. For females on placebo, the predominant response is no improvement
with chances around 60%.
3. For males on active treatment, the predominant response is no
improvement with chances around 50%.
4. For males on placebo, the predominant response is no improvement
with chances more than 90%.
The fitted model is:
105
exp(1.03  1.38 * gender  2.17 * treatment)
D
exp(1.03  0.28 * gender  1.06 * treatment)
Pr(Improvement = Some) =
D
1
Pr(Improvement = Marked) =
D
D = 1 + exp(-1.03 + 1.38*gender + 2.17*treatment)
+ exp(-1.03 – 0.28*gender + 1.06*treatment)
Pr(Improvement = None) =
with the understanding that
gender = 1 if male
= 0 if female
treatment = 1 if placebo
= 0 if active
What is the rationality behind this kind of scoring?
Let us compare the empirical and model probabilities for each configuration
of the factors.
improvement gender treatment
1
2
3
4
5
6
7
8
9
10
11
12
marked
some
none
marked
some
none
marked
some
none
marked
some
none
female
female
female
male
male
male
female
female
female
male
male
male
active
active
active
active
active
active
placebo
placebo
placebo
placebo
placebo
placebo
Emp. Model Prob.
Prob.
0.59
0.58
0.19
0.21
0.22
0.21
0.36
0.37
0.14
0.10
0.50
0.53
0.19
0.19
0.22
0.20
0.59
0.61
0.09
0.07
0.00
0.05
0.91
0.88
Let us compare the observed and expected frequencies.
106
improvement gender treatment
1
2
3
4
5
6
7
8
9
10
11
12
marked
some
none
marked
some
none
marked
some
none
marked
some
none
female
female
female
male
male
male
female
female
female
male
male
male
active
active
active
active
active
active
placebo
placebo
placebo
placebo
placebo
placebo
Emp. Model Prob.
Prob.
16
15.66
5
5.67
6
5.67
5
5.18
2
1.40
7
7.42
6
6.40
7
6.40
19
19.52
1
0.77
0
0.55
10
9.68
An eye inspection of the frequencies tells us that the model is a good
summary of the data.
Table of z-values (Estimate/(Standard Error)
Response Intercept gender treatment
None
-2.256
2.159 3.649
Some
-2.168
-0.311
1.522
Let us have a different baseline. Let us have {Y = Marked} as the baseline.
> Multi5 <- vglm(cbind(Some, None, Marked) ~ Gender + Treatment,
family =
+ multinomial, data = Arthritis)
> summary(Multi5)
Call:
vglm(formula = cbind(Some, None, Marked) ~ Gender + Treatment,
family = multinomial, data = Arthritis)
Pearson Residuals:
log(mu[,1]/mu[,3]) log(mu[,2]/mu[,3])
1
-0.26907
0.146256
2
0.26106
-0.063604
3
0.52039
-0.135416
4
-0.81536
0.127844
Coefficients:
107
Value Std. Error
(Intercept):1
-1.03536
0.47750
(Intercept):2
-1.03245
0.45755
GenderMale:1
-0.28303
0.90900
GenderMale:2
1.37847
0.63848
TreatmentPlacebo:1 1.06314
0.69859
TreatmentPlacebo:2 2.16865
0.59430
t value
-2.16828
-2.25644
-0.31137
2.15897
1.52184
3.64909
Number of linear predictors: 2
Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3])
Dispersion Parameter for multinomial family: 1
Residual Deviance: 1.70347 on 2 degrees of freedom
Log-likelihood: -11.29448 on 2 degrees of freedom
Number of Iterations: 4
> Multi6 <- fitted(Multi5)
> Multi6
Some
None
1 0.20751090 0.2081158
2 0.19991267 0.6056523
3 0.09980040 0.5272053
4 0.05479949 0.8744660
1.
2.
3.
4.
Marked
0.58437331
0.19443502
0.37299433
0.07073449
Residual deviance is exactly the same.
Predicted probabilities are exactly the same.
Expected frequencies are exactly the same.
The choice of the baseline is yours.
108
Module 11: PROPORTIONAL ODDS MODEL ON R + VGAM package
The following is a data set coming from Lancaster coalmines in England.
The publication of these data has revolutionized the way coalface workers
are compensated for years of exposure in the coal mines.
The data has information on a sample (371) of coalface workers, how many
years each is exposed and severity of pneumoconiosis categorized as 1 =
normal, 2 = mild pneumoconiosis, 3 = severe pneumoconiosis. The response
variable is severity of pneumoconiosis, which is clearly ordinal. The
covariate is the number of years X of exposure. For simplicity of
presentation and analysis, the covariate values are divided into 8 intervals:
(0, 12]; (12, 18], (18, 24]; (24, 30]; (30, 36]; (36, 42]; (42, 48];
(48, 54] years of exposure. The median value of the years of exposure in
each interval is taken as representative of the interval. Take any interval of
exposure. Identify all workers whose years of exposure fall into the interval.
For the group of workers, find out how many of them are normal, have mild
pneumoconiosis, and severe pneumoconiosis.
Data
1
2
3
4
5
6
7
8
exposure normal mild severe
5.8
98
0 0
15.0
51
2 1
21.5
34
6 3
27.5
35
5 8
33.5
32
10 9
39.5
23
7 8
46.0
12
6 10
51.5
4
2 5
Interpretation of the data: Out of 98 coalface workers whose median
exposure time is 5.8 years, all of them are normal; Out of 54 coalface
workers whose median exposure time is 15 years, 52 are normal, two have
mild pneumoconiosis, and one has severe pneumoconiosis; etc.
This is grouped data with a ternary response variable and one quantitative
covariate.
109
Earlier, I have presented two data sets with a multi-level response variable.
Let us discuss the structure of these data sets.
1. Alligator data
a. The response variable is ternary.
b. The response variable is nominal.
c. The proportional odds model cannot be entertained.
d. There is only one covariate, which is quantitative.
e. The data are presented in raw form.
f. We have fitted the multinomial logistic regression model using the
packages nnet and VGAM.
2. Arthritis data
a. The response variable is ternary.
b. The response variable is ordinal.
c. Both the multinomial logistic regression and proportional odds
models can be entertained.
d. There are two covariates both categorical.
e. The data are presented in grouped form.
f. We have fitted the multinomial logistic regression model using the
VGAM package.
g. What about the proportional odds model?
3. Coal mine data
a. The response variable is ternary.
b. The response variable is ordinal.
c. Both the multinomial logistic regression and proportional odds
models can be entertained.
d. There is only one covariate which is quantitative.
e. The data are force presented in grouped form.
f. We are going to fit both the multinomial logistic regression model
and the proportional odds model using the VGAM package.
Download the package VGAM.
The coal mine data set are available in VGAM under the name ‘pneumo.’
Make the package active.
 library(VGAM)
Download the data and see what it contains.
110
> data(pneumo)
> pneumo
exposure.time normal mild severe
1
5.8
98
0
0
2
15.0
51
2
1
3
21.5
34
6
3
4
27.5
35
5
8
5
33.5
32
10
9
6
39.5
23
7
8
7
46.0
12
6
10
8
51.5
4
2
5
The covariate is continuous. The data on the number of years of exposure is
skewed to the left. Take logarithms (natural logarithm) to make the numbers
look more like a normal distribution.
Keep the same name for the data set. Transform only the exposure time.
Keep all other entries intact.
> pneumo = transform(pneumo, logexpo = log(exposure.time))
Look what we have in the new data folder.
> pneumo
exposure.time
1
5.8
2
15.0
3
21.5
4
27.5
5
33.5
6
39.5
7
46.0
8
51.5
normal mild severe
98
0
0
51
2
1
34
6
3
35
5
8
32
10
9
3
7
8
12
6
10
4
2
5
logexpo
1.757858
2.708050
3.068053
3.314186
3.511545
3.676301
3.828641
3.941582
Two models can be entertained.
Multinomial Logistic Regression Model
Pr(Severity = normal) =
Pr(Severity = mild)
exp(1  1  logexpo)
D
exp( 2   2  logexpo)
=
D
111
1
D
D = 1 + exp(α1 + β1*logexpo) + exp(α2 + β2*logexpo)
Pr(Severity = severe)
=
The baseline category is taken to be ‘severe.’ The R command as it is laid
out below will respect our choice. This model has four parameters.
Proportional Odds Model (Cumulative Model)
exp(1    logexpo)
1  exp(1    logexpo)
exp( 2    logexpo)
Pr(Severity = normal) + Pr(Severity = mild) =
1  exp( 2    logexpo)
This model has three parameters.
Pr(Severity = normal) =
We will now fit the multinomial logistic regression model. ‘glm’ is the
acronym for the generalized linear model.
> fit.multi <- vglm(cbind(normal, mild, severe) ~ logexpo, family =
multinomial, pneumo)
> summary(fit.multi)
Call:
vglm(formula = cbind(normal, mild, severe) ~ logexpo, family =
multinomial, data = pneumo)
Pearson Residuals:
Min
1Q
Median
3Q
Max
log(mu[,1]/mu[,3]) -0.75473 -0.20032 -0.043255 0.43644 0.92283
log(mu[,2]/mu[,3]) -0.81385 -0.43818 -0.121281 0.15716 1.03229
Coefficients:
Value
(Intercept):1 11.9751
(Intercept):2 3.0391
logexpo:1
-3.0675
logexpo:2
-0.9021
Std. Error t value
2.00044
5.9862
2.37607
1.2790
0.56521
-5.4272
0.66898
-1.3485
112
Number of linear predictors: 2
Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3])
Dispersion Parameter for multinomial family: 1
Residual Deviance: 5.34738 on 12 degrees of freedom
Let us now check on goodness-of-fit. The null hypothesis is:
H0: The response probabilities follow the multinomial logistic regression
model pattern.
Let us calculate the p-value of the observed residual deviance on 12 degrees
of freedom.
pchisq(5.34738, 12, lower.tail = F)
[1] 0.945359
p-value = Probability of getting the residual deviance at least as large as
5.34738 when the null hypothesis is true.
How does one calculate the degrees of freedom?
The model is a good fit. We do not reject the null hypothesis.
The fitted model is:
exp(11.975 1 - 3.0675  logexpo)
D
exp(3.0391 - 0.9021  logexpo)
P̂r(Severity  mild) 
D
1
P̂r(Severity  severe) 
D
D = 1 + exp(11.9751 – 3.0675*logexpo) + exp(3.0391 – 0.9021*logexpo)
P̂r(Severity  normal) 
Let us calculate the model probabilities. The following command gives the
probabilities as per the model for each age group.
113
 fitted(fit.multi)
1
2
3
4
5
6
7
8
normal
0.9927503
0.9329702
0.8488899
0.7485338
0.6393787
0.5334715
0.4313692
0.3581471
mild
0.005875947
0.043219077
0.085745054
0.128835331
0.168725388
0.201127232
0.226188995
0.239824757
severe
0.001373768
0.023810688
0.065365011
0.122630879
0.191895881
0.265401245
0.342441766
0.402028109
The empirical probabilities are:
Age group
1
2
3
4
5
6
7
8
normal
1.00
0.94
0.79
0.73
0.63
0.61
0.43
0.36
mild
0.00
0.04
0.14
0.10
0.20
0.18
0.21
0.18
severe
0.00
0.02
0.07
0.17
0.17
0.21
0.36
0.45
It is more instructive to compare observed and expected frequencies. Let us
go to R. Calculate the total number of coalminers in each exposure group.
Open a data vector consisting of these numbers.
total <- c(98, 54, 43, 48, 51, 38, 28, 11)
> fitted(fit.multi)*total
1
2
3
4
5
6
7
8
normal
97.289528
50.380393
36.502267
35.929622
32.608315
20.271918
12.078339
3.939618
mild
severe
0.5758428 0.1346293
2.3338302 1.2857771
3.6870373 2.8106955
6.1840959 5.8862822
8.6049948 9.7866899
7.6428348 10.0852473
6.3332919 9.5883694
2.6380723 4.4223092
114
Compare these numbers with the observed frequencies. It is done in the last
page.
We now fit the proportional odds model.
ln
Pr(Severit y  normal)
= α1 + β*logexpo
Pr(Severit y  mild)  Pr(Severit y  severe)
ln
Pr(Severit y  normal)  Pr(Severit y  mild)
= α2 + β*logexpo
Pr(Severit y  severe)
The log odds of the best condition versus the worst two are a linear function
of the lone covariate.
The log odds of the best two conditions versus the worst one are a linear
function of the lone covariate.
Further, these two linear functions are parallel (same slope).
There are other ways to write the proportional odds model as we shall see
later. All are equivalent.
> fit.prop <- vglm(cbind(normal, mild, severe) ~ logexpo, propodds, data =
pneumo, trace = T)
> summary(fit.prop)
Call:
vglm(formula = cbind(normal, mild, severe) ~ logexpo, family = propodds,
data = pneumo, trace = T)
Pearson Residuals:
Min
1Q
Median
3Q
Max
logit(P[Y>=2]) -0.77144 -0.30860 -0.14410 0.071638 1.2479
logit(P[Y>=3]) -0.50476 -0.33528 -0.30926 0.184145 1.0444
Coefficients:
Value Std. Error t value
(Intercept):1 -9.6761
1.3241 -7.3078
(Intercept):2 -10.5817
1.3454 -7.8649
115
logexpo
2.5968
0.3811
6.8139
Number of linear predictors: 2
Names of linear predictors: logit(P[Y>=2]), logit(P[Y>=3])
Dispersion Parameter for cumulative family: 1
Residual Deviance: 5.02683 on 13 degrees of freedom
Log-likelihood: -25.09026 on 13 degrees of freedom
Number of Iterations: 4
We have to be very careful in writing the estimated model. Look at the
output carefully.
In the model statement, we see ‘cbind(normal, mild, severe).’ These
responses are labeled in the output as 1, 2, and 3, respectively.
The output says it is providing estimates of the parameter in logit Pr(Y ≥ 2)
and logit Pr(Y ≥ 3). What does it mean? Understand what logit means.
Estimate of logit Pr(Y ≥ 2)
= ln [(Pr(Y = mild) + Pr(Y = severe))]/Pr(Y = normal)]
= - 9.6761 + 2.5968*logexpo
From this, we get
Pr(Y = mild) + Pr(Y = severe)
= exp(-9.6761 + 2.5968*logexpo)/[1 + exp(-9.6761 + 2.5968*logexpo)]
In addition, estimate of logit Pr(Y ≥ 3)
= ln [Pr(Y = severe)]/(Pr(Y = normal) + Pr(Y = mild))]
= - 10.5817 + 2.5968*logexpo
From this, we get
Pr(Y = severe)
= exp(-10.5817 + 2.5968*logexpo)/[1 + exp(-10.5817 + 2.5968*logexpo)]
116
These expressions did not come out as we planned originally. Never mind!
We can look at the predicted probabilities as per this model.
> fitted(fit.prop)
normal
mild
severe
1 0.9940077 0.003560995 0.002431267
2 0.9336285 0.038433830 0.027937700
3 0.8467004 0.085093969 0.068205607
4 0.7445575 0.133635126 0.121807334
5 0.6358250 0.176154091 0.188020946
6 0.5323177 0.205582603 0.262099709
7 0.4338530 0.220783923 0.345363119
8 0.3636788 0.222017003 0.414304232
We can look at the expected frequencies as per this model.
> total <- c(98, 54, 43, 48, 51, 38, 28, 11)
> fitted(fit.prop)*total
normal
mild
1 97.412758 0.3489776
2 50.415937 2.0754268
3 36.408118 3.6590407
4 35.738762 6.4144861
5 32.427073 8.9838586
6 20.228072 7.8121389
7 12.147883 6.1819498
8 4.000466 2.4421870
severe
0.2382642
1.5086358
2.9328411
5.8467520
9.5890683
9.9597889
9.6701673
4.5573466
See how close these expected frequencies are to the observed frequencies.
Common sense tells me that the fit must be excellent.
We can reverse the responses in the ‘cbind’ input. We will have the same
model in a different guise.
> fit.prop <- vglm(cbind(severe, mild, normal) ~ logexpo, propodds, data =
+ pneumo, trace = T)
> summary(fit.prop)
Call:
vglm(formula = cbind(severe, mild, normal) ~ logexpo, family = propodds,
117
data = pneumo, trace = T)
Pearson Residuals:
Min
1Q Median
3Q
Max
logit(P[Y>=2]) -1.0444 -0.184145 0.30926 0.33528 0.50476
logit(P[Y>=3]) -1.2479 -0.071638 0.14410 0.30860 0.77144
Coefficients:
Value Std. Error t value
(Intercept):1 10.5817
1.3454 7.8649
(Intercept):2 9.6761
1.3241 7.3078
logexpo
-2.5968
0.3811 -6.8139
Number of linear predictors: 2
Names of linear predictors: logit(P[Y>=2]), logit(P[Y>=3])
Dispersion Parameter for cumulative family: 1
Residual Deviance: 5.02683 on 13 degrees of freedom
Log-likelihood: -25.09026 on 13 degrees of freedom
Number of Iterations: 4
Interpretation of the output and estimated model:
In the model statement, we see ‘cbind(severe, mild, normal).’ These
responses are labeled in the output as 1, 2, and 3, respectively.
The output says it is providing estimates of the parameter in logit Pr(Y ≥ 2)
and logit Pr(Y ≥ 3). What does it mean?
Estimate of logit Pr(Y ≥ 2)
= ln [(Pr(Y = mild) + Pr(Y = normal))]/Pr(Y = severe)]
= 9.6761 - 2.5968*logexpo
From this, we get
Pr(Y = mild) + Pr(Y = normal)
= exp(9.6761 - 2.5968*logexpo)/[1 + exp(9.6761 - 2.5968*logexpo)]
118
In addition, estimate of logit Pr(Y ≥ 3)
= ln [Pr(Y = normal)]/(Pr(Y = mild) + Pr(Y = severe))]
= 10.5817 - 2.5968*logexpo
From this, we get
Pr(Y = normal)
= exp(10.5817 - 2.5968*logexpo)/[1 + exp(10.5817 - 2.5968*logexpo)]
The predicted probabilities and other entities are exactly the same.
Let us now check on goodness-of-fit. The null hypothesis is:
H0: The response probabilities follow the multinomial logistic regression
model pattern.
Let us calculate the p-value.
pchisq(5.02638, 13, lower.tail = F)
[1] 0.9746078
The model is a good fit. We do not reject the null hypothesis.
The fitted model using the second run of the command is:
P̂r(Severity  normal) 
exp(9.6761 - 2.5968  logexpo)
1  exp(9.6761 - 2.5968 * logexpo)
P̂r(Severity  normal)  P̂r(Severity  mild)
exp(10.581 7 - 2.5968  logexpo)
=
1  exp(10.581 7 - 2.5968 * logexpo)
A commentary on the models: We have two models which are good fits. We
prefer the tighter model, the one with fewer parameters. Our choice is the
Proportional Odds model. There is another reason for choosing the model.
Every parameter in the model is significant. Why? In the case of the
multinomial logistic regression model, some parameters are not significant.
119
Additional comments. Multinomial Logistic Model has 4 parameters, where
as the Proportional Odds Model (Cumulative Model with parallelism) has 3.
We now tabulate the observed and expected frequencies for each of the
fitted models.
MB <- data.frame(pnuemo, round(fitted.multi*total, 2),
round(fitted.prop*total, 2))
Age
Frequencies
Group Observed
Expected
Multinomial
norm. mild sev. norm. mild sev.
1
98
0
0 97.29 0.58 0.13
51
2
1 50.38 2.33 1.29
34
6
3 36.50 3.69 2.81
35
5
8 35.93 6.18 5.89
32
10
9 32.61 8.60 9.79
23
7
8 20.27 7.64 10.09
12
6
10 12.08 6.33 9.59
4
2
5 3.94
2.64 4.42
Expected
Prop.odds
norm. mild sev.
97.41 0.35 0.24
50.42 2.08 1.51
36.41 3.66 2.93
35.73 6.41 5.85
32.43 8.98 9.59
20.23 7.81 9.96
12.15 6.18 9.67
4.00
2.44 4.56
Final word. The Multinomial model is carrying the entire data on 4 four
numbers where as the Proportional Odds model on three numbers!
Final, final word! Does the covariate have a significant impact on the
response variable? Which model is in a good position to answer this
question? Let us mull in this.
120
Download