Logistic Regression Example with Grouped Data

advertisement
1
Logistic Regression
Testing Fit of a Model Using the Hosmer-Lemeshow Test
If our logistic regression model has a single, continuous, explanatory variable, then a test of fit
can be constructed by grouping values of the explanatory variable into several intervals and
using the Pearson chi-square statistic, which under the null hypothesis will have an approximate
chi-square distribution with degrees of freedom equal to the number of groups minus two (the
number of parameters in the model).
However, if we have several explanatory variables, some of which are continuous, a test of fit
becomes somewhat more difficult. If there are, for instance, two continuous explanatory
variables, and we break up the range of values of each of them into several groups, we may find
that there are sparse cells in the table. As the number of explanatory variables increases, the
problem of sparse cells becomes more apparent.
Hosmer and Lemeshow1 proposed a different approach to grouping, one that does not depend on
the number of explanatory variables. They proposed breaking up, not the ranges of values of
continuous explanatory variables, but the probabilities estimated from the original, ungrouped
data. The data set, of size n, is sorted according to the probabilities estimated from the final
logistic regression model. Then the data set is partitioned into several (Hosmer and Lemeshow
recommend 10) equal-sized groups. The first cell corresponds to the n/10 observations having
the highest estimated probabilities. The next cell corresponds to the n/10 observations having
the next highest estimated probabilities, etc. A Pearson-like statistic is constructed based on the
observed and expected cell frequencies.
Let Y denote the (binary) response variable. Assume that the final model used q explanatory
variables, and thus has q + 1 parameters. Let Yij be the observed value of the response variable
for the jth observation in the ith group of the partition, where i = 1, 2, …, g and j = 1, 2, …, ni.
ni
Then
Y
j 1
ij
is the observed frequency in the ith cell of the partition. Let
q


exp  ˆ0   ˆ k x kij 
k 1

 denote the estimated success probability for the jth observation in
ˆ ij 
q


1  exp  ˆ0   ˆ k x kij 
k 1


ni
the ith cell of the partition. Then
 ̂
j 1
ij
is the expected cell frequency for the ith cell of the
partition. The statistic proposed by Hosmer and Lemeshow is
2
ni
 ni

ˆ ij 
Y





ij
g
j 1
 j 1

HL  
.
ni
ni




i 1 
 ˆ ij  1    ˆ ij / ni 
 j 1    j 1

2
Hosmer and Lemeshow showed that, when the number of distinct patterns of covariate values
equals the sample size, the null distribution of HL is approximately chi-square with d.f. = g – 2.
The SAS program below (using the flu shots example) calculates the value of the HL statistic,
using g = 10. This program will work to test the fit of any final logistic regression model using
the HL statistic. All program output has been suppressed except for the output of the final PROC
MEANS, which gives the value of the HL statistic.
The test of fit proceeds as follows, for the final model estimated for the Flu Shot data:
Step 1: H0:  ij x   log it  1.4578  0.0779 x1  0.0955 x 2 
HA:  ij x   log it  1.4578  0.0779 x1  0.0955 x 2 
Step 2: We have n = 159, g = 10, and we choose  = 0.05.
Step 3: The test statistic is HL, as given above, and under the null hypothesis, this statistic has
an approximate chi-square distribution with d.f. = g – 2 = 8.
Step 4: We will reject the null hypothesis if HL >  82, 0.05  15.51 .
Step 5: Using SAS output, we find that HL = 9.6376.
Step 6: We fail to reject the null hypothesis at the 0.05 level of significance. We do not have
sufficient evidence to conclude that the data do not fit the hypothesized fitted logistic regression
model.
SAS Program to Perform Hosmer-Lemeshow Test of Fit for an Estimated Logistic
Regression Model, Using Flu Shots Data:
proc format;
value difmt 0 = "No "
1 = "Yes";
value sexfmt 0 = "Female"
1 = "Male ";
;
data one;
input y x1 x2 x3;
label y = "Flu Shot?"
x1 = "Age in Years"
x2 = "Health Awareness Index"
x3 = "Gender"
format y difmt. x3 sexfmt.;
obs = _n_;
cards;
The data set is listed in the appendix.
;
proc logistic data=one noprint;
model y (order=formatted event=LAST) = x1 x2;
output out=parest predicted=probs;
title "Final Multiple Logistic Regression Model";
;
proc sort data=parest;by probs;
;
data two;set parest;
dummy = 1;
;
proc means noprint;
3
var obs;
output out=sampsize max=size;
;
data temp;set sampsize;
dummy = 1;
;data three;merge two temp;by dummy;
drop dummy;
group = 10;
if _n_ < 0.9*size then group = 9;
if _n_ < 0.8*size then group = 8;
if _n_ < 0.7*size then group = 7;
if _n_ < 0.6*size then group = 6;
if _n_ < 0.5*size then group = 5;
if _n_ < 0.4*size then group = 4;
if _n_ < 0.3*size then group = 3;
if _n_ < 0.2*size then group = 2;
if _n_ < 0.1*size then group = 1;
;
data grp1;set three;if group = 1;
;
proc means noprint;
var y probs;
output out=grp1mns sum=celln cellprob n=cellsize;
title "Create data set with cell frequencies";
title2 "And expected frequencies, for group 1";
;
data grp2;set three;if group = 2;
;
proc means noprint;
var y probs;
output out=grp2mns sum=celln cellprob n=cellsize;
title "Create data set with cell frequencies";
title2 "And expected frequencies, for group 2";
;
data grp3;set three;if group = 3;
;
proc means noprint;
var y probs;
output out=grp3mns sum=celln cellprob n=cellsize;
title "Create data set with cell frequencies";
title2 "And expected frequencies, for group 3";
;
data grp4;set three;if group = 4;
;
proc means noprint;
var y probs;
output out=grp4mns sum=celln cellprob n=cellsize;
title "Create data set with cell frequencies";
title2 "And expected frequencies, for group 4";
;
data grp5;set three;if group = 5;
;
proc means noprint;
var y probs;
output out=grp5mns sum=celln cellprob n=cellsize;
title "Create data set with cell frequencies";
title2 "And expected frequencies, for group 5";
4
;
data grp6;set three;if group = 6;
;
proc means noprint;
var y probs;
output out=grp6mns sum=celln cellprob n=cellsize;
title "Create data set with cell frequencies";
title2 "And expected frequencies, for group 6";
;
data grp7;set three;if group = 7;
;
proc means noprint;
var y probs;
output out=grp7mns sum=celln cellprob n=cellsize;
title "Create data set with cell frequencies";
title2 "And expected frequencies, for group 7";
;
data grp8;set three;if group = 8;
;
proc means noprint;
var y probs;
output out=grp8mns sum=celln cellprob n=cellsize;
title "Create data set with cell frequencies";
title2 "And expected frequencies, for group 8";
;
data grp9;set three;if group = 9;
;
proc means noprint;
var y probs;
output out=grp9mns sum=celln cellprob n=cellsize;
title "Create data set with cell frequencies";
title2 "And expected frequencies, for group 9";
;
data grp10;set three;if group = 10;
;
proc means noprint;
var y probs;
output out=grp10mns sum=celln cellprob n=cellsize;
title "Create data set with cell frequencies";
title2 "And expected frequencies, for group 10";
;
data four;set grp1mns grp2mns grp3mns grp4mns grp5mns grp6mns grp7mns grp8mns
grp9mns grp10mns;
numer = (celln - cellprob)**2;
denom = cellprob*(1-(cellprob/cellsize));
ratio = numer/denom;
;
proc means sum;
var ratio;
title "Calculation of Goodness-of-Fit Test Statistic";
title2 "Hosmer and Lemeshow (1980)";
;
run;
5
Output of HL Program for Flu Shot Model:
Calculation of Goodness-of-Fit Test Statistic
Hosmer and Lemeshow (1980)
The MEANS Procedure
Analysis Variable : ratio
Sum
ƒƒƒƒƒƒƒƒƒƒƒƒ
9.6376501
ƒƒƒƒƒƒƒƒƒƒƒƒ
1
Hosmer, D. W. and Lemeshow, S. (1980). “A goodness-of-fit test for multiple logistic
regression model,” Communications in Statistics, Series A, 9, 1043-1069.
Appendix: Flu Shot Data
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
1
0
59
61
82
51
53
62
51
70
71
55
58
53
72
56
56
81
62
49
56
50
53
56
56
50
52
52
67
51
70
64
61
53
77
73
67
50
80
75
65
60
68
61
52
55
51
70
70
49
69
54
65
58
48
58
65
68
83
68
44
70
69
74
57
64
67
83
48
81
53
61
51
51
65
51
54
64
69
71
38
51
54
59
57
63
0
1
0
0
0
1
1
1
1
1
0
1
0
0
0
0
0
0
1
0
0
1
1
1
1
0
1
0
0
0
1
0
1
1
0
0
0
0
1
1
1
0
6
1
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
1
1
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
1
0
1
0
0
0
0
0
1
62
53
72
54
59
61
50
48
52
54
62
71
65
49
58
62
69
56
76
51
64
57
51
81
50
64
64
59
53
63
59
70
72
68
75
57
67
59
55
75
66
67
59
78
59
68
59
68
78
55
71
51
65
54
79
64
82
48
58
56
59
75
48
79
66
57
68
48
60
63
61
57
69
38
50
45
72
51
62
81
55
77
65
53
49
65
58
60
57
37
49
55
60
57
56
58
64
51
59
61
49
49
55
61
50
47
73
45
45
59
61
52
50
46
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
1
1
1
0
0
1
0
1
0
1
1
1
0
0
1
1
0
0
1
0
1
1
0
1
0
0
0
1
0
0
1
1
1
1
1
0
0
1
0
0
1
7
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
64
70
59
59
63
48
61
51
48
71
51
57
49
67
73
73
56
48
50
50
66
53
50
51
68
72
51
62
60
67
70
55
66
65
84
58
68
51
67
52
68
76
54
50
63
77
60
51
51
66
52
66
56
49
67
57
56
67
56
50
56
61
74
78
68
71
58
57
51
74
56
57
65
47
69
71
76
60
75
65
42
66
49
58
61
55
60
54
63
56
59
52
63
57
59
53
67
62
63
62
52
58
49
65
55
60
51
67
64
55
58
66
64
66
0
1
0
1
1
0
0
0
0
1
0
1
0
1
0
0
0
1
0
1
1
1
1
0
1
1
1
1
0
1
1
1
0
1
1
0
1
1
1
0
0
1
1
1
0
1
1
0
1
1
0
1
1
0
0
1
0
8
1
1
1
76
68
73
22
32
56
1
0
1
Download