Generalized Linear Models

advertisement
Page 36
Generalized Linear Models. (for a much more detailed discussion refer to
Agresti’s text (Categorical Data analysis,Second Edition) Chapter 4, particularly
pages 115-118, 125-132.
Generalized linear models (GLMs) are “a broad class of models that include
ordinary regression and analysis of variance for continuous response variables, as
well as for categorical response variables”. There are three components that are
common to all GLMs: a
 Random component
 Systematic Component
 Link Function
Random Component:
The random component: refers to the probability distribution of the response Y.
We observe independent random variables Y1, Y2, . . ., YN. We now look at three
‘random components examples.
Example 1. (Y1, Y2, . . ., YN) might be normal. In this case, we would say the
random component is the normal distribution. This component leads to ordinary
regression and analysis of variance models.
Example 2. If the observations are Bernoulli random variables (which have values
0 or 1), then we would say the link function is the binomial distribution. . When
the random component is the binomial distribution, we are commonly concerned
with logistic regression models or probit models..
Example 3. Quite often the random variables Y1, Y2, . . ., YN have a Poisson
distribution. Then we will be involved with Poisson regression models or loglinear
models.
Systematic Component.
The random variables Yi, I = 1, 2, . . ., N, have expected values µi, I = 1, 2, . . ., N.
The systematic component involves the explanatory variables x1, x2 , · · · , xk.as
linear predictors:
0 + 1 x1 + 2 x2 + · · · + k xk.
Page 37
Link Function.
The third component of a GLM is the link between the random and systematic
components. It says how the mean µ = E(Y) relates to the explanatory variables in
the linear predictor through specifying a function g(µ):
g(µ) = 0 + 1 x1 + 2 x2 + · · · + k xk.
g(µ) is called the link function. Here are some examples:
Example 1. The logistic regression model says
ln [( x1, x2 , · · · , xk)/1-( x1, x2 , · · · , xk)] = 0 + 1 x1 + 2 x2 + · · · + k xk.
The observations Y1, Y2, . . ., YN have a binomial distribution (the random
component).
Thus, for logistic regression, the link function is ln[µ/(1-µ)] and is called the logit
link.
There are other link functions used when the random component is binomial. For
example, the normit/probit model has the binomial distribution as the random
component and link function
g(µ) = -1(µ), where (x) is the cumulative normal distribution.
There is also a ‘Gompit’/complimentary log-log link (available in Minitab with the
probit link also)
Example 2. For ordinary linear regression, we assume the observations have a
normal distribution (the random component) and the mean is
µ(0 + 1 x1 + 2 x2 + · · · + k xk) = 0 + 1 x1 + 2 x2 + · · · + k xk.
In this case the link function is the identity: g(µ) = µ.
Example 3. If we assume the observations Y1, Y2, . . ., YN have a Poisson
distribution (the random component) and the link function is g(µ) = ln µ, then we
have the Poisson regression model:
ln µ(0 + 1 x1 + 2 x2 + · · · + k xk) = 0 + 1 x1 + 2 x2 + · · · + k xk.
Page 38
Sometimes the identity link function is used in Poisson regression, so that
µ(0 + 1 x1 + 2 x2 + · · · + k xk) = 0 + 1 x1 + 2 x2 + · · · + k xk.
This model is the same as that used in ordinary regression except that the random
component is the Poisson distribution.
There are other random components and link functions used in generalized linear
models. The probit model has the binomial distribution as the random component
and link function
g(µ) = -1(µ), where (x) is the cumulative normal distribution.
In some disciplines, the negative binomial distribution has been the random
component.
Here is a comparison of the cumulative Logistic
and Normal distributions :
CumNormal
0.001350
0.001866
0.002555
0.003467
0.004661
0.006210
0.008198
0.010724
0.013903
0.017864
0.022750
0.028717
0.035930
0.044565
0.054799
0.066807
0.080757
0.096800
0.115070
0.135666
0.158655
0.184060
0.211855
0.241964
0.274253
0.308538
0.344578
0.382089
0.420740
0.460172
0.500000
Logistic
0.047426
0.052154
0.057324
0.062973
0.069138
0.075858
0.083173
0.091123
0.099750
0.109097
0.119203
0.130108
0.141851
0.154465
0.167982
0.182426
0.197816
0.214165
0.231475
0.249740
0.268941
0.289050
0.310026
0.331812
0.354344
0.377541
0.401312
0.425557
0.450166
0.475021
0.500000
Scatterplot of CumNormal vs x
1.0
0.8
CumNormal
x
-3.0
-2.9
-2.8
-2.7
-2.6
-2.5
-2.4
-2.3
-2.2
-2.1
-2.0
-1.9
-1.8
-1.7
-1.6
-1.5
-1.4
-1.3
-1.2
-1.1
-1.0
-0.9
-0.8
-0.7
-0.6
-0.5
-0.4
-0.3
-0.2
-0.1
0.0
0.6
0.4
0.2
0.0
-3
-2
-1
0
x
1
2
3
Scatterplot of Logistic vs x
1.0
0.8
Logistic
Row
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
0.6
0.4
0.2
0.0
-3
-2
-1
0
x
1
2
3
Page 39
Regression Models with Binary Response Variables: Logistic Regression
A common problem is that of estimating the probability of success using a
predictor variable x. Here is an example. Launch temperatures (in degrees
Fahrenheit) and an indicator of O-ring failure for 24 space shuttle launches prior to
the space shuttle Challenger disaster in 1986 are given below:
x (temperature)
53
56
57
63
66
67
67
67
68
69
70
70
Failure
yes
yes
yes
no
no
no
no
no
no
no
no
yes
x (temperature)
70
70
72
73
75
75
76
76
78
79
80
81
Failure
yes
yes
no
no
no
yes
no
no
no
no
no
no
Can we predict the probability of failure using temperature?
Let (x) = Prob(success|x) and 1- (x) = Prob(failure|x). We want a 'model' for
(x). We will set up a 'regression model' for (x). Why not a linear regression
model (x) = 0 + 1x1 ?
Answer:
a. For x large positively and x large negatively (x) = 0 + 1x1 will eventually be
negative and greater than 1, an undesirable feature of a model for probabilities.
b. We are working with Bernoulli trials. The variance of the outcome of a
Bernoulli trial is [(x)(1-(x)] = [0 + 1x1][1 - (0 + 1x1)]. The variance of
an observation depends on x, meaning the assumption of constant variance is
not satisfied.
c. The errors would be either 0-[0 + 1x1] = -0 - 1x1 or 1 - (0 + 1x1)--just two
possible values for a given x--violating assumption of normality.
What should a regression model look like?
Page 40
1. Since (x) is a probability, its values should be between 0 and 1.
2. For the O-ring problem, we would expect (x) to increase from values near 0 to
values near 1: as temperatures increase the chances of a failure should decrease
or the chances of a 'success' --no O-ring failure--should increase.
Here are some 'nice-looking' 'curves':
(x) = Normal Distribution Function
Logistic Distribution function
1.0
1.0
0.9
0.8
Logistic
Normal
0.7
0.5
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.0
-3
-2
-1
0
1
2
3
-3
-2
-1
x
0
1
2
3
x
What we are looking at on the left (above)is a normal curve for probabilities. On
the y-axis is (x) and on the x-axis is x: given a value of x, the probability of a
success is (x), where (x) is the normal curve. There are other 'curves' we could
use: The curve on the right looks a lot like the first one (normal), but it is actually
called the 'logistic' curve'. There are many other curves we could use, but these are
the two most commonly used ones (by a country mile!). The curves above are in
'standard units'. (x) denotes the cumulative normal curve. For a regression
model we use (x) = (0 + 1x1). The expression for the logistic curve is much
nicer: F(x) = ex / (1 + ex ). The corresponding regression model is
(x) = F(0 + 1x1) =exp(0 + 1x1)/ [1 + exp(0 + 1x1)].
If the 'slope' is negative, the curves would curve downward as x increases.
Which curve should be used? Or better yet: which curve(s) are used in practice?
If the normal distribution is used, the model is called the 'probit' (or ‘normit’model,
while if the logistic curve is used it is called the 'logistic regression model'.
Page 41
The logistic model says that
(x) = F(0 + 1x1) =exp(0 + 1x1)/ [1 + exp(0 + 1x1)]
A bit of algebra shows that this model is equivalent to
ln [(x) / [1 - (x)] = 0 + 1x1
A correspondingly simple model cannot be obtained for the probit model.
The quantity ln [(x) / [1 - (x)] is called the logit of (x) or the logit transform of
(x).
Logistic Regression Example. We illustrate logistic regression using the
Challenger Shuttle data on O-ring failures. We call success 'no O-ring failure'--it is
coded as '1' in the output.
Here is the Minitab output, using
Stat>Regression>Binary Logistic Regression.
Binary Logistic Regression
Link Function:
Logit
Response Information
Variable
failure
Value
yes
no
Total
Count
7
17
24
(Event)
Logistic Regression Table
Predictor
Constant
temp
Coef
10.875
-0.17132
StDev
5.703
0.08344
Z
P
1.91 0.057
-2.05 0.040
Odds
Ratio
0.84
95% CI
Lower
Upper
0.72
0.99
Log-Likelihood = -11.515
Test that slope is zero: G = 5.944, DF = 1, P-Value = 0.015
Fitted Model:
Prob(failure|temp) =
e10.8750.17132temp
.
1  e10.8750.17132
Fitted probabilities with y=1 denoting ‘failure’.are given on the next page.
Page 42
Row
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Temp
53
56
57
63
66
67
67
67
68
69
70
70
70
70
72
73
75
75
76
76
78
79
80
81
y
1
1
1
0
0
0
0
0
0
0
0
1
1
1
0
0
0
1
0
0
0
0
0
0
Prob
0.857583
0.782688
0.752144
0.520528
0.393696
0.353629
0.353629
0.353629
0.315518
0.279737
0.246552
0.246552
0.246552
0.246552
0.188509
0.163687
0.121993
0.121993
0.104799
0.104799
0.076729
0.065438
0.055709
0.047353
Poisson and Ordinary Regression of ‘Number of Arguments on Years Married’
Suppose we wanted to model the number Y of arguments married couples have as
a function of the number of years they have been married. 60 couples, with 3
married x years, x = 1, 2, …, 20, are randomly obtained and asked how many
arguments they had in the past year (they answer honestly). A summary by year is
given below with output on 1) a linear regression, 2) a quadratic regression, 3) a
quadratic regression using the square root of Y, and a Poisson regression.
Data Display
yr
1
2
3
4
5
6
7
8
9
ysum
7
12
14
27
31
38
54
59
61
aver
2.3333
4.0000
4.6667
9.0000
10.3333
12.6667
18.0000
19.6667
20.3333
x
1
1
1
2
2
2
3
3
3
x2
1
1
1
4
4
4
9
9
9
y
5
0
2
2
6
4
2
6
6
ysq
2.24
0.00
1.41
1.41
2.45
2.00
1.41
2.45
2.45
SRES1
1.79
0.53
1.03
-0.13
0.85
0.36
-1.13
-0.17
-0.17
FITS1
-2.12
-2.12
-2.12
2.52
2.52
2.52
6.68
6.68
6.68
SRES2
2.51
-2.08
0.82
-0.61
1.46
0.56
-1.86
0.19
0.19
FITS2
1.01
1.01
1.01
1.72
1.72
1.72
2.36
2.36
2.36
Page 43
10
11
12
13
14
15
16
17
18
19
20
73
69
81
69
81
57
47
38
31
26
14
24.3333
23.0000
27.0000
23.0000
27.0000
19.0000
15.6667
12.6667
10.3333
8.6667
4.6667
4
4
4
5
5
5
6
6
6
7
7
16
16
16
25
25
25
36
36
36
49
49
4
10
13
11
11
9
10
15
13
20
18
2.00
3.16
3.61
3.32
3.32
3.00
3.16
3.87
3.61
4.47
4.24
-1.53
-0.09
0.63
-0.62
-0.62
-1.09
-1.51
-0.32
-0.79
0.34
-0.14
10.37
10.37
10.37
13.59
13.59
13.59
16.32
16.32
16.32
18.58
18.58
-1.80
0.48
1.35
-0.18
-0.18
-0.80
-1.31
0.08
-0.44
0.57
0.12
2.92
2.92
2.92
3.41
3.41
3.41
3.83
3.83
3.83
4.18
4.18
Simple Linear Regression:
The regression equation is
Y = 11.1 + 0.354 x
Predictor
Constant
x
Coef
11.104
0.3536
S = 8.33097
SE Coef
2.234
0.1865
T
4.97
1.90
R-Sq = 5.8%
P
0.000
0.063
R-Sq(adj) = 4.2%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
58
59
SS
249.49
4025.49
4274.98
MS
249.49
69.41
F
3.59
P
0.063
Unusual Observations
Obs
35
38
58
x
12.0
13.0
20.0
Y
33.00
36.00
2.00
Fit
15.35
15.70
18.18
SE Fit
1.11
1.17
2.07
Residual
17.65
20.30
-16.18
St Resid
2.14R
2.46R
-2.00R
R denotes an observation with a large standardized residual.
A 4 in 1 graphical display is given below., It shows the following:
Residual Plots for Y
Normal Probability Plot of the Residuals
Residuals Versus the Fitted Values
99.9
20
90
Residual
Percent
99
50
10
10
0
-10
1
0.1
-20
-10
0
Residual
10
-20
20
Histogram of the Residuals
15.0
Fitted Value
16.5
18.0
20
12
Residual
Frequency
13.5
Residuals Versus the Order of the Data
16
8
4
0
12.0
10
0
-10
-15
-10
-5
0
5
Residual
10
15
20
-20
1 5
10 15 20 25 30 35 40 45 50 55 60
Observation Order
Top right graph shows curvature,
suggesting a squared term be
added to the model.
Page 44
Quadratic Regression Analysis: Y versus x, xsq
The regression equation is
Y = - 7.24 + 5.36 x - 0.238 xsq
Predictor
Constant
x
xsq
Coef
-7.243
5.3572
-0.23827
S = 4.26222
SE Coef
1.831
0.4015
0.01857
R-Sq = 75.8%
T
-3.96
13.34
-12.83
P
0.000
0.000
0.000
R-Sq(adj) = 74.9%
Analysis of Variance
Source
Regression
Residual Error
Total
Source
x
xsq
DF
1
1
DF
2
57
59
SS
3239.5
1035.5
4275.0
MS
1619.7
18.2
F
89.16
P
0.000
Seq SS
249.5
2990.0
Unusual Observations
Obs
x
Y
Fit
35 12.0 33.000 22.733
38 13.0 36.000 22.134
42 14.0 30.000 21.058
SE Fit
0.809
0.782
0.753
Residual
10.267
13.866
8.942
St Resid
2.45R
3.31R
2.13R
R denotes an observation with a large standardized residual.
Residual Plots for Y
Normal Probability Plot of the Residuals
Residuals Versus the Fitted Values
99.9
15
10
90
Residual
Percent
99
50
10
Make a square root
transformation
0
-5
1
0.1
5
-10
0
Residual
10
0
Histogram of the Residuals
5
10
15
Fitted Value
20
Residuals Versus the Order of the Data
15
10
9
Residual
Frequency
12
6
3
0
5
0
-5
-4
0
4
Residual
8
12
Graph top right
suggests non constant
variance.
1 5
10 15 20 25 30 35 40 45 50 55 60
Observation Order
Page 45
Regression Analysis: sqrtY versus x, xsq
The regression equation is
sqrtY = 0.236 + 0.814 x - 0.0358 xsq
Predictor
Constant
x
xsq
Coef
0.2358
0.81396
-0.035782
S = 0.520233
SE Coef
0.2235
0.04901
0.002267
R-Sq = 83.0%
T
1.06
16.61
-15.78
P
0.296
0.000
0.000
R-Sq(adj) = 82.4%
Analysis of Variance
Source
Regression
Residual Error
Total
Source
x
xsq
DF
1
1
DF
2
57
59
SS
75.236
15.427
90.663
MS
37.618
0.271
F
139.00
P
0.000
Seq SS
7.804
67.432
Unusual Observations
Obs
1
2
38
x
1.0
1.0
13.0
sqrtY
2.2361
0.0000
6.0000
Fit
1.0140
1.0140
4.7702
SE Fit
0.1829
0.1829
0.0954
Residual
1.2221
-1.0140
1.2298
St Resid
2.51R
-2.08R
2.40R
R denotes an observation with a large standardized residual.
Plot of residuals vs. fitted
values now looks more
random—suggests variances
are constant.
Residual Plots for sqrtY
Normal Probability Plot of the Residuals
Residuals Versus the Fitted Values
99.9
1.0
90
Residual
Percent
99
50
10
1
0.1
-1
0
Residual
1
2
1
Histogram of the Residuals
2
3
Fitted Value
4
5
Residuals Versus the Order of the Data
1.0
12
Residual
Frequency
0.0
-0.5
-1.0
-2
16
8
0.5
0.0
-0.5
4
0
0.5
-1.0
-1.0
-0.5
0.0
0.5
Residual
1.0
1 5
10 15 20 25 30 35 40 45 50 55 60
Observation Order
Page 46
Poisson Regression Output
Why use Poisson Regression? The probability distribution of Y should be
given by the Poisson, from the nature of the phenomena.
The SAS System
The GENMOD Procedure
Model Information
Data Set
Distribution
Link Function
Dependent Variable
Observations Used
WORK.ARGUMENTS
Poisson
Log
y
60
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Deviance
57
54.4054
Scaled Deviance
57
54.4054
Pearson Chi-Square
57
51.4574
Scaled Pearson X2
57
51.4574
Log Likelihood
1638.7309
Parameter
Intercept
x
xsq
Scale
DF
1
1
1
0
Analysis Of Parameter Estimates
Standard
Wald 95%
Estimate
Error
Confidence Limits
0.4193
0.1845
0.0576
0.7810
0.4904
0.0347
0.4223
0.5585
-0.0213
0.0015
-0.0243
-0.0183
1.0000
0.0000
1.0000
1.0000
Obs
1
4
7
10
13
16
19
22
25
28
31
34
37
40
43
46
51
52
55
58
x
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
xsq
1
4
9
16
25
36
49
64
81
100
121
144
169
196
225
256
289
324
361
400
y
5
2
2
4
11
10
20
22
22
27
21
28
17
26
20
11
11
12
8
2
Value/DF
0.9545
0.9545
0.9028
0.9028
ChiSquare
5.16
199.33
194.01
Pr > ChiSq
0.0231
<.0001
<.0001
pred
2.4311
3.7240
5.4662
7.6885
10.3628
13.3840
16.5644
19.6445
22.3247
24.3112
25.3691
25.3678
24.3073
22.3187
19.6372
16.5564
13.3762
10.3556
7.6824
5.4612
Page 47
Plot of fitted values vs. year
25
pred
20
15
10
5
0
0
5
10
x
15
20
Download