Lecture 2

advertisement
Lecture Two: A Quick Review/Overview on Generalized
Linear Models (GLMs)
Outline
The topics to be covered include:
• Model Definition (Examples, Definition, and Related Topics)
• Parameter Estimation (MLE and WILS Algorithm)
• Model Deviances and Hypothesis Testing Problems (Deviance, Z-test/Wald
Chi-square Test and Deviance Test)
• Residuals and Model Diagnostics (Residuals, Residual Plots, and Measures
of Influence)
1
1
Some Special Examples of GLMs
Logistic Regression Model
• Consider a data set where the response variable takes only 0 or 1 values
(e.g., Yes/No type, we code Yes = 1 and No = 0) and the single covariate
variable is (continues) numerical type.
[Insert Figure 2.1 for an example data]
• If we apply a simple linear regression model
yi = β0 + β1 xi + i
(1)
to fit the data, there are at least three problems from modeling viewpoint:
◦ Problem 1: If model (1) is correct, we will have
Pr(Yi = 1) = E(Yi ) = β0 + β1 xi .
=⇒ This equation is a mismatch! –- because the left hand side can
only take values in interval (0,1) and the right hand side can be any
value between (−∞, ∞).
◦ Problem 2: If model (1) is correct, i = yi − (β0 + β1 xi ). With yi only
taking values 0 or 1, the random error
i can only take two values either (β0 + β1 xi ) or 1 − (β0 + β1 xi ).
=⇒ The assumption that i is normally distributed (or close to be normally distributed) with mean 0 is no longer valid.
◦ Problem 3: If model (1) is correct, the variance of the random error
var(i ) = var(Yi ) = πi (1 − π) = (−ηi )(1 − ηi ).
Here, πi = Pr(Yi = 1) and ηi = β0 + β1 xi .
=⇒ This implies that var(i ) is not a constant, which violates the constant error variance assumption in a simple linear regression model!
Conclusion: It is not appropriate to use the simple linear regression to
model regression data with binary type responses.
• One solution to the mismatch problem is to use the logistic function
g(t) =
2
et
1 + et
to link the linear predictor ηi = β0 + β1 xi to πi the expected value of the
response. Note that, g(t) is a monotonic mapping from (−∞, ∞) to (0, 1).
This is a much better way of specifying the mean model for the response
variable.
[Insert Figure 2.2 here]
• To complete the model specification, we need in addition to specify the
distribution of the random errors (or equivalently, the distribution of the
responses) – which should be different from the normal distributions!
• Indeed, the formal definition of the logistic (regression) model for binary
response data with p covariate is as follows:
A set of regression data {(yi , xi1 , . . . , xip ) : i = 1, 2, . . . , n} with binary type
response yi is said to follow a Logistic regression model, if
◦ The responses yi are independently observed for fixed values of covariates (xi1 , xi2 , . . . , xip )1 , and the covariate variables may only influence the
distribution of the response yi through a single linear function
ηi = β0 + β1 xi1 + . . . + βp xip .
(This linear function is called linear predictor)
◦ The mean of the response πi = E(yi ) is linked to the linear predictor ηi
by the following equation
πi
log
1 − πi
= ηi
or πi =
e ηi
.
1 + e ηi
s
(The function h(s) = log( 1−s
) is called a link function and its inverse
t
e
function g(t) = 1+e
t is called a inverse link function.)
◦ The response yi follows a Bernoulli distribution yi ∼ Bernoulli(πi ).
• Similar as in the regular linear regression, the β0 is known as the intercept
and βk ’s are the slope parameters. One can interpret βk as the expected
amount increase in h(E(y)) with a unit increase in the kth covariate xk ,
when the covariate variable x is a numerical type.
◦ Specifically, in the logistic model,
yi = 1) and h(E(yi )) = log
πi
1−πi
πi
1−πi
is the odds of “success” (i.e.,
is the log-odds. So, in the logistic
1
The covariate variables can be either numerical or categorical and they are considered
as fixed; See, Lecture 1.
3
model, eηi is the odds of the ith response being a success and eβk is the
odds ratio between with and without unit increase in the kth covariate.
Or, in other words, with a unit increase in the kth covariate, we expect
the odds of success to increase eβk times.
[Insert Figure 2.3 here]
• The above logistic model for binary response data can be slightly modified/extended to cover binomial response data: Suppose the ith trial has
ỹi “successes” (1’s) in ni tries. Consider the “success” rate yi = ỹi /ni . The
first two items in the model definition is exactly the same and the third
item is replaced by “The number of successes responses in the ith trial
ỹi = ni yi follow a Binomial distribution ỹi = ni yi ∼ Binomial(ni , πi )”.
[Bring back Figure 2.1 for another example data]
• Since we know the responses yi ’s are from Bernoulli or Binomial distributions, we can write out the likelihood function of observing yi ’s:
L(β|y) =
n
i=1
f (yi ) =
n i=1
ni
πiỹi (1−πi )ni −ỹi = ... = e
ỹi
n
i=1
·
ni {yi ηi −log(1+e´i )}+log
ni
ỹi
We can maximize this likelihood function to obtain the MLE estimates of
unknown regression parameters β = (β0 , β1 , . . . , βp )T , i.e.,
β̂M LE


1
n
n
x 
i1 
= argmaxβ [ni {yi ηi −log(1+eηi )}], or it solves
ni (yi − πi ) 

 = 0.
 ... 
i=1
i=1
xip
Further details of the estimation along with hypothesis testing and diagnostic will be discussed later.
Poisson Regression Model
• Sometimes one needs to perform regression analysis on Poisson type responses. The possible values of a Poisson type response are 0, 1, 2, 3, . . . .
◦ Some examples:
the number of insurance claims received in a given day by an office
the number of customers served in one day by a sales person.
the number of patients treated in an emergency room each Monday.
4
.
etc.
[Insert Figure 2.4 here]
◦ Similar to in the binary logistic regression case, the regular linear regression model is not appropriate for Poisson type responses (details omitted).
⇒ We need to use a Poisson regression model which relates the response
of Poisson counts to a certain set of covariate variables.
• We can specify the Poisson (regression) model in the same style as we used
to specify the logistic model:
◦ The responses yi are independently observed for fixed values of covariates (xi1 , xi2 , . . . , xip ), and the covariate variables may only influence the
distribution of the response yi through a single linear function
ηi = β0 + β1 xi1 + . . . + βp xip .
(This linear function is called linear predictor)
◦ The mean of the response µi = E(yi ) is linked to the linear predictor ηi
by the following equation
log(µi ) = ηi
or µi = eηi .
(The link function h(s) = log(s) and its inverse function is g(t) = et .)
◦ The response yi follows a Poisson distribution yi ∼ Poisson(µi ).
• We can interpret the slope parameter βk as the expected amount increase
(or change) in the expected value of a response at the (natural) log-scale,
with a unit increase (or change) in the kth covariate.
• We can write out the likelihood function of observing yi ’s:
L(β0 , β1 |y) =
n
i=1
f (yi ) =
n
µyi i e−µi
i=1
yi !
= ... = e
n
i=1
{yi ηi −e´i −log(yi !)}
.
We can maximize this likelihood function to obtain the MLE estimates of
unknown regression parameters β = (β0 , β1 , . . . , βp )T , i.e.,
β̂M LE


1
n
n
x 
i1 
= argmaxβ (yi ηi − eηi ), or it solves (yi − µi ) 

 = 0.
 ... 
i=1
i=1
xip
5
Further details of the estimation along with hypothesis testing and diagnostic will be discussed later.
Regular Linear Regression with Normal Errors
• Regular Gaussian linear regression model
yi = β0 + β1 xi1 + β2 xi2 + . . . + βp xip + i
where i ∼ N (0, σ 2 ).
This model assumption is the same as the following two requirements:
◦ E(Yi ) = ηi , where ηi = β0 + β1 xi1 + β2 xi2 + . . . + βp xip ;
◦ i ∼ N (0, σ 2 ) (or, equivalently, yi ∼ N (ηi , σ 2 ) (given covariates xi )).
⇒ We can specify the Gaussian linear regression model in the same style
as we used to specify the logistic and Poisson models:
◦ The responses yi are independently observed for fixed values of covariates (xi1 , xi2 , . . . , xip ), and the covariate variables may only influence the
distribution of the response yi through a single linear function
ηi = β0 + β1 xi1 + . . . + βp xip .
(This linear function is called linear predictor)
◦ The mean of the response µi = E(Yi ) is linked to the linear predictor ηi
by the following equation
µi = η i .
(Here, both the link and inverse link functions are identity functions
h(s) = g(t) = t.)
◦ The response yi follows a normal distribution yi ∼ N (µi , σ 2 ).
6
2
Formal Definition of GLMs
Formal Definition of GLMs
• Generalized linear models (GLMs) extent linear regression models to accommodate both non-normal response distributions and transformation of
linearity. A formal definition is provided as follows.
A regression data set containing responses yi and covariates xi is said to
follow a generalized linear model (GLM) is
◦ The responses yi are independently observed for fixed values of covariates (xi1 , xi2 , . . . , xip ), and the covariate variables may only influence the
distribution of the response yi through a single linear function
ηi = β0 + β1 xi1 + . . . + βp xip .
(This linear function is called linear predictor)
◦ The mean of the response µi = E(yi ) is linked to the linear predictor ηi
by a smooth invertible link function
h(µi ) = ηi .
(Here, h(s) is called a link function and its inverse function g(t) = h−1 (t)
is called the inverse link function.)
◦ The distribution of the response yi has density of form
Ai
φ
f (yi |β, φ) = exp
{yi θi − γ(θi )} + τ (yi , ) ,
φ
Ai
(2)
where φ is a scale parameter 2 that could be either known or unknown,
Ai is a known constant, and θi = θ(ηi ) is a function of linear predictor
ηi .
• Remark 1: We can prove that the function θ(·) = (γ )−1 (g(·)). So it is
completely determined by the function γ and the link function h.
• Remark 2: GLM is fully determined by the choices of the link function h
and the form of response distribution (i.e., the form of the function γ).
2
Some books (for example, McCullaugh and Nelder, 1989) also call φ a dispersion
parameter.
7
• Remark 3: It is easy to show that
E(yi ) = µi = γ (θi ) and var(yi ) =
φ γ (θi ).
Ai
• Remark 4: We can interpret the slope parameter βk as the expected amount
increase (or change) in h(E(y)) with a unit increase (or change) in the kth
covariate.
Verifying Some Special Examples:
• Gaussian/Normal For the normal distribution, we can write φ = σ 2 and
1
1
1
1
f (yi ) = exp {yi µi − µ2i − yi2 } − log(2πφ) .
φ
2
2
2
So, θi = µi = ηi and γ(θi ) = θi2 /2.
• Binomial (logistic) For a binomial distribution, we take φ = 1 and write
·
f (yi ) = exp ni {yi ηi − log(1 + eηi )} + log
ni
ni yi
So, Ai = ni , θi = ηi and the function γ(θi ) = log(1 + θi ).
• Poisson For a Poisson model, we take φ = 1 and write
f (yi ) = exp{(yi ηi − eηi ) − log(yi !)}
So, θi = ηi , and γ(θ) = eθi .
• Gamma model is used to model Gamma-distributed responses where the
response variance var(yi ) ∝ µ2i with µi = E(yi ).
◦ In a Gamma model, the mean µi is often link to the linear predictor ηi
by a inverse link function h(t) = 1/t:
µi = ηi−1 .
◦ Also, in the Gamma model, the response yi follows a Gamma distribution with density
1
f (yi ) =
Γ(α)
8
α
µi
α
yiα−1 e
−
®yi
¹i
,
where E(yi ) = µi and var(yi ) = µ2i /α. This density function can be
written as
f (yi ) = exp [α{yi ηi − log(ηi )} + (α − 1) log(yi ) − α log(α) − log{Γ(α)}] .
So, φ = 1/α, θi = ηi , and the function γ(θi ) = log(θi ).
Link Functions and Model Family
• Some commonly used link functions:
◦ Logit
t
h(t) = log
1−t
where Φ(t) =
◦ Probit
h(t) = Φ−1 (t),
t
−∞
1 2
1
√ e− 2 z dz
2π
◦ Complementary log-log
h(t) = log{− log(1 − t)}
◦ Identify
h(t) = t
◦ log-link
h(t) = log(t)
◦ Square root
h(t) =
√
t
◦ Inverse
h(t) = 1/t
• The combination of the response distribution and the link function is called
the family of GLM.
◦ See Figure 2.5 for the usual combination.
[Insert Figure 2.5]
9
• Canonical link function For each response distribution, there is a choice of
link function such that θi = ηi is always true in the form (2). Such a choice
of link function is called the canonical link function.
◦ The choice of the canonical link is mathematically convenient and the
minimal sufficient statistic for β can be easily obtained.
◦ The default choices of the link functions are usually the canonical links.
[Insert Figure 2.5 here]
10
3
Parameter Estimation in GLMs
MLE and Likelihood Estimating Equations
• In GLM, the unknown regression parameters β are estimated by the maximum likelihood estimators β̂ = β̂MLE . That is, β̂ maximizes the (log-)
likelihood function
n
!(β|y) =
i=1
Ai
φ
{yi θi − γ(θi )} + τ (yi , ) .
φ
Ai
• Taking derivative on !(β|y) with respective to β and set them equal to
0, the estimators β̂ are the solution to the following likelihood estimation
equations:


1
n

(yi − µi )g (ηi ) 
 xi1 

=0
(3)
 ... 
var(yi )
i=1
xip
◦ Recall 1: In the logistic regression, we solve equations


1
n
x 
i1 
ni (yi − πi ) 

 = 0.
 ... 
i=1
xip
(Note: var(yi ) = πi (1−πi )/ni , πi = E(yi ) = µi , and g (ηi ) = πi (1−πi ).)
Recall 2: In the Poisson regression model, we solve equations


1
n
x 
i1 
(yi − µi ) 

 = 0.
 ... 
i=1
xip
(Note: var(yi ) = µi and g (ηi ) = eηi = µi .)
WILS algorithm
We use a so called weighted iterative least square (WILS) algorithm to solve
the estimating equations (3).
11
• The working response for the ith observation in GLM setting is defined by
zi = η i +
(yi − µi )
dηi
=
η
+
(y
−
µ
)
.
i
i
i
g (ηi )
dµi
It behaves in many ways like the regular response yi in the regular linear
regression model!
◦ The weight for the ith observation is
wi =
[g (ηi )]2
var(yi )
• In terms of zi , the likelihood estimating equations (1) can be re-expressed
in a form that very much similar to the regular LS estimating equations:


1
n
x 
i1 

 = 0 or, in matrix form XT WXβ = XT Wz.
wi (zi − ηi ) 
 ... 
i=1
xip
Here, X is the regular design matrix (the same as in linear regression),
z = (z1 , z2 , . . . , zn )T and W is a diagonal matrix with diagonal entries
equal to w1 , w2 , . . . , wn , respectively.
• The above estimating equations lead us to the WILS algorithm:
◦ Step 0: Find a set of staring values, say β (0) .
◦ Step 1: For m = 0, 1, 2, ..., compute the current working responses
and the weights by replacing the unknown parameters with their current
estimates, i.e., z(m) = z| β=β (m) and W(m) = W| β=β (m) . Then, update the
unknown parameter estimates by the weighted least formula:
β (m+1) = {XT W(m) X}−1 XT W(m) z(m) .
◦ Step 2: Repeat Step 1 until the updated estimates β (m+1) and current
estimates β (m) are very close (convergence).
(Under very mild conditions, the WILS converges.)
Some Theoretical Results
12
• Under very mild conditions, when n is large, the distribution of the estimators β̂ is asymptotically a normal distribution N (β, In−1 ).
◦ Here, In =
matrix.
∂2
∂β∂β T
!(β|y) = ... = XT WX is the Fisher information
• This theorem implies:
◦ The β̂ are consistent estimators (when n large, they are close to the
corresponding true parameter values).
◦ The β̂ are asymptotically efficient estimators and a way to estimate the
variance of the estimators β̂ is by the Fisher information matrix.
13
4
Model Deviance and Hypothesis Testing Problems
Model Deviance
• (Overall statement) Model deviance in GLM setting is an analogous concept
to the residual sum of squares (SSE) in the regular linear regression model.
It is a key quantity that is involved in many problems of hypothesis testing
and model diagnostics.
• (Recall) the linear regression model with normal errors,
yi = µi + i ,
where µi = β0 + β1 xi1 + . . . + βp xip and i ∼ N (0, σ 2 ).
The log likelihood function in terms of responses yi ’s and their means µi ’s
is
n
1 n
l(µ|y) = − log(2πσ 2 ) − 2
(yi − µi )2 .
2
2σ i=1
• The saturated model in GLM setting refers to the perfectly fitting (i.e.,
over-fitting) model of using n free parameters to fit n observations. In a
saturated model, the log-likelihood function achieves its maximum achievable value.
◦ For example, in the normal linear regression case, we have µ̃i = ŷi = yi
and this leads to l(y|y) = − n2 log(2πσ 2 ) which is an upper-bound of the
likelihood functions.
• The residual sum of squares SSE = ni=1 (yi − ŷi )2 is a quantity used to
measure the discrepancy (variation) between the fitted model values and
the observed data, and its format comes from the squared error loss. We’d
like to point out that SSE can also be expressed in terms of twice the
discrepancy of the likelihood functions between the currently fitted model
and the saturated model:
SSE = 2σ 2 {l(y|y) − l(µ̂|y)} = . . . =
n
(yi − µ̂i )2 .
i=1
• Model deviance in GLM is defined as the twice discrepancy of the loglikelihood functions of the fitted model and the saturated model. In math14
ematical form, the model deviance is defined as
n
DM = 2φ{l(y|y) − l(µ̂|y)} = 2
Ai {yi θ̃i − γ(θ̃i )} − {yi θ̂i − γ(θ̂i )} ,
i=1
where θ̃i = (γ )−1 (µ̃i ) with µ̃i = yi and θ̂i = (γ )−1 (µ̂i ) with µ̂i = g(η̂i ) and
η̂i = β̂0 + β̂1 xi1 + . . . + β̂p xip .
• When the scale parameter φ in the GLM definition is not 1, we also define
the scaled model deviance, which is the scaled version of the model deviance:
∗
=
DM
n
DM
Ai {yi θ̃i − γ(θ̃i )} − {yi θ̂i − γ(θ̂i )} .
= 2{l(y|y) − l(µ|y)} = 2
φ
i=1 φ
• In the binomial logistic regression model, the log-likelihood function can
be expressed as
n
l(µ|y) =
i=1
µi
ni
ni {yi log
+ log(1 − µi )} + log
ỹi
1 − µi
and the scale parameter φ = 1. So, the model deviances are
n
∗
=2
DM = DM
{ni log(
i=1
where µ̂i = π̂i =
e´ˆi
1+e´ˆi
yi
1 − yi
) + ni (1 − yi ) log(
)},
µ̂i
1 − µ̂i
and η̂i = β̂0 + β̂1 xi1 + . . . + β̂p xip .
• In the Poisson model, the log-likelihood function can be expressed as
n
{yi log(µi ) − µi − log(yi !)}
l(µ|y) =
i=1
and the scale parameter φ = 1. So, the model deviances are
∗
DM = DM
=2
n
{yi log(yi /µ̂i ) − (yi − µ̂i )} .
i=1
Here µ̂i = eη̂i and η̂i = β̂0 + β̂1 xi1 + . . . + β̂p xip .
15
• Unfortunately, unlike SSE ∼ σ 2 χ2n−p−1 in the linear regression case, DM
in general does not follow a χ2 distribution, even asymptotically! Neither
does the generalized Pearson chi-square statistic, which is defined by
n
X2 =
(yi − µi )2
.
i=1 var(µ̂i )
Note, For the normal distribution, both X 2 and DM are the residual sum
of squares (SSE), while for the Poisson and Binomial models, X 2 is the
original Pearson Chi-square statistic.
• The good news is that the difference of the deviance functions between
two nested models does follow a χ2 distribution when n is large. (This is
analogous to SSE(R) − SSE(F ) ∼ σ 2 χ2p−q .) This result will be used to
test between two nested models below.
Hypothesis Testings
• Z-test for the jth regression parameter βj . In GLM, we also may need to
test the hypothesis whether the jth covariate has significant contribution
to response variable or not. In math term, the hypotheses are H0 : βj = 0
versus H1 : βj = 0. Recall that the maximum likelihood β̂ is asymptotically
normally distributed. To formally conduct the test, we use the z-statistic
Zj = se(β̂β̂j ) and compare it to the standard normal distribution.3 When the
j
absolute value |Zj | is large (corresponding to small p-value), we reject the
null hypothesis H0 : βj = 0.
◦ Some software reports Zj2 , which is compared to χ21 -distribution. It is
known as the Wald chi-square test and it is equivalent to the Z-test.
• Deviance chi-square test for nested models. To answer the question whether
a subset of the regression parameters are significant or not in a linear regression model, we use the deviance chi-square test for nested models. This test
is analogous to the F -test for nested models in the regular linear regressions.
Without loss of generality, assume that we want to test whether the last
p − q (q < p) parameters are zero or not, that is, H0 : βq+1 = . . . = βp = 0
versus H1 : One of these p − q parameters are not equal to 0. Under the
null hypothesis H0 , the regression model has q covariates and we call this
3
Note, when n → ∞, the tn−p -test in the normal linear regression model is the same
as z-test.
16
model a reduce model (R); the model under H1 has p covariates and we call
it a full model (F). To formally conduct the test, we use the test statistic
that is the difference of model deviances between these two nest model:
DM (R) − DM (F )
.
φ
When φ is known, this test statistic is compared to the χ2p−q -distribution
with degrees of freedom p − q. If this statistic is large (corresponding to
p-value small), we reject the null hypothesis H0 : βq+1 = . . . = βp = 0.
When φ is not known, we need to replace φ by its estimator φ̂ and still
compare it to the χ2p−q -distribution.4,5
• Similar to ANOVA tables in linear regression, the results in the aforementioned deviance chi-square test for nested models are summarized in a
table. Such a table is known as the analysis of deviance table.
• Goodness of ¯t tests. In general, goodness of fit test for GLM is not
available, unless in special cases.
• One commonly used goodness of fit test in logistic regression model is the
Hosmer-Lemeshow goodness of ¯t test.
◦ The test can be described as this: we place the n observations into approximate 10 groups (sorted/grouped according to their estimated probablities of successes) and calculate the corresponding generalized Pearson
chi-square statistic. To test the model, this Pearson chi-square statistic
(after grouping) is compared to a chi-square distribution.
◦ The basis for this Homser-Lemeshow test: asymptotic chi-square distribution holds for DM (also DM ) in a binomial model if, as n increases,
the number of covariate cells (groups) are fixed and the observations in
each cell increases.
• Similar grouping technique can be used to constructe a goodness of fit test
for Poisson models.
4
Depending on how the scale parameter φ is defined, sometimes we can use an F test
instead of the Chi-square test.
5
The scale parameter φ can be estimated either by its moment estimators X 2 /(n−p−1)
or DM /(n − p − 1), or by its mximum likelihood estimatior.
17
5
Model Diagnostics
Residuals
• The most intuitive way to define a residual is to define the response residual
= g(η̂ ).
where µ̂i = Ey
i
i
riR = yi − µ̂i ,
Although it is intuitive but loses a lot of nice features known associated
with residuals in the regular linear regression models. We usually do not
use it in model diagnostics.
• Deviance residuals. The nice feature of the deviance residuals is that they
are tied to the likelihood function and model deviance, thus the model
fitting. The deviance residual for the ith observation is defined as
riD = ai di ,
where di is the contribution of the ith observation to the model deviance
and ai = 1 if yi > µ̂i , ai = 0 if yi = µ̂i and ai = −1 if yi < µ̂i . It is easy to
see that
n
n
DM =
(riD )2 .
di =
i=1
i=1
This equation is comparable to the equation SSE =
linear regression model.
n
2
i=1 ei
in the regular
◦ In the binary logistic regression model, the deviance residual is
riD

 2{log(1 + eη̂i ) − η̂i },
= − 2 log(1 + eη̂i ),
if yi = 1;
if yi = 0;
where η̂i = β̂0 + β̂1 xi1 + . . . + β̂p xip .
◦ In Poisson models, the deviance residual formula is
riD

 2{yi log(yi /µ̂i ) − yi + µ̂i },
= − 2{y log(y /µ̂ ) − y + µ̂ },
i
i
i
i
i
if yi − µ̂i > 0;
if yi − µ̂ < 0;
where µ̂i = eη̂i and η̂i = β̂0 + β̂1 xi1 + . . . + β̂p xip .
18
• Working residuals. Recall that the working responses zi defined in WILS
algorithm. We mentioned that it works like the ith response in the regular
linear regression model. The working residual for the ith observation is
defined as:
riW = (zi − ηi )| β=β̂ = (yi − µ̂i )
∂ηi
| β=β̂ = (yi − µ̂i )/g (η̂i ).
∂µi
The working residual is easily available from WILS algorithm and they are
tied to the generalized Pearson chi-square statistic.
◦ In logistic regression model, the working residual is
(yi − µ̂i )
,
µ̂i (1 − µ̂i )
riW =
where µ̂i =
e´ˆi
1+e´ˆi
and η̂i = β̂0 + β̂1 xi1 + . . . + β̂p xip .
◦ In Poisson models, the working residual formula is
riW =
(yi − µ̂i )
,
µ̂i
where µ̂i = eη̂i and η̂i = β̂0 + β̂1 xi1 + . . . + β̂p xip .
• The Pearson residual is closely related to the Pearson statistic X 2 :
yi − µi
.
riP = var(µ̂i )
Clearly, X 2 = ni=1 (riP )2 . Pearson residuals are rescaled version of the
working residuals, when proper account is taken of the associated weights
√
riP = wi riW . Here, wi is defined in the WILS algorithm.
◦ In logistic regression model, the working residual is
(yi − µ̂i )
riP = ,
µ̂i (1 − µ̂i )
where µ̂i =
e´ˆi
1+e´ˆi
and η̂i = β̂0 + β̂1 xi1 + . . . + β̂p xip .
19
◦ In Poisson models, the working residual formula is
riW =
(yi − µ̂i )
√
,
µ̂i
where µ̂i = eη̂i and η̂i = β̂0 + β̂1 xi1 + . . . + β̂p xip .
• The most used residuals in model diagnostics are the deviance residuals.
Then, followed the working residuals and the Pearson residuals. Other
versions of residual definitions are also available, for example, the Anscombe
residual tries to make the residuals “as close to normal as possible”.
• A version of deletion residual is also available under GLM setting, which is
related to the Pearson residual. Their forms are very complicated and we
are thus omitting it here.
Graphical Methods
In the GLM setting, we can still plot the regular residual plots, where
deviance residuals (or working residuals or Pearson residuals) are plotted
against the jth covariate values xij , or the fitted values of the linear predictor η̂i ’s, or other things. But the interpretation is less clear and patterns
may vary quite differently for different models. In practice, we often turn to
other specialized residual plots and try to detect any systematic departure
from the fitted model. Here we discuss three commonly used residual plots
in GLMs.
• Index plot of deviance residuals plots the deviance residual riD against
the index of observation i, and then use a straight line to connect the
neighboring points. This plot helps to identify outlying residuals (but
they do not indicate whether the outlying residuals should be treated as
outliers).
[Insert Figure 2.6 here]
• Half normal plot in GLM setting is an extension of Atkinson’s (1985) half
normal plot in regular linear regression models. In the plot the absolute
values of the deviance residuals riD are ordered and the kth largest absolute
residual is plotted against z( k+n−1/8
). Here, z(α) is the α-percentile of the
2n+1/2
standard normal distribution. As in the linear regression model, the half
normal plot in GLM setting can also be used to detect influential points.
20
◦ To identify outlying residuals, we also can use the simulated envelope
as in the linear regression models. This envelope constitutes a band such
that the plotted residuals are likely to fall within the band if the fitted
model is correct. Some details of the simulation in the binary logistic
model are
Step 1. For each of the n cases, generate a new Bernoulli outcome yi = 0 or 1
with the success rate πi equal to the estimated probability of response yi = 1.
Fit a logistic regression to the new set of data of size n, with the covariates
keeping their original values. Order the absolute deletion residuals in ascending
order.
Step 2. Repeat step 1 eighteen times (total 19 times).
Step 3. For each k, k = 1, 2, . . . , n, assemble the kth smallest absolute residuals from the 19 groups and determine the minimum value, the mean, and the
maximum value of these 19 kth smallest absolute residuals.
Step 4. Plot these minimum, mean, and maximum values against z( k+n−1/8
2n+1/2 )
on the half-normal probability plot for the original data and connect the points
by straight lines.
[Insert Figure 2.7 here]
• As in the linear regression model, Partial residual plot is useful for identifying the nature of relationship for an independent variable under consideration, say the kth covariate x̃k = (x1k , x2k , . . . , xnk )T , for addition to
the regression model. The partial residuals for the kth covariate in a GLM
model is defined as:
[k]
ri = riW + β̂k xik
for i = 1, 2, . . . , n,
where riW is the working residual for the ith observation. The partial
[k]
residual plot for the kth covariate plots ri against xik . If the response
h(E(yi )) is linearly related to the kth covariate xik , the points should be
more or less around a straight line.
[Insert Figure 2.8 here]
Diagnostics for High Leverage Points
• We can still use the hat matrix to detect the high leverage point. But in
21
1
1
GLM models, the hat matrix is modified as H = W 2 X(XT WX)−1 XT W 2 ,
where W is as defined in the WILS algorithm.
• Let hii be the ith diagonal element of this hat matrix H. Again, we call
those observations with very large hii values high leverage points. We use
the same heuristic rule: any observation with its corresponding hii > 2(p+1)
n
is considered as a high leverage point.
• Note that for GLMs a point at the extreme of the x-range will not necessarily have high leverage if its weight is very small.
Measures of Influence
Many measures of influence described for linear regression models have analogies and are appropriate for the GLM models (especially in binary and
Poisson models). These include DFBETAS, DFFITS, and Cooks Distances,
among others.
• DF BET AS measures the effect of deleting a single observation (say, ith
observation) on the estimator of a particular regression parameter (say, kth
parameter βk ).
β̂k − β̂k(i)
(DF BET AS)k(i) =
,
√
s(i) ckk
where ckk is the kth diagonal entry of (XT WX)−1 .
◦ The Interpretation is that if this is large, then the ith observation has
an undue influence on the kth parameter estimate.
◦ Heuristic rule: to flag any
√ value of |DF BET AS| that is bigger than 1
for a small data set or 2/ n for a large data set.
• Cook's Distance is intended again in GLMs as an overall measure of the
influence of the ith observation on all parameter estimates:
Di =
(β̂ − β̂(i) )T XT WX(β̂ − β̂(i) )
(p + 1)φ̂
.
◦ We may still use the same heuristic rule, which suggests to identify the
ith observation as influential if Di is great that the 10% quantile of the
Fp+1,n−(p+1) distribution, and highly influential if it is greater that 50%
quantile of the Fp+1,n−(p+1) distribution.
22
Other Topics (not going to details)
• Transformations of variables on the covariates (e.g. Box-Tidwell transformation)
• Model selection criteria (e.g., Cp , AIC, BIC) and Stepwise Regressions
6
References
McCullagh, P. and Nelder, J.A. (1989). Generalized Linear Models. 2nd
edition. Chapman and Hall, New York.
23
Download