COUNT DATA ANALYSIS USING POISSON REGRESSION AND HANDLING OF OVERDISPERSION

advertisement
COUNT DATA ANALYSIS USING POISSON REGRESSION
AND HANDLING OF OVERDISPERSION
RAIHANA BINTI ZAINORDIN
UNIVERSITI TEKNOLOGI MALAYSIA
COUNT DATA ANALYSIS USING POISSON REGRESSION
AND HANDLING OF OVERDISPERSION
RAIHANA BINTI ZAINORDIN
A dissertation submitted in partial fulfillment of the
requirements for the award of the degree of
Master of Science (Mathematics)
Faculty of Science
Universiti Teknologi Malaysia
NOVEMBER 2009
iii
To Mak and Abah
because I could not have made it without you both.
And to sunflower
because life should be meaningful.
iv
ACKNOWLEDGEMENT
First and foremost, all praise be to Allah, the Almighty, the Benevolent for His
blessings and guidance for giving me the inspiration to embark on this project and
instilling in me the strength to see that this thesis becomes a reality.
I would like to say a big thank you to my amazing supervisor, P.M. Dr. Robiah
binti Adnan for believing in me to finish up this dissertation. Her advice, encouragement
and help mean a lot to me. So I can never thank her enough.
I would also like to thank my friend Ehsan who is never tired of helping and
teaching me things that I do not know.
As always, I am forever grateful to my parents for all their love and sacrifice. To
my sisters (Nordiana, Noraiman Amalina, Nurul Aina), thanks for giving me the reason
to laugh and cry.
A heartfelt thanks goes to all my friends who are always willing to help and
support me. And finally to my best friend, Ikhwan, thank you so much for being there.
v
ABSTRACT
Count data is very common in various fields such as in biomedical science,
public health and marketing. Poisson regression is widely used to analyze count data. It
is also appropriate for analyzing rate data. Poisson regression is a part of class of models
in generalized linear models (GLM). It uses natural log as the link function and models
the expected value of response variable. The natural log in the model ensures that the
predicted values of response variable will never be negative. The response variable in
Poisson regression is assumed to follow Poisson distribution. One requirement of the
Poisson distribution is that the mean equals the variance. In real-life application,
however, count data often exhibits overdispersion. Overdipersion occurs when the
variance is significantly larger than the mean. When this happens, the data is said to be
overdispersed. Overdispersion can cause underestimation of standard errors which
consequently leads to wrong inference. Besides that, test of significance result may also
be overstated. Overdispersion can be handled by using quasi-likelihood method as well
as negative binomial regression. The simulation study has been done to see the
performance of Poisson regression and negative binomial regression in analyzing data
that has no overdispersion as well as data that has overdispersion. The results show that
Poisson regression is most appropriate for data that has no overdispersion while negative
binomial regression is most appropriate for data that has overdispersion.
vi
ABSTRAK
Data bilangan adalah sangat lazim dalam pelbagai bidang, contohnya bidang
sains bioperubatan, kesihatan awam dan bidang pemasaran. Regresi Poisson digunakan
secara meluas untuk menganalisis data bilangan. Regresi Poisson juga sesuai untuk
menganalisis data kadaran. Regresi Poisson merupakan sebahagian daripada model kelas
model linear teritlak. Regresi ini menggunakan logaritma asli sebagai fungsi hubungan.
Regresi ini memodelkan nilai jangkaan bagi pembolehubah maklum balas. Logaritma
asli digunakan untuk memastikan supaya nilai ramalan bagi pembolehubah maklumbalas
tidak akan berbentuk negatif. Pembolehubah maklumbalas dalam regresi Poisson
dianggap mengikut taburan Poisson. Salah satu ciri taburan Poisson ialah nilai min
pembolehubah adalah sama dengan nilai varians. Walaubagaimanapun, dalam aplikasi
sebenar, data bilangan sering mempamerkan masalah lebih serakan. Masalah lebih
serakan terjadi apabila nilai varians melebihi nilai min. Apabila ini terjadi, sesebuah data
itu dikatakan terlebih serak. Masalah lebih serakan boleh menyebabkan kurang anggaran
terhadap sisihan piawai yang kemudiannya memberi inferens yang salah. Selain
daripada itu, keputusan ujian signifikan pula akan terlebih anggar. Masalah lebih serakan
boleh diatasi dengan menggunakan kaedah kebolehjadian quasi dan juga regresi
binomial negatif. Kajian simulasi telah dibuat untuk melihat keputusan regresi Poisson
dan regresi binomial negatif dalam menganalisis data yang tidak mempunyai masalah
lebih serakan dan juga data yang mempunyai masalah lebih serakan. Keputusan
menunjukkan bahawa regresi Poisson adalah paling sesuai untuk data yang tidak
mempunyai masalah lebih serakan manakala regresi binomial negatif adalah paling
sesuai untuk data yang mempunyai masalah lebih serakan.
vii
TABLE OF CONTENTS
CHAPTER
1
2
TITLE
PAGE
COVER
i
DECLARATION
ii
DEDICATION
iii
ACKNOWLEDGEMENTS
iv
ABSTRACT
v
ABSTRAK
vi
TABLE OF CONTENTS
vii
LIST OF TABLES
xi
LIST OF FIGURES
xii
LIST OF SYMBOLS
xiv
LIST OF APPENDICES
xv
INTRODUCTION
1
1.1
Count Data
1
1.2
Statement of the Problem
3
1.3
Objectives of the Study
4
1.4
Scope of the Study
4
1.5
Significance of the Study
4
1.6
Outline of the Study
5
1.7
Analysis Flow Chart
7
LITERATURE REVIEW
8
2.1
8
Generalized Linear Models
viii
2.1.1 Random component
8
2.1.2 Systematic component
9
2.1.3 Link
9
Principles of Statistical Modelling
10
2.2.1 Exploratory Data Analysis
10
2.2.2 Model Fitting
10
2.3
Poisson Distribution
12
2.4
Poisson Regression
14
2.5
Problems in Poisson Regression
16
2.5.1 Truncation and Censoring
16
2.5.2 Excess Zero
17
2.5.3 Overdispersion
19
2.6
Alternative Count Models
20
2.7
Negative Binomial Regression
21
2.2
3
POISSON REGRESSION ANALYSIS
23
3.1
The Model
23
3.2
Estimation of Parameters Using Maximum
24
Likelihood Estimation (MLE)
3.3
Standard Errors for Regression Coefficients
31
3.4
Interpretation of Coefficients
31
3.5
Elasticity
34
3.6
Model Checking Using Pearson Chi-Squares
35
and Deviance
3.6.1 Pearson Chi-Squares
36
3.6.2 Deviance
36
3.7
Model Residuals
37
3.8
Inference
38
3.8.1 Test of Significance
38
3.8.2 Confidence Intervals
39
ix
3.9
Handling Overdispersion
39
3.9.1 Quasi-Likelihood Method
39
3.9.1.1
Estimating the Overdispersion
43
Parameter
3.9.1.2
Testing for Overdispersion
3.9.2 Negative Binomial Regression Analysis
4
44
45
3.10 Example
46
ANALYSIS OF POISSON REGRESSION USING
58
SAS
5
4.1
Introduction
58
4.2
Nursing Home Data
58
4.3
Choosing the Right Model
65
4.4
Results and Discussion
67
4.5
Negative Binomial Regression
76
4.6
Results and Discussion
77
SIMULATION STUDY
85
5.1
Data Simulation
85
5.2
Analysis of Data with No Overdispersion
86
5.2.1 Results and Discussion
93
5.2.1.1
Goodness-of-fit
93
5.2.1.2
Significance, Confidence
94
Intervals, and Standard Errors
5.3
Analysis of Data with Overdispersion
94
5.3.1 Results and Discussion
101
5.3.1.1
Goodness-of-fit
101
5.3.1.2
Significance, Confidence
102
Intervals, and Standard Errors
5.4
Conclusions
103
x
5.5
The Simulation Codes
103
5.5.1 Codes for the simulation of non-
104
overdispersed data
5.5.2 Codes for the simulation of
105
overdispersed data
6
SUMMARY AND CONCLUSIONS
106
6.1
Summary
106
6.2
Conclusions
107
6.3
Recommendations
107
REFERENCES
108
APPENDICES
114
xi
LIST OF TABLES
TABLE NO.
TITLE
PAGE
3.1
Elephant’s mating success regarding age
47
3.2
Iterative reweighted least squares results
49
3.3
Residuals for elephant’s mating success data
53
3.4
Adjusted standard errors
57
4.1
Nursing home data
59
4.2
Log likelihood and deviance for model M1, M2, and M3
65
4.3
Elasticities of the explanatory variables in nursing home
69
data for model M3
4.4
Pearson residuals and adjusted residuals for nursing home
70
data
4.5
Comparison among standard errors for ordinary Poisson
75
regression and corrected Poisson regression
4.6
Elasticities of the explanatory variables in nursing home
80
data for negative binomial regression model
4.7
Residuals for negative binomial regression
81
5.1
Pearson chi-square and deviance for Poisson regression and
93
negative binomial regression obtained from data that has no
overdispersion
5.2
Pearson chi-square and deviance for Poisson regression and
negative binomial regression obtained from overdispersed
data
101
xii
LIST OF FIGURES
FIGURE NO.
TITLE
PAGE
2.1
Steps in model fitting
11
3.1
Steps in quasi-likelihood approach
40
3.2
SAS result for analysis of elephant’s mating success data
50
4.1
SAS output for model (M1)
62
4.2
SAS output for model (M2)
63
4.3
SAS output for model (M3)
64
4.4
SAS output for corrected Poisson regression
74
4.5
SAS output for negative binomial regression
79
5.1
Poisson regression SAS output of non-overdispersed data
for µ = 10
Poisson regression SAS output of non-overdispersed data
87
5.2
88
for µ = 20
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
Poisson regression SAS output of non-overdispersed data
for µ = 50
Negative binomial regression SAS output of nonoverdispersed data for µ = 10
Negative binomial regression SAS output of nonoverdispersed data for µ = 20
Negative binomial regression SAS output of nonoverdispersed data for µ = 50
Poisson regression SAS output of overdispersed data for
µ = 10
Poisson regression SAS output of overdispersed data for
µ = 20
Poisson regression SAS output of overdispersed data for
µ = 50
Negative binomial regression SAS output of overdispersed
data for µ = 10
89
90
91
92
95
96
97
98
xiii
5.11
5.12
Negative binomial regression SAS output of overdispersed
data for µ = 20
Negative binomial regression SAS output of overdispersed
data for µ = 50
99
100
xiv
LIST OF SYMBOLS
Y
Response variable
X
Predictor variable
β
Regression coefficient
η
Link Function
I
Information matrix
U
Score fuction
W
Weight matrix
E
Elasticity
X2
Pearson chi-squares statistic
D
Deviance statistic
R
Pearson residual
Z
Wald statistic
φ
Dispersion parameter
xv
LIST OF APPENDICES
APPENDIX
TITLE
PAGE
A
SAS Codes for Elephant’s Mating Success Data
114
B
The Values of µ̂ i for Elephant’s Mating Success Data
115
C
SAS Codes for Nursing Home Data
117
D
SAS Output of Residual Analysis for Poisson Regression
119
in Nursing Home Data
E
The Values of µ̂ i for Nursing Home Data
124
F
SAS Output of Residual Analysis for Negative Binomial
126
Regression in Nursing Home Data
G
Simulated Data
131
CHAPTER 1
INTRODUCTION
1.1
Count Data
An event count refers to the number of times an event occurs, for example the
number of individuals arriving at a serving station (e.g.: bank teller, gas station, cash
register, etc.) within a fixed interval, the number of failures of electronic components per
unit of time, the number of homicides per year, or the number of patents applied for and
received. In many fields such as in social, behavioral and biomedical sciences, as well as
in public health, marketing, education, biological and agricultural sciences and industrial
quality control, the response variable of interest is often measured as a nonnegative
integer or count.
Significant early developments in count models, however, took place in actuarial
science, biostatistics, and demography. In recent years these models have also been used
extensively in economics, political science, and sociology. The special features of data
in their respective fields of application have fueled developments that have enlarged the
scope of these models. An important milestone in the development of count data
regression model was the emergence of the generalized linear models, of which the
Poisson regression is a special case.
2
In another case, an event may be thought of as the realization of a point process
governed by some specified rate of occurrence of the event. The number of events may
be characterized as the total number of such realizations over some unit of time. The
dual of the event count is the inter-arrival time, defined as the length of the period
between events. Count data regression is useful in studying the occurrence rate per unit
of time.
The approach taken to the analysis of count data sometimes depends on how the
counts are assumed to arise. Count data can arise from two common ways:
i) Counts arise from a direct observation of a point process.
ii) Counts arise from discretization of continuous latent data.
In the first case, examples are the number of telephone calls arriving at central
telephone exchange, the number of monthly absences at workplace, the number of
airline accidents, the number of hospital admissions, and so forth. The data may also
consist of inter-arrival times for events.
In the second case, consider the following example. Credit rating of agencies
may be stated as AAA, AAB, AA, A, BBB, B, and so forth, where AAA indicates the
greatest credit. Suppose one codes these as y = 0,1,..., m . These are pseudocounts that
can be analyzed using a count regression. But one may also regard this as an ordinal
ranking that can be modeled using a suitable latent variable model such as ordered
probit.
Typically, the characteristic of count data is that the counts occur over some
fixed area or observation period and that the things that people count are often rare.
Count data, even though numeric, can create some problems if it is analyzed using the
regular linear regression because of the limited range of most of the values and because
only nonnegative integer values can occur. Thus, count data can potentially result in a
highly skewed distribution, one that cut off at zero. Therefore, it is often unreasonable to
3
assume that the response variable and the resulting errors have a normal distribution,
making linear regression a less appropriate option for analysis. A suitable way to deal
with count data is to use Poisson distribution and log link function in the analysis. The
regression model that uses these kinds of options is called Poisson regression or Poisson
log-linear regression model.
Basically, the most popular methods to model count data are Poisson and
negative binomial regression models. But Poisson regression is the more popular of the
two and is applied to various fields.
1.2
Statement of the Problem
Count data often have variance exceeding the mean. In other words, count data
usually shows greater variability in the response counts than one would expect if the
response distribution truly were Poisson. This violates the Poisson regression
assumption which strictly states that the mean is equal to the variance (equidispersion).
The phenomenon where the variance is greater than the mean is called overdispersion. A
statistical test of overdispersion is highly desirable after running a Poisson regression.
Ignoring overdispersion in the analysis would lead to underestimation of standard errors
and consequent of significance in hypothesis testing. The overdispersion must be
accounted for by the analysis methods appropriate to the data. Poisson regression is not
adequate for analyzing overdispersed data. Therefore, to overcome overdispersion,
quasi-likelihood method will be used as well as negative binomial regression. Negative
binomial regression is more adequate for overdispersed data. This is because negative
binomial regression allows for overdispersion since its variance is naturally greater than
its mean.
4
1.3
Objectives of the Study
The objectives of this study are:
i) To study the analysis of Poisson regression.
ii) To illustrate Poisson regression by analyzing count data manually and by using
SAS 9.1.
iii) To demonstrate how to handle overdispersion in Poisson regression using quasi
likelihood approach as well as negative binomial regression approach.
iv) To see the performance of Poisson regression and the performance of negative
binomial regression in analyzing data that has no overdispersion as well as data
that has overdispersion from simulation study.
1.4
Scope of the Study
This study will focus on the analysis of Poisson regression. This study will also
focus on the overdispersion problem that exists when dealing with real life count data.
Overdispersion happens when the variance is greater than the mean which violates the
equidispersion property in Poisson distribution and thus need to be taken care of. In
accordance to overdispersion problem, the performance of Poisson regression and
negative binomial regression in analyzing data that has no overdispersion as well as data
that has overdispersion will be examined from simulation study. The analyses in this
study include manual analysis and analysis by using statistical package. Statistical
package that is used in this study is SAS 9.1.
1.5
Significance of the Study
This study will help the scientists to realize the use of Poisson regression in
analyzing count data. Besides focusing on parameter estimation, this study will also help
5
to highlight about the interpretation of coefficients. This study will also help to
overcome overdispersion problem that occurs in Poisson regression which, if ignored,
may cause underestimation of standard errors and which consequently gives misleading
inference about the regression parameters. Clearly, this study is imperative and will give
much benefit.
1.6
Outline of the Study
This dissertation consists of 6 chapters.
Chapter 1 gives rough idea about the study. It begins with the explanation on
count data. This includes the characteristic of count data which is very important
throughout the study. Chapter 1 also explains how the idea about the study came about.
Furthermore, it also explains about the purpose of the study, the scope and the
importance of the study.
Chapter 2 discusses the basic idea that is important in Poisson regression
analysis. This chapter also discusses about common problems in Poisson regression as
well as negative binomial regression other than previous studies done by previous
researchers.
Poisson regression analysis can be found in Chapter 3. This chapter gives clear
descriptions on formulation of Poisson regression model, manual computation of
maximum likelihood estimates, and how to interpret coefficients in Poisson regression.
It also includes other important analyses such as goodness of fit test, residual analysis
and inference. Other than that, this chapter also discusses about the methods to handle
overdispersion. To illustrate Poisson regression, an example is presented here. The
analysis of this example is done manually.
6
Chapter 4 deals with the analysis of Poisson regression using SAS 9.1. A bigger
data is used and more factors are considered. The data is a count data in the form of rate
and it involves overdispersion. SAS codes are provided for convenience.
Chapter 5 presents the simulation study. Data is simulated using R 2.9.2 software
and is analyzed by using SAS 9.1. The performance of Poisson regression and the
performance of negative binomial regression in analyzing data that has no
overdispersion as well as data that has overdispersion are presented in this chapter.
Lastly, the conclusions of the study are discussed in Chapter 6. This chapter
summarizes the whole study. Some recommendations for further research are also made
here.
7
1.7
Analysis Flow Chart
POISSON
REGRESSION
Run the Analysis
1. Formulation of the model
2. Estimation of parameter using
MLE
3. Interpretation of coefficients
4. Elasticity
5. Model Checking
6. Residual Analysis
7. Inference
Estimate the overdispersion
parameter and test for
overdispersion
Does
overdispersion
exist?
No
Conclude that
Poisson regression is
adequate
Yes
Run quasi-likelihood
method
Run negative
binomial regression
1. Estimate the dispersion
parameter
Run the same analysis
as in Poisson
regression analysis
2. Adjust standard errors
CHAPTER 2
LITERATURE REVIEW
2.1
Generalized Linear Models
All generalized linear models (GLM) have three components. The random
component identifies the response variable Y and assumes its probability distribution.
The systematic component specifies the explanatory variables used as predictors in the
model. The link explains the functional relationship between the systematic component
and the expected value of the random component. The GLM relates a function of that
expected value to the explanatory variables through a prediction equation having linear
form.
2.1.1
Random Component
Let Y1 , Y2 ,..., YN be the observations for a sample of size N and suppose
Y1 , Y2 ,..., YN are independent. The random component of a GLM consists of identifying
the response variable Y and selecting a probability distribution for Y1 , Y2 ,..., YN .
If the observations are in binary form, a binomial distribution is assumed for the
random component. If the observations are nonnegative counts, a Poisson distribution is
assumed whereas if the observations are continuous, a normal distribution is assumed.
9
2.1.2
Systematic Component
The systematic component of a GLM specifies the explanatory variables which
enter linearly as predictors on the right hand side of the model equation. In other words,
the systematic component specifies the variables that play the roles of x j in the formula
β 0 + β1 x1 + ... + β p x p
This linear combination of the explanatory variables is called linear predictor or
linear component.
2.1.3
Link
The third component of a GLM is the link between the random component and
systematic component. It specifies how µ = E (Y ) relates to the explanatory variables in
the linear predictor. The mean, µ or a function of the mean, g ( µ ) can be modeled. The
model formula is stated as
g ( µ ) = β 0 + β1 x1 + ... + β p x p = η
g ( µ ) = η is called the link function.
The simplest possible link function is written as g ( µ ) = η = µ . This link
function models the mean directly and is called the identity link. It specifies a linear
model for the mean response, that is,
µ = β 0 + β1 x1 + ... + β p x p = η
This is the form of ordinary regression models for continuous responses.
10
For early note, the link function for Poisson regression model has the
form g ( µ ) = η = ln(µ ) . The natural log function applies to positive numbers. Thus, this
natural log link is appropriate when µ cannot be negative. A GLM that uses the natural
log link is called a loglinear model. Further information will be discussed later.
2.2
Principles of Statistical Modelling
2.2.1
Exploratory Data Analysis
Any data analysis should begin with a consideration of each variable separately.
This is important in checking the quality of the data and in formulating the model.
Several questions need to be answered when performing the analysis of data.
i) What is the scale of measurement? Is it discrete or continuous? Note that, for
Poisson regression analysis, the scale of measurement is discrete (since the
observations are in the form of counts).
ii) What is the shape of the distribution?
iii) How is it associated with other variables?
2.2.2
Model Fitting
The model fitting process begins with the formulation of the model. It is then
followed by estimation of parameters in the model, model checking, residual analysis
and inference. Figure 2.1 summarizes these steps.
11
Model Formulation
Estimation of the Parameters
Model Checking
Residual Analysis
Inference
Figure 2.1: Steps in model fitting
i) Model Formulation
In formulating a model, knowledge of the context in which the data were
obtained, including the substantive questions of interest, theoretical relationships
among the variables, the study design and results of the data analysis can all be used.
Basically, the model has two components:
-
Probability distribution of Y
-
Equation linking the expected value of Y with a linear combination of the
explanatory variables.
ii) Estimation of Parameters
Parameters used in the model must be estimated. The most commonly used
estimation methods are maximum likelihood and least squares.
12
iii) Model Checking
A GLM provides accurate description and inference for a data set only if it fits
that data set well. Summary of the goodness-of-fit statistics can help to investigate
the adequacy of a GLM fit. For Poisson regression, goodness of fit of the model is
tested using Pearson chi-square and the Deviance.
iv) Residual Analysis
Goodness-of-fit statistics only broadly summarize how well models fit data.
Further insight can be obtained by using residuals, that is, by comparing observed
and fitted counts individually.
v) Inference
Statistical inference involves calculating confidence interval and testing
hypotheses about the parameters in the model and interpreting the results.
2.3
Poisson Distribution
A random variable Y is said to have a Poisson distribution with parameter µ if it
takes integer values y = 0,1,2,... with probability
e −µ µ y
P(Y = y ) =
y!
for µ > 0
13
The Poisson distribution can be derived as a limiting form of the binomial
distribution if the distribution of the number of successes in a very large number of
Bernoulli trials with a small probability of success in each trial is considered.
Specifically, if Y ~ B (n, π ) then the distribution of Y as n → ∞ and π → 0 with
µ = nπ remaining fixed approaches a Poisson distribution with mean µ . Thus, the
Poisson distribution provides an approximation to the binomial for the analysis of rare
events, where π is small and n is large.
The Poisson distribution is often used to model the occurrence of rare events,
such as the number of traffic accidents in a month, the number of nesting attempts or
offspring in a breeding season, and the number of automobile accidents occurring at a
certain location per year.
The Poisson distribution has several characteristic features:
i) The variance is equal to the mean, E (Y ) = Var (Y ) = µ . This property is called
equidispersion.
ii) The distribution tends to be skewed to the right.
iii) Poisson distribution with a large mean is often well-approximated by a normal
distribution.
A useful property of the Poison distribution is that the sum of independent
Poisson random variables is also Poisson. Specifically, if Y1 and Y2 are independent
with Yi ~ P( µ i ) for i = 1,2 then,
Y1 + Y2 ~ P( µ1 + µ 2 )
This result is also applicable to the sum of more than two independent Poisson random
variables.
14
2.4
Poisson Regression
The Poisson regression is a member of a class of generalized linear models
(GLM), which is an extension of traditional linear models that allows the mean of a
population to depend on a linear predictor through a nonlinear link function and allows
the response probability distribution to be any member of exponential family
distributions (McCullagh and Nelder, 1989). The use of Poisson regression is vast and
the study of this type of regression is continuing. It has been an aid in many research
areas such as in economy, epidemiology, sociology and medicine.
Poisson regression is useful when the outcome is a count, with large-count
outcomes being rare events (Kutner, Nachtsheim, and Neter, 2004). It is the most widely
used regression model for multivariate count data. Counts are integer and can never be
negative. The distribution of counts is discrete, not continuous. Thus, they tend to be
positively skewed. Ordinary least squares regression uses the normal distribution as its
probability model. Hence, it is fundamentally not a good fit for discrete type data,
because the Normal distribution is symmetric and extends from negative to positive
infinity (Atkins and Gallop, 2007). The Poisson distribution is a much better fit for count
data. It characterizes the probability of observing any discrete number of events
(Osgood, 2000). When the mean count is low, the Poisson distribution is skewed. As the
mean count grows however, the Poisson distribution increasingly approximates the
normal.
Poisson regression uses the Poisson distribution as its probability model.
Therefore it is one of the alternatives that can be used for analyzing count data. Poisson
regression shares many similarities with ordinary least squares regression. Only, it
assumes the response variable to follow Poisson distribution instead of Normal
distribution and it models the natural log of the response variable as a linear function of
the coefficients.
15
One of the main assumptions of linear models is that the residual errors follow a
Normal distribution. To meet this assumption when a continuous response variable is
skewed, a transformation of the response variable can produce errors that are
approximately Normal. However, for discrete response variable, a simple transformation
will not produce normally distributed errors. This is why Poisson regression uses natural
log as its link function in the model rather than uses log-transformation for each of the
response. In addition, the natural log in Poisson regression model guarantees that the
predictions from the model will never be negative, which is appropriate for count data
(Atkins and Gallop, 2007).
Other general and popular application of Poisson regression involves modeling
rates for different subgroups of interest (Kleinbaum et al., 1998) Rate is thought of as
the number of counts observed within a specified time. Often, time is constant for all
observations. Rate is also defined as the number of events divided by the total personyears of experience.
Poisson regression model has been used as a tool for resolving common
problems in analyzing aggregate crime rates (Osgood, 2000). It has also been used to
analyze bird monitoring data (Strien et al., 2000). Other than that, Poisson regression
was used by family researchers to model marital commitment (Atkins and Gallop,
2007).
There are many journals and papers that present Poisson regression analysis and
its application. There are researchers who have done modification on Poisson regression
or introduce better models for analyzing count data for example, Tsou (2006)
demonstrated that the Poisson regression model could be adjusted to become
asymptotically valid for inference about regression parameters, even if the Poisson
assumption fails. In addition, Zou (2004) introduces Poisson regression with a robust
error variance to estimate relative risk. This study will only focus on the general Poisson
regression analysis.
16
2.5
Problems in Poisson Regression
Poisson regression is a powerful analysis tool but as with all statistical methods,
it can be used inappropriately if its limitations are not fully understood. There are three
problems that might exist in Poisson regression analysis.
i) The data might be truncated or censored.
ii) The data might contain excess zeroes.
iii) The mean and variance are not equal, as required by Poisson distribution. This
problem is called overdispersion.
2.5.1
Truncation and Censoring
Truncation of data can occur in the routine collection of data. In survey data, for
instance, such as surveys of users of recreational facilities, respondents who report zero
are sometimes discarded from the sample. This situation produced truncated data.
Consider other example. If the number of times per week an in-vehicle
navigation system is used on the morning commute to work during weekdays, the data
are right truncated at 5, which is the maximum number of uses in any given week.
Estimating a Poisson regression model without accounting for this truncation will result
in biased estimates of the parameter and erroneous inferences will be drawn.
Fortunately, Poisson model adapted easily to account for this problem. The righttruncated Poisson model is written as
P(Yi ) =
 µ iYi 

Yi !

 r  µ mi
 
 ∑ i
 m =0
mi ! 
 i

17
where P(Yi ) is the probability of commuter i using the system Yi times per week, µ i is
the Poisson parameter for commuter i , mi is the number of uses per week, and r is the
right truncation (in this example, 5 times per week).
In other settings, respondents are sometimes given a limit category (say, 40 and
more) for some large value. The data from this survey is said to be censored. Censoring,
like truncation, leads to inconsistent parameter estimates.
2.5.2
Excess Zero
There are certain phenomena where an observation of zero events during the
observation period can arise from two qualitatively different conditions, that is, the
condition may result from:
i) Failing to observe an event during the observation period or,
ii) An inability to ever experience an event. For example, consider the number of
crimes ever committed by each person in a community. In this case, most people
are hardly involved in a crime. Therefore, there will be too many zero counts in
the data.
Consider an example where a transportation survey asks how many times a
person has taken mass transit to work during the past week. An observed zero could
arise in two distinct ways. First, last week the person may have opted to take the vanpool
instead of mass transit. Alternatively, the person may never take transit as a result of
other commitment on the way to and from his workplace. Thus two states are present,
one being normal-count state and the other being a zero-count state. Normal-count state
happens when event occurrence is inevitable and follows some known count process
whereas zero-count state refer to situations where the likelihood of an event occurring is
extremely rare. Two aspects of this nonqualitative distinction of the zero state are
noteworthy. Data obtained from normal-count and zero-count states often suffer from
18
overdispersion if considered as part of single, normal-count state because the number of
zeroes is inflated by the zero-count state. It is common not to know if the observation is
in the zero state – so the statistical analysis process must uncover the separation of the
two states as part of the model estimation process. Models that account for this dualstate system are referred to as zero-inflated models. To address phenomena with zeroinflated counting processes, the zero-inflated Poisson (ZIP) and zero-inflated negative
binomial (ZINB) regression models have been developed.
The ZIP regression model assumes that the events Y1 , Y2 ,..., YN are independent.
The model is given by
Yi = 0
with
P(Yi ) = pi + (1 − pi )e − µi
Yi = Y
with
P(Yi ) =
(1 − pi )e − µi µ iY
Y!
where Y is the number of events per period. Maximum likelihood estimation (MLE) is
used to estimate the coefficients of a ZIP regression model.
The ZINB regression model follows a similar formulation with events
Y1 , Y2 ,..., YN being independent. The model is given by
Yi = 0
Yi = Y
1
α
with
 1

α 
P(Yi ) = pi + (1 − pi ) 
 1 + µi 
 α

with
 Γ 1 + Y u 1α (1 − u )Y
i
i
α
P(Yi ) = (1 − pi ) 

Γ 1 Y!
α

(( ) )
( )

,


Y = 1,2,3,...
where u i = (1 α ) [(1 α ) + µ i ] . MLE is also used to estimate the coefficients of ZINB
regression model.
Zero-inflated models imply that the underlying data-generating process has a
splitting regime that provides for two types of zero states. The splitting process can be
19
assumed to follow a logit (logistic) or probit (normal) probability process or other
probability processes. Note that, there must be underlying justification to believe the
splitting process exists (resulting in two distinct states) prior to fitting zero-inflated
model. There should be a basis for believing that part of the process is in zero-count
state.
2.5.3
Overdispersion
The basic Poisson regression model is appropriate only if the probability
distribution matches the data (Osgood, 2000). A major assumption underlying the use of
log-linear analysis for Poisson distributed data is that the variance of the error
distribution is completely determined by the mean. In practice, the assumption that the
variance is equal the mean in Poisson distribution is unlikely to be valid. This is because,
one important characteristic of counts is that the variance tends to increase with the
mean. The variance is regularly found to be greater than the mean in count data. This
problem is called overdispersion.
Overdispersion (also called extra-Poisson variation) occurs if Var (Y ) > µ . If
Var (Y ) < µ , the problem is called underdispersion. Underdispersion seldom occurs in
the analysis. This study will only focus on overdispersion problem.
Overdispersion may be due to,
i) Unobserved heterogeneity.
ii) The process generating the first event may differ from the ones determining
the later events.
iii) Failure of the assumption of independence of events which is implicit in the
Poisson process.
20
The first indication that something is wrong is that the deviance measure of
goodness-of-fit for full model exceeds its degree of freedom. This is an early detection
of overdispersion. The following four aspects need to be considered in overdispersion
problem:
i) Formal test for overdispersion.
ii) Standard errors for regression coefficients that account for overdispersion.
iii) Test statistic for added variables that account for overdispersion.
iv) More general models with parameters in the variance function.
If standard Poisson model is applied to overdispersed data, the efficiency of
parameter estimates remains reasonably high, yet their standard errors are
underestimated. Hence, coverage probabilities of confidence intervals and significance
levels of tests are no longer valid and can result in highly misleading outcome (Heinzl
and Mittlbock, 2003). In other words, overdispersion will cause underestimation of
standard errors which consequently will give wrong inference in the analysis (Ismail and
Jemain, 2005) Therefore, overdispersion must be handled.
The simplest way to allow for the possibility of overdispersion is the quasilikelihood approach which retains coefficient estimates from the basic Poisson model
but adjusts standard error and significance tests based on the amount of overdispersion
(Osgood, 2000). Another approach is to use negative binomial regression. Negative
binomial model is a count model which allows for overdispersion. Quasi-likelihood
method and negative binomial regression will be discussed further later.
2.6
Alternative Count Models
A common more general model for analyzing count data is the negative binomial
model. This model can be used if data are overdispersed. It is then more efficient than
Poisson, but in practice the efficiency benefits over Poisson are small. The negative
21
binomial should be used, however, if one wishes to predict probabilities and not just
model the mean. The negative binomial cannot be estimated if data are underdispersed.
Another more common general model is the hurdle model. This treats the process
for zeroes differently from that for the nonzero counts. In this case the mean of Yi is no
'
longer e ( X iβ ) , so the Poisson estimator is inconsistent and the hurdle model should be
used. This model can handle overdispersion and underdispersion.
2.7
Negative Binomial Regression
A well-known limitation of the Poisson distribution is that the variance and mean
must be approximately equal. However for count data, overdispersion always happens,
that is, the variance always exceeds the mean. To relax the overdispersion constraint
imposed by the Poison model, a negative binomial distribution is commonly used (Lee,
Nam, and Park, 2005).
The negative binomial distribution is one of the most widely used distributions
when modeling count data that exhibit variation that Poisson distribution cannot explain
(Dossou-Gbete, and Mizere, 2006). One important characteristic of the negative
binomial distributions is that it naturally accounts for overdispersion due to its variance
is always greater than its mean.
Negative binomial regression is typically used when there are signs of
overdispersion in Poisson regression. Negative binomial regression uses a different
probability model which allows for more variability in the data. Basically, it combines
the Poisson distribution of event counts with a gamma distribution.
22
Negative binomial model is simply a Poisson regression that estimates the
dispersion parameter, allowing for independent specification of the mean and variance.
Because the only difference between the Poisson and negative binomial regression lies
in their variances, regression coefficients tend to be similar across the two models, but
standard errors can be very different. When the outcome variable is overdispersed
relative to the Poisson distribution, standard errors from the negative binomial model
will be larger but more appropriate. Thus, p-values in Poisson regression are artificially
low, and confidence intervals are too narrow in the presence of overdispersion.
CHAPTER 3
POISSON REGRESSION ANALYSIS
3.1
The Model
The general method of fitting a Poisson regression model is still to use the
Poisson model formulation to derive a likelihood function that can be maximized so that
parameter estimates, estimated standard errors, maximized likelihood statistics, and
other information can be produced. Poisson regression analysis goal is to fit the data to a
regression equation that will accurately model E (Y ) (or µ ) as a function of a set of
explanatory variables X 1 , X 2 ,..., X p .
For a Poisson regression analysis, let Y be the response variable for count data
and let X be the explanatory variables. Y must follow Poisson distribution with
parameter µ . The Poisson regression model can be written as,
ln µ = β 0 + β1 X
where Y ~ P ( µ )
(3.1)
or equivalently,
µ = e (β
0 + β1 X
)
(3.2)
This is the model for analyzing normal count data.
Sometimes, the response may be in the form of events of certain type that occur
over time, space, or some other index of size. In this situation, it is often relevant to
model the data as the rate at which events occur. When a response count Y has index
24
(such as population size) equal to t , the sample rate of outcomes is Y / t . The expected
value for rate is µ / t . Thus, for analyzing rate data, the model can be written as,
µ
ln  = β 0 + β1 X
t 
(3.3)
This model has equivalent representation as,
ln µ − ln t = β 0 + β1 X
(3.4)
The adjustment term, − ln t , on the left-hand side of the equation is called an offset.
Note that, a simple linear model of the form µ = β 0 + β1 X cannot be used to
model count data because this model has the disadvantage that the linear predictor on the
right-hand side can assume any real value, whereas the Poisson mean on the left-hand
side, which represents an expected count, has to be nonnegative. The natural log of the
mean in Poisson regression model ensures that the predictions from the model will never
be negative.
One might ask if the analysis of count data can be done by transforming the
response straightforwardly. In some cases, a log-transformation of the response will help
to linearize the relationship and result in more normally distributed errors. However in
other cases, a simple log-transformation will not solve the problems and the natural log
link approach is needed. Furthermore, the presence of zeroes in the response variable
becomes problematic for log-transformation of the response. Poisson regression model
can definitely overcome this situation.
3.2
Estimation of Parameters Using Maximum Likelihood Estimation (MLE)
Estimation of parameters in Poisson regression relies on maximum likelihood
estimation (MLE) method. Maximum likelihood estimation seeks to answer question of
what values of the regression coefficients are most likely to have given rise to the data.
25
MLE focuses on a likelihood function that describes the probability of observing the
data as a function of a set of parameters. As stated previously, Poison regression uses
Poisson distribution as the probability model and the regression coefficients define the
parameters that specify the mean structure of the data. The goal in MLE is to find the
estimates of the regression coefficients that maximize the likelihood function. This can
be accomplished by setting the first derivative of the likelihood equation equal to zero
and solving for the regression coefficients.
In most practical cases, finding MLE requires iterative process, which adds an
extra layer of complexity to these models. In particular, complex model involving many
parameters and small sample sizes may prevent the process from converging.
Ultimately, the results of MLE yield asymptotic standard errors for the regression
coefficients.
To discuss the maximum likelihood estimation for Poisson regression, let µ i be
the mean for the ith response, for i = 1,2,..., n . Since the mean response is assumed to be
a function of a set of explanatory variables, X 1 , X 2 ,..., X k , the notation µ ( X i , β) is used
to denote the function that relates the mean response µ i to X i (the values of
explanatory variables for case i) and β (the values of regression coefficients). Now
consider the Poisson regression model in the following form:
µ i = µ ( X i , β) = e (X β )
'
i
(3.5)
Then, from Poisson distribution,
[
µ ( X i , β)]Y e − µ(X ,β )
P(Y ; β) =
i
Y!
(3.6)
26
The likelihood function is given as,
N
L(Y ; β) = ∏ P(Yi ; β)
i =1
[ µ ( X i , β)]Yi e[ − µ(Xi ,β )]
=∏
Yi !
i =1
N
{∏i =1[ µ ( X i , β)]Yi }e[ − ∑i =1 µ ( X i ,β )]
N
=
(3.7)
N
∏
n
Y!
i =1 i
The next thing to do is taking natural log of the above likelihood function. Then,
differentiate the equation with respect to β and equate the equation to zero. The log
likelihood function is given as,
N
ln L(Yi , β) = ∑ [Yi ln[µ ( X i , β)] − µ ( X i , β) − ln(Yi !)]
(3.8)
δ
[ln L(Y ; β)] = 0
δβ
(3.9)
i =1
The solution to the set of Maximum Likelihood given above must generally be
obtained by iteration procedure. One of the procedures is known as iteratively
reweighted least squares. This procedure will estimate the values of β . Maximum
likelihood estimation produces Poisson parameters that are consistent, asymptotically
normal and asymptotically efficient.
To demonstrate estimation of parameters using maximum likelihood estimation,
consider the method of scoring in generalized linear model. The method of scoring in
generalized linear model simplifies the estimating equation to
b ( m ) = b ( m −1) + [I ( m −1) ] −1 U ( m −1)
(3.10)
where b (m ) is the vector of estimates of the parameters β 0 , β1 ,..., β p at the mth iteration.
27
I is the information matrix with elements I jk given by
xij xik  ∂µ i

=∑
i =1 Var (Yi )  ∂η i
N
I jk



2
(3.11)
and U is the vector of elements given by
N 
(Y − µ i )  ∂µ i
U j = ∑ i
xij 
i =1  Var (Yi )
 ∂η i



(3.12)
U is called the score function.
If both sides of equation (3.10) are multiplied by I ( m −1) it will become
I ( m −1) b ( m ) = I ( m −1) b ( m −1) + U ( m −1)
(3.13)
From (3.11), I can be written as
I = X' WX
(3.14)
where W is the N × N diagonal matrix with elements
1  ∂µ i

wii =
Var (Yi )  ∂η i



2
(3.15)
The expression on the right-hand side of equation (3.13) is the vector with elements
xij xik  ∂µ i

∑∑
k = 0 i =1 Var (Y ) i  ∂η i
p
N
 ( m −1) N (Yi − µ i ) xij  ∂µ i
 bk

+∑
Var (Yi )  ∂η i
i =1

2



evaluated at b ( m −1) . Thus, the right-hand side of equation (3.13) can be written as
(3.16)
X' Wz
where z has elements
p
 ∂η
z i = ∑ xik bk( m −1) + (Yi − µ i ) i
k =0
 ∂µ i
Note that z is N × 1 matrix.



(3.17)
28
Hence, finally, the iterative equation for parameter estimation can be written as
( X' WX) ( m −1) b ( m ) = ( X' Wz ) ( m −1)
(3.18)
This equation has to be solved iteratively because, in general, z and W depend on b .
This iterative method is known as iteratively reweighted least squares method (IRWLS).
Now, consider a set of Poisson regression data, Y1 , Y2 ..., YN satisfying the
properties of generalized linear model. Parameters β 0 and β1 (let’s just consider these
two) are related to the Yi ’s through E (Yi ) = µ i and g ( µ i ) = η i = ln(µ i ) = β 0 + β1 xi .
From equation (3.15), the following equation is obtained:
wii =
1
µi
(µi ) 2 = µi
b 
Using the estimate b =  0  for β , equation (3.17) becomes
 b1 
z i = b0 + b1 xi +
(Yi − µ i )
µi
Essentially, to find the formula for estimating equation, the following matrices must be
obtained:
1 x1 
1 x 
2 

. . 
X=

. . 
. . 


1 x N 
b 
b =  0
 b1 
29
 µ1



W=




µ2
.
.
0


0





.

µ N 
Y1 − µ1

 b0 + b1 x1 + µ
1

Y
−
µ2
2
 b +b x +
0
1 2

µ2

.
z=

.


.

b0 + b1 x N + YN − µ N
µN













From the above matrices,
1
X' WX = 
 x1
1
X' Wz = 
 x1
1
x2
1
x2
 µ1


. . . 1 
×
. . . x N  



 µ1


. . . 1 
×
. . . x N  



µ2
.
.
0
µ2
.
.
0
 1 x1 
 1 x 
0
2 
 
 . . 
×

 . . 
 . . 
.

 
µ N  1 x N 
Y1 − µ1

 b0 + b1 x1 + µ
 
1

µ2
Y
−
2

0
  b0 + b1 x 2 +
µ2
 
.
×
 
.
 
.
.

µ N  
Y − µN
b0 + b1 x N + N
µN













30
which then give
 N
 ∑ µi
X' WX =  Ni =1
 µx
i i
∑
i =1


i =1

N
2
µ i xi
∑

i =1
N
∑µ x
i
i
(3.19)
and
 N 
Yi − µ i  
 
 ∑ µ i  b0 + b1 xi +
µ i  
i =1


X' Wz = N


Yi − µ i 

∑ µ i xi  b0 + b1 xi +
µ i 
 i =1

(3.20)
Since ln(µ i ) = b0 + b1 xi , thus, µ i = e ( b0 +b1xi ) .
Therefore, (3.19) and (3.20) become
 N (b0 +b1xi )
 ∑e
X' WX =  Ni =1
 x e ( b0 +b1xi )
i
∑
i =1


i =1

N
2 ( b0 + b1 xi ) 
x
e
∑
i

i =1
N
∑x e
( b0 + b1 xi )
i
 N ( b0 +b1xi ) 
Y
 
 b0 + b1 xi + ( b0 +ib1 xi ) − 1 
 ∑e
e

 
X' Wz =  Ni =1
Y


 x e ( b0 +b1xi ) b + b x +
i
− 1
 0 1 i
i
( b0 + b1 xi )
∑
e


 i =1
(3.21)
(3.22)
The maximum likelihood estimates are obtained iteratively using equation (3.18).
Initial values can be obtained by applying the link to the data, that is taking the natural
log of the response, and regressing it on the predictors (or explanatory variables) using
ordinary least square (OLS) method given by
βˆ = ( X' X) −1 X' Y
(3.23)
31
To avoid problems with counts of 0, one can add a small constant to all
responses. The procedure will converge in a few iterations.
3.3
Standard Errors for Regression Coefficients
Standard errors for β 0 and β1 are given by the inverse of Information matrix
obtained from the last iteration, that is,
a b 
I −1 = ( X' WX) −1 = 

c d 
where standard error for β 0 is
a while standard error for β1 is
d . These standard
errors are important in calculating the confidence interval.
3.4
Interpretation of Coefficients
The interpretation of coefficients in Poisson regression model is fairly
straightforward and intuitive. What needs to be accounted in interpreting the coefficients
is just the fact that a natural log link has been incorporated in the model unlike other
ordinary regression model. To interpret the results of the analysis, the estimates of
interest must be exponentiated (e β ) as well as the ends of the confidence interval in
order to get estimates on the original scale of the outcome. The response variable can
then be said to have multiplicative changes for each one-unit change in the predictor
variable.
To illustrate, consider Poisson regression model as follows,
ln(µ ) = α + β x
32
This model can be written as,
µ = e (α + β x ) = e α e β x
Next, consider two values of x ( x1 and x 2 ) such that the difference between them equals
1. For example, x1 = 10 and x 2 = 11 , that is,
x 2 − x1 = 1
When x = x1 = 10 ,
µ1 = e α e βx = eα e β (10)
(3.24)
1
When x = x 2 = 11
µ 2 = eα e βx = eα e β ( x +1) = eα e βx e β = e α e β (10) e β
2
1
(3.25)
1
A change in x has a multiplicative effect on the mean of Y . When one looks at a one
unit increase in the explanatory variable, (i.e., x 2 − x1 = 1 ),
µ1 = e α e β x
1
and
µ 2 = e α e βx e β
1
If β = 0 , then e 0 = 1 . Thus, µ1 = eα and µ 2 = e α . Therefore, it can be said that
µ = E (Y ) is not related to x .
If β > 0 , then e β > 1 . Thus, µ1 = eα e βx1 and µ 2 = eα e βx2 = eα e βx1 e β = µ1e β . Therefore
it can be said that µ 2 is e β larger than µ1 .
If β < 0 , then 0 ≤ e β < 1 . Thus, µ1 = eα e βx1 and µ 2 = eα e βx2 = eα e βx1 e β = µ1e β .
Therefore it can be said that µ 2 is e β smaller than µ1 .
33
In short, multiplicative effect means that increasing x by one unit, multiplies the
mean by a factor e β , that is, one-unit change in each explanatory variable, X , cause e β
change in response variable, Y .
Take note that, positive coefficients in estimation of parameters indicate the
increase in the prediction while negative coefficients indicate the decrease in the
prediction.
The explanatory variables in Poisson regression may be in coded or continuous
form. When the explanatory variables are in coded form, the interpretation of coefficient
can be done straightforwardly from the result of parameter estimates. However, when
the explanatory variables are in continuous form, the interpretation of coefficient
requires a bit more work. There are three strategies that can be used to interpret
continuous explanatory variables.
The first strategy is by using the regression equation. A regression equation is a
type of prediction equation. Thus, the regression equation is used to generate predictions
over specific ranges of the predictors (explanatory variables) in order to interpret
Poisson regression models.
For the following Poisson regression equation,
ln(µ i ) = β 0 + β1 X i
prediction can be generated by replacing explanatory variable with a numeric value. By
multiplying and then exponentiating each of the predictor value, X i , with the
corresponding regression coefficient, β1 , the predicted values for each predictor will be
generated.
The second strategy is to use the regression equation to provide predicted values
for discrete combinations of predictors. Instead of generating continuous predictions
34
along the range of predictors, specific values for individual predictors are specified. This
strategy is virtually identical to the first strategy, except that discrete as opposed to
continuous predictions are generated. For continuous explanatory variables, this strategy
is most appropriate when there are clear cutoffs or benchmarks.
The first and second strategy could be used with any regression model.
Nonetheless, the third strategy is only restricted to Poisson regression. The third strategy
allows the interpretation of regression coefficients in the Poisson model to be in the form
of percentage change in the expected counts given as,
100(e β ×δ − 1)
(3.26)
where β is the regression coefficient from the Poisson regression and δ is the units of
change in the explanatory variable (e.g.: for one unit change in the explanatory variable,
δ = 1 ). This strategy results from the fact that Poisson model is a multiplicative model,
where the predictors in the model are exponentiated (Atkins and Gallop, 2007).
3.5
Elasticity
To provide some insight into the implications of parameter estimation results,
elasticities are computed to determine the marginal effects of the independent variables.
Elasticities provide an estimate of the impact of an explanatory variable on the mean and
are interpreted as the effect of a one percent change in the explanatory variable on the
mean. For example, an elasticity of -1.32 is interpreted as a one percent increase in the
explanatory variable reduces the mean by 1.32 percent. Elasticities are the correct way
of evaluating the relative impact of each explanatory variable in the model. Elasticity of
the mean is defined as,
E Xµiik = β k X ik
(3.27)
35
where E represents the elasticity, X ik represents the value of the k th explanatory
variable for observation i , β k represents the estimated parameter for the k th
explanatory variable and µ i is the mean for observation i . Note that elasticities are
computed for each observation i . It is common to report a single elasticity as the
average elasticity over all i .
The elasticity in equation (3.27) is only appropriate for continuous explanatory
variables such as highway lane width, distance from outside shoulder edge to roadside
features and vertical curve length. It is not valid for noncontinuous variables such as
indicator variables that take on values of zero or one. For indicator variables, pseudoelasticity is computed to estimate an approximate elasticity of the variables. The pseudoelasticity gives the incremental change in mean caused by changes in the indicator
variables. The pseudo-elasticity for indicator variables is computed as,
E Xµiik =
3.6
e βk − 1
e βk
(3.28)
Model Checking Using Pearson Chi-Squares and Deviance
The popular measures of the adequacy of the model fit are Pearson chi-squares
and deviance. If the values for both Pearson chi-squares and deviance are close to the
degrees of freedom, N − p , the model may be considered as adequate.
To check the goodness-of-fit of the model, the following hypotheses are required:
H 0 : the model has a good fit
versus
H 1 : the model has lack of fit
36
3.6.1
Pearson Chi-Squares
Let Yi be the observed count and µ̂ i be the fitted mean value. Then, for Poisson
regression analysis, the Pearson chi-squares statistic is given by
X2 =∑
(Yi − µˆ i ) 2
µˆ i
(3.29)
When the fitted mean values, µ̂ i , are relatively large (greater than 5) this test
statistic has approximate chi-squared distribution. Its degree of freedom is equal to the
number of response counts, N minus the number of parameters in the model, p . H 0
will be rejected if X 2 > χ 02.05 ( N − p) indicating lack of fit of the model.
3.6.2
Deviance
The deviance is given by
Y
D = 2∑ Yi ln i
 µ̂ i



(3.30)
For large samples, the deviance also has approximate chi-squared distribution
with ( N − p) degrees of freedom. Similar to Pearson chi-squares, H 0 will be rejected if
D > χ 02.05 ( N − p ) indicating lack of fit of the model.
37
3.7
Model Residuals
For observation i , the residual difference Yi − µ̂ i between an observed and fitted
count has limited usefulness. For Poisson sampling, for instance, the standard deviation
of a count is
µ̂ i , so larger differences tend to occur when µ̂ i is larger. The Pearson
residual, R , is a standardization of this difference, defined by
R=
(observed − fitted )
Vˆar (observed )
(3.31)
For Poisson GLM, this simplifies for count i to
ei =
Yi − µˆ i
µˆ i
(3.32)
which standardizes by dividing the difference by estimated Poisson standard deviation.
These residuals relate to Pearson chi-squares statistic by
∑e
2
i
= X 2 . Counts having
larger residuals made greater contributions to the overall X 2 value for testing goodnessof-fit of the model.
Pearson residual values fluctuate around zero, following approximately a normal
distribution when µ i is large. When the model holds, these residuals are less variable
than standard normal, however, because the numerator must use the fitted value µ̂ i
rather than the true mean µ i . Since the sample data determine the fitted value, Yi − µ̂ i
tends to be smaller than Yi − µ i .
The Pearson residual divided by its estimated standard error is called an adjusted
residual. It does have an approximate standard normal distribution when µ i is large
(greater than 5). Thus, with adjusted residuals, it is easier to tell when a deviation
Yi − µ̂ i is large.
38
Residuals larger than about 2 in absolute value are worthy of attention. Adjusted
residuals are preferable to Pearson residuals. PROC GENMOD in SAS 9.1 can provide
Pearson residual as well as adjusted residual.
3.8
Inference
Inference consists of hypothesis testing and calculating the confidence intervals.
Hypothesis test may answer question whether or not the parameter in the model equal to
certain value. Hypothesis test is also applied in comparing how well two (or more)
related models fit the data. Confidence intervals on the other hand are increasingly
regarded as more useful than hypothesis tests because the width of a confidence interval
provides a measure of the precision with which inferences can be made. It does so in a
way which is conceptually simpler than the power of a statistical test.
3.8.1
Test of Significance
In GLM, test of significance must be performed in order to see the importance of
predictor variables in the model. For this purpose, the following hypotheses are required:
H 0 : β1 = 0
versus
H 1 : β1 ≠ 0
These hypotheses can be tested using a procedure called Wald statistic
procedure. The test statistic for this procedure is given by
Z=
βˆ1 − β1
SE ( βˆ1 )
(3.33)
39
which has an approximate standard normal distribution under H 0 . Z 2 has a chi-squared
distribution with one degree of freedom (df = 1) . Therefore, H 0 is rejected if
Z 2 > χ 02.05 (1) .
3.8.2
Confidence Intervals
From section 3.3, 95% confidence interval for β1 is given by
βˆ1 ± 1.96 × SE ( βˆ1 )
3.9
Handling Overdispersion
3.9.1
Quasi-Likelihood Method
(3.34)
Basically, there are three steps in quasi-likelihood approach. Firstly, regression
coefficients and their standard errors are estimated by using MLE. Secondly, the
dispersion parameter, φ , is estimated separately. And thirdly, the standard errors are
adjusted for the estimated dispersion parameter so that proper confidence intervals and
test statistic can be obtained. These steps are simplified in Figure 3.1.
40
Run usual Poisson regression and
obtain values for coefficients and
their standard errors
Estimate the dispersion parameter
Adjust the standard errors
Figure 3.1: Steps in quasi-likelihood approach
If the precise mechanism that produces the overdispersion is known, specific
method may be applied to model the data. In the absence of such knowledge, it is
convenient to introduce a dispersion parameter, φ , in the variance formula, that is,
Var (Y ) = φµ . If φ > 1 , then overdispersion exists. This is a rather robust approach to
tackle the problem, since even quite substantial deviations in the assumed simple linear
functional form, Var (Y ) = φµ , generally have a merely minor effect on the conclusions
related to standard errors, confidence intervals and p-values.
The introduction of the dispersion parameter does not introduce a new
probability distribution but just gives a correction term for testing the parameter
estimates under the Poisson model. The models are fit in the usual way, and the
parameter estimates are not affected by the value of φ . Only standard errors are inflated
by this factor. This method produces an appropriate inference if overdispersion is
modest and it has become the conventional approach in Poisson regression analysis.
It is worth pointing out that the quasi-likelihood arguments starts from the quasiscore function. To describe quasi-likelihood method, first consider the score function
given in (3.12).
41
N 
(Y − µ i )  ∂µ i
U = ∑ i
xi 
i =1  Var (Yi )
 ∂η i



For overdispersed Y , Var (Yi ) = φi µ i as introduced previously and since
g ( µ i ) = η i = ln(µ i ) , the derivative is given by (∂µ i ∂η i ) = µ i . Substituting these two
equations in (3.12) results in a function given as follows:
N
1

u = ∑  xi (Yi − µ i )
i =1  φ i

(3.35)
Note that u is not the score function, because Yi ’s are no longer Poisson. u is
called a quasi-score function or an estimating function. The solution of u = 0 , the
~
maximum quasi-likelihood estimate (MQLE) of β , is denoted by β .
~
β − β can then be expressed approximately as a matrix multiple of the multivariate
random vector u .
~
β − β ≈ [Γ( β )]−1 u
(3.36)
 ∂u 
where Γ( β ) = E −
.
 ∂β ' 
~
Therefore β − β is asymptotically multivariate normally distributed. As in the
case of the score function and its negative expected derivative I (information matrix),
the negative expected value of the derivative of the estimating function, Γ( β ) , turns out
to be the variance of u . For the natural log link function, the negative derivative does
not involve the random variables Yi , so it equals to its expectation as written below:
 ∂u  N 1
'
Γ( β ) = E −
 = ∑ φ µ i xi xi
∂
'
β

 i =1 i
(3.37)
42
while
N
Var (u ) = ∑ xi
i =1
1
φi
Var (Yi − µ i )
1
φi
N
xi' = ∑
i =1
1
φi
µ i xi xi'
(3.38)
~
−1
Consequently, the asymptotic variance of β is [Γ( β )] , that is,
~
−1
−1
−1
Var ( β ) = [Γ( β )] Var (u )[Γ( β )] = [Γ( β )]
(3.39)
If the dispersion parameter φi is constant over i , that is, φi = φ for all i , then φ is
dropped out from u = 0 , and the estimating equation becomes
N
u 0 = ∑ [xi (Yi − µ i )]
(3.40)
i =1
with expected derivative
N
 ∂u 
E − 0  = Γ0 ( β ) = ∑ µ i xi xi'
i =1
 ∂β ' 
(3.41)
where the subscript 0 indicates that the estimating function and the expectation of its
derivative are free of the overdispersion parameter. Although u 0 is identical to the score
function for pure Poisson outcomes,
u 0 is not the score function for overdispersed
counts. In this constant-overdispersion case,
N
N
i =1
i =1
Var (u 0 ) = ∑ xiVar (Yi − µ i )xi' = φ ∑ µ i xi xi' = φ [Γ0 ( β )]
(3.42)
β − β ≈ [Γ0 ( β )]−1 u 0
(3.43)
~
−1
−1
−1
Var ( β ) = [Γ0 ( β )] Var (u 0 )[Γ0 ( β )] = φ [Γ0 ( β )]
(3.44)
and
~
Thus,
43
This expression implies that one can fit an ordinary Poisson regression model to
overdispersed data as if there were no overdispersion in order to obtain the maximum
~
quasi-likelihood estimate β . The maximum quasi-likelihood estimate is consistent and
asymptotically normal, but the variance is φ [Γ0 ( β )] rather than the usual [Γ0 ( β )]
as
in ordinary Poisson regression. Therefore, inflating the standard errors by the factor
φ
−1
−1
is all that is required for valid analysis in this case.
PROC GENMOD in SAS 9.1 implements this approach. Dispersion parameter,
φ , can be obtained using SAS 9.1.
3.9.1.1 Estimating the Overdispersion Parameter
Dispersion parameter, φ can be estimated based only on the first two moments
 (Y − µ i ) 2 
of Y . From the relationship, E  i
 =φ ,
µi


N
(Yi − µˆ i ) 2
1
φ =
∑ µˆ
( N − p ) i =1
i
~
(3.45)
~
Note that φ is the scaled Pearson chi-square (that is, the Pearson chi-square statistic
divided by its degree of freedom).
Other than scaled Pearson chi-square, the scaled Deviance (Deviance statistic
divided by its degree of freedom) is also common in estimating the dispersion
parameter. This is given by,
φˆ =
 N
 Yi
1
2∑ Yi ln
( N − p)  i =1
 µˆ i



(3.46)
44
From equation Var (Y ) = φµ , it is clear that if there is no overdispersion, the
~
estimated φ ( φ or φˆ ) will be close to 1.
Once the dispersion parameter is obtained, the usual Poisson regression standard
errors need to be multiplied by
φ and t-statistics divided by φ . Standard errors that
are multiplied by φ are called adjusted standard errors.
The regular maximum likelihood estimates are still the same. Inference can then
be performed in the usual way with these adjusted standard errors. A proper confidence
interval can be obtained if the adjusted standard errors are used.
3.9.1.2 Testing for Overdispersion
Even though overdispersion parameter is obtained as above, one still need to test
for overdispersion in order to see whether or not overdispersion exists. To do so, the
following hypotheses are tested:
H 0 : Var (Y ) = µ
H 1 : Var (Y ) = µ + αg ( µ )
versus
where g ( µ ) = µ or g ( µ ) = µ 2 .
Or perhaps one can simply write the hypotheses as:
H 0 : No overdispersion exists
versus
H 1 : Overdispersion exists
45
For g ( µ ) = µ , the test statistic to test the existence of overdispersion is:
Q1 =
(Yi − µˆ i ) 2 − Yi
∑
µˆ i
2 N i =1
N
1
(3.47)
For g ( µ ) = µ 2 , the test statistic is given by:
∑ [(Y
N
Q2 =
i =1
i
− µˆ i ) 2 − Yi
]
(3.48)
N
2∑ µˆ
i =1
2
i
Both Q1 and Q2 are distributed approximately standard normal when N is
large. Therefore, Q1 and Q2 have a chi-squared distribution with one degree of freedom
(df = 1) . H 0 will be rejected if Q1 or Q2 is statistically significant, that is, if
Q1 > χ 02.05 (1) or Q2 > χ 02.05 (1) .
This study will only use Q1 to test the existence of overdispersion.
3.9.2 Negative Binomial Regression Analysis
The negative binomial regression model is derived by rewriting Poisson
regression model such that,
ln µ = β 0 + β1 X + ε
(3.49)
where e ε is a Gamma-distributed error-term with mean 1 and variance α 2 . This
addition allows the variance to differ from the mean as,
Var (Y ) = µ (1 + αµ ) = µ + αµ 2
(3.50)
46
α also acts as a dispersion parameter. Poisson regression model is regarded as a
limiting model of the negative binomial regression model as α approaches zero, which
means that the selection between these two models is dependent upon the value of α .
The negative binomial distribution has the form,
(
)

Γ1 +y  1
α
α


P(Y = y ) =
Γ 1 y!  1 + µ 
α
 α

1
α
( ) ( )
 µ



 1 + µ
 α

y
( )
(3.51)
where Γ(.) is a gamma function. This results in the likelihood function,
(
)

Γ 1 + yi  1
α
α 

L(Yi ) = ∏
Γ 1 yi !  1 + µ i 
i
α
 α

( )
( )
1
α
 µ

i


 1 + µi 
 α

yi
( )
(3.52)
Maximum likelihood estimation is used to estimate parameters in negative
binomial. In addition, the interpretation of regression coefficients for negative binomial
regression is the same as for Poisson regression.
3.10
Example
To illustrate Poisson regression, consider the following example:
Although male elephants are capable of reproducing by 14 to 17 years of age,
young adult males are usually unsuccessful in competing with their larger elders for the
attention of receptive female. Since the male elephants continue to grow throughout their
lifetimes, and since larger males tend to be more successful at mating, the males most
likely to pass their genes to future generations are those whose characteristics enable
them to live long lives. Joyce Poole studied a population of African elephants in
Amboseli National Park, Kenya for 8 years. Data in Table 3.1 shows the number of
47
successful mating and ages of 41 male elephants. The question of interest is: What is the
relationship between mating success and age?
Table 3.1: Elephant’s mating success regarding age
Age
Mating
Age
Mating
Age
Mating
27
0
33
3
39
1
28
1
33
3
41
3
28
1
33
3
42
4
28
1
33
2
43
0
28
3
34
1
43
2
29
0
34
1
43
3
29
0
34
2
43
4
29
0
34
3
43
9
29
2
36
5
44
3
29
2
36
6
45
5
29
2
37
1
47
7
30
1
37
1
48
2
32
2
37
6
52
9
33
4
38
2
Model Fitting
From this example, the response outcome, Y , is mating success while the
explanatory variable, X , is age. The model for this data is written as
ln[ E ( Mating i )] = β 0 + β 1 ( Agei )
Or simply,
ln[ E (Yi )] = β 0 + β1 X i
48
Estimation of Parameters
Initial Estimates
Since there are zero responses, 0.1 is added to all responses. Then, natural log is
taken to the responses. Using OLS method, with
1 27 
1 28


. . 
X =

. . 
. . 


1 52 
− 2.3026
 0.0953 




.
Y =

.




.


 2.2083 
and
initial estimates are obtained as,
−1
1470   22.5582  − 2.5748
 41
βˆ = ( X' X) −1 X' Y = 
 
=

1470 54436 959.6767  0.0872 
Therefore, b0( 0 ) = −2.5748 and b1( 0 ) = 0.0872
Next estimates
The next estimates is found using IRWLS method by solving equation (3.18)
iteratively. The computation is aided by Microsoft Excel. Using the initial estimates, at
m = 1,
 84.5954 3375.317
( X' WX) ( 0) = 

3375.317 138732.8 
and
101.9159 
( X' Wz ) ( 0 ) = 

4322.413
49
Thus,
− 1.3117
b (1) = 

 0.0631 
that is, b0(1) = −1.3117 and b1(1) = 0.0631 .
This iterative process is continued until it converges. The results are shown in
Table 3.2. The maximum likelihood estimates are β 0 = −1.5820 and β 1 = 0.0687 . The
model can then be written as
ln[ E (Yi )] = −1.5820 + 0.0687 X i
Table 3.2: Iterative reweighted least squares results
m
0
1
2
3
4
b1
-2.5748 -1.3117 -1.5689
-1.582
-1.582
b2
0.0872
0.0687
0.0687
0.0631
0.0684
The results are checked by using SAS 9.1. The estimates for β 0 and β1 from
SAS 9.1 result are found to be the same with the estimates that have been calculated
manually. Figure 3.2 shows the result of analyzing the same data using SAS 9.1. The
SAS 9.1 code is given in APENDIX A. This result from SAS 9.1 proves that the manual
computation method is practically valid and useful.
50
The GENMOD Procedure
Model Information
Data Set
Distribution
Link Function
Dependent Variable
WORK.ELEPHANT
Poisson
Log
matings
Number of Observations Read
Number of Observations Used
41
41
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
39
39
39
39
51.0116
51.0116
45.1360
45.1360
10.7400
1.3080
1.3080
1.1573
1.1573
Algorithm converged.
Analysis Of Parameter Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
age
Scale
1
1
0
-1.5820
0.0687
1.0000
0.5446
0.0137
0.0000
Wald 95% Confidence
Limits
-2.6494
0.0418
1.0000
-0.5146
0.0956
1.0000
ChiSquare
8.44
24.97
NOTE: The scale parameter was held fixed.
Figure 3.2: SAS result for analysis of elephant’s mating success data
Pr > ChiSq
0.0037
<.0001
51
Interpretation of Coefficients
From the model, β1 = 0.0687 , indicates that a one-year change in age cause
e 0.0687 = 1.0711 change in elephant’s mating success. It can also be said that, each
additional year of age increase the mean number of elephant’s mating success by 7.1
percent i.e, | 100(e 0.0687 − 1) |= 7.1% .
Elasticity
Taking the average, the elasticity for age is found to be,
E Xµii = β1 X i = 0.0687 X i = 2.4631
This value indicates that one percent increase in age, increase the elephant’s mating
success by 2.46 percent.
Standard Error
The Information matrix, I for final iteration is found to be,
110.0317 4292.276
I = ( X' WX) ( 4 ) = 

4292.276 172733.3 
The inverse of this Information matrix is
 0.2965
I −1 = 
−3
− 7.37 × 10
− 7.37 × 10 −3 

1.89 × 10 − 4 
52
From the inverse of Information matrix, standard errors for β 0 and β1 are
SE ( βˆ0 ) = 0.2965 = 0.5445
SE ( βˆ1 ) = 1.89 × 10 −4 = 0.0137
These values are also the same with the values obtained from SAS.
Model Checking
From the model, it is known that µˆ i = e ( −1.5820 + 0.0687 X i ) . All values of µ̂ i are
shown in APPENDIX B. These values are computed using Microsoft Excel for
simplicity.
The Pearson chi-squares statistic for testing goodness-of-fit for elephant’s mating
success data is obtained as follows:
41
X2 =∑
i =1
(Yi − µˆ i )
= 45.1236
µˆ i
From statistical table, it is found that, χ 02.05 (39) = 55.758
Since 45.1236 < 55.758 , H 0 is not rejected at significance level, α = 0.05 . This
indicates that the model has a good fit.
Next, the deviance statistic for elephant’s mating success data is found to be
41
Y
D = 2∑ Yi ln i
i =1
 µˆ i

 = 2(24.0352) = 48.0704

53
Note: 0.1 is added to zero responses in order to overcome the impossibility to calculate
Y
ln i
 µ̂ i

 . This is why the value of deviance obtained here is slightly different from the

value obtained from SAS.
Again, from statistical table, χ 02.05 (39) = 55.758
Since, 48.0704 < 55.758 , H 0 is not rejected at significance level, α = 0.05
indicating that the model has a good fit, which is the same with Pearson chi-squares’
result.
Residual Analysis
Table 3.3 shows Pearson residuals for elephant’s mating success data. (The
computation is done using Microsoft Excel). Adjusted residuals are impossible to
compute due to single observation for age 27, 30, 32, 38, 39, 41, 42, 44, 45, 47, 48 and
52. From Table 3.3, it can be seen that observation 24, 27, and 36 has residual value
greater than 2. Therefore, these observations must be put into attention. Other residuals,
however, are not large enough to indicate potential problems with model fit. This shows
that the model fits the data well.
Table 3.3: Residuals for elephant’s mating success data
X
Y
Pearson
Observation
(Age)
(Mating)
Residuals
1
27
0
-1.1462
2
28
1
-0.34326
3
28
1
-0.34326
4
28
1
-0.34326
54
5
28
3
1.342717
6
29
0
-1.22771
7
29
0
-1.22771
8
29
0
-1.22771
9
29
2
0.401341
10
29
2
0.401341
11
29
2
0.401341
12
30
1
-0.48359
13
32
2
0.108564
14
33
4
1.431296
15
33
3
0.721338
16
33
3
0.721338
17
33
3
0.721338
18
33
2
0.01138
19
34
1
-0.77177
20
34
1
-0.77177
21
34
2
-0.08579
22
34
3
0.600195
23
36
5
1.640773
24
36
6
2.281213
25
37
1
-0.99718
26
37
1
-0.99718
27
37
6
2.096892
28
38
2
-0.47663
29
39
1
-1.15319
30
41
3
-0.23589
31
42
4
0.165836
32
43
0
-1.98586
33
43
2
-0.97873
34
43
3
-0.47517
55
35
43
4
0.028389
36
43
9
2.546195
37
44
3
-0.59558
38
45
5
0.223561
39
47
7
0.794057
40
48
2
-1.50978
41
52
9
0.62158
Inference
One might want to know if β1 is necessary in the model, that is, if age has an
effect on elephant’s mating success. Thus, the hypotheses to be tested are
H 0 : β1 = 0
versus
H 1 : β1 ≠ 0
The test statistic is
Z=
0.0687 − 0
= 5.0146
0.0137
Thus, Z 2 = 5.0146 2 = 25.1462 . From statistical table, χ 02.05 (1) = 3.841 .
Since 25.1462 > 3.841 , H 0 is rejected at significance level, α = 0.05 . This
shows that β1 is important to the model. In other words, age really affects the elephant’s
mating success.
56
Confidence Intervals
As mentioned in section 3.8.2, 95% confidence interval for β1 is given by
βˆ1 ± 1.96 × SE ( βˆ1 )
Therefore, 95% confidence interval for β1 is
0.0687 ± 1.96(0.0137)
or
(0.0418,0.0956)
This gives the corresponding 95% confidence interval for the multiplicative factor as
(e 0.0418 , e 0.0956 ) = (1.0427,1.1003) .
Overdispersion
Since the data for this example caused some problem for the calculation of the
deviance, only scaled Pearson chi-squares is used for the estimation of dispersion
parameter, φ . Scaled Pearson chi-squares is obtained as follows:
~
φ =
N
(Yi − µˆ i ) 2 45.1236
1
∑ µˆ = 39 = 1.1570
( N − p ) i =1
i
This value is nearly 1 and is rather small. Thus, one can say that there is no
overdispersion exists in this data. Therefore, adjustment of the standard errors is not
crucially required.
However, to really verify the absence of overdispersion, the overdispersion test
must be done. Consider the following hypotheses:
H 0 : No overdispersion exists
versus
H 1 : Overdispersion exists
57
It is found that,
Q1 =
N
1
∑
2N
i =1
(Yi − µˆ i ) 2 − Yi
=
µˆ i
1
2(41)
(4.4244) = 0.4885
From statistical table, χ 02.05 (1) = 3.841 . Since 0.4885 < 3.841 , H 0 is not rejected
at 0.05 level of significance. Therefore, there is certainly no overdispersion exists in this
data.
Suppose one still wants to find the maximum quasi-likelihood estimates with
their adjusted standard errors. Thus the result for adjusted standard errors is as in Table
3.4.
Parameter
Table 3.4: Adjusted standard errors
Estimates
Standard Error
Adjusted
Standard Error
Intercept
-1.5820
0.5445
0.5857
Age
0.0687
0.0137
0.0147
Adjusted standard errors are obtained by multiplying standard errors with
φ
~
(i.e.: φ = 1.1570 = 1.0756 ). From this result, the adjustment for standard errors only
shows 7.56% increase. In other words, increasing standard errors by 7.56% yields
adjusted standard errors 0.5857 and 0.0137 for the intercept and age respectively. This
increase is not very much. Therefore, standard errors are actually not underestimated
which explain that there is no overdispersion in the data and that the Poisson regression
is adequate.
CHAPTER 4
ANALYSIS OF POISSON REGRESSION USING SAS
4.1
Introduction
In the previous chapter, the analysis of Poisson regression was done manually
and the example did not involve overdispersion. In addition, the data was a normal count
data. This chapter is about analyzing count data in the form of rate and the data involves
overdispersion. The analysis of Poisson regression in this chapter is done using SAS 9.1.
4.2
Nursing Home Data
The nursing home data given in Table 4.1 was adapted from a book by Fleiss,
J.L., Levin, B. and Paik, M.C entitled Statistical Methods for Rates and Proportions.
The data was collected by the Department of Health and Social Services of the
State of New Mexico. The collected variables include the number of beds, annual total
patient days, annual total patient care revenue, annual nursing salaries, annual facilities
expenditures, and an indicator for rural location. The question of interest is whether
nursing homes in rural areas tend to have fewer beds per patient population than those in
urban areas, adjusting for the other factors that affect hospital facilities. Symbols used
for all variables are explained as follows:
59
BED – number of beds
TDAYS – annual total patient days (in hundreds)
PCREV – annual total patient care revenue (in $ millions)
NSAL – annual nursing salaries (in $ millions)
FEXP – annual facilities expenditures (in $ millions)
RURAL – rural (1) or urban (0)
Table 4.1: Nursing home data
TDAYS PCREV NSAL FEXP
UNIT
BED
RURAL
1
244
385
2.3521
0.523
0.5334
0
2
59
203
0.916
0.2459
0.0493
1
3
120
392
2.19
0.6304
0.6115
0
4
120
419
2.2354
0.659
0.6346
0
5
120
363
1.7421
0.5362
0.6225
0
6
65
234
1.0531
0.3622
0.0449
1
7
120
372
2.2147
0.4406
0.4998
1
8
90
305
1.4025
0.4173
0.0966
1
9
96
169
0.8812
0.1955
0.126
0
10
120
188
1.1729
0.3224
0.6442
1
11
62
192
0.8896
0.2409
0.1236
0
12
120
426
2.0987
0.2066
0.336
1
13
116
321
1.7655
0.5946
0.4231
0
14
59
164
0.7085
0.1925
0.128
1
15
80
284
1.3089
0.4166
0.1123
1
16
120
375
2.1453
0.5257
0.5206
1
17
80
133
0.779
0.1988
0.4443
1
18
100
318
1.8309
0.4156
0.4585
1
19
60
213
0.8872
0.1914
0.1675
1
20
110
280
1.7881
0.5173
0.5686
1
60
21
120
336
1.7004
0.463
0.0907
0
22
135
442
2.3829
0.7489
0.3351
0
23
59
191
0.9424
0.2051
0.1756
1
24
60
202
1.2474
0.3803
0.2123
0
25
25
83
0.4078
0.2008
0.4531
1
26
221
776
3.6029
0.1288
0.2543
1
27
64
214
0.8782
0.4729
0.4446
1
28
62
204
0.8951
0.2367
0.1064
0
29
108
366
1.7446
0.5933
0.2987
1
30
62
220
0.6164
0.2782
0.0411
1
31
90
286
0.2853
0.4651
0.4197
0
32
146
375
2.1334
0.6857
0.1198
0
33
62
189
0.8082
0.2143
0.1209
1
34
30
88
0.3948
0.3025
0.0137
1
35
79
278
1.1649
0.2905
0.1279
0
36
44
158
0.785
0.1498
0.1273
1
37
120
423
2.9035
0.6236
0.3524
0
38
100
300
1.7532
0.3547
0.2561
1
39
49
177
0.8197
0.281
0.3874
1
40
123
336
2.2555
0.6059
0.6402
1
41
82
136
0.8459
0.1995
0.1911
1
42
58
205
1.0412
0.2245
0.1122
1
43
110
323
1.6661
0.4029
0.3893
1
44
62
222
1.2406
0.2784
0.2212
1
45
86
200
1.1312
0.372
0.2959
1
46
102
355
1.4499
0.3866
0.3006
1
47
135
471
2.4274
0.7485
0.1344
0
48
78
203
0.9327
0.3672
0.1242
1
49
83
390
1.2362
0.3995
0.1484
1
50
60
213
1.0644
0.282
0.1154
0
61
51
54
144
0.7556
0.2088
0.0245
1
52
120
327
2.0182
0.4432
0.6274
0
From this data, number of beds is the response variable and annual total patient
days is regarded as the offset. The rest of the variables are the explanatory variables.
 E ( BED ) 
The following Poisson regression models are fitted for 
.
 TDAYS 
(M1)
 E ( BEDi ) 
 = β 0 + β1 ( PCREV ) i + β 2 ( NSAL ) i + β 3 ( FEXP ) i
ln
 TDAYS i 
(M2)
 E ( BEDi ) 
 = β 0 + β1 ( PCREV ) i + β 2 ( NSAL) i + β 3 ( FEXP) i
ln
 TDAYSi 
+ β 4 ( PCREV ⋅ NSAL) i + β 5 ( PCREV ⋅ FEXP) i + β 6 ( NSAL ⋅ FEXP) i
(M3)
 E ( BEDi ) 
 = β 0 + β 1 ( PCREV ) i + β 2 ( NSAL) i + β 3 ( FEXP ) i
ln
 TDAYS i 
+ β 4 ( PCREV ⋅ NSAL ) i + β 5 ( PCREV ⋅ FEXP ) i + β 6 ( NSAL ⋅ FEXP ) i
+ β 7 ( RURAL ) i
The data are analyzed using SAS 9.1. The codes for running the program is given
in APPENDIX C. SAS output for all three models are displayed in Figure 4.1, Figure
4.2, and Figure 4.3.
62
The GENMOD Procedure
Model Information
Data Set
Distribution
Link Function
Dependent Variable
Offset Variable
WORK.NURSING_HOME
Poisson
Log
bed
log_t
Number of Observations Read
Number of Observations Used
52
52
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
48
48
48
48
245.0465
245.0465
276.8284
276.8284
17446.3823
5.1051
5.1051
5.7673
5.7673
Algorithm converged.
Analysis Of Parameter Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
pcrev
nsal
fexp
Scale
1
1
1
1
0
-1.1018
-0.0564
-0.1428
0.4935
1.0000
0.0434
0.0209
0.0945
0.0847
0.0000
Wald 95% Confidence
Limits
ChiSquare
Pr > ChiSq
-1.1867
-0.0974
-0.3281
0.3275
1.0000
645.81
7.25
2.28
33.96
<.0001
0.0071
0.1308
<.0001
-1.0168
-0.0153
0.0425
0.6595
1.0000
NOTE: The scale parameter was held fixed.
Figure 4.1: SAS output for model (M1)
63
The GENMOD Procedure
Model Information
Data Set
Distribution
Link Function
Dependent Variable
Offset Variable
WORK.NURSING_HOME
Poisson
Log
bed
log_t
Number of Observations Read
Number of Observations Used
52
52
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
45
45
45
45
215.6775
215.6775
235.5925
235.5925
17461.0667
4.7928
4.7928
5.2354
5.2354
Algorithm converged.
Analysis Of Parameter Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
pcrev
nsal
fexp
pcrev*nsal
pcrev*fexp
nsal*fexp
Scale
1
1
1
1
1
1
1
0
-1.0480
-0.3221
0.0165
1.3582
0.3718
0.5264
-3.6525
1.0000
0.1074
0.0712
0.4790
0.2803
0.1467
0.2541
0.9363
0.0000
Wald 95% Confidence
Limits
-1.2584
-0.4618
-0.9222
0.8087
0.0844
0.0284
-5.4876
1.0000
-0.8376
-0.1825
0.9553
1.9077
0.6593
1.0245
-1.8175
1.0000
NOTE: The scale parameter was held fixed
Figure 4.2: SAS output for model (M2)
ChiSquare
Pr > ChiSq
95.26
20.45
0.00
23.47
6.43
4.29
15.22
<.0001
<.0001
0.9724
<.0001
0.0112
0.0383
<.0001
64
The GENMOD Procedure
Model Information
Data Set
Distribution
Link Function
Dependent Variable
Offset Variable
WORK.NURSING_HOME
Poisson
Log
bed
log_t
Number of Observations Read
Number of Observations Used
52
52
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
44
44
44
44
201.9478
201.9478
216.1808
216.1808
17467.9316
4.5897
4.5897
4.9132
4.9132
Algorithm converged.
Analysis Of Parameter Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
pcrev
nsal
fexp
pcrev*nsal
pcrev*fexp
nsal*fexp
rural
Scale
1
1
1
1
1
1
1
1
0
-0.9730
-0.3478
0.2783
1.4676
0.2655
0.7085
-4.4965
-0.1331
1.0000
0.1091
0.0709
0.4842
0.2831
0.1495
0.2567
0.9551
0.0357
0.0000
Wald 95% Confidence
Limits
-1.1869
-0.4868
-0.6706
0.9128
-0.0276
0.2055
-6.3685
-0.2032
1.0000
-0.7591
-0.2088
1.2273
2.0224
0.5586
1.2116
-2.6246
-0.0631
1.0000
NOTE: The scale parameter was held fixed.
Figure 4.3: SAS output for model (M3)
ChiSquare
Pr > ChiSq
79.51
24.06
0.33
26.88
3.15
7.62
22.16
13.87
<.0001
<.0001
0.5654
<.0001
0.0758
0.0058
<.0001
0.0002
65
4.3
Choosing the Right Model
Since there are three proposed models, one needs to determine the most
appropriate model to use. Note that M1 and M2 are nested models because M2 can be
reduced to M1. Similarly, M3 can be reduced to M2. Therefore, to determine the right
model, M1 must be compared with M2 and M2 must be compared with M3.
To make comparison among two models, two methods can be used. One is by
using the deviance statistic and the other is by using the Wald statistic. However, only
deviance statistic method will be discussed here. Table 4.2 summarizes log likelihood
and deviance for M1, M2, and M3 obtained from previous SAS output.
Table 4.2: Log likelihood and deviance for model M1, M2, and M3
Model
Degrees Of
Log Likelihood
Deviance
Freedom
M1
48
17446.3823
245.0465
M2
45
17461.0667
215.6775
M3
44
17467.9316
201.9478
To compare between M1 and M2, the following hypotheses need to be tested:
H 0 : βi = 0
∀i ; i = 4,5,6 versus H 1 : β i ≠ 0
for at least one i
Firstly, it can be seen that the deviance of M2 is smaller than that of M1 due to
the added two-way interaction terms. The smaller value of deviance suggests that M2
fits the data better than M1. The next question is whether M2 fits the data significantly
better than M1.
To answer this question, the following statistic is used:
d = D0 − D1
66
where D0 indicates the deviance of the less inclusive model while D1 indicates the
deviance of the more inclusive model. This statistic has an approximate chi-squared
distribution with degrees of freedom equal to the difference between the numbers of
unknown parameters in the two models. Thus H 0 will be rejected if d > χ 02.05 (df ) .
For comparing M1 and M2,
d = 245.0465 − 215.6775 = 29.369
and
df = 48 − 45 = 3
From statistical table, χ 02.05 (3) = 7.815 . Since 29.369 > 7.815 , H 0 is rejected at
significance level α = 0.05 . This implies that the interaction terms are jointly highly
significant to the model.
Now, to compare between M2 and M3, the following hypotheses are considered:
H0 : β7 = 0
versus
H1 : β 7 ≠ 0
The deviance for M3 is smaller than that the deviance of M2 suggesting that M3
fits the data better than M2. The statistic for comparing M2 and M3 is
d = 215.6775 − 201.9478 = 13.7297
and its degrees of freedom is
df = 45 − 44 = 1
From statistical table, χ 02.05 (1) = 3.841 . Since 13.7297 > 3.841 , H 0 is rejected at
significance level α = 0.05 . This implies that RURAL is thus highly significant to the
model. Therefore, M1 and M2 are rejected in favour of M3. M3 is the right Poisson
regression model for nursing home data.
67
4.4
Results and Discussion
The analysis is now focused on model M3 and its SAS output in Figure 4.3.
The Model
The model for nursing home data is obtained as:
 E ( BEDi ) 
 = −0.9730 − 0.3478( PCREV ) i + 0.2783( NSAL) i + 1.4676( FEXP ) i
ln
 TDAYS i 
+ 0.2655( PCREV ⋅ NSAL ) i + 0.7085( PCREV ⋅ FEXP ) i − 4.4965( NSAL ⋅ FEXP ) i
− 0.1331( RURAL ) i
Interpretation of Coefficients
From “Analysis of Parameter Estimates” in Figure 4.3, it can be seen that,
PCREV, FEXP, RURAL factors are highly significant (p-value < 0.05). Similarly, the
interaction between PCREV and FEXP as well as NSAL and FEXP are also significant.
For RURAL factor, β 7 = −0.1331 . This indicates that each nursing home in rural
area has fewer beds than those in urban areas by e −0.1331 = 0.8754 factor. In other words,
since | 100(e −0.1331 − 1) |= 12.46% , it can be said that nursing homes in rural area has
12.46% fewer beds compared to nursing homes in urban area. The 95% confidence
interval for β 7 is − 0.1331 ± 1.96 × 0.0357 or (−0.2031,−0.0631) which gives the
corresponding
confidence
(e −0.2031 , e −0.0631 ) = (0.8162,0.9388) .
interval
for
multiplicative
factor,
68
The coefficient of annual total patient care revenue (PCREV) factor indicates
that a one-unit change in annual total patient care revenue causes 29.38% decrease in the
mean of the number of beds. The 95% confidence interval for
β1
is
− 0.3478 ± 1.96 × 0.0709 or (−0.4868,−0.2088) and the corresponding confidence
interval for multiplicative factor is (e −0.4868 , e −0.2088 ) = (0.6146,0.8116) .
For NSAL factor, it is found that, a one-unit change in annual nursing salaries
leads to 32.09% increase in the mean of the number of beds and the 95% confidence
interval for β 2 is 0.2783 ± 1.96 × 0.4842 or (−0.6707,1.2273) . Its corresponding
confidence interval for multiplicative factor is (e −0.6707 , e1.2273 ) = (0.5114,3.4120) . A oneunit change in annual facilities expenditure (FEXP), however, results in a very high
increase in the mean of the number of beds.
Interaction factors between PCREV and NSAL and between PCREV and FEXP
show that a one-unit change in these two factors increases the number of beds by
e 0.2655 = 1.3041 and e 0.7085 = 2.0309 factors respectively whereas the interaction factor
between NSAL and FEXP shows that a one-unit change in this factor decreases the
number of beds by e −4.4965 = 0.0111 factor. Their 95% confidence intervals for
multiplicative factor are (−0.0275,0.5585) , (0.2954,1.2116) , and (−6.3685,−2.6245)
respectively.
Elasticity
Table 4.3 shows elasticity computed for the explanatory variables in nursing
home data for model M3. The elasticities for PCREV, NSAL, and FEXP variables are
computed by applying equation (3.27) to each of their observation and then taking their
average. This computation is done by using Microsoft Excel. RURAL is an indicator
variable, thus, the pseudo-elasticity for RURAL variable is computed as follows:
69
E RURAL =
exp(β 7 ) − 1 exp(−0.1331) − 1
=
= −0.1424
exp(β 7 )
exp(−0.1331)
Table 4.3: Elasticities of the explanatory variables in nursing home data for model M3
Explanatory Variables
Elasticity
PCREV
-0.4942
NSAL
0.1061
FEXP
0.4180
RURAL
-0.1424
The result shows that a one percent increase in annual total patient care revenue
reduces the mean number of beds in nursing homes by 0.4942 percent. In contrast, a one
percent increase in annual nursing salaries and annual facilities expenditures increase the
mean number of beds in nursing homes by 0.1061 and 0.4180 percent respectively.
Furthermore, it is found that the mean number of beds in nursing homes for rural area is
reduced by 0.1424 percent. This clearly proves that nursing homes in rural area tend to
have fewer beds than nursing homes in urban area.
Model Checking
As usual, the goodness-of-fit of the model needs to be checked. The hypotheses are:
H 0 : the model has a good fit
versus
H 1 : the model has lack of fit
From Figure 4.3, it is found that the value for Pearson chi-squares statistic is
X 2 = 216.1808 and the value for deviance statistic is D = 201.9478 . From statistical
table, χ 02.05 (44) = 60.4568 . Since 216.1808 > 60.4568 and 201.9478 > 60.4568 , H 0 is
rejected at significance level, α = 0.05 . This implies that the model has lack of fit.
70
Residuals Analysis
To find Pearson residuals and adjusted residuals using SAS, “obstats” and
“residuals” options are added to the codes to obtain Pearson residuals and adjusted
residuals. The additional codes are written as follows:
proc genmod;
model bed = pcrev nsal fexp pcrev*nsal pcrev*fexp nsal*fexp rural /
dist=poi link=log offset=log_t obstats residuals;
run;
The output for this additional code is given in APPENDIX D. The “obstats”
option in this output provides the Pearson residuals (labeled as “Reschi”) while the
“residuals” option provides the adjusted residuals (labeled as “StReschi”), which adjust
the Pearson residuals to be distributed approximately normal. Table 4.4 simplifies the
results.
Table 4.4: Pearson residuals and adjusted residuals for nursing home data
Pearson
Adjusted
Obs.
Bed
Residuals
Residuals
1
244
7.0169656
7.9541283
2
59
0.0842729
0.0874151
3
120
-1.139224
-1.268365
4
120
-1.42913
-1.687196
5
120
-1.1648
-1.279099
6
65
-0.29734
-0.312951
7
120
-1.801313
-1.99645
8
90
0.4307467
0.4500368
9
96
4.5567171
4.8399814
10
120
3.6104925
4.1424735
11
62
-0.766451
-0.808099
71
12
120
-3.122378
-3.452756
13
116
1.1426998
1.20197
14
59
0.7616619
0.7904897
15
80
-0.222551
-0.232783
16
120
-0.631143
-0.670849
17
80
2.4420727
2.7090799
18
100
-1.426019
-1.482452
19
60
-1.221678
-1.276587
20
110
1.8056045
1.8957571
21
120
1.5718986
1.6769361
22
135
0.0515266
0.0576814
23
59
-0.478213
-0.494186
24
60
-1.35736
-1.413152
25
25
-2.097099
-2.308997
26
221
1.0280765
3.0256132
27
64
-0.169027
-0.183494
28
62
-1.096703
-1.162205
29
108
0.4340545
0.4660353
30
62
-0.754576
-0.809907
31
90
-0.680123
-0.894277
32
146
2.4808479
2.8560404
33
62
0.2899505
0.2998119
34
30
0.2524611
0.2699829
35
79
-1.574343
-1.666879
36
44
-0.956302
-1.011012
37
120
-2.032293
-2.44568
38
100
0.679786
0.6950838
39
49
-2.366709
-2.478751
40
123
1.5070629
1.7096722
41
82
5.2365404
5.3725172
72
42
58
-0.403804
-0.417828
43
110
-0.078899
-0.081086
44
62
-1.176382
-1.197609
45
86
2.4542185
2.5004873
46
102
-1.218256
-1.250821
47
135
-1.151268
-1.519447
48
78
1.9612958
2.0434853
49
83
-3.063645
-3.229351
50
60
-1.48926
-1.561096
51
54
1.9310157
2.0084219
52
120
-2.811129
-3.636855
From Table 4.4, it can be seen that, there are a lot of Pearson residuals as well as
adjusted residuals that are greater than 2 in absolute value. The observations with
residuals greater than 2 in absolute value must be checked because they might be
outliers. Since there are so many large residuals, therefore, the model does not fit the
data very well. This result is the same as the goodness-of-fit test result obtained
previously.
Overdispersion
From “Criteria for Assessing Goodness of Fit” in Figure 4.3, it can be seen that
the values of both deviance and Pearson chi-square are very much larger than their
degrees of freedom. Furthermore, dividing deviance and Pearson chi-square by their
degrees of freedom give values greater than 1, that is, X 2 df = 4.9132 and.
D df = 4.5897 Therefore, overdispersion exists in this data and Poisson regression is
clearly not adequate to describe the data.
73
A test for overdispersion with
Q1 =
1
52
∑
2(52)
i =1
(Yi − µˆ i ) 2 − Yi
= 16.0203
µˆ i
confirms the presence of overdispersion since Q1 > χ 02.05 (1) , that is, 16.0203 > 3.841.
The values of µ̂ i (predicted values) are easily obtained from SAS 9.1 by using the
following codes:
proc genmod;
model bed = pcrev nsal fexp pcrev*nsal pcrev*fexp nsal*fexp rural /
dist=poi link=log offset=log_t;
output out = temp p = muhati;
run;
proc print data =temp (obs=52);
var bed muhati;
run;
The values of µ̂ i obtained from the above codes are shown in APPENDIX E
In conjunction with overdispersion test, a dispersion parameter, φ , is introduced
into the relationship between the variance and the mean as Var (Y ) = φµ . Take note that
the “scale” parameter in the “Analysis of Parameter Estimates” in SAS output is actually
φ.
The scale parameter can be estimated using SAS 9.1. This can be done by either
specifying SCALE=DEVIANCE (or just DSCALE) or SCALE=PEARSON (or just
PSCALE). For nursing home data, Pearson is used to obtain the scale estimate. The
following codes are added to the codes in APPENDIX C and the program is run. Figure
4.4 shows the output. This output is called corrected Poisson regression.
proc genmod;
model bed = pcrev nsal fexp pcrev*nsal pcrev*fexp nsal*fexp rural /
dist=poi link=log scale=pearson offset=log_t;
run;
74
The GENMOD Procedure
Model Information
Data Set
Distribution
Link Function
Dependent Variable
Offset Variable
WORK.NURSING_HOME
Poisson
Log
bed
log_t
Number of Observations Read
Number of Observations Used
52
52
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
44
44
44
44
201.9478
41.1031
216.1808
44.0000
3555.3064
4.5897
0.9342
4.9132
1.0000
Algorithm converged.
Analysis Of Parameter Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
pcrev
nsal
fexp
pcrev*nsal
pcrev*fexp
nsal*fexp
rural
Scale
1
1
1
1
1
1
1
1
0
-0.9730
-0.3478
0.2783
1.4676
0.2655
0.7085
-4.4965
-0.1331
2.2166
0.2419
0.1572
1.0732
0.6274
0.3314
0.5689
2.1171
0.0792
0.0000
Wald 95% Confidence
Limits
-1.4471
-0.6559
-1.8251
0.2378
-0.3841
-0.4066
-8.6460
-0.2884
2.2166
ChiSquare
Pr > ChiSq
16.18
4.90
0.07
5.47
0.64
1.55
4.51
2.82
<.0001
0.0269
0.7954
0.0193
0.4231
0.2130
0.0337
0.0929
-0.4989
-0.0397
2.3818
2.6973
0.9151
1.8236
-0.3471
0.0222
2.2166
NOTE: The scale parameter was estimated by the square root of Pearson's Chi-Square/DOF.
Figure 4.4: SAS output for corrected Poisson regression
75
The “scale” parameter in the output equals to 1 shows that ordinary Poisson
regression is used (see “scale” parameter value in Figure 4.3). If the “scale” parameter
shows value greater than 1, then it is certain that the data is overdispersed and that the
model used is not Poisson regression (see “scale” parameter value in Figure 4.4).
Output in Figure 4.4 can also be referred to as the maximum quasi-likelihood
estimates and standard errors of the Poisson regression since this corrected Poisson
regression is the same as quasi-likelihood method. From Figure 4.4, it can be seen that
scaled Pearson chi-square is now held fixed to 1 and scale parameter is equal to 2.2166
( X 2 df = 4.9132 = 2.2166) . The parameter estimates are still the same as in Figure
4.3. However, the standard errors are inflated by the scale factor,
φ . Table 4.5
summarizes the comparison between standard errors for ordinary Poisson regression and
corrected Poisson regression.
Table 4.5: Comparison among standard errors for ordinary Poisson regression and
corrected Poisson regression
Parameter
SE (ordinary
SE (corrected
Poisson regression)
Poisson regression)
Intercept
0.1091
0.2419
PCREV
0.0709
0.1572
NSAL
0.4842
1.0732
FEXP
0.2831
0.6274
PCREV*NSAL
0.1495
0.3314
PCREV*FEXP
0.2567
0.5689
NSAL*FEXP
0.9551
2.1171
RURAL
0.0357
0.0792
Table 4.5 shows that the corrected Poisson regression increases the standard
errors by approximately 121%. This is rather high. In other words, increasing standard
errors by 121% helps adjust for the apparent overdispersion. Clearly, the variability of
76
response variable is understated if it is assumed to be pure Poisson, and consequently,
the estimated standard errors are understated too.
Also take note that the 95% confidence intervals for all coefficients in corrected
Poisson regression (Figure 4.4) are much wider that those of ordinary Poisson regression
obtained previously. In addition, the p-values are much higher and significance tests are
more conservative than those based on Poisson regression (Figure 4.3) before
adjustment for overdispersion. It can be seen that the RURAL effect and the interaction
effect between PCREV and FEXP are no longer significant (p-value > 0.05) as reported
by ordinary Poisson regression. PCREV, FEXP, and the interaction between NSAL and
FEXP, however, remain significant.
Thus, it is clear that ignoring the overdispersion will underestimate standard
errors and will give false result for the inference.
4.5
Negative Binomial Regression
To run negative binomial regression using SAS 9.1, the distribution term is
changed to “NB” instead of “poi”. The PROC GENMOD command for negative
binomial regression is given as follows:
proc genmod;
model bed = pcrev nsal fexp pcrev*nsal pcrev*fexp nsal*fexp rural /
dist=NB link=log offset=log_t;
run;
Figure 4.5 shows the output of negative binomial regression for nursing home data.
77
4.6
Results and Discussion
The Model
For negative binomial regression, the model for nursing home data is obtained as:
 E ( BEDi ) 
 = −0.9103 − 0.3869( PCREV ) i + 0.1557( NSAL) i + 1.4298( FEXP ) i
ln
 TDAYS i 
+ 0.3323( PCREV ⋅ NSAL) i + 0.7532( PCREV ⋅ FEXP ) i − 4.5658( NSAL ⋅ FEXP ) i
− 0.1193( RURAL) i + ε
Interpretation of Coefficients
From Figure 4.5, it is found that the RURAL effect and the interaction effect
between PCREV and FEXP are no longer significant (p-value > 0.05) as reported by
Poisson regression previously. PCREV, FEXP, and the interaction between NSAL and
FEXP, however, remain significant. Note that, this is the same as the result obtained
from corrected Poisson regression before.
The value of coefficient for RURAL factor indicates that each nursing home in
rural area has fewer beds than those in urban areas by e −0.1193 = 0.8875 factor. In other
words, since | 100(e −0.1193 − 1) |= 11.25% , it can be said that nursing homes in rural area
has 11.25% fewer beds compared to nursing homes in urban area. This value is less than
that of Poisson regression by 1.21%. Furthermore, the 95% confidence interval for β 7 is
− 0.1193 ± 1.96 × 0.0705 or (−0.2575,0.0189) and the corresponding confidence interval
for multiplicative factor is obtained as, (e −0.2575 , e 0.0189 ) = (0.7730,1.0191) .
78
The coefficient of annual total patient care revenue (PCREV) factor indicates
that a one-unit change in annual total patient care revenue causes 32.08% decrease in the
mean of the number of beds. This value is greater than that of Poisson regression by
2.7%. In addition, the 95% confidence interval for β1 is − 0.3869 ± 1.96 × 0.1543 or
(−0.6893,−0.0845) and the corresponding confidence interval for multiplicative factor is
obtained as, (e −0.6893 , e −0.0845 ) = (0.5019,0.9190) .
A one-unit change in annual nursing salaries (NSAL) leads to 16.85% increase
in the mean of the number of beds This value is less than that of Poisson regression by
15.24%. The 95% confidence interval for
β2
is
0.1557 ± 1.96 × 0.9194
or
(−1.6463,1.9577) and the corresponding confidence interval for multiplicative factor is
found to be (e −1.6463 , e1.9577 ) = (0.1928,7.0830) .
A one-unit change in annual facilities expenditure (FEXP), as in Poisson
regression, results in a very high increase in the mean of the number of beds.
Interaction factors between PCREV and NSAL and between PCREV and FEXP
show that a one-unit change in these two factors increases the mean number of beds by
39.42% and 112.38% respectively whereas the interaction factor between NSAL and
FEXP shows that a one-unit change in this factor decreases the mean number of beds by
98.96%. Their 95% confidence intervals for multiplicative factor are (−0.2428,0.9074) ,
(−0.2589,1.7653) , and (−8.4956,−0.6360) respectively.
79
The GENMOD Procedure
Model Information
Data Set
Distribution
Link Function
Dependent Variable
Offset Variable
WORK.NURSING_HOME
Negative Binomial
Log
bed
log_t
Number of Observations Read
Number of Observations Used
52
52
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
44
44
44
44
52.3723
52.3723
57.5931
57.5931
17509.1130
1.1903
1.1903
1.3089
1.3089
Algorithm converged.
Analysis Of Parameter Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
pcrev
nsal
fexp
pcrev*nsal
pcrev*fexp
nsal*fexp
rural
Dispersion
1
1
1
1
1
1
1
1
1
-0.9103
-0.3869
0.1557
1.4298
0.3323
0.7532
-4.5658
-0.1193
0.0300
0.1989
0.1543
0.9194
0.5118
0.2934
0.5164
2.0050
0.0705
0.0082
Wald 95% Confidence
Limits
-1.3002
-0.6894
-1.6464
0.4267
-0.2427
-0.2590
-8.4955
-0.2574
0.0141
ChiSquare
Pr > ChiSq
20.95
6.28
0.03
7.81
1.28
2.13
5.19
2.87
<.0001
0.0122
0.8656
0.0052
0.2573
0.1447
0.0228
0.0905
-0.5205
-0.0844
1.9577
2.4329
0.9074
1.7654
-0.6361
0.0188
0.0460
NOTE: The negative binomial dispersion parameter was estimated by maximum likelihood.
Figure 4.5: SAS output for negative binomial regression
80
Elasticity
Table 4.6 shows elasticity computed for the explanatory variables in nursing home data
for negative binomial regression model.
Table 4.6: Elasticities of the explanatory variables in nursing home data for negative
binomial regression model
Explanatory Variables
Elasticity
PCREV
-0.5498
NSAL
0.0594
FEXP
0.4071
RURAL
-0.1267
The pseudo-elasticity for RURAL variable is computed as follows:
E RURAL =
exp(β 7 ) − 1 exp(−0.1193) − 1
=
= −0.1267
exp(β 7 )
exp(−0.1193)
From Table 4.6, it can be seen that for negative binomial regression, a one
percent increase in annual total patient care revenue reduces the mean number of beds in
nursing homes by 0.5498 percent. A one percent increase in annual nursing salaries and
annual facilities expenditures, however, increase the mean number of beds in nursing
homes by 0.0594 and 0.4071 percent respectively. Furthermore, it is found that the mean
number of beds in nursing homes for rural area is reduced by 0.1267 percent. As in
Poisson regression, this also clearly shows that nursing homes in rural area tend to have
fewer beds than nursing homes in urban area.
81
Model Checking
From negative binomial regression output in Figure 4.5, the value of Pearson chisquares statistic is X 2 = 57.5931 and the value of deviance statistic is D = 52.3723 .
Both values are quite close to the value of the degrees of freedom. Thus, the model is
adequate. To verify this, from the statistical table it is found that, the critical
value χ 02.05 (44) = 60.4568 , that is, one would like to reject H 0 when X 2 or D is greater
than 60.4568.
Since 57.5931 < 60.4568 and 52.3723 < 60.4568 , H 0 is not rejected at
significance level, α = 0.05 . This implies that the model has a good fit. Thus, it can be
said that negative binomial regression model is more adequate in fitting nursing home
data compared to Poisson regression model. This is because the data is overdispersed
and negative binomial model accounts for overdispersion, unlike Poisson regression.
Residuals Analysis
SAS output of residuals for negative binomial regression is given in APPENDIX
F. Table 4.7 summarizes the result. Most of Pearson residuals and adjusted residuals in
Table 4.7 are small. Only a few residuals have value greater than 2 in absolute value.
This suggests that the model fits the data well. This result is equivalent to the goodnessof-fit test result.
Table 4.7: Residuals for negative binomial regression
Pearson Adjusted
Obs.
Bed
Residuals
Residuals
1
244
2.8981131 3.2134736
2
59
-0.063957
-0.066634
82
3
120
-0.519342
-0.574014
4
120
-0.648106
-0.750932
5
120
-0.487942
-0.528478
6
65
-0.242413
-0.255811
7
120
-0.859559
-0.938295
8
90
0.1807227 0.1889604
9
96
2.6168761 2.8116326
10
120
1.8297598 2.0763753
11
62
-0.48599
-0.515769
12
120
-1.346897
-1.471129
13
116
0.6125658 0.6444858
14
59
0.3172917 0.3311324
15
80
-0.165974
-0.173677
16
120
-0.359472
-0.379287
17
80
1.3094942 1.4735788
18
100
-0.727722
-0.755202
19
60
-0.807629
-0.84203
20
110
0.8794827 0.9264853
21
120
0.8163993 0.8707447
22
135
0.0093543
0.010448
23
59
-0.394395
-0.407917
24
60
-0.75373
-0.792121
25
25
-1.515122
-1.748939
26
221
0.4986813 1.1168278
27
64
-0.088956
-0.099318
28
62
-0.666765
-0.709793
29
108
0.1909746 0.2040666
30
62
-0.555559
-0.598065
31
90
-0.246757
-0.325236
32
146
1.195937
1.3787255
83
33
62
0.0402785 0.0417337
34
30
0.0587035 0.0670965
35
79
-0.818759
-0.863585
36
44
-0.734942
-0.787187
37
120
-0.927702
-1.083767
38
100
0.2954252 0.3032409
39
49
-1.415236
40
123
0.6269081 0.7132831
41
82
3.1921869 3.2940407
42
58
-0.341776
-0.3554
43
110
-0.089158
-0.091424
44
62
-0.731093
-0.744943
45
86
1.3567572 1.3879271
46
102
-0.6224
-0.636486
47
135
-0.491999
-0.626695
48
78
1.0713815 1.1240668
49
83
-1.4873
-1.548535
50
60
-0.856827
-0.902649
51
54
1.1228299 1.1857509
52
120
-1.19399
-1.48398
-1.463142
Overdispersion
SAS output for negative binomial regression reports the “dispersion” parameter
in the “Analysis of Parameter Estimates” instead of the “scale” parameter for the
ordinary and corrected Poisson regression. This value is actually the α value in variance
equation for negative binomial regression which is given as,
Var (Y ) = µ (1 + αµ ) = µ + αµ 2
84
If this value is greater than zero, then evidently negative binomial regression
should be used in favour of Poisson regression. From Figure 4.5, dispersion parameter is
equal to 0.03. Therefore, negative binomial regression is preferred to Poisson regression.
CHAPTER 5
SIMULATION STUDY
5.1
Data Simulation
The main purpose of this simulation study is to see the performance of Poisson
regression as well as the performance of negative binomial regression in analyzing count
data with overdispersion and count data without overdispersion. The performance is
examined based on the goodness-of-fit test and other criteria such as the significance of
the coefficient result, confidence intervals, and standard errors.
This study presents two conditions of the data to be analyzed, that is,
overdispersed and non-overdispersed data. The data is simulated by using R 2.9.2
software according to three distinct values of µ , that is, µ = 10 , µ = 20 and µ = 50 .
The values of µ are selected to be large (greater than 5) because if the means are large,
both Pearson chi-square and deviance are distributed approximately standard normal. If
the means are less than 5, it is potentially misleading to rely on the Pearson chi-square
and deviance test as tools for assessment of goodness-of-fit.
The sample size for each data is 100. This large sample size is selected because
the sampling distribution of the estimated coefficients are approximately normal if the
sample size is large, hence makes the test of significance and confidence intervals results
more valid. Two types of data are simulated, namely overdispersed and non-
86
overdispersed data. Therefore there are six sets of data altogether. All simulated data are
provided in APPENDIX G.
This simulation study begins with the analysis of count data without
overdispersion followed by the analysis of count data with overdispersion. All data are
analyzed by using SAS 9.1.
5.2
Analysis of Data with No Overdispersion
The data is first analyzed for Poisson regression then followed by negative
binomial regression. Figure 5.1, 5.2, and 5.3 show Poisson regression SAS output for
µ = 10 , µ = 20 and µ = 50 respectively whereas Figure 5.4, 5.5, and 5.6 show negative
binomial regression SAS output for µ = 10 , µ = 20 and µ = 50 respectively.
87
The GENMOD Procedure
Model Information
Data Set
Distribution
Link Function
Dependent Variable
WORK.MU10
Poisson
Log
y
Number of Observations Read
Number of Observations Used
100
100
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
98
98
98
98
82.2942
82.2942
81.2841
81.2841
1504.8773
0.8397
0.8397
0.8294
0.8294
Algorithm converged.
Analysis Of Parameter Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
x
Scale
1
1
0
2.4679
-0.0357
1.0000
0.0770
0.0307
0.0000
Wald 95% Confidence
Limits
2.3169
-0.0959
1.0000
2.6189
0.0245
1.0000
ChiSquare
Pr > ChiSq
1026.30
1.35
<.0001
0.2455
NOTE: The scale parameter was held fixed.
Figure 5.1: Poisson regression SAS output of non-overdispersed data for µ = 10
88
The GENMOD Procedure
Model Information
Data Set
Distribution
Link Function
Dependent Variable
WORK.MU20
Poisson
Log
y
Number of Observations Read
Number of Observations Used
100
100
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
98
98
98
98
76.4468
76.4468
75.8861
75.8861
4499.8443
0.7801
0.7801
0.7743
0.7743
Algorithm converged.
Analysis Of Parameter Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
x
Scale
1
1
0
3.1963
-0.0399
1.0000
0.0772
0.0247
0.0000
Wald 95% Confidence
Limits
3.0451
-0.0884
1.0000
3.3476
0.0085
1.0000
ChiSquare
Pr > ChiSq
1714.57
2.61
<.0001
0.1061
NOTE: The scale parameter was held fixed.
Figure 5.2: Poisson regression SAS output of non-overdispersed data for µ = 20
89
The GENMOD Procedure
Model Information
Data Set
Distribution
Link Function
Dependent Variable
WORK.MU50
Poisson
Log
y
Number of Observations Read
Number of Observations Used
100
100
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
98
98
98
98
84.5274
84.5274
83.4692
83.4692
14170.7596
0.8625
0.8625
0.8517
0.8517
Algorithm converged.
Analysis Of Parameter Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
x
Scale
1
1
0
3.9649
-0.0188
1.0000
0.0580
0.0145
0.0000
Wald 95% Confidence
Limits
3.8513
-0.0472
1.0000
4.0785
0.0096
1.0000
ChiSquare
Pr > ChiSq
4676.62
1.68
<.0001
0.1944
NOTE: The scale parameter was held fixed.
Figure 5.3: Poisson regression SAS output of non-overdispersed data for µ = 50
90
The GENMOD Procedure
Model Information
Data Set
Distribution
Link Function
Dependent Variable
WORK.MU10
Negative Binomial
Log
y
Number of Observations Read
Number of Observations Used
100
100
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
98
98
98
98
101.4523
101.4523
100.2999
100.2999
1505.9236
1.0352
1.0352
1.0235
1.0235
Algorithm converged.
Analysis Of Parameter Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
x
Dispersion
1
1
1
2.4681
-0.0358
-0.0175
0.0691
0.0277
0.0104
Wald 95% Confidence
Limits
2.3326
-0.0900
-0.0380
2.6036
0.0184
0.0029
ChiSquare
Pr > ChiSq
1274.65
1.67
<.0001
0.1959
NOTE: The negative binomial dispersion parameter was estimated by maximum likelihood.
Figure 5.4: Negative binomial regression SAS output of non-overdispersed data for
µ = 10
91
The GENMOD Procedure
Model Information
Data Set
Distribution
Link Function
Dependent Variable
WORK.MU20
Negative Binomial
Log
y
Number of Observations Read
Number of Observations Used
100
100
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
98
98
98
98
100.1980
100.1980
99.5042
99.5042
4501.5529
1.0224
1.0224
1.0153
1.0153
Algorithm converged.
Analysis Of Parameter Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
x
Dispersion
1
1
1
3.1971
-0.0402
-0.0109
0.0674
0.0217
0.0049
Wald 95% Confidence
Limits
3.0650
-0.0827
-0.0205
3.3293
0.0023
-0.0014
ChiSquare
Pr > ChiSq
2247.54
3.44
<.0001
0.0635
NOTE: The negative binomial dispersion parameter was estimated by maximum likelihood
Figure 5.5: Negative binomial regression SAS output of non-overdispersed data for
µ = 20
92
The GENMOD Procedure
Model Information
Data Set
Distribution
Link Function
Dependent Variable
WORK.MU50
Negative Binomial
Log
y
Number of Observations Read
Number of Observations Used
100
100
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
98
98
98
98
101.2727
101.2727
100.2162
100.2162
14171.5370
1.0334
1.0334
1.0226
1.0226
Algorithm converged.
Analysis Of Parameter Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
x
Dispersion
1
1
1
3.9649
-0.0188
-0.0034
0.0528
0.0132
0.0024
Wald 95% Confidence
Limits
3.8614
-0.0447
-0.0081
4.0684
0.0071
0.0013
ChiSquare
Pr > ChiSq
5633.55
2.02
<.0001
0.1548
NOTE: The negative binomial dispersion parameter was estimated by maximum likelihood.
Figure 5.6: Negative binomial regression SAS output of non-overdispersed data for
µ = 50
93
5.2.1
Results and Discussion
5.2.1.1 Goodness-of-fit
Table 5.1 summarizes Pearson chi-square and deviance values for Poisson
regression and negative binomial regression obtained from non-overdispersed data
output. It can be seen that for data that has no overdispersion, the values of Pearson chisquare and deviance for both Poisson regression and negative binomial regression are
small and close to the value of the degress of freedom. This indicates that both models
have a good fit.
Table 5.1: Pearson chi-square and deviance for Poisson regression and negative
binomial regression obtained from data that has no overdispersion
µ
Model
Pearson chi-square
Deviance
Poisson
Negative Binomial
µ = 10
81.2841
82.2942
µ = 20
75.8861
76.4468
µ = 50
83.4692
84.5274
µ = 10
100.2999
101.4523
µ = 20
99.5042
100.1980
µ = 50
100.2162
101.2727
To verify, all values of Pearson chi-square and deviance for Poisson regression
and negative binomial regression are compared with the critical value obtained from
statistical table, χ 02.05 (98) = 122.1026 . It can be seen that, none of the values is greater
than 122.1026. Thus, H 0 is not rejected at significance level, α = 0.05 which implies
that Poisson regression and negative binomial regression model are both adequate for
data that has no overdispersion.
However, Poisson regression is found to be better than negative binomial
regression since the values of its Pearson chi square as well as the values of its deviance
94
for each case of µ is less than negative binomial regression. Therefore, Poisson
regression is preferred to negative binomial regression when the data has no
overdispersion. This is due to the fact that when there is no overdispersion, the value of
α in negative binomial regression model (see section 3.9.2) is equal to zero which
reduces negative binomial regression to Poisson regression. Thus, it is better to use
Poisson regression in the analysis instead of negative binomial regression when the data
has no overdispersion in order to obtain more accurate result. In other words, Poisson
regression is more reliable for the analysis of non-overdispersed data.
5.2.1.2 Significance, Confidence Intervals, and Standard Errors
From Figure 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, it can be seen that for all cases of µ , β1
is found to be insignificant (p-values > 0.05) in both Poisson regression and negative
binomial regression. Furthermore, the 95% confidence intervals for both models are
approximately the same. This goes the same with standard errors. The standard errors for
Poisson regression and negative binomial regression for each case of µ are also very
much the same. This result complies with the above explanation.
5.3
Analysis of Data with Overdispersion
The data is first analyzed for Poisson regression then followed by negative
binomial regression. Figure 5.7, 5.8, and 5.9 show Poisson regression SAS output for
µ = 10 , µ = 20 and µ = 50 respectively whereas Figure 5.10, 5.11, and 5.12 show
negative binomial regression SAS output for µ = 10 , µ = 20 and µ = 50 respectively.
95
The GENMOD Procedure
Model Information
Data Set
Distribution
Link Function
Dependent Variable
WORK.MU10
Poisson
Log
y
Number of Observations Read
Number of Observations Used
100
100
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
98
98
98
98
705.9973
705.9973
672.2758
672.2758
818.9899
7.2041
7.2041
6.8600
6.8600
Algorithm converged.
Analysis Of Parameter Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
x
Scale
1
1
0
1.9546
0.0426
1.0000
0.0822
0.0321
0.0000
Wald 95% Confidence
Limits
ChiSquare
Pr > ChiSq
1.7934
-0.0204
1.0000
564.79
1.76
<.0001
0.1852
2.1158
0.1056
1.0000
NOTE: The scale parameter was held fixed.
Figure 5.7: Poisson regression SAS output of overdispersed data for µ = 10
96
The GENMOD Procedure
Model Information
Data Set
Distribution
Link Function
Dependent Variable
WORK.MU20
Poisson
Log
y
Number of Observations Read
Number of Observations Used
100
100
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
98
98
98
98
1180.0775
1180.0775
1233.3790
1233.3790
2895.4924
12.0416
12.0416
12.5855
12.5855
Algorithm converged.
Analysis Of Parameter Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
x
Scale
1
1
0
2.4777
0.1046
1.0000
0.0894
0.0289
0.0000
Wald 95% Confidence
Limits
2.3024
0.0479
1.0000
2.6529
0.1612
1.0000
ChiSquare
Pr > ChiSq
767.68
13.09
<.0001
0.0003
NOTE: The scale parameter was held fixed.
Figure 5.8: Poisson regression SAS output of overdispersed data for µ = 20
97
The GENMOD Procedure
Model Information
Data Set
Distribution
Link Function
Dependent Variable
WORK.MU50
Poisson
Log
y
Number of Observations Read
Number of Observations Used
100
100
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
98
98
98
98
3905.6126
3905.6126
4286.7770
4286.7770
14355.4429
39.8532
39.8532
43.7426
43.7426
Algorithm converged.
Analysis Of Parameter Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
x
Scale
1
1
0
3.3669
0.1328
1.0000
0.0564
0.0134
0.0000
Wald 95% Confidence
Limits
3.2563
0.1066
1.0000
3.4774
0.1591
1.0000
ChiSquare
Pr > ChiSq
3562.18
98.60
<.0001
<.0001
NOTE: The scale parameter was held fixed.
Figure 5.9: Poisson regression SAS output of overdispersed data for µ = 50
98
The GENMOD Procedure
Model Information
Data Set
Distribution
Link Function
Dependent Variable
WORK.MU10
Negative Binomial
Log
y
Number of Observations Read
Number of Observations Used
100
100
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
98
98
98
98
117.8710
117.8710
72.8406
72.8406
1018.0347
1.2028
1.2028
0.7433
0.7433
Algorithm converged.
Analysis Of Parameter Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
x
Dispersion
1
1
1
1.9286
0.0540
1.0577
0.2699
0.1095
0.1797
Wald 95% Confidence
Limits
1.3995
-0.1606
0.7055
2.4576
0.2686
1.4098
ChiSquare
Pr > ChiSq
51.05
0.24
<.0001
0.6217
NOTE: The negative binomial dispersion parameter was estimated by maximum likelihood.
Figure 5.10: Negative binomial regression SAS output of overdispersed data for µ = 10
99
The GENMOD Procedure
Model Information
Data Set
Distribution
Link Function
Dependent Variable
WORK.MU20
Negative Binomial
Log
y
Number of Observations Read
Number of Observations Used
100
100
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
98
98
98
98
114.8235
114.8235
86.1879
86.1879
3308.7693
1.1717
1.1717
0.8795
0.8795
Algorithm converged.
Analysis Of Parameter Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
x
Dispersion
1
1
1
2.4560
0.1120
0.8303
0.3426
0.1136
0.1248
Wald 95% Confidence
Limits
1.7845
-0.1107
0.5857
3.1275
0.3346
1.0749
ChiSquare
Pr > ChiSq
51.39
0.97
<.0001
0.3242
NOTE: The negative binomial dispersion parameter was estimated by maximum likelihood.
Figure 5.11: Negative binomial regression SAS output of overdispersed data for µ = 20
100
The GENMOD Procedure
Model Information
Data Set
Distribution
Link Function
Dependent Variable
WORK.MU50
Negative Binomial
Log
y
Number of Observations Read
Number of Observations Used
100
100
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
98
98
98
98
112.5855
112.5855
92.8749
92.8749
16079.2532
1.1488
1.1488
0.9477
0.9477
Algorithm converged.
Analysis Of Parameter Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
x
Dispersion
1
1
1
3.4228
0.1189
0.8877
0.3484
0.0852
0.1172
Wald 95% Confidence
Limits
2.7400
-0.0481
0.6579
4.1056
0.2858
1.1175
ChiSquare
Pr > ChiSq
96.54
1.95
<.0001
0.1628
NOTE: The negative binomial dispersion parameter was estimated by maximum likelihood.
Figure 5.12: Negative binomial regression SAS output of overdispersed data for µ = 50
101
5.3.1
Results and Discussion
5.3.1.1 Goodness-of-fit
Table 5.2 summarizes Pearson chi-square and deviance values for Poisson
regression and negative binomial regression obtained from overdispersed data output. It
can be seen that for data that has overdispersion, the values of Pearson chi-square and
deviance for Poisson regression are very large while the values of Pearson chi-square
and deviance for negative binomial regression are small. In fact, the values of Pearson
chi-square and deviance for Poisson regression are so much larger compared to negative
binomial regression. This indicates that Poison regression has lack of fit while negative
binomial has a good fit for all cases of µ . This also indicates that Poisson regression is
not reliable in analyzing data that has overdispersion.
Table 5.2: Pearson chi-square and deviance for Poisson regression and negative
binomial regression obtained from overdispersed data
µ
Model
Pearson chi-square
Deviance
Poisson
Negative Binomial
µ = 10
672.2758
705.9973
µ = 20
1233.3790
1180.0775
µ = 50
4286.7770
3905.6126
µ = 10
72.8406
117.8710
µ = 20
86.1879
114.8235
µ = 50
92.8749
112.5855
To formally test the goodness-of-fit of the models, again, all Pearson chi-square
and deviance values are compared with the critical value obtained from statistical table,
χ 02.05 (98) = 122.1026 . It is found that, all Pearson chi-square and deviance for Poisson
regression model are greater than 122.1026. Therefore, for all cases of µ , Poisson
regression has lack of fit, that is, H 0 is rejected at significance level, α = 0.05 .
102
However, for negative binomial regression, all Pearson chi-square and deviance
values are less than 122.1026. Thus, for all cases of µ , negative binomial regression has
a good fit, that is, H 0 is not rejected at significance level, α = 0.05 .
This result clearly shows that Poisson regression is not adequate when applied to
overdispersed data. In contrast, negative binomial regression can always handle
overdispersion. This is due to the fact that its variance naturally exceeds its mean.
5.3.1.2 Significance, Confidence Intervals, and Standard Errors
For µ = 10 , β1 is statistically insignificant (p-values > 0.05) in both Poisson
regression (see Figure 5.7) and negative binomial regression (see Figure 5.10) models.
Nevertheless, for µ = 20 and µ = 50 , β1 is significant in Poisson regression (see Figure
5.8 and Figure 5.9) but not significant in negative binomial regression (see Figure 5.11
and Figure 5.12).
The 95% confidence intervals for negative binomial regression are much wider
than the 95% confidence intervals for Poisson regression when the data is overdispersed.
Thus, it is clear that, when overdispersion exists, applying the wrong model can change
the statistical inference.
Standard errors for negative binomial regression is substantial compared to
standard errors obtained when using Poisson regression to analyze overdispersed data.
This shows that Poisson regression can cause underestimation of standard errors when
overdispersion exists.
103
5.4
Conclusions
From the analysis of simulated data, it is found that for data that has no
overdispersion, both Poisson regression and negative binomial regression are adequate,
that is, both of the models have a good fit. The significance of the coefficient results are
the same for both models. In addition, the 95% confidence intervals as well as standard
errors are also approximately the same. Nevertheless, Poisson regression is more reliable
than negative binomial regression since negative binomial regression is actually reduced
to Poisson regression when there is no overdispersion. Even though the analyses for both
models give approximately the same results, Poisson regression can provide more
accurate result.
On the other hand, when overdispersion occurs, Poisson regression is no longer
adequate. Applying Poisson regression can underestimate the standard errors and can
lead to wrong inference. The significance of the coefficient results can also be doubtful,
that is, when Poisson regression is used to analyze overdispersed data, one tends to
reject H 0 in test of significance when it actually should not be rejected.
This simulation study proves that negative binomial regression can be relied on
when one encounters with overdispersion problem. Negative binomial regression will
always be adequate and can provide correct result when the data is overdispersed.
5.5
The Simulation Codes
As mentioned earlier, the data in this simulation study are simulated using R
2.9.2 software. This section shows how the data are simulated. Section 5.5.1 discusses
the codes for simulating non-overdispersed data while section 5.5.2 discusses the codes
for simulating overdispersed data.
104
5.5.1
Codes for the simulation of non-overdispersed data
Below are the R codes for the simulation of non-overdipersed data:
b0<-0
b1<-1
n<-100
p<-matrix(0,n)
y<-matrix(0,n)
x<-rnorm(n,0,1)
miu<-round(exp(b0+(b1/n)*sum(x[1:n])))
u<-runif(n,0,1)
for (i in 1:n)
{
p[i]<-dpois(i-1,miu, log=FALSE)
}
k<-1
for(i in 1:n)
{
s<-p[1]
k<-1
while (s<u[i])
{
k<-k+1
s<-sum(p[1:k])
}
y[i]<-k
}
First, the value for b0 is set to zero and b1 is set to one such that the following
model is obtained:
ln µ = b0 + b1 X = X
(5.1)
Sample size, n is set to be 100 while Y is set to be in vector form. Next, X is
defined to follow normal distribution. X is important for the determination of µ . From
the codes, µ is defined as,
µ = e (b +b x )
0
1
(5.2)
Since b0 = 0 and b1 = 1 , µ becomes
µ = ex
(5.3)
105
From the above codes, suppose we want data with µ = 10 . Thus, from equation
(5.3), we obtain x = ln µ = ln(10) = 2.3026 . Then what we need to do is to change the
mean value in the definition of X as follows,
x<-rnorm(n,2.3026,1)
Data is simulated in such a way that Y follows the Poisson distribution.
Therefore, the data is not overdispersed.
5.5.2
Codes for the simulation of overdispersed data
Below are the R codes for the simulation of overdispersed data:
b0<-0
b1<-1
n<-100
alpha<-1
p<-matrix(0,n)
y<-matrix(0,n)
x<-rnorm(n,2,1)
m<-round(exp(b0+(b1/n)*sum(x[1:n])))
u<-runif(n,0,1)
y<-rnbinom(n, size=round(1/alpha), mu=m)
mean(y);var(y);
Like the simulation of non-overdispersed data, first, the value for b0 is set to
zero and b1 is set to one. Sample size, n is set to be 100 while Y is set to be in vector
form. Note that “alpha” in the codes is actually α in negative binomial probability
distribution. It is set to one for simplicity.
Data is simulated in such a way that Y follows the negative binomial
distribution. This is to ensure that the variance of the data is greater than the mean.
Therefore the data is overdispersed.
CHAPTER 6
SUMMARY AND CONCLUSIONS
6.1
Summary
The analysis of Poisson regression in this study begins with the formulation of
the model. It is then followed by the estimation of parameters by using maximum
likelihood estimation method (MLE). After the estimates are obtained, they must be
interpreted. The interpretation of coefficient in Poisson regression is different from other
regression due to the natural log used in the model. The values of coefficient need to be
exponentiated in order to get the correct interpretation. After interpreting the coefficient,
the goodness-of-fit of the model is checked. This is done by using Pearson chi-square
and deviance statistic. This study also focuses on residual analysis which can also give
some idea about the goodness-of-fit of the model. Test of significance is done after
residual analysis. Confidence intervals also need to be computed. After that, test for
overdispersion is done. If overdispersion is present, quasi-likelihood method and
negative binomial regression will be used to overcome the situation.
This study also presents a simulation study. The purpose of the simulation study
is to see the performance of Poisson regression and negative binomial regression in
analyzing data that has no overdispersion and data that has overdispersion.
107
6.2
Conclusions
This study focuses on the analysis of Poisson regression as well as on
overdispersion problem. It is found that overdispersion can cause underestimation of
standard errors which then lead to erroneous inference. Test of significance result can
also be suspicious. Overdispersion can be handled by using quasi-likelihood method.
Quasi-likelihood method is the simplest method and is most appropriate when the cause
of overdipersion is unknown. Overdispersion can also be handled by using negative
binomial regression.
The simulation study reveals that when the data has overdispersion, negative
binomial regression gives a good and correct result in the analysis while Poisson
regression is clearly not adequate for overdispersed data since it gives wrong result.
However, when the data has no overdispersion, Poisson regression is more reliable.
6.3
Recommendations
The recommendations for future research or study on Poisson regression are:
i) Extend the study on how to overcome other problems in Poison regression such
as truncation and censoring as well as excess zero.
ii) Do analysis on time series data, multivariate data, and panel data.
iii) Modify the classical Poisson regression to more robust Poisson regression.
iv) Use other statistical software for the analysis instead of SAS 9.1. Other software
that can be used are Stata, S-Plus, SPSS and R.
108
REFERENCES
Agresti, A. 1996. An Intorduction to Categorical Data Analysis. New York: John Wiley
& Sons, Inc.
Atkins, D. C. and Gallop, R. J. 2007. Rethinking How Family Researchers Model
Infrequent Outcomes: A Tutorial on Count Regression and Zero-Inflated Models.
Journal of Family Psychology. Vol. 21, No. 4: 726 – 735.
Bailer, A. J., Reed, L. D., and Stayner, L. T. 1997. Modeling Fatal Injury Rates Using
Poisson Regression: A Case Study of Workers in Agriculture, Forestry, and
Fishing. Journal of Safety Research. 28(3): 177 – 186.
Cameron, A. C. and Trivedi, P. K. 1998. Regression Analysis of Count Data.
Cambridge, UK: Cambridge University Press.
Carrivick, P. J. W.,
Lee, A. H., and Yau, K. K. W. 2003. Zero-Inflated Poisson
Modeling to Evaluate Occupational Safety Interventions. Safety Science. 41: 53 –
63.
Chan, Y.H. 2005. Log-linear Models: Poisson Regression. Singapore Med. J. 46(8): 377
– 386.
109
Choi, Y., Ahn, H., and Chen, J. J. 2005. Regression Trees for Analysis of Count Data
with Extra Poisson Variation. Computational Statistics & Data Analysis. 49: 893
– 915.
Dietz, E. and Bohning, D. 2000. On Estimation of the Poisson Parameter in Zero
Modified Poisson Models. Computational Statistics & Data Analysis. 34: 441 –
459.
Dobson, A.J. 2002. An Introduction to Generalized Linear Models. 2nd Edition. New
York: Chapman & Hall.
Dossou-Gbete, S. and Mizere, D. 2006. An Overview of Probability Models Statistical
Modelling of Count Data. Monografias del Seminario Matematico Garcia de
Galdeano. 33: 237 – 244.
Famoye, F. and Wang, W. Censored Generalized Poisson Regression Model.
Computational Statistics & Data Analysis. 46: 547 – 560.
Fleiss, J.L., Levin, B. and Paik, M.C. 2003. Statistical Methods for Rates and
Proportions. 3rd Edition. New York: John Wiley & Sons, Inc.
Frome, E.L. 1986. Regression Method for Binomial and Poisson Distributed Data. The
American Institute of Physics, New York. 1 – 40.
110
Guangyong, Z. 2003. A modified Poisson Regression Approach to Prospective Studies
with Binary Data. American Journal of Epidemiology. Vol. 159, No.7: 702 –
706.
Guo, J. Q. and Li, T. 2002. Poisson Regression Models with Errors-in-Variables:
Implication and Treatment. Journal of Statistical Planning and Inference. 104:
391 – 401.
Heinzl, H. and Mittlbock, M. 2003. Pseudo R-squared Measures for Poisson Regression
Models with Over- or Underdispersion. Computational Statistics & Data
Analysis. 44: 253 – 271.
Ismail, N. and Jemain, A.A. 2005. Generalized Poisson Regression: An Alternative For
Risk Classification. Jurnal Teknologi. 43(C): 39 – 54.
Jahn-Eimermacher, A. 2008. Comparison of the Andersen – Gill Model with Poisson
and Negative Binomial Regression on Recurrent Event Data. Computational
Statistics & Data Analysis. 52: 4959 – 4997.
Jovanovic, B. D. and Hosmer, D. W. 1997. A Simulation of the Performance of Cp in
Model Selection for Logistic and Poisson Regression. Computational Statistics
& Data Analysis. 23: 373 – 379.
Kleinbaum, D.G., Kupper, L.L., Muller, K.E., and Nizam, A. 1998. Applied Regression
Analysis and Other Multivariable Methods. 3rd Edition. USA: Duxbury Press.
111
Kohler, M. and Krzyzak, A. 2007. Asymptotic Confidence Intervals for Poisson
Regression. Journal of Multivariate Analysis. 98: 1072 – 1094.
Kokonendji, C.C., Demetrio, C,G,B, and Dossou-Gbete, S. 2004. Overdispersion and
Poisson-Tweedie Exponential Dispersion Models. Monografias del Seminario
Matematico Garcia de Galdeano. 31: 365 – 374.
Kutner, M.H., Nachtsheim, C.J., and Neter, J. 2004. Applied Linear Regression Models.
Fourth Edition. New York: McGraw-Hill.
Lee, J, Nam, D., and Park, D. 2005. Analyzing the Relationship Between Grade
Crossing Elements and Accidents. Journal of
Eastern Asia Society for
Transportation Study. Vol. 6: 3658 – 3668.
Lee, Y., Nelder, J.A., Pawitan, Y. 2006.Generalized Linear Models with Random
Effects, Unified Analysis via H-likelihood. USA: Chapman & Hall.
Liu, J. and Dey, D. K. 2007. Hierarchical overdispersed Poisson Model with Macrolevel
Autocorrelation. Statistical Methodology. 4: 354 – 370.
Lord, D., Washington, S. P., and Ivan, J. N. 2005. Poisson, Poisson-Gamma and ZeroInflated Regression Models of Motor Vehicle Crashes: Balancing Statistical Fit
and Theory. Accident Analysis and Prevention. 37: 35 – 46.
112
Luceno, A. 1995. A Family of Partially Correlated Poisson Models for Overdispersion.
Computational Statistics & Data Analysis. 20: 511 – 520.
McCullagh, P. and J.A. Nelder. 1989. Generalized Linear Models. 2nd Edition. London:
Chapman & Hall.
Norliza binti Adnan (2006). Comparing Three Methods of Handling Multicollinearity
Using Simulation Approach. Master of Science (Mathematics). Universiti
Teknologi Malaysia, Skudai.
Osgood, D.W. 2000. Poisson-Based Regression Analysis of Aggregate Crime Rates.
Journal of Quantitative Criminology. Vol. 16, No. 1: 21 – 43.
Pedan, A. Analysis of Count Data Using the SAS System. Statistic, Data Analysis, and
Data Mining. Paper 247-26: 1 – 6.
Pradhan, N. C. and Leung, P.S. 2006. A Poisson and Negative Binomial Regression
Model of Sea Turtle Interactions in Hawaii’s Longline Fishery. Fisheries
Research. 78: 309 – 322.
Spinelli, J. J., Lockhart, R. A., and Stephens, M. A. 2002. Tests for the Response
Distribution in a Poisson Regression Model. Journal of Statistical Planning and
Inference. 108: 137 – 154.
113
Strien, A.V., Pannekoek, P., Hegemeijer, W. and Verstrael, T. 2000. A Loglinear
Poisson Regression Method to Analyse Bird Monitoring Data. Proceeding of the
International Conference and 13th Meeting of the European Bird Census Council.
33 – 39.
Tang, H, Hu, M, and Shi, Q. 2003. Accident Injury Analysis for Two-Lane Rural
Highways. Journal of Eastern Asia Society for Transportation Study. Vol. 5:
2340 – 2443.
Tsou, T. S. 2006. Robust Poisson Regression. Journal of Statistical Planning and
Inference. 136: 3173 – 3186.
Vacchino, M. N. 1999. Poisson Regression in Mapping Cancer Mortality.
Environmental Research Section A. 81: 1 – 17.
Wang, K., Lee, A. H., and Yau, K. K. W. and Carrivick, P. J. W. 2003. A Bivariate
Zero-Inflated Poisson Regression Model to Analyze Occupational Injuries.
Accident Analysis and Prevention. 35: 625 – 629.
Wang, Y., Smith, E. P., and Ye, K. 2006. Sequential Designs for a Poisson Regression
Model. Journal of Statistical Planning and Inference. 136: 3187 – 3202.
114
APPENDIX A
SAS Codes for Elephant’s Mating Success Data
data elephant;
input age matings;
cards;
27
0
28
1
28
1
28
1
28
3
29
0
29
0
29
0
29
2
29
2
29
2
30
1
32
2
33
4
33
3
33
3
33
3
33
2
34
1
34
1
34
2
34
3
36
5
36
6
37
1
37
1
37
6
38
2
39
1
41
3
42
4
43
0
43
2
43
3
43
4
43
9
44
3
45
5
47
7
48
2
52
9
;
proc genmod;
model matings = age / dist=poi link=log;
run;
115
APPENDIX B
The Values of µ̂ i for Elephant’s Mating Success Data
X
27
28
28
28
28
29
29
29
29
29
29
30
32
33
33
33
33
33
34
34
34
34
36
36
37
37
37
38
39
41
42
43
43
Y
0
1
1
1
3
0
0
0
2
2
2
1
2
4
3
3
3
2
1
1
2
3
5
6
1
1
6
2
1
3
4
0
2
µ̂ i
1.313769
1.407197
1.407197
1.407197
1.407197
1.50727
1.50727
1.50727
1.50727
1.50727
1.50727
1.614459
1.852248
1.98397
1.98397
1.98397
1.98397
1.98397
2.12506
2.12506
2.12506
2.12506
2.438054
2.438054
2.611435
2.611435
2.611435
2.797147
2.996066
3.437347
3.681794
3.943624
3.943624
116
43
43
43
44
45
47
48
52
3
4
9
3
5
7
2
9
3.943624
3.943624
3.943624
4.224074
4.524468
5.190863
5.560011
7.318461
117
APPENDIX C
SAS Codes for Nursing Home Data
data nursing_home;
input bed tdays pcrev nsal fexp rural;
log_t= log(tdays);
cards;
244
385
2.3521
0.523
0.5334
59
203
0.916
0.2459
0.0493
120
392
2.19
0.6304
0.6115
120
419
2.2354
0.659
0.6346
120
363
1.7421
0.5362
0.6225
65
234
1.0531
0.3622
0.0449
120
372
2.2147
0.4406
0.4998
90
305
1.4025
0.4173
0.0966
96
169
0.8812
0.1955
0.126
120
188
1.1729
0.3224
0.6442
62
192
0.8896
0.2409
0.1236
120
426
2.0987
0.2066
0.336
116
321
1.7655
0.5946
0.4231
59
164
0.7085
0.1925
0.128
80
284
1.3089
0.4166
0.1123
120
375
2.1453
0.5257
0.5206
80
133
0.779
0.1988
0.4443
100
318
1.8309
0.4156
0.4585
60
213
0.8872
0.1914
0.1675
110
280
1.7881
0.5173
0.5686
120
336
1.7004
0.463
0.0907
135
442
2.3829
0.7489
0.3351
59
191
0.9424
0.2051
0.1756
60
202
1.2474
0.3803
0.2123
25
83
0.4078
0.2008
0.4531
221
776
3.6029
0.1288
0.2543
64
214
0.8782
0.4729
0.4446
62
204
0.8951
0.2367
0.1064
108
366
1.7446
0.5933
0.2987
62
220
0.6164
0.2782
0.0411
90
286
0.2853
0.4651
0.4197
146
375
2.1334
0.6857
0.1198
62
189
0.8082
0.2143
0.1209
30
88
0.3948
0.3025
0.0137
79
278
1.1649
0.2905
0.1279
44
158
0.785
0.1498
0.1273
120
423
2.9035
0.6236
0.3524
100
300
1.7532
0.3547
0.2561
49
177
0.8197
0.281
0.3874
123
336
2.2555
0.6059
0.6402
82
136
0.8459
0.1995
0.1911
58
205
1.0412
0.2245
0.1122
110
323
1.6661
0.4029
0.3893
62
222
1.2406
0.2784
0.2212
86
200
1.1312
0.372
0.2959
102
355
1.4499
0.3866
0.3006
135
471
2.4274
0.7485
0.1344
0
1
0
0
0
1
1
1
0
1
0
1
0
1
1
1
1
1
1
1
0
0
1
0
1
1
1
0
1
1
0
0
1
1
0
1
0
1
1
1
1
1
1
1
1
1
0
118
78
203
0.9327
0.3672
0.1242
1
83
390
1.2362
0.3995
0.1484
1
60
213
1.0644
0.282
0.1154
0
54
144
0.7556
0.2088
0.0245
1
120
327
2.0182
0.4432
0.6274
0
;
proc genmod;
model bed = pcrev nsal fexp / dist=poi link=log offset=log_t;
run;
/*(M1)*/
proc genmod;
model bed = pcrev nsal fexp pcrev*nsal pcrev*fexp nsal*fexp
link=log offset=log_t;
run;
/*(M2)*/
/ dist=poi
proc genmod;
model bed = pcrev nsal fexp pcrev*nsal pcrev*fexp nsal*fexp rural /
dist=poi link=log offset=log_t;
run;
/*(M3)*/
119
APPENDIX D
SAS Output of Residual Analysis for Poisson Regression in Nursing Home Data
The GENMOD Procedure
Observation Statistics
Observation
bed
log_t
Xbeta
Reschi
pcrev
Std
Resdev
nsal
HessWgt
StResdev
fexp
Lower
StReschi
rural
Upper
Reslik
Pred
Resraw
1
244
5.9532433
5.0516468
7.0169656
2.3521
0.0376696
6.4787097
0.523
156.27962
7.3439847
0.5334
145.15699
7.9541283
0
168.25451
7.4835841
156.27962
87.720383
2
59
5.313206
4.0665661
0.0842729
0.916
0.0347824
0.0841186
0.2459
58.356229
0.0872551
0.0493
54.51053
0.0874151
1
62.47324
0.0872664
58.356229
0.6437711
3
120
5.9712618
4.8914414
-1.139224
2.19
0.0380992
-1.158786
0.6304
133.14535
-1.290145
0.6115
123.56513
-1.268365
0
143.46835
-1.285964
133.14535
-13.14535
4
120
6.0378709
4.9178605
-1.42913
2.2354
0.0454592
-1.459835
0.659
136.70981
-1.723445
0.6346
125.05607
-1.687196
0
149.44955
-1.713282
136.70981
-16.70981
5
120
5.8944028
4.8937729
-1.1648
1.7421
0.0357675
-1.185246
0.5362
133.45615
-1.301551
0.6225
124.42086
-1.279099
0
143.14757
-1.297745
133.45615
-13.45615
6
65
5.4553211
4.2112657
-0.29734
1.0531
0.037979
-0.299162
0.3622
67.441848
-0.314869
0.0449
62.603945
-0.312951
1
72.653615
-0.314683
67.441848
-2.441848
7
120
5.9188939
4.9517437
-1.801313
2.2147
0.0362592
-1.849924
0.4406
141.42134
-2.050327
0.4998
131.7198
-1.99645
1
151.83742
-2.040418
141.42134
-21.42134
8
90
5.7203118
4.4544089
0.4307467
1.4025
0.0312314
0.4274751
0.4173
86.005297
0.4466187
0.0966
80.898595
0.4500368
1
91.434358
0.4469065
86.005297
3.9947032
9
96
5.1298987
4.1033726
4.5567171
0.8812
0.0433215
4.1947091
0.1955
60.544134
4.4554695
0.126
55.615611
4.8399814
0
65.90941
4.5008148
60.544134
35.455866
10
120
5.236442
4.4593743
3.6104925
1.1729
0.0527329
3.4080802
0.3224
86.433407
3.9102372
0.6442
77.946268
4.1424735
1
95.844664
3.9672965
86.433407
33.566593
120
Observation
bed
log_t
Xbeta
Reschi
pcrev
Std
Resdev
nsal
HessWgt
StResdev
fexp
Lower
StReschi
rural
Upper
Reslik
Pred
Resraw
11
62
5.2574954
4.2244354
-0.766451
0.8896
0.0383341
-0.778778
0.2409
68.335912
-0.821095
0.1236
63.389742
-0.808099
0
73.66802
-0.8198
68.335912
-6.335912
12
120
6.0544393
5.0715684
-3.122378
2.0987
0.0338076
-3.266488
0.2066
159.42417
-3.612114
0.336
149.20282
-3.452756
1
170.34574
-3.583605
159.42417
-39.42417
13
116
5.7714411
4.6475429
1.1426998
1.7655
0.0303644
1.1223291
0.5946
104.32832
1.1805427
0.4231
98.300565
1.20197
0
110.7257
1.1826207
104.32832
11.671677
14
59
5.0998664
3.9784181
0.7616619
0.7085
0.0366079
0.7489778
0.1925
53.432442
0.7773256
0.128
49.732963
0.7904897
1
57.407115
0.7782756
53.432442
5.5675576
15
80
5.6489742
4.406908
-0.222551
1.3089
0.0323779
-0.223472
0.4166
82.015478
-0.233747
0.1123
76.972526
-0.232783
1
87.388825
-0.233664
82.015478
-2.015478
16
120
5.926926
4.845099
-0.631143
2.1453
0.0300613
-0.637174
0.5257
127.11587
-0.677259
0.5206
119.8427
-0.670849
1
134.83044
-0.676526
127.11587
-7.115867
17
80
4.8903491
4.1098356
2.4420727
0.779
0.0554565
2.3289401
0.1988
60.936701
2.5835778
0.4443
54.660586
2.7090799
1
67.933437
2.6075576
60.936701
19.063299
18
100
5.7620514
4.7476515
-1.426019
1.8309
0.0254496
-1.459467
0.4156
115.31315
-1.517224
0.4585
109.70238
-1.482452
1
121.21089
-1.514655
115.31315
-15.31315
19
60
5.3612922
4.2518995
-1.221678
0.8872
0.0346179
-1.253322
0.1914
70.238707
-1.309653
0.1675
65.631104
-1.276587
1
75.169785
-1.306902
70.238707
-10.23871
20
110
5.6347896
4.5285346
1.8056045
1.7881
0.0316613
1.7531492
0.5173
92.622735
1.8406827
0.5686
87.049739
1.8957571
1
98.552518
1.8458655
92.622735
17.377265
21
120
5.8171112
4.6441205
1.5718986
1.7004
0.0341634
1.5339043
0.463
103.97188
1.6364029
0.0907
97.237984
1.6769361
0
111.17212
1.641375
103.97188
16.028116
22
135
6.0913099
4.9008401
0.0515266
2.3829
0.0387701
0.0514885
0.7489
134.40264
0.0576388
0.3351
124.56805
0.0576814
0
145.01367
0.0576474
134.40264
0.5973586
23
59
5.2522734
4.1397854
-0.478213
0.9424
0.0318262
-0.483148
0.2051
62.789347
-0.499286
0.1756
58.992323
-0.494186
1
66.830766
-0.498963
62.789347
-3.789347
121
Observation
bed
log_t
Xbeta
Reschi
pcrev
Std
Resdev
nsal
HessWgt
StResdev
fexp
Lower
StReschi
rural
Upper
Reslik
Pred
Resraw
24
60
5.3082677
4.2693556
-1.35736
1.2474
0.0329078
-1.396352
0.3803
71.47556
-1.453747
0.2123
67.011047
-1.413152
0
76.237515
-1.450645
71.47556
-11.47556
25
25
4.4188406
3.6352807
-2.097099
0.4078
0.0679635
-2.237128
0.2008
37.912493
-2.463175
0.4531
33.184226
-2.308997
1
43.314468
-2.43688
37.912493
-12.91249
26
221
6.6541525
5.3290206
1.0280765
3.6029
0.0654903
1.0161621
0.1288
206.23588
2.9905494
0.2543
181.39226
3.0256132
1
234.4821
3.0215856
206.23588
14.76412
27
64
5.365976
4.1800111
-0.169027
0.8782
0.0481366
-0.169621
0.4729
65.366577
-0.184139
0.4446
59.481488
-0.183494
1
71.833937
-0.184041
65.366577
-1.366577
28
62
5.31812
4.2663035
-1.096703
0.8951
0.0392082
-1.121837
0.2367
71.257743
-1.18884
0.1064
65.986933
-1.162205
0
76.949566
-1.185951
71.257743
-9.257743
29
108
5.9026333
4.6403673
0.4340545
1.7446
0.0357706
0.4310228
0.5933
103.58239
0.4627802
0.2987
96.569055
0.4660353
1
111.10507
0.4632129
103.58239
4.4176086
30
62
5.3936275
4.222929
-0.754576
0.6164
0.0439783
-0.766525
0.2782
68.233044
-0.822732
0.0411
62.597987
-0.809907
1
74.375368
-0.821051
68.233044
-6.233044
31
90
5.6559918
4.5714856
-0.680123
0.2853
0.0660333
-0.688199
0.4651
96.687643
-0.904895
0.4197
84.949993
-0.894277
0
110.0471
-0.900434
96.687643
-6.687643
32
146
5.926926
4.7786491
2.4808479
2.1334
0.0454293
2.3947269
0.6857
118.94356
2.7568948
0.1198
108.81066
2.8560404
0
130.02009
2.7815602
118.94356
27.056438
33
62
5.241747
4.0903127
0.2899505
0.8082
0.0329047
0.2881657
0.2143
59.758576
0.2979664
0.1209
56.026267
0.2998119
1
63.739521
0.2980861
59.758576
2.2414239
34
30
4.4773368
3.3551086
0.2524611
0.3948
0.0662096
0.2505144
0.3025
28.648715
0.2679012
0.0137
25.162133
0.2699829
1
32.618414
0.2681635
28.648715
1.3512851
35
79
5.6276211
4.5463446
-1.574343
1.1649
0.033836
-1.620047
0.2905
94.287117
-1.71527
0.1279
88.237072
-1.666879
0
100.75199
-1.710112
94.287117
-15.28712
36
44
5.062595
3.928233
-0.956302
0.785
0.0455206
-0.978975
0.1498
50.817105
-1.034982
0.1273
46.479634
-1.011012
1
55.559347
-1.032484
50.817105
-6.817105
122
Observation
bed
log_t
Xbeta
Reschi
pcrev
Std
Resdev
nsal
HessWgt
StResdev
fexp
Lower
StReschi
rural
Upper
Reslik
Pred
Resraw
37
120
6.0473722
4.9727488
-2.032293
2.9035
0.0462914
-2.094033
0.6236
144.42334
-2.519978
0.3524
131.89673
-2.44568
0
158.13963
-2.49722
144.42334
-24.42334
38
100
5.7037825
4.5372047
0.679786
1.7532
0.0215857
0.6720428
0.3547
93.429269
0.6871663
0.2561
89.558984
0.6950838
1
97.466807
0.6875129
93.429269
6.5707312
39
49
5.1761497
4.2283316
-2.366709
0.8197
0.0358885
-2.495446
0.281
68.602679
-2.613583
0.3874
63.94296
-2.478751
1
73.601965
-2.601951
68.602679
-19.60268
40
123
5.8171112
4.6764014
1.5070629
2.2555
0.0455677
1.4725839
0.6059
107.38295
1.6705578
0.6402
98.208268
1.7096722
1
117.41473
1.6793581
107.38295
15.617052
41
82
4.9126549
3.8362088
5.2365404
0.8459
0.0328376
4.7183215
0.1995
46.34942
4.8408417
0.1911
43.46032
5.3725172
1
49.430577
4.8687928
46.34942
35.65058
42
58
5.32301
4.1134589
-0.403804
1.0412
0.0328512
-0.407356
0.2245
61.157892
-0.421504
0.1122
57.344201
-0.417828
1
65.225213
-0.421262
61.157892
-3.157892
43
110
5.7776523
4.7080031
-0.078899
1.6661
0.021909
-0.078998
0.4029
110.83062
-0.081187
0.3893
106.1722
-0.081086
1
115.69344
-0.081182
110.83062
-0.830623
44
62
5.4026774
4.2763965
-1.176382
1.2406
0.0220929
-1.205271
0.2784
71.98059
-1.227018
0.2212
68.930252
-1.197609
1
75.165913
-1.225997
71.98059
-9.98059
45
86
5.2983174
4.1904684
2.4542185
1.1312
0.0235602
2.3440601
0.372
66.053722
2.3882521
0.2959
63.072898
2.5004873
1
69.17542
2.3924602
66.053722
19.946278
46
102
5.8721178
4.7455251
-1.218256
1.4499
0.0211335
-1.242481
0.3866
115.06821
-1.275693
0.3006
110.39935
-1.250821
1
119.93453
-1.274427
115.06821
-13.06821
47
135
6.1548581
5.0043197
-1.151268
2.4274
0.0534544
-1.170112
0.7485
149.05565
-1.544318
0.1344
134.2295
-1.519447
0
165.51939
-1.533774
149.05565
-14.05565
48
78
5.313206
4.1350896
1.9612958
0.9327
0.0355142
1.8875857
0.3672
62.495191
1.9666862
0.1242
58.293056
2.0434853
1
67.000244
1.9728483
62.495191
15.504809
49
83
5.9661467
4.7535547
-3.063645
1.2362
0.0293604
-3.229431
0.3995
115.99588
-3.404104
0.1484
109.50928
-3.229351
1
122.86669
-3.387036
115.99588
-32.99588
123
Observation
bed
log_t
Xbeta
Reschi
pcrev
Std
Resdev
nsal
HessWgt
StResdev
fexp
Lower
StReschi
rural
Upper
Reslik
Pred
Resraw
50
60
5.3612922
4.2863123
-1.48926
1.0644
0.0351689
-1.536112
0.282
72.697882
-1.610209
0.1154
67.855643
-1.561096
0
77.885668
-1.605855
72.697882
-12.69788
51
54
4.9698133
3.7269563
1.9310157
0.7556
0.0426532
1.8449607
0.2088
41.552444
1.9189175
0.0245
38.219947
2.0084219
1
45.175511
1.925829
41.552444
12.447556
52
120
5.7899602
5.0434126
-2.811129
2.0182
0.0509613
-2.928329
0.4432
154.99806
-3.788481
0.6274
140.26454
-3.636855
0
171.2792
-3.728188
154.99806
-34.99806
124
APPENDIX E
The Values of µ̂ i for Nursing Home Data
bed
(Y[i])
244
59
120
120
120
65
120
90
96
120
62
120
116
59
80
120
80
100
60
110
120
135
59
60
25
221
64
62
108
62
90
146
62
30
79
miu[i]
156.28
58.356
133.145
136.71
133.456
67.442
141.421
86.005
60.544
86.433
68.336
159.424
104.328
53.432
82.015
127.116
60.937
115.313
70.239
92.623
103.972
134.403
62.789
71.476
37.912
206.236
65.367
71.258
103.582
68.233
96.688
118.944
59.759
28.649
94.287
125
44
120
100
49
123
82
58
110
62
86
102
135
78
83
60
54
120
50.817
144.423
93.429
68.603
107.383
46.349
61.158
110.831
71.981
66.054
115.068
149.056
62.495
115.996
72.698
41.552
154.998
126
APPENDIX F
SAS Output of Residual Analysis for Negative Binomial Regression in Nursing
Home Data
The GENMOD Procedure
Observation Statistics
Observation
bed
log_t
Xbeta
Reschi
pcrev
Std
Resdev
nsal
HessWgt
StResdev
fexp
Lower
StReschi
rural
Upper
Reslik
Pred
Resraw
1
244
5.9532433
5.057063
2.8981131
2.3521
0.0824918
2.5178647
0.523
40.012978
2.7918481
0.5334
133.67106
3.2134736
0
184.70205
2.8752383
157.12836
86.871644
2
59
5.313206
4.0914617
-0.063957
0.916
0.0602987
-0.064201
0.2459
21.203341
-0.066888
0.0493
53.158524
-0.066634
1
67.332626
-0.066868
59.827277
-0.827277
3
120
5.9712618
4.8935213
-0.519342
2.19
0.0832582
-0.535908
0.6304
24.504356
-0.592323
0.6115
113.33389
-0.574014
0
157.072
-0.589044
133.42256
-13.42256
4
120
6.0378709
4.9212532
-0.648106
2.2354
0.0985915
-0.674269
0.659
24.096706
-0.781246
0.6346
113.07115
-0.750932
0
166.41572
-0.773625
137.17441
-17.17441
5
120
5.8944028
4.8868507
-0.487942
1.7421
0.0754142
-0.502515
0.5362
24.602537
-0.544262
0.6225
114.32459
-0.528478
0
153.64731
-0.541963
132.53553
-12.53553
6
65
5.4553211
4.2269541
-0.242413
1.0531
0.0678904
-0.245965
0.3622
21.63657
-0.259559
0.0449
59.972808
-0.255811
1
78.258458
-0.25918
68.508244
-3.508244
7
120
5.9188939
4.9681804
-0.859559
2.2147
0.0779802
-0.906682
0.4406
23.409432
-0.989734
0.4998
123.38901
-0.938295
1
167.50592
-0.981645
143.76505
-23.76505
8
90
5.7203118
4.4636342
0.1807227
1.4025
0.0594289
0.1788462
0.4173
24.708708
0.1869984
0.0966
77.258399
0.1889604
1
97.525391
0.1871665
86.802394
3.1976063
9
96
5.1298987
4.1176423
2.6168761
0.8812
0.078269
2.29633
0.1955
29.477614
2.4672304
0.126
52.680103
2.8116326
0
71.596536
2.5160225
61.414272
34.585728
10
120
5.236442
4.4707834
1.8297598
1.1729
0.0950989
1.6641762
0.3224
30.622492
1.8884744
0.6442
72.558486
2.0763753
1
105.33797
1.9320445
87.425187
32.574813
127
Observation
bed
log_t
Xbeta
Reschi
pcrev
Std
Resdev
nsal
HessWgt
StResdev
fexp
Lower
StReschi
rural
Upper
Reslik
Pred
Resraw
11
62
5.2574954
4.2353038
-0.48599
0.8896
0.0704147
-0.500652
0.2409
20.915458
-0.53133
0.1236
60.177199
-0.515769
0
79.306022
-0.529608
69.082662
-7.082662
12
120
6.0544393
5.083684
-1.346897
2.0987
0.0772048
-1.469495
0.2066
21.738682
-1.605034
0.336
138.70721
-1.471129
1
187.72961
-1.584141
161.36744
-41.36744
13
116
5.7714411
4.6384282
0.6125658
1.7655
0.0625364
0.5919628
0.5946
27.513236
0.6228092
0.4231
91.456105
0.6444858
0
116.8624
0.624936
103.38172
12.618279
14
59
5.0998664
4.0102252
0.3172917
0.7085
0.0623455
0.3115073
0.1925
21.666556
0.3250957
0.128
48.814638
0.3311324
1
62.328588
0.325594
55.159291
3.8407087
15
80
5.6489742
4.4166763
-0.165974
1.3089
0.0605281
-0.167616
0.4166
23.174401
-0.175396
0.1123
73.555725
-0.173677
1
93.252348
-0.175247
82.820553
-2.820553
16
120
5.926926
4.8599232
-0.359472
2.1453
0.0620071
-0.367274
0.5257
24.999243
-0.387519
0.5206
114.25029
-0.379287
1
145.68618
-0.386689
129.0143
-9.014298
17
80
4.8903491
4.1344012
1.3094942
0.779
0.0975379
1.2201778
0.1988
25.700193
1.3730707
0.4443
51.584972
1.4735788
1
75.608751
1.3948094
62.452184
17.547816
18
100
5.7620514
4.7594557
-0.727722
1.8309
0.0526905
-0.761083
0.4156
23.02583
-0.789822
0.4585
105.23377
-0.755202
1
129.37656
-0.787399
116.6824
-16.6824
19
60
5.3612922
4.2796751
-0.807629
0.8872
0.0590742
-0.849592
0.1914
20.152784
-0.885781
0.1675
64.32136
-0.84203
1
81.08179
-0.882359
72.216972
-12.21697
20
110
5.6347896
4.5370857
0.8794827
1.7881
0.0631088
0.8379882
0.5173
27.762476
0.8827731
0.5686
82.549218
0.9264853
1
105.71816
0.8871918
93.418154
16.581846
21
120
5.8171112
4.6367443
0.8163993
1.7004
0.0685085
0.7804927
0.463
28.27474
0.8324479
0.0907
90.239767
0.8707447
0
118.0394
0.8371723
103.20779
16.792211
22
135
6.0913099
4.9034662
0.0093543
2.3829
0.0877062
0.0093492
0.7489
26.741414
0.0104423
0.3351
113.47302
0.010448
0
160.03095
0.0104434
134.75606
0.2439399
23
59
5.2522734
4.165467
-0.394395
0.9424
0.0543472
-0.403977
0.2051
20.735162
-0.417827
0.1756
57.913375
-0.407917
1
71.663794
-0.417188
64.422761
-5.422761
128
Observation
bed
log_t
Xbeta
Reschi
pcrev
Std
Resdev
nsal
HessWgt
StResdev
fexp
Lower
StReschi
rural
Upper
Reslik
Pred
Resraw
24
60
5.3082677
4.2665678
-0.75373
1.2474
0.0648779
-0.790063
0.3803
20.249686
-0.830305
0.2123
62.765736
-0.792121
0
80.94146
-0.826769
71.276576
-11.27658
25
25
4.4188406
3.6615037
-1.515122
0.4078
0.1170029
-1.683504
0.2008
14.487179
-1.943307
0.4531
30.944097
-1.748939
1
48.951263
-1.896676
38.919823
-13.91982
26
221
6.6541525
5.3089967
0.4986813
3.6029
0.1633691
0.4849636
0.1288
30.880767
1.0861062
0.2543
146.76007
1.1168278
1
278.4377
1.1107705
202.14732
18.852684
27
64
5.365976
4.17801
-0.088956
0.8782
0.0949587
-0.089427
0.4729
21.770485
-0.099844
0.4446
54.157389
-0.099318
1
78.580658
-0.09974
65.235905
-1.235905
28
62
5.31812
4.2776198
-0.666765
0.8951
0.0715552
-0.6949
0.2367
20.600368
-0.739743
0.1064
62.638133
-0.709793
0
82.919087
-0.736285
72.068695
-10.06869
29
108
5.9026333
4.6448136
0.1909746
1.7446
0.0697101
0.1888884
0.5933
25.953693
0.2018374
0.2987
90.75688
0.2040666
1
119.27634
0.2021156
104.04397
3.9560266
30
62
5.3936275
4.2514456
-0.555559
0.6164
0.0780862
-0.57486
0.2782
20.796614
-0.618843
0.0411
60.243788
-0.598065
1
81.817548
-0.616036
70.206831
-8.206831
31
90
5.6559918
4.5508004
-0.246757
0.2853
0.1334815
-0.250408
0.4651
23.732002
-0.330048
0.4197
72.906771
-0.325236
0
123.0289
-0.328015
94.708181
-4.708181
32
146
5.926926
4.7728484
1.195937
2.1334
0.0966402
1.121599
0.6857
30.740945
1.2930256
0.1198
97.850141
1.3787255
0
142.91639
1.3147635
118.25561
27.744394
33
62
5.241747
4.1185051
0.0402785
0.8082
0.055999
0.040183
0.2143
21.720872
0.0416347
0.1209
55.077923
0.0417337
1
68.597847
0.0416415
61.467283
0.5327167
34
30
4.4773368
3.3864721
0.0587035
0.3948
0.1232847
0.0584915
0.3025
15.769367
0.0668542
0.0137
23.215925
0.0670965
1
37.641444
0.0669111
29.561477
0.4385228
35
79
5.6276211
4.5497385
-0.818759
1.1649
0.0639514
-0.861611
0.2905
21.625827
-0.908783
0.1279
83.46238
-0.863585
0
107.24125
-0.904316
94.607662
-15.60766
36
44
5.062595
3.961858
-0.734942
0.785
0.0790977
-0.769878
0.1498
18.353544
-0.824606
0.1273
45.007512
-0.787187
1
61.36788
-0.819899
52.554882
-8.554882
129
Observation
bed
log_t
Xbeta
Reschi
pcrev
Std
Resdev
nsal
HessWgt
StResdev
fexp
Lower
StReschi
rural
Upper
Reslik
Pred
Resraw
37
120
6.0473722
4.9836905
-0.927702
2.9035
0.1013003
-0.983023
0.6236
23.183164
-1.148394
0.3524
119.71879
-1.083767
0
178.08047
-1.131483
146.01225
-26.01225
38
100
5.7037825
4.5473314
0.2954252
1.7532
0.0453276
0.2904762
0.3547
25.699386
0.298161
0.2561
86.357084
0.3032409
1
103.14873
0.2984216
94.380208
5.6197916
39
49
5.1761497
4.2457198
-1.415236
0.8197
0.062728
-1.554515
0.281
17.996339
-1.630024
0.3874
61.730322
-1.48398
1
78.938127
-1.61735
69.805988
-20.80599
40
123
5.8171112
4.6952284
0.6269081
2.2555
0.094274
0.605376
0.6059
27.96
0.6887843
0.6402
90.963179
0.7132831
1
131.63093
0.6944344
109.4238
13.576199
41
82
4.9126549
3.8638051
3.1921869
0.8459
0.0555176
2.7318616
0.1995
27.920169
2.8190277
0.1911
42.733902
3.2940407
1
53.123408
2.8502131
47.646307
34.353693
42
58
5.32301
4.1365766
-0.341776
1.0412
0.0586161
-0.348937
0.2245
20.696199
-0.362847
0.1122
55.795385
-0.3554
1
70.207982
-0.362292
62.588189
-4.588189
43
110
5.7776523
4.7182353
-0.089158
1.6661
0.0435985
-0.089626
0.4029
25.319477
-0.091903
0.3893
102.79984
-0.091424
1
121.95922
-0.091879
111.97048
-1.97048
44
62
5.4026774
4.2929754
-0.731093
1.2406
0.0400977
-0.765155
0.2784
20.48327
-0.779651
0.2212
67.652571
-0.744943
1
79.167474
-0.7784
73.183899
-11.1839
45
86
5.2983174
4.2013947
1.3567572
1.1312
0.0447048
1.2614531
0.372
26.488057
1.2904335
0.2959
61.177215
1.3879271
1
72.894595
1.2949191
66.7794
19.2206
46
102
5.8721178
4.755463
-0.6224
1.4499
0.0410245
-0.646515
0.3866
23.42317
-0.661147
0.3006
107.23861
-0.636486
1
125.94806
-0.660086
116.21745
-14.21745
47
135
6.1548581
5.0043236
-0.491999
2.4274
0.1192617
-0.506803
0.7485
25.121986
-0.645552
0.1344
117.98707
-0.626695
0
188.30673
-0.638383
149.05623
-14.05623
48
78
5.313206
4.1502993
1.0713815
0.9327
0.0654041
1.0103332
0.3672
25.123199
1.0600165
0.1242
55.818742
1.1240668
1
72.131365
1.0660399
63.452991
14.547009
49
83
5.9661467
4.7642221
-1.4873
1.2362
0.0548644
-1.64023
0.3995
20.036096
-1.707761
0.1484
105.28699
-1.548535
1
130.54973
-1.695952
117.23988
-34.23988
130
Observation
bed
log_t
Xbeta
Reschi
pcrev
Std
Resdev
nsal
HessWgt
StResdev
fexp
Lower
StReschi
rural
Upper
Reslik
Pred
Resraw
50
60
5.3612922
4.2917421
-0.856827
1.0644
0.0656815
-0.904318
0.282
20.062667
-0.952679
0.1154
64.26459
-0.902649
0
83.135801
-0.947846
73.093694
-13.09369
51
54
4.9698133
3.758391
1.1228299
0.7556
0.0736066
1.0550546
0.2088
21.480396
1.1141776
0.0245
37.118842
1.1857509
1
49.533897
1.1217836
42.879376
11.120624
52
120
5.7899602
5.0462531
-1.19399
2.0182
0.1109835
-1.288548
0.4432
22.276168
-1.579015
0.6274
125.05196
-1.463142
0
193.20983
-1.541274
155.43895
-35.43895
131
APPENDIX G
Simulated Data
Non-overdispersed data
miu=10
X
2.56027343
3.3326889
1.66801331
1.61006016
0.36285718
3.69366502
2.81218555
1.52363452
1.51572989
2.0388978
2.47043236
4.06667615
3.24242262
1.51449378
1.28149382
1.0423767
3.18255029
2.37663376
2.05582736
2.03180688
2.93538735
3.49670966
2.89388151
2.72812274
2.16852126
2.9350875
0.9796758
3.24283892
2.06432669
2.25888823
3.71504505
2.59536997
3.36972833
2.90387464
Y
10
6
12
12
10
8
8
12
8
18
11
12
12
14
11
13
14
15
6
7
11
8
7
12
13
11
14
11
17
6
19
10
7
4
miu=20
X
2.233535
2.380598
3.737507
1.893425
2.687594
3.906926
2.749047
2.948506
4.704911
2.789378
3.627864
3.015747
3.42842
3.019538
2.387605
4.049522
4.265544
3.249506
3.287559
2.242166
2.96238
2.093768
2.623234
1.389107
2.124163
3.774267
1.818471
2.598286
4.19652
4.005775
3.465652
3.341846
3.050679
2.052052
Y
18
20
19
19
20
24
16
21
18
23
22
27
18
22
17
18
15
22
14
30
18
34
17
20
20
17
18
25
22
17
24
22
18
19
miu=50
X
3.9699987
4.8434237
4.4943946
4.6310685
3.0659341
4.3252598
5.9114104
1.874538
4.9111629
2.3559743
1.8477671
3.7703251
3.2662908
3.2943631
3.7510428
3.6242798
3.8402843
4.5546947
4.2363452
3.6035993
3.10055
0.3834092
3.0675352
3.635612
3.5323313
3.2227586
3.7645734
4.410683
3.7649807
4.1922024
5.4702537
2.5660532
3.4760939
4.6382366
Y
47
53
57
51
56
45
56
57
59
41
48
65
48
60
43
45
49
54
44
48
49
53
41
39
32
57
47
53
43
43
47
48
57
48
132
2.81451369
1.25842432
2.07940264
4.51482464
1.46244939
4.1868501
2.07409596
1.50803132
3.18251188
3.38992079
1.22532545
2.63168439
3.18694385
2.12424368
1.26411597
2.35176685
2.92244782
2.42275344
2.53927744
1.37460335
1.04661211
3.38630568
2.79825127
1.76558189
2.71937214
1.55265173
0.3104915
2.19924408
3.25973095
2.90217759
2.38023748
4.25647746
2.38099308
1.80118863
0.99632815
2.87664385
2.37776979
2.25172576
1.94891179
1.42317685
3.6831514
11
16
10
13
14
8
10
10
13
8
14
9
8
8
10
13
11
9
12
9
9
9
8
10
8
7
14
13
12
11
10
13
15
10
14
9
17
11
12
14
10
2.871468
3.62096
2.825491
4.223984
2.421426
1.991402
3.262232
3.20434
3.5112
2.655626
3.103678
2.268579
2.654457
3.240083
1.338323
2.955074
3.760561
2.494356
3.067955
3.293478
3.343564
4.092233
3.470404
3.580472
1.202995
4.805678
3.247511
3.097027
2.001007
3.263064
3.717297
1.455882
1.456778
4.511168
4.633684
3.710726
2.638712
3.431643
1.804892
3.057467
2.728066
21
23
23
29
24
20
16
23
28
16
21
26
24
25
28
27
18
21
24
20
21
24
28
10
25
18
20
25
20
17
25
25
30
14
24
22
19
22
22
19
24
3.8218477
3.8210704
3.7967673
4.04387
5.3592521
4.3233362
3.4298408
5.0527314
4.3845342
2.3926529
2.4613402
2.5679592
3.139902
4.7341778
3.1589628
4.320673
4.1690454
2.872078
4.7493709
5.5859022
3.2574183
4.5438332
4.4717676
4.4901139
3.7233904
4.7087055
3.3533185
3.2674426
3.8337626
2.0761218
3.7104727
4.5944552
5.2191323
3.540893
5.4053498
3.3165428
5.2921917
1.989137
4.2475918
4.5012906
3.7036413
46
54
46
46
47
39
50
43
47
40
45
59
47
54
42
57
47
56
47
40
55
51
50
45
62
30
53
52
41
46
52
43
54
51
53
57
52
47
47
41
60
133
3.52976312
3.46639053
3.43555119
2.38152391
3.14476528
2.41774325
1.52962269
0.49890047
2.87897229
0.92306063
2.6185509
2.29404894
3.48546267
4.02298131
1.07883456
3.94725484
1.11731098
0.64580806
1.43439431
-0.0823161
2.65617426
0.76878554
1.21692979
3.21691682
1.81126194
11
9
5
13
8
8
12
14
16
10
14
14
10
9
10
14
10
6
16
11
12
11
8
9
5
2.920535
1.678912
2.179912
2.787339
2.300393
2.961683
2.93638
4.147253
1.925526
2.5701
3.144706
5.09753
2.598161
3.331129
4.128854
3.63403
4.935413
1.614
4.357518
3.380103
3.390668
1.552754
3.893704
3.572254
1.854771
22
16
25
24
19
26
25
23
23
22
25
24
19
26
28
21
18
32
24
18
20
24
23
16
19
4.5837201
3.2605241
6.7931809
3.7509432
3.9999346
2.1554632
4.2595821
2.6537653
4.5316234
4.2517845
4.194305
4.0677576
4.6617705
3.532448
3.8688486
4.0325427
5.045115
4.1096757
2.7171915
3.7607899
3.7105418
4.2784821
4.2616238
4.7912618
5.7802977
54
46
47
48
55
51
42
56
51
53
50
48
52
49
45
51
46
53
66
47
45
44
49
37
38
134
Overdispersed Data
miu=10
X
2.44163
2.104
1.52064
0.90916
0.79473
2.01532
3.19482
1.64743
2.40063
2.74466
0.62351
1.80325
1.97372
0.2907
3.42932
5.26497
3.44543
3.04891
1.17708
3.1177
1.97079
2.6756
2.87712
1.55979
0.68691
1.2245
2.93103
2.77611
2.12268
2.24546
2.43895
2.69914
0.38124
2.14843
4.61383
3.48309
Y
2
3
4
5
14
1
14
0
29
23
5
9
19
6
3
3
0
10
3
5
11
17
18
0
3
0
2
8
2
0
17
0
1
17
0
6
miu=20
X
2.66376
2.45444
1.9666
3.92778
0.76437
3.37262
2.27186
4.13457
3.03504
3.42327
3.89228
3.24653
1.7641
2.60344
2.85246
2.07496
4.16171
3.33367
2.43729
2.24164
1.89108
4.50213
2.22699
2.98849
3.09531
3.59585
3.11136
2.1967
3.07363
2.36964
1.23683
2.31667
4.64975
4.31556
2.21664
3.91311
Y
2
24
22
18
0
24
14
13
7
16
17
35
24
2
0
40
1
2
11
4
23
23
20
25
14
26
4
17
17
6
15
20
12
2
10
8
miu-50
X
2.88276
1.83427
4.55353
3.65407
4.08464
2.2976
3.74117
2.43642
6.19172
1.7602
3.97899
2.48501
2.81162
3.17003
3.97182
3.10242
3.62071
2.9827
5.2785
3.36342
4.44359
4.27552
1.68266
2.5891
5.09814
4.8191
4.49929
6.35005
4.86595
4.85178
4.06398
4.35744
5.82929
3.39721
5.43037
3.15511
Y
36
46
9
2
5
40
96
62
129
57
62
90
58
6
12
57
61
22
6
22
29
87
40
85
4
7
42
163
11
12
126
239
136
2
128
52
135
1.13993
2.15086
1.87124
0.31924
1.58076
4.47172
2.794
1.71189
2.38219
1.54522
2.16051
0.49664
1.33656
3.84749
1.63085
3.01141
1.41408
2.8041
3.13662
1.85138
3.58869
3.79014
2.20832
2.97588
3.59825
2.47614
2.07837
3.73233
2.76328
0.99803
0.72687
2.02619
3.19523
2.61502
3.01252
-0.79722
1.66685
2.53451
3.07421
2.92809
1.72812
4
7
2
19
15
1
11
6
8
3
5
5
12
5
6
7
7
14
0
19
16
0
5
7
25
0
0
7
14
20
19
5
0
20
24
5
26
21
5
13
1
2.66633
3.41239
1.3929
1.87624
3.92357
3.63896
3.30859
3.23259
3.88666
2.59509
2.27531
3.35699
3.46903
3.02228
2.84548
3.0837
4.81189
3.62199
3.17865
2.95065
3.69129
3.23378
3.02842
2.06705
2.20435
3.55136
3.32951
4.49041
2.08909
2.06913
2.65827
2.60514
3.11345
3.38731
1.9049
1.58148
1.69146
2.10952
2.3264
2.99103
3.10909
26
1
10
0
10
11
4
2
20
12
11
4
40
32
1
2
10
43
9
25
0
10
20
5
8
6
7
25
10
9
1
34
27
48
64
2
15
20
1
13
4
4.62784
3.30831
5.25449
5.51319
4.34461
3.03456
4.89971
4.3881
5.46789
3.38408
4.34105
1.7767
2.45248
5.36947
3.47778
2.80763
3.78067
2.44013
2.89905
4.75979
3.21596
4.83291
4.56639
4.03758
4.40804
4.45115
4.33562
4.19355
4.2325
3.82704
4.03993
4.07587
3.46035
3.6433
2.52635
5.29063
5.42906
5.10655
4.71323
4.35133
2.91343
35
76
61
13
40
43
15
41
17
30
5
41
12
8
7
36
34
2
15
76
42
254
89
6
3
11
3
31
27
150
12
56
93
41
34
115
85
83
37
15
159
136
2.33986
4.05749
2.52357
3.78881
2.90051
2.51625
2.73479
3.94897
3.71768
1.55765
-0.19213
1.25845
2.30207
1.17082
0.65576
3.87628
0.01265
2.61168
2.11416
1.41527
2.24124
3.37832
0.69132
14
0
10
15
11
3
0
11
4
2
0
8
15
4
2
9
2
0
8
2
0
7
2
3.20283
3.2003
1.61702
2.74098
2.64998
3.33911
2.63632
4.66838
2.369
2.27309
3.59818
1.70775
4.03669
3.68486
2.01495
1.86527
2.61437
3.7602
0.61867
2.36131
3.8941
2.3606
4.07685
36
28
3
38
12
2
0
15
50
48
20
17
69
32
6
8
11
19
11
8
16
12
38
3.9709
3.18958
2.6479
4.49983
2.778
2.83642
3.8224
3.08565
4.84058
2.70799
3.67085
3.17162
4.08897
4.56694
4.86552
2.64723
4.37878
4.98765
3.06825
4.1668
6.79676
4.79784
3.56401
20
1
35
53
29
71
81
72
28
40
13
40
91
12
105
21
63
4
4
40
13
29
76
Download