COUNT DATA ANALYSIS USING POISSON REGRESSION AND HANDLING OF OVERDISPERSION RAIHANA BINTI ZAINORDIN UNIVERSITI TEKNOLOGI MALAYSIA COUNT DATA ANALYSIS USING POISSON REGRESSION AND HANDLING OF OVERDISPERSION RAIHANA BINTI ZAINORDIN A dissertation submitted in partial fulfillment of the requirements for the award of the degree of Master of Science (Mathematics) Faculty of Science Universiti Teknologi Malaysia NOVEMBER 2009 iii To Mak and Abah because I could not have made it without you both. And to sunflower because life should be meaningful. iv ACKNOWLEDGEMENT First and foremost, all praise be to Allah, the Almighty, the Benevolent for His blessings and guidance for giving me the inspiration to embark on this project and instilling in me the strength to see that this thesis becomes a reality. I would like to say a big thank you to my amazing supervisor, P.M. Dr. Robiah binti Adnan for believing in me to finish up this dissertation. Her advice, encouragement and help mean a lot to me. So I can never thank her enough. I would also like to thank my friend Ehsan who is never tired of helping and teaching me things that I do not know. As always, I am forever grateful to my parents for all their love and sacrifice. To my sisters (Nordiana, Noraiman Amalina, Nurul Aina), thanks for giving me the reason to laugh and cry. A heartfelt thanks goes to all my friends who are always willing to help and support me. And finally to my best friend, Ikhwan, thank you so much for being there. v ABSTRACT Count data is very common in various fields such as in biomedical science, public health and marketing. Poisson regression is widely used to analyze count data. It is also appropriate for analyzing rate data. Poisson regression is a part of class of models in generalized linear models (GLM). It uses natural log as the link function and models the expected value of response variable. The natural log in the model ensures that the predicted values of response variable will never be negative. The response variable in Poisson regression is assumed to follow Poisson distribution. One requirement of the Poisson distribution is that the mean equals the variance. In real-life application, however, count data often exhibits overdispersion. Overdipersion occurs when the variance is significantly larger than the mean. When this happens, the data is said to be overdispersed. Overdispersion can cause underestimation of standard errors which consequently leads to wrong inference. Besides that, test of significance result may also be overstated. Overdispersion can be handled by using quasi-likelihood method as well as negative binomial regression. The simulation study has been done to see the performance of Poisson regression and negative binomial regression in analyzing data that has no overdispersion as well as data that has overdispersion. The results show that Poisson regression is most appropriate for data that has no overdispersion while negative binomial regression is most appropriate for data that has overdispersion. vi ABSTRAK Data bilangan adalah sangat lazim dalam pelbagai bidang, contohnya bidang sains bioperubatan, kesihatan awam dan bidang pemasaran. Regresi Poisson digunakan secara meluas untuk menganalisis data bilangan. Regresi Poisson juga sesuai untuk menganalisis data kadaran. Regresi Poisson merupakan sebahagian daripada model kelas model linear teritlak. Regresi ini menggunakan logaritma asli sebagai fungsi hubungan. Regresi ini memodelkan nilai jangkaan bagi pembolehubah maklum balas. Logaritma asli digunakan untuk memastikan supaya nilai ramalan bagi pembolehubah maklumbalas tidak akan berbentuk negatif. Pembolehubah maklumbalas dalam regresi Poisson dianggap mengikut taburan Poisson. Salah satu ciri taburan Poisson ialah nilai min pembolehubah adalah sama dengan nilai varians. Walaubagaimanapun, dalam aplikasi sebenar, data bilangan sering mempamerkan masalah lebih serakan. Masalah lebih serakan terjadi apabila nilai varians melebihi nilai min. Apabila ini terjadi, sesebuah data itu dikatakan terlebih serak. Masalah lebih serakan boleh menyebabkan kurang anggaran terhadap sisihan piawai yang kemudiannya memberi inferens yang salah. Selain daripada itu, keputusan ujian signifikan pula akan terlebih anggar. Masalah lebih serakan boleh diatasi dengan menggunakan kaedah kebolehjadian quasi dan juga regresi binomial negatif. Kajian simulasi telah dibuat untuk melihat keputusan regresi Poisson dan regresi binomial negatif dalam menganalisis data yang tidak mempunyai masalah lebih serakan dan juga data yang mempunyai masalah lebih serakan. Keputusan menunjukkan bahawa regresi Poisson adalah paling sesuai untuk data yang tidak mempunyai masalah lebih serakan manakala regresi binomial negatif adalah paling sesuai untuk data yang mempunyai masalah lebih serakan. vii TABLE OF CONTENTS CHAPTER 1 2 TITLE PAGE COVER i DECLARATION ii DEDICATION iii ACKNOWLEDGEMENTS iv ABSTRACT v ABSTRAK vi TABLE OF CONTENTS vii LIST OF TABLES xi LIST OF FIGURES xii LIST OF SYMBOLS xiv LIST OF APPENDICES xv INTRODUCTION 1 1.1 Count Data 1 1.2 Statement of the Problem 3 1.3 Objectives of the Study 4 1.4 Scope of the Study 4 1.5 Significance of the Study 4 1.6 Outline of the Study 5 1.7 Analysis Flow Chart 7 LITERATURE REVIEW 8 2.1 8 Generalized Linear Models viii 2.1.1 Random component 8 2.1.2 Systematic component 9 2.1.3 Link 9 Principles of Statistical Modelling 10 2.2.1 Exploratory Data Analysis 10 2.2.2 Model Fitting 10 2.3 Poisson Distribution 12 2.4 Poisson Regression 14 2.5 Problems in Poisson Regression 16 2.5.1 Truncation and Censoring 16 2.5.2 Excess Zero 17 2.5.3 Overdispersion 19 2.6 Alternative Count Models 20 2.7 Negative Binomial Regression 21 2.2 3 POISSON REGRESSION ANALYSIS 23 3.1 The Model 23 3.2 Estimation of Parameters Using Maximum 24 Likelihood Estimation (MLE) 3.3 Standard Errors for Regression Coefficients 31 3.4 Interpretation of Coefficients 31 3.5 Elasticity 34 3.6 Model Checking Using Pearson Chi-Squares 35 and Deviance 3.6.1 Pearson Chi-Squares 36 3.6.2 Deviance 36 3.7 Model Residuals 37 3.8 Inference 38 3.8.1 Test of Significance 38 3.8.2 Confidence Intervals 39 ix 3.9 Handling Overdispersion 39 3.9.1 Quasi-Likelihood Method 39 3.9.1.1 Estimating the Overdispersion 43 Parameter 3.9.1.2 Testing for Overdispersion 3.9.2 Negative Binomial Regression Analysis 4 44 45 3.10 Example 46 ANALYSIS OF POISSON REGRESSION USING 58 SAS 5 4.1 Introduction 58 4.2 Nursing Home Data 58 4.3 Choosing the Right Model 65 4.4 Results and Discussion 67 4.5 Negative Binomial Regression 76 4.6 Results and Discussion 77 SIMULATION STUDY 85 5.1 Data Simulation 85 5.2 Analysis of Data with No Overdispersion 86 5.2.1 Results and Discussion 93 5.2.1.1 Goodness-of-fit 93 5.2.1.2 Significance, Confidence 94 Intervals, and Standard Errors 5.3 Analysis of Data with Overdispersion 94 5.3.1 Results and Discussion 101 5.3.1.1 Goodness-of-fit 101 5.3.1.2 Significance, Confidence 102 Intervals, and Standard Errors 5.4 Conclusions 103 x 5.5 The Simulation Codes 103 5.5.1 Codes for the simulation of non- 104 overdispersed data 5.5.2 Codes for the simulation of 105 overdispersed data 6 SUMMARY AND CONCLUSIONS 106 6.1 Summary 106 6.2 Conclusions 107 6.3 Recommendations 107 REFERENCES 108 APPENDICES 114 xi LIST OF TABLES TABLE NO. TITLE PAGE 3.1 Elephant’s mating success regarding age 47 3.2 Iterative reweighted least squares results 49 3.3 Residuals for elephant’s mating success data 53 3.4 Adjusted standard errors 57 4.1 Nursing home data 59 4.2 Log likelihood and deviance for model M1, M2, and M3 65 4.3 Elasticities of the explanatory variables in nursing home 69 data for model M3 4.4 Pearson residuals and adjusted residuals for nursing home 70 data 4.5 Comparison among standard errors for ordinary Poisson 75 regression and corrected Poisson regression 4.6 Elasticities of the explanatory variables in nursing home 80 data for negative binomial regression model 4.7 Residuals for negative binomial regression 81 5.1 Pearson chi-square and deviance for Poisson regression and 93 negative binomial regression obtained from data that has no overdispersion 5.2 Pearson chi-square and deviance for Poisson regression and negative binomial regression obtained from overdispersed data 101 xii LIST OF FIGURES FIGURE NO. TITLE PAGE 2.1 Steps in model fitting 11 3.1 Steps in quasi-likelihood approach 40 3.2 SAS result for analysis of elephant’s mating success data 50 4.1 SAS output for model (M1) 62 4.2 SAS output for model (M2) 63 4.3 SAS output for model (M3) 64 4.4 SAS output for corrected Poisson regression 74 4.5 SAS output for negative binomial regression 79 5.1 Poisson regression SAS output of non-overdispersed data for µ = 10 Poisson regression SAS output of non-overdispersed data 87 5.2 88 for µ = 20 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 Poisson regression SAS output of non-overdispersed data for µ = 50 Negative binomial regression SAS output of nonoverdispersed data for µ = 10 Negative binomial regression SAS output of nonoverdispersed data for µ = 20 Negative binomial regression SAS output of nonoverdispersed data for µ = 50 Poisson regression SAS output of overdispersed data for µ = 10 Poisson regression SAS output of overdispersed data for µ = 20 Poisson regression SAS output of overdispersed data for µ = 50 Negative binomial regression SAS output of overdispersed data for µ = 10 89 90 91 92 95 96 97 98 xiii 5.11 5.12 Negative binomial regression SAS output of overdispersed data for µ = 20 Negative binomial regression SAS output of overdispersed data for µ = 50 99 100 xiv LIST OF SYMBOLS Y Response variable X Predictor variable β Regression coefficient η Link Function I Information matrix U Score fuction W Weight matrix E Elasticity X2 Pearson chi-squares statistic D Deviance statistic R Pearson residual Z Wald statistic φ Dispersion parameter xv LIST OF APPENDICES APPENDIX TITLE PAGE A SAS Codes for Elephant’s Mating Success Data 114 B The Values of µ̂ i for Elephant’s Mating Success Data 115 C SAS Codes for Nursing Home Data 117 D SAS Output of Residual Analysis for Poisson Regression 119 in Nursing Home Data E The Values of µ̂ i for Nursing Home Data 124 F SAS Output of Residual Analysis for Negative Binomial 126 Regression in Nursing Home Data G Simulated Data 131 CHAPTER 1 INTRODUCTION 1.1 Count Data An event count refers to the number of times an event occurs, for example the number of individuals arriving at a serving station (e.g.: bank teller, gas station, cash register, etc.) within a fixed interval, the number of failures of electronic components per unit of time, the number of homicides per year, or the number of patents applied for and received. In many fields such as in social, behavioral and biomedical sciences, as well as in public health, marketing, education, biological and agricultural sciences and industrial quality control, the response variable of interest is often measured as a nonnegative integer or count. Significant early developments in count models, however, took place in actuarial science, biostatistics, and demography. In recent years these models have also been used extensively in economics, political science, and sociology. The special features of data in their respective fields of application have fueled developments that have enlarged the scope of these models. An important milestone in the development of count data regression model was the emergence of the generalized linear models, of which the Poisson regression is a special case. 2 In another case, an event may be thought of as the realization of a point process governed by some specified rate of occurrence of the event. The number of events may be characterized as the total number of such realizations over some unit of time. The dual of the event count is the inter-arrival time, defined as the length of the period between events. Count data regression is useful in studying the occurrence rate per unit of time. The approach taken to the analysis of count data sometimes depends on how the counts are assumed to arise. Count data can arise from two common ways: i) Counts arise from a direct observation of a point process. ii) Counts arise from discretization of continuous latent data. In the first case, examples are the number of telephone calls arriving at central telephone exchange, the number of monthly absences at workplace, the number of airline accidents, the number of hospital admissions, and so forth. The data may also consist of inter-arrival times for events. In the second case, consider the following example. Credit rating of agencies may be stated as AAA, AAB, AA, A, BBB, B, and so forth, where AAA indicates the greatest credit. Suppose one codes these as y = 0,1,..., m . These are pseudocounts that can be analyzed using a count regression. But one may also regard this as an ordinal ranking that can be modeled using a suitable latent variable model such as ordered probit. Typically, the characteristic of count data is that the counts occur over some fixed area or observation period and that the things that people count are often rare. Count data, even though numeric, can create some problems if it is analyzed using the regular linear regression because of the limited range of most of the values and because only nonnegative integer values can occur. Thus, count data can potentially result in a highly skewed distribution, one that cut off at zero. Therefore, it is often unreasonable to 3 assume that the response variable and the resulting errors have a normal distribution, making linear regression a less appropriate option for analysis. A suitable way to deal with count data is to use Poisson distribution and log link function in the analysis. The regression model that uses these kinds of options is called Poisson regression or Poisson log-linear regression model. Basically, the most popular methods to model count data are Poisson and negative binomial regression models. But Poisson regression is the more popular of the two and is applied to various fields. 1.2 Statement of the Problem Count data often have variance exceeding the mean. In other words, count data usually shows greater variability in the response counts than one would expect if the response distribution truly were Poisson. This violates the Poisson regression assumption which strictly states that the mean is equal to the variance (equidispersion). The phenomenon where the variance is greater than the mean is called overdispersion. A statistical test of overdispersion is highly desirable after running a Poisson regression. Ignoring overdispersion in the analysis would lead to underestimation of standard errors and consequent of significance in hypothesis testing. The overdispersion must be accounted for by the analysis methods appropriate to the data. Poisson regression is not adequate for analyzing overdispersed data. Therefore, to overcome overdispersion, quasi-likelihood method will be used as well as negative binomial regression. Negative binomial regression is more adequate for overdispersed data. This is because negative binomial regression allows for overdispersion since its variance is naturally greater than its mean. 4 1.3 Objectives of the Study The objectives of this study are: i) To study the analysis of Poisson regression. ii) To illustrate Poisson regression by analyzing count data manually and by using SAS 9.1. iii) To demonstrate how to handle overdispersion in Poisson regression using quasi likelihood approach as well as negative binomial regression approach. iv) To see the performance of Poisson regression and the performance of negative binomial regression in analyzing data that has no overdispersion as well as data that has overdispersion from simulation study. 1.4 Scope of the Study This study will focus on the analysis of Poisson regression. This study will also focus on the overdispersion problem that exists when dealing with real life count data. Overdispersion happens when the variance is greater than the mean which violates the equidispersion property in Poisson distribution and thus need to be taken care of. In accordance to overdispersion problem, the performance of Poisson regression and negative binomial regression in analyzing data that has no overdispersion as well as data that has overdispersion will be examined from simulation study. The analyses in this study include manual analysis and analysis by using statistical package. Statistical package that is used in this study is SAS 9.1. 1.5 Significance of the Study This study will help the scientists to realize the use of Poisson regression in analyzing count data. Besides focusing on parameter estimation, this study will also help 5 to highlight about the interpretation of coefficients. This study will also help to overcome overdispersion problem that occurs in Poisson regression which, if ignored, may cause underestimation of standard errors and which consequently gives misleading inference about the regression parameters. Clearly, this study is imperative and will give much benefit. 1.6 Outline of the Study This dissertation consists of 6 chapters. Chapter 1 gives rough idea about the study. It begins with the explanation on count data. This includes the characteristic of count data which is very important throughout the study. Chapter 1 also explains how the idea about the study came about. Furthermore, it also explains about the purpose of the study, the scope and the importance of the study. Chapter 2 discusses the basic idea that is important in Poisson regression analysis. This chapter also discusses about common problems in Poisson regression as well as negative binomial regression other than previous studies done by previous researchers. Poisson regression analysis can be found in Chapter 3. This chapter gives clear descriptions on formulation of Poisson regression model, manual computation of maximum likelihood estimates, and how to interpret coefficients in Poisson regression. It also includes other important analyses such as goodness of fit test, residual analysis and inference. Other than that, this chapter also discusses about the methods to handle overdispersion. To illustrate Poisson regression, an example is presented here. The analysis of this example is done manually. 6 Chapter 4 deals with the analysis of Poisson regression using SAS 9.1. A bigger data is used and more factors are considered. The data is a count data in the form of rate and it involves overdispersion. SAS codes are provided for convenience. Chapter 5 presents the simulation study. Data is simulated using R 2.9.2 software and is analyzed by using SAS 9.1. The performance of Poisson regression and the performance of negative binomial regression in analyzing data that has no overdispersion as well as data that has overdispersion are presented in this chapter. Lastly, the conclusions of the study are discussed in Chapter 6. This chapter summarizes the whole study. Some recommendations for further research are also made here. 7 1.7 Analysis Flow Chart POISSON REGRESSION Run the Analysis 1. Formulation of the model 2. Estimation of parameter using MLE 3. Interpretation of coefficients 4. Elasticity 5. Model Checking 6. Residual Analysis 7. Inference Estimate the overdispersion parameter and test for overdispersion Does overdispersion exist? No Conclude that Poisson regression is adequate Yes Run quasi-likelihood method Run negative binomial regression 1. Estimate the dispersion parameter Run the same analysis as in Poisson regression analysis 2. Adjust standard errors CHAPTER 2 LITERATURE REVIEW 2.1 Generalized Linear Models All generalized linear models (GLM) have three components. The random component identifies the response variable Y and assumes its probability distribution. The systematic component specifies the explanatory variables used as predictors in the model. The link explains the functional relationship between the systematic component and the expected value of the random component. The GLM relates a function of that expected value to the explanatory variables through a prediction equation having linear form. 2.1.1 Random Component Let Y1 , Y2 ,..., YN be the observations for a sample of size N and suppose Y1 , Y2 ,..., YN are independent. The random component of a GLM consists of identifying the response variable Y and selecting a probability distribution for Y1 , Y2 ,..., YN . If the observations are in binary form, a binomial distribution is assumed for the random component. If the observations are nonnegative counts, a Poisson distribution is assumed whereas if the observations are continuous, a normal distribution is assumed. 9 2.1.2 Systematic Component The systematic component of a GLM specifies the explanatory variables which enter linearly as predictors on the right hand side of the model equation. In other words, the systematic component specifies the variables that play the roles of x j in the formula β 0 + β1 x1 + ... + β p x p This linear combination of the explanatory variables is called linear predictor or linear component. 2.1.3 Link The third component of a GLM is the link between the random component and systematic component. It specifies how µ = E (Y ) relates to the explanatory variables in the linear predictor. The mean, µ or a function of the mean, g ( µ ) can be modeled. The model formula is stated as g ( µ ) = β 0 + β1 x1 + ... + β p x p = η g ( µ ) = η is called the link function. The simplest possible link function is written as g ( µ ) = η = µ . This link function models the mean directly and is called the identity link. It specifies a linear model for the mean response, that is, µ = β 0 + β1 x1 + ... + β p x p = η This is the form of ordinary regression models for continuous responses. 10 For early note, the link function for Poisson regression model has the form g ( µ ) = η = ln(µ ) . The natural log function applies to positive numbers. Thus, this natural log link is appropriate when µ cannot be negative. A GLM that uses the natural log link is called a loglinear model. Further information will be discussed later. 2.2 Principles of Statistical Modelling 2.2.1 Exploratory Data Analysis Any data analysis should begin with a consideration of each variable separately. This is important in checking the quality of the data and in formulating the model. Several questions need to be answered when performing the analysis of data. i) What is the scale of measurement? Is it discrete or continuous? Note that, for Poisson regression analysis, the scale of measurement is discrete (since the observations are in the form of counts). ii) What is the shape of the distribution? iii) How is it associated with other variables? 2.2.2 Model Fitting The model fitting process begins with the formulation of the model. It is then followed by estimation of parameters in the model, model checking, residual analysis and inference. Figure 2.1 summarizes these steps. 11 Model Formulation Estimation of the Parameters Model Checking Residual Analysis Inference Figure 2.1: Steps in model fitting i) Model Formulation In formulating a model, knowledge of the context in which the data were obtained, including the substantive questions of interest, theoretical relationships among the variables, the study design and results of the data analysis can all be used. Basically, the model has two components: - Probability distribution of Y - Equation linking the expected value of Y with a linear combination of the explanatory variables. ii) Estimation of Parameters Parameters used in the model must be estimated. The most commonly used estimation methods are maximum likelihood and least squares. 12 iii) Model Checking A GLM provides accurate description and inference for a data set only if it fits that data set well. Summary of the goodness-of-fit statistics can help to investigate the adequacy of a GLM fit. For Poisson regression, goodness of fit of the model is tested using Pearson chi-square and the Deviance. iv) Residual Analysis Goodness-of-fit statistics only broadly summarize how well models fit data. Further insight can be obtained by using residuals, that is, by comparing observed and fitted counts individually. v) Inference Statistical inference involves calculating confidence interval and testing hypotheses about the parameters in the model and interpreting the results. 2.3 Poisson Distribution A random variable Y is said to have a Poisson distribution with parameter µ if it takes integer values y = 0,1,2,... with probability e −µ µ y P(Y = y ) = y! for µ > 0 13 The Poisson distribution can be derived as a limiting form of the binomial distribution if the distribution of the number of successes in a very large number of Bernoulli trials with a small probability of success in each trial is considered. Specifically, if Y ~ B (n, π ) then the distribution of Y as n → ∞ and π → 0 with µ = nπ remaining fixed approaches a Poisson distribution with mean µ . Thus, the Poisson distribution provides an approximation to the binomial for the analysis of rare events, where π is small and n is large. The Poisson distribution is often used to model the occurrence of rare events, such as the number of traffic accidents in a month, the number of nesting attempts or offspring in a breeding season, and the number of automobile accidents occurring at a certain location per year. The Poisson distribution has several characteristic features: i) The variance is equal to the mean, E (Y ) = Var (Y ) = µ . This property is called equidispersion. ii) The distribution tends to be skewed to the right. iii) Poisson distribution with a large mean is often well-approximated by a normal distribution. A useful property of the Poison distribution is that the sum of independent Poisson random variables is also Poisson. Specifically, if Y1 and Y2 are independent with Yi ~ P( µ i ) for i = 1,2 then, Y1 + Y2 ~ P( µ1 + µ 2 ) This result is also applicable to the sum of more than two independent Poisson random variables. 14 2.4 Poisson Regression The Poisson regression is a member of a class of generalized linear models (GLM), which is an extension of traditional linear models that allows the mean of a population to depend on a linear predictor through a nonlinear link function and allows the response probability distribution to be any member of exponential family distributions (McCullagh and Nelder, 1989). The use of Poisson regression is vast and the study of this type of regression is continuing. It has been an aid in many research areas such as in economy, epidemiology, sociology and medicine. Poisson regression is useful when the outcome is a count, with large-count outcomes being rare events (Kutner, Nachtsheim, and Neter, 2004). It is the most widely used regression model for multivariate count data. Counts are integer and can never be negative. The distribution of counts is discrete, not continuous. Thus, they tend to be positively skewed. Ordinary least squares regression uses the normal distribution as its probability model. Hence, it is fundamentally not a good fit for discrete type data, because the Normal distribution is symmetric and extends from negative to positive infinity (Atkins and Gallop, 2007). The Poisson distribution is a much better fit for count data. It characterizes the probability of observing any discrete number of events (Osgood, 2000). When the mean count is low, the Poisson distribution is skewed. As the mean count grows however, the Poisson distribution increasingly approximates the normal. Poisson regression uses the Poisson distribution as its probability model. Therefore it is one of the alternatives that can be used for analyzing count data. Poisson regression shares many similarities with ordinary least squares regression. Only, it assumes the response variable to follow Poisson distribution instead of Normal distribution and it models the natural log of the response variable as a linear function of the coefficients. 15 One of the main assumptions of linear models is that the residual errors follow a Normal distribution. To meet this assumption when a continuous response variable is skewed, a transformation of the response variable can produce errors that are approximately Normal. However, for discrete response variable, a simple transformation will not produce normally distributed errors. This is why Poisson regression uses natural log as its link function in the model rather than uses log-transformation for each of the response. In addition, the natural log in Poisson regression model guarantees that the predictions from the model will never be negative, which is appropriate for count data (Atkins and Gallop, 2007). Other general and popular application of Poisson regression involves modeling rates for different subgroups of interest (Kleinbaum et al., 1998) Rate is thought of as the number of counts observed within a specified time. Often, time is constant for all observations. Rate is also defined as the number of events divided by the total personyears of experience. Poisson regression model has been used as a tool for resolving common problems in analyzing aggregate crime rates (Osgood, 2000). It has also been used to analyze bird monitoring data (Strien et al., 2000). Other than that, Poisson regression was used by family researchers to model marital commitment (Atkins and Gallop, 2007). There are many journals and papers that present Poisson regression analysis and its application. There are researchers who have done modification on Poisson regression or introduce better models for analyzing count data for example, Tsou (2006) demonstrated that the Poisson regression model could be adjusted to become asymptotically valid for inference about regression parameters, even if the Poisson assumption fails. In addition, Zou (2004) introduces Poisson regression with a robust error variance to estimate relative risk. This study will only focus on the general Poisson regression analysis. 16 2.5 Problems in Poisson Regression Poisson regression is a powerful analysis tool but as with all statistical methods, it can be used inappropriately if its limitations are not fully understood. There are three problems that might exist in Poisson regression analysis. i) The data might be truncated or censored. ii) The data might contain excess zeroes. iii) The mean and variance are not equal, as required by Poisson distribution. This problem is called overdispersion. 2.5.1 Truncation and Censoring Truncation of data can occur in the routine collection of data. In survey data, for instance, such as surveys of users of recreational facilities, respondents who report zero are sometimes discarded from the sample. This situation produced truncated data. Consider other example. If the number of times per week an in-vehicle navigation system is used on the morning commute to work during weekdays, the data are right truncated at 5, which is the maximum number of uses in any given week. Estimating a Poisson regression model without accounting for this truncation will result in biased estimates of the parameter and erroneous inferences will be drawn. Fortunately, Poisson model adapted easily to account for this problem. The righttruncated Poisson model is written as P(Yi ) = µ iYi Yi ! r µ mi ∑ i m =0 mi ! i 17 where P(Yi ) is the probability of commuter i using the system Yi times per week, µ i is the Poisson parameter for commuter i , mi is the number of uses per week, and r is the right truncation (in this example, 5 times per week). In other settings, respondents are sometimes given a limit category (say, 40 and more) for some large value. The data from this survey is said to be censored. Censoring, like truncation, leads to inconsistent parameter estimates. 2.5.2 Excess Zero There are certain phenomena where an observation of zero events during the observation period can arise from two qualitatively different conditions, that is, the condition may result from: i) Failing to observe an event during the observation period or, ii) An inability to ever experience an event. For example, consider the number of crimes ever committed by each person in a community. In this case, most people are hardly involved in a crime. Therefore, there will be too many zero counts in the data. Consider an example where a transportation survey asks how many times a person has taken mass transit to work during the past week. An observed zero could arise in two distinct ways. First, last week the person may have opted to take the vanpool instead of mass transit. Alternatively, the person may never take transit as a result of other commitment on the way to and from his workplace. Thus two states are present, one being normal-count state and the other being a zero-count state. Normal-count state happens when event occurrence is inevitable and follows some known count process whereas zero-count state refer to situations where the likelihood of an event occurring is extremely rare. Two aspects of this nonqualitative distinction of the zero state are noteworthy. Data obtained from normal-count and zero-count states often suffer from 18 overdispersion if considered as part of single, normal-count state because the number of zeroes is inflated by the zero-count state. It is common not to know if the observation is in the zero state – so the statistical analysis process must uncover the separation of the two states as part of the model estimation process. Models that account for this dualstate system are referred to as zero-inflated models. To address phenomena with zeroinflated counting processes, the zero-inflated Poisson (ZIP) and zero-inflated negative binomial (ZINB) regression models have been developed. The ZIP regression model assumes that the events Y1 , Y2 ,..., YN are independent. The model is given by Yi = 0 with P(Yi ) = pi + (1 − pi )e − µi Yi = Y with P(Yi ) = (1 − pi )e − µi µ iY Y! where Y is the number of events per period. Maximum likelihood estimation (MLE) is used to estimate the coefficients of a ZIP regression model. The ZINB regression model follows a similar formulation with events Y1 , Y2 ,..., YN being independent. The model is given by Yi = 0 Yi = Y 1 α with 1 α P(Yi ) = pi + (1 − pi ) 1 + µi α with Γ 1 + Y u 1α (1 − u )Y i i α P(Yi ) = (1 − pi ) Γ 1 Y! α (( ) ) ( ) , Y = 1,2,3,... where u i = (1 α ) [(1 α ) + µ i ] . MLE is also used to estimate the coefficients of ZINB regression model. Zero-inflated models imply that the underlying data-generating process has a splitting regime that provides for two types of zero states. The splitting process can be 19 assumed to follow a logit (logistic) or probit (normal) probability process or other probability processes. Note that, there must be underlying justification to believe the splitting process exists (resulting in two distinct states) prior to fitting zero-inflated model. There should be a basis for believing that part of the process is in zero-count state. 2.5.3 Overdispersion The basic Poisson regression model is appropriate only if the probability distribution matches the data (Osgood, 2000). A major assumption underlying the use of log-linear analysis for Poisson distributed data is that the variance of the error distribution is completely determined by the mean. In practice, the assumption that the variance is equal the mean in Poisson distribution is unlikely to be valid. This is because, one important characteristic of counts is that the variance tends to increase with the mean. The variance is regularly found to be greater than the mean in count data. This problem is called overdispersion. Overdispersion (also called extra-Poisson variation) occurs if Var (Y ) > µ . If Var (Y ) < µ , the problem is called underdispersion. Underdispersion seldom occurs in the analysis. This study will only focus on overdispersion problem. Overdispersion may be due to, i) Unobserved heterogeneity. ii) The process generating the first event may differ from the ones determining the later events. iii) Failure of the assumption of independence of events which is implicit in the Poisson process. 20 The first indication that something is wrong is that the deviance measure of goodness-of-fit for full model exceeds its degree of freedom. This is an early detection of overdispersion. The following four aspects need to be considered in overdispersion problem: i) Formal test for overdispersion. ii) Standard errors for regression coefficients that account for overdispersion. iii) Test statistic for added variables that account for overdispersion. iv) More general models with parameters in the variance function. If standard Poisson model is applied to overdispersed data, the efficiency of parameter estimates remains reasonably high, yet their standard errors are underestimated. Hence, coverage probabilities of confidence intervals and significance levels of tests are no longer valid and can result in highly misleading outcome (Heinzl and Mittlbock, 2003). In other words, overdispersion will cause underestimation of standard errors which consequently will give wrong inference in the analysis (Ismail and Jemain, 2005) Therefore, overdispersion must be handled. The simplest way to allow for the possibility of overdispersion is the quasilikelihood approach which retains coefficient estimates from the basic Poisson model but adjusts standard error and significance tests based on the amount of overdispersion (Osgood, 2000). Another approach is to use negative binomial regression. Negative binomial model is a count model which allows for overdispersion. Quasi-likelihood method and negative binomial regression will be discussed further later. 2.6 Alternative Count Models A common more general model for analyzing count data is the negative binomial model. This model can be used if data are overdispersed. It is then more efficient than Poisson, but in practice the efficiency benefits over Poisson are small. The negative 21 binomial should be used, however, if one wishes to predict probabilities and not just model the mean. The negative binomial cannot be estimated if data are underdispersed. Another more common general model is the hurdle model. This treats the process for zeroes differently from that for the nonzero counts. In this case the mean of Yi is no ' longer e ( X iβ ) , so the Poisson estimator is inconsistent and the hurdle model should be used. This model can handle overdispersion and underdispersion. 2.7 Negative Binomial Regression A well-known limitation of the Poisson distribution is that the variance and mean must be approximately equal. However for count data, overdispersion always happens, that is, the variance always exceeds the mean. To relax the overdispersion constraint imposed by the Poison model, a negative binomial distribution is commonly used (Lee, Nam, and Park, 2005). The negative binomial distribution is one of the most widely used distributions when modeling count data that exhibit variation that Poisson distribution cannot explain (Dossou-Gbete, and Mizere, 2006). One important characteristic of the negative binomial distributions is that it naturally accounts for overdispersion due to its variance is always greater than its mean. Negative binomial regression is typically used when there are signs of overdispersion in Poisson regression. Negative binomial regression uses a different probability model which allows for more variability in the data. Basically, it combines the Poisson distribution of event counts with a gamma distribution. 22 Negative binomial model is simply a Poisson regression that estimates the dispersion parameter, allowing for independent specification of the mean and variance. Because the only difference between the Poisson and negative binomial regression lies in their variances, regression coefficients tend to be similar across the two models, but standard errors can be very different. When the outcome variable is overdispersed relative to the Poisson distribution, standard errors from the negative binomial model will be larger but more appropriate. Thus, p-values in Poisson regression are artificially low, and confidence intervals are too narrow in the presence of overdispersion. CHAPTER 3 POISSON REGRESSION ANALYSIS 3.1 The Model The general method of fitting a Poisson regression model is still to use the Poisson model formulation to derive a likelihood function that can be maximized so that parameter estimates, estimated standard errors, maximized likelihood statistics, and other information can be produced. Poisson regression analysis goal is to fit the data to a regression equation that will accurately model E (Y ) (or µ ) as a function of a set of explanatory variables X 1 , X 2 ,..., X p . For a Poisson regression analysis, let Y be the response variable for count data and let X be the explanatory variables. Y must follow Poisson distribution with parameter µ . The Poisson regression model can be written as, ln µ = β 0 + β1 X where Y ~ P ( µ ) (3.1) or equivalently, µ = e (β 0 + β1 X ) (3.2) This is the model for analyzing normal count data. Sometimes, the response may be in the form of events of certain type that occur over time, space, or some other index of size. In this situation, it is often relevant to model the data as the rate at which events occur. When a response count Y has index 24 (such as population size) equal to t , the sample rate of outcomes is Y / t . The expected value for rate is µ / t . Thus, for analyzing rate data, the model can be written as, µ ln = β 0 + β1 X t (3.3) This model has equivalent representation as, ln µ − ln t = β 0 + β1 X (3.4) The adjustment term, − ln t , on the left-hand side of the equation is called an offset. Note that, a simple linear model of the form µ = β 0 + β1 X cannot be used to model count data because this model has the disadvantage that the linear predictor on the right-hand side can assume any real value, whereas the Poisson mean on the left-hand side, which represents an expected count, has to be nonnegative. The natural log of the mean in Poisson regression model ensures that the predictions from the model will never be negative. One might ask if the analysis of count data can be done by transforming the response straightforwardly. In some cases, a log-transformation of the response will help to linearize the relationship and result in more normally distributed errors. However in other cases, a simple log-transformation will not solve the problems and the natural log link approach is needed. Furthermore, the presence of zeroes in the response variable becomes problematic for log-transformation of the response. Poisson regression model can definitely overcome this situation. 3.2 Estimation of Parameters Using Maximum Likelihood Estimation (MLE) Estimation of parameters in Poisson regression relies on maximum likelihood estimation (MLE) method. Maximum likelihood estimation seeks to answer question of what values of the regression coefficients are most likely to have given rise to the data. 25 MLE focuses on a likelihood function that describes the probability of observing the data as a function of a set of parameters. As stated previously, Poison regression uses Poisson distribution as the probability model and the regression coefficients define the parameters that specify the mean structure of the data. The goal in MLE is to find the estimates of the regression coefficients that maximize the likelihood function. This can be accomplished by setting the first derivative of the likelihood equation equal to zero and solving for the regression coefficients. In most practical cases, finding MLE requires iterative process, which adds an extra layer of complexity to these models. In particular, complex model involving many parameters and small sample sizes may prevent the process from converging. Ultimately, the results of MLE yield asymptotic standard errors for the regression coefficients. To discuss the maximum likelihood estimation for Poisson regression, let µ i be the mean for the ith response, for i = 1,2,..., n . Since the mean response is assumed to be a function of a set of explanatory variables, X 1 , X 2 ,..., X k , the notation µ ( X i , β) is used to denote the function that relates the mean response µ i to X i (the values of explanatory variables for case i) and β (the values of regression coefficients). Now consider the Poisson regression model in the following form: µ i = µ ( X i , β) = e (X β ) ' i (3.5) Then, from Poisson distribution, [ µ ( X i , β)]Y e − µ(X ,β ) P(Y ; β) = i Y! (3.6) 26 The likelihood function is given as, N L(Y ; β) = ∏ P(Yi ; β) i =1 [ µ ( X i , β)]Yi e[ − µ(Xi ,β )] =∏ Yi ! i =1 N {∏i =1[ µ ( X i , β)]Yi }e[ − ∑i =1 µ ( X i ,β )] N = (3.7) N ∏ n Y! i =1 i The next thing to do is taking natural log of the above likelihood function. Then, differentiate the equation with respect to β and equate the equation to zero. The log likelihood function is given as, N ln L(Yi , β) = ∑ [Yi ln[µ ( X i , β)] − µ ( X i , β) − ln(Yi !)] (3.8) δ [ln L(Y ; β)] = 0 δβ (3.9) i =1 The solution to the set of Maximum Likelihood given above must generally be obtained by iteration procedure. One of the procedures is known as iteratively reweighted least squares. This procedure will estimate the values of β . Maximum likelihood estimation produces Poisson parameters that are consistent, asymptotically normal and asymptotically efficient. To demonstrate estimation of parameters using maximum likelihood estimation, consider the method of scoring in generalized linear model. The method of scoring in generalized linear model simplifies the estimating equation to b ( m ) = b ( m −1) + [I ( m −1) ] −1 U ( m −1) (3.10) where b (m ) is the vector of estimates of the parameters β 0 , β1 ,..., β p at the mth iteration. 27 I is the information matrix with elements I jk given by xij xik ∂µ i =∑ i =1 Var (Yi ) ∂η i N I jk 2 (3.11) and U is the vector of elements given by N (Y − µ i ) ∂µ i U j = ∑ i xij i =1 Var (Yi ) ∂η i (3.12) U is called the score function. If both sides of equation (3.10) are multiplied by I ( m −1) it will become I ( m −1) b ( m ) = I ( m −1) b ( m −1) + U ( m −1) (3.13) From (3.11), I can be written as I = X' WX (3.14) where W is the N × N diagonal matrix with elements 1 ∂µ i wii = Var (Yi ) ∂η i 2 (3.15) The expression on the right-hand side of equation (3.13) is the vector with elements xij xik ∂µ i ∑∑ k = 0 i =1 Var (Y ) i ∂η i p N ( m −1) N (Yi − µ i ) xij ∂µ i bk +∑ Var (Yi ) ∂η i i =1 2 evaluated at b ( m −1) . Thus, the right-hand side of equation (3.13) can be written as (3.16) X' Wz where z has elements p ∂η z i = ∑ xik bk( m −1) + (Yi − µ i ) i k =0 ∂µ i Note that z is N × 1 matrix. (3.17) 28 Hence, finally, the iterative equation for parameter estimation can be written as ( X' WX) ( m −1) b ( m ) = ( X' Wz ) ( m −1) (3.18) This equation has to be solved iteratively because, in general, z and W depend on b . This iterative method is known as iteratively reweighted least squares method (IRWLS). Now, consider a set of Poisson regression data, Y1 , Y2 ..., YN satisfying the properties of generalized linear model. Parameters β 0 and β1 (let’s just consider these two) are related to the Yi ’s through E (Yi ) = µ i and g ( µ i ) = η i = ln(µ i ) = β 0 + β1 xi . From equation (3.15), the following equation is obtained: wii = 1 µi (µi ) 2 = µi b Using the estimate b = 0 for β , equation (3.17) becomes b1 z i = b0 + b1 xi + (Yi − µ i ) µi Essentially, to find the formula for estimating equation, the following matrices must be obtained: 1 x1 1 x 2 . . X= . . . . 1 x N b b = 0 b1 29 µ1 W= µ2 . . 0 0 . µ N Y1 − µ1 b0 + b1 x1 + µ 1 Y − µ2 2 b +b x + 0 1 2 µ2 . z= . . b0 + b1 x N + YN − µ N µN From the above matrices, 1 X' WX = x1 1 X' Wz = x1 1 x2 1 x2 µ1 . . . 1 × . . . x N µ1 . . . 1 × . . . x N µ2 . . 0 µ2 . . 0 1 x1 1 x 0 2 . . × . . . . . µ N 1 x N Y1 − µ1 b0 + b1 x1 + µ 1 µ2 Y − 2 0 b0 + b1 x 2 + µ2 . × . . . µ N Y − µN b0 + b1 x N + N µN 30 which then give N ∑ µi X' WX = Ni =1 µx i i ∑ i =1 i =1 N 2 µ i xi ∑ i =1 N ∑µ x i i (3.19) and N Yi − µ i ∑ µ i b0 + b1 xi + µ i i =1 X' Wz = N Yi − µ i ∑ µ i xi b0 + b1 xi + µ i i =1 (3.20) Since ln(µ i ) = b0 + b1 xi , thus, µ i = e ( b0 +b1xi ) . Therefore, (3.19) and (3.20) become N (b0 +b1xi ) ∑e X' WX = Ni =1 x e ( b0 +b1xi ) i ∑ i =1 i =1 N 2 ( b0 + b1 xi ) x e ∑ i i =1 N ∑x e ( b0 + b1 xi ) i N ( b0 +b1xi ) Y b0 + b1 xi + ( b0 +ib1 xi ) − 1 ∑e e X' Wz = Ni =1 Y x e ( b0 +b1xi ) b + b x + i − 1 0 1 i i ( b0 + b1 xi ) ∑ e i =1 (3.21) (3.22) The maximum likelihood estimates are obtained iteratively using equation (3.18). Initial values can be obtained by applying the link to the data, that is taking the natural log of the response, and regressing it on the predictors (or explanatory variables) using ordinary least square (OLS) method given by βˆ = ( X' X) −1 X' Y (3.23) 31 To avoid problems with counts of 0, one can add a small constant to all responses. The procedure will converge in a few iterations. 3.3 Standard Errors for Regression Coefficients Standard errors for β 0 and β1 are given by the inverse of Information matrix obtained from the last iteration, that is, a b I −1 = ( X' WX) −1 = c d where standard error for β 0 is a while standard error for β1 is d . These standard errors are important in calculating the confidence interval. 3.4 Interpretation of Coefficients The interpretation of coefficients in Poisson regression model is fairly straightforward and intuitive. What needs to be accounted in interpreting the coefficients is just the fact that a natural log link has been incorporated in the model unlike other ordinary regression model. To interpret the results of the analysis, the estimates of interest must be exponentiated (e β ) as well as the ends of the confidence interval in order to get estimates on the original scale of the outcome. The response variable can then be said to have multiplicative changes for each one-unit change in the predictor variable. To illustrate, consider Poisson regression model as follows, ln(µ ) = α + β x 32 This model can be written as, µ = e (α + β x ) = e α e β x Next, consider two values of x ( x1 and x 2 ) such that the difference between them equals 1. For example, x1 = 10 and x 2 = 11 , that is, x 2 − x1 = 1 When x = x1 = 10 , µ1 = e α e βx = eα e β (10) (3.24) 1 When x = x 2 = 11 µ 2 = eα e βx = eα e β ( x +1) = eα e βx e β = e α e β (10) e β 2 1 (3.25) 1 A change in x has a multiplicative effect on the mean of Y . When one looks at a one unit increase in the explanatory variable, (i.e., x 2 − x1 = 1 ), µ1 = e α e β x 1 and µ 2 = e α e βx e β 1 If β = 0 , then e 0 = 1 . Thus, µ1 = eα and µ 2 = e α . Therefore, it can be said that µ = E (Y ) is not related to x . If β > 0 , then e β > 1 . Thus, µ1 = eα e βx1 and µ 2 = eα e βx2 = eα e βx1 e β = µ1e β . Therefore it can be said that µ 2 is e β larger than µ1 . If β < 0 , then 0 ≤ e β < 1 . Thus, µ1 = eα e βx1 and µ 2 = eα e βx2 = eα e βx1 e β = µ1e β . Therefore it can be said that µ 2 is e β smaller than µ1 . 33 In short, multiplicative effect means that increasing x by one unit, multiplies the mean by a factor e β , that is, one-unit change in each explanatory variable, X , cause e β change in response variable, Y . Take note that, positive coefficients in estimation of parameters indicate the increase in the prediction while negative coefficients indicate the decrease in the prediction. The explanatory variables in Poisson regression may be in coded or continuous form. When the explanatory variables are in coded form, the interpretation of coefficient can be done straightforwardly from the result of parameter estimates. However, when the explanatory variables are in continuous form, the interpretation of coefficient requires a bit more work. There are three strategies that can be used to interpret continuous explanatory variables. The first strategy is by using the regression equation. A regression equation is a type of prediction equation. Thus, the regression equation is used to generate predictions over specific ranges of the predictors (explanatory variables) in order to interpret Poisson regression models. For the following Poisson regression equation, ln(µ i ) = β 0 + β1 X i prediction can be generated by replacing explanatory variable with a numeric value. By multiplying and then exponentiating each of the predictor value, X i , with the corresponding regression coefficient, β1 , the predicted values for each predictor will be generated. The second strategy is to use the regression equation to provide predicted values for discrete combinations of predictors. Instead of generating continuous predictions 34 along the range of predictors, specific values for individual predictors are specified. This strategy is virtually identical to the first strategy, except that discrete as opposed to continuous predictions are generated. For continuous explanatory variables, this strategy is most appropriate when there are clear cutoffs or benchmarks. The first and second strategy could be used with any regression model. Nonetheless, the third strategy is only restricted to Poisson regression. The third strategy allows the interpretation of regression coefficients in the Poisson model to be in the form of percentage change in the expected counts given as, 100(e β ×δ − 1) (3.26) where β is the regression coefficient from the Poisson regression and δ is the units of change in the explanatory variable (e.g.: for one unit change in the explanatory variable, δ = 1 ). This strategy results from the fact that Poisson model is a multiplicative model, where the predictors in the model are exponentiated (Atkins and Gallop, 2007). 3.5 Elasticity To provide some insight into the implications of parameter estimation results, elasticities are computed to determine the marginal effects of the independent variables. Elasticities provide an estimate of the impact of an explanatory variable on the mean and are interpreted as the effect of a one percent change in the explanatory variable on the mean. For example, an elasticity of -1.32 is interpreted as a one percent increase in the explanatory variable reduces the mean by 1.32 percent. Elasticities are the correct way of evaluating the relative impact of each explanatory variable in the model. Elasticity of the mean is defined as, E Xµiik = β k X ik (3.27) 35 where E represents the elasticity, X ik represents the value of the k th explanatory variable for observation i , β k represents the estimated parameter for the k th explanatory variable and µ i is the mean for observation i . Note that elasticities are computed for each observation i . It is common to report a single elasticity as the average elasticity over all i . The elasticity in equation (3.27) is only appropriate for continuous explanatory variables such as highway lane width, distance from outside shoulder edge to roadside features and vertical curve length. It is not valid for noncontinuous variables such as indicator variables that take on values of zero or one. For indicator variables, pseudoelasticity is computed to estimate an approximate elasticity of the variables. The pseudoelasticity gives the incremental change in mean caused by changes in the indicator variables. The pseudo-elasticity for indicator variables is computed as, E Xµiik = 3.6 e βk − 1 e βk (3.28) Model Checking Using Pearson Chi-Squares and Deviance The popular measures of the adequacy of the model fit are Pearson chi-squares and deviance. If the values for both Pearson chi-squares and deviance are close to the degrees of freedom, N − p , the model may be considered as adequate. To check the goodness-of-fit of the model, the following hypotheses are required: H 0 : the model has a good fit versus H 1 : the model has lack of fit 36 3.6.1 Pearson Chi-Squares Let Yi be the observed count and µ̂ i be the fitted mean value. Then, for Poisson regression analysis, the Pearson chi-squares statistic is given by X2 =∑ (Yi − µˆ i ) 2 µˆ i (3.29) When the fitted mean values, µ̂ i , are relatively large (greater than 5) this test statistic has approximate chi-squared distribution. Its degree of freedom is equal to the number of response counts, N minus the number of parameters in the model, p . H 0 will be rejected if X 2 > χ 02.05 ( N − p) indicating lack of fit of the model. 3.6.2 Deviance The deviance is given by Y D = 2∑ Yi ln i µ̂ i (3.30) For large samples, the deviance also has approximate chi-squared distribution with ( N − p) degrees of freedom. Similar to Pearson chi-squares, H 0 will be rejected if D > χ 02.05 ( N − p ) indicating lack of fit of the model. 37 3.7 Model Residuals For observation i , the residual difference Yi − µ̂ i between an observed and fitted count has limited usefulness. For Poisson sampling, for instance, the standard deviation of a count is µ̂ i , so larger differences tend to occur when µ̂ i is larger. The Pearson residual, R , is a standardization of this difference, defined by R= (observed − fitted ) Vˆar (observed ) (3.31) For Poisson GLM, this simplifies for count i to ei = Yi − µˆ i µˆ i (3.32) which standardizes by dividing the difference by estimated Poisson standard deviation. These residuals relate to Pearson chi-squares statistic by ∑e 2 i = X 2 . Counts having larger residuals made greater contributions to the overall X 2 value for testing goodnessof-fit of the model. Pearson residual values fluctuate around zero, following approximately a normal distribution when µ i is large. When the model holds, these residuals are less variable than standard normal, however, because the numerator must use the fitted value µ̂ i rather than the true mean µ i . Since the sample data determine the fitted value, Yi − µ̂ i tends to be smaller than Yi − µ i . The Pearson residual divided by its estimated standard error is called an adjusted residual. It does have an approximate standard normal distribution when µ i is large (greater than 5). Thus, with adjusted residuals, it is easier to tell when a deviation Yi − µ̂ i is large. 38 Residuals larger than about 2 in absolute value are worthy of attention. Adjusted residuals are preferable to Pearson residuals. PROC GENMOD in SAS 9.1 can provide Pearson residual as well as adjusted residual. 3.8 Inference Inference consists of hypothesis testing and calculating the confidence intervals. Hypothesis test may answer question whether or not the parameter in the model equal to certain value. Hypothesis test is also applied in comparing how well two (or more) related models fit the data. Confidence intervals on the other hand are increasingly regarded as more useful than hypothesis tests because the width of a confidence interval provides a measure of the precision with which inferences can be made. It does so in a way which is conceptually simpler than the power of a statistical test. 3.8.1 Test of Significance In GLM, test of significance must be performed in order to see the importance of predictor variables in the model. For this purpose, the following hypotheses are required: H 0 : β1 = 0 versus H 1 : β1 ≠ 0 These hypotheses can be tested using a procedure called Wald statistic procedure. The test statistic for this procedure is given by Z= βˆ1 − β1 SE ( βˆ1 ) (3.33) 39 which has an approximate standard normal distribution under H 0 . Z 2 has a chi-squared distribution with one degree of freedom (df = 1) . Therefore, H 0 is rejected if Z 2 > χ 02.05 (1) . 3.8.2 Confidence Intervals From section 3.3, 95% confidence interval for β1 is given by βˆ1 ± 1.96 × SE ( βˆ1 ) 3.9 Handling Overdispersion 3.9.1 Quasi-Likelihood Method (3.34) Basically, there are three steps in quasi-likelihood approach. Firstly, regression coefficients and their standard errors are estimated by using MLE. Secondly, the dispersion parameter, φ , is estimated separately. And thirdly, the standard errors are adjusted for the estimated dispersion parameter so that proper confidence intervals and test statistic can be obtained. These steps are simplified in Figure 3.1. 40 Run usual Poisson regression and obtain values for coefficients and their standard errors Estimate the dispersion parameter Adjust the standard errors Figure 3.1: Steps in quasi-likelihood approach If the precise mechanism that produces the overdispersion is known, specific method may be applied to model the data. In the absence of such knowledge, it is convenient to introduce a dispersion parameter, φ , in the variance formula, that is, Var (Y ) = φµ . If φ > 1 , then overdispersion exists. This is a rather robust approach to tackle the problem, since even quite substantial deviations in the assumed simple linear functional form, Var (Y ) = φµ , generally have a merely minor effect on the conclusions related to standard errors, confidence intervals and p-values. The introduction of the dispersion parameter does not introduce a new probability distribution but just gives a correction term for testing the parameter estimates under the Poisson model. The models are fit in the usual way, and the parameter estimates are not affected by the value of φ . Only standard errors are inflated by this factor. This method produces an appropriate inference if overdispersion is modest and it has become the conventional approach in Poisson regression analysis. It is worth pointing out that the quasi-likelihood arguments starts from the quasiscore function. To describe quasi-likelihood method, first consider the score function given in (3.12). 41 N (Y − µ i ) ∂µ i U = ∑ i xi i =1 Var (Yi ) ∂η i For overdispersed Y , Var (Yi ) = φi µ i as introduced previously and since g ( µ i ) = η i = ln(µ i ) , the derivative is given by (∂µ i ∂η i ) = µ i . Substituting these two equations in (3.12) results in a function given as follows: N 1 u = ∑ xi (Yi − µ i ) i =1 φ i (3.35) Note that u is not the score function, because Yi ’s are no longer Poisson. u is called a quasi-score function or an estimating function. The solution of u = 0 , the ~ maximum quasi-likelihood estimate (MQLE) of β , is denoted by β . ~ β − β can then be expressed approximately as a matrix multiple of the multivariate random vector u . ~ β − β ≈ [Γ( β )]−1 u (3.36) ∂u where Γ( β ) = E − . ∂β ' ~ Therefore β − β is asymptotically multivariate normally distributed. As in the case of the score function and its negative expected derivative I (information matrix), the negative expected value of the derivative of the estimating function, Γ( β ) , turns out to be the variance of u . For the natural log link function, the negative derivative does not involve the random variables Yi , so it equals to its expectation as written below: ∂u N 1 ' Γ( β ) = E − = ∑ φ µ i xi xi ∂ ' β i =1 i (3.37) 42 while N Var (u ) = ∑ xi i =1 1 φi Var (Yi − µ i ) 1 φi N xi' = ∑ i =1 1 φi µ i xi xi' (3.38) ~ −1 Consequently, the asymptotic variance of β is [Γ( β )] , that is, ~ −1 −1 −1 Var ( β ) = [Γ( β )] Var (u )[Γ( β )] = [Γ( β )] (3.39) If the dispersion parameter φi is constant over i , that is, φi = φ for all i , then φ is dropped out from u = 0 , and the estimating equation becomes N u 0 = ∑ [xi (Yi − µ i )] (3.40) i =1 with expected derivative N ∂u E − 0 = Γ0 ( β ) = ∑ µ i xi xi' i =1 ∂β ' (3.41) where the subscript 0 indicates that the estimating function and the expectation of its derivative are free of the overdispersion parameter. Although u 0 is identical to the score function for pure Poisson outcomes, u 0 is not the score function for overdispersed counts. In this constant-overdispersion case, N N i =1 i =1 Var (u 0 ) = ∑ xiVar (Yi − µ i )xi' = φ ∑ µ i xi xi' = φ [Γ0 ( β )] (3.42) β − β ≈ [Γ0 ( β )]−1 u 0 (3.43) ~ −1 −1 −1 Var ( β ) = [Γ0 ( β )] Var (u 0 )[Γ0 ( β )] = φ [Γ0 ( β )] (3.44) and ~ Thus, 43 This expression implies that one can fit an ordinary Poisson regression model to overdispersed data as if there were no overdispersion in order to obtain the maximum ~ quasi-likelihood estimate β . The maximum quasi-likelihood estimate is consistent and asymptotically normal, but the variance is φ [Γ0 ( β )] rather than the usual [Γ0 ( β )] as in ordinary Poisson regression. Therefore, inflating the standard errors by the factor φ −1 −1 is all that is required for valid analysis in this case. PROC GENMOD in SAS 9.1 implements this approach. Dispersion parameter, φ , can be obtained using SAS 9.1. 3.9.1.1 Estimating the Overdispersion Parameter Dispersion parameter, φ can be estimated based only on the first two moments (Y − µ i ) 2 of Y . From the relationship, E i =φ , µi N (Yi − µˆ i ) 2 1 φ = ∑ µˆ ( N − p ) i =1 i ~ (3.45) ~ Note that φ is the scaled Pearson chi-square (that is, the Pearson chi-square statistic divided by its degree of freedom). Other than scaled Pearson chi-square, the scaled Deviance (Deviance statistic divided by its degree of freedom) is also common in estimating the dispersion parameter. This is given by, φˆ = N Yi 1 2∑ Yi ln ( N − p) i =1 µˆ i (3.46) 44 From equation Var (Y ) = φµ , it is clear that if there is no overdispersion, the ~ estimated φ ( φ or φˆ ) will be close to 1. Once the dispersion parameter is obtained, the usual Poisson regression standard errors need to be multiplied by φ and t-statistics divided by φ . Standard errors that are multiplied by φ are called adjusted standard errors. The regular maximum likelihood estimates are still the same. Inference can then be performed in the usual way with these adjusted standard errors. A proper confidence interval can be obtained if the adjusted standard errors are used. 3.9.1.2 Testing for Overdispersion Even though overdispersion parameter is obtained as above, one still need to test for overdispersion in order to see whether or not overdispersion exists. To do so, the following hypotheses are tested: H 0 : Var (Y ) = µ H 1 : Var (Y ) = µ + αg ( µ ) versus where g ( µ ) = µ or g ( µ ) = µ 2 . Or perhaps one can simply write the hypotheses as: H 0 : No overdispersion exists versus H 1 : Overdispersion exists 45 For g ( µ ) = µ , the test statistic to test the existence of overdispersion is: Q1 = (Yi − µˆ i ) 2 − Yi ∑ µˆ i 2 N i =1 N 1 (3.47) For g ( µ ) = µ 2 , the test statistic is given by: ∑ [(Y N Q2 = i =1 i − µˆ i ) 2 − Yi ] (3.48) N 2∑ µˆ i =1 2 i Both Q1 and Q2 are distributed approximately standard normal when N is large. Therefore, Q1 and Q2 have a chi-squared distribution with one degree of freedom (df = 1) . H 0 will be rejected if Q1 or Q2 is statistically significant, that is, if Q1 > χ 02.05 (1) or Q2 > χ 02.05 (1) . This study will only use Q1 to test the existence of overdispersion. 3.9.2 Negative Binomial Regression Analysis The negative binomial regression model is derived by rewriting Poisson regression model such that, ln µ = β 0 + β1 X + ε (3.49) where e ε is a Gamma-distributed error-term with mean 1 and variance α 2 . This addition allows the variance to differ from the mean as, Var (Y ) = µ (1 + αµ ) = µ + αµ 2 (3.50) 46 α also acts as a dispersion parameter. Poisson regression model is regarded as a limiting model of the negative binomial regression model as α approaches zero, which means that the selection between these two models is dependent upon the value of α . The negative binomial distribution has the form, ( ) Γ1 +y 1 α α P(Y = y ) = Γ 1 y! 1 + µ α α 1 α ( ) ( ) µ 1 + µ α y ( ) (3.51) where Γ(.) is a gamma function. This results in the likelihood function, ( ) Γ 1 + yi 1 α α L(Yi ) = ∏ Γ 1 yi ! 1 + µ i i α α ( ) ( ) 1 α µ i 1 + µi α yi ( ) (3.52) Maximum likelihood estimation is used to estimate parameters in negative binomial. In addition, the interpretation of regression coefficients for negative binomial regression is the same as for Poisson regression. 3.10 Example To illustrate Poisson regression, consider the following example: Although male elephants are capable of reproducing by 14 to 17 years of age, young adult males are usually unsuccessful in competing with their larger elders for the attention of receptive female. Since the male elephants continue to grow throughout their lifetimes, and since larger males tend to be more successful at mating, the males most likely to pass their genes to future generations are those whose characteristics enable them to live long lives. Joyce Poole studied a population of African elephants in Amboseli National Park, Kenya for 8 years. Data in Table 3.1 shows the number of 47 successful mating and ages of 41 male elephants. The question of interest is: What is the relationship between mating success and age? Table 3.1: Elephant’s mating success regarding age Age Mating Age Mating Age Mating 27 0 33 3 39 1 28 1 33 3 41 3 28 1 33 3 42 4 28 1 33 2 43 0 28 3 34 1 43 2 29 0 34 1 43 3 29 0 34 2 43 4 29 0 34 3 43 9 29 2 36 5 44 3 29 2 36 6 45 5 29 2 37 1 47 7 30 1 37 1 48 2 32 2 37 6 52 9 33 4 38 2 Model Fitting From this example, the response outcome, Y , is mating success while the explanatory variable, X , is age. The model for this data is written as ln[ E ( Mating i )] = β 0 + β 1 ( Agei ) Or simply, ln[ E (Yi )] = β 0 + β1 X i 48 Estimation of Parameters Initial Estimates Since there are zero responses, 0.1 is added to all responses. Then, natural log is taken to the responses. Using OLS method, with 1 27 1 28 . . X = . . . . 1 52 − 2.3026 0.0953 . Y = . . 2.2083 and initial estimates are obtained as, −1 1470 22.5582 − 2.5748 41 βˆ = ( X' X) −1 X' Y = = 1470 54436 959.6767 0.0872 Therefore, b0( 0 ) = −2.5748 and b1( 0 ) = 0.0872 Next estimates The next estimates is found using IRWLS method by solving equation (3.18) iteratively. The computation is aided by Microsoft Excel. Using the initial estimates, at m = 1, 84.5954 3375.317 ( X' WX) ( 0) = 3375.317 138732.8 and 101.9159 ( X' Wz ) ( 0 ) = 4322.413 49 Thus, − 1.3117 b (1) = 0.0631 that is, b0(1) = −1.3117 and b1(1) = 0.0631 . This iterative process is continued until it converges. The results are shown in Table 3.2. The maximum likelihood estimates are β 0 = −1.5820 and β 1 = 0.0687 . The model can then be written as ln[ E (Yi )] = −1.5820 + 0.0687 X i Table 3.2: Iterative reweighted least squares results m 0 1 2 3 4 b1 -2.5748 -1.3117 -1.5689 -1.582 -1.582 b2 0.0872 0.0687 0.0687 0.0631 0.0684 The results are checked by using SAS 9.1. The estimates for β 0 and β1 from SAS 9.1 result are found to be the same with the estimates that have been calculated manually. Figure 3.2 shows the result of analyzing the same data using SAS 9.1. The SAS 9.1 code is given in APENDIX A. This result from SAS 9.1 proves that the manual computation method is practically valid and useful. 50 The GENMOD Procedure Model Information Data Set Distribution Link Function Dependent Variable WORK.ELEPHANT Poisson Log matings Number of Observations Read Number of Observations Used 41 41 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood 39 39 39 39 51.0116 51.0116 45.1360 45.1360 10.7400 1.3080 1.3080 1.1573 1.1573 Algorithm converged. Analysis Of Parameter Estimates Parameter DF Estimate Standard Error Intercept age Scale 1 1 0 -1.5820 0.0687 1.0000 0.5446 0.0137 0.0000 Wald 95% Confidence Limits -2.6494 0.0418 1.0000 -0.5146 0.0956 1.0000 ChiSquare 8.44 24.97 NOTE: The scale parameter was held fixed. Figure 3.2: SAS result for analysis of elephant’s mating success data Pr > ChiSq 0.0037 <.0001 51 Interpretation of Coefficients From the model, β1 = 0.0687 , indicates that a one-year change in age cause e 0.0687 = 1.0711 change in elephant’s mating success. It can also be said that, each additional year of age increase the mean number of elephant’s mating success by 7.1 percent i.e, | 100(e 0.0687 − 1) |= 7.1% . Elasticity Taking the average, the elasticity for age is found to be, E Xµii = β1 X i = 0.0687 X i = 2.4631 This value indicates that one percent increase in age, increase the elephant’s mating success by 2.46 percent. Standard Error The Information matrix, I for final iteration is found to be, 110.0317 4292.276 I = ( X' WX) ( 4 ) = 4292.276 172733.3 The inverse of this Information matrix is 0.2965 I −1 = −3 − 7.37 × 10 − 7.37 × 10 −3 1.89 × 10 − 4 52 From the inverse of Information matrix, standard errors for β 0 and β1 are SE ( βˆ0 ) = 0.2965 = 0.5445 SE ( βˆ1 ) = 1.89 × 10 −4 = 0.0137 These values are also the same with the values obtained from SAS. Model Checking From the model, it is known that µˆ i = e ( −1.5820 + 0.0687 X i ) . All values of µ̂ i are shown in APPENDIX B. These values are computed using Microsoft Excel for simplicity. The Pearson chi-squares statistic for testing goodness-of-fit for elephant’s mating success data is obtained as follows: 41 X2 =∑ i =1 (Yi − µˆ i ) = 45.1236 µˆ i From statistical table, it is found that, χ 02.05 (39) = 55.758 Since 45.1236 < 55.758 , H 0 is not rejected at significance level, α = 0.05 . This indicates that the model has a good fit. Next, the deviance statistic for elephant’s mating success data is found to be 41 Y D = 2∑ Yi ln i i =1 µˆ i = 2(24.0352) = 48.0704 53 Note: 0.1 is added to zero responses in order to overcome the impossibility to calculate Y ln i µ̂ i . This is why the value of deviance obtained here is slightly different from the value obtained from SAS. Again, from statistical table, χ 02.05 (39) = 55.758 Since, 48.0704 < 55.758 , H 0 is not rejected at significance level, α = 0.05 indicating that the model has a good fit, which is the same with Pearson chi-squares’ result. Residual Analysis Table 3.3 shows Pearson residuals for elephant’s mating success data. (The computation is done using Microsoft Excel). Adjusted residuals are impossible to compute due to single observation for age 27, 30, 32, 38, 39, 41, 42, 44, 45, 47, 48 and 52. From Table 3.3, it can be seen that observation 24, 27, and 36 has residual value greater than 2. Therefore, these observations must be put into attention. Other residuals, however, are not large enough to indicate potential problems with model fit. This shows that the model fits the data well. Table 3.3: Residuals for elephant’s mating success data X Y Pearson Observation (Age) (Mating) Residuals 1 27 0 -1.1462 2 28 1 -0.34326 3 28 1 -0.34326 4 28 1 -0.34326 54 5 28 3 1.342717 6 29 0 -1.22771 7 29 0 -1.22771 8 29 0 -1.22771 9 29 2 0.401341 10 29 2 0.401341 11 29 2 0.401341 12 30 1 -0.48359 13 32 2 0.108564 14 33 4 1.431296 15 33 3 0.721338 16 33 3 0.721338 17 33 3 0.721338 18 33 2 0.01138 19 34 1 -0.77177 20 34 1 -0.77177 21 34 2 -0.08579 22 34 3 0.600195 23 36 5 1.640773 24 36 6 2.281213 25 37 1 -0.99718 26 37 1 -0.99718 27 37 6 2.096892 28 38 2 -0.47663 29 39 1 -1.15319 30 41 3 -0.23589 31 42 4 0.165836 32 43 0 -1.98586 33 43 2 -0.97873 34 43 3 -0.47517 55 35 43 4 0.028389 36 43 9 2.546195 37 44 3 -0.59558 38 45 5 0.223561 39 47 7 0.794057 40 48 2 -1.50978 41 52 9 0.62158 Inference One might want to know if β1 is necessary in the model, that is, if age has an effect on elephant’s mating success. Thus, the hypotheses to be tested are H 0 : β1 = 0 versus H 1 : β1 ≠ 0 The test statistic is Z= 0.0687 − 0 = 5.0146 0.0137 Thus, Z 2 = 5.0146 2 = 25.1462 . From statistical table, χ 02.05 (1) = 3.841 . Since 25.1462 > 3.841 , H 0 is rejected at significance level, α = 0.05 . This shows that β1 is important to the model. In other words, age really affects the elephant’s mating success. 56 Confidence Intervals As mentioned in section 3.8.2, 95% confidence interval for β1 is given by βˆ1 ± 1.96 × SE ( βˆ1 ) Therefore, 95% confidence interval for β1 is 0.0687 ± 1.96(0.0137) or (0.0418,0.0956) This gives the corresponding 95% confidence interval for the multiplicative factor as (e 0.0418 , e 0.0956 ) = (1.0427,1.1003) . Overdispersion Since the data for this example caused some problem for the calculation of the deviance, only scaled Pearson chi-squares is used for the estimation of dispersion parameter, φ . Scaled Pearson chi-squares is obtained as follows: ~ φ = N (Yi − µˆ i ) 2 45.1236 1 ∑ µˆ = 39 = 1.1570 ( N − p ) i =1 i This value is nearly 1 and is rather small. Thus, one can say that there is no overdispersion exists in this data. Therefore, adjustment of the standard errors is not crucially required. However, to really verify the absence of overdispersion, the overdispersion test must be done. Consider the following hypotheses: H 0 : No overdispersion exists versus H 1 : Overdispersion exists 57 It is found that, Q1 = N 1 ∑ 2N i =1 (Yi − µˆ i ) 2 − Yi = µˆ i 1 2(41) (4.4244) = 0.4885 From statistical table, χ 02.05 (1) = 3.841 . Since 0.4885 < 3.841 , H 0 is not rejected at 0.05 level of significance. Therefore, there is certainly no overdispersion exists in this data. Suppose one still wants to find the maximum quasi-likelihood estimates with their adjusted standard errors. Thus the result for adjusted standard errors is as in Table 3.4. Parameter Table 3.4: Adjusted standard errors Estimates Standard Error Adjusted Standard Error Intercept -1.5820 0.5445 0.5857 Age 0.0687 0.0137 0.0147 Adjusted standard errors are obtained by multiplying standard errors with φ ~ (i.e.: φ = 1.1570 = 1.0756 ). From this result, the adjustment for standard errors only shows 7.56% increase. In other words, increasing standard errors by 7.56% yields adjusted standard errors 0.5857 and 0.0137 for the intercept and age respectively. This increase is not very much. Therefore, standard errors are actually not underestimated which explain that there is no overdispersion in the data and that the Poisson regression is adequate. CHAPTER 4 ANALYSIS OF POISSON REGRESSION USING SAS 4.1 Introduction In the previous chapter, the analysis of Poisson regression was done manually and the example did not involve overdispersion. In addition, the data was a normal count data. This chapter is about analyzing count data in the form of rate and the data involves overdispersion. The analysis of Poisson regression in this chapter is done using SAS 9.1. 4.2 Nursing Home Data The nursing home data given in Table 4.1 was adapted from a book by Fleiss, J.L., Levin, B. and Paik, M.C entitled Statistical Methods for Rates and Proportions. The data was collected by the Department of Health and Social Services of the State of New Mexico. The collected variables include the number of beds, annual total patient days, annual total patient care revenue, annual nursing salaries, annual facilities expenditures, and an indicator for rural location. The question of interest is whether nursing homes in rural areas tend to have fewer beds per patient population than those in urban areas, adjusting for the other factors that affect hospital facilities. Symbols used for all variables are explained as follows: 59 BED – number of beds TDAYS – annual total patient days (in hundreds) PCREV – annual total patient care revenue (in $ millions) NSAL – annual nursing salaries (in $ millions) FEXP – annual facilities expenditures (in $ millions) RURAL – rural (1) or urban (0) Table 4.1: Nursing home data TDAYS PCREV NSAL FEXP UNIT BED RURAL 1 244 385 2.3521 0.523 0.5334 0 2 59 203 0.916 0.2459 0.0493 1 3 120 392 2.19 0.6304 0.6115 0 4 120 419 2.2354 0.659 0.6346 0 5 120 363 1.7421 0.5362 0.6225 0 6 65 234 1.0531 0.3622 0.0449 1 7 120 372 2.2147 0.4406 0.4998 1 8 90 305 1.4025 0.4173 0.0966 1 9 96 169 0.8812 0.1955 0.126 0 10 120 188 1.1729 0.3224 0.6442 1 11 62 192 0.8896 0.2409 0.1236 0 12 120 426 2.0987 0.2066 0.336 1 13 116 321 1.7655 0.5946 0.4231 0 14 59 164 0.7085 0.1925 0.128 1 15 80 284 1.3089 0.4166 0.1123 1 16 120 375 2.1453 0.5257 0.5206 1 17 80 133 0.779 0.1988 0.4443 1 18 100 318 1.8309 0.4156 0.4585 1 19 60 213 0.8872 0.1914 0.1675 1 20 110 280 1.7881 0.5173 0.5686 1 60 21 120 336 1.7004 0.463 0.0907 0 22 135 442 2.3829 0.7489 0.3351 0 23 59 191 0.9424 0.2051 0.1756 1 24 60 202 1.2474 0.3803 0.2123 0 25 25 83 0.4078 0.2008 0.4531 1 26 221 776 3.6029 0.1288 0.2543 1 27 64 214 0.8782 0.4729 0.4446 1 28 62 204 0.8951 0.2367 0.1064 0 29 108 366 1.7446 0.5933 0.2987 1 30 62 220 0.6164 0.2782 0.0411 1 31 90 286 0.2853 0.4651 0.4197 0 32 146 375 2.1334 0.6857 0.1198 0 33 62 189 0.8082 0.2143 0.1209 1 34 30 88 0.3948 0.3025 0.0137 1 35 79 278 1.1649 0.2905 0.1279 0 36 44 158 0.785 0.1498 0.1273 1 37 120 423 2.9035 0.6236 0.3524 0 38 100 300 1.7532 0.3547 0.2561 1 39 49 177 0.8197 0.281 0.3874 1 40 123 336 2.2555 0.6059 0.6402 1 41 82 136 0.8459 0.1995 0.1911 1 42 58 205 1.0412 0.2245 0.1122 1 43 110 323 1.6661 0.4029 0.3893 1 44 62 222 1.2406 0.2784 0.2212 1 45 86 200 1.1312 0.372 0.2959 1 46 102 355 1.4499 0.3866 0.3006 1 47 135 471 2.4274 0.7485 0.1344 0 48 78 203 0.9327 0.3672 0.1242 1 49 83 390 1.2362 0.3995 0.1484 1 50 60 213 1.0644 0.282 0.1154 0 61 51 54 144 0.7556 0.2088 0.0245 1 52 120 327 2.0182 0.4432 0.6274 0 From this data, number of beds is the response variable and annual total patient days is regarded as the offset. The rest of the variables are the explanatory variables. E ( BED ) The following Poisson regression models are fitted for . TDAYS (M1) E ( BEDi ) = β 0 + β1 ( PCREV ) i + β 2 ( NSAL ) i + β 3 ( FEXP ) i ln TDAYS i (M2) E ( BEDi ) = β 0 + β1 ( PCREV ) i + β 2 ( NSAL) i + β 3 ( FEXP) i ln TDAYSi + β 4 ( PCREV ⋅ NSAL) i + β 5 ( PCREV ⋅ FEXP) i + β 6 ( NSAL ⋅ FEXP) i (M3) E ( BEDi ) = β 0 + β 1 ( PCREV ) i + β 2 ( NSAL) i + β 3 ( FEXP ) i ln TDAYS i + β 4 ( PCREV ⋅ NSAL ) i + β 5 ( PCREV ⋅ FEXP ) i + β 6 ( NSAL ⋅ FEXP ) i + β 7 ( RURAL ) i The data are analyzed using SAS 9.1. The codes for running the program is given in APPENDIX C. SAS output for all three models are displayed in Figure 4.1, Figure 4.2, and Figure 4.3. 62 The GENMOD Procedure Model Information Data Set Distribution Link Function Dependent Variable Offset Variable WORK.NURSING_HOME Poisson Log bed log_t Number of Observations Read Number of Observations Used 52 52 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood 48 48 48 48 245.0465 245.0465 276.8284 276.8284 17446.3823 5.1051 5.1051 5.7673 5.7673 Algorithm converged. Analysis Of Parameter Estimates Parameter DF Estimate Standard Error Intercept pcrev nsal fexp Scale 1 1 1 1 0 -1.1018 -0.0564 -0.1428 0.4935 1.0000 0.0434 0.0209 0.0945 0.0847 0.0000 Wald 95% Confidence Limits ChiSquare Pr > ChiSq -1.1867 -0.0974 -0.3281 0.3275 1.0000 645.81 7.25 2.28 33.96 <.0001 0.0071 0.1308 <.0001 -1.0168 -0.0153 0.0425 0.6595 1.0000 NOTE: The scale parameter was held fixed. Figure 4.1: SAS output for model (M1) 63 The GENMOD Procedure Model Information Data Set Distribution Link Function Dependent Variable Offset Variable WORK.NURSING_HOME Poisson Log bed log_t Number of Observations Read Number of Observations Used 52 52 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood 45 45 45 45 215.6775 215.6775 235.5925 235.5925 17461.0667 4.7928 4.7928 5.2354 5.2354 Algorithm converged. Analysis Of Parameter Estimates Parameter DF Estimate Standard Error Intercept pcrev nsal fexp pcrev*nsal pcrev*fexp nsal*fexp Scale 1 1 1 1 1 1 1 0 -1.0480 -0.3221 0.0165 1.3582 0.3718 0.5264 -3.6525 1.0000 0.1074 0.0712 0.4790 0.2803 0.1467 0.2541 0.9363 0.0000 Wald 95% Confidence Limits -1.2584 -0.4618 -0.9222 0.8087 0.0844 0.0284 -5.4876 1.0000 -0.8376 -0.1825 0.9553 1.9077 0.6593 1.0245 -1.8175 1.0000 NOTE: The scale parameter was held fixed Figure 4.2: SAS output for model (M2) ChiSquare Pr > ChiSq 95.26 20.45 0.00 23.47 6.43 4.29 15.22 <.0001 <.0001 0.9724 <.0001 0.0112 0.0383 <.0001 64 The GENMOD Procedure Model Information Data Set Distribution Link Function Dependent Variable Offset Variable WORK.NURSING_HOME Poisson Log bed log_t Number of Observations Read Number of Observations Used 52 52 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood 44 44 44 44 201.9478 201.9478 216.1808 216.1808 17467.9316 4.5897 4.5897 4.9132 4.9132 Algorithm converged. Analysis Of Parameter Estimates Parameter DF Estimate Standard Error Intercept pcrev nsal fexp pcrev*nsal pcrev*fexp nsal*fexp rural Scale 1 1 1 1 1 1 1 1 0 -0.9730 -0.3478 0.2783 1.4676 0.2655 0.7085 -4.4965 -0.1331 1.0000 0.1091 0.0709 0.4842 0.2831 0.1495 0.2567 0.9551 0.0357 0.0000 Wald 95% Confidence Limits -1.1869 -0.4868 -0.6706 0.9128 -0.0276 0.2055 -6.3685 -0.2032 1.0000 -0.7591 -0.2088 1.2273 2.0224 0.5586 1.2116 -2.6246 -0.0631 1.0000 NOTE: The scale parameter was held fixed. Figure 4.3: SAS output for model (M3) ChiSquare Pr > ChiSq 79.51 24.06 0.33 26.88 3.15 7.62 22.16 13.87 <.0001 <.0001 0.5654 <.0001 0.0758 0.0058 <.0001 0.0002 65 4.3 Choosing the Right Model Since there are three proposed models, one needs to determine the most appropriate model to use. Note that M1 and M2 are nested models because M2 can be reduced to M1. Similarly, M3 can be reduced to M2. Therefore, to determine the right model, M1 must be compared with M2 and M2 must be compared with M3. To make comparison among two models, two methods can be used. One is by using the deviance statistic and the other is by using the Wald statistic. However, only deviance statistic method will be discussed here. Table 4.2 summarizes log likelihood and deviance for M1, M2, and M3 obtained from previous SAS output. Table 4.2: Log likelihood and deviance for model M1, M2, and M3 Model Degrees Of Log Likelihood Deviance Freedom M1 48 17446.3823 245.0465 M2 45 17461.0667 215.6775 M3 44 17467.9316 201.9478 To compare between M1 and M2, the following hypotheses need to be tested: H 0 : βi = 0 ∀i ; i = 4,5,6 versus H 1 : β i ≠ 0 for at least one i Firstly, it can be seen that the deviance of M2 is smaller than that of M1 due to the added two-way interaction terms. The smaller value of deviance suggests that M2 fits the data better than M1. The next question is whether M2 fits the data significantly better than M1. To answer this question, the following statistic is used: d = D0 − D1 66 where D0 indicates the deviance of the less inclusive model while D1 indicates the deviance of the more inclusive model. This statistic has an approximate chi-squared distribution with degrees of freedom equal to the difference between the numbers of unknown parameters in the two models. Thus H 0 will be rejected if d > χ 02.05 (df ) . For comparing M1 and M2, d = 245.0465 − 215.6775 = 29.369 and df = 48 − 45 = 3 From statistical table, χ 02.05 (3) = 7.815 . Since 29.369 > 7.815 , H 0 is rejected at significance level α = 0.05 . This implies that the interaction terms are jointly highly significant to the model. Now, to compare between M2 and M3, the following hypotheses are considered: H0 : β7 = 0 versus H1 : β 7 ≠ 0 The deviance for M3 is smaller than that the deviance of M2 suggesting that M3 fits the data better than M2. The statistic for comparing M2 and M3 is d = 215.6775 − 201.9478 = 13.7297 and its degrees of freedom is df = 45 − 44 = 1 From statistical table, χ 02.05 (1) = 3.841 . Since 13.7297 > 3.841 , H 0 is rejected at significance level α = 0.05 . This implies that RURAL is thus highly significant to the model. Therefore, M1 and M2 are rejected in favour of M3. M3 is the right Poisson regression model for nursing home data. 67 4.4 Results and Discussion The analysis is now focused on model M3 and its SAS output in Figure 4.3. The Model The model for nursing home data is obtained as: E ( BEDi ) = −0.9730 − 0.3478( PCREV ) i + 0.2783( NSAL) i + 1.4676( FEXP ) i ln TDAYS i + 0.2655( PCREV ⋅ NSAL ) i + 0.7085( PCREV ⋅ FEXP ) i − 4.4965( NSAL ⋅ FEXP ) i − 0.1331( RURAL ) i Interpretation of Coefficients From “Analysis of Parameter Estimates” in Figure 4.3, it can be seen that, PCREV, FEXP, RURAL factors are highly significant (p-value < 0.05). Similarly, the interaction between PCREV and FEXP as well as NSAL and FEXP are also significant. For RURAL factor, β 7 = −0.1331 . This indicates that each nursing home in rural area has fewer beds than those in urban areas by e −0.1331 = 0.8754 factor. In other words, since | 100(e −0.1331 − 1) |= 12.46% , it can be said that nursing homes in rural area has 12.46% fewer beds compared to nursing homes in urban area. The 95% confidence interval for β 7 is − 0.1331 ± 1.96 × 0.0357 or (−0.2031,−0.0631) which gives the corresponding confidence (e −0.2031 , e −0.0631 ) = (0.8162,0.9388) . interval for multiplicative factor, 68 The coefficient of annual total patient care revenue (PCREV) factor indicates that a one-unit change in annual total patient care revenue causes 29.38% decrease in the mean of the number of beds. The 95% confidence interval for β1 is − 0.3478 ± 1.96 × 0.0709 or (−0.4868,−0.2088) and the corresponding confidence interval for multiplicative factor is (e −0.4868 , e −0.2088 ) = (0.6146,0.8116) . For NSAL factor, it is found that, a one-unit change in annual nursing salaries leads to 32.09% increase in the mean of the number of beds and the 95% confidence interval for β 2 is 0.2783 ± 1.96 × 0.4842 or (−0.6707,1.2273) . Its corresponding confidence interval for multiplicative factor is (e −0.6707 , e1.2273 ) = (0.5114,3.4120) . A oneunit change in annual facilities expenditure (FEXP), however, results in a very high increase in the mean of the number of beds. Interaction factors between PCREV and NSAL and between PCREV and FEXP show that a one-unit change in these two factors increases the number of beds by e 0.2655 = 1.3041 and e 0.7085 = 2.0309 factors respectively whereas the interaction factor between NSAL and FEXP shows that a one-unit change in this factor decreases the number of beds by e −4.4965 = 0.0111 factor. Their 95% confidence intervals for multiplicative factor are (−0.0275,0.5585) , (0.2954,1.2116) , and (−6.3685,−2.6245) respectively. Elasticity Table 4.3 shows elasticity computed for the explanatory variables in nursing home data for model M3. The elasticities for PCREV, NSAL, and FEXP variables are computed by applying equation (3.27) to each of their observation and then taking their average. This computation is done by using Microsoft Excel. RURAL is an indicator variable, thus, the pseudo-elasticity for RURAL variable is computed as follows: 69 E RURAL = exp(β 7 ) − 1 exp(−0.1331) − 1 = = −0.1424 exp(β 7 ) exp(−0.1331) Table 4.3: Elasticities of the explanatory variables in nursing home data for model M3 Explanatory Variables Elasticity PCREV -0.4942 NSAL 0.1061 FEXP 0.4180 RURAL -0.1424 The result shows that a one percent increase in annual total patient care revenue reduces the mean number of beds in nursing homes by 0.4942 percent. In contrast, a one percent increase in annual nursing salaries and annual facilities expenditures increase the mean number of beds in nursing homes by 0.1061 and 0.4180 percent respectively. Furthermore, it is found that the mean number of beds in nursing homes for rural area is reduced by 0.1424 percent. This clearly proves that nursing homes in rural area tend to have fewer beds than nursing homes in urban area. Model Checking As usual, the goodness-of-fit of the model needs to be checked. The hypotheses are: H 0 : the model has a good fit versus H 1 : the model has lack of fit From Figure 4.3, it is found that the value for Pearson chi-squares statistic is X 2 = 216.1808 and the value for deviance statistic is D = 201.9478 . From statistical table, χ 02.05 (44) = 60.4568 . Since 216.1808 > 60.4568 and 201.9478 > 60.4568 , H 0 is rejected at significance level, α = 0.05 . This implies that the model has lack of fit. 70 Residuals Analysis To find Pearson residuals and adjusted residuals using SAS, “obstats” and “residuals” options are added to the codes to obtain Pearson residuals and adjusted residuals. The additional codes are written as follows: proc genmod; model bed = pcrev nsal fexp pcrev*nsal pcrev*fexp nsal*fexp rural / dist=poi link=log offset=log_t obstats residuals; run; The output for this additional code is given in APPENDIX D. The “obstats” option in this output provides the Pearson residuals (labeled as “Reschi”) while the “residuals” option provides the adjusted residuals (labeled as “StReschi”), which adjust the Pearson residuals to be distributed approximately normal. Table 4.4 simplifies the results. Table 4.4: Pearson residuals and adjusted residuals for nursing home data Pearson Adjusted Obs. Bed Residuals Residuals 1 244 7.0169656 7.9541283 2 59 0.0842729 0.0874151 3 120 -1.139224 -1.268365 4 120 -1.42913 -1.687196 5 120 -1.1648 -1.279099 6 65 -0.29734 -0.312951 7 120 -1.801313 -1.99645 8 90 0.4307467 0.4500368 9 96 4.5567171 4.8399814 10 120 3.6104925 4.1424735 11 62 -0.766451 -0.808099 71 12 120 -3.122378 -3.452756 13 116 1.1426998 1.20197 14 59 0.7616619 0.7904897 15 80 -0.222551 -0.232783 16 120 -0.631143 -0.670849 17 80 2.4420727 2.7090799 18 100 -1.426019 -1.482452 19 60 -1.221678 -1.276587 20 110 1.8056045 1.8957571 21 120 1.5718986 1.6769361 22 135 0.0515266 0.0576814 23 59 -0.478213 -0.494186 24 60 -1.35736 -1.413152 25 25 -2.097099 -2.308997 26 221 1.0280765 3.0256132 27 64 -0.169027 -0.183494 28 62 -1.096703 -1.162205 29 108 0.4340545 0.4660353 30 62 -0.754576 -0.809907 31 90 -0.680123 -0.894277 32 146 2.4808479 2.8560404 33 62 0.2899505 0.2998119 34 30 0.2524611 0.2699829 35 79 -1.574343 -1.666879 36 44 -0.956302 -1.011012 37 120 -2.032293 -2.44568 38 100 0.679786 0.6950838 39 49 -2.366709 -2.478751 40 123 1.5070629 1.7096722 41 82 5.2365404 5.3725172 72 42 58 -0.403804 -0.417828 43 110 -0.078899 -0.081086 44 62 -1.176382 -1.197609 45 86 2.4542185 2.5004873 46 102 -1.218256 -1.250821 47 135 -1.151268 -1.519447 48 78 1.9612958 2.0434853 49 83 -3.063645 -3.229351 50 60 -1.48926 -1.561096 51 54 1.9310157 2.0084219 52 120 -2.811129 -3.636855 From Table 4.4, it can be seen that, there are a lot of Pearson residuals as well as adjusted residuals that are greater than 2 in absolute value. The observations with residuals greater than 2 in absolute value must be checked because they might be outliers. Since there are so many large residuals, therefore, the model does not fit the data very well. This result is the same as the goodness-of-fit test result obtained previously. Overdispersion From “Criteria for Assessing Goodness of Fit” in Figure 4.3, it can be seen that the values of both deviance and Pearson chi-square are very much larger than their degrees of freedom. Furthermore, dividing deviance and Pearson chi-square by their degrees of freedom give values greater than 1, that is, X 2 df = 4.9132 and. D df = 4.5897 Therefore, overdispersion exists in this data and Poisson regression is clearly not adequate to describe the data. 73 A test for overdispersion with Q1 = 1 52 ∑ 2(52) i =1 (Yi − µˆ i ) 2 − Yi = 16.0203 µˆ i confirms the presence of overdispersion since Q1 > χ 02.05 (1) , that is, 16.0203 > 3.841. The values of µ̂ i (predicted values) are easily obtained from SAS 9.1 by using the following codes: proc genmod; model bed = pcrev nsal fexp pcrev*nsal pcrev*fexp nsal*fexp rural / dist=poi link=log offset=log_t; output out = temp p = muhati; run; proc print data =temp (obs=52); var bed muhati; run; The values of µ̂ i obtained from the above codes are shown in APPENDIX E In conjunction with overdispersion test, a dispersion parameter, φ , is introduced into the relationship between the variance and the mean as Var (Y ) = φµ . Take note that the “scale” parameter in the “Analysis of Parameter Estimates” in SAS output is actually φ. The scale parameter can be estimated using SAS 9.1. This can be done by either specifying SCALE=DEVIANCE (or just DSCALE) or SCALE=PEARSON (or just PSCALE). For nursing home data, Pearson is used to obtain the scale estimate. The following codes are added to the codes in APPENDIX C and the program is run. Figure 4.4 shows the output. This output is called corrected Poisson regression. proc genmod; model bed = pcrev nsal fexp pcrev*nsal pcrev*fexp nsal*fexp rural / dist=poi link=log scale=pearson offset=log_t; run; 74 The GENMOD Procedure Model Information Data Set Distribution Link Function Dependent Variable Offset Variable WORK.NURSING_HOME Poisson Log bed log_t Number of Observations Read Number of Observations Used 52 52 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood 44 44 44 44 201.9478 41.1031 216.1808 44.0000 3555.3064 4.5897 0.9342 4.9132 1.0000 Algorithm converged. Analysis Of Parameter Estimates Parameter DF Estimate Standard Error Intercept pcrev nsal fexp pcrev*nsal pcrev*fexp nsal*fexp rural Scale 1 1 1 1 1 1 1 1 0 -0.9730 -0.3478 0.2783 1.4676 0.2655 0.7085 -4.4965 -0.1331 2.2166 0.2419 0.1572 1.0732 0.6274 0.3314 0.5689 2.1171 0.0792 0.0000 Wald 95% Confidence Limits -1.4471 -0.6559 -1.8251 0.2378 -0.3841 -0.4066 -8.6460 -0.2884 2.2166 ChiSquare Pr > ChiSq 16.18 4.90 0.07 5.47 0.64 1.55 4.51 2.82 <.0001 0.0269 0.7954 0.0193 0.4231 0.2130 0.0337 0.0929 -0.4989 -0.0397 2.3818 2.6973 0.9151 1.8236 -0.3471 0.0222 2.2166 NOTE: The scale parameter was estimated by the square root of Pearson's Chi-Square/DOF. Figure 4.4: SAS output for corrected Poisson regression 75 The “scale” parameter in the output equals to 1 shows that ordinary Poisson regression is used (see “scale” parameter value in Figure 4.3). If the “scale” parameter shows value greater than 1, then it is certain that the data is overdispersed and that the model used is not Poisson regression (see “scale” parameter value in Figure 4.4). Output in Figure 4.4 can also be referred to as the maximum quasi-likelihood estimates and standard errors of the Poisson regression since this corrected Poisson regression is the same as quasi-likelihood method. From Figure 4.4, it can be seen that scaled Pearson chi-square is now held fixed to 1 and scale parameter is equal to 2.2166 ( X 2 df = 4.9132 = 2.2166) . The parameter estimates are still the same as in Figure 4.3. However, the standard errors are inflated by the scale factor, φ . Table 4.5 summarizes the comparison between standard errors for ordinary Poisson regression and corrected Poisson regression. Table 4.5: Comparison among standard errors for ordinary Poisson regression and corrected Poisson regression Parameter SE (ordinary SE (corrected Poisson regression) Poisson regression) Intercept 0.1091 0.2419 PCREV 0.0709 0.1572 NSAL 0.4842 1.0732 FEXP 0.2831 0.6274 PCREV*NSAL 0.1495 0.3314 PCREV*FEXP 0.2567 0.5689 NSAL*FEXP 0.9551 2.1171 RURAL 0.0357 0.0792 Table 4.5 shows that the corrected Poisson regression increases the standard errors by approximately 121%. This is rather high. In other words, increasing standard errors by 121% helps adjust for the apparent overdispersion. Clearly, the variability of 76 response variable is understated if it is assumed to be pure Poisson, and consequently, the estimated standard errors are understated too. Also take note that the 95% confidence intervals for all coefficients in corrected Poisson regression (Figure 4.4) are much wider that those of ordinary Poisson regression obtained previously. In addition, the p-values are much higher and significance tests are more conservative than those based on Poisson regression (Figure 4.3) before adjustment for overdispersion. It can be seen that the RURAL effect and the interaction effect between PCREV and FEXP are no longer significant (p-value > 0.05) as reported by ordinary Poisson regression. PCREV, FEXP, and the interaction between NSAL and FEXP, however, remain significant. Thus, it is clear that ignoring the overdispersion will underestimate standard errors and will give false result for the inference. 4.5 Negative Binomial Regression To run negative binomial regression using SAS 9.1, the distribution term is changed to “NB” instead of “poi”. The PROC GENMOD command for negative binomial regression is given as follows: proc genmod; model bed = pcrev nsal fexp pcrev*nsal pcrev*fexp nsal*fexp rural / dist=NB link=log offset=log_t; run; Figure 4.5 shows the output of negative binomial regression for nursing home data. 77 4.6 Results and Discussion The Model For negative binomial regression, the model for nursing home data is obtained as: E ( BEDi ) = −0.9103 − 0.3869( PCREV ) i + 0.1557( NSAL) i + 1.4298( FEXP ) i ln TDAYS i + 0.3323( PCREV ⋅ NSAL) i + 0.7532( PCREV ⋅ FEXP ) i − 4.5658( NSAL ⋅ FEXP ) i − 0.1193( RURAL) i + ε Interpretation of Coefficients From Figure 4.5, it is found that the RURAL effect and the interaction effect between PCREV and FEXP are no longer significant (p-value > 0.05) as reported by Poisson regression previously. PCREV, FEXP, and the interaction between NSAL and FEXP, however, remain significant. Note that, this is the same as the result obtained from corrected Poisson regression before. The value of coefficient for RURAL factor indicates that each nursing home in rural area has fewer beds than those in urban areas by e −0.1193 = 0.8875 factor. In other words, since | 100(e −0.1193 − 1) |= 11.25% , it can be said that nursing homes in rural area has 11.25% fewer beds compared to nursing homes in urban area. This value is less than that of Poisson regression by 1.21%. Furthermore, the 95% confidence interval for β 7 is − 0.1193 ± 1.96 × 0.0705 or (−0.2575,0.0189) and the corresponding confidence interval for multiplicative factor is obtained as, (e −0.2575 , e 0.0189 ) = (0.7730,1.0191) . 78 The coefficient of annual total patient care revenue (PCREV) factor indicates that a one-unit change in annual total patient care revenue causes 32.08% decrease in the mean of the number of beds. This value is greater than that of Poisson regression by 2.7%. In addition, the 95% confidence interval for β1 is − 0.3869 ± 1.96 × 0.1543 or (−0.6893,−0.0845) and the corresponding confidence interval for multiplicative factor is obtained as, (e −0.6893 , e −0.0845 ) = (0.5019,0.9190) . A one-unit change in annual nursing salaries (NSAL) leads to 16.85% increase in the mean of the number of beds This value is less than that of Poisson regression by 15.24%. The 95% confidence interval for β2 is 0.1557 ± 1.96 × 0.9194 or (−1.6463,1.9577) and the corresponding confidence interval for multiplicative factor is found to be (e −1.6463 , e1.9577 ) = (0.1928,7.0830) . A one-unit change in annual facilities expenditure (FEXP), as in Poisson regression, results in a very high increase in the mean of the number of beds. Interaction factors between PCREV and NSAL and between PCREV and FEXP show that a one-unit change in these two factors increases the mean number of beds by 39.42% and 112.38% respectively whereas the interaction factor between NSAL and FEXP shows that a one-unit change in this factor decreases the mean number of beds by 98.96%. Their 95% confidence intervals for multiplicative factor are (−0.2428,0.9074) , (−0.2589,1.7653) , and (−8.4956,−0.6360) respectively. 79 The GENMOD Procedure Model Information Data Set Distribution Link Function Dependent Variable Offset Variable WORK.NURSING_HOME Negative Binomial Log bed log_t Number of Observations Read Number of Observations Used 52 52 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood 44 44 44 44 52.3723 52.3723 57.5931 57.5931 17509.1130 1.1903 1.1903 1.3089 1.3089 Algorithm converged. Analysis Of Parameter Estimates Parameter DF Estimate Standard Error Intercept pcrev nsal fexp pcrev*nsal pcrev*fexp nsal*fexp rural Dispersion 1 1 1 1 1 1 1 1 1 -0.9103 -0.3869 0.1557 1.4298 0.3323 0.7532 -4.5658 -0.1193 0.0300 0.1989 0.1543 0.9194 0.5118 0.2934 0.5164 2.0050 0.0705 0.0082 Wald 95% Confidence Limits -1.3002 -0.6894 -1.6464 0.4267 -0.2427 -0.2590 -8.4955 -0.2574 0.0141 ChiSquare Pr > ChiSq 20.95 6.28 0.03 7.81 1.28 2.13 5.19 2.87 <.0001 0.0122 0.8656 0.0052 0.2573 0.1447 0.0228 0.0905 -0.5205 -0.0844 1.9577 2.4329 0.9074 1.7654 -0.6361 0.0188 0.0460 NOTE: The negative binomial dispersion parameter was estimated by maximum likelihood. Figure 4.5: SAS output for negative binomial regression 80 Elasticity Table 4.6 shows elasticity computed for the explanatory variables in nursing home data for negative binomial regression model. Table 4.6: Elasticities of the explanatory variables in nursing home data for negative binomial regression model Explanatory Variables Elasticity PCREV -0.5498 NSAL 0.0594 FEXP 0.4071 RURAL -0.1267 The pseudo-elasticity for RURAL variable is computed as follows: E RURAL = exp(β 7 ) − 1 exp(−0.1193) − 1 = = −0.1267 exp(β 7 ) exp(−0.1193) From Table 4.6, it can be seen that for negative binomial regression, a one percent increase in annual total patient care revenue reduces the mean number of beds in nursing homes by 0.5498 percent. A one percent increase in annual nursing salaries and annual facilities expenditures, however, increase the mean number of beds in nursing homes by 0.0594 and 0.4071 percent respectively. Furthermore, it is found that the mean number of beds in nursing homes for rural area is reduced by 0.1267 percent. As in Poisson regression, this also clearly shows that nursing homes in rural area tend to have fewer beds than nursing homes in urban area. 81 Model Checking From negative binomial regression output in Figure 4.5, the value of Pearson chisquares statistic is X 2 = 57.5931 and the value of deviance statistic is D = 52.3723 . Both values are quite close to the value of the degrees of freedom. Thus, the model is adequate. To verify this, from the statistical table it is found that, the critical value χ 02.05 (44) = 60.4568 , that is, one would like to reject H 0 when X 2 or D is greater than 60.4568. Since 57.5931 < 60.4568 and 52.3723 < 60.4568 , H 0 is not rejected at significance level, α = 0.05 . This implies that the model has a good fit. Thus, it can be said that negative binomial regression model is more adequate in fitting nursing home data compared to Poisson regression model. This is because the data is overdispersed and negative binomial model accounts for overdispersion, unlike Poisson regression. Residuals Analysis SAS output of residuals for negative binomial regression is given in APPENDIX F. Table 4.7 summarizes the result. Most of Pearson residuals and adjusted residuals in Table 4.7 are small. Only a few residuals have value greater than 2 in absolute value. This suggests that the model fits the data well. This result is equivalent to the goodnessof-fit test result. Table 4.7: Residuals for negative binomial regression Pearson Adjusted Obs. Bed Residuals Residuals 1 244 2.8981131 3.2134736 2 59 -0.063957 -0.066634 82 3 120 -0.519342 -0.574014 4 120 -0.648106 -0.750932 5 120 -0.487942 -0.528478 6 65 -0.242413 -0.255811 7 120 -0.859559 -0.938295 8 90 0.1807227 0.1889604 9 96 2.6168761 2.8116326 10 120 1.8297598 2.0763753 11 62 -0.48599 -0.515769 12 120 -1.346897 -1.471129 13 116 0.6125658 0.6444858 14 59 0.3172917 0.3311324 15 80 -0.165974 -0.173677 16 120 -0.359472 -0.379287 17 80 1.3094942 1.4735788 18 100 -0.727722 -0.755202 19 60 -0.807629 -0.84203 20 110 0.8794827 0.9264853 21 120 0.8163993 0.8707447 22 135 0.0093543 0.010448 23 59 -0.394395 -0.407917 24 60 -0.75373 -0.792121 25 25 -1.515122 -1.748939 26 221 0.4986813 1.1168278 27 64 -0.088956 -0.099318 28 62 -0.666765 -0.709793 29 108 0.1909746 0.2040666 30 62 -0.555559 -0.598065 31 90 -0.246757 -0.325236 32 146 1.195937 1.3787255 83 33 62 0.0402785 0.0417337 34 30 0.0587035 0.0670965 35 79 -0.818759 -0.863585 36 44 -0.734942 -0.787187 37 120 -0.927702 -1.083767 38 100 0.2954252 0.3032409 39 49 -1.415236 40 123 0.6269081 0.7132831 41 82 3.1921869 3.2940407 42 58 -0.341776 -0.3554 43 110 -0.089158 -0.091424 44 62 -0.731093 -0.744943 45 86 1.3567572 1.3879271 46 102 -0.6224 -0.636486 47 135 -0.491999 -0.626695 48 78 1.0713815 1.1240668 49 83 -1.4873 -1.548535 50 60 -0.856827 -0.902649 51 54 1.1228299 1.1857509 52 120 -1.19399 -1.48398 -1.463142 Overdispersion SAS output for negative binomial regression reports the “dispersion” parameter in the “Analysis of Parameter Estimates” instead of the “scale” parameter for the ordinary and corrected Poisson regression. This value is actually the α value in variance equation for negative binomial regression which is given as, Var (Y ) = µ (1 + αµ ) = µ + αµ 2 84 If this value is greater than zero, then evidently negative binomial regression should be used in favour of Poisson regression. From Figure 4.5, dispersion parameter is equal to 0.03. Therefore, negative binomial regression is preferred to Poisson regression. CHAPTER 5 SIMULATION STUDY 5.1 Data Simulation The main purpose of this simulation study is to see the performance of Poisson regression as well as the performance of negative binomial regression in analyzing count data with overdispersion and count data without overdispersion. The performance is examined based on the goodness-of-fit test and other criteria such as the significance of the coefficient result, confidence intervals, and standard errors. This study presents two conditions of the data to be analyzed, that is, overdispersed and non-overdispersed data. The data is simulated by using R 2.9.2 software according to three distinct values of µ , that is, µ = 10 , µ = 20 and µ = 50 . The values of µ are selected to be large (greater than 5) because if the means are large, both Pearson chi-square and deviance are distributed approximately standard normal. If the means are less than 5, it is potentially misleading to rely on the Pearson chi-square and deviance test as tools for assessment of goodness-of-fit. The sample size for each data is 100. This large sample size is selected because the sampling distribution of the estimated coefficients are approximately normal if the sample size is large, hence makes the test of significance and confidence intervals results more valid. Two types of data are simulated, namely overdispersed and non- 86 overdispersed data. Therefore there are six sets of data altogether. All simulated data are provided in APPENDIX G. This simulation study begins with the analysis of count data without overdispersion followed by the analysis of count data with overdispersion. All data are analyzed by using SAS 9.1. 5.2 Analysis of Data with No Overdispersion The data is first analyzed for Poisson regression then followed by negative binomial regression. Figure 5.1, 5.2, and 5.3 show Poisson regression SAS output for µ = 10 , µ = 20 and µ = 50 respectively whereas Figure 5.4, 5.5, and 5.6 show negative binomial regression SAS output for µ = 10 , µ = 20 and µ = 50 respectively. 87 The GENMOD Procedure Model Information Data Set Distribution Link Function Dependent Variable WORK.MU10 Poisson Log y Number of Observations Read Number of Observations Used 100 100 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood 98 98 98 98 82.2942 82.2942 81.2841 81.2841 1504.8773 0.8397 0.8397 0.8294 0.8294 Algorithm converged. Analysis Of Parameter Estimates Parameter DF Estimate Standard Error Intercept x Scale 1 1 0 2.4679 -0.0357 1.0000 0.0770 0.0307 0.0000 Wald 95% Confidence Limits 2.3169 -0.0959 1.0000 2.6189 0.0245 1.0000 ChiSquare Pr > ChiSq 1026.30 1.35 <.0001 0.2455 NOTE: The scale parameter was held fixed. Figure 5.1: Poisson regression SAS output of non-overdispersed data for µ = 10 88 The GENMOD Procedure Model Information Data Set Distribution Link Function Dependent Variable WORK.MU20 Poisson Log y Number of Observations Read Number of Observations Used 100 100 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood 98 98 98 98 76.4468 76.4468 75.8861 75.8861 4499.8443 0.7801 0.7801 0.7743 0.7743 Algorithm converged. Analysis Of Parameter Estimates Parameter DF Estimate Standard Error Intercept x Scale 1 1 0 3.1963 -0.0399 1.0000 0.0772 0.0247 0.0000 Wald 95% Confidence Limits 3.0451 -0.0884 1.0000 3.3476 0.0085 1.0000 ChiSquare Pr > ChiSq 1714.57 2.61 <.0001 0.1061 NOTE: The scale parameter was held fixed. Figure 5.2: Poisson regression SAS output of non-overdispersed data for µ = 20 89 The GENMOD Procedure Model Information Data Set Distribution Link Function Dependent Variable WORK.MU50 Poisson Log y Number of Observations Read Number of Observations Used 100 100 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood 98 98 98 98 84.5274 84.5274 83.4692 83.4692 14170.7596 0.8625 0.8625 0.8517 0.8517 Algorithm converged. Analysis Of Parameter Estimates Parameter DF Estimate Standard Error Intercept x Scale 1 1 0 3.9649 -0.0188 1.0000 0.0580 0.0145 0.0000 Wald 95% Confidence Limits 3.8513 -0.0472 1.0000 4.0785 0.0096 1.0000 ChiSquare Pr > ChiSq 4676.62 1.68 <.0001 0.1944 NOTE: The scale parameter was held fixed. Figure 5.3: Poisson regression SAS output of non-overdispersed data for µ = 50 90 The GENMOD Procedure Model Information Data Set Distribution Link Function Dependent Variable WORK.MU10 Negative Binomial Log y Number of Observations Read Number of Observations Used 100 100 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood 98 98 98 98 101.4523 101.4523 100.2999 100.2999 1505.9236 1.0352 1.0352 1.0235 1.0235 Algorithm converged. Analysis Of Parameter Estimates Parameter DF Estimate Standard Error Intercept x Dispersion 1 1 1 2.4681 -0.0358 -0.0175 0.0691 0.0277 0.0104 Wald 95% Confidence Limits 2.3326 -0.0900 -0.0380 2.6036 0.0184 0.0029 ChiSquare Pr > ChiSq 1274.65 1.67 <.0001 0.1959 NOTE: The negative binomial dispersion parameter was estimated by maximum likelihood. Figure 5.4: Negative binomial regression SAS output of non-overdispersed data for µ = 10 91 The GENMOD Procedure Model Information Data Set Distribution Link Function Dependent Variable WORK.MU20 Negative Binomial Log y Number of Observations Read Number of Observations Used 100 100 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood 98 98 98 98 100.1980 100.1980 99.5042 99.5042 4501.5529 1.0224 1.0224 1.0153 1.0153 Algorithm converged. Analysis Of Parameter Estimates Parameter DF Estimate Standard Error Intercept x Dispersion 1 1 1 3.1971 -0.0402 -0.0109 0.0674 0.0217 0.0049 Wald 95% Confidence Limits 3.0650 -0.0827 -0.0205 3.3293 0.0023 -0.0014 ChiSquare Pr > ChiSq 2247.54 3.44 <.0001 0.0635 NOTE: The negative binomial dispersion parameter was estimated by maximum likelihood Figure 5.5: Negative binomial regression SAS output of non-overdispersed data for µ = 20 92 The GENMOD Procedure Model Information Data Set Distribution Link Function Dependent Variable WORK.MU50 Negative Binomial Log y Number of Observations Read Number of Observations Used 100 100 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood 98 98 98 98 101.2727 101.2727 100.2162 100.2162 14171.5370 1.0334 1.0334 1.0226 1.0226 Algorithm converged. Analysis Of Parameter Estimates Parameter DF Estimate Standard Error Intercept x Dispersion 1 1 1 3.9649 -0.0188 -0.0034 0.0528 0.0132 0.0024 Wald 95% Confidence Limits 3.8614 -0.0447 -0.0081 4.0684 0.0071 0.0013 ChiSquare Pr > ChiSq 5633.55 2.02 <.0001 0.1548 NOTE: The negative binomial dispersion parameter was estimated by maximum likelihood. Figure 5.6: Negative binomial regression SAS output of non-overdispersed data for µ = 50 93 5.2.1 Results and Discussion 5.2.1.1 Goodness-of-fit Table 5.1 summarizes Pearson chi-square and deviance values for Poisson regression and negative binomial regression obtained from non-overdispersed data output. It can be seen that for data that has no overdispersion, the values of Pearson chisquare and deviance for both Poisson regression and negative binomial regression are small and close to the value of the degress of freedom. This indicates that both models have a good fit. Table 5.1: Pearson chi-square and deviance for Poisson regression and negative binomial regression obtained from data that has no overdispersion µ Model Pearson chi-square Deviance Poisson Negative Binomial µ = 10 81.2841 82.2942 µ = 20 75.8861 76.4468 µ = 50 83.4692 84.5274 µ = 10 100.2999 101.4523 µ = 20 99.5042 100.1980 µ = 50 100.2162 101.2727 To verify, all values of Pearson chi-square and deviance for Poisson regression and negative binomial regression are compared with the critical value obtained from statistical table, χ 02.05 (98) = 122.1026 . It can be seen that, none of the values is greater than 122.1026. Thus, H 0 is not rejected at significance level, α = 0.05 which implies that Poisson regression and negative binomial regression model are both adequate for data that has no overdispersion. However, Poisson regression is found to be better than negative binomial regression since the values of its Pearson chi square as well as the values of its deviance 94 for each case of µ is less than negative binomial regression. Therefore, Poisson regression is preferred to negative binomial regression when the data has no overdispersion. This is due to the fact that when there is no overdispersion, the value of α in negative binomial regression model (see section 3.9.2) is equal to zero which reduces negative binomial regression to Poisson regression. Thus, it is better to use Poisson regression in the analysis instead of negative binomial regression when the data has no overdispersion in order to obtain more accurate result. In other words, Poisson regression is more reliable for the analysis of non-overdispersed data. 5.2.1.2 Significance, Confidence Intervals, and Standard Errors From Figure 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, it can be seen that for all cases of µ , β1 is found to be insignificant (p-values > 0.05) in both Poisson regression and negative binomial regression. Furthermore, the 95% confidence intervals for both models are approximately the same. This goes the same with standard errors. The standard errors for Poisson regression and negative binomial regression for each case of µ are also very much the same. This result complies with the above explanation. 5.3 Analysis of Data with Overdispersion The data is first analyzed for Poisson regression then followed by negative binomial regression. Figure 5.7, 5.8, and 5.9 show Poisson regression SAS output for µ = 10 , µ = 20 and µ = 50 respectively whereas Figure 5.10, 5.11, and 5.12 show negative binomial regression SAS output for µ = 10 , µ = 20 and µ = 50 respectively. 95 The GENMOD Procedure Model Information Data Set Distribution Link Function Dependent Variable WORK.MU10 Poisson Log y Number of Observations Read Number of Observations Used 100 100 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood 98 98 98 98 705.9973 705.9973 672.2758 672.2758 818.9899 7.2041 7.2041 6.8600 6.8600 Algorithm converged. Analysis Of Parameter Estimates Parameter DF Estimate Standard Error Intercept x Scale 1 1 0 1.9546 0.0426 1.0000 0.0822 0.0321 0.0000 Wald 95% Confidence Limits ChiSquare Pr > ChiSq 1.7934 -0.0204 1.0000 564.79 1.76 <.0001 0.1852 2.1158 0.1056 1.0000 NOTE: The scale parameter was held fixed. Figure 5.7: Poisson regression SAS output of overdispersed data for µ = 10 96 The GENMOD Procedure Model Information Data Set Distribution Link Function Dependent Variable WORK.MU20 Poisson Log y Number of Observations Read Number of Observations Used 100 100 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood 98 98 98 98 1180.0775 1180.0775 1233.3790 1233.3790 2895.4924 12.0416 12.0416 12.5855 12.5855 Algorithm converged. Analysis Of Parameter Estimates Parameter DF Estimate Standard Error Intercept x Scale 1 1 0 2.4777 0.1046 1.0000 0.0894 0.0289 0.0000 Wald 95% Confidence Limits 2.3024 0.0479 1.0000 2.6529 0.1612 1.0000 ChiSquare Pr > ChiSq 767.68 13.09 <.0001 0.0003 NOTE: The scale parameter was held fixed. Figure 5.8: Poisson regression SAS output of overdispersed data for µ = 20 97 The GENMOD Procedure Model Information Data Set Distribution Link Function Dependent Variable WORK.MU50 Poisson Log y Number of Observations Read Number of Observations Used 100 100 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood 98 98 98 98 3905.6126 3905.6126 4286.7770 4286.7770 14355.4429 39.8532 39.8532 43.7426 43.7426 Algorithm converged. Analysis Of Parameter Estimates Parameter DF Estimate Standard Error Intercept x Scale 1 1 0 3.3669 0.1328 1.0000 0.0564 0.0134 0.0000 Wald 95% Confidence Limits 3.2563 0.1066 1.0000 3.4774 0.1591 1.0000 ChiSquare Pr > ChiSq 3562.18 98.60 <.0001 <.0001 NOTE: The scale parameter was held fixed. Figure 5.9: Poisson regression SAS output of overdispersed data for µ = 50 98 The GENMOD Procedure Model Information Data Set Distribution Link Function Dependent Variable WORK.MU10 Negative Binomial Log y Number of Observations Read Number of Observations Used 100 100 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood 98 98 98 98 117.8710 117.8710 72.8406 72.8406 1018.0347 1.2028 1.2028 0.7433 0.7433 Algorithm converged. Analysis Of Parameter Estimates Parameter DF Estimate Standard Error Intercept x Dispersion 1 1 1 1.9286 0.0540 1.0577 0.2699 0.1095 0.1797 Wald 95% Confidence Limits 1.3995 -0.1606 0.7055 2.4576 0.2686 1.4098 ChiSquare Pr > ChiSq 51.05 0.24 <.0001 0.6217 NOTE: The negative binomial dispersion parameter was estimated by maximum likelihood. Figure 5.10: Negative binomial regression SAS output of overdispersed data for µ = 10 99 The GENMOD Procedure Model Information Data Set Distribution Link Function Dependent Variable WORK.MU20 Negative Binomial Log y Number of Observations Read Number of Observations Used 100 100 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood 98 98 98 98 114.8235 114.8235 86.1879 86.1879 3308.7693 1.1717 1.1717 0.8795 0.8795 Algorithm converged. Analysis Of Parameter Estimates Parameter DF Estimate Standard Error Intercept x Dispersion 1 1 1 2.4560 0.1120 0.8303 0.3426 0.1136 0.1248 Wald 95% Confidence Limits 1.7845 -0.1107 0.5857 3.1275 0.3346 1.0749 ChiSquare Pr > ChiSq 51.39 0.97 <.0001 0.3242 NOTE: The negative binomial dispersion parameter was estimated by maximum likelihood. Figure 5.11: Negative binomial regression SAS output of overdispersed data for µ = 20 100 The GENMOD Procedure Model Information Data Set Distribution Link Function Dependent Variable WORK.MU50 Negative Binomial Log y Number of Observations Read Number of Observations Used 100 100 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood 98 98 98 98 112.5855 112.5855 92.8749 92.8749 16079.2532 1.1488 1.1488 0.9477 0.9477 Algorithm converged. Analysis Of Parameter Estimates Parameter DF Estimate Standard Error Intercept x Dispersion 1 1 1 3.4228 0.1189 0.8877 0.3484 0.0852 0.1172 Wald 95% Confidence Limits 2.7400 -0.0481 0.6579 4.1056 0.2858 1.1175 ChiSquare Pr > ChiSq 96.54 1.95 <.0001 0.1628 NOTE: The negative binomial dispersion parameter was estimated by maximum likelihood. Figure 5.12: Negative binomial regression SAS output of overdispersed data for µ = 50 101 5.3.1 Results and Discussion 5.3.1.1 Goodness-of-fit Table 5.2 summarizes Pearson chi-square and deviance values for Poisson regression and negative binomial regression obtained from overdispersed data output. It can be seen that for data that has overdispersion, the values of Pearson chi-square and deviance for Poisson regression are very large while the values of Pearson chi-square and deviance for negative binomial regression are small. In fact, the values of Pearson chi-square and deviance for Poisson regression are so much larger compared to negative binomial regression. This indicates that Poison regression has lack of fit while negative binomial has a good fit for all cases of µ . This also indicates that Poisson regression is not reliable in analyzing data that has overdispersion. Table 5.2: Pearson chi-square and deviance for Poisson regression and negative binomial regression obtained from overdispersed data µ Model Pearson chi-square Deviance Poisson Negative Binomial µ = 10 672.2758 705.9973 µ = 20 1233.3790 1180.0775 µ = 50 4286.7770 3905.6126 µ = 10 72.8406 117.8710 µ = 20 86.1879 114.8235 µ = 50 92.8749 112.5855 To formally test the goodness-of-fit of the models, again, all Pearson chi-square and deviance values are compared with the critical value obtained from statistical table, χ 02.05 (98) = 122.1026 . It is found that, all Pearson chi-square and deviance for Poisson regression model are greater than 122.1026. Therefore, for all cases of µ , Poisson regression has lack of fit, that is, H 0 is rejected at significance level, α = 0.05 . 102 However, for negative binomial regression, all Pearson chi-square and deviance values are less than 122.1026. Thus, for all cases of µ , negative binomial regression has a good fit, that is, H 0 is not rejected at significance level, α = 0.05 . This result clearly shows that Poisson regression is not adequate when applied to overdispersed data. In contrast, negative binomial regression can always handle overdispersion. This is due to the fact that its variance naturally exceeds its mean. 5.3.1.2 Significance, Confidence Intervals, and Standard Errors For µ = 10 , β1 is statistically insignificant (p-values > 0.05) in both Poisson regression (see Figure 5.7) and negative binomial regression (see Figure 5.10) models. Nevertheless, for µ = 20 and µ = 50 , β1 is significant in Poisson regression (see Figure 5.8 and Figure 5.9) but not significant in negative binomial regression (see Figure 5.11 and Figure 5.12). The 95% confidence intervals for negative binomial regression are much wider than the 95% confidence intervals for Poisson regression when the data is overdispersed. Thus, it is clear that, when overdispersion exists, applying the wrong model can change the statistical inference. Standard errors for negative binomial regression is substantial compared to standard errors obtained when using Poisson regression to analyze overdispersed data. This shows that Poisson regression can cause underestimation of standard errors when overdispersion exists. 103 5.4 Conclusions From the analysis of simulated data, it is found that for data that has no overdispersion, both Poisson regression and negative binomial regression are adequate, that is, both of the models have a good fit. The significance of the coefficient results are the same for both models. In addition, the 95% confidence intervals as well as standard errors are also approximately the same. Nevertheless, Poisson regression is more reliable than negative binomial regression since negative binomial regression is actually reduced to Poisson regression when there is no overdispersion. Even though the analyses for both models give approximately the same results, Poisson regression can provide more accurate result. On the other hand, when overdispersion occurs, Poisson regression is no longer adequate. Applying Poisson regression can underestimate the standard errors and can lead to wrong inference. The significance of the coefficient results can also be doubtful, that is, when Poisson regression is used to analyze overdispersed data, one tends to reject H 0 in test of significance when it actually should not be rejected. This simulation study proves that negative binomial regression can be relied on when one encounters with overdispersion problem. Negative binomial regression will always be adequate and can provide correct result when the data is overdispersed. 5.5 The Simulation Codes As mentioned earlier, the data in this simulation study are simulated using R 2.9.2 software. This section shows how the data are simulated. Section 5.5.1 discusses the codes for simulating non-overdispersed data while section 5.5.2 discusses the codes for simulating overdispersed data. 104 5.5.1 Codes for the simulation of non-overdispersed data Below are the R codes for the simulation of non-overdipersed data: b0<-0 b1<-1 n<-100 p<-matrix(0,n) y<-matrix(0,n) x<-rnorm(n,0,1) miu<-round(exp(b0+(b1/n)*sum(x[1:n]))) u<-runif(n,0,1) for (i in 1:n) { p[i]<-dpois(i-1,miu, log=FALSE) } k<-1 for(i in 1:n) { s<-p[1] k<-1 while (s<u[i]) { k<-k+1 s<-sum(p[1:k]) } y[i]<-k } First, the value for b0 is set to zero and b1 is set to one such that the following model is obtained: ln µ = b0 + b1 X = X (5.1) Sample size, n is set to be 100 while Y is set to be in vector form. Next, X is defined to follow normal distribution. X is important for the determination of µ . From the codes, µ is defined as, µ = e (b +b x ) 0 1 (5.2) Since b0 = 0 and b1 = 1 , µ becomes µ = ex (5.3) 105 From the above codes, suppose we want data with µ = 10 . Thus, from equation (5.3), we obtain x = ln µ = ln(10) = 2.3026 . Then what we need to do is to change the mean value in the definition of X as follows, x<-rnorm(n,2.3026,1) Data is simulated in such a way that Y follows the Poisson distribution. Therefore, the data is not overdispersed. 5.5.2 Codes for the simulation of overdispersed data Below are the R codes for the simulation of overdispersed data: b0<-0 b1<-1 n<-100 alpha<-1 p<-matrix(0,n) y<-matrix(0,n) x<-rnorm(n,2,1) m<-round(exp(b0+(b1/n)*sum(x[1:n]))) u<-runif(n,0,1) y<-rnbinom(n, size=round(1/alpha), mu=m) mean(y);var(y); Like the simulation of non-overdispersed data, first, the value for b0 is set to zero and b1 is set to one. Sample size, n is set to be 100 while Y is set to be in vector form. Note that “alpha” in the codes is actually α in negative binomial probability distribution. It is set to one for simplicity. Data is simulated in such a way that Y follows the negative binomial distribution. This is to ensure that the variance of the data is greater than the mean. Therefore the data is overdispersed. CHAPTER 6 SUMMARY AND CONCLUSIONS 6.1 Summary The analysis of Poisson regression in this study begins with the formulation of the model. It is then followed by the estimation of parameters by using maximum likelihood estimation method (MLE). After the estimates are obtained, they must be interpreted. The interpretation of coefficient in Poisson regression is different from other regression due to the natural log used in the model. The values of coefficient need to be exponentiated in order to get the correct interpretation. After interpreting the coefficient, the goodness-of-fit of the model is checked. This is done by using Pearson chi-square and deviance statistic. This study also focuses on residual analysis which can also give some idea about the goodness-of-fit of the model. Test of significance is done after residual analysis. Confidence intervals also need to be computed. After that, test for overdispersion is done. If overdispersion is present, quasi-likelihood method and negative binomial regression will be used to overcome the situation. This study also presents a simulation study. The purpose of the simulation study is to see the performance of Poisson regression and negative binomial regression in analyzing data that has no overdispersion and data that has overdispersion. 107 6.2 Conclusions This study focuses on the analysis of Poisson regression as well as on overdispersion problem. It is found that overdispersion can cause underestimation of standard errors which then lead to erroneous inference. Test of significance result can also be suspicious. Overdispersion can be handled by using quasi-likelihood method. Quasi-likelihood method is the simplest method and is most appropriate when the cause of overdipersion is unknown. Overdispersion can also be handled by using negative binomial regression. The simulation study reveals that when the data has overdispersion, negative binomial regression gives a good and correct result in the analysis while Poisson regression is clearly not adequate for overdispersed data since it gives wrong result. However, when the data has no overdispersion, Poisson regression is more reliable. 6.3 Recommendations The recommendations for future research or study on Poisson regression are: i) Extend the study on how to overcome other problems in Poison regression such as truncation and censoring as well as excess zero. ii) Do analysis on time series data, multivariate data, and panel data. iii) Modify the classical Poisson regression to more robust Poisson regression. iv) Use other statistical software for the analysis instead of SAS 9.1. Other software that can be used are Stata, S-Plus, SPSS and R. 108 REFERENCES Agresti, A. 1996. An Intorduction to Categorical Data Analysis. New York: John Wiley & Sons, Inc. Atkins, D. C. and Gallop, R. J. 2007. Rethinking How Family Researchers Model Infrequent Outcomes: A Tutorial on Count Regression and Zero-Inflated Models. Journal of Family Psychology. Vol. 21, No. 4: 726 – 735. Bailer, A. J., Reed, L. D., and Stayner, L. T. 1997. Modeling Fatal Injury Rates Using Poisson Regression: A Case Study of Workers in Agriculture, Forestry, and Fishing. Journal of Safety Research. 28(3): 177 – 186. Cameron, A. C. and Trivedi, P. K. 1998. Regression Analysis of Count Data. Cambridge, UK: Cambridge University Press. Carrivick, P. J. W., Lee, A. H., and Yau, K. K. W. 2003. Zero-Inflated Poisson Modeling to Evaluate Occupational Safety Interventions. Safety Science. 41: 53 – 63. Chan, Y.H. 2005. Log-linear Models: Poisson Regression. Singapore Med. J. 46(8): 377 – 386. 109 Choi, Y., Ahn, H., and Chen, J. J. 2005. Regression Trees for Analysis of Count Data with Extra Poisson Variation. Computational Statistics & Data Analysis. 49: 893 – 915. Dietz, E. and Bohning, D. 2000. On Estimation of the Poisson Parameter in Zero Modified Poisson Models. Computational Statistics & Data Analysis. 34: 441 – 459. Dobson, A.J. 2002. An Introduction to Generalized Linear Models. 2nd Edition. New York: Chapman & Hall. Dossou-Gbete, S. and Mizere, D. 2006. An Overview of Probability Models Statistical Modelling of Count Data. Monografias del Seminario Matematico Garcia de Galdeano. 33: 237 – 244. Famoye, F. and Wang, W. Censored Generalized Poisson Regression Model. Computational Statistics & Data Analysis. 46: 547 – 560. Fleiss, J.L., Levin, B. and Paik, M.C. 2003. Statistical Methods for Rates and Proportions. 3rd Edition. New York: John Wiley & Sons, Inc. Frome, E.L. 1986. Regression Method for Binomial and Poisson Distributed Data. The American Institute of Physics, New York. 1 – 40. 110 Guangyong, Z. 2003. A modified Poisson Regression Approach to Prospective Studies with Binary Data. American Journal of Epidemiology. Vol. 159, No.7: 702 – 706. Guo, J. Q. and Li, T. 2002. Poisson Regression Models with Errors-in-Variables: Implication and Treatment. Journal of Statistical Planning and Inference. 104: 391 – 401. Heinzl, H. and Mittlbock, M. 2003. Pseudo R-squared Measures for Poisson Regression Models with Over- or Underdispersion. Computational Statistics & Data Analysis. 44: 253 – 271. Ismail, N. and Jemain, A.A. 2005. Generalized Poisson Regression: An Alternative For Risk Classification. Jurnal Teknologi. 43(C): 39 – 54. Jahn-Eimermacher, A. 2008. Comparison of the Andersen – Gill Model with Poisson and Negative Binomial Regression on Recurrent Event Data. Computational Statistics & Data Analysis. 52: 4959 – 4997. Jovanovic, B. D. and Hosmer, D. W. 1997. A Simulation of the Performance of Cp in Model Selection for Logistic and Poisson Regression. Computational Statistics & Data Analysis. 23: 373 – 379. Kleinbaum, D.G., Kupper, L.L., Muller, K.E., and Nizam, A. 1998. Applied Regression Analysis and Other Multivariable Methods. 3rd Edition. USA: Duxbury Press. 111 Kohler, M. and Krzyzak, A. 2007. Asymptotic Confidence Intervals for Poisson Regression. Journal of Multivariate Analysis. 98: 1072 – 1094. Kokonendji, C.C., Demetrio, C,G,B, and Dossou-Gbete, S. 2004. Overdispersion and Poisson-Tweedie Exponential Dispersion Models. Monografias del Seminario Matematico Garcia de Galdeano. 31: 365 – 374. Kutner, M.H., Nachtsheim, C.J., and Neter, J. 2004. Applied Linear Regression Models. Fourth Edition. New York: McGraw-Hill. Lee, J, Nam, D., and Park, D. 2005. Analyzing the Relationship Between Grade Crossing Elements and Accidents. Journal of Eastern Asia Society for Transportation Study. Vol. 6: 3658 – 3668. Lee, Y., Nelder, J.A., Pawitan, Y. 2006.Generalized Linear Models with Random Effects, Unified Analysis via H-likelihood. USA: Chapman & Hall. Liu, J. and Dey, D. K. 2007. Hierarchical overdispersed Poisson Model with Macrolevel Autocorrelation. Statistical Methodology. 4: 354 – 370. Lord, D., Washington, S. P., and Ivan, J. N. 2005. Poisson, Poisson-Gamma and ZeroInflated Regression Models of Motor Vehicle Crashes: Balancing Statistical Fit and Theory. Accident Analysis and Prevention. 37: 35 – 46. 112 Luceno, A. 1995. A Family of Partially Correlated Poisson Models for Overdispersion. Computational Statistics & Data Analysis. 20: 511 – 520. McCullagh, P. and J.A. Nelder. 1989. Generalized Linear Models. 2nd Edition. London: Chapman & Hall. Norliza binti Adnan (2006). Comparing Three Methods of Handling Multicollinearity Using Simulation Approach. Master of Science (Mathematics). Universiti Teknologi Malaysia, Skudai. Osgood, D.W. 2000. Poisson-Based Regression Analysis of Aggregate Crime Rates. Journal of Quantitative Criminology. Vol. 16, No. 1: 21 – 43. Pedan, A. Analysis of Count Data Using the SAS System. Statistic, Data Analysis, and Data Mining. Paper 247-26: 1 – 6. Pradhan, N. C. and Leung, P.S. 2006. A Poisson and Negative Binomial Regression Model of Sea Turtle Interactions in Hawaii’s Longline Fishery. Fisheries Research. 78: 309 – 322. Spinelli, J. J., Lockhart, R. A., and Stephens, M. A. 2002. Tests for the Response Distribution in a Poisson Regression Model. Journal of Statistical Planning and Inference. 108: 137 – 154. 113 Strien, A.V., Pannekoek, P., Hegemeijer, W. and Verstrael, T. 2000. A Loglinear Poisson Regression Method to Analyse Bird Monitoring Data. Proceeding of the International Conference and 13th Meeting of the European Bird Census Council. 33 – 39. Tang, H, Hu, M, and Shi, Q. 2003. Accident Injury Analysis for Two-Lane Rural Highways. Journal of Eastern Asia Society for Transportation Study. Vol. 5: 2340 – 2443. Tsou, T. S. 2006. Robust Poisson Regression. Journal of Statistical Planning and Inference. 136: 3173 – 3186. Vacchino, M. N. 1999. Poisson Regression in Mapping Cancer Mortality. Environmental Research Section A. 81: 1 – 17. Wang, K., Lee, A. H., and Yau, K. K. W. and Carrivick, P. J. W. 2003. A Bivariate Zero-Inflated Poisson Regression Model to Analyze Occupational Injuries. Accident Analysis and Prevention. 35: 625 – 629. Wang, Y., Smith, E. P., and Ye, K. 2006. Sequential Designs for a Poisson Regression Model. Journal of Statistical Planning and Inference. 136: 3187 – 3202. 114 APPENDIX A SAS Codes for Elephant’s Mating Success Data data elephant; input age matings; cards; 27 0 28 1 28 1 28 1 28 3 29 0 29 0 29 0 29 2 29 2 29 2 30 1 32 2 33 4 33 3 33 3 33 3 33 2 34 1 34 1 34 2 34 3 36 5 36 6 37 1 37 1 37 6 38 2 39 1 41 3 42 4 43 0 43 2 43 3 43 4 43 9 44 3 45 5 47 7 48 2 52 9 ; proc genmod; model matings = age / dist=poi link=log; run; 115 APPENDIX B The Values of µ̂ i for Elephant’s Mating Success Data X 27 28 28 28 28 29 29 29 29 29 29 30 32 33 33 33 33 33 34 34 34 34 36 36 37 37 37 38 39 41 42 43 43 Y 0 1 1 1 3 0 0 0 2 2 2 1 2 4 3 3 3 2 1 1 2 3 5 6 1 1 6 2 1 3 4 0 2 µ̂ i 1.313769 1.407197 1.407197 1.407197 1.407197 1.50727 1.50727 1.50727 1.50727 1.50727 1.50727 1.614459 1.852248 1.98397 1.98397 1.98397 1.98397 1.98397 2.12506 2.12506 2.12506 2.12506 2.438054 2.438054 2.611435 2.611435 2.611435 2.797147 2.996066 3.437347 3.681794 3.943624 3.943624 116 43 43 43 44 45 47 48 52 3 4 9 3 5 7 2 9 3.943624 3.943624 3.943624 4.224074 4.524468 5.190863 5.560011 7.318461 117 APPENDIX C SAS Codes for Nursing Home Data data nursing_home; input bed tdays pcrev nsal fexp rural; log_t= log(tdays); cards; 244 385 2.3521 0.523 0.5334 59 203 0.916 0.2459 0.0493 120 392 2.19 0.6304 0.6115 120 419 2.2354 0.659 0.6346 120 363 1.7421 0.5362 0.6225 65 234 1.0531 0.3622 0.0449 120 372 2.2147 0.4406 0.4998 90 305 1.4025 0.4173 0.0966 96 169 0.8812 0.1955 0.126 120 188 1.1729 0.3224 0.6442 62 192 0.8896 0.2409 0.1236 120 426 2.0987 0.2066 0.336 116 321 1.7655 0.5946 0.4231 59 164 0.7085 0.1925 0.128 80 284 1.3089 0.4166 0.1123 120 375 2.1453 0.5257 0.5206 80 133 0.779 0.1988 0.4443 100 318 1.8309 0.4156 0.4585 60 213 0.8872 0.1914 0.1675 110 280 1.7881 0.5173 0.5686 120 336 1.7004 0.463 0.0907 135 442 2.3829 0.7489 0.3351 59 191 0.9424 0.2051 0.1756 60 202 1.2474 0.3803 0.2123 25 83 0.4078 0.2008 0.4531 221 776 3.6029 0.1288 0.2543 64 214 0.8782 0.4729 0.4446 62 204 0.8951 0.2367 0.1064 108 366 1.7446 0.5933 0.2987 62 220 0.6164 0.2782 0.0411 90 286 0.2853 0.4651 0.4197 146 375 2.1334 0.6857 0.1198 62 189 0.8082 0.2143 0.1209 30 88 0.3948 0.3025 0.0137 79 278 1.1649 0.2905 0.1279 44 158 0.785 0.1498 0.1273 120 423 2.9035 0.6236 0.3524 100 300 1.7532 0.3547 0.2561 49 177 0.8197 0.281 0.3874 123 336 2.2555 0.6059 0.6402 82 136 0.8459 0.1995 0.1911 58 205 1.0412 0.2245 0.1122 110 323 1.6661 0.4029 0.3893 62 222 1.2406 0.2784 0.2212 86 200 1.1312 0.372 0.2959 102 355 1.4499 0.3866 0.3006 135 471 2.4274 0.7485 0.1344 0 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 1 1 1 1 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 118 78 203 0.9327 0.3672 0.1242 1 83 390 1.2362 0.3995 0.1484 1 60 213 1.0644 0.282 0.1154 0 54 144 0.7556 0.2088 0.0245 1 120 327 2.0182 0.4432 0.6274 0 ; proc genmod; model bed = pcrev nsal fexp / dist=poi link=log offset=log_t; run; /*(M1)*/ proc genmod; model bed = pcrev nsal fexp pcrev*nsal pcrev*fexp nsal*fexp link=log offset=log_t; run; /*(M2)*/ / dist=poi proc genmod; model bed = pcrev nsal fexp pcrev*nsal pcrev*fexp nsal*fexp rural / dist=poi link=log offset=log_t; run; /*(M3)*/ 119 APPENDIX D SAS Output of Residual Analysis for Poisson Regression in Nursing Home Data The GENMOD Procedure Observation Statistics Observation bed log_t Xbeta Reschi pcrev Std Resdev nsal HessWgt StResdev fexp Lower StReschi rural Upper Reslik Pred Resraw 1 244 5.9532433 5.0516468 7.0169656 2.3521 0.0376696 6.4787097 0.523 156.27962 7.3439847 0.5334 145.15699 7.9541283 0 168.25451 7.4835841 156.27962 87.720383 2 59 5.313206 4.0665661 0.0842729 0.916 0.0347824 0.0841186 0.2459 58.356229 0.0872551 0.0493 54.51053 0.0874151 1 62.47324 0.0872664 58.356229 0.6437711 3 120 5.9712618 4.8914414 -1.139224 2.19 0.0380992 -1.158786 0.6304 133.14535 -1.290145 0.6115 123.56513 -1.268365 0 143.46835 -1.285964 133.14535 -13.14535 4 120 6.0378709 4.9178605 -1.42913 2.2354 0.0454592 -1.459835 0.659 136.70981 -1.723445 0.6346 125.05607 -1.687196 0 149.44955 -1.713282 136.70981 -16.70981 5 120 5.8944028 4.8937729 -1.1648 1.7421 0.0357675 -1.185246 0.5362 133.45615 -1.301551 0.6225 124.42086 -1.279099 0 143.14757 -1.297745 133.45615 -13.45615 6 65 5.4553211 4.2112657 -0.29734 1.0531 0.037979 -0.299162 0.3622 67.441848 -0.314869 0.0449 62.603945 -0.312951 1 72.653615 -0.314683 67.441848 -2.441848 7 120 5.9188939 4.9517437 -1.801313 2.2147 0.0362592 -1.849924 0.4406 141.42134 -2.050327 0.4998 131.7198 -1.99645 1 151.83742 -2.040418 141.42134 -21.42134 8 90 5.7203118 4.4544089 0.4307467 1.4025 0.0312314 0.4274751 0.4173 86.005297 0.4466187 0.0966 80.898595 0.4500368 1 91.434358 0.4469065 86.005297 3.9947032 9 96 5.1298987 4.1033726 4.5567171 0.8812 0.0433215 4.1947091 0.1955 60.544134 4.4554695 0.126 55.615611 4.8399814 0 65.90941 4.5008148 60.544134 35.455866 10 120 5.236442 4.4593743 3.6104925 1.1729 0.0527329 3.4080802 0.3224 86.433407 3.9102372 0.6442 77.946268 4.1424735 1 95.844664 3.9672965 86.433407 33.566593 120 Observation bed log_t Xbeta Reschi pcrev Std Resdev nsal HessWgt StResdev fexp Lower StReschi rural Upper Reslik Pred Resraw 11 62 5.2574954 4.2244354 -0.766451 0.8896 0.0383341 -0.778778 0.2409 68.335912 -0.821095 0.1236 63.389742 -0.808099 0 73.66802 -0.8198 68.335912 -6.335912 12 120 6.0544393 5.0715684 -3.122378 2.0987 0.0338076 -3.266488 0.2066 159.42417 -3.612114 0.336 149.20282 -3.452756 1 170.34574 -3.583605 159.42417 -39.42417 13 116 5.7714411 4.6475429 1.1426998 1.7655 0.0303644 1.1223291 0.5946 104.32832 1.1805427 0.4231 98.300565 1.20197 0 110.7257 1.1826207 104.32832 11.671677 14 59 5.0998664 3.9784181 0.7616619 0.7085 0.0366079 0.7489778 0.1925 53.432442 0.7773256 0.128 49.732963 0.7904897 1 57.407115 0.7782756 53.432442 5.5675576 15 80 5.6489742 4.406908 -0.222551 1.3089 0.0323779 -0.223472 0.4166 82.015478 -0.233747 0.1123 76.972526 -0.232783 1 87.388825 -0.233664 82.015478 -2.015478 16 120 5.926926 4.845099 -0.631143 2.1453 0.0300613 -0.637174 0.5257 127.11587 -0.677259 0.5206 119.8427 -0.670849 1 134.83044 -0.676526 127.11587 -7.115867 17 80 4.8903491 4.1098356 2.4420727 0.779 0.0554565 2.3289401 0.1988 60.936701 2.5835778 0.4443 54.660586 2.7090799 1 67.933437 2.6075576 60.936701 19.063299 18 100 5.7620514 4.7476515 -1.426019 1.8309 0.0254496 -1.459467 0.4156 115.31315 -1.517224 0.4585 109.70238 -1.482452 1 121.21089 -1.514655 115.31315 -15.31315 19 60 5.3612922 4.2518995 -1.221678 0.8872 0.0346179 -1.253322 0.1914 70.238707 -1.309653 0.1675 65.631104 -1.276587 1 75.169785 -1.306902 70.238707 -10.23871 20 110 5.6347896 4.5285346 1.8056045 1.7881 0.0316613 1.7531492 0.5173 92.622735 1.8406827 0.5686 87.049739 1.8957571 1 98.552518 1.8458655 92.622735 17.377265 21 120 5.8171112 4.6441205 1.5718986 1.7004 0.0341634 1.5339043 0.463 103.97188 1.6364029 0.0907 97.237984 1.6769361 0 111.17212 1.641375 103.97188 16.028116 22 135 6.0913099 4.9008401 0.0515266 2.3829 0.0387701 0.0514885 0.7489 134.40264 0.0576388 0.3351 124.56805 0.0576814 0 145.01367 0.0576474 134.40264 0.5973586 23 59 5.2522734 4.1397854 -0.478213 0.9424 0.0318262 -0.483148 0.2051 62.789347 -0.499286 0.1756 58.992323 -0.494186 1 66.830766 -0.498963 62.789347 -3.789347 121 Observation bed log_t Xbeta Reschi pcrev Std Resdev nsal HessWgt StResdev fexp Lower StReschi rural Upper Reslik Pred Resraw 24 60 5.3082677 4.2693556 -1.35736 1.2474 0.0329078 -1.396352 0.3803 71.47556 -1.453747 0.2123 67.011047 -1.413152 0 76.237515 -1.450645 71.47556 -11.47556 25 25 4.4188406 3.6352807 -2.097099 0.4078 0.0679635 -2.237128 0.2008 37.912493 -2.463175 0.4531 33.184226 -2.308997 1 43.314468 -2.43688 37.912493 -12.91249 26 221 6.6541525 5.3290206 1.0280765 3.6029 0.0654903 1.0161621 0.1288 206.23588 2.9905494 0.2543 181.39226 3.0256132 1 234.4821 3.0215856 206.23588 14.76412 27 64 5.365976 4.1800111 -0.169027 0.8782 0.0481366 -0.169621 0.4729 65.366577 -0.184139 0.4446 59.481488 -0.183494 1 71.833937 -0.184041 65.366577 -1.366577 28 62 5.31812 4.2663035 -1.096703 0.8951 0.0392082 -1.121837 0.2367 71.257743 -1.18884 0.1064 65.986933 -1.162205 0 76.949566 -1.185951 71.257743 -9.257743 29 108 5.9026333 4.6403673 0.4340545 1.7446 0.0357706 0.4310228 0.5933 103.58239 0.4627802 0.2987 96.569055 0.4660353 1 111.10507 0.4632129 103.58239 4.4176086 30 62 5.3936275 4.222929 -0.754576 0.6164 0.0439783 -0.766525 0.2782 68.233044 -0.822732 0.0411 62.597987 -0.809907 1 74.375368 -0.821051 68.233044 -6.233044 31 90 5.6559918 4.5714856 -0.680123 0.2853 0.0660333 -0.688199 0.4651 96.687643 -0.904895 0.4197 84.949993 -0.894277 0 110.0471 -0.900434 96.687643 -6.687643 32 146 5.926926 4.7786491 2.4808479 2.1334 0.0454293 2.3947269 0.6857 118.94356 2.7568948 0.1198 108.81066 2.8560404 0 130.02009 2.7815602 118.94356 27.056438 33 62 5.241747 4.0903127 0.2899505 0.8082 0.0329047 0.2881657 0.2143 59.758576 0.2979664 0.1209 56.026267 0.2998119 1 63.739521 0.2980861 59.758576 2.2414239 34 30 4.4773368 3.3551086 0.2524611 0.3948 0.0662096 0.2505144 0.3025 28.648715 0.2679012 0.0137 25.162133 0.2699829 1 32.618414 0.2681635 28.648715 1.3512851 35 79 5.6276211 4.5463446 -1.574343 1.1649 0.033836 -1.620047 0.2905 94.287117 -1.71527 0.1279 88.237072 -1.666879 0 100.75199 -1.710112 94.287117 -15.28712 36 44 5.062595 3.928233 -0.956302 0.785 0.0455206 -0.978975 0.1498 50.817105 -1.034982 0.1273 46.479634 -1.011012 1 55.559347 -1.032484 50.817105 -6.817105 122 Observation bed log_t Xbeta Reschi pcrev Std Resdev nsal HessWgt StResdev fexp Lower StReschi rural Upper Reslik Pred Resraw 37 120 6.0473722 4.9727488 -2.032293 2.9035 0.0462914 -2.094033 0.6236 144.42334 -2.519978 0.3524 131.89673 -2.44568 0 158.13963 -2.49722 144.42334 -24.42334 38 100 5.7037825 4.5372047 0.679786 1.7532 0.0215857 0.6720428 0.3547 93.429269 0.6871663 0.2561 89.558984 0.6950838 1 97.466807 0.6875129 93.429269 6.5707312 39 49 5.1761497 4.2283316 -2.366709 0.8197 0.0358885 -2.495446 0.281 68.602679 -2.613583 0.3874 63.94296 -2.478751 1 73.601965 -2.601951 68.602679 -19.60268 40 123 5.8171112 4.6764014 1.5070629 2.2555 0.0455677 1.4725839 0.6059 107.38295 1.6705578 0.6402 98.208268 1.7096722 1 117.41473 1.6793581 107.38295 15.617052 41 82 4.9126549 3.8362088 5.2365404 0.8459 0.0328376 4.7183215 0.1995 46.34942 4.8408417 0.1911 43.46032 5.3725172 1 49.430577 4.8687928 46.34942 35.65058 42 58 5.32301 4.1134589 -0.403804 1.0412 0.0328512 -0.407356 0.2245 61.157892 -0.421504 0.1122 57.344201 -0.417828 1 65.225213 -0.421262 61.157892 -3.157892 43 110 5.7776523 4.7080031 -0.078899 1.6661 0.021909 -0.078998 0.4029 110.83062 -0.081187 0.3893 106.1722 -0.081086 1 115.69344 -0.081182 110.83062 -0.830623 44 62 5.4026774 4.2763965 -1.176382 1.2406 0.0220929 -1.205271 0.2784 71.98059 -1.227018 0.2212 68.930252 -1.197609 1 75.165913 -1.225997 71.98059 -9.98059 45 86 5.2983174 4.1904684 2.4542185 1.1312 0.0235602 2.3440601 0.372 66.053722 2.3882521 0.2959 63.072898 2.5004873 1 69.17542 2.3924602 66.053722 19.946278 46 102 5.8721178 4.7455251 -1.218256 1.4499 0.0211335 -1.242481 0.3866 115.06821 -1.275693 0.3006 110.39935 -1.250821 1 119.93453 -1.274427 115.06821 -13.06821 47 135 6.1548581 5.0043197 -1.151268 2.4274 0.0534544 -1.170112 0.7485 149.05565 -1.544318 0.1344 134.2295 -1.519447 0 165.51939 -1.533774 149.05565 -14.05565 48 78 5.313206 4.1350896 1.9612958 0.9327 0.0355142 1.8875857 0.3672 62.495191 1.9666862 0.1242 58.293056 2.0434853 1 67.000244 1.9728483 62.495191 15.504809 49 83 5.9661467 4.7535547 -3.063645 1.2362 0.0293604 -3.229431 0.3995 115.99588 -3.404104 0.1484 109.50928 -3.229351 1 122.86669 -3.387036 115.99588 -32.99588 123 Observation bed log_t Xbeta Reschi pcrev Std Resdev nsal HessWgt StResdev fexp Lower StReschi rural Upper Reslik Pred Resraw 50 60 5.3612922 4.2863123 -1.48926 1.0644 0.0351689 -1.536112 0.282 72.697882 -1.610209 0.1154 67.855643 -1.561096 0 77.885668 -1.605855 72.697882 -12.69788 51 54 4.9698133 3.7269563 1.9310157 0.7556 0.0426532 1.8449607 0.2088 41.552444 1.9189175 0.0245 38.219947 2.0084219 1 45.175511 1.925829 41.552444 12.447556 52 120 5.7899602 5.0434126 -2.811129 2.0182 0.0509613 -2.928329 0.4432 154.99806 -3.788481 0.6274 140.26454 -3.636855 0 171.2792 -3.728188 154.99806 -34.99806 124 APPENDIX E The Values of µ̂ i for Nursing Home Data bed (Y[i]) 244 59 120 120 120 65 120 90 96 120 62 120 116 59 80 120 80 100 60 110 120 135 59 60 25 221 64 62 108 62 90 146 62 30 79 miu[i] 156.28 58.356 133.145 136.71 133.456 67.442 141.421 86.005 60.544 86.433 68.336 159.424 104.328 53.432 82.015 127.116 60.937 115.313 70.239 92.623 103.972 134.403 62.789 71.476 37.912 206.236 65.367 71.258 103.582 68.233 96.688 118.944 59.759 28.649 94.287 125 44 120 100 49 123 82 58 110 62 86 102 135 78 83 60 54 120 50.817 144.423 93.429 68.603 107.383 46.349 61.158 110.831 71.981 66.054 115.068 149.056 62.495 115.996 72.698 41.552 154.998 126 APPENDIX F SAS Output of Residual Analysis for Negative Binomial Regression in Nursing Home Data The GENMOD Procedure Observation Statistics Observation bed log_t Xbeta Reschi pcrev Std Resdev nsal HessWgt StResdev fexp Lower StReschi rural Upper Reslik Pred Resraw 1 244 5.9532433 5.057063 2.8981131 2.3521 0.0824918 2.5178647 0.523 40.012978 2.7918481 0.5334 133.67106 3.2134736 0 184.70205 2.8752383 157.12836 86.871644 2 59 5.313206 4.0914617 -0.063957 0.916 0.0602987 -0.064201 0.2459 21.203341 -0.066888 0.0493 53.158524 -0.066634 1 67.332626 -0.066868 59.827277 -0.827277 3 120 5.9712618 4.8935213 -0.519342 2.19 0.0832582 -0.535908 0.6304 24.504356 -0.592323 0.6115 113.33389 -0.574014 0 157.072 -0.589044 133.42256 -13.42256 4 120 6.0378709 4.9212532 -0.648106 2.2354 0.0985915 -0.674269 0.659 24.096706 -0.781246 0.6346 113.07115 -0.750932 0 166.41572 -0.773625 137.17441 -17.17441 5 120 5.8944028 4.8868507 -0.487942 1.7421 0.0754142 -0.502515 0.5362 24.602537 -0.544262 0.6225 114.32459 -0.528478 0 153.64731 -0.541963 132.53553 -12.53553 6 65 5.4553211 4.2269541 -0.242413 1.0531 0.0678904 -0.245965 0.3622 21.63657 -0.259559 0.0449 59.972808 -0.255811 1 78.258458 -0.25918 68.508244 -3.508244 7 120 5.9188939 4.9681804 -0.859559 2.2147 0.0779802 -0.906682 0.4406 23.409432 -0.989734 0.4998 123.38901 -0.938295 1 167.50592 -0.981645 143.76505 -23.76505 8 90 5.7203118 4.4636342 0.1807227 1.4025 0.0594289 0.1788462 0.4173 24.708708 0.1869984 0.0966 77.258399 0.1889604 1 97.525391 0.1871665 86.802394 3.1976063 9 96 5.1298987 4.1176423 2.6168761 0.8812 0.078269 2.29633 0.1955 29.477614 2.4672304 0.126 52.680103 2.8116326 0 71.596536 2.5160225 61.414272 34.585728 10 120 5.236442 4.4707834 1.8297598 1.1729 0.0950989 1.6641762 0.3224 30.622492 1.8884744 0.6442 72.558486 2.0763753 1 105.33797 1.9320445 87.425187 32.574813 127 Observation bed log_t Xbeta Reschi pcrev Std Resdev nsal HessWgt StResdev fexp Lower StReschi rural Upper Reslik Pred Resraw 11 62 5.2574954 4.2353038 -0.48599 0.8896 0.0704147 -0.500652 0.2409 20.915458 -0.53133 0.1236 60.177199 -0.515769 0 79.306022 -0.529608 69.082662 -7.082662 12 120 6.0544393 5.083684 -1.346897 2.0987 0.0772048 -1.469495 0.2066 21.738682 -1.605034 0.336 138.70721 -1.471129 1 187.72961 -1.584141 161.36744 -41.36744 13 116 5.7714411 4.6384282 0.6125658 1.7655 0.0625364 0.5919628 0.5946 27.513236 0.6228092 0.4231 91.456105 0.6444858 0 116.8624 0.624936 103.38172 12.618279 14 59 5.0998664 4.0102252 0.3172917 0.7085 0.0623455 0.3115073 0.1925 21.666556 0.3250957 0.128 48.814638 0.3311324 1 62.328588 0.325594 55.159291 3.8407087 15 80 5.6489742 4.4166763 -0.165974 1.3089 0.0605281 -0.167616 0.4166 23.174401 -0.175396 0.1123 73.555725 -0.173677 1 93.252348 -0.175247 82.820553 -2.820553 16 120 5.926926 4.8599232 -0.359472 2.1453 0.0620071 -0.367274 0.5257 24.999243 -0.387519 0.5206 114.25029 -0.379287 1 145.68618 -0.386689 129.0143 -9.014298 17 80 4.8903491 4.1344012 1.3094942 0.779 0.0975379 1.2201778 0.1988 25.700193 1.3730707 0.4443 51.584972 1.4735788 1 75.608751 1.3948094 62.452184 17.547816 18 100 5.7620514 4.7594557 -0.727722 1.8309 0.0526905 -0.761083 0.4156 23.02583 -0.789822 0.4585 105.23377 -0.755202 1 129.37656 -0.787399 116.6824 -16.6824 19 60 5.3612922 4.2796751 -0.807629 0.8872 0.0590742 -0.849592 0.1914 20.152784 -0.885781 0.1675 64.32136 -0.84203 1 81.08179 -0.882359 72.216972 -12.21697 20 110 5.6347896 4.5370857 0.8794827 1.7881 0.0631088 0.8379882 0.5173 27.762476 0.8827731 0.5686 82.549218 0.9264853 1 105.71816 0.8871918 93.418154 16.581846 21 120 5.8171112 4.6367443 0.8163993 1.7004 0.0685085 0.7804927 0.463 28.27474 0.8324479 0.0907 90.239767 0.8707447 0 118.0394 0.8371723 103.20779 16.792211 22 135 6.0913099 4.9034662 0.0093543 2.3829 0.0877062 0.0093492 0.7489 26.741414 0.0104423 0.3351 113.47302 0.010448 0 160.03095 0.0104434 134.75606 0.2439399 23 59 5.2522734 4.165467 -0.394395 0.9424 0.0543472 -0.403977 0.2051 20.735162 -0.417827 0.1756 57.913375 -0.407917 1 71.663794 -0.417188 64.422761 -5.422761 128 Observation bed log_t Xbeta Reschi pcrev Std Resdev nsal HessWgt StResdev fexp Lower StReschi rural Upper Reslik Pred Resraw 24 60 5.3082677 4.2665678 -0.75373 1.2474 0.0648779 -0.790063 0.3803 20.249686 -0.830305 0.2123 62.765736 -0.792121 0 80.94146 -0.826769 71.276576 -11.27658 25 25 4.4188406 3.6615037 -1.515122 0.4078 0.1170029 -1.683504 0.2008 14.487179 -1.943307 0.4531 30.944097 -1.748939 1 48.951263 -1.896676 38.919823 -13.91982 26 221 6.6541525 5.3089967 0.4986813 3.6029 0.1633691 0.4849636 0.1288 30.880767 1.0861062 0.2543 146.76007 1.1168278 1 278.4377 1.1107705 202.14732 18.852684 27 64 5.365976 4.17801 -0.088956 0.8782 0.0949587 -0.089427 0.4729 21.770485 -0.099844 0.4446 54.157389 -0.099318 1 78.580658 -0.09974 65.235905 -1.235905 28 62 5.31812 4.2776198 -0.666765 0.8951 0.0715552 -0.6949 0.2367 20.600368 -0.739743 0.1064 62.638133 -0.709793 0 82.919087 -0.736285 72.068695 -10.06869 29 108 5.9026333 4.6448136 0.1909746 1.7446 0.0697101 0.1888884 0.5933 25.953693 0.2018374 0.2987 90.75688 0.2040666 1 119.27634 0.2021156 104.04397 3.9560266 30 62 5.3936275 4.2514456 -0.555559 0.6164 0.0780862 -0.57486 0.2782 20.796614 -0.618843 0.0411 60.243788 -0.598065 1 81.817548 -0.616036 70.206831 -8.206831 31 90 5.6559918 4.5508004 -0.246757 0.2853 0.1334815 -0.250408 0.4651 23.732002 -0.330048 0.4197 72.906771 -0.325236 0 123.0289 -0.328015 94.708181 -4.708181 32 146 5.926926 4.7728484 1.195937 2.1334 0.0966402 1.121599 0.6857 30.740945 1.2930256 0.1198 97.850141 1.3787255 0 142.91639 1.3147635 118.25561 27.744394 33 62 5.241747 4.1185051 0.0402785 0.8082 0.055999 0.040183 0.2143 21.720872 0.0416347 0.1209 55.077923 0.0417337 1 68.597847 0.0416415 61.467283 0.5327167 34 30 4.4773368 3.3864721 0.0587035 0.3948 0.1232847 0.0584915 0.3025 15.769367 0.0668542 0.0137 23.215925 0.0670965 1 37.641444 0.0669111 29.561477 0.4385228 35 79 5.6276211 4.5497385 -0.818759 1.1649 0.0639514 -0.861611 0.2905 21.625827 -0.908783 0.1279 83.46238 -0.863585 0 107.24125 -0.904316 94.607662 -15.60766 36 44 5.062595 3.961858 -0.734942 0.785 0.0790977 -0.769878 0.1498 18.353544 -0.824606 0.1273 45.007512 -0.787187 1 61.36788 -0.819899 52.554882 -8.554882 129 Observation bed log_t Xbeta Reschi pcrev Std Resdev nsal HessWgt StResdev fexp Lower StReschi rural Upper Reslik Pred Resraw 37 120 6.0473722 4.9836905 -0.927702 2.9035 0.1013003 -0.983023 0.6236 23.183164 -1.148394 0.3524 119.71879 -1.083767 0 178.08047 -1.131483 146.01225 -26.01225 38 100 5.7037825 4.5473314 0.2954252 1.7532 0.0453276 0.2904762 0.3547 25.699386 0.298161 0.2561 86.357084 0.3032409 1 103.14873 0.2984216 94.380208 5.6197916 39 49 5.1761497 4.2457198 -1.415236 0.8197 0.062728 -1.554515 0.281 17.996339 -1.630024 0.3874 61.730322 -1.48398 1 78.938127 -1.61735 69.805988 -20.80599 40 123 5.8171112 4.6952284 0.6269081 2.2555 0.094274 0.605376 0.6059 27.96 0.6887843 0.6402 90.963179 0.7132831 1 131.63093 0.6944344 109.4238 13.576199 41 82 4.9126549 3.8638051 3.1921869 0.8459 0.0555176 2.7318616 0.1995 27.920169 2.8190277 0.1911 42.733902 3.2940407 1 53.123408 2.8502131 47.646307 34.353693 42 58 5.32301 4.1365766 -0.341776 1.0412 0.0586161 -0.348937 0.2245 20.696199 -0.362847 0.1122 55.795385 -0.3554 1 70.207982 -0.362292 62.588189 -4.588189 43 110 5.7776523 4.7182353 -0.089158 1.6661 0.0435985 -0.089626 0.4029 25.319477 -0.091903 0.3893 102.79984 -0.091424 1 121.95922 -0.091879 111.97048 -1.97048 44 62 5.4026774 4.2929754 -0.731093 1.2406 0.0400977 -0.765155 0.2784 20.48327 -0.779651 0.2212 67.652571 -0.744943 1 79.167474 -0.7784 73.183899 -11.1839 45 86 5.2983174 4.2013947 1.3567572 1.1312 0.0447048 1.2614531 0.372 26.488057 1.2904335 0.2959 61.177215 1.3879271 1 72.894595 1.2949191 66.7794 19.2206 46 102 5.8721178 4.755463 -0.6224 1.4499 0.0410245 -0.646515 0.3866 23.42317 -0.661147 0.3006 107.23861 -0.636486 1 125.94806 -0.660086 116.21745 -14.21745 47 135 6.1548581 5.0043236 -0.491999 2.4274 0.1192617 -0.506803 0.7485 25.121986 -0.645552 0.1344 117.98707 -0.626695 0 188.30673 -0.638383 149.05623 -14.05623 48 78 5.313206 4.1502993 1.0713815 0.9327 0.0654041 1.0103332 0.3672 25.123199 1.0600165 0.1242 55.818742 1.1240668 1 72.131365 1.0660399 63.452991 14.547009 49 83 5.9661467 4.7642221 -1.4873 1.2362 0.0548644 -1.64023 0.3995 20.036096 -1.707761 0.1484 105.28699 -1.548535 1 130.54973 -1.695952 117.23988 -34.23988 130 Observation bed log_t Xbeta Reschi pcrev Std Resdev nsal HessWgt StResdev fexp Lower StReschi rural Upper Reslik Pred Resraw 50 60 5.3612922 4.2917421 -0.856827 1.0644 0.0656815 -0.904318 0.282 20.062667 -0.952679 0.1154 64.26459 -0.902649 0 83.135801 -0.947846 73.093694 -13.09369 51 54 4.9698133 3.758391 1.1228299 0.7556 0.0736066 1.0550546 0.2088 21.480396 1.1141776 0.0245 37.118842 1.1857509 1 49.533897 1.1217836 42.879376 11.120624 52 120 5.7899602 5.0462531 -1.19399 2.0182 0.1109835 -1.288548 0.4432 22.276168 -1.579015 0.6274 125.05196 -1.463142 0 193.20983 -1.541274 155.43895 -35.43895 131 APPENDIX G Simulated Data Non-overdispersed data miu=10 X 2.56027343 3.3326889 1.66801331 1.61006016 0.36285718 3.69366502 2.81218555 1.52363452 1.51572989 2.0388978 2.47043236 4.06667615 3.24242262 1.51449378 1.28149382 1.0423767 3.18255029 2.37663376 2.05582736 2.03180688 2.93538735 3.49670966 2.89388151 2.72812274 2.16852126 2.9350875 0.9796758 3.24283892 2.06432669 2.25888823 3.71504505 2.59536997 3.36972833 2.90387464 Y 10 6 12 12 10 8 8 12 8 18 11 12 12 14 11 13 14 15 6 7 11 8 7 12 13 11 14 11 17 6 19 10 7 4 miu=20 X 2.233535 2.380598 3.737507 1.893425 2.687594 3.906926 2.749047 2.948506 4.704911 2.789378 3.627864 3.015747 3.42842 3.019538 2.387605 4.049522 4.265544 3.249506 3.287559 2.242166 2.96238 2.093768 2.623234 1.389107 2.124163 3.774267 1.818471 2.598286 4.19652 4.005775 3.465652 3.341846 3.050679 2.052052 Y 18 20 19 19 20 24 16 21 18 23 22 27 18 22 17 18 15 22 14 30 18 34 17 20 20 17 18 25 22 17 24 22 18 19 miu=50 X 3.9699987 4.8434237 4.4943946 4.6310685 3.0659341 4.3252598 5.9114104 1.874538 4.9111629 2.3559743 1.8477671 3.7703251 3.2662908 3.2943631 3.7510428 3.6242798 3.8402843 4.5546947 4.2363452 3.6035993 3.10055 0.3834092 3.0675352 3.635612 3.5323313 3.2227586 3.7645734 4.410683 3.7649807 4.1922024 5.4702537 2.5660532 3.4760939 4.6382366 Y 47 53 57 51 56 45 56 57 59 41 48 65 48 60 43 45 49 54 44 48 49 53 41 39 32 57 47 53 43 43 47 48 57 48 132 2.81451369 1.25842432 2.07940264 4.51482464 1.46244939 4.1868501 2.07409596 1.50803132 3.18251188 3.38992079 1.22532545 2.63168439 3.18694385 2.12424368 1.26411597 2.35176685 2.92244782 2.42275344 2.53927744 1.37460335 1.04661211 3.38630568 2.79825127 1.76558189 2.71937214 1.55265173 0.3104915 2.19924408 3.25973095 2.90217759 2.38023748 4.25647746 2.38099308 1.80118863 0.99632815 2.87664385 2.37776979 2.25172576 1.94891179 1.42317685 3.6831514 11 16 10 13 14 8 10 10 13 8 14 9 8 8 10 13 11 9 12 9 9 9 8 10 8 7 14 13 12 11 10 13 15 10 14 9 17 11 12 14 10 2.871468 3.62096 2.825491 4.223984 2.421426 1.991402 3.262232 3.20434 3.5112 2.655626 3.103678 2.268579 2.654457 3.240083 1.338323 2.955074 3.760561 2.494356 3.067955 3.293478 3.343564 4.092233 3.470404 3.580472 1.202995 4.805678 3.247511 3.097027 2.001007 3.263064 3.717297 1.455882 1.456778 4.511168 4.633684 3.710726 2.638712 3.431643 1.804892 3.057467 2.728066 21 23 23 29 24 20 16 23 28 16 21 26 24 25 28 27 18 21 24 20 21 24 28 10 25 18 20 25 20 17 25 25 30 14 24 22 19 22 22 19 24 3.8218477 3.8210704 3.7967673 4.04387 5.3592521 4.3233362 3.4298408 5.0527314 4.3845342 2.3926529 2.4613402 2.5679592 3.139902 4.7341778 3.1589628 4.320673 4.1690454 2.872078 4.7493709 5.5859022 3.2574183 4.5438332 4.4717676 4.4901139 3.7233904 4.7087055 3.3533185 3.2674426 3.8337626 2.0761218 3.7104727 4.5944552 5.2191323 3.540893 5.4053498 3.3165428 5.2921917 1.989137 4.2475918 4.5012906 3.7036413 46 54 46 46 47 39 50 43 47 40 45 59 47 54 42 57 47 56 47 40 55 51 50 45 62 30 53 52 41 46 52 43 54 51 53 57 52 47 47 41 60 133 3.52976312 3.46639053 3.43555119 2.38152391 3.14476528 2.41774325 1.52962269 0.49890047 2.87897229 0.92306063 2.6185509 2.29404894 3.48546267 4.02298131 1.07883456 3.94725484 1.11731098 0.64580806 1.43439431 -0.0823161 2.65617426 0.76878554 1.21692979 3.21691682 1.81126194 11 9 5 13 8 8 12 14 16 10 14 14 10 9 10 14 10 6 16 11 12 11 8 9 5 2.920535 1.678912 2.179912 2.787339 2.300393 2.961683 2.93638 4.147253 1.925526 2.5701 3.144706 5.09753 2.598161 3.331129 4.128854 3.63403 4.935413 1.614 4.357518 3.380103 3.390668 1.552754 3.893704 3.572254 1.854771 22 16 25 24 19 26 25 23 23 22 25 24 19 26 28 21 18 32 24 18 20 24 23 16 19 4.5837201 3.2605241 6.7931809 3.7509432 3.9999346 2.1554632 4.2595821 2.6537653 4.5316234 4.2517845 4.194305 4.0677576 4.6617705 3.532448 3.8688486 4.0325427 5.045115 4.1096757 2.7171915 3.7607899 3.7105418 4.2784821 4.2616238 4.7912618 5.7802977 54 46 47 48 55 51 42 56 51 53 50 48 52 49 45 51 46 53 66 47 45 44 49 37 38 134 Overdispersed Data miu=10 X 2.44163 2.104 1.52064 0.90916 0.79473 2.01532 3.19482 1.64743 2.40063 2.74466 0.62351 1.80325 1.97372 0.2907 3.42932 5.26497 3.44543 3.04891 1.17708 3.1177 1.97079 2.6756 2.87712 1.55979 0.68691 1.2245 2.93103 2.77611 2.12268 2.24546 2.43895 2.69914 0.38124 2.14843 4.61383 3.48309 Y 2 3 4 5 14 1 14 0 29 23 5 9 19 6 3 3 0 10 3 5 11 17 18 0 3 0 2 8 2 0 17 0 1 17 0 6 miu=20 X 2.66376 2.45444 1.9666 3.92778 0.76437 3.37262 2.27186 4.13457 3.03504 3.42327 3.89228 3.24653 1.7641 2.60344 2.85246 2.07496 4.16171 3.33367 2.43729 2.24164 1.89108 4.50213 2.22699 2.98849 3.09531 3.59585 3.11136 2.1967 3.07363 2.36964 1.23683 2.31667 4.64975 4.31556 2.21664 3.91311 Y 2 24 22 18 0 24 14 13 7 16 17 35 24 2 0 40 1 2 11 4 23 23 20 25 14 26 4 17 17 6 15 20 12 2 10 8 miu-50 X 2.88276 1.83427 4.55353 3.65407 4.08464 2.2976 3.74117 2.43642 6.19172 1.7602 3.97899 2.48501 2.81162 3.17003 3.97182 3.10242 3.62071 2.9827 5.2785 3.36342 4.44359 4.27552 1.68266 2.5891 5.09814 4.8191 4.49929 6.35005 4.86595 4.85178 4.06398 4.35744 5.82929 3.39721 5.43037 3.15511 Y 36 46 9 2 5 40 96 62 129 57 62 90 58 6 12 57 61 22 6 22 29 87 40 85 4 7 42 163 11 12 126 239 136 2 128 52 135 1.13993 2.15086 1.87124 0.31924 1.58076 4.47172 2.794 1.71189 2.38219 1.54522 2.16051 0.49664 1.33656 3.84749 1.63085 3.01141 1.41408 2.8041 3.13662 1.85138 3.58869 3.79014 2.20832 2.97588 3.59825 2.47614 2.07837 3.73233 2.76328 0.99803 0.72687 2.02619 3.19523 2.61502 3.01252 -0.79722 1.66685 2.53451 3.07421 2.92809 1.72812 4 7 2 19 15 1 11 6 8 3 5 5 12 5 6 7 7 14 0 19 16 0 5 7 25 0 0 7 14 20 19 5 0 20 24 5 26 21 5 13 1 2.66633 3.41239 1.3929 1.87624 3.92357 3.63896 3.30859 3.23259 3.88666 2.59509 2.27531 3.35699 3.46903 3.02228 2.84548 3.0837 4.81189 3.62199 3.17865 2.95065 3.69129 3.23378 3.02842 2.06705 2.20435 3.55136 3.32951 4.49041 2.08909 2.06913 2.65827 2.60514 3.11345 3.38731 1.9049 1.58148 1.69146 2.10952 2.3264 2.99103 3.10909 26 1 10 0 10 11 4 2 20 12 11 4 40 32 1 2 10 43 9 25 0 10 20 5 8 6 7 25 10 9 1 34 27 48 64 2 15 20 1 13 4 4.62784 3.30831 5.25449 5.51319 4.34461 3.03456 4.89971 4.3881 5.46789 3.38408 4.34105 1.7767 2.45248 5.36947 3.47778 2.80763 3.78067 2.44013 2.89905 4.75979 3.21596 4.83291 4.56639 4.03758 4.40804 4.45115 4.33562 4.19355 4.2325 3.82704 4.03993 4.07587 3.46035 3.6433 2.52635 5.29063 5.42906 5.10655 4.71323 4.35133 2.91343 35 76 61 13 40 43 15 41 17 30 5 41 12 8 7 36 34 2 15 76 42 254 89 6 3 11 3 31 27 150 12 56 93 41 34 115 85 83 37 15 159 136 2.33986 4.05749 2.52357 3.78881 2.90051 2.51625 2.73479 3.94897 3.71768 1.55765 -0.19213 1.25845 2.30207 1.17082 0.65576 3.87628 0.01265 2.61168 2.11416 1.41527 2.24124 3.37832 0.69132 14 0 10 15 11 3 0 11 4 2 0 8 15 4 2 9 2 0 8 2 0 7 2 3.20283 3.2003 1.61702 2.74098 2.64998 3.33911 2.63632 4.66838 2.369 2.27309 3.59818 1.70775 4.03669 3.68486 2.01495 1.86527 2.61437 3.7602 0.61867 2.36131 3.8941 2.3606 4.07685 36 28 3 38 12 2 0 15 50 48 20 17 69 32 6 8 11 19 11 8 16 12 38 3.9709 3.18958 2.6479 4.49983 2.778 2.83642 3.8224 3.08565 4.84058 2.70799 3.67085 3.17162 4.08897 4.56694 4.86552 2.64723 4.37878 4.98765 3.06825 4.1668 6.79676 4.79784 3.56401 20 1 35 53 29 71 81 72 28 40 13 40 91 12 105 21 63 4 4 40 13 29 76