See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/262800102 Count data modeling: Choice between generalized Poisson model and negative binomial model Article in Journal of Applied Statistical Science · January 2005 CITATION READS 1 1,274 1 author: Felix Famoye Central Michigan University 140 PUBLICATIONS 5,435 CITATIONS SEE PROFILE Some of the authors of this publication are also working on these related projects: Research View project Review View project All content following this page was uploaded by Felix Famoye on 28 August 2014. The user has requested enhancement of the downloaded file. COUNT DATA MODELING: CHOICE BETWEEN GENERALIZED POISSON MODEL AND NEGATIVE BINOMIAL MODEL FELIX FAMOYE Department of Mathematics Central Michigan University Mount Pleasant Michigan 48859 E-mail: felix.famoye@cmich.edu SUMMARY The generalized Poisson regression and the negative binomial regression models have been used to describe count data. An advantage of the generalized Poisson regression model over the negative binomial regression model is that it can be used to model count data with either over-dispersion or under-dispersion. The negative binomial regression model is suitable for cases with over-dispersion. In this paper, we carry out a simulation study to compare both regression models when the true data generating process exhibits over-dispersion. In the simulation experiment, we observe that the generalized Poisson regression model has an advantage over the negative binomial regression model when the data has a high proportion of zeros. Key Words: Non-nested models; Monte Carlo simulation; likelihood ratio test; score test. AMS 2000 Subject Classifications: 62J02, 65C60, 62F03. 1 1. INTRODUCTION The probability function for the generalized Poisson regression (GPR) model [Consul and Famoye (1992) and Famoye (1993)], f ( yi ; µi , ϕ ) , is given by yi µi (1 + ϕ yi ) yi −1 µ (1 + ϕ yi ) f ( yi ; µi , ϕ ) = ex p i , yi 0, 1, 2, 3, . . . yi 1 + ϕµi 1 + ϕµi (1.1) where k = µi µi ( x= ) exp xij β j , = xi (= xi1 1, xi 2 , ..., xik ) , ∑ i j =1 β = ( β1 , β 2 , ..., β k ) is the i-th row of covariance matrix X and (1.2) are unknown k-dimensional column vector of parameters. The model in (1.1) is based upon the generalized Poisson distribution defined and studied by Consul (1989). The mean of yi is given by µi and the variance of yi is given by µi (1 + ϕµi ) 2 . The model in (1.1) is an extension of the Poisson regression model given by Frome et al. (1973). When ϕ = 0, the GPR model reduces to the Poisson regression model. When ϕ > 0, the GPR model is used to model count data that exhibits over-dispersion and when ϕ < 0, the model is used to describe count data that shows under-dispersion. The probability function for the negative binomial regression (NBR) model [see for example Lawless (1987), Cameron and Trivedi (1998)], g ( yi ; µi ,τ ) , is given by 1/ τ y yi + τ −1 − 1 1 τµi i g ( yi ; µi ,τ ) = , yi 0, 1, 2, 3, . . . yi 1 + τµi 1 + τµi where µi is defined in (1.2). The mean of (1.3) yi is given by µi and the variance of yi is given by µi (1 + τµi ) . The model in (1.3) reduces to the Poisson regression model when τ → 0. It can be used to model count data with overdispersion when τ > 0. When ϕ > 0 in the GPR model and τ > 0 in the NBR model, the two models are good competitors. The question to be answered in this paper is whether one model is better than the other. If so, under what condition is this true. 2 From the expressions in (1.1) and (1.3), the two regression models have exactly the same number of parameters. The NBR model is presently available in the statistical software STATA. This author has come across data sets for which the NBR model failed to converge. A similar observation was made by Lambert (1992) when she tried to fit a zeroinflated NBR model to an observed data set. Even though the GPR model is not available in STATA, it is very easy to write a computer program to fit the model. In all our simulation study in this paper, S-PLUS is used to fit both the GPR and the NBR models. In this paper we compare the GPR model with the NBR model by using a test statistic proposed by Vuong (1989) for non-nested models. In section 2, we review the likelihood ratio test statistic proposed by Vuong. In section 3, we carry out a simulation study to compare the GPR model with the NBR model. In section 4, we define a score statistic to compare the GPR model with the Poisson regression model. Through a simulation experiment, the power of the score statistic is compared with the powers of the likelihood ratio statistic suggested by Famoye (1993) and the asymptotic Wald t-statistic. Finally in section 5, we provide a recommendation on the choice between the GPR and the NBR models. 2. LIKELIHOOD RATIO TEST We wish to choose between the GPR model f ( yi ; µi , ϕ ) and the NBR model g ( yi ; µi ,τ ) , which are two non-nested regression models. Given the two regression models, we consider the hypothesis H 0 : GPR and NBR are equivalent (2.1) against H f : GPR is better than NBR, or H g : NBR is better than GPR. (2.2) The likelihood ratio statistic for testing the GPR model against the NBR model is n f ( yi ; µi , ϕ ) L* = ∑ log . i =1 g ( yi ; µi ,τ ) (2.3) 3 Since the GPR model and the NBR model are non-nested, the statistic in (2.3) is not chi-square distributed. If these two models are nested, L* will follow a chi-square distribution. Vuong (1989) used the Kullback-Liebler Information Criterion to discriminate between two non-nested models. To test the null hypothesis in (2.1), Vuong proposed the test statistic T* = L* ωˆ n , (2.4) where 2 ω 2 2 f ( yi ; µˆ i , ϕˆ ) 1 n f ( yi ; µˆ i , ϕˆ ) 1 n log − ∑ log , ∑ n i 1= g ( yi ; µˆ i ,τˆ) g ( yi ; µˆ i ,τˆ) n i 1 is an estimate of the variance of L* / n . For a non-nested model, T* is approximately standard normal distributed under the null hypothesis that the two models are equivalent. At significant level α, one compares T* with zα / 2 . If T* < – zα / 2 , the null hypothesis is rejected in favor of H g , the NBR model is better than the GPR model. If T* > zα / 2 , the null hypothesis is rejected in favor of H f , the GPR model is better than the NBR model. However, if | T* | ≤ zα / 2 , the null hypothesis is not rejected. Thus, there is no sufficient evidence to say that both models are not equivalent. 3. SIMULATION STUDY In this section, we compare the GPR and the NBR models when the data is generated from the GPR model, and also when the data is generated from the NBR model. The statistic, T* , in (2.4) is computed and the null hypothesis in (2.1) is tested at significance levels α = .10 and α = .05. The simulation study has been carried out for different sample sizes and different values of parameters ϕ or τ, β1 , and β 2 . The results are similar and so we report simulations for n = 25, 50, and 75 for some parameter 4 combinations (ϕ or τ, β1 , β 2 ). Each simulation experiment is based on 2000 replications. Each regression model is generated according= to µi distribution. The values of exp( β1 + β 2 xi 2 ) , where xi 2 is taken as a random sample from Uniform (0, 1) β1 , β 2 , and ϕ or τ used in the experiments are shown in Tables 1 and 2. The covariate xi 2 is fixed throughout the simulation study. Responses yi are generated from the GPR f ( yi ; µi , ϕ ) and the NBR g ( yi ; µi ,τ ) for the µi ’s. For the parameter set ( β1 , β 2 ) = (–4.0, 6.5), the µi ’s range from 0.02 to 10.03 and this gives a large proportion of yi = 0 in the simulated data. For the parameter set (0.5, 1.0), the values of µi range from 1.66 to 4.35 and for the parameter set (2.0, 1.0), the For each generated data, the test statistic T* in (2.4) is computed for testing the null hypothesis in (2.1). From the 2000 repetitions, we record the proportion of times ( ( µi ’s range from 7.43 to 19.49. pe ) both models are equivalent, the proportion of times p f ) the GPR model is better than the NBR model, and the proportion of times ( pg ) the NBR model is better than the GPR model. When the data is generated from the GPR model, one would expect when the data is generated from the NBR model, one would expect p f > pg rather than pg > p f . Also, pg > p f rather than p f > pg . However, the results of the simulation study in Tables 1 and 2 tell a different story. When the data is simulated from the NBR model, the two models are more equivalent than when the data is simulated from the GPR model. For the parameter values and the sample sizes reported in Tables 1 and 2, both models are equivalent in at least 86% of the times when the data is generated from NBR model, and in at least 80% of the times when the data is generated from the GPR model. As n increases, the proportion of times that both models are equivalent decreases. When the data has a high proportion of zeros, the GPR model is better ( generating process. When the p f > pg ) irrespective of the data µi ’s range from 1.66 to 4.35, both models seem to perform equivalently. From Tables 1 and 2, both models are equivalent in at least 91% of the times. When the µi ’s range from 7.43 to 19.49, the performance of each model depends on the data generating process. From Table 1, the GPR model is better ( 5 pf > pg ) when the data is generated from the GPR model. In Table 2, the NBR model is better ( pg > p f ) when the data is generated from the NBR model. For small µi ’s [( β1 , β 2 ) = (–4.0, 6.5) or (0.5, 1.0)], both models are either more equivalent or the GPR model is better than the NBR model. For large µi ’s [( β1 , β 2 ) = (2.0, 1.0),], the better model is the one from which the data is generated. Table 1. Simulated data from GPR model β1 –4.0 Parameters β2 ϕ 6.5 0.2 0.3 0.5 1.0 0.2 0.3 2.0 1.0 0.2 0.3 Size n α = 0.10 α = 0.05 pf pe pg pf pe pg 25 .0260 .9725 .0015 .0120 .9875 .0005 50 .0245 .9730 .0025 .0100 .9900 .0000 75 .0420 .9475 .0105 .0185 .9800 .0015 25 .0400 .9585 .0015 .0165 .9835 .0000 50 .0415 .9490 .0095 .0210 .9775 .0015 75 .0555 .9310 .0135 .0255 .9685 .0060 25 .0110 .9805 .0085 .0040 .9935 .0025 50 .0160 .9700 .0140 .0050 .9900 .0050 75 .0270 .9435 .0295 .0070 .9880 .0050 25 .0300 .9545 .0155 .0120 .9830 .0050 50 .0330 .9370 .0300 .0135 .9750 .0115 75 .0395 .9170 .0435 .0170 .9670 .0160 25 .0940 .8755 .0305 .0550 .9335 .0115 50 .1140 .8625 .0235 .0665 .9280 .0055 75 .1315 .8490 .0195 .0715 .9195 .0090 25 .1020 .8680 .0300 .0665 .9220 .0115 50 .1360 .8480 .0160 .0815 .9150 .0035 75 .1785 .8070 .0145 .1005 .8940 .0055 6 Table 2. Simulated data from NBR model Parameters β1 β2 τ –4.0 6.5 0.2 0.3 0.5 1.0 0.2 0.3 2.0 1.0 0.2 0.3 Size n α = 0.10 α = 0.05 pf pe pg pf pe pg 25 .0070 .9925 .0005 .0025 .9975 .0000 50 .0040 .9945 .0015 .0010 .9990 .0000 75 .0090 .9895 .0015 .0030 .9965 .0005 25 .0125 .9870 .0005 .0035 .9965 .0000 50 .0065 .9920 .0015 .0025 .9975 .0000 75 .0165 .9785 .0050 .0060 .9930 .0010 25 .0025 .9945 .0030 .0000 1.0000 .0000 50 .0015 .9975 .0010 .0005 .9995 .0000 75 .0025 .9935 .0040 .0000 1.0000 .0000 25 .0030 .9920 .0050 .0005 .9980 .0015 50 .0055 .9855 .0090 .0005 .9985 .0010 75 .0080 .9800 .0120 .0020 .9960 .0020 25 .0220 .9510 .0270 .0080 .9830 .0090 50 .0275 .9260 .0465 .0100 .9775 .0125 75 .0205 .8925 .0870 .0090 .9575 .0335 25 .0285 .9245 .0470 .0130 .9705 .0165 50 .0295 .8995 .0710 .0090 .9630 .0280 75 .0190 .8695 .1115 .0090 .9365 .0545 7 4. COMPARING GPR MODEL WITH POISSON REGRESSION MODEL To assess the adequacy of the GPR model over the Poisson regression model, one may test the hypothesis H 0 : ϕ = 0 against H a : ϕ ≠ 0 . (4.1) To carry out the test in (4.1), Famoye suggested a likelihood ratio test statistic n f ( y ; µ , ϕ = 0) L1 = −2∑ log * i i , i =1 f ( yi ; µˆ i , ϕˆ ) (4.2) f* ( yi ; µ i , ϕ = 0) is the estimated probability function for the Poisson regression model. This is a case of where nested models. The test statistic in (4.2) is approximately chi-square distributed with one degree of freedom. An alternative test is to fit the GPR model and use the asymptotic Wald t-statistic to carry out the test in (4.1). Thus, one computes the test statistic W= ϕˆ , se(ϕˆ ) where ϕ̂ (4.3) is the maximum likelihood estimate of ϕ and se( ϕ̂ ) is the corresponding standard error. The statistic in (4.3) is compared with the t-distribution with n – k – 1 degrees of freedom. The log-likelihood function for the GPR model in (1.1) is given by = n µi ∑ y log 1 + ϕµ i =1 i i µi (1 + ϕ yi ) . + ( yi − 1) log(1 + ϕ yi ) − log( yi !) − 1 + ϕµi (4.4) On differentiating (4.4) with respect to the parameter ϕ, we get − µi yi yi ( yi − 1) µi ( yi − µi ) − . 1 + ϕ yi (1 + ϕµi ) 2 ∂ = ∂ϕ ∑ 1 + ϕµ Under H 0 , the result in (4.5) becomes n i =1 + i (4.5) n n ∂ = ∑ {( yi − µi ) 2 − yi }= ∑ {( yi − µi ) 2 − µi − ( yi − µi )} . ∂ϕ i 1 =i 1 = On evaluating (4.6) under the Poisson regression model maximum likelihood estimate for β, we get 8 (4.6) ∂ = ∂ϕ H 0 ∑ {( y − µˆ ) n i =1 i i 2 − µˆ i } . (4.7) H 0 , the mean and variance of [( yi − µi ) 2 − µi ] are, respectively, 0 and 2 µi2 . Hence, the test statistic Under −1 n S= ∑ {( yi − µˆ i ) − µˆ i } 2∑ µˆ i2 , i1 i 1= n 2 is asymptotically standard normal distributed under (4.8) H 0 . The result in (4.8) is the same test statistic obtained by Lee (1986) for the NBR model. The statistic in (4.8) is a score statistic and the null hypothesis in (4.1) is rejected at significance level α if S ≥ zα / 2 . One advantage of the test statistic in (4.8) is that one does not need to fit the GPR model. The only requirement is the fit from the Poisson regression model. For both the likelihood ratio statistic and the asymptotic Wald t-statistic, one needs to fit the GPR model. In order to compare the likelihood ratio statistic in (4.2), the asymptotic Wald t-statistic in (4.3), and the score statistic in (4.8), we carry out a Monte Carlo simulation experiment based on 2000 repetitions. Samples are generated from the GPR model with ϕ = –0.2, –0.1, 0.0, 0.1, 0.2, 0.3. The simulated experiment has been carried out for n = 25, 50, and 75 with parameter combination β1 = 0.5 and β 2 = 1.0. For ϕ = 0.0, we also included the results for n = 100, 150, and 200. Each regression model is generated according to µi used in section 3. The covariate xi 2 is fixed throughout the experiment. The three test statistics are computed. Each entry in the table represents the proportion of 2000 Monte Carlo samples declared significant by each test using the significant levels α = 0.10 and α = 0.05. 9 Table 3. Powers of score, likelihood ratio and Wald t-statistics Size n 25 α = 0.10 Parameter ϕ S L1 α = 0.05 W S L1 W –0.2 1.0000 1.0000 1.0000 .9995 1.0000 1.0000 –0.1 .6820 .8695 .9325 .4025 .7645 .8910 0.0 .0755 .1395 .1910 .0320 .0775 .1375 0.1 .4620 .3980 .1900 .37855 .2980 .0720 0.2 .7960 .7615 .5580 .7480 .6950 .3380 0.3 .9235 .9085 .7880 .8985 .8710 .6085 –0.2 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 –0.1 .9465 .9840 .9940 .8570 .9555 .9840 0.0 .0910 .1235 .1495 .0385 .0640 .1000 0.1 .6970 .6510 .5080 .6210 .5470 .3180 0.2 .9695 .9600 .9265 .9535 .9375 .8385 0.3 .9960 .9945 .9860 .9915 .9910 .9725 –0.2 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 –0.1 .9935 .9975 .9985 .9700 .9950 .9975 0.0 .1050 .1235 .1305 .0465 .0640 .0870 0.1 .8380 .8125 .7215 .7735 .7730 .5670 0.2 .9935 .9910 .9860 .9890 .9870 .9715 0.3 .9995 .9995 .9995 .9995 .9995 .9985 100 0.0 .1010 .1105 .1270 .0425 .0610 .0755 150 0.0 .1020 .1175 .1320 .0495 .0620 .0700 200 0.0 .1060 .1110 .1155 .0480 .0560 .0700 50 75 10 There is a good agreement between the actual and nominal significance levels for the score statistic in Table 3 when ϕ = 0.0 and n is large. Note that ϕ = 0.0 corresponds to the case of Poisson regression Model. When ϕ > 0, the score statistic is the most powerful of the three test statistics considered for testing the null hypothesis H 0 in (4.1). The Wald t-statistic is the least powerful for ϕ > 0. When ϕ < 0, the score statistic appears to be the least powerful, but still has a very good power. The Wald t-statistic appears to be the most powerful when ϕ < 0. For all values of ϕ, a score statistic is recommended for testing the null hypothesis H 0 in (4.1). 5. CONCLUSION In general, the GPR model can be used in place of the NBR model as both models are equivalent with a high percentage. The GPR model has an advantage over the NBR model when the data has a high proportion of zeros. In addition to the fact that the GPR can model under-dispersion, it appears to be a model that one should always apply. In terms of estimation, one model is not easier to estimate than the other. In fact, in our simulation study, there are a few cases when the NBR model failed to converge in data generated from the GPR model. These cases were excluded from the 2000 repetitions in the simulation experiment. We did not observe a similar result for the GPR model when the data is generated from the NBR model. ACKNOWLEDGEMENT The support received from the Central Michigan University FRCE Committee under grant #48136 (Sabbatical Leave Request) is gratefully acknowledged. This work was done while Felix Famoye, Central Michigan University, was on his sabbatical leave at the Department of Biostatistics, University of North Texas Health Science Center at Fort Worth. 11 REFERENCES Cameron, A.C. and Trivedi, P.K. (1998). Regression analysis of count data. Cambridge University Press, New York, New York. Consul, P.C. (1989). Generalized Poisson distributions: Properties and Applications. Marcel Dekker Inc., New York. Consul, P.C. and Famoye, F. (1992). “Generalized Poisson regression model.” Communications in Statistics- Theory and Methods 21(1), 89-109. Famoye, F. (1993). “Restricted generalized Poisson regression model.” Communications in Statistics- Theory and Methods 22(5), 1335-1354. Frome, E.L., Kutner M.H., Beauchamp J.J. (1973). “Regression analysis of Poisson-distributed data.” Journal of the American Statistical Association 68: 935-940. Lambert, D. (1992). “Zero-inflated Poisson regression, with an application to defects in manufacturing.” Technometrics 34, 1-14. Lawless, J.F. (1987). “Negative binomial and mixed Poisson regression.” The Canadian Journal of Statistics 15(3), 209-225. Lee, L. (1986). “Specification test for Poisson regression models.” International Economic Review 27(3), 689-706. Vuong, Q.H. (1989). “Likelihood ratio tests for model selection and non-nested hypotheses.” Econometrica 57(2), 307-333. 12 View publication stats