Uploaded by Hassan Tahir

Count data modeling: Choice between generalized Poisson model and negative binomial model

advertisement
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/262800102
Count data modeling: Choice between generalized Poisson model and negative
binomial model
Article in Journal of Applied Statistical Science · January 2005
CITATION
READS
1
1,274
1 author:
Felix Famoye
Central Michigan University
140 PUBLICATIONS 5,435 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Research View project
Review View project
All content following this page was uploaded by Felix Famoye on 28 August 2014.
The user has requested enhancement of the downloaded file.
COUNT DATA MODELING: CHOICE BETWEEN GENERALIZED POISSON MODEL AND
NEGATIVE BINOMIAL MODEL
FELIX FAMOYE
Department of Mathematics
Central Michigan University
Mount Pleasant
Michigan 48859
E-mail: felix.famoye@cmich.edu
SUMMARY
The generalized Poisson regression and the negative binomial regression models have been used to describe count
data. An advantage of the generalized Poisson regression model over the negative binomial regression model is that
it can be used to model count data with either over-dispersion or under-dispersion. The negative binomial regression
model is suitable for cases with over-dispersion. In this paper, we carry out a simulation study to compare both
regression models when the true data generating process exhibits over-dispersion. In the simulation experiment, we
observe that the generalized Poisson regression model has an advantage over the negative binomial regression model
when the data has a high proportion of zeros.
Key Words: Non-nested models; Monte Carlo simulation; likelihood ratio test; score test.
AMS 2000 Subject Classifications: 62J02, 65C60, 62F03.
1
1. INTRODUCTION
The probability function for the generalized Poisson regression (GPR) model [Consul and Famoye (1992) and
Famoye (1993)],
f ( yi ; µi , ϕ ) , is given by
yi
 µi  (1 + ϕ yi ) yi −1
 µ (1 + ϕ yi ) 
f ( yi ; µi , ϕ ) =
ex p i

 , yi 0, 1, 2, 3, . . .
yi
 1 + ϕµi 
 1 + ϕµi 
(1.1)
where
 k

=
µi µi ( x=
)
exp
xij β j  , =
xi (=
xi1 1, xi 2 , ..., xik ) ,

∑
i
 j =1

β = ( β1 , β 2 , ..., β k )
is the i-th row of covariance matrix X and
(1.2)
are unknown k-dimensional column vector of
parameters. The model in (1.1) is based upon the generalized Poisson distribution defined and studied by Consul
(1989). The mean of
yi is given by µi and the variance of yi is given by µi (1 + ϕµi ) 2 .
The model in (1.1) is an extension of the Poisson regression model given by Frome et al. (1973). When ϕ = 0,
the GPR model reduces to the Poisson regression model. When ϕ > 0, the GPR model is used to model count data
that exhibits over-dispersion and when ϕ < 0, the model is used to describe count data that shows under-dispersion.
The probability function for the negative binomial regression (NBR) model [see for example Lawless (1987),
Cameron and Trivedi (1998)],
g ( yi ; µi ,τ ) , is given by
1/ τ
y
 yi + τ −1 − 1  1   τµi  i
g ( yi ; µi ,τ ) =

 
 , yi 0, 1, 2, 3, . . .
yi

  1 + τµi   1 + τµi 
where
µi
is defined in (1.2). The mean of
(1.3)
yi is given by µi and the variance of yi is given by µi (1 + τµi ) . The
model in (1.3) reduces to the Poisson regression model when τ → 0. It can be used to model count data with overdispersion when τ > 0.
When ϕ > 0 in the GPR model and τ > 0 in the NBR model, the two models are good competitors. The question
to be answered in this paper is whether one model is better than the other. If so, under what condition is this true.
2
From the expressions in (1.1) and (1.3), the two regression models have exactly the same number of parameters. The
NBR model is presently available in the statistical software STATA. This author has come across data sets for which
the NBR model failed to converge. A similar observation was made by Lambert (1992) when she tried to fit a zeroinflated NBR model to an observed data set. Even though the GPR model is not available in STATA, it is very easy
to write a computer program to fit the model. In all our simulation study in this paper, S-PLUS is used to fit both the
GPR and the NBR models.
In this paper we compare the GPR model with the NBR model by using a test statistic proposed by Vuong
(1989) for non-nested models. In section 2, we review the likelihood ratio test statistic proposed by Vuong. In
section 3, we carry out a simulation study to compare the GPR model with the NBR model. In section 4, we define a
score statistic to compare the GPR model with the Poisson regression model. Through a simulation experiment, the
power of the score statistic is compared with the powers of the likelihood ratio statistic suggested by Famoye (1993)
and the asymptotic Wald t-statistic. Finally in section 5, we provide a recommendation on the choice between the
GPR and the NBR models.
2. LIKELIHOOD RATIO TEST
We wish to choose between the GPR model
f ( yi ; µi , ϕ ) and the NBR model g ( yi ; µi ,τ ) , which are two
non-nested regression models. Given the two regression models, we consider the hypothesis
H 0 : GPR and NBR are equivalent
(2.1)
against
H f : GPR is better than NBR, or H g : NBR is better than GPR.
(2.2)
The likelihood ratio statistic for testing the GPR model against the NBR model is
n
 f ( yi ; µi , ϕ ) 
L* = ∑ log 
.
i =1
 g ( yi ; µi ,τ ) 
(2.3)
3
Since the GPR model and the NBR model are non-nested, the statistic in (2.3) is not chi-square distributed. If these
two models are nested,
L* will follow a chi-square distribution. Vuong (1989) used the Kullback-Liebler
Information Criterion to discriminate between two non-nested models. To test the null hypothesis in (2.1), Vuong
proposed the test statistic
T* =
L*
ωˆ n
,
(2.4)
where
2
ω
2
2
 f ( yi ; µˆ i , ϕˆ )  
1 n   f ( yi ; µˆ i , ϕˆ )    1 n
log 
  −  ∑ log 
 ,
∑
n i 1=
 g ( yi ; µˆ i ,τˆ)  
  g ( yi ; µˆ i ,τˆ)    n i 1
is an estimate of the variance of
L* / n . For a non-nested model, T* is approximately standard normal distributed
under the null hypothesis that the two models are equivalent. At significant level α, one compares
T* with zα / 2 . If
T* < – zα / 2 , the null hypothesis is rejected in favor of H g , the NBR model is better than the GPR model. If T* >
zα / 2 , the null hypothesis is rejected in favor of H f , the GPR model is better than the NBR model. However, if
| T* |
≤ zα / 2 , the null hypothesis is not rejected. Thus, there is no sufficient evidence to say that both models are not
equivalent.
3. SIMULATION STUDY
In this section, we compare the GPR and the NBR models when the data is generated from the GPR model, and
also when the data is generated from the NBR model. The statistic,
T* , in (2.4) is computed and the null hypothesis
in (2.1) is tested at significance levels α = .10 and α = .05.
The simulation study has been carried out for different sample sizes and different values of parameters ϕ or τ,
β1 , and β 2 . The results are similar and so we report simulations for n = 25, 50, and 75 for some parameter
4
combinations (ϕ or τ,
β1 , β 2 ). Each simulation experiment is based on 2000 replications. Each regression model is
generated according=
to µi
distribution. The values of
exp( β1 + β 2 xi 2 ) , where xi 2 is taken as a random sample from Uniform (0, 1)
β1 , β 2 , and ϕ or τ used in the experiments are shown in Tables 1 and 2. The covariate
xi 2 is fixed throughout the simulation study. Responses yi are generated from the GPR f ( yi ; µi , ϕ ) and the
NBR
g ( yi ; µi ,τ ) for the µi ’s. For the parameter set ( β1 , β 2 ) = (–4.0, 6.5), the µi ’s range from 0.02 to 10.03
and this gives a large proportion of
yi = 0 in the simulated data. For the parameter set (0.5, 1.0), the values of µi
range from 1.66 to 4.35 and for the parameter set (2.0, 1.0), the
For each generated data, the test statistic
T* in (2.4) is computed for testing the null hypothesis in (2.1). From
the 2000 repetitions, we record the proportion of times (
(
µi ’s range from 7.43 to 19.49.
pe ) both models are equivalent, the proportion of times
p f ) the GPR model is better than the NBR model, and the proportion of times ( pg ) the NBR model is better than
the GPR model.
When the data is generated from the GPR model, one would expect
when the data is generated from the NBR model, one would expect
p f > pg rather than pg > p f . Also,
pg > p f rather than p f > pg . However, the
results of the simulation study in Tables 1 and 2 tell a different story. When the data is simulated from the NBR
model, the two models are more equivalent than when the data is simulated from the GPR model. For the parameter
values and the sample sizes reported in Tables 1 and 2, both models are equivalent in at least 86% of the times when
the data is generated from NBR model, and in at least 80% of the times when the data is generated from the GPR
model. As n increases, the proportion of times that both models are equivalent decreases.
When the data has a high proportion of zeros, the GPR model is better (
generating process. When the
p f > pg ) irrespective of the data
µi ’s range from 1.66 to 4.35, both models seem to perform equivalently. From Tables
1 and 2, both models are equivalent in at least 91% of the times. When the
µi ’s range from 7.43 to 19.49, the
performance of each model depends on the data generating process. From Table 1, the GPR model is better (
5
pf >
pg ) when the data is generated from the GPR model. In Table 2, the NBR model is better ( pg > p f ) when the
data is generated from the NBR model. For small
µi ’s [( β1 , β 2 ) = (–4.0, 6.5) or (0.5, 1.0)], both models are either
more equivalent or the GPR model is better than the NBR model. For large
µi ’s [( β1 , β 2 ) = (2.0, 1.0),], the better
model is the one from which the data is generated.
Table 1. Simulated data from GPR model
β1
–4.0
Parameters
β2
ϕ
6.5
0.2
0.3
0.5
1.0
0.2
0.3
2.0
1.0
0.2
0.3
Size
n
α = 0.10
α = 0.05
pf
pe
pg
pf
pe
pg
25
.0260
.9725
.0015
.0120
.9875
.0005
50
.0245
.9730
.0025
.0100
.9900
.0000
75
.0420
.9475
.0105
.0185
.9800
.0015
25
.0400
.9585
.0015
.0165
.9835
.0000
50
.0415
.9490
.0095
.0210
.9775
.0015
75
.0555
.9310
.0135
.0255
.9685
.0060
25
.0110
.9805
.0085
.0040
.9935
.0025
50
.0160
.9700
.0140
.0050
.9900
.0050
75
.0270
.9435
.0295
.0070
.9880
.0050
25
.0300
.9545
.0155
.0120
.9830
.0050
50
.0330
.9370
.0300
.0135
.9750
.0115
75
.0395
.9170
.0435
.0170
.9670
.0160
25
.0940
.8755
.0305
.0550
.9335
.0115
50
.1140
.8625
.0235
.0665
.9280
.0055
75
.1315
.8490
.0195
.0715
.9195
.0090
25
.1020
.8680
.0300
.0665
.9220
.0115
50
.1360
.8480
.0160
.0815
.9150
.0035
75
.1785
.8070
.0145
.1005
.8940
.0055
6
Table 2. Simulated data from NBR model
Parameters
β1
β2
τ
–4.0
6.5
0.2
0.3
0.5
1.0
0.2
0.3
2.0
1.0
0.2
0.3
Size
n
α = 0.10
α = 0.05
pf
pe
pg
pf
pe
pg
25
.0070
.9925
.0005
.0025
.9975
.0000
50
.0040
.9945
.0015
.0010
.9990
.0000
75
.0090
.9895
.0015
.0030
.9965
.0005
25
.0125
.9870
.0005
.0035
.9965
.0000
50
.0065
.9920
.0015
.0025
.9975
.0000
75
.0165
.9785
.0050
.0060
.9930
.0010
25
.0025
.9945
.0030
.0000
1.0000
.0000
50
.0015
.9975
.0010
.0005
.9995
.0000
75
.0025
.9935
.0040
.0000
1.0000
.0000
25
.0030
.9920
.0050
.0005
.9980
.0015
50
.0055
.9855
.0090
.0005
.9985
.0010
75
.0080
.9800
.0120
.0020
.9960
.0020
25
.0220
.9510
.0270
.0080
.9830
.0090
50
.0275
.9260
.0465
.0100
.9775
.0125
75
.0205
.8925
.0870
.0090
.9575
.0335
25
.0285
.9245
.0470
.0130
.9705
.0165
50
.0295
.8995
.0710
.0090
.9630
.0280
75
.0190
.8695
.1115
.0090
.9365
.0545
7
4. COMPARING GPR MODEL WITH POISSON REGRESSION MODEL
To assess the adequacy of the GPR model over the Poisson regression model, one may test the hypothesis
H 0 : ϕ = 0 against H a : ϕ ≠ 0 .
(4.1)
To carry out the test in (4.1), Famoye suggested a likelihood ratio test statistic
n
 f ( y ; µ , ϕ = 0) 
L1 = −2∑ log  * i i
,
i =1
 f ( yi ; µˆ i , ϕˆ ) 
(4.2)
f* ( yi ; µ i , ϕ = 0) is the estimated probability function for the Poisson regression model. This is a case of
where
nested models. The test statistic in (4.2) is approximately chi-square distributed with one degree of freedom. An
alternative test is to fit the GPR model and use the asymptotic Wald t-statistic to carry out the test in (4.1). Thus, one
computes the test statistic
W=
ϕˆ
,
se(ϕˆ )
where
ϕ̂
(4.3)
is the maximum likelihood estimate of ϕ and se( ϕ̂ ) is the corresponding standard error. The statistic in
(4.3) is compared with the t-distribution with n – k – 1 degrees of freedom.
The log-likelihood function for the GPR model in (1.1) is given by
=


n

µi
∑  y log  1 + ϕµ
i =1

i

i

µi (1 + ϕ yi ) 
.
 + ( yi − 1) log(1 + ϕ yi ) − log( yi !) −
1 + ϕµi 

(4.4)
On differentiating (4.4) with respect to the parameter ϕ, we get
 − µi yi
yi ( yi − 1) µi ( yi − µi ) 
−
.
1 + ϕ yi
(1 + ϕµi ) 2 
∂
=
∂ϕ
∑ 1 + ϕµ
Under
H 0 , the result in (4.5) becomes
n
i =1

+
i
(4.5)
n
n
∂
= ∑ {( yi − µi ) 2 − yi }= ∑ {( yi − µi ) 2 − µi − ( yi − µi )} .
∂ϕ i 1 =i 1
=
On evaluating (4.6) under the Poisson regression model maximum likelihood estimate for β, we get
8
(4.6)
∂
=
∂ϕ H 0
∑ {( y − µˆ )
n
i =1
i
i
2
− µˆ i } .
(4.7)
H 0 , the mean and variance of [( yi − µi ) 2 − µi ] are, respectively, 0 and 2 µi2 . Hence, the test statistic
Under
−1
n


S= ∑ {( yi − µˆ i ) − µˆ i }  2∑ µˆ i2  ,
 i1

i 1=


n
2
is asymptotically standard normal distributed under
(4.8)
H 0 . The result in (4.8) is the same test statistic obtained by Lee
(1986) for the NBR model. The statistic in (4.8) is a score statistic and the null hypothesis in (4.1) is rejected at
significance level α if
S ≥ zα / 2 . One advantage of the test statistic in (4.8) is that one does not need to fit the GPR
model. The only requirement is the fit from the Poisson regression model. For both the likelihood ratio statistic and
the asymptotic Wald t-statistic, one needs to fit the GPR model.
In order to compare the likelihood ratio statistic in (4.2), the asymptotic Wald t-statistic in (4.3), and the score
statistic in (4.8), we carry out a Monte Carlo simulation experiment based on 2000 repetitions. Samples are
generated from the GPR model with ϕ = –0.2, –0.1, 0.0, 0.1, 0.2, 0.3. The simulated experiment has been carried out
for n = 25, 50, and 75 with parameter combination
β1 = 0.5 and β 2
= 1.0. For ϕ = 0.0, we also included the results
for n = 100, 150, and 200. Each regression model is generated according to
µi
used in section 3. The covariate
xi 2
is fixed throughout the experiment. The three test statistics are computed. Each entry in the table represents the
proportion of 2000 Monte Carlo samples declared significant by each test using the significant levels α = 0.10 and α
= 0.05.
9
Table 3. Powers of score, likelihood ratio and Wald t-statistics
Size
n
25
α = 0.10
Parameter
ϕ
S
L1
α = 0.05
W
S
L1
W
–0.2
1.0000
1.0000
1.0000
.9995
1.0000
1.0000
–0.1
.6820
.8695
.9325
.4025
.7645
.8910
0.0
.0755
.1395
.1910
.0320
.0775
.1375
0.1
.4620
.3980
.1900
.37855
.2980
.0720
0.2
.7960
.7615
.5580
.7480
.6950
.3380
0.3
.9235
.9085
.7880
.8985
.8710
.6085
–0.2
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
–0.1
.9465
.9840
.9940
.8570
.9555
.9840
0.0
.0910
.1235
.1495
.0385
.0640
.1000
0.1
.6970
.6510
.5080
.6210
.5470
.3180
0.2
.9695
.9600
.9265
.9535
.9375
.8385
0.3
.9960
.9945
.9860
.9915
.9910
.9725
–0.2
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
–0.1
.9935
.9975
.9985
.9700
.9950
.9975
0.0
.1050
.1235
.1305
.0465
.0640
.0870
0.1
.8380
.8125
.7215
.7735
.7730
.5670
0.2
.9935
.9910
.9860
.9890
.9870
.9715
0.3
.9995
.9995
.9995
.9995
.9995
.9985
100
0.0
.1010
.1105
.1270
.0425
.0610
.0755
150
0.0
.1020
.1175
.1320
.0495
.0620
.0700
200
0.0
.1060
.1110
.1155
.0480
.0560
.0700
50
75
10
There is a good agreement between the actual and nominal significance levels for the score statistic in Table 3
when ϕ = 0.0 and n is large. Note that ϕ = 0.0 corresponds to the case of Poisson regression Model. When ϕ > 0, the
score statistic is the most powerful of the three test statistics considered for testing the null hypothesis
H 0 in (4.1).
The Wald t-statistic is the least powerful for ϕ > 0. When ϕ < 0, the score statistic appears to be the least powerful,
but still has a very good power. The Wald t-statistic appears to be the most powerful when ϕ < 0. For all values of ϕ,
a score statistic is recommended for testing the null hypothesis
H 0 in (4.1).
5. CONCLUSION
In general, the GPR model can be used in place of the NBR model as both models are equivalent with a high
percentage. The GPR model has an advantage over the NBR model when the data has a high proportion of zeros. In
addition to the fact that the GPR can model under-dispersion, it appears to be a model that one should always apply.
In terms of estimation, one model is not easier to estimate than the other. In fact, in our simulation study, there are a
few cases when the NBR model failed to converge in data generated from the GPR model. These cases were
excluded from the 2000 repetitions in the simulation experiment. We did not observe a similar result for the GPR
model when the data is generated from the NBR model.
ACKNOWLEDGEMENT
The support received from the Central Michigan University FRCE Committee under grant #48136 (Sabbatical
Leave Request) is gratefully acknowledged. This work was done while Felix Famoye, Central Michigan University,
was on his sabbatical leave at the Department of Biostatistics, University of North Texas Health Science Center at
Fort Worth.
11
REFERENCES
Cameron, A.C. and Trivedi, P.K. (1998). Regression analysis of count data. Cambridge University Press, New
York, New York.
Consul, P.C. (1989). Generalized Poisson distributions: Properties and Applications. Marcel Dekker Inc., New
York.
Consul, P.C. and Famoye, F. (1992). “Generalized Poisson regression model.” Communications in Statistics- Theory
and Methods 21(1), 89-109.
Famoye, F. (1993). “Restricted generalized Poisson regression model.” Communications in Statistics- Theory and
Methods 22(5), 1335-1354.
Frome, E.L., Kutner M.H., Beauchamp J.J. (1973). “Regression analysis of Poisson-distributed data.” Journal of the
American Statistical Association 68: 935-940.
Lambert, D. (1992). “Zero-inflated Poisson regression, with an application to defects in manufacturing.”
Technometrics 34, 1-14.
Lawless, J.F. (1987). “Negative binomial and mixed Poisson regression.” The Canadian Journal of Statistics 15(3),
209-225.
Lee, L. (1986). “Specification test for Poisson regression models.” International Economic Review 27(3), 689-706.
Vuong, Q.H. (1989). “Likelihood ratio tests for model selection and non-nested hypotheses.” Econometrica 57(2),
307-333.
12
View publication stats
Download