Uploaded by IAEME PUBLICATION

ADAPTIVE REGRESSION MODEL FOR HIGHLY SKEWED COUNT DATA

advertisement
International Journal of Mechanical Engineering and Technology (IJMET)
Volume 10, Issue 01, January 2019, pp. 1964-1972, Article ID: IJMET_10_01_192
Available online at http://www.iaeme.com/ijmet/issues.asp?JType=IJMET&VType=10&IType=01
ISSN Print: 0976-6340 and ISSN Online: 0976-6359
© IAEME Publication
Scopus Indexed
ADAPTIVE REGRESSION MODEL FOR HIGHLY
SKEWED COUNT DATA
Remi J. Dare
Department of Mathematical Sciences, Kings University, Ode-Omu, Osun State, Nigeria
Olumide S. Adesina
Department of Mathematical Sciences, Olabisi Onabanjo University, Ago-Iwoye, Ogun State,
Nigeria
Pelumi E. Oguntunde and Olasunmbo O. Agboola
Department of Mathematics, Covenant University, Ota, Ogun State, Nigeria
ABSTRACT
A big task often faced by practitioners is in deciding the appropriate model to adopt
in fitting count datasets. This paper is aimed at investigating a suitable model for fitting
highly skewed count datasets. Among other models, COM-Poisson regression model was
proposed in this paper for fitting count data due to its varying normalizing constant. Some
statistical models were investigated along with the proposed model; these include
Poisson, Negative Binomial, Zero-Inflated, Zero-inflated Poisson and Quasi- Poisson
models. A real life dataset relating to visits to Doctor within a given period was equally
used to test the behavior of the underlying models. From the findings, it is recommended
that COM-Poisson regression model should be adopted in fitting highly skewed count
datasets irrespective of the type of dispersion.
Key words: Count Data, dispersion, COM-Poisson, Zero-Inflated Models.
Cite this Article: Remi J. Dare, Olumide S. Adesina, Pelumi E. Oguntunde and Olasunmbo
O. Agboola, Adaptive Regression Model for Highly Skewed Count Data, International
Journal of Mechanical Engineering and Technology, 10(01), 2019, pp.1964–1972
http://www.iaeme.com/IJMET/issues.asp?JType=IJMET&VType=10&Type=01
1. INTRODUCTION
With the new trend in technological advancements regarding collection and storage of statistical
data, count datasets are now readily available across disciplines. Count data can be described as
a type of data that take observations from only non-negative integers where these integers are
obtained from counting rather than ranking. Adesina et al., (2017; 2018) modeled count datasets
using various estimation methods; some of which are Dirichlet Mixture Models, MCMCglmm,
Bayesian Discrete Weibull and a few frequentist techniques. The authors made a comprehensive
comparison between Bayesian and frequentist estimation techniques to establish the most
http://www.iaeme.com/IJMET/index.asp
1964
editor@iaeme.com
Remi J. Dare, Olumide S. Adesina, Pelumi E. Oguntunde and Olasunmbo O. Agboola
preferred in fitting count datasets irrespective of the form of dispersion. Famoye (1993) proposed
a restricted Generalized Poisson regression model for handling dispersed count datasets. The
model is considered to be an extension of the family of Generalized Poisson Distribution (GPD)
since the latter was found inadequate to fit count datasets effectively. Resmi et al., (2013) has
established cases of biased and misleading results associated with highly skewed distributions
with excess zeros. An instance of a clinical data involving children with electrophysiological
disorders in which many of them were treated without surgery was given.
A Poisson experiment has to do with number of occurrences of an event in a given time
interval or a specific location. Poisson distribution is one of the simplest and perhaps most
frequently used probability distributions in modeling the time instance at which event occurs.
Authors who applied Poisson regression to model count datasets include; Yip (1988),
Romundstad et al., (2001), Winkelmann (2004), Heller et al., (2007) and Gagnon et al., (2008),
to mention but a few. It is assumed that the mean of Poisson distribution is equal to the variance,
and this makes Poisson regression model inadequate to model datasets that exhibit other form of
dispersion aside equi-dispersion, also, Poisson regression is known to be restrictive intrinsically
heteroskedastic.
The Negative Binomial regression model shows superiority over Poisson regression model
because it has extra parameter to make it suitable for modeling over-dispersion. Derivation of
Negative Binomial regression follows Bayesian principle (Cameron & Trivedi, 2005). Readers
can refer to Hilbe (2007) or detailed discussion on negative binomial regression model.
COM-Poisson regression operates with link function for any dependent variable. Among
others who took advantage of COM-Poisson model are Ridout & Besbeas (2004) and Kalyanam
et al., (2007). Cameron et al., (1988), Pohlmeier & Ulrich (1995), Grootendorst (1995) and Geil
et al., (1996) considered Doctor visits, Special visits, Drug prescriptions, number of hospital stays
and Hospitalizations respectively, Sellers & Shmeul (2010) and Shmueli et al., (2005) gave
robust applications of COM-Poisson regression model.
Cameron & Trivedi (2005) pointed out that conventional parametric count distributions; the
poisson and negative binomial models, do not often satisfy description of empirical distributions
at some level of dispersion. A major challenge here is that they cannot model variance-to-mean
ratio below 1 which is typically common with most count datasets. Therefore, COM-Poisson
regression is identified suitable for count datasets irrespective of the form of dispersion the data
exhibits. In order to fit COM-Poisson regression to a dataset, it is required that the normalizing
constant be known, which follows its estimation. This study is aimed at discussing COM-Poisson
regression model extensively and to carry out model comparison among various models. In
section 2 of this paper, the materials and methods used are discussed while results are presented
in section 3.
2. MATERIALS AND METHOD
Some basic regression models for fitting count datasets are itemized and discussed as follows:
2.1. Poisson Regression
For a Poisson model, the mean µ i is expressed in terms of explanatory variables x using a suitable
link function. The model can be described as:
g ( µi ) = x′β
(1)
The link for g ( µi ) may be identity link µi = x′β or log link log( µi ) = x′β . Log link
µˆ i = exp( x′βˆ ) is expected to be positive, but not compulsory in the case of identity link. The
mean and variance of a Poisson regression are:
http://www.iaeme.com/IJMET/index.asp
1965
editor@iaeme.com
Adaptive Regression Model for Highly Skewed Count Data
E (Yi ) = exp( x′β )
(2)
And
V (Yi ) = exp( x′β )
(3)
Respectively, its likelihood function can be given by:
n
InL( β ) = ∑ yi xi′β − exp(xi′β ) − Inyi )
i =1
(4)
2.2. Negative Binomial Regression
For parameters µ and δ , Negative Binomial regression can be expressed as Bayesian procedure
as:
∞
e − µ v ( µ v) y vδ −1e− vδ δ δ
⋅
dv
y!
Γ (δ )
0
h( y | µ , δ ) = ∫
(5)
Further simplification gives:
∞
e − ( µ +δ ) v µ y v y +δ −1δ δ
⋅
dv
δ
y
!
Γ
(
)
0
h( y | µ , δ ) = ∫
(6)
The model’s mean and variance can be expressed as:
E (Yi ) = exp( x′β )
(7)
And
V (Yi ) = exp( x′β ) + δ exp( x′β ) 2
(8)
Respectively. The likelihood function of Negative Binomial model is given as:
m
yi
m
Γ ( yi + m)  m   µi 
( yi + m − 1)L (k )Γ (k )  m   µi 
li =

 
 =

 

Γ (m)Γ( yi )  m + µi   m + µi 
Γ (m)Γ( yi )
 m + µi   m + µi 
m
yi
( y + m − 1)L (m)  m   µi 
li = i

 
 =
( yi − 1)!
 m + µi   m + µ i 
e k*
( yi + e m* − 1)L e k *  e m*   µi 
 m*
  m*

( yi − 1)!
 e + µi   e + µi 
yi
yi
(9)
Where m is the extra parameter responsible for taking care of over-dispersion in count data?
The log-likelihood function represented by L is:
yi −1
L = ∑ ln(e m* + j ) − ln [ ( yi − 1)!] + e m* ln(e m* ) + yi ln( µi ) − (e m* + yi ) ln( µi + em* )
j =0
(10)
2.3. Zero Inflated Poisson (ZIP) Regression Models
The count variable of interest can be represented by ‘Y’ which is described by:
 ωij + (1 − ωij ) exp(−λij ),
P (Yij = yij ) = 
yij
(1 − ωij ) exp(−λij )λij / yij !,
yij = 0
yij > 0
(11)
The ZIP mean and variance are:
http://www.iaeme.com/IJMET/index.asp
1966
editor@iaeme.com
Remi J. Dare, Olumide S. Adesina, Pelumi E. Oguntunde and Olasunmbo O. Agboola
E (Yij ) = (1 − ωij )λij
(12)
And
var(Yij ) = (1 − ωij )λij (1 + ωij λij )
(13)
Respectively. Yau et al., (2003) made a submission that, in regression analysis, both the mean
λij and zero proportion ω ij parameters are linked to covariate vectors x ij and zij respectively.
The ZIP mixed regression model can be expressed as:
ηij = log(λij ) = xij′ β + ui
(14)
 ω
ξij = log  ij
1− ω
ij


 = zij′ γ + vi

Form Equation (15), we have:
ωij =
(15)
exp( zij′ γ + vi )
1 + exp( zij′ γ + vi )
(16)
Therefore, the likelihood function for ZIP model is:
n  exp( z ′ γ + v )


exp( zij′ γ + vi ) 
ij
i
+ 1 −
L( y, λ ) = ∏ 
exp(−λij ) 

 1 + exp( zij′ γ + vi ) 


i =1  1 + exp( zij′ γ + vi )



(17)
For y = 0 or;
n 
exp( zij′ γ + vi ) 
ij
L( y, λ ) = ∏ 1 −
 exp(−λij )λij

′
1
+
exp(
z
+
v
)
γ
i =1 
ij
i 
yij !
(18)
For y > 0
Following matrix notation for ZINB,
′
X = x11 ,K, x1n1 ,K, x m1 ,K, x mnm ,
[
]
] [
]
′
W = diag 1 n1 ,1 n2 K ,1 nm = w11 ,K, w1n1 ,K, w m1 ,K, w mnm ,
[
[
]
′
Z = z 11 ,K, z 1n1 ,K, z m1 ,K, z mnm ,
The mixed regression model can be expressed as:
 ω 
log 
 = ξ = Zγ + Wv
 1− ω 
(19)
log(λ ) = η = Xβ + Wu
2.4. COM-Poisson distribution
The work of Conway & Maxwell (1962) proposed the COM-Poisson distribution for the first
time as a way of handling queuing systems. The density function can be described as:
P(Y = y | λ ,ν ) =
λy
ν
( y !)
⋅
1
Z (λ ,ν )
;
http://www.iaeme.com/IJMET/index.asp
y = 0,1, 2.....
1967
(20)
editor@iaeme.com
Adaptive Regression Model for Highly Skewed Count Data
∞
Z (λ ,ν ) = ∑
j =0
λy
( y !)ν
For λ > 0 and ν ≥ 0 . Z (λ ,ν ) is the normalization constant, Z (λ ,ν ) is observed not to have
a closed analytical form. According to Shmueli (2005), the approximations for COM-Poisson
parameter, v is such that there is accuracy ν ≤ 1 or λ > 10ν . The moment generating function
(mgf) of COM-Poisson is given as:
M Y (t ) = E (eYt ) = Z (λ et ,ν ) Z (λ ,ν )
(21)
While the probability generating function (pgf) is expressed as:
E (t Y ) = Z (λ t ,ν ) Z (λ ,ν )
(22)
COM-Poisson distribution is a member of an exponential family in both parameters with
n
n
i =1
i =1
sufficient statistic S1 = ∑ Yi and S 2 = ∑ log(Yi !) , where Yi ,..........., Yn represents a random
sample of n COM-Poisson random variables.
Parameter Estimation of COM-Poisson distribution via maximum likelihood
Approach
This method of estimation takes on the maximum likelihood approach. This approach is made
possible due to its exponential family structure. The log-likelihood function can be written as:
n
log L( y1 ,.......... yn | λ ,ν ) =
∏λ
i =1
n
yi
⋅ Z (λ , υ ) − n
∏ ( y !)ν
i
i =1
(23)
Further workings showed that the left hand side equals log λ S1 −ν S 2 − n log Z (λ ,υ ) , where
n
n
i =1
i =1
S1 = ∑ Yi and S1 = ∑ log(Yi !) . Therefore,
log L( y1 ,.......... yn | λ ,ν ) = λ S1 ⋅ exp( −ν S 2 ) ⋅ Z (λ ,υ ) − n
(24)
COM-Poisson distribution can be expressed as:
k

L( y | θ ) = γ (θ ) ⋅ φ ( y ) ∑ π j (θ )t j ( y ) 
 j

(25)
Which qualifies the distribution to be listed among the exponential class of family.
In order to obtain the maximum likelihood, sets of normal equations would be solved
λy
converges for any λ > 0 , v > 0
( y !)v
λ
, the ratio of the two subsequent terms of the series
tends to 0 as j→∞ . To make any
yv
computation on COM-Poisson probabilities, the normalizing constant Z (λ , v) has to be derived.
The link function for COM-Poisson model is given as:
iteratively. Evaluating the normalizing constant, the series
log( µi ) = x′β
http://www.iaeme.com/IJMET/index.asp
(26)
1968
editor@iaeme.com
Remi J. Dare, Olumide S. Adesina, Pelumi E. Oguntunde and Olasunmbo O. Agboola
log( µi ) = − x′c
Where β and c are the regression coefficients. The mean and variance are as follows:
E (Yi )  exp( x′β )
(27)
And
V (Yi )  exp( xi′β + xi′c)
(28)
Respectively.
3. APPLICATION
To identify the strength of the outlined models, dataset “mdvis” was obtained from “package
COUNT” in R. The data is made up of German Socio-Economic Panel data with two thousand
two hundred and twenty seven (2,227) observations. The response variable examined was the
number of visits (numvisit); that is, number of patients’ visits to a doctor for a three month period.
Predictor variables include; Reform (interview year post-reform: 1998=1; pre-reform:1996=0),
badh (not bad health=0, bad health=1), indicating the state of health, Agegrp (20-39=1; 40-49=2;
50-60=3) and educ (education level): (1=7-10; 2=10.5-12; 3=HSgrad+). The mean =2.589 and
variance=16.129 indicating that the data is over-dispersed. The result of the model selection is
presented in Table 1.
Table1: Model selection for over-dispersed count data
Poisson
NB
ZIP
ZINB
QuasiPois
CMP
Reform
-.13789
-0.1354
0.16595
0.9037
-0.1378
-0.1378
Badh
1.13059
1.13117
-0.9994
-2.8589
1.13059
1.13059
educ
-0.0289
-0.0088
-0.2302
-4.5675
-0.0289
-0.0289
agegrp
0.07453
0.09016
0.03926
-0.1442
0.07453
0.07453
AIC
11913.00
9141.70
10816.30
9143.40
74371.80
9101.30*
BIC
11941l.87
9175.98
18813.22
NA
9941.87
Deviance
7437.8
0.0901**
-
The plot showing the number of visit to the doctor is presented in Figure 1 while the Normal
Q-Q plot is presented in Figure 2.
Figure 1: Histogram showing frequency of visits
http://www.iaeme.com/IJMET/index.asp
1969
editor@iaeme.com
Adaptive Regression Model for Highly Skewed Count Data
Figure 2: Q-Q Plots for number of visits to doctor
The model coefficients/parameters and their corresponding confidence intervals are presented
in Table 2.
Table 2: Coefficient and Confidence Interval from the model
CO
count_(Intercept)
count_reform
count_badh
count_educ
count_agegrp
2.5 pct
2.0958295
0.8711983
3.0974853
0.9715042
1.0773779
97.5 pct
1.8928566
0.8269712
2.9208505
0.9387839
1.0413875
0.9177257
3.2834258
0.9177257
1.0053716
1.1144003
4. DISCUSSIONS AND CONCLUSION
From Table 1, after fitting the selected models to the over-dispersed count data and comparing
them, the result shows that COM-Poisson and Zero-inflated models (in that order) are superior to
the other models based on Akaike and Bayesian Information Criteria (AIC and BIC). These
results suggest that, fitting with Poisson, Negative binomial and Quasi-Poisson will likely lead to
misleading inferences about the parameters.
Having established that significant relationship exists between each predictor and the
response variable, an exponentially generated coefficient was carried out as shown in Table 2,
the table gives the extent to which each predictor impart on the response variable. Coefficient for
“reform” is less than 1; therefore, for every increase in the number of patients in post-reform
period, there is a decrease in the number of visits to the doctor by a factor of 0.8711. Also, as the
number of people with bad health status increases, the number of visits to doctor increases by a
factor of 3.0974; bad health status must have informed the need to see their doctors.
Education has a coefficient less than 1; therefore, increase in education level leads to decrease
in the number of visits to doctor by a factor of 0.9715. This might be due to the fact that education
brings about exposure and a need to be conscious of one’s health; this can however bring about
reduction in visits to doctor. Lastly, as age increases, the number of visits to doctor increases by
a factor of 1.0773; this relate to the fact that as an individual increases in age, there is possibility
of having more health issues. This is in line with the study of Christensen et al., (2009) who
identified health related issues among older people than younger individuals.
http://www.iaeme.com/IJMET/index.asp
1970
editor@iaeme.com
Remi J. Dare, Olumide S. Adesina, Pelumi E. Oguntunde and Olasunmbo O. Agboola
In this study, the suitability of COM-Poisson model has been established over some other
existing parametric models in analyzing count datasets with high dispersion. A real-life data set
was applied to check the performances of the outlined models. COM-Poisson model has proven
to be suitable to model highly skewed over-dispersed count data, and therefore recommended.
Future studies can consider fitting the models to under-dispersed datasets using both frequentist
and Bayesian technique. Estimation of parameters can also be done following the work of Rastogi
and Oguntunde (2018).
ACKNOWLEDGEMENT
The authors would like to thank Covenant University for her supports.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
Adesina O. S., Olatayo T. O., Agboola O. O., Oguntunde P. E. (2018). Bayesian Dirichet
Process Mixture Prior for Count Data, International Journal of Mechanical Engineering and
Technology, 9(12), 630-646
Adesina O. S, Agunbiade D. A., Osundina S. A. (2017). Bayesian Regression Model for
Counts in Scholarship, Journal of Mathematical Theory and Modelling. 7(9), 46-57
Cameron A. C., Trivedi P. K. (2005). Micro econometrics Methods and Application;
Cambridge University Press
Cameron A. C., Trivedi P. K., Milne F., Piggott J. (1988). A Micro econometric Model of the
Demand for Health Care and Health Insurance in Australia, Review of Economic Studies, 55,
85–106
Christensen K., Doblhammer G., Rau R., Vaupel J. W. (2009). Ageing populations: The
challenges ahead, Lancet, 374(9696), 1196-1208
Famoye F. (1993). Restricted generalized Poisson regression model, Communications in
Statistics-Theory and Methods, 22(5), 1335–1354
Gagnon D. R., Doron-LaMarca S., Bell M., O'Farrell T. J., Taft C.T. (2008). Poisson
regression for modeling count and frequency outcomes in trauma research, Journal of
Traumatic Stress, 21(5), 448-454
Grootendorst P. (1995). Effects of Drug Plan Eligibility on Prescription Drug Utilization, Ph.
D. Dissertation, McMaster University.
Heller G. Z., Mikis S. D., Rigby R. A., De Jong P. (2007). Mean and dispersion modelling
for policy claims costs, Scandinavian Actuarial Journal, 4, 281-292
Hilbe J. M. (2007). Negative Binomial Regression. Cambridge: Cambridge University Press.
Hilbe J. M. (2016). COUNT: Functions, Data and Code for Count Data. R package
version 1.3.4. https://CRAN.R-project.org/package=COUNT
Kalyanam K., Borle S., Boatwright P. (2007). Deconstructing each items category
contribution, Marketing Science, 26(3), 327-341. doi: 10.1287/mksc.1070.0270. URL
http://pubsonline.informs.org/ doi/abs/10.1287/mksc.1070.0270.
Nelder J. A., Wedderburn R. W. M. (1972). Generalized Linear Models, Journal of the Royal
Statistical Society. Series A (General), 135(3), 370–384
Minka T. P., Shmueli G., Kadane J. B., Borle S., Boatwright P. (2003). Computing with the
COM-Poisson distribution. Technical report, CMU Statistics Department.
Gupta R., Marino B. S., Cnota J. F., Ittenbach R. F. (2013) Finding the right distribution for
highly skewed zero-inflated clinical data, Epidemiology Biostatistics and Public Health,
10(1). https://doi.org/10.2427/8732
R Core Team (2018). R: A language and environment for statistical computing. R Foundation
for Statistical Computing, Vienna, Austria. URL https://www.Rproject.org/
http://www.iaeme.com/IJMET/index.asp
1971
editor@iaeme.com
Adaptive Regression Model for Highly Skewed Count Data
[17]
[18]
[19]
Rastogi M. K., Oguntunde P. E. (2018). Classical and Bays estimation of reliability
characteristics of the Kumaraswamy-Inverse Exponential distribution, International Journal
of
System
Assurance
Engineering
and
Management
(Online
First),
https://doi.org/10.1007/s13198-018-0744-7
Ridout M. S., Besbeas P. (2004). An empirical model for underdispersed count data,
Statistical Modelling, 4(1), 77-89, ISSN 1471-082X
Romundstad P., Andersen A., Haldorsen T. (2001). Cancer incidence among workers in the
Norwegian silicon carbide industry. American Journal of Epidemiology, 153(10), 978-986
[20]
Sellers K. F., Shmueli G. (2010). A Flexible Regression Model for Count Data, The
Annal of Applied Statistics, 4(2), 943-961
[21]
Shmueli G., Minka T. P., Kadane J. B., Borle S., Boatwright P. (2005). A useful distribution
for fitting discrete data: revival of the Conway-Maxwell-Poisson distribution, Journal of the
Royal Statistical Society: Series C, 54(1), 127-142
Winkelmann R. (2004). Health care reform and the number of doctor visits: An econometric
analysis, Journal of Applied Econometrics, 19(4), 455-472
Yau K. K. W., Wang K., Lee A. H. (2003). Zero-inflated negative binomial mixed regression
modeling of over-dispersed count data with extra zeros, Biometrical Journal, 45(4), 437-452
Yip P. (1988). Inference about the mean of a Poisson distribution in the presence of a nuisance
parameter, Australian Journal of Statistics, 30, 299-306
[22]
[23]
[24]
[25]
http://www.iaeme.com/IJMET/index.asp
1972
editor@iaeme.com
Download