International Journal of Mechanical Engineering and Technology (IJMET) Volume 10, Issue 01, January 2019, pp. 1964-1972, Article ID: IJMET_10_01_192 Available online at http://www.iaeme.com/ijmet/issues.asp?JType=IJMET&VType=10&IType=01 ISSN Print: 0976-6340 and ISSN Online: 0976-6359 © IAEME Publication Scopus Indexed ADAPTIVE REGRESSION MODEL FOR HIGHLY SKEWED COUNT DATA Remi J. Dare Department of Mathematical Sciences, Kings University, Ode-Omu, Osun State, Nigeria Olumide S. Adesina Department of Mathematical Sciences, Olabisi Onabanjo University, Ago-Iwoye, Ogun State, Nigeria Pelumi E. Oguntunde and Olasunmbo O. Agboola Department of Mathematics, Covenant University, Ota, Ogun State, Nigeria ABSTRACT A big task often faced by practitioners is in deciding the appropriate model to adopt in fitting count datasets. This paper is aimed at investigating a suitable model for fitting highly skewed count datasets. Among other models, COM-Poisson regression model was proposed in this paper for fitting count data due to its varying normalizing constant. Some statistical models were investigated along with the proposed model; these include Poisson, Negative Binomial, Zero-Inflated, Zero-inflated Poisson and Quasi- Poisson models. A real life dataset relating to visits to Doctor within a given period was equally used to test the behavior of the underlying models. From the findings, it is recommended that COM-Poisson regression model should be adopted in fitting highly skewed count datasets irrespective of the type of dispersion. Key words: Count Data, dispersion, COM-Poisson, Zero-Inflated Models. Cite this Article: Remi J. Dare, Olumide S. Adesina, Pelumi E. Oguntunde and Olasunmbo O. Agboola, Adaptive Regression Model for Highly Skewed Count Data, International Journal of Mechanical Engineering and Technology, 10(01), 2019, pp.1964–1972 http://www.iaeme.com/IJMET/issues.asp?JType=IJMET&VType=10&Type=01 1. INTRODUCTION With the new trend in technological advancements regarding collection and storage of statistical data, count datasets are now readily available across disciplines. Count data can be described as a type of data that take observations from only non-negative integers where these integers are obtained from counting rather than ranking. Adesina et al., (2017; 2018) modeled count datasets using various estimation methods; some of which are Dirichlet Mixture Models, MCMCglmm, Bayesian Discrete Weibull and a few frequentist techniques. The authors made a comprehensive comparison between Bayesian and frequentist estimation techniques to establish the most http://www.iaeme.com/IJMET/index.asp 1964 editor@iaeme.com Remi J. Dare, Olumide S. Adesina, Pelumi E. Oguntunde and Olasunmbo O. Agboola preferred in fitting count datasets irrespective of the form of dispersion. Famoye (1993) proposed a restricted Generalized Poisson regression model for handling dispersed count datasets. The model is considered to be an extension of the family of Generalized Poisson Distribution (GPD) since the latter was found inadequate to fit count datasets effectively. Resmi et al., (2013) has established cases of biased and misleading results associated with highly skewed distributions with excess zeros. An instance of a clinical data involving children with electrophysiological disorders in which many of them were treated without surgery was given. A Poisson experiment has to do with number of occurrences of an event in a given time interval or a specific location. Poisson distribution is one of the simplest and perhaps most frequently used probability distributions in modeling the time instance at which event occurs. Authors who applied Poisson regression to model count datasets include; Yip (1988), Romundstad et al., (2001), Winkelmann (2004), Heller et al., (2007) and Gagnon et al., (2008), to mention but a few. It is assumed that the mean of Poisson distribution is equal to the variance, and this makes Poisson regression model inadequate to model datasets that exhibit other form of dispersion aside equi-dispersion, also, Poisson regression is known to be restrictive intrinsically heteroskedastic. The Negative Binomial regression model shows superiority over Poisson regression model because it has extra parameter to make it suitable for modeling over-dispersion. Derivation of Negative Binomial regression follows Bayesian principle (Cameron & Trivedi, 2005). Readers can refer to Hilbe (2007) or detailed discussion on negative binomial regression model. COM-Poisson regression operates with link function for any dependent variable. Among others who took advantage of COM-Poisson model are Ridout & Besbeas (2004) and Kalyanam et al., (2007). Cameron et al., (1988), Pohlmeier & Ulrich (1995), Grootendorst (1995) and Geil et al., (1996) considered Doctor visits, Special visits, Drug prescriptions, number of hospital stays and Hospitalizations respectively, Sellers & Shmeul (2010) and Shmueli et al., (2005) gave robust applications of COM-Poisson regression model. Cameron & Trivedi (2005) pointed out that conventional parametric count distributions; the poisson and negative binomial models, do not often satisfy description of empirical distributions at some level of dispersion. A major challenge here is that they cannot model variance-to-mean ratio below 1 which is typically common with most count datasets. Therefore, COM-Poisson regression is identified suitable for count datasets irrespective of the form of dispersion the data exhibits. In order to fit COM-Poisson regression to a dataset, it is required that the normalizing constant be known, which follows its estimation. This study is aimed at discussing COM-Poisson regression model extensively and to carry out model comparison among various models. In section 2 of this paper, the materials and methods used are discussed while results are presented in section 3. 2. MATERIALS AND METHOD Some basic regression models for fitting count datasets are itemized and discussed as follows: 2.1. Poisson Regression For a Poisson model, the mean µ i is expressed in terms of explanatory variables x using a suitable link function. The model can be described as: g ( µi ) = x′β (1) The link for g ( µi ) may be identity link µi = x′β or log link log( µi ) = x′β . Log link µˆ i = exp( x′βˆ ) is expected to be positive, but not compulsory in the case of identity link. The mean and variance of a Poisson regression are: http://www.iaeme.com/IJMET/index.asp 1965 editor@iaeme.com Adaptive Regression Model for Highly Skewed Count Data E (Yi ) = exp( x′β ) (2) And V (Yi ) = exp( x′β ) (3) Respectively, its likelihood function can be given by: n InL( β ) = ∑ yi xi′β − exp(xi′β ) − Inyi ) i =1 (4) 2.2. Negative Binomial Regression For parameters µ and δ , Negative Binomial regression can be expressed as Bayesian procedure as: ∞ e − µ v ( µ v) y vδ −1e− vδ δ δ ⋅ dv y! Γ (δ ) 0 h( y | µ , δ ) = ∫ (5) Further simplification gives: ∞ e − ( µ +δ ) v µ y v y +δ −1δ δ ⋅ dv δ y ! Γ ( ) 0 h( y | µ , δ ) = ∫ (6) The model’s mean and variance can be expressed as: E (Yi ) = exp( x′β ) (7) And V (Yi ) = exp( x′β ) + δ exp( x′β ) 2 (8) Respectively. The likelihood function of Negative Binomial model is given as: m yi m Γ ( yi + m) m µi ( yi + m − 1)L (k )Γ (k ) m µi li = = Γ (m)Γ( yi ) m + µi m + µi Γ (m)Γ( yi ) m + µi m + µi m yi ( y + m − 1)L (m) m µi li = i = ( yi − 1)! m + µi m + µ i e k* ( yi + e m* − 1)L e k * e m* µi m* m* ( yi − 1)! e + µi e + µi yi yi (9) Where m is the extra parameter responsible for taking care of over-dispersion in count data? The log-likelihood function represented by L is: yi −1 L = ∑ ln(e m* + j ) − ln [ ( yi − 1)!] + e m* ln(e m* ) + yi ln( µi ) − (e m* + yi ) ln( µi + em* ) j =0 (10) 2.3. Zero Inflated Poisson (ZIP) Regression Models The count variable of interest can be represented by ‘Y’ which is described by: ωij + (1 − ωij ) exp(−λij ), P (Yij = yij ) = yij (1 − ωij ) exp(−λij )λij / yij !, yij = 0 yij > 0 (11) The ZIP mean and variance are: http://www.iaeme.com/IJMET/index.asp 1966 editor@iaeme.com Remi J. Dare, Olumide S. Adesina, Pelumi E. Oguntunde and Olasunmbo O. Agboola E (Yij ) = (1 − ωij )λij (12) And var(Yij ) = (1 − ωij )λij (1 + ωij λij ) (13) Respectively. Yau et al., (2003) made a submission that, in regression analysis, both the mean λij and zero proportion ω ij parameters are linked to covariate vectors x ij and zij respectively. The ZIP mixed regression model can be expressed as: ηij = log(λij ) = xij′ β + ui (14) ω ξij = log ij 1− ω ij = zij′ γ + vi Form Equation (15), we have: ωij = (15) exp( zij′ γ + vi ) 1 + exp( zij′ γ + vi ) (16) Therefore, the likelihood function for ZIP model is: n exp( z ′ γ + v ) exp( zij′ γ + vi ) ij i + 1 − L( y, λ ) = ∏ exp(−λij ) 1 + exp( zij′ γ + vi ) i =1 1 + exp( zij′ γ + vi ) (17) For y = 0 or; n exp( zij′ γ + vi ) ij L( y, λ ) = ∏ 1 − exp(−λij )λij ′ 1 + exp( z + v ) γ i =1 ij i yij ! (18) For y > 0 Following matrix notation for ZINB, ′ X = x11 ,K, x1n1 ,K, x m1 ,K, x mnm , [ ] ] [ ] ′ W = diag 1 n1 ,1 n2 K ,1 nm = w11 ,K, w1n1 ,K, w m1 ,K, w mnm , [ [ ] ′ Z = z 11 ,K, z 1n1 ,K, z m1 ,K, z mnm , The mixed regression model can be expressed as: ω log = ξ = Zγ + Wv 1− ω (19) log(λ ) = η = Xβ + Wu 2.4. COM-Poisson distribution The work of Conway & Maxwell (1962) proposed the COM-Poisson distribution for the first time as a way of handling queuing systems. The density function can be described as: P(Y = y | λ ,ν ) = λy ν ( y !) ⋅ 1 Z (λ ,ν ) ; http://www.iaeme.com/IJMET/index.asp y = 0,1, 2..... 1967 (20) editor@iaeme.com Adaptive Regression Model for Highly Skewed Count Data ∞ Z (λ ,ν ) = ∑ j =0 λy ( y !)ν For λ > 0 and ν ≥ 0 . Z (λ ,ν ) is the normalization constant, Z (λ ,ν ) is observed not to have a closed analytical form. According to Shmueli (2005), the approximations for COM-Poisson parameter, v is such that there is accuracy ν ≤ 1 or λ > 10ν . The moment generating function (mgf) of COM-Poisson is given as: M Y (t ) = E (eYt ) = Z (λ et ,ν ) Z (λ ,ν ) (21) While the probability generating function (pgf) is expressed as: E (t Y ) = Z (λ t ,ν ) Z (λ ,ν ) (22) COM-Poisson distribution is a member of an exponential family in both parameters with n n i =1 i =1 sufficient statistic S1 = ∑ Yi and S 2 = ∑ log(Yi !) , where Yi ,..........., Yn represents a random sample of n COM-Poisson random variables. Parameter Estimation of COM-Poisson distribution via maximum likelihood Approach This method of estimation takes on the maximum likelihood approach. This approach is made possible due to its exponential family structure. The log-likelihood function can be written as: n log L( y1 ,.......... yn | λ ,ν ) = ∏λ i =1 n yi ⋅ Z (λ , υ ) − n ∏ ( y !)ν i i =1 (23) Further workings showed that the left hand side equals log λ S1 −ν S 2 − n log Z (λ ,υ ) , where n n i =1 i =1 S1 = ∑ Yi and S1 = ∑ log(Yi !) . Therefore, log L( y1 ,.......... yn | λ ,ν ) = λ S1 ⋅ exp( −ν S 2 ) ⋅ Z (λ ,υ ) − n (24) COM-Poisson distribution can be expressed as: k L( y | θ ) = γ (θ ) ⋅ φ ( y ) ∑ π j (θ )t j ( y ) j (25) Which qualifies the distribution to be listed among the exponential class of family. In order to obtain the maximum likelihood, sets of normal equations would be solved λy converges for any λ > 0 , v > 0 ( y !)v λ , the ratio of the two subsequent terms of the series tends to 0 as j→∞ . To make any yv computation on COM-Poisson probabilities, the normalizing constant Z (λ , v) has to be derived. The link function for COM-Poisson model is given as: iteratively. Evaluating the normalizing constant, the series log( µi ) = x′β http://www.iaeme.com/IJMET/index.asp (26) 1968 editor@iaeme.com Remi J. Dare, Olumide S. Adesina, Pelumi E. Oguntunde and Olasunmbo O. Agboola log( µi ) = − x′c Where β and c are the regression coefficients. The mean and variance are as follows: E (Yi ) exp( x′β ) (27) And V (Yi ) exp( xi′β + xi′c) (28) Respectively. 3. APPLICATION To identify the strength of the outlined models, dataset “mdvis” was obtained from “package COUNT” in R. The data is made up of German Socio-Economic Panel data with two thousand two hundred and twenty seven (2,227) observations. The response variable examined was the number of visits (numvisit); that is, number of patients’ visits to a doctor for a three month period. Predictor variables include; Reform (interview year post-reform: 1998=1; pre-reform:1996=0), badh (not bad health=0, bad health=1), indicating the state of health, Agegrp (20-39=1; 40-49=2; 50-60=3) and educ (education level): (1=7-10; 2=10.5-12; 3=HSgrad+). The mean =2.589 and variance=16.129 indicating that the data is over-dispersed. The result of the model selection is presented in Table 1. Table1: Model selection for over-dispersed count data Poisson NB ZIP ZINB QuasiPois CMP Reform -.13789 -0.1354 0.16595 0.9037 -0.1378 -0.1378 Badh 1.13059 1.13117 -0.9994 -2.8589 1.13059 1.13059 educ -0.0289 -0.0088 -0.2302 -4.5675 -0.0289 -0.0289 agegrp 0.07453 0.09016 0.03926 -0.1442 0.07453 0.07453 AIC 11913.00 9141.70 10816.30 9143.40 74371.80 9101.30* BIC 11941l.87 9175.98 18813.22 NA 9941.87 Deviance 7437.8 0.0901** - The plot showing the number of visit to the doctor is presented in Figure 1 while the Normal Q-Q plot is presented in Figure 2. Figure 1: Histogram showing frequency of visits http://www.iaeme.com/IJMET/index.asp 1969 editor@iaeme.com Adaptive Regression Model for Highly Skewed Count Data Figure 2: Q-Q Plots for number of visits to doctor The model coefficients/parameters and their corresponding confidence intervals are presented in Table 2. Table 2: Coefficient and Confidence Interval from the model CO count_(Intercept) count_reform count_badh count_educ count_agegrp 2.5 pct 2.0958295 0.8711983 3.0974853 0.9715042 1.0773779 97.5 pct 1.8928566 0.8269712 2.9208505 0.9387839 1.0413875 0.9177257 3.2834258 0.9177257 1.0053716 1.1144003 4. DISCUSSIONS AND CONCLUSION From Table 1, after fitting the selected models to the over-dispersed count data and comparing them, the result shows that COM-Poisson and Zero-inflated models (in that order) are superior to the other models based on Akaike and Bayesian Information Criteria (AIC and BIC). These results suggest that, fitting with Poisson, Negative binomial and Quasi-Poisson will likely lead to misleading inferences about the parameters. Having established that significant relationship exists between each predictor and the response variable, an exponentially generated coefficient was carried out as shown in Table 2, the table gives the extent to which each predictor impart on the response variable. Coefficient for “reform” is less than 1; therefore, for every increase in the number of patients in post-reform period, there is a decrease in the number of visits to the doctor by a factor of 0.8711. Also, as the number of people with bad health status increases, the number of visits to doctor increases by a factor of 3.0974; bad health status must have informed the need to see their doctors. Education has a coefficient less than 1; therefore, increase in education level leads to decrease in the number of visits to doctor by a factor of 0.9715. This might be due to the fact that education brings about exposure and a need to be conscious of one’s health; this can however bring about reduction in visits to doctor. Lastly, as age increases, the number of visits to doctor increases by a factor of 1.0773; this relate to the fact that as an individual increases in age, there is possibility of having more health issues. This is in line with the study of Christensen et al., (2009) who identified health related issues among older people than younger individuals. http://www.iaeme.com/IJMET/index.asp 1970 editor@iaeme.com Remi J. Dare, Olumide S. Adesina, Pelumi E. Oguntunde and Olasunmbo O. Agboola In this study, the suitability of COM-Poisson model has been established over some other existing parametric models in analyzing count datasets with high dispersion. A real-life data set was applied to check the performances of the outlined models. COM-Poisson model has proven to be suitable to model highly skewed over-dispersed count data, and therefore recommended. Future studies can consider fitting the models to under-dispersed datasets using both frequentist and Bayesian technique. Estimation of parameters can also be done following the work of Rastogi and Oguntunde (2018). ACKNOWLEDGEMENT The authors would like to thank Covenant University for her supports. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] Adesina O. S., Olatayo T. O., Agboola O. O., Oguntunde P. E. (2018). Bayesian Dirichet Process Mixture Prior for Count Data, International Journal of Mechanical Engineering and Technology, 9(12), 630-646 Adesina O. S, Agunbiade D. A., Osundina S. A. (2017). Bayesian Regression Model for Counts in Scholarship, Journal of Mathematical Theory and Modelling. 7(9), 46-57 Cameron A. C., Trivedi P. K. (2005). Micro econometrics Methods and Application; Cambridge University Press Cameron A. C., Trivedi P. K., Milne F., Piggott J. (1988). A Micro econometric Model of the Demand for Health Care and Health Insurance in Australia, Review of Economic Studies, 55, 85–106 Christensen K., Doblhammer G., Rau R., Vaupel J. W. (2009). Ageing populations: The challenges ahead, Lancet, 374(9696), 1196-1208 Famoye F. (1993). Restricted generalized Poisson regression model, Communications in Statistics-Theory and Methods, 22(5), 1335–1354 Gagnon D. R., Doron-LaMarca S., Bell M., O'Farrell T. J., Taft C.T. (2008). Poisson regression for modeling count and frequency outcomes in trauma research, Journal of Traumatic Stress, 21(5), 448-454 Grootendorst P. (1995). Effects of Drug Plan Eligibility on Prescription Drug Utilization, Ph. D. Dissertation, McMaster University. Heller G. Z., Mikis S. D., Rigby R. A., De Jong P. (2007). Mean and dispersion modelling for policy claims costs, Scandinavian Actuarial Journal, 4, 281-292 Hilbe J. M. (2007). Negative Binomial Regression. Cambridge: Cambridge University Press. Hilbe J. M. (2016). COUNT: Functions, Data and Code for Count Data. R package version 1.3.4. https://CRAN.R-project.org/package=COUNT Kalyanam K., Borle S., Boatwright P. (2007). Deconstructing each items category contribution, Marketing Science, 26(3), 327-341. doi: 10.1287/mksc.1070.0270. URL http://pubsonline.informs.org/ doi/abs/10.1287/mksc.1070.0270. Nelder J. A., Wedderburn R. W. M. (1972). Generalized Linear Models, Journal of the Royal Statistical Society. Series A (General), 135(3), 370–384 Minka T. P., Shmueli G., Kadane J. B., Borle S., Boatwright P. (2003). Computing with the COM-Poisson distribution. Technical report, CMU Statistics Department. Gupta R., Marino B. S., Cnota J. F., Ittenbach R. F. (2013) Finding the right distribution for highly skewed zero-inflated clinical data, Epidemiology Biostatistics and Public Health, 10(1). https://doi.org/10.2427/8732 R Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.Rproject.org/ http://www.iaeme.com/IJMET/index.asp 1971 editor@iaeme.com Adaptive Regression Model for Highly Skewed Count Data [17] [18] [19] Rastogi M. K., Oguntunde P. E. (2018). Classical and Bays estimation of reliability characteristics of the Kumaraswamy-Inverse Exponential distribution, International Journal of System Assurance Engineering and Management (Online First), https://doi.org/10.1007/s13198-018-0744-7 Ridout M. S., Besbeas P. (2004). An empirical model for underdispersed count data, Statistical Modelling, 4(1), 77-89, ISSN 1471-082X Romundstad P., Andersen A., Haldorsen T. (2001). Cancer incidence among workers in the Norwegian silicon carbide industry. American Journal of Epidemiology, 153(10), 978-986 [20] Sellers K. F., Shmueli G. (2010). A Flexible Regression Model for Count Data, The Annal of Applied Statistics, 4(2), 943-961 [21] Shmueli G., Minka T. P., Kadane J. B., Borle S., Boatwright P. (2005). A useful distribution for fitting discrete data: revival of the Conway-Maxwell-Poisson distribution, Journal of the Royal Statistical Society: Series C, 54(1), 127-142 Winkelmann R. (2004). Health care reform and the number of doctor visits: An econometric analysis, Journal of Applied Econometrics, 19(4), 455-472 Yau K. K. W., Wang K., Lee A. H. (2003). Zero-inflated negative binomial mixed regression modeling of over-dispersed count data with extra zeros, Biometrical Journal, 45(4), 437-452 Yip P. (1988). Inference about the mean of a Poisson distribution in the presence of a nuisance parameter, Australian Journal of Statistics, 30, 299-306 [22] [23] [24] [25] http://www.iaeme.com/IJMET/index.asp 1972 editor@iaeme.com