Modelling multivariate count data Aristidis K. Nikoloulopoulos and Dimitris Karlis Department of Statistics, Athens University of Economics, 76 Patission Str., 10434, Athens, GREECE{akn,karlis}@aueb.gr Summary. Multivariate count data occur in several different disciplines. Models for such data are few mainly because of the computational complexity for their application. We propose models based on the currently fashionable idea of copulas. We propose copulas appropriate for this case taking into account computational limitations. A real data application is provided. 1 Introduction Multivariate count data occur in several disciplines, like epidemiology, marketing, criminology, sports statistics, industrial statistics, among others. However, flexible models for such data are not widely available and usually are hard to fit in real data. Copulas are currently fashionable models for modelling dependent data as they can separate the estimation of the marginal properties and the dependence structure. While there are plenty of publications that treat continuous data there are only few treating count data. The purpose of the present paper is to propose a model based on copulas for multivariate count data. Definition 1. ( [Nel99]) A multivariate copula is a function C from In to I with the following properties: (1) For every u in In C(u) = 0 if at least one coordinate of u is 0 and if all coordinates of u are 1 except uk , then C(u) = uk (2) For every a and b in In such that a ≤ b, VC ([a, b]) ≤ 0. There are few paper for the use of copulas with discrete data. [MM94, TDB99, Lee94,CLTZ04] exploit the use of Frank copula, see [Fra79], to model discrete bivariate data (i.e., count and binary data). The multivariate extension of Frank copula has the disadvantages, that restrict to positive dependence and form one dependence parameter for all the bivariate margins. The multivariate normal copula overcomes this drawback. [Lee01, vO99] exploit the use of bivariate normal copula to model count data while [LV02, Son00] used the latter to model multivariate non-normal longitudinal continuous data. For discrete data, generalization to the multivariate case is not easy since the joint probability function involves computation of the copula in several different points and hence multivariate numerical integration is needed leading to tremendous computational problems. 600 Aristidis K. Nikoloulopoulos and Dimitris Karlis We propose the use of mixtures of max-infinitely divisible bivariate copulas, see [Joe96] to derive flexible positive dependence between the random variables. We also fit a multivariate normal copula and discuss the computational problems occurred. 2 Multivariate parametric families of copulas 2.1 Copulas via mixtures of max-infinitely divisible bivariate copulas Let Λ be a univariate cumulative distribution function (cdf) of a positive random (Λ(0) = 0), and let φ be the Laplace transform (LT) of Λ, φ(t) = R ∞ variable −ts exp dΛ(s), t ≥ 0,. 0 Mixtures of max-infinitely divisible copulas (maxid) have the form C(u) = φ − X ′ log Cij (e−pi φ −1 (ui ) , e−pj φ −1 (uj ) )+ m X i<j ! νi pi φ−1 (ui ) . (1) i=1 where φ is a LT, that introduce the smallest dependence between random variables ′ and Cij are maxid copulas that add some pairwise dependence. The parameters νi are included in order that the parametric family of multivariate copulas (1) is closed under margins. Some members of maxid bivariate copulas and Laplace transforms are presented on Table 1. Table 1. Max-infinitely divisible distributions Family C ′ (u, v; θ) [Gum60] (LTA) exp − (− log(u))θ + (− log(v))θ 1/θ [Cla78] (LTB) (u−θ + v −θ − 1)−1/θ [Joe93] (LTC) 1 − (1 − u)θ + (1 − v)θ − (1 − u)θ (1 − v)θ [Fra79] (LTD) − 1θ ln 1 + [Gal75] (e−θu −1)(e−θv −1) e−θ −1 1/θ φ(t; θ) θ∈ exp(−t1/θ ) [1, ∞) (1 + t)−1/θ (0, ∞) 1 − (1 − e−t )1/θ [1, ∞) −θ−1 log 1 − (1 − e−θ )e−t (0, ∞) uv exp {(log(u)−θ + log(v)−θ )−1/θ } Concluding, for combination of preceding Laplace transforms and bivariate copulas we recover 20 parametric families of the form (1) with flexible dependence structure. [0, ∞) Modelling multivariate count data 601 2.2 Multivariate normal copula The multivariate extensions of bivariate elliptical copulas persist to allow both positive and negative dependence between random variables, in antithesis with Frank copula as described in the previous section. Definition 2. The n-variate normal copula with linear correlation matrix R is, −1 n (u1 ), . . . , Φ−1 (un ) , CR (u) = Φn R Φ where Φ is the N(0,1) c.d.f., Φ−1 is the functional inverse of Φ and Φn R is the nvariate standard normal c.d.f. with linear correlation matrix R . This copula system allows both positive and negative flexible dependence in antithesis with maxid copulas. 3 Estimation of a multivariate copula based parametric model Consider a multivariate copula based parametric model for the random vector Y with distribution function H provided by the copula representation according to the theorem of [Skl59], H(y; α1 , . . . , αn , θ) = C(F1 (y1 ; α1 ), . . . , Fn (yn ; αn ); θ), (2) where Fi are marginal distributions, with parameters ai , i = 1, . . . , n and θ is the vector of copula parameter. The density of the specified cumulative distribution H can be obtained using finite differences of the cumulative distribution function through its copula representation, Radon-Nikodym derivative ( [Son00]), for H in (2) with respect to the counting measure. For estimation purposes we will concentrate on likelihood methods. Consider the n log-likelihoods functions for the univariate marginal distributions: Lyj (αj ) = N X log fj (yij ; αj ), j = 1, . . . , n (3) i=1 and the joint log-likelihood L(θ, α1 , . . . , αn ) = N X log h(yi1 , . . . , yin ; α1 , . . . , αn , θ), (4) i=1 where N is the sample size. A quite efficient estimation of the model parameters is succeeded by the inference function of margins (IFM) ( [Joe97]) which consists in a two step approach. At the first step of this method the univariate log-likelihoods (3) are maximized independently of the copula parameter and at the second step the joint log-likelihood (4) maximized over θ with univariate parameters fixed as estimated at the first step of the method. The traditional full maximum likelihood (FML) method consists at the maximization of the joint log-likelihood (4) over the copula and marginal 602 Aristidis K. Nikoloulopoulos and Dimitris Karlis parameters ( [Son00, TDB99, MM94]). Initial estimates for the FML estimates are provided by IFM estimates to reduce the computational effort. Estimation by IFM method becomes more popular as the dimension increases and computational problems arise. The problem of fitting multivariate data decomposed into two smaller problems: fitting the marginal distributions separately from fitting the existing dependence structure. Asymptotic efficiency of the two step method studied in [Joe05] for a number of multivariate models. All of these comparisons suggest that the IFM method is highly efficient compared with FML, except for extreme cases near the Fréchet bounds. Inclusion for covariates can be succeeded modelling the univariate responses applying a generalized linear nodel ( [MN83]). More on this topic will be covered at the following application example for count data. 4 Application We jointly modelled the number of different crimes in regions of Greece for the year 1997. Three different types of crimes: manslaughters, rapes, smuggling were considered plus some exogenous predictors as: the population of the area (in millions), the Gross domestic Product, the unemployment rate, whether the region has a city with population larger than 100000 habitants, and a dummy showing whether the region belongs to the borders of Greece (to account for economic refugees that enter the country illegally). The peculiar feature of this data-set is the small sample size, due to that Greece has only 50 regions. This fact allows the applicability of normal copula on discrete and particularly count data. The probability mass function for a normal copula based model is obtained using finite differences, so a normal probability integral must be computed in several points, resulting problems at maximization-estimation of the model parameters. This was not the case four our data-set. The probability mass function h(y) for a normal copula based model, computed using [Sch84] routine for multivariate normal rectangle probabilities. The log-likelihood for normal copula based model was maximized using a quasi-Newton iterative algorithm ( [Nas90]), while for mixture of maxid copulas was numerically maximized using a quasi-Newton iterative algorithm ( [BLNZ95]) implemented by the optim function in R, which allows box constraints, meaning that in each parameter can be given a lower and/or upper bound, because of the positive dependence restriction on this type of multivariate copula systems. Finally we computed standard errors of estimates using the Jackknife method as proposedPby [Joe97]. The jackknife estimator of the asymptotic covariance matrix of ̺b was N b(i) − ̺b)T (̺b(i) − ̺b), i=1 (̺ (i) where ̺b is the estimator of ̺ = (α1 , . . . , αn , θ) with the ith observation deleted, i = 1, . . . , 50. We used the following models: • Normal copula based model, allowing the inclusion of covariates via a negative binomial model for marginal distributions. • Mixtures of max-infinite divisible copulas with the same negative binomial marginal distributions. After preliminary analysis, we simplify the model and numerical computations ′ with C23 = Π (independence copula) and ν1 = 0, ν2 = ν3 = −1. In this manner Modelling multivariate count data 603 we assumed a lower level of dependence for the (2,3) bivariate margin represented by the parameter θ of the Laplace transform φ and a higher dependence for the (1,2), and (1,3) bivariate margins with the parameters θ12 , and θ13 representing bivariate dependence exceeding the minimum dependence of the Laplace transform φ. As shown in Table 2 the log-likelihood values were affected greatly by the family of ′ bivariate copulas Cij but not by the family of Laplace transform φ. The estimated dependence and marginal parameters can be interpreted through Kendall’s tau derived in [DL05]. On Table 2 dependence parameters for all fitted copula systems, plus the corresponding Kendall’s tau values are presented. Table 2. Estimates of the dependence parameters, Kendall’s tau and log-likelihoods Copula LT θ 12 θ 13 θb τ 12 τ 13 τ 23 log-likelihood Clayton Clayton Clayton Clayton Gumbel Gumbel Gumbel Gumbel Frank Frank Frank Joe Joe Joe Joe Galambos Galambos Galambos 3-variate A B C D A B C D A C D A B C D A C D Normal 3.522 502.747 3.641 6.168 1.377 1.439 1.396 1.395 6.538 6.613 6.998 1.343 1.460 1.367 1.380 0.614 0.632 0.640 0.407 0.098 0.851 0.105 0.000 1.092 1.107 1.102 1.086 0.925 0.934 1.047 1.105 1.120 1.117 1.104 0.326 0.341 0.318 0.174 1.033 0.008 1.062 0.106 1.064 0.107 1.076 0.679 1.031 1.057 0.023 1.088 0.082 1.107 0.881 1.066 1.078 0.687 0.091 0.308 0.302 0.309 0.344 0.216 0.222 0.207 0.231 0.303 0.304 0.296 0.182 0.212 0.171 0.210 0.209 0.200 0.226 0.248 0.043 0.004 0.048 0.010 0.100 0.090 0.092 0.106 0.074 0.077 0.058 0.107 0.085 0.095 0.118 0.105 0.097 0.110 0.089 0.028 0.003 0.032 0.010 0.053 0.036 0.039 0.063 0.026 0.030 0.002 0.071 0.028 0.054 0.081 0.055 0.040 0.064 0.048 -278.832 -279.076 -278.708 -279.748 -280.889 -281.077 -280.990 -280.754 -278.503 -278.419 -278.625 -281.521 -281.175 -281.703 -281.166 -280.765 -280.868 -280.647 -280.500 Concluding, the best fit was provided by Frank copula and Laplace transform LTC. On Table 3 estimated parameters and standard errors for the latter are presented. 4.1 Sensitivity analysis As a sensitivity analysis we modelled the trivariate data considering different marginal models i.e., negative binomial and three finite mixture poison distribution, and the same normal copula system. Moreover we include different covariate sets to identify the covariates’ effect on copula parameters. As noticed in Table 4 estimates of the copula parameters are insensitive in the inclusion of covariates and the choice of marginal models. 604 Aristidis K. Nikoloulopoulos and Dimitris Karlis Table 3. Estimates and standard errors (se) by the jacknife method proposed by Joe (1997) of best (largest likelihood) fitting copula with Laplace transform family LTC and maxid bivariate copula family Frank. Covariate coefficient Jackknife s.e. coefficient Jackknife s.e. coefficient Jackknife s.e. (Intercept) pop unemp bord aep ast theta Rapes -2.062 3.801 0.054 0.197 0.176 -0.220 2.991 0.884 5.010 0.024 0.337 0.102 0.714 2.099 6.613 0.934 1.057 3.019 1.917 0.108 Manslaughter -0.824 0.991 2.723 4.767 0.010 0.024 -0.651 0.357 0.192 0.089 0.405 0.524 2.848 3.817 Smuggling -1.476 1.732 4.589 8.666 -0.024 0.047 0.467 0.618 0.156 0.201 0.179 1.105 0.696 0.312 Dependence θ12 θ13 θ23 Table 4. Sensitivity analysis for the effect of marginal distributions on the normal copula parameters. r̂12 NB pop+unemp+bord+aep+ast 0.407 unemp+bord+aep+ast 0.524 pop+bord+aep+ast 0.412 pop+unemp+aep+ast 0.364 pop+unemp+bord+ast 0.438 pop+unemp+bord+aep 0.387 MP r̂13 NB MP r̂13 NB MP 0.477 0.561 0.452 0.432 0.532 0.410 0.174 0.258 0.141 0.185 0.242 0.166 0.248 0.433 0.064 0.236 0.190 0.224 0.091 0.188 0.087 0.081 0.133 0.105 0.296 0.355 0.202 0.019 0.094 0.318 5 Concluding remarks Modelling multivariate count data based on copula systems described in previous Sections. By using copula functions the estimation procedure is decomposed into two smaller: estimation of marginal parameters and estimation of copula parameters. Cautious in specification of marginal distributions. Possible errors in the first step will be exaggerated at the second step. Moreover, there is an effect of covariates included in univariate models, on dependence (copula parameters). A solution might be specification of marginal distributions non-parametrically. Be aware that the applicability of normal copula with flexible positive and negative dependence is limited as the sample size increases and computation problems appear. On the other hand, applicable maxid copulas, even for large sample sizes, allow only positive flexible dependence. There is the need of a copula without computation complications and wide range of flexible dependence, which is the objective for our future research. Modelling multivariate count data 605 Concluding, few copulas appeared to be suitable as indicated by the loglikelihood principle. Among the candidate models we choose the one with the largest log-likelihood as the most appropriate to make inference. The precision of this estimate is conditional on the selected model without reflecting model selection uncertainty and making confidence intervals to have less than nominal coverage. By model selection uncertainty we mean that the resulted likelihoods for some models are quite close with the largest likelihood. If we use a re-sample for our data probably one of these models will be chosen than the best model for raw data. References [BLNZ95] R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimization. SIAM J. Scientific Computing, 16:1190–1208, 1995. [Cla78] D. G. Clayton. A model for association in bivariate life tables and its application in epidemiological studies of familial tendency in chronic disease incidence. Biometrika, 65(1):141–151, 1978. [CLTZ04] A. Colin Cameron, Tong Li, Pravin K. Trivedi, and David M. Zimmer. Modelling the differences in counted outcomes using bivariate copula models with application to mismeasured counts. Econom. J., 7(2):566–584, 2004. [DL05] Michel Denuit and Philippe Lambert. Constraints on concordance measures in bivariate discrete data. J. Multivariate Anal., 93(1):40–57, 2005. [Fra79] M. J. Frank. On the simultaneous associativity of F (x, y) and x + y − F (x, y). Aequationes Math., 19(2-3):194–226, 1979. [Gal75] János Galambos. Order statistics of samples from multivariate distributions. J. Amer. Statist. Assoc., 70(351, part 1):674–680, 1975. [Gum60] E. J. Gumbel. Bivariate exponential distributions. J. Amer. Statist. Assoc., 55:698–707, 1960. [Joe93] Harry Joe. Parametric families of multivariate distributions with given margins. J. Multivariate Anal., 46(2):262–282, 1993. [Joe96] H. Joe. Multivariate distributions from mixtures of max-infinitely divisible distributions. Journal of Multivariate Analysis, 57:240–265, 1996. [Joe97] Harry Joe. Multivariate models and dependence concepts. Chapman & Hall, London, 1997. [Joe05] H. Joe. Asymptotic efficiency of the two-stage estimation method for copula-based models. Journal of Multivariate Analysis, 94(2):401–419, 2005. [Lee94] A. Lee. Modelling rugby league data via bivariate negative binomial regression. Australian and New Zealand Journal of Statistics, 41(2):141– 152, 1994. [Lee01] Lung-Fei Lee. On the range of correlation coefficients of bivariate ordered discrete random variables. Econometric Theory, 17(1):247–256, 2001. [LV02] P. Lambert and F. Vandenhende. A copula-based model for multivariate non-normal longitudinal data: analysis of a dose titration safety study on a new antidepressant. Statist. Med., 21:3197–3217, 2002. [MM94] S.G. Meester and J. MacKay. A parametric model for Cluster Correlated Categorical Data. Biometrics, 50:954–963, 1994. 606 [MN83] Aristidis K. Nikoloulopoulos and Dimitris Karlis P. McCullagh and J. A. Nelder. Generalized linear models. Monographs on Statistics and Applied Probability. Chapman & Hall, London, 1983. [Nas90] J.C. Nash. Compact numerical methods for computers: linear algebra and function minimisatiov. Hilger, New York, 1990. 2nd edition. [Nel99] Roger B. Nelsen. An introduction to copulas, volume 139 of Lecture Notes in Statistics. Springer-Verlag, New York, 1999. [Sch84] Mark J. Schervish. Algorithm AS 195. multivariate normal probabilities with error bound. Applied Statistics, pages 81–94, 1984. [Skl59] M. Sklar. Fonctions de répartition à n dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris, 8:229–231, 1959. [Son00] Peter Xue-Kun Song. Multivariate dispersion models generated from Gaussian copula. Scand. J. Statist., 27(2):305–320, 2000. [TDB99] D. A. Trégouët, P. Ducimetière, V. Bocquet, S. Visvikis, F. Soubrier, and L. Tiret. A parametric copula model for analysis of familial binary data. Am J Hum Genet, 64(3):886–93, 1999. [vO99] Hans van Ophem. A general method to estimate correlated discrete random variables. Econometric Theory, 15(2):228–237, 1999.