Modelling multivariate count data

advertisement
Modelling multivariate count data
Aristidis K. Nikoloulopoulos and Dimitris Karlis
Department of Statistics, Athens University of Economics, 76 Patission Str., 10434,
Athens, GREECE{akn,karlis}@aueb.gr
Summary. Multivariate count data occur in several different disciplines. Models
for such data are few mainly because of the computational complexity for their
application. We propose models based on the currently fashionable idea of copulas.
We propose copulas appropriate for this case taking into account computational
limitations. A real data application is provided.
1 Introduction
Multivariate count data occur in several disciplines, like epidemiology, marketing,
criminology, sports statistics, industrial statistics, among others. However, flexible
models for such data are not widely available and usually are hard to fit in real data.
Copulas are currently fashionable models for modelling dependent data as they can
separate the estimation of the marginal properties and the dependence structure.
While there are plenty of publications that treat continuous data there are only few
treating count data. The purpose of the present paper is to propose a model based
on copulas for multivariate count data.
Definition 1. ( [Nel99]) A multivariate copula is a function C from In to I with
the following properties: (1) For every u in In C(u) = 0 if at least one coordinate
of u is 0 and if all coordinates of u are 1 except uk , then C(u) = uk (2) For every
a and b in In such that a ≤ b, VC ([a, b]) ≤ 0.
There are few paper for the use of copulas with discrete data. [MM94, TDB99,
Lee94,CLTZ04] exploit the use of Frank copula, see [Fra79], to model discrete bivariate data (i.e., count and binary data). The multivariate extension of Frank copula
has the disadvantages, that restrict to positive dependence and form one dependence
parameter for all the bivariate margins. The multivariate normal copula overcomes
this drawback. [Lee01, vO99] exploit the use of bivariate normal copula to model
count data while [LV02, Son00] used the latter to model multivariate non-normal
longitudinal continuous data. For discrete data, generalization to the multivariate
case is not easy since the joint probability function involves computation of the copula in several different points and hence multivariate numerical integration is needed
leading to tremendous computational problems.
600
Aristidis K. Nikoloulopoulos and Dimitris Karlis
We propose the use of mixtures of max-infinitely divisible bivariate copulas,
see [Joe96] to derive flexible positive dependence between the random variables.
We also fit a multivariate normal copula and discuss the computational problems
occurred.
2 Multivariate parametric families of copulas
2.1 Copulas via mixtures of max-infinitely divisible bivariate
copulas
Let Λ be a univariate cumulative distribution function (cdf) of a positive random
(Λ(0) = 0), and let φ be the Laplace transform (LT) of Λ, φ(t) =
R ∞ variable
−ts
exp
dΛ(s),
t ≥ 0,.
0
Mixtures of max-infinitely divisible copulas (maxid) have the form
C(u) = φ −
X
′
log Cij
(e−pi φ
−1
(ui )
, e−pj φ
−1
(uj )
)+
m
X
i<j
!
νi pi φ−1 (ui ) .
(1)
i=1
where φ is a LT, that introduce the smallest dependence between random variables
′
and Cij
are maxid copulas that add some pairwise dependence. The parameters νi
are included in order that the parametric family of multivariate copulas (1) is closed
under margins.
Some members of maxid bivariate copulas and Laplace transforms are presented
on Table 1.
Table 1. Max-infinitely divisible distributions
Family
C ′ (u, v; θ)
[Gum60] (LTA) exp − (− log(u))θ + (− log(v))θ
1/θ
[Cla78] (LTB) (u−θ + v −θ − 1)−1/θ
[Joe93] (LTC) 1 − (1 − u)θ + (1 − v)θ − (1 − u)θ (1 − v)θ
[Fra79] (LTD) − 1θ ln 1 +
[Gal75]
(e−θu −1)(e−θv −1)
e−θ −1
1/θ
φ(t; θ)
θ∈
exp(−t1/θ )
[1, ∞)
(1 + t)−1/θ
(0, ∞)
1 − (1 − e−t )1/θ
[1, ∞)
−θ−1 log 1 − (1 − e−θ )e−t (0, ∞)
uv exp {(log(u)−θ + log(v)−θ )−1/θ }
Concluding, for combination of preceding Laplace transforms and bivariate copulas
we recover 20 parametric families of the form (1) with flexible dependence structure.
[0, ∞)
Modelling multivariate count data
601
2.2 Multivariate normal copula
The multivariate extensions of bivariate elliptical copulas persist to allow both positive and negative dependence between random variables, in antithesis with Frank
copula as described in the previous section.
Definition 2. The n-variate normal copula with linear correlation matrix R is,
−1
n
(u1 ), . . . , Φ−1 (un ) ,
CR
(u) = Φn
R Φ
where Φ is the N(0,1) c.d.f., Φ−1 is the functional inverse of Φ and Φn
R is the nvariate standard normal c.d.f. with linear correlation matrix R .
This copula system allows both positive and negative flexible dependence in
antithesis with maxid copulas.
3 Estimation of a multivariate copula based parametric
model
Consider a multivariate copula based parametric model for the random vector Y
with distribution function H provided by the copula representation according to the
theorem of [Skl59],
H(y; α1 , . . . , αn , θ) = C(F1 (y1 ; α1 ), . . . , Fn (yn ; αn ); θ),
(2)
where Fi are marginal distributions, with parameters ai , i = 1, . . . , n and θ is the
vector of copula parameter. The density of the specified cumulative distribution
H can be obtained using finite differences of the cumulative distribution function
through its copula representation, Radon-Nikodym derivative ( [Son00]), for H in
(2) with respect to the counting measure.
For estimation purposes we will concentrate on likelihood methods. Consider the
n log-likelihoods functions for the univariate marginal distributions:
Lyj (αj ) =
N
X
log fj (yij ; αj ),
j = 1, . . . , n
(3)
i=1
and the joint log-likelihood
L(θ, α1 , . . . , αn ) =
N
X
log h(yi1 , . . . , yin ; α1 , . . . , αn , θ),
(4)
i=1
where N is the sample size.
A quite efficient estimation of the model parameters is succeeded by the inference function of margins (IFM) ( [Joe97]) which consists in a two step approach. At
the first step of this method the univariate log-likelihoods (3) are maximized independently of the copula parameter and at the second step the joint log-likelihood
(4) maximized over θ with univariate parameters fixed as estimated at the first step
of the method. The traditional full maximum likelihood (FML) method consists
at the maximization of the joint log-likelihood (4) over the copula and marginal
602
Aristidis K. Nikoloulopoulos and Dimitris Karlis
parameters ( [Son00, TDB99, MM94]). Initial estimates for the FML estimates are
provided by IFM estimates to reduce the computational effort. Estimation by IFM
method becomes more popular as the dimension increases and computational problems arise. The problem of fitting multivariate data decomposed into two smaller
problems: fitting the marginal distributions separately from fitting the existing dependence structure. Asymptotic efficiency of the two step method studied in [Joe05]
for a number of multivariate models. All of these comparisons suggest that the IFM
method is highly efficient compared with FML, except for extreme cases near the
Fréchet bounds.
Inclusion for covariates can be succeeded modelling the univariate responses
applying a generalized linear nodel ( [MN83]). More on this topic will be covered at
the following application example for count data.
4 Application
We jointly modelled the number of different crimes in regions of Greece for the
year 1997. Three different types of crimes: manslaughters, rapes, smuggling were
considered plus some exogenous predictors as: the population of the area (in millions), the Gross domestic Product, the unemployment rate, whether the region has
a city with population larger than 100000 habitants, and a dummy showing whether
the region belongs to the borders of Greece (to account for economic refugees that
enter the country illegally). The peculiar feature of this data-set is the small sample size, due to that Greece has only 50 regions. This fact allows the applicability
of normal copula on discrete and particularly count data. The probability mass
function for a normal copula based model is obtained using finite differences, so a
normal probability integral must be computed in several points, resulting problems
at maximization-estimation of the model parameters. This was not the case four
our data-set. The probability mass function h(y) for a normal copula based model,
computed using [Sch84] routine for multivariate normal rectangle probabilities. The
log-likelihood for normal copula based model was maximized using a quasi-Newton
iterative algorithm ( [Nas90]), while for mixture of maxid copulas was numerically
maximized using a quasi-Newton iterative algorithm ( [BLNZ95]) implemented by
the optim function in R, which allows box constraints, meaning that in each parameter can be given a lower and/or upper bound, because of the positive dependence restriction on this type of multivariate copula systems. Finally we computed standard
errors of estimates using the Jackknife method as proposedPby [Joe97]. The jackknife
estimator of the asymptotic covariance matrix of ̺b was N
b(i) − ̺b)T (̺b(i) − ̺b),
i=1 (̺
(i)
where ̺b is the estimator of ̺ = (α1 , . . . , αn , θ) with the ith observation deleted,
i = 1, . . . , 50.
We used the following models:
• Normal copula based model, allowing the inclusion of covariates via a negative
binomial model for marginal distributions.
• Mixtures of max-infinite divisible copulas with the same negative binomial
marginal distributions.
After preliminary analysis, we simplify the model and numerical computations
′
with C23
= Π (independence copula) and ν1 = 0, ν2 = ν3 = −1. In this manner
Modelling multivariate count data
603
we assumed a lower level of dependence for the (2,3) bivariate margin represented
by the parameter θ of the Laplace transform φ and a higher dependence for the
(1,2), and (1,3) bivariate margins with the parameters θ12 , and θ13 representing bivariate dependence exceeding the minimum dependence of the Laplace transform φ.
As shown in Table 2 the log-likelihood values were affected greatly by the family of
′
bivariate copulas Cij
but not by the family of Laplace transform φ. The estimated
dependence and marginal parameters can be interpreted through Kendall’s tau derived in [DL05]. On Table 2 dependence parameters for all fitted copula systems,
plus the corresponding Kendall’s tau values are presented.
Table 2. Estimates of the dependence parameters, Kendall’s tau and log-likelihoods
Copula
LT
θ
12
θ
13
θb
τ
12
τ
13
τ
23
log-likelihood
Clayton
Clayton
Clayton
Clayton
Gumbel
Gumbel
Gumbel
Gumbel
Frank
Frank
Frank
Joe
Joe
Joe
Joe
Galambos
Galambos
Galambos
3-variate
A
B
C
D
A
B
C
D
A
C
D
A
B
C
D
A
C
D
Normal
3.522
502.747
3.641
6.168
1.377
1.439
1.396
1.395
6.538
6.613
6.998
1.343
1.460
1.367
1.380
0.614
0.632
0.640
0.407
0.098
0.851
0.105
0.000
1.092
1.107
1.102
1.086
0.925
0.934
1.047
1.105
1.120
1.117
1.104
0.326
0.341
0.318
0.174
1.033
0.008
1.062
0.106
1.064
0.107
1.076
0.679
1.031
1.057
0.023
1.088
0.082
1.107
0.881
1.066
1.078
0.687
0.091
0.308
0.302
0.309
0.344
0.216
0.222
0.207
0.231
0.303
0.304
0.296
0.182
0.212
0.171
0.210
0.209
0.200
0.226
0.248
0.043
0.004
0.048
0.010
0.100
0.090
0.092
0.106
0.074
0.077
0.058
0.107
0.085
0.095
0.118
0.105
0.097
0.110
0.089
0.028
0.003
0.032
0.010
0.053
0.036
0.039
0.063
0.026
0.030
0.002
0.071
0.028
0.054
0.081
0.055
0.040
0.064
0.048
-278.832
-279.076
-278.708
-279.748
-280.889
-281.077
-280.990
-280.754
-278.503
-278.419
-278.625
-281.521
-281.175
-281.703
-281.166
-280.765
-280.868
-280.647
-280.500
Concluding, the best fit was provided by Frank copula and Laplace transform LTC.
On Table 3 estimated parameters and standard errors for the latter are presented.
4.1 Sensitivity analysis
As a sensitivity analysis we modelled the trivariate data considering different
marginal models i.e., negative binomial and three finite mixture poison distribution, and the same normal copula system. Moreover we include different covariate
sets to identify the covariates’ effect on copula parameters.
As noticed in Table 4 estimates of the copula parameters are insensitive in the
inclusion of covariates and the choice of marginal models.
604
Aristidis K. Nikoloulopoulos and Dimitris Karlis
Table 3. Estimates and standard errors (se) by the jacknife method proposed by
Joe (1997) of best (largest likelihood) fitting copula with Laplace transform family
LTC and maxid bivariate copula family Frank.
Covariate
coefficient Jackknife s.e. coefficient Jackknife s.e. coefficient Jackknife s.e.
(Intercept)
pop
unemp
bord
aep
ast
theta
Rapes
-2.062
3.801
0.054
0.197
0.176
-0.220
2.991
0.884
5.010
0.024
0.337
0.102
0.714
2.099
6.613
0.934
1.057
3.019
1.917
0.108
Manslaughter
-0.824
0.991
2.723
4.767
0.010
0.024
-0.651
0.357
0.192
0.089
0.405
0.524
2.848
3.817
Smuggling
-1.476
1.732
4.589
8.666
-0.024
0.047
0.467
0.618
0.156
0.201
0.179
1.105
0.696
0.312
Dependence
θ12
θ13
θ23
Table 4. Sensitivity analysis for the effect of marginal distributions on the normal
copula parameters.
r̂12
NB
pop+unemp+bord+aep+ast 0.407
unemp+bord+aep+ast
0.524
pop+bord+aep+ast
0.412
pop+unemp+aep+ast
0.364
pop+unemp+bord+ast
0.438
pop+unemp+bord+aep
0.387
MP
r̂13
NB
MP
r̂13
NB
MP
0.477
0.561
0.452
0.432
0.532
0.410
0.174
0.258
0.141
0.185
0.242
0.166
0.248
0.433
0.064
0.236
0.190
0.224
0.091
0.188
0.087
0.081
0.133
0.105
0.296
0.355
0.202
0.019
0.094
0.318
5 Concluding remarks
Modelling multivariate count data based on copula systems described in previous
Sections. By using copula functions the estimation procedure is decomposed into two
smaller: estimation of marginal parameters and estimation of copula parameters.
Cautious in specification of marginal distributions. Possible errors in the first
step will be exaggerated at the second step. Moreover, there is an effect of covariates
included in univariate models, on dependence (copula parameters). A solution might
be specification of marginal distributions non-parametrically.
Be aware that the applicability of normal copula with flexible positive and negative dependence is limited as the sample size increases and computation problems
appear. On the other hand, applicable maxid copulas, even for large sample sizes,
allow only positive flexible dependence. There is the need of a copula without computation complications and wide range of flexible dependence, which is the objective
for our future research.
Modelling multivariate count data
605
Concluding, few copulas appeared to be suitable as indicated by the loglikelihood principle. Among the candidate models we choose the one with the largest
log-likelihood as the most appropriate to make inference. The precision of this estimate is conditional on the selected model without reflecting model selection uncertainty and making confidence intervals to have less than nominal coverage. By
model selection uncertainty we mean that the resulted likelihoods for some models
are quite close with the largest likelihood. If we use a re-sample for our data probably
one of these models will be chosen than the best model for raw data.
References
[BLNZ95] R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimization. SIAM J. Scientific Computing,
16:1190–1208, 1995.
[Cla78]
D. G. Clayton. A model for association in bivariate life tables and its application in epidemiological studies of familial tendency in chronic disease
incidence. Biometrika, 65(1):141–151, 1978.
[CLTZ04] A. Colin Cameron, Tong Li, Pravin K. Trivedi, and David M. Zimmer.
Modelling the differences in counted outcomes using bivariate copula models with application to mismeasured counts. Econom. J., 7(2):566–584,
2004.
[DL05]
Michel Denuit and Philippe Lambert. Constraints on concordance measures in bivariate discrete data. J. Multivariate Anal., 93(1):40–57, 2005.
[Fra79]
M. J. Frank. On the simultaneous associativity of F (x, y) and x + y −
F (x, y). Aequationes Math., 19(2-3):194–226, 1979.
[Gal75]
János Galambos. Order statistics of samples from multivariate distributions. J. Amer. Statist. Assoc., 70(351, part 1):674–680, 1975.
[Gum60] E. J. Gumbel. Bivariate exponential distributions. J. Amer. Statist.
Assoc., 55:698–707, 1960.
[Joe93]
Harry Joe. Parametric families of multivariate distributions with given
margins. J. Multivariate Anal., 46(2):262–282, 1993.
[Joe96]
H. Joe. Multivariate distributions from mixtures of max-infinitely divisible distributions. Journal of Multivariate Analysis, 57:240–265, 1996.
[Joe97]
Harry Joe. Multivariate models and dependence concepts. Chapman &
Hall, London, 1997.
[Joe05]
H. Joe. Asymptotic efficiency of the two-stage estimation method for
copula-based models. Journal of Multivariate Analysis, 94(2):401–419,
2005.
[Lee94]
A. Lee. Modelling rugby league data via bivariate negative binomial
regression. Australian and New Zealand Journal of Statistics, 41(2):141–
152, 1994.
[Lee01]
Lung-Fei Lee. On the range of correlation coefficients of bivariate ordered
discrete random variables. Econometric Theory, 17(1):247–256, 2001.
[LV02]
P. Lambert and F. Vandenhende. A copula-based model for multivariate
non-normal longitudinal data: analysis of a dose titration safety study on
a new antidepressant. Statist. Med., 21:3197–3217, 2002.
[MM94] S.G. Meester and J. MacKay. A parametric model for Cluster Correlated
Categorical Data. Biometrics, 50:954–963, 1994.
606
[MN83]
Aristidis K. Nikoloulopoulos and Dimitris Karlis
P. McCullagh and J. A. Nelder. Generalized linear models. Monographs
on Statistics and Applied Probability. Chapman & Hall, London, 1983.
[Nas90] J.C. Nash. Compact numerical methods for computers: linear algebra and
function minimisatiov. Hilger, New York, 1990. 2nd edition.
[Nel99]
Roger B. Nelsen. An introduction to copulas, volume 139 of Lecture Notes
in Statistics. Springer-Verlag, New York, 1999.
[Sch84]
Mark J. Schervish. Algorithm AS 195. multivariate normal probabilities
with error bound. Applied Statistics, pages 81–94, 1984.
[Skl59]
M. Sklar. Fonctions de répartition à n dimensions et leurs marges. Publ.
Inst. Statist. Univ. Paris, 8:229–231, 1959.
[Son00] Peter Xue-Kun Song. Multivariate dispersion models generated from
Gaussian copula. Scand. J. Statist., 27(2):305–320, 2000.
[TDB99] D. A. Trégouët, P. Ducimetière, V. Bocquet, S. Visvikis, F. Soubrier, and
L. Tiret. A parametric copula model for analysis of familial binary data.
Am J Hum Genet, 64(3):886–93, 1999.
[vO99]
Hans van Ophem. A general method to estimate correlated discrete random variables. Econometric Theory, 15(2):228–237, 1999.
Download