Bayesian Analysis of Finite Poisson Mixtures

advertisement
Bayesian Analysis of Finite Poisson Mixtures1
Petros Dellaportas, Dimitris Karlis and Evdokia Xekalaki
Department of Statistics
Athens University of Economics and Business,
76 Patission Str. Athens, Greece
ABSTRACT
Finite Poisson mixtures are widely used to model overdispersed data sets for
which the simple Poisson distribution is inadequate. Such data sets are very
common in real applications. In this paper we investigate Bayesian estimation via
MCMC for finite Poisson mixtures and we discuss some computational issues.
The related problem of determining the number of components in a mixture is also
treated. This problem is of particular interest since it can reveal the number of
subpopulations comprising the entire population under study. We model the
number of components k as a random quantity having itself a prior probability
function and then we apply the Reversible Jump Markov Chain Monte Carlo
(Green 1995) to estimate the number of components. A real data example dealing
with a financial credit scoring system illustrates our methodology.
Keywords: Bayesian Analysis; Credit Scoring; Markov Chain Monte Carlo; Reversible Jump;
Mixture Models;
1
An earlier version of this paper was presented at the 4th Practical Bayesian Statistics Conference held in
Nottingham, July 1997.
1
1.Introduction
In recent economies, the need for financial assistance is present for a great variety of
economic activities. A large number of financial institutes provide such kind of facilities by
giving loans to their clients. The total amount which the financial institution gives must be
returned after a certain fixed time period, usually by paying an instalment at fixed subperiods.
The amount of each instalment is calculated using an interest rate so as to be profitable for the
financial institution and to cover the difference due to inflation at the nominal amount. By this
transaction, the client improves his cash condition and the financial institution can gain a large
amount since interest rates are always profitable.
At the same time the financial institution undertakes a great risk, that of non-payment
of instalments, termed as the defaulted instalments. In this case the institution may loose the
whole or part of the total amount borrowed. Usually, the legislation of each country can cope
with such situations but even in these cases the risk and the cost for the institution is great. To
avoid such situations the institutions have a mechanism, which examines if the new client will
adhere to the terms of the loan, so as to classify each client as a ‘good’ or a ‘bad’ client. This
mechanism is the so-called credit scoring system and it examines the reliability of each client.
It is clear that clients who are allocated to the latter category are suspected to default their
instalments, and thus the institution can either deny providing the loan, or give the loan under
more severe regulations for the amount and the time of the instalments.
The allocation of the new clients to the above mentioned categories can be made using
several different statistical methods, (see Hand and Henley, 1997 and the references therein).
However, the allocation of customers to only two groups has some disadvantages because it
ignores more realistic separations. For example, the ‘bad’ client category might contain clients
who deliberately neglect paying their instalments (‘continuously bad clients’) as well as clients
2
who simply neglect paying, for example, due to the nature of their work (‘temporary bad
clients’). A mechanism, which could separate these two categories, would therefore be useful.
Moreover, it would be of particular interest for the institution to be able to determine the
number of different categories, as it may develop different programs for each category,
minimising the effort and the risk of the institution.
Table 1 about here
The data set shown in Table 1 refers to the number of defaulted instalments in a
financial institution in Spain in 1990 and has been analysed by Guillen and Artis (1992) and
Dionne et al (1996). The data contain a large amount of overdispersion; the mean equals 1.58
while the variance is more than 6 times greater and equals 9.93, implying the inadequacy of
the simple Poisson distribution to model the observed frequency distribution. In this paper we
shall describe a mixture model to describe the inhomogeneity within the population.
Mixture models are alternative candidates when simple models fail. In particular, finite
mixture models can provide interesting information about the number of subpopulations
comprising the entire population. In the case of the dataset considered, the institution itself
would be interested in determining the number of different subpopulations comprising the
entire client population.
The remaining of this paper is organised as follows. Section 2 contains a brief
description of mixture models as well as a short review of estimating methods. Section 3 is
devoted to developing Bayesian estimation for finite Poisson mixtures via the Markov Chain
Monte Carlo (MCMC) methodology. We also discuss some amendments of the standard
MCMC method for improving the performance of our algorithm as well as the problem of
determining the number of components comprising the mixture models. Section 4 contains an
3
application of the new method to the financial institution data set. A comparison of the newly
proposed method can be found in Section 5. Finally, Section 6 contains a short discussion.
2. Finite Mixture Models
Suppose that the probability density of a random variable X, f(x) can be expressed in
the form
k
f ( x )   p j f ( x| j )
(1)
j 1
k
for some probabilities pj>0, j=1,. . .,k with
 p =1. Then x is said to have a k-finite mixture
i
i 1
density. In (1)  j is either a vector of parameters or a scalar referring to the j-th component of
the mixture.
Usually pj is called the mixing proportion while  j is called the mixing
parameter. Finite mixture models are of special interest mainly for two reasons. Firstly, they
can describe populations for which the assumption of a finite number of subpopulations is
valid. Secondly, even if the true underlying population consists of an infinite number of
subpopulations, one is restricted in estimating it via a finite distribution; see Laird (1978).
Furthermore, by fitting a k-finite mixture we can elicit information about the number of
components and hence about the number of subpopulations comprising the entire population.
It is known that the role of the simple Poisson distribution is prominent among the
discrete probability distributions. The Poisson distribution is usually used to model situations
where only randomness is present. In practice this probabilistic mechanism cannot explain
data variation and the need for alternative more complex models is obvious. Poisson mixture
models are then interesting candidates. The probability function of the k-finite Poisson
mixture is given by
4
k
f ( x)   p j
e
 j
 jx
(2)
x!
j 1
We assume that 0   1   2 ....   k in order to ensure the identifiability of the above
finite mixture (see Teicher, 1963). Note that we allow the first component mean to be 0, thus
allowing the corresponding distribution to be degenerate.
A large amount of estimation methods for finite mixtures can be found in Titterington
et al (1985), McLachlan and Basford (1988) and Lindsay (1995). These include maximum
likelihood methods, moment and minimum distance methods. If the number of components is
fixed, simple EM algorithms can be used for maximum likelihood estimation; see for
example, Hasselblad (1969). If k is not fixed more sophisticated algorithms must be used; for
a detailed review see Bohning (1995).
3. Bayesian Estimation for Finite Poisson Mixtures
3.1. The case of known number of components
The standard Bayesian formulation of the finite mixture problem with a known number
of components and its implementation via Markov Chain Monte Carlo (MCMC) is given by
Diebolt and Robert (1994). For the finite Poisson mixture the explicit details are as follows.
Let the data x   x1 , x 2 ,..., x n  be modelled as exchangeably distributed according to (2). This
implies that conditional on the parameter vector    p,   with     1 ,...,  k  and
p   p1 ,..., pk  , the distribution of Xi, f ( xi |  j ) is Poisson( j ) . By expressing the mixture
model

in
terms
of
missing
data,
we
introduce
indicator
parameters

z  zij ; i  1,..., n; j  1,..., k such that z ij  1 if the i-th observation Xi belongs to the j-th
component of the mixture and z ij  0 otherwise. Thus the density
5
f ( xi | zij  1) is
Poisson( j ) and f ( zij  1| p)  p j . Note that f ( zij | p) is a Multinomial (1, p) distribution so
p can be thought as a vector of hyperparameters determining the distribution of z . The joint
distribution of the observed data x and the unobserved indicator parameters z conditional on
the model parameters can be written as
n
k
f ( x, z |  )  f ( x |  , z ) f ( z | p)   f ( xi |  j )
z ij
(4)
i 1 j 1
where f ( xi |  j ) is in our case the Poisson distribution with parameter  j . Bayesian
formulation requires prior distributions for the model parameters  . Immediate choices are
the conjugate prior distributions Gamma(a,  ) for each  j and the Dirichlet (d1,..., dk ) for p.
Denoting the priors  ( ) ,  (  ) and  ( p) , Bayes theorem leads to the joint posterior
distribution of 
p( | x, z )  f ( x, z |  ) ( )  f ( x |  , p) f ( z | p) ( p) ( ) .
Inference problems associated with the above posterior density present implementation
difficulties due to the nature of the sampling density: the complete likelihood in (4) is a
product of n terms each being the product of k distinct elements. Possible approaches to
overcome this problem include a quasi-Bayesian approach (Smith and Makov, 1978 and
Rajagobalan and Loganathan, 1991), modal estimation (Redner et al, 1987), Laplace
approximations (Crawford, 1994) and a recent suggestion of Aitkin et al (1996) based on the
use of contiguous partitioning. In this paper we follow the usual MCMC approach suggested
by Diebolt and Robert (1994) for finite normal mixtures which attracted the attention of many
researchers; see for example Lavine and West (1992), Dellaportas et al (1995), Robert (1996).
To implement the MCMC for the Poisson mixtures case we need to sample from the full
conditional posterior distribution for the parameters , p and z. These sampling steps are
readily found to be
6
zij  Multinomial (1, wi1 ,..., wik ) i=1, . . .,n, j=1, . . .,k where
wij 
p j f ( xi |  j )
f ( xi )
, j=1, . . . , k
(5)
n
n


p  Dirichlet  d 1   wi1 ,..., d k   wik 


i 1
i 1
n
n
i 1
i 1
 j  Gamma (a   zij x i , b   zij ) I (  j 1 ,  j 1 ) , j=1, . . . , k
where I(a,b) denotes the indicator function which takes the value 1 if x belongs to the interval
(a,b) and 0 otherwise, with the convention that  0  0 and  k 1   .
Iterative sampling from the above densities, from suitably long runs of the chain,
provides the tool to obtain estimates for marginal posterior quantities of interest; see for
example Smith and Roberts (1993).
The parameters a,  and d j , j=1, . . . ,k can be chosen so that the resulting prior
densities
for  and p are rather flat. The priors should be proper to ensure parameter
identifiability. The truncated form of f ( j |.) can avoid the multimodality of the resulting
posterior distribution. Moreover, sampling from a Gamma density truncated at the left or at
the right is achieved by using techniques such as those described by Gelfand et al (1992). For
a detailed discussion on the multimodality of the posterior distributions, see, for example,
Stephens (1996).
7
3.2 The case of unknown number of components
If the number of components k is not known, the Bayesian estimation problem is
enlarged in the sense that inferences are based on the joint posterior distribution of the
parameter vector    ( k , k ) where  k  (  1 ,...  k , p1 ,..., pk ) denotes the vector of
parameters for the k-finite Poisson mixture. The posterior density can be written as
p( | x , z)  p( k , k | x , z)  f ( x , z| k ) ( k ) ( k ) 
 f ( x|  1 ,...,  k , p1 ,..., pk , z) f ( z| p1 ,..., pk ) ( p1 ,..., pk ) (  1 ,...,  k ) ( k )
where  ( k ) denotes the prior density for the parameter k.
We adopt the recently suggested methodology of Green (1995) which provides a fully
Bayesian solution of the above problem. This methodology, termed as the reversible jump
MCMC (RJMCMC), has been already applied in the finite normal mixture case in Richardson
and Green 1997. The key idea in RJMCMC is to allow moves between parameter subspaces of
different dimensionallity by permitting a series of different ‘move types’. The details of the
general RJMCMC implementation strategy can be found in the papers cited above. In the
sequel, we provide a detailed account of the sampling steps required for the finite Poisson
mixture case.
Suppose that there are at most k max components, so k=1,. . . , k max . We denote by Sk
and Ck the probability of ‘‘split’’ and ‘‘combine’’ moves respectively for a model with k
parameters. Clearly S k  1  Ck , with S k max  0 and C1  0 . We set S k  0.5 for k=2, . . .,
k max . With these probabilities we choose the type of move which will be attempted.
Let us first consider the simplest case where we have to move from the k-component
models to (k-1)-component model. At first we have to randomly choose the pair of
components which will be combined. These components are, by construction, successively
ordered ( 0   1   2 ....   k ), and hence we have to choose one out of k-1 different
8
successive pairs. Denoting the chosen pair as  j1 , j2  , we have to transform the current vector


of parameters p j1 , p j2 ,  j1 ,  j2 , to the new vector of parameters, say  p* ,  * , as well as the


allocation latent variables z j1 , z j2 to the new allocation latent variable z* . We suggest using
the ‘combine’ transformation
* 
p*  p j1  p j2 ,
 j pj   j pj
1
1
2
p*
2
z*  z j1  z j2
,
which preserves the values of the means between jumps. Other transformations could have
been chosen and although they would not affect the results they might deteriorate the
convergence properties of the chain. Note that once the choice of the pair has been made, the
‘‘combine’’ proposal move is deterministic.
The ‘‘split’’ move is more tedious than the combine move. At first we have to choose
randomly the components which will be attempted to be split. Suppose that we chose at
random the j * -th component to be split into 2 new components, say j 1 and j2 . In order to
make the transformation reversible we need to generate two continuous random variates u1


and u2 and then calculate the new parameters p j1 , p j2 ,  j1 ,  j2 , as well as the corresponding


allocation latent variables z j1 , z j2 . The generation of these variables can be made by any
arbitrary distribution. So, in the sequel we generate u1 and u2 from a Beta(2,2) distribution.
We found that in practice this choice works well providing good chain convergence. Other
Beta distributions were also tried (for example we chose a Beta with mean 0.9 so that the split
move does encourage splits to a ‘large’ and a ‘small’ component) but they gave similar
convergence behaviour.
The new parameters are calculated as
p j1  p j* u1
p j2  p j* 1  u1 
,
9
 j   j u2 ,
1
*
 1  u1u2 
,
*
 1  u1 
j j 
2


and the Jacobean matrix of the transformation of the vector p j* ,  j* , u1 , u2 to the vector
p
j1

, p j2 ,  j1 ,  j2 can be shown to be given by
0
 u1
1 u
0
1

u2
J  0

1 u1u 2
 0
1 u1

p j*
 p j*
0
j  1 u2 
*
1 u1  2



j* 
 j* u1 

1 u1 
0
0
with determinant
J 
p j* j*
1 u1
.
Again, once the choice of the pair has been made, the ‘‘split’’ proposal is deterministic. Note
also that if the means of the new components are not in order or if one component has mean
smaller than 0, then the proposed move is rejected.
Finally if zij*  1 for some i, then zij1  1 with probability proportionally equal to
wij1
w
ij1

 wij2 . In practice, since we are dealing with count data, we can simplify the
calculations noting that for each count we have the observed frequency. In this case the
allocation step is done via generation from the binomial distribution.
Thus, when the move has been proposed and the new parameters have been calculated
we have either to accept or to reject the move with some probability which is calculated as
follows. Suppose that the current state of the Markov chain at time t is m and the
corresponding vector of parameters is  m . Suppose that a particular move type M involves a
proposed move to model m with corresponding parameter vector  m which is of higher
dimensionality than  m . Furthermore, suppose that  m is created by generating a vector u of
10
dimensionality equal to the difference in dimensionalities between  m and  m from some
proposal distribution q(u) , by setting  m  g ( m , u) where g is a one-to-one function.
Similarly, if the move type M is used in the other direction (in which case it is denoted by
M  ) and  m is of higher dimension than  m then  m is created by applying the inverse
transformation using the function g 1 ( m , u) and ignoring u. Thus, if w(m,M) denotes the
probability of a move of type M when we are at state m, the acceptance probability  is
calculated as:


f ( x , m ) w(m, M )
a  min1,
J
 f ( x , m ) w(m, M )q (u) 
where J is the Jacobean matrix of the transformation.
Therefore, the acceptance probabilities for our cases can be calculated. It follows
specifically that for the split move the acceptance probability is min(1,A) while for the
combine move the acceptance probability is min(1,B) where
k 1
L( k  1)  ( k  1) 
i 1
A
k
L( k )
 (k )
 ( i ) ( p1 , p2 ,..., pk 1 )
  ( ) ( p , p ,..., p )
i 1
k 1
L(k  1)  (k  1) 
i 1
B
k
L( k )  ( k )
i
1
2

( k  1) gu1 , u2 

1
J
Ck 1
Sk Palloc
,
k
 ( i ) ( p1 , p2 ,..., pk 1 )
  ( i ) ( p1 , p2 ,..., pk )


1
1 Sk 1 Palloc
gu1 , u2 
.
k
J
Ck
i 1
Here, L( k ) is the likelihood of the model with k components and  ( k ),  (  ) and
 ( p1 , p2 , ..., pk ) are the priors for k,  and p respectively. gu1 , u2  is the proposal density
from which u1 and u2 are generated. The factors (k+1) for the ‘split’ move and ( k 1 ) for the
‘combine’ move are derived from the ratios of factorials in the ordered statistics densities for
the mixture means.
11
4. Application to financial data
The following prior specifications were used for analysing the data in Table 1. The
maximum number of components
k max was set equal to 25, and two different priors for k
were considered. The first was a uniform across the values 1, . . , k max , with P (k )  1 / k max
while the second was a zero truncated Poisson distribution with P (k ) 
et t k
, k>0,
t !(1  e  t )
where t was a parameter reflecting our belief (or knowledge) about the mean number of
components; we choose, for illustrative purposes, to set t=5. We found that due to the large
sample size (n=4690), the results were not sensitive to the choice of the prior.
For  j , j=1,...,k we chose a Gamma (a  0.01,   0.01) prior density, which is rather
flat, reflecting our prior ignorance about the parameter values. Again we found that even a
multimodal informative prior density with k modes, does not alter our main inferences.
Finally, the choice of the prior parameters for the mixing proportions ( p1 ,..., p k ) was a
Dirichlet(1,...,1) distribution and deviations from this density produced no interesting
changes to the posterior distribution.
Starting values for the RJMCMC algorithm varied for all the parameters. The
algorithm turned out to be very insensitive to the initial values reaching the joint posterior
density in, at most, 1000 iterations. The results which follow were derived by using a burn-in
period of 5000 iterations and then taking one point every 100 iterations, to avoid
autocorrelation biases.
Figure 1 shows the values of P (k  j ) for j=2,3,4 across the sweeps. The stability of
the values indicates convergence of the RJMCMC algorithm. Figure 2 illustrates the moves
for each k in 50000 iterations. RJMCMC succeeded in moving efficiently between models.
12
The proportion of accepted moves was near 0.40. Figure 3 depicts the posterior probabilities
for k for both Poisson and Uniform prior probabilities. It is evident that the posterior densities
are rather robust to the choice of priors, indicating that the posterior for k favours, in both
cases, a model with 2 components. The probability decreases, as k becomes larger, obtaining
values less than 0.1 for k  5 .
Table 2 about here
All posterior summaries of interest are tabulated in Table 2. The most probable model
(k=2) indicates that for these data, there are two groups of customers, representing the 78%
and 22% of the population respectively, with corresponding mean number of defaulted
instalments 0.24 and 6.41. Note that the values of p1 and 1 are rather stable across k,
indicating that we can be rather firm about the characteristics of the ‘good customers’ group.
Assume that we choose our inferences to be based on the most probable model with
two mixture components. An interesting quantity to report in our example would be a cut-off
value,  say. This value determines the subpopulation in which an individual will be
allocated. Hence if the number of defaulted instalments is smaller or equal to  then the
individual is allocated at the first group while if it is larger at the second group. Conditional on
the population parameters    k ,  1 , ... ,  k , p1 , ... , p k  we define as the cut-off point the
integer part of the root   of the equation
p1 f 1 ( |  1 )  p2 f 2 ( |  2 )
(6)
denoting by f 1 and f 2 the two Poisson densities estimated from the 2-finite Poisson model.
The unique solution of the above equation is

ln( p2 / p1 )   2   1
.
ln(  1 /  2 )
13
A posterior sample of  is readily available by producing a function of the sampled
population parameters   1 ,  2 , p1  , solving (6) and then calculating     where [a] stands
for the integer part of a. The posterior distribution of  in the data of Table 1 is tabulated in
Table 3. We have used a lag of 100 iterations so as to remove autocorrelation and the results
are based on 3000 values of the cut-off point. This posterior distribution is concentrated on
=2, (the mean is 2.001 and the standard deviation is 0.53) reflecting that this value can be
considered as the cut-off point for allocating a client as ‘good’. In practice, the financial
institution can use this value to start possible sanctions for discouraging similar actions in the
sense that the institution wants not to bother ‘good’ clients’ which, by accident, have defaulted
instalments.
Table 3 about here
The entries of Table 3 were based on a 2- components model. We might have used
model-averaging technique for calculating the posterior distribution of the cut-off point. This
would be a mixture of the posterior distributions obtained of each value of k, weighted by the
probability of a model with k-components.
5. Comparison with other methods
Determining the number of components in a mixture is a difficult and yet unresolved
problem. The same problem occurs in cluster analysis where the number of clusters is of
interest (see, for example, McLachlan and Basford, 1988 for a thorough presentation of the
use of finite mixture models in cluster analysis). A large number of methods have been
proposed for this problem. We will briefly describe some of them, and then a brief comparison
of them to our method will be made, based on the considered real data set.
14
The majority of the methods have been proposed for normal mixtures, but they are also
applied, with minor modifications, to finite Poisson mixtures. The common strategy of
sequentially using the likelihood ratio test has been proposed by Wolfe (1970), and Izenmann
and Sommer (1988). However it has been strongly critisized by several authors, mainly
because the asymptotic distribution is not of a chi-square form as demonstrated, for example
by McLachlan (1987), and Milligan and Cooper (1985). Recently Karlis and Xekalaki (1997a)
have proposed the use of a sequential bootstrap likelihood ratio test where the null distribution
of the test statistic is constructed at each step via parameteric bootstrap. This approach verified
that the chi-square approximation is inappropriate.
Penalised likelihood methods have been proposed by Leroux (1992) and Leroux and
Putterman (1992). They proposed the use of the Akaike Information Criterion (AIC) and the
Bayesian Information Criterion (BIC) for discouraging the use of a large number of
components.
The AIC is given by
A(m)  Lm  d m while the BIC is given by
B(m)  Lm  ln(n)d m / 2 , where Lm is the maximized log-likelihood for a model with m
components and d m is the number of free parameters in the model with m components, which
is equal to 2m-1 for a m-finite Poisson mixture distribution. Both criteria share disadvantages
as they both ignore the number of already fitted components, while the AIC also ignores the
sample size.
Penalised Minimum distance methods have been proposed by Henna (1985) and Chen
and Kaldbfleisch (1996). The innovation in the latter is that the penalty is based on the mixing
proportions so as to discourage components with small mixing proportions. The penalized
k
distance is given by Km  Dm  c ln pi where Dm is an appropriately chosen distance for a
i 1
15
model with m components (e.g. the Hellinger Distance considered by Karlis and Xekalaki,
1997b), and c is a scalar which scales the contribution of the penalty term.
Fruman and Lindsay (1994) proposed a method based on moments. They used the
property of several distributions to generate mixtures for which the moments of the mixing
distribution can be consistently estimated by the data. Their method adds components until the
related moment problem becomes insolvable. This method is criticised on that it leads to few
components (see Heckman et al, 1990).
Other methods related to the problem can be found in Lindsay and Roeder (1992),
Celeux and Diebolt (1985), Windham and Culter (1992) and Aitkin et al (1996).
We applied some of the above mentioned method to our data set. The number of
components resulted in by these methods as well as by the method proposed in this paper can
be seen in Table 4. It is interesting that the estimated number of components varies from
method to method. Moreover, our method proposes the smallest number of components.
Table 4 about here
We re-emphasize here that the RJMCMC-based Bayesian methodology provides a
posterior distribution on the number of components k, rather than a value of k considered as a
‘best estimator’. This, of course, leads to a number of competing explanations of the data.
Moreover, we would like to point out that the methods in table 4 are based, in general, on the
usual classical paradigm, which typically uses a penalty to offset additional variables in an
enlarged model and then chooses the model with the smallest possible number of parameters,
thus obeying to a principal of parsimony. However, it is striking that our Bayesian approach
produced the highest probability for k=2 which is less than any value the classical methods
produced. We tested our code and the behaviour of RJMCMC for simulated data and we
16
report that there is not a bias towards models with fewer components. These results agree with
the general experience in model mixtures; see Richardson and Green (1997).
6. Discussion
A Bayesian estimation procedure for finite Poisson mixtures was developed and
applied to financial data. The RJMCMC provides both estimation for the number of
components as well as estimation of the parameters of the mixture. The method’s results
favoured models with fewer components. In an experimental study we certified that this is not
due to the bad behaviour of RJMCMC but it happened due to the mixtures overlapping: when
the sample size is small and the counts are overdispersed, RJMCMC may produce results with
high posterior probabilities for models with k  n .
We restricted ourselves to using simple data sets containing only counts. Clearly, the
financial institution can use supplementary information, e.g. some covariates reflecting
personal characteristics of the client. This is the common strategy for credit scoring methods,
see Dionne et al (1994). Both the MCMC as well as the RJMCMC can be applied. The finite
mixture Poisson regression method described in Wang et al (1996) is an example where we
can proceed by assuming that the number of components is unknown, and thus develop full
Bayesian estimation for both the number of components and the parameters associated to the
covariates.
Finally, we would like to point out that the usual by-products of the MCMC output are,
as usual, available here. For example, standard confidence regions, predictive densities and
any posterior summaries of interest can be readily performed by utilising the RJMCMC
output.
17
Acknowledgement
Dimitris Karlis is being supported by a scholarship from the State Scholarships Foundation of
Greece. We would like to thank Sylvia Richardson and Valerie Viallefont for their help on
the split-combine move.
References
Aitkin M., Finch S., Mendell N. and Thode H. (1996) A new test for the presence of a normal
mixture distribution based on the posterior Bayes factor. Statistics and Computing 6,
121-125
Bohning D. (1995) A review of reliable maximum likelihood algorithms for semiparametric
mixture models. Journal of Statistical Planning and Inference 47, 5-28
Celeux G. and Diebolt J. (1985) The SEM algorithm: a probabilistic teacher algorithm derived
from the EM algorithm for the mixture problem. Comput. Statistics Quarterly 2, 73-92
Chen J. and Kalbfleisch J.D. (1996) Penalised minimum-distance estimates in finite mixture
models. Canadian Journal of Statistics 24, 167-175
Crawford S. (1994) An Application of Laplace Method to Finite Mixture Distributions.
Journal of the American Statistical Association 89, 259-267
Dellaportas P., Stephens D. A., Smith A. F.M. and Guttman I. (1995) A comparative study of
prenatal mortality using a two component mixture model. In: Bayesian Biostatistics,
pp 601-616, Berry D. A. and Stangl D. K. (editors). New York, Marcel Dekker.
Diebolt J. and Robert C. (1994) Estimation of Finite Mixture Distributions Through Bayesian
Sampling. Journal of the Royal Statistical Society B 56, 363-375
18
Dionne G, Artis M. and Guillen M. (1996) Count data models for a credit scoring system.
Journal of Empirical Finance, 3, 303-325
Furman W.D. and Lindsay B. (1994) Testing for the number of components in a mixture of
normal distributions using moment estimators. Computational Statistics and Data
Analysis, 17, 473-492
Gelfand A.E., Smith A.F.M. and Lee T.M. (1992) Bayesian Analysis of constrained parameter
and truncated data problems using Gibbs sampling. Journal
of the American
Statistical Association 87, 523-532
Green P. (1995) Reversible jump Markov chain Monte Carlo computation and Bayesian
model determination. Biometrika, 82, 711-732
Guillen M. and Artis M. (1992) Count Data Models for a Credit Scoring System. Presented at
the 3rd Meeting of the European Conference Series in Quantitative Economics and
Econometrics on Econometrics of Duration, Count and Transition Models, Paris,
December, 10-11, 1992
Hand D.J. and Henley W.E. (1997) Statistical Classification Methods in Consumer Credit
Scoring: a Review. Journal of the Royal Statistical Society, A, 160, 523-541
Hasselblad V. (1969) Estimation of finite mixtures from the exponential family. Journal of
the American Statistical Association 64, 1459-1471
Heckman J., Robb R. and Walker J. (1990) Testing the Mixture of Exponentials Hypothesis
and Estimating the Mixing Distribution By the Method of Moments . Journal of the
American Statistical Association 85, 582-589
Henna J. (1985) On estimating the number of constituents of a finite mixture of continuous
distributions. Annals of the Institute of Statistical Mathematics, 37, 235-240
19
Izenmann A. J. and Sommer C. (1988) Philatelic mixtures and multimodal densities. Journal
of the American Statistical Association 83, 941-953
Karlis D. and Xekalaki E. (1997) Testing for finite mixtures via the Likelihood Ratio Test.
Technical Report #28, Dept. of Statistics, Athens University of Economics and
Business, January 1997
Karlis D. and Xekalaki E. (1997) Minimum Hellinger Estimation for finite Poisson mixtures.
To appear in Computational Statistics and Data Analysis
Laird N. (1978) Nonparametric Maximum Likelihood Estimation of a Mixing Distribution.
Journal of the American Statistical Association 73, 805-811
Lavine M. and West M. (1992) A Bayesian Method for Classification and Discrimination. the
Canadian Journal of Statistics 20, 451-461
Leroux B. (1992) Consistent estimation of a mixing distribution. Annals of Statistics 20,
1350-1360
Leroux B. and Puterman M. (1992) Maximum-Penalised-Likelihood for independent and
Markov-Dependent Mixture Models. Biometrics 48, 545-558
Lindsay B. (1995) Mixture Models: Theory, Geometry and Applications. Regional Conference
Series in Probability and Statistics, Vol 5, Institute of Mathematical Statistics and
American Statistical Association
Lindsay B. and Roeder K. (1992) Residuals diagnostics for mixture models. Journal of the
American Statistical Association 87, 785-794
McLachlan G. (1987) On bootstrapping the likelihood ratio test statistic for the number of
components in a normal mixture. Applied Statistics 36, 318-324
McLachlan G. and Basford K. (1988) Mixture models: inference and application to clustering.
Marcel and Decker Inc.
20
Milligan G. W. and Cooper M. C. (1985) An examination of procedures for determining the
number of clusters in a data set. Pshychometrika 50, 159-179
Rajagopalan N. and Loganathan A. (1991) Bayes Estimates of Mixing Proportions in Finite
Mixture Distributions. Communications in Statistics-Theory and Methods 20, 23372349
Redner R., Hathaway R. and Bezdek J. (1987) Estimating the Parameters of Mixture Models
with Modal Estimators. Communications in Statistics- Theory and Methods 16, 26392660
Richardson S. and Green P. (1997) On Bayesian analysis of mixtures with an unknown
number of components (with discussion). Journal of the Royal Statistical Society,
Series B, 59, 731-792.
Robert C.P. (1996) Mixtures of Distributions: Inference and Estimation. in Markov Chain
Monte Carlo in Practice, Eds Gilks W.R., Richardson S., Spiegelhalter D.J., Chapman
and Hall, 441-464.
Smith A.F.M. and Makov U. (1978) A Quasi-Bayes Procedure for Mixtures. Journal of the
Royal Statistical Society B 40, 106-112
Smith A.F.M. and Roberts G. (1993) Bayesian computation via the Gibbs sampler and related
Markov Chain Monte Carlo methods (with discussion). Journal of the Royal Statistical
Society, B, 55, 3-23.
Stephens, M. (1996) Dealing with the Multimodal Distributions of Mixture Model
Parameters. Preprint, Department of Statistics, University of Oxford.
Teicher H. (1963) Identifiability of Finite Mixtures. Annals of Mathematical Statistics 28, 7588
21
Titterington M., Makov G., and Smith A.F.M. (1985) Statistical analysis of finite mixtures.
Willey, UK
Wang P., Puterman M., Cokburn I. and Le N. (1996) Mixed Poisson Regression Models with
Covariate Dependent Rates. Biometrics, 52, 381-400
Windham M. and Cutler A. (1992) Information ratios for validating mixture analyses. Journal
of the American Statistical Association 87, 1188-1192
Wolfe J.H. (1970) Pattern clustering by multivariate mixture analysis. Multivariate Behavioral
Research 5, 329-350
22
Table 1.
Number of defaulted instalments in a financial institution in Spain in 1990.
x
0
1
2
3
4
5
6
observed
3002
502
187
138
233
160
107
x
7
8
9
10
11
12
13
observed
80
59
53
41
28
34
10
X
14
15
16
17
18
19
20
observed
13
11
4
5
8
6
3
x
21
22
23
24
>24*
*
The actual values were 28,29,30,34
Table 3.
Posterior distribution of  based on a 2-components model
m
1
2
3
4
5
P(=m)
0.0943
0.8242
0.06344
0.0179
0.0001
Table 4.
The number of components estimated via several methods.
Method
estimated k
Likelihood
Sequential Bootstrap LRT
Graphical method
Penalised ML
AIC
BIC
Penalised Minimum Distance
Minimum Hellinger (c=1/4n)
Maximum Likelihood (c=1)
Moment method
23
5
6
5
4
4
5
3
observed
0
1
0
1
4
Table 2.
The mean of the estimated components for various values of k
p1
p2
p3
p4
p5
p6
p7
λ1
λ2
λ3
λ4
λ5
λ6
λ7
k
P(k)
2
0.537 0.78 0.22 -
-
-
-
-
0.24 6.41 -
-
-
-
-
3
0.282 0.75 0.16 0.09 -
-
-
-
0.19 4.07 10.28 -
-
-
-
4
0.128 0.74 0.13 0.08 0.05 -
-
-
0.16 3.50 6.45
12.87 -
-
-
5
0.039 0.73 0.11 0.08 0.05 0.03 -
-
0.15 2.98 5.19
8.09
14.95 -
-
6
0.010 0.72 0.08 0.07 0.07 0.04 0.02
-
0.14 2.47 4.18
6.21
9.29
16.55 -
7
0.004 0.71 0.07 0.07 0.07 0.05 0.03
0.01 0.14 2.08 3.53
5.21
7.27
10.51 18.08
24
.7
.6
k=2
probability
.5
.4
k=3
.3
.2
k=4
.1
0.0
0
10000
20000
30000
sweep
Figure 1: The values of P(k=j) for j=2,3,4, across the sweeps.
25
40000
probability
1.2
1.0
.8
.6
.4
move to k+1
.2
stay at k
0.0
move to k-1
2
3
4
5
6
7
8
k
Figure 2: The moves for each k for the financial data.
26
.6
probability
.5
.4
.3
.2
.1
PO ISSO N
U N IFO RM
0.0
1
2
3
4
6
5
7
8
9
k
Figure 3: Posterior probabilities for k for both Poisson and uniform priors.
27
700
1600
600
1400
1200
500
1000
400
800
300
600
200
400
100
200
0
0
.719
.750
.781
.813
.844
p1
.875
.906
.125
.225
.325
.425
λ1
`
1000
800
600
400
200
0
5.44
.525
6.98
8.51
10. 05
λ2
Figure 4. Posterior distribution for the parameters of a 2-finite Poisson mixture.
28
.625
.725
.825
Download