Bayesian Analysis of Finite Poisson Mixtures1 Petros Dellaportas, Dimitris Karlis and Evdokia Xekalaki Department of Statistics Athens University of Economics and Business, 76 Patission Str. Athens, Greece ABSTRACT Finite Poisson mixtures are widely used to model overdispersed data sets for which the simple Poisson distribution is inadequate. Such data sets are very common in real applications. In this paper we investigate Bayesian estimation via MCMC for finite Poisson mixtures and we discuss some computational issues. The related problem of determining the number of components in a mixture is also treated. This problem is of particular interest since it can reveal the number of subpopulations comprising the entire population under study. We model the number of components k as a random quantity having itself a prior probability function and then we apply the Reversible Jump Markov Chain Monte Carlo (Green 1995) to estimate the number of components. A real data example dealing with a financial credit scoring system illustrates our methodology. Keywords: Bayesian Analysis; Credit Scoring; Markov Chain Monte Carlo; Reversible Jump; Mixture Models; 1 An earlier version of this paper was presented at the 4th Practical Bayesian Statistics Conference held in Nottingham, July 1997. 1 1.Introduction In recent economies, the need for financial assistance is present for a great variety of economic activities. A large number of financial institutes provide such kind of facilities by giving loans to their clients. The total amount which the financial institution gives must be returned after a certain fixed time period, usually by paying an instalment at fixed subperiods. The amount of each instalment is calculated using an interest rate so as to be profitable for the financial institution and to cover the difference due to inflation at the nominal amount. By this transaction, the client improves his cash condition and the financial institution can gain a large amount since interest rates are always profitable. At the same time the financial institution undertakes a great risk, that of non-payment of instalments, termed as the defaulted instalments. In this case the institution may loose the whole or part of the total amount borrowed. Usually, the legislation of each country can cope with such situations but even in these cases the risk and the cost for the institution is great. To avoid such situations the institutions have a mechanism, which examines if the new client will adhere to the terms of the loan, so as to classify each client as a ‘good’ or a ‘bad’ client. This mechanism is the so-called credit scoring system and it examines the reliability of each client. It is clear that clients who are allocated to the latter category are suspected to default their instalments, and thus the institution can either deny providing the loan, or give the loan under more severe regulations for the amount and the time of the instalments. The allocation of the new clients to the above mentioned categories can be made using several different statistical methods, (see Hand and Henley, 1997 and the references therein). However, the allocation of customers to only two groups has some disadvantages because it ignores more realistic separations. For example, the ‘bad’ client category might contain clients who deliberately neglect paying their instalments (‘continuously bad clients’) as well as clients 2 who simply neglect paying, for example, due to the nature of their work (‘temporary bad clients’). A mechanism, which could separate these two categories, would therefore be useful. Moreover, it would be of particular interest for the institution to be able to determine the number of different categories, as it may develop different programs for each category, minimising the effort and the risk of the institution. Table 1 about here The data set shown in Table 1 refers to the number of defaulted instalments in a financial institution in Spain in 1990 and has been analysed by Guillen and Artis (1992) and Dionne et al (1996). The data contain a large amount of overdispersion; the mean equals 1.58 while the variance is more than 6 times greater and equals 9.93, implying the inadequacy of the simple Poisson distribution to model the observed frequency distribution. In this paper we shall describe a mixture model to describe the inhomogeneity within the population. Mixture models are alternative candidates when simple models fail. In particular, finite mixture models can provide interesting information about the number of subpopulations comprising the entire population. In the case of the dataset considered, the institution itself would be interested in determining the number of different subpopulations comprising the entire client population. The remaining of this paper is organised as follows. Section 2 contains a brief description of mixture models as well as a short review of estimating methods. Section 3 is devoted to developing Bayesian estimation for finite Poisson mixtures via the Markov Chain Monte Carlo (MCMC) methodology. We also discuss some amendments of the standard MCMC method for improving the performance of our algorithm as well as the problem of determining the number of components comprising the mixture models. Section 4 contains an 3 application of the new method to the financial institution data set. A comparison of the newly proposed method can be found in Section 5. Finally, Section 6 contains a short discussion. 2. Finite Mixture Models Suppose that the probability density of a random variable X, f(x) can be expressed in the form k f ( x ) p j f ( x| j ) (1) j 1 k for some probabilities pj>0, j=1,. . .,k with p =1. Then x is said to have a k-finite mixture i i 1 density. In (1) j is either a vector of parameters or a scalar referring to the j-th component of the mixture. Usually pj is called the mixing proportion while j is called the mixing parameter. Finite mixture models are of special interest mainly for two reasons. Firstly, they can describe populations for which the assumption of a finite number of subpopulations is valid. Secondly, even if the true underlying population consists of an infinite number of subpopulations, one is restricted in estimating it via a finite distribution; see Laird (1978). Furthermore, by fitting a k-finite mixture we can elicit information about the number of components and hence about the number of subpopulations comprising the entire population. It is known that the role of the simple Poisson distribution is prominent among the discrete probability distributions. The Poisson distribution is usually used to model situations where only randomness is present. In practice this probabilistic mechanism cannot explain data variation and the need for alternative more complex models is obvious. Poisson mixture models are then interesting candidates. The probability function of the k-finite Poisson mixture is given by 4 k f ( x) p j e j jx (2) x! j 1 We assume that 0 1 2 .... k in order to ensure the identifiability of the above finite mixture (see Teicher, 1963). Note that we allow the first component mean to be 0, thus allowing the corresponding distribution to be degenerate. A large amount of estimation methods for finite mixtures can be found in Titterington et al (1985), McLachlan and Basford (1988) and Lindsay (1995). These include maximum likelihood methods, moment and minimum distance methods. If the number of components is fixed, simple EM algorithms can be used for maximum likelihood estimation; see for example, Hasselblad (1969). If k is not fixed more sophisticated algorithms must be used; for a detailed review see Bohning (1995). 3. Bayesian Estimation for Finite Poisson Mixtures 3.1. The case of known number of components The standard Bayesian formulation of the finite mixture problem with a known number of components and its implementation via Markov Chain Monte Carlo (MCMC) is given by Diebolt and Robert (1994). For the finite Poisson mixture the explicit details are as follows. Let the data x x1 , x 2 ,..., x n be modelled as exchangeably distributed according to (2). This implies that conditional on the parameter vector p, with 1 ,..., k and p p1 ,..., pk , the distribution of Xi, f ( xi | j ) is Poisson( j ) . By expressing the mixture model in terms of missing data, we introduce indicator parameters z zij ; i 1,..., n; j 1,..., k such that z ij 1 if the i-th observation Xi belongs to the j-th component of the mixture and z ij 0 otherwise. Thus the density 5 f ( xi | zij 1) is Poisson( j ) and f ( zij 1| p) p j . Note that f ( zij | p) is a Multinomial (1, p) distribution so p can be thought as a vector of hyperparameters determining the distribution of z . The joint distribution of the observed data x and the unobserved indicator parameters z conditional on the model parameters can be written as n k f ( x, z | ) f ( x | , z ) f ( z | p) f ( xi | j ) z ij (4) i 1 j 1 where f ( xi | j ) is in our case the Poisson distribution with parameter j . Bayesian formulation requires prior distributions for the model parameters . Immediate choices are the conjugate prior distributions Gamma(a, ) for each j and the Dirichlet (d1,..., dk ) for p. Denoting the priors ( ) , ( ) and ( p) , Bayes theorem leads to the joint posterior distribution of p( | x, z ) f ( x, z | ) ( ) f ( x | , p) f ( z | p) ( p) ( ) . Inference problems associated with the above posterior density present implementation difficulties due to the nature of the sampling density: the complete likelihood in (4) is a product of n terms each being the product of k distinct elements. Possible approaches to overcome this problem include a quasi-Bayesian approach (Smith and Makov, 1978 and Rajagobalan and Loganathan, 1991), modal estimation (Redner et al, 1987), Laplace approximations (Crawford, 1994) and a recent suggestion of Aitkin et al (1996) based on the use of contiguous partitioning. In this paper we follow the usual MCMC approach suggested by Diebolt and Robert (1994) for finite normal mixtures which attracted the attention of many researchers; see for example Lavine and West (1992), Dellaportas et al (1995), Robert (1996). To implement the MCMC for the Poisson mixtures case we need to sample from the full conditional posterior distribution for the parameters , p and z. These sampling steps are readily found to be 6 zij Multinomial (1, wi1 ,..., wik ) i=1, . . .,n, j=1, . . .,k where wij p j f ( xi | j ) f ( xi ) , j=1, . . . , k (5) n n p Dirichlet d 1 wi1 ,..., d k wik i 1 i 1 n n i 1 i 1 j Gamma (a zij x i , b zij ) I ( j 1 , j 1 ) , j=1, . . . , k where I(a,b) denotes the indicator function which takes the value 1 if x belongs to the interval (a,b) and 0 otherwise, with the convention that 0 0 and k 1 . Iterative sampling from the above densities, from suitably long runs of the chain, provides the tool to obtain estimates for marginal posterior quantities of interest; see for example Smith and Roberts (1993). The parameters a, and d j , j=1, . . . ,k can be chosen so that the resulting prior densities for and p are rather flat. The priors should be proper to ensure parameter identifiability. The truncated form of f ( j |.) can avoid the multimodality of the resulting posterior distribution. Moreover, sampling from a Gamma density truncated at the left or at the right is achieved by using techniques such as those described by Gelfand et al (1992). For a detailed discussion on the multimodality of the posterior distributions, see, for example, Stephens (1996). 7 3.2 The case of unknown number of components If the number of components k is not known, the Bayesian estimation problem is enlarged in the sense that inferences are based on the joint posterior distribution of the parameter vector ( k , k ) where k ( 1 ,... k , p1 ,..., pk ) denotes the vector of parameters for the k-finite Poisson mixture. The posterior density can be written as p( | x , z) p( k , k | x , z) f ( x , z| k ) ( k ) ( k ) f ( x| 1 ,..., k , p1 ,..., pk , z) f ( z| p1 ,..., pk ) ( p1 ,..., pk ) ( 1 ,..., k ) ( k ) where ( k ) denotes the prior density for the parameter k. We adopt the recently suggested methodology of Green (1995) which provides a fully Bayesian solution of the above problem. This methodology, termed as the reversible jump MCMC (RJMCMC), has been already applied in the finite normal mixture case in Richardson and Green 1997. The key idea in RJMCMC is to allow moves between parameter subspaces of different dimensionallity by permitting a series of different ‘move types’. The details of the general RJMCMC implementation strategy can be found in the papers cited above. In the sequel, we provide a detailed account of the sampling steps required for the finite Poisson mixture case. Suppose that there are at most k max components, so k=1,. . . , k max . We denote by Sk and Ck the probability of ‘‘split’’ and ‘‘combine’’ moves respectively for a model with k parameters. Clearly S k 1 Ck , with S k max 0 and C1 0 . We set S k 0.5 for k=2, . . ., k max . With these probabilities we choose the type of move which will be attempted. Let us first consider the simplest case where we have to move from the k-component models to (k-1)-component model. At first we have to randomly choose the pair of components which will be combined. These components are, by construction, successively ordered ( 0 1 2 .... k ), and hence we have to choose one out of k-1 different 8 successive pairs. Denoting the chosen pair as j1 , j2 , we have to transform the current vector of parameters p j1 , p j2 , j1 , j2 , to the new vector of parameters, say p* , * , as well as the allocation latent variables z j1 , z j2 to the new allocation latent variable z* . We suggest using the ‘combine’ transformation * p* p j1 p j2 , j pj j pj 1 1 2 p* 2 z* z j1 z j2 , which preserves the values of the means between jumps. Other transformations could have been chosen and although they would not affect the results they might deteriorate the convergence properties of the chain. Note that once the choice of the pair has been made, the ‘‘combine’’ proposal move is deterministic. The ‘‘split’’ move is more tedious than the combine move. At first we have to choose randomly the components which will be attempted to be split. Suppose that we chose at random the j * -th component to be split into 2 new components, say j 1 and j2 . In order to make the transformation reversible we need to generate two continuous random variates u1 and u2 and then calculate the new parameters p j1 , p j2 , j1 , j2 , as well as the corresponding allocation latent variables z j1 , z j2 . The generation of these variables can be made by any arbitrary distribution. So, in the sequel we generate u1 and u2 from a Beta(2,2) distribution. We found that in practice this choice works well providing good chain convergence. Other Beta distributions were also tried (for example we chose a Beta with mean 0.9 so that the split move does encourage splits to a ‘large’ and a ‘small’ component) but they gave similar convergence behaviour. The new parameters are calculated as p j1 p j* u1 p j2 p j* 1 u1 , 9 j j u2 , 1 * 1 u1u2 , * 1 u1 j j 2 and the Jacobean matrix of the transformation of the vector p j* , j* , u1 , u2 to the vector p j1 , p j2 , j1 , j2 can be shown to be given by 0 u1 1 u 0 1 u2 J 0 1 u1u 2 0 1 u1 p j* p j* 0 j 1 u2 * 1 u1 2 j* j* u1 1 u1 0 0 with determinant J p j* j* 1 u1 . Again, once the choice of the pair has been made, the ‘‘split’’ proposal is deterministic. Note also that if the means of the new components are not in order or if one component has mean smaller than 0, then the proposed move is rejected. Finally if zij* 1 for some i, then zij1 1 with probability proportionally equal to wij1 w ij1 wij2 . In practice, since we are dealing with count data, we can simplify the calculations noting that for each count we have the observed frequency. In this case the allocation step is done via generation from the binomial distribution. Thus, when the move has been proposed and the new parameters have been calculated we have either to accept or to reject the move with some probability which is calculated as follows. Suppose that the current state of the Markov chain at time t is m and the corresponding vector of parameters is m . Suppose that a particular move type M involves a proposed move to model m with corresponding parameter vector m which is of higher dimensionality than m . Furthermore, suppose that m is created by generating a vector u of 10 dimensionality equal to the difference in dimensionalities between m and m from some proposal distribution q(u) , by setting m g ( m , u) where g is a one-to-one function. Similarly, if the move type M is used in the other direction (in which case it is denoted by M ) and m is of higher dimension than m then m is created by applying the inverse transformation using the function g 1 ( m , u) and ignoring u. Thus, if w(m,M) denotes the probability of a move of type M when we are at state m, the acceptance probability is calculated as: f ( x , m ) w(m, M ) a min1, J f ( x , m ) w(m, M )q (u) where J is the Jacobean matrix of the transformation. Therefore, the acceptance probabilities for our cases can be calculated. It follows specifically that for the split move the acceptance probability is min(1,A) while for the combine move the acceptance probability is min(1,B) where k 1 L( k 1) ( k 1) i 1 A k L( k ) (k ) ( i ) ( p1 , p2 ,..., pk 1 ) ( ) ( p , p ,..., p ) i 1 k 1 L(k 1) (k 1) i 1 B k L( k ) ( k ) i 1 2 ( k 1) gu1 , u2 1 J Ck 1 Sk Palloc , k ( i ) ( p1 , p2 ,..., pk 1 ) ( i ) ( p1 , p2 ,..., pk ) 1 1 Sk 1 Palloc gu1 , u2 . k J Ck i 1 Here, L( k ) is the likelihood of the model with k components and ( k ), ( ) and ( p1 , p2 , ..., pk ) are the priors for k, and p respectively. gu1 , u2 is the proposal density from which u1 and u2 are generated. The factors (k+1) for the ‘split’ move and ( k 1 ) for the ‘combine’ move are derived from the ratios of factorials in the ordered statistics densities for the mixture means. 11 4. Application to financial data The following prior specifications were used for analysing the data in Table 1. The maximum number of components k max was set equal to 25, and two different priors for k were considered. The first was a uniform across the values 1, . . , k max , with P (k ) 1 / k max while the second was a zero truncated Poisson distribution with P (k ) et t k , k>0, t !(1 e t ) where t was a parameter reflecting our belief (or knowledge) about the mean number of components; we choose, for illustrative purposes, to set t=5. We found that due to the large sample size (n=4690), the results were not sensitive to the choice of the prior. For j , j=1,...,k we chose a Gamma (a 0.01, 0.01) prior density, which is rather flat, reflecting our prior ignorance about the parameter values. Again we found that even a multimodal informative prior density with k modes, does not alter our main inferences. Finally, the choice of the prior parameters for the mixing proportions ( p1 ,..., p k ) was a Dirichlet(1,...,1) distribution and deviations from this density produced no interesting changes to the posterior distribution. Starting values for the RJMCMC algorithm varied for all the parameters. The algorithm turned out to be very insensitive to the initial values reaching the joint posterior density in, at most, 1000 iterations. The results which follow were derived by using a burn-in period of 5000 iterations and then taking one point every 100 iterations, to avoid autocorrelation biases. Figure 1 shows the values of P (k j ) for j=2,3,4 across the sweeps. The stability of the values indicates convergence of the RJMCMC algorithm. Figure 2 illustrates the moves for each k in 50000 iterations. RJMCMC succeeded in moving efficiently between models. 12 The proportion of accepted moves was near 0.40. Figure 3 depicts the posterior probabilities for k for both Poisson and Uniform prior probabilities. It is evident that the posterior densities are rather robust to the choice of priors, indicating that the posterior for k favours, in both cases, a model with 2 components. The probability decreases, as k becomes larger, obtaining values less than 0.1 for k 5 . Table 2 about here All posterior summaries of interest are tabulated in Table 2. The most probable model (k=2) indicates that for these data, there are two groups of customers, representing the 78% and 22% of the population respectively, with corresponding mean number of defaulted instalments 0.24 and 6.41. Note that the values of p1 and 1 are rather stable across k, indicating that we can be rather firm about the characteristics of the ‘good customers’ group. Assume that we choose our inferences to be based on the most probable model with two mixture components. An interesting quantity to report in our example would be a cut-off value, say. This value determines the subpopulation in which an individual will be allocated. Hence if the number of defaulted instalments is smaller or equal to then the individual is allocated at the first group while if it is larger at the second group. Conditional on the population parameters k , 1 , ... , k , p1 , ... , p k we define as the cut-off point the integer part of the root of the equation p1 f 1 ( | 1 ) p2 f 2 ( | 2 ) (6) denoting by f 1 and f 2 the two Poisson densities estimated from the 2-finite Poisson model. The unique solution of the above equation is ln( p2 / p1 ) 2 1 . ln( 1 / 2 ) 13 A posterior sample of is readily available by producing a function of the sampled population parameters 1 , 2 , p1 , solving (6) and then calculating where [a] stands for the integer part of a. The posterior distribution of in the data of Table 1 is tabulated in Table 3. We have used a lag of 100 iterations so as to remove autocorrelation and the results are based on 3000 values of the cut-off point. This posterior distribution is concentrated on =2, (the mean is 2.001 and the standard deviation is 0.53) reflecting that this value can be considered as the cut-off point for allocating a client as ‘good’. In practice, the financial institution can use this value to start possible sanctions for discouraging similar actions in the sense that the institution wants not to bother ‘good’ clients’ which, by accident, have defaulted instalments. Table 3 about here The entries of Table 3 were based on a 2- components model. We might have used model-averaging technique for calculating the posterior distribution of the cut-off point. This would be a mixture of the posterior distributions obtained of each value of k, weighted by the probability of a model with k-components. 5. Comparison with other methods Determining the number of components in a mixture is a difficult and yet unresolved problem. The same problem occurs in cluster analysis where the number of clusters is of interest (see, for example, McLachlan and Basford, 1988 for a thorough presentation of the use of finite mixture models in cluster analysis). A large number of methods have been proposed for this problem. We will briefly describe some of them, and then a brief comparison of them to our method will be made, based on the considered real data set. 14 The majority of the methods have been proposed for normal mixtures, but they are also applied, with minor modifications, to finite Poisson mixtures. The common strategy of sequentially using the likelihood ratio test has been proposed by Wolfe (1970), and Izenmann and Sommer (1988). However it has been strongly critisized by several authors, mainly because the asymptotic distribution is not of a chi-square form as demonstrated, for example by McLachlan (1987), and Milligan and Cooper (1985). Recently Karlis and Xekalaki (1997a) have proposed the use of a sequential bootstrap likelihood ratio test where the null distribution of the test statistic is constructed at each step via parameteric bootstrap. This approach verified that the chi-square approximation is inappropriate. Penalised likelihood methods have been proposed by Leroux (1992) and Leroux and Putterman (1992). They proposed the use of the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC) for discouraging the use of a large number of components. The AIC is given by A(m) Lm d m while the BIC is given by B(m) Lm ln(n)d m / 2 , where Lm is the maximized log-likelihood for a model with m components and d m is the number of free parameters in the model with m components, which is equal to 2m-1 for a m-finite Poisson mixture distribution. Both criteria share disadvantages as they both ignore the number of already fitted components, while the AIC also ignores the sample size. Penalised Minimum distance methods have been proposed by Henna (1985) and Chen and Kaldbfleisch (1996). The innovation in the latter is that the penalty is based on the mixing proportions so as to discourage components with small mixing proportions. The penalized k distance is given by Km Dm c ln pi where Dm is an appropriately chosen distance for a i 1 15 model with m components (e.g. the Hellinger Distance considered by Karlis and Xekalaki, 1997b), and c is a scalar which scales the contribution of the penalty term. Fruman and Lindsay (1994) proposed a method based on moments. They used the property of several distributions to generate mixtures for which the moments of the mixing distribution can be consistently estimated by the data. Their method adds components until the related moment problem becomes insolvable. This method is criticised on that it leads to few components (see Heckman et al, 1990). Other methods related to the problem can be found in Lindsay and Roeder (1992), Celeux and Diebolt (1985), Windham and Culter (1992) and Aitkin et al (1996). We applied some of the above mentioned method to our data set. The number of components resulted in by these methods as well as by the method proposed in this paper can be seen in Table 4. It is interesting that the estimated number of components varies from method to method. Moreover, our method proposes the smallest number of components. Table 4 about here We re-emphasize here that the RJMCMC-based Bayesian methodology provides a posterior distribution on the number of components k, rather than a value of k considered as a ‘best estimator’. This, of course, leads to a number of competing explanations of the data. Moreover, we would like to point out that the methods in table 4 are based, in general, on the usual classical paradigm, which typically uses a penalty to offset additional variables in an enlarged model and then chooses the model with the smallest possible number of parameters, thus obeying to a principal of parsimony. However, it is striking that our Bayesian approach produced the highest probability for k=2 which is less than any value the classical methods produced. We tested our code and the behaviour of RJMCMC for simulated data and we 16 report that there is not a bias towards models with fewer components. These results agree with the general experience in model mixtures; see Richardson and Green (1997). 6. Discussion A Bayesian estimation procedure for finite Poisson mixtures was developed and applied to financial data. The RJMCMC provides both estimation for the number of components as well as estimation of the parameters of the mixture. The method’s results favoured models with fewer components. In an experimental study we certified that this is not due to the bad behaviour of RJMCMC but it happened due to the mixtures overlapping: when the sample size is small and the counts are overdispersed, RJMCMC may produce results with high posterior probabilities for models with k n . We restricted ourselves to using simple data sets containing only counts. Clearly, the financial institution can use supplementary information, e.g. some covariates reflecting personal characteristics of the client. This is the common strategy for credit scoring methods, see Dionne et al (1994). Both the MCMC as well as the RJMCMC can be applied. The finite mixture Poisson regression method described in Wang et al (1996) is an example where we can proceed by assuming that the number of components is unknown, and thus develop full Bayesian estimation for both the number of components and the parameters associated to the covariates. Finally, we would like to point out that the usual by-products of the MCMC output are, as usual, available here. For example, standard confidence regions, predictive densities and any posterior summaries of interest can be readily performed by utilising the RJMCMC output. 17 Acknowledgement Dimitris Karlis is being supported by a scholarship from the State Scholarships Foundation of Greece. We would like to thank Sylvia Richardson and Valerie Viallefont for their help on the split-combine move. References Aitkin M., Finch S., Mendell N. and Thode H. (1996) A new test for the presence of a normal mixture distribution based on the posterior Bayes factor. Statistics and Computing 6, 121-125 Bohning D. (1995) A review of reliable maximum likelihood algorithms for semiparametric mixture models. Journal of Statistical Planning and Inference 47, 5-28 Celeux G. and Diebolt J. (1985) The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Comput. Statistics Quarterly 2, 73-92 Chen J. and Kalbfleisch J.D. (1996) Penalised minimum-distance estimates in finite mixture models. Canadian Journal of Statistics 24, 167-175 Crawford S. (1994) An Application of Laplace Method to Finite Mixture Distributions. Journal of the American Statistical Association 89, 259-267 Dellaportas P., Stephens D. A., Smith A. F.M. and Guttman I. (1995) A comparative study of prenatal mortality using a two component mixture model. In: Bayesian Biostatistics, pp 601-616, Berry D. A. and Stangl D. K. (editors). New York, Marcel Dekker. Diebolt J. and Robert C. (1994) Estimation of Finite Mixture Distributions Through Bayesian Sampling. Journal of the Royal Statistical Society B 56, 363-375 18 Dionne G, Artis M. and Guillen M. (1996) Count data models for a credit scoring system. Journal of Empirical Finance, 3, 303-325 Furman W.D. and Lindsay B. (1994) Testing for the number of components in a mixture of normal distributions using moment estimators. Computational Statistics and Data Analysis, 17, 473-492 Gelfand A.E., Smith A.F.M. and Lee T.M. (1992) Bayesian Analysis of constrained parameter and truncated data problems using Gibbs sampling. Journal of the American Statistical Association 87, 523-532 Green P. (1995) Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82, 711-732 Guillen M. and Artis M. (1992) Count Data Models for a Credit Scoring System. Presented at the 3rd Meeting of the European Conference Series in Quantitative Economics and Econometrics on Econometrics of Duration, Count and Transition Models, Paris, December, 10-11, 1992 Hand D.J. and Henley W.E. (1997) Statistical Classification Methods in Consumer Credit Scoring: a Review. Journal of the Royal Statistical Society, A, 160, 523-541 Hasselblad V. (1969) Estimation of finite mixtures from the exponential family. Journal of the American Statistical Association 64, 1459-1471 Heckman J., Robb R. and Walker J. (1990) Testing the Mixture of Exponentials Hypothesis and Estimating the Mixing Distribution By the Method of Moments . Journal of the American Statistical Association 85, 582-589 Henna J. (1985) On estimating the number of constituents of a finite mixture of continuous distributions. Annals of the Institute of Statistical Mathematics, 37, 235-240 19 Izenmann A. J. and Sommer C. (1988) Philatelic mixtures and multimodal densities. Journal of the American Statistical Association 83, 941-953 Karlis D. and Xekalaki E. (1997) Testing for finite mixtures via the Likelihood Ratio Test. Technical Report #28, Dept. of Statistics, Athens University of Economics and Business, January 1997 Karlis D. and Xekalaki E. (1997) Minimum Hellinger Estimation for finite Poisson mixtures. To appear in Computational Statistics and Data Analysis Laird N. (1978) Nonparametric Maximum Likelihood Estimation of a Mixing Distribution. Journal of the American Statistical Association 73, 805-811 Lavine M. and West M. (1992) A Bayesian Method for Classification and Discrimination. the Canadian Journal of Statistics 20, 451-461 Leroux B. (1992) Consistent estimation of a mixing distribution. Annals of Statistics 20, 1350-1360 Leroux B. and Puterman M. (1992) Maximum-Penalised-Likelihood for independent and Markov-Dependent Mixture Models. Biometrics 48, 545-558 Lindsay B. (1995) Mixture Models: Theory, Geometry and Applications. Regional Conference Series in Probability and Statistics, Vol 5, Institute of Mathematical Statistics and American Statistical Association Lindsay B. and Roeder K. (1992) Residuals diagnostics for mixture models. Journal of the American Statistical Association 87, 785-794 McLachlan G. (1987) On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Applied Statistics 36, 318-324 McLachlan G. and Basford K. (1988) Mixture models: inference and application to clustering. Marcel and Decker Inc. 20 Milligan G. W. and Cooper M. C. (1985) An examination of procedures for determining the number of clusters in a data set. Pshychometrika 50, 159-179 Rajagopalan N. and Loganathan A. (1991) Bayes Estimates of Mixing Proportions in Finite Mixture Distributions. Communications in Statistics-Theory and Methods 20, 23372349 Redner R., Hathaway R. and Bezdek J. (1987) Estimating the Parameters of Mixture Models with Modal Estimators. Communications in Statistics- Theory and Methods 16, 26392660 Richardson S. and Green P. (1997) On Bayesian analysis of mixtures with an unknown number of components (with discussion). Journal of the Royal Statistical Society, Series B, 59, 731-792. Robert C.P. (1996) Mixtures of Distributions: Inference and Estimation. in Markov Chain Monte Carlo in Practice, Eds Gilks W.R., Richardson S., Spiegelhalter D.J., Chapman and Hall, 441-464. Smith A.F.M. and Makov U. (1978) A Quasi-Bayes Procedure for Mixtures. Journal of the Royal Statistical Society B 40, 106-112 Smith A.F.M. and Roberts G. (1993) Bayesian computation via the Gibbs sampler and related Markov Chain Monte Carlo methods (with discussion). Journal of the Royal Statistical Society, B, 55, 3-23. Stephens, M. (1996) Dealing with the Multimodal Distributions of Mixture Model Parameters. Preprint, Department of Statistics, University of Oxford. Teicher H. (1963) Identifiability of Finite Mixtures. Annals of Mathematical Statistics 28, 7588 21 Titterington M., Makov G., and Smith A.F.M. (1985) Statistical analysis of finite mixtures. Willey, UK Wang P., Puterman M., Cokburn I. and Le N. (1996) Mixed Poisson Regression Models with Covariate Dependent Rates. Biometrics, 52, 381-400 Windham M. and Cutler A. (1992) Information ratios for validating mixture analyses. Journal of the American Statistical Association 87, 1188-1192 Wolfe J.H. (1970) Pattern clustering by multivariate mixture analysis. Multivariate Behavioral Research 5, 329-350 22 Table 1. Number of defaulted instalments in a financial institution in Spain in 1990. x 0 1 2 3 4 5 6 observed 3002 502 187 138 233 160 107 x 7 8 9 10 11 12 13 observed 80 59 53 41 28 34 10 X 14 15 16 17 18 19 20 observed 13 11 4 5 8 6 3 x 21 22 23 24 >24* * The actual values were 28,29,30,34 Table 3. Posterior distribution of based on a 2-components model m 1 2 3 4 5 P(=m) 0.0943 0.8242 0.06344 0.0179 0.0001 Table 4. The number of components estimated via several methods. Method estimated k Likelihood Sequential Bootstrap LRT Graphical method Penalised ML AIC BIC Penalised Minimum Distance Minimum Hellinger (c=1/4n) Maximum Likelihood (c=1) Moment method 23 5 6 5 4 4 5 3 observed 0 1 0 1 4 Table 2. The mean of the estimated components for various values of k p1 p2 p3 p4 p5 p6 p7 λ1 λ2 λ3 λ4 λ5 λ6 λ7 k P(k) 2 0.537 0.78 0.22 - - - - - 0.24 6.41 - - - - - 3 0.282 0.75 0.16 0.09 - - - - 0.19 4.07 10.28 - - - - 4 0.128 0.74 0.13 0.08 0.05 - - - 0.16 3.50 6.45 12.87 - - - 5 0.039 0.73 0.11 0.08 0.05 0.03 - - 0.15 2.98 5.19 8.09 14.95 - - 6 0.010 0.72 0.08 0.07 0.07 0.04 0.02 - 0.14 2.47 4.18 6.21 9.29 16.55 - 7 0.004 0.71 0.07 0.07 0.07 0.05 0.03 0.01 0.14 2.08 3.53 5.21 7.27 10.51 18.08 24 .7 .6 k=2 probability .5 .4 k=3 .3 .2 k=4 .1 0.0 0 10000 20000 30000 sweep Figure 1: The values of P(k=j) for j=2,3,4, across the sweeps. 25 40000 probability 1.2 1.0 .8 .6 .4 move to k+1 .2 stay at k 0.0 move to k-1 2 3 4 5 6 7 8 k Figure 2: The moves for each k for the financial data. 26 .6 probability .5 .4 .3 .2 .1 PO ISSO N U N IFO RM 0.0 1 2 3 4 6 5 7 8 9 k Figure 3: Posterior probabilities for k for both Poisson and uniform priors. 27 700 1600 600 1400 1200 500 1000 400 800 300 600 200 400 100 200 0 0 .719 .750 .781 .813 .844 p1 .875 .906 .125 .225 .325 .425 λ1 ` 1000 800 600 400 200 0 5.44 .525 6.98 8.51 10. 05 λ2 Figure 4. Posterior distribution for the parameters of a 2-finite Poisson mixture. 28 .625 .725 .825