Generalized discriminant rule for binary data when training and test populations differ on their descriptive parameters. Julien Jacques1 and Christophe Biernacki2 1 2 Laboratoire de Statistiques et Analyse des Données, Université Pierre Mendès France, 38040 Grenoble Cedex 9, France. julien.jacques@iut2.upmf-grenoble.fr Laboratoire Paul Painlevé UMR CNRS 8524, Université Lille I, 59655 Villeneuve d’Ascq Cedex, France. christophe.biernacki@math.univ-lille1.fr Summary. Standard discriminant analysis supposes that both the training sample and the test sample are issued from the same population. When these samples are issued from populations for which descriptive parameters are different, generalized discriminant analysis consists to adapt the classification rule issued from the training population to the test population, by estimating a link between this two populations. This paper extends existing work available in a multi-normal context to the case of binary data. To raise the major challenge of this work which is to define a link between the two binary populations, we suppose that binary data are issued from the discretization of latent Gaussian data. Estimation method are presented, and an application in biological context illustrates this work. Key words: Model-based discriminant analysis, relationship between populations, latent class assumption. 1 Introduction Standard discriminant analysis supposes that both the training labeled sample and the test sample which has to be classified are issued from the same population. Since works of Fisher in 1936 [Fis36], who introduced a linear discriminant rule between two groups, numerous evolutions have been proposed ( [McL92] for a survey). All of them concern the nature of the discriminant rule : parametric quadratic rule, semiparametric rule or nonparametric rule. An alternative evolution, introduced by Van Franeker and Ter Brack [Van93] and developed further by Biernacki et al. [Bie02], considers the case in which the training sample is not necessary issued from the same population than the test sample. Biernacki et al. define several models of generalized discriminant analysis in a multinormal context, and experiment them in a biological situation, in which the two populations consist of birds from the same species, but from different geographical origins. 878 Julien Jacques and Christophe Biernacki But in many domains, like insurance or medicine, a large number of applications deal with binary data. The goal of this paper is to extend generalized discriminant analysis, established in a multi-normal context, to the case of binary data. The difference between the training and the test populations can be geographical (as in the biological application previously quoted), but also temporal or other. The next section presents the data and the latent class model for both training and test populations. Section 3 makes the assumption that these binary data are the discretization of latent continuous variables. This hypothesis helps to establish a link between the two populations. It leads to propose eight models of generalized discriminant analysis for binary data. Thereafter, in Section 4 the estimation of model parameters is illustrated. In Section 5, an application in a biological context illustrates that generalized discriminant analysis is more powerful than standard discriminant analysis or clustering. Finally, the last section discusses possible extensions of the present work. 2 The data and the latent class model Data consist of two samples : the first sample S, labeled and issued from the training population P , and the second sample S ∗ , not labeled and issued from the test population P ∗ . The two populations can be different. The training sample is composed by n pairs (x1 , z1 ), . . . , (xn , zn ), where xi is the explanatory vector for the ith object, and zi = (zi1 , . . . , ziK ) with zik equal to 1 if the ith object belongs to the kth group, and 0 if not (i = 1, . . . , n, k = 1, . . . , K) where K denotes the number of groups. Each pair (xi , zi ) is an independent realization of (X , Z) of distribution: j X|Z k =1 ∼ B(αkj ) ∀j = 1, . . . , d and Z ∼ M(1, p1 , . . . , pK ), (1) where B(αkj ) is the Bernoulli distribution of parameter αkj (0 < αkj < 1), and M(1, p1 , . . . , pK ) defines the one order multinomial distribution of parameters P p1 , . . . , pK (0 < pk < 1, K k=1 pk = 1). Moreover, using the latent class model assumption that explanatory variables are conditionally independent ( [Cel91]), the probability function of X is: f (x1 , . . . , xd ) = K X k=1 pk d Y j j αkj x (1 − αkj )1−x . (2) j=1 The test sample is composed by n∗ pairs (x∗1 , z1∗ ), . . . , (x∗n∗ , zn∗ ∗ ), where zi∗ are unknown (variables are the same as in the training sample). These pairs are independent realizations of (X ∗ , Z ∗ ) of the same distribution as (X , Z) but with different parameters, noted α∗kj and p∗k . Our aim is to estimate the unknown labels z1∗ , . . . , zn∗ ∗ by using information from both training and test samples. The challenge is to find a link between the populations P and P ∗ . Generalized discriminant rule for binary data 879 3 Relationship between test and training populations In a multi-normal context, a linear stochastic relationship between P and P ∗ is not only justified (with very few assumptions) but also intuitive [Bie02]. In the binary context, since no such intuitive relationship seems to exist, an additional assumption is stated: it is supposed that binary data are issued from the discretization of latent Gaussian variables. This assumption is not new in statistics: see for example works of Thurstone [Thu27], who used this assumption in his comparative judgment model to choose between two stimuli, or works of Everitt [Eve87], who proposed a classification algorithm for binary, categorical and continuous data. j We suppose thus that explanatory variables X|Z k =1 of Bernoulli distribution B(αkj ), are issued from the following discretization of latent continuous variables j 2 Y|Z k =1 of normal distribution N (µkj , σkj ), and, as for discrete variables, we assume j that Y|Z k =1 are also conditionally independent: ( j X|Z k =1 = j 0 if λj Y|Z k =1 < 0 j 1 if λj Y|Z k =1 ≥ 0 for j = 1, . . . , d, (3) where λj ∈ {−1, 1} is introduced to avoid choosing to which value of X j , 0 or 1, corresponds a positive value of Y j . The task of the new parameter is thus to avoid binary variables to inherit from the natural order induced by continuous variables. We can thus derive the following relationship between αkj , and λj , µkj and σkj : ( αkj = j p(X|Z k =1 = 1) = µ Φ σkj kj 1−Φ µkj σkj if λj = 1 if λj = −1 (4) where Φ is the N (0, 1) cumulative density function. It is worth noting, that the assumption of conditional independence makes the calculus of αkj very simple. It avoids to consider computation of multidimensional integrals, which are very complex to estimate, especially when the problem dimension (d) is large, what is often the case with binary data. As for the variable X , we suppose also that the discrete variable X ∗ is issued from the discretization of the latent continuous variable Y ∗ . The equations are the ∗ same than (3) and (4), by changing αkj into α∗kj , µkj into µ∗kj and σkj into σkj . The ∗ parameter λj is naturally supposed to be equal to λj . In a Gaussian context, Biernacki et al. defined in [Bie02] the stochastic linear relationship (5) between the latent continuous variable Y of P and Y ∗ of P ∗ by assuming two plausible hypotheses: the transformation between P and P ∗ is C 1 and ∗j j ∗ the jth component Y|Z k =1 of Y |Z k =1 only depends on the jth component Y|Z k =1 of Y|Z k =1 : ∗ Y|Z ∗k =1 ∼ Ak Y |Z k =1 + bk , (5) where Ak is a diagonal matrix of Rd×d containing the elements akj (1 ≤ j ≤ d) and bk is a vector of Rd . Using this relationship, we can obtain after some calculus the following relationship between the parameters α∗kj and αkj : 880 Julien Jacques and Christophe Biernacki α∗kj = Φ δkj Φ−1 (αkj ) + λj γkj , (6) where δkj = sgn(akj ), sgn(t) denoting the sign of t and where γkj = bkj /(|akj | σkj ). Thus, conditionally to the fact that αkj are known (they will be estimated in practice), the estimation of the Kd continuous parameters α∗kj is obtained from estimate of the parameters of the link between P and P ∗ : δkj , γkj and λj . The number of continuous parameters is thus Kd. In fact, estimating directly α∗kj without using population P or estimating the link between P and P ∗ is equivalent. Consequently there is a need to reduce the number of free continuous parameters. Thus, in order to reduce the number of these continuous parameters, it is necessary to introduce some sub-model by imposing constraints on the transformation between the two populations P and P ∗ . Model M1 : σkj and Ak are free and bk = 0 (k = 1, . . . , K). The model is: α∗kj = Φ δkj Φ−1 (αkj ) with δkj ∈ {−1, 1}. This transformation corresponds either to identity or to a permutation of the modalities of X . Model M2 : σkj = σ, Ak = aId with a > 0, Id the identity matrix of Rd×d and bk = βe, with β ∈ R and e is a vector of dimension d composed only of 1 (the transformation is dimension and group independent). The model is: ′ α∗kj = Φ Φ−1 (αkj ) + λj |γ| ′ λj = λj sgn(γ) ∈ {−1, 1} and |γ| ∈ R+ . with The assumption a > 0 is made to have identifiability of the model, and it does not induce any restriction. The same hypothesis is made in the two following models. Model M3 : σkj = σk , Ak = ak Id , with ak > 0 for 1 ≤ k ≤ K, and bk = βk e, with βk ∈ R (the transformation is only group dependent). The model is: ′ ′ α∗kj = Φ Φ−1 (αkj ) + λkj |γk | with λkj = λj sgn(γk ) ∈ {−1, 1} and |γk | ∈ R+ . Model M4 : σkj = σj , Ak = A, with akj > 0 for 1 ≤ k ≤ K and 1 ≤ j ≤ d, and bk = β ∈ R (the transformation is only dimension dependent). The model is: ′ α∗kj = Φ Φ−1 (αkj ) + γj with ′ γj = λj γj ∈ R. ′ Note that in model M4 , as the parameter γj is free, it includes the parameter λj . For these four models Mi (i = 1, . . . , 4), we take into account an additional assumption on group proportions: they are conserved or not from P to P ∗ . Let Mi be the model with unchanged proportions, and pMi the model with possibly changed proportions. Eight models are thus defined. Note that model M2 is always nested in M3 and M4 , and M1 can be sometimes nested in the three other models. Generalized discriminant rule for binary data 881 Finally, to automatically choose among the eight generalized discriminant models, the BIC criterion (Bayesian Information Criterion, [Sch78]) can be employed. It is defined by: BIC = −2l(θ̂) + ν log(n), where: θ = {p∗k , δkj , λj , γkj } for 1 ≤ k ≤ K and 1 ≤ j ≤ d, l(θ̂) is the maximum log-likelihood corresponding to the estimate θ̂ of θ, and ν is the number of free continuous parameters associated to the given model. The model leading to the smallest BIC value is then retained. There is now a need to estimate the parameter θ, and the maximum likelihood method is retained. 4 Parameter estimation Generalized discriminant analysis needs three estimation steps. We present the situation where proportions are unknown, the contrary case being immediate. The first step consists of estimating parameters pk and αkj (1 ≤ k ≤ K and 1 ≤ j ≤ d) of population P from the training sample S. Since S is a labeled sample, the maximum likelihood estimate is usual ( [Eve84, Cel91]). The second step consists of estimating parameters p∗k and α∗kj (1 ≤ k ≤ K and 1 ≤ j ≤ d) of the Bernoulli mixture described from both θ and S ∗ . For estimating α∗kj , parameters estimates of the link between P and P ∗ are to be obtained: when parameters δkj , γkj and λj of this link are estimated, an estimate of α∗kj is deduced by equation (6). This step is described below. Finally, the third step consists of estimating the group membership of individuals from the test sample S ∗ , by maximum a posteriori. For the second step, maximum likelihood estimation can be efficiently based on the EM algorithm [Dem77]. The likelihood is given by: n X K Y ∗ L(θ) = p∗k i=1 k=1 d Y α∗kj ∗j xi ∗j (1 − α∗kj )1−xi . j=1 The completed log-likelihood is given by: ∗ lc (θ; z1∗ , . . . , zn∗ ∗ ) = K n X X zi∗k log p∗k i=1 k=1 d Y α∗kj ∗j xi ∗j (1 − α∗kj )(1−xi ) . j=1 The E step. Using a current value θ (q) of the parameter θ, the E step of EM algorithm consists to compute the conditional expectation of the completed log-likelihood: Q(θ; θ (q) ) = Eθ (q) [lc (θ; Z1∗ , . . . , Zn∗∗ )|x∗1 , . . . , x∗n∗ ] n X K X ∗ = i=1 k=1 (q) tik n log(p∗k ) + log d Y j=1 α∗kj ∗j xi ∗j (1 − α∗kj )1−xi o (7) 882 Julien Jacques and Christophe Biernacki where p∗k (q) tik = p(Zi∗k = 1|x∗1 , . . . , x∗n∗ ; θ (q) ) = (q) d Y (α∗kj ∗j (q) xi ) (1 − α∗kj (q) (1−x∗j i ) ) j=1 K X p∗κ (q) κ=1 d Y (α∗κj ∗j (q) xi ) (1 − α∗κj (q) (1−x∗j i ) ) j=1 is the conditional probability that the individual i belongs to the group k. The M step. The M step of the EM algorithm consists in choosing the value θ (q+1) which maximizes the conditional expectation Q computed at the E step: θ (q+1) = argmax Q(θ; θ (q) ). (8) θ∈Θ We describe this step for each component of θ = {p∗k , δkj , λj , γkj }. For the proportions, this maximization gives the estimate p∗k (q+1) = Pn∗ (q) ∗ i=1 tik /n . For the continuous parameters γkj , we can prove for each model that Q is a strictly concave function of γkj which tends to −∞ when a norm of the parameter vector (γ11 , . . . , γKd ) tends to ∞ (cf. [Jac05]). Also, we can use a Newton algorithm to find the unique maximum of Q(θ; θ (q) ) on θ. For discrete parameters δkj and λj , if the dimension d and the number of groups K are relatively small (for example K = 2 and d = 5), the maximization is carried out by computing Q(θ; θ (q) ) for all possible values of these discrete parameters. When K or d are bigger, the number of values of δkj is too large (for example 220 for K = 2 and d = 10), and so it is impossible to enumerate all possible values of δkj in a reasonable time. In this case, we use a relaxation method which consists to assume that δkj (respectively λj ) is not a binary parameter in {−1, 1} but a continuous one in [−1, 1], named δ̃kj ( [Wol98] for example). Optimization is thus made with respect to continuous parameter (with Newton algorithm as for γkj because Q is a strictly (q+1) concave function of δ̃kj ), and the solution δ̃kj is discretized to obtain a binary (q+1) solution δkj (q+1) as follows: δkj (q+1) = sgn(δ̃kj ). 5 Application on a real data set Models and estimation methods have been validated by tests on simulated data, available in [Jac05]. We present here an application on a real data set. The first motivations for which generalized discriminant analysis was developed are biological applications [Bie02, Van93], in which the aim was to predict sex of birds from biometrical variables. Very powerful results have been obtained with multi-normal assumptions. The species of birds considered in this application is Cory’s Shearwater Calanectris diomedea [Thi97]. This species can be divided in three subspecies, among which borealis, which lives in the Atlantic islands (the Azores, Canaries, etc.), and diomedea, which lives in the Mediterranean islands (Balearics, Corsica, etc.). The birds of borealis subspecies are generally bigger than those of diomedea subspecies. Generalized discriminant rule for binary data 883 Thus, standard discriminant analysis is not adapted to predict sex of diomedea with a learning sample issued from the population borealis. A sample of borealis (n = 206, 45% females) was measured using skins in several National Museums. Five morphological variables were measured: culmen (bill length), tarsus, wing and tail lengths, and culmen depth. Similarly, a sample of diomedea (n = 38, 58% females) was measured using the same set of variables. In this example, two groups are present, males and females, and all the birds are of known sex (from dissection). To provide an application of the present work, we discretize continuous biometrical variables into binary data (small or big wings, ...). We select the subspecies borealis as the training population and the subspecies diomedea as the test population. We apply on these data the eight generalized discriminant analysis models, standard discriminant analysis (DA) and clustering. The number of bad classifications and the value of the BIC criterion for generalized discriminant analysis are also given by Table 1. Table 1. Rate (in %) of bad classifications and value of the BIC criterion for test population diomedea with training population borealis. pM1 pM2 pM3 pM4 M1 M2 M3 M4 DA Clustering Rate 57.9 26.3 23.7 21.0 57.9 23.7 15.8 18.4 42.1 23.7 BIC 269.7 222.5 220.5 237.0 267.3 221.6 221.5 233.6 648.4 - If the results are compared according to the error rate, generalized discriminant analysis with the model M3 is the best method, with 15.8% of error. This error is lower than that obtained by standard discriminant analysis (42.1%) and by clustering (23.7%). If it is the values of the BIC criterion which are used to choose a model, three models of generalized discriminant analysis are emphasized (pM3 , M2 and M3 ), including the model with the lowest error rate M3 . This application illustrates the interest of generalized discriminant analysis with respect to standard discriminant analysis or clustering. Effectively, by adapting the classification rule issued from the training population to the test population, generalized discriminant analysis gives lower classification error rates than by applying directly the rule issued from the training population (standard discriminant analysis), or by omitting the training population and applying directly clustering on the test population. We can remark that the assumption which supposes that binary data are issued from the discretization of Gaussian variables (see data in [Bie02]) is relatively realistic in this application. Nevertheless, there exists a strong correlation between the five biometrical variables, which violates the assumption of conditional independence. 6 Conclusion Generalized discriminant analysis extends standard discriminant analysis by allowing training and test samples to arise from slightly different populations. Our contribution consists to adapt precursor works in a multi-normal context to the case of 884 Julien Jacques and Christophe Biernacki binary data. An application in a biological context illustrates this work. By using generalized discriminant analysis models defined in this paper, we provide a classification of birds according to their sex which is better than standard discriminant analysis or clustering. Application of the method on data issued of the insurance context is our next challenge. Methodological perspectives for this work are also numerous. Firstly, we have defined the link between the two populations using Gaussian cumulative density function. Although it seemed initially difficult to find this link, a simple link was obtained. It was not easy to imagine it, but it is well comprehensible afterwards. It would be interesting to try other types of cumulative density functions (theoretical reasons will have to be developed and practical tests will have to be carried out). Secondly, with this contribution generalized discriminant analysis is now developed for continuous data and for binary data. To allow analyzing a large number of practical cases, it is important to study the case of categorical variables (i.e. much than two modalities), and thereafter the case of mixed variables (binary, categorical and continuous). Finally, it would be also interesting to extend other classical discriminant method like non-parametric discrimination. References [Bie02] Biernacki, C., Beninel, F. and Bretagnolle, V.: A generalized discriminant rule when training population and test population differ on their descriptive parameters. Biometrics, 58, 2, 387-397 (2002) [Cel91] Celeux, G. and Govaert, G.: Clustering criteria for discrete data and latent class models. Journal of classification, 8, 157-176 (1991) [Dem77] Dempster, A.P., Laird, N.M. and Rubin, D.B.: Maximum likelihood from incomplete data (with discussion). Journal of the Royal Statistical Society, Series B 39, 1-38 (1977) [Eve87] Everitt, B.S.: A finite mixture model for the clustering of mixed-mode data. Statistics and Probability Letters 6, 305-309 (1987) [Fis36] Fisher, R.A.: The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179-188, Pt. II (1936) [Jac05] Jacques, J.: Contributions à l’analyse de sensibilité et à l’analyse discriminante généralisée. PhD thesis of University Joseph Fourier (2005) [McL92] McLachlan, G.J.: Discriminant Analysis and Statistical Pattern Recognition. Wiley, New-York (1992) [Sch78] Schwarz, G.: Estimating the dimension of a model. Annals of Statistics, 6, 461-464 (1978) [Thi97] Thibault, J-.C., Bretagnolle, V. and Rabouam, C.: Cory’s shearwater calonectris diomedia. Birds of Western Paleartic Update, 1, 75-98 (1997) [Thu27] Thurstone, L.L.: A law of comparative judgement. Amer. J. Psychol., 38, 368-389 (1927) [Van93] Van Franeker, J.A. and Ter Brack, C.J.F.: A generalized discriminant for sexing fulmarine petrels from external measurements. The Auk, 110(3), 492-502 (1993) [Wol98] Wolsey, L.A.: Integer Programming. Wiley (1998)