A Comparison Between Naïve Bayes Classifier and EM Algorithm Abstract In this work we would study and compare two algorithms from bayesian reasoning family. Bayesian reasoning is based on the assumption that the quantities of interest are governed by probability distributions and that optimal decisions can be made by reasoning about these probabilities together with observed data. In this study we would try to compare EM algorithm and Naïve Bayes classifier by detecting their weaknesses and present good solutions for them. Some of these weaknesses expressed in previous works such as initializing the parameters in Naïve Bayes classifier. In these situations we tried to present a more efficient solution than before works. Some of these weaknesses expressed for first time in this paper such as the problem of means equality of two or more clusters. 1. Introduction Clustering is the automatic classification of data or in one of the known sets of possible clusters. It is thus, a categorization problem where the task is the assignment of the correct category to a data known as pertaining to a fixed number of the possible clusters. Its operationalization entails several challenges such as efficiency and nescience, to mention only two. Efficiency means that category formation must be performed at minimum detention, while nescience means that category formation often happens unsupervised since no predefined categorization scheme is given. In this paper we study and compare two of best algorithms in clustering: EM algorithm and Naïve Bayes classifier. Each of them has weaknesses and for some of these weaknesses there is no solution. This paper is organized as follows. In sections 2 and 3 we describe briefly Naïve Bayes classifier and EM algorithm. In section 4 we explain our implementation. In section 5 we study one weakness of EM algorithm. This weakness is means equality. In section 6 we concentrate on one weakness of Naïve Bayes algorithm: initializing parameters. In section 7 we concentrate on another weakness of Naïve Bayes classifier: non-sensitivity respect to amount of difference between data. Section 8 is a conclusion from previous sections. 2. Naive Bayes Classifier The Naive Bayes classifier applies to learning tasks where each instance x is described by a conjunction of attribute values and where the target function f (x) can take on any value from some finite set V. A set of training examples of the target function is provided, and a new instance is presented, described by the tuple of attribute values a1 , a2 ,..., an . The learner is asked to predict the target value, or classification, for this new instance. The Bayesian approach to classifying the new instance is to assign the most probable target value, vMAP, given the attribute values a1 , a2 ,..., an that describe the instance. vMAP arg max P(v j | a1 , a2 ,..., an ) v j V vMAP arg max v j V P(a1 , a2 ,..., an , | v j ) P(v j ) P(a1 , a2 ,..., an ) arg max P(a1 , a2 ,..., an , | v j ) P(v j ) v j V The naive Bayes classifier is based on the simplifying assumption that the attribute values are conditionally independent given the target value. In other words, the assumption is that given the target value of the instance, the probability of observing the conjunction a1 , a2 ,..., an is just the product of the probabilities for the individual attributes: v NB arg max P(v j ) P(ai | v j ) v j V (1) i 3. The EM Algorithm Let X {x1 , x2 ,..., xm } denote the observed data in a set of m independently drawn instances, let Z {z1 , z 2 ,..., z m } denote the unobserved data in these same instances, and let Y X Z denote the full data. The EM algorithm searches for the maximum likelihood hypothesis h by seeking the h that maximizes P (Y | h ) . This expected value is taken over the probability distribution governing Y, which is determined by the unknown parameters . The EM algorithm uses its current hypothesis h in place of the actual parameters to estimate the distribution governing Y. Let us define a function Q(h | h) that gives E (ln P(Y | h)) as a function of h , under the assumption that h and given the observed portion X of the full data Y. Q(h | h) E[ln p(Y | h) | h, X ] In its general form, the EM algorithm repeats the following two steps until convergence: Step 1: Estimation (E) step: Calculate Q(h | h) using the current hypothesis h and the observed data X to estimate the probability distribution over Y. Q(h | h) E[ln p (Y | h) | h, X ] Step 2: Maximization (M) step: Replace hypothesis h by the hypothesis h that maximizes Q function. h arg max Q(h | h) h Fig 1) an example of K-Mean problem (k=2). Points in horizontal part of diagrams are documents. Derivation of the K-Means Algorithm: The k-means problem is to estimate the parameters 1 , 2 ,..., k that define the means of the k Normal distributions. The estimation formula can be written as: E[ z ij ] e 1 2 2 k n 1 e ( xi j ) 2 1 2 2 ( xi n ) 2 (2) The maximization step then finds the values 1 , 2 ,..., k that maximize Q function. It is done by setting each j as follow: m j i 1 m E[ z ij ]xi (3) E[ z ij ] i 1 4. Implementation This implementation has four main classes: Initial, NaiveBayesClassifier, K-Mean, and Normal. In Initial Class, premier operations such connecting to database and converting the data to a two dimensional array and etc., are done. Each of the NaiveBayesClassifier class and the K-Mean class and Normal class implement the related algorithm. The main function of these two classes is their constructors. In all of tests the time complexity of Naïve Bayes classifier is less than the time complexity of Normal algorithm and the time complexity of Normal algorithm is less than the time complexity of K-Mean algorithm. Theses algorithms are faster than algorithms of other families (such as neural network and decision tree), but in some conditions could generate weaker results. Thus in our comparison we concentrate on the quality of result and do not pay attention to time complexity of algorithms. 5. Means equality in EM algorithm Maybe the EM algorithm doesn’t work properly. Yang and Chen [7] studied one of these weaknesses: Outliers occur in the database. In this state the EM algorithm would not work properly. Second weakness has been studied by Archambeau and Lee [8] and when Data repetitions exist among the data samples occurs. Another important point about EM is means equality of two or more distributions problem. If in a part of means recognizing process, two or more means take equal values, then these means keep their values to the end of means recognizing process. This subject can be easily understood by considering the formula of maximization step. It should be considered that if two means find equal values, the number of clusters have decreased one unit. Thus it is very important to understand that under which condition(s) the value of two or more means have been equal and we would describe it in this section. For simplicity and without losing generality we study the condition(s) which in it two clusters j and h find equal means. In the maximization step formula the mean of a cluster is related to observed data and expectation of unobserved data. The subversive of estimation formula is same in all of distributions means. Hence: m m E[ zij ]xi i 1 m E[ zij ] i 1 m e i 1 m i 1 i 1 1 2 m i 1 m i 1 m 1 e 2 2 1 e e xi ( xi j ) 2 E[ zih ] m i 1 e 2 2 1 2 2 ( xi j ) 2 xi * i 1 e ( xi h ) 2 xi * i 1 e Ln[i 1 e Ln[i 1 e 1 2 2 1 2 2 m m ( xi j ) 2 ( xi h ) 2 1 2 2 m i 1 m m i 1 ( xi j ) 2 2 E[ zih ]xi e ( xi h ) 2 1 2 2 1 2 2 1 2 2 xi ( xi h ) 2 ( xi h ) 2 ( xi j ) 2 xi ] Ln[i 1 e m xi ] Ln[i 1 e m 1 2 2 1 2 2 ( xi h ) 2 ( xi j ) 2 ] (4) ] Now we must determine for which values of x i , the above relation is satisfied and for which values it is not. At follow we try to simplify above relation for some values of x i . State 1: If all of the observed data have the value 1, then the relation (1-1) is satisfied and two clusters take equal means. Because in this state all of the data have the same value 1, thus this equality makes no problem, unless we want to use current means for clustering a new block of data. State 2: If we have exactly one x i that is opposite of 1, and other observed data be equal to 1, the relation (1-1) never be satisfied. Prove: because we have only one x i that is opposite of 1, sign is eliminated. Values that are 1 are eliminated from two sides of relation. We name the data that has not value 1 x1 . Ln[i 1 e Ln[i 1 e m m Ln[e 1 2 2 1 2 1 2 2 1 2 ( xi j ) 2 xi ] Ln[i 1 e ( xi h ) 2 xi ] Ln[i 1 e ( x1 j ) 2 m m x1 ] Ln[e 1 2 2 1 2 2 ( x1 h ) 2 ( x1 j ) 2 Ln( x1 ) 2 j h 2 1 2 2 1 2 2 ( xi h ) 2 ( xi j ) 2 ] ] x1 ] ( x1 h ) 2 Ln( x1 ) Since the previous values of means are not equal, then the relation (1-1) never is satisfied. State 3: If all of the data that are opposite of 1, have equal values, thus the means of two clusters never are equal. Prove: because we have only one x i that is opposite of 1, sign is eliminated. We assume that the number of data which are opposite of 1 be m and the value of each is x1 . Ln[i 1 e Ln[i 1 e m m 1 2 2 1 2 2 m *{Ln[e ( xi j ) 2 xi ] Ln[i 1 e ( xi h ) 2 xi ] Ln[i 1 e 1 2 2 m ( x1 j ) 2 m 1 2 2 1 2 2 ( xi h ) 2 ( xi j ) 2 ] ] ] Ln( x1 )} m *{Ln[e 1 2 2 ( x1 h ) 2 Ln( x1 )} ( x1 j ) 2 ( x1 h ) 2 j h Similar to state 2, since the previous values of means are not equal, then the relation (11) never be satisfied. This state is important when used fields for clustering are Boolean. In this condition, we can be sure that the values of means never be equal. State 4: In this state value of one of the data, for example x1 , is opposite of 1 and is equal to the current value of the mean of one of the clusters, for example j , e.g. x1 j . One of other data, for example x2 , is opposite of 1, and rest of data is assumed to be 1. We consider as difference of two non-one data e.g. x1 x2 . In this state the means of two clusters was equal under specific conditions. Prove: Ln[i 1 e Ln[i 1 e m m 1 2 2 1 2 2 ( xi j ) 2 xi ] Ln[i 1 e ( xi h ) 2 xi ] Ln[i 1 e m m 1 2 2 1 2 2 ( xi h ) 2 ( xi j ) 2 ] ] 2 ( x1 1 ) ( x1 h ) 2 ( x1 1) ( x1 h ) 2 ( x1 1 ) ( x1 1 )( x1 h )( x1 h ) ( x1 1)( x1 h ) 2 ( x1 1 )( x1 2 h ) ( x1 1)( x1 h ) 2( x1 1) ( x1 2 h ) 2 x1 2 x1 2 h h 3 x1 2 2 Thus for means equality, the previous value of the mean of other cluster must be equal to 3x1 2 2 . In these conditions the EM algorithm doesn’t work properly. 6. A method for determining initial values of probabilities in Bayes Naive classifier One of the most important points in Naïve Bayes classifier is initializing the parameters (probabilities). Because this algorithm is based on training, thus the initial values of parameters have a very important role. Naïve Bayes classifier is less self-corrective than EM algorithm, and if any mistake occurs in clustering of a document, the probability that this mistake has a dramatic effect on clustering of other documents is very high. In contrast, if EM algorithm makes a mistake in clustering a document, it corrects the mistake exponentially. Thus in EM algorithm the initial values of means of k cluster(s) is not important and just not any two means might have equal values throughout clustering process. Because the initializing of parameters in Naïve Bayes classifier has an important effect, then we tried to find a good method for this work. Bellow we explain the method which we used in our implementation: In conditions when all of clusters are empty, the initial values have no importance, because there is no difference between clusters and even one of them can be selected randomly for first data. Without loss of generality for simplicity we consider conditions in which we have two clusters and one data that must be dedicated to one of clusters. There are two possible states: 1) If clustering is done correctly, the new data must go to the cluster which has no member (empty cluster). In this state the probability formula for each cluster can be calculated as follows: Full cluster (j): 1 * p(ai | V j ) i Empty cluster (j’): p(v j ) * p(ai | V j ) i So as to the data goes to the empty cluster, we must have: 1* p(ai | V j ) p(v j ) * p(ai | v j ) i In this state, i p(ai | V j ) has a zero value or is near to zero (very less than 1). For i satisfying the above condition and as a result being sure from correctness of clustering, the initial values of both p (v j ) and p (v j | a i ) can set to 1. 2) If clustering is done correctly, the new data must go to cluster that has at least one member (full cluster). In this state the probability formula for each cluster can be calculated as follows: Full cluster (j): 1 * p(ai | V j ) i Empty cluster (j’): p(v j ) * p(ai | V j ) i So as to the data goes to the full cluster, we must have: 1* p(ai | V j ) p(v j ) * p(ai | v j ) i In this state, i p(ai | V j ) has a value more than zero (near to 1 and opposite of zero). i For satisfying the above condition and as a result being sure from correctness of clustering, the initial value of p (v j ) or p (v j | a i ) can be set to 0. The problem is to distinguish between these two states so in each state the probabilities are set to relevant values. The proposed solution uses a threshold value. Because the value of p(ai | V j ) in state (1) is zero or near to zero, and in state (2) is one or near to i one (very large compared with former state), we can use a threshold value (e.g. 0) so be able to distinguish between two states and initiate parameters properly. This proposed method has been used in our implementation and has reduced incorrect clustering of documents (Fig 2). Fig 2) comparison of proposed method for initializing the parameters in Naïve Bayes classifier (bar no 2) and training method (bar no 1) and constant initiating (bar no 3). This diagram states two things: first the importance of initial values in Naïve Bayes classifier and second good efficiency of proposed method. 7. Non-sensitivity of Bays Naïve classifier respect to difference between values In our implementation of Naive Bayes classifier, P(v j ) values are defined as the number of cluster j members divided by the number of total members that already have been clustered. In natural, P ( ai | v j ) Values are defined as the number of cluster j member(s) within which the value of i is a specific value (the value of term i of current record), divided by total number(s) of cluster j. This definition works well on fields which the range of values is small and limited. For example the type of field was Boolean. For solving this problem we considered and used the bellow definition for P ( ai | v j ) that is free from the type of field: the value of P ( ai | v j ) is considered such as the number of members of cluster j in which field i has a specific range divided by the number of total member(s) of cluster j. P ( ai | v j ) is considered such as the number of members of cluster j in which field i has a specific range divided by the number of total member(s) of cluster j. this range is defined such as for each field we subtract the minimum value of a data between total of data from the maximum value of a data between total of data. Then divide this number by number of clusters. When the value of field i from current data is compared with the value of this field in data of cluster j, it is not necessary that two values be equal exactly, but if their difference be less than from the number that consumed above we consider them equivalent. Even with this change Naïve Bayes classifier is not sensitive respect to the amount of difference of values. Sensitization respect to difference means the difference between possible values of a field effects on clustering. For example if our data was as follow: data field1 field2 field3 0 22 22 22 1 22 1 6 2 22 6 6 Naive Bayes classifier can place row 1 in one cluster and rows 0 and 2 in another cluster, while the EM algorithm that is sensitive to difference place rows 1 and 2 in one cluster and 0 in another cluster. Each step for converting Naïve Bayes classifier to a difference sensitive algorithm causes this algorithm move toward EM algorithm. For this goal we propose Normal algorithm. In this algorithm the probability of selecting a cluster between all clusters is similar to Naive Bayes algorithm but P ( ai | v j ) is different. This probability is computed as: the value of field ai in each of records of cluster vj subtracted from the value of field ai in current record. Then these values added together. We display the result of this sum for cluster vj and field ai as A( ai | v j ) . The probability of a field in a record is changed as bellow: p p (v j ) A(a i In above formula if i | vj) A(a i (5) | v j ) be equal to zero, then it consider equal to 1. i The efficiency of this algorithm is in between the efficiency of EM algorithm and Naïve Bayes classifier. For example this algorithm is slower than Naïve Bayes classifier but is faster than EM algorithm. 8. Conclusion Naïve Bayes classifier is more sensitive than EM algorithm respect to initial values of parameters and must have a training phase. This training phase either is done by expert that is very time consuming or is done by another algorithm such as EM algorithm. We have presented a method for initializing the parameters of NaiveBayes classifier that works better than previous methods. The EM algorithm may don not work properly in some situations such as: Outliers occur in the database, Data repetitions occur among the data samples and equality of means of two or more clusters. If EM algorithm in one phase makes a mistake, the algorithm can correct itself (but not always). Also this algorithm has no sensitivity to initial values of parameters. References: [1] Buntine W. L, (1994). Operations for learning with graphical models. Journal of Artificial Intelligence Research, 2, 159-225. [2] R. Agrawal, J. Gehrke, D. Gunopolos, and Prabhakar Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In ACM SIGMOD Conference, 1998. [3] R. Neal and G. Hinton. A view of the EM algorithm that justifies incremental, sparse and other variants. Technical report, Dept. of Statistics, University of Toronto, 1993. [4] PELLEG, D. and MOORE, A. 2000. X-means: Extending K-means with Efficient Estimation of the Number of Clusters. In Proceedings 17th ICML, Stanford University. [5] LEE, C-Y. and ANTONSSON, E.K. 2000. Dynamic partitional clustering using evolution strategies. In Proceedings of the 3rd Asia-Pacific Conference on Simulated Evolution and Learning, Nagoya, Japan. [6] Casella, G., & Berger, R. L. (1990). Statistical inference . Pacific Grove, CA: Wadsworth & Brooks/Cole. [7] L. Xu and M. Jordan. On convergence properties of the EM algorithm for Gaussian mixtures. Neural Computation, 7, 1995. [8] Cédric Archambeau, John A. Lee, Michel Verleysen. On Convergence Problems of the EM Algorithm for Finite Gaussian Mixtures. ESANN'2003 proceedings – European Symposium on Artificial Neural Networks Bruges (Belgium), 23-25 April 2003, d-side publi., ISBN 2-930307-03X, pp. 99-106 [9] P. Langley, W. Iba, and K. Thompson. An Analysis of Bayesian Classifiers. Proc. 10th Nat. Conf. on Artificial Intelligence, 223–228, AAAI Press and MIT Press, USA 1992 [10] P. Langley and S. Sage. Induction of Selective Bayesian Classifiers. Proc. 10th Conf. on Artificial Intelligence, 1994 [11] J. Bilmes: A Gentle Tutorial of the EM Algorithm and its Application to Parameter stimation for Gaussian Mixture and Hidden Markov Models. Technical Report of the International Computer Science Institute, Berkeley, CA (1998). [12] Adwait Ratnaparkhi. 1998. Maximum Entropy Models for Natural Language Ambiguity Resolution. Ph.D. thesis, the University of Pennsylvania. [13] Robert E. Schapire and Yoram Singer. 2000. Boostexter: A boosting-based system for text categorization. Machine Learning, 39(2/3):135–168.