Compare Naïve Bayes algorithm and K_Mean algorithm

advertisement
A Comparison Between Naïve Bayes
Classifier and EM Algorithm
Abstract
In this work we would study and compare two algorithms from bayesian
reasoning family. Bayesian reasoning is based on the assumption that
the quantities of interest are governed by probability distributions and
that optimal decisions can be made by reasoning about these
probabilities together with observed data. In this study we would try to
compare EM algorithm and Naïve Bayes classifier by detecting their
weaknesses and present good solutions for them. Some of these
weaknesses expressed in previous works such as initializing the
parameters in Naïve Bayes classifier. In these situations we tried to
present a more efficient solution than before works. Some of these
weaknesses expressed for first time in this paper such as the problem of
means equality of two or more clusters.
1. Introduction
Clustering is the automatic classification of data or in one of the known sets of possible
clusters. It is thus, a categorization problem where the task is the assignment of the
correct category to a data known as pertaining to a fixed number of the possible clusters.
Its operationalization entails several challenges such as efficiency and nescience, to
mention only two. Efficiency means that category formation must be performed at
minimum detention, while nescience means that category formation often happens
unsupervised since no predefined categorization scheme is given.
In this paper we study and compare two of best algorithms in clustering: EM algorithm
and Naïve Bayes classifier. Each of them has weaknesses and for some of these
weaknesses there is no solution.
This paper is organized as follows. In sections 2 and 3 we describe briefly Naïve Bayes
classifier and EM algorithm. In section 4 we explain our implementation. In section 5 we
study one weakness of EM algorithm. This weakness is means equality. In section 6 we
concentrate on one weakness of Naïve Bayes algorithm: initializing parameters. In
section 7 we concentrate on another weakness of Naïve Bayes classifier: non-sensitivity
respect to amount of difference between data. Section 8 is a conclusion from previous
sections.
2. Naive Bayes Classifier
The Naive Bayes classifier applies to learning tasks where each instance x is described by
a conjunction of attribute values and where the target function f (x) can take on any value
from some finite set V. A set of training examples of the target function is provided, and
a new instance is presented, described by the tuple of attribute values  a1 , a2 ,..., an  . The
learner is asked to predict the target value, or classification, for this new instance.
The Bayesian approach to classifying the new instance is to assign the most probable
target value, vMAP, given the attribute values  a1 , a2 ,..., an  that describe the instance.
vMAP  arg max P(v j | a1 , a2 ,..., an )
v j V
vMAP  arg max
v j V
P(a1 , a2 ,..., an , | v j ) P(v j )
P(a1 , a2 ,..., an )
 arg max P(a1 , a2 ,..., an , | v j ) P(v j )
v j V
The naive Bayes classifier is based on the simplifying assumption that the attribute values
are conditionally independent given the target value. In other words, the assumption is
that given the target value of the instance, the probability of observing the conjunction
a1 , a2 ,..., an is just the product of the probabilities for the individual attributes:
v NB  arg max P(v j ) P(ai | v j )
v j V
(1)
i
3. The EM Algorithm
Let X  {x1 , x2 ,..., xm } denote the observed data in a set of m independently drawn
instances, let Z  {z1 , z 2 ,..., z m } denote the unobserved data in these same instances, and
let Y  X  Z denote the full data. The EM algorithm searches for the maximum
likelihood hypothesis h by seeking the h that maximizes P (Y | h ) . This expected value
is taken over the probability distribution governing Y, which is determined by the
unknown parameters  . The EM algorithm uses its current hypothesis h in place of the
actual parameters  to estimate the distribution governing Y. Let us define a function
Q(h | h) that gives E (ln P(Y | h)) as a function of h , under the assumption that   h
and given the observed portion X of the full data Y.
Q(h | h)  E[ln p(Y | h) | h, X ]
In its general form, the EM algorithm repeats the following two steps until convergence:
Step 1: Estimation (E) step: Calculate Q(h | h) using the current hypothesis h and the
observed data X to estimate the probability distribution over Y.
Q(h | h)  E[ln p (Y | h) | h, X ]
Step 2: Maximization (M) step: Replace hypothesis h by the hypothesis h that
maximizes Q function.
h  arg max Q(h | h)
h
Fig 1) an example of K-Mean problem (k=2). Points in horizontal part of diagrams are documents.
Derivation of the K-Means Algorithm: The k-means problem is to estimate the
parameters    1 ,  2 ,...,  k  that define the means of the k Normal distributions.
The estimation formula can be written as:

E[ z ij ] 
e

1
2 2

k
n 1
e
( xi   j ) 2
1
2 2
( xi   n ) 2
(2)
The maximization step then finds the values  1 ,  2 ,...,  k  that maximize Q function. It
is done by setting each  j as follow:



m
j
i 1
m
E[ z ij ]xi
(3)
E[ z ij ]
i 1
4. Implementation
This implementation has four main classes: Initial, NaiveBayesClassifier, K-Mean, and
Normal. In Initial Class, premier operations such connecting to database and converting
the data to a two dimensional array and etc., are done. Each of the NaiveBayesClassifier
class and the K-Mean class and Normal class implement the related algorithm. The main
function of these two classes is their constructors.
In all of tests the time complexity of Naïve Bayes classifier is less than the time
complexity of Normal algorithm and the time complexity of Normal algorithm is less
than the time complexity of K-Mean algorithm. Theses algorithms are faster than
algorithms of other families (such as neural network and decision tree), but in some
conditions could generate weaker results. Thus in our comparison we concentrate on the
quality of result and do not pay attention to time complexity of algorithms.
5. Means equality in EM algorithm
Maybe the EM algorithm doesn’t work properly. Yang and Chen [7] studied one of these
weaknesses: Outliers occur in the database. In this state the EM algorithm would not
work properly. Second weakness has been studied by Archambeau and Lee [8] and when
Data repetitions exist among the data samples occurs.
Another important point about EM is means equality of two or more distributions
problem. If in a part of means recognizing process, two or more means take equal values,
then these means keep their values to the end of means recognizing process. This subject
can be easily understood by considering the formula of maximization step. It should be
considered that if two means find equal values, the number of clusters have decreased
one unit. Thus it is very important to understand that under which condition(s) the value
of two or more means have been equal and we would describe it in this section. For
simplicity and without losing generality we study the condition(s) which in it two clusters
j and h find equal means.
In the maximization step formula the mean of a cluster is related to observed data and
expectation of unobserved data. The subversive of estimation formula is same in all of
distributions means. Hence:





m
m
E[ zij ]xi
i 1
m
E[ zij ]
i 1


m
e
i 1



m
i 1
i 1
1
2

m
i 1
m
i 1
m
1
e
2 2

1
e

e
xi
( xi   j ) 2

E[ zih ]

m
i 1


e
2
2
1
2 2
( xi   j ) 2
xi * i 1 e

( xi   h ) 2
xi * i 1 e

 Ln[i 1 e
Ln[i 1 e


1
2 2
1
2 2
m
m
( xi   j ) 2
( xi   h ) 2
1
2 2

m
i 1
m
m
i 1
( xi   j ) 2
2
E[ zih ]xi
e
( xi   h ) 2
1
2 2
1
2 2
1
2 2
xi
( xi   h ) 2
( xi   h ) 2

( xi   j ) 2
xi ]  Ln[i 1 e
m
xi ]  Ln[i 1 e
m


1
2 2
1
2 2
( xi   h ) 2
( xi   j ) 2
]
(4)
]
Now we must determine for which values of x i , the above relation is satisfied and for
which values it is not. At follow we try to simplify above relation for some values of x i .
State 1: If all of the observed data have the value 1, then the relation (1-1) is satisfied and
two clusters take equal means. Because in this state all of the data have the same value 1,
thus this equality makes no problem, unless we want to use current means for clustering a
new block of data.
State 2: If we have exactly one x i that is opposite of 1, and other observed data be equal
to 1, the relation (1-1) never be satisfied.
Prove: because we have only one x i that is opposite of 1,  sign is eliminated. Values
that are 1 are eliminated from two sides of relation. We name the data that has not value
1 x1 .
Ln[i 1 e

Ln[i 1 e

m
m
 Ln[e


1
2 2
1
2
1
2 2
1
2
( xi   j ) 2
xi ]  Ln[i 1 e

( xi   h ) 2
xi ]  Ln[i 1 e

( x1   j ) 2
m
m
x1 ]  Ln[e

1
2 2
1
2 2
( x1   h ) 2
( x1   j ) 2  Ln( x1 )  
2
  j  h
2
1
2 2
1
2
2
( xi   h ) 2
( xi   j ) 2
]
]
x1 ]
( x1   h ) 2  Ln( x1 )
Since the previous values of means are not equal, then the relation (1-1) never is satisfied.
State 3: If all of the data that are opposite of 1, have equal values, thus the means of two
clusters never are equal.
Prove: because we have only one x i that is opposite of 1,  sign is eliminated. We
assume that the number of data which are opposite of 1 be m and the value of each is x1 .
Ln[i 1 e

Ln[i 1 e

m
m
1
2
2
1
2 2
 m *{Ln[e

( xi   j ) 2
xi ]  Ln[i 1 e

( xi   h ) 2
xi ]  Ln[i 1 e

1
2 2
m
( x1   j ) 2
m
1
2 2
1
2 2
( xi   h ) 2
( xi   j ) 2
]
]
]  Ln( x1 )}  m *{Ln[e

1
2 2
( x1   h ) 2
 Ln( x1 )}
 ( x1   j ) 2  ( x1   h ) 2
  j  h
Similar to state 2, since the previous values of means are not equal, then the relation (11) never be satisfied. This state is important when used fields for clustering are Boolean.
In this condition, we can be sure that the values of means never be equal.
State 4: In this state value of one of the data, for example x1 , is opposite of 1 and is equal
to the current value of the mean of one of the clusters, for example  j , e.g. x1   j . One of
other data, for example x2 , is opposite of 1, and rest of data is assumed to be 1. We
consider  as difference of two non-one data e.g.   x1  x2 . In this state the means of two
clusters was equal under specific conditions.
Prove:
Ln[i 1 e

Ln[i 1 e

m
m
1
2 2
1
2 2
( xi   j ) 2
xi ]  Ln[i 1 e

( xi   h ) 2
xi ]  Ln[i 1 e

m
m
1
2 2
1
2 2
( xi   h ) 2
( xi   j ) 2
]
]
  2 ( x1  1  )  ( x1   h ) 2 ( x1  1)  ( x1     h ) 2 ( x1  1  )
 ( x1  1  )( x1     h )( x1     h  )  ( x1  1)( x1   h ) 2
 ( x1  1  )( x1  2   h )  ( x1  1)( x1   h )
 2( x1  1)  ( x1  2   h )
 2 x1  2   x1  2   h
  h  3 x1  2  2
Thus for means equality, the previous value of the mean of other cluster must be equal
to 3x1  2  2 . In these conditions the EM algorithm doesn’t work properly.
6. A method for determining initial values of probabilities in Bayes
Naive classifier
One of the most important points in Naïve Bayes classifier is initializing the parameters
(probabilities). Because this algorithm is based on training, thus the initial values of
parameters have a very important role. Naïve Bayes classifier is less self-corrective than
EM algorithm, and if any mistake occurs in clustering of a document, the probability that
this mistake has a dramatic effect on clustering of other documents is very high. In
contrast, if EM algorithm makes a mistake in clustering a document, it corrects the
mistake exponentially. Thus in EM algorithm the initial values of means of k cluster(s) is
not important and just not any two means might have equal values throughout clustering
process.
Because the initializing of parameters in Naïve Bayes classifier has an important effect,
then we tried to find a good method for this work. Bellow we explain the method which
we used in our implementation:
In conditions when all of clusters are empty, the initial values have no importance,
because there is no difference between clusters and even one of them can be selected
randomly for first data. Without loss of generality for simplicity we consider conditions
in which we have two clusters and one data that must be dedicated to one of clusters.
There are two possible states:
1) If clustering is done correctly, the new data must go to the cluster which has no
member (empty cluster). In this state the probability formula for each cluster can
be calculated as follows:
Full cluster (j): 1 *  p(ai | V j )
i
Empty cluster (j’): p(v j ) *  p(ai | V j )
i
So as to the data goes to the empty cluster, we must have:
1*  p(ai | V j )  p(v j ) *  p(ai | v j )
i
In this state,
i
 p(ai | V j ) has a zero value or is near to zero (very less than 1). For
i
satisfying the above condition and as a result being sure from correctness of
clustering, the initial values of both p (v j  ) and p (v j  | a i ) can set to 1.
2) If clustering is done correctly, the new data must go to cluster that has at least one
member (full cluster). In this state the probability formula for each cluster can be
calculated as follows:
Full cluster (j): 1 *  p(ai | V j )
i
Empty cluster (j’): p(v j ) *  p(ai | V j )
i
So as to the data goes to the full cluster, we must have:
1*  p(ai | V j )  p(v j ) *  p(ai | v j )
i
In this state,
i
 p(ai | V j ) has a value more than zero (near to 1 and opposite of zero).
i
For satisfying the above condition and as a result being sure from correctness of
clustering, the initial value of p (v j  ) or p (v j  | a i ) can be set to 0.
The problem is to distinguish between these two states so in each state the probabilities
are set to relevant values. The proposed solution uses a threshold value. Because the
value of  p(ai | V j ) in state (1) is zero or near to zero, and in state (2) is one or near to
i
one (very large compared with former state), we can use a threshold value (e.g. 0) so be
able to distinguish between two states and initiate parameters properly. This proposed
method has been used in our implementation and has reduced incorrect clustering of
documents (Fig 2).
Fig 2) comparison of proposed method for initializing the parameters in Naïve Bayes classifier (bar no 2)
and training method (bar no 1) and constant initiating (bar no 3). This diagram states two things: first the
importance of initial values in Naïve Bayes classifier and second good efficiency of proposed method.
7. Non-sensitivity of Bays Naïve classifier respect to difference
between values
In our implementation of Naive Bayes classifier, P(v j ) values are defined as the
number of cluster j members divided by the number of total members that already have
been clustered. In natural, P ( ai | v j ) Values are defined as the number of cluster j
member(s) within which the value of i is a specific value (the value of term i of current
record), divided by total number(s) of cluster j.
This definition works well on fields which the range of values is small and limited.
For example the type of field was Boolean. For solving this problem we considered and
used the bellow definition for P ( ai | v j ) that is free from the type of field: the value of
P ( ai | v j ) is considered such as the number of members of cluster j in which field i has a
specific range divided by the number of total member(s) of cluster j. P ( ai | v j ) is
considered such as the number of members of cluster j in which field i has a specific
range divided by the number of total member(s) of cluster j. this range is defined such as
for each field we subtract the minimum value of a data between total of data from the
maximum value of a data between total of data. Then divide this number by number of
clusters. When the value of field i from current data is compared with the value of this
field in data of cluster j, it is not necessary that two values be equal exactly, but if their
difference be less than from the number that consumed above we consider them
equivalent.
Even with this change Naïve Bayes classifier is not sensitive respect to the amount of
difference of values. Sensitization respect to difference means the difference between
possible values of a field effects on clustering. For example if our data was as follow:
data field1 field2 field3
0
22
22
22
1
22
1
6
2
22
6
6
Naive Bayes classifier can place row 1 in one cluster and rows 0 and 2 in another cluster,
while the EM algorithm that is sensitive to difference place rows 1 and 2 in one cluster
and 0 in another cluster.
Each step for converting Naïve Bayes classifier to a difference sensitive algorithm causes
this algorithm move toward EM algorithm. For this goal we propose Normal algorithm.
In this algorithm the probability of selecting a cluster between all clusters is similar to
Naive Bayes algorithm but P ( ai | v j ) is different. This probability is computed as: the
value of field ai in each of records of cluster vj subtracted from the value of field ai in
current record. Then these values added together. We display the result of this sum for
cluster vj and field ai as A( ai | v j ) . The probability of a field in a record is changed as
bellow:
p
p (v j )
 A(a
i
In above formula if
i
| vj)
 A(a
i
(5)
| v j ) be equal to zero, then it consider equal to 1.
i
The efficiency of this algorithm is in between the efficiency of EM algorithm and Naïve
Bayes classifier. For example this algorithm is slower than Naïve Bayes classifier but is
faster than EM algorithm.
8. Conclusion
Naïve Bayes classifier is more sensitive than EM algorithm respect to initial values of
parameters and must have a training phase. This training phase either is done by expert
that is very time consuming or is done by another algorithm such as EM algorithm. We
have presented a method for initializing the parameters of NaiveBayes classifier that
works better than previous methods.
The EM algorithm may don not work properly in some situations such as: Outliers occur
in the database, Data repetitions occur among the data samples and equality of means of
two or more clusters.
If EM algorithm in one phase makes a mistake, the algorithm can correct itself (but not
always). Also this algorithm has no sensitivity to initial values of parameters.
References:
[1] Buntine W. L, (1994). Operations for learning with graphical models. Journal of Artificial
Intelligence Research, 2, 159-225.
[2] R. Agrawal, J. Gehrke, D. Gunopolos, and Prabhakar Raghavan. Automatic subspace clustering
of high dimensional data for data mining applications. In ACM SIGMOD Conference, 1998.
[3] R. Neal and G. Hinton. A view of the EM algorithm that justifies incremental, sparse and other
variants. Technical report, Dept. of Statistics, University of Toronto, 1993.
[4] PELLEG, D. and MOORE, A. 2000. X-means: Extending K-means with Efficient Estimation of
the Number of Clusters. In Proceedings 17th ICML, Stanford University.
[5] LEE, C-Y. and ANTONSSON, E.K. 2000. Dynamic partitional clustering using evolution
strategies. In Proceedings of the 3rd Asia-Pacific Conference on Simulated Evolution and Learning,
Nagoya, Japan.
[6] Casella, G., & Berger, R. L. (1990). Statistical inference . Pacific Grove, CA: Wadsworth &
Brooks/Cole.
[7] L. Xu and M. Jordan. On convergence properties of the EM algorithm for Gaussian mixtures.
Neural Computation, 7, 1995.
[8] Cédric Archambeau, John A. Lee, Michel Verleysen. On Convergence Problems of the EM
Algorithm for Finite Gaussian Mixtures. ESANN'2003 proceedings – European Symposium on
Artificial Neural Networks Bruges (Belgium), 23-25 April 2003, d-side publi., ISBN 2-930307-03X, pp. 99-106
[9] P. Langley, W. Iba, and K. Thompson. An Analysis of Bayesian Classifiers. Proc. 10th Nat.
Conf. on Artificial Intelligence, 223–228, AAAI Press and MIT Press, USA 1992
[10] P. Langley and S. Sage. Induction of Selective Bayesian Classifiers. Proc. 10th Conf. on
Artificial Intelligence, 1994
[11] J. Bilmes: A Gentle Tutorial of the EM Algorithm and its Application to Parameter stimation
for Gaussian Mixture and Hidden Markov Models. Technical Report of the International Computer
Science Institute, Berkeley, CA (1998).
[12] Adwait Ratnaparkhi. 1998. Maximum Entropy Models for Natural Language Ambiguity
Resolution. Ph.D. thesis, the University of Pennsylvania.
[13] Robert E. Schapire and Yoram Singer. 2000. Boostexter: A boosting-based system for text
categorization. Machine Learning, 39(2/3):135–168.
Download