An entropy-based algorithm for categorical clustering Author : Daniel Barbara Julia Couto Yi Li Graduate : Chien-Ming Hsiao Outline Motivation Objective Introduction Background and problem formulation Algorithm Experimental results Conclusions Personal opinion Motivation Much of the published algorithms to cluster categorical data rely on the usage of a distance metric. Clustering of categorical attributes is difficult. Objective Use a novel method which uses the notion such as entropy to group record. Clusters of similar poi lower entropy than those of dissimilar ones. Introduction COOLCAT A method which uses the notion of entropy to group records. is an incremental algorithm that aims to minimize the expected entropy of the clusters. Background and problem formulation Entropy and clustering Entropy is the measure of information and uncertainty of a random variable. X is a random variable S(X) the set of values that X can take p(x) the probability function of X EX p x log px xS X Background and problem formulation The entropy of a multivariate vector can be computed as E X x1S X1 xˆ X 1 , X n pxˆ log pxˆ xn S X n Where pxˆ px1 ,, xn probability distribution is the multivariate Problem formulation Given a data set D of N points pˆ1,, pˆ 2 , where each point is a multidimensional vector of d categorical attributes, and we would like to separate the points into k groups. This problem is NP-Complete We first have to resolve the issue of what we mean by the “whole entropy of the system”. Where E(C1),…,E(Ck) represent the entropies of each cluster, Ci denotes the points assigned to cluster i Ck E Ck E C k D Problem formulation We assume independence of the attributes of the record. The joint probability of the combined attribute values becomes the product of the probabilities of each attribute. The entropy can be calculate as the sum of entropies of the attributes. E Xˆ x1S X 1 xn S X n p x log p x i i EX1 EX 2 EX n i Problem formulation Expected entropy and the Minimum Description Length principle The minimum description length principle (MDL) Recommends choosing the model that minimizes the sum of the model’s algorithmic complexity and the description of the data with respect to that model K h, D K h K D using h K h k log D k 1 Ci i 0 D K D using h d 1 v 1 P j 0 l 0 ijl log Pijl D log k K h, D log D D log k E C The term K(h) denotes the complexity of the model, or model encoding. K(D using h) is the complexity of the data encoding with respect to the chosen model. Evaluating clustering results Significance Test on External Variables E Ck PC V j log PC V j j The category utility function The category utility (CU) To measure if the clustering improves the likelihood of similar values falling in the same cluster. CU k Ck D PA V C PA V 2 i i j ij k 2 i ij Related work ENCLUS ROCK A algorithm computes distances between records using the Jaccard coefficient. CACTUS A entropy-base algorithm Is an agglomerative algorithm AUTOCLASS Snob Our algorithm Consists of two steps: Initialization To find the k most “dissimilar” records from the sample set by maximizing the minimum pairwise entropy of the chosen points O(|S2|) 2 1 s k k log k Incremental 1 1 log 2 log Computing the expected entropy that results of placing the point in each of the clusters and selecting the cluster for which that expected entropy is the minimum. Our algorithm Experimental results Archaeological data set Congressional vote Is a hypothetical collection of human tombs and artifacts from an archaeological site. This data set was obtained from the UCI KDD Archive KDD Cup This data set was obtained from the UCI KDD Archive Conclusion & Personal opinion To introduce a categorical clustering algorithm, COOLCAT, based in the notion of entropy. The experimental evaluation supports author’s claim that COOLCAT is an efficient algorithm. COOLCAT is better than ROCK on tuning and efficient.