Problem formulation

An entropy-based algorithm for categorical clustering Author : Daniel Barbara Julia Couto Yi Li Graduate : Chien-Ming Hsiao Outline         Motivation Objective Introduction Background and problem formulation Algorithm Experimental results Conclusions Personal opinion Motivation  Much of the published algorithms to cluster categorical data rely on the usage of a distance metric.  Clustering of categorical attributes is difficult. Objective  Use a novel method which uses the notion such as entropy to group record.  Clusters of similar poi lower entropy than those of dissimilar ones. Introduction  COOLCAT   A method which uses the notion of entropy to group records. is an incremental algorithm that aims to minimize the expected entropy of the clusters. Background and problem formulation  Entropy and clustering  Entropy is the measure of information and uncertainty of a random variable.    X is a random variable S(X) the set of values that X can take p(x) the probability function of X EX     p x log  px  xS X Background and problem formulation  The entropy of a multivariate vector can be computed as E X      x1S X1   xˆ  X 1 , X n   pxˆ log  pxˆ  xn S X n Where pxˆ   px1 ,, xn  probability distribution is the multivariate Problem formulation  Given a data set D of N points pˆ1,, pˆ 2 , where each point is a multidimensional vector of d categorical attributes, and we would like to separate the points into k groups.   This problem is NP-Complete We first have to resolve the issue of what we mean by the “whole entropy of the system”.  Where E(C1),…,E(Ck) represent the entropies of each cluster, Ci denotes the points assigned to cluster i   Ck   E Ck  E C   k  D    Problem formulation  We assume independence of the attributes of the record.   The joint probability of the combined attribute values becomes the product of the probabilities of each attribute. The entropy can be calculate as the sum of entropies of the attributes.   E Xˆ    x1S  X 1    xn S  X n            p x log p x    i  i  EX1   EX 2     EX n   i  Problem formulation Expected entropy and the Minimum Description Length principle  The minimum description length principle (MDL)  Recommends choosing the model that minimizes the sum of the model’s algorithmic complexity and the description of the data with respect to that model K h, D  K h  K D using h K h  k log  D  k 1 Ci i 0 D K D using h    d 1 v 1  P j 0 l 0 ijl log Pijl   D log k  K h, D    log  D   D log k   E C     The term K(h) denotes the complexity of the model, or model encoding. K(D using h) is the complexity of the data encoding with respect to the chosen model. Evaluating clustering results  Significance Test on External Variables E Ck    PC  V j log PC  V j  j  The category utility function The category utility (CU)  To measure if the clustering improves the likelihood of similar values falling in the same cluster. CU   k Ck D  PA  V C   PA  V   2 i i j ij k 2 i ij Related work  ENCLUS   ROCK    A algorithm computes distances between records using the Jaccard coefficient. CACTUS   A entropy-base algorithm Is an agglomerative algorithm AUTOCLASS Snob Our algorithm  Consists of two steps:  Initialization     To find the k most “dissimilar” records from the sample set by maximizing the minimum pairwise entropy of the chosen points O(|S2|) 2 1 s  k  k log    k   Incremental    1  1  log     2 log         Computing the expected entropy that results of placing the point in each of the clusters and selecting the cluster for which that expected entropy is the minimum. Our algorithm Experimental results  Archaeological data set   Congressional vote   Is a hypothetical collection of human tombs and artifacts from an archaeological site. This data set was obtained from the UCI KDD Archive KDD Cup  This data set was obtained from the UCI KDD Archive Conclusion & Personal opinion    To introduce a categorical clustering algorithm, COOLCAT, based in the notion of entropy. The experimental evaluation supports author’s claim that COOLCAT is an efficient algorithm. COOLCAT is better than ROCK on tuning and efficient.

Problem formulation

Related documents

Products

Support

Problem formulation

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib