Problem formulation

advertisement
An entropy-based algorithm
for categorical clustering
Author : Daniel Barbara
Julia Couto
Yi Li
Graduate : Chien-Ming Hsiao
Outline








Motivation
Objective
Introduction
Background and problem formulation
Algorithm
Experimental results
Conclusions
Personal opinion
Motivation

Much of the published algorithms to cluster
categorical data rely on the usage of a distance
metric.

Clustering of categorical attributes is difficult.
Objective

Use a novel method which uses the notion such as
entropy to group record.

Clusters of similar poi lower entropy than those of
dissimilar ones.
Introduction

COOLCAT


A method which uses the notion of entropy to group
records.
is an incremental algorithm that aims to minimize the
expected entropy of the clusters.
Background and problem
formulation

Entropy and clustering

Entropy is the measure of information and uncertainty
of a random variable.



X is a random variable
S(X) the set of values that X can take
p(x) the probability function of X
EX   
 p x log  px 
xS X
Background and problem
formulation

The entropy of a multivariate vector
can be computed as
E X   


x1S X1 

xˆ  X 1 , X n 
 pxˆ log  pxˆ 
xn S X n
Where pxˆ   px1 ,, xn 
probability distribution
is the multivariate
Problem formulation

Given a data set D of N points pˆ1,, pˆ 2 , where each
point is a multidimensional vector of d categorical
attributes, and we would like to separate the points into
k groups.


This problem is NP-Complete
We first have to resolve the issue of what we mean by
the “whole entropy of the system”.

Where E(C1),…,E(Ck) represent the entropies of each cluster, Ci
denotes the points assigned to cluster i

 Ck


E Ck 
E C  
k  D

 
Problem formulation

We assume independence of the attributes of
the record.


The joint probability of the combined attribute
values becomes the product of the probabilities of
each attribute.
The entropy can be calculate as the sum of entropies
of the attributes.
 
E Xˆ  

x1S  X 1 


xn S  X n 










p
x
log
p
x


 i 
i
 EX1   EX 2     EX n 

i

Problem formulation
Expected entropy and the Minimum
Description Length principle

The minimum description length principle (MDL)

Recommends choosing the model that minimizes the sum of
the model’s algorithmic complexity and the description of the
data with respect to that model
K h, D  K h  K D using h
K h  k log  D 
k 1
Ci
i 0
D
K D using h   
d 1 v 1
 P
j 0 l 0
ijl
log Pijl   D log k 
K h, D    log  D   D log k   E C 



The term K(h) denotes the complexity of the model, or model encoding.
K(D using h) is the complexity of the data encoding with respect to the
chosen model.
Evaluating clustering results

Significance Test on External Variables
E Ck    PC  V j log PC  V j 
j

The category utility function The category utility (CU)

To measure if the clustering improves the likelihood of similar
values falling in the same cluster.
CU  
k
Ck
D
 PA  V C   PA  V  
2
i
i
j
ij
k
2
i
ij
Related work

ENCLUS


ROCK



A algorithm computes distances between records using
the Jaccard coefficient.
CACTUS


A entropy-base algorithm
Is an agglomerative algorithm
AUTOCLASS
Snob
Our algorithm

Consists of two steps:

Initialization




To find the k most “dissimilar” records from the sample set
by maximizing the minimum pairwise entropy of the chosen
points
O(|S2|)
2
1
s  k  k log    k
 
Incremental

  1 
1
 log     2 log  
 
   
Computing the expected entropy that results of placing the
point in each of the clusters and selecting the cluster for
which that expected entropy is the minimum.
Our algorithm
Experimental results

Archaeological data set


Congressional vote


Is a hypothetical collection of human tombs and
artifacts from an archaeological site.
This data set was obtained from the UCI KDD Archive
KDD Cup

This data set was obtained from the UCI KDD Archive
Conclusion & Personal opinion



To introduce a categorical clustering algorithm,
COOLCAT, based in the notion of entropy.
The experimental evaluation supports author’s
claim that COOLCAT is an efficient algorithm.
COOLCAT is better than ROCK on tuning and
efficient.
Download