CACTUS – Clustering Categorical Data Using Summaries By Venkatesh Ganti, Johannes Gehrke and Raghu Ramakrishnan RongEn Li School of Informatics, Edinburgh Overview • Introduction and motivation • Existing tools for clustering categorical data: STIRR and ROCK • Definition of a cluster over categorical data • The algorithm – CACTUS • Experiments and results • Summary Introduction and motivation • Numeric data, {1,2,3,4,5, …} • Categorical data, {LFD, PMR, DME} – Usually small number of attribute values in their domains. Large domains are typically hard to infer useful information – Use relations! Relations contain different attributes, but the cross product of domain attributes can be large. • CACTUS – a fast summarisation-based algorithm which uses summary information to find well-fined clusters. Existing tools for clustering categorical data • STIRR – Each attribute value is represented as a weighted vertex in a graph. – Multiple copies b1,…,bm (basins) of weighted vertices are maintained. They can have different weights. – Starting Step: a set of weights on all vertices in all basins. – Iterative Step: Increment the weight in basin bi on vertex tj, for each vertices tuple t=<t1, …, tn> in bi, using a function combining the weights of vertices other than tj in bi. – At fixed point: the large positive weights and small negative weights across the basins isolate two groups of attribute values on each attribute. • ROCK – Starts with each tuple in its own cluster. – Merges close clusters until a required number (user specified) of clusters remains. Closeness defined by a similarity function. • Use STIRR to compare with CACTUS. Definitions: Interval region, support and belonging • A1,…,An is a set of categorical attributes with domains D1,…,Dn respectively. D is a set tuples where each tuple t є {1,…,n}. – Interval region: S=S1 X … X Sn if Si subset of Di for all i є {1,…,n}. Equivalent to intervals in numeric data – The support of a value pair σD(ai,aj)=|{tєD:t.Ai=ai & t.Aj=aj}|/|D|. The support of a region σD(S) is the number of tuples in D contained in S – Belonging: A tuple t=<t.A1,…t.An> є D belongs to a region S if for all t є {1,…,n}, t.Ai є Si. Definitions: expected support, strongly connected • The expected support under attribute-independence assumption, – Of a region: E[σD(S)] = |D|*|S1|X…X|Sn|/|D1|X…X|Dn| – Of a pair ai and aj: E[σD(ai,aj)] = α*|D|/|Di|X|Dj| – α is normally set to 2 or 3 • Strongly Connected – ai and aj: if σD(ai,aj)>E[σD(ai,aj)], σ*D(ai,aj)=σD(ai,aj); Otherwise, 0. – ai є Si w.r.t Sj: for all x є Sj, ai and x are strongly connected. – Si and Sj: if each ai є Si is strongly connected with each aj є Sj and if each aj є Sj is strongly connected with each ai є Si. Definitions: Cluster, Cluster-projection, sub-cluster and subspace cluster • C=<C1,..Cn> is a cluster over {A1,…,An} if – 1. Ci and Cj are strongly connected – 2. There exists on C’i such that C’i is a proper superset of Ci and C’i and Ci are strongly connected – 3. σD(C) of C is >= α * the expected support of C under attribute-independence assumption • Ci is a cluster-projection of C on Ai. • C is a sub-cluster if it only satisfies 1 and 3. • A cluster C over a subset of all attributes S proper subset of {A1,…,An} is a subspace cluster on S. Definitions: similarity, inter-attribute summaries, intra-attribute summaries • Similarity: γj(ai,a2) = |{x є Dj: σ*D(a1,x)>0 and σ*D(a2,x)>0}| • Inter-attribute summary: – ∑ij={(ai,aj, σ*D(ai,aj): ai є Di, aj є Dj, and σ*D(ai,aj)>0} – Strongly connected attribute values pairs where each pair has attribute values from different attributes • Intra-attribute summary: – ∑ij={(ai,aj, γjD(ai,aj): ai є Di, aj є Dj, and γjD(ai,aj)>0} – Similarities between attribute values of the same attribute CACTUS Vs STIRR: clusters found by CACTUS CACTUS Vs STIRR: clusters found by STIRR CACTUS: CAtegorical ClusTering Using Summaries • Central idea: data summary (inter- & intraattribute summary) is sufficient enough to find candidate clusters which can then be validated. • A three-phase clustering algorithm: – Summarisation – Clustering – Validation Summarisation Phase • Assumption: the inter- & intra- attribute summary of any pair of attributes fits easily into main memory. • Inter-attribute Summaries: – Use a counter set to 0 initially for each pair (ai,aj) є Di x Dj. – Scan the dataset, increment the counter for each pair. – After the scan, compute σ*D(ai,aj) and reset the counters of those whose σ* < E[σD(ai,aj)]. Store those values pairs. • Intra-attribute Summaries: – Scan the dataset and find those tuples (T1,T2) of one domain such that T1.a is strongly connected with T1.b and T2.a is strongly connected with T2.b. – Very fast operation, hence only compute them when needed Clustering Phase • A two-step operation: –Step 1. analyse each attribute to compute all cluster-projections on it –Step 2. Synthesise candidate clusters on sets of attributes from the cluster-projections on individual attributes Clustering Phase continued • Step1: Compute cluster-projections on attributes – Step A. Find all cluster-projections on Ai of cluster over (Ai,Aj). – Step B. Compute all the cluster-projections on Ai of cluster over {A1,…,An} by intersecting sets of cluster-projects from Step A. – Step A is NP-Hard! Solution: use distinguishing sets. • Distinguishing sets identify different cluster-projections. • Construct distinguishing sets on Ai and extend w.r.t Aj some of the candidate distinguishing sets on Ai. • Detailed steps are too long for this presentation, sorry! – StepB: intersection of Cluster-projection • Intersection joint S1ΠS2 = {s: there exist s1єS1 and s2єS2 such that s=s1Πs2 and |s|>1} • Apply intersection joint to all sets of attribute values on Ai. • Step2: Try to augment ck with a cluster projection ck+1 on attribute Ak+1. If new cluster <ci,ck+1> is a sub-cluster on (Ai,Ak+1), i є {1,…,k}, then add ck+1 = <c1,…ck+1> to the final cluster. Validation Phase • Use a required threshold to recognise false candidates which do not have enough support because some of the 2clusters combined to form a candidate cluster may be due to different sets of tuples. Experiments and Results • To compare with STIRR • Use 1 million tuples, 10 attributes and 100 attribute values for each attribute. • CACTUS discovers a broader class of clusters than STIRR. Experiments and Results Conclusion • The authors formalised the definition of a cluster in categorical data • CACTUS is a fast and efficient algorithm for clustering in categorical data. • I am sorry that I did not show some part of the algorithm due to time constraint. Question Time