CACTUS – Clustering Categorical Data Using Summaries

advertisement
CACTUS – Clustering
Categorical Data Using
Summaries
By Venkatesh Ganti, Johannes Gehrke and Raghu
Ramakrishnan
RongEn Li
School of Informatics, Edinburgh
Overview
• Introduction and motivation
• Existing tools for clustering categorical
data: STIRR and ROCK
• Definition of a cluster over categorical data
• The algorithm – CACTUS
• Experiments and results
• Summary
Introduction and motivation
• Numeric data, {1,2,3,4,5, …}
• Categorical data, {LFD, PMR, DME}
– Usually small number of attribute values in their
domains. Large domains are typically hard to infer
useful information
– Use relations! Relations contain different attributes,
but the cross product of domain attributes can be
large.
• CACTUS – a fast summarisation-based
algorithm which uses summary information to
find well-fined clusters.
Existing tools for clustering
categorical data
• STIRR
– Each attribute value is represented as a weighted vertex in a graph.
– Multiple copies b1,…,bm (basins) of weighted vertices are maintained.
They can have different weights.
– Starting Step: a set of weights on all vertices in all basins.
– Iterative Step: Increment the weight in basin bi on vertex tj, for each
vertices tuple t=<t1, …, tn> in bi, using a function combining the weights
of vertices other than tj in bi.
– At fixed point: the large positive weights and small negative weights
across the basins isolate two groups of attribute values on each
attribute.
• ROCK
– Starts with each tuple in its own cluster.
– Merges close clusters until a required number (user specified) of
clusters remains. Closeness defined by a similarity function.
• Use STIRR to compare with CACTUS.
Definitions: Interval region, support
and belonging
• A1,…,An is a set of categorical attributes with
domains D1,…,Dn respectively. D is a set tuples
where each tuple t є {1,…,n}.
– Interval region: S=S1 X … X Sn if Si subset of Di for
all i є {1,…,n}. Equivalent to intervals in numeric data
– The support of a value pair σD(ai,aj)=|{tєD:t.Ai=ai &
t.Aj=aj}|/|D|. The support of a region σD(S) is the
number of tuples in D contained in S
– Belonging: A tuple t=<t.A1,…t.An> є D belongs to a
region S if for all t є {1,…,n}, t.Ai є Si.
Definitions: expected support,
strongly connected
• The expected support under attribute-independence
assumption,
– Of a region: E[σD(S)] = |D|*|S1|X…X|Sn|/|D1|X…X|Dn|
– Of a pair ai and aj: E[σD(ai,aj)] = α*|D|/|Di|X|Dj|
– α is normally set to 2 or 3
• Strongly Connected
– ai and aj: if σD(ai,aj)>E[σD(ai,aj)], σ*D(ai,aj)=σD(ai,aj);
Otherwise, 0.
– ai є Si w.r.t Sj: for all x є Sj, ai and x are strongly connected.
– Si and Sj: if each ai є Si is strongly connected with each aj є Sj
and if each aj є Sj is strongly connected with each ai є Si.
Definitions: Cluster, Cluster-projection,
sub-cluster and subspace cluster
• C=<C1,..Cn> is a cluster over {A1,…,An} if
– 1. Ci and Cj are strongly connected
– 2. There exists on C’i such that C’i is a proper
superset of Ci and C’i and Ci are strongly connected
– 3. σD(C) of C is >= α * the expected support of C under
attribute-independence assumption
• Ci is a cluster-projection of C on Ai.
• C is a sub-cluster if it only satisfies 1 and 3.
• A cluster C over a subset of all attributes S proper subset of
{A1,…,An} is a subspace cluster on S.
Definitions: similarity, inter-attribute
summaries, intra-attribute summaries
• Similarity: γj(ai,a2) = |{x є Dj: σ*D(a1,x)>0 and
σ*D(a2,x)>0}|
• Inter-attribute summary:
– ∑ij={(ai,aj, σ*D(ai,aj): ai є Di, aj є Dj, and σ*D(ai,aj)>0}
– Strongly connected attribute values pairs where each
pair has attribute values from different attributes
• Intra-attribute summary:
– ∑ij={(ai,aj, γjD(ai,aj): ai є Di, aj є Dj, and γjD(ai,aj)>0}
– Similarities between attribute values of the same
attribute
CACTUS Vs STIRR: clusters found by
CACTUS
CACTUS Vs STIRR: clusters found by
STIRR
CACTUS: CAtegorical ClusTering
Using Summaries
• Central idea: data summary (inter- & intraattribute summary) is sufficient enough to
find candidate clusters which can then be
validated.
• A three-phase clustering algorithm:
– Summarisation
– Clustering
– Validation
Summarisation Phase
• Assumption: the inter- & intra- attribute summary of any
pair of attributes fits easily into main memory.
• Inter-attribute Summaries:
– Use a counter set to 0 initially for each pair (ai,aj) є Di x Dj.
– Scan the dataset, increment the counter for each pair.
– After the scan, compute σ*D(ai,aj) and reset the counters of
those whose σ* < E[σD(ai,aj)]. Store those values pairs.
• Intra-attribute Summaries:
– Scan the dataset and find those tuples (T1,T2) of one domain
such that T1.a is strongly connected with T1.b and T2.a is
strongly connected with T2.b.
– Very fast operation, hence only compute them when needed
Clustering Phase
• A two-step operation:
–Step 1. analyse each attribute to
compute all cluster-projections on it
–Step 2. Synthesise candidate
clusters on sets of attributes from
the cluster-projections on individual
attributes
Clustering Phase continued
•
Step1: Compute cluster-projections on attributes
– Step A. Find all cluster-projections on Ai of cluster over (Ai,Aj).
– Step B. Compute all the cluster-projections on Ai of cluster over {A1,…,An} by
intersecting sets of cluster-projects from Step A.
– Step A is NP-Hard! Solution: use distinguishing sets.
• Distinguishing sets identify different cluster-projections.
• Construct distinguishing sets on Ai and extend w.r.t Aj some of the candidate
distinguishing sets on Ai.
• Detailed steps are too long for this presentation, sorry!
– StepB: intersection of Cluster-projection
• Intersection joint S1ΠS2 = {s: there exist s1єS1 and s2єS2 such that s=s1Πs2 and
|s|>1}
• Apply intersection joint to all sets of attribute values on Ai.
•
Step2: Try to augment ck with a cluster projection ck+1 on attribute Ak+1. If
new cluster <ci,ck+1> is a sub-cluster on (Ai,Ak+1), i є {1,…,k}, then add ck+1
= <c1,…ck+1> to the final cluster.
Validation Phase
• Use a required threshold to recognise
false candidates which do not have
enough support because some of the 2clusters combined to form a candidate
cluster may be due to different sets of
tuples.
Experiments and Results
• To compare with STIRR
• Use 1 million tuples, 10 attributes and 100
attribute values for each attribute.
• CACTUS discovers a broader class of
clusters than STIRR.
Experiments and Results
Conclusion
• The authors formalised the definition of a
cluster in categorical data
• CACTUS is a fast and efficient algorithm
for clustering in categorical data.
• I am sorry that I did not show some part of
the algorithm due to time constraint.
Question Time
Download