DIVA: A Variance-based Clustering Approach for Multi-type Relational Data {li.tao,

advertisement
DIVA: A Variance-based Clustering Approach
for Multi-type Relational Data
Tao Li, Sarabjot S. Anand
Department of Computer Science, University of Warwick
Coventry, United Kingdom
{li.tao, s.s.anand}@warwick.ac.uk
ABSTRACT
Clustering is a common technique used to extract knowledge from a dataset in unsupervised learning. In contrast
to classical propositional approaches that only focus on simple and flat datasets, relational clustering can handle multitype interrelated data objects directly and adopt semantic
information hidden in the linkage structure to improve the
clustering result. However, exploring linkage information
will greatly reduce the scalability of relational clustering.
Moreover, some characteristics of vector data space utilized
to accelerate the propositional clustering procedure are no
longer valid in relational data space. These two disadvantages restrain the relational clustering techniques from being
applied to very large datasets or in time-critical tasks, such
as online recommender systems. In this paper we propose
a new variance-based clustering algorithm to address the
above difficulties. Our algorithm combines the advantages of
divisive and agglomerative clustering paradigms to improve
the quality of cluster results. By adopting the idea of Representative Object, it can be executed with linear time complexity. Experimental results show our algorithm achieves
high accuracy, efficiency and robustness in comparison with
some well-known relational clustering approaches.
Categories and Subject Descriptors
I.5.3 [Patten Recognition]: Clustering—Algorithms
General Terms
Algorithms, Performance
Keywords
Clustering, Multi-type, Relational
1.
INTRODUCTION
Data mining aims to learn knowledge from a dataset. In
unsupervised learning, clustering is a common technique to
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CIKM’07, November 6–8, 2007, Lisboa, Portugal.
Copyright 2007 ACM 978-1-59593-803-9/07/0011 ...$5.00.
partition the dataset into a certain number of groups (clusters) with maximum intra-cluster similarity and minimum
inter-cluster similarity. Classical clustering approaches only
focus on simple and flat data, i.e. all data objects are of
the same type and conventionally described by a list of numeric attribute values. The former condition assumes all
data are stored in a single table, and the latter one make it
possible to represent data as points in a multi-dimensional
vector space. Hence, many mathematical methods, e.g. accumulation or transformation, can be utilized to simplify
the clustering process greatly. Unfortunately, the above assumptions are not held in many practical cases: Firstly, data
might have various types of attributes: binary, categorical,
string or taxonomy-based; secondly, data objects are usually stored in several tables and their pairwise relationships
are specified by the semantic links between tables. Those
semantic links together with multi-type attributes compose
a far more complex feature space for relational dataset than
the Euclidean space, so both of them should be considered
during the clustering process. Although the classical clustering algorithms are still applicable in the relational cases
by means of combining the multiple tables into a single one
with join or aggregation operations, it is not a good choice
for the following reasons [11]:
• The transformation of relational linkage information
into a unified feature space will causes information loss,
or generate very high dimensional and sparse data,
which will inevitably degrade the performance of clustering algorithms.
• Besides the clusters within data objects of each type,
the global hidden patterns involving multi-type objects
might also be important, which cannot be recognized
by classical clustering approaches.
For example, the ontology of a relational movie dataset is
shown in Figure 1. Different concepts in the ontology have
different data types, e.g. “Title”, “YearOfRelease”, “Certificate” and “Genre” are string, numeric, categorical and
taxonomy-based, respectively. An arrow in the figure indicate that an object of the source concept includes one or
more objects of the target concept(s) as its member property. A bi-directional arrow means there exist recursive references between two concepts. Hence, a data object representing an actor will contain references of the movies he
acted in, which in turn refer to other actors and directors
he cooperated with in those movies. Figure 2 shows part of
the object for Tom Hanks. Classical clustering approaches
based only on data attributes (e.g. actors’ name) will gener-
Figure 1: Ontology of a movie dataset
Figure 2: Example object: Tom Hanks (Depth = 2)
ate completely meaningless clusters. Instead, relational clustering approaches will consider movies, directors and other
actors related to the current actor along the semantic links
in the ontology to generate reasonable clusters. Even if objects have no attributes, the linkage structure of the whole
dataset itself can still provide useful information for clustering.
Multi-type relational clustering has raised substantial research interest recently [3, 11, 15]. In contrast to classical
propositional approaches, relational clustering can handle
multi-type interrelated data directly and adopt the semantic information hidden in the linkage structure to improve
the clustering result. However, the trade-off for the above
advantages is: relational clustering needs to explore far more
information, so its scalability and efficiency are reduced. For
example, calculating the similarity between two data objects
of type Actor as in Figure 2 becomes much more expensive
than that in the propositional case. The second problem
is that many conveniences in Euclidean space are not available in relational space. For example, k-Means and BIRCH
[16] require all data to be represented as vectors to support
the operations of addition and division, because the gravity
center of each cluster is used in the clustering procedure.
However, such operations are not valid for data objects in
the relational dataset, so relational clustering approaches
(e.g. RDBC [8] and FORC [9]) have quadratic computational complexity. The above two disadvantages restrain relational clustering techniques from being applied upon very
large datasets, for example in an online movie recommender
system or a bibliographic database system. Another nontrivial issue in the classical clustering approaches is: how to
select the optimal number of clusters k as the input parameter? Inappropriate value of k will lead to skew or “unnatural” clusters [2]. This problem becomes even more severe in
the reinforcement clustering algorithms, such as ReCoM [13]
and LinkClus [15], because the skewed cluster result of one
data type will be propagated along the relations to influence
the partitioning of other data types.
In order to address the above difficulties, in this paper we
propose a new variance-based clustering approach, named
DIVA (DIVision and Agglomeration). Our approach combines the advantages of divisive and agglomerative clustering paradigms to improve the quality of cluster results. Unlike ReCoM or LinkClus that only consider the direct relationships when clustering data objects of the current concept, we exploit multi-layered relational information in the
phase of constructing data objects. By adopting the idea of
Representative Object, we can perform the DIVA algorithm
with linear computational complexity O(N ), where N is the
number of objects to be clustered. Since the multi-type relational information is considered when data instances are
constructed and DIVA generates the clusters of different
data types separately, the problem of skewness propagation
within reinforcement clustering approaches can be avoided.
The rest of our paper is organized as follows: Section 2
introduces our method for constructing multi-type relational
objects and the corresponding similarity metric. On that
basis, we explain in detail the DIVA algorithm and analyze
its computational complexity in Section 3. Comprehensive
experimental results are provided in Section 4. Section 5
presents some prominent algorithms for propositional and
relational clustering. Finally, the conclusions are drawn in
Section 6.
2. RELATIONAL OBJECT CONSTRUCTION
AND SIMILARITY MEASURE
In this section, we will formally define our method of constructing the multi-type relational objects as well as a recursive relational similarity metric according to the ontology.
Given an ontology represented as a directed graph G =
(C, E), in which vertices C = {ci } stand for the set of
concepts in the ontology and edges E = {eij | edge eij :
ci → cj } for the relationships between pairwise concepts,
i.e. concept ci includes concept cj as its member property. In Figure 1, concept “Actor” has member concept
list M C(Actor)={Name, Movie} and concept “Movie” has
M C(Movie)={Title, Genre, YearOfRelease, Certificate, Plot,
Duration, Actor, Director}. When constructing an object x
of concept ci , we will first build its member concept list
M C(ci ) and then link all objects related to x into the member property attributes of x. We say an object y of concept
cj is related to x when cj ∈ M C(ci ). In such case, y will
be added into the member property attribute x.cj . Then
for each y ∈ x.cj , we launch the above procedure iteratively
until M C(cj ) = ∅ or a depth bound Depth(≥ 0) is reached.
As an example, Figure 2 shows the object for actor “Tom
Hanks” with Depth = 2.
Figure 3: Example objects for Two Action Movie
Stars (Depth = 1)
For two relational objects x1 and x2 of concept ci , we
define the similarity metric as follows:
1
f sobj (x1 , x2 ) =
wij · f sset (x1 .cj , x2 .cj )
|M C(ci )|
cj ∈M C(ci )
(1)
where weight wij (wij ≤ 1 and
w
=
1)
represent
the
ij
j
importance of member concept cj in describing the concept
ci . In Equation 1, f sset (·, ·) is defined as [1]:
⎧
1
⎪
max f s(yk , yl ),
⎪
|x1 .cj |
⎪
yk ∈x1 .cj
⎪
⎪
yl ∈x2 .cj
⎪
⎪
⎪
⎪
if |x1 .cj | ≥ |x2 .cj | > 0.
⎪
⎪
⎪
⎨
1
f sset (x1 .cj , x2 .cj ) =
max f s(yk , yl ),
⎪ |x2 .cj |
yl ∈x2 .cj
⎪
⎪
yk ∈x1 .cj
⎪
⎪
⎪
⎪
if |x2 .cj | ≥ |x1 .cj | > 0.
⎪
⎪
⎪
⎪
⎪
⎩
0,
if |x1 .cj | = 0 or |x2 .cj | = 0.
(2)
When M C(cj ) = ∅, the value of f s(yk , yl ) in Equation 2 is
recursively calculated by Equation 1. Hence, this similarity
metric can explore the linkage structure of the relational objects. The above procedure continues until M C(cj ) = ∅ or
the depth bound is reached, where the traditional propositional similarity metrics can be used.
To demonstrate our recursive similarity metric, we give a
simplified example in the movie dataset, assuming concept
“Movie” only has member concepts {Title, Genre, YearOfRelease}. Figure 3 shows the relational objects for action
movie stars “Arnold Schwarzenegger” and “Sylvester Stallone”. We will compare them with the actor “Tom Hanks”
represented in Figure 2. In the scenario of propositional
clustering, some naive similarity metrics might be adopted:
if the actor’s name is regarded as an enumerated value or a
string, the true-or-false function or the Levenshtein distance
can be used. The former metric results in zero similarity
value between each pair of actors because they always have
different names. By using the latter, we have f s(oA2 , oA3 )
= f sstr (“Arnold Schwarzenegger”, “Sylvester Stallone”) =
0.095, f s(oA1 , oA3 ) = f sstr (“Tom Hanks”,“Sylvester Stallone”) = 0.111, which is still unreasonable. Another common
technique is to transform the relational information into a
high dimensional vector space, e.g. constructing a binary
vector to represent the movies that an actor has acted in
and then calculating the pairwise similarity between actors
as the cosine value of their movie vectors. As discussed in
Section 1, such transformation often produces sparse data
when the number of movies is large. Another disadvantage
is that some deeper semantic information, e.g. the movie
genre, is lost when calculating pairwise similarity between
actors.
In our framework, by setting Depth = 1 and utilizing the
relational similarity metric, we calculate the similarity value
between movies “Terminator 2: Judgment Day” and “Rocky”
as follows (the weights for all member concepts are ignored
here for simplicity):
1
f sstr (oM 3 .T itle , oM 5 .T itle )
f sobj (oM 3 , oM 5 ) =
3
+ f staxonomy (oM 3 .Genre , oM 5 .Genre )
+ f snum (oM 3 .Y ear , oM 5 .Y ear )
1 |1991 − 1976|
1
0.077 + + 1 −
=
3
2
maxyear − minyear
= 0.359
where the hierarchical taxonomy for concept “Genre” is shown
at the bottom of Figure 3 and the corresponding similarity
metric f staxonomy (·, ·) is defined in [5].
Similarly,
|1994 − 1976| 1
0+1+ 1−
= 0.467
f sobj (oM 4 , oM 5 ) =
3
30
1
1
|1991 − 1982| 0.154 + + 1 −
= 0.451
f sobj (oM 3 , oM 6 ) =
3
2
30
|1994 − 1982|
1
0.182 + 1 + 1 −
= 0.594
f sobj (oM 4 , oM 6 ) =
3
30
and hence
f sobj (oA2 , oA3 ) =
=
=
1
f sstr (oA2 .Name , oA3 .Name )
2
+ f sset (oA2 .M ovie , oA3 .M ovie )
1
0.095 +
max
f s(yk , yl )
yk ∈{M3, M4}
2
2
yl ∈{M5, M6}
1
1
0.095 + (0.467 + 0.594) = 0.313
2
2
1
In the same way, we can get f sobj (oA1 , oA3 ) = 0.171,
which is less than f sobj (oA2 , oA3 ). Therefore, by incorporating the ontology of the dataset and applying the relational
similarity metric, the new results reflect more credible similarity values among these actors.
Theoretically, calculating the similarity value between two
objects should exhaustively explore their relational structures and consider their member objects at all levels, i.e.
setting Depth = ∞. This is infeasible and unnecessary in
practice. From Equations 1 and 2, we see that the similarity value between two member objects of concept cj will be
propagated into the similarity calculation of the upper level
wij
concept ci with a decay factor δd (cj ) = |M C(c
, where d
i )|
means concept ci is located at the d-th level of the root object’s relational structure. The total decay factor for concept
cj to impact
the similarity calculation of two root objects
is ∆(cj ) = d δd . In many applications, this factor will be
reduced very quickly as d increases, which means the impact
from member objects at the deeper levels of the relational
structure keeps decreasing. For instance, in the above simplified example, the total decay factors for concepts “Name”
and “YearOfRelease” to impact root concept “Actor” are as
follows (the weights for member concepts are again ignored
for simplicity):
1
1
=
|M C(Actor)|
2
1
1
1
=
=
∆(Year) =
|M C(Actor)| · |M C(Movie)|
2×3
6
∆(Name) =
When applying the real ontology shown in Figure 1, the
1
because
total decay factor ∆ (Year) is no greater than 14
M C(Movie)=7. Just like RDBC and FORC, we can set
the depth bound Depth to a moderate value instead of ∞,
so that enough information is retrieved from the relational
structure to guarantee the similarity calculation credible as
well as feasible.
3.
DIVA FRAMEWORK
Clustering is a technique of data compression: data objects in the same cluster can be treated collectively because
they are more similar to each other than those from different clusters. From another point of view, the clustering
procedure preserves the significant similarity entries within
the dataset, by distributing pairs of highly similar objects
into the same cluster. Like DBSCAN [4] that explicitly controls the intra-cluster similarity by specifying the expected
data density, we use variance, a criterion of evaluating the
diameter of clusters, to meet the requirement. The formal
definition of variance will be given in Section 3.1. Roughly
speaking, greater variance means the derived clusters are
more compact and higher similarity values between objects
are preserved by the clustering procedure.
Based on the above discussion, our DIVA algorithm is
designed as follows: First, divide the whole dataset into a
number of clusters so that the variance of each cluster is
greater than a particular threshold value υ. Based on these
clusters, a hierarchical dendrogram is built using an agglomerative approach. Finally the appropriate level of the dendrogram that satisfies the variance requirement is selected to
construct the clustering result. Table 1 summarizes the main
framework of DIVA, consisting of two parts: a recursive divisive step to partition the dataset and an agglomerative
step to build the dendrogram based on the clusters. After
defining some fundamental concepts in Section 3.1, we provide more details for these two steps in Section 3.2 and 3.3.
The computational complexity is analyzed in Section 3.4.
3.1 Fundamental Concepts
As presented in the related work (Section 5), traditional
clustering approaches are usually categorized as partitional
and hierarchical. The classical k-Medoids algorithm, as an
example of partitional clustering, defines the medoid of a
DIVA (dataset D0 , number of ROs r, variance υ)
1. cluster set {Dk } ← call the Divisive-Step, given D0 ,
r and υ as the parameters.
2. dendrogram T ← call the Agglomerative-Step, given
{Dk } as the parameter.
3. select the appropriate level in T to construct the
clustering result.
Table 1: Main Framework of DIVA
cluster as the data object that has the maximum average
similarity (or minimum average of distance) with the other
objects in the same cluster. This requires the calculation
of the similarity values between every pair of data objects
in the given cluster. On the other hand, the hierarchical
clustering is composed of two sub-categories: divisive and
agglomerative. In the former one, a common criterion to
decide whether or not a cluster should be divided is its diameter, which is determined by the distance value of two
data objects in the cluster that are the farthest away from
each other. Again we are faced with the quadratic computational complexity. Similarly, the traditional agglomerative clustering paradigms also has the quadratic complexity
when searching for two sub clusters that are closest to each
other in order to derive the super one. Due to the complicated structure of relational data objects, relational clustering with quadratic computational complexity is restrained
in many applications.
Is there any efficient way to delimit the shape of clusters
and hence accelerate the division and agglomeration procedures? We develop the concept Representative Object (RO)
to achieve this goal. The ROs are defined as a set of r
maximum-spread objects in the data space, given r ≥ 2.
More strictly, after a start object xs is randomly chosen, the
i-th RO is determined by the following formula:
⎧
if i = 1
⎨ arg min f sobj (x, xs )
x∈D roi =
(3)
⎩ arg min max f sobj (x, roj )
if 2 ≤ i ≤ r
x∈D
1≤j<i
The reason we do not use xs as an RO is: xs will reside at
the center part of the data space with high probability, and
thus not satisfy the maximum-spread requirement for ROs.
Additionally, we can also reduce the impact of randomly
selected xs on the final clustering result. In Section 3.4 we
analyze how the application of ROs successfully reduces the
total computational complexity of DIVA to be linear with
the size of dataset.
(D)
We use {roi } (1 ≤ i ≤ r) to denote the set of ROs for
the dataset D. Because they are maximum-spread from each
other, the distance between the farthest pair of ROs approximates to the diameter of D. More formally, the variance of
the dataset D is defined as:
(D)
Υ(D) = min f sobj (roi
1≤i,j≤r
(D)
, roj
)
(4)
Hence, greater variance means data objects reside in a smaller
data space and thus are more similar to each other, which
means the data are of higher homogeneity.
Divisive-Step(dataset D0 , number of ROs r, variance υ)
1. INITIALIZE the cluster list L by adding D0
2. FOR EACH newly added cluster Dk in L
(Dk )
(a) generate the set of ROs {roi
(Dk )
i. the start object xs
an object from Dk .
}:
← randomly select
(D )
ii. to determine roi k (1 ≤ i ≤ r):
(D )
roi k ← select the object x from Dk that
(D )
is farthest away from the start point xs k
when i = 1 or that minimizes the accumulated similarity from itself to the already
(D )
obtained ROs roj k (1 ≤ j < i) when
2 ≤ i ≤ r, as described in Equation 3.
(b) evaluate the variance of Dk by Equation 4.
Without loss of generality, assume the pair
(D )
of ROs in {roi k } that are farthest away
(D )
(D )
from each other are ro1 k and ro2 k . Then
(Dk )
(Dk )
Υ(Dk ) = f s(ro1 , ro2 ).
(c) if Υ(Dk ) < υ, then:
i. create two new clusters Dk and Dk , using
(D )
(D )
ro1 k and ro2 k as the absorbent objects
of Dk and Dk respectively, where k and
k are unused index numbers in L.
ii. allocate the rest objects x ∈ Dk into either Dk or Dk based on the comparison
(D )
(D )
of f s(x, ro1 k ) and f s(x, ro2 k ).
iii. add Dk and Dk into L to replace Dk .
3. RETURN all the remaining clusters Dk in L
Table 2: The Divisive Step
3.2 Divisive Step
The divisive step starts by assuming all the data objects
belong to the same cluster. Here we use D0 to denote the
whole dataset as well as the initial cluster it forms. Equation 3 is applied to find a set of ROs for D0 and thus its
variance Υ(D0 ) is determined by Equation 4. Υ(D0 ) is less
than the pre-specified variance threshold υ, so the division
procedure is launched. Two ROs that are farthest away
from each other, i.e. the pair of ROs determining the diameter of D0 , are used as the absorbent objects of two
sub-clusters respectively. The other data objects are allocated to the appropriate sub-cluster based on their similarities to the absorbent objects. Finally, the original cluster
D0 is replaced by its derived sub-clusters in the cluster list
L. Since the similarity values of all the non-RO objects
to every RO object have been obtained when determining
(D)
{roi 0 } by Equation 3, the division operation for D0 can be
performed without extra effort of similarity calculation. If
either of the newly formed sub-clusters remains unsatisfied
with the required threshold υ, the above division process
is recursively
performed. Finally we get a set of clusters
{Dk } ( k Dk = D0 , Dk1 Dk2 = ∅) with variance equals or
Agglomerative-Step(cluster set {Dk })
1. INITIALIZE the dendrogram T . For each Dk , construct a leaf node tk in T .
2. REPEAT for K − 1 times, where K is the size of
{Dk }
(a) for nodes in T that have no parent node,
their pairwise similarity values are evaluated
by Equation 5. From all these nodes, we choose
the pair with the highest similarity value, assuming they are tl and tl .
(b) generate a new node tp as the parent node
for both tl and tl , which equals to create a
new super-cluster Dp by merging Dl and Dl .
(t )
The top-r maximum-spread ROs in {roi l } ∪
(tl )
{roj } are chosen as the ROs for tp .
(c) store tp into T .
3. RETURN T .
Table 3: The Agglomerative Step
is greater than υ, which are used as the input of the agglomerative step. The divisive step is summarized in Table 2.
3.3 Agglomerative Step
Like classical agglomerative clustering approaches, in this
step we will build a hierarchical dendrogram T in a bottomup fashion. The cluster set {Dk } obtained from the divisive
step constitute the leaf nodes of the dendrogram. In each
iteration, the most similar pairwise clusters (sub nodes) are
merged to form a new super-cluster (parent node). Because
each cluster in the agglomerative step is related to a unique
node in the dendrogram, the words “cluster” and “node” are
used interchangeable in this section.
Various similarity metrics for agglomerating the nodes
within a dendrogram have been discussed in [2], among which
we adopt the complete-linkage similarity. Since each cluster
is represented by a set of ROs, the similarity between two
nodes tl , tl ∈ T is defined as follows:
(tl )
f snode (tl , tl ) = min f sobj (roi
i,j
(t )
(tl )
, roj
)
(5)
where {roi l } is the set of ROs contained in node tl and
(t )
{roj l } is the set of ROs contained in node tl . Without
loss of generality, we assume that the super node tp is formed
based on two sub-nodes tl and tl , then the top-r maximum(t )
(t )
spread ROs in {roi l } ∪ {roj l } are chosen as the ROs for
tp .
The agglomerative step is summarized in Table 3. It is
worth noting here that constructing the hierarchy in this
step is not a reverse reproduction of the divisive step in
Section 3.2. As shown in [16], the agglomeration can remedy
the inaccurate partitioning generated by the divisive step.
After the dendrogram T is built, we need to determine the
appropriate level in T and use the corresponding nodes to
construct the clustering result. A common strategy is to select the level at which the variance of each node equals or is
greater than υ. Alternatively, we can record the variance of
newly generated node for each level, find the largest gap between variances of two neighboured levels and use the lower
level as the basis to construct clusters [2]. When the number
of clusters is fixed, as in the experiments in Section 4, we
select the level which contains the exactly required number
of nodes to construct clusters.
3.4 Complexity Analysis
In this section, we will briefly analyze the computational
complexity for each step in our DIVA algorithm, given the
whole dataset D0 of size N , the number of iteration in the
divisive step is R and the size of {Dk } is K:
• Divisive Step:
– The random selection of the start object xs has
complexity O(1). Then we will scan the whole
dataset D once (with N −1 similarity comparison)
to pick out the farthest data object from xs as
the first RO ro1 . Similarly, in order to determine
roi (2 ≤ i ≤ r), we only need to scan the whole
dataset once and compare all the non-RO objects
with roi−1 , because the other similarity values required by Equation 3 have been obtained when
determining roj (1 ≤ j ≤ i − 2). Overall, there are
r
r(r − 1)
−1 similar(N −1)+ (N −i+1) = rN −
2
i=2
ity comparisons, so the computational complexity
is O(r · N ).
– If the dataset D has to be divided, we construct
two sub-clusters and appoint the farthest pair of
ROs, assuming ro1 and ro2 as before, as the absorbent objects respectively. Then, all the nonRO objects can be allocated to the nearest cluster
without extra calculation because their similarity
values to ro1 or ro2 have been obtained in the
procedure of determining {roi }. Hence, the operation for dividing and redistributing dataset D
has complexity O(1).
– If any cluster’s variance is lower than the specified variance threshold υ, a recursive division procedure will be launched on this cluster until the
variances of all derived sub-clusters satisfy the requirement. Without loss of generality, we assume
cluster Di of size Ni in the final clustering result
is generated from the original dataset D0 by at
most R iterations, then all the data objects in
Di need to be compared with R · r ROs during
the recursive division. Hence, the total computational
complexity for the recursive division is
O( i R rNi ), i.e. O(R rN ).
• Agglomerative Step: Like the classical agglomerative
algorithm, the computational complexity of building
the taxonomy is O(r 2 · K 2 ).
Therefore, the total computational complexity of the DIVA
algorithm is O(rN + R rN + r 2 K 2 ).
We must point out that both R and K above are controlled
by the variance threshold υ: higher υ leads to more recursive
divisions and thus generates more clusters. When υ → 1, the
recursive division will generate many tiny clusters, each of
which will only contains the RO itself. In this extreme case,
our DIVA algorithm will behave like the complete agglomerative approach RDBC with quadratic complexity. Nevertheless, by choosing moderate values for r and υ to keep
rK N , the computational complexity of our DIVA algorithm would be linear to the size of the dataset. Usually r
will be set a fairly small value, such as 3 or 4. Like BIRCH
[16], we can gradually increase the value of υ to improve the
homogeneity of the generated clusters, until their quality
meets our requirement.
4. EXPERIMENTAL RESULT
In order to evaluate the effectiveness and efficiency of our
DIVA algorithm for clustering multi-relational datasets, we
compare it with the following approaches: (1) ReCoM [13],
which uses relationships among data objects to improve the
cluster quality of interrelated data objects through an iterative reinforcement clustering process. Because there is no
prior knowledge about the authoritativeness in the datasets,
we treat all data objects as equally important. Additionally,
k-Medoids is incorporated as the meta clustering approach
in ReCoM. (2) FORC [9], which is the natural extension of
k-Medoids in the field of relational clustering. (3) LinkClus
[15], which uses a new hierarchical structure, SimTree, to
represent the similarity values between pairwise data objects
and facilitate the iterative clustering process. All the experiments were carried out on a workstation with a 2.8GHz
P4 CPU, 1GB memory and RedHat Operating System. All
approaches are implemented by Java.
The experiments are conducted on two relational datasets.
The first one is a synthetic dataset that simulates the users’
browsing products on the website www.amazon.com. The
second is the real movie dataset mentioned in Section 1.
To evaluate the accuracy of the clustering result, we use
the Related Minimum Variance Criterion [2] to measure the
similarity between pairs of objects in the same cluster:
1 sk
(6)
Sintra =
K
k
where nk is the size of cluster Dk and
1 sk = 2
f sobj (x, x )
nk x∈D k
x ∈Dk
When the class labels of data objects are available, we can
also use an entropy-based measure [13] to evaluate the clustering result. The measure reflects the uniformity or purity
of a cluster. Formally, given a cluster Dk and category labels
of data objects in it, the entropy of cluster Dk is:
Ph log2 Ph
H(Dk ) = −
h
where Ph is the proportion of data objects of class h in the
cluster. The total entropy is defined as:
H=
H(Dk )
(7)
Dk
Generally, larger intra-cluster similarity and smaller entropy
values indicate higher accuracy of clustering result.
In the following experiments, we calculate the above criteria on a fixed number of clusters. For ReCoM and FORC,
the number of clusters is a pre-specified input parameter.
For LinkClus, we use the method mentioned in [15] to obtain the fixed number of clusters, i.e. first find the level at
4.1 Synthetic Dataset
In this section, we test each clustering approach on a synthetic dataset. The dataset is generated by the following
steps, as in [13]:
5H&RP
)25&
',9$
/LQN&OXV
(QWURS\
which the number of nodes is most close to the pre-specified
number, then adopt the operation of merging to satisfy the
requirement. For DIVA, when the number of merged clusters
reaches the pre-specified requirement during the agglomerative step, the algorithm is terminated.
,QWUDFOXVWHUVLPLODULW\
Figure 4: Ontology of the Synthetic Dataset
two apply the reinforcement clustering manner. For all approaches, the numbers of clusters to be generated within
datasets “User” and “Product” were specified as 100 by default. LinkClus was executed with a series of c values and
the best clustering result was used to compare the performance of LinkClus with that of other approaches. Since
FORC, LinkClus and ReCoM launch an iterative procedure
of clustering data objects until convergence, we set the maximum number of iterations to be 10 because these algorithms
converge very quickly in most cases.
5H&RP
)25&
',9$
/LQN&OXV
9DULDQFH
3. Each user’s browsing action is generated according to
the information of users, groups, categories and products: (i) randomly select a user and get his group; (ii)
based on the probabilities of group interests, select an
interest and get the related product category; (iii) randomly select a product that belongs to this category;
(iv) create a browsing action between the user and the
product obtained in step (i) and (iii). In total 100,000
browsing actions were created.
4. In order to test the robustness of different clustering
approaches, we also generate some noise data: for each
user, (i) uniformly choose four noise interests; (ii) randomly select a product that belongs to one of the four
noise interests of the current user; (iii) create a noise
browsing action between the user and the product. We
will examine how these clustering approaches perform
in the case of different noise ratios.
Figure 4 is the ontology of the synthetic dataset, Concepts
“User” and “Product” are interrelated by the link browse action and Concept “Product” has property “Category” as its
content feature. When creating objects for users or products, we set the depth bound Depth = 1 for FORC/DIVA
and Depth = 0 for LinkClus/ReCoM, because the latter
(b) Entropy
Figure 5: Synthetic Dataset - Clustering users with
different υ
5H&RP
)25&
',9$
/LQN&OXV
5H&RP
)25&
',9$
/LQN&OXV
(QWURS\
2. We randomly generate 2,000 users and uniformly distribute them into 100 groups. For each group, we construct a probability distribution to simulate the users’
preferences on the third-level categories (obtained in
Step 1), which defines the likelihood of a user in that
group to browse a product in a certain category. Each
group of users has 4 interest categories: one category
representing the major interest of that group is assigned the probability 0.5, two categories of the intermediate interest are assigned the probability 0.2, and
one category of the minor interest is assigned the probability 0.1.
(a) Intra-cluster similarity
,QWUDFOXVWHUVLPLODULW\
1. The product taxonomy of the online shop Amazon is
retrieved from its website www.amazon.com. It contains 11 first-level categories, 40 second-level categories
and 409 third-level categories in the taxonomy. We
generate 10,000 virtual products and randomly assign
them to the third-level categories. The category information is the only content feature defined for these
products.
9DULDQFH
9DULDQFH
(a) Intra-cluster similarity
9DULDQFH
(b) Entropy
Figure 6: Synthetic Dataset - Clustering products
with different υ
Variance υ is the most important parameter for the DIVA
algorithm, so we first test its impact. Figure 5 and 6 show
the evaluation results of clustering users and products respectively, ranging υ from 0.3 to 0.6 and fixing r = 3. In
general, the quality of the clustering result generated by
DIVA improves as υ increases. When υ ≥ 0.4, DIVA outperforms all the other algorithms, especially when evaluated
by the entropy criterion, which means DIVA is more capable of discovering the inherit category of products as well as
the hidden groups of users. Furthermore, we found that the
accuracy of LinkClus is far worst than those of the other
approaches. The reason is that LinkClus builds an initial
SimTrees by applying Frequent Pattern Mining, which only
exploits the link structure of the relational dataset. The content features of the data objects, for example the category
information for product, are completely ignored in the procedure of clustering. Due to such information loss, LinkClus
cannot generate clusters of high quality. As a result, we will
not test LinkClus in the next experiments.
Figure 7 shows that FORC is always the most time consuming algorithm. This result is not surprising since its
computational complexity is O(N 2 ). On the other hand,
time spent by ReCoM, LinkClus and DIVA are comparable.
When clustering products DIVA even outperforms ReCoM.
The reason is: the average number of users related to each
product is less than that of products related to each user,
so the most expensive operation in similarity calculation,
Equation 2, is used less for clustering products than that for
7LPHVSHQWVHF
5H&RP
)25&
',9$
/LQN&OXV
5H&RP
)25&
',9$
/LQN&OXV
9DULDQFH
9DULDQFH
(a) Users
5H&R0
)25&
',9$
(b) Products
9DULDQFH
users. Generally, as υ increase, DIVA will spend more time
to generate smaller clusters in the divisive step and combine them again in the agglomerative step. When υ > 0.6
for this dataset, time spent by DIVA sharply increases because many single-object clusters are generated. As we have
discussed in Section 3.4, such over-estimation downgrades
DIVA into RDBC with quadratic complexity.
Next we examine the parameter r, the number
√ of ROs for
each cluster. r is varied from 2 to 17 in steps of 2. We only
compare two fixed values of υ here for the reason of clarity,
but the conclusion is also valid for other values. Curves
“User-04” and “User-05” in Figure 8 are for clustering users
with variance 0.4 and 0.5 respectively, and curves “Product04” and “Product-06” for clustering products with variance
0.4 and 0.6 respectively. We can see that the running time
grows very quickly while the accuracy does not change a
lot. Therefore, a fairly small value of r, such as 3 or 4, is
enough for providing high accuracy as well as keeping short
processing time.
9DULDQFH
Figure 10 illustrates the robustness of all approaches under different noise ratios of browsing actions, ranging from
20% to 100%. The parameters for DIVA are set as: υ = 0.5
and r = 3. Generally, the accuracy of all approaches are reduced as noise ratio increases. When evaluated by the criterion of intra-cluster similarity, ReCoM is slightly better than
DIVA and FORC is the worst among three. Yet the entropybased criterion might be more preferable here, because the
intra-cluster similarity is calculated based on not only the
informative browsing actions but also the noise ones, while
the entropy is calculated only based on the class labels of
the data objects. Evaluated by the latter one, DIVA exceeds ReCoM and FORC when the noise ratio is below 80%
and their performance are very close when the noise ratio is
above 80%.
5H&RP
)25&
',9$
5H&RP
)25&
',9$
8VHU
8VHU
3URGXFW
3URGXFW
Figure 9: Synthetic Dataset - Standard deviation of
the cluster sizes
5H&R0
)25&
',9$
(b) Products
(a) Users
Figure 7: Synthetic Dataset - Time spent vs υ
,QWUDFOXVWHU6LPLODULW\
(QWURS\
,QWUDFOXVWHUVLPLODULW\
7LPHVSHQWVHF
6WG'HYRIFOXVWHUVL]H
6WG'HYRIFOXVWHUVL]H
1RLVHUDWLR
(a) Intra-cluster similarity
1RLVHUDLR
(b) Entropy
Figure 10: Synthetic Dataset - Accuracy of clustering users along with different noise ratio
1XPEHURI52V
(a) Intra-cluster similarity
(QWURS\
8VHU
8VHU
3URGXFW
3URGXFW
7LPHVSHQWVHF
4.2 Real Dataset
8VHU
8VHU
3URGXFW
3URGXFW
1XPEHURI52V
(b) Entropy
1XPEHURI52V
(c) Time spent (sec)
Figure 8: Synthetic Dataset - Clustering users and
products by DIVA with different r
In many applications very small clusters (in the extreme
case, the singleton cluster that only contain one data object)
are meaningless, so it is necessary to investigate the structure of the derived clusters. Since the cluster number in our
experiments has been fixed to 100, resulting in the same average size of the derived clusters for all approaches, we consider the standard deviation of the clusters’ sizes here. The
result is shown in Figure 9. Roughly speaking, the standard
deviation of the clusters’ sizes generated by DIVA decreases
as the variance increase, meaning that our approach does
not tend to generate singleton clusters.
The clustering approaches were also evaluated on a realworld dataset, a movie knowledge base defined by the ontology in Figure 1. After data pre-processing, there are 62,955
movies, 40,826 actors and 9,189 directors. The dataset also
includes a genre taxonomy of 186 genres. Additionally, we
have 542,738 browsing records included in 15,741 sessions
from 10,151 users. The number of sessions made by different users ranges from 1 to 814.
The evaluation result based on the intra-cluster similarity
is shown in Figure 11(a), in which DIVA performs better
than ReCoM and FORC. The entropy-based criterion defined by Equation 7 can not be applied, because there is
no pre-specified or manually-labelled class information for
movies in the dataset. We have to utilize the visit information from users to evaluate the clustering results indirectly.
The traditional collaborative filtering algorithm constructs
a user’s profile based on all items he/she has browsed across
sessions, then the profile of active user ua is compared with
those of other users to form ua ’s neighbourhood and the
items visited by the neighbourhoods but not by ua are returned as the recommendation [10]. Hence, two items are
,QWUDFOXVWHUVLPLODULW\
5H&RP
)25&
',9$
9DULDQFH
(a) Intra-cluster similarity
5H&RP
)25&
',9$
6HVVLRQEDVHG&RHIILFLHQW
8VHUEDVHG&RHIILFLHQW
5H&RP
)25&
',9$
9DULDQFH
9DULDQFH
(b) User-based coefficient (c) Session-based coefficient
Figure 11: Real Dataset - Clustering movies with
different υ
“labelled” into the same category if they are co-browsed by
the same user, which reflects the partitioning of the dataset
from the viewpoint of users. Accordingly, we can construct
the evaluation criterion as in [15]: two objects are said to
be correctly clustered if they are co-browsed by at least one
common user. The accuracy of clustering is defined as a
variant of Jaccard Coefficient: the number of objects pairs
that are correctly clustered over all possible objects pairs
in the same clusters. Another criterion is similar but of
higher granularity, based on the assumption that a user seldom shifts his/her interest within a session, so two objects
are said to be correctly clustered if they are included in at
least one common session. Therefore we have two new criteria for evaluating the accuracy of clustering: user-based and
session-based coefficient. As discussed in [15], higher accuracy tends to be achieved when the number of clusters increases, so we set to generate 100 clusters for all approaches.
Figure 11(b) and 11(c) show the evaluation results based
on the above two criteria respectively. Since the user browsing data are very sparse (a common phenomenon in the scenario of recommender systems), no clustering approach can
achieve very high coefficient value. Despite that, DIVA still
outperforms both ReCoM and FORC when υ > 0.3, indicating the clusters generated by DIVA are more accordant
with the user browsing patterns, i.e. the partitioning of the
dataset derived by DIVA is more acceptable by users. It indicates that DIVA will have higher usefulness than ReCoM
or FORC when applied into a recommender system.
5.
RELATED WORK
As a widely-applied technique for data analysis and knowledge discovery, clustering tries to separate a dataset into a
number of finite and discrete subsets so that the data objects
in each subset share some common trait [7]. Roughly speaking, traditional clustering methods can be divided into two
categories: hierarchical and partitional. Hierarchical clustering algorithms recursively agglomerate or divide existing
clusters in a bottom-up or top-down manner respectively, so
the data objects are organized within a hierarchical structure. On the contrary, partitional algorithms construct a
fixed number of clusters from the beginning, distribute each
data object into its nearest cluster and update the cluster’s
mean or medoid iteratively.
In recent years, many innovative ideas have been proposed
to address various issues of traditional clustering algorithms
[6, 14]. For example, BIRCH [16] utilizes the concept clustering feature (CF) to efficiently summarize the statistical
characteristics of a cluster and distribute the data objects
in the Euclidean space. Because the CF vectors are additive, they can be updated easily when a cluster absorbs a
new data object or two sub-clustered are merged. By scanning the dataset, BIRCH incrementally builds a CF-tree to
preserve the inherent clustering structure of the data objects that have been scanned. Finally, in order to remedy
the problems of skewed input order or undesirable splitting,
BIRCH applies a traditional agglomerative clustering algorithm to improve the CF-tree. BIRCH is very efficient because of its linear computational complexity, but it is not
applicable for non-Euclidean datasets, such as the relational
data object shown in Figure 2. DBSCAN [4], a densitybased clustering algorithm, is proposed to find clusters of
sophisticated shapes, which requires two parameters to define the minimum density of clusters: Eps (the radius of the
neighbourhood of a point) and M inP ts (the minimum number of points in the neighbourhood). Clusters are dynamically created from an arbitrary point and then all points
in its neighbourhood are absorbed. Like the dilemma of
pre-specifying parameter k in k-Means/k-Medoids, the parameters Eps and M inP ts in DBSCAN are difficult to be
determined. Another problem is that DBSCAN needs R*tree to improve the efficiency of region queries. Since such a
structure is not available in a relational dataset, the computational complexity of DBSCAN will be degraded to O(N 2 ).
In contrast to traditional clustering algorithms that only
exploit flat datasets and search for an optimal partitioning in the Euclidean space, relational clustering algorithms
try to incorporate the relationships between data objects as
well. Neville et al. [12] provided some preliminary work
of combining traditional clustering and graph partitioning
techniques to solve the problem. They used a very simple similarity metric, called matching coefficient, to weight
the relations between data. Kirten and Wrobel developed
RDBC [8] and FORC [9] as the first-order extensions of classical hierarchical agglomerative and k-partitional clustering
algorithms respectively. Both of them adopt the distance
measure RIBL2 to calculate the dissimilarity between data
objects, which recursively compares the sub-components of
the first-order objects until they can finally fall back on
propositional comparisons on elementary features. Therefore, RDBC and FORC inherit the disadvantage as their
propositional antecedent: When the relational datasets are
very large, such algorithms will be infeasible due to the
quadratic computational complexity. Enlightened by the
idea of mutual reinforcement, Wang et al. [13] proposed a
general framework for clustering multi-type relational data
objects: Initially, clustering is performed separately for each
dataset based on the content information. The derived cluster structure formulates a reduced feature space of the current dataset and is then propagated along the linkage structure to impact the re-clustering procedure of the related
datasets. More specifically, the relationships between data
objects of different types are transformed into the linkage
feature vectors of these objects, which can be easily handled
by the traditional clustering algorithms just like the con-
tent feature vectors. The re-clustering procedure continues
in an iterative manner for all types of datasets until the result converges. Additional improvement would be achieved
by assigning different importance among data objects in the
clustering procedure according to their hub and authoritative values. Long et al. presented another general framework for multi-type relational clustering in [11]. Based on
the assumption that the hidden structure of a data matrix
can be explored by its factorization, the multi-type relational clustering is converted into an optimization problem:
approximate the multiple relation matrices and the feature
matrices by their corresponding collective factorization. Under this model a spectral clustering algorithm for multi-type
relational data is derived, which updates one intermediate
cluster indicator matrix as a number of leading eigenvectors
at each iterative step until the result converges. Finally the
intermediate matrices have to be post-processed to extract
the meaningful cluster structure. This spectral clustering
framework is theocratically sound, but it is not applicable
for the semantic information, such as “Genre” for concept
“Movie” in Figure 1.
According to the definition, the clustering approaches try
to partition a dataset so that data objects in the same clusters are more similar to each other while objects in different
clusters are less similar. Since data objects in the same cluster can be treated collectively, the clustering result can be
considered as a technique of data compression [6]. Therefore, we can use the clustering model to approximate the
similarity values between pairs of objects. Yin et al. propose a hierarchical data structure SimTree to represent the
similarities between objects, and use LinkClus to improve
the SimTree in an iterative way [15]. However, since they
only utilized frequent pattern mining techniques to build
the initial SimTrees and use path-based similarity in one
SimTree to adjust the similarity values and structures of
other SimTrees, information contained in the property attributes of data objects are not adopted in their clustering
framework, which definitely degrades the accuracy of the
final clustering result.
6.
CONCLUSIONS
In this paper we propose a new variance-based clustering
approach for multi-type relational datasets. The variance is
a criterion to control the compactness of the derived clusters
and thus preserve the most significant similarity entries of
the object pairs. Our approach combines the advantages of
divisive and agglomerative paradigms to improve the quality of clustering results. By incorporating the idea of Representative Object, our approach has linear time complexity.
Since the multi-type relational information is considered in
the procedure of constructing data instances, the problem
of skewness propagation within reinforcement clustering approaches can be avoided. Experimental results show our algorithm outperforms some well-known relational clustering
approaches in accuracy, efficiency and robustness.
7.
REFERENCES
[1] S. S. Anand, P. Kearney, and M. Shapcott. Generating
semantically enriched user profiles for web
personalization. ACM Transactions on Internet
Technologies, 7(3), August 2007.
[2] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern
Classification (2nd Edition). Wiley-Interscience
Publication, 2001.
[3] S. Dzeroski and N. Lavrac. Relational Data Mining.
Springer, 2001.
[4] M. Ester, H. P. Kriegel, J. Sander, and X. Xu. A
density-based algorithm for discovering clusters in
large spatial databases with noise. In Proceedings of
2nd International Conference on Knowledge Discovery
and Data Mining, 1996.
[5] P. Ganesan, H. Garcia-Molina, and J. Widom.
Exploiting hierarchical domain structure to compute
similarity. ACM Transactions on Information Systems
(TOIS), 21(1):64–93, 2003.
[6] J. Han and M. Kamber. Data Mining: Concepts and
Techniques (2nd Edition). Morgan Kaufmann, 2006.
[7] A. K. Jain, M. N. Murty, and P. J. Flynn. Data
clustering: a review. ACM Computing Surveys,
31(3):264–323, 1999.
[8] M. Kirsten and S. Wrobel. Relational distance-based
clustering. In Proceedings of Fachgruppentreffen
Maschinelles Lernen (FGML-98), pages 119 – 124,
10587 Berlin, 1998. Techn. Univ. Berlin, Technischer
Bericht 98/11.
[9] M. Kirsten and S. Wrobel. Extending k-means
clustering to first-order representations. In ILP ’00:
Proceedings of the 10th International Conference on
Inductive Logic Programming, pages 112–129, London,
UK, 2000. Springer-Verlag.
[10] J. A. Konstan, B. N. Miller, D. Maltz, J. L. Herlocker,
L. R.Gordon, and J. Riedl. GroupLens: Applying
collaborative filtering to Usenet news. Communication
of the ACM, (3):77–87, 1997.
[11] B. Long, Z. M. Zhang, X. Wu;, and P. S. Yu. Spectral
clustering for multi-type relational data. In
Proceedings of the 23rd international conference on
Machine learning (ICML’06), pages 585–592, New
York, NY, USA, 2006. ACM Press.
[12] J. Neville, M. Adler, and D. Jensen. Clustering
relational data using attribute and link information.
In Proceedings of the Text Mining and Link Analysis
Workshop, 18th International Joint Conference on
Artificial Intelligence, 2003.
[13] J. Wang, H. Zeng, Z. Chen, H. Lu, T. Li, and W.-Y.
Ma. Recom: Reinformcement clustering of multi-type
interrelated data objects. In Proceedings of the 26th
ACM SIGIR conference on Research and development
in informaion retrieval (SIGIR’03), pages 274–281,
New York, NY, USA, 2003. ACM Press.
[14] R. Xu and D. Wunsch. Survey of clustering
algorithms. IEEE Transaction on Neural Networks,
16:645–678, 5 2005.
[15] X. Yin, J. Han, and P. S. Yu. Linkclus: efficient
clustering via heterogeneous semantic links. In
Proceedings of the 32nd international conference on
Very large data bases (VLDB’06), pages 427–438.
VLDB Endowment, 2006.
[16] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH:
an efficient data clustering method for very large
databases. In Proceedings of 1996 ACM SIGMOD
International Conference on Management of Data,
pages 103–114, Montreal, Canada, 1996.
Download