DIVA: A Variance-based Clustering Approach for Multi-type Relational Data Tao Li, Sarabjot S. Anand Department of Computer Science, University of Warwick Coventry, United Kingdom {li.tao, s.s.anand}@warwick.ac.uk ABSTRACT Clustering is a common technique used to extract knowledge from a dataset in unsupervised learning. In contrast to classical propositional approaches that only focus on simple and flat datasets, relational clustering can handle multitype interrelated data objects directly and adopt semantic information hidden in the linkage structure to improve the clustering result. However, exploring linkage information will greatly reduce the scalability of relational clustering. Moreover, some characteristics of vector data space utilized to accelerate the propositional clustering procedure are no longer valid in relational data space. These two disadvantages restrain the relational clustering techniques from being applied to very large datasets or in time-critical tasks, such as online recommender systems. In this paper we propose a new variance-based clustering algorithm to address the above difficulties. Our algorithm combines the advantages of divisive and agglomerative clustering paradigms to improve the quality of cluster results. By adopting the idea of Representative Object, it can be executed with linear time complexity. Experimental results show our algorithm achieves high accuracy, efficiency and robustness in comparison with some well-known relational clustering approaches. Categories and Subject Descriptors I.5.3 [Patten Recognition]: Clustering—Algorithms General Terms Algorithms, Performance Keywords Clustering, Multi-type, Relational 1. INTRODUCTION Data mining aims to learn knowledge from a dataset. In unsupervised learning, clustering is a common technique to Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’07, November 6–8, 2007, Lisboa, Portugal. Copyright 2007 ACM 978-1-59593-803-9/07/0011 ...$5.00. partition the dataset into a certain number of groups (clusters) with maximum intra-cluster similarity and minimum inter-cluster similarity. Classical clustering approaches only focus on simple and flat data, i.e. all data objects are of the same type and conventionally described by a list of numeric attribute values. The former condition assumes all data are stored in a single table, and the latter one make it possible to represent data as points in a multi-dimensional vector space. Hence, many mathematical methods, e.g. accumulation or transformation, can be utilized to simplify the clustering process greatly. Unfortunately, the above assumptions are not held in many practical cases: Firstly, data might have various types of attributes: binary, categorical, string or taxonomy-based; secondly, data objects are usually stored in several tables and their pairwise relationships are specified by the semantic links between tables. Those semantic links together with multi-type attributes compose a far more complex feature space for relational dataset than the Euclidean space, so both of them should be considered during the clustering process. Although the classical clustering algorithms are still applicable in the relational cases by means of combining the multiple tables into a single one with join or aggregation operations, it is not a good choice for the following reasons [11]: • The transformation of relational linkage information into a unified feature space will causes information loss, or generate very high dimensional and sparse data, which will inevitably degrade the performance of clustering algorithms. • Besides the clusters within data objects of each type, the global hidden patterns involving multi-type objects might also be important, which cannot be recognized by classical clustering approaches. For example, the ontology of a relational movie dataset is shown in Figure 1. Different concepts in the ontology have different data types, e.g. “Title”, “YearOfRelease”, “Certificate” and “Genre” are string, numeric, categorical and taxonomy-based, respectively. An arrow in the figure indicate that an object of the source concept includes one or more objects of the target concept(s) as its member property. A bi-directional arrow means there exist recursive references between two concepts. Hence, a data object representing an actor will contain references of the movies he acted in, which in turn refer to other actors and directors he cooperated with in those movies. Figure 2 shows part of the object for Tom Hanks. Classical clustering approaches based only on data attributes (e.g. actors’ name) will gener- Figure 1: Ontology of a movie dataset Figure 2: Example object: Tom Hanks (Depth = 2) ate completely meaningless clusters. Instead, relational clustering approaches will consider movies, directors and other actors related to the current actor along the semantic links in the ontology to generate reasonable clusters. Even if objects have no attributes, the linkage structure of the whole dataset itself can still provide useful information for clustering. Multi-type relational clustering has raised substantial research interest recently [3, 11, 15]. In contrast to classical propositional approaches, relational clustering can handle multi-type interrelated data directly and adopt the semantic information hidden in the linkage structure to improve the clustering result. However, the trade-off for the above advantages is: relational clustering needs to explore far more information, so its scalability and efficiency are reduced. For example, calculating the similarity between two data objects of type Actor as in Figure 2 becomes much more expensive than that in the propositional case. The second problem is that many conveniences in Euclidean space are not available in relational space. For example, k-Means and BIRCH [16] require all data to be represented as vectors to support the operations of addition and division, because the gravity center of each cluster is used in the clustering procedure. However, such operations are not valid for data objects in the relational dataset, so relational clustering approaches (e.g. RDBC [8] and FORC [9]) have quadratic computational complexity. The above two disadvantages restrain relational clustering techniques from being applied upon very large datasets, for example in an online movie recommender system or a bibliographic database system. Another nontrivial issue in the classical clustering approaches is: how to select the optimal number of clusters k as the input parameter? Inappropriate value of k will lead to skew or “unnatural” clusters [2]. This problem becomes even more severe in the reinforcement clustering algorithms, such as ReCoM [13] and LinkClus [15], because the skewed cluster result of one data type will be propagated along the relations to influence the partitioning of other data types. In order to address the above difficulties, in this paper we propose a new variance-based clustering approach, named DIVA (DIVision and Agglomeration). Our approach combines the advantages of divisive and agglomerative clustering paradigms to improve the quality of cluster results. Unlike ReCoM or LinkClus that only consider the direct relationships when clustering data objects of the current concept, we exploit multi-layered relational information in the phase of constructing data objects. By adopting the idea of Representative Object, we can perform the DIVA algorithm with linear computational complexity O(N ), where N is the number of objects to be clustered. Since the multi-type relational information is considered when data instances are constructed and DIVA generates the clusters of different data types separately, the problem of skewness propagation within reinforcement clustering approaches can be avoided. The rest of our paper is organized as follows: Section 2 introduces our method for constructing multi-type relational objects and the corresponding similarity metric. On that basis, we explain in detail the DIVA algorithm and analyze its computational complexity in Section 3. Comprehensive experimental results are provided in Section 4. Section 5 presents some prominent algorithms for propositional and relational clustering. Finally, the conclusions are drawn in Section 6. 2. RELATIONAL OBJECT CONSTRUCTION AND SIMILARITY MEASURE In this section, we will formally define our method of constructing the multi-type relational objects as well as a recursive relational similarity metric according to the ontology. Given an ontology represented as a directed graph G = (C, E), in which vertices C = {ci } stand for the set of concepts in the ontology and edges E = {eij | edge eij : ci → cj } for the relationships between pairwise concepts, i.e. concept ci includes concept cj as its member property. In Figure 1, concept “Actor” has member concept list M C(Actor)={Name, Movie} and concept “Movie” has M C(Movie)={Title, Genre, YearOfRelease, Certificate, Plot, Duration, Actor, Director}. When constructing an object x of concept ci , we will first build its member concept list M C(ci ) and then link all objects related to x into the member property attributes of x. We say an object y of concept cj is related to x when cj ∈ M C(ci ). In such case, y will be added into the member property attribute x.cj . Then for each y ∈ x.cj , we launch the above procedure iteratively until M C(cj ) = ∅ or a depth bound Depth(≥ 0) is reached. As an example, Figure 2 shows the object for actor “Tom Hanks” with Depth = 2. Figure 3: Example objects for Two Action Movie Stars (Depth = 1) For two relational objects x1 and x2 of concept ci , we define the similarity metric as follows: 1 f sobj (x1 , x2 ) = wij · f sset (x1 .cj , x2 .cj ) |M C(ci )| cj ∈M C(ci ) (1) where weight wij (wij ≤ 1 and w = 1) represent the ij j importance of member concept cj in describing the concept ci . In Equation 1, f sset (·, ·) is defined as [1]: ⎧ 1 ⎪ max f s(yk , yl ), ⎪ |x1 .cj | ⎪ yk ∈x1 .cj ⎪ ⎪ yl ∈x2 .cj ⎪ ⎪ ⎪ ⎪ if |x1 .cj | ≥ |x2 .cj | > 0. ⎪ ⎪ ⎪ ⎨ 1 f sset (x1 .cj , x2 .cj ) = max f s(yk , yl ), ⎪ |x2 .cj | yl ∈x2 .cj ⎪ ⎪ yk ∈x1 .cj ⎪ ⎪ ⎪ ⎪ if |x2 .cj | ≥ |x1 .cj | > 0. ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ 0, if |x1 .cj | = 0 or |x2 .cj | = 0. (2) When M C(cj ) = ∅, the value of f s(yk , yl ) in Equation 2 is recursively calculated by Equation 1. Hence, this similarity metric can explore the linkage structure of the relational objects. The above procedure continues until M C(cj ) = ∅ or the depth bound is reached, where the traditional propositional similarity metrics can be used. To demonstrate our recursive similarity metric, we give a simplified example in the movie dataset, assuming concept “Movie” only has member concepts {Title, Genre, YearOfRelease}. Figure 3 shows the relational objects for action movie stars “Arnold Schwarzenegger” and “Sylvester Stallone”. We will compare them with the actor “Tom Hanks” represented in Figure 2. In the scenario of propositional clustering, some naive similarity metrics might be adopted: if the actor’s name is regarded as an enumerated value or a string, the true-or-false function or the Levenshtein distance can be used. The former metric results in zero similarity value between each pair of actors because they always have different names. By using the latter, we have f s(oA2 , oA3 ) = f sstr (“Arnold Schwarzenegger”, “Sylvester Stallone”) = 0.095, f s(oA1 , oA3 ) = f sstr (“Tom Hanks”,“Sylvester Stallone”) = 0.111, which is still unreasonable. Another common technique is to transform the relational information into a high dimensional vector space, e.g. constructing a binary vector to represent the movies that an actor has acted in and then calculating the pairwise similarity between actors as the cosine value of their movie vectors. As discussed in Section 1, such transformation often produces sparse data when the number of movies is large. Another disadvantage is that some deeper semantic information, e.g. the movie genre, is lost when calculating pairwise similarity between actors. In our framework, by setting Depth = 1 and utilizing the relational similarity metric, we calculate the similarity value between movies “Terminator 2: Judgment Day” and “Rocky” as follows (the weights for all member concepts are ignored here for simplicity): 1 f sstr (oM 3 .T itle , oM 5 .T itle ) f sobj (oM 3 , oM 5 ) = 3 + f staxonomy (oM 3 .Genre , oM 5 .Genre ) + f snum (oM 3 .Y ear , oM 5 .Y ear ) 1 |1991 − 1976| 1 0.077 + + 1 − = 3 2 maxyear − minyear = 0.359 where the hierarchical taxonomy for concept “Genre” is shown at the bottom of Figure 3 and the corresponding similarity metric f staxonomy (·, ·) is defined in [5]. Similarly, |1994 − 1976| 1 0+1+ 1− = 0.467 f sobj (oM 4 , oM 5 ) = 3 30 1 1 |1991 − 1982| 0.154 + + 1 − = 0.451 f sobj (oM 3 , oM 6 ) = 3 2 30 |1994 − 1982| 1 0.182 + 1 + 1 − = 0.594 f sobj (oM 4 , oM 6 ) = 3 30 and hence f sobj (oA2 , oA3 ) = = = 1 f sstr (oA2 .Name , oA3 .Name ) 2 + f sset (oA2 .M ovie , oA3 .M ovie ) 1 0.095 + max f s(yk , yl ) yk ∈{M3, M4} 2 2 yl ∈{M5, M6} 1 1 0.095 + (0.467 + 0.594) = 0.313 2 2 1 In the same way, we can get f sobj (oA1 , oA3 ) = 0.171, which is less than f sobj (oA2 , oA3 ). Therefore, by incorporating the ontology of the dataset and applying the relational similarity metric, the new results reflect more credible similarity values among these actors. Theoretically, calculating the similarity value between two objects should exhaustively explore their relational structures and consider their member objects at all levels, i.e. setting Depth = ∞. This is infeasible and unnecessary in practice. From Equations 1 and 2, we see that the similarity value between two member objects of concept cj will be propagated into the similarity calculation of the upper level wij concept ci with a decay factor δd (cj ) = |M C(c , where d i )| means concept ci is located at the d-th level of the root object’s relational structure. The total decay factor for concept cj to impact the similarity calculation of two root objects is ∆(cj ) = d δd . In many applications, this factor will be reduced very quickly as d increases, which means the impact from member objects at the deeper levels of the relational structure keeps decreasing. For instance, in the above simplified example, the total decay factors for concepts “Name” and “YearOfRelease” to impact root concept “Actor” are as follows (the weights for member concepts are again ignored for simplicity): 1 1 = |M C(Actor)| 2 1 1 1 = = ∆(Year) = |M C(Actor)| · |M C(Movie)| 2×3 6 ∆(Name) = When applying the real ontology shown in Figure 1, the 1 because total decay factor ∆ (Year) is no greater than 14 M C(Movie)=7. Just like RDBC and FORC, we can set the depth bound Depth to a moderate value instead of ∞, so that enough information is retrieved from the relational structure to guarantee the similarity calculation credible as well as feasible. 3. DIVA FRAMEWORK Clustering is a technique of data compression: data objects in the same cluster can be treated collectively because they are more similar to each other than those from different clusters. From another point of view, the clustering procedure preserves the significant similarity entries within the dataset, by distributing pairs of highly similar objects into the same cluster. Like DBSCAN [4] that explicitly controls the intra-cluster similarity by specifying the expected data density, we use variance, a criterion of evaluating the diameter of clusters, to meet the requirement. The formal definition of variance will be given in Section 3.1. Roughly speaking, greater variance means the derived clusters are more compact and higher similarity values between objects are preserved by the clustering procedure. Based on the above discussion, our DIVA algorithm is designed as follows: First, divide the whole dataset into a number of clusters so that the variance of each cluster is greater than a particular threshold value υ. Based on these clusters, a hierarchical dendrogram is built using an agglomerative approach. Finally the appropriate level of the dendrogram that satisfies the variance requirement is selected to construct the clustering result. Table 1 summarizes the main framework of DIVA, consisting of two parts: a recursive divisive step to partition the dataset and an agglomerative step to build the dendrogram based on the clusters. After defining some fundamental concepts in Section 3.1, we provide more details for these two steps in Section 3.2 and 3.3. The computational complexity is analyzed in Section 3.4. 3.1 Fundamental Concepts As presented in the related work (Section 5), traditional clustering approaches are usually categorized as partitional and hierarchical. The classical k-Medoids algorithm, as an example of partitional clustering, defines the medoid of a DIVA (dataset D0 , number of ROs r, variance υ) 1. cluster set {Dk } ← call the Divisive-Step, given D0 , r and υ as the parameters. 2. dendrogram T ← call the Agglomerative-Step, given {Dk } as the parameter. 3. select the appropriate level in T to construct the clustering result. Table 1: Main Framework of DIVA cluster as the data object that has the maximum average similarity (or minimum average of distance) with the other objects in the same cluster. This requires the calculation of the similarity values between every pair of data objects in the given cluster. On the other hand, the hierarchical clustering is composed of two sub-categories: divisive and agglomerative. In the former one, a common criterion to decide whether or not a cluster should be divided is its diameter, which is determined by the distance value of two data objects in the cluster that are the farthest away from each other. Again we are faced with the quadratic computational complexity. Similarly, the traditional agglomerative clustering paradigms also has the quadratic complexity when searching for two sub clusters that are closest to each other in order to derive the super one. Due to the complicated structure of relational data objects, relational clustering with quadratic computational complexity is restrained in many applications. Is there any efficient way to delimit the shape of clusters and hence accelerate the division and agglomeration procedures? We develop the concept Representative Object (RO) to achieve this goal. The ROs are defined as a set of r maximum-spread objects in the data space, given r ≥ 2. More strictly, after a start object xs is randomly chosen, the i-th RO is determined by the following formula: ⎧ if i = 1 ⎨ arg min f sobj (x, xs ) x∈D roi = (3) ⎩ arg min max f sobj (x, roj ) if 2 ≤ i ≤ r x∈D 1≤j<i The reason we do not use xs as an RO is: xs will reside at the center part of the data space with high probability, and thus not satisfy the maximum-spread requirement for ROs. Additionally, we can also reduce the impact of randomly selected xs on the final clustering result. In Section 3.4 we analyze how the application of ROs successfully reduces the total computational complexity of DIVA to be linear with the size of dataset. (D) We use {roi } (1 ≤ i ≤ r) to denote the set of ROs for the dataset D. Because they are maximum-spread from each other, the distance between the farthest pair of ROs approximates to the diameter of D. More formally, the variance of the dataset D is defined as: (D) Υ(D) = min f sobj (roi 1≤i,j≤r (D) , roj ) (4) Hence, greater variance means data objects reside in a smaller data space and thus are more similar to each other, which means the data are of higher homogeneity. Divisive-Step(dataset D0 , number of ROs r, variance υ) 1. INITIALIZE the cluster list L by adding D0 2. FOR EACH newly added cluster Dk in L (Dk ) (a) generate the set of ROs {roi (Dk ) i. the start object xs an object from Dk . }: ← randomly select (D ) ii. to determine roi k (1 ≤ i ≤ r): (D ) roi k ← select the object x from Dk that (D ) is farthest away from the start point xs k when i = 1 or that minimizes the accumulated similarity from itself to the already (D ) obtained ROs roj k (1 ≤ j < i) when 2 ≤ i ≤ r, as described in Equation 3. (b) evaluate the variance of Dk by Equation 4. Without loss of generality, assume the pair (D ) of ROs in {roi k } that are farthest away (D ) (D ) from each other are ro1 k and ro2 k . Then (Dk ) (Dk ) Υ(Dk ) = f s(ro1 , ro2 ). (c) if Υ(Dk ) < υ, then: i. create two new clusters Dk and Dk , using (D ) (D ) ro1 k and ro2 k as the absorbent objects of Dk and Dk respectively, where k and k are unused index numbers in L. ii. allocate the rest objects x ∈ Dk into either Dk or Dk based on the comparison (D ) (D ) of f s(x, ro1 k ) and f s(x, ro2 k ). iii. add Dk and Dk into L to replace Dk . 3. RETURN all the remaining clusters Dk in L Table 2: The Divisive Step 3.2 Divisive Step The divisive step starts by assuming all the data objects belong to the same cluster. Here we use D0 to denote the whole dataset as well as the initial cluster it forms. Equation 3 is applied to find a set of ROs for D0 and thus its variance Υ(D0 ) is determined by Equation 4. Υ(D0 ) is less than the pre-specified variance threshold υ, so the division procedure is launched. Two ROs that are farthest away from each other, i.e. the pair of ROs determining the diameter of D0 , are used as the absorbent objects of two sub-clusters respectively. The other data objects are allocated to the appropriate sub-cluster based on their similarities to the absorbent objects. Finally, the original cluster D0 is replaced by its derived sub-clusters in the cluster list L. Since the similarity values of all the non-RO objects to every RO object have been obtained when determining (D) {roi 0 } by Equation 3, the division operation for D0 can be performed without extra effort of similarity calculation. If either of the newly formed sub-clusters remains unsatisfied with the required threshold υ, the above division process is recursively performed. Finally we get a set of clusters {Dk } ( k Dk = D0 , Dk1 Dk2 = ∅) with variance equals or Agglomerative-Step(cluster set {Dk }) 1. INITIALIZE the dendrogram T . For each Dk , construct a leaf node tk in T . 2. REPEAT for K − 1 times, where K is the size of {Dk } (a) for nodes in T that have no parent node, their pairwise similarity values are evaluated by Equation 5. From all these nodes, we choose the pair with the highest similarity value, assuming they are tl and tl . (b) generate a new node tp as the parent node for both tl and tl , which equals to create a new super-cluster Dp by merging Dl and Dl . (t ) The top-r maximum-spread ROs in {roi l } ∪ (tl ) {roj } are chosen as the ROs for tp . (c) store tp into T . 3. RETURN T . Table 3: The Agglomerative Step is greater than υ, which are used as the input of the agglomerative step. The divisive step is summarized in Table 2. 3.3 Agglomerative Step Like classical agglomerative clustering approaches, in this step we will build a hierarchical dendrogram T in a bottomup fashion. The cluster set {Dk } obtained from the divisive step constitute the leaf nodes of the dendrogram. In each iteration, the most similar pairwise clusters (sub nodes) are merged to form a new super-cluster (parent node). Because each cluster in the agglomerative step is related to a unique node in the dendrogram, the words “cluster” and “node” are used interchangeable in this section. Various similarity metrics for agglomerating the nodes within a dendrogram have been discussed in [2], among which we adopt the complete-linkage similarity. Since each cluster is represented by a set of ROs, the similarity between two nodes tl , tl ∈ T is defined as follows: (tl ) f snode (tl , tl ) = min f sobj (roi i,j (t ) (tl ) , roj ) (5) where {roi l } is the set of ROs contained in node tl and (t ) {roj l } is the set of ROs contained in node tl . Without loss of generality, we assume that the super node tp is formed based on two sub-nodes tl and tl , then the top-r maximum(t ) (t ) spread ROs in {roi l } ∪ {roj l } are chosen as the ROs for tp . The agglomerative step is summarized in Table 3. It is worth noting here that constructing the hierarchy in this step is not a reverse reproduction of the divisive step in Section 3.2. As shown in [16], the agglomeration can remedy the inaccurate partitioning generated by the divisive step. After the dendrogram T is built, we need to determine the appropriate level in T and use the corresponding nodes to construct the clustering result. A common strategy is to select the level at which the variance of each node equals or is greater than υ. Alternatively, we can record the variance of newly generated node for each level, find the largest gap between variances of two neighboured levels and use the lower level as the basis to construct clusters [2]. When the number of clusters is fixed, as in the experiments in Section 4, we select the level which contains the exactly required number of nodes to construct clusters. 3.4 Complexity Analysis In this section, we will briefly analyze the computational complexity for each step in our DIVA algorithm, given the whole dataset D0 of size N , the number of iteration in the divisive step is R and the size of {Dk } is K: • Divisive Step: – The random selection of the start object xs has complexity O(1). Then we will scan the whole dataset D once (with N −1 similarity comparison) to pick out the farthest data object from xs as the first RO ro1 . Similarly, in order to determine roi (2 ≤ i ≤ r), we only need to scan the whole dataset once and compare all the non-RO objects with roi−1 , because the other similarity values required by Equation 3 have been obtained when determining roj (1 ≤ j ≤ i − 2). Overall, there are r r(r − 1) −1 similar(N −1)+ (N −i+1) = rN − 2 i=2 ity comparisons, so the computational complexity is O(r · N ). – If the dataset D has to be divided, we construct two sub-clusters and appoint the farthest pair of ROs, assuming ro1 and ro2 as before, as the absorbent objects respectively. Then, all the nonRO objects can be allocated to the nearest cluster without extra calculation because their similarity values to ro1 or ro2 have been obtained in the procedure of determining {roi }. Hence, the operation for dividing and redistributing dataset D has complexity O(1). – If any cluster’s variance is lower than the specified variance threshold υ, a recursive division procedure will be launched on this cluster until the variances of all derived sub-clusters satisfy the requirement. Without loss of generality, we assume cluster Di of size Ni in the final clustering result is generated from the original dataset D0 by at most R iterations, then all the data objects in Di need to be compared with R · r ROs during the recursive division. Hence, the total computational complexity for the recursive division is O( i R rNi ), i.e. O(R rN ). • Agglomerative Step: Like the classical agglomerative algorithm, the computational complexity of building the taxonomy is O(r 2 · K 2 ). Therefore, the total computational complexity of the DIVA algorithm is O(rN + R rN + r 2 K 2 ). We must point out that both R and K above are controlled by the variance threshold υ: higher υ leads to more recursive divisions and thus generates more clusters. When υ → 1, the recursive division will generate many tiny clusters, each of which will only contains the RO itself. In this extreme case, our DIVA algorithm will behave like the complete agglomerative approach RDBC with quadratic complexity. Nevertheless, by choosing moderate values for r and υ to keep rK N , the computational complexity of our DIVA algorithm would be linear to the size of the dataset. Usually r will be set a fairly small value, such as 3 or 4. Like BIRCH [16], we can gradually increase the value of υ to improve the homogeneity of the generated clusters, until their quality meets our requirement. 4. EXPERIMENTAL RESULT In order to evaluate the effectiveness and efficiency of our DIVA algorithm for clustering multi-relational datasets, we compare it with the following approaches: (1) ReCoM [13], which uses relationships among data objects to improve the cluster quality of interrelated data objects through an iterative reinforcement clustering process. Because there is no prior knowledge about the authoritativeness in the datasets, we treat all data objects as equally important. Additionally, k-Medoids is incorporated as the meta clustering approach in ReCoM. (2) FORC [9], which is the natural extension of k-Medoids in the field of relational clustering. (3) LinkClus [15], which uses a new hierarchical structure, SimTree, to represent the similarity values between pairwise data objects and facilitate the iterative clustering process. All the experiments were carried out on a workstation with a 2.8GHz P4 CPU, 1GB memory and RedHat Operating System. All approaches are implemented by Java. The experiments are conducted on two relational datasets. The first one is a synthetic dataset that simulates the users’ browsing products on the website www.amazon.com. The second is the real movie dataset mentioned in Section 1. To evaluate the accuracy of the clustering result, we use the Related Minimum Variance Criterion [2] to measure the similarity between pairs of objects in the same cluster: 1 sk (6) Sintra = K k where nk is the size of cluster Dk and 1 sk = 2 f sobj (x, x ) nk x∈D k x ∈Dk When the class labels of data objects are available, we can also use an entropy-based measure [13] to evaluate the clustering result. The measure reflects the uniformity or purity of a cluster. Formally, given a cluster Dk and category labels of data objects in it, the entropy of cluster Dk is: Ph log2 Ph H(Dk ) = − h where Ph is the proportion of data objects of class h in the cluster. The total entropy is defined as: H= H(Dk ) (7) Dk Generally, larger intra-cluster similarity and smaller entropy values indicate higher accuracy of clustering result. In the following experiments, we calculate the above criteria on a fixed number of clusters. For ReCoM and FORC, the number of clusters is a pre-specified input parameter. For LinkClus, we use the method mentioned in [15] to obtain the fixed number of clusters, i.e. first find the level at 4.1 Synthetic Dataset In this section, we test each clustering approach on a synthetic dataset. The dataset is generated by the following steps, as in [13]: 5H&RP )25& ',9$ /LQN&OXV (QWURS\ which the number of nodes is most close to the pre-specified number, then adopt the operation of merging to satisfy the requirement. For DIVA, when the number of merged clusters reaches the pre-specified requirement during the agglomerative step, the algorithm is terminated. ,QWUDFOXVWHUVLPLODULW\ Figure 4: Ontology of the Synthetic Dataset two apply the reinforcement clustering manner. For all approaches, the numbers of clusters to be generated within datasets “User” and “Product” were specified as 100 by default. LinkClus was executed with a series of c values and the best clustering result was used to compare the performance of LinkClus with that of other approaches. Since FORC, LinkClus and ReCoM launch an iterative procedure of clustering data objects until convergence, we set the maximum number of iterations to be 10 because these algorithms converge very quickly in most cases. 5H&RP )25& ',9$ /LQN&OXV 9DULDQFH 3. Each user’s browsing action is generated according to the information of users, groups, categories and products: (i) randomly select a user and get his group; (ii) based on the probabilities of group interests, select an interest and get the related product category; (iii) randomly select a product that belongs to this category; (iv) create a browsing action between the user and the product obtained in step (i) and (iii). In total 100,000 browsing actions were created. 4. In order to test the robustness of different clustering approaches, we also generate some noise data: for each user, (i) uniformly choose four noise interests; (ii) randomly select a product that belongs to one of the four noise interests of the current user; (iii) create a noise browsing action between the user and the product. We will examine how these clustering approaches perform in the case of different noise ratios. Figure 4 is the ontology of the synthetic dataset, Concepts “User” and “Product” are interrelated by the link browse action and Concept “Product” has property “Category” as its content feature. When creating objects for users or products, we set the depth bound Depth = 1 for FORC/DIVA and Depth = 0 for LinkClus/ReCoM, because the latter (b) Entropy Figure 5: Synthetic Dataset - Clustering users with different υ 5H&RP )25& ',9$ /LQN&OXV 5H&RP )25& ',9$ /LQN&OXV (QWURS\ 2. We randomly generate 2,000 users and uniformly distribute them into 100 groups. For each group, we construct a probability distribution to simulate the users’ preferences on the third-level categories (obtained in Step 1), which defines the likelihood of a user in that group to browse a product in a certain category. Each group of users has 4 interest categories: one category representing the major interest of that group is assigned the probability 0.5, two categories of the intermediate interest are assigned the probability 0.2, and one category of the minor interest is assigned the probability 0.1. (a) Intra-cluster similarity ,QWUDFOXVWHUVLPLODULW\ 1. The product taxonomy of the online shop Amazon is retrieved from its website www.amazon.com. It contains 11 first-level categories, 40 second-level categories and 409 third-level categories in the taxonomy. We generate 10,000 virtual products and randomly assign them to the third-level categories. The category information is the only content feature defined for these products. 9DULDQFH 9DULDQFH (a) Intra-cluster similarity 9DULDQFH (b) Entropy Figure 6: Synthetic Dataset - Clustering products with different υ Variance υ is the most important parameter for the DIVA algorithm, so we first test its impact. Figure 5 and 6 show the evaluation results of clustering users and products respectively, ranging υ from 0.3 to 0.6 and fixing r = 3. In general, the quality of the clustering result generated by DIVA improves as υ increases. When υ ≥ 0.4, DIVA outperforms all the other algorithms, especially when evaluated by the entropy criterion, which means DIVA is more capable of discovering the inherit category of products as well as the hidden groups of users. Furthermore, we found that the accuracy of LinkClus is far worst than those of the other approaches. The reason is that LinkClus builds an initial SimTrees by applying Frequent Pattern Mining, which only exploits the link structure of the relational dataset. The content features of the data objects, for example the category information for product, are completely ignored in the procedure of clustering. Due to such information loss, LinkClus cannot generate clusters of high quality. As a result, we will not test LinkClus in the next experiments. Figure 7 shows that FORC is always the most time consuming algorithm. This result is not surprising since its computational complexity is O(N 2 ). On the other hand, time spent by ReCoM, LinkClus and DIVA are comparable. When clustering products DIVA even outperforms ReCoM. The reason is: the average number of users related to each product is less than that of products related to each user, so the most expensive operation in similarity calculation, Equation 2, is used less for clustering products than that for 7LPHVSHQWVHF 5H&RP )25& ',9$ /LQN&OXV 5H&RP )25& ',9$ /LQN&OXV 9DULDQFH 9DULDQFH (a) Users 5H&R0 )25& ',9$ (b) Products 9DULDQFH users. Generally, as υ increase, DIVA will spend more time to generate smaller clusters in the divisive step and combine them again in the agglomerative step. When υ > 0.6 for this dataset, time spent by DIVA sharply increases because many single-object clusters are generated. As we have discussed in Section 3.4, such over-estimation downgrades DIVA into RDBC with quadratic complexity. Next we examine the parameter r, the number √ of ROs for each cluster. r is varied from 2 to 17 in steps of 2. We only compare two fixed values of υ here for the reason of clarity, but the conclusion is also valid for other values. Curves “User-04” and “User-05” in Figure 8 are for clustering users with variance 0.4 and 0.5 respectively, and curves “Product04” and “Product-06” for clustering products with variance 0.4 and 0.6 respectively. We can see that the running time grows very quickly while the accuracy does not change a lot. Therefore, a fairly small value of r, such as 3 or 4, is enough for providing high accuracy as well as keeping short processing time. 9DULDQFH Figure 10 illustrates the robustness of all approaches under different noise ratios of browsing actions, ranging from 20% to 100%. The parameters for DIVA are set as: υ = 0.5 and r = 3. Generally, the accuracy of all approaches are reduced as noise ratio increases. When evaluated by the criterion of intra-cluster similarity, ReCoM is slightly better than DIVA and FORC is the worst among three. Yet the entropybased criterion might be more preferable here, because the intra-cluster similarity is calculated based on not only the informative browsing actions but also the noise ones, while the entropy is calculated only based on the class labels of the data objects. Evaluated by the latter one, DIVA exceeds ReCoM and FORC when the noise ratio is below 80% and their performance are very close when the noise ratio is above 80%. 5H&RP )25& ',9$ 5H&RP )25& ',9$ 8VHU 8VHU 3URGXFW 3URGXFW Figure 9: Synthetic Dataset - Standard deviation of the cluster sizes 5H&R0 )25& ',9$ (b) Products (a) Users Figure 7: Synthetic Dataset - Time spent vs υ ,QWUDFOXVWHU6LPLODULW\ (QWURS\ ,QWUDFOXVWHUVLPLODULW\ 7LPHVSHQWVHF 6WG'HYRIFOXVWHUVL]H 6WG'HYRIFOXVWHUVL]H 1RLVHUDWLR (a) Intra-cluster similarity 1RLVHUDLR (b) Entropy Figure 10: Synthetic Dataset - Accuracy of clustering users along with different noise ratio 1XPEHURI52V (a) Intra-cluster similarity (QWURS\ 8VHU 8VHU 3URGXFW 3URGXFW 7LPHVSHQWVHF 4.2 Real Dataset 8VHU 8VHU 3URGXFW 3URGXFW 1XPEHURI52V (b) Entropy 1XPEHURI52V (c) Time spent (sec) Figure 8: Synthetic Dataset - Clustering users and products by DIVA with different r In many applications very small clusters (in the extreme case, the singleton cluster that only contain one data object) are meaningless, so it is necessary to investigate the structure of the derived clusters. Since the cluster number in our experiments has been fixed to 100, resulting in the same average size of the derived clusters for all approaches, we consider the standard deviation of the clusters’ sizes here. The result is shown in Figure 9. Roughly speaking, the standard deviation of the clusters’ sizes generated by DIVA decreases as the variance increase, meaning that our approach does not tend to generate singleton clusters. The clustering approaches were also evaluated on a realworld dataset, a movie knowledge base defined by the ontology in Figure 1. After data pre-processing, there are 62,955 movies, 40,826 actors and 9,189 directors. The dataset also includes a genre taxonomy of 186 genres. Additionally, we have 542,738 browsing records included in 15,741 sessions from 10,151 users. The number of sessions made by different users ranges from 1 to 814. The evaluation result based on the intra-cluster similarity is shown in Figure 11(a), in which DIVA performs better than ReCoM and FORC. The entropy-based criterion defined by Equation 7 can not be applied, because there is no pre-specified or manually-labelled class information for movies in the dataset. We have to utilize the visit information from users to evaluate the clustering results indirectly. The traditional collaborative filtering algorithm constructs a user’s profile based on all items he/she has browsed across sessions, then the profile of active user ua is compared with those of other users to form ua ’s neighbourhood and the items visited by the neighbourhoods but not by ua are returned as the recommendation [10]. Hence, two items are ,QWUDFOXVWHUVLPLODULW\ 5H&RP )25& ',9$ 9DULDQFH (a) Intra-cluster similarity 5H&RP )25& ',9$ 6HVVLRQEDVHG&RHIILFLHQW 8VHUEDVHG&RHIILFLHQW 5H&RP )25& ',9$ 9DULDQFH 9DULDQFH (b) User-based coefficient (c) Session-based coefficient Figure 11: Real Dataset - Clustering movies with different υ “labelled” into the same category if they are co-browsed by the same user, which reflects the partitioning of the dataset from the viewpoint of users. Accordingly, we can construct the evaluation criterion as in [15]: two objects are said to be correctly clustered if they are co-browsed by at least one common user. The accuracy of clustering is defined as a variant of Jaccard Coefficient: the number of objects pairs that are correctly clustered over all possible objects pairs in the same clusters. Another criterion is similar but of higher granularity, based on the assumption that a user seldom shifts his/her interest within a session, so two objects are said to be correctly clustered if they are included in at least one common session. Therefore we have two new criteria for evaluating the accuracy of clustering: user-based and session-based coefficient. As discussed in [15], higher accuracy tends to be achieved when the number of clusters increases, so we set to generate 100 clusters for all approaches. Figure 11(b) and 11(c) show the evaluation results based on the above two criteria respectively. Since the user browsing data are very sparse (a common phenomenon in the scenario of recommender systems), no clustering approach can achieve very high coefficient value. Despite that, DIVA still outperforms both ReCoM and FORC when υ > 0.3, indicating the clusters generated by DIVA are more accordant with the user browsing patterns, i.e. the partitioning of the dataset derived by DIVA is more acceptable by users. It indicates that DIVA will have higher usefulness than ReCoM or FORC when applied into a recommender system. 5. RELATED WORK As a widely-applied technique for data analysis and knowledge discovery, clustering tries to separate a dataset into a number of finite and discrete subsets so that the data objects in each subset share some common trait [7]. Roughly speaking, traditional clustering methods can be divided into two categories: hierarchical and partitional. Hierarchical clustering algorithms recursively agglomerate or divide existing clusters in a bottom-up or top-down manner respectively, so the data objects are organized within a hierarchical structure. On the contrary, partitional algorithms construct a fixed number of clusters from the beginning, distribute each data object into its nearest cluster and update the cluster’s mean or medoid iteratively. In recent years, many innovative ideas have been proposed to address various issues of traditional clustering algorithms [6, 14]. For example, BIRCH [16] utilizes the concept clustering feature (CF) to efficiently summarize the statistical characteristics of a cluster and distribute the data objects in the Euclidean space. Because the CF vectors are additive, they can be updated easily when a cluster absorbs a new data object or two sub-clustered are merged. By scanning the dataset, BIRCH incrementally builds a CF-tree to preserve the inherent clustering structure of the data objects that have been scanned. Finally, in order to remedy the problems of skewed input order or undesirable splitting, BIRCH applies a traditional agglomerative clustering algorithm to improve the CF-tree. BIRCH is very efficient because of its linear computational complexity, but it is not applicable for non-Euclidean datasets, such as the relational data object shown in Figure 2. DBSCAN [4], a densitybased clustering algorithm, is proposed to find clusters of sophisticated shapes, which requires two parameters to define the minimum density of clusters: Eps (the radius of the neighbourhood of a point) and M inP ts (the minimum number of points in the neighbourhood). Clusters are dynamically created from an arbitrary point and then all points in its neighbourhood are absorbed. Like the dilemma of pre-specifying parameter k in k-Means/k-Medoids, the parameters Eps and M inP ts in DBSCAN are difficult to be determined. Another problem is that DBSCAN needs R*tree to improve the efficiency of region queries. Since such a structure is not available in a relational dataset, the computational complexity of DBSCAN will be degraded to O(N 2 ). In contrast to traditional clustering algorithms that only exploit flat datasets and search for an optimal partitioning in the Euclidean space, relational clustering algorithms try to incorporate the relationships between data objects as well. Neville et al. [12] provided some preliminary work of combining traditional clustering and graph partitioning techniques to solve the problem. They used a very simple similarity metric, called matching coefficient, to weight the relations between data. Kirten and Wrobel developed RDBC [8] and FORC [9] as the first-order extensions of classical hierarchical agglomerative and k-partitional clustering algorithms respectively. Both of them adopt the distance measure RIBL2 to calculate the dissimilarity between data objects, which recursively compares the sub-components of the first-order objects until they can finally fall back on propositional comparisons on elementary features. Therefore, RDBC and FORC inherit the disadvantage as their propositional antecedent: When the relational datasets are very large, such algorithms will be infeasible due to the quadratic computational complexity. Enlightened by the idea of mutual reinforcement, Wang et al. [13] proposed a general framework for clustering multi-type relational data objects: Initially, clustering is performed separately for each dataset based on the content information. The derived cluster structure formulates a reduced feature space of the current dataset and is then propagated along the linkage structure to impact the re-clustering procedure of the related datasets. More specifically, the relationships between data objects of different types are transformed into the linkage feature vectors of these objects, which can be easily handled by the traditional clustering algorithms just like the con- tent feature vectors. The re-clustering procedure continues in an iterative manner for all types of datasets until the result converges. Additional improvement would be achieved by assigning different importance among data objects in the clustering procedure according to their hub and authoritative values. Long et al. presented another general framework for multi-type relational clustering in [11]. Based on the assumption that the hidden structure of a data matrix can be explored by its factorization, the multi-type relational clustering is converted into an optimization problem: approximate the multiple relation matrices and the feature matrices by their corresponding collective factorization. Under this model a spectral clustering algorithm for multi-type relational data is derived, which updates one intermediate cluster indicator matrix as a number of leading eigenvectors at each iterative step until the result converges. Finally the intermediate matrices have to be post-processed to extract the meaningful cluster structure. This spectral clustering framework is theocratically sound, but it is not applicable for the semantic information, such as “Genre” for concept “Movie” in Figure 1. According to the definition, the clustering approaches try to partition a dataset so that data objects in the same clusters are more similar to each other while objects in different clusters are less similar. Since data objects in the same cluster can be treated collectively, the clustering result can be considered as a technique of data compression [6]. Therefore, we can use the clustering model to approximate the similarity values between pairs of objects. Yin et al. propose a hierarchical data structure SimTree to represent the similarities between objects, and use LinkClus to improve the SimTree in an iterative way [15]. However, since they only utilized frequent pattern mining techniques to build the initial SimTrees and use path-based similarity in one SimTree to adjust the similarity values and structures of other SimTrees, information contained in the property attributes of data objects are not adopted in their clustering framework, which definitely degrades the accuracy of the final clustering result. 6. CONCLUSIONS In this paper we propose a new variance-based clustering approach for multi-type relational datasets. The variance is a criterion to control the compactness of the derived clusters and thus preserve the most significant similarity entries of the object pairs. Our approach combines the advantages of divisive and agglomerative paradigms to improve the quality of clustering results. By incorporating the idea of Representative Object, our approach has linear time complexity. Since the multi-type relational information is considered in the procedure of constructing data instances, the problem of skewness propagation within reinforcement clustering approaches can be avoided. Experimental results show our algorithm outperforms some well-known relational clustering approaches in accuracy, efficiency and robustness. 7. REFERENCES [1] S. S. Anand, P. Kearney, and M. Shapcott. Generating semantically enriched user profiles for web personalization. ACM Transactions on Internet Technologies, 7(3), August 2007. [2] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification (2nd Edition). Wiley-Interscience Publication, 2001. [3] S. Dzeroski and N. Lavrac. Relational Data Mining. Springer, 2001. [4] M. Ester, H. P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining, 1996. [5] P. Ganesan, H. Garcia-Molina, and J. Widom. Exploiting hierarchical domain structure to compute similarity. ACM Transactions on Information Systems (TOIS), 21(1):64–93, 2003. [6] J. Han and M. Kamber. Data Mining: Concepts and Techniques (2nd Edition). Morgan Kaufmann, 2006. [7] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264–323, 1999. [8] M. Kirsten and S. Wrobel. Relational distance-based clustering. In Proceedings of Fachgruppentreffen Maschinelles Lernen (FGML-98), pages 119 – 124, 10587 Berlin, 1998. Techn. Univ. Berlin, Technischer Bericht 98/11. [9] M. Kirsten and S. Wrobel. Extending k-means clustering to first-order representations. In ILP ’00: Proceedings of the 10th International Conference on Inductive Logic Programming, pages 112–129, London, UK, 2000. Springer-Verlag. [10] J. A. Konstan, B. N. Miller, D. Maltz, J. L. Herlocker, L. R.Gordon, and J. Riedl. GroupLens: Applying collaborative filtering to Usenet news. Communication of the ACM, (3):77–87, 1997. [11] B. Long, Z. M. Zhang, X. Wu;, and P. S. Yu. Spectral clustering for multi-type relational data. In Proceedings of the 23rd international conference on Machine learning (ICML’06), pages 585–592, New York, NY, USA, 2006. ACM Press. [12] J. Neville, M. Adler, and D. Jensen. Clustering relational data using attribute and link information. In Proceedings of the Text Mining and Link Analysis Workshop, 18th International Joint Conference on Artificial Intelligence, 2003. [13] J. Wang, H. Zeng, Z. Chen, H. Lu, T. Li, and W.-Y. Ma. Recom: Reinformcement clustering of multi-type interrelated data objects. In Proceedings of the 26th ACM SIGIR conference on Research and development in informaion retrieval (SIGIR’03), pages 274–281, New York, NY, USA, 2003. ACM Press. [14] R. Xu and D. Wunsch. Survey of clustering algorithms. IEEE Transaction on Neural Networks, 16:645–678, 5 2005. [15] X. Yin, J. Han, and P. S. Yu. Linkclus: efficient clustering via heterogeneous semantic links. In Proceedings of the 32nd international conference on Very large data bases (VLDB’06), pages 427–438. VLDB Endowment, 2006. [16] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an efficient data clustering method for very large databases. In Proceedings of 1996 ACM SIGMOD International Conference on Management of Data, pages 103–114, Montreal, Canada, 1996.