Labeling Nodes of Automatically Generated Taxonomy for Multi-type Relational Datasets Tao Li and Sarabjot S. Anand Department of Computer Science, University of Warwick Coventry, United Kingdom {li.tao,s.s.anand}@warwick.ac.uk Abstract. Automatic Taxonomy Generation organizes a large dataset into a hierarchical structure so as to facilitate people’s navigation and browsing actions. To better summarize the content of each node as well as to reflect the distinctiveness between sibling ones, meaningful labels need to be assigned to all the nodes within a derived taxonomy. Current research only focuses on labeling taxonomies that are built from a corpora of textual documents. In this paper we address the problem of labeling taxonomies built for multi-type relational datasets. A novel measure is proposed to quantitatively evaluate the homogeneity of each node and the heterogeneity of its sibling nodes using information-theoretical techniques, based on which the labels of taxonomic nodes are determined. We perform some experiments on a real dataset to prove the effectiveness of our method. Key words: Taxonomy, Multi-type, Relational, Hierarchical Clustering 1 Introduction Automatic Taxonomy Generation (ATG) is a promising approach to organizing a large dataset into a hierarchical structure so as to facilitate people’s navigation and browsing actions in a more efficient way [1]. After the taxonomy has been constructed, meaningful node labels need to be assigned to best summarize the content of each node as well as to reflect its distinctiveness from sibling ones [2]. Most of the current research focuses on labeling taxonomies generated from a large corpora of textual documents. Techniques of Natural Language Processing, Information Retrieval or Computational Linguistics are often applied to pre-process the text and extract keywords/concepts, among which the node labels are selected according to the frequency of keywords or the correlation between concepts [3][4][5]. Nevertheless, to the best of our knowledge, there is no systematic way of labeling taxonomies built for multi-type relational datasets. Multi-type relational datasets pertain to domains with a number of data types and a multitude of relationships between them [6]. Relational objects, being defined by a set of attributes and links related to other objects, are usually stored in multiple tables of a relational database. Fig. 1 is the simplified schema of a commercial movie database. Each of the movie objects has attributes Title, 2 Labeling Taxonomic Nodes for Relational Datasets Fig. 1. Example Movie Database Schema YearOfRelease, Certificate and Genre that are of types text, numeric, categorical, and taxonomy-based respectively. Moreover, these movies are linked to objects of types Director and Actor via relationships “Movie-Director” and “Movie-Actor”. All the directors/actors have the attribute Name of type text and are in turn linked to multiple movies. Given the task of generating taxonomy for movie objects, some automated approaches can be employed to (heuristically) search the space of possible taxonomies and find the optimal one. The search may be conducted in a single step as in the case of CobWeb[7] or in two steps where a hierarchical clustering is performed on the dataset followed by a post-processing of the dendrogram, merging nodes in the dendrogram, to derive a multi-split tree structure [4][8]. Once the taxonomy has been learned, labels of the derived taxonomic nodes need to be selected carefully. In our example, the attributes and related objects of movies are usually of different importance in discriminating the nodes within the derived taxonomy: When the taxonomic nodes mainly consists of action movies, users often care about the leading Actors (e.g. Arnold Schwarzenegger, Sylvester Stallone, Jackie Chan); for taxonomic nodes of ethical films, the Director (e.g. Ingmar Bergman, Federico Fellini, Krzysztof Kieslowski) may better summarize the movies allocated to these nodes; while the attribute YearOfRelease and keywords appearing in the Title are good discriminant indicators for the taxonomic nodes consisting of documentaries. In this paper, we try to answer the following question: Given a taxonomy that is automatically generated for relational datasets using some unsupervised learning techniques, how can the taxonomic nodes be labeled appropriately so as to summarize the content of each node and reflect the distinctiveness between sibling nodes? We develop a novel approach to selecting the labels of taxonomic nodes by quantitatively evaluating the homogeneity of each node and the heterogeneity of its sibling nodes. Experiments conducted on a real dataset prove the effectiveness of our method. The paper is organized as follows: In Section 2 we review some related works. Our new algorithm is explained in Section 3. Experimental results are provided in Section 4. Finally we summarize some conclusions and future works. 2 Related Works In order to more effectively organize a large document collection and discover knowledge within it, many algorithms for automatically generating taxonomies Labeling Taxonomic Nodes for Relational Datasets 3 upon document collections have been developed [3][9][5]. Usually documents are first pre-processed by utilizing techniques in Natural Language Processing, Information Retrieval or Statistics to extract their linguistic features such as keywords or concepts [10]. These keywords or concepts are then used as the labels of the taxonomic nodes to make the taxonomy more understandable. Lawrie and Croft proposed to use a set of topical summary terms as the labels of taxonomic nodes. These topical terms are selected by maximizing the joint probability of their topicality and predictiveness, which is estimated by the statistical language models of the document collection [11]. Kummamuru et al. developed an incremental learning algorithm DisCover to maximize the coverage as well as the distinctiveness of the taxonomy. They used meaningful nouns, adjectives and noun phrases (with necessary pre-processing such as stemming, stop-word elimination or morphological generalization) extracted from the document set as the labels of the derived taxonomic nodes [2]. Krishnapuram and Kummamuru concluded that the set of words that have a high frequency of occurrence within the nodes can be used as the labels [12]. However, no research has been reported on how the taxonomic nodes built from relational datasets can be labeled. Unlike supervised learning where the classes of data instances are known prior to the learning procedure, the automatic taxonomy generation is more akin to unsupervised learning where the data classes are not available. Labeling taxonomic nodes from structured (propositional or relational) datasets can be viewed as a procedure of learning the features (attributes or related objects) that best discriminate the content of a node from its siblings and hence bears similarity with the idea of choosing decision attributes within Decision Tree Induction [13][14][15]. The pure Information Gain measure based on the node entropy prefers to use attributes with many values as the decision attribute. To reduce such bias, the Gain Ratio was introduced which incorporates the split information to penalize the above over-fitting problem [13][16]. 3 Algorithm for Labeling Taxonomic Nodes In this section, we first investigate the applicability of Kullback-Leibler Divergence within a taxonomy built for relational datasets. The bias of KullbackLeibler Divergence is then analyzed, leading to the development of a new synthesized criterion. Finally we propose two strategies to determine the node labels using KL Divergence. The aim of labeling taxonomic nodes is to choose some predictive labels that best summarize the content of each node as well as highlight its distinctiveness from siblings. As shown in Fig. 2, given a set of relational objects O = {oi } (1 ≤ i ≤ N ) organized by a taxonomy T and ts a non-leaf node in T with K child nodes {tsk } (1 ≤ k ≤ K), we try to determine labels for each child node tsk based on the attributes and linkages of all relational objects contained in tsk and its sibling nodes. The Kullback-Leibler Divergence [17], which is widely used in probability theory and information theory to measure the divergence of one probability distribution from another, will be adopted in our approach. 4 Labeling Taxonomic Nodes for Relational Datasets Fig. 2. Example Taxonomy Sub-Tree We first consider the propositional case where all the objects are only defined by a set of attributes {Ar }. The attribute-value pairs that best distinguish the objects contained in the current node from those in its sibling nodes are chosen as the node labels. Given an attribute Ar , assuming pkr and qr are the probability distributions defined over the domain of Ar for objects allocated to nodes tsk and ts respectively, the KL-divergence of pkr from qr is defined as: KLr (pkr , qr ) = X vj ∈Dom(Ar ) pkr (vj ) log pkr (vj ) qr (vj ) (1) where Dom(Ar ) is the domain of Ar . The distribution qr is defined by the PK Nk weighted sum of its child-node distributions: qr = k=1 N pkr , where Nk is the number of objects contained in child-node tsk and N is the number of objects contained in parent-node ts . Because KLr (pkr , qr ) measures how important the attribute Ar is in distinguishing the members of node tsk from the members of its sibling nodes, the attribute that has the maximum value of KL-divergence provides the basis for naming the node tsk . Although the KL-divergence metric is generally unbounded according to Eq. 1, we can estimate its upper bound in the context of the taxonomy labeling task, as log NNk . It is interesting that this upper bound is independent of the attribute domain from which the KL-divergence is computed. More details about the proof are provided in the Appendix A.1. More generally in the relational case where objects are defined by both attributes and relational links to other objects, the above procedure can be easily extended to calculate the KL-divergences for all attributes as well as relationships. For example, each movie object is related to a set of actors via the relationship actedBy (through the table “Movie-Actor”). Given a set of movie objects contained in tsk , the size of the related actor objects is Nk.actor , which is usually greater the range of KL-divergence for relationship £ than Nk . Then ¤ actedBy will be 0, log NN.actor (we ignore the proof here for brevity). Similarly k .actor £ ¤ for relationship directedBy, the corresponding range is 0, log NN.director . Since k .director the KL-divergences for different relationships have different ranges, they should be normalized to the interval [0, 1] before being compared. As shown in Section 4, the KL-divergence, when being used as the criterion of determining node labels, is biased towards the preference of attributes with more unique values over those with fewer values. This phenomenon is similar Labeling Taxonomic Nodes for Relational Datasets 5 to the bias in the Information Gain metric used in the procedure of Decision Tree induction [16]. To address this problem, we introduce the KL-Ratio (KLR) metric as: KLr (pkr , qr ) KLRr (pkr , qr ) = (2) Entropyr (ts ) ´ P ³n n where Entropyr (ts ) = − j Nj log Nj is the entropy of the attribute Ar within the parent node ts . Experimental results presented in Section 4 show that the normalization and the entropy-based adjustment together can effectively reduce the bias of the original KL-divergence metric. With the KLR criterion defined in Eq. 2, we develop two strategies for labeling taxonomic nodes: – Strategy 1 : All the sibling nodes use the same attribute to construct their labels, solely differing in the attribute values. The selected attribute has the maximum weighted sum of KLR values across all child-nodes: X Nk arg max KLRr (pkr , qr ) N Ar k It is interesting that this strategy is mathematically equivalent to the use of Gain Ratio as the criterion to determine the split attribute in Decision Tree Induction. The detailed proof is provided in the Appendix A.2. However, we must point out the fundamental difference between Decision Tree Induction and our ATG-based approach: the former belongs to supervised learning, i.e. the class information for all training data are already known before the learning procedure; in contrast, our approach is based on the ATG algorithms, which is in essential unsupervised and hence has no prior knowledge about the class information. In summary, Decision Tree Induction and our ATGbased labeling aim at solving similar problems under different motivations and prerequisites. – Strategy 2 : Each of the child nodes can be assigned an independent attributevalue pair as its label, which might be different from its siblings. In this case the attribute used to label a child node would be the one with the maximum KLR value for that node: arg max KLRr (pkr , qr ) Ar In comparison to the Decision Tree Induction, each branch within the derived taxonomy could be labeled using a different attribute. When being applied in practice, the first strategy is more suitable for users to navigate the taxonomy in a top-down fashion, because all the child nodes that belong to the same parent node use the same attributes with different split values as their labels, so users can easily understand the distinctiveness between these sibling nodes. While the second one might more accurately reflect the content of each node in the taxonomy, because the most representative attribute (with the maximum KLR value) is selected as the node labels. 6 Labeling Taxonomic Nodes for Relational Datasets Table 1. Results of Simulation Experiment (P opM ovieSet = 100) Attribute Number of Unique Values Title 542.480±16.770 Year 16.180± 0.997 Certificate 9.580± 0.698 Genre 35.720± 2.810 Director 225.460±13.237 Actor 1064.960±55.599 4 KL-divergence 1.132 0.082 0.046 0.195 1.031 1.338 ± ± ± ± ± ± 0.076 0.027 0.019 0.037 0.065 0.100 Normalized KL-divergence 0.713 ± 0.028 0.052 ± 0.017 0.029 ± 0.012 0.123 ± 0.023 0.650 ± 0.030 0.842 ± 0.018 Entropy 8.774 3.243 2.647 3.969 6.404 9.736 ± ± ± ± ± ± 0.060 0.078 0.064 0.118 0.218 0.107 KLR 0.081 0.016 0.011 0.031 0.101 0.086 ± ± ± ± ± ± 0.003 0.005 0.005 0.006 0.003 0.002 Experimental Results Some experiments were conducted to evaluate our algorithms presented in Section 3. We first measure the bias within the original KL-divergence metric and show to what extent the KLR metric can reduce such bias. Then two strategies for labeling nodes are compared through the simulation of a user locating a given set of movies in the taxonomy. A real movie dataset used by an online DVD retailer, of which the database schema has been shown in Fig. 1, is used in our experiments. After data pre-processing, there are 62,955 movies, 40,826 actors and 9,189 directors. The dataset also includes a genre taxonomy of 186 genres. Additionally, we have 542,738 browsing records from 10,151 users. Based on the user visits, we select 10,000 most popular movies for our analysis. 4.1 Bias within the KL Metric To determine whether the use of the original KL-divergence metric for selecting the most informative attribute is biased or not, we conducted the following experiments: three sets of 100 movies were randomly chosen to form sibling nodes ts1 , ts2 and ts3 , which shared the common parent node ts . For each subnode composed of sampled movies, we calculate the KL-divergences for all the attributes. This experiment was repeated 50 times with different random seeds. Table 1 shows the average number of unique values for each attribute contained in the parent node ts as well as the mean and the standard deviation of using different labeling criteria with respect to each attribute. The movie titles were processed to extract meaningful nouns, verb and adjectives using WordNet (http://wordnet.princeton.edu/). In Tables 1, the original KL-divergence values for attributes Title, Director and Actor are greater than 1 while the others are far less than 1, which proves the necessity for normalizing the KL-Divergence as suggested in Section 3. Furthermore, the number of unique values for different attributes vary greatly and the original KL-divergence is proportional to this number. The entropy, which acts as the penalty factor in our approach, is also impacted by that because the attributes with more unique values also have higher entropy. As can be seen from the last column of Table 1, KLR that synthesizes the KL-Divergence and the entropy can effectively reduce the bias. Labeling Taxonomic Nodes for Relational Datasets 7 Fig. 3. Labeling Taxonomy for Movie Dataset (Strategy 1) 4.2 Evaluating labeling Strategies Figures 3 and 4 show parts of the movie taxonomy that was labeled using two strategies introduced in Section 3. In Fig. 3 all the sibling nodes have the same decision attribute with different values as their labels, while in Fig. 4 each of the sibling nodes may use a different attribute as its own label. It is worth noting that, depending on the utilized techniques of taxonomy generation, some data objects belonging to two different taxonomic nodes might share the same attribute values, which makes the derived node labels overlapped, e.g. both Node 767 and Node 768 in Figures 3 use the director “Martin Scorsese” in their labels. In such case, other attribute values in the node labels can provide useful information to distinguish the content the nodes. To quantitatively evaluate the goodness of the derived labels, a robot was developed to simulate a user navigating the taxonomy. For a set of randomly selected movies, the robot navigated the taxonomic structure in a top-down fashion to locate the corresponding leaf nodes that contain the given movies. When examining a non-leaf node, the robot uses the node’s label to determine which sub-node should be explored in the next iteration. If the target movie object matches the labels of more than one sub-node, all the matched subnodes will be explored in a best-match-first order. We use a criterion Cost to measure the time spent in the above exploration procedure, which is defined by the average number of attribute values within the labels of the taxonomic nodes that have been examined by the robot before finding the correct leaf nodes. General speaking, a smaller value of Cost means that the corresponding labeling strategy is preferred. In our experiment, 100 randomly selected movies were used in each run and the experiment was repeated 10 times. The mean and standard deviation of Cost values for two strategies are 378.31 ± 26.97 and 399.88 ± 26.14 respectively. They are statistically significant different, meaning the first labeling strategy is more effective to locate the target objects in the derived taxonomy. 8 Labeling Taxonomic Nodes for Relational Datasets Fig. 4. Labeling Taxonomy for Movie Dataset (Strategy 2) 5 Conclusions and Future Works ATG techniques can efficiently organize large datasets into hierarchical structures. Usually a bundle of labels will be assigned to taxonomic nodes in order to summarize their respective content and to reflect their distinctiveness among siblings. In this paper we propose a novel approach, based on the evaluation of the homogeneity of each node and heterogeneity of its siblings, to label taxonomies built for multi-type relational datasets. Moreover, we effectively remove the induction bias within the original KL-Divergence metric. In the future, we will continue investigating other approaches to labeling the automatically generated taxonomy for relational dataset and compare their effectiveness and efficiency. Furthermore, we will study the possibility of using multiple attributes to label the taxonomic nodes. References 1. Zhao, Y., Karypis, G.: Evaluation of hierarchical clustering algorithms for document datasets. In: Proceedings of ACM CIKM’02, USA, (2002) 2. Kummamuru, K., Lotlikar, R., Roy, S., Singal, K., Krishnapuram, R.: A hierarchical monothetic document clustering algorithm for summarization and browsing search results. In: Proceedings of WWW’04. (2004) 3. Muller, A., Dorre, J., Gerstl, P., Seiffert, R.: The TaxGen framework: Automating the generation of a taxonomy for a large document collection. In: Proceedings of the 32nd Hawaii International Conference on System Sciences. (1999) 4. Chuang, S.L., Chien, L.F.: Towards automatic generation of query taxonomy: A hierarchical query clustering approach. In: Proceedings of ICDM’02, Washington, USA, IEEE Computer Society (2002) 75–82 Labeling Taxonomic Nodes for Relational Datasets 9 5. Cimiano, P., Hotho, A., Staab, S.: Comparing conceptual, partitional and agglomerative clustering for learning taxonomies from text. In: Proceedings of ECAI’04, IOS Press (2004) 435–439 6. Džeroski, S.: Multi-relational data mining: An introduction. SIGKDD Explorations Newsletter 5(1) (2003) 1–16 7. Clerkin, P., Cunningham, P., Hayes, C.: Ontology discovery for the semantic web using hierarchical clustering. In: Proceedings of Semantic Web Mining Workshop co-located with ECML/PKDD, Freiburg, Germany (September 2001) 8. Cheng, P.J., Chien, L.F.: Auto-generation of topic hierarchies for web images from users’ perspectives. In: Proceedings of ACM CIKM’03. (2003) 9. Lawrie, D., Croft, W.B., Rosenberg, A.: Finding topic words for hierarchical summarization. In: Proceedings of ACM SIGIR’01, New York, USA, ACM (2001) 349–357 10. Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. ACM Press / Addison-Wesley (1999) 11. Lawrie, D.J., Croft, W.B.: Generating hierarchical summaries for web searches. In: Proceedings of ACM SIGIR’03, New York, USA, ACM (2003) 457–458 12. Krishnapuram, R., Kummamuru, K.: Automatic taxonomy generation: Issues and possibilities. In: Proceedings of Fuzzy Sets and Systems (IFSA). LNCS 2715, Springer-Verlag Heidelberg (2003) 52–63 13. Quinlan, J.R.: Induction of decision trees. Volume 1. Kluwer Academic Publishers, Hingham, USA 81–106 14. Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco, USA (1993) 15. Kramer, S., Widmer, G.: Inducing classification and regression trees in first order logic. In: Relational Data Mining. Springer (September 2001) 140–160 16. Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997) 17. Kullback, S.: The kullback-leibler distance. The American Statistician 41 (1987) 340–341 A A.1 Appendix Proof of the KL-divergence Bounds Given the symbols N , Nk , Ar , pkr and qr defined as in Section 3, it is easy to see that the KL-divergence for attribute Ar will be maximized when the distributions pkr and q\k,r have non-zero probabilities for disjoint subsets of P Nl values in Ar , where q\k,r = l6=k N −N plr . The upper bound is then: k X max KLr (pkr , qr ) = pkr (vj ) log vj ∈Dom(Ar ) = X j = Ã nj log Nk X µ nj j = log Nk N Nk log nj Nk nj N N Nk ! ¶ pkr (vj ) qr (vj ) 10 Labeling Taxonomic Nodes for Relational Datasets ¢ n ¡ P nj where Njk j Nk = 1 is the frequency that the j-th value of Ar occurs in the objects contained in ichild-node tsk . Therefore, the range of KL-divergence for h node tsk is 0, log NNk , which is not impacted by the pre-specified attribute Ar . A.2 Proof of the Equivalence between Information Gain and KL-based Strategy Proof. The Information Gain used in the Decision Tree Induction is defined as: Inf oGainr (ts ) = Entropy(ts ) − X µ Nk N k ¶ · Entropy(tsk ) µ ¶ nj ´ X Nk X nkj nkj log · =− + log N N N Nk Nk j j k ´ ³ X X ³ nkj X X nkj nj nkj ´ =− log + log N N N Nk j j k k µ ¶ XX nkj nj nkj nkj = − log + log N N N Nk j k ! Ã nkj X X nkj k = log N nj N N j k ¶ X X µ Nk pkr (vj ) = · pkr(vj ) log N qr (vj ) j k µ ¶ X Nk = · KLr (pkr , qr ) N X ³ nj k The Gain Ratio is defined as the ratio of Information Gain and the entropy of parent node with respect to the attribute Ar [16], which gives: Inf oGainr (ts ) Entropyr (ts ) X µ Nk KLr (pkr , qr ) ¶ = · N Entropyr (ts ) k X Nk = KLRr (pkr , qr ) N GainRatior (ts ) = k This completes the proof. ¤