Labeling Nodes of Automatically Generated Taxonomy for Multi-type Relational Datasets

advertisement
Labeling Nodes of Automatically Generated
Taxonomy for Multi-type Relational Datasets
Tao Li and Sarabjot S. Anand
Department of Computer Science, University of Warwick
Coventry, United Kingdom
{li.tao,s.s.anand}@warwick.ac.uk
Abstract. Automatic Taxonomy Generation organizes a large dataset
into a hierarchical structure so as to facilitate people’s navigation and
browsing actions. To better summarize the content of each node as well as
to reflect the distinctiveness between sibling ones, meaningful labels need
to be assigned to all the nodes within a derived taxonomy. Current research only focuses on labeling taxonomies that are built from a corpora
of textual documents. In this paper we address the problem of labeling
taxonomies built for multi-type relational datasets. A novel measure is
proposed to quantitatively evaluate the homogeneity of each node and
the heterogeneity of its sibling nodes using information-theoretical techniques, based on which the labels of taxonomic nodes are determined.
We perform some experiments on a real dataset to prove the effectiveness
of our method.
Key words: Taxonomy, Multi-type, Relational, Hierarchical Clustering
1
Introduction
Automatic Taxonomy Generation (ATG) is a promising approach to organizing
a large dataset into a hierarchical structure so as to facilitate people’s navigation
and browsing actions in a more efficient way [1]. After the taxonomy has been
constructed, meaningful node labels need to be assigned to best summarize the
content of each node as well as to reflect its distinctiveness from sibling ones
[2]. Most of the current research focuses on labeling taxonomies generated from
a large corpora of textual documents. Techniques of Natural Language Processing, Information Retrieval or Computational Linguistics are often applied
to pre-process the text and extract keywords/concepts, among which the node
labels are selected according to the frequency of keywords or the correlation between concepts [3][4][5]. Nevertheless, to the best of our knowledge, there is no
systematic way of labeling taxonomies built for multi-type relational datasets.
Multi-type relational datasets pertain to domains with a number of data
types and a multitude of relationships between them [6]. Relational objects, being defined by a set of attributes and links related to other objects, are usually
stored in multiple tables of a relational database. Fig. 1 is the simplified schema
of a commercial movie database. Each of the movie objects has attributes Title,
2
Labeling Taxonomic Nodes for Relational Datasets
Fig. 1. Example Movie Database Schema
YearOfRelease, Certificate and Genre that are of types text, numeric, categorical,
and taxonomy-based respectively. Moreover, these movies are linked to objects of
types Director and Actor via relationships “Movie-Director” and “Movie-Actor”.
All the directors/actors have the attribute Name of type text and are in turn
linked to multiple movies. Given the task of generating taxonomy for movie
objects, some automated approaches can be employed to (heuristically) search
the space of possible taxonomies and find the optimal one. The search may be
conducted in a single step as in the case of CobWeb[7] or in two steps where a
hierarchical clustering is performed on the dataset followed by a post-processing
of the dendrogram, merging nodes in the dendrogram, to derive a multi-split
tree structure [4][8]. Once the taxonomy has been learned, labels of the derived
taxonomic nodes need to be selected carefully. In our example, the attributes
and related objects of movies are usually of different importance in discriminating the nodes within the derived taxonomy: When the taxonomic nodes mainly
consists of action movies, users often care about the leading Actors (e.g. Arnold
Schwarzenegger, Sylvester Stallone, Jackie Chan); for taxonomic nodes of ethical
films, the Director (e.g. Ingmar Bergman, Federico Fellini, Krzysztof Kieslowski)
may better summarize the movies allocated to these nodes; while the attribute
YearOfRelease and keywords appearing in the Title are good discriminant indicators for the taxonomic nodes consisting of documentaries.
In this paper, we try to answer the following question: Given a taxonomy
that is automatically generated for relational datasets using some unsupervised
learning techniques, how can the taxonomic nodes be labeled appropriately so as
to summarize the content of each node and reflect the distinctiveness between
sibling nodes? We develop a novel approach to selecting the labels of taxonomic
nodes by quantitatively evaluating the homogeneity of each node and the heterogeneity of its sibling nodes. Experiments conducted on a real dataset prove
the effectiveness of our method.
The paper is organized as follows: In Section 2 we review some related works.
Our new algorithm is explained in Section 3. Experimental results are provided
in Section 4. Finally we summarize some conclusions and future works.
2
Related Works
In order to more effectively organize a large document collection and discover
knowledge within it, many algorithms for automatically generating taxonomies
Labeling Taxonomic Nodes for Relational Datasets
3
upon document collections have been developed [3][9][5]. Usually documents are
first pre-processed by utilizing techniques in Natural Language Processing, Information Retrieval or Statistics to extract their linguistic features such as keywords
or concepts [10]. These keywords or concepts are then used as the labels of the
taxonomic nodes to make the taxonomy more understandable. Lawrie and Croft
proposed to use a set of topical summary terms as the labels of taxonomic nodes.
These topical terms are selected by maximizing the joint probability of their topicality and predictiveness, which is estimated by the statistical language models
of the document collection [11]. Kummamuru et al. developed an incremental
learning algorithm DisCover to maximize the coverage as well as the distinctiveness of the taxonomy. They used meaningful nouns, adjectives and noun phrases
(with necessary pre-processing such as stemming, stop-word elimination or morphological generalization) extracted from the document set as the labels of the
derived taxonomic nodes [2]. Krishnapuram and Kummamuru concluded that
the set of words that have a high frequency of occurrence within the nodes can
be used as the labels [12]. However, no research has been reported on how the
taxonomic nodes built from relational datasets can be labeled.
Unlike supervised learning where the classes of data instances are known
prior to the learning procedure, the automatic taxonomy generation is more
akin to unsupervised learning where the data classes are not available. Labeling
taxonomic nodes from structured (propositional or relational) datasets can be
viewed as a procedure of learning the features (attributes or related objects)
that best discriminate the content of a node from its siblings and hence bears
similarity with the idea of choosing decision attributes within Decision Tree
Induction [13][14][15]. The pure Information Gain measure based on the node
entropy prefers to use attributes with many values as the decision attribute. To
reduce such bias, the Gain Ratio was introduced which incorporates the split
information to penalize the above over-fitting problem [13][16].
3
Algorithm for Labeling Taxonomic Nodes
In this section, we first investigate the applicability of Kullback-Leibler Divergence within a taxonomy built for relational datasets. The bias of KullbackLeibler Divergence is then analyzed, leading to the development of a new synthesized criterion. Finally we propose two strategies to determine the node labels
using KL Divergence.
The aim of labeling taxonomic nodes is to choose some predictive labels that
best summarize the content of each node as well as highlight its distinctiveness
from siblings. As shown in Fig. 2, given a set of relational objects O = {oi } (1 ≤
i ≤ N ) organized by a taxonomy T and ts a non-leaf node in T with K child
nodes {tsk } (1 ≤ k ≤ K), we try to determine labels for each child node tsk
based on the attributes and linkages of all relational objects contained in tsk
and its sibling nodes. The Kullback-Leibler Divergence [17], which is widely
used in probability theory and information theory to measure the divergence of
one probability distribution from another, will be adopted in our approach.
4
Labeling Taxonomic Nodes for Relational Datasets
Fig. 2. Example Taxonomy Sub-Tree
We first consider the propositional case where all the objects are only defined
by a set of attributes {Ar }. The attribute-value pairs that best distinguish the
objects contained in the current node from those in its sibling nodes are chosen as
the node labels. Given an attribute Ar , assuming pkr and qr are the probability
distributions defined over the domain of Ar for objects allocated to nodes tsk
and ts respectively, the KL-divergence of pkr from qr is defined as:
KLr (pkr , qr ) =
X
vj ∈Dom(Ar )
pkr (vj ) log
pkr (vj )
qr (vj )
(1)
where Dom(Ar ) is the domain of Ar . The distribution qr is defined by the
PK Nk
weighted sum of its child-node distributions: qr =
k=1 N pkr , where Nk is
the number of objects contained in child-node tsk and N is the number of objects contained in parent-node ts . Because KLr (pkr , qr ) measures how important
the attribute Ar is in distinguishing the members of node tsk from the members
of its sibling nodes, the attribute that has the maximum value of KL-divergence
provides the basis for naming the node tsk . Although the KL-divergence metric is
generally unbounded according to Eq. 1, we can estimate its upper bound in the
context of the taxonomy labeling task, as log NNk . It is interesting that this upper
bound is independent of the attribute domain from which the KL-divergence is
computed. More details about the proof are provided in the Appendix A.1.
More generally in the relational case where objects are defined by both attributes and relational links to other objects, the above procedure can be easily
extended to calculate the KL-divergences for all attributes as well as relationships. For example, each movie object is related to a set of actors via the relationship actedBy (through the table “Movie-Actor”). Given a set of movie
objects contained in tsk , the size of the related actor objects is Nk.actor , which
is usually greater
the range of KL-divergence for relationship
£ than Nk . Then
¤
actedBy will be 0, log NN.actor
(we
ignore the proof here for brevity). Similarly
k .actor
£
¤
for relationship directedBy, the corresponding range is 0, log NN.director
. Since
k .director
the KL-divergences for different relationships have different ranges, they should
be normalized to the interval [0, 1] before being compared.
As shown in Section 4, the KL-divergence, when being used as the criterion
of determining node labels, is biased towards the preference of attributes with
more unique values over those with fewer values. This phenomenon is similar
Labeling Taxonomic Nodes for Relational Datasets
5
to the bias in the Information Gain metric used in the procedure of Decision
Tree induction [16]. To address this problem, we introduce the KL-Ratio (KLR)
metric as:
KLr (pkr , qr )
KLRr (pkr , qr ) =
(2)
Entropyr (ts )
´
P ³n
n
where Entropyr (ts ) = − j Nj log Nj is the entropy of the attribute Ar within
the parent node ts . Experimental results presented in Section 4 show that the
normalization and the entropy-based adjustment together can effectively reduce
the bias of the original KL-divergence metric.
With the KLR criterion defined in Eq. 2, we develop two strategies for labeling taxonomic nodes:
– Strategy 1 : All the sibling nodes use the same attribute to construct their
labels, solely differing in the attribute values. The selected attribute has the
maximum weighted sum of KLR values across all child-nodes:
X Nk
arg max
KLRr (pkr , qr )
N
Ar
k
It is interesting that this strategy is mathematically equivalent to the use of
Gain Ratio as the criterion to determine the split attribute in Decision Tree
Induction. The detailed proof is provided in the Appendix A.2. However, we
must point out the fundamental difference between Decision Tree Induction
and our ATG-based approach: the former belongs to supervised learning, i.e.
the class information for all training data are already known before the learning procedure; in contrast, our approach is based on the ATG algorithms,
which is in essential unsupervised and hence has no prior knowledge about
the class information. In summary, Decision Tree Induction and our ATGbased labeling aim at solving similar problems under different motivations
and prerequisites.
– Strategy 2 : Each of the child nodes can be assigned an independent attributevalue pair as its label, which might be different from its siblings. In this case
the attribute used to label a child node would be the one with the maximum
KLR value for that node:
arg max KLRr (pkr , qr )
Ar
In comparison to the Decision Tree Induction, each branch within the derived
taxonomy could be labeled using a different attribute.
When being applied in practice, the first strategy is more suitable for users to
navigate the taxonomy in a top-down fashion, because all the child nodes that
belong to the same parent node use the same attributes with different split values
as their labels, so users can easily understand the distinctiveness between these
sibling nodes. While the second one might more accurately reflect the content
of each node in the taxonomy, because the most representative attribute (with
the maximum KLR value) is selected as the node labels.
6
Labeling Taxonomic Nodes for Relational Datasets
Table 1. Results of Simulation Experiment (P opM ovieSet = 100)
Attribute
Number of
Unique Values
Title
542.480±16.770
Year
16.180± 0.997
Certificate
9.580± 0.698
Genre
35.720± 2.810
Director 225.460±13.237
Actor 1064.960±55.599
4
KL-divergence
1.132
0.082
0.046
0.195
1.031
1.338
±
±
±
±
±
±
0.076
0.027
0.019
0.037
0.065
0.100
Normalized
KL-divergence
0.713 ± 0.028
0.052 ± 0.017
0.029 ± 0.012
0.123 ± 0.023
0.650 ± 0.030
0.842 ± 0.018
Entropy
8.774
3.243
2.647
3.969
6.404
9.736
±
±
±
±
±
±
0.060
0.078
0.064
0.118
0.218
0.107
KLR
0.081
0.016
0.011
0.031
0.101
0.086
±
±
±
±
±
±
0.003
0.005
0.005
0.006
0.003
0.002
Experimental Results
Some experiments were conducted to evaluate our algorithms presented in Section 3. We first measure the bias within the original KL-divergence metric and
show to what extent the KLR metric can reduce such bias. Then two strategies for labeling nodes are compared through the simulation of a user locating
a given set of movies in the taxonomy. A real movie dataset used by an online
DVD retailer, of which the database schema has been shown in Fig. 1, is used
in our experiments. After data pre-processing, there are 62,955 movies, 40,826
actors and 9,189 directors. The dataset also includes a genre taxonomy of 186
genres. Additionally, we have 542,738 browsing records from 10,151 users. Based
on the user visits, we select 10,000 most popular movies for our analysis.
4.1
Bias within the KL Metric
To determine whether the use of the original KL-divergence metric for selecting
the most informative attribute is biased or not, we conducted the following
experiments: three sets of 100 movies were randomly chosen to form sibling
nodes ts1 , ts2 and ts3 , which shared the common parent node ts . For each subnode composed of sampled movies, we calculate the KL-divergences for all the
attributes. This experiment was repeated 50 times with different random seeds.
Table 1 shows the average number of unique values for each attribute contained in the parent node ts as well as the mean and the standard deviation of
using different labeling criteria with respect to each attribute. The movie titles
were processed to extract meaningful nouns, verb and adjectives using WordNet
(http://wordnet.princeton.edu/). In Tables 1, the original KL-divergence
values for attributes Title, Director and Actor are greater than 1 while the others
are far less than 1, which proves the necessity for normalizing the KL-Divergence
as suggested in Section 3. Furthermore, the number of unique values for different attributes vary greatly and the original KL-divergence is proportional to
this number. The entropy, which acts as the penalty factor in our approach,
is also impacted by that because the attributes with more unique values also
have higher entropy. As can be seen from the last column of Table 1, KLR that
synthesizes the KL-Divergence and the entropy can effectively reduce the bias.
Labeling Taxonomic Nodes for Relational Datasets
7
Fig. 3. Labeling Taxonomy for Movie Dataset (Strategy 1)
4.2
Evaluating labeling Strategies
Figures 3 and 4 show parts of the movie taxonomy that was labeled using two
strategies introduced in Section 3. In Fig. 3 all the sibling nodes have the same
decision attribute with different values as their labels, while in Fig. 4 each of the
sibling nodes may use a different attribute as its own label. It is worth noting
that, depending on the utilized techniques of taxonomy generation, some data
objects belonging to two different taxonomic nodes might share the same attribute values, which makes the derived node labels overlapped, e.g. both Node
767 and Node 768 in Figures 3 use the director “Martin Scorsese” in their labels. In such case, other attribute values in the node labels can provide useful
information to distinguish the content the nodes.
To quantitatively evaluate the goodness of the derived labels, a robot was
developed to simulate a user navigating the taxonomy. For a set of randomly
selected movies, the robot navigated the taxonomic structure in a top-down
fashion to locate the corresponding leaf nodes that contain the given movies.
When examining a non-leaf node, the robot uses the node’s label to determine
which sub-node should be explored in the next iteration. If the target movie
object matches the labels of more than one sub-node, all the matched subnodes will be explored in a best-match-first order. We use a criterion Cost to
measure the time spent in the above exploration procedure, which is defined
by the average number of attribute values within the labels of the taxonomic
nodes that have been examined by the robot before finding the correct leaf nodes.
General speaking, a smaller value of Cost means that the corresponding labeling
strategy is preferred. In our experiment, 100 randomly selected movies were used
in each run and the experiment was repeated 10 times. The mean and standard
deviation of Cost values for two strategies are 378.31 ± 26.97 and 399.88 ± 26.14
respectively. They are statistically significant different, meaning the first labeling
strategy is more effective to locate the target objects in the derived taxonomy.
8
Labeling Taxonomic Nodes for Relational Datasets
Fig. 4. Labeling Taxonomy for Movie Dataset (Strategy 2)
5
Conclusions and Future Works
ATG techniques can efficiently organize large datasets into hierarchical structures. Usually a bundle of labels will be assigned to taxonomic nodes in order
to summarize their respective content and to reflect their distinctiveness among
siblings. In this paper we propose a novel approach, based on the evaluation
of the homogeneity of each node and heterogeneity of its siblings, to label taxonomies built for multi-type relational datasets. Moreover, we effectively remove
the induction bias within the original KL-Divergence metric.
In the future, we will continue investigating other approaches to labeling
the automatically generated taxonomy for relational dataset and compare their
effectiveness and efficiency. Furthermore, we will study the possibility of using
multiple attributes to label the taxonomic nodes.
References
1. Zhao, Y., Karypis, G.: Evaluation of hierarchical clustering algorithms for document datasets. In: Proceedings of ACM CIKM’02, USA, (2002)
2. Kummamuru, K., Lotlikar, R., Roy, S., Singal, K., Krishnapuram, R.: A hierarchical monothetic document clustering algorithm for summarization and browsing
search results. In: Proceedings of WWW’04. (2004)
3. Muller, A., Dorre, J., Gerstl, P., Seiffert, R.: The TaxGen framework: Automating
the generation of a taxonomy for a large document collection. In: Proceedings of
the 32nd Hawaii International Conference on System Sciences. (1999)
4. Chuang, S.L., Chien, L.F.: Towards automatic generation of query taxonomy: A
hierarchical query clustering approach. In: Proceedings of ICDM’02, Washington,
USA, IEEE Computer Society (2002) 75–82
Labeling Taxonomic Nodes for Relational Datasets
9
5. Cimiano, P., Hotho, A., Staab, S.: Comparing conceptual, partitional and agglomerative clustering for learning taxonomies from text. In: Proceedings of ECAI’04,
IOS Press (2004) 435–439
6. Džeroski, S.: Multi-relational data mining: An introduction. SIGKDD Explorations
Newsletter 5(1) (2003) 1–16
7. Clerkin, P., Cunningham, P., Hayes, C.: Ontology discovery for the semantic web
using hierarchical clustering. In: Proceedings of Semantic Web Mining Workshop
co-located with ECML/PKDD, Freiburg, Germany (September 2001)
8. Cheng, P.J., Chien, L.F.: Auto-generation of topic hierarchies for web images from
users’ perspectives. In: Proceedings of ACM CIKM’03. (2003)
9. Lawrie, D., Croft, W.B., Rosenberg, A.: Finding topic words for hierarchical summarization. In: Proceedings of ACM SIGIR’01, New York, USA, ACM (2001)
349–357
10. Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. ACM Press
/ Addison-Wesley (1999)
11. Lawrie, D.J., Croft, W.B.: Generating hierarchical summaries for web searches.
In: Proceedings of ACM SIGIR’03, New York, USA, ACM (2003) 457–458
12. Krishnapuram, R., Kummamuru, K.: Automatic taxonomy generation: Issues and
possibilities. In: Proceedings of Fuzzy Sets and Systems (IFSA). LNCS 2715,
Springer-Verlag Heidelberg (2003) 52–63
13. Quinlan, J.R.: Induction of decision trees. Volume 1. Kluwer Academic Publishers,
Hingham, USA 81–106
14. Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann Publishers
Inc., San Francisco, USA (1993)
15. Kramer, S., Widmer, G.: Inducing classification and regression trees in first order
logic. In: Relational Data Mining. Springer (September 2001) 140–160
16. Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)
17. Kullback, S.: The kullback-leibler distance. The American Statistician 41 (1987)
340–341
A
A.1
Appendix
Proof of the KL-divergence Bounds
Given the symbols N , Nk , Ar , pkr and qr defined as in Section 3, it is easy
to see that the KL-divergence for attribute Ar will be maximized when the
distributions pkr and q\k,r have non-zero probabilities for disjoint subsets of
P
Nl
values in Ar , where q\k,r = l6=k N −N
plr . The upper bound is then:
k
X
max KLr (pkr , qr ) =
pkr (vj ) log
vj ∈Dom(Ar )
=
X
j
=
Ã
nj
log
Nk
X µ nj
j
= log
Nk
N
Nk
log
nj
Nk
nj
N
N
Nk
!
¶
pkr (vj )
qr (vj )
10
Labeling Taxonomic Nodes for Relational Datasets
¢
n ¡ P nj
where Njk
j Nk = 1 is the frequency that the j-th value of Ar occurs in the
objects contained
in ichild-node tsk . Therefore, the range of KL-divergence for
h
node tsk is 0, log NNk , which is not impacted by the pre-specified attribute Ar .
A.2
Proof of the Equivalence between Information Gain and
KL-based Strategy
Proof. The Information Gain used in the Decision Tree Induction is defined as:
Inf oGainr (ts )
= Entropy(ts ) −
X µ Nk
N
k
¶
· Entropy(tsk )


µ
¶
nj ´ X  Nk X nkj
nkj 
log
·
=−
+
log
N
N
N
Nk
Nk
j
j
k
´
³
X X ³ nkj
X
X
nkj
nj
nkj ´
=−
log
+
log
N
N
N
Nk
j
j
k
k
µ
¶
XX
nkj
nj
nkj
nkj
=
−
log
+
log
N
N
N
Nk
j
k
!
Ã
nkj
X X nkj
k
=
log N
nj
N
N
j
k
¶
X X µ Nk
pkr (vj )
=
· pkr(vj ) log
N
qr (vj )
j
k
µ
¶
X Nk
=
· KLr (pkr , qr )
N
X ³ nj
k
The Gain Ratio is defined as the ratio of Information Gain and the entropy of
parent node with respect to the attribute Ar [16], which gives:
Inf oGainr (ts )
Entropyr (ts )
X µ Nk KLr (pkr , qr ) ¶
=
·
N Entropyr (ts )
k
X Nk
=
KLRr (pkr , qr )
N
GainRatior (ts ) =
k
This completes the proof.
¤
Download