WikiOnto: A System For Semi-automatic Extraction And Modeling Of Ontologies

advertisement
2009 IEEE International Conference on Semantic Computing
WikiOnto: A System For Semi-automatic Extraction And
Modeling Of Ontologies
Using Wikipedia XML Corpus
Lakshman Jayaratne
University of Colombo School of Computing
Colombo, Sri Lanka
klj@ucsc.cmb.ac.lk
Lalindra De Silva
University of Colombo School of Computing
Colombo, Sri Lanka
lalindra84@gmail.com
domains as the sources for building ontologies1 .
The research efforts and the applications that resulted
from those research in the area of ontology development,
extraction and ontology learning are many and diverse.
With the light of these attempts, section 2 takes a look at
related prior research to construct topic ontologies using
various sources as input. In section 3, we describe our
source - the Wikipedia XML Corpus - and the structure of
the documents in the corpus, along with recognition as to
why it is an ideal source for such research. The framework
of our system consists of three major layers (Figure 1).
As such, section 4 looks at in detail how this three-tiered
framework is implemented. Finally, we will present the
ontology development environment with both the current and
proposed facilities that are/will be available to the users of
the system.
Abstract—This paper introduces WikiOnto: a system that
assists in the extraction and modeling of topic ontologies in a
semi-automatic manner using a preprocessed document corpus
of one of the largest knowledge bases in the world - the
Wikipedia. Based on the Wikipedia XML Corpus, we present a
three-tiered framework for extracting topic ontologies in quick
time and a modeling environment to refine these ontologies.
Using Natural Language Processing (NLP) and other Machine
Learning (ML) techniques along with a very rich document
corpus, this system proposes a solution to a task that is
generally considered extremely cumbersome. The initial results
of the prototype suggest strong potential of the system to
become highly successful in ontology extraction and modeling
and also inspire further research on extracting ontologies from
other semi-structured document corpora as well.
Keywords-Ontology, Wikipedia XML Corpus, Ontology
Modeling, Ontology Extraction
II. R ELATED W ORKS
I. I NTRODUCTION
The research efforts to extract ontologies have spanned
over multifarious domains in terms of the input sources they
use. A large number of these researches have focused on
using textual data as their input while a lesser number of
research have focused on semi-structured data and relational
databases. [4] presents a method in which the words that
appear in the source texts are ontologically annotated using
the WordNet [5] as the lexical database. The authors have
primarily worked on assigning verbs that appear in source
articles into ontological classes. Several other researchers
have utilized the XML schemas as a starting point for the
creation of ontologies. In [6], the authors have used the
ORA-SS (Object-Relationship-Attribute Model for Semistructured Data) model to initially determine the nature of
the relationship between the elements of the XML schema
and subsequently to create a genetic model for organizing
the ontology. [7] gives a detailed survey of some of the work
that has been carried out in terms of learning ontologies for
the semantic web. [8] presents a framework for ontology
With the growing issue of information overload in the
present world and with the emergence of the semantic
web, the importance of ontologies has come under the
spotlight in recent years. Ontologies, commonly defined as
“an explicit specification of a conceptualization” provides
an agreed upon representation of concepts of a domain for
easier knowledge management [1]. However, the modeling
of ontologies is generally considered to be a painstaking
task, often requiring the knowledge and expertise of an
ontology engineer who is well versed in the concerned
domain.
Previous efforts in building ontology development environments were successful in their comprehensiveness of
the tools they provided to the user, but it still involved a
lot of manual work when building ontologies using these
environments [2]. Additionally, most of the efforts to extract
ontologies were focused on using textual data as their
sources. As such, in this research, we have looked into extracting ontologies and modeling them using the Wikipedia
XML Corpus [3] as the source, with a futuristic view of
extending this research into using other semi-structured data
978-0-7695-3800-6/09 $26.00 © 2009 IEEE
DOI 10.1109/ICSC.2009.93
1 This work is based on the continuing undergraduate thesis of the first
author titled ‘A Machine Learning Approach To Ontology Extraction and
Evaluation Using Semi-structured Data’
571
extraction for document classification. In this work, the
author has experimented on the Reuters collection and the
Wikipedia database dump as her sources and produces a
methodology in which concepts can be extracted from such
large document bases. Other tools for extracting ontologies
from text such as ASIUM [9], OntoLearn [10] and TextToOnto [11] have also been around. However, the biggest
motivation for our research spans from OntoGen [12]. OntoGen is a tool that enables semi-automatic generation of topic
ontologies using textual data as the initial source. We have
utilized the best features of the OntoGen methodology while
extending our research into the semi-structured data domain.
Consequently, we have prposed to enhance our system with
additional concept extraction mechanisms such as lexico
syntactic pattern matching, which the OntoGen project has
not incorporated.
<collectionlink xlink:type="simple"
xlink:href="3850.xml">Baseball
</collectionlink>
</item>
<item>
<collectionlink xlink:type="simple"
xlink:href="3812.xml">Basketball
</collectionlink>
</item>
</normallist>
</section>
...
...
...
<languagelink lang="cs">M?</languagelink>
<languagelink lang="de">Ball</languagelink>
...
...
...
</section>
</body>
</article>
III. W IKIPEDIA XML C ORPUS
Each document is contained within the ‘article’ tag within
which lies a limited number of predefined element tags
that correspond to specific relations between text segments
in the document. The ‘name’ element carries the title of
the article while ‘title’ elements within each ‘section’ elements correspond to the titles of different subsections. The
‘normallist’ elements contain lists of information while the
‘collectionlink’ and ‘unknownlink’ elements correspond to
other articles referred from within the document.
In addition to the documents, the corpus provides category
information which hierarchically organizes the articles in
the relevant categories. Using this category information, we
were able to develop the initial segment of our system where
the user can choose a certain number of articles from the
interested domain to be initially input to the system. This
selection of domain specific document set enables better
targeted and expedite ontology creation as opposed to using
the entire corpus as the input.
The Wikipedia, being one of the largest knowledge bases
in the world and the most popular reference work in the
current World Wide Web, has inspired so many people and
projects in diverse fields including knowledge management,
information retrieval and ontology development. The reliability of the articles in the Wikipedia is often criticized
as it is a knowledge base that can be accessed and made
modifications to by anyone in the world. However, for the
most part, the Wikipedia is considered to be reliable enough
to be used in non-critical applications.
The Wikipedia XML Corpus is a collection of documents
derived from the Wikipedia which is being used in a large
number of Information Retrieval and Machine Learning
tasks at present research communities. The corpus contains
articles in eight languages and the latest collection contains
approximately 660,000 articles in English language that
correspond to articles in the Wikipedia. Each XML document in the corpus is uniquely named (e.g. 12345.xml) and
has the following uniform element structure: (e.g. 3928.xml
representing the article on “Ball”)
IV. W IKI O NTO F RAMEWORK
As previously mentioned, the WikiOnto system is implemented in a three-tiered framework. The following sections
explain each layer in detail.
<article>
<name id="3928">Ball</name>
<body>
<p>A <emph3>ball</emph3> is a round object
that is used most often in
<collectionlink xlink:type="simple"
xlink:href="26853.xml">sport
</collectionlink>s and
<collectionlink xlink:type="simple"
xlink:href="11970.xml">game
</collectionlink>s.
</p>
...
...
...
<section>
<title>Popular ball games</title>
There are many popular games...
...
<normallist>
<item>
A. Concept Extraction Using Document Structure
In this layer, we make use of the structure of the documents to deduce substantial information about the taxonomic
hierarchy of possible concepts in that document. After
careful review of many documents in the corpus, we have
established the following assumptions in extracting concepts
from the documents.
• Word phrases contained within the ‘name’, ‘title’,
‘item’, ‘collectionlink’ and ‘unknownlink’ elements are
proposed as concepts to the user (e.g. in the previous
example of the XML file, the words ‘Ball’, ‘Sport’,
‘Game’, ‘Popular Ball Games’, ‘Baseball’ and ‘Basketball’ are all suggested to the user as concepts)
572
Figure 1.
•
•
•
Design of the WikiOnto System
The word phrases contained within ‘title’ elements in
each of the first-level sections are suggested as subconcepts of the concept within the ‘name’ element of
the document (e.g. ‘Popular Ball Games’ is a subconcept of the concept ‘Ball’)
The word phrases contained within a ‘title’ element
nested within several ‘section’ elements is suggested
as a sub-concept of the ‘title’ concept of the immediate
section above it
Any word phrase wrapped inside ‘collectionlink’ or
‘unknownlink’ elements are suggested as sub-concepts
of the concept immediately above it in the structure of
the XML document (e.g. the concepts ‘Baseball’ and
‘BasketBall’ are sub-concepts of the concept ‘Popular
Ball Games’ while the concepts ‘Sport’ and ‘Game’ are
sub-concepts of the concept ‘Ball’)
selector explained earlier, choosing manually or inputting the
entire corpus2 ), the system iterates through all the documents
to extract the potential concepts and their relationships
according to the assumptions listed earlier. The ontology
modeling environment (section 5) explains how the user
can refine these relationships after examining the concepts
(i.e. how the user can label the relationships as hyponomic,
hypernimic, meronomic relations etc).
In order to validate the concepts that are extracted and
suggested, we have incorporated WordNet [5] into our system. When all the concepts are extracted from the documents
as per the above assumptions, each and every concept is
matched morphologically, word-by-word with WordNet. If
even a single word in a word phrase cannot be morphologically matched with WordNet (owing to the reasons such as
being a foreign word, a person’s name, place name, etc),
the whole word phrase is withheld from being added to the
concept collection and is presented to the user at the end of
the processing stage along with the rest of such concepts.
In the initialization of the system, the user is given the
opportunity to define the maximum number of words that
can be contained in a potential concept. Once the user
inputs the documents that are to be used as the sources to
extract concepts (either using the domain-specific document
2 Due to computer resource restraints, the system is yet to be tested with
the full corpus (approx. 660,000 articles) as input
573
The user has the full discretion to decide whether these
concepts should be added to the concept collection or not.
At the end of this processing stage, the user is able to
query for a concept in the collection and start building
an ontology with the queried concept as the root concept.
The system automatically populates first level concepts and
according to the user’s needs, will populate the additional
levels of concepts as well (i.e. initially the ontology includes
only the immediate concepts that have a sub-concept relationship with the root concept. With the user’s discretion, the
system will treat each concept in the first level as a parent
concept and add its child concepts to it).
documents and imposing a threshold value, we were able to
identify the keywords that correspond to a given document.
With this vector-space representation of each document,
the user is then given the opportunity to group the documents
in to a desired number of clusters. This is achieved through
the k-means clustering algorithm [14] described below.
k-means Algorithm:
1) Choose k cluster centers randomly from the data points
2) Assign each data point to the closest cluster center
based on the similarity measure
3) Re-compute the cluster centers
4) If the assignment of data points have changed (i.e. the
process has not converged) repeat from step 2
B. Concepts From Keyword Clustering
As the second layer of our framework, we have implemented a keyword clustering method to identify the concepts
related to a given concept. There have been several welldocumented approaches and metrics for extracting keywords
and measuring their relevance to the documents. However,
we have used the well known TFIDF measure [13], which
is generally accepted to be a good measure of the relevance
of a term in a document and have represented each document using the vector-space model with the use of these
keywords. The TFIDF measure is the product between the
Term Frequency (TF) and the Inverse Document Frequency
(IDF) defined as follows:
The similarity measure used in our approach is the cosine
similarity [15] between two vectors which is defined as
follows:
If A and B are two vectors in our vector space:
Cosine Similarity = cos (θ) =
Once these clusters have been formed, the user selects a
concept in the raw ontology that is taking shape and the
system provides suggestions for that concept (excluding the
concepts already added as sub-concepts of that concept). In
achieving this, the system looks for the cluster that contains
the highest TFIDF value for that word (in the case where a
concept consists of more than one word, the system looks
for all the clusters where each word is located) and suggests
the keywords using two criteria.
1) Suggest the individual keywords of the document
vectors of that cluster
2) Suggest the highest valued keywords in the centroid
vector of that cluster. The centroid vector is the single
vector comprised by summing the respective elements
of all the vectors
nt,d
T Ft,d = k nk,d
TFt,d :the number of times a word t appears in a document
d
n :the frequency of term t in the document d
k :the number of distinct words in document d
IDFi = log
A.B
||A|| ||B||
|D|
|{d : ti ∈ d}|
IDFi :the log of the fraction between the number of
documents in the corpus and the number of documents in
which the word t appears
|D| :the number of documents in the corpus
C. Concepts From Sentence Pattern Matching
In enhancing the accuracy and the comprehensiveness of
the ontologies constructed through our system, we have
proposed to implement a sentence analyzing module to
work alongside the previous two layers. Several research
attempts at identifying lexical patterns within text files have
been proposed and we intend to utilize the best of these
methods to enable the users of our system to enhance
the ontologies they are constructing. A popular method for
acquiring hyponymic relations was presented by [16] and we
have begun to implement a similar approach in our system.
The motivation for such syntactic processing comes from the
fact that there is a significant number of common sentence
patterns appearing in the Wikipedia XML Corpus and this
TFIDFi,d = TFi,d ×IDFi
In selecting keywords in the document collection, we
have defined and removed all the ‘stopwords’ (words that
appear commonly in sentences and have little meaning with
regard to the ontologies, such as prepositions, definite and
indefinite articles, etc) and extracted all the distinct words in
the document collection to build the vector-space model for
each document. Afterwards, calculating the TFIDF measure
for each and every word in the word list within the individual
574
Figure 2.
WikiOnto Ontology Construction Environment
will allow the system to provide better suggestions to the
user in constructing the ontology.
In implementing this layer, we will incorporate a Part-OfSpeech tagger and with the help of the POS-tagged text, we
intend to extract relations as follows.
Sentences like “A ball is a round object that is used most
often ...” which appeared in the example XML file earlier
are evidence of common sentence patterns in the corpus. In
this instance, “A ball is a round object” matches with the
pattern “{NP} is a {NP}” in the POS-tagged text where ‘NP’
refers to a noun phrase. Several other patterns such as “{NP}
including {NP,}*and {NP}” (e.g. “All third world countries
including Sri Lanka, India and Pakistan”) are candidates for
very obvious relations and this should enable us to make
more comprehensive suggestions to the user. Again, these
candidate word phrases will be validated against WordNet
and the decision to add them to the ontology will completely
lie at the user’s discretion.
In the beginning of the ontology contsruction process, the
user has the ability to query for a concept, irrespective of
the fact that that concept exists in the concept collection
extracted initially. The system will generate the OWL definition for the ontology being built and the user has the ability
to export the ontology in this standard format.
VI. E VALUATION A ND F UTURE W ORK
Owing to the reason that the project is still continuing, a
thorough evaluation of the system seems distant. However,
the initial prototype was tested among the undergraduates
and several faculty members of the authors’ university and
we received commendable feedback and several requests for
additional features for the system. Additionally, since the
ontology being generated, for the most part, depends on
the user’s choices, a standard evaluation mechanism seems
impractical. The most well-known evaluation methods such
as comparing with a gold standard (an ontology that is
accepted to be defining the targeted domain accurately) or
testing the results of the ontology in an application seems
inappropriate given the dependance of the system upon the
user. Owing to the reason that our system is a facilitator
rather than a fully-fledged ontology generator, we plan to
evaluate the system through user trials towards the end of
the project.
With the progress of this project we have furthered our
V. T HE G RAPHICAL M ODELING E NVIRONMENT
We have used C# as our language of choice for this system
and piccolo2D for the visualization of the ontology editor.
The user is given the flexibility to add, delete concepts from
the taxonomic hierarchy and also the capability to rename
relations and concepts according to their discretion (Figure.
2).
575
expectations of the potential applications that can make use
of a system that takes semi-structured data and extracts topic
ontologies from them. Especially, we are focusing on how
the proposed project can be extended to other document
sources, such as the Reuters Collection [17].
[10] P. Velardi, R. Navigli, A. Cucchiarelli, and F. Neri, “Evaluation of OntoLearn, a methodology for automatic population of
domain ontologies,” in Ontology Learning from Text: Methods, Applications and Evaluation, P. Buitelaar, P. Cimiano,
and B. Magnini, Eds. IOS Press, 2006.
[11] A. Maedche and S. Staab, “Semi-automatic Engineering of
Ontologies from Text,” in Proceedings of the 12th International Conference on Software Engineering and Knowledge
Engineering, 2000.
VII. C ONCLUSION
In this paper, we have introduced WikiOnto: a system
for extracting and modeling topic ontologies using the
Wikipedia XML corpus. Through detail explanations, we
have presented our methodology of the system where we
have proposed a three-tiered approach to concept and relation extraction from the corpus as well as a development
environment for modeling the ontology. The project is still
continuing and is expected to produce successful outcomes
in the area of ontology extraction as well as spring up further
research in ontology extraction and modeling.
We expect to make the system and the source code
available for free. The extendability for other document
sources is still being tested and is the reason why it was left
out from the paper. We hope to make the results announced
with enough time so that the final system will be available
for the this paper’s intended audiance.
[12] “Ontogen: Semi-automatic ontology editor,” in Human Interface, Part II, HCII 2007, M. Smith and G. Salvendy, Eds.,
2007, pp. 309–318.
[13] G. Salton and C. Buckley, “Term-weighting approaches in
automatic text retrieval,” in Information Processing and Management, 1988, pp. 513–523.
[14] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering:
A review,” 1999.
[15] G. Salton and C. Buckley, “Term-weighting approaches in
automatic text retrieval,” in Information Processing and Management, 1988, pp. 513–523.
[16] M. A. Hearst, “Automatic acquisition of hyponyms from large
text corpora,” in In Proceedings of the 14th International
Conference on Computational Linguistics, 1992, pp. 539–545.
R EFERENCES
[17] M. Sanderson, “Reuters Test Collection,” in BSC IRSG, 1994.
[Online]. Available: citeseer.ist.psu.edu/sanderson94reuters.
html
[1] T. R. Gruber, “A translation approach to portable ontology
specifications,” Knowl. Acquis., vol. 5, no. 2, pp. 199–220,
1993.
[2] Ontoprise, “Ontostudio,” http://www.ontoprise.de/, 2007, (accessed 2009-05-08).
[3] L. Denoyer and P. Gallinari, “The Wikipedia XML Corpus,”
SIGIR Forum, 2006.
[4] S. Tratz, M. Gregory, P. Whitney, C. Posse, P. Paulson,
B. Baddeley, R. Hohimer, and A. White, “Ontological annotation with wordnet.”
[5] C. Fellbaum, Ed., WordNet An Electronic Lexical
Database. Cambridge, MA ; London: The MIT Press,
1998. [Online]. Available: http://mitpress.mit.edu/catalog/
item/default.asp?ttype=2&tid=8106
[6] C. Li and T. W. Ling, “From xml to semantic web,” in In 10th
International Conference on Database Systems for Advanced
Applications, 2005, pp. 582–587.
[7] B. Omelayenko, “Learning of ontologies for the web: the
analysis of existent approaches,” in In Proceedings of the
International Workshop on Web Dynamics, 2001.
[8] N. Kozlova, “Automatic ontology extraction for document
classification - masters thesis,” Saarland University, Germany,
Tech. Rep., February 2005.
[9] D. Faure, C. Ndellec, and C. Rouveirol, “Acquisition of
semantic knowledge using machine learning methods: The
system ”asium”,” Universite Paris Sud, Tech. Rep., 1998.
576
Download