Ontology Extraction for the Sementic Web: a New Framework

advertisement
Proceedings of the 11th Annual Conference
of Asia Pacific Decision Sciences Institute
Hong Kong, June 14-18, 2006, pp. 585-588.
ONTOLOGY EXTRACTION FOR THE SEMENTIC WEB: A NEW FRAMEWORK
Feng Li
School of Business Administration,
South China University of Technology, Guangzhou, P.R. China
EMAIL: fenglee@scut.edu.cn
Ying Wei
Department of Systems Engineering & Engineering Management,
The Chinese University of Hong Kong, Shatin, New territories, Hong Kong
EMAIL: ywei@se.cuhk.edu.hk
ABSTRACT
To wrap the traditional web with semantic presentation is a challenging task. This paper proposes a new framework for
ontology extraction by introducing a “self-learning” cycle, even with the lack of pre-defined knowledge. The ontology
is constructed along five steps: knowledge representing the website, recalling existing ontologies from an ontology base
to recognize concepts and relations, extracting concept by clustering algorithm for unrecognized concepts and relations,
amending proposed ontology, and restoring the confirmed ontology for future reuse. Further, this framework can
integrated other intelligent technologies. At the end of this paper, we develop a prototype system extracting the
ontology of the website of Department of Automation and Computer-Aided Engineering, the Chinese University of
Hong Kong.
Keywords: Ontology Extraction, Semantic Web
INTRODUCTION
Semantic Web (SW), proposed by Berners-Lee as next generation Web, provides great benefits in Web Services,
Internet Commerce, and other promising application areas [1]. However, SW is still in its primary stage and has lots of
unsolved problems. To transfer domain-specific ontology to structured data for machine understanding is one of them,
by which we call ontology extraction.
A naïve approach for ontology extraction is by manual. Manual construction is time-consuming and error-prone, and
poses problems into future ontology maintenance. Most of the current research focuses on exploiting various methods
to generate ontology automatically or semi-automatically. PROMPT (formerly SMART) provides a semi-automatic
approach to merge or align ontologies [2]. Similar work includes OBSERVER, ONIONS, OntoMorph, OntoMap, and
GLUE. However, additional knowledge, e.g. instances, sharable instances, or linguistic ontologies is required to merge
or align ontologies. Another stream focuses on extracting ontologies from well-structured sources. For example,
ERONTO is a tool that builds ontologies from extended E/R diagrams [3]. The other stream extracts ontologies from
semi- or un-structured sources with certain auxiliary knowledge. Typical work can be referred to OntoMiner, which
extracts ontologies from overlapping domain-specific websites, and the websites have been confirmed before the
extraction [4]. Others recognize linguistic ontological information from plain text sources using knowledge-poor
algorithms. Typical algorithm is Latent Semantic Index, a vector space approach to catch term-term statistical [5].
To create ontology from diversity sources automatically or semi-automatically, most of the research needs various
auxiliary sources. This paper proposes a new framework by introducing a “self-learning” cycle, even with the lack of
pre-defined knowledge. The ontology is constructed along five steps: knowledge representing the website, recalling
existing ontologies from an ontology base to recognize concepts and relations, extracting concept by clustering
algorithm for unrecognized concepts and relations, amending proposed ontology, and restoring the confirmed ontology
for future reuse. Furthermore, this process is implemented semi-automatically and can be integrated into with other
open intelligent technologies.
The rest of this paper is organized as follows. We describe the process of ontology extracting from websites in section 2.
In section3 we develop a prototype system with the example of Department of Automation and Computer-Aided
Engineering, the Chinese University of Hong Kong. And finally we conclude the paper in section 4.
586
FENG LI AND YING WEI
ONTOLOGY EXTRACTION FROM WEBSITE
Before formalizing the ontology extracting process, we specify the notations and assumptions as follows.
Assumption 1 Each website is assumed as an ontology instance. This assumption is similar to the specification in [4]: a
website is said to be ‘taxonomy-directed’ if it contains at least one taxonomy for organizing its key concepts.
Assumption 2 Each Web page is assumed as a concept instance. If no further ontology extraction within Web page,
every Web page belonging to a website is a concept instance inside the ontology instance represented by the website.
Assumption 3 Each hyperlink, along with contiguous hypertext, represents a relation instance.
Assumption 4 Web pages are assumed to be placed together if they are instances of a same concept. In other word,
they always are sibling nodes in a graph which represent the organization of the website.
Assumption 5 Instances of a same concept are assumed to be similar, while instances of different concepts are
dissimilar because of dissimilarity of separate concepts they belonging to.
Given the notations and assumption, ontology extraction from website includes extracting the website organizations
into ontology instances, web pages into concept instances, and hyperlinks into relation instances.
The Framework
By its nature, extracting ontology from website is a process of acquiring semantic knowledge from web documents.
Common problems in knowledge management, such as storing, adapting, standardizing knowledge are also considered
in ontology extraction, namely five iterative phases: (1) represent organization of website; (2) recall previous ontologies
to recognize concepts and relations; (3) extract concepts and relations; (4) amend generated ontology; and (5) maintain
revised ontology.
Represent
WebSite from
Real World
Organization
of WebSite
l
cal
Re
Ontology
Base
Learned
Ontology
Preliminary
Ontology
rac
t
ain
Ext
int
Ma
Common
Knowledge
Output
Confirmed
Ontology
Proposed
Ontology
Amend
Figure 1. The framework of ontology extraction
As illustrated in Figure 1, the process is centralized by an ontology-base, a repository for ontologies and their patterns.
When a URL of the target website is located, web pages are fetched to represent the organization of the website.
Existing ontologies and their patterns are recalled from the ontology-base to recognize the concepts and relations from
fetched web pages. If there are no matched ontologies, we call the concepts and relations unrecognized, and use
heuristic algorithms to group “similar” web pages and hyperlinks into clusters. Each cluster is an instance of new
concept or relation. The concepts or relations are then refined or revised by human experts, or other intelligent
techniques. Finally, the accepted ontology is restored into ontology-base for future reuse. The ontology-base system
grows increasingly within its lifecycle.
As indicated with dashed lines in Figure 1, common knowledge also play an important role in each phase. The common
knowledge here means general knowledge, which is loosed-coupled, or domain-independent knowledge, opposed to
specific and formal knowledge represented by the ontology.
Website Representation
ONTOLOGY EXTRACTION FOR THE SEMENTIC WEB: A NEW FRAMEWORK
587
Successful ontology extraction relies on how to represent the organization of the target website. In order to represent
organization of the website, several steps are needed: fetch web pages and hyperlinks, label web pages and hyperlinks,
prune useless pages and hyperlinks, and finally construct the organization of the target.
The process of fetching website involves: (1) fetch Web pages inside the target website from remote server; (2)
transform fetched pages into well-formed web documents; (3) parse and filter some trivial HTML tags such as
“CENTER”, “HR”; (4) store fetched website into local repository for later use. After all web pages are fetched, each
pages need to be labeled to identify main contents inside pages. However, labeling is not easy due to the loose grammar
regulation of the html files. For example, tag “TITLE” of HTML is designed to provide information about an entire
page, but our experiment results show that 30% pages have the identical text embedded in tag “TITLE” as the home
page. Therefore we consider both information from incoming hyperlinks and text embedded in “TITLE” tag to identify
pages.
Once the useless pages and hyperlinks are pruned, we use breadth-first algorithm to construct the organization of the
website. We traverse downloaded web pages assuming that website developers organize web pages with hierarchy
architecture. We start from the URL of website home page and establish the organization as follows: home page of the
website is the root node; the linked pages by outgoing hyperlinks of the home page are regarded as the first-generation
--- child nodes; and the linked pages by outgoing hyperlinks of the child nodes are the second-generation --- grandchild
nodes; and so on. In this way, the organization of the website is represented as hierarchical semantic structure.
Ontology Recall
To recognize concepts and ontologies from historical patterns is easier than to extract concepts and ontologies on the
fly. The ontology recall process recalls ontologies and their patterns which are stored in the ontology-base. It consists of
the following three tasks: (1) search the ontology-base for concepts and their related patterns that match label and
content of pages; (2) use recognized concepts and ontology-base to explore those unidentified pages; (3) replace
recognized pages from the organization of the website with their concepts.
Intuitively, each concept of ontology has some properties that can be used to identify its instances. Ideally these
properties are specific enough to distinguish instances belonging to other concepts. However, these properties are not
practically apparent to recognize the concept. And a concept may be able to be represented by many related patterns.
Therefore both ontologies and their related patterns are stored in the ontology-base. We use some common regular
expressions to represent patterns, for example, the concept “time” is
represented by
“[0-2]?[0-9]:[0-5]?[0-9][pPaA]?[.][mM][.]?”.
Ontology Extraction
At the early stage of system running, few concepts and patterns are available in the ontology-base. To recognize Web
pages, it is not enough to use technologies of pattern matching. According to assumption 5, algorithms for clustering
concept instances by their similarity are needed. Obviously, it is easier to extract concept from concept instances if we
can successfully identify and group concept instances.
Similarity of concept instances is calculated by two approaches: syntactical similarity and semantical similarity. The
syntactical similarity evaluation, also called knowledge-poor approach, has its advantage in that it is simple and easy to
implement in practice. For example, in vector space model, similarity is calculated by comparing frequencies of terms
occurred in two pages. On the other hand, semantical similarity evaluation, also called knowledge-rich approach, needs
more domain-specific knowledge to explain why two pages are similar and how similar they are. This approach has its
advantages in precise and recall if similar features are used to measure retrieval performance just as information
retrieval technologies do.
We calculate and group pages based on their locations in the organization and similarities represented by vector space
model. Ambiguous patterns are then extracted from these pages, which are assumed to be instances of the same concept.
These patterns are utilized for further matching and grouping. Finally more clear-cut patterns are generated from
clustered concept instances to represent their concepts, and restored into the ontology-base after possible refinement.
For those pages which have no similar pages, concepts are extracted only from their labels.
The generated pattern is very specific at the beginning to be recognized. However, as more patterns are generated and
stored in the ontology base, the coverage will be increasing.
Ontology Amendment
After ontology is extracted and generated, it may need further refinement and revision for final output. This phase is
called ontology amendment where general knowledge and domain knowledge are necessary. The ontology amendment
588
FENG LI AND YING WEI
consists of three tasks: content amendment, structure amendment, and pattern amendment. The first one re-defines the
definition of concepts and relations while the second one reconstructs hierarchy structure of the ontology. The pattern
amendment is optional since it refines the patterns extracted in the stage of ontology extraction. The amended ontology
should satisfy some accepted criteria for sharing, reusing, and disseminating ontology purpose. The ontology revision
may be repeated several times to assure its quality. Moreover, the amendment is complicated since it requires higher
intelligence, such as interacting with domain experts for guidance or artificial intelligent techniques. We provide a
friendly graphic UI interface for ontology revision manually.
Ontology Maintenance
Ontology maintenance restores the extracted ontology to the ontology-base. In general, it is a process of selecting parts
of the ontology and the form of ontology to store. If the ontology describes a new concept, the entirely new ontology
needs to be stored into the ontology-base as a new record. If the ontology is related to an existed ontology, the existing
one should be revised. Finally, if the ontology has already existed in the ontology-base, it should be discarded.
Moreover, ontology maintenance performs another important function. It standardizes ontology with ontology
representation languages. This standardization supports ontology sharing and reusing. Applications should allow other
ontology resources to be imported, integrate them into the ontology-base, and publish local ontology. However, the lack
of well-accepted ontology representation language hampers broader distribution of the ontology. Our system supports
existing or available standards, such as RDF and DAML+OIL.
IMPLEMENTATION
We implement a prototyped system using Java of J2SE Development Kit 5.0, with Xerces2 Java Parser 2.5.0 plug-in
for forming and parsing well-formed Web documents. MySQL database server 4.1 is used for constructing ontology
database with local repository. We tested the prototype system with an example of Department of Automation and
Computer-Aided Engineering, the Chinese University of Hong Kong (http://www.acae.cuhk.edu.hk/en/). The stored
ontology of “Department” is shown in Figure 2.
Department
Program
Staff
Student
Research
Course
Admission Requirement
Non-Academic Staff
Academic Staff
Project
Grant
Facility
Laboratory
Equipment
News
News
Seminar
Name;Telephone
Office;email
Name;Title;Office
Telephone;Fax;email
website;research interest
laboratory
Date
Content
Title;abstract
Date;Venue
Speaker
Biography
Figure 2. Example: Ontology extracted from ACAE department
CONCLUSION
In this paper we propose a new framework for semi-automatic ontology extraction. Being a self-learning process, this
model can increasingly enhance its extraction ability with few or lack initial pre-defined knowledge. Our preliminary
experiment result is acceptable and will be increasingly convincing with the growing ontology-base. In addition, this
new model can also be integrated into with other open intelligent technologies, which could be a future concern of our
research.
REFERENCES
[1] Berners-Lee, T., Handler, J., & Lassila, O. “The Semantic Web”, Scientific American, 2001, 284(5): 34-43.
[2] Noy, N., & Musen, M. “PROMPT: Algorithm and Tool for Automated Ontology Merging and Alignment”, In
Proceedings of the 17th National Conference on Artificial Intelligence and 12th Conference on Innovative
Applications of Artificial Intilligence, 2000, Austin, USA.
[3] Upadhyaya, S.R., & Kumar, P.S. “ENONTO: A Tool for Exacting Ontologies from Extended E/R Diagrams”, In
Proceedings of the 2005 ACM Symposium on Applied Computing, 2005, Santa Fe, USA.
[4] Davulcu, H., Vadrevu, S., & Nagarjan, S. “OntoMiner: Bootstrapping Ontologies from Overlapping Domain
Specific Web Sites”, In Proceedings of the 13th International World Wide Web Conference, 2004, New York, USA.
[5] Maddi, G.R., Velvadapu, C.S., Srivastava, S., & Lamadrid, J.G. “Ontology Extraction from Text Documents by
Singular Value Decomposition”, In Proceedings of ADMI 2001, 2001, Hampton, USA.
Download