Proceedings of the 11th Annual Conference of Asia Pacific Decision Sciences Institute Hong Kong, June 14-18, 2006, pp. 585-588. ONTOLOGY EXTRACTION FOR THE SEMENTIC WEB: A NEW FRAMEWORK Feng Li School of Business Administration, South China University of Technology, Guangzhou, P.R. China EMAIL: fenglee@scut.edu.cn Ying Wei Department of Systems Engineering & Engineering Management, The Chinese University of Hong Kong, Shatin, New territories, Hong Kong EMAIL: ywei@se.cuhk.edu.hk ABSTRACT To wrap the traditional web with semantic presentation is a challenging task. This paper proposes a new framework for ontology extraction by introducing a “self-learning” cycle, even with the lack of pre-defined knowledge. The ontology is constructed along five steps: knowledge representing the website, recalling existing ontologies from an ontology base to recognize concepts and relations, extracting concept by clustering algorithm for unrecognized concepts and relations, amending proposed ontology, and restoring the confirmed ontology for future reuse. Further, this framework can integrated other intelligent technologies. At the end of this paper, we develop a prototype system extracting the ontology of the website of Department of Automation and Computer-Aided Engineering, the Chinese University of Hong Kong. Keywords: Ontology Extraction, Semantic Web INTRODUCTION Semantic Web (SW), proposed by Berners-Lee as next generation Web, provides great benefits in Web Services, Internet Commerce, and other promising application areas [1]. However, SW is still in its primary stage and has lots of unsolved problems. To transfer domain-specific ontology to structured data for machine understanding is one of them, by which we call ontology extraction. A naïve approach for ontology extraction is by manual. Manual construction is time-consuming and error-prone, and poses problems into future ontology maintenance. Most of the current research focuses on exploiting various methods to generate ontology automatically or semi-automatically. PROMPT (formerly SMART) provides a semi-automatic approach to merge or align ontologies [2]. Similar work includes OBSERVER, ONIONS, OntoMorph, OntoMap, and GLUE. However, additional knowledge, e.g. instances, sharable instances, or linguistic ontologies is required to merge or align ontologies. Another stream focuses on extracting ontologies from well-structured sources. For example, ERONTO is a tool that builds ontologies from extended E/R diagrams [3]. The other stream extracts ontologies from semi- or un-structured sources with certain auxiliary knowledge. Typical work can be referred to OntoMiner, which extracts ontologies from overlapping domain-specific websites, and the websites have been confirmed before the extraction [4]. Others recognize linguistic ontological information from plain text sources using knowledge-poor algorithms. Typical algorithm is Latent Semantic Index, a vector space approach to catch term-term statistical [5]. To create ontology from diversity sources automatically or semi-automatically, most of the research needs various auxiliary sources. This paper proposes a new framework by introducing a “self-learning” cycle, even with the lack of pre-defined knowledge. The ontology is constructed along five steps: knowledge representing the website, recalling existing ontologies from an ontology base to recognize concepts and relations, extracting concept by clustering algorithm for unrecognized concepts and relations, amending proposed ontology, and restoring the confirmed ontology for future reuse. Furthermore, this process is implemented semi-automatically and can be integrated into with other open intelligent technologies. The rest of this paper is organized as follows. We describe the process of ontology extracting from websites in section 2. In section3 we develop a prototype system with the example of Department of Automation and Computer-Aided Engineering, the Chinese University of Hong Kong. And finally we conclude the paper in section 4. 586 FENG LI AND YING WEI ONTOLOGY EXTRACTION FROM WEBSITE Before formalizing the ontology extracting process, we specify the notations and assumptions as follows. Assumption 1 Each website is assumed as an ontology instance. This assumption is similar to the specification in [4]: a website is said to be ‘taxonomy-directed’ if it contains at least one taxonomy for organizing its key concepts. Assumption 2 Each Web page is assumed as a concept instance. If no further ontology extraction within Web page, every Web page belonging to a website is a concept instance inside the ontology instance represented by the website. Assumption 3 Each hyperlink, along with contiguous hypertext, represents a relation instance. Assumption 4 Web pages are assumed to be placed together if they are instances of a same concept. In other word, they always are sibling nodes in a graph which represent the organization of the website. Assumption 5 Instances of a same concept are assumed to be similar, while instances of different concepts are dissimilar because of dissimilarity of separate concepts they belonging to. Given the notations and assumption, ontology extraction from website includes extracting the website organizations into ontology instances, web pages into concept instances, and hyperlinks into relation instances. The Framework By its nature, extracting ontology from website is a process of acquiring semantic knowledge from web documents. Common problems in knowledge management, such as storing, adapting, standardizing knowledge are also considered in ontology extraction, namely five iterative phases: (1) represent organization of website; (2) recall previous ontologies to recognize concepts and relations; (3) extract concepts and relations; (4) amend generated ontology; and (5) maintain revised ontology. Represent WebSite from Real World Organization of WebSite l cal Re Ontology Base Learned Ontology Preliminary Ontology rac t ain Ext int Ma Common Knowledge Output Confirmed Ontology Proposed Ontology Amend Figure 1. The framework of ontology extraction As illustrated in Figure 1, the process is centralized by an ontology-base, a repository for ontologies and their patterns. When a URL of the target website is located, web pages are fetched to represent the organization of the website. Existing ontologies and their patterns are recalled from the ontology-base to recognize the concepts and relations from fetched web pages. If there are no matched ontologies, we call the concepts and relations unrecognized, and use heuristic algorithms to group “similar” web pages and hyperlinks into clusters. Each cluster is an instance of new concept or relation. The concepts or relations are then refined or revised by human experts, or other intelligent techniques. Finally, the accepted ontology is restored into ontology-base for future reuse. The ontology-base system grows increasingly within its lifecycle. As indicated with dashed lines in Figure 1, common knowledge also play an important role in each phase. The common knowledge here means general knowledge, which is loosed-coupled, or domain-independent knowledge, opposed to specific and formal knowledge represented by the ontology. Website Representation ONTOLOGY EXTRACTION FOR THE SEMENTIC WEB: A NEW FRAMEWORK 587 Successful ontology extraction relies on how to represent the organization of the target website. In order to represent organization of the website, several steps are needed: fetch web pages and hyperlinks, label web pages and hyperlinks, prune useless pages and hyperlinks, and finally construct the organization of the target. The process of fetching website involves: (1) fetch Web pages inside the target website from remote server; (2) transform fetched pages into well-formed web documents; (3) parse and filter some trivial HTML tags such as “CENTER”, “HR”; (4) store fetched website into local repository for later use. After all web pages are fetched, each pages need to be labeled to identify main contents inside pages. However, labeling is not easy due to the loose grammar regulation of the html files. For example, tag “TITLE” of HTML is designed to provide information about an entire page, but our experiment results show that 30% pages have the identical text embedded in tag “TITLE” as the home page. Therefore we consider both information from incoming hyperlinks and text embedded in “TITLE” tag to identify pages. Once the useless pages and hyperlinks are pruned, we use breadth-first algorithm to construct the organization of the website. We traverse downloaded web pages assuming that website developers organize web pages with hierarchy architecture. We start from the URL of website home page and establish the organization as follows: home page of the website is the root node; the linked pages by outgoing hyperlinks of the home page are regarded as the first-generation --- child nodes; and the linked pages by outgoing hyperlinks of the child nodes are the second-generation --- grandchild nodes; and so on. In this way, the organization of the website is represented as hierarchical semantic structure. Ontology Recall To recognize concepts and ontologies from historical patterns is easier than to extract concepts and ontologies on the fly. The ontology recall process recalls ontologies and their patterns which are stored in the ontology-base. It consists of the following three tasks: (1) search the ontology-base for concepts and their related patterns that match label and content of pages; (2) use recognized concepts and ontology-base to explore those unidentified pages; (3) replace recognized pages from the organization of the website with their concepts. Intuitively, each concept of ontology has some properties that can be used to identify its instances. Ideally these properties are specific enough to distinguish instances belonging to other concepts. However, these properties are not practically apparent to recognize the concept. And a concept may be able to be represented by many related patterns. Therefore both ontologies and their related patterns are stored in the ontology-base. We use some common regular expressions to represent patterns, for example, the concept “time” is represented by “[0-2]?[0-9]:[0-5]?[0-9][pPaA]?[.][mM][.]?”. Ontology Extraction At the early stage of system running, few concepts and patterns are available in the ontology-base. To recognize Web pages, it is not enough to use technologies of pattern matching. According to assumption 5, algorithms for clustering concept instances by their similarity are needed. Obviously, it is easier to extract concept from concept instances if we can successfully identify and group concept instances. Similarity of concept instances is calculated by two approaches: syntactical similarity and semantical similarity. The syntactical similarity evaluation, also called knowledge-poor approach, has its advantage in that it is simple and easy to implement in practice. For example, in vector space model, similarity is calculated by comparing frequencies of terms occurred in two pages. On the other hand, semantical similarity evaluation, also called knowledge-rich approach, needs more domain-specific knowledge to explain why two pages are similar and how similar they are. This approach has its advantages in precise and recall if similar features are used to measure retrieval performance just as information retrieval technologies do. We calculate and group pages based on their locations in the organization and similarities represented by vector space model. Ambiguous patterns are then extracted from these pages, which are assumed to be instances of the same concept. These patterns are utilized for further matching and grouping. Finally more clear-cut patterns are generated from clustered concept instances to represent their concepts, and restored into the ontology-base after possible refinement. For those pages which have no similar pages, concepts are extracted only from their labels. The generated pattern is very specific at the beginning to be recognized. However, as more patterns are generated and stored in the ontology base, the coverage will be increasing. Ontology Amendment After ontology is extracted and generated, it may need further refinement and revision for final output. This phase is called ontology amendment where general knowledge and domain knowledge are necessary. The ontology amendment 588 FENG LI AND YING WEI consists of three tasks: content amendment, structure amendment, and pattern amendment. The first one re-defines the definition of concepts and relations while the second one reconstructs hierarchy structure of the ontology. The pattern amendment is optional since it refines the patterns extracted in the stage of ontology extraction. The amended ontology should satisfy some accepted criteria for sharing, reusing, and disseminating ontology purpose. The ontology revision may be repeated several times to assure its quality. Moreover, the amendment is complicated since it requires higher intelligence, such as interacting with domain experts for guidance or artificial intelligent techniques. We provide a friendly graphic UI interface for ontology revision manually. Ontology Maintenance Ontology maintenance restores the extracted ontology to the ontology-base. In general, it is a process of selecting parts of the ontology and the form of ontology to store. If the ontology describes a new concept, the entirely new ontology needs to be stored into the ontology-base as a new record. If the ontology is related to an existed ontology, the existing one should be revised. Finally, if the ontology has already existed in the ontology-base, it should be discarded. Moreover, ontology maintenance performs another important function. It standardizes ontology with ontology representation languages. This standardization supports ontology sharing and reusing. Applications should allow other ontology resources to be imported, integrate them into the ontology-base, and publish local ontology. However, the lack of well-accepted ontology representation language hampers broader distribution of the ontology. Our system supports existing or available standards, such as RDF and DAML+OIL. IMPLEMENTATION We implement a prototyped system using Java of J2SE Development Kit 5.0, with Xerces2 Java Parser 2.5.0 plug-in for forming and parsing well-formed Web documents. MySQL database server 4.1 is used for constructing ontology database with local repository. We tested the prototype system with an example of Department of Automation and Computer-Aided Engineering, the Chinese University of Hong Kong (http://www.acae.cuhk.edu.hk/en/). The stored ontology of “Department” is shown in Figure 2. Department Program Staff Student Research Course Admission Requirement Non-Academic Staff Academic Staff Project Grant Facility Laboratory Equipment News News Seminar Name;Telephone Office;email Name;Title;Office Telephone;Fax;email website;research interest laboratory Date Content Title;abstract Date;Venue Speaker Biography Figure 2. Example: Ontology extracted from ACAE department CONCLUSION In this paper we propose a new framework for semi-automatic ontology extraction. Being a self-learning process, this model can increasingly enhance its extraction ability with few or lack initial pre-defined knowledge. Our preliminary experiment result is acceptable and will be increasingly convincing with the growing ontology-base. In addition, this new model can also be integrated into with other open intelligent technologies, which could be a future concern of our research. REFERENCES [1] Berners-Lee, T., Handler, J., & Lassila, O. “The Semantic Web”, Scientific American, 2001, 284(5): 34-43. [2] Noy, N., & Musen, M. “PROMPT: Algorithm and Tool for Automated Ontology Merging and Alignment”, In Proceedings of the 17th National Conference on Artificial Intelligence and 12th Conference on Innovative Applications of Artificial Intilligence, 2000, Austin, USA. [3] Upadhyaya, S.R., & Kumar, P.S. “ENONTO: A Tool for Exacting Ontologies from Extended E/R Diagrams”, In Proceedings of the 2005 ACM Symposium on Applied Computing, 2005, Santa Fe, USA. [4] Davulcu, H., Vadrevu, S., & Nagarjan, S. “OntoMiner: Bootstrapping Ontologies from Overlapping Domain Specific Web Sites”, In Proceedings of the 13th International World Wide Web Conference, 2004, New York, USA. [5] Maddi, G.R., Velvadapu, C.S., Srivastava, S., & Lamadrid, J.G. “Ontology Extraction from Text Documents by Singular Value Decomposition”, In Proceedings of ADMI 2001, 2001, Hampton, USA.