LinKFactory® : an Advanced Formal Ontology Management System Werner Ceusters Peter Martens Language and Computing (L&C) Het Moorhof, Hazenakkerstraat 20 A 9520 Zonnegem, Belgium [email protected] Language and Computing (L&C) Het Moorhof, Hazenakkerstraat 20 A 9520 Zonnegem, Belgium [email protected] Abstract As the web becomes more and more a worldwide platform for e-commerce, the creation of formal ontologies in all business sectors becomes crucial. It will become increasingly important to have computers understood what the real meaning is of the content of web pages, and of the data in the databases lying behind them. The real challenge will be to create formal ontologies that are processable and exchangeable by machines. developed will have to be language-independent, but however linkable to all languages. This paper describes LinKFactory®, a formal terminology and ontology management system that makes the creation and management of large scale, complex, multilingual and formal ontologies possible. We will explain the possibilities of LinKFactory® based on our experiences in creating a formal representation of the medical world, named LinKBase®, and linking it to several third party ontologies. Keywords Formal ontologies, semantic/linguistic knowledge base, ontology management system, terminology management system The existing problem that researchers and industries have is to build and maintain an environment that makes it possible to create the needed large formal ontology’s while keeping processing time at a minimum. Formal ontology has been recently defined as the systematic, formal, axiomatic development of the logic of all forms and modes of being . Management systems for smaller ontology’s have been developed (ODE , WebOnto , Ontolingua , HoZo , JOE , Protégé , OntoSaurus , …), but none of these are capable to deal with the enormous and complex ontology’s that will be needed to support the semantic web. To resolve these problems (initially for the medical environment) L&C worked on creating a formal Ontology Management System, called LinKFactory®. The intent of the project was to implement a knowledge representation and compatible reasoning mechanism in a database structure. Among the objectives set for developing the data-model were: INTRODUCTION For many years, numerous teams, mainly of academic origin, have been working on systems that can handle terminology and the complex relations between individual terms. All those systems suffer from at least one of the major setbacks : Insufficiently formalised (designed for human use, not machine use) Not capable of handling the required large numbers of knowledge objects that form an adequate ontology Not designed to handle linguistic aspects, sometimes not even multiple language entries To make the semantic web a success, these three setbacks will have to be overcome. Knowledge will have to be formalized so that machines worldwide have a shared and common understanding of the information provided. The systems developed will have to be able to handle enormous amounts of information very fast. As the web is a universal system, different languages will have to be supported, i.e. the system and the ontology’s The ability to fully model a classification (ontology) of concepts with all their relevant relationships and definitions. The ability to connect languages with this conceptual model and use it for natural language understanding. The ability to connect the resulting association of terms and concepts with third party terminology systems such as SNOMED or ICD-9. All entities in the database should be versioned so that references can be made to older versions of objects without losing that information. During the course of the project several extra capabilities were added to these base requirements that served to enrich the structure and allow for even more sophisticated features. All this had to be modelled as efficiently as possible, and in such a way that it would allow easy manipulation from an application layer. THE TOOL : LinKFactory® General Description LinKFactory® is the formal ontology management system, developed by L&C, used to build en manage the medical linguistic knowledge base LinKBase®. LinKFactory® is a tool that can be used to build large and complex language-independent formal ontology’s. “Language-independent” has to be understood in terms of independency from any specific language (such as English, French, Dutch, …), but not from language as a medium of communication. It is also not limited to small ontology’s, as most of the existing ontology editors are. The fact that the ontology’s are language-independent has some major consequences on the type of applications that can run on top of them. It will, for example, be much easier to search for relevant information on the web (or a thesaurus): the search can be done in one language in free text. This free text search will be linked to language-independent concepts (based on the semantics) that will be the basis for the information retrieval. Since terms in several languages are attached to the concepts using a linguistic ontology , also relevant info in other languages can be retrieved, while semantically irrelevant information will not appear in the list of results. System Architecture of LinKFactory® LinKFactory® stores the data in a relational database (we currently use Oracle). Access to the database is abstracted away by a set of functions that are “natural” when dealing with ontology’s: get-children, find-path, join concepts, get terms for concept X, … One of the main requirements of the project was that a server-side component should be developed that would allow developers to use a standardized API to program applications on top of the semantic database without requiring intimate knowledge of the internal structure of the database. This component would also have to be databaseindependent (Oracle, Sybase, SqlServer have been tested), capable of dealing with multiple concurrent users and it would have to be stable. LinKFactory® is also platform independent (Windows, Solaris, Unix and Linux tested). Combining all these requirements made it clear that Java was to be the platform of choice seeing as it supports all of the above and has become a stable and mature technology in the last year. We finally settled on RMI (Remote Method Invocation) as our technology of choice because of its simplicity and proven robustness. This means that our server-side component is a Java Application that extends java.rmi.Remote. The application requires an RMI registry (a sort of Domain Name Server for RMI servers) to be running in order for it to be able to register itself and for clients to be able to connect to the RMI server. The LinKFactory® system consists of 2 major components (see figure 1), the LinKFactory® Server, and the LinKFactory® Workbench (client-side component). The LinkFactory® Workbench allows the user to browse and model the LinKBase® data. Figure 1 : LinKFactory® components The workbench is a dynamic framework for the LinKFactory® Beans. Each bean has its own specific functionality and limited view to the underlying formal ontology, but combining a set of beans in the workbench can provide the user with a powerful tool to view an manage the data stored in the semantic database. The workbench provides the user with an optimal flexibility to create a customized tool to view and manage the data in the ontology. Different views on the semantic network are implemented as Java beans. Examples are: Concept tree, Concept criteria and full definitions, Linktype tree, Criteria list, Term list, Search pane, Properties panel, Reverse relations, … The LinKFactory® framework is implemented in 100% pure Java code. The modular design is done using Java beans organized and linked in a freely configurable workspace. Each user can create multiple views on the semantic network using the beans available. The beans are organized in several workspaces designed by the user. Each workspace can contain multiple frames upon which the beans are laid out. Once the layout work is finished, links can be established between the beans used. Each of the layouts defined can be saved as Java code and stored in the database. Layouts can be defined on different levels: Organization, Group, User. Each bean can have multiple incoming and outgoing links where appropriate. Beans can also be linked inter frame. Each bean has specific properties, which can be set at runtime. This approach allows for the different types of tasks to be performed using the optimal layout for the task at hand. Several quality assurance mechanisms are build in: versioning, user tracking, user hierarchies, formal sanctioning with possibility to overrule, siblingdetection, linktype hierarchy, etc. Specifications of the Available Beans : General The different beans provide information on and a view of different parts of the ontology’s build. All of these beans can be linked to each other, when an outgoing link from bean 1 matches an incoming link from bean 2. A bar on top of each bean shows the other beans the bean has been linked to and also the direction of the link. Other items in the bean bar are the bean label, the button to display/edit the bean properties and the possibility to refresh the bean contents. Optional items (dependent on the kind of bean) are a shortcut to the linktype filter property and a dragable item possibility. Most important beans The ConceptTree (see figure 2) bean provides the user with a view to the hierarchical relations in the semantic network of concepts. As concepts can have multiple parents (network structure) and the representation is a tree-view, the network structure is split up into the matching tree representation. Modifications to the structure can be made by means of drag and drop. relations and full definitions can be added, removed or modified (by drag and drop). We introduced the notion of concept-definition and concept-criteria, which allows us to group a number of concept-criteria (essentially relationships) to form a full definition. In this way a concept could not only have multiple full definitions and loose concept-criteria, but also the definitions could overlap. Concept-criteria are the equivalent of what used to be complex-concepts; they represent a relationship between two concepts by use of a linktype. This diagram illustrates how full definitions could be constructed: L1 FD1 C C1 L2 C2 L3 L4 L5 The functionalities of the ConceptTree bean include search by knowledge name, search by terms; modify hierarchy, history of searches. The bean properties provide a way to specify the number of siblings to display, the font, the child depth, the number of children to display, the preferred language, the parent depth and the leaf-node child depth. 2 FD C3 C4 C5 The hypothetical concept C has 5 relationships (CONCEPT_CRITERIA) and 2 full definitions (FD1 and FD2). FullDef1 consists of 3 concept-criteria: L1, L2 and L3 and FullDef2 consists of L3 and L4. L5 is simply a loose concept-criterium not belonging to any full definitions. Figure 2 : the ConceptTree bean Figure 3 : the Full Definition bean A second important bean is the Full Definition bean (see figure 3). This bean shows the user the hierarchical and non-hierarchical relations a concept has with other concepts. These relations are sub classed in the relations explicitly specified for this concept (beneath the node labeled CRITERIA), and the implicit relations (beneath the node labeled INHERITED CRITERIA) (figure 2). It also shows the full definitions, i.e. the sufficient criteria to uniquely identify a concept, for this concept. Explicit The ReverseConcept bean (see figure 4) shows the reverse concept bean shows the relations other concepts have with the selected concept. The node labeled Reverse ConceptCriteria shows the explicit relations other concepts have with the selected concept. The node labeled Inherited Reverse ConceptCriteria shows the implicit relations other concepts have with this concept, i.e.: the explicit relations other concepts have with a concept that is a explicit child of the selected concept, hence the concepts have an implicit relation with the featured concept. The inherited reverse relations are not shown by default. Figure 4 : the Reverse Concept bean The LinkType bean (see figure 5) provides the user with a view to the hierarchical relations in the semantic network of linktypes. As linktypes can have multiple parents (network structure) and the representation is a tree-view, the network structure is split up into the matching tree representation. Modifications to the structure can be made by means of drag and drop. Figure 5 : LinkType Tree bean Linktypes were deemed to have a hierarchy just like concepts so we added the LINKTYPE_TREE to represent this; this simple construct suffices because there is only a hierarchical parent-child relationship between linktypes. This hierarchy will have an effect on the constraints (see below) because when a linktype is used in a conceptcriterium it automatically implies all the parent-linktypes are used. e.g. : When there is a link HAS-BONAFIDEBOUNDARY and it has HAS-BOUNDARY as parent then that parent is also implied when the child is used in a relationship. The Translate bean (see figure 6) shows the list of terms related to the selected concept, the selected linktype or the selected criterium in a certain language. Terms can be added, modified or removed. Several Translate beans can be viewed simultaneously giving terms in different languages, all linked to the language-independent concept. This construction f.e. makes it possible to place an application on top of the ontology Figure 6 : Translate bean Other available beans include Concept Properties bean, LinkType Properties bean, Criteria bean, Bookmark bean and others. All of these beans can be selected by the user and linked to each other, as such creating a powerful environment for browsing and editing large ontology’s. An ontology that has been build with LinKFactory® is LinKBase®. LinKBase® : A LARGE FORMAL ONTOLOGY BUILD WITH LinKFactory® Since the initial focus of L&C was the medical world, we started to construct a formal representation of the medical world. We used LinKFactory® to do this. LinKBase® is a large multi-lingual medical formal terminology system covering most parts of healthcare. The fact that LinKBase® currently contains over 1,000,000 medical concepts and over 350 linktypes (with over 3,000,000 linktype instantiations), gives a good indication of the size of the ontology’s LinKFactory® can handle. The medical concepts, themselves language independent, are linked to about 3,000,000 terms in various languages. Terms can be stored in different languages and can be linked to concepts, criteria and linktypes with an intersection table, allowing us to define both homonyms (1 term that has several different meanings or linked concepts/criteria/linktypes) and synonyms (multiple terms associated with 1 concept/criteria/linktype). Closely related with the mentioned intersection table (TERM_CCL) is the SOURCE and SOURCE_OBJECT construction. SOURCE is a table that stores a number of medical classification systems such as SNOMED or ICD-9-CM that classify medical concepts according to their own hierarchy and are used throughout the medical world. By using SOURCE_OBJECT we can link TERM_CCL records with certain sources and assign a code to that combination. We now have the possibility to translate from existing formal medical hierarchies to our own conceptual structure and back. A very powerful feature. LinKBase® is an IS_A hierarchy without loops. This is called a ‘directed acyclic graph’. It means that no concept can be a child (of a child of a child…) of itself. In this hierarchy, it is not presumed that the children of one parent are mutually exclusive. In some cases, they are, what can be made explicit by using the DISJOINT link. In most cases, they are not. LinKBase® makes a distinction between the domain knowledge and the linguistic knowledge. The linguistic ontology is a subset of the global ontology. It contains elements (medical knowledge and other) of the ontology that influence the grammar of a language. Part of it is present in a specific sub-ontology. Another part is present in the global ontology in a scattered manner. In this way, every piece of linguistic knowledge has to get a (referred-to) place in the domain ontology. The linguistic ontology is on the level of language, not on the level of formal domain knowledge. Whether something truly happens or not is not part of the linguistic configuration. The distinctions in linguistic configuration are made according to the kinds of roles that are needed in a sentence. E.g. the ‘HAS-THEME’ is used for things that are displaced in a movement predicate. E.g. “The nurse removes a tumor.” What is the relation between ‘nurse’, ‘removing’ and ‘driving’? Domain ontologically How can I think, abstractly, about removing? What are relevant questions? What are fixed elements, such as : There is movement ( What kind of movement?); Something/someone moves ( Who or what?); Something/someone provokes movement (Who or what?); The movement has a goal (Which goal?) Linguistically That something/someone removes is necessary within this sentence. It does not matter who removes. It does not matter whether the surgeon really can remove things at all. This medical ontology has been the basis for a lot of related products that each deal with a specific problem in medical environments. DERIVED APPLICATIONS Fastcode® is a state of the art coding tool to transform narrative expressions (diagnosis, clinical findings, etc.) in natural language into a classification system. Medical code tables and international classification systems can contain several thousand codes. Searching for a code is unpleasant and time-consuming. Fastcode® offers a very fast and accurate solution using semantic technology. The semantic database (LinKBase®) contains the medical terms (cfr. supra) linked to the different classification systems (Snomed-RT, ICD-9-CM, ICD10, MedDRA, ICPC, UMLS, MesH, …) on the basis of their conceptual meaning. Fastcode® analyses the meaning of the input words and performs a search based on the related concepts as stored in the LinKBase® medical ontology. Using this approach increases speed and accuracy while solving problems related to synonyms, homonyms, compound words, orthography and multiple spelling possibilities. TeSSI® is a fast indexer to index documents on the basis of their content (deep meaning) rather than on the actual words contained in the text. This makes it possible to search a textual database using semantics rather than string matching. It will be clear that the results of semantic searches will be superior to those of string matches. FUTURE DIRECTIONS OF LinKFactory LinKfactory is currently used in house by ten medical knowledge engineers simultaneously. There are over 7,000,000 knowledge elements, and around 2000 modifications are made on a daily basis. A number of issues still have to be dealt with. First, developments have started to make the system DAML+OIL compatible. This is not easy a task mainly because the DAML+OIL conventions themselves are not mature enough to be unambiguously understood. The fact that the reasoning mechanism behind LinkFactory® is description logic based, makes it however feasible, and today, we can claim to be 90% DAML+OIL compatible. Another future goal is to integrate unsupervised learning capacities into the LinkFactory®. Using maximumentropy models on large amounts of free texts, we are currently able to infer head-modifier relationships automatically from huge text corpora. The goal is now to find out how the head-modifier relationships can be “named” by using information from the LinkBase®. This possibly will lead to an optimal collaboration amongst statistical and symbolic methods. REFERENCES  Blázquez, M., Fernández, M, García-Pinar, J.M., Gómez, A.-. Building Ontologies at the Knowledge Levelusing the Ontology Design Environment. KAW98, http://ksi.cpsc.ucalgary.ca/KAW/KAW98/blazquez/  Ceusters W. et al. The distinction between linguistic and conceptual semantics in medical terminology and its implication for NLP-based knowledge acquisition. In C Chute (ed): Proceedings of IMIAWG6 Conference on Natural Language and Medical Concept Representation (IMIA WG6, Jacksonville,1997) 71-80.  Cocchiarella, N. B. 1991. Formal Ontology. In H. Burkhardt and B. Smith (eds.), Handbook of Metaphysics and Ontology. Philosophia Verlag, Munich: 640-647.  Dominguez, J., Tadzebao and WebOnto : Discussing, browsing, and editing ontologies on the Web, Proceedings of the 11th Banff Knowledge Acquisition Workshop., (1998)  Farquhar, A., and Rice, J., The Ontolingua Server : a tool for collaborative ontology construction, Proceedings of the 10th Banff Acquisition Workshop, (1996). Knowledge  Kozaki, K. et al. “Development of an Environment for Building Ontologies which is based on a Fundamental Consideration of Relationship and Role", Proceedings of The Sixth Pacific Knowledge Acquisition Workshop (PKAW2000), pp.205-221 ,Sydney, Australia, December 11-13, 2000.  Mahalingam, K., Huhns, M., An ontology tool for query formulation in an agent-based context, Proceedings of the 2nd IFCIS International Conference on Cooperative Information Systems (CoopIS '97)  Musen, M.A., The Knowledge Model of Protégé2000: Combining Interoperability and Flexibility, Proceedings of EKAW 2000 International Conference on Knowledge Engineering and Knowledge Management. Methods, Models and Tools., Juan-les-Pins, France, (October 2000).  Preece, A. et al. Better Knowledge Management through Knowledge Engineering,. In IEEE Intelligent Systems 16:1, Jan-Feb, 2001.