OIL: A Slick Way to Represent Knowledge for Bioinformatics and the Web Carole Goble(1), Robert Stevens(1), Ian Horrocks(1), Dieter Fensel (2), Frank van Harmelen (2,3), Stefan Decker (4), Michel Klein (2) (1) The Department of Computer Science The University of Manchester Manchester, M13 9PL, UK Email: <carole/stevensr/horrocks>@cs.man.ac.uk http://img.cs.man.ac.uk/ (2) Vrije Universiteit Amsterdam, Holland dieter@cs.vu.nl / frankh@cs.vu.nl / mcaklein@cs.vu.nl (3) Aidministrator Nederland, B.V., Amersfoort, Holland (4) Stanford University, USA stefan@db.stanford.edu">stefan@db.stanford.edu 1 Introduction The web has proved to be an excellent mechanism for the ready publication and availability of information. It has been an effective technology for supporting the biologists’ culture of co-ordinated research, and the sharing and rapid dissemination of information. What is missing is the next level of interoperability—not just making information available, but understanding what the information means so that it can be linked in appropriate and insightful ways, and providing automated support for this process. The integration and sharing of information requires a consistent shared understanding of the meaning of that information. The biologist’s knowledge of molecular biology and bioinformatics, and their interpretation of the resources with respect to this knowledge, is essential to the task of combining resources to answer queries. A shared understanding requires three things: metadata, terminologies and ontologies. Sharing data means publishing, agreeing and sharing metadata, which means shared schemas and shared controlled vocabularies. This means the use of, and adherence to, standard vocabularies such as GO [Ashburner00]. The questions now become (a) how to describe and exchange such vocabularies?and (b) how do we build better vocabularies? A prerequisite for a widespread use of ontologies is a joint standard for their description and exchange. Ontologies provide a way of capturing a shared understanding of terms that can be used by humans and programs to aid information exchange. The ontology needs to be specified in some language. In 1999, the BioOntology Core Group developed a frame-based language with an XML syntax for the exchange of ontologies for molecular biology. The proposed language was XOL [Karp99] (later extended to iXOL), based on the modelling primitives of OKBCLite [Chaudhri97]. However, this language is only suitable for ontologies that are framebased – allowing only necessary but not sufficient class definitions (i.e. a new class is always a sub-class of and not exactly equal to its specification), and only class names but not class expressions (except for the limited form of expression provided by slots and frames) can be used in defining classes. Ontologies such as the TAMBIS ontology [Baker99] come from the Description Logic [Borgida95] school of knowledge representation, relying on and encouraging reasoning over the class definitions to infer subsumption relationships between classes and to check the consistency and coherency of the classification and its concepts. What would be ideal would be a language that unified frames and description logics with a web standard syntax. The Ontology Inference Layer (OIL) is a new language proposed as a knowledge representation language for the web [Fensel00]. OIL is an effort to produce a layered architecture for specifying and exchanging ontologies. The language has been designed so that: It provides the modelling primitives commonly used in frame-based ontologies; It has simple, clean and well-defined semantics based on description logics; Automated reasoning support (i.e. consistency and subsumption checking) may be specified and provided in a computationally efficient manner. OIL is a layered language in order to be very clear about the reasoning support possible at each level of expressivity. OIL-Core captures the necessary mainstream modelling primitives that both provide adequate expressive power and are well understood, thereby allowing the semantics to be precisely defined and complete inference to be viable; Standard-OIL includes the integration of instances as well as concepts; Heavy-OIL includes additional representational (and reasoning) capabilities. Although OIL is a powerful language it isn’t necessary to use all the expressive power – it is light-weight enough to represent simple taxonomies or frame-like ontologies such as GO without obstruction. It is also expressive enough to present logic-based models such as the TAMBIS Ontology [Baker99], and all points in between. Some parts of the ontology can be simple, others complex; moreover, the language offers the ontologist an evolutionary development path whereby they can progressively introduce more expressive constructs. Thus OIL is the confluence of the modelling primitives of frames with the classification reasoning capabilities of description logics and encodings in the metadata languages of the web. For more discussion and motivation, and for papers that give a formal description, grammar specification, see the papers at http://www.ontoknowledge.org/oil/. OIL’s machine-readable syntax is defined as an XML DTD, an XML Schema definition and an RDF Schema definition. Because of OIL’s antecedents in the logic community, it has a formal denotational semantics available from the same URI. OIL also has a compact more human-readable syntax, which is used in this paper. 1.1 An Informal Description of OIL-Core An OIL ontology is a structure made up of several components, some of which may themselves be structures, some of which are optional (component?) and some of which may be repeated one or more times (component+) or zero or more times (component*). An OIL ontology is delineated by the keywords begin-ontology and end-ontology, and consists of two major parts: The ontology container, which is concerned with describing the metadata of the ontology, based on the Dublin Core Metadata Element Set (Version 1.1) standard [Dublin Core]. Some of the elements can be specialised with a qualifier, which refines the meaning of the element (written element.qualifier). Elements include title, creator, subject, description, publisher, contributor, date, type, format, identifer, source (URI), language, relation (to other OIL ontologies) and rights. Figure 1 gives an example of an ontology container; ontology-container title “macromolecule fragment” creator “robert stevens” subject “macromolecule generic ontology” description “example for a tutorial” description.release “1.0” publisher “R Stevens” type “ontology” format “pseudo-xml” identifier “http://www.ontoknowledge.org/oil/oil.pdf” source “http://img.cs.man.ac.uk/ismb00/mmexample.pdf” language “OIL” language “en-uk” Figure 1: An example of an ontology container The ontology definition, defining a particular ontological vocabulary, containing descriptions of classes, slots and individuals (also known as instances). Classes are collections of individuals that share properties; slots describe binary relations between individuals. The definition is made up of two parts: (a) an import directive giving a list of one or more references to other OIL modules that are to be included in this ontology and (b) a definition of zero or more of the following: i. class definitions (class-def). Classes are unary predicates, such as person, and can be arranged into classification lattices; for example person is a subclass of mammal, thus anything that is an instance of person must also be an instance of ii. iii. a mammal. Classes also contain information about how their members relate to other individuals; slot definitions (slot-def). Slots are binary relations that define how members of classes relate to other objects. Slots are also known as roles or attributes. Slots can also be arranged into slot lattices (subrelations). They may be constrained through the use of slot constraints, limiting the type of instance that could fill the slot (the slot filler), the number of fillers a slot may have, enumerating a set of possible fillers etc. Particular fillers may also be specified; for example, a white wine may be explicitly defined as something that is a wine whose colour slot is filled with the colour white. axioms (disjoint, covered, disjoint-covered, equivalent). An axiom asserts some additional facts about the classes in the ontology. The rest of the paper concentrates on an informal discussion of the definition part of OIL, with illustrative examples that don’t claim to be great biology but serve to illustrate the expressivity of OIL. 1.2 Class Definitions A class definition associates a class name with a class description. A class-def includes the name of a class and some optional documentation describing the class. Class expressions are descriptions of classes, which are either: A class name, including built-in class names such as top and bottom. Top is the top of the lattice, so is the most general class; bottom is the bottom of the lattice, and is a sub-class of every class. Every individual is an instance of top; no individual is an instance of bottom; An enumerated class, defined by enumerating its instances, for example (one-of Iron Sulphur) defines the class whose instances are Iron and Sulphur; A slot-constraint, a list of one or more constraints or restrictions applied to a slot. A slot is a binary relation (i.e. its instances are pairs of individuals). A slot-constraint is a class definition, i.e. its instances are those individuals that satisfy the constraints. We will go into this in more detail below. A combination of class expressions using the Boolean connectives (and, or, not). And takes two of more expressions and conjoins them, or takes two or more expressions and treats them as a disjunction, and not takes a single expression that is negated. For example, the class (dna or messenger-rna) defines the class whose instances are all those individuals that are instances of either the class dna or the class messengerrna. Expressions may be recursive and thus arbitrarily complex. Concrete type expressions define a range over some data type, currently integer and string. Sub-ranges can be defined using expressions (min x), (max x), (greater-than x), less-than x), (equal x) and (range x y). Expressions of the same type can be combined using the boolean operators. The ontology definition consists of the following components: type? The type of definition, which is either primitive or defined, defaulting to primitive. A primitive class’s definition (the combination of subclass-of and slot-constraint components) is taken to be a necessary but not sufficient condition for membership in the class. For example, in (1) primitive class dna is asserted to be a sub-class of nucleicacid. In (2) deoxyribophosphate is asserted to be a sub-class of all classes whose instances are all those individuals that are not instances of the class ribophosphate. This is part of the assertions required to state that an instance cannot both be an instance of deoxyribophosphate and ribophosphate, i.e. that the classes are disjoint. As this is such a common thing to say OIL has this as a special axiom as we will see later. (1) class-def primitive dna subclass-of nucleic-acid (2) class-def primitive deoxyribophosphate subclass-of not ribophosphate Definition (3) declares dna as a sub-class of nucleic-acid with a slot-constraint (or role) stating that the backbone must be deoxyribophosphate. This says that, all instances of dna must necessarily be nucleic-acids with a ribophosphate backbone, but there may be nucleic acids with a deoxyribophosphate backbone that are not dna. (3) class-def primitive dna subclass-of nucleic-acid slot-constraint has-backbone has-value deoxyribophosphate (4) class-def defined dna subclass-of nucleic-acid slot-constraint has-backbone has-value deoxyribophosphate When a class is defined, its definition is taken to be a necessary and sufficient condition for membership of the class. Thus in (4) all instances of dna are necessarily deoxyribophosphate backboned nucleic acids, and every deoxyribophosphate backboned nucleic acid is also an instance of dna. subclass-of? The class being defined in this class-def must be a sub-class of each of the class-expressions listed. In (5) and (6) enzyme has two super-classes: protein and catalyst. (5) class-def defined catalyst subclass-of macromolecule slot-constraint promotes has-value reaction (6) class-def enzyme subclass-of protein, catalyst slot-constraint* Zero or more slot-constraints, which are special kinds of classexpression (as described above). A slot-constraint defines a class as shown in (3) and (4) above. A slot-constraint consists of the following components: name A slot can be defined in the ontology through a slot-def; if it isn’t then its assumed to have no globally applicable constraints so any pair of individuals could be an instance of the slot. has-value? A list of one or more class or concrete-type expressions. This is equivalent to the existential quantifer of predicate logic. (3) and (4) define the class each instance of which as has-backbone of some instance of the class deoxyribophosphate. This doesn’t mean that instances of the slot-constraint hasbackbone is only deoxyribophosphate. value-type? A list of one or more class or concrete-type expressions. This is equivalent to the universal (for-all) quantifer of predicate logic. (7) defines the class such that if an instance of the dna is related via has-backbone to some individual x, then x must be an instance the class deoxyribophosphate. This doesn’t mean that instances of dna have to have a backbone at all. (7) class-def defined dna subclass-of nucleic-acid slot-constraint has-backbone value-type deoxyribophosphate (8) class-def defined dna subclass-of nucleic-acid slot-constraint has-backbone has-value deoxyribophosphate value-type deoxyribophosphate To define dna as a class where there must be a backbone and it must only be filled by deoxyribophosphate we must make a definition as in (8). has-filler? A list of one or more individual names or data values (integers or strings). Every instance of the defined class must be related via the slot to each individual and data value listed (e.g. (9)). max-cardinality?, min-cardinality? and cardinality?. A non-negative number n followed by a class or concrete-type expression. An instance of the defined class can be related to at most n, at least n or exactly n distinct instances, or data values of the expression defined by the slot constraint. If the expression if missing then the type of the instances the slot is related to is irrelevant, all that matters is the number. (10) defines the class each instance of which has exactly 7 transmembrane-helices for its scaffold. (9) slot-constraint atomic-number has-filler 19 (10) slot-constraint has-scaffold cardinality 7 transmembrane-helix 1.2.1 Slot definitions A slot definition associates a slot name and some optional documentation with a slot description. A slot description specifies global constraints that apply to the slot relation. Examples are given in (11). A slot-def consists of the following components: (11) slot-def part-of inverse is-part-of properties transitive slot-def has-polarity range polarity slot-def cellularlocation properties functional slot-def has_backbone inverse is_backbone_of slot-def loosely-binds domain electrical-charge subslot-of binds inverse loosely-bound-by ? subslot-of A list of one or more slots. Slots may also be arranged into slot hierarchies. domain? If the pair (x,y) is an instance of the slot relation, x must be an instance of each class-expression listed. range? If the pair (x,y) is an instance of the slot relation, y must be an instance or data value of each class or concrete type expression listed. inverse? The name of the slot that is the inverse of the slot being defined. properties? A list of one or more properties of the slot, which include: - transitive if both (x,y) and (y,z) are instances that (x,z) must be an instance; - symmetric if (x,y) is an instance then (y,x) must be an instance; - functional if (x,y) is an instance then there is no z such that (x,z) is an instance and y z. 1.2.2 Axioms Current valid axioms defined on lists of class expressions are: disjoint All the class expressions listed are pairwise disjoint (have no instances in common), for example (12); covered. Every instance of the class is also an instance of at least one of a list of class expressions; disjoint-covered A combination of the two above; equivalent all the class expressions listed are equivalent (have the same instances). (12) class-def ribophosphate disjoint ribophosphate deoxyribophosphate class-def deoxyribophosphate disjoint rna dna These can be alternatively defined using class definitions with negation and disjunction; however, they are common enough to warrant their own primitives in the language. 1.3 Reasoning and OIL Linking OIL with Description Logics gives the language well defined semantics often lacking in frame-based and web-based language specification such as OKBC and RDFS. In addition it considerably extends what the ontologist is able to express. OIL extends frame language such as OKBC by: Arbitrary Boolean class expressions wherever a class name can appear; A slot definition can be treated as a class and can be used in class expressions; Class definitions can be specified as primitive or defined; Slot fillers can be restricted to instances of arbitrarily complex class expressions; Global slots allow the specification of parent slot and relation properties; There is no restriction on the ordering of class and slot definitions. Moreover, the reasoning services offered by a Description Logic support the development and incremental maintenance of an ontology [Bechhofer, Stevens]. Highly optimised implementations of sound and complete tableaux subsumption algorithms for very expressive DLs such as SHIQ [Horrocks99] can be used in spite of the high worst-case complexity. The key services are: Subsumption checking between two concept descriptions, C and D, C subsumes D, when the set of individuals that are instances of D are always a subset of the individuals that are instances of C. Classification organises a collection of concept expressions into a partial order based on the subsumption check. This provides a lattice of definitions, ranging from the general to the specific. Composed definitions have their position implicitly determined automatically. Thus classification is a dynamic process where new compositional expressions can be added to an existing hierarchy. Concept satisfiability checks whether a concept description can never have instances because of some inconsistencies, contradictions or some other reason in the model. What this means to the ontologist is that (a) there is automated support for building and evolving the classification lattice and (b) the classification scheme is coherent and consistent. For example, given (13) and definitions above. (13) class-def defined mitochondrial slot-constraint cellular_location has-value (mitochondrion or (slot-constraint part-of has-value mitochondrion)) class-def defined succinate_dehydrogenase subclass-of enzyme slot-constraint promotes value-type oxidation slot-constraint cellular_location has-value (slot-constraint part-of has-value mitochondrion) The If we introduced the class expression (catalyst and mitochondrial), succinate_dehydrogenase would be recognised as a sub-class because of its definition; 1.4 succinate_dehydrogenase mitochondrial because of its definition; class is recognised as a sub-class of Standard OIL and Heavy OIL Our starting point was to define a core language with the intention that additional features be defined as a set of semantics still with clearly defined semantics. OIL-Core is an expressive language that still enjoys reasoning support – every construct can be serviced with the most state of the art description logic reasoning engine such as FaCT [Horrocks99]. Standard OIL is a strict superset of OIL-Core, extending it to include the definition of instances of classes and roles using instance-of and related statements respectively. These statements allow the modeller to express instances and distribute them along with their class definitions using XML syntax. This was an important requirement in the exchanging of ontologies. However, the instances cannot be used in reasoning at the present time. Special instances known as nominals are supported by using primitive disjoint concepts. Nominals include notions such as Italy – a modeller might want to refer to Italy in a model but there is only one Italy, so although it is really an instance of country its behaviour is similar to a concept. Heavy OIL is used to refer to other features not yet in OIL that can be found in other ontology modelling languages including: further relation properties such as composite relations, and relation (ir)reflexivity etc; rules/axioms; limited second order expressivity; default reasoning and modules. These are explored further in papers in [Horrocks00], and form a line of future work. 1.5 OIL Status The issues to be addressed by the biology e-Science community are a version of issues to be addressed by the whole web community—specifically, how to move the web from one where information is machine readable by humans to one where information is machine processable by intelligent services such as information brokers, search agents and information filters. OIL has been developed as an international collaboration as part of an effort to realize the W3C vision of a semantic web through the explicit representation of their meaning [Berners-Lee99]. Current OIL activity is focused on: Example applications: Manchester has already produced a version of the Gene Ontology [Ashburner00] in OIL and has developed the second generation TAMBIS ontology in OIL. Tools: The FaCT reasoner is available with a logic sufficient to reason over OIL, an implementation efficient enough to make reasoning empirically tractable and a CORBA IDL with a clean API [Bechhofer99]. Other tools under development include OIL editors (OILEd at Manchester, OntoEdit at Karlesruhe and an extension of Protégé 2000 at Stanford), and the adaptation of ontology integration tools, such as Chimaera at Stanford. EU project proposals: A Network of Excellence and an IST FET-O proposal have been submitted, with over 20 commercial partners as supporters including major pharmaceutical companies and biotech web portal information providers. WWW metadata language standardisation: The OIL effort is a major contributor to the DARPA Agent Markup Language initiative (DAML) in their attempt to define an ontology language, DAML-ONT [DAML]. In addition, the W3C have been developing an extension to RDF to include logical inferencing; OIL’s RDFS mapping gives it a semantics that it is otherwise absent. At the time of writing, the three communities—OIL, W3C and DAML—have come together in a joint DAML Language Committee to produce a universal, scalable yet technically sound language for describing knowledge on the web. Acknowledgements The authors would like to acknowledge the entire OIL Consortium (http://www.ontoknowledge.org/oil/misc.shtml#ackn) and Sean Bechhofer for his invaluable help with the examples. References [Ashburner00] M. Ashburner et al Gene Ontology: Tool for the Unification of Biology, Nature Genetics Vol 25 pages 25-29, (2000) [Baker99] P.G. Baker, C.A. Goble, S. Bechhofer, N.W. Paton, R. Stevens, and A Brass. An Ontology for Bioinformatics Applications Bioinformatics, 15(6):510-520, (1999). [Bechhofer99]. Bechhofer S., Horrocks I., Patel-Schneider P.F., Tessaris S. A Proposal for a Description Logic Interface. DL99, International Workshop on Description Logics, Linköping, Sweden, (1999). [Bechhofer] S. Bechhofer and C.A. Goble Thesaurus Construction through Knowledge Representation In: Data and Knowledge Engineering accepted for publication [Berners-Lee99] T. Berners-Lee Weaving the Web Orion Business, ISBN: 0752820907 (1999) [Borgida95] A. Borgida Description logics in data management. IEEE Trans on Knowledge and Data Engineering 7(5) pp: 671-682 (1995) [Chaudhri97] V.K. Chaudhri, A. Farquhar, R. Fikes, P.D. Karp, and J.P. Rice: Open knowledge base connectivity 2.0 Technical Report KSL-98-06, Knowledge Systems Laboratory, Stanford (1997) [DAML] http://www.daml.org [Dublin Core] http://purl.ocl.org/dc/ [Fensel00] D. Fensel I. Horrocks, F. van Harmelen, S. Decker, M. Erdmann, and M.Klein OIL in a nutshell In: Knowledge Acquisition, Modeling, and Management, Proceedings of the European Knowledge Acquisition Conference (EKAW-2000), Lecture Notes in Artificial Intelligence, Springer-Verlag, (2000). [Horrocks99] I. Horrocks, U. Sattler A Description Logic with Transitive and Inverse Roles and Role Hierarchies. Journal of Logic and Computation, 9(3): 385-410, (1999). [Horrocks00] I. Horrocks, D. Fensel, S Bechhofer, J Broekstra, S. Decker, M. Erdmann, C. Goble, F. van Harmelen, M.Klein, S. Staab, R. Studer, The Ontology Inference Layer OIL, available from http://www.ontoknowledge.org/oil/oilhome.shtml [Karp99] P.D. Karp, V.K. Chaudhri and J. Thomere: XOL: An XML-based ontology exchange language: Version 0.3, (1999). [Stevens] R. Stevens, C.A. Goble and S. Bechhofer, Ontology-based Knowledge Representation for Bioinformatics to appear in Briefings in Bioinformatics