Building Ontologies from the Ground Up When users set out to model their professional activity Mark A. Musen Professor of Medicine and Computer Science Stanford University v 1.00 1 “An ontology is a specification of a conceptualization” (T. Gruber) • A conceptualization is the way we think about a domain • A specification provides a formal way of writing it down 2 Porphyry’s depiction of Aristotle’s Categories Supreme genus: Differentiae: SUBSTANCE material immaterial Subordinate genera: Differentiae: BODY animate inanimate Subordinate genera: LIVING Differentiae: sensitive Proximate genera: ANIMAL Differentiae: rational Species: Individuals: SPIRIT MINERAL insensitive PLANT irrational HUMAN Socrates Plato BEAST Aristotle … 3 4 Creating Ontologies in Machine-Processable Form • Provides a mechanism for developers to codify salient distinctions about the world or some application area • Provides a structure for knowledge bases that can enable – – – – Information retrieval Information integration Automated translation Decision support 5 The New Philosophers • Categorizing “what exists” in machineunderstandable form • Providing a structure that enables – Developers to locate and update relevant descriptions – Computers to infer relationships and properties • Creating new abstractions to facilitate the creation of this structure 6 7 Part of the CYC Upper Ontology 8 There is a misconception … • That people building ontologies are all well versed in metaphysics, computer science, knowledge representation, and the content domain • That ontologies in the real world are as “clean” as SUMO, DOLCE, and other upper-level ontologies • That most people who are creating ontologies understand all the ramifications of what they are doing! 9 Lots of ontology builders are not very good philosophers • Nearly always, ontologies are created to address pressing professional needs • The people who have the most insight into professional knowledge may have little appreciation for metaphysics, principles of knowledge representation, or computational logic • There simply aren’t enough good philosophers to go around 10 Practical Problems BioInformatics 11 The pressing need to standardize the names of human genes 12 But the human genome is only part of the problem … • Scientist maintain huge databases of gene sequences and gene expression for a wide range of “model organisms” (e.g., mouse, rat, yeast, fruit fly, round worm, slime mold) • Database entries are annotated with the entries such as the name of a gene, the function of the gene, and so on • How do you ensure uniformity in the nature of these annotations? 13 Gene Ontology Consortium • Founded in 1998 as a collaboration among scientists responsible for developing different databases of genomic data for model organisms (fruit fly, yeast, mouse) • Now, essentially all developers of all modelorganism databases participate • Goal: To produce a dynamic, controlled vocabulary that can be applied to all organism databases even as knowledge of gene and protein roles in cells is accumulating and changing 14 Gene Ontology (GO) • Comprises three independent “ontologies” – molecular function of gene products – cellular component of gene products – biological process representing the gene product’s higher order role. • Uses these terms as attributes of gene products in the collaborating databases (gene product associations) • Allows queries across databases using GO terms, providing linkage of biological information across species 15 GO = Three Ontologies • Molecular Function – elemental activity or task – example: DNA binding • Cellular Component – location or complex – example: cell nucleus • Biological Process – goal or objective within cell – example: secretion 16 17 GO has been wildly successful!! • Dozens of biologists around the world contribute to GO on a regular basis • The ontology is updated every 30 minutes! • It’s now impossible to work in most areas of computational biology without making use of GO terms 18 But GO has real problems … • Ontologies are represented in an idiosyncratic format that is not compatible with standard knowledge-representation systems • The format is based on directed acyclic graphs of concepts, without the general ability to specify machine interpretable properties of concepts or definitions of concepts • Because of the informal knowledge-representation system, lots of errors have crept into GO – Terms that are duplicated in different places – Terms with no superclasses – Uncertain relationships between terms 19 20 Tension in the GO Community • Biologists around the world with pressing needs to integrate research databases work together to add terms to GO nearly continuously – Using an impoverished, nonstandard knowledgerepresentation system – Using no standards to assure uniform modeling conventions from one part of GO to another • Computer scientists bemoan all this ad-hoc-ery and condemn GO as a hack that will become increasingly unusable and unmaintainable 21 A wonderful keynote talk from the recent meeting on Standards and Ontologies for Functional Genomics The Capulets and Montagues A plague on both your houses? Professor Carole Goble University of Manchester, UK Warning: This talk contains sweeping generalisations 22 Carole Goble Prologue Two households, both alike in dignity, In fair genomics, where we lay our scene, (One, comforted by its logic’s rigour, Claims ontology for the realm of pure, The other, with blessed scientist’s vigour, Acts hastily on models that endure), From ancient grudge break to new mutiny, When “being” drives a fly-man to blaspheme. From forth the fatal loins of these two foes Researchers to unlock the book of life; Whole misadventured piteous overthrows Can with their work bury their clans’ strife. The fruitful passage of their GO-mark'd love, And the continuance of their studies sage, Which, united, yield ontologies undreamed-of, Is now the hours' traffic of our stage; The which if you with patient ears attend, What here shall miss, our toil shall strive to mend. 23 Based on an idea by Shakespeare Carole Goble The Montagues One, comforted by its logic’s rigour, Claims ontology for the realm of pure Computer Science, Knowledge engineering, AI Logic and Languages Theory Top down, well-behaved neatness Generic and lots of toys Methodologies & patterns Tools and standards Technology push Academic pursuit 24 Carole Goble The Capulets The other, with blessed scientist’s vigour, Acts hastily on models that endure Life Scientists Practice Bottom up, real-world Specific and many of them Methodologies, community practice Tools and standards Application pull Practical pursuit – build ‘n’ use it 25 Carole Goble The Philosophers One, comforted by its logic’s rigour, Claims ontology for the realm of pure Philosophers Theory Truth Generic – the one true ontology? Methodologies, patterns & foundational ontologies Not really into tools No push or pull Academic pursuit 26 Carole Goble Endurants, Perdurants, Being, Substance, Event Philosophers KR Montagues The end Mechanism providers Life Scientists Capulets A means to an end Content providers 27 Carole Goble The Princes of Genomics Rebellious subjects, enemies to peace, Profaners of this neighbour-stained steel,-Will they not hear? What, ho! you men, you beasts, That quench the fire of your pernicious rage With purple fountains issuing from your veins, On pain of torture, from those bloody hands Throw your mistemper'd weapons to the ground, And hear the sentence of your moved prince. Three civil brawls, bred of an airy word, By thee, old Capulet, and Montague, Have thrice disturb'd the quiet of our streets, And made genomics's ancient citizens Cast by their grave beseeming ornaments, To wield old partisans, in hands as old, Canker'd with peace, to part your canker'd hate: 28 A tragedy? As in Romeo and Juliette, the threats are political and sociological 29 Creating ontologies has become a widespread cottage industry • Professional Societies – MGED: Microarray Gene Expression Data Society – HUPO: Human Protein Organization • Government – NCI Thesaurus – NIST: Process Specification Language • Open Biological Ontologies – GO – Three dozen (and growing) other ontologies – Mostly in DAG-Edit, some in Protégé format 30 31 Government Continues to be a Major Driving Force • Highly visible intramural initiatives to create public ontologies at many agencies, including NIST, NIH, VA, CDC • Notable variation in these ontologies’ – – – – Scope Representational sophistication “Openness” of content Opportunities for peer review 32 NCI Enterprise Vocabulary Services 1997: R. Klausner, Director NCI, wanted a “science management system” • Know about everything funded by NCI • Goals and results – “bench to bedside” - Thereby improve and speed translation of research Approach: 1. Create integrative terminology 2. Evolve terminology scope from supporting grants management to supporting science 3. Build Web-accessible infrastructure – caCORE 33 34 More than 37,000 concepts are represented with extremely detailed granularity in many areas 35 Definitions may include considerable detail with respect to properties that establish relationships with other concepts 36 NCI Thesaurus is in Active Use nciterms.nci.nih.gov ncicb.nci.nih.gov/core/EVS (more info) Website: 1500-4000 page hits daily, 14K unique visitors (2004) • API: NCICB & external applications • Fulfills NCI and collaborators’ needs for controlled vocabulary • Public domain, open content license 37 NCI Thesaurus Guidelines • Develop content model (based on Ontylog description logic from Apelon, Inc.) • Leverage existing sources as appropriate – MeSH, VA NDF-RT, MedDRA … • Develop unique content where needed – Cancer genes, gene products, cancer diagnoses, drugs, chemotherapies, molecular abnormalities etc., and relationships among them • Link to other standards using URLs where possible – OMIM, Swissprot, GO 38 : NCI uses an Elaborate Process for Editing and Maintenance 39 The NCI Thesaurus is not without its problems • Upper level concepts are sometimes used inconsistently or not at all • Textual definitions of concepts may not always reflect the meaning implied by the concepts’ position in the ontology • Reliance on a proprietary knowledgerepresentation system – Prevents the ability to disseminate the ontology freely – Adds an unfortunate degree of uncertainty to the semantics 40 Throughout this cottage industry • Lots of ontology development, principally by content experts with little training in conceptual modeling • Use of development tools and ontology-definition languages that may be – Extremely limited in their expressiveness – Useless for detecting potential errors and guiding correction – Nonadherent to recognized standards – Proprietary and expensive 41 But the world is beginning to change! • The Montagues do want to get the modeling right! • The Capulets do want to see their work used by others! • Useful, open tools and standards are now available that make it hard to justify closed, proprietary approaches 42 Some signs the world is changing … • Developers of several overlapping and incompatible ontologies of anatomy suddenly are trying to understand why their models do not agree • Philosopher Barry Smith suddenly is camping out at biomedical informatics meetings to get the attention of ontology developers • NCI is piloting the use of OWL and Protégé to encode and manage the NCI thesaurus • MGED and several other biomedical ontologies are being authored in OWL and Protégé from the beginning • Downloads of the Protégé system continue to escalate 43 44 Total Protege Registrations Month/Year Oct '04 Jul '04 Apr '04 Jan '04 Oct '03 Jul '03 Apr '03 Jan '03 Oct '02 Jul '02 Apr '02 Jan '02 Oct '01 Jul '01 Apr '01 22000 21000 20000 19000 18000 17000 16000 15000 14000 13000 12000 11000 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 Jan '01 Registrations Through 10/13/04 45 Protégé’s main features • Simplified editing of ontologies and knowledge bases • Open-source distribution to encourage development by a world-wide community of users • A plug-in architecture that enables developers to add new features easily • Support for a wide range of representation formats – – – – – CLIPS/COOL XML Schema UML RDF OWL 46 Protégé is ecumenical in its support for formal languages • Open Knowledge Base Connectivity Protocol – – – – – CLIPS/COOL UML XML Schema RDF and RDFS Topic Maps • Ontology Web Language (OWL) 47 Protégé remains successful because of its user community • There are now 89 plug-ins available for use with Protégé • Collaboration with our users enables rapid debugging and code fixes • Some development, such as the creation of extensions to our basic OWL capabilities, has been a major collaborative experience • Annual users groups meetings provide great opportunities for developers to share strategies, principles, and war stories • Members of the international Protégé community are a huge support base for new users and for fledgling projects 48 The NCI Thesaurus 49 Moving from cottage industry to the industrial age • There must be widely available tools that are open-source, that are easy to use, and that adhere to knowledge representation standards: Protégé certainly is a candidate • There must be a large user user community of developers who use the tools and who can provide feedback to one another and to the core team of tool builders 50 Moving from cottage industry to the industrial age II • Government and professional societies must set expectations regarding the need for appropriate standards • Government and professional societies must invest in educational programs to teach Montagues to identify with Capulets, and vice versa • Demonstration projects must communicate to the potential developers of future ontologies the strengths and weaknesses of the guidelines, tools, and languages that facilitated the development work 51 A thousand flowers are blooming from every corner of the landscape • Ontologies are being developed by interested groups from every sector of academia, industry, and government • Many of these ontologies have been proven to be extraordinarily useful to wide communities • Many of these same ontologies have been shown to be structurally flawed and of uncertain semantics • We finally are at the stage where we have tools and representation languages that can lift us out of the grass roots to create durable and maintainable ontologies with rich semantic content 52 An infrastructure is now in place • The need to build new ontologies in environmental health, phenotypic expression in model organisms, developmental biology, and many, many other domains is getting wide attention • We finally have the tools and the languages to do things right • Now all we need now is the will, the educational opportunities, and the community feedback to help developers at the grass roots to reemerge as philosophers and princes. 53 Let’s have a happy ending. 54 Editing OWL Ontologies with Protégé Holger Knublauch Stanford University July 06, 2004 55 This Tutorial • Introduction to OWL, the Semantic Web, and the Protégé OWL Plugin • Theory + Walkthrough • Also available: Tutorial by Matthew Horridge (http://www.co-ode.org) – Similar content but more details on logic – Other example scenario (Pizzas) • ... Workshop (this afternoon) • ... Talks (tomorrow morning) 56 Overview The Semantic Web and OWL Basic OWL Interactive: Classes, Properties Advanced OWL Interactive: Class Descriptions Creating Semantic Web Contents 57 The Semantic Web Shared ontologies help to exchange data and meaning between web-based services 58 (Image by Jim Hendler) Wine Example Scenario Tell me what wines I should buy to serve with each course of the following menu. Books Agent Wine Agent I recommend Chardonney or DryRiesling Grocery Agent 59 Ontologies in the Semantic Web • Provide shared data structures to exchange information between agents • Can be explicitly used as annotations in web sites • Can be used for knowledge-based services using other web resources • Can help to structure knowledge to build domain models (for other purposes) 60 OWL • Web Ontology Language • Official W3C Standard since Feb 2004 • Based on predecessors (DAML+OIL) • A Web Language: Based on RDF(S) • An Ontology Language: Based on logic 61 OWL Ontologies • What’s inside an OWL ontology – Classes + class-hierarchy – Properties (Slots) / values – Relations between classes (inheritance, disjoints, equivalents) – Restrictions on properties (type, cardinality) – Characteristics of properties (transitive, …) – Annotations – Individuals • Reasoning tasks: classification, consistency checking 62 OWL Use Cases • At least two different user groups – OWL used as data exchange language (define interfaces of services and agents) – OWL used for terminologies or knowledge models • OWL DL is the subset of OWL (Full) that is optimized for reasoning and knowledge modeling 63 Protégé OWL Plugin • Extension of Protégé for handling OWL ontologies • Project started in April 2003 • Features – Loading and saving OWL files & databases – Graphical editors for class expressions – Access to description logics reasoners – Powerful platform for hooking in customtailored components 64 Tutorial Scenario • Semantic Web for Tourism/Traveling • Goal: Find matching holiday destinations for a customer I am looking for a comfortable destination with beach access Tourism Web 65 Scenario Architecture • A search problem: Match customer’s expectations with potential destinations • Required: Web Service that exploits formal information about the available destinations – Accomodation (Hotels, B&B, Camping, ...) – Activities (Sightseeing, Sports, ...) 66 Tourism Semantic Web • Open World: – New hotels are being added – New activities are offered • Providers publish their services dynamically • Standard format / grounding is needed → Tourism Ontology 67 Tourism Semantic Web OWL Metadata (Individuals) Tourism Ontology OWL Metadata (Individuals) Destination Activity Accomodation OWL Metadata (Individuals) OWL Metadata (Individuals) Web Services 68 OWL (in Protégé) • Individuals (e.g., “FourSeasons”) • Properties – ObjectProperties (references) – DatatypeProperties (simple values) • Classes (e.g., “Hotel”) 69 Individuals • Represent objects in the domain • Specific things • Two names could represent the same “real-world” individual Sydney SydneysOlympicBeach BondiBeach 70 ObjectProperties • Link two individuals together • Relationships (0..n, n..m) BondiBeach Sydney FourSeasons 71 Inverse Properties • Represent bidirectional relationships • Adding a value to one property also adds a value to the inverse property BondiBeach Sydney 72 Transitive Properties • If A is related to B and B is related to C then A is also related to C • Often used for part-of relationships NewSouthWales Sydney BondiBeach hasPart (derived) 73 DatatypeProperties • Link individuals to primitive values (integers, floats, strings, booleans etc) • Often: AnnotationProperties without formal “meaning” Sydney hasSize = 4,500,000 isCapital = true rdfs:comment = “Don’t miss the opera house” 74 Classes • Sets of individuals with common characteristics • Individuals are instances of at least one class Beach City Sydney Cairns BondiBeach CurrawongBeach 75 Range and Domain • Property characteristics – Domain: “left side of relation” (Destination) – Range: “right side” (Accomodation) Accomodation Destination BestWestern Sydney FourSeasons 76 Domains • Individuals can only take values of properties that have matching domain – “Only Destinations can have Accomodations” • Domain can contain multiple classes • Domain can be undefined: Property can be used everywhere 77 Superclass Relationships • Classes can be organized in a hierarchy • Direct instances of subclass are also (indirect) instances of superclasses Cairns Sydney Canberra Coonabarabran 78 Class Relationships • Classes can overlap arbitrarily RetireeDestination City Cairns BondiBeach Sydney 79 Class Disjointness • All classes could potentially overlap • In many cases we want to make sure they don’t share instances disjointWith UrbanArea Sydney Sydney City RuralArea Woomera CapeYork Destination 80 (Create a new OWL project) 81 (Create simple classes) 82 (Create class hierarchy and set disjoints) 83 (Create Contact class with datatype properties) 84 (Edit details of datatype properties) 85 (Create an object property hasContact) 86 (Create an object property with inverse) 87 (Create the remaining classes and properties) 88 Class Descriptions • Classes can be described by their logical characteristics • Descriptions are “anonymous classes” Things with three star accomodation RetireeDestination SanJose Sydney BlueMountains 89 Things with sightseeing opportunities Class Descriptions • Define the “meaning” of classes • Anonymous class expressions are used – “All national parks have campgrounds.” – “A backpackers destination is a destination that has budget accomodation and offers sports or adventure activities.” • Expressions mostly restrict property values (OWL Restrictions) 90 Class Descriptions: Why? • Based on OWL’s Description Logic support • Formalize intentions and modeling decisions (comparable to test cases) • Make sure that individuals fulfill conditions • Tool-supported reasoning 91 Reasoning with Classes • Tool support for three types of reasoning exists: – Consistency checking: Can a class have any instances? – Classification: Is A a subclass of B? – Instance classification: Which classes does an individual belong to? • For Protégé we recommend RACER (but other tools with DIG support work too) 92 Restrictions (Overview) • Define a condition for property values – – – – – – allValuesFrom someValuesFrom hasValue minCardinality maxCardinality cardinality • An anonymous class consisting of all individuals that fulfill the condition 93 Cardinality Restrictions • Meaning: The property must have at least/at most/exactly x values • is the shortcut for and • Example: A FamilyDestination is a Destination that has at least one Accomodation and at least 2 Activities 94 allValuesFrom Restrictions • Meaning: All values of the property must be of a certain type • Warning: Also individuals with no values fulfill this condition (trivial satisfaction) • Example: Hiking is a Sport that is only possible in NationalParks 95 someValuesFrom Restrictions • Meaning: At least one value of the property must be of a certain type • Others may exist as well • Example: A NationalPark is a RuralArea that has at least one Campground and offers at least one Hiking opportunity 96 hasValue Restrictions • Meaning: At least one of the values of the property is a certain value • Similar to someValuesFrom but with Individuals and primitive values • Example: A PartOfSydney is a Destination where one of the values of the isPartOf property is Sydney 97 Enumerated Classes • Consist of exactly the listed individuals OneStarRating ThreeStarRating TwoStarRating BudgetAccomodation 98 Logical Class Definitions • Define classes out of other classes – – – unionOf (or) intersectionOf (and) complementOf (not) • Allow arbitrary nesting of class descriptions (A and (B or C) and not D) 99 unionOf • The class of individuals that belong to class A or class B (or both) • Example: Adventure or Sports activities Adventure Sports 100 intersectionOf • The class of individuals that belong to both class A and class B • Example: A BudgetHotelDestination is a destination with accomodation that is a budget accomodation and a hotel BudgetAccomodation Hotel 101 Implicit intersectionOf • When a class is defined by more than one class description, then it consists of the intersection of the descriptions • Example: A luxury hotel is a hotel that is also an accomodation with 3 stars Hotel LuxuryHotel AccomodationWith3Stars 102 complementOf • The class of all individuals that do not belong to a certain class • Example: A quiet destination is a destination that is not a family destination Destination QuietDestination (grayed) FamilyDestination 103 Class Conditions • Necessary Conditions: (Primitive / partial classes) “If we know that something is a X, then it must fulfill the conditions...” • Necessary & Sufficient Conditions: (Defined / complete classes) “If something fulfills the conditions..., then it is an X.” 104 Class Conditions (2) NationalPark (not everything that fulfills these conditions is a NationalPark) QuietDestination (everything that fulfills these 105 conditions is a QuietDestination) Classification NationalPark • A RuralArea is a Destination • A Campground is BudgetAccomodation • Hiking is a Sport • Therefore: Every NationalPark is a Backpackers-Destiantion BackpackersDestination 106 (Other BackpackerDestinations) Classification (2) • Input: Asserted class definitions • Output: Inferred subclass relationships 107 (Create an enumerated class out of individuals) 108 (Create a hasValue restriction) 109 (Create a hasValue restriction) 110 (Create a defined class) 111 (Classify Campground) 112 (Add restrictions to City and Capital) 113 (Create defined class BackpackersDestination) 114 (Create defined class FamilyDestination) 115 (Create defined class QuietDestination) 116 (Create defined class RetireeDestination) 117 (Classification) 118 (Consistency Checking) 119 Visualization with OWLViz 120 OWL Wizards 121 Putting it All Together • • • • • Ontology has been developed Published on a dedicated web address Ontology provides standard terminology Other ontologies can extend it Users can instantiate the ontology to provide instances – specific hotels – specific activities 122 Ontology Import • Adds all classes, properties and individuals from an external OWL ontology into your project • Allows to create individuals, subclasses, or to further restrict imported classes • Can be used to instantiate an ontology for the Semantic Web 123 Tourism Semantic Web (2) OWL Metadata (Individuals) Tourism Ontology Destination Activity Accomodation Web Services 124 Ontology Import with Protégé • On the Metadata tab: – Add namespace, define prefix – Check “Imported” and reload your project 125 Individuals 126 Individuals 127 OWL File <?xml version="1.0"?>\ <rdf:RDF xmlns="http://protege.stanford.edu/plugins/owl/owl-library/heli-bunjee.owl#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:travel="http://protege.stanford.edu/plugins/owl/owl-library/travel.owl#" xml:base="http://protege.stanford.edu/plugins/owl/owl-library/heli-bunjee.owl"> <owl:Ontology rdf:about=""> <owl:imports rdf:resource="http://protege.stanford.edu/plugins/owl/owl-library/travel.owl"/> </owl:Ontology> <owl:Class rdf:ID="HeliBunjeeJumping"> <rdfs:subClassOf rdf:resource="http://protege.stanford.edu/plugins/owl/owl-library/travel.owl#BunjeeJumping"/> </owl:Class> <HeliBunjeeJumping rdf:ID="ManicSuperBunjee"> <travel:isPossibleIn> <rdf:Description rdf:about="http://protege.stanford.edu/plugins/owl/owl-library/travel.owl#Sydney"> <travel:hasActivity rdf:resource="#ManicSuperBunjee"/> </rdf:Description> </travel:isPossibleIn> <travel:hasContact> <travel:Contact rdf:ID="MSBInc"> <travel:hasEmail rdf:datatype="http://www.w3.org/2001/XMLSchema#string">msb@manicsuperbunjee.com </travel:hasEmail> <travel:hasCity rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Sydney</travel:hasCity> <travel:hasStreet rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Queen Victoria St</travel:hasStreet> <travel:hasZipCode rdf:datatype="http://www.w3.org/2001/XMLSchema#int">1240</travel:hasZipCode> </travel:Contact> </travel:hasContact> <rdfs:comment rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Manic super bunjee now offers nerve wrecking jumps from 300 feet right out of a helicopter. Satisfaction guaranteed.</rdfs:comment> </HeliBunjeeJumping> </rdf:RDF> 128