Bill Gates is not a Parking Meter: Philosophical Quality Control in Automated Ontology-building Catherine Legg & Sam Sarjant University of Waikato 2 July, 2012 Agenda Philosophical Categories: What are they? What are they good for? 2. Automated Ontology-Building • Concept Mapping • Adding New Assertions 3. Philosophical Quality Control: Semantic vs. Ontological Relatedness 1. Agenda Philosophical Categories: What are they? What are they good for? 2. Automated Ontology-Building • Concept Mapping • Adding New Assertions 3. Philosophical Quality Control: Semantic vs. Ontological Relatedness 1. What is wrong with this statement? S1: The number 8 is a very red number. This seems somehow worse than false. For a false statement can be negated to produce a truth, but: Fregean quibbles notwithstanding S2: The number 8 is not a very red number. The odd synaesthesia episode notwithstanding doesn’t seem right either. The problem seems to be that numbers are not the kind of thing that can have colours. If someone thinks so then they don’t understand what kinds of things numbers are. Traditional philosophical terminology for what is wrong is that S1 commits a category mistake. It is well-known that the philosophical discipline of ontology was invented by Aristotle to deal with these kinds of problems (among others): όντος: (being) + λόγος: (theory of) It was called by him first philosophy (i.e. the fundamental science). A key part of its role was to define categories. Categories are intended to describe the most different kinds of things that exist in reality: Suggested examples include: Physical objects Times Events Numbers Relationships Traditionally ontologies were built into a taxonomic structure, or ‘tree of knowledge’: ‘Tree of Porphyry’ (3rd century A.D.) Category vs. Property Distinctions: • There is a subtle but important distinction between philosophical categories and mere properties. • Although both divide entities into groups, and may be represented by classes, categories provide a deeper, more sortal division which enforces constraints. • E.g. while we know the same thing cannot be a colour and a number, the same cannot be said for green and square. • However, at what ‘level’ of an ontology constraining categorical divisions give way to non-constraining property divisions is frequently unclear and contested. • This has led to skepticism about philosophical categories. • Plus the disdain of “speculative metaphysics” by C20th logical positivism (and various successors) didn’t help. Agenda Philosophical Categories: What are they? What are they good for? 2. Automated Ontology-Building • Concept Mapping • Adding New Assertions 3. Philosophical Quality Control: Semantic vs. Ontological Relatedness 1. Why Automated Ontology-Building? • A wealth of new free user-supplied Web content (‘Web 2.0’) for raw data. (Blogs, tags, Wikipedia…) • New automated methods of determining semantic relatedness (e.g. Milne & Witten, “An Effective, Low-cost measure of Semantic Relatedness Obtained from Wikipedia links”,.2008) • Manual methods widely agreed to be too slow and labourintensive (e.g. Cyc project, 25 years, still unfinished) What Has Been Done So Far? • YAGO (Wordnet backbone, enriched with Wikipedia’s leaf categories. Taxonomic) • DBPedia (Wikipedia Infoboxes and other semi-structured data. Not taxonomic) • EMLR (Wikipedia category network, enriched with further relations from, e.g., parsing category names, e.g. BornIn1954. Taxonomic) Our Choice • • Use as backbone the Cyc taxonomy, because: • Highly principled as the result of so many years’ labor • Purpose-built inference engine to reason over the knowledge • Now publically available ‘open source’ (ResearchCyc) • One researcher (Legg) had inside knowledge of the system, which is rare worldwide Build onto it knowledge mined from Wikipedia, because: • Access to Wikipedia-based automated measure of semantic relatedness developed at University of Waikato • Wikipedia is astounding! • 2.4M (English) articles, referred to by 3M different terms • ~25 hyperlinks per article • 175K templates for semi-structured data-entry (inc.9K ‘infoboxes’) • full editing history for every article….etc, etc. Cyc Ontology & Knowledge Base OpenCyc contains: ~13,500 Predicates ~200,000 Concepts ~3,000,000 Assertions Thing Intangible Individual Thing Sets Relations Space Physical Objects Living Things Ecology Natural Geography Political Geography Weather Earth & Solar System Paths Human Anatomy & Physiology Partially Tangible Thing Artifacts Actors Actions Movement State Change Dynamics Plants Temporal Thing Logic Math Borders Geometry Plans Goals Physical Agents Animals Emotion Human Products Conceptual Perception Behavior & Devices Works Belief Actions Vehicles Buildings Weapons Spatial Thing Spatial Paths Materials Parts Statics Life Forms Human Beings Human Artifacts Represented in: • First Order Logic • Higher Order Logic Time • Context Logic Events Scripts (Micro-theories) Agents Mechanical Software Social & Electrical Literature Language Relations, Devices Works of Art Culture Organizational Actions Organizational Plans Agent Organizations Social Behavior Organization Human Activities Business & Commerce Purchasing Shopping Social Activities Types of Organizations Politics Warfare Sports Recreation Entertainment Transportation & Logistics Human Organizations Nations Governments Geo-Politics Professions Occupations Travel Communication Everyday Living Law Business, Military Organizations Domain-Specific Knowledge (e.g., Bio-Warfare, Terrorism, Computer Security, Military Tactics, Command & Control, Health Care, …) Domain-Specific Facts and Data Cyc Common-Sense Knowledge 1. Semantic Argument Constraints on Relations Cyc contains many assertions of the form: (arg1Isa birthDate Animal) (arg2Isa capital City) These represent that only animals have birthdays, and that the capital of a country must be a city. These features of Cyc are a form of categorical knowledge. Although some of the categories invoked might seem relatively specific and trivial compared to Aristotle’s, logically the constraining process is the same. Cyc enforces these constraints at knowledge entry, a notable difference from every other formal ontology its size. Cyc Common-Sense Knowledge 2. Disjointness Assertions Between Collections ResearchCyc currently contains 6000 explicitly asserted disjointWith claims, for e.g.: disjointWith Doorway WindowPortal disjointWith HomogeneousStructure Ecosystem disjointWith YardWork ShootingAProjectileWeapon (!) From these countless further claims can be deduced. Again, Cyc enforces these constraints at knowledge entry, Wordnet knows about sibling relationships, so in some sense it knows that a cat cannot be a dog. But it cannot ramify this knowledge through its hierarchy in this way. Cyc reasoning over its common-sense knowledge: Never explicitly asserted into Cyc (!) N.B. Key assertion Wikipedia as an ontology articles basic concepts infoboxes facts about those concepts first sentences concept definitions, often in standard format hyperlinks between articles ‘semantic relatedness’ between concepts categories organise articles into conceptual groupings. Though these groupings are far from a principled taxonomy enabling knowledge inheritance For example, consider the following category…. ! !! Agenda Philosophical Categories: What are they? What are they good for? 2. Automated Ontology-Building • Concept Mapping • Adding New Assertions 3. Philosophical Quality Control: Semantic vs. Ontological Relatedness 1. Stage A&B: easy 1-1 matches using title strings (or synonyms) Exact mappings via Cyc synonyms CityOfSaarbrucken Saarbrücken Get synonyms asserted in Cyc, e.g. using #$nameString Saarbrucken Use Wikipedia redirects matches Saarbrucken Stage C: a number of possible candidates Stage C: Many Wikipedia 1 Cyc. onSemantic the Wikipedia side, semantic disambiguation required disambiguation required Collect all candidates (e.g. Kiwi: Bird? Fruit? Nationality? ) Compute commonness of each candidate (how often string is anchor text to their Wikipedia pages) Collect context from Cyc (concepts nearby in taxonomy) Compute similarity to context (using Wikipedia hyperlinks: Milne and Witten, 2008) Determine best candidate Stage D: Many Cyc 1 Wikipedia. Reverse disambiguation required. In this stage many candidate mappings were eliminated by mapping back from the Wikipedia article to the Cyc term, discarding mappings which don’t ‘map back’. For example, Cyc term #$DirectorOfOrganisation incorrectly maps to Film director, but when we attempt to find a Cyc term from Film director we get #$Director-Film. This reduced the number of mappings by 43%, but increased precision considerably. Agenda Philosophical Categories: What are they? What are they good for? 2. Automated Ontology-Building • Concept Mapping • Adding New Assertions 3. Philosophical Quality Control: Semantic vs. Ontological Relatedness 1. Finding new concepts in Wikipedia and adding them to Cyc First we found mapped concepts where the Wikipedia article had an equivalent category (about 20% of mapped concepts). E.g. the article Israeli Settlement has an equivalent category Israeli Settlements: We then mined this category for new concepts belonging under the mapped Cyc concept, according to the Cyc taxonomy. For instance: We then mined this category for new concepts belonging under the mapped Cyc concept, according to the Cyc taxonomy. For instance: We called these ‘true children’. We identified true children by: 1.Parsing the first sentences of Wikipedia articles (with a list of regular expressions): Havat Gilad (Hebrew: חַ וַת ִּגלְעָ ד, lit. Gilad Farm) is an Israeli settlement outpost in the West Bank. Netiv HaGdud (Hebrew: נְתִּ יב הַ גְדוד, lit. Path of the Battalion) is a moshav and Israeli settlement in the West Bank. Kfar Eldad (Hebrew: )כפר אלדדis an Israeli settlement and a Communal settlement in the Gush Etzion Regional Council, south of Jerusalem. The Yad La’achim operation (Hebrew: מבצע יד לאחים, “Giving hand to brothers") was an operation that the IDF performed during the disengagement plan. 2. “Infobox pairing” – If an article in the category shares an infobox template with 90% of true children, we include it. FAILS FIRST SENTENCE TEST 2. “Infobox pairing” – If an article in the category shares an infobox template with 90% of true children, we include it. FAILS FIRST SENTENCE TEST Final Results Method Cyc terms Percent mapped Total terms available 163,000 Common sense terms 83,900 Exact (1-1) mappings 33,500 40% Further mappings after disambiguation(2 ways) 8,800 10% Total mapped 42,300 50% Cyc assertions Percent growth New Cyc Concepts 35, 000 30% New “doc. strings” 17, 000 Other new assertions 228, 000 Method 10% Evaluation (22 human volunteers, online form, compared with DBPedia Ontology) CASE: 1 2 3 4 5 6_ DBpedia children 0.58 0.81 0.99 0.98 0.99 0.99 Our New children 0.57 0.88 0.99 0.90 0.90 1.00 2008 mappings 2009 mappings 0.65 0.83 0.99 0.99 0.99 1.00 0.68 0.91 1.00 1.00 1.00 1.00 CASES: 1 : 100% of evaluators thought assignment correct 2 : >50% thought assignment correct 3 : At least 1 thought assignment correct 4 : 100% thought assignment correct or close 5 : >50% thought assignment correct or close 6 : At least 1 thought assignment correct or close Agenda Philosophical Categories: What are they? What are they good for? 2. Automated Ontology-Building • Concept Mapping • Adding New Assertions 3. Philosophical Quality Control: Semantic vs. Ontological Relatedness 1. Due to its common-sense knowledge, Cyc ‘regurgitated’ many assertions fed to it which were ontologically incorrect. Examples of rejected assertions: (#$isa #$CallumRoberts #$Research) Why it happened Professor Callum Roberts is a marine conservation biologist, oceanographer, author and research scholar in the Environment Department of the University of York in England. (#$isa #$Insight-EMailClient #$EMailMessage) Insight WebClient is an groupware E-Mail client from Bynari embedded on Arachne Web Browser for DOS. Why it happened Quality control is provided via Cyc’s commonsense knowledge, as Cyc knows enough now to ‘regurgitate’ many assertions which are ontologically incorrect. Here Cyc knows that the collection of biological living objects is disjoint with the collection of Examples of regurgitated assertions: information objects. (#$isa #$CallumRoberts #$Research) Why it happened Professor Callum Roberts is a marine conservation biologist, oceanographer, author and research scholar in the Environment Department of the University of York in England. (#$isa #$Insight-EMailClient #$EMailMessage) Insight WebClient is an groupware E-Mail client from Bynari embedded on Arachne Web Browser for DOS. Why it happened Quality control is provided via Cyc’s commonsense knowledge, as Cyc knows enough now to ‘regurgitate’ many assertions which are ontologically incorrect. Here Cyc knows that the collection of biological living objects is disjoint with the collection of Examples of regurgitated assertions: information objects. (#$isa #$CallumRoberts #$Research) Why it happened Professor Callum Roberts is a marine conservation biologist, oceanographer, author and research scholar in the Environment Here Cyc knows that the collection of software Department of the University of York in England. programs is disjoint with the collection of letters. (#$isa #$Insight-EMailClient #$EMailMessage) Insight WebClient is an groupware E-Mail client from Bynari embedded on Arachne Web Browser for DOS. Why it happened Ontological vs Semantic Relatedness These examples usefully highlight a clear difference between quantitative measures of semantic relatedness, and an ontological relatedness derivable from a principled category structure. Callum Roberts is a researcher, which is highly semantically related to research and Insight is an email client, which is highly semantically related to email messages. Thematically these pairs are incredibly close, but ontologically, they are very different kinds of thing In this way, Cyc rejected 4 300 assertions, roughly 3% of the total presented to it. Manual analysis showed that of these, 96% were true negatives. Future building work Now that we have a distinction between semantic and ontological relatedness, combining the two has powerful possibilities. In fact in general in automated information science, overlapping independent heuristics are a boon to accuracy. We plan to: automatically augment Cyc’s disjointness network and semantic argument constraints on relations: systematically organized infobox relations are a natural ground to generalize argument constraints. The Wikipedia category network will be mined – with caution - for further disjointness knowledge. Evaluate much more fully both this automated ontology- building effort and other current leaders in the field Philosophical lessons • Our project suggests the notion of philosophical categories leads to measurable improvements in real-world ontology-building. • Just how extensive a system of categories should be will require real-world testing. But now we have the tools and free user-supplied data to do this. • Where exactly the line should be drawn between categories proper and mere properties remains open. • However, modern statistical tools raise the possibility of a quantitative treatment of ontological relatedness that is more nuanced than Aristotle’s ten neat piles of predicates, yet can still recognize that it is highly problematic to say that the number 8 is red, and why. clegg@waikato.ac.nz sjs31@cs.waikato.ac.nz