YAGO:A LARGE ONTOLOGY FROM WIKIPEDIA AND WORDNET FABIAN M. SUCHANEK, GJERGJI KASNECI, GERHARD WEIKUM Subbalakshmi Iyer Motivation for an Ontology Natural Language communication Automated text translation Finding information on internet Computer-processable collection of knowledge What is an Ontology? An ontology is the description of a domain, its classes and properties and relationships between those classes by means of a formal language. collection of knowledge about the world, a knowledge base Example ontologies: large taxonomies categorizing Web sites (such as on Yahoo!) categorizations of products for sale and their features (such as on Amazon.com) Uses of Ontologies Machine Translation Word Sense Disambiguation Document Classification Question Answering Entity and fact-oriented Web Search What is Yago Yet Another Great Ontology Part of Yago-Naga project Goal to build a knowledge base that is Large Scale Domain-independent Automatic Construction High Accuracy Uses Wikipedia and WordNet More about YAGO 2 million entities 20 million facts Facts represented as RDF triples Accuracy of 95% Examples: Elvis Presley isA singer singer subClassOf person Elvis Presley bornOnDate 1935-01-08 Elvis Presley bornIn Tupelo Tupelo locatedIn Mississippi(state) Mississippi(state) locatedIn USA The YAGO model Slight extension of RDFS Represents knowledge as Entities Classes Relations Facts Properties of relations like transitivity Simple and decidable model Knowledge Representation in YAGO All objects are entities e.g. Elvis Presley, Grammy Award 2 entities can stand in a relationship e.g. hasWonAward Elvis Presley hasWonAward Grammy Award The triple of entity, relationship, entity is a fact e.g. fact Elvis Presley hasWonAward Grammy Award is a Knowledge Representation in YAGO -2 Numbers, dates and strings are also entities. Elvis Presley BornInYear 1935 Words are entities “Elvis” Entity is instance of class Elvis means Elvis Presley Presley Type Singer Classes are also entities Singer Type class Knowledge Representation in YAGO- 3 Classes have hierarchies Singer SubClassOf Person Relations are also entities subClassOf Type atr Each fact has a fact identifier #1 FoundIn Wikipedia Key Contributions of YAGO Information Extraction from Wikipedia Infoboxes Category Pages Combination with WordNet Taxonomy Quality Control Canonicalization Type Checking Information Extraction -1 Entities from Wikipedia Each page title is candidate entity Wiki Markup Language Wikipedia dump as of September, 2008 Information Extraction - WML Information Extraction Techniques Infobox Harvesting Wikipedia Word-Level Techniques Wikipedia Redirects Category Harvesting Wikipedia Infoboxes Categories Type Extraction Wikipedia Categories, WordNet Classes 1. Information Extraction from Wikipedia – Infobox Harvesting Wikipedia Infobox Attribute Map Infobox Attribute Relation Inverse Manifold Indirect …… bornOnDate … Born Bor B B Born: January 8, 1935 Relation Map Relation bornOnDate Elvis Presley Domain … person … Range yagoDate bornOnDate January 8, 1935 Attribute Map Infobox Attribute Relation Inverse Manifold Indirect …… Died diedOnDate Bor … B B Relation Map Relation Died: August 16, 1977 diedOnDate Elvis Presley Domain … person … diedOnDate Range yagoDate August 16, 1977 Attribute Map Infobox Attribute Relation Inverse Manifold Indirect …… isOfGenre … Genre Bor B B Genre: Rock and Roll Relation Map Relation isOfGenre Elvis Presley Domain … entity … isOfGenre Range yagoClass Rock and Roll Attribute Map Infobox Attribute Relation Inverse …… Manifold Indirect birth name means Bor … B B Birth Name: Elvis Aaron Presley Relation Map Relation Domain means … yagoWord … Elvis Aaron Presley means Range entity Elvis Presley Manifold Attributes Some attributes may have multiple values e.g. a person may have multiple children Multiple facts are generated e.g. one hasChild fact for each child Indirect Attributes - 1 Attribute Map Attribute Relation Inverse …… gdp ppp hasGDP gdp year during Indirect Some attributes do not concern article entity, but another fact Manifold e.g attribute GDP does not concern the article entity i.e. Republic of Singapore, but year 2008 Therefore, facts generated: Singapore hasGDP 238.755 billion #14 during 2008 Singapore hasGDP 238.755 billion during 2008 Indirect Attributes - 2 Singapore Infobox Type of Infobox American Pie Released Format Genre Length Label Writer October, 1971 vinyl record Folk Rock 8:33 mins United Artists Don McLean Song Infobox Tesla Roadster Manufacturer Production Class Length Width Height Tesla Motors 2008-present Roadster 3,946 mm 1,873 mm 1,127 mm Car Infobox Type of Infobox: Attribute Map Attribute Map Attribute Relation Inverse Manifold Indirect …… car #length hasLength … song #length hasDuration … Song Infobox American Pie hasDuration 8:33 Car Infobox Tesla Roadster hasLength 3946 Information Extraction - Word Level Techniques Wikipedia Redirects virtual redirect page for “Presley, Elvis“ links to “Elvis Presley” Each redirect gives ‘means’ fact e.g. “Presley, Elvis“ means Elvis Presley Parsing Person Names extract the name components establish relations givenNameOf and familyNameOf e.g. Presley familyNameOf Elvis Presley Elvis givenNameOf Elvis Presley Wikipedia Categories Categories: Presidents of the United States | Lists of office-holders | Lists of Presidents Categories: Rift Valleys | North Sea | Rivers of Germany | Articles needing translation from German Wikipedia | Rivers of Netherlands Categories: Canadian Singers| Canadian male singers| 1959 births | English-language singers | Living people | Grammy Award Winners | Portrait photographers Facts created from Wikipedia Categories Rhine locatedIn Germany Bryan Adams bornOnDate 1959 Bryan Adams hasWonAward Grammy Award Abraham Lincoln politicianOf United States Information Extraction - Category Harvesting Relational Categories Regular Expression ([0-9]f3,4g) births ([0-9]f3,4g) deaths ([0-9]f3,4g) establishments ([0-9]f3,4g) books|novels MountainsjRivers in (.*) PresidentsjGovernors of (.*) (.*) winners [A-Za-z]+ (.*) winners Relation bornOnDate diedOnDate establishedOnDate writtenOnDate locatedIn politicianOf hasWonPrize hasWonPrize Table: Some Category Heuristics 2. Connecting Wikipedia and WordNet – What is WordNet Lexical database for the English language Created at the Cognitive Science Laboratory of Princeton University Groups English words into sets of synonyms called synsets Provides short, general definitions Provides hypernym/hyponym relations e.g. canine is hypernym, dog is hyponym Connecting Wikipedia and WordNet – Type Extraction Goal: create class hierarchy e.g. singer subClassOf performer performer subClassOf artist hyponymy relation from WordNet Wikipedia class ‘American people in Japan’ is subclass of WordNet class ‘person’ Classifications of Categories Conceptual Categories e.g. Albert Einstein is in ‘Naturalized citizens of the United States’ Administrative Categories e.g. Albert Einstein is in ‘Articles with unsourced statements’ Relational Information 1879 births Thematic Vicinity Physics Identification of Conceptual Categories Only conceptual categories are used Shallow linguistic parsing of category names e.g. category ‘American people in Japan’ Break category into pre-modifier - ‘American’ head - ‘people’ post-modifier - ‘in Japan’ If head is plural, then category is conceptual category Extract class from Wikipedia category Connect to class from WordNet e.g. the Wikipedia class ‘American people in Japan’ has to be made a subclass of the WordNet class ‘person’ Algorithm Function wiki2wordnet(c) Input: Wikipedia category name c Output: WordNet synset 1 head =headCompound(c) 2 pre =preModifier(c) 3 post =postModifier(c) 4 head =stem(head) 5 If there is a WordNet synset s for pre + head 6 return s 7 If there are WordNet synsets s1, … , sn for head 8 (ordered by their frequency for head) 9 return s1 10 fail Explanation of Algorithm Input: American people in Japan 1. pre-modifier : American 2. Head : people 3. Post-modifier : in Japan 4. Stem(head) : person 5. If there is a WordNet synset for ‘American person’ 6. return that synset 7. If there are s1, …, sn synsets for ‘person’ 8. (Ordered by frequency for ‘person’) 9. Return s1 10.Fail Output: person Result: American People in Japan subClassOf person Fig.: WordNet search for “person” Fig.: WordNet search for ‘American Person’ Exceptions Complete hierarchy of classes Upper classes from WordNet Leaves from Wikipedia 2 dozen cases failed Categories with head compound “capital” In Wikipedia, it means “capital city” In WordNet, it means “financial asset” These cases were corrected manually 3. Quality Control Canonicalization Each fact and each entity reference unique an entity is always referred to by the same identifier in all facts in YAGO Type Checking eliminates individuals that do not have class eliminates facts that do not respect domain and range constraints an argument of a fact in YAGO is always an instance of the class required by the relation Canonicalization - 1 Redirect Resolution infobox heuristics deliver facts that have Wikipedia entities (i.e. Wikipedia links) as arguments These links may not be correct Wikipedia page identifiers Check if each argument is correct Wikipedia identifier Replace by correct, redirected identifier E.g. Hermitage Museum locatedIn St. Petersburg Hermitage Museum locatedIn Saint Petersburg Canonicalization - 2 Removal of Duplicate facts Sometimes, 2 heuristics deliver the same fact. canonicalization eliminates one of them e.g., category ‘1935 births’ yields the fact: Elvis Presley bornOnDate 1935 Infobox attribute ‘Born: January 8, 1935’ yields the fact: Elvis Presley bornOnDate January 8, 1935 Type Checking - 1 Reductive Type Checking Sometimes class of entity cannot be determined Such facts are discarded e.g. Wikipedia entities that have been proposed for an article, but that do not have a page yet Inductive Type Checking Type constraints can be used to generate facts e.g. Elvis Presley bornOnDate January 8, 1935 So, Elvis Presley is a person Regular expression check to ensure entity name pattern of given name and family name Type Checking - 2 Type Coherence Checking Sometimes, classification yields wrong results e.g. Abraham Lincoln is instance of 13 classes 12 are subclasses of class ‘person’; e.g. lawyer, president 13th class is class ‘cabinet’ Class hierarchy of YAGO is partitioned into branches e.g. locations, artifacts, people, other physical entities, and abstract entities Branch that most types lead to, is determined Other types are purged References YAGO:ALarge Ontology from Wikipedia andWordNet Fabian M. Suchanek, Gjergji Kasneci, GerhardWeikum Max-Planck-Institute for Computer Science, Saarbruecken, Germany Automated Construction and Growth of a Large Ontology Fabian M. Suchanek Thesis for obtaining the title of Doctor of Engineering of the Faculties of Natural Sciences and Technology of Saarland University Wikipedia http://en.wikipedia.org/wiki/Main_Page WordNet http://wordnet.princeton.edu/ Thank You, Any Questions?