YAGO:A Large Ontology from Wikipedia and WordNet

advertisement
YAGO:A LARGE ONTOLOGY
FROM WIKIPEDIA AND
WORDNET
FABIAN M. SUCHANEK, GJERGJI KASNECI,
GERHARD WEIKUM
Subbalakshmi Iyer
Motivation for an Ontology
Natural Language communication
 Automated text translation
 Finding information on internet
 Computer-processable collection of knowledge

What is an Ontology?
An ontology is the description of a domain, its
classes and properties and relationships between
those classes by means of a formal language.
 collection of knowledge about the world, a
knowledge base
 Example ontologies:

 large
taxonomies categorizing Web sites (such as on
Yahoo!)
 categorizations of products for sale and their features
(such as on Amazon.com)
Uses of Ontologies





Machine Translation
Word Sense Disambiguation
Document Classification
Question Answering
Entity and fact-oriented Web Search
What is Yago



Yet Another Great Ontology
Part of Yago-Naga project
Goal to build a knowledge base that is
 Large
Scale
 Domain-independent
 Automatic Construction
 High Accuracy

Uses Wikipedia and WordNet
More about YAGO





2 million entities
20 million facts
Facts represented as RDF triples
Accuracy of 95%
Examples:
Elvis Presley isA singer
 singer subClassOf person
 Elvis Presley bornOnDate 1935-01-08
 Elvis Presley bornIn Tupelo
 Tupelo locatedIn Mississippi(state)
 Mississippi(state) locatedIn USA

The YAGO model


Slight extension of RDFS
Represents knowledge as
 Entities
 Classes
 Relations
 Facts


Properties of relations like transitivity
Simple and decidable model
Knowledge Representation in YAGO

All objects are entities
 e.g.

Elvis Presley, Grammy Award
2 entities can stand in a relationship
 e.g.
hasWonAward
 Elvis Presley hasWonAward Grammy Award

The triple of entity, relationship, entity is a fact
 e.g.
fact
Elvis Presley hasWonAward Grammy Award is a
Knowledge Representation in YAGO -2

Numbers, dates and strings are also entities.
 Elvis

Presley BornInYear 1935
Words are entities
 “Elvis”

Entity is instance of class
 Elvis

means Elvis Presley
Presley Type Singer
Classes are also entities
 Singer
Type class
Knowledge Representation in YAGO- 3

Classes have hierarchies
 Singer

SubClassOf Person
Relations are also entities
 subClassOf

Type atr
Each fact has a fact identifier
 #1
FoundIn Wikipedia
Key Contributions of YAGO

Information Extraction from Wikipedia
 Infoboxes
 Category

Pages
Combination with WordNet
 Taxonomy

Quality Control
 Canonicalization
 Type
Checking
Information Extraction -1




Entities from Wikipedia
Each page title is candidate entity
Wiki Markup Language
Wikipedia dump as of September, 2008
Information Extraction - WML
Information Extraction Techniques

Infobox Harvesting
 Wikipedia

Word-Level Techniques
 Wikipedia

Redirects
Category Harvesting
 Wikipedia

Infoboxes
Categories
Type Extraction
 Wikipedia
Categories, WordNet Classes
1. Information Extraction from
Wikipedia – Infobox Harvesting
Wikipedia Infobox
Attribute Map
Infobox
Attribute Relation
Inverse
Manifold
Indirect
……
bornOnDate
…
Born
Bor
B
B
Born: January 8, 1935
Relation Map
Relation
bornOnDate
Elvis Presley
Domain
…
person
…
Range
yagoDate
bornOnDate January 8, 1935
Attribute Map
Infobox
Attribute Relation
Inverse
Manifold
Indirect
……
Died
diedOnDate
Bor
…
B
B
Relation Map
Relation
Died: August 16, 1977
diedOnDate
Elvis Presley
Domain
…
person
…
diedOnDate
Range
yagoDate
August 16, 1977
Attribute Map
Infobox
Attribute Relation
Inverse
Manifold
Indirect
……
isOfGenre
…
Genre
Bor
B
B
Genre: Rock and Roll
Relation Map
Relation
isOfGenre
Elvis Presley
Domain
…
entity
…
isOfGenre
Range
yagoClass
Rock and Roll
Attribute Map
Infobox
Attribute
Relation
Inverse
……
Manifold
Indirect
birth name means
Bor
…
B
B
Birth Name: Elvis Aaron
Presley
Relation Map
Relation
Domain
means
…
yagoWord
…
Elvis Aaron Presley
means
Range
entity
Elvis Presley
Manifold Attributes

Some attributes may have multiple values
 e.g.

a person may have multiple children
Multiple facts are generated

e.g. one hasChild fact for each child
Indirect Attributes - 1
Attribute Map
Attribute Relation
Inverse
……
gdp ppp hasGDP
gdp year during

Indirect
Some attributes do not concern article entity, but
another fact


Manifold
e.g attribute GDP does not concern the article entity i.e.
Republic of Singapore, but year 2008
Therefore, facts generated:
Singapore hasGDP 238.755 billion
 #14 during 2008
 Singapore hasGDP 238.755 billion during 2008

Indirect Attributes - 2
Singapore Infobox
Type of Infobox
American Pie
Released
Format
Genre
Length
Label
Writer
October, 1971
vinyl record
Folk Rock
8:33 mins
United Artists
Don McLean
Song Infobox
Tesla Roadster
Manufacturer
Production
Class
Length
Width
Height
Tesla Motors
2008-present
Roadster
3,946 mm
1,873 mm
1,127 mm
Car Infobox
Type of Infobox: Attribute Map
Attribute Map
Attribute
Relation
Inverse
Manifold
Indirect
……
car #length hasLength
…
song #length hasDuration
…
Song Infobox
American Pie hasDuration 8:33
Car Infobox
Tesla Roadster hasLength 3946
Information Extraction - Word Level
Techniques

Wikipedia Redirects
 virtual
redirect page for “Presley, Elvis“ links to “Elvis
Presley”
 Each redirect gives ‘means’ fact
 e.g. “Presley, Elvis“ means Elvis Presley

Parsing Person Names
 extract
the name components
 establish relations givenNameOf and familyNameOf
e.g. Presley familyNameOf Elvis Presley
Elvis givenNameOf Elvis Presley
Wikipedia Categories
Categories: Presidents of the United States | Lists of office-holders | Lists of Presidents
Categories: Rift Valleys | North Sea | Rivers of Germany | Articles needing translation
from German Wikipedia | Rivers of Netherlands
Categories: Canadian Singers| Canadian male singers| 1959 births | English-language
singers | Living people | Grammy Award Winners | Portrait photographers
Facts created from Wikipedia
Categories




Rhine locatedIn Germany
Bryan Adams bornOnDate 1959
Bryan Adams hasWonAward Grammy Award
Abraham Lincoln politicianOf United States
Information Extraction - Category
Harvesting

Relational Categories
Regular Expression
([0-9]f3,4g) births
([0-9]f3,4g) deaths
([0-9]f3,4g) establishments
([0-9]f3,4g) books|novels
MountainsjRivers in (.*)
PresidentsjGovernors of (.*)
(.*) winners
[A-Za-z]+ (.*) winners
Relation
bornOnDate
diedOnDate
establishedOnDate
writtenOnDate
locatedIn
politicianOf
hasWonPrize
hasWonPrize
Table: Some Category Heuristics
2. Connecting Wikipedia and
WordNet – What is WordNet





Lexical database for the English language
Created at the Cognitive Science Laboratory of
Princeton University
Groups English words into sets of synonyms called
synsets
Provides short, general definitions
Provides hypernym/hyponym relations
 e.g.
canine is hypernym, dog is hyponym
Connecting Wikipedia and WordNet –
Type Extraction

Goal: create class hierarchy
 e.g.


singer subClassOf performer
performer subClassOf artist
hyponymy relation from WordNet
Wikipedia class ‘American people in Japan’ is
subclass of WordNet class ‘person’
Classifications of Categories

Conceptual Categories
 e.g.
Albert Einstein is in ‘Naturalized citizens of the
United States’

Administrative Categories
 e.g.
Albert Einstein is in ‘Articles with unsourced
statements’

Relational Information
 1879

births
Thematic Vicinity
 Physics
Identification of Conceptual Categories


Only conceptual categories are used
Shallow linguistic parsing of category names


e.g. category ‘American people in Japan’
Break category into






pre-modifier - ‘American’
head
- ‘people’
post-modifier - ‘in Japan’
If head is plural, then category is conceptual category
Extract class from Wikipedia category
Connect to class from WordNet

e.g. the Wikipedia class ‘American people in Japan’ has to be
made a subclass of the WordNet class ‘person’
Algorithm
Function wiki2wordnet(c)
Input: Wikipedia category name c
Output: WordNet synset
1 head =headCompound(c)
2 pre =preModifier(c)
3 post =postModifier(c)
4 head =stem(head)
5 If there is a WordNet synset s for pre + head
6 return s
7 If there are WordNet synsets s1, … , sn for head
8 (ordered by their frequency for head)
9 return s1
10 fail
Explanation of Algorithm
Input: American people in Japan
1. pre-modifier : American
2. Head
: people
3. Post-modifier : in Japan
4. Stem(head) : person
5. If there is a WordNet synset for ‘American person’
6. return that synset
7. If there are s1, …, sn synsets for ‘person’
8. (Ordered by frequency for ‘person’)
9. Return s1
10.Fail
Output: person
Result: American People in Japan subClassOf person
Fig.: WordNet search for “person”
Fig.: WordNet search for ‘American
Person’
Exceptions








Complete hierarchy of classes
Upper classes from WordNet
Leaves from Wikipedia
2 dozen cases failed
Categories with head compound “capital”
In Wikipedia, it means “capital city”
In WordNet, it means “financial asset”
These cases were corrected manually
3. Quality Control

Canonicalization
 Each
fact and each entity reference unique
 an entity is always referred to by the same identifier in
all facts in YAGO

Type Checking
 eliminates
individuals that do not have class
 eliminates facts that do not respect domain and range
constraints
 an argument of a fact in YAGO is always an instance
of the class required by the relation
Canonicalization - 1

Redirect Resolution
 infobox
heuristics deliver facts that have Wikipedia
entities (i.e. Wikipedia links) as arguments
 These links may not be correct Wikipedia page
identifiers
 Check if each argument is correct Wikipedia identifier
 Replace by correct, redirected identifier
 E.g.

Hermitage Museum locatedIn St. Petersburg
Hermitage Museum locatedIn Saint Petersburg
Canonicalization - 2

Removal of Duplicate facts
 Sometimes,
2 heuristics deliver the same fact.
 canonicalization eliminates one of them
 e.g., category ‘1935 births’ yields the fact:
 Elvis Presley bornOnDate 1935
 Infobox attribute ‘Born: January 8, 1935’
yields the fact:
 Elvis Presley bornOnDate January 8, 1935
Type Checking - 1

Reductive Type Checking


Sometimes class of entity cannot be determined
Such facts are discarded


e.g. Wikipedia entities that have been proposed for an article, but that
do not have a page yet
Inductive Type Checking




Type constraints can be used to generate facts
e.g. Elvis Presley bornOnDate January 8, 1935
So, Elvis Presley is a person
Regular expression check to ensure entity name pattern of given
name and family name
Type Checking - 2


Type Coherence Checking
Sometimes, classification yields wrong results
e.g. Abraham Lincoln is instance of 13 classes
 12 are subclasses of class ‘person’; e.g. lawyer, president
 13th class is class ‘cabinet’


Class hierarchy of YAGO is partitioned into
branches
e.g. locations, artifacts, people, other physical
 entities, and abstract entities



Branch that most types lead to, is determined
Other types are purged
References

YAGO:ALarge Ontology from Wikipedia andWordNet
Fabian M. Suchanek, Gjergji Kasneci, GerhardWeikum
Max-Planck-Institute for Computer Science, Saarbruecken, Germany

Automated Construction and Growth of a Large
Ontology
Fabian M. Suchanek
Thesis for obtaining the title of Doctor of Engineering of the Faculties of
Natural Sciences and Technology of Saarland University

Wikipedia
http://en.wikipedia.org/wiki/Main_Page

WordNet
http://wordnet.princeton.edu/
Thank You, Any Questions?
Download