Slides - Fabian M. Suchanek

advertisement
Harvesting Knowledge from
Web Data and Text
CIKM 2010 Tutorial
(1/2 Day)
Hady W. Lauw1, Ralf Schenkel2,
Fabian Suchanek3, Martin Theobald4,
and Gerhard Weikum4
1Institute
for Infocomm Research, Singapore
2Saarland University, Saarbruecken
3INRIA Saclay, Paris
4Max Planck Institute Informatics, Saarbruecken
All slides for download…
http://www.mpi-inf.mpg.de/yago-naga/
CIKM10-tutorial/
Harvesting Knowledge from Web Data
2
Outline
• Part I
– What and Why
– Available Knowledge Bases
• Part II
– Extracting Knowledge
• Part III
– Ranking and Searching
• Part IV
– Conclusion and Outlook
Harvesting Knowledge from Web Data
3
Motivation
Elvis Presley
1935 - 1977
Will there ever be someone like him again?
4
Motivation
Another Elvis
Elvis Presley: The Early Years
Elvis spent more weeks at the top of the charts
than any other artist.
www.fiftiesweb.com/elvis.htm
5
Motivation
Another singer called Elvis, young
Personal relationships of Elvis Presley – Wikipedia
...when Elvis was a young teen.... another girl whom the
singer's mother hoped Presley would .... The writer called
Elvis "a hillbilly cat”
en.wikipedia.org/.../Personal_relationships_of_Elvis_Presley
6
Motivation
Dear Mr. Page, you don’t understand me. I just...
Elvis Presley - Official page for Elvis Presley
Welcome to the Official Elvis Presley Web Site, home of the
undisputed King of Rock 'n' Roll and his beloved Graceland ...
www.elvis.com/
7
Motivation
Other (more serious?) queries:
• when is Madonna’s next concert in Europe?
• which protein inhibits atherosclerosis?
• who was king of England when Napoleon I was emperor of France?
King George III
• has any scientist ever won the Nobel Prize in Literature?
Bertrand Russel
• which countries have a HDI comparable to Sweden’s?
• which scientific papers have led to patents?
• is there another famous singer named “Elvis”?
8
This Tutorial
Mr. Page, let’s try this again.
Is there another singer
named Elvis?
In this tutorial, we will explain
• how the knowledge is organized
• what knowledge bases exist already
• how we can construct knowledge bases
• how we can query knowledge bases
singer
type
type
?
“Elvis”
“Elvis”
9
Ontologies
entity
subclassOf
subclassOf
person
location
subclassOf
scientists
subclassOf
singer
type
city
type
type
bornIn
Tupelo
?
The same label
for two entities:
homonymy
Classes
label
“Elvis”
label
“The King”
Relations
Instances
The same entity
has two labels:
synonymy
Labels/words
10
Classes
entity
subclassOf
person
subclassOf
scientists
singer
type
type
?
Transitivity:
type(x,y) /\ subclassOf(y,z) => type(x,z)
Relations
entity
subclassOf
subclassOf
person
location
domain
range
bornIn
singer
subclassOf
city
type
type
bornIn
Tupelo
Domain and range constraints:
domain(r,c) /\ r(x,y) => type(x,c)
range(r,c) /\ r(x,y) => type(y,c)
Looks like higher order, but is
not. Consider introducing a
predicate fact(r,x,y)
Event Entities
1967
An event entity
is an artificial entity
introduced to represent
an n-ary relationship
year
ElvisGrammy
winner
prize
Grammy Award
won
Event entities allow representing
arbitrary relational data
as binary graphs
Winner
Prize
Row42 Elvis Presley Grammy
Award
Row43
Year
196
7
Reification
Reification is the method of
creating an entity that represents
a fact.
1967
year
#42
Grammy Award
won
#42
source
Wikipedia
bornIn
Tupelo
#43
There are different ways to reify
a fact, this is the one used in this talk.
RDF
resource
The Resource Description Format (RDF) is a
W3C standard that provides a standard
vocabulary to model ontologies.
An RDF ontology can be seen as a
directed labeled multi-graph where
• the nodes are entities
• the edges are labeled with relations
Edges (facts) are commonly written
• as triples
<Elvis, bornIn, Tupelo>
• as literals
bornIn(Elvis, Tupelo)
subclassOf
location
subclassOf
city
type
bornIn
Tupelo
[W3C recommendation: RDF, 2004]
Outline
• Part I
– What and Why ✔
– Available Knowledge Bases
• Part II
– Extracting Knowledge
• Part III
– Ranking and Searching
• Part IV
– Conclusion and Outlook
Harvesting Knowledge from Web Data
16
Cyc
What if we could make
all common sense
knowledge
computer-processable?
Cyc project
Douglas Lenat
•
•
•
•
started in 1984
driven by
staff of 20
goal: formalize knowledge
manually
[Lenat, Comm. ACM, 1995]
Cyc: Language
CycL is the formal language that Cyc uses to represent knowledge.
(Semantics based on First Order Logic, syntax based on LISP)
(#$forall ?A
(#$implies
(#$isa ?A #$Animal)
(#$thereExists ?M
(#$mother ?A ?M))))
Cyc project
(#$arity #$GovernmentFn 1)
(#$arg1Isa #$GovernmentFn #$GeopoliticalEntity)
(#$resultIsa #$GovernmentFn #$RegionalGovernment)
(#$governs (#$GovernmentFn #$Canada)
+ a logical reasoner
#$Canada)
http://cyc.com/cycdoc/ref/cycl-syntax.html
Cyc: Knowledge
#$Love
Strong affection for another agent arising out of kinship or personal ties. Love
may be felt towards things, too: warm attachment, enthusiasm, or devotion.
#$Love is a collection, as further explained
under #$Happiness. Specialized forms of
#$Love are #$Love-Romantic, platonic
Cyc project
love, maternal love, infatuation, agape, etc.
guid: bd589433-9c29-11b1-9dad-c379636f7270
direct instance of: #$FeelingType
direct specialization of: #$Affection
direct generalization of: #$Love-Romantic
http://cyc.com/cycdoc/vocab/emotion-vocab.html#Love
Facts and axioms about: Transportation, Ecology, everyday living, chemistry,
healthcare, animals, law, computer science...
“If a computer network implements IEEE 802.11 Wireless LAN Protocol
and some computer is a node in that computer network, then that
computer is vulnerable to decryption. “ http://cyc.com/cyc/technology/whatiscyc_dir/maptest
Cyc: Summary
Cyc
SUMO
License
proprietary, free for research
GNU GPL
Entities
500k
20k
Assertions
5m
70k
Relations
15k
Tools
Reasoner,
NL understanding tool
Reasoner
URL
http://cyc.com
http://ontologyportal.org
References
[Lenat, Comm. ACM 1995]
[Niles, FOIS 2001]
http://cyc.com/cyc/technology/whatiscyc_dir/whatsincyc
http://ontologyportal.org
SUMO (the Suggested Upper Model Ontology) is a research project in a similar spirit,
driven by Adam Pease of Articulate Software
WordNet
What if we could make
the English language
computer-processable?
George Miller
• started in 1985
• Cognitive Science Laboratory,
Princeton University
• written by lexicographers
• goal: support automatic text
analysis and AI applications
[Miller, CACM 1995]
WordNet: Lexical Database
synonymous
words
polysemous
words
Word
photographic
camera
Sense
sense1
camera
television
camera
sense2
WordNet
WordNet: Semantic Relations
Hypernymy
Kitchen Appliances
Meronymy
Is-value-of
Camera
Speed
Slow
Toaster
Optical Lens
Fast
WordNet: Semantic Relations
Relation
Meaning
Examples
Synonymy
(N, V, Adj, Adv)
Same sense
(camera, photographic camera)
(mountain climbing,
mountaineering)
(fast, speedy)
Antonymy
(Adj, Adv)
Opposite
(fast, slow)
(buy, sell)
Hypernymy (N)
Is-A
(camera, photographic equipment)
(mountain climbing, climb)
Meronymy (N)
Part
(camera, optical lens)
(camera, view finder)
Troponymy (V)
Manner
(buy, subscribe)
(sell, retail)
Entailment (V)
X must mean doing Y (buy, pay)
(sell, give)
WordNet: Hierarchy
Hypernymy Is-A relations
instrumentation
equipment
device
photographic
equipment
lamp
flash
WordNet: Size
Type
Number
#words
155k
#senses
117k
#word-sense pairs
207k
%words that are polysemous
17%
License
Proprietary,
Free for research
http://wordnet.princeton.edu/wordnet/man2.1/wnstats.7WN.html
Downloadable at
http://wordnet.prin
ceton.edu
Wikipedia
If a small number of people
can create a knowledge
base, how about a LARGE
number of people?
Jimmy Wales
• started in 2001
• driven by Wikimedia Foundation,
and a large number of volunteers
• goal: build world’s largest
encyclopedia
Wikipedia: Entities and Attributes
Entities
Attributes
Wikipedia: Synonymy and Polysemy
Redirection
(synonyms)
Disambiguation
(polysemy)
Wikipedia: Classes/Categories
Class
hierarchy
different
from
WordNet
Wikipedia: Others
Inter-lingual
Links
Navigation/
Topic box
Wikipedia: Numbers
English:
• 1B words,
• 2.8M articles,
• 152K contributors
All (250 languages):
• 1.74B words,
• 9.25M articles,
• 283K contributors
vs. Britannica:
• 25X as many words
• ½ avg article length
License: Creative Commons
Attribution-ShareAlike (CC-BY-SA)
Growth 2001 - 2008
Downloadable at
http://download.wi
kimedia.org/
Automatically Constructed Knowledge Bases
• Manual approaches (Cyc, WordNet, Wikipedia)
– produce high quality knowledge bases
– labor-intensive and limited in scope
Can we construct the
knowledge bases
automatically?
YAGO
… , etc.
YAGO
Can we exploit
Wikipedia and
WordNet to build an
ontology?
YAGO
• started as PhD thesis in 2007
• now major project at
the Max Planck Institute
for Informatics in Germany
• goal: extract ontology from
Wikipedia with high accuracy
and consistency
[Suchanek et al., WWW 2007]
YAGO: Construction
WordNet
Person
Person
subclassOf
subclassOf
Singer
Singer
subclassOf
Elvis Presley
Rock Singer
type
Blah blah blub fasel
(do not read this,
better listen to the
talk) blah blah Elvis
blub (you are still
reading this) blah
Elvis blah blub later
became astronaut
blah
~Infobox~
Born: 1935
...
Categories: Rock singer
born
Exploit Infoboxes
Exploit conceptual categories
Add WordNet
1935
YAGO: Consistency Checks
Person
subclassOf
Singer
subclassOf
Guitar
Guitarist
Rock Singer
type
Physics
born
born
Check uniqueness of entities and functional arguments
Check domains and ranges of relations
Check type coherence
1935
YAGO: Relations
About People
About Locations
About Other Things
actedIn
establishedOnDate
happenedIn
bornIn / on date
established
from / until
diedIn / on date
hasCapital
isCalled
created / on date
hasPopulation
foundIn
dicovered
locatedIn
produced
hasChild, hasSpouse
hasCurrency
hasProductionLanguage
family name
hasInflation
hasISBN
graduatedFrom
hasPolitician
hasPrecedecssor
...
...
...
ca. 100 relations with range and domain
YAGO: Numbers
YAGO
YAGO+Geonam
es
2.6m
10m
0.5m
0.5m
people
0.8m
0.8m
classes
0.5m
0.5m
Facts
30m
240m
Relations
86
92
Precision
95%
95%
License
Creative Commons AttributionNonCommercial (CC-NC-BY)
Entities
organizations
Downloadable at http://mpii.de/yago
incl. converters for RDF, XML, databases
DBpedia
Can we harvest facts
more exhaustively with
community effort?
• community effort started in 2007
• driven by Free U. Berlin, U. Leipzig,
OpenLink
• goal: "extract structured information
from Wikipedia and to make this
information available on the Web"
[Bizer et al., Journal of Web Semantics 2009]
DBPedia: Ontology
In YAGO, the taxonomy is based on WordNet classes.
Dbpedia:
• places entities extracted from Wikipedia into its own ontology.
•hand-crafted: 259 classes, 6 levels, 1200 properties
• emphasizes recall
• only half of extracted entities are currently placed in its own ontology
• alternative classifications: Wikipedia, YAGO, UMBEL (OpenCyc)
DBPedia: Mapping Rules
DBpedia mapping rules:
• maps Wikipedia infoboxes and tables to its ontology
• target datatypes (normalize units, ignore deviant values)
Community effort:
• hand-craft mapping rules
• expand ontology
< http://en.wikipedia.org/wiki/Elvis_Presley >
{{Infobox musical artist
|Name = Elvis Presley
|Background = solo_singer
|Birth_name = Elvis Aaron Presley
}}
< http://dbpedia.org/page/Elvis_Presley >
foaf:name
“Elvis Presley”;
background “solo_singer”;
foaf:givenName “Elvis Aaron Presley”;
Note that the values do not change.
DBPedia: Numbers
Type
Number
Facts
English: 257 m (YAGO: 240 m)
All languages: 1 b
Entities
3.4 m overall (YAGO: 10 m)
1.5 m in DBPedia ontology
People
312 k
Locations
413 k
Organizations
140 k
License
Creative Commons Attribution-ShareAlike 3.0
(CC-BY-SA 3.0)
plus
• 5.5 m links to external Web pages
• 1.5 m links to images
• 5 m links to other RDF data sets
Downloadable at
http://dbpedia.org
Freebase
What if we could harvest
both automatic
extraction and user
contribution?
• started in 2000
• driven by Metaweb,
part of Google since Jul 2010
• goals:
• “an open shared database of the
world's knowledge”
• “a massive, collaboratively-edited
database of cross-linked data”
Freebase
Like DBpedia and YAGO, Freebase imports data from Wikipedia.
Differently:
• also imports from other sources (e.g., ChefMoz, NNDB, and MusicBrainz)
• including individually contributed data
• users can collaboratively edit its data (without having to edit Wikipedia).
Freebase: User Contribution
Edit Entities
• create new entities
• assign a new type/class to an entity
• add/change attributes
• connect to other entities
• upload/edit images
Review
• flag vandalism
• flag entities to be merged/deleted
• vote on flagged content
(3 unanimous vote, or an expert
has to be tie-breaker)
Edit Schema
• define new class, specifying the
attributes of the class
• class definition can only be
changed by creator/admin
• class not part of commons until
peer-reviewed & promoted by
staff/admin
Data Game
• finding aliases in Wikipedia
redirects
• extracts dates of events from
Wikipedia articles
• uses the Yahoo image search API to
find candidates
Freebase: Community
Experts
• tie breaker in reviews
• split entities
• “rewind” changes
New experts inducted
by current experts.
Admins
• create new classes
and attributes
• respond to
community suggestions
Promoted by staff or
other admins.
Members
• contribute
(edit, review, vote)
Anyone can be a
member.
Freebase: Numbers
Type
Number
Facts
41 m
Entities
13 m (YAGO: 10 m)
People
2m
Locations
946 k
Businesses
567 k
Film
397 k
License
Creative Commons Attribution
(CC-BY)
Downloadable at
http://download.freebase.com
Question Answering Systems
Objective is to answer user queries from an underlying knowledge base.
• data from Wikipedia and user edits
• natural language translation of queries
• 9 m entities, 300 m facts
• computes answers from an internal
knowledge base of curated, structured data.
• stores not just facts, but also algorithms
and models
Application: Semantic Similarity
• Task: determine similarity between two words
– topological distance of two words in the graph
– taxonomic distance: hierarchical is-a relations
• Example application: correct real-word spelling errors
physical entity
legume
bean
soy
…
…
legume
garment
bean
trouser
soy
Tofu is made from soy jeans.
[Hirst et al., Natural Language Engineering 2001]
jean
Application: Sentiment Orientation
• Task: determine an adjective’s polarity (positive or negative)
– same polarity connected by synonymic relations
– opposite polarity by antonymic relations
• Example application: overall sentiment of customer reviews
suitable
appropriate
proper
right
spoiled
GOOD
BAD
defective
[Hu et al., KDD 2004]
forged
risky
Application: Annotation of Web Data
• Task: given a data source in the form of a Web table
– Annotate column with entity type
– Annotate pair of columns with relationship type
– Annotate table cell with entity ID
[Limaye et al., VLDB 2010]
Application: Map Annotation
Idea:
• Determine geographical entities in the vicinity (by GPS coordinates)
• Show information about these entities (from DBpedia)
Possible Applications:
• Map search on the Internet
• Enhanced Reality applications
[Becker et al., Linking Open Data Workshop 2008]
Application: Faceted Search
Attributes and
values based on
frequency (?)
search is “full text
search within results”
Constraints are listed
for possible deletion
Suggestions based
on current
consideration set
DBpedia Browser
Summary
• Part I covers what knowledge bases are
– Knowledge representation model (RDF)
– Manual knowledge bases:
• WordNet: expert-driven, English words
• Wikipedia: community-driven, entities/attributes
– Automatically extracted knowledge bases:
• YAGO: Wikipedia + WordNet, automated, high precision
• DBpedia: Wikipedia + community-crafted mapping rules, high recall
• Freebase: Wikipedia + other databases + user edits
• Part II will cover how to extract information included
in the knowledge bases
References for Part I
•
•
•
•
•
•
•
•
•
•
•
C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, S. Hellmann: PDF DocumentDBpedia – A
Crystallization Point for the Web of Data. Journal of Web Semantics: Science, Services and Agents on the
World Wide Web, Issue 7, Pages 154–165, 2009.
C. Becker, C. Bizer: DBpedia Mobile: A LocationEnabled Linked Data Browser. Linking Open Data Workshop
2008
G. Hirst and A. Budanitsky: Correcting real-word spelling errors by restoring lexical cohesion. Natural
Language Engineering 11 (1): 87–111, 2001.
M. Hu and B. Liu: Mining and Summarizing Customer Reviews. KDD, 2004.
J. Kamps, M. Marx, R. J. Mokken, and M. de Rijke: Using WordNet to Measure Semantic Orientations of
Adjectives. LREC, 2004.
D. Lenat: CYC: A large-scale investment in knowledge infrastructure. Communications of the ACM, 1995.
G. Limaye, S. Sarawagi, and S. Chakrabarti: Annotating and Searching Web Tables Using Entities, Types and
Relationships. VLDB, 2010.
G. A. Miller, WordNet: A Lexical Database for English. Communications of the ACM Vol. 38, No. 11: 39-41,
1995.
F. M. Suchanek, G. Kasneci and G. Weikum: Yago - A Core of Semantic Knowledge. WWW, 2007.
I. Niles, and A. Pease: Towards a Standard Upper Ontology. In Proceedings of the 2nd International
Conference on Formal Ontology in Information Systems (FOIS-2001), Chris Welty and Barry Smith, eds,
Ogunquit, Maine, October 17-19, 2001.
World Wide Web Consortium: RDF Primer. W3C Recommendation, 2004. http://www.w3.org/TR/rdfprimer/
Outline
• Part I
– What and Why ✔
– Available Knowledge Bases ✔
• Part II
– Extracting Knowledge
• Part III
– Ranking and Searching
• Part IV
– Other topics
Harvesting Knowledge from Web Data
57
Entities & Classes
Which entity types (classes, unary predicates) are there?
scientists, doctoral students, computer scientists, …
female humans, male humans, married humans, …
Which subsumptions should hold
(subclass/superclass, hyponym/hypernym, inclusion dependencies)?
subclassOf (computer scientists, scientists),
subclassOf (scientists, humans), …
Which individual entities belong to which classes?
instanceOf (Surajit Chaudhuri, computer scientists),
instanceOf (BarbaraLiskov, computer scientists),
instanceOf (Barbara Liskov, female humans), …
Which names denote which entities?
means (“Lady Di“, Diana Spencer),
means (“Diana Frances Mountbatten-Windsor”, Diana Spencer), …
means (“Madonna“, Madonna Louise Ciccone),
means (“Madonna“, Madonna(painting by Edward Munch)), …
...
Binary Relations
Which instances (pairs of individual entities) are there
for given binary relations with specific type signatures?
hasAdvisor (JimGray, MikeHarrison)
hasAdvisor (HectorGarcia-Molina, Gio Wiederhold)
hasAdvisor (Susan Davidson, Hector Garcia-Molina)
graduatedAt (JimGray, Berkeley)
graduatedAt (HectorGarcia-Molina, Stanford)
hasWonPrize (JimGray, TuringAward)
bornOn (JohnLennon, 9-Oct-1940)
diedOn (JohnLennon, 8-Dec-1980)
marriedTo (JohnLennon, YokoOno)
Which additional & interesting relation types are there
between given classes of entities?
competedWith(x,y), nominatedForPrize(x,y), …
divorcedFrom(x,y), affairWith(x,y), …
assassinated(x,y), rescued(x,y), admired(x,y), …
Higher-arity Relations & Reasoning
• Time, location & provenance annotations
• Knowledge representation – how do we model & store these?
• Consistency reasoning – how do we filter out inconsistent facts that
the extractor produced?
Facts (RDF triples)
triples):
Facts about facts:
1:
2:
3:
4:
5:
5: (1, inYear, 1968)
6: (2, inYear, 2006)
7: (3, validFrom, 22-Dec-2000)
8: (3, validUntil, Nov-2008)
9: (4, validFrom, 2-Feb-2008)
10: (2, source, SigmodRecord)
11: (5, inYear, 1999)
12: (5, location, CampNou)
13: (5, source, Wikipedia)
(JimGray, hasAdvisor, MikeHarrison)
(SurajitChaudhuri, hasAdvisor, JeffUllman)
(Madonna, marriedTo, GuyRitchie)
(NicolasSarkozy, marriedTo, CarlaBruni)
(ManchesterU, wonCup, ChampionsLeague)
Harvesting Knowledge from Web Data
60
Outline
• Part I
– What and Why ✔
– Available Knowledge Bases ✔
• Part II
– Extracting Knowledge
• Part III
– Ranking and Searching
• Part IV
– Conclusion and Outlook
Harvesting Knowledge from Web Data
61
Outline
• Part II
–Extracting Knowledge
• Pattern-based Extraction
• Consistency Reasoning
• Higher-arity Relations: Space & Time
Harvesting Knowledge from Web Data
62
Framework: Information Extraction (IE)
Surajit
obtained his
PhD in CS from
Stanford University
under the supervision
of Prof. Jeff Ullman.
He later joined HP and
worked closely with
Umesh Dayal …
sourcecentric IE
1) recall !
2) precision
instanceOf (Surajit, scientist)
inField (Surajit, computer science)
hasAdvisor (Surajit, Jeff Ullman)
almaMater (Surajit, Stanford U)
workedFor (Surajit, HP)
friendOf (Surajit, Umesh Dayal)
…
one source
yield-centric
harvesting
many sources
1) precision !
2) recall
near-human
quality !
hasAdvisor
Student
Surajit Chaudhuri
Alon Halevy
Jim Gray
…
…
Advisor
Jeffrey Ullman
Jeffrey Ullman
Mike Harrison
almaMater
Student
Surajit Chaudhuri
Alon Halevy
Jim Gray
…
…
University
Stanford U
Stanford U
UC Berkeley
Framework: Knowledge Representation
• RDF (Resource Description Framework, W3C):
- subject-property-object (SPO) triples / binary relations
- highly structured, but no (prescriptive) schema
- first-order logical reasoning over binary predicates
This tutorial!
• Frames, F-Logic, description logics: OWL/DL/lite
• Also: higher-order logics, epistemic logics
Facts (RDF triples)
triples):
Reification: facts about facts:
1:
2:
3:
4:
5: (1, inYear, 1968)
6: (2, inYear, 2006)
7: (3, validFrom, 22-Dec-2000)
8: (3, validUntil, Nov-2008)
9: (4, validFrom, 2-Feb-2008)
10: (2, source, SigmodRecord)
(JimGray, hasAdvisor, MikeHarrison)
(SurajitChaudhuri, hasAdvisor, JeffUllman)
(Madonna, marriedTo, GuyRitchie)
(NicolasSarkozy, marriedTo, CarlaBruni)
Temporal, spatial, & provenance annotations
can refer to reified facts via fact identifiers
(approx. equiv. to higer-arity RDF: Sub  Prop  Obj  Time  Location  Source)
...
Picking Low-Hanging Fruit (First)
Deterministic Pattern Matching
[Kushmerick 97; Califf & Mooney 99; Gottlob 01, …]
...
Wrapper Induction
[Gottlob et al: VLDB’01, PODS’04,…]
• Wrapper induction:
• Hierarchical document structure, XHTML, XML
• Pattern learning for restricted regular languages
(ELog, combining concepts of XPath & FOL)
...
• Visual interfaces
• See e.g. http://www.lixto.com/,
http://w4f.sourceforge.net/
67
Tapping on Web Tables
[Cafarella et al: PVLDB‘08; Sarawagi et al: PVLDB‘09]
Problem:
discover interesting relations
wonAward: Person  Award
nominatedForAward: Person  Award
…
from many table headers
and co-occurring cells
...
Relational Fact Extraction From Plain Text
• Hearst patterns [Hearst: COLING‘92]
– POS-enhanced regular expression matching in natural-language text
NP0 {,} such as {NP1, NP2, … (and|or) }{,} NPn
NP0 {,}{NP1, NP2, … NPn-1}{,} or other NPn
…
“The bow lute, such as the Bambara ndang, is plucked and has an
individual curved neck for each string.”
 isA(“Bambara ndang”, “bow lute”)
• Noun classification from predicate-argument structures
[Hindle: ACL’90]
– Clustering of nouns by similar
verbal phrases
– Similarity based on co-occurrence
frequencies (mutual information)
beer
wine
drink
9.34
10.20
sell
4.21
3.75
have
0.84
1.38
Harvesting Knowledge from Web Data
69
DIPRE
[Brin: WebDB‘98]
• DPIRE: “Dual Iterative Pattern Relation Extraction”
– (Almost) unsupervised, iterative gathering of facts and patterns
– Positive & negative examples as seeds for target relation
e.g.
+(Hillary, Bill)
+(Carla, Nicolas) –(Larry, Google)
– Specificity threshold for new patterns based on occurrence frequency
(Hillary, Bill)
(Carla, Nicolas)
X and her husband Y
X and Y on their honeymoon
(Angelina, Brad)
(Victoria, David)
(Hillary, Bill)
(Carla, Nicolas)
X and Y and their children
X has been dating with Y
X loves Y
(Larry, Google)
…
Harvesting Knowledge from Web Data
70
DIPRE/Snowball/QXtract
[Brin: WebDB’98; Agichtein,Gravano: SIGMOD’01+‘03]
• DPIRE: “Dual Iterative Pattern Relation Extraction”
– (Almost) unsupervised, iterative gathering of facts and patterns
– Positive & negative examples as seeds for target relation
e.g.
+(Hillary, Bill)
+(Carla, Nicolas) –(Larry, Google)
– Specificity threshold for new patterns based on occurrence frequency
• Snowball/QXtract [Agichtein,Gravano: DL’00, SIGMOD’01+‘03]
– Refined patterns and statistical measures
– >80% recall at >85% precision over a large news corpus
– QXtract demo additionally allowed user feedback in the
iteration loop
Harvesting Knowledge from Web Data
71
Help from NLP: Dependency Parsing!
• Analyze lexico-syntactic structure of sentences
– Part-Of-Speech (POS) tagging & dependency parsing
– Prefer shorter dependency paths for fact candidates
Carla has been seen dating with Ben.
NNP VBZ VBN VBN VBG
IN
dating(Carla, Ben)
NNP
software tools:
CMU Link Parser: http://www.link.cs.cmu.edu/link/
Stanford Lex Parser: http://nlp.stanford.edu/software/lex-parser.shtml
Open NLP Tools: http://opennlp.sourceforge.net/
ANNIE Open-Source Information Extraction: http://www.aktors.org/technologies/annie/
LingPipe: http://alias-i.com/lingpipe/ (commercial license)
Harvesting Knowledge from Web Data
72
Harvesting Knowledge from Web Data
73
Open-Domain Gathering of Facts (Open IE)
[Etzioni,Cafarella et al:WWW’04, IJCAI‘07; Weld,Hoffman,Wu: SIGMOD-Rec‘08]
Analyze verbal phrases between entities for new relation types
• unsupervised bootstrapping with short dependency paths
Carla has been seen dating with Ben.
Rumors about Carla indicate there is something between her and Ben.
• self-supervised classifier for (noun, verb-phrase, noun) triples
… seen dating with …
(Carla, Ben), (Carla, Sofie), …
… partying with …
(Carla, Ben), (Paris, Heidi), …
• build statistics & prune sparse candidates
result
oftentypes
is noisy
• group/cluster candidatesBut:
for new
relation
and their facts
clusters
are not{romanticRelation},
canonicalized …relations
{datesWith, partiesWith}, {affairWith,
flirtsWith},
...
far from near-human-quality
Learning More Mappings
[Wu & Weld: CIKM’07, WWW‘08 ]
Kylin Ontology Generator (KOG):
learn classifier for subclassOf across Wikipedia & WordNet using
• YAGO as training data
• advanced ML methods (MLN‘s, SVM‘s)
• rich features from various sources
• Category/class name similarity measures
• Category instances and their infobox templates:
template names, attribute names (e.g. knownFor)
#articles
• Wikipedia edit history:
refinement of categories
• Hearst patterns:
C such as X, X and Y and other C‘s, …
• Other search-engine statistics:
co-occurrence frequencies
instances/classes
> 3 Mio. entities
> 1 Mio. w/ infoboxes
> 500 000 categories
Entity Disambiguation
Names
“Penn“
“U Penn“
Entities
Sean Penn
?
University of
Pennsylvania
“Penn State“
Pennsylvania
State University
„PSU“
Pennsylvania
(US State)
Passenger
Service Unit
• ill-defined with zero context
• known as record linkage for names in record fields
• Wikipedia offers rich candidate mappings:
disambiguation pages, re-directs, inter-wiki links,
anchor texts of href links
Individual Entity Disambiguation
Sean Penn
…
Penn
Into the Wild
…
Penn
XML Treebank
…
Penn
Univ. Park
University of
Pennsylvania
Penn State
University
Typical Approaches:
name similarity:
edit distances, n-gram overlap, …
context similarity: record level
context similarity: words/phrases level
context similarity:
text around names, classes & facts around entities
Challenge: efficiency & scalability
Collective Entity Disambiguation
[Doan et al: AAAI‘05; Singla,Domingos: ICDM’07; Chakrabarti et al: KDD‘09, …]
• Consider a set of names {n1, n2, …} in same context
and sets of candidate entities
E1 = {e11, e12, …}, E2 = {e21, e22, …}, …
• Define joint objective function (e.g. likelihood for prob. model)
that rewards coherence of mappings
(n1)=x1E1, (n2)=x2E2, …
• Solve optimization problem
Stuart Russell (DJ)
Stuart Russell
Michael Jordan
Stuart Russell
(computer scientist)
Michael Jordan
(computer scientist)
Michael Jordan (NBA)
Declarative Extraction Frameworks
• IBM’s SystemT [Krishnamurthy et al: SIGMOD Rec.’08, ICDE’08]
– Fully declarative extraction framework
– SQL-style operators, cost models, full optimizer support
• DBLife/Cimple [DeRose, Doan et al: CIDR’07, VLDB’07]
– Online community portal centered around the DB domain
(regular crawls of DBLP, conferences, homepages, etc.)
• More commercial endeavors:
– FreeBase.com, WolframAlpha.com, Sig.ma,
TrueKnowledge.com, Google.com/squared
Harvesting Knowledge from Web Data
79
Google Images
DBLP
Homepages/
DBLP/
DBWorld/
Google Scholar
DBWorld/DBLP
/Google Scholar
Harvesting Knowledge from Web Data
80
Probabilistic Extraction Models
•
Hidden Markov Models (HMMs)
[Rabiner: Proc. IEEE’89; Sutton,McCallum: MIT Press’06]
– Markov chain (directed graphical model) with
“hidden” states Y, observations X, and transition probabilities
– Factorizes the joint distribution P(Y,X)
– Assuming independence among observations
•
Conditional Random Fields (CRFs)
[Lafferty,McCallum,Pereira: ML’01; Sarawagi,Cohen: NIPS’04]
– Markov random field (undirected graphical model)
– Models the conditional distribution P(Y|X)
(less strict independence assumptions)
“I went skiing with Fernando Pereira in British Columbia.”
• Joint segmentation and disambiguation of input strings onto entities
and classes: NER, POS tagging, etc.
• Trained, e.g., on bibliograhic entries, no manual labeling required
Harvesting Knowledge from Web Data
81
Pattern-Based Harvesting
[Hearst 92; Brin 98; Agichtein 00; Etzioni 04; …]
Facts & Fact Candidates
(Hillary, Bill)
(Carla, Nicolas)
Patterns
X and her husband Y
X and Y on their honeymoon
(Angelina, Brad)
(Victoria, David)
(Hillary, Bill)
X and Y and their children
(Carla, Nicolas)
X has been dating with Y
X loves Y
(Yoko, John)
(Kate, Pete)
(Carla, Benjamin)
(Larry, Google)
(Angelina, Brad)
(Victoria, David)
…
• good for recall
• noisy, drifting
• not robust enough
for high precision
Outline
• Part II
–Extracting Knowledge
• Pattern-based Extraction ✔
• Consistency Reasoning
• Higher-arity Relations: Space & Time
Harvesting Knowledge from Web Data
83
French Marriage Problem
isMarriedTo:
person  person
isMarriedTo:
frenchPolitician  person
...
French Marriage Problem
Facts in KB:
married
(Hillary, Bill)
married
(Carla, Nicolas)
married
(Angelina, Brad)
New facts or fact candidates:
married (Cecilia, Nicolas)
married (Carla, Benjamin)
married (Carla, Mick)
married (Michelle, Barack)
married (Yoko, John)
married (Kate, Leonardo)
married (Carla, Sofie)
married (Larry, Google)
1) for recall: pattern-based harvesting
2) for precision: consistency reasoning
Reasoning about Fact Candidates
Use consistency constraints to prune false candidates!
First-order-logic rules (restricted):
spouse(x,y)  diff(y,z)  spouse(x,z)
spouse(x,y)  diff(w,y)  spouse(w,y)
spouse(x,y)  f(x)
spouse(x,y)  m(y)
spouse(x,y)  (f(x)m(y))  (m(x)f(y))
Rules reveal inconsistencies
Find consistent subset(s) of atoms
(“possible world(s)“, “the truth“)
Ground atoms:
spouse(Hillary,Bill)
spouse(Carla,Nicolas)
spouse(Cecilia,Nicolas)
spouse(Carla,Ben)
spouse(Carla,Mick)
spouse(Carla, Sofie)
f(Hillary)
f(Carla)
f(Cecilia)
f(Sofie)
m(Bill)
m(Nicolas)
m(Ben)
m(Mick)
Rules can be weighted
(e.g. by fraction of ground atoms that satisfy a rule)
 uncertain / probabilistic data
 compute prob. distr. over (a subset of) ground atoms being “true“
Markov Logic Networks (MLN‘s)
[Richardson/Domingos: ML 2006]
Map logical constraints & fact candidates
into probabilistic graphical model: Markov Random Field (MRF)
FOL rules:
s(x,y)  diff(y,z)  s(x,z)
s(x,y)  diff(w,y)  s(w,y)
Grounding:
s(x,y)  f(x)
s(x,y)  m(y)
f(x)  m(x)
m(x)  f(x)
Grounding: Literal  Boolean Var
Reasoning: Literal  Binary RV
s(Ca,Nic)  s(Ce,Nic)
s(Ca,Nic)  s(Ca,Ben)
s(Ca,Nic)  m(Nic)
s(Ca,Nic)  s(Ca,So)
s(Ce,Nic)  m(Nic)
s(Ca,Ben)  s(Ca,So)
s(Ca,Ben)  m(Ben)
s(Ca,Ben)  s(Ca,So)
s(Ca,So)  m(So)
Base facts
w/entities:
s(Carla,Nicolas)
s(Cecilia,Nicolas)
s(Carla,Ben)
s(Carla,Sofie)
…
Markov Logic Networks (MLN‘s)
[Richardson,Domingos: ML 2006]
Map logical constraints & fact candidates
into probabilistic graphical model: Markov Random Field (MRF)
s(x,y)  diff(y,z)  s(x,z)
s(x,y)  diff(w,y)  s(w,y)
s(x,y)  f(x)
s(x,y)  m(y)
f(x)  m(x)
m(x)  f(x)
s(Ce,Nic)
m(Nic)
s(Ca,Nic)
s(Ca,Ben)
s(Ca,So)
m(Ben)
m(So)
Variety of algorithms for joint inference:
Gibbs sampling, other MCMC, belief propagation,
randomized MaxSat, …
s(Carla,Nicolas)
s(Cecilia,Nicolas)
s(Carla,Ben)
s(Carla,Sofie)
…
RVs coupled
by MRF edge
if they appear
in same clause
MRF assumption:
P[Xi|X1..Xn]=P[Xi|MB(Xi)]
joint distribution
has product form
over all cliques
Markov Logic Networks (MLN‘s)
[Richardson,Domingos: ML 2006]
Map logical constraints & fact candidates
into probabilistic graphical model: Markov Random Field (MRF)
s(x,y)  diff(y,z)  s(x,z)
s(x,y)  diff(w,y)  s(w,y)
s(x,y)  f(x)
s(x,y)  m(y)
f(x)  m(x)
m(x)  f(x)
s(Ce,Nic)
0.8
m(Nic)
0.1
s(Ca,Nic)
s(Ca,Ben)
0.5
0.2 s(Ca,So)
0.7
s(Carla,Nicolas)
s(Cecilia,Nicolas)
s(Carla,Ben)
s(Carla,Sofie)
…
m(Ben) 0.6
m(So)
0.7
Consistency reasoning: prune low-confidence facts!
StatSnowball [Zhu et al: WWW‘09], BioSnowball [Liu et al: KDD‘10]
EntityCube, MSR Asia: http://entitycube.research.microsoft.com/
Related Alternative Probabilistic Models
Constrained Conditional Models [Roth et al. 2007]
log-linear classifiers with constraint-violation penalty
mapped into Integer Linear Programs
Factor Graphs with Imperative Variable Coordination
[McCallum et al. 2008]
RV‘s share “factors“ (joint feature functions)
generalizes MRF, BN, CRF, …
inference via advanced MCMC
flexible coupling & constraining of RV‘s
s(Ca,Nic)
s(Ce,Nic)
m(Nic)
s(Ca,Ben)
m(Ben)
software tools:
s(Ca,So)
alchemy.cs.washington.edu
code.google.com/p/factorie/
research.microsoft.com/en-us/um/cambridge/projects/infernet/
m(So)
Reasoning for KB Growth: Direct Route
[Suchanek,Sozio,Weikum: WWW’09]
New fact candidates:
Facts in KB
married
(Hillary, Bill)
married
(Carla, Nicolas)
married
(Angelina, Brad)
+
married (Cecilia, Nicolas)
married (Carla, Benjamin)
married (Carla, Mick)
married (Carla, Sofie)
married (Larry, Google)
?
Patterns:
X and her husband Y
X and Y and their children
X has been dating with Y
Direct approach:
X loves Y
• KB facts are true; fact candidates & patterns  hypotheses
• grounded constraints  clauses with hypotheses as vars
• cast into Weighted Max-Sat with weights from pattern stats
• customized approximation algorithm
• unifies: fact/candidate consistency, pattern goodness, entity disambig.
www.mpi-inf.mpg.de/yago-naga/sofie/
SOFIE: Facts & Patterns Consistency
[Suchanek,Sozio,Weikum: WWW’09]
Constraints to connect facts, fact candidates & patterns
pattern-fact duality:
occurs(p,x,y)  expresses(p,R)  R(x,y)
occurs(p,x,y)  R(x,y)  expresses(p,R)
name(-in-context)-to-entity mapping:
 means(n,e1)   means(n,e2)  …
functional dependencies:
spouse(x,y): x y, y x
relation properties:
asymmetry, transitivity, acyclicity, …
type constraints, inclusion dependencies:
spouse  Person  Person
capitalOfCountry  cityOfCountry
domain-specific constraints:
bornInYear(x) + 10years ≤ graduatedInYear(x)
hasAdvisor(x,y)  graduatedInYear(x,t)  graduatedInYear(y,s)  s < t
www.mpi-inf.mpg.de/yago-naga/sofie/
SOFIE: Facts & Patterns Consistency
[Suchanek,Sozio,Weikum: WWW’09]
Constraints to connect facts, fact candidates & patterns
pattern-fact duality:
• Grounded into large propositional
occurs(p,x,y)  expresses(p,R)  R(x,y)
Boolean formula in CNF
occurs(p,x,y)  R(x,y)  expresses(p,R)
• Max-Sat solver for joint inference
(complete truth assignment to all
name(-in-context)-to-entity mapping:
candidate patterns & facts)
 means(n,e1)   means(n,e2)  …
functional dependencies:
spouse(x,y): x y, y x
relation properties:
asymmetry, transitivity, acyclicity, …
type constraints, inclusion dependencies:
spouse  Person  Person
capitalOfCountry  cityOfCountry
domain-specific constraints:
bornInYear(x) + 10years ≤ graduatedInYear(x)
hasAdvisor(x,y)  graduatedInYear(x,t)  graduatedInYear(y,s)  s < t
www.mpi-inf.mpg.de/yago-naga/sofie/
SOFIE Example
Spouse (HillaryClinton,
BillClinton)
Spouse (CarlaBruni,
NicolasSarkozy)
occurs (X and her husband Y, Hillary, Bill)
occurs (X Y and their children, Hillary, Bill)
occurs (X and her husband Y, Victoria, David)
occurs (X dating with Y, Rebecca, David)
occurs (X dating with Y, Victoria, Tom)
Spouse (Victoria, David) [1]
Spouse (Rebecca, David) [1]
Spouse (Victoria, Tom)
[1]
expresses (X and her husband Y, Spouse)
expresses (X Y and their children, Spouse)
expresses (X dating with Y, Spouse)
 x,y,z,w:
Spouse
(Victoria,
R(x,y) David)
 R(x,z)  y=z
 Spouse (Rebecca, David)
Spouse

x,y,z,w:
(Victoria,
R(x,y) David)
R(w,y)  x=w
 Spouse (Victoria, Tom)
…
...
occurs

x,y: R(x,y)
(husband,
 R(y,x)
Victoria, David)  expresses (husband, Spouse)
…  Spouse (Victoria, David)
occurs

p,x,y:(dating,
occurs Rebecca,
(p, x, y) David)
expresses
 expresses
(p, R)  (dating,
R (x, y) Spouse)
 Spouse (Rebecca, David)
…
occurs

p,x,y:(husband,
occurs (p,Victoria,
x, y)  RDavid)
(x, y) 
 Spouse
expresses
(Victoria,
(p, R) David)
 expresses (husband, Spouse)
…
[100]
[40]
[60]
[20]
[10]
[1]
[1]
[1]
[60]
[20]
[60]
Soft Rules vs. Hard Constraints
Enforce FD‘s (mutual exclusion) as hard constraints:
hasAdvisor(x,y)  diff(y,z)   hasAdvisor(x,z)
combine with weighted constraints
no longer regular MaxSat
constrained (weighted) MaxSat instead
Generalize to other forms of constraints:
Hard constraint
Soft constraint
hasAdvisor(x,y) 
graduatedInYear(x,t) 
graduatedInYear(y,s)
s<t
firstPaper(x,p)  firstPaper(y,q) 
author(p,x)  author(p,y) ) 
inYear(p) > inYear(q) + 5years
 hasAdvisor(x,y) [0.6]
open issue for arbitrary constraints
Datalog-style grounding
(deductive & potentially recursive)
 rethink reasoning !
Pattern Harvesting, Revisited
[Suchanek et al: KDD’06; Nakashole et al: WebDB’10, WSDM’11]
narrow / nasty / noisy patterns:
X and his famous advisor Y
X carried out his doctoral research in math under the supervision of Y
X jointly developed the method with Y
using narrow &
dropping nasty patterns
loses recall !
POS-lifted n-gram itemsets as patterns:
X { PRP ADJ advisor } Y
X { his doctoral research, under the supervision of} Y using noisy patterns
loses precision &
X { PRP doctoral research, IN DET supervision of} Y
slows down MaxSat
confidence weights, using seeds and counter-seeds:
seeds: (MosheVardi, CatrielBeeri), (JimGray, MikeHarrison)
counter-seeds: (MosheVardi, RonFagin), (AlonHalevy, LarryPage)
 confidence of pattern p ~ #p with seeds  #p with counter-seeds
Outline
• Part II
–Extracting Knowledge
• Pattern-based Extraction ✔
• Consistency Reasoning ✔
• Higher-arity Relations: Space & Time
Harvesting Knowledge from Web Data
97
Higher-arity Relations: Space & Time
• YAGO-2 Preview
Just Wikipedia
#Relations
Incl. Gazetteer Data
86
92
#Classes
563,374
563,997
#Entities
2,639,853
9,819,683
495,770,281
996,329,323
- basic relations
20,937,244
61,188,706
- types & classes
8,664,129
181,977,830
466,168,908
753,162,787
23.4 GB
37 GB
#Facts
- space, time & proven.
Size (CSV format)
estimated precision > 95%
(for basic relations excl. space, time & provenance)
www.mpi-inf.mpg.de/yago-naga/
Harvesting Knowledge from Web Data
98
French Marriage Problem (Revisited)
JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
Facts in KB:
1: married
(Hillary, Bill)
2: married
(Carla, Nicolas)
3: married
(Angelina, Brad)
validFrom (2, 2008)
New fact candidates:
4:
5:
6:
7:
8:
married (Cecilia, Nicolas)
married (Carla, Benjamin)
married (Carla, Mick)
divorced (Madonna, Guy)
domPartner (Angelina, Brad)
validFrom (4, 1996)
validFrom (5, 2010)
validFrom (6, 2006)
validFrom (7, 2008)
validUntil (4, 2007)
Challenge: Temporal Knowledge Harvesting
For all people in Wikipedia (100,000‘s) gather all spouses,
incl. divorced & widowed, and corresponding time periods!
>95% accuracy, >95% coverage, in one night
Consistency constraints are potentially helpful:
• functional dependencies: {husband, time}  {wife, time}
• inclusion dependencies: marriedPerson  adultPerson
• age/time/gender restrictions: birthdate +  < marriage < divorce
Difficult Dating
(Even More Difficult)
Implicit Dating
explicit dates vs.
implicit dates relative to other
dates
(Even More Difficult)
Implicit Dating
vague dates
relative dates
narrative text
relative order
TARSQI: Extracting Time Annotations
http://www.timeml.org/site/tarsqi/
[Verhagen et al: ACL‘05]
Hong Kong is poised to hold the first election in more than half <TIMEX3 tid="t3"
TYPE="DURATION" VAL="P100Y">a century</TIMEX3> that includes a democracy
advocate seeking high office in territory controlled by the Chinese government in Beijing. A prodemocracy politician, Alan Leong, announced <TIMEX3 tid="t4" TYPE="DATE"
VAL="20070131">Wednesday</TIMEX3> that he had obtained enough nominations to
appear on the ballot to become the territory’s next chief executive. But heextraction
acknowledged that
he had no chance of beating the Beijing-backed incumbent, Donald Tsang, who is seeking reerrors!
election. Under electoral rules imposed by Chinese officials, only 796 people on the election
committee – the bulk of them with close ties to mainland China – will be allowed to vote in the
<TIMEX3 tid="t5" TYPE="DATE" VAL="20070325">March 25</TIMEX3>
election. It will be the first contested election for chief executive since Britain returned Hong
Kong to China in <TIMEX3 tid="t6" TYPE="DATE" VAL="1997">1997</TIMEX3>.
Mr. Tsang, an able administrator who took office during the early stages of a sharp economic
upturn in <TIMEX3 tid="t7" TYPE="DATE" VAL="2005">2005</TIMEX3>, is
popular with the general public. Polls consistently indicate that three-fifths of Hong Kong’s
people approve of the job he has been doing. It is of course a foregone conclusion – Donald
Tsang will be elected and will hold office for <TIMEX3 tid="t9" beginPoint="t0"
endPoint="t8“
TYPE="DURATION"
VAL="P5Y">another
five
years
</TIMEX3>, said Mr. Leong, the former chairman of the Hong Kong Bar Association.
13 Relations between Time Intervals
[Allen, 1984; Allen & Hayes 1989]
A Before B
B After A
A Meets B
B MetBy A
A
A Overlaps B
B OverlappedBy A
A
A Starts B
B StartedBy A
A During B
B Contains A
A Finishes B
B FinishedBy A
A Equal B
A
B
B
B
A
B
A
B
A
B
A
B
Possible Worlds in Time
[Wang,Yahya,Theobald: MUD Workshop ‘10]
Derived
Facts
teamMates(Beckham,
Ronaldo)
State Relation

playsFor(Beckham, Real, T1)
 playsFor(Ronaldo, Real, T2)
 overlaps(T1,T2)
0.36
0.16
0.08
‘03
‘04
0.12
‘05
‘07
Non-independent
Independent
0.4
Base
Facts
0.6
1.0
0.1
0.2
0.4
0.9
0.2
‘05
‘07
‘03
playsFor(Beckham, Real)
‘00 ‘02
‘07
‘04 ‘05
playsFor(Ronaldo, Real)
State Relation
State Relation
Possible Worlds in Time
[Wang,Yahya,Theobald: MUD Workshop ‘10]
Derived
Facts
teamMates(Beckham,
Ronaldo)
State Relation

playsFor(Beckham, Real, T1)
 playsFor(Ronaldo, Real, T2)
 overlaps(T1,T2)
0.36
0.16
0.08
‘03
‘04
0.12
‘05
‘07
Non-independent
Independent
• Closed and complete representation model (incl. lineage)
 Stanford Trio project [Widom: CIDR’05, Benjelloun et al: VLDB’06]
1.0
0.4 of bins0.9
• Interval
remains linear in the number
0.2
0.2
0.1
• Confidence computation
per bin is #P-complete
Base
‘05
‘07
‘03
‘00 ‘02
‘07
‘04 ‘05
• In general
requires possible-worlds-based
sampling Real)
playsFor(Beckham,
Real)
playsFor(Ronaldo,
Facts
techniques
(Gibbs-style
State
Relation sampling, Luby-Karp,
Stateetc.)
Relation
0.6
computation
0.4
Open Problems and Challenges in IE (I)
High precision & high recall at affordable cost
robust pattern analysis & reasoning
parallel processing, lazy / lifted inference, …
Types and constraints
soft rules & hard constraints, rich DL, beyond CWA
explore & understand different families of constraints
Declarative, self-optimizing workflows
incorporate pattern & reasoning steps into IE queries/programs
Scale, dynamics, life-cycle
grow & maintain KB with near-human-quality over long periods
Open-domain knowledge harvesting
turn names, phrase & table cells into entities & relations
Open Problems and Challenges in IE (II)
Temporal Querying (Revived)
query language (T-SPARQL?), no schema
confidence weights & ranking
Gathering Implicit and Relative Time Annotations
biographies & news, relative orderings
aggregate & reconcile observations
Incomplete and Uncertain Temporal Scopes
incorrect, incomplete, unknown begin/end
vague dating
Consistency Reasoning
extended MaxSat, extended Datalog, prob. graph. models, etc.
for resolving inconsistencies on uncertain facts & uncertain time
Outline
• Part II
–Extracting Knowledge
• Pattern-based Extraction ✔
• Consistency Reasoning ✔
• Higher-arity Relations: Space & Time ✔
Harvesting Knowledge from Web Data
111
References for Part II
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
E. Agichtein, L. Gravano, J. Pavel, V. Sokolova, A. Voskoboynik. Snowball: a prototype system for extracting relations from large text collections. SIGMOD, 2001.
James Allen. Towards a general theory of action and time. Artif.Intell., 23(2), 1984.
M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, O. Etzioni. Open information extraction from the web. IJCAI, 2007.
R. Baumgartner, S. Flesca, G. Gottlob. Visual web information extraction with Lixto. VLDB, 2001.
S. Brin. Extracting patterns and relations from the World Wide Web. WebDB, 1998.
M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, Y. Zhang. WebTables: exploring the power of tables on the web. PVLDB, 1(1), 2008.
M. E. Califf, R. J. Mooney. Relational learning of pattern-match rules for information extraction. AAAI, 1999.
P. DeRose, W. Shen, F. Chen, Y. Lee, D. Burdick, A. Doan, R. Ramakrishnan. DBLife: A community information management platform for the database research community. CIDR, 2007.
A. Doan, L. Gravano, R. Ramakrishnan, S. Vaithyanathan. (Eds.). Special issue on information extraction. SIGMOD Record, 37(4), 2008.
O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, A. Yates. Web-scale information extraction in KnowItAll. WWW, 2004.
G. Gottlob, C. Koch, R. Baumgartner, M. Herzog, S. Flesca. The Lixto data extraction project - back and forth between theory and practice. PODS, 2004.
R. Gupta, S. Sarawagi: Answering Table Augmentation Queries from Unstructured Lists on the Web. PVLDB, 2(1), 2009.
M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. COLING, 1992.
D. Hindle. Noun classification from predicate-argument structures. ACL, 1990.
R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, S. Vaithyanathan, H. Zhu. SystemT: a system for declarative information extraction. SIGMOD Record, 37(4), 2008.
S. Kulkarni, A. Singh, G. Ramakrishnan, S. Chakrabarti. Collective Annotation of Wikipedia Entities in Web Text. KDD, 2009.
N. Kushmerick. Wrapper induction: efficiency and expressiveness. Artif. Intell., 118(1-2), 2000.
J. Lafferty, A. McCallum, F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. ML, 2001.
X. Liu, Z. Nie, N. Yu, J.-R. Wen. BioSnowball: automated population of Wikis. KDD, 2010.
A. McCallum, K. Schultz, S. Singh. FACTORIE: Probabilistic Programming via Imperatively Defined Factor Graphs. NIPS, 2009.
N. Nakashole, M. Theobald, G. Weikum. Find your Advisor: Robust Knowledge Gathering from the Web. WebDB, 2010.
L.R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, 1989.
M. Richardson and P. Domingos. Markov Logic Networks. ML, 2006.
D. Roth, W. Yih. Global Inference for Entity and Relation Identification via a Linear Programming Formulation. MIT Press, 2007.
S. Sarawagi. Information extraction. Foundations and Trends in Databases, 1(3), 2008.
S. Sarawagi, W. W. Cohen. Semi-Markov conditional random fields for information extraction. NIPS, 2004.
W. Shen, X. Li, A. Doan. Constraint-Based Entity Matching. AAAI, 2005.
P. Singla, P. Domingos. Entity resolution with Markov Logic. ICDM, 2006.
F. M. Suchanek, M. Sozio, G. Weikum. SOFIE: a self-organizing framework for information extraction. WWW, 2009.
F. M. Suchanek, G. Ifrim, G. Weikum. Combining linguistic and statistical analysis to extract relations from web documents. KDD, 2006.
C. Sutton, A. McCallum. An Introduction to Conditional Random Fields for Relational Learning. MIT Press, 2006.
R. C. Wang, W. W. Cohen. Language-independent set expansion of named entities using the web. ICDM, 2007.
Y. Wang, M. Yahya, M. Theobald. Time-aware Reasoning in Uncertain Knowledge Bases. VDLB/MUD, 2010.
D. S. Weld, R. Hoffmann, F. Wu. Using Wikipedia to bootstrap open information extraction. SIGMOD Record, 37(4), 2008.
F. Wu, D. S. Weld. Autonomously semantifying Wikipedia. CIKM, 2007.
F. Wu, D. S. Weld. Automatically refining the Wikipedia infobox ontology. WWW, 2008.
A. Yates, M. Banko, M. Broadhead, M. J. Cafarella, O. Etzioni, S. Soderland. TextRunner: Open information extraction on the web. HLT-NAACL, 2007.
J. Zhu, Z. Nie, X. Liu, B. Zhang, J.-R. Wen. StatSnowball: a statistical approach to extracting entity relationships. WWW, 2009.
Harvesting Knowledge from Web Data
Outline
• Part I
– What and Why ✔
– Available Knowledge Bases ✔
• Part II
– Extracting Knowledge ✔
• Part III
– Ranking and Searching
• Part IV
– Conclusion and Outlook
Harvesting Knowledge from Web Data
Outline for Part III
• Part III.1: Querying Knowledge Bases
– A short overview of SPARQL
– Extensions to SPARQL
• Part III.2: Searching and Ranking Entities
• Part III.3: Searching and Ranking Facts
Harvesting Knowledge from Web Data
SPARQL
• Query language for RDF from the W3C
• Main component:
– select-project-join combination of triple patterns
graph pattern queries on the knowledge base
Harvesting Knowledge from Web Data
SPARQL – Example
Example query:
Find all actors from Ontario (that are in the knowledge base)
scientist
isA
actor
isA
vegetarian
isA
isA
Mike_Myers
Jim_Carrey
bornIn
bornIn
Scarborough
Newmarket
locatedIn
locatedIn
physicist
isA
chemist
isA
isA
Albert_Einstein
Otto_Hahn
bornIn
bornIn
Ulm
Frankfurt
locatedIn
Ontario
locatedIn
Germany
locatedIn
Canada
isA
locatedIn
Harvesting Knowledge from Web Data
Europe
116
SPARQL – Example
Example query:
Find all actors from Ontario (that are in the knowledge base)
scientist
isA
actor
isA
vegetarian
isA
isA
Mike_Myers
Jim_Carrey
bornIn
bornIn
Scarborough
Newmarket
locatedIn
locatedIn
physicist
isA
chemist
isA
isA
Albert_Einstein
Otto_Hahn
bornIn
bornIn
Ulm
Frankfurt
locatedIn
Ontario
locatedIn
Germany
locatedIn
Canada
isA
locatedIn
Harvesting Knowledge from Web Data
Europe
117
SPARQL – Example
Example query:
Find all actors from Ontario (that are in the knowledge base)
SELECT ?person WHERE ?person isA actor. ?person bornIn ?loc.
?loc locatedIn Ontario.
Find subgraphs of this form:
actor
constants
actor
isA
?person
isA
variables
locatedIn
isA
isA
Mike_Myers
Jim_Carrey
bornIn
bornIn
Scarborough
Newmarket
bornIn
?loc
vegeta
locatedIn
Ontario
locatedIn
Ontario
locatedIn
Harvesting Knowledge from Web Data
Canada
118
SPARQL – More Features
• Eliminate duplicates in results
SELECT DISTINCT ?c WHERE {?person isA actor. ?person bornIn ?loc.
?loc locatedIn ?c}
• Return results in some order
SELECT ?person WHERE {?person isA actor. ?person bornIn ?loc.
?loc locatedIn Ontario} ORDER BY DESC(?person)
with optional LIMIT n clause
• Optional matches and filters on bounded vars
SELECT ?person WHERE {?person isA actor.
OPTIONAL{?person bornIn ?loc}.
FILTER (!BOUND(?loc))}
• More operators: ASK, DESCRIBE, CONSTRUCT
Harvesting Knowledge from Web Data
SPARQL: Extensions from W3C
W3C SPARQL 1.1 draft:
• Aggregations (COUNT, AVG, …)
• Subqueries
• Negation: syntactic sugar for
OPTIONAL {?x … }
FILTER(!BOUND(?x))
Harvesting Knowledge from Web Data
SPARQL: Extensions from Research (1)
More complex graph patterns:
• Transitive paths [Anyanwu et al., WWW07]
SELECT ?p, ?c WHERE {
?p isA scientist .
?p ??r ?c. ?c isA Country. ?c locatedIn Europe .
PathFilter(cost(??r) < 5).
PathFilter (containsAny(??r,?t ). ?t isA City. }
• Regular expressions [Kasneci et al., ICDE08]
SELECT ?p, ?c WHERE {
?p isA ?s. ?s isA scientist.
?p (bornIn | livesIn | citizenOf) locatedIn* Europe.}
Harvesting Knowledge from Web Data
SPARQL: Extensions from Research (2)
Queries over federated RDF sources:
• Determine distribution of triple patterns as part
of query (for example in ARQ from Jena)
• Automatically route triple predicates to useful
sources
Harvesting Knowledge from Web Data
122
SPARQL: Extensions from Research (2)
Queries over federated RDF sources:
• Determine distribution of triple patterns as part
of query (for example in ARQ from Jena)
• Automatically route triple predicates to useful
sources
Potentially requires mapping of
identifiers from different sources
Harvesting Knowledge from Web Data
123
RDF+SPARQL: Systems
• BigOWLIM
• OpenLink Virtuoso
• Jena with different backends
• Sesame
• OntoBroker
• SW-Store, Hexastore, RDF-3X (no reasoning)
System deployments with >1011 triples
( see http://esw.w3.org/LargeTripleStores)
Harvesting Knowledge from Web Data
Outline for Part III
• Part III.1: Querying Knowledge Bases
• Part III.2: Searching and Ranking Entities
– Entity Importance: Graph Analysis
– Entity Search: Language Models
• Part III.3: Searching and Ranking Facts
Harvesting Knowledge from Web Data
Why ranking is essential
• Queries often have a huge number of results:
– scientists from Canada
– conferences in Toronto
– publications in databases
– actors from the U.S.
• Ranking as integral part of search
• Huge number of app-specific ranking methods:
paper/citation count, impact, salary, …
• Need for generic ranking
Harvesting Knowledge from Web Data
Extending Entities with Keywords
Remember: entities occur in facts in documents
 Associate entities with terms in those documents
chancellor Germany
scientist election
Stuttgart21 Guido
Westerwelle France
Nicolas Sarkozy
Harvesting Knowledge from Web Data
Digression 1: Graph Authority Measures
Idea: incoming links are endorsements & increase page authority,
authority is higher if links come from high-authority pages

PR(p)
PR(q)
 (1   )  
|V |
( p , q )E outdeg(p)
Authority (page q) =
stationary prob. of visiting q
Random walk: uniformly random choice of links + random jumps
Harvesting Knowledge from Web Data
Graph-Based Entity Importance
Combine several paradigms:
• Keyword search on associated terms to
determine candidate entities
• Pagerank or similar measure to determine
important entities
• Ranking can combine entity rank with keywordbased score
Harvesting Knowledge from Web Data
Digression 2: Language Models (LMs)
State-of-the-art model in text retrieval
d1
?
LM(1)
d2
q
?
LM(2)
• each document di has LM: generative probability distribution of
terms with parameter i
• query q viewed as sample from LM(1), LM(2), …
• estimate likelihood P[ q | LM(i) ] that q is sample of LM of
document di (q is „generated by“ di)
• rank by descending likelihoods (best „explanation“ of q)
Harvesting Knowledge from Web Data
130
Language Models for Text: Example
model M
A A A A
B B
estimate likelihood
of observing query
C C C
D
P [ A A B C E |EM]
E E E E E
query
document d: sample of M
used for parameter estimation
Harvesting Knowledge from Web Data
131
Language Models for Text: Smoothing
+
model M
C
A A A A
B B
D
A
A
C C C
B
D
F
E
E E E E E
estimate likelihood
of observing query
P [ A B C E F | M]
query
document d
+
background corpus
and/or smoothing
used for parameter estimation
Harvesting Knowledge from Web Data
Laplace smoothing
Jelinek-Mercer
Dirichlet smoothing
…
132
Some LM Basics
s(d , q)  P[q | d ]  i P[qi | d ]
independ. assumpt.
tf (i , d )
~ i log
k tf (k , d )
s(d , q)  P[q | d ]  (1   ) P[q]

tf ( i , d )
df ( i ) 

~  i log 
 (1   )
  tf ( k , d )
k df (k ) 
k


tf ( i , d )
1 
~  i log 1 


k tf (k , d ) 

simple MLE:
overfitting
mixture model
for smoothing
P[q] est. from
log or corpus
k df (k )  rank by ascending
df ( i )


P[i | q]
~ KL(q | d )  i P[i | q] log
P[i | d ]
Harvesting Knowledge from Web Data
“improbability“
KL divergence
(Kullback-Leibler div.)
aka. relative entropy
133
Entity Search with LM Ranking
query: keywords  answer: entities
P[qi | ei ]
~ KL (LM(q) | LM(e))
P[qi ]
LM (entity e) = prob. distr. of words seen in context of e
s(e, q)  P[q | e]  (1   ) P[q] ~ 
query q: „French player who
won world championship“
candidate entities:
e1: David Beckham
e2: Ruud van Nistelroy
played for ManU, Real, LA Galaxy
David Beckham champions league
England lost match against France
married to spice girl …
weighted
by conf.
e3: Ronaldinho
e4: Zinedine Zidane
e5: FC Barcelona
Zizou champions league 2002
Real Madrid won final ...
Zinedine Zidane best player
France world cup 1998 ...
[Z. Nie et al.: WWW’07]
Harvesting Knowledge from Web Data
134
Outline for Part III
• Part III.1: Querying Knowledge Bases
• Part III.2: Searching and Ranking Entities
• Part III.3: Searching and Ranking Facts
– General ranking issues
– NAGA-style ranking
– Language Models for facts
Harvesting Knowledge from Web Data
What makes a fact „good“?
Confidence:
Prefer results that are likely correct
 accuracy of info extraction
 trust in sources
(authenticity, authority)
Informativeness:
Prefer results with salient facts
Statistical estimation from:
 frequency in answer
 frequency on Web
 frequency in query log
Diversity:
Prefer variety of facts
Conciseness:
Prefer results that are tightly connected
 size of answer graph
 cost of Steiner tree
bornIn (Jim Gray, San Francisco) from
„Jim Gray was born in San Francisco“
(en.wikipedia.org)
livesIn (Michael Jackson, Tibet) from
„Fans believe Jacko hides in Tibet“
(www.michaeljacksonsightings.com)
q: Einstein isa ?
Einstein isa scientist
Einstein isa vegetarian
q: ?x isa vegetarian
Einstein isa vegetarian
Whocares isa vegetarian
E won … E discovered … E played …
E won … E won … E won … E won …
Einstein won NobelPrize
Bohr won NobelPrize
Einstein isa vegetarian
Cruise isa vegetarian
Cruise born 1962 Bohr died 1962
How can we implement this?
Confidence:
Prefer results that are likely correct
 accuracy of info extraction
 trust in sources
(authenticity, authority)
Informativeness:
empirical accuracy of IE
PR/HITS-style estimate of trust
combine into:
max { accuracy (f,s) * trust(s) |
s  witnesses(f) }
PR/HITS-style entity/fact ranking
[V. Hristidis et al., S.Chakrabarti, …]
Prefer results with salient facts
Statistical estimation from:
 frequency in answer
 frequency on Web
 frequency in query log
IR models: tf*idf … [K.Chang et al., …]
Statistical Language Models
Diversity:
Statistical Language Models
Prefer variety of facts
Conciseness:
or
graph algorithms (BANKS, STAR, …)
[J.X. Yu et al., S.Chakrabarti et al.,
B. Kimelfeld et al., A. Markovetz et al.,
B.C. Ooi et al., G.Kasneci et al., …]
Prefer results that are tightly connected
 size of answer graph
 cost of Steiner tree
Harvesting Knowledge from Web Data
137
LMs: From Entities to Facts
Document / Entity LM‘s
LM for doc/entity: prob. distr. of words
LM for query: (prob. distr. of) words
LM‘s: rich for docs/entities, super-sparse for queries
richer query LM with query expansion, etc.
Triple LM‘s
LM for facts: (degen. prob. distr. of) triple
LM for queries: (degen. prob. distr. of) triple pattern
LM‘s: apples and oranges
• expand query variables by S,P,O values from DB/KB
• enhance with witness statistics
• query LM then is prob. distr. of triples !
Harvesting Knowledge from Web Data
138
LMs for Triples and Triple Patterns
triple patterns (queries q):
triples (facts f):
q: Beckham
?y
LM(q) +psmoothing
f1: Beckham p ManchesterU
q: Beckham p ManU
200/550
f2: Beckham p RealMadrid
q: Beckham p Real
300/550
f3: Beckham p LAGalaxy
q: Beckham p Galaxy
20/550
f4: Beckham p ACMilan
q: Beckham p Milan
30/550
F5: Kaka p ACMilan
q: ?x p ASCannes
F6: Kaka p RealMadrid
f7: Zidane p ASCannes
Zidane p ASCannes 20/30
Tidjani p ASCannes 10/30
f8: Zidane p Juventus
f9: Zidane p RealMadrid
q: ?x p ?y
LM(q): {t  P [t | t matches q] ~ #witnesses(t)}
f10: Tidjani p ASCannes
Messi p FCBarcelona LM(answer
400/2600
f): {t f11:
P [t
| t matches
f] ~ 1 for f}
Messi
p FCBarcelona
Zidane p RealMadrid smooth
350/2600
all LM‘s f12: Henry p Arsenal
Kaka p ACMilan
300/2600
rank results by ascending
KL(LM(q)|LM(f))
…
f13: Henry
p FCBarcelona
f14: Ribery p BayernMunich
q: Cruyff ?r FCBarcelona
f15: Drogba p Chelsea
Cruyff playedFor FCBarca
200/500
f16: Casillas p RealMadrid
Cruyff playedAgainst FCBarca 50/500
Cruyff coached FCBarca
250/500
Harvesting
Knowledge from Web Data
witness statistics
200
300
20
30
300
150
20
200
350
10
400
200
150
100
150
20
139
: 2600
LMs for Composite Queries
q: Select ?x,?c Where {?x bornIn France . ?x playsFor ?c . ?c in UK . }
P [ Henry bI F,
Henry p Arsenal,
Arsenal
in UK
P [ Drogba
bI]F,
~
200 Drogba
200 p160

 Chelsea,
UK ]
650 Chelsea
2600 in
500
30 150 140
~


650 2600 500
f21: Zidane bI F
200
f22: Tidjani bI F
20
f23: Henry bI F 200
f24: Ribery bI F 200
f25: Drogba bI F 30
f26: Drogba bI IC 100
F27: Zidane bI ALG 50
queries q with subqueries q1 … qn
results are n-tuples of triples t1 … tn
LM(q): P[q1…qn] = i P[qi]
LM(answer): P[t1…tn] = i P[ti]
KL(LM(q)|LM(answer)) = i KL(LM(qi)|LM(ti))
f1: Beckham p ManU
200
f7: Zidane p ASCannes
20
f8: Zidane p Juventus
200
f9: Zidane p RealMadrid
300
f10: Tidjani p ASCannes
10
f12: Henry p Arsenal
200
f13: Henry p FCBarca
150
f14: Ribery p Bayern
100
f15:Harvesting
DrogbaKnowledge
p Chelsea
from Web Data 150
f31: ManU in UK
200
f32: Arsenal in UK 160
f33: Chelsea in UK 140
140
Extensions: Keywords
Problem: not everything is triplified
• Consider witnesses/sources
(provenance meta-facts)
• Allow text predicates with
each triple pattern (à la XQ-FT)
Semantics:
triples match struct. pred.
witnesses match text pred.
European composers who have won the Oscar,
whose music appeared in dramatic western scenes,
and who also wrote classical pieces ?
Select ?p Where {
?p instanceOf Composer .
?p bornIn ?t . ?t inCountry ?c . ?c locatedIn Europe .
?p hasWon ?a .?a Name AcademyAward .
?p contributedTo ?movie [western, gunfight, duel, sunset] .
?p composed ?music [classical, orchestra, cantata, opera] . }
Harvesting Knowledge from Web Data
Extensions: Keywords
Problem: not everything is triplified
• Consider witnesses/sources
(provenance meta-facts)
• Allow text predicates with
each triple pattern (à la XQ-FT)
Grouping of
keywords or phrases
boosts expressiveness
French politicians married to Italian singers?
Select ?p1, ?p2 Where {
?p1 instanceOf ?c1 [France, politics] .
?p2 instanceOf ?c2 [Italy, singer] .
?p1 marriedTo ?p2 . }
CS researchers whose advisors worked on the Manhattan project?
Select ?r, ?a Where {
?r ?p1
instOf
?o1
researcher
[“computer
[“computer
science“] science“]
.
.
?a ?p2
workedOn
?o2 [“Manhattan
?x [“Manhattan
project“]
project“]
.
.
Harvesting
Knowledge
?r ?p3
hasAdvisor
?a
. } ?a
. } from Web Data
142
LMs for Keyword-Augmented Queries
q: Select ?x, ?c Where {
France ml ?x [goalgetter, “top scorer“] .
?x p ?c .
?c in UK
[champion, “cup winner“, double] . }
subqueries qi with keywords w1 … wm
results are still n-tuples of triples ti
LM(qi): P[triple ti | w1 … wm] = k  P[ti | wk] + (1) P[ti]
LM(answer fi) analogous
KL(LM(q)|LM(answer fi)) = i KL (LM(qi) | LM(fi))
result ranking prefers (n-tuples of) triples
whose witnesses score high on the subquery keywords
Harvesting Knowledge from Web Data
143
Extensions: Query Relaxation
(2): … Where {?x bornIn IC.
q(4)
?x .?x
?xpp?c
?c. .?c
?cininUK
UK . .}}
[ Zidane bI F,
Zidane p Real,
in ESP
]
[ Real
Drogba
bI IC,
Drogba p Chelsea,
Chelsea
in resOf
UK] F,
[ Drogba
Drogba p Chelsea,
Chelsea
in bI
UK]
[ Drogba
IC,
Drogba p Chelsea,
Chelsea in UK]
f21: Zidane bI F 200
f22: Tidjani bI F
20
F23: Henry bI F 200
F24: Ribery bI F 200
F26: Drogba bI IC 100
F27 Zidane bI ALG 50
LM(q*) =  LM(q) + 1 LM(q(1)) + 2 LM(q(2)) + …
replace e in q by e(i) in q(i):
precompute P:=LM (e ?p ?o)
and Q:=LM (e(i) ?p ?o)
set i ~ 1/2 (KL (P|Q) + KL (Q|P))
replace r in q by r(i) in q(i)  LM (?s r(i) ?o)
replace e in q by ?x in q(i)  LM (?x r ?o)
…
LM‘s of e, r, ...
f1: Beckham p ManU
200
f7: Zidane p ASCannes
20
f9: Zidane p Real
300
f10: Tidjani p ASCannes
10
f12: Henry p Arsenal
200
144
f15: Drogba p Chelsea
150
are prob. distr.‘s
of triples
f31:
ManU in!UK
200
f32: Arsenal in UK 160
f33: Chelsea in UK 140
Extensions: Diversification
q: Select ?p, ?c Where { ?p isa SoccerPlayer . ?p playedFor ?c . }

1 Beckham, ManchesterU
2 Beckham, RealMadrid
3 Beckham, LAGalaxy
4 Beckham, ACMilan
5 Zidane, RealMadrid
6 Kaka, RealMadrid
7 Cristiano Ronaldo, RealMadrid
8 Raul, RealMadrid
9 van Nistelrooy, RealMadrid
10 Casillas, RealMadrid
1 Beckham, ManchesterU
2 Beckham, RealMadrid
3 Zidane, RealMadrid
4 Kaka, ACMilan
5 Cristiano Ronaldo, ManchesterU
6 Messi, FCBarcelona
7 Henry, Arsenal
8 Ribery, BayernMunich
9 Drogba, Chelsea
10 Luis Figo, Sporting Lissabon
rank results f1 ... fk by ascending

 KL(LM(q) | LM(fi))  (1) KL( LM(fi) | LM({f1..fk}\{fi}))
implemented by greedy re-ranking of fi‘s in candidate pool
Harvesting Knowledge from Web Data
Searching and Ranking – Summary
• Don‘t re-invent the wheel:
LM‘s are elegant and expressive means for ranking
consider both data & workload statistics
• Extensions should be conceptually simple:
can capture informativeness, personalization,
relaxation, diversity – all in same framework
• Unified ranking model for complete query language:
still work to do
Harvesting Knowledge from Web Data
References for Part III
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
SPARQL Query Language for RDF, W3C Recommendation, 15 January 2008, http://www.w3.org/TR/2008/REC-rdf-sparql-query20080115/
SPARQL New Features and Rationale, W3C Working Draft, 2 July 2009, http://www.w3.org/TR/2009/WD-sparql-features-20090702/
Kemafor Anyanwu, Angela Maduko, Amit P. Sheth: SPARQ2L: towards support for subgraph extraction queries in RDF databases.
WWW Conference, 2007
Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, S. Sudarshan: Keyword Searching and Browsing in Databases
using BANKS. ICDE, 2002
Soumen Chakrabarti: Dynamic personalized pagerank in entity-relation graphs. WWW Conference, 2007
Tao Cheng , Xifeng Yan , Kevin Chen-Chuan Chang: EntityRank: searching entities directly and holistically. VLDB, 2007
Shady Elbassuoni, Maya Ramanath, Ralf Schenkel, Marcin Sydow, Gerhard Weikum: Language-model-based ranking for queries on
RDF-graphs. CIKM, 2009
Djoerd Hiemstra: Language Models. Encyclopedia of Database Systems, 2009
Vagelis Hristidis, Heasoo Hwang, Yannis Papakonstantinou: Authority-based keyword search in databases. ACM Transactions on
Database Systems 33(1), 2008
Gjergji Kasneci, Maya Ramanath, Mauro Sozio, Fabian M. Suchanek, Gerhard Weikum: STAR: Steiner-Tree Approximation in
Relationship Graphs. ICDE, 2009
Gjergji Kasneci, Fabian M. Suchanek, Georgiana Ifrim, Maya Ramanath, Gerhard Weikum: NAGA: Searching and Ranking
Knowledge. ICDE, 2008
Mounia Lalmas: XML Retrieval. Morgan & Claypool Publishers, 2009
Thomas Neumann, Gerhard Weikum: The RDF-3X engine for scalable management of RDF data. VLDB Journal 19(1), 2010
Zaiqing Nie, Yunxiao Ma, Shuming Shi, Ji-Rong Wen, Wei-Ying Ma: Web object retrieval. WWW Conference, 2007
Desislava Petkova, W. Bruce Croft: Hierarchical Language Models for Expert Finding in Enterprise Corpora. ICTAI, 2006
Nicoleta Preda, Gjergji Kasneci, Fabian M. Suchanek, Thomas Neumann, Wenjun Yuan, Gerhard Weikum: Active knowledge:
dynamically enriching RDF knowledge bases by web services. SIGMOD Conference, 2010
Pavel Serdyukov, Djoerd Hiemstra: Modeling Documents as Mixtures of Persons for Expert Finding. ECIR, 2008
ChengXiang Zhai: Statistical Language Models for Information Retrieval. Morgan & Claypool Publishers, 2008
Harvesting Knowledge from Web Data
Outline
• Part I
– What and Why ✔
– Available Knowledge Bases ✔
• Part II
– Extracting Knowledge ✔
• Part III
– Ranking and Searching ✔
• Part IV
– Conclusion and Outlook
Harvesting Knowledge from Web Data
148
But back to the original question...
Will there ever be a famous singer called Elvis again?
?x
hasGivenName
“Elvis”
type
singer
149
But back to the original question...
http://mpii.de/yago
We found
him!
?x = Elvis_Costello
?singer = wordnet_singer_110599806
?d = 1954-08-25
Can we
find out
more
about this
guy?
150
But back to the original question...
http://mpii.de/yago
Alright,
and even
more?
151
Linking Open Data: Goal
guitar
plays
born
1954
Costellopedia
YAGO
Can we combine knowledge from different sources?
Linking Open Data: URIs
guitar
plays
http://dbpedia.org/resource
born
1954
http://costello.org
1. Define a name space
http://dbpedia.org/resource/ElvisCostello http://costello.org/Elvis 2. Define entity names
in that name space
Every entity has a worldwide unique identifier
(a Uniform Resource Identifier, URI).
There is a W3C standard for that.
[W3C URI]
Linking Open Data: Cool URIs
guitar
plays
born
1954
http://costello.org
http://dbpedia.org/resource
1. Define a name space
http://dbpedia.org/resource/ElvisCostello http://costello.org/Elvis 2. Define entity names
in that name space
3. Make them accessible
online
http://costello.org/Elvis
client
born
1954
server
There is a W3C
description for that
[W3C CoolURI]
Linking Open Data: Links
guitar
plays
http://dbpedia.org/resource
born
1954
http://costello.org
1. Define a name space
http://dbpedia.org/resource/ElvisCostello http://costello.org/Elvis 2. Define entity names
in that name space
This is an entity resolution problem.
Use
•
•
•
•
similar identifiers
similar labels (names)
keys (e.g., the ISBN)
common properties
Goal of the
W3C group
[Bizer JSWIS 2009]
3. Make them accessible
online
4. Define equivalence
links
Linking Open Data: Status so far
Currently (2010)
• 200 ontologies
• 25 billion triples
• 400m links
http://richard.cyganiak.de/2007/10/lod/imagemap.html
Querying Semantic Data
Sindice is an index for the Semantic Web developed at the DERI in Galway/Ireland.
http://sindice.com
Sindice exploits
• RDF dumps available on the Web
• RDF information embedded into HTML pages
• RDF data available by cool URIs
• inter-ontology links
[Tummarello ISWC 2007]
Querying Semantic Data
?
?
... far from perfect... but far from useless...
Conclusion
• We have seen the knowledge representation model of ontologies, RDF
In a nutshell, RDF is a kind of distributed entity-relationship model
• We have seen numerous existing knowledge bases
...manually constructed (Cyc and WordNet) and automatically constructed
(YAGO, DBpedia, Freebase, TrueKnowledge etc.)
• We have seen techniques for creating such knowledge bases
(Pattern-based extraction and reasoning-based extraction, with uncertainty)
• We have seen techniques for querying and ranking the knowledge
(by SPARQL and language-based models)
• We have seen that many knowledge bases already exist
and that is ongoing work to interlink them
• We have seen that there is indeed a promising singer called Elvis
The End
The slides are available at
http://www.mpi-inf.mpg.de/yago-naga/CIKM10-tutorial/
Feel free to contact us with further questions
Hady Lauw
Institute for Infocomm
Research, Singapore
http://hadylauw.com
Fabian M. Suchanek
INRIA Saclay, Paris
http://suchanek.name
Martin Theobald
Max-Planck Institute
for Informatics, Saarbrücken
http://mpii.de/~mtb
Ralf Schenkel
Saarland University
http://people.mmci.uni-saarland.de/~schenkel/
References for Part IV
References
• [W3C URI] W3C: “Architecture of the World Wide Web, Volume One”
Recommendation 15 December 2004, http://www.w3.org/TR/webarch/
• [W3C CoolURI] W3C: “Cool URIs for the Semantic Web”
Interest Group Note 03 December 2008, http://www.w3.org/TR/cooluris/
• [Bizer JSWIS 2009] C.Bizer, T.Heath, T.Berners-Lee: “Linked data – the story so far”
International Journal on Semantic Web and Information Systems, 5(3):1–22, 2009.
• [Tummarello ISWC 2007] G. Tummarello, R. Delbru, E. Oren:
“Sindice.com: Weaving the Open Linked Data”
ISWC/ASWC 2007:
Download