Slide - Max Planck Institute for Informatics

advertisement
For a Few Triples More
Gerhard Weikum
Max Planck Institute for Informatics
http://www.mpi-inf.mpg.de/~weikum/
Acknowledgements
LOD: RDF Triples on the Web
http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png
LOD: Linked RDF Triples on the Web
yago/wordnet: Artist109812338
yago/wordnet:Actor109765278
yago/wikicategory:ItalianComposer
imdb.com/name/nm0910607/
dbpedia.org/resource/Ennio_Morricone
imdb.com/title/tt0361748/
dbpedia.org/resource/Rome
rdf.freebase.com/ns/en.rome
data.nytimes.com/51688803696189142301
geonames.org/3169070/roma
N 41° 54' 10'' E 12° 29' 2''
LOD: Linked RDF Triples on the Web
• Size:
30 Billion triples
• Linkage:
500 Million links
• Dynamics:
encyclopedic
reference data
The Good, the Bad, and the Ugly
For a Few Triples More
30 billion triples – still not enough ?
No! Consider:
1. Dynamics
2. Linkage
3. Ubiquity
Outline

Explain Title
Why More Triples:
Dynamics, Linkage, Ubiquity
Linkage & Ubiquity:
Named-Entity Disambiguation
Web-Scale Linkage
Wrap-up
1. Dynamics: in a Fast Paced World
Anecdotic examples:
• <… rdf:about="http://dbpedia.org/resource/Steve_Jobs"> still there
<dbpprop:occupation …> Chairman and CEO, Apple Inc.
• <… "http://... Ellen_Johnson_Sirleaf"> <dcterms:subject rdf:resource=
"http://... Category:Nobel_Peace_Prize_laureates”/>
not there
• <… "http://... Scarlett_Johansson">
never there
<dbpprop:spouse rdf:resource="http://... Ryan_Reynolds"/>
• <… "http://... Clint_Eastwood">
<dbpprop:spouse …>Dina Ruiz 1 child
both there
<… rdf:about="http://... Clint_Eastwood">
<dbpprop:spouse …>Maggie Johnson 2 children
• <… rdf:about="http://... Paul_McCartney">
<dbpprop:spouse rdf:resource=« http://… Linda_McCartney "/>
<… rdf:about="http://... Paul_McCartney">
<dbpprop:spouse rdf:resource=« http://… Heather_Mills"/>
<… rdf:about="http://... Paul_McCartney">
none there
<dbpprop:spouse …>Nancy Shevell
1. Dynamics: As Fresh As Possible
http://data.gov.uk/openspending
1. Dynamics: Updates in the Web of Data
http://sindice.com
1. Dynamics: Closer to the Sources
RDF Data on the Web produced by:
• Maintained, but mostly „static“
reference collections (e.g. geo)
• Periodic exports from curated databases
(e.g. gov, bio, music)
• Periodic extraction from Web sources
(e.g. encyclopedia, news)
• Tags in social streams and advertisements
mostly fresh
often stale
often stale
very noisy
 Get closer to the data origin:
• RDF engines (Sparql APIs) for production DBs
• view-maintenance by pub-sub push (feeds)
• Deep-Web crawl/query for surfacing of RDF data
1. Dynamics: Nothing Lasts Forever
Even old and „static“ data often needs temporal scope
(timepoint, timespan) for proper interpretation
1: PaulMcCartney hasSpouse HeatherMills
2: PaulMcCartney hasSpouse NancyShevell
3: PaulMcCartney gotHonor SirPaul
1 validFrom 11-Jun-2002
2 validFrom Oct-2011
3 happendOn 1999
[11-Jun-2002, 2008]
[Oct-2011, now]
[1999]
1 validUntil 2008
with reification, or use quads (quints, pints, etc.)
Need to add temporal properties to RDF and SPARQL
Select ?w Where {
?id1: PM gotHonor SirPaul . ?id1 happendOn ?t .
?id2: PM hasSpouse ?w . ?id2 validFrom ?b . ?id2 validUntil ?e .
?t containedIn [?b,?e] . }
but: principled, expressive, easy-to-use
1. Dynamics: Nothing Lasts Forever
http://www.mpi-inf.mpg.de/yago-naga/yago/
2. Linkage: sameAs Links
dbpedia.org/resource/Linda_Louise_Eastman
yago-knowledge.org/resource/Linda_McCartney
owl:sameAs
www.freebase.com/view/en/man_with_no_name
dbpedia.org/page/Clint_Eastwood
owl:SameAs
data.linkedmdb.org/page/film/38166
de.dbpedia.org/page/Zwei_glorreiche_Halunken
owl:sameAs
LOD statistics: 30 Bio. triples, 500 Mio. links
330 Mio. links trivial (ID-based) within pub, within bio
10‘s Mio. links near-trivial Dbpedia  Freebase  Yago  GeoNames
sameas.org: 17 Mio. bundles for 50 Mio. URIs
data.nytimes.com: 5000 people, 2000 locations
Way too few for a world with:
1 Mio. people, 10 Mio. locations, 10‘s Mio. species,
6 Mio. books, 2 Mio. movies, 10 Mio. songs, etc. etc.
2. Linkage: sameAs Coverage
2. Linkage: sameAs Accuracy
http://sameas.org
3. Ubiquity: Web-of-Data & Web-of-Contents
3. Ubiquity: Web of Data & Other Contents
RDF data and Web contents need to be interconnected
RDFa & microformats provide the mechanism
How do we get the Web RDF-annotated (at large scale)?
Largely automated, but allow humans in the loop
3. Ubiquity: Web of Data & Other Contents
<html … May 2, 2011
May 2, 2011
Maestro Morricone will perform
on the stage of the Smetana Hall
to conduct the Czech National
Symphony Orchestra and Choir.
The concert will feature both
Classical compositions and
soundtracks such as
the Ecstasy of Gold.
In programme two concerts for
July 14th and 15th.
<div typeof=event:music>
<span id="Maestro_Morricone">
Maestro Morricone
<a rel="sameAs"
resource="dbpedia…/Ennio_Morricone "/>
</span>
…
<span property = "event:location" >
Smetana Hall </span>
…
<span property="rdf:type"
resource="yago:performance">
The concert </span> will feature
…
<span property="event:date"
content="14-07-2011"></span>
July 1
</div>
Why a Few Triples More?
Linked Data is great!
But still in its infancy
Need to add triples to capture further issues:
• Dynamics:
Where is the live data?
• Linkage:
Where are the links in Linked Data?
• Ubiquity:
Where are the paths between the Web-of-Data and the Web?
Outline

Explain Title

Why More Triples:
Dynamics, Linkage, Ubiquity
Linkage & Ubiquity:
Named-Entity Disambiguation
Web-Scale Linkage
Wrap-up
Entities on the Web
http://sig.ma
Named-Entity Disambiguation (NED)
Harry fought with you know who. He defeats the dark lord.
Dirty
Harry
Harry
Potter
Prince Harry
of England
The Who
(band)
Lord
Voldemort
Three NLP tasks:
1) named-entity detection: segment & label by HMM or CRF
(e.g. Stanford NER tagger)
2) co-reference resolution: link to preceding NP
(trained classifier over linguistic features)
3) named-entity disambiguation:
map each mention (name) to canonical entity (entry in KB)
Mentions, Meanings, Mappings
Eli (bible)
Sergio talked to
Ennio about
Eli‘s role in the
Ecstasy scene.
This sequence on
the graveyard
was a highlight in
Sergio‘s trilogy
of western films.
Mentions
(surface names)
Eli Wallach
?
Benny
EcstasyGoodman
(drug)
EcstasyAndersson
of Gold
Benny
Star Wars Trilogy
KB
Sergio means Sergio_Leone
Sergio means Serge_Gainsbourg
Ennio means Ennio_Antonelli
Ennio means Ennio_Morricone
Eli means Eli_(bible)
Eli means ExtremeLightInfrastructure
Eli means Eli_Wallach
Ecstasy means Ecstasy_(drug)
Ecstasy means Ecstasy_of_Gold
trilogy means Star_Wars_Trilogy
trilogy means Lord_of_the_Rings
trilogy means Dollars_Trilogy
Lord of the Rings
Dollars Trilogy
Entities
(meanings)
Mention-Entity Graph
weighted undirected graph with two types of nodes
bag-of-words or
Eli (bible)
Sergio talked to language model:
words, bigrams,
Ennio about
Eli Wallach
phrases
Eli‘s role in the
Ecstasy (drug)
Ecstasy scene.
This sequence on
Ecstasy of Gold
the graveyard
Star Wars
was a highlight in
Sergio‘s trilogy
Lord of the Rings
of western films.
Dollars Trilogy
Popularity
(m,e):
Similarity
(m,e):
• freq(e|m)
• length(e)
• #links(e)
• cos/Dice/KL
(context(m),
context(e))
KB+Stats
Mention-Entity Graph
weighted undirected graph with two types of nodes
Sergio talked to
Ennio about
Eli‘s role in the
Ecstasy scene.
This sequence on
the graveyard
was a highlight in
Sergio‘s trilogy
of western films.
Eli (bible)
Eli Wallach
joint
mapping
Popularity
(m,e):
Similarity
(m,e):
• freq(e|m)
• length(e)
• #links(e)
• cos/Dice/KL
(context(m),
context(e))
Ecstasy (drug)
Ecstasy of Gold
Star Wars
Lord of the Rings
Dollars Trilogy
KB+Stats
Mention-Entity Graph
weighted undirected graph with two types of nodes
Sergio talked to
Ennio about
Eli‘s role in the
Ecstasy scene.
This sequence on
the graveyard
was a highlight in
Sergio‘s trilogy
of western films.
Eli (bible)
Eli Wallach
Ecstasy(drug)
Ecstasy of Gold
Star Wars
Lord of the Rings
Popularity
(m,e):
Similarity
(m,e):
• freq(m,e|m)
• length(e)
• #links(e)
• cos/Dice/KL
(context(m),
context(e))
Dollars Trilogy
KB+Stats
Coherence
(e,e‘):
• dist(types)
• overlap(links)
• overlap
28
28 / 20
(anchor words)
Mention-Entity Graph
weighted undirected graph with two types of nodes
American Jews
Eli (bible)
film actors
Sergio talked to
artists
Ennio about
Academy Award winners
Eli Wallach
Eli‘s role in the
Metallica songs
Ecstasy (drug)
Ecstasy scene.
Ennio Morricone songs
This sequence on
artifacts
Ecstasy of Gold
soundtrack music
the graveyard
Star Wars
was a highlight in
spaghetti westerns
Sergio‘s trilogy
Lord of the Rings
film trilogies
movies
of western films.
artifacts
Dollars Trilogy
Popularity
(m,e):
Similarity
(m,e):
• freq(m,e|m)
• length(e)
• #links(e)
• cos/Dice/KL
(context(m),
context(e))
KB+Stats
Coherence
(e,e‘):
• dist(types)
• overlap(links)
• overlap
29
29 / 20
(anchor words)
Mention-Entity Graph
weighted undirected graph with two types of nodes
http://.../wiki/Dollars_Trilogy
Eli (bible)
http://.../wiki/The_Good,_the_Bad, _
Sergio talked to
http://.../wiki/Clint_Eastwood
Ennio about
http://.../wiki/Honorary_Academy_A
Eli Wallach
Eli‘s role in the
Ecstasy (drug)
http://.../wiki/The_Good,_the_Bad,_t
Ecstasy scene.
http://.../wiki/Metallica
This sequence on
Ecstasy of Gold
http://.../wiki/Bellagio_(casino)
http://.../wiki/Ennio_Morricone
the graveyard
Star Wars
was a highlight in
http://.../wiki/Sergio_Leone
Sergio‘s trilogy
Lord of the Rings
http://.../wiki/The_Good,_the_Bad,_
http://.../wiki/For_a_Few_Dollars_M
of western films.
http://.../wiki/Ennio_Morricone
Dollars Trilogy
Popularity
(m,e):
Similarity
(m,e):
• freq(m,e|m)
• length(e)
• #links(e)
• cos/Dice/KL
(context(m),
context(e))
KB+Stats
Coherence
(e,e‘):
• dist(types)
• overlap(links)
• overlap
30
30 / 20
(anchor words)
Mention-Entity Graph
weighted undirected graph with two types of nodes
The Magnificent Seven
Eli (bible)
The Good, the Bad, and the Ugly
Sergio talked to
Clint Eastwood
Ennio about
University of Texas at Austin
Eli Wallach
Eli‘s role in the
Metallica on Morricone tribute
Ecstasy (drug)
Ecstasy scene.
Bellagio water fountain show
This sequence on
Yo-Yo Ma
Ecstasy of Gold
Ennio Morricone composition
the graveyard
Star Wars
was a highlight in
For a Few Dollars More
Sergio‘s trilogy
Lord of the Rings
The Good, the Bad, and the Ugly
Man with No Name trilogy
of western films.
soundtrack by Ennio Morricone
Dollars Trilogy
Popularity
(m,e):
Similarity
(m,e):
• freq(m,e|m)
• length(e)
• #links(e)
• cos/Dice/KL
(context(m),
context(e))
KB+Stats
Coherence
(e,e‘):
• dist(types)
• overlap(links)
• overlap
31
31 / 20
(anchor words)
Joint Mapping
50
30
20
30
50
10
10
90
100
30
90
100
5
20
80
30
90
• Build mention-entity graph or joint-inference factor graph
from knowledge and statistics in KB
• Compute high-likelihood mapping (ML or MAP) or
dense subgraph such that:
each m is connected to exactly one e (or at most one e)
Coherence Graph Algorithm
[J. Hoffart et al.: EMNLP‘11]
50
30
20
30
100
30
90
100
5
140
180
50
50
10
10
90
470
20
80
145
30
90
230
• Compute dense subgraph to
maximize min weighted degree among entity nodes
such that:
each m is connected to exactly one e (or at most one e)
• Greedy approximation:
iteratively remove weakest entity and its edges
• Keep alternative solutions, then use local/randomized search
Coherence Graph Algorithm
[J. Hoffart et al.: EMNLP‘11]
50
30
100
30
90
100
5
140
30
180
170
50
10
90
50
470
80
145
30
90
230
210
• Compute dense subgraph to
maximize min weighted degree among entity nodes
such that:
each m is connected to exactly one e (or at most one e)
• Greedy approximation:
iteratively remove weakest entity and its edges
• Keep alternative solutions, then use local/randomized search
Coherence Graph Algorithm
[J. Hoffart et al.: EMNLP‘11]
140
30
170
120
90
100
30
90
100
5
460
80
145
30
90
210
• Compute dense subgraph to
maximize min weighted degree among entity nodes
such that:
each m is connected to exactly one e (or at most one e)
• Greedy approximation:
iteratively remove weakest entity and its edges
• Keep alternative solutions, then use local/randomized search
Coherence Graph Algorithm
[J. Hoffart et al.: EMNLP‘11]
30
120
90
100
380
90
100
145
90
210
• Compute dense subgraph to
maximize min weighted degree among entity nodes
such that:
each m is connected to exactly one e (or at most one e)
• Greedy approximation:
iteratively remove weakest entity and its edges
• Keep alternative solutions, then use local/randomized search
Named-Entity Disambiguation:
State-of-the-Art
Literature:
• Razvan Bunescu, Marius Pasca: EACL 2006
• Silviu Cucerzan: EMNLP 2007
• David Milne, Ian Witten: CIKM 2008
• S. Kulkarni, A. Singh, G. Ramakrishnan, S. Chakrabarti: KDD 2009
• G. Limaye, S. Sarawagi, S. Chakrabarti: VLDB 2010
• Paolo Ferragina, Ugo Scaella: CIKM 2010
• Mark Dredze et al.: COLING 2010
• Johannes Hoffart et al.: EMNLP 2011
etc. etc.
Online tools:
https://d5gate.ag5.mpi-sb.mpg.de/webaida/
http://tagme.di.unipi.it/
http://spotlight.dbpedia.org/demo/index.html
http://viewer.opencalais.com/
http://wikipedia-miner.cms.waikato.ac.nz/demos/annotate/
etc.
NED: Experimental Evaluation
Benchmark:
• Extended CoNLL 2003 dataset: 1400 newswire articles
• originally annotated with mention markup (NER),
now with NED mappings to Yago and Freebase
• difficult texts:
… Australia beats India …
… White House talks to Kreml …
… EDS made a contract with …
 Australian_Cricket_Team
 President_of_the_USA
 HP_Enterprise_Services
Results:
Best: AIDA method with prior+sim+coh + robustness test
82% precision @100% recall, 87% mean average precision
Comparison to other methods, see paper
J. Hoffart et al.: Robust Disambiguation of Named Entities in Text, EMNLP 2011
http://www.mpi-inf.mpg.de/yago-naga/aida/
AIDA: Accurate Online Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/aida/
AIDA: Accurate Online Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/aida/
AIDA: Accurate Online Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/aida/
AIDA: Accurate Online Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/aida/
AIDA: Accurate Online Disambiguation
http://www.mpi-inf.mpg.de/yago-naga/aida/
Interesting Research Issues
• More efficient graph algorithms (multicore, etc.)
• Allow mentions of unknown entities, mapped to null
• Leverage deep-parsing structures,
leverage semantic types
• Short and difficult texts:
• tweets, headlines, etc.
• fictional texts: novels, song lyrics, etc.
• incoherent texts
• Disambiguation beyond entity names:
• coreferences: pronouns, paraphrases, etc.
• common nouns, verbal phrases (general WSD)
Why Named Entity Disambiguation is Key
• Linked data is best if it has many good links
• New & rich contents mostly in traditional Web
• Create sameAs links in (X)HTML contents, via RDFa
• Links for named entities give best mileage/effort
• Methods & tools greatly advanced & gradually maturing
• Keep human in the loop, embed NED in authoring tools
Outline

Explain Title

Why More Triples:
Dynamics, Linkage, Ubiquity

Linkage & Ubiquity:
Named-Entity Disambiguation
Web-Scale Linkage
Wrap-up
Variants of NED at Web Scale
Tools can map short text onto entities in a few seconds
• How to run this on big batch of 1 Mio. input texts?
 partition inputs across distributed machines,
organize dictionary appropriately, …
 exploit cross-document contexts
• How to deal with inputs from different time epochs?
 consider time-dependent contexts,
map to entities of proper epoch
(e.g. harvested from Wikipedia history)
• How to handle Web-scale inputs (100 Mio. pages)
restricted to a set of interesting entities?
(e.g. tracking politicians and companies)
Linked RDF Triples on the Web
yago/wordnet: Artist109812338
yago/wordnet:Actor109765278
yago/wikicategory:ItalianComposer
imdb.com/name/nm0910607/
dbpedia.org/resource/Ennio_Morricone
imdb.com/title/tt0361748/
dbpedia.org/resource/Rome
rdf.freebase.com/ns/en.rome_ny
?
?
data.nytimes.com/51688803696189142301
referential data quality:
automatic, dynamic,
high coverage !
?
geonames.org/5134301/city_of_rome
N 43° 12' 46'' W 75° 27' 20''
Outline

Explain Title

Why More Triples:
Dynamics, Linkage, Ubiquity

Linkage & Ubiquity:
Named-Entity Disambiguation

Web-Scale Linkage
Wrap-up
Summary
Linked Data is great!
But it needs more triples to capture:
• Dynamics:
(Deep-Web) sources  feeds, pub-sub, … ?
 fresh & versioned triples
• Linkage:
LOD  entity mapping  user  community
• Ubiquity:
RDFa  entity disambiguation  authoring
Outlook
For a Few Triples More
Challenge 1:
generate high-quality sameAs links
in RDFa & across all LOD sources
For a Few Triples Less
Challenge 2:
add efficient top-k ranking
to queries over RDF-in-context
Thank You !
Download