Knowledge Consolidation

advertisement
Outline
1.
2.
3.
4.
Introduction
Harvesting Classes
Harvesting Facts
Common Sense Knowledge
5. Knowledge Consolidation
6. Web Content Analytics
7. Wrap-Up
•
•
•
•
•
in YAGO
in NELL
in Google
KB Alignment
Linked Data
1
Goal: Combine several extractors
is(Elvis, alive) ?
is(Elvis, alive)
Extractor
is(Elvis, dead)
is(Elvis, alive)
Extractor
Extractor
Text
Tables
Schriftstück
2
YAGO combines 170 extractors
Infobox Extractor
JimGray bornIn "January 12, 1944"
JimGray bornIn SanFrancisco
…
TypeChecker
JimGray bornIn SanFrancisco
MultilingualMerger
3
YAGO combines 170 extractors
•
•
•
•
•
•
•
Type checking
Type coherence checking
Translation
Learning of foreign language attributes
Deduplication
Horn rule inference
Functional constraint checking
(simple preference over sources)
=> 10 languages, precision of 95%
http://yago-knowledge.org
[Mahdisoltani CIDR 2015]
4
Outline
1.
2.
3.
4.
Introduction
Harvesting Classes
Harvesting Facts
Common Sense Knowledge
5. Knowledge Consolidation
6. Web Content Analytics
7. Wrap-Up
•
•
•
•
•
in YAGO √
in NELL
in Google
KB Alignment
Linked Data
5
NELL couples different learners
Initial Ontology
Table Extractor
Krzewski Blue Angels
Miller
Red Angels
Natural Language
Pattern Extractor
Krzewski coaches the
Blue Devils.
Mutual exclusion
sports coach != scientist
Type Check
If I coach, am I a coach?
[Carlson et al. 2010 and follow-ups]
http://rtw.ml.cmu.edu/rtw/
6
NELL couples different learners
Initial Ontology
Table Extractor
Krzewski Blue Angels
Miller
Red Angels
Natural Language
DifferentExtractor
learners benefit from each other:
Pattern
• table extraction
Krzewski coaches the
Mutual exclusion
• text extraction
Blue Devils.
sports coach != scientist
• path ranking (rule learning)
• morphological features ("…ism" is something abstract)
• active learning (ask for answers in online forums)
Type Check
• learning from images
If I coach,
am I a coach?
• learn from several languages
(?)
[Carlson et al. 2010 and follow-ups]
http://rtw.ml.cmu.edu/rtw/
7
Estimating Accuracy from Unlabeled Data
[Platanios, Blum, Mitchell, UAI‘14]
Given: Extractors f1,…,fn
Find: error probability ei of each extractor
aij = Px(fi(x)=fj(x))
= P(both make error) + P(neither makes error)
=1 – ei
– ej
– 2*eij
Probability of a
simultaneous error
Agreement
(known!)
Case 1: Independent errors & acc. > 0.5
then aij=1 – ei – ej – 2*ei*ej
Problem reduced to solving a system of
N*(N-1)/2 equations with N unknown values
Solvable if N ≥ 3
8
Estimating Accuracy from Unlabeled Data
[Platanios, Blum, Mitchell, UAI‘14]
Given: Extractors f1,…,fn
Find: error probability ei of each extractor
aij = Px(fi(x)=fj(x))
= P(both make error) + P(neither makes error)
=1 – ei
– ej
– 2*eij
Probability of a
simultaneous error
Agreement
(known!)
Case 2: not independent errors
Idea: minimize eij-ei*ej, i.e., find
independent classifiers
9
Outline
1.
2.
3.
4.
Introduction
Harvesting Classes
Harvesting Facts
Common Sense Knowledge
5. Knowledge Consolidation
6. Web Content Analytics
7. Wrap-Up
•
•
•
•
•
in YAGO √
in NELL √
in Google
KB Alignment
Linked Data
10
Google Knowledge Vault
[Dong et al.: KDD2014]
Given: Freebase, a relation r, extractors e1,…,en
Train a classifier that, given the confidences of the extractors,
tells us whether the extracted statement is true.
KB
facts
Fusion
facts
Path
Ranking
facts
Extractor
Extractor
Txt
DOM
facts
Extractor
ANO
Tables
RDFa>
schema.org
11
RDFa Annotations
<div typeof="Person" resource="http://elvis.me">
My name is <span property=“name”>Elvis</span>.
</div>
Browser
RDFa analyzer
<elvis.me, name, "Elvis">
• 30% of Web pages are annotated this way
• schema.org is a common vocabulary designed by Google,
Microsoft, Yandex, and others for this purpose
[Guha: "Schema.org", keynote at AKBC 2014]
12
Trustworthiness of Web Sources
[Dong et al.: VLDB2015]
Page Rank and Trustworthiness are not always correlated!
Page Rank
Many Gossip Web sites
Tail sources
with high
trustworthiness
Knowledge Base Trust
13
Outline
1.
2.
3.
4.
Introduction
Harvesting Classes
Harvesting Facts
Common Sense Knowledge
5. Knowledge Consolidation
6. Web Content Analytics
7. Wrap-Up
•
•
•
•
•
in YAGO √
in NELL √
in Google √
KB Alignment
Linked Data
14
Knowledge bases are complementary
15
No Links  No Use
Who is the spouse of the guitar player?
16
Linking Records vs. Linking Knowledge
record
Susan B. Davidson
Peter Buneman
Yi Chen
University of
Pennsylvania
KB / Ontology
university
Differences between DB records and KB entities:
• Links have rich semantics (e.g. subclassOf)
• KBs have only binary predicates
• KBs have no schema
• Match not just entities,
but also classes & predicates (relations)
17
Similarity Flooding matches entities at scale
Build a graph:
nodes: pairs of entities, weighted with similarity
edges: weighted with degree of relatedness
relatedness
0.8
similarity: 0.9
similarity: 0.7
similarity: 0.8
Iterate until convergence:
similarity := weighted sum of neighbor similarities
many variants (belief propagation, label propagation, etc.)
18
Some neighborhoods are more indicative
1935
sameAs
1935
Many people born in 1935
 not indicative
sameAs ?
sameAs
Few people
married to Priscilla
 highly indicative
19
Inverse functionality as indicativeness
1935
sameAs
sameAs ?
1935
𝟏
𝒊𝒇𝒖𝒏 𝒓, 𝒚 =
| 𝒙: 𝒓 𝒙, 𝒚 |
𝟏
𝒊𝒇𝒖𝒏 𝒃𝒐𝒓𝒏, 𝟏𝟗𝟑𝟓 =
𝟓
𝟏
𝒊𝒇𝒖𝒏 𝒎𝒂𝒓𝒓𝒊𝒆𝒅, 𝑷𝒓𝒊𝒔𝒄. =
𝟐
𝒊𝒇𝒖𝒏 𝒓 = 𝑯𝑴𝒚 𝒊𝒇𝒖𝒏(𝒓, 𝒚)
sameAs
𝒊𝒇𝒖𝒏 𝒃𝒐𝒓𝒏 = 𝟎. 𝟎𝟏
𝒊𝒇𝒖𝒏 𝒎𝒂𝒓𝒓𝒊𝒆𝒅 = 𝟎. 𝟗
[Suchanek et al.: VLDB’12]
20
Match entities, classes and relations
sameAs
subPropertyOf
21
Match entities, classes and relations
sameAs
subPropertyOf
22
Match entities, classes and relations
sameAs
subPropertyOf
23
Match entities, classes and relations
subClassOf
PARIS matches YAGO and DBpedia
• time: 1:30 hours
• precision for instances: 90%
• precision for classes: 74%
sameAs
• precision for relations:
96%
[Suchanek et al.: VLDB’12]
http://webdam.inria.fr/paris
subPropertyOf
24
Many challenges remain
Entity linkage is at the heart of semantic data integration.
More than 50 years of research, still some way to go!
• Highly related entities with ambiguous names
George W. Bush (jun.) vs. George H.W. Bush (sen.)
• Long-tail entities with sparse context
• Records with complex DB / XML / OWL schemas
• Ontologies with non-isomorphic structures
Benchmarks:
•
•
•
LOD>
OAEI Ontology Alignment & Instance Matching: oaei.ontologymatching.org
TAC KBP Entity Linking: www.nist.gov/tac/2012/KBP/
25
TREC Knowledge Base Acceleration: trec-kba.org
Outline
1.
2.
3.
4.
Introduction
Harvesting Classes
Harvesting Facts
Common Sense Knowledge
5. Knowledge Consolidation
6. Web Content Analytics
7. Wrap-Up
•
•
•
•
•
in YAGO √
in NELL √
in Google √
KB Alignment √
Linked Data
Warning: Numbers mentioned here are not authoritative,
because (1) they are based on incomplete crawls or (2) they
may be outdated. See the respective sources for details.
26
30 Bio. triples
500 Mio. links
April 2011
Linked Open Data Cloud
media: 24
user- generated: 51
government:
183 KBs
linguists
publications: 138
From 2011 to 2014, the number of KBs
tripled from 297 to 1091.
[Schmachtenberg et al.: ICSW2014]
life-sciences: 85
geographic: 27
social-networking: 520
http://lodcould.net
cross-domain: 47
[Schmachtenberg et al.: ICSW2014]
27
Links between KBs
KB 1
KB 2
Top Linking
Predicates
owl:sameAs
owl:sameAs
rdfs:seeAlso
dct:source
dct:language
Watch out: “sameAs” has developed 5 meanings:
skos:exactMatch
• Identical to
skos:closeMatch
• Same in different context
• Same but referentially opaque
#links
of April 2014: unknown, crawled only sample
• as
Represents
• as
Very
to
#links
of similar
April 2011:
500 mio
[Halpin & Hayes: “When owl:sameAs isn’t the Same”, LDOW, 2010]
geographic
dct:creator
#links “sameAs” at sameAs.org: 150 mio
LOD>
[Schmachtenberg et al.: ICSW2014] 28
44% of KBs are not linked at all
Dereferencing URIs
<http://yago-knowlegde.org/resource/Elvis_Presley>
@prefix y: http://yago-knowledge.org/resource/
y:Elvis rdf:type y:livingPerson
y:Elvis y:wasBornIn y:USA
…
Dereferencability
of schemas
full
partial
none
19%
9%
72%
[Schmachtenberg et al.: ICSW2014]
In a crawl of 1.6m dereferenceable URIs:
[Hogan et al: “Weaving the Pedantic Web”, LDOW, 2010]
LOD>
29
Publish the Rubbish
LOD>
[Suchanek@WOLE2012 keynote]
30
Vocabularies (% of KBs)
Larger adoption
of standard
vocabularies
Vocabulary
2011
2014
FOAF
27%
69%
Dublin Core
31%
56%
Usage of standard vocabularies
Term
% KBs
Term
% KBs
rdfs:range
10%
rdfs:seeAlso
2%
rdfs:subClassOf
9%
owl:equivalentClass
2%
rdfs:subPropertyOf
7%
owl:inverseOf
1%
rdfs:domain
6%
swivt:type
1%
rdfs:isDefinedBy
4%
owl:equivalentProperty
1%
[Schmachtenberg et al.: ICSW2014]
31
Open Problems and Grand Challenges
Web-scale, robust Entity Linking with high quality
Handle huge amounts of linked-data sources, Web tables, …
Distilling out the high quality pieces of information
Automatic and continuously maintained sameAs links
for Web of Linked Data with high accuracy & coverage
32
Download