Outline 1. 2. 3. 4. Introduction Harvesting Classes Harvesting Facts Common Sense Knowledge 5. Knowledge Consolidation 6. Web Content Analytics 7. Wrap-Up • • • • • in YAGO in NELL in Google KB Alignment Linked Data 1 Goal: Combine several extractors is(Elvis, alive) ? is(Elvis, alive) Extractor is(Elvis, dead) is(Elvis, alive) Extractor Extractor Text Tables Schriftstück 2 YAGO combines 170 extractors Infobox Extractor JimGray bornIn "January 12, 1944" JimGray bornIn SanFrancisco … TypeChecker JimGray bornIn SanFrancisco MultilingualMerger 3 YAGO combines 170 extractors • • • • • • • Type checking Type coherence checking Translation Learning of foreign language attributes Deduplication Horn rule inference Functional constraint checking (simple preference over sources) => 10 languages, precision of 95% http://yago-knowledge.org [Mahdisoltani CIDR 2015] 4 Outline 1. 2. 3. 4. Introduction Harvesting Classes Harvesting Facts Common Sense Knowledge 5. Knowledge Consolidation 6. Web Content Analytics 7. Wrap-Up • • • • • in YAGO √ in NELL in Google KB Alignment Linked Data 5 NELL couples different learners Initial Ontology Table Extractor Krzewski Blue Angels Miller Red Angels Natural Language Pattern Extractor Krzewski coaches the Blue Devils. Mutual exclusion sports coach != scientist Type Check If I coach, am I a coach? [Carlson et al. 2010 and follow-ups] http://rtw.ml.cmu.edu/rtw/ 6 NELL couples different learners Initial Ontology Table Extractor Krzewski Blue Angels Miller Red Angels Natural Language DifferentExtractor learners benefit from each other: Pattern • table extraction Krzewski coaches the Mutual exclusion • text extraction Blue Devils. sports coach != scientist • path ranking (rule learning) • morphological features ("…ism" is something abstract) • active learning (ask for answers in online forums) Type Check • learning from images If I coach, am I a coach? • learn from several languages (?) [Carlson et al. 2010 and follow-ups] http://rtw.ml.cmu.edu/rtw/ 7 Estimating Accuracy from Unlabeled Data [Platanios, Blum, Mitchell, UAI‘14] Given: Extractors f1,…,fn Find: error probability ei of each extractor aij = Px(fi(x)=fj(x)) = P(both make error) + P(neither makes error) =1 – ei – ej – 2*eij Probability of a simultaneous error Agreement (known!) Case 1: Independent errors & acc. > 0.5 then aij=1 – ei – ej – 2*ei*ej Problem reduced to solving a system of N*(N-1)/2 equations with N unknown values Solvable if N ≥ 3 8 Estimating Accuracy from Unlabeled Data [Platanios, Blum, Mitchell, UAI‘14] Given: Extractors f1,…,fn Find: error probability ei of each extractor aij = Px(fi(x)=fj(x)) = P(both make error) + P(neither makes error) =1 – ei – ej – 2*eij Probability of a simultaneous error Agreement (known!) Case 2: not independent errors Idea: minimize eij-ei*ej, i.e., find independent classifiers 9 Outline 1. 2. 3. 4. Introduction Harvesting Classes Harvesting Facts Common Sense Knowledge 5. Knowledge Consolidation 6. Web Content Analytics 7. Wrap-Up • • • • • in YAGO √ in NELL √ in Google KB Alignment Linked Data 10 Google Knowledge Vault [Dong et al.: KDD2014] Given: Freebase, a relation r, extractors e1,…,en Train a classifier that, given the confidences of the extractors, tells us whether the extracted statement is true. KB facts Fusion facts Path Ranking facts Extractor Extractor Txt DOM facts Extractor ANO Tables RDFa> schema.org 11 RDFa Annotations <div typeof="Person" resource="http://elvis.me"> My name is <span property=“name”>Elvis</span>. </div> Browser RDFa analyzer <elvis.me, name, "Elvis"> • 30% of Web pages are annotated this way • schema.org is a common vocabulary designed by Google, Microsoft, Yandex, and others for this purpose [Guha: "Schema.org", keynote at AKBC 2014] 12 Trustworthiness of Web Sources [Dong et al.: VLDB2015] Page Rank and Trustworthiness are not always correlated! Page Rank Many Gossip Web sites Tail sources with high trustworthiness Knowledge Base Trust 13 Outline 1. 2. 3. 4. Introduction Harvesting Classes Harvesting Facts Common Sense Knowledge 5. Knowledge Consolidation 6. Web Content Analytics 7. Wrap-Up • • • • • in YAGO √ in NELL √ in Google √ KB Alignment Linked Data 14 Knowledge bases are complementary 15 No Links No Use Who is the spouse of the guitar player? 16 Linking Records vs. Linking Knowledge record Susan B. Davidson Peter Buneman Yi Chen University of Pennsylvania KB / Ontology university Differences between DB records and KB entities: • Links have rich semantics (e.g. subclassOf) • KBs have only binary predicates • KBs have no schema • Match not just entities, but also classes & predicates (relations) 17 Similarity Flooding matches entities at scale Build a graph: nodes: pairs of entities, weighted with similarity edges: weighted with degree of relatedness relatedness 0.8 similarity: 0.9 similarity: 0.7 similarity: 0.8 Iterate until convergence: similarity := weighted sum of neighbor similarities many variants (belief propagation, label propagation, etc.) 18 Some neighborhoods are more indicative 1935 sameAs 1935 Many people born in 1935 not indicative sameAs ? sameAs Few people married to Priscilla highly indicative 19 Inverse functionality as indicativeness 1935 sameAs sameAs ? 1935 𝟏 𝒊𝒇𝒖𝒏 𝒓, 𝒚 = | 𝒙: 𝒓 𝒙, 𝒚 | 𝟏 𝒊𝒇𝒖𝒏 𝒃𝒐𝒓𝒏, 𝟏𝟗𝟑𝟓 = 𝟓 𝟏 𝒊𝒇𝒖𝒏 𝒎𝒂𝒓𝒓𝒊𝒆𝒅, 𝑷𝒓𝒊𝒔𝒄. = 𝟐 𝒊𝒇𝒖𝒏 𝒓 = 𝑯𝑴𝒚 𝒊𝒇𝒖𝒏(𝒓, 𝒚) sameAs 𝒊𝒇𝒖𝒏 𝒃𝒐𝒓𝒏 = 𝟎. 𝟎𝟏 𝒊𝒇𝒖𝒏 𝒎𝒂𝒓𝒓𝒊𝒆𝒅 = 𝟎. 𝟗 [Suchanek et al.: VLDB’12] 20 Match entities, classes and relations sameAs subPropertyOf 21 Match entities, classes and relations sameAs subPropertyOf 22 Match entities, classes and relations sameAs subPropertyOf 23 Match entities, classes and relations subClassOf PARIS matches YAGO and DBpedia • time: 1:30 hours • precision for instances: 90% • precision for classes: 74% sameAs • precision for relations: 96% [Suchanek et al.: VLDB’12] http://webdam.inria.fr/paris subPropertyOf 24 Many challenges remain Entity linkage is at the heart of semantic data integration. More than 50 years of research, still some way to go! • Highly related entities with ambiguous names George W. Bush (jun.) vs. George H.W. Bush (sen.) • Long-tail entities with sparse context • Records with complex DB / XML / OWL schemas • Ontologies with non-isomorphic structures Benchmarks: • • • LOD> OAEI Ontology Alignment & Instance Matching: oaei.ontologymatching.org TAC KBP Entity Linking: www.nist.gov/tac/2012/KBP/ 25 TREC Knowledge Base Acceleration: trec-kba.org Outline 1. 2. 3. 4. Introduction Harvesting Classes Harvesting Facts Common Sense Knowledge 5. Knowledge Consolidation 6. Web Content Analytics 7. Wrap-Up • • • • • in YAGO √ in NELL √ in Google √ KB Alignment √ Linked Data Warning: Numbers mentioned here are not authoritative, because (1) they are based on incomplete crawls or (2) they may be outdated. See the respective sources for details. 26 30 Bio. triples 500 Mio. links April 2011 Linked Open Data Cloud media: 24 user- generated: 51 government: 183 KBs linguists publications: 138 From 2011 to 2014, the number of KBs tripled from 297 to 1091. [Schmachtenberg et al.: ICSW2014] life-sciences: 85 geographic: 27 social-networking: 520 http://lodcould.net cross-domain: 47 [Schmachtenberg et al.: ICSW2014] 27 Links between KBs KB 1 KB 2 Top Linking Predicates owl:sameAs owl:sameAs rdfs:seeAlso dct:source dct:language Watch out: “sameAs” has developed 5 meanings: skos:exactMatch • Identical to skos:closeMatch • Same in different context • Same but referentially opaque #links of April 2014: unknown, crawled only sample • as Represents • as Very to #links of similar April 2011: 500 mio [Halpin & Hayes: “When owl:sameAs isn’t the Same”, LDOW, 2010] geographic dct:creator #links “sameAs” at sameAs.org: 150 mio LOD> [Schmachtenberg et al.: ICSW2014] 28 44% of KBs are not linked at all Dereferencing URIs <http://yago-knowlegde.org/resource/Elvis_Presley> @prefix y: http://yago-knowledge.org/resource/ y:Elvis rdf:type y:livingPerson y:Elvis y:wasBornIn y:USA … Dereferencability of schemas full partial none 19% 9% 72% [Schmachtenberg et al.: ICSW2014] In a crawl of 1.6m dereferenceable URIs: [Hogan et al: “Weaving the Pedantic Web”, LDOW, 2010] LOD> 29 Publish the Rubbish LOD> [Suchanek@WOLE2012 keynote] 30 Vocabularies (% of KBs) Larger adoption of standard vocabularies Vocabulary 2011 2014 FOAF 27% 69% Dublin Core 31% 56% Usage of standard vocabularies Term % KBs Term % KBs rdfs:range 10% rdfs:seeAlso 2% rdfs:subClassOf 9% owl:equivalentClass 2% rdfs:subPropertyOf 7% owl:inverseOf 1% rdfs:domain 6% swivt:type 1% rdfs:isDefinedBy 4% owl:equivalentProperty 1% [Schmachtenberg et al.: ICSW2014] 31 Open Problems and Grand Challenges Web-scale, robust Entity Linking with high quality Handle huge amounts of linked-data sources, Web tables, … Distilling out the high quality pieces of information Automatic and continuously maintained sameAs links for Web of Linked Data with high accuracy & coverage 32