A Semantic Web View on Concepts and their Alignments Antoine Isaac Vrije Universiteit Amsterdam Europeana Concepts in Context, Köln, July 19th 2010 Linked Data Principles 1. Use URIs as names for things 2. Use HTTP URIs so that people can look up those names 3. When someone looks up a URI, provide useful information using standards (RDF, SPARQL) 4. Include links to other URIs, so that they can discover more things. Tim Berners-Lee, http://linkeddata.org/ A way to publish Semantic Web data A web of data • Publish and re-use data via the web, building innovative applications over former data silos • Principle #4 is crucial to this vision: Include links to other URIs, so that they can discover more things. http://linkeddata.org/ SKOS, Knowledge Organization Systems and Linked Data SKOS allows representing (simple) KOS data as RDF animals NT cats cats UF domestic cats RT wildcats BT animals SN used only for domestic cats domestic cats USE cats wildcats SKOS, KOSs and LD SKOS allows bridging across KOSs from different contexts http://www.w3.org/2004/02/skos/ Some landmark KOS LD implementations • Many Libraries – not a surprise! • • • Swedish National Library’s Libris catalogue and thesaurus http://libris.kb.se/ Library of Congress’ vocabularies, including LCSH http://id.loc.gov/ DNB’s Gemeinsame Normdatei (incl. SWD subject headings) http://d-nb.info/gnd/ Documentation at https://wiki.d-nb.de/display/LDS • • • • BnF’s RAMEAU subject headings http://stitch.cs.vu.nl/ OCLC’s DDC classification http://dewey.info/ and VIAF http://viaf.org/ STW economy thesaurus http://zbw.eu/stw National Library of Hungary’s catalogue and thesauri http://oszkdk.oszk.hu/resource/DRJ/404 (example) • Other fields • • • • • • • • • Wikipedia categories through Dbpedia http://dbpedia.org/ New York Times subject headings http://data.nytimes.com/ IVOA astronomy vocabularies http://www.ivoa.net/Documents/latest/Vocabularies.html GEMET environmental thesaurus http://eionet.europa.eu/gemet UMTHES Agrovoc http://aims.fao.org/ Linked Life Data http://linkedlifedata.com/ Taxonconcept http://www.taxonconcept.org/ UK Public sector vocabularies http://standards.esd.org.uk/ (e.g., http://id.esd.org.uk/lifeEvent/7 ) KOS Alignments? Quite many of them are linked to some other resource • LCSH, SWD and RAMEAU interlinked through MACS mappings • GND linked to DBpedia and VIAF • Libris linked to LCSH • Agrovoc to CAT, NAL, SWD, GEMET • NYT to freebase, DBpedia, Geonames • dbPedia links are overwhelming Hungary, STW, TaxonConcept, GND… Is that enough? Are these links any good? Sparse linkage: the LD cloud [Cyganiak, Jentzsch] http://linkeddata.org/ Sparse of linkage: another view [Guéret, 2010] http://blog.larkc.eu/?p=1941 Linked Data Issues Mike Uschold’s “semantic elephants” • Proliferation of URIs, Managing Coreference • Versioning and URIs • Overloading owl:sameAs http://lists.w3.org/Archives/Public/public-lod/2010May/0012.html What kind of links? Coreference links are the most used (and needed) • owl:sameAs • skos:exactMatch • skos:closeMatch • rdfs:seeAlso • umbel:isLike Overloading owl:sameAs • Formally, two URIs linked by owl:sameAs are inferred to have the same properties ex:a name “Antoine Isaac” . ex:b owl:sameAs ex:a . Implies ex:b name “Antoine Isaac” . • Many owl:sameAs statements are asserted between resources that are only very similar [Halpin 2009] A same resource but in different contexts, a reference… Case study: New York Times • 10K concepts (places, descriptors, persons, organizations) http://data.nytimes.com • Manually or automatically mapped by NYT staff to dbPedia, freebase, geonames Linking LD cloud to NYT articles! Allows to easily mix NYT content with other content • Started with quite messy modeling http://data.nytimes.com/60694995023816375851 dcterms:rightsHolder The New York Times Company . http://data.nytimes.com/60694995023816375851 owl:sameAs http://dbpedia.org/resource/Park_Slope%2C_Brooklyn . Clearer KOS alignments (1) What is being aligned? Concepts, documents, real-world entities “out there” (persons, places…) • In principle owl:sameAs should not be applied across disjoint categories • But even for one category there can be issues • Two KOS concepts representing a same notion but with different management metadata attached (skos:changeNote) Clearer KOS alignments (2) How is it aligned? Distinguish: • exact co-reference • conceptual similarity, including equivalence • classification • Making clearer distinctions between conceptual links • skos:narrowMatch, skos:broadMatch, skos:relatedMatch • Minimize ontological commitment for KOS data consumers • skos:exactMatch: concepts can be used interchangeably across a wide range of information retrieval applications. skos:exactMatch is a transitive property • skos:closeMatch: In order to avoid the possibility of "compound errors" when combining mappings across more than two concept schemes, skos:closeMatch is not declared to be a transitive property Case study: New York Times (2) Data quality has considerably improved • Factual data is at the concept itself, management data is at the resource representing the data source (context) http://data.nytimes.com/60694995023816375851 rdf:type skos:Concept ; skos:prefLabel “Park Slope (NYC)” ; geo:lat “40.6701033” ; owl:sameAs http://dbpedia.org/resource/Park_Slope%2C_Brooklyn . http://data.nytimes.com/60694995023816375851.rdf dcterms:rightsHolder “The New York Times Company” ; foaf:primaryTopic http://data.nytimes.com/60694995023816375851 • Still, for resources linked with owl:sameAs statements representing different modeling choices can be merged the DBpedia resource might not be a skos:Concept, or use different latitude format Clearer KOS alignments (3) What is the alignment for? • SKOS mapping properties use the notion of validity within one application context • Application context for mapping has been investigated in thesaurus interoperability studies • Application of alignments matters: • STITCH application scenarios for Cultural Heritage: book re-indexing, thesaurus merging, query reformulation… • A same alignment performs differently for different scenarios [Isaac 2008, Wang 2009] Application-specific alignment evaluation Example: OAEI 2007 campaign, 3 matching tools evaluated for thesaurus merging & book re-indexing 100% 100% 90% 90% 80% 80% 70% 70% 60% Falcon 60% Falcon 50% Silas 50% Silas 40% DSSim 40% DSSim 30% 30% 20% 20% 10% 10% 0% 0% Precision Coverage Pa Ra Application-specific alignments Why? Take 2 thesauri at the Nat. Library of the Netherlands: GTT and Brinkman • For thesaurus merging, gtt:excavation should be aligned to brinkman:excavation • For book re-indexing, gtt:excavation should be aligned to brinkman:archeology_netherlands • Requires a finer representation grain for the context in which the alignment is produced • • • • Who created it? Manual vs. Automatic? Which alignment strategy or tool? Is there a degree of confidence? Case study: New York Times (3) • Using nyt:mapping_strategy property with nyt:manual or nyt:automatic: http://data.nytimes.com/60694995023816375851.rdf nyt:mapping_strategy http://data.nytimes.com/elements/manual . • Problem: it applies to the context file for the concept, not to the statement itself: http://data.nytimes.com/60694995023816375851 owl:sameAs http://dbpedia.org/resource/Park_Slope%2C_Brooklyn . • Using simple binary properties (skos:exactMatch…) between aligned resources does not allow for much flexibility Ontology Matching community practices • Community investigating the ontology and vocabulary matching issues Ontology Alignment Evaluation Initiative http://oaei.ontologymatching.org • Matching tools produce some metadata • Metadata repositories store and manage them – Bioportal http://bioportal.bioontology.org/ – CATCH vocabulary and alignment repository http://stitch.cs.vu.nl/repository/ … • Consensus: richer alignment metadata is needed From a simple representation to a more complete one http://alignapi.gforge.inria.fr/edoal.html Can LD accommodate complex representations? • The strength of the LD vision lies in the relative simplicity of a standard representation • LD provides a simple way to publish data and follow one’s nose to connected data Serendipity! • Reification and metadata on links are not really compatible with it Higher barrier for data publication and consumption Peaceful co-existence • Applications with narrow scope and that require precise data can afford • Selecting alignments they consume • Exploiting finer-grained representations • Creating finer-grained representations • Simple data for applications that are simple and/or exploiting a wide range of datasets • Simple mesh-up applications robust to (limited) approximation • Web-scale applications Large-scale document retrieval, Concept discovery Does it need to be perfect anyway? • Do we really want to throw away crucial URI co-reference data? http://sameAs.org has 35,187,488 URIs in 11,285,263 bundles • Extensive linking to dbPedia is useful, even with a type of link which is not used in the theoretically good way Cf. BBC content and data mesh-ups http://www.bbc.co.uk/wildlifefinder/ http://www.bbc.co.uk/music/ • Issues with mixed quality are being tackled – http://sameAs.org as a “service to provide you with help finding URIs”, keeping track of data sources – Representation and exchange of provenance info is under active investigation Peaceful co-existence (2) • If you have complex representation, don’t be pedantic and publish simpler data, too! • Articulation between LD (to discover links) and alignment repositories is needed • Technically feasible, best practices have to be identified Conclusions • (Almost) any alignment is better than none This is a web of data, without links there’s almost no value • There is already great linking happening! • More involvement from this community would certainly help! Alignment themselves & Theoretical foundations Thanks! Possible participation channels: Linked Open Data community (http://linkeddata.org) and mailing list (public-lod@w3.org) Library Linked Data W3C incubator group (http://www.w3.org/2005/Incubator/lld/wiki/ ) and community list (public-lld@w3.org) References • [Halpin 2009] Harry Halpin, Pat Hayes. When owl:sameAs isn't the Same: An Analysis of Identity Links on the Semantic Web. LDOW 2009 • [Isaac, 2008] Antoine Isaac, Henk Matthezing, Lourens van der Meij, Stefan Schlobach, Shenghui Wang, Claus Zinn. Putting ontology alignment in context: usage scenarios, deployment and evaluation in a library case. ESWC 2008 • [Wang, 2009] Shenghui Wang, Antoine Isaac, Balthasar Schopman, Stefan Schlobach, Lourens van der Meij. Matching multi-lingual subject vocabularies. ECDL 2009