Isaac, Antoine: A Semantic Web View on Concepts and their

advertisement
A Semantic Web View on Concepts
and their Alignments
Antoine Isaac
Vrije Universiteit Amsterdam
Europeana
Concepts in Context, Köln, July 19th 2010
Linked Data Principles
1. Use URIs as names for things
2. Use HTTP URIs so that people can look up those names
3. When someone looks up a URI, provide useful information
using standards (RDF, SPARQL)
4. Include links to other URIs, so that they can discover more
things.
Tim Berners-Lee, http://linkeddata.org/
A way to publish Semantic Web data
A web of data
• Publish and re-use data via the web, building innovative
applications over former data silos
• Principle #4 is crucial to this vision:
Include links to other URIs, so that they can discover more
things.
http://linkeddata.org/
SKOS, Knowledge Organization Systems
and Linked Data
SKOS allows representing (simple) KOS data as RDF
animals
NT cats
cats
UF domestic cats
RT wildcats
BT animals
SN used only for domestic cats
domestic cats
USE cats
wildcats
SKOS, KOSs and LD
SKOS allows bridging across KOSs from different contexts
http://www.w3.org/2004/02/skos/
Some landmark KOS LD implementations
• Many Libraries – not a surprise!
•
•
•
Swedish National Library’s Libris catalogue and thesaurus http://libris.kb.se/
Library of Congress’ vocabularies, including LCSH http://id.loc.gov/
DNB’s Gemeinsame Normdatei (incl. SWD subject headings) http://d-nb.info/gnd/
Documentation at https://wiki.d-nb.de/display/LDS
•
•
•
•
BnF’s RAMEAU subject headings http://stitch.cs.vu.nl/
OCLC’s DDC classification http://dewey.info/ and VIAF http://viaf.org/
STW economy thesaurus http://zbw.eu/stw
National Library of Hungary’s catalogue and thesauri http://oszkdk.oszk.hu/resource/DRJ/404
(example)
• Other fields
•
•
•
•
•
•
•
•
•
Wikipedia categories through Dbpedia http://dbpedia.org/
New York Times subject headings http://data.nytimes.com/
IVOA astronomy vocabularies http://www.ivoa.net/Documents/latest/Vocabularies.html
GEMET environmental thesaurus http://eionet.europa.eu/gemet
UMTHES
Agrovoc http://aims.fao.org/
Linked Life Data http://linkedlifedata.com/
Taxonconcept http://www.taxonconcept.org/
UK Public sector vocabularies http://standards.esd.org.uk/ (e.g., http://id.esd.org.uk/lifeEvent/7 )
KOS Alignments?
Quite many of them are linked to some other resource
• LCSH, SWD and RAMEAU interlinked through MACS mappings
• GND linked to DBpedia and VIAF
• Libris linked to LCSH
• Agrovoc to CAT, NAL, SWD, GEMET
• NYT to freebase, DBpedia, Geonames
• dbPedia links are overwhelming
Hungary, STW, TaxonConcept, GND…
Is that enough? Are these links any good?
Sparse linkage: the LD cloud
[Cyganiak, Jentzsch] http://linkeddata.org/
Sparse of linkage: another view
[Guéret, 2010] http://blog.larkc.eu/?p=1941
Linked Data Issues
Mike Uschold’s “semantic elephants”
• Proliferation of URIs, Managing Coreference
• Versioning and URIs
• Overloading owl:sameAs
http://lists.w3.org/Archives/Public/public-lod/2010May/0012.html
What kind of links?
Coreference links are the most used (and needed)
• owl:sameAs
• skos:exactMatch
• skos:closeMatch
• rdfs:seeAlso
• umbel:isLike
Overloading owl:sameAs
• Formally, two URIs linked by owl:sameAs are inferred
to have the same properties
ex:a name “Antoine Isaac” .
ex:b owl:sameAs ex:a .
Implies ex:b name “Antoine Isaac” .
• Many owl:sameAs statements are asserted between
resources that are only very similar [Halpin 2009]
A same resource but in different contexts, a reference…
Case study: New York Times
• 10K concepts (places, descriptors, persons, organizations)
http://data.nytimes.com
• Manually or automatically mapped by NYT staff to dbPedia,
freebase, geonames
Linking LD cloud to NYT articles!
Allows to easily mix NYT content with other content
• Started with quite messy modeling
http://data.nytimes.com/60694995023816375851
dcterms:rightsHolder The New York Times Company .
http://data.nytimes.com/60694995023816375851
owl:sameAs http://dbpedia.org/resource/Park_Slope%2C_Brooklyn .
Clearer KOS alignments (1)
What is being aligned?
Concepts, documents, real-world entities “out there”
(persons, places…)
• In principle owl:sameAs should not be applied across disjoint
categories
• But even for one category there can be issues
• Two KOS concepts representing a same notion but with different
management metadata attached (skos:changeNote)
Clearer KOS alignments (2)
How is it aligned? Distinguish:
• exact co-reference
• conceptual similarity, including equivalence
• classification
• Making clearer distinctions between conceptual links
• skos:narrowMatch, skos:broadMatch, skos:relatedMatch
• Minimize ontological commitment for KOS data consumers
• skos:exactMatch: concepts can be used interchangeably across a wide range
of information retrieval applications. skos:exactMatch is a transitive property
• skos:closeMatch: In order to avoid the possibility of "compound errors" when
combining mappings across more than two concept schemes, skos:closeMatch
is not declared to be a transitive property
Case study: New York Times (2)
Data quality has considerably improved
• Factual data is at the concept itself, management data is at the resource
representing the data source (context)
http://data.nytimes.com/60694995023816375851 rdf:type skos:Concept ;
skos:prefLabel “Park Slope (NYC)” ;
geo:lat “40.6701033” ;
owl:sameAs http://dbpedia.org/resource/Park_Slope%2C_Brooklyn .
http://data.nytimes.com/60694995023816375851.rdf
dcterms:rightsHolder “The New York Times Company” ;
foaf:primaryTopic http://data.nytimes.com/60694995023816375851
• Still, for resources linked with owl:sameAs statements representing
different modeling choices can be merged
the DBpedia resource might not be a skos:Concept, or use different latitude format
Clearer KOS alignments (3)
What is the alignment for?
• SKOS mapping properties use the notion of validity within one
application context
• Application context for mapping has been investigated in
thesaurus interoperability studies
• Application of alignments matters:
• STITCH application scenarios for Cultural Heritage: book re-indexing,
thesaurus merging, query reformulation…
• A same alignment performs differently for different scenarios
[Isaac 2008, Wang 2009]
Application-specific alignment evaluation
Example: OAEI 2007 campaign, 3 matching tools
evaluated for thesaurus merging & book re-indexing
100%
100%
90%
90%
80%
80%
70%
70%
60%
Falcon
60%
Falcon
50%
Silas
50%
Silas
40%
DSSim
40%
DSSim
30%
30%
20%
20%
10%
10%
0%
0%
Precision
Coverage
Pa
Ra
Application-specific alignments
Why?
Take 2 thesauri at the Nat. Library of the Netherlands: GTT and
Brinkman
• For thesaurus merging, gtt:excavation should be aligned to
brinkman:excavation
• For book re-indexing, gtt:excavation should be aligned to
brinkman:archeology_netherlands
• Requires a finer representation grain for the context
in which the alignment is produced
•
•
•
•
Who created it?
Manual vs. Automatic?
Which alignment strategy or tool?
Is there a degree of confidence?
Case study: New York Times (3)
• Using nyt:mapping_strategy property with nyt:manual or
nyt:automatic:
http://data.nytimes.com/60694995023816375851.rdf
nyt:mapping_strategy http://data.nytimes.com/elements/manual .
• Problem: it applies to the context file for the concept, not to
the statement itself:
http://data.nytimes.com/60694995023816375851
owl:sameAs http://dbpedia.org/resource/Park_Slope%2C_Brooklyn .
• Using simple binary properties (skos:exactMatch…) between
aligned resources does not allow for much flexibility
Ontology Matching community
practices
• Community investigating the ontology and vocabulary
matching issues
Ontology Alignment Evaluation Initiative
http://oaei.ontologymatching.org
• Matching tools produce some metadata
• Metadata repositories store and manage them
– Bioportal http://bioportal.bioontology.org/
– CATCH vocabulary and alignment repository
http://stitch.cs.vu.nl/repository/
…
• Consensus: richer alignment metadata is needed
From a simple representation
to a more complete one
http://alignapi.gforge.inria.fr/edoal.html
Can LD accommodate complex
representations?
• The strength of the LD vision lies in the relative simplicity of a
standard representation
• LD provides a simple way to publish data and follow one’s
nose to connected data
Serendipity!
• Reification and metadata on links are not really compatible
with it
Higher barrier for data publication and consumption
Peaceful co-existence
• Applications with narrow scope and that require precise
data can afford
• Selecting alignments they consume
• Exploiting finer-grained representations
• Creating finer-grained representations
• Simple data for applications that are simple and/or
exploiting a wide range of datasets
• Simple mesh-up applications robust to (limited) approximation
• Web-scale applications
Large-scale document retrieval, Concept discovery
Does it need to be perfect anyway?
• Do we really want to throw away crucial URI co-reference data?
http://sameAs.org has 35,187,488 URIs in 11,285,263 bundles
• Extensive linking to dbPedia is useful, even with a type of link
which is not used in the theoretically good way
Cf. BBC content and data mesh-ups
http://www.bbc.co.uk/wildlifefinder/
http://www.bbc.co.uk/music/
• Issues with mixed quality are being tackled
– http://sameAs.org as a “service to provide you with help finding URIs”,
keeping track of data sources
– Representation and exchange of provenance info is under active
investigation
Peaceful co-existence (2)
• If you have complex representation, don’t be
pedantic and publish simpler data, too!
• Articulation between LD (to discover links) and
alignment repositories is needed
• Technically feasible, best practices have to be
identified
Conclusions
• (Almost) any alignment is better than none
This is a web of data, without links there’s almost no value
• There is already great linking happening!
• More involvement from this community would
certainly help!
Alignment themselves & Theoretical foundations
Thanks!
Possible participation channels:
Linked Open Data community (http://linkeddata.org) and
mailing list (public-lod@w3.org)
Library Linked Data W3C incubator group
(http://www.w3.org/2005/Incubator/lld/wiki/ ) and
community list (public-lld@w3.org)
References
• [Halpin 2009] Harry Halpin, Pat Hayes. When owl:sameAs isn't the Same:
An Analysis of Identity Links on the Semantic Web. LDOW 2009
• [Isaac, 2008] Antoine Isaac, Henk Matthezing, Lourens van der Meij,
Stefan Schlobach, Shenghui Wang, Claus Zinn. Putting ontology alignment
in context: usage scenarios, deployment and evaluation in a library case.
ESWC 2008
• [Wang, 2009] Shenghui Wang, Antoine Isaac, Balthasar Schopman, Stefan
Schlobach, Lourens van der Meij. Matching multi-lingual subject
vocabularies. ECDL 2009
Download