URI Resolution in Linked Data August 15, 2012 Seungwoo Lee KISTI IASLOD2012 1 Copyright © 2004-2012, KISTI Linked Data Linked data describes a method of publishing structured data so that it can be interlinked and become more useful. -Wikipedia User-generated Media 31,634,213,770 Triples Government Publications 504m RDF Links Cross-domain Geography IASLOD2012 Life sciences 2 http://www4.wiwiss.fu-berlin.de/lodcloud/state/ Copyright © 2004-2012, KISTI Linked Data Principles Four principles (by Tim Berners-Lee) Use URIs as names for things Use HTTP URIs so that people can look up those names When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL) Include links to other URIs so that they can discover more things IASLOD2012 3 Copyright © 2004-2012, KISTI RDF Linking: three types Relationship Links point to related things in other data sources Identity Links point to URI aliases used by other data sources expressed by owl:sameAs Vocabulary Links point to related vocabulary terms (concepts) expressed by owl:equivalentClass, owl:equivalentProperty, rdfs:subClassOf, rdfs:subPropertyOf IASLOD2012 4 Copyright © 2004-2012, KISTI RDF Linking: three types (cont’) owl:equivalentClass my:Person http://xmlns.com/foaf/0.1/Person my:JohnSmith my:name my:liveIn my:topic “Daejeon” my:age http://sws.geonames.org/1835235/ my:SemanticWeb “John Smith” “30” rdfs:label “Semantic Web”@en rdfs:label owl:sameAs “시맨틱 웹”@kr http://dbpedia.org/resource/Semantic_Web My Ontology IASLOD2012 5 Copyright © 2004-2012, KISTI Decentralization of Identity Using HTTP URIs means that Globally unique names can be created in a decentralized fashion Every owner of a domain name may create their own URIs for every entity As a result, one real-object can be referred by more than two different URIs we need IASLOD2012 to find existing URIs of an interesting entity from Linked Data cloud to detect duplicate URIs between different data sources 6 Copyright © 2004-2012, KISTI How to find existing URIs Use a data set-specific search interface (such as a SPARQL endpoint) There are some tools that index and search URIs by keywords Sindice (http://www.sindice.com) Falcons (http://iws.seu.edu.cn/services/falcons/objectsearch) temporarily unavailable To set RDF links manually IASLOD2012 for small, static data sets 7 Copyright © 2004-2012, KISTI Sindice IASLOD2012 8 Copyright © 2004-2012, KISTI How to detect duplicate URIs To set RDF links (usually, identity links) by automated or semiautomated methods Scalable to larger data sets General approaches Key-based: commonly accepted identifiers Similarity-based heuristics included in URIs a value of a property of type owl:InverseFunctionalProperty compare multiple property-values There are several services that can detect duplication of URI pairs by key or similarity metrics Silk LIMES <sameAs> (http://sameas.org) OntoURIResolver IASLOD2012 9 Copyright © 2004-2012, KISTI Silk Link discovery and maintenance framework Three components: A link discovery engine that computes links between data sources based on declarative specification of the conditions A tool for evaluating the generated data links A protocol for maintaining data links when data set can be changed Downloadable from IASLOD2012 http://www4.wiwiss.fu-berlin.de/bizer/silk/ 10 Copyright © 2004-2012, KISTI LIMES Large-scale link discovery on linked data Time-efficient approach based on the characteristics of metric spaces Reduce the number of comparisons by distance approximation based on the triangle inequality Downloadable from IASLOD2012 http://aksw.org/Projects/limes 11 Copyright © 2004-2012, KISTI <sameAs> Manage and help to find co-references between different data sets Search by a URI When a keyword is given, it is first mapped into URIs using Sindice Provide bundles of URIs as a result owl:sameAs rkb:coreferenceData umbel:isLike skos:exactMatch openvocab:similarTo IASLOD2012 12 Copyright © 2004-2012, KISTI OntoURIResolover - Motivation Why should we know in advance which data sets could be targeted? It may be quite a burden to Linked Data creators Instead, we can use Linked Data search engines such as Sindice In some applications, bulk-to-bulk resolution approaches, such as Silk and LIMES, are less acceptable There often exist false-equivalences of URIs as well as unknownequivalences E.g., an application that resolves entities retrieved by Linked Data search engines Even slightly wrong equivalence may cause serious problems due to network effect of Linked Data When more than one URI aliases are available, which one is more representative or preferable? Some URIs may be deprecated or unreachable Want to discriminate RDF URIs from non-RDF URIs IASLOD2012 13 Copyright © 2004-2012, KISTI OntoURIResolover - Features Use an existing Linked Data search engine, Sindice, and an existing identity link repository, <sameAs>, Take an entity-to-entity resolution approach Some entities have too large number of properties, most of which are not discriminative Classify or filter out erroneous data such as false equivalences (from <sameAs>) deprecated URIs unreachable URIs non-RDF URIs Recommend a canonical URI among aliases Contrary to the bulk-to-bulk method, we take real-time entity-to-entity approach which is more appropriate for our environment Define and use representative properties for types of target entities not to bound target data sets in advance having the most plentiful attributes and relationships Web-based user interface IASLOD2012 14 Copyright © 2004-2012, KISTI OntoURIResolover - Processes URI, Entity name 1. Collecting URI List URI List 2. Collecting Contextual Info. 3. Type-based Grouping 4. Loading Resolution Rules URI, Entity name Request RDF 7. Verifying Results Triple Criteria Triple Store Request RDF Store Resolved Result IASLOD2012 Service Caching 5. Direct Clustering 6. Indirect Clustering External Linked Data 15 Copyright © 2004-2012, KISTI Experimental Result 29 unique author names with more than 10 URIs 488 URIs which have 712 kinds of unique properties From more than 20 data sources Metrics Average Cluster Purity: ACP measuring the cluster quality The higher the value is, the better the clustering quality is Average Author Purity: AAP to consider excessive fragmentation of truly-equivalent URIs The higher the value is, the less fragmented the same author set is K-measure: the geometric mean of ACP and AAP IASLOD2012 16 Copyright © 2004-2012, KISTI Experimental Result (cont’) IASLOD2012 17 Copyright © 2004-2012, KISTI OntoURIResolver Demo Demo Site: http://our.kisti.re.kr IASLOD2012 18 Copyright © 2004-2012, KISTI References [Linked Data-Design Issues] Tim Berners-Lee, http://www.w3.org/DesignIssues/LinkedData.html, 2006. [Linked Data] Tom Heath and Christian Bizer. Linked Data: Evolving the Web into a Global Data Space (1st edition). Synthesis Lectures on the Semantic Web: Theory and Technology, 1-136. Morgan & Claypool, 2011 [Silk] Julius Volz, Christian Bizer, Martin Gaedke, and Georgi Kobilarov. Discovering and maintaining links on the web of data. In Proceedings of the International Semantic Web Conference, pages 650–665, 2009. [LIMES] Axel Cyrille, Ngonga Ngomo, and Soeren Auer. LIMES - a timeefficient approach for large-scale link discovery on the web of data. http://svn.aksw.org/papers/2011/WWW_LIMES/public.pdf, 2010. [K-measure] A. Solomonoff, A. Mielke, M. Schmidt, and H. Gish, Clustering Speakers by Their Voices. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Seattle, 1998. IASLOD2012 19 Copyright © 2004-2012, KISTI Thank you Seungwoo Lee (swlee@kisti.re.kr) IASLOD2012 20 Copyright © 2004-2012, KISTI