Linked Data

advertisement
URI Resolution
in Linked Data
August 15, 2012
Seungwoo Lee
KISTI
IASLOD2012
1
Copyright © 2004-2012, KISTI
Linked Data

Linked data describes a method of publishing structured data so
that it can be interlinked and become more useful.
-Wikipedia
User-generated
Media
31,634,213,770 Triples
Government
Publications
504m RDF Links
Cross-domain
Geography
IASLOD2012
Life sciences
2
http://www4.wiwiss.fu-berlin.de/lodcloud/state/
Copyright © 2004-2012, KISTI
Linked Data Principles

Four principles (by Tim Berners-Lee)

Use URIs as names for things

Use HTTP URIs so that people can look up those names

When someone looks up a URI, provide useful information,
using the standards (RDF, SPARQL)

Include links to other URIs so that they can discover more
things
IASLOD2012
3
Copyright © 2004-2012, KISTI
RDF Linking: three types

Relationship Links



point to related things in other data sources
Identity Links

point to URI aliases used by other data sources

expressed by owl:sameAs
Vocabulary Links

point to related vocabulary terms (concepts)

expressed by owl:equivalentClass, owl:equivalentProperty,
rdfs:subClassOf, rdfs:subPropertyOf
IASLOD2012
4
Copyright © 2004-2012, KISTI
RDF Linking: three types (cont’)
owl:equivalentClass
my:Person
http://xmlns.com/foaf/0.1/Person
my:JohnSmith
my:name
my:liveIn
my:topic
“Daejeon”
my:age
http://sws.geonames.org/1835235/
my:SemanticWeb
“John
Smith”
“30”
rdfs:label
“Semantic
Web”@en
rdfs:label
owl:sameAs
“시맨틱
웹”@kr
http://dbpedia.org/resource/Semantic_Web
My Ontology
IASLOD2012
5
Copyright © 2004-2012, KISTI
Decentralization of Identity


Using HTTP URIs means that

Globally unique names can be created in a decentralized
fashion

Every owner of a domain name may create their own URIs for
every entity
As a result,

one real-object can be referred by more than two different
URIs

we need


IASLOD2012
to find existing URIs of an interesting entity from Linked Data
cloud
to detect duplicate URIs between different data sources
6
Copyright © 2004-2012, KISTI
How to find existing URIs

Use a data set-specific search interface (such as a
SPARQL endpoint)

There are some tools that index and search URIs by
keywords

Sindice (http://www.sindice.com)

Falcons (http://iws.seu.edu.cn/services/falcons/objectsearch)


temporarily unavailable
To set RDF links manually

IASLOD2012
for small, static data sets
7
Copyright © 2004-2012, KISTI
Sindice
IASLOD2012
8
Copyright © 2004-2012, KISTI
How to detect duplicate URIs

To set RDF links (usually, identity links) by automated or semiautomated methods


Scalable to larger data sets
General approaches

Key-based: commonly accepted identifiers



Similarity-based heuristics


included in URIs
a value of a property of type owl:InverseFunctionalProperty
compare multiple property-values
There are several services that can detect duplication of URI pairs
by key or similarity metrics

Silk

LIMES

<sameAs> (http://sameas.org)

OntoURIResolver
IASLOD2012
9
Copyright © 2004-2012, KISTI
Silk
Link discovery and maintenance framework
 Three components:


A link discovery engine that computes links between data
sources


based on declarative specification of the conditions

A tool for evaluating the generated data links

A protocol for maintaining data links when data set can be
changed
Downloadable from

IASLOD2012
http://www4.wiwiss.fu-berlin.de/bizer/silk/
10
Copyright © 2004-2012, KISTI
LIMES


Large-scale link discovery on linked data
Time-efficient approach based on the characteristics of metric
spaces


Reduce the number of comparisons by distance approximation based
on the triangle inequality
Downloadable from

IASLOD2012
http://aksw.org/Projects/limes
11
Copyright © 2004-2012, KISTI
<sameAs>
Manage and help to find co-references between different data sets
 Search by a URI



When a keyword is given, it is first mapped into URIs using Sindice
Provide bundles of URIs as a result
owl:sameAs
rkb:coreferenceData
umbel:isLike
skos:exactMatch
openvocab:similarTo
IASLOD2012
12
Copyright © 2004-2012, KISTI
OntoURIResolover - Motivation


Why should we know in advance which data sets could be targeted?

It may be quite a burden to Linked Data creators

Instead, we can use Linked Data search engines such as Sindice
In some applications, bulk-to-bulk resolution approaches, such as Silk
and LIMES, are less acceptable


There often exist false-equivalences of URIs as well as unknownequivalences


E.g., an application that resolves entities retrieved by Linked Data search engines
Even slightly wrong equivalence may cause serious problems due to network effect of
Linked Data
When more than one URI aliases are available, which one is more
representative or preferable?

Some URIs may be deprecated or unreachable

Want to discriminate RDF URIs from non-RDF URIs
IASLOD2012
13
Copyright © 2004-2012, KISTI
OntoURIResolover - Features

Use an existing Linked Data search engine, Sindice, and an existing
identity link repository, <sameAs>,


Take an entity-to-entity resolution approach



Some entities have too large number of properties, most of which are not discriminative
Classify or filter out erroneous data such as

false equivalences (from <sameAs>)

deprecated URIs

unreachable URIs

non-RDF URIs
Recommend a canonical URI among aliases


Contrary to the bulk-to-bulk method, we take real-time entity-to-entity approach which is more
appropriate for our environment
Define and use representative properties for types of target entities


not to bound target data sets in advance
having the most plentiful attributes and relationships
Web-based user interface
IASLOD2012
14
Copyright © 2004-2012, KISTI
OntoURIResolover - Processes
URI, Entity name
1. Collecting URI List
URI List
2. Collecting
Contextual Info.
3. Type-based
Grouping
4. Loading
Resolution Rules
URI, Entity name
Request RDF
7. Verifying Results
Triple
Criteria
Triple
Store
Request RDF
Store
Resolved Result
IASLOD2012
Service
Caching
5. Direct Clustering
6. Indirect Clustering
External
Linked Data
15
Copyright © 2004-2012, KISTI
Experimental Result


29 unique author names with more than 10 URIs

488 URIs which have 712 kinds of unique properties

From more than 20 data sources
Metrics

Average Cluster Purity: ACP


measuring the cluster quality
The higher the value is, the better the clustering quality is

Average Author Purity: AAP
 to consider excessive fragmentation of truly-equivalent URIs
 The higher the value is, the less fragmented the same author set
is

K-measure: the geometric mean of ACP and AAP
IASLOD2012
16
Copyright © 2004-2012, KISTI
Experimental Result (cont’)
IASLOD2012
17
Copyright © 2004-2012, KISTI
OntoURIResolver Demo
Demo Site: http://our.kisti.re.kr
IASLOD2012
18
Copyright © 2004-2012, KISTI
References





[Linked Data-Design Issues] Tim Berners-Lee,
http://www.w3.org/DesignIssues/LinkedData.html, 2006.
[Linked Data] Tom Heath and Christian Bizer. Linked Data: Evolving
the Web into a Global Data Space (1st edition). Synthesis Lectures on the
Semantic Web: Theory and Technology, 1-136. Morgan & Claypool, 2011
[Silk] Julius Volz, Christian Bizer, Martin Gaedke, and Georgi Kobilarov.
Discovering and maintaining links on the web of data. In Proceedings of
the International Semantic Web Conference, pages 650–665, 2009.
[LIMES] Axel Cyrille, Ngonga Ngomo, and Soeren Auer. LIMES - a timeefficient approach for large-scale link discovery on the web of data.
http://svn.aksw.org/papers/2011/WWW_LIMES/public.pdf, 2010.
[K-measure] A. Solomonoff, A. Mielke, M. Schmidt, and H. Gish,
Clustering Speakers by Their Voices. In Proceedings of the IEEE
International Conference on Acoustics, Speech, and Signal Processing,
Seattle, 1998.
IASLOD2012
19
Copyright © 2004-2012, KISTI
Thank you
Seungwoo Lee (swlee@kisti.re.kr)
IASLOD2012
20
Copyright © 2004-2012, KISTI
Download