Slide 1

advertisement
KD2R: a Key Discovery method for
semantic Reference Reconciliation
Danai Symeonidou, Nathalie Pernelle and Fatiha Saϊs
LRI (University Paris-Sud)
WOD’2013
June, 3th
Danai Symeonidou, WOD’2013
2
Data Linking
• More and more heterogeneous RDF sources
• Links can be asserted between them
▫ Same as is one of the most important types of links:
combine information given in different data sources
▫ LOD: the number of already existing links is very small
• How to create links automatically ?
Danai Symeonidou, WOD’2013
3
Data Linking Problem
Dataset1
P1 FirstName: George
LastName: Thomson
SSN : 011223456
Job : Artist
P2 FirstName: George
LastName: Thomson
SSN : 444223456
Job: Professor
Dataset2
P3 FirstName: George
LastName: Thomson
SSN : 011223456
Age : 45
Danai Symeonidou, WOD’2013
4
Data Linking Problem
Dataset1
P1 FirstName: George
LastName: Thomson
SSN : 011223456
Job : Artist
P2 FirstName: George
LastName: Thomson
SSN : 444223456
Job: Professor
Dataset2
SameAs
P3 FirstName: George
LastName: Thomson
SSN : 011223456
Age : 45
Danai Symeonidou, WOD’2013
5
Data Linking Problem
Dataset1
P1 FirstName: George
LastName: Thomson
SSN : 011223456
Job : Artist
P2 FirstName: George
LastName: Thomson
SSN : 444223456
Job: Professor
Dataset2
SameAs
SameAs
P3 FirstName: George
LastName: Thomson
SSN : 011223456
Age : 45
Danai Symeonidou, WOD’2013
Data Linking with or without key
constraints
• No knowledge given about the properties:
 all the properties have the same importance.
• Knowledge given by an expert:
 Specific expert rules [Arasu and al.’09, Low and al.’01, Volz and al.’09
(Silk)]
Example: max(jaro(phone-number;phone-number; jaro-winkler(SSN;SSN)) > 0.88
 Key constraints [Saïs, Pernelle and Rousset’09]
Example: hasKey(Museum (museumName) (museumAddress))
• OWL2 Key for a class expression: a combination of (inverse)
properties which uniquely identify an entity
▫
hasKey( CE ( OPE1 ... OPEm ) ( DPE1 ... DPEn ) )
Example: hasKey(Museum (museumName) (museumAddress)) expresses:
Museum(x1)∧Museum(x2)∧museumName(x1, y)∧museumName(x2, y)
∧museumAddress(x1, w)∧museumAddress(x2, w)  sameAs(x1, x2)
6
Danai Symeonidou, WOD’2013
Key Discovery Problem
 Problem: when data sources contain numerous data and/or complex
ontologies
 Some keys are not obvious to find.
 Erroneous keys can be given by the expert.
• Aim: automatic discovery of a complete set of keys from data
• Naïve automatic way to discover keys: examine all the possible
combinations of properties
▫ Example: given an instance described by 15 properties the number of
candidate keys is 215-1 = 32767
▫ For each candidate key we have to scan all the instances of the data
• Objective: find efficiently keys by:
▫ Reducing the combinations
▫ Partially scanning the data
7
Danai Symeonidou, WOD’2013
8
Key Discovery Problem
• RDF data sources (conforming to an OWL 2 ontology)
• Mappings between classes and properties of the different ontologies
• Open world assumption (incomplete data) and multivalued properties
may exist
id
lastName
firstName
hasFriend
i1
Tompson
Manuel
i2,i3
i2
Tompson
Maria
i3
David
George
i4
Solgar
Michel
i2, i4
How to discover keys when we do not know if :
i1 =?= i2 =?=i3 =?=i4
hasFriend(i1,i4), hasFriend(i2, i3) …. ?? firstName(i1, Elodie) … ?
Danai Symeonidou, WOD’2013
Key Discovery Problem:Assumptions
• Unique Name Assumption (UNA): two different URIs refer to
distinct entities (data sources generated from relational
databases , Yago)
i1 <> i2<> i3 <> i4
• Two literals that are syntactically different are semantically
different
▫ (e.g. “Napoleon Bonaparte” <> “Napoleon”)
9
Danai Symeonidou, WOD’2013
10
Key Discovery:Heuristics
• Heuristic 1 - Pessimistic:
▫ Not instantiated property  all the values are possible
 Example: hasFriend(i2, i3), hasFriend(i4, i2) are possible.
▫ Instantiated property  only given values are considered
 Example: not hasFriend(i1, i4)
id
lastName
firstName
hasFriend
i1
Tompson
Manuel
i2,i3
i2
Tompson
Maria
i3
David
George
i4
Solgar
Michel
i2, i4
Non keys: {lastName}, {hasFriend}
Keys: {firstName}, {lastName, firstName}, {firstName, hasFriend}
Undetermined keys: {hasFriend, lastName}
Danai Symeonidou, WOD’2013
11
Key Discovery:Heuristics
• Heuristic 1 - Optimistic:
▫ Not instantiated property  value not one of the already existing ones
 Example: not hasFriend(i2, i3), not hasFriend(i2, i1), not hasFriend(i2, i4).
▫ Instantiated property  only given values are considered
 Example: not hasFriend(i1, i4)
id
lastName
firstName
hasFriend
i1
Tompson
Manuel
i2,i3
i2
Tompson
Maria
i3
David
George
i4
Solgar
Michel
i2, i4
Non keys: {lastName}, {hasFriend}
Keys: {firstName}, {lastName, firstName}, {firstName, hasFriend},
{hasFriend, lastName}
Danai Symeonidou, WOD’2013
12
KD2R approach
Topological sort of the classes (subsumption)
• Key Finder
▫ Discover non keys
 Ex: {lastName}, {hasFriend}
▫ Derive keys using non keys
 Ex: {firstName}, {lastName, firstName},
{firstName, hasFriend}, {hasFriend, lastName}
• Key Merge
▫ Cartesian product of minimal key sets in S1,S2
 Ex. Ks1 = {firstName}
Ks2 = {hasFriend}
Ks1-s2 = {firstName, hasFriend}
Technical report available:
https://www.lri.fr/~bibli/Rapports-internes/2013/RR1559.pdf
Danai Symeonidou, WOD’2013
13
KD2R approach: Key Finder
• Computation of maximal non keys and undetermined keys
▫ Represent data in a prefix-tree
class)
(a compact representation of the data of one
Danai Symeonidou, WOD’2013
14
Validation of approach
• Datasets where KD2R has been tested:
Datasets
RDF files
#instance
s
Optimisti
c
Pessimisti
c
OAEI Restaurants
Dataset
Restaurant1
339
Yes
Yes
Restaurant2
1390
Yes
Yes
Person11
1000
Yes
Yes
Peson12
1000
Yes
Yes
Person21
1200
Yes
Yes
Person
763644
Yes
No
NaturalPlace
78400
Yes
No
BodyOfWater
34008
Yes
No
Lake
33348
Yes
No
googleFusion
Dataset
G_Restaurant
372813
Yes
Yes
ChefMoz Dataset
C_Restaurant
1047
Yes
Yes
OAEI Persons
Dataset
Dbpedia Dataset
(properties
instasiated in at least
80% of the data)
Danai Symeonidou, WOD’2013
15
Demo
• Ontologies
▫ Data conforming to one ontology
• RDF data
▫ Dbpedia NaturalPlace dataset (78400 instances)
▫ OAEIPerson dataset (2000 instances)
• Data linking
▫ Link data using LN2R
▫ Measure quality of linking using:
 recall
 precision
 f-measure
Danai Symeonidou, WOD’2013
QUESTIONS???
16
Danai Symeonidou, WOD’2013
THANK YOU!!!
17
Download