KD2R: a Key Discovery method for semantic Reference Reconciliation Danai Symeonidou, Nathalie Pernelle and Fatiha Saϊs LRI (University Paris-Sud) WOD’2013 June, 3th Danai Symeonidou, WOD’2013 2 Data Linking • More and more heterogeneous RDF sources • Links can be asserted between them ▫ Same as is one of the most important types of links: combine information given in different data sources ▫ LOD: the number of already existing links is very small • How to create links automatically ? Danai Symeonidou, WOD’2013 3 Data Linking Problem Dataset1 P1 FirstName: George LastName: Thomson SSN : 011223456 Job : Artist P2 FirstName: George LastName: Thomson SSN : 444223456 Job: Professor Dataset2 P3 FirstName: George LastName: Thomson SSN : 011223456 Age : 45 Danai Symeonidou, WOD’2013 4 Data Linking Problem Dataset1 P1 FirstName: George LastName: Thomson SSN : 011223456 Job : Artist P2 FirstName: George LastName: Thomson SSN : 444223456 Job: Professor Dataset2 SameAs P3 FirstName: George LastName: Thomson SSN : 011223456 Age : 45 Danai Symeonidou, WOD’2013 5 Data Linking Problem Dataset1 P1 FirstName: George LastName: Thomson SSN : 011223456 Job : Artist P2 FirstName: George LastName: Thomson SSN : 444223456 Job: Professor Dataset2 SameAs SameAs P3 FirstName: George LastName: Thomson SSN : 011223456 Age : 45 Danai Symeonidou, WOD’2013 Data Linking with or without key constraints • No knowledge given about the properties: all the properties have the same importance. • Knowledge given by an expert: Specific expert rules [Arasu and al.’09, Low and al.’01, Volz and al.’09 (Silk)] Example: max(jaro(phone-number;phone-number; jaro-winkler(SSN;SSN)) > 0.88 Key constraints [Saïs, Pernelle and Rousset’09] Example: hasKey(Museum (museumName) (museumAddress)) • OWL2 Key for a class expression: a combination of (inverse) properties which uniquely identify an entity ▫ hasKey( CE ( OPE1 ... OPEm ) ( DPE1 ... DPEn ) ) Example: hasKey(Museum (museumName) (museumAddress)) expresses: Museum(x1)∧Museum(x2)∧museumName(x1, y)∧museumName(x2, y) ∧museumAddress(x1, w)∧museumAddress(x2, w) sameAs(x1, x2) 6 Danai Symeonidou, WOD’2013 Key Discovery Problem Problem: when data sources contain numerous data and/or complex ontologies Some keys are not obvious to find. Erroneous keys can be given by the expert. • Aim: automatic discovery of a complete set of keys from data • Naïve automatic way to discover keys: examine all the possible combinations of properties ▫ Example: given an instance described by 15 properties the number of candidate keys is 215-1 = 32767 ▫ For each candidate key we have to scan all the instances of the data • Objective: find efficiently keys by: ▫ Reducing the combinations ▫ Partially scanning the data 7 Danai Symeonidou, WOD’2013 8 Key Discovery Problem • RDF data sources (conforming to an OWL 2 ontology) • Mappings between classes and properties of the different ontologies • Open world assumption (incomplete data) and multivalued properties may exist id lastName firstName hasFriend i1 Tompson Manuel i2,i3 i2 Tompson Maria i3 David George i4 Solgar Michel i2, i4 How to discover keys when we do not know if : i1 =?= i2 =?=i3 =?=i4 hasFriend(i1,i4), hasFriend(i2, i3) …. ?? firstName(i1, Elodie) … ? Danai Symeonidou, WOD’2013 Key Discovery Problem:Assumptions • Unique Name Assumption (UNA): two different URIs refer to distinct entities (data sources generated from relational databases , Yago) i1 <> i2<> i3 <> i4 • Two literals that are syntactically different are semantically different ▫ (e.g. “Napoleon Bonaparte” <> “Napoleon”) 9 Danai Symeonidou, WOD’2013 10 Key Discovery:Heuristics • Heuristic 1 - Pessimistic: ▫ Not instantiated property all the values are possible Example: hasFriend(i2, i3), hasFriend(i4, i2) are possible. ▫ Instantiated property only given values are considered Example: not hasFriend(i1, i4) id lastName firstName hasFriend i1 Tompson Manuel i2,i3 i2 Tompson Maria i3 David George i4 Solgar Michel i2, i4 Non keys: {lastName}, {hasFriend} Keys: {firstName}, {lastName, firstName}, {firstName, hasFriend} Undetermined keys: {hasFriend, lastName} Danai Symeonidou, WOD’2013 11 Key Discovery:Heuristics • Heuristic 1 - Optimistic: ▫ Not instantiated property value not one of the already existing ones Example: not hasFriend(i2, i3), not hasFriend(i2, i1), not hasFriend(i2, i4). ▫ Instantiated property only given values are considered Example: not hasFriend(i1, i4) id lastName firstName hasFriend i1 Tompson Manuel i2,i3 i2 Tompson Maria i3 David George i4 Solgar Michel i2, i4 Non keys: {lastName}, {hasFriend} Keys: {firstName}, {lastName, firstName}, {firstName, hasFriend}, {hasFriend, lastName} Danai Symeonidou, WOD’2013 12 KD2R approach Topological sort of the classes (subsumption) • Key Finder ▫ Discover non keys Ex: {lastName}, {hasFriend} ▫ Derive keys using non keys Ex: {firstName}, {lastName, firstName}, {firstName, hasFriend}, {hasFriend, lastName} • Key Merge ▫ Cartesian product of minimal key sets in S1,S2 Ex. Ks1 = {firstName} Ks2 = {hasFriend} Ks1-s2 = {firstName, hasFriend} Technical report available: https://www.lri.fr/~bibli/Rapports-internes/2013/RR1559.pdf Danai Symeonidou, WOD’2013 13 KD2R approach: Key Finder • Computation of maximal non keys and undetermined keys ▫ Represent data in a prefix-tree class) (a compact representation of the data of one Danai Symeonidou, WOD’2013 14 Validation of approach • Datasets where KD2R has been tested: Datasets RDF files #instance s Optimisti c Pessimisti c OAEI Restaurants Dataset Restaurant1 339 Yes Yes Restaurant2 1390 Yes Yes Person11 1000 Yes Yes Peson12 1000 Yes Yes Person21 1200 Yes Yes Person 763644 Yes No NaturalPlace 78400 Yes No BodyOfWater 34008 Yes No Lake 33348 Yes No googleFusion Dataset G_Restaurant 372813 Yes Yes ChefMoz Dataset C_Restaurant 1047 Yes Yes OAEI Persons Dataset Dbpedia Dataset (properties instasiated in at least 80% of the data) Danai Symeonidou, WOD’2013 15 Demo • Ontologies ▫ Data conforming to one ontology • RDF data ▫ Dbpedia NaturalPlace dataset (78400 instances) ▫ OAEIPerson dataset (2000 instances) • Data linking ▫ Link data using LN2R ▫ Measure quality of linking using: recall precision f-measure Danai Symeonidou, WOD’2013 QUESTIONS??? 16 Danai Symeonidou, WOD’2013 THANK YOU!!! 17