A Machine Learning Approach to Linking FOAF Instances

advertisement
Computing FOAF Co-reference
Relations with Rules and
Machine Learning
Jennifer Sleeman and Tim Finin
University of Maryland, Baltimore County
The Third International Workshop on Social Data on the
Web, November 2010
http://ebiquity.umbc.edu/paper/html/id/506/
FOAF

Friend of a Friend (FOAF) vocabulary
describes people and their relationships
 One of oldest and most widely used ontologies

Does not include a globally unique identifier
 Inverse functional properties (IFPs) help

Multiple foaf instances referring to the same
person are common
 Increasingly so with more linked data
introduction  foaf co-reference  approach  methodology  evaluation  conclusions
Linking data



Data integration requires linking instances
from different data sets
Linking foaf instances is a common and
typical use case
Sindice reports 23 foaf instances all referring
to Sir Tim Berners Lee
 Probably more than my query revealed
 Only a handful are linked via owl:sameAs
 Automatically linking foaf instances is not
always easy
introduction  foaf co-reference  approach  methodology  evaluation  conclusions
Example 1
Common properties but can we
say this is the same person…
<swivt:Subject rdf:about="http://tw.rpi.edu/wiki/Special:URIResolver/Bijan_Parsia">
<rdfs:label>Bijan Parsia</rdfs:label>
<swivt:page rdf:resource="http://tw.rpi.edu/wiki/Bijan_Parsia"/>
<rdfs:isDefinedBy rdf:resource="http://tw.rpi.edu/wiki/Special:ExportRDF/Bijan_Parsia"/>
<rdf:type rdf:resource="http://tw.rpi.edu/wiki/Special:URIResolver/Category-3APerson"/>
<property:Foaf-3Adepiction rdf:resource="http://tw.rpi.edu/wiki/Special:URIResolver/Anonymous.png"/>
<foaf:firstName rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Bijan</foaf:firstName>
<foaf:interest rdf:resource="http://tw.rpi.edu/wiki/Special:URIResolver/Category-3ASemantic_Web_Topic"/>
<foaf:name rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Bijan Parsia</foaf:name>
<foaf:surname rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Parsia</foaf:surname>
<property:Has_affiliation rdf:resource="http://tw.rpi.edu/wiki/Special:URIResolver/Manchester_University"/>
<property:Has_identifier rdf:resource="http://tw.rpi.edu/wiki/Special:URIResolver/Bijan_Parsia"/>
</swivt:Subject>
http://tw.rpi.edu/wiki/Special:ExportRDF/Bijan_Parsia
<foaf:Person rdf:ID="bparsia">
<foaf:mbox_sha1sum>f49a6854842c5fa76dc0edb8e82f8fe04fd56bc9</foaf:mbox_sha1sum>
<foaf:firstName>Bijan</foaf:firstName> <foaf:surname>Parsia</foaf:surname> <foaf:name>Bijan
Parsia</foaf:name> <foaf:homepage rdf:resource="http://trust.mindswap.org/cgibin/FilmTrust/foaf.cgi?user=bparsia"/> <foaf:img rdf:resource="http://www.mindswap.org/~bparsia/talks/uriuse/bijan.jpg"/> <foaf:depiction rdf:resource="http://www.mindswap.org/~bparsia/talks/uri-use/bijan.jpg"/>
<foaf:nick>bparsia</foaf:nick> <foaf:holdsAccount> <foaf:OnlineAccount>
<foaf:accountName>bparsia</foaf:accountName> <foaf:accountServiceHomepage
rdf:resource="http://trust.mindswap.org/FilmTrust/"/> </foaf:OnlineAccount> </foaf:holdsAccount>
http://trust.mindswap.org/cgi-bin/FilmTrust/foaf.cgi?user=bparsia#tt0084827-bparpia
Example 2
Aliases and slight name
variations…
<foaf:Person>
<foaf:name>James A. Hendler</foaf:name>
<foaf:firstName>James</foaf:firstName>
<foaf:surname>Hendler</foaf:surname>
<foaf:publications>http://ebiquity.umbc.edu/papers/select/person/James/Hendler/</foaf:publications>
<foaf:homepage rdf:resource="http://www.cs.umd.edu/~hendler/"/>
<foaf:workInfoHomepage rdf:resource="http://www.cs.umd.edu/~hendler/"/>
http://ebiquity.umbc.edu/person/foaf/James/A./Hendler/foaf.rdf
<foaf:Person rdf:ID="jhendler">
<foaf:mbox_sha1sum>0b62d4242736e64be6138547c79a811b3e82fd52</foaf:mbox_sha1sum>
<foaf:firstName>Jim</foaf:firstName> <foaf:surname>Hendler</foaf:surname> <foaf:name>Jim
Hendler</foaf:name> <foaf:title>Tetherless World Constellation Chair</foaf:title> <foaf:homepage
rdf:resource="http://trust.mindswap.org/cgi-bin/FilmTrust/foaf.cgi?user=jhendler"/> <foaf:homepage
rdf:resource="http://www.cs.umd.edu/~hendler"/> <foaf:depiction rdf:resource="http://www.semanticgrid.org/qiantbljim.jpg"/> <foaf:workplaceHomepage rdf:resource="http://owl.mindswap.org"/> <foaf:img
rdf:resource="http://www.cs.umd.edu/~hendler/hendler.gif"/> <foaf:depiction
rdf:resource="http://www.cs.umd.edu/~hendler/hendler.gif"/> <foaf:nick>jhendler</foaf:nick> <foaf:openID
rdf:resource="http://jhendler.pip.verisignlabs.com/" /> <foaf:holdsAccount> <foaf:OnlineAccount>
<foaf:accountName>jhendler</foaf:accountName> <foaf:accountServiceHomepage
rdf:resource="http://trust.mindswap.org/FilmTrust/"/> </foaf:OnlineAccount> </foaf:holdsAccount>
http://www.cs.rpi.edu/~hendler/foaf.rdf
Example 3
What if mbox_sha1sums are
different?
<Agent rdf:about="http://identi.ca/user/53505">
<mbox_sha1sum>08445a31a78661b5c746feff39a9db6e4e2cc5cf</mbox_sha1sum>
<name>David Wood</name>
<homepage rdf:resource="http://dw2-0.com"/>
<weblog rdf:resource="http://identi.ca/dw2"/>
<holdsAccount><OnlineAccount rdf:about="http://identi.ca/user/53505#acct">
<accountServiceHomepage rdf:resource="http://identi.ca/"/>
<accountName>dw2</accountName>
<accountProfilePage rdf:resource="http://identi.ca/dw2"/>
<sioc:account_of rdf:resource="http://identi.ca/user/53505"/>
<sioc:follows rdf:resource="http://identi.ca/user/136#acct"/>
</OnlineAccount></holdsAccount>
http://identi.ca/dw2/foaf
<foaf:Person rdf:about="http://zepheira.com/team/dave/#me"> <foaf:name>David Wood</foaf:name>
<foaf:title>Dr.</foaf:title> <foaf:givenname>David</foaf:givenname> <foaf:family_name>Wood</foaf:family_name>
<foaf:nick>prototypo</foaf:nick>
<foaf:mbox_sha1sum>37c8d030d4e615d05f31625b3460532a3f4e214e</foaf:mbox_sha1sum> <foaf:homepage
rdf:resource="http://prototypo.blogspot.com/"/> <foaf:depiction
rdf:resource="http://www.itee.uq.edu.au/~dwood/images/dave_w_0.jpg"/> <foaf:phone rdf:resource="tel:+1-(571)-3313723"/> <foaf:workplaceHomepage rdf:resource="http://www.zepheira.com/"/> <foaf:workInfoHomepage
rdf:resource="http://www.zepheira.com/team/dave"/> <foaf:schoolHomepage rdf:resource="http://www.vmi.edu/"/>
<foaf:schoolHomepage rdf:resource="http://www.nps.navy.mil/"/> <foaf:schoolHomepage
rdf:resource="http://www.itee.uq.edu.au/"/> <foaf:aimChatID>piprototypo</foaf:aimChatID>
http://www.itee.uq.edu.au/~dwood/dave.rdf#me
Example 3 cont.
Which David Wood was a
mindswapper?
<ms:Researcher rdf:ID="David_Wood" rdfs:label="David Wood">
<foaf:name>David Wood</foaf:name>
<foaf:mbox>
<owl:Thing rdf:about="mailto:dwood@mindswap.org"/>
</foaf:mbox>
<foaf:homepage>
<foaf:Document rdf:about="http://www.mindswap.org/~dwood/"/>
</foaf:homepage>
<foaf:workInfoHomepage>
<foaf:Document rdf:about="http://www.mindswap.org/~dwood/"/>
</foaf:workInfoHomepage>
</ms:Researcher>
http://www.mindswap.org/2004/owl/mindswappers#David.Wood
Example 5
Could jgolbeck and Jennifer Golbeck
be the same person …
<foaf:Person rdf:ID="jgolbeck">
<foaf:mbox_sha1sum>08445a31a78661b5c746feff39a9db6e4e2cc5cf</foaf:mbox_sha1sum>
<foaf:firstName></foaf:firstName> <foaf:surname></foaf:surname> <foaf:name> </foaf:name> <foaf:homepage
rdf:resource="http://trust.mindswap.org/cgi-bin/FilmTrust/foaf.cgi?user=jgolbeck"/> <foaf:img rdf:resource=""/>
<foaf:depiction rdf:resource=""/> <foaf:nick>jgolbeck</foaf:nick> <foaf:holdsAccount> <foaf:OnlineAccount>
<foaf:accountName>jgolbeck</foaf:accountName> <foaf:accountServiceHomepage
rdf:resource="http://trust.mindswap.org/FilmTrust/"/> </foaf:OnlineAccount> </foaf:holdsAccount>
http://trust.mindswap.org/cgi-bin/FilmTrust/foaf.cgi?user=jgolbeck
<swivt:Subject rdf:about="http://tw.rpi.edu/wiki/Special:URIResolver/Jennifer_Golbeck">
<rdfs:label>Jennifer Golbeck</rdfs:label>
<swivt:page rdf:resource="http://tw.rpi.edu/wiki/Jennifer_Golbeck"/>
<rdfs:isDefinedBy rdf:resource="http://tw.rpi.edu/wiki/Special:ExportRDF/Jennifer_Golbeck"/>
<rdf:type rdf:resource="http://tw.rpi.edu/wiki/Special:URIResolver/Category-3AAssistant_Professor"/>
<rdf:type rdf:resource="http://tw.rpi.edu/wiki/Special:URIResolver/Category-3APerson"/>
<property:Foaf-3Adepiction rdf:resource="http://tw.rpi.edu/wiki/Special:URIResolver/Anonymous.png"/>
<foaf:firstName rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Jennifer</foaf:firstName>
<foaf:interest rdf:resource="http://tw.rpi.edu/wiki/Special:URIResolver/Category-3ASemantic_Web_Topic"/>
<foaf:name rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Jennifer Golbeck</foaf:name>
<foaf:surname rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Golbeck</foaf:surname>
http://tw.rpi.edu/wiki/Special:ExportRDF/Jennifer_Golbeck
Example 5 cont.
Which profile is most recent/relevant?
<rdf:RDF>
<foaf:Person>
<foaf:name>Jennifer Golbeck</foaf:name>
<foaf:mbox rdf:resource="mailto:golbeck@cs.umd.edu"/>
<foaf:mbox rdf:resource="mailto:golbeck@mindswap.org"/>
<owl:sameAs rdf:resource="http://www.mindswap.org/2004/owl/mindswappers#Jennifer.Golbeck"/>
<foaf:workplaceHomepage rdf:resource="http://www.cs.umd.edu/~golbeck"/>
<foaf:currentProject rdf:resoruce="http://trust.mindswap.org"/>
<foaf:publications rdf:resource="http://www.mindswap.org/papers"/>
<foaf:knows rdf:resource="#danbri"/>
<rdfs:seeAlso rdf:resource="http://trust.mindswap.org/cgi-bin/getList.cgi"/>
http://www.cs.umd.edu/~golbeck/daml/golbeckFOAF.rdf
<ms:Researcher rdf:ID="Jennifer.Golbeck" rdfs:label="Jennifer Golbeck">
<rdfs:seeAlso rdf:resource="http://www.cs.umd.edu/~golbeck/daml/golbeckFOAF.rdf"/>
<foaf:name>Jennifer Golbeck</foaf:name>
<foaf:mbox><owl:Thing rdf:about="mailto:golbeck@cs.umd.edu"/></foaf:mbox>
<foaf:homepage><foaf:Document rdf:about="http://www.cs.umd.edu/~golbeck/"/></foaf:homepage>
<foaf:workInfoHomepage><foaf:Document rdf:about="http://www.mindswap.org/~golbeck/"/>
</foaf:workInfoHomepage>
</ms:Researcher>
http://www.mindswap.org/2004/owl/mindswappers#Jennifer.Golbeck
Our Contributions





Treating foaf smushing as entity co-reference
Use machine learning to train a classifier for
recognizing co-referent foaf instance
Combine this with rule-based evidence
Use of narrower RDF properties to express coreference, avoiding overuse of owl:sameAs
Use of a greedy algorithm for iteratively clustering
co-referent entities and re-evaluating their
potential co-reference relations
introduction  foaf co-reference  approach  methodology  evaluation  conclusions
Co-Reference in FOAF





Approach problem like cross-document coreference resolution in text
Match pairs FOAF agents
Use rules and properties
Assign new properties to represent coref
and notCoref relationships
Cluster co-referent pairs

introduction  foaf co-reference  approach  methodology  evaluation  conclusions
Cross-Document Co-reference Resolution
Determine when two documents mention
the same entity
Are two documents that talk about “George Bush”
talking about the same George Bush?
Is a document mentioning “Mahmoud Abbas”
referring to the same person as one mentioning
“Muhammed Abbas”? What about “Abu Abbas”?
“Abu Mazen”?
Drawing appropriate inferences from
multiple documents demands crossdocument co-reference resolution
2008 NIST Text Analysis Conference
TAC KBP: Entity Linking
Given an entity mention in an
article, find the link to the
right Wikipedia entity if one
exists.
John Williams
author
1922-1994
J. Lloyd Williams
botanist
1854-1945
John Williams
politician
1955-
John J. Williams
US Senator
1904-1988
John Williams
Archbishop
1582-1650
John Williams
composer
1932-
Jonathan Williams
poet
1929-
Michael Phelps
swimmer
1985-
Michael Phelps
biophysicist
1939-
John Williams
Richard Kaufman goes a long way back with John
Williams. Trained as a classical violinist,
Californian Kaufman started doing session work
in the Hollywood studios in the 1970s. One of his
movies was Jaws, with Williams conducting his
score in recording sessions in 1975...
Michael Phelps
Debbie Phelps, the mother of swimming star
Michael Phelps, who won a record eight gold
medals in Beijing, is the author of a new memoir,
...
Michael Phelps is the scientist most often
identified as the inventor of PET, a technique that
permits the imaging of biological processes in the
organ systems of living individuals. Phelps has ...
2009 NIST TAC Knowledge Base Population Track
Smushing




Smushing is the traditional term used for
recognizing that two “blank nodes” refer to the
same thing and merging them
Past work on smushing has exploited IFPs
(e.g., foaf:mbox), heuristic similarity metrics
and custom SPARQL queries
owl:sameAs is often used to relate smushed
nodes, enabling a reasoner to effect the merging
rdf:seeAlso used to find related foaf data
introduction  foaf co-reference  approach  methodology  evaluation  conclusions
Smushing
foaf:
Person
rdfs:type
”bar"
foaf:nick
owl:sameAs
foaf:mbox
foaf:knows
foaf:mbox
"foo@gmail.com"
introduction  foaf co-reference  approach  methodology  evaluation  conclusions
Smushing
foaf:
Person
rdfs:type
”bar"
foaf:nick
foaf:knows
foaf:mbox
"foo@gmail.com"
introduction  foaf co-reference  approach  methodology  evaluation  conclusions
owl:sameAs considered harmful



Known problems
– Temporally qualified data (Ding vs. Ding)
– Noisy data (Clinton vs. Clinton)
– Referentially opaque contexts (John likes the
Morning Star beautiful)
Halpin et. Al (2010) suggest a vocabulary for
similarity relations similarity.owl
We use two weaker predicates: coref & notCoref
– Defer the sameAs problem to applications
introduction  foaf co-reference  approach  methodology  evaluation  conclusions
Co-Reference in FOAF


coref: transitive, symmetric and reflexive; has
sameAs as subproperty
notCoref: symmetric and irreflexive but not
transitive; has differentFrom as subproperty
:coref a owl:TransitiveProperty, owl:SymmetricProperty,
owl:ReflexiveProperty
owl:sameAs rdfs:subPropertyOf :coref.
:notCoref a owl:SymmetricProperty, owl:IrreflexiveProperty.
owl:differentFrom rdfs:subPropertyOf :notCoref.
{?a :notCoref ?b. ?b :coref ?c.} => {?a :notCoref ?c}
{?a foaf:knows ?b.} => {?a :notCoref ?b}
The :coref and :notCoref properties that we use instead of owl:sameAs
introduction  foaf co-reference  approach  methodology  evaluation  conclusions
Batch Approach
Given a potentially large set of foaf instances
 Generate candidate pairs
 Evaluate each pair for co-reference
 Using
rules and classifier independently
 Each results in a {coref, notCoref, unknown}
decision
 Trust rules over classifier


Designate pairs as co-referent
Create Clusters
introduction  foaf co-reference  approach  methodology  evaluation  conclusions
Ingest



Extract triples from FOAF profiles
Add each foaf agent as new entity in
database
Entity URLs followed in foaf:knows graph to
get additional information
introduction  foaf co-reference  approach  methodology  evaluation  conclusions
Approach: System Architecture
ingestion
Abstract entity
generation
candidate
pair
generation
Potential pairs:
reduces classifier
workload
Model Generation
clusters form
new abstract
entities
rule-based
reasoning
machine
learning
deductive
decisions
predictions
Co-referent designation and clustering
introduction  foaf co-reference  approach  methodology  evaluation  conclusions
Candidate Pairs





Filter pairs reduce matching set
Use simple string matching predicates
 Dice score for 3-grams
Apply both to values of common properties
and also cross-property values
Experiment 2 ~30% reduction
Reductions vary based on data set
introduction  foaf co-reference  approach  methodology  evaluation  conclusions
Input data sources


FOAF profiles extracted from Swoogle
Also used URLS extracted from tests
conducted in previous work
Distribution of URLs from Experiment 2
introduction  foaf co-reference  approach  methodology  evaluation  conclusions
Methodology: Rule-based Model


Rules conclude that two instances are coreferent, not co-referent or draw no
conclusion (the most common outcome)
Basic co-reference rule:
{?p a owl:IFP. ?a ?p ?x. ?b ?p ?x) =>
{?a :coref ?b}
{?p a owl:FP . ?a ?p ?x. ?a ?p ?y.) =>
{ ?x :coref ?y}
introduction  foaf co-reference  approach  methodology  evaluation  conclusions
Methodology: Rule-based Model



In text processing, very similar name mentions
in a document more likely to be co-referent
It also is used in disambiguating name mentions in citations in a single paper or Web page
A similar heuristic is useful for a “knows graph”
extracted from a single foaf profile
{?a foaf:knows ?b.
?a foaf:knows ?c.
?b neq ?c} => {?b :notCoref ?c}
introduction  foaf co-reference  approach  methodology  evaluation  conclusions
Methodology – Vector Model
Support Vector Machine linear kernel
 Features:
– Match/nomatch of any IFPs
– Distance measures over common property
values (Levenshtein & 3-gram Dice score)
– Alias and entity mention resolution
– Property specific feature comparison
– Knows graph comparisons: Jaccard coef of
similarity of foaf names of one-hop neighbors

introduction  foaf co-reference  approach  methodology  evaluation  conclusions
Methodology: Clustering



Pairs form clusters
Clusters used as part of system evaluation
Can result in:
– Entity to Entity pairing
– Cluster to Entity pairing
– Cluster to Cluster pairing


Greedy process with a confidence threshold
Use rule-based model to eliminate known
non-coreferent pairs
introduction  foaf co-reference  approach  methodology  evaluation  conclusions
Methodology – Clustering
Instance matching can result in new cluster formation and
cluster matching can result in merged clusters.
introduction  foaf co-reference  approach  methodology  evaluation  conclusions
Evaluation


Two experiments
– E1: 50,000 triples, over 500 entity
mentions, 600 classes used for training
– E2: 250,000 triples, over 3500 entity
mentions, over 1800 classes for training
10-fold cross-validation tests
introduction  foaf co-reference  approach  methodology  evaluation  conclusions
Evaluation


For E1: 900 pairs non-match, majority
undetermined
E2: Results shown below
Pairs
Rule
Conclusion
differentFrom
Undetermined
47184
inverse functional
Undetermined
2402
inverse functional
Co-referent
8687410
knows graph
Undetermined
9138326
sameAs
Undetermined
1047874
knows
Not Co-referent
9138326
introduction  foaf co-reference  approach  methodology  evaluation  conclusions
Evaluation



Results promising
During our E2 clustering phase, the first
phase 90% accuracy
Second phase no new relationships among
pairs, cluster to cluster pairing occurred
TP Rate
FP Rate
Precision
Recall
F-Measure
E1
0.933
0.267
0.93
0.933
0.93
E2
0.959
0.128
0.958
0.959
0.958
Classification Results using 10-fold Validation
introduction  foaf co-reference  approach  methodology  evaluation  conclusions
Evaluation




Retrieving additional FOAF profiles based on
knows graph
Quickly retrieve large number of entities
Tightly linked
– reduced diversity of analyzed data
– more entities that are co-referent
Future experiments: a diversity filter
spanning domains
introduction  foaf co-reference  approach  methodology  evaluation  conclusions
Future Work






Evaluating the contribution of each rule and
SVM feature to performance
Other ML approaches, e.g., markov logic, EM
Exploiting better clustering algorithms
Adding more features, e.g. non-foaf vocabulary, non-RDF data (e.g., hosting site)
Applying approach to other RDF instances
Scalability:
 Providing
a non-batch, streaming service
 Offering a coref Web service
introduction  foaf co-reference  approach  methodology  evaluation  conclusions
Conclusions




We can treat instance linking as co-reference
resolution & exploit in-doc and xdoc distinction
Good results with an ensemble approach
combining rules and an SVM classifier
Apply clustering to form groups of co-referent
relations and reprocess
Promising initial results
introduction  foaf co-reference  approach  methodology  evaluation  conclusions
http://ebiquity.umbc.edu/
Download