Presentation on techniques for determing similarity of concepts

advertisement
Entity Disambiguation
By
Angela Maduko
Directed by
Amit Sheth
Entity Disambiguation Problem


Emerges mainly while merging information from
different sources
Two major levels


1. Schema/Ontology level : Determining the similarity of
attributes/concepts/classes from the different
schema/ontology to be merged
2. Instance level: Which instances of concepts/classes
(/tuples in relational databases ) refer to the same entity
Current approaches for both levels

Feature-based Similarity Approach (FSA)





Set-Theory Similarity Approach (STA)
Information-Theory Similarity Approach (ITA)
Hybrid Approach (HA)
Relationship-based Similarity Approach (RSA)
Hybrid Similarity Approach (HSA)
ITA

In [1], Dekang presents a measure for the similarity
between two concepts based on both their
commonalities and differences



Intuition 1: The similarity between A and B is related to their
commonality. The more commonality they share, the more
similar they are.
Intuition 2: The similarity between A and B is related to the
differences between them. The more differences they have,
the less similar they are.
Intuition 3: The maximum similarity between A and B is
reached when A and B are identical, no matter how much
commonality they share.
ITA

Consider the concept Fruit



A is an Apple
B is an Orange
Commonality of A and B?



Common(A, B) = Fruit(A) and Fruit(B)
Measures the commonality between A and B =
I(common(A, B)) by the amount of information
contained in common(A, B)
Where the information content of S I(S) = -logP(S)
ITA




Differences is measured by I(description(A, B)) –
I(common(A, B))
Decription(A, B) is a proposition which describes what
A and B are
Can be applied at both levels 1 & 2
Intuitively, sim(A, B) =



1 when A and B are exactly alike;
0 when they share no commonalities
Proposes sim(A, B) =
log P(common(A, B))
log P(descripti on(A, B))
ITA




In [2], Resnik measures the similarity between two
concepts in an is-a taxonomy based on the information
content of their most specific common super-concept
Define P(c) as the probability of encountering an
instance of a concept c in the taxonomy
For any two concepts c1 and c2, define S(c1, c2) as the set
of concepts that subsume both c1 and c2
( log P(c))
Proposes sim(c1, c2) = cmax
S ( c ,c )
1
2
ITA


Y

C
Z
E
X
A
D
B

F





100 instances of concept X
4 instances of concept Y
200 instances of concept Z
2000 instances of all
concepts
sim(A, B)
Sim(C, D)
sim(A, D)
sim(A, E)
sim(C, D) > sim(A, B).
Should this be so?
ITA






Define s(w) as the set of concepts that are word senses
of word w. Proposes a measure for word similarity as
follows
( sim (c1 , c2 ))
Sim(w1, w2) = c1s ( wmax
1 ), c2 s ( w2 )
Can be applied at level 1 only
Doctor (medical and PhD)
Nurse (medical and nanny)
Sim(Doctor, Nurse)
STA

[3] introduces a set theoretical notion of a matching
function F based on the following assumptions for
classes a, b, c with description sets A, B, C respectively


Matching: s(a, b) = F(A  B, A - B, B - A)
Monotonicity: s(a, b) ≥ s(a, c) whenever A  B  A  C, A B  A - C, B - A  C - A
STA


Proposes two models:
Contrast model: Similarity is defined as







An increasing function of common features
A decreasing function of distinctive features (features that
apply to one object but not the other)
S(a, b) = f(A  B) - f(A -B) - f(B - A) (,, ≥ 0)
Function f measures the salience of set of features
f depends on intensity and context factors
Intensity – physical salience (eg physical features)
Context – salience of features varies with context
STA

Ratio Model



S(a, b) =
,, ≥ 0
f (A  B)
f (A  B)  f (A - B)  f (B - A)
Can be applied at both levels 1 & 2
HA





[7] combines clustering and information content approaches for
entity disambiguation (Scalable Information Bottleneck
(LIMBO) method)
Attempts to cluster entities in such a way that the clusters are
informative about the entities within them
Model: A set T of n entities (relational tuples), defined on m
attributes (A1, A2, …, Am) .Domain of attribute Ai is the set Vi =
{Vi,1, Vi,2, …, Vi, di}
Let T and V be two discrete random variables that can take
values from T and V respectively
Initially, assigns each entity to a cluster ie #clusters = #entities.
Let Cq denote this initial clustering, then the mutual information
of Cq and T, I(Cq, T) = the mutual information of V and T, I(V,
T)
HA


Assumes number of distinct entities k is known
Seeks a clustering Ck of V such that I(Ck, T) remains as
large as possible or the information loss I(V, T) - I(Ck,
T) is minimal
HSA



In [8], Kashyap and Sheth introduce the concept of semantic
proximity (semPro) between entities to capture their similarity
In addition to context, employs relationships and features of
entities in determining their similarity
semPro(O1,O2) = <Context, Abstraction, (D1, D2), (S1, S2)>




Context  context in which objects O1 and O2 are being compared
Abstraction  abstraction/mappings relating domains of the objects
(D1, D2)  domain definitions of the objects
(S1, S2)  states of the objects
HSA








Abstractions
Total 1-1 value mapping
Partial many-one mapping.
Generalization/specialization.
Aggregation.
Functional dependencies.
ANY
NONE
HSA


Semantic Taxonomy
Defines 5 degrees of similarity between objects





Semantic Equivalence
Semantic Relationship
Semantic Relevance
Semantic Resemblance
Semantic Incompatibility
HSA

Semantic Equivalence: strongest measure of semantic
proximity



Two objects are said to be semantically equivalent when they
represent the same real world entity ie
semPro(O1,O2) = <ALL, total 1-1 value mapping, (D1, D2), > (domain Semantic Equivalence)
semPro(O1,O2) = <ALL, M, (D1, D2), (S1, S2)> where M = a
total 1-1 value mappings between (D1, S1) and (D2, S2) (state
Semantic Equivalence)
HSA



Semantic Relationship: weaker than semantic
equivalence.
semPro(O1,O2) = <ALL, M, (D1 ,D2) , _)> where M = a
partial many-one value mapping, generalization or
aggregation
Requirement of a 1-1 mapping is relaxed such that,
given an instance O1, we can identify an instance of
O2, but not vice versa.
HSA

Semantic Relevance:
Two objects are semantically relevant if there exists any
mapping between their domains in some context

semPro(O1,O2) = <SOME, ANY, (D1 ,D2) , _)>

HSA

Semantic Resemblance: weakest measure of semantic
proximity.

There does not exists any mapping between their domains in any
context
Have same roles in some contexts with coherent definition
contexts

HSA




Semantic Incompatibility
Asserts semantic dissimilarity.
Asserts that there is no context and no abstraction in
which the domains of the two objects are related.
semPro(O1,O2) = <NONE, NONE, (D1,D2), _>
HSA



In [5] Cho et al propose a model derived from the
edge-based approach, employing information content
of the node based approach based on these facts:
There exists a correlation between similarity and # of
shared parent concepts in a hierarchy
Link type (hyponymy, meronymy etc)  semantic
relationship
HSA



Conceptual similarity between a node and its adjacent
child node may not be equal
As depth increases in the hierarchy, conceptual
similarity b/w a node and its adjacent child node
decreases
Population of nodes is not uniform over entire
ontological structure (links in a dense part of hierarchy
 less distance than that in a less dense part )
HSA

Proposes S(ci, cj) = D(Lj i)0≤k≤n[ W(tk)d(ck+1k)f(d) ] (
max[H(c)] ), where






f(d) is a function that returns a depth factor (topological
location in hierarchy)
d(ck+1k) is a density function
D(Lj i) is a function that returns a distance factor between ci
and cj (shortest path from one node to the other)
W(tk) is a weight function that assigns weights to each link
type (W(tk) = 1 for is-a link)
H(c) is information content of super-concepts of ci and cj
For level 1 only
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
Dekang Lin, An Information-Theoretic Definition of Similarity, Proceedings of
the Fifteenth International Conference on Machine Learning, p.296-304, 1998
Philip Resnik, Using Information Content to Evaluate Semantic Similarity in a Taxonomy,
IJCAI, 1995.
Tversky Amos, Features of Similarity, Psychological Review 84(4), 1977, pp 327 - 352.
Debabrata Dey, A Distance-Based Approach to Entity Reconciliation in Heterogeneous
Databases, IEEE Transactions on Knowledge and Data Engineeing, 14 (3), May/June 2002.
Hui Han, Hongyuan Zha and C. Lee Giles, A Model-based K-means Algorithm for Name
Disambiguation in Proceedings of the Second International Semantic Web Conference (ISWC03) Workshop on Semantic Web Technologies for Searching and Retrieving Scientific Data.
2003
M. Andrea Rodriguez and Max J. Egenhofer, Determining Semantic Similarity Among Entity
Classes from Different Ontologies, IEEE Transactions on Knowledge and Data Engineering ,
15 (2): 442-456, 2003
Periklis Andritsos, Renee J. Miller and Panayiotis Tsaparas, Information-Theoretic Tools for
Mining Database Structure from Large Data Sets, SIGMOD Conference 2004: 731-742
Vipul Kashyap, Amit Sheth, Semantic and schematic similarities between database objects: a
context-based approach, VLDB Journal 5, no. 4 (1996): 276--304. 367
Miyoung Cho, Junho Choi and Pankoo Kim, An Efficient computational Method for
Measuring Similarity between Two Conceptual Entities, WAIM 2003: 381-388
Download