SemRank: Ranking Complex Relationship Search Results on the Semantic Web Kemafor Anyanwu

advertisement
SemRank: Ranking Complex
Relationship Search Results on the
Semantic Web
Kemafor Anyanwu, Angela Maduko, Amit Sheth
LSDIS lab, University of Georgia
Paper presentation at WWW2005, Chiba Japan
Kemafor Anyanwu, Angela Maduko, and Amit Sheth. SemRank: Ranking Complex Relationship
Search Results on the Semantic Web, Proceedings of the 14th International World Wide Web
Conference (WWW2005), Chiba, Japan, May 10-14, 2005, pp. 117-127.
This work is funded by NSF-ITR-IDM Award#0325464 titled ‘SemDIS: Discovering Complex
Relationships in the Semantic Web’ and NSF-ITR-IDM Award#0219649 titled ‘Semantic Association
Identification and Knowledge Discovery for National Security Applications.’
Outline
• The Problem
• The SemRank relevance model
• SemRank computational issues in the SSARK
system
• Evaluating SemRank: strategy and issues
• Related Work
• Conclusion and Future work
The Problem
• [Anyanwu et al WWW2003] proposed a
query operator for finding complex
relationships between entities
• [Angles et al ESWC05] a survey of graphbased query operations that should be
enabled on the Semantic Web
• Question: How can results of relationship
query operations be ranked?
The Relationship Ranking Problem
8
query q = (1, 3) (a pair of nodes)
h
g
a
1
b
d
4
g
2
1.
1
b
f
d
4
g
e
f
c
6
5
7
1. Find the subgraph that covers q
2. List the results in order of relevance
2. 1
could be done
with step 1 or
3.
as a separate
step
.
.
.
2n.
Things to think about
• Relevance as best match vs. ????
• Homogenous (hyperlinks) vs. heterogeneous
relationships
• Should relevance be fixed for all situations?
• Size of result set potentially large
The SemRank Model
SemRank’s Design Philosophy
• Tenet 1: Thou shall support variable
rankings
• Tenet 2: Thou must not burden the user
with complex query specification
• Tenet 3: Thou shall support main stream
search paradigms
SemRank’s Key Concepts
• Modulative Ranking
• Relevance: Search Mode + Predictability
• Refraction Count
– How varied is the result from what is expected
from schema?
• Information Gain
– How much information does a user gain by being
informed about a result?
• S-Match
– Best semantic match with user need (if provided)
Low Information Gain
Low Refraction Count
High S-Match
High Information Gain
High Refraction Count
High S-Match
adjustable search mode
Modulative Rank Function
• Typical preference or rank function
– Ranki =

wij * attrij
• What we want is, given
– µ - weight function parameter
– and attributes attr1, attr2 … attrk e.g. length
– for each attribute, select appropriate weight functions
from g1, g2, … gm e.g. gi (µ) = µ
• each gi is some function of µ
• Then
– Ranki(µ ) =

gj(µ) * (attrik)
• where gj is the weight function selected for attrk
Refraction as a measure of predictability
Refraction
Student
1
enrolled_in
enrolled_in
Spouse
Course
2
taught_by
taught_by
Professor
3
married_to
4
• The path “ enrolled_in  taught_by  married_to “
doesn’t exist anywhere at schema layer
• We say that the path refracts at node 3
• High refraction count in a path  low
predictability
married_to
Semantic Summary
p
Representative
Ontology
Class
3
p1
p2
C1
p1, p2
C3
p1
p2
p5
C2
p4
p5
p1, p2
C5
p4
C4
p3
C1
C1
C2
p1, p2,
C3
C4
p 1 , p2
C2
C3
C4
C5
C1  C3
C2  C4
C5
p3
P5, p4
p 1 , p2
p1, p2
p3
p4, p5
Semantic Summary & Refraction.
• A Semantic Summary is a graph of
representative ontology classes with
appropriate relations as arcs
• For a path p = r1, p1, r2, p2, r3, there is a
refraction at r2 if
• p1  (ROCi, ROCj) and p2  (ROCj, ROCk)
(or vice versa) where
– ROCi, ROCj, ROCk are representative ontology
classes of r1, r2, r3 respectively
Information content
and
Information gain
Measuring Information Content of a
Property
• Content is related to uncertainty removed
• Typically measured as some function of its
probability
– High probability -> low information content
• For p  P, P = set of property types, its
information content ISP can be measured
as:
– ISP(pk) = log2(1/Prk(p = pk))
= - log2 ( [[ pk]] / [[ P ]] )
• ISP(p) is maximum when
– Pri = 1 / [[ P ]] = log [[ P ]]
Information Content of a Property
Sequence – global perspective
• The information content of a sequence of
properties p1  p2  p3     pk is
– max(ISP(pi)), 1 ≤ i ≤ k
weak point
p1
p2
Prob = high
Prob = low
p3
Prob = high
Information content is dependent on p2
Information Content – Local Perspective
• Global high information content but local
low information content
• Given (a, p1, b), information content with
respect to only the valid possibilities
between a and b ?
(a, p1, b), and valid(p1) is
P = (ROCi, ROCj), a  ROCi and b  ROCj
and superproperties
• Recompute probabilities based on P (local)
– I  =min(NI(pi) + average of other NI
Total Information Content
Total information content
=
Information content from global perspective
+
Information content from local
perspective
S-Match
Relevance Specification as keywords
Keywords
published_in located_in
S-Match
• Uses the “best semantic match” paradigm
• For a keyword ki and a property pj on a
path:
– SemMatch(ki, pj) = 0 < (2d)-1  1, where
• d is the minimum distance between the properties in
a property hierarchy
• For a path ps, its S-Match value is:
– the sum of the max(SemMatch(ki, pj))
Putting it all together …….
SemRank
• For a search mode  and a path ps:
• Modulated information gain for ps, I(ps)
– I(ps) = (1-)(I(ps))-1 + I(ps)
• Modulated Refraction Count RC(ps)
– RC(ps) = RC(ps)
• SEMRANK(ps) = I(ps)  (1+RC(ps)) 
(1+S-Match(ps))
Computing SemRank in SSARK
The SSARK system
Preprocessing
phase
Query Processing
phase
Ranking
phase
x ?? ?? ?? y
RDF
Documents
Loader
Preprocessor
Storage Manager
LtStore
UtStore
User SubSystem
Query & Result
Interface
Query
Processor
LAC
Look Ahead
Cache
RC
Result
Cache
Pipelined top-k
results
Index Manager
FDIX
PHIX
ROIX
Ranking
Engine
Approach
g
a
1
b
2
d
4
Query Processor
f


e
g,3
c

5


Ranking engine
b, 4
Assigns SemRank* values
to leaves of the tree
i.e. edges on the path
* - without refraction count

d,6
b ,4
a ,2


c ,1
e ,2
f, 5
f, 5
The Index Subsystem
• FDIX – Frequency Distribution IndeX
– Stores the frequency distribution of properties
• ROIX – Representative Ontology IndeX
– Maps classes to Representative Ontology Classes
– Stores the semantic summary graph
• PHIX – Property Hierarchy IndeX
– Uses the Dewey Decimal labeling scheme to
encode the hierarchical relationships in a property
hierarchy
– Used for computing S-Match (match between
keywords and properties in a path)



∙
∙
a, 3
∙

d, 1
,
c.f, 9
g.i, 9
cf, 9
ce, 6
,
h, 1
i, 6
a, 3

∙
,
cf, 9
ce , 6
c 4
d, 1
,
gi, 9
gh, 4

g, 3
,
i, 6
h, 1
∙
g, 3
,
c, 4

f, 5

h, 1
b, 2
c4
d1
,
c, 4


d, 1
e, 2
f, 5
e, 2
d, 1
e, 2
f, 5
f, 5
i, 6
,
,
i, 6
h, 1
,
e, 2
∙
a, 3
∙ g.i, 9 ,
i, 6,
gh,
4
∙
 h, 1

g, 3 h, 1 i, 6
,  cf, 9 ,  f, 5,
ce, 6
e, 2
h, 1
i, 6
∙
b, 2c, 4 d, 1 e, 2 f, 5
ab, 5a, 3
∙ c, 4
d, 1
f, 5

∙
ab, 5
e, 2

b, 2


∙
ab, 5
g, 3
b, 2 
c, 4

Top-K Evaluation
Final Top_k:
1. g.i, 18
2. c. f, 9
Evaluation Issues
• Data set needs
– Entities described with a variety of relationships
– Richly connected hierarchies
– Realistic frequency distributions
• Synthetically generated realistic small data
set using human defined rules
– e.g. |(p = “audits”)| ≤ 0.1  |(p = “enrolls”)|
µ=0
µ=1
Related Work
• Semantic searching and ranking of entities
on the Semantic Web
• Rocha et al WWW2004, Guha et al WWW2003,
Stojanovic et al ISWC 2003 , Zhuge et al WWW2003,
• Semantic ranking of relationships
• Halaschek VLDB demo 2004, Aleman-Meza et al
SWDB03
Future Work
• Comprehensive evaluation
• Including some measures for importance of
nodes in the paths
• Revise the Modulation function
• Optimizing Top-K evaluation
– Decreasing height of tree
– estimation techniques for a closer approximation
to SemRank ordering
Data, demos, more publications at
SemDis project web site (Google: semdis)
Thank You
Download