SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik

advertisement
SPARQLeR: Extended Sparql for
Semantic Association Discovery
Krzysztof Kochut and Maciej Janik
ESWC 2007, Innsbruck, Austria
June 4, 2007
Work supported by the National Science Foundation Grant
No. IIS-0325464, entitled “SemDIS: Discovering Complex
Relationships in the Semantic Web”.
Computer Science Department
University of Georgia
Paths in RDF
Directed path
Undirected path
Undirected path,
but with specific
properties and
directionality
Computer Science Department
University of Georgia
Why are paths interesting ?
• A path describes how entities are related.
– Relationships on the path define meaning of this
connection.
– Entities on the path specify the content.
• Do you have migraine? Try taking magnesium!
– Path discovered by Dr. D.R.Swanson from partial
information available in PubMed publications
• stress can lead to loss of magnesium in the human body
• migraine patients seem to be experiencing stress
… that’s why …
• migraine could lead to a loss of magnesium, so …
take magnesium to fight migraine!
Swanson, R.D. Migraine and Magnesium: Eleven Neglected Connections.
Perspectives in Biology and Medicine, 31 (4). 526-557.
Computer Science Department
University of Georgia
Formally, what is a simple path ?
• Simple directed path between resources
r0 and rn in a description base R:
– sequence r0 p1 r1 p2 r2 , … , pn-1 rn-1 pn rn (n>0)
– r0 p1 r1, r1 p2 r2 , … , rn-2 pn-1 rn-1, rn-1 pn rn (n>0) are
triples in R.
– all of the resources ri (0 ≤i ≤ n) in the path are distinct
• Simple undirected path between resources
r0 and rn in R:
– sequence r0 p1 r1 p2 r2 , … , pn-1 rn-1 pn rn (n>0)
– for each ri-1 pi ri (0 < i ≤ n) in the path, either ri-1 pi ri or
ri pi ri-1 is a triple in R
– all of the resources ri (0 ≤i ≤ n) in the path are distinct
Computer Science Department
University of Georgia
Paths and SPARQL
• SPARQL query can express only static
graph patterns.
– Some flexibility is introduced by an OPTIONAL
part, but it does not solve path problems.
• No support for flexible length path
expressions.
– Glycan biosynthesis pathway in biology has a
specific pattern (properties), but its length
may be unknown.
– Path discovery may be of unknown length and
pattern, like in Dr. Swanson’s example.
Computer Science Department
University of Georgia
What we need to discover paths?
• Knowledge discovery needs more flexible
patterns.
– Patterns may be partially known or even
unknown (unrestricted path).
– Properties on the path, their order and
directionality create a specific meaning.
– Entities on the path provide content.
– Relationships to entities outside of the path
give an additional context.
Computer Science Department
University of Georgia
Proposed extensions
• A path may have a flexible length
– For computational reasons, length is limited.
• Constraints on properties
– Specific properties must appear in the path.
– Their order and directionality is meaningful.
– They can form a repeating pattern.
• Constraints on resources
– Specific resources must be on the path.
– They can be anywhere on the path or at
specific positions.
Computer Science Department
University of Georgia
SPARQLeR
• Extension of SPARQL for semantic
association discovery.
• Seamlessly integrated into the SPARQL
syntax.
• Graph patterns incorporating simple paths
with constraints.
• Constraints are based on regular
expressions over properties.
Computer Science Department
University of Georgia
What is a path in SPARQLeR ?
• Path is a meta-property that connects two
resources.
– Defined as a sequence of interleaving properties and
resources.
– Starts and ends with properties (endpoint resources are
not included).
– A path of length 1 is a sequence with just one property.
<rdf:Class rdf:about="http://meta.org/rdf-meta-schema#Path">
<rdfs:isDefinedBy rdf:resource="http://meta.org/rdf-meta-schema#"/>
<rdfs:subClassOf
rdf:resource="http://www.w3.org/1999/02/22-rdf-syntax-ns#Property"/>
<rdfs:subClassOf
rdf:resource="http://www.w3.org/1999/02/22-rdf-syntax-ns#Seq"/>
<rdfs:label>Path</rdfs:label>
<rdfs:comment>The class of RDFMS paths.</rdfs:comment>
</rdf:Class>
Computer Science Department
University of Georgia
Path patterns in SPARQLeR
• Meta-property – similar concept to a property
– Resource –[property] Resource
– Resource –[path] Resource
• Path as a Sequence
– Test if a resource is in the path:
• rdfs:member
– Test if a resource is at a specific position in the path:
• rdf:_2, rdf:_4, ...
• SPARQLeR-specific path properties
– Test all resources or all properties in the path:
• rdfms:entityResource and rdfms:propertyResource
Example: all resources on a path must be of type foo:Person
Computer Science Department
University of Georgia
Path pattern anatomy
Path patterns
(match of path variable)
p1
p1
p2
p1
p2
p3
p1
rdfs:member
1
2
p
2
rdfms:entityResource
rdf:_6
3
4
5
6
rdf:_3
p3
p2
p1
7
rdfs:member
rdfms:propertyResource
length: 4
elements: 7
Computer Science Department
University of Georgia
Path types in SPARQLeR
• Directionality of relationships in the path defines
its specific semantics.
• SPARQLeR allows definition of the following path
types
– As defined in graph theory
• Directed
• Undirected
– SPARQLeR specific extension
• Defined directionality path
(includes directed path)
Computer Science Department
University of Georgia
Directionality of properties in path
• Defined directionality paths:
– Neither directed nor undirected
– Each property in a path has a specified directionality
• Example: simple graph with p relationship
(a) X p* Y, directed path
(b) X p* Y, undirected path
(c) X ( p p-1 )* Y, directional path
(c)
(b)
(a)
p
p
p
p
X
Y
p
p
p
p
Computer Science Department
University of Georgia
Inverse property operator
• In standard SPARQL there is no need for
inverse property operator
– Pattern syntax is based on individual
statements, so it is easy to reverse direction.
• Defining path constraints requires the
inverse operator
– A pPath expression defines constraints on
properties, not on individual statements.
– Without the inverse property operator some
paths constraints would be impossible to
express (as shown in the previous example).
Computer Science Department
University of Georgia
RegExp in path constraints
• Path constraints on properties are based
on regular expressions
– Uses syntax similar to lex
– Easy for grep users
• Examples:
a
c* d
[abc] c? d
a+
(b|c) a
( b a-1 )+
c
Computer Science Department
University of Georgia
Path constraints in SPARQLeR
• Defined as regular path expressions
– Can specify patterns of properties in the path
– Directionality requirement needs the inverse operator 
(‘-’ minus) –p
• Supported regular expressions
p (single property)
. (wildcard)
-p (the inverse of p)
x | y (alternative)
[p1 p2 ... pn]
xy (concatenation)
(class of properties)
x* (Kleene star);
-[p1 p2 ... pn]
x+ (one or more repetition)
(class of inverse properties)
(x) (match a path matched by x)
[^p1 p2 .. pn]
(complement of properties)
-[^p1 p2 .. pn]
(inverse of complement of properties)
Computer Science Department
University of Georgia
Path constraints (cont’d)
• Class of properties and inverse operator
– Complement operator can be applied only to
defined properties, not their inverses
– Inverse operator
• Not allowed inside class of properties
• Inverses set created from defined properties
– Example:
properties: q r s t
[^rt]

–[^qr]

([^st] | –[^t])

qs
t-1 s-1 (inverses)
q r q-1 r-1 s-1
Computer Science Department
University of Georgia
Integrating paths into SPARQL
• Path variable binds a path
– Name begins with ‘%’ instead of ‘?’
• Simple patterns – path between two
resources
SELECT ?prop WHERE {<r> ?prop <s>}
SELECT %path WHERE {<r> %path <s>}
• Single source path
SELECT %path, ?res WHERE {<r> %path ?res}
Computer Science Department
University of Georgia
Integrating paths into SPARQL
• Resources on the path
SELECT %path WHERE
{<r> %path <s> . %path rdfs:member <e>}
SELECT %path WHERE
{<r> %path <s> . %path rdf:_1 <p>}
• Listing path elements – list operator
SELECT list(%path) WHERE {<r> %path <s>}
Computer Science Department
University of Georgia
Expressing path constraints
• Bounded path length
– only constants allowed
FILTER(length(%path)<5)
FILTER(length(%path)>3 && length(%path)<7)
Computer Science Department
University of Georgia
Expressing path constraints
• Constraints added as a regular expression
filter (existing syntax in SPARQL)
regex( pathvariable, pathexpr, pathflags )
FILTER(regex(%path,”.*foo:prop.*”,”uis”))
– Flags: i (instances) s (schema) l (literals)
h (match using hierarchy)
d (set directionality) u (undirected)
– Default flags: d i
Computer Science Department
University of Georgia
Some examples
SELECT list(%path), ?res WHERE
{<r> %path ?res .
%path rdfs:member ?x .
?x foo:locatedIn wiki:Europe
FILTER(regex(%path,”foo:prop+”)}
SELECT list(%path) WHERE
{<r> %path <s> .
%path rdfms:entityResource ?x .
?x rdf:type foo:Person
FILTER(regex(%path,”(foo:prop|foo:rel)+”,”u”)}
SELECT list(%path) WHERE
{<r> %path <s>
FILTER(length(%path)<=6 && length(%path)>=4 &&
regex(%path,”(foo:prop -foo:rel)+”)}
Computer Science Department
University of Georgia
SPARQLeR Prototype Implementation
• Prototype implementation is based on
BRAHMS – RDF/S main memory storage.
• Path search based on a bi-directional BFS
for simple paths.
• Checking of path constraints in regex is
implemented as a simulation of DFAs.
Janik, M. and Kochut, K., BRAHMS: A WorkBench RDF Store And High Performance
Memory System for Semantic Association Discovery. ISWC 2005
Computer Science Department
University of Georgia
Implementation details
• Each path expression (FILTER regex) is
translated into a DFA.
– For path between two resources, partial
constraints are checked while building the
search trie from both endpoints
– forward and reverse DFAs
– When a path is connected,
the forward DFA used
to check the full
(path) constraint.
Computer Science Department
University of Georgia
Experiments: biology pathway
• Biosynthesis paths in biology (glycomics)
• How specific glyco peptide is created from a basic
structure?
– Find pathway between dolichol phosphate and glyco
peptide G00009
• Path has 15 reactions (30 hops, as each reaction is
represented by its substrates and products)
• Only undirected path connects the endpoint resources, but
a specific directionality pattern is present
RDF representation: sample reactions in the path
Computer Science Department
University of Georgia
Experiments : biology pathway
• Functionality test - proof of concept
N-glycan biosynthesis pathway
SELECT list(%path) WHERE {
glyco:dolichol_phosphate %path glyco:glyco_peptide_G00009 .
%path rdfs:member enzyo:R05969
FILTER ( length(%path) <= 30 &&
regex(%path, "((-glyco:has_acceptor_substrate|
-glyco:has_reactant) glyco:has_product)*" ) ) }
Ontology:
Length:
Consists of:
Search time:
GlycO
30 hops
15 reactions
milliseconds (less than 1 tick)...
courtesy of Dr. Alison Vandersall-Nairn, University of Georgia
Computer Science Department
University of Georgia
Experiments
• Scalability
– Modified DBLP datasets in RDF (added random citations)
– Test on increasing dataset (adding older years of
publications)
– Search for cited publications (transitive)
PREFIX opus:
<http://lsdis.cs.uga.edu/projects/semdis/opus#>
SELECT ?end_publication WHERE {
<http://dblp.uni-trier.de/rec/bibtex/journals/ai/Huber06>
%path ?end_publication
FILTER ( length(%path)<=26 &&
regex(%path, "(opus:cites_publication)*" ) ) }
B. Aleman-Meza et. al. Semantic Analytics on Social Networks:
Experiences in Addressing the Problem of Conflict of Interest Detection. (WWW2006)
Computer Science Department
University of Georgia
Experiments – dataset characteristics
Computer Science Department
University of Georgia
Experiments – results: single source paths
Search paths up to length 26
Computer Science Department
University of Georgia
Experiments – results: two endpoint paths
Computer Science Department
University of Georgia
More complex uses of path expressions
• Discover connecting paths with a shared node
– Path between A and B, length up to 4
– Path between C and D, length up to 4
– Both paths have a shared resource
A %path_1 B
length(%path_1) <= 4
A
?x
C
C %path_2 D
length(%path_2) <= 4
%path_1 rdfs:member ?x
%path_2 rdfs:member ?x
B
D
Potential subgraph discovery
Computer Science Department
University of Georgia
SPARQLeR summary
• Path expressions
– use of regular expressions over properties
• Flexible path specification
– Undirected
– Defined directionality paths
• Directed
– Length restricted
• Complex path patterns
– Test of resources and properties on the path
– Intersecting paths
Computer Science Department
University of Georgia
Conclusion and future work
• SPARQLeR extension fits seamlessly into
the current SPARQL syntax.
• Performance of path queries is acceptable
(if defined expression is highly selective).
• Optimization of path queries, complex
expressions and multiple paths in query.
• Inclusion of context.
Computer Science Department
University of Georgia
SPARQLeR
Krys Kochut, Maciej Janik
Thank you
Download