RDF storages and indexes Enterprise Integration – Semantic Web Maciej Janik

advertisement
Enterprise Integration – Semantic Web
RDF storages and indexes
Maciej Janik
September 1, 2005
Outline
• RDF storages
–
–
–
–
Jena
Sesame
Redland
Brahms
• Indexing RDF
– difference from DB indexing
– what to index
– examples of index types
Storages
• Jena
– Implemented in Java
– Supports RDF, RDFS and OWL
– In memory and persistent storage (Oracle,
MySQL, PostgreSQL)
– RDQL
– Reasoning/inference engine
– Optimization for common statement patterns grouping of properties
– Powerful, but slow and memory exhaustive
Storages
• Sesame
– Implemented in Java
– Modules (HTTP/SOAP handler, admin, query,
export, Repository Abstraction Layer)
– Persistent RDF store
• traditional DBMS or dedicated RDF triple storage
–
–
–
–
Database independent
Scalable architecture
Node-centric approach
Fast and efficient, as for Java implementation
Storages
• Redland – together with Rasqual and
Raptor
– Modular approach
– Redland – only storage for RDF triples + low
level API
– Implemented in pure C for portability
– Rich API and bindings to other languages
– Rasqual - RDF query module (RDQL, SPARQL)
– Raptor - a very fast RDF parser
– Average performance
Storages
• Brahms /from LSDIS lab/
– Read-only main-memory storage for RDF
• read RDF and saves optimized snapshot
– Written in C++, optimized for speed
• additional bindings to Java
–
–
–
–
–
Full indexing of Subject-Predicate-Object
Uses Raptor as RDF parser
Rich low level API for graph manipulation
Very fast and memory efficient
Waiting for SPARQL implementation
Brahms
• Separation of different resource types:
– InstanceNode, Literal, SchemaClass,
SchemaProperty
– Statements
• InstanceStatament (instance – property – instance)
• LiteralStatement (instance – property – literal)
• TypeOfStatement (instance – type – class)
– Taxonomy for classes and properties
• Iterators deal only with one type of
resource
– not wasting time during instance search
algorithm to check for literal or type relation
Indexing of RDF
• RDF = Graph
– traditional DB indexes may not be sufficient
• XML cannot be indexed directly as relational DB
• Indexing may take advantage of tree structure
–
–
–
–
depth of node
common path from the root
convert each path to string expression
precalculate the path tree
• Simple indexes on statements may also be
powerful
What to index?
• Most straight-forward approach
Statements : subject –[predicate] object
• Possibilities:
Brahms
Single:
SPO SOP OSP OPS PSO POS
Double:
SOP SPO POS
Redland
Single indexes in Brahms [design]
Power of single indexes
• Full indexing of statements
– SPO, SOP, PSO, POS, OSP, OPS
– indexes for each type of statements
(InstanceStatements, LiteralStatements ...)
– fast check if given resrouce is connected to
another, or uses given property – use of binary
search
– merge of 2-hop path element in linear time
• All RDF storages are based on simple
indexes and their extensions
Schema Vs. Instances [Brahms]
• Schema is small compared to instances
• Instance to taxonomy
– know or check for type of the instance
• Taxonomy index (classes and properties)
– direct subtypes/supertypes
– all ancesstors/descendants
– dynamically build index of instances for given
type and all its subtypes
Tree-based index
• Idea is based on
Patricia’s trie
• Index should scale
with the growth of
data
• Path together with
leaf is encoded into
string -> the Index
Fabric
„A Fast Index for Semistructured Data” - Brian F. Cooper et al.
Index fabrics
• Index is used to accelerate path expressions mainly for queries that ask for root-to-leaf path
• Idea of prefix encoding
–
–
–
–
xml: <A>alpha<B>beta<C>gamma</C></B></A>
paths: <A>alpha ; <A><B>beta ; <A><B><C>gamma
encoded: A alpha ; A B beta ; A B C gamma
infix (not common): A alpha B beta C gamma
• Convert path to string for fast searches
• Replace tags with ‘non-terminal’ characters (like
in automata)
Indexing of graphs
Backbone
http://www.aisee.com/
Indexing of graphs
Tree-type
- prefixes
- tries
http://www.aisee.com/
Indexing of graphs
T-index
Path templates
2-index
1-index
„Index Structure for Path Expressions” - Tova Milo, Dan Suciu
Indexing of graphs
Landmarks
http://www.aisee.com/
Indexing of graphs
• Indexing semistructured data
–
–
–
–
–
–
index fabric - encoding, multilayered
common prefixes - trie structure
backbone - highways between points
landmarks - county division
path templates - precalculated expressions
clustering - grouping by theme access
• Indexing such data is NOT easy, solution depends
how you want to search the graph
References
• Beckett, D., „The Design and Implementation of the Redland RDF
Application Framework”.
• Cooper et al., „A Fast Index for Semistructured Data”
• Janik M. And Kochut K., „BRAHMS: A WorkBench RDF Store And
High Performance Memory System for Semantic Association
Discovery”
• Milo T. and Suciu D., „Index Structures for Path Expressions”
• Wilkinson et al., „Efficient RDF Storage and Retrieval in Jena2”
•
•
•
•
Jena - http://jena.sourceforge.net/
Raptor - http://librdf.org/raptor/
Redland – http://librdf.org/
Sesame - http://www.openrdf.org/
Download