Enterprise Integration – Semantic Web RDF storages and indexes Maciej Janik September 1, 2005 Outline • RDF storages – – – – Jena Sesame Redland Brahms • Indexing RDF – difference from DB indexing – what to index – examples of index types Storages • Jena – Implemented in Java – Supports RDF, RDFS and OWL – In memory and persistent storage (Oracle, MySQL, PostgreSQL) – RDQL – Reasoning/inference engine – Optimization for common statement patterns grouping of properties – Powerful, but slow and memory exhaustive Storages • Sesame – Implemented in Java – Modules (HTTP/SOAP handler, admin, query, export, Repository Abstraction Layer) – Persistent RDF store • traditional DBMS or dedicated RDF triple storage – – – – Database independent Scalable architecture Node-centric approach Fast and efficient, as for Java implementation Storages • Redland – together with Rasqual and Raptor – Modular approach – Redland – only storage for RDF triples + low level API – Implemented in pure C for portability – Rich API and bindings to other languages – Rasqual - RDF query module (RDQL, SPARQL) – Raptor - a very fast RDF parser – Average performance Storages • Brahms /from LSDIS lab/ – Read-only main-memory storage for RDF • read RDF and saves optimized snapshot – Written in C++, optimized for speed • additional bindings to Java – – – – – Full indexing of Subject-Predicate-Object Uses Raptor as RDF parser Rich low level API for graph manipulation Very fast and memory efficient Waiting for SPARQL implementation Brahms • Separation of different resource types: – InstanceNode, Literal, SchemaClass, SchemaProperty – Statements • InstanceStatament (instance – property – instance) • LiteralStatement (instance – property – literal) • TypeOfStatement (instance – type – class) – Taxonomy for classes and properties • Iterators deal only with one type of resource – not wasting time during instance search algorithm to check for literal or type relation Indexing of RDF • RDF = Graph – traditional DB indexes may not be sufficient • XML cannot be indexed directly as relational DB • Indexing may take advantage of tree structure – – – – depth of node common path from the root convert each path to string expression precalculate the path tree • Simple indexes on statements may also be powerful What to index? • Most straight-forward approach Statements : subject –[predicate] object • Possibilities: Brahms Single: SPO SOP OSP OPS PSO POS Double: SOP SPO POS Redland Single indexes in Brahms [design] Power of single indexes • Full indexing of statements – SPO, SOP, PSO, POS, OSP, OPS – indexes for each type of statements (InstanceStatements, LiteralStatements ...) – fast check if given resrouce is connected to another, or uses given property – use of binary search – merge of 2-hop path element in linear time • All RDF storages are based on simple indexes and their extensions Schema Vs. Instances [Brahms] • Schema is small compared to instances • Instance to taxonomy – know or check for type of the instance • Taxonomy index (classes and properties) – direct subtypes/supertypes – all ancesstors/descendants – dynamically build index of instances for given type and all its subtypes Tree-based index • Idea is based on Patricia’s trie • Index should scale with the growth of data • Path together with leaf is encoded into string -> the Index Fabric „A Fast Index for Semistructured Data” - Brian F. Cooper et al. Index fabrics • Index is used to accelerate path expressions mainly for queries that ask for root-to-leaf path • Idea of prefix encoding – – – – xml: <A>alpha<B>beta<C>gamma</C></B></A> paths: <A>alpha ; <A><B>beta ; <A><B><C>gamma encoded: A alpha ; A B beta ; A B C gamma infix (not common): A alpha B beta C gamma • Convert path to string for fast searches • Replace tags with ‘non-terminal’ characters (like in automata) Indexing of graphs Backbone http://www.aisee.com/ Indexing of graphs Tree-type - prefixes - tries http://www.aisee.com/ Indexing of graphs T-index Path templates 2-index 1-index „Index Structure for Path Expressions” - Tova Milo, Dan Suciu Indexing of graphs Landmarks http://www.aisee.com/ Indexing of graphs • Indexing semistructured data – – – – – – index fabric - encoding, multilayered common prefixes - trie structure backbone - highways between points landmarks - county division path templates - precalculated expressions clustering - grouping by theme access • Indexing such data is NOT easy, solution depends how you want to search the graph References • Beckett, D., „The Design and Implementation of the Redland RDF Application Framework”. • Cooper et al., „A Fast Index for Semistructured Data” • Janik M. And Kochut K., „BRAHMS: A WorkBench RDF Store And High Performance Memory System for Semantic Association Discovery” • Milo T. and Suciu D., „Index Structures for Path Expressions” • Wilkinson et al., „Efficient RDF Storage and Retrieval in Jena2” • • • • Jena - http://jena.sourceforge.net/ Raptor - http://librdf.org/raptor/ Redland – http://librdf.org/ Sesame - http://www.openrdf.org/