RDF languages and storages part 2 - indexing semi-structure data Maciej Janik Conrad Ibanez CSCI 8350, Fall 2004 Outline Jena storage Indexing techniques Jena Implemented in Java One of the most popularly used RDF storages and query engines Supports RDF, RDFS and OWL In memory and persistent storage (Oracle, MySQL, PostgreSQL) RDQL Reasoning/inference engine Jena - storage schema Previous version used normalized relational DB tables statements literals resources Taken approach to store triples as (Subject, Predicate, Object) in denormalized tables Optimization for common statement patterns - grouping of properties Jena - storage Normalized tables Denormalized „Efficient RDF Storage and Retrieval in Jena2” - Wilkinson et al. Jena - storage Do certain trade-off for space and search time Cluster properties that are likely to be accessed together optimize for common patterns Special treatment of reified statements Jena - graph abstraction Graph interface is separated from (persistent) triple storage layer Special support for different types of graphs - optimized for performance Support operations like add, delete, find. Jena - query processing Converting multiple patterns in query into one query to DB Use DB query optimizer instead of executing multiple queries from Jena level (as it was in Jena1) Associate a table with pattern (best) or span pattern between tables (requires join operation) Query may span between different graphs, but it can be optimized only if they are in the same database What to index? How to index? Indexing semistructured data XML cannot be indexed directly as relational DB Indexing may take advantage of tree structure depth of node common path from the root convert each path to string expression precalculate the path tree Indexing semistructured data Idea is based on Particia’s trie Index should scale with the growth of data Path together with leaf is encoded into string -> the Index Fabric „A Fast Index for Semistructured Data” - Brian F. Cooper et al. A Layered Index „A Fast Index for Semistructured Data” - Brian F. Cooper et al. Index Fabric Index is used to accelerate path expressions - mainly for queries that ask for root-to-leaf path Idea of prefix encoding xml: <A>alpha<B>beta<C>gamma</C></B></A> paths: <A>alpha ; <A><B>beta ; <A><B><C>gamma encoded: A alpha ; A B beta ; A B C gamma infix (not common): A alpha B beta C gamma Convert path to string for fast searches Replace tags with ‘non-terminal’ characters (like in automata) Index Fabric - raw paths „A Fast Index for Semistructured Data” - Brian F. Cooper et al. Graphs - how to index? Backbone http://www.aisee.com/ Graphs - how to index? Tree-type - prefixes - tries http://www.aisee.com/ Graphs - how to index? T-index Path templates 2-index 1-index „Index Structure for Path Expressions” - Tova Milo, Dan Suciu Graphs - how to index? Landmarks http://www.aisee.com/ Indexing - summary Indexing semistructure data index fabric - encoding, multilayered common prefixes - trie structure backbone - highways between points landmarks - county division path templates - precalculated expressions clustering - grouping by theme access Indexing such data is NOT easy, solution depends how you want to search the graph References „Efficient RDF Storage and Retrieval in Jena2” - Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds „A Fast Index for Semistructured Data” Brian F. Cooper, Neal Sample, Michael J. Franklin, Gisli Hjaltason, Moshe Shadmon „Index Structures for Path Expressions” Tova Milo, Dan Suciu