Building an efficient RDF store over a Relational Database Bishwaranjan Bhattacharjee IBM T.J.Watson Research Center bhatta@us.ibm.com March 2013 © 2013 IBM Corporation Building an efficient RDF store over a Relational Database Related SIGMOD 2013 Research Track Paper : Building an efficient RDF store over a relational database Mihaela A. Bornea, Julian Dolby, Anastasios Kementsietsidis, Kavitha Srinivas , Patrick Dantressangle, Octavian Udrea, Bishwaranjan Bhattacharjee 2 March 2013 © 2013 IBM Corporation Building an efficient RDF store over a Relational Database Executive Summary New mechanism to store RDF data in relational systems Developed for DB2 LUW with support for SPARQL* Other possibilities beyond RDF * (Simple Protocol and RDF Query Language) 3 March 2013 © 2013 IBM Corporation Building an efficient RDF store over a Relational Database Brief introduction to RDF Biological data Financial applications Government Watson (Jeopardy Champ) Social media RDF data has variable schema and is sparse, could have thousands of entities and predicate 4 March 2013 © 2013 IBM Corporation Building an efficient RDF store over a Relational Database Sample SPARQL query What are all the country capitals in Africa? PREFIX abc: <http://example.com/exampleOntology#> SELECT ?capital ?country WHERE { ?x abc:cityname ?capital ; abc:isCapitalOf ?y . ?y abc:countryname ?country ; abc:isInContinent abc:Africa . } 5 March 2013 © 2013 IBM Corporation Building an efficient RDF store over a Relational Database RDF data management Relational RDF storage Subject Predicate Native RDF storage Object Wolfgang Mozart sonOf Leopold Mozart Wolfgang Mozart placeOfBirth Salzburg Wolfgang Mozart DoB 1756 Wolfgang Mozart category musician Pros : Transaction Support, Compression, Scalability, Security,….. Cons : Inefficient query processing 6 March 2013 © 2013 IBM Corporation Building an efficient RDF store over a Relational Database Our System Architecture 7 March 2013 © 2013 IBM Corporation Building an efficient RDF store over a Relational Database Primary Hash Table Subject Index Column Names in Relational Table Subject Predicate Value/Ref 1 1 … Predicate N Value/Ref Bitmap N SubjectA hasName name Z … hasComposed 32162567 000…001 SubjectB hasName name Y … lifeStory ABCDEFG 000…000 SubjectA category musicians … placeOfBirth place X 000…000 Hashtable for the predicates connected to a subject. Insertion: When a triple is inserted, predicate is hashed to a position in the hashtable. If the position is occupied, multiple hashing used to find an empty location Retrieval: If the predicate is unknown, scan all predicates; otherwise, hash to retrieve the column Value/Ref: If a single value for a subject, predicate pair (DBPedia: more than 81%), store value in hashtable (e.g., hasName) Otherwise if there are multiple values, store reference to a secondary hash table (e.g., hasComposed and 32162567) Bitmap: Specifies whether value/ref column for each predicate contains value of reference. 8 March 2013 © 2013 IBM Corporation Building an efficient RDF store over a Relational Database Reverse Primary Hash Table Object Index Column Names in Relational Table Object Predicate1 Subject 1 musicians category 3256874 place x placeOfBirth name z hasName … Predicate M Subject M Bitmap … Total number 100000 100…000 5238765 … population 500000 100…000 SubjectA … 000…000 Hashtable for the predicates connected to a object. Insertion: When a triple is inserted, predicate is hashed to a position in the hashtable. If the position is occupied, multiple hashing used to find an empty location Retrieval: If the predicate is unknown, scan all columns; otherwise, hash to retrieve the column Value/Ref: If a single value for a object, store value in hashtable (e.g., population) Otherwise if there are multiple values, store reference to a secondary hash table (example 3256874) Bitmap: Specifies whether value/ref column for each predicate contains value of reference. 9 March 2013 © 2013 IBM Corporation Building an efficient RDF store over a Relational Database Secondary Hash Table Reference Index Reference Value1 3256874 Subject A Value2 … Value(K-1) Value K Subject B … 5238765 Subject A Subject B Subject Z … 32162567 composition1 composition2 compositionk compositionz … During query processing, based on the reference id, the values are attached to a subject or object 10 March 2013 © 2013 IBM Corporation Building an efficient RDF store over a Relational Database Subject Index Graph Coloring To Assign Columns To Predicates Column Names in Relational Table Subject Predicate Value/Ref 1 1 … Predicate N Value/Ref Bitmap N SubjectA SSN 123456 … hasComposed 32162567 000…001 SubjectB Revenue 236090 … Headquarter Armonk 000…000 SubjectC Population 50000 … Mayor John Smith 000…000 SubjectA : Details about a person SubjectB : Details about a company SubjectC : Details about a city 11 March 2013 © 2013 IBM Corporation Building an efficient RDF store over a Relational Database Intelligent data layout Given triples: Find ‘predicate sets’ a P b PQR a Q c QL a R d ST b Q e b L f Build graph with edges connecting predicate sets P Q R c S g c T h L S T Color the graph using graph coloring, each color is now an assignment of a predicate to a column. Notice for 7 predicates, we use only 3 colors. Uses Floyd-Warshall greedy algorithm. Number of colors <= number of columns 12 Building an efficient RDF store over a Relational Database Query Optimization SPARQL SPARQL to SQL Optimization Extra Statistics SQL RDBMS optimizer statistics SQL Optimization by QRW Optimized SQL 13 Building an efficient RDF store over a Relational Database 14 March 2013 © 2013 IBM Corporation Building an efficient RDF store over a Relational Database Overall Stack Architecture Jena API Jena API Jena API Jena Query Engine Jena Query Engine DB2 SPARQL to SQL Query Engine DB2 SPARQL to SQL Query Engine Jena Native Store DB2/ Oracle/ MySQL DB2 DB2 Jena TDB Jena SDB DB2 noSQL Graph Store DB2 noSQL Graph Store Java APIs and HTTP based SPARQL querying March 2013 © 2013 IBM Corporation Building an efficient RDF store over a Relational Database Scalability member DRAM BP CPUs member DRAM BP CA/CF DRAM GBP DRAM ESE Server March 2013 pureScale © 2013 IBM Corporation Building an efficient RDF store over a Relational Database Key Research Innovations Flexible schema uses hashing techniques to ‘bind’ predicates to column values. Overloads a single column to hold multiple predicates. Intelligent compilation of SPARQL to SQL based on an estimate of costs of accessing each triple. Schema customization for a dataset (like a re-org capability) to exploit correlations between predicate cooccurrences, minimizing storage, and maximizing indexing capabilities. Workload analysis that can advise predicates to be indexed. 17 17 Nov 2011 March 2013 IBM Confidential 2011IBM IBMCorporation Corporation ©©2013 Building an efficient RDF store over a Relational Database Internal use case of RDF An IBM product has a RDF repository of objects Previously experimented with various stores Gave up due to performance problems Currently using Jena TDB Open source java based RDF repository Performs better than previous stores tried Scalability and handling of updates is a concern 18 March 2013 © 2013 IBM Corporation Building an efficient RDF store over a Relational Database Some comparisons Dataset: 60M triples. SPARQL Query workload: 29 queries, some very complex (include more than 100 unions of graph patterns). Schema layout using reorganization facility + predicate indexes as determined by re-org. Query workload issued 5 times to Jena TDB/ DB2 noSQL Graph Store in a randomized order, in a single user environment Average performance for 29 query workload > 4X better than Jena TDB 19 19 Nov 2011 March 2013 IBM Confidential 2011IBM IBMCorporation Corporation ©©2013 Building an efficient RDF store over a Relational Database Other comparisons On SP2Bench, LUBM, Dbpedia With Virtuoso, Jena, Sesame, RDF3X 20 March 2013 © 2013 IBM Corporation Building an efficient RDF store over a Relational Database Conclusion New mechanism to store RDF data in relational systems Provides significant performance improvement Compared to conventional triple store approaches on a RDBMS Compared to Jena TDB For more details please attend the SIGMOD paper presentation 21 March 2013 © 2013 IBM Corporation