Virtuoso - Column Store, Adaptive Techniques for RDF Orri Erling Program Manager, Virtuoso Openlink Software © 2012 OpenLink Software, All rights reserved. 1 Flexible Big Data Data grows in volume and heterogeneity Schema last is great - if the price is right RDF, graphs promise powerful querying with the flexibility and scale of no-SQL key value stores Inference may be good for integration, if can express the right things, beyond OWL RDF tech must learn the lessons of DB, everything applies © 2012 OpenLink Software, All rights reserved. 2 Virtuoso Column Store Edition SQL and SPARQL Compressed column store, vectored execution Shared nothing scale out Powerful procedure language with parallel, distributed control structures Full-text and geospatial indexes © 2012 OpenLink Software, All rights reserved. 3 Storage Freely mix column-, and row-wise indices All SQL and RDF data types natively supported , single execution engine for SQL/SPARQL Column compression 3x more space efficient than rowwise compression for RDF Column stores are not only for big scans, random access surpasses rows as as soon as there is some locality 9 B/quad with DBpedia, 7 B/quad with BSBM or RDF-H, 14 B/quad with web crawls (PSOG, POSG, SP, OP, GS, excluding literals) © 2012 OpenLink Software, All rights reserved. 4 Execution Engine Vectoring is not only for column stores Vectoring makes a random access into a linear merge join if there is any locality: Always a win, mileage depends on run time factors Vectoring eliminates interpretation overhead and makes CPU friendly code possible Even with run time data typing, vectoring allows use of typespecific operators on homogenous data, e.g. arithmetic Dynamically adjust vector size: Larger vector may not fit in cache but will get better locality for random access © 2012 OpenLink Software, All rights reserved. 5 Graph operations Run time computation plus caching instead of materialization SPARQL/SQL extension for arbitrary transitive subqueries: Flexible options for returning shortest paths, all paths, all /distinct reachable, attributes of steps on paths etc. Efficient execution, searching the graph from both ends if looking for a path with ends given Query operators for RDF hierarchy traversal Special query operator for OWL sameAs and IFP based identity Taking OWL sameAs / IFP identity into account for DISTINCT /GROUP BY © 2012 OpenLink Software, All rights reserved. 6 Query Optimization Challenges Typical SQL stats do not help Need to measure data cardinalities starting from constants in the query Need to sample fanout predicate by predicate, as needed Predicate and class hierarchies are easy to handle in sampling sameAs or IFP inference voids all guesses Is hash join worthwhile? High setup cost means that one must be sure of cardinalities first © 2012 OpenLink Software, All rights reserved. 7 Deep Sampling Everything is a join -> sampling must also do joins As the candidate plan grows, the cost model executes all the ops on a sample of the data Actual cardinality and locality are known, also when search conditions are correlated Having high confidence in the cost model, hash join plans become safe and attractive Even though there is an indexed access path for all, a scan can be better because it produces results in order. Need to be sure of selectivity before taking the risk © 2012 OpenLink Software, All rights reserved. 8 Elastic Cluster Data is partitioned by key, different indices may have different partition keys Partitions may split and migrate between servers Partitions may be kept in duplicate for fault tolerance/load balancing Actual access stats drive partition split and placement © 2012 OpenLink Software, All rights reserved. 9 Optimizing for Cluster Vectored execution is natural in a cluster since single-tuple messages are not an option Keep max ops in flight at all times, always send long messages Fully distributed query coordination: Any node can service a client request. Correlated subqueries, stored procedures may execute anywhere, arbitrary parallelism and recursion between partitions On single shared memory box, cluster is approximately even with single process multithreading, low overhead Distributed stored procedures, send the proc to the data, as in mapreduce, except that there are no limits on cross partition calling/recursion Choice of transactional and auto-commit update semantics, can have atomic ops without global transaction © 2012 OpenLink Software, All rights reserved. 10 LOD Cache 55 billion triples in LOD cache, only 384 GB of RAM, 2TB disk Most of Linked Open Data and Web Crawls http://lod.openlinksw.com http://lod.openlinksw.com/sparql © 2012 OpenLink Software, All rights reserved. 11 Future Work Complete deep sampling: No more bad query plans Caching and recycling of intermediate results, specially inference and partial plans Automatic cluster sizing and load redistribution Automatic balancing of storage between disk and SSD Run TPC-H and TPC-DS in SQL and their 1:1 translation in SPARQL, demonstrating SPARQL performance as near to SQL as possible © 2012 OpenLink Software, All rights reserved. 12 Making Technology Work For You openlinksw.com/virtuoso © 2012 OpenLink Software, All rights reserved.