Virtuoso - Column Store,
Adaptive Techniques for RDF
Orri Erling
Program Manager, Virtuoso
Openlink Software
© 2012 OpenLink Software, All rights reserved.
1
Flexible Big Data
Data grows in volume and heterogeneity
Schema last is great - if the price is right
RDF, graphs promise powerful querying with
the flexibility and scale of no-SQL key value
stores
Inference may be good for integration, if can
express the right things, beyond OWL
RDF tech must learn the lessons of DB,
everything applies
© 2012 OpenLink Software, All rights reserved.
2
Virtuoso Column Store Edition
SQL and SPARQL
Compressed column store, vectored
execution
Shared nothing scale out
Powerful procedure language with parallel,
distributed control structures
Full-text and geospatial indexes
© 2012 OpenLink Software, All rights reserved.
3
Storage
 Freely mix column-, and row-wise indices
 All SQL and RDF data types natively supported , single
execution engine for SQL/SPARQL
 Column compression 3x more space efficient than rowwise compression for RDF
 Column stores are not only for big scans, random
access surpasses rows as as soon as there is some
locality
 9 B/quad with DBpedia, 7 B/quad with BSBM or RDF-H,
14 B/quad with web crawls (PSOG, POSG, SP, OP, GS,
excluding literals)
© 2012 OpenLink Software, All rights reserved.
4
Execution Engine
 Vectoring is not only for column stores
 Vectoring makes a random access into a linear merge join if
there is any locality: Always a win, mileage depends on run
time factors
 Vectoring eliminates interpretation overhead and makes
CPU friendly code possible
 Even with run time data typing, vectoring allows use of typespecific operators on homogenous data, e.g. arithmetic
 Dynamically adjust vector size: Larger vector may not fit in
cache but will get better locality for random access
© 2012 OpenLink Software, All rights reserved.
5
Graph operations
 Run time computation plus caching instead of materialization
 SPARQL/SQL extension for arbitrary transitive subqueries:
 Flexible options for returning shortest paths, all paths, all
/distinct reachable, attributes of steps on paths etc.
 Efficient execution, searching the graph from both ends if
looking for a path with ends given
 Query operators for RDF hierarchy traversal
 Special query operator for OWL sameAs and IFP based
identity
 Taking OWL sameAs / IFP identity into account for DISTINCT
/GROUP BY
© 2012 OpenLink Software, All rights reserved.
6
Query Optimization Challenges
 Typical SQL stats do not help
 Need to measure data cardinalities starting from
constants in the query
 Need to sample fanout predicate by predicate, as
needed
 Predicate and class hierarchies are easy to handle
in sampling
 sameAs or IFP inference voids all guesses
 Is hash join worthwhile? High setup cost means
that one must be sure of cardinalities first
© 2012 OpenLink Software, All rights reserved.
7
Deep Sampling
 Everything is a join -> sampling must also do joins
 As the candidate plan grows, the cost model executes
all the ops on a sample of the data
 Actual cardinality and locality are known, also when
search conditions are correlated
 Having high confidence in the cost model, hash join
plans become safe and attractive
 Even though there is an indexed access path for all, a
scan can be better because it produces results in
order. Need to be sure of selectivity before taking the
risk
© 2012 OpenLink Software, All rights reserved.
8
Elastic Cluster
Data is partitioned by key, different indices
may have different partition keys
Partitions may split and migrate between
servers
Partitions may be kept in duplicate for fault
tolerance/load balancing
Actual access stats drive partition split and
placement
© 2012 OpenLink Software, All rights reserved.
9
Optimizing for Cluster
 Vectored execution is natural in a cluster since single-tuple messages
are not an option
 Keep max ops in flight at all times, always send long messages
 Fully distributed query coordination:
 Any node can service a client request. Correlated subqueries,
stored procedures may execute anywhere, arbitrary parallelism and
recursion between partitions
 On single shared memory box, cluster is approximately even with
single process multithreading, low overhead
 Distributed stored procedures, send the proc to the data, as in mapreduce, except that there are no limits on cross partition
calling/recursion
 Choice of transactional and auto-commit update semantics, can
have atomic ops without global transaction
© 2012 OpenLink Software, All rights reserved.
10
LOD Cache
55 billion triples in LOD cache, only 384 GB of
RAM, 2TB disk
Most of Linked Open Data and Web Crawls
http://lod.openlinksw.com
http://lod.openlinksw.com/sparql
© 2012 OpenLink Software, All rights reserved.
11
Future Work
 Complete deep sampling: No more bad query
plans
 Caching and recycling of intermediate results,
specially inference and partial plans
 Automatic cluster sizing and load redistribution
 Automatic balancing of storage between disk and
SSD
 Run TPC-H and TPC-DS in SQL and their 1:1
translation in SPARQL, demonstrating SPARQL
performance as near to SQL as possible
© 2012 OpenLink Software, All rights reserved.
12
Making Technology Work For
You
openlinksw.com/virtuoso
© 2012 OpenLink Software, All rights reserved.