Slides

Virtuoso - Column Store, Adaptive Techniques for RDF Orri Erling Program Manager, Virtuoso Openlink Software © 2012 OpenLink Software, All rights reserved. 1 Flexible Big Data Data grows in volume and heterogeneity Schema last is great - if the price is right RDF, graphs promise powerful querying with the flexibility and scale of no-SQL key value stores Inference may be good for integration, if can express the right things, beyond OWL RDF tech must learn the lessons of DB, everything applies © 2012 OpenLink Software, All rights reserved. 2 Virtuoso Column Store Edition SQL and SPARQL Compressed column store, vectored execution Shared nothing scale out Powerful procedure language with parallel, distributed control structures Full-text and geospatial indexes © 2012 OpenLink Software, All rights reserved. 3 Storage  Freely mix column-, and row-wise indices  All SQL and RDF data types natively supported , single execution engine for SQL/SPARQL  Column compression 3x more space efficient than rowwise compression for RDF  Column stores are not only for big scans, random access surpasses rows as as soon as there is some locality  9 B/quad with DBpedia, 7 B/quad with BSBM or RDF-H, 14 B/quad with web crawls (PSOG, POSG, SP, OP, GS, excluding literals) © 2012 OpenLink Software, All rights reserved. 4 Execution Engine  Vectoring is not only for column stores  Vectoring makes a random access into a linear merge join if there is any locality: Always a win, mileage depends on run time factors  Vectoring eliminates interpretation overhead and makes CPU friendly code possible  Even with run time data typing, vectoring allows use of typespecific operators on homogenous data, e.g. arithmetic  Dynamically adjust vector size: Larger vector may not fit in cache but will get better locality for random access © 2012 OpenLink Software, All rights reserved. 5 Graph operations  Run time computation plus caching instead of materialization  SPARQL/SQL extension for arbitrary transitive subqueries:  Flexible options for returning shortest paths, all paths, all /distinct reachable, attributes of steps on paths etc.  Efficient execution, searching the graph from both ends if looking for a path with ends given  Query operators for RDF hierarchy traversal  Special query operator for OWL sameAs and IFP based identity  Taking OWL sameAs / IFP identity into account for DISTINCT /GROUP BY © 2012 OpenLink Software, All rights reserved. 6 Query Optimization Challenges  Typical SQL stats do not help  Need to measure data cardinalities starting from constants in the query  Need to sample fanout predicate by predicate, as needed  Predicate and class hierarchies are easy to handle in sampling  sameAs or IFP inference voids all guesses  Is hash join worthwhile? High setup cost means that one must be sure of cardinalities first © 2012 OpenLink Software, All rights reserved. 7 Deep Sampling  Everything is a join -> sampling must also do joins  As the candidate plan grows, the cost model executes all the ops on a sample of the data  Actual cardinality and locality are known, also when search conditions are correlated  Having high confidence in the cost model, hash join plans become safe and attractive  Even though there is an indexed access path for all, a scan can be better because it produces results in order. Need to be sure of selectivity before taking the risk © 2012 OpenLink Software, All rights reserved. 8 Elastic Cluster Data is partitioned by key, different indices may have different partition keys Partitions may split and migrate between servers Partitions may be kept in duplicate for fault tolerance/load balancing Actual access stats drive partition split and placement © 2012 OpenLink Software, All rights reserved. 9 Optimizing for Cluster  Vectored execution is natural in a cluster since single-tuple messages are not an option  Keep max ops in flight at all times, always send long messages  Fully distributed query coordination:  Any node can service a client request. Correlated subqueries, stored procedures may execute anywhere, arbitrary parallelism and recursion between partitions  On single shared memory box, cluster is approximately even with single process multithreading, low overhead  Distributed stored procedures, send the proc to the data, as in mapreduce, except that there are no limits on cross partition calling/recursion  Choice of transactional and auto-commit update semantics, can have atomic ops without global transaction © 2012 OpenLink Software, All rights reserved. 10 LOD Cache 55 billion triples in LOD cache, only 384 GB of RAM, 2TB disk Most of Linked Open Data and Web Crawls http://lod.openlinksw.com http://lod.openlinksw.com/sparql © 2012 OpenLink Software, All rights reserved. 11 Future Work  Complete deep sampling: No more bad query plans  Caching and recycling of intermediate results, specially inference and partial plans  Automatic cluster sizing and load redistribution  Automatic balancing of storage between disk and SSD  Run TPC-H and TPC-DS in SQL and their 1:1 translation in SPARQL, demonstrating SPARQL performance as near to SQL as possible © 2012 OpenLink Software, All rights reserved. 12 Making Technology Work For You openlinksw.com/virtuoso © 2012 OpenLink Software, All rights reserved.

Slides

Related documents

Products

Support

Slides

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib