Sharing statistics for SPARQL Federation optimization, with emphasis on benchmark quality Kjetil Kjernsmo <kjekje@ifi.uio.no> Department of Informatics, University of Oslo, Norway Key Problems: Expected contributions: Current benchmarks cannot meaningfully summarize different micro- or macrobenchmarks, thereby making falsification difficult. Advising users of SPARQL Endpoints on which implementation to use is difficult. Benchmarks that are able to cover all realistic performance-influencing parameters for SPARQL engines and make it possible to weigh different parameters, as well as quantify unexplained differences. To assist optimization, enable endpoint service descriptions to expose: performance characteristics, Better statistics is needed for efficient join-order optimization in the federated case. statistical digests of data that are optimized with respect to size and query performance. Selectivity estimation for SPARQL needs to be improved. SPARQL engine developers can use the benchmark to reliably quantify unexpected adverse effects from a given modification. New SPARQL engine users can identify key differences between different engines, and therefore be able to more wisely choose engine to use. Design of Experiments Nonmonotonic responses Use current state of the art in statistics Use selection of language features, data size, heterogeneity, two platforms, etc Statistical digest Reimplement DBPedia SPARQL Benchmark (Morsey et al) There exists wellstudied designs that can be adapted Distributed Data Big Data that may or may not change frequently Astronomical data Censored data SPARQL 1.0 Assessment of unexpected side-effects The benchmark will be heavy Can be used on data that lends itself well to histograms Full statistical study Bayesian networks / Probabilistic Relational Models, etc Create 2-4 levels along axis of the SPARQL Grammar + Comprehensive data and platform factors Requires significant progress in statistics Parametrized statistics Balance digest size and optimization potential Identify key differences In many cases, just distinguishing cases with zero or one hits would be valuable Not all data lends itself well to histograms Exposure of statistical digest in Service Description Expose Benchmark Results In Service Description Zero-One-Many cardinality Extend VoID and the direction set by SPLENDID Optimization Selectivity estimation in the non-federated case Source selection No progress Not a lot of literature in the conditional case Driven by exposed data Selectivity estimator for federation Group Graph Patterns Join Order Optimization No progress Query rewriter Rewritten SPARQL 1.1 Query with best join order Federated Benchmark Problems worthy of attack Prove their worth by hitting back -- Piet Hein