Approximate Queries on Very Large Data Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael Jordan, Samuel Madden, Ion Stoica UC Berkeley Our Goal Support interactive SQL-like aggregate queries over massive sets of data Our Goal Support interactive SQL-like aggregate queries over massive sets of data blinkdb> SELECT AVG(jobtime) FROM very_big_log AVG, COUNT, SUM, STDEV, PERCENTILE etc. Our Goal Support interactive SQL-like aggregate queries over massive sets of data blinkdb> SELECT AVG(jobtime) FROM very_big_log WHERE src = ‘hadoop’ FILTERS, GROUP BY clauses Our Goal Support interactive SQL-like aggregate queries over massive sets of data blinkdb> SELECT AVG(jobtime) FROM very_big_log WHERE src = ‘hadoop’ LEFT OUTER JOIN logs2 ON very_big_log.id = logs.id JOINS, Nested Queries etc. Our Goal Support interactive SQL-like aggregate queries over massive sets of data blinkdb> SELECT my_function(jobtime) FROM very_big_log WHERE src = ‘hadoop’ LEFT OUTER JOIN logs2 ML Primitives, User Defined Functions ON very_big_log.id = logs.id 100 TB on 1000 machines ½ - 1 Hour 1 - 5 Minutes 1 second ? Hard Disks Memory Query Execution on Samples Query Execution on Samples ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 What is the average buffering ratio in the table? 0.2325 Query Execution on Samples ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 What is the average buffering ratio in the table? Uniform Sample ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/4 6 Berkeley 0.25 1/4 8 NYC 0.19 1/4 0.2325 0.19 Query Execution on Samples ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 What is the average buffering ratio in the table? Uniform Sample ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/4 6 Berkeley 0.25 1/4 8 NYC 0.19 1/4 0.2325 0.19 +/- 0.05 Query Execution on Samples ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 What is the average buffering ratio in the table? Uniform Sample ID City Buff Ratio 2 NYC 0.13 1/2 3 Berkeley 0.25 1/2 5 NYC 0.19 1/2 6 Berkeley 0.09 1/2 8 NYC 0.18 1/2 12 Berkeley 0.49 1/2 0.2325 0.19 +/- 0.05 $0.22 +/- 0.02 Sampling Rate Speed/Accuracy Trade-off Error Interactive Queries Time to Execute on Entire Dataset 2 sec 30 mins Execution Time (Sample Size) Speed/Accuracy Trade-off Error Interactive Queries Pre-Existing Noise Time to Execute on Entire Dataset 2 sec 30 mins Execution Time (Sample Size) Query Response Time (Seconds) Sampling Vs. No Sampling 1000 900 800 700 600 500 400 300 200 100 0 1020 10x as response time is dominated by I/O 103 18 1 10-1 13 10 10-2 10-3 10-4 Fraction of full data 8 10-5 Query Response Time (Seconds) Sampling Vs. No Sampling 1000 900 800 700 600 500 400 300 200 100 0 1020 Error Bars (0.02%) (0.07%) (1.1%) (3.4%) 103 18 13 10 1 10-1 10-2 10-3 10-4 Fraction of full data (11%) 8 10-5 What is BlinkDB? A framework built on Shark and Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime What is BlinkDB? A framework built on Shark and Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime Uniform Samples 4 3 1 2 Uniform Samples 4 U 3 1 2 Uniform Samples ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 4 U 3 1 2 Uniform Samples ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 4 1. 2. 3. U 3 1 2 FILTER rand() < 1/3 Adds per-row Weights (Optional) ORDER BY rand() Uniform Samples ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 4 ID City Data Weight 2 NYC 0.13 1/3 8 NYC 0.25 1/3 6 Berkeley 0.09 1/3 11 NYC 1/3 0.19 U Doesn’t change Shark RDD Semantics 3 1 2 Stratified Samples 4 3 1 2 Stratified Samples 4 S 3 1 2 Stratified Samples ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 4 S 3 1 2 Stratified Samples ID City Data ID City Data 1 NYC 0.78 1 NYC 0.78 2 NYC 0.13 2 NYC 0.13 3 Berkeley 0.25 3 Berkeley 0.25 4 NYC 0.19 4 NYC 0.19 5 NYC 0.11 5 NYC 0.11 6 Berkeley 0.09 6 Berkeley 0.09 7 NYC 0.18 7 NYC 0.18 8 NYC 0.15 8 NYC 0.15 9 Berkeley 0.13 9 Berkeley 0.13 10 Berkeley 0.49 10 Berkeley 0.49 11 NYC 0.19 11 NYC 0.19 12 Berkeley 0.10 12 Berkeley 0.10 4 S1 SPLIT 3 1 2 Stratified Samples 4 City Count NYC 7 Berkele y 5 S2 GROUP S1 3 1 2 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Stratified Samples 4 City Count Ratio NYC 7 2/7 Berkele y 5 2/5 S2 GROUP S1 3 1 2 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Stratified Samples 4 City Count Ratio NYC 7 2/7 Berkele y 5 2/5 S2 JOIN S2 S1 3 1 2 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Stratified Samples 4 U S2 ID City Data Weight 2 NYC 0.13 2/7 8 NYC 0.25 2/7 6 Berkeley 0.09 2/5 12 Berkeley 0.49 2/5 S2 S1 Doesn’t change Shark RDD Semantics 3 1 2 What is BlinkDB? A framework built on Shark and Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime Error Estimation Closed Form Aggregate Functions - Central Limit Theorem - Applicable to AVG, COUNT, SUM, VARIANCE and STDEV Error Estimation Closed Form Aggregate Functions - Central Limit Theorem - Applicable to AVG, COUNT, SUM, VARIANCE and STDEV Error Estimation Closed Form Aggregate Functions A Central Limit Theorem A AVG - Applicable to AVG, COUNT, SUM, SUM COUNT A A VARIANCE and STDEV STDEV VARIANCE 2 2 1 1 Sample Sample ±ε Error Estimation Generalized Aggregate Functions - Statistical Bootstrap - Applicable to complex and nested queries, UDFs, joins etc. Error Estimation Generalized Aggregate Functions - Statistical Bootstrap B - AApplicable to complex and nested A A A queries, UDFs, joins etc. … 100 2 1 ±ε A … Sample Sample What is BlinkDB? A framework built on Shark and Spark that … - creates and maintains a variety of random and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime Kleiner’s Diagnostics More Data Higher Accuracy Error 300 Data Points 97% Accuracy [KDD 13] Sample Size What is BlinkDB? A framework built on Shark and Spark that … - creates and maintains a variety of random and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime Single Pass Execution A Approximate Query on a Sample Sample Single Pass Execution A Sample State Age Metric Weight CA 20 1971 1/4 CA 22 2819 1/4 MA 22 3819 1/4 MA 30 3091 1/4 Single Pass Execution A Resampling Operator R Sample State Age Metric Weight CA 20 1971 1/4 CA 22 2819 1/4 MA 22 3819 1/4 MA 30 3091 1/4 Single Pass Execution A Sample “Pushdown” R Sample State Age Metric Weight CA 20 1971 1/4 CA 22 2819 1/4 MA 22 3819 1/4 MA 30 3091 1/4 Single Pass Execution A Sample “Pushdown” R Sample Metric Weight 1971 1/4 3819 1/4 Single Pass Execution A Resampling Operator R Sample Metric Weight 1971 1/4 3819 1/4 Single Pass Execution A Resampling Operator R Sample Metric Weight Metric Weight Metric Weight 1971 1/4 1971 1/4 3819 1/4 3819 1/4 1971 1/4 3819 1/4 Single Pass Execution A A1 An A … R Sample Metric Weight Metric Weight Metric Weight 1971 1/4 1971 1/4 3819 1/4 3819 1/4 1971 1/4 3819 1/4 Single Pass Execution A A1 An A … R Sample Metric Weight Metric Weight Metric Weight 1971 1/4 1971 1/4 3819 1/4 3819 1/4 1971 1/4 3819 1/4 Single Pass Execution A Sample Metric Weight 1971 1/4 3819 1/4 Leverage Poissonized Resampling to generate samples with replacement Single Pass Execution A A1 Sample Metric Weight S1 1971 1/4 2 3819 1/4 1 Sample from a Poisson (1) Distribution Single Pass Execution A A1 Ak Sample Metric Weight S1 Sk 1971 1/4 2 1 3819 1/4 1 0 Construct all Resamples in Single Pass Single Pass Execution A S SAMPLE Sample S1 … S100 BOOTSTRAP WEIGHTS Da1 … Da100 Db1 … Db100 Dc1 DIAGNOSTICS WEIGHTS … Dc100 Single Pass Execution Additional Overhead: 200 bytes/row A S SAMPLE Sample S1 … S100 BOOTSTRAP WEIGHTS Da1 … Da100 Db1 … Db100 Dc1 DIAGNOSTICS WEIGHTS … Dc100 Single Pass Execution A Sample Embarrassingly Parallel S SAMPLE S1 … S100 BOOTSTRAP WEIGHTS Da1 … Da100 Db1 … Db100 Dc1 DIAGNOSTICS WEIGHTS … Dc100 Response Time (s) Single Pass Execution Query Execution Response Time (s) Single Pass Execution Error Estimation Overhead Response Time (s) Single Pass Execution Diagnostics Overhead BlinkDB Architecture Command-line Shell Thrift/JDBC Driver Physical Plan Meta store SQL Parser Query Optimizer SerDes, UDFs Execution Hadoop/Spark/Presto Hadoop Storage (e.g., HDFS, Hbase, Presto) Getting Started 1. Alpha 0.1.1 released and available at http://blinkdb.org 2. Allows you to create random and stratified samples on native tables and materialized views 3. Adds approximate aggregate functions with statistical closed forms to HiveQL 4. Compatible with Apache Hive, AMP Lab’s Shark and Facebook’s Presto (storage, serdes, UDFs, types, metadata) Feature Roadmap 1. Support a majority of Hive Aggregates and User-Defined Functions 2. Runtime Correctness Tests 3. Integrating BlinkDB with Shark and Facebook’s Presto as an experimental feature 4. Automatic Sample Management 5. Streaming Support Summary 1. Approximate queries is an important means to achieve interactivity in processing large datasets 2. BlinkDB.. - approximate answers with error bars by executing queries on small samples of data supports existing Hive/Shark/Presto queries 3. For more information, please check out our EuroSys 2013 (http://bit.ly/blinkdb-1) and KDD 2014 (http://bit.ly/blinkdb-2) papers Thanks!