Efficient Query Processing for Modern Data Management Utkarsh Srivastava Stanford University Data Management System: User’s View Data Management System User/ Client Query Results Query Processor Data July 26, 2016 Stanford InfoSeminar Behind the Scenes Query Planning Component Cost Model Statistics Profiling Component July 26, 2016 Data Stanford InfoSeminar Execution Results Component A Conventional Relational DBMS Histograms, Available indices, Cardinalities Planning Component Cost Model Disk I/O Statistics Profiling Component July 26, 2016 Execution Component Stanford InfoSeminar Sensor Data Management Query Results Coordinator Node July 26, 2016 Stanford InfoSeminar Sensor Data Management Sensor and network power Histograms, consumption, Available Sensor indices, capabilities Cardinalities Planning Component Cost Model Total Diskpower I/O consumed, Latency Statistics Profiling Component July 26, 2016 Data Stanford InfoSeminar Execution Component Data Stream Management Streamed Results Continuous Queries … Data Data Streams July 26, 2016 e.g., find all flows that are using > 0.1% of the bandwidth e.g., stream of network packet headers Stanford InfoSeminar Data Stream Management 1-pass histograms, Histograms, Stream Available arrival rates, indices, Burst characteristics Cardinalities Planning Component Cost Model Throughput, Disk I/O Accuracy Statistics Profiling Component July 26, 2016 … Data Data Streams Stanford InfoSeminar Execution Component Query Processing over Web Services Query Results Web/Enterprise company symbol info. symbol WS1 WS2 NASDAQ July 26, 2016 Stanford InfoSeminar stock price info. NYSE Query Processing over Web Services Web service Histograms, response Available times, WS costs, indices, Histograms cardinalities Planning Component Cost Model Latency, Disk I/O Monetary cost Statistics WS1 Profiling Component July 26, 2016 Data WSn Stanford InfoSeminar Execution Component Summary of Contributions Query Processing over Web Services Finding the optimal query execution plan [TR05] Data Stream and Sensor Data Management Optimal operator placement in the network to minimize power consumption [PODS04] Near-optimal sharing strategies for queries with common expensive filters [TR05] Maximizing accuracy of stream joins in the presence of memory or CPU constraints [VLDB04, ICDE06] Profiling Component Collecting statistics from execution feedback [ICDE06] Collecting statistics from block-level sampling [SIGMOD04] July 26, 2016 Stanford InfoSeminar Summary of Contributions Query Processing over Web Services Finding the optimal query execution plan [TR05] Data Stream and Sensor Data Management Optimal operator placement in the network to minimize power consumption [PODS04] Near-optimal sharing strategies for queries with common expensive filters [TR05] Maximizing accuracy of stream joins in the presence of memory or CPU constraints [VLDB04, ICDE06] Profiling Component Collecting statistics from execution feedback [ICDE06] Collecting statistics from block-level sampling [SIGMOD04] July 26, 2016 Stanford InfoSeminar Summary of Contributions Query Processing over Web Services Finding the optimal query execution plan [TR05] Data Stream and Sensor Data Management Optimal operator placement in the network to minimize power consumption [PODS04] Near-optimal sharing strategies for queries with common expensive filters [TR05] Maximizing accuracy of stream joins in the presence of memory or CPU constraints [VLDB04, ICDE06] Profiling Component Collecting statistics from execution feedback [ICDE06] Collecting statistics from block-level sampling [SIGMOD04] July 26, 2016 Stanford InfoSeminar A Web Service Management System Declarative Interface WS Invocations WSMS Metadata Component Query and Input Data Client Results WS Registration Schema Mapper Query Processing Component Plan Selection WS2 Plan Execution Profiling and Statistics Component Response Time Profiler July 26, 2016 WS1 Stanford InfoSeminar Statistics Tracker WSn Query over Web Services – An Example Credit card company wishes to send offers to those a)Who have a high credit rating, and, b)Who have a good payment history on a prior card. Company has at its disposal L: List of potential recipient SSNs WS1: SSN ! credit rating WS2: SSN ! card no(s) WS3: card no. ! payment history July 26, 2016 Stanford InfoSeminar Class of Queries Considered “Select-Project-Join” queries over: Set of web services having interface Xi ! Yi Input data Each input attribute for WSi is provided by: Either the attributes in the input data Or, output attribute of another web service WSj ) Precedence constraint WSj ! WSi In general, precedence constraints form a DAG. July 26, 2016 Stanford InfoSeminar Plan 1 SSN SSN, cr WSMS WS1 SSN ! cr Filter on cr, keep SSN L (SSN) Client Results WS2 SSN ! ccn SSN, ccn WS3 SSN, ccn, ph Filter on ph, keep SSN Note pipelined processing July 26, 2016 Stanford InfoSeminar ccn ! ph Simple Representation of Plan 1 L July 26, 2016 WS1 WS2 Stanford InfoSeminar WS3 Results Plan 2 WSMS SSN SSN, cr WS1 SSN ! cr Filter on cr, keep SSN SSN L (SSN) Client Join Results WS2 SSN ! ccn SSN, ccn WS3 SSN, ccn, ph Filter on ph, keep SSN July 26, 2016 Stanford InfoSeminar ccn ! ph Simple Representation of Plan 2 WS1 L Results WS2 July 26, 2016 WS3 Stanford InfoSeminar Which is More Efficient? L WS1 WS2 WS3 Results WS1 L Results WS2 WS3 In Plan 1, WS2 has to process only filtered SSNs In Plan 2, WS2 has to process all SSNs July 26, 2016 Stanford InfoSeminar Query Planning Recap Possible plans P1, …, Pn Statistics S Cost model cost(Pi, S) Want to find least-cost plan July 26, 2016 Stanford InfoSeminar Statistics ci: per-tuple response time of WSi Assume independent response times si: selectivity of WSi Average number of output tuples per input tuple to WSi Assume independent selectivites May be · 1 or > 1 July 26, 2016 Stanford InfoSeminar Bottleneck Cost Metric Lunch Buffet Dish 1 Dish 2 Dish 3 Dish 4 Overall per-item processing time = Response time of slowest or bottleneck stage in pipeline July 26, 2016 Stanford InfoSeminar Cost Expression for Plan P Ri(P): Predecessors of WSi in plan P Fraction of input tuples seen by WSi = Response time per original input tuple at WSi Assumption: Contrast WSMS withcost sumiscost not metric the bottleneck July 26, 2016 Stanford InfoSeminar Problem Statement Input: Set of web services WS1, …, WSn Response times c1, …, cn Selectivities s1, …, sn Precedence constraints among web services Output: Plan P arranging web services into a DAG P respects all precedence constraints cost(P) by the bottleneck metric is minimized July 26, 2016 Stanford InfoSeminar No Precedence Constraints All selectivities · 1 Theorem: Optimal to linearly order by increasing ci General case proliferative web services selective web services … in increasing cost order July 26, 2016 Stanford InfoSeminar Local Join at WSMS Results With Precedence Constraints Greedy Algorithm while (not all web services added to P) { from all web services that can be added to P choose WSx that can be added to P at least cost add WSx to P } July 26, 2016 linear program solved using network flow techniques Running Time: O(n5) Stanford InfoSeminar Implementation Building a prototype general-purpose WSMS Written in Java Uses Apache Axis, an open source implementation of SOAP Axis generates a Java class for each web service, dynamically loaded using Java Reflection Experiments so far primarily show validity of theoretical results [TR05]. July 26, 2016 Stanford InfoSeminar Summary of Contributions Query Processing over Web Services Finding the optimal query execution plan [TR05] Data Stream and Sensor Data Management Optimal operator placement in the network to minimize power consumption [PODS04] Near-optimal sharing strategies for queries with common expensive filters [TR05] Maximizing accuracy of stream joins in the presence of memory or CPU constraints [VLDB04, ICDE06] Profiling Component Collecting statistics from execution feedback [ICDE06] Collecting statistics from block-level sampling [SIGMOD04] July 26, 2016 Stanford InfoSeminar Gathering Statistics from Query Feedback Planning Component Cost Model Statistics WS1 Profiling Component Data WS n July 26, 2016 Stanford InfoSeminar Execution Component Gathering Statistics from Query Feedback Planning Component Cost Model Statistics Profiling Component July 26, 2016 Execution Component Stanford InfoSeminar Basic Idea User queries predicate P ) cardinality(P, database) Over time predicates P1, P2, … queried Can be used to build and refine histogram over time July 26, 2016 Stanford InfoSeminar Example Fans(Name, Country, Sport) Soccer Cricket UK India Prior knowledge: Total number of tuples = 100 July 26, 2016 Stanford InfoSeminar Example Fans(Name, Country, Sport) Soccer Cricket UK 25 25 India 25 25 From table stats: Total number of tuples = 100 July 26, 2016 Stanford InfoSeminar Example Fans(Name, Country, Sport) Soccer Cricket UK 25 25 India 25 25 SELECT Name FROM Fans WHERE Country = ‘UK’ 20 tuples July 26, 2016 Stanford InfoSeminar Example Fans(Name, Country, Sport) Soccer Cricket UK 10 10 India 25 25 SELECT Name FROM Fans WHERE Country = ‘UK’ 20 tuples July 26, 2016 Stanford InfoSeminar Example Fans(Name, Country, Sport) Soccer Cricket UK 10 10 India 40 40 SELECT Name FROM Fans WHERE Country = ‘UK’ 20 tuples July 26, 2016 Stanford InfoSeminar Example Fans(Name, Country, Sport) Soccer Cricket UK 10 10 India 40 40 Accuracy improves even for India July 26, 2016 Stanford InfoSeminar Example Fans(Name, Country, Sport) Soccer Cricket UK 10 10 India 40 40 SELECT Name FROM Fans WHERE Sport = ‘Soccer’ 70 tuples July 26, 2016 Stanford InfoSeminar Example Fans(Name, Country, Sport) Not possible! Soccer Cricket UK 35 10 India 35 40 SELECT Name FROM Fans WHERE Sport = ‘Soccer’ 70 tuples July 26, 2016 Stanford InfoSeminar Example Fans(Name, Country, Sport) Uniform Soccer Cricket UK 10 10 India 60 40 SELECT Name FROM Fans WHERE Sport = ‘Soccer’ 70 tuples July 26, 2016 Stanford InfoSeminar Non-Uniform Solution Goals Maintain consistency with observed feedback Maintain uniformity when feedback gives no information Use the maximum-entropy principle Entropy is a measure of uniformity of a distribution. Leads to an optimization problem Variables: Number of tuples in histogram buckets Constraints: Feedback observed so far July 26, 2016 Stanford InfoSeminar Objective: Maximize entropy Solution to Example Soccer Cricket UK India Total tuples = 100 Total tuples with country as ‘UK’ = 20 Total tuples with sport as ‘Soccer’ = 70 July 26, 2016 Stanford InfoSeminar Solution to Example Soccer Cricket UK 14 6 India 56 24 Total tuples = 100 Total tuples with country as ‘UK’ = 20 Total tuples with sport as ‘Soccer’ = 70 Maintains consistency and uniformity 70-30 Soccer-Cricket ratio at both countries 20-80 UK-India ratio for both sports July 26, 2016 Stanford InfoSeminar Making it Work in a Real System (IBM DB2) Maintaining a controlled-size histogram Solving maximum-entropy yields an importance measure for feedback. Discard “unimportant” feedback by this measure Updates may lead to inconsistent feedback Heuristic to eliminate inconsistent feedback Preferentially discard older feedback July 26, 2016 Stanford InfoSeminar Experimental Evaluation Static Data Average Relative Error 1 0.8 0.6 ISOMER STGrid 0.4 Optimizer 0.2 0 0 100 200 300 400 Number of Queries July 26, 2016 Stanford InfoSeminar 500 Experimental Evaluations Average Relative Error Dynamic Data Uniform to Correlated 1.4 1.2 1 0.8 ISOMER 0.6 STGrid 0.4 Optimizer 0.2 0 0 500 1000 1500 Number of Queries July 26, 2016 Stanford InfoSeminar 2000 Related Work Web Service Composition No formal cost model or provably optimal strategies Targeted towards workflow-oriented applications Parallel and Distributed Query Optimization Freedom to move query operators around Much larger space of execution plans Data Integration, Mediators Integrate general sources of data Optimize the cost at the integration system itself Feedback-based statistics Self-tuning histograms, STHoles July 26, 2016 Stanford InfoSeminar Summary of Contributions Query Processing over Web Services Finding the optimal query execution plan [TR05] Data Stream and Sensor Data Management Optimal operator placement in the network to minimize power consumption [PODS05] Near-optimal sharing strategies for queries with common expensive filters [TR05] Maximizing accuracy of stream joins in the presence of memory or CPU constraints [VLDB04, ICDE06] Profiling Component Collecting statistics from execution feedback [ICDE06] Collecting statistics from block-level sampling [SIGMOD04] July 26, 2016 Stanford InfoSeminar Operator Placement in Sensor Networks [PODS05] increasing computational power cheaper links Data Acquisition Problem: Tradeoff between processing and communication; what is the best operator placement? Solution: Operator placement algorithm for filter-join queries guaranteeing minimum total cost. July 26, 2016 Stanford InfoSeminar Sharing Among Continuous Queries [TR05] SELECT * FROM S WHERE F1 Æ F2 Æ F3 SELECT AVG( a ) FROM S WHERE F1 Æ F2 Æ F4 SELECT MAX( b ) FROM S WHERE F2 Æ F3 Problem: What is the best way to save redundant work due to shared expensive filters? Solution: Optimal plan must be adaptive Achieve within O(log2 m log n) of best sharing Implementation shows large gain at little July 26, 2016 Stanford InfoSeminar overhead. Maximizing Accuracy of Stream Joins [VLDB04] Aggregation SELECT AVG(p2.ts – p1.ts) FROM Packets1 p1, Packets2 p2 WHERE p1.id = p2.id Input State rates may may be be very very high large Problem: Maximizing join accuracy with resource constraints. Solution: Given memory or CPU constraints Find maximum-subset of join result Find largest possible random sample of join result July 26, 2016 Stanford InfoSeminar Statistics using Block-Level Sampling [SIGMOD04] cost of reading a random byte ¼ cost of reading entire block 100 tuples per block Uniform Random Sampling 1000 samples ¼ 1000 IOs Block-Level Sampling 1000 samples = 10 IOs Problem: Naïvely applying techniques for uniform samples to block-level samples can cause huge errors. Solution: New techniques for computing histograms and distinct values from a block-level sample. Robust and optimal for all layouts July 26, 2016 Stanford InfoSeminar Future Directions Development of comprehensive WSMS. Web service reliability/quality. Extension to handle workflows. Apply feedback-based profiling. Web services with monetary cost. Query processing in other new data management scenarios. Infrastructures that enforce data privacy [CIDR05]. July 26, 2016 Stanford InfoSeminar