Primitives for Workload Summarization and Implications for SQL Prasanna Ganesan*

Primitives for Workload Summarization and Implications for SQL Prasanna Ganesan* Stanford University Surajit Chaudhuri Vivek Narasayya Microsoft Research *Work done at Microsoft Research 1 Motivation • Workload: Set of SQL Statements • Many tasks exploit workload information – DB Admin, Index Tuning, Statistics building, Approximate Query Processing • DBMS profilers produce large workloads (+additional info) • Most tasks need small workloads • Goal: Summarization - Find a “representative” subset of a given, large workload. – Sometimes a weighted subset 2 Why Not Random Sampling? • One Size does not fit all – Different definitions of “representative subset” – Random sampling may lose valuable info • Ignores additional info associated with statements • Shown to work poorly, e.g., for Index Selection [chaudhuri02] – May oversample queries on some tables, while ignoring less frequent queries on other tables 3 Our Solution 1. Treat input as a relation • Each SQL statement (+associated info) is a tuple 2. Extend SQL with new language primitives • • Allow declarative specification of desired subset Usable on arbitrary relations, not just workloads 3. Implement extensions inside query engine • • Why? Primitives appear widely applicable Other implementation options available 4 The Architecture Query ID SQL String FROM Tables Q1 SELECT * FROM R1, R2 {R1, R2} Q2 … … .. … … Execution Engine SELECT *, DOMSUM(Count) FROM WkldTbl DOMINATE WITH PARTITIONING BY FromTables, JoinConds, WhereCols (SLAVE.GroupByCols  MASTER.GroupByCols) AND (SLAVE.OrderByCols PREFIX MASTER.OrderByCols) REPRESENT WITH PARTITIONING BY FromTables, JoinConds, WhereCols MAXIMIZING SUM(DOM_Count) GLOBAL CONSTRAINT Count(*) ≤ 200 LOCAL CONSTRAINT 2.5 Count(*) ≥ 3.03 int(200*LOCAL.Count(*)/GLOBAL.Count(*)) …… Estimated Cost .. Execution Cost … … … … Summary Application 5 Outline • New Primitives for Summarization (Subsetting) – Dominance – Representation • Implementing summarization primitives in SQL • Experiments 6 Dominance • Idea: Filter and aggregate using a partial order on tuples • Specify condition for one tuple to dominate another – Transitive condition – Encapsulates application knowledge • Output: Keep throwing away tuples that are dominated – Retain aggregate info about dominated tuples 7 A Graphical Representation 2Cattivo Vendor Quality Price 6Buono 375 25 3 50 2 50 8 Applying Dominance to Workloads • Example: Index Selection Q2 Q1 SELECT … FROM R SELECT ... FROM R GROUP BY A, B, C dominates GROUP BY A, B – An index useful for Q1 likely to be useful for Q2 MASTER.FromTables=SLAVE.FromTables AND MASTER.GroupByCols  SLAVE.GroupByCols AND MASTER.OrderByCols PREFIX SLAVE.OrderByCols 9 Outline • New Primitives for Summarization (Subsetting) – Dominance – Representation • Implementing Summarization Primitives in SQL • Experiments 10 Representation • Dominance only gets us so far – Need a “lossier” way to select a subset • Idea: Pick a subset that solves a Linear Program – Optimize some criterion – Satisfy lots of constraints – Support concept of partitioning 11 Details • Partition tuples by a set of attributes A 1 1 1 • • B 10 .. .. C .. .. .. A 2 2 2 B 5 .. B.. C .. .. .. A C 1 10 .. 2 5 .. Criterion: Maximize/Minimize Aggregate 3 7 – E.g., Minimize1 Count(*) … 2 … .. Global Constraints 3 … … > 60% Sum(B) .. – E.g., Sum(B) ..in chosen subset A 3 3 3 B 7 .. .. C .. .. .. in input • Local Constraints - apply to each partition – E.g., Sum(B) in chosen subset > 40% Sum(B) in that partition 12 An Index Selection Example • Partition by Tables, Join Conditions and attributes in WHERE clause • Criterion: Maximize Sum(ExecutionCost) – Need best “coverage” • Global Constraint: Count(*) ≤ 200 • Local Constraint: Proportionate representation – A partition with 20% of input should have 20% of output – Count(*) ≥int(200*LOCAL.Count(*)/GLOBAL.Count(*)) 13 Putting it all together 1. Apply dominance criterion (as earlier). 2. Apply representation (as earlier, but maximize SUM(DOM_Count) ). 3. Weight each tuple by the number of tuples it dominates. SELECT SqlString, DOMSUM(Count) FROM WkldTbl DOMINATE WITH PARTITIONING BY FromTables, JoinConds, WhereCols (SLAVE.GroupByCols  MASTER.GroupByCols) AND (SLAVE.OrderByCols PREFIX MASTER.OrderByCols) REPRESENT WITH PARTITIONING BY FromTables, JoinConds, WhereCols MAXIMIZING SUM(DOM_Count) GLOBAL CONSTRAINT Count(*) ≤ 200 LOCAL CONSTRAINT Count(*) ≥ int(200*LOCAL.Count(*)/GLOBAL.Count(*)) 14 Outline • New Primitives for Summarization (Subsetting) – Dominance – Representation • Implementing Summarization Primitives in SQL • Experiments 15 Implementing Summarization Primitives in SQL • Assume set and sequence support in SQL – The mills of the standards bodies… • Partitioning useful for both primitives – Hashing, Sort-based, Index-based… • Implementing Dominance – Naïve O(n2) algorithm – Techniques from group-wise processing – Leverage Skyline optimizations 16 Representation • Implementing directly is LP-hard • Many queries are much simpler – Fall into one of two special cases • Other queries are handled by a simple heuristic – User-guided search • Implement as multiple operators 17 User-Guided Search • Scan tuples in a specific order – User-specified, or heuristically chosen • Will always minimize/maximize Count(*) – Use ordering to transform other objectives – Slightly different algorithms for the two cases 18 A Minimization Example F E Satisfied D C Output B A Violated 19 Two Special Cases • Maximize SUM(Attr) – All constraints are on Count(*) – Use partitioning and sort-order access • Minimize Count(*) – Single constraint: Again easily solved – More special cases also solvable – Multiple constraints: Approximation algorithm 20 Experiments • Evaluate utility for index selection • Compare to sophisticated Wkld. Compression [chaudhuri02] – Clusters using a complex distance function • Simple query as described earlier – Constrained to output same number of statements as Workload Compression – Orders of magnitude faster • TPC-H 1GB database – Multiple synthetic workloads introduced in [chaudhuri02] 21 Experiments (Contd.) Workload Compress Tuning Wizard Evaluate Total Estimated Cost 22 Comparing Estimated Costs Wkld Compression Proportionate(Syntactic) Estimated Cost 90000 80000 70000 60000 50000 40000 30000 20000 10000 0 SPJ SPJ-GB SPJ-GBOB SingleTable Workloads 23 Conclusion • Our contributions – Summarization can be expressed declaratively – Introduction of new operators for summarization – Discussion of SQL implementation • The Future – An automatic monitoring and tuning infrastructure? – More workload-sensitive tasks? 24

Primitives for Workload Summarization and Implications for SQL Prasanna Ganesan*

Related documents

Products

Support

Primitives for Workload Summarization and Implications for SQL Prasanna Ganesan*

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib