Adaptive Query Processing for Wide-Area Distributed Data Michael Franklin UC Berkeley Joint work with Tolga Urhan, Laurent Amsaleg, and Anthony Tomasic Motivation The Internet enables access to globallydistributed data sources... But, current search and data access technology is primitive: Discovering relevant sources and data is difficult. Simple text-based searches. Navigation through link clicking. Collecting, aggregating, and manipulating data from multiple sources is not supported. M. Franklin, August 2000 2 QP on the Internet? — Issues Semantic Interoperability Wrapper/Mediator Architecture. XML,XMI, CWMI,OLE-DB, ... Source Discovery Metadata Repositories and Directories. Performance Distributed database technology, caching, etc. Responsiveness and Availability Unpredictability: how to build responsive systems? This is the focus of this talk. M. Franklin, August 2000 3 Databases to the Rescue? DB query languages used to be navigational. Relational languages are more useful for many tasks. Powerful, and (more or less) declarative. Queries are written without regard to the physical structure/location/etc. of data. (Data Independence) Easily extended to distributed systems. DB query languages and optimization techniques have been developed over decades. This technology is unavailable to the Internet user. M. Franklin, August 2000 4 Distributed Query Processing (QP) SELECT eid,ename,title,salary FROM Emp, Proj, Assign WHERE Emp.eid = Assign.eid AND Proj.pid = Assign.pid AND Emp.loc <> Proj.loc System handles query plan generation & optimization; ensures correct execution. Originally conceived for corporate networks. M. Franklin, August 2000 ©1998 Ozsu and Valduriez 5 Wide-area + Wrapped sources Unpredictability Sources may be unreachable or slow to respond. Data delivery may be: slower than expected bursty interrupted Data statistics/cost estimates may be unavailable or unreliable. Traditional, static query processing approaches cannot cope with such problems at run-time. M. Franklin, August 2000 6 Some Solutions Adaptive Query Processing Query Scrambling - “Reactive Query Execution” XJoin – non-blocking, reactive query operator. and beyond! Risk-Aware Query Planning Producing robust plans. Exploiting Alternative Sources Mirrors or “not exactly”. Relaxing Query Semantics Partial, Fuzzy, or Alternative answers M. Franklin, August 2000 7 Query Scrambling - Introduction Goal: Overcome limitations of static QP for unexpected delays. A Reactive Approach: Start with an optimized plan. Modify the plan on-the-fly if problems are detected. Hide delays by performing other useful work. Assumptions: Focus on Initial Delay Query processing at client; Iterator model No replication. M. Franklin, August 2000 8 Query Scrambling - Overview An iterative algorithm. Monitor input and scramble when problems are detected. Normal Execution Source(s) delayed Scrambling Phase 1 Source(s) responded Phase 2 Still delayed Phase 1: Reschedule “runable” operators. Phase 2: Operator synthesis: create new operators. M. Franklin, August 2000 9 Query Scrambling Example ABCDE 4 1 4 1 3 B 2 A B C D E Initial Plan M. Franklin, August 2000 CDE A Reschedule BCDE B CDE A A New Operators Reschedule 10 Building a Scrambling Engine A thread per operator. Monitoring and scheduling. Not Started Closed Stalled open timeout done Active data_arrival de-schedule Suspended resume A “smart” materialization operator. Multi-threaded query operators? M. Franklin, August 2000 11 Directing Scrambling [SIGMOD 98] Original formulation [PDIS 96] was based on heuristics. Demonstrated the ability for QS to hide delays, but was susceptible to making bad choices. Query optimizers are able to choose good plans, but how to use an optimizer to do scrambling? Phase I Issue: where to place the materialization operator? Answer: Choose subtree with best overhead/useful work ratio. Phase II is trickier. M. Franklin, August 2000 12 Phase II - Operator Synthesis If no runable subtrees, create new ones. Needed: an optimizer that: 1) is lightweight & incremental, and 2) understands delays. Most QP systems optimize for total work. But, delay is inherently a response-time issue. Response-time optimization can “magically” move delayed operators to the “best” point in the plan, but only if it knows the duration of the delay! M. Franklin, August 2000 13 Include Delayed (ID) Algorithm Invokes the optimizer with a very large delay value. Optimizer pushes the delayed relation as far back as is useful. Large delay estimation M. Franklin, August 2000 Aggressive 14 Estimated Delay (ED) Algorithm Initially calls the RT optimizer with a small delay Small Successively increases the delay estimation. 50% value = 25 % of the RT of the original query and then 100% of the original RT. Increasing estimates M. Franklin, August 2000 Adaptive 15 Experimental Environment Workload: Queries derived from TPC-D benchmark TPC-D (5), TPC-D(8), TPC-D(9), (1 GB base data) Optimizer (built from scratch): Two Phase Randomized Optimizer a la [Ioannidis 90]. Optimizes for Total Work or Response Time (GHK 92). Search space = bushy plans Studied algorithms on a simulated environment Network, remote sites, query engine etc. Subsequently validated with Predator-based implementation. M. Franklin, August 2000 16 National Market Share Query (TPC-D 8) Experiments with several memory sizes Delayed relation (Part) is an important relation. Used hash joins only. Part LineItem 1/150 Order Customer 2/7 Supplier Nation Nation 1/5 Region Lineitem is the largest relation, Part is a “reducer” Optimizer initially chooses to go left-to-right. M. Franklin, August 2000 17 National Market Share Query (large memory) > 4 MB Response Time (Sec) 1000 800 600 400 PAIR IN 200 ED 0 0 200 400 600 800 1000 Delay (sec) M. Franklin, August 2000 18 National Market Share Query (Sm. memory) Response Time (Sec) 3000 2500 2000 1500 1000 PAIR IN ED 500 0 0 500 1000 1500 2000 2500 Delay (sec) • • • • Scrambling becomes more expensive Pair: Local Decisions, lack of global view IN : Poor performance for short delays. ED : Good for a wide range of delay values. M. Franklin, August 2000 19 Cost-Based Query Scrambling Summary: Traditional static query processing does not scale to the wide-area environment. A reactive approach is needed. This requires a multi-threaded engine and a scrambling-enabled optimizer. Experimental Results: Avoids many of the problems of heuristic algorithms. Response time-based optimization is needed. Fundamental tradeoffs arise in the absence of good delay predictions. M. Franklin, August 2000 20 XJoin - Improving Responsiveness QS can speed up the delivery of the entire answer. But, its ability to hide delays is limited by the amount of useful work that can be done in the query. XJoin is a new query operator that: Produces results incrementally as they become available. Allows progress to be made in highly erratic situations. Has a small memory footprint. Tolerates bursty and slow behavior. M. Franklin, August 2000 21 Hash Join Symmetric Hash Join Hash Hash Hash Table ATable ATable B Source A Build Probe Source B block when input stalls. Traditional Symmetric Hash Hash Joins Join (SHJ) blocksone only if both stall. Processes tuples as they arrive from sources. Produces all tuples in the join and no duplicates. M. Franklin, August 2000 22 Memory Utilization As originally specified, SHJ requires both inputs to be memory resident. For a complex query, this means all intermediate results must be in memory. This is wasteful and can result in thrashing. XJoin extends SHJ to allow it to work with limited memory (like “Hybrid Hash”). Spilled tuples are processed by a reactivelyscheduled background thread. M. Franklin, August 2000 23 Partitioning XJoin is a partitioned hash join method. When allocated memory is exhausted, a partition is flushed to disk. Join processing continues on memory-resident data. Disk-resident tuples are handled in background. M. Franklin, August 2000 24 The 3 Stages of XJoin Stage 1 - Symmetric hash join (memory-to-memory) Stage 2- Disk-to-memory Separate thread - runs when stage 1 blocks. Stage 1 and 2 trade off until all input has been received. Stage 3 - Clean up stage Stage 1 misses pairs that were not in memory concurrently. Stage 2 misses pairs when both are on disk, and may not get to run to completion. M. Franklin, August 2000 25 XJoin - Details The asynchronous/multi-threaded nature of XJoin combined with its small footprint allows it to be fully pipelined, but… Duplicate result tuples can be introduced during stages 2 and 3. These are avoided using timestamps. Each tuple is given an Arrival Timestamp (ATS) and a Departure Timestamp (DTS). Two tuples with overlaping ATS-DTS ranges have already been matched in stage 1. Timestamp of when disk-resident partition was used allows detection of tuples matched during stage 2. Second stage can be further optimized, at the expense of a bit of memory and some additional duplicate detection. M. Franklin, August 2000 26 XJoin-Performance We implemented XJoin in our multi-threaded version of the PREDATOR ORDBMS (from Cornell). We modeled network delays using traces obtained from accessing sites across the Internet. Replaying these traces provides repeatable results. Focus on a “slow” (24.1 KB/sec) and “fast” (132.8 KB/Sec) trace - both exhibit bursty behavior. Workload is simple join queries on Wisconsin Benchmark relations. M. Franklin, August 2000 27 Results - 2-Way Joins (Time in seconds to nth tuple) H H XJ-2 XJoin XJ-2 Slow Build, Slow Probe XJoin Fast Build, Slow Probe H H XJ-2 XJoin M. Franklin, August 2000 Slow Build, Fast Probe XJoin XJ-2 Fast Build, Fast Probe 28 Taming the Second Stage Impact of the second stage decreases during the execution of an XJoin. Scheduling can be adjusted to account for this. XJoin H XJoin-A Fast Build, Fast Probe M. Franklin, August 2000 29 Results – Multiway Joins # 1st 5K 50K Last Rels XJ HHJ XJ HHJ XJ HHJ XJ HHJ 2 5 823 195 826 668 836 860 878 4 46 916 482 938 6 79 992 400 2 1 150 4 17 6 75 907 1018 1075 860 1144 952 1174 36 153 127 181 178 201 285 175 307 378 362 470 387 476 381 559 803 629 892 660 M. Franklin, August 2000 786 992 Delivery Times (in Seconds) 30 XJoin - Summary A non-blocking, small footprint join operator. It is multi-threaded, consisting of three stages. These stages allow XJoin to make progress when input blocks, but they can introduce duplicates. XJoin is optimized for streaming results to users as fast as they are created. Like QS, XJoin hides delays with useful work, but at the operator level rather than at the plan level. Experiments showed order-of-magnitude improvements in time to get initial results. M. Franklin, August 2000 31 Eddy – Continuous Optimization Join RS R S Eddy T Join ST Flow-based (“Rivers”) Tuples are routed via a ticket-based scheme and back-pressure. Hellerstein and Avnur 99 M. Franklin, August 2000 32 Adaptive Approaches static plans current DBMS late binding reopt. continuous opt. anarchy Eddy Dynamic, Query Scrambling, Parametric, Kabra/DeWitt XJoin Competitive, … ??? Increased uncertainty argues for increased adaptivity. Wide-area nets and admin domains introduce uncertainty. Pesky users introduce uncertainty. Non-traditional data sources introduce uncertainty. Implications for data-intensive Internet services. M. Franklin, August 2000 33 The Telegraph Project Adaptive data management for Internet-scale composition of services. Dataflow-based scheduling. Cross-domain negotiation. “User-in-the-loop” Adaptation and learning over varying granularities individual long-running jobs many similar short jobs continuous data flows and filters. M. Franklin, August 2000 34 Conclusions Current static query processing technology cannot cope with the wide-area environment. A key concern is unpredictability. Query Scrambling is a reactive execution approach. XJoin is a pipelined operator that streams answers. Even more adaptive approaches are possible. Complementary approaches (and future work): Alternative semantics. sources, optimizing for robustness, relaxing These ideas extend to the composition of Internet services. M. Franklin, August 2000 35 The End Future Work Investigating the properties of query plans that make them robust in the presence of network problems. Will use these properties in the objective function for query optimization. Next step is to use alternative, but not necessarily equivalent sources. Further progress will involve relaxing the guarantees on semantics that the query system provides. The WWW has shown us that users will accept this! M. Franklin, August 2000 37 Conclusions Current Internet querying and data manipulation capabilities are too limited. Unexpressive, too coarse grained, etc. Do not support manipulating data from multiple sites. Distributed querying technology addresses these concerns but is not applicable on the Internet. A key concern is unpredictability. Query Scrambling is a reactive execution approach. XJoin is a pipelined operator that streams answers. Lots more interesting work to be done in this area. M. Franklin, August 2000 38 Motivation Pervasive network connectivity enables global-scale federated DBMSs. Improvements in heterogeneous DBMS and emerging standards enable Internet query processing. Telegraph: Flow-based composition of dataintensive Internet services. M. Franklin, August 2000 39