Adaptive Processing in Data Stream Systems Shivnath Babu Stanford University stanfordstreamdatamanager Data Streams • New applications -- data as continuous, rapid, time-varying data streams – – – – – – Network monitoring and traffic engineering Sensor networks, RFID tags Financial applications Telecom call records Web logs and click-streams Manufacturing processes • Traditional databases -- data stored in finite, persistent data sets stanfordstreamdatamanager 2 Using a Traditional Database User/Application Query Query … Loader Result Result … Table R Table S stanfordstreamdatamanager 3 New Approach for Data Streams User/Application Register Continuous Query (Standing Query) Input streams stanfordstreamdatamanager Result Stream Query Processor 4 Example Continuous (Standing) Queries • Web – Amazon’s best sellers over last hour • Network Intrusion Detection – Track HTTP packets with destination address matching a prefix in given table and content matching “*\.ida” • Finance – Monitor NASDAQ stocks between $20 and $200 that have moved down more than 2% in the last 20 minutes stanfordstreamdatamanager 5 Data Stream Management System Streamed Result Register Continuous Query 13 34 71 90 Data Stream Management System (DSMS) 12 22 50 45 Stored Result Input Streams Archive Stored Tables stanfordstreamdatamanager 6 Data Stream Management System Streamed ResultDSMS System Components • User/application interface Register • Execution plan operators Data Stream Continuous and synopsis data stores Query Management • Query Query processor System (DSMS) Processor • Operator scheduler • Memory manager Input Streams • Metadata and statistics catalog • and others stanfordstreamdatamanager 7 Primer on Database Query Processing Declarative Query Preprocessing Database System Canonical form Query Optimization Best query execution plan Results Query Execution stanfordstreamdatamanager Data 8 Traditional Query Optimization Which statistics are required Estimated Periodically collects statistics, statistics Query Statistics Manager: Optimizer: e.g., data sizes, histograms Finds “best” query plan to process this query Data, auxiliary structures, statistics Executor: Chosen query plan Runs chosen plan to completion stanfordstreamdatamanager 9 Optimizing Continuous Queries Poses New Challenges • Continuous queries are long-running • Stream properties can change while query runs – Data properties: value distributions – Arrival properties: bursts, delays • System conditions can change • Performance of a fixed plan can change significantly over time Adaptive processing: use plan that is best for current conditions stanfordstreamdatamanager 10 Rest of this Talk Streamed Result Register Continuous Query Input Streams DSMS System Components Query Processor StreaMon Adaptive Query Processing Engine stanfordstreamdatamanager Adaptiveordering Adaptive orderingofof commutative filters commutative filters Adaptive caching for multiway joins Adaptive use of input stream properties for resource mgmt. 11 Traditional Optimization StreaMon Which statistics are required Query Estimated Optimizer: Re-optimizer: Periodically collects statistics, Finds “best” queryisplan to Monitors current stream and statistics Ensures that plan efficient Statistics Manager: Profiler: e.g., table characteristics sizes, histograms system Combined in part for efficiency processcharacteristics this query for current Executor: Executor: Executes Runs chosen current planplan on to completion incoming stream tuples stanfordstreamdatamanager Chosen Decisions query plan to adapt 12 Pipelined Filters • Commutative filters over a stream • Example: Track HTTP packets with destination address matching a prefix in given table and content matching “*\.ida” • Simple to complex filters – Boolean predicates – Table lookups – Pattern matching – User-defined functions stanfordstreamdatamanager Bad packets Filter3 Filter2 Filter1 Packets 13 Pipelined Filters: Problem Definition • Continuous Query: F1 Æ F2 … Æ … Fn • Plan: Tuples F(1) F(2) … … F(n) • Goal: Minimize expected cost to process a tuple stanfordstreamdatamanager 14 Pipelined Filters: Example 2 1 Input tuples 1 2 3 4 1 2 3 7 1 5 6 7 1 2 4 8 F1 F2 F3 F4 Output tuples Informal Goal: If tuple will be dropped, then drop it as cheaply as possible stanfordstreamdatamanager 15 Why is Our Problem Hard? • Filter drop-rates and costs can change over time • Filters can be correlated • E.g., Protocol = HTTP and DestPort = 80 stanfordstreamdatamanager 16 Metrics for an Adaptive Algorithm • Speed of adaptivity – Detecting changes and finding new plan • Run-time overhead Profiler – Re-optimization, collecting statistics, plan switching • Convergence properties Re-optimizer StreaMon Executor – Plan properties under stationary statistics stanfordstreamdatamanager 17 Pipelined Filters: Stationary Statistics • Assume statistics are not changing – Order filters by decreasing drop-rate/cost [MS79, IK84, KBZ86, H94] – Correlations NP-Hard • Greedy algorithm: Use conditional statistics – F(1) has maximum drop-rate/cost – F(2) has maximum drop-rate/cost ratio for tuples not dropped by F(1) – And so on stanfordstreamdatamanager 18 Adaptive Version of Greedy • Greedy gives strong guarantees – 4-approximation, best poly-time approx. possible assuming P NP [CFK03, FLT04, MBM+05] – For arbitrary (correlated) characteristics – Usually optimal in experiments • Challenge: – Online algorithm – Fast adaptivity to Greedy ordering – Low run-time overhead A-Greedy: Adaptive Greedy stanfordstreamdatamanager 19 A-Greedy Which statistics are required Profiler: Maintains conditional Estimated statistics filter drop-rates and costs over recent tuples Combined in part for efficiency Re-optimizer: Ensures that filter ordering is Greedy for current statistics Executor: Processes tuples with current Greedy ordering stanfordstreamdatamanager Changes in filter ordering 20 A-Greedy’s Profiler • Responsible for maintaining current statistics – Filter costs – Conditional filter drop-rates -- exponential • Profile Window: Sampled statistics from which required conditional drop-rates can be estimated stanfordstreamdatamanager 21 Profile Window 4 4 1 2 3 4 1 2 3 7 1 5 6 7 1 2 4 8 F1 F2 F3 F4 1 0 1 0 stanfordstreamdatamanager 0 0 0 1 0 1 0 1 1 1 1 0 Profile Window 22 Greedy Ordering Using Profile Window F1 F2 F3 F4 F1 F2 F3 F4 1 0 1 0 2 2 3 2 0 0 0 1 1 0 1 0 F3 F1 F2 F4 3 2 2 2 0 2 1 F3 F2 F4 F1 3 2 2 2 2 0 1 1 0 0 1 0 0 0 1 0 0 0 0 1 1 Matrix View Greedy Ordering stanfordstreamdatamanager 23 A-Greedy’s Re-optimizer • Maintains Matrix View over Profile Window – Easy to incorporate filter costs – Efficient incremental update – Fast detection & correction of changes in Greedy order Details in [BMM+04]: “Adaptive Processing of Pipelined Stream Filters”, ACM SIGMOD 2004 stanfordstreamdatamanager 24 Next • Tradeoffs and variations of A-Greedy • Experimental results for A-Greedy stanfordstreamdatamanager 25 Tradeoffs • Suppose: – Changes are infrequent – Slower adaptivity is okay – Want best plans at very low run-time overhead • Three-way tradeoff among speed of adaptivity, run-time overhead, and convergence properties • Spectrum of A-Greedy variants stanfordstreamdatamanager 26 Variants of A-Greedy Algorithm Convergence Properties Run-time Overhead Adap. A-Greedy 4-approx. High (relative to others) Fast 1 0 1 0 0 0 0 1 1 0 1 0 0 1 0 0 0 1 0 0 0 0 1 1 Profile Window stanfordstreamdatamanager 3 2 2 2 2 0 1 1 0 Matrix View 0 Matrix View 27 Variants of A-Greedy Algorithm Convergence Properties Run-time Overhead Adap. A-Greedy 4-approx. High (relative to others) Fast Sweep 4-approx. Less work per sampling step Slow Local-Swaps May get caught in local optima Misses correlations Less work per sampling step Slow Lower sampling rate Fast Independent stanfordstreamdatamanager Matrix View 28 Experimental Setup • Implemented A-Greedy, Sweep, Local-Swaps, and Independent in StreaMon • Studied convergence properties, run-time overhead, and adaptivity • Synthetic stream-generation testbed – Can control & vary stream data and arrival properties • DSMS server running on 700 MHz Linux machine, 1 MB L2 cache, 2 GB memory stanfordstreamdatamanager 29 Avg. processing rate (tuples/sec) Converged Processing Rate 55000 50000 Optimal-Fixed Optimal Sweep Sweep A-Greedy A-Greedy 45000 40000 35000 Local-Swaps Local-Swaps 30000 Independent Independent 25000 20000 3 4 6 8 Number of filters stanfordstreamdatamanager 10 30 Avg. processing rate (tuples/sec) Effect of Correlation 55000 50000 Optimal-Fixed Sweep Optimal A-Greedy A-Greedy 45000 40000 35000 Independent Local-Swaps 30000 Independent 25000 20000 1 2 3 Correlation factor stanfordstreamdatamanager 4 31 Run-time Overhead Average time/tuple (microsecs) Tuple processing Profiling + Reopt. Overhead 250 200 150 100 50 0 Optimal A-Greedy stanfordstreamdatamanager Sweep Local-Swaps Independent 32 5000 4800 4600 4400 4200 4000 3800 3600 3400 3200 3000 Sweep Local-Swaps A-Greedy Independent 80 0 77 0 74 0 71 0 68 0 65 0 62 0 59 0 56 0 Stream data properties changed here 53 0 50 0 #Filters evaluated per 2000 tuples Adaptivity of time Progress ofProgress time (x1000 tuples processed) stanfordstreamdatamanager 33 Remainder of Talk • Adaptive caching for multiway joins • Current and future research directions • Related work stanfordstreamdatamanager 34 Stream Joins join results DSMS observations in the last minute Sensor R stanfordstreamdatamanager Sensor S Sensor T 35 MJoins (VNB04) ⋈T ⋈T ⋈S Window on R ⋈R Window on S stanfordstreamdatamanager ⋈S ⋈R Window on T 36 Excessive Recomputation in MJoins ⋈T ⋈R Window on R Window on S stanfordstreamdatamanager Window on T 37 Materializing Join Subexpressions Fullymaterialized join subexpression Window on R ⋈ ⋈ Window on S stanfordstreamdatamanager Window on T 38 Tree Joins: Trees of Binary Joins Fully-materialized join subexpression Window on S ⋈ S Window on R ⋈ R stanfordstreamdatamanager Window on T T 39 Hard State Hinders Adaptivity W R ⋈ WT WS ⋈ WT ⋈ ⋈ Plan switch ⋈ R S T stanfordstreamdatamanager ⋈ S R T 40 Can We Get Best of Both Worlds? MJoin ⋈S ⋈T ⋈R ⋈T ⋈R ⋈S Tree Join WR ⋈ W T ⋈ ⋈ R S T Θ Recomputation stanfordstreamdatamanager R S T Θ Less adaptive Θ Higher memory use 41 MJoins + Caches ⋈T Bypass pipeline segment ⋈R WR ⋈ WT S tuple Cache Probe Window on R Window on S stanfordstreamdatamanager Window on T 42 MJoins + Caches (contd.) • Caches are soft state – Enables plan switching with almost no overhead – Flexible with respect to memory availability • Captures whole spectrum from MJoins to Tree Joins, and plans in between • Challenge: adaptive algorithm to choose join operator orders and caches in pipelines stanfordstreamdatamanager 43 Adaptive Caching (A-Caching) • Adaptive join ordering with A-Greedy or variant – Join operator orders candidate caches • Adaptive selection from candidate caches – Based on profiled costs and benefits of caches • Adaptive memory allocation to chosen caches • Problems are individually NP-Hard – Efficient approximation algorithms scalable • Details in [BMW+05]: “Adaptive Caching for Continuous Queries”, ICDE 2005 stanfordstreamdatamanager 44 A-Caching (caching part only) List of candidate caches Estimated Profiler: statistics Re-optimizer: Ensures that Estimates costs and benefits maximum-benefit subset of candidate caches of candidate caches is used Combined in part for efficiency Add/remove caches Executor: MJoins with caches stanfordstreamdatamanager 45 Avg. processing rate (tuples/sec) Performance of Stream-Join Plans (1) 450000 ⋈ 400000 350000 300000 ⋈ 250000 U 200000 150000 ⋈ 100000 T 50000 0 MJoin TreeJoin A-Caching R S Arrival rates of streams are in the ratio 1:1:1:10, other details of input are given in [BMW+05] stanfordstreamdatamanager 46 Avg. processing rate (tuples/sec) Performance of Stream-Join Plans (2) 250000 200000 150000 100000 50000 0 MJoin TreeJoin A-Caching Arrival rates of streams are in the ratio 15:10:5:1, other details of input are given in [BMW+05] stanfordstreamdatamanager 47 Remainder of Talk • Current and future research directions • Related work stanfordstreamdatamanager 48 Research Directions Applications P2P Federation Streams PubSub ApplicationsSensors IR Privacy Data Mining Data Management System • System complexity is growing beyond control • Up to 80% of IT budgets spent on maintenance [McKinsey] – “People cost” dominates • Data Management Systems Data Formats & Access must be self-managing Records Web pages Partitions DataXML Formats Access Views Stored&procedures Web services Indexes Text OS/Hardware Windows SAN LinuxOS/Hardware NAS RAID PDA Local disks Remote DB stanfordstreamdatamanager – Autonomic Systems – Self-configuration, self-optimization, self-healing, and self-protection • Adaptive processing is key 49 New Techniques for Autonomic Systems & Adaptive Processing • Bringing more components under adaptive management – Ex: Parallelism, overload management, memory allocation, sharing data & computation across queries • Being proactive as well as reactive – Rio prototype system for adaptive processing in conventional database systems [BBD04] – Considering uncertainty in statistics to choose robust query execution plans [BBD04] – Plan logging: A new overall approach to adaptive processing of continuous queries [BB05] stanfordstreamdatamanager 50 Future Work in DSMSs • Expanding the declarative interface – Event detection (e.g., regular expressions), data cubes, decision trees and other data mining models • Scaling in data arrival rates – Graphical Processing Units (GPUs) as stream coprocessors, Stanford’s streaming supercomputer • Many others – Disk usage based on query and stream properties – Handling missing or imprecise data in streams – Handling hybrid workloads of continuous & regular queries stanfordstreamdatamanager 51 Other Work • More on StreaMon – Using stream properties for resource optimization [TODS] – Theory of pipelined processing [ICDT 2005] – System demonstration [SIGMOD 2004] • Stanford’s STREAM DSMS – Overview papers, source code release, web demo • Other aspects of a DSMS – Operator scheduling [SIGMOD 2003, VLDB Journal] – Continuous query language & semantics [VLDB Journal] – Memory requirements for continuous queries [PODS 2002, TODS] stanfordstreamdatamanager 52 Other Work (contd.) • Adaptive query processing architectures – Taxonomy of past work, next steps [CIDR 2005] – Adaptive processing for conventional database systems [Rio system, Technical report] – Concurrent use of multiple plans for same query [Technical report] • Summer internship – Compressing relations using data mining [SIGMOD 2001] stanfordstreamdatamanager 53 Related Work (Brief) • Adaptive processing of continuous queries – E.g., Eddies [AH00], NiagaraCQ [CDT+00] • Adaptive processing in conventional databases – Inter-query adaptivity, e.g., Leo [SLM+01], [BC03] – Intra-query adaptivity, e.g., Re-Opt [KD98], POP [MRS+04], Tukwila [IFF+99] • New approaches to query optimization – E.g., parametric [GW89, INS+92, HS03], expectedcost based [CHS99, CHG02], error-aware [VN03] • Other DSMSs – E.g., Aurora, Gigascope, Nile, TelegraphCQ stanfordstreamdatamanager 54 Summary • New trends demand adaptive query processing – New applications, e.g., continuous queries, data streams – Increasing data sizes, query complexity, overall system complexity • CS-wide push towards Autonomic Systems • Our goal: Adaptive Data Management Systems – StreaMon: Adaptive Data Stream Engine – Rio: Adaptive Processing in Conventional Databases • Google keywords: “shivnath”, “stanford stream” stanfordstreamdatamanager 55