Telegraph: Ideas & Status Overview • Folks – Amol Deshpande, Mohan Lakhamraju, VijayShankar Raman – Rob von Behren, Steve Gribble, Matt Welsh – Kris Hildrum – Hellerstein, Franklin, Brewer, Papadimitriou (ITR team) • Roots – “Regres” think-tank • Carey, Hellerstein, Stonebraker, 1998-99 – CONTROL project (Online Aggregation, etc.) • UC Berkeley ‘96-present – Query Scrambling • Franklin & Urhan, UMD – Inktomi experiences – Jaguar (Welsh & Culler) Telegraph Goals • Query all the data in the world – ITR: internet sources and services – Endeavour: sensors – Also shared-nothing DBMS done better • Unify and redesign storage engines – DBMS, HTTP server, cluster-based FS – Reject multi-threading in favor of event-flow state machines • storage manager a “query plan” over events – Cluster-centric recovery scheme Today • Status on storage manager – Event flow and state machines – Simplified transactional API – Experiences with Jaguar – Status • Continuously adaptive dataflow – Eddies & Rivers – Applications to event flow: storage mgr is a dataflow plan too • Open Questions State Machines • Web servers/proxies, cache consistency HW use FSMs – Order 100x-1000x more concurrent clients than threads allow – One thread per concurrent HW activity – FSMs for multiplexing threads on connections • Thesis: apply query plan technology to state machines – We understand data flow – Optimization = composition of FSMs – MS Research “Pipeline Server” • State machine gives better cache locality (old-fashioned DB batching of I/O on chip!) • A theme in the TinyOS research too ‘stub’ SN_ Hashtable (fsm) ‘stub’ Gribble I/O Core (v6!!) SN_ Hashtable (fsm) thread boundary HT “work” queue thread pool buffer cache (fsm) r w r w r w lock mngr (fsm) thread boundary Mohan API for Xact Recovery Lock Unlock Lock Manager Deadlock Detect Trans Manager Recoveryaction Commit/Abort-action Application Begin Commit Abort Readaction Updateaction Recovery Manager Read Update Scan Segment Manager Pin Unpin Buffer Manager Flush Jaguar • Two Basic Features – Rather than JNI, map certain bytecodes to inlined assembly code • Do this judiciously, maintaining type safety! – Pre-Serialized Objects (PSOs) • Can lay down a Java object “container” over an arbitrary VM range outside Java’s heap. • With these, you have everything you need – Inlining and PSOs allow direct user-level access to network buffers, disk device drivers, etc. – PSOs allow buffer pool to be pre-allocated, and tuples I the pool to be “pointed at” • Matt Welsh Storage Manager Status • Working! – Transactions and recovery too – Gribble’s hashtable indexes currently don’t talk to Mohan’s stuff – Complete version and numbers for VLDB 2000, mid-February • Lessons – Debugger support for state machine development needed – Thinking about where to multiplex and queue in a state machine is NOT EASY (but we’re learning) – Jaguar isn’t quite there yet • e.g. GC control • But we’re getting there • Need to keep Welsh and Culler aboard Query Processing Challenges • The world is a messy place – performance varies widely over time • River lessons on NOW (NowSort experience) • Internet • MEMS for sure! – performance metadata usually unavailable or wrong • no “runstats” on the web • Users are unpredictable – want to get early answers, “control” queries as they run • Plus Mariposa/Millenium-esque issues – local autonomy, costs for access, etc. ITR Example Scenario • “What do the French think about farm subsidies?” – How would you do this on the web today? • Translate query into French via BabelFish • Find a French search engine, restrict domains to .fr • Fetch matches and translate back to English via BabelFish • Feed to a text summarizer like NetSumm Behavior Along the Way • Speed changes – Site that was fast suddenly slows down • Behavior changes – Site that was returning few answers starts returning lots (“selectivity”) • Failures – Site won’t respond. Choose an alternate server. • Ordering affects answers – summarize then translate? Or vice versa? Standard Query Engine Won’t Cut It • Can’t adapt while running – need a “continuous” query optimizer – need to handle midstream failover • Reload, alternate sites • Uses the wrong QP algorithms – Can’t produce incremental results • need CONTROL-based dataflow algorithms • Can’t understand cost/quality tradeoffs – maybe I’d settle for something cheesier if it went faster -- e.g. use an English search engine in the US QP Framework: Eddies • Need an adaptive query processor – respond to changes mid-stream • Eddy – a pipelining object router • works well with ops that have frequent moments of symmetry – adjusts flow adaptively • objects flow in different orders • visit each op once before output – simple policy for routing • never give out a new object if there’s a used one Avnur & Hellerstein SIGMOD 2000 Simple Eddies Learn Input Rates • Two single-table, unchanging filters – one fast, one slow – both have same probability of output (selectivity) • most tuples visit the fast op first – policy + finite queues result in back pressure – slow op almost always finds a used tuple from fast op – fast op rarely finds a used tuple Simple Eddies & Output Rate • Again, two single-table static filters – one low probability of output, one high – equal costs • Back-pressure slightly worse than random – low-probability should be favored – but it is more likely to find used tuples An Aside: n-Arm Bandits • A little machine learning problem: – Each arm pays off differently – Explore? Or Exploit? • Sometimes want to randomly choose an arm • Usually want to go with the best • If probabilities are static, dampen exploration over time Learning Eddies • Tuple routing is basically a bandit problem – which operator should I choose next? – Complicated by back pressure • Bandit problems + queueing theory • Lottery Scheduling implementation – Each operator starts with k tickets – When multiple operators request a tuple, hold a “lottery”; holder of winning ticket gets it – When an operator takes a tuple, it earns a ticket – When an operator produces a tuple, it is charged a ticket • Works well in practice for some things – Problems with delayed sources & joins – Kris Hildrum studying formal proofs of convergence • Ticket policy needs work. Mechanism looks robust. Open Eddy Questions • Eddy addresses the operator ordering problem • Remaining problems: – operator choice (hash join or index join?) – source choice, overlap, failover: Ninja? – delayed sources – short jobs – resource mgmt (memory allocation) – distributed work and parallelism • Sensor (i.e. sequence) operations – What changes when data-ordering matters? – What are the ops for sensors? • Streaming media? – Objects not discretely differentiated?? Putting it together • Current eddy/river in C – Prototypes in Java, but not state machines – Probably do a rewrite in state machine format • Thesis: every piece of the system is a “query plan” – Apply eddies to event routing in the storage manager? – To network protocol? Cross-pollination • Telegraph QP and Ninja “Paths” • DB, IStore, and OceanStore students looking at adaptive storage location – OceanStore orthogonal to Telegraph storage manager? But let’s combine! – DB and Istore efforts apply to clusters • MEMS and sensors – As soon as eddy/river rewrite done, we need to look at sensor apps and ops • TinyOS – Good state machine lessons at the boundary – Data flow between the devices?? • Negotiation – Eddies and pricing fits into this! I.e. we have the infrastructure for dynamic pricing and re-routing on the way.