Telegraph - Berkeley Database Research

advertisement
Telegraph: Ideas & Status
Overview
• Folks
– Amol Deshpande, Mohan Lakhamraju, VijayShankar Raman
– Rob von Behren, Steve Gribble, Matt Welsh
– Kris Hildrum
– Hellerstein, Franklin, Brewer, Papadimitriou (ITR team)
• Roots
– “Regres” think-tank
• Carey, Hellerstein, Stonebraker, 1998-99
– CONTROL project (Online Aggregation, etc.)
• UC Berkeley ‘96-present
– Query Scrambling
• Franklin & Urhan, UMD
– Inktomi experiences
– Jaguar (Welsh & Culler)
Telegraph Goals
• Query all the data in the world
– ITR: internet sources and services
– Endeavour: sensors
– Also shared-nothing DBMS done better
• Unify and redesign storage engines
– DBMS, HTTP server, cluster-based FS
– Reject multi-threading in favor of event-flow state
machines
• storage manager a “query plan” over events
– Cluster-centric recovery scheme
Today
• Status on storage manager
– Event flow and state machines
– Simplified transactional API
– Experiences with Jaguar
– Status
• Continuously adaptive dataflow
– Eddies & Rivers
– Applications to event flow: storage mgr is a
dataflow plan too
• Open Questions
State Machines
• Web servers/proxies, cache consistency HW use FSMs
– Order 100x-1000x more concurrent clients than threads
allow
– One thread per concurrent HW activity
– FSMs for multiplexing threads on connections
• Thesis: apply query plan technology to state machines
– We understand data flow
– Optimization = composition of FSMs
– MS Research “Pipeline Server”
• State machine gives better cache locality (old-fashioned DB
batching of I/O on chip!)
• A theme in the TinyOS research too
‘stub’
SN_
Hashtable
(fsm)
‘stub’
Gribble I/O Core (v6!!)
SN_
Hashtable
(fsm)
thread
boundary
HT
“work”
queue
thread
pool
buffer
cache
(fsm)
r
w
r
w
r
w
lock
mngr
(fsm)
thread
boundary
Mohan API for Xact Recovery
Lock
Unlock
Lock
Manager
Deadlock
Detect
Trans
Manager
Recoveryaction
Commit/Abort-action
Application
Begin
Commit
Abort
Readaction
Updateaction
Recovery
Manager
Read
Update
Scan
Segment
Manager
Pin
Unpin
Buffer
Manager
Flush
Jaguar
• Two Basic Features
– Rather than JNI, map certain bytecodes to inlined assembly
code
• Do this judiciously, maintaining type safety!
– Pre-Serialized Objects (PSOs)
• Can lay down a Java object “container” over an arbitrary VM
range outside Java’s heap.
• With these, you have everything you need
– Inlining and PSOs allow direct user-level access to network
buffers, disk device drivers, etc.
– PSOs allow buffer pool to be pre-allocated, and tuples I the
pool to be “pointed at”
• Matt Welsh
Storage Manager Status
• Working!
– Transactions and recovery too
– Gribble’s hashtable indexes currently don’t talk to Mohan’s
stuff
– Complete version and numbers for VLDB 2000, mid-February
• Lessons
– Debugger support for state machine development needed
– Thinking about where to multiplex and queue in a state
machine is NOT EASY (but we’re learning)
– Jaguar isn’t quite there yet
• e.g. GC control
• But we’re getting there
• Need to keep Welsh and Culler aboard
Query Processing Challenges
• The world is a messy place
– performance varies widely over time
• River lessons on NOW (NowSort experience)
• Internet
• MEMS for sure!
– performance metadata usually unavailable or wrong
• no “runstats” on the web
• Users are unpredictable
– want to get early answers, “control” queries as they run
• Plus Mariposa/Millenium-esque issues
– local autonomy, costs for access, etc.
ITR Example Scenario
• “What do the French think about farm
subsidies?”
– How would you do this on the web today?
• Translate query into French via BabelFish
• Find a French search engine, restrict domains to .fr
• Fetch matches and translate back to English via
BabelFish
• Feed to a text summarizer like NetSumm
Behavior Along the Way
• Speed changes
– Site that was fast suddenly slows down
• Behavior changes
– Site that was returning few answers starts
returning lots (“selectivity”)
• Failures
– Site won’t respond. Choose an alternate server.
• Ordering affects answers
– summarize then translate? Or vice versa?
Standard Query Engine Won’t Cut It
• Can’t adapt while running
– need a “continuous” query optimizer
– need to handle midstream failover
• Reload, alternate sites
• Uses the wrong QP algorithms
– Can’t produce incremental results
• need CONTROL-based dataflow algorithms
• Can’t understand cost/quality tradeoffs
– maybe I’d settle for something cheesier if it went
faster -- e.g. use an English search engine in the
US
QP Framework: Eddies
• Need an adaptive query processor
– respond to changes mid-stream
• Eddy
– a pipelining object router
• works well with ops that have
frequent moments of symmetry
– adjusts flow adaptively
• objects flow in different orders
• visit each op once before output
– simple policy for routing
• never give out a new object if there’s a used one
Avnur & Hellerstein
SIGMOD 2000
Simple Eddies Learn Input Rates
• Two single-table, unchanging filters
– one fast, one slow
– both have same probability of output (selectivity)
• most tuples visit the fast op first
– policy + finite queues result in back pressure
– slow op almost always finds a used tuple from fast
op
– fast op rarely finds a used tuple
Simple Eddies & Output Rate
• Again, two single-table static filters
– one low probability of output, one high
– equal costs
• Back-pressure slightly worse than random
– low-probability should be favored
– but it is more likely to find used tuples
An Aside: n-Arm Bandits
• A little machine learning problem:
– Each arm pays off differently
– Explore? Or Exploit?
• Sometimes want to randomly choose an
arm
• Usually want to go with the best
• If probabilities are static, dampen
exploration over time
Learning Eddies
• Tuple routing is basically a bandit problem
– which operator should I choose next?
– Complicated by back pressure
• Bandit problems + queueing theory
• Lottery Scheduling implementation
– Each operator starts with k tickets
– When multiple operators request a tuple, hold a “lottery”;
holder of winning ticket gets it
– When an operator takes a tuple, it earns a ticket
– When an operator produces a tuple, it is charged a ticket
• Works well in practice for some things
– Problems with delayed sources & joins
– Kris Hildrum studying formal proofs of convergence
• Ticket policy needs work. Mechanism looks robust.
Open Eddy Questions
• Eddy addresses the operator ordering problem
• Remaining problems:
– operator choice (hash join or index join?)
– source choice, overlap, failover: Ninja?
– delayed sources
– short jobs
– resource mgmt (memory allocation)
– distributed work and parallelism
• Sensor (i.e. sequence) operations
– What changes when data-ordering matters?
– What are the ops for sensors?
• Streaming media?
– Objects not discretely differentiated??
Putting it together
• Current eddy/river in C
– Prototypes in Java, but not state machines
– Probably do a rewrite in state machine format
• Thesis: every piece of the system is a “query
plan”
– Apply eddies to event routing in the storage
manager?
– To network protocol?
Cross-pollination
• Telegraph QP and Ninja “Paths”
• DB, IStore, and OceanStore students looking at adaptive
storage location
– OceanStore orthogonal to Telegraph storage manager? But
let’s combine!
– DB and Istore efforts apply to clusters
• MEMS and sensors
– As soon as eddy/river rewrite done, we need to look at
sensor apps and ops
• TinyOS
– Good state machine lessons at the boundary
– Data flow between the devices??
• Negotiation
– Eddies and pricing fits into this! I.e. we have the
infrastructure for dynamic pricing and re-routing on the way.
Download