Cosmic Convergence
Joe Hellerstein
UC Berkeley
How I got started on this
CONTROL project
Tie-ins to Networking Research
Telegraph & ongoing adaptive dataflow research
New arenas:
Sensor networks
P2P networks
Online/Interactive query processing
Online aggregation
Scalable spreadsheets & refining visualizations
Online data cleaning (Potter’s Wheel)
Pipelining operators (ripple joins, online reordering) over streaming samples
Performance metric:
Statistical (e.g. conf. intervals)
User-driven (e.g. weighted by widgets)
New “greedy” performance regime
Maximize 1 st derivative of the “mirth index”
Mirth defined on-the-fly
Therefore need FEEDBACK and CONTROL
Goals and data may change over time
User feedback, sample variance
Goals and data may be different in different
Group-by, scrollbar position
[An aside: dependencies in selectivity estimation]
Q: Query optimization in this world?
Or in any pipelining, volatile environment??
Where else do we see volatility?
A little more state per tuple
Ready/done bits (extensible a la
Query processing = dataflow routing!!
We'll come back to this!
Break the set-oriented boundary
Usual DB model: algebra expressions: (R S) T
Usual DB implementation: pipelining operators!
Subexpressions never materialized
Typical implementation is more flexible than algebra
We can reorder in-flight operators
Other gains possible by breaking the set-oriented boundary…
Don’t rewrite graph. Impose a router
Graph edge = absence of routing constraint
Observe operator consumption/production rates
Consumption: cost
Production: cost*selectivity
How I got started on this
CONTROL project
Tie-ins to Networking Research
Telegraph & ongoing adaptive dataflow research
New arenas:
Sensor networks
P2P networks
CLICK: a NW router is a query plan!
“The Click Modular Router”, Robert Morris, Eddie Kohler ,
John Jannotti, and M. Frans Kaashoek, SOSP ‘99
Paths the key to comm-centric OS
“Making Paths Explicit in the Scout Operating System”,
David Mosberger and Larry L. Peterson. OSDI ‘96.
Figure 3:Example Router Graph
Merge OS & DBMS grad class, over a year
Eric/Joe, point/counterpoint
Some tie-ins were obvious:
memory mgmt, storage, scheduling, concurrency
Surprising: QP and networks go well side by side
E.g. eddies and TCP Congestion Control
Both use back-pressure and simple Control Theory to
“learn” in an unpredictable dataflow environment
Eddies close to the n-armed bandit problem
Core function of protocols: data xfer
Data Manipulation (buffer, checksum, encryption, xfer to/fr app space, presentation)
Transfer Control (flow/congestion ctl, detecting xmission probs, acks, muxing, timestamps, framing)
-Clark & Tennenhouse, “Architectural Considerations for a
New Generation of Protocols”, SIGCOMM ‘90
Basic Internet assumption:
“a network of unknown topology and with an unknown, unknowable and constantly changing population of competing conversations” (Van
Thesis: nets are good at xfer control, not so good at data manipulation
Some C&T wacky ideas for better data manipulation Data Modeling!
Xfer semantic units, not packets (ALF)
Auto-rewrite layers to flatten them (ILP)
Minimize cross-layer ordering constraints
Control delivery in parallel via packet content
What if…
We had unbounded data producers and consumers
(“streams” … “continuous queries”)
We couldn’t know our producers’ behavior or contents??
(“federation” … “mediators”)
We couldn’t predict user behavior? (“control”)
We couldn’t predict behavior of components in the dataflow? (“networked services”)
We had partial failure as a given? (oops, have we ignored this?)
Yes … networking people have been here!
Remember Van Jacobson’s quote?
Data Models, Query Opt, DataScalability
Adaptive Query
Interactive QP
Content Addressable
Adaptivity, Federated Control, GeoScalability
Data Models, Query Opt, DataScalability
Adaptive Query
Interactive QP
Content Addressable
Adaptivity, Federated Control, GeoScalability
How I got started on this
CONTROL project
Tie-ins to Networking Research
Telegraph & ongoing adaptive dataflow research
New arenas:
Sensor networks
P2P networks
Scenarios with:
Structured Content
Rich Queries
Long-running data analysis a la CONTROL
Continuous queries
Queries over Internet sources and services
Two emerging scenarios:
Sensor networks
P2P query processing
An adaptive dataflow system
Dataflow programming model
A la Volcano, CLICK: push and pull. “Fjords”, ICDE02
Extensible set of pipelining operators, including relational ops, grouped filters (e.g. XFilter)
SQL parser for convenience (looking at XQuery)
Adaptivity operators
+ Extensible rules for routing constraints, Competition
SteMs (state modules)
FLuX (Fault-tolerant Load-balancing eXchange)
Bounded and continuous:
Data sources
Goal: Further adaptivity through competition
Multiple mirrored sources
Handle rate changes, failures, parallelism
Multiple alternate operators
Join = Routing + State
SteM operator manages tradeoffs
State Module , unifies caches, rendezvous buffers, join state
Competitive sources/operators share building/probing SteMs
Join algorithm hybridization!
Vijayshankar Raman eddy eddy
+ stems static dataflow
Fault Tolerance, Load Balancing
Continuous/long-running flows need high availability
Big flows need parallelism
Adaptive Load-Balancing req’d
FLuX operator: Exchange plus…
Adaptive flow partitioning (River)
Transient state replication & migration
RAID for SteMs
Needs to be extensible to different ops:
Dataflow semantics
Optimize based on edge semantics
Networking tie-in again:
• At-least-once delivery?
• Exactly-once delivery?
• In/Out of order?
Migration policy: the ski rental analogy
Mehul Shah
Continuous Queries clearly need all this stuff! Address adaptivity 1st.
4 Ideas in CACQ:
Use eddies to allow reordering of ops.
But one eddy will serve for all queries
Explicit tuple lineage
Mark each tuple with per-op ready/done bits
Mark each tuple with per-query completed bits
Queries are data: join with Grouped Filter
Much like XFilter, but for relational queries
Joins via SteMs, shared across all queries
Note: mixed-lineage tuples in a SteM. I.e. shared state is not shared algebraic expressions!
Delete a tuple from flow only if it matches no query
Next: F.T. CACQ via FLuXen
Sam Madden, Mehul Shah, Vijayshankar Raman
How I got started on this
CONTROL project
Tie-ins to Networking Research
Telegraph & ongoing adaptive dataflow research
New arenas:
Sensor networks
P2P networks
“ Smart Dust” + TinyOS
Thousands of “motes”
Expensive communication
Power constraints
Query workload:
Aggregation & approximation
Queries and Continuous Queries
Push the processing into the network
Deal with volatility & failure
CONTROL issues: data variance, user desires
Simple example:
Aggregation query
Joint work with Ramesh Govindan, Sam Madden,
Wei Hong and David Culler (Intel Berkeley Lab)
Starting point: P2P as grassroots phenomenon
Outrageous filesharing volume (1.8Gfiles in October 2001)
No business case to date
Challenge: scale DDBMS QP ideas to P2P
Motivate why
Pick the right parts of DBMS research to focus on
Storage: no! QP: yes.
Make it work:
Scalability well beyond our usual target
Admin constraints
Unknown data distributions, load
Heterogeneous comm/processing
Partial failure
Joint work with Scott Shenker, Ion Stoica, Matt
Harren, Ryan Huebsch, Nick Lanham, Boon Thau Loo
Requires clever system design
The Exchange model: encapsulate in ops?
Interesting adaptive policy problems
E.g. eddy routing, flux migration
Control Theory, Machine Learning
Encompasses another CS goal?
“No-knobs”, “Autonomic”, etc.
New performance regimes
Decent performance in the common case
Mean/Variance more important than MAX
Interactive Metrics
Time to completion often unimportant/irrelevant
Set-valued thinking as albatross?
E.g. eddies vs. Kabra/DeWitt or Tukwila
E.g. SteMs vs. Materialized Views
E.g. CACQ vs. NiagaraCQ
Some clean theory here would be nice
Current routing correctness proofs are inelegant
Model/language of choice is not clear
SEQ? Relational? XQuery?
Extensible operators, edge semantics
[A whine about VLDB’s absurd “Specificity
Too early for technical conclusions
Of this I’m sure:
The CS262 experiment is a success
Our students are getting a bigger picture than before
I’m learning, finding new connections
May morph to OS/Nets, Nets/DB
Eventually rethink the systems software curriculum at the undergraduate level too
Nets folks are coming our way
Doing relevant work, eager to collaborate
DB community needs to branch out
Outbound: Better proselytizing in CS
Inbound: Need new ideas
Sabbatical is a good invention
Hasn’t even started, I’m already grateful!