Cosmic Convergence
Joe Hellerstein
UC Berkeley
How I got started on this
CONTROL project
Eddies
Tie-ins to Networking Research
Telegraph & ongoing adaptive dataflow research
New arenas:
Sensor networks
P2P networks
Online/Interactive query processing
Online aggregation
Scalable spreadsheets & refining visualizations
Online data cleaning (Potter’s Wheel)
Pipelining operators (ripple joins, online reordering) over streaming samples
CLOUDS
100%
Performance metric:
Statistical (e.g. conf. intervals)
User-driven (e.g. weighted by widgets)
New “greedy” performance regime
Maximize 1 st derivative of the “mirth index”
Mirth defined on-the-fly
Therefore need FEEDBACK and CONTROL
Online
Traditional
Time
Goals and data may change over time
User feedback, sample variance
Goals and data may be different in different
“regions”
Group-by, scrollbar position
[An aside: dependencies in selectivity estimation]
Q: Query optimization in this world?
Or in any pipelining, volatile environment??
Where else do we see volatility?
Eddy
A little more state per tuple
Ready/done bits (extensible a la
Volcano/Starburst)
Query processing = dataflow routing!!
We'll come back to this!
Break the set-oriented boundary
Usual DB model: algebra expressions: (R S) T
Usual DB implementation: pipelining operators!
Subexpressions never materialized
Typical implementation is more flexible than algebra
We can reorder in-flight operators
Other gains possible by breaking the set-oriented boundary…
Don’t rewrite graph. Impose a router
Graph edge = absence of routing constraint
Observe operator consumption/production rates
Consumption: cost
Production: cost*selectivity
How I got started on this
CONTROL project
Eddies
Tie-ins to Networking Research
Telegraph & ongoing adaptive dataflow research
New arenas:
Sensor networks
P2P networks
CLICK: a NW router is a query plan!
“The Click Modular Router”, Robert Morris, Eddie Kohler ,
John Jannotti, and M. Frans Kaashoek, SOSP ‘99
Paths the key to comm-centric OS
“Making Paths Explicit in the Scout Operating System”,
David Mosberger and Larry L. Peterson. OSDI ‘96.
Figure 3:Example Router Graph
Merge OS & DBMS grad class, over a year
Eric/Joe, point/counterpoint
Some tie-ins were obvious:
memory mgmt, storage, scheduling, concurrency
Surprising: QP and networks go well side by side
E.g. eddies and TCP Congestion Control
Both use back-pressure and simple Control Theory to
“learn” in an unpredictable dataflow environment
Eddies close to the n-armed bandit problem
Core function of protocols: data xfer
Data Manipulation (buffer, checksum, encryption, xfer to/fr app space, presentation)
Transfer Control (flow/congestion ctl, detecting xmission probs, acks, muxing, timestamps, framing)
-Clark & Tennenhouse, “Architectural Considerations for a
New Generation of Protocols”, SIGCOMM ‘90
Basic Internet assumption:
“a network of unknown topology and with an unknown, unknowable and constantly changing population of competing conversations” (Van
Jacobson)
Query
Opt!
Thesis: nets are good at xfer control, not so good at data manipulation
Some C&T wacky ideas for better data manipulation Data Modeling!
Xfer semantic units, not packets (ALF)
Auto-rewrite layers to flatten them (ILP)
Minimize cross-layer ordering constraints
Control delivery in parallel via packet content
Exchange!
What if…
We had unbounded data producers and consumers
(“streams” … “continuous queries”)
We couldn’t know our producers’ behavior or contents??
(“federation” … “mediators”)
We couldn’t predict user behavior? (“control”)
We couldn’t predict behavior of components in the dataflow? (“networked services”)
We had partial failure as a given? (oops, have we ignored this?)
Yes … networking people have been here!
Remember Van Jacobson’s quote?
Data Models, Query Opt, DataScalability
DATABASE RESEARCH
Adaptive Query
Processing
Continuous
Queries
Approximate/
Interactive QP
Sensor
Databases
Content-Based
Routing
Router
Toolkits
Content Addressable
Networks
Directed
Diffusion
NETWORKING RESEARCH
Adaptivity, Federated Control, GeoScalability
Data Models, Query Opt, DataScalability
DATABASE RESEARCH
Adaptive Query
Processing
Continuous
Queries
Approximate/
Interactive QP
Sensor
Databases
Content-Based
Routing
Router
Toolkits
Content Addressable
Networks
Directed
Diffusion
NETWORKING RESEARCH
Adaptivity, Federated Control, GeoScalability
How I got started on this
CONTROL project
Eddies
Tie-ins to Networking Research
Telegraph & ongoing adaptive dataflow research
New arenas:
Sensor networks
P2P networks
Scenarios with:
Structured Content
Volatility
Rich Queries
Clearly:
Long-running data analysis a la CONTROL
Continuous queries
Queries over Internet sources and services
Two emerging scenarios:
Sensor networks
P2P query processing
An adaptive dataflow system
Dataflow programming model
A la Volcano, CLICK: push and pull. “Fjords”, ICDE02
Extensible set of pipelining operators, including relational ops, grouped filters (e.g. XFilter)
SQL parser for convenience (looking at XQuery)
Adaptivity operators
Eddies
+ Extensible rules for routing constraints, Competition
SteMs (state modules)
FLuX (Fault-tolerant Load-balancing eXchange)
Bounded and continuous:
Data sources
Queries
Goal: Further adaptivity through competition
Multiple mirrored sources
Handle rate changes, failures, parallelism
Multiple alternate operators
Join = Routing + State
SteM operator manages tradeoffs
State Module , unifies caches, rendezvous buffers, join state
Competitive sources/operators share building/probing SteMs
Join algorithm hybridization!
Vijayshankar Raman eddy eddy
+ stems static dataflow
Fault Tolerance, Load Balancing
Continuous/long-running flows need high availability
Big flows need parallelism
Adaptive Load-Balancing req’d
FLuX operator: Exchange plus…
Adaptive flow partitioning (River)
Transient state replication & migration
RAID for SteMs
Needs to be extensible to different ops:
Content-sensitivity
History-sensitivity
Dataflow semantics
Optimize based on edge semantics
Networking tie-in again:
• At-least-once delivery?
• Exactly-once delivery?
• In/Out of order?
Migration policy: the ski rental analogy
Mehul Shah
Continuous Queries clearly need all this stuff! Address adaptivity 1st.
4 Ideas in CACQ:
Use eddies to allow reordering of ops.
But one eddy will serve for all queries
Explicit tuple lineage
Mark each tuple with per-op ready/done bits
Mark each tuple with per-query completed bits
Queries are data: join with Grouped Filter
Much like XFilter, but for relational queries
Joins via SteMs, shared across all queries
Note: mixed-lineage tuples in a SteM. I.e. shared state is not shared algebraic expressions!
Delete a tuple from flow only if it matches no query
Next: F.T. CACQ via FLuXen
Sam Madden, Mehul Shah, Vijayshankar Raman
How I got started on this
CONTROL project
Eddies
Tie-ins to Networking Research
Telegraph & ongoing adaptive dataflow research
New arenas:
Sensor networks
P2P networks
“ Smart Dust” + TinyOS
Thousands of “motes”
Expensive communication
Power constraints
Query workload:
Aggregation & approximation
Queries and Continuous Queries
Challenges:
Push the processing into the network
Deal with volatility & failure
CONTROL issues: data variance, user desires
Simple example:
Aggregation query
Joint work with Ramesh Govindan, Sam Madden,
Wei Hong and David Culler (Intel Berkeley Lab)
Starting point: P2P as grassroots phenomenon
Outrageous filesharing volume (1.8Gfiles in October 2001)
No business case to date
Challenge: scale DDBMS QP ideas to P2P
Motivate why
Pick the right parts of DBMS research to focus on
Storage: no! QP: yes.
Make it work:
Scalability well beyond our usual target
Admin constraints
Unknown data distributions, load
Heterogeneous comm/processing
Partial failure
Joint work with Scott Shenker, Ion Stoica, Matt
Harren, Ryan Huebsch, Nick Lanham, Boon Thau Loo
Adaptivity
Requires clever system design
The Exchange model: encapsulate in ops?
Interesting adaptive policy problems
E.g. eddy routing, flux migration
Control Theory, Machine Learning
Encompasses another CS goal?
“No-knobs”, “Autonomic”, etc.
New performance regimes
Decent performance in the common case
Mean/Variance more important than MAX
Interactive Metrics
Time to completion often unimportant/irrelevant
Set-valued thinking as albatross?
E.g. eddies vs. Kabra/DeWitt or Tukwila
E.g. SteMs vs. Materialized Views
E.g. CACQ vs. NiagaraCQ
Some clean theory here would be nice
Current routing correctness proofs are inelegant
Extensibility
Model/language of choice is not clear
SEQ? Relational? XQuery?
Extensible operators, edge semantics
[A whine about VLDB’s absurd “Specificity
Factor”]
Too early for technical conclusions
Of this I’m sure:
The CS262 experiment is a success
Our students are getting a bigger picture than before
I’m learning, finding new connections
May morph to OS/Nets, Nets/DB
Eventually rethink the systems software curriculum at the undergraduate level too
Nets folks are coming our way
Doing relevant work, eager to collaborate
DB community needs to branch out
Outbound: Better proselytizing in CS
Inbound: Need new ideas
Sabbatical is a good invention
Hasn’t even started, I’m already grateful!