Adaptive Dataflow: A Database/Networking Cosmic Convergence Joe Hellerstein

Adaptive Dataflow:

A Database/Networking

Cosmic Convergence

Joe Hellerstein

UC Berkeley

Road Map

How I got started on this



CONTROL project



Eddies

Tie-ins to Networking Research

Telegraph & ongoing adaptive dataflow research

New arenas:



Sensor networks



P2P networks

Background: CONTROL project

Online/Interactive query processing



Online aggregation



Scalable spreadsheets & refining visualizations



Online data cleaning (Potter’s Wheel)

Pipelining operators (ripple joins, online reordering) over streaming samples

Example: Online Aggregation

Online Data Visualization

CLOUDS

Potter’s Wheel

Goals for Online Processing

100%

Performance metric:







Statistical (e.g. conf. intervals)

User-driven (e.g. weighted by widgets)

New “greedy” performance regime







Maximize 1 st derivative of the “mirth index”

Mirth defined on-the-fly

Therefore need FEEDBACK and CONTROL



Online

Traditional

Time

CONTROL



Volatility

Goals and data may change over time



User feedback, sample variance

Goals and data may be different in different

“regions”





Group-by, scrollbar position

[An aside: dependencies in selectivity estimation]

Q: Query optimization in this world?





Or in any pipelining, volatile environment??

Where else do we see volatility?

Continuous Adaptivity: Eddies

Eddy

A little more state per tuple



Ready/done bits (extensible a la

Volcano/Starburst)

Query processing = dataflow routing!!



We'll come back to this!

Eddies: Two Key Observations

Break the set-oriented boundary



Usual DB model: algebra expressions: (R S) T



Usual DB implementation: pipelining operators!



Subexpressions never materialized



Typical implementation is more flexible than algebra





We can reorder in-flight operators

Other gains possible by breaking the set-oriented boundary…

Don’t rewrite graph. Impose a router



Graph edge = absence of routing constraint



Observe operator consumption/production rates





Consumption: cost

Production: cost*selectivity

Road Map




CONTROL project



Eddies



New arenas:



Sensor networks



P2P networks

Coincidence: Eddie Comes to

Berkeley

CLICK: a NW router is a query plan!



“The Click Modular Router”, Robert Morris, Eddie Kohler ,

John Jannotti, and M. Frans Kaashoek, SOSP ‘99

Also Scout

Paths the key to comm-centric OS



“Making Paths Explicit in the Scout Operating System”,

David Mosberger and Larry L. Peterson. OSDI ‘96.

Figure 3:Example Router Graph

More Interaction: CS262

Experiment w/ Eric Brewer

Merge OS & DBMS grad class, over a year

Eric/Joe, point/counterpoint

Some tie-ins were obvious:

 memory mgmt, storage, scheduling, concurrency

Surprising: QP and networks go well side by side



E.g. eddies and TCP Congestion Control





Both use back-pressure and simple Control Theory to

“learn” in an unpredictable dataflow environment

Eddies close to the n-armed bandit problem

Networking Overview for DB

People Like Me

Core function of protocols: data xfer



Data Manipulation (buffer, checksum, encryption, xfer to/fr app space, presentation)



Transfer Control (flow/congestion ctl, detecting xmission probs, acks, muxing, timestamps, framing)

-Clark & Tennenhouse, “Architectural Considerations for a

New Generation of Protocols”, SIGCOMM ‘90

Basic Internet assumption:



“a network of unknown topology and with an unknown, unknowable and constantly changing population of competing conversations” (Van

Jacobson)

C & T’s Wacky Ideas

Query

Opt!

Thesis: nets are good at xfer control, not so good at data manipulation

Some C&T wacky ideas for better data manipulation Data Modeling!



Xfer semantic units, not packets (ALF)





Auto-rewrite layers to flatten them (ILP)

Minimize cross-layer ordering constraints



Control delivery in parallel via packet content

Exchange!

Wacky New Ideas in QP

What if…











We had unbounded data producers and consumers

(“streams” … “continuous queries”)

We couldn’t know our producers’ behavior or contents??

(“federation” … “mediators”)

We couldn’t predict user behavior? (“control”)

We couldn’t predict behavior of components in the dataflow? (“networked services”)

We had partial failure as a given? (oops, have we ignored this?)

Yes … networking people have been here!



Remember Van Jacobson’s quote?

The Cosmic Convergence

Data Models, Query Opt, DataScalability

DATABASE RESEARCH

Adaptive Query

Processing

Continuous

Queries

Approximate/

Interactive QP

Sensor

Databases

Content-Based

Routing

Router

Toolkits

Content Addressable

Networks

Directed

Diffusion

NETWORKING RESEARCH

Adaptivity, Federated Control, GeoScalability

The Cosmic Convergence

Data Models, Query Opt, DataScalability

DATABASE RESEARCH

Adaptive Query

Processing

Continuous

Queries

Approximate/

Interactive QP

Telegraph

Sensor

Databases

Content-Based

Routing

Router

Toolkits

Content Addressable

Networks

Directed

Diffusion

NETWORKING RESEARCH

Adaptivity, Federated Control, GeoScalability

Road Map




CONTROL project



Eddies



New arenas:



Sensor networks



P2P networks

What’s in the Sweet Spot?

Scenarios with:



Structured Content





Volatility

Rich Queries

Clearly:



Long-running data analysis a la CONTROL





Continuous queries

Queries over Internet sources and services

Two emerging scenarios:



Sensor networks



P2P query processing

Telegraph: Engineering the

Sweet Spot

An adaptive dataflow system







Dataflow programming model







A la Volcano, CLICK: push and pull. “Fjords”, ICDE02

Extensible set of pipelining operators, including relational ops, grouped filters (e.g. XFilter)

SQL parser for convenience (looking at XQuery)

Adaptivity operators







Eddies



+ Extensible rules for routing constraints, Competition

SteMs (state modules)

FLuX (Fault-tolerant Load-balancing eXchange)

Bounded and continuous:





Data sources

Queries

State Modules (SteMs)

Goal: Further adaptivity through competition



Multiple mirrored sources



Handle rate changes, failures, parallelism







Multiple alternate operators

Join = Routing + State

SteM operator manages tradeoffs



State Module , unifies caches, rendezvous buffers, join state





Competitive sources/operators share building/probing SteMs

Join algorithm hybridization!

Vijayshankar Raman eddy eddy

+ stems static dataflow

FLuX: Routing Across Cluster

Fault Tolerance, Load Balancing







Continuous/long-running flows need high availability

Big flows need parallelism



Adaptive Load-Balancing req’d

FLuX operator: Exchange plus…



Adaptive flow partitioning (River)









Transient state replication & migration



RAID for SteMs

Needs to be extensible to different ops:





Content-sensitivity

History-sensitivity

Dataflow semantics





Optimize based on edge semantics

Networking tie-in again:

• At-least-once delivery?

• Exactly-once delivery?

• In/Out of order?

Migration policy: the ski rental analogy

Mehul Shah

Continuously Adaptive

Continuous Queries (CACQ)

Continuous Queries clearly need all this stuff! Address adaptivity 1st.

4 Ideas in CACQ:









Use eddies to allow reordering of ops.



But one eddy will serve for all queries

Explicit tuple lineage



Mark each tuple with per-op ready/done bits



Mark each tuple with per-query completed bits

Queries are data: join with Grouped Filter



Much like XFilter, but for relational queries

Joins via SteMs, shared across all queries





Note: mixed-lineage tuples in a SteM. I.e. shared state is not shared algebraic expressions!

Delete a tuple from flow only if it matches no query

Next: F.T. CACQ via FLuXen

Sam Madden, Mehul Shah, Vijayshankar Raman

Road Map




CONTROL project



Eddies



New arenas:



Sensor networks



P2P networks

Sensor Nets

“ Smart Dust” + TinyOS

Thousands of “motes”

Expensive communication



Power constraints

Query workload:





Aggregation & approximation

Queries and Continuous Queries

Challenges:







Push the processing into the network

Deal with volatility & failure

CONTROL issues: data variance, user desires

Simple example:

Aggregation query

Joint work with Ramesh Govindan, Sam Madden,

Wei Hong and David Culler (Intel Berkeley Lab)

P2P QP

Starting point: P2P as grassroots phenomenon



Outrageous filesharing volume (1.8Gfiles in October 2001)



No business case to date

Challenge: scale DDBMS QP ideas to P2P



Motivate why





Pick the right parts of DBMS research to focus on



Storage: no! QP: yes.

Make it work:











Scalability well beyond our usual target

Admin constraints

Unknown data distributions, load

Heterogeneous comm/processing

Partial failure

Joint work with Scott Shenker, Ion Stoica, Matt

Harren, Ryan Huebsch, Nick Lanham, Boon Thau Loo

A Grassroots Example: TeleNap

Themes Throughout

Adaptivity







Requires clever system design



The Exchange model: encapsulate in ops?

Interesting adaptive policy problems





E.g. eddy routing, flux migration

Control Theory, Machine Learning

Encompasses another CS goal?



“No-knobs”, “Autonomic”, etc.

New performance regimes



Decent performance in the common case



Mean/Variance more important than MAX



Interactive Metrics



Time to completion often unimportant/irrelevant

More Themes

Set-valued thinking as albatross?



E.g. eddies vs. Kabra/DeWitt or Tukwila







E.g. SteMs vs. Materialized Views

E.g. CACQ vs. NiagaraCQ

Some clean theory here would be nice



Current routing correctness proofs are inelegant

Extensibility



Model/language of choice is not clear



SEQ? Relational? XQuery?





Extensible operators, edge semantics

[A whine about VLDB’s absurd “Specificity

Factor”]

Conclusions?

Too early for technical conclusions

Of this I’m sure:







The CS262 experiment is a success



Our students are getting a bigger picture than before







I’m learning, finding new connections

May morph to OS/Nets, Nets/DB

Eventually rethink the systems software curriculum at the undergraduate level too

Nets folks are coming our way



Doing relevant work, eager to collaborate

DB community needs to branch out





Outbound: Better proselytizing in CS

Inbound: Need new ideas

Conclusions, cont.

Sabbatical is a good invention



Hasn’t even started, I’m already grateful!

Adaptive Dataflow: A Database/Networking Cosmic Convergence Joe Hellerstein

Adaptive Dataflow:

A Database/Networking

Road Map

Background: CONTROL project

Example: Online Aggregation

Online Data Visualization

Potter’s Wheel

Goals for Online Processing

CONTROL

Volatility

Continuous Adaptivity: Eddies

Eddies: Two Key Observations

Road Map

Coincidence: Eddie Comes to

Berkeley

Also Scout

More Interaction: CS262

Experiment w/ Eric Brewer

Networking Overview for DB

People Like Me

C & T’s Wacky Ideas

Wacky New Ideas in QP

The Cosmic Convergence

The Cosmic Convergence

Telegraph

Road Map

What’s in the Sweet Spot?

Telegraph: Engineering the

Sweet Spot

State Modules (SteMs)

FLuX: Routing Across Cluster

Continuously Adaptive

Continuous Queries (CACQ)

Road Map

Sensor Nets

P2P QP

A Grassroots Example: TeleNap

Themes Throughout

More Themes

Conclusions?

Conclusions, cont.

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib