Adaptive Dataflow: A Database/Networking Cosmic Convergence Joe Hellerstein

advertisement

Adaptive Dataflow:

A Database/Networking

Cosmic Convergence

Joe Hellerstein

UC Berkeley

Road Map

How I got started on this

CONTROL project

Eddies

Tie-ins to Networking Research

Telegraph & ongoing adaptive dataflow research

New arenas:

Sensor networks

P2P networks

Background: CONTROL project

Online/Interactive query processing

Online aggregation

Scalable spreadsheets & refining visualizations

Online data cleaning (Potter’s Wheel)

Pipelining operators (ripple joins, online reordering) over streaming samples

Example: Online Aggregation

Online Data Visualization

CLOUDS

Potter’s Wheel

Goals for Online Processing

100%

Performance metric:

Statistical (e.g. conf. intervals)

User-driven (e.g. weighted by widgets)

New “greedy” performance regime

Maximize 1 st derivative of the “mirth index”

Mirth defined on-the-fly

Therefore need FEEDBACK and CONTROL

Online

Traditional

Time

CONTROL

Volatility

Goals and data may change over time

User feedback, sample variance

Goals and data may be different in different

“regions”

Group-by, scrollbar position

[An aside: dependencies in selectivity estimation]

Q: Query optimization in this world?

Or in any pipelining, volatile environment??

Where else do we see volatility?

Continuous Adaptivity: Eddies

Eddy

A little more state per tuple

Ready/done bits (extensible a la

Volcano/Starburst)

Query processing = dataflow routing!!

We'll come back to this!

Eddies: Two Key Observations

Break the set-oriented boundary

Usual DB model: algebra expressions: (R S) T

Usual DB implementation: pipelining operators!

Subexpressions never materialized

Typical implementation is more flexible than algebra

We can reorder in-flight operators

Other gains possible by breaking the set-oriented boundary…

Don’t rewrite graph. Impose a router

Graph edge = absence of routing constraint

Observe operator consumption/production rates

Consumption: cost

Production: cost*selectivity

Road Map

How I got started on this

CONTROL project

Eddies

Tie-ins to Networking Research

Telegraph & ongoing adaptive dataflow research

New arenas:

Sensor networks

P2P networks

Coincidence: Eddie Comes to

Berkeley

CLICK: a NW router is a query plan!

“The Click Modular Router”, Robert Morris, Eddie Kohler ,

John Jannotti, and M. Frans Kaashoek, SOSP ‘99

Also Scout

Paths the key to comm-centric OS

“Making Paths Explicit in the Scout Operating System”,

David Mosberger and Larry L. Peterson. OSDI ‘96.

Figure 3:Example Router Graph

More Interaction: CS262

Experiment w/ Eric Brewer

Merge OS & DBMS grad class, over a year

Eric/Joe, point/counterpoint

Some tie-ins were obvious:

 memory mgmt, storage, scheduling, concurrency

Surprising: QP and networks go well side by side

E.g. eddies and TCP Congestion Control

Both use back-pressure and simple Control Theory to

“learn” in an unpredictable dataflow environment

Eddies close to the n-armed bandit problem

Networking Overview for DB

People Like Me

Core function of protocols: data xfer

Data Manipulation (buffer, checksum, encryption, xfer to/fr app space, presentation)

Transfer Control (flow/congestion ctl, detecting xmission probs, acks, muxing, timestamps, framing)

-Clark & Tennenhouse, “Architectural Considerations for a

New Generation of Protocols”, SIGCOMM ‘90

Basic Internet assumption:

“a network of unknown topology and with an unknown, unknowable and constantly changing population of competing conversations” (Van

Jacobson)

C & T’s Wacky Ideas

Query

Opt!

Thesis: nets are good at xfer control, not so good at data manipulation

Some C&T wacky ideas for better data manipulation Data Modeling!

Xfer semantic units, not packets (ALF)

Auto-rewrite layers to flatten them (ILP)

Minimize cross-layer ordering constraints

Control delivery in parallel via packet content

Exchange!

Wacky New Ideas in QP

What if…

We had unbounded data producers and consumers

(“streams” … “continuous queries”)

We couldn’t know our producers’ behavior or contents??

(“federation” … “mediators”)

We couldn’t predict user behavior? (“control”)

We couldn’t predict behavior of components in the dataflow? (“networked services”)

We had partial failure as a given? (oops, have we ignored this?)

Yes … networking people have been here!

Remember Van Jacobson’s quote?

The Cosmic Convergence

Data Models, Query Opt, DataScalability

DATABASE RESEARCH

Adaptive Query

Processing

Continuous

Queries

Approximate/

Interactive QP

Sensor

Databases

Content-Based

Routing

Router

Toolkits

Content Addressable

Networks

Directed

Diffusion

NETWORKING RESEARCH

Adaptivity, Federated Control, GeoScalability

The Cosmic Convergence

Data Models, Query Opt, DataScalability

DATABASE RESEARCH

Adaptive Query

Processing

Continuous

Queries

Approximate/

Interactive QP

Telegraph

Sensor

Databases

Content-Based

Routing

Router

Toolkits

Content Addressable

Networks

Directed

Diffusion

NETWORKING RESEARCH

Adaptivity, Federated Control, GeoScalability

Road Map

How I got started on this

CONTROL project

Eddies

Tie-ins to Networking Research

Telegraph & ongoing adaptive dataflow research

New arenas:

Sensor networks

P2P networks

What’s in the Sweet Spot?

Scenarios with:

Structured Content

Volatility

Rich Queries

Clearly:

Long-running data analysis a la CONTROL

Continuous queries

Queries over Internet sources and services

Two emerging scenarios:

Sensor networks

P2P query processing

Telegraph: Engineering the

Sweet Spot

An adaptive dataflow system

Dataflow programming model

A la Volcano, CLICK: push and pull. “Fjords”, ICDE02

Extensible set of pipelining operators, including relational ops, grouped filters (e.g. XFilter)

SQL parser for convenience (looking at XQuery)

Adaptivity operators

Eddies

+ Extensible rules for routing constraints, Competition

SteMs (state modules)

FLuX (Fault-tolerant Load-balancing eXchange)

Bounded and continuous:

Data sources

Queries

State Modules (SteMs)

Goal: Further adaptivity through competition

Multiple mirrored sources

Handle rate changes, failures, parallelism

Multiple alternate operators

Join = Routing + State

SteM operator manages tradeoffs

State Module , unifies caches, rendezvous buffers, join state

Competitive sources/operators share building/probing SteMs

Join algorithm hybridization!

Vijayshankar Raman eddy eddy

+ stems static dataflow

FLuX: Routing Across Cluster

Fault Tolerance, Load Balancing

Continuous/long-running flows need high availability

Big flows need parallelism

Adaptive Load-Balancing req’d

FLuX operator: Exchange plus…

Adaptive flow partitioning (River)

Transient state replication & migration

RAID for SteMs

Needs to be extensible to different ops:

Content-sensitivity

History-sensitivity

Dataflow semantics

Optimize based on edge semantics

Networking tie-in again:

• At-least-once delivery?

• Exactly-once delivery?

• In/Out of order?

Migration policy: the ski rental analogy

Mehul Shah

Continuously Adaptive

Continuous Queries (CACQ)

Continuous Queries clearly need all this stuff! Address adaptivity 1st.

4 Ideas in CACQ:

Use eddies to allow reordering of ops.

But one eddy will serve for all queries

Explicit tuple lineage

Mark each tuple with per-op ready/done bits

Mark each tuple with per-query completed bits

Queries are data: join with Grouped Filter

Much like XFilter, but for relational queries

Joins via SteMs, shared across all queries

Note: mixed-lineage tuples in a SteM. I.e. shared state is not shared algebraic expressions!

Delete a tuple from flow only if it matches no query

Next: F.T. CACQ via FLuXen

Sam Madden, Mehul Shah, Vijayshankar Raman

Road Map

How I got started on this

CONTROL project

Eddies

Tie-ins to Networking Research

Telegraph & ongoing adaptive dataflow research

New arenas:

Sensor networks

P2P networks

Sensor Nets

“ Smart Dust” + TinyOS

Thousands of “motes”

Expensive communication

Power constraints

Query workload:

Aggregation & approximation

Queries and Continuous Queries

Challenges:

Push the processing into the network

Deal with volatility & failure

CONTROL issues: data variance, user desires

Simple example:

Aggregation query

Joint work with Ramesh Govindan, Sam Madden,

Wei Hong and David Culler (Intel Berkeley Lab)

P2P QP

Starting point: P2P as grassroots phenomenon

Outrageous filesharing volume (1.8Gfiles in October 2001)

No business case to date

Challenge: scale DDBMS QP ideas to P2P

Motivate why

Pick the right parts of DBMS research to focus on

Storage: no! QP: yes.

Make it work:

Scalability well beyond our usual target

Admin constraints

Unknown data distributions, load

Heterogeneous comm/processing

Partial failure

Joint work with Scott Shenker, Ion Stoica, Matt

Harren, Ryan Huebsch, Nick Lanham, Boon Thau Loo

A Grassroots Example: TeleNap

Themes Throughout

Adaptivity

Requires clever system design

The Exchange model: encapsulate in ops?

Interesting adaptive policy problems

E.g. eddy routing, flux migration

Control Theory, Machine Learning

Encompasses another CS goal?

“No-knobs”, “Autonomic”, etc.

New performance regimes

Decent performance in the common case

Mean/Variance more important than MAX

Interactive Metrics

Time to completion often unimportant/irrelevant

More Themes

Set-valued thinking as albatross?

E.g. eddies vs. Kabra/DeWitt or Tukwila

E.g. SteMs vs. Materialized Views

E.g. CACQ vs. NiagaraCQ

Some clean theory here would be nice

Current routing correctness proofs are inelegant

Extensibility

Model/language of choice is not clear

SEQ? Relational? XQuery?

Extensible operators, edge semantics

[A whine about VLDB’s absurd “Specificity

Factor”]

Conclusions?

Too early for technical conclusions

Of this I’m sure:

The CS262 experiment is a success

Our students are getting a bigger picture than before

I’m learning, finding new connections

May morph to OS/Nets, Nets/DB

Eventually rethink the systems software curriculum at the undergraduate level too

Nets folks are coming our way

Doing relevant work, eager to collaborate

DB community needs to branch out

Outbound: Better proselytizing in CS

Inbound: Need new ideas

Conclusions, cont.

Sabbatical is a good invention

Hasn’t even started, I’m already grateful!

Download