A Software-Defined Networking
based Approach for Performance
Management of Analytical Queries on
Distributed Data Stores
Pengcheng Xiong (NEC Labs America)
Hakan Hacigumus (NEC Labs America)
Jeffrey F. Naughton (Univ. of Wisconsin)
Agenda

Why?


How?


System architecture and implementation
So what?


Motivation and background
Real system and benchmark query evaluation
Conclusion
2
Motivation

Data analytics applications or data scientists
query the data from distributed stores.

A huge amount of data traffic on the network.


Many applications want to share a cluster


Data backup, video streaming, etc
Response time is critical


Join
Deadline-driven reports
Query service differentiation

Batch queries, interactive queries
3
An example query (TPC-H Q14)
lineitem
part
Data Store
Site Sl
Data Store
Site Sp
We assume that tables are distributed at relational data stores.
Relational data stores are connected by networking
4
Network change implies plan perf. change
(2) The best plan can become
the worst one
(1) Huge gap
Phase 1
Network
status
changes
Phase 2
Phase 3
5
What if?
What if query optimizer can dynamically
monitor the network bandwidth and
adaptively choose plan?
Adaptive plan is chosen and query
execution time is kept short.
Phase 1
Phase 2
Phase 3
6
Network busy implies no good plan
Run query right now
and right away. I need
that ASAP to catch my
deadline!
User
Well… I am sorry.
None of the candidate
plans can meet your
deadline due to current
busy network status.
Distributed DBMS
7
What if?
Run query right
now and right away.
I need that ASAP to
catch my deadline!
User
OK. Although current
network is busy, I can
control it to prioritize
the bandwidth for the
query.
Distributed DBMS
What if query optimizer can control the
network?
8
Distributed query optimizer monitors
and controls the network?
9
Sounds like a mission impossible

Database always treats the underneath
networking as a black box



unable to monitor
let alone to control
With software-defined networking


inquire about the current status of the network, or
control the network with directives
With SDN
Networking
Networking
10
Sounds interesting, but how?
Ethernet Switch/Router
11
Control Path (Software)
Data Path (Hardware)
12
Dist. Query Optimizer
Our contribution
API
OpenFlow Controller
OpenFlow Protocol (SSL/TCP)
Control Path
OpenFlow
Data Path (Hardware)
13
System architecture
14
System implementation
NEC PFS5240
15
Plan generation
Stores lineitem table
Stores part table
16
Cost estimation

Cost model for network operator


Amount of data transferred
Real-time transfer speed


(Monitor)
 Take any bandwidth left
(Control)
 Assign the highest priority
 Make a bandwidth reservation
17
Evaluation

Setup





TPC-H, scaling factor 100, Q14
Small tables (supplier, nation, region) are
replicated.
Other tables are placed at a single data store site
Neighbor traffic generator-iperf
Summary of case studies
18
Case 1: single user, single-thread, iperf
Bottleneck
Bottleneck
Based on SDN, query optimizer
can
dynamically monitor the network
Bottleneck
bandwidth and
adaptively choose the
best plan
Phase 1
Phase 2
Phase 3
19
Case 3: multiple users, multiple-thread,
no contention traffic, priority queue
Based on SDN, premium queries run
faster than regular ones.
Based on SDN, all queries run faster.
20
Case study 5: single user, multi-thread,
iperf, weighted-fair queue
Based on SDN, more reservation makes
queries run faster.
21
Conclusion

SDN can be effectively exploited for
performance management of analytical
queries on distributed data stores



Directly monitor the network and adaptively pick
the best plan.
Control the priority of network traffic or make
network bandwidth reservations to differentiate
the query service.
Lots of opportunities
22
Thanks!
23