pptx

advertisement
Camdoop
Exploiting In-network Aggregation
for Big Data Applications
Paolo Costa
joint work with
Austin Donnelly, Antony Rowstron, and Greg O’Shea (MSR Cambridge)
Presented by:
Navid Hashemi
MapReduce Overview
Input file
Chunk 0
Chunk 1
Chunk 2
Intermediate results
Final results
MapTask
ReduceTask
MapTask
ReduceTask
MapTask
ReduceTask
• Map
− Processes input data and generates (key, value) pairs
• Shuffle
− Distributes the intermediate pairs to the reduce tasks
• Reduce
− Aggregates all values associated to each key
PaoloCosta
Camdoop: Exploiting In-networkAggregation for Big DataApplications
2/52
Problem
Input file
Split 0
Split 1
Split 2
Intermediate results
Final results
MapTask
ReduceTask
MapTask
ReduceTask
MapTask
ReduceTask
• Shuffle phase is challenging for data center
networks
− Led to proposals for full-bisection bandwidth
PaoloCosta
Camdoop: Exploiting In-networkAggregation for Big DataApplications
3/52
Data Reduction
Input file
Split 0
Split 1
Split 2
Final results
Intermediate results
MapTask
ReduceTask
MapTask
ReduceTask
MapTask
ReduceTask
• The final results are typically much smaller than
the intermediate results
• In most Facebook jobs the final size is 5.4 % of
the intermediate size
• In mostYahoo jobs the ratio is 8.2 %
PaoloCosta
Camdoop: Exploiting In-networkAggregation for Big DataApplications
4/52
Data Reduction
Input file
Split 0
Split 1
Split 2
Final results
Intermediate results
MapTask
ReduceTask
MapTask
ReduceTask
MapTask
ReduceTask
• The final results are typically much smaller than
the intermediate results
How can we exploit this to reduce the traffic and
improve the performance of the shuffle phase?
PaoloCosta
Camdoop: Exploiting In-networkAggregation for Big DataApplications
5/52
Background: Combiners
Split 0
Split 1
Split 2
Final results
Intermediate results
Input file
MapTask
Combiner
ReduceTask
MapTask
Combiner
ReduceTask
MapTask
Combiner
ReduceTask
• To reduce the data transferred in the shuffle,
users can specify a combiner function
− Aggregates the local intermediate pairs
• Server-side only => limited aggregation
PaoloCosta
Camdoop: Exploiting In-networkAggregation for Big DataApplications
6/52
Distributed Combiners
• It has been proposed to use aggregation trees in
MapReduce to perform multiple steps of combiners
− e.g., rack-level aggregation [Yu et al., SOSP’09]
PaoloCosta
Camdoop: Exploiting In-networkAggregation for Big DataApplications
7/52
Logical and Physical Topology
What happens when we map the tree
to a typical data center topology?
Logical topology
PaoloCosta
ToR
Switch
Physical topology
Camdoop: Exploiting In-networkAggregation for Big DataApplications
8/52
Logical and Physical Topology
What happens when we map the tree
to a typical data center topology?
Only 500 Mbps
per child
ToR
Switch
Logical topology
Physical topology
The server link is the bottleneck
Full-bisection bandwidth does not help here
Mismatch between physical and logical topology
Two logical links are mapped onto the same physical link
PaoloCosta
Camdoop: Exploiting In-networkAggregation for Big DataApplications
Logical and Physical Topology
What happens when we map the tree
to a typical data center topology?
Only 500 Mbps
per child
ToR
Switch
Physical topology
The server link is the bottleneck
Camdoop Goal
Perform the combiner functions within the network as
opposed to application-level solutions
Reduce shuffle time by aggregating packets on path
PaoloCosta
Camdoop: Exploiting In-networkAggregation for Big DataApplications
How Can We Perform In-network Processing?
• We exploit CamCube
y
− Direct-connect topology
− 3D torus
− Uses no switches / routers
for internal traffic
z
x
• Nodes have (x,y,z)coordinates
− This defines a key-space (=> key-based routing)
− Coordinates are locally re-mapped in case of failures
PaoloCosta
Camdoop: Exploiting In-networkAggregation for Big DataApplications
11/52
How Can We Perform In-network Processing?
• We exploit CamCube
y
− Direct-connect topology
− 3D torus
− Uses no switches / routers
for internal traffic
• Servers intercept, forward
and process packets
z
x
• Nodes have (x,y,z)coordinates
− This defines a key-space (=> key-based routing)
− Coordinates are locally re-mapped in case of failures
PaoloCosta
Camdoop: Exploiting In-networkAggregation for Big DataApplications
12/52
(1,2,1)
(1,2,2)
How Can We Perform In-network Processing?
y
• Nodes have (x,y,z)coordinates:
 This defines a key-space (=> key-based routing)
 Coordinates are locally re-mapped in case of
failures
z
x
Key Property
No distinction between network and computation devices
Servers can perform arbitrary packet processing on-path
PaoloCosta
Camdoop: Exploiting In-networkAggregation for Big DataApplications
13/52
(1,2,1)
(1,2,2)
How Can We Perform In-network Processing?
y
In CamCube:
• Keys are encoded using 160-bit identifier
• Each server responsible for subset of them
• Most significant bits are used to generate
(x,y,z)
• If server unreachable => identifier map to its
neighbors
• Remaining (160 - k) bits used to determine
the neighbor
z
x
Mapping a tree…
… on a switched topology
• The 1 Gbps link
becomes the 1/in-degree
Gbps
bottleneck
… on CamCube
• Packets are aggregated
on path (=> less traffic)
• 1:1 mapping btw. logical
and physical topology
1 Gbps
PaoloCosta
Camdoop: Exploiting In-networkAggregation for Big DataApplications
15/52
Camdoop Design
Goals
1. No change in the programming model
2. Exploit network locality
3. Good server and link load distribution
4. Fault-tolerance
PaoloCosta
Camdoop: Exploiting In-networkAggregation for Big DataApplications
16/52
Design Goal #1
Programming Model
• Camdoop adopts the same
MapReduce model
• GFS-like distributed file-system
− Each server runs map tasks on local chunks
• We use a spanning tree
− Combiners aggregate map tasks and children results (if any)
and stream the results to the parents
− The root runs the reduce task and generates the final output
PaoloCosta
Camdoop: Exploiting In-networkAggregation for Big DataApplications
17/52
Design Goal #2
Network locality
How to map the tree nodes to servers?
PaoloCosta
Camdoop: Exploiting In-networkAggregation for Big DataApplications
18/52
Design Goal #2
Network locality
Simple rule: Map task outputs are always read from the local disk
PaoloCosta
Camdoop: Exploiting In-networkAggregation for Big DataApplications
20/52
Design Goal #2
Network locality
(1,2,1)
(1,2,2)
(1,1,1)
The parent-children are mapped on physical neighbors
PaoloCosta
Camdoop: Exploiting In-networkAggregation for Big DataApplications
21/52
Design Goal #2
Network locality
(1,2,1)
(1,2,2)
(1,1,1)
This ensures maximum locality and optimizes network transfer
Network Locality
LogicalView
PhysicalView (3DTorus)
One physical link is used by one and only one logical link
PaoloCosta
Camdoop: Exploiting In-networkAggregation for Big DataApplications
22/52
Design Goal #3
Load Distribution
PaoloCosta
Camdoop: Exploiting In-networkAggregation for Big DataApplications
23/52
Design Goal #3
Only 1 Gbps
(instead of 6)
Load Distribution
Different
in-degree
Poor server load distribution
PaoloCosta
Camdoop: Exploiting In-networkAggregation for Big DataApplications
24/52
Design Goal #3
Load Distribution
Solution: stripe the data across disjoint trees
 Different links are used
 Improves load distribution
Design Goal #3
Solution: stripe the data across 6 disjoint trees
PaoloCosta
 All links are used => (Up to) 6 GBps / server
 Good load distribution
Design Goal #4
Fault-tolerance
The tree is built in the coordinate space
Link Failure:
• Using CamCube key-based routing service and shortest path
 reroute the packets along the new path
Server Failure:
•
•
•
•
•
PaoloCosta
CamCube remaps coordinates
API notifies all the servers
Parent of vertex v receives notification
send control packet containing the last key to the newly mapped server
Each child of v resends all (key, value) from last key onwards
Camdoop: Exploiting In-networkAggregation for Big DataApplications
29/52
Evaluation
Testbed
− 27-server CamCube (3 x 3 x 3)
− Quad-core Intel Xeon 5520 2.27 Ghz
− 12GB RAM
− 6 Intel PRO/1000 PT 1 Gbps ports
Simulator
− Packet-level simulator (CPU overhead not modelled)
− 512-server (8x8x8) CamCube
PaoloCosta
Camdoop: Exploiting In-networkAggregation for Big DataApplications
30/52
Evaluation
Design and implementation recap
Camdoop
Shuffle & reduce
parallelized

• Reduce phase is parallelized with the shuffle phase
− Since all streams are ordered, as soon as the root receive at least
one packet from all children, it can start the reduce function
− No need to store to disk the intermediate results on reduce servers
Map
Map
PaoloCosta
Shuffle
Shuffle
Reduce
Camdoop: Exploiting In-networkAggregation for Big DataApplications
Reduce
Evaluation
Design and implementation recap
Camdoop
Shuffle & reduce
parallelized

CamCube

Six disjoint trees


In-network
aggregation
PaoloCosta
Camdoop: Exploiting In-networkAggregation for Big DataApplications
30/52
Evaluation
Design and implementation recap
Camdoop
TCP Camdoop (switch)
Shuffle & reduce
parallelized


CamCube


Six disjoint trees




In-network
aggregation
• TCP Camdoop (switch)
Map
Shuffle
− 27 CamCube servers attached to aToR switch
Reduce data in the shuffle phase
−Map
TCP is Shuffle
used to transfer
PaoloCosta
Camdoop: Exploiting In-networkAggregation for Big DataApplications
31/52
Evaluation
Design and implementation recap
Camdoop
TCP Camdoop (switch)
Camdoop (no agg)
Shuffle & reduce
parallelized



CamCube



Six disjoint trees






In-network
aggregation
•• TCP
Camdoop
(switch)
Camdoop
(no agg)
(Non-commutative/ associative functions)
Map
Shuffle
Like
Camdoop
but without
in-network
aggregation
−− 27
CamCube
servers
attached
to aToR switch
Reduce
−−Map
TCP
is Shuffle
used
to transfer
data
in the on
shuffle
phase
Shows
the impact
of just
running
CamCube
PaoloCosta
Camdoop: Exploiting In-networkAggregation for Big DataApplications
32/52
• Sort andWordCount
•
•
Same size intermediate and
output data
Significant aggregation
• Camdoop baselines are
competitive against
Hadoop and Dryad
Time logscale (s)
Validation against Hadoop & Dryad
Hadoop
Dryad/DryadLINQ
TCP Camdoop (switch)
Camdoop (no agg)
1000
Worse
100
10
1
Better
0
Sort
WordCount
• Several reasons:
− Shuffle and reduce parellized
− Fine-tuned implementation
PaoloCosta
Camdoop: Exploiting In-networkAggregation for Big DataApplications
33/52
• Sort andWordCount
• Camdoop baselines are
competitive against
Hadoop and Dryad
• Several reasons:
Time logscale (s)
Validation against Hadoop
Hadoop
Dryad/DryadLINQ
TCP Camdoop (switch)
Camdoop (no agg)
1000
Worse
100
10
1
Better
0
− Shuffle and reduce parellized
− Fine-tuned implementation
PaoloCosta
We consider
these as our
baseline
Sort
Camdoop: Exploiting In-networkAggregation for Big DataApplications
WordCount
34/52
Parameter Sweep
• Output size / intermediate size (S)
− S=1 (no aggregation)
o Every key is unique
− S=1/N ≈ 0 (full aggregation)
o Every key appears in all map task outputs
− Distribute the workload
o Intermediate data size is 22.2 GB (843 MB/server)
• Reduce tasks (R)
− R= 1 (all-to-one)
o E.g., Interactive queries, top-K jobs
− R=N (all-to-all)
o Common setup in MapReduce jobs
o N output files are generated
PaoloCosta
Camdoop: Exploiting In-networkAggregation for Big DataApplications
35/52
All-to-one (R=1)
Time (s) logscale
Worse 1000
100
TCP Camdoop (switch)
10
Camdoop (no agg)
Camdoop
Better
1
0
Full
aggregation
PaoloCosta
0.2
0.4
0.6
Output size/ intermediate size (S)
Camdoop: Exploiting In-networkAggregation for Big DataApplications
0.8
1
No
aggregation
36/52
Impact of
in-network
aggregation
Impact of running
on CamCube
Performance
independent of S
Time (s) logscale
Worse 1000
100
TCP Camdoop (switch)
10
Camdoop (no agg)
Camdoop
Better
1
0
Full
aggregation
PaoloCosta
0.2
0.4
0.6
Output size/ intermediate size (S)
Camdoop: Exploiting In-networkAggregation for Big DataApplications
0.8
1
No
aggregation
37/52
Impact of
in-network
aggregation
Impact of running
on CamCube
Performance
independent of S
Time (s) logscale
Worse 1000
100
TCP Camdoop (switch)
10
Camdoop (no agg)
Camdoop
Better
1
0
Full
aggregation
PaoloCosta
Fa0c.e2 book rep0o.4rted
aggregation ratio
0.6
Camdoop: Exploiting In-networkAggregation for Big DataApplications
0.8
1
No
aggregation
40/52
All-to-all (R=27)
Worse
Time (s) logscale
100
Better
10
TCP Camdoop (switch)
Camdoop
Camdoop (no agg)
Switch 1 Gbps (bound)
1
0
Full
aggregation
PaoloCosta
0.2
0.4
0.6
Output size / intermediate size (S)
Camdoop: Exploiting In-networkAggregation for Big DataApplications
0.8
1
No
aggregation
39/52
Impact of running
on CamCube
Time (s) logscale
Impact of
in-network
aggregation
Better
Maximum theoretical
performance over the
switch
10
TCP Camdoop (switch)
Camdoop
Camdoop (no agg)
Switch 1 Gbps (bound)
1
0
Full
aggregation
PaoloCosta
0.2
0.4
0.6
Output size / intermediate size (S)
Camdoop: Exploiting In-networkAggregation for Big DataApplications
0.8
1
No
aggregation
40/52
Number of reduce tasks (S=0)
Time (s) logscale
Worse 1000
TCP Camdoop (switch)
Camdoop (no agg)
Camdoop
10.19 x
100
4.31 x
10
1.91 x
1.13 x
1
Better
1
All-to-one
PaoloCosta
6
11
16
# reduce tasks (R)
Camdoop: Exploiting In-networkAggregation for Big DataApplications
21
26 27
All-to-all
43/52
Performance
depends on R
Time (s) logscale
Worse R1d
00o0es not (significantly)
Caam
impact perTfCoPrm
ncdeoop (switch)
Camdoop (no agg)
Camdoop
10.19 x
100
4.31 x
10
1.91 x
Better
1.13 x
1
1
All-to-one
Implementation
bPaooltotCloesn
ta eck
6
11
16
# reduce tasks (R)
Camdoop: Exploiting In-networkAggregation for Big DataApplications
21
26 27
All-to-all
44/52
Performance
depends on R
Number of reduce tasks (S=0)
Time (s) logscale
Worse R1d
00o0es not (significantly)
Caam
impact perTfCoPrm
ncdeoop (switch)
Camdoop (no agg)
Camdoop
10.19 x
100
4.31 x
10
1.91 x
1.13 x
1
Better
1
All resources
are used in the aggregation phase even when 26
R 27
=1
All-to-one
# reduce tasks (R)
All-to-all
The number of output files just become a configuration parameter
PaoloCosta
Behavior at scale (simulated)
N=512, S= 0
100
Worse
Switch 1 Gbps (bound)
Time (s) logscale
10
Camdoop
512x
1
0.1
0.01
Better
0.001
1
4
All-to-one
PaoloCosta
16
64
# reduce tasks (R) logscale
Camdoop: Exploiting In-networkAggregation for Big DataApplications
256
512
All-to-all
44/52
This assumes
full-bisection
bandwidth
Behavior at scale (simulated)
N=512, S= 0
100
Worse
Switch 1 Gbps (bound)
Time (s) logscale
10
Camdoop
512x
1
0.1
0.01
Better
0.001
1
4
All-to-one
PaoloCosta
16
64
# reduce tasks (R) logscale
Camdoop: Exploiting In-networkAggregation for Big DataApplications
256
512
All-to-all
45/52
Experiment of failure
• Without failure:
• A single edge in tree means a single one-hop link in CamCube
• With failure:
• A single edge becomes a multi-hop path in CamCube
Graph shows the shuffle and reduce time normalized against T(time taken to run without failure)
PaoloCosta
Camdoop: Exploiting In-networkAggregation for Big DataApplications
46/52
Beyond MapReduce
requests
• The partition-aggregate model also
common in interactive services
Web
server
Cache
− e.g., Bing Search, Google Dremel
• Small-scale data (Good use-case for CamCube)
…
Parent
servers
…
− 10s to 100s of KB returned per server
• Typically, these services use
one reduce task (R=1)
− Single result must
be returned to the user
• Full aggregation is common (S ≈ 0)
Leaf servers
− There is a large opportunity for aggregation
PaoloCosta
Camdoop: Exploiting In-networkAggregation for Big DataApplications
47/52
Small-scale data (R=1, S=0)
Worse 60
TCP Camdoop (switch)
Time (ms)
50
40
Camdoop (no agg)
Camdoop
30
20
10
Better 0
20 KB
200 KB
Input data size / server
PaoloCosta
Camdoop: Exploiting In-networkAggregation for Big DataApplications
50/52
Small-scale data (R=1, S=0)
Worse 60
TCP Camdoop (switch)
Time (ms)
50
40
Camdoop (no agg)
Camdoop
30
20
10
Better 0
20 KB
200 KB
In-network aggregation can be beneficial also for (small-scale data)
interactive services
Conclusions
• Camdoop
− Explores the benefits of in-network processing by
running combiners within the network
− No change in the programming model
− Achieves lower shuffle and reduce time
− Decouples performance from the # of output files
• A final thought: how would Camdoop run on this?
− AMD SeaMicro – a 512-core cluster
for data centers using a 3D torus
− Fast interconnect: 5 Gbps / link
PaoloCosta
Camdoop: Exploiting In-networkAggregation for Big DataApplications
52/52
Questions?
Download