Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa joint work with Austin Donnelly, Antony Rowstron, and Greg O’Shea (MSR Cambridge) Presented by: Navid Hashemi MapReduce Overview Input file Chunk 0 Chunk 1 Chunk 2 Intermediate results Final results MapTask ReduceTask MapTask ReduceTask MapTask ReduceTask • Map − Processes input data and generates (key, value) pairs • Shuffle − Distributes the intermediate pairs to the reduce tasks • Reduce − Aggregates all values associated to each key PaoloCosta Camdoop: Exploiting In-networkAggregation for Big DataApplications 2/52 Problem Input file Split 0 Split 1 Split 2 Intermediate results Final results MapTask ReduceTask MapTask ReduceTask MapTask ReduceTask • Shuffle phase is challenging for data center networks − Led to proposals for full-bisection bandwidth PaoloCosta Camdoop: Exploiting In-networkAggregation for Big DataApplications 3/52 Data Reduction Input file Split 0 Split 1 Split 2 Final results Intermediate results MapTask ReduceTask MapTask ReduceTask MapTask ReduceTask • The final results are typically much smaller than the intermediate results • In most Facebook jobs the final size is 5.4 % of the intermediate size • In mostYahoo jobs the ratio is 8.2 % PaoloCosta Camdoop: Exploiting In-networkAggregation for Big DataApplications 4/52 Data Reduction Input file Split 0 Split 1 Split 2 Final results Intermediate results MapTask ReduceTask MapTask ReduceTask MapTask ReduceTask • The final results are typically much smaller than the intermediate results How can we exploit this to reduce the traffic and improve the performance of the shuffle phase? PaoloCosta Camdoop: Exploiting In-networkAggregation for Big DataApplications 5/52 Background: Combiners Split 0 Split 1 Split 2 Final results Intermediate results Input file MapTask Combiner ReduceTask MapTask Combiner ReduceTask MapTask Combiner ReduceTask • To reduce the data transferred in the shuffle, users can specify a combiner function − Aggregates the local intermediate pairs • Server-side only => limited aggregation PaoloCosta Camdoop: Exploiting In-networkAggregation for Big DataApplications 6/52 Distributed Combiners • It has been proposed to use aggregation trees in MapReduce to perform multiple steps of combiners − e.g., rack-level aggregation [Yu et al., SOSP’09] PaoloCosta Camdoop: Exploiting In-networkAggregation for Big DataApplications 7/52 Logical and Physical Topology What happens when we map the tree to a typical data center topology? Logical topology PaoloCosta ToR Switch Physical topology Camdoop: Exploiting In-networkAggregation for Big DataApplications 8/52 Logical and Physical Topology What happens when we map the tree to a typical data center topology? Only 500 Mbps per child ToR Switch Logical topology Physical topology The server link is the bottleneck Full-bisection bandwidth does not help here Mismatch between physical and logical topology Two logical links are mapped onto the same physical link PaoloCosta Camdoop: Exploiting In-networkAggregation for Big DataApplications Logical and Physical Topology What happens when we map the tree to a typical data center topology? Only 500 Mbps per child ToR Switch Physical topology The server link is the bottleneck Camdoop Goal Perform the combiner functions within the network as opposed to application-level solutions Reduce shuffle time by aggregating packets on path PaoloCosta Camdoop: Exploiting In-networkAggregation for Big DataApplications How Can We Perform In-network Processing? • We exploit CamCube y − Direct-connect topology − 3D torus − Uses no switches / routers for internal traffic z x • Nodes have (x,y,z)coordinates − This defines a key-space (=> key-based routing) − Coordinates are locally re-mapped in case of failures PaoloCosta Camdoop: Exploiting In-networkAggregation for Big DataApplications 11/52 How Can We Perform In-network Processing? • We exploit CamCube y − Direct-connect topology − 3D torus − Uses no switches / routers for internal traffic • Servers intercept, forward and process packets z x • Nodes have (x,y,z)coordinates − This defines a key-space (=> key-based routing) − Coordinates are locally re-mapped in case of failures PaoloCosta Camdoop: Exploiting In-networkAggregation for Big DataApplications 12/52 (1,2,1) (1,2,2) How Can We Perform In-network Processing? y • Nodes have (x,y,z)coordinates: This defines a key-space (=> key-based routing) Coordinates are locally re-mapped in case of failures z x Key Property No distinction between network and computation devices Servers can perform arbitrary packet processing on-path PaoloCosta Camdoop: Exploiting In-networkAggregation for Big DataApplications 13/52 (1,2,1) (1,2,2) How Can We Perform In-network Processing? y In CamCube: • Keys are encoded using 160-bit identifier • Each server responsible for subset of them • Most significant bits are used to generate (x,y,z) • If server unreachable => identifier map to its neighbors • Remaining (160 - k) bits used to determine the neighbor z x Mapping a tree… … on a switched topology • The 1 Gbps link becomes the 1/in-degree Gbps bottleneck … on CamCube • Packets are aggregated on path (=> less traffic) • 1:1 mapping btw. logical and physical topology 1 Gbps PaoloCosta Camdoop: Exploiting In-networkAggregation for Big DataApplications 15/52 Camdoop Design Goals 1. No change in the programming model 2. Exploit network locality 3. Good server and link load distribution 4. Fault-tolerance PaoloCosta Camdoop: Exploiting In-networkAggregation for Big DataApplications 16/52 Design Goal #1 Programming Model • Camdoop adopts the same MapReduce model • GFS-like distributed file-system − Each server runs map tasks on local chunks • We use a spanning tree − Combiners aggregate map tasks and children results (if any) and stream the results to the parents − The root runs the reduce task and generates the final output PaoloCosta Camdoop: Exploiting In-networkAggregation for Big DataApplications 17/52 Design Goal #2 Network locality How to map the tree nodes to servers? PaoloCosta Camdoop: Exploiting In-networkAggregation for Big DataApplications 18/52 Design Goal #2 Network locality Simple rule: Map task outputs are always read from the local disk PaoloCosta Camdoop: Exploiting In-networkAggregation for Big DataApplications 20/52 Design Goal #2 Network locality (1,2,1) (1,2,2) (1,1,1) The parent-children are mapped on physical neighbors PaoloCosta Camdoop: Exploiting In-networkAggregation for Big DataApplications 21/52 Design Goal #2 Network locality (1,2,1) (1,2,2) (1,1,1) This ensures maximum locality and optimizes network transfer Network Locality LogicalView PhysicalView (3DTorus) One physical link is used by one and only one logical link PaoloCosta Camdoop: Exploiting In-networkAggregation for Big DataApplications 22/52 Design Goal #3 Load Distribution PaoloCosta Camdoop: Exploiting In-networkAggregation for Big DataApplications 23/52 Design Goal #3 Only 1 Gbps (instead of 6) Load Distribution Different in-degree Poor server load distribution PaoloCosta Camdoop: Exploiting In-networkAggregation for Big DataApplications 24/52 Design Goal #3 Load Distribution Solution: stripe the data across disjoint trees Different links are used Improves load distribution Design Goal #3 Solution: stripe the data across 6 disjoint trees PaoloCosta All links are used => (Up to) 6 GBps / server Good load distribution Design Goal #4 Fault-tolerance The tree is built in the coordinate space Link Failure: • Using CamCube key-based routing service and shortest path reroute the packets along the new path Server Failure: • • • • • PaoloCosta CamCube remaps coordinates API notifies all the servers Parent of vertex v receives notification send control packet containing the last key to the newly mapped server Each child of v resends all (key, value) from last key onwards Camdoop: Exploiting In-networkAggregation for Big DataApplications 29/52 Evaluation Testbed − 27-server CamCube (3 x 3 x 3) − Quad-core Intel Xeon 5520 2.27 Ghz − 12GB RAM − 6 Intel PRO/1000 PT 1 Gbps ports Simulator − Packet-level simulator (CPU overhead not modelled) − 512-server (8x8x8) CamCube PaoloCosta Camdoop: Exploiting In-networkAggregation for Big DataApplications 30/52 Evaluation Design and implementation recap Camdoop Shuffle & reduce parallelized • Reduce phase is parallelized with the shuffle phase − Since all streams are ordered, as soon as the root receive at least one packet from all children, it can start the reduce function − No need to store to disk the intermediate results on reduce servers Map Map PaoloCosta Shuffle Shuffle Reduce Camdoop: Exploiting In-networkAggregation for Big DataApplications Reduce Evaluation Design and implementation recap Camdoop Shuffle & reduce parallelized CamCube Six disjoint trees In-network aggregation PaoloCosta Camdoop: Exploiting In-networkAggregation for Big DataApplications 30/52 Evaluation Design and implementation recap Camdoop TCP Camdoop (switch) Shuffle & reduce parallelized CamCube Six disjoint trees In-network aggregation • TCP Camdoop (switch) Map Shuffle − 27 CamCube servers attached to aToR switch Reduce data in the shuffle phase −Map TCP is Shuffle used to transfer PaoloCosta Camdoop: Exploiting In-networkAggregation for Big DataApplications 31/52 Evaluation Design and implementation recap Camdoop TCP Camdoop (switch) Camdoop (no agg) Shuffle & reduce parallelized CamCube Six disjoint trees In-network aggregation •• TCP Camdoop (switch) Camdoop (no agg) (Non-commutative/ associative functions) Map Shuffle Like Camdoop but without in-network aggregation −− 27 CamCube servers attached to aToR switch Reduce −−Map TCP is Shuffle used to transfer data in the on shuffle phase Shows the impact of just running CamCube PaoloCosta Camdoop: Exploiting In-networkAggregation for Big DataApplications 32/52 • Sort andWordCount • • Same size intermediate and output data Significant aggregation • Camdoop baselines are competitive against Hadoop and Dryad Time logscale (s) Validation against Hadoop & Dryad Hadoop Dryad/DryadLINQ TCP Camdoop (switch) Camdoop (no agg) 1000 Worse 100 10 1 Better 0 Sort WordCount • Several reasons: − Shuffle and reduce parellized − Fine-tuned implementation PaoloCosta Camdoop: Exploiting In-networkAggregation for Big DataApplications 33/52 • Sort andWordCount • Camdoop baselines are competitive against Hadoop and Dryad • Several reasons: Time logscale (s) Validation against Hadoop Hadoop Dryad/DryadLINQ TCP Camdoop (switch) Camdoop (no agg) 1000 Worse 100 10 1 Better 0 − Shuffle and reduce parellized − Fine-tuned implementation PaoloCosta We consider these as our baseline Sort Camdoop: Exploiting In-networkAggregation for Big DataApplications WordCount 34/52 Parameter Sweep • Output size / intermediate size (S) − S=1 (no aggregation) o Every key is unique − S=1/N ≈ 0 (full aggregation) o Every key appears in all map task outputs − Distribute the workload o Intermediate data size is 22.2 GB (843 MB/server) • Reduce tasks (R) − R= 1 (all-to-one) o E.g., Interactive queries, top-K jobs − R=N (all-to-all) o Common setup in MapReduce jobs o N output files are generated PaoloCosta Camdoop: Exploiting In-networkAggregation for Big DataApplications 35/52 All-to-one (R=1) Time (s) logscale Worse 1000 100 TCP Camdoop (switch) 10 Camdoop (no agg) Camdoop Better 1 0 Full aggregation PaoloCosta 0.2 0.4 0.6 Output size/ intermediate size (S) Camdoop: Exploiting In-networkAggregation for Big DataApplications 0.8 1 No aggregation 36/52 Impact of in-network aggregation Impact of running on CamCube Performance independent of S Time (s) logscale Worse 1000 100 TCP Camdoop (switch) 10 Camdoop (no agg) Camdoop Better 1 0 Full aggregation PaoloCosta 0.2 0.4 0.6 Output size/ intermediate size (S) Camdoop: Exploiting In-networkAggregation for Big DataApplications 0.8 1 No aggregation 37/52 Impact of in-network aggregation Impact of running on CamCube Performance independent of S Time (s) logscale Worse 1000 100 TCP Camdoop (switch) 10 Camdoop (no agg) Camdoop Better 1 0 Full aggregation PaoloCosta Fa0c.e2 book rep0o.4rted aggregation ratio 0.6 Camdoop: Exploiting In-networkAggregation for Big DataApplications 0.8 1 No aggregation 40/52 All-to-all (R=27) Worse Time (s) logscale 100 Better 10 TCP Camdoop (switch) Camdoop Camdoop (no agg) Switch 1 Gbps (bound) 1 0 Full aggregation PaoloCosta 0.2 0.4 0.6 Output size / intermediate size (S) Camdoop: Exploiting In-networkAggregation for Big DataApplications 0.8 1 No aggregation 39/52 Impact of running on CamCube Time (s) logscale Impact of in-network aggregation Better Maximum theoretical performance over the switch 10 TCP Camdoop (switch) Camdoop Camdoop (no agg) Switch 1 Gbps (bound) 1 0 Full aggregation PaoloCosta 0.2 0.4 0.6 Output size / intermediate size (S) Camdoop: Exploiting In-networkAggregation for Big DataApplications 0.8 1 No aggregation 40/52 Number of reduce tasks (S=0) Time (s) logscale Worse 1000 TCP Camdoop (switch) Camdoop (no agg) Camdoop 10.19 x 100 4.31 x 10 1.91 x 1.13 x 1 Better 1 All-to-one PaoloCosta 6 11 16 # reduce tasks (R) Camdoop: Exploiting In-networkAggregation for Big DataApplications 21 26 27 All-to-all 43/52 Performance depends on R Time (s) logscale Worse R1d 00o0es not (significantly) Caam impact perTfCoPrm ncdeoop (switch) Camdoop (no agg) Camdoop 10.19 x 100 4.31 x 10 1.91 x Better 1.13 x 1 1 All-to-one Implementation bPaooltotCloesn ta eck 6 11 16 # reduce tasks (R) Camdoop: Exploiting In-networkAggregation for Big DataApplications 21 26 27 All-to-all 44/52 Performance depends on R Number of reduce tasks (S=0) Time (s) logscale Worse R1d 00o0es not (significantly) Caam impact perTfCoPrm ncdeoop (switch) Camdoop (no agg) Camdoop 10.19 x 100 4.31 x 10 1.91 x 1.13 x 1 Better 1 All resources are used in the aggregation phase even when 26 R 27 =1 All-to-one # reduce tasks (R) All-to-all The number of output files just become a configuration parameter PaoloCosta Behavior at scale (simulated) N=512, S= 0 100 Worse Switch 1 Gbps (bound) Time (s) logscale 10 Camdoop 512x 1 0.1 0.01 Better 0.001 1 4 All-to-one PaoloCosta 16 64 # reduce tasks (R) logscale Camdoop: Exploiting In-networkAggregation for Big DataApplications 256 512 All-to-all 44/52 This assumes full-bisection bandwidth Behavior at scale (simulated) N=512, S= 0 100 Worse Switch 1 Gbps (bound) Time (s) logscale 10 Camdoop 512x 1 0.1 0.01 Better 0.001 1 4 All-to-one PaoloCosta 16 64 # reduce tasks (R) logscale Camdoop: Exploiting In-networkAggregation for Big DataApplications 256 512 All-to-all 45/52 Experiment of failure • Without failure: • A single edge in tree means a single one-hop link in CamCube • With failure: • A single edge becomes a multi-hop path in CamCube Graph shows the shuffle and reduce time normalized against T(time taken to run without failure) PaoloCosta Camdoop: Exploiting In-networkAggregation for Big DataApplications 46/52 Beyond MapReduce requests • The partition-aggregate model also common in interactive services Web server Cache − e.g., Bing Search, Google Dremel • Small-scale data (Good use-case for CamCube) … Parent servers … − 10s to 100s of KB returned per server • Typically, these services use one reduce task (R=1) − Single result must be returned to the user • Full aggregation is common (S ≈ 0) Leaf servers − There is a large opportunity for aggregation PaoloCosta Camdoop: Exploiting In-networkAggregation for Big DataApplications 47/52 Small-scale data (R=1, S=0) Worse 60 TCP Camdoop (switch) Time (ms) 50 40 Camdoop (no agg) Camdoop 30 20 10 Better 0 20 KB 200 KB Input data size / server PaoloCosta Camdoop: Exploiting In-networkAggregation for Big DataApplications 50/52 Small-scale data (R=1, S=0) Worse 60 TCP Camdoop (switch) Time (ms) 50 40 Camdoop (no agg) Camdoop 30 20 10 Better 0 20 KB 200 KB In-network aggregation can be beneficial also for (small-scale data) interactive services Conclusions • Camdoop − Explores the benefits of in-network processing by running combiners within the network − No change in the programming model − Achieves lower shuffle and reduce time − Decouples performance from the # of output files • A final thought: how would Camdoop run on this? − AMD SeaMicro – a 512-core cluster for data centers using a 3D torus − Fast interconnect: 5 Gbps / link PaoloCosta Camdoop: Exploiting In-networkAggregation for Big DataApplications 52/52 Questions?