MapReduce performance discussion: Performance Refinement

MapReduce performance discussion:
Performance Refinement:
Actually, the MapReduce had several works done for improvements of either
Map or Reduce, yet, we just discussed about locality which is most related to the
network bandwidth in the datacenter. Network bandwidth is a relatively scarce
resource in this computing environment. MapReduce conserves network band- width
by taking advantage of the fact that the input data is stored on the local disks of the
machines that make up our cluster. GFS divides each file into 64 MB blocks, and
stores several copies of each block on different machines. The MapReduce master
takes the location information of the input files into account and attempts to
schedule a map task on a machine that contains a replica of the corresponding input
data. Failing that, it attempts to schedule a map task near a replica of that task's
input data. That is to say, the master will do its best to ensure that a worker machine
that is on the same network switch as the machine containing the data. Therefore, in
the map phase, most input data is read locally and consumes no network bandwidth.
Google had done the performance test using the TeraSort benchmark [1] which
is the typical sorting task with the large data sets. The programs were executed on a
cluster that consists of approximately 1800 machines. Each machine has two 2GHz
Intel Xeon processors, 4GB of memory, two 160GB IDE disks, and a gigabit Ethernet
link. The machines were arranged in a two-level tree-shaped switched network which
is the most common structure built for the datacenter. Also, its tree structure has
approximately 100-200 Gbps of aggregate band-width available at the root.
This is the result comes from the above task, the
x-axis represents the seconds elapsing from the
MapReduce’s start, and the y-axis represents the
throughput measured by MB/s. This sketch has three
sub pictures that respectively demonstrate the three
phases: Map, Shuffle, and Reduce. Also, we can notice
that some shuffle task starts even though input tasks
hadn’t finished; and the output task also start in the
same way. In fact, this is caused by the pipeline
implementation in the MapReduce.
It is obviously that input rate is higher than the shuffle rate and the output rate
because of our locality optimization most data is read from a local disk and bypasses
our relatively bandwidth constrained network.
The shuffle rate is higher than the output rate because the output phase writes
two copies of the sorted data (MapReduce make two replicas of the output for
reliability and availability reasons).
Based on the fact that the input tasks can consume less network bandwidth
with locality strategy that shuffle and reduce task can’t use, the throughput gap
between input and shuffle or input and output may results from the intensive
network usage and remote data access. In fact, network resource in accessing data
across machines is the mainly problem. Also, in the MapReduce tasks, as the
previous section (which introduce the operations in MapReduce) said, the general
operations like the propagation of the program to all worker machines and the delay
interaction with GFS to open the large set of input files heavily pose a great overhead
to the network usage among many nodes.
Many services in Google need to process large data sets every day. MapReduce
is a programming model and an associated implementation for processing and
generating large data sets. Programs written in MapReduce are automatically
parallelized and executed on a large cluster. The model is easy to use, even for
programmer without experience with parallel and distributed system as it hides the
details of parallelization, fault-tolerance, locality optimization, and load balancing.
The computation takes a set of input key/value pairs, and produces a set of output
key/value pairs. Two functions Map and Reduce are used to express this computation.
Map takes an input pair and produces a set of intermediate key/value pairs. Reduce
accepts an intermediate key and a set of value for that key. It merges together these
values to form a possibly smaller set of values.
However, behind this powerful service, it still faces the intensive network usage
problem like other datacenters. Thus, in the next section, we are going to discuss
about DCTCP which is published to solve such serious problem in the large