Hadoop system simulation with Mumak 1 Content CPS 216 Project

advertisement
Hadoop system simulation with Mumak
Fei Dong, Tianyu Feng, Hong Zhang
CPS 216 Project
1 Content
This report is organized in the following way, in the “Clarifications and Important
Points” we only address the questions discussed in the final presentation. And in the
other sections, the technical detail of this project is included.
2 Clarifications and Important Points
1. Mumak only counts user space time.
In SimulatorTaskTracker.java the simulation time increases only at:
long delta = tip.getUserSpaceRunTime();
long finishTime = now + delta;
in the function createTaskAttemptCompletionEvent.
The function getUserSpaceRunTime is defined in an inner class of TaskTracker,
with the explanation:
“Runtime of the user-space code of the task attempt. This is the full
runtime for map tasks, and only that of the REDUCE phase for reduce
tasks.”
So our previous statement is not accurate. Mumak counts the time for the whole
map task attempt, and only the reduce phase (the phase running user code) of
reduce task attempt.
The part of code correspond to the reduce phase can be found in the function
handleAllMapsCompletedTaskAction:
status.setPhase(Phase.REDUCE);
status.setShuffleFinishTime(now);
status.setSortFinishTime(now);
This shows the shuffle finish time and sort finish time is set to the time when all
maps tasks finish, so these two phases has zero running time. This means the
Shuffle and Sort phase is not included.
2. The two graphs about number of Map/Reduce slots.
One of these graph is produced by MRPerf and another one is produced by
the non-modified Mumak. We found that in the class SimulatorTaskTracer,
the variable “DEFAULT_MAP_SLOTS”, “DEFAULT_REDUCE_SLOTS” is
defined both with initial value 2. So we’ve changed these two values and
finished a group of experiments then produce the graph, we also did the
same thing on MRPerf and produced another graph. Basically they looks the
same. In this experiment we didn’t try to verify whether these graph are
correct, but just wants to show although generally we think Mumak as nonflexible and can only reproduce previous finished experiments, some
parameters can be set easily.
However, in reality if the number of map slots continue to increase, after
some point the running time should also increase because of the resource
contention and communication overhead, so the above two graph actually
doesn’t make sense. This shows the effects like resource contention and
communication overhead is ignored by MRPerf and the original Mumak.
3. The interface with Rumen.
We think Mumak don’t have a clear interface with Rumen, instead, Rumen is
deeply involved in the Mumak source code. For example, Mumak use a lot of
Rumen classes in SimulatorEngine. In order to modify Mumak to make
predictions on running time, maybe we should either modify Rumen or define
another class to provide Mumak the topology and simulation event.
A few example of the Rumen classes imported by SimulatorEngine is:
rumen.ClusterStory;
rumen.ClusterTopologyReader;
rumen.JobStoryProducer;
rumen.LoggedNetworkTopology;
rumen.MachineNode;
rumen.RackNode;
rumen.ZombieCluster;
rumen.RandomSeedGenerator;
3.1 Study on the MRPerf source code
We’ve read the MRPerf source code. MRPerf simulate Hadoop system by implement
all components of Hadoop in TCL. For example, there is map.tcl, reduce.tcl,
schedular.tcl. And in each file, it simulate the procedure of how Hadoop works. For
example, in the reduce.tcl file, there is functions like “fetch_map_output”,
“shuffle_finish”, “reduce_merge_sort”, “start_reduce_task”, “hdfs_output”. In this way, all
the implementation details has to be exactly the same as the real Hadoop system in
order for the simulation to be accurate. For example, when a map task fail, the actions
taken by the system, and the steps a reducer do in each phase such as shuffle and
merge. We think this approach has three drawbacks:
1. It’s quite hard to get the implementation right, because every detail is involved.
2. When the Hadoop system updates, the source code of MRPerf needs to be
updated.
3. Cannot use the schedulers of provided by Hadoop system. Have to implement
every scheduler in tcl if needed.
So we think the way MRPerf simulates Hadoop is neither flexible nor extendable. And
also the code is quite hard to understand. In contrast, the approach taken by Mumak is
very attractive. For the most complex and important part of a hadoop system, the job
tracker, Mumak extend the original class and wrote a few functions which define the
simulation interface, so it can easily interact with other part of Mumak. And the Mumak
can easily use any of the schedulers provided by Hadoop. This simulates exactly the
real Hadoop system and when Hadoop updates, no modifications on Mumak is needed.
3.2 MRPerf Experiment
In this experiment, we investigated MRPerf, a MapReduce simulation toll built on the top
of network simulator ns-2.
MRPerf simulates MapReduce tasks based on three classes of pamareters: cluster
parameters, configuration parameters, and framework parameters.
Each simulation consisted of a MapReduce job sorting one gigabyte of data. Some
parameters such as CPU speed, memory are from EC2 official site.
CPU Speed(GHz)
2.5
Number of CPUs
2
Number of cores per CPU
2
Disk capacity (GB)
350
Disk read bandwidth(MB/s)
280
Disk write bandwidth(MB/s)
75
NIC capacity
1Gbps
Memory capacity (GB)
1.7G
Chunk size
64M
Experiment Result:
4 nodes double rack data center (Chunk Size = 64M)
Limitations:
MRPerf does not support disk storage and realistic node reliability.
4.1 Mumak Introduction
Mumak works base on the log information of a real work finished before. The running
time of each individual map and reduce task, is added to the simulation time, this is
basically how the simulation progress. The time taken to do shuffle and sort is not
considered.
All the events are put into the priority queue in SimulationEngine and processed
according to time order. The events supposed to happen earlier are popped off the
queue and processed.
To illustrate how this works, consider two map tasks. The simulator task tracker process
the event to start the first map task at time a, then put another event “map task finish”
with timestamp a+delta_a to the queue in SimulationEngine, where the delta is the
running time of that map task which is available in the log of the real job finished. And
then process the event to start the second map task at time b, suppose b > a but a +
delta_a > b + delta_b, then the event map b finished will be pop off from the queue and
processed earlier. This is same for the job submission events deal by
SimulatorJobClient.
The SimulatorJobTracker extends the real hadoop class “JobTracker” and performs
control works like job/task scheduling, it interacts with SimulatorTaskTracker by
heartbeat. The job submission is performed by the class SimulatorJobClient.
From http://www.slideshare.net/hadoopusergroup/mumak
4.2 Compare Mumak and MRPerf
factor
Mumak
mrperf
feature
with schedulers (default, capacity, fair,
fine-grainedsimulation at
priority)
sub-phase level, model
inter and intra-rack network
communication, activity
inside a single job
evaluation
● It would simulate a cluster of the
same size and network topology
● sub-phase
performance(map,
as where the source trace comes
sort, spill, merge,
from. The reason for this restriction
overhead),
is because data locality is an
● network
important factor to the scheduler
topology(star,
decisions and scaling the job
double rack, tree,
traces obtained from cluster A and
dcell),
try to simulate it on cluster B with
● data locality,
different diameters require a much
● failures(map,
thorough understanding.
● It would simulate failures faithfully
redduce, node,
rack)
as recorded in the job trace.
Namely, if a particular task attempt
fails in the trace, it would fail in the
simulation.
● It would replay the same number
of map tasks and reduce tasks as
specified in the job digest.
● It would use the inputsplit locations
as are recorded in the job trace.
configuration topology, jobtrace
network topology(inter,
intra-rack), node
configuration(number
capacity), data layout, job
description
limitation
● No job dependency;
● jobtrace-driven, should generate it
first, and fixed format
● not support different nodes,
network topology
● simulate a single
job, no job workflow
● not simulate
distribution of keyvalues pairs
● not simulate the network traffic,
between map and
disk traffic
● not simulating the shuffle-phase
reduce phases
● model single
● No modeling of failure
storage device per
correlations(node)
node
● same map and recuder numbers
as the trace
code scale
3494 java
1004 python, 2025 tcl,
1706 c++
library
Ruman, Hadoop
ns-2, libxml
running time
<10s
6-15s
possible
modify the job-submissioin, define a rule
approach
that can describe the workflow, integrate
dependency
with other workflow tools like oozie
4.3 Hack Mumak
We hacked Mumak by first wrote a script to modify the input log file, to simulate the
situation with more or less tasks. For example, previously the topology is,
{
"name" : "<root>",
"children" : [ {
"name" : "default-rack",
"children" : [ {
"name" : "domU-12-31-39-02-65-11\\.compute-1\\.internal",
"children" : null
}, {
"name" : "domU-12-31-39-02-65-12\\.compute-1\\.internal",
"children" : null
} ]
} ]
}
By adding 2 nodes the topology becomes,
{
"name" : "<root>",
"children" : [ {
"name" : "default-rack",
"children" : [ {
"name" : "domU-12-31-39-02-65-11\\.compute-1\\.internal",
"children" : null
}, {
"name" : "domU-12-31-39-02-65-12\\.compute-1\\.internal",
"children" : null
}, {
"name" : "domU-12-31-39-02-65-13\\.compute-1\\.internal",
"children" : null
}, {
"name" : "domU-12-31-39-02-65-14\\.compute-1\\.internal",
"children" : null
} ]
} ]
}
Then the corresponding trace file also needs to be modified. The number of “preferred
location” of a task corresponds to its data replica factor, after add in new nodes we also
have to change the “preferred location” to make sure the data is located evenly over all
nodes and satisfies the data replica factor. However, in the experiments we found the
result is actually not stable. We’ve run the same work flow twice and for some job the
estimated running time doubles. To solve this problem further investigation into Mumak
and Rumen is needed.
The second part is to modify the Mumak source code to add in delay correspond to the
shuffle phase. We’ve defined an interface named NetworkSimulator in the mapred,
getDelay() is to calculate network delay. It would be invoked inSimulatorJobTracker.java
when getMapCompletionTasks(), setMapInfo() is to record the finishtime of
TaskAttempt, which is called immediately after all maps completed for reduce tasks.
We’ve implemented the NetworkSimulator interface in the class NS. The structure is
shown in the appendix.
In this class, we give an implementation of getDelay() following the steps.
0 Assumption: the network speed is 1Gbps. It is defined as capacity
1 Sum up the OutputBytes for each task, get total_size
2 delay = total_size/capacity
3 delta = Time(first finishing MAP task) + delay - current_time
4 return max(0, delta)
This assumes a single rack topology and the transmission begins when the first map
task finish.
4.4 Experiments
We got the log files from the Harold Lim. The log files is produced by first run the
Mapreduce program on EC2 using 2 m1.small nodes, which is to sort and statistic some
infomation with TPCH. The job descriptions from job3 to job11 are “Default Join”, “Sort
By Supplier”, “Count Orders By Customer”, “Count Orders By Date”, “Count Customers
By Nation”, “ Count Line Item By Order”,”Count Supplier By Nation”,”Count Part By
Brand”,”Sort By Customer”. The input sizes are: 2435842268, 28271512, 3488799106,
3488799106, 487806602, 15572615447, 28271512, 485241375, 487806602.
Base on the log data of the running time with 2 nodes, we make predictions about the
running time with 4 nodes and 6 nodes. In “estimate” we set the data replication factor
to 3. In “estimate local” we set the data replication factor as large as possible.
Assumptions:
The scaling factor for task attempt runtime of rack-local over nodel-level is 1
There is no remote rack. So all nodes are from local rack.
Not considering job dependency. Every job is independent.
The predicted running time:
I notice“preferredLocations”which is used to calculate the locality the closest location in
getMapTaskAttemptInfo(). To view the effect of locality factor, we put estimated_local
group, that only considers about local nodes. In this case, we could guarantee the node
id must exist in preferredLocations.
The predicted running time verses number of nodes.
The predicted network delay:
jobID
delay_time(ms)
job3
16642
job4
141
job5
2185
job6
16
job7
0
job8
2828
job9
0
job10
0
job11
2258
Most of the delay is actually neglectable compare to the map and reduce task running
time. This shows the network is actually not the bottleneck.
5 Conclusion
The objective of our project is to predict the running time of a job on Hadoop system. In
order to do prediction we need to simulate the Hadoop system. We have studied into
two simulation platforms, MRPerf and Mumak, and then decided to implement the
approaches taken by MRPerf on the top of Mumak. In the implementation, we wrote
scripts to change the Rumen log and implemented a network delay interface. With these
modifications, we are able to predict the running time of a job with different node
numbers. To make the prediction more accurate, some more modification to Mumak
and Rumen is needed.
Reference:
http://aws.amazon.com/ec2/instance-types/
Guanying Wang, Ali R. Butt, Prashant Pandey, and Karan Gupta. Using Realistic
Simulation for Performance Analysis of MapReduce Setups. In Proceedings of the 1st
ACM workshop on Large-Scale System and Application Performance (LSAP '09),
Garching, Germany. Jun. 2009.
Appendix
Network Simulator structure
Download