Hadoop system simulation with Mumak Fei Dong, Tianyu Feng, Hong Zhang CPS 216 Project 1 Content This report is organized in the following way, in the “Clarifications and Important Points” we only address the questions discussed in the final presentation. And in the other sections, the technical detail of this project is included. 2 Clarifications and Important Points 1. Mumak only counts user space time. In SimulatorTaskTracker.java the simulation time increases only at: long delta = tip.getUserSpaceRunTime(); long finishTime = now + delta; in the function createTaskAttemptCompletionEvent. The function getUserSpaceRunTime is defined in an inner class of TaskTracker, with the explanation: “Runtime of the user-space code of the task attempt. This is the full runtime for map tasks, and only that of the REDUCE phase for reduce tasks.” So our previous statement is not accurate. Mumak counts the time for the whole map task attempt, and only the reduce phase (the phase running user code) of reduce task attempt. The part of code correspond to the reduce phase can be found in the function handleAllMapsCompletedTaskAction: status.setPhase(Phase.REDUCE); status.setShuffleFinishTime(now); status.setSortFinishTime(now); This shows the shuffle finish time and sort finish time is set to the time when all maps tasks finish, so these two phases has zero running time. This means the Shuffle and Sort phase is not included. 2. The two graphs about number of Map/Reduce slots. One of these graph is produced by MRPerf and another one is produced by the non-modified Mumak. We found that in the class SimulatorTaskTracer, the variable “DEFAULT_MAP_SLOTS”, “DEFAULT_REDUCE_SLOTS” is defined both with initial value 2. So we’ve changed these two values and finished a group of experiments then produce the graph, we also did the same thing on MRPerf and produced another graph. Basically they looks the same. In this experiment we didn’t try to verify whether these graph are correct, but just wants to show although generally we think Mumak as nonflexible and can only reproduce previous finished experiments, some parameters can be set easily. However, in reality if the number of map slots continue to increase, after some point the running time should also increase because of the resource contention and communication overhead, so the above two graph actually doesn’t make sense. This shows the effects like resource contention and communication overhead is ignored by MRPerf and the original Mumak. 3. The interface with Rumen. We think Mumak don’t have a clear interface with Rumen, instead, Rumen is deeply involved in the Mumak source code. For example, Mumak use a lot of Rumen classes in SimulatorEngine. In order to modify Mumak to make predictions on running time, maybe we should either modify Rumen or define another class to provide Mumak the topology and simulation event. A few example of the Rumen classes imported by SimulatorEngine is: rumen.ClusterStory; rumen.ClusterTopologyReader; rumen.JobStoryProducer; rumen.LoggedNetworkTopology; rumen.MachineNode; rumen.RackNode; rumen.ZombieCluster; rumen.RandomSeedGenerator; 3.1 Study on the MRPerf source code We’ve read the MRPerf source code. MRPerf simulate Hadoop system by implement all components of Hadoop in TCL. For example, there is map.tcl, reduce.tcl, schedular.tcl. And in each file, it simulate the procedure of how Hadoop works. For example, in the reduce.tcl file, there is functions like “fetch_map_output”, “shuffle_finish”, “reduce_merge_sort”, “start_reduce_task”, “hdfs_output”. In this way, all the implementation details has to be exactly the same as the real Hadoop system in order for the simulation to be accurate. For example, when a map task fail, the actions taken by the system, and the steps a reducer do in each phase such as shuffle and merge. We think this approach has three drawbacks: 1. It’s quite hard to get the implementation right, because every detail is involved. 2. When the Hadoop system updates, the source code of MRPerf needs to be updated. 3. Cannot use the schedulers of provided by Hadoop system. Have to implement every scheduler in tcl if needed. So we think the way MRPerf simulates Hadoop is neither flexible nor extendable. And also the code is quite hard to understand. In contrast, the approach taken by Mumak is very attractive. For the most complex and important part of a hadoop system, the job tracker, Mumak extend the original class and wrote a few functions which define the simulation interface, so it can easily interact with other part of Mumak. And the Mumak can easily use any of the schedulers provided by Hadoop. This simulates exactly the real Hadoop system and when Hadoop updates, no modifications on Mumak is needed. 3.2 MRPerf Experiment In this experiment, we investigated MRPerf, a MapReduce simulation toll built on the top of network simulator ns-2. MRPerf simulates MapReduce tasks based on three classes of pamareters: cluster parameters, configuration parameters, and framework parameters. Each simulation consisted of a MapReduce job sorting one gigabyte of data. Some parameters such as CPU speed, memory are from EC2 official site. CPU Speed(GHz) 2.5 Number of CPUs 2 Number of cores per CPU 2 Disk capacity (GB) 350 Disk read bandwidth(MB/s) 280 Disk write bandwidth(MB/s) 75 NIC capacity 1Gbps Memory capacity (GB) 1.7G Chunk size 64M Experiment Result: 4 nodes double rack data center (Chunk Size = 64M) Limitations: MRPerf does not support disk storage and realistic node reliability. 4.1 Mumak Introduction Mumak works base on the log information of a real work finished before. The running time of each individual map and reduce task, is added to the simulation time, this is basically how the simulation progress. The time taken to do shuffle and sort is not considered. All the events are put into the priority queue in SimulationEngine and processed according to time order. The events supposed to happen earlier are popped off the queue and processed. To illustrate how this works, consider two map tasks. The simulator task tracker process the event to start the first map task at time a, then put another event “map task finish” with timestamp a+delta_a to the queue in SimulationEngine, where the delta is the running time of that map task which is available in the log of the real job finished. And then process the event to start the second map task at time b, suppose b > a but a + delta_a > b + delta_b, then the event map b finished will be pop off from the queue and processed earlier. This is same for the job submission events deal by SimulatorJobClient. The SimulatorJobTracker extends the real hadoop class “JobTracker” and performs control works like job/task scheduling, it interacts with SimulatorTaskTracker by heartbeat. The job submission is performed by the class SimulatorJobClient. From http://www.slideshare.net/hadoopusergroup/mumak 4.2 Compare Mumak and MRPerf factor Mumak mrperf feature with schedulers (default, capacity, fair, fine-grainedsimulation at priority) sub-phase level, model inter and intra-rack network communication, activity inside a single job evaluation ● It would simulate a cluster of the same size and network topology ● sub-phase performance(map, as where the source trace comes sort, spill, merge, from. The reason for this restriction overhead), is because data locality is an ● network important factor to the scheduler topology(star, decisions and scaling the job double rack, tree, traces obtained from cluster A and dcell), try to simulate it on cluster B with ● data locality, different diameters require a much ● failures(map, thorough understanding. ● It would simulate failures faithfully redduce, node, rack) as recorded in the job trace. Namely, if a particular task attempt fails in the trace, it would fail in the simulation. ● It would replay the same number of map tasks and reduce tasks as specified in the job digest. ● It would use the inputsplit locations as are recorded in the job trace. configuration topology, jobtrace network topology(inter, intra-rack), node configuration(number capacity), data layout, job description limitation ● No job dependency; ● jobtrace-driven, should generate it first, and fixed format ● not support different nodes, network topology ● simulate a single job, no job workflow ● not simulate distribution of keyvalues pairs ● not simulate the network traffic, between map and disk traffic ● not simulating the shuffle-phase reduce phases ● model single ● No modeling of failure storage device per correlations(node) node ● same map and recuder numbers as the trace code scale 3494 java 1004 python, 2025 tcl, 1706 c++ library Ruman, Hadoop ns-2, libxml running time <10s 6-15s possible modify the job-submissioin, define a rule approach that can describe the workflow, integrate dependency with other workflow tools like oozie 4.3 Hack Mumak We hacked Mumak by first wrote a script to modify the input log file, to simulate the situation with more or less tasks. For example, previously the topology is, { "name" : "<root>", "children" : [ { "name" : "default-rack", "children" : [ { "name" : "domU-12-31-39-02-65-11\\.compute-1\\.internal", "children" : null }, { "name" : "domU-12-31-39-02-65-12\\.compute-1\\.internal", "children" : null } ] } ] } By adding 2 nodes the topology becomes, { "name" : "<root>", "children" : [ { "name" : "default-rack", "children" : [ { "name" : "domU-12-31-39-02-65-11\\.compute-1\\.internal", "children" : null }, { "name" : "domU-12-31-39-02-65-12\\.compute-1\\.internal", "children" : null }, { "name" : "domU-12-31-39-02-65-13\\.compute-1\\.internal", "children" : null }, { "name" : "domU-12-31-39-02-65-14\\.compute-1\\.internal", "children" : null } ] } ] } Then the corresponding trace file also needs to be modified. The number of “preferred location” of a task corresponds to its data replica factor, after add in new nodes we also have to change the “preferred location” to make sure the data is located evenly over all nodes and satisfies the data replica factor. However, in the experiments we found the result is actually not stable. We’ve run the same work flow twice and for some job the estimated running time doubles. To solve this problem further investigation into Mumak and Rumen is needed. The second part is to modify the Mumak source code to add in delay correspond to the shuffle phase. We’ve defined an interface named NetworkSimulator in the mapred, getDelay() is to calculate network delay. It would be invoked inSimulatorJobTracker.java when getMapCompletionTasks(), setMapInfo() is to record the finishtime of TaskAttempt, which is called immediately after all maps completed for reduce tasks. We’ve implemented the NetworkSimulator interface in the class NS. The structure is shown in the appendix. In this class, we give an implementation of getDelay() following the steps. 0 Assumption: the network speed is 1Gbps. It is defined as capacity 1 Sum up the OutputBytes for each task, get total_size 2 delay = total_size/capacity 3 delta = Time(first finishing MAP task) + delay - current_time 4 return max(0, delta) This assumes a single rack topology and the transmission begins when the first map task finish. 4.4 Experiments We got the log files from the Harold Lim. The log files is produced by first run the Mapreduce program on EC2 using 2 m1.small nodes, which is to sort and statistic some infomation with TPCH. The job descriptions from job3 to job11 are “Default Join”, “Sort By Supplier”, “Count Orders By Customer”, “Count Orders By Date”, “Count Customers By Nation”, “ Count Line Item By Order”,”Count Supplier By Nation”,”Count Part By Brand”,”Sort By Customer”. The input sizes are: 2435842268, 28271512, 3488799106, 3488799106, 487806602, 15572615447, 28271512, 485241375, 487806602. Base on the log data of the running time with 2 nodes, we make predictions about the running time with 4 nodes and 6 nodes. In “estimate” we set the data replication factor to 3. In “estimate local” we set the data replication factor as large as possible. Assumptions: The scaling factor for task attempt runtime of rack-local over nodel-level is 1 There is no remote rack. So all nodes are from local rack. Not considering job dependency. Every job is independent. The predicted running time: I notice“preferredLocations”which is used to calculate the locality the closest location in getMapTaskAttemptInfo(). To view the effect of locality factor, we put estimated_local group, that only considers about local nodes. In this case, we could guarantee the node id must exist in preferredLocations. The predicted running time verses number of nodes. The predicted network delay: jobID delay_time(ms) job3 16642 job4 141 job5 2185 job6 16 job7 0 job8 2828 job9 0 job10 0 job11 2258 Most of the delay is actually neglectable compare to the map and reduce task running time. This shows the network is actually not the bottleneck. 5 Conclusion The objective of our project is to predict the running time of a job on Hadoop system. In order to do prediction we need to simulate the Hadoop system. We have studied into two simulation platforms, MRPerf and Mumak, and then decided to implement the approaches taken by MRPerf on the top of Mumak. In the implementation, we wrote scripts to change the Rumen log and implemented a network delay interface. With these modifications, we are able to predict the running time of a job with different node numbers. To make the prediction more accurate, some more modification to Mumak and Rumen is needed. Reference: http://aws.amazon.com/ec2/instance-types/ Guanying Wang, Ali R. Butt, Prashant Pandey, and Karan Gupta. Using Realistic Simulation for Performance Analysis of MapReduce Setups. In Proceedings of the 1st ACM workshop on Large-Scale System and Application Performance (LSAP '09), Garching, Germany. Jun. 2009. Appendix Network Simulator structure