Map-Reduce Performance Analysis and Speedup measurement Prajakta Purohit pppurohi@indiana.edu Problem Statement: To analyze the results of parallelization on the Hadoop map-reduce environment for a fixed input on a single node as well as multimode environment. To study how the number of mappers and the reducers affect the output parallelization on single and multiple nodes of Hadoop on bare metal CS machines as well as in the future-grid environment. We make use of the embarrassingly parallel WordCount example to understand these concepts. Methodology: We first measure the performance on a single node with different numbers of mappers and reducers. Then measure performance on multiple nodes with different number of mappers and reducers. Followed by the calculation of speedup for parallelization again on both the CS machines as well as the future grid resources. Runtime Resources Settings: We realize that the maximum speedup depends on the available cores of the machines, the communication time between nodes, the size of the input file. The run-time environment settings include identifying the names of the master and the slave especially for the multi-node configuration. Making changes to the number of mappers and reducers in the conf/mapred-sites_template.xml. The values from the conf/mapred-sites_template.xml are populated into the mared-sites.xml by the MultiNodeSetUp.sh script. We need four ports available for the map reduce to work correctly, if the same nodes are allocated to different users, hadoop will not startup correctly. The details of where the port are required is mentioned later. Hadoop also required that the master and the slave have identical structures, hence some files are copied over from the master to the slave. For the small scale testing that we are doing, we do not need hadoop to store the state, hence we do not keep the files in the /tmp directory that hadoop creates. We can change the number of mappers and reducers in the hadoop configuration by making changes in the hadoop/conf/mapred-site_template.xml file. Implementation: The input to the WordCount program are 16 text file approximately of size 1.5 each. Thus the total input is 26 MB. a. conf/master : The conf/masters file defines the machines on which Hadoop will start. In case of single node, this is just the master machine. The primary NameNode and the JobTracker are the machines on which bin/startdfs.sh and bin/start-mapred.sh scripts, respectively are run. In our case the runMultipleNodes.sh starts these on the same machine from where we execute it. If the hadoop deamon is started manually instead of using this script, it will not take the conf/masters and conf/slaves files into account. b. conf/slaves : This conf/slaves file lists the hosts, where the Hadoop slave daemons are to be run. We want both the master box and the slave box to act as Hadoop slaves because we want both of them to store and process data. On master, update /conf/slaves consists:master, slave. Any additional slaves to be added are added in this file. c. conf/core-site.xml : This contains the cluster specific configurations. It has the fs.default.name variable which specifies the HDFS master host and port. This requires an unique port [1]. d. conf/hdfs-site.xml : dfs.replication - Variable which specifies the default block replication. It defines how many machines a single file should be replicated to before it becomes available. This is be more than the slave nodes to avoid errors. The default value of dfs.replication is 3. However as we have only two nodes available, so we set dfs.replication to 2. The dfs.http.address requires one unique port [2]. dfs.name.dir – It is a path on the local file system where the nameNode stores the namespace and transaction logs persistently. dfs.data.dir – This has the path on the local file system of a dataNode where it should store its logs. e. conf/mapred-site.xml : mapred.job.tracker – variable which specifies the MapReduce master host and port. Hence this requires a unique port [3]. mapred.job.tracker.http.address – This variable also required a unique port [4]. mapred.system.dir – Path for the hadoop master and slave to store the system files on master and slave. mapred.tasktracker.{map|reduce}.tasks.maximum – The maximum number of mapper tasks that run simultaneously on a given task tracker. dfs.hosts/dfs.hosts.exclude – List of excluded nodes mapred.hosts/mapred.hosts.exclude – List of permitted/excluded trackers mapred.queue.names – comma separated list of queues to which jobs can be submitted Statistic table It contrasts a single node mode and two nodes mode. Nodes 2 2 2 2 2 1 1 1 1 1 Nodes 2 2 2 2 2 1 1 1 1 1 CS Machines (16 files 26.5 MB - 2 cores) Mappers Reducers Speed up (Time Execution on each on each series / Time time node node parallel) 1 1 81 1.79382716 2 2 57.1 1.753064799 4 4 46.4 1.726293103 8 8 43.3 1.561200924 16 16 43.11111111 4.070876289 1 2 4 8 16 1 2 4 8 16 145.3 100.1 80.1 67.6 175.5 Eucalyptus (16 Files 26.5 MB - 4 cores) Mappers Reducers Speed up (Time Execution on each on each series / Time time node node parallel) 1 1 79.4 1.779596977 2 2 49.3 1.679513185 4 4 38.1 1.587926509 8 8 28.6 1.604895105 16 16 28.1 1.224199288 1 2 4 8 16 1 2 4 8 16 141.3 82.8 60.5 45.9 34.4 Though it is seen that the number of reducers in the configuration file equals the number of mappers, the code uses only one reducer. The single node CS machine performance is poor compared to the multiple CS machines or Eucalyptus performance. But this can be attributed to the creation of multiple virtual instances on the Eucalyptus and any latency due to communications. Ideally having the same number of mappers on a machine as the number of CPU's running on it, gives the highest efficiency. But sometimes the results may seem slightly erratic considering the fact that the CS machines are shared machines and there might be some scheduled backup tasks running periodically. Similarly the eucalyptus VM’s are on a shared environment, hence the performance may vary depending upon the load on the hypervisor. For all the tests, the input is as follows: Files: 16 files File size: ~ 1.5MB each Total size: ~ 26.5MB. Cores on CS Machines: 2 core machines Cores Eucalyptus machines: 4 core machines On Single node CS Machines: The performance seems to improve as the number of cores increases from 1 to 8. But the performance suddenly degrades when we try to allocate 16 mappers on the 2 core machines. This is because the context switching time is very high though the data is very small and the hardware is not able to satisfy the resource requirement that the user specifies. When we specify 4 or 8 mappers also the performance should ideally degrade, but the speed improvement is seen as the data set is small and the context switching time is approximately the same as that required for data processing. On 2 node CS Machines: The performance seems to improve as compared to the respective single node machines as the number of mappers increase, but this is so because of the small data set. If we would have increased the number of mappers to 32, the performance degrades; the reason is the same as mentioned above. This can be explained as follows in case of 2 nodes, if we allocate 2 mappers per node, we have 4 mappers working at a time, whereas on the single node for 2 mappers per node we would have only 2 mappers working at a time, this improves the performance of the multiple node system as compared to the single node system. On Single node Eucalyptus VM: The performance improves as we increase the number of nodes. Ideally on a 4 core machine the performance should start degrading after we increase the number of cores to more than 4, but this is not practically observed for a couple of reasons and the load on the CPU and the small data set. On 2 node Eucalyptus VM: The performance is better that the respective single node performance hence the speedup is almost a linearly increasing graph. But if as we increase the number of mappers to 16 or more we are able to observe a negative speedup, as the requirements specified are more than what the system can handle, and the context switching and data management on multiple nodes takes a longer time than / almost the same time as the respective single node configuration. Execution Time 200 180 175.5 160 145.3 141.3 Time in seconds --> 140 120 CS Machines - 1 Node 100 100.1 79.4 81 82.8 80 60 CS Machines - 2 Nodes Eucalyptus - 1 Node 80.1 67.6 60.5 57.1 49.3 45.9 46.4 38.1 40 Eucalyptus - 2 Nodes 43.11111111 34.4 28.1 43.3 28.6 20 0 0 2 4 6 8 10 12 Number of mappers and reducers --> 14 16 Speedup 4.5 Speedup (Time series / Time parallel) --> 4 3.5 3 2.5 CS Machines 2 Eucaluptus 1.5 1 0.5 0 0 2 4 6 8 10 12 14 16 18 Number of Mappers and Reducers --> Feedback for FutureGrid: I am grateful as a first timer to have the opportunity to work on a robust cloud environment. Some things I faced some issues was the user login creation with Eucalyptus. There is no way to retrieve/change the password in case the user has missed it. And on the future grid portal there is no way to change the username one the user has created it. Other than that it is a great environment to work on. Being able to run our jobs on an academic cloud gave an experience of using the real commercial clouds. References: 1) 2) http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ http://hadoop.apache.org/mapreduce/