zhangbj_proj0_p2

advertisement
B649 PROJECT 1 PART 2 REPORT
BINGJING ZHANG
ABSTRACT
This report is for Project 1 Part 2 in B649 Cloud Computing. The reports gives tests and analysis to the performance
of Hadoop MapReduce [1], a software framework for distributed processing of large data sets on compute clusters
by using traditional Linux clusters in CS Department [2] and Eucalyptus in FutureGrid [3]. The tests provide a basic
methodology of doing Hadoop performance tests and the results show that the environments can affect Hadoop
performance differently.
INTRODUCTION
Hadoop MapReduce is an open source implementation based on Google’s MapReduce. It works on HDFS [4], an
open source implementation of Google File System [5]. As a result, it is important to know how to run Hadoop
MapReduce and how its performance is like in learning MapReduce.
By the requirement of the assignment, the famous MapReduce program example WordCount [6] is used as an
example to do performance test. In the tests, WordCount is firstly executed on one compute node, and then
executed on two nodes. By recording the execution time, a speed-up chart can be drawn to show how much
performance is gained by using more compute resources. The tests are done on two different environments. One
is the traditional Linux clusters in CS department, the compute nodes here are physical machines. Another is
Eucalyptus, which is a cloud environment similar to Amazon Web Services [7] therefore the compute nodes
provided are virtual machine.
The report includes three parts, the first part is the Hadoop cluster configuration used in the test, and the second
part is about the test results, the third part is a feedback to FutureGrid environment.
HADOOP CLUSTER CONFIGURATION
The Hadoop version used here is 0.20.203.0, the current stable version introduced by Hadoop official website [8].
Before Hadoop cluster configuration, there are several steps required to be done first.
1.
2.
3.
Install Java, set JAVA_HOME in ~/.bashrc and conf/hadoop-env.sh.
Setup passphraseless SSH.
Set Hostname. This requires adding “ip hostname” lines in /etc/hosts, and calling “hostname” command.
These steps are optional if the compute nodes have already been provided with these features. Without Step 1
and 2, one cannot start a Hadoop environment properly. For step 3, if you referring machines by their IP addresses
in configuration files, you can start a Hadoop environment, but you will meet errors in shuffling stage of
MapReduce job execution. In this performance test, Installing Hadoop in CS Linux cluster only requires Step 1 and 2,
but for virtual machines in FutureGrid, all 3 steps are required.
The key part of Hadoop Configuration is editing the following 5 files: conf/master, conf/slaves, conf/core-site.xml,
conf/hdfs-site.xml, conf/mapred-site.xml.
File conf/master lists the hostnames of master nodes, each with one line. Master nodes are used as NameNode
and JobTracker. NameNode is central node to manage and control HDFS, while JobTracker is the program to
manage MapReduce jobs. Typically one machine in the cluster is designated as the NameNode and another
machine the as JobTracker, exclusively. But in this test, only one node is used as master. It acts as NameNode and
JobTracker at the same time.
File conf/slave lists the hostnames of slave nodes, each with one line. A slave node acts as both DataNode and
TaskTracker. DataNode is the slave of NameNode. It is responsible to manage local data storage. TaskTracker is the
slave of JobTracker. It manages local Map/Shuffle/Reduce tasks. In this performance test, master node is also used
as slave node.
To further configure NameNode and JobTracker on master node, you need to edit conf/core-site.xml, conf/hdfssite.xml, conf/mapred-site.xml. In core-site.xml, you are required to set NameNode URI, with the value in a format
of hdfs://hostname:port. In this performance test, the port is set to 9000.
In conf/hdfs-site.xml, you can set dfs.name.dir to a path on a Namenode where it stores the namespace and
transactions logs and set dfs.data.dir to a path on a DataNode where it should store its data blocks.
In conf/mapred-site.xml, you can set the URI of JobTracker in a format of hostname:port. The maximum number of
Map and Reduce tasks per node is also set here, by editing mapred.tasktracker.{map|reduce}.tasks.maximum. In
this performance test, the number of Map tasks is set to 2, so is the number of Reduce tasks.
The ports used in edit conf/core-site.xml and conf/mapred-site.xml should be unique available ports on a shared
environment such as CS Linux Cluster. Otherwise, the other users cannot start Hadoop environment properly if
there has already been one environment existing.
PERFORMANCE TEST
In performance test, a 133.8 MB input is used for WordCount. Under two different environments, CS Linux Cluster
and Eucalyptus in FutureGrid, firstly the test is done on one compute node, and then is expanded to two nodes. By
the difference of execution times, speed up can be learnt.
The machine settings are following:
Table 1. Machine CPU and memory specs
number of cores per node
CS Linux Cluster
2
Eucalyptus in Future Grid
1
core specs
Intel(R) Core(TM)2 Duo
CPU E8400 @ 3.00GHz
Intel(R) Xeon(R) CPU
X5570 @ 2.93GHz
memory
4 GB
1 GB
In CS Linux Cluster, two sets of machines are used in test. The first set of machines, iceman and wolverine, are
both under Shark domain. The second set of machines, rogue and bandicoot are from different domains. Rogue is
in Shark domain, but bandicoot is in burrow domain. The following are performance results:
Table 2. Execution time on iceman and wolverine
number of mappers
on each node
number of reducers
on each node
total number of
nodes
2
2
2
2
1
2
execution time (an
average of 10 runs,
unit: milliseconds)
78383.5
68525.6
speed up (between
single node and two
nodes mode)
1
1.14
Table 3. Execution time on rogue and bandicoot
number of mappers
on each node
number of reducers
on each node
total number of
nodes
2
2
2
2
1
2
execution time (an
average of 10 runs,
unit: milliseconds)
79034.6
66863.5
speed up (between
single node and two
nodes mode)
1
1.18
From two test cases above, we see that the speed up is very small, probably due to the fact that CS Linux Cluster is
a shared cluster environment; it has lots of daily workload, and also has lots of network traffic inside of the cluster.
For Eucalyptus in FutureGrid, though each virtual machine only has one core, the test still sets 2 cores for each
node by the requirement of the assignment. For performance results, though the virtual machines have longer
execution times, they have much better speed up values.
Table 4. Execution time on Eucalyptus
number of mappers
on each node
number of reducers
on each node
total number of
nodes
2
2
2
2
1
2
execution time (an
average of 10 runs,
unit: milliseconds)
141295.6
85882.8
speed up (between
single node and two
nodes mode)
1
1.65
For execution time chart, see Figure 1. And for speed up chart, see Figure 2.
FUTUREGRID FEEDBACK
Eucalyptus in FutureGrid provides functionalities comparable to Amazon EC2. It is excellent that you can have
some virtual resources only own by you and without pay. Under the context of building Hadoop environment, it
will be easier for Hadoop configuration if the virtual machines have their own default unique hostnames once they
get running.
CONCLUSION
This report shows a basic methodology to do performance test on Hadoop. The results show that the environment
is important in performance test. If the test is only done in CS Linux Cluster, you may conclude that the speed up of
Hadoop is bad. But after doing another test in FutureGrid, you may find the speed up of Hadoop can be much
better. Since CS Linux Cluster is a shared cluster environment, the results there is misleading to some extent. The
results on Eucalyptus are much better, but we still need to understand its hidden infrastructure before drawing the
final conclusion.
Figure 1. Execution time chart
Figure 2. Speed up chart
REFERENCE
[1] Hadoop MapReduce, http://hadoop.apache.org/mapreduce/
[2] Burrow and Shark, https://uisapp2.iu.edu/confluence-prd/pages/viewpage.action?pageId=114491559
[3] FutureGrid, https://portal.futuregrid.org/
[4] HDFS, http://hadoop.apache.org/hdfs/
[5] GFS, http://dl.acm.org/citation.cfm?id=945450
[6] WordCount, http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
[7] AWS, http://aws.amazon.com/
[8] Hadoop Common, http://hadoop.apache.org/common/releases.html
Download