pppurohi_proj0_p2

advertisement
Map-Reduce Performance Analysis and Speedup measurement
Prajakta Purohit
pppurohi@indiana.edu
Problem Statement:
To analyze the results of parallelization on the Hadoop map-reduce environment for a fixed input on a single node
as well as multimode environment. To study how the number of mappers and the reducers affect the output
parallelization on single and multiple nodes of Hadoop on bare metal CS machines as well as in the future-grid
environment. We make use of the embarrassingly parallel WordCount example to understand these concepts.
Methodology:
We first measure the performance on a single node with different numbers of mappers and reducers. Then
measure performance on multiple nodes with different number of mappers and reducers. Followed by the
calculation of speedup for parallelization again on both the CS machines as well as the future grid resources.
Runtime Resources Settings:
We realize that the maximum speedup depends on the available cores of the machines, the communication time
between nodes, the size of the input file. The run-time environment settings include identifying the names of the
master and the slave especially for the multi-node configuration. Making changes to the number of mappers and
reducers in the conf/mapred-sites_template.xml. The values from the conf/mapred-sites_template.xml are
populated into the mared-sites.xml by the MultiNodeSetUp.sh script. We need four ports available for the map
reduce to work correctly, if the same nodes are allocated to different users, hadoop will not startup correctly. The
details of where the port are required is mentioned later. Hadoop also required that the master and the slave have
identical structures, hence some files are copied over from the master to the slave. For the small scale testing that
we are doing, we do not need hadoop to store the state, hence we do not keep the files in the /tmp directory that
hadoop creates. We can change the number of mappers and reducers in the hadoop configuration by making
changes in the hadoop/conf/mapred-site_template.xml file.
Implementation:
The input to the WordCount program are 16 text file approximately of size 1.5 each. Thus the total input is 26 MB.
a. conf/master :
The conf/masters file defines the machines on which Hadoop will start. In case of single node, this is just
the master machine. The primary NameNode and the JobTracker are the machines on which bin/startdfs.sh and bin/start-mapred.sh scripts, respectively are run. In our case the runMultipleNodes.sh starts
these on the same machine from where we execute it. If the hadoop deamon is started manually instead
of using this script, it will not take the conf/masters and conf/slaves files into account.
b. conf/slaves :
This conf/slaves file lists the hosts, where the Hadoop slave daemons are to be run. We want both the
master box and the slave box to act as Hadoop slaves because we want both of them to store and process
data. On master, update /conf/slaves consists:master, slave. Any additional slaves to be added are added
in this file.
c. conf/core-site.xml :
This contains the cluster specific configurations. It has the fs.default.name variable which specifies the
HDFS master host and port. This requires an unique port [1].
d. conf/hdfs-site.xml :
dfs.replication - Variable which specifies the default block replication. It defines how many machines a
single file should be replicated to before it becomes available. This is be more than the slave nodes to
avoid errors. The default value of dfs.replication is 3. However as we have only two nodes available, so we
set dfs.replication to 2. The dfs.http.address requires one unique port [2].
dfs.name.dir – It is a path on the local file system where the nameNode stores the namespace and
transaction logs persistently.
dfs.data.dir – This has the path on the local file system of a dataNode where it should store its logs.
e. conf/mapred-site.xml :
mapred.job.tracker – variable which specifies the MapReduce master host and port. Hence this requires
a unique port [3].
mapred.job.tracker.http.address – This variable also required a unique port [4].
mapred.system.dir – Path for the hadoop master and slave to store the system files on master and slave.
mapred.tasktracker.{map|reduce}.tasks.maximum – The maximum number of mapper tasks that run
simultaneously on a given task tracker.
dfs.hosts/dfs.hosts.exclude – List of excluded nodes
mapred.hosts/mapred.hosts.exclude – List of permitted/excluded trackers
mapred.queue.names – comma separated list of queues to which jobs can be submitted
Statistic table
It contrasts a single node mode and two nodes mode.
Nodes
2
2
2
2
2
1
1
1
1
1
Nodes
2
2
2
2
2
1
1
1
1
1
CS Machines (16 files 26.5 MB - 2 cores)
Mappers Reducers
Speed up (Time
Execution
on each
on each
series / Time
time
node
node
parallel)
1
1
81
1.79382716
2
2
57.1
1.753064799
4
4
46.4
1.726293103
8
8
43.3
1.561200924
16
16
43.11111111
4.070876289
1
2
4
8
16
1
2
4
8
16
145.3
100.1
80.1
67.6
175.5
Eucalyptus (16 Files 26.5 MB - 4 cores)
Mappers Reducers
Speed up (Time
Execution
on each
on each
series / Time
time
node
node
parallel)
1
1
79.4
1.779596977
2
2
49.3
1.679513185
4
4
38.1
1.587926509
8
8
28.6
1.604895105
16
16
28.1
1.224199288
1
2
4
8
16
1
2
4
8
16
141.3
82.8
60.5
45.9
34.4
Though it is seen that the number of reducers in the configuration file equals the number of mappers, the code
uses only one reducer.
The single node CS machine performance is poor compared to the multiple CS machines or Eucalyptus
performance. But this can be attributed to the creation of multiple virtual instances on the Eucalyptus and any
latency due to communications.
Ideally having the same number of mappers on a machine as the number of CPU's running on it, gives the highest
efficiency. But sometimes the results may seem slightly erratic considering the fact that the CS machines are
shared machines and there might be some scheduled backup tasks running periodically. Similarly the eucalyptus
VM’s are on a shared environment, hence the performance may vary depending upon the load on the hypervisor.
For all the tests, the input is as follows:
Files: 16 files
File size: ~ 1.5MB each
Total size: ~ 26.5MB.
Cores on CS Machines: 2 core machines
Cores Eucalyptus machines: 4 core machines
On Single node CS Machines: The performance seems to improve as the number of cores increases from 1 to 8.
But the performance suddenly degrades when we try to allocate 16 mappers on the 2 core machines. This is
because the context switching time is very high though the data is very small and the hardware is not able to
satisfy the resource requirement that the user specifies. When we specify 4 or 8 mappers also the performance
should ideally degrade, but the speed improvement is seen as the data set is small and the context switching time
is approximately the same as that required for data processing.
On 2 node CS Machines: The performance seems to improve as compared to the respective single node
machines as the number of mappers increase, but this is so because of the small data set. If we would have
increased the number of mappers to 32, the performance degrades; the reason is the same as mentioned above.
This can be explained as follows in case of 2 nodes, if we allocate 2 mappers per node, we have 4 mappers working
at a time, whereas on the single node for 2 mappers per node we would have only 2 mappers working at a time,
this improves the performance of the multiple node system as compared to the single node system.
On Single node Eucalyptus VM: The performance improves as we increase the number of nodes. Ideally on a 4
core machine the performance should start degrading after we increase the number of cores to more than 4, but
this is not practically observed for a couple of reasons and the load on the CPU and the small data set.
On 2 node Eucalyptus VM: The performance is better that the respective single node performance hence the
speedup is almost a linearly increasing graph. But if as we increase the number of mappers to 16 or more we are
able to observe a negative speedup, as the requirements specified are more than what the system can handle, and
the context switching and data management on multiple nodes takes a longer time than / almost the same time as
the respective single node configuration.
Execution Time
200
180
175.5
160
145.3
141.3
Time in seconds -->
140
120
CS Machines - 1 Node
100
100.1
79.4
81 82.8
80
60
CS Machines - 2 Nodes
Eucalyptus - 1 Node
80.1
67.6
60.5
57.1
49.3
45.9
46.4
38.1
40
Eucalyptus - 2 Nodes
43.11111111
34.4
28.1
43.3
28.6
20
0
0
2
4
6
8
10
12
Number of mappers and reducers -->
14
16
Speedup
4.5
Speedup (Time series / Time parallel) -->
4
3.5
3
2.5
CS Machines
2
Eucaluptus
1.5
1
0.5
0
0
2
4
6
8
10
12
14
16
18
Number of Mappers and Reducers -->
Feedback for FutureGrid:
I am grateful as a first timer to have the opportunity to work on a robust cloud environment. Some things I faced
some issues was the user login creation with Eucalyptus. There is no way to retrieve/change the password in case
the user has missed it. And on the future grid portal there is no way to change the username one the user has
created it. Other than that it is a great environment to work on. Being able to run our jobs on an academic cloud
gave an experience of using the real commercial clouds.
References:
1)
2)
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
http://hadoop.apache.org/mapreduce/
Download