xc7_proj0_p2

advertisement
Project #0 (Part2) Hadoop WordCount on Multiple Nodes
Name: Chen, Xiaoyang
Email: xc7@indiana.edu
1. Include title and authors information. Add brief descriptions of the problem,
methodology, and runtime environment settings.
The mission is to test Hadoop WordCount on Multiple Nodes. We use our FutureGrid
account in the machine to start two images on Eucalypt. Then we define master and slave
machine, start the hadoop for multinode on the Eucalypt images and CS linux machines
separately, and test the performance with wordcount program.
We follow the method, in Appendix of the assignment of the instruction and the emails
about image setup of the master and slave domain, to setup the environment.
Here is I use PuTTY in Win7 to get access to the machines through ssh protocol.
Following is some key points of my successful configuration:




Make sure ssh-agents in each machine has all public and private key information
added and files has suitable mode (with command such as chmod, ssh-agent, sshadd, and save suitable files into the ~/.ssh directory) , so that we could ensure
suitable privilege of the operation and avoid the interrupt from password settings.
Master node must be current machine or shell execution would be failed.
Sometime we should manual rm files in /tmp/*
In this test, with limited file quantity, the splits depend on the file number.
Eucalypt image information:
RESERVATION
INSTANCE
c1.medium
RESERVATION
INSTANCE
c1.medium
r-462F076B
xc7
default
i-2AD305AA emi-D778156D 149.165.159.144 10.0.5.3 running xc7
2011-09-27T04:03:16.506Z
india
eki-78EF12D2 eri-5BB61255
r-3FB00813
xc7
0
default
i-4433086E emi-D778156D 149.165.159.128 10.0.5.2 running xc7
2011-09-27T04:03:11.2Z india eki-78EF12D2
eri-5BB61255
0
2. Write a paragraph about your implementation/coding with reference to the following files.
a. conf/master
b. conf/slaves
c. conf/core-site.xml
d. conf/hdfs-site.xml
e. conf/mapred-site.xml
Overall relations of them are like this. Each file above has a template file eg. master_template.
The Shell file just extracts nodes information (name or ip) from the nodes file. Along with the
port information set inside the Shell file itself, it modifies corresponding fields in the template
file, and generates the configuration files above. The generated files would also be copied to
corresponding folder on the Master and Slaves machines. It will also format the file system and
run start-all.sh.
1) What is a Master node? What is a Slaves node?
Master node is the Master that manages the whole progress, eg. Partitions the mission, assign
them to worker, and so on. Slaves are the workers which are execution branch to finish assigned
mission parts. Here the machine in the first line of “nodes” file is the Master, and itself along
with the machine in the second line are both Slaves.
2) Why do we need to set unique available ports to those configuration files on a shared
environment? What errors or side-effects will show if we use same port number for
each user?
In the Shell file, ports are assigned for different usages, different functions. Port[0] (9001) is set
for Master node. Port[1] (9002) is set for HDFS. Port[2] (9003) and Port[3] (9004) are set for the
two Jobtrackers separately. There would be port collision between different function modules if
we use same port number.
3) How can we change the number of mappers and reducers from the configuration file?
You can change the total amount of mapper and reducer by editing the configuration files
“conf/mapred-site_template.xml” and rerun the multi-node shell.
Change the values (here are 2) in the corresponding properties, showed below:
<name>mapred.tasktracker.map.tasks.maximum</name>
<!-- maximum map tasks per node, please set it as same as the amount of cores (cpu)-->
<value>2</value>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<!-- maximum map tasks per node, please set it as same as the amount of cores (cpu)-->
<value>2</value>
3. Create a statistic table contrasting a single node mode and two nodes mode. Then draw a
graph including two execution line charts and another graph including two speed up
charts based on your performance results. Add comments when appropriate.
1)
2)
3)
4)
5)
number of mappers on each node
number of reducers on each node
total number of nodes
execution time (an average of 10 runs).
speed up (between single node and two nodes mode)
Then, based on these statistic numbers, draw an execution line chart and a speed up line chart.
The test data is only 8 file, which is too few to see a good speedup of multinode.
Machine
ID
CS:
Gneiss
1
Nodes
#
1
MaxCore#
Mapper#
Reducer#
ExTime(s)
64.3
Total
Speedup
100.00%
Node
Speedup
100.00%
1
8
1
2
3
1
1
2
4
8
8
1
1
29.5
28.3
217.97%
227.21%
100.00%
100.00%
CS:
Schist &
Granite
4
5
2
2
1
2
4
4
1
1
48
31.6
133.96%
203.48%
133.96%
93.35%
Eucalyptus
6
7
8
2
2
2
4
2
4
4
4
4
1
1
1
22.7
42.6
37.7
283.26%
150.94%
170.56%
124.67%
150.94%
78.25%
Execution time & Speed up on 1 node with different maximum map & reduce on it:
Exec 1 Node
80
60
40
Exec 1 Node
20
0
0
2
4
6
Execution time
Sp up 1 Node
250.00%
200.00%
150.00%
Sp up 1 Node
100.00%
50.00%
0.00%
0
2
4
6
Speedup
Speed up of 2 nodes on CS and Eucalypt machine:
160.00%
140.00%
120.00%
100.00%
80.00%
CS 2 nodes
60.00%
Eucalypt
40.00%
20.00%
0.00%
0
2
4
6
Total Performance
Exection Time
70
60
50
40
Exection Time
30
20
10
0
1
2
3
4
5
6
7
8
Total Speedup
300.00%
250.00%
200.00%
150.00%
Total Speedup
100.00%
50.00%
0.00%
1
2
3
4
5
6
7
8
4. Write feedback for FutureGrid
The login progress function is not completed, as I can’t correct my error on client side, when I
mistyped my email address. But it’s still easy to use as a command line system.
Download