Project #0 (Part2) Hadoop WordCount on Multiple Nodes Name: Chen, Xiaoyang Email: xc7@indiana.edu 1. Include title and authors information. Add brief descriptions of the problem, methodology, and runtime environment settings. The mission is to test Hadoop WordCount on Multiple Nodes. We use our FutureGrid account in the machine to start two images on Eucalypt. Then we define master and slave machine, start the hadoop for multinode on the Eucalypt images and CS linux machines separately, and test the performance with wordcount program. We follow the method, in Appendix of the assignment of the instruction and the emails about image setup of the master and slave domain, to setup the environment. Here is I use PuTTY in Win7 to get access to the machines through ssh protocol. Following is some key points of my successful configuration: Make sure ssh-agents in each machine has all public and private key information added and files has suitable mode (with command such as chmod, ssh-agent, sshadd, and save suitable files into the ~/.ssh directory) , so that we could ensure suitable privilege of the operation and avoid the interrupt from password settings. Master node must be current machine or shell execution would be failed. Sometime we should manual rm files in /tmp/* In this test, with limited file quantity, the splits depend on the file number. Eucalypt image information: RESERVATION INSTANCE c1.medium RESERVATION INSTANCE c1.medium r-462F076B xc7 default i-2AD305AA emi-D778156D 149.165.159.144 10.0.5.3 running xc7 2011-09-27T04:03:16.506Z india eki-78EF12D2 eri-5BB61255 r-3FB00813 xc7 0 default i-4433086E emi-D778156D 149.165.159.128 10.0.5.2 running xc7 2011-09-27T04:03:11.2Z india eki-78EF12D2 eri-5BB61255 0 2. Write a paragraph about your implementation/coding with reference to the following files. a. conf/master b. conf/slaves c. conf/core-site.xml d. conf/hdfs-site.xml e. conf/mapred-site.xml Overall relations of them are like this. Each file above has a template file eg. master_template. The Shell file just extracts nodes information (name or ip) from the nodes file. Along with the port information set inside the Shell file itself, it modifies corresponding fields in the template file, and generates the configuration files above. The generated files would also be copied to corresponding folder on the Master and Slaves machines. It will also format the file system and run start-all.sh. 1) What is a Master node? What is a Slaves node? Master node is the Master that manages the whole progress, eg. Partitions the mission, assign them to worker, and so on. Slaves are the workers which are execution branch to finish assigned mission parts. Here the machine in the first line of “nodes” file is the Master, and itself along with the machine in the second line are both Slaves. 2) Why do we need to set unique available ports to those configuration files on a shared environment? What errors or side-effects will show if we use same port number for each user? In the Shell file, ports are assigned for different usages, different functions. Port[0] (9001) is set for Master node. Port[1] (9002) is set for HDFS. Port[2] (9003) and Port[3] (9004) are set for the two Jobtrackers separately. There would be port collision between different function modules if we use same port number. 3) How can we change the number of mappers and reducers from the configuration file? You can change the total amount of mapper and reducer by editing the configuration files “conf/mapred-site_template.xml” and rerun the multi-node shell. Change the values (here are 2) in the corresponding properties, showed below: <name>mapred.tasktracker.map.tasks.maximum</name> <!-- maximum map tasks per node, please set it as same as the amount of cores (cpu)--> <value>2</value> <name>mapred.tasktracker.reduce.tasks.maximum</name> <!-- maximum map tasks per node, please set it as same as the amount of cores (cpu)--> <value>2</value> 3. Create a statistic table contrasting a single node mode and two nodes mode. Then draw a graph including two execution line charts and another graph including two speed up charts based on your performance results. Add comments when appropriate. 1) 2) 3) 4) 5) number of mappers on each node number of reducers on each node total number of nodes execution time (an average of 10 runs). speed up (between single node and two nodes mode) Then, based on these statistic numbers, draw an execution line chart and a speed up line chart. The test data is only 8 file, which is too few to see a good speedup of multinode. Machine ID CS: Gneiss 1 Nodes # 1 MaxCore# Mapper# Reducer# ExTime(s) 64.3 Total Speedup 100.00% Node Speedup 100.00% 1 8 1 2 3 1 1 2 4 8 8 1 1 29.5 28.3 217.97% 227.21% 100.00% 100.00% CS: Schist & Granite 4 5 2 2 1 2 4 4 1 1 48 31.6 133.96% 203.48% 133.96% 93.35% Eucalyptus 6 7 8 2 2 2 4 2 4 4 4 4 1 1 1 22.7 42.6 37.7 283.26% 150.94% 170.56% 124.67% 150.94% 78.25% Execution time & Speed up on 1 node with different maximum map & reduce on it: Exec 1 Node 80 60 40 Exec 1 Node 20 0 0 2 4 6 Execution time Sp up 1 Node 250.00% 200.00% 150.00% Sp up 1 Node 100.00% 50.00% 0.00% 0 2 4 6 Speedup Speed up of 2 nodes on CS and Eucalypt machine: 160.00% 140.00% 120.00% 100.00% 80.00% CS 2 nodes 60.00% Eucalypt 40.00% 20.00% 0.00% 0 2 4 6 Total Performance Exection Time 70 60 50 40 Exection Time 30 20 10 0 1 2 3 4 5 6 7 8 Total Speedup 300.00% 250.00% 200.00% 150.00% Total Speedup 100.00% 50.00% 0.00% 1 2 3 4 5 6 7 8 4. Write feedback for FutureGrid The login progress function is not completed, as I can’t correct my error on client side, when I mistyped my email address. But it’s still easy to use as a command line system.