Week-10 Times Task -1 Step-1: I have combined all sotu.txt files into “AllYears.txt” and place it into hdfs. Run WordCount-1 example with time and got output as 43.142 Sec. Step-2: Modify WordCount-1code to get the only words that occur more than 4 times. Step-3: Run WordCount2 example with pattern.txt file ans used -skip to remove all punctuation (and, or, but, the, to) from output. Also, turned on the lowercase option. Step-4: Take output of WordCount-1 and WordCount-2 into Excel sheet and sorted to get highest count of word. WordCount-1 Highest occurrence of word is: “the” – 1867 times WordCount-2 Highest occurrence of word is: “of” – 1142 times Step-5: Run the WordCount-2 with the added value Job.setNumReduceTasks(1) into the WordCount2.java code. Step-6: Run the WordCount-2 with the added value Job.setNumReduceTasks(2) into the WordCount2.java code. Step-7: Run the WordCount-2 with the added value Job.setNumReduceTasks(4) into the WordCount2.java code. Step-8: Graphical Analysis Analysis: As we can see there is times increment in output of same AllYears.txt file, if we add reducers tasks to job. In my point of view there could be below reasons for time increment. 1) In setNumReduceTasks(1), there is only one output folder is creating as “part-r-00000”. 2) In setNumReduceTasks(2), there is only two output folder is creating as “part-r00000/00001”. 3) In setNumReduceTasks(4), there is only four output folder is creating as “part-r00000/00001/00002/00003”. From above we can see that, as number of reduce tasks increasing then output folder increasing in same manner as number of tasks. So, taking more time to generate more folders. Hence, we can say increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures. Another, reason to increment of the time with reducer can be explained at block level. During creating output for reduce task-1, the output is going to be write only one hdfs block of 128mb. However, during the creation of reduce task-2,4, the output might be going to write in different blocks of Hdfs. So, we can say that during the writing of hdfs output time will be increase because there is data in all output folders (part-r-00000/1/2/3..) Taks-2 Part-1: Using data set 5085.txt of cluster and got the below timing result. Step-1: Run the MaxTemperatureWithCombiner with the added value Job.setNumReduceTasks(1) into the MaxTemperatureWithCombiner.java code for 5085copy.txt. Step-2: Run the MaxTemperatureWithCompression with the added value Job.setNumReduceTasks(2) into the MaxTemperatureWithCompression.java code for 5085copy.txt. Step-3: Run the MaxTemperatureWithCompression with the added value Job.setNumReduceTasks(4) into the MaxTemperatureWithCompression.java code for 5085copy.txt. Part-2: Using data set 5085.txt.gz of cluster got the below timing result. Step-1: Run the MaxTemperatureWithCombiner with the added value Job.setNumReduceTasks(1) into the MaxTemperatureWithCombiner.java code for 5085.txt.gz Step-2: Run the MaxTemperatureWithCompression with the added value Job.setNumReduceTasks(2) into the MaxTemperatureWithCompression.java code for 5085.txt.gz Step-3: Run the MaxTemperatureWithCompression with the added value Job.setNumReduceTasks(4) into the MaxTemperatureWithCompression.java code for 5085.txt.gz Part-3: Using data set 5085.txt.bz2 of cluster got the below timing result. Step-1: Run the MaxTemperatureWithCombiner with the added value Job.setNumReduceTasks(1) into the MaxTemperatureWithCombiner.java code for 5085.txt.bz2 Step-2: Run the MaxTemperatureWithCompression with the added value Job.setNumReduceTasks(2) into the MaxTemperatureWithCompression.java code for 5085.txt.bz2 Step-3: Run the MaxTemperatureWithCompression with the added value Job.setNumReduceTasks(4) into the MaxTemperatureWithCompression.java code for 5085.txt.bz2 Part-4: Graphical Analysis for all files (.txt, .gz, .bz2) separately and combined all together. Step-1: Graph for 5085-copy.txt with different reducers. (all times in Min) Step-2: Graph for 5085.txt.gz with different reducers. Step-3: Graph for 5085.txt.bz2 with different reducers. Step-3: Graph for all files together with different reducers. Analysis: As we can see there is time taken has been decreased gradually for the 5085.txt.bz2 > 5085copy.txt>5085.txt.gz in output file, if we add reducers tasks to job. 1) In setNumReduceTasks(1), there is only one output folder is creating as “part-r00000/txt/.gz/bz2” for all types of files. 2) In setNumReduceTasks(2), there is only two output folder is creating as “part-r00000/00001/txt/.gz/bz2” for all types of files. 3) In setNumReduceTasks(4), there is only four output folder is creating as “part-r00000/00001/00002/00003/txt/.gz/.bz2”. For the MaxTemperature job for file type 5085/txt/gz/bz2 of year 1950 and 1985 the below are the fidings. 1) Overall time taken to process the job on cluster as follows: 5085.txt.bz2 > 5085-copy.txt>5085.txt.gz 2) Overall times taken to process the job within each type of file is for different number of reducers: setNumReduceTasks (1) > setNumReduceTasks (2) > setNumReduceTasks(4) From above findings, In my point of view job with one reducer is slow because for just 1 reduce task, the reducer has to wait for all mappers to finish, and the shuffle phase has to collect all intermediate data to be redirected to just that one reducer. So, it's natural that the map and shuffle times are larger, and so is the overall time. From finding, we can say that the right number of reduces seems to be 0.95 or 1.75 * (nodes * mapred.tasktracker.tasks.maximum). From above output for temperature data we can see below as word art: Number of Reduce Tasks Increment inversely proportional to Time taken in Years data Reasons: The number of reduce tasks increasing then output folder increasing in same manner as number of tasks. But the output for temperature data is going to be store only in one folder and remaining other output folders is just going to be empty. Also, we know that as number of reducer tasks increment will always achieve “parallelism” in output data because of internal implementation of “Multithreading concepts” on setNumReduceTasks job so due to parallel process job run time is going to less time as those output folders are empty so overall efficiency will increase. Block level explanations: We can see that number of output folder will be also be in same format of imported compression. Such as for 5085.txt.gz output will in part-r-00000.gz format. As we can say compressed data of 5085 years is less in size so our mapper and reducer code will take data in compressed format process it to mapper and reducer phase and then compressed the output too. As input data size, less so it will be going to write in less number of 128mb blocks and output contains only in one folder irrespective of number of reducer. BZip2 can also produce more compression than GZip for some types of files, at the cost of some speed when compressing and decompressing. From CPU point of view, CPU time and performances is not providing optimal results, as compression is very CPU consuming. Hence, taking more time to give output result for bz2. Last, all output folders in reducers are empty expect one, which contains the output. So, we can say that reducer tasks creating adequate output folder while performing the reducer task result but during writing output folder on hdfs, reducer does not care about the writing pattern of output folder. As part-r-00001 and part-r-00002 could be on different block but writing empty folder increasing in the efficiency. So, we can say that increasing reducer will increase in efficiency. Apart from above technical details about reducers task the time taken to process the MR jobs could be affected by network traffic as well as network internet speed. As in this assignment project, less number of people on cluster showing more efficient completion of job (in ~10 min) compare to 50 people on same cluster (~70 min for 1 job).