Asati-Sumit-week-10-times

advertisement
Week-10 Times
Task -1
Step-1: I have combined all sotu.txt files into “AllYears.txt” and place it into hdfs.
 Run WordCount-1 example with time and got output as 43.142 Sec.
Step-2: Modify WordCount-1code to get the only words that occur more than 4 times.
Step-3: Run WordCount2 example with pattern.txt file ans used -skip to remove all punctuation
(and, or, but, the, to) from output. Also, turned on the lowercase option.
Step-4: Take output of WordCount-1 and WordCount-2 into Excel sheet and sorted to get highest
count of word.
 WordCount-1 Highest occurrence of word is: “the” – 1867 times
 WordCount-2 Highest occurrence of word is: “of” – 1142 times
Step-5: Run the WordCount-2 with the added value Job.setNumReduceTasks(1) into the
WordCount2.java code.
Step-6: Run the WordCount-2 with the added value Job.setNumReduceTasks(2) into the
WordCount2.java code.
Step-7: Run the WordCount-2 with the added value Job.setNumReduceTasks(4) into the
WordCount2.java code.
Step-8: Graphical Analysis
Analysis:
As we can see there is times increment in output of same AllYears.txt file, if we add
reducers tasks to job. In my point of view there could be below reasons for time increment.
1) In setNumReduceTasks(1), there is only one output folder is creating as “part-r-00000”.
2) In setNumReduceTasks(2), there is only two output folder is creating as “part-r00000/00001”.
3) In setNumReduceTasks(4), there is only four output folder is creating as “part-r00000/00001/00002/00003”.
From above we can see that, as number of reduce tasks increasing then output folder increasing
in same manner as number of tasks. So, taking more time to generate more folders. Hence, we
can say increasing the number of reduces increases the framework overhead, but increases load
balancing and lowers the cost of failures.
Another, reason to increment of the time with reducer can be explained at block level. During
creating output for reduce task-1, the output is going to be write only one hdfs block of 128mb.
However, during the creation of reduce task-2,4, the output might be going to write in different
blocks of Hdfs. So, we can say that during the writing of hdfs output time will be increase because
there is data in all output folders (part-r-00000/1/2/3..)
Taks-2
Part-1: Using data set 5085.txt of cluster and got the below timing result.
Step-1:
Run
the
MaxTemperatureWithCombiner
with
the
added
value
Job.setNumReduceTasks(1) into the MaxTemperatureWithCombiner.java code for 5085copy.txt.
Step-2:
Run
the
MaxTemperatureWithCompression
with
the
added
value
Job.setNumReduceTasks(2) into the MaxTemperatureWithCompression.java code for 5085copy.txt.
Step-3:
Run
the
MaxTemperatureWithCompression
with
the
added
value
Job.setNumReduceTasks(4) into the MaxTemperatureWithCompression.java code for 5085copy.txt.
Part-2: Using data set 5085.txt.gz of cluster got the below timing result.
Step-1:
Run
the
MaxTemperatureWithCombiner
with
the
added
value
Job.setNumReduceTasks(1) into the MaxTemperatureWithCombiner.java code for 5085.txt.gz
Step-2:
Run
the
MaxTemperatureWithCompression
with
the
added
value
Job.setNumReduceTasks(2) into the MaxTemperatureWithCompression.java code for
5085.txt.gz
Step-3:
Run
the
MaxTemperatureWithCompression
with
the
added
value
Job.setNumReduceTasks(4) into the MaxTemperatureWithCompression.java code for
5085.txt.gz
Part-3: Using data set 5085.txt.bz2 of cluster got the below timing result.
Step-1:
Run
the
MaxTemperatureWithCombiner
with
the
added
value
Job.setNumReduceTasks(1) into the MaxTemperatureWithCombiner.java code for 5085.txt.bz2
Step-2:
Run
the
MaxTemperatureWithCompression
with
the
added
value
Job.setNumReduceTasks(2) into the MaxTemperatureWithCompression.java code for
5085.txt.bz2
Step-3:
Run
the
MaxTemperatureWithCompression
with
the
added
value
Job.setNumReduceTasks(4) into the MaxTemperatureWithCompression.java code for
5085.txt.bz2
Part-4: Graphical Analysis for all files (.txt, .gz, .bz2) separately and combined all together.
Step-1: Graph for 5085-copy.txt with different reducers. (all times in Min)
Step-2: Graph for 5085.txt.gz with different reducers.
Step-3: Graph for 5085.txt.bz2 with different reducers.
Step-3: Graph for all files together with different reducers.
Analysis:
As we can see there is time taken has been decreased gradually for the 5085.txt.bz2 > 5085copy.txt>5085.txt.gz in output file, if we add reducers tasks to job.
1) In setNumReduceTasks(1), there is only one output folder is creating as “part-r00000/txt/.gz/bz2” for all types of files.
2) In setNumReduceTasks(2), there is only two output folder is creating as “part-r00000/00001/txt/.gz/bz2” for all types of files.
3) In setNumReduceTasks(4), there is only four output folder is creating as “part-r00000/00001/00002/00003/txt/.gz/.bz2”.
For the MaxTemperature job for file type 5085/txt/gz/bz2 of year 1950 and 1985 the below are
the fidings.
1) Overall time taken to process the job on cluster as follows:
 5085.txt.bz2 > 5085-copy.txt>5085.txt.gz
2) Overall times taken to process the job within each type of file is for different number of
reducers:
 setNumReduceTasks (1) > setNumReduceTasks (2) > setNumReduceTasks(4)
From above findings, In my point of view job with one reducer is slow because for just 1 reduce
task, the reducer has to wait for all mappers to finish, and the shuffle phase has to collect all
intermediate data to be redirected to just that one reducer. So, it's natural that the map and shuffle
times are larger, and so is the overall time. From finding, we can say that the right number of
reduces seems to be 0.95 or 1.75 * (nodes * mapred.tasktracker.tasks.maximum).
From above output for temperature data we can see below as word art:
Number of Reduce Tasks Increment inversely proportional to Time taken in Years data
Reasons: The number of reduce tasks increasing then output folder increasing in same manner
as number of tasks. But the output for temperature data is going to be store only in one folder and
remaining other output folders is just going to be empty.
Also, we know that as number of reducer tasks increment will always achieve “parallelism” in
output data because of internal implementation of “Multithreading concepts” on
setNumReduceTasks job so due to parallel process job run time is going to less time as those
output folders are empty so overall efficiency will increase.
Block level explanations: We can see that number of output folder will be also be in same format of imported compression.
Such as for 5085.txt.gz output will in part-r-00000.gz format. As we can say compressed data of
5085 years is less in size so our mapper and reducer code will take data in compressed format
process it to mapper and reducer phase and then compressed the output too. As input data size,
less so it will be going to write in less number of 128mb blocks and output contains only in one
folder irrespective of number of reducer.
BZip2 can also produce more compression than GZip for some types of files, at the cost of some
speed when compressing and decompressing. From CPU point of view, CPU time and
performances is not providing optimal results, as compression is very CPU consuming. Hence,
taking more time to give output result for bz2.
Last, all output folders in reducers are empty expect one, which contains the output. So, we can
say that reducer tasks creating adequate output folder while performing the reducer task result but
during writing output folder on hdfs, reducer does not care about the writing pattern of output
folder. As part-r-00001 and part-r-00002 could be on different block but writing empty folder
increasing in the efficiency. So, we can say that increasing reducer will increase in efficiency.
Apart from above technical details about reducers task the time taken to process the MR jobs
could be affected by network traffic as well as network internet speed. As in this assignment
project, less number of people on cluster showing more efficient completion of job (in ~10 min)
compare to 50 people on same cluster (~70 min for 1 job).
Download