Personal_3.MapReduce An Introduction - hadoop

- Ghana • Understanding MapReduce • Map Reduce - An Introduction • Word count – default • Word count – custom  Programming model to process large datasets  Supported languages for MR      Java Ruby Python C++ Map Reduce Programs are Inherently parallel.  More data  more machines to analyze.  No need to change anything in the code.  Start with WORDCOUNT example  “Do as I say, not as I do” Word Count As 2 Do 2 I 2 Not 2 Say 1 define wordCount as Map<String,long>; for each document in documentSet { T = tokenize(document); for each token in T { wordCount[token]++; } } display(wordCount);  This works until the no.of documents to process is not very large  Spam filter  Millions of emails  Word count for analysis  Working from a single computer is time consuming  Rewrite the program to count form multiple machines  How do we attain parallel computing ? 1. All the machines compute fraction of documents 2. Combine the results from all the machines STAGE 1 define wordCount as Map<String,long>; for each document in documentSUBSet { T = tokenize(document); for each token in T { wordCount[token]++; } } STAGE 2 define totalWordCount as Multiset; for each wordCount received from firstPhase { multisetAdd (totalWordCount, wordCount); } Display(totalWordcount) Master Documents Comp-1 Comp-2 Comp-3 Comp-4 Problems STAGE 1 • Documents segregations to be well defined Master Documents Comp-1 Comp-2 Comp-3 Comp-4 • Bottle neck in network transfer • Data-intensive processing • Not computational intensive • So better store files over processing machines • BIGGEST FLAW • Storing the words and count in memory • Disk based hash-table implementation needed Problems STAGE 2 Master • Phase 2 has only once machine • Bottle Neck • Phase 1 highly distributed though • Make phase 2 also distributed • Need changes in Phase 1 • Partition the phase-1 output (say based on first character of the word) • We have 26 machines in phase 2 • Single Disk based hash-table should be now 26 Disk based hash-table • Word count-a , worcount-b,wordcount-c Documents Comp-1 Comp-2 Comp-3 Comp-4 Master Documents Comp-1 Comp-2 Comp-3 Comp-4 A B C D E 1 2 4 5 10 Comp-10 Comp-20 A B C D E 10 20 40 5 9 . . . Comp-30 Comp-40  After phase-1  From comp-1 ▪ ▪ ▪ ▪ ▪  WordCount-A  comp-10 WordCount-B  comp-20 . . . Each machine in phase 1 will shuffle its output to different machines in phase 2  This is getting complicated  Store files where are they are being processed  Write disk-based hash table obviating RAM limitations  Partition the phase-1 output  Shuffle the phase-1 output and send it to appropriate reducer  This is more than a lot for word count  We haven’t even touched the fault tolerance  What if comp-1 or com-10 fails  So, A need of frame work to take care of all these things  We concentrate only on business Interim output MAPPER REDUCER Comp-2 Comp-3 Comp-4 Partitioning Documents HDFS Comp-1 A B C D E 1 2 4 5 10 A B C D E 1 2 4 5 10 . . . Shuffling Master Comp-10 Comp-20 Comp-30 Comp-40   Mapper Reducer Mapper filters and transforms the input Reducer collects that and aggregate on that. Extensive research is done two arrive at two phase strategy  Mapper,Reducer,Partitioner,Shuffling  Work together  common structure for data processing Input Output Mapper <K1,V1> List<K2,V2> Reducer <k2,list(v2)> List<k3,v3>  Mapper  <key,words_per_line> : Input  <word,1> : output  Input Output List<K2,V2> Reducer Mapper <K1,V1>  <word,list(1)> : Input Reducer <k2,list(v2)> List<k3,v3>  <word,count(list(1))> : Output  As said, don’t store the data in memory  So keys and values regularly have to be written to disk.  They must be serialized.  Hadoop provides its way of deserialization  Any class to be key or value have to implement WRITABLE class. Java Type Hadoop Serialized Types String Text Integer IntWritable Long LongWritable  Let’s try to execute the following command ▪ hadoop jar hadoop-examples-0.20.2-cdh3u4.jar wordcount ▪ hadoop jar hadoop-examples-0.20.2-cdh3u4.jar wordcount <input> <output>  What does this code do ?  Switch to eclipse

Personal_3.MapReduce An Introduction - hadoop

Related documents

Products

Support

Personal_3.MapReduce An Introduction - hadoop

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib