Simplifying MapReduce Data Processing Jin-Ming Shih, Chih-Shan Liao Dept. of Computer Science and Information Engineering National Dong Hwa University Hualien, Taiwan Ruay-Shiung Chang Dept. of Computer Science and Information Engineering National Dong Hwa University Hualien, Taiwan 2011 Fourth IEEE International Conference on Utility and Cloud Computing Speaker: Lin You-Wu 1 Outline I. INTRODUCTION II. RELATED WORK III. THE WEB-BASED GUI FOR MAPREDUCE DATA PROCESSING IV. CASE STUDY AND IMPLEMENTATION V. CONCLUSIONS AND FUTURE WORK 2 I. Introduction • MapReduce is a programming model developed by Google for processing and generating large data sets in distributed environments. • In MapReduce, the complexity of distributed programming is decreased substantially by hiding details of the distributed computing system. 3 • The MapReduce process shown in Fig. 1 splits input data into a lot of Map operation nodes and produces a set of output containing key/value pairs. 4 • Although the MapReduce model can simplify the complexity of programming for distributed computing, MapReduce is still not easy to implement. • If users want to use MapReduce to compute a large-scale data set, they have to set up their machine's MapReduce environments first. 5 • In this paper, we present a Web-based GUI for MapReduce Data Processing. • The GUI can let users design their MapReduce workflow intuitively and conveniently without the need of implementing a MapReduce system and actually writing the programs. 6 II. Related Work A. Hadoop MapReduce B. Cascading C. Pig 7 A. Hadoop MapReduce • Hadoop is an open-source project implementing Google's MapReduce architecture by Apache. • Hadoop MapReduce is a software framework for process large-scale data in-parallel on large clusters of commodity hardware. 8 • Hadoop MapReduce framework consists of a single master called JobTracker and many slaves called TaskTracker. • Job Tracker – responsible scheduling, monitoring and reexecuting all tasks of job on slave machines. • TaskTracker – executes the tasks assigned by the JobTracker. 9 B. Cascading • Cascading is an open source Java library and API offering an abstract layer for MapReduce. • The Cascading processing model allows developers to rapidly develop complex data processing applications on Hadoop. 10 • On the surface, Cascading is more complex than MapReduce since there are many pipe types and operations. • But Cascading is easier in terms of solving real-world problems by using building blocks instead of MapReduce programming. 11 C. Pig • Pig is a high-level platform for creating MapReduce programs to analyze large data sets using Hadoop. • It provides not only a higher level of abstraction of the data processing capabilities but also maintains the simplicity and reliability of Hadoop. 12 III. The Web-Based GUI for Mapreduce Data Processing • In MapReduce, the Map phase splits input data using user’s definition. • The Map output key/value pairs will be grouped in Reduce Phase or Combine Phase. 13 • Our proposed method supports to decrease the complexity of MapReduce data processing by eliminating the actual programming. • Map phase and Reduce phase were hidden by “Target-Value-Action” in each “Layer”. • There is a “Container” which integrates the different formats from different outputs. 14 A. Target-Value-Action • Target-Value-Action is a direct and original thought to reflect data processing in Map and Reduce. • A user just chooses some objects as targets, gives each object a value and specifies the desired action. 15 • In the choosing of target, users can choose a part of data they want to process as the target. • Value depends on what type of outcome the user wants. It can be integer, text, or other data types. • For example, in the word count problem, each word is a Target, Value is “1” and Action is “Sum”. 16 17 B. Layer • In our Web-based GUI, chained MapReduce Jobs were controlled by “Layers”. • Each Layer consists of single or a set of Target-ValueAction and Container. • Layers can accept other Layers’ output or the original data as the input. 18 C. Container • “Container” integrates different outputs into the same set with multi-column. • It can make the output dataflow visible and convenient while operating the inputs from different Layers with different outputs. 19 20 D. System Architecture • The Web-based GUI for MapReduce Data Processing is based on a Web-based GUI and a Hadoop cluster. 21 IV. Case Study and Implementation • In our Web-based GUI, users can drag the Target-Value-Action, input path form and container to put in the drop area on Web page. • Drag and Drop offer a convenient control method. 22 A. Pairwise Document Similarity • In Pairwise Document Similarity in Large Collections with MapReduce, the authors calculate the similarity of pairwise document using MapReduce. • d is a document • sim (di, dj) is the similaritybetweendocuments di and dj • V is the vocabulary set 23 Algorithm • The authors proposed an efficient solution to the pairwise document similarity problem. 24 • The presented solution can be expressed as two MapReduce jobs: “Indexing” and “Pairwise Similarity”. 25 B. Pair Document Similarity in Our Proposed methods 26 V. Conclusions and Future Work • We also present three abstract data processing building blocks to hide the programming details from Map and Reduce. • The presented methods are suitable for a Web-based GUI since the complexity of the MapReduce data processing has been decreased. 27 • User can also directly drop and drag the operation components to process large-scale data using MapReduce. • In the future, we will enrich the operation components and management functions. • If users can store, reuse and share their used operation components structure, our method can be more powerful. 28 Thank you for listening 29