Simplifying MapReduce Data Processing

advertisement
Simplifying MapReduce Data
Processing
Jin-Ming Shih, Chih-Shan Liao
Dept. of Computer Science and
Information Engineering
National Dong Hwa University
Hualien, Taiwan
Ruay-Shiung Chang
Dept. of Computer Science and
Information Engineering
National Dong Hwa University
Hualien, Taiwan
2011 Fourth IEEE International Conference on Utility and
Cloud Computing
Speaker: Lin You-Wu
1
Outline
I. INTRODUCTION
II. RELATED WORK
III. THE WEB-BASED GUI FOR MAPREDUCE
DATA PROCESSING
IV. CASE STUDY AND IMPLEMENTATION
V. CONCLUSIONS AND FUTURE WORK
2
I. Introduction
• MapReduce is a programming model
developed by Google for processing and
generating large data sets in distributed
environments.
• In MapReduce, the complexity of distributed
programming is decreased substantially by
hiding details of the distributed computing
system.
3
• The MapReduce process shown in Fig. 1 splits input
data into a lot of Map operation nodes and produces
a set of output containing key/value pairs.
4
• Although the MapReduce model can simplify
the complexity of programming for distributed
computing, MapReduce is still not easy to
implement.
• If users want to use MapReduce to compute a
large-scale data set, they have to set up their
machine's MapReduce environments first.
5
• In this paper, we present a Web-based GUI for
MapReduce Data Processing.
• The GUI can let users design their MapReduce
workflow intuitively and conveniently without
the need of implementing a MapReduce
system and actually writing the programs.
6
II. Related Work
A. Hadoop MapReduce
B. Cascading
C. Pig
7
A. Hadoop MapReduce
• Hadoop is an open-source project
implementing Google's MapReduce
architecture by Apache.
• Hadoop MapReduce is a software framework
for process large-scale data in-parallel on large
clusters of commodity hardware.
8
• Hadoop MapReduce framework consists of a
single master called JobTracker and many
slaves called TaskTracker.
• Job Tracker
– responsible scheduling, monitoring and
reexecuting all tasks of job on slave machines.
• TaskTracker
– executes the tasks assigned by the JobTracker.
9
B. Cascading
• Cascading is an open source Java library and
API offering an abstract layer for MapReduce.
• The Cascading processing model allows
developers to rapidly develop complex data
processing applications on Hadoop.
10
• On the surface, Cascading is more complex
than MapReduce since there are many pipe
types and operations.
• But Cascading is easier in terms of solving
real-world problems by using building blocks
instead of MapReduce programming.
11
C. Pig
• Pig is a high-level platform for creating
MapReduce programs to analyze large data
sets using Hadoop.
• It provides not only a higher level of
abstraction of the data processing capabilities
but also maintains the simplicity and reliability
of Hadoop.
12
III. The Web-Based GUI for Mapreduce
Data Processing
• In MapReduce, the Map phase splits input
data using user’s definition.
• The Map output key/value pairs will be
grouped in Reduce Phase or Combine Phase.
13
• Our proposed method supports to decrease
the complexity of MapReduce data processing
by eliminating the actual programming.
• Map phase and Reduce phase were hidden by
“Target-Value-Action” in each “Layer”.
• There is a “Container” which integrates the
different formats from different outputs.
14
A. Target-Value-Action
• Target-Value-Action is a direct and original
thought to reflect data processing in Map and
Reduce.
• A user just chooses some objects as targets,
gives each object a value and specifies the
desired action.
15
• In the choosing of target, users can choose a
part of data they want to process as the target.
• Value depends on what type of outcome the
user wants. It can be integer, text, or other
data types.
• For example, in the word count problem, each
word is a Target, Value is “1” and Action is
“Sum”.
16
17
B. Layer
• In our Web-based GUI,
chained MapReduce Jobs
were controlled by “Layers”.
• Each Layer consists of single
or a set of Target-ValueAction and Container.
• Layers can accept other
Layers’ output or the
original data as the input.
18
C. Container
• “Container” integrates different outputs into
the same set with multi-column.
• It can make the output dataflow visible and
convenient while operating the inputs from
different Layers with different outputs.
19
20
D. System Architecture
• The Web-based GUI for MapReduce Data Processing is based
on a Web-based GUI and a Hadoop cluster.
21
IV. Case Study and Implementation
• In our Web-based GUI, users can drag the
Target-Value-Action, input path form and
container to put in the drop area on Web page.
• Drag and Drop offer a convenient control
method.
22
A. Pairwise Document Similarity
• In Pairwise Document Similarity in Large
Collections with MapReduce, the authors
calculate the similarity of pairwise document
using MapReduce.
• d is a document
• sim (di, dj) is the similaritybetweendocuments di and dj
• V is the vocabulary set
23
Algorithm
• The authors proposed an efficient solution to the pairwise
document similarity problem.
24
• The presented solution can be expressed as two
MapReduce jobs: “Indexing” and “Pairwise
Similarity”.
25
B. Pair Document Similarity in Our
Proposed methods
26
V. Conclusions and Future Work
• We also present three abstract data
processing building blocks to hide the
programming details from Map and Reduce.
• The presented methods are suitable for a
Web-based GUI since the complexity of the
MapReduce data processing has been
decreased.
27
• User can also directly drop and drag the
operation components to process large-scale
data using MapReduce.
• In the future, we will enrich the operation
components and management functions.
• If users can store, reuse and share their used
operation components structure, our method
can be more powerful.
28
Thank you for listening
29
Download