The MapReduce Framework for Big Data Application

advertisement
The MapReduce Framework for Big Data Application
Mr. Gajanan Rathod,
Mr.Gurudutt Sortur,
Mr. Nikhil Nimbalkar,
Miss.Shahida Nadaf
Prof. R. V. Patil
Department of Computer Engineering PDEA’S COEM, Pune
ABSTRACT:The big data is being generated at
a large scale, because of day by day
activities and use of resources in an
exceptional computing. To access and
handle such huge amount of data,
distributed system mechanism is used. One
such widely used distributed classification
system is Hadoop Distributed File System
(HDFS). Google map reduce framework
and apache Hadoop, an open source
implementation, are the de-facto software
system for big data applications. It uses
MapReduce- framework to perform
analysis and carry computations parallel
on these massive information sets. Hadoop
follows the master/slave design decoupling
system data and application information
wherever data is kept on dedicated server
NameNode and application information on
DataNodes. Process of MapReduce is slow
whereas, it's better-known that accessing
data from cache is far quicker as
compared to memory access. Hadoop
distributed file system suffer from huge I/O
bottleneck for storing the tri-replicated
data block. The I/O overhead intrinsic to
the HDFS architecture degrade the
application performance.
KEYWORDS:
Big Data, Map Reduce, Hadoop
Distributed File System (HDFS), Cache
Management.
I. INTRODUCTION:
MapReduce [5] Framework for Big
Data Applications led to introduction of
millions of computing related resources
and with advent use of those applications
day by day has led to generation of large
amount of data. Applications specify the
computation in terms of a map and a
reduce function working on partitioned
data items. The MapReduce framework
schedules computation across a cluster of
machines. A challenge involves quicker
retrieval of those resources alongside high
performance. A good mechanism to realize
this goal is that the use of distributed
systems.
MapReduce
provides
a
standardized framework for implementing
large-scale distributed computation on
unprecedentedly large-scale data set.
Hadoop is the software framework [6] for
writing applications that rapidly process
large amount of data in parallel on large
clusters of computer nodes. It provides a
distributed file system and a framework for
the analysis and transformation of very
large data sets using the MapReduce
paradigm.
The volume of data, collectively
called data-sets, generated by the
application is very large. So, there is a
need of processing large data-sets
efficiently. Over the years, Hadoop has
gained importance owing to its
measurability, reliability, high output,
analysis and enormous computations on
these huge amounts of data. It's being
employed by all the leading industries
such as Amazon, Google, Facebook, and
Yahoo [7].
MapReduce is a generic execution
engine that parallelizes computation over a
large cluster of machines [2]. An important
characteristic of Hadoop is the partitioning
of data and computation across many
hosts,
and
executing
application
computations in parallel close to their data.
A Hadoop cluster scales computation
capacity, storage capacity and I/O
bandwidth by simply adding commodity
servers. In this paper, for faster processing
of Hadoop system, three algorithms are
used, namely, Normal cache MR,
Distributed cache MR and Adaptive
Replacement Cache (ARC)[1] MR.
II. Related Work
The Map Reduce Big data
processes large amount of data. It is
designed in such a way that MapReduce
program can be automatically paralyze and
execute large cluster of the machines .As
this model is easy to use, programmers
with limited experience can work on it
using parallel and distributed systems[].
The jobs are divided and well balanced
using load balancing technique. In this
model extra overhead is required to keep
information about every data node and its
job hence computation overhead increases.
Improved MapReduce Performance is
done through Data Placement in
Heterogeneous Hadoop Clusters [4].
The same format of the cache
description of different applications varies
according to their specific semantic
contexts. This could be designed and
implemented by application developers
who are responsible for implementing their
MapReduce tasks.
III. Research Methodology:
3.1 MapReduce: MapReduce [5] is a
programming model for processing large
data sets, and the name of an
implementation of the model by Google.
MapReduce is typically used to perform
distributed computing on clusters of
computers. The model is inspired by map
and reduce functions commonly used in
functional programming, although their
purpose in the MapReduce framework is
not the same as their original forms. MapReduce libraries have been written in
many programming languages. A popular
free implementation is Apache Hadoop.
MapReduce is a framework for processing
parallelizable problems across huge
datasets using a large number of computers
(master and slave nodes), collectively
referred to as a cluster. Computational
processing can occur on data stored in a
HDFS in the form of multi-structured data
(structured,
semi-structured
and
unstructured).
Map step: The master node takes the
input, divides it into smaller sub-problems,
and distributes them to worker nodes. A
worker node may do this again in turn,
leading to a multi-level tree structure. The
worker node processes the smaller
problem, and passes the answer back to its
master node. In other words, mapping is
the function which is used to arrange the
input data in the form of key-value pair.
Consider example:- I am Sam, I am
studying in BE.
Key value was defined as below
1) (I, 1), 2)(Am, 1) 3)(Sam, 1) ,4)(I, 1)
5) (Am, 1), 6) (Studying, 1)
5) (In, 1), 8) (BE, 1)
Map function contain some attribute are
as follows:a) Partition: - In this system the
input data is Divided into number of nodes
and process it with the help of map-reduce
function.
b) Combiner:-Combiner is also
Called as mini-reducer which is used along
with each mapper. It is mainly used for
load balancing.
c) Cluster:- Cluster is part of
memory. When data is processed than it is
stored inside the cluster.
Reduce step: Reducer is the function
which take the input (key) from map
function and perform reduce operation.
Consider example:(I, 2)
(Am, 2)
(Sam, 1)
(Studying, 1)
(in.1)
(BE.1)
As shown in above example the given
sentence contain some repeated words so
reducer can used to reduce it by showing
their count.
Fig: Execution Overview
Dataflow: The frozen part of the
MapReduce framework is a large
distributed sort. The important points,
which the applications do define are input
reader, Map, partition, compare, Reduce
function and output writer.
Input reader: It divides the input into
appropriate size 'splits' (in practice
typically 16 MB to 128 MB) and the
framework assigns one split to each Map
function. The input reader reads data from
stable storage (typically a distributed file
system) and generates key/value pairs. A
common example will read a directory full
of text files and return each line as a
record.
Map function: Each Map function takes a
series of key/value pairs, processes each,
and generates zero or more output
key/value pairs. The input and output types
of the map can be (and often are) different
from each other.
Partition function: Each Map function
output is allocated to a particular reducer
by the application's partition function for
sharing purposes. The partition function is
given the key and the number of reducers
and returns the index of desired reduce.
Comparison function: The input for each
Reduce is pulled from the machine where
the Map ran and sorted using the
application's comparison function.
Reduce function: The framework calls the
application's Reduce function once for
each unique key in the sorted order. The
Reduce can iterate through the values that
are associated with that key and produce
zero or more outputs.
Output writer: It writes the output of the
Reduce to stable storage, usually a
distributed file system.
As an example, the illustrative problem of
counting the average word length of every
word occurrences in a large collection of
documents in MapReduce is represented as
following: The input key/value to the Map
function is a document name, and its
contents. The function scans through the
document and emits each word plus the
associated word length of the occurrences
of that word in the document. Shuffling
groups together occurrences of the same
word in all documents, and passes them to
the Reduce function. The Reduce function
sums up all the word length for all
occurrences. Then divide it by the count of
that word and emits the word and its
overall average word length of every word
occurrences.
Example:
Consider the problem of
counting the average word length in a
large collection of documents. The user
would write code similar to the following
pseudo-code: [ 5]
Function map (String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w,wordlength);
functionreduce(String key,Iterator values)
// key: a word
// values: a list of counts
doublesum = 0, count =0, result=0;
for each v in values:
sum += ParseInt(v); count++;
result = sum / count;
Emit(w, AsDouble(result));
Here, each document is split into words,
and each word length is counted by the
map function, using the word as the result
key. The framework puts together all the
pairs with the same key and feeds them to
the same call to reduce, thus this function
just needs to sum all of its input values to
find the total appearances of that word.
Then for finding average word length we
divide the sum by the count of that word.
Following three algorithms proposed for
processing of Hadoop system.
1. Normal cache MR:
In this algorithm, input data is processed
from main memory and, mapping and
reducing operations are performed on
input data normally. Then the processed
data is stored in HDFS.
CONCLUSION:The motivation is, Hadoop being used
and able to study in depth such widely
used system and MapReduce framework
which is used for big data analysis and
transformations. There are many new
technologies emerging at a rapid rate, each
with technological advancements and with
the potential of making ease in use of
technology. We will work on bringing
together ideas from MapReduce and
Normal, Distributed, ARC algorithms;
however, this work focuses mainly on
increase of processing Hadoop System.
We will combine the advantages of
MapReduce-like [6] software with the
efficiency and shared work advantages that
come with loading data and creating
performance enhancing data structures.
Following table show the some review
points
2. Distributed cache MR:
In this algorithm, main memory is divided
into two parts that act as filter for all junk
and temporary files. After filtering,
mapping and reducing operations are
performed on data and is stored in HDFS.
3. Adaptive Replacement Cache
(ARC) MR:
In this algorithm, main memory is divided
into four parts that act as filter for all junk
and temporary files. After filtering,
mapping and reducing operations are
performed on data and is stored in HDFS.
IV. IMPLEMENTATION DATAILS:In this section we represent the input,
expected result and environment used for
implementation.
Input:
For this implementation, we use the input
as text file, which is further splitted into
number of blocks.
REFERENCES:[1] Yaxiong Zhao, Jie Wu, and Cong Liu,
"Dache: A Data Aware Caching for BigData Applications Using the MapReduce
Framework " 2013 Proceedings IEEE
INFOCOM, 978-1-4673-5946-7/13/$31.00
©2014 IEEE.
[2] Nusrat Sharmin Islam, Xiaoyi Lu, Md.
Wasi-ur-Rahman,
Raghunath
Rajachandrasekar, and Dhabaleswar K.
(DK) Panda” In-Memory I/O and
Replication for HDFS with Memcached:
Early Experiences”2014 IEEE.
[3] L.A. Belady, “A Study of Replacement
Algorithms
for
Virtual
Storage
Computers,” IBM Systems J., vol.
5, no. 2, 1966, pp. 78-101.
[4] M. Zaharia, A. Konwinski, A. D.
Joseph, R. Katz, and I. Stoica, “Improving
MapReduce performance in heterogeneous
environments”,in Proc. of OSDI2008,
Berkeley,
[5] Jeffrey Dean and Sanjay Ghemawat,
Mapreduce: Simplied data processing on
large clusters. In OSDI, pages 137-150,
2004.
[6]CloudEra,https://ccp.cloudera.com/dis
play/ CDH4DOC/ CDH4+Installation
[7] "Outperforming LRU with an adaptive
replacement cache algorithm ", Megiddo,
Nimrod ,IBM Almaden Res. Center, San
Jose, CA, USA , pp.58-65.
[8] D. Peng and f. Dabek ,”Large Scale
incremental Processing using distributed
Transaction and notification” , in Proc. of
OSDI’2010 ,Berkeley ,CA,USA,2010.
[9] Dongfang Zhao, Ioan Raicu.”
HyCache:
a
User-Level
Caching
Middleware
for
Distributed
File
Systems.”2014
[10] Xindong Wu, Fellow Xingquan Zhu,
Senior Member, Gong-Qing Wu.” Data
Mining with Big Data” 2014
Download