PERFORMANCE OF MAP REDUCE

advertisement
International Journal of Engineering Trends and Technology- May to June Issue 2011
PERFORMANCE OF MAP REDUCE
1
2
V.Anitha Moses, Professor,
Department of Computer Application,
Panimalar Engineering College,
Chennai.
B.Palanivel, PG Scholar,
Department of Computer Application,
Panimalar Engineering College,
Chennai.
3
S. Srinidhi
Asst. Prof., Department of MCA,
Panimalar Engineering College, Chennai
ABSTRACT
MapReduce is a programming model and an
associated implementation for processing and
generating large data sets. Users specify a map
function that processes a key/value pair to
generate a set of intermediate key/value pairs,
and a reduce function that merges all
intermediate values associated with the same
intermediate key. Many real world tasks are
expressible in this model, as shown in the
paper.Programs written in this functional style
are automatically parallelized and executed on a
large cluster of commodity machines. The runtime system takes care of the details of
partitioning the input data, scheduling the
program's execution across a set of machines,
handling machine failures, and managing the
required inter-machine communication. This
allows programmers without any experience with
parallel and distributed systems to easily utilize
the resources of a large distributed system. Our
implementation of MapReduce runs on a large
cluster of commodity machines and is highly
scalable: a typical MapReduce computation
processes many terabytes of data on thousands
of machines. Programmers find the system easy
to use: hundreds of MapReduce programs have
been implemented and upwards of one
thousand MapReduce jobs are executed on
Google's clusters every day.
of MapReduce is simple yet expressive.
Although MapReduce only involves two
functions map() and reduce(), a number of data
analytical tasks including traditional SQL query,
data
mining,
machine
learning,
and
graphprocessing can be expressed with a set of
MapReduce jobs. MapReduce is flexible. It is
designed to be independent of storage systems
and is able to analyze various kinds of data,
structured
and
unstructured.
Finally,
MapReduce is scalable. Installation of
MapReduce on a 4,000 nodes shared-nothing
cluster has been reported [2]. MapReduce also
provides tne-grain fault tolerance whereby only
tasks on failed nodes need to be restarted.
Traditionally, the large-scale data analysis
market is dominated by Parallel Database
systems. The popularity of MapReduce gives
rise to the question of whether there are
fundamental differences between MapReducebased and Parallel Database systems. Along
this direction, reported a comparative evaluation
of the two systems in many dimensions,
including schema support, data access
methods, fault tolerance and so on. The authors
also introduced a bench mark to evaluate the
performance of both systems. The results
showed that the observed performance of a
Parallel Database system is much better than
that of a MapReduce based system. The
authors
of
speculated
about
possible
architectural causes for the performance gap
between the two systems. For instance,
MapReduce-based systems need to repetitively
parse records since it is designed to be
INTRODUCTION
MapReduce-based systems are increasingly
being used for large-scale data analysis. There
are several reasons for this. First, the interface
ISSN:2231-5381
- 92 -
IJETT
International Journal of Engineering Trends and Technology- May to June Issue 2011
independent of the storage system. Thus,
parsing introduces performance overhead.
HADOOP
Hadoop consists of the Hadoop Common, which
provides access to the filesystems supported by
Hadoop. The Hadoop Common package
contains the necessary JAR files and scripts
needed to start Hadoop. The package also
provides source code, documentation, and a
contribution section which includes projects from
the Hadoop Community.A key feature is that for
effective scheduling of work, every filesystem
should provide location awareness: the name of
the rack (more precisely, of the network switch)
where a worker node is. Hadoop applications
can use this information to run work on the node
where the data is, and, failing that, on the same
rack/switch, so reducing backbone traffic. The
HDFS filesystem uses this when replicating
data, to try to keep different copies of the data
on different racks. The goal is to reduce the
impact of a rack power outage or switch failure
so that even if these events occur, the data may
still be readable. A typical Hadoop cluster will
include a single master and multiple slave
nodes. The master node consists of a
jobtracker,
tasktracker,
namenode,
and
datanode. A slave or compute node consists of
a datanode and tasktracker. Hadoop requires
JRE 1.6 or higher. The standard startup and
shutdown scripts require ssh to be set up
between nodes in the cluster. While Microsoft
Windows and OS/X are supported for
development, as of April 2011 there are no
public claims that these are in use in large
servers.The HDFS is a distributed, scalable, and
portable filesystem written in Java for the
Hadoop framework. Each node in a Hadoop
instance typically has a single datanode; a
cluster of datanodes form the HDFS cluster. The
situation is typical because each node does not
require a datanode to be present. Each
datanode serves up blocks of data over the
network using a block protocol specific to HDFS.
The filesystem uses the TCP/IP layer for
communication;
clients
use
RPC
to
communicate between each other. The HDFS
stores large files (an ideal file size is a multiple
of 64 MB, across multiple machines. It achieves
reliability by replicating the data across multiple
hosts, and hence does not require RAID storage
on hosts. With the default replication value, 3,
data is stored on three nodes: two on the same
rack, and one on a different rack. Data nodes
MAPREDUCE
MapReduce is a patented software framework
introduced by Google to support distributed
computing on large data sets on clusters of
computers. The framework is inspired by the
map and reduce functions commonly used in
functional programming, although their purpose
in the MapReduce framework is not the same as
their original forms. Map Reduce is a framework
for processing huge datasets on certain kinds of
distributable problems using a large number of
computers nodes, collectively referred to as a
cluster if all nodes use the same hardware or as
a grid if the nodes use different hardware.
Computational processing can occur on data
stored either in a file system unstructured or
within a database structured.
CLUSTERING
Clustering allows us to run an applications on
several parallel servers (cluster nodes). The
load is distributed across different servers, and
even if any of the servers fails, the application is
still accessible via other cluster nodes.
Clustering is crucial for scalable enterprise
applications, as you can improve performance
by simply adding more nodes to the cluster.
Clustering can be considered the most important
unsupervised learning problem; so, as every
other problem of this kind, it deals with finding a
structure in a collection of unlabeled data.
A loose definition of clustering could be the
process of organizing objects into groups whose
members are similar in some way. A cluster is a
set of nodes.a node is a JBoss server instance.
Thus, to build a cluster, several JBoss instances
have to be grouped together known as a
partition. On a same network, we may have
different clusters. In order to differentiate them,
each cluster must have an individual name.
Clustering is a nonlinear activity that generates
ideas, images and feelings around a stimulus
word. As students cluster, their thoughts tumble
out, enlarging their word bank for writing and
often enabling them to see patterns in their
ideas. Clustering may be a class or an individual
activity.
ISSN:2231-5381
- 93 -
IJETT
International Journal of Engineering Trends and Technology- May to June Issue 2011
can talk to each other to rebalance data, to
move copies around, and to keep the replication
of data high. The HDFS is not fully POSIX
compliant because the requirements for a
POSIX filesystem differ from the target goals for
a Hadoop application. The tradeoff of not having
a fully POSIX compliant filesystem is increased
performance for data throughput. The HDFS
was designed to handle very large files. The
HDFS does not provide High Availability.
integers, for example representing a person's
height in centimeters, but may also be nomina
data (i.e., not consisting of numerical values), for
example representing a person's ethnicity. More
generally, values may be of any of the kinds
described as a level of measurement. For each
variable, the values will normally all be of the
same kind. However, there may also be "missing
values", which need to be indicated in some
way.
A filesystem requires one unique server, the
name node. This is a single point of failure for an
HDFS installation. If the name node goes down,
the filesystem is offline. When it comes back up,
the name node must replay all outstanding
operations. This replay process can take over
half an hour for a big cluster. The filesystem
includes what is called a Secondary Namenode,
which misleads some people into thinking that
when the Primary Namenode goes offline, the
Secondary Namenode takes over
EXISTING SYSTEM
In existing System there is no job completion
times.large jobs or heavy users cannot access
error latencies in long-running tasks will be
occurred.No Load balancing.Time Delay.
MapReduce systems face enormous challenges
due to increasing growth, diversity, and
consolidation of the data and computation
involved.
Provisioning,
configuring,
and
managing large-scale MapReduce clusters
require realistic, workload-specific performance
insights that existing MapReduce benchmarks
are ill-equipped to supply. In this paper, we build
the case for going beyond benchmarks for
MapReduce performance evaluations. We
analyze
and
compare
two
production
MapReduce traces to develop a vocabulary for
describing MapReduce workloads. We show
that existing benchmarks fail to capture rich
workload characteristics observed in traces, and
propose a framework to synthesize and execute
representative workloads. We demonstrate that
performance
evaluations
using
realistic
workloads gives cluster operator new ways to
identify workload-specific resource bottlenecks,
and workload-specific choice of MapReduce
task schedulers. We expect that once available,
workload suites would allow cluster operators to
accomplish previously challenging tasks beyond
what we can now imagine, thus serving as a
useful tool to help design and manage
MapReduce systems.
DATASET
A data set is a collection of data, usually
presented in tabular form. Each column
represents a particular variable. Each row
corresponds to a given member of the data set
in question. Its values for each of the variables,
such as height and weight of an object or values
of random numbers. Each value is known as a
datum. The data set may comprise data for one
or more members, corresponding to the number
of rows. A data set has several characteristics
which define its structure and properties. These
include the number and types of the attributes or
variables and the various statistical measures
which may be applied to them such as standard
deviation and kurtosis.
In the simplest case, there is only one variable,
and then the data set consists of a single
column of values, often represented as a list. In
spite of the name, such a univariate data set is
not a set in the usual mathematical sense, since
a given value may occur multiple times.
Normally the order does not matter, and then the
collection of values may be considered to be a
multiset rather than an (ordered) list. The values
may be numbers, such as real numbers or
ISSN:2231-5381
PROPOSED SYSTEM
The main insights from our analysis, are that:
(i) job completion times and cluster allocation
patterns follow a long-tailed distribution and
require fair job
schedulers to prevent large
- 94 -
IJETT
International Journal of Engineering Trends and Technology- May to June Issue 2011
jobs or heavy users from monopolizing the
cluster;
(ii) better diagnosis and recovery approaches
are needed to reduce error latencies in longrunning tasks;
(iii) evenly-balanced load across most jobs
implies that peer comparison is a suitable
strategy for anomaly detection as described in
our previous work and
(iv) low variability in user behavior over short
periods of time allows us to exploit temporal
locality to predict job completion times.
Under different levels of database tables, a valid
user can access the authorized attributes
through the multi authentication.
These ways of authentication processes are
mutual so that our scheme is secure against
spoofing or masquerading attack.
More Secure is associated with the same key on
distributed computer nodes. The runtime system
takes care of data partitioning, scheduling, load
balancing, fault tolerance, and network
communications. The simple interface of
MapReduce allows programmers to easily
design parallel and distributed applications.
The classified data are stored in different
databases as different styles.
REPORTS GENERATED
prevent large jobs or heavy users from
monopolizing the cluster. We also observed
large error-latencies in some long-running tasks
indicating that better diagnosis and recovery
approaches are needed.
User tended to run the same job repeatedly over
short intervals of time thereby allowing us to
exploit temporal locality to predict job completion
times. We compared the effectiveness of a
distance-weighted algorithm against a locallyweighted linear algorithm at predicting job
completion times when we scaled the map input
CONCLUSION
We analyzed Hadoop logs from the 400-node
M45 supercomputing cluster which Yahoo!
made freely available to select universities for
systems research. Our studies tracks the
evolution in cluster utilization patterns from its
launch at Carnegie Mellon University in April
2008 to April 2009.
Job completion times and cluster allocation
patterns followed a long-tailed distribution
motivating the need for fair job schedulers [5] to
ISSN:2231-5381
- 95 -
IJETT
International Journal of Engineering Trends and Technology- May to June Issue 2011
sizes of incoming jobs. Locallyweighted linear
regression performs better with a mean relative
prediction error of 26%.
The MapReduce programming model has been
successfully used at Google for many different
purposes. Weattribute this success to several
reasons. First, the model is easy to use, even for
programmers without experience with parallel
and distributed systems, since it hides the
details of parallelization, fault-tolerance, locality
optimization, and load balancing. Second, a
large variety of problems are easily expressible
as MapReduce computations. For example,
MapReduce is used for the generation of data
for Google's production web search service,for
sorting, for data mining, for machine
learning,and many other systems. Third, we
have developed an implementation of
MapReduce that scales to large clusters of
machines comprising thousands of machines.
The implementation makes ef_cient use of these
machine resources and therefore is suitable for
use on many of the large computational
problems encountered at Google.
8. T. A. S. Foundation, “The Map/Reduce
Tutorial,”
2008,
http://hadoop.apache.org/common/docs/current/
mapred tutorial.html.
REFERENCES
1. J. Dean and S. Ghemawat, “MapReduce:
Simplified data processing on
large clusters.” Communications of the ACM,
vol. 51, pp. 107–113, 2008.
2. http://en.wikipedia.org/wiki/Clustering.
3.
Hadoop,
“Powered
by
Hadoop,”
http://wiki.apache.org/hadoop/ PoweredBy.
4.http://econ.worldbank.org/WBSITE/EXTERNA
L/EXTDEC/EXTRESEARCH/0,,contentMDK:206
99301~pagePK:64214825~piPK:64214943~the
SitePK:469382,00.html.
5.
R.
Sahoo,
M.
Squillante,
A.
Sivasubramaniam, and Y. Zhang, “Failuredata
analysis of a large-scale
heterogeneous
server environment,” inDependendable Systems
and Networks, Florence, Italy, Jun. 2004.
6. Yahoo!, “Hadoop capacity scheduler,” 2008,
https://issues.apache.org/jira/browse/HADOOP3445.
7. M. Isard, V. Prabhakaran, J. Currey, U.
Wieder, K. Talwar, and A. Goldberg,“Quincy: fair
scheduling for distributed computing clusters,”
inACM Symposium on Operating Systems
Principles, Big Sky, Montana,
Oct. 2009, pp. 261–276.
ISSN:2231-5381
- 96 -
IJETT
Download