A Closer Looks Over Hadoop Framework Raj Kumari Bhatia, Aakriti Bansal

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 14 Number 3 – Aug 2014
A Closer Looks Over Hadoop Framework
Raj Kumari Bhatia, Aakriti Bansal
UIET, Panjab University, Chandigarh, India
Abstract— Apache's Hadoop is good but there are huge scope of
improvements and extensions in the existing technology. Various
improvements have been proposed to Hadoop which is an open
source implementation of Google's Map/Reduce framework. It
enables us about the distributed, data intensive job scheduling and
parallel applications by decomposing a massive jobs into small
number of segments and a massive data set into smaller partitions
such that each task processes a different partition in parallel. The
large scale data is handled by HDFS which is a mimicking version of
Google File System. Map/Reduce application mainly uses HDFS for
storing data. HDFS is a very large distributed file system that uses
commodity hardware and provides high throughput as well as fault
tolerance. Many big companies believe that in few years more than
half of the world's data will be stored in Hadoop. HDFS stores files
as a series of blocks and are replicated for fault tolerance. This paper
is an instruction of HADOOP and give a brief about the existing
installation system type and there limitation and at last section
describes about the dynamic processing of I/O slots through
scheduling and its limitations.
Keywords— hadoop, Deployment, Sheduling, I/O slots, Map
reduced.
I.
B. MapReduce
INTRODUCTION
Big data and Hadoop are rapidly emerging as the preferred
solution to address the issues related to technologies and
business that are disrupting traditional management and
processing systems. As the information is simply too massive
from numerous sources in several type, it's characterized by
the three Vs given in fig. 1.
BIGDATA
BIGDATA
• Variety
A. HADOOP
Hadoop is a Programming framework that supports the
processing of huge data sets in an exceedingly distributed
computing atmosphere. Hadoop was developed by Google’s
MapReduce that's a software system framework where
associate degree application breaks down into numerous
elements. The Current Apache Hadoop system consists of the
Hadoop Kernel, MapReduce, HDFS and numbers of assorted
elements like Apache Hive, HBase and Zookeeper. The
Apache Hadoop projects provide a series of tools designed to
solve big data problems. The Hadoop cluster implements a
parallel computing cluster using inexpensive commodity
hardware. The cluster is partitioned across many servers to
provide a near linear scalability. The philosophy of the cluster
design is to bring the computing to the data. So each data node
will hold part of the overall data and be able to process the
data that it holds. The overall framework for the processing
software is called MapReduce.
•Velocity
BIGDATA
• Volume
Fig.1. Parameters of Big Data
Variety makes the data too big. Data comes from the varied
sources that may be of structured, unstructured and semistructured type. Completely different type of data include the
text, audio, video, log files, sensor data etc. Volume represents
the size of the data how much the data is huge. The size of the
data is delineated in terabytes and petabytes. Velocities define
the motion of the data and also the analysis of streaming of the
data.
ISSN: 2231-5381
MapReduce is a programming framework for distributed
computing that is formed by the Google within which divide
and conquer technique is employed to interrupt the big
complicated information into tiny units and process them.
Mapreduce
provides
automatic
parallelization
and
distribution,fault tolerance, I/O scheduling, status and
monitoring of jobs.
MapReduce is the combination of map and reduce phase[1]:
• Map ():- The master node takes the input, divide into smaller
subparts and distribute into employee nodes. A employee node
additional do that again that results in the multi-level tree
structure. The employee node method the m=smaller problem
and passes the solution back to the master Node.it processe4s
the input in key value pair.it produces set of intermediate pairs
which are passed to Master node.
• Reduce ():- it combines all the intermediate values and
produces a set of merged output value.
C. HDFS
HDFS is a block-structured distributed file system that holds
the large amount of Big Data. In the HDFS the data is stored
in blocks that are known as chunks. HDFS is client-server
architecture comprises of NameNode and many DataNodes.
The name node stores the metadata for the NameNode.
NameNodes keeps track of the state of the DataNodes.
NameNode is also responsible for the file system operations
etc. [2].
http://www.ijettjournal.org
Page 150
International Journal of Engineering Trends and Technology (IJETT) – Volume 14 Number 3 – Aug 2014
When Name Node fails the Hadoop doesn’t support automatic
recovery, but the configuration of secondary nod is possible.
HDFS is based on the principle of “Moving Computation is
cheaper than Moving Data”.
HDFS is self-healing, high-bandwidth clustered storage
system which provide optimization, redundant, reliable,
distributed file system.
Other Components of Hadoop [3]:
HBase: HBase is an open source, non-relational distributed
database that allows for low-latency, quick lookups in
Hadoop. It adds transactional capabilities to Hadoop, allowing
users to conduct updates, inserts and deletes. it is written in
Java. It runs on the top of HDFS. It can serve as the input and
output for the MapReduce.
Pig: Pig Latin is a Hadoop-based language developed by
Yahoo. It is relatively easy to learn and is adept at very deep,
very long data pipeline. Pig is high-level platform where the
MapReduce programs are created which is used with Hadoop.
It is a high level data processing system where the data sets
are analyzed that occurs in high level language.
Hive: Hive is a Hadoop-based data warehousing-like
framework originally developed by Facebook. It allows users
to write queries in a SQL-like language called HiveQL, which
are then converted to MapReduce. Hive infrastructure is built
on the top of Hadoop that help in providing summarization,
query and analysis .This allows SQL programmers with no
MapReduce experience to use the warehouse and makes it
easier to integrate with business intelligence and visualization
tools.
Sqoop: Sqoop is a command-line interface platform that is
used for transferring data between relational databases and
Hadoop.
Avro: Avro is a data exchange service which is basically used
in Apache Hadoop. These services can be used together as
well as independently. It also acts as a data serialization
system that allows for encoding the schema of Hadoop files. It
is adept at parsing data and performing removed procedure
calls
Oozie: Oozie is a workflow processing system that lets users
define a series of jobs written in multiple languages – such as
Map Reduce, Pig and Hive -- then intelligently link them to
one another. Oozie allows users to specify, for example, that a
particular query is only to be initiated after specified previous
jobs on which it relies for data are completed. Moreover
Oozie is a java based web-application that runs in a java
servlet. Oozie uses the database to store definition of
Workflow that is a collection of actions. It manages the
Hadoop jobs.
Chukwa: Chukwa is a data collection and analysis framework
which is used to process and analyze the large amount logs. It
is built on the top of the HDFS and MapReduce framework.
Flume: Flume is a high level framework for populating
Hadoop with data from multiple sources. Agents are populated
throughout ones IT infrastructure – inside web servers,
application servers and mobile devices, for example – to
collect data and integrate it into Hadoop.
Zookeeper: it is a centralized service that provides distributed
synchronization and providing group services and maintains
the configuration information etc.
ISSN: 2231-5381
Advantages and Disadvantages of Hadoop
1) Advantages
MapReduce is designed under commodity PC cluster,
management of thousands commodity PCs is a big job. Also
reliability of commodity PC is a question. Maybe the biggest
problem is the power consumption. So if one want to build its
own compute center, it will pay quite a lot. This is where the
EC2 like services are used. Deploying the Hadoop
Applications on virtual machines can take all the advantages
of virtualization, which can make the management of the
cluster more easily, improve the reliability which is because
that virtual machines can be more easily recovered from crush
than physical ones. Thus, it can improve the reliability of the
master node of MapReduce. Besides that, virtualization can
help to fully utilize the system resources. By using EC2 like
Services, customers can easily and cost-effectively process
vast amounts of data.
2) Disadvantages
The only disadvantage is that the potential for poor
performance and heavy load undoubtedly, which is what to be
solved in below paper.
II. DEPLOYING
HADOOP OVER VIRTUAL MACHINE
BY
GUANGHUI
From this paper, we can follow it to set up their own Hadoop
experimental environment and capture the current status and
trend of optimizing Hadoop in virtual environment.
Hadoop deployment
They have discussed the Hadoop deployment over virtual
machines and also describe about Hadoop with JDK and
deployed it over ubuntu 12.04.
To have Hadoop with JDK, follow these steps:
1) Download the latest version of JDK for Ubuntu from
http://www.oracle.com/technetwork/java/javase/downloads/
jdk-7u4-downloads-1591156.html, we chose jdk-7u4-linuxi586.tar.gz.
2) Set environment variables
Untar the file, set environment variable JAVA_HOME to the
path of JDK, add JAVA_HOME/bin to PATH and
JAVA_HOME/lib to CLASSPATH.
3) Make sun-jdk be default jdk
$ sudo update-alternatives --install /usr/bin/java java
/usr/lib/java/jdk1.6.0_20/bin/java 300
$ sudo update-alternatives --install /usr/bin/javac javac
/usr/lib/java/jdk1.6.0_20/bin/javac 300
$ sudo update-alternatives --config java
A. Task Scheduling
Hadoop's performance is closely tied to its task scheduler, who
implicitly assumes that cluster nodes are homogeneous and
tasks make progress linearly, and uses these assumptions to
decide when to speculatively re-execute tasks that appear to be
stragglers [4].These are the implicit assumptions of Hadoop's
scheduler [8]:
 Nodes can perform work at roughly the same rate.
http://www.ijettjournal.org
Page 151
International Journal of Engineering Trends and Technology (IJETT) – Volume 14 Number 3 – Aug 2014





Tasks progress at a constant rate throughout time.
There is no cost to launching a speculative task on
anode that would otherwise have an idle slot.
A task’s progress score is roughly equal to the
fraction of its total work that it has done. Specifically,
in a reduce task, the copy, reduce and merge phases
each take 1/3 of the total time.
Tasks tend to finish in waves, so a task with a low
progress score is likely a slow task.
Different tasks of the same category (map or reduce)
require roughly the same amount of work.
B. I/O Scheduling
The efficiency in virtual machine may be very low than in
physical machine, the reason including task scheduling and
I/O scheduling. MapReduce is designed to run in physical
machines, when a MapReduce task is running, a lot of data
will be transferred between machines, the efficiency of I/O
scheduling is very important to shorten the response time.
Research Aspects:But if Hadoop is running on virtual machines and knows
whether any two virtual machines are in a same physical host,
it will help Hadoop to decide which virtual machine run which
map or reduce job.
III. DYNAMIC PROCESSING
KURAZUMI
SLOTS
SCHEDULING
BY
In this paper, they propose dynamic processing slots
scheduling for I/O intensive jobs of Hadoop MapReduce
focusing on I/O wait during execution of jobs. Assigning more
tasks to added free slots when CPU resources with the high
rate of I/O wait have been detected on each active
TaskTracker node leads to the improvement of CPU
performance.
They have implemented it on Hadoop 1.0.3. They have
evaluated up_ioline and down_ioline, there is just the little
difference between the results of the values which was close.
They have also concluded that setting up_ioline to the
extremely high values can cause the performance decrement
because the high rate of I/O wait has appeared to a mound.
They use Sort [5] program as benchmark program because
Sort is basic operation as the program working on Hadoop
MapReduce. The Map function Identity Mapper and the
Reduce function Identity Reducer only get Key-Value pairs
from Record Reader and output them. The amount of CPU
processing in both of Map and Reduce phases is less than the
amount of I/O processing, so that Sort is I/O intensive
program.
They have evaluated several times the fluctuation of the total
number of Map slots in the whole cluster during the execution
of single job with the cluster consists of 8 slave nodes and
12GB input data, because most effectiveness was found out in
the execution pattern. The number of the total initial number
ISSN: 2231-5381
of Map slots in the whole cluster is set to 16, because this
cluster consists of 8 slave nodes. The execution time of Map
phase is about 50% of all execution time (including the state
Map+Reduce) according to the analysis of it. The total number
of Map slots is increased up to 42 in the second half part. This
results from a number of running I/O intensive Reduce tasks
in this part. At this time, all Map tasks have completed. In the
first half part of all experimental results, the total number of
Map slots has increased up to 27 and about 20 on average.
Pros:They implemented the proposed method on Hadoop 1.0.3 and
evaluated it by executing jobs with Sort as the benchmark
program. Modified Hadoop was enabled to control the number
of Map slots dynamically in comparison with default static
assignment. Using our proposed method, the execution time
was improved up to about 23% compared with default
Hadoop.
Cons:They had not taken the consideration of controlling number of
maps slots according to the change of map reduce phases.
They were not able to remove the overhead of managing
threads.
IV.
CONCLUSION
In this paper we have successfully discussed about HADOOP
and give a brief about its deployment over the virtual machine
and its limitation and at the final section of the paper we
described about the dynamic processing of I/O slots through
scheduling and its limitations.
V.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
REFERENCES
K. Bakshi, "Considerations for Big Data: Architecture and Approach",
Aerospace Conference IEEE, Big Sky Montana, March 2012.
Garlasu, D.; Sandulescu, V. ; Halcu, I. ; Neculoiu, G. ( 17-19 Jan.
2013),”A Big Data implementation based on Grid Computing”, Grid
Computing.
Sagiroglu, S.; Sinanc, D. ,(20-24 May 2013),”Big Data: A Review” .
M. Zaharia, et al., “Improving MapReduce performance in
heterogeneous environments,” Proc. Proceedings of the 8th USENIX
conference on Operating systems design and implementation, USENIX
Association, 2008, pp. 29-42.
Hadoop, The Apache Software Foundation, May 2012, 1.0.3.
Nicolae, Bogdan, et al. "BlobSeer: Bringing high throughput under
heavy concurrency to Hadoop Map-Reduce applications." Parallel &
Distributed Processing (IPDPS), 2010 IEEE International Symposium
on. IEEE, 2010.
Thusoo, Ashish, et al. "Hive: a warehousing solution over a mapreduce framework." Proceedings of the VLDB Endowment 2.2 (2009):
1626-1629.
Shvachko, Konstantin, et al. "The hadoop distributed file system."
Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th
Symposium on. IEEE, 2010.
Kurazumi, Shiori, et al. "Dynamic Processing Slots Scheduling for I/O
Intensive Jobs of Hadoop MapReduce." ICNC. 2012.
Xu, Guanghui, Feng Xu, and Hongxu Ma. "Deploying and researching
Hadoop in virtual machines." Automation and Logistics (ICAL), 2012
IEEE International Conference on. IEEE, 2012.
http://www.ijettjournal.org
Page 152
Download