Uploaded by leeshuchaudhary

Hadoop Introduction & Installation

advertisement
Hadoop Introduction &
Installation
Presented By:
Pulkit Narwal
What is Hadoop
• Hadoop is an open source framework from Apache and is used to
store process and analyze data which are very huge in volume.
• It is used to develop data processing applications which are executed
in a distributed computing environment, so that, you can process it
parallely.
• Hadoop is written in Java and is not OLAP (online analytical
processing).
• It is used for batch/offline processing. It is being used by Facebook,
Yahoo, Google, Twitter, LinkedIn and many more.
• Moreover it can be scaled up just by adding nodes in the cluster.
Hadoop Through an Analogy
• Introducing Jack, a grape farmer. He harvests the grapes in the fall,
stores them in a storage room, and finally sells them in the nearby
town. He kept this routing going for years until people began to
demand other fruits. This rise in demand led to him growing apples
and oranges, in addition to grapes.
• Unfortunately, the whole process turned out to be time-consuming
and difficult for Jack to do single-handedly.
Hadoop Through an Analogy
• So, Jack hires two more people to work alongside him. The extra help
speeds up the harvesting process as three of them can work
simultaneously on different products.
• However, this takes a nasty toll on the storage room, as the storage
area becomes a bottleneck for storing and accessing all the fruits.
Hadoop Through an Analogy
• Jack thought through this problem and came up with a solution: give
each one separate storage space. So, when Jack receives an order for
a fruit basket, he can complete the order on time as all three can
work with their storage area.
• Thanks to Jack’s solution, everyone can finish their order on time and
with no trouble. Even with sky-high demands, Jack can complete his
orders.
Modules of Hadoop
• HDFS: Hadoop Distributed File System. Google published its paper GFS and
on the basis of that HDFS was developed. It states that the files will be
broken into blocks and stored in nodes over the distributed architecture.
• Yarn: Yet another Resource Negotiator is used for job scheduling and
manage the cluster.
• Map Reduce: This is a framework which helps Java programs to do the
parallel computation on data using key value pair. The Map task takes input
data and converts it into a data set which can be computed in Key value
pair. The output of Map task is consumed by reduce task and then the out
of reducer gives the desired result.
• Hadoop Common: These Java libraries are used to start Hadoop and are
used by other Hadoop modules.
Hadoop has a Master-Slave Architecture
Hadoop has a Master-Slave Architecture
• The Java language is used to develop HDFS. So any machine that supports
Java language can easily run the NameNode and DataNode software.
• NameNode
1. It is a single master server exist in the HDFS cluster.
2. As it is a single node, it may become the reason of single point failure.
3. It manages the file system namespace by executing an operation like the opening,
renaming and closing the files.
4. It simplifies the architecture of the system.
• Task Tracker
1. It works as a slave node for Job Tracker.
2. It receives task and code from Job Tracker and applies that code on the file. This
process can also be called as a Mapper.
Hadoop has a Master-Slave Architecture
• DataNode
1.
2.
3.
4.
The HDFS cluster contains multiple DataNodes.
Each DataNode contains multiple data blocks.
These data blocks are used to store data.
It is the responsibility of DataNode to read and write requests from the file
system's clients.
5. It performs block creation, deletion, and replication upon instruction from the
NameNode.
• Job Tracker
1. The role of Job Tracker is to accept the MapReduce jobs from client and process
the data by using NameNode.
2. In response, NameNode provides metadata to Job Tracker.
HDFS
• HDFS creates an abstraction, let me simplify it for you. Similar as
virtualization, you can see HDFS logically as a single unit for storing
Big Data, but actually you are storing your data across multiple nodes
in a distributed fashion. HDFS follows master-slave architecture.
• In HDFS, Namenode is the master node and Datanodes are the
slaves. Namenode contains the metadata about the data stored in
Data nodes, such as which data block is stored in which data node,
where are the replications of the data block kept etc. The actual data
is stored in Data Nodes.
HDFS
MapReduce
• Hadoop MapReduce is the processing unit of Hadoop. In the MapReduce
approach, the processing is done at the slave nodes, and the final result is
sent to the master node.
• The input dataset is first split into chunks of data. In this example, the input
has three lines of text with three separate entities - “bus car train,” “ship
ship train,” “bus ship car.” The dataset is then split into three chunks, based
on these entities, and processed parallelly.
• In the map phase, the data is assigned a key and a value of 1. In this case,
we have one bus, one car, one ship, and one train.
• These key-value pairs are then shuffled and sorted together based on their
keys. At the reduce phase, the aggregation takes place, and the final output
is obtained.
MapReduce
YARN
• YARN performs all your processing activities by allocating resources
and scheduling tasks.
• It has two major components, i.e. ResourceManager and
NodeManager.
• ResourceManager is again a master node. It receives the processing
requests and then passes the parts of requests to corresponding
NodeManagers accordingly, where the actual processing takes place.
NodeManagers are installed on every DataNode. It is responsible for
the execution of the task on every single DataNode.
YARN
Source: edureka
YARN
• Suppose a client machine wants to do a query or fetch some code
for data analysis. This job request goes to the resource manager
(Hadoop Yarn), which is responsible for resource allocation and
management.
• In the node section, each of the nodes has its node managers. These
node managers manage the nodes and monitor the resource usage in
the node. The containers contain a collection of physical resources,
which could be RAM, CPU, or hard drives. Whenever a job request
comes in, the app master requests the container from the node
manager. Once the node manager gets the resource, it goes back to
the Resource Manager.
YARN
•Hadoop YARN acts like an OS to Hadoop. It is a file system that is built on top of HDFS.
•It is responsible for managing cluster resources to make sure you don't overload one machine.
•It performs job scheduling to make sure that the jobs are scheduled in the right place
Installing Hadoop
• Hadoop is supported by GNU/Linux platform and its flavors.
Therefore, we have to install a Linux operating system for setting up
Hadoop environment. In case you have an OS other than Linux, you
can install a Virtualbox software in it and have Linux inside the
Virtualbox.
Links for Installation
• Hadoop: Setting up a Single Node Cluster:
https://hadoop.apache.org/docs/stable/hadoop-projectdist/hadoop-common/SingleCluster.html
• Installing Hadoop 3.2.1 Single node cluster on Windows 10
https://towardsdatascience.com/installing-hadoop-3-2-1-singlenode-cluster-on-windows-10-ac258dd48aef
• Virtualizing Hadoop® on VMware vSphere®: A Deployment Guide
https://www.vmware.com/content/dam/digitalmarketing/vmware/e
n/pdf/products/vsphere/vmware-hadoop-deployment-guide.pdf
Links for Installation
• Hadoop - Environment Setup
https://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.
htm
• Hadoop Installation
https://www.javatpoint.com/hadoop-installation
Download