Hadoop Introduction & Installation Presented By: Pulkit Narwal What is Hadoop • Hadoop is an open source framework from Apache and is used to store process and analyze data which are very huge in volume. • It is used to develop data processing applications which are executed in a distributed computing environment, so that, you can process it parallely. • Hadoop is written in Java and is not OLAP (online analytical processing). • It is used for batch/offline processing. It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more. • Moreover it can be scaled up just by adding nodes in the cluster. Hadoop Through an Analogy • Introducing Jack, a grape farmer. He harvests the grapes in the fall, stores them in a storage room, and finally sells them in the nearby town. He kept this routing going for years until people began to demand other fruits. This rise in demand led to him growing apples and oranges, in addition to grapes. • Unfortunately, the whole process turned out to be time-consuming and difficult for Jack to do single-handedly. Hadoop Through an Analogy • So, Jack hires two more people to work alongside him. The extra help speeds up the harvesting process as three of them can work simultaneously on different products. • However, this takes a nasty toll on the storage room, as the storage area becomes a bottleneck for storing and accessing all the fruits. Hadoop Through an Analogy • Jack thought through this problem and came up with a solution: give each one separate storage space. So, when Jack receives an order for a fruit basket, he can complete the order on time as all three can work with their storage area. • Thanks to Jack’s solution, everyone can finish their order on time and with no trouble. Even with sky-high demands, Jack can complete his orders. Modules of Hadoop • HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of that HDFS was developed. It states that the files will be broken into blocks and stored in nodes over the distributed architecture. • Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster. • Map Reduce: This is a framework which helps Java programs to do the parallel computation on data using key value pair. The Map task takes input data and converts it into a data set which can be computed in Key value pair. The output of Map task is consumed by reduce task and then the out of reducer gives the desired result. • Hadoop Common: These Java libraries are used to start Hadoop and are used by other Hadoop modules. Hadoop has a Master-Slave Architecture Hadoop has a Master-Slave Architecture • The Java language is used to develop HDFS. So any machine that supports Java language can easily run the NameNode and DataNode software. • NameNode 1. It is a single master server exist in the HDFS cluster. 2. As it is a single node, it may become the reason of single point failure. 3. It manages the file system namespace by executing an operation like the opening, renaming and closing the files. 4. It simplifies the architecture of the system. • Task Tracker 1. It works as a slave node for Job Tracker. 2. It receives task and code from Job Tracker and applies that code on the file. This process can also be called as a Mapper. Hadoop has a Master-Slave Architecture • DataNode 1. 2. 3. 4. The HDFS cluster contains multiple DataNodes. Each DataNode contains multiple data blocks. These data blocks are used to store data. It is the responsibility of DataNode to read and write requests from the file system's clients. 5. It performs block creation, deletion, and replication upon instruction from the NameNode. • Job Tracker 1. The role of Job Tracker is to accept the MapReduce jobs from client and process the data by using NameNode. 2. In response, NameNode provides metadata to Job Tracker. HDFS • HDFS creates an abstraction, let me simplify it for you. Similar as virtualization, you can see HDFS logically as a single unit for storing Big Data, but actually you are storing your data across multiple nodes in a distributed fashion. HDFS follows master-slave architecture. • In HDFS, Namenode is the master node and Datanodes are the slaves. Namenode contains the metadata about the data stored in Data nodes, such as which data block is stored in which data node, where are the replications of the data block kept etc. The actual data is stored in Data Nodes. HDFS MapReduce • Hadoop MapReduce is the processing unit of Hadoop. In the MapReduce approach, the processing is done at the slave nodes, and the final result is sent to the master node. • The input dataset is first split into chunks of data. In this example, the input has three lines of text with three separate entities - “bus car train,” “ship ship train,” “bus ship car.” The dataset is then split into three chunks, based on these entities, and processed parallelly. • In the map phase, the data is assigned a key and a value of 1. In this case, we have one bus, one car, one ship, and one train. • These key-value pairs are then shuffled and sorted together based on their keys. At the reduce phase, the aggregation takes place, and the final output is obtained. MapReduce YARN • YARN performs all your processing activities by allocating resources and scheduling tasks. • It has two major components, i.e. ResourceManager and NodeManager. • ResourceManager is again a master node. It receives the processing requests and then passes the parts of requests to corresponding NodeManagers accordingly, where the actual processing takes place. NodeManagers are installed on every DataNode. It is responsible for the execution of the task on every single DataNode. YARN Source: edureka YARN • Suppose a client machine wants to do a query or fetch some code for data analysis. This job request goes to the resource manager (Hadoop Yarn), which is responsible for resource allocation and management. • In the node section, each of the nodes has its node managers. These node managers manage the nodes and monitor the resource usage in the node. The containers contain a collection of physical resources, which could be RAM, CPU, or hard drives. Whenever a job request comes in, the app master requests the container from the node manager. Once the node manager gets the resource, it goes back to the Resource Manager. YARN •Hadoop YARN acts like an OS to Hadoop. It is a file system that is built on top of HDFS. •It is responsible for managing cluster resources to make sure you don't overload one machine. •It performs job scheduling to make sure that the jobs are scheduled in the right place Installing Hadoop • Hadoop is supported by GNU/Linux platform and its flavors. Therefore, we have to install a Linux operating system for setting up Hadoop environment. In case you have an OS other than Linux, you can install a Virtualbox software in it and have Linux inside the Virtualbox. Links for Installation • Hadoop: Setting up a Single Node Cluster: https://hadoop.apache.org/docs/stable/hadoop-projectdist/hadoop-common/SingleCluster.html • Installing Hadoop 3.2.1 Single node cluster on Windows 10 https://towardsdatascience.com/installing-hadoop-3-2-1-singlenode-cluster-on-windows-10-ac258dd48aef • Virtualizing Hadoop® on VMware vSphere®: A Deployment Guide https://www.vmware.com/content/dam/digitalmarketing/vmware/e n/pdf/products/vsphere/vmware-hadoop-deployment-guide.pdf Links for Installation • Hadoop - Environment Setup https://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup. htm • Hadoop Installation https://www.javatpoint.com/hadoop-installation