Introduction to Hadoop Several tools are available for working with big data. Many of the tools are opensource and Linux-based. Explore the fundamentals of Apache Hadoop, including distributed computing, design principles, HDFS, Yarn, MapReduce, and Spark. Table of Contents 1. 2. 3. 4. 5. 6. 7. Distributed Computing Design Principles of Hadoop Functional View of Hadoop HDFS in Action Yarn in Action MapReduce in Action Spark in Action Distributed Computing Hadoop software is the foundational technology used in building a super-computing platform. Hadoop has two parts. First – distributed data storage and second – parallel processing framework. In this video we will take a look at both of these in some detail. Let's begin. There is always confusion about how much storage is involved in Big Data. The root of this is not using terminology correctly. Let me share some common terms you should learn. First disk space. This is the available read/write space in a single disk. Normally it's measured in terabytes, for example, a 2 TB hard disk. Then there's physical space; this is the total amount of space from all the hard disks. For example, if you have 300 hard disks you will have 600 TB of physical space. Then there's usable space. This is the amount of space available after calculating the replication factor. A standard is to have a replication factor of 3. This means 600 TB of physical space has 200 TB of usable space. And then there's temp space; this is the overhead amount of space on each disk required for swapping and I/O performance. This is frequently 10% to 20%. This means 200 TB of usable space needs 20 to 30 TB for overhead. Heading: Terms of Storage. A table with three columns and five rows is displayed. The columns are: Term, Definition, and Example. Each row provides a term, then the definition, followed by the example. Disk Space: a single disk: 2 TB. Physical Space: total amount of disk space: 600 TB. Usable Space: space available after replication factor (Defaults to 3 copies): 200 TB. Temp Space: overhead for performance (10% to 15%): 30 TB. Data Space: Space available for data store: 170 TB. And then there's data space, this is the most important term. This is the amount of space available for Hadoop. This takes into account the replication factor and the required temp space. In our example, we will have a 170 to a 180 TB of data space. Now the primary components for Hadoop cluster include the master server, the switches, the racks, and the data servers, which are also commonly called data workers, data nodes, or just nodes. The master server has a number of key responsibilities. These include the master server is responsible for managing and coordinating the entire Hadoop cluster. The master server has responsibilities to monitor the health of all the data nodes and take corrective action when required. The master server is responsible for mapping the location of all the data and for directing all data movement. The master server is responsible for scheduling and rescheduling all compute jobs. And finally the master server is responsible for handling failure and errors, including loss of a data node and rescheduling of failed compute jobs. Heading: Functional View of Hadoop Cluster. A diagram depicts a Hadoop cluster. In the diagram, the master server is connected to DC Switch A and DC Switch B. Each of the switches is connected to two racks, which are made up of six nodes. Heading: Responsibilities for Master Server. The responsibilities are: Coordinate: manages cluster of worker nodes. Health checks: monitors the health of all worker nodes. Data location and movement: mapping and direction for all data movement. Scheduling: scheduling and rescheduling all jobs. Error handling: handles all errors and failures. Responsibilities for the data server are very simple. First it is responsible to receive and execute instructions from master server. Second the data server is responsible for performing work as assigned by the master server. All work is performed on the physical host; this is an important part of our share nothing design principle. And all the data servers must continuously report status to health checks and ongoing job progress. We have some recommendations we would like to make regarding what you need to purchase in terms of hardware for your master server. The master server is critical to availability and the money should be spent to ensure the highest level of availability. We recommend you always max out the memory. Start at 64 Gigabytes, you should have minimal growth of data so do not oversize your data storage. And always, the master server should be configured with a highest level of RAID disks, dual network, and dual power supply. One very firm rule, never collocate a data node on the master server; it always, always should be dedicated to its primary purpose. Heading: Responsibilities for Data Server. The data server is responsible for processing instructions and performing processing tasks on data located on the physical node. Heading: Recommendation for Master Server. Some of the recommendations are: Max memory, minimize physical storage, and configuring the highest level of RAID disks, and use a dual power supply and dual network. Now regarding recommendations for the data server. When selecting hardware for your data nodes, we strongly suggest you follow these rules. Do not put in redundancy, do not put in multiple network ports, do not overload the CPU or memory, do not overload storage per node. Remember one of the ways to keep supercomputing affordable is to scale up by adding low cost data nodes. There are a number of great reference models provided by the vendors such as Dell, IBM, and HP. These are comprehensive reference models, they include everything you should need to know to build a Hadoop cluster from bare iron and they all come with strong support. Obviously they are not vendor neutral but are a solid starting point for building out your supercomputing platform. Now you have a good overview of what it takes to build a Hadoop cluster. Design Principles of Hadoop Let us discuss designing supercomputing platforms by first listing the problems that need to be solved. These problems have been set out in requirements for a few decades and there have been many different solutions over the years. Common problems in designing supercomputers include: how to move large volumes of transactional data across the back pane; how to move large volumes of static data across the network; how to process large volumes of data faster and faster; how to ensure even job scheduling and fair usage of resources; how to handle errors interrupting other jobs; and how to coordinate and optimize resources. These are all tough challenges for computer scientists. Many of the early solutions were handled at a hardware level, which significantly drove up cost. But then along came Hadoop. Hadoop is a radical redesign to meet the desire for supercomputing. The early Hadoop developers stated their design principles as dumb hardware and smart software; share nothing; move processing, not data; embrace failure; build applications, not infrastructure. The design principle of dumb hardware and smart software means using commodity hardware versus purposely built high-availability servers. The lower the cost and the simpler the hardware for the data node, the better for a Hadoop cluster. "Share nothing" is the design principle that no CPU, or I/O, or data, should be shared. A supercomputing platform is optimized when each data node is independent and able to manage its own work. Move processing, not data. This design principle is about moving the process into the data using a parallel processing framework versus moving the data from disk to a central application to be processed. Embrace failure; this is a critical construct in the development of Hadoop technology. We don't design to prevent failure, we accept it as a part of normal operations. This design principle is to ensure failure does not impact performance or result in data loss. Build applications, not infrastructure. This design principle is about creating a software framework for our supercomputer versus designing specialty in expensive hardware. This design principle ensures we can take the full advantage of the flexibility of Hadoop, as we are working at a software level not a hardware level. Functional View of Hadoop Now it is time to gain an understanding of Hadoop itself. Personally I always find myself marveling at the elegance and simplicity of Hadoop. As you gain an understanding of the solution, you too will pass through one of those moments that make you say, now why didn't I think of that? Hadoop is made universal and assessable because it is built on technology that is readily available and inexpensive. The software can be downloaded from Apache for free. Hadoop comes with high availability and fault tolerance purposely designed into the architecture. These qualities are designed from an operational perspective, not as something to be added on to prevent failure. It really does function in environments with a large number of hardware failures and it really does continue to run jobs until they're complete. Hadoop scales, Hadoop scales horizontally. Hadoop scales massively. We frequently see 10, 20, 30, 40, or more racks fill with data nodes for a single supercomputing platform. And I think one of the most important attributes is Hadoop is simple for end users. The intelligence to run the distributed file system and the parallel processing framework is in the software. And the software abstracts away the complexities of distributed computing. Now while writing a MapReduce job is nontrivial, it is actually easy for the programmer to write massively parallel jobs across a massively distributed file system. The programmer does not have to determine data location or size the number of parallel compute jobs. Hadoop does it for you. Let us take a look at a functional view of Hadoop. There are two parts of Hadoop. First there is the Hadoop distributed file system, which we will now call HDFS. Second there is YARN. YARN stands for Yet Another Resource Negotiator, fun name, right. Hadoop covers two layers in our Big Data stack. HDFS is the storage layer and it is responsible for creating a distributed repository. While YARN is the data refinery layer and is a processing level for scheduling parallel compute jobs. Heading: Functional View of Hadoop. A diagram depicts a Hadoop Distributed File System (HDFS) and Yet Another Resource Negotiator (YARN). In the diagram there is a master server and two data servers. The YARN master server is called ResourceManager an it has a mortgage ratings compute job which is connected to a node manager in one data server and an applicationMaster is another data server. The HDFS master server is connected to a DataNode in one data server and a second DataNode in the second data server. HDFS consist of two daemons. The NameNode is the bookkeeper for HDFS. It runs on the master server. There's only one NameNode per cluster and it is responsible for managing all the data on the cluster. The DataNode runs on all the data nodes. It is responsible for managing the local data. It does all the read/write of data. Now YARN consists of three daemons. The ResourceManager is responsible for managing and allocating all of the resources on the cluster. It resides on the master server. The NodeManager resides on every single compute node in the cluster. It manages and communicates the status of the onboard resources, such as memory, back to the ResourceManager. The ApplicationManager is responsible for running the compute jobs. It requests resources and then ensures the successful completion of its assigned compute jobs. HDFS in Action Let me give a high-level overview of the Hadoop Distributed File System, HDFS. Let us start with the key learning elements. The first important thing to learn about HDFS is its purpose. HDFS abstracts away the complexity of distributing large data files across multiple data nodes. The second important thing to learn is every data file has multiple copies. HDFS keeps track of all the file segments and all the copies of each file segment. This is the power of HDFS. HDFS can be imagined as an Inode table of Inode tables living on a master server. At a local system level an Inode table is in effect a mapping of the file name against the location of the data blocks for that file. In HDFS the master server keeps a superior table that maps the file names against the data blocks. But here the listing contains the names and network addresses of all of the data nodes. It is rather a brilliant solution. HDFS works best with a single large file, and we mean large file. A data file of 60 terabytes is easily handled. Such a large file must be split up and the segment sent to different data nodes. We call these segments data blocks. This is a key term. Heading: Key Attributes for HDFS. The default data block size for HDFS is anything between 64MB and 128MB. HDFS also replicates data blocks as they are moved to different locations. It has a replication factor of 3. In HDFS data blocks have standard sizes of 64 or a 128 megabytes. They can be made even larger by changing the configuration parameter. HDFS is basically a write-once file system. The normal sequence is for small data files to be created elsewhere and then the data is merged together into an HDFS file. These HDFS files are then ready for analytics. HDFS is not a good system for storing data that needs to be constantly modified. Our architectural principle of embracing failure leads to one of the most important attributes for HDFS. This is data replication. The number of copies of data files is a configuration parameter but it is normally left at the default value of three. This system does a princely job of distributing the data across many nodes at load time. HDFS then does an analysis and makes multiple copies of each data block. HDFS is very astute at alerting to the loss of access to a data block. And when this happens HDFS begins replicating another data block to always keep the number of copies to replication level. This is high availability and fault tolerance by operational design. Let us walk through a simple example of how HDFS works. Let me again introduce the two key players, the NameNode and the DataNode. The NameNode is the master of HDFS. It is a daemon that resides on the master server. It keeps all of the file system tables in memory to ensure high-performance. The DataNode is the data worker. It is a daemon that resides on the DataNode. It is responsible to manage the data blocks on the DataNode. It then keeps the NameNode apprised of status through a heartbeat message. Here we can see the NameNode has a table which list two file names. There is a file called creditcards and a file called mortgages. These very large files are broken up into data blocks. Each data block is assigned a number. The NameNode assigns the data block to a DataNode within the cluster. Heading: HDFS Archetecture. A diagram depicts a typical HDFS architecture with a NameNode and five DataNodes. Each of the DataNodes has eight blocks in it. Three of the blocks in each DataNode have a number in them, and the rest are empty. The NameNode contains the text "/file/creditcards(data block:1,2) /file/mortgages(data blocks: 3,4,5)". The first DataNode contains the numbers five, three, and two, the second DataNode contains the numbers three, one, and four, the third DataNode contains the numbers three, five, and two, the fourth DataNode contains the numbers one, four, and two, the fifth node contains the numbers one, five, and four. The NameNode table contains the network address and local block number for that data. Then using the replication factor, the NameNode assigns copies of data block to other DataNodes. As you can see in this example, each data block has three copies. It is the responsibility of the NameNode to ensure if there are always three copies of each data block. If a DataNode stops transmitting a heartbeat, the NameNode will wait a configurable amount of time and then make a new copy of the data block. There is a lot of intelligence and a lot of configuration in how HDFS works, but this overview gives you the primary principles. This is a fascinating subject and is really worthy of further study. Yarn in Action The following are my key learning points for YARN. YARN is the parallel processing framework. It can process large amounts of data over multiple compute nodes. Writing computer jobs to run on YARN is not trivial, but once done YARN makes it very easy to scale. Move code to data; need I say more. Colocating the data with the computing node improves performance. We advance this idea by adding the concept of containers with Hadoop version 2.0. Containers are an abstract notion to support multi-tenancy. It is a way to define a requirement for memory, CPU, disk, and network. By dividing the DataNode up into containers, the DataNode can now simultaneously host multiple compute jobs. What really makes YARN work is the design principle of Share Nothing. It is very intentionally designed so no compute job has any dependency on any other compute job. Each compute job is run in its own container and it does not share the assigned resources. Each compute job is only responsible for its assigned work. Heading: Key Attributes of YARN. Another key attribute of YARN is embrace failure. YARN is highly fault tolerant. In many ways, it epitomizes the design principle of embracing failure. The fact that most compute jobs are run in a Share Nothing node means a compute job can fail and be restarted without any implication to the final output. YARN allows for the graceful rescheduling of failed compute jobs. Let us begin our walkthrough with a bit more detailed introduction of the players. The client is responsible for submitting the job, beginning by requesting an application ID from the ResourceManager. The client then must provide location of the jar files, command line argument, and any environmental settings. The client can maintain job status by querying either the ResourceManager or the ApplicationMaster. The ResourceManager is first responsible for maintaining knowledge of the resource status of the entire cluster. It needs to keep a picture of the available memory, CPU, disk, and network. Currently only memory is supported. It does this by ongoing communications with the NodeManager. Heading: YARN Architecture. A diagram depicts typical YARN architecture. The diagram uses the flow of data in a bank as an example. The diagram contains a client which requests score for mortgages. The Resource Manager, which is connected to the Client, manages cluster resources and assigns containers. The Resource Manager is connected to the ApplicationMaster, which manages compute jobs, and the Container which completes assigned work. The ApplicationMaster moves code to data between itself and the Container. The ApplicationMaster and Container are each connected to a different NodeManager. Second the ResourceManager is responsible for scheduling these resources by allocating containers. It does this according to input requests by the client. It does this by cluster capacity. It does this by work queues. And it does this by overall prioritization of resources on the cluster. ResourceManager uses an algorithm for allocating the containers. The general principle is to start a container on the same node as the data required by the compute job. Third the ResourceManager is responsible for managing applications by accepting job submissions from the client and by maintaining a status of the ApplicationMasters. It will launch, and if required, relaunch the ApplicationMaster. The ApplicationMaster is the actual owner of the job. It is launched by the ResourceManager as a result of a request for a compute job from a client. The ApplicationMaster communicates directly with the client. An ApplicationMaster is launched for every compute job and keeps running until the job is complete. The ApplicationMaster determines resource requirements and then negotiates for those resources with the ResourceManager. The ResourceManager interacts with the NameNode to determine data location and assigns containers as required. The ApplicationMaster is responsible for moving code to the appropriate container. It is also responsible for job completion. It must restart failed compute jobs to ensure compute job completion. The ApplicationMaster emits a heartbeat to the ResourceManager to keep it informed of job status. The NodeManager is responsible for keeping an up-to-date status of its resources and reporting these resources to the ResourceManager. Every node within the cluster must have a running NodeManager. In this example the client submits a job to request scoring on mortgages. The job is submitted to the ResourceManager. The ResourceManager accepts the job and it opens a container for the ApplicationManager on one of the nodes of the cluster. The ApplicationMaster receives the job and then determines its need for more containers. It makes this request to the ResourceManager, which assigns containers based on data location and resource availability. The ResourceManager uses intelligence to determine which one of the replication data blocks is best suited to launch the container. The ApplicationMaster then moves the code to the various containers and then begins the compute job. The ApplicationMaster will monitor the containers assigned to the compute job and react to ensure job completion. This is a complex framework but it ensures compute jobs can be massively paralleled across the entire Hadoop cluster. MapReduce in Action Before I begin discussion about the key attributes for MapReduce, I want to provide a historical comment. There was no YARN prior to Hadoop version 2.0. In the early releases for Hadoop, the function of YARN and MapReduce were combined into MapReduce. The functions of resource scheduling and job management were all part of MapReduce. Please do not be confused if you're dealing with a Hadoop release prior to version 2.0. MapReduce is a heavyweight for batch processing compute jobs. As we discussed in another video, MapReduce was designed to use key-value pairs. There are two primary parts of MapReduce. The mapper is meant to filter and transform data. It completes the initial data crunch to produce intermediate output files. These output files are then shuffled across the network to the reducer for the reduce phase. The reducer aggregates data. It receives intermediate data and consolidates them into the final output. Heading: Key Attributes for MapReduce The two primary parts of MapReduce are Mapper and Reducer. Let us walk through a MapReduce process. The inputs to MapReduce are the data blocks of a large HDFS file. Frequently part of the compute job is to preprocess this data by cleaning out dirty data and/or duplicates. This ensures the following phases in MapReduce run cleanly and with a higher performance level. During the Map phase, a mapper is started within a container on each data node where data must be processed. Each mapper will read its assigned data blocks and then process the data blocks. For example, the mapper could read in each line and count each word of each line. It will output its data into a file consisting of each word and a count of one for each word. A second step may be run to further sort each file down to a list of unique words and the total count for each word. This will be repeated for each mapper until all the map jobs have produced their local files. Then comes the Shuffle phase. Using the network, the files are shuffled around the cluster. If done correctly, this should be the first transferred data over the network. Heading: MapReduce Architecture. A flow diagram illustrates how data is passed and processed through the MapReduce system. The diagram is divided into Map input, Map phase, Shuffle phase, Reduce phase, and Final output sections. There are two rectangular blocks named Data node 1 and Data node 2 respectively which have their own separate content and lead on to a single rectangular block called Data node 3 which they are both connected to. In the Map input section of the diagram, Data node 1 contains Block 0 and Block 1. Data node 2 contains Block 2 and Block 3. In the Map phase section on the diagram Block 0 and Block 1 in Data node 1 are connected to a Mapper block and Block 2 and Block 3 in Data node 2 are connected to another Mapper block. In the Shuffle phase section of the diagram, the Mapper block is connected to a block named Data A and the Mapper block in Data node 2 is connected to a block named Data B. In this section of the diagram Data node 1 and Data node 2 are connected to Data node 3 where Data A and Data B appear. In the Reduce phase section of the diagram, the Data A and Data B are connected to a block called Reducer. In the Final output section the Reducer block is connected to a block named Partnnnn-xx. Then comes the Reduce phase. If the files are not already sorted by key, each file is first sorted. Then all the files are combined into a larger file and another sort will occur on the keys. The job of a reducer is to sort the file into a single answer. And the last phase is the creation of the output file. This file is assembled by the reducer into as many files as required and then they are all stored on HDFS. I should mention, we have a unique output naming file strategy of Partnnnn.xx This is a very high level introduction to MapReduce. It becomes extremely complex and it takes a combination of talent and experience to write good MapReduce jobs. If you are interested in learning more about MapReduce, I recommend you start by reading the Apache web site for Hadoop. Spark in Action Spark is an open source computing framework that can run data on a disk, or it can load data into clusters memory. Spark became a top level Apache project in February 2014. It appears to be quickly rising to the top of a lot of people's list as their favorite software to run in Big Data. Spark has an extremely strong community, many who advocate it as a replacement for MapReduce. This is because Spark is fast – some say blindingly fast. A company from UC Berkeley, called Databrick, are the originators of Spark, and they are providing commercial support. Spark is robust and versatile. It has successfully combined a number of different functions into a single software solution. Key attributes include that Spark applications can be written in Java, Scala, and Python. This makes it easy for programmers to write in their native language. There's also an interactive Scala and Python shell. Spark is built to run on top of HDFS and can use YARN to run alongside MapReduce jobs. It can read any existing Hadoop data file. It also reads from HBase, Cassandra, and many other data sources. Spark is scalable to 2000 nodes and it will continue to expand its ability to scale compute jobs. Heading: Introducing Spark. Spark is built onto HDFS and is able to use YARN. Heading: Key Attributes for Spark. Spark has high level libraries for streaming, machine learning, graph processing, and R statistical programming. One of the design principles of Spark was to combine SQL, streaming, and complex analytics. The software provides high level libraries, so programmers can quickly write jobs for streaming, machine learning, graph processing, and the R statistical programming language. Spark is fast. Let me make two quick points about its speed, and why this is leading to its increasing popularity. It runs against both disk and in memory. Both are demonstrated to be faster than other existing software. For example, in machine learning, it is clocked to run on a disk compute job ten times faster than Apache Mahout. And on a large-scale statistical analysis, it is benchmarked to run a hundred times faster in memory, than the same job running in MapReduce. At this point the market has not determined the outcome between Spark and MapReduce. Both are strong tools and it is possible both will be in use for years to come.