BIG DATA Syllabus Unit-I : Introduction to Big Data Unit-II : Hadoop Frameworks and HDFS Unit-III :MapReduce Unit-VI : Hive and Pig Unit-V : ZOOKEEPER, Sqoop and CASE STUDY 1 1. MapReduce: Map-Reduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner is called MapReduce. MapReduce is a programming model for expressing distributed computations on massive amounts of data and an execution framework for large-scale data processing on clusters of commodity servers. MapReduce is Programming Model for Data Processing. MapReduce characteristics are Batch processing, No limits on passes over the data or time and No memory constraints. 2 Fig: MapReduce Logical Data flow 3 History of MapReduce: Developed by researchers at Google Group around in the year 2003, built on principles in parallel and distributed processing. MapReduce Provides a clear separation between what to compute and how to compute it on a cluster. Created by Doug Cutting as solution to Nutch’s scaling problems, inspired by Google’s GFS/MapReduce papers. In 2004, Nutch Distributed Filesystem written (based on GFS). In 2005, all important parts of Nutch ported to MapReduce and NDFS. In 2006, code moved into an independent subproject of Lucene called Hadoop. In early 2006 Doug Cutting joined Yahoo! which contributed resources and manpower. In 2008, Hadoop became a top-level project at Apache 4 Fig: Example of Overall MapReduce word count Process 5 MapReduce: It consists of 1. Analyzing the Data with UNIX Tools 2. Analyzing the Data with Hadoop 3. Scaling Out 4. Hadoop Streaming 5. Hadoop Pipes. 6 1) Analyzing the Data with UNIX Tools: Analyzing the Data with UNIX Tools are Hadoop, Cloudera, Datameer, Splunk, Mahout, Hive, HBase, LucidWorks, R, MapR, Ubuntu and Linux flavors. Ex: A program for finding the maximum recorded temperature by year from NASA weather Program: records #!/bash for year in allusr/bin/env /* do echo -ne `basename $year .gz`"\t" gunzip -c $year | \ awk '{ temp = substr($0, 88, 5) + 0; q = substr($0, 93, 1); if (temp !=9999 && q ~ /[01459]/ && temp > max) max = temp } END { print max }' Done The script loops through the compressed year files, first printing the year, and then processing each file using awk. The awk script extracts two fields from the data: the air temperature and the quality code. The END block is executed after all the lines in the file have been processed, 7 and it prints the maximum value. 2) Analyzing the Data with Hadoop: Analyzing the Data with Hadoop is majorly MapReduce and HDFS. To take advantage of the parallel processing that Hadoop provides, we need to express our query as a Map Reduce job. Map Reduce works by breaking the processing into two phases are the map phase and the reduce phase, each phase has key-value pairs as input and output. The map() method is passed a key to a value and also provides an instance of Context to write the output. Fig: Map Reduce Logical Data flow 8 3) Scaling Out: Scaling out means size will be increased or decreased the system will be supported properly. Scale-out Architecture means add servers to increase processing power MapReduce is a programming model for data processing and simple to express useful programs in. Hadoop can run MapReduce programs written in various languages are Java, Ruby, Python, and C++. A MapReduce job is a unit of work that the client wants to be performed and it consists of the input data, MapReduce program, and configuration information. 9 4) Hadoop Streaming: Streaming means flow of data i.e. videos , images, signals and audios. Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program. Hadoop streaming supported by many languages are Pig, Python and Ruby. Pig is a programming language and supports streaming files such as videos and audios, that can read from standard input and write to standard output. Python is a programming language and supports streaming files such as videos and audios, that can read from standard input and write to standard output. Ruby is a programming language and supports streaming files such as videos and audios, that can read from standard input and write to standard output. 10 5) Hadoop Pipes: Hadoop Pipe is the name of the java interface to Hadoop MapReduce, unlike streaming, which uses standard input and output to communicate with the map and reduce code. Hadoop pipes uses sockets as the channel over which the task tracker communicates with the process running the java map or reduce function. The main() method is the application entry point, it calls HadoopPipes :: runTask, which connects to the Java parent process and marshals data to and from the Mapper or Reducer. The runTask() method is passed a Factory so that it can create instances of the Mapper or Reducer. 11 2) MapReduce Features: MapReduce features describe the execution and lower level details, simply knowing the APIs and their usage is sufficient to write applications. Features of MapReduce includes counters, sorting and joining datasets. By default MapReduce will sort input records by their keys. MapReduce is the heart of Hadoop. It is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a Map() procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). MapReduce is a massively scalable, parallel processing framework that works in tandem with HDFS, with MapReduce and Hadoop, compute is executed at the location of the data, rather than moving data to the compute location; data storage and computation coexist on the same physical nodes in the cluster. MapReduce processes exceedingly large amounts of data without being affected by traditional bottlenecks like network bandwidth by taking advantage of this data proximity. MapReduce divides workloads up into multiple tasks that can be executed in parallel. It consists of 1. 2. 3. 4. 5. 6. Features of MapReduce Counters Sorting Joins Side Data Distribution MapReduce Library Classes 1. Features of MapReduce: Map-Reduce is a software framework for easily writing applications which process vast amounts of data in-parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner is called MapReduce. Features of MapReduce includes counters, sorting and joining datasets. It consists of Scale-out Architecture: Add servers to increase processing power Security & Authentication: Works with HDFS and HBase security to make sure that only approved users can operate against the data in the system Resource Manager: Employs data locality and server resources to determine optimal computing operations Optimized Scheduling: Completes jobs according to prioritization Flexibility: Procedures can be written in virtually any programming language Resiliency & High Availability: Multiple job and task trackers ensure that jobs fail independently and restart automatically. Fig: MapReduce Logical Data flow 15 2. Counters: The MapReduce framework provides Counters as an efficient mechanism for tracking the occurrences of global events within the map and reduce phases of jobs. Counters are a useful channel for gathering statistics about the job which means it show for quality control or for application level-statistics. They are also useful for problem diagnosis. Hadoop should maintain a built-in counters for every job, which report various metrics for your job. for example there are counters for the number of input files and records processed. Ex: Typical MapReduce job will kick off several mapper instances, one for each block of the input data, all running the same code. These instances are part of the same job, but run independent of one another. Hadoop MapReduce Counters are divided into two groups: 1)Task Counters 2)Job Counters Group Name/Enum MapReduce Task Counters org.apache.hadoop.mapred.Task$Counter (0.20) org.apache.hadoop.mapreduce.TaskCounter (post 0.20) File System Counters FileSystemCounters (0.20) org.apache.hadoop.mapreduce.FileSystemCounter (post 0.20) File Input-Format Counters org.apache.hadoop.mapred.FileInputFormat$Counter (0.20) org.apache.hadoop.mapreduce.lib.input.FileInputFormatCoun ter (post 0.20) File Output-Format Counters org.apache.hadoop.mapred.FileOutputFormat$Counter (0.20) org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCo unter(post 0.20) Job Counters org.apache.hadoop.mapred.JobInProgress$Counter (0.20) org.apache.hadoop.mapreduce.JobCounter (post 0.20) Fig: There are several groups for the built-in Counters i) Task Counters: Task counters gather information about tasks over the course of their execution, and the results are aggregated over all the tasks in a job. Task counters are maintained by each task attempt, and periodically sent to the tasktracker and then to the jobtracker. Counter values are definitive only once a job has successfully completed. Ex: The MAP_INPUT_RECORDS counter counts the input records read by each map task and aggregates over all map tasks in a job, so that the final figure is the total number of input records for the whole job. ii) Job Counters: Job counters are maintained by jobtracker, which measures the job level statistics. User-Defined Java Counters: MapReduce can allow the userdefined java counters by using java “enum” keyword. A job may define an arbitrary number of enums, each with an arbitrary number of fields. The name of the enum is the group name, and the enum’s fields are the counter names. Ex: TOTAL_LAUNCHED_MAPS counts the number of map tasks that were launched over the course of a job. Ex: public class MaxTemperatureWithCounters extends Configured implements Tool { enum Temperature { MISSING, MALFORMED }} 3. SORTING Sorting means arranging the elements in sequential order or any order. By default, MapReduce will sort input records by their keys This job produces 30 output files, each of which is sorted However, there is no easy way to combine the files (partial sort) Produce a set of sorted files that, if concatenated, would form a globally sorted file, Use a partitioner that respects the total order of the output. Ex: Range Partitioner Although this approach works, you have to choose your partition sizes carefully to ensure that they are fairly even so that job times aren’t dominated by a single reducer Ex: Bad partitioning To construct more even partitions, we need to have a better understanding of the distribution for the whole dataset 4. JOINS Joins is one of the interesting features available in MapReduce. A join is an operation that combines records from two or more data sets based on a field or a set of fields, known as the foreign key. The foreign key is the field in the relational table that matches the column of another table. Frameworks like Pig, Hive, or Cascading has support for performing joins. Joins performed by Mapper are called as Map-side Joins. Joins performed by Reducer can be treated as Reduce-side joins. It consists of i. ii. Map-Side Joins. Reduce-Side Joins. i) Map-Side Joins: A map-side join between large inputs works by performing the join before the data reaches the map function(Joining at map side performs the join before data reached to map). The inputs for to each map must be partitioned and sorted in a specific way. Each input dataset must be divided into the same number of partitions, and it must be sorted by the same key (the join key) in each source. All the records for a particular key must reside in the same partition and which is mandatory. We can achieve following kind of joins using Map-Side techniques, 1) Inner Join 2) Outer Join 3) Override - MultiFilter for a given key, preferred values from the right most source. Use a CompositeInputFormat from the org.apache.hadoop. mapred. join package to run a map-side join MapReduce Job for Sorting Dataset 1 Map Reduce Map Dataset 2 Map Reduce ii) Reduce-Side Joins Reduce-Side joins are more simple than Map-Side joins since the – Input datasets don’t have to be structured in any particular way – Less efficient as both datasets have to go through the MapReduce shuffle Idea: The mapper tags each record with its source – Uses the join key as the map output key so that the records with the same key are brought together in the reducer. Multiple inputs: The input sources for the datasets have different formats Use the MultipleInputs class to separate the logic for parsing and tagging each source. Secondary sort: To perform the join, it is important to have the data from one source before another. Example: The code assumes that every station ID in the weather records has exactly one matching record in the station dataset. 5. SIDE DATA DISTRIBUTION Side data can be defined as extra read-only data needed by a job to process the main dataset. Side data refers to extra static small data required by MapReduce to perform job. Side-Data is the additional data needed by the job to process the main dataset. The challenge is the availability of side data on the node where the map would be executed in a convenient and efficient fashion. Hadoop provides two side data distribution techniques. They are: (a) Using the Job Configuration (b)Distributed Cache. (a) Using the Job Configuration: 1. An arbitrary key-value pairs can be set in the job configuration using the various setter methods on Configuration. 2. It is a very useful technique in case of small file. The suggested size of file to keep in configuration object is in KBs. Because configuration object would be read by job tracker, task tracker and new child jvm. 3. In the task you can retrieve the data from the configuration returned by Context’s getConfiguration() method. 4. A part from this side data would require serialization if it has non-primitive encoding. 5. DefaultStringifier uses Hadoop’s serialization framework to serialize objects. (b) Distributed Cache 1. Rather than serializing side data in the job configuration, it is preferable to distribute datasets using Hadoop’s distributed cache mechanism. 2. This provides a service for copying files and archives to the task nodes in time for the tasks to use them when they run. 3. To save network bandwidth, files are normally copied to any particular node once per job. 4. Side-Data can be shared using the Hadoop’s Distributed cache mechanism. 5. We can copy files and archives to the task nodes when the tasks need to run. Usually this is the preferrable way over the JobConfigurtion. 6. If both the datasets are too large then we cannot copy either of the datasets to each node in the cluster as we did in the Side data distribution. 7. We can still join the records using MapReduce with a Map-side or reduce-side joins. 6. MapReduce Library Classes Hadoop comes with a library of mappers and reducers for commonly used functions. They are listed with brief descriptions in Table(next slide). For further information on how to use them, please consult their Java documentation. The major classes in the MapReduce library are: Javadocs The Input class: Writing your own Input class The Mapping classes The Reducer class The Output class: Writing your own Output class The Marshaller class The Counter Class Size limits CLASSSES ChainMapper,ChainReducer FieldSelectionReducer (new API) DESCRIPTION Run a chain of mappers in a single mapper, and a reducer followed by a chain of mappers in a single reducer. A mapper and a reducer that can select fields (like the Unix cut command) from the input keys and values and emit them as output keys and values. IntSumReducer, LongSumReducer Reducers that sum integer values to produce a total for every key. InverseMapper A mapper that swaps keys and values. MultithreadedMapper (new API) A mapper (or map runner in the old API) that runs mappers concurrently in separate threads. Useful for mappers that are not CPU-bound.