Uploaded by Dheeraj Royal

BDT - HDFS - MapReduce

advertisement
BIG DATA
Syllabus
Unit-I : Introduction to Big Data
Unit-II : Hadoop Frameworks and HDFS
Unit-III :MapReduce
Unit-VI : Hive and Pig
Unit-V : ZOOKEEPER, Sqoop and CASE STUDY
1
1. MapReduce: Map-Reduce is a software framework for
easily writing applications which process vast amounts of
data (multi-terabyte data-sets) in-parallel on large clusters
(thousands of nodes) of commodity hardware in a reliable,
fault-tolerant manner is called MapReduce.
 MapReduce is a programming model for expressing
distributed computations on massive amounts of data and
an execution framework for large-scale data processing on
clusters of commodity servers.
 MapReduce is Programming Model for Data Processing.
 MapReduce characteristics are Batch processing, No limits
on passes over the data or time and No memory
constraints.
2
Fig: MapReduce Logical Data flow
3
 History of MapReduce: Developed by researchers at Google Group
around in the year 2003, built on principles in parallel and distributed
processing.
 MapReduce Provides a clear separation between what to compute
and how to compute it on a cluster.
 Created by Doug Cutting as solution to Nutch’s scaling problems,
inspired by Google’s GFS/MapReduce papers.
 In 2004, Nutch Distributed Filesystem written (based on GFS).
 In 2005, all important parts of Nutch ported to MapReduce and
NDFS.
 In 2006, code moved into an independent subproject of Lucene
called Hadoop.
 In early 2006 Doug Cutting joined Yahoo! which contributed
resources and manpower.
 In 2008, Hadoop became a top-level project at Apache
4
Fig: Example of Overall MapReduce word count Process
5
MapReduce: It consists of
1. Analyzing the Data with UNIX Tools
2. Analyzing the Data with Hadoop
3. Scaling Out
4. Hadoop Streaming
5. Hadoop Pipes.
6
1) Analyzing the Data with UNIX Tools: Analyzing the Data with UNIX
Tools are Hadoop, Cloudera, Datameer, Splunk, Mahout, Hive, HBase,
LucidWorks, R, MapR, Ubuntu and Linux flavors.
 Ex: A program for finding the maximum recorded temperature by year from
NASA weather
 Program:
records
#!/bash
for year in allusr/bin/env /*
do
echo -ne `basename $year .gz`"\t"
gunzip -c $year | \
awk '{ temp = substr($0, 88, 5) + 0;
q = substr($0, 93, 1);
if (temp !=9999 && q ~ /[01459]/ && temp > max) max = temp }
END { print max }'
Done
 The script loops through the compressed year files, first printing the year, and then processing
each file using awk. The awk script extracts two fields from the data: the air temperature and
the quality code. The END block is executed after all the lines in the file have been processed,
7
and it prints the maximum value.
2) Analyzing the Data with Hadoop: Analyzing the Data with
Hadoop is majorly MapReduce and HDFS.
 To take advantage of the parallel processing that Hadoop provides,
we need to express our query as a Map Reduce job.
 Map Reduce works by breaking the processing into two phases are
the map phase and the reduce phase, each phase has key-value pairs
as input and output. The map() method is passed a key to a value
and also provides an instance of Context to write the output.
Fig: Map Reduce Logical Data flow
8
3) Scaling Out: Scaling out means size will be increased or
decreased the system will be supported properly.
 Scale-out Architecture means add servers to increase
processing power
 MapReduce is a programming model for data processing
and simple to express useful programs in.
 Hadoop can run MapReduce programs written in various
languages are Java, Ruby, Python, and C++.
 A MapReduce job is a unit of work that the client wants to
be performed and it consists of the input data, MapReduce
program, and configuration information.
9
4) Hadoop Streaming: Streaming means flow of data i.e. videos ,
images, signals and audios.
 Hadoop Streaming uses Unix standard streams as the interface
between Hadoop and your program, so you can use any language
that can read standard input and write to standard output to write
your MapReduce program.
 Hadoop streaming supported by many languages are Pig, Python
and Ruby.
 Pig is a programming language and supports streaming files such
as videos and audios, that can read from standard input and write
to standard output.
 Python is a programming language and supports streaming files
such as videos and audios, that can read from standard input and
write to standard output.
 Ruby is a programming language and supports streaming files
such as videos and audios, that can read from standard input and
write to standard output.
10
5) Hadoop Pipes: Hadoop Pipe is the name of the java
interface to Hadoop MapReduce, unlike streaming, which
uses standard input and output to communicate with the
map and reduce code.
 Hadoop pipes uses sockets as the channel over which the
task tracker communicates with the process running the
java map or reduce function.
 The main() method is the application entry point, it calls
HadoopPipes :: runTask, which connects to the Java
parent process and marshals data to and from the Mapper
or Reducer.
 The runTask() method is passed a Factory so that it can
create instances of the Mapper or Reducer.
11
2) MapReduce Features: MapReduce features describe the execution
and lower level details, simply knowing the APIs and their usage is
sufficient to write applications. Features of MapReduce includes
counters, sorting and joining datasets.
 By default MapReduce will sort input records by their keys.
 MapReduce is the heart of Hadoop. It is a programming model for
processing large data sets with a parallel, distributed algorithm on a
cluster.
 A MapReduce program is composed of a Map() procedure that
performs filtering and sorting (such as sorting students by first
name into queues, one queue for each name) and
a Reduce() procedure that performs a summary operation (such as
counting the number of students in each queue, yielding name
frequencies).
 The first is the map job, which takes a set of data and converts it
into another set of data, where individual elements are broken down
into tuples (key/value pairs).
 MapReduce is a massively scalable, parallel processing
framework that works in tandem with HDFS, with
MapReduce and Hadoop, compute is executed at the location
of the data, rather than moving data to the compute location;
data storage and computation coexist on the same physical
nodes in the cluster.
 MapReduce processes exceedingly large amounts of data
without being affected by traditional bottlenecks like network
bandwidth by taking advantage of this data proximity.
 MapReduce divides workloads up into multiple tasks that can
be executed in parallel.
 It consists of
1.
2.
3.
4.
5.
6.
Features of MapReduce
Counters
Sorting
Joins
Side Data Distribution
MapReduce Library Classes
1. Features of MapReduce: Map-Reduce is a software framework
for easily writing applications which process vast amounts of data
in-parallel on large clusters of commodity hardware in a reliable,
fault-tolerant manner is called MapReduce.
 Features of MapReduce includes counters, sorting and joining
datasets.
 It consists of
 Scale-out Architecture: Add servers to increase processing power
 Security & Authentication: Works with HDFS and HBase security to
make sure that only approved users can operate against the data in the
system
 Resource Manager: Employs data locality and server resources to
determine optimal computing operations
 Optimized Scheduling: Completes jobs according to prioritization
 Flexibility: Procedures can be written in virtually any programming
language
 Resiliency & High Availability: Multiple job and task trackers ensure
that jobs fail independently and restart automatically.
Fig: MapReduce Logical Data flow
15
2. Counters: The MapReduce framework provides Counters as an
efficient mechanism for tracking the occurrences of global events
within the map and reduce phases of jobs.
 Counters are a useful channel for gathering statistics about the
job which means it show for quality control or for application
level-statistics. They are also useful for problem diagnosis.
 Hadoop should maintain a built-in counters for every job, which
report various metrics for your job. for example there are
counters for the number of input files and records processed.
 Ex: Typical MapReduce job will kick off several mapper
instances, one for each block of the input data, all running the
same code. These instances are part of the same job, but run
independent of one another.
 Hadoop MapReduce Counters are divided into two groups:
1)Task Counters
2)Job Counters
Group
Name/Enum
MapReduce Task
Counters
org.apache.hadoop.mapred.Task$Counter (0.20)
org.apache.hadoop.mapreduce.TaskCounter (post 0.20)
File System Counters
FileSystemCounters (0.20)
org.apache.hadoop.mapreduce.FileSystemCounter (post 0.20)
File Input-Format
Counters
org.apache.hadoop.mapred.FileInputFormat$Counter (0.20)
org.apache.hadoop.mapreduce.lib.input.FileInputFormatCoun
ter (post 0.20)
File Output-Format
Counters
org.apache.hadoop.mapred.FileOutputFormat$Counter (0.20)
org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCo
unter(post 0.20)
Job Counters
org.apache.hadoop.mapred.JobInProgress$Counter (0.20)
org.apache.hadoop.mapreduce.JobCounter (post 0.20)
Fig: There are several groups for the built-in Counters
i) Task Counters: Task counters gather information about tasks
over the course of their execution, and the results are aggregated
over all the tasks in a job.

Task counters are maintained by each task attempt, and
periodically sent to the tasktracker and then to the jobtracker.

Counter values are definitive only once a job has successfully
completed.

Ex: The MAP_INPUT_RECORDS counter counts the input
records read by each map task and aggregates over all map tasks
in a job, so that the final figure is the total number of input
records for the whole job.
ii) Job Counters: Job counters are maintained by jobtracker, which
measures the job level statistics.
 User-Defined Java Counters: MapReduce can allow the userdefined
java counters by using java “enum” keyword.
 A job may define an arbitrary number of enums, each with an
arbitrary number of fields.
 The name of the enum is the group name, and the enum’s fields are
the counter names.
 Ex: TOTAL_LAUNCHED_MAPS counts the number of map tasks
that were launched over the course of a job.
 Ex: public class MaxTemperatureWithCounters extends Configured
implements Tool {
enum Temperature
{ MISSING,
MALFORMED
}}
3. SORTING
 Sorting means arranging the elements in sequential order or any
order.
 By default, MapReduce will sort input records by their keys
 This job produces 30 output files, each of which is sorted
 However, there is no easy way to combine the files (partial sort)
 Produce a set of sorted files that, if concatenated, would form a
globally sorted file, Use a partitioner that respects the total order of
the output.
 Ex: Range Partitioner
 Although this approach works, you have to choose your partition
sizes carefully to ensure that they are fairly even so that job times
aren’t dominated by a single reducer
 Ex: Bad partitioning
 To construct more even partitions, we need to have a better
understanding of the distribution for the whole dataset
4. JOINS
 Joins is one of the interesting features available in
MapReduce.
 A join is an operation that combines records from two or
more data sets based on a field or a set of fields, known as
the foreign key.
 The foreign key is the field in the relational table that
matches the column of another table.
 Frameworks like Pig, Hive, or Cascading has support for
performing joins.
 Joins performed by Mapper are called as Map-side Joins.
 Joins performed by Reducer can be treated as Reduce-side
joins.
 It consists of
i.
ii.
Map-Side Joins.
Reduce-Side Joins.
i) Map-Side Joins: A map-side join between large inputs works
by performing the join before the data reaches the map
function(Joining at map side performs the join before data
reached to map).
 The inputs for to each map must be partitioned and sorted in a
specific way.
 Each input dataset must be divided into the
same number of partitions, and it must be sorted by the same
key (the join key) in each source.
 All the records for a particular key must reside in the same
partition and which is mandatory.
 We can achieve following kind of joins using Map-Side
techniques,
1) Inner Join
2) Outer Join
3) Override - MultiFilter for a given key, preferred values from the
right most source.
 Use a CompositeInputFormat from the org.apache.hadoop.
mapred. join package to run a map-side join
MapReduce Job for Sorting
Dataset 1
Map
Reduce
Map
Dataset 2
Map
Reduce
ii) Reduce-Side Joins
 Reduce-Side joins are more simple than Map-Side joins since the
– Input datasets don’t have to be structured in any particular way
– Less efficient as both datasets have to go through the
MapReduce shuffle
 Idea: The mapper tags each record with its source
– Uses the join key as the map output key so that the records with
the same key are brought together in the reducer.
 Multiple inputs: The input sources for the datasets have different
formats
 Use the MultipleInputs class to separate the logic for parsing
and tagging each source.
 Secondary sort: To perform the join, it is important to have the data
from one source before another.
 Example: The code assumes that every station ID in the
weather records has exactly one matching record in the
station dataset.
5. SIDE DATA DISTRIBUTION
 Side data can be defined as extra read-only data needed
by a job to process the main dataset.
 Side data refers to extra static small data required by
MapReduce to perform job.
 Side-Data is the additional data needed by the job to
process the main dataset.
 The challenge is the availability of side data on the node
where the map would be executed in a convenient and
efficient fashion.
 Hadoop provides two side data distribution techniques.
They are:
 (a) Using the Job Configuration
 (b)Distributed Cache.
(a) Using the Job Configuration:
1. An arbitrary key-value pairs can be set in the job
configuration using the various setter methods on
Configuration.
2. It is a very useful technique in case of small file. The
suggested size of file to keep in configuration object is in
KBs. Because configuration object would be read by job
tracker, task tracker and new child jvm.
3. In the task you can retrieve the data from the configuration
returned by Context’s getConfiguration() method.
4. A part from this side data would require serialization if it has
non-primitive encoding.
5. DefaultStringifier uses Hadoop’s serialization framework to
serialize objects.
(b) Distributed Cache
1. Rather than serializing side data in the job configuration, it is
preferable to distribute datasets using Hadoop’s distributed cache
mechanism.
2. This provides a service for copying files and archives to the task
nodes in time for the tasks to use them when they run.
3. To save network bandwidth, files are normally copied to any
particular node once per job.
4. Side-Data can be shared using the Hadoop’s Distributed cache
mechanism.
5. We can copy files and archives to the task nodes when the tasks
need to run. Usually this is the preferrable way over the
JobConfigurtion.
6. If both the datasets are too large then we cannot copy either of the
datasets to each node in the cluster as we did in the Side data
distribution.
7. We can still join the records using MapReduce with a Map-side or
reduce-side joins.
6. MapReduce Library Classes

Hadoop comes with a library of mappers and reducers for
commonly used functions.

They are listed with brief descriptions in Table(next slide). For
further information on how to use them, please consult their Java
documentation.

The major classes in the MapReduce library are:
 Javadocs
 The Input class: Writing your own Input class
 The Mapping classes
 The Reducer class
 The Output class: Writing your own Output class
 The Marshaller class
 The Counter Class
 Size limits
CLASSSES
ChainMapper,ChainReducer
FieldSelectionReducer (new API)
DESCRIPTION
Run a chain of mappers in a single mapper, and a
reducer followed by a chain of mappers in a single
reducer.
A mapper and a reducer that can select fields (like the
Unix cut command) from the input keys and values
and emit them as output keys and values.
IntSumReducer, LongSumReducer
Reducers that sum integer values to produce a total for
every key.
InverseMapper
A mapper that swaps keys and values.
MultithreadedMapper (new API)
A mapper (or map runner in the old API) that runs
mappers concurrently in separate threads. Useful for
mappers that are not CPU-bound.
Download