The Definitive Cloudera Hadoop WordCount Tutorial

advertisement
MapReduce & Hadoop
IT332
Distributed Systems
Outline




MapReduce
Hadoop
Cloudera Hadoop
Tutorial
2
MapReduce
 MapReduce is a programming model for data processing
 The power of MapReduce lies in its ability to scale to 100s or 1000s
of computers, each with several processor cores
 How large an amount of work?
 Web-Scale data on the order of 100s of GBs to TBs or PBs
 It is likely that the input data set will not fit on a single computer’s
hard drive
 Hence, a distributed file system (e.g., Google File System- GFS) is
typically required
3
MapReduce Characteristics
 MapReduce ties smaller and more reasonably priced machines
together into a single cost-effective commodity cluster
 MapReduce divides the workload into multiple independent tasks and
schedule them across cluster nodes
 A work performed by each task is done in isolation from one another
4
Data Distribution
 In a MapReduce cluster, data is distributed to all the nodes of the
cluster as it is being loaded in
 An underlying distributed file systems (e.g., GFS) splits large data files
into chunks which are managed by different nodes in the cluster
Input data: A large file
Node 1
Node 2
Node 3
Chunk of input data
Chunk of input data
Chunk of input data
 Even though the file chunks are distributed across several machines,
they form a single namesapce
5
MapReduce: A Bird’s-Eye View
in
 The outputs from the mappers are denoted as
intermediate outputs (IOs) and are brought
into a second set of tasks called Reducers
 The process of bringing together IOs into a set
of Reducers is known as shuffling process
 The Reducers produce the final outputs (FOs)
 Overall, MapReduce breaks the data flow into two
phases,
map phase and reduce phase
C0
C1
C2
C3
mappers M0
M1
M2
M3
IO0
IO1
IO2
IO3
chunks
Map Phase
processed
Reduce Phase
 In MapReduce, chunks are
isolation by tasks called Mappers
Shuffling Data
Reducers
R0
R1
FO0
FO1
Keys and Values
 The programmer in MapReduce has to specify two functions, the map
function and the reduce function that implement the Mapper and the
Reducer in a MapReduce program
 In MapReduce data elements
key-value (i.e., (K, V)) pairs
are
always
structured
 The map and reduce functions receive and emit (K, V) pairs
Input Splits
(K, V) Pairs
Intermediate Outputs
Map
Function
(K’, V’)
Pairs
Final Outputs
Reduce
Function
(K’’, V’’)
Pairs
as
Hadoop
 Hadoop is an open source implementation of MapReduce and is
currently enjoying wide popularity
 Hadoop presents MapReduce as an analytics engine and under the
hood uses a distributed storage layer referred to as Hadoop
Distributed File System (HDFS)
 HDFS mimics Google File System (GFS)
8
Cloudera Hadoop
Cloudera Virtual Manager
• Cloudera VM contains a single-node Apache Hadoop cluster
along with everything you need to get started with Hadoop.
• Requirements:
– A 64-bit host OS
– A virtualization software: VMware Player, KVM, or VirtualBox.
• Virtualization Software will require a laptop that supports
virtualization. If you are unsure, one way this can be checked by
looking at your BIOS and seeing if Virtualization is Enabled.
– A 4 GB of total RAM.
• The total system memory required varies depending on the size
of your data set and on the other processes that are running.
Installation
• Step#1: Download & Run Vmware
• Step#2: Download Cloudera VM
• Step#3: Extract to the Cloudera folder.
• Step#4: Open the "cloudera-quickstart-vm4.4.0-1-vmware"
Once you got the software installed, fire up the
VirtualBox image of Cloudera QuickStart VM and you
should see the initial screen similar to below:
WordCount Tutorial
• This example computes the occurrence frequency
of each word in a text file.
• Steps:
1. Set up Hadoop environment
2. Upload files into HDFS
3. Executing Java MapReduce functions in Hadoop
• Tutorial:
http://edataanalyst.com/2013/08/the-definitive-cloudera-hadoopwordcount-tutorial/
Download