Uploaded by ირაკლი წულუკიძე

[03] mapreduce model

advertisement
BIG DATA: MODELS AND ALGORITHMS
Massively Parallel Computational
Models
Mattia D’Emidio
Professor, Researcher @ UNIVERSITY OF L’AQUILA
email: mattia.demidio@univaq.it
web: www.mattiademidio.com
March 13, 2024
OVERVIEW
Massively Parallel Computational Models
Map and Reduce
First MapReduce Algorithm
2/68
BIG DATA: MODELS AND ALGORITHMS
Massively Parallel Computational Models
MCP MODELS: three (common) main ingredients
DISTRIBUTED STORAGE INFRASTRUCTURE
Distributed File Systems
PROGRAMMING/COMPUTATIONAL MODEL
⇐ today
MapReduce
Natively designed to run on DFS/CLUSTER-BASED SYSTEMS
EXECUTION FRAMEWORK TBD
4/68
BIG DATA: MODELS AND ALGORITHMS
WHAT MASSIVELY PARALLEL COMPUTATIONAL MODELS ARE
ONE AND MANY THINGS:
A model of computation
With specific assumptions on how to perform computations that we shall see, including
definition of both abstract and lower level details
A paradigm/style of computing
A programming model for expressing distributed computations at a massive scale
An architectural framework to run software on clusters
A design pattern for distributed algorithms to run on clusters
FIRST TO BE INTRODUCED AND TO BE SUCCESSFUL
MAP-REDUCE model
MAP-REDUCE PROGRAM/ALGORITHM: a program or algorithm that follows the MR model of com5/68
putation
BIG DATA: MODELS AND ALGORITHMS
MAP-REDUCE MODEL and its EXTENSIONS/GENERALIZATION are used/deployed in most MODERN
SYSTEMS FOR LARGE-SCALE PROCESSING AND STORAGE that rely on cloud/cluster technologies
both to directly process large-scale distributed data
or as a general purpose tool to design algorithms for clusters or to deploy more complex
other technologies
SEVERAL EXAMPLES OF CLOUD/CLUSTER COMPUTING BASED TECHNOLOGIES
based on/inspired to/adapted to Map-Reduce
Google’s internal implementation
Apache Hadoop
Apache Hive, various NoSQL DBMSs
Google Cloud Bigtable, BigQuery, DataProc, Amazon AWS, Microsoft Azure
Apache Spark
6/68
BIG DATA: MODELS AND ALGORITHMS
DESIGN PRINCIPLES
SIMPLICITY OF MODEL: simple enough that MapReduce programs can be created via HIGHERLEVEL PROGRAMMING (e.g. Python) even by non-experts of CS/CE
REASON OF SUCCESS
BASED ON STORING DATA IN A DISTRIBUTED FASHION
Store data when and where they are produced
Distribute/replicate for fault tolerance
If you need to process data: start suited (D&C) processes on the compute nodes where
data are (already) stored
Avoid unnecessary data movement
MOTTO: the DATACENTER (i.e. any cluster where you store in distributed fashion) is the MACHINES that gives you SUPERIOR COMPUTATIONAL PERFORMANCE
In contrast to separate facilities for storage and computing
WHY: since MOVING PROCESSING TO THE DATA is much cheaper/faster (at scale) than MOVING 7/68
BIG DATA: MODELS AND ALGORITHMS
DATA TO PROCESSORS (if you have a suited framework supporting this)
Map and Reduce
THINKING IN MAPREDUCE
GIVEN A COMPUTATIONAL PROBLEM you want TO SOLVE all you NEED to do is:
1. ENCODE PROBLEM/INPUTS/OUTPUTS via a suited modeling of the data
The KEY-VALUE PAIR data model
ASSOCIATE a key to every piece of data you process to move/store data in cluster effectively
2. WRITE TWO FUNCTIONS (procedures) that define a solving strategy on data modeled as
key-value pairs
MAP and REDUCE functions executed in distributed fashion by each node of cluster
Functions are a way to break problem into sub-problems
KEY-VALUE PAIR MODEL for representing inputs and outputs of the problem is
NATIVE STRATEGY for supporting DISTRIBUTION OF DATA
MAP AND REDUCE PARADIGM for defining a solving strategy for the problem is
NATIVE DESIGN for supporting/defining DISTRIBUTION OF COMPUTATION
9/68
BIG DATA: MODELS AND ALGORITHMS
ONCE PROBLEM (AND SOLUTION ALGORITH) ”CONVERTED” in the above form it can be executed
by any system based on the Map-Reduce model
data movement, distribution of computations, fault tolerance and other issues are ”automatically“ handled by the system
similar to classic algorithm design, where we think of a solution and assume then that
instructions are executed by a machine according to some flow
WE SHALL ASSUME in basic scenarios, that once that
(1) ENCODING and
(2) DESIGN OF MAPREDUCE FUNCTIONS
have been done, there will be a system ”taking care of the rest” when it comes to implementation and execution
10/68
BIG DATA: MODELS AND ALGORITHMS
HOW THIS IS POSSIBLE?
Real-world software systems implementing MapReduce paradigm are built to offer DECOUPLING
OF ALGORITHM DESIGN and ALGORITHM DEPLOYMENT
according to a rigid model
THE MODEL IMPOSES THAT:
COMPUTATION is broken into PARALLEL EXECUTIONS of many (MAP AND REDUCE) functions
(performed by tasks running at nodes)
TASKS operate on portions of inputs and outputs encoded according to the KEY-VALUE
PAIR MODEL
11/68
BIG DATA: MODELS AND ALGORITHMS
EXECUTION/SCHEDULING OF TASKS and FAULT-TOLERANCE are entirely handled ”by the system”
We design how data are distributed and processed, then system performs computations
We want to ignore technical/hw/sw details when designing algorithms/programming clusters/cloud computing systems
Type of cluster/nodes, interconnection, operating systems, etc
Let us see how
12/68
BIG DATA: MODELS AND ALGORITHMS
MAPREDUCE DATA MODEL
The ”key notion“ is the KEY
Everything in the domain of world is built around this ”abstract” data type called KEYVALUE PAIR
Every piece of data we want to manipulate/process/store is converted/treated/represented in this form
13/68
BIG DATA: MODELS AND ALGORITHMS
MAPREDUCE DATA MODEL
We will use this FORMALISM
⟨k, v⟩
to denote a generic element of type key-value
k is the KEY of this element
v is the VALUE
Similar principle to associative arrays, we have a key to index/access data
KEYS AND VALUES k, v can be of any type (int, strings, tuples, lists)
14/68
BIG DATA: MODELS AND ALGORITHMS
MAPREDUCE DATA MODEL
WHY DO WE ASSOCIATE KEYS TO VALUES/DATA?
It is essential to EFFICIENTLY DISTRIBUTE computations and data movement operations
Keys used for both TASK SCHEDULING and DATA MOVEMENT OPERATIONS
DATA MOVEMENT RATIONALE: key is used to identify the portion of data to be sent/assigned to
a specific compute node (for storing/processing)
TASK SCHEDULING RATIONALE: key is used to determine which specific compute node is in
charge of some computational sub-task and which data it should manipulate
15/68
BIG DATA: MODELS AND ALGORITHMS
MAPREDUCE: STRUCTURE OF A PROTOTYPICAL COMPUTATION
1. DISTRIBUTE input to nodes (if not already distributed)
2. MAP phase (process, possibly redistribute)
3. REDUCE phase (process)
4. COLLECT output from nodes (if necessary to aggregate on a single machine)
REMARK ON #1 AND #4: depending on the application, data might be already distributed (these
two steps might be optional)
This is the point of ”bringing the computation” to where data is
Applications that already rely on DFS, on data being stored in a distributed fashion to
achieve more ”storage” and easier fault-tolerance, redundancy
Data is already distributed, we process where it
Otherwise, we distribute it
16/68
BIG DATA: MODELS AND ALGORITHMS
MAPREDUCE: MORE DETAILS ON STRUCTURE OF A PROTOTYPICAL COMPUTATION
BASIC PROCESSING FLOW
1. READ your (potentially large) INPUT
Put it in a DFS if it not already distributed (and necessary)
2. EXECUTE a so–called MAP PHASE that convert each input element into (one or more)
⟨key, value⟩ pair
OUTPUT: collection of ⟨key, value⟩ pairs
3. PERFORM a so–called GROUP PHASE by key on outputs of previous phase
OUTPUT: sets of ⟨key, [listOf V alues]⟩ pairs
for each pair ⟨k, L⟩ the list-of-values L contains all values associated to key k
4. EXECUTE a so–called REDUCE PHASE on outputs of previous phase
combines values associated to a same key according to some policy
output: sets of ⟨key, value⟩ pairs
5. WRITE your (potentially large) OUTPUT
Write it back sequentially if necessary, or leave it distributed
17/68
BIG DATA: MODELS AND ALGORITHMS
STRUCTURE OF A MAPREDUCE COMPUTATION: GLOBAL VIEWPOINT
18/68
BIG DATA: MODELS AND ALGORITHMS
STRUCTURE OF A MAPREDUCE COMPUTATION: GLOBAL VIEWPOINT
1. DISTRIBUTE INPUT to nodes (if not already distributed)
2. MAP phase (process, possibly redistribute)
3. REDUCE phase (process)
4. COLLECT output from nodes (if necessary to aggregate on a single machine)
REMARK ON #1 AND #4: might be optional, depending on the application, data might be already distributed
This is the point of ”bringing the computation to where data is”
19/68
BIG DATA: MODELS AND ALGORITHMS
Suited for, but not limited to, applications that already rely on data produced in a distributed fashion
It should be evident at this point that THIS FLOW REQUIRES SYNCHRONIZATION between nodes
MOVE DATA
PROCESS
MOVE DATA BACK
We make some ASSUMPTIONS ON SYNCHRONIZATION and forget about it
These assumptions are then GUARANTEED AT IMPLEMENTATION LEVEL by systems implementing
the mapreduce model
All synchronization steps are COORDINATED BY A SOME ”MASTER ENTITY” according to user
settings and available hardware
We will see, an orchestrator process
More details later
20/68
BIG DATA: MODELS AND ALGORITHMS
A (WORKING) MAP REDUCE SYSTEM is a very complex system (several interacting components)
21/68
BIG DATA: MODELS AND ALGORITHMS
MAP PHASE: DETAILS
MAP PHASE performed by a number of MAP TASKS
Each task runs on one compute node of the cluster
Each task operates on one piece (CHUNK) of global input (a set of elements)
Each task executes repeatedly MAP FUNCTIONS on assigned chunk
A MAP TASK processes an input chunk in key-value form
PERFORMS repeated executions of MAP FUNCTIONS on chunk
no fixed policy for assigning chunks to nodes
data distributed (in general) arbitrarily, cannot be decided
TURNS chunk into other key-value pairs (writes INTERMEDIATE data)
A MAP FUNCTION specifies the way output key-value pairs are produced from input elements
(CODE OF THE MAP FUNCTION)
This is established by the code of the MAP FUNCTION: PROBLEM-DEPENDENT and hence 22/68
BIG DATA: MODELS AND ALGORITHMS
user defined, part of the algorithm design
MAP FUNCTION: FORMALLY
MAP FUNCTION can be formalized as follows:
∀ ⟨k, v⟩ ∈ I MAP⟨k, v⟩ → ⟨k ′ , v ′ ⟩∗
INPUT: key-value pairs ⟨k, v⟩ ∈ I
Portions of the input I can be processed by different map tasks
E.g. k is filename, v is a line of the file or the whole content
ONE MAP FUNCTION CALL per pair ⟨k, v⟩
LOCAL OUTPUT: a set of key-value pairs ⟨k ′ , v ′ ⟩∗ (zero or more), for each input key-value
pair
E.g. Word-Occurrence pairs for each document
GLOBAL MAP OUTPUT: a set I ′ of key-list-of-values pairs (after grouping)
23/68
BIG DATA: MODELS AND ALGORITHMS
ON THE INPUT CHUNKS TO MAP TASKS
KEY-VALUE PAIRS that are given as inputs to a map task are called ELEMENTS
An ELEMENT can be any type (tuple,document,data structure,string)
A CHUNK is a collection of elements (local inputs)
For simplicity will always assume input always STORED IN KEY-VALUE FORM
If not we can treat it as key-values by considering random/empty keys
E.g. ⟨randomkey1 , documenta ⟩, ⟨randomkey2 , documentb ⟩ . . .
24/68
BIG DATA: MODELS AND ALGORITHMS
ON THE INPUT CHUNKS TO MAP TASKS
NOTATION when we do not care about input/output elements’ keys we use ⟨∗, documenty ⟩
notation
Usually not really relevant to design in general
EXCEPTIONS: unless there are some specific optimizations to consider or more ”computations
to concatenate” (we will see)
Convenient to treat inputs this way FROM THE SYSTEM/DESIGN VIEWPOINT (for analysis and implementation purposes)
DESIGN CONSTRAINT: no input element ever shared across two chunks
Each element processed by one and only map
Subdivision of input is called a PARTITION
25/68
BIG DATA: MODELS AND ALGORITHMS
ON THE OUTPUTS OF MAP TASKS
MAP FUNCTION takes elements and produces, per element, (ZERO OR MORE) KEY-VALUE PAIRS
Types of keys/values: can be arbitrary, they are problem-dependent
Keys here are not keys in the traditional sense
They do not have to be unique, there can be many k-v pairs with same key
A MAP TASK performs several applications of map function to its inputs
Output of a map task per input chunk is a COLLECTION OF KEY VALUE PAIRS
Since no uniqueness we can have several key-value pairs sharing a same key, even from
a same element
Output is stored locally at node
OUTPUT OF MAP PHASE is collection of ALL KEY-VALUE PAIRS GENERATED BY ALL MAP TASKS
several key-value pairs, many can have same key
this output is distributed across nodes
26/68
BIG DATA: MODELS AND ALGORITHMS
GROUP PHASE
Across different map tasks located on different nodes we can observe collections with SEVERAL KEY-VALUE PAIRS HAVING THE SAME KEY
For most problems, it is necessary to aggregate such values to solve global problem
GROUP PHASE these data are GROUPED BY KEY (virtually, through the application of a proper
function on the key)
GROUPING induces outputs of maps to be CONVERTED into a key-list of values representation denoted by ⟨k, L⟩
LIST OF VALUES L associated to a UNIQUE KEY k during the group phase contains all values
associated to k after the map phase
ALL VALUES IN THE OUTPUTS of map phase having same key (which recall it is distributed)
27/68
BIG DATA: MODELS AND ALGORITHMS
AFTER GROUPING (or it would be more precise to say DURING GROUPING) data are TRANSMITTED
OVER THE NETWORK TO BE FED TO REDUCE TASKS
This is a system wide activity that involves all outputs of all map tasks
Data is ”redistributed” across nodes of the cluster to achieve global solutions
Key-lists are reassigned to nodes, transmitted over the network
HOW DATA ARE REASSIGNED
ONE-TO-ONE CORRESPONDENCE between NODES and KEYS established
EACH NODE processes all values HAVING A SAME KEY (possibly for many keys)
To COMBINE/AGGREGATE THEM IN SOME WAY to obtain the (global) output
This is established by the code of the REDUCE FUNCTION
28/68
BIG DATA: MODELS AND ALGORITHMS
HOW THIS ONE-TO-ONE ASSIGNMENT is determined?
There exist in any system implementing map reduce model a PARTITIONER COMPONENT
PARTITIONER
A PROCESS, running one each node, that decides which node will be responsible for processing a specific key and all its values
According to some PARTITIONING POLICY (more details will follow)
Typically, via the use of HASH FUNCTIONS on the keys, more details later
Partitioning policy applied during grouping often referred to as SHUFFLE PHASE or shuffling (since data are shuffled, redistributed, across the cluster as a consequence of partitioning/grouping)
SORTING on keys typically also included
If a node receives multiple key-lists, these are processed in sorted order by key
29/68
BIG DATA: MODELS AND ALGORITHMS
REDUCE PHASE
Performed by a number of REDUCE TASKS
Each running on one compute node of the cluster
A REDUCE TASK
PROCESSES (RECEIVES) all data associated to a key outputted by the map tasks
PERFORMS repeated (executions) of REDUCE FUNCTIONS on such key-list of value pairs
TURNS each key-list of value pair into other key-value pairs
The way key-value pairs are produced is DETERMINED by the CODE OF THE REDUCE FUNCTION
Which is again PROBLEM-DEPENDENT and hence USER DEFINED
Another part of the algorithm design
OUTPUT of each reduce task is collection of key value pairs
30/68
If algorithm properly designed, they together form GLOBAL OUTPUT to original
problem
BIG DATA: MODELS AND ALGORITHMS
REDUCE FUNCTION: FORMALLY
REDUCE FUNCTION CAN BE FORMALIZED AS FOLLOWS:
∀⟨k ′ , ⟨v ′ ⟩∗ ⟩ ∈ I ′ REDUCE(⟨k ′ , ⟨v ′ ⟩∗ ⟩) → ⟨k ′ , v ′′ ⟩∗
INPUT key-list-of-values pairs ⟨k ′ , ⟨v ′ ⟩∗ ⟩ ∈ I ′
Values of global output of map sharing the same key
partitioning can be customized
ONE REDUCE FUNCTION CALL per value k ′
Values v ′ having key k ′ are reduced and processed together
LOCAL OUTPUT: a key-value pair whose key is k ′ and whose value is result of the reduce
function on the associated list
GLOBAL OUTPUT: a set key-values pairs ⟨k ′ , v ′′ ⟩∗ (zero or more) that are COLLECTED and RE- 31/68
TURNED
BIG DATA: MODELS AND ALGORITHMS
TYPICAL USE CASE 1/4
We have DATA DISTRIBUTED ACROSS NODES (produced by some large-scale application)
We want to SOLVE some COMPUTATIONAL TASK on such data
We want to PROCESS DATA DISTRIBUTEDLY to achieve superior performance
We thus design a DISTRIBUTED ALGORITHM
divide and conquer philosophy, several small tasks to solve big problem
DESIGNING A DISTRIBUTED ALGORITHM is a challenging step
Several aspects to consider (synchronization, data movement, task assignments, etc)
FOLLOWING MAP REDUCE ASSUMPTIONS helps (rigid model, implementations must adhere)
32/68
BIG DATA: MODELS AND ALGORITHMS
TYPICAL USE CASE 2/4
To this aim we first accordingly DESIGN A MAP PHASE
CONVERTS chunks of inputs residing at nodes into key-value pairs
PERFORMED BY MAP TASKS at each node, each executes map functions on elements in
the chunks
Code of function is PART OF ALGORITHMIC DESIGN (PSEUDOCODE defined by user)
OUTPUT key-value pairs stored locally at each node (INTERMEDIATE DATA)
HOW MANY MAP/REDUCE TASKS can be customized, depends on available resources, data
and computational goals (system takes care)
33/68
BIG DATA: MODELS AND ALGORITHMS
TYPICAL USE CASE 3/4
We have RESULTS OF MAP PHASE (INTERMEDIATE DATA) that are distributed across nodes
We want to process them to solve global problem
Intermediate Data are grouped and aggregated/partitioned by key
This is a system wide activity that involves all outputs of all map tasks
Values having SAME KEY are sent to/processed by a same reduce task
Map outputs (residing on nodes) are sent to other nodes by using the policy
”ALL DATA WITH A SAME KEY MUST BE SENT TO SAME NODE”
Use of HASH FUNCTIONS ON THE KEYS to send to proper, same node (we will see)
34/68
BIG DATA: MODELS AND ALGORITHMS
TYPICAL USE CASE #4
We have RESULTS OF MAP PHASE THAT HAVE BEEN GROUPED AND SENT to compute nodes according to the policy ”ALL DATA WITH A SAME KEY MUST BE SENT TO SAME NODE”
We want to PROCESS SUCH DATA DISTRIBUTEDLY to obtain final output
35/68
BIG DATA: MODELS AND ALGORITHMS
TYPICAL USE CASE 4/4
Hence we accordingly DESIGN A REDUCE PHASE that process these lists of outputs of map,
aggregated by key, and convert them into other key-value pairs
PERFORMED BY REDUCE TASKS at each node, each of which executes reduce functions on
lists of values associated keys
CODE OF FUNCTION is part of algorithmic design (pseudocode by user)
Reduced outputs (key-value pairs) can either remain locally at each node or be stored
back to DFS or aggregated
together they form FINAL OUTPUT DATA if design is correctly done
How many reduce tasks can be customized, depends on available resources, data and
computational goals (system takes care)
36/68
BIG DATA: MODELS AND ALGORITHMS
First MapReduce Algorithm
LEARNING BY EXAMPLE
We design a MR algorithm to solve WORDCOUNT
We assume input is a HUGE REPOSITORY OF TEXT DOCUMENTS, we want to COUNT OCCURRENCES
OF WORDS
KILLER APP: web analytics
We need to define formally PROBLEM
INPUT: collection of documents (containing sequences of words – strings)
DESIRED OUTPUT: some data structure storing, for each word, the number of occurrences
of the word in the collection
Then we CONVERT the problem into ”Map Reduce Paradigm“
In order to design a proper distributed algorithm based on divide and conquer
38/68
BIG DATA: MODELS AND ALGORITHMS
INPUT: a collection d of documents
We can think of each document as an ELEMENT
And of each CHUNK as a set of elements
KEYS of inputs here are irrelevant (could be a document id), do not care here
VALUES of inputs is a document
OUTPUT: for each word we need occurrences
KEYS can be of type String (WORDS)
VALUES can be of type integer (OCCURRENCES OF WORDS)
DEFINE MAP FUNCTION
MAP FUNCTION ALGORITHM
manipulates input elements from some chunk
produces some PARTIAL (INTERMEDIATE) OUTPUT in key-value form
MAP FUNCTION ALGORITHM (in this case)
1. READ an element (a document)
2. BREAK it into sequence of words w1 , w2 , . . . , wn
3. EMIT (i.e. return as output, write intermediate data) a key-value pair ⟨wi , 1⟩ for each
encountered word wi
key is the word itself wi
value is 1, meaning ”an occurrence has been found”
NOMENCLATURE REMARK: keyword EMIT is used to denote ”return”, ”produce as output”
Used since data eventually is transmitted over network
40/68
BIG DATA: MODELS AND ALGORITHMS
MAP TASK
RESULT OF PROCESSING A CHUNK is hence EMISSION of collection of pairs (stored locally, then
transmitted over network)
⟨w1 , 1⟩, ⟨w2 , 1⟩ . . . ⟨wn , 1⟩
one pair per word in document
Since EACH MAP TASK here processes many ELEMENTS (which are documents)
OUTPUT OF MAP TASKS will be collections of pairs
One collection per processed document
If word m appears k times in some document Di we will have k key-value pairs produced
by the map task that processes element Di
k pairs ⟨m, 1⟩ in the emitted sequences
NOMENCLATURE REMARK: Result of application of map function to a single element is called a
MAPPER
41/68
BIG DATA: MODELS AND ALGORITHMS
MAP PHASE
Input
key-value pairs
k v k v … k Intermediate
key-value pairs
k v k v k v map map … v k v 42/68
BIG DATA: MODELS AND ALGORITHMS
REDUCE FUNCTION
AFTER MAP PHASE: all emitted key-value pairs (remark: stored intermediately at compute
nodes) undergo the GROUP&SORT phase
All elements sharing a same key are GROUPED TOGETHER in lists of values
This collection of key-list-of-value pairs represents INPUT FOR THE NEXT (REDUCE) PHASE
If multiple key-lists are given to a reduce task, these are sorted by key (can be exploited
for algorithmic design, we will see)
43/68
BIG DATA: MODELS AND ALGORITHMS
REDUCE FUNCTION
INPUT
A KEY k
A LIST OF VALUES associated to key k: the set of all values emitted by ALL map tasks for
key k
OUTPUT
SEQUENCE of (zero or more) key-value pairs
DATA TYPE can be different w.r.t. those of map function and input
Often same type, depends on the PROBLEM
44/68
BIG DATA: MODELS AND ALGORITHMS
REDUCE FUNCTION
REDUCE FUNCTION ALGORITHM here
manipulates key,list-of-values pairs ⟨k, [l1 , l2 , l3 , . . . , ln ]⟩
produces the FINAL OUTPUT of the problem in key-value form
REDUCE FUNCTION ALGORITHM (in this case)
1. COMPUTE for each key k value OCCURRENCEk =
n
P
li
i=1
2. EMIT ⟨k, OCCURRENCEk ⟩
NOMENCLATURE REMARK: result of application of reduce function to a single key and its associated values is called a REDUCER
45/68
BIG DATA: MODELS AND ALGORITHMS
REDUCE PHASE
Intermediate
key-value pairs
k v k v k v Group by key k v v k v v v reduce reduce k v k v … … k Output
key-value pairs
Key-value groups
v k … v k v 46/68
BIG DATA: MODELS AND ALGORITHMS
GROUPING BY KEY PHASE
47/68
BIG DATA: MODELS AND ALGORITHMS
MAPREDUCE FOR WORD COUNTING
MAP: Read input and produces a set of key-­‐value pairs Provided by the
programmer
Group by key: Reduce: Collect all pairs with same key Collect all values belonging to the key and output The crew of the space
shuttle Endeavor recently
returned to Earth as
ambassadors, harbingers of
a new era of space
exploration. Scientists at
NASA are saying that the
recent assembly of the
Dextre bot is the first step in
a long-term space-based
man/mache partnership.
'"The work we're doing now
-- the robotics we're doing
-- is what we're going to
need ……………………..
(The, 1) (crew, 1) (of, 1) (the, 1) (space, 1) (shuttle, 1) (Endeavor, 1) (recently, 1) …. (crew, 1) (crew, 1) (space, 1) (the, 1) (the, 1) (the, 1) (shuttle, 1) (recently, 1) … (crew, 2) (space, 1) (the, 3) (shuttle, 1) (recently, 1) … Big document
(key, value)
(key, value)
(key, value)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, hJp://www.mmds.org Only sequential eads Sequentially read t rhe data Provided by the
programmer
18 48/68
BIG DATA: MODELS AND ALGORITHMS
PSEUDO-CODE OF WORD COUNT ALGO
HOW A MAPREDUCE ALGORITHM is defined: pseudo-code (reasonably pseudo)
INPUT FOR MAP: elements of assigned chunk
INPUT OF REDUCE: key-lists generated by maps
49/68
BIG DATA: MODELS AND ALGORITHMS
OUTPUT OF REDUCERS
AT THE END OF REDUCE PHASE
Outputs from all the reduce tasks are COLLECTED AND MERGED INTO A SINGLE FILE
This will contain the OUTPUT OF THE GLOBAL PROBLEM
Returned to the controller for storage
50/68
BIG DATA: MODELS AND ALGORITHMS
MAPREDUCE: FUNCTIONS VIEWPOINT (WORDCOUNT)
51/68
BIG DATA: MODELS AND ALGORITHMS
MAPREDUCE: FUNCTIONS VIEWPOINT (GENERAL)
52/68
BIG DATA: MODELS AND ALGORITHMS
TASKS VS KEYS
DESIGN CONSTRAINT OF THE MODEL: each reduce function processes ONE AND ONLY ONE key-list
⟨k, L⟩
ALL VALUES having the same key after map phase are formed into a list
One reduce task repeatedly processes several of these lists
THERE IS A MASTER CONTROLLER PROCESS running in the cluster that coordinates all operations
by default
Such default can be customized for optimization purposes
For instance master process will know HOW MANY REDUCE TASKS WE CAN RUN, say r
Typically r is USER DEFINED and TUNED DEPENDING ON THE HARDWARE
53/68
BIG DATA: MODELS AND ALGORITHMS
TASKS VS KEYS
HOW DO WE ASSIGN LISTS TO REDUCE TASKS?
If r LARGER THAN MAX NUMBER OF KEYS we are already fine
In the sense that we can assign each key to a single node
This is an IDEAL SITUATION THAT NEVER HAPPENS
Hence, otherwise, keys-lists MUST BE PARTITIONED
several per reducer
This is done by HASHING (hash functions are precisely many-to-one functions)
Hash functions can be used also to decide which values go where
54/68
BIG DATA: MODELS AND ALGORITHMS
Partitioning policies use HASH FUNCTIONS ON KEYS
Many-to-one functionh : U → {0, . . . , r − 1} to partition keys-lists
U : universe of keys
r an integer typically r ≪ |U |
Each key HASHED into an integer from 0 to r − 1
Called BUCKET NUMBER or HASH VALUE
55/68
BIG DATA: MODELS AND ALGORITHMS
MAPREDUCE: PARTITIONING VIEWPOINT
HOW DATA ARE SENT TO NODES performing reduce tasks:
Each map task writes outputs locally (collection of key value pairs)
Keys of the collection are transmitted to one of r local files of nodes that will be executing
reduce tasks according to RESULT OF THE HASHING (application of partitioning function)
Each result DESTINED for one of the reduce tasks
Stored in sorted order of reception
56/68
BIG DATA: MODELS AND ALGORITHMS
MAPREDUCE: PARTITIONING VIEWPOINT
57/68
BIG DATA: MODELS AND ALGORITHMS
WHAT HASHING TO USE? Not easy to decide, depends on data
Several methods, known implementations include PREDEFINED, READY TO USE HASH FUNCTIONS
Optionally user can SPECIFY ITS OWN HASH FUNCTION
Or other methods for assigning MANY KEYS TO SINGLE REDUCE TASK
58/68
BIG DATA: MODELS AND ALGORITHMS
REDUCE TASKS: EFFECTS OF HASHING
FOR EACH KEY k OF SOME OUTPUT FROM MAP TASKS
Hash value of k defines id of reduce task in charge of value of that key
Input to REDUCE TASK h(k) will be KEY-LIST ⟨k, [v1 , v2 , . . . , vn ]⟩ WHERE
(k, v1 ), (k, v2 ), . . . , (k, vn ) is SET OF ALL THE KEY-VALUE PAIRS sharing the same key k
in the output of map phase
while h(k) is the so called BUCKET NUMBER of k
Output of map with key k SENT to reducer associated to BUCKET NUMBER h(k) (having id
k(h))
Easy to see that PROPER HASHING PLAYS A FUNDAMENTAL ROLE to achieve GOOD PERFORMANCE
(low completion time)
Defaults are ok for most data types (for most types of keys)
59/68
BIG DATA: MODELS AND ALGORITHMS
EXAMPLE: EFFECTS OF HASHING
Assume we have 4 nodes running 4 REDUCE TASKS running in our system
Assume we know AT MOST 210 KEYS out of map tasks (that can be many more) will have to be
handled (say any key can be an integer in [0, 210 − 1])
If we USE hash function h(k) = k mod 4 we know any key in the range will UNIQUELY
ASSIGNED to a task with bucket number h(k)
10
Specifically EACH REDUCE TASK will have to handle (at most) 222 = 28 keys-lists
Task 0 handles all keys k whose h(k) is 0
Could be less depends on input actually occurring, this is a worst case prediction
60/68
BIG DATA: MODELS AND ALGORITHMS
EXAMPLE: EFFECTS OF HASHING
REMARK: even if we distributed keys evenly, we do not know size of lists
Skewness might happen nonetheless
Balance sometimes cannot be guaranteed before execution
Probability can help in estimating, when generally we do not know precise distributions beforehand
61/68
BIG DATA: MODELS AND ALGORITHMS
REDUCE TASKS: EFFECTS OF HASHING
TAKE HOME MESSAGE: different hashing strategies induce different performance!
Induce DIFFERENT NUMBER OF LISTS PER REDUCE TASK
For instance: suppose that in our input we have all keys are multiples of 3
By using h(k) = k mod 4 we would have a SINGLE REDUCER DOING ALL THE JOB!
This is undesired (we would bewasting parallelism)
NON-BALANCED HASHING can induce different processing time for some reduce task (called
SKEWNESS)
SKEWNESS can also be induced by different lengths of lists (less impacting though)
Real MR systems have predefined methods to (try to) achieve balance
SKEWNESS in lengths of lists must be handled by algorithm designer (we will see)
We will see sometimes might be helpful to do some CUSTOM HASHING
62/68
BIG DATA: MODELS AND ALGORITHMS
REDUCE TASKS, COMPUTE NODES, AND SKEW
SKEW: in general is the phenomenon such that we observe significant variation in the lengths
of the value lists for different keys (or in the number of key-lists per reducer)
different reducers take different amounts of time
SIGNIFICANT DIFFERENCE in the amount of time each reduce task takes
Proper hashing is super important
Defaults are fine for most data types (for most types of keys)
Sometimes we will have to enforce, by suited design
63/68
BIG DATA: MODELS AND ALGORITHMS
REDUCE TASKS, COMPUTE NODES, AND SKEW
OVERHEAD AND PHYSICAL LIMITATIONS: there is overhead associated with each task we create
One way to REDUCE THE IMPACT OF SKEW can be using fewer Reduce tasks
If keys are sent RANDOMLY TO REDUCE TASKS, we can expect that there will be some averaging of the total time required by the different Reduce tasks
We can FURTHER REDUCE THE SKEW by using more Reduce tasks than compute nodes
In that way, long Reduce tasks might occupy a compute node fully, while several shorter
Reduce tasks might run sequentially at a single compute node
We will see how to evaluate performance formally
64/68
BIG DATA: MODELS AND ALGORITHMS
MORE RIGOROUS DEFINITION OF THE MAPREDUCE MODEL
INPUT TO A MAP REDUCE PROGRAM (also called Algorithm, Task or Job)
A set of key-value pairs I
DESIGNER CHOICES
1. DESIGN two algorithms:
MAP and REDUCE
That is how to break problem into sub–problems
2. SPECIFY inputs and data types
3. DEFINE (custom) hashing policies (optional)
65/68
BIG DATA: MODELS AND ALGORITHMS
MAPREDUCE: FORMAL MODEL PERSPECTIVE
66/68
BIG DATA: MODELS AND ALGORITHMS
MAPREDUCE: SUMMARY OF BASICS
1. MAP PHASE
1.1 SOME NUMBER OF MAP TASKS each is given one or more chunks of input elements
1.2 EACH TASK turns chunks into of key-value pairs via Map Function – USER DEFINED
2. GROUP BY KEY PHASE: all outputs of Map formed into key-list-of-value pairs
PARTITIONED among reduce tasks by KEY
ALL VALUES HAVING SAME KEY wind up at the same reduce task
3. REDUCE PHASE
3.1 REDUCE TASKS work on one key at a time
3.2 COMBINE values into key-value pairs to form global output via Reduce Function – USER DEFINED
67/68
BIG DATA: MODELS AND ALGORITHMS
HOMEWORK
DESIGN AN ALGORITHM TO SOLVE THE FOLLOWING PROBLEM IN MR
INPUT: a repository of documents, each document is an element
PROBLEM: counting how many words of certain lengths exist in a collection of documents
That is: OUTPUT is a count for any possible length
Exercises on task granularity
68/68
BIG DATA: MODELS AND ALGORITHMS
Download