BIG DATA: MODELS AND ALGORITHMS Massively Parallel Computational Models Mattia D’Emidio Professor, Researcher @ UNIVERSITY OF L’AQUILA email: mattia.demidio@univaq.it web: www.mattiademidio.com March 13, 2024 OVERVIEW Massively Parallel Computational Models Map and Reduce First MapReduce Algorithm 2/68 BIG DATA: MODELS AND ALGORITHMS Massively Parallel Computational Models MCP MODELS: three (common) main ingredients DISTRIBUTED STORAGE INFRASTRUCTURE Distributed File Systems PROGRAMMING/COMPUTATIONAL MODEL ⇐ today MapReduce Natively designed to run on DFS/CLUSTER-BASED SYSTEMS EXECUTION FRAMEWORK TBD 4/68 BIG DATA: MODELS AND ALGORITHMS WHAT MASSIVELY PARALLEL COMPUTATIONAL MODELS ARE ONE AND MANY THINGS: A model of computation With specific assumptions on how to perform computations that we shall see, including definition of both abstract and lower level details A paradigm/style of computing A programming model for expressing distributed computations at a massive scale An architectural framework to run software on clusters A design pattern for distributed algorithms to run on clusters FIRST TO BE INTRODUCED AND TO BE SUCCESSFUL MAP-REDUCE model MAP-REDUCE PROGRAM/ALGORITHM: a program or algorithm that follows the MR model of com5/68 putation BIG DATA: MODELS AND ALGORITHMS MAP-REDUCE MODEL and its EXTENSIONS/GENERALIZATION are used/deployed in most MODERN SYSTEMS FOR LARGE-SCALE PROCESSING AND STORAGE that rely on cloud/cluster technologies both to directly process large-scale distributed data or as a general purpose tool to design algorithms for clusters or to deploy more complex other technologies SEVERAL EXAMPLES OF CLOUD/CLUSTER COMPUTING BASED TECHNOLOGIES based on/inspired to/adapted to Map-Reduce Google’s internal implementation Apache Hadoop Apache Hive, various NoSQL DBMSs Google Cloud Bigtable, BigQuery, DataProc, Amazon AWS, Microsoft Azure Apache Spark 6/68 BIG DATA: MODELS AND ALGORITHMS DESIGN PRINCIPLES SIMPLICITY OF MODEL: simple enough that MapReduce programs can be created via HIGHERLEVEL PROGRAMMING (e.g. Python) even by non-experts of CS/CE REASON OF SUCCESS BASED ON STORING DATA IN A DISTRIBUTED FASHION Store data when and where they are produced Distribute/replicate for fault tolerance If you need to process data: start suited (D&C) processes on the compute nodes where data are (already) stored Avoid unnecessary data movement MOTTO: the DATACENTER (i.e. any cluster where you store in distributed fashion) is the MACHINES that gives you SUPERIOR COMPUTATIONAL PERFORMANCE In contrast to separate facilities for storage and computing WHY: since MOVING PROCESSING TO THE DATA is much cheaper/faster (at scale) than MOVING 7/68 BIG DATA: MODELS AND ALGORITHMS DATA TO PROCESSORS (if you have a suited framework supporting this) Map and Reduce THINKING IN MAPREDUCE GIVEN A COMPUTATIONAL PROBLEM you want TO SOLVE all you NEED to do is: 1. ENCODE PROBLEM/INPUTS/OUTPUTS via a suited modeling of the data The KEY-VALUE PAIR data model ASSOCIATE a key to every piece of data you process to move/store data in cluster effectively 2. WRITE TWO FUNCTIONS (procedures) that define a solving strategy on data modeled as key-value pairs MAP and REDUCE functions executed in distributed fashion by each node of cluster Functions are a way to break problem into sub-problems KEY-VALUE PAIR MODEL for representing inputs and outputs of the problem is NATIVE STRATEGY for supporting DISTRIBUTION OF DATA MAP AND REDUCE PARADIGM for defining a solving strategy for the problem is NATIVE DESIGN for supporting/defining DISTRIBUTION OF COMPUTATION 9/68 BIG DATA: MODELS AND ALGORITHMS ONCE PROBLEM (AND SOLUTION ALGORITH) ”CONVERTED” in the above form it can be executed by any system based on the Map-Reduce model data movement, distribution of computations, fault tolerance and other issues are ”automatically“ handled by the system similar to classic algorithm design, where we think of a solution and assume then that instructions are executed by a machine according to some flow WE SHALL ASSUME in basic scenarios, that once that (1) ENCODING and (2) DESIGN OF MAPREDUCE FUNCTIONS have been done, there will be a system ”taking care of the rest” when it comes to implementation and execution 10/68 BIG DATA: MODELS AND ALGORITHMS HOW THIS IS POSSIBLE? Real-world software systems implementing MapReduce paradigm are built to offer DECOUPLING OF ALGORITHM DESIGN and ALGORITHM DEPLOYMENT according to a rigid model THE MODEL IMPOSES THAT: COMPUTATION is broken into PARALLEL EXECUTIONS of many (MAP AND REDUCE) functions (performed by tasks running at nodes) TASKS operate on portions of inputs and outputs encoded according to the KEY-VALUE PAIR MODEL 11/68 BIG DATA: MODELS AND ALGORITHMS EXECUTION/SCHEDULING OF TASKS and FAULT-TOLERANCE are entirely handled ”by the system” We design how data are distributed and processed, then system performs computations We want to ignore technical/hw/sw details when designing algorithms/programming clusters/cloud computing systems Type of cluster/nodes, interconnection, operating systems, etc Let us see how 12/68 BIG DATA: MODELS AND ALGORITHMS MAPREDUCE DATA MODEL The ”key notion“ is the KEY Everything in the domain of world is built around this ”abstract” data type called KEYVALUE PAIR Every piece of data we want to manipulate/process/store is converted/treated/represented in this form 13/68 BIG DATA: MODELS AND ALGORITHMS MAPREDUCE DATA MODEL We will use this FORMALISM ⟨k, v⟩ to denote a generic element of type key-value k is the KEY of this element v is the VALUE Similar principle to associative arrays, we have a key to index/access data KEYS AND VALUES k, v can be of any type (int, strings, tuples, lists) 14/68 BIG DATA: MODELS AND ALGORITHMS MAPREDUCE DATA MODEL WHY DO WE ASSOCIATE KEYS TO VALUES/DATA? It is essential to EFFICIENTLY DISTRIBUTE computations and data movement operations Keys used for both TASK SCHEDULING and DATA MOVEMENT OPERATIONS DATA MOVEMENT RATIONALE: key is used to identify the portion of data to be sent/assigned to a specific compute node (for storing/processing) TASK SCHEDULING RATIONALE: key is used to determine which specific compute node is in charge of some computational sub-task and which data it should manipulate 15/68 BIG DATA: MODELS AND ALGORITHMS MAPREDUCE: STRUCTURE OF A PROTOTYPICAL COMPUTATION 1. DISTRIBUTE input to nodes (if not already distributed) 2. MAP phase (process, possibly redistribute) 3. REDUCE phase (process) 4. COLLECT output from nodes (if necessary to aggregate on a single machine) REMARK ON #1 AND #4: depending on the application, data might be already distributed (these two steps might be optional) This is the point of ”bringing the computation” to where data is Applications that already rely on DFS, on data being stored in a distributed fashion to achieve more ”storage” and easier fault-tolerance, redundancy Data is already distributed, we process where it Otherwise, we distribute it 16/68 BIG DATA: MODELS AND ALGORITHMS MAPREDUCE: MORE DETAILS ON STRUCTURE OF A PROTOTYPICAL COMPUTATION BASIC PROCESSING FLOW 1. READ your (potentially large) INPUT Put it in a DFS if it not already distributed (and necessary) 2. EXECUTE a so–called MAP PHASE that convert each input element into (one or more) ⟨key, value⟩ pair OUTPUT: collection of ⟨key, value⟩ pairs 3. PERFORM a so–called GROUP PHASE by key on outputs of previous phase OUTPUT: sets of ⟨key, [listOf V alues]⟩ pairs for each pair ⟨k, L⟩ the list-of-values L contains all values associated to key k 4. EXECUTE a so–called REDUCE PHASE on outputs of previous phase combines values associated to a same key according to some policy output: sets of ⟨key, value⟩ pairs 5. WRITE your (potentially large) OUTPUT Write it back sequentially if necessary, or leave it distributed 17/68 BIG DATA: MODELS AND ALGORITHMS STRUCTURE OF A MAPREDUCE COMPUTATION: GLOBAL VIEWPOINT 18/68 BIG DATA: MODELS AND ALGORITHMS STRUCTURE OF A MAPREDUCE COMPUTATION: GLOBAL VIEWPOINT 1. DISTRIBUTE INPUT to nodes (if not already distributed) 2. MAP phase (process, possibly redistribute) 3. REDUCE phase (process) 4. COLLECT output from nodes (if necessary to aggregate on a single machine) REMARK ON #1 AND #4: might be optional, depending on the application, data might be already distributed This is the point of ”bringing the computation to where data is” 19/68 BIG DATA: MODELS AND ALGORITHMS Suited for, but not limited to, applications that already rely on data produced in a distributed fashion It should be evident at this point that THIS FLOW REQUIRES SYNCHRONIZATION between nodes MOVE DATA PROCESS MOVE DATA BACK We make some ASSUMPTIONS ON SYNCHRONIZATION and forget about it These assumptions are then GUARANTEED AT IMPLEMENTATION LEVEL by systems implementing the mapreduce model All synchronization steps are COORDINATED BY A SOME ”MASTER ENTITY” according to user settings and available hardware We will see, an orchestrator process More details later 20/68 BIG DATA: MODELS AND ALGORITHMS A (WORKING) MAP REDUCE SYSTEM is a very complex system (several interacting components) 21/68 BIG DATA: MODELS AND ALGORITHMS MAP PHASE: DETAILS MAP PHASE performed by a number of MAP TASKS Each task runs on one compute node of the cluster Each task operates on one piece (CHUNK) of global input (a set of elements) Each task executes repeatedly MAP FUNCTIONS on assigned chunk A MAP TASK processes an input chunk in key-value form PERFORMS repeated executions of MAP FUNCTIONS on chunk no fixed policy for assigning chunks to nodes data distributed (in general) arbitrarily, cannot be decided TURNS chunk into other key-value pairs (writes INTERMEDIATE data) A MAP FUNCTION specifies the way output key-value pairs are produced from input elements (CODE OF THE MAP FUNCTION) This is established by the code of the MAP FUNCTION: PROBLEM-DEPENDENT and hence 22/68 BIG DATA: MODELS AND ALGORITHMS user defined, part of the algorithm design MAP FUNCTION: FORMALLY MAP FUNCTION can be formalized as follows: ∀ ⟨k, v⟩ ∈ I MAP⟨k, v⟩ → ⟨k ′ , v ′ ⟩∗ INPUT: key-value pairs ⟨k, v⟩ ∈ I Portions of the input I can be processed by different map tasks E.g. k is filename, v is a line of the file or the whole content ONE MAP FUNCTION CALL per pair ⟨k, v⟩ LOCAL OUTPUT: a set of key-value pairs ⟨k ′ , v ′ ⟩∗ (zero or more), for each input key-value pair E.g. Word-Occurrence pairs for each document GLOBAL MAP OUTPUT: a set I ′ of key-list-of-values pairs (after grouping) 23/68 BIG DATA: MODELS AND ALGORITHMS ON THE INPUT CHUNKS TO MAP TASKS KEY-VALUE PAIRS that are given as inputs to a map task are called ELEMENTS An ELEMENT can be any type (tuple,document,data structure,string) A CHUNK is a collection of elements (local inputs) For simplicity will always assume input always STORED IN KEY-VALUE FORM If not we can treat it as key-values by considering random/empty keys E.g. ⟨randomkey1 , documenta ⟩, ⟨randomkey2 , documentb ⟩ . . . 24/68 BIG DATA: MODELS AND ALGORITHMS ON THE INPUT CHUNKS TO MAP TASKS NOTATION when we do not care about input/output elements’ keys we use ⟨∗, documenty ⟩ notation Usually not really relevant to design in general EXCEPTIONS: unless there are some specific optimizations to consider or more ”computations to concatenate” (we will see) Convenient to treat inputs this way FROM THE SYSTEM/DESIGN VIEWPOINT (for analysis and implementation purposes) DESIGN CONSTRAINT: no input element ever shared across two chunks Each element processed by one and only map Subdivision of input is called a PARTITION 25/68 BIG DATA: MODELS AND ALGORITHMS ON THE OUTPUTS OF MAP TASKS MAP FUNCTION takes elements and produces, per element, (ZERO OR MORE) KEY-VALUE PAIRS Types of keys/values: can be arbitrary, they are problem-dependent Keys here are not keys in the traditional sense They do not have to be unique, there can be many k-v pairs with same key A MAP TASK performs several applications of map function to its inputs Output of a map task per input chunk is a COLLECTION OF KEY VALUE PAIRS Since no uniqueness we can have several key-value pairs sharing a same key, even from a same element Output is stored locally at node OUTPUT OF MAP PHASE is collection of ALL KEY-VALUE PAIRS GENERATED BY ALL MAP TASKS several key-value pairs, many can have same key this output is distributed across nodes 26/68 BIG DATA: MODELS AND ALGORITHMS GROUP PHASE Across different map tasks located on different nodes we can observe collections with SEVERAL KEY-VALUE PAIRS HAVING THE SAME KEY For most problems, it is necessary to aggregate such values to solve global problem GROUP PHASE these data are GROUPED BY KEY (virtually, through the application of a proper function on the key) GROUPING induces outputs of maps to be CONVERTED into a key-list of values representation denoted by ⟨k, L⟩ LIST OF VALUES L associated to a UNIQUE KEY k during the group phase contains all values associated to k after the map phase ALL VALUES IN THE OUTPUTS of map phase having same key (which recall it is distributed) 27/68 BIG DATA: MODELS AND ALGORITHMS AFTER GROUPING (or it would be more precise to say DURING GROUPING) data are TRANSMITTED OVER THE NETWORK TO BE FED TO REDUCE TASKS This is a system wide activity that involves all outputs of all map tasks Data is ”redistributed” across nodes of the cluster to achieve global solutions Key-lists are reassigned to nodes, transmitted over the network HOW DATA ARE REASSIGNED ONE-TO-ONE CORRESPONDENCE between NODES and KEYS established EACH NODE processes all values HAVING A SAME KEY (possibly for many keys) To COMBINE/AGGREGATE THEM IN SOME WAY to obtain the (global) output This is established by the code of the REDUCE FUNCTION 28/68 BIG DATA: MODELS AND ALGORITHMS HOW THIS ONE-TO-ONE ASSIGNMENT is determined? There exist in any system implementing map reduce model a PARTITIONER COMPONENT PARTITIONER A PROCESS, running one each node, that decides which node will be responsible for processing a specific key and all its values According to some PARTITIONING POLICY (more details will follow) Typically, via the use of HASH FUNCTIONS on the keys, more details later Partitioning policy applied during grouping often referred to as SHUFFLE PHASE or shuffling (since data are shuffled, redistributed, across the cluster as a consequence of partitioning/grouping) SORTING on keys typically also included If a node receives multiple key-lists, these are processed in sorted order by key 29/68 BIG DATA: MODELS AND ALGORITHMS REDUCE PHASE Performed by a number of REDUCE TASKS Each running on one compute node of the cluster A REDUCE TASK PROCESSES (RECEIVES) all data associated to a key outputted by the map tasks PERFORMS repeated (executions) of REDUCE FUNCTIONS on such key-list of value pairs TURNS each key-list of value pair into other key-value pairs The way key-value pairs are produced is DETERMINED by the CODE OF THE REDUCE FUNCTION Which is again PROBLEM-DEPENDENT and hence USER DEFINED Another part of the algorithm design OUTPUT of each reduce task is collection of key value pairs 30/68 If algorithm properly designed, they together form GLOBAL OUTPUT to original problem BIG DATA: MODELS AND ALGORITHMS REDUCE FUNCTION: FORMALLY REDUCE FUNCTION CAN BE FORMALIZED AS FOLLOWS: ∀⟨k ′ , ⟨v ′ ⟩∗ ⟩ ∈ I ′ REDUCE(⟨k ′ , ⟨v ′ ⟩∗ ⟩) → ⟨k ′ , v ′′ ⟩∗ INPUT key-list-of-values pairs ⟨k ′ , ⟨v ′ ⟩∗ ⟩ ∈ I ′ Values of global output of map sharing the same key partitioning can be customized ONE REDUCE FUNCTION CALL per value k ′ Values v ′ having key k ′ are reduced and processed together LOCAL OUTPUT: a key-value pair whose key is k ′ and whose value is result of the reduce function on the associated list GLOBAL OUTPUT: a set key-values pairs ⟨k ′ , v ′′ ⟩∗ (zero or more) that are COLLECTED and RE- 31/68 TURNED BIG DATA: MODELS AND ALGORITHMS TYPICAL USE CASE 1/4 We have DATA DISTRIBUTED ACROSS NODES (produced by some large-scale application) We want to SOLVE some COMPUTATIONAL TASK on such data We want to PROCESS DATA DISTRIBUTEDLY to achieve superior performance We thus design a DISTRIBUTED ALGORITHM divide and conquer philosophy, several small tasks to solve big problem DESIGNING A DISTRIBUTED ALGORITHM is a challenging step Several aspects to consider (synchronization, data movement, task assignments, etc) FOLLOWING MAP REDUCE ASSUMPTIONS helps (rigid model, implementations must adhere) 32/68 BIG DATA: MODELS AND ALGORITHMS TYPICAL USE CASE 2/4 To this aim we first accordingly DESIGN A MAP PHASE CONVERTS chunks of inputs residing at nodes into key-value pairs PERFORMED BY MAP TASKS at each node, each executes map functions on elements in the chunks Code of function is PART OF ALGORITHMIC DESIGN (PSEUDOCODE defined by user) OUTPUT key-value pairs stored locally at each node (INTERMEDIATE DATA) HOW MANY MAP/REDUCE TASKS can be customized, depends on available resources, data and computational goals (system takes care) 33/68 BIG DATA: MODELS AND ALGORITHMS TYPICAL USE CASE 3/4 We have RESULTS OF MAP PHASE (INTERMEDIATE DATA) that are distributed across nodes We want to process them to solve global problem Intermediate Data are grouped and aggregated/partitioned by key This is a system wide activity that involves all outputs of all map tasks Values having SAME KEY are sent to/processed by a same reduce task Map outputs (residing on nodes) are sent to other nodes by using the policy ”ALL DATA WITH A SAME KEY MUST BE SENT TO SAME NODE” Use of HASH FUNCTIONS ON THE KEYS to send to proper, same node (we will see) 34/68 BIG DATA: MODELS AND ALGORITHMS TYPICAL USE CASE #4 We have RESULTS OF MAP PHASE THAT HAVE BEEN GROUPED AND SENT to compute nodes according to the policy ”ALL DATA WITH A SAME KEY MUST BE SENT TO SAME NODE” We want to PROCESS SUCH DATA DISTRIBUTEDLY to obtain final output 35/68 BIG DATA: MODELS AND ALGORITHMS TYPICAL USE CASE 4/4 Hence we accordingly DESIGN A REDUCE PHASE that process these lists of outputs of map, aggregated by key, and convert them into other key-value pairs PERFORMED BY REDUCE TASKS at each node, each of which executes reduce functions on lists of values associated keys CODE OF FUNCTION is part of algorithmic design (pseudocode by user) Reduced outputs (key-value pairs) can either remain locally at each node or be stored back to DFS or aggregated together they form FINAL OUTPUT DATA if design is correctly done How many reduce tasks can be customized, depends on available resources, data and computational goals (system takes care) 36/68 BIG DATA: MODELS AND ALGORITHMS First MapReduce Algorithm LEARNING BY EXAMPLE We design a MR algorithm to solve WORDCOUNT We assume input is a HUGE REPOSITORY OF TEXT DOCUMENTS, we want to COUNT OCCURRENCES OF WORDS KILLER APP: web analytics We need to define formally PROBLEM INPUT: collection of documents (containing sequences of words – strings) DESIRED OUTPUT: some data structure storing, for each word, the number of occurrences of the word in the collection Then we CONVERT the problem into ”Map Reduce Paradigm“ In order to design a proper distributed algorithm based on divide and conquer 38/68 BIG DATA: MODELS AND ALGORITHMS INPUT: a collection d of documents We can think of each document as an ELEMENT And of each CHUNK as a set of elements KEYS of inputs here are irrelevant (could be a document id), do not care here VALUES of inputs is a document OUTPUT: for each word we need occurrences KEYS can be of type String (WORDS) VALUES can be of type integer (OCCURRENCES OF WORDS) DEFINE MAP FUNCTION MAP FUNCTION ALGORITHM manipulates input elements from some chunk produces some PARTIAL (INTERMEDIATE) OUTPUT in key-value form MAP FUNCTION ALGORITHM (in this case) 1. READ an element (a document) 2. BREAK it into sequence of words w1 , w2 , . . . , wn 3. EMIT (i.e. return as output, write intermediate data) a key-value pair ⟨wi , 1⟩ for each encountered word wi key is the word itself wi value is 1, meaning ”an occurrence has been found” NOMENCLATURE REMARK: keyword EMIT is used to denote ”return”, ”produce as output” Used since data eventually is transmitted over network 40/68 BIG DATA: MODELS AND ALGORITHMS MAP TASK RESULT OF PROCESSING A CHUNK is hence EMISSION of collection of pairs (stored locally, then transmitted over network) ⟨w1 , 1⟩, ⟨w2 , 1⟩ . . . ⟨wn , 1⟩ one pair per word in document Since EACH MAP TASK here processes many ELEMENTS (which are documents) OUTPUT OF MAP TASKS will be collections of pairs One collection per processed document If word m appears k times in some document Di we will have k key-value pairs produced by the map task that processes element Di k pairs ⟨m, 1⟩ in the emitted sequences NOMENCLATURE REMARK: Result of application of map function to a single element is called a MAPPER 41/68 BIG DATA: MODELS AND ALGORITHMS MAP PHASE Input key-value pairs k v k v … k Intermediate key-value pairs k v k v k v map map … v k v 42/68 BIG DATA: MODELS AND ALGORITHMS REDUCE FUNCTION AFTER MAP PHASE: all emitted key-value pairs (remark: stored intermediately at compute nodes) undergo the GROUP&SORT phase All elements sharing a same key are GROUPED TOGETHER in lists of values This collection of key-list-of-value pairs represents INPUT FOR THE NEXT (REDUCE) PHASE If multiple key-lists are given to a reduce task, these are sorted by key (can be exploited for algorithmic design, we will see) 43/68 BIG DATA: MODELS AND ALGORITHMS REDUCE FUNCTION INPUT A KEY k A LIST OF VALUES associated to key k: the set of all values emitted by ALL map tasks for key k OUTPUT SEQUENCE of (zero or more) key-value pairs DATA TYPE can be different w.r.t. those of map function and input Often same type, depends on the PROBLEM 44/68 BIG DATA: MODELS AND ALGORITHMS REDUCE FUNCTION REDUCE FUNCTION ALGORITHM here manipulates key,list-of-values pairs ⟨k, [l1 , l2 , l3 , . . . , ln ]⟩ produces the FINAL OUTPUT of the problem in key-value form REDUCE FUNCTION ALGORITHM (in this case) 1. COMPUTE for each key k value OCCURRENCEk = n P li i=1 2. EMIT ⟨k, OCCURRENCEk ⟩ NOMENCLATURE REMARK: result of application of reduce function to a single key and its associated values is called a REDUCER 45/68 BIG DATA: MODELS AND ALGORITHMS REDUCE PHASE Intermediate key-value pairs k v k v k v Group by key k v v k v v v reduce reduce k v k v … … k Output key-value pairs Key-value groups v k … v k v 46/68 BIG DATA: MODELS AND ALGORITHMS GROUPING BY KEY PHASE 47/68 BIG DATA: MODELS AND ALGORITHMS MAPREDUCE FOR WORD COUNTING MAP: Read input and produces a set of key-­‐value pairs Provided by the programmer Group by key: Reduce: Collect all pairs with same key Collect all values belonging to the key and output The crew of the space shuttle Endeavor recently returned to Earth as ambassadors, harbingers of a new era of space exploration. Scientists at NASA are saying that the recent assembly of the Dextre bot is the first step in a long-term space-based man/mache partnership. '"The work we're doing now -- the robotics we're doing -- is what we're going to need …………………….. (The, 1) (crew, 1) (of, 1) (the, 1) (space, 1) (shuttle, 1) (Endeavor, 1) (recently, 1) …. (crew, 1) (crew, 1) (space, 1) (the, 1) (the, 1) (the, 1) (shuttle, 1) (recently, 1) … (crew, 2) (space, 1) (the, 3) (shuttle, 1) (recently, 1) … Big document (key, value) (key, value) (key, value) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, hJp://www.mmds.org Only sequential eads Sequentially read t rhe data Provided by the programmer 18 48/68 BIG DATA: MODELS AND ALGORITHMS PSEUDO-CODE OF WORD COUNT ALGO HOW A MAPREDUCE ALGORITHM is defined: pseudo-code (reasonably pseudo) INPUT FOR MAP: elements of assigned chunk INPUT OF REDUCE: key-lists generated by maps 49/68 BIG DATA: MODELS AND ALGORITHMS OUTPUT OF REDUCERS AT THE END OF REDUCE PHASE Outputs from all the reduce tasks are COLLECTED AND MERGED INTO A SINGLE FILE This will contain the OUTPUT OF THE GLOBAL PROBLEM Returned to the controller for storage 50/68 BIG DATA: MODELS AND ALGORITHMS MAPREDUCE: FUNCTIONS VIEWPOINT (WORDCOUNT) 51/68 BIG DATA: MODELS AND ALGORITHMS MAPREDUCE: FUNCTIONS VIEWPOINT (GENERAL) 52/68 BIG DATA: MODELS AND ALGORITHMS TASKS VS KEYS DESIGN CONSTRAINT OF THE MODEL: each reduce function processes ONE AND ONLY ONE key-list ⟨k, L⟩ ALL VALUES having the same key after map phase are formed into a list One reduce task repeatedly processes several of these lists THERE IS A MASTER CONTROLLER PROCESS running in the cluster that coordinates all operations by default Such default can be customized for optimization purposes For instance master process will know HOW MANY REDUCE TASKS WE CAN RUN, say r Typically r is USER DEFINED and TUNED DEPENDING ON THE HARDWARE 53/68 BIG DATA: MODELS AND ALGORITHMS TASKS VS KEYS HOW DO WE ASSIGN LISTS TO REDUCE TASKS? If r LARGER THAN MAX NUMBER OF KEYS we are already fine In the sense that we can assign each key to a single node This is an IDEAL SITUATION THAT NEVER HAPPENS Hence, otherwise, keys-lists MUST BE PARTITIONED several per reducer This is done by HASHING (hash functions are precisely many-to-one functions) Hash functions can be used also to decide which values go where 54/68 BIG DATA: MODELS AND ALGORITHMS Partitioning policies use HASH FUNCTIONS ON KEYS Many-to-one functionh : U → {0, . . . , r − 1} to partition keys-lists U : universe of keys r an integer typically r ≪ |U | Each key HASHED into an integer from 0 to r − 1 Called BUCKET NUMBER or HASH VALUE 55/68 BIG DATA: MODELS AND ALGORITHMS MAPREDUCE: PARTITIONING VIEWPOINT HOW DATA ARE SENT TO NODES performing reduce tasks: Each map task writes outputs locally (collection of key value pairs) Keys of the collection are transmitted to one of r local files of nodes that will be executing reduce tasks according to RESULT OF THE HASHING (application of partitioning function) Each result DESTINED for one of the reduce tasks Stored in sorted order of reception 56/68 BIG DATA: MODELS AND ALGORITHMS MAPREDUCE: PARTITIONING VIEWPOINT 57/68 BIG DATA: MODELS AND ALGORITHMS WHAT HASHING TO USE? Not easy to decide, depends on data Several methods, known implementations include PREDEFINED, READY TO USE HASH FUNCTIONS Optionally user can SPECIFY ITS OWN HASH FUNCTION Or other methods for assigning MANY KEYS TO SINGLE REDUCE TASK 58/68 BIG DATA: MODELS AND ALGORITHMS REDUCE TASKS: EFFECTS OF HASHING FOR EACH KEY k OF SOME OUTPUT FROM MAP TASKS Hash value of k defines id of reduce task in charge of value of that key Input to REDUCE TASK h(k) will be KEY-LIST ⟨k, [v1 , v2 , . . . , vn ]⟩ WHERE (k, v1 ), (k, v2 ), . . . , (k, vn ) is SET OF ALL THE KEY-VALUE PAIRS sharing the same key k in the output of map phase while h(k) is the so called BUCKET NUMBER of k Output of map with key k SENT to reducer associated to BUCKET NUMBER h(k) (having id k(h)) Easy to see that PROPER HASHING PLAYS A FUNDAMENTAL ROLE to achieve GOOD PERFORMANCE (low completion time) Defaults are ok for most data types (for most types of keys) 59/68 BIG DATA: MODELS AND ALGORITHMS EXAMPLE: EFFECTS OF HASHING Assume we have 4 nodes running 4 REDUCE TASKS running in our system Assume we know AT MOST 210 KEYS out of map tasks (that can be many more) will have to be handled (say any key can be an integer in [0, 210 − 1]) If we USE hash function h(k) = k mod 4 we know any key in the range will UNIQUELY ASSIGNED to a task with bucket number h(k) 10 Specifically EACH REDUCE TASK will have to handle (at most) 222 = 28 keys-lists Task 0 handles all keys k whose h(k) is 0 Could be less depends on input actually occurring, this is a worst case prediction 60/68 BIG DATA: MODELS AND ALGORITHMS EXAMPLE: EFFECTS OF HASHING REMARK: even if we distributed keys evenly, we do not know size of lists Skewness might happen nonetheless Balance sometimes cannot be guaranteed before execution Probability can help in estimating, when generally we do not know precise distributions beforehand 61/68 BIG DATA: MODELS AND ALGORITHMS REDUCE TASKS: EFFECTS OF HASHING TAKE HOME MESSAGE: different hashing strategies induce different performance! Induce DIFFERENT NUMBER OF LISTS PER REDUCE TASK For instance: suppose that in our input we have all keys are multiples of 3 By using h(k) = k mod 4 we would have a SINGLE REDUCER DOING ALL THE JOB! This is undesired (we would bewasting parallelism) NON-BALANCED HASHING can induce different processing time for some reduce task (called SKEWNESS) SKEWNESS can also be induced by different lengths of lists (less impacting though) Real MR systems have predefined methods to (try to) achieve balance SKEWNESS in lengths of lists must be handled by algorithm designer (we will see) We will see sometimes might be helpful to do some CUSTOM HASHING 62/68 BIG DATA: MODELS AND ALGORITHMS REDUCE TASKS, COMPUTE NODES, AND SKEW SKEW: in general is the phenomenon such that we observe significant variation in the lengths of the value lists for different keys (or in the number of key-lists per reducer) different reducers take different amounts of time SIGNIFICANT DIFFERENCE in the amount of time each reduce task takes Proper hashing is super important Defaults are fine for most data types (for most types of keys) Sometimes we will have to enforce, by suited design 63/68 BIG DATA: MODELS AND ALGORITHMS REDUCE TASKS, COMPUTE NODES, AND SKEW OVERHEAD AND PHYSICAL LIMITATIONS: there is overhead associated with each task we create One way to REDUCE THE IMPACT OF SKEW can be using fewer Reduce tasks If keys are sent RANDOMLY TO REDUCE TASKS, we can expect that there will be some averaging of the total time required by the different Reduce tasks We can FURTHER REDUCE THE SKEW by using more Reduce tasks than compute nodes In that way, long Reduce tasks might occupy a compute node fully, while several shorter Reduce tasks might run sequentially at a single compute node We will see how to evaluate performance formally 64/68 BIG DATA: MODELS AND ALGORITHMS MORE RIGOROUS DEFINITION OF THE MAPREDUCE MODEL INPUT TO A MAP REDUCE PROGRAM (also called Algorithm, Task or Job) A set of key-value pairs I DESIGNER CHOICES 1. DESIGN two algorithms: MAP and REDUCE That is how to break problem into sub–problems 2. SPECIFY inputs and data types 3. DEFINE (custom) hashing policies (optional) 65/68 BIG DATA: MODELS AND ALGORITHMS MAPREDUCE: FORMAL MODEL PERSPECTIVE 66/68 BIG DATA: MODELS AND ALGORITHMS MAPREDUCE: SUMMARY OF BASICS 1. MAP PHASE 1.1 SOME NUMBER OF MAP TASKS each is given one or more chunks of input elements 1.2 EACH TASK turns chunks into of key-value pairs via Map Function – USER DEFINED 2. GROUP BY KEY PHASE: all outputs of Map formed into key-list-of-value pairs PARTITIONED among reduce tasks by KEY ALL VALUES HAVING SAME KEY wind up at the same reduce task 3. REDUCE PHASE 3.1 REDUCE TASKS work on one key at a time 3.2 COMBINE values into key-value pairs to form global output via Reduce Function – USER DEFINED 67/68 BIG DATA: MODELS AND ALGORITHMS HOMEWORK DESIGN AN ALGORITHM TO SOLVE THE FOLLOWING PROBLEM IN MR INPUT: a repository of documents, each document is an element PROBLEM: counting how many words of certain lengths exist in a collection of documents That is: OUTPUT is a count for any possible length Exercises on task granularity 68/68 BIG DATA: MODELS AND ALGORITHMS