Map Reduce Last Time … ο΅ MPI I/O ο΅ Randomized Algorithms ο΅ Parallel π-Select ο΅ Graph coloring ο΅ Assignment 2 ο΅ Questions ? Today … ο΅ Map Reduce ο΅ Overview ο΅ Matrix multiplication ο΅ Complexity theory MapReduce programming interface ο΅ ο΅ Twoβstage data processing ο΅ Data can be divided into many chunks ο΅ A map task processes input data & generates local results for one or a few chunks ο΅ A reduce task aggregates & merges local results from multiple map tasks Data is always represented as a set of keyβvalue pairs ο΅ Key helps grouping for the reduce tasks ο΅ Though key is not always needed (for some applications, or for the input data), a consistent data represent eases the programming interface Motivation & design principles ο΅ Fault tolerance ο΅ Loss of a single node or an entire rack ο΅ Redundant file storage ο΅ Files can be enormous ο΅ Files are rarely updated ο΅ ο΅ Read data, perform calculations ο΅ Append rather than modify Dominated by communication costs and I/O ο΅ ο΅ Computation is cheap compared with data access Dominated by input size Example ο΅ Count the occurrences of individual words in bunch of web pages ο΅ Map task: find words in one or a few files ο΅ ο΅ Input: <key = page url, value = page content> ο΅ Output: <key = word, value = word count> Reduce task: compute total word counts across multiple files ο΅ Input/output: <key = word, value = word count> Dependency in MapReduce ο΅ Map tasks are independent from each other, can all run in parallel ο΅ A map task must finish before the reduce task that processes its result ο΅ In many cases, reduce tasks are commutative ο΅ Acyclic graph model Implementations ο΅ ο΅ ο΅ Original mapreduce implementation at Google ο΅ Not publicly available ο΅ C/C++ Hadoop ο΅ Open Source Implementation – Yahoo!, Apache ο΅ Java Phoenix++ ο΅ Open Source – Stanford ο΅ C++ - Shared memory Applications that don’t fit ο΅ MapReduce supports limited semantics ο΅ ο΅ The key success of MapReduce depends on the assumption that the dominant part of data processing can be divided into a large number of independent tasks What applications don’t fit this? ο΅ Those with complex dependencies – Gaussian elimination, k-means clustering, iterative methods, n-body problems, graph problems, … MapReduce ο΅ Map: chunks from DFS ο (key, value) ο΅ User code to determine (π, π£) from chunks (files/data) ο΅ Sort: (π, π£) from each map task are collected by a master controller and sorted by key and divided among the reduce tasks ο΅ Reduce: work on one key at a time and combine all the values associated with that key ο΅ Manner of combination is determined by user code MapReduce – word counting ο΅ Input ο set of documents ο΅ Map: ο΅ reads a document and breaks it into a sequence of words π€1 , π€2 , … , π€π ο΅ Generates (π, π£) pairs, π€1 , 1 , π€2 , 1 , … , (π€π , 1) ο΅ ο΅ System: ο΅ group all π, π£ by key ο΅ Given π reduce tasks, assign keys to reduce tasks using a hash function Reduce: ο΅ Combine the values associated with a given key ο΅ Add up all the values associated with the word ο total count for that word Parallelism ο΅ Reducers ο΅ Reduce Tasks ο΅ Compute nodes ο΅ Skew (load imbalance) ο΅ Easy to implement ο΅ Hard to get performance Node failures ο΅ Master node fails ο΅ ο΅ Node with Map worker fails ο΅ ο΅ Restart mapreduce job Redo all map tasks assigned to this worker ο΅ Set this worker as idle ο΅ Inform reduce tasks about change of input location Node with Reduce worker fails ο΅ Set the worker as idle Matrix-vector multiplication ο΅ π × π matrix π with entries πππ ο΅ Vector π of length π with values π£π ο΅ We wish to compute π π₯π = πππ π£π π=1 ο΅ If π can fit in memory ο΅ Map: generate (π, πππ π£π ) ο΅ Reduce: sum all values of π to produce (π, π₯π ) If π is too large to fit in memory? Stripes? Blocks? ο΅ What if we need to do this iteratively? ο΅ Matrix-Matrix Multiplication ο΅ π = ππ ο πππ = ο΅ 2 mapreduce operations π πππ πππ ο΅ Map 1: produce π, π£ , π, π, π, πππ and π, π, π, πππ ο΅ Reduce 1: for each π ο π, π, πππ , πππ ο΅ Map 2: identity ο΅ Reduce 2: sum all values associated with key π, π Matrix-Matrix multiplication ο΅ In one mapreduce step ο΅ Map: ο΅ ο΅ generate π, π£ ο π, π , π, π, πππ & π, π , π, π, πππ Reduce: ο΅ each key (π, π) will have values ο΅ Sort all values by π ο΅ Extract πππ & πππ and multiply, accumulate the sum π, π , π, π, πππ & π, π , π, π, πππ ∀π Complexity Theory for mapreduce Communication cost ο΅ Communication cost of a task is the size of the input to the task ο΅ We do not consider the amount of time it takes each task to execute when estimating the running time of an algorithm ο΅ The algorithm output is rarely large compared with the input or the intermediate data produced by the algorithm Reducer size & Replication rate ο΅ Reducer size π ο΅ Upper bound on the number of values that are allowed to appear in the list associated with a single key ο΅ By making the reducer size small, we can force there to be many reducers ο΅ ο΅ By choosing a small π we can perform the computation associated with a single reducer entirely in the main memory of the compute node ο΅ ο΅ High parallelism ο low wall-clock time Low synchronization (Comm/IO) ο low wall clock time Replication rate (π) ο΅ number of (π, π£) pairs produced by all the Map tasks on all the inputs, divided by the number of inputs ο΅ π is the average communication from Map tasks to Reduce tasks Example: one-pass matrix mult ο΅ Assume matrices are π × π ο΅ π – replication rate ο΅ ο΅ Each element πππ produces π keys ο΅ Similarly each πππ produces π keys ο΅ Each input produces exactly π keys ο load balance π – reducer size ο΅ Each key has π values from π and π values from π ο΅ 2π Example: two-pass matrix mult ο΅ Assume matrices are π × π ο΅ π – replication rate ο΅ ο΅ Each element πππ produces 1 key ο΅ Similarly each πππ produces 1 key ο΅ Each input produces exactly 1 key (2nd pass) π – reducer size ο΅ Each key has π values from π and π values from π ο΅ 2π (1st pass), π (2nd pass) Real world example: Similarity Joins ο΅ Given a large set of elements π and a similarity measure π (π₯, π¦) ο΅ Output: pairs whose similarity exceeds a given threshold π‘ ο΅ Example: given a database of 106 images of size 1MB each, find pairs of images that are similar ο΅ Input: (π, ππ ), where π is an ID for the picture and ππ is the image ο΅ Output: ππ , ππ or simply π, π for those pairs where π ππ , ππ > π‘ Approach 1 ο΅ Map: generate (π, π£) π, π , ππ , ππ ο΅ Reduce: ο΅ Apply similarity function to each value (image pair) ο΅ Output pair if similarity above threshold π‘ ο΅ Reducer size – π ο 2 (2MB) ο΅ Replication rate – π ο 106 − 1 ο΅ Total communication from mapο reduce tasks? ο΅ 106 × 106 × 106 bytes ο 1018 bytes ο 1 Exabyte (kB MB GB TB PB EB) ο΅ Communicate over GigE ο 1010 sec ο 300 years Approach 2: group images 106 π ο΅ Group images into π groups with ο΅ Map: Take input element π, ππ and generate ο΅ ο΅ ο΅ images each π − 1 keys π’, π£ | ππ ∈ π’ π’ , π£ ∈ {1, … , π} β {π’} Associated value is (π, ππ ) Reduce: consider key (π’, π£) 106 π ο΅ Associated list will have 2 × ο΅ Take each π, ππ and π, ππ where π, π belong to different groups and compute π ππ , ππ ο΅ Compare pictures belonging to the same group ο΅ elements π, ππ heuristic for who does this, say reducer for key π’, π’ + 1 Approach 2: group images ο΅ Replication rate: π = π − 1 ο΅ Reducer size: π = 2 × 106 /π ο΅ Input size: 2 × 1012 /π bytes ο΅ Say π = 1000, ο΅ Input is 2GB ο΅ Total communication: 106 × 999 × 106 = 1015 bytes ο 1 petabyte