NoSQL Topic: Google’s MapReduce PRESENTERS: SANJANA H (20MCAR0046) KAVITHA N (20MCAR0031) Google File System 2 Introduction to Google File System Google File System (GFS) is a scalable distributed file system (DFS) created by Google Inc. and developed to accommodate Google’s expanding data processing requirements. GFS provides fault tolerance, reliability, scalability, availability and performance to large networks and connected nodes. The Google File System capitalized on the strength of off-the-shelf servers while minimizing hardware weaknesses. 3 Distributed File System In Google file system each cluster is made up of hundreds or thousands of commodity servers and clients can use this cluster to read or write files in a fault tolerant and reliable manner. GFS is made up of several storage systems built from low-cost commodity hardware components. It is optimized to accommodate Google's different data use and storage needs, such as its search engine, which generates huge amounts of data that must be stored 4 Use Case: Google music team Using GFS, sort the analytics or logs of all the times a particular song was played and find out the top and trending songs based on these analytics 5 All the information or metadata about what files there are in the cluster in how many parts the file is divided and what is the location of each of these parts is stored by the component called GFS master Chunk Servers 6 File(var/foo) THE FILES STORED IN THIS CLUSTER ARE GENERALLY, VERY LARGE AND THUS THIS FILE IS SPLIT INTO MULTIPLE PARTS AND EACH PART IS CALLED A CHUNK AND THESE CHUNKS ARE SPREAD OVER THESE MULTIPLE COMMODITY SERVERS IS WITHIN THE CLUSTER DRAWBACKS: FILES ARE VERY LARGE (100’S OF MB/S TO MULTIPLE GBS ) ACCESSING ALL THESE FILES CAN STRESS A NETWORK BANDWIDTH VERY SEVERELY SINGLE CLIENT APPLICATION TRYING TO PROCESS MULTIPLE TERABYTES OF FILES CONSUMES TIME 7 Solution: Bringing the processing function to the data instead to the client It counts of all the songs and passes it along to the servers which are stores the file. Using this approach each of these servers can apply that function to the files that are there within that server Here all this processing will happen in parallelly and the client application can wait until the processing is completed. Once all the processing is complete the final results can be sent back to the client application where the client can finally aggregate the individual results of all the servers 8 MapReduce What is MapReduce? A programming model (& its associated implementation) For processing large data set Exploits large set of commodity computers Executes process in distributed manner Offers high degree of transparencies In other words: simple and maybe suitable for your tasks !!! Map Reduce is a programming model and an associated implementation for processing and generating large data sets Users specify a Map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a Reduce function that merges all intermediate values associated with the same intermediate key Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines The run-‐time system take care of the details of partitioning of the input data, scheduling the program’s execution 9 10 Map+Reduce programming model where instead of taking out all the data from multiple servers you push your core onto multiple servers and ask those servers to process or run that code is called MapReduce MapReduce comes from two parts the first one is the map function so the processing function that the client passed is the mapping function Second part is the reduced function which is the aggregation of all the results of each of these individual servers 11 Master Server Master Server takes care of all the coordination. Master will find out using the GFS where the files are stored or where parts of those files are stored on with servers on each of those servers it will start a separate process all worker processes these worker process will take that particular function and start mapping 12 Functioning of Master Server master Master Server keeps track of how much progress has happened by using heartbeats so each of the workers will keep sending the heartbeat data and percentage of progression that has happened until now also these workers will not send the results directly to this master instead they will save it on the local hard disk itself The Chunk Servers send the result file name and path to the master the master collects the file path of the results on all of these servers once the processing is complete the partial results of each of these servers are there on the local hard disk 13 14 Examples 15 Inverted Web Index Map Reduce For all crawled documents: map each word with document id For all mappings, aggregate results with same key (word, document-id) (word, list<document-id>) map worker-1 reduce worker-1 map worker-2 reduce worker-2 16 Fault Tolerance & Distributed Processing Fault tolerance for each of the MapReduce jobs, the master thread using the heartbeats gets to know and it assigns some other worker on some other server to boot the same processing and hence MapReduce job is much faster than doing the sequential processing 17 18 19