The MapReduce Framework for Big Data Application Mr. Gajanan Rathod, Mr.Gurudutt Sortur, Mr. Nikhil Nimbalkar, Miss.Shahida Nadaf Prof. R. V. Patil Department of Computer Engineering PDEA’S COEM, Pune ABSTRACT:The big data is being generated at a large scale, because of day by day activities and use of resources in an exceptional computing. To access and handle such huge amount of data, distributed system mechanism is used. One such widely used distributed classification system is Hadoop Distributed File System (HDFS). Google map reduce framework and apache Hadoop, an open source implementation, are the de-facto software system for big data applications. It uses MapReduce- framework to perform analysis and carry computations parallel on these massive information sets. Hadoop follows the master/slave design decoupling system data and application information wherever data is kept on dedicated server NameNode and application information on DataNodes. Process of MapReduce is slow whereas, it's better-known that accessing data from cache is far quicker as compared to memory access. Hadoop distributed file system suffer from huge I/O bottleneck for storing the tri-replicated data block. The I/O overhead intrinsic to the HDFS architecture degrade the application performance. KEYWORDS: Big Data, Map Reduce, Hadoop Distributed File System (HDFS), Cache Management. I. INTRODUCTION: MapReduce [5] Framework for Big Data Applications led to introduction of millions of computing related resources and with advent use of those applications day by day has led to generation of large amount of data. Applications specify the computation in terms of a map and a reduce function working on partitioned data items. The MapReduce framework schedules computation across a cluster of machines. A challenge involves quicker retrieval of those resources alongside high performance. A good mechanism to realize this goal is that the use of distributed systems. MapReduce provides a standardized framework for implementing large-scale distributed computation on unprecedentedly large-scale data set. Hadoop is the software framework [6] for writing applications that rapidly process large amount of data in parallel on large clusters of computer nodes. It provides a distributed file system and a framework for the analysis and transformation of very large data sets using the MapReduce paradigm. The volume of data, collectively called data-sets, generated by the application is very large. So, there is a need of processing large data-sets efficiently. Over the years, Hadoop has gained importance owing to its measurability, reliability, high output, analysis and enormous computations on these huge amounts of data. It's being employed by all the leading industries such as Amazon, Google, Facebook, and Yahoo [7]. MapReduce is a generic execution engine that parallelizes computation over a large cluster of machines [2]. An important characteristic of Hadoop is the partitioning of data and computation across many hosts, and executing application computations in parallel close to their data. A Hadoop cluster scales computation capacity, storage capacity and I/O bandwidth by simply adding commodity servers. In this paper, for faster processing of Hadoop system, three algorithms are used, namely, Normal cache MR, Distributed cache MR and Adaptive Replacement Cache (ARC)[1] MR. II. Related Work The Map Reduce Big data processes large amount of data. It is designed in such a way that MapReduce program can be automatically paralyze and execute large cluster of the machines .As this model is easy to use, programmers with limited experience can work on it using parallel and distributed systems[]. The jobs are divided and well balanced using load balancing technique. In this model extra overhead is required to keep information about every data node and its job hence computation overhead increases. Improved MapReduce Performance is done through Data Placement in Heterogeneous Hadoop Clusters [4]. The same format of the cache description of different applications varies according to their specific semantic contexts. This could be designed and implemented by application developers who are responsible for implementing their MapReduce tasks. III. Research Methodology: 3.1 MapReduce: MapReduce [5] is a programming model for processing large data sets, and the name of an implementation of the model by Google. MapReduce is typically used to perform distributed computing on clusters of computers. The model is inspired by map and reduce functions commonly used in functional programming, although their purpose in the MapReduce framework is not the same as their original forms. MapReduce libraries have been written in many programming languages. A popular free implementation is Apache Hadoop. MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (master and slave nodes), collectively referred to as a cluster. Computational processing can occur on data stored in a HDFS in the form of multi-structured data (structured, semi-structured and unstructured). Map step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. In other words, mapping is the function which is used to arrange the input data in the form of key-value pair. Consider example:- I am Sam, I am studying in BE. Key value was defined as below 1) (I, 1), 2)(Am, 1) 3)(Sam, 1) ,4)(I, 1) 5) (Am, 1), 6) (Studying, 1) 5) (In, 1), 8) (BE, 1) Map function contain some attribute are as follows:a) Partition: - In this system the input data is Divided into number of nodes and process it with the help of map-reduce function. b) Combiner:-Combiner is also Called as mini-reducer which is used along with each mapper. It is mainly used for load balancing. c) Cluster:- Cluster is part of memory. When data is processed than it is stored inside the cluster. Reduce step: Reducer is the function which take the input (key) from map function and perform reduce operation. Consider example:(I, 2) (Am, 2) (Sam, 1) (Studying, 1) (in.1) (BE.1) As shown in above example the given sentence contain some repeated words so reducer can used to reduce it by showing their count. Fig: Execution Overview Dataflow: The frozen part of the MapReduce framework is a large distributed sort. The important points, which the applications do define are input reader, Map, partition, compare, Reduce function and output writer. Input reader: It divides the input into appropriate size 'splits' (in practice typically 16 MB to 128 MB) and the framework assigns one split to each Map function. The input reader reads data from stable storage (typically a distributed file system) and generates key/value pairs. A common example will read a directory full of text files and return each line as a record. Map function: Each Map function takes a series of key/value pairs, processes each, and generates zero or more output key/value pairs. The input and output types of the map can be (and often are) different from each other. Partition function: Each Map function output is allocated to a particular reducer by the application's partition function for sharing purposes. The partition function is given the key and the number of reducers and returns the index of desired reduce. Comparison function: The input for each Reduce is pulled from the machine where the Map ran and sorted using the application's comparison function. Reduce function: The framework calls the application's Reduce function once for each unique key in the sorted order. The Reduce can iterate through the values that are associated with that key and produce zero or more outputs. Output writer: It writes the output of the Reduce to stable storage, usually a distributed file system. As an example, the illustrative problem of counting the average word length of every word occurrences in a large collection of documents in MapReduce is represented as following: The input key/value to the Map function is a document name, and its contents. The function scans through the document and emits each word plus the associated word length of the occurrences of that word in the document. Shuffling groups together occurrences of the same word in all documents, and passes them to the Reduce function. The Reduce function sums up all the word length for all occurrences. Then divide it by the count of that word and emits the word and its overall average word length of every word occurrences. Example: Consider the problem of counting the average word length in a large collection of documents. The user would write code similar to the following pseudo-code: [ 5] Function map (String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w,wordlength); functionreduce(String key,Iterator values) // key: a word // values: a list of counts doublesum = 0, count =0, result=0; for each v in values: sum += ParseInt(v); count++; result = sum / count; Emit(w, AsDouble(result)); Here, each document is split into words, and each word length is counted by the map function, using the word as the result key. The framework puts together all the pairs with the same key and feeds them to the same call to reduce, thus this function just needs to sum all of its input values to find the total appearances of that word. Then for finding average word length we divide the sum by the count of that word. Following three algorithms proposed for processing of Hadoop system. 1. Normal cache MR: In this algorithm, input data is processed from main memory and, mapping and reducing operations are performed on input data normally. Then the processed data is stored in HDFS. CONCLUSION:The motivation is, Hadoop being used and able to study in depth such widely used system and MapReduce framework which is used for big data analysis and transformations. There are many new technologies emerging at a rapid rate, each with technological advancements and with the potential of making ease in use of technology. We will work on bringing together ideas from MapReduce and Normal, Distributed, ARC algorithms; however, this work focuses mainly on increase of processing Hadoop System. We will combine the advantages of MapReduce-like [6] software with the efficiency and shared work advantages that come with loading data and creating performance enhancing data structures. Following table show the some review points 2. Distributed cache MR: In this algorithm, main memory is divided into two parts that act as filter for all junk and temporary files. After filtering, mapping and reducing operations are performed on data and is stored in HDFS. 3. Adaptive Replacement Cache (ARC) MR: In this algorithm, main memory is divided into four parts that act as filter for all junk and temporary files. After filtering, mapping and reducing operations are performed on data and is stored in HDFS. IV. IMPLEMENTATION DATAILS:In this section we represent the input, expected result and environment used for implementation. Input: For this implementation, we use the input as text file, which is further splitted into number of blocks. REFERENCES:[1] Yaxiong Zhao, Jie Wu, and Cong Liu, "Dache: A Data Aware Caching for BigData Applications Using the MapReduce Framework " 2013 Proceedings IEEE INFOCOM, 978-1-4673-5946-7/13/$31.00 ©2014 IEEE. [2] Nusrat Sharmin Islam, Xiaoyi Lu, Md. Wasi-ur-Rahman, Raghunath Rajachandrasekar, and Dhabaleswar K. (DK) Panda” In-Memory I/O and Replication for HDFS with Memcached: Early Experiences”2014 IEEE. [3] L.A. Belady, “A Study of Replacement Algorithms for Virtual Storage Computers,” IBM Systems J., vol. 5, no. 2, 1966, pp. 78-101. [4] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica, “Improving MapReduce performance in heterogeneous environments”,in Proc. of OSDI2008, Berkeley, [5] Jeffrey Dean and Sanjay Ghemawat, Mapreduce: Simplied data processing on large clusters. In OSDI, pages 137-150, 2004. [6]CloudEra,https://ccp.cloudera.com/dis play/ CDH4DOC/ CDH4+Installation [7] "Outperforming LRU with an adaptive replacement cache algorithm ", Megiddo, Nimrod ,IBM Almaden Res. Center, San Jose, CA, USA , pp.58-65. [8] D. Peng and f. Dabek ,”Large Scale incremental Processing using distributed Transaction and notification” , in Proc. of OSDI’2010 ,Berkeley ,CA,USA,2010. [9] Dongfang Zhao, Ioan Raicu.” HyCache: a User-Level Caching Middleware for Distributed File Systems.”2014 [10] Xindong Wu, Fellow Xingquan Zhu, Senior Member, Gong-Qing Wu.” Data Mining with Big Data” 2014