International Journal of Engineering Trends and Technology (IJETT) – Volume 14 Number 3 – Aug 2014 A Closer Looks Over Hadoop Framework Raj Kumari Bhatia, Aakriti Bansal UIET, Panjab University, Chandigarh, India Abstract— Apache's Hadoop is good but there are huge scope of improvements and extensions in the existing technology. Various improvements have been proposed to Hadoop which is an open source implementation of Google's Map/Reduce framework. It enables us about the distributed, data intensive job scheduling and parallel applications by decomposing a massive jobs into small number of segments and a massive data set into smaller partitions such that each task processes a different partition in parallel. The large scale data is handled by HDFS which is a mimicking version of Google File System. Map/Reduce application mainly uses HDFS for storing data. HDFS is a very large distributed file system that uses commodity hardware and provides high throughput as well as fault tolerance. Many big companies believe that in few years more than half of the world's data will be stored in Hadoop. HDFS stores files as a series of blocks and are replicated for fault tolerance. This paper is an instruction of HADOOP and give a brief about the existing installation system type and there limitation and at last section describes about the dynamic processing of I/O slots through scheduling and its limitations. Keywords— hadoop, Deployment, Sheduling, I/O slots, Map reduced. I. B. MapReduce INTRODUCTION Big data and Hadoop are rapidly emerging as the preferred solution to address the issues related to technologies and business that are disrupting traditional management and processing systems. As the information is simply too massive from numerous sources in several type, it's characterized by the three Vs given in fig. 1. BIGDATA BIGDATA • Variety A. HADOOP Hadoop is a Programming framework that supports the processing of huge data sets in an exceedingly distributed computing atmosphere. Hadoop was developed by Google’s MapReduce that's a software system framework where associate degree application breaks down into numerous elements. The Current Apache Hadoop system consists of the Hadoop Kernel, MapReduce, HDFS and numbers of assorted elements like Apache Hive, HBase and Zookeeper. The Apache Hadoop projects provide a series of tools designed to solve big data problems. The Hadoop cluster implements a parallel computing cluster using inexpensive commodity hardware. The cluster is partitioned across many servers to provide a near linear scalability. The philosophy of the cluster design is to bring the computing to the data. So each data node will hold part of the overall data and be able to process the data that it holds. The overall framework for the processing software is called MapReduce. •Velocity BIGDATA • Volume Fig.1. Parameters of Big Data Variety makes the data too big. Data comes from the varied sources that may be of structured, unstructured and semistructured type. Completely different type of data include the text, audio, video, log files, sensor data etc. Volume represents the size of the data how much the data is huge. The size of the data is delineated in terabytes and petabytes. Velocities define the motion of the data and also the analysis of streaming of the data. ISSN: 2231-5381 MapReduce is a programming framework for distributed computing that is formed by the Google within which divide and conquer technique is employed to interrupt the big complicated information into tiny units and process them. Mapreduce provides automatic parallelization and distribution,fault tolerance, I/O scheduling, status and monitoring of jobs. MapReduce is the combination of map and reduce phase[1]: • Map ():- The master node takes the input, divide into smaller subparts and distribute into employee nodes. A employee node additional do that again that results in the multi-level tree structure. The employee node method the m=smaller problem and passes the solution back to the master Node.it processe4s the input in key value pair.it produces set of intermediate pairs which are passed to Master node. • Reduce ():- it combines all the intermediate values and produces a set of merged output value. C. HDFS HDFS is a block-structured distributed file system that holds the large amount of Big Data. In the HDFS the data is stored in blocks that are known as chunks. HDFS is client-server architecture comprises of NameNode and many DataNodes. The name node stores the metadata for the NameNode. NameNodes keeps track of the state of the DataNodes. NameNode is also responsible for the file system operations etc. [2]. http://www.ijettjournal.org Page 150 International Journal of Engineering Trends and Technology (IJETT) – Volume 14 Number 3 – Aug 2014 When Name Node fails the Hadoop doesn’t support automatic recovery, but the configuration of secondary nod is possible. HDFS is based on the principle of “Moving Computation is cheaper than Moving Data”. HDFS is self-healing, high-bandwidth clustered storage system which provide optimization, redundant, reliable, distributed file system. Other Components of Hadoop [3]: HBase: HBase is an open source, non-relational distributed database that allows for low-latency, quick lookups in Hadoop. It adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes. it is written in Java. It runs on the top of HDFS. It can serve as the input and output for the MapReduce. Pig: Pig Latin is a Hadoop-based language developed by Yahoo. It is relatively easy to learn and is adept at very deep, very long data pipeline. Pig is high-level platform where the MapReduce programs are created which is used with Hadoop. It is a high level data processing system where the data sets are analyzed that occurs in high level language. Hive: Hive is a Hadoop-based data warehousing-like framework originally developed by Facebook. It allows users to write queries in a SQL-like language called HiveQL, which are then converted to MapReduce. Hive infrastructure is built on the top of Hadoop that help in providing summarization, query and analysis .This allows SQL programmers with no MapReduce experience to use the warehouse and makes it easier to integrate with business intelligence and visualization tools. Sqoop: Sqoop is a command-line interface platform that is used for transferring data between relational databases and Hadoop. Avro: Avro is a data exchange service which is basically used in Apache Hadoop. These services can be used together as well as independently. It also acts as a data serialization system that allows for encoding the schema of Hadoop files. It is adept at parsing data and performing removed procedure calls Oozie: Oozie is a workflow processing system that lets users define a series of jobs written in multiple languages – such as Map Reduce, Pig and Hive -- then intelligently link them to one another. Oozie allows users to specify, for example, that a particular query is only to be initiated after specified previous jobs on which it relies for data are completed. Moreover Oozie is a java based web-application that runs in a java servlet. Oozie uses the database to store definition of Workflow that is a collection of actions. It manages the Hadoop jobs. Chukwa: Chukwa is a data collection and analysis framework which is used to process and analyze the large amount logs. It is built on the top of the HDFS and MapReduce framework. Flume: Flume is a high level framework for populating Hadoop with data from multiple sources. Agents are populated throughout ones IT infrastructure – inside web servers, application servers and mobile devices, for example – to collect data and integrate it into Hadoop. Zookeeper: it is a centralized service that provides distributed synchronization and providing group services and maintains the configuration information etc. ISSN: 2231-5381 Advantages and Disadvantages of Hadoop 1) Advantages MapReduce is designed under commodity PC cluster, management of thousands commodity PCs is a big job. Also reliability of commodity PC is a question. Maybe the biggest problem is the power consumption. So if one want to build its own compute center, it will pay quite a lot. This is where the EC2 like services are used. Deploying the Hadoop Applications on virtual machines can take all the advantages of virtualization, which can make the management of the cluster more easily, improve the reliability which is because that virtual machines can be more easily recovered from crush than physical ones. Thus, it can improve the reliability of the master node of MapReduce. Besides that, virtualization can help to fully utilize the system resources. By using EC2 like Services, customers can easily and cost-effectively process vast amounts of data. 2) Disadvantages The only disadvantage is that the potential for poor performance and heavy load undoubtedly, which is what to be solved in below paper. II. DEPLOYING HADOOP OVER VIRTUAL MACHINE BY GUANGHUI From this paper, we can follow it to set up their own Hadoop experimental environment and capture the current status and trend of optimizing Hadoop in virtual environment. Hadoop deployment They have discussed the Hadoop deployment over virtual machines and also describe about Hadoop with JDK and deployed it over ubuntu 12.04. To have Hadoop with JDK, follow these steps: 1) Download the latest version of JDK for Ubuntu from http://www.oracle.com/technetwork/java/javase/downloads/ jdk-7u4-downloads-1591156.html, we chose jdk-7u4-linuxi586.tar.gz. 2) Set environment variables Untar the file, set environment variable JAVA_HOME to the path of JDK, add JAVA_HOME/bin to PATH and JAVA_HOME/lib to CLASSPATH. 3) Make sun-jdk be default jdk $ sudo update-alternatives --install /usr/bin/java java /usr/lib/java/jdk1.6.0_20/bin/java 300 $ sudo update-alternatives --install /usr/bin/javac javac /usr/lib/java/jdk1.6.0_20/bin/javac 300 $ sudo update-alternatives --config java A. Task Scheduling Hadoop's performance is closely tied to its task scheduler, who implicitly assumes that cluster nodes are homogeneous and tasks make progress linearly, and uses these assumptions to decide when to speculatively re-execute tasks that appear to be stragglers [4].These are the implicit assumptions of Hadoop's scheduler [8]: Nodes can perform work at roughly the same rate. http://www.ijettjournal.org Page 151 International Journal of Engineering Trends and Technology (IJETT) – Volume 14 Number 3 – Aug 2014 Tasks progress at a constant rate throughout time. There is no cost to launching a speculative task on anode that would otherwise have an idle slot. A task’s progress score is roughly equal to the fraction of its total work that it has done. Specifically, in a reduce task, the copy, reduce and merge phases each take 1/3 of the total time. Tasks tend to finish in waves, so a task with a low progress score is likely a slow task. Different tasks of the same category (map or reduce) require roughly the same amount of work. B. I/O Scheduling The efficiency in virtual machine may be very low than in physical machine, the reason including task scheduling and I/O scheduling. MapReduce is designed to run in physical machines, when a MapReduce task is running, a lot of data will be transferred between machines, the efficiency of I/O scheduling is very important to shorten the response time. Research Aspects:But if Hadoop is running on virtual machines and knows whether any two virtual machines are in a same physical host, it will help Hadoop to decide which virtual machine run which map or reduce job. III. DYNAMIC PROCESSING KURAZUMI SLOTS SCHEDULING BY In this paper, they propose dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce focusing on I/O wait during execution of jobs. Assigning more tasks to added free slots when CPU resources with the high rate of I/O wait have been detected on each active TaskTracker node leads to the improvement of CPU performance. They have implemented it on Hadoop 1.0.3. They have evaluated up_ioline and down_ioline, there is just the little difference between the results of the values which was close. They have also concluded that setting up_ioline to the extremely high values can cause the performance decrement because the high rate of I/O wait has appeared to a mound. They use Sort [5] program as benchmark program because Sort is basic operation as the program working on Hadoop MapReduce. The Map function Identity Mapper and the Reduce function Identity Reducer only get Key-Value pairs from Record Reader and output them. The amount of CPU processing in both of Map and Reduce phases is less than the amount of I/O processing, so that Sort is I/O intensive program. They have evaluated several times the fluctuation of the total number of Map slots in the whole cluster during the execution of single job with the cluster consists of 8 slave nodes and 12GB input data, because most effectiveness was found out in the execution pattern. The number of the total initial number ISSN: 2231-5381 of Map slots in the whole cluster is set to 16, because this cluster consists of 8 slave nodes. The execution time of Map phase is about 50% of all execution time (including the state Map+Reduce) according to the analysis of it. The total number of Map slots is increased up to 42 in the second half part. This results from a number of running I/O intensive Reduce tasks in this part. At this time, all Map tasks have completed. In the first half part of all experimental results, the total number of Map slots has increased up to 27 and about 20 on average. Pros:They implemented the proposed method on Hadoop 1.0.3 and evaluated it by executing jobs with Sort as the benchmark program. Modified Hadoop was enabled to control the number of Map slots dynamically in comparison with default static assignment. Using our proposed method, the execution time was improved up to about 23% compared with default Hadoop. Cons:They had not taken the consideration of controlling number of maps slots according to the change of map reduce phases. They were not able to remove the overhead of managing threads. IV. CONCLUSION In this paper we have successfully discussed about HADOOP and give a brief about its deployment over the virtual machine and its limitation and at the final section of the paper we described about the dynamic processing of I/O slots through scheduling and its limitations. V. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. REFERENCES K. Bakshi, "Considerations for Big Data: Architecture and Approach", Aerospace Conference IEEE, Big Sky Montana, March 2012. Garlasu, D.; Sandulescu, V. ; Halcu, I. ; Neculoiu, G. ( 17-19 Jan. 2013),”A Big Data implementation based on Grid Computing”, Grid Computing. Sagiroglu, S.; Sinanc, D. ,(20-24 May 2013),”Big Data: A Review” . M. Zaharia, et al., “Improving MapReduce performance in heterogeneous environments,” Proc. Proceedings of the 8th USENIX conference on Operating systems design and implementation, USENIX Association, 2008, pp. 29-42. Hadoop, The Apache Software Foundation, May 2012, 1.0.3. Nicolae, Bogdan, et al. "BlobSeer: Bringing high throughput under heavy concurrency to Hadoop Map-Reduce applications." Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on. IEEE, 2010. Thusoo, Ashish, et al. "Hive: a warehousing solution over a mapreduce framework." Proceedings of the VLDB Endowment 2.2 (2009): 1626-1629. Shvachko, Konstantin, et al. "The hadoop distributed file system." Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on. IEEE, 2010. Kurazumi, Shiori, et al. "Dynamic Processing Slots Scheduling for I/O Intensive Jobs of Hadoop MapReduce." ICNC. 2012. Xu, Guanghui, Feng Xu, and Hongxu Ma. "Deploying and researching Hadoop in virtual machines." Automation and Logistics (ICAL), 2012 IEEE International Conference on. IEEE, 2012. http://www.ijettjournal.org Page 152