www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3 Issue 8 August, 2014 Page No. 7917-1920 Hadoop – from where the journey of unstructured data begins… Mrs. R. K. Dixit, Sourabh Mahajan, Suyog Kandi. Department of Computer Science and Engineering Walchand Institute of Technology, Solapur. rashmirajivk@gmail.com Department of Computer Science and Engineering Walchand Institute of Technology, Solapur. ssmahajan92@gmail.com Department of Computer Science and Engineering Walchand Institute of Technology, Solapur. kandisuyog92@gmail.com Abstract:Hadoop is a “flexible and available architecture for large scale computation and data processing on a network of commodity hardware”it is processingtechnology for large scale applications. Nowadays unstructured data is growing at faster rate. Big data generated daily from social media, sensor used to gather climate information, digital picture purchased transactions records and many more. Ultimately there need to process Multi Petabyte Datasets efficiently. Failure in datasets is expected, rather than exceptional. The number of nodes in cluster is not constant. The Hadoop platform was designed to solve such problems as it provides application for operational and analytics. Today Hadoop is fastest growing technology provides advantage for business across industries in world. Keywords: HDFS:Hadoop Distributed File System, PB: petabytes. 1. Introduction Scenario of Computing in its purest form has changed multiple time compare to its earlier time. As uses and need of computing changed widely as per technology. First, from near the beginning mainframes were predicted to be the future of computing. Indeed mainframes and large scale machines were built and used, and in some circumstances are used similarly today. The today’s trend turned from bigger and more expensive, to smaller and more affordable commodity PCs and servers. Hadoop is a strong tool which is open-source platform manages to develop and analyze large and massive datasets. Hadoop enables you to explore complex data, using custom analyses tailored to your information and questions. In Hadoop system unstructured data is distributed across hundreds or thousands of machines forming clusters, and the execution of Map/Reduce routines to run on the data in that cluster. Hadoop has its own file system which replicates data to multiple nodes to ensure if one node holding data goes down, there are at least 2 other nodes from which to retrieve that piece of information. 2. History Hadoop was created by Doug Cutting at Yahoo!, who named Hadoop after his child stuffed elephant. Hadoop was inspired by Map Reduce a user defined function developed by google in early 2000s for indexing the Web. It was designed to handle petabytes andExabyte’s of data distributed over multiple nodes in parallel further in 2006 yahoo hires Doug Cutting to work on Hadoop with a dedicated team which became top level project of apache in 2008.Hadoop is used by the top level apache foundation project, large active user base, mailing lists, users groups, very active development, and strong development teams. Today Hadoop has its own eco-system and versions which has lot of development. 3. Assumptions and Goals Assumption and goals of Hadoop are very clear node failure i.e. hardware failure, streaming datasets, large datasets, distributed file system.A distributed file system is designed to hold a large amount of data and provide access to this data to many clients distributed across a network. An HDFS instancemay consist of hundreds or thousands of server machines, eachstoring part of the file system’s data.The main and basic feature of HDFS is a distributed file system designed to run on commodity hardware. The fact that there are a hugenumber of components and that each component has a non-trivialprobability of failure means that some component of HDFS is alwaysnon-functional. Therefore, detection of faults and quick, automaticrecovery from them is a core architectural goal of HDFS.HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. Applicationsthat run on HDFS need streaming access to their data sets. They arenot general purpose applications that typically run on general Sourabh Mahajan, IJECS Volume 3Issue 8 August 2014 Page No.7917-7920 Page 7917 purposefile systems. HDFS is designed more for batch processing rather thaninteractive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted forHDFS. POSIX semantics in a few key areas has been traded to increasedata throughput rates.Applications that run on HDFS have large data sets. A typical file inHDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to supportlarge files. It should provide high aggregate data bandwidth andscale to hundreds of nodes in a single cluster. It should support tens ofmillions of files in a single instance. HDFS applications need a write onceread-many access model for files. A file once created, written, andclosed need not be changed. A Map Reduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future. A computation requested by an application is much more efficient if itis executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion andincreases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located. HDFS has been designed to be easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications.HDFS has Master/slave architecture. An HDFS cluster consists of a single Name node, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of Data nodes, usually one per node in the cluster, which manages storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of Data nodes. The Name node executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to Data nodes. The Data nodes are responsible for serving read and write requests from the file system’s clients. The Data nodes also perform block creation, deletion, and replication upon instruction from the Name node. Some of the goals of HDFS are Very Large Distributed File System, Assumes Commodity Hardware and Optimized for Batch Processing, Runs on heterogeneous OS layer of Hadoop. Map Reduce jobs are divided into two (obviously named) parts. The “Map” function divides a query into multiple parts and processes data at the node level. The “Reduce” function aggregates the results of the “Map” function to determine the “answer” to the query. Hive is a Hadoop-based data warehousing-like framework originally developed by Facebook. It allows users to write queries in a SQL-like language called HiveQL, which are then converted to Map Reduce. This allows SQL programmers with no Map Reduce experience to use the warehouse and makes it easier to integrate with business intelligence and visualization tools such as Micro strategy, Tableau, Revolutions Analytics, etc. Pig Latin is a Hadoop-based language developed by Yahoo. It is relatively easy to learn and is adept at very deep, very long data pipelines (a limitation of SQL.) GRA - GLOBAL RESEARCH ANALYSIS X 101 HBase is a non-relational database that allows for low-latency, quicklookups in Hadoop. It adds transactional capabilities to Hadoop, allowingusers to conduct updates, inserts and deletes. EBay and Facebookuse HBase heavily. Flume is a framework for populating Hadoop withdata. Oozie is a workflow processing system that lets users define a series of jobs written in multiple languages – such as Map Reduce, Pig and Hive -- then intelligently link them to one another. Oozie allows users to specify, for example, that a particular query is only to be initiated after specified previous jobs on which it relies for data are completed. Flume is a framework for populating Hadoop with data. Ambari is a web-based set of tools for deploying, administering and monitoring Apache Hadoop clusters. Its development is being led by engineersfrom Hortonworoks, which include Ambari in its Horton works Data Platform. Avro is a data serialization system that allows for encoding the schema of Hadoop files. It is adept at parsing data and performing removed procedure calls. Mahout is a data mining library. It takes the most popular data mining algorithms for performing clustering, regression testing and statistical modeling and implements them using the Map Reduce model. Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational databases and data warehouses – into Hadoop. HCatalog is a centralized metadata management and sharing service for Apache Hadoop. Big Top is an effort to create a more formal process or framework for packaging and interoperability testing of Hadoop’s sub-projects and related components with the goal improving the Hadoop platform as a whole. Figure1: HDFS architecture 4. Hadoop’s components Hadoop Distributed File System: HDFS, the storage layer of Hadoop, is a distributed, scalable, Java-based file system adept at storing large volumes of unstructured data. Map Reduce is a software framework that serves as the compute Figure 2: Hadoop Ecosystem Sourabh Mahajan, IJECS Volume 3Issue 8 August 2014 Page No.7917-7920 Page 7918 5. Working process of Hadoop architecture Hadoop is designed to run on a large number of machines that don’t share any memory or disks. That means you can buy a whole bunch of commodity servers, slap them in a rack, and run the Hadoop software on each one. When you want to load all of your organization’s data into Hadoop, what the software does is bust that data into pieces that it then spreads across your different servers. There’s no one place where you go to talk to all of your data; Hadoop keeps track of where the data resides. And because there are multiple copy stores, data stored on a server that goes offline or dies can be automatically replicated from a known good copy. In a centralized database system, you’ve got one big disk connected to four or eight or 16 big processors. But that is as much horsepower as you can bring to bear. In a Hadoop cluster, every one of those servers has two or four or eight CPUs. You can run your indexing job by sending your code to each of the dozens of servers in your cluster, and each server operates on its own little piece of the data. Results are then delivered back to you in a unified whole. That’s Map Reduce you map the operation out to all of those servers and then you reduce the results back into a single result set. Architecturally, the reason you’re able to deal with lots of data is because Hadoop spreads it out. And the reason you’re able to ask complicated computational questions is because you’ve got all of these processors, working in parallel, harnessed together.Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework. Hadoop Common is a set of utilities that support the other Hadoop subprojects. Hadoop Common includes File System, RPC, and serialization libraries. 6. Map Reduce Map Reduce is the heart of Hadoop .It is this programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster. The Map reduce concept is fairly simple to understand for those who are familiar with clustered scale-out data processing solutions. For people new to this topic, it can be somewhat difficult to grasp, because it’s not typically something people have been exposed to previously. The term Map reduces actually refers to two separate and distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of data and converts it into another set of data, where individuals elements are broken down into tuples (key/value pairs).The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name Map Reduce implies, the reduce job is always performed after the map job. applications .Very large production deployments (GRID) Process lots of unstructured data .When your processing can easily be made parallel. Running batch jobs is acceptable when you have access to lots of cheap hardware. Hadoop Users: The following companies are the users of Hadoop Adobe,Alibaba, Amazon, AOL, Facebook, Google, and IBM. 8. Conclusion The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. Hadoop is designed to run on cheap commodity hardware, it automatically handles data replication and node failure, it does the hard work – you can focus on processing data, Cost Saving and efficient and reliable data processing. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop Core project. Reference [1]Apache Hadoop!(hadoop.apache.org) [2] Mr. Nileshpatil, Mr. Tanvirpatel, “Apache Hadoop: Resource Big Data Management” International Journal of Innovative Research in Science, Engineering and Technology. ISSN:2319-8753. www.ijirset.com [3]Hadoop on Wikipedia (http://en.wikipedia.org/wiki/Hadoop) [4] Praveen kumar, Dr Vijay Singh Rathore, ”Efficient Capabilities of Processing of Big Data using Hadoop Map Reduce” International Journal of Advanced Research in Computer and Communication Engineering. ISSN:227-1021 [5] Free Search by Doug Cutting (http://cutting.wordpress.com) [6]Hadoop and Distributed Computing at Yahoo!(http://developer.yahoo.com/hadoop) | Yahoo! Inc , Hadoop tutorial from Yahoo! Available : http//:developer.yahoo.com/Hadoop/tutorial/index.html [7]Cloudera -Apache Hadoop for the Enterprise (http://www.cloudera.com) [8]www.pentaho.com 7. Requirement of Hadoop Batch data processing, not real-time user facing (e.g. Document Analysis and Indexing, Web Graphs and Crawling).Highly parallel data intensive distributed Sourabh Mahajan, IJECS Volume 3Issue 8 August 2014 Page No.7917-7920 Page 7919 Author Profile Mrs. R. K. Dixitis an Associate Professor in Computer Science and Engineering Department at Walchand Institute of Technology, Solapur. She received his B.E degree from Shivaji University, Kolhapur and M.E degree from Pune University Pune.Her research interests lie in the area of Security, Database, Big Data, Hadoop and Data Mining. Sourabh SMahajan is a Graduation student pursuing B.E in Computer Science and Engineering from Walchand Institute of Technology, Solapur. His research interests lie in the area of BigData, Hadoop and Data Mining Suyog D Kandi is a Graduation student pursuing B.E in Computer Science and Engineering from Walchand Institute of Technology, Solapur. His research interests lie in the area of Big Data, Hadoop and Data Mining. Sourabh Mahajan, IJECS Volume 3Issue 8 August 2014 Page No.7917-7920 Page 7920