CS 590: Cloud Systems for Blind and Hearing Impaired Real time parallel computation with Hadoop Bala Murugan Problem Statement: Apache Hadoop is a software framework that supports data-intensive distributed applications. The underlying Hadoop Distributed File System (HDFS) utilized by the Hadoop framework is targeted at providing high throughput at the cost of increased latency. Thus currently Hadoop does not support real time interactive computations on input data.A mechanism for Hadoop to handle real time data needs to be devised.This finds its application in analysis of server logs or sensor outputs. What needs to be done: The major bottleneck in providing real time processing is the disk access at various stages of Map-Reduce. The input to the Map stage is usually a data file which after the Map stage produces another set of intermediate files to be acted upon by the Reduce stage. All this disk access increases the latency of the system.Finding a means to avoid all this disk access will provide better parallelization. What has been done: Hadoop Database (HBase) is a distributed storage system providing optimizations for real time queries. This is modeled after Google’s BigTable another distributed data storage system. This reduces the latency in data access during Map-Reduce by providing random access to data and a REST-ful web service.This currently being developed as part of Apache Software Foundation’s Hadoop project. What can be done: We propose replacing all the disk access from the different Map-Reduce stages with data streams. This guarantees low latency during data retrieval and computation. The input data streams backed by a buffer would replace the input file.No intermediate data files are created and output from one stage is directly fed to next stage. The final output data stream could either be written to file or used as input for next cycle of Map-Reduce. One potential problem to be handled is fault tolerance. What I have done: I have started with understanding the implementation details of HBase , HDFS and MapReduce as part of the Apache Hadoop project.I am also looking at mechanisms to read parts of data streams without data loss. References: [1] "MapReduce: Simplified Data Processing on Large Clusters", by Jeffrey Dean and Sanjay Ghemawat; from Google Labs [2] HBase Architecture - http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture [3] Hadoop - http://en.wikipedia.org/wiki/Hadoop [4]http://people.apache.org/~rdonkin/hadoop-talk/hadoop.html