Real time parallel computation with Hadoop

advertisement
CS 590: Cloud Systems for Blind and Hearing Impaired
Real time parallel computation with Hadoop
Bala Murugan
Problem Statement:
Apache Hadoop is a software framework that supports data-intensive distributed
applications. The underlying Hadoop Distributed File System (HDFS) utilized by the Hadoop
framework is targeted at providing high throughput at the cost of increased latency. Thus
currently Hadoop does not support real time interactive computations on input data.A
mechanism for Hadoop to handle real time data needs to be devised.This finds its application
in analysis of server logs or sensor outputs.
What needs to be done:
The major bottleneck in providing real time processing is the disk access at various
stages of Map-Reduce. The input to the Map stage is usually a data file which after the Map
stage produces another set of intermediate files to be acted upon by the Reduce stage. All this
disk access increases the latency of the system.Finding a means to avoid all this disk access
will provide better parallelization.
What has been done:
Hadoop Database (HBase) is a distributed storage system providing optimizations
for real time queries. This is modeled after Google’s BigTable another distributed data
storage system. This reduces the latency in data access during Map-Reduce by providing
random access to data and a REST-ful web service.This currently being developed as part of
Apache Software Foundation’s Hadoop project.
What can be done:
We propose replacing all the disk access from the different Map-Reduce stages with
data streams. This guarantees low latency during data retrieval and computation. The input
data streams backed by a buffer would replace the input file.No intermediate data files are
created and output from one stage is directly fed to next stage. The final output data stream
could either be written to file or used as input for next cycle of Map-Reduce. One potential
problem to be handled is fault tolerance.
What I have done:
I have started with understanding the implementation details of HBase , HDFS and
MapReduce as part of the Apache Hadoop project.I am also looking at mechanisms to read
parts of data streams without data loss.
References:
[1] "MapReduce: Simplified Data Processing on Large Clusters", by Jeffrey Dean and Sanjay
Ghemawat; from Google Labs
[2] HBase Architecture - http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture
[3] Hadoop - http://en.wikipedia.org/wiki/Hadoop
[4]http://people.apache.org/~rdonkin/hadoop-talk/hadoop.html
Download