REAL-TIME ANALYTICS PROCESSING WITH MAPREDUCE

advertisement
REAL-TIME ANALYTICS
PROCESSING WITH MAPREDUCE
Author C. Z. PENG, Z. J. JIANG, X. B. CAI and Z.K. ZHANG
Source Proceedings of the 2012 International Conference on Machine
Learning and Cybernetics , Xian, 15-17 July, 2012
報告者:
許立新
報告日期: 2013/5/14
Outline
1.
Introduction of real-time analytics.
2.
Related work of Hadoop MapReduce.
3.
The modified programming model.
 Programming Model for Real-Time Processing.
 Execution Framework.
 Persistent Storage .
4.
Conclusions
2016/7/14
2
Introduction (1/2)

These SPCs were not popularly applied to enterprises
same as Map Reduce, even if now real-time analytics
applications are taken into attention.

The increasing use of wireless terminal devices and
personal computers being found everywhere connected to
the Internet continuously.

Real-time analyzing these information will get huge
commercial benefit and avoid commercial deficit.
2016/7/14
3
Introduction (2/2)

Data stream management system(DSMS) will be attracted
again in the data analytics area.
 financial tickers
 performance measurements in network monitoring
 traffic management
 log records
 click-streams in web tracking
 data feeds from sensor applications

There were many discussions about whether MapReduce
can be applied in real-time analytics.
2016/7/14
4
Related work (1/2)

Map Reduce allows developers to think in a data-oriented
fashion:
 How those computations are actually carried out
 How to get the data to the processes that depend on them
 Distributed execution
 Network communication
 The coordination and fault tolerance will be handled

For a batch processing called one job, before MapReduce
application being run, the first step is that the data needed to be
handled will be stored in Hadoop HDFS.
2016/7/14
5
Related work (1/2)

If user does not set the number of Map and Reduce tasks,
the master node will automatically calculate the number
using a hash algorithm.

After mapping, the intermediate data will enter shuffled
and sorted phase by supplied by MapReduce.

Hadoop MapReduce is not suited in at least two folds:
 The data are generated dynamically
 Only the partial data need to be analyzed.
2016/7/14
6
The modified model (1/3)
Programming Model for Real-Time Processing. (1/2)
 This study extended input key-value model through map
and reduce phases shown as below:
Map: (<k1,t1>, v1)→list(<k2,t1>,v2)
Reduce: (<k1,t1>,<r1,r2…rn>)→(<k1,t1>,r)

The difference between map and reduce is that :
 multiple maps will be executed in parallel for the same key
 the execution of reduce has to be synchronous for the same
key with time stamp to ensure right result.
right result.
2016/7/14
7
Programming Model for Real-Time Processing. (2/2)
 The modified key-value model can support the time
stamp necessary for real-time data stream.

In the Fig of next page, only the same key with the same
time stamp will be aggregated.

Another, for real-time analytics, push-style map is needed,
Hadoop MapReduce provides this function, so we did not
modify the map function.
2016/7/14
8

The modified MapReduce programming model
 1309
2016/7/14
9
The modified model (2/3)
Execution Framework for Real-Time Processing (1/4)
 However in real-time data processing, it must replace the
key-value storage provided
1) The input data can be horizontally partitioned and
replicated across the Hadoop MapReduce's data nodes.
2) The input, intermediate and output data can be stored in
memory controlled by user.
3) The key-value storage system can be recovered from node
failure.
4) The data consistency can be guaranteed by providing
configurations.
2016/7/14
10
Execution Framework for Real-Time Processing (2/4)
 This key-value storage system is mainly designed for the
input and unhandled data, which are stored in memory
before map processing.

We implemented a Chord [10] that was a technique
employed in many distributed storage systems by using
JOL[12].
2016/7/14
11
Execution Framework for Real-Time Processing (3/4)
 Using JOL, we can express the input data as tuples in
usual table and event table.
 First, the nodes in Chord ring will be contacted to
determine which node being selected, and subsequently be
replicated to the calculated nodes.
 Second one tuple in replicates will be selected to insert into
map task event table implemented by JOL, and then will
trigger MapReduce to execute map function.
2016/7/14
12

MapReduce for real-time analytics
2016/7/14
13
Execution Framework for Real-Time Processing (4/4)
 The incoming data streams are divided into two type:
 query and data, which are horizontally distributed and
replicated into some nodes through Chord.

For the data needing persistently to be stored will be
placed into the component "Persistent Storage".
2016/7/14
14
The modified model (3/3)

Persistent Storage
 In a distributed computation system, some node failures are
not uncommon phenomenon, but these failures make some
queries not able to get executed.
 This system choose Cassandra Model as persistent keyvalue storage.
2016/7/14
15

Chord for key-value storage
2016/7/14
16

Some streaming computations according to calculation of
the component "Data Stream Manager" are stored into it.

When a single node of failure emerges, the system will
continue processing unfinished query.
2016/7/14
17
結論
1.
By building a real-time data stream analytics system
through modifying the stock Hadoop
MapReduce‘ programming model.
2.
Implemented a Chord for distributing and replicating
data and query event in execution framework
3.
Replaced the HDFS with Cassandra as its key-value
storage.
2016/7/14
18
Download