Stream Processing with BigData: SSS-MapReduce

advertisement
Stream Processing with BigData:
SSS-MapReduce
Hidemoto Nakada, Hirotaka Ogawa and Tomohiro
Kudoh
National Institute of Advanced Industrial Science and
Technology,
1-1-1 Umezono, Tsukuba, Ibaraki 305-8568, JAPAN
報告者:蔡育龍
Outline
1. Introduction
2. Implementation
a.
b.
c.
Overview of SSS-MapReduce
Stream Processing in SSS
Sliding Window Management
3. Preliminary Evaluation
4. Discussion
5. Related Work
6. Conclusion
1. Introduction
 Existing stream processing systems are mainly targeting on the
low-latency data processing and work only on the relatively small
on-memory data-set
 This kind of systems is very effective for specific class of
applications, such as algorithm trading, but applicable area is not
so large.
1. Introduction
 We propose SSS,which can process streamed data along with
the stored large data
 SSS is basically a KVS based MapReduce System, but can handle
treamed data with Continuous Mapper and Reducer
process,which is periodically invoked by the system.
2. Implementation
Overview of SSS-MapReduce
a.
1.
Server Configuration :
2. Implementation
2.
Implementation of Distributed KVS:
 When SSS servers put key-value pair to the distributed KVSs, it
determines unitary KVS to put with hashed value of the key.
 All the SSS servers shares the same hash function to guarantee that key-
value pairs with the same key go to the same unitary KVS.
 SSS writes key-value pairs in bulk. The pairs are sorted with the key
beforehand to reduce the burden on Tokyo Cabinet. SSS reads key-value
pairs, again, in bulk, specifying the beginning key and the endding key of
the range of keypairs.
2. Implementation
 We also implemented a network service layer, called Data Server that
wraps Tokyo Cabinet so that it can be accessed from remote SSS servers.
 The protocol is specially designed to leverage the specific access patterns
described above and to enable pipeline processing in the SSS servers.
3.
Tuple Group:
 In SSS, data space is divided into several sub namespaces called ’Tuple
Group’. Mappers and Reduces read input from tuple group(s) and write
the output into tuple group(s).
 The Data Server allocates one date file to each Tuple Group.This design
allows us to remove a whole file when we want to remove a Tuple Group.
2. Implementation
Stream Processing in SSS:
b.
1.
Stream Input and Output:
 In SSS, stream input is represented as a continuous writes to a specific
tuple group.The tuple group works as input buffer for the input
stream.Processing Mapper / Reducer will read from the tuple
2.
Periodic Mapper / Reducer:
 We implemented streamed data processing by invoking Mappers and
Reducers continuously and periodically.
 The Mappers and Reducers reads and delete Key Value Pairs from the
specified tuple Group, to ensure that one Key Value Pair is not processed
more than once.
2. Implementation
 When Periodic Mapper or Reducer kicks in on a Tuple Group, the Data
Server create a new database file and redirect successive write operations
to the new file, while serving the old file for read operations from the
Mapper or Reducer.
9
2. Implementation
3.
MergeReducer:
 The MergeReducers are special Reducer that can handle inputs more
than one tuple groups
 The inputs for the MergeReducers will be one Key and more than two
Value lists.
 In the SSS Server there are multiple threads that read tuples from each
Tuple Group.
2. Implementation
 Note that the data in the Tuple Group are always sorted by key.
 SSS Server controls read threads so that they will give MergeReducer
tuples that have same keys.
2. Implementation
c. Sliding Window Management
2. Implementation
PostReducer have small amount of persistent storage on
memory as ring buffer.
 The length of the buffer is length/subwindowLength, for each
key.

3. Preliminary Evaluation
 We have performed a preliminary evaluation to know data
stream handling throughput of SSS on one node.
 The input data was randomly generated so that they mimic
the Apache Web Server log records.
 The record size was about 300 bytes. We repeatedly put
10000 records with 10ms interval. The input stream is
directly fed into the Data Server,without bothering the SSS
server.
3. Preliminary Evaluation
4.Discussion
 The event (Keyvalue pare) stream will be distributed to mappers
and reducers on several nodes.
 And there is a shuffle phase between Map and Reduce, where
events from nodes are shuffled each other. This means that
Reducers will receive event in out-of-order fashion.
5. Related Work
Hadoop Online Prototype
a.


HOP (Hadoop Online Prototype)is a Hadoop variant that
enabled pipeline processing by directly connecting Mapper
with Reducer and even Reducer with Mappers in the next
iteration using sockets, aiming at quick iteration of
MapReduce operations.
Although it can handle BigData, since it is based on Hadoop,
the Continuous query only works on the stream data from
outside and cannot handle the static data store in the HDFS.
5. Related Work
C-MR (Continuous MapReduce)
b.



C-MR is a stream processing system that is targeting on a
single node with multi-core processors.
C-MR adopts MapReduce as the programming interface and
supports strict Sliding Window management.
Since it is meant for single node, it cannot scale out by
increasing the number of nodes.
5. Related Work
S4
c.




S4 is a distributed processing framework for stream data,
which is originally implemented by Yahoo and contributed to
Apache.
The basic data structure in S4 is called Event, which is
composed of key and value.
Operations are performed in modules called PEs (Processing
Elements).
The basic concept of S4 is somewhat similar to SSS. The main
difference is that SSS can handle off-memory big data while
S4 only supports on-memory data.
6.Conclusion
 SSS handles stream data with continuous MapReduce that is
periodically invoked by the system and perform operation in the
storage.
 With the continuous MapReduce and MergeReducer, SSS can
perform stream processing based on static BigData stored in the
system.
Download