REAL-TIME ANALYTICS PROCESSING WITH MAPREDUCE Author C. Z. PENG, Z. J. JIANG, X. B. CAI and Z.K. ZHANG Source Proceedings of the 2012 International Conference on Machine Learning and Cybernetics , Xian, 15-17 July, 2012 報告者: 許立新 報告日期: 2013/5/14 Outline 1. Introduction of real-time analytics. 2. Related work of Hadoop MapReduce. 3. The modified programming model. Programming Model for Real-Time Processing. Execution Framework. Persistent Storage . 4. Conclusions 2016/7/14 2 Introduction (1/2) These SPCs were not popularly applied to enterprises same as Map Reduce, even if now real-time analytics applications are taken into attention. The increasing use of wireless terminal devices and personal computers being found everywhere connected to the Internet continuously. Real-time analyzing these information will get huge commercial benefit and avoid commercial deficit. 2016/7/14 3 Introduction (2/2) Data stream management system(DSMS) will be attracted again in the data analytics area. financial tickers performance measurements in network monitoring traffic management log records click-streams in web tracking data feeds from sensor applications There were many discussions about whether MapReduce can be applied in real-time analytics. 2016/7/14 4 Related work (1/2) Map Reduce allows developers to think in a data-oriented fashion: How those computations are actually carried out How to get the data to the processes that depend on them Distributed execution Network communication The coordination and fault tolerance will be handled For a batch processing called one job, before MapReduce application being run, the first step is that the data needed to be handled will be stored in Hadoop HDFS. 2016/7/14 5 Related work (1/2) If user does not set the number of Map and Reduce tasks, the master node will automatically calculate the number using a hash algorithm. After mapping, the intermediate data will enter shuffled and sorted phase by supplied by MapReduce. Hadoop MapReduce is not suited in at least two folds: The data are generated dynamically Only the partial data need to be analyzed. 2016/7/14 6 The modified model (1/3) Programming Model for Real-Time Processing. (1/2) This study extended input key-value model through map and reduce phases shown as below: Map: (<k1,t1>, v1)→list(<k2,t1>,v2) Reduce: (<k1,t1>,<r1,r2…rn>)→(<k1,t1>,r) The difference between map and reduce is that : multiple maps will be executed in parallel for the same key the execution of reduce has to be synchronous for the same key with time stamp to ensure right result. right result. 2016/7/14 7 Programming Model for Real-Time Processing. (2/2) The modified key-value model can support the time stamp necessary for real-time data stream. In the Fig of next page, only the same key with the same time stamp will be aggregated. Another, for real-time analytics, push-style map is needed, Hadoop MapReduce provides this function, so we did not modify the map function. 2016/7/14 8 The modified MapReduce programming model 1309 2016/7/14 9 The modified model (2/3) Execution Framework for Real-Time Processing (1/4) However in real-time data processing, it must replace the key-value storage provided 1) The input data can be horizontally partitioned and replicated across the Hadoop MapReduce's data nodes. 2) The input, intermediate and output data can be stored in memory controlled by user. 3) The key-value storage system can be recovered from node failure. 4) The data consistency can be guaranteed by providing configurations. 2016/7/14 10 Execution Framework for Real-Time Processing (2/4) This key-value storage system is mainly designed for the input and unhandled data, which are stored in memory before map processing. We implemented a Chord [10] that was a technique employed in many distributed storage systems by using JOL[12]. 2016/7/14 11 Execution Framework for Real-Time Processing (3/4) Using JOL, we can express the input data as tuples in usual table and event table. First, the nodes in Chord ring will be contacted to determine which node being selected, and subsequently be replicated to the calculated nodes. Second one tuple in replicates will be selected to insert into map task event table implemented by JOL, and then will trigger MapReduce to execute map function. 2016/7/14 12 MapReduce for real-time analytics 2016/7/14 13 Execution Framework for Real-Time Processing (4/4) The incoming data streams are divided into two type: query and data, which are horizontally distributed and replicated into some nodes through Chord. For the data needing persistently to be stored will be placed into the component "Persistent Storage". 2016/7/14 14 The modified model (3/3) Persistent Storage In a distributed computation system, some node failures are not uncommon phenomenon, but these failures make some queries not able to get executed. This system choose Cassandra Model as persistent keyvalue storage. 2016/7/14 15 Chord for key-value storage 2016/7/14 16 Some streaming computations according to calculation of the component "Data Stream Manager" are stored into it. When a single node of failure emerges, the system will continue processing unfinished query. 2016/7/14 17 結論 1. By building a real-time data stream analytics system through modifying the stock Hadoop MapReduce‘ programming model. 2. Implemented a Chord for distributing and replicating data and query event in execution framework 3. Replaced the HDFS with Cassandra as its key-value storage. 2016/7/14 18