HPMR : Prefetching and Pre-shuffling in Shared MapReduce Computation Environment IEEE 2009 Sangwon Seo(KAIST), Ingook Jang Kyungchang Woo, Inkyo Kim Jin-Soo Kim, Seungyoul Maeng 2013.04.25 파일처리 특론 김태훈 Contents 1. Introduction 2. Related Work 3. Design 4. Implementation 5. Evaluations 6. Conclusion 2 /27 Introduction It is difficult to deal internet services Enormous volumes of data Generate a large amount of data which needs to be processed every day To solve the problem, use MapReduce programing model Support distributed and parallel processing for largescale data-intensive application data-intensive simulation application e.g : data mining, scientific 3 /27 Introduction Hadoop; Since hadoop is distributed system, it’s called HDFS(Hadoop distributed file system) HDFS master server that manages the namespace of a file system, regulates clients’ access to file A Number of DataNode manage storage directly attached to each DataNode HDFS cluster is consist of A Single NameNode based on MapReduce placement policy place each of three replicas on each node in the local rack Advantage : improve write performance by cutting down interrack write traffic 4 /27 Introduction Node2 Node1 Files loaded from HDFS stores file file Split Split Split RR RR RR map map map Combiner Partitione r Writeback to Local HDFS store Input format Input format RecordReaders “Shuffling” process (over the N/W) Split Split Split RR RR RR map map map file file Combiner Partitione r (sort) (sort) reduce reduce Output Format Output Format Essential to reduce the shuffling overhead to improve the overall performance of the MapReduce computation. the network bandwidth between nodes is also an important factor of the shuffling overhead. 5 /27 Introduction Hadoop’s Moving computation is better Better It’s basic principle to migrate the computation closer used for when the size of data set is huge the migration of the computation minimizes network congestion and increase the overall throughput1) of the system. 1)Throughput : 지정된 시간 내 전송된 처리량 6 /27 Introduction HOD(Hadoop-On-Demand, developed by Yahoo!) a management system for provisioning virtual Hadoop cluster over a large physical large physical cluster All physical nodes are shared by more than one Yahoo! Engineers Increase the utilization of physical resource When the computing resources are shared by multiple users, Hadoop policy(‘Moving computation’) is not effective Because resource are shared Resource e.g : computing n/w, hardware resource 7 /27 Introduction To solve the that problem, two optimization scheme is proposed Prefetching Intra-block prefetching Inter-block prefetching Pre-shuffling 8 /27 Related work J. Dean and S. Ghemawat Traditional prefetching techniques V. Padmanabhan and J.Mogul, T.Kroeger and D. long, P. Cao,E. Felten et al., Prefetching method to reduce I/O latency 9 /27 Related work Zaharia et al., LATE(Longest Approximation Time to End) More efficiently in the shared environment Drayd(Microsoft) Can be expressed as direct acyclic graph The degree of data locality is highly related to the MapReduce performance 10 /27 Design(Prefetching Scheme) Assigned input split for map task Computation In progress Prefetching In progress Fig.1. The intra-block prefetching in Map Phase Expected data for reduce task Computation In progress Prefetching In progress Fig.2. The intra-block prefetching in Reduce Phase Intra2)-block prefetching Bi-directional processing A simple prefetching technique that prefetches data within a single block while performing a complex computation 2)Intra : 안 내부 11 /27 Design(Prefetching Scheme) While a complex job is performed in the left side, the to be-required data are prefetched and assigned in parallel to the corresponding task Advantage of Intra-block prefetching 1. Using the concept of processing bar that monitors the current status of each side and invokes a signal if synchronization is about to be broken 2. Try find the appropriate prefetching rate at which the performance can be maximized while minimizing the prefetching overhead Can be minimize the network overhead 3)At which : when, where 12 /27 Design(Prefetching Scheme) 1 n1 2 n2 n3 block block block block block block D=1 D=5 D=8 3 Inter-block prefetching runs in block level, by prefetching the expected block replica4) to a local rack 4)replica : 복제본 • A2, A3, A4 is prefetching the required blocks D=Distance 13 /27 Design(Prefetching Scheme) Inter-block prefetching runs in block level, by prefetching the expected block replica4) to a local rack 4)replica : 복제본 • A2, A3, A4 is prefetching the required blocks 14 /27 Design(Prefetching Scheme) Inter-block prefetching processing Algorithm 1. Assign map task to the node that are the nearest to the required blocks 2. The predictor generates the list of data blocks, B, to be prefetched for the target task t 15 /27 Design(Pre-Shuffling Scheme) Pre-Shuffling processing The pre-shuffling module in the task scheduler looks over input split or candidate data in the map phase, and predicts which reducer the key-value pairs are partitioned into. 16 /27 Design(Optimization) LATE(Longest Approximation Time to End) algorithm How to robustly perform speculative execution to maximize performance under heterogenous environment Did not consider data locality that can accelerate the MapReduce computation further D-LATE(Data-aware LATE) algorithm Almost the same LATE, except that a task is assigned as nearly as possible to the location where the needed data are present 17 /27 Implementation – Optimizer scheduler) Optimized scheduler Predictor module Not only finds stragglers, but also predicts candidate data blocks and the reducers into which the key-value pairs are partitioned D-LATE These predictions, the optimized scheduler perform the D-LATE algorithm 18 /27 Implementation – Optimizer scheduler) Prefetcher To Monitor the status of worker threads and to manage the prefetching synchronization with processing bar Load Balancer Check the logs(include dis usage per node and current n/w traffic per data block) Invoke to maintain load balancing based on disk usage and n/w traffic 19 /27 Evaluation Two dual-core 2.0Ghz AMD, 4GB main memory 400GB ATA Hard disk drives Gigabit Ethernet n/w interface card The entire nodes are divided in to 40racks which are connected with L3 routers Yahoo! Grid which consists of 1670 nodes All test configured that HDFS maintains four replicas for each data block, whose size is 128MB Three type of workload ; wordcount, search log aggregator, similarity calculator 20 /27 Evaluation Fig7, We can observe that HPMR shows significantly better performance than the native Hadoop for all of test sets Fig8, #1 : smallest ratio of number of nodes to the number of map tasks. #5 : due to significant reduction in shuffling overhead 21 /27 Evaluation The prefetching latency is affected by disk overhead or n/w congestion Therefore, the long prefetching latency indicates that the corresponding node is heavily loaded Prefetching rate increases beyond 60% 22 /27 Evaluation This means that HPMR assures consistent performance even in the shared environment such as Yahoo!Grid where the available bandwidth fluctuates severely. 4Kbps ~ 128Kbps 23 /27 Conclusion Two The prefetching scheme innovative schemes Exploits data locality The pre-shuffling scheme Reduce the network overhead required to shuffle key-value pairs HPMR is implemented as a plug-in type component for Hadoop HPMR improves the overall performance by up to 73% compared to the native Hadoop Next, step we plan to evaluate a more complicated workload such as HAMA(Open-source Apache incubator project) 24 /27 Appendix : MapReduce Example MapReduce Example : Weather data set 분석 하나의 레코드는 라인 단위로 저장되며, 이때 저장 타입은 ASCII 형태 하나의 파일에서 각 필드는 구분자없이 고정길이로 저장되어 있음 레코드 예제) 0057332130999991950010103004+51317+028783FM12+017199999V0203201N00721004501CN0100001N9-01281-01391102681 질의 1901년 ~ 2001년 동안 작성된 NCDC 데이터 파일들로부터 각 년도별 가장 높은 기온(F)을 측정하라 Input: 1st Map: 2nd Map: Shuffle: Reduce: Chunk(64MB) 단위 파일로부터 각 레코드로부터 연도별 데이터 그룹 최종 결과 데이터 파일 <offset, 레코드>추출 <연도, 기온> 추출 으로 정리 병합 및 반환 25 /27 Appendix : MapReduce Example 1st Map : 파일에서, <Offset, Record> 추출 <Key_1, Value> = <offset, record> <0, 0067011990999991950051507004...9999999N9+00001+99999999999...> <106, 0043011990999991950051512004...9999999N9+00221+99999999999...> <212, 0043011990999991950051518004...9999999N9-00111+99999999999...> <318, 0043012650999991949032412004...0500001N9+01111+99999999999...> <424, 0043012650999991949032418004...0500001N9+00781+99999999999...> ... 연 도 2nd Map : 각 레코드별 Year, Temp 추출 기온 <Key_2, Value> = <year, Temp> <1950, 0> <1950, 22> <1950, −11> <1949, 111> <1949, 78> … 26 /27 Appendix : MapReduce Example Shuffle 2nd Map의 결과가 너무 많기 때문에, 이를 각 연도별 데이터 그룹으로 다시 정리 Reduce 과정에서 병합시, 처리 비용 감소 2nd Map <1950, <1950, <1950, <1949, <1949, 0> 22> −11> 111> 78> Shuffle <1949, [111, 78]> <1950, [0, 22, −11]> Reduce : 모든 Map의 후보집합을 병합하여 최종 결과 반환 Mapper_1 (1950, [0, 22, −11]) (1949, [111, 78]) Mapper_2 Reducer (1950, [0, 22, −11, 25, 15]) (1950, 25) (1949, [111, 78, 30, 45]) (1949, 111) (1950, [25, 15]) (1949, [30, 45]) 27 /27 Appendix : Hadoop the Definitive Guide p19~20 1 2 3 4 28 /27