Hybrid MapReduce Workflow Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US Outline • Introduction and Background – MapReduce – Iterative MapReduce – Distributed Workflow Management Systems • Hybrid MapReduce (HyMR) – Architecture – Implementation – Use case • Experiments – Performance – Scaleup – Fault tolerance • Conclusions MapReduce • Introduced by Google • Hadoop is an open source MapReduce framework Mapper: read input data, emit key/value pairs Map User Program fork assign map fork fork Master Reducer: accept a key and all the values belongs to that key, emits final output assign reduce Reduce Input Data Worker Split 0 Split 1 read Split 2 Worker local write Worker Worker Worker remote read, sort write Output File 0 Output File 1 Iterative MapReduce(Twister) • • • • • • • • Iterative applications: K‐means, EM An extension to MapReduce Long‐running mappers and reducers. Use data streaming instead of file I/O Keep static data in memory Use broadcast to send out updated data to all mappers Use a pub/sub messaging infrastructure Naturally support parallel iterative applications efficiently Workflow Systems • Traditional Workflow Systems – Focused on dynamic resource allocation – Pegasus, Kepler, Taverna • MapReduce Workflow Systems – Oozie • Apache Project • Use XML to describe workflows – MRGIS • Focus on GIS applications – CloudWF • Optimized for usage in Cloud – All based on Hadoop Why Hybrid? • MapReduce – – – – Lack of the support of parallel iterative applications High overhead on iterative application execution Strong fault tolerance support File system support • Iterative MapReduce – No file system support, the data are saved in local disk or NFS – Weak fault tolerance support – Efficient iterative application execution HyMR Architecture • Concrete model – Use PBS/TORQUE for resource allocation – Focused on efficient workflow execution after resource is allocated • User Interface – WF definition in Script/XML • Instance Controller – – – – WF model: DAG Manage workflow execution Job status checker Status updates in XML Job and Runtime Controller • Job Controller – – – – Manage job execution Single Node Job: File Distributor, File Partitioner Multi‐Node Job: MapReduce Job, Iterative MapReduce Job Twister Fault Checker: Detect faults and notify Instance Controller • Runtime Controller – Runtime Configuration: save the user from complicate Hadoop and Twister configuration and start the runtime automatically – Persistent Runtime: reduce time cost of restarting runtimes once a job is finished – Support Hadoop and Twister File System Support in Twister • Add HDFS support for Twister – Before: explicit data staging phase – After: implicit data staging as same as Hadoop A Bioinfo Data Visualization Pipeline • Input: FASTA File • Output: A coordinates file contains the mapping result from dimension reduction • 3 main components: – Pairwise Sequence alignment: reads FASTA file, generates dissimilarity matrix – Multidimensional Scaling(MDS): reads dissimilarity matrix, generates coordinates file – Interpolation: reads FASTA file and coordinates file, generates final result … >SRR042317.123 CTGGCACGT… >SRR042317.129 CTGGCACGT… >SRR042317.145 CTGGCACGG… … Twister‐Pipeline • Hadoop does not directly support MDS (iterative application). Incur high overhead • All of the data staging are explicitly considered as a job Hybrid‐Pipeline • In HyMR pipeline, distributed data are stored in HDFS. No explicit data staging is needed as partitioned data are write into and read from HDFS directly. Pairwise Sequence Alignment Block (0,0) Input Sample Fasta Partition 1 Block (0,1) Input Sample FastaPartition 2 … Block (n,0) … Input Sample Fasta Partition n … Block (n‐1,n‐1) Map M M … Reduce Dissimilarity Matrix Partition 1 R … R Block (0,1) Block (0,2) Block (0,n‐1) Block (1,0) Block (1,1) Block (1,2) Block (1,n‐1) Block (2,0) Block (2,1) Block (2,2) Block (2,n‐1) Block Block (n‐1, 0) (n‐1, 1) Block (n‐1,n‐1) … Dissimilarity Matrix Partition n M Sample Data File I/O Block (0,0) Dissimilarity Matrix Partition 2 Network Communication • Used for generating all‐pair dissimilarity matrix • Use Smith‐Waterman as alignment algorithm • Improve task granularity to reduce scheduling overhead Multidimensional Scaling (MDS) • Scaling by Majorizing a Complicated Function (SMACOF) • Two MapReduce Job in one iteration Sample Data File I/O Map Sample Label File I/O Map Reduce Network Communication Reduce Input Dissimilarity Matrix Partition 1 M Input Dissimilarity Matrix Partition 2 M … … … Input Dissimilarity Matrix Partition n M M Parallelized SMACOF Algorithm Stress Calculation M R C M R C Sample Coordinates MDS Interpolation • SMACOF use O(N2) memory, which limits its applicability on large collection of data • Interpolate out‐sample sequences into target dimension space by giving k nearest neighbor sample sequences’ mapping result Input Sample Coordinates Map Reduce Input Sample Fasta Input Out‐Sample Fasta Partition 1 Input Out‐Sample Fasta Partition 2 … Input Out‐Sample Fasta Partition n M M R … … R C Final Output M Sample Data File I/O Out‐Sample Data File I/O Network Communication Experiment Setup • PolarGrid cluster in Indiana University (8 cores per machine) • 16S rRNA data from the NCBI database. • Num. of sequences: from 6144 to 36864 • Sample set and out‐sample set: 50 – 50 • Node/Core number from 32/256 to 128/1024 Performance Comparison Tested on 96 nodes, 768 cores Differences increases when data size is larger Write/read files to/from HDFS directly Runtime starts take longer time Execution includes read/write I/O, which is higher than local 8 disk TimeCost(thousand second) • • • • • 7 6 5 4 3 2 1 0 Twister‐pipeline Hybrid‐pipeline 6144 12288 18432 24576 30720 36864 Datasize Detail Time Analysis • Twister‐pipeline – Data Staging time is longer when data size increases – Less runtime start/stop time 100% Twister‐pipeline • Hybrid‐pipeline – Data Staging time is fixed due to map task number is fixed – Longer execution time 100% Hybrid‐pipeline 80% 80% 60% 60% 40% 40% 20% 20% 0% 0% 6144 12288 18432 24576 30720 36864 Datasize RuntimeControl Execution DataStaging 6144 12288 18432 24576 30720 36864 Datasize RuntimeControl Execution DataStaging Scale‐up Test • Hybrid‐pipeline performs better when # of node increases – Data distribution overhead from Twister increases – Scheduling overhead for Hadoop increases, but not much 7 16 6 14 5 4 3 2 Twister‐speedup Hybrid‐speedup 1 0 0 256 512 768 Corenumber 1024 TimeCost(thousand seconds) Speedup(hundreds) • For pure computation time: Twister‐pipeline performs slightly better since all the files are in local disk when jobs are run Twister‐execution‐time Hybrid‐execution‐time 12 10 8 6 4 2 0 0 256 512 768 Corenumber 1024 Fault Tolerance Test • • Fault tolerance, kill 1/10 nodes manually at different time during execution 10% and 25% are at PSA; 40% is at MDS; 55%, 70% and 85% are at Interpolation If the node is killed when using Hadoop runtime, the tasks will be rescheduled immediately; Otherwise HyMR will restart the job TimeCost(thousandseconds) • 4 3.5 3 2.5 2 1.5 1 0.5 0 10% Hybrid‐10nodes Hybrid‐1node Twister‐1node 25% 40% 55% 70% Timepercentage 85% Conclusions • First hybrid workflow system based on MapReduce and iterative MapReduce runtimes • Support iterative parallel application efficiently • Fault tolerance and HDFS support added for Twister Questions? Supplement Other iterative MapReduce runtimes Haloop Spark Extension based on Hadoop Iterative MapReduce by keeping long running mappers and reducers Task Scheduler keeps data locality for mappers and reducers Input and output are cached on local disks to reduce I/O cost between iterations Build on Nexus, a cluster manger keep long running executor on each node. Static data are cached in memory between iterations. Fault tolerance same as Use Resilient Distributed Hadoop. Dataset to ensure the fault Reconstruct cache to the tolerance worker assigned with failed worker’s partition. Pregel Large scale iterative graphic processing framework Use long living workers to keep the updated vertices between Super Steps. Vertices update their status during each Super Step. Use aggregator for global coordinates. Keep check point through each Super Step. If one worker fail, all the other work will need to reverse. Different Runtimes Comparison Name Iterative Fault Tolerance File System Scheduling Higher Caching level language Worker Unit Environment Google No Strong GFS Dynamic Sawzall ‐‐ Process C++ Hadoop No Strong HDFS Dynamic Pig ‐‐ Process Java Twister Yes Weak ‐‐ Static ‐‐ Memory Thread Java Haloop Yes Strong HDFS Dynamic ‐‐ Disk Process Java Spark Yes Weak HDFS Static Scala Memory Thread Java Pregel Yes Weak GFS Static ‐‐ Memory Process C++