http://salsahpc.indiana.edu/twister4azure Twister4Azure : Iterative MapReduce for Azure Cloud A Decentralized, Dynamically Scalable, Fault Tolerant Iterative MapReduce Framework Built Using Cloud Services for Microsoft Azure Cloud. Map Job Start Map Combine Combine Reduce Merge Add Iteration? Map Combine Reduce No Job Finish Yes MapID ……. ……. Status Status Map 1 Map 2 Map n Red 1 Red 2 Red n K-Means Clustering KMeans iterative MapReduce performance. 16 Azure Small instances, 6 iterations, 8 to 48 million 20-D data points. Left: Performance with and without data caching. Right: Speedup obtained from using the data cache 1600 160% 1400 140% Relative Parallel Efficiency Time(s) 1200 120% 1000 100% 800 80% 600 60% 400 40% 200 20% 0 0% 8 X 16M 16 X 32M 32 X 64M 48 X 96M Num Instances X Num Data Points 64 X 128M Left: Scaling speedup with increasing number of instances (Azure Small) & data for 10 iterations. Right: Increasing number of iterations using 16 million data points with caching using 16 Azure Small instances. BLAST NCBI BLAST+ sequence searching using NR protein sequence database (~8.7 GB) using 128 CPU cores. 16 Extra-Large Azure instances were used for Twister4Azure. 8-core 16GB memory per node cluster was used Apache Hadoop. 16-core 16GB memory per node Windows HPC cluster was used for DryadLINQ testing. Each query file == 100 sequences. 100.00% 90.00% 80.00% Parallel Efficiency MapID Performance Comparison Relative Parallel Efficiency Twister4Azure Familiar MapReduce programming model Fault Tolerance features similar to traditional MapReduce. No single point of failure. Combiner step Supports dynamically scaling up and down of the compute resources. Web based monitoring console Easy testing and deployment using Azure local development fabric. Scheduling Queue Iterative extensions Worker Role Merge Step Map Workers Job Bulleting Board In-Memory Caching of Reduce Workers static data Cache aware hybrid In Memory Data Cache Map Task Table scheduling using Queues Task Monitoring as well as a bulletin Role Monitoring board (special table) Time (s) Introduction There exists many algorithms that rely on iterative computations, where each iterative step can be easily specified as a MapReduce computation. MapReduceRoles for Azure (MR4Azure) is a decentralized, dynamically scalable MapReduce runtime we developed for Windows Azure Cloud platform using Microsoft Azure cloud infrastructure services as the building blocks. Twister4Azure extends MR4Azure to support optimized iterative MapReduce executions, enabling a wide array of large scale iterative data analysis and scientific applications to utilize Azure platform easily and efficiently. 70.00% 60.00% 50.00% 40.00% 30.00% Twister4Azure 20.00% Hadoop-Blast DryadLINQ-Blast 10.00% 0.00% Data Cache 128 3000 Twister4Azure Computation Flow MapReduceRoles for Azure Use distributed, highly scalable and highly available cloud services as the building blocks. Azure Queues for task scheduling. Azure Blob storage for input, output and intermediate data storage. Azure Tables for meta-data storage and monitoring Utilize eventually-consistent , high-latency cloud services effectively to deliver performance comparable to traditional MapReduce runtimes. Minimal management and maintenance overhead Thilina Gunarathne, Judy Qiu, Geoffrey Fox {tgunarat, xqiu, gcf}@indiana.edu 2000 1500 Twister4Azure 1000 Amazon EMR 500 Apache Hadoop 328 428 528 Number of Query Files 628 728 Smith Waterman-GOTOH All-Pairs Sequence Alignment Azure Small instances for Twister4Azure, Amazon EC2 High-CPU-Extra-Large instances for Amazon Elastic MapReduce and 8 core 16GB per node cluster for Apache Hadoop. Block == 200 sequences. 0 Num. of Cores * Num. of Blocks CAP3 Sequence Assembly Using Azure Small instances for Twister4Azure, Amazon EC2 High-CPU-Extra-Large instances for Amazon Elastic MapReduce and 8 core 16GB per node cluster for Apache Hadoop. Each file contained 458 reads. Parallel Efficiency Hybrid scheduling of the new iteration Adjusted Time (s) 2500 228 100% 95% 90% 85% 80% 75% 70% 65% 60% 55% 50% Twister4Azure Amazon EMR Apache Hadoop Num. of Cores * Num. of Files REFERENCES Gunarathne, T., Wu,T.L., Qiu, J., and Fox, G.C. 2010. MapReduce in the Clouds for Science. In Proceedings of CloudCom 2010 Conference (Indianapolis,December 2010) Ekanayake, J., Li, H., Zhang B., et al., 2010. Twister: A Runtime for iterative MapReduce, in Proceedings of the First International Workshop on MapReduce and its Applications of ACM HPDC 2010 conference (Chicago, June 2010) Gunarathne, T., Wu, T. L., Qiu, J., et al., 2010. Cloud computing paradigms for pleasingly parallel biomedical applications. Submitted for publication in Concurrency and Computation: Practice and Experience journal. ACKNOWLEDGEMENTS Microsoft Exploratory Research in Clouds and Platforms Grant, FutureGrid and Salsa Group.