Twister4Azure :

advertisement
http://salsahpc.indiana.edu/twister4azure
Twister4Azure : Iterative MapReduce for Azure Cloud
A Decentralized, Dynamically Scalable, Fault Tolerant Iterative MapReduce Framework Built Using Cloud Services for Microsoft Azure Cloud.
Map
Job Start
Map
Combine
Combine
Reduce
Merge
Add
Iteration?
Map
Combine
Reduce
No
Job Finish
Yes
MapID
…….
…….
Status
Status
Map
1
Map
2
Map
n
Red
1
Red
2
Red
n
K-Means Clustering
KMeans iterative MapReduce
performance. 16 Azure Small
instances, 6 iterations, 8 to 48
million 20-D data points.
Left: Performance with and
without
data
caching.
Right: Speedup obtained from
using the data cache
1600
160%
1400
140%
Relative Parallel
Efficiency
Time(s)
1200
120%
1000
100%
800
80%
600
60%
400
40%
200
20%
0
0%
8 X 16M
16 X 32M
32 X 64M
48 X 96M
Num Instances X Num Data Points
64 X 128M
Left: Scaling speedup
with increasing number
of instances (Azure
Small) & data for 10
iterations.
Right: Increasing number
of iterations using 16
million data points with
caching using 16 Azure
Small instances.
BLAST
NCBI BLAST+ sequence searching using NR protein
sequence database (~8.7 GB) using 128 CPU cores. 16
Extra-Large Azure instances were used for
Twister4Azure. 8-core 16GB memory per node cluster
was used Apache Hadoop. 16-core 16GB memory
per node Windows HPC cluster was used for
DryadLINQ testing. Each query file == 100 sequences.
100.00%
90.00%
80.00%
Parallel Efficiency
MapID
Performance Comparison
Relative Parallel Efficiency
Twister4Azure
 Familiar MapReduce programming model
 Fault Tolerance features similar to traditional MapReduce.
 No single point of failure.
 Combiner step
 Supports dynamically scaling up and down of the
compute resources.
 Web based monitoring console
 Easy testing and deployment using Azure local
development fabric.
Scheduling Queue
 Iterative extensions
Worker Role
 Merge Step
Map Workers
Job Bulleting Board
 In-Memory Caching of
Reduce Workers
static data
 Cache aware hybrid
In Memory Data Cache
Map Task Table
scheduling using Queues
Task Monitoring
as well as a bulletin
Role Monitoring
board (special table)
Time (s)
Introduction
There exists many algorithms that rely on iterative
computations, where each iterative step can be easily
specified as a MapReduce computation. MapReduceRoles
for Azure (MR4Azure) is a decentralized, dynamically
scalable MapReduce runtime we developed for Windows
Azure Cloud platform using Microsoft Azure cloud
infrastructure services as the building blocks.
Twister4Azure extends MR4Azure to support optimized
iterative MapReduce executions, enabling a wide array of
large scale iterative data analysis and scientific
applications to utilize Azure platform easily and efficiently.
70.00%
60.00%
50.00%
40.00%
30.00%
Twister4Azure
20.00%
Hadoop-Blast
DryadLINQ-Blast
10.00%
0.00%
Data Cache
128
3000
Twister4Azure Computation Flow
MapReduceRoles for Azure
 Use distributed, highly scalable and highly
available cloud services as the building blocks.
 Azure Queues for task scheduling.
 Azure Blob storage for input, output and
intermediate data storage.
 Azure Tables for meta-data storage and
monitoring
 Utilize eventually-consistent , high-latency cloud
services effectively to deliver performance
comparable to traditional MapReduce runtimes.
 Minimal management and maintenance overhead
Thilina Gunarathne, Judy Qiu, Geoffrey Fox
{tgunarat, xqiu, gcf}@indiana.edu
2000
1500
Twister4Azure
1000
Amazon EMR
500
Apache Hadoop
328
428
528
Number of Query Files
628
728
Smith
Waterman-GOTOH
All-Pairs
Sequence
Alignment
Azure Small instances for Twister4Azure, Amazon EC2
High-CPU-Extra-Large instances for Amazon Elastic
MapReduce and 8 core 16GB per node cluster for
Apache Hadoop. Block == 200 sequences.
0
Num. of Cores * Num. of Blocks
CAP3 Sequence Assembly
Using Azure Small instances for Twister4Azure,
Amazon EC2 High-CPU-Extra-Large instances for
Amazon Elastic MapReduce and 8 core 16GB per
node cluster for Apache Hadoop. Each file
contained 458 reads.
Parallel Efficiency
Hybrid scheduling of the new iteration
Adjusted Time (s)
2500
228
100%
95%
90%
85%
80%
75%
70%
65%
60%
55%
50%
Twister4Azure
Amazon EMR
Apache Hadoop
Num. of Cores * Num. of Files
REFERENCES
Gunarathne, T., Wu,T.L., Qiu, J., and Fox, G.C. 2010. MapReduce in the Clouds for Science. In Proceedings of
CloudCom 2010 Conference (Indianapolis,December 2010)
Ekanayake, J., Li, H., Zhang B., et al., 2010. Twister: A Runtime for iterative MapReduce, in Proceedings of the First
International Workshop on MapReduce and its Applications of ACM HPDC 2010 conference (Chicago, June 2010)
Gunarathne, T., Wu, T. L., Qiu, J., et al., 2010. Cloud computing paradigms for pleasingly parallel biomedical
applications. Submitted for publication in Concurrency and Computation: Practice and Experience journal.
ACKNOWLEDGEMENTS
Microsoft Exploratory Research in Clouds and Platforms Grant, FutureGrid and Salsa Group.
Download