Comparison between DryadLINQ and Agenta -Ratul Bhawal

advertisement
Comparison between DryadLINQ and Agenta
favorable choice among the available technologies
to solve such problems.
-Ratul Bhawal
[3]
Abstract:
Scalability is one of the most important factors
which one desires to achieve in Distributed Systems.
[1]
Microsoft provides a system and a set of language
extensions enabling a new programming model for
large scale distributed computing known as
DryadLINQ. It generalizes previous execution
environments such as SQL, MapReduce, and Dryad in
two ways: by adopting an expressive data model of
strongly typed .NET objects; and by supporting
general purpose imperative and declarative
operations on datasets within a traditional high-level
programming language. A DryadLINQ program is a
sequential program composed of LINQ expressions
performing arbitrary side effect free transformations
on datasets. Agenta is the latest unreleased version
of DryadLINQ. Here, we shall describe DryadLINQ
and examine its scalability and also note the key
differenced when compared with Agenta.
First let us have an overview of
DryadLINQ software stack (Figure1). Working at his
workstation, the programmer writes code in one of
the managed languages of the .NET Framework
using Language Integrated Query. The LINQ
operators are mixed with imperative code to process
data held in collection of strongly typed objects. A
single collection can span multiple computers
thereby allowing for scalable storage and efficient
execution. The code produced by a DryadLINQ
programmer looks like the code for a sequential
LINQ application. Behind the scene, however,
DryadLINQ translates LINQ queries into Dryad
computations (Directed Acyclic Graph (DAG) based
execution flows. While the Dryad engine executes
the distributed computation, the DryadLINQ client
application typically waits for the results to continue
with further processing.
Application
.NET + LINQ
1. Introduction
Scalability can be defined as the ability of a
computer application or product (hardware or
software) to continue to function well when it (or its
context) is changed in size or volume in order to
meet a user’s need. There are many cloud platforms
which can achieve desired scalability. DryadLINQ is
the platform provided by Microsoft. Many domains
like biology, chemistry, particle physics, information
retrieval and finance involve a deluge of data and
highly computation intensive applications, mandates
the use of large computing infrastructures and
parallel runtimes to achieve considerable
performance gains. Now, the data sets keep on
increasing with time leading to the need of efficient
scalable applications for parallel computing. The
support for handling large data sets, the concept of
moving computation to data, and the better quality
of services provided by DryadLINQ makes it a
DryadLINQ
Dryad
Cluster Environment
Windows
Server
Windows
Server
Figure1: DryadLINQ Software Stack
2. Features of DryadLINQ
[2]
DryadLINQ implements a DAG programming
model. DAG stands for Directed Acyclic Graph.
DryadLINQ provides LINQ API for Dryad using C#. It
handles data by means of shared directories/local
disks. It uses files, TCP pipes, Shared memory FIFO
for intermediate data communication. It uses
network topology based run time graph
optimizations for job scheduling. It has an efficient
fault tolerance mechanism. Instead of restarting the
entire job, it re executes just the failed task. It also
provides monitoring support for execution graphs.
To improve performance, DryadLINQ applications
typically break persistent input data into multiple
partitions, which can then be processed in parallel
on multiple compute nodes. The details—including
the optimal number of partitions—depend on the
particular application and data set.
Partition files
Time(min)
25
5:04
50
8:45
100
16:47
Now, from the above results, it is clear that
scheduling time is a major factor affecting
performance. As we decreased task granularity,
scheduling time increased and performance got
affected.
18
16
14
12
10
8
6
Figure 2. A DryadLINQ application with partitioned data
4
The output data is also broken into multiple
partitions, but they are presented to the application
as a single collection. The number of partitions can
even exceed the number of compute nodes, but
DryadLINQ then processes the partitions in multiple
passes. Agenta being updated version of DryadLINQ
has the basic features similar to DryadLINQ but there
have been certain modifications which make Agenta
better than DryadLINQ. To discover the key
differences we tested the PageRank algorithm using
both DryadLINQ and Agenta.
2
3. Performance results for PageRank on DryadLINQ
Increase in scalability means the ability to execute
more tasks, however with increase in number of
tasks, there is an increase in scheduling costs which
is an important factor affecting performance. I have
done performance evaluation of DryadLINQ using
PageRank Algorithm. I took data which consists of
100 adjacency matrix. Following are some of the test
cases that I tested.
0
25 Dryad
tasks
50 Dryad 100 Dryad
tasks
tasks
Figure 3. Page Rank performance results with DryadLINQ
Agenta is the latest unreleased version of
DryadLINQ. I evaluated it using Pagerank algorithm
and noticed certain key differences between
DryadLINQ and Agenta. Following are the results of
testing with Agenta
Partition files
Time(sec)
6
1888
12
1693
48
1559
72
1364
96
1251
192
1441
Partition files
Time(sec)
Partition files
Time(sec)
98
1398
384
1845
99
1248
768
2731
1000
3231
1280
3874
1450
1400
1350
1300
1250
4500
4000
3500
3000
2500
2000
1500
1000
500
0
1200
After first set of testing, we noticed that optimal
performance was with 96 tasks.
Then I did some more test cases to find a more
optimal point which would be somewhere near to 96
tasks.
Partition files
Time(sec)
90
1275
91
1259
92
1236
93
1237
94
1249
95
1254
96
1282
97
1262
90 Dryad tasks
91 Dryad tasks
92 Dryad tasks
93 Dryad tasks
94 Dryad tasks
95 Dryad tasks
96 Dryad tasks
97 Dryad tasks
98 Dryad tasks
99 Dryad tasks
6 Dryad tasks
12 Dryad tasks
24 Dryad tasks
48 Dryad tasks
72 Dryad tasks
96 Dryad tasks
192 Dryad tasks
384 Dryad tasks
768 Dryad tasks
1000 Dryad tasks
1280 Dryad tasks
1150
We found that 92 tasks is the ideal performance
point. That is this is the ideal task granularity v/s
scheduling time stand off point.
We ran PageRank application with different data
partition sets on both DryadLINQ and Agenta. Based
on the test results until now, we have identified
following key differences between DryadLINQ and
Agenta.
They can be listed as follows:
Firstly, DryadLINQ makes use of the partition file.
The partition file is a file with “.pt” ass extension
which basically contains information related to data
partition location within the cluster. The user needs
to create this file to tell the Dryad Job Manager
where to find the data.
In case of Agenta, no partition file is needed. Instead
it uses DSC (Distributed Storage Catalog). The entire
Dataset is present at head node, and Dryad
scheduler will take care of resource distribution
across the nodes in cluster. The user need not worry
about resource allocation.
The Second major difference is the kind of
scheduling strategy implemented for job execution.
Previous version of DryadLINQ uses static strategy
for scheduling tasks i.e. before actual task execution,
the job manager will make a schedule for specific
tasks on specific nodes. A specific task will execute
on the node on which it is scheduled irrespective of
the fact that in the same time other nodes might be
free and the current node might be busy. This leads
to inefficient usage of cluster resources.
Agenta on the other hand makes use of Dynamic
scheduling strategy. In the dynamic strategy, the job
manager will execute the first n tasks on n available
nodes, then as soon as a task finishes and a node
gets free, it will schedule task n+1 on the free node
instead of waiting for 1st node to get free. This leads
to efficient utilization of cluster resources.
The below diagrams will help us understand static
and dynamic strategy better.
From the above diagrams, it is clear that overall job
time reduces with dynamic strategy.
4. Conclusion
DryadLINQ is an older and stable version. Although
Agenta promises to be a much better and efficient
version, we need to conduct more tests on Agenta
before adopting Agenta totally and making
DryadLINQ obsolete.
References:
1) Yuan Yu, Michael Isard, Dennis Fetterly,
Mihai Budiu, Úlfar Erlingsson1, Pradeep
Kumar Gunda, Jon Currey,DryadLINQ: A
System for General-Purpose Distributed
Data-Parallel Computing Using a HighLevel Language
2) Microsoft
Research,
DryadLINQ
Programming guide
3) Jaliya Ekanayake, Thilina Gunarathne,
Geoffrey Fox, Atilla Soner Balkir,
Christophe Poulain, Nelson Araujo,
Roger Barga, DryadLINQ for Scientific
Analyses, 2009 Fifth IEEE International
Conference on e-Science
Download