Comparison between DryadLINQ and Agenta favorable choice among the available technologies to solve such problems. -Ratul Bhawal [3] Abstract: Scalability is one of the most important factors which one desires to achieve in Distributed Systems. [1] Microsoft provides a system and a set of language extensions enabling a new programming model for large scale distributed computing known as DryadLINQ. It generalizes previous execution environments such as SQL, MapReduce, and Dryad in two ways: by adopting an expressive data model of strongly typed .NET objects; and by supporting general purpose imperative and declarative operations on datasets within a traditional high-level programming language. A DryadLINQ program is a sequential program composed of LINQ expressions performing arbitrary side effect free transformations on datasets. Agenta is the latest unreleased version of DryadLINQ. Here, we shall describe DryadLINQ and examine its scalability and also note the key differenced when compared with Agenta. First let us have an overview of DryadLINQ software stack (Figure1). Working at his workstation, the programmer writes code in one of the managed languages of the .NET Framework using Language Integrated Query. The LINQ operators are mixed with imperative code to process data held in collection of strongly typed objects. A single collection can span multiple computers thereby allowing for scalable storage and efficient execution. The code produced by a DryadLINQ programmer looks like the code for a sequential LINQ application. Behind the scene, however, DryadLINQ translates LINQ queries into Dryad computations (Directed Acyclic Graph (DAG) based execution flows. While the Dryad engine executes the distributed computation, the DryadLINQ client application typically waits for the results to continue with further processing. Application .NET + LINQ 1. Introduction Scalability can be defined as the ability of a computer application or product (hardware or software) to continue to function well when it (or its context) is changed in size or volume in order to meet a user’s need. There are many cloud platforms which can achieve desired scalability. DryadLINQ is the platform provided by Microsoft. Many domains like biology, chemistry, particle physics, information retrieval and finance involve a deluge of data and highly computation intensive applications, mandates the use of large computing infrastructures and parallel runtimes to achieve considerable performance gains. Now, the data sets keep on increasing with time leading to the need of efficient scalable applications for parallel computing. The support for handling large data sets, the concept of moving computation to data, and the better quality of services provided by DryadLINQ makes it a DryadLINQ Dryad Cluster Environment Windows Server Windows Server Figure1: DryadLINQ Software Stack 2. Features of DryadLINQ [2] DryadLINQ implements a DAG programming model. DAG stands for Directed Acyclic Graph. DryadLINQ provides LINQ API for Dryad using C#. It handles data by means of shared directories/local disks. It uses files, TCP pipes, Shared memory FIFO for intermediate data communication. It uses network topology based run time graph optimizations for job scheduling. It has an efficient fault tolerance mechanism. Instead of restarting the entire job, it re executes just the failed task. It also provides monitoring support for execution graphs. To improve performance, DryadLINQ applications typically break persistent input data into multiple partitions, which can then be processed in parallel on multiple compute nodes. The details—including the optimal number of partitions—depend on the particular application and data set. Partition files Time(min) 25 5:04 50 8:45 100 16:47 Now, from the above results, it is clear that scheduling time is a major factor affecting performance. As we decreased task granularity, scheduling time increased and performance got affected. 18 16 14 12 10 8 6 Figure 2. A DryadLINQ application with partitioned data 4 The output data is also broken into multiple partitions, but they are presented to the application as a single collection. The number of partitions can even exceed the number of compute nodes, but DryadLINQ then processes the partitions in multiple passes. Agenta being updated version of DryadLINQ has the basic features similar to DryadLINQ but there have been certain modifications which make Agenta better than DryadLINQ. To discover the key differences we tested the PageRank algorithm using both DryadLINQ and Agenta. 2 3. Performance results for PageRank on DryadLINQ Increase in scalability means the ability to execute more tasks, however with increase in number of tasks, there is an increase in scheduling costs which is an important factor affecting performance. I have done performance evaluation of DryadLINQ using PageRank Algorithm. I took data which consists of 100 adjacency matrix. Following are some of the test cases that I tested. 0 25 Dryad tasks 50 Dryad 100 Dryad tasks tasks Figure 3. Page Rank performance results with DryadLINQ Agenta is the latest unreleased version of DryadLINQ. I evaluated it using Pagerank algorithm and noticed certain key differences between DryadLINQ and Agenta. Following are the results of testing with Agenta Partition files Time(sec) 6 1888 12 1693 48 1559 72 1364 96 1251 192 1441 Partition files Time(sec) Partition files Time(sec) 98 1398 384 1845 99 1248 768 2731 1000 3231 1280 3874 1450 1400 1350 1300 1250 4500 4000 3500 3000 2500 2000 1500 1000 500 0 1200 After first set of testing, we noticed that optimal performance was with 96 tasks. Then I did some more test cases to find a more optimal point which would be somewhere near to 96 tasks. Partition files Time(sec) 90 1275 91 1259 92 1236 93 1237 94 1249 95 1254 96 1282 97 1262 90 Dryad tasks 91 Dryad tasks 92 Dryad tasks 93 Dryad tasks 94 Dryad tasks 95 Dryad tasks 96 Dryad tasks 97 Dryad tasks 98 Dryad tasks 99 Dryad tasks 6 Dryad tasks 12 Dryad tasks 24 Dryad tasks 48 Dryad tasks 72 Dryad tasks 96 Dryad tasks 192 Dryad tasks 384 Dryad tasks 768 Dryad tasks 1000 Dryad tasks 1280 Dryad tasks 1150 We found that 92 tasks is the ideal performance point. That is this is the ideal task granularity v/s scheduling time stand off point. We ran PageRank application with different data partition sets on both DryadLINQ and Agenta. Based on the test results until now, we have identified following key differences between DryadLINQ and Agenta. They can be listed as follows: Firstly, DryadLINQ makes use of the partition file. The partition file is a file with “.pt” ass extension which basically contains information related to data partition location within the cluster. The user needs to create this file to tell the Dryad Job Manager where to find the data. In case of Agenta, no partition file is needed. Instead it uses DSC (Distributed Storage Catalog). The entire Dataset is present at head node, and Dryad scheduler will take care of resource distribution across the nodes in cluster. The user need not worry about resource allocation. The Second major difference is the kind of scheduling strategy implemented for job execution. Previous version of DryadLINQ uses static strategy for scheduling tasks i.e. before actual task execution, the job manager will make a schedule for specific tasks on specific nodes. A specific task will execute on the node on which it is scheduled irrespective of the fact that in the same time other nodes might be free and the current node might be busy. This leads to inefficient usage of cluster resources. Agenta on the other hand makes use of Dynamic scheduling strategy. In the dynamic strategy, the job manager will execute the first n tasks on n available nodes, then as soon as a task finishes and a node gets free, it will schedule task n+1 on the free node instead of waiting for 1st node to get free. This leads to efficient utilization of cluster resources. The below diagrams will help us understand static and dynamic strategy better. From the above diagrams, it is clear that overall job time reduces with dynamic strategy. 4. Conclusion DryadLINQ is an older and stable version. Although Agenta promises to be a much better and efficient version, we need to conduct more tests on Agenta before adopting Agenta totally and making DryadLINQ obsolete. References: 1) Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson1, Pradeep Kumar Gunda, Jon Currey,DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a HighLevel Language 2) Microsoft Research, DryadLINQ Programming guide 3) Jaliya Ekanayake, Thilina Gunarathne, Geoffrey Fox, Atilla Soner Balkir, Christophe Poulain, Nelson Araujo, Roger Barga, DryadLINQ for Scientific Analyses, 2009 Fifth IEEE International Conference on e-Science