SW-G using new DryadLINQ(Argentia) DRYADLINQ: Ø Dryad is a high-performance, general-purpose distributed computing engine that is designed to manage execution of large-scale applications on various cluster technologies, including Windows® HPC Server 2008.DryadLINQ is based on LINQ, which was introduced with Microsoft® .NET Framework version 3.5. The core of DryadLINQ is a DryadLINQ provider, which translates the application’s LINQ queries into a Dryad job and runs the job as a distributed application on a Windows HPC cluster. A developer thus does not need to know much about Dryad—or even about parallel or distributed computing—to write a DryadLINQ application. However, developers who are familiar with these technologies can take advantage of their knowledge to optimize performance. A DryadLINQ evaluation query runs as follows: Ø An application creates one or more DistributedData<TSource> objects to represent persistent data, and then it defines a query by applying one or more LINQ operators. Ø DryadLINQ builds a DistributedQuery<TResult> to represent the query, but defers evaluation. Ø The application triggers the object evaluation process. a Ø One way to trigger evaluation is by calling foreach to enumerate a DistributedQuery<T> object. If you step through a DryadLINQ application in the debugger, you will see that the foreach call itself is the trigger. The foreach operator calls the object’s GetEnumerator method, which initiates the evaluation. The actual enumeration doesn’t take place until evaluation is complete. Evaluation can also be triggered by other DryadLINQ methods, such as Execute() and ExecuteAsync(). Ø LINQ passes the LINQ expression to the DryadLINQ provider. Ø The DryadLINQ provider: o Generates an execution plan o The execution plan is used by the Dryad Job Manager to create a graph for the job. o Because LINQ applications operate on data sets rather than individual items, the DryadLINQ provider has considerable flexibility in how it translates the LINQ query into an efficient Dryad execution plan. o Generates data processing code for the vertices. o The data processing code for each stage is compiled to a .NET assembly and then dispatched to the cluster’s computers at the appropriate stage of the operation. o Collates any “side-information” that is necessary for the computation, particular local variables that have been referenced in the query via closure. Ø A Dryad job is started on the cluster, including both a Dryad Graph Manager and Dryad Vertex Hosts. Ø The Dryad Graph Manager executes the Dryad vertices from the Dryad graph on the cluster: Ø o The graph manager schedules vertices as Dryad Vertex Hosts are available. o Each vertex runs .NET code to processes a data partition and prepare the results for subsequent stages or final output. o The input to a vertex can be either static data or data from the preceding stage. o The graph manager may apply runtime policies and optimizations to the execution graph to optimize performance. When execution completes, the results are written to one or more distributed data sets. how a Dryad job runs on a cluster. Dryad Linq Application with Partitioned Data: About SWG: Computation(SW-G) using In SWG we evaluate the performance of Smith Waterman Gotoh Dissimilarity Computation(SW the new DryadLinq(Argentia) (Argentia) . Upper Triangle N*N matrix broken to D*D blocks 20000 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 DryadLINQ MPI 35339 • • • • 50000 Calculate pairwise distances for a collection of genes (used for clustering, MDS) Fine grained tasks in MPI Coarse grained tasks in DryadLINQ Performed on 768 cores (Tempest Cluster) Blocks in the upper triangle Ø Ø Ø Each D consecutive blocks are merged to form a set of row blocks ; each with NxD elements process has workload of NxD element. We need to determine how Dryad distributes this data across the compute nodes. For this purpose we record every block info down to the log file, then analyze it to see how blocks are distributed. Therefore we can know how Dryad distributed general data after reasoning and analysis. We use the following DryadLinq operators to distribute data across the nodes: AsDistributed() and AsDistributedFromPartitions().The difference between the two is that is AsDistributed() always divide data into default pieces, and AsDistributedFromPartitions() can divide data as you like.We also use a new operator called ApplyPerPartition() basically tells Dryad how to deal with each individual piece of data. 0.030 60% 0.025 50% 40% 0.020 30% 0.015 20% 0.010 10% 0.005 0.000 0% 10000 20000 30000 40000 50000 -10% No. of Sequences DryadLinq SWG on WinHPC Problems: ( SW1) It's not easy to debug in Argentia, some exception information is not of solid value. (Argentia G Debugging) non-compute-bound bound work such as 2) In cases where a query is performing a significant amount of non File I/O, it might be beneficial to specify a degree of parallelism greater than the number of cores on the machine Performance Degradation on VM (Hadoop) Time per Actual Calculation (ms) Scalibility of Pairwise Distance Calculation: