SW-G using new DryadLINQ(Argentia)

advertisement
SW-G using new DryadLINQ(Argentia)
DRYADLINQ:
Ø
Dryad is a high-performance, general-purpose distributed computing engine that is designed to manage
execution of large-scale applications on various cluster technologies, including Windows® HPC Server
2008.DryadLINQ is based on LINQ, which was introduced with Microsoft® .NET Framework version 3.5.
The core of DryadLINQ is a DryadLINQ provider, which translates the application’s LINQ queries into a
Dryad job and runs the job as a distributed application on a Windows HPC cluster. A developer thus does
not need to know much about Dryad—or even about parallel or distributed computing—to write a
DryadLINQ application. However, developers who are familiar with these technologies can take advantage
of their knowledge to optimize performance.
A DryadLINQ evaluation query runs as follows:
Ø
An application creates one or more DistributedData<TSource> objects to represent persistent data, and
then it defines a query by applying one or more LINQ operators.
Ø
DryadLINQ builds a DistributedQuery<TResult> to represent the query, but defers evaluation.
Ø
The application triggers the object evaluation process. a
Ø
One way to trigger evaluation is by calling foreach to enumerate a DistributedQuery<T> object. If you
step through a DryadLINQ application in the debugger, you will see that the foreach call itself is the
trigger. The foreach operator calls the object’s GetEnumerator method, which initiates the evaluation.
The actual enumeration doesn’t take place until evaluation is complete. Evaluation can also be triggered
by other DryadLINQ methods, such as Execute() and ExecuteAsync().
Ø
LINQ passes the LINQ expression to the DryadLINQ provider.
Ø
The DryadLINQ provider:
o
Generates an execution plan
o
The execution plan is used by the Dryad Job Manager to create a graph for the job.
o
Because LINQ applications operate on data sets rather than individual items, the DryadLINQ
provider has considerable flexibility in how it translates the LINQ query into an efficient Dryad
execution plan.
o
Generates data processing code for the vertices.
o
The data processing code for each stage is compiled to a .NET assembly and then dispatched to
the cluster’s computers at the appropriate stage of the operation.
o
Collates any “side-information” that is necessary for the computation, particular local variables
that have been referenced in the query via closure.
Ø
A Dryad job is started on the cluster, including both a Dryad Graph Manager and Dryad Vertex Hosts.
Ø
The Dryad Graph Manager executes the Dryad vertices from the Dryad graph on the cluster:
Ø
o
The graph manager schedules vertices as Dryad Vertex Hosts are available.
o
Each vertex runs .NET code to processes a data partition and prepare the results for subsequent
stages or final output.
o
The input to a vertex can be either static data or data from the preceding stage.
o
The graph manager may apply runtime policies and optimizations to the execution graph to
optimize performance.
When execution completes, the results are written to one or more distributed data sets.
how a Dryad job runs on a cluster.
Dryad Linq Application with Partitioned Data:
About SWG:
Computation(SW-G) using
In SWG we evaluate the performance of Smith Waterman Gotoh Dissimilarity Computation(SW
the new DryadLinq(Argentia)
(Argentia) .
Upper Triangle
N*N matrix broken to D*D blocks
20000
18000
16000
14000
12000
10000
8000
6000
4000
2000
0
DryadLINQ
MPI
35339
•
•
•
•
50000
Calculate pairwise distances for a collection of genes (used for clustering, MDS)
Fine grained tasks in MPI
Coarse grained tasks in DryadLINQ
Performed on 768 cores (Tempest Cluster)
Blocks in the upper triangle
Ø
Ø
Ø
Each D consecutive blocks are merged to form a set of row blocks ; each with NxD elements
process has workload of NxD element.
We need to determine how Dryad distributes this data across the compute nodes. For this
purpose we record every block info down to the log file, then analyze it to see how blocks are
distributed. Therefore we can know how Dryad distributed general data after reasoning and
analysis.
We use the following DryadLinq operators to distribute data across the nodes: AsDistributed()
and AsDistributedFromPartitions().The difference between the two is that is AsDistributed()
always divide data into default pieces, and AsDistributedFromPartitions() can divide data as you
like.We also use a new operator called ApplyPerPartition() basically tells Dryad how to deal with
each individual piece of data.
0.030
60%
0.025
50%
40%
0.020
30%
0.015
20%
0.010
10%
0.005
0.000
0%
10000
20000
30000
40000
50000
-10%
No. of Sequences
DryadLinq SWG on WinHPC
Problems:
(
SW1) It's not easy to debug in Argentia, some exception information is not of solid value. (Argentia
G Debugging)
non-compute-bound
bound work such as
2) In cases where a query is performing a significant amount of non
File I/O, it might be beneficial to specify a degree of parallelism greater than the number of cores
on the machine
Performance Degradation on VM
(Hadoop)
Time per Actual Calculation (ms)
Scalibility of Pairwise Distance Calculation:
Download