DryadLINQ Tutorial - Microsoft Research

advertisement
DryadLINQ Tutorial
Contents
0
DryadLINQ Status .................................................................................................................................. 3
1
What is DryadLINQ? .............................................................................................................................. 3
2
Introduction to LINQ ............................................................................................................................. 3
3
2.1
A Simple LINQ Program ................................................................................................................. 4
2.2
Iterating Over File Contents .......................................................................................................... 5
2.3
IEnumerable: A Simple Extension of C# .................................................................................. 7
2.4
IQueryable: Support for Specialized Execution Engines ......................................................... 8
DryadLINQ ............................................................................................................................................. 9
3.1
3.1.1
The DryadLINQ Dynamically Linked Library ........................................................................ 10
3.1.2
DryadLINQConfig.xml .......................................................................................................... 10
3.1.3
Running the DryadLINQ Application ................................................................................... 10
3.2
Watching the Remote Program .................................................................................................. 12
3.3
Partitioned Files .......................................................................................................................... 13
3.4
Computing Histograms ............................................................................................................... 14
3.4.1
A Helper Class ..................................................................................................................... 14
3.4.2
The Histogram Query .......................................................................................................... 15
3.4.3
Type Inference .................................................................................................................... 17
3.5
Reductions (Aggregations) .......................................................................................................... 17
3.6
Apply ........................................................................................................................................... 18
3.7
Join .............................................................................................................................................. 22
3.8
Statistics in DryadLINQ ................................................................................................................ 22
3.9
Writing Custom Serializers .......................................................................................................... 26
3.9.1
What is the Default Output Format? .................................................................................. 26
3.9.2
How Can I Change the Output Format and How Can I Read Binary Data? ......................... 26
3.10
4
A First DryadLINQ Program ........................................................................................................... 9
Advanced Topic: Higher-Order Query Operations...................................................................... 27
Reference ............................................................................................................................................ 29
4.1
DryadLINQ Operators.................................................................................................................. 29
4.2
DryadLINQ Annotations .............................................................................................................. 32
DryadLINQ Tutorial
2
0 DryadLINQ Status
DryadLINQ is currently available only within Microsoft.
1 What is DryadLINQ?
DryadLINQ is a compiler which translates LINQ programs to distributed computations which can be run
on a PC cluster. LINQ is an extension to .NET, launched with Visual Studio 2008, which provides
declarative programming for data manipulation.
By using DryadLINQ the programmer does not need to have any knowledge about parallel or distributed
computation (though a little knowledge can help with writing efficient programs). Thus any LINQ
programmer turns instantly into a cluster computing programmer.
While LINQ extensions have been made to Visual Basic and C#, the DryadLINQ compiler only supports
C#.
DryadLINQ uses the Dryad distributed execution environment to create and run distributed
applications. Knowledge or understanding of Dryad is not required for DryadLINQ users. Dryad is a
generic distributed execution environment which can support a variety of distributed programming
models. Dryad is currently deployed in production systems within Microsoft. Dryad is a library linked
with user applications, which provides the following services:







An API to create distributed applications (jobs), by specifying which processes have to be
executed and communication channels linking them.
Scheduling of the processes on the cluster machines.
Fault-tolerance through re-execution of processes after transient failures.
Monitoring of the computation and statistics collection.
Job visualization.
An API for run-time resource management policies.
Support for efficient bulk data transfer between processes.
Dryad has been tested from small clusters (four machines) to very large clusters (4000 machines). Dryad
assumes that the computer cluster you are using is hosted in a secure datacenter, with high bandwidth
connections between machines. It does not provide security or access control – applications are fully
trusted. Dryad has not been targeted at loosely coupled machines, e.g. workstations in an office
environment.
2 Introduction to LINQ
DryadLINQ leverages the LINQ programming language extensions, which were introduced as part of
the.NET 3.5 framework shipped with Visual Studio 2008
DryadLINQ Tutorial
3
Here we provide a brief introduction to LINQ. For more information on LINQ see ‘LINQ: .NET Language
Integrated Query’ at:
http://msdn2.microsoft.com/en-us/library/bb308959.aspx.
For lots of small examples, see “101 LINQ Samples” at:
http://msdn2.microsoft.com/en-us/vcsharp/aa336746.aspx
2.1 A Simple LINQ Program
We will start with a simple LINQ program.
1) Create a new project 'example' in Visual Studio (File/New Project). Choose a "console
application".
2) In the solution explorer, browse to your project's references, and add as a new reference
System.Core (it is in the tab .NET). This is required for enabling LINQ.
3) Here is the code to add to the file:
using
using
using
using
System;
System.Collections;
System.Collections.Generic;
System.LINQ;
//
//
//
//
for
for
for
for
I/O
IEnumerable
IEnumerable<T>
LINQ operators
namespace Example {
public static class Example
{
public static void Main(string[] args)
{
IEnumerable<int> l = Enumerable.Range(1, 100);
foreach (var v in l)
Console.WriteLine("{0}", v);
Console.ReadKey();
}
}
}
When you build this program (e.g., using F6) and run it (using Ctrl-F5) it should output the numbers 1 to
100, each on a line.
This program introduces several important LINQ concepts:
DryadLINQ Tutorial
4




The IEnumerable<T> data type is a special type of generic collection, holding elements of an
arbitrary type T. The collection is represented by an iterator, which provides efficient sequential
access to each element. In this example we instantiate T to int.
The Range operator generates an iterator which will produce the elements of the collection on
demand. The collection will contain all integers from 1 to 100.
The foreach loop invokes repeatedly the (iterator of) collection l to obtain a new value, and
writes the value to the console.
The variable v has no type declared (it is declared just using the var keyword). The compiler
will infer the fact that v is an integer from the context.
Note that the collection elements are never stored simultaneously. There is no container of size 100
allocated to hold all values between 1 and 100. The numbers are produced by the Range() operator
on demand, consumed, and thrown away (removed by the garbage collector).
Range can be defined in C# using the operator yield, in the following way:
static IEnumerable<int> Range(int low, int high)
{
for (int i=low; i <= high; i++)
yield return i;
}
(To invoke the new Range, just change the invocation from Enumerable.Range to Range).
Range defines an iterator which will return a new value each time it is invoked.
It is instructive to run this program with the debugger, by single-stepping using F-11. You will notice that
the Range() function is called for every iteration of the foreach loop.
2.2 Iterating Over File Contents
Files are ideally suited to iterator-type access, since iterators entail sequential access to file contents,
which is very efficient (in contrast to random access). Let us implement an iterator which scans a text
file:
using System.IO;
using System.Collections;
public class TextFileReader {
private StreamReader file;
public TextFileReader(string filename) {
file = new StreamReader(filename);
}
public IEnumerator<string> Contents() {
DryadLINQ Tutorial
5
while (file.Peek() != -1)
yield return file.ReadLine();
}
}
It is now easy to implement a fgrep-like program (which searches a fixed string):
public static IEnumerable<string> Match(string filename,
string tosearch)
{
TextFileReader tr = new TextFileReader(filename);
foreach (string s in tr.Contents()) {
if (s.IndexOf(tosearch) >= 0)
yield return s;
}
}
public static void ShowOnConsole<T>(IEnumerable<T> t)
{
foreach (T s in t)
Console.WriteLine("{0}", s.ToString());
}
public static void Main(string[] args)
{
ShowOnConsole(Match("c:/test.txt", "here"));
}
However, it is even better to use the built-in LINQ operator Where for this purpose:
public static IEnumerable<string> Match(string filename,
string tosearch)
{
TextFileReader tr = new TextFileReader(filename);
IEnumerable<string> result =
tr.Contents().Where(s => (s.IndexOf(tosearch) >= 0));
}
This program introduces some new LINQ elements:


The function ShowOnConsole manipulates generic streams, containing elements of an
arbitrary type T. This function is invoked in Main to display a stream of strings.
The Where method has as an argument an anonymous delegate described as a lambda
expression. The string:
s => (s.IndexOf(tosearch) >= 0)
DryadLINQ Tutorial
6
describes a function which takes an argument s and returns a Boolean value, the result of
evaluating the expression s.IndexOf(tosearch) >= 0. Note that the lambda expression
depends on the tosearch value. The type of this function in C# is Func<string, bool>
(function transforming strings to Booleans).
The type of Where is:
public static
IEnumerable<T> Where<T>(this IEnumerable<T> source,
Func<T, bool> filter)
Another useful method is Select. It is somewhat unfortunately named, since it does not just perform
a selection, it can perform an arbitrary type transformation:
public static
IEnumerable<S> Select<T, S>(this IEnumerable<T> source,
Func<T,S> transform)
The function transform is applied to every element of the input stream to generate an element of the
output stream. For example, to print the lengths of the matching lines instead of the lines themselves,
one would change the function as follows:
ShowOnConsole(Match("c:/test.txt", "here").Select(s => s.Length))
If we redefine the ShowOnConsole method as an extension of IEnumerable<T>:
public static void ShowOnConsole<T>(this IEnumerable<T> t)
we can write the above call as:
Match("c:/test.txt", "here").Select(s => s.Length).ShowOnConsole()
2.3 IEnumerable: A Simple Extension of C#
The LINQ programs we wrote so far can be expressed by using only "pure" C# operations. But there is
more to LINQ than just syntactic sugar.
Almost every collection in C# (e.g. List<T>, Array, vector, etc.) can be transformed to an
IEnumerable in a very simple way. For example, List has an operator AsEnumerable().
Computations on IEnumerables are handled by the local machine as any standard C# program.
Most IEnumerable<T> methods take as arguments delegates. (I.e., the IEnumerable methods
are higher order functions). For example, the Where method requires a delegate returning a Boolean
value.
However, one of the great features of LINQ is that it allows computations to be dispatched to other
"providers". For example, if your data is a SQL Server database, you can write a LINQ query and ask SQL
Server to execute it. DryadLINQ takes advantage of this feature, by using Dryad to execute LINQ
queries. But in order to enable this behavior we need a new type, IQueryable<T>.
DryadLINQ Tutorial
7
2.4 IQueryable: Support for Specialized Execution Engines
Superficially the IQueryable<T> type is very similar to IEnumerable<T>. Since
IQueryable<T> inherits from IEnumerable<T>, all methods applicable to IEnumerable<T>
operate on IQueryable<T> objects as well. However, there is a very important distinction:
IQueryable objects do not represent iterators, they represent queries, which are shipped for
execution to some computation engine and thus are not executed by the local JIT. The interfacing
between the local computation and the remote execution engine is encapsulated in a query provider.
Examples of execution engines are SQL Server or DryadLINQ, but one can easily write a new one. One of
the main design features of LINQ is extensibility, allowing new providers to be created by application
programmers.
Figure 1: A Query provider translates IQueryable objects to a suitable format and ships them to a remote
execution engine. It also transforms the remote data into C# objects.
DryadLINQ is just an instance of such a provider which interfaces with the Dryad remote execution framework.
A remote execution environment is not required to understand C#. All methods of IQueryable
receive arguments which are not delegates, but Expression<> objects. An expression is essentially a
"syntactic" representation of a computation.
For example, the type of Where operating on IQueryable<T> is:
public static
IQueryable<T> Where<T>(this IQueryable<T> source,
Expression<Func<T, bool>> filter)
DryadLINQ Tutorial
8
Conveniently, it is very easy to convert a delegate of type Func<T,S> to an expression
Expression<Func<T,S>> -- most of the times an implicit cast will handle the transformation.
Thus, for many simple LINQ programs operating on IQueryables one can use exactly the same syntax
as for IEnumerable objects. (For more on the distinction between Func<> and
Expression<Func<>> see Section ADVANCED TOPIC: HIGHER-ORDER QUERY OPERATIONS.)
What
Generic
Created
Evaluated
Evaluated by
Operator pipelining
Method arguments
IEnumerable
Iterator over collection
Yes
Lazily
Lazily
Local JIT
Yes
Delegates
IQueryable
Query
Yes
Lazily
Depends on provider
Remote execution engine
Depends on engine
Expressions
Table 1: IEnumerable vs. IQueryable
3 DryadLINQ
3.1 A First DryadLINQ Program
Let us rewrite the Match function to operate on an IQueryable:
using
using
using
using
using
System;
System.Collections.Generic;
System.LINQ;
System.Text;
LINQToDryad;
namespace Example
{
static class Program
{
public static IQueryable<string> Match(string directory,
string filename,
string tosearch)
{
DryadDataContext ddc =
new DryadDataContext("file://" + directory);
DryadTable<LineRecord> table =
ddc.GetTable<LineRecord>(filename);
return table.Select(s => s.line).
Where(s => s.IndexOf(tosearch) >= 0);
}
public static void ShowOnConsole<T>(this IEnumerable<T> t)
{
foreach (T s in t)
DryadLINQ Tutorial
9
Console.WriteLine("{0}", s.ToString());
}
static void Main(string[] args)
{
IQueryable<string> result = Match("c:/", "test.txt", "here");
result.ShowOnConsole();
}
}
}
Before you can compile and run this application:


You must link with the LINQToDryad dll.
You must configure the application to find:
o The cluster to use
o The place where temporary data is create on the cluster
Let us tackle these problems one by one.
3.1.1 The DryadLINQ Dynamically Linked Library
All the functionality of DryadLINQ is contained in the LINQToDryad.dll file. This file contains the
provider, which knows how to accept queries, translate them to Dryad jobs and ship them to remote
clusters for execution, and how to collect the results back.
Add the dll as a reference to your project: In solution explorer right-click on references and then browse
to the location where you installed the LINQToDryad.dll on the development machine. (If you installed
DryadLINQ at C:\dryadlinq, it will be at C:\dryadlinq\lib\debug\LINQtoDryad.dll).
3.1.2 DryadLINQConfig.xml
The configuration of your DryadLINQ application is controlled via a file called DryadLinqConfig.xml. A
global copy of this file is located in the root directory of your DryadLINQ installation, and defines default
values for all applications. Configurable parameters may optionally be overridden on a per-application
basis, by placing an additional copy in the root directory of an application. For details, including
descriptions of the configurable parameters, please read SECTION Error! Reference source not found.,
REF _Ref195932192 \h \* MERGEFORMAT Error! Reference source not found..
3.1.3 Running the DryadLINQ Application
If you have access to a Dryad cluster, and if you have configured everything properly, now you can run
your application (Ctrl-F5, or just F5 if you would like to single-step). You should make sure that the
cluster machines are allowed to access the input and output files you have provided. Note that the
programs running on the cluster will not use your user account privileges. The code highlighted in
yellow in the previous example will run remotely, everything else will run locally.
When you run this application the sequence of events unfolds as follows; we indicate how they
correspond to the numbered labels in FIGURE 1.
1. C# execution
 The configuration parameters are set.
DryadLINQ Tutorial
10

2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
1
The Match function is invoked. It creates an object ddc of type DryadDataContext.
This object creates the DryadLINQ provider which will handle IQueryables that can
be executed by DryadLINQ.
 The ddc is used to allocate a DryadTable<T>. The DryadTable is a subclass of
IQueryable<T> representing persistent data (i.e., files). DryadTables are type
collections. In our case, the table contains LineRecord objects (these are just C#
object representing a line of text). The DryadTable object is bound to a (text) file,
specified by directory and filename.
Query creation
 The methods Select and Where are applied to the table, in this order. The result
of applying these two methods is an IQueryable<string> object called result.
 The method ShowOnConsole is applied to result. The method starts iterating over
the content of result.
Query evaluation
 At this point, the query must be evaluated. The DryadLINQ provider receives the query
for processing. In the meantime the C# program that spawned the query is blocked
waiting for the query to complete.
Query transformation
 DryadLINQ analyzes the query received and compiles it into a Dryad job, to be executed
on the cluster. The compilation is done in several stages.
Query invocation
 The cluster runtime receives a Dryad executable1.
Plan instantiation
 The Dryad executable it handed the Dryad job description; the job is described in an
XML file.
Plan execution
 The Dryad runtime analyzes the execution plan and spawns the processes composing
the Dryad Job on the cluster.
Data generation
 The job completes and writes the results to temporary files in output directory
(indicated by the configuration options).
Completion
 The C# program receives notification about the job completion.
Result marshalling and
C# object creation
 The query provider translates the data, marshalling the results into C# objects.
Resume program.
 The C# program resumes execution and the ShowOnConsole iterator extracts the
data from the query output file, on demand.
The executable handed to Dryad is called XmlExechost.exe.
DryadLINQ Tutorial
11
3.2 Watching the Remote Program
While your program is running the remote query you will see on the output console some cryptic
strings. This information comes from the Dryad Job Manager, which oversees the execution of the
program on the cluster. While an explanation of Dryad is out of the scope of this document it is useful
to know the basics of Dryad for debugging, visualization and performance tuning purposes.
Figure 2: Execution stages of a Dryad Job.
FIGURE 2 shows the stages of a Dryad job. The job is coordinated by a Job Manager; the manager is the
brain, while all the work is performed by the workers, which are called vertices. The manager starts
first, and it creates vertices on the cluster, using a remote execution service. It monitors the vertices’
progress and gathers execution statistics. The manager prints periodic summaries of the state of the
computation, to the console.
The job manager itself may be running on a remote machine (depending on configuration parameters).
 If your job manager is running on the local machine then you can visualize your job’s state
interactively using internet explorer.
 If your job manager is running on the cluster using the cluster scheduler, then you can visualize
the job via the cluster’s web server. The visualization is not interactive.
By default, DryadLINQ will run the job manager locally. If you are running on a cluster that has a job
scheduler installed, you can configure DryadLINQ to instead submit the job to the job scheduler by
adding the usejobscheduler attribute to the <Cluster> element in your DryadLinqConfig.xml:
<Cluster name=" MyClusterName "
…
usejobscheduler=" true " />
DryadLINQ Tutorial
12
3.3 Partitioned Files
The Match program that we wrote can be effectively parallelized for scanning a large amount of data.
It is sufficient to cut the data into pieces (preserving line boundaries) and run the scan in parallel on all
pieces. The hardest part to do is to describe the file pieces. For this purpose DryadLINQ provides a
datatype PartitionedFile. A partitioned file on disk is composed of two parts:
1) The pieces themselves and
2) The metadata: a textual description of all the pieces of a file which has been split. FIGURE 3
shows how the metadata is organized:
 The first line indicates the name prefix of each piece. The pieces must all be placed in the
same directory on all the machines. In this example each file will be in the \mydata
directory, and its name will have the form Piece.XXXXXXXX. Here XXXXXXXX is an 8-digit
hexadecimal number.
 The second line is the number of pieces, in this example 4.
 Each line that follows describes a piece:
o The piece number, in decimal.
o The piece size in bytes.
o Finally, a comma-separated list of machines. A piece may be replicated on several
machines, for fault-tolerance.
Figure 3: Partitioned File Structure
The description in FIGURE 3 corresponds to the following pieces:
 \\m1\mydata\Piece.00000000
DryadLINQ Tutorial
13




\\m2\mydata\Piece.00000001
\\m3\mydata\Piece.00000001
\\m3\mydata\Piece.00000002
\\m4\mydata\Piece.00000003
Piece.00000001 is present on two machines.
Once you have partitioned your data in this way, you only need to make a tiny change to enable your
computation to use the partitioned table:
public static IQueryable<string> Match(string directory,
string filename,
string tosearch)
{
DryadDataContext ddc = new DryadDataContext("file://" + directory);
DryadTable<LineRecord> table = ddc.GetPartitionedTable<LineRecord>(filename);
return table.Select(s => s.line).Where(s => s.IndexOf(tosearch) >= 0);
}
When running this job, the job will operate in parallel on all four partitions:
Figure 4: The program operating on a partitioned file with 4 partitions.
The throughput of this computation will be increased by a factor of 4 (assuming the cluster contains at
least 4 machines). If the input partitions are on different machines, they can be read all in parallel. The
file output by the program also contains four partitions. (However, the C# program still uses one iterator
to read all four output partitions.)
3.4 Computing Histograms
Let us tackle a slightly more complex example: the input is a large text file distributed over many
machines (e.g., web pages). We want to compute a histogram of the words in the web pages, and
extract the top k words and their counts. This is a typical map-reduce application.
3.4.1
A Helper Class
We need to build a helper class Pair, which will be used to represent for each word a count.
[Serializable]
public struct Pair {
private string word;
private int count;
DryadLINQ Tutorial
14
public Pair(string w, int c)
{
word = w;
count = c;
}
public int Count
{ get { return count; } }
public string Word { get { return word; } }
public override string ToString() {
return word + ":" + count.ToString();
}
}
While the Pair class is quite trivial, you should pay attention to the annotation [Serializable]
which precedes it. The distributed computation will express intermediate results as collections of Pair
objects, and these collections need to be shipped between machines. The annotation indicates to the
runtime that it needs to prepare to serialize and deserialize the internal representation of the Pair on
the wire. DryadLINQ automatically builds efficient serializers for the data structures in your program; its
serialization is much less verbose than the default C# reflection-based serialization.
3.4.2
The Histogram Query
Despite the complexity of the transformations, the DryadLINQ code is actually very compact (the
interesting part is in the green highlights):
public static IQueryable<Pair> Histogram(
string directory,
string filename,
int k)
{
DryadDataContext ddc = new DryadDataContext("file://" + directory);
DryadTable<LineRecord> table =
ddc.GetPartitionedTable<LineRecord>(filename);
IQueryable<string> words =
table.SelectMany(x => x.line.Split(' ').AsEnumerable());
IQueryable<IGrouping<string, string>> groups = words.GroupBy(x => x);
IQueryable<Pair> counts = groups.Select(x => new Pair(x.Key, x.Count()));
IQueryable<Pair> ordered = counts.OrderByDescending(x => x.Count);
IQueryable<Pair> top = ordered.Take(k);
return top;
}
Calling ShowOnConsole on the result of this function will display the output. TABLE 2 shows a sample
execution for k=3.
DryadLINQ Tutorial
15
Operator
Output
table
SelectMany
GroupBy
Select
OrderByDescending
Take(3)
“A line of words of wisdom”
[“A”, “line”, “of”, “words”, “of”, “wisdom”]
[[“A”], [“line”], [“of”, “of”], [“words”], [“wisdom”]]
[ {“A”, 1}, {“line”, 1}, {“of”, 2}, {“words”, 1}, {“wisdom”, 1}]
[{“of”, 2}, {“A”, 1}, {“line”, 1}, {“words”, 1}, {“wisdom”, 1}]
[{“of”, 2}, {“A”, 1}, {“line”, 1}]
Table 2: Sample execution of Histogram
Let’s dissect this program.





The SelectMany method transforms a scalar into an IEnumerable. In our case, we use it
to transform a string representing a line into an IEnumerable<string> containing all
the words on the line. The result is obtained by transforming the array returned by Split into an
IEnumerable using the AsEnumerable method.
The GroupBy has a key selector delegate as argument (the delegate returns the “key”
associated to each input). The result is a set of bags (called IGrouping), where all elements in
a bag have the same “key”. We use the identity function for the delegate, and thus all identical
words are grouped together.
Each group is summarized with a pair containing just the representative word (x.Key) and the
count of elements in the group. Since IGrouping is a subclass of IEnumerable, it
provides the Count() method, which we use to measure the group size.
The pairs are sorted descending on their Count value (OrderBy).
Finally, the Take() method just selects the first k elements of the result.
The generated plan is actually quite smart:
Figure 5: Distributed Histogram plan generated by DryadLINQ.
DryadLINQ Tutorial
16
There are four input partitions, reading four files. The GroupBy is computed in a distributed way, using
a pattern called hash-partitioning. Each word is sent to a machine based on the hash function applied to
the word; in this way, each machine computes a local GroupBy just for a subset of the words. (Note
that all identical words will end up on the same machine). The optimizer inserts OrderBy at the level
of each partition. The final node can thus just use merge-sort to combine the ordered streams, and then
apply Take to the result. Currently the Take operator requires the whole data to be present on a
single machine; this explains why the data was merged and the output file has a single partition.
3.4.3 Type Inference
Note that it would be perfectly acceptable to omit the types of all temporary variables in the program.
I.e., the following program works just fine, and is equivalent to the previous one:
var
var
var
var
var
words = table.SelectMany(x => x.line.Split(' ').AsEnumerable());
groups = words.GroupBy(x => x);
counts = groups.Select(x => new Pair(x.Key, x.Count()));
ordered = counts.OrderByDescending(x => x.Count);
top = ordered.Take(k);
However, types can be good documentation, making the program more readable and maintainable.
Although the compiler is very good at inferring types, occasionally you may need to supply types
explicitly.
3.5 Reductions (Aggregations)
One of the most useful operations that can be performed on data is reduction, also called aggregation.
By definition, an aggregation takes a lot of data values and collapses them to a single value. A typical
example would be the sum of a stream of numbers. A somewhat less obvious example is the count.
LINQ contains lots of operators – see 4.1. But, as you may expect by now, LINQ provides a generic
aggregation operator which relies on a delegate to transform two values into one.
public static TAccumulate Aggregate<TSource, TAccumulate>(
this IQueryable<TSource> source,
TAccumulate seed,
Expression<Func<TAccumulate, TSource, TAccumulate>> func)
We can sum up the values in a partitioned file in this way:
var result = input.Aggregate((x,y) => x+y);
The distributed computation which aggregates an input with two partitions looks as follows:
DryadLINQ Tutorial
17
Figure 6: Aggregation is done after collecting all inputs in a single vertex.
The Aggregate vertex collects all the data from the two input readers and sums it up.
However, if the aggregating function is associative, a much better parallel computation plan is possible
by aggregating each partition of the data separately and then combining the results at the end. By
adding an [Associative] annotation to a function, you can enable DryadLINQ to generate a much
better plan.
[Associative]
int Add(int x, int y);
var sum = input.Aggregate((x,y)=>Add(x,y));
The generated plan looks much better:
Figure 7: Aggregation of associative function is done using multiple machines.
Each machine does pipelined reading followed by local aggregation on its own data, and then a global
stage combines the partial results.
3.6 Apply
DryadLINQ takes one IQueryable object and transforms it into a distributed network of processes
(vertices). Each of the vertices manipulates only a partition of the data. In the generated code, each
vertex executes an independent LINQ program. The inputs and outputs to each vertex are all
IEnumerable objects. Thus each vertex takes automatically advantage of the lazy evaluation and
pipelining provided by the iterator model.
Most LINQ methods are “stateless”: they operate on each value in a collection independently on its
neighbors. However, some very useful types of computations need to see multiple values at once. A
typical example is a sliding-window (e.g., convolution) computation. DryadLINQ extends LINQ with
DryadLINQ Tutorial
18
several powerful operators. The complete list of DryadLINQ operators is given in Section 4.1 DryadLINQ
Operators.
The Apply operator is a new addition. It corresponds roughly to Select: it has a delegate argument,
which produces the output by transforming the input. Unlike Select, the input to the Apply
delegate is the whole input stream, and the output is a complete stream.
Figure 8: The Select delegate receives each element individually, while the one of Apply receives the whole
stream.
In other words, in Figure 8: The Select delegate receives each element individually, while the one of Apply
receives the whole stream.FIGURE 8 the type of f is Expression<Func<T,S>>, while the type of g is
Expression<Func<IEnumerable<T>,IEnumerable<S>>>.
There exists a binary version of Apply, which operates on two input streams:
public static IQueryable<T3>
Apply<T1, T2, T3>(this IQueryable<T1> source1,
IQueryable<T2> source2,
Expression<Func<IEnumerable<T1>,
IEnumerable<T2>,
IEnumerable<T3>>> procFunc);
Unfortunately, there is no binary version of Select. But we can build one using on Apply. For
example, here is how to implement a binary Select-like operator which adds the corresponding
numbers in two streams (the two streams must have the same length). First, we write the per-vertex
transformation, which operates on IEnumerable inputs:
public static IEnumerable<int>
addeach(IEnumerable<int> left, IEnumerable<int> right)
{
IEnumerator<int> left_enu = left.GetEnumerator();
IEnumerator<int> right_enu = right.GetEnumerator();
while (true)
DryadLINQ Tutorial
19
{
bool more_left = left_enu.MoveNext();
bool more_right = right_enu.MoveNext();
if (more_left != more_right)
{
throw new Exception("Streams with different lengths");
}
if (!more_left) yield break; // both are finished
int l = left_enu.Current;
int r = right_enu.Current;
int q = l + r;
yield return q;
}
}
The addeach function is hopefully obvious: it iterates over two streams in parallel using two iterators
(it uses the MoveNext() and Current stream operators rather than foreach).
To create the IQueryable version of addition it is just enough to invoke addeach on the two inputs:
public static IQueryable<int>
Add(IQueryable<int> left,
IQueryable<int> right)
{
return left.Apply<int, int, int>(right, (x,y) => addeach(x,y));
}
(It is surprisingly harder to write a generic Select, which takes an arbitrary delegate; this is covered in
Section ADVANCED TOPIC: HIGHER-ORDER QUERY OPERATIONS.)
If we run this query, for example by supplying both inputs from a single partitioned file:
Add(input, input).ShowOnConsole();
we will have an unpleasant surprise: the executed plan is quite inefficient:
DryadLINQ Tutorial
20
Figure 9: Naive Plan for the pairwise addition.
(Each in[] vertex reads one partition and then broadcasts the data to two consumers using a Tee vertex.
A Tee vertex stands for “broadcast”: all outputs correspond to the same input. The red edges in the
figure are implemented using FIFO channels; this means that all three vertices Merge and Apply run as
separate threads in a single process on the same machine, and just pass pointers to objects to each
other.)
The plan generated performs all the additions in a single vertex, labeled “Apply__0” in the figure. For
this purposes it merges (by concatenating) the two entire input streams. It is obvious to us that the
additions could be done in parallel in each partition, but since the addeach function claims it needs to
see the entire input stream, DryadLINQ obliges and builds it before passing it to the Apply.
Fortunately, there is a way out: by adding an appropriate annotation to the addeach function you can
indicate that it can be safely applied to each partition independently:
[Homomorphic]
public static IEnumerable<int>
addeach(IEnumerable<int> left, IEnumerable<int> right)
While homomorphic is a mouthful, it just means that the function operates correctly partitionwise
(i.e., it distributes with respect to partition concatenation:
concatenate(add(a,c),add(b,d)) = add(concatenate(a,b), concatenate(c,d) ). With this annotation, the
execution plan uses two separate vertices to perform the addition, one operating on each partition:
DryadLINQ Tutorial
21
Figure 10: using delegate annotations can improve plans.
3.7 Join
One of the most powerful operations provided by LINQ is join. Its type signature is:
public static IQueryable<TResult>
Join<TOuter, TInner, TKey, TResult>(
this IQueryable<TOuter> outer,
IEnumerable<TInner> inner,
Expression<Func<TOuter, TKey>> outerKeySelector,
Expression<Func<TInner, TKey>> innerKeySelector,
Expression<Func<TOuter, TInner, TResult>> resultSelector);
Join operates on two sub-queries and combines all elements that have the same key (where keys are
extracted using two delegates, one for each query). For each pair of matching values the result is
obtained by applying a third delegate.
3.8 Statistics in DryadLINQ
In this section we take advantage of the power of LINQ to manipulate more complex C# data structures.
We will tackle a statistics application: given a very large set of high-dimensional sparse vectors, compute
their mean and variance.
∑ 𝑣𝑖
∑(𝑣𝑖 − 𝜇)2
,𝜎 = √
𝑛
𝑛
We won’t delve into the implementation of the sparse vectors; Here we have used a
Dictionary<int, double> to represent them, but any implementation with the following
interface would do:
𝜇=
[Serializable]
public class SparseVector
{
DryadLINQ Tutorial
22
public SparseVector();
public SparseVector(string line); /* read from text file */
public double this[uint index] { get; set; }
[Associative]
public SparseVector Add(SparseVector r);
public SparseVector Subtract(SparseVector r);
public SparseVector Square(); // elementwise square
public SparseVector SqRoot(); // elementwise square root
public SparseVector Divide(double scalar);
}
The statistics program will ship around objects of type (collection of) SparseVector; this is a much
more complex datatype compared with the strings that we have manipulated so far. It is also a datatype
which is much harder to support in the context of a traditional database (showing that DryadLINQ’s
resemblance to a database query engine is deceiving).
We first tackle computing the average. This is done by summing-up all the values and dividing the result
by the count of values. Both the sum and count are built-in aggregations. A naïve attempt to do this
fails:
// this program is not good enough
public static SparseVector
ComputeStatistics(this IQueryable<SparseVector> v)
{
SparseVector sum = v.Aggregate( (x, y) => x.Add(y));
int count = v.Count();
SparseVector average = sum.Divide((double)count);
IQueryable<SparseVector> normalized =
v.Select(x => (x.Subtract(average).Square()));
SparseVector sum1 = normalized.Aggregate((x, y) => x.Add(y));
sum.1Divide(count);
sum.SqRoot();
return sum;
}
This code does indeed compute the average and standard deviation of all the SparseVectors in v.
However, average is a SparseVector, and not an IQueryable. This means that there will be
three queries executed: one to compute the sum, a second to compute the count , and a third the
standard deviation. In between the count and average values are shipped to C#.
In order to blend the computation in a single big query we have to perform a few changes:
1) First, we have to use special DryadLINQ extensions for Aggregate and Count which return
IQueryables and not values: AggregateAsQuery and CountAsQuery. These two
operators return an IQueryable which will always contain a single element when evaluated.
2) The average computation becomes much more involved, since we can no longer perform
simple arithmetic between the sum and count. We need to use Apply to manipulate them,
as we have illustrated in the previous section: Error! Reference source not found.. The Apply
elegate needs to be spelled out:
DryadLINQ Tutorial
23
[Homomorphic]
public static
IEnumerable<SparseVector> Scale(IEnumerable<SparseVector> left,
IEnumerable<int> right)
// left and right should contain a single value
{
SparseVector l = left.Single();
int coef = right.Single();
yield return l.Divide((double)coef);
}
public static
IQueryable<SparseVector> Average(this IQueryable<SparseVector> v,
IQueryable<int> count)
{
IQueryable<SparseVector> sum =
v.AggregateAsQuery<SparseVector>((x, y) => x.Add(y));
IQueryable<SparseVector> average =
sum.Apply(count, (x, y) => Scale(x, y));
return average;
}
The Single() method returns the unique element of a stream.
We have factored out the count computation, since it will be reused.
3) The standard deviation involves a computation between a big stream (the input vector), and a
singleton stream (the average, which is subtracted from each element of the vector). This can
be done with another instance of Apply,using the following delegate argument:
[Homomorphic(Left = true)]
public static
IEnumerable<SparseVector> stddev(IEnumerable<SparseVector> left,
IEnumerable<SparseVector> average)
// average is a single value, left is a vector
{
SparseVector avg = average.Single();
foreach (SparseVector l in left)
{
SparseVector tmp = l.Subtract(avg);
SparseVector tmp_sq = tmp.Square();
yield return tmp_sq;
}
}
public static
IQueryable<SparseVector> StdDev(this IQueryable<SparseVector> v,
IQueryable<SparseVector> average,
IQueryable<int> count)
{
IQueryable<SparseVector> normalized =
v.Apply(average, (x, y) => stddev(x, y));
IQueryable<SparseVector> sum =
normalized.AggregateAsQuery<SparseVector>((x, y) => x.Add(y));
DryadLINQ Tutorial
24
IQueryable<SparseVector> scaled =
sum.Apply(count, (x, y) => Scale(x, y));
IQueryable<SparseVector> result =
scaled.Apply(x => x.SqRoot());
return result;
}
We know that the first Apply operation (the normalization) can be performed in parallel, but
how do we express this fact? In other words, stddev can handle in parallel all elements of the
left input, but it needs to see the whole right input (which is a single element stream
anyway). This can be described using a special variant of the Homomorphic attribute for the
stddev delegate:
[Homomorophic(Left=true)]
public static
IEnumerable<SparseVector> stddev(IEnumerable<SparseVector> left,
IEnumerable<SparseVector> average)
This means that the function is distributive in its left argument, but not in the right one.
4) Finally, we would like the computation to save to persistent storage both of the computed
statistics: the average and the standard deviation. If we just extract the values from average
and result, two queries will run, recomputing some results twice. What we need is to have a
single query with multiple outputs. This is done by creating lazy tables for each interesting
output. The lazy tables are computed and saved to persistent storage only when forced with a
Materialize call:
public static
IQueryable<SparseVector> ReadVectors(string directory,
string filename)
{
DryadDataContext ddc = new DryadDataContext("file://" + directory);
DryadTable<LineRecord> table =
ddc.GetPartitionedTable<LineRecord>(filename);
return table.Select(s => new SparseVector(s.line));
}
public static void
ComputeStatistics(this IQueryable<SparseVector> v)
{
IQueryable<int> count = v.CountAsQuery();
IQueryable<SparseVector> average = v.Average(count);
IQueryable<SparseVector> dev = v.StdDev(average, count);
IQueryable<SparseVector> a = average.ToDryadTableLazy("average");
IQueryable<SparseVector> d = dev.ToDryadTableLazy("stddev");
DryadLINQQueryable.Materialize(a, d);
}
This query will generate two tables when executed. The query plan for an input with two partitions
looks pretty good:
DryadLINQ Tutorial
25
Figure 11: Plan for the statistics computation.
3.9 Writing Custom Serializers
All the DryadLINQ programs so far only read text files, but they output binary files.
In this section we will answer three questions:
1) What format is the data written in?
2) How can I write data in a different format?
3) How can I read binary data?
3.9.1 What is the Default Output Format?
DryadLINQ writes binary data in the output tables, using the same binary serialization routines that are
used to ship data between vertices. As a consequence, data written by a DryadLINQ query can be read
by another query without any special preparations; the two programs should just specify the same type
for the table contents:
DryadTable<T> output = result.ToDryadTable(“histogram”);
…
DryadTable<T> input = ddc.GetTable(“histogram”);
3.9.2 How Can I Change the Output Format and How Can I Read Binary Data?
We will answer both of these questions at once.
If you define a class MyRecord, then you can control the way it is represented on the wire by endowing
it with the following two methods:
public struct MyRecord
{
DryadLINQ Tutorial
26
public static MyRecord Read(DryadBinaryReader rd);
public static int Write(DryadBinaryWriter wr, MyRecord rec);
}
(You can use Visual Studio to discover the interfaces provided by DryadBinaryReader and
DryadBinaryWriter.)
For example, there are two ways to have the SparseVectors from the statistics example be written
as text:
 Add an extra Select computation stage before the output:

result.Select(x => x.ToString()).ToDryadTable();
Add a Write method to the SparseVector class:
public static int Write(DryadBinaryWriter wr, SparseVector vec)
{
string s = vec.ToString();
return wr.Write(s);
}
3.10 Advanced Topic: Higher-Order Query Operations
In this section we show how to build higher-order query operations by writing a generic Select
operator on IQueryable objects which operates on two inputs at once, pairwise.
public static Expression<Func<T1, T2, T3>>
Closure_cvv<T0, T1, T2, T3>(
Func<T0, T1, T2, T3> function,
Expression firstArg) // type should be T0
// build a closure from a function of 3 arguments,
// first one is constant (cvv)
{
ParameterExpression xparam = Expression.Parameter(typeof(T1), “xparam”);
ParameterExpression yparam = Expression.Parameter(typeof(T2), “yparam”);
Expression fun = Expression.Constant(function);
Expression body = Expression.Invoke(fun, firstArg, xparam, yparam);
Type resultType =
typeof(Func<,,>).MakeGenericType(typeof(T1), typeof(T2), body.Type);
LambdaExpression result = Expression.Lambda(resultType,
body,
xparam,
yparam);
return (Expression<Func<T1, T2, T3>>)result;
}
public static IQueryable<T3>
Select<T1, T2, T3>(this IQueryable<T1> input0,
IQueryable<T2> input1,
Expression<Func<T1, T2, T3>> mapper)
{
// first create pairs of elements using Apply
Expression<Func<T1, T2, Pair<T1, T2>>>
makepairs = (x, y) => Pair<T1, T2>.MakePair(x, y);
DryadLINQ Tutorial
27
Expression<Func<IEnumerable<T1>,
IEnumerable<T2>,
IEnumerable<Pair<T1,T2>>>>
pairmaker =
Closure_cvv<Func<T1, T2, Pair<T1,T2>>,
IEnumerable<T1>,
IEnumerable<T2>,
IEnumerable<Pair<T1,T2>>>(
Conversions.Pointwise, makepairs);
// tag h as homomorphic
HomomorphicAttribute h = new HomomorphicAttribute();
AttributeSystem.Add(pairmaker, h);
ResourceAttribute a = new ResourceAttribute();
a.IsStateful = false;
AttributeSystem.Add(pairmaker, a);
IQueryable<Pair<T1, T2>> pairs =
input0.Apply<T1, T2, Pair<T1, T2>>(input1, pairmaker);
// second, run the (slightly modified) 'mapper' on the pairs
ParameterExpression p12 =
Expression.Parameter(typeof(Pair<T1, T2>),
Conversions.GetFreshName());
Expression p1 = Expression.Property(p12, "First");
Expression p2 = Expression.Property(p12, "Second");
Expression body = Expression.Invoke(mapper, p1, p2);
Expression<Func<Pair<T1, T2>, T3>> pop =
Expression.Lambda<Func<Pair<T1, T2>, T3>>(body, p12);
IQueryable<T3> result = pairs.Select(pop);
return result;
}
DryadLINQ Tutorial
28
4 Reference
4.1 DryadLINQ Operators
We can distinguish four main classes of operators:
1. Operators present in LINQ which are implemented by DryadLINQ. In the table below, they are
marked with the keyword “LINQ”.
2. Adaptations of operators present in LINQ which return scalar values (i.e., not IQueryable),
but which are modified to return an IQueryable instead. For example, Count returns an
integer, while CountAsQueryable returns an IQueryable whose actual contents will be a
single integer. The AsQueryable variants can be chained together to produce complex
queries, while using the scalar variants would require breaking queries into small sub-queries,
which could decrease efficiency (see an example in SECTION STATISTICS IN DRYADLINQ).
3. New operators, which exist only in DryadLINQ. We have added new operators which cannot be
synthesized efficiently from compositions of primitive LINQ operators, and which can
substantially improve the performance of queries in the context of a distributed execution
environment like Dryad. These operators are accompanied by a brief description in the table.
4. Operations not yet implemented in DryadLINQ (there is only one).
Operator
Aggregate
AggregateAsQuery
All
AllAsQuery
Any
AnyAsQuery
Apply
AssumeDistinct
Brief Description
LINQ
Result is query (not scalar).
LINQ
Result is query (not scalar).
LINQ
Result is query (not scalar).
Applies a delegate to an entire input stream. See also the Section APPLY.
Two versions exist:
public static IQueryable<T2> Apply<T1, T2>(
this IQueryable<T1> source1,
Expression<Func<IEnumerable<T1>,
IEnumerable<T2>>> procFunc);
public static IQueryable<T3> Apply<T1, T2, T3>(
this IQueryable<T1> source1,
IQueryable<T2> source2,
Expression<Func<IEnumerable<T1>,
IEnumerable<T2>,
IEnumerable<T3>>> procFunc);
Can only be applied to DryadTable<T> objects. Asserts that there are no
two identical objects in the table (according to the comparison function).
public void AssumeDistinct<T>(
IEqualityComparer<T> comparer)
DryadLINQ Tutorial
29
Operator
AssumeHashPartition
AssumeOrderBy
AssumeRangePartition
Average
AverageAsQuery
Concat
Contains
ContainsAsQuery
Count
CountAsQuery
Distinct
Except
First
FirstAsQuery
FirstOrDefault
FirstOrDefaultAsQuery
Fork
Brief Description
An assertion that hints the compiler that an IQueryable is partitioned
according to a specified hash function.
public void AssumeHashPartition<TKey>(
Expression<Func<T, TKey>> keySelector,
IEqualityComparer<TKey> comparer)
An assertion that hints the compiler that an IQueryable is ordered
according to a specified key selector function.
public void AssumeOrderBy<TKey>(
Expression<Func<T, TKey>> keySelector,
IComparer<TKey> comparer)
An assertion that hints the compiler than an IQueryable is partitioned
according to a specified set of buckets.
public void AssumeRangePartition<TKey>(
Expression<Func<T, TKey>> keySelector,
TKey[] rangeKeys,
IComparer<TKey> comparer)
LINQ
Result is query (not scalar).
Currently not implemented in DryadLINQ. Let us know if you need it.
LINQ
Result is query (not scalar).
LINQ
Result is query (not scalar).
LINQ
LINQ
LINQ
Result is query (not scalar).
LINQ
Result is query (not scalar).
Apply simultaneously several transformations to a single IQueryable.
There are three versions of Fork:
public static IMultiEnumerable<R1, R2, R3>
Fork<TSource, R1, R2, R3>(
this IEnumerable<TSource> source,
Func<IEnumerable<TSource>,
IEnumerable<ForkTuple<R1, R2, R3>>> mapper);
public static IMultiEnumerable<R1, R2>
Fork<TSource, R1, R2>(
this IEnumerable<TSource> source,
Func<IEnumerable<TSource>,
IEnumerable<ForkTuple<R1, R2>>> mapper);
public static IMultiEnumerable<R1, R2>
Fork<TSource, R1, R2>(
this IEnumerable<TSource> source,
Func<TSource, ForkTuple<R1, R2>> mapper);
ForkChoose
GroupBy
GroupJoin
LINQ
LINQ
LINQ
DryadLINQ Tutorial
30
Operator
HashPartition
Brief Description
Partition the input IQueryable using a specified hash function.
There are four variants of this method:
public static IEnumerable<TSource>
HashPartition<TSource, TKey>(
this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector);
public static IEnumerable<TSource>
HashPartition<TSource, TKey>(
this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector,
IEqualityComparer<TKey> comparer);
public static IEnumerable<TSource>
HashPartition<TSource, TKey>(
this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector,
int count);
public static IEnumerable<TSource>
HashPartition<TSource, TKey>(
this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector,
IEqualityComparer<TKey> comparer,
int count);
Intersect
Join
Last
LastAsQuery
LastOrDefault
LastOrDefaultAsQuery
LongCount
LongCountAsQuery
Max
MaxAsQuery
Merge
Min
MinAsQuery
OfType
OrderBy
OrderByDescending
RangePartition
LINQ
LINQ
LINQ
Result is query (not scalar).
LINQ
Result is query (not scalar).
LINQ
Result is query (not scalar).
LINQ
Result is query (not scalar).
LINQ
LINQ
Result is query (not scalar).
LINQ
LINQ
LINQ
Partition the input IQueryable using a specified set of buckets.
public static IEnumerable<TSource>
RangePartition<TSource, TKey>(
this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector,
bool isDescending);
public static IEnumerable<TSource>
RangePartition<TSource, TKey>(
this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector,
TKey[] rangeKeys);
public static IEnumerable<TSource>
DryadLINQ Tutorial
31
Operator
Brief Description
RangePartition<TSource, TKey>(
this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector,
IComparer<TKey> comparer,
bool isDescending);
public static IEnumerable<TSource>
RangePartition<TSource, TKey>(
this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector,
TKey[] rangeKeys,
IComparer<TKey> comparer);
Reverse
Select
SelectMany
SequenceEqual
SequenceEqualAsQuery
Single
SingleAsQuery
SingleOrDefault
SingleOrDefaultAsQuery
Skip
SkipWhile
Sum
SumAsQuery
Take
TakeWhile
ThenBy
ThenByDescending
Union
Where
LINQ
LINQ
LINQ
LINQ
LINQ
LINQ
LINQ
LINQ
Result is query (not scalar).
LINQ
LINQ
LINQ
Result is query (not scalar).
LINQ
LINQ
LINQ
LINQ
LINQ
LINQ
4.2 DryadLINQ Annotations
Annotations indicate to the compiler some special behavior of delegates that are used as arguments for
DryadLINQ operators. This enables the optimizer to generate better plans.
Annotation
Associative
Homomorphic
Example
[Associative]
[Homomorphic(Left=true)]
Resource
[Resource(IsStateful=false)]
Meaning
Delegate is associative: can be applied in any order.
Delegate commutes with partitioning: can be
applied independently on partitions of a stream. If
“Left” is specified, the (two-argument) delegate
commutes with partitioning only in its left input.
The delegate is not blocking: i.e., it can be
pipelined with other operators.
DryadLINQ Tutorial
32
Download