DryadLINQ Tutorial Contents 0 DryadLINQ Status .................................................................................................................................. 3 1 What is DryadLINQ? .............................................................................................................................. 3 2 Introduction to LINQ ............................................................................................................................. 3 3 2.1 A Simple LINQ Program ................................................................................................................. 4 2.2 Iterating Over File Contents .......................................................................................................... 5 2.3 IEnumerable: A Simple Extension of C# .................................................................................. 7 2.4 IQueryable: Support for Specialized Execution Engines ......................................................... 8 DryadLINQ ............................................................................................................................................. 9 3.1 3.1.1 The DryadLINQ Dynamically Linked Library ........................................................................ 10 3.1.2 DryadLINQConfig.xml .......................................................................................................... 10 3.1.3 Running the DryadLINQ Application ................................................................................... 10 3.2 Watching the Remote Program .................................................................................................. 12 3.3 Partitioned Files .......................................................................................................................... 13 3.4 Computing Histograms ............................................................................................................... 14 3.4.1 A Helper Class ..................................................................................................................... 14 3.4.2 The Histogram Query .......................................................................................................... 15 3.4.3 Type Inference .................................................................................................................... 17 3.5 Reductions (Aggregations) .......................................................................................................... 17 3.6 Apply ........................................................................................................................................... 18 3.7 Join .............................................................................................................................................. 22 3.8 Statistics in DryadLINQ ................................................................................................................ 22 3.9 Writing Custom Serializers .......................................................................................................... 26 3.9.1 What is the Default Output Format? .................................................................................. 26 3.9.2 How Can I Change the Output Format and How Can I Read Binary Data? ......................... 26 3.10 4 A First DryadLINQ Program ........................................................................................................... 9 Advanced Topic: Higher-Order Query Operations...................................................................... 27 Reference ............................................................................................................................................ 29 4.1 DryadLINQ Operators.................................................................................................................. 29 4.2 DryadLINQ Annotations .............................................................................................................. 32 DryadLINQ Tutorial 2 0 DryadLINQ Status DryadLINQ is currently available only within Microsoft. 1 What is DryadLINQ? DryadLINQ is a compiler which translates LINQ programs to distributed computations which can be run on a PC cluster. LINQ is an extension to .NET, launched with Visual Studio 2008, which provides declarative programming for data manipulation. By using DryadLINQ the programmer does not need to have any knowledge about parallel or distributed computation (though a little knowledge can help with writing efficient programs). Thus any LINQ programmer turns instantly into a cluster computing programmer. While LINQ extensions have been made to Visual Basic and C#, the DryadLINQ compiler only supports C#. DryadLINQ uses the Dryad distributed execution environment to create and run distributed applications. Knowledge or understanding of Dryad is not required for DryadLINQ users. Dryad is a generic distributed execution environment which can support a variety of distributed programming models. Dryad is currently deployed in production systems within Microsoft. Dryad is a library linked with user applications, which provides the following services: An API to create distributed applications (jobs), by specifying which processes have to be executed and communication channels linking them. Scheduling of the processes on the cluster machines. Fault-tolerance through re-execution of processes after transient failures. Monitoring of the computation and statistics collection. Job visualization. An API for run-time resource management policies. Support for efficient bulk data transfer between processes. Dryad has been tested from small clusters (four machines) to very large clusters (4000 machines). Dryad assumes that the computer cluster you are using is hosted in a secure datacenter, with high bandwidth connections between machines. It does not provide security or access control – applications are fully trusted. Dryad has not been targeted at loosely coupled machines, e.g. workstations in an office environment. 2 Introduction to LINQ DryadLINQ leverages the LINQ programming language extensions, which were introduced as part of the.NET 3.5 framework shipped with Visual Studio 2008 DryadLINQ Tutorial 3 Here we provide a brief introduction to LINQ. For more information on LINQ see ‘LINQ: .NET Language Integrated Query’ at: http://msdn2.microsoft.com/en-us/library/bb308959.aspx. For lots of small examples, see “101 LINQ Samples” at: http://msdn2.microsoft.com/en-us/vcsharp/aa336746.aspx 2.1 A Simple LINQ Program We will start with a simple LINQ program. 1) Create a new project 'example' in Visual Studio (File/New Project). Choose a "console application". 2) In the solution explorer, browse to your project's references, and add as a new reference System.Core (it is in the tab .NET). This is required for enabling LINQ. 3) Here is the code to add to the file: using using using using System; System.Collections; System.Collections.Generic; System.LINQ; // // // // for for for for I/O IEnumerable IEnumerable<T> LINQ operators namespace Example { public static class Example { public static void Main(string[] args) { IEnumerable<int> l = Enumerable.Range(1, 100); foreach (var v in l) Console.WriteLine("{0}", v); Console.ReadKey(); } } } When you build this program (e.g., using F6) and run it (using Ctrl-F5) it should output the numbers 1 to 100, each on a line. This program introduces several important LINQ concepts: DryadLINQ Tutorial 4 The IEnumerable<T> data type is a special type of generic collection, holding elements of an arbitrary type T. The collection is represented by an iterator, which provides efficient sequential access to each element. In this example we instantiate T to int. The Range operator generates an iterator which will produce the elements of the collection on demand. The collection will contain all integers from 1 to 100. The foreach loop invokes repeatedly the (iterator of) collection l to obtain a new value, and writes the value to the console. The variable v has no type declared (it is declared just using the var keyword). The compiler will infer the fact that v is an integer from the context. Note that the collection elements are never stored simultaneously. There is no container of size 100 allocated to hold all values between 1 and 100. The numbers are produced by the Range() operator on demand, consumed, and thrown away (removed by the garbage collector). Range can be defined in C# using the operator yield, in the following way: static IEnumerable<int> Range(int low, int high) { for (int i=low; i <= high; i++) yield return i; } (To invoke the new Range, just change the invocation from Enumerable.Range to Range). Range defines an iterator which will return a new value each time it is invoked. It is instructive to run this program with the debugger, by single-stepping using F-11. You will notice that the Range() function is called for every iteration of the foreach loop. 2.2 Iterating Over File Contents Files are ideally suited to iterator-type access, since iterators entail sequential access to file contents, which is very efficient (in contrast to random access). Let us implement an iterator which scans a text file: using System.IO; using System.Collections; public class TextFileReader { private StreamReader file; public TextFileReader(string filename) { file = new StreamReader(filename); } public IEnumerator<string> Contents() { DryadLINQ Tutorial 5 while (file.Peek() != -1) yield return file.ReadLine(); } } It is now easy to implement a fgrep-like program (which searches a fixed string): public static IEnumerable<string> Match(string filename, string tosearch) { TextFileReader tr = new TextFileReader(filename); foreach (string s in tr.Contents()) { if (s.IndexOf(tosearch) >= 0) yield return s; } } public static void ShowOnConsole<T>(IEnumerable<T> t) { foreach (T s in t) Console.WriteLine("{0}", s.ToString()); } public static void Main(string[] args) { ShowOnConsole(Match("c:/test.txt", "here")); } However, it is even better to use the built-in LINQ operator Where for this purpose: public static IEnumerable<string> Match(string filename, string tosearch) { TextFileReader tr = new TextFileReader(filename); IEnumerable<string> result = tr.Contents().Where(s => (s.IndexOf(tosearch) >= 0)); } This program introduces some new LINQ elements: The function ShowOnConsole manipulates generic streams, containing elements of an arbitrary type T. This function is invoked in Main to display a stream of strings. The Where method has as an argument an anonymous delegate described as a lambda expression. The string: s => (s.IndexOf(tosearch) >= 0) DryadLINQ Tutorial 6 describes a function which takes an argument s and returns a Boolean value, the result of evaluating the expression s.IndexOf(tosearch) >= 0. Note that the lambda expression depends on the tosearch value. The type of this function in C# is Func<string, bool> (function transforming strings to Booleans). The type of Where is: public static IEnumerable<T> Where<T>(this IEnumerable<T> source, Func<T, bool> filter) Another useful method is Select. It is somewhat unfortunately named, since it does not just perform a selection, it can perform an arbitrary type transformation: public static IEnumerable<S> Select<T, S>(this IEnumerable<T> source, Func<T,S> transform) The function transform is applied to every element of the input stream to generate an element of the output stream. For example, to print the lengths of the matching lines instead of the lines themselves, one would change the function as follows: ShowOnConsole(Match("c:/test.txt", "here").Select(s => s.Length)) If we redefine the ShowOnConsole method as an extension of IEnumerable<T>: public static void ShowOnConsole<T>(this IEnumerable<T> t) we can write the above call as: Match("c:/test.txt", "here").Select(s => s.Length).ShowOnConsole() 2.3 IEnumerable: A Simple Extension of C# The LINQ programs we wrote so far can be expressed by using only "pure" C# operations. But there is more to LINQ than just syntactic sugar. Almost every collection in C# (e.g. List<T>, Array, vector, etc.) can be transformed to an IEnumerable in a very simple way. For example, List has an operator AsEnumerable(). Computations on IEnumerables are handled by the local machine as any standard C# program. Most IEnumerable<T> methods take as arguments delegates. (I.e., the IEnumerable methods are higher order functions). For example, the Where method requires a delegate returning a Boolean value. However, one of the great features of LINQ is that it allows computations to be dispatched to other "providers". For example, if your data is a SQL Server database, you can write a LINQ query and ask SQL Server to execute it. DryadLINQ takes advantage of this feature, by using Dryad to execute LINQ queries. But in order to enable this behavior we need a new type, IQueryable<T>. DryadLINQ Tutorial 7 2.4 IQueryable: Support for Specialized Execution Engines Superficially the IQueryable<T> type is very similar to IEnumerable<T>. Since IQueryable<T> inherits from IEnumerable<T>, all methods applicable to IEnumerable<T> operate on IQueryable<T> objects as well. However, there is a very important distinction: IQueryable objects do not represent iterators, they represent queries, which are shipped for execution to some computation engine and thus are not executed by the local JIT. The interfacing between the local computation and the remote execution engine is encapsulated in a query provider. Examples of execution engines are SQL Server or DryadLINQ, but one can easily write a new one. One of the main design features of LINQ is extensibility, allowing new providers to be created by application programmers. Figure 1: A Query provider translates IQueryable objects to a suitable format and ships them to a remote execution engine. It also transforms the remote data into C# objects. DryadLINQ is just an instance of such a provider which interfaces with the Dryad remote execution framework. A remote execution environment is not required to understand C#. All methods of IQueryable receive arguments which are not delegates, but Expression<> objects. An expression is essentially a "syntactic" representation of a computation. For example, the type of Where operating on IQueryable<T> is: public static IQueryable<T> Where<T>(this IQueryable<T> source, Expression<Func<T, bool>> filter) DryadLINQ Tutorial 8 Conveniently, it is very easy to convert a delegate of type Func<T,S> to an expression Expression<Func<T,S>> -- most of the times an implicit cast will handle the transformation. Thus, for many simple LINQ programs operating on IQueryables one can use exactly the same syntax as for IEnumerable objects. (For more on the distinction between Func<> and Expression<Func<>> see Section ADVANCED TOPIC: HIGHER-ORDER QUERY OPERATIONS.) What Generic Created Evaluated Evaluated by Operator pipelining Method arguments IEnumerable Iterator over collection Yes Lazily Lazily Local JIT Yes Delegates IQueryable Query Yes Lazily Depends on provider Remote execution engine Depends on engine Expressions Table 1: IEnumerable vs. IQueryable 3 DryadLINQ 3.1 A First DryadLINQ Program Let us rewrite the Match function to operate on an IQueryable: using using using using using System; System.Collections.Generic; System.LINQ; System.Text; LINQToDryad; namespace Example { static class Program { public static IQueryable<string> Match(string directory, string filename, string tosearch) { DryadDataContext ddc = new DryadDataContext("file://" + directory); DryadTable<LineRecord> table = ddc.GetTable<LineRecord>(filename); return table.Select(s => s.line). Where(s => s.IndexOf(tosearch) >= 0); } public static void ShowOnConsole<T>(this IEnumerable<T> t) { foreach (T s in t) DryadLINQ Tutorial 9 Console.WriteLine("{0}", s.ToString()); } static void Main(string[] args) { IQueryable<string> result = Match("c:/", "test.txt", "here"); result.ShowOnConsole(); } } } Before you can compile and run this application: You must link with the LINQToDryad dll. You must configure the application to find: o The cluster to use o The place where temporary data is create on the cluster Let us tackle these problems one by one. 3.1.1 The DryadLINQ Dynamically Linked Library All the functionality of DryadLINQ is contained in the LINQToDryad.dll file. This file contains the provider, which knows how to accept queries, translate them to Dryad jobs and ship them to remote clusters for execution, and how to collect the results back. Add the dll as a reference to your project: In solution explorer right-click on references and then browse to the location where you installed the LINQToDryad.dll on the development machine. (If you installed DryadLINQ at C:\dryadlinq, it will be at C:\dryadlinq\lib\debug\LINQtoDryad.dll). 3.1.2 DryadLINQConfig.xml The configuration of your DryadLINQ application is controlled via a file called DryadLinqConfig.xml. A global copy of this file is located in the root directory of your DryadLINQ installation, and defines default values for all applications. Configurable parameters may optionally be overridden on a per-application basis, by placing an additional copy in the root directory of an application. For details, including descriptions of the configurable parameters, please read SECTION Error! Reference source not found., REF _Ref195932192 \h \* MERGEFORMAT Error! Reference source not found.. 3.1.3 Running the DryadLINQ Application If you have access to a Dryad cluster, and if you have configured everything properly, now you can run your application (Ctrl-F5, or just F5 if you would like to single-step). You should make sure that the cluster machines are allowed to access the input and output files you have provided. Note that the programs running on the cluster will not use your user account privileges. The code highlighted in yellow in the previous example will run remotely, everything else will run locally. When you run this application the sequence of events unfolds as follows; we indicate how they correspond to the numbered labels in FIGURE 1. 1. C# execution The configuration parameters are set. DryadLINQ Tutorial 10 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 1 The Match function is invoked. It creates an object ddc of type DryadDataContext. This object creates the DryadLINQ provider which will handle IQueryables that can be executed by DryadLINQ. The ddc is used to allocate a DryadTable<T>. The DryadTable is a subclass of IQueryable<T> representing persistent data (i.e., files). DryadTables are type collections. In our case, the table contains LineRecord objects (these are just C# object representing a line of text). The DryadTable object is bound to a (text) file, specified by directory and filename. Query creation The methods Select and Where are applied to the table, in this order. The result of applying these two methods is an IQueryable<string> object called result. The method ShowOnConsole is applied to result. The method starts iterating over the content of result. Query evaluation At this point, the query must be evaluated. The DryadLINQ provider receives the query for processing. In the meantime the C# program that spawned the query is blocked waiting for the query to complete. Query transformation DryadLINQ analyzes the query received and compiles it into a Dryad job, to be executed on the cluster. The compilation is done in several stages. Query invocation The cluster runtime receives a Dryad executable1. Plan instantiation The Dryad executable it handed the Dryad job description; the job is described in an XML file. Plan execution The Dryad runtime analyzes the execution plan and spawns the processes composing the Dryad Job on the cluster. Data generation The job completes and writes the results to temporary files in output directory (indicated by the configuration options). Completion The C# program receives notification about the job completion. Result marshalling and C# object creation The query provider translates the data, marshalling the results into C# objects. Resume program. The C# program resumes execution and the ShowOnConsole iterator extracts the data from the query output file, on demand. The executable handed to Dryad is called XmlExechost.exe. DryadLINQ Tutorial 11 3.2 Watching the Remote Program While your program is running the remote query you will see on the output console some cryptic strings. This information comes from the Dryad Job Manager, which oversees the execution of the program on the cluster. While an explanation of Dryad is out of the scope of this document it is useful to know the basics of Dryad for debugging, visualization and performance tuning purposes. Figure 2: Execution stages of a Dryad Job. FIGURE 2 shows the stages of a Dryad job. The job is coordinated by a Job Manager; the manager is the brain, while all the work is performed by the workers, which are called vertices. The manager starts first, and it creates vertices on the cluster, using a remote execution service. It monitors the vertices’ progress and gathers execution statistics. The manager prints periodic summaries of the state of the computation, to the console. The job manager itself may be running on a remote machine (depending on configuration parameters). If your job manager is running on the local machine then you can visualize your job’s state interactively using internet explorer. If your job manager is running on the cluster using the cluster scheduler, then you can visualize the job via the cluster’s web server. The visualization is not interactive. By default, DryadLINQ will run the job manager locally. If you are running on a cluster that has a job scheduler installed, you can configure DryadLINQ to instead submit the job to the job scheduler by adding the usejobscheduler attribute to the <Cluster> element in your DryadLinqConfig.xml: <Cluster name=" MyClusterName " … usejobscheduler=" true " /> DryadLINQ Tutorial 12 3.3 Partitioned Files The Match program that we wrote can be effectively parallelized for scanning a large amount of data. It is sufficient to cut the data into pieces (preserving line boundaries) and run the scan in parallel on all pieces. The hardest part to do is to describe the file pieces. For this purpose DryadLINQ provides a datatype PartitionedFile. A partitioned file on disk is composed of two parts: 1) The pieces themselves and 2) The metadata: a textual description of all the pieces of a file which has been split. FIGURE 3 shows how the metadata is organized: The first line indicates the name prefix of each piece. The pieces must all be placed in the same directory on all the machines. In this example each file will be in the \mydata directory, and its name will have the form Piece.XXXXXXXX. Here XXXXXXXX is an 8-digit hexadecimal number. The second line is the number of pieces, in this example 4. Each line that follows describes a piece: o The piece number, in decimal. o The piece size in bytes. o Finally, a comma-separated list of machines. A piece may be replicated on several machines, for fault-tolerance. Figure 3: Partitioned File Structure The description in FIGURE 3 corresponds to the following pieces: \\m1\mydata\Piece.00000000 DryadLINQ Tutorial 13 \\m2\mydata\Piece.00000001 \\m3\mydata\Piece.00000001 \\m3\mydata\Piece.00000002 \\m4\mydata\Piece.00000003 Piece.00000001 is present on two machines. Once you have partitioned your data in this way, you only need to make a tiny change to enable your computation to use the partitioned table: public static IQueryable<string> Match(string directory, string filename, string tosearch) { DryadDataContext ddc = new DryadDataContext("file://" + directory); DryadTable<LineRecord> table = ddc.GetPartitionedTable<LineRecord>(filename); return table.Select(s => s.line).Where(s => s.IndexOf(tosearch) >= 0); } When running this job, the job will operate in parallel on all four partitions: Figure 4: The program operating on a partitioned file with 4 partitions. The throughput of this computation will be increased by a factor of 4 (assuming the cluster contains at least 4 machines). If the input partitions are on different machines, they can be read all in parallel. The file output by the program also contains four partitions. (However, the C# program still uses one iterator to read all four output partitions.) 3.4 Computing Histograms Let us tackle a slightly more complex example: the input is a large text file distributed over many machines (e.g., web pages). We want to compute a histogram of the words in the web pages, and extract the top k words and their counts. This is a typical map-reduce application. 3.4.1 A Helper Class We need to build a helper class Pair, which will be used to represent for each word a count. [Serializable] public struct Pair { private string word; private int count; DryadLINQ Tutorial 14 public Pair(string w, int c) { word = w; count = c; } public int Count { get { return count; } } public string Word { get { return word; } } public override string ToString() { return word + ":" + count.ToString(); } } While the Pair class is quite trivial, you should pay attention to the annotation [Serializable] which precedes it. The distributed computation will express intermediate results as collections of Pair objects, and these collections need to be shipped between machines. The annotation indicates to the runtime that it needs to prepare to serialize and deserialize the internal representation of the Pair on the wire. DryadLINQ automatically builds efficient serializers for the data structures in your program; its serialization is much less verbose than the default C# reflection-based serialization. 3.4.2 The Histogram Query Despite the complexity of the transformations, the DryadLINQ code is actually very compact (the interesting part is in the green highlights): public static IQueryable<Pair> Histogram( string directory, string filename, int k) { DryadDataContext ddc = new DryadDataContext("file://" + directory); DryadTable<LineRecord> table = ddc.GetPartitionedTable<LineRecord>(filename); IQueryable<string> words = table.SelectMany(x => x.line.Split(' ').AsEnumerable()); IQueryable<IGrouping<string, string>> groups = words.GroupBy(x => x); IQueryable<Pair> counts = groups.Select(x => new Pair(x.Key, x.Count())); IQueryable<Pair> ordered = counts.OrderByDescending(x => x.Count); IQueryable<Pair> top = ordered.Take(k); return top; } Calling ShowOnConsole on the result of this function will display the output. TABLE 2 shows a sample execution for k=3. DryadLINQ Tutorial 15 Operator Output table SelectMany GroupBy Select OrderByDescending Take(3) “A line of words of wisdom” [“A”, “line”, “of”, “words”, “of”, “wisdom”] [[“A”], [“line”], [“of”, “of”], [“words”], [“wisdom”]] [ {“A”, 1}, {“line”, 1}, {“of”, 2}, {“words”, 1}, {“wisdom”, 1}] [{“of”, 2}, {“A”, 1}, {“line”, 1}, {“words”, 1}, {“wisdom”, 1}] [{“of”, 2}, {“A”, 1}, {“line”, 1}] Table 2: Sample execution of Histogram Let’s dissect this program. The SelectMany method transforms a scalar into an IEnumerable. In our case, we use it to transform a string representing a line into an IEnumerable<string> containing all the words on the line. The result is obtained by transforming the array returned by Split into an IEnumerable using the AsEnumerable method. The GroupBy has a key selector delegate as argument (the delegate returns the “key” associated to each input). The result is a set of bags (called IGrouping), where all elements in a bag have the same “key”. We use the identity function for the delegate, and thus all identical words are grouped together. Each group is summarized with a pair containing just the representative word (x.Key) and the count of elements in the group. Since IGrouping is a subclass of IEnumerable, it provides the Count() method, which we use to measure the group size. The pairs are sorted descending on their Count value (OrderBy). Finally, the Take() method just selects the first k elements of the result. The generated plan is actually quite smart: Figure 5: Distributed Histogram plan generated by DryadLINQ. DryadLINQ Tutorial 16 There are four input partitions, reading four files. The GroupBy is computed in a distributed way, using a pattern called hash-partitioning. Each word is sent to a machine based on the hash function applied to the word; in this way, each machine computes a local GroupBy just for a subset of the words. (Note that all identical words will end up on the same machine). The optimizer inserts OrderBy at the level of each partition. The final node can thus just use merge-sort to combine the ordered streams, and then apply Take to the result. Currently the Take operator requires the whole data to be present on a single machine; this explains why the data was merged and the output file has a single partition. 3.4.3 Type Inference Note that it would be perfectly acceptable to omit the types of all temporary variables in the program. I.e., the following program works just fine, and is equivalent to the previous one: var var var var var words = table.SelectMany(x => x.line.Split(' ').AsEnumerable()); groups = words.GroupBy(x => x); counts = groups.Select(x => new Pair(x.Key, x.Count())); ordered = counts.OrderByDescending(x => x.Count); top = ordered.Take(k); However, types can be good documentation, making the program more readable and maintainable. Although the compiler is very good at inferring types, occasionally you may need to supply types explicitly. 3.5 Reductions (Aggregations) One of the most useful operations that can be performed on data is reduction, also called aggregation. By definition, an aggregation takes a lot of data values and collapses them to a single value. A typical example would be the sum of a stream of numbers. A somewhat less obvious example is the count. LINQ contains lots of operators – see 4.1. But, as you may expect by now, LINQ provides a generic aggregation operator which relies on a delegate to transform two values into one. public static TAccumulate Aggregate<TSource, TAccumulate>( this IQueryable<TSource> source, TAccumulate seed, Expression<Func<TAccumulate, TSource, TAccumulate>> func) We can sum up the values in a partitioned file in this way: var result = input.Aggregate((x,y) => x+y); The distributed computation which aggregates an input with two partitions looks as follows: DryadLINQ Tutorial 17 Figure 6: Aggregation is done after collecting all inputs in a single vertex. The Aggregate vertex collects all the data from the two input readers and sums it up. However, if the aggregating function is associative, a much better parallel computation plan is possible by aggregating each partition of the data separately and then combining the results at the end. By adding an [Associative] annotation to a function, you can enable DryadLINQ to generate a much better plan. [Associative] int Add(int x, int y); var sum = input.Aggregate((x,y)=>Add(x,y)); The generated plan looks much better: Figure 7: Aggregation of associative function is done using multiple machines. Each machine does pipelined reading followed by local aggregation on its own data, and then a global stage combines the partial results. 3.6 Apply DryadLINQ takes one IQueryable object and transforms it into a distributed network of processes (vertices). Each of the vertices manipulates only a partition of the data. In the generated code, each vertex executes an independent LINQ program. The inputs and outputs to each vertex are all IEnumerable objects. Thus each vertex takes automatically advantage of the lazy evaluation and pipelining provided by the iterator model. Most LINQ methods are “stateless”: they operate on each value in a collection independently on its neighbors. However, some very useful types of computations need to see multiple values at once. A typical example is a sliding-window (e.g., convolution) computation. DryadLINQ extends LINQ with DryadLINQ Tutorial 18 several powerful operators. The complete list of DryadLINQ operators is given in Section 4.1 DryadLINQ Operators. The Apply operator is a new addition. It corresponds roughly to Select: it has a delegate argument, which produces the output by transforming the input. Unlike Select, the input to the Apply delegate is the whole input stream, and the output is a complete stream. Figure 8: The Select delegate receives each element individually, while the one of Apply receives the whole stream. In other words, in Figure 8: The Select delegate receives each element individually, while the one of Apply receives the whole stream.FIGURE 8 the type of f is Expression<Func<T,S>>, while the type of g is Expression<Func<IEnumerable<T>,IEnumerable<S>>>. There exists a binary version of Apply, which operates on two input streams: public static IQueryable<T3> Apply<T1, T2, T3>(this IQueryable<T1> source1, IQueryable<T2> source2, Expression<Func<IEnumerable<T1>, IEnumerable<T2>, IEnumerable<T3>>> procFunc); Unfortunately, there is no binary version of Select. But we can build one using on Apply. For example, here is how to implement a binary Select-like operator which adds the corresponding numbers in two streams (the two streams must have the same length). First, we write the per-vertex transformation, which operates on IEnumerable inputs: public static IEnumerable<int> addeach(IEnumerable<int> left, IEnumerable<int> right) { IEnumerator<int> left_enu = left.GetEnumerator(); IEnumerator<int> right_enu = right.GetEnumerator(); while (true) DryadLINQ Tutorial 19 { bool more_left = left_enu.MoveNext(); bool more_right = right_enu.MoveNext(); if (more_left != more_right) { throw new Exception("Streams with different lengths"); } if (!more_left) yield break; // both are finished int l = left_enu.Current; int r = right_enu.Current; int q = l + r; yield return q; } } The addeach function is hopefully obvious: it iterates over two streams in parallel using two iterators (it uses the MoveNext() and Current stream operators rather than foreach). To create the IQueryable version of addition it is just enough to invoke addeach on the two inputs: public static IQueryable<int> Add(IQueryable<int> left, IQueryable<int> right) { return left.Apply<int, int, int>(right, (x,y) => addeach(x,y)); } (It is surprisingly harder to write a generic Select, which takes an arbitrary delegate; this is covered in Section ADVANCED TOPIC: HIGHER-ORDER QUERY OPERATIONS.) If we run this query, for example by supplying both inputs from a single partitioned file: Add(input, input).ShowOnConsole(); we will have an unpleasant surprise: the executed plan is quite inefficient: DryadLINQ Tutorial 20 Figure 9: Naive Plan for the pairwise addition. (Each in[] vertex reads one partition and then broadcasts the data to two consumers using a Tee vertex. A Tee vertex stands for “broadcast”: all outputs correspond to the same input. The red edges in the figure are implemented using FIFO channels; this means that all three vertices Merge and Apply run as separate threads in a single process on the same machine, and just pass pointers to objects to each other.) The plan generated performs all the additions in a single vertex, labeled “Apply__0” in the figure. For this purposes it merges (by concatenating) the two entire input streams. It is obvious to us that the additions could be done in parallel in each partition, but since the addeach function claims it needs to see the entire input stream, DryadLINQ obliges and builds it before passing it to the Apply. Fortunately, there is a way out: by adding an appropriate annotation to the addeach function you can indicate that it can be safely applied to each partition independently: [Homomorphic] public static IEnumerable<int> addeach(IEnumerable<int> left, IEnumerable<int> right) While homomorphic is a mouthful, it just means that the function operates correctly partitionwise (i.e., it distributes with respect to partition concatenation: concatenate(add(a,c),add(b,d)) = add(concatenate(a,b), concatenate(c,d) ). With this annotation, the execution plan uses two separate vertices to perform the addition, one operating on each partition: DryadLINQ Tutorial 21 Figure 10: using delegate annotations can improve plans. 3.7 Join One of the most powerful operations provided by LINQ is join. Its type signature is: public static IQueryable<TResult> Join<TOuter, TInner, TKey, TResult>( this IQueryable<TOuter> outer, IEnumerable<TInner> inner, Expression<Func<TOuter, TKey>> outerKeySelector, Expression<Func<TInner, TKey>> innerKeySelector, Expression<Func<TOuter, TInner, TResult>> resultSelector); Join operates on two sub-queries and combines all elements that have the same key (where keys are extracted using two delegates, one for each query). For each pair of matching values the result is obtained by applying a third delegate. 3.8 Statistics in DryadLINQ In this section we take advantage of the power of LINQ to manipulate more complex C# data structures. We will tackle a statistics application: given a very large set of high-dimensional sparse vectors, compute their mean and variance. ∑ 𝑣𝑖 ∑(𝑣𝑖 − 𝜇)2 ,𝜎 = √ 𝑛 𝑛 We won’t delve into the implementation of the sparse vectors; Here we have used a Dictionary<int, double> to represent them, but any implementation with the following interface would do: 𝜇= [Serializable] public class SparseVector { DryadLINQ Tutorial 22 public SparseVector(); public SparseVector(string line); /* read from text file */ public double this[uint index] { get; set; } [Associative] public SparseVector Add(SparseVector r); public SparseVector Subtract(SparseVector r); public SparseVector Square(); // elementwise square public SparseVector SqRoot(); // elementwise square root public SparseVector Divide(double scalar); } The statistics program will ship around objects of type (collection of) SparseVector; this is a much more complex datatype compared with the strings that we have manipulated so far. It is also a datatype which is much harder to support in the context of a traditional database (showing that DryadLINQ’s resemblance to a database query engine is deceiving). We first tackle computing the average. This is done by summing-up all the values and dividing the result by the count of values. Both the sum and count are built-in aggregations. A naïve attempt to do this fails: // this program is not good enough public static SparseVector ComputeStatistics(this IQueryable<SparseVector> v) { SparseVector sum = v.Aggregate( (x, y) => x.Add(y)); int count = v.Count(); SparseVector average = sum.Divide((double)count); IQueryable<SparseVector> normalized = v.Select(x => (x.Subtract(average).Square())); SparseVector sum1 = normalized.Aggregate((x, y) => x.Add(y)); sum.1Divide(count); sum.SqRoot(); return sum; } This code does indeed compute the average and standard deviation of all the SparseVectors in v. However, average is a SparseVector, and not an IQueryable. This means that there will be three queries executed: one to compute the sum, a second to compute the count , and a third the standard deviation. In between the count and average values are shipped to C#. In order to blend the computation in a single big query we have to perform a few changes: 1) First, we have to use special DryadLINQ extensions for Aggregate and Count which return IQueryables and not values: AggregateAsQuery and CountAsQuery. These two operators return an IQueryable which will always contain a single element when evaluated. 2) The average computation becomes much more involved, since we can no longer perform simple arithmetic between the sum and count. We need to use Apply to manipulate them, as we have illustrated in the previous section: Error! Reference source not found.. The Apply elegate needs to be spelled out: DryadLINQ Tutorial 23 [Homomorphic] public static IEnumerable<SparseVector> Scale(IEnumerable<SparseVector> left, IEnumerable<int> right) // left and right should contain a single value { SparseVector l = left.Single(); int coef = right.Single(); yield return l.Divide((double)coef); } public static IQueryable<SparseVector> Average(this IQueryable<SparseVector> v, IQueryable<int> count) { IQueryable<SparseVector> sum = v.AggregateAsQuery<SparseVector>((x, y) => x.Add(y)); IQueryable<SparseVector> average = sum.Apply(count, (x, y) => Scale(x, y)); return average; } The Single() method returns the unique element of a stream. We have factored out the count computation, since it will be reused. 3) The standard deviation involves a computation between a big stream (the input vector), and a singleton stream (the average, which is subtracted from each element of the vector). This can be done with another instance of Apply,using the following delegate argument: [Homomorphic(Left = true)] public static IEnumerable<SparseVector> stddev(IEnumerable<SparseVector> left, IEnumerable<SparseVector> average) // average is a single value, left is a vector { SparseVector avg = average.Single(); foreach (SparseVector l in left) { SparseVector tmp = l.Subtract(avg); SparseVector tmp_sq = tmp.Square(); yield return tmp_sq; } } public static IQueryable<SparseVector> StdDev(this IQueryable<SparseVector> v, IQueryable<SparseVector> average, IQueryable<int> count) { IQueryable<SparseVector> normalized = v.Apply(average, (x, y) => stddev(x, y)); IQueryable<SparseVector> sum = normalized.AggregateAsQuery<SparseVector>((x, y) => x.Add(y)); DryadLINQ Tutorial 24 IQueryable<SparseVector> scaled = sum.Apply(count, (x, y) => Scale(x, y)); IQueryable<SparseVector> result = scaled.Apply(x => x.SqRoot()); return result; } We know that the first Apply operation (the normalization) can be performed in parallel, but how do we express this fact? In other words, stddev can handle in parallel all elements of the left input, but it needs to see the whole right input (which is a single element stream anyway). This can be described using a special variant of the Homomorphic attribute for the stddev delegate: [Homomorophic(Left=true)] public static IEnumerable<SparseVector> stddev(IEnumerable<SparseVector> left, IEnumerable<SparseVector> average) This means that the function is distributive in its left argument, but not in the right one. 4) Finally, we would like the computation to save to persistent storage both of the computed statistics: the average and the standard deviation. If we just extract the values from average and result, two queries will run, recomputing some results twice. What we need is to have a single query with multiple outputs. This is done by creating lazy tables for each interesting output. The lazy tables are computed and saved to persistent storage only when forced with a Materialize call: public static IQueryable<SparseVector> ReadVectors(string directory, string filename) { DryadDataContext ddc = new DryadDataContext("file://" + directory); DryadTable<LineRecord> table = ddc.GetPartitionedTable<LineRecord>(filename); return table.Select(s => new SparseVector(s.line)); } public static void ComputeStatistics(this IQueryable<SparseVector> v) { IQueryable<int> count = v.CountAsQuery(); IQueryable<SparseVector> average = v.Average(count); IQueryable<SparseVector> dev = v.StdDev(average, count); IQueryable<SparseVector> a = average.ToDryadTableLazy("average"); IQueryable<SparseVector> d = dev.ToDryadTableLazy("stddev"); DryadLINQQueryable.Materialize(a, d); } This query will generate two tables when executed. The query plan for an input with two partitions looks pretty good: DryadLINQ Tutorial 25 Figure 11: Plan for the statistics computation. 3.9 Writing Custom Serializers All the DryadLINQ programs so far only read text files, but they output binary files. In this section we will answer three questions: 1) What format is the data written in? 2) How can I write data in a different format? 3) How can I read binary data? 3.9.1 What is the Default Output Format? DryadLINQ writes binary data in the output tables, using the same binary serialization routines that are used to ship data between vertices. As a consequence, data written by a DryadLINQ query can be read by another query without any special preparations; the two programs should just specify the same type for the table contents: DryadTable<T> output = result.ToDryadTable(“histogram”); … DryadTable<T> input = ddc.GetTable(“histogram”); 3.9.2 How Can I Change the Output Format and How Can I Read Binary Data? We will answer both of these questions at once. If you define a class MyRecord, then you can control the way it is represented on the wire by endowing it with the following two methods: public struct MyRecord { DryadLINQ Tutorial 26 public static MyRecord Read(DryadBinaryReader rd); public static int Write(DryadBinaryWriter wr, MyRecord rec); } (You can use Visual Studio to discover the interfaces provided by DryadBinaryReader and DryadBinaryWriter.) For example, there are two ways to have the SparseVectors from the statistics example be written as text: Add an extra Select computation stage before the output: result.Select(x => x.ToString()).ToDryadTable(); Add a Write method to the SparseVector class: public static int Write(DryadBinaryWriter wr, SparseVector vec) { string s = vec.ToString(); return wr.Write(s); } 3.10 Advanced Topic: Higher-Order Query Operations In this section we show how to build higher-order query operations by writing a generic Select operator on IQueryable objects which operates on two inputs at once, pairwise. public static Expression<Func<T1, T2, T3>> Closure_cvv<T0, T1, T2, T3>( Func<T0, T1, T2, T3> function, Expression firstArg) // type should be T0 // build a closure from a function of 3 arguments, // first one is constant (cvv) { ParameterExpression xparam = Expression.Parameter(typeof(T1), “xparam”); ParameterExpression yparam = Expression.Parameter(typeof(T2), “yparam”); Expression fun = Expression.Constant(function); Expression body = Expression.Invoke(fun, firstArg, xparam, yparam); Type resultType = typeof(Func<,,>).MakeGenericType(typeof(T1), typeof(T2), body.Type); LambdaExpression result = Expression.Lambda(resultType, body, xparam, yparam); return (Expression<Func<T1, T2, T3>>)result; } public static IQueryable<T3> Select<T1, T2, T3>(this IQueryable<T1> input0, IQueryable<T2> input1, Expression<Func<T1, T2, T3>> mapper) { // first create pairs of elements using Apply Expression<Func<T1, T2, Pair<T1, T2>>> makepairs = (x, y) => Pair<T1, T2>.MakePair(x, y); DryadLINQ Tutorial 27 Expression<Func<IEnumerable<T1>, IEnumerable<T2>, IEnumerable<Pair<T1,T2>>>> pairmaker = Closure_cvv<Func<T1, T2, Pair<T1,T2>>, IEnumerable<T1>, IEnumerable<T2>, IEnumerable<Pair<T1,T2>>>( Conversions.Pointwise, makepairs); // tag h as homomorphic HomomorphicAttribute h = new HomomorphicAttribute(); AttributeSystem.Add(pairmaker, h); ResourceAttribute a = new ResourceAttribute(); a.IsStateful = false; AttributeSystem.Add(pairmaker, a); IQueryable<Pair<T1, T2>> pairs = input0.Apply<T1, T2, Pair<T1, T2>>(input1, pairmaker); // second, run the (slightly modified) 'mapper' on the pairs ParameterExpression p12 = Expression.Parameter(typeof(Pair<T1, T2>), Conversions.GetFreshName()); Expression p1 = Expression.Property(p12, "First"); Expression p2 = Expression.Property(p12, "Second"); Expression body = Expression.Invoke(mapper, p1, p2); Expression<Func<Pair<T1, T2>, T3>> pop = Expression.Lambda<Func<Pair<T1, T2>, T3>>(body, p12); IQueryable<T3> result = pairs.Select(pop); return result; } DryadLINQ Tutorial 28 4 Reference 4.1 DryadLINQ Operators We can distinguish four main classes of operators: 1. Operators present in LINQ which are implemented by DryadLINQ. In the table below, they are marked with the keyword “LINQ”. 2. Adaptations of operators present in LINQ which return scalar values (i.e., not IQueryable), but which are modified to return an IQueryable instead. For example, Count returns an integer, while CountAsQueryable returns an IQueryable whose actual contents will be a single integer. The AsQueryable variants can be chained together to produce complex queries, while using the scalar variants would require breaking queries into small sub-queries, which could decrease efficiency (see an example in SECTION STATISTICS IN DRYADLINQ). 3. New operators, which exist only in DryadLINQ. We have added new operators which cannot be synthesized efficiently from compositions of primitive LINQ operators, and which can substantially improve the performance of queries in the context of a distributed execution environment like Dryad. These operators are accompanied by a brief description in the table. 4. Operations not yet implemented in DryadLINQ (there is only one). Operator Aggregate AggregateAsQuery All AllAsQuery Any AnyAsQuery Apply AssumeDistinct Brief Description LINQ Result is query (not scalar). LINQ Result is query (not scalar). LINQ Result is query (not scalar). Applies a delegate to an entire input stream. See also the Section APPLY. Two versions exist: public static IQueryable<T2> Apply<T1, T2>( this IQueryable<T1> source1, Expression<Func<IEnumerable<T1>, IEnumerable<T2>>> procFunc); public static IQueryable<T3> Apply<T1, T2, T3>( this IQueryable<T1> source1, IQueryable<T2> source2, Expression<Func<IEnumerable<T1>, IEnumerable<T2>, IEnumerable<T3>>> procFunc); Can only be applied to DryadTable<T> objects. Asserts that there are no two identical objects in the table (according to the comparison function). public void AssumeDistinct<T>( IEqualityComparer<T> comparer) DryadLINQ Tutorial 29 Operator AssumeHashPartition AssumeOrderBy AssumeRangePartition Average AverageAsQuery Concat Contains ContainsAsQuery Count CountAsQuery Distinct Except First FirstAsQuery FirstOrDefault FirstOrDefaultAsQuery Fork Brief Description An assertion that hints the compiler that an IQueryable is partitioned according to a specified hash function. public void AssumeHashPartition<TKey>( Expression<Func<T, TKey>> keySelector, IEqualityComparer<TKey> comparer) An assertion that hints the compiler that an IQueryable is ordered according to a specified key selector function. public void AssumeOrderBy<TKey>( Expression<Func<T, TKey>> keySelector, IComparer<TKey> comparer) An assertion that hints the compiler than an IQueryable is partitioned according to a specified set of buckets. public void AssumeRangePartition<TKey>( Expression<Func<T, TKey>> keySelector, TKey[] rangeKeys, IComparer<TKey> comparer) LINQ Result is query (not scalar). Currently not implemented in DryadLINQ. Let us know if you need it. LINQ Result is query (not scalar). LINQ Result is query (not scalar). LINQ LINQ LINQ Result is query (not scalar). LINQ Result is query (not scalar). Apply simultaneously several transformations to a single IQueryable. There are three versions of Fork: public static IMultiEnumerable<R1, R2, R3> Fork<TSource, R1, R2, R3>( this IEnumerable<TSource> source, Func<IEnumerable<TSource>, IEnumerable<ForkTuple<R1, R2, R3>>> mapper); public static IMultiEnumerable<R1, R2> Fork<TSource, R1, R2>( this IEnumerable<TSource> source, Func<IEnumerable<TSource>, IEnumerable<ForkTuple<R1, R2>>> mapper); public static IMultiEnumerable<R1, R2> Fork<TSource, R1, R2>( this IEnumerable<TSource> source, Func<TSource, ForkTuple<R1, R2>> mapper); ForkChoose GroupBy GroupJoin LINQ LINQ LINQ DryadLINQ Tutorial 30 Operator HashPartition Brief Description Partition the input IQueryable using a specified hash function. There are four variants of this method: public static IEnumerable<TSource> HashPartition<TSource, TKey>( this IEnumerable<TSource> source, Func<TSource, TKey> keySelector); public static IEnumerable<TSource> HashPartition<TSource, TKey>( this IEnumerable<TSource> source, Func<TSource, TKey> keySelector, IEqualityComparer<TKey> comparer); public static IEnumerable<TSource> HashPartition<TSource, TKey>( this IEnumerable<TSource> source, Func<TSource, TKey> keySelector, int count); public static IEnumerable<TSource> HashPartition<TSource, TKey>( this IEnumerable<TSource> source, Func<TSource, TKey> keySelector, IEqualityComparer<TKey> comparer, int count); Intersect Join Last LastAsQuery LastOrDefault LastOrDefaultAsQuery LongCount LongCountAsQuery Max MaxAsQuery Merge Min MinAsQuery OfType OrderBy OrderByDescending RangePartition LINQ LINQ LINQ Result is query (not scalar). LINQ Result is query (not scalar). LINQ Result is query (not scalar). LINQ Result is query (not scalar). LINQ LINQ Result is query (not scalar). LINQ LINQ LINQ Partition the input IQueryable using a specified set of buckets. public static IEnumerable<TSource> RangePartition<TSource, TKey>( this IEnumerable<TSource> source, Func<TSource, TKey> keySelector, bool isDescending); public static IEnumerable<TSource> RangePartition<TSource, TKey>( this IEnumerable<TSource> source, Func<TSource, TKey> keySelector, TKey[] rangeKeys); public static IEnumerable<TSource> DryadLINQ Tutorial 31 Operator Brief Description RangePartition<TSource, TKey>( this IEnumerable<TSource> source, Func<TSource, TKey> keySelector, IComparer<TKey> comparer, bool isDescending); public static IEnumerable<TSource> RangePartition<TSource, TKey>( this IEnumerable<TSource> source, Func<TSource, TKey> keySelector, TKey[] rangeKeys, IComparer<TKey> comparer); Reverse Select SelectMany SequenceEqual SequenceEqualAsQuery Single SingleAsQuery SingleOrDefault SingleOrDefaultAsQuery Skip SkipWhile Sum SumAsQuery Take TakeWhile ThenBy ThenByDescending Union Where LINQ LINQ LINQ LINQ LINQ LINQ LINQ LINQ Result is query (not scalar). LINQ LINQ LINQ Result is query (not scalar). LINQ LINQ LINQ LINQ LINQ LINQ 4.2 DryadLINQ Annotations Annotations indicate to the compiler some special behavior of delegates that are used as arguments for DryadLINQ operators. This enables the optimizer to generate better plans. Annotation Associative Homomorphic Example [Associative] [Homomorphic(Left=true)] Resource [Resource(IsStateful=false)] Meaning Delegate is associative: can be applied in any order. Delegate commutes with partitioning: can be applied independently on partitions of a stream. If “Left” is specified, the (two-argument) delegate commutes with partitioning only in its left input. The delegate is not blocking: i.e., it can be pipelined with other operators. DryadLINQ Tutorial 32