An Introduction to DryadLINQ Christophe Poulain Microsoft Research Virtual School of Computational Science and Engineering Big Data For Science Course, July 28, 2010 The Fourth Paradigm: DataData-Intensive Science http://research.microsoft.com/fourthparadigm Scientific discovery is increasingly driven by exploration of large amounts of data from many sources. Scientific breakthrough will be powered by advanced computing capabilities that help researchers manipulate and explore massive datasets. 2 Data--intensive computing is increasingly prevalent Data Powered by powerful multi-core workstations, readily available commodity clusters and cloud services platforms 112 containers x 2000 servers = 224000 servers Programming data analyses that scale from desktop to a large number of compute nodes remains challenging 3 Dryad and DryadLINQ Research programming models for writing distributed data-parallel applications that scale from a small cluster to a large data-center. A DryadLINQ programmer can use thousands of machines, each of them with multiple processors or cores, without prior knowledge in parallel programming. 4 Availability Dryad/DryadLINQ on Windows HPC 2008 (SP1) is available as a free download from: http://research.microsoft.com/collaboration/tools/dryad.aspx – DryadLINQ (in source) & Dryad (in binary) – With tutorials, programming guides, sample codes, libraries, and a community site: http://connect.microsoft.com/dryad – Windows HPC Server licenses freely available through your department’s subscription to MSDN Academic Alliance 5 Outline • DryadLINQ programming model • Dryad and DryadLINQ overview • Applications DryadLINQ Dryad LINQ Experience Use a cluster as if it were a single computer • Sequential, single machine programming abstraction • Same program runs on single-core, multi-core, or cluster • Familiar programming languages • C#, VB, F#, IronPython… • Familiar development environment • .NET, Visual Studio or other IDE LINQ • Microsoft’s Language INtegrated Query – Released with .NET Framework 3.5, Visual Studio optional • A set of operators to manipulate datasets in .NET – Support traditional relational operators • Select, Join, GroupBy, Aggregate, etc. – Integrated into .NET programming languages • Programs can call operators • Operators can invoke arbitrary .NET functions • Data model – Data elements are strongly typed .NET objects – Much more expressive than SQL tables • Extremely extensible – Add new custom operators – Add new execution providers Example of a LINQ Query IEnumerable<string> logs = GetLogLines(); var logentries = Go through logs and keep only lines from line in logs that are not comments. Parse each where !line.StartsWith("#") line into a LogEntry object. select new LogEntry(line); var user = Go through logentries and keep from access in logentries where access.user.EndsWith(@"\ulfar") only entries that are accesses by select access; ulfar. var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count()); var htmAccesses = Group ulfar’s accesses according to from access in accesses what page they correspond to. For where access.page.EndsWith(".htm") each page, count the occurrences. orderby access.count descending select access; Sort the pages ulfar has accessed according to access frequency. 9 DryadLINQ Data Model Partition .Net objects PartitionedTable<T> PartitionedTable<T> implements IQueryable<T> and IEnumerable<T> PartitionedTable exposes metadata information: • type, partition, compression scheme, etc. 10 A complete DryadLINQ program public class LogEntry { public string user; public string ip; public string page; public LogEntry(string line) { string[] fields = line.Split(' '); this.user = fields[8]; this.ip = fields[9]; this.page = fields[5]; } } public class UserPageCount{ public string user; public string page; public int count; public UserPageCount( string user, string page, int count) { this.user = user; this.page = page; this.count = count; } } PartitionedTable<string> logs = PartitionedTable.Get<string>( @”file:\\MSR-SCR-DRYAD01\DryadData\cpoulain\logfile.pt” ); var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where access.user.EndsWith(@"\ulfar") select access; var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count()); var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access; htmAccesses.ToPartitionedTable( @”file:\\MSR-SCR-DRYAD01\DryadData\cpoulain\results.pt” ); Demo Executing the log query DryadLINQ for Dryad on Windows Server 2008 HPC Cluster 12 MapReduce in DryadLINQ MapReduce(source, // sequence of Ts mapper, // T -> Ms keySelector, // M -> K reducer) // (K, Ms) -> Rs { var map = source.SelectMany(mapper); var group = map.GroupBy(keySelector); var result = group.SelectMany(reducer); return result; // sequence of Rs } 13 Outline • DryadLINQ programming model • Dryad and DryadLINQ overview • Applications Software Stack Machine Learning Image Processing Graph Analysis … Data Mining Applications Other Applications DryadLINQ Other Languages Dryad CIFS/NTFS SQL Servers Azure Storage Cosmos DFS Cluster Services (Azure, HPC, or Cosmos) Windows Server Windows Server Windows Server Windows Server 15 Dryad • Provides a general, flexible execution layer – Dataflow graph as the computation model – Higher language layer supplies graph, vertex code, channel types, hints for data locality, … • Automatically handles execution – Distributes code, routes data – Schedules processes on machines near data – Masks failures in cluster and network – Fair scheduling of concurrent jobs Dryad Job Structure Channels Input files Stage sort grep Output files awk sed perl sort grep awk sed grep Vertices (processes) sort Channel is a finite streams of items • NTFS files (temporary) • TCP pipes (inter-machine) • Memory FIFOs (intra-machine) Dryad System Architecture data plane job manager Job1 Files, TCP, FIFO V V V PD PD PD control plane New jobs Job1: v11, v12, … Job2: v21, v22, … Job3: … scheduler cluster Demo Fault tolerance DryadLINQ for Dryad on Windows Server 2008 HPC Cluster 19 Fault Tolerance Consider an embarrassingly parallel problem public static Pair<int, string> DoWork(int index) { System.Threading.Thread.Sleep(200); return new Pair<int, string>(index, System.Environment.MachineName); } public static void Main(string[] args) { int count = 50; var seeds = Enumerable.Range(1, count); var pairs = from seed in seeds select DoWork(seed); foreach (Pair<int, string> pair in pairs) { Console.WriteLine("{0} => {1}", pair.Key, pair.Value.ToString()); } } 21 An embarrassingly parallel problem Many cores, one machine with PLINQ public static Pair<int, string> DoWork(int index) { System.Threading.Thread.Sleep(200); return new Pair<int, string>(index, System.Environment.MachineName); } public static void Main(string[] args) { int count = 50; var seeds = Enumerable.Range(1, count); var pairs = from seed in seeds.AsParallel() select DoWork(seed); foreach (Pair<int, string> pair in pairs) { Console.WriteLine("{0} => {1}", pair.Key, pair.Value.ToString()); } } 22 An embarrassingly parallel problem Many cores, many machines with DryadLINQ (& PLINQ) public static Pair<int, string> DoWork(int index) { System.Threading.Thread.Sleep(2000); return new Pair<int, string>(index, System.Environment.MachineName); } public static void Main(string[] args) { int count = 50; var seeds = Enumerable.Range(1, count); int[] ranges = seeds.Take(count - 1).ToArray(); var pairs = from seed in seeds.ToPartitionedTable("tmp.pt").RangePartition(i => i, ranges) select DoWork(seed); foreach (Pair<int, string> pair in pairs) { Console.WriteLine("{0} => {1}", pair.Key, pair.Value.ToString()); } } 23 An embarrassingly parallel problem Simulating errors on the cluster Substitute DoWorkAndSimulateFailure for DoWork: private static Random RANDOM = new Random(); public static Pair<int, string> DoWorkAndSimulateFailure(int index) { if (RANDOM.NextDouble() < 0.1) { throw new Exception("My program failed."); } System.Threading.Thread.Sleep(200); return new Pair<int, string>(index, System.Environment.MachineName); } Will the program successfully finish when we run it? Let us see… 24 DryadLINQ: Friendly programming API for Dryad DryadLINQ leverages LINQ’s extensibility Scalability Local machine Query .Net program (C#, VB, F#, etc) Objects LINQ provider interface Execution engines DryadLINQ PLINQ Cluster Multi-core LINQ-to-SQL LINQ-to-XML Single-core DryadLINQ Provider Vertex code Collection<T> collection; bool IsLegal(Key k); string Hash(Key); var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value}; Query plan (Dryad job) Data collection C# C# C# C# results 26 Example: Word Count Count word frequency in a set of documents: var docs = new PartitionedTable<Doc>(“dfs://yuan/docs”); var words = docs.SelectMany(doc => doc.words); var groups = words.GroupBy(word => word); var counts = groups.Select(g => new WordCount(g.Key, g.Count())); counts.ToTable(“dfs://yuan/counts.txt”); IN metadata SM doc => doc.words GB word => word S g => new … OUT metadata Distributed Execution of Word Count LINQ expression IN SM GB S OUT DryadLINQ Dryad execution Execution Plan for Word Count SM Q SM GB GB SelectMany Sort GroupBy C Count D Distribute MS Mergesort GB GroupBy pipelined (1) S Sum pipelined Sum 29 Execution Plan for Word Count SM GB SM SM SM SM Q Q Q Q GB GB GB GB C C C C D D D D MS MS MS MS GB GB GB GB Sum Sum Sum Sum (1) S (2) 30 Distributed Execution Plan <ClusterName> Arden1 </ClusterName> <Resources> <Resource>C:\DryadLinqDrop\lib\retail\amd64\wrappernativeinfo.dll</Resource> <Resource>C:\DryadLinqDrop\lib\Release\LinqToDryad.dll</Resource> <Resource>C:\apps\WordCount\bin\Release\DryadLinq.dll</Resource> <Resource>C:\apps\WordCount\bin\Release\WordCount.exe</Resource> </Resources> <QueryPlan> … <Vertex> <UniqueId> 1 </UniqueId> <Partitions> 8 </Partitions> <ChannelType> DiskFile </ChannelType> <ConnectionOperator> CrossProduct </ConnectionOperator> <Entry> <AssemblyName> DryadLinq.dll </AssemblyName> <ClassName> LinqToDryad.DryadLinq_Vertex</ClassName> <MethodName> Super_2 </MethodName></Entry> <Children><Child><UniqueId> 0 </UniqueId></Child></Children> </Vertex> ... </QueryPlan> Vertex code (in DryadLinq.dll) The actual code executed at Dryad vertex public static int Super_2(string args) { DryadVertexEnv denv = new DryadVertexEnv(args); var dwriter_3 = denv.MakeWriter(DryadLinq_Extension.FactoryType_1); var dreader_4 = denv.MakeReader(DryadLinq_Extension.FactoryType_0); var source_5 = DryadLinqVertex.SelectMany(dreader_4, doc => doc.words); var source_6 = DryadLinqVertex.Sort(source_5, word => word); var source_7 = DryadLinqVertex.OrderedGroupBy(source_6, word => word); var source_8 = DryadLinqVertex.Select(g => new Pair<String, Int32>(g.Key, g.Count())); DryadLinqVertex.HashPartition(source_8, e => e.Key, dwriter_3); return 0; } DryadLINQ • Distributed execution plan generation – Static optimizations: pipelining, eager aggregation, etc. – Dynamic optimizations: data-dependent partitioning, dynamic aggregation, etc. • Vertex runtime – – – – – Single machine (multi-core) implementation of LINQ Vertex code that runs on vertices Data serialization code Callback code for runtime dynamic optimizations Automatically distributed to cluster machines DryadLINQ Job Browser Artemis 34 Simple and Scalable Distributed File System • A simple fault-tolerant, distributed file system that provides the abstractions necessary for data parallel computations on HPC clusters • High performance, reliable, scalable service • Prototypical workload • High throughput, sequential IO, write once • Cluster machines working in parallel • Configurable number of replicas per dataset http://research.microsoft.com/events/techfair2010/demos.aspx 35 Outline • DryadLINQ programming model • Dryad and DryadLINQ overview • Applications Dryad • • • • • Continuously deployed since 2006 The execution engine for Bing analytics Running on >> 104 machines Runs on clusters > 3000 machines Sifting through > 10Pb data daily 37 Microsoft Kinect: Kinect: Learning From Data Kinect (formerly Project Natal) is using DryadLINQ to train enormous decision trees from millions of images across hundreds of cores. Training examples Rasterize Motion Capture (ground truth) Machine learning Classifier Recognize players from depth map at frame rate using fraction of Xbox CPU. 38 Large--scale machine learning Large • > 1022 objects • Sparse, multi-dimensional data structures • Complex datatypes (images, video, matrices, etc.) • Complex application logic and dataflow – – – – – – >35000 lines of .Net 140 CPU days > 105 processes 30 TB data analyzed 140 avg parallelism (235 machines) 300% CPU utilization (4 cores/machine) 39 Highly efficient parallellization 40 SDSS Query Q18 Most time-consuming query from Sloan Digital Sky Survey database Find all objects within 1' of one another other that have very similar colors: that is with the color ratios u-g, g-r, r-I are less than 0.05m. (http://www.sdss.jhu.edu/SQL/SQLQueries.html) • Two tables 11.8GB and 41.8 GB • Under 2 minutes with 40 nodes • Hand-tuned Dryad implementation is faster (92s vs 113s with 40 nodes) • DryadLINQ code is 10x smaller See Dryad (Eurosys’07) and DryadLINQ (OSDI’08) papers. SDSS Query Q18 PartitionedTable<PhotoObjAll> photoObjAll = PartitionedTable.Get<PhotoObjAll>(@"file://\\<...>\ugriz-u9.pt"); PartitionedTable<Neighbor> neighbors = PartitionedTable.Get<Neighbor>(@"file://\\<...>\neighbor-u9.pt"); var j1 = from p in photoObjAll join n in neighbors on p.objId equals n.objId select new PhotoObjNeighbor(p, n); var w1 = from pn in j1 where pn.objId < pn.neighborObjId && pn.mode select pn; var j2 = from l in photoObjAll join pn in w1 on l.objId equals pn.neighborObjId select new PhotoObjNeighborAll(l, pn); var w2 = from lp in j2 where lp.l.mode && Math.Abs((lp.p.u-lp.p.g)-(lp.l.u-lp.l.g)) < 0.05 && Math.Abs((lp.p.g-lp.p.r)-(lp.l.g-lp.l.r)) < 0.05 && Math.Abs((lp.p.r-lp.p.i)-(lp.l.r-lp.l.i)) < 0.05 && Math.Abs((lp.p.i-lp.p.z)-(lp.l.i-lp.l.z)) < 0.05 select lp.p.objId; var q = w2.Distinct(); q.ToDryadPartitionedTable("result.pt"); Scalable clustering algorithm for NN-body simulations in a shared--nothing cluster shared YongChul Kwon, Dylan Nunley, Jeffrey P. Gardner, Magdalena Balazinska, Bill Howe, and Sarah Loebman. UW Tech Report. UW-CSE-09-06-01. June 2009. S43 S92 Ideal 1.00 OS43 OS92 0.80 Scale Up Speed Up 4 2 0.60 S43 S92 Ideal 0.40 0.20 0 0.00 0 2 4 Number of nodes 6 8 0 2 4 6 Number of nodes… (S=DryadLINQ; OS=OpenMP; 43=sparse; 92=dense) • Large-scale spatial clustering – 916M particles in 3D clustered under 70 minutes with 8 nodes. • Re-implemented using DryadLINQ – Partition > Local cluster > Merge cluster > Relabel – Faster development and good scalability – Must ensure near-constant processing time per tuple 8 Scalable clustering algorithm for NN-body simulations in a shared--nothing cluster shared 44 Terapixel Sky Image 1791 pairs of red-light and blue-light images acquired from two telescopes, scanned into 23,040x23,040 or 14000x14000 images. ~4TB of uncompressed data. Processed into 1791 RGB color images. Stitched into one terapixel spherical image. Image seams removed with optimization. Multi-scale resolution image available in WorldWide Telescope and Bing Map. 45 Computing Vignetting Corrections Creating Flat Fields Normalization Matrix Normalizing Corners DryadLINQ => concise code var var var var pixelRows = folders.SelectMany(image => ImageToRows(image, options)); stackedPixelRows = pixelRows.GroupBy(pixelRow => pixelRow.Position); finalRows = stackedPixelRows.Select(x => ReduceStackedRows(x)); flatField = finalRows.Apply(x => SaveFlatField(x, options)); DryadLINQ + Windows HPC => Efficient and robust execution – Elapsed time to process all flat fields: 8.7 hours – 28 8-core compute nodes => 1,950 CPU hours – Total input data: 417 GB compressed, 4 TB uncompressed. 46 CAP3 - DNA Sequence Assembly Program [1] EST (Expressed Sequence Tag) corresponds to messenger RNAs (mRNAs) transcribed from the genes residing on chromosomes. Each individual EST sequence represents a fragment of mRNA, and the EST assembly aims to re-construct full-length mRNA sequences for each expressed gene. Input files (FASTA) Cap3data.pf \DryadData\cap3\cap3data 10 0,344,CGB-K18-N01 1,344,CGB-K18-N01 … V V Cap3data.00000000 9,344,CGB-K18-N01 \\GCB-K18-N01\DryadData\cap3\cluster34442.fsa \\GCB-K18-N01\DryadData\cap3\cluster34443.fsa ... Output files \\GCB-K18-N01\DryadData\cap3\cluster34467.fsa Input files (FASTA) IQueryable<LineRecord> inputFiles=PartitionedTable.Get<LineRecord>(uri); IQueryable<OutputInfo> = inputFiles.Select(x=>ExecuteCAP3(x.line)); [1] X. Huang, A. Madan, “CAP3: A DNA Sequence Assembly Program,” Genome Research, vol. 9, no. 9, 1999. CAP3 - Performance “DryadLINQ for Scientific Analyses”, Jaliya Ekanayake, Thilina Gunarathnea, Geoffrey Fox, Atilla Soner Balkir, Christophe Poulain, Nelson Araujo, Roger Barga (IEEE eScience ‘09) High Energy Physics Data Analysis • • • • Histogramming of events from a large (up to 1TB) data set Data analysis requires ROOT framework (ROOT Interpreted Scripts) Performance depends on disk access speeds Hadoop implementation uses a shared parallel file system (Lustre) – ROOT scripts cannot access data from HDFS – On demand data movement has significant overhead • Dryad stores data in local disks giving better performance over Hadoop Pairwise Distances – ALU Sequencing 125 million distances 4 hours & 46 minutes • • • • • • Calculate pairwise distances for a collection of genes (used for clustering, MDS) O(N^2) effect Fine grained tasks in MPI Coarse grained tasks in DryadLINQ Performance close to MPI Performed on 768 cores (Tempest Cluster) 20000 18000 DryadLINQ 16000 MPI 14000 12000 10000 8000 6000 4000 2000 0 35339 50000 Xiaohong Qiu, Jaliya Ekanayake, Scott Beason, Thilina Gunarathne, Geoffrey Fox, Roger Barga, Dennis Gannon Cloud Technologies for Bioinformatics Applications (SuperComputing09) Acknowledgements MSR Silicon Valley Dryad & DryadLINQ teams Andrew Birrell, Mihai Budiu, Jon Currey, Ulfar Erlingsson, Dennis Fetterly, Michael Isard, Pradeep Kunda, Mark Manasse, Chandu Thekkath and Yuan Yu . http://research.microsoft.com/en-us/projects/dryad http://research.microsoft.com/en-us/projects/dryadlinq MSR External Research Advanced Research Tools and Services Team http://research.microsoft.com/en-us/collaboration/tools/dryad.aspx MS Product Groups: HPC, Parallel Computing Platform. Academic Collaborators Jaliya Ekanayake, Geoffrey Fox, Thilina Gunarathne, Scott Beason, Xiaohong Qiu (Indiana University Bloomington). YongChul Kwon, Magdalena Balazinska (University of Washington). Atilla Soner Balkir, Ian Foster (University of Chicago). Dryad/DryadLINQ Papers 1. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks (EuroSys’07) 2. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language (OSDI’08) 3. Hunting for prolems with Artemis (Usenix WASL, San Diego 08) 4. Distributed Data-Parallel Computing Using a High-Level Programming Language (SIGMOD’09) 5. Quincy: Fair scheduling for distributed computing clusters (SOSP’09) 6. Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations (SOSP’09) 7. DryadInc: Reusing work in large scale computation (HotCloud 09). Conclusion DryadLINQ provides a powerful, elegant programming environment for large-scale data-parallel computing. Still an area of active research… …download it and get involved! http://connect.microsoft.com/dryad 54 Dryad & DryadLINQ in Context Application SQL Sawzall ≈SQL Sawzall Pig, Hive DryadLINQ Scope MapReduce Hadoop Dryad GFS BigTable HDFS S3 NTFS Azure SQL Server Language Execution Storage Parallel Databases LINQ, SQL SCOPE Postgres, Radical Better languages simplicity = Oracle, custom DB2, queryTeradata, languageetc. for Search, uses Dryad to Powerfulon Restricted Layered execute, cannot execution execution MapReduce leverage engines, engine execution LINQ, and restricted .NET language engine, and languages Visual lesser Studio performance ecosystem.