An Introduction to DryadLINQ Christophe Poulain Microsoft Research

advertisement
An Introduction to DryadLINQ
Christophe Poulain
Microsoft Research
Virtual School of Computational Science and Engineering
Big Data For Science Course, July 28, 2010
The Fourth Paradigm: DataData-Intensive Science
http://research.microsoft.com/fourthparadigm
Scientific discovery is increasingly driven by exploration of
large amounts of data from many sources.
Scientific breakthrough will be powered by advanced
computing capabilities that help researchers manipulate
and explore massive datasets.
2
Data--intensive computing is increasingly prevalent
Data
Powered by powerful multi-core workstations, readily
available commodity clusters and cloud services platforms
112 containers x 2000 servers = 224000 servers
Programming data analyses that scale from desktop to a
large number of compute nodes remains challenging
3
Dryad and DryadLINQ
Research programming models for writing distributed data-parallel
applications that scale from a small cluster to a large data-center.
A DryadLINQ programmer can use thousands of machines, each of
them with multiple processors or cores, without prior knowledge in
parallel programming.
4
Availability
Dryad/DryadLINQ on Windows HPC 2008 (SP1) is available
as a free download from:
http://research.microsoft.com/collaboration/tools/dryad.aspx
– DryadLINQ (in source) & Dryad (in binary)
– With tutorials, programming guides, sample codes, libraries,
and a community site: http://connect.microsoft.com/dryad
– Windows HPC Server licenses freely available through your
department’s subscription to MSDN Academic Alliance
5
Outline
• DryadLINQ programming model
• Dryad and DryadLINQ overview
• Applications
DryadLINQ
Dryad
LINQ Experience
Use a cluster as if it were a single computer
• Sequential, single machine programming abstraction
• Same program runs on single-core, multi-core, or cluster
• Familiar programming languages
• C#, VB, F#, IronPython…
• Familiar development environment
• .NET, Visual Studio or other IDE
LINQ
• Microsoft’s Language INtegrated Query
– Released with .NET Framework 3.5, Visual Studio optional
• A set of operators to manipulate datasets in .NET
– Support traditional relational operators
• Select, Join, GroupBy, Aggregate, etc.
– Integrated into .NET programming languages
• Programs can call operators
• Operators can invoke arbitrary .NET functions
• Data model
– Data elements are strongly typed .NET objects
– Much more expressive than SQL tables
• Extremely extensible
– Add new custom operators
– Add new execution providers
Example of a LINQ Query
IEnumerable<string> logs = GetLogLines();
var logentries =
Go through logs and keep only lines
from line in logs
that are not comments. Parse each
where !line.StartsWith("#")
line into a LogEntry object.
select new LogEntry(line);
var user =
Go through logentries and keep
from access in logentries
where access.user.EndsWith(@"\ulfar")
only entries that are accesses by
select access;
ulfar.
var accesses =
from access in user
group access by access.page into pages
select new UserPageCount("ulfar", pages.Key, pages.Count());
var htmAccesses =
Group ulfar’s accesses according to
from access in accesses
what page they correspond to. For
where access.page.EndsWith(".htm")
each page, count the occurrences.
orderby access.count descending
select access;
Sort the pages ulfar has accessed
according to access frequency.
9
DryadLINQ Data Model
Partition
.Net objects
PartitionedTable<T>
PartitionedTable<T> implements IQueryable<T> and IEnumerable<T>
PartitionedTable exposes metadata information:
• type, partition, compression scheme, etc.
10
A complete DryadLINQ program
public class LogEntry {
public string user;
public string ip;
public string page;
public LogEntry(string line) {
string[] fields = line.Split(' ');
this.user = fields[8];
this.ip = fields[9];
this.page = fields[5];
}
}
public class UserPageCount{
public string user;
public string page;
public int count;
public UserPageCount(
string user, string page, int
count) {
this.user = user;
this.page = page;
this.count = count;
}
}
PartitionedTable<string> logs = PartitionedTable.Get<string>(
@”file:\\MSR-SCR-DRYAD01\DryadData\cpoulain\logfile.pt”
);
var logentries =
from line in logs
where !line.StartsWith("#")
select new LogEntry(line);
var user =
from access in logentries
where access.user.EndsWith(@"\ulfar")
select access;
var accesses =
from access in user
group access by access.page into pages
select new UserPageCount("ulfar", pages.Key, pages.Count());
var htmAccesses =
from access in accesses
where access.page.EndsWith(".htm")
orderby access.count descending
select access;
htmAccesses.ToPartitionedTable(
@”file:\\MSR-SCR-DRYAD01\DryadData\cpoulain\results.pt”
);
Demo
Executing the log query
DryadLINQ for Dryad on Windows Server 2008 HPC Cluster
12
MapReduce in DryadLINQ
MapReduce(source,
// sequence of Ts
mapper,
// T -> Ms
keySelector, // M -> K
reducer)
// (K, Ms) -> Rs
{
var map = source.SelectMany(mapper);
var group = map.GroupBy(keySelector);
var result = group.SelectMany(reducer);
return result; // sequence of Rs
}
13
Outline
• DryadLINQ programming model
• Dryad and DryadLINQ overview
• Applications
Software Stack
Machine
Learning
Image
Processing
Graph
Analysis
…
Data
Mining
Applications
Other Applications
DryadLINQ
Other Languages
Dryad
CIFS/NTFS
SQL Servers
Azure Storage
Cosmos DFS
Cluster Services (Azure, HPC, or Cosmos)
Windows
Server
Windows
Server
Windows
Server
Windows
Server
15
Dryad
• Provides a general, flexible execution layer
– Dataflow graph as the computation model
– Higher language layer supplies graph, vertex code,
channel types, hints for data locality, …
• Automatically handles execution
– Distributes code, routes data
– Schedules processes on machines near data
– Masks failures in cluster and network
– Fair scheduling of concurrent jobs
Dryad Job Structure
Channels
Input
files
Stage
sort
grep
Output
files
awk
sed
perl
sort
grep
awk
sed
grep
Vertices
(processes)
sort
Channel is a finite streams of items
• NTFS files (temporary)
• TCP pipes (inter-machine)
• Memory FIFOs (intra-machine)
Dryad System Architecture
data plane
job manager
Job1
Files, TCP, FIFO
V
V
V
PD
PD
PD
control plane
New jobs
Job1: v11, v12, …
Job2: v21, v22, …
Job3: …
scheduler
cluster
Demo
Fault tolerance
DryadLINQ for Dryad on Windows Server 2008 HPC Cluster
19
Fault Tolerance
Consider an embarrassingly parallel problem
public static Pair<int, string> DoWork(int index)
{
System.Threading.Thread.Sleep(200);
return new Pair<int, string>(index, System.Environment.MachineName);
}
public static void Main(string[] args)
{
int count = 50;
var seeds = Enumerable.Range(1, count);
var pairs = from seed in seeds select DoWork(seed);
foreach (Pair<int, string> pair in pairs)
{
Console.WriteLine("{0} => {1}", pair.Key, pair.Value.ToString());
}
}
21
An embarrassingly parallel problem
Many cores, one machine with PLINQ
public static Pair<int, string> DoWork(int index)
{
System.Threading.Thread.Sleep(200);
return new Pair<int, string>(index, System.Environment.MachineName);
}
public static void Main(string[] args)
{
int count = 50;
var seeds = Enumerable.Range(1, count);
var pairs = from seed in seeds.AsParallel() select DoWork(seed);
foreach (Pair<int, string> pair in pairs)
{
Console.WriteLine("{0} => {1}", pair.Key, pair.Value.ToString());
}
}
22
An embarrassingly parallel problem
Many cores, many machines with DryadLINQ (& PLINQ)
public static Pair<int, string> DoWork(int index)
{
System.Threading.Thread.Sleep(2000);
return new Pair<int, string>(index, System.Environment.MachineName);
}
public static void Main(string[] args)
{
int count = 50;
var seeds = Enumerable.Range(1, count);
int[] ranges = seeds.Take(count - 1).ToArray();
var pairs = from seed
in seeds.ToPartitionedTable("tmp.pt").RangePartition(i => i, ranges)
select DoWork(seed);
foreach (Pair<int, string> pair in pairs)
{
Console.WriteLine("{0} => {1}", pair.Key, pair.Value.ToString());
}
}
23
An embarrassingly parallel problem
Simulating errors on the cluster
Substitute DoWorkAndSimulateFailure for DoWork:
private static Random RANDOM = new Random();
public static Pair<int, string> DoWorkAndSimulateFailure(int index)
{
if (RANDOM.NextDouble() < 0.1)
{
throw new Exception("My program failed.");
}
System.Threading.Thread.Sleep(200);
return new Pair<int, string>(index, System.Environment.MachineName);
}
Will the program successfully finish when we run it? Let us see…
24
DryadLINQ: Friendly programming API for Dryad
DryadLINQ leverages LINQ’s extensibility
Scalability
Local machine
Query
.Net
program
(C#, VB,
F#, etc)
Objects
LINQ provider interface
Execution engines
DryadLINQ
PLINQ
Cluster
Multi-core
LINQ-to-SQL
LINQ-to-XML
Single-core
DryadLINQ Provider
Vertex
code
Collection<T> collection;
bool IsLegal(Key k);
string Hash(Key);
var results = from c in collection
where IsLegal(c.key)
select new { Hash(c.key), c.value};
Query
plan
(Dryad job)
Data
collection
C#
C#
C#
C#
results
26
Example: Word Count
Count word frequency in a set of documents:
var docs = new PartitionedTable<Doc>(“dfs://yuan/docs”);
var words = docs.SelectMany(doc => doc.words);
var groups = words.GroupBy(word => word);
var counts = groups.Select(g => new WordCount(g.Key, g.Count()));
counts.ToTable(“dfs://yuan/counts.txt”);
IN
metadata
SM
doc =>
doc.words
GB
word =>
word
S
g =>
new …
OUT
metadata
Distributed Execution of Word Count
LINQ expression
IN
SM
GB
S
OUT
DryadLINQ
Dryad execution
Execution Plan for Word Count
SM
Q
SM
GB
GB
SelectMany
Sort
GroupBy
C
Count
D
Distribute
MS
Mergesort
GB
GroupBy
pipelined
(1)
S
Sum
pipelined
Sum
29
Execution Plan for Word Count
SM
GB
SM
SM
SM
SM
Q
Q
Q
Q
GB
GB
GB
GB
C
C
C
C
D
D
D
D
MS
MS
MS
MS
GB
GB
GB
GB
Sum
Sum
Sum
Sum
(1)
S
(2)
30
Distributed Execution Plan
<ClusterName> Arden1 </ClusterName>
<Resources>
<Resource>C:\DryadLinqDrop\lib\retail\amd64\wrappernativeinfo.dll</Resource>
<Resource>C:\DryadLinqDrop\lib\Release\LinqToDryad.dll</Resource>
<Resource>C:\apps\WordCount\bin\Release\DryadLinq.dll</Resource>
<Resource>C:\apps\WordCount\bin\Release\WordCount.exe</Resource>
</Resources>
<QueryPlan>
…
<Vertex>
<UniqueId> 1 </UniqueId>
<Partitions> 8 </Partitions>
<ChannelType> DiskFile </ChannelType>
<ConnectionOperator> CrossProduct </ConnectionOperator>
<Entry> <AssemblyName> DryadLinq.dll </AssemblyName>
<ClassName> LinqToDryad.DryadLinq_Vertex</ClassName>
<MethodName> Super_2 </MethodName></Entry>
<Children><Child><UniqueId> 0 </UniqueId></Child></Children>
</Vertex>
...
</QueryPlan>
Vertex code (in DryadLinq.dll)
The actual code executed at Dryad vertex
public static int Super_2(string args) {
DryadVertexEnv denv = new DryadVertexEnv(args);
var dwriter_3 = denv.MakeWriter(DryadLinq_Extension.FactoryType_1);
var dreader_4 = denv.MakeReader(DryadLinq_Extension.FactoryType_0);
var source_5 = DryadLinqVertex.SelectMany(dreader_4, doc => doc.words);
var source_6 = DryadLinqVertex.Sort(source_5, word => word);
var source_7 = DryadLinqVertex.OrderedGroupBy(source_6, word => word);
var source_8 = DryadLinqVertex.Select(g => new Pair<String, Int32>(g.Key,
g.Count()));
DryadLinqVertex.HashPartition(source_8, e => e.Key, dwriter_3);
return 0;
}
DryadLINQ
• Distributed execution plan generation
– Static optimizations: pipelining, eager aggregation, etc.
– Dynamic optimizations: data-dependent partitioning,
dynamic aggregation, etc.
• Vertex runtime
–
–
–
–
–
Single machine (multi-core) implementation of LINQ
Vertex code that runs on vertices
Data serialization code
Callback code for runtime dynamic optimizations
Automatically distributed to cluster machines
DryadLINQ Job Browser
Artemis
34
Simple and Scalable Distributed File System
• A simple fault-tolerant, distributed file system that
provides the abstractions necessary for data parallel
computations on HPC clusters
• High performance, reliable, scalable service
• Prototypical workload
• High throughput, sequential IO, write once
• Cluster machines working in parallel
• Configurable number of replicas per dataset
http://research.microsoft.com/events/techfair2010/demos.aspx
35
Outline
• DryadLINQ programming model
• Dryad and DryadLINQ overview
• Applications
Dryad
•
•
•
•
•
Continuously deployed since 2006
The execution engine for Bing analytics
Running on >> 104 machines
Runs on clusters > 3000 machines
Sifting through > 10Pb data daily
37
Microsoft Kinect:
Kinect: Learning From Data
Kinect (formerly Project Natal) is using
DryadLINQ to train enormous decision
trees from millions of images across
hundreds of cores.
Training examples
Rasterize
Motion Capture
(ground truth)
Machine
learning
Classifier
Recognize players from depth map at frame rate using fraction of Xbox CPU.
38
Large--scale machine learning
Large
• > 1022 objects
• Sparse, multi-dimensional data structures
• Complex datatypes
(images, video, matrices, etc.)
• Complex application logic and dataflow
–
–
–
–
–
–
>35000 lines of .Net
140 CPU days
> 105 processes
30 TB data analyzed
140 avg parallelism (235 machines)
300% CPU utilization (4 cores/machine)
39
Highly efficient parallellization
40
SDSS Query Q18
Most time-consuming query from Sloan Digital Sky Survey database
Find all objects within 1' of one another other that have very similar
colors: that is with the color ratios u-g, g-r, r-I are less than 0.05m.
(http://www.sdss.jhu.edu/SQL/SQLQueries.html)
• Two tables 11.8GB and 41.8 GB
• Under 2 minutes with 40 nodes
• Hand-tuned Dryad
implementation is faster
(92s vs 113s with 40 nodes)
• DryadLINQ code is 10x smaller
See Dryad (Eurosys’07) and DryadLINQ (OSDI’08) papers.
SDSS Query Q18
PartitionedTable<PhotoObjAll> photoObjAll =
PartitionedTable.Get<PhotoObjAll>(@"file://\\<...>\ugriz-u9.pt");
PartitionedTable<Neighbor> neighbors =
PartitionedTable.Get<Neighbor>(@"file://\\<...>\neighbor-u9.pt");
var j1 = from p in photoObjAll
join n in neighbors on p.objId equals n.objId
select new PhotoObjNeighbor(p, n);
var w1 = from pn in j1
where pn.objId < pn.neighborObjId && pn.mode
select pn;
var j2 = from l in photoObjAll
join pn in w1 on l.objId equals pn.neighborObjId
select new PhotoObjNeighborAll(l, pn);
var w2 = from lp in j2 where lp.l.mode
&& Math.Abs((lp.p.u-lp.p.g)-(lp.l.u-lp.l.g)) < 0.05
&& Math.Abs((lp.p.g-lp.p.r)-(lp.l.g-lp.l.r)) < 0.05
&& Math.Abs((lp.p.r-lp.p.i)-(lp.l.r-lp.l.i)) < 0.05
&& Math.Abs((lp.p.i-lp.p.z)-(lp.l.i-lp.l.z)) < 0.05
select lp.p.objId;
var q = w2.Distinct();
q.ToDryadPartitionedTable("result.pt");
Scalable clustering algorithm for NN-body simulations in a
shared--nothing cluster
shared
YongChul Kwon, Dylan Nunley, Jeffrey P. Gardner, Magdalena Balazinska, Bill Howe,
and Sarah Loebman. UW Tech Report. UW-CSE-09-06-01. June 2009.
S43
S92
Ideal
1.00
OS43
OS92
0.80
Scale Up
Speed Up
4
2
0.60
S43
S92
Ideal
0.40
0.20
0
0.00
0
2
4
Number of nodes
6
8
0
2
4
6
Number of nodes…
(S=DryadLINQ; OS=OpenMP; 43=sparse; 92=dense)
• Large-scale spatial clustering
– 916M particles in 3D clustered under 70 minutes with 8 nodes.
• Re-implemented using DryadLINQ
– Partition > Local cluster > Merge cluster > Relabel
– Faster development and good scalability
– Must ensure near-constant processing time per tuple
8
Scalable clustering algorithm for NN-body simulations in a
shared--nothing cluster
shared
44
Terapixel Sky Image
1791 pairs of red-light and blue-light images
acquired from two telescopes, scanned into
23,040x23,040 or 14000x14000 images.
~4TB of uncompressed data.
Processed into 1791 RGB color images.
Stitched into one terapixel spherical image.
Image seams removed with optimization.
Multi-scale resolution image available in
WorldWide Telescope and Bing Map.
45
Computing Vignetting Corrections
Creating Flat Fields
Normalization Matrix
Normalizing Corners
DryadLINQ => concise code
var
var
var
var
pixelRows = folders.SelectMany(image => ImageToRows(image, options));
stackedPixelRows = pixelRows.GroupBy(pixelRow => pixelRow.Position);
finalRows = stackedPixelRows.Select(x => ReduceStackedRows(x));
flatField = finalRows.Apply(x => SaveFlatField(x, options));
DryadLINQ + Windows HPC => Efficient and robust execution
– Elapsed time to process all flat fields: 8.7 hours
– 28 8-core compute nodes => 1,950 CPU hours
– Total input data: 417 GB compressed, 4 TB uncompressed.
46
CAP3 - DNA Sequence Assembly Program [1]
EST (Expressed Sequence Tag) corresponds to messenger RNAs (mRNAs) transcribed from the genes
residing on chromosomes. Each individual EST sequence represents a fragment of mRNA, and the
EST assembly aims to re-construct full-length mRNA sequences for each expressed gene.
Input files (FASTA)
Cap3data.pf
\DryadData\cap3\cap3data
10
0,344,CGB-K18-N01
1,344,CGB-K18-N01
…
V
V
Cap3data.00000000
9,344,CGB-K18-N01
\\GCB-K18-N01\DryadData\cap3\cluster34442.fsa
\\GCB-K18-N01\DryadData\cap3\cluster34443.fsa
...
Output files
\\GCB-K18-N01\DryadData\cap3\cluster34467.fsa
Input files
(FASTA)
IQueryable<LineRecord> inputFiles=PartitionedTable.Get<LineRecord>(uri);
IQueryable<OutputInfo> = inputFiles.Select(x=>ExecuteCAP3(x.line));
[1] X. Huang, A. Madan, “CAP3: A DNA Sequence Assembly Program,” Genome Research, vol. 9, no. 9, 1999.
CAP3 - Performance
“DryadLINQ for Scientific Analyses”, Jaliya Ekanayake, Thilina Gunarathnea, Geoffrey Fox,
Atilla Soner Balkir, Christophe Poulain, Nelson Araujo, Roger Barga (IEEE eScience ‘09)
High Energy Physics Data Analysis
•
•
•
•
Histogramming of events from a large (up to 1TB) data set
Data analysis requires ROOT framework (ROOT Interpreted Scripts)
Performance depends on disk access speeds
Hadoop implementation uses a shared parallel file system (Lustre)
– ROOT scripts cannot access data from HDFS
– On demand data movement has significant overhead
• Dryad stores data in local disks giving better performance over Hadoop
Pairwise Distances – ALU Sequencing
125 million distances
4 hours & 46
minutes
•
•
•
•
•
•
Calculate pairwise distances for a collection of
genes (used for clustering, MDS)
O(N^2) effect
Fine grained tasks in MPI
Coarse grained tasks in DryadLINQ
Performance close to MPI
Performed on 768 cores (Tempest Cluster)
20000
18000
DryadLINQ
16000
MPI
14000
12000
10000
8000
6000
4000
2000
0
35339
50000
Xiaohong Qiu, Jaliya Ekanayake, Scott Beason, Thilina Gunarathne, Geoffrey Fox, Roger
Barga, Dennis Gannon Cloud Technologies for Bioinformatics Applications (SuperComputing09)
Acknowledgements
MSR Silicon Valley Dryad & DryadLINQ teams
Andrew Birrell, Mihai Budiu, Jon Currey, Ulfar Erlingsson, Dennis Fetterly, Michael
Isard, Pradeep Kunda, Mark Manasse, Chandu Thekkath and Yuan Yu .
http://research.microsoft.com/en-us/projects/dryad
http://research.microsoft.com/en-us/projects/dryadlinq
MSR External Research
Advanced Research Tools and Services Team
http://research.microsoft.com/en-us/collaboration/tools/dryad.aspx
MS Product Groups: HPC, Parallel Computing Platform.
Academic Collaborators
Jaliya Ekanayake, Geoffrey Fox, Thilina Gunarathne, Scott Beason, Xiaohong Qiu
(Indiana University Bloomington).
YongChul Kwon, Magdalena Balazinska (University of Washington).
Atilla Soner Balkir, Ian Foster (University of Chicago).
Dryad/DryadLINQ Papers
1. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
(EuroSys’07)
2. DryadLINQ: A System for General-Purpose Distributed Data-Parallel
Computing Using a High-Level Language (OSDI’08)
3. Hunting for prolems with Artemis (Usenix WASL, San Diego 08)
4. Distributed Data-Parallel Computing Using a High-Level Programming
Language (SIGMOD’09)
5. Quincy: Fair scheduling for distributed computing clusters (SOSP’09)
6. Distributed Aggregation for Data-Parallel Computing: Interfaces and
Implementations (SOSP’09)
7. DryadInc: Reusing work in large scale computation (HotCloud 09).
Conclusion
DryadLINQ provides a powerful, elegant
programming environment for large-scale
data-parallel computing.
Still an area of active research…
…download it and get involved!
http://connect.microsoft.com/dryad
54
Dryad & DryadLINQ in Context
Application
SQL
Sawzall
≈SQL
Sawzall
Pig, Hive
DryadLINQ
Scope
MapReduce
Hadoop
Dryad
GFS
BigTable
HDFS
S3
NTFS
Azure
SQL Server
Language
Execution
Storage
Parallel
Databases
LINQ, SQL
SCOPE
Postgres,
Radical
Better
languages
simplicity
= Oracle,
custom DB2,
queryTeradata,
languageetc.
for Search, uses Dryad to
Powerfulon
Restricted
Layered
execute,
cannot
execution
execution
MapReduce
leverage
engines,
engine
execution
LINQ,
and
restricted
.NET
language
engine,
and
languages
Visual
lesser Studio
performance
ecosystem.
Download