High Productivity Computing: Taking HPC Mainstream Lee Grant Technical Solutions Professional High Performance Computing leegrant@microsoft.com Challenge: High Productivity Computing “Make high-end computing easier and more productive to use. Emphasis should be placed on time to solution, the major metric of value to high-end computing users… A common software environment for scientific computation encompassing desktop to high-end systems will enhance productivity gains by promoting ease of use and manageability of systems.” 2004 High-End Computing Revitalization Task Force Office of Science and Technology Policy, Executive Office of the President X64 Server The Data Pipeline Data Gathering Discovery and Browsing Science Exploration Domain specific analyses Scientific Output “Raw” data includes sensor output, data downloaded from agency or collaboration web sites, papers (especially for ancillary data “Raw” data browsing for discovery (do I have enough data in the right places?), cleaning (does the data look obviously wrong?), and light weight science via browsing “Science variables” and data summaries for early science exploration and hypothesis testing. Similar to discovery and browsing, but with science variables computed via gap filling, units conversions, or simple equation. “Science variables” combined with models, other specialized code, or statistics for deep science understanding. Scientific results via packages such as MatLab or R2. Special rendering package such as ArcGIS. Paper preparation. Operations per second for serial code for traditional software Free Lunch Free Lunch Is Over For Traditional Software 24 GHz 1 Core 12 GHz 1 Core 3 GHz 2 Cores 6 GHz 1 Core 3 GHz 4 Cores 3 GHz 8 Cores 3 GHz 1 Cor 3 GHz 1 Cores Additional operations per second if code can take advantage of concurrency No Free Lunch for traditional software (Without highly concurrent software it won’t get any faster!) Microsoft’s Vision for HPC “Provide the platform, tools and broad ecosystem to reduce the complexity of HPC by making parallelism more accessible to address future computational needs.” Reduced Complexity Mainstream HPC Developer Ecosystem Ease deployment for larger scale clusters Address needs of traditional supercomputing Increase number of parallel applications and codes Simplify management for clusters of all scale Address emerging cross-industry computation trends Offer choice of parallel development tools, languages and libraries Integrate with existing infrastructure Enable non-technical users to harness the power of HPC Drive larger universe of developers and ISVs Microsoft HPC++ Solution Application Benefits The most productive distributed application development environment Cluster Benefits Complete HPC cluster platform integrated with the enterprise infrastructure System Benefits Cost-effective, reliable and high performance server operating system Windows HPC Server 2008 Integrated security via Active Directory Support for batch, interactive and service-oriented applications High availability scheduling Interoperability via OGF’s HPC Basic Profile Rapid large scale deployment and built-in diagnostics suite Integrated monitoring, management and reporting Familiar UI and rich scripting interface Systems Management Storage Access to SQL, Windows and Unix file servers Key parallel file server vendor support (GPFS, Lustre, Panasas) In-memory caching options Job Scheduling MPI MS-MPI stack based on MPICH2 reference implementation Performance improvements for RDMA networking and multi-core shared memory MS-MPI integrated with Windows Event Tracing List or Heat Map view cluster at a glance Group compute nodes based on hardware, software and custom attributes; Act on groupings. Receive alerts for failures Track long running operations and access operation history Pivoting enables correlating nodes and jobs together Integrated Job Scheduling Services oriented HPC apps Expanded Job Policies Support for Job Templates Improve interoperability with mixed IT infrastructure Skip/Demo Skip/Demo Node/Socket/Core Allocation Windows HPC Server can help your application make the best use of multi-core systems Node 1 Node 2 P1 P0 S0 S1 J1 P2 P1 P0 P3 P0 P2 P2 P3 P1 S1 S0 J1 P0 P1 P3 P2 P3 P0 P1 J2 P1 P0 S2 P2 P0 J3 P1 J3 P0 S3 J1 P3 P2 P1 S3 S2 J3 P3 J3 J1: /numsockets:3 /exclusive: false J3: /numcores:4 /exclusive: false P2 P3 J2: /numnodes:1 P2 P3 Job submission: 3 methods • Command line – – – – – • Programmatic • • Job submit /headnode:Clus1 /Numprocessors:124 /nodegroup:Matlab Job submit /corespernode:8 /numnodes:24 Job submit /failontaskfailure:true /requestednodes:N1,N2,N3,N4 Job submit /numprocessors:256 mpiexec \\share\mpiapp.exe [Completel Powershell system mgmt commands are available as well] Support for C++ & .Net languages Web Interface • Open Grid Forum: “HPC Basic Profile” using Microsoft.Hpc.Scheduler; class Program { static void Main() { IScheduler store = new Scheduler(); store.Connect(“localhost”); ISchedulerJob job = store.CreateJob(); job.AutoCalculateMax = true; job.AutoCalculateMin = true; ISchedulerTask task = job.CreateTask(); task.CommandLine = "ping 127.0.0.1 -n *"; task.IsParametric = true; task.StartValue = 1; task.EndValue = 10000; task.IncrementValue = 1; task.MinimumNumberOfCores = 1; task.MaximumNumberOfCores = 1; job.AddTask(task); store.SubmitJob(job, @"hpc\user“, "p@ssw0rd"); } } Scheduling MPI jobs •Job Submit /numprocessors:7800 mpiexec hostname •Start time: 1 second, Completion time: 27 seconds NetworkDirect A new RDMA networking interface built for speed and stability – 2 usec latency, 2 GB/sec bandwidth on ConnectX • OpenFabrics driver for Windows includes support for Network Direct, Winsock Direct and IPoIB protocols MPI App MS-MPI Windows Sockets (Winsock + WSD) RDMA Networking Networking Networking WinSock Direct Hardware Hardware Provider TCP/Ethernet Networking Networking Networking NetworkDirect Hardware Hardware Provider Networking Hardware Hardware Networking User Mode Access Layer TCP User Mode Kernel By-Pass • Verbs-based design for close fit with native, high-perf networking interfaces • Equal to Hardware-Optimized stacks for MPI microbenchmarks Socket-Based App IP NDIS Networking Networking Mini-port Hardware Hardware Driver Kernel Mode Networking Hardware Hardware Networking Hardware Driver Networking Hardware Hardware Networking Networking Hardware (ISV) App CCP Component OS Component IHV Component Spring 2008, NCSA, #23 9472 cores, 68.5 TF, 77.7% Spring 2008, Umea, #40 5376 cores, 46 TF, 85.5% Spring 2008, Aachen, #100 2096 cores, 18.8 TF, 76.5% Fall 2007, Microsoft, #116 2048 cores, 11.8 TF, 77.1% 30% efficiency improvement Windows HPC Server 2008 Windows Compute Cluster 2003 Spring 2007, Microsoft, #106 2048 cores, 9 TF, 58.8% Spring 2006, NCSA, #130 896 cores, 4.1 TF November 2008 Top500 Customers “It is important that our IT environment is easy to use and support. Windows HPC is improving our performance and manageability.” -- Dr. J.S. Hurley, Senior Manager, Head Distributed Computing, Networked Systems Technology, The Boeing Company “Ferrari is always looking for the most advanced technological solutions and, of course, the same applies for software and engineering. To achieve industry leading power-to-weight ratios, reduction in gear change times, and revolutionary aerodynamics, we can rely on Windows HPC Server 2008. It provides a fast, familiar, high performance computing platform for our users, engineers and administrators.” -- Antonio Calabrese, Responsabile Sistemi Informativi (Head of Information Systems), Ferrari “Our goal is to broaden HPC availability to a wider audience than just power users. We believe that Windows HPC will make HPC accessible to more people, including engineers, scientists, financial analysts, and others, which will help us design and test products faster and reduce costs.” -- Kevin Wilson, HPC Architect, Procter & Gamble “We are very excited about utilizing the Cray CX1 to support our research activities,” said Rico Magsipoc, Chief Technology Officer for the Laboratory of Neuro Imaging. “The work that we do in brain research is computationally intensive but will ultimately have a huge impact on our understanding of the relationship between brain structure and function, in both health and disease. Having the power of a Cray supercomputer that is simple and compact is very attractive and necessary, considering the physical constraints we face in our data centers today.” Porting Unix Applications • Windows Subsystem for Unix applications – Complete SVR-5 and BSD UNIX environment with 300 commands, utilizes, shell scripts, compilers – Visual Studio extensions for debugging POSIX applications – Support for 32 and 64-bit applications • Recent port of WRF weather model – 350K lines, Fortran 90 and C using MPI, OpenMP – Traditionally developed for Unix HPC systems – Two dynamical cores, full range of physics options • Porting experience – Fewer than 750 lines of code changed in makefiles/scripts – Level of effort similar to port to any new version of UNIX – Performance on par with the Linux systems • India Interoperability Lab, MTC Bangalore – Industry Solutions for Interop jointly with partners – HPC Utility Computing Architecture – Open Source Applications on HPC Server 2008 (NAMD, PL_POLY, GROMACS) High Productivity Modeling Languages/Runtimes C++, C#, VB F#, Python, Ruby, Jscript Fortran (Intel, PGI) OpenMP, MPI .Net Framework LINQ: language integrated query Dynamic Language Runtime Fx/JIT/GC improvements Native support for Web Services Team Development Team portal: version control, scheduled build, bug tracking Test and stress generation Code analysis, Code coverage Performance analysis IDE Rapid application development Parallel debugging Multiprocessor builds Work flow design MSFT || Computing Technologies Task Concurrency IFx / CCR •Robotics-based manufacturing assembly Maestro line •Silverlight TPL / Olympics PPL viewer Local Computing •Automotive control system WCF •Internet –based photo services WF Cluster-TPL •Ultrasound imaging equipment •Media encode/decode PLINQ •Image processing/ enhancement OpenMP TPL / PPL •Data visualization CDS MPI / MPI.Net •Enterprise search, OLTP, Cluster SOA collab •Animation / CGI rendering Cluster-PLINQ •Weather forecasting •Seismic monitoring •Oil exploration Data Parallelism Distributed/ Cloud Computing UDF UDF UDF UDF UDF UDF UDF Head Nodes Supports SOA functionality WCF Brokers. UDF Compute Nodes Each performs UDF Tasks as called From WCF Broker SOA Broker Performance Low latency Messages/sec (25 ms compute time) Round Trip Latency ( ms ) 1.6 High throughput 1.4 1.2 1 0.8 0.6 0.4 0.2 0 6000 5000 4000 3000 2000 1000 0 0 50 100 150 Number of clients Message Size ( bytes ) WSD IPoIB Gige 0k pingpong 1k pingpong 4k pingpong 16k pingpong 200 MPI.NET • Supports all .NET languages (C#, C++, F#, ..., even Visual Basic!) • Natural expression of MPI in C# if (world.Rank == 0) world.Send(“Hello, World!”, 1, 0); else string msg = world.Receive<string>(0, 0); string[] hostnames = comm.Gather(MPI.Environment.ProcessorName, 0); double pi = 4.0*comm.Reduce(dartsInCircle,(x, y) => return x + y, 0) / totalDartsThrown; • Negligible overhead (relative to C) over TCP Allinea DDT VS Debugger Add-in Skip/Demo NetPIPE Performance Throughput (Mbps) 100 10 C (Native ) C# (Prim itive ) C# (Se rialize d) 1 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 0.1 0.01 Message Size (Bytes) 1.E+05 1.E+06 1.E+07 Parallel Extensions to .NET • Declarative data parallelism (PLINQ) var q = from n in names.AsParallel() where n.Name == queryInfo.Name && n.State == queryInfo.State && n.Year >= yearStart && n.Year <= yearEnd orderby n.Year ascending select n; • Imperative data and task parallelism (TPL) Parallel.For(0, n, i=> { result[i] = compute(i); }); • Data structures and coordination constructs Example: Tree Walk Sequential static void ProcessNode<T>(Tree<T> tree, Action<T> action) { if (tree == null) return; Thread Pool static void ProcessNode<T>(Tree<T> tree, Action<T> action) { if (tree == null) return; Stack<Tree<T>> nodes = new Stack<Tree<T>>(); Queue<T> data = new Queue<T>(); nodes.Push(tree); while (nodes.Count > 0) { Tree<T> node = nodes.Pop(); data.Enqueue(node.Data); if (node.Left != null) nodes.Push(node.Left); if (node.Right != null) nodes.Push(node.Right); } ProcessNode(tree.Left, action); ProcessNode(tree.Right, action); action(tree.Data); } using (ManualResetEvent mre = new ManualResetEvent(false)) { int waitCount = Environment.ProcessorCount; WaitCallback wc = delegate { bool gotItem; do { T item = default(T); lock (data) { if (data.Count > 0) { item = data.Dequeue(); gotItem = true; } else gotItem = false; } if (gotItem) action(item); } while (gotItem); if (Interlocked.Decrement(ref waitCount) == 0) mre.Set(); }; for (int i = 0; i < Environment.ProcessorCount - 1; i++) { ThreadPool.QueueUserWorkItem(wc); } wc(null); mre.WaitOne(); } } Example: Tree Walk Parallel Extensions (with Task) static void ProcessNode<T>(Tree<T> tree, Action<T> action) { if (tree == null) return; Task t = Task.Create(delegate { ProcessNode(tree.Left, action); }); ProcessNode(tree.Right, action); action(tree.Data); t.Wait(); } Parallel Extensions (with Parallel) static void ProcessNode<T>(Tree<T> tree, Action<T> action) { if (tree == null) return; Parallel.Do( () => ProcessNode(tree.Left, action), () => ProcessNode(tree.Right, action), () => action(tree.Data) ); } Parallel Extensions (with PLINQ) static void ProcessNode<T>(Tree<T> tree, Action<T> action) { tree.AsParallel().ForAll(action); } F# is... ...a functional, object-oriented, imperative and explorative programming language for .NET Libraries Scalable Explorative Succinct Strongly Typed Interoperable F# Efficient Interactive F# Shell C:\fsharpv2>bin\fsi MSR F# Interactive, (c) Microsoft Corporation, All Rights Reserved F# Version 1.9.2.9, compiling for .NET Framework Version v2.0.50727 NOTE: NOTE: NOTE: NOTE: NOTE: NOTE: NOTE: NOTE: NOTE: NOTE: NOTE: NOTE: NOTE: NOTE: See 'fsi --help' for flags Commands: #r <string>;; reference (dynamically load) the given DLL. #I <string>;; add the given search path for referenced DLLs. #use <string>;; accept input from the given file. #load <string> ...<string>;; load the given file(s) as a compilation unit. #time;; toggle timing on/off. #types;; toggle display of types on/off. #quit;; exit. Visit the F# website at http://research.microsoft.com/fsharp. Bug reports to fsbugs@microsoft.com. Enjoy! > let rec f x = (if x < 2 then x else f (x-1) + f (x-2));; val f : int -> int > f 6;; val it = 8 val it : int Example: Taming Asynchronous I/O using System; using System.IO; using System.Threading; public static void ReadInImageCallback(IAsyncResult asyncResult) { public static void ProcessImagesInBulk() ImageStateObject state = (ImageStateObject)asyncResult.AsyncState; { public class BulkImageProcAsync Stream stream = state.fs; Console.WriteLine("Processing images... "); int bytesRead = stream.EndRead(asyncResult); { long t0 = Environment.TickCount; if (bytesRead != numPixels) public const String ImageBaseName = "tmpImage-"; NumImagesToFinish = numImages; throw new Exception(String.Format public const int numImages = 200; AsyncCallback readImageCallback = new ("In ReadInImageCallback, got the wrong number of " + AsyncCallback(ReadInImageCallback); public const int numPixels = 512 * 512; "bytes from the image: {0}.", bytesRead)); for (int i = 0; i < numImages; i++) ProcessImage(state.pixels, state.imageNum); { // ProcessImage has a simple O(N) loop, and you can vary the number stream.Close(); ImageStateObject state = new ImageStateObject(); // of times you repeat that loop to make the application more CPUstate.pixels = new byte[numPixels]; // Now write out the image. // bound or more IO-bound. state.imageNum = i; // =Using public static int processImageRepeats 20; asynchronous I/O here appears not to be best practice.// Very large items are read only once, so you can make the // It ends up swamping the threadpool, because the threadpool // buffer on the FileStream very small to save memory. // threads are blocked on I/O requests that were just queued toFileStream fs = new FileStream(ImageBaseName + i + ".tmp", // Threads must decrement NumImagesToFinish, and protect // the threadpool. FileMode.Open, FileAccess.Read, FileShare.Read, 1, true); // their access to it through a mutex. FileStream fs = new FileStream(ImageBaseName + state.imageNum +state.fs = fs; public static int NumImagesToFinish = numImages; ".done", FileMode.Create, FileAccess.Write, FileShare.None,fs.BeginRead(state.pixels, 0, numPixels, readImageCallback, public static Object[] NumImagesMutex = 4096, new Object[0]; false); state); 0, numPixels); // WaitObject is signalled when all fs.Write(state.pixels, image processing is done. } public static Object[] WaitObject = fs.Close(); new Object[0]; public class ImageStateObject // Determine whether all images are done being processed. // This application model uses too much memory. // If not, block until all are finished. { // Releasing memory as soon as possible is a good idea, bool mustBlock = false; public byte[] pixels; // especially global state. lock (NumImagesMutex) public int imageNum; state.pixels = null; { public FileStream fs; fs = null; if (NumImagesToFinish > 0) // Record that an image is finished now. } mustBlock = true; Processing 200 images in parallel lock (NumImagesMutex) { NumImagesToFinish--; if (NumImagesToFinish == 0) { Monitor.Enter(WaitObject); Monitor.Pulse(WaitObject); Monitor.Exit(WaitObject); } } } if (mustBlock) { Console.WriteLine("All worker threads are queued. " + " Blocking until they complete. numLeft: {0}", NumImagesToFinish); Monitor.Enter(WaitObject); Monitor.Wait(WaitObject); Monitor.Exit(WaitObject); } long t1 = Environment.TickCount; Console.WriteLine("Total time processing images: {0}ms", (t1 - t0)); } } } Example: Taming Asynchronous I/O Equivalent F# code (same perf) Open the file synchronously Read from the file, asynchronously let ProcessImageAsync(i) = async { let inStream = File.OpenRead(sprintf "source%d.jpg" i) let! pixels = inStream.ReadAsync(numPixels) let pixels' = TransformImage(pixels,i) let outStream = File.OpenWrite(sprintf "result%d.jpg" i) do! outStream.WriteAsync(pixels') do Console.WriteLine "done!" } Write the result asynchronously let ProcessImagesAsync() = Async.Run (Async.Parallel [ for i in 1 .. numImages -> ProcessImageAsync(i) ]) Generate the tasks and queue them in parallel The Coming of Accelerators Current Offerings Microsoft AMD nVidia Intel Apple Accelerator Brook+ RapidMind Ct Grand Central D3DX, DaVinci, FFT, Scan ACML-GPU cuFFT, cuBLAS, cuPP MKL++ CoreImage CoreAnim Compute Shader CAL CUDA LRB Native OpenCL Any Processor AMD CPU or GPU nVidia GPU Intel CPU Larrabee Any Processor DirectX11 Compute Shader • A new processing model for GPUs – – – – Integrated with Direct3D Supports more general constructs Enables more general data structures Enables more general algorithms • Image/Post processing: – Image Reduction, Histogram, Convolution, FFT – Video transcode, superResolution, etc. • Effect physics – Particles, smoke, water, cloth, etc. • Ray-tracing, radiosity, etc. • Gameplay physics, AI FFT Performance Example • Complex 1024x1024 2-D FFT: – Software – Direct3D9 – CUFFT – Prototype DX11 – Latest chips 42ms 15ms 8ms 6ms 3ms 6 GFlops 17 GFlops 3x 32 GFlops 5x 42 GFlops 6x 100 GFlops • Shared register space and random access writes enable ~2x speedups IMSL .NET Numerical Library • • • • • • Linear Algebra Eigensystems Interpolation and Approximation Quadrature Differential Equations Transforms • • • • • • Nonlinear Equations Optimization Basic Statistics Nonparametric Tests Goodness of Fit Regression • • • • • • Variances, Covariances and Correlations Multivariate Analysis Analysis of Variance Time Series and Forecasting Distribution Functions Random Number Generation Research Integrate Data acquisition from source systems and integration Data transformation and synthesis Analyze Data enrichment, with business logic, hierarchical views Data discovery via data mining Report Data presentation and distribution Data access for the masses Data Browsing with Excel Annual Mean Monthly Mean Weekly Mean Courtesy Catherine van Ingen, MSR Datamining with Excel Integrated algorithms • Text Mining • Neural Nets • Naïve Bayes • Time Series • Sequent Clustering • Decision Trees • Association Rules Workflow Design for Sharepoint Microsoft HPC++ Labs: Academic Computational Finance Service Taking HPC Mainstream © 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.