High-Performance Computing With Windows Ryan Waite General Program Manager Windows Server HPC Group Microsoft Corporation Outline Part 1: Overview Why Microsoft has gotten into HPC What our V1 product offers Some future directions Part 2: Drill-down A few representative V1 features (for those who are interested) Part 1 Overview Evolving Tools Of The Scientific Process 1. Observation 2. Hypothesis Instruments Experiments done with a telescope by Galilei 400 years ago inaugurated the scientific method Microscope, laser, x-ray, collider, accelerator allowed peering further and deeper into matter HPC 4. Validation 3. Prediction Automation and acceleration of the scientific and engineering process itself Digital instruments, data mining, simulation, experiment steering The Next Challenge Taking HPC Mainstream Volume economics of industry standard hardware and commercial software applications are rapidly bringing HPC capabilities to a broader number of users But HPC is still only accessible to the few computational scientists who can master a domain science, program parallel, distributed algorithms, and use/manage a supercomputer Microsoft HPC Strategy – taking HPC to the mainstream Enabling broad HPC adoption and making HPC into a high volume market in which everyone can have their own personal supercomputer Enabling domain scientists who are not computer scientists to partake in the HPC revolution Evidence Of Standardization And Commoditization Clusters over 70% Industry usage rising GigE is gaining (50% of systems) x86 is leading (Pentium 41%, EM64T 16%, Opteron 11%) HPC Market Trends <$250K – 97% of systems, 55% of revenue 2005 Systems 2005 Growth 981 -3% 4,988 21,733 163,441 Source: IDC, 2005 30% 36% 33% Even The Low End Is Powerful 1991 1998 2005 Cray Y-MP C916 Sun HPC10000 Small Form Factor PCs Architecture 16 x Vector 4GB, Bus 24 x 333MHz UltraSPARCII, 24GB, SBus 4 x 2.2GHz Athlon64 4GB, GigE OS UNICOS Solaris 2.5.1 Windows Server 2003 SP1 GFlops ~10 ~10 ~10 Top500 # 1 500 N/A Price $40,000,000 $1,000,000 (40x drop) < $4,000 (250x drop) Customers Government Labs Large Enterprises Every Engineer and Scientist Applications Classified, Climate, Physics Research Manufacturing, Energy, Finance, Telecom Bioinformatics, Materials Sciences, Digital Media System Top Challenges Setup is painful Takes a long time to get clusters up and running Clusters are separate islands Lack of integration into IT infrastructure Job management Lack of integration into end-user apps “Make high-end computing easier and more productive to use. Emphasis should be placed on time to solution, the major metric of value to high-end computing users… A common software environment for scientific computation encompassing desktop to high-end systems will enhance productivity gains by promoting ease of use and manageability of systems.” Application availability Limited eco-system of applications that can exploit parallel processing capabilities High-End Computing Revitalization Task Force, 2004 (Office of Science and Technology Policy, Executive Office of the President) Windows Compute Cluster Server 2003 Simplified cluster deployment, job submission and status monitoring Better integration with existing Windows infrastructure allowing customers to leverage existing technology and skill-sets Familiar development environment allows developers to write parallel applications from within the powerful Visual Studio IDE Windows Compute Cluster Server 2003 Leveraging Existing Windows Infrastructure Integration with IT infrastructure Kerberos authentication Resource management Secure job execution Group policies Secure MPI Operations manager Job scheduler Windows Update services Admin console Systems Management Server Performance monitor Command line interface Remote Installation services CCS Key Features Node deployment and administration Task-based configuration for head and compute nodes UI and command line-based node management Monitoring with Performance Monitor (Perfmon), Microsoft Operations Manager (MOM), Server Performance Advisor (SPA), and 3rd-party tools Integration with existing Windows and management infrastructure Integrates with Active Directory, Windows security technologies, management, and deployment tools Extensible job scheduler 3rd-party extensibility at job submission and/or job assignment Submit jobs from command line, UI, or directly from applications Simple job management, similar to print queue management Secure and performant MPI User credentials secured in job scheduler and compute nodes MPI stack based on MPICH2 reference implementation Support for high performance interconnects through Winsock Direct Integrated development environment OpenMP support in Visual Studio, Standard Edition Parallel debugger in Visual Studio, Professional Edition HPC Institutes National Center for Supercomputing Applications, IL U.S.A. University of Utah Salt Lake City, UT U.S.A. TACC – University of Texas Austin, TX U.S.A. Cornell Theory Center Ithaca, NY U.S.A. Southampton University Southampton, UK University of Virginia Charlottesville, VA U.S.A. University of Tennessee Knoxville, TN U.S.A. Nizhni Novgorod University Nizhni Novgorod, Russia Tokyo Institute of Technology Tokyo, Japan HLRS – University of Stuttgart Stuttgart, Germany Shanghai Jiao Tong University Shanghai, PRC An Example Of Porting To Windows Weather research and forecasting model Large collaborative effort, lead by NCAR, to develop next-generation community model with direct path to operations Applications Atmospheric research Numerical weather prediction Coupled modeling systems Current release WRFV2.1.2 ~1/3 million lines, Fortran 90 and some C using MPI, OpenMP Traditionally developed for Unix HPC systems Two dynamical cores Full range of physics options Rapid community growth – more than 3,000 registered users Operational capabilities U.S. Air Force Weather Agency National Centers for Environmental Prediction (NOAA) KMA (Korea), IMD (India), CWB (Taiwan), IAF (Israel), WSI (U.S.) WRF On Windows Motivation Extend available systems available to WRF users Stability and consistency with respect to Linux Take advantage of Microsoft and 3rd party (e.g., Portland Group) development tools, environments WRF ported under SUA and running on development AMD64 clusters using Compute Cluster Pack Of 360k lines, fewer than 750 changed to compile and link under SUA Largest number of changes involved the WRF build mechanism (Makefiles, scripts) Level of effort and nature of tasks was not unlike porting to any new version of UNIX Details of porting experience described in a white paper available from Microsoft and at http://www.mmm.ucar.edu/wrf/WG2/wrf_port_notes.htm An Example Of Application Integration With HPC Scaling Excel Excel Services on Windows Compute Cluster Server 2003 Excel Services Excel “12” Desktop Servers Clusters Excel Services View and Interact Author and Publish Spreadsheets Browser 100% thin Open Spreadsheet/Snapshot Excel “12” Web Services Access Excel “12” client Custom applications Excel And Windows CCS Customer requirements Faster spreadsheet calculation Free-up client machines from long-running calculations Time/mission critical calculations that must run Parallel iterations on models Example scenarios Schedule overnight risk calculations Farm out analytical library calculations Scale-out Monte Carlo iterations, parametric sweeps Evolution Of HPC Evolving Scenarios Key Factors Batch computing on supercomputers IT Mgr Manual, batch execution Interactive computing on departmental clusters Compute cycles are scarce and require careful partitioning and allocation Cluster systems administration major challenge Applications split into UI and compute parts Compute cycles are cheap Interactive applications integrate UI/compute parts Emergence of turnkey personal clusters Interactive Computation and Visualization Complex workflow spanning applications SQL Compute and data resources are diffused throughout the enterprise Distributed application, systems and data management is the key source of complexity Multiple applications are organized into complex workflows and data pipelines Focus on service orientation and web services Cheap Cycles And Personal Supercomputing IBM Cell processor 256 Gflops today 4 node personal cluster 1 Tflops 32 node personal cluster Top100 Microsoft Xbox The key challenge How to program these things Concurrent programming will be an important area of investments for all of Microsoft (not just HPC) 3 custom PowerPCs + ATI graphics processor 1 Tflops today $300 8 node personal cluster “Top100” for $2500 (ignoring all that you don’t get for $300) Intel many-core chips “100’s of cores on a chip in 2015” (Justin Rattner, Intel) “4 cores”/Tflop 25 Tflops/chip 22 “Grid Computing” A catch-all marketing term Desktop cycle-stealing Managed HPC clusters Internet access to giant, distributed repositories Virtualization of data center IT resources Out-sourcing to “utility data centers” “Software as a service” Parallel databases HPC Grids And Web Services Compute grid Forest of clusters Coordinated scheduling of resources Data grid Distributed storage facilities Coordinated management of data Web Services Glue for heterogeneous platforms/applications/systems Cross- and intraorganization integration Standards-based distributed computing Interoperability and composability Cluster-Based HPC Intra-Organization HPC Virtual Organizations Part 2 Drill-Down Technologies Platform Windows Server 2003 SP1 64-bit Edition x64 processors (Intel EM64T and AMD Opteron) Ethernet, Ethernet over RDMA and Infiniband support Administration Prescriptive, simplified cluster setup and administration Scripted, image-based compute node management Active Directory based security Scalable job scheduling and resource management Development MPICH-2 from Argonne National Labs with performance and security enhancements Cluster scheduler programmable via Web Services and DCOM Visual Studio 2005 – OpenMP, Parallel Debugger Partner delivered Fortran compilers and numerical libraries Head Node Installation Head Node installs only on x64 Windows 2003 Compute Cluster Edition Windows 2003 SP1 Standard And Enterprise Windows 2003 R2 Installation Leverages appliance like functionality Scripted installation Warnings if system is misconfigured To Do list to assist with final configuration Walkthrough Windows Server 2003 is installed on the head node System may have been pre-installed using OPK User launches Compute Cluster Kit setup To Do list starts up, guiding User through next steps User joins Active Directory domain User installs IP over IB drivers for InfiniBand cards if not pre-installed Wizard assists with multi-NIC routing and configuration Remote Installation Service is configured for imaging compute nodes Compute Node Installation Automated installation Remote Installation Service provides simple imaging solution May use third-party system imaging tools compute nodes Requires private network Walkthrough User racks up compute nodes Starts Add Node wizard Powers up a group of compute nodes Compute nodes PXE boot RIS and installation scripts will Install operating system: W2K3 SP1 Install drivers Join appropriate domain Install compute cluster software (CD2) Join cluster Exiting wizard turns off RIS Corpnet Ethernet Head Node Infiniband Compute Node Compute Node Node Management Not building a new systems management paradigm Leveraging Windows infrastructure for simple management MMC, Perfmon, Event Viewer, Remote Desktop Can integrate with enterprise management infrastructure, such as Microsoft Operations Manager Compute Cluster MMC snap-in Compute Cluster Admin Console File Supports specific actions Pause Node Resume Node Open CD Drive Reboot Node Execute Command Remote Desktop Connection Start PerfMon Delete Properties Can operate on multiple nodes at once Action View Favorites Window Compute Cluster Admin Console Bio Lab 1 (Compute Cluster) To Do List Queue Management Node Management Help Compute Node Name Node1 Node2 Node3 Node4 Node5 Node6 Node7 Node8 Node9 Node10 Node11 Node12 Node13 Node14 Node15 Node16 Node17 Node18 Node19 Node20 Node Status Active Active Active Active Active Paused Active Paused Paused Paused Active Active Active Active Active Active Installing Installing Installing Installing Job Status Job Name Executing Executing Executing Executing Idle Executing Executing Executing Idle Idle Idle Idle Executing Executing Executing Executing Bob’s Blast Job Bob’s Blast Job Bob’s Blast Job Orange Temp Job Time Owner 47 51 41 1245 NTDEV\bobmu NTDEV\bobmu NTDEV\bobmu NTDEV\suej Bob’s Blast Job Orange Temp Agent B, Matrix 27 42 1245 60102 NTDEV\bobmu NTDEV\suej NTDEV\enrico Agent B, Matrix 27 Patching Patching Patching 60102 465 680 465 NTDEV\enrico CC\admin CC\admin CC\admin Job/Task Conceptual Model Serial Job Parallel MPI Job Task Task Proc Proc IPC Proc Parameter Sweep Job Task Task Task Proc Proc Proc Task Flow Job Task Task Task Task Job Scheduler Stack User Admin Third-party Applications User Console Admin Console Command Line Interface COM API WS (WSE 3.0) Jobs/Tasks Client Node Object Model Interface Layer Admission Scheduling Layer Allocation User Interface Handlers Head Node Queueing Job Management Compute Node Node Manager Node Manager Node Manager Resource Management Node Manager Node Manager Node Manager Execution Layer Activation Job Scheduler Job scheduler provides two features: Ordering and allocation Job ordering Priority-based first-come, first-serve (FCFS) Backfill supported for jobs with time limits Resource allocation License-aware scheduling through plug-ins Parallel application node allocation policies Extensible Core engine based on embedded SQL engine Resource and job descriptions are based on XML 3rd parties can extend by plugging into submission and execution phases to implement queuing and licensing policies Job submission Jobs submitted via UI, API, command line, or web service Security Jobs on compute nodes execute in the security account of the submitting user, allowing secure access to networked resources Cleanup Jobs executed in Job Objects on compute nodes, facilitating cleanup Queue Management Job Management model similar to print queue management Leverage familiar user paradigm Queue management operations Delete Change properties Priority Run time # of CPUs Preferred nodes CPUs per node All in one License parameters Uniform attributes Notification Compute Cluster File Action View Favorites Window Console Root Bio Lab 1 (Compute Cluster) To Do List Queue Management Compute Nodes Help Order Priority 1 2 2 3 4 5 6 1 2 2 1 2 Name Bob’s Blast Job 000434 000435 000436 000437 000438 000439 000440 000441 000442 000443 000444 000445 000446 000447 000448 000449 000450 Lodica Calc Sue Agent B, Matrix 27 Better work this time! Orange Temp Owner Domain\Bobr Domain\Bobr Domain\Bobr Domain\Bobr Domain\Bobr Domain\Bobr Domain\Bobr Domain\Bobr Domain\Bobr Domain\Bobr Domain\Bobr Domain\Bobr Domain\Bobr Domain\Bobr Domain\Bobr Domain\Bobr Domain\Bobr Domain\Bobr Domain\Sue Domain\Sue Domain\Tam.. Domain\Crai.. Domain\Ryan Status Running Completed Completed Running – Bnode19 Running – Bnode20 Running – Bnode30 Running – Bnode21 Running – Bnode26 Running – Bnode27 Running – Bnode18 Queued Queued Queued Queued Queued Queued Queued Queued Running Running Running Running Running Networking Focusing on industry standard interconnect technologies MPI implementation tuned to Winsock Automatic RDMA support through Winsock Direct (SAN provider required from IHV) Gigabit Ethernet Expect to be the mainstream choice RDMA + GigE offers compelling latency Infiniband Emerging as a leading high end solution Engaged with all IB vendors OpenIB group developing a Windows IB stack Planning to support IB in WHQL Resources Microsoft HPC web site (evaluation copies available) http://www.microsoft.com/hpc/ Microsoft Windows Compute Cluster Server 2003 community site http://www.windowshpc.net/ Windows Server x64 information http://www.microsoft.com/64bit/ http://www.microsoft.com/x64/ Windows Server System information http://www.microsoft.com/wss/ © 2006 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.