PetaScale Single System Image and other stuff Principal Investigators R. Scott Studham, ORNL Alan Cox, Rice University Bruce Walker, HP Collaborators Peter Braam, CFS Steve Reinhardt, SGI Stephen Wheat, Intel Stephen Scott, ORNL Investigators Peter Druschel, Rice University Scott Rixner, Rice University Geoffroy Vallee, ORNL/INRIA Kevin Harris, HP Hong Ong, ORNL Outline Project Goals & Approach Details on work areas Collaboration ideas Compromising pictures of Barney Project goals • • • Evaluate methods for predictive task migration Evaluate methods for dynamic superpages Evaluate Linux at O(100,000) in a Single System Image Approach Engineering: Build a suite based off existing technologies that will provide a foundation for further studies. Research: Test advanced features built upon that foundation. The Foundation PetaSSI 1.0 Software Stack – Release as distro July 2005 SSI Software Filesystem Basic OS Virtualization OpenSSI Lustre Linux Xen 1.9.1 1.4.2 2.6.10 2.0.2 Single System Image with process migration OpenSSI OpenSSI OpenSSI XenLinux Linux 2.6.10 Lustre XenLinux Linux 2.6.10 Lustre XenLinux Linux 2.6.10 Lustre OpenSSI XenLinux Linux 2.6.10 Lustre Xen Virtual Machine Monitor Hardware (SMP, MMU, physical memory, Ethernet, SCSI/IDE) Work Areas • • • • • • • Simulate ~10K nodes running OpenSSI via virtualization techniques System-wide tools and process management for a ~100K processor environment Testing Linux SMP at higher processor count The scalability studies of a shared root filesystem scaling up to ~10K nodes High Availability - Preemptive task migration. Quantify “OS noise” at scale Dynamic large page sizes (superpages) OpenSSI Cluster SW Architecture Install and sysadmin Application monitoring and restart Boot and Init Devices Interprocess Communication IPC HA Resource Mgmt and Job Scheduling MPI Kernel interface Process load leveling DLM Cluster Membership CLMS Cluster Filesystem CFS Process Mgmt/ Vproc Lustre client LVS Remote File Block Use ICS for inter-node communications Internode Communication Subsystem - ICS Quadrics Miricom Infiniband Tcp/ip RDMA subsystems interface to CLMS for nodedown/nodeup notification Approach for researching PetaScale SSI Install and sysadmin Boot and Init Devices IPC Application monitoring and restart HA Resource Mgmt and Job Scheduling MPI Process load leveling CLMS Cluster Filesystem CFS MPI Boot DLM Vproc Remote File Block Lustre client ICS Service Nodes single install; local boot (for HA); single IP (LVS) connection load balancing (LVS); single root with HA (Lustre): single file system namespace (Lustre); single IPC namespace; single process space and process load leveling; application HA strong/strict membership; CLMS Lite LVS Vproc Lustre client Remote File Block ICS Compute Nodes single install; network or local boot; not part of single IP and no connection load balance single root with caching (Lustre); single file system namespace (Lustre); no single IPC namespace (optional); single process space but no process load leveling; no HA participation; scalable (relaxed) membership; inter-node communication channels on demand only Approach to scaling OpenSSI Two or more “service” nodes + optional “compute” nodes Vproc for cluster-wide monitoring, e.g., Process space, process launch and process movement Lustre for scaling up filesystem story (including root) Service nodes provide availability and 2 forms of load balancing Computation can also be done on service nodes Compute nodes allow for even larger scaling No daemons require on compute nodes for job launch or stdio Enable diskless node option Integration with other open source components: Lustre, LVS, PBS, Maui, Ganglia, SuperMon, SLURM, … Simulate 10,000 nodes running OpenSSI Using Virtualization technique to demonstrate basic functionality (booting, etc): Simulation enables assessment of relative performance: Xen Trying to quantify how many virtual machines we can have per physical machine. Establish performance characteristics of individual OpenSSI components at scale (e.g., Lustre on 10,000 processors) Exploring hardware testbed for performance characterization: 786 processor IBM Power (would require port to Power) 4096 processor Cray XT3 (catamount Linux) OS Testbeds are a major issue for Fast-OS projects Virtualization and HPC For latency tolerant applications it is possible to run multiple virtual machines on a single physical node to simulate large node counts. Xen has little overhead when GuestOS’s are not contending for resources. This may provide a path to support multiple OS’s on a single HPC system. 900 Latency Ring & Random 800 Latency Rings Overhead to build a Linux kernel on a GuestOS Xen: 3% VMWare: 27% User Mode Linux: 103% 700 Latency Ping-Pong Latency (us) 600 500 400 300 200 100 0 2 3 4 5 Number of virtual machines on a node 6 Pushing nodes to higher processor count, and integration with OpenSSI What is needed to push kernel scalability further: Continued work to quantify spinlocking bottlenecks in the kernel. Using Open Source LockMeter http://oss.sgi.com/projects/lockmeter/ Paper about 2K node linux kernel at SGIUG next week in Germany What is needed for SSI integration Continued SMP spinlock testing Move to 2.6 kernel Application Performance testing Large page integration Quantifying CC impacts on Fast Multipole Method using 512P Altix Paper at SGIUG next week Establish the intersection of OpenSSI cluster and large kernels to get to 100,000+ processors 2048 CPUs 2) Enhance scalability of both approaches 3) Understand intersection of both methods Single Linux Kernel 1) Establish scalability baselines Continue SGIs work on single kernel scalability Test the intersection large kernels with software OpenSSI to establish sweet spot for Continue the OpenSSI’s 100,000 processor Linux work on Typical SSISSI scalability environments Stock Linux Kernel 1 CPU 1 Node OpenSSI Clusters 10,000 Nodes System-wide tools and process management for a 100,000 processor environment Study process creation performance Build tree strategy if necessary Leverage periodic information collection Study scalability of utilities like top, ls, ps, etc. The scalability of a shared root filesystem to 10,000 nodes Work started to enable Xen. validation work to date has been with UML Lustre is being tested and enhanced to be a root filesystem. Validated functionality with OpenSSI currently a bug with the tty char device access Scalable IO Testbed IBM Power4 GPFS Cray X1 XFS Archive IBM Power3 Lustre XFS Gateway GPFS Cray XT3 SGI Altix Lustre Lustre Cray XD1 Lustre Cray X2 High Availability Strategies - Applications Predictive Failures and Migration Leverage Intel failure predictive work [NDA required] OpenSSI supports process migration… hard part is MPI rebinding. On next global collective: Don’t return until you have reconnected with indicated client; Specific client moves and then reconnects and then responds to the collective Do first for MPI and then adapt to UPC “OS noise” (stuff that interrupts computation) Problem – even small overhead could have a large impact on large-scale applications that co-ordinate often Investigation: Identify sources and measure overhead Interrupts, daemons and kernel threads Solution directions: Eliminate daemons or minimize OS Reduce clock overhead Register noise makers and: Co-ordinate across the cluster; Make noise only when the machine is idle; Tests: Run Linux on ORNL XT3 to evaluate against Catamount Run daemons on a different physical node under SSI Run application and services on different sections of a hypervisor The use of very large page sizes (superpages) for large address spaces Increasing cost in TLB miss overhead Processors now provide superpages growing working sets TLB size does not grow at same pace one TLB entry can map a large region Most mainstream processors will support superpages in the next few years. OSs have been slow to harness them no transparent superpage support for apps TLB coverage trend TLB coverage as percentage of main memory Factor of 1000 decrease in 15 years 10.0% 1.0% 0.1% TLB miss overhead: 5% 0.01% 0.001% 1985 30% 5-10% 1990 1995 2000 Other approaches Reservations Relocation move pages at promotion time must recover copying costs Eager superpage creation (IRIX, HP-UX) one superpage size only size specified by user: non-transparent Demotion issues not addressed large pages partially dirty/referenced Approach to be studied under this project Dynamic superpages Observation: Once an application touches the first page of a memory object then it is likely that it will quickly touch every page of that object Example: array initialization Opportunistic policy Go for biggest size that is no larger than the memory object (e.g., file) If size not available, try preemption before resigning to a smaller size Speculative demotions Manage fragmentation Current work has been on Itnaium and Alpha, both running BSD. This project will focus on Linux and we are currently investigating other processors. Best-case benefits on Itanium SPEC CPU2000 integer Other benchmarks 12.7% improvement (0 to 37%) FFT (2003 matrix): 13% improvement 1000x1000 matrix transpose: 415% improvement 25%+ improvement in 6 out of 20 benchmarks 5%+ improvement in 16 out of 20 benchmarks Why multiple superpage sizes 64KB 512KB 4MB All 1% 0% 55% 55% galgel 28% 28% 1% 29% mcf 24% 31% 22% 68% FFT Improvements with only one superpage size vs. all sizes on Alpha Summary • • • • • • • Simulate ~10K nodes running OpenSSI via virtualization techniques System-wide tools and process management for a ~100K processor environment Testing Linux SMP at higher processor count The scalability studies of a shared root filesystem scaling up to ~10K nodes High Availability - Preemptive task migration. Quantify “OS noise” at scale Dynamic large page sizes (superpages) June 2005 Project Status Work done since funding started Xen and OpenSSI validation Xen and Lustre validation C/R added to OpenSSI IB port of OpenSSI Lustre Installation Manual Lockmeter at 2048 CPUs CC impacts on apps at 512P Cluster Proc hooks Scaling study of Open SSI HA OpenSSI OpenSSI Socket Migration [done] [done] [done] [done] [Book] [Paper at SGIUG] [Paper at SGIUG] [Paper at OLS] [Paper at COSET] [Submitted Cluster05] [Pending] PetaSSI 1.0 release in July 2005 Collaboration ideas Linux on XT3 to quantify LWK Xen and other virtualization techniques Dynamic vs. Static superpages Questions