Application of General Purpose HPC Systems in HPEC David Alexander Director of Engineering SGI, Inc. dca 8/2003 SGI HPC System Architecture dca 8/2003 SGI Scalable ccNUMA Architecture Basic Node Structure and Interconnect C A C H E CPU CPU Processor Interface (>6GB/sec) Interface Chip C A C H E C A C H E NUMAlink Interconnect (>6GB/sec) Memory Interface (>8GB/sec) CPU CPU C A C H E Processor Interface (>6GB/sec) Interface Chip Memory Interface (>8GB/sec) Physical Memory (2-16GB) Physical Memory (2-16GB) Physical Memory SGI Scalable ccNUMA Architecture Scaling to Large Node Counts CPU CPU CPU CPU CPU CPU CPU CPU Interface Chip Interface Chip Interface Chip Interface Chip Memory Memory Memory Memory System Interconnect Fabric • Scaling to 100’s of processors • RT Memory Latency < 600ns worst case (64p config) • Bi-section bandwidth >25GB/sec (64p config) SGI Scalable ccNUMA Architecture SGI® NUMAflex™ Modular Design System Interconnect CPU & Memory System I/O Standard I/O Expansion Performance: High-bandwidth interconnect with very low latency Flexibility: Tailored configurations for different dimensions of scalability Investment protection: Add new technologies as they evolve Scalability: No central bus or switch; just modules and NUMAlink™ cables High BW I/O Expansion Graphics Expansion Storage Expansion Modules from SGI® Onyx® 3000 Series SGI Scalable ccNUMA Architecture The Benefits of Shared Memory Traditional Clusters SGI® Altix™ 3000 Commodity interconnect mem mem mem mem mem node + OS node + OS node + OS node + OS node + OS ... mem node + OS Fast NUMAflex™ interconnect Global shared memory node + OS node + OS node ... + OS node + OS What is shared memory? • All nodes operate on one large shared memory space, instead of each node having its own small memory space Shared memory is high-performance • All nodes can access one large memory space efficiently, so complex communication and data passing between nodes aren’t needed • Big data sets fit entirely in memory; less disk I/O is needed Shared memory is cost-effective and easy to deploy • The SGI Altix 3000 family supports all major parallel programming models • It requires less memory per node, because large problems can be solved in big shared memory • Simpler programming means lower tuning and maintenance costs SGI® Altix™ 3000 HPC Product dca 8/2003 SGI® Altix™ 3000 Overview First Linux® node with 64 CPUs in single-OS image First clusters with global shared memory across multiple nodes First Linux solution with HPC systemand data-management tools World-record performance for floatingpoint calculations, memory performance, I/O bandwidth, and real technical applications SGI® Altix™ 3000 Fusion of Powerhouse Technologies SGI® supercomputing technology, Intel’s most advanced processors, and open-source computing Global shared memory (64-bit shared memory across cluster nodes) SGI ProPack™ and supercomputing enhancements SGI® NUMAlink™ (Built-in high-bandwidth, low-latency interconnect) SGI® NUMAflex™ (Third-generation modular supercomputing architecture) Intel® Itanium® 2 processor family Industry-standard Linux® SGI® Altix™ 3000 Family Extreme Power, Extreme Potential Model 3700 Superclusters Model 3300 Servers Single-node entry offering 4–12 1.30 GHz Itanium® 2 processors, 3MB L3 cache Up to 96GB memory Scalable performance offering 4–64 Itanium 2 processors in a single node 1.30 GHz/3MB L3 cache 1.50 GHz/6MB L3 cache Shared memory across nodes Scalable to 2,048 processors, 16TB memory Nodes up to 64P, 4TB memory Multi-Paradigm Architecture dca 8/2003 Multi-Paradigm Architecture Overview R • NUMAlink system interconnect C • General-purpose compute nodes IO MSE V MSE C C • Integrated graphics/visualization IO R R MSE C C C C V C • Peer-attached general purpose I/O • Mission-specific accel. and/or I/O (MSE) IO C MSE C R IO R IO C MSE C Multi-Paradigm Architecture Mission Specific Element (MSE) Memory System Interconnect Fabric >5GB/s TIO ASIC FPGA, DSP, or ASIC I/O NL4 Ports (>5GB/s x 2) MSE Implementation variants: • Customer supplied/loaded FPGA algorithms and/or ASIC/DSP • Subroutine or standard library acceleration • Specific use “appliance” Embedded High Performance Computing (EHPC) dca 8/2003 Embedded High Performance Computing Unified Development/Deployment Environment EHPC Today EHPC Vision Different Development and Deployment Environments Unified Development and Deployment Development Environment HPC methods, tools, hardware Architecture provides freedom to develop new approaches Months of work Millions of dollars Significant schedule risk Significant architectural/performance risk Deployment Proprietary hardware and software “Islands of memory” narrows algorithm options Scalable ccNUMA System Architecture SGI EHPC System Mission-Specific Processing Embedded Systems Packaging Platform Port Seamless HPC Application Development Model Enhanced software development productivity Superior performance and HW utilization Demonstration to deployment in days Benefits from mainstream HPC advances Embedded High Performance Computing EHPC Platform Seamless HPC Application Development Model • Unified software environment for development, prototype and deployment • Full multi-paradigm computing support • Field-upgradeable mission specific processing acceleration • Highly optimized performance per watt and performance per slot • Design to fit into established environmentals, form factors, and interfaces • • • • 6U Eurocard form factor Passive, slot configurable backplane PMC module connectivity to standard interfaces Able to address oceanic, ground and airborne environmentals SGI EHPC System Mission-Specific Processing Embedded Systems Packaging Scalable ccNUMA System Architecture Embedded High Performance Computing Family of 6U Form Factor Modules 6U Form Factor 6U Form Factor Cache Cache uP uP PMC Module Site PIC Memory Memory DDR SDRAM DDR SDRAM Memory Memory DDR SDRAM DDR SDRAM Memory Memory DDR SDRAM DDR SDRAM Memory Memory DDR SDRAM DDR SDRAM PMC Module Site NUMAlink Controller NUMAlink4 Ports NUMAlink Controller System Control Scalable General Purpose Processor/Memory Blade Diagnostic & L1 Ports System Control General Purpose I/O Blade NUMAlink4 Ports Diagnostic & L1 Ports Embedded High Performance Computing Family of 6U Form Factor Modules 6U Form Factor 6U Form Factor Memory FPGA External I/O (Optional) TIO NL4 Port System Control 6.4GB/s FPGA Memory TIO NUMAlink4 Port ASIC Diagnostic & L1 Ports NUMAlink4 Port External I/O (Optional) ASIC TIO NL4 Port FPGA 6.4GB/s FPGA Memory System Control TIO Memory Mission-Specific Accelerator and/or I/O Blades NUMAlink4 Port Diagnostic & L1 Ports NUMAlink4 Port www.sgi.com ©2002 Silicon Graphics, Inc. All rights reserved. Silicon Graphics, SGI, IRIX, Challenge, Origin, Onyx, and the SGI logo are registered trademarks and OpenVault, FailSafe, Altix, XFS, and CXFS are trademarks of Silicon Graphics, Inc., in the U.S. and/or other countries worldwide. MIPS is a registered trademark of MIPS Technologies, Inc., used under license by Silicon Graphics, Inc. UNIX is a registered trademark of The Open Group in the U.S. and other countries. Intel and Itanium are registered trademarks of Intel Corporation. Linux is a registered trademark of Linus Torvalds. Mac is a registered trademark of Apple Computer, Inc. Windows is a registered trademark or trademark of Microsoft Corporation in the United States and/or other countries.Red Hat is a registered trademark of Red Hat, Inc. All other trademarks are the property of their respective owners. (1/03)