Altix 4700 ccNUMA Architecture • Distributed Memory - Shared address space Altix HLRB II – Phase 2 • 19 partitions with 9728 cores • Each with 256 Itanium dual-core processors, i.e., 512 cores – Clock rate 1.6 GHz – 4 Flops per cycle per core – 12,8 GFlop/s (6,4 GFlop/s per core) • 13 high-bandwidth partitions – Blades with 1 processor (2 cores) and 4 GB memory – Frontside bus 533 MHz (8.5 GB/sec) • 6 high-density partitions – Blades with 2 processors (4 cores) and 4 GB memory. – Same memory bandwidth. • Peak Performance: 62,3 TFlops (6.4 GFlops/core) • Memory: 39 TB Memory Hierarchy • L1D • 16 KB, 1 cycle latency, 25,6 GB/s bandwidth • cache line size 64 bytes • L2D • 256 KB, 6 cycles, 51 GB/s • cache line size 128 bytes • L3 • 9 MB, 14 cycles, 51 GB/s • cache line size 128 bytes Interconnect • NUMAlink 4 • 2 links per blade • Each link 2*3,2 GB/s bandwidth • MPI latency 1-5µs Disks • Direct attached disks (temporary large files) • 600 TB • 40 GB/s bandwidth • Network attached disks (Home Directories) • 60 TB • 800 MB/s bandwidth Environment • Footprint: 24 m x 12 m • Weight: 103 metric tons • Electrical power: ~1 MW NUMAlink Building Block NUMALink 4 Router Level 1 8 cores (high bandwidth) 16 cores (high-density) NUMALink 4 Router Level 1 PCI/FC B L A D E B L A D E B L A D E B L A D E NUMALink 4 Router Level 1 I O B L A D E B L A D E B L A D E B L A D E B L A D E NUMALink 4 Router Level 1 I O B L A D E SAN Switch 10 GE Blades and Rack Interconnection in a Partition Interconnection of Partitions • Gray squares • 1 partition with 512 cores • L: Login B:Batch • Lines • 2 NUMALink4 planes with 16 cables • each cable: 2 * 3,2 GB/s Interactive Partition 4 OS 16 Login 12 Login 16 16 • Login cores 4 Login 16 12 Batch 16 16 16 16 16 16 • 32 for compile & test • Interactive batch jobs • 476 cores • managed by PBS – daytime interactive usage – small-scale and nighttime batch processing – single partition only • High-density blades • 4 cores per memory 18 Batch Partitions • Batch jobs 4 OS 8 (16) 8 (16) 8 (16) 8 (16) 8 (16) 8 (16) 6 (12) 8 (16) 8 (16) 8 (16) 8 (16) 8 (16) • • • • 510 (508) cores managed by PBS large-scale parallel jobs single or multi-partition jobs • 5 partitions with highdensity blades • 13 partitions with highbandwidth blades Bandwidth Bandwidth (MB/s) Intra-Node 3000 Intranode 2500 2000 1500 1000 500 0 Internode Coherence Implementatioin • SHUB2 supports up to 8192 SHUBs (32768 cores) • Coherence domain up to 1024 SHUBs(4096 cores) • SGI term: "Sharing mode" • Directory with one bit per SHUB • Multiple shared copies are supported. • Accesses of other coherence domains • • • • SGI term: "Exclusive sharing mode" Always translated in exclusive access Only single copy is supported Directory stores the address of SHUB(13 bits) SHMEM Latency Model for Altix • SHMEM get latency is sum of: • • • • • 80 nsec for function call 260 nsec for memory latency 340 nsec for first hop 60 nsec per hop 20 nsec per meter of NUMAlink cable • Example • 64 P system: max hops is 4, max total cable length is 4. • Total SHMEM get latency is: 1000 nsec = 80 + 260 + 340 + 60x4 + 20x4 Parallel Programming Models Intra-Host (512 cores) Altix® System OpenMP Pthreads Linux Image 1 MPI Intra-Coherency Domain (4096 cores) and across entire machine MPI SHMEM SHMEMTM Linux Image 2 Global segments Coherency Domain 1 Coherency Domain 2 Global Segments Barrier Synchronization • Frequent in OpenMP, SHMEM, MPI single sided ops (MPI_Win_fence) • Tree-based implementation using multiple fetch-op variables to minimize contention on SHUB. • Using uncached load to reduce NUMAlink traffic. CPU HUB Fetch-op CPU variable ROUTER Programming Models • OpenMP on an Linux image • MPI • SHMEM • Shared segments (System V und Global Shared Memory) SHMEM • Can be used for MPI programs where all processes execute same code. • Enables access within and across partitions. • Static data and symmetric heap data (shmalloc or shpalloc) • info: man intro_shmem Example #include <mpp/shmem.h> main() { long source[10] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 }; static long target[10]; MPI_Init(…) if (myrank == 0) { /* put 10 elements into target on PE 1 */ shmem_long_put(target, source, 10, 1); } shmem_barrier_all(); /* sync sender and receiver */ if (myrank == 1) printf("target[0] on PE %d is %d\n", myrank,target[0]); } Global Shared Memory Programming • Allocation of a shared memory segment via collective GSM_alloc. • Similar to memory mapped files or System V shared segments. But these are limited to a single OS instance. • GSM segment can be distributed across partitions. – GSM_ROUNDROBIN: Pages are distributed in roundrobin across processes – GSM_SINGLERANK: Places all pages near to a single process – GSM_CUSTOM_ROUNDROBIN: Each process specifies how many pages should be placed in its memory. • Data structures can be placed in this memory segment and accessed from all processes with normal load and store instructions. Example #include <mpi_gsm.h> placement = GSM_ROUNDROBIN; flags = 0; size = ARRAY_LEN * sizeof(int); int *shared_buf; rc = GSM_Alloc(size, placement, flags, MPI_COMM_WORLD,&shared_buf); // Have one rank initialize the shared memory region if (rank == 0) { for(i=0; i < ARRAY_LEN; i++) { shared_buf[i] = i; } } MPI_Barrier(MPI_COMM_WORLD); // Have every rank verify they can read from the shared memory for (i=0; i < ARRAY_LEN; i++) { if (shared_buf[i] != i) { printf("ERROR!! element %d = %d\n", i, shared_buf[i]); printf("Rank %d - FAILED shared memory test.\n", rank); exit(1); } } Summary • Altix 4700 is a ccNUMA system • >60 TFlop/s • MPI messages sent with two-copy or single-copy protocol • Hierarchical coherence implementation • Intranode • Coherence domain • Across coherence domains • Programming models • • • • OpenMP MPI SHMEM GSM The Compute Cube of LRZ Rückkühlwerke Hö Höchstleistungsrechner (säulenfrei) (sä Zugangsbrü cke Zugangsbrücke Server/Netz Archiv/Backup Archiv/Backup Klima Klima Elektro