8. Cache Coherence, Consistency, Synchronization

Altix 4700 ccNUMA Architecture • Distributed Memory - Shared address space Altix HLRB II – Phase 2 • 19 partitions with 9728 cores • Each with 256 Itanium dual-core processors, i.e., 512 cores – Clock rate 1.6 GHz – 4 Flops per cycle per core – 12,8 GFlop/s (6,4 GFlop/s per core) • 13 high-bandwidth partitions – Blades with 1 processor (2 cores) and 4 GB memory – Frontside bus 533 MHz (8.5 GB/sec) • 6 high-density partitions – Blades with 2 processors (4 cores) and 4 GB memory. – Same memory bandwidth. • Peak Performance: 62,3 TFlops (6.4 GFlops/core) • Memory: 39 TB Memory Hierarchy • L1D • 16 KB, 1 cycle latency, 25,6 GB/s bandwidth • cache line size 64 bytes • L2D • 256 KB, 6 cycles, 51 GB/s • cache line size 128 bytes • L3 • 9 MB, 14 cycles, 51 GB/s • cache line size 128 bytes Interconnect • NUMAlink 4 • 2 links per blade • Each link 2*3,2 GB/s bandwidth • MPI latency 1-5µs Disks • Direct attached disks (temporary large files) • 600 TB • 40 GB/s bandwidth • Network attached disks (Home Directories) • 60 TB • 800 MB/s bandwidth Environment • Footprint: 24 m x 12 m • Weight: 103 metric tons • Electrical power: ~1 MW NUMAlink Building Block NUMALink 4 Router Level 1 8 cores (high bandwidth) 16 cores (high-density) NUMALink 4 Router Level 1 PCI/FC B L A D E B L A D E B L A D E B L A D E NUMALink 4 Router Level 1 I O B L A D E B L A D E B L A D E B L A D E B L A D E NUMALink 4 Router Level 1 I O B L A D E SAN Switch 10 GE Blades and Rack Interconnection in a Partition Interconnection of Partitions • Gray squares • 1 partition with 512 cores • L: Login B:Batch • Lines • 2 NUMALink4 planes with 16 cables • each cable: 2 * 3,2 GB/s Interactive Partition 4 OS 16 Login 12 Login 16 16 • Login cores 4 Login 16 12 Batch 16 16 16 16 16 16 • 32 for compile & test • Interactive batch jobs • 476 cores • managed by PBS – daytime interactive usage – small-scale and nighttime batch processing – single partition only • High-density blades • 4 cores per memory 18 Batch Partitions • Batch jobs 4 OS 8 (16) 8 (16) 8 (16) 8 (16) 8 (16) 8 (16) 6 (12) 8 (16) 8 (16) 8 (16) 8 (16) 8 (16) • • • • 510 (508) cores managed by PBS large-scale parallel jobs single or multi-partition jobs • 5 partitions with highdensity blades • 13 partitions with highbandwidth blades Bandwidth Bandwidth (MB/s) Intra-Node 3000 Intranode 2500 2000 1500 1000 500 0 Internode Coherence Implementatioin • SHUB2 supports up to 8192 SHUBs (32768 cores) • Coherence domain up to 1024 SHUBs(4096 cores) • SGI term: "Sharing mode" • Directory with one bit per SHUB • Multiple shared copies are supported. • Accesses of other coherence domains • • • • SGI term: "Exclusive sharing mode" Always translated in exclusive access Only single copy is supported Directory stores the address of SHUB(13 bits) SHMEM Latency Model for Altix • SHMEM get latency is sum of: • • • • • 80 nsec for function call 260 nsec for memory latency 340 nsec for first hop 60 nsec per hop 20 nsec per meter of NUMAlink cable • Example • 64 P system: max hops is 4, max total cable length is 4. • Total SHMEM get latency is: 1000 nsec = 80 + 260 + 340 + 60x4 + 20x4 Parallel Programming Models Intra-Host (512 cores) Altix® System OpenMP Pthreads Linux Image 1 MPI Intra-Coherency Domain (4096 cores) and across entire machine MPI SHMEM SHMEMTM Linux Image 2 Global segments Coherency Domain 1 Coherency Domain 2 Global Segments Barrier Synchronization • Frequent in OpenMP, SHMEM, MPI single sided ops (MPI_Win_fence) • Tree-based implementation using multiple fetch-op variables to minimize contention on SHUB. • Using uncached load to reduce NUMAlink traffic. CPU HUB Fetch-op CPU variable ROUTER Programming Models • OpenMP on an Linux image • MPI • SHMEM • Shared segments (System V und Global Shared Memory) SHMEM • Can be used for MPI programs where all processes execute same code. • Enables access within and across partitions. • Static data and symmetric heap data (shmalloc or shpalloc) • info: man intro_shmem Example #include <mpp/shmem.h> main() { long source[10] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 }; static long target[10]; MPI_Init(…) if (myrank == 0) { /* put 10 elements into target on PE 1 */ shmem_long_put(target, source, 10, 1); } shmem_barrier_all(); /* sync sender and receiver */ if (myrank == 1) printf("target[0] on PE %d is %d\n", myrank,target[0]); } Global Shared Memory Programming • Allocation of a shared memory segment via collective GSM_alloc. • Similar to memory mapped files or System V shared segments. But these are limited to a single OS instance. • GSM segment can be distributed across partitions. – GSM_ROUNDROBIN: Pages are distributed in roundrobin across processes – GSM_SINGLERANK: Places all pages near to a single process – GSM_CUSTOM_ROUNDROBIN: Each process specifies how many pages should be placed in its memory. • Data structures can be placed in this memory segment and accessed from all processes with normal load and store instructions. Example #include <mpi_gsm.h> placement = GSM_ROUNDROBIN; flags = 0; size = ARRAY_LEN * sizeof(int); int *shared_buf; rc = GSM_Alloc(size, placement, flags, MPI_COMM_WORLD,&shared_buf); // Have one rank initialize the shared memory region if (rank == 0) { for(i=0; i < ARRAY_LEN; i++) { shared_buf[i] = i; } } MPI_Barrier(MPI_COMM_WORLD); // Have every rank verify they can read from the shared memory for (i=0; i < ARRAY_LEN; i++) { if (shared_buf[i] != i) { printf("ERROR!! element %d = %d\n", i, shared_buf[i]); printf("Rank %d - FAILED shared memory test.\n", rank); exit(1); } } Summary • Altix 4700 is a ccNUMA system • >60 TFlop/s • MPI messages sent with two-copy or single-copy protocol • Hierarchical coherence implementation • Intranode • Coherence domain • Across coherence domains • Programming models • • • • OpenMP MPI SHMEM GSM The Compute Cube of LRZ Rückkühlwerke Hö Höchstleistungsrechner (säulenfrei) (sä Zugangsbrü cke Zugangsbrücke Server/Netz Archiv/Backup Archiv/Backup Klima Klima Elektro

8. Cache Coherence, Consistency, Synchronization

Related documents

Products

Support

8. Cache Coherence, Consistency, Synchronization

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib