System-Area Networks (SAN) Group Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters 11/05/03 Appendix for Q2 Status Report DOD Project MDA904-03-R-0507 November 5, 2003 1 Outline Objectives and Motivations Background Related Research Approach Results Conclusions and Future Plans 11/05/03 2 Objectives and Motivations Objectives Support advancements for HPC with Unified Parallel C (UPC) on cluster systems exploiting high-throughput, low-latency system-area networks (SANs) Design and analysis of tools to support UPC on SAN-based systems Benchmarking and case studies with key UPC applications Analysis of tradeoffs in application, network, service and system design Motivations Increasing demand in sponsor and scientific computing community for sharedmemory parallel computing with UPC New and emerging technologies in system-area networking and cluster computing Scalable Coherent Interface (SCI) Myrinet InfiniBand QsNet (Quadrics) Gigabit Ethernet and 10 Gigabit Ethernet PCI Express (3GIO) UPC Intermediate Layers Network Layer Clusters offer excellent cost-performance potential SCI 11/05/03 Myrinet InfiniBand QsNet 1/10 Gigabit PCI Ethernet Express 3 Background Key sponsor applications and developments toward shared-memory parallel computing with UPC More details from sponsor are requested UPC UPC extends the C language to exploit parallelism Currently runs best on shared-memory multiprocessors (notably HP/Compaq’s UPC compiler) First-generation UPC runtime systems becoming available for clusters (MuPC, Berkeley UPC) ? Significant potential advantage in cost-performance ratio with COTS-based cluster configurations Leverage economy of scale Clusters exhibit low cost relative to tightly-coupled SMP, CC-NUMA, and MPP systems Scalable performance with commercial off-the-shelf (COTS) technologies 11/05/03 3 Com 3 Com UPC 3 Com 4 Related Research University of California at Berkeley UPC runtime system UPC to C translator Global-Address Space Networking (GASNet) design and development Application benchmarks George Washington University UPC specification UPC documentation UPC testing strategies, testing suites UPC benchmarking UPC collective communications Parallel I/O 11/05/03 Michigan Tech University Michigan Tech UPC (MuPC) design and development UPC collective communications Memory model research Programmability studies Test suite development Ohio State University HP/Compaq UPC benchmarking UPC compiler Intrepid GCC UPC compiler 5 Benchmarking Exploiting SAN Strengths for UPC Design and develop new SCI Conduit for GASNet in collaboration UCB/LBNL Evaluate DSM for SCI as option of executing UPC Performance Analysis Use and design of applications in UPC to grasp key concepts and understand performance issues Network communication experiments UPC computing experiments Emphasis on SAN Options and Tradeoffs Upper Layers Michigan Tech Benchmarks, modeling, specification UC Berkeley Benchmarks Benchmarks, UPC-to-C translator, specification, GWU Benchmarks, documents, specification Michigan Tech UF HCS Lab Applications, Translators, Documentation HP/Compaq UPC Compiler V2.1 running in lab on new ES80 AlphaServer (Marvel) Support of testing by OSU, MTU, UCB/LBNL, UF, et al. with leading UPC tools and system for function performance evaluation Field test of newest compiler and system Middle Layers Runtime Systems, Interfaces Collaboration API, Networks Lower Layers Approach Ohio State Benchmarks UP-to-MPI translation and runtime system UC Berkeley C runtime system, upper levels of GASNet GASNet collaboration, beta testing HP UPC runtime system on AlphaServer UC Berkeley GASNet GASNet collaboration, network performance analysis SCI, Myrinet, InfiniBand, Quadrics, GigE, 10GigE, etc. 11/05/03 6 GASNet/SCI Berkeley UPC runtime system operates over GASNet layer Comm. APIs for implementing global-address-space SPMD languages Network and language independent Two layers Core (Active messages) Extended (Shared-memory interface to networks) Implements GASNet on SCI network Core API implementation nearly completed via Dolphin SISCI API Active Messages on SCI Uses best aspects of SCI for higher performance Compiler Network Independent SCI Conduit for GASNet PIO for small, DMA for large, writes instead of reads Compiler-generated C code UPC runtime system GASNet Extended API GASNet Core API GASNet Performance Analysis on SANs Evaluate GASNet and Berkeley UPC runtime on a variety of networks Network tradeoffs and performance analysis through Benchmarks Applications Comparison with data obtained from UPC on sharedmemory architectures (performance, cost, scalability) 11/05/03 UPC Code Direct Access Language Independent Network 7 GASNet/SCI - Core API Design 3 types of shared-memory regions Control (ConReg) Stores: Message Ready Flags (MRF, Nx4 Total) + Message Exist Flag (MEF, 1 Total) SCI Operation: Direct shared memory write (*x = value) Node Y SCI Space Command (ComReg) N = Number of Nodes in System Local (In use) Stores: 2 pairs of request/reply message AM Header (80B) Medium AM payload (up to 944B) SCI Operation: memcpy() write Payload (PayReg) – Long AM payload Stores: Long AM payload SCI Operation: DMA write Export 1 ConReg, N ComReg and 1 PayReg Import N ConReg and N ComReg Connect to N Payload regions Create one local DMA queue to 11/05/03 Imports Control 1 Physical Address Control Y Command Y-1 Command Y-2 ... Command Y-N Payload Y Control 2 ... Control N Command 1-X Node X (Virtual Address) Command 2-X ... Initialization phase (all nodes) Exports Local (free) Command N-X For Long AM Transfer 8 GASNet/SCI - Core API Design Node X Execution phase Sender sends a message through 3 types of shared-memory regions Step 1: Send AM Header Step 2: Set appropriate MRF and MEF to 1 Header + Medium AM Payload Transfer Dedicated Command Region Long AMPayload Preparation Long AM Payload Transfer Payload Region Medium AM Step 1: Send AM Header + Medium AM payload Step 2: Set appropriate MRF and MEF to 1 SCI Network Operations Transfer Complete Signal Long AM Step 1: Imports destination’s payload region, prepare source’s payload data Step 2: Send Long AM payload + AM Header Step 3: Set appropriate MRF and MEF to 1 Message Ready Signal Step 1: Check work queue. If not empty, skip to step 4 Step 2: If work queue is empty, check if MEF = 1 Step 3: if MEF = 1, set MEF = 0, check MRFs and enqueue corresponding messages, set MRF =0 Step 4: Dequeue new message from work queue Step 5: Read from ComReg and executes new message Step 6: Send reply to sender Repeat for all messages No Yes Work Queue Empty? Control Region Exit Start Receiver polls new message by 11/05/03 Header Construction + Medium AM Payload Short AM Node Y Yes MEF = 1? MRF = 1? Set MEF = 0 No Yes Enqueue Corresponding Message Dequeue First Message Exit Execute Message Reply to Sender Set MRF = 0 9 GASNet/SCI - Experimental Setup & Analysis Testbed Dual 2.4 GHz Intel Xeon, 1GB DDR PC2100 (DDR266) RAM, Intel SE7501BR2 server motherboard with E7501 chipset 5.3 Gb/s Dolphin SCI D337 (2D/3D) NICs, 4x2 torus Independent test variables Max request/reply pairs Raw = N/A Pre-Integration = 2 All other tests = number in parenthesis Test Setup SCI Raw / Pre-Integration Long (DMA) SISCI’s dma_bench / Inhouse code Short/Medium (PIO) In-house code (pingpong) GASNet SCI Conduit Berkeley GASNet Test suite Graphs 1 and 2 use testsmall Uses only AM medium Graphs 3 and 4 use testlarge For message size <= 512 bytes payload transferred by AM medium For message size > 512 bytes, AM long is used In-house code (Pure Core API AM Transfer) used with modified barriers Graphs 1 and 2 use only AM medium Graphs 3 and 4 use only AM long 11/05/03 Barrier types None: Raw and Pre-Integration GASNet reference extended API AM Based Many-to-one one-to-many SCI conduit modified Shared Memory Based All-to-all Barrier Wait Time (2 nodes / 4 nodes) Reference Barrier (20) = 30.329 µs / 47.236 µs Modified Barrier (2) = 4.632 µs / 5.252 µs Modified Barrier (20) = 4.661 µs / 5.305 µs Pre-Integration results on Short AM Latency [(RTT with barrier) / 2] Xeon not yet available Raw = 3.62 µs Pre-Integration = 6.66 µs (on PIII/1GHz) Reference Barrier (20) = 13.267 µs Modified Barrier (2) = 11.144 µs Modified Barrier (20) = 4.148 µs 10 GASNet/SCI – Core API Latency / Throughput Results SCI Raw (PIO) Reference Barrier (20) 1) SCI Conduit (Pre-Integration) Modified Barrier (2) 2) SCI Raw (PIO) Reference Barrier (20) Modified Barrier (2) Modified Barrier (20) Modified Barrier (20) Throughput (MB / s) 60 40 30 20 10 0 4 8 16 32 64 128 256 unexpected behavior 25 20 15 10 5 0 4 512 8 16 Payload Size (bytes) 3) SCI Raw (DMA) 4) SCI Conduit (Pre-Integration) Reference Barrier (20) Modified Barrier (2) 64 128 256 512 SCI Raw (DMA) Reference Barrier (20) Modified Barrier (2) Modified Barrier (20) Throughput (MB / s) 250 comm. protocol shift 200 150 100 50 11/05/03 10 24 20 48 40 96 81 92 16 38 4 32 76 8 65 53 6 51 2 12 8 64 32 16 6 8 53 65 4 76 32 38 92 Payload Size (bytes) 16 81 96 40 24 2 48 20 10 8 6 51 25 12 64 32 0 16 Latency (us) Modified Barrier (20) 450 400 350 300 250 200 150 100 50 0 32 Payload Size (bytes) 25 6 Latency (us) 50 50 45 40 35 30 Payload Size (bytes) 11 GASNet/SCI - Analysis Analysis Communication protocol shift gasnet_put / gasnet_get Use AMMedium with payload size up to MaxAMMedium() = 512 bytes Use AMLong with payload size > MaxAMMedium() Unexpected behavior In-house throughput test (AMMedium) Under investigation 2 request/reply pairs + GASNet reference barrier High latency (> 1000 µs) Small throughput (< 1 MB/s up to 64KB message size) Performance improvement strategies Increase number of request/reply pairs Using shared memory barriers Part of SCI Extended API Latency / Throughput approaching SCI Raw value 11/05/03 Contention reduction vs. memory requirement Can be improved with Further optimization of Core API Realization of Extended API 12 GASNet/SCI - Large Segment Support Many UPC applications require large amounts of memory SISCI API does not inherently support >1MB segment on Linux No kernel support for large contiguous memory allocation Unable to run practical UPC applications Possible solutions BigPhysArea patch Working with Dolphin to enhance their driver 11/05/03 Available kernel patch allowing Large blocks to be set aside at boot time Enables SISCI to create large SCI segments No performance impact Limits physical memory usable by other programs Memory set aside only available through special function calls Not an elegant solution Kernel on every machine used needs to be patched Any upgrade or major OS change requires re-patching Address Translation Table (ATT) has 16K entries per node Supports 64MB (16K 4K) to 1GB (16K 64K) with different page sizes Currently only works with contiguous physical memory Extension to support multiple pages per entry for non-contiguous memory 13 UPC - STREAM Memory Benchmark Sustained memory bandwidth benchmark Experimental Result Indication of scalable communication Execution performance should be linear An embarrassingly parallel benchmark Synchronization only at beginning and end of every function (UPC, MPI) First ever UPC benchmark on native SCI Block size of 100, MPICH 1.2.5, SCI conduit without barrier enhancement, Berkeley UPC runtime system 1.0.1 Linear performance gain in SCI conduit even without extended API implemented UPC over MPI Conduit Measures memory access time with and without processing Copy a[i] = b[i], Scale a[i]=c*b[i], Add a[i]=b[i]+c[i], Triad a[i]=d*b[i]+c[i] Implemented in UPC, and MPI Comparative memory access performance Address translation at end of each block (UPC) Check for correct thread affinity Nodes only act on shared memory with affinity to themselves All tests show same trend 11/05/03 UPC over SCI Conduit 6000 5000 Sustained Bandwidth (MB/s) MPI 4000 3000 2000 1000 0 1Thread 2 Threads 4 Threads Only Triad result shown 14 Network Performance Tests Detailed understanding of high-performance cluster interconnects Identifies suitable networks for UPC over clusters Aids in smooth integration of interconnects with upper-layer UPC components Enables optimization of network communication; unicast and collective Various levels of network performance analysis Low-level tests Mid-level tests 11/05/03 InfiniBand based on Virtual Interface Provider Library (VIPL) SCI based on Dolphin SISCI and SCALI SCI Myrinet based on Myricom GM QsNet based on Quadrics Elan Communication Library Host architecture issues (e.g. CPU, I/O, etc.) Sockets Dolphin SCI Sockets on SCI BSD Sockets on Gigabit and 10Gigabit Ethernet GM Sockets on Myrinet SOVIA on Infinband MPI InfiniBand and Myrinet based on MPI/PRO SCI based on ScaMPI and SCI-MPICH Intermediate Layers Network Layer SCI Myrinet InfiniBand QsNet 1/10 Gigabit PCI Ethernet Express 15 Gigabit Ethernet Performance Tests Promising light-weight communication protocol MVIA (National Energy Research Center) Virtual Interface Architecture for Linux Mesh topology for Gigabit Ethernet cluster (Jefferson National Lab) Direct or indirect, scalable system architecture High performance/cost ratio Suitable platform for large-scale UPC applications Low-cost and widely deployed Standardized TCP communication protocol High-overhead, low-performance One-way latency test TCP Netpipe MPI over TCP MPICH and MPI/Pro GASNet over MPI/TCP MPICH TCP MPIPro/TCP MPICH/TCP GASNet/MPICH/TCP 1200 ~60 sec one-way latency for TCP communication Commercial MPI implementation (MPI/Pro) performs better for large messages ~100 sec GASNet overhead for small messages Each send/receive includes a RPC For large messages GASNET overhead is negligible Latency (usec) 1000 800 600 GASNet overhead includes RPC associated with AM 400 200 0 0 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K Message Size (Bytes) 11/05/03 16 On-going SAN Tests and Future Plans Quadrics Infiniband Features Features Fat-tree topology Low latency and high bandwidth Work offloading from the host CPU for communication events Based on VIPL Industry standard Low latency and high bandwidth Reliability Performance increase with NICs Potential Potential Existing GASNet implementation Existing MPI 1.2 implementation Future Plans Initial Results Bandwidth Results From Elan to Main Memory 195.73 MB/sec Latency of 3s Existing GASNet conduit Low-level network performance tests Performance evaluation of GASNet on Infiniband UPC benchmarking on Infiniband-based clusters Future plans 150 100 50 10K 100K 40K 20K 8K 10K 4K 2K 1K 800 400 0 200 UPC benchmarking on Quadrics-based clusters 200 80 250 100 Performance evaluation of GASNet on Quadrics 0 40 Low-level network performance tests Bandwidth (Mbytes/Sec) Quadrics Memory Results Size of Message (bytes) 11/05/03 17 On-going Architectural Tests and Future Plans Pentium 4 Xeon Features Features Increased CPU utilization RISC processor core 4.3 GB/s I/O bandwidth Future Plans 11/05/03 Intel NetBurst microarchitecture Opteron 32-bit processor Hyper-Threading technology Performance and functionality evaluations with various SANs UPC and other parallel application benchmarks 64-bit processor Real-time support of 32-bit OS On-chip memory controllers Eliminates 4 GB memory barrier imposed by 32-bit systems 19.2 GB/s I/O bandwidth per processor Future plans Performance and functionality evaluations with various SANs UPC and other parallel application benchmarks 18 Conclusions and Future Plans Accomplishments Key insights Baselining of UPC on shared-memory multiprocessors Evaluation of promising tools for UPC on clusters Leverage and extend communication and UPC layers Conceptual design of new tools Preliminary network and system performance analyses Completed V1.0 of the GASNet Core API SCI conduit for UPC Existing tools are limited but developing and emerging SCI is a promising interconnect for UPC Inherent shared-memory features and low latency for memory transfers Copy vs. transfer latencies; scalability from 1D to 3D tori GASNet/SCI Core API performance comparable to SCI Raw performance Method of accessing shared memory in UPC is important for optimal performance Future Plans 11/05/03 Further investigation of GASNet and related tools, and Extended API design for SCI conduit Evaluation of design options and tradeoffs GASNet conduits on promising SANs; UPC performance evolutions on different CPU architectures. Comprehensive performance analysis of new SANs and SAN-based clusters Evaluation of UPC methods and tools on various architectures and systems UPC benchmarking on cluster architectures, networks, and conduits Cost/Performance analysis for all options 19