Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory ECE Department University of Florida Principal Investigator: Professor Alan D. George Sr. Research Assistant: Mr. Hung-Hsun Su 10/15/04 1 Outline Objectives and Motivations Background Related Research Approach Results Conclusions and Future Plans 10/15/04 2 Objectives and Motivations Objectives Support advancements for HPC with Unified Parallel C (UPC) on cluster systems exploiting high-throughput, low-latency System-Area Networks (SANs) and LANs Design and analysis of tools to support UPC on SAN-based systems Benchmarking and case studies with key UPC applications Analysis of tradeoffs in application, network, service, and system design Motivations Increasing demand in sponsor and scientific computing community for sharedmemory parallel computing with UPC New and emerging technologies in system-area networking and cluster computing 10/15/04 Scalable Coherent Interface (SCI) Myrinet (GM) InfiniBand (VAPI) QsNet (Quadrics Elan) Gigabit Ethernet and 10 Gigabit Ethernet Clusters offer excellent cost-performance potential 3 Background Key sponsor applications and developments toward shared-memory parallel computing with UPC UPC extends the C language to exploit parallelism Notably HP/Compaq’s UPC compiler MuPC, Berkeley UPC Significant potential advantage in cost-performance ratio with Commercial Off-The-Shelf (COTS) cluster configurations Leverage economy of scale Clusters exhibit low cost relative to tightly coupled SMP, CCNUMA, and MPP systems 10/15/04 ? First-generation UPC runtime systems becoming available for clusters UPC Currently runs best on shared-memory multiprocessors or proprietary clusters (e.g. AlphaServer SC) 3 Com 3 Com UPC 3 Com Scalable performance with COTS technologies 4 Related Research University of California at Berkeley UPC runtime system UPC to C translator Global-Address Space Networking (GASNet) design and development Application benchmarks George Washington University UPC specification UPC documentation UPC testing strategies, testing suites UPC benchmarking UPC collective communications Parallel I/O 10/15/04 Michigan Tech University Michigan Tech UPC (MuPC) design and development UPC collective communications Memory model research Programmability studies Test suite development Ohio State University HP/Compaq UPC benchmarking UPC compiler Intrepid GCC UPC compiler 5 Related Research -- MuPC & DSM MuPC (Michigan Tech UPC) First open-source reference implementation of UPC for COTS clusters Any cluster that provides Pthreads and MPI can use Built as a reference implementation, performance is secondary Limitations in application size, memory mode Not suitable for performance-critical applications UPC/DSM/SCI SCI-VM (DSM system for SCI) HAMSTER interface allows multiple modules to support MPI and shared-memory models Created using Dolphin SISCI API, ANSI C SCI-VM not under constant development, so future upgrades sketchy Not feasible for amount of work needed versus expected performance 10/15/04 Better possibilities with GASNet 6 Related Research -- GASNet Communication system created by U.C. Berkeley Target for Berkeley UPC system Global-Address Space Networking (GASNet)[1] Language-independent, low-level networking layer for high-performance communication Segment region for communication on each node, three types Segment-fast: sacrifice size for speed Segment-large: allows large memory area for shared space, perhaps with some loss in performance (though firehose [2] algorithm is often employed) Segment-everything: expose the entire virtual memory space of each process for shared access Interface for high-level global address space SPMD languages UPC [3] and Titanium [4] Divided into two layers Core 10/15/04 Firehose algorithm allows memory to be managed into buckets for efficient transfers Active Messages Extended High-level operations which take direct advantage of network capabilities communication system from U.C. Berkeley A reference implementation available that uses the Core layer 7 Related Research -- Berkeley UPC Second open-source implementation of UPC for COTS clusters Translator First with a focus on performance GASNet for all accesses to remote memory Network conduits allow for high performance over many different interconnects UPC Code Platform- Translator Generated C independent Code Targets a variety of architectures x86, Alpha, Itanium, PowerPC, SPARC, MIPS, Networkindependent Berkeley UPC Runtime System Compilerindependent GASNet Communication System Languageindependent PA-RISC Best chance as of now for high-performance UPC applications on COTS clusters Note: Only supports strict shared-memory Network Hardware access and therefore only uses the blocking transfer functions in the GASNet spec 10/15/04 8 Approach Use and design of applications in UPC to grasp key concepts and understand performance issues NAS benchmarks from GWU DES-cypher benchmark from UF Performance Analysis Design and develop new SCI Conduit for GASNet in collaboration UCB/LBNL Evaluate DSM for SCI as option of executing UPC Benchmarking Upper Layers Exploiting SAN Strengths for UPC Network communication experiments UPC computing experiments Emphasis on SAN Options and Tradeoffs UC Berkeley Benchmarks Benchmarks, UPC-to-C translator, specification GWU Benchmarks, documents, specification SCI, Myrinet, InfiniBand, Quadrics, GigE, 10GigE, etc. 10/15/04 Michigan Tech UF HCS Lab Michigan Tech Benchmarks, modeling, specification Field test of newest compiler and system Middle Layers Runtime Systems, Interfaces HP/Compaq UPC Compiler V2.1 running in our lab on ES80 AlphaServer (Marvel) Support of testing by OSU, MTU, UCB/LBNL, UF, et al. with leading UPC tools and system for function performance evaluation Lower Layers Applications, Translators, Documentation Collaboration Runtime Systems, Interfaces Ohio State Benchmarks UPC-to-MPI translation and runtime system UC Berkeley C runtime system, upper levels of GASNet GASNet collaboration, beta testing HP UPC runtime system on AlphaServer UC Berkeley GASNet GASNet collaboration, network performance analysis 9 GASNet SCI Conduit Control 1 Local (In use) Scalable Coherent Interface (SCI) Low-latency, high-bandwidth SAN Shared-memory capabilities ... Control X Control X Command X-1 Control Segments (N total) ... ... Command X-X Require memory exporting and importing PIO (require importing) + DMA (need 8 bytes alignment) Physical Address Control N ... Command X-N Command 1-X Payload X ... Remote write ~10x faster than remote read DMA Queues (Local) Command X-X Command Segments (N*N total) ... SCI conduit Local (free) Command N-X AM enabling (core API) ... Dedicated AM message channels (Command) Control 1 Request/Response pairs to prevent deadlock ... Virtual Address Flags to signal arrival of new AM (Control) Command 1-X Payload X Payload Segments (N total) Command X-X ... ... Command N-X Global segment (Payload) Payload N Node X Importing 10/15/04 ... Control N ... Put/Get enabling (extended API) Payload 1 Control X Exporting SCI Space 10 GASNet SCI Conduit - Core API Active Message Transferring 1. Obtain free slot Tract locally using array of flags 2. Package AM Header 3. Transfer Data Short AM PIO write (Header) Medium AM PIO write (Header) PIO write (Medium Payload) Long AM PIO write (Header) PIO write (Long Payload) Payload size 1024 Unaligned portion of payload DMA write (multiple of 64 bytes) 4. Wait for transfer completion 5. Signal AM arrival Message Ready Flag Value = type of AM Message Exist Flag Value = TRUE 6. Wait for reply/control signal Free up remote slot for reuse 10/15/04 Check Message Exist Flag Control AM Header Polling Start Command Y-1 ... Medium AM Payload ... Long AM Payload New Messages Availiable? Command Y-X Command Y-N Extract Message Information Payload Y Yes No Wait for Completion Memory Process all new messages Flags Other processing Process reply message AM Reply or ack Polling Done Polling End Node X Node Y 11 Experimental Testbed Elan, VAPI (Xeon), MPI, and SCI conduits Nodes: Dual 2.4 GHz Intel Xeons, 1GB DDR PC2100 (DDR266) RAM, Intel SE7501BR2 server motherboard with E7501 chipset SCI: 667 MB/s (300 MB/s sustained) Dolphin SCI D337 (2D/3D) NICs, using PCI 64/66, 4x2 torus Quadrics: 528 MB/s (340 MB/s sustained) Elan3, using PCI-X in two nodes with QM-S16 16-port switch InfiniBand: 4x (10Gb/s, 800 MB/s sustained) Infiniserv HCAs, using PCI-X 100, InfiniIO 2000 8-port switch from Infinicon RedHat 9.0 with gcc compiler V 3.3.2, SCI uses MP-MPICH beta from RWTH Aachen Univ., Germany. Berkeley UPC runtime system 1.1 VAPI (Opteron) Nodes: Dual AMD Opteron 240, 1GB DDR PC2700 (DDR333) RAM, Tyan Thunder K8S server motherboard InfiniBand: Same as in VAPI (Xeon) GM (Myrinet) conduit (c/o access to cluster at MTU) Nodes*: Dual 2.0 GHz Intel Xeons, 2GB DDR PC2100 (DDR266) RAM Myrinet*: 250 MB/s Myrinet 2000, using PCI-X, on 8 nodes connected with 16-port M3FSW16 switch RedHat 7.3 with Intel C compiler V 7.1., Berkeley UPC runtime system 1.1 ES80 AlphaServer (Marvel) Four 1GHz EV7 Alpha processors, 8GB RD1600 RAM, proprietary inter-processor connections Tru64 5.1B Unix, HP UPC V2.1 compiler 10/15/04 * via testbed made available courtesy of Michigan Tech 12 SCI Conduit GASNet Core Level Experiments Experimental Setup Analysis 25 20 15 10 5 1024 512 256 128 64 32 16 0 0 Payload Size (Bytes) Long AM Ping-Pong Latency Long AM Throughput SCI Raw SCI Conduit SCI Raw 8K 4K 2K 1K 512 256 128 64 32 16 256K 128K 64K 32K 16K 8K 4K 2K 1K 512 0 256 0 128 50 8 100 50 Payload Size (Bytes) PIO/DMA Mode Shift 4 100 150 2 150 200 1 PIO/DMA Mode Shift 0 Latency (us) 200 10/15/04 SCI Conduit 250 64 Throughput (MB/s) 250 16K Latency a little high, but constant overhead (not exponential Throughput follows RAW trend 30 8 SCI Conduit 35 4 SCI Raw 40 2 SCI Conduit Latency/Throughput (testam 10000 iterations) SCI Raw PIO latency (scipp) DMA latency and throughput (dma_bench) 1 Short/Medium AM Ping-Pong Latency Latency (us) Payload Size (Bytes) 13 SCI Conduit GASNet Extended Level Experiments Experimental Setup GASNet configured with segment Large As fast as segment-fast for inside the segment Makes use of Firehose for memory outside the segment (often more efficient than segment-fast) GASNet Conduit experiments Berkeley GASNet test suite Average of 1000 iterations Each uses put/get operations to take advantage of implemented extended APIs Executed with target memory falling inside and then outside the GASNet segment Latency results use testsmall Throughput results use testlarge Reported only inside results unless difference was significant Analysis Elan shows best performance for latency of puts and gets VAPI is by far the best bandwidth; latency very good GM latencies a little higher than all the rest HCS SCI conduit shows better put latency than MPI on SCI for sizes > 64 bytes; very close to MPI on SCI for smaller messages HCS SCI conduit has latency slightly higher than MPI on SCI GM and SCI provide about the same throughput HCS SCI conduit slightly higher bandwidth for largest message sizes Quick look at estimated total cost to support 8 nodes of these interconnect architectures: SCI: ~$8,700 Myrinet: ~$9,200 InfiniBand: ~$12,300 Elan3: ~$18,000 (based on Elan4 pricing structure, which is slightly higher) 10/15/04 * via testbed made available courtesy of Michigan Tech 14 GASNet Extended Level Latency GM put VAPI put HCS SCI put GM get VAPI get HCS SCI get Elan put MPI SCI put Elan get MPI SCI get 40 Round-trip Latency (usec) 35 30 25 20 15 10 5 0 1 2 4 8 16 32 64 128 256 512 1K Message Size (bytes) 10/15/04 15 GASNet Extended Level Throughput GM put VAPI put HCS SCI put GM get VAPI get HCS SCI get Elan put MPI SCI put Elan get MPI SCI get 800 700 Throughput (MB/s) 600 500 400 300 200 100 0 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K Message Size (bytes) 10/15/04 16 Matisse IP-Based Networks Nodes: Dual 2.4 GHz Intel Xeons, 1GB DDR PC2100 (DDR266) RAM, Intel SE7501BR2 server motherboard with E7501 chipset Setup: 1 switch – all nodes connected to 1 switch 2 switch – half of the nodes connected to each switch with either short (1km) or long (12.5km) fiber in between the switches Tests: Low Level - Pallas Benchmark: ping-pong and send-recv GASNet Level - testsmall Pallas PingPong Test 2 switches - short fiber 1 switch 2 switches - long fiber 50.00 70.00 45.00 60.00 Throughput (MB/sec) 40.00 35.00 30.00 25.00 20.00 15.00 10.00 2 switches - short fiber 2 switches - long fiber 50.00 40.00 30.00 20.00 10.00 5.00 0.00 10/15/04 Message Size (bytes) 17 4096K 2048K 512K 256K 128K 64K 32K 512 128 32 8 4096K 2048K 1024K 512K 1024K Message Size (bytes) 256K 128K 64K 32K 16K 8K 2K 512 128 32 0.00 8 Throughput (MB/sec) 1 switch Pallas SendRecv Test (2 Nodes) 16K 8K Switch-based GigE network with DWDM backbone between switches for high scalability Product in alpha testing stage Experimental setup 2K Matisse IP-Based Networks GASNet put/get Latency for 2 switches with short/long fibers constant Short – 250 us Long – 374 us Throughput is comparable with regular GigE Latency comparable with regular GigE (~255us for all sizes) Testsmall Latency (1 switch) put Testsmall Throughput get 1 switch - put 2 switches - short fiber - put 2 switches - long fiber - put GigE - put 300 1 switch - get 2 switches - short fiber - get 2 switches - long fiber - get GigE - get 8000 10/15/04 512 2048 Message Size (bytes) 256 128 64 32 0 16 2048 1024 512 2000 1024 Message Size (bytes) 256 128 64 32 16 8 4 2 0 8 50 4000 4 100 6000 2 150 1 Throughput (kb/s) 200 1 Latency (us) 250 18 UPC function performance A look at common shared-data operations Comparison between accesses to local data through regular and private pointers Block copies between shared and private memory upc_memget, upc_memput Pointer conversion (shared local to private) Pointer addition (advancing pointer to next location) Loads & Stores (to a single location local and remote) Block copies upc_memget & upc_memget translate directly into GASNet blocking put and get (even on local shared objects); see previous graph for results Marvel with HP UPC compiler shows no appreciable difference between local and remote puts and gets and regular C operations 10/15/04 Steady increase from 0.27 to 1.83 µsec for sizes 2 to 8K bytes Difference of < .5 µsec for remote operations 19 UPC function performance Pointer operations Cast Local share to private All BUPC conduits ~2ns, Marvel needed ~90ns Pointer addition below Private Shared 0.07 Execution Time (usec) 0.06 0.05 Shared-pointer manipulation about an order of magnitude greater than private. 0.04 0.03 0.02 0.01 0 MPI-SCI 10/15/04 MPI GigE Elan GM VAPI Marvel 20 UPC function performance Loads and stores with pointers (not bulk) Data local to the calling node Pvt Shared are private pointers to the local shared space Private Store Shared Load Private Load Pvt Shared Store Shared Store Pvt Shared Load 0.1 Execution Time (usec) 0.09 0.08 0.07 0.06 MPI on GigE shared store takes 2 orders of magnitude longer, therefore not shown. Marvel shared loads and stores twice an order of magnitude greater than private. 0.05 0.04 0.03 0.02 0.01 0 MPI SCI 10/15/04 MPI GigE Elan GM VAPI Marvel 21 UPC function performance Loads and stores with pointers (not bulk) Data remote to the calling node Note: MPI GigE showed a time of ~450µsec for loads and ~500µsec for stores Load Store 30 Execution Time (usec) 25 Marvel remote access through pointers the same as with local shared, two orders of magnitude less than Elan. 20 15 10 5 0 MPI-SCI 10/15/04 Elan GM VAPI Marvel 22 UPC Benchmark - IS from NAS Benchmark IS (Integer Sort, Class A), lots of fine-grain communication, low computation Poor performance in the GASNet communication system does not necessary indicate poor performance in UPC application 1 Thread 2 Threads 4 Threads 8 Threads 30 Execution Time (sec) 25 20 15 10 5 0 GM 10/15/04 Elan GigE mpi VAPI (Xeon) SCI mpi SCI Marvel 23 UPC Benchmarks – FT from NAS Benchmarks* FT (3-D Fast Fourier Fransform, Class A), medium communication, high computation Used optimized version 01 (private pointers to local shared memory) SCI Conduit unable to run due to driver limitation (size constraint) High-bandwidth networks perform best (VAPI followed by Elan) VAPI conduit allows cluster of Xeons to keep pace with Marvel’s performance MPI on GigE not well suited for these types of problems (high-latency, low-bandwidth traits limit performance) MPI on SCI has lower bandwidth than VAPI but still maintains near-linear speedup for more than 2 nodes (skirts TCP/IP overhead) GM performance a factor of processor speed (see 1 Thread) 1 Thread 50 4 Threads 8 Threads High-latency of MPI on GigE impedes performance. 45 40 Execution Time (sec) 2 Threads 35 30 25 20 15 10 5 0 GM 10/15/04 Elan GigE mpi VAPI SCI mpi Marvel * Using code developed at GWU 24 UPC Benchmark - DES Differential Attack Simulator S-DES (8-bit key) cipher (integer-based) Creates basic components used in differential cryptanalysis S-Boxes, Difference Pair Tables (DPT), and Differential Distribution Tables (DDT) Bandwidth-intensive application Designed for high cache miss rate, so very costly in terms of memory access Sequential 1 Thread 2 Threads 4 Threads 3500 Execution Time (msec.) 3000 2500 2000 1500 1000 500 0 GM 10/15/04 Elan GigE mpi VAPI (Xeon) SCI mpi SCI Marvel 25 UPC Benchmark - DES Analysis With increasing number of nodes, bandwidth and NIC response time become more important Interconnects with high bandwidth and fast response times perform best 10/15/04 Marvel shows near-perfect linear speedup, but processing time of integers an issue VAPI shows constant speedup Elan shows near-linear speedup from 1 to 2 nodes, but more nodes needed in testbed for better analysis GM does not begin to show any speedup until 4 nodes, then minimal SCI conduit performs well for high-bandwidth programs but with the same speedup problem as GM MPI conduit clearly inadequate for high-bandwidth programs 26 UPC Benchmark - Differential Cryptanalysis for CAMEL Cipher Uses 1024-bit S-Boxes Given a key, encrypts data, then tries to guess key solely based on encrypted data using differential attack Has three main phases: Compute optimal difference pair based on S-Box (not very CPU-intensive) Performs main differential attack (extremely CPU-intensive) Gets a list of candidate keys and checks all candidate keys using brute force in combination with optimal difference pair computed earlier Analyze data from differential attack (not very CPU-intensive) Computationally (independent processes) intensive + several synchronization points 1 Thread 2 Threads 4 Threads 8 Threads 16 Threads 250 Execution Time (s) Parameters 200 MAINKEYLOOP = 256 NUMPAIRS = 400,000 150 Initial Key: 12345 100 50 0 SCI (Xeon) 10/15/04 VAPI (Opteron) Marvel 27 UPC Benchmark - CAMEL Analysis Marvel Attained almost perfect speedup Synchronization cost very low Berkeley UPC Speedup decreases with increasing number of threads Run time varied greatly as number of threads increased 10/15/04 Cost of synchronization increases with number of threads Hard to get consistent timing readings Still decent performance for 32 threads (76.25% efficiency, VAPI) Performance is more sensitive to data affinity 28 Architectural Performance Tests Intel Pentium 4 Xeon Features Increased CPU utilization RISC processor core 4.3 GB/s I/O bandwidth 32-bit/64-bit processor Real-time support of 32-bit OS On-chip memory controllers Eliminates 4 GB memory barrier imposed by 32-bit systems 19.2 GB/s I/O bandwidth per processor Intel Itanium II Theme: Preliminary study of tradeoffs in available processor architectures, since their performance will clearly affect computation, communication, and synchronization in UPC clusters. 10/15/04 Intel NetBurst microarchitecture Features 32-bit processor Hyper-Threading technology AMD Opteron Features 64-bit processor Based on EPIC architecture 3-level cache design Enhanced Machine Check Architecture (MCA) with extensive Error Correcting Code (ECC) 6.4 GB/s I/O bandwidth 29 CPU Performance Results Itanium2 Opteron Xeon Computation benchmarks excluded due to compiler problems with Itanium2. Throughput (MB/s) 1000 800 600 400 200 0 Random Reads Random Writes AIM 9 Sequential Reads Sequential Writes Disk copies 10 iterations using 5MB files testing sequential and random reads, writes, and copies Itanium2 slightly above Opteron in both reads and writes except for random writes where Opteron has a slight advantage Both Itanium2 and Opteron outperform Xeon by a wide margin in all cases except sequential reads Xeon sequential reads are comparable to Opteron, but Itanium2 is much higher than both Major performance gain from sequential reads compared to random, but sequential writes do not receive nearly as large of a boost 10/15/04 30 10 Gigabit Ethernet – Preliminary results 10GigE GigE 450 400 Testbed Throughput (MB/s) 350 Nodes: Each with dual x 2.4GHz Xeons, S2io Xframe 10GigE card in PCI-X 100, 1GB PC2100 DDR RAM, Intel PRO/1000 1GigE, RedHat 9.0 kernel 2.4.20-8smp, LAM-MPI V7.0.3 300 250 200 150 100 50 10GigE GigE 0 200 64 128 256 512 1K 2K 4K 8K 16K 32K Message Size (bytes) 180 Round-trip Latency (usec) 160 140 10GigE is promising due to expected economy-of-scale issues of Ethernet S2io 10GigE shows impressive throughput, though slightly less than half of theoretical maximum; further tuning needed to go higher Results show much-needed decrease in latency versus other Ethernet options 120 100 80 60 40 20 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K Message Size (bytes) 10/15/04 31 64K Conclusions Key insights HCS SCI conduit shows promise Performance on-par with other conduits On-going collaboration with vendor (Dolphin) to resolve the memory constraint issue Berkeley UPC system a promising COTS cluster tool Performance on par with HP UPC (also see [6]) Performance of COTS clusters match and sometimes beat performance of high-end CC-NUMA Various conduits allow UPC to execute on many interconnects VAPI and Elan are initially found to be strongest Some open issues with bugs and optimization Active bug reports and development team help improvements Very good solution for clusters to execute UPC, but may not quite be ready for production use No debugging or performance tools available Xeon cluster suitable for applications with high Read/Write ratio Opteron cluster suitable for generic application due to comparable Read/Write capability Itanium2 excellent for sequential reads, about the same as Opteron for everything else 10GigE provides high bandwidth with much lower latencies than 1GigE Key accomplishments to date 10/15/04 Baselining of UPC on shared-memory multiprocessors Evaluation of promising tools for UPC on clusters Leveraging and extension of communication and UPC layers Conceptual design of new tools for UPC Preliminary network and system performance analyses for UPC systems Completion of optimized GASNet SCI conduit for UPC 32 References 1. D. Bonachea, U.C. Berkeley Tech Report (UCB/CSD-02-1207) (spec v1.1), October 2002. 2. C. Bell, D. Bonachea, “A New DMA Registration Strategy for Pinning-Based High Performance Networks,” Workshop on Communication Architecture for Clusters (CAC'03), April, 2003. 3. W. Carlson, J. Draper, D. Culler, K. Yelick, E. Brooks, K. Warren, “Introduction to UPC and Language Specification,” IDA Center for Computing Sciences, Tech. Report., CCS-TR99-157, May 1999. 4. K. A. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. N. Hilfinger, S. L. Graham, D. Gay, P. Colella, and A. Aiken , “Titanium: A HighPerformance Java Dialect,” Concurrency: Practice and Experience, Vol. 10, No. 1113, September-November 1998. 5. B. Gordon, S. Oral, G. Li, H. Su and A. George, “Performance Analysis of HP AlphaServer ES80 vs. SAN-based Clusters,” 22nd IEEE International Performance, Computing, and Communications Conference (IPCCC), April, 2003. 6. W. Chen, D. Bonachea, J. Duell, P. Husbands, C. Iancu, K. Yelick, “A Performance Analysis of the Berkeley UPC Compiler,” 17th Annual International Conference on Supercomputing (ICS), June, 2003. 10/15/04 33