Distributed Shared-Memory Parallel Computing with UPC on High-Perf. Networking (HPN) Group

Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory ECE Department University of Florida Principal Investigator: Professor Alan D. George Sr. Research Assistant: Mr. Hung-Hsun Su 10/15/04 1 Outline  Objectives and Motivations  Background  Related Research  Approach  Results  Conclusions and Future Plans 10/15/04 2 Objectives and Motivations   Objectives  Support advancements for HPC with Unified Parallel C (UPC) on cluster systems exploiting high-throughput, low-latency System-Area Networks (SANs) and LANs  Design and analysis of tools to support UPC on SAN-based systems  Benchmarking and case studies with key UPC applications  Analysis of tradeoffs in application, network, service, and system design Motivations  Increasing demand in sponsor and scientific computing community for sharedmemory parallel computing with UPC  New and emerging technologies in system-area networking and cluster computing  10/15/04  Scalable Coherent Interface (SCI)  Myrinet (GM)  InfiniBand (VAPI)  QsNet (Quadrics Elan)  Gigabit Ethernet and 10 Gigabit Ethernet Clusters offer excellent cost-performance potential 3 Background  Key sponsor applications and developments toward shared-memory parallel computing with UPC  UPC extends the C language to exploit parallelism    Notably HP/Compaq’s UPC compiler MuPC, Berkeley UPC Significant potential advantage in cost-performance ratio with Commercial Off-The-Shelf (COTS) cluster configurations  Leverage economy of scale  Clusters exhibit low cost relative to tightly coupled SMP, CCNUMA, and MPP systems  10/15/04 ? First-generation UPC runtime systems becoming available for clusters   UPC Currently runs best on shared-memory multiprocessors or proprietary clusters (e.g. AlphaServer SC) 3 Com 3 Com UPC 3 Com Scalable performance with COTS technologies 4 Related Research  University of California at Berkeley  UPC runtime system  UPC to C translator  Global-Address Space Networking (GASNet) design and development    Application benchmarks George Washington University  UPC specification  UPC documentation  UPC testing strategies, testing suites  UPC benchmarking  UPC collective communications  Parallel I/O 10/15/04  Michigan Tech University  Michigan Tech UPC (MuPC) design and development  UPC collective communications  Memory model research  Programmability studies  Test suite development Ohio State University   HP/Compaq   UPC benchmarking UPC compiler Intrepid  GCC UPC compiler 5 Related Research -- MuPC & DSM  MuPC (Michigan Tech UPC)   First open-source reference implementation of UPC for COTS clusters  Any cluster that provides Pthreads and MPI can use  Built as a reference implementation, performance is secondary  Limitations in application size, memory mode  Not suitable for performance-critical applications UPC/DSM/SCI  SCI-VM (DSM system for SCI)  HAMSTER interface allows multiple modules to support MPI and shared-memory models  Created using Dolphin SISCI API, ANSI C  SCI-VM not under constant development, so future upgrades sketchy  Not feasible for amount of work needed versus expected performance  10/15/04 Better possibilities with GASNet 6 Related Research -- GASNet  Communication system created by U.C. Berkeley   Target for Berkeley UPC system Global-Address Space Networking (GASNet)[1]  Language-independent, low-level networking layer for high-performance communication  Segment region for communication on each node, three types  Segment-fast: sacrifice size for speed  Segment-large: allows large memory area for shared space, perhaps with some loss in performance (though firehose [2] algorithm is often employed)  Segment-everything: expose the entire virtual memory space of each process for shared access   Interface for high-level global address space SPMD languages   UPC [3] and Titanium [4] Divided into two layers  Core   10/15/04 Firehose algorithm allows memory to be managed into buckets for efficient transfers Active Messages Extended  High-level operations which take direct advantage of network capabilities communication system from U.C. Berkeley  A reference implementation available that uses the Core layer 7 Related Research -- Berkeley UPC  Second open-source implementation of UPC for COTS clusters   Translator First with a focus on performance GASNet for all accesses to remote memory  Network conduits allow for high performance over many different interconnects  UPC Code Platform- Translator Generated C independent Code Targets a variety of architectures  x86, Alpha, Itanium, PowerPC, SPARC, MIPS, Networkindependent Berkeley UPC Runtime System Compilerindependent GASNet Communication System Languageindependent PA-RISC  Best chance as of now for high-performance UPC applications on COTS clusters  Note: Only supports strict shared-memory Network Hardware access and therefore only uses the blocking transfer functions in the GASNet spec 10/15/04 8 Approach   Use and design of applications in UPC to grasp key concepts and understand performance issues  NAS benchmarks from GWU  DES-cypher benchmark from UF Performance Analysis    Design and develop new SCI Conduit for GASNet in collaboration UCB/LBNL Evaluate DSM for SCI as option of executing UPC Benchmarking   Upper Layers Exploiting SAN Strengths for UPC  Network communication experiments UPC computing experiments Emphasis on SAN Options and Tradeoffs  UC Berkeley Benchmarks Benchmarks, UPC-to-C translator, specification GWU Benchmarks, documents, specification SCI, Myrinet, InfiniBand, Quadrics, GigE, 10GigE, etc. 10/15/04 Michigan Tech UF HCS Lab  Michigan Tech Benchmarks, modeling, specification Field test of newest compiler and system Middle Layers  Runtime Systems, Interfaces  HP/Compaq UPC Compiler V2.1 running in our lab on ES80 AlphaServer (Marvel) Support of testing by OSU, MTU, UCB/LBNL, UF, et al. with leading UPC tools and system for function performance evaluation Lower Layers  Applications, Translators, Documentation Collaboration Runtime Systems, Interfaces  Ohio State Benchmarks UPC-to-MPI translation and runtime system UC Berkeley C runtime system, upper levels of GASNet GASNet collaboration, beta testing HP UPC runtime system on AlphaServer UC Berkeley GASNet GASNet collaboration, network performance analysis 9 GASNet SCI Conduit Control 1  Local (In use) Scalable Coherent Interface (SCI)  Low-latency, high-bandwidth SAN  Shared-memory capabilities ... Control X Control X Command X-1 Control Segments (N total) ... ... Command X-X  Require memory exporting and importing  PIO (require importing) + DMA (need 8 bytes alignment)   Physical Address Control N ... Command X-N Command 1-X Payload X ... Remote write ~10x faster than remote read DMA Queues (Local) Command X-X Command Segments (N*N total) ... SCI conduit Local (free) Command N-X  AM enabling (core API)  ... Dedicated AM message channels (Command)    Control 1 Request/Response pairs to prevent deadlock ... Virtual Address Flags to signal arrival of new AM (Control) Command 1-X Payload X Payload Segments (N total) Command X-X ... ... Command N-X Global segment (Payload) Payload N Node X Importing 10/15/04 ... Control N ... Put/Get enabling (extended API)  Payload 1 Control X Exporting SCI Space 10 GASNet SCI Conduit - Core API Active Message Transferring 1. Obtain free slot  Tract locally using array of flags 2. Package AM Header 3. Transfer Data  Short AM  PIO write (Header)  Medium AM  PIO write (Header)  PIO write (Medium Payload)  Long AM  PIO write (Header)  PIO write (Long Payload)  Payload size  1024  Unaligned portion of payload  DMA write (multiple of 64 bytes) 4. Wait for transfer completion 5. Signal AM arrival  Message Ready Flag  Value = type of AM  Message Exist Flag  Value = TRUE 6. Wait for reply/control signal  Free up remote slot for reuse 10/15/04 Check Message Exist Flag Control AM Header Polling Start Command Y-1 ... Medium AM Payload ... Long AM Payload New Messages Availiable? Command Y-X Command Y-N Extract Message Information Payload Y Yes No Wait for Completion Memory Process all new messages Flags Other processing Process reply message AM Reply or ack Polling Done Polling End Node X Node Y 11 Experimental Testbed     Elan, VAPI (Xeon), MPI, and SCI conduits  Nodes: Dual 2.4 GHz Intel Xeons, 1GB DDR PC2100 (DDR266) RAM, Intel SE7501BR2 server motherboard with E7501 chipset  SCI: 667 MB/s (300 MB/s sustained) Dolphin SCI D337 (2D/3D) NICs, using PCI 64/66, 4x2 torus  Quadrics: 528 MB/s (340 MB/s sustained) Elan3, using PCI-X in two nodes with QM-S16 16-port switch  InfiniBand: 4x (10Gb/s, 800 MB/s sustained) Infiniserv HCAs, using PCI-X 100, InfiniIO 2000 8-port switch from Infinicon  RedHat 9.0 with gcc compiler V 3.3.2, SCI uses MP-MPICH beta from RWTH Aachen Univ., Germany. Berkeley UPC runtime system 1.1 VAPI (Opteron)  Nodes: Dual AMD Opteron 240, 1GB DDR PC2700 (DDR333) RAM, Tyan Thunder K8S server motherboard  InfiniBand: Same as in VAPI (Xeon) GM (Myrinet) conduit (c/o access to cluster at MTU)  Nodes*: Dual 2.0 GHz Intel Xeons, 2GB DDR PC2100 (DDR266) RAM  Myrinet*: 250 MB/s Myrinet 2000, using PCI-X, on 8 nodes connected with 16-port M3FSW16 switch  RedHat 7.3 with Intel C compiler V 7.1., Berkeley UPC runtime system 1.1 ES80 AlphaServer (Marvel)  Four 1GHz EV7 Alpha processors, 8GB RD1600 RAM, proprietary inter-processor connections  Tru64 5.1B Unix, HP UPC V2.1 compiler 10/15/04 * via testbed made available courtesy of Michigan Tech 12 SCI Conduit GASNet Core Level Experiments Experimental Setup Analysis 25 20 15 10 5 1024 512 256 128 64 32 16 0 0 Payload Size (Bytes) Long AM Ping-Pong Latency Long AM Throughput SCI Raw SCI Conduit SCI Raw 8K 4K 2K 1K 512 256 128 64 32 16 256K 128K 64K 32K 16K 8K 4K 2K 1K 512 0 256 0 128 50 8 100 50 Payload Size (Bytes) PIO/DMA Mode Shift 4 100 150 2 150 200 1 PIO/DMA Mode Shift 0 Latency (us) 200 10/15/04 SCI Conduit 250 64 Throughput (MB/s) 250 16K  Latency a little high, but constant overhead (not exponential Throughput follows RAW trend 30 8  SCI Conduit 35 4  SCI Raw 40 2  SCI Conduit  Latency/Throughput (testam 10000 iterations) SCI Raw  PIO latency (scipp)  DMA latency and throughput (dma_bench) 1  Short/Medium AM Ping-Pong Latency Latency (us)  Payload Size (Bytes) 13 SCI Conduit GASNet Extended Level Experiments  Experimental Setup   GASNet configured with segment Large  As fast as segment-fast for inside the segment  Makes use of Firehose for memory outside the segment (often more efficient than segment-fast) GASNet Conduit experiments  Berkeley GASNet test suite  Average of 1000 iterations  Each uses put/get operations to take advantage of implemented extended APIs  Executed with target memory falling inside and then outside the GASNet segment  Latency results use testsmall  Throughput results use testlarge   Reported only inside results unless difference was significant Analysis  Elan shows best performance for latency of puts and gets  VAPI is by far the best bandwidth; latency very good  GM latencies a little higher than all the rest  HCS SCI conduit shows better put latency than MPI on SCI for sizes > 64 bytes; very close to MPI on SCI for smaller messages  HCS SCI conduit has latency slightly higher than MPI on SCI  GM and SCI provide about the same throughput  HCS SCI conduit slightly higher bandwidth for largest message sizes  Quick look at estimated total cost to support 8 nodes of these interconnect architectures:  SCI: ~$8,700  Myrinet: ~$9,200  InfiniBand: ~$12,300  Elan3: ~$18,000 (based on Elan4 pricing structure, which is slightly higher) 10/15/04 * via testbed made available courtesy of Michigan Tech 14 GASNet Extended Level Latency GM put VAPI put HCS SCI put GM get VAPI get HCS SCI get Elan put MPI SCI put Elan get MPI SCI get 40 Round-trip Latency (usec) 35 30 25 20 15 10 5 0 1 2 4 8 16 32 64 128 256 512 1K Message Size (bytes) 10/15/04 15 GASNet Extended Level Throughput GM put VAPI put HCS SCI put GM get VAPI get HCS SCI get Elan put MPI SCI put Elan get MPI SCI get 800 700 Throughput (MB/s) 600 500 400 300 200 100 0 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K Message Size (bytes) 10/15/04 16 Matisse IP-Based Networks Nodes: Dual 2.4 GHz Intel Xeons, 1GB DDR PC2100 (DDR266) RAM, Intel SE7501BR2 server motherboard with E7501 chipset Setup:   1 switch – all nodes connected to 1 switch 2 switch – half of the nodes connected to each switch with either short (1km) or long (12.5km) fiber in between the switches   Tests:  Low Level - Pallas Benchmark: ping-pong and send-recv GASNet Level - testsmall   Pallas PingPong Test 2 switches - short fiber 1 switch 2 switches - long fiber 50.00 70.00 45.00 60.00 Throughput (MB/sec) 40.00 35.00 30.00 25.00 20.00 15.00 10.00 2 switches - short fiber 2 switches - long fiber 50.00 40.00 30.00 20.00 10.00 5.00 0.00 10/15/04 Message Size (bytes) 17 4096K 2048K 512K 256K 128K 64K 32K 512 128 32 8 4096K 2048K 1024K 512K 1024K Message Size (bytes) 256K 128K 64K 32K 16K 8K 2K 512 128 32 0.00 8 Throughput (MB/sec) 1 switch Pallas SendRecv Test (2 Nodes) 16K  8K  Switch-based GigE network with DWDM backbone between switches for high scalability Product in alpha testing stage Experimental setup 2K  Matisse IP-Based Networks  GASNet put/get Latency for 2 switches with short/long fibers constant  Short – 250 us Long – 374 us   Throughput is comparable with regular GigE Latency comparable with regular GigE (~255us for all sizes)   Testsmall Latency (1 switch) put Testsmall Throughput get 1 switch - put 2 switches - short fiber - put 2 switches - long fiber - put GigE - put 300 1 switch - get 2 switches - short fiber - get 2 switches - long fiber - get GigE - get 8000 10/15/04 512 2048 Message Size (bytes) 256 128 64 32 0 16 2048 1024 512 2000 1024 Message Size (bytes) 256 128 64 32 16 8 4 2 0 8 50 4000 4 100 6000 2 150 1 Throughput (kb/s) 200 1 Latency (us) 250 18 UPC function performance  A look at common shared-data operations  Comparison between accesses to local data through regular and private pointers  Block copies between shared and private memory   upc_memget, upc_memput  Pointer conversion (shared local to private)  Pointer addition (advancing pointer to next location)  Loads & Stores (to a single location local and remote) Block copies  upc_memget & upc_memget translate directly into GASNet blocking put and get (even on local shared objects); see previous graph for results  Marvel with HP UPC compiler shows no appreciable difference between local and remote puts and gets and regular C operations 10/15/04  Steady increase from 0.27 to 1.83 µsec for sizes 2 to 8K bytes  Difference of < .5 µsec for remote operations 19 UPC function performance  Pointer operations Cast Local share to private   All BUPC conduits ~2ns, Marvel needed ~90ns Pointer addition below  Private Shared 0.07 Execution Time (usec) 0.06 0.05 Shared-pointer manipulation about an order of magnitude greater than private. 0.04 0.03 0.02 0.01 0 MPI-SCI 10/15/04 MPI GigE Elan GM VAPI Marvel 20 UPC function performance Loads and stores with pointers (not bulk)  Data local to the calling node   Pvt Shared are private pointers to the local shared space Private Store Shared Load Private Load Pvt Shared Store Shared Store Pvt Shared Load 0.1 Execution Time (usec) 0.09 0.08 0.07 0.06 MPI on GigE shared store takes 2 orders of magnitude longer, therefore not shown. Marvel shared loads and stores twice an order of magnitude greater than private. 0.05 0.04 0.03 0.02 0.01 0 MPI SCI 10/15/04 MPI GigE Elan GM VAPI Marvel 21 UPC function performance  Loads and stores with pointers (not bulk) Data remote to the calling node   Note: MPI GigE showed a time of ~450µsec for loads and ~500µsec for stores Load Store 30 Execution Time (usec) 25 Marvel remote access through pointers the same as with local shared, two orders of magnitude less than Elan. 20 15 10 5 0 MPI-SCI 10/15/04 Elan GM VAPI Marvel 22 UPC Benchmark - IS from NAS Benchmark   IS (Integer Sort, Class A), lots of fine-grain communication, low computation Poor performance in the GASNet communication system does not necessary indicate poor performance in UPC application 1 Thread 2 Threads 4 Threads 8 Threads 30 Execution Time (sec) 25 20 15 10 5 0 GM 10/15/04 Elan GigE mpi VAPI (Xeon) SCI mpi SCI Marvel 23 UPC Benchmarks – FT from NAS Benchmarks*  FT (3-D Fast Fourier Fransform, Class A), medium communication, high computation Used optimized version 01 (private pointers to local shared memory) SCI Conduit unable to run due to driver limitation (size constraint)    High-bandwidth networks perform best (VAPI followed by Elan) VAPI conduit allows cluster of Xeons to keep pace with Marvel’s performance MPI on GigE not well suited for these types of problems (high-latency, low-bandwidth traits limit performance) MPI on SCI has lower bandwidth than VAPI but still maintains near-linear speedup for more than 2 nodes (skirts TCP/IP overhead) GM performance a factor of processor speed (see 1 Thread)     1 Thread 50 4 Threads 8 Threads High-latency of MPI on GigE impedes performance. 45 40 Execution Time (sec) 2 Threads 35 30 25 20 15 10 5 0 GM 10/15/04 Elan GigE mpi VAPI SCI mpi Marvel * Using code developed at GWU 24 UPC Benchmark - DES Differential Attack Simulator  S-DES (8-bit key) cipher (integer-based) Creates basic components used in differential cryptanalysis  S-Boxes, Difference Pair Tables (DPT), and Differential Distribution Tables (DDT) Bandwidth-intensive application Designed for high cache miss rate, so very costly in terms of memory access    Sequential 1 Thread 2 Threads 4 Threads 3500 Execution Time (msec.) 3000 2500 2000 1500 1000 500 0 GM 10/15/04 Elan GigE mpi VAPI (Xeon) SCI mpi SCI Marvel 25 UPC Benchmark - DES Analysis   With increasing number of nodes, bandwidth and NIC response time become more important Interconnects with high bandwidth and fast response times perform best       10/15/04 Marvel shows near-perfect linear speedup, but processing time of integers an issue VAPI shows constant speedup Elan shows near-linear speedup from 1 to 2 nodes, but more nodes needed in testbed for better analysis GM does not begin to show any speedup until 4 nodes, then minimal SCI conduit performs well for high-bandwidth programs but with the same speedup problem as GM MPI conduit clearly inadequate for high-bandwidth programs 26 UPC Benchmark - Differential Cryptanalysis for CAMEL Cipher    Uses 1024-bit S-Boxes Given a key, encrypts data, then tries to guess key solely based on encrypted data using differential attack Has three main phases:   Compute optimal difference pair based on S-Box (not very CPU-intensive) Performs main differential attack (extremely CPU-intensive)    Gets a list of candidate keys and checks all candidate keys using brute force in combination with optimal difference pair computed earlier Analyze data from differential attack (not very CPU-intensive) Computationally (independent processes) intensive + several synchronization points 1 Thread 2 Threads 4 Threads 8 Threads 16 Threads 250 Execution Time (s) Parameters 200 MAINKEYLOOP = 256 NUMPAIRS = 400,000 150 Initial Key: 12345 100 50 0 SCI (Xeon) 10/15/04 VAPI (Opteron) Marvel 27 UPC Benchmark - CAMEL Analysis  Marvel    Attained almost perfect speedup Synchronization cost very low Berkeley UPC  Speedup decreases with increasing number of threads   Run time varied greatly as number of threads increased    10/15/04 Cost of synchronization increases with number of threads Hard to get consistent timing readings Still decent performance for 32 threads (76.25% efficiency, VAPI) Performance is more sensitive to data affinity 28 Architectural Performance Tests   Intel Pentium 4 Xeon   Features      Increased CPU utilization  RISC processor core 4.3 GB/s I/O bandwidth   32-bit/64-bit processor Real-time support of 32-bit OS On-chip memory controllers Eliminates 4 GB memory barrier imposed by 32-bit systems 19.2 GB/s I/O bandwidth per processor Intel Itanium II  Theme: Preliminary study of tradeoffs in available processor architectures, since their performance will clearly affect computation, communication, and synchronization in UPC clusters. 10/15/04  Intel NetBurst microarchitecture  Features  32-bit processor Hyper-Threading technology  AMD Opteron Features      64-bit processor Based on EPIC architecture 3-level cache design Enhanced Machine Check Architecture (MCA) with extensive Error Correcting Code (ECC) 6.4 GB/s I/O bandwidth 29 CPU Performance Results Itanium2 Opteron Xeon Computation benchmarks excluded due to compiler problems with Itanium2. Throughput (MB/s) 1000 800 600 400 200 0 Random Reads Random Writes  AIM 9 Sequential Reads Sequential Writes Disk copies  10 iterations using 5MB files testing sequential and random reads, writes, and copies  Itanium2 slightly above Opteron in both reads and writes except for random writes where Opteron has a slight advantage  Both Itanium2 and Opteron outperform Xeon by a wide margin in all cases except sequential reads  Xeon sequential reads are comparable to Opteron, but Itanium2 is much higher than both  Major performance gain from sequential reads compared to random, but sequential writes do not receive nearly as large of a boost 10/15/04 30 10 Gigabit Ethernet – Preliminary results 10GigE GigE 450  400 Testbed Throughput (MB/s) 350 Nodes: Each with dual x 2.4GHz Xeons, S2io Xframe 10GigE card in PCI-X 100, 1GB PC2100 DDR RAM, Intel PRO/1000 1GigE, RedHat 9.0 kernel 2.4.20-8smp, LAM-MPI V7.0.3  300 250 200 150 100 50 10GigE GigE 0 200 64 128 256 512 1K 2K 4K 8K 16K 32K Message Size (bytes) 180 Round-trip Latency (usec) 160 140  10GigE is promising due to expected economy-of-scale issues of Ethernet  S2io 10GigE shows impressive throughput, though slightly less than half of theoretical maximum; further tuning needed to go higher  Results show much-needed decrease in latency versus other Ethernet options 120 100 80 60 40 20 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K Message Size (bytes) 10/15/04 31 64K Conclusions  Key insights        HCS SCI conduit shows promise  Performance on-par with other conduits  On-going collaboration with vendor (Dolphin) to resolve the memory constraint issue Berkeley UPC system a promising COTS cluster tool  Performance on par with HP UPC (also see [6])  Performance of COTS clusters match and sometimes beat performance of high-end CC-NUMA  Various conduits allow UPC to execute on many interconnects  VAPI and Elan are initially found to be strongest  Some open issues with bugs and optimization  Active bug reports and development team help improvements  Very good solution for clusters to execute UPC, but may not quite be ready for production use  No debugging or performance tools available Xeon cluster suitable for applications with high Read/Write ratio Opteron cluster suitable for generic application due to comparable Read/Write capability Itanium2 excellent for sequential reads, about the same as Opteron for everything else 10GigE provides high bandwidth with much lower latencies than 1GigE Key accomplishments to date       10/15/04 Baselining of UPC on shared-memory multiprocessors Evaluation of promising tools for UPC on clusters Leveraging and extension of communication and UPC layers Conceptual design of new tools for UPC Preliminary network and system performance analyses for UPC systems Completion of optimized GASNet SCI conduit for UPC 32 References 1. D. Bonachea, U.C. Berkeley Tech Report (UCB/CSD-02-1207) (spec v1.1), October 2002. 2. C. Bell, D. Bonachea, “A New DMA Registration Strategy for Pinning-Based High Performance Networks,” Workshop on Communication Architecture for Clusters (CAC'03), April, 2003. 3. W. Carlson, J. Draper, D. Culler, K. Yelick, E. Brooks, K. Warren, “Introduction to UPC and Language Specification,” IDA Center for Computing Sciences, Tech. Report., CCS-TR99-157, May 1999. 4. K. A. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. N. Hilfinger, S. L. Graham, D. Gay, P. Colella, and A. Aiken , “Titanium: A HighPerformance Java Dialect,” Concurrency: Practice and Experience, Vol. 10, No. 1113, September-November 1998. 5. B. Gordon, S. Oral, G. Li, H. Su and A. George, “Performance Analysis of HP AlphaServer ES80 vs. SAN-based Clusters,” 22nd IEEE International Performance, Computing, and Communications Conference (IPCCC), April, 2003. 6. W. Chen, D. Bonachea, J. Duell, P. Husbands, C. Iancu, K. Yelick, “A Performance Analysis of the Berkeley UPC Compiler,” 17th Annual International Conference on Supercomputing (ICS), June, 2003. 10/15/04 33

Distributed Shared-Memory Parallel Computing with UPC on High-Perf. Networking (HPN) Group

Related documents

Products

Support

Distributed Shared-Memory Parallel Computing with UPC on High-Perf. Networking (HPN) Group

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib