General-Purpose vs. GPU: Comparison of Many-Cores on Irregular Benchmarks George C. Caragea Fuat Keceli Alexandros Tzannes Uzi Vishkin XMT: An Easy-to-Program Many-Core XMT: Motivation and Background XMT Programming Model • Many-cores are coming. But 40yrs of parallel computing: • Never a successful general-purpose parallel computer (easy to program, good speedups, up & down scalable). • IF you could program it great speedups. • XMT: Fix the IF • XMT: Designed from the ground up to address that for on-chip parallelism • Tested HW & SW prototypes • Builds on PRAM algorithmics. Only really successful parallel algorithmic theory. Latent, though not widespread, knowledgebase • At each step, provide all instructions that can execute concurrently (not dependent on each other) • PRAM/XMT abstraction: all such instructions execute immediately (“uniform cost”) • PRAM-like programming: using reduced synchrony • Main construct: spawn-join block. Can start any number of virtual-threads at once • Virtual-Threads advance at own speed, not lockstep • Prefix-sum (ps): similar to atomic fetch-and-add Paraleap: XMT PRAM-on-chip silicon • Built FPGA prototype • Announced in SPAA’07 • Built using 3 FPGA chips • 2 Virtex-4 LX200, 1 Virtex-4 FX100 Clock rate DRAM size DRAM channels Mem. data rate No. cores (TCUs) Clusters Cache modules Shared cache Ease of programming • Necessary condition for success of a general-purpose platform • In von Neumann’s 1947 specs • Indications that XMT is easy to program: 1. XMT is based on rich algorithmic theory (PRAM) 2. Ease-of-teaching as a benchmark: a. Successfully taught parallel programming to middle-school, highschool and up b. Evaluated by education experts (SIGCSE 2010) c. XMT superior to MPI, OpenMP and CUDA 3. Programmer’s workflow for deriving efficient programs from PRAM algorithms 4. DARPA HPCS productivity study: XMT development time half of MPI XMTC Programming Language 75 MHz 1GB 1 0.6GB/s 64 8 8 256KB C with simple SPMD extensions • spawn: start any number of virtual threads • $: unique thread ID • ps/psm: atomic prefix sum. Efficient hardware implementation int A[N],B[N] int base=0; spawn(0,N-1) { int inc=1; if (A[$]!=0) { ps(inc,base); B[inc]=A[$]; } } Arrzz XMTC Example: Array Compaction • Non-zero elements of A copied into B • Order is not necessarily preserved • After atomically executing ps(inc,base) • base = base + inc • inc gets original value of base • Elements copied into unique locations in B Tesla TESLA Memory Latency Hiding and Reduction Memory and Cache Bandwidth Functional Unit (FU) Allocation Control Flow and Synchronization • • • • • • • • • • • XMT • • • • • • Dedicated FUs for SPs and SFUs • Less arbitration logic required • Higher theoretical peak performance • Single instruction cache and issue per SM. Warps execute in lock-step (penalizes diverging branches) • Efficient local synchronization and communication within blocks. Global communication is expensive • Switching between serial and parallel modes (i.e. passing control from CPU to GPU) requires off-chip communication • Heavy multithreading (requires large register files and state-aware scheduler) Limited local shared scratchpad memory No coherent private caches at SM or SP Memory access patterns need to be coordinated by the user for efficiency (request coalescing) Scratchpad memories prone to bank conflicts TESLA XMT Large globally shared cache No coherent private TCU or cluster caches Software prefetching Relaxed need for user-coordinated DRAM access due to caches Address hashing for avoiding memory module hotspots High bandwidth mesh-of-trees interconnect between clusters and caches Heavy FUs (FPU and MDU) are shared through arbitrators Lightweight FUs (ALU, branch) are allocated per TCU ALUs do not include multiply-divide functionality One instruction cache and program counter per TCU enables independent progress of threads Coordination of threads performed via constant time prefix-sum. Other communication through the shared cache Dynamic hardware support for fast switching between serial and parallel modes and load balance of virtual threads Tested Configurations: GTX280 vs. XMT-1024 • Need configurations with equivalent area constraints (576 mm2 in 65nm) • Can not simply set the number of functional units and memory to the same values • Area estimation of the envisioned XMT chip is based on the 64 TCU XMT ASIC prototype (designed in 90nm IBM technology) • More area intensive side is emphasized in each category. Benchmarks Description CUDA Source Lines of Dataset Code CUDA XMT Parallel Threads/sectn. sectn. CUDA XMT CUDA XMT Bfs Breadth-First Search 290 86 1M nodes, 6M edges 25 12 1M Bprop Back Propagation Harish and Narayanan, Rodinia Rodinia 960 522 64K nodes 2 65 1.04M 19.4K Conv Image Convolution NVIDIA CUDA 283 SDK Merge-Sort Thrust library 966 87 1024x512 2 2 131K 512K 283 1M keys 82 140 32K 10.7K NeedlemanWunsch Parallel Reduction 129 2x2048 sequences 16M elts. 255 4192 1.1K 1.1K 3 3 5.5K 44K 36K x 36K, 1 4M non-zero 1 30.7K 36K NW Reduct Spmv Sparse matrixvector multiply Rodinia 430 NVIDIA CUDA 481 SDK Bell and 91 Garland 59 34 Experimental Platform • • • • • XMT-1024 Prefetch Buffers Regular Caches Constant Cache Texture Cache 32KB 4104KB 128KB -- -480KB 240KB 480KB 1024 TCU 1024 ALU, 64 MDU 64 FPU 128KB Performance Comparison Name Msort GTX280 Principal Computational Resources Cores 240 SP, 60 SFU Integer Units 240 ALU+MDU Floating Point Units 240 FPU, 60 SFU On Chip Memory Registers 1920KB XMTSim: The cycle-accurate XMT simulator Timing modeled after the 64-TCU FPGA prototype Highly configurable to simulate any configuration Modular design, enables architectural exploration Part of XMT Software Release: http://www.umiacs.umd.edu/users/vishkin/XMT/sw-release.html 87.4K • When using 1024-TCU XMT configuration: • 6.05x average speedup on irregular applications • 2.07x average slowdown on regular applications • When using 512-TCU XMT configuration • 4.57x average speedup on irregular • 3.06x average slowdown on regular • Case study: BFS on low parallelism dataset • Speedup of 73.4x over Rodinia implementation • Speedup of 6.89x over UIUC implementation • Speedup of 110.6x when using only 64 TCUs (lower latencies for the smaller design) • SPAA’09: 10X over Intel Core Duo with same silicon area • Current work: • XMT outperforms GPU on all irregular workloads • XMT does not fall behind significantly on regular workloads • No need to pay high performance penalty for ease-of-programming • Promising candidate for pervasive platform of the future: • Highly parallel general-purpose CPU coupled with: • Parallel GPU • Future work: • Power/energy comparison of XMT and GPU