Download talk (ppt, 14 slides).

Performance Potential of an Easy-toProgram PRAM-On-Chip Prototype Versus State-of-the-Art Processor George C. Caragea – University of Maryland A. Beliz Saybasili – LCB Branch, NHLBL, NIH Xingzhi Wen – NVIDIA Corporation Uzi Vishkin – University of Maryland www.umiacs.umd.edu/users/vishkin/XMT Hardware prototypes of PRAM-On-Chip 64-core, 75MHz FPGA prototype [SPAA’07, Computing Frontiers’08] Original explicit multi-threaded (XMT) architecture [SPAA98] (Cray started to use “XMT” ~7 years later) Interconnection Network for 128-core. 9mmX5mm, IBM90nm process. 400 MHz prototype [HotInterconnects’07] Same design as 64-core FPGA. 10mmX10mm, IBM90nm process. 150 MHz prototype The design scales to 1000+ cores on-chip Objective of current paper Meaningful comparison of 1. Our FPGA design, with 2. State-of-the-Art (Intel) Processor XMT: A PRAM-On-Chip Vision • Manycores are coming. But 40yrs of parallel computing: • Never a successful general-purpose parallel computer (easy to program, good speedups, up&down scalable). IF you could program it  great speedups. XMT: Fix the IF • XMT: Designed from the ground up to address that for on-chip parallelism • Unlike matching current HW (Some other SPAA papers) • Tested HW & SW prototypes • Builds on PRAM algorithmics. Only really successful parallel algorithmic theory. Latent, though not widespread, knowledgebase • This paper: ~10X relative to Intel Core 2 Duo If there is time: Really serious about ease of programming Objective for programmer’s model • Emerging: not sure, but the analysis should be work-depth. But, why not design for your analysis? (like serial) What could I do in parallel at each step assuming unlimited hardware Serial Paradigm # ops  .. time Time = Work .. Natural (Parallel) Paradigm # . ops . .. .. .. .. time Work = total #ops Time << Work • XMT: Design for work-depth. Unique among manycores. - 1 operation now. Any #ops next time unit. - Competitive on nesting. (To be published.) - No need to program for locality. Programmer’s Model: Engineering Workflow • Arbitrary CRCW Work-depth algorithm. Reason about correctness & complexity in synchronous model • SPMD reduced synchrony – Threads advance at own speed, not lockstep – Main construct: spawn-join block. Note: can start any number of processes at once – Prefix-sum (ps). Independence of order semantics (IOS). spawn join spawn join – Establish correctness & complexity by relating to WD analyses. – Circumvents “The problem with threads”, e.g., [Lee]. • Tune (compiler or expert programmer): (i) Length of sequence of round trips to memory, (ii) QRQW, (iii) WD. [VCL07] • Trial&error contrast: similar startwhile insufficient inter-thread bandwidth do{rethink algorithm to take better advantage of cache} XMT Architecture Overview • One serial core – master thread control unit (MTCU) • Parallel cores (TCUS) grouped in clusters • Global memory space evenly partitioned in cache banks using hashing • No local caches at TCU – Avoids expensive cache coherence hardware MTCU Hardware Scheduler/Prefix-Sum Unit Cluster 1 Cluster 2 … Cluster C Parallel Interconnection Network Memory Bank 1 Memory Bank 2 DRAM Channel 1 Shared Memory (L1 Cache) … … Memory Bank M DRAM Channel D Paraleap: XMT PRAM-on-chip silicon • Built FPGA prototype • Announced in SPAA’07 • Built using 3 FPGA chips – 2 Virtex-4 LX200 – 1 Virtex-4 FX100 Clock rate 75 MHz No. TCUs 64 DRAM size 1GB Clusters 8 DRAM channels 1 Cache modules 8 Mem. data rate 0.6GB/s Shared cache 256KB With no prior design experience, X. Wen completed synthesizable Verilog description AND the new FPGA-based XMT computer in slightly more than two years. X. Wen is one person..  basic simplicity of the XMT architecture simple  faster time to market, lower implementation cost. Benchmarks • Sparse Matrix – Vector Multiplication (SpMV) – Matrix stored in Compact Sparse Row (CSR) format – Serial version: iterate through rows – Parallel version: one thread per row • 1-D FFT – Fixed-point arithmetic implementation – Serial version: Radix-2 Cooley-Tukey Algorithm – Parallel version: Parallelized each stage of serial algorithm • Quicksort – Serial version: standard textbook implementation – Parallel version: two phases • Phase 1: For large sub-arrays, parallelize partitioning operation using atomic prefix-sum • Phase 2: Process all partitions in parallel using serial partitioning algorithm Experimental Platforms XMT Paraleap FPGA Intel Core 2 Duo Processor 1 MTCU, 64 TCUs Dual Core, E6300 Clock 75 MHz 1.86GHz Cache 256KB shared L1 cache 2x64KB L1, 2x2MB L2 DRAM 1GB DDR2 2GB DDR2 Data rate 0.6GB/s 6.4GB/s Compiler XMTCC (GCC 4.0.2-based) GCC 4.1.0 Intel C++ Professional Compiler (ICC) v11 Compiler Optim. -O3, data prefetch, read-only buffers -O3, SSE3 SIMD, data prefetching, autoparallelization For meaningful comparison: compare cycle count Input Datasets small large Program N Footprint N Footprint SpMV 22K 200KB 4M 33MB FFT 8K 192KB 4M 96MB Quicksort 100K 781KB 20M 153MB • Large dataset represents realistic input sizes – Recommended by Intel engineer for comparison – Gives Intel Core 2 advantage because of larger cache • Small dataset – Fits in both Paraleap and Intel Core 2 cache – Provides most fair comparison for current XMT generation Clock-Cycle Speedup • Computed as: speedup = #ClockCycles for Core 2 / #ClockCylces for Paraleap Core 2 – ICC Core 2 - GCC Program small large small large SpMV 6.7 3.3 6.3 3.26 FFT 9.51 2.51 8.76 2.71 Quicksort(*) 13.07 7.75 13.89 8.18 • Paraleap outperforms Intel Core 2 on all benchmarks • Lower speed-ups for Large dataset because of smaller cache size – Will not be an issue for future implementations of XMT • Silicon area of 64-TCU XMT roughly the same as one core of Intel Core 2 Duo • No reason for clock frequency of XMT to fall behind Conclusion • XMT provides viable answer to biggest challenges for the field – Ease of programming – Scalability (up&down) • Preliminary evaluation shows good result of XMT architecture versus state-of-the art Intel Core 2 platform • ICPP’08 paper compares with GPUs. Software release Allows to use your own computer for programming on an XMT environment and experimenting with it, including: (i)Cycle-accurate simulator of the XMT machine (ii)Compiler from XMTC to that machine Also provided, extensive material for teaching or self-studying parallelism, including (i)Tutorial + manual for XMTC (150 pages) (ii)Classnotes on parallel algorithms (100 pages) (iii)Video recording of 9/15/07 HS tutorial (300 minutes) (iv) Video recording of grad Parallel Algorithms lectures (30+hours) www.umiacs.umd.edu/users/vishkin/XMT/sw-release.html Next Major Objective Industry-grade chip. Requires 10X in funding. Ease of Programming • Benchmark: can any CS major program your manycore? - cannot really avoid it. Teachability demonstrated so far: - To freshman class with 11 non-CS students. Some prog. assignments: merge-sort, integer-sort & samples-sort. Other teachers: - Magnet HS teacher. Downloaded simulator, assignments, class notes, from XMT page. Self-taught. Recommends: Teach XMT first. Easiest to set up (simulator), program, analyze: ability to anticipate performance (as in serial). Can do not just for embarrassingly parallel. Teaches also OpenMP, MPI, CUDA. Lookup keynote at CS4HS’09@CMU + interview with teacher. - High school & Middle School (some 10 year olds) students from underrepresented groups by HS Math teacher.

Download talk (ppt, 14 slides).

Related documents

Products

Support

Download talk (ppt, 14 slides).

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib