Institute of Computing Technology Chinese Academy of Sciences Many-Core Processor Design for High Performance and High Throughput Computing Dongrui Fan, Xiaochun Ye Institute of Computing Technology, Chinese Academy of Sciences AMS Processor Research Plan in AMS High Throughput Computing ExtraScale Computing 1000 cores on chip Advanced Simulation 1000 threads on chip Manycore Research Godson-T(Tera Flops) • Fine-Grain Sychronization • Resource Sharing Godson-D (Throughput) HPC HTC 2 • Realtime Processing • Concurrent Independent Processing CO-DESIGN 2015 Overview Microprocessor Architecture Challenges Godson-T Features for High Performance Computing Godson-D Features for High Throughput Computing 3 CO-DESIGN 2015 Microprocessor Architecture Revolution Memory wall + Power wall + ILP wall = Brick wall ! What parallel architecture will be successful? Power Wall Many-Core Performance Multi-Core ILP Wall Superscalar Processor Core Simpler Processor Core Power, Complexity Parallel architectures come to rescue, superscalar processor is being replaced by on-chip multi-core / many-core processor 4 CO-DESIGN 2015 Exploiting Polymorphic Parallelism Instruction-Level Parallelism Godson-3 With Powerful Processing Unit Heterogenous Many-Core With moderate DLP and strong TLP exploitation Godson-T Thread-Level Parallelism 5 CO-DESIGN 2015 Many-Core Challenges: The 5 P’s Power efficiency challenge Performance challenge How to provide a converged many core solution in a standard programming environment Parallelism challenge How to scale from 1 to 1000 cores – the number of cores is the new Megahertz Programming challenge Performance per watt is the new metric – Dark Silicon becomes serious when facing 10nm process technology How to exploit parallelism through software and hardware co-design Platform challenge How to maximize the usability, such as, runtime system (OS), compiler & library, many-core debug… 6 CO-DESIGN 2015 What we could help…… Productivity Side: Most sequential programmers are not ready to switch into parallel programming Unique programming model cannot solve all problems efficiently; conservative programming model and make incremental improvement? support multiple programming features efficiently? Locks are messy new way to eliminate deadlock? …… Performance Side: On-chip synchronization is fast; Flops are cheap, memory communications are expensive; handle all synchronizations on-chip? trade flops for communication latency? Fine-grained parallelism should be taken into consideration; enable data-driven thread execution on chip? …… 7 CO-DESIGN 2015 Overview Microprocessor Architecture Challenges › Memory wall + Power wall + ILP wall › The 5P’s: Many-core Processor Challenges Godson-T Features for High Performance Computing › Godson-T architecture overview › Architectural supports for multithreading › Software runtime system Godson-D Features for High Throughput Computing 8 CO-DESIGN 2015 Architecture Overview of Godson-T Program Runtime System Memory Controller CORE GodsonT Processor L1$ Private Memroy R Memory Controller Memory Controller Processing Core Private I-$ Fetch & Decode Register Files FP/ INT MAC Sync Vector ALU ALU Unit ALU Synchronization Manager L2 Cache Bank I/O Controller SPM D$ Local Memory DTA Router Memory Controller 9 CO-DESIGN 2015 Processing Core 4Byte FETCH Instruction Buffer 1 instruction DECODE 1 instruction 32Byte 2r/1w Vector Register File 32 Bytex 2x1 Vector/ Floating-Point Unit Vector/ Fixed-Point Unit ISA: MIPS (user), SIMD-ext., sync-ext. 8-stage pipeline Dual-issue per thread Expected SIMD L0_Icache L0_Dcache 16Byte 16Byte Load/Store/ Synchronization Unit Fast level-1 memory 32KB private memory Automatically mapped into stack address space Full/empty bit tagged on each 64-bit slot enables efficient producerconsumer style synchronization 16Byte in/out msg 8Byte SPM/ L1 Cache Router 10 Load, Store, Data Move, Arithmetic …… Communication with external modules through message packets CO-DESIGN 2015 Interconnection Separated routers with each processing core Static XY wormhole routing Round-Robin arbitration Duplex 128bit link for each connection Guaranteed in-order point-to-point communication Deadlock-free & livelock-free Scalable and power-efficient for MESH topology Low latency core-to-core communication Separated networks allowing traffic segregation and tolerating burst DMA transfer 11 CO-DESIGN 2015 Memory Hierarchy Data Transfer Bandwidth Reg Each processing core with 32 fixedpoint and 32 floating-point registers Local Memory 32KB local memory, including L1$D, SPM 512GB/s 128GB/s 16 address-interleaved L2 cache banks, 128KB each L2 Cache 51.2GB/s Off-Chip Memory 4 DDR3-1600 memory controllers 12 CO-DESIGN 2015 Data Communication & Synchronization Data Transfer Agent (DTA) DTA horizontal operation DTA vertical operation DTA vertical operation SPM DTA L2 Cache SPM …… DTA Router …… Router …… off-chip memory Router …… Programmable asynchronous data transfer agent Support vertical and horizontal DTA operations (such as prefetch) Data transfers between multidimension addresses (such as matrix inversion ) Network load perception, automatically flow control (improve bandwidth-efficiency) Support fine-grain synchronous operations on-chip network (a) vertical and horizontal DTA operations chunk stride chunk stride …… block block stride …… block stride invovled data block address (b) 2D strided DTA operations 13 CO-DESIGN 2015 Data Communication & Synchronization Region-Based Cache Coherent Protocol Lazy cache coherent protocol Pure mutual-exclusion synchronization instruction without memory accesses; Eliminate busy-waiting for locks; More scalable than bus-snoopy and directory-based cache protocol; Enable hardware deadlock detecting. int mutual_func_a() { …… acquire_lock(LOCK_VAL); X = 3; release_lock(LOCK_VAL); …… } X=0 Mini Core Mini Core Private Cache Private Cache ACQ. REL. X=3 ACQ. REL. X=0 int mutual_func_b() { …… acquire_lock(LOCK_VAL); Y = X; Z = X; release_lock(LOCK_VAL); …… } Interconnect ACK. CORE 0 UNLOCK LOCKED CORE 1 WAITING UNLOCK LOCKED X=3 X=0 Synchronization Manager Shared L2 Cache 14 CO-DESIGN 2015 Pthread-like Thread Execution using Private Memory Support master-slave style thread execution; Programming easily: C programming model + Pthread API; Compatible with large amount of conventional codes on shared memory system. int slave_func(int argc, int argv[]) { // do computation; …… godsont_exit(0); } int mater_func() { …… int thread_id = godsont_ create_thread(slave_func, 2, param_array); …… } Stack Pointer 1. Master transmits parameters to slave; 2. Master setups PC and register context for slave; 3. Slave function begins execution, private memory acts as a local stack. SPM PC ab Interconnect Tens of cycles to create or terminate a thread. 15 CO-DESIGN 2015 GodRunner Static Software Stack The programmers majorly focus on expressing parallelism, While the processor and runtime system concentrate on efficient parallel executions. The library provides a rich set of Software APIs for task management, including batch thread management. Programmer is responsible for creating,terminating and synchronizing tasks by inserting appropriate Pthreads-like APIs. Applications (ANSI C Programs) Pthreads-like C Libraries Library GNU C Compiler and Binutils Application Binary Dynamic “GodRunner” Runtime System Godson-T Architecture Simulator (GAS) Hardware 16 CO-DESIGN 2015 Evaluating the Scalability of Performance 64 pFind (32K pep.) iBlastP (600 seq.) FFT (m=16) RADIX (n=256k) LU (n=512) CHOLESKY (tk29.O) 56 48 Speedup 40 32 24 16 8 0 0 8 16 24 32 40 48 56 64 # of Threads Speedup of the bioinformatics and SPLASH-2 benchmarks on Godson-T 17 CO-DESIGN 2015 Summary of Godson-T Exploiting parallelism on multiple level Without disturbing DLP, Can merge with DLP processors Exploiting fine-grained data-driven TLP A scalable many-core architecture research platform Lock-based cache coherence protocol Asynchronous DTA and hardware-supported synchronization mechanisms Pthreads-like programming model, and versatile parallel libraries Flexible to extend, easy to use, and simple to implement Keep balance between software and hardware to gain programmability and performance 18 CO-DESIGN 2015 Overview Microprocessor Architecture Challenges › Memory wall + Power wall + ILP wall › The 5P’s: Many-core Processor Challenges Godson-T Features for High Performance Computing › Godson-T architecture overview › Architectural supports for multithreading › Software runtime system Godson-D Features for High Throughput Computing › Motivation of DPU(Godson-D) › Godson-D Features for High Throughput Computing 19 CO-DESIGN 2015 HPC versus HTC Conventional Workload FLOPS Characteristics Parallelism Now Realtime Reliability Throughput Requirement Performance Requirement Conventionally, Internet Service HTC systems were implemented High NextHigh-Volume Abundant Throughput by HPC As requirements of for Data infrastructures. Thread-level Yes Diversity generation multiple Parallelism Diversity Tasks tasks data-center throughput, QoS, scalability and reliability are Poor Locality increasing in emerging HTC systems, High- conventional wisdom is no longer suitable for Single-node Highperformance Scientific Service Polymorphic fault causes Performance next-generation data-center computing. Parallelism No computing Mono-Task whole system for single system Good Locality Exploitation crash down task A Case Study Story: A system with hundreds of general purpose high performance processors Serves about 1,000,000 users concurrently Zoom in: -- one state-of-art processor can process about 7000 users’ requests in specific period -- each user’s data size is about 10KB What will happen? 21 CO-DESIGN 2015 …… L2 Cache Miss Rate: 94.86% L3 Cache Miss Rate: 72.37% Memory Dynamic user data Static user data … 7000 users’ data compete for memory space Phenomenon L2 and L3 caches contribute little because of high miss rate Float point Unit is useless for this mobile internet scenario Main Cause Massive independent user request compete for limited on-chip memory, and memory access latency dominates the throughput 22 CO-DESIGN 2015 Other HTC Benchmarks Cache Miss Rate on Commercial Processor 23 CO-DESIGN 2015 Problems in Commercial Processors We can easily find the following problems in current data center: Problem 1: Mismatch between Data Format and Data Structure in CPU Problem 2: Mismatch between Data Flow and Data Path in CPU -- On-chip Memory -- Data Path Problem 3: Mismatch between Special Data Processing requirement and the General Architecture of CPU -- Computing Problem 4: Mismatch between Real Time Data Processing Requirement and the Unpredictable Architecture of CPU -- Comprehensive 24 CO-DESIGN 2015 Godson-D Features for HTC Thousands of threads running concurrently to hide latency High throughput data access path Data Processor Unit Godson-D Realtime control technologies 25 CO-DESIGN 2015 Godson-D Architecture Hierarchical Ring + Cone + Thread Dispatcher ALU ALU ALU ALU Shared SPM D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ Shared D$ Dispatcher Dispatcher Dispatcher AGU AGU AGU AGU scheduler Dispatcher Dispatcher Dispatcher Dispatcher Dispatcher Dispatcher Dispatcher Dispatcher Dispatcher Dispatcher Dispatcher Dispatcher CO-DESIGN 2015 26 Thread Core Group Shared I$ Shared FU Memory Access Collect Table(M-Table) 1. M-Table can collect the discrete memory access requirements together and process them in batch, thus improve the memory efficiency. RAM N cycle Oldest N cycle N cycle N cycle N cycle Performance Speedup: 2.82X Memory Access Latency: 1.58X Number of memory requirements: 43% 27 CO-DESIGN 2015 High density NoC(HD-NoC) • There is a large proportion of small message packets in HTC applications • Conventional large links are split into small channels (such as 8 or 16 bits) so that flits in the same direction can be transferred64bit simultaneously Typical Router Granularity Distribution 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 64bit 64bit R 8 64bit 4 2 64bit 1 64bit rd wt BFS rd wt GUPS rd wt rd wt rd wt rd 64bit R HD-Router wt SSCA2 canneal listrank pagerank 64bit The effective bandwidth utilization can be improved by 4.5X 28 CO-DESIGN 2015 Hardware Supported Realtime Scheduler Fast Pass Two Jump to reach MCU for urgent thread 1 2 Hardware realtime scheduler Normal Thread Pointer Null Thread Pointer High Priority Thread Pointer PID PID PID PID PID PID PID PID PID PID PID PID TID TID TID TID TID TID TID TID TID TID TID TID Occupied Occupied Occupied Occupied Occupied Occupied Occupied Occupied Occupied Occupied Occupied Occupied Core_Num Core_Num Core_Num Core_Num Core_Num Core_Num Core_Num Core_Num Core_Num Core_Num Core_Num Core_Num 29 Threshold_Time Threshold_Time Threshold_Time Threshold_Time Threshold_Time Threshold_Time Threshold_Time Threshold_Time Threshold_Time Threshold_Time Threshold_Time Threshold_Time Context_Pointer Context_Pointer Context_Pointer Context_Pointer Context_Pointer Context_Pointer Context_Pointer Context_Pointer Context_Pointer Context_Pointer Context_Pointer Context_Pointer Next_Thread Next_Thread Next_Thread Next_Thread Next_Thread Next_Thread Next_Thread Next_Thread Next_Thread Next_Thread Next_Thread Next_Thread CO-DESIGN 2015 Effect of global realtime scheduling Search 235 209 183 157 131 105 79 53 27 1 thread_id thread_id Search 0 2 000 000 4 000 000 6 000 000 finished_cycles 8 000 000 241 217 193 169 145 121 97 73 49 25 1 0 241 217 193 169 145 121 97 73 49 25 1 0 20000000 40000000 60000000 finished_cycles 8000000 Terasort Thread_id thread_id TeraSort 2000000 4000000 6000000 finished_cycles 80000000 235 209 183 157 131 105 79 53 27 1 0 30 20000000 40000000 60000000 80000000 Finished_cycles CO-DESIGN 2015 HTC System vs. HPC System HTC HPC Memory queue Processing Unit Task queue Data path Performanceoriented scheduling Single direction M-Table T-Table Hardware support QoS scheduling Bidirection dynamic self-adaption Low width/ High density High width / low density Message-based memory Imperative memory access Memory Perf-oriented Cache replacement 31 QoS oriented Cache replacement CO-DESIGN 2015 Thank you!