ECE8833 Polymorphous and Many-Core Computer Architecture Lecture 4 Billion-Transistor Architecture 97 (Part II) Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Practitioners’ Groups Every one has an acronym ! • IRAM – Implementation at Berkeley • CMP – Lead to Sun Niagra and the multicore (r)evolution • SMT – Intel HyperThreading (arguably Intel first envisioned the idea), IBM Power5, Alpha 21464 – Many credit this technology to UCSB’s multistreaming work in early 1990s. • RAW – Lead to Tilera64 ECE8833 H.-H. S. Lee 2009 2 C. E. Kozyrakis, S. Perissakis, D. Patterson, T. Anderson, K. Asanovic, N. Cardwell, R. Fromm, J. Golbus, B. Gribstad, K. Keeton, R. Thomas, N. Treuhaft, K. Yelick ECE8833 H.-H. S. Lee 2009 3 Mission Statement ECE8833 H.-H. S. Lee 2009 4 Future Roadblocks that Inspires IRAM • Latency issues – Continuingly increased performance gap between processor and memory – DRAM optimized for density, not speed • Bandwidth issues – Off-chip bus • Slow and narrow • high capacitance, high energy – Especially, scientific codes, database, etc. ECE8833 H.-H. S. Lee 2009 5 IRAM Approach • Move DRAM closer to processor – Enlarge on-chip bandwidth • Fewer I/O pins – Smaller package – Serial interface Anything look familiar? ECE8833 H.-H. S. Lee 2009 6 IRAM Chip Design Research • How much larger and slower is a processor designed in a straight DRAM process vs. a standard logic process – Microprocessor fab offers fast transistors fo fast logic and many metal layers for accelerating communication and simplifying power distribution – DRAM fabs offer many poly layers to give small DRAM cells and low leakage for low refresh rate • Speed of page buffer vs. registers and cache • New DRAM interface based on fast serial links (2.5Gbit/s or 300 MB/s per pin) • Quantify Bandwidth vs. Area/Power tradeoff • Area overhead for IRAM vs. a DRAM • Extra power dissipation for IRAM vs. a DRAM • Performance of IRAM with same area and power as DRAM (“processor for free) Source: David Patterson’s slide in his IRAM Overview talk ECE8833 H.-H. S. Lee 2009 7 IRAM Architecture Research • How much slower can a processor with a high bandwidth memory be and yet be as fast as a conventional computer? (very interesting point) • Compare memory management schemes (e.g., vector registers, scratch pad, wide TLB/cache) • Compare scheme for running large programs, i.e., span multiple IRAMs • Quantify value of compact programs and data (e.g., compact code, on-the-fly compression) • Quantify pros and cons of standard instruction set vs. custom IRAM instruction set Source: David Patterson’s slide in his IRAM Overview talk ECE8833 H.-H. S. Lee 2009 8 IRAM Compiler Research • Explicit SW control of memory management vs. conventional implicit HW designs – – – – Protection (software fault isolation) Paging (dynamic relocation, overlap I/O accesses) “Cache” control (vector register, scratch pad) I/O interrupt/polling • Evaluate benchmark performance in conjunction with architectural research – – – – Number crunching (Vector vs. superscalar) Memory intensive (database, operating system) Real-time benchmarks (stability and performance) Pointer intensive (GCC compiler) • Impact of Language on IRAM (Fortran 77 vs. HPF, C/C++ vs Java) Source: David Patterson’s slide in his IRAM Overview talk ECE8833 H.-H. S. Lee 2009 9 Potential IRAM Architecture • “New Model”: VSIW=Very Short Instruction Word! – – – – Compact: Describe N operations with 1 short inst. (vector) Predictable: (real-time) perf. Vs. statistical perf. (cache) Multimedia ready: choose Nx64b, 2Nx32b, 4Nx16 Easy to get high performance; N operations: • Are independent • Use same functional unit • Access disjoint registers • Access registers in same order as previous instructions • Access contiguous memory words or known pattern • Hides memory latency (and any other latency) – Compiler technology already developed.. Source: David Patterson’s slide in his IRAM talk ECE8833 H.-H. S. Lee 2009 10 Berkeley Vector-Intelligent RAM Why vector processing • • • • Scalable design Higher code density Run at a higher clock rate Better energy efficiency due to easier clock gating for vector / scalar units • Lower die temperature to keep good data retention rate • On-chip DRAM is sufficient for embedded applications • Use external off-chip DRAM as secondary memory – Pages swapped between onchip and off-chip DRAMs ECE8833 H.-H. S. Lee 2009 11 VIRAM-1 Floorplan • • • • • 180nm, CMOS, 6-layer copper 125 million transistors, 325 mm2 2 watts @ 200MHz 13MB eDRAM macros from IBM and 4 vector units (total 8KB vector registers) VRF = 32x64b or 64x32b or 128x16b 64-bit MIPS M5Kc ¼ of 8KB VRF (Custom layout) IBM Embedded DRAM macros, each 13Mbit [Gebis et al. DAC student contest 04] ECE8833 H.-H. S. Lee 2009 12 S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, R. L. Stamm, D. M. Tullsen ECE8833 H.-H. S. Lee 2009 13 SMT Concept vs. Other Alternatives Unused Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Execution Time FU1 FU2 FU3 FU4 Conventional Superscalar Single Threaded Chip Fine-grained Coarse-grained Multithreading Multithreading Multiprocessor (CMP) (cycle-by-cycle (Block Interleaving) Interleaving) Simultaneous Multithreading (or Intel’s HT) • Early SMT idea was developed by UCSB (Mario Nemirosky’s group HICSS’94) • The name SMT was christened by the group at University of Washington ISCA’95 ECE8833 H.-H. S. Lee 2009 14 Exploiting Choice: SMT Inst Fetch Policies • FIFO, Round Robin, simple but may be too naive • RR.X.Y – – – – X threads for Y instructions RR1.8 RR.2.4 or RR.4.2 RR.2.8 • What are the main design and/or performance issue when X > 1 [Tullsen et al. ISCA96] ECE8833 H.-H. S. Lee 2009 15 Exploiting Choice: SMT Inst Fetch Policies • Adaptive Fetching Policies – BRCOUNT (reduce wrong path issuing) • Count # of br inst in decode/rename/IQ stages • Give top priority to thread with the least BRCOUNT – MISSCOUT (reduce IQ clog) • Count # of outstanding D-cache misses • Give top priority to thread with the least MISSCOUNT – ICOUNT (reduce IQ clog) • Count # of inst in decode/rename/IQ stages • Give top priority to thread with the least ICOUNT – IQPOSN (reduce IQ clog) • Give lowest priority to those threads with inst closest to the head of INT or FP instruction queues – Due to that threads with the oldest instructions will be most prone to IQ clog • No Counter needed [Tullsen et al. ISCA96] ECE8833 H.-H. S. Lee 2009 16 Exploiting Choice: SMT Inst Fetch Policies [Tullsen et al. ISCA96] ECE8833 H.-H. S. Lee 2009 17 Alpha 21464 (EV8) • Leading-edge process technology – – – – – 1.2 to 2.0GHz 0.125m CMOS SOI-compatible Cu interconnect, 7 metal layers Low-k dielectrics • Chip characteristics – 1.2V Vdd, 250W (EV6: 72W and EV7: 125W) – 250 million transistors, 350mm2 – 1100 signal pins in flip chip packaging Slide Source: Dr. Joel Emer ECE8833 H.-H. S. Lee 2009 18 EV8 Architecture Overview • Enhanced OoO execution • 8-wide issue superscalar processor • Large on-die L2 (1.75MB) • 8 DRDRAM channels • On-chip router for system interconnect • Directory-based ccNUMA for up to 512-way SMP • 4-way SMT Slide Source: Dr. Joel Emer ECE8833 H.-H. S. Lee 2009 19 SMT Pipeline • Replicated • Shared resources – PCs – Register maps Fetch Decode/ Map – – – – – Queue Reg Read RF Instruction queue First and second level caches Translation buffers Branch predictor Execute Dcache/ Store Buffer Reg Write Retire PC Register Map Regs Dcache Regs Icache Slide Source: Dr. Joel Emer ECE8833 H.-H. S. Lee 2009 20 Intel HyperThreading • Intel Xeon Processor, Xeon MP Processor, and ATOM • Enable Simultaneous Multi-Threading (SMT) – Exploit ILP through TLP (—Thread-Level Parallelism) – Issuing and executing multiple threads at the same snapshot • Appears to be 2 logical processors • Share the same execution resources • Duplicate architectural states and certain microarchitectural states – IPs, iTLB, streaming buffer – Architectural register file – Return stack buffer – Branch history buffer – Register Alias Table ECE8833 H.-H. S. Lee 2009 21 Sharing Resource in Intel HT • P4’s TC (or ROM) is alternatively accessed per cycle for each logical processor unless one is stalled due to TC miss • TLB shared with logical processor ID but partitioned – X86 does not employ ASID – Hard-partitioning appears to be the only option to allow HT • • • • • • op queue (into ½) after fetched from TC ROB (126/2 in P4) LB (48/2 in P4) SB (24/2 or 32/2 in P4) General op queue and memory op queue (1/2) Retirement: alternating between 2 logical processors ECE8833 H.-H. S. Lee 2009 22 HT in Intel ATOM 32KB • • • • • • First In-order processor with HT HT claimed to enlarge silicon asset by 8% Claimed 30% performance increase at 15% power increase Shared cache space deprived/competed between threads No dedicated Multiplier – use SIMD Multiplier No dedicated Int Divider - use FP Divider 512KB 24KB 25mm2 @45nm Source: Microprocessor Report and Intel ECE8833 H.-H. S. Lee 2009 23 L. Hammond, B. A. Nayfeh, K. Olukotun ECE8833 H.-H. S. Lee 2009 24 Main Argument • Single thread of control has limited parallelism (ILP is dead) • Cost of the above is prohibitive due to complexity • Achieving parallelization with SW, not HW – Inherently parallel multimedia application – Widespread Multi-tasking OS – Emerging parallel compilers (ref. SUIF), mainly for loop-level parallelism • Why not SMT? – Interconnect delay issue – Partitioning is less localized than CMP • Use relatively simple single-thread processor – Exploit only “modest” amount of ILP – Execute multiple threads in parallel • Bottom line ECE8833 H.-H. S. Lee 2009 25 Architectural Comparison ECE8833 H.-H. S. Lee 2009 26 Single Chip Multiprocessor ECE8833 H.-H. S. Lee 2009 27 Commercial CMP (AMD Phenom II Quad-Core) • • • • • • AMD K10 (Barcelona) Code name “Deneb” 45nm process 4 cores, private 512KB L2 Shared 6MB L3 (2MB in Phenom) Integrated Northbridge – Up to 4 DIMMs • Sideband Stack optimizer (SSO) – Parallelize many POPs and PUSHs (which were dependent on each other) • Convert them into pure loads/store instructions – No uops in FUs for stack pointer adjustment ECE8833 H.-H. S. Lee 2009 28 Intel Core i7 (Nehalem) • 4-core • HT support each core • 8MB shared L3 • 3 DDR3 channels • 25.6GB/s memory BW • Turbo Boost Technology – New P-state (Performance) – DFVS when workloads operated under max power – Same frequency for all cores ECE8833 H.-H. S. Lee 2009 29 Ultra Sparc T1 • • • • • • Up to Eight cores, each 4-way threaded Fine-grained multithreading – a thread-selection logic • Take out threads that encounter long latency events – Round-robin cycle-by-cycle – 4 threads in a group share a processing pipeline (Sparc pipe) 1.2 GHz (90nm) In-order, 8 instructions per cycle (single issue from each core) 1 shared FPU Caches – 16K 4-way 32B L1-I – 8K 4-way 16B L1-D – Blocking cache (reason for MT) – 4-banked 12-way 3MB L2 + 4 memory controllers. (shared by all) – Data moved between the L2 and the cores using an integrated crossbar switch to provide high throughput (200GB/s) ECE8833 H.-H. S. Lee 2009 30 Ultra Sparc T1 • Thread-select logic marks a thread inactive based on – Instruction type • A predecode bit in the I-cache to indicate long-latency instruction – Misses – Traps – Resource conflicts ECE8833 H.-H. S. Lee 2009 31 Ultra Sparc T2 • • • • • • • • • • • • A fatter version of T1 1.4GHz (65nm) 8 threads per core, 8 cores on-die 1 FPU per core (1 FPU per die in T1), 16 INT EU (8 in T1) L2 increased to 8-banked 16-way 4MB shared 8 stage integer pipeline ( as opposed to 6 for T1) 16 instructions per cycle One PCI Express port (x8 1.0) Two 10 Gigabit Ethernet ports with packet classification and filtering Eight encryption engines Four dual-channel FBDIMM memory controllers 711 signal I/O,1831 total • Subsequent T2 Plus contains 2 sockets: 16 cores / 128 threads ECE8833 H.-H. S. Lee 2009 32 Sun ROCK Processor • 16-core, two threads per core • Hardware scout threading (runahead) – Invisible to SW – Long latency inst starts auto HW scout • L1 D$ miss • Micro-DTLB miss • Divide – Warm up branch predictor – Prefetch memory • Execute Ahead (EXE) – Retire independent instructions while scouting • Simultaneous Speculative Threading (SST) [ISCA’09] – Two hardware threads for one program – Runahead speculatively executes under a cache miss – OoO retirement • HTM Support ECE8833 H.-H. S. Lee 2009 33 Many-Core Processors • • • • 2KB Data Memory 3KB Instruction Memory No coherence support 2 FMACs • Next-gen will have 3Dintegrated memory – SRAM first – DRAM in the future Intel Teraflops (Polaris) ECE8833 H.-H. S. Lee 2009 34 E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, A. Agarwal ECE8833 H.-H. S. Lee 2009 35 MIT RAW Design Tenet • Long wire across chip will be the constraint • Exposed architecture to software (parallelizing compilers) – Explicit parallelization – Pins – Communication • Use tile-based architecture – Similar designs sponsored by DARPA PCA program: UT TRIPS, Stanford Smart Memories • Simple Point-to-point static routing network – – – – One cycle across each tile Scalable (than bus) Harnessed by compiler with precise count of wire hops Use dynamic router to support memory accesses that cannot be analyzed statically. ECE8833 H.-H. S. Lee 2009 36 Application Mapping on RAW Video Data Stream Frame Buffer And Screen Four-way parallelized scalar code Two-way threaded Java program httpd [Taylor IEEE MICRO’02] ECE8833 H.-H. S. Lee 2009 Fast Inter-tile ALU forwarding : 3 cycles Custom Data Path Pipeline (by Compiler) Zzzz.. Sleep Mode (power saving) 37 Scalar Operand Network Design Non-Pipelined Scalar Operand Network Pipelined w/ Bypass Link Pipelined w/ Bypass Link and Multiple ALUs Lots of live values in the SON [Taylor et al. HPCA’03] ECE8833 H.-H. S. Lee 2009 38 Communication Scalability Issue Routing area Large MUX Complex Compare logic • RB (# of result bus) * WS (window size) compares made per cycle • Long, dense wire elongates cycle time – Pipeline the wire • Cost of processing incoming information is high • Similar problem in bus-based snoopy cache protocol ECE8833 H.-H. S. Lee 2009 39 Scalar Operand Network RegFile RegFile RegFile Multiscalar Operand Network (distributed ILP machine) RegFile RegFile RegFile Switch RegFile RegFile RegFile Scalar Operand Network On a 2-D, p2p interconnect (e.g., Raw or TRIPS) [Taylor et al. HPCA’03] ECE8833 H.-H. S. Lee 2009 40 Mapping Operations to Tile-based Architecture RegFile >> RegFile * RegFile ld a i = a[j]; q = b[i]; r = q+j; s = q >> 3; t = r * s; b[j] = l; b[t] = t; • Done at RegFile RegFile + RegFile ld b st b st b – Compile time (RAW) – Or Runtime • “Point-to-point” 2D mesh • Tradeoff – Computation vs. Communication – Compute Affinity (data flow through fewer hops) • How to maintain control flow-control ECE8833 H.-H. S. Lee 2009 41 RAW Core-to-Core Communication • Static Router – Place-and-route wires by software – P2p scalar transport – Compilers (or assembly writers) handle predictable communication • Dynamic Router – Transport dynamic, unpredictable operations • Interrupts • Cache misses – Unpredictable communication at compile-time ECE8833 H.-H. S. Lee 2009 42 Architectural Comparison RAW Superscalar Multiprocessor • Raw replace a bus of a superscalar with switched network • Switched network is tightly integrated into processor’s pipeline to support singlecycle message injection and receive operations • Raw software (compiler) has to implement functions such as instruction scheduling, dependency checking, etc. • Raw yields complexity to software so that more hardware can be used for ALU and memory ECE8833 H.-H. S. Lee 2009 43 RAW’s Four On-Chip Mesh Networks Compute Pipeline 8 32-bit channels Registered at input longest wire = length of tile [Slide Source: Michael B. Taylor] ECE8833 H.-H. S. Lee 2009 44 Raw Architecture [Slide Source: Volker Strumpen] ECE8833 H.-H. S. Lee 2009 45 Raw Compute Processor Pipeline Fast ALU-tonetwork (4 cycles) R24-27 map to 4 on-chip physical networks 0-cycle local bypass [Taylor IEEE MICRO’02] ECE8833 H.-H. S. Lee 2009 46 RAW Processor Tile Each tile contains • Tile processor – 32-bit MIPS, 8-stage in-order, single issue – 32KB instruction memory – 32KB data cache (not coherent, user managed) • Switch processor – 8K-instruction memory – Executes basic move and branch instructions – Transfer between local switch and neighbor switches • Dynamic Router – Hardware control (not directly under programmer’s control) ECE8833 H.-H. S. Lee 2009 47 Raw Programming • Compute the sum c=a+b across four tiles: ECE8833 H.-H. S. Lee 2009 48 Data Path: Zoom 1 • Stateful hardware: local data memory (a,c), register (b) and both static networks (snet1 and 2) ECE8833 H.-H. S. Lee 2009 49 Zoom 2: Processor Datapaths ECE8833 H.-H. S. Lee 2009 50 Zoom 2: Switch Datapaths (+-tile processor) ECE8833 H.-H. S. Lee 2009 51 Raw Assembly ECE8833 H.-H. S. Lee 2009 52 RAW On-Chip Network • 2D Mesh – Longest wire is no greater than one side of a tile – Worst case: 6 hops (or cycles) for 16 tiles • 2 Static Routers, “point-to-point,” each has – A 64KB SW-managed instruction cache – A pair of routing crossbars – Example: Tile 1 (receiver) Tile 0 (sender) or $csto, $0, $5 nop route $csto->$cEo2 #SWITCH0 nop route $cWi2->$csti2 #SWITCH1 and $5, $5, $csti2 • 2 Dynamic Routers – Dimension-ordered routing by hardware – Example: Tile 0 (sender) lui ihdr or ld $3, $0, 15 $cgno, $3, 0x0200 #header msg len=2 $cgno,$0,$9 #sent word1 $cgno,$0,$csti #sent word2 ECE8833 H.-H. S. Lee 2009 Tile 15 (receiver) or $2, $cgni, $0 #word1 or $3, $cgni, $0 #word2 53 Control Orchestration Optimization • Orchestrated by the Raw compiler • Control localization – Hide control flow sequence within a “macro-instruction” assigned to a tile : One instruction macroin s [Lee et al. ASPLOS’98] ECE8833 H.-H. S. Lee 2009 54 Example of RAW Compiler Transformation Initial Code Transformatio n y z a y = = = = a+b; a*a; y*a*5; y*b*6; Instruction Partitioner Global Data Partitioner Data & Inst Placer Communication Code Gen read(a) read(b) y_1 = a+b z_1 = a*a tmp_1 = y_1*a a_1 = tmp_1*5 tmp_2 = y_1*b y_2 = tmp_2*6 write(z) write(a) write(y) read (a) read (b) z_1 = a*a y_1 = a+b write(z) tmp_2 = y_1*b tmp_1 = y_1*a y_2 = tmp_2*6 a_1=tmp_1*5 write(y) write(a) Event Scheduler Initial Code Transformation [Lee et al. ASPLOS’98] ECE8833 H.-H. S. Lee 2009 55 Example of RAW Compiler Transformation read (a) read (b) z_1 = a*a y_1 = a+b read (a) read (b) z_1 = a*a y_1 = a+b {a,z} write(z) tmp_2 = y_1*b write(z) tmp_2 = y_1*b tmp_1 = y_1*a y_2 = tmp_2*6 a_1=tmp_1*5 write(y) {b,y} tmp_1 = y_1*a y_2 = tmp_2*6 a_1=tmp_1*5 write(y) write(a) Instruction Partitioner Global Data Partitioner write(a) {a,z} {b,y} P0 P1 Data & Inst Placer [Lee et al. ASPLOS’98] ECE8833 H.-H. S. Lee 2009 56 Example of RAW Compiler Transformation read (b) read (a) send (a) z_1 = a*a y_1 = a+b a=rcv() write(z) route(P0,S1) route(S0,P1) route(S1,P0) route(P1,S0) send(y_1) tmp_2 = y_1*b y_1=rcv() tmp_1 = y_1*a y_2 = tmp_2*6 a_1=tmp_1*5 write(y) write(a) P0 S0 S1 P1 Communication Code Gen [Lee et al. ASPLOS’98] ECE8833 H.-H. S. Lee 2009 57 Example of RAW Compiler Transformation read (a) route(P0,S1) route(S0,P1) send (a) route(S1,P0) route(P1,S0) z_1 = a*a read (b) a=rcv() y_1 = a+b write(z) send(y_1) y_1=rcv() tmp_2 = y_1*b tmp_1 = y_1*a y_2 = tmp_2*6 a_1=tmp_1*5 write(y) write(a) P0 S0 S1 P1 Event Scheduler [Lee et al. ASPLOS’98] ECE8833 H.-H. S. Lee 2009 58 Raw Compiler Example Assign instructions to the tiles, maximizing locality. Generate the static router instructions to transfer Operands & streams tiles. tmp3 = (seed*6+2)/3 v2 = (tmp1 - tmp3)*5 v1 = (tmp1 + tmp2)*3 v0 = tmp0 - v1 …. seed.0=seed pval1=seed.0*3.0 pval5=seed.0*6.0 pval4=pval5+2.0 seed.0=seed pval0=pval1+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 tmp0.1=pval0/2.0 pval1=seed.0*3.0 v1.2=v1 v3.10=tmp3.6-v2.7 v2.4=v2 pval5=seed.0*6.0 pval0=pval1+2.0 tmp0=tmp0.1 v3=v3.10 pval2=seed.0*v1.2 pval3=seed.o*v2.4 pval4=pval5+2.0 tmp1.3=pval2+2.0 tmp0.1=pval0/2.0 tmp2.5=pval3+2.0 tmp0=tmp0.1 v1.2=v1 v2.4=v2 pval2=seed.0*v1.2 pval3=seed.o*v2.4 tmp1.3=pval2+2.0 tmp2.5=pval3+2.0 tmp3.6=pval4/3.0 tmp1=tmp1.3 tmp2=tmp2.5 pval7=tmp1.3+tmp2.5 tmp3=tmp3.6 pval6=tmp1.3-tmp2.5 tmp1=tmp1.3 v1.8=pval7*3.0 v2.7=pval6*5.0 pval7=tmp1.3+tmp2.5 v0.9=tmp0.1-v1.8 v1=v1.8 v0=v0.9 v3.10=tmp3.6-v2.7 v1.8=pval7*3.0 v0.9=tmp0.1-v1.8 v2=v2.7 v3=v3.10 tmp2=tmp2.5 pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 v2=v2.7 v1=v1.8 v0=v0.9 [Slide Source: Michael B. Taylor] ECE8833 H.-H. S. Lee 2009 59 Scalability 180 nm, 16 tiles 1 cycle 90 nm, 64 tiles Just stamp out more tiles! Longest wire, frequency, design and verification complexity all independent of issue width. Architecture is backwards compatible. [Slide Source: Michael B. Taylor] ECE8833 H.-H. S. Lee 2009 60