In search of performance • The road to multi-cores – – – – – – Single-cycle processor Pipelining Caches Simplescalar Superscalar, VLIW and EPIC Multi-threading, Hyperthreading and Simultaneous Multithreading • Multicores – Chip Multi-Processors – Graphics Processors – Software Managed 1 3/12/2016 http://www.public.asu.edu/~ashriva6 What is this? 00100111101111011111111111100000 10101111101111110000000000010100 10101111101001000000000000100000 10101111101001010000000000100100 10101111101000000000000000011000 10101111101000000000000000011100 10001111101011100000000000011100 10001111101110000000000000011000 00000001110011100000000000011001 00100101110010000000000000000001 00101001000000010000000001100101 10101111101010000000000000011100 00000000000000000111100000010010 00000011000011111100100000100001 00010100001000001111111111110111 10101111101110010000000000011000 00111100000001000001000000000000 10001111101001010000000000011000 00001100000100000000000011101100 00100100100001000000010000110000 10001111101111110000000000010100 00100111101111010000000000100000 00000011111000000000000000001000 00000000000000000001000000100001 int main (int argc, char *argv[]) { int i; int sum = 0; for (i = 0; i <= 100; i = i + 1) sum = sum + i * i; printf ("The sum from 0 .. 100 is %d\n", sum); } MIPS machine language code for a routine to compute and print the sum of the squares of integers between 0 and 100. High-Level Languages • Higher-level languages – Allow the programmer to think in a more natural language • Customized for their intended use, e.g., – Fortran for scientific computation – Cobol for business programming – Lisp for symbol manipulation – Improve programmer productivity and maintainability • more understandable code that is easier to debug and validate – Independent of • Computer on which it applications are developed • Computer on which it applications will execute • Enabler – Optimizing Compiler Technology • Now very little programming at the assembler level Translation from High Level Languages • High-level language program (in C) swap (int v[], int k) . . . •Assembly language program (for MIPS) swap: sll add lw lw sw sw jr C - Compiler $2, $5, 2 $2, $4, $2 $15, 0($2) $16, 4($2) $16, 0($2) $15, 4($2) $31 • Machine (object) code (for MIPS) 000000 000000 100011 100011 101011 101011 000000 00000 00100 00010 00010 00010 00010 11111 00101 00010 01111 10000 10000 01111 00000 0001000010000000 0001000000100000 0000000000000000 0000000000000100 0000000000000000 0000000000000100 0000000000001000 Assembler Fetching Instructions • Fetching instructions involves – reading the instruction from the Instruction Memory – updating the PC to hold the address of the next instruction Add 4 Instruction Memory PC Read Address Instruction – PC is updated every cycle, so it does not need an explicit write control signal – Instruction Memory is read every cycle, so it doesn’t need an explicit read control signal Executing ALU Operations 31 R-type: op 25 rs 20 15 10 rt rd 5 0 shamt funct – perform the indicated (by op and funct) operation on values in rs and rt – store the result back into the Register File (into location rd) RegWrite Instruction Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data ALU control ALU overflow zero Data 2 – Note that Register File is not written every cycle (e.g. sw), so we need an explicit write control signal for the Register File Executing Load and Store Operations RegWrite Instruction ALU control overflow zero Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data 16 Address ALU Data Memory Read Data Write Data Data 2 Sign Extend MemWrite MemRead 32 Executing Branch Operations Add 4 Add Shift left 2 Branch target address ALU control PC Read Addr 1 Instruction Register Read Read Addr 2 Data 1 File Write Addr Read Write Data 16 Data 2 Sign Extend 32 zero (to branch control logic) ALU Executing Jump Operations • Jump operations have to 31 25 0 J-Type: op jump target address – replace the lower 28 bits of the PC with the lower 26 bits of the fetched instruction shifted left by 2 bits Add 4 4 Instruction Memory PC Read Address Shift left 2 Instruction 26 Jump address 28 Adding the Pieces Together Add RegWrite ALUSrc ALU control 4 MemtoReg ovf zero Instruction Memory PC MemWrite Read Address Instruction Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Address ALU Write Data Data 2 Sign 16 Extend Data Memory Read Data MemRead 32 Adding the Branch Portion Add 4 Shift left 2 RegWrite Instruction Memory PC Read Address Instruction MemWrite MemtoReg Address ALU Data Memory Read Data Write Data Data 2 Sign 16 Extend PCSrc ALUSrc ALU control ovf zero Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Add MemRead 32 Adding the Jump Portion 26 Shift left 2 32 28 Jump 1 PC+4[31-28] 0 Add 4 Shift left 2 RegWrite Instruction Memory PC Read Address Instruction MemWrite MemtoReg Address ALU Data Memory Read Data Write Data Data 2 Sign 16 Extend PCSrc ALUSrc ALU control ovf zero Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Add MemRead 32 MIPS Machine (with Controls) Instr[25-0] Shift left 2 26 1 28 32 0 PC+4[31-28] 0 Add Jump ALUOp Add Shift left 2 4 1 PCSrc Branch MemRead MemtoReg MemWrite Instr[31-26] Control Unit ALUSrc RegWrite RegDst Instruction Memory PC Read Address Instr[31-0] ovf Instr[25-21] Read Addr 1 Register Read Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read 1 Instr[15 -11] Instr[15-0] Write Data zero ALU Data Memory Read Data 1 Write Data 0 0 Data 2 1 Sign 16 Extend Address 32 Instr[5-0] ALU control Single Cycle – Can we do better? • Every instruction executes in 1 cycle – Every instruction time = slowest instruction time • Cannot reuse resources – A wire can carry only one value in one cycle Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clock Singlecycle inst. Ifetch Dec/Reg Exec Mem Wr Ifetch Dec/Reg Exec lw 14 3/12/2016 http://www.public.asu.edu/~ashriva6 sw Mem In search of performance • The road to multi-cores – – – – – – Single-cycle processor Pipelining Caches Simplescalar Superscalar, VLIW and EPIC Multi-threading, Hyperthreading and Simultaneous Multithreading • Multicores – Chip Multi-Processors – Graphics Processors – Software Managed 15 3/12/2016 http://www.public.asu.edu/~ashriva6 Pipelined Processor 4 IF/ID ADD ID/EX EX/MEM MEM/WB M u x PC Comp. IR6...10 Inst. Memory M u x IR11..15 MEM/ WB.IR Register File Data must be stored from one stage to the next in pipeline. Registers/latches. hold temporary values between clocks and needed info. for execution. M u x Sign Extend 16 32 Branch taken ALU Data Mem. M u x Applied to Computer Design 2 3 4 5 IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM DM Reg Reg DM Reg Reg ALU DM Reg ALU Inst. i 1 ALU Inst. # ALU Clock Number DM Inst. i+1 Inst. i+2 Program execution order (in instructions) Inst. i+3 IM Reg IM IM IM Reg 6 7 8 WB Time Potential of n-times speedup. n = no of pipeline stages Reg Data Forwarding Branch Hazard PCSrc ID/EX ALUOp EX IF/ID ALUSrc Add RegWrite Add Read reg 2 Read data1 Zero Instruction Memory Write reg Read data2 M u x Write data ALU Register File Inst[15-0] WB Read addr Write addr Write data Sign extend ALU control Inst[20-16] Inst[15-11] M RegDst PC MEM/WB Shift left 2 Read reg1 Read address WB MemWrite M MemRead Control MemtoReg EX/MEM WB Branch M u x M u x Read data Data Memory M u x A Branch Predictor Normal PC value P C Guess Branch Guess as to where to branch Instruction Memory Branch Prediction Logic Branch Update Information In search of performance • The road to multi-cores – – – – – – Single-cycle processor Pipelining Caches Simplescalar Superscalar, VLIW and EPIC Multi-threading, Hyperthreading and Simultaneous Multithreading • Multicores – Chip Multi-Processors – Graphics Processors – Software Managed 21 3/12/2016 http://www.public.asu.edu/~ashriva6 Memory Hierarchy Capacity Access Time Cost CPU Registers 100s Bytes <10s ns Cache K Bytes 10-100 ns 1-0.1 cents/bit Main Memory M Bytes 200ns- 500ns $.0001-.00001 cents /bit Disk G Bytes, 10 ms (10,000,000 ns) 10-5 - 10-6 cents/bit Tape infinite sec-min 10-8 cents/bit Registers Staging Xfer Unit Upper Level faster Instr. Operands prog./compiler Cache Blocks 1-8 bytes cache cntl 8-128 bytes Memory Pages OS 4K-16K bytes Disk Files Tape user/operator Mbytes Larger Lower Level • Fact: Large memories are slow, fast memories are small • How do we create a memory that gives the illusion of being large, cheap and fast (most of the time)? Caches: Insight • Temporal Locality (Locality in Time): => Keep most recently accessed data items closer to the processor • Spatial Locality (Locality in Space): => Move blocks consists of contiguous words to the upper levels To Processor Upper Level Memory Lower Level Memory Blk X From Processor Blk Y Direct Mapped Cache • Cache line 0 can be occupied by data from: – Memory location 0, 8, 16, ... etc. – In general: any memory location whose 3 LSBs of the address are 0s – Address<2:0> => cache index • Which one should we place in the cache? – the one we reference most frequently • How can we tell which one is in the cache? Set-Associative Cache • Conflict Misses are misses caused by: different memory locations mapped to the same cache index – Solution 1: make the cache size bigger – Solution 2: Multiple entries for the same cache index Cache Index Cache Tag Block 0 : • N-way set associative: N entries for each cache index • N direct mapped caches operates in parallel cache tag Cache Data = = : selector (MUX) Block 0 • A two-way set-associate cache: : : data hit In search of performance • The road to multi-cores – – – – – – Single-cycle processor Pipelining Caches Simplescalar Superscalar, VLIW and EPIC Multi-threading, Hyperthreading and Simultaneous Multithreading • Multicores – Chip Multi-Processors – Graphics Processors – Software Managed 26 3/12/2016 http://www.public.asu.edu/~ashriva6 Simple Pipelined Processor Register File IF ID RF EX MEM WB • Fetch, issue and commit one instruction per cycle – Execution units can take multiple cycles – Including the memory • Issue Logic – Need to stall the instruction until its operands are ready to read – – Depending on the bypasses In-order Issue, execute, and commit • Several Execution Units – I-ALU, FP-Add, FP-Mult, FP-DIV – With different delays Multi-cycle and Pipelined Units Register File EX FP Mult IF ROB ID RF MEM FP Add CB WB Divider Functional Unit Latency Initiation Interval I-ALU 1 1 FP ALU 4 1 FP+ Integer Mult 7 1 FP Divider 24 24 Register File EX M1 IF ID ROB M2 M3 M4 M5 M6 RF M7 MEM A1 A2 A3 Divider A4 CB WB Multi-cycle Units Register File EX FP Mult IF ID ROB RF MEM FP Add CB WB Divider • Actually EX unit may have separate, but parallel units for Integer arithmetic, multiplication, and floating point operations • Scheduling done using Reservation Tables Functional Unit Latency I-ALU 1 FP ALU 4 FP+ Integer Mult 7 FP Divider 24 Instruction Reordering Register File EX M1 IF ID ROB M2 M3 M4 M5 M6 M7 RF MEM A1 A2 A3 CB WB A4 Divider • Even in an in-order issue processor – • Several instructions can be executing at the same time ADDD cannot be executed due to dependency – But SUBD can be • • Instruction Reordering is needed during issue Commit buffer is needed for in-order writes • Front End – • Instruction Fetch separated from the rest of the pipeline by ROB Back End – Instruction Commit is separated from the rest of the pipeline by CB DIVD F0, F2, F4 ADDD F10, F0, F8 SUBD F12, F8, F14 Dynamic Scheduling • Register Scoreboarding – CDC 6600 – First instruction whose source is not the destination of an unfinished instruction is issued • In presence of bypasses? – Only one destination can be in flight • Difficult to disambiguate which result to use. • Tomasulo’s Algorithm – – – – IBM 360 Writes the result in a new location All reads happen from new location Register Renaming In search of performance • The road to multi-cores – – – – – – Single-cycle processor Pipelining Caches Simplescalar Superscalar, VLIW and EPIC Multi-threading, Hyperthreading and Simultaneous Multithreading • Multicores – Chip Multi-Processors – Graphics Processors – Software Managed 32 3/12/2016 http://www.public.asu.edu/~ashriva6 Superscalar Processors • Exploit Instruction-Level Parallelism in hardware – Instructions are issued from a sequential instruction stream – CPU hardware dynamically checks for data dependencies between instructions at run time • Dynamic re-ordering of instructions – Accepts multiple instructions per clock cycle Register File EX M1 IF ID ROB M2 M3 A1 Issue 2 instructions per cycle M4 A2 M5 A3 M6 M7 A4 Divider • Issue logic overhead – For n-different instructions and k-issue processor – # gates ~ nk, and delay ~ k2log n – combinatoric explosion • ILP available in the application MEM CB Commit 2 instructions per cycle WB VLIW Processors • Instruction Re-ordering and management extremely resource and power hungry – Perform in Compiler: VLIW Processor operation operation • Josh Fischer • Instruction composed of operations Instruction – All operations execute in parallel • All reads are done in OR stage, and writes in the WB stage – No re-ordering and checking by the machine • Very simple hardware (power-efficient) – Each operation requires small statically predictable delay • Triggered a lot of compiler work – Trace Scheduling: Speculative execution and compensation code • Problems – – – – – Code density with lots of NOPs Compiler speculates on branches Load hoisting Re-compile when machine width changes Memory operations do not have deterministic delay operation operation EPIC • Intel Itanium Architecture – 1.7 bilion transistors core on 90nm process – 24M L3 cache • Explicitly Parallel Instruction Computing – Beyond VLIW Sturdy heat sink of Itanium • Compatibility – Bit-vector in instruction to specify dependency with previous instructions • Load Hoisting – Speculative load, and check load instructions • Branches – Predication – Multi-way branch instructions • Register Renaming – Very large Register Files • Code size – Itanium code size ~ 2* x86 code size Processor in a sea of cache Comparison • Superscalar and VLIW are the two ends of the spectrum • EPIC has more dynamic features, but removes the really costly hardware features In search of performance • The road to multi-cores – – – – – – Single-cycle processor Pipelining Caches Simplescalar Superscalar, VLIW and EPIC Multi-threading, Hyperthreading and Simultaneous Multithreading • Multicores – Chip Multi-Processors – Graphics Processors – Software Managed 37 3/12/2016 http://www.public.asu.edu/~ashriva6 Multi-threading • Tackle memory wall – Small pipeline stalls can be fixed by scheduling – Switch to a different thread of execution for memory misses • Can be done in software – Store and restore the thread context • Hardware Threading – Hardware state for each thread • RF, PC, CPSR, Register renaming tables • Even separate pipeline registers for each thread • When to switch – When current one stalls, timeout, priority with pre-emption – Fine-grained • Round robin, dynamic thread priorities Simultaneous Multi Threading • Issue instructions from multiple threads in each cycle • Chief Advantage: – Possible to find more instructions to issue, and hence higher performance • Chief Disadvantages – Maximum hardware overhead – Need larger cache – Load store queues need to keep memory consistent • Intel HT, IBM AS/400 (a) Superscalar, (b) Multithreading, and (c) SMT processor In search of performance • The road to multi-cores – – – – – – Single-cycle processor Pipelining Caches Simplescalar Superscalar, VLIW and EPIC Multi-threading, Hyperthreading and Simultaneous Multithreading • Multicores – Chip Multi-Processors • Sun Niagara, Power, Intel – Graphics Processors – Software Managed 40 3/12/2016 http://www.public.asu.edu/~ashriva6 Sun UltraSPARC I (Niagara) • 8 cores – Small L1 Cache • Private 8KB data cache • Private 16 KB instruction cache – Shared L2 • 3 MB unified, with 4 banks, 16-byte access to memory • At 400 MHZ, 25.6 GBps – Only one FPU • Each FP takes > 40 cycles – As a rough guideline, performance degrades if the number of floatingpoint instructions exceeds 1 percent of total instructions. Sun UltraSPARC I (Niagara) • Core Pipeline – No out-of-order execution – Small cache – 4-way Multi-threading to hide cache miss latency Sun UltraSPARC II (Niagara 2) • 1 FPU per core • Each core is 4 way SMT • Eight encryption engines, with each supporting • DES, 3DES, AES, RC4, SHA1, SHA256, MD5, RSA-2048, ECC, CRC32 • Power 123 W (max), 95 W (normal) • Verilog RTL available under OpenSPARC project • Rock Processor (separate from Niagara) – Hardware Implemented Transactional Memory – Hardware scout for prefetching – Coming out in 2009 IBM Power Series • POWER - Performance Optimization With Enhanced RISC • POWER 1 from RS/6000 • Power 2 – Quad-word storage instructions. The quad-word load instruction moves two adjacent double-precision values into two adjacent floating-point registers. – Hardware square root instruction. – Floating-point to integer conversion instructions – Power2 was the processor used in the 1997 IBM Deep Blue Chess playing supercomputer which beat chess grandmaster Gary Kasparov. Power 3 • Issue up-to 4 FP instructions • Peak of 8-instructions per cycle • 32-byte interface with L1 I$ • 16-byte interface with L1 D$ • Unified L2 – 1MB – 16 MB • 6-stage pipeline – branch mis-prediction penalty is only 3 cycles • • • • • 2048-entry BHT Full register renaming – Tomasulo’s alg. Non-blocking for up to 4 cache misses L2 cache latency – 6 cycles Data Cache – Prefetch mechanism monitors accesses to two adjacent cache lines – If found starts a stream of prefetch Power 4 • Dual-core, shared L2 cache • External off-chip L3 cache • 174 M transistors, 130 nm, 1 GHz, 115 W Power 4 • 5 instructions can commit each cycle • Register renaming ,out-of order and hit under misses allow for more than 200 instructions in-flight. Power 5 • 276 M transistors, 130 nm, • On-chip memory controller, L3 is still off-chip • 2-way SMT • Software thread priority • Increase in resources – GPRs: 80 120 – FPRs: 72 120 • LRQ, SRQ have 32-entries • Dynamic Resource Balancing for SMT – Reduce priority of low-ILP thread Power 5 Power 6 • “Often the individual components, especially the processors, are capable of additional performance, but the power and thermal costs require the system to enforce limits to ensure safe, continued operation.” • 790 M transistors, 65 nm, <5 GHz, 430 W • 13 gate delay pipeline • “Historically, Power Architecture technology-based machines consumed nearly their maximum power when idle.” • PURR: Processor Utilization of Resources Register – Added for each SMT thread – Collect statistics for thread prioritization • At the SuperComputing 2007 (SC07) water-cooled Power6 was revealed • IBM and Cray get $250 M each from DARPA for petascale computer Nehalem • Advertisement Video • 730 M transistors, 45 nm, ~ 3 GHz, 173 W • 2-way SMT • Operation fusion – Cannot change ISA, but can execute complex instructions efficiently in hardware • Loop Stream Detector Branch Prediction Fetch Decode 28 Micro-Ops Loop Stream Detector Nehalem Memory Architecture Core 32kB L1 32kB L1 Data CacheInst. Cache Core Core Core L1 Caches L1 Caches L1 Caches … L2 Cache Exclusive Inclusive L3 Cache L3 Cache HIT! MISS! Core 1 Core 2 L2 Cache L3 Cache 256kB L2 Cache Core 0 L2 Cache Core 3 Must check all other cores Core 0 Core 1 0 0 1 0 Core 2 Core 3 Only need to check the core whose core valid bit is set Nehalem Power Control Unit Vcc BCLK Core PLL Vcc Freq. Sensors Integrated proprietary microcontroller PLL Core Vcc Freq. Sensors PLL Core PCU Vcc Freq. Sensors PLL Core Uncore , LLC Vcc Freq. Sensors PLL Shifts control from hardware to embedded firmware Real time sensors for temperature, current, power Flexibility enables sophisticated algorithms, tuned for current operating conditions Intel® Core™ Microarchitecture (Nehalem) Turbo Mode Power Gating Zero power for inactive cores 54 Core 2 Core 3 Core 1 Core 0 Workload Lightly Threaded or < TDP Frequency (F) Core 3 Core 2 Core 1 Core 0 Frequency (F) No Turbo Intel® Core™ Microarchitecture (Nehalem) Turbo Mode Power Gating Turbo Mode Zero power for inactive cores In response to workload adds additional performance bins within headroom 55 Core 1 Core 0 Workload Lightly Threaded or < TDP Frequency (F) Core 3 Core 2 Core 1 Core 0 Frequency (F) No Turbo Intel® Core™ Microarchitecture (Nehalem) Turbo Mode Power Gating Turbo Mode Zero power for inactive cores In response to workload adds additional performance bins within headroom 56 Core 1 Core 0 Workload Lightly Threaded or < TDP Frequency (F) Core 3 Core 2 Core 1 Core 0 Frequency (F) No Turbo In search of performance • The road to multi-cores – – – – – – Single-cycle processor Pipelining Caches Simplescalar Superscalar, VLIW and EPIC Multi-threading, Hyperthreading and Simultaneous Multithreading • Multicores – Chip Multi-Processors • Sun Niagara, Power, Intel – Graphics Processors • Nivida Tesla – Software Managed 57 3/12/2016 http://www.public.asu.edu/~ashriva6 Nvidia GPUs • CPUs devote more transistors to data storage • GPUs devote more transistors to data processing • Writing graphics applications required use of graphics API, like DirectX. • CUDA – Compute Unified Device Architecture – C-like language for writing graphics applications • CUDA Execution Model – Highly multithreaded co-processor – For loop kernels (parallel for) GPU for Graphics • Graphics Applications – – – – All task on a pixel is a thread, and executed on single core Multiple threads execute on a single core in SMT fashion Pixel tasks are independent All frame pixels processed simultaneously Processor Architecture • Thread processors have – private registers – Common instruction stream – Shared memory • Tesla S1070 – 1 TFlop processor # of Tesla GPUs 4 # of Streaming Processor Cores 960 (240 per processor) Frequency of processor cores 1.296 to 1.44 GHz Single Precision floating point performance (peak) Double Precision floating point performance (peak) Floating Point Precision IEEE 754 single & double Total Dedicated Memory 16GB Memory Interface 512-bit Memory Bandwidth 408GB/sec Max Power Consumption 800 W System Interface PCIe x16 or x8 Programming environment CUDA 3.73 to 4.14 TFlops 311 to 345 GFlops In search of performance • The road to multi-cores – – – – – – Single-cycle processor Pipelining Caches Simplescalar Superscalar, VLIW and EPIC Multi-threading, Hyperthreading and Simultaneous Multithreading • Multicores – Chip Multi-Processors • Sun Niagara, Power, Intel – Graphics Processors • Nivida Tesla – Software Managed • IBM Cell 61 3/12/2016 http://www.public.asu.edu/~ashriva6 IBM Cell • Clock speed > 4 GHz • Peak Performance (single precision) > 256 GFlops • Peak Performance (dual precision) > 26 GFlops* • Local Storage per SPU 256 KB • Area: 221 mm2 • Technology 90 nm • Total number of transistors: 234 M Heterogeneous multi-core system architecture • Power Processing Element for control tasks • Synergistic Processor Element for data intensive processing. Synergistic Processor Element consists of - Synergistic Processor Unit (SPU) - Synergistic Memory Flow Control (SMF) Cell Components – SPE (8) Synergistic Processing Unit • RISC organization – 32 bit fixed instructions encoding 3-operand instruction format – Load/Store architecture – Unified register file • User-mode architecture – No page translation within SPU • SIMD dataflow – Broad set of operations (8,16,32,64 bit) – Graphics Single Precision float – IEEE DP- Float • 256 KB Local Store – Combined Inst and Data • DMA block transfer – Using Power Architecture memory translation SPE Pipeline Synergistic Memory Flow Control • SMF implements memory management and mapping • SMF operates in parallel to SPU – Independent compute and transfer – Command interface from SPU • DMA queue decouples SMF and SPU • Block transfer between system memory and local store • SPE programs reference system memory using user-level effective address space – Ease of data sharing – Local store to local store transfers – Protection 8 concurrent memory transfers In search of performance • The road to multi-cores – – – – – – Single-cycle processor Pipelining Caches Simplescalar Superscalar, VLIW and EPIC Multi-threading, Hyperthreading and Simultaneous Multithreading • Multicores – Chip Multi-Processors • Sun Niagara, Power, Intel – Graphics Processors • Nivida Tesla – Software Managed • IBM Cell • Where are we headed to 67 3/12/2016 http://www.public.asu.edu/~ashriva6 Where are we headed to? • Intel CMPs – – – – 16-core, 32-core CMPs With SMT, and shared memory Cache coherency TLMs, hardware and software implemented • Intel 80-core experimental system – More like Cell processor • Intel Network Processor – Larabee – 10s of cores network processor • Nvidia – Planning for 1000 core processor. • IBM – Next version of cell 68 3/12/2016 http://www.public.asu.edu/~ashriva6 CGRAs coarse grain reconfigurable architecture Config FU • Array of PEs connected in a mesh-like interconnect • Characterized by array size, node functionalities, interconnect, register file configurations • Execute compute intensive kernels in multimedia applications LRF