Advanced Processor Architectures Introduction CISC & RISC Basic techniques Superscalar VLIW Multicore & Manycore Introduction Microprocessor Implementation of the Von Neumann architecture Turing computation model First microprocessor 1972, Intel 4004, 2800 Transistors, 740 KHz, 4 bit Long and rapid evolution Billions of transistors, 3-4 GHz, 64 bit, … Computation model Always the same!! Introduction After a few years two approaches emerged for the design of new microprocessors CISC: Complex Instruction-Set Computers Enforces code compactness Variable-length instructions Complex operations, complex addressing modes Complex hardware RISC: Reduced Instruction-Set Computers Fixed-length instructions Simple operations on registers (load/store approach) Simpler hardware Introduction RISC instructions Load & Store (memory <-> registers) ALU (registers) Branches Easily predictable execution time Execution can be broken into smaller phases Pipelining allows parallelizing the execution Pipelining allows executing instructions faster Provided that memory is fast Can be obtained with cache memories Sequential execution This problem is mitigated with branch prediction Introduction The RISC approach seems promising and scalable Several standards, e.g. MIPS: 5 stage pipeline FETCH DECODE MEMORY ACCESS EXECUTE WRITE BACK Execution example INSTR 1 INSTR 2 INSTR 3 INSTR 4 INSTR 5 INSTR 6 INSTR 7 F1 D1 E1 M1 W1 F2 D2 E2 M2 W2 F3 D3 E3 M3 W3 F4 D4 E4 M4 W4 F5 D5 E5 M5 W5 F6 D6 E6 M6 W6 F7 D7 E7 M7 W7 Pipelining With the availability of Silicon Foundry processes Design tools Pipelining became feasible also for CISC machines Main problems with pipelining Definition and balancing of the stages Data hazards Control hazards Pipelining Maximum operating frequency of a pipeline Number of stages: Total latency: Maximum latency: No pipelining Fmax = 1 / L Unbalanced pipeline Fmax = 1 / Lmax Optima case L1 = L 2 = … = L N = L / N Fmax = 1 / (L/N) = N / L N L = L 1 + … + LN Lmax = max( L1, …, LN) Data Hazards Consider the pipelined execution of the code ADD DEC R0, R1, R2 R0 // R0 <= R1 + R2 // R0 <= R0 – 1 Reads registers R1 and R2 Calculates the sum … ADD DEC … . F1 Writes the result in R0 D1 E1 M1 W1 F2 D2 E2 M2 W2 F3 D3 E3 M3 W3 F4 D4 E4 M4 W4 F5 D5 E5 M5 W5 Reads register R0 for instruction DEC Data Hazards The code should execute as follows Reads registers R1 and R2 Calculates the sum … ADD NOP NOP NOP DEC … . F1 Writes the result in R0 D1 E1 M1 W1 F2 D2 E2 M2 W2 F D E M W F D E M W F D E M W F3 D3 E3 M3 W3 F4 D4 E4 M4 W4 F5 D5 E5 M5 W5 Reads register R0 for instruction DEC Data Hazards The NOP instructions can be Generated by the compiler Strongly architecture dependent Low runtime overhead Inserted as “stalls” or bubbles by the microprocessor Simpler compiler Hardware architecture more complex Additional logic for data hazards is needed General approach to this kinds of problems Code scheduling At compile time At run time Data Hazards - Solutions Bypass logic Involves Decode and Execute stages Requires additional multiplexors The Decode stage compares Registers written by execute and memory stages Registers read by the decode stage Controls the multiplexors to select most recent data F D E M W F D E M W Data Hazards - Solutions Bypass logic and bubbles When data is ready only after Memory access stage F D E M W F D E M W Bypass cannot “anticipate” data A bubble is inserted F D E F D M W E Bubble M W Control Hazards Consider the pipelined execution of the code DEC BGE ADD R0 LOOP: ... // R0 <= R0 – 1 // If R0 >= 0 jumps to LOOP Reads register R0 Decrements and sets condition flags … DEC BGE ADD … F1 If R0 >= 0 next instruction is not ADD D1 E1 M1 W1 F2 D2 E2 M2 W2 F3 D3 E3 M3 W3 F4 D4 E4 M4 W4 F5 D5 E5 M5 W5 ADD already fetched and decoded Control Hazards The code should execute as follows Reads register R0 Decrements and sets condition flags … DEC BGE ADD F1 E1 M1 W1 F2 D2 E2 M2 W2 F3 D3 E3 M3 W3 F4 D4 E4 M4 W4 F5 D5 E5 M5 W5 F6 D6 E6 M6 NEXT … If R0 >= 0 next instruction is not ADD D1 Wrong instruction Flush Correct instruction W6 Control Hazards - Solutions Prediction units Try to predict whether a branch will be taken on not Fetch the next instruction accordingly Still require pipeline flush More rarely if a good prediction policy is implemented Types of predictions Always taken Assumes that the branch will always be taken Always not taken Assumes that the branch will never be taken History based Predicts the next jump based on the history Control Hazards - Solutions Delay slot The instruction following a branch is always executed Branch is taken only after this instruction Reads register R0 Decrements and sets condition flags … DEC BGE SLOT NEXT … F1 If R0 >= 0 next instruction is not ADD D1 E1 M1 W1 F2 D2 E2 M2 W2 F3 D3 E3 M3 W3 F4 D4 E4 M4 W4 F5 D5 E5 M5 W5 F6 D6 E6 M6 Delay slot W6 Can fetch the next instruction Control Hazards - Solutions Improves execution time Always taken: fails every time prdiction is wrong At least one clock cycle wasted Always taken: fails every time prdiction is wrong At least one clock cycle wasted Delay slot: fails if the compiler cannot exploit the slot At least one clock cycle wasted (NOP) Decidable at compile time No additional hardware Controversial solution Relies on the compiler’s ability to exploit delay slots Performance improvement Assuming no hazards a major cause limiting the execution speed is memory I/O Several solutions at system level Stanford Harvard SuperHarvard Other approaches at CPU level Prefetch buffer Loop optimization Conditional execution Stanford Architecture Based on Von Neumann model Unified cache for data and instructions Data and instructions have different access patterns PC FETCH UNIT ADDRESS IR L0 CACHE MAR MEMORY UNIT MDR MEMORY DATA Harvard Architecture Based on Von Neumann model Split cache for data and instructions Better hit rates ADDRESS PC FETCH UNIT L0 I-CACHE IR DATA ADDRESS MEMORY ADDRESS MAR MEMORY UNIT L0 D-CACHE MDR DATA DATA SuperHarvard Architecture Extends Split cache for data and instructions Integrates controller for high speed data transfers Two buses ADDRESS PC FETCH UNIT L0 I-CACHE MEMORY IR DATA MAR ADDRESS MEMORY UNIT L0 D-CACHE MDR DATA MEMORY I/O DMA Instruction Prefetch Buffer When the pipeline CPI is close to 1 or greater Instruction fetch latency may slow down execution Fetching istructions ahead of time is beneficial Especially true for superscalar architectures PC IR PC PC L0 I-CACHE IR PC Prefetch Buffer IR IR Read Ahead L0 I-CACHE Loop Optimization Most loops are explicitly indexed This requires allocating a register for the loop index Incrementing or decrementing the idnex Comparing with loop boundary Jumping Back to the beginning, or Out of the loop All indexed loops can be rewritten as INIT: LOOP: MOV ... DEC BGE R0, #N // Loop body R0 LOOP Loop Optimization All indexed loops can be rewritten as INIT: LOOP: MOV ... ... DEC BGE R0, #N // Loop body R0 LOOP The highligthed instruction can be executed by a dedicated unit Significant reduction of execution time Using dedicated hardware they can be parallelized Ad hoc usage of loop registers Loop optimization logic required Conditional execution A similar principle can be applied to Single instructions Conditional constructs Consider the code if( A + B > 0 ) Z = C * B ... The direct translation with normal instructions is ADD BLE MUL L000: ... R0, R1, R2 L000 R3, R2, R4 // temp = A+B // Z=C*B Conditional execution With conditional instructions we have ADD MULGT R0, R1, R2 R3, R2, R4 // temp = A+B // Z=C*B if flags // N and Z are reset L000: ... This arrangement has several advantages Smaller code Saves one instruction at run-time Does not generate control hazards Clearly… Requires additional hardware Superscalar Architectures Different instructions Can use different units of the pipeline Can share no data dependency They can be executed in parallel! Having more than one pipeline EXEC 1 FETCH DECODE ISSUE RETIRE EXEC N WRITE BACK Superscalar Architectures Typical architecture Different pipelines for different clsses of instructions Integer & logic Integer multiply & divide Floating point Branch Each pipeline Is optimized for execution of specific instrcutions Requires a different number of cycles to complete Instrcutions must be retired in order Reorder Buffer Superscalar Architectures A specific unit decides how to issue instructions Can be after the decode stage or integrated with it This activity requires a sort of instruction scheduling Can be generalized Out of order execution DECODE SCHEDULER EXEC 1 FETCH DECODE REORDER BUFFER EXEC N WRITE BACK Superscalar Architectures Decoded instrctuons can be stroed Reduces latency Needs specific unit FETCH DECODE EXEC 1 SCHEDULER TRACE CACHE DECODE REORDER BUFFER EXEC N WRITE BACK Superscalar Architectures Classes of instructions and their pipelines FETCH DISPATCH DECODE EXECUTE PREDICT Integer FETCH DISPATCH DECODE EXECUTE WRITE BACK Load-Store FETCH DISPATCH DECODE ADDR GEN CAHE WRITE BACK Floating Point FETCH DISPATCH DECODE EXECUTE 1 EXECUTE 2 Branch WRITE BACK Superscalar Architectures Example: RISC 8000 architetcure Explicitly Parallel Architectures In superscalar architectures is the hardware that Analyzes sequences of instructions Verifies the presence of data dependencies Dispatches them to different pipelines Most of these tasks can be made at compile time Data dependencies Extraction of Instruction-Level Parallelism Dispatch to different pipelines Explicit scheduling on several executors The resulting architectures are EPIC: Explicitly Parallel Instruction Computer VLIW: Very Long Instruction Word Explicitly Parallel Architectures Analyzing the structure of a typical RISC code Compilers can extract parallelism Code generators can “pack” several small RISC instructions into one long VLIW instruction Superscalar RISC code Original RISC code LD ADD ADD MUL ST R0, R1, R3, R5, R5, #0xF004 R2, R2 R4, R4 #12, R5 #0xF008 LD R0 #0xF004 ADD R1 R2 R2 ADD R3 R4 R4 MUL R5 #12 R5 ST R5 #0xF008 VLIW code LD R0 #0xF004 ADD R1 R2 R2 ADD R3 R4 R4 MUL R5 #12 R5 ST R5 #0xF008 Explicitly Parallel Architectures Simplified architecture INSTRUCTION CACHE FETCH IR DECODERS EXECUTE UNIT #1 DATA CACHE REGISTER FILE EXECUTE UNIT #2 EXECUTE UNIT #N Example: TI 6455 DSP Core Example: TI 6455 DSP Core Two datapaths A, B Provide the same functionality Registers Only some registers can be used on both A and B Exchange data from A to B and vice-versa Complex routing structure Example: TI 6455 DSP Core .L unit - 32/40-bit arithmetic/compare operations 32-bit logical operations Leftmost 1 or 0 counting for 32 bits Normalization count for 32 and 40 bits Byte shifts Data packing/unpacking 5-bit constant generation Dual 16-bit arithmetic operations Quad 8-bit arithmetic operations Dual 16-bit minimum/maximum operations Quad 8-bit minimum/maximum operations .S unit - 32-bit arithmetic operations 32/40-bit shifts and 32-bit bit-field operations 32-bit logical operations Branches Constant generation Register transfers to/from control register file (.S2 only) Byte shifts Data packing/unpacking Dual 16-bit compare operations Quad 8-bit compare operations Dual 16-bit shift operations Dual 16-bit saturated arithmetic operations Quad 8-bit saturated arithmetic operations .M unit - 32x32-bit multiply operations 16x16-bit multiply operations 16x32-bit multiply operations Quad 8x8-bit multiply operations Dual 16x16-bit multiply operations Dual 16x16-bit multiply with add/subtract operations Quad 8x8-bit multiply with add operation Bit expansion Bit interleaving/de-interleaving Variable shift operations Rotation Galois Field Multiply .D unit - 32-bit add, subtract, linear and circular address calculation Loads and stores with 5-bit constant offset Loads and stores with 15-bit constant offset (.D2 only) Load and store doublewords with 5-bit constant Load and store nonaligned words and doublewords 5-bit constant generation 32-bit logical operations Example: TI 6455 DSP Core Generic instruction format Sequential execution Parallel execution Multi-Core Architecture Integration technology allows packing several microprocessor cores on a single chip Multi-core architectures Many advantages High performance Application can scale very well But… many problems Communication among cores Memory access Cache coherence … Multi-Core Architecture Simplified view of a single-core architecture REGS ALU BUS INTERFACE Simplified view of a multi-core architecture Core 1 REGS Core 2 ALU REGS Core 3 ALU REGS BUS INTERFACE Core 4 ALU REGS ALU Multi-Core Architecture - Execution Threads run in parallel Within each core threads are time-sliced Multi-Core Architecture - Parallelism ILP – Instruction Level Parallelism Parallelism at the machine-instruction level re-ordering Instruction pipelining Split instruction into micro-operations Branch prediction… Main source of performance in the last 15 years TLP – Thread Level Parallelism Parallelism on a more coarser scale Server serve each client in a separate thread Games run AI, graphics, and physics in three threads Superscalar processors cannot fully exploit TLP Multi-Core Architecture - Models Four possible computation models Instruction single multiple multiple Data single SISD Traditional cores RISC CISC MISD SIMD MIMD Vector machines Graphic cards Multi-core Many-core Multi-Core Architecture - Memory Shared memory One common shared memory for all processors Large Distributed memory Each processor has its own local memory Small Its content is not replicated anywhere else Multi-Core Architecture - Caches Cache memories can be Private Closer to the core, so faster access Reduces contention Cache coherence is a major problem Shared Threads on different cores can share cache data More cache space available if few high-performance thread runs on the system Multi-Core Architecture - Caches Initially all caches are empty Core 1 reads variable X at address 0xF0 Cache 1 has a miss then loads the correct value Core 1 Core 2 Core 3 Core 4 Private cache X=33210 Private cache Private cache Private cache MAIN MEMORY X=33210 Multi-Core Architecture - Caches Then Core 2 reads the same variable X at address 0xF0 Cache 2 has a miss then loads the correct value Core 1 Core 2 Core 3 Core 4 Private cache X=33210 Private cache X=33210 Private cache Private cache MAIN MEMORY X=33210 Multi-Core Architecture - Caches Then Core 1 modifies variable X Cache 1 uses a write-through policy Cache 2 holds a stale value Core 1 Core 2 Core 3 Core 4 Private cache X=1228 Private cache X=33210 Private cache Private cache MAIN MEMORY X=1228 Multi-Core Architecture - Caches Then Core 2 reads variable X Core 2 hits but reads the wrong value! Core 1 Core 2 Core 3 Core 4 Private cache X=1228 Private cache X=33210 Private cache Private cache MAIN MEMORY X=1228 Multi-Core Architecture - Caches Invalidation with snooping Invalidation When a core writes data, all copies of this data in other caches are invalidated Snooping Cores continuously monitor (“snoop”) the bus connecting the cores Multi-Core Architecture - Caches Initially all caches are empty Core 1 reads variable X at address 0xF0 Cache 1 has a miss then loads the correct value Core 1 Core 2 Core 3 Core 4 Private cache X=33210 Private cache Private cache Private cache MAIN MEMORY X=33210 Multi-Core Architecture - Caches Then Core 2 reads the same variable X at address 0xF0 Cache 2 has a miss then loads the correct value Core 1 Core 2 Core 3 Core 4 Private cache X=33210 Private cache X=33210 Private cache Private cache MAIN MEMORY X=33210 Multi-Core Architecture - Caches Then Core 1 modifies variable X Cache 1 uses a write-through policy with invalidation Cache 2 holds a stale value Core 1 Core 2 Core 3 Core 4 Private cache X=1228 Private cache X=33210 Private cache Private cache MAIN MEMORY X=1228 Data is invalidated Multi-Core Architecture - Caches After invalidation Cache 2 does not contain the value of X any longer Core 1 Core 2 Core 3 Core 4 Private cache X=1228 Private cache X=33210 Private cache Private cache MAIN MEMORY X=1228 Data is invalidated Multi-Core Architecture - Caches Then Core 2 reads variable X Core 2 has a miss but now reads the correct value Core 1 Core 2 Core 3 Core 4 Private cache X=1228 Private cache X=1228 Private cache Private cache MAIN MEMORY X=1228 Multi-Core Architecture - Caches A different solution maintains all caches up-to-date Update When a core writes data, all caches are notified of the new value through a broadcasted message Multi-Core Architecture - Caches Initially all caches are empty Core 1 reads variable X at address 0xF0 Cache 1 has a miss then loads the correct value Core 1 Core 2 Core 3 Core 4 Private cache X=33210 Private cache Private cache Private cache MAIN MEMORY X=33210 Multi-Core Architecture - Caches Then Core 2 reads the same variable X at address 0xF0 Cache 2 has a miss then loads the correct value Core 1 Core 2 Core 3 Core 4 Private cache X=33210 Private cache X=33210 Private cache Private cache MAIN MEMORY X=33210 Multi-Core Architecture - Caches Then Core 1 modifies variable X Cache 1 uses a write-through policy with update Cache 2 receives the message and updates Core 1 Core 2 Core 3 Core 4 Private cache X=1228 Private cache X=1228 Private cache Private cache MAIN MEMORY X=1228 Data is updated Multi-Core Architecture - Caches After update Core 2 reads variable X Core 2 hits and reads the correct value Core 1 Core 2 Core 3 Core 4 Private cache X=1228 Private cache X=1228 Private cache Private cache MAIN MEMORY X=1228 Many-Core Architecture Evolution of a multi-core approach: many-cores Hundreds of simpler cores Many local memories Complex communication network Often NoC One or more “supervisor cores” Many-Core Architecture Simplified architecture MEM CORE CORE mem mem mem mem core core core core core mem mem mem mem mem core core core core core mem mem mem mem mem core core core core core NoC Bridge MEM I/O mem NoC NoC Bridge Further reading 1. Computer architecture: a quantitative approach John L. Hennessy,David A. Patterson,Andrea C. Arpaci-Dusseau 2. The SPARC Architecture Manual - Version 8 SPARC Intl. 3. TMS320C6455 Fixed-Point Digital Signal Processor Texas Instrument Data Sheet 4. TMS320C64x/C64x+ DSP CPU and Instruction Set Texas Instrument Data Sheet 5. Planning Considerations for Multicore Processor Technology Dell Report 6. Many-core Architecture and Programming Challenges Satnam Singh, Microsoft Research