CS 152 Computer Architecture and Engineering Lecture 22: Final Lecture Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste http://inst.cs.berkeley.edu/~cs152 Today’s Lecture • Review entire semester – What you learned • Follow-on classes • What’s next in computer architecture? 5/6/2008 CS152-Spring’08 2 The New CS152 Executive Summary (what was promised in lecture 1) The processor your predecessors built in CS152 What you’ll understand and experiment with in the new CS152 Plus, the technology behind chip-scale multiprocessors (CMPs) 5/6/2008 CS152-Spring’08 3 From Babbage to IBM 650 QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. 5/6/2008 CS152-Spring’08 4 IBM 360: Initial Implementations Model 30 ... Storage 8K - 64 KB Datapath 8-bit Circuit Delay 30 nsec/level Local Store Main Store Control Store Read only 1sec Model 70 256K - 512 KB 64-bit 5 nsec/level Transistor Registers Conventional circuits IBM 360 instruction set architecture (ISA) completely hid the underlying technological differences between various models. Milestone: The first true ISA designed as portable hardware-software interface! With minor modifications it still survives today! 5/6/2008 CS152-Spring’08 5 Microcoded Microarchitecture busy? zero? opcode holds fixed microcode instructions controller (ROM) Datapath Data holds user program written in macrocode instructions (e.g., MIPS, x86, etc.) 5/6/2008 Addr Memory (RAM) CS152-Spring’08 enMem MemWrt 6 Implementing Complex Instructions Opcode ldIR zero? OpSel ldA busy 32(PC) 31(Link) rd rt rs ldB 2 IR ExtSel 2 Imm Ext enImm rd rt rs 3 A ALU control 32 GPRs + PC ... ALU RegWrt 32-bit Reg data Bus MA addr addr B enALU Memory MemWrt enReg data enMem 32 rd M[(rs)] op (rt) M[(rd)] (rs) op (rt) M[(rd)] M[(rs)] op M[(rt)] 5/6/2008 RegSel ldMA CS152-Spring’08 Reg-Memory-src ALU op Reg-Memory-dst ALU op Mem-Mem ALU op 7 From CISC to RISC • Use fast RAM to build fast instruction cache of uservisible instructions, not fixed hardware microroutines – Can change contents of fast instruction memory to fit what application needs right now • Use simple ISA to enable hardwired pipelined implementation – Most compiled code only used a few of the available CISC instructions – Simpler encoding allowed pipelined implementations • Further benefit with integration – In early ‘80s, can fit 32-bit datapath + small caches on a single chip – No chip crossings in common case allows faster operation 5/6/2008 CS152-Spring’08 8 Nanocoding Exploits recurring control signal patterns in code, e.g., ALU0 A Reg[rs] ... ALUi0 A Reg[rs] ... PC (state) code next-state address code ROM nanoaddress nanoinstruction ROM data • MC68000 had 17-bit code containing either 10-bit jump or 9-bit nanoinstruction pointer – Nanoinstructions were 68 bits wide, decoded to give 196 control signals 5/6/2008 CS152-Spring’08 9 “Iron Law” of Processor Performance Time = Instructions Cycles Time Program Program * Instruction * Cycle – Instructions per program depends on source code, compiler technology, and ISA – Cycles per instructions (CPI) depends upon the ISA and the microarchitecture – Time per cycle depends upon the microarchitecture and the base technology Microarchitecture Microcoded Single-cycle unpipelined Pipelined 5/6/2008 CS152-Spring’08 CPI >1 1 1 cycle time short long short 10 5-Stage Pipelined Execution 0x4 Add PC addr rdata we rs1 rs2 rd1 ws wd rd2 GPRs IR Inst. Memory I-Fetch (IF) Data Memory Imm Ext wdata Write Decode, Reg. Fetch Execute Memory (ID) (EX) (MA) Back t0 t1 t2 t3 t4 t5 t6 t7 . . . . (WB) time instruction1 instruction2 instruction3 instruction4 instruction5 5/6/2008 ALU we addr rdata IF1 ID1 EX1 MA1 WB1 IF2 ID2 EX2 MA2 WB2 IF3 ID3 EX3 MA3 WB3 IF4 ID4 EX4 MA4 WB4 IF5 ID5 EX5 MA5 WB5 CS152-Spring’08 11 Pipeline Hazards • Pipelining instructions is complicated by HAZARDS: – Structural hazards (two instructions want same hardware resource) – Data hazards (earlier instruction produces value needed by later instruction) – Control hazards (instruction changes control flow, e.g., branches or exceptions) • Techniques to handle hazards: – Interlock (hold newer instruction until older instructions drain out of pipeline) – Bypass (transfer value from older instruction to newer instruction as soon as available somwhere in machine) – Speculate (guess effect of earlier instruction) • Speculation needs predictor, prediction check, and recovery mechanism 5/6/2008 CS152-Spring’08 12 Exception Handling 5-Stage Pipeline Commit Point PC address Exception Select Handler PC 5/6/2008 Kill F Stage D Decode E Illegal Opcode + M Overflow Data Mem Data address Exceptions Exc D Exc E Exc M PC D PC E PC M Asynchronous Kill D Stage Kill E Stage CS152-Spring’08 W Interrupts EPC Cause Inst. Mem PC Kill Writeback 13 Processor-DRAM Gap (latency) µProc 60%/year CPU “Moore’s Law” Processor-Memory Performance Gap: (grows 50% / year) 100 10 DRAM 7%/year 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 1 1982 1983 1984 1985 1986 1987 DRAM 1980 1981 Performance 1000 Time Four-issue 2GHz superscalar accessing 100ns DRAM could execute 800 instructions during time for one memory access! 5/6/2008 CS152-Spring’08 14 Common Predictable Patterns Two predictable properties of memory references: – Temporal Locality: If a location is referenced it is likely to be referenced again in the near future. – Spatial Locality: If a location is referenced it is likely that locations near it will be referenced in the near future. 5/6/2008 CS152-Spring’08 Memory Address (one dot per access) Memory Reference Patterns Temporal Locality Spatial Locality Time Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971) Causes for Cache Misses • Compulsory: first-reference to a block a.k.a. cold start misses - misses that would occur even with infinite cache • Capacity: cache is too small to hold all data needed by the program - misses that would occur even under perfect replacement policy • Conflict: misses that occur because of collisions due to block-placement strategy - misses that would not occur with full associativity 5/6/2008 CS152-Spring’08 17 A Typical Memory Hierarchy c.2006 Split instruction & data primary caches (on-chip SRAM) CPU RF Multiported register file (part of CPU) 5/6/2008 L1 Instruction Cache Multiple interleaved memory banks (DRAM) Memory Unified L2 Cache L1 Data Cache Memory Memory Memory Large unified secondary cache (on-chip SRAM) CS152-Spring’08 18 Modern Virtual Memory Systems Illusion of a large, private, uniform store Protection & Privacy OS several users, each with their private address space and one or more shared address spaces page table name space Demand Paging Provides the ability to run programs larger than the primary memory useri Primary Memory Swapping Store Hides differences in machine configurations The price is address translation on each memory reference 5/6/2008 CS152-Spring’08 VA mapping TLB PA 19 Hierarchical Page Table Virtual Address 31 22 21 p1 0 12 11 p2 offset 10-bit 10-bit L1 index L2 index offset Root of the Current Page Table p2 p1 (Processor Register) Level 1 Page Table page in primary memory page in secondary memory Level 2 Page Tables PTE of a nonexistent page 5/6/2008 Data Pages CS152-Spring’08 20 Address Translation & Protection Virtual Address Virtual Page No. (VPN) offset Kernel/User Mode Read/Write Protection Check Address Translation Exception? Physical Address Physical Page No. (PPN) offset • Every instruction and data access needs address translation and protection checks A good VM design needs to be fast (~ one cycle) and space efficient -> Translation Lookaside Buffer (TLB) 5/6/2008 CS152-Spring’08 21 Address Translation in CPU Pipeline PC Inst TLB Inst. Cache D Decode E TLB miss? Page Fault? Protection violation? + M Data TLB Data Cache W TLB miss? Page Fault? Protection violation? • Software handlers need restartable exception on page fault or protection violation • Handling a TLB miss needs a hardware or software mechanism to refill TLB • Need mechanisms to cope with the additional latency of a TLB: – slow down the clock – pipeline the TLB and cache access – virtual address caches – parallel TLB/cache access 5/6/2008 CS152-Spring’08 22 Concurrent Access to TLB & Cache VA VPN L TLB PA PPN b k Page Offset Tag Virtual Index = hit? Direct-map Cache 2L blocks 2b-byte block Physical Tag Data Index L is available without consulting the TLB cache and TLB accesses can begin simultaneously Tag comparison is made after both accesses are completed Cases: L + b = k 5/6/2008 L+b<k CS152-Spring’08 L+b>k 23 CS152 Administrivia • Lab 4 competition winners! • Quiz 6 on Thursday, May 8 – L19-21, PS 6, Lab 6 • Last 15 minutes, course survey – HKN survey – Informal feedback survey for those who’ve not done it already • Quiz 5 results 5/6/2008 CS152-Spring’08 24 Complex Pipeline Structure ALU IF ID Issue GPR’s FPR’s Mem WB Fadd Fmul Fdiv 5/6/2008 CS152-Spring’08 25 Superscalar In-Order Pipeline PC Inst. 2 D Mem Dual Decode • Fetch two instructions per cycle; issue both simultaneously if one is integer/memory and other is floating-point • Inexpensive way of increasing throughput, examples include Alpha 21064 (1992) & MIPS R5000 series (1996) • Same idea can be extended to wider issue by duplicating functional units (e.g. 4-issue UltraSPARC) but register file ports and bypassing costs grow quickly 5/6/2008 GPRs FPRs X1 + X1 X2 Data Mem X3 W X2 Fadd X3 W X2 Fmul X3 Commit Point Unpipelined FDiv X2 CS152-Spring’08 divider X3 26 Types of Data Hazards Consider executing a sequence of rk (ri) op (rj) type of instructions Data-dependence r3 (r1) op (r2) r5 (r3) op (r4) Anti-dependence r3 (r1) op (r2) r1 (r4) op (r5) Output-dependence r3 (r1) op (r2) r3 (r6) op (r7) 5/6/2008 CS152-Spring’08 Read-after-Write (RAW) hazard Write-after-Read (WAR) hazard Write-after-Write (WAW) hazard 27 Phases of Instruction Execution PC I-cache Fetch Buffer Issue Buffer Func. Units Result Buffer Arch. State 5/6/2008 Fetch: Instruction bits retrieved from cache. Decode: Instructions placed in appropriate issue (aka “dispatch”) stage buffer Execute: Instructions and operands sent to execution units . When execution completes, all results and exception flags are available. Commit: Instruction irrevocably updates architectural state (aka “graduation” or “completion”). CS152-Spring’08 28 Pipeline Design with Physical Regfile Branch Resolution kill Branch Prediction PC Fetch kill kill Decode & Rename Update predictors kill Out-of-Order Reorder Buffer In-Order Commit In-Order Physical Reg. File Branch ALU MEM Unit Store Buffer D$ Execute 5/6/2008 CS152-Spring’08 29 Reorder Buffer Holds Active Instruction Window … (Older instructions) ld r1, (r3) add r3, r1, r2 sub r6, r7, r9 add r3, r3, r6 ld r6, (r1) add r6, r6, r3 st r6, (r1) ld r6, (r1) … (Newer instructions) Commit Execute Fetch Cycle t 5/6/2008 … ld r1, (r3) add r3, r1, sub r6, r7, add r3, r3, ld r6, (r1) add r6, r6, st r6, (r1) ld r6, (r1) … r2 r9 r6 r3 Cycle t + 1 CS152-Spring’08 30 Branch History Table Fetch PC 00 k I-Cache Instruction Opcode BHT Index 2k-entry BHT, 2 bits/entry offset + Branch? Target PC Taken/¬Taken? 4K-entry BHT, 2 bits/entry, ~80-90% correct predictions 5/6/2008 CS152-Spring’08 31 Two-Level Branch Predictor Pentium Pro uses the result from the last two branches to select one of the four sets of BHT bits (~95% correct) 00 Fetch PC k 2-bit global branch history shift register Shift in Taken/¬Taken results of each branch Taken/¬Taken? 5/6/2008 CS152-Spring’08 32 Branch Target Buffer (BTB) I-Cache 2k-entry direct-mapped BTB PC (can also be associative) Entry PC Valid predicted target PC valid target k = match • • • • Keep both the branch PC and target PC in the BTB PC+4 is fetched if match fails Only taken branches and jumps held in BTB Next PC determined before branch fetched and decoded 5/6/2008 CS152-Spring’08 33 Combining BTB and BHT • BTB entries are considerably more expensive than BHT, but can redirect fetches at earlier stage in pipeline and can accelerate indirect branches (JR) • BHT can hold many more entries and is more accurate BTB BHT in later pipeline stage corrects when BTB misses a predicted taken branch BHT A P F B I J R E PC Generation/Mux Instruction Fetch Stage 1 Instruction Fetch Stage 2 Branch Address Calc/Begin Decode Complete Decode Steer Instructions to Functional units Register File Read Integer Execute BTB/BHT only updated after branch resolves in E stage 5/6/2008 CS152-Spring’08 34 Sequential ISA Bottleneck Sequential source code Superscalar compiler Sequential machine code a = foo(b); for (i=0, i< Find independent operations Schedule operations Superscalar processor Check instruction dependencies 5/6/2008 Schedule execution CS152-Spring’08 35 VLIW: Very Long Instruction Word Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2 Two Integer Units, Single Cycle Latency Two Load/Store Units, Three Cycle Latency Two Floating-Point Units, Four Cycle Latency • • • • Multiple operations packed into one instruction Each operation slot is for a fixed function Constant operation latencies are specified Architecture requires guarantee of: – Parallelism within an instruction => no cross-operation RAW check – No data use before data ready => no data interlocks 5/6/2008 CS152-Spring’08 36 Scheduling Loop Unrolled Code Unroll 4 ways loop: ld f1, 0(r1) ld f2, 8(r1) ld f3, 16(r1) ld f4, 24(r1) add r1, 32 fadd f5, f0, f1 fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4 sd f5, 0(r2) sd f6, 8(r2) sd f7, 16(r2) sd f8, 24(r2) add r2, 32 bne r1, r3, loop 5/6/2008 Int1 Int 2 loop: add r1 M1 ld f1 ld f2 ld f3 ld f4 Schedule M2 FP+ FPx fadd f5 fadd f6 fadd f7 fadd f8 sd f5 sd f6 sd f7 add r2 bne sd f8 CS152-Spring’08 37 Software Pipelining Int1 Unroll 4 ways first loop: ld f1, 0(r1) ld f2, 8(r1) ld f3, 16(r1) ld f4, 24(r1) add r1, 32 fadd f5, f0, f1 fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4 sd f5, 0(r2) sd f6, 8(r2) sd f7, 16(r2) add r2, 32 sd f8, -8(r2) bne r1, r3, loop 5/6/2008 Int 2 M1 ld f1 ld f2 ld f3 add r1 ld f4 prolog ld f1 ld f2 ld f3 add r1 ld f4 loop: ld f1 iterate ld f2 add r2 ld f3 add r1 bne ld f4 epilog add r2 bne CS152-Spring’08 M2 sd f5 sd f6 sd f7 sd f8 sd f5 sd f6 sd f7 sd f8 sd f5 FP+ FPx fadd f5 fadd f6 fadd f7 fadd f8 fadd f5 fadd f6 fadd f7 fadd f8 fadd f5 fadd f6 fadd f7 fadd f8 38 Vector Programming Model Scalar Registers r15 v15 r0 v0 Vector Registers [0] [1] [2] [VLRMAX-1] Vector Length Register Vector Arithmetic Instructions ADDV v3, v1, v2 v1 v2 v3 Vector Load and Store Instructions LV v1, r1, r2 Base, r1 5/6/2008 VLR + + [0] [1] v1 + + + + [VLR-1] Vector Register Memory Stride, r2 CS152-Spring’08 39 Vector Unit Structure Functional Unit Vector Registers Elements 0, 4, 8, … Elements 1, 5, 9, … Elements 2, 6, 10, … Elements 3, 7, 11, … Lane Memory Subsystem 5/6/2008 CS152-Spring’08 40 Vector Instruction Parallelism Can overlap execution of multiple vector instructions – example machine has 32 elements per vector register and 8 lanes Load Unit load Multiply Unit Add Unit mul add time load mul add Instruction issue Complete 24 operations/cycle while issuing 1 short instruction/cycle 5/6/2008 CS152-Spring’08 41 Multithreading How can we guarantee no dependencies between instructions in a pipeline? -- One way is to interleave execution of instructions from different program threads on same pipeline Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe t0 t1 t2 t3 t4 t5 t6 t7 F D X MW T1: LW r1, 0(r2) F D X M T2: ADD r7, r1, r4 F D X T3: XORI r5, r4, #12 T4: SW 0(r7), r5 F D T1: LW r5, 12(r1) F 5/6/2008 t8 W MW X MW D X MW CS152-Spring’08 t9 Prior instruction in a thread always completes writeback before next instruction in same thread reads register file 42 Time (processor cycle) Multithreaded Categories Superscalar Fine-Grained Coarse-Grained Thread 1 Thread 2 5/6/2008 Multiprocessing Thread 3 Thread 4 CS152-Spring’08 Simultaneous Multithreading Thread 5 Idle slot 43 Power 4 SMT in Power 5 2 fetch (PC), 2 initial decodes 5/6/2008 CS152-Spring’08 2 commits (architected register sets) 44 A Producer-Consumer Example Producer tail head Rtail Rtail Producer posting Item x: Load Rtail, (tail) Store (Rtail), x Rtail=Rtail+1 Store (tail), Rtail The program is written assuming instructions are executed in order. 5/6/2008 Consumer Rhead R Consumer: Load Rhead, (head) spin: Load Rtail, (tail) if Rhead==Rtail goto spin Load R, (Rhead) Rhead=Rhead+1 Store (head), Rhead process(R) CS152-Spring’08 45 Sequential Consistency A Memory Model P P P P P P M “ A system is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in the order specified by the program” Leslie Lamport Sequential Consistency = arbitrary order-preserving interleaving of memory references of sequential programs 5/6/2008 CS152-Spring’08 46 Sequential Consistency Sequential consistency imposes more memory ordering constraints than those imposed by uniprocessor program dependencies ( ) What are these in our example ? T1: Store (X), 1 (X = 1) Store (Y), 11 (Y = 11) T2: Load R1, (Y) Store (Y’), R1 (Y’= Y) Load R2, (X) Store (X’), R2 (X’= X) additional SC requirements 5/6/2008 CS152-Spring’08 47 Mutual Exclusion and Locks Want to guarantee only one process is active in a critical section • Blocking atomic read-modify-write instructions e.g., Test&Set, Fetch&Add, Swap vs • Non-blocking atomic read-modify-write instructions e.g., Compare&Swap, Load-reserve/Store-conditional vs • Protocols based on ordinary Loads and Stores 5/6/2008 CS152-Spring’08 48 Snoopy Cache Protocols Memory Bus M1 Snoopy Cache M2 Snoopy Cache M3 Snoopy Cache Physical Memory DMA DISKS Use snoopy mechanism to keep all processors’ view of memory coherent 5/6/2008 CS152-Spring’08 49 MESI: An Enhanced MSI protocol increased performance for private data Each cache line has a tag M: Modified Exclusive Address tag state bits P1 write or read M E: Exclusive, unmodified S: Shared I: Invalid P1 write E Other processor reads P1 writes back Read miss, shared Read by any processor 5/6/2008 S Other processor intent to write CS152-Spring’08 P1 read Read miss, not shared Write miss Other processor intent to write I Cache state in processor P1 50 Basic Operation of Directory P P Cache Cache • k processors. • With each cache-block in memory: k presence-bits, 1 dirty-bit Interconnection Network Memory • • • presence bits Directory • With each cache-block in cache: 1 valid bit, and 1 dirty (owner) bit dirty bit • Read from main memory by processor i: • If dirty-bit OFF then { read from main memory; turn p[i] ON; } • if dirty-bit ON then { recall line from dirty proc (cache state to shared); update memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to i;} • Write to main memory by processor i: • If dirty-bit OFF then {send invalidations to all caches that have the block; turn dirty-bit ON; supply data to i; turn p[i] ON; ... } 5/6/2008 CS152-Spring’08 51 Directory Cache Protocol (Handout 6) CPU CPU CPU CPU CPU CPU Cache Cache Cache Cache Cache Cache Interconnection Network Directory Controller Directory Controller Directory Controller Directory Controller DRAM Bank DRAM Bank DRAM Bank DRAM Bank • Assumptions: Reliable network, FIFO message delivery between any given source-destination pair 5/6/2008 CS152-Spring’08 52 Performance of Symmetric Shared-Memory Multiprocessors Cache performance is combination of: 1. Uniprocessor cache miss traffic 2. Traffic caused by communication – Results in invalidations and subsequent cache misses • Adds 4th C: coherence miss – Joins Compulsory, Capacity, Conflict – (Sometimes called a Communication miss) 5/6/2008 CS152-Spring’08 53 Intel “Nehalem” (2008) • 2-8 cores • SMT (2 threads/core) • Private L2$/core • Shared L3$ • Initially in 45nm 5/6/2008 QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. CS152-Spring’08 54 Related Courses CS 258 Parallel Architectures, Languages, Systems CS61C Strong Prerequisite Basic computer organization, first look at pipelines + caches 5/6/2008 CS 152 Computer Architecture, First look at parallel architectures CS 252 Graduate Computer Architecture, Advanced Topics CS 150 CS 194-6 Digital Logic Design New FPGA-based Architecture Lab Class CS152-Spring’08 55 Advice: Get involved in research E.g., • RADLab - data center • ParLab - parallel clients • Undergrad research experience is the most important part of application to top grad schools. 5/6/2008 CS152-Spring’08 56 End of CS152 • Thanks for being such patient guinea pigs! – Hopefully your pain will help future generations of CS152 students 5/6/2008 CS152-Spring’08 57 Acknowledgements • These slides contain material developed and copyright by: – – – – – – Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) David Patterson (UCB) • MIT material derived from course 6.823 • UCB material derived from course CS252 5/6/2008 CS152-Spring’08 58