CSCE 430/830 Computer Architecture Reviews of Quantitative Principles, Memory Hierarchy & Pipeline Design Basics Adopted from Professor David Patterson Electrical Engineering and Computer Sciences University of California, Berkeley Instruction Set Architecture: Critical Interface software instruction set hardware • Properties of a good abstraction – – – – Lasts through many generations (portability) Used in many different ways (generality) Provides convenient functionality to higher levels Permits an efficient implementation at lower levels 7/27/2016 CSCE 430/830, Review of Pipeline Design & Basics 2 Example: MIPS r0 r1 ° ° ° r31 PC lo hi 0 Programmable storage Data types ? 2^32 x bytes Format ? 31 x 32-bit GPRs (R0=0) Addressing Modes? 32 x 32-bit FP regs (paired DP) Operations? HI, LO, PC Arithmetic logical Add, AddU, Sub, SubU, And, Or, Xor, Nor, SLT, SLTU, AddI, AddIU, SLTI, SLTIU, AndI, OrI, XorI, LUI See PP. 13 on the SLL, SRL, SRA, SLLV, SRLV, SRAV textbook for MIPS64 Memory Access LB, LBU, LH, LHU, LW, LWL,LWR SB, SH, SW, SWL, SWR Control 32-bit instructions on word boundary J, JAL, JR, JALR BEq, BNE, BLEZ,BGTZ,BLTZ,BGEZ,BLTZAL,BGEZAL 7/27/2016 CSCE 430/830, Review of Pipeline Design & Basics 3 CPI 5) Processor performance equation inst count CPU time = Seconds = Instructions x Program Program CPI Program Compiler X (X) Inst. Set. X X Cycle Clock Rate X Technology 7/27/2016 x Seconds Instruction Inst Count X Organization Cycles Cycle time CSCE 430/830, Review of Pipeline Design & Basics X X 5 Define and quantity power ( 1 / 2) • For CMOS chips, traditional dominant energy consumption has been in switching transistors, called dynamic power 2 Powerdynamic 1 / 2 CapacitiveLoad Voltage FrequencySwitched • For mobile devices, energy better metric 2 Energydynamic CapacitiveLoad Voltage • For a fixed task, slowing clock rate (frequency switched) reduces power, but not energy • Capacitive load a function of number of transistors connected to output and technology, which determines capacitance of wires and transistors • Dropping voltage helps both, so went from 5V to 1V • To save energy & dynamic power, most CPUs now turn off clock of inactive modules (e.g. Fl. Pt. Unit) 7/27/2016 CSCE 430/830, Review of Pipeline Design & Basics 6 Define and quantity power (2 / 2) • Because leakage current flows even when a transistor is off, now static power important too Powerstatic Currentstatic Voltage • Leakage current increases in processors with smaller transistor sizes • Increasing the number of transistors increases power even if they are turned off • In 2006, goal for leakage is 25% of total power consumption; high performance designs at 40% • Very low power systems even gate voltage to inactive modules to control loss due to leakage 7/27/2016 CSCE 430/830, Review of Pipeline Design & Basics 7 Define and quantity dependability (1/3) • • How decide when a system is operating properly? Infrastructure providers now offer Service Level Agreements (SLA) to guarantee that their networking or power service would be dependable • Systems alternate between 2 states of service with respect to an SLA: 1. Service accomplishment, where the service is delivered as specified in SLA 2. Service interruption, where the delivered service is different from the SLA • Failure = transition from state 1 to state 2 • Restoration = transition from state 2 to state 1 7/27/2016 CSCE 430/830, Review of Pipeline Design & Basics 8 Define and quantity dependability (2/3) • Module reliability = measure of continuous service accomplishment (or time to failure). 2 metrics 1. Mean Time To Failure (MTTF) measures Reliability 2. Failures In Time (FIT) = 1/MTTF, the rate of failures • • Traditionally reported as failures per billion hours of operation Mean Time To Repair (MTTR) measures Service Interruption – Mean Time Between Failures (MTBF) = MTTF+MTTR • • Module availability measures service as alternate between the 2 states of accomplishment and interruption (number between 0 and 1, e.g. 0.9) Module availability = MTTF / ( MTTF + MTTR) 7/27/2016 CSCE 430/830, Review of Pipeline Design & Basics 9 The Principle of Locality • The Principle of Locality: – Program access a relatively small portion of the address space at any instant of time. • Two Different Types of Locality: – Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) – Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access) • Last 15 years, HW relied on locality for speed It is a property of programs which is exploited in machine design. CSCE430/830 Review of Mem. Hierarchy Amdahl’s Law Fractionenhanced ExTimenew ExTimeold 1 Fractionenhanced Speedupenhanced Speedupoverall ExTimeold ExTimenew 1 1 Fractionenhanced Fractionenhanced Speedupenhanced Best you could ever hope to do: Speedupmaximum CSCE430/830 7/27/2016 1 1 - Fractionenhanced CSCE 430/830, Review of 13 Review of Mem. Hierarchy Memory Hierarchy - the Big Picture • Problem: memory is too slow and too small • Solution: memory hierarchy Processor Control Size (bytes): CSCE430/830 L1 On-Chip Cache Speed (ns): Registers Datapath 0.25-0.5 <1K L2 Off-Chip Cache Main Memory (DRAM) 0.5-25 80-250 <16M <16G Secondary Storage (Disk) 5,000,000 (5ms) >100G Review of Mem. Hierarchy Fundamental Cache Questions • Q1: Where can a block be placed in the upper level? (Block placement) • Q2: How is a block found if it is in the upper level? (Block identification) • Q3: Which block should be replaced on a miss? (Block replacement) • Q4: What happens on a write? (Write strategy) CSCE430/830 Review of Mem. Hierarchy Q1: Where can a block be placed in the upper level? • Block 12 placed in 8 block cache: – Fully associative, direct mapped, 2-way set associative – S.A. Mapping = (Block Number) Modulo (Number Sets) 2-Way Assoc Direct Mapped Full Mapped (12 mod 4) = 0 (12 mod 8) = 4 01234567 01234567 01234567 Cache 1111111111222222222233 01234567890123456789012345678901 Memory CSCE430/830 Review of Mem. Hierarchy Q2: How is a block found if it is in the upper level? • Block offset selects the desired data from the block. – Len. Block Offset = log2 (Cache block size) • Index selects the set – Number of sets = Number of Cache blocks / Number of Ways – Len. Index = log2 (Number of sets) • Tag is compared against it for a hit. – Len. Tag = Len. Mem Addr – Len. Index – Len. Block Offest Block Address Tag CSCE430/830 Index Block Offset Review of Mem. Hierarchy Q2: How is a block found if it is in the upper level? • Tag on each block – No need to check index or block offset • Increasing associativity shrinks index, expands tag Block Address Tag CSCE430/830 Index Block Offset Review of Mem. Hierarchy Q3: Which block should be replaced on a miss? • Easy for Direct Mapped • Set Associative or Fully Associative: – Random – LRU (Least Recently Used) Assoc: Size 16 KB 64 KB 256 KB CSCE430/830 2-way LRU Ran 5.2% 5.7% 1.9% 2.0% 1.15% 1.17% 4-way LRU Ran 4.7% 5.3% 1.5% 1.7% 1.13% 1.13% 8-way LRU Ran 4.4% 5.0% 1.4% 1.5% 1.12% 1.12% Review of Mem. Hierarchy Q4: What happens on a write? Write-Through Policy Data written to cache block Write-Back Write data only to the cache also written to lowerlevel memory Update lower level when a block falls out of the cache Debug Easy Hard Do read misses produce writes? No Yes Do repeated writes make it to lower level? Yes No Additional option (on miss)-- let writes to an un-cached address allocate a new cache line (“write-allocate”). CSCE430/830 Review of Mem. Hierarchy Cache Performance Measures • Hit rate: fraction found in the cache – So high that we usually talk about Miss rate = 1 - Hit Rate • Hit time: time to access the cache • Miss penalty: time to replace a block from lower level, including time to replace in CPU – access time: time to acccess lower level – transfer time: time to transfer block • Average memory-access time (AMAT) = Hit time + Miss rate x Miss penalty (ns or clocks) CSCE430/830 Review of Mem. Hierarchy Six basic cache optimizations: Larger block size Reduces overall memory access time Giving priority to read misses over writes Reduces conflict misses Increases hit time, increases power consumption Higher number of cache levels Increases hit time, increases power consumption Higher associativity Reduces compulsory misses Increases capacity and conflict misses, increases miss penalty Larger total cache capacity to reduce miss rate Introduction Memory Hierarchy Basics Reduces miss penalty Avoiding address translation in cache indexing Reduces hit time Copyright © 2012, Elsevier Inc. All rights reserved. 24 Introduction Memory Hierarchy Basics Ten advanced cache optimizations: Small and simple 1st level caches Way prediction Pipelining cache Nonblocking caches Multibanked caches Critical word first, early restart Merging write buffer Compiler optimizations Hardware prefetching Compiler prefetching See PP. 96 on the textbook for a summary Copyright © 2012, Elsevier Inc. All rights reserved. 25 Details of Page Table Page Table Physical Memory Space Virtual Address 12 offset frame frame V page no. frame Page Table frame virtual address Page Table Base Reg index into page table V Access Rights PA table located in physical P page no. memory offset 12 Physical Address • Page table maps virtual page numbers to physical frames (“PTE” = Page Table Entry) • Virtual memory => treat memory cache for disk CSCE430/830 Review of Mem. Hierarchy The TLB caches page table entries Physical and virtual pages must be the same size! TLB caches page table entries. virtual address page Physical frame address for ASID off Page Table 2 0 1 3 physical address TLB frame page 2 2 0 5 CSCE430/830 page off MIPS handles TLB misses in software (random replacement). Other machines use hardware. V=0 pages either reside on disk or have not yet been allocated. OS handles V=0 Review of Mem. Hierarchy “Page fault” Summary of Virtual Machine Monitor • Virtual Machine Revival – Overcome security flaws of modern OSes – Processor performance no longer highest priority – Manage Software, Manage Hardware • “… VMMs give OS developers another opportunity to develop functionality no longer practical in today’s complex and ossified operating systems, where innovation moves at geologic pace .” [Rosenblum and Garfinkel, 2005] • Virtualization challenges for processor, virtual memory, I/O – Paravirtualization, ISA upgrades to cope with those difficulties • Xen as example VMM using paravirtualization – 2005 performance on non-I/O bound, I/O intensive apps: 80% of native Linux without driver VM, 34% with driver VM • Opteron memory hierarchy still critical to performance CSCE430/830 Review of Mem. Hierarchy Visualizing Pipelining Figure A.2, Page A-8 Time (clock cycles) 7/27/2016 Reg DMem Ifetch Reg DMem Reg ALU DMem Reg ALU O r d e r Ifetch ALU I n s t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Ifetch Reg Reg CSCE 430/830, Review of Pipeline Design & Basics Reg DMem Reg 35 Classic RISC Pipeline Source: http://en.wikipedia.org/wiki/Classic_RISC_pipeline 7/27/2016 CSCE 430/830, Basic Pipelining & Performance 36 Pipelining is not quite that easy! • Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle – Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) – Root Cause: Resource Contention – Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) – Root Cause: Data Dependence – Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps). – Root Cause: Control Trandfer 7/27/2016 CSCE 430/830, Review of Pipeline Design & Basics 37 One Memory Port/Structural Hazards Figure A.4, Page A-14 Time (clock cycles) 7/27/2016 Reg DMem Reg DMem Reg DMem Reg ALU Instr 4 Ifetch ALU Instr 3 DMem ALU O r d e r Instr 2 Reg ALU I Load Ifetch n s Instr 1 t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Ifetch Reg Ifetch Reg CSCE 430/830, Basic Pipelining & Performance Reg Reg Reg DMem 38 One Memory Port/Structural Hazards (Similar to Figure A.5, Page A-15) Time (clock cycles) Stall Instr 3 7/27/2016 DMem Ifetch Reg DMem Reg ALU Ifetch Bubble Reg Reg DMem Bubble Bubble Ifetch Reg CSCE 430/830, Basic Pipelining & Performance Reg Bubble ALU O r d e r Instr 2 Reg ALU I Load Ifetch n s Instr 1 t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Bubble Reg DMem 39 Data Dependence and Hazards • InstrJ is data dependent (aka true dependence) on InstrI: 1. InstrJ tries to read operand before InstrI writes it I: add r1,r2,r3 J: sub r4,r1,r3 2. or InstrJ is data dependent on InstrK which is dependent on InstrI • • • If two instructions are data dependent, they cannot execute simultaneously or be completely overlapped Data dependence in instruction sequence data dependence in source code effect of original data dependence must be preserved If data dependence caused a hazard in pipeline, called a Read After Write (RAW) hazard 7/27/2016 CSCE 430/830, Review of Pipeline Design & Basics 40 Name Dependence #1: Anti-dependence • Name dependence: when 2 instructions use same register or memory location, called a name, but no flow of data between the instructions associated with that name; 2 versions of name dependence • InstrJ writes operand before InstrI reads it I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an “anti-dependence” by compiler writers. This results from reuse of the name “r1” • If anti-dependence caused a hazard in the pipeline, called a Write After Read (WAR) hazard 7/27/2016 CSCE 430/830, Review of Pipeline Design & Basics 41 Name Dependence #2: Output dependence • InstrJ writes operand before InstrI writes it. I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 • Called an “output dependence” by compiler writers This also results from the reuse of name “r1” • If anti-dependence caused a hazard in the pipeline, called a Write After Write (WAW) hazard • Instructions involved in a name dependence can execute simultaneously if name used in instructions is changed so instructions do not conflict – Register renaming resolves name dependence for regs – Either by compiler or by HW 7/27/2016 CSCE 430/830, Review of Pipeline Design & Basics 42 RAW, WAR, WAW • Read After Write • Write After Read • Write After Write • Condition – At least one register is used by both 2 instructions – At least one write instruction to that register between 2 instructions 7/27/2016 CSCE 430/830, Review of Pipeline Design & Basics 43 Forwarding to Avoid Data Hazard Figure A.7, Page A-19 or r8,r1,r9 xor r10,r1,r11 7/27/2016 Reg DMem Ifetch Reg DMem Ifetch Reg DMem Ifetch Reg ALU and r6,r1,r7 Ifetch DMem ALU sub r4,r1,r3 Reg ALU O r d e r add r1,r2,r3 Ifetch ALU I n s t r. ALU Time (clock cycles) Reg Reg CSCE 430/830, Review of Pipeline Design & Basics Reg Reg DMem 45 Reg Forwarding to Avoid LW-SW Data Hazard Figure A.8, Page A-20 or r8,r6,r9 xor r10,r9,r11 7/27/2016 Reg DMem Ifetch Reg DMem Ifetch Reg DMem Ifetch Reg ALU sw r4,12(r1) Ifetch DMem ALU lw r4, 0(r1) Reg ALU O r d e r add r1,r2,r3 Ifetch ALU I n s t r. ALU Time (clock cycles) Reg Reg CSCE 430/830, Review of Pipeline Design & Basics Reg Reg DMem 46 Reg Forwarding Schemes • Goal: start the work earlier • Forwarding 1: – EX/MEM Pipeline Register => Input of ALU • Forwarding 2: – MEM/WB Pipeline Register => Input of ALU • Forwarding 3: Special Case – Register File => Register File – Because writing a register is done in the first half of the clock cycle, and reading the same register is done in the second half of a clock cycle. • Forwarding 4: for LD/SW Data Hazard – MEM/WB Pipeline Register => Input of Memory Access – Loading data from memory Writing data to memory 7/27/2016 CSCE 430/830, Review of Pipeline Design & Basics 47 HW Change for Forwarding Figure A.23, Page A-37 NextPC mux MEM/WR EX/MEM ALU mux ID/EX Registers Data Memory mux Immediate What circuit detects and resolves this hazard? 7/27/2016 CSCE 430/830, Review of Pipeline Design & Basics 48 Data Hazard Even with Forwarding Figure A.9, Page A-21 and r6,r1,r7 or 7/27/2016 r8,r1,r9 DMem Ifetch Reg DMem Reg Ifetch Ifetch Reg Reg CSCE 430/830, Review of Pipeline Design & Basics Reg DMem ALU O r d e r sub r4,r1,r6 Reg ALU lw r1, 0(r2) Ifetch ALU I n s t r. ALU Time (clock cycles) Reg DMem 49 Reg Data Hazard Even with Forwarding (Similar to Figure A.10, Page A-21) and r6,r1,r7 or r8,r1,r9 7/27/2016 Reg DMem Ifetch Reg Bubble Ifetch Bubble Reg Bubble Ifetch Reg How isCSCE this detected? 430/830, Review of Pipeline Design & Basics DMem Reg Reg Reg DMem ALU sub r4,r1,r6 Ifetch ALU O r d e r lw r1, 0(r2) ALU I n s t r. ALU Time (clock cycles) DMem 50 Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW LW ADD SW LW LW SUB SW Rb,b Rc,c Ra,Rb,Rc a,Ra Re,e Rf,f Rd,Re,Rf d,Rd Fast code: LW LW LW ADD LW SW SUB SW Rb,b Rc,c Re,e Ra,Rb,Rc Rf,f a,Ra Rd,Re,Rf d,Rd Compiler optimizes for performance. Hardware checks for safety. 7/27/2016 CSCE 430/830, Review of Pipeline Design & Basics 51 Reg DMem Ifetch Reg DMem Ifetch Reg DMem Ifetch Reg ALU r6,r1,r7 Ifetch DMem ALU 18: or Reg ALU 14: and r2,r3,r5 Ifetch ALU 10: beq r1,r3,36 ALU Control Hazard on Branches Three Stage Stall 22: add r8,r1,r9 36: xor r10,r1,r11 Reg Reg Reg What do you do with the 3 instructions in between? How do you do it? Where is the “commit”? 7/27/2016 CSCE 430/830, Review of Pipeline Design & Basics 52 Reg DMem 5 Steps of MIPS Datapath Figure A.3, Page A-9 Execute Addr. Calc Instr. Decode Reg. Fetch Next SEQ PC Next SEQ PC Adder 4 Zero? RS1 MUX MEM/WB Data Memory EX/MEM ALU MUX MUX ID/EX Imm Reg File IF/ID Memory Address RS2 Write Back MUX Next PC Memory Access WB Data Instruction Fetch Sign Extend RD RD RD • Data stationary control – local 7/27/2016 decode CSCE for each instruction phase / pipeline stage 430/830, Basic Pipelining & Performance 53 Pipelined MIPS Datapath Figure A.24, page A-38 Instruction Fetch Memory Access Write Back Adder Adder MUX Next SEQ PC Next PC Zero? RS1 MUX MEM/WB Data Memory EX/MEM ALU MUX ID/EX Imm Reg File IF/ID Memory Address RS2 WB Data 4 Execute Addr. Calc Instr. Decode Reg. Fetch Sign Extend RD RD RD • Interplay of instruction set design and cycle time. 7/27/2016 CSCE 430/830, Review of Pipeline Design & Basics 54 Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken – – – – – Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% MIPS branches not taken on average PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken – 53% MIPS branches taken on average – But haven’t calculated branch target address in MIPS » MIPS still incurs 1 cycle branch penalty » Other machines: branch target known before outcome 7/27/2016 CSCE 430/830, Review of Pipeline Design & Basics 55 Four Branch Hazard Alternatives #4: Delayed Branch – Define branch to take place AFTER a following instruction branch instruction sequential successor1 sequential successor2 ........ sequential successorn branch target if taken Branch delay of length n – 1 slot delay allows proper decision and branch target address in 5 stage pipeline – MIPS uses this 7/27/2016 CSCE 430/830, Review of Pipeline Design & Basics 56 Scheduling Branch Delay Slots (Fig A.14) A. From before branch add $1,$2,$3 if $2=0 then delay slot become s if $2=0 then add $1,$2,$3 B. From branch target sub $4,$5,$6 add $1,$2,$3 if $1=0 then delay slot become s add $1,$2,$3 if $1=0 then sub $4,$5,$6 C. From fall through add $1,$2,$3 if $1=0 then delay slot sub $4,$5,$6 OR $7,$8,$9 become add $1,$2,$3 s if $1=0 then sub $4,$5,$6 OR $7,$8,$9 • A is the best choice, fills delay slot • In B, the sub instruction may need to be copied, increasing IC • In B and C, must be okay to execute sub when branch fails 7/27/2016 CSCE 430/830, Basic Pipelining & Performance 57 Speed Up Equation for Pipelining CPIpipelined Ideal CPI Average Stall cycles per Inst Cycle Timeunpipelined Ideal CPI Pipeline depth Speedup Ideal CPI Pipeline stall CPI Cycle Timepipelined For simple RISC pipeline, CPI = 1: Cycle Timeunpipelined Pipeline depth Speedup 1 Pipeline stall CPI Cycle Timepipelined 7/27/2016 CSCE 430/830, Review of Pipeline Design & Basics 58 Evaluating Branch Alternatives Pipeline speedup = Pipeline depth 1 +Branch frequency Branch penalty Assume 4% unconditional branch, 6% conditional branchuntaken, 10% conditional branch-taken Scheduling Branch CPI speedup v. speedup v. scheme penalty unpipelined stall Stall pipeline 3 1.60 3.1 1.0 Predict taken 1 1.20 4.2 1.33 Predict not taken 1 1.14 4.4 1.40 Delayed branch 0.5 1.10 4.5 1.45 7/27/2016 CSCE 430/830, Review of Pipeline Design & Basics 59