Lecture 1: Cost/Performance, DLX, Pipelining Prof. Fred Chong ECS 250A Computer Architecture UC Davis Dept of Computer Science Winter 1999 (Adapted from Patterson CS252 Copyright 1998 UCB) FTC.W99 1 Computer Architecture Is … the attributes of a [computing] system as seen by the programmer, i.e., the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation. Amdahl, Blaaw, and Brooks, 1964 SOFTWARE FTC.W99 2 Computer Architecture’s Changing Definition • 1950s to 1960s: Computer Architecture Course Computer Arithmetic • 1970s to mid 1980s: Computer Architecture Course Instruction Set Design, especially ISA appropriate for compilers • 1990s: Computer Architecture Course Design of CPU, memory system, I/O system, Multiprocessors FTC.W99 3 Computer Architecture Topics Input/Output and Storage Disks, WORM, Tape Emerging Technologies Interleaving Bus protocols DRAM Memory Hierarchy Coherence, Bandwidth, Latency L2 Cache L1 Cache VLSI Instruction Set Architecture RAID Addressing, Protection, Exception Handling Pipelining, Hazard Resolution, Superscalar, Reordering, Prediction, Speculation, Vector, DSP Pipelining and Instruction Level Parallelism FTC.W99 4 Computer Architecture Topics P M P M S °°° P M P M Interconnection Network Processor-Memory-Switch Multiprocessors Networks and Interconnections Shared Memory, Message Passing, Data Parallelism Network Interfaces Topologies, Routing, Bandwidth, Latency, Reliability FTC.W99 5 ECS 250A Course Focus Understanding the design techniques, machine structures, technology factors, evaluation methods that will determine the form of computers in 21st Century Technology Parallelism Programming Languages Applications Computer Architecture: • Instruction Set Design • Organization • Hardware Operating Systems Measurement & Evaluation Interface Design (ISA) History FTC.W99 6 Topic Coverage Textbook: Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2nd Ed., 1996. • Performance/Cost, DLX, Pipelining, Caches, Branch Prediction • ILP, Loop Unrolling, Scoreboarding, Tomasulo, Dynamic Branch Prediction • Trace Scheduling, Speculation • Vector Processors, DSPs • Memory Hierarchy • I/O • Interconnection Networks • Multiprocessors FTC.W99 7 ECS250A: Staff Instructor: Fred Chong Office: EUII-3031 chong@cs Office Hours: Mon 4-6pm or by appt. T. A: Diana Keen Office: EUII-2239 keend@cs TA Office Hours: Fri 1-3pm Class: Mon 6:10-9pm Text: Computer Architecture: A Quantitative Approach, Second Edition (1996) Web page: http://arch.cs.ucdavis.edu/~chong/250A/ Lectures available online before 1PM day of lecture Newsgroup: ucd.class.cs250a{.d} FTC.W99 8 Grading • • • • • Problem Sets 35% 1 In-class exam (prelim simulation) 20% Project Proposals and Drafts 10% Project Final Report 25% Project Poster Session (CS colloquium) 10% FTC.W99 9 Assignments • Read Ch 1-3 • Problem Set 1 - due Mon 1/25/99 – alone or in pairs • Project Proposals - due Mon 1/25/99 – – – – groups of 2 or 3 see web page and links e-mail to me and cc:diana about ideas - due Mon 1/18/99 pick 3 research papers FTC.W99 10 VLSI Transistors A A G B G B FTC.W99 11 CMOS Inverter In Out In Out FTC.W99 12 CMOS NAND Gate B A C A C B FTC.W99 13 Integrated Circuits Costs IC cost = Die cost + Testing cost + Packaging cost Final test yield Die cost = Wafer cost Dies per Wafer * Die yield Dies per wafer = š * ( Wafer_diam / 2)2 – š * Wafer_diam – Test dies Die Area ¦ 2 * Die Area { Die Yield = Wafer yield * 1 + Defects_per_unit_area * Die_Area Die Cost goes roughly with die area4 } FTC.W99 14 Real World Examples Chip Metal Line Wafer Defect Area Dies/ Yield Die Cost layers width cost /cm2 mm2 wafer 386DX 2 0.90 $900 1.0 43 360 71% $4 486DX2 3 0.80 $1200 1.0 81 181 54% $12 PowerPC 601 4 0.80 $1700 1.3 121 115 28% $53 HP PA 7100 3 0.80 $1300 1.0 196 66 27% $73 DEC Alpha 3 0.70 $1500 1.2 234 53 19% $149 SuperSPARC 3 0.70 $1700 1.6 256 48 13% $272 Pentium 3 0.80 $1500 1.5 296 40 9% $417 – From "Estimating IC Manufacturing Costs,” by Linley Gwennap, Microprocessor Report, August 2, 1993, p. 15 FTC.W99 15 Cost/Performance What is Relationship of Cost to Price? • Component Costs • Direct Costs (add 25% to 40%) recurring costs: labor, purchasing, scrap, warranty • Gross Margin (add 82% to 186%) nonrecurring costs: R&D, marketing, sales, equipment maintenance, rental, financing cost, pretax profits, taxes • Average Discount to get List Price (add 33% to 66%): volume discounts and/or retailer markup List Price Average Discount Avg. Selling Price Gross Margin Direct Cost Component Cost 25% to 40% 34% to 39% 6% to 8% 15% to 33% FTC.W99 16 Chip Prices (August 1993) • Assume purchase 10,000 units Chip Area mm2 386DX Mfg. Price Multi- Comment cost plier 43 $9 $31 486DX2 81 PowerPC 601 121 $35 $77 $245 $280 3.4 Intense Competition 7.0 No Competition 3.6 DEC Alpha 234 $202 $1231 6.1 Recoup R&D? Pentium 296 $473 2.0 Early in shipments $965 FTC.W99 17 Summary: Price vs. Cost 100% 80% Average Discount 60% Gross Margin 40% Direct Costs 20% Component Costs 0% Mini 5 4 W/S PC 4.7 3.5 3.8 Average Discount 2.5 3 Gross Margin 1.8 2 Direct Costs 1.5 1 Component Costs 0 Mini W/S PC FTC.W99 18 Technology Trends: Microprocessor Capacity 100000000 Alpha 21264: 15 million Pentium Pro: 5.5 million PowerPC 620: 6.9 million Alpha 21164: 9.3 million Sparc Ultra: 5.2 million 10000000 Moore’s Law Pentium i80486 Transistors 1000000 i80386 i80286 100000 CMOS improvements: • Die size: 2X every 3 yrs • Line width: halve / 7 yrs i8086 10000 i8080 i4004 1000 1970 1975 1980 1985 1990 1995 2000 Year FTC.W99 19 Memory Capacity (Single Chip DRAM) size 1000000000 100000000 Bits 10000000 1000000 100000 10000 1000 1970 1975 1980 1985 1990 1995 year 1980 1983 1986 1989 1992 1996 2000 2000 size(Mb) cyc time 0.0625 250 ns 0.25 220 ns 1 190 ns 4 165 ns 16 145 ns 64 120 ns 256 100 ns Year FTC.W99 20 Technology Trends (Summary) Capacity Speed (latency) Logic 2x in 3 years 2x in 3 years DRAM 4x in 3 years 2x in 10 years Disk 4x in 3 years 2x in 10 years FTC.W99 21 Processor Performance Trends 1000 Supercomputers 100 Mainframes 10 Minicomputers Microprocessors 1 0.1 1965 1970 1975 1980 1985 1990 1995 2000 Year FTC.W99 22 Processor Performance (1.35X before, 1.55X now) 1200 1000 DEC Alpha 21264/600 1.54X/yr 800 600 DEC Alpha 5/500 400 200 0 DEC Alpha 5/300 DEC HP IBM AXP/ SunMIPSMIPS 9000/ DEC Alpha 4/266 -4/ M M/ RS/ 750 500 IBM POWER 100 260 2000 1206000 87 88 89 90 91 92 93 94 95 96 97 FTC.W99 23 Performance Trends (Summary) • Workstation performance (measured in Spec Marks) improves roughly 50% per year (2X every 18 months) • Improvement in cost performance estimated at 70% per year FTC.W99 24 Computer Engineering Methodology Technology Trends FTC.W99 25 Computer Engineering Methodology Evaluate Existing Systems for Bottlenecks Benchmarks Technology Trends FTC.W99 26 Computer Engineering Methodology Evaluate Existing Systems for Bottlenecks Benchmarks Technology Trends Simulate New Designs and Organizations Workloads FTC.W99 27 Computer Engineering Methodology Implementation Complexity Evaluate Existing Systems for Bottlenecks Benchmarks Technology Trends Implement Next Generation System Simulate New Designs and Organizations Workloads FTC.W99 28 Measurement Tools • Benchmarks, Traces, Mixes • Hardware: Cost, delay, area, power estimation • Simulation (many levels) – ISA, RT, Gate, Circuit • Queuing Theory • Rules of Thumb • Fundamental “Laws”/Principles FTC.W99 29 The Bottom Line: Performance (and Cost) Plane DC to Paris Speed Passengers Throughput (pmph) Boeing 747 6.5 hours 610 mph 470 286,700 BAD/Sud Concodre 3 hours 1350 mph 132 178,200 • Time to run the task (ExTime) – Execution time, response time, latency • Tasks per day, hour, week, sec, ns … (Performance) – Throughput, bandwidth FTC.W99 30 The Bottom Line: Performance (and Cost) "X is n times faster than Y" means ExTime(Y) --------ExTime(X) = Performance(X) --------------Performance(Y) • Speed of Concorde vs. Boeing 747 • Throughput of Boeing 747 vs. Concorde FTC.W99 31 Amdahl's Law Speedup due to enhancement E: ExTime w/o E Speedup(E) = ------------ExTime w/ E = Performance w/ E ------------------Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected FTC.W99 32 Amdahl’s Law ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced Speedupoverall = ExTimeold ExTimenew 1 = (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced FTC.W99 33 Amdahl’s Law • Floating point instructions improved to run 2X; but only 10% of actual instructions are FP ExTimenew = Speedupoverall = FTC.W99 34 Amdahl’s Law • Floating point instructions improved to run 2X; but only 10% of actual instructions are FP ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold Speedupoverall = 1 0.95 = 1.053 FTC.W99 35 Metrics of Performance Application Answers per month Operations per second Programming Language Compiler ISA (millions) of Instructions per second: MIPS (millions) of (FP) operations per second: MFLOP/s Datapath Control Function Units Transistors Wires Pins Megabytes per second Cycles per second (clock rate) FTC.W99 36 Aspects of CPU Performance CPU time = Seconds = Instructions x Program Program CPI Program Compiler X (X) Inst. Set. X X Technology x Seconds Instruction Inst Count X Organization Cycles X Cycle Clock Rate X X FTC.W99 37 Cycles Per Instruction “Average Cycles per Instruction” CPI = (CPU Time * Clock Rate) / Instruction Count = Cycles / Instruction Count n S CPU time = CycleTime * i =1 CPI i * I i “Instruction Frequency” n CPI = S i =1 CPIi * F i where Fi = Ii Instruction Count Invest Resources where time is Spent! FTC.W99 38 Example: Calculating CPI Base Machine (Reg / Reg) Op Freq Cycles CPI(i) ALU 50% 1 .5 Load 20% 2 .4 Store 10% 2 .2 Branch 20% 2 .4 1.5 (% Time) (33%) (27%) (13%) (27%) Typical Mix FTC.W99 39 SPEC: System Performance Evaluation Cooperative • First Round 1989 – 10 programs yielding a single number (“SPECmarks”) • Second Round 1992 – SPECInt92 (6 integer programs) and SPECfp92 (14 floating point programs) » Compiler Flags unlimited. March 93 of DEC 4000 Model 610: spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)= memcpy(b,a,c)” wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200 nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas • Third Round 1995 – new set of programs: SPECint95 (8 integer programs) and SPECfp95 (10 floating point) – “benchmarks useful for 3 years” – Single flag setting for all programs: SPECint_base95, SPECfp_base95 FTC.W99 40 How to Summarize Performance • Arithmetic mean (weighted arithmetic mean) tracks execution time: S(Ti)/n or S(Wi*Ti) • Harmonic mean (weighted harmonic mean) of rates (e.g., MFLOPS) tracks execution time: n/ S(1/Ri) or n/ S(Wi/Ri) • Normalized execution time is handy for scaling performance (e.g., X times faster than SPARCstation 10) • But do not take the arithmetic mean of normalized execution time, use the geometric: (Pxi)^1/n FTC.W99 41 SPEC First Round • One program: 99% of time in single line of code • New front-end compiler could improve dramatically 800 700 500 400 300 200 100 tomcatv fpppp matrix300 eqntott li nasa7 doduc spice epresso 0 gcc SPEC Perf 600 Benchmark FTC.W99 42 Impact of Means on SPECmark89 for IBM 550 Ratio to VAX: Program gcc espresso spice doduc nasa7 li eqntott matrix300 fpppp tomcatv Mean Time: Before After Before After 30 29 49 51 35 34 65 67 47 47 510 510 46 49 41 38 78 144 258 140 34 34 183 183 40 40 28 28 78 730 58 6 90 87 34 35 33 138 20 19 54 72 124 108 Geometric Ratio 1.33 Ratio 1.16 Weighted Time: Before After 8.91 9.22 7.64 7.86 5.69 5.69 5.81 5.45 3.43 1.86 7.86 7.86 6.68 6.68 3.43 0.37 2.97 3.07 2.01 1.94 54.42 49.99 Arithmetic Weighted Arith. Ratio 1.09 FTC.W99 43 Performance Evaluation • “For better or worse, benchmarks shape a field” • Good products created when have: – Good benchmarks – Good ways to summarize performance • Given sales is a function in part of performance relative to competition, investment in improving product as reported by performance summary • If benchmarks/summary inadequate, then choose between improving product for real programs vs. improving product to get more sales; Sales almost always wins! • Execution time is the measure of computer performance! FTC.W99 44 Instruction Set Architecture (ISA) software instruction set hardware FTC.W99 45 Interface Design A good interface: • Lasts through many implementations (portability, compatibility) • Is used in many differeny ways (generality) • Provides convenient functionality to higher levels • Permits an efficient implementation at lower levels use Interface use use imp 1 time imp 2 imp 3 FTC.W99 46 Evolution of Instruction Sets Single Accumulator (EDSAC 1950) Accumulator + Index Registers (Manchester Mark I, IBM 700 series 1953) Separation of Programming Model from Implementation High-level Language Based (B5000 1963) Concept of a Family (IBM 360 1964) General Purpose Register Machines Complex Instruction Sets (Vax, Intel 432 1977-80) Load/Store Architecture (CDC 6600, Cray 1 1963-76) RISC (Mips,Sparc,HP-PA,IBM RS6000, . . .1987) FTC.W99 47 Evolution of Instruction Sets • Major advances in computer architecture are typically associated with landmark instruction set designs – Ex: Stack vs GPR (System 360) • Design decisions must take into account: – – – – – technology machine organization programming languages compiler technology operating systems • And they in turn influence these FTC.W99 48 A "Typical" RISC • • • • 32-bit fixed format instruction (3 formats) 32 32-bit GPR (R0 contains zero, DP take pair) 3-address, reg-reg arithmetic instruction Single address mode for load/store: base + displacement – no indirection • Simple branch conditions • Delayed branch see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3 FTC.W99 49 Example: MIPS Register-Register 31 26 25 Op 21 20 Rs1 16 15 Rs2 11 10 6 5 Rd 0 Opx Register-Immediate 31 26 25 Op 21 20 Rs1 16 15 0 immediate Rd Branch 31 26 25 Op Rs1 21 20 16 15 Rs2/Opx 0 immediate Jump / Call 31 26 25 Op 0 target FTC.W99 50 Summary, #1 • Designing to Last through Trends Capacity • Speed Logic 2x in 3 years 2x in 3 years DRAM 4x in 3 years 2x in 10 years Disk 4x in 3 years 2x in 10 years 6 yrs to graduate => 16X CPU speed, DRAM/Disk size • Time to run the task – Execution time, response time, latency • Tasks per day, hour, week, sec, ns, … – Throughput, bandwidth • “X is n times faster than Y” means ExTime(Y) --------ExTime(X) = Performance(X) -------------Performance(Y) FTC.W99 51 Summary, #2 • Amdahl’s Law: Speedupoverall = ExTimeold ExTimenew 1 = (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced • CPI Law: CPU time = Seconds Program = Instructions x Program Cycles x Seconds Instruction Cycle • Execution time is the REAL measure of computer performance! • Good products created when have: – Good benchmarks, good ways to summarize performance • Die Cost goes roughly with die area4 • Can PC industry support engineering/research FTC.W99 52 investment? Pipelining: Its Natural! • Laundry Example • Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold • Washer takes 30 minutes A B C D • Dryer takes 40 minutes • “Folder” takes 20 minutes FTC.W99 53 Sequential Laundry 6 PM 7 8 9 10 11 Midnight Time 30 40 20 30 40 20 30 40 20 30 40 20 T a s k A B O r d e r C D • Sequential laundry takes 6 hours for 4 loads • If they learned pipelining, how long would laundry take? FTC.W99 54 Pipelined Laundry Start work ASAP 6 PM 7 8 9 10 11 Midnight Time 30 40 T a s k 40 40 40 20 A B O r d e r C D • Pipelined laundry takes 3.5 hours for 4 loads FTC.W99 55 Pipelining Lessons 6 PM 7 8 9 Time T a s k O r d e r 30 40 A B C D 40 40 40 20 • Pipelining doesn’t help latency of single task, it helps throughput of entire workload • Pipeline rate limited by slowest pipeline stage • Multiple tasks operating simultaneously • Potential speedup = Number pipe stages • Unbalanced lengths of pipe stages reduces speedup • Time to “fill” pipeline and time to “drain” it reduces speedup FTC.W99 56 Computer Pipelines • Execute billions of instructions, so throughput is what matters • DLX desirable features: all instructions same length, registers located in same place in instruction format, memory operands only in loads or stores FTC.W99 57 5 Steps of DLX Datapath Figure 3.1, Page 130 Instruction Fetch Instr. Decode Reg. Fetch IR Execute Addr. Calc Memory Access Write Back L M D FTC.W99 58 Pipelined DLX Datapath Figure 3.4, page 137 Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc. Write Back Memory Access • Data stationary control – local decode for each instruction phase / pipeline stage FTC.W99 59 Visualizing Pipelining Figure 3.3, Page 133 Time (clock cycles) I n s t r. O r d e r FTC.W99 60 Its Not That Easy for Computers • Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle – Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) – Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) – Control hazards: Pipelining of branches & other instructionsstall the pipeline until the hazardbubbles” in the pipeline FTC.W99 61 One Memory Port/Structural Hazards Figure 3.6, Page 142 Time (clock cycles) I n s t r. O r d e r Load Instr 1 Instr 2 Instr 3 Instr 4 FTC.W99 62 One Memory Port/Structural Hazards Figure 3.7, Page 143 Time (clock cycles) Load I n s t r. O r d e r Instr 1 Instr 2 stall Instr 3 FTC.W99 63 Speed Up Equation for Pipelining CPIpipelined = Ideal CPI + Pipeline stall clock cycles per instr Speedup = Ideal CPI x Pipeline depth x Ideal CPI + Pipeline stall CPI Speedup = Pipeline depth x 1 + Pipeline stall CPI Clock Cycleunpipelined Clock Cyclepipelined Clock Cycleunpipelined Clock Cyclepipelined FTC.W99 64 Example: Dual-port vs. Single-port • Machine A: Dual ported memory • Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate • Ideal CPI = 1 for both • Loads are 40% of instructions executed SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe) = Pipeline Depth SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05) = (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33 • Machine A is 1.33 times faster FTC.W99 65 Data Hazard on R1 Figure 3.9, page 147 Time (clock cycles) IF I n s t r. add r1,r2,r3 O r d e r and r6,r1,r7 ID/RF EX MEM WB sub r4,r1,r3 or r8,r1,r9 xor r10,r1,r11 FTC.W99 66 Three Generic Data Hazards InstrI followed by InstrJ • Read After Write (RAW) InstrJ tries to read operand before InstrI writes it FTC.W99 67 Three Generic Data Hazards InstrI followed by InstrJ • Write After Read (WAR) InstrJ tries to write operand before InstrI reads i – Gets wrong operand • Can’t happen in DLX 5 stage pipeline because: – All instructions take 5 stages, and – Reads are always in stage 2, and – Writes are always in stage 5 FTC.W99 68 Three Generic Data Hazards InstrI followed by InstrJ • Write After Write (WAW) InstrJ tries to write operand before InstrI writes it – Leaves wrong result ( InstrI not InstrJ ) • Can’t happen in DLX 5 stage pipeline because: – All instructions take 5 stages, and – Writes are always in stage 5 • Will see WAR and WAW in later more complicated pipes FTC.W99 69 Forwarding to Avoid Data Hazard Figure 3.10, Page 149 Time (clock cycles) I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 FTC.W99 70 HW Change for Forwarding Figure 3.20, Page 161 FTC.W99 71 Data Hazard Even with Forwarding Figure 3.12, Page 153 Time (clock cycles) I n s t r. lw r1, 0(r2) O r d e r and r6,r1,r7 sub r4,r1,r6 or r8,r1,r9 FTC.W99 72 Data Hazard Even with Forwarding Figure 3.13, Page 154 Time (clock cycles) I n s t r. O r d e r lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9 FTC.W99 73 Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW LW ADD SW LW LW SUB SW Rb,b Rc,c Ra,Rb,Rc a,Ra Re,e Rf,f Rd,Re,Rf d,Rd Fast code: LW LW LW ADD LW SW SUB SW Rb,b Rc,c Re,e Ra,Rb,Rc Rf,f a,Ra Rd,Re,Rf d,Rd FTC.W99 74 Control Hazard on Branches Three Stage Stall FTC.W99 75 Branch Stall Impact • If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9! • Two part solution: – Determine branch taken or not sooner, AND – Compute taken branch address earlier • DLX branch tests if register = 0 or ° 0 • DLX Solution: – Move Zero test to ID/RF stage – Adder to calculate new PC in ID/RF stage – 1 clock cycle penalty for branch versus 3 FTC.W99 76 Pipelined DLX Datapath Figure 3.22, page 163 Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc. Memory Access Write Back This is the correct 1 cycle latency implementation! FTC.W99 77 Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken – – – – – Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% DLX branches not taken on average PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken – 53% DLX branches taken on average – But haven’t calculated branch target address in DLX » DLX still incurs 1 cycle branch penalty » Other machines: branch target known before outcome FTC.W99 78 Four Branch Hazard Alternatives #4: Delayed Branch – Define branch to take place AFTER a following instruction branch instruction sequential successor1 sequential successor2 ........ sequential successorn branch target if taken Branch delay of length n – 1 slot delay allows proper decision and branch target address in 5 stage pipeline – DLX uses this FTC.W99 79 Delayed Branch • Where to get instructions to fill branch delay slot? – – – – Before branch instruction From the target address: only valuable when branch taken From fall through: only valuable when branch not taken Cancelling branches allow more slots to be filled • Compiler effectiveness for single branch delay slot: – Fills about 60% of branch delay slots – About 80% of instructions executed in branch delay slots useful in computation – About 50% (60% x 80%) of slots usefully filled • Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar) FTC.W99 80 Evaluating Branch Alternatives Pipeline speedup = Scheduling Branch scheme penalty Stall pipeline 3 Predict taken 1 Predict not taken 1 Delayed branch 0.5 Pipeline depth 1 +Branch frequency Branch penalty CPI 1.42 1.14 1.09 1.07 speedup v. unpipelined 3.5 4.4 4.5 4.6 speedup v. stall 1.0 1.26 1.29 1.31 Conditional & Unconditional = 14%, 65% change PC FTC.W99 81 Pipelining Summary • Just overlap tasks, and easy if tasks are independent • Speed Up Š Pipeline Depth; if ideal CPI is 1, then: Speedup = Pipeline Depth Clock Cycle Unpipelined X 1 + Pipeline stall CPI Clock Cycle Pipelined • Hazards limit performance on computers: – Structural: need more HW resources – Data (RAW,WAR,WAW): need forwarding, compiler scheduling – Control: delayed branch, prediction FTC.W99 82