Lecture 11: Pipelining and Branch Prediction EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr. Rozier (UM) THE QUIZ SHOW! Today’s class will be a quiz show • We will be solving puzzles involving pipelining, branch prediction, and the stack. • Form up into groups of 8 individuals • Points for correct solutions, the extra credit points awarded to the top teams: – – – – 4 pts for 1st place 3 pts for 2nd place 2 pts for 3rd place 1 pt for 4th place The Rules! • Each group will elect a “buzzer” when the buzzer raises his hand, your group will be called on to solve the puzzle. • One representative will be sent up per group. They will give their answer and explain it. • Once the buzzer has raised his hand, your group must stop discussing the answer! PIPELINING Pipelining • Assume r5 != r4 • Assume there is one memory for instructions and data. • During a cycle either data can be loaded for an instruction OR an instruction can be fetched, not both. (100) A structural hazard exists. What is it? str ldr cmp beq add add r0, [r1, #16] r0, [r1, #8] r5, r4 label r5, r2, r4 r5, r5, r0 Pipelining • Assume r5 != r4 • Assume there is one memory for instructions and data. • During a cycle either data can be loaded for an instruction OR an instruction can be fetched, not both. (200) Can this structural hazard be eliminated by adding “bubbles” to the pipeline in the form of NOP instructions? str ldr cmp beq add add r0, [r1, #16] r0, [r1, #8] r5, r4 label r5, r2, r4 r5, r5, r0 Pipelining • Assume r5 != r4 • Assume there is one memory for instructions and data. • During a cycle either data can be loaded for an instruction OR an instruction can be fetched, not both. (300) To guarantee forward progress, how must this hazard be resolved? In favor of data access, or instruction fetching? Why? str ldr cmp beq add add r0, [r1, #16] r0, [r1, #8] r5, r4 label r5, r2, r4 r5, r5, r0 Pipelining • Assume r5 != r4 • Assume there is one memory for instructions and data. • During a cycle either data can be loaded for an instruction OR an instruction can be fetched, not both. (400) Draw the 5-stage pipeline for this code, assume the stages are: Fetch, Decode, Execute, Memory, Writeback. What is the total execution time? str ldr cmp beq add add r0, [r1, #16] r0, [r1, #8] r5, r4 label r5, r2, r4 r5, r5, r0 Pipelining • Assume r5 != r4 • Assume there is one memory for instructions and data. • During a cycle either data can be loaded for an instruction OR an instruction can be fetched, not both. (500) Assume we have a new processor such that when the offset is zero on a memory operation, the Execute stage (ALU) can be skipped. The MEM and EXECUTE can now be overlapped in the pipeline. What speedup is achieved with this new architecture? str ldr cmp beq add add r0, [r1, #0] r0, [r10, #0] r5, r4 label r5, r2, r4 r5, r5, r0 DATA DEPENDENCIES Data Dependencies (100) Find all data dependencies in this sequence. ldr and ldr ldr r1, [r1, #0] r1, r1, r2 r2, [r1, #0] r1, [r3, #0] Data Dependencies (200) Find all hazards in this sequence, with and without forwarding, for a 5-stage pipeline assume the stages are: Fetch, Decode, Execute, Memory, Writeback. ldr and ldr ldr r1, [r1, #0] r1, r1, r2 r2, [r1, #0] r1, [r3, #0] Data Dependencies (300) To reduce the clock cycle time, we are considering a split of the MEM stage into two stages. Find all hazards in this sequence for a 5-stage pipeline, with and without forwarding, assume the stages are: Fetch, Decode, Execute, Memory, Writeback. add ldr ldr or r1, r2, r1 r2, [r1, #0] r1, [r1, #4] r3, r1, r2 Data Dependencies • Assume all data memory values are 0’s. • Assume: – – – – r0 = 0 r1 = -1 r2 = 31 r3 = 1500 • Assume the processor has forwarding logic for hazards. (400) What value is the first one to be forwarded, and what is the value it overrides? add ldr ldr or r1, r2, r1 r2, [r1, #0] r1, [r1, #4] r3, r1, r2 Data Dependencies • Assume all data memory values are 0’s. • Assume: – – – – r0 = 0 r1 = -1 r2 = 31 r3 = 1500 (500) The hazard detection unit assumes forwarding was implemented, but the processor designers, (UF students) forgot to implement it! What are the final register values? What should they be? Add NOPs to this sequence to ensure correct execution despite UF’s screw up! add ldr ldr or r1, r2, r1 r2, [r1, #0] r1, [r1, #4] r3, r1, r2 BRANCH PREDICTION Branch Prediction (100) When building a branch prediction unit, define for the following cases if the best choice is “branch not taken” or “branch taken” for the prediction: 1.Branches associated with “If” statements 2.Branches associated with “Else if” statements 3.Branches associated with “Else” Statements 4.Branches associated with “For” Statements Branch Prediction (200) Design a dynamic branch predictor for if statements and loops. Describe how to implement it in hardware. What new hardware might it require? Branch Prediction • • Assume branch prediction is handled by branch not taken. Assume one element of the array at r2 is equal to 100. (300) How many times is the branch predicted correctly versus incorrectly? 00: 01: LOOP: 02: 03: 04: 05: LABEL: 06: 07: 08: 09: 10: mov mov r1, #0 r2, #DEADBEEF ldr r3, [r2, r0 lsl 2] cmp r3, #100 beq LABEL mov r4, r3 add r0, r0, #1 cmp r0, #5 beq LOOP mov r0, r4 add r0, r0, #1 Branch Prediction • • • • • • Assume branch prediction is handled by branch not taken. Assume one element of the array at r2 is equal to 100. Assume the PC pipeline is three instructions deep Assume the PC pipeline can be flushed in one cycle, and on a miss prediction must be fully flushed. Assume a pipeline with the phases: Fetch, Decode, Issue, Execute, Memory, and Writeback Assume branches are evaluated in the issue step, and the pipeline flushed during execute (400) How many cycles does the loop take? 00: 01: LOOP: 02: 03: 04: 05: LABEL: 06: 07: 08: 09: 10: mov mov r1, #0 r2, #DEADBEEF ldr r3, [r2, r0 lsl 2] cmp r3, #100 beq LABEL mov r4, r3 add r0, r0, #1 cmp r0, #5 beq LOOP mov r0, r4 add r0, r0, #1 Branch Prediction • • • • • Assume branch prediction is handled by branch not taken. Assume the PC pipeline is three instructions deep Assume the PC pipeline can be flushed in one cycle, and on a miss prediction must be fully flushed. Assume a pipeline with the phases: Fetch, Decode, Issue, Execute, Memory, and Writeback Assume branches are evaluated in the issue step, and the pipeline flushed during execute (500) Act as the compiler. Optimize the code for branch not taken. How many cycles does it take? 00: 01: LOOP: 02: 03: 04: 05: LABEL: 06: 07: 08: 09: 10: mov mov r1, #0 r2, #DEADBEEF ldr r3, [r2, r0 lsl 2] cmp r3, #100 beq LABEL mov r4, r3 add r0, r0, #1 cmp r0, #5 beq LOOP mov r0, r4 add r0, r0, #1 PROCESSOR ARCHITECTURE Processor Architecture (100) For a five stage pipeline with stages: Fetch, Decode, Execute, Memory, and Writeback, describe what happens in each stage. Processor Architecture (200) Describe the purpose of a clock signal in a processor. Why do processors need clock signals? Processor Architecture (300) Describe how during the Decode phase registers are selected from the register file. How is this accomplished in hardware? Processor Architecture (400) Why must we allocate new registers in the datapath for the writeback register instead of reading it from the decode phase? Processor Architecture (500) Design a one bit full adder. REPRESENTATION OF DATA Representation of Data (100) Describe the difference between big endian and little endian representations. Representation of Data (200) Represent the following data in big endian and little endian formats: 1. 00ac8eff 2. 54897743 3. be88fac8 Representation of Data (300) Represent the following data as hexadecimal numbers in big and little endian formats. Assume unsigned integers 1. 128 2. 976 Representation of Data (400) Represent the following data as hexadecimal numbers in big and little endian formats. Assume signed integers 1. -55 2. 99 Representation of Data (500) Write assembly code which takes data from one register in Big Endian format and stores it in a new register in Little Endian format. You may use temporary registers. FINAL QUESTION Final Question • Each team should decide an amount of points to bid. • Write down your bids on a sheet of paper and hand them in. • You will have only 60 seconds to answer the next question as a team, write your answers down by the time limit. – Answer correctly and you will add your bid to your score. – Answer incorrectly and you will lose those points. Final Question In order to detect data hazards, new hardware must be added. Assuming that the registers ids involved in an instruction are available during the decode stage, what hardware would be necessary to check for data hazards? WRAP UP For next time • Enjoy your spring break! • Read Chapter 5, sections 5.1 – 5.3