http://www.ece.cmu.edu/∼ece447/ February 28, 2012 CMU 18-447: Introduction to Computer Architecture Handout 6 / Lab 3: Pipelining Basics Due: Friday, March 2, 2012, 9:20pm (200 points, done individually) Objectives In this lab, you will gain an appreciation for timing, control, and forwarding issues that occur in an in-order 5-stage pipelined processor. READ this handout carefully for specifically required design elements. You will not get any credit for this lab if you do not meet the required elements during check off. Lab Description Using your single-cycle MIPS core as a starting point, you will build a 5-stage pipelined processor. In this lab, you will be concerned with building a pipeline to execute ALU and memory instructions with dependence resolutions. The next lab will add control flow instructions and introduce other details found in modern processors. Your pipeline should be similar to the 5-stage pipeline introduced in Patterson and Hennessy. The basic pipeline layout is listed in Table 1. Stage Fetch Decode Execute Memory Writeback Table 1: Pipeline Layout Function Read instruction from memory Decode instruction and read values from the register file Execute the ALU instruction or calculate memory addresses Access memory for load and store instructions Write finished results back to the register file This lab consists of two checkpoints (both need to be handed in; see the Grading section for more details): • Checkpoint 1: Pipelined MIPS core which stalls on data dependences • Checkpoint 2: Pipelined MIPS core which forwards when possible for dependent instructions Detecting and handling data dependences adds significant complexity to most pipelined designs. MIPS is no exception. There are two basic ways of handling read-after-write (RAW) dependences in the MIPS pipeline which will always result in correct execution. The simplest method is to stall earlier pipeline stages when a RAW dependence is detected. Stalling typically requires disabling earlier pipeline register writes and propagating at least one “bubble” through the remaining pipeline stages. This gives time for instruction results to writeback and for dependent instructions to read the new value from the register file. For Checkpoint 1, your implementation may stall, but must stall only when necessary. If RAW dependences are common, these stalls can significantly impact the performance of the processor. Forwarding paths allow the processor to send results directly to dependent instructions in prior pipeline stages, even before register writeback has occurred. In the best case, this allows back-to-back dependent instructions to execute without any stalls. Compared to stalling, more control and datapath logic is required to correctly implement forwarding. For Checkpoint 2, you should forward into the end of the decode stage. Your implementation can still stall; however, it should not stall unnecessarily when forwarding would allow one instruction to feed another instruction directly. 1 of 5 http://www.ece.cmu.edu/∼ece447/ February 28, 2012 Instruction Set In order to make the pipelined implementation easier, we will only require a subset of the MIPS instruction set for this lab. We are not requiring you to implement control flow instructions such as jump and branch. The instructions that you are required to support are listed in Table 2. (Note: there is no need to delete existing code for control flow instructions, because you will need to start from them in Lab 4.) J ADDI XORI LHU BLTZAL SRLV ADDU NOR MTHI Table 2: JAL ADDIU LUI SB BGEZAL SRAV SUB SLT MTLO Required Instruction Set BEQ BNE BLEZ SLTI SLTIU ANDI LB LH LW SH SW BLTZ SLL SRL SRA JR JALR SYSCALL SUBU AND OR SLTU MULT MFHI MULTU BGEZ ORI LBU BGEZ SLLV ADD XOR MFLO Suggestions on Completing the Lab You should diagram the pipelined core first, then implement and test it by executing code with sufficient NOP instructions added to hide the dependences (to make sure your pipeline works correctly on programs without dependences first). Do not try to implement the pipeline without first drawing a detailed diagram. Think carefully about what situations will require a stall, and what conditions will allow you to continue execution. At this point, you should be able to execute arithmetic instructions that do not have data dependences. Next, add logic to handle stalls from the multiplier; at this point, you should be able to handle all instructions that do not have RAW dependences. Next, add dependence detection and stall logic. Your test code should now execute properly even without padding by NOP instructions. Keep this processor core for handin. Finally, use this code as a basis for the next model with forwarding. Your two implementations should produce exactly the same (architectural) results, but test cases with dependences that can be resolved by forwarding should complete in fewer cycles. When you are designing the pipeline for this lab, consider defining some invariants (rules that are always true during pipeline operation) that describe how pipeline stages interact with each other. Often, these sorts of rules can be summed up in a sentence or so each; adhering to them rigorously can make defining semantics of each module much easier. To get you started, we propose one such rule: • If pipeline stage n asserts a “stall” signal to stage n − 1 before a clock edge, then the inputs to stage n must not change as a result of the clock edge. Of course, your rules probably don’t need to be so formally specified; the important thing is that they are clear to you, and help you build a cohesive view of how the pipeline fits together. Critical Paths The Iron Law tells us that the cycle time can significantly impact overall processor performance. The longest critical path in all of the pipeline stages determines your overall cycle time. While a short cycle time is not necessary to achieve a perfect score on this lab, you should still try to keep your cycle time as short as possible. Keep in mind that cycle time isn’t the whole story. Features such as forwarding paths may actually increase your cycle time, but also improve overall performance (by reducing stalls and decreasing CPI). Finally, it’s important to make your processor work first, and then make it work fast. 2 of 5 http://www.ece.cmu.edu/∼ece447/ February 28, 2012 Handout The files for this lab are available in /afs/ece/class/ece447/labs/lab3; it can be extracted and built much like the previous labs. When you start on this lab, you should begin with a copy of the lab3 directory, and then take the rtl/ from your Lab 2 implementation and copy it into this directory. The interface for your top-level mips core module remains the same. Most of the work that you do in this lab will involve splitting and refactoring your MIPS core into pipeline stages. Only one change has been made in the base RTL that we provide: • Stalling multiplier: In the last lab, we did not use the multiplier because we did not implement multiply/divide operations. In this lab, we will be implementing multiply operations, and we have provided a new stalling multiplier design in order to help you implement these instructions. The multiplier takes multiple cycles to execute. If you attempt to issue the multiply unit an opcode that it’s not ready for (i.e., an MTHI while a MULT is still being executed), then the multiplier will assert the mul stall 2a output. You should take this potential for stalls into account when a MULT or MULTU instruction is executed and stall the pipeline accordingly. Pipeline Diagram In order to begin implementing the pipeline, you should start by extending your single-cycle processor diagram from Lab 2. As with previous labs, you should include all control and datapath wires and keep careful track of which values are stored in pipeline registers. It may be helpful to use different colors or line weights to separate control and datapath wires. Use some discretion when choosing what details to include on your diagram (for instance, the various units and control structures within your ALU are probably not relevant, but all of the inputs and outputs certainly are). Two diagrams are required in this lab. The first should show your basic pipeline which stalls on data dependences. Show how the dependences are detected and how you control the pipeline stalls. The second diagram should also show the forwarding paths. Both diagrams should be computer-drawn, as before. (We suggest Inkscape or xfig if you are looking for vector-drawing software to produce your diagrams.) Your figures should have approximately the same level of detail as the LC-3b datapath diagrams that we have seen in class. Note that you do not need to show all control logic in great detail, but only a control-logic box that produces signals based on the instruction. Think carefully about the location of the system call unit. To maintain the illusion of a machine that executes each instruction sequentially and atomically (i.e., the pipeline is not exposed to the programmer), a system call should only be handled after all other preceding instructions have completed. Otherwise, the exit SYSCALL ($v0 = 10) might halt your processor before all of the program’s results have written back to registers and memory. You may find it advantageous to split your pipeline into one module per pipeline stage, and then to draw each module in its own diagram. (See the LC-3b pipeline handout for an example.) Be sure your diagram clearly addresses the design issues associated with the required design elements. You will be questioned about them during checkoff. You should be able to describe your pipeline design and any issues you encountered clearly (a small portion of your grade on this lab will depend on how well you describe your design during checkoff). Verilog Notes You should use synthesizable-style Verilog for this lab. During the demo, you need to show your Verilog code is synthesizable by invoking the synthesis tool. To make grading (and your testing) easier, make sure you dump the contents of the register file before ending execution, as in Lab 2. 3 of 5 http://www.ece.cmu.edu/∼ece447/ February 28, 2012 You also must not imply latches in combinational logic. Look for warnings in the synthesis output from XST. If you see a warning of the form “Found N-bit latch for signal X. Latches may be generated from incomplete case or if statements,” then you have implied a latch, probably accidentally. Please see our handout on Verilog for more details on how to avoid this problem. Be careful! In this lab, you will have many, many wires floating around. For your sanity, then, it is very important to have a scheme for naming them and routing them around. Much of the base code in 447rtl/ is written with reasonable style; for guidance, you might wish to look there. Handin You should electronically hand in all of your Verilog files through the course AFS space. Bring a paper copy of your diagram to your lab demo period, and submit a PDF copy into your course AFS space. During the demo, we will ask you questions about your pipelined designs and test it with a number of input programs. During the demo, you need to show that your Verilog code is synthesizable by invoking the synthesis tool; you can accomplish this by running “make synth”. Please be sure to allow plenty of time to get checked off (i.e., don’t come in the last 15 minutes of lab). Code submission should be done similarly to Lab 2. Please hand in two buildable trees into /afs/ece/class/ ece447/handin/$USER/lab3/checkpoint1 and /afs/ece/class/ ece447/handin/$USER/lab3/checkpoint2 for Checkpoints 1 and 2 respectively. (In other words, each of these directories should have an rtl/ subdirectory, 447rtl/ subdirectory, etc.) Also, please make sure to clean out your runs/ directories before submission; if you don’t, then your submission may be hundreds of megabytes. If your submission does not build when copied out of your handin directory, the automatic grading scripts will not work, and your grader will have to intervene manually, making him or her very unhappy! You should also submit a README.txt that describes details of your implementation. If you did anything ‘clever’ (hopefully!), then you should describe it. Also state the critical path length of your design, and describe the critical path (which pipeline stage does it pass through, and what component(s) are the limiting factor?). You can find critical path information in XST’s timing results. Please describe any optimization you did or design choices you made to reduce the critical path. Finally, if you wrote any additional test cases, submit and describe them. Grading Checkpoints 1 and 2 are both due at the end of the second lab week, as with previous labs. However, this is a difficult lab, so we suggest that you have at least arithmetic and multiply instructions done by the first week! Functionality will be tested with a set of test programs at your lab demo, and you must check-off your lab in order to receive a grade. As with labs 1 and 2, your grade for lab 3 will be based on a number of test cases, which we will release after we grade the labs. Functionality is the primary goal: your processor must be correct in all cases (with the Lab 1 simulator as the golden standard to match). We will test many corner cases on your turned-in RTL, so you should test thoroughly and rigorously to be prepared. Extra Credit: In order to encourage you to start early and to avoid a check-off rush near the due date, we will offer a 5% extra-credit bonus for those who check off their lab during the first week (by Friday, Feb. 24, 2012 at 9:20pm). Extra Credit: Performance Competition Finally, we will be holding a performance competition among those designs that are correct (i.e., pass all 4 of 5 http://www.ece.cmu.edu/∼ece447/ February 28, 2012 tests). Students with the top three lowest execution times1 on our test cases, and a correct design, will receive significant extra credit (which will be specified exactly later, but will be at least 5%). They will also receive prizes. You should clearly describe how you optimized cycle time and CPI in your design. 1 execution time is defined as cycle count multiplied by cycle time (critical path length). You should carefully weigh any optimization that improves (decreases) cycle time if it might also introduce stalls that would increase cycle count. 5 of 5