Handout 6 / Lab 3: Pipelining Basics

advertisement
http://www.ece.cmu.edu/∼ece447/
February 28, 2012
CMU 18-447: Introduction to Computer Architecture
Handout 6 / Lab 3: Pipelining Basics
Due: Friday, March 2, 2012, 9:20pm
(200 points, done individually)
Objectives
In this lab, you will gain an appreciation for timing, control, and forwarding issues that occur in an in-order
5-stage pipelined processor. READ this handout carefully for specifically required design elements.
You will not get any credit for this lab if you do not meet the required elements during check off.
Lab Description
Using your single-cycle MIPS core as a starting point, you will build a 5-stage pipelined processor. In
this lab, you will be concerned with building a pipeline to execute ALU and memory instructions with
dependence resolutions. The next lab will add control flow instructions and introduce other details found in
modern processors.
Your pipeline should be similar to the 5-stage pipeline introduced in Patterson and Hennessy. The basic
pipeline layout is listed in Table 1.
Stage
Fetch
Decode
Execute
Memory
Writeback
Table 1: Pipeline Layout
Function
Read instruction from memory
Decode instruction and read values from the register file
Execute the ALU instruction or calculate memory addresses
Access memory for load and store instructions
Write finished results back to the register file
This lab consists of two checkpoints (both need to be handed in; see the Grading section for more details):
• Checkpoint 1: Pipelined MIPS core which stalls on data dependences
• Checkpoint 2: Pipelined MIPS core which forwards when possible for dependent instructions
Detecting and handling data dependences adds significant complexity to most pipelined designs. MIPS
is no exception. There are two basic ways of handling read-after-write (RAW) dependences in the MIPS
pipeline which will always result in correct execution. The simplest method is to stall earlier pipeline
stages when a RAW dependence is detected. Stalling typically requires disabling earlier pipeline register
writes and propagating at least one “bubble” through the remaining pipeline stages. This gives time for
instruction results to writeback and for dependent instructions to read the new value from the register file.
For Checkpoint 1, your implementation may stall, but must stall only when necessary.
If RAW dependences are common, these stalls can significantly impact the performance of the processor.
Forwarding paths allow the processor to send results directly to dependent instructions in prior pipeline
stages, even before register writeback has occurred. In the best case, this allows back-to-back dependent
instructions to execute without any stalls. Compared to stalling, more control and datapath logic is required
to correctly implement forwarding. For Checkpoint 2, you should forward into the end of the decode stage.
Your implementation can still stall; however, it should not stall unnecessarily when forwarding would allow
one instruction to feed another instruction directly.
1 of 5
http://www.ece.cmu.edu/∼ece447/
February 28, 2012
Instruction Set
In order to make the pipelined implementation easier, we will only require a subset of the MIPS instruction
set for this lab. We are not requiring you to implement control flow instructions such as jump and branch.
The instructions that you are required to support are listed in Table 2. (Note: there is no need to delete
existing code for control flow instructions, because you will need to start from them in Lab 4.)
J
ADDI
XORI
LHU
BLTZAL
SRLV
ADDU
NOR
MTHI
Table 2:
JAL
ADDIU
LUI
SB
BGEZAL
SRAV
SUB
SLT
MTLO
Required Instruction Set
BEQ
BNE
BLEZ
SLTI
SLTIU ANDI
LB
LH
LW
SH
SW
BLTZ
SLL
SRL
SRA
JR
JALR
SYSCALL
SUBU
AND
OR
SLTU
MULT MFHI
MULTU
BGEZ
ORI
LBU
BGEZ
SLLV
ADD
XOR
MFLO
Suggestions on Completing the Lab
You should diagram the pipelined core first, then implement and test it by executing code with sufficient
NOP instructions added to hide the dependences (to make sure your pipeline works correctly on programs
without dependences first). Do not try to implement the pipeline without first drawing a detailed diagram.
Think carefully about what situations will require a stall, and what conditions will allow you to continue
execution. At this point, you should be able to execute arithmetic instructions that do not have data dependences. Next, add logic to handle stalls from the multiplier; at this point, you should be able to handle all
instructions that do not have RAW dependences. Next, add dependence detection and stall logic. Your test
code should now execute properly even without padding by NOP instructions. Keep this processor core for
handin. Finally, use this code as a basis for the next model with forwarding. Your two implementations
should produce exactly the same (architectural) results, but test cases with dependences that can be resolved
by forwarding should complete in fewer cycles.
When you are designing the pipeline for this lab, consider defining some invariants (rules that are always
true during pipeline operation) that describe how pipeline stages interact with each other. Often, these
sorts of rules can be summed up in a sentence or so each; adhering to them rigorously can make defining
semantics of each module much easier. To get you started, we propose one such rule:
• If pipeline stage n asserts a “stall” signal to stage n − 1 before a clock edge, then the inputs to
stage n must not change as a result of the clock edge.
Of course, your rules probably don’t need to be so formally specified; the important thing is that they are
clear to you, and help you build a cohesive view of how the pipeline fits together.
Critical Paths
The Iron Law tells us that the cycle time can significantly impact overall processor performance. The
longest critical path in all of the pipeline stages determines your overall cycle time. While a short cycle
time is not necessary to achieve a perfect score on this lab, you should still try to keep your cycle time as
short as possible.
Keep in mind that cycle time isn’t the whole story. Features such as forwarding paths may actually increase
your cycle time, but also improve overall performance (by reducing stalls and decreasing CPI). Finally, it’s
important to make your processor work first, and then make it work fast.
2 of 5
http://www.ece.cmu.edu/∼ece447/
February 28, 2012
Handout
The files for this lab are available in /afs/ece/class/ece447/labs/lab3; it can be extracted and
built much like the previous labs.
When you start on this lab, you should begin with a copy of the lab3 directory, and then take the rtl/ from
your Lab 2 implementation and copy it into this directory. The interface for your top-level mips core
module remains the same. Most of the work that you do in this lab will involve splitting and refactoring
your MIPS core into pipeline stages. Only one change has been made in the base RTL that we provide:
• Stalling multiplier: In the last lab, we did not use the multiplier because we did not implement
multiply/divide operations. In this lab, we will be implementing multiply operations, and we have
provided a new stalling multiplier design in order to help you implement these instructions. The
multiplier takes multiple cycles to execute. If you attempt to issue the multiply unit an opcode that
it’s not ready for (i.e., an MTHI while a MULT is still being executed), then the multiplier will assert
the mul stall 2a output. You should take this potential for stalls into account when a MULT or
MULTU instruction is executed and stall the pipeline accordingly.
Pipeline Diagram
In order to begin implementing the pipeline, you should start by extending your single-cycle processor
diagram from Lab 2. As with previous labs, you should include all control and datapath wires and keep
careful track of which values are stored in pipeline registers. It may be helpful to use different colors or line
weights to separate control and datapath wires. Use some discretion when choosing what details to include
on your diagram (for instance, the various units and control structures within your ALU are probably not
relevant, but all of the inputs and outputs certainly are).
Two diagrams are required in this lab. The first should show your basic pipeline which stalls on data
dependences. Show how the dependences are detected and how you control the pipeline stalls. The second
diagram should also show the forwarding paths. Both diagrams should be computer-drawn, as before. (We
suggest Inkscape or xfig if you are looking for vector-drawing software to produce your diagrams.) Your
figures should have approximately the same level of detail as the LC-3b datapath diagrams that we have
seen in class. Note that you do not need to show all control logic in great detail, but only a control-logic
box that produces signals based on the instruction.
Think carefully about the location of the system call unit. To maintain the illusion of a machine that
executes each instruction sequentially and atomically (i.e., the pipeline is not exposed to the programmer),
a system call should only be handled after all other preceding instructions have completed. Otherwise, the
exit SYSCALL ($v0 = 10) might halt your processor before all of the program’s results have written back
to registers and memory.
You may find it advantageous to split your pipeline into one module per pipeline stage, and then to draw
each module in its own diagram. (See the LC-3b pipeline handout for an example.)
Be sure your diagram clearly addresses the design issues associated with the required design elements. You
will be questioned about them during checkoff. You should be able to describe your pipeline design and
any issues you encountered clearly (a small portion of your grade on this lab will depend on how well you
describe your design during checkoff).
Verilog Notes
You should use synthesizable-style Verilog for this lab. During the demo, you need to show your Verilog
code is synthesizable by invoking the synthesis tool. To make grading (and your testing) easier, make sure
you dump the contents of the register file before ending execution, as in Lab 2.
3 of 5
http://www.ece.cmu.edu/∼ece447/
February 28, 2012
You also must not imply latches in combinational logic. Look for warnings in the synthesis output from
XST. If you see a warning of the form “Found N-bit latch for signal X. Latches may be generated from
incomplete case or if statements,” then you have implied a latch, probably accidentally. Please see our
handout on Verilog for more details on how to avoid this problem.
Be careful! In this lab, you will have many, many wires floating around. For your sanity, then, it is very
important to have a scheme for naming them and routing them around. Much of the base code in 447rtl/
is written with reasonable style; for guidance, you might wish to look there.
Handin
You should electronically hand in all of your Verilog files through the course AFS space. Bring a paper copy
of your diagram to your lab demo period, and submit a PDF copy into your course AFS space. During the
demo, we will ask you questions about your pipelined designs and test it with a number of input programs.
During the demo, you need to show that your Verilog code is synthesizable by invoking the synthesis tool;
you can accomplish this by running “make synth”. Please be sure to allow plenty of time to get checked off
(i.e., don’t come in the last 15 minutes of lab).
Code submission should be done similarly to Lab 2. Please hand in two buildable trees into /afs/ece/class/
ece447/handin/$USER/lab3/checkpoint1 and
/afs/ece/class/ ece447/handin/$USER/lab3/checkpoint2 for Checkpoints 1 and 2 respectively. (In other words, each of these directories should have an rtl/ subdirectory, 447rtl/ subdirectory, etc.) Also, please make sure to clean out your runs/ directories before submission; if you don’t,
then your submission may be hundreds of megabytes. If your submission does not build when copied out
of your handin directory, the automatic grading scripts will not work, and your grader will have to intervene
manually, making him or her very unhappy!
You should also submit a README.txt that describes details of your implementation. If you did anything
‘clever’ (hopefully!), then you should describe it. Also state the critical path length of your design, and
describe the critical path (which pipeline stage does it pass through, and what component(s) are the limiting
factor?). You can find critical path information in XST’s timing results. Please describe any optimization
you did or design choices you made to reduce the critical path. Finally, if you wrote any additional test
cases, submit and describe them.
Grading
Checkpoints 1 and 2 are both due at the end of the second lab week, as with previous labs. However, this
is a difficult lab, so we suggest that you have at least arithmetic and multiply instructions done by the first
week! Functionality will be tested with a set of test programs at your lab demo, and you must check-off
your lab in order to receive a grade.
As with labs 1 and 2, your grade for lab 3 will be based on a number of test cases, which we will release
after we grade the labs. Functionality is the primary goal: your processor must be correct in all cases (with
the Lab 1 simulator as the golden standard to match). We will test many corner cases on your turned-in
RTL, so you should test thoroughly and rigorously to be prepared.
Extra Credit: In order to encourage you to start early and to avoid a check-off rush near the due date, we
will offer a 5% extra-credit bonus for those who check off their lab during the first week (by Friday, Feb.
24, 2012 at 9:20pm).
Extra Credit: Performance Competition
Finally, we will be holding a performance competition among those designs that are correct (i.e., pass all
4 of 5
http://www.ece.cmu.edu/∼ece447/
February 28, 2012
tests). Students with the top three lowest execution times1 on our test cases, and a correct design, will
receive significant extra credit (which will be specified exactly later, but will be at least 5%). They will also
receive prizes. You should clearly describe how you optimized cycle time and CPI in your design.
1 execution time is defined as cycle count multiplied by cycle time (critical path length). You should carefully weigh any optimization that
improves (decreases) cycle time if it might also introduce stalls that would increase cycle count.
5 of 5
Download