Gerald, Huifang, Kahn, Mei, & Yuchi
CS152 Lab 7 Report
By QWX
Abstract
The purpose of this lab assignment is to add several sub-projects into our existing one-memory processor.
Those sub-projects consist of 8-stage deep pipelining, branch prediction, victim cache, stream buffer, and write buffer.
Division of Labor:
8 stage deep pipelining: Yuchi & Gerald
Branch predictor: Yuchi, Gerald, & Mei
Victim cache: Khan
Stream buffer: Huifang
Write buffer: Mei
Detailed Strategy:
1. Implementation
1.1 Deep Pipelining Processor
Our 8-stage pipelining is broken into 2 IF, 1 ID, 2 EX, 2 MEM, and 1 WB:
IF1: the PC calculation and the first stage of the I-cache access
IF2: the second stage of the I-cache access, the branch prediction, the main controller
ID: instruction decode and forwarding and stall control signal generators
EX1: the only ex stage for logic operation and the first stage for adding, subtracting, and
shifting operation
EX2: the second stage for adding, subtracting, and shifting operation
MEM1: the lw/sw address calculation
MEM2: the D-cache access
WB: the register write back
The main function units in each stage are:
IF1: PC adder, J mux, PC src mux, I-cache, and stream buffer
IF2: I-cache, extender, branch adder, main controller, the jr forwarding mux, and branch
predictor
ID: register file, forwarding, and stall controllers
EX1: two 16bit adders, two 16bit subtractors, 1 16bit-enable shifter, rs and rt forwarding
muxes, Rt mux, register destination mux, and branch decision logic gates
EX2: one 16bit adder, one 16bit subtractor, one 16bit-enabled shifter, and SLT logic gates
MEM1: address adder, and address forwarding mux
MEM2: sw forwarding mux, mem-to-reg mux, D-cache, write buffer, and victim cache
WB: register file
The critical path (slt followed by branch forwarding) in lab 5 is EX pipeline register, ALU source mux,
ALU, SLT logic, the forwarding muxes in ID stage, the comparator, and J mux back in IF stage. In order to break down this critical path, we first decide to move the forwarding muxes back to EX1 as well as the branch decision logic. Since the branch would be available, this is feasible in terms of the MIPS one delay slot convention. In the consideration of the one delay slot for jump, the main controller needs to be in the second stage, IF2, for the jump decision. After doing so, the ALU has become our critical path. To break down the ALU as well as minimize our cycle time, we have implemented each ALU function individually. AND, OR, XOR, and the comparator are implemented in gate level. Also, 16bit adder, subtractor, and shifter implemented in VHDL are used to reduce the delay for 32bit ones.
In terms of the forwarding, because of two more stages added in between the ID and the WB stages, the results need to be forwarded over at most four stages. Thus, five input forwarding muxes are used and located in ID (for jr), EX1 (for ALU), MEM1 (for address calculation), and MEM2 (for sw).
As for the stall, at most three stalls are needed for lw followed by a dependent ALU-used instruction. One stalls are needed for EX2 used instruction followed by any dependent ALU-used instruction. One stall is needed for lw followed by any address dependent lw/sw instruction.
1.2 Branch Predictor
Since in our 8-stage deep pipeline, branch decision making is in the EX1, which the fourth stage, without a branch predictor, a branch instruction needs three delay slots. In order to comply the MIPS convention of one delay slot, the pipeline has to flush two instructions if the branch is taken. We decide to use a twobit branch predictor to increase the rate of correct prediction. The FSM of the predictor consists of four states, strong taken, weak taken, strong not taken, weak not taken. The state transition diagram is as follows.
We reserve a two-bit entry for each instruction to record the branch history, which is selected by the lower part of PC. We have 256 entries in the prediction table. We set the initial value of each entry to be “weak taken”, so that the number of miss-prediction for a loop is only once.
The branch predictor works as follows.
1.
When a branch instruction enters the IF2 stage, the branch predictor will output an predicted value of the branch, which will be passed to the IF1 stage to select the next PC.
2.
When the branch instruction and the predicted value are passed down to the EX1 stage, the real branch decision will be made by a comparator. If the decision doesn’t agree with the predicted value, then a flush signal will flush the IF1 and IF2 stages, and select the right PC as the next PC.
3.
The branch decision in the EX1 stage will be passed back to the branch predictor in the IF2 stage to update the branch predictor state according to the state transition diagram.
The block diagram of our branch predictor is as follows.
IF 1 IF 2 ID
EX 1
BP
Comp
Comp
Comp
We run the lab6_mystery.s using our branch prediction. Since the quick sort only has five branches, and most of the branch decision is taken, the performance gain of our branch prediction is large. Compared to that without the prediction mechanism, we got a performance gain of approximately 15 per cent.
1.3 Stream Buffer ( schematic in Viewdraw )
DRAM
The stream buffer connects with the memory and the instruction cache as indicated in the above diagram.
It extends the bandwidth of the interface between the instruction cache and the memory. Each time when fetching instruction the DRAM sends the required 4 words to the instruction cache and the following 4 words are sent to the stream buffer. The content of stream buffer is copied to the instruction cache when instruction is miss in the cache and hit in the stream buffer. When fetching instructions the pipleline is only stalled during the period of the data transfer between the DRAM and the instruction cache.
In the final version the size of the stream buffer is 4 words. This is because of the performance consideration of the trade off between its effects of reducing the refill penalty in about half of the instruction misses, and increasing the stalls caused by the requests conflict between the instruction and data accesses to the memory. The 4 words buffer is chose instead of the 8 words buffer provided in the provided specs because a smaller size of stream buffer is expected to have better performance on a small test program. The details of this consideration are in the trade off section.
Actually the stream buffer was designed and implemented to have the instruction pre-fetch feature with buffer size of 8 words. With this feature there will also be 4 words sent to the stream buffer during instruction fetching. When the stream buffer controller finds that one of its two 4-words block is empty and the pipeline is working without memory access, it will send out request to the arbiter to fetch 4 words from the memory. Also the stream buffer pre-fetching should have lower priority than the instruction cache and data cache and is killed when either cache requests. This pre-fetch feature is disabled in the final processor debugging because it involves much more complicated cooperation in the memory system and the debugging time is limited. If there is another chance for processor improvement this feature will be finished and is believe to be helpful for the performance.
1.4 Victim Cache
The victim cache serves as a cache for the data cache. When dirty data are rewritten back to DRAM, they are also written to the victim cache for later reference. Data which are not dirty are also got written to the victim cache before they are overwritten in the data cache. The time to access data in the victim cache is smaller compared to the time to access data in the DRAM, therefore the victim cache helps to improve the performance of the processor.
In our design, the victim cache includes the block cell and the victim cache controller. The block cell is composed of 16 memory cells for four 4-word cache lines and 4 memory cells for the tags. Each cell stores 1 word or 4 bytes of data. The row decoder selects which cache line are being read or written. The column decoder selects a word in the cache line to read or write. Four comparators are used to compare the input tag with the tags of the cache lines. The output of the comparators (HIT0-3) are OR-ed together to determine a hit or miss in the victim cache. The HIT signal stops the arbitor from making a memory read request. The victim cache controller reads the HIT0-3 signals and use FIFO algorithm to determine which cache line to replace.
1.5 Write Buffer
We decide to add a write buffer between the data cache and memory in our design. Without the write buffer, each write miss will stall the pipeline to fetch four words from memory to the data cache according to our write-back policy. By adding a write buffer, the pipeline will write the word being stored to the write buffer, and the write buffer will write it back to memory afterwards, so the pipeline can proceed without being stalled, as long as the data cache is not being accessed during the write back process.
We design a five-word buffer for the write buffer, in which one word is for the word being stored, and the other four are for writing back the dirty cache line. The write buffer is quite simple by itself, which is just five word registers and two address registers, plus a couple of muxes. The write buffer is also controlled by the data cache controller, and this greatly complicates the controller design.
The revised data cache controller works as follows.
1.
Not dirty read miss: Stalls the pipeline, reads 4 words from memory, transfers the replaced cache line to the victim cache in the waiting time, and sends one word to the processor, then release the pipeline. (10 cycle penalty)
2.
Dirty read miss: Stalls the pipeline, sends memory read request to memory, transfers the dirty cache line to the write buffer and victim cache in the waiting time, writes a cache line, sends one word to the processor, then release the pipeline, write back from the write buffer to memory. (10 cycle penalty)
3.
Not dirty write miss: Write the word being stored to the write buffer, sends memory read request to memory, transfers the replaced cache line to the victim cache in the waiting time, writes a cache line, then write back from the write buffer to memory. (no penalty!)
4.
Dirty write miss: Write the word being stored to the write buffer, sends memory read request to memory, transfers the replaced cache line to the write buffer and victim cache in the waiting time, writes a cache line, then write back from the write buffer to memory. (no penalty!)
The state transition diagram for “sw” with a dirty cache miss is as follows.
Write the “sw” word into write buffer
Write the dirty cache line into write buffer
Write the buffered cache line to memory
Write the stored word into cache
Load 4 words from memory into cache
2 Trade-offs
2.1 Deep Pipelining Processor
There is always a big trade-off between the good performance and the design complexity, and the deep pipelining processor is not exceptional. The major trade-off in terms of the 8-stage processor itself is literally between the stall and the forwarding. In order to minimize the number of the stalls, more forwardings are required; thus, more functional units would be used in our processor. Two major tradeoffs appear in our processor design.
1.
3 stalls versus 1 stall for lw followed by dependent lw/sw address instruction
To avoid three stalls, we have moved the address calculation in the MEM stage with one additional adder, and one forwarding mux for the address operand rs.
2.
3 stalls versus 0 stall for lw followed by sw
To avoid unnecessary stalls in this case, we have a forwarding mux in the MEM2 stage for the rt in sw.
3.
1 stall versus 0 stall for adding, subtracting, shifting or slt operation followed by a dependent logic operation
To avoid this stall, we have implemented the logic operation in gate level, thus, having its result available in EX1.
4.
1 stall versus 0 stall for adding, subtracting, shifting or slt operation followed by dependent lw/sw address instruction
By moving the address calculation in MEM1 as in the first case, we have basically got rid of this penalty.
Moreover, in respect to the jr, we have forwarded the dependent result in the earliest possible stage, i.e., the ID stage. Yet, even with this forwarding, we still need to flush one instruction in order to meet the
MIPS convention of one jump delay slot.
2.2 Branch Predictor
We have a large number of prediction table entries in order to reduce the possibility of instruction overlap for each entry. The trickiest part for the branch prediction design is the coordination with the pipeline stall and the nearby jump instructions. The predicted value needs to be blocked when the pipeline is stalled.
For the combination of branch + some instruction + jump, the branch decision of EX1 and the jump decision of IF2 come to the IF1 to select the next PC at the same cycle. The next PC should be decided by the branch instruction, since it comes before the jump.
2.3 Stream Buffer
The performance enhancement the stream buffer can gain for the pipeline processor involves the trade off between the effects of reducing the refill penalty in about half of the instruction misses, and increasing the stalls caused by the requests conflict between the instruction and data accesses to the memory. The stream buffer design achieves less performance gain for small application programs because there is few instruction misses. The simulation results a stream buffer of 4 words size is shown as following:
Application
Lab5 Mystery Program
Lab6 Mystery Program
Project Test Program w/o Stream Buffer
(cycles)
2445
10877
2951
With Stream Buffer
(cycles)
2689
11083
2664
Performance Gain
8%
-1.9%
3.4%
Program Size
(bytes)
10 k
3 k
4 k
Here we can see with program of small size the stream buffer even increases the execution time. With this consideration we chose to design the stream buffer of 4 words size, because the number of the stalls incurred by the stream buffer increasing with its size and that will lead to even worse performance for a small program.
2.4 Victim Cache
In order to improve the performance of the pipeline processor, we choose to implement the victim cache.
Due to the principle of locality, data replaced in the data cache might be refered in the near future. These data are stored temporarilty in the victim cache for later reference. We hope that the victim cache would reduce data cache miss and hence speedup our processor.
Instead of putting the victim cache in between the data cache and the arbitor, we choose to put it aside by the data cache and design the victim cache as an add-on module. There are three benefits for this decision.
First, the victim cache can be used in our previous design without modifying the data cache or the arbitor.
This helps us in testing phase because we were able to test the victim cache in our 5-stage pipeline (from lab6) before testing it with our 8-stage pipeline processor. Second, the victim cache can be disabled or enabled without affecting the operation of the pipepline (eventhough it might affect the performance of the pipepine). This feature helps us to narrow down the bug area, whether the bug is in the victim cache or in the other part of our design. Finally, we were able to measure the effect of the victim cache to the pipeline processor. The effect of the victim cache to the pipeline is the difference in performance when the victim cache is enabled and disabled, respectively.
2.5 Write Buffer
We design a 5-word write buffer instead of 4 words in the spec. This is for not stalling the pipeline for dirty write misses, because the number of word being transferred is five. For dirty read misses, we release the stall right after data cache gets the cache line from memory, and before writing back the replaced data.
In this way, we save the penalty for dirty read misses from 19 cycles to 10 cycles. For memory reads, since we need to wait for 5 cycles between sending out request and getting the first data, we decide to
“steal” 4 cycles from the waiting time for transferring a cache line from the data cache to write buffer.
The performance gain of the write buffer is fairly large. For the dirty write miss case, the total number of stalled cycles will be reduced from 19 to zero. And for not dirty write miss, the number of cycles saved is
9. For a typical program in which store is 15%, the gain in CPI is 2.7, assuming that all the stored words is dirty.
3. Verification
This project is comprised of 5 components. The victim cache, write buffer and the stream buffer are parts of the cache system, while the branch predictor involves closely with the deep-pipelined processor. Thus in the first stage each of the components are tested independently, while in the second stage the verification is divided into 2 groups: the 3 cache features are added to the previous 5 stage pipeline and tested one after another, while the branch predictor is tested with the 8 stage deep pipeline. Finally the enhanced cache system is added to the pipeline with the branch predictor.
In a word, the verification of this project is divided into groups of dependent components, and is carried out from the basic level to the higher hierarchy. We used the lab5, lab6 and project released mystery program as a good sequence of testing program, together with the input vector command files and some simple test programs.
3.1 Deep Pipelining Processor
To make sure modified the original 5-stage pipelined processor, the 7-stage pipelined processor, the 8stage pipelined processor, and the 8-stage pipelining processor with cache worked, we tested the whole processor using some critical testing files written by ourselves (Basically we tested the corner cases we came up with.) and lab5 mystery testing file. When we tested the 8-stage pipelining processor, we had to add delay slots into all testing files. To be more specific, three delay slots for all branch instructions and two for jump register instruction. Although it was much more complex to test the pipelining processor, we followed the same testing rules to do the testing for the complicated processor. Whenever we found an instruction which could not be executed as expected, we used both the schematic and the waveforms to trace the instruction to see whether we could get the expected forwarding results, computational results, control signals in each stage if applicable.
When we got new components from other members who were designing the branch predictor, stream buffer, victim cache, and write buffer, we added them into the 8-stage pipelined processor one by one instead of combining all of them together with the processor. This approach added complexity into the processor gradually, which made the testing relatively easy. After successfully finishing this testing, new component was added in.
We found it was very convenient to use the “break” command when debugging the processor. By using it, we could be directly led to the point in which we were interested in.
3.2 Branch Predictor
To make sure that the branch prediction and the flush mechanism works fine, we separate the testing process into two phases. For phase one, we disable the predictor, so that every predicted value is not taken. We also run a working 5-stage pipeline for the same testing program in order to make a comparison of the two, especially checking if the PC and instruction sequence is the same. In the second phase, we add in the predictor, and compare the two results again.
3.3 Stream Buffer
The stream buffer is first tested with input vectors written in command file, and then incorporated into the pipeline processor and tested with the mystery programs. Since it’s a functionally independent component with other features in the project, further testing with other features included didn’t have any trouble with the stream buffer.
3.4 Victim Cache
We perform the following test verification to make sure the victim cache function as designed:
Since the victim cache communicate with the data cache controller and the arbitor, we test the victim cache with the fake data cache controller and the fake arbitor which we modified from the vhdl file of the cache controller and the arbitor. The purpose of this test is to verify the controll signals between the victim cache and the cache controller and the arbitor
Testing the victim cache with our working 5-stage pipeline from lab6
Testing the victim cache in our new design 8-stage pipeline
The results from lab5_mystery.s, lab6_mystery.s, and partial_sum.s program shows that the victim cache works with our 5-stage and 8-stage pipeline processor.
3.5 Write Buffer
To make sure the revised cache controller works, we first use a simple testing program to check if “lw” and “sw” go through all the states according to the designed FSM. Next, we add the write buffer into the pipeline, and compare the cache contents with right results.
Results:
1. Performance Analysis
1.1 Deep Pipelining Processor
In comparison with the 5-stage pipeline, the number of cycles increases because of the extra stalls. Yet, we have approximately reduced our cycle time into a half of the original.
Cycle time
Two Memory
Five-Stage Processor
43ns
Eight-Stage Processor
20.5ns
Two Caches 52ns 30ns
The number of cycles with lab5 mystery
Two Memory
Two Caches
5-Stage Processor
1200 cycles
2700 cycle
8-Stage Processor
(wo branch predictor)
1390 cycles
6500 cycles
Eight-Stage Processor
(with victim cache, stream & write buffer)
4600 cycles
1.2 Branch Predictor
We run the lab6_mystery.s using our branch prediction. Since the quick sort only has five branches, and most of the branch decision is taken, the performance gain of our branch prediction is large. Compared to that without the prediction mechanism, we got a performance gain of approximately 15 per cent.
1.3 Stream Buffer
The stream buffer achieves good performance gain for program of large size, while small and even negative gain for small size program. This is because there is few instruction misses in when executing small programs, and the extra stalls introduced by the transferring between the stream buffer and the arbiter hurt the performance. The comparison of running different programs before and after adding the stream buffer is shown as following:
Application
Lab5 Mystery Program
Lab6 Mystery Program
Project Test Program w/o Stream Buffer
(cycles)
1645
10877
2951
With Stream Buffer
(cycles)
1589
11083
2664
Performance Gain
10%
-1.9%
3.4%
1.4 Victim Cache
The results of our experiments on the performance of the victim cache with lab6_mystery.s program and our 5-stage pipeline processor are shown in the following table:
Test condition Number of hits Without victim cache With victim cache Speedup lab6_myster.s, 15 10,006 cycles 10,142 cycles -1.36%
5-stage pipeline, only dirty data are written to victim cache
In our experiment, the victim cache make the pipeline performance worse because:
The cycles saved from the 15 hits does not compensate for other overhead that the victim cache add to the pipeline, such as wait cycles
The victim cache cause more read miss in the data cache as observed from the waveform of the simulation. This can be explained as follows: the data cache use random algorithm to
determine which data to replace. However, our hardware can only generate pseudo-random sequence, ie. the sequence is the same for every simulation. The read hits in the victim cache change this sequence. For example, at clock cycles 100 th the number generated is 0, so data X at lower half of the block is replaced while data Y at upper part of the block is still in the data cache. We don’t have a read miss in the data cache if Y is refered later. If there is a read hit in the victim cache before clock cycle 100 th , then the data reference of the data cache occurs before clock cycle 100 th
(since a read hit in the victim cache saves some cycles), say 95 th
, and the number generated is 1. Data Y at upper half of the block is replaced. Later, when data Y is refered, we have a read miss in the data cache.
The result does not present the true effect of the victim cache on the performance of the pipelining processor.
One test is not enough to measure the performance of the victim cache
There are very few hits in the victim cache. The 15 hits in 10,000 cycles are the number times dirty data got written back to DRAM. Data that are not dirty are not written to the victim cache.
There would be more victim cache hits if not dirty data are also written to the victim cache
The algorithm that data cache used to determined which data to replace depends on a sequence of
0 and 1, which in turn depend on when a read hit occurs in the victim cache.
1.5 Write Buffer
We run the lab6_mystery.s with write buffer. Since we save a lot of cycles for store and load word, so the performance gain is considerate. One problem is that if a successive memory access comes before the previous procedure ends, the pipeline needs to be stalled. Since the reads and writes in the testing program is frequent, so the stalls cannot be eliminated completely. We got a 15% performance gain with write buffer.
2. Critical path analysis:
The critical path in 8-stage pipelining processor with I-cache and D-cache
--any instruction followed by dependent branch
EX1 pipeline register + forwarding mux + branch comparator + branch logic gate + PCsrc mux + J mux
= 3 + 3.5 + 5 + 5 + 2.5 + 1.5
= 20.5 ns
The critical path in 8-stage pipelining processor with I-cache and D-cache
--LW followed by JR
MEM2 pipeline register + D-cache access time + mem-to-reg mux + jr forwarding mux + J mux
= 3 + 19.5 + 1.5 + 3.5+ 2.5
= 30 ns
Conclusion:
In this project we were aiming for a deep pipeline processor with 4 features of performance enhancement.
Finally the pipeline works well with the other 3 memory system features, while the branch predictor is more complicated than we expected.
Again in this project we experienced the similar situation as in our lab6. Features are designed to be complicated for the performance goal without enough consideration for the testing and interface debugging time and effort. The most important lesson we learnt from this class is that the good testing methodology is very important for a good design, and also the schedule of design complexity according to the goal and design time is critical to make sure that the good design can be implemented on time.
We would like to thank to the professor and the TAs sincerely for the help during this class. We really appreciate the experience in the processor design throughout the semester. What we learnt is not only the computer architecture class material, but also the skill in designing and testing a system design.
Appendix I (schematics or block diagram):
Deep Pipelining
8_stage
Branch Predictor
Branch_predictor.1
Stream Buffer stream buffer.l
Victim Cache
Victim Cache block diagrams.doc
Write Buffer writebuffer
Appendix II (VHDL files):
The following is the VHDL source code for the added/changed components added in lab7.
Deep Pipelining
Ex1_forwarding.vhd
Mem1_forwarding.vhd
Mem2_forwarding.vhd
Stall.vhd
Branch Predictor
Branch_predictor.vhd
Stream Buffer stream buffer control.vhd
Victim Cache
victim_cache_cntrl.vhd
arbitor71.vhd
fake_arbitor.vhd
fake_cache_cntrl.vhd
Write Buffer
Cache_ctrl.vhd
Appendix III(diagnostic programs):
Deep Pipelining
Pipe.cmd
Pipe2.cmd
Pipe3.cmd
Branch Predictor
Branch.cmd
Stream Buffer stream buffer.cmd
lab5 mystery program lab6 mystery program partial sum program
Victim Cache victim_cache.cmd
v_d_a.cmd
Write Buffer
Wirte_buffer.cmd
Appendix IV (online notebooks):
Notebooks for Gerald: Gerald_notebook.txt
Notebooks for Huifang:
Huifang_notebook.doc
Notebooks for Kahn: Kahn_notebook.doc
Notebooks for Mei: Mei_notebook.txt
Notebooks for Yuchi: Yuchi_notebook.txt