Final Project Super-Scalar Stream Buffer Victim Cache CAM-based Cache Group Steve Fang Kent Lin Jeff Tsai Qian Yu Abstract The final design is a super-scalar MIPS microprocessor. This allows the processor to handle two instructions in parallel if there are no dependencies between them. Furthermore, because more instructions are being executed per cycle as compared to the single-feed processor, instruction cache misses become a significant bottleneck for the processor. Therefore, a stream buffer is included to relieve this problem. Additionally, a victim cache is also added to reduce the miss rate of the data-cache. Overall, these two features are added as solutions to the problem of a very costly DRAM access. Finally, a CAM style cache organization has been added to the first level cache. Division of Labor Implementation of the super-scalar design was split up into three parts: schematic drawing and wiring, control logic and hazard handling, and memory. Memory consists of the memory controller for the new DRAM component, the stream buffer and the victim cache. The assignment breakdown was Jeff handled the memory controller, Qian worked on the datapath drawing/wiring, Steve worked on the stream buffer, and Kent worked on the victim cache. Each member was responsible for testing and verification of their assigned task or component. In addition everyone contributed to the control logic and hazard handling due to the complexity of the subject. Detailed Strategy Datapath The structure of the datapath itself is fairly simple; much like the single-issue, 5-stage pipeline covered in class, the super-scalar version features the same stages, but many compoments are duplicated to allow for complete parallelism. Moreover, the datapath is also aligned in a way such that only even instructions can go through the top pipeline, while only odd instructions can enter the odd one. A high-level diagram is given below. Decode Even Pipeline Execution Fetch Odd Pipeline Memory Decode Write-back Execution In general, many components of the datapath were duplicated for the parallel execution of the even and the odd instructions, while some components were modified with more ports. As a results, the instruction bandwidth of the datapath increases, and so does the CPI of the processor. The following table presents a big picture of how the classic 5 stage datapath was changed at each pipeline stage. PIPELINE Table1. Datapath Modification COMPONENTS COMPONENTS WITH WIDER STAGES DUPLICATED BANDWIDTH IF Pipeline registers Instruction Cache outputs ID Instruction Decode Controller Register File outputs Extender Hazard Controller inputs, outputs Branch Comparator Stall Unit inputs Forwarding muxes Forwarding muxes inputs Branch PC bus Jump PC bus Pipeline registers EX ALU None SLT instructions Unit Shifter Pipeline registers ME Pipeline register None Memory Source Muxs WB Register File Source Select Register File inputs Mux Monitors In the instruction fetch stage, the output bandwidth of the cache is doubled so that the datapath can fetch both the even and odd instructions in parallel. Since the even and the odd instructions are in the same 2-word cache line of the cache, the change was made in the cache datapath. On the other hand, the data cache of MEM stage does not need to be modified because our memory system only allows one memory access at a time. And the Stall units of the decode stage will separate two parallel memory-access instructions. The major modifications of the datapath are in the instruction decode stage and the execution stage. The components, which operate only for an individual instruction, are duplicated, such as branch comparator, extender and ALU. Hazard and stall units however have to take instructions of ID, EX, and MEM stages from both even and odd pipelines in order to determine the dependency. One special case is the register file. In order to work with supers-scalar scheme, it should be able to write and read two different registers as well as the same register in parallel. If both the even and the odd instructions write back the same register file location, we must let odd instruction win in order to prevent WAW hazard. Beyond the general modifications, 2-way super-scalar pipeline adds complexity in calculating new PCs. The following table shows the PC calculation of the different types of instructions. Table 2. PC Calculation INSTRUCTION TYPES NEW PC CALCULATIONS Arithmetic, Logic, Memory PC = PC + 8 Even Branch PC = PC + 4 + Branch Value Odd Branch PC = PC + 8 + Branch Value Jump, Jr PC = Jump Value Even JAL PC = Jump Value; $R31 = PC + 4 Odd JAL PC = Jump Offset; $R31 = PC + 8 To give a more intuitive view of the implementation, schematic captures of branch and Jump PC calculation are presented as following: Picture 1 shows the implementation of branch PC calculation. We cannot use PC[31:0] for the calculation of even branch address since PC = PC + 8 in the super-scalar pipeline. Therefore, PCD[31:3] (PC of the decode stage, the “current” PC) are used as the base PC. The result signal of the even branch comparator is used as the third bit which in effect add 4 to the base PC. Picture 1. Branch PC calculation Picture 2 shows the implementation of the Jump PC calculation. The first level selects between Jump and Jr instructions, whereas the second level mux selects the even and odd instructions. Picture 2. Jump PC calculation Super-scalar Structure As mentioned before, the microprocessor features two pipelines for parallel operation. Thus, complicated hazard and stall cases arise from this. Stalls There are six different stall cases that must be handled. The first two occur when either one of the instructions in the execution stage is a branch or jump instruction. When that is the case, the delay slot must be handled. More precisely, if the branch occurred in the even location, then the delay slot has already been executed in parallel with it. Thus, the instructions in the decode stage (both even and odd) must be replaced with no-ops. (See figure 1) Even Pipeline Odd Pipeline BRANCH Delay Slot Figure 1 – Both instructions in the Decode stage needs to be ignored. However, if the branch is in the odd pipeline, then the even instruction at the decode stage needs to be executed while the odd one must be swapped with a no-op. (See figure 2) Even Pipeline Delay Slot Odd Pipeline Instruction BRANCH Figure 2 – Since the branch now occurs in the odd pipeline, the delay slot comes after, but the following instruction must be ignored. The third case occurs when the instruction in the execution stage is a JAL or a LW and the odd, decode instruction depends on it. When that is the case, forwarding cannot solve this case and both instructions in the decode stage must be stalled. Even Pipeline Odd Pipeline ADDIU $1, $3, 2 LW $1, 0($10) Figure 3 – Load word followed by an instruction that uses the loaded data forces a stall in both pipelines. The fourth stall case is when the same thing happens for the odd instruction at the decode stage. Here, the even instruction is allowed to execute while the odd instruction is stalled for one cycle. Afterwards, the even instruction is replaced with a no-op since it has already been executed once, and the odd instruction is permitted to enter the execution stage. The fifth case is when both instructions in the decode stage accesses the memory. Since there is only one data cache, a parallel access is impossible and so the even instruction will go first while the odd will be stalled until the following cycle. Finally, the last case occurs when the odd instruction in the decode stage has a dependency on the even instruction in the same stage. Because both instructions are in the decode stage, nothing has been calculated so forwarding is impossible. Therefore, this is handled identically to the previous case, that is, the odd instruction is stalled for one cycle. (Figure 4) Even Pipeline Odd Pipeline SW $1, 0($10) SW $3, 4($10) Figure 4 – Here, both of these instructions cannot be executed in parallel, so the even instruction is executed first while the odd one is stalled until the following cycle. Because of the alignment of the pipelines, a major problem occurs when there is a branch or jump to an odd instruction. This is because instructions come in pairs for this processor, and so when the target of the branch is the odd instruction, it is crucial that the even instruction associated with it (which is the instruction before the target one) is not executed. Thus, there is also a branch handler which watches for branches and jumps to odd addresses. Once the branch is handled, the stall handler previously discussed will handle the delay slot such that only one instruction is executed there. Next, after the new PC is calculated and its corresponding instruction fetched, the branch handler will check the target PC. If the target PC is even, then the pipeline acts like normal whereas if the target is odd, then the handler will ignore the even instruction. (see figure 5) Branch Even Pipeline Odd Pipeline Delay Slot Target of Branch Figure 5 – If a branch/jump targets an odd instruction, the corresponding even instruction must be ignored. This is handled by a special branch/jump handler that is separate from the stall handler. Stream Buffer Stream buffer locates in between the first-level cache and the DRAM. With a slow memory access time, an instruction fetch on every cycle will be very inefficient. The addition of a stream buffer will improve the efficiency because it will do a burst to get four instructions on every miss in the stream buffer. One normal read from the DRAM will take 9 cycles, but one burst (two consecutive reads) will take 10 cycles. There are three cases in which a burst will occur. Case 1: At the very beginning of the program, the stream buffer has nothing in it and needs to fetch first four instructions in the DRAM. Case 2: When you need to branch/jump to a new instruction that is not in the stream buffer or in the first-level cache. Case 3: Two hits in the stream buffer followed by a miss in the stream buffer, meaning that the next instruction has not been prefetched. Stream Buffer Controller The stream buffer controller works on the positive edge of a phase-shifted clock. The delay of the phase shift is 10 ns. The reason for this phase-shifted clock signal is that the registers in the first-level cache work on the negative edges of the clock. The request signal from the first-level cache goes up after 5 ns if there is a cache miss. Therefore, by using the phase-shifted clock, we can send a fake wait_sig to the cache to tell it to do an instruction stall (before the negative edge of the normal clock) if there is a miss in the stream buffer. The trick is to operate between the phase-shifted clock for the stream buffer controller and the negative edge of the normal clock for the cache and pipelined datapath. Without the stream buffer, the first-level cache operates mainly by looking at the wait_sig from the DRAM. However, now with the stream buffer, the stream buffer controller will take in the wait_sig from the DRAM and send a fake wait_sig to the first-level cache. This is necessary because during the burst mode, the DRAM controller will send a wait_sig low for one cycle. This wait_sig from the DRAM cannot be sent directly to the first-level cache. Therefore, the fake wait_sig is required. The fake wait_sig is a continuous high signal like the specification. The dip is important for the stream buffer controller because it uses that to enable the first half of the stream buffer to store the first two instructions. And when the wait_sig goes down for the second time, the controller will enable the second half of the stream buffer to store the next two instructions. An example of the wait_sig timing from the DRAM during a burst is shown below. 1 time 0 Figure 6 – example of wait signal behavior during burst read. The following is the pseudo code: If (no match in buffer) then If (request_from_cache = ‘0’) then Wait_sig_to_cache := 0; Request_to_DRAM := 0; Elsif ((request_from_cache = ‘1’) and (wait_sig_from_DRAM = ‘0’)) then Request_to_DRAM := ‘1’; Wait_sig_to_cache := ‘1’; If (wait_sig_from_DRAM = ‘0’ for the first time) then Enable_for_first_half_register := ‘1’; Enable_for_second_half_register := ‘0’; Elsif (wait_sig_from_DRAM = ‘0’ for the second time) then Enable_for_first_half_register := ‘0’; Enable_for_second_half_register := ‘1’; Else Enable_for_first_half_register := ‘0’; Enable_for_second_half_register := ‘0’; End if; End if; Else Wait_sig_to_cache := 0; Request_to_DRAM := 0; End if; Once the four instructions are in the stream buffer and every time the first-level cache gets a miss, the miss penalty will only be one cycle. Every time the cache controller sends a request and the PC to the stream buffer, the stream buffer will compare the PC with the address tag associated with each half of the buffer and select the matched half to output to the cache based on the results of the comparators. High-Level Schematic of the Stream Buffer Address tag from 1st level Cache (10-bit bus) Compare with both address tags to see if there is a Hit Data from DRAM (even) Use the results of the comparators to select the mux Data from DRAM (odd) (64-bit wide bus) 4 32-bit Registers 1st even address tag To 1st level Cache (Both are 32-bit buses) 2nd even address tag Hit Wait_sig from DRAM Request from 1st level Cache Address from the 1st level Cache Stream buffer Controller Wait_sig to 1st level Cache Request to DRAM Register enable signals Address to the DRAM and to the internal tag registers Victim Cache The implementation of the victim cache is similar to the first level cache in terms of control logic and schematic design. The main differences are in the handling of data input and output. The victim cache contains four cache lines worth of information, is fully-associative, and uses a FIFO (first in, first out) replacement policy. The reason for using FIFO instead of random as the first level cache is that random replacement is more effective for larger cache sizes. The small size of the victim cache makes it probable that certain cache blocks will be replaced more often. Because the victim cache outputs to the DRAM on a cache miss, replacing blocks on a more frequent basis will result in a higher AMAT (determined by hit time + miss rate * miss penalty). A FIFO replacement policy ensures each block has the same chance to be replaced. Theoretically a LRU is the best replacement policy, but is too difficult to implement with more than two sets. Top Level Schematic of Victim Cache in Memory Hierarchy Datapath 1st Level Cache Valid / Dirty / Tag[9:0] Word 0 Word 1 Victim Cache reg 1 Cache lines reg 2 Arbiter and DRAM Because the victim cache is fully associative, each cache block component contains a comparator to determine a hit with the address and tag. However, the input to each cache block only comes from the first level cache. Thus the victim cache blocks are simplified version of the first level cache blocks. Schematic-wise, the victim cache sits between the first level cache and the DRAM. The goal is to make the victim cache transparent to system so the first level cache thinks it is requesting directly to the DRAM. Thus the victim cache intercepts all intermediate signals and must output data as well as control lines. Muxes are used to select which cache block to output and where to output (first level cache or memory). To avoid losing the cache data while swapping the two cache lines, two additional cache registers are added to hold the output from the first level cache and victim cache. In addition to sending data back to the first level cache, the victim cache must also send the dirty bit. The first level cache needs to know whether the data being sent up is dirty or not, so the correct replacement behavior is used on subsequent memory accesses. Changes to the datapath are local to the memory system. The first level cache needs extra output signals to send the tag, valid and dirty bits to the victim cache. In addition, it needs an extra input to receive the dirty bit from the victim cache. The DRAM needs a mux to choose the correct burst signal from the victim cache or the stream buffer. Victim Cache Control The first level cache control works on the rising edge of the clock while the memory controller works on the falling edge. To act transparently in the system, the victim cache must work within this time to output control signals at the correct time for both first level cache and memory controller to work. A delayed clock is used to set up this timing situation. The following RTL explains the behavior of the victim cache control. if (first_level_cache_request = 1) then reg1 <= 1st level cache line reg2 <= victim cache line (Hit or Replace) if (replace_block = dirty) then DRAM <= reg2 if (victim_cache_hit = 1) then 1st level cache <= reg2 else 1st level cache <= DRAM victim cache <= reg1 Control Signals - To first level cache Because the first level cache interacts with the DRAM solely on the wait signal, the victim cache must simulate the wait signal to get the appropriate response from the first level cache. The wait signal must be set high when the victim cache is writing a cache line to the DRAM or when the first level cache is trying to read from the DRAM. - To DRAM Because the victim cache only has one memory access at a time, the burst signal is set low. The request signal comes from only the victim cache. Any request from the first level cache is first interpreted by the victim cache to determine which memory requests to perform if any. CAM-based cache The major difference between the CAM-based cache and standard cache is how the address tags are stored and handled. With the CAM-based cache, the addresses are store in a completely different register from the data words. Each tag address register is accompanied by a comparator, the function of which is to determine whether the register contains a match with the input address. The match signals for each register are used to quickly output a hit data word instead of waiting in the control. Top Level Schematic of CAM-Based Cache Address (10 bits) Data (32 or 64 bits) CAM array Control Logic 8 cache lines x 10 bits Priority encoder The general CAM-based design utilizes the above scheme. At the very least, the CAMbased cache must output a hit signal and a line select signal. All comparators work in parallel and output a single bit match signal. All the match signals pass to an 8 to 3 priority encoder. The hit signal (“or” of the match signals) is used by the cache control to determine whether the data resides in the cache or a request to DRAM is necessary. The line select signal is used with an 8 to 1 mux to choose which of the cache line data to output. In an effort to make the cache output a hit value as quickly as possible, both encoder and 8 to 1 mux are built out of gates. The CAM-based cache was already mostly built during lab 6 so most functional testing was done on it. Tests for the final project implementation were done at the cache level to measure delay time and the improvement over the previous vhdl comprised components. Improvement times varied according to the input patterns, but output delay averaged a few nanoseconds. Results Overall, the super-scalar processor works on simple code where there are no dependencies. It offers support for all the instructions that are required in the final processor, but can only run reliably for simple arithmetic programs. However, when there are complicated dependencies that overlap, the controller will fail to no-op correctly at times and so certain tests fail. Therefore, it seems to be unable to exit from the partial sums program. Currently our processor minus the super-scalar design has a minimum cycle time of 76 ns. The critical path is located in the memory stage beginning from the control logic of the first level cache running to the victim cache, passing data through the arbiter and ending at the input to the dram controller. Because our design has the first level cache working on the rising edge of the clock, we effectively have only one-half clock period to get the data stable at the memory controller side. The exact timing of the critical path is determined by the phase shifter, delay of the victim cache control and load delay of several muxes and gates in the arbiter. The total time reaches 38 ns for the half clock cycle and thus results in a 78 ns clock cycle. Compared with lab 6 which had a 42 ns clock cycle, it is noted that the lab 6 did not have a victim cache which operated on the delayed clock. This victim cache most likely is the main reason for the increase in cycle time. Referring to the performance improvements of the super-scalar design and added memory components, the results are mixed. Already the longer cycle time is a disadvantage to the new processor. In addition, the provided test program, partial_sum, runs on tight loops with many memory loads, but few memory stores. In fact, our processor is most optimized for long branching, but sequential code and high volume of memory stores. The reason for this is because of our use of the stream buffer to pre-fetch sequential data and the victim cache to increase the cache line capacity. The following shows the results of our processor at various stages compared with the lab 6 processor. Original processor (lab 6) Processor + stream buffer Cycles Processor + victim cache Processor + SB + VC 0 500 1000 1500 2000 2500 3000 Conclusion In conclusion, this final project took us around 200 hours as a group to complete. The strength of the project is that we tried to keep everything simple. When there is a bug, we know how to attack the problem. However, when the problem occurs with VHDL and Viewlogic, we were often stuck for a while before we can think of a way around it. Two weeks to do the final project is really limited. If we had more time, we definitely would improve our hardware components. Right now, one improvement for the stream buffer is that it can start prefetching for the next two instructions after two instructions have been fetched by the first-level cache. Another improvement on the first-level instruction cache that we did not have time to do is to make it a FIFO. Victim cache currently uses four states during a write-back to memory and a read. The fourth state is a reset state and probably could be pushed to the falling edge of the third state, thus saving one cycle. Finally, branching and jumping performance can be increase by using a more aggressive scheme of stalls. I am sure a lot of people encountered problems with VHDL and Viewlogic. Sometimes the clock will not run after cycling for a certain time. Another problem is that the clock would be undefined at the very beginning for no reason. The biggest challenge still lies in getting the VHDL components to work properly. In the stall controller, there is one case which the VHDL code never picks up. This was eventually relieved by removing the specific cases and using a more general scheme to deal with that one particular case. (The LW dependency when the current instruction was even was combined with the even case) Appendix See the attached zip file