CS 152 Computer Architecture Final Project Report Spring 2001 Prof. Kubiatowicz Section: TA: 2-4 pm Wednesday Ed Liao Team members: Lin Zhou Sonkai Lao Wen Hui Guan Wilma Yeung -0- 14255590 14429797 12270914 14499996 Table of Content: Abstract …………………………………………………………… Division of Labors ………………………………………………… Detailed Strategies ………………………………………………… Results …………………………………………………………….. Conclusion ………………………………………………………… Appendix I (notebooks) …………………………………………… Appendix II (schematics) ………………………………………….. Appendix III (VHDL files) ………………………………………… Appendix IV (testing) ……………………………………………… Appendix V (delay time table) …………………………………….. -1- 2 2 2 12 12 13 13 14 14 15 Abstract: We implemented a deep-pipelined processor with seven stages, along with a branch predictor, separate instruction and data cache, and stream buffer. Based on the original 5-staged pipeline, we divided the instruction fetch stage into two, the same as the execution stage. At the first fetch stage, the PC is read from the instruction cache while the branch decision is predicted at the second fetch stage. Since the ALU is the function unit that accounts for the worst delay time (15ns), to achieve optimal performance, we broke it down into two 16-bit adders. Each adder is paired with a logical unit and is placed on different execution stages. We also added a branch predictor on the second stage. With this implementation, our pipelined processor can sustained a cycle time of 22.5ns, with the ideal memory system. To optimize our memory system, we implemented two cache subsystems, one for the data and the other for the instruction. Furthermore, due to the frequency of sequential memory accesses to the instruction cache, we attached a stream buffer to it. And finally, the performance metrics that we used for the cache are the hit time and miss penalty. The hit time for both data and instruction caches is 7ns. The miss penalty for the data cache is estimated as 124.5ns while that for the instruction cache is 422ns, based on the result from running the mystery program with the ideal memory. Division of Labor: Project Design Areas Processor datapath, and controllers Cache system, controllers, and arbiter Branch Target predictor Tracer monitor Forwarding Unit Report write-up Team Member(s) Involved Sonkai Lao, Lin Zhou Wilma Yeung, Wen Hui Guan Sonkai Lao Wen Hui Guan Wilma Yeung Lin Zhou, Wen Hui Guan Detailed Strategy: a. General Design Strategy Memory System (1) Instruction Cache and Stream Buffer Stream buffer is added to reduce the compulsory and capacity misses. Since instructions are likely executed in sequence as a group of four to -2- five instructions due to branches and jumps. We design the instruction cache and stream buffer with fully associative lookup method. Thus, we associate each cache and buffer block with a comparator. For the instruction cache, we implement the FIFO replacement policy, which is enforced by using a counter to select a block for replacement in a fashion of a clock. Therefore, when read-miss occurs on both the cache and the buffer, the buffer would be flushed. The counter would advance after the instruction cache read in 2 sequential words for a block. The buffer controller would then fill the buffer with next 4 sequential words. This sequence of reads corresponds to two memory accesses under our implementation of burst request of 3 words. Based on this emplementation, the miss penalty is about 124.5ns. However, when a hit is found on the cache, the hit time is 7ns. If the requested word misses on cache but hits on the buffer, it only costs one extra ns to bring the word to the NPC. (2) Data Cache Re-design (with fully associative, burst request, and write-back) We implement fully associative access method for the data cache along with write-back write policy. Since we employ two 32-bit DRAM banks in parallel, we connect the data cache to the memory via a 64-bit bus. We also implement the FIFO replacement policy. This policy is enforced by using a counter to select a block for replacement in a fashion of a clock as for the instruction cache. Thus, the counter only advances when there is a read-miss or write-miss. To increase the performance of the cache, we design the cache in such a way that when a write to the DRAM is required, the cache only writes one word, instead of two, to the memory. However, when there is a read-miss and a block is selected for replacement, two words would be read into the block to take advantage of the spatial locality. This implementation also reduces the overhead for write since only one word is written back to the DRAM if the dirty bit corresponding to that word is set. Therefore, we implement each block with two dirty bits. Write-miss and write-hit involves a little complication since we need to update only one word (32-bit) via a 64-bit bus and make the implementation simple. For write-miss, the block selected for replacement is handled in the same manner as for read-miss. In this case, we just request a read of two words from the DRAMs and set the valid bit after writing the dirty word(s) to the memory if any. However, to put these two scenarios together and fit them into a 64-bit input bus, we need to consider several write cases. Our approach is to -3- keep two valid bits and two dirty bit for each block. For example, on write-hit, we simply update the appropriate word and set the dirty bit. For a compulsory write-miss, we would need to update the tag, the cache word, and the corresponding valid bit. Therefore, by keeping two valid bit, we could do non-allocate-on-write to reduce the overhead. Evidently, due to the data source for write can come from either CPU or DRAM and the fact the target block may be empty (all valid bits are not set) and the input data is from a 64-bit bus, we need to appropriately choose the 64-bit input data. In this regard, we implement the data cache controller to distinguish the input source and generate a 3-bit source selection signal to choose the input data. (3) Data Cache controller design From the data cache design section, write-hit and read-hit are handled similarly. Read-hit involves selecting the right word to be output to the CPU. On write-hit, the appropriate word needs to be updated appropriately. These two cases can be handled on the same state by recognizing the type of hit and generate the corresponding selection signals. For both misses, the cache would request permission to access the DRAM through the arbiter (describe below). It has to wait for the grant signal from the arbiter before performing the write or read. The diagram below shows that the hit is handled at the START stage. When a miss is triggered, the controller would go to either GRANT_W or GRANT_R states until the write or read permission is granted. Once permission is granted, the controller would make a transition to either write or read states and wait until the DRAM is idle. When the operation is completed, the controller would go back to the initial state. It is possible that a read-miss occurs after servicing a write-miss. As shown in the diagram, if this is the case, the state machine can request read right after finishing the write from the DRAM. This implementation actually saves one cycle on the fly. The state diagram is here. (4) Instruction and Stream Buffer controllers design Instruction and stream buffer controller are two communicating state machines. When it hits on either the cache or the buffer, both machine stay at the START state. If it is a hit on the buffer, while on START state, the buffer controller would instruct to output the requested word instead. -4- On misses, the buffer controller takes an active role to communicate with the arbiter and memory controllers to request a sequence of 6 words (burst read 3 times, two words each). Once the memory access request is granted to the buffer controller by the arbiter, the first two words would be written to the selected cache block while the next 4 sequential words would be put into the buffer. After each burst read from the memory, the buffer controller would wait for one cycle to let the DRAM to recover. In addition, it is also quite possible that a write-miss is followed by a read-miss. We treat this case by requesting memory access again immediately after the write-miss is handled because the availability of instruction data is critical to the processor performance. The instruction cache state diagram is here. And the stream buffer controller state diagram is here. (5) Memory controller design Memory controller (memctrl) is a synchronous process, which contains two asychronous components, memory read component (mem_read) and memory write component (mem_write). The synchronous input signals of memctrl will trigger the mem_read and mem_write to do the operation. The separation of mem_read and mem_write is easy for memctrl to distinguish the read and write operation. In order to increase the throughput of the DRAM’s, we implemented the burst read request of 3 words. Within the RAC and CAS timing constraint, for each RAC signal, we toggle the CAS signal three times to accomplish the burst read request. (6) Memory arbiter We design the arbiter as a state machine consisting of 3 states, Start, Instruction, and Data. From the start state When a request comes either from data or instruction cache, the current state would make a transition to the respective state along with the appropriate memory request signal. However, when requests come from both instruction and data at the same time, the request from data takes priority and the instruction fetch register is stalled until the memory access is completed. At this point of time, the request from instruction is serviced. Datapath (7) Main structure of 7 stages deep pipeline (overview). -5- (8) IF2 Dec DEC ALU2 IF2 IF1 ALU1 IF1 MEM Reg WB Branch Target Predictor and controller To implement a branch target predictor, we put a branch prediction table on the second instruction fetch stage. The prediction table maintains 16 entries of PCs, predicted PCs, and the next state bits. The prediction is based on the input PC from the next PC logic. The prediction table is associated with a 2-bit state machine, which always predicts “not taken” at the first time. The predicted NPC would be latched to the next PC logic. But the actual decision is made in the first EXE stage. If the prediction is wrong, the instruction register would be flushed. This costs extra two cycles in delay. T Flush NT Predict Taken Predict Taken T NT instruction instruction R E G Instr Cache T NT Predict Not Taken T Predict Not Taken T NT DEC P C Next_PC Logic P C Branch Predict Table Right decision?? Predicted_PC / Correct Pc IF1 (9) EXE IF2 DEC EXE1 ALU Modification To reduce the cycle time, we divide the original 32-bit ALU into 2 16bit adders. Addition and subtraction can be handled by the adders from the 2 stages. The first adder in EXE1 would operate on the lower 16 bits of operands while the upper 16 bits would be calculated by the other adder in EXE2. Also each execution stage has a 32-bit logic unit, which handles logic operations and the shifts. The reason of -6- having two logic units is to avoid stall between arithmetic and shift/logic operations. Both logic units would do the operations at the different stages, and the output of the first logic unit will be forwarded to the EXE2 stage. If there is a RAW hazard, the controller will detect the hazard and select the right output of logic unit through the control of the MUX. Forwarding can also be completed as usual. For example, choose the first output of Logic Unit : sll $1 $2 $3 addiu $4 $1 3 choose the second output of Logic Unit: addiu $1$2 $3 sll $4 $1 2 In addition, the gate level logic block, such as SLT logic unit, is built in order to reduce the cycle time of the critical path. A[15:0] A_Half[31:16] 32 B_Hal[15:0], A_Half[15:0] A_Half[31:0] 32 R E G B[31:0] B[31:16] B_Half[31:16] SLT Logic Out_1[31:0] M U Out_2[31:0] X A[31:0] A[31:0] B[31:0] B[31:0] Out_2[31:0] Logic Logic M U X R E G Forward to DE (Mux32 x8) 16 Adder B[31:0] Adder R E G B[15:0] M U X Out_1[31:0] (10) Forwarding Unit The forwarding unit (hazard controller) is written in VHDL. It compares the registers among the decode, 1st execution, 2nd execution, and memory stage, then compares them. When data dependency is detected, it would forward the depended value to the previous stage. As a result, it avoids the RAW hazard. To implement the forwarding unit, we code the VHDL to generate the necessary selection signals for various MUXes to accomplish the correct data flow based on the result of the comparison. There are 2 muxes at the decode stage, 4 muxes at the 1st execution stage, and 5 muxes at the 2nd execution stage. It also controls which logical unit to to use for each shift/logic operations. -7- b. Testing Strategy Delay modeling of major components: (1) Forwarding Unit The forwarding unit is used to generate the selection signals for muxes. It compares registers from different stages. In hardware implementation, it can be designed using muxes and simple logic gates. As a result, we model the time delay as 5ns. (2) Predict entry The predict entry is used to keep the PC, NPC, and the state bits of the prediction state machine. It is can be implemented as a register. Thus, the delay time of 3ns is reasonable. (3) Branch Table Controller The state transitions of the controller are following the edge-triggered flip flops. Since all the output signals result from simple combinational logic, we consider the delay of the controller having the same delay as for a series of two register, which is 6 ns. (4) Comparator The 9-bit comparator is used to compare the tag can be implemented with 9 2-input NAND gates, 9 inverters. However, ANDing these 9 single bit values may need a 9-input AND gate. This high fanout does increase the delay. Another way to do it is to break it down into more -8- levels. This also increases the time delay for 3 levels of gates, excluding inverters. Therefore, we give it a delay of 6 ns. (5) 3-bit counter The 3-bit counter can be implemented with 3 toggle flip flops in parallel and a fair amount of combinational logic to select the output bits. Thus, the delay of the counter is tied to the time delay of the flip flops. It is reasonable to model the delay for toggle flip flop as 2ns, that of register. Therefore, that the 3-bit counter has delay time of 2ns seems justifiable. (6) Cache block select decoder The decoder takes in a 3-bit value as input and output 8 single bits value in parallel with only one bit got asserted at most, based on the value of the 3-bit. This can be realized as a one level of 8 NAND gates in parallel, excluding inverters. As a result, we assign to it a time delay of 2 ns. (7) Instruction, Data Cache controllers The cache controllers can be implemented with pure combinational logic. Thus state transitions are merely triggered by flip flops. The reasonable delay is 3ns. (8) Memory controller The state transitions of the controller are following the edge-triggered flip flops. Since all the output signals result from simple combinational logic, we consider the delay of the controller having the same delay as a series of two registers, which is 6 ns. (9) Memory arbiter The memory arbiter is designed as a state machine using state bit assignment. The state bits are done by direct bit assignment. Therefore, its structure is a two level combinational logic. Same as the memory controller, we use 6ns delay time to model the arbiter. (10) Other common components are modeled with the same values as for lab 4 to lab 6. Memory System Testing Methodology Cache, Stream buffer, and controllers: -9- (1) We first test individual cache and controller components. The individual testing scripts (.cmd file) can be found in Appendix IV. Then we do a comprehensive test on the memory cache system as a whole. The read-hit is tested on a straightforward manner by addressing cache content and verify it value. (2) In particular, we check for the correct handling of the replacement policy by doing write-misses on all blocks first. An additional writemiss would choose the first block for replacement because of the FIFO policy. As a result, the dirty word would be written back to the dram before writing the new word into the cache block. We check the memory content for the correct update. The read-miss is tested similarly. (3) For the instruction cache, we particularly check for the read-miss on the cache but have a first hit on the buffer. Then reading the next 3 sequential words shouldn’t cause any unnecessary delay and stall because the buffer has the 4 sequential words already. After that, miss both on the cache and buffer would require buffer flushing and bringing in another 6 sequential words. Memory controller: (1) We first test the basic write and read operations and verify the corresponding memory content and the output. (2) To test the burst read of 3 words from memory, we input a row address and then lower the RAS signal. During this period, we supply the controller with three consecutive column addresses, each followed by lowering the CAS signal. After one access of the DRAM’s, we expect that 6 sequential words would be the output since we have two DRAM’s in parallel and we access both at the same time. Comprehensive testing: (3) Our testing strategy is in hierarchical order. First of all, we break down the pipeline into 7 stages and test it with the mystery program used in the previous labs to make sure the pipeline works. After testing all the basic components such as cache block, caches, controllers, forwarding unit, branch predictor, and arbiter, etc. We proceed to add one functional unit at a time to the pipeline and run it with an ideal memory. After ensuring those functional units work properly, we attach the memory and cache system to the pipeline. And we do a series of intensive tests to further confirm the proper functionality of each unit and the whole processor. Finally, a trace monitor is added to datapath to monitor the committed result on the write-back stage. - 10 - Results: a. Critical path (4) The processor critical path: Time Delay for Critical Path Mux16X 2 Mux 32x8 Mux32x5 16-Bit Adder SLT Logic 32-Bit Reg : 1.5 ns :3.5 ns : 3.5 ns : 8 ns : 3 ns : 3 ns Total = 22.5 ns A[15:0] A_Half[31:16] 32 B_Hal[15:0], A_Half[15:0] A_Half[31:0] 32 R E G B[31:16] B[31:0] B_Half[31:16] SLT Logic Out_1[31:0] M U Out_2[31:0] X A[31:0] A[31:0] B[31:0] B[31:0] M U X Out_2[31:0] Logic Logic R E G Forward to DE (Mux32 x8) 16 Adder B[31:0] Adder R E G B[15:0] M U X Out_1[31:0] Cmparing with the cycle time of 37ns obtained from the 5-stage pipeline, the cycle time for the deep-pipeline processor is a 39.2% improvement. (5) Cache performance metrics: Data and Instruction Hit Time (on cache or buffer): - 11 - 9-bit comparator(6ns) + 32-bit Tristate Buffer(1ns) 7ns 32 32 32-bit Tristate Selected Word 6ns 1ns To PC 9 Tag[8:0] 9-bit Comparator 9 I_addr[9:1] hit Instruction Cache Miss penalty = 422ns. Cache 64 32 Word Selection Logic (Tristate) miss request Arbiter 2ns Tag[8:0] 9-bit 9-Bit Comparator Comparator 9 1ns Tristate I_Addr[9:1] 6ns To PC 1ns granted addresses Stream Buffer Controller 6ns Memory Controller 6ns access DRAM 400ns 64 Miss Penalty = (6 + 2 + 6 + 6 + 400 + 1 + 1)ns = 422ns Data Cache Miss penalty => 124.5ns: 3.5ns 3.5ns mux mux Comparator 6ns Arbiter 2ns Cache Controller 6ns Mem Controller 6ns Dram 100ns Mux 3.5ns triState 1ns Cache Cache 64 64 Wordselection selection Word logic(Tristate) (Tristate) logic 32 To PC 1ns 1ns miss miss Tag[8:0] Tag[8:0] 99 I_Addr[9:1] I_Addr[9:1] 9-Bit 9-Bit Comparator Comparator 6ns 6ns request Total Delay = 124.5ns Arbiter 2ns granted addresses Memory Controller 6ns access DRAM 64 Miss penalty for the data and instruction cache differs significantly due to the fact that instruction cache burst read 3 times (page mode) on a miss while the data cache only read once (normal mode). b. Cycle time for the pipelined processor - 12 - The calculated delay time for the critical path is 22.5 ns. And the clock rate for the new process is 1/22.5ns = 44.44MHZ. Conclusion: We have implemented a deep-pipeline processor running at 44.44 MHZ with branch prediction capability. We also implemented a stream buffer to complement the instruction cache. Both the data and instruction caches’ access method is fully associative access with write-back write policy and FIFO replacement policy. Furthermore, we implemented the burst memory read request of 3 words for each dram. It works well with the memory controller, DRAM,, cache, and cache controller as a whole via a 64-bit bus. An arbiter is constructed to prioritize the simultaneous data and instruction. The data cache request always takes priority. If time permits, in order to further enhance the cache system, we would implement a victim cache for the data cache. Appendix I (notebooks): a. b. c. d. Lin Zhou Sonkai Lao Wilma Leung Wen Hui Guan Appendix II (schematics): All schematics shown on this report are not listed. Component Data cache controller Instruction cache controller Cache and memory system Arbiter controller Stream buffer controller Presentation schematics Schematic file Data_cache_fsm.ppt Instr_cache_fsm.ppt Mem_cache_system.ppt Fsm_for_arbiter.ppt Stream_buffer.fsm.ppt Sum.ppt Appendix III (VHDL files): VHDL files for all components and blocks used: Component Data cache controller Predict entry Instruction cache controller Stream buffer controller VHDL file Data_cache_ctrl.vhd Predict_entry.vhd Instr_cache_ctrl.vhd Stream_buf_ctrl.vhd - 13 - Memory controller Branch table controller Memory arbiter Hazard controller (forwarding) 16-bit ALU Debugging Tracer (monitor) 3-bit counter m1x8 mux m10x2 mux m32x2 mux m8x2 mux 10-bit tristate 32-bit tristate 9-bit comparator 9-bit register 1-bit register Block select decoder Memory read component Memory write component Note: Memctrl.vhd B_table_ctrl.vhd Final_mem_arbiter.vhd Hazardcontroller.vhd Half_alu.vhd Display.vhd Counter_3bit.vhd Mux1x8.vhd M10x2.vhd M32x2.vhd M8x2.vhd Tristate_10bit.vhd Tristate.vhd Comparator9.vhd Register9int.vhd Reg1.vhd Block_sel_decoder.vhd Mem_read.vhd Mem_write.vhd All other components such as extender, shifter, muxes are the same as those used in lab 4 to lab 6. So they are not listed here again. Appendix IV (testing): Component testing: Tests are performed in hierachical order: Component Overall system test Cache system Pipeline datapath Forwarding unit Forwarding on quickSort Instruction cache Memory system Dram bank Data cache Command file Final.cmd Quicksort.cmd Cachesystem.cmd Test_cache_sys.cmd A_partial.cmd Test_datapathfinal.cmd Forward.cmd Forward_no_sort.cmd Forward_sort.cmd Instr_cache Test_mem_sys.cmd Test_memoryBank.cmd Test_datacache.cmd - 14 - Other files All testbenches are saved in folder: finalproj/behv/testbnc h/ Replacement policy arbiter Replace.cmd Test_arbiter.cmd Log files: (1) for testing system finalproj/fsimtemp.log Final Datapath testing: Datapath schematic file: Datapath command file: Monitor output file: ../ finalproj/sch/datapathhazard.1 ../ finalproj /cmd/final.cmd ../ finalproj /disassembly.out Appendix V : delay time for each component Component Stream buffer controller Predict entry Hazard controller (forwarding) Branch table controller Data cache controller Instruction cache controller Block select decoder 3-bit counter m1x8 mux m32x8 mux m10x2 mux m32x2 mux m8x2 mux 10-bit tristate 32-bit tristate 9-bit comparator 9-bit register 32-bit register Logic Unit (shifter/logic unit) 16-bit Adder SLT logic 1-bit register Data, instr. Cache controller Memory controller Memory arbiter - 15 - Dalay time 6ns 3ns 5ns 6ns 6ns 6ns 2 ns 2 ns 3.5 ns 3.5 ns 1.5 ns 1.5 ns 1.5 ns 1 ns 2 ns 6 ns 3 ns 3 ns 10 ns 8 ns 3 ns 3 ns 6 ns 6 ns 6 ns Memory read component Memory write component Both are part of the memory controller - 16 -