Technological Document Due:6/24/2008 CPU Design Report Project Name: MIPS CPU Design Author: Lin Yin TA: Li Tingtao School of Software , Shanghai Jiao Tong University ABSTRACT This report gives a brief description of the design and VHDL implementation of a MIPS CPU with pipeline and cache functionalities. KEYWORDS MIPS, CPU, Pipeline, Cache, Logic, VHDL, Digital Circuit School of Software, Shanghai Jiao Tong University Address:Dongchuan Road No.800 Post Code:200240 Contents 1. Executive Overview ....................................................................................................1 2. Processor and Cache Design ......................................................................................1 2.1 Processor Design ....................................................................................................1 2.1.1 Architectural Overview ..................................................................................1 2.1.2 Hazards Solving..............................................................................................2 2.1.3 The Controller ................................................................................................4 2.1.4 Detailed Design for Each Phase .....................................................................6 2.2 Cache Design .......................................................................................................10 2.2.1 Overview ......................................................................................................10 2.2.2 Design Choices .............................................................................................11 2.2.3 State Machines ..............................................................................................12 3. Processor Debugging ................................................................................................14 3.1 Debug Overview ..................................................................................................14 3.2 Debugging Synchronous Problems ......................................................................15 3.3 Debugging the Cache ...........................................................................................15 4. Results ........................................................................................................................15 4.1 Testing Code ........................................................................................................15 4.2 Single Cycle MIPS ...............................................................................................15 4.3 Pipeline MIPS ......................................................................................................18 4.4 Pipeline MIPS with Cache ...................................................................................19 5. Conclusions ...............................................................................................................20 6. Developing Environments ........................................................................................20 6.1 Hardware ..............................................................................................................21 6.2 Software ...............................................................................................................21 7. Appendices ................................................................................................................22 7.1 ISA .......................................................................................................................22 7.2 Testing Codes .......................................................................................................22 8. References..................................................................................................................23 Page 1 School of Software, Shanghai Jiao Tong University 1. Executive Overview In this CPU design practice, I’ve implemented 5-stage pipelined MIPS CPU with cache functionalities. The practice is intended to make us know computer architecture better by making CPU logic design. My design supports 15 instructions including 7 R-type instructions, 7 I-type, and 1 J-type ones (see appendix [1]). It has successfully fulfilled all basic ideas of a simple MIPS CPU including instruction execution, pipeling and cache functionalities. However, due to time limits, the bus arbitrator is not realized so that separate memories are required for instructions and data. From the practice, I’ve learned how to make a single-cycle CPU run, how to turn a single-cycle CPU into a pipelined one by solving the problem of harzards, and how to design a cache using a finite state machine model. As a by-product, I also learned VHDL language and how to solve the problem of synchronous problems. And above all, the goal did achieve its goal – it made me know better about the architecture and organization of a computer. The rest of this document is organized by first giving the details about processor and cache design, followed by a description of problems and solutions during the processor debugging phase. Then the results of the practice are given. Finally, conclusions are reached and the developing environments are listed. 2. Processor and Cache Design 2.1 Processor Design 2.1.1 Architectural Overview Figure 1 Top-level Circuit Page 1 School of Software, Shanghai Jiao Tong University As is shown in figure1, the MIPS CPU is divided into 5 phases, according to the stages in the pipelining: IF (Instruction Fectch), ID (Instruction Decoding), EXE (Execution), MEM (Memory Accessing) and WB (Write Back). Whenever a phase has finished its task in an instruction, it will head for the next instruction. Therefore, a set of backup registers are required to save the results of each phase. The long bars in figure 1 denote the registers used for backup. 2.1.2 Hazards Solving Like any pipelined CPU, the design of pipelined MIPS involves three types of hazards: structural hazards, data hazards, and control hazards. Structural hazards happen when instructions compete for the same hardware resource. For example, both IF phase and MEM phase call for memory accesses. So they compete to gain the memory resource and to occupy the data and address bus. In my design, this confiction is solved by using separate instruction cache and data cache, instruction memory and data memory. However, the separation of the latter is not necessary if an arbitrator is available. When ever a miss in one of the cache happens, the arbitrator will freeze the memory so that the other cache can not access it until the current memory accessing finishes. The strategy of using separate cache to solve the structure hazards can be generalized as adding hardware resources. A general solution to this kind of harzards also includes adding bubbles, where the CPU stops instruction fetching for one cycle to allow the competing instructions to gain resource access in a sequential way. This approach is not adopted in my design because it delays instruction execution and is not efficient enough. Data hazards happen when the execution of an instruction needs the results of the instruction one or several instructions ahead of it. In most conditions, an instruction will put its result in the register or memory so that the following instructions can fetch it. However, if each instruction waits until the final results of previous instructions are reached, unnecessary delay will be resulted and the pipeline will stop working. Thus, whenever such data association happens, we should try to get the partial result of the previous instructions for the thirsty instruction as soon as possible. This technique is called forwarding. In MIPS, partial results have the following three sources: ID Phase – the result of this phase may come from the register file or instruction word. EXE Phase – the result of this phase may come from ALU. MEM Phase – the result of this phase may come from memory (cache). Since my design does not support multiplication instructions, it has fewer data hazard sources thus fewer considerations are needed. The following list gives brief ideas in solving the data hazards in my design. When an instruction In needs to use the result of In-3, since In-3 has already reached its WB phase and written its result to the register file at the negative clock edge of the current cycle, no data hazards will happen. Page 2 School of Software, Shanghai Jiao Tong University When an instruction In needs to use the result of In-2, since In-2 has already reached its MEM phase the result has already been calculated at the negative clock edge of the current cycle, forwarding is possible. The only problem is that the calculated result hasn’t been written into the register file yet, so we forward the the data from the backup register of MEM phase. When an instruction In needs to use the result of In-1, two different cases should be taken into consideration. If the result of In-1 comes from ID or EXE phase, the results has already been calculated, forwarding is made possible. However, if the result of In-1 comes from MEM phase, it hasn’t been ready yet. The only solution is to insert a bubble to delay the instruction fetch for one cycle and allow In-1 to get its result from the memory. Figure 2 Figure 3 Solution to Data Harzards by Forwarding Solution to Data Harzards by Adding Bubbles The last kind of hazards – the control hazards, happen when branches are needed. We are put in a dilemma whether to execute the following instruction or take the branch. Such decisions can not be made until we have ideas whether the branch condition is true. Generally, three solutions are proposed (see Page 3 School of Software, Shanghai Jiao Tong University reference [1]) for this hazard. In my design, the third one (non-delay solution) is adopted because of its efficiency. The branch condition is judged at the ID phase. Also, a delaying slot is appended to each branch instruction to make fuller use of resources. Due to the introduction of delaying slot, the instruction that directly follows a branch instruction is guaranteed to execute, whether the branch is taken or not. The execution of a series of instruction involving the branch is illustrated in Figure 4. Cycle 1 2 3 4 5 Branch IF ID EXE MEM WB IF ID EXE MEM WB IF ID EXE MEM WB IF ID EXE MEM Delaying Slot Suc. Inst. 1 Suc. Inst. 2 Figure 4 6 7 8 9 WB Solution to Control Hazards The delaying slot will definitely be executed after branch execution. But the successor instruction 1 and 2 are executed only when the branch is not taken. 2.1.3 The Controller pcu rseqrt instr[31..0] controlw_id[26..20] controlw_exe[26..20] writepc rseqrt writepc rseqz writeir rsltz jump instr[31..0] branch aludes[6..0] f wda[1..0] memdes[6..0] f wdb[1..0] rso[4..0] rto[4..0] rdo[4..0] writeir jump branch f wda[1..0] f wdb[1..0] rs[4..0] rt[4..0] rd[4..0] controlw[31..0] controlw[31..0] inst7 Figure 5 The PCU Module The task of a controller is to decode the MIPS instructions and generate control signals for every design elements in the CPU. The generation control signals is the core issue involved in CPU design. It coordinates every parts of a processor to work in phase with each other. The solution to problems of harzards is also implemented in the controller. In my design, all controller functionalities along with decoding are all encapsulated in a PCU (Processor Control Unit). The type of an instruction is recognized by decoding its opcode. Different kinds of instructions will have different opcodes. For example, R-type instruction will have an all-zero opcode in my design. For Rtype instructions, the type of calculation is recognized by decoding the func-code. The opcodes and funccodes of each instruction is given in appendix 1. Page 4 School of Software, Shanghai Jiao Tong University Once the concrete operation of an instruction is recognized, the PCU may start generating the control word. The control word contains every piece of information needed to define the behavior of the processor in the current cycle and provides enough information for the generation of control word in the next cycle. Typically, this may include: the soonest phase at which the final result of current instruction will be generated, whether the result will be written to the register file or memory, the source of data for port A and port B of ALU, the func-code of ALU, etc. In practice, the control word is stored in every backup register to enable each phase work correctly. The arrangement of control word is given in table 2-1. Table 2-1 Control Word Bits Name Usage 0-4 ALUCONTROL Func-code of ALU 5 ALUSRCA The data source of ALU data A 6 WRITEMEM Control write enable port of memory 7 MEMTOREG Whether the reg file should write the data from memory 8 WRITEREG Whether result will be written to reg file in the WB phase 9-13 ResltDes Control the register to be written in the WB phase 14 ALURESOK Whether EXE phase will use ALU 15 MEMRESOK Whether MEM phase will access memory 16-17 ALUSRCB The data source of ALU data B 18-32 Reserved Table 2-2 Other Control Signals Name # of Bits Comments fwda 2 Control the forwarding to ALU data A fwdb 2 Control the forwarding to ALU data B writepc 1 Control the write enable port of pc register writeir 1 Control the write enable port of instruction register (IR) branch 1 Whether the current instruction is a branch instruction. jump 1 Whether the current instruction is a jump instruction. rs 5 # of source register rt 5 # of target register rd 5 # of destination register Table 2-3 Data Source for ALU Port A ALUSRCA Data Source 0 RS in the register file Page 5 School of Software, Shanghai Jiao Tong University 1 INST[10..6] in the instruction code Table 2-4 Data Source for ALU Port B ALUSRCB Data Source 0 RT in the register file 1 INST[15..0] in the instruction code Table 2-5 Forwarding Source FWDA/FWDB Forwarding Source 00 No forwarding at all. 01 Forward data from ALU. 10 Forward data from memory. 2.1.4 Detailed Design for Each Phase 2.1.4.1 IF Phase Figure 6 IF Phase The IF phase should finish the following operations: (1) Calculate the next instruction address for program counter register (PC) (2) Read the instruction from the memory. The PC is counted in the following way: (1) In plain conditions, simply add 4 to the current PC counter (2) When encountered with branch instruction, shift offset INST[[15..0] two bits left and add it to the current instruction address. Note that “current instruction” means the instruction at IF execution, Page 6 School of Software, Shanghai Jiao Tong University that is, the instruction in the delaying slot. Offset INST[15..0] is in the branch code saved during ID phase, that is, a part in the branch instruction itself. (3) When encountered with a jump instruction, shift offset INST[[25..0] two bits left and add it to the current instruction address. Note that “current instruction” means the instruction at IF execution, that is, the instruction in the delaying slot. Offset INST[15..0] is in the jump instruction saved during ID phase, that is, a part in the jump instruction itself. Instruction fetch is to fetch the instruction at the virtual address saved in the PC register. However, MMU is not provided in my simple MIPS CPU so that the addresses are directly sent to the instruction memory (cache) for instructions. The instruction is saved in the instruction register. 2.1.4.2 ID Phase Figure 7 ID Phase The ID phase is responsible for the following tasks: (1) Instruction decoding. (2) Handle data harzards. (3) Judch whether the condtion is met for a branch instruction. (4) Fetch the operands from register file. Once the mechanism described in section 2.1.3 is used, the decoding work is much of a language problem. Simple and, or operation is enough for realizing the decoder. As is mentioned in section 2.1.2, data harzard is solved by establishing extra data pathand forwarding data of the previous instructions as soon as possible. The strategy of forwarding is illustrated in table 2-6. Table 2-6 Forwarding Strategy Page 7 School of Software, Shanghai Jiao Tong University Instruction EXE Data Source Phase ( 1 instruction ahead) EXE Condition Phase: rssource ALUDES[5] Strategy AND fwda=01B AND fwdb=01B AND bubble AND bubble AND fwda=01B AND fwdb=01B AND fwda=01B AND fwdb=01B (RSI==ALUDES) rtsource (RSI==ALUDES) MEM Phase: rssource ALUDES[6] (RSI==ALUDES) rtsource (RSI==ALUDES) MEM Phase instructions ahead) (2 EXE Phase: rssource MEMDES[5] (RSI==MEMDES) rtsource (RSI==MEMDES) MEM Phase: rssource MEMDES[6] (RSI==MEMDES) rtsource (RSI==MEMDES) 2.1.4.3 EXE Phase Figure 8 EXE Phase The EXE Phase follows the ID Phase. The main tasks for EXE Phase are: (1) Use ALU to realize arithmetic/logic operations (2) Save the result of calculation to the backup register (3) Give various control signals Page 8 School of Software, Shanghai Jiao Tong University Before we start calculation, we should first select operands. The input data of ALU either come from immediate number in the instruction code or from RS/RT in the register file. This choice is decided upon the signal ALUSRCA/ALUSRCB given by PCU. 2.1.4.4 MEM Phase Figure 9 MEM Phase The MEM Phase has the following main tasks: (1) Give various memory access control signals and finish the memory access task. (2) Save the data read from memory to backup register so they can be used by WB Phase. The following signals are needed to access memory. (1) DATAO[31..0]: Output data of CPU (2) DADDR[31..0]: Address for accessing the memory. (3) WRITEMEM: Control the write enable of the memory. 2.1.4.5 WB Phase Page 9 School of Software, Shanghai Jiao Tong University Figure 10 WB Phase The only task for the WB Phase is to write the result of calculation to the register file. To achieve this, the following signals are needed. (1) RESULT[31..0]: The final result of the execution of an instruction (2) CONTROLW_MEM[13..9] (ResltDes in PCU signals): The index of the register to be written. (3) CONTROLW_MEM[8] (WRITEREG in PCU signals): Control the write enable of the register file. 2.2 Cache Design 2.2.1 Overview CPU Core Cache Controller i-cache Main Memory d-cache Figure 11 Cache Overview Page 10 School of Software, Shanghai Jiao Tong University As is shown in figure 11, the cache lies in the CPU core and is coordinated by a cache controller. All addresses sent by the CPU is sent to the cache controller first instead of directly to the main memory. The cache controller decides whether the data needed is in the cache. If it is, then no memory access is needed, the data is directly got from the cache and sent back to the CPU; if not, then it fetches several words from the main memory consecutively to fill the corresponding line in the cache. Since cache access is much faster than memory access. Once the data hit the cache, the performance may increase significantly. By using cache, we believe the time locality and space locality of data access; that is, we believe that a datum used currently will soon be used again; and a datum access may denotes access to the adjacent data in the near future. Therefore, instead of only fetch the data we currently need, we also fetch their adjacent blocks – sequential access to memory is much faster than random access. Since my design does not include an MMU module, the cache receives physical addresses from CPU instead of virtual addresses. Lock functionality is not supported. Two state machines are used in coordination in the cache controller to schedule data read and write. 2.2.2 Design Choices Out of simplicity, my design adopts the direct mapped cache. Instruction cache (i-cache) and data cache (d-cache) are separated in order to avoid structural hazards. The size of both caches is 2KB. Both of them have 128 lines; each line has 4 words; and each word is 4 bytes in length. Therefore, signals DADDR[10..2] are used for addressing the cache. Among them, DADDR[10..4] are used for indexing line and DADDR[3..2] are used for deciding the column. Each line also has 21 tag bits, 1 valid bit and 1 dirty bit. The tag bits are used for recording the high 21 bits of a virtual address so that the cache controller may decide whether a read/write hits or misses the cache. The valid bit is used to distinguish whether a line in the cache is valid. All valid bits are set to false when the machine restarts. The dirty bit is used to decide whether the line is dirty, that is, whether a block in the line has been overwritten without writing the data back to the memory. Choices involved in any cache design are also made in this practice. The strategies I adopted are listed below: (1) Block (re)placement & block identification. Since I adopt the direct mapped way, there is no choice for these two strategies. The block to be (re)placed is decided by the low 11 bits of the address. And the block is identified by using the low 11 bits for addressing and high 21 bits and valid bit for verification. (2) Write strategy. Out of the consideration of efficiency, I adopt the write back strategy in my design. The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. (3) Allocation strategy. I assume that data written to memory will soon be read and therefore adopt the read-write-allocation strategy. Upon a cache miss, the cache controller allocates a cache line for Page 11 School of Software, Shanghai Jiao Tong University either a read or a write to memory. Any load or store operation made to main memory, which is not in cache memory, allocates a cache line. Since memory clock is much slower than the CPU clock, we also have choices about the mechanism to let the cache controller know when a memory read/write has been completed. We can do this by either sent both memory clock and CPU clock to the cache controller to let it arbitrate or add a memready signal between the cache controller and memory so that the memory can tell when the data are ready. In my design, I choose the former. 2.2.3 State Machines 2.2.3.1 Cache State Switching Figure 12 and figure 13 demonstrate state switching of instruction cache and data cache, respectively. Cache Valid Read Invalid Write Figure 12 State Switching for Instruction Cache Invalid Write Write Cache Cache Cache Write, Read Valid Write Valid Clean Read Figure 13 State Switching for Data Cache 2.2.3.2 Finite State Machine for I-cache and D-cache The differences between I-cache and D-cache lies in that I-cache does not allow write action. To make this document more concise and clear, only state machines for D-cache is given here. The state machines for I-cache are just a submachine of D-cache by removing the edges involving write actions and unnecessary isolated states after the edges are removed. Page 12 School of Software, Shanghai Jiao Tong University To realize a D-cache, two state machines are needed, we call them cachefsm and fillfsm, respectively. Cachefsm is meant for handling the data access requests, judging hit or miss, and scheduling the fillfsm, if it is unfortunately a miss. After fillfsm has fulfilled its tasks, cachefsm fetches the data from cache and sends them to the output pins, if necessary. Fillfsm is responsible for fetching the data from memory to fill the corresponding line in the cache upon a miss. It does all things from necessary write back to memory read. The state machine for cachefsm is shown in figure 14; it is a 3-state finite state machine (fsm). It starts from a srw state. It then judges whether the request is a read or write request. If it is a read one, it switches to the SDRW state; if it is a write one, it will turn into SDWW state; or else it will remain SRW state. It may also schedule fillfsm, depending on whether a miss ever happens. The state machine for fillfsm is a bit complicated; it is a 10-state fsm, starting from ss state. When it is not work, it remains at sidle state. It turns into ss state when it is scheduled by cachefsm. Then it goes through a series of SWW or SRW states, depending on the type of data accessing request. Each SWW or SRW will write/read a word from/to the memory. So a group of 4 states write back/fill a line in the cache. Concrete state machine for fillfsm is shonw in figure 14. In figure 14, the dashed line denotes message between two state machines. Cachefsm may schedule fillfsm by sending writeback or fillfsm signal. Fillfsm tells cachefsm whether it is busy by maintaining cachebusy signal. Page 13 School of Software, Shanghai Jiao Tong University cachefsm SRW write read !cachebusy !cachebusy cachebusy cachebusy SDRW SDRS write back fillcache cachebusy fillfsm SS fillcache write back SWW1 SWW0 SRW0 SRW1 SRW3 SRW2 fillcache SWW2 SWW3 !fillcache SIDLE Figure 14 State Machine for D-Cache 3. Processor Debugging 3.1 Debug Overview To facilitate debugging, I use ModelSim to do pre-synthesizing analysis. This tool has powerful tools for writing benchmarks, adding breakpoints, and watching waveforms. For each module in the design, I write a bundle of benchmarks to test it in the ModelSim. After it has passed all testcases, I integrate it to the system. The breakpoint function of ModelSim empowered me with the ability to catch nearly any bug in my Page 14 School of Software, Shanghai Jiao Tong University design. After the CPU design finished, I then compile and synthesize it in the Quartus II, load test codes and data to the memories, edit the input waveform Quaretus Waveform Editor, and then watch the result. 3.2 Debugging Synchronous Problems The biggest headache in the design is that Quartus II does not support asynchronouos memory reading and writing. That is, the output of memory is only given at a clock edge. Data can be written to the memory only at the clock edge, too. This is quite different from default memory in MAX PLUS II. When I found that the final output waveform does not conform to what we have expected, I traced it down to a delayed memory read near the second cycle of my test program. I then detected this problem. This problem can be solved by giving a negative clock edge to the memory while a positive one to the CPU. Then the memory data access is done at the middle of a CPU cycle. So the memory seems asynchronous. However, when I add caches to my design, this approach just won’t work. This is because, the cachefsm state machine requires a half cycle delay so it can schedule fillfsm state machine, so it also needs a negative clock edge. This conflicts with the memory access again. To solve the latter problem calls on us to use different clocks for CPU and memory rather than reversing the CPU clock and sending it to the memory. A memory cycle should be an integral multiple of a CPU cycle to ensure that each module in the design work in phase. 3.3 Debugging the Cache When I added cache to my design, the machine behaves in a strange way again. The first problem I discovered is that both values of PC and IR have short cycles no matter the cache hit or miss. So I check the modules that give the signals and found that I’ve forgotten to write a module that locks instruction fetch upon a cache miss. I modified the PC calculation logic and the first problem was solved. But other problems are still ahead. The final results of my test programs are all wrong. So I output the states in the cache fsm and watch their waveforms. By doing this, I detected several flaws in my state switching logic, and some conditions that have been neglected. For example, I forgot to check the dirty bit before filling the cache with new data. Once the fsm diagram is correct, debugging state switching is made much easier. Watch the states in the waveform and see whether they conform to your expectation. If not, trace the wrong switch down to the program and rectify it. 4. Results 4.1 Testing Code To test the correctness of each design, three pieces of codes are used. To make this document more concise, only the codes cited in reference [1] are used (a small modification of code is made at line 0C to show the correctness of cache write back). The testing codes are attached in the appendix of this document. 4.2 Single Cycle MIPS Page 15 School of Software, Shanghai Jiao Tong University Page 16 School of Software, Shanghai Jiao Tong University Page 17 School of Software, Shanghai Jiao Tong University 4.3 Pipeline MIPS Page 18 School of Software, Shanghai Jiao Tong University As can be seen from the result, the CPU with pipeline has the same number of cycles as single cycle CPU, but only 1/5 cycles compared with that of multi-cycle CPU. The time cost of pipeline CPU is much less than both single-cycle and multi-cycle CPU. It is much faster than single-cycle CPU because single-cycle CPU contains a lot of gliches at the beginning of each cycle and requires plenty of time to stable, whereas pipelined CPU does not. It is much faster than multi-cycle CPU just because it contains fewer cycles, and because all modules in the CPU are now in full-time usage. In theory, a CPU with k stages in its pipeline with have a 1/k time cost of multi-cycle CPU, assuming that the time cost of all stages are even. 4.4 Pipeline MIPS with Cache Page 19 School of Software, Shanghai Jiao Tong University As is seen from the result, the cache significantly cut down the time cost when the data hit the cache. The cycles from 3.5ms to 4.3ms demonstrates a whole process of a cache miss together with write back. It demonstrates the penality resulted upon a cache miss. 5. Conclusions In this practice, I’ve successfully accomplished buiding a MIPS CPU with pipeline and cache functionalities. 3 types of harzards are solved and the memory is correctly synchronized. Two seperated direct mapped caches are designed for instruction and data access. The hierarchy of storage is fully demonstrated in this practice. From the practice, I’ve get more familiar with computer architecture and organization. Through this practice, I have got a very deep impression about harzards involved in pipelining and how to solve them. I also learned different cache policies with their advantages and disadvantages. As a byproduct, I learned how to use VHDL to build digital circuits and how to debug it by writing benchmarks and watching waveforms. And above all, it endows me with patience and carefulness to do things in an orderly way. This practice is really a valuable lesson for my college study. 6. Developing Environments Page 20 School of Software, Shanghai Jiao Tong University 6.1 Hardware Processor: Intel Core Duo processor T2300 (1.66MHz FSB) Memory: 1536MB 6.2 Software OS: Windows XP + SP2 Design Software: Quartus II 7.2 Debug and Simulation: ModelSim SE 6.2b Page 21 School of Software, Shanghai Jiao Tong University 7. Appendices 7.1 ISA Table 7-1 Instruction Set 7.2 Testing Codes Page 22 School of Software, Shanghai Jiao Tong University WIDTH=32; DEPTH=32; ADDRESS_RADIX=HEX; DATA_RADIX=HEX; CONTENT BEGIN 00 : 00000820; % add $1, $0, $0 % 01 : 20020004; % addi $2, $0, 4 % 02 : 00001820; % add $3, $0, $0 % 03 : 8C240000; %loop: lw $4, 0($1) % 04 : 20210004; % addi $1, $1, 4 % 05 : 00641820; % add $3, $3, 4 % 06 : 2042FFFF; % addi $2, $2, -1 % 07 : 10400003; % beq $2, $0, finish % 08 : 00000000; % nop 09 : 08000003; % j 0A : 00000000; % nop 0B : AC230000; %finish:sw $3, 0($1) % 0C : 8C220000; % $4, 0($2) % 0D : 0800000D; %here: j here % % loop lw % % [0E..1F] : 00000000; END; 8. References [1] Zhu Ziyu, Li Yamin. CPU Chip Logic Design. Tsinghua University Publishers,2005 [2] David A Patterson, John L Hennessy. Computer Architecture: A Quantitative Approach. Third Edition. Morgan Kaufmann Publishers, Inc. 2003 [3] Purdue University ECE4371 MIPS Deisgn Labs. http://cobweb.ecn.purdue.edu/~ece437l/materials Page 23 School of Software, Shanghai Jiao Tong University